Creating Data Visualizations Using Python: A Beginner's Guide

Introduction

This blog aims to show, in a simplified way, the process of creating a data visualization using Python. Specifically, we'll use libraries such as Pandas, Plotly, Scikit-learn, and PyTorch to visualize skill embeddings clustered via machine learning techniques. Let’s take a structured, step-by-step approach to develop an interactive 3D plot displaying skill clusters.

Step 1: Setting Up Your Environment

First, ensure you have the necessary Python libraries installed. You can install them using pip:

pip install pandas plotly scikit-learn torch

Step 2: Import Libraries

Start your script by importing the necessary libraries:

import pandas as pd
import plotly.express as px
from plotly import io as pio
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import torch

Step 3: Define the Visualizer Class

We'll create a class named SkillsVisualizer to encapsulate all our visualization logic:

class SkillsVisualizer:
    def __init__(self, skills, model):
        self.skills = skills
        self.model = model

This constructor initializes the visualizer with a list of skills (words) and a pre-trained language model (model).

The model is a sentence transformer, that will take the word with its context as input sentence and output an encoding vector that embedd the sentence meaning.

Step 4: Embedding Skills

Define a method to convert skill descriptions into numerical embeddings using the provided language model:

def embed_sentences(self):
    token_embeddings = self.model.encode(self.words, output_value=None)
    list_embedding = [token_embedding['token_embeddings'].sum(dim=0) for token_embedding in token_embeddings]
    self.embeddings = torch.stack(list_embedding, dim=0).numpy()

This method computes embeddings for each skill, summing up token embeddings to get a single vector per skill.

Step 5: Dimensionality Reduction

Reduce the dimensionality of embeddings to make clustering manageable and visualization possible in three dimensions:

def perform_pca(self):
    pca = PCA(n_components=3)
    self.pca_result = pca.fit_transform(self.embeddings)

Using PCA, we reduce the embeddings to three principal components. Here you can use other algorithms like t-SNE, UMAP, LDA or others

Step 6: Clustering Skills

Apply a clustering algorithm to group similar skills. The idea is to categories skills and show some domaine specific skills together.

def perform_clustering(self):
    kmeans = KMeans(n_clusters=7, random_state=0, n_init="auto")
    kmeans.fit(self.pca_result)
    self.labels = ['Cluster_'+str(item) for item in kmeans.labels_]

This method uses KMeans clustering to categorize the skills into clusters.

Step 7: Creating an Interactive 3D Plot

Generate an interactive 3D scatter plot of the skills:

def create_interactive_plot(self):
    words_only = [w.split(' ')[0] for w in self.skills]
    embedding_df = pd.DataFrame({
        'Skill': words_only,
        'X': self.pca_result[:, 0],
        'Y': self.pca_result[:, 1],
        'Z': self.pca_result[:, 2],
        'Color': self.labels
    })
    fig = px.scatter_3d(embedding_df, x='X', y='Y', z='Z', text='Skill', color='Color')
    fig.update_layout(title="3D Skills Embeddings & Clustering", scene=dict(aspectmode="cube"))
    return pio.to_html(fig, full_html=False)

This function creates a DataFrame from the PCA results and clusters, then plots them using Plotly's scatter_3d.

Step 8: Visualize Skills

Add a method to run all steps and visualize the results:

def visualize_skills(self):
    self.embed_sentences()
    self.perform_pca()
    self.perform_clustering()
    return self.create_interactive_plot()

The returned object is an interactive html object that will allow user to navigate and zoom in the 3D space of projects skills by word.

Conclusion

To visualize your data, simply instantiate the SkillsVisualizer with your data and model, then call visualize_skills(). This guide provides a practical example of integrating machine learning and visualization techniques using Python to produce meaningful insights from data.