Creating Data Visualizations Using Python: A Beginner's Guide
Introduction
This blog aims to show, in a simplified way, the process of creating a data visualization using Python. Specifically, we'll use libraries such as Pandas, Plotly, Scikit-learn, and PyTorch to visualize skill embeddings clustered via machine learning techniques. Let’s take a structured, step-by-step approach to develop an interactive 3D plot displaying skill clusters.
Step 1: Setting Up Your Environment
First, ensure you have the necessary Python libraries installed. You can install them using pip:
pip install pandas plotly scikit-learn torch
Step 2: Import Libraries
Start your script by importing the necessary libraries:
import pandas as pd
import plotly.express as px
from plotly import io as pio
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import torch
Step 3: Define the Visualizer Class
We'll create a class named SkillsVisualizer
to encapsulate all our visualization logic:
class SkillsVisualizer:
def __init__(self, skills, model):
self.skills = skills
self.model = model
This constructor initializes the visualizer with a list of skills (words
) and a pre-trained language model (model
).
The model is a sentence transformer, that will take the word with its context as input sentence and output an encoding vector that embedd the sentence meaning.
Step 4: Embedding Skills
Define a method to convert skill descriptions into numerical embeddings using the provided language model:
def embed_sentences(self):
token_embeddings = self.model.encode(self.words, output_value=None)
list_embedding = [token_embedding['token_embeddings'].sum(dim=0) for token_embedding in token_embeddings]
self.embeddings = torch.stack(list_embedding, dim=0).numpy()
This method computes embeddings for each skill, summing up token embeddings to get a single vector per skill.
Step 5: Dimensionality Reduction
Reduce the dimensionality of embeddings to make clustering manageable and visualization possible in three dimensions:
def perform_pca(self):
pca = PCA(n_components=3)
self.pca_result = pca.fit_transform(self.embeddings)
Using PCA, we reduce the embeddings to three principal components. Here you can use other algorithms like t-SNE, UMAP, LDA or others
Step 6: Clustering Skills
Apply a clustering algorithm to group similar skills. The idea is to categories skills and show some domaine specific skills together.
def perform_clustering(self):
kmeans = KMeans(n_clusters=7, random_state=0, n_init="auto")
kmeans.fit(self.pca_result)
self.labels = ['Cluster_'+str(item) for item in kmeans.labels_]
This method uses KMeans clustering to categorize the skills into clusters.
Step 7: Creating an Interactive 3D Plot
Generate an interactive 3D scatter plot of the skills:
def create_interactive_plot(self):
words_only = [w.split(' ')[0] for w in self.skills]
embedding_df = pd.DataFrame({
'Skill': words_only,
'X': self.pca_result[:, 0],
'Y': self.pca_result[:, 1],
'Z': self.pca_result[:, 2],
'Color': self.labels
})
fig = px.scatter_3d(embedding_df, x='X', y='Y', z='Z', text='Skill', color='Color')
fig.update_layout(title="3D Skills Embeddings & Clustering", scene=dict(aspectmode="cube"))
return pio.to_html(fig, full_html=False)
This function creates a DataFrame from the PCA results and clusters, then plots them using Plotly's scatter_3d
.
Step 8: Visualize Skills
Add a method to run all steps and visualize the results:
def visualize_skills(self):
self.embed_sentences()
self.perform_pca()
self.perform_clustering()
return self.create_interactive_plot()
The returned object is an interactive html object that will allow user to navigate and zoom in the 3D space of projects skills by word.
Conclusion
To visualize your data, simply instantiate the SkillsVisualizer
with your data and model, then call visualize_skills()
. This guide provides a practical example of integrating machine learning and visualization techniques using Python to produce meaningful insights from data.