To simplify, we use one-word sentences in this example.
Install the required modules:
$ pip install umap-learn
$ pip install hdbscan
$ pip install matplotlib
Python code:
$ vi clustering.py
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from matplotlib import pyplot
import numpy as np
# load embedding model
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")
# generate embeddings
texts = ['cats', 'dogs', 'elephants', 'birds', 'cars', 'trains', 'planes']
embeddings = embedding_model.encode(texts)
print(f'Number of the embedded documents and their dimensions: {embeddings.shape}')
# reduce the embeddings dimensions
reduced_embeddings = UMAP(n_components=5, random_state=42).fit_transform(embeddings)
print(f'Number of the embedded documents and their reduced dimensions: {reduced_embeddings.shape}')
# create an hdbscan object and fit the model to the data
cluster_algorithm = HDBSCAN(min_cluster_size=2).fit(reduced_embeddings)
# get the cluster labels (note: -1 means noise points)
cluster_labels = cluster_algorithm.labels_
# get the number of clusters
n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
print(f'Number of clusters: {n_clusters}')
# print features in cluster 0
print("Cluster 0:")
cluster = 0
for index in np.where(cluster_labels==cluster)[0][:10]:
print(f'Feature {index}: {texts[index][:10]}')
# print features in cluster 1
print("Cluster 1:")
cluster = 1
for index in np.where(cluster_labels==cluster)[0][:10]:
print(f'Feature {index}: {texts[index][:10]}')
# plot the results
pyplot.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], c=cluster_labels, cmap='Spectral', s=40)
pyplot.colorbar()
pyplot.savefig('hdbscan_cluster_plot.png')
Run the Python script:
$ python3 clustering.py
Output:
Number of the embedded documents and their dimensions: (7, 384)
Number of the embedded documents and their reduced dimensions: (7, 5)
Number of clusters: 2
Cluster 0:
Feature 0: cats
Feature 1: dogs
Feature 2: elephants
Feature 3: birds
Cluster 1:
Feature 4: cars
Feature 5: trains
Feature 6: planes
Chart of the clusters: hdbscan_cluster_plot.png