Cluster 0: ['cats', 'dogs', 'elephants', 'birds'] ==> topic: animals Cluster 1: ['cars', 'trains', 'planes'] ==> topic: transportationBERTopic is a modern topic modeling technique that leverages transformer-based embeddings to create more semantically meaningful topics. Unlike classical approaches that rely on bag-of-words representations, BERTopic uses contextual embeddings that capture semantic relationships between words and phrases. In BERTopic, document clusters are formed based on semantic similarity in high-dimensional embedding space and then interpreted as coherent topics.
$ pip install bertopicPython code:
$ vi topic-modeling.py
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic import BERTopic
# sample data - in practice, you'd use full sentences or documents
sentences = ['cats', 'dogs', 'elephants', 'birds', 'cars', 'trains', 'planes']
# initialize the sentence transformer model
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")
# generate embeddings
embeddings = embedding_model.encode(sentences)
# configure BERTopic with custom parameters + fit the model
topic_model = BERTopic(
embedding_model=embedding_model,
umap_model=UMAP(n_components=5, random_state=42),
hdbscan_model=HDBSCAN(min_cluster_size=2),
verbose=True
).fit(sentences, embeddings)
# display results
print("Topics info:")
print(topic_model.get_topic_info())
print("Topic 0 info:")
print(topic_model.get_topic(0))
print("Topic 1 info:")
print(topic_model.get_topic(1))
# create and save visualizations
fig = topic_model.visualize_barchart()
fig.write_html("bertopic-barchart-figure.html")
Run the Python script:$ python3 topic-modeling.pyOutput:
Topics info:
Topic Count Name Representation Representative_Docs
0 0 4 0_cats_birds_elephants_dogs [cats, birds, elephants, dogs, , , , , , ] [birds, cats, dogs]
1 1 3 1_cars_trains_planes_ [cars, trains, planes, , , , , , , ] [planes, cars, trains]
Topic 0 info:
[
('cats', np.float64(0.34657359027997264)),
('birds', np.float64(0.34657359027997264)),
('elephants', np.float64(0.34657359027997264)),
('dogs', np.float64(0.34657359027997264)),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05)
]
Topic 1 info:
[
('cars', np.float64(0.46209812037329684)),
('trains', np.float64(0.46209812037329684)),
('planes', np.float64(0.46209812037329684)),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05)
]
Topics are represented by the main keywords extracted from the clustered documents, ranked by their importance scores (TF-IDF weights).
Each topic name is automatically generated by concatenating the most representative keywords using underscores ("_").
The scores indicate how strongly each word represents the topic, with higher scores meaning stronger association.
A special topic labeled "-1" may also appear, which typically includes outliers and documents that do not clearly fit into any specific topic cluster.
This outlier category helps identify noise in the data or documents that require different clustering parameters.
INPUT TEMPLATE: + Representative documents from the topic + Keywords that define the topic Documents: [DOCUMENTS] Keywords: [KEYWORDS] Task: Generate a concise, descriptive label for this topic.
OUTPUT: <human-readable topic labels>Python code:
$ vi label-topic-modeling.py
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic import BERTopic
from transformers import pipeline
from bertopic.representation import TextGeneration
sentences = ['cats', 'dogs', 'elephants', 'birds', 'cars', 'trains', 'planes']
# initialize the sentence transformer model
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")
# create embeddings
embeddings = embedding_model.encode(sentences)
# configure BERTopic with custom parameters + fit the model
topic_model = BERTopic(
embedding_model=embedding_model,
umap_model=UMAP(n_components=5, random_state=42),
hdbscan_model=HDBSCAN(min_cluster_size=2),
verbose=True
).fit(sentences, embeddings)
# prompt for topic labeling
prompt = """These documents belong to the same topic:
[DOCUMENTS]
These keywords give details about the topic: '[KEYWORDS]'.
Given these documents and keywords, what is this topic about?"""
# initialize text generation pipeline
# use a model ("google/flan-t5-small") to label the topics
generator = pipeline("text2text-generation", model="google/flan-t5-small")
# create representation model
representation_model = TextGeneration(
generator,
prompt=prompt,
doc_length=50,
tokenizer="whitespace"
)
# update topics with the generated labels
topic_model.update_topics(sentences, representation_model=representation_model)
# print the topic labels
print(topic_model.get_topic_info())
Run the Python script:$ python3 label-topic-modeling.pyOutput:
Topic Count Name Representation Representative_Docs
0 0 4 0_animals___ [animals, , , , , , , , , ] [birds, cats, dogs]
1 1 3 1_car___ [car, , , , , , , , , ] [planes, cars, trains]
The actual output may show "car" instead of "transportation" due to the small dataset size and the language model's interpretation.
With only 3 transportation-related words, the model may focus on the most frequent or representative term.
In real-world applications with larger, more diverse datasets, language models typically generate more comprehensive and accurate topic labels.