Let's start with a basic example using individual words to understand the fundamental concepts.
This simplified example demonstrates the core functionality.
Install the required modules:
$ pip install bertopic
Python code:
$ vi topic-modeling.py
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic import BERTopic
# sample data - in practice, you'd use full sentences or documents
sentences = ['cats', 'dogs', 'elephants', 'birds', 'cars', 'trains', 'planes']
# initialize the sentence transformer model
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")
# generate embeddings
embeddings = embedding_model.encode(sentences)
# configure BERTopic with custom parameters + fit the model
topic_model = BERTopic(
embedding_model=embedding_model,
umap_model=UMAP(n_components=5, random_state=42),
hdbscan_model=HDBSCAN(min_cluster_size=2),
verbose=True
).fit(sentences, embeddings)
# display results
print("Topics info:")
print(topic_model.get_topic_info())
print("Topic 0 info:")
print(topic_model.get_topic(0))
print("Topic 1 info:")
print(topic_model.get_topic(1))
# create and save visualizations
fig = topic_model.visualize_barchart()
fig.write_html("bertopic-barchart-figure.html")
Run the Python script:
$ python3 topic-modeling.py
Output:
Topics info:
Topic Count Name Representation Representative_Docs
0 0 4 0_cats_birds_elephants_dogs [cats, birds, elephants, dogs, , , , , , ] [birds, cats, dogs]
1 1 3 1_cars_trains_planes_ [cars, trains, planes, , , , , , , ] [planes, cars, trains]
Topic 0 info:
[
('cats', np.float64(0.34657359027997264)),
('birds', np.float64(0.34657359027997264)),
('elephants', np.float64(0.34657359027997264)),
('dogs', np.float64(0.34657359027997264)),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05)
]
Topic 1 info:
[
('cars', np.float64(0.46209812037329684)),
('trains', np.float64(0.46209812037329684)),
('planes', np.float64(0.46209812037329684)),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05)
]
Topics are represented by the main keywords extracted from the clustered documents, ranked by their importance scores (TF-IDF weights).
Each topic name is automatically generated by concatenating the most representative keywords using underscores ("_").
The scores indicate how strongly each word represents the topic, with higher scores meaning stronger association.
A special topic labeled "-1" may also appear, which typically includes outliers and documents that do not clearly fit into any specific topic cluster.
This outlier category helps identify noise in the data or documents that require different clustering parameters.
Chart of the topics (Topic Word Scores): bertopic-barchart-figure.html