Let's start with a basic example using individual words to understand the fundamental concepts.
Install the required modules:
$ pip install bertopic
Python code:
$ vi topic-modeling.py
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic import BERTopic
# sample data - in practice, you'd use full sentences or documents
sentences = ['cats', 'dogs', 'elephants', 'birds', 'cars', 'trains', 'planes']
# initialize the sentence transformer model
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")
# generate embeddings
embeddings = embedding_model.encode(sentences)
# configure BERTopic with custom parameters + fit the model
topic_model = BERTopic(
embedding_model=embedding_model,
umap_model=UMAP(n_components=5, random_state=42),
hdbscan_model=HDBSCAN(min_cluster_size=2),
verbose=True
).fit(sentences, embeddings)
# display results
print("Topics info:")
print(topic_model.get_topic_info())
print("Topic 0 info:")
print(topic_model.get_topic(0))
print("Topic 1 info:")
print(topic_model.get_topic(1))
# create and save visualizations
fig = topic_model.visualize_barchart()
fig.write_html("bertopic-barchart-figure.html")
Run the Python script:
$ python3 topic-modeling.py
Output:
Topics info:
Topic Count Name Representation Representative_Docs
0 0 4 0_cats_birds_elephants_dogs [cats, birds, elephants, dogs, , , , , , ] [birds, cats, dogs]
1 1 3 1_cars_trains_planes_ [cars, trains, planes, , , , , , , ] [planes, cars, trains]
Topic 0 info:
[
('cats', np.float64(0.34657359027997264)),
('birds', np.float64(0.34657359027997264)),
('elephants', np.float64(0.34657359027997264)),
('dogs', np.float64(0.34657359027997264)),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05)
]
Topic 1 info:
[
('cars', np.float64(0.46209812037329684)),
('trains', np.float64(0.46209812037329684)),
('planes', np.float64(0.46209812037329684)),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05),
('', 1e-05)
]
Topics are represented by the main keywords extracted from the text.
Each topic name is formed by concatenating these keywords using underscores ("_").
A special topic labeled "-1" may also appear; it typically includes all keywords that do not clearly match any specific topic.
This category may also contain outliers—data points that do not align with any of the identified topics.
Chart of the topics (Topic Word Scores): bertopic-barchart-figure.html