• Home
  • LLMs
  • Python
  • Docker
  • Kubernetes
  • Java
  • Maven
  • All
  • About
LLMs | Topic Modeling
  1. Topic Modeling
  2. Basic Topic Modeling Example
  3. Generating Topic Labels

  1. Topic Modeling
    Topic modeling is an unsupervised machine learning technique that automatically discovers abstract topics within a collection of documents. It identifies patterns in word usage and groups documents that share similar themes, providing insights into the underlying structure of large text corpora.

    Key Benefits:
    • Automatically organize large document collections.
    • Discover hidden themes and patterns in text data.
    • Reduce dimensionality of text data for analysis.
    • Enable content recommendation and search improvements.
    • Support exploratory data analysis of textual content.

    Example (to simplify, we use one-word sentences):
    Cluster 0: ['cats', 'dogs', 'elephants', 'birds'] ==> topic: animals
    Cluster 1: ['cars', 'trains', 'planes'] ==> topic: car
    BERTopic is a topic modeling technique that leverages transformer-based embeddings to create more semantically meaningful topics.
    In BERTopic, document clusters are formed based on semantic similarity and then interpreted as topics.

    The topic modeling steps in BERTopic:
    • Document Embeddings: Convert documents into high-dimensional vector representations using transformer models.

    • Dimensionality Reduction: Use UMAP to reduce embedding dimensions while preserving local structure.

    • Clustering: Apply HDBSCAN to group similar documents into clusters.

    • Topic Representation: Extract representative keywords for each cluster using TF-IDF or other representation models.


    BERTopic characteristics:
    • Semantic Understanding: Uses contextual embeddings that capture word meaning better than bag-of-words approaches.

    • Hierarchical Structure: Supports topic hierarchies and subtopics.

    • Flexibility: Modular design allows customization of each component.

    • Visualization: Rich visualization capabilities for topic exploration.

    See this page for more details about BERTopic:
    https://maartengr.github.io/BERTopic/index.html
  2. Basic Topic Modeling Example
    Let's start with a basic example using individual words to understand the fundamental concepts.

    Install the required modules:
    $ pip install bertopic
    Python code:
    $ vi topic-modeling.py
    from sentence_transformers import SentenceTransformer
    from umap import UMAP
    from hdbscan import HDBSCAN
    from bertopic import BERTopic
    
    # sample data - in practice, you'd use full sentences or documents
    sentences = ['cats', 'dogs', 'elephants', 'birds', 'cars', 'trains', 'planes']
    
    # initialize the sentence transformer model
    embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")
    
    # generate embeddings
    embeddings = embedding_model.encode(sentences)
    
    # configure BERTopic with custom parameters + fit the model
    topic_model = BERTopic(
        embedding_model=embedding_model,
        umap_model=UMAP(n_components=5, random_state=42),
        hdbscan_model=HDBSCAN(min_cluster_size=2),
        verbose=True
    ).fit(sentences, embeddings)
    
    # display results
    print("Topics info:")
    print(topic_model.get_topic_info())
    
    print("Topic 0 info:")
    print(topic_model.get_topic(0))
    
    print("Topic 1 info:")
    print(topic_model.get_topic(1))
    
    # create and save visualizations
    fig = topic_model.visualize_barchart()
    fig.write_html("bertopic-barchart-figure.html")
    Run the Python script:
    $ python3 topic-modeling.py
    Output:
    Topics info:
           Topic  Count    Name                         Representation                                 Representative_Docs
    0      0      4        0_cats_birds_elephants_dogs  [cats, birds, elephants, dogs, , , , , , ]     [birds, cats, dogs]
    1      1      3        1_cars_trains_planes_        [cars, trains, planes, , , , , , , ]           [planes, cars, trains]
    
    Topic 0 info:
    [
        ('cats', np.float64(0.34657359027997264)),
        ('birds', np.float64(0.34657359027997264)),
        ('elephants', np.float64(0.34657359027997264)),
        ('dogs', np.float64(0.34657359027997264)),
        ('', 1e-05),
        ('', 1e-05),
        ('', 1e-05),
        ('', 1e-05),
        ('', 1e-05),
        ('', 1e-05)
    ]
    
    Topic 1 info:
    [
        ('cars', np.float64(0.46209812037329684)),
        ('trains', np.float64(0.46209812037329684)),
        ('planes', np.float64(0.46209812037329684)),
        ('', 1e-05),
        ('', 1e-05),
        ('', 1e-05),
        ('', 1e-05),
        ('', 1e-05),
        ('', 1e-05),
        ('', 1e-05)
    ]
    Topics are represented by the main keywords extracted from the text. Each topic name is formed by concatenating these keywords using underscores ("_"). A special topic labeled "-1" may also appear; it typically includes all keywords that do not clearly match any specific topic. This category may also contain outliers—data points that do not align with any of the identified topics.

    Chart of the topics (Topic Word Scores): bertopic-barchart-figure.html
    Topics Plot
  3. Generating Topic Labels
    One of BERTopic's powerful features is the ability to generate human-readable topic labels using language models.

    In our example, we will create a prompt that has two parts:
    • A subset of documents that best represent the topics will be inserted using the [DOCUMENTS] tag.
    • The keywords that make up the topic cluster will be inserted using the [KEYWORDS] tag.

    INPUT
    + subset of documents
    + data
    list of documents:
    [DOCUMENTS]
    list of keywords:
    [KEYWORDS]
    predict the label of the topic.
    OUTPUT:
    <labeled topics>
    Python code:
    $ vi label-topic-modeling.py
    from sentence_transformers import SentenceTransformer
    from umap import UMAP
    from hdbscan import HDBSCAN
    from bertopic import BERTopic
    from transformers import pipeline
    from bertopic.representation import TextGeneration
    
    sentences = ['cats', 'dogs', 'elephants', 'birds', 'cars', 'trains', 'planes']
    
    # initialize the sentence transformer model
    embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")
    
    # create embeddings
    embeddings = embedding_model.encode(sentences)
    
    # configure BERTopic with custom parameters + fit the model
    topic_model = BERTopic(
        embedding_model=embedding_model,
        umap_model=UMAP(n_components=5, random_state=42),
        hdbscan_model=HDBSCAN(min_cluster_size=2),
        verbose=True
    ).fit(sentences, embeddings)
    
    # prompt for topic labeling
    prompt = """These documents belong to the same topic:
    [DOCUMENTS]
    
    These keywords give details about the topic: '[KEYWORDS]'.
    
    Given these documents and keywords, what is this topic about?"""
    
    # initialize text generation pipeline
    # use a model ("google/flan-t5-small") to label the topics
    generator = pipeline("text2text-generation", model="google/flan-t5-small")
    
    # create representation model
    representation_model = TextGeneration(
        generator,
        prompt=prompt,
        doc_length=50,
        tokenizer="whitespace"
    )
    
    # update topics with the generated labels
    topic_model.update_topics(sentences, representation_model=representation_model)
    
    # print the topic labels
    print(topic_model.get_topic_info())
    Run the Python script:
    $ python3 label-topic-modeling.py
    Output:
           Topic  Count          Name           Representation                 Representative_Docs
    0      0      4              0_animals___  [animals, , , , , , , , , ]     [birds, cats, dogs]
    1      1      3              1_car___      [car, , , , , , , , , ]         [planes, cars, trains]
© 2025  mtitek