• Home
  • LLMs
  • Docker
  • Kubernetes
  • Java
  • Ubuntu
  • Maven
  • Archived
  • About
LLMs | Topic Modeling
  1. Topic Modeling
  2. Basic Topic Modeling Example
  3. AI-Generated Topic Labels

  1. Topic Modeling
    Topic modeling is an unsupervised machine learning technique that automatically discovers abstract topics within a collection of documents. It identifies patterns in word usage and groups documents that share similar themes, providing insights into the underlying structure of large text corpora.

    Key Benefits:
    • Automatically organize large document collections.
    • Discover hidden themes and patterns in text data.
    • Reduce dimensionality of text data for analysis.
    • Enable content recommendation and search improvements.
    • Support exploratory data analysis of textual content.

    Example (to simplify, we use one-word sentences):

    BERTopic is a topic modeling technique that leverages transformer-based embeddings to create more semantically meaningful topics.

    The topic modeling steps in BERTopic:
    • Document Embeddings: Convert documents into high-dimensional vector representations using transformer models.

    • Dimensionality Reduction: Use UMAP to reduce embedding dimensions while preserving local structure.

    • Clustering: Apply HDBSCAN to group similar documents into clusters.

    • Topic Representation: Extract representative keywords for each cluster using TF-IDF or other representation models.


    BERTopic characteristics:
    • Semantic Understanding: Uses contextual embeddings that capture word meaning better than bag-of-words approaches.

    • Hierarchical Structure: Supports topic hierarchies and subtopics.

    • Flexibility: Modular design allows customization of each component.

    • Visualization: Rich visualization capabilities for topic exploration.

    See this page for more details about BERTopic:
    https://maartengr.github.io/BERTopic/index.html
  2. Basic Topic Modeling Example
    Let's start with a basic example using individual words to understand the fundamental concepts.

    Install the required modules:

    Python code:

    Run the Python script:

    Output:

    Topics are represented by the main keywords extracted from the text. Each topic name is formed by concatenating these keywords using underscores ("_"). A special topic labeled "-1" may also appear; it typically includes all keywords that do not clearly match any specific topic. This category may also contain outliers—data points that do not align with any of the identified topics.

    Chart of the topics (Topic Word Scores): bertopic-barchart-figure.html
    Topics Plot
  3. AI-Generated Topic Labels
    One of BERTopic's powerful features is the ability to generate human-readable topic labels using language models.

    In our example, we will create a prompt that has two parts:
    • A subset of documents that best represent the topics that will inserted using the [DOCUMENTS] tag.
    • The keywords that make up the topics of the cluster that will be inserted using the [KEYWORDS] tag.


    Python code:

    Run the Python script:

    Output:
© 2025  mtitek