• Home
  • LLMs
  • Python
  • Docker
  • Kubernetes
  • Java
  • Maven
  • All
  • About
LLMs | Text Clustering
  1. Text Clustering
  2. Example: word clustering

  1. Text Clustering
    Text clustering is an unsupervised machine learning technique that groups documents or text snippets based on their semantic similarity. Unlike classification, clustering doesn't require labeled data—instead, it discovers hidden patterns and structures within text collections. This makes it invaluable for exploratory data analysis, content organization, and understanding large text corpora.

    Key Applications:
    • Document organization and categorization
    • Customer feedback analysis
    • News article grouping
    • Academic paper classification
    • Social media content analysis
    • Market research and trend detection

    Example (to simplify, we use one-word sentences):
    INPUT (unstructured textual data): ['cats', 'dogs', 'elephants', 'birds', 'cars', 'trains', 'planes']
    OUTPUT (clusters of semantically similar data):
    Cluster 0 ['cats', 'dogs', 'elephants', 'birds'],
    Cluster 1 ['cars', 'trains', 'planes']
    To cluster documents we follow these three steps:

    • Text Embedding Generation:
      The foundation of semantic clustering lies in converting text into numerical representations that capture meaning. Modern embedding models use transformer architectures trained on vast text corpora to understand semantic relationships.

      Popular Embedding Models:
      • all-MiniLM-L12-v2: Fast, efficient, good general performance
      • all-mpnet-base-v2: Higher quality, slightly slower
      • text-embedding-ada-002 (OpenAI): Commercial option with excellent performance
      • multilingual-E5-large: For multilingual applications

      Example (illustrative values):
      INPUT: texts
      ['cats', 'dogs', ...]
      OUTPUT: embeddings
      cats [1,0,0,0,1, ...],
      dogs [2,0,0,0,2, ...],
      ...
      Example Implementation:
      from sentence_transformers import SentenceTransformer
      
      # load embedding model
      embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")
      
      # generate embeddings
      texts = ['cats', 'dogs', 'elephants', 'birds', 'cars', 'trains', 'planes']
      embeddings = embedding_model.encode(texts)
      
      print(f'Embedding shape: {embeddings.shape}') # (7, 384)
    • Dimensionality Reduction (optional but recommended):
      To ease clustering a large volume of data, we can choose to reduce the dimensions of the embeddings using a dimensionality reduction library. This process might cause the loss of information and hence the clustering might not be very accurate.

      Example (illustrative values):
      INPUT (embedding with dimension 5): [1,0,0,0,1]
      OUTPUT (compressed embedding with dimensions 3): [2,0,2]
      High-dimensional embeddings can face the "curse of dimensionality" in clustering, where increasing dimensions require exponentially more data to capture patterns accurately, leading to issues like data sparsity, distance concentration, and overfitting.

      Dimensionality reduction techniques help by:
      • Reducing computational complexity
      • Eliminating noise and redundant features
      • Improving clustering algorithm performance
      • Enabling visualization in 2D/3D space

      UMAP (Uniform Manifold Approximation and Projection) is preferred because it:
      • Preserves both local and global structure
      • Handles non-linear relationships effectively
      • Maintains cluster separation better
      • Provides more interpretable low-dimensional representations

      UMAP Configuration Guidelines:
      from umap import UMAP
      
      # conservative reduction for clustering
      reducer = UMAP(
          n_components=50, # Moderate reduction
          n_neighbors=15, # Local neighborhood size
          min_dist=0.0, # Tight clusters
          metric='cosine', # Good for text embeddings
          random_state=42 # Reproducibility
      )
      
      reduced_embeddings = reducer.fit_transform(embeddings)
      Parameter Tuning Tips:
      • n_components: Start with 10-50 for clustering, 2-3 for visualization
      • n_neighbors: Higher values preserve global structure, lower values preserve local structure
      • min_dist: Lower values create tighter clusters
      • metric: Use 'cosine' for text embeddings, 'euclidean' for other data

      See this page for more details about UMAP (Uniform Manifold Approximation and Projection) for dimension reduction:
      https://umap-learn.readthedocs.io/en/latest/index.html

    • Clustering Algorithm Selection:
      The last step is to use a clustering library to find groups of semantically similar documents.

      Example (illustrative values):
      INPUT: (embeddings): [1,0,1], [2,0,2], ...
      OUTPUT:
      Cluster 0 ['cats', 'dogs', 'elephants', 'birds'],
      Cluster 1 ['cars', 'trains', 'planes']
      HDBSCAN (Hierarchical Density-Based Spatial Clustering) excels at text clustering because it:
      • Automatically determines the number of clusters
      • Handles clusters of varying densities and shapes
      • Identifies outliers and noise points
      • Provides hierarchical cluster structure
      • Doesn't assume spherical clusters

      HDBSCAN Parameter Tuning:
      from hdbscan import HDBSCAN
      
      clusterer = HDBSCAN(
          min_cluster_size=5,            # Minimum points per cluster
          min_samples=3,                 # Core point threshold
          metric='euclidean',            # Distance metric
          cluster_selection_method='eom' # Excess of Mass
      )
      See this page for more details about the HDBSCAN clustering Library:
      https://hdbscan.readthedocs.io/en/latest/index.html
  2. Example: word clustering
    To simplify, we use one-word sentences in this example.

    Install the required modules:
    $ pip install umap-learn
    $ pip install hdbscan
    $ pip install matplotlib
    Python code:
    $ vi clustering.py
    from sentence_transformers import SentenceTransformer
    from umap import UMAP
    from hdbscan import HDBSCAN
    from matplotlib import pyplot
    import numpy as np
    
    # load embedding model
    embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")
    
    # generate embeddings
    texts = ['cats', 'dogs', 'elephants', 'birds', 'cars', 'trains', 'planes']
    embeddings = embedding_model.encode(texts)
    
    print(f'Number of the embedded documents and their dimensions: {embeddings.shape}')
    
    # reduce the embeddings dimensions
    reduced_embeddings = UMAP(n_components=5, random_state=42).fit_transform(embeddings)
    
    print(f'Number of the embedded documents and their reduced dimensions: {reduced_embeddings.shape}')
    
    # create an hdbscan object and fit the model to the data
    cluster_algorithm = HDBSCAN(min_cluster_size=2).fit(reduced_embeddings)
    
    # get the cluster labels (note: -1 means noise points)
    cluster_labels = cluster_algorithm.labels_
    
    # get the number of clusters
    n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
    
    print(f'Number of clusters: {n_clusters}')
    
    # print features in cluster 0
    print("Cluster 0:")
    cluster = 0
    for index in np.where(cluster_labels==cluster)[0][:10]:
        print(f'Feature {index}: {texts[index][:10]}')
    
    # print features in cluster 1
    print("Cluster 1:")
    cluster = 1
    for index in np.where(cluster_labels==cluster)[0][:10]:
        print(f'Feature {index}: {texts[index][:10]}')
    
    # plot the results
    pyplot.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], c=cluster_labels, cmap='Spectral', s=40)
    pyplot.colorbar()
    pyplot.savefig('hdbscan_cluster_plot.png')
    Run the Python script:
    $ python3 clustering.py
    Output:
    Number of the embedded documents and their dimensions: (7, 384)
    
    Number of the embedded documents and their reduced dimensions: (7, 5)
    
    Number of clusters: 2
    
    Cluster 0:
    Feature 0: cats
    Feature 1: dogs
    Feature 2: elephants
    Feature 3: birds
    
    Cluster 1:
    Feature 4: cars
    Feature 5: trains
    Feature 6: planes
    Chart of the clusters: hdbscan_cluster_plot.png
    Clusters Plot
© 2025  mtitek