• Home
  • LLMs
  • Docker
  • Kubernetes
  • Java
  • Ubuntu
  • Maven
  • Archived
  • About
LLMs | Embeddings
  1. Embeddings
  2. Create token embeddings
  3. Create text embeddings

  1. Embeddings
    Embeddings are the foundation of modern NLP, enabling machines to work with human language in a mathematically meaningful way. Understanding how to create, use, and optimize embeddings is essential for any text-processing application.

    Embeddings are numerical representations that transform discrete tokens (words, subwords, characters) into continuous vector spaces. Think of them as a way to give computers a mathematical understanding of language meaning.

    Key Characteristics:
    • Dense vectors: Each embedding is a list of real numbers (typically 100-4096 dimensions).
    • Semantic capture: Similar words have similar embeddings in vector space.
    • Contextual: Modern embeddings consider surrounding context, not just the word itself.
    • Fixed dimensionality: All embeddings from a model have the same number of dimensions.

    Alternative names: embedding vectors, vector representations, dense representations, distributed representations

    By converting words to vectors, we enable:
    • Semantic similarity: Measuring how similar two pieces of text are.
    • Machine learning: Using text as input to neural networks.
    • Information retrieval: Finding relevant documents or passages.

    Types of Embeddings
    • Token Embeddings
      • Scope: Individual tokens (words, subwords, characters).
      • Use case: Building blocks for language models.
      • Example: "language" → [0.1, -0.1, 0.2, ..., 0.3]

    • Sentence Embeddings
      • Scope: Complete sentences or short passages.
      • Use case: Semantic search, text classification, clustering.
      • Example: "Hello Embeddings!" → [0.4, -0.3, 0.7, ..., 0.8]

    • Document Embeddings
      • Scope: Entire documents or long passages.
      • Use case: Document classification, recommendation systems.

    Contextual vs. Static Embeddings
    • Static Embeddings (Word2Vec, GloVe): Same vector for a word regardless of context.
    • Contextual Embeddings (BERT, GPT, etc.): Different vectors based on surrounding context.

    Training Process
    • Initialization: Start with random vectors for each token.
    • Context learning: Model learns from massive text datasets.
    • Optimization: Vectors adjust to capture semantic relationships.
    • Convergence: Final embeddings encode learned patterns.

    Practical Applications
    • Text Generation
      • Language models use embeddings as input representations.
      • Enable models to understand and generate coherent text.

    • Text Classification
      • Convert documents to embeddings.
      • Train classifiers on vector representations.
      • Examples: sentiment analysis, spam detection.

    • Semantic Search & RAG
      • Convert queries and documents to embeddings.
      • Find similar content using vector similarity.
      • Power recommendation systems and search engines.

    • Text Clustering
      • Group similar documents using embedding similarity.
      • Organize large text collections.
      • Discover hidden themes in data.

    Strategies for combining token embeddings into sentence embeddings:
    • Mean pooling: Average all token vectors.
    • Max pooling: Take maximum value across each dimension.

    Model Selection
    • General purpose: all-MiniLM-L6-v2, all-mpnet-base-v2.
    • High performance: all-MiniLM-L12-v2, larger models.
    • Domain-specific: Fine-tuned models for medical, legal, scientific text.
    • Multilingual: paraphrase-multilingual-MiniLM-L12-v2.
  2. Create token embeddings
    First, download the model:

    Python code:

    Run the Python script:

    Output:

    Note that the created embeddings have the follwong size: 1, 7, 384
    • 1: the batch dimenesion
    • 7: seven tokens
    • 384: each token is embedded in a vector of 384 values

    The batch dimension can be larger than 1 in cases when multiple sentences are given to the model to be processed at the same time.
  3. Create text embeddings
    Install the Sentence Transformers library:

    Python code:

    Run the Python script:

    Output:
© 2025  mtitek