• Home
  • LLMs
  • Python
  • Docker
  • Kubernetes
  • Java
  • Maven
  • All
  • About
LLMs | Embeddings
  1. Embeddings
  2. Create token embeddings
  3. Create text embeddings

  1. Embeddings
    Embeddings are the foundation of modern NLP, enabling machines to work with human language in a mathematically meaningful way. Understanding how to create, use, and optimize embeddings is essential for any text-processing application.

    Embeddings are numerical representations that transform discrete tokens (words, subwords, characters) into continuous vector spaces. Think of them as a way to give computers a mathematical understanding of language meaning.

    Key Characteristics:
    • Dense vectors: Each embedding is a list of real numbers (typically 100-4096 dimensions).
    • Semantic capture: Similar words have similar embeddings in vector space.
    • Contextual: Modern embeddings consider surrounding context, not just the word itself.
    • Fixed dimensionality: All embeddings from a model have the same number of dimensions.

    Alternative names: embedding vectors, vector representations, dense representations, distributed representations

    By converting words to vectors, we enable:
    • Semantic similarity: Measuring how similar two pieces of text are.
    • Machine learning: Using text as input to neural networks.
    • Information retrieval: Finding relevant documents or passages.

    Types of Embeddings
    • Token Embeddings
      • Scope: Individual tokens (words, subwords, characters).
      • Use case: Building blocks for language models.
      • Example: "language" → [0.1, -0.1, 0.2, ..., 0.3]

    • Sentence Embeddings
      • Scope: Complete sentences or short passages.
      • Use case: Semantic search, text classification, clustering.
      • Example: "Hello Embeddings!" → [0.4, -0.3, 0.7, ..., 0.8]

    • Document Embeddings
      • Scope: Entire documents or long passages.
      • Use case: Document classification, recommendation systems.

    Contextual vs. Static Embeddings
    • Static Embeddings (Word2Vec, GloVe): Same vector for a word regardless of context.
    • Contextual Embeddings (BERT, GPT, etc.): Different vectors based on surrounding context.

    Training Process
    • Initialization: Start with random vectors for each token.
    • Context learning: Model learns from massive text datasets.
    • Optimization: Vectors adjust to capture semantic relationships.
    • Convergence: Final embeddings encode learned patterns.

    Practical Applications
    • Text Generation
      • Language models use embeddings as input representations.
      • Enable models to understand and generate coherent text.

    • Text Classification
      • Convert documents to embeddings.
      • Train classifiers on vector representations.
      • Examples: sentiment analysis, spam detection.

    • Semantic Search & RAG
      • Convert queries and documents to embeddings.
      • Find similar content using vector similarity.
      • Power recommendation systems and search engines.

    • Text Clustering
      • Group similar documents using embedding similarity.
      • Organize large text collections.
      • Discover hidden themes in data.

    Strategies for combining token embeddings into sentence embeddings:
    • Mean pooling: Average all token vectors.
    • Max pooling: Take maximum value across each dimension.

    Model Selection
    • General purpose: all-MiniLM-L6-v2, all-mpnet-base-v2.
    • High performance: all-MiniLM-L12-v2, larger models.
    • Domain-specific: Fine-tuned models for medical, legal, scientific text.
    • Multilingual: paraphrase-multilingual-MiniLM-L12-v2.
  2. Create token embeddings
    First, download the model (this may not be necessary since the model will be automatically downloaded using from_pretrained()):
    $ huggingface-cli download microsoft/deberta-v3-xsmall
    Python code:
    $ vi token-embeddings.py
    from transformers import AutoModel, AutoTokenizer
    
    # load model and tokenizer
    model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")
    tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")
    
    # tokenize input text
    tokens = tokenizer('Hello Embeddings!', return_tensors='pt')
    
    # decode tokens to see how text was split
    for token in tokens['input_ids'][0]:
      # convert the input token id to it corresponding token
      print(tokenizer.decode(token))
    
    # generate embeddings
    output = model(**tokens)[0]
    
    # shape: [batch_size, number_of_tokens, embeddings_dimension]
    print(output.shape) # torch.Size([1, 7, 384])
    
    # output embeddings
    print(output)
    Run the Python script:
    $ python3 token-embeddings.py
    Output:
    # input tokens
    [CLS]
    Hello
     Emb
    edd
    ings
    !
    [SEP]
    
    # output shape: [batch_size, number_of_tokens, embeddings_dimension]
    torch.Size([1, 7, 384])
    
    # output embeddings
    tensor([[[-3.3186,  0.1003, -0.1506,  ..., -0.2840, -0.3882, -0.1670],
             [-0.5446,  0.7986, -0.4200,  ...,  0.1163, -0.3322, -0.3622],
             [-0.1689,  0.6443, -0.0145,  ...,  0.0207, -0.5754,  1.3607],
             ...,
             [ 0.0366,  0.0818, -0.0607,  ..., -0.4793, -0.7831, -0.9185],
             [-0.0555,  0.3136,  0.2662,  ...,  0.3092, -0.4876, -0.3294],
             [-3.1255,  0.1324, -0.0899,  ..., -0.1426, -0.5295,  0.0731]]],
           grad_fn=<NativeLayerNormBackward0>)
    Note that the created embeddings have the size "1, 7, 384" (may vary based on input and tokenization):
    • 1: the batch dimension
    • 7: seven tokens
    • 384: each token is embedded in a vector of 384 values

    The batch dimension can be larger than 1 in cases when multiple sentences are given to the model to be processed at the same time.
  3. Create text embeddings
    Install the Sentence Transformers library:
    $ pip install sentence-transformers
    Python code:
    $ vi text-embeddings.py
    from sentence_transformers import SentenceTransformer
    
    # load pre-trained sentence transformer
    model = SentenceTransformer("all-MiniLM-L6-v2")
    
    # example sentences
    sentences = ["Hello Sentence Transformers library!", "Generate sentence embeddings!"]
    
    # generate embeddings
    embeddings = model.encode(sentences)
    
    # output shape: [number_of_sentences, embeddings_dimension]
    print(embeddings.shape) # (2, 384) → 2 sentences, each with a 384 dimension embedding
    
    # output embeddings
    print(embeddings)
    Run the Python script:
    $ python3 text-embeddings.py
    Output:
    # output shape: [number_of_sentences, embeddings_dimension]
    (2, 384)
    
    # output embeddings
    [[-6.70545474e-02 -3.04300548e-03  3.52957926e-04  4.17553373e-02
       5.08048979e-04  1.49061205e-02  1.29256323e-02  5.43267690e-02
    ...
       1.12174535e-02  1.13273829e-01  5.92597015e-02 -1.89474523e-02]]
© 2025  mtitek