LLMs (Large Language Models) - Embeddings

LLMs | Embeddings

Embeddings
Create token embeddings
Create text embeddings

Embeddings
Embeddings are the foundation of modern NLP, enabling machines to work with human language in a mathematically meaningful way. Understanding how to create, use, and optimize embeddings is essential for any text-processing application.

Embeddings are numerical representations that transform discrete tokens (words, subwords, characters) into continuous vector spaces. Think of them as a way to give computers a mathematical understanding of language meaning.

Key Characteristics:
- Dense vectors: Each embedding is a list of real numbers (typically 100-4096 dimensions).
- Semantic capture: Similar words have similar embeddings in vector space.
- Contextual: Modern embeddings consider surrounding context, not just the word itself.
- Fixed dimensionality: All embeddings from a model have the same number of dimensions.
Alternative names: embedding vectors, vector representations, dense representations, distributed representations

By converting words to vectors, we enable:
- Semantic similarity: Measuring how similar two pieces of text are.
- Machine learning: Using text as input to neural networks.
- Information retrieval: Finding relevant documents or passages.
Types of Embeddings
- Token Embeddings
  - Scope: Individual tokens (words, subwords, characters).
  - Use case: Building blocks for language models.
  - Example: "language" → [0.1, -0.1, 0.2, ..., 0.3]
- Sentence Embeddings
  - Scope: Complete sentences or short passages.
  - Use case: Semantic search, text classification, clustering.
  - Example: "Hello Embeddings!" → [0.4, -0.3, 0.7, ..., 0.8]
- Document Embeddings
  - Scope: Entire documents or long passages.
  - Use case: Document classification, recommendation systems.
Contextual vs. Static Embeddings
- Static Embeddings (Word2Vec, GloVe): Same vector for a word regardless of context.
- Contextual Embeddings (BERT, GPT, etc.): Different vectors based on surrounding context.
Training Process
- Initialization: Start with random vectors for each token.
- Context learning: Model learns from massive text datasets.
- Optimization: Vectors adjust to capture semantic relationships.
- Convergence: Final embeddings encode learned patterns.
Practical Applications
- Text Generation
  - Language models use embeddings as input representations.
  - Enable models to understand and generate coherent text.
- Text Classification
  - Convert documents to embeddings.
  - Train classifiers on vector representations.
  - Examples: sentiment analysis, spam detection.
- Semantic Search & RAG
  - Convert queries and documents to embeddings.
  - Find similar content using vector similarity.
  - Power recommendation systems and search engines.
- Text Clustering
  - Group similar documents using embedding similarity.
  - Organize large text collections.
  - Discover hidden themes in data.
Strategies for combining token embeddings into sentence embeddings:
- Mean pooling: Average all token vectors.
- Max pooling: Take maximum value across each dimension.
Model Selection
- General purpose: all-MiniLM-L6-v2, all-mpnet-base-v2.
- High performance: all-MiniLM-L12-v2, larger models.
- Domain-specific: Fine-tuned models for medical, legal, scientific text.
- Multilingual: paraphrase-multilingual-MiniLM-L12-v2.
Create token embeddings
First, download the model:
$ huggingface-cli download microsoft/deberta-v3-xsmall
Python code:
$ vi embeddings.py from transformers import AutoModel, AutoTokenizer # load model and tokenizer model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall") tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base") # tokenize input text tokens = tokenizer('Hello Embeddings!', return_tensors='pt') # decode tokens to see how text was split for token in tokens['input_ids'][0]: # convert the input token id to it corresponding token print(tokenizer.decode(token)) # [CLS] [Hello] [Emb] [edd] [ings] [!] [SEP] # generate embeddings output = model(**tokens)[0] # shape: [batch_size, number_of_tokens, embeddings_dimension] print(output.shape) # torch.Size([1, 7, 384]) # output embeddings print(output)
Run the Python script:
$ python3 embeddings.py
Output:
# input tokens [CLS] Hello Emb edd ings ! [SEP] # output shape: [batch_size, number_of_tokens, embeddings_dimension] torch.Size([1, 7, 384]) # output embeddings tensor([[[-3.3186, 0.1003, -0.1506, ..., -0.2840, -0.3882, -0.1670], [-0.5446, 0.7986, -0.4200, ..., 0.1163, -0.3322, -0.3622], [-0.1689, 0.6443, -0.0145, ..., 0.0207, -0.5754, 1.3607], ..., [ 0.0366, 0.0818, -0.0607, ..., -0.4793, -0.7831, -0.9185], [-0.0555, 0.3136, 0.2662, ..., 0.3092, -0.4876, -0.3294], [-3.1255, 0.1324, -0.0899, ..., -0.1426, -0.5295, 0.0731]]], grad_fn=<NativeLayerNormBackward0>)
Note that the created embeddings have the follwong size: 1, 7, 384
- 1: the batch dimenesion
- 7: seven tokens
- 384: each token is embedded in a vector of 384 values
The batch dimension can be larger than 1 in cases when multiple sentences are given to the model to be processed at the same time.
Create text embeddings
Install the Sentence Transformers library:
$ pip install sentence-transformers
Python code:
$ vi embeddings.py from sentence_transformers import SentenceTransformer # load pre-trained sentence transformer model = SentenceTransformer("all-MiniLM-L6-v2") # example sentences sentences = ["Hello Sentence Transformers library!.", "Generate sentence embeddings!"] # generate embeddings embeddings = model.encode(sentences) # output shape: [number_of_tokens, embeddings_dimension] print(embeddings.shape) # (2, 384) # output embeddings print(embeddings)
Run the Python script:
$ python3 embeddings.py
Output:
# output shape: [number_of_tokens, embeddings_dimension] (2, 384) # output embeddings [[-6.70545474e-02 -3.04300548e-03 3.52957926e-04 4.17553373e-02 5.08048979e-04 1.49061205e-02 1.29256323e-02 5.43267690e-02 ... 1.12174535e-02 1.13273829e-01 5.92597015e-02 -1.89474523e-02]]