Embeddings are numerical representations of a token:
- They are also known as embedding vectors, vector representations, or vectors.
- They aim to capture the semantic (meanings, context, patterns) of the embedded text.
- They represent a token and its links with other tokens based on the contexts (contextualized tokens embeddings)
- They allow measuring the semantic similarity between two tokens.
Embeddings are used for:
- Text generation
- Text classification
- Text clustering
- ...
Embeddings have the following characteristics:
- The size (number of dimensions) of the embeddings is fixed and depends on the underlying embedding model.
- Each cell in the dimension row has a numerical value (property).
- The properties (numerical values) of the embeddings are used to represent tokens.
A tokenizer is used in the training process of its associated model:
- The model is linked with its tokenizer and can't be used with another tokenizer.
- The model holds an embedding for each token in the tokenizer's vocabulary.
- When a model is created, its embeddings and weights are randomly initialized.
- When a model is trained, its embeddings and weights are assigned proper values that capture the semantic of each token.
Types of embeddings:
- Tokens embeddings: creating one vector that represents each token.
- Sentences embeddings: creating one vector that represents each sentence (used for categorization, semantic search RAG).
- Documents embeddings: creating one vector that represents each document (used for categorization, semantic search RAG).
When generating embeddings for text (longer than just one token) the embeddings should capture the meaning of the whole text.
One way to generate those embeddings is to average the values of all the tokens' embeddings of the text.