Tokenization is the process of breaking down text into smaller units (character, word, sub-word) called tokens.
Tokenizers are used to convert:
- The tokens into token IDs associated with them.
- The token IDs into the tokens associated with them.
Token IDs are integers that represent the unique IDs of each token (character, word, sub-word) in the Tokenizer vocabulary (a table of all its tokens/IDs).
Tokenizer vocabulary:
- Tokens: large, language, model, ...
- Token IDs: 00001, 00002, 00003, ...
The words that are not in the Tokenizer vocabulary can be split into tokens (for example: tokenizers => token + izers).
The Tokenizers need to handle special cases:
- capitalization
- numbers
- languages
- emojis
- programming code
- ...
Example of tokens:
- Some tokens are complete words (for example: hello)
- Some tokens are parts of words (for example: token, izers)
- Punctuation and special characters have their own tokens
- Special tokens (for example: <|endoftext|>)
Each Tokenizer is characterized by its parameters:
- The size of its vocabulary
- The special tokens
- ...
The special tokens are unique symbols that are used to add additional information or to serve for specific purposes:
- Beginning of text
- End of text: used by the model to indicate that it completed processing the request
- Padding [PAD]: used to to align the input with the context length
- Unknown [UNK]: unknown token with no specific ID
- CLS [CLS]: used for classification tasks
- Masking [MASK]: used to hide a token
- ...
Example:
Output: