Tokenization is the process of breaking down text into smaller units (character, word, sub-word) called tokens.
Tokenizers are used to convert the tokens into token IDs associated with them.
They are also used to convert the token IDs into the tokens associated with them.
Token IDs are integers that represent the unique IDs of each token (character, word, sub-word) in the Tokenizer vocabulary (a table of all its tokens/IDs).
Example of how tokens/token IDs look in a tokenizer vocabulary:
- Tokens: large, language, model, ...
- Token IDs: 00001, 00002, 00003, ...
The words that are not in the Tokenizer vocabulary can be split into tokens (for example: tokenizers => token + izers).
The Tokenizers need to handle special cases:
- capitalization
- numbers
- languages
- emojis
- programming code
- ...
Types of tokens:
- Some tokens are complete words (for example: hello)
- Some tokens are parts of words (for example: token, izers)
- Punctuation and special characters have their own tokens
- Special tokens (for example: <|endoftext|>)
- ...
Each Tokenizer is characterized by its parameters:
- The size of its vocabulary
- The special tokens
- ...
The special tokens are unique symbols that are used to add additional information or to serve for specific purposes:
- Beginning of text
- End of text: used by the model to indicate that it completed processing the request
- Padding [PAD]: used to to align the input with the context length
- Unknown [UNK]: unknown token with no specific ID
- CLS [CLS]: used for classification tasks
- Masking [MASK]: used to hide a token
- ...
Example:
Run the Python script:
Output: