Tokenization is the critical first step in natural language processing where text is broken down into smaller units called tokens.
These tokens can be characters, words, or sub-words,
and they form the basic building blocks that Large Language Models (LLMs) use to understand and generate language.
A tokenizer serves two essential functions:
- Converting raw text into a sequence of tokens, then into numerical token IDs.
- Converting token IDs back into tokens and assembled text.
The vocabulary of a tokenizer is a comprehensive mapping between tokens and their unique numeric IDs. For example:
- Tokens: "large", "language", "model", ...
- Token IDs: 00001, 00002, 00003, ...
When a tokenizer encounters a word not in its vocabulary, it employs subword tokenization strategies.
This approach enables LLMs to handle unlimited vocabulary by combining subword units.
For example:
- "tokenizers" → ["token", "izers"]
- "transformers" → ["transform", "ers"]
Tokenization Challenges:
- Capitalization: "Hello" vs "hello" (may be treated as different tokens).
- Numbers: Efficient representation of numerical values.
- Multiple languages: Support for cross-lingual text.
- Emojis and Unicode: Proper handling of non-ASCII characters.
- Whitespace: Significant in many tokenization schemes.
- Programming code: Special handling for code syntax.
- Domain-specific terminology: Medical, legal, or scientific jargon.
Types of Tokens:
- Complete words: Common words that have their own token (e.g., "hello", "world").
- Subwords: Parts of words (e.g., "un", "expect", "ed").
- Characters: Individual characters for rare or complex words.
- Punctuation: Commas, periods, etc. typically have dedicated tokens.
- Special tokens: System tokens with specific functions in the model.
- Whitespace tokens: Some tokenizers prefix tokens with a special character when preceded by space.
Tokenizer Parameters:
- Vocabulary size: Typically between 10,000-100,000 tokens.
- Maximum sequence length: The context window size (2K, 4K, 8K, 32K, etc.).
- Special tokens: Model-specific control tokens.
- Tokenization algorithm: BPE, WordPiece, SentencePiece, Unigram, etc.
- Pre-tokenization rules: How raw text is initially segmented.
Special Tokens:
- Beginning of text [BOS]: Signals the start of input.
- End of text [EOS]: Signals completion of generation.
- Padding [PAD]: Fills sequences to uniform length.
- Unknown [UNK]: Placeholder for tokens outside vocabulary.
- Classification [CLS]: Used for sentence-level classification tasks.
- Masking [MASK]: Used in masked language modeling (BERT-style).
- Separator [SEP]: Separates distinct text segments.
- System prompt [SYS]: Defines the AI assistant's behavior (Claude, GPT).