Tokenization in LLMs: How Large Language Models Understand Text

LLMs | Tokenization

Tokenization: the Foundation of Large Language Models
Example: Working with Tokenizers

Tokenization: the Foundation of Large Language Models
Tokenization is the critical first step in natural language processing where text is broken down into smaller units called tokens. These tokens can be characters, words, or sub-words, and they form the basic building blocks that Large Language Models (LLMs) use to understand and generate language.

A tokenizer serves two essential functions:
- Converting raw text into a sequence of tokens, then into numerical token IDs (encoding).
- Converting token IDs back into tokens and reconstructed text (decoding).
The vocabulary of a tokenizer is a comprehensive mapping between tokens and their unique numeric IDs. For example:
- Tokens: "large", "language", "model", "Ġthe" (where Ġ represents a space prefix), "##ing" (subword suffix)
- Token IDs: 123, 456, 789, 262, 345
When a tokenizer encounters a word not in its vocabulary, it employs subword tokenization strategies. This approach allows LLMs to handle a virtually unlimited range of words by decomposing them into known subword units.
For example (the exact splits depend on the specific tokenizer and its training):
- "tokenizers" → ["token", "izers"]
- "transformers" → ["transform", "ers"]
Common Tokenization Algorithms:
- Byte Pair Encoding (BPE): Iteratively merges most frequent character pairs. Used by GPT models.
- WordPiece: Similar to BPE but uses likelihood-based merging. Used by BERT.
- SentencePiece: Language-agnostic approach that works directly on raw text. Used by T5, mT5.
Tokenization Challenges:
- Capitalization: "Hello" vs "hello" (may be treated as different tokens).
- Numbers: Efficient representation of numerical values.
- Multiple languages: Support for cross-lingual text.
- Unicode and Emojis: Proper handling of non-ASCII characters and special symbols.
- Whitespace: Critical in many tokenization schemes (especially leading spaces, tabs, newlines).
- Programming code: Special handling for code syntax, indentation, and keywords.
Types of Tokens:
- Complete words: Common words that have their own token (e.g., "hello", "world").
- Subwords: Parts of words, often with prefixes indicating position (e.g., "un", "Ġexpect", "##ed").
- Punctuation: Commas, periods, etc. typically have dedicated tokens.
- Special tokens: System tokens with specific functions in the model architecture.
- Whitespace tokens: Many tokenizers use space-prefixed tokens (Ġ) to mark word boundaries.
Tokenizer Parameters:
- Vocabulary size: Typically 30K-100K tokens.
- Maximum sequence length: Context window size (512, 2K, 4K, 8K, 32K, 128K+ tokens).
- Special tokens: Model-specific control tokens for different tasks.
- Tokenization algorithm: BPE, WordPiece, SentencePiece.
- Pre-tokenization rules: How raw text is initially segmented before subword tokenization.
- Normalization: Text cleaning steps (lowercasing, accent removal, etc.).
Special Tokens:
- End of Sequence [EOS]: Signals completion of generation or input.
- Padding [PAD]: Fills sequences to uniform length for batch processing.
- Classification [CLS]: Used for sentence-level classification tasks (BERT-style).
- Mask [MASK]: Used in masked language modeling for training.
- Separator [SEP]: Separates distinct text segments in multi-part inputs.
- System tokens: For chat models, tokens like <|system|>, <|user|>, <|assistant|>.

Example: Working with Tokenizers

Let's explore a practical implementation of tokenization using the Hugging Face Transformers library:

$ vi tokenizer.py

from transformers import AutoModelForCausalLM, AutoTokenizer

# load a pre-trained model and its tokenizer
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")

# Add padding token if not present
if tokenizer.pad_token is None:
  tokenizer.pad_token = tokenizer.eos_token

# display tokenizer properties
print(f"Vocab size: {tokenizer.vocab_size}")
print(f"Model max length: {tokenizer.model_max_length}")
print(f"Special tokens: {tokenizer.special_tokens_map}")

# display tokens directly without decoding
tokens = tokenizer.tokenize("Hello Tokenizers!")

# display the tokens
print(f"\nTokens: {tokens}")

# encode text into token IDs + return attention mask
tokens = tokenizer("Hello Tokenizers!", return_tensors='pt', return_attention_mask=True)
input_ids = tokens["input_ids"]
attention_mask = tokens["attention_mask"]

# display the token IDs
print("\nInput token IDs:", input_ids)

# display the attention mask
print("Attention mask:", attention_mask)

# encode text into token IDs (similar to the above)
# input_ids = tokenizer.encode("Hello Tokenizers!", return_tensors='pt')

# display individual tokens by converting IDs to text
print("\nIndividual input tokens:")
for i, token_id in enumerate(input_ids[0]):
  token = tokenizer.decode([token_id])
  print(f"Token {i}: ID {token_id} -> '{token}'")

# generate text from the model based on the input
# attention mask used to avoid this warning 'The attention mask is not set and cannot be inferred from input because pad token is same as eos token.'
output = model.generate(
  input_ids=input_ids,
  attention_mask=attention_mask,
  max_new_tokens=10,
  do_sample=False,
  pad_token_id=tokenizer.eos_token_id
)

# display the output token IDs
print(f"\nOutput token IDs: {output}")

# display individual output tokens
print("\nIndividual output tokens:")
for i, token_id in enumerate(output[0]):
  token = tokenizer.decode([token_id])
  print(f"Token {i}: ID {token_id} -> '{token}'")

# convert token IDs to their corresponding text
generated_text = tokenizer.decode(output[0], skip_special_tokens=False)
print(f"\nGenerated text: {generated_text}")

# convert token IDs to their corresponding text without special tokens
generated_clean_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"\nGenerated clean text: {generated_clean_text}")

Run the Python script:

$ python3 tokenizer.py

Output:

Vocab size: 50257
Model max length: 1024
Special tokens: {'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}

Tokens: ['Hello', 'ĠToken', 'izers', '!']

Input token IDs: tensor([[15496, 29130, 11341,     0]])
Attention mask: tensor([[1, 1, 1, 1]])

Individual input tokens:
Token 0: ID 15496 -> 'Hello'
Token 1: ID 29130 -> ' Token'
Token 2: ID 11341 -> 'izers'
Token 3: ID 0 -> '!'

Output token IDs: tensor([[15496, 29130, 11341,     0, 50256]])

Individual output tokens:
Token 0: ID 15496 -> 'Hello'
Token 1: ID 29130 -> ' Token'
Token 2: ID 11341 -> 'izers'
Token 3: ID 0 -> '!'
Token 4: ID 50256 -> '<|endoftext|>'

Generated text: Hello Tokenizers!<|endoftext|>

Generated clean text: Hello Tokenizers!