• Home
  • LLMs
  • Docker
  • Kubernetes
  • Java
  • Ubuntu
  • Maven
  • Archived
  • About
LLMs | Tokenization
  1. Tokenization: the foundation of Large Language Models
  2. Practical Example: Working with Tokenizers

  1. Tokenization: the foundation of Large Language Models
    Tokenization is the critical first step in natural language processing where text is broken down into smaller units called tokens. These tokens can be characters, words, or sub-words, and they form the basic building blocks that Large Language Models (LLMs) use to understand and generate language.

    A tokenizer serves two essential functions:
    • Converting raw text into a sequence of tokens, then into numerical token IDs.
    • Converting token IDs back into tokens and assembled text.

    The vocabulary of a tokenizer is a comprehensive mapping between tokens and their unique numeric IDs. For example:
    • Tokens: "large", "language", "model", ...
    • Token IDs: 00001, 00002, 00003, ...

    When a tokenizer encounters a word not in its vocabulary, it employs subword tokenization strategies. This approach enables LLMs to handle unlimited vocabulary by combining subword units. For example:
    • "tokenizers" → ["token", "izers"]
    • "transformers" → ["transform", "ers"]

    Tokenization Challenges:
    • Capitalization: "Hello" vs "hello" (may be treated as different tokens).
    • Numbers: Efficient representation of numerical values.
    • Multiple languages: Support for cross-lingual text.
    • Emojis and Unicode: Proper handling of non-ASCII characters.
    • Whitespace: Significant in many tokenization schemes.
    • Programming code: Special handling for code syntax.
    • Domain-specific terminology: Medical, legal, or scientific jargon.

    Types of Tokens:
    • Complete words: Common words that have their own token (e.g., "hello", "world").
    • Subwords: Parts of words (e.g., "un", "expect", "ed").
    • Characters: Individual characters for rare or complex words.
    • Punctuation: Commas, periods, etc. typically have dedicated tokens.
    • Special tokens: System tokens with specific functions in the model.
    • Whitespace tokens: Some tokenizers prefix tokens with a special character when preceded by space.

    Tokenizer Parameters:
    • Vocabulary size: Typically between 10,000-100,000 tokens.
    • Maximum sequence length: The context window size (2K, 4K, 8K, 32K, etc.).
    • Special tokens: Model-specific control tokens.
    • Tokenization algorithm: BPE, WordPiece, SentencePiece, Unigram, etc.
    • Pre-tokenization rules: How raw text is initially segmented.

    Special Tokens:
    • Beginning of text [BOS]: Signals the start of input.
    • End of text [EOS]: Signals completion of generation.
    • Padding [PAD]: Fills sequences to uniform length.
    • Unknown [UNK]: Placeholder for tokens outside vocabulary.
    • Classification [CLS]: Used for sentence-level classification tasks.
    • Masking [MASK]: Used in masked language modeling (BERT-style).
    • Separator [SEP]: Separates distinct text segments.
    • System prompt [SYS]: Defines the AI assistant's behavior (Claude, GPT).
  2. Practical Example: Working with Tokenizers
    Let's explore a practical implementation of tokenization using the Hugging Face Transformers library:


    Run the Python script:

    Output:
© 2025  mtitek