• Home
  • LLMs
  • Python
  • Docker
  • Kubernetes
  • Java
  • Maven
  • All
  • About
LLMs | Tokenization
  1. Tokenization: the foundation of Large Language Models
  2. Practical Example: Working with Tokenizers

  1. Tokenization: the foundation of Large Language Models
    Tokenization is the critical first step in natural language processing where text is broken down into smaller units called tokens. These tokens can be characters, words, or sub-words, and they form the basic building blocks that Large Language Models (LLMs) use to understand and generate language.

    A tokenizer serves two essential functions:
    • Converting raw text into a sequence of tokens, then into numerical token IDs.
    • Converting token IDs back into tokens and reconstructed text.

    The vocabulary of a tokenizer is a comprehensive mapping between tokens and their unique numeric IDs. For example:
    • Tokens: "large", "language", "model", ...
    • Token IDs: 123, 456, 789, ...

    When a tokenizer encounters a word not in its vocabulary, it employs subword tokenization strategies. This approach allows LLMs to handle a virtually unlimited range of words (by combining subword units).
    For example (exact splits depend on the tokenizer):
    • "tokenizers" → ["token", "izers"]
    • "transformers" → ["transform", "ers"]

    Tokenization Challenges:
    • Capitalization: "Hello" vs "hello" (may be treated as different tokens).
    • Numbers: Efficient representation of numerical values.
    • Multiple languages: Support for cross-lingual text.
    • Emojis and Unicode: Proper handling of non-ASCII characters.
    • Whitespace: Significant in many tokenization schemes.
    • Programming code: Special handling for code syntax.
    • Domain-specific terminology: Medical, legal, or scientific jargon.

    Types of Tokens:
    • Complete words: Common words that have their own token (e.g., "hello", "world").
    • Subwords: Parts of words (e.g., "un", "expect", "ed").
    • Characters: Individual characters for rare or complex words.
    • Punctuation: Commas, periods, etc. typically have dedicated tokens.
    • Special tokens: System tokens with specific functions in the model.
    • Whitespace tokens: Some tokenizers prefix tokens with a space character to mark word boundaries.

    Tokenizer Parameters:
    • Vocabulary size: Typically between 10000-100000 tokens.
    • Maximum sequence length: The context window size (2K, 4K, 8K, 32K, etc.).
    • Special tokens: Model-specific control tokens.
    • Tokenization algorithm: BPE, WordPiece, SentencePiece, Unigram, etc.
    • Pre-tokenization rules: How raw text is initially segmented.

    Special Tokens:
    • Beginning of text [BOS]: Signals the start of input.
    • End of text [EOS]: Signals completion of generation.
    • Padding [PAD]: Fills sequences to uniform length.
    • Unknown [UNK]: Placeholder for tokens outside vocabulary.
    • Classification [CLS]: Used for sentence-level classification tasks.
    • Masking [MASK]: Used in masked language modeling (BERT-style).
    • Separator [SEP]: Separates distinct text segments.
  2. Practical Example: Working with Tokenizers
    Let's explore a practical implementation of tokenization using the Hugging Face Transformers library:

    $ vi tokenizer.py
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    # load a pre-trained model and its tokenizer
    model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")
    tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
    
    # encode text into token IDs
    input_ids = tokenizer.encode("Hello Tokenizers!", return_tensors='pt')
    
    # display the token IDs
    print(input_ids) # tensor([[15496, 29130, 11341,     0]])
    
    # display individual tokens by converting IDs back to text
    for id in input_ids[0]:
      # convert the input token ID to its corresponding token
      print(tokenizer.decode(id)) # [Hello] [ Token] [izers] [!]
    
    # generate text from the model based on the input
    output = model.generate(
      input_ids=input_ids,
      max_new_tokens=50
    )
    
    # display the output token IDs
    print(output) # tensor([[15496, 29130, 11341,     0, 50256]])
    
    # display individual output tokens
    for id in output[0]:
      # convert the output token id to it corresponding token
      print(tokenizer.decode(id)) # [Hello] [ Token] [izers] [!] [<|endoftext|>]
    
    # convert all output token IDs to their corresponding text
    print(tokenizer.decode(output[0])) # Hello Tokenizers!<|endoftext|>
    Run the Python script:
    $ python3 tokenizer.py
    Output:
    # input tokens ids
    tensor([[15496, 29130, 11341,     0]])
    
    # input tokens
    Hello
     Token
    izers
    !
    
    # output tokens ids
    tensor([[15496, 29130, 11341,     0, 50256]])
    
    # output tokens
    Hello
     Token
    izers
    !
    <|endoftext|>
    
    # output
    Hello Tokenizers!<|endoftext|>
© 2025  mtitek