• Home
  • Docker
  • Kubernetes
  • LLMs
  • Java
  • Ubuntu
  • Maven
  • Big Data
  • Archived
LLMs | Tokenization
  1. Tokenization

  1. Tokenization
    Tokenization is the process of breaking down text into smaller units (character, word, sub-word) called tokens.

    Tokenizers are used to convert:
    • The tokens into token IDs associated with them.
    • The token IDs into the tokens associated with them.

    Token IDs are integers that represent the unique IDs of each token (character, word, sub-word) in the Tokenizer vocabulary (a table of all its tokens/IDs).

    Tokenizer vocabulary:
    • Tokens: large, language, model, ...
    • Token IDs: 00001, 00002, 00003, ...

    The words that are not in the Tokenizer vocabulary can be split into tokens (for example: tokenizers => token + izers).

    The Tokenizers need to handle special cases:
    • capitalization
    • numbers
    • languages
    • emojis
    • programming code
    • ...

    Example of tokens:
    • Some tokens are complete words (for example: hello)
    • Some tokens are parts of words (for example: token, izers)
    • Punctuation and special characters have their own tokens
    • Special tokens (for example: <|endoftext|>)

    Each Tokenizer is characterized by its parameters:
    • The size of its vocabulary
    • The special tokens
    • ...

    The special tokens are unique symbols that are used to add additional information or to serve for specific purposes:
    • Beginning of text
    • End of text: used by the model to indicate that it completed processing the request
    • Padding [PAD]: used to to align the input with the context length
    • Unknown [UNK]: unknown token with no specific ID
    • CLS [CLS]: used for classification tasks
    • Masking [MASK]: used to hide a token
    • ...

    Example:

    Output:
© 2025  mtitek