Let's explore a practical implementation of tokenization using the Hugging Face Transformers library:
$ vi tokenizer.py
from transformers import AutoModelForCausalLM, AutoTokenizer
# load a pre-trained model and its tokenizer
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
# encode text into token IDs
input_ids = tokenizer.encode("Hello Tokenizers!", return_tensors='pt')
# display the token IDs
print(input_ids) # tensor([[15496, 29130, 11341, 0]])
# display individual tokens by converting IDs back to text
for id in input_ids[0]:
# convert the input token ID to its corresponding token
print(tokenizer.decode(id)) # [Hello] [ Token] [izers] [!]
# generate text from the model based on the input
output = model.generate(
input_ids=input_ids,
max_new_tokens=50
)
# display the output token IDs
print(output) # tensor([[15496, 29130, 11341, 0, 50256]])
# display individual output tokens
for id in output[0]:
# convert the output token id to it corresponding token
print(tokenizer.decode(id)) # [Hello] [ Token] [izers] [!] [<|endoftext|>]
# convert all output token IDs to their corresponding text
print(tokenizer.decode(output[0])) # Hello Tokenizers!<|endoftext|>
Run the Python script:
$ python3 tokenizer.py
Output:
# input tokens ids
tensor([[15496, 29130, 11341, 0]])
# input tokens
Hello
Token
izers
!
# output tokens ids
tensor([[15496, 29130, 11341, 0, 50256]])
# output tokens
Hello
Token
izers
!
<|endoftext|>
# output
Hello Tokenizers!<|endoftext|>