LLMs (Large Language Models) - Running Models

LLMs | Running Models

Hugging Face Hub
Standard Transformer Implementation
Pipeline-Based Implementation
Run a transformer model using llama-cpp-python
Integration with OpenAI API
Key parameters of the transformer models
Save the model and its associated tokenizer and configuration files
Load the saved model and its associated tokenizer and configuration files

Hugging Face Hub
The Hugging Face Hub represents an extensive repository housing over 1 million models designed for text, image, audio, and video processing. When selecting an appropriate model for your application, consider these critical factors:

Hugging Face Models:
https://huggingface.co/models

Selecting a model depends on:
- Architectural foundation (representation vs. generative capabilities).
- Model size and computational requirements.
- Performance benchmarks and efficacy metrics.
- Task specialization and compatibility.
- Multilingual support capabilities.
- ...
For comparative analysis of embedding models across languages, the Hugging Face Embedding Leaderboard provides comprehensive benchmarking data:
https://huggingface.co/spaces/mteb/leaderboard
Standard Transformer Implementation
You can use the Hugging Face CLI to download a model: $ huggingface-cli download microsoft/DialoGPT-small
Python code:
$ vi llm-transformers.py # import the model and the tokenizer objects from the Transformers library from transformers import AutoModelForCausalLM, AutoTokenizer # load the model model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small") # load the model's tokenizer tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small") # toknize the input input_ids = tokenizer.encode("Hello!" + tokenizer.eos_token, return_tensors='pt') # generate the text output = model.generate( input_ids=input_ids, max_new_tokens=50 ) # decode generated tokens print(tokenizer.decode(output[0]))
Run the Python script:
$ python3 llm-transformers.py
Output:
Hello!<|endoftext|>Hi!<|endoftext|>
Pipeline-Based Implementation
Python code:
$ vi llm-transformers-pipeline.py # import the model and the tokenizer objects from the Transformers library from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline # load the model model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small") # load the model's tokenizer tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small") # create a pipeline object for the "text-generation" task generator = pipeline( "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=50 ) # prompt pipeline with some initial text to generate more text output = generator("Hello!" + tokenizer.eos_token) print(output[0]["generated_text"])
Run the Python script:
$ python3 llm-transformers-pipeline.py
Output:
Hello!<|endoftext|>Hi!
Run a transformer model using llama-cpp-python
Download this model:
$ wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf
Python code:
$ vi llm-llama-cpp.py from llama_cpp import Llama model = Llama(model_path="./Phi-3-mini-4k-instruct-q4.gguf") prompt = """ Question: What's 1+1? """ output = model( prompt, max_tokens=50, # limits the length of the generated text. temperature=0, # controls the randomness of the output. Lower values are more deterministic. top_p=0, # controls diversity of the slection of the tokens. Lower values means selecting the most probable tokens. echo=True # includes the prompt in the output if True. ) print(output["choices"][0]["text"])
Run the Python script:
$ python3 llm-llama-cpp.py
Output:
Question: What's 1+1? <|assistant|> 1+1 equals 2.
Integration with OpenAI API
ChatGPT (OpenAI) is a proprietary model. The model can be accessed through OpenAI's API.
You need to sign-up and create an API key here: https://platform.openai.com/api-keys
The API key will be used to communicate with OpenAI's API.
- Try the model using curl:
  curl https://api.openai.com/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{ "model": "gpt-4o-mini", "store": true, "messages": [ {"role": "user", "content": "What is 1+1?"} ] }'
  Output:
  { "id": "chatcmpl-BWREFCaHDdorA60Q4ufKWGZ9yY70Z", "object": "chat.completion", "model": "gpt-4o-mini-2024-07-18", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "1 + 1 equals 2.", "refusal": null, "annotations": [] }, "logprobs": null, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 14, "completion_tokens": 9, "total_tokens": 23, "prompt_tokens_details": { "cached_tokens": 0, "audio_tokens": 0 }, "completion_tokens_details": { "reasoning_tokens": 0, "audio_tokens": 0, "accepted_prediction_tokens": 0, "rejected_prediction_tokens": 0 } } }
- Try the model using Python:
  
  Install the OpenAI Python SDK:
  $ pip install openai
  Check OpenAI Python SDK installation:
  $ pip show openai Name: openai Version: 1.76.0 ...
  Python code:
  $ vi llm-gpt.py from openai import OpenAI client = OpenAI( api_key="YOUR_API_KEY" ) completion = client.chat.completions.create( model="gpt-4o-mini", store=True, messages=[ {"role": "user", "content": "What is 1+1?"} ] ) print(completion.choices[0].message);
  Run the Python script:
  $ python3 llm-gpt.py
  Output:
  ChatCompletionMessage( content='1 + 1 equals 2.', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None )
Key parameters of the transformer models
There are few parameters that can affect the output of the model:
- Context Length:
  A model has a context length (a.k.a. the context window, context size, token limit):
  - The context length represents the maximum number of tokens the model can process.
  - Generative models are autoregressive, so the current context length will increase as new tokens are generated.
- return_full_text:
  If set to "False", only the model output is returned.
  Otherwise, the full text is returned; including the user prompt.
- max_new_tokens:
  It sets the maximum number of tokens the model can generate.
- do_sample:
  The model decides the probability of all possible values of the next token. It sorts the next possible tokens based on their probability of being chosen.
  
  If the "do_sample" parameter is set to "False", the model selects the most probable next token; this leads to a more predictable and consistent response. Otherwise, the model will sample from the probability distribution, leading to more possible tokens that can be chosen by the model.
  
  When we set the "do_sample" parameter to true, we can also use the "temperature" parameter to make the output more "random". Hence we can get different output for the same prompt.
- temperature:
  It controls the probability that the model can choose less likely tokens.
  
  When we set the "temperature" parameter to 0 (deterministic), the model should always generate the same response when given the same prompt.
  
  The closer the value of the "temperature" parameter is to 1 (high randomness), the more likely we are to get a random output.
Save the model and its associated tokenizer and configuration files
To save a model, tokenizer, and configuration files, we can use the "save_pretrained" method from the Hugging Face Transformers library.

Ideally, you will save all related files in the same folder.

Note that saving the model also saves its configuration file.
- Save the model and its associated configuration files:
  
  Python code:
  $ vi llm-save-model.py from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small") path = "./models/microsoft/model/dialogpt-small" # model serialization model.save_pretrained(path)
  Run the Python script:
  $ python3 llm-save-model.py
  This will create a directory containing:
  $ ls -1 models/microsoft/model/dialogpt-small/ config.json generation_config.json model.safetensors
- Save the model tokenizer files:
  
  Python code:
  $ vi llm-save-tokenizer.py from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small") path = "./models/microsoft/tokenizer/dialogpt-small" # tokenizer serialization tokenizer.save_pretrained(path)
  Run the Python script:
  $ python3 llm-save-tokenizer.py
  This will create a directory containing:
  $ ls -1 models/microsoft/tokenizer/dialogpt-small/ merges.txt special_tokens_map.json tokenizer.json tokenizer_config.json vocab.json
- Save only the model configuration file:
  
  Python code:
  $ vi llm-save-model-config.py from transformers import AutoConfig config = AutoConfig.from_pretrained("microsoft/DialoGPT-small") path = "./models/microsoft/config/dialogpt-small" # configuration serialization config.save_pretrained(path)
  Run the Python script:
  $ python3 llm-save-model-config.py
  This will create a directory containing:
  $ ls -1 models/microsoft/config/dialogpt-small/ config.json
Files:
- config.json: The configuration file of the model.
  { "architectures": [ "GPT2LMHeadModel" ], ... "transformers_version": "4.51.3", "vocab_size": 50257 }
- tokenizer_config.json: The configuration file of the tokenizer.
  { "add_bos_token": false, "add_prefix_space": false, "added_tokens_decoder": { "50256": { "content": "<|endoftext|>", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false, "special": true } }, "bos_token": "<|endoftext|>", "chat_template": "{% for message in messages %}{{ message.content }}{{ eos_token }}{% endfor %}", "clean_up_tokenization_spaces": true, "eos_token": "<|endoftext|>", "errors": "replace", "extra_special_tokens": {}, "model_max_length": 1024, "pad_token": null, "tokenizer_class": "GPT2Tokenizer", "unk_token": "<|endoftext|>" }
- vocab.json, tokenizer.json: contain the vocabulary and the mapping of tokens to IDs.
- special_tokens_map.json: contains the mapping of special tokens used by the tokenizer.
- model.safetensors: contains the model's weights.
- generation_config.json, merges.txt
Load the saved model and its associated tokenizer and configuration files
To load the saved model, tokenizer and configuration files, we can use the "from_pretrained" method from the Hugging Face Transformers library.

Ideally, you will have saved all related files in the same folder.

Python code:
$ vi llm-load-model-tokenizer-config.py from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig model_path = "./models/microsoft/model/dialogpt-small" tokenizer_path = "./models/microsoft/tokenizer/dialogpt-small" config_path = "./models/microsoft/config/dialogpt-small" model = AutoModelForCausalLM.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(tokenizer_path) config = AutoConfig.from_pretrained(config_path) print(model) print(tokenizer) print(config)
Run the Python script:
$ python3 llm-load-model-tokenizer-config.py
Output:
GPT2LMHeadModel( (transformer): GPT2Model( (wte): Embedding(50257, 768) (wpe): Embedding(1024, 768) ... ) (lm_head): Linear(in_features=768, out_features=50257, bias=False) ) GPT2TokenizerFast( name_or_path='./models/microsoft/tokenizer/dialogpt-small', vocab_size=50257, model_max_length=1024, ... special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, added_tokens_decoder={50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),} ) GPT2Config { "architectures": [ "GPT2LMHeadModel" ], ... "transformers_version": "4.51.3", "vocab_size": 50257 }