• Home
  • LLMs
  • Python
  • Docker
  • Kubernetes
  • Java
  • Maven
  • All
  • About
LLMs | Running Models
  1. Hugging Face Hub
  2. Standard Transformer Implementation
  3. Pipeline-Based Implementation
  4. Run a transformer model using llama-cpp-python
  5. Integration with OpenAI API
  6. Key parameters of the transformer models
  7. Save the model and its associated tokenizer and configuration files
  8. Load the saved model and its associated tokenizer and configuration files

  1. Hugging Face Hub
    The Hugging Face Hub represents an extensive repository housing over 1 million models designed for text, image, audio, and video processing. When selecting an appropriate model for your application, consider these critical factors:

    Hugging Face Models:
    https://huggingface.co/models

    Selecting a model depends on:
    • Architectural foundation (representation vs. generative capabilities).
    • Model size and computational requirements.
    • Performance benchmarks and efficacy metrics.
    • Task specialization and compatibility.
    • Multilingual support.
    • ...

    For comparative analysis of embedding models across languages, the Hugging Face Embedding Leaderboard provides comprehensive benchmarking data:
    https://huggingface.co/spaces/mteb/leaderboard
  2. Standard Transformer Implementation
    You can use the Hugging Face CLI to download a model:
    $ huggingface-cli download microsoft/DialoGPT-small
    Python code:
    $ vi llm-transformers.py
    # import the model and the tokenizer objects from the Transformers library
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    # load the model
    model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")
    
    # load the model's tokenizer
    tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
    
    # tokenize the input
    input_ids = tokenizer.encode("Hello!" + tokenizer.eos_token, return_tensors='pt')
    
    # generate the text
    output = model.generate(
        input_ids=input_ids,
        max_new_tokens=50
    )
    
    # decode generated tokens
    print(tokenizer.decode(output[0]))
    Run the Python script:
    $ python3 llm-transformers.py
    Output:
    Hello!<|endoftext|>Hi!<|endoftext|>
  3. Pipeline-Based Implementation
    Python code:
    $ vi llm-transformers-pipeline.py
    # import the model and the tokenizer objects from the Transformers library
    from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
    
    # load the model
    model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")
    
    # load the model's tokenizer
    tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
    
    # create a pipeline object for the "text-generation" task
    generator = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=50
    )
    
    # prompt pipeline with some initial text to generate more text
    output = generator("Hello!" + tokenizer.eos_token)
    
    print(output[0]["generated_text"])
    Run the Python script:
    $ python3 llm-transformers-pipeline.py
    Output:
    Hello!<|endoftext|>Hi!
  4. Run a transformer model using llama-cpp-python
    Download this model:
    $ wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf
    Python code:
    $ vi llm-llama-cpp.py
    from llama_cpp import Llama
    model = Llama(model_path="./Phi-3-mini-4k-instruct-q4.gguf")
    
    prompt = """
    Question: What's 1+1?
    """
    
    output = model(
        prompt,
        max_tokens=50, # limits the length of the generated text.
        temperature=0, # controls the randomness of the output. Lower values are more deterministic.
        top_p=1, # (range should be (0, 1]). controls diversity of the selection of the tokens. Lower values means selecting the most probable tokens.
        echo=True # includes the prompt in the output if True.
    )
    print(output["choices"][0]["text"])
    Run the Python script:
    $ python3 llm-llama-cpp.py
    Output:
    Question: What's 1+1?
    <|assistant|> 1+1 equals 2.
  5. Integration with OpenAI API
    ChatGPT (OpenAI) is a proprietary model. The model can be accessed through OpenAI's API.
    You need to sign-up and create an API key here: https://platform.openai.com/api-keys
    The API key will be used to communicate with OpenAI's API.

    • Try the model using curl:
      curl https://api.openai.com/v1/chat/completions \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer YOUR_API_KEY" \
        -d '{
          "model": "gpt-4o-mini",
          "store": true,
          "messages": [
            {"role": "user", "content": "What is 1+1?"}
          ]
        }'
      Output:
      {
        "id": "chatcmpl-BWREFCaHDdorA60Q4ufKWGZ9yY70Z",
        "object": "chat.completion",
        "model": "gpt-4o-mini-2024-07-18",
        "choices": [
          {
            "index": 0,
            "message": {
              "role": "assistant",
              "content": "1 + 1 equals 2.",
              "refusal": null,
              "annotations": []
            },
            "logprobs": null,
            "finish_reason": "stop"
          }
        ],
        "usage": {
          "prompt_tokens": 14,
          "completion_tokens": 9,
          "total_tokens": 23,
          "prompt_tokens_details": {
            "cached_tokens": 0,
            "audio_tokens": 0
          },
          "completion_tokens_details": {
            "reasoning_tokens": 0,
            "audio_tokens": 0,
            "accepted_prediction_tokens": 0,
            "rejected_prediction_tokens": 0
          }
        }
      }
    • Try the model using Python:

      Install the OpenAI Python SDK:
      $ pip install openai
      Check OpenAI Python SDK installation:
      $ pip show openai
      Name: openai
      Version: 1.76.0
      ...
      Python code:
      $ vi llm-gpt.py
      import openai
      
      openai.api_key = "YOUR_API_KEY"
      
      completion = openai.chat.completions.create(
        model="gpt-4o-mini",
        store=True,
        messages=[
          {"role": "user", "content": "What is 1+1?"}
        ]
      )
      
      print(completion.choices[0].message);
      Run the Python script:
      $ python3 llm-gpt.py
      Output:
      ChatCompletionMessage(
          content='1 + 1 equals 2.',
          refusal=None,
          role='assistant',
          annotations=[],
          audio=None,
          function_call=None,
          tool_calls=None
      )
  6. Key parameters of the transformer models
    There are a few parameters that can affect the output of the model:

    • Context Length:
      A model has a context length (a.k.a. the context window, context size, token limit):
      • The context length represents the maximum number of tokens the model can process.
      • Generative models are autoregressive, so the current context length will increase as new tokens are generated.

    • return_full_text:
      If set to "False", only the model output is returned.
      Otherwise, the full text is returned; including the user prompt.

    • max_new_tokens:
      It sets the maximum number of tokens the model can generate.

    • do_sample:
      The model decides the probability of all possible values ​​of the next token. It sorts the next possible tokens based on their probability of being chosen.

      If the "do_sample" parameter is set to "False", the model selects the most probable next token; this leads to a more predictable and consistent response. Otherwise, the model will sample from the probability distribution, leading to a wider variety of possible token outputs.

      When we set the "do_sample" parameter to true, we can also use the "temperature" parameter to make the output more "random". Hence we can get different output for the same prompt.

    • temperature:
      It controls the probability that the model can choose less likely tokens.

      When we set the "temperature" parameter to 0 (deterministic), the model should always generate the same response when given the same prompt.

      The closer the value of the "temperature" parameter is to 1 (high randomness), the more likely we are to get a random output.
  7. Save the model and its associated tokenizer and configuration files
    To save a model, tokenizer, and configuration files, we can use the "save_pretrained" method from the Hugging Face Transformers library.

    Ideally, you will save all related files in the same folder.

    Note that saving the model also saves its configuration file.

    • Save the model and its associated configuration files:

      Python code:
      $ vi llm-save-model.py
      from transformers import AutoModelForCausalLM
      
      model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")
      
      path = "./models/microsoft/model/dialogpt-small"
      
      # model serialization
      model.save_pretrained(path)
      Run the Python script:
      $ python3 llm-save-model.py
      This will create a directory containing:
      $ ls -1 models/microsoft/model/dialogpt-small/
      config.json
      generation_config.json
      model.safetensors
    • Save the model tokenizer files:

      Python code:
      $ vi llm-save-tokenizer.py
      from transformers import AutoTokenizer
      
      tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
      
      path = "./models/microsoft/tokenizer/dialogpt-small"
      
      # tokenizer serialization
      tokenizer.save_pretrained(path)
      Run the Python script:
      $ python3 llm-save-tokenizer.py
      This will create a directory containing:
      $ ls -1 models/microsoft/tokenizer/dialogpt-small/
      merges.txt
      special_tokens_map.json
      tokenizer.json
      tokenizer_config.json
      vocab.json
    • Save only the model configuration file:

      Python code:
      $ vi llm-save-model-config.py
      from transformers import AutoConfig
      
      config = AutoConfig.from_pretrained("microsoft/DialoGPT-small")
      
      path = "./models/microsoft/config/dialogpt-small"
      
      # configuration serialization
      config.save_pretrained(path)
      Run the Python script:
      $ python3 llm-save-model-config.py
      This will create a directory containing:
      $ ls -1 models/microsoft/config/dialogpt-small/
      config.json
    Files:
    • config.json: The configuration file of the model.
      {
        "architectures": [
          "GPT2LMHeadModel"
        ],
      ...
        "transformers_version": "4.51.3",
        "vocab_size": 50257
      }

    • tokenizer_config.json: The configuration file of the tokenizer.
      {
        "add_bos_token": false,
        "add_prefix_space": false,
        "added_tokens_decoder": {
          "50256": {
            "content": "<|endoftext|>",
            "lstrip": false,
            "normalized": true,
            "rstrip": false,
            "single_word": false,
            "special": true
          }
        },
        "bos_token": "<|endoftext|>",
        "chat_template": "{% for message in messages %}{{ message.content }}{{ eos_token }}{% endfor %}",
        "clean_up_tokenization_spaces": true,
        "eos_token": "<|endoftext|>",
        "errors": "replace",
        "extra_special_tokens": {},
        "model_max_length": 1024,
        "pad_token": null,
        "tokenizer_class": "GPT2Tokenizer",
        "unk_token": "<|endoftext|>"
      }

    • vocab.json, tokenizer.json: contain the vocabulary and the mapping of tokens to IDs.

    • special_tokens_map.json: contains the mapping of special tokens used by the tokenizer.

    • model.safetensors: contains the model's weights.

    • generation_config.json, merges.txt
  8. Load the saved model and its associated tokenizer and configuration files
    To load the saved model, tokenizer and configuration files, we can use the "from_pretrained" method from the Hugging Face Transformers library.

    Ideally, you will have saved all related files in the same folder.

    Python code:
    $ vi llm-load-model-tokenizer-config.py
    from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
    
    model_path = "./models/microsoft/model/dialogpt-small"
    tokenizer_path = "./models/microsoft/tokenizer/dialogpt-small"
    config_path = "./models/microsoft/config/dialogpt-small"
    
    model = AutoModelForCausalLM.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
    config = AutoConfig.from_pretrained(config_path)
    
    print(model)
    print(tokenizer)
    print(config)
    Run the Python script:
    $ python3 llm-load-model-tokenizer-config.py
    Output:
    GPT2LMHeadModel(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
    ...
      )
      (lm_head): Linear(in_features=768, out_features=50257, bias=False)
    )
    GPT2TokenizerFast(
        name_or_path='./models/microsoft/tokenizer/dialogpt-small',
        vocab_size=50257,
        model_max_length=1024,
    ...
        special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'},
        added_tokens_decoder={50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),}
    )
    GPT2Config {
      "architectures": [
        "GPT2LMHeadModel"
      ],
    ...
      "transformers_version": "4.51.3",
      "vocab_size": 50257
    }
© 2025  mtitek