• Home
  • LLMs
  • Docker
  • Kubernetes
  • Java
  • Ubuntu
  • Maven
  • Big Data
  • Archived
LLMs | Running Models
  1. Hugging Face Hub
  2. Run a model using Transformers
  3. Run a model using Transformers Pipelines
  4. Run a transformer model using llama-cpp-python
  5. Run a model with ChatGPT (OpenAI)
  6. Key parameters of the transformer models
  7. Save the model and its associated tokenizer and configuration files
  8. Load the saved model and its associated tokenizer and configuration files

  1. Hugging Face Hub
    The Hugging Face Hub is an open platform with over 1 million models that can be used to process and generate text, images, audio, video, ...
    https://huggingface.co/models

    Selecting a model depends on:
    • The underlying architecture of the model (representation/generative model)
    • The size of the model
    • The performance of the model
    • The task to be executed by the model
    • The languages supported by the model
    • ...

    You can use the Hugging Face Embedding Leaderboard. It compares models across many languages.
    https://huggingface.co/spaces/mteb/leaderboard
  2. Run a model using Transformers
    You can use the Hugging Face CLI to download a model:
    Python code:

    Run the Python script:

    Output:
  3. Run a model using Transformers Pipelines
    Python code:

    Run the Python script:

    Output:
  4. Run a transformer model using llama-cpp-python
    Download this model:

    Python code:

    Run the Python script:

    Output:
  5. Run a model with ChatGPT (OpenAI)
    ChatGPT (OpenAI) is a proprietary model. The model can be accessed through OpenAI's API.
    You need to sign-up and create an API key here: https://platform.openai.com/api-keys
    The API key will be used to communicate with OpenAI's API.

    • Try the model using curl:

      Output:

    • Try the model using Python:

      Install the OpenAI Python SDK:

      Check OpenAI Python SDK installation:

      Python code:

      Run the Python script:

      Output:
  6. Key parameters of the transformer models
    There are few parameters that can affect the output of the model:

    • The model's context length:
      A model has a context length (a.k.a. the context window, context size, token limit):
      • The context length represents the maximum number of tokens the model can process.
      • Generative models are autoregressive, so the current context length will increase as new tokens are generated.

    • return_full_text:
      If set to "False", only the model output is returned.
      Otherwise, the full text is returned; including the user prompt.

    • max_new_tokens:
      It sets the maximum number of tokens the model can generate.

    • do_sample:
      The model decides the probability of all possible values ​​of the next token. It sorts the next possible tokens based on their probability of being chosen.

      If the "do_sample" parameter is set to "False", the model selects the most probable next token; this leads to a more predictable and consistent response. Otherwise, the model will sample from the probability distribution, leading to more possible tokens that can be chosen by the model.

      When we set the "do_sample" parameter to true, we can also use the "temperature" parameter to make the output more "random". Hence we can get different output for the same prompt.

    • temperature:
      It controls the probability that the model can choose less likely tokens.

      When we set the "temperature" parameter to 0 (deterministic), the model should always generate the same response when given the same prompt.

      The closer the value of the "temperature" parameter is to 1 (high randomness), the more likely we are to get a random output.
  7. Save the model and its associated tokenizer and configuration files
    To save a model, tokenizer, and configuration files, we can use the "save_pretrained" method from the Hugging Face Transformers library.

    Ideally, you will save all related files in the same folder.

    Note that saving the model also saves its configuration file.

    • Save the model and its associated configuration files:

      Python code:

      Run the Python script:

      This will create a directory containing:

    • Save the model tokenizer files:

      Python code:

      Run the Python script:

      This will create a directory containing:

    • Save only the model configuration file:

      Python code:

      Run the Python script:

      This will create a directory containing:

    Files:
    • config.json: The configuration file of the model.

    • tokenizer_config.json: The configuration file of the tokenizer.

    • vocab.json, tokenizer.json: contain the vocabulary and the mapping of tokens to IDs.

    • special_tokens_map.json: contains the mapping of special tokens used by the tokenizer.

    • model.safetensors: contains the model's weights.

    • generation_config.json, merges.txt
  8. Load the saved model and its associated tokenizer and configuration files
    To load the saved model, tokenizer and configuration files, we can use the "from_pretrained" method from the Hugging Face Transformers library.

    Ideally, you will have saved all related files in the same folder.

    Python code:

    Run the Python script:

    Output:
© 2025  mtitek