• Home
  • LLMs
  • Docker
  • Kubernetes
  • Java
  • Ubuntu
  • Maven
  • Archived
  • About
LLMs | Retrieval-Augmented Generation (RAG)
  1. Notes
  2. Question Answering: basic query example
  3. Question Answering: prompt example
  4. Question Answering: chain types

  1. Notes
    RAG addresses one of the limitations of LLMs - their inability to access up-to-date or domain-specific information outside their training data. By retrieving relevant information at runtime, RAG enables LLMs to produce more accurate, factual, and contextually appropriate responses.

    RAG (Retrieval-Augmented Generation) leverages information retrieval systems and large language models (LLMs) to deliver accurate and relevant results. Initially, the query is processed by the information retrieval system (such as a vector store, database, or search engine). The retrieved results are then fed into the model, which generates a contextually appropriate response to the prompt.

    RAG Architecture:
    • Query Processing: The user query is received and processed.
    • Information Retrieval: The query is executed against a vector store (database or search engine).
    • Context Integration: Retrieved results (documents or chunks) are combined with the original query.
    • LLM Generation: The combined information is passed to the LLM along with a prompt.
    • Response Synthesis: The LLM generates a contextually relevant response.

    Query (question) -> Vector Store -> Results (documents or chunks) -> LLM + Prompt -> Response (answer)
    ┌───────────┐     ┌─────────────┐     ┌────────────────┐     ┌─────────┐     ┌──────────┐
    │           │     │             │     │                │     │         │     │          │
    │   Query   │────▶│Vector Store │────▶│ Retrieved Data │────▶│  LLM +  │────▶│ Response │
    │           │     │             │     │                │     │ Prompt  │     │          │
    └───────────┘     └─────────────┘     └────────────────┘     └─────────┘     └──────────┘
    
  2. Question Answering: basic query example
    Let's use RetrievalQA to perform a simple RAG query against a local vector database.

    We use a collection of 10 BBC news headlines covering diverse topics. In a real-world application, this corpus would typically be much larger and could contain full documents rather than just headlines.

    LLM details:
    • Quantization: The q4 version (4-bit quantization) significantly reduces memory requirements while maintaining most of the performance.
    • Size: As a "mini" model, it offers a good balance between speed and capability.
    • Instruction tuning: The "instruct" version is fine-tuned to follow instructions, which is ideal for RAG workflows.

    LLM Configuration: We initialize the LlamaCpp model with the following parameters:
    • model_path: Path to the quantized Phi-3-mini model (4-bit quantization for reduced memory footprint).
    • max_tokens: Limiting responses to 50 tokens for concise answers.
    • temperature: Set to 0.8, balancing creativity with factuality.
    • top_p: Nucleus sampling parameter set to 0.95, allowing for varied but relevant token selection.
    • n_ctx: Context window size of 512 tokens - sufficient for our simple RAG implementation.
    • seed: Fixed seed for reproducible results.

    Embedding Model: We use the lightweight but effective all-MiniLM-L6-v2 model from HuggingFace, which produces 384-dimensional embeddings and offers a good balance between performance and computational efficiency.
    • Dimension: It produces 384-dimensional vectors, balancing expressiveness with storage requirements.
    • Speed: It's optimized for efficiency, allowing for quick embedding generation.
    • Quality: Despite its small size, it performs well on semantic similarity tasks.
    • Resource usage: It can run effectively on CPU, making it accessible for development environments.

    Vector Store Creation: FAISS (Facebook AI Similarity Search) is used to create our vector database, converting each text entry into a vector embedding and enabling efficient similarity search.
    • Efficiency: Optimized for fast similarity search operations.
    • Scalability: Can handle from thousands to billions of vectors.
    • In-memory operation: Perfect for examples and small to medium datasets.
    • Indexing options: Supports various indexing methods for different performance profiles.

    RetrievalQA Chain:
    • Take a query and retrieve relevant documents from the vector store
    • Format these documents along with the query
    • Send the formatted input to the LLM for response generation

    Install the required modules:

    Python code:

    Run the Python script:

    Output:

    The model correctly identified Barcelona's La Liga victory as the major achievement mentioned in the corpus. It successfully retrieved the relevant information from the vector store and generated a response based on that information.
  3. Question Answering: prompt example
    Let's improve our implementation by adding a custom prompt template to better guide the LLM's responses. We will use RetrievalQA to perform a simple RAG prompt against a local vector database.

    Enhanced Features:
    • Custom Prompt Template: This is the key enhancement in this implementation. The template provides:
      • Clear instructions to the model about its role ("assistant for question-answering tasks").
      • Explicit guidance on how to use the retrieved context.
      • Format constraints ("three sentences maximum").
      • Chat-specific formatting with user/assistant markers.

    • Explicit Chain Type: We now explicitly specify chain_type='stuff', which tells LangChain to use the simplest chain type that "stuffs" all retrieved documents into a single prompt.

    • Chain Type Kwargs: We pass our custom prompt through the chain_type_kwargs parameter, which allows for customization of the chain behavior.

    Python code:

    Run the Python script:

    Output:

    The response is now more concise and focused, directly addressing the question without unnecessary information. The custom prompt has effectively guided the model to produce a more streamlined response.
  4. Question Answering: chain types
    LangChain offers several chain types for RAG implementations, each with different trade-offs and use cases:
    • stuff
      • Combines all retrieved documents into a single prompt.
      • Simple, efficient, single LLM call.
      • Limited by context window size.
      • Small to medium-sized retrievals, quick Q&A.

    • refine
      • Processes documents sequentially, refining the answer with each document.
      • Can handle larger sets of documents, progressive refinement.
      • Multiple LLM calls (higher latency and cost).
      • Complex questions requiring nuanced answers.

    • map_reduce
      • Processes each document separately, then combines results.
      • Can handle very large document sets, parallelizable.
      • Multiple LLM calls, potential information loss.
      • Large document collections, distributed processing.

    To use the refine chain, update the above code as follows:

    Python code:

    Output:

    Note how the refine chain has produced a more comprehensive answer that combines multiple pieces of information. It identified Barcelona's La Liga victory but also added information about their environmental efforts - demonstrating the refine chain's ability to build more complete answers.

    The map_reduce chain can be used in cases where the results from the information retrieval system are very large and cannot fit into the model context window. The retrieved results (documents or chunks) are sent individually to the model (will increase processing time) and the responses from each call are inserted into a final query. This final query is sent to the model which responds with a final response. The map_reduce chain provides less accurate results than the stuff and refine chains.

    To use the map_reduce chain, update the above code as follows (need to increase the value of the n_ctx parameter):

    Python code:

    Output:

    The map-reduce output is notably different - it identified a different achievement (Jhonattan Vegas leading the PGA Championship) and produced a somewhat fragmented response. This illustrates the potential for information loss or confusion when using map-reduce with small context windows.
© 2025  mtitek