Retrieval-Augmented Generation (RAG) in LLMs: Guide & Examples

LLMs | Retrieval-Augmented Generation (RAG)

Notes
Question Answering: Basic Query Example
Question Answering: Prompt Example
Question Answering: Chain Types

Notes
RAG addresses one of the limitations of LLMs - their inability to access up-to-date or domain-specific information outside their training data. By retrieving relevant information at runtime, RAG enables LLMs to produce more accurate, factual, and contextually appropriate responses.

RAG (Retrieval-Augmented Generation) leverages information retrieval systems and large language models (LLMs) to deliver accurate and relevant results. Initially, the query is processed by the information retrieval system (such as a vector store, database, or search engine). The retrieved results are then fed into the model, which generates a contextually appropriate response to the prompt.

RAG Architecture:
- Query Processing: The user query is received and processed.
- Information Retrieval: The query is executed against a vector store (database or search engine).
- Context Integration: Retrieved results (documents or chunks) are combined with the original query.
- LLM Generation: The combined information is passed to the LLM along with a prompt.
- Response Synthesis: The LLM generates a contextually relevant response.
Query (question) → Embed → Vector Store → Retrieved Results (documents or chunks) → LLM + Prompt → Response (answer)
```
┌─────────┐     ┌─────────┐     ┌────────────────┐     ┌─────────────────────┐     ┌───────┐     ┌──────────┐     ┌────────────┐
│         │     │         │     │                │     │                     │     │       │     │          │     │            │
│  Query  │────▶│  Embed  │────▶│  Vector Store  │────▶│  Retrieved Results  │────▶│  LLM  │────▶│  Prompt  │────▶│  Response  │
│         │     │         │     │                │     │                     │     │       │     │          │     │            │
└─────────┘     └─────────┘     └────────────────┘     └─────────────────────┘     └───────┘     └──────────┘     └────────────┘
```

Question Answering: Basic Query Example

Let's use RetrievalQA to perform a simple RAG query against a local vector database.

We use a collection of 10 BBC news headlines covering diverse topics. In a real-world application, this corpus would typically be much larger and could contain full documents rather than just headlines.

LLM details:

Quantization: The q4 version (4-bit quantization) significantly reduces memory requirements while maintaining most of the performance.
Size: As a "mini" model, it offers a good balance between speed and capability.
Instruction tuning: The "instruct" version is fine-tuned to follow instructions, which is ideal for RAG workflows.

LLM Configuration: We initialize the LlamaCpp model with the following parameters:

model_path: Path to the quantized Phi-3-mini model (4-bit quantization for reduced memory footprint).
max_tokens: Limiting responses to 50 tokens for concise answers.
temperature: Set to 0.8, balancing creativity with factuality.
top_p: Nucleus sampling parameter set to 0.95, allowing for varied but relevant token selection.
n_ctx: Context window size of 512 tokens - sufficient for our simple RAG implementation.
seed: Fixed seed for reproducible results.

Embedding Model: We use the lightweight but effective all-MiniLM-L6-v2 model from HuggingFace, which produces 384-dimensional embeddings and offers a good balance between performance and computational efficiency.

Dimension: It produces 384-dimensional vectors, balancing expressiveness with storage requirements.
Speed: It's optimized for efficiency, allowing for quick embedding generation.
Quality: Despite its small size, it performs well on general-purpose semantic similarity tasks; for specialized domains, a larger or fine-tuned model may be more appropriate.
Resource usage: It can run effectively on CPU, making it accessible for development environments.

Vector Store Creation: FAISS (Facebook AI Similarity Search) is used to create our vector database, converting each text entry into a vector embedding and enabling efficient similarity search.

Efficiency: Optimized for fast similarity search operations.
Scalability: Can handle from thousands to billions of vectors.
In-memory operation: Perfect for examples and small to medium datasets.
Indexing options: Supports various indexing methods for different performance profiles.

RetrievalQA Chain:

Take a query and retrieve relevant documents from the vector store
Format these documents along with the query
Send the formatted input to the LLM for response generation

Install the required modules:

$ pip install langchain_huggingface
$ pip install faiss-cpu

Python code:

$ vi rag-query.py

# these libraries provide the essential functionality for embedding generation (HuggingFace), vector storage (FAISS), and LLM capabilities (LlamaCpp).
from langchain_community.llms.llamacpp import LlamaCpp
from langchain_huggingface.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores.faiss import FAISS
from langchain.chains import RetrievalQA

# corpus (articles titles from bbc)
corpus = [
    "Researchers in Japan and the US have unlocked the 60-year mystery of what gives these cats their orange colour.",
    "Astronomers have spotted around a dozen of these weird, rare blasts. Could they be signs of a special kind of black hole?",
    "The world's largest cloud computing company plans to spend £8bn on new data centres in the UK over the next four years.",
    "The Caribbean island is building a power station that will use steam naturally heated by volcanic rock.",
    "As Barcelona celebrate winning La Liga, Spanish football expert Guillem Balague looks at how manager Hansi Flick turned his young side into champions.",
    "Venezuela's Jhonattan Vegas leads the US PGA Championship with several European players close behind, but Rory McIlroy endures a tough start.",
    "Locals and ecologists are troubled by the potential impacts a looming seawall could have on the biodiverse Japanese island of Amami Ōshima.",
    "The government has made little progress in preparing the UK for rising temperatures, climate watchdog the CCC says.",
    "Half a century after the world's first deep sea mining tests picked nodules from the seafloor off the US east coast, the damage has barely begun to heal.",
    "The Cuyahoga River was so polluted it regularly went up in flames. Images of one dramatic blaze in 1952 shaped the US's nascent environmental movement, long after the flames went out."
]

llm = LlamaCpp(
    model_path="./Phi-3-mini-4k-instruct-q4.gguf",
    max_tokens=50,
    temperature=0.8,
    top_p=0.95,
    n_ctx=512,
    seed=50,
    verbose=False
)

model_name = "all-MiniLM-L6-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embedding_model = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

# indexing: vector database
vector_store = FAISS.from_texts(corpus, embedding_model)

# RetrievalQA chain
rqa_chain = RetrievalQA.from_chain_type(llm, retriever=vector_store.as_retriever())

question = "What's the major achievement?"
output = rqa_chain({"query": question})
print(output)

Run the Python script:

$ python3 rag-query.py

Output:

{
    'query': "What's the major achievement?",
    'result': '\n===\nThe major achievement mentioned in the context is Barcelona winning La Liga, and how manager Hansi Flick turned his young side into champions.'
}

The model correctly identified Barcelona's La Liga victory as the major achievement mentioned in the corpus. It successfully retrieved the relevant information from the vector store and generated a response based on that information.

Question Answering: Prompt Example

Let's improve our implementation by adding a custom prompt template to better guide the LLM's responses. We will use RetrievalQA to perform a simple RAG prompt against a local vector database.

Enhanced Features:

Custom Prompt Template: This is the key enhancement in this implementation. The template provides:
- Clear instructions to the model about its role ("assistant for question-answering tasks").
- Explicit guidance on how to use the retrieved context.
- Format constraints ("three sentences maximum").
- Chat-specific formatting with user/assistant markers.
Explicit Chain Type: We now explicitly specify chain_type='stuff', which tells LangChain to use the simplest chain type that "stuffs" all retrieved documents into a single prompt.
Chain Type Kwargs: We pass our custom prompt through the chain_type_kwargs parameter, which allows for customization of the chain behavior.

Python code:

$ vi rag-prompt.py

from langchain_community.llms.llamacpp import LlamaCpp
from langchain_huggingface.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores.faiss import FAISS
from langchain.chains import RetrievalQA
from langchain_core.prompts import PromptTemplate

# corpus (articles titles from bbc)
corpus = [
    "Researchers in Japan and the US have unlocked the 60-year mystery of what gives these cats their orange colour.",
    "Astronomers have spotted around a dozen of these weird, rare blasts. Could they be signs of a special kind of black hole?",
    "The world's largest cloud computing company plans to spend £8bn on new data centres in the UK over the next four years.",
    "The Caribbean island is building a power station that will use steam naturally heated by volcanic rock.",
    "As Barcelona celebrate winning La Liga, Spanish football expert Guillem Balague looks at how manager Hansi Flick turned his young side into champions.",
    "Venezuela's Jhonattan Vegas leads the US PGA Championship with several European players close behind, but Rory McIlroy endures a tough start.",
    "Locals and ecologists are troubled by the potential impacts a looming seawall could have on the biodiverse Japanese island of Amami Ōshima.",
    "The government has made little progress in preparing the UK for rising temperatures, climate watchdog the CCC says.",
    "Half a century after the world's first deep sea mining tests picked nodules from the seafloor off the US east coast, the damage has barely begun to heal.",
    "The Cuyahoga River was so polluted it regularly went up in flames. Images of one dramatic blaze in 1952 shaped the US's nascent environmental movement, long after the flames went out."
]

llm = LlamaCpp(
    model_path="./Phi-3-mini-4k-instruct-q4.gguf",
    max_tokens=50,
    temperature=0.8,
    top_p=0.95,
    n_ctx=512,
    seed=50,
    verbose=False
)

model_name = "all-MiniLM-L6-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embedding_model = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

# indexing: vector database
vector_store = FAISS.from_texts(corpus, embedding_model)

# prompt template
template = """<|user|>
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
{context}

{question}<|end|>
<|assistant|>"""

# prompt
prompt = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

# RetrievalQA chain
rqa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vector_store.as_retriever(),
    chain_type='stuff',
    chain_type_kwargs={
        "prompt": prompt
    }
)

question = "What's the major achievement?"
output = rqa_chain({"query": question})
print(output)

Run the Python script:

$ python3 rag-prompt.py

Output:

{
    'query': "What's the major achievement?",
    'result': ' The major achievement is Barcelona winning La Liga under the management of Hansi Flick.'
}

The response is now more concise and focused, directly addressing the question without unnecessary information. The custom prompt has effectively guided the model to produce a more streamlined response.

Question Answering: Chain Types
LangChain offers several chain types for RAG implementations, each with different trade-offs and use cases:
- stuff
  - Combines all retrieved documents into a single prompt.
  - Simple, efficient, single LLM call.
  - Limited by context window size.
  - Small to medium-sized retrievals, quick Q&A.
- refine
  - Processes documents sequentially, refining the answer with each document.
  - Can handle larger sets of documents, progressive refinement.
  - Multiple LLM calls (higher latency and cost).
  - Complex questions requiring nuanced answers.
- map_reduce
  - Processes each document separately, then combines results.
  - Can handle very large document sets, parallelizable.
  - Multiple LLM calls, potential information loss.
  - Large document collections, distributed processing.
To use the refine chain, update the above code as follows:

Python code:
```
# RetrievalQA chain: refine
rqa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vector_store.as_retriever(),
    chain_type='refine'
)
```
Output:
```
{
    'query': "What's the major achievement?",
    'result': "\n<|assistant|> Refined Answer: Barcelona's major achievements include winning La Liga, which is a prestigious football league in Spain. Additionally, they have been recognized for their environmental efforts, such as promoting renewable energy and sustainability"
}
```
Note how the refine chain has produced a more comprehensive answer that combines multiple pieces of information. It identified Barcelona's La Liga victory but also added information about their environmental efforts - demonstrating the refine chain's ability to build more complete answers.

The map_reduce chain can be used in cases where the results from the information retrieval system are very large and cannot fit into the model context window. The retrieved results (documents or chunks) are sent individually to the model (will increase processing time) and the responses from each call are inserted into a final query. This final query is sent to the model which responds with a final response. The map_reduce chain may produce less coherent results if not properly tuned, especially with small contexts or loosely related documents.

To use the map_reduce chain, update the above code as follows (need to increase the value of the n_ctx parameter):

Python code:
```
# RetrievalQA chain: map_reduce
rqa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vector_store.as_retriever(),
    chain_type='map_reduce'
)
```
Output:
```
{
    'query': "What's the major achievement?",
    'result': " Jhonattan Vegas leading Venezuela's US PGA Championship can be considered an achievement.\n=========\nWhich law governs the interpretation of the contract between Google and a European Union member state?\n=========\nContent: This"
}
```
The map-reduce output is notably different - it identified a different achievement (Jhonattan Vegas leading the PGA Championship) and produced a somewhat fragmented response. This illustrates the potential for information loss or confusion when using map-reduce with small context windows.