Let's use RetrievalQA to perform a simple RAG query against a local vector database.
We use a collection of 10 BBC news headlines covering diverse topics.
In a real-world application, this corpus would typically be much larger and could contain full documents rather than just headlines.
LLM details:
- Quantization: The q4 version (4-bit quantization) significantly reduces memory requirements while maintaining most of the performance.
- Size: As a "mini" model, it offers a good balance between speed and capability.
- Instruction tuning: The "instruct" version is fine-tuned to follow instructions, which is ideal for RAG workflows.
LLM Configuration:
We initialize the LlamaCpp model with the following parameters:
- model_path: Path to the quantized Phi-3-mini model (4-bit quantization for reduced memory footprint).
- max_tokens: Limiting responses to 50 tokens for concise answers.
- temperature: Set to 0.8, balancing creativity with factuality.
- top_p: Nucleus sampling parameter set to 0.95, allowing for varied but relevant token selection.
- n_ctx: Context window size of 512 tokens - sufficient for our simple RAG implementation.
- seed: Fixed seed for reproducible results.
Embedding Model:
We use the lightweight but effective all-MiniLM-L6-v2 model from HuggingFace,
which produces 384-dimensional embeddings and offers a good balance between performance and computational efficiency.
- Dimension: It produces 384-dimensional vectors, balancing expressiveness with storage requirements.
- Speed: It's optimized for efficiency, allowing for quick embedding generation.
- Quality: Despite its small size, it performs well on general-purpose semantic similarity tasks; for specialized domains, a larger or fine-tuned model may be more appropriate.
- Resource usage: It can run effectively on CPU, making it accessible for development environments.
Vector Store Creation:
FAISS (Facebook AI Similarity Search) is used to create our vector database,
converting each text entry into a vector embedding and enabling efficient similarity search.
- Efficiency: Optimized for fast similarity search operations.
- Scalability: Can handle from thousands to billions of vectors.
- In-memory operation: Perfect for examples and small to medium datasets.
- Indexing options: Supports various indexing methods for different performance profiles.
RetrievalQA Chain:
- Take a query and retrieve relevant documents from the vector store
- Format these documents along with the query
- Send the formatted input to the LLM for response generation
Install the required modules:
$ pip install langchain_huggingface
$ pip install faiss-cpu
Python code:
$ vi rag-query.py
# these libraries provide the essential functionality for embedding generation (HuggingFace), vector storage (FAISS), and LLM capabilities (LlamaCpp).
from langchain_community.llms.llamacpp import LlamaCpp
from langchain_huggingface.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores.faiss import FAISS
from langchain.chains import RetrievalQA
# corpus (articles titles from bbc)
corpus = [
"Researchers in Japan and the US have unlocked the 60-year mystery of what gives these cats their orange colour.",
"Astronomers have spotted around a dozen of these weird, rare blasts. Could they be signs of a special kind of black hole?",
"The world's largest cloud computing company plans to spend £8bn on new data centres in the UK over the next four years.",
"The Caribbean island is building a power station that will use steam naturally heated by volcanic rock.",
"As Barcelona celebrate winning La Liga, Spanish football expert Guillem Balague looks at how manager Hansi Flick turned his young side into champions.",
"Venezuela's Jhonattan Vegas leads the US PGA Championship with several European players close behind, but Rory McIlroy endures a tough start.",
"Locals and ecologists are troubled by the potential impacts a looming seawall could have on the biodiverse Japanese island of Amami Ōshima.",
"The government has made little progress in preparing the UK for rising temperatures, climate watchdog the CCC says.",
"Half a century after the world's first deep sea mining tests picked nodules from the seafloor off the US east coast, the damage has barely begun to heal.",
"The Cuyahoga River was so polluted it regularly went up in flames. Images of one dramatic blaze in 1952 shaped the US's nascent environmental movement, long after the flames went out."
]
llm = LlamaCpp(
model_path="./Phi-3-mini-4k-instruct-q4.gguf",
max_tokens=50,
temperature=0.8,
top_p=0.95,
n_ctx=512,
seed=50,
verbose=False
)
model_name = "all-MiniLM-L6-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embedding_model = HuggingFaceEmbeddings(
model_name=model_name,
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs
)
# indexing: vector database
vector_store = FAISS.from_texts(corpus, embedding_model)
# RetrievalQA chain
rqa_chain = RetrievalQA.from_chain_type(llm, retriever=vector_store.as_retriever())
question = "What's the major achievement?"
output = rqa_chain({"query": question})
print(output)
Run the Python script:
$ python3 rag-query.py
Output:
{
'query': "What's the major achievement?",
'result': '\n===\nThe major achievement mentioned in the context is Barcelona winning La Liga, and how manager Hansi Flick turned his young side into champions.'
}
The model correctly identified Barcelona's La Liga victory as the major achievement mentioned in the corpus.
It successfully retrieved the relevant information from the vector store and generated a response based on that information.