Let's use RetrievalQA to perform a simple RAG query against a local vector database.
We use a collection of 10 BBC news headlines covering diverse topics.
In a real-world application, this corpus would typically be much larger and could contain full documents rather than just headlines.
LLM details:
- Quantization: The q4 version (4-bit quantization) significantly reduces memory requirements while maintaining most of the performance.
- Size: As a "mini" model, it offers a good balance between speed and capability.
- Instruction tuning: The "instruct" version is fine-tuned to follow instructions, which is ideal for RAG workflows.
LLM Configuration:
We initialize the LlamaCpp model with the following parameters:
- model_path: Path to the quantized Phi-3-mini model (4-bit quantization for reduced memory footprint).
- max_tokens: Limiting responses to 50 tokens for concise answers.
- temperature: Set to 0.8, balancing creativity with factuality.
- top_p: Nucleus sampling parameter set to 0.95, allowing for varied but relevant token selection.
- n_ctx: Context window size of 512 tokens - sufficient for our simple RAG implementation.
- seed: Fixed seed for reproducible results.
Embedding Model:
We use the lightweight but effective all-MiniLM-L6-v2 model from HuggingFace,
which produces 384-dimensional embeddings and offers a good balance between performance and computational efficiency.
- Dimension: It produces 384-dimensional vectors, balancing expressiveness with storage requirements.
- Speed: It's optimized for efficiency, allowing for quick embedding generation.
- Quality: Despite its small size, it performs well on semantic similarity tasks.
- Resource usage: It can run effectively on CPU, making it accessible for development environments.
Vector Store Creation:
FAISS (Facebook AI Similarity Search) is used to create our vector database,
converting each text entry into a vector embedding and enabling efficient similarity search.
- Efficiency: Optimized for fast similarity search operations.
- Scalability: Can handle from thousands to billions of vectors.
- In-memory operation: Perfect for examples and small to medium datasets.
- Indexing options: Supports various indexing methods for different performance profiles.
RetrievalQA Chain:
- Take a query and retrieve relevant documents from the vector store
- Format these documents along with the query
- Send the formatted input to the LLM for response generation
Install the required modules:
Python code:
Run the Python script:
Output:
The model correctly identified Barcelona's La Liga victory as the major achievement mentioned in the corpus.
It successfully retrieved the relevant information from the vector store and generated a response based on that information.