Semantic Search with FAISS and LangChain Examples

LLMs | Semantic Search

Introduction
Example: Direct FAISS Index
Example: LangChain with FAISS Vector Store

Introduction
Semantic search revolutionizes information retrieval by understanding the meaning and context of queries rather than relying solely on keyword matching. This approach uses vector embeddings to represent both documents and queries in a high-dimensional space, enabling similarity-based retrieval through mathematical operations.

Vector embeddings transform text into numerical representations that capture semantic meaning. Documents with similar meanings cluster together in the vector space, regardless of exact word matches.

Similarity Metrics
- Cosine Similarity: Measures the angle between two vectors (most common for text).
- Euclidean Distance (L2): Measures straight-line distance between vectors.
Search Algorithms
- k-Nearest Neighbors (kNN): Exact search, suitable for smaller datasets.
- Approximate Nearest Neighbors (ANN): Faster search with slight accuracy trade-off, ideal for large-scale applications (millions+ embeddings).
Semantic search workflow:
- Indexing Phase:
```
Documents → Text Preprocessing → Chunking → Embedding Model → Vectors → FAISS Index
```
- Search Phase:
```
Query → Embedding Model → Query Vector → FAISS Search → Ranked Results
```
When indexing or storing a document in a vector database, one approach is to generate a single embedding vector for the entire document. However, this can result in loss of contextual information, leading to less accurate search results. To address this, the document can be split into smaller chunks, and an embedding is generated for each chunk to preserve finer-grained semantic details.

Text Chunking Strategies
- Fixed-Size Chunking
  - Character-based: Split at fixed character counts (e.g., 512, 1024 characters).
  - Token-based: Split at fixed token counts (respects model limits).
- Semantic Chunking
  - Sentence-based: Maintain sentence boundaries.
  - Paragraph-based: Preserve logical document structure.
- Overlapping Windows
  - Maintain context continuity between chunks.
  - Typical overlap: 10-20% of chunk size.
  - Helps capture cross-boundary information.
- Advanced Techniques
  - LLM-guided chunking: Use language models to identify optimal split points.
  - Metadata enrichment: Add summaries, keywords, or extracted entities.
We’ll use a few examples to demonstrate how semantic search works. The first step is to embed all documents into a vector space and store them in a vector database. Queries must then be embedded using the same embedding model that was used to generate the document embeddings.

The goal is to retrieve the most relevant documents—those whose embeddings are closest to the query embedding—based on cosine similarity. This involves identifying the nearest neighbors in the vector space, i.e., the document embeddings closest to the query embedding.

Search results can be further enhanced using Large Language Models (LLMs) through techniques such as re-ranking and Retrieval-Augmented Generation (RAG).

Example: Direct FAISS Index

Install the required modules:

$ pip install faiss-cpu

Python code:

$ vi faiss-semantic-search.py

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import pandas as pd

# sample corpus (BBC article titles)
corpus = [
    "Researchers in Japan and the US have unlocked the 60-year mystery of what gives these cats their orange colour.",
    "Astronomers have spotted around a dozen of these weird, rare blasts. Could they be signs of a special kind of black hole?",
    "The world's largest cloud computing company plans to spend £8bn on new data centres in the UK over the next four years.",
    "The Caribbean island is building a power station that will use steam naturally heated by volcanic rock.",
    "As Barcelona celebrate winning La Liga, Spanish football expert Guillem Balague looks at how manager Hansi Flick turned his young side into champions.",
    "Venezuela's Jhonattan Vegas leads the US PGA Championship with several European players close behind, but Rory McIlroy endures a tough start.",
    "Locals and ecologists are troubled by the potential impacts a looming seawall could have on the biodiverse Japanese island of Amami Ōshima.",
    "The government has made little progress in preparing the UK for rising temperatures, climate watchdog the CCC says.",
    "Half a century after the world's first deep sea mining tests picked nodules from the seafloor off the US east coast, the damage has barely begun to heal.",
    "The Cuyahoga River was so polluted it regularly went up in flames. Images of one dramatic blaze in 1952 shaped the US's nascent environmental movement, long after the flames went out."
]

# initialize embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# generate corpus embeddings
corpus_embeddings = embedding_model.encode(corpus)

# build FAISS index
index = faiss.IndexFlatL2(corpus_embeddings.shape[1])
faiss.normalize_L2(corpus_embeddings)
index.add(corpus_embeddings)

# perform search
query = "major achievement"
query_embeddings = embedding_model.encode(query)

# search for top 5 similar documents
distances , results = index.search(np.float32([query_embeddings]), 5)

# print results as pandas DataFrame
corpus_np_array = np.array(corpus)
data = {'Results':corpus_np_array[results[0]], 'Distances':distances[0]}
results_df = pd.DataFrame(data)
print(results_df)

Run the Python script:

$ python3 faiss-semantic-search.py

Output:

                                             Results  Distances
0  As Barcelona celebrate winning La Liga, Spanis...   1.628376
1  The Cuyahoga River was so polluted it regularl...   1.827289
2  Venezuela's Jhonattan Vegas leads the US PGA C...   1.839513
3  Half a century after the world's first deep se...   1.871128
4  Astronomers have spotted around a dozen of the...   1.902063

Example: LangChain with FAISS Vector Store

Python code:

$ vi langchain-semantic-search.py

from langchain_huggingface.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores.faiss import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_core.documents import Document
from uuid import uuid4
import faiss

# create document objects with metadata (BBC article titles)
documents = [
    Document(page_content="Researchers in Japan and the US have unlocked the 60-year mystery of what gives these cats their orange colour.", metadata={"source":"bbc"}),
    Document(page_content="Astronomers have spotted around a dozen of these weird, rare blasts. Could they be signs of a special kind of black hole?", metadata={"source":"bbc"}),
    Document(page_content="The world's largest cloud computing company plans to spend £8bn on new data centres in the UK over the next four years.", metadata={"source":"bbc"}),
    Document(page_content="The Caribbean island is building a power station that will use steam naturally heated by volcanic rock.", metadata={"source":"bbc"}),
    Document(page_content="As Barcelona celebrate winning La Liga, Spanish football expert Guillem Balague looks at how manager Hansi Flick turned his young side into champions.", metadata={"source":"bbc"}),
    Document(page_content="Venezuela's Jhonattan Vegas leads the US PGA Championship with several European players close behind, but Rory McIlroy endures a tough start.", metadata={"source":"bbc"}),
    Document(page_content="Locals and ecologists are troubled by the potential impacts a looming seawall could have on the biodiverse Japanese island of Amami Ōshima.", metadata={"source":"bbc"}),
    Document(page_content="The government has made little progress in preparing the UK for rising temperatures, climate watchdog the CCC says.", metadata={"source":"bbc"}),
    Document(page_content="Half a century after the world's first deep sea mining tests picked nodules from the seafloor off the US east coast, the damage has barely begun to heal.", metadata={"source":"bbc"}),
    Document(page_content="The Cuyahoga River was so polluted it regularly went up in flames. Images of one dramatic blaze in 1952 shaped the US's nascent environmental movement, long after the flames went out.", metadata={"source":"bbc"})
]

# configure embedding model
model_name = "all-MiniLM-L6-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embedding_model = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

# initialize FAISS index
index = faiss.IndexFlatL2(len(embedding_model.embed_query("hello semantic search")))

# create vector store
vector_store = FAISS(
    embedding_function=embedding_model,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

# add documents with unique IDs
uuids = [str(uuid4()) for _ in range(len(documents))]

# add documents to vector store
vector_store.add_documents(documents=documents, ids=uuids)

# perform filtered search
query = "major achievement"

# similarity search (+ use metadata filter)
results = vector_store.similarity_search(
    query,
    k=2,
    filter={"source": "bbc"},
)

for res in results:
    print(f"{res.page_content}")

Run the Python script:

$ python3 langchain-semantic-search.py

Output:

As Barcelona celebrate winning La Liga, Spanish football expert Guillem Balague looks at how manager Hansi Flick turned his young side into champions.
The Cuyahoga River was so polluted it regularly went up in flames. Images of one dramatic blaze in 1952 shaped the US's nascent environmental movement, long after the flames went out.