• Home
  • LLMs
  • Python
  • Docker
  • Kubernetes
  • Java
  • Maven
  • All
  • About
LLMs | Semantic Search
  1. Introduction
  2. Example: Direct FAISS Index
  3. Example: LangChain with FAISS Vector Store

  1. Introduction
    Semantic search revolutionizes information retrieval by understanding the meaning and context of queries rather than relying solely on keyword matching. This approach uses vector embeddings to represent both documents and queries in a high-dimensional space, enabling similarity-based retrieval through mathematical operations.

    Vector embeddings transform text into numerical representations that capture semantic meaning. Documents with similar meanings cluster together in the vector space, regardless of exact word matches.

    Similarity Metrics
    • Cosine Similarity: Measures the angle between two vectors (most common for text).

    • Euclidean Distance (L2): Measures straight-line distance between vectors.

    Search Algorithms
    • k-Nearest Neighbors (kNN): Exact search, suitable for smaller datasets.

    • Approximate Nearest Neighbors (ANN): Faster search with slight accuracy trade-off, ideal for large-scale applications (millions+ embeddings).

    Semantic search workflow:
    • Indexing Phase:
      Documents → Text Preprocessing → Chunking → Embedding Model → Vectors → FAISS Index

    • Search Phase:
      Query → Embedding Model → Query Vector → FAISS Search → Ranked Results

    When indexing or storing a document in a vector database, one approach is to generate a single embedding vector for the entire document. However, this can result in loss of contextual information, leading to less accurate search results. To address this, the document can be split into smaller chunks, and an embedding is generated for each chunk to preserve finer-grained semantic details.

    Text Chunking Strategies
    • Fixed-Size Chunking
      • Character-based: Split at fixed character counts (e.g., 512, 1024 characters).
      • Token-based: Split at fixed token counts (respects model limits).

    • Semantic Chunking
      • Sentence-based: Maintain sentence boundaries.
      • Paragraph-based: Preserve logical document structure.

    • Overlapping Windows
      • Maintain context continuity between chunks.
      • Typical overlap: 10-20% of chunk size.
      • Helps capture cross-boundary information.

    • Advanced Techniques
      • LLM-guided chunking: Use language models to identify optimal split points.
      • Metadata enrichment: Add summaries, keywords, or extracted entities.

    We’ll use a few examples to demonstrate how semantic search works. The first step is to embed all documents into a vector space and store them in a vector database. Queries must then be embedded using the same embedding model that was used to generate the document embeddings.

    The goal is to retrieve the most relevant documents—those whose embeddings are closest to the query embedding—based on cosine similarity. This involves identifying the nearest neighbors in the vector space, i.e., the document embeddings closest to the query embedding.

    Search results can be further enhanced using Large Language Models (LLMs) through techniques such as re-ranking and Retrieval-Augmented Generation (RAG).
  2. Example: Direct FAISS Index
    Install the required modules:
    $ pip install faiss-cpu
    Python code:
    $ vi faiss-semantic-search.py
    from sentence_transformers import SentenceTransformer
    import faiss
    import numpy as np
    import pandas as pd
    
    # sample corpus (BBC article titles)
    corpus = [
        "Researchers in Japan and the US have unlocked the 60-year mystery of what gives these cats their orange colour.",
        "Astronomers have spotted around a dozen of these weird, rare blasts. Could they be signs of a special kind of black hole?",
        "The world's largest cloud computing company plans to spend £8bn on new data centres in the UK over the next four years.",
        "The Caribbean island is building a power station that will use steam naturally heated by volcanic rock.",
        "As Barcelona celebrate winning La Liga, Spanish football expert Guillem Balague looks at how manager Hansi Flick turned his young side into champions.",
        "Venezuela's Jhonattan Vegas leads the US PGA Championship with several European players close behind, but Rory McIlroy endures a tough start.",
        "Locals and ecologists are troubled by the potential impacts a looming seawall could have on the biodiverse Japanese island of Amami Ōshima.",
        "The government has made little progress in preparing the UK for rising temperatures, climate watchdog the CCC says.",
        "Half a century after the world's first deep sea mining tests picked nodules from the seafloor off the US east coast, the damage has barely begun to heal.",
        "The Cuyahoga River was so polluted it regularly went up in flames. Images of one dramatic blaze in 1952 shaped the US's nascent environmental movement, long after the flames went out."
    ]
    
    # initialize embedding model
    embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
    
    # generate corpus embeddings
    corpus_embeddings = embedding_model.encode(corpus)
    
    # build FAISS index
    index = faiss.IndexFlatL2(corpus_embeddings.shape[1])
    faiss.normalize_L2(corpus_embeddings)
    index.add(corpus_embeddings)
    
    # perform search
    query = "major achievement"
    query_embeddings = embedding_model.encode(query)
    
    # search for top 5 similar documents
    distances , results = index.search(np.float32([query_embeddings]), 5)
    
    # print results as pandas DataFrame
    corpus_np_array = np.array(corpus)
    data = {'Results':corpus_np_array[results[0]], 'Distances':distances[0]}
    results_df = pd.DataFrame(data)
    print(results_df)
    Run the Python script:
    $ python3 faiss-semantic-search.py
    Output:
                                                 Results  Distances
    0  As Barcelona celebrate winning La Liga, Spanis...   1.628376
    1  The Cuyahoga River was so polluted it regularl...   1.827289
    2  Venezuela's Jhonattan Vegas leads the US PGA C...   1.839513
    3  Half a century after the world's first deep se...   1.871128
    4  Astronomers have spotted around a dozen of the...   1.902063
  3. Example: LangChain with FAISS Vector Store
    Python code:
    $ vi langchain-semantic-search.py
    from langchain_huggingface.embeddings.huggingface import HuggingFaceEmbeddings
    from langchain_community.vectorstores.faiss import FAISS
    from langchain_community.docstore.in_memory import InMemoryDocstore
    from langchain_core.documents import Document
    from uuid import uuid4
    import faiss
    
    # create document objects with metadata (BBC article titles)
    documents = [
        Document(page_content="Researchers in Japan and the US have unlocked the 60-year mystery of what gives these cats their orange colour.", metadata={"source":"bbc"}),
        Document(page_content="Astronomers have spotted around a dozen of these weird, rare blasts. Could they be signs of a special kind of black hole?", metadata={"source":"bbc"}),
        Document(page_content="The world's largest cloud computing company plans to spend £8bn on new data centres in the UK over the next four years.", metadata={"source":"bbc"}),
        Document(page_content="The Caribbean island is building a power station that will use steam naturally heated by volcanic rock.", metadata={"source":"bbc"}),
        Document(page_content="As Barcelona celebrate winning La Liga, Spanish football expert Guillem Balague looks at how manager Hansi Flick turned his young side into champions.", metadata={"source":"bbc"}),
        Document(page_content="Venezuela's Jhonattan Vegas leads the US PGA Championship with several European players close behind, but Rory McIlroy endures a tough start.", metadata={"source":"bbc"}),
        Document(page_content="Locals and ecologists are troubled by the potential impacts a looming seawall could have on the biodiverse Japanese island of Amami Ōshima.", metadata={"source":"bbc"}),
        Document(page_content="The government has made little progress in preparing the UK for rising temperatures, climate watchdog the CCC says.", metadata={"source":"bbc"}),
        Document(page_content="Half a century after the world's first deep sea mining tests picked nodules from the seafloor off the US east coast, the damage has barely begun to heal.", metadata={"source":"bbc"}),
        Document(page_content="The Cuyahoga River was so polluted it regularly went up in flames. Images of one dramatic blaze in 1952 shaped the US's nascent environmental movement, long after the flames went out.", metadata={"source":"bbc"})
    ]
    
    # configure embedding model
    model_name = "all-MiniLM-L6-v2"
    model_kwargs = {'device': 'cpu'}
    encode_kwargs = {'normalize_embeddings': False}
    embedding_model = HuggingFaceEmbeddings(
        model_name=model_name,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs
    )
    
    # initialize FAISS index
    index = faiss.IndexFlatL2(len(embedding_model.embed_query("hello semantic search")))
    
    # create vector store
    vector_store = FAISS(
        embedding_function=embedding_model,
        index=index,
        docstore=InMemoryDocstore(),
        index_to_docstore_id={},
    )
    
    # add documents with unique IDs
    uuids = [str(uuid4()) for _ in range(len(documents))]
    
    # add documents to vector store
    vector_store.add_documents(documents=documents, ids=uuids)
    
    # perform filtered search
    query = "major achievement"
    
    # similarity search (+ use metadata filter)
    results = vector_store.similarity_search(
        query,
        k=2,
        filter={"source": "bbc"},
    )
    
    for res in results:
        print(f"{res.page_content}")
    Run the Python script:
    $ python3 langchain-semantic-search.py
    Output:
    As Barcelona celebrate winning La Liga, Spanish football expert Guillem Balague looks at how manager Hansi Flick turned his young side into champions.
    The Cuyahoga River was so polluted it regularly went up in flames. Images of one dramatic blaze in 1952 shaped the US's nascent environmental movement, long after the flames went out.
© 2025  mtitek