Skip to content
AI

RAG Explained: Building Retrieval-Augmented Generation with LangChain

A practical LangChain RAG tutorial that goes past the demo — chunking strategy, embedding choice, hybrid search, evaluation, and the source-citation grounding that keeps a chatbot from making things up.

May 28, 2026 6 min read
RAG Explained: Building Retrieval-Augmented Generation with LangChain cover

Most RAG tutorials get you a chatbot that answers the first question beautifully and falls apart by the third. The reason is always the same: the retrieval is naive. I build RAG systems where retrieval being trustworthy is the entire point — the Ilm AI knowledge assistant only works because every answer traces back to a verified source, and the Multi-AI RAG Accounting System lets people query financial documents in plain English without getting a confidently wrong number. This is the LangChain RAG build that holds up past the demo.

Quick answer: what RAG is

Retrieval-augmented generation (RAG) is a way to make an LLM answer from your data instead of only its training. At query time you search a knowledge base for the passages most relevant to the question, paste those passages into the prompt as context, and ask the model to answer using them — with a citation back to each source. The model supplies fluency and reasoning; your documents supply the facts. That separation is what keeps the answer current and verifiable.

The RAG pipeline in LangChain

A RAG system has two phases: an ingestion phase that runs once (or on updates) to build the searchable index, and a query phase that runs on every question.

INGESTION (offline)              QUERY (per request)
─────────────────                ───────────────────
load documents                   embed the question
   │                                │
split into chunks                search vector store (+ keyword)
   │                                │
embed each chunk                 rerank top results
   │                                │
store vectors + metadata         build grounded prompt
                                    │
                                 LLM generates answer + citations

Step 1: load and split — where RAG quality is won or lost

Chunking is the most underrated decision in RAG. Chunk too large and you bury the relevant sentence in noise the retriever can't rank. Chunk too small and you sever the context that gives a passage meaning. On Ilm AI this mattered enormously — splitting a ruling mid-sentence can change what it means entirely.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,        # tune to your content, not a default
    chunk_overlap=120,     # overlap preserves context across boundaries
    separators=["\n\n", "\n", ". ", " "],  # split on structure first
)
chunks = splitter.split_documents(docs)

The rules I follow: split on natural structure (paragraphs, sections) before raw character counts; keep overlap so a thought is never cut clean in half; and attach metadata to every chunk — source, section, date, document type. That metadata is what lets you filter retrieval later, and it is what powers citations.

Step 2: embed and store

Embeddings turn text into vectors so "what did we spend on software" can match a passage about SaaS expenses even with zero shared keywords. Pick an embedding model deliberately — they differ in quality, cost, and dimension — and never mix models between ingestion and query.

from langchain_openai import OpenAIEmbeddings
from langchain_postgres import PGVector

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

store = PGVector.from_documents(
    documents=chunks,
    embedding=embeddings,
    connection="postgresql+psycopg://...",
    collection_name="kb",
)

I default to Postgres with pgvector — one database for relational data and vectors until scale genuinely demands a dedicated vector DB. (I wrote a full comparison of pgvector vs Pinecone vs Chroma if you are weighing the trade-off.)

Step 3: retrieve — and do not rely on cosine similarity alone

Pure semantic search misses exact matches: invoice numbers, product SKUs, proper nouns. The fix is hybrid search — combine dense vector similarity with sparse keyword (BM25) matching, then rerank the merged results so the genuinely relevant passages float to the top.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

semantic = store.as_retriever(search_kwargs={"k": 8})
keyword = BM25Retriever.from_documents(chunks); keyword.k = 8

retriever = EnsembleRetriever(
    retrievers=[semantic, keyword],
    weights=[0.6, 0.4],   # tune to your query mix
)

Hybrid search is the single biggest accuracy upgrade over a default RAG demo, especially for data full of identifiers — exactly the case in the accounting system, where "invoice 0047" must match the literal string, not just something semantically nearby.

Step 4: generate with grounding and citations

The prompt is where you prevent hallucination. Inject the retrieved chunks, instruct the model to answer only from them, tell it to admit when the context doesn't cover the question, and require it to cite sources.

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template("""
Answer using ONLY the context below. If the context does not contain
the answer, say so plainly — do not invent one. Cite the source of
each fact as [source].

Context:
{context}

Question: {question}
""")

Every response on both my RAG systems carries a citation back to the document it came from. That is not a nicety — for financial data or scholarly rulings, an answer you cannot verify is an answer you cannot trust.

Evaluation: the step that separates real RAG from a demo

You cannot improve retrieval you do not measure. Before shipping, I build an eval set of real questions with known-correct sources and track two things:

Retrieval quality — for each question, did the correct passage appear in the top-k results? (Recall@k.) If the right chunk never gets retrieved, no amount of prompt engineering saves the answer.

Answer quality — is the generated answer correct and properly grounded in what was retrieved? You can grade this with a stronger model as a judge, or by hand for a smaller set.

When retrieval is wrong, you fix chunking, embeddings, or hybrid weights — not the LLM. Most RAG failures are retrieval failures wearing a generation costume.

Common mistakes that break RAG in production

The big four, in the order I see them: naive fixed-size chunking that severs context; semantic-only retrieval that misses exact identifiers; no evaluation, so quality is a vibe instead of a number; and a too-large context stuffed with marginally-relevant chunks, which dilutes the signal and raises cost. Retrieve fewer, better passages — not more, noisier ones.

The takeaway

RAG is straightforward to demo and hard to make trustworthy, and the entire difference lives in retrieval. Chunk on structure with overlap and metadata, embed deliberately, search hybrid and rerank, ground generation strictly with citations, and measure recall before you ship. Do that and you get a chatbot that handles the messy real question, not just the clean one from your demo script.


Want a chatbot grounded in your actual data that retrieves the right answer instead of the nearest-sounding one? See RAG & Chatbots or book a scope call.

Want this built, not just explained?

That’s the day job. Book a free scope call and bring the half-baked idea.

Book a consultation
A

Ayaan Motiwala

AI Specialist in Surat. I ship multi-LLM systems, voice agents, and automations that survive real users — and write about what breaks along the way.