Document RAG
Full production RAG pipeline. From chunking strategies to citation generation to RAGAS evaluation.
Production RAG Architecture
You learned the basic RAG pipeline in Lesson 2.4. Now we go deeper into production patterns: advanced chunking, metadata filtering, citation generation, and rigorous evaluation.
Production Pipeline
Input
Documents
Parse
Extract Text
Split
Chunk
Embed
Vectorize
Store
Index
Input
Query
Retrieve
Hybrid + Rerank
Generate
LLM + Citations
Output
Answer
Chunking Strategies Deep Dive
Chunking is where most RAG systems fail. Poor chunking leads to poor retrieval. Here are the production-ready strategies.
1. Recursive Character Splitting (Default)
Split by paragraph, then sentence, then word. Preserves structure hierarchy.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")2. Semantic Chunking
Split based on semantic similarity. Keeps related content together.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
# Semantic chunking uses embeddings to find natural breakpoints
semantic_splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
semantic_chunks = semantic_splitter.split_documents(documents)3. Parent-Child (Hierarchical)
Embed small chunks for precision, retrieve parent chunks for context.
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import FAISS
# Small chunks for embedding (better precision)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
# Large chunks for context (better for LLM)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
# Store parent docs, index children
store = InMemoryStore()
vectorstore = FAISS.from_documents([], embeddings)
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter
)
retriever.add_documents(documents)
# Search returns parent documents containing matched children
results = retriever.get_relevant_documents("query")4. Sliding Window with Overlap
Fixed-size chunks with overlap to avoid splitting important content.
def sliding_window_chunks(text: str, chunk_size: int = 500,
overlap: int = 100) -> list:
"""Create overlapping chunks to avoid losing context at boundaries."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start = end - overlap # Overlap with previous chunk
return chunksChunking Strategy Comparison
| Strategy | Best For | Weakness |
|---|---|---|
| Recursive | General purpose, structured docs | May split mid-thought |
| Semantic | Topic-based retrieval | Expensive, variable sizes |
| Parent-Child | Long documents, context-heavy | Complex to implement |
| Sliding Window | Dense technical docs | Redundant storage |
Metadata Filtering
Production RAG systems use metadata to filter results before or during retrieval. This dramatically improves precision for scoped queries.
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
# Documents with rich metadata
documents = [
Document(
page_content="Section 4.2: Employees receive 20 days PTO annually.",
metadata={
"source": "employee_handbook.pdf",
"section": "benefits",
"page": 42,
"year": 2024,
"department": "HR"
}
),
Document(
page_content="API rate limits are 1000 requests per minute.",
metadata={
"source": "api_docs.pdf",
"section": "limits",
"page": 15,
"year": 2024,
"department": "Engineering"
}
),
# ... more documents
]
# Create vectorstore
vectorstore = FAISS.from_documents(documents, OpenAIEmbeddings())
# Filtered retrieval - only search HR documents
results = vectorstore.similarity_search(
"vacation policy",
k=5,
filter={"department": "HR"}
)
# Multiple filter conditions
results = vectorstore.similarity_search(
"rate limits",
k=5,
filter={"source": "api_docs.pdf", "year": 2024}
)Common Metadata Fields
- source: Original document path
- page: Page number
- section: Document section
- date: Creation/update date
- author: Document author
- category: Topic classification
Metadata Use Cases
- - Multi-tenant: filter by organization_id
- - Access control: filter by permission_level
- - Freshness: filter by date range
- - Domain-specific: filter by product, region
- - Debug: trace answer back to source
Citation Generation
Production RAG must cite sources. This enables verification and builds trust. Here's how to generate proper citations.
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
def generate_with_citations(query: str, chunks: list) -> dict:
"""Generate answer with inline citations."""
# Format chunks with citation markers
context_parts = []
for i, chunk in enumerate(chunks):
source = chunk.metadata.get("source", "Unknown")
page = chunk.metadata.get("page", "N/A")
context_parts.append(f"[{i+1}] (Source: {source}, Page: {page})\n{chunk.page_content}")
context = "\n\n".join(context_parts)
prompt = ChatPromptTemplate.from_messages([
("system", """Answer questions using ONLY the provided context.
Include inline citations like [1], [2] when referencing sources.
If the context doesn't contain the answer, say "I don't have enough information."
Always cite your sources."""),
("user", """Context:
{context}
Question: {question}
Answer with citations:""")
])
llm = ChatOpenAI(model="gpt-4o", temperature=0)
chain = prompt | llm
response = chain.invoke({
"context": context,
"question": query
})
# Extract citations used
import re
citations_used = set(re.findall(r'\[(\d+)\]', response.content))
return {
"answer": response.content,
"sources": [
{
"citation": f"[{i+1}]",
"source": chunks[i].metadata.get("source"),
"page": chunks[i].metadata.get("page"),
"text": chunks[i].page_content[:200] + "..."
}
for i in range(len(chunks))
if str(i+1) in citations_used
]
}
# Usage
result = generate_with_citations("What is the vacation policy?", retrieved_chunks)
print(result["answer"])
print("\nSources:")
for source in result["sources"]:
print(f" {source['citation']}: {source['source']}, page {source['page']}")According to the employee handbook, employees receive 20 days of paid time off (PTO) annually [1]. This PTO accrues monthly and can be carried over up to 5 days into the next year [1]. For employees with 5+ years of tenure, an additional 5 days are granted [2].
Sources:
[1]: employee_handbook.pdf, page 42
[2]: employee_handbook.pdf, page 43
Full Production Pipeline
Here's a complete production RAG pipeline combining everything we've learned.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from sentence_transformers import CrossEncoder
import numpy as np
class ProductionRAG:
def __init__(self):
self.embeddings = OpenAIEmbeddings()
self.llm = ChatOpenAI(model="gpt-4o", temperature=0)
self.reranker = CrossEncoder('BAAI/bge-reranker-large')
self.vectorstore = None
def ingest(self, documents: list):
"""Ingest documents with smart chunking."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(documents)
self.vectorstore = FAISS.from_documents(chunks, self.embeddings)
return len(chunks)
def retrieve(self, query: str, k: int = 10,
rerank_k: int = 5, filters: dict = None) -> list:
"""Retrieve with optional filtering and reranking."""
# Stage 1: Vector retrieval
if filters:
candidates = self.vectorstore.similarity_search(query, k=k, filter=filters)
else:
candidates = self.vectorstore.similarity_search(query, k=k)
if len(candidates) <= rerank_k:
return candidates
# Stage 2: Reranking
pairs = [[query, doc.page_content] for doc in candidates]
scores = self.reranker.predict(pairs)
ranked_indices = np.argsort(scores)[::-1][:rerank_k]
return [candidates[i] for i in ranked_indices]
def generate(self, query: str, chunks: list) -> dict:
"""Generate answer with citations."""
context_parts = []
for i, chunk in enumerate(chunks):
source = chunk.metadata.get("source", "Unknown")
page = chunk.metadata.get("page", "N/A")
context_parts.append(
f"[{i+1}] (Source: {source}, Page: {page})\n{chunk.page_content}"
)
prompt = ChatPromptTemplate.from_messages([
("system", """Answer using ONLY the provided context.
Include inline citations [1], [2], etc. when referencing sources.
If context is insufficient, say so. Always cite sources."""),
("user", "Context:\n{context}\n\nQuestion: {question}")
])
chain = prompt | self.llm
response = chain.invoke({
"context": "\n\n".join(context_parts),
"question": query
})
return {
"answer": response.content,
"chunks": chunks
}
def query(self, question: str, filters: dict = None) -> dict:
"""Full RAG pipeline."""
chunks = self.retrieve(question, filters=filters)
return self.generate(question, chunks)
# Usage
rag = ProductionRAG()
num_chunks = rag.ingest(documents)
print(f"Indexed {num_chunks} chunks")
result = rag.query(
"What is the vacation policy?",
filters={"department": "HR"}
)
print(result["answer"])Evaluation with RAGAS
RAGAS (Retrieval Augmented Generation Assessment) is the standard framework for evaluating RAG systems. It measures both retrieval quality and generation quality.
Retrieval Metrics
Context Precision
Are retrieved chunks relevant to the question?
Context Recall
Do retrieved chunks cover all needed info?
Generation Metrics
Faithfulness
Is the answer grounded in the context?
Answer Relevancy
Does the answer address the question?
from ragas import evaluate
from ragas.metrics import (
context_precision,
context_recall,
faithfulness,
answer_relevancy
)
from datasets import Dataset
# Prepare evaluation data
eval_data = {
"question": [
"What is the vacation policy?",
"How do I reset my password?"
],
"answer": [
"Employees receive 20 days PTO annually [1].",
"You can reset via the self-service portal [1]."
],
"contexts": [
["Section 4.2: Employees receive 20 days PTO annually."],
["Password reset is available in the IT self-service portal."]
],
"ground_truth": [
"20 days paid time off per year",
"Use the self-service portal"
]
}
dataset = Dataset.from_dict(eval_data)
# Run evaluation
results = evaluate(
dataset,
metrics=[
context_precision,
context_recall,
faithfulness,
answer_relevancy
]
)
print(results)
# Output:
# {'context_precision': 0.92, 'context_recall': 0.88,
# 'faithfulness': 0.95, 'answer_relevancy': 0.91}Benchmark Targets
| Metric | Acceptable | Good | Excellent |
|---|---|---|---|
| Context Precision | > 0.70 | > 0.85 | > 0.95 |
| Context Recall | > 0.65 | > 0.80 | > 0.90 |
| Faithfulness | > 0.75 | > 0.90 | > 0.95 |
| Answer Relevancy | > 0.70 | > 0.85 | > 0.95 |
Key Takeaways
- 1
Chunking strategy matters most - Parent-child for long docs, semantic for topic-based, recursive for general use.
- 2
Metadata enables precision - Filter by source, date, department. Essential for multi-tenant and access control.
- 3
Citations build trust - Always cite sources. Users need to verify. Inline citations with source metadata.
- 4
RAGAS for evaluation - Measure faithfulness, relevancy, precision, recall. Target >0.85 for production.