Level 3: Production~30 min

Document RAG

Full production RAG pipeline. From chunking strategies to citation generation to RAGAS evaluation.

Production RAG Architecture

You learned the basic RAG pipeline in Lesson 2.4. Now we go deeper into production patterns: advanced chunking, metadata filtering, citation generation, and rigorous evaluation.

Production Pipeline

Input

Documents

->

Parse

Extract Text

->

Split

Chunk

->

Embed

Vectorize

->

Store

Index

Input

Query

->

Retrieve

Hybrid + Rerank

->

Generate

LLM + Citations

->

Output

Answer

Chunking Strategies Deep Dive

Chunking is where most RAG systems fail. Poor chunking leads to poor retrieval. Here are the production-ready strategies.

1. Recursive Character Splitting (Default)

Split by paragraph, then sentence, then word. Preserves structure hierarchy.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")

2. Semantic Chunking

Split based on semantic similarity. Keeps related content together.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# Semantic chunking uses embeddings to find natural breakpoints
semantic_splitter = SemanticChunker(
    embeddings=OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)

semantic_chunks = semantic_splitter.split_documents(documents)

3. Parent-Child (Hierarchical)

Embed small chunks for precision, retrieve parent chunks for context.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import FAISS

# Small chunks for embedding (better precision)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

# Large chunks for context (better for LLM)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)

# Store parent docs, index children
store = InMemoryStore()
vectorstore = FAISS.from_documents([], embeddings)

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter
)

retriever.add_documents(documents)

# Search returns parent documents containing matched children
results = retriever.get_relevant_documents("query")

4. Sliding Window with Overlap

Fixed-size chunks with overlap to avoid splitting important content.

def sliding_window_chunks(text: str, chunk_size: int = 500,
                          overlap: int = 100) -> list:
    """Create overlapping chunks to avoid losing context at boundaries."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap  # Overlap with previous chunk
    return chunks

Chunking Strategy Comparison

StrategyBest ForWeakness
RecursiveGeneral purpose, structured docsMay split mid-thought
SemanticTopic-based retrievalExpensive, variable sizes
Parent-ChildLong documents, context-heavyComplex to implement
Sliding WindowDense technical docsRedundant storage

Metadata Filtering

Production RAG systems use metadata to filter results before or during retrieval. This dramatically improves precision for scoped queries.

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document

# Documents with rich metadata
documents = [
    Document(
        page_content="Section 4.2: Employees receive 20 days PTO annually.",
        metadata={
            "source": "employee_handbook.pdf",
            "section": "benefits",
            "page": 42,
            "year": 2024,
            "department": "HR"
        }
    ),
    Document(
        page_content="API rate limits are 1000 requests per minute.",
        metadata={
            "source": "api_docs.pdf",
            "section": "limits",
            "page": 15,
            "year": 2024,
            "department": "Engineering"
        }
    ),
    # ... more documents
]

# Create vectorstore
vectorstore = FAISS.from_documents(documents, OpenAIEmbeddings())

# Filtered retrieval - only search HR documents
results = vectorstore.similarity_search(
    "vacation policy",
    k=5,
    filter={"department": "HR"}
)

# Multiple filter conditions
results = vectorstore.similarity_search(
    "rate limits",
    k=5,
    filter={"source": "api_docs.pdf", "year": 2024}
)

Common Metadata Fields

  • source: Original document path
  • page: Page number
  • section: Document section
  • date: Creation/update date
  • author: Document author
  • category: Topic classification

Metadata Use Cases

  • - Multi-tenant: filter by organization_id
  • - Access control: filter by permission_level
  • - Freshness: filter by date range
  • - Domain-specific: filter by product, region
  • - Debug: trace answer back to source

Citation Generation

Production RAG must cite sources. This enables verification and builds trust. Here's how to generate proper citations.

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

def generate_with_citations(query: str, chunks: list) -> dict:
    """Generate answer with inline citations."""

    # Format chunks with citation markers
    context_parts = []
    for i, chunk in enumerate(chunks):
        source = chunk.metadata.get("source", "Unknown")
        page = chunk.metadata.get("page", "N/A")
        context_parts.append(f"[{i+1}] (Source: {source}, Page: {page})\n{chunk.page_content}")

    context = "\n\n".join(context_parts)

    prompt = ChatPromptTemplate.from_messages([
        ("system", """Answer questions using ONLY the provided context.
Include inline citations like [1], [2] when referencing sources.
If the context doesn't contain the answer, say "I don't have enough information."
Always cite your sources."""),
        ("user", """Context:
{context}

Question: {question}

Answer with citations:""")
    ])

    llm = ChatOpenAI(model="gpt-4o", temperature=0)
    chain = prompt | llm

    response = chain.invoke({
        "context": context,
        "question": query
    })

    # Extract citations used
    import re
    citations_used = set(re.findall(r'\[(\d+)\]', response.content))

    return {
        "answer": response.content,
        "sources": [
            {
                "citation": f"[{i+1}]",
                "source": chunks[i].metadata.get("source"),
                "page": chunks[i].metadata.get("page"),
                "text": chunks[i].page_content[:200] + "..."
            }
            for i in range(len(chunks))
            if str(i+1) in citations_used
        ]
    }

# Usage
result = generate_with_citations("What is the vacation policy?", retrieved_chunks)
print(result["answer"])
print("\nSources:")
for source in result["sources"]:
    print(f"  {source['citation']}: {source['source']}, page {source['page']}")
Example output:

According to the employee handbook, employees receive 20 days of paid time off (PTO) annually [1]. This PTO accrues monthly and can be carried over up to 5 days into the next year [1]. For employees with 5+ years of tenure, an additional 5 days are granted [2].

Sources:

[1]: employee_handbook.pdf, page 42

[2]: employee_handbook.pdf, page 43

Full Production Pipeline

Here's a complete production RAG pipeline combining everything we've learned.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from sentence_transformers import CrossEncoder
import numpy as np

class ProductionRAG:
    def __init__(self):
        self.embeddings = OpenAIEmbeddings()
        self.llm = ChatOpenAI(model="gpt-4o", temperature=0)
        self.reranker = CrossEncoder('BAAI/bge-reranker-large')
        self.vectorstore = None

    def ingest(self, documents: list):
        """Ingest documents with smart chunking."""
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=50,
            separators=["\n\n", "\n", ". ", " "]
        )
        chunks = splitter.split_documents(documents)
        self.vectorstore = FAISS.from_documents(chunks, self.embeddings)
        return len(chunks)

    def retrieve(self, query: str, k: int = 10,
                 rerank_k: int = 5, filters: dict = None) -> list:
        """Retrieve with optional filtering and reranking."""
        # Stage 1: Vector retrieval
        if filters:
            candidates = self.vectorstore.similarity_search(query, k=k, filter=filters)
        else:
            candidates = self.vectorstore.similarity_search(query, k=k)

        if len(candidates) <= rerank_k:
            return candidates

        # Stage 2: Reranking
        pairs = [[query, doc.page_content] for doc in candidates]
        scores = self.reranker.predict(pairs)
        ranked_indices = np.argsort(scores)[::-1][:rerank_k]

        return [candidates[i] for i in ranked_indices]

    def generate(self, query: str, chunks: list) -> dict:
        """Generate answer with citations."""
        context_parts = []
        for i, chunk in enumerate(chunks):
            source = chunk.metadata.get("source", "Unknown")
            page = chunk.metadata.get("page", "N/A")
            context_parts.append(
                f"[{i+1}] (Source: {source}, Page: {page})\n{chunk.page_content}"
            )

        prompt = ChatPromptTemplate.from_messages([
            ("system", """Answer using ONLY the provided context.
Include inline citations [1], [2], etc. when referencing sources.
If context is insufficient, say so. Always cite sources."""),
            ("user", "Context:\n{context}\n\nQuestion: {question}")
        ])

        chain = prompt | self.llm
        response = chain.invoke({
            "context": "\n\n".join(context_parts),
            "question": query
        })

        return {
            "answer": response.content,
            "chunks": chunks
        }

    def query(self, question: str, filters: dict = None) -> dict:
        """Full RAG pipeline."""
        chunks = self.retrieve(question, filters=filters)
        return self.generate(question, chunks)

# Usage
rag = ProductionRAG()
num_chunks = rag.ingest(documents)
print(f"Indexed {num_chunks} chunks")

result = rag.query(
    "What is the vacation policy?",
    filters={"department": "HR"}
)
print(result["answer"])

Evaluation with RAGAS

RAGAS (Retrieval Augmented Generation Assessment) is the standard framework for evaluating RAG systems. It measures both retrieval quality and generation quality.

Retrieval Metrics

  • Context Precision

    Are retrieved chunks relevant to the question?

  • Context Recall

    Do retrieved chunks cover all needed info?

Generation Metrics

  • Faithfulness

    Is the answer grounded in the context?

  • Answer Relevancy

    Does the answer address the question?

# pip install ragas
from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy
)
from datasets import Dataset

# Prepare evaluation data
eval_data = {
    "question": [
        "What is the vacation policy?",
        "How do I reset my password?"
    ],
    "answer": [
        "Employees receive 20 days PTO annually [1].",
        "You can reset via the self-service portal [1]."
    ],
    "contexts": [
        ["Section 4.2: Employees receive 20 days PTO annually."],
        ["Password reset is available in the IT self-service portal."]
    ],
    "ground_truth": [
        "20 days paid time off per year",
        "Use the self-service portal"
    ]
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy
    ]
)

print(results)
# Output:
# {'context_precision': 0.92, 'context_recall': 0.88,
#  'faithfulness': 0.95, 'answer_relevancy': 0.91}

Benchmark Targets

MetricAcceptableGoodExcellent
Context Precision> 0.70> 0.85> 0.95
Context Recall> 0.65> 0.80> 0.90
Faithfulness> 0.75> 0.90> 0.95
Answer Relevancy> 0.70> 0.85> 0.95

Key Takeaways

  • 1

    Chunking strategy matters most - Parent-child for long docs, semantic for topic-based, recursive for general use.

  • 2

    Metadata enables precision - Filter by source, date, department. Essential for multi-tenant and access control.

  • 3

    Citations build trust - Always cite sources. Users need to verify. Inline citations with source metadata.

  • 4

    RAGAS for evaluation - Measure faithfulness, relevancy, precision, recall. Target >0.85 for production.