Basic RAG Pipeline
Ground your LLM in real data. Build a complete retrieval-augmented generation system from scratch.
What is RAG?
Retrieval-Augmented Generation (RAG) is a technique that enhances LLMs by connecting them to external knowledge sources. Instead of relying solely on what the model learned during training, RAG retrieves relevant documents at query time.
Think of it as giving the LLM an open-book exam instead of a closed-book test. The model can look up information rather than trying to recall everything from memory.
// Without RAG (closed-book)
User: "What's our company's vacation policy?"
LLM: "I don't have information about your specific company..."
// With RAG (open-book)
User: "What's our company's vacation policy?"
LLM: "According to section 4.2 of the employee handbook, you receive 20 days PTO..."
Why RAG Matters
The Hallucination Problem
LLMs confidently generate plausible-sounding but factually incorrect information. They can't distinguish between what they know and what they're making up.
The RAG Solution
By grounding responses in retrieved documents, RAG provides verifiable sources. The model can cite where information came from, making it auditable.
Stale Knowledge
Training data has a cutoff date. GPT-4's knowledge stops at a fixed point. It can't know about events, products, or changes that happened after training.
Private Data Access
RAG lets you query internal documents, databases, and proprietary knowledge without fine-tuning. Update your knowledge base anytime without retraining.
The 4 Stages of RAG
Every RAG pipeline follows the same fundamental pattern: Chunk, Embed, Retrieve, Generate.
Chunk
Split your documents into smaller pieces. Large documents don't fit in context windows, and smaller chunks provide more precise retrieval. Choose between fixed-size, sentence-based, or semantic chunking strategies.
Embed
Convert each chunk into a vector embedding that captures its semantic meaning. Similar content produces similar vectors, enabling semantic search rather than keyword matching.
Retrieve
When a query arrives, embed it and find the most similar chunks using vector similarity (cosine similarity or dot product). Return the top-k most relevant chunks as context for generation.
Generate
Pass the retrieved chunks along with the user's question to the LLM. The model synthesizes an answer grounded in the provided context, ideally citing its sources.
Interactive Pipeline Demo
Walk through each stage of the RAG pipeline. Try different chunking strategies, embed the chunks, retrieve relevant context, and see how the final answer is generated.
RAG Pipeline Visualizer
What is RAG?
Retrieval-Augmented Generation connects LLMs to your data. Instead of hallucinating, the model retrieves relevant documents and grounds its answers in real information.
Problem: Hallucination
LLMs make up facts when they don't know the answer. No way to verify claims.
Problem: Stale Knowledge
Training data has a cutoff date. Model doesn't know about recent events.
Solution: RAG
Retrieve real documents at query time. Ground answers in your actual data.
Chunking Strategies
Chunking is deceptively important. Bad chunking leads to bad retrieval, which leads to bad answers. Here are the main strategies:
| Strategy | Pros | Cons |
|---|---|---|
| Fixed Size | Simple, predictable chunk counts | May split mid-sentence |
| Sentence-based | Preserves sentence boundaries | Variable sizes, may be too granular |
| Semantic / Paragraph | Preserves topic coherence | Chunks may be too large |
| Recursive | Hierarchical structure preserved | More complex implementation |
Chunk Size Experiments
256 tokens
Fine-grained, precise retrieval
- + High precision for specific facts
- + More chunks = more retrieval options
- - May lack surrounding context
- - More embeddings to store
512 tokens
Balanced approach (common default)
- + Good balance of precision/context
- + Works well for most use cases
- - May still split related content
1024 tokens
More context per chunk
- + Rich context in each chunk
- + Fewer total chunks
- - Less precise retrieval
- - May dilute relevance signal
Retrieval: Top-K and Thresholds
Two key parameters control what gets retrieved:
Top-K Selection
Return the K most similar chunks regardless of their absolute similarity score.
k=3: Focused, minimal context
k=5: Balanced (common default)
k=10: Broad context, more noise
Similarity Threshold
Only return chunks above a minimum similarity score. Prevents irrelevant results.
0.7: Strict - only highly relevant
0.5: Moderate threshold
0.3: Loose - may include tangential
Pro tip: Combine both approaches. Use a similarity threshold to filter out irrelevant chunks, then take the top-k from what remains. This prevents the system from hallucinating when there's no relevant context.
Prompt Engineering for RAG
The prompt template you use significantly impacts answer quality. Here are key patterns:
Example RAG Prompt
System:
You are a helpful assistant. Answer questions using ONLY the provided context. If the context doesn't contain the answer, say "I don't have enough information to answer that question." Cite your sources using [1], [2], etc.
Context:
[1] {chunk_1_text}
[2] {chunk_2_text}
[3] {chunk_3_text}
Question:
{user_question}
1. Constrain to context
Explicitly tell the model to ONLY use provided context. This reduces hallucination.
2. Handle missing information
Give the model a way out. If context is insufficient, it should admit it rather than guess.
3. Require citations
Ask for source attribution. This makes answers verifiable and builds user trust.
4. Context placement
Place context before the question. Models pay more attention to content near the end of the prompt.
Key Takeaways
- 1
RAG = Chunk + Embed + Retrieve + Generate - This four-stage pipeline grounds LLM outputs in real data.
- 2
Chunking strategy matters - Bad chunking leads to bad retrieval. Experiment with size and approach.
- 3
Tune top-k and thresholds - Too few chunks = missing context. Too many = noise and distraction.
- 4
Prompt engineering is critical - Constrain to context, handle unknowns gracefully, require citations.