Level 3: Production~25 min

Hallucination Detection

LLMs confidently lie. Learn to detect and prevent hallucinations in production systems.

What Causes Hallucinations

LLMs are trained to predict the next token, not to be truthful. They hallucinate for several fundamental reasons that cannot be fully eliminated through prompting alone.

Training Data Issues

  • -Outdated information (knowledge cutoff)
  • -Incorrect facts in training data
  • -Underrepresented topics (long tail)
  • -Conflicting information from multiple sources

Model Architecture Issues

  • -Poor confidence calibration
  • -Tendency to produce fluent but false text
  • -Pattern matching over factual recall
  • -Context window limitations

Hallucination Examples

Intrinsic hallucination (contradicts context)

Context: "The company was founded in 2015."

LLM: "The company, established in 2012, has grown significantly."

Extrinsic hallucination (adds unsupported facts)

Context: "The CEO announced quarterly earnings."

LLM: "The CEO, John Smith, announced record quarterly earnings of $5B."

Detection Method 1: NLI-Based Verification

Natural Language Inference (NLI) models classify whether a hypothesis is entailed, neutral, or contradicted by a premise. We use this to verify if LLM claims are supported by the source context.

NLI Classification

ENTAILMENT

Claim is supported by context

Grounded

NEUTRAL

Context doesn't confirm or deny

Uncertain

CONTRADICTION

Claim conflicts with context

Hallucination

# pip install transformers torch
from transformers import pipeline

# Load NLI model (BART trained on MNLI)
nli = pipeline("zero-shot-classification",
               model="facebook/bart-large-mnli")

def check_grounded(claim: str, context: str) -> dict:
    """
    Check if a claim is supported by the context.
    Returns label and confidence score.
    """
    # NLI: Is the claim entailed by the context?
    result = nli(
        claim,
        candidate_labels=["entailment", "neutral", "contradiction"],
        hypothesis_template="{}",
        multi_label=False
    )

    # Also check with context as premise
    # Using the NLI properly: context |= claim?
    from transformers import AutoModelForSequenceClassification, AutoTokenizer
    import torch

    model_name = "facebook/bart-large-mnli"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)

    # Tokenize with context as premise, claim as hypothesis
    inputs = tokenizer(context, claim, return_tensors="pt", truncation=True)

    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=1)[0]

    labels = ["contradiction", "neutral", "entailment"]
    label_probs = {l: p.item() for l, p in zip(labels, probs)}

    return {
        "is_grounded": label_probs["entailment"] > 0.5,
        "label": max(label_probs, key=label_probs.get),
        "scores": label_probs
    }

# Usage
context = "The company reported revenue of $10 million in Q3 2024."
claim = "The company made $10 million in the third quarter."

result = check_grounded(claim, context)
print(f"Grounded: {result['is_grounded']}")
print(f"Label: {result['label']}")
print(f"Entailment score: {result['scores']['entailment']:.3f}")
Example outputs:
# Grounded claim
Context: "The company reported revenue of $10 million in Q3 2024."
Claim: "The company made $10 million in Q3."
>>> Grounded: True, Entailment: 0.89

# Hallucinated claim
Context: "The company reported revenue of $10 million in Q3 2024."
Claim: "The company achieved record profits this quarter."
>>> Grounded: False, Entailment: 0.12 (Neutral: 0.71)

Detection Method 2: Self-Consistency

Generate multiple answers to the same question. If the model gives inconsistent answers, it's likely hallucinating. Consistent answers across samples are more reliable.

from openai import OpenAI
from collections import Counter
import re

client = OpenAI()

def self_consistency_check(question: str, context: str,
                           n_samples: int = 5,
                           temperature: float = 0.7) -> dict:
    """
    Generate multiple answers and check for consistency.
    Returns the majority answer and a confidence score.
    """
    answers = []

    for _ in range(n_samples):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "Answer based ONLY on the context provided. Be concise."},
                {"role": "user", "content": f"Context: {context}\n\nQuestion: {question}"}
            ],
            temperature=temperature
        )
        answers.append(response.choices[0].message.content.strip())

    # Normalize answers for comparison (lowercase, strip punctuation)
    def normalize(text):
        return re.sub(r'[^a-z0-9\s]', '', text.lower()).strip()

    normalized = [normalize(a) for a in answers]

    # Count occurrences
    counter = Counter(normalized)
    most_common, count = counter.most_common(1)[0]

    # Find original answer that matches normalized version
    majority_answer = next(a for a, n in zip(answers, normalized) if n == most_common)

    consistency_score = count / n_samples

    return {
        "answer": majority_answer,
        "consistency_score": consistency_score,
        "is_consistent": consistency_score >= 0.6,
        "all_answers": answers,
        "sample_count": n_samples
    }

# Usage
context = "Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976."
question = "When was Apple founded?"

result = self_consistency_check(question, context, n_samples=5)
print(f"Answer: {result['answer']}")
print(f"Consistency: {result['consistency_score']:.0%}")
print(f"Reliable: {result['is_consistent']}")

Self-Consistency Insight

Factual questions should yield consistent answers (100% consistency). Subjective questions naturally have variance. Low consistency on factual questions signals potential hallucination or insufficient context.

Detection Method 3: Retrieval Verification

Extract claims from the LLM response, then verify each claim is supported by retrieved context. This is how RAGAS faithfulness metric works.

from openai import OpenAI
import json

client = OpenAI()

def extract_claims(answer: str) -> list:
    """Extract atomic claims from an answer."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """Extract all factual claims from the text.
Return a JSON array of strings, each being one atomic claim.
Only include verifiable factual statements, not opinions."""},
            {"role": "user", "content": answer}
        ],
        response_format={"type": "json_object"}
    )

    result = json.loads(response.choices[0].message.content)
    return result.get("claims", [])

def verify_claims(claims: list, context: str) -> dict:
    """Verify each claim against the context using NLI."""
    from transformers import AutoModelForSequenceClassification, AutoTokenizer
    import torch

    tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-mnli")
    model = AutoModelForSequenceClassification.from_pretrained("facebook/bart-large-mnli")

    verified = []
    for claim in claims:
        inputs = tokenizer(context, claim, return_tensors="pt", truncation=True)
        with torch.no_grad():
            outputs = model(**inputs)
            probs = torch.softmax(outputs.logits, dim=1)[0]

        entailment_score = probs[2].item()  # Index 2 is entailment
        verified.append({
            "claim": claim,
            "supported": entailment_score > 0.5,
            "score": entailment_score
        })

    supported_count = sum(1 for v in verified if v["supported"])
    faithfulness = supported_count / len(claims) if claims else 1.0

    return {
        "faithfulness_score": faithfulness,
        "verified_claims": verified,
        "supported_count": supported_count,
        "total_claims": len(claims)
    }

def full_hallucination_check(answer: str, context: str) -> dict:
    """Complete hallucination detection pipeline."""
    claims = extract_claims(answer)
    verification = verify_claims(claims, context)

    return {
        "is_faithful": verification["faithfulness_score"] >= 0.8,
        "faithfulness_score": verification["faithfulness_score"],
        "claims": verification["verified_claims"],
        "summary": f"{verification['supported_count']}/{verification['total_claims']} claims verified"
    }

# Usage
context = """
Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976.
The company is headquartered in Cupertino, California. As of 2023, Apple is
the world's largest technology company by revenue, with $383 billion in 2023.
"""

answer = """
Apple was founded in 1976 by Steve Jobs and Steve Wozniak.
The company is based in Cupertino and generated over $380 billion in revenue in 2023.
Apple is known for the iPhone, which was first released in 2008.
"""

result = full_hallucination_check(answer, context)
print(f"Faithful: {result['is_faithful']}")
print(f"Score: {result['faithfulness_score']:.0%}")
print(f"Verified: {result['summary']}")
Example output:
Faithful: False
Score: 67%
Verified: 2/3 claims verified

Claims:
[SUPPORTED] "Apple was founded in 1976 by Steve Jobs and Steve Wozniak" (0.91)
[SUPPORTED] "The company is based in Cupertino and generated over $380B in 2023" (0.87)
[NOT SUPPORTED] "iPhone was first released in 2008" (0.12)
  ^ Context doesn't mention iPhone release date (actual: 2007)

Mitigation Strategies

Detection is only half the battle. Here's how to prevent and mitigate hallucinations.

1. Grounding with RAG

Always provide source context. Instruct the model to ONLY use provided information.

system_prompt = """Answer based ONLY on the provided context.
If the context doesn't contain the answer, say "I don't have enough information."
Never add facts that aren't in the context.
Cite sources using [1], [2] format."""

2. Confidence Thresholds

Ask the model to express uncertainty. Filter low-confidence answers.

def generate_with_confidence(query: str, context: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """Answer the question based on context.
After your answer, rate your confidence from 0-100.
Format: [Answer]\n\nConfidence: [0-100]"""},
            {"role": "user", "content": f"Context: {context}\n\nQuestion: {query}"}
        ]
    )

    text = response.choices[0].message.content
    # Parse confidence score
    confidence = int(re.search(r'Confidence: (\d+)', text).group(1))

    if confidence < 70:
        return {"answer": None, "reason": "Low confidence", "confidence": confidence}

    return {"answer": text.split("\n\nConfidence")[0], "confidence": confidence}

3. Mandatory Citations

Force the model to cite sources for every claim. Uncited claims are suspect.

system_prompt = """Every factual statement MUST have a citation.
Use format: [statement] [1] where [1] refers to a source.
If you cannot cite a source for a claim, do not make that claim.
Unsupported claims are forbidden."""

4. Post-Generation Verification

Run hallucination detection on every response before returning to user.

def safe_generate(query: str, context: str) -> dict:
    # Generate answer
    answer = generate_answer(query, context)

    # Verify
    verification = full_hallucination_check(answer, context)

    if not verification["is_faithful"]:
        # Option 1: Regenerate with stricter prompt
        # Option 2: Filter out unverified claims
        # Option 3: Add warning to user
        return {
            "answer": answer,
            "warning": "Some claims could not be verified",
            "unverified_claims": [c["claim"] for c in verification["claims"]
                                  if not c["supported"]]
        }

    return {"answer": answer, "verified": True}

Evaluation Metrics

Measure hallucination rates systematically to track improvement over time.

Faithfulness

% of claims in the answer that are supported by context.

Faithfulness = Supported Claims / Total Claims

Target: > 0.90

Factuality

% of claims that are objectively true (vs ground truth).

Factuality = True Claims / Total Claims

Target: > 0.95

Hallucination Rate

% of responses containing at least one hallucination.

HalRate = Responses with Halluc / Total Responses

Target: < 0.10

Abstention Rate

% of times model correctly says "I don't know" when it should.

Abstention = Correct IDK / Should IDK

Target: > 0.80

Production Monitoring Dashboard

MetricThis WeekLast WeekTargetStatus
Faithfulness0.920.89> 0.90Pass
Hallucination Rate8.2%11.5%< 10%Pass
Abstention Rate0.720.68> 0.80Improve
Avg Confidence0.840.81> 0.75Pass

Key Takeaways

  • 1

    Hallucinations are fundamental - LLMs are trained to be fluent, not truthful. Detection is essential for production.

  • 2

    NLI is the workhorse - Use NLI models to verify if claims are entailed by source context.

  • 3

    Multi-method detection - Combine NLI, self-consistency, and claim extraction for robust detection.

  • 4

    Mitigate with grounding and citations - Force models to cite sources, verify post-generation, track metrics.