Codesota · Guides · Prompting techniquesOriginal research · 100 samples · 9 frameworksPublished December 2025

Guide · Prompting

The prompting framework tarpit.

We benchmarked RTF, TAG, RACE, and five other popular “prompting frameworks.” None improved accuracy. Some made it worse — by as much as nineteen points.

LinkedIn is flooded with prompting frameworks promising 340% ROI. We ran the numbers. The claims are fabricated. Here is why smart people keep falling for them — and what actually works.

What actually works →See the data

§ 01 · Premise

Style is not substance.

Why the acronyms feel useful, and what they are actually changing.

Prompting frameworks affect style — how the model speaks. They do not affect substance — what it knows or how it reasons. Telling the model to act like Einstein gives you Einstein's vocabulary. It does not give you his reasoning.

What frameworks change: vocabulary and formality, response structure, tone and personality, the confidence level of language. What they do not change: actual accuracy on tasks, reasoning ability, knowledge depth, logical correctness.

Why smart people fall for it

Frameworks help you organise your thoughts — but LLMs do not think like humans. They parse intent from tokens, not templates. The structure is scaffolding for your brain, not theirs. Add confirmation bias (you remember the time RTF “worked”), authority heuristic (50K-follower prompt engineers seem credible), and complexity bias (a seven-step framework feels more real than “describe what you want”).

§ 02 · Benchmark

Nine frameworks, one hundred samples.

Methodology: Llama 3.3-70b via Groq, four task types, pass@1.

Nine prompting approaches on 100 samples across email classification, sentiment analysis, data extraction, and Q&A. Code and data at github.com/codesota/benchmarks/prompting.

Framework	Accuracy	Δ baseline	Avg tokens	Token waste
Baseline	97%	—	93	—
APE	97%	—	108	+16%
RACE	97%	—	123	+32%
TRACE	97%	—	122	+31%
COAST	95%	-2%	121	+30%
ROSES	95%	-2%	118	+27%
RTF	94%	-3%	119	+28%
STAR	80%	-17%	132	+42%
TAG	78%	-19%	132	+42%

Key finding

Baseline ties or beats every framework. STAR and TAG hurt performance by 17 and 19 points respectively — their rigid structure confused the model on data-extraction tasks.

§ 03 · Literature

What peer-reviewed papers say.

Four citations. None support the frameworks.

“When ‘A Helpful Assistant’ Is Not Really Helpful” · arXiv 2024: Tested personas in system prompts across multiple LLMs. Personas do not improve performance. “You are an expert…” is essentially a no-op for accuracy.
“Persona is a Double-edged Sword” · ACL 2024: Role prompting hurts reasoning in 13–14% of cases, helps in 15–16%. Net effect is nearly random. Random persona choice works as well as careful selection.
“The Decreasing Value of Chain of Thought” · Wharton 2025: CoT benefits are shrinking with newer models. For reasoning models, CoT provides only 2.9–3.1% improvement.
RTF, TAG, RACE, COAST, STAR, APE, TRACE, ROSES · LinkedIn 2023–25: Zero peer-reviewed papers. Zero reproducible experiments. “340% ROI” has no citations. These are marketing constructs.

§ 04 · Evidence

Techniques that have empirical support.

They change HOW the model reasons, not how it formats output.

Technique	Accuracy	Tokens	Latency	Cost
Zero-Shot	60%	1×	1×	1×
Few-Shot (3 examples)	72%	2.5×	1.2×	2.5×
Chain-of-Thought	78%	3.5×	1.8×	3.5×
Self-Consistency (n=5)	84%	5×	5×	5×
Few-Shot + CoT	82%	4.5×	2×	4.5×

Self-Consistency achieves the highest accuracy but at five times the cost. For most applications, Few-Shot + CoT provides the best accuracy-per-dollar ratio.

§ 05 · Model variance

Chain-of-Thought gains, by model.

Newer, more capable models benefit less from explicit reasoning prompts.

Model	Zero-shot	With CoT	Gain
Gemini Flash 2.0	71.2%	80.8%	+13.5%
Claude 3.5 Sonnet	74.1%	82.8%	+11.7%
GPT-4o	76.4%	82.7%	+8.2%
GPT-4o-mini	68.9%	71.9%	+4.4%
Claude 3 Haiku	62.3%	66.1%	+6.1%

Warning

CoT can hurt performance on easy questions. The model may overthink and introduce errors where a direct answer would be correct. Profile your task distribution before defaulting to CoT.

§ 06 · Implementations

Each technique, in code.

Four patterns, copy-paste, with the edges they work on and the edges where they fail.

Zero-shot

Direct instruction, no examples. Baseline for comparison. Best token efficiency when it works. Strong on simple classification and extraction; weak on multi-step reasoning.

# Zero-shot: Direct instruction
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": "Classify this review as positive, negative, or neutral: 'The product arrived late but works great.'"
    }]
)

Few-shot

Provide two to five examples to establish the pattern. About +12% over zero-shot on average. Diminishing returns beyond five. Best for format-sensitive output and domain jargon; costs context window.

# Few-shot: Provide examples to establish pattern
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": """Classify reviews as positive, negative, or neutral.

Review: "Absolutely love it, best purchase ever!"
Classification: positive

Review: "Broken on arrival, total waste of money."
Classification: negative

Review: "It's okay, nothing special but works fine."
Classification: neutral

Review: "The product arrived late but works great."
Classification:"""
    }]
)

Chain-of-Thought

Request step-by-step reasoning. Most effective on math and logic. Adds 20–80% latency; can yield 15–20% accuracy gains on complex tasks. Weak on factual recall.

# Chain-of-Thought: Request step-by-step reasoning
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": """Solve this step by step:
A store has 45 apples. They sell 1/3 of them in the morning
and 1/2 of the remaining in the afternoon.
How many apples are left?

Let's think through this step by step:"""
    }]
)

Self-consistency

Run the same prompt multiple times with temperature > 0, then vote on the most common answer. High cost but highest accuracy for critical decisions.

import asyncio
from collections import Counter

async def self_consistency(prompt: str, n: int = 5) -> str:
    """Run prompt n times, return majority answer."""
    responses = await asyncio.gather(*[
        client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7  # Need variation
        )
        for _ in range(n)
    ])

    # Extract final answers and vote
    answers = [extract_answer(r.choices[0].message.content)
               for r in responses]
    most_common = Counter(answers).most_common(1)[0][0]
    return most_common

§ 07 · Efficiency

Measuring the token-budget tradeoff.

TALE (Dec 2024) showed budget-aware prompting reduces tokens 68.9% with <5% accuracy loss.

# TALE-inspired: Budget-aware prompting
def budget_aware_prompt(task: str, complexity: str) -> dict:
    """Adjust prompting strategy based on task complexity."""

    if complexity == "simple":
        # Zero-shot, minimal tokens
        return {
            "prompt": task,
            "max_tokens": 50,
            "strategy": "zero-shot"
        }

    elif complexity == "medium":
        # Few-shot with 2 examples
        return {
            "prompt": f"{FEW_SHOT_EXAMPLES}\n\n{task}",
            "max_tokens": 200,
            "strategy": "few-shot"
        }

    else:  # complex
        # Full CoT with examples
        return {
            "prompt": f"{FEW_SHOT_COT_EXAMPLES}\n\n{task}\nLet's solve step by step:",
            "max_tokens": 500,
            "strategy": "few-shot-cot"
        }

Efficiency calculation

def calculate_prompting_efficiency(
    accuracy: float,
    tokens_used: int,
    latency_ms: int,
    cost_per_1k: float = 0.01
) -> dict:
    """Calculate efficiency metrics for a prompting strategy."""

    # Token efficiency: accuracy per 1000 tokens
    token_efficiency = (accuracy / tokens_used) * 1000

    # Cost efficiency: accuracy gain per dollar
    cost = (tokens_used / 1000) * cost_per_1k
    cost_efficiency = accuracy / cost if cost > 0 else 0

    # Time-accuracy tradeoff
    accuracy_per_second = accuracy / (latency_ms / 1000)

    return {
        "accuracy": accuracy,
        "tokens_used": tokens_used,
        "latency_ms": latency_ms,
        "cost_usd": round(cost, 4),
        "token_efficiency": round(token_efficiency, 2),
        "cost_efficiency": round(cost_efficiency, 2),
        "accuracy_per_second": round(accuracy_per_second, 2)
    }

# Example comparison
zero_shot = calculate_prompting_efficiency(60, 150, 800)
few_shot = calculate_prompting_efficiency(72, 450, 1100)
cot = calculate_prompting_efficiency(78, 650, 1500)

print(f"Zero-shot efficiency: {zero_shot['token_efficiency']}")  # 400.0
print(f"Few-shot efficiency: {few_shot['token_efficiency']}")    # 160.0
print(f"CoT efficiency: {cot['token_efficiency']}")              # 120.0

Interpretation

Zero-shot has 400 token efficiency; CoT drops to 120. Zero-shot extracts 3.3× more accuracy per token. Use CoT only when the accuracy gain justifies the cost.

§ 08 · Decision

Choose your technique by task and constraint.

Task Type               | Recommended Technique | Expected Gain | Token Cost
------------------------|----------------------|---------------|------------
Simple classification   | Zero-shot            | Baseline      | 1x
Format-sensitive output | Few-shot (2-3 ex)    | +12-15%       | 2-3x
Math/logic problems     | Chain-of-Thought     | +15-20%       | 3-4x
High-stakes decisions   | Self-Consistency     | +20-25%       | 5x
Complex domain tasks    | Few-shot + CoT       | +25-30%       | 4-5x
Latency-critical        | Zero-shot or cached  | Baseline      | 1x

When to use zero-shot

Latency under 1 second required
High volume (>10K requests/day)
Task accuracy already >80%
Simple classification or extraction

When to use CoT

Math or logic problems
Multi-step reasoning required
Accuracy is paramount
Tasks where errors are costly

When to use few-shot

Output format must match exactly
Domain-specific terminology
Edge cases in training data
Style consistency matters

When to use self-consistency

Error cost > 100× prompt cost
Medical / legal / financial decisions
Low volume, high stakes
When you can wait for results

§ 09 · Mistakes

Common traps.

Using CoT for everything: CoT adds latency and cost. On simple factual queries it can decrease accuracy by overthinking. Profile the task distribution first.
Too many few-shot examples: Beyond five examples, returns diminish rapidly. Three diverse examples typically suffice.
Not measuring baseline first: Always establish zero-shot accuracy before adding complexity. If zero-shot hits 90%, the ceiling is 10%.
Ignoring model-specific behavior: Prompts that work for GPT-4 may fail for Claude or Gemini. Test on your target model.

§ 10 · Method

Your measurement checklist.

01
Establish baseline
Run zero-shot on 100+ samples. Record accuracy, latency, tokens.
02
Test techniques independently
Few-shot alone, CoT alone, then combinations. Measure each.
03
Calculate efficiency
Token efficiency = (accuracy / tokens) × 1000. Compare ratios.
04
Apply cost constraints
At your volume, what is the monthly cost difference? Is accuracy worth it?
05
Monitor in production
Track metrics over time. Model updates can change optimal strategy.

§ 11 · Toolkit

The evidence-based toolkit.

Two tiers: proven prompting, and the techniques that actually change capability.

Tier 1 · Proven prompting

Clear, specific instructions: The single highest-impact factor. Describe exactly what you want, in what format, with what constraints. +20–40% over vague prompts.
Few-shot examples: Two to five examples establishing the pattern. Most effective for format matching and domain vocabulary. +10–15% accuracy, 2–3× token cost.
Chain-of-thought for complex reasoning: Ask the model to show its work. Works for math, logic, multi-step problems. Diminishing returns on newer models. +5–20%, 2–4× tokens.
Structured output (JSON mode): Use the model's native JSON mode or schema enforcement. 100% format compliance, minimal cost.

Tier 2 · Beyond prompting

These techniques add real knowledge or change model weights. Prompting cannot make a model know things it was not trained on.

RAG — Retrieval-Augmented Generation: Inject relevant documents at query time. Model can now answer questions about your private data, recent events, or specialised domains. Use when the model lacks knowledge you have in documents.
Fine-tuning: Train on your specific data. Changes actual weights. Model learns your domain’s patterns, terminology, edge cases. Use for consistent style/format at scale.
Tool use / function calling: Let the model call external APIs, databases, or code. Extends capability to real-time data and actions.
Agentic workflows: Chain multiple LLM calls with planning, reflection, and tool use. Handles complex multi-step tasks.

§ 12 · Take

What actually moves the needle.

RAG works. Fine-tuning works. But they are still somewhat overhyped as silver bullets. RAG requires careful chunking, retrieval tuning, and context management. Fine-tuning requires quality data, evaluation pipelines, and ongoing maintenance.

What actually changed my productivity: agents that can iterate. Not “agent” as a marketing term. I mean systems that learn from mistakes, ask clarifying questions, research autonomously, and iterate on solutions.

A simple prompt plus an iteration loop beats a perfectly-crafted one-shot prompt every time. The frameworks LinkedIn loves optimise for single-turn interactions. Real work is multi-turn, iterative, and requires adaptation.

§ 13 · TL;DR

The six-line summary.

SkipRTF, TAG, RACE, COAST, STAR, APE, TRACE, ROSES. Zero evidence. Wasted tokens.
Be skepticalRole prompting (“You are an expert…”) — research shows unpredictable effects.
UseClear instructions, few-shot examples, structured output, CoT for complex reasoning.
Level upRAG for knowledge, fine-tuning for consistency, tool use for actions.
Real breakthroughAgents that iterate, learn from errors, and ask questions. Multi-turn beats perfect one-shot.
MeasureRun your own benchmarks. Your task is unique.

§ 14 · Sources

Citations and data.

Peer-reviewed research

“When ‘A Helpful Assistant’ Is Not Really Helpful” (arXiv 2024)
Personas do not improve LLM performance
“Persona is a Double-edged Sword” (ACL 2024)
Role prompting: 13–14% hurt, 15–16% help
“The Decreasing Value of Chain of Thought” (Wharton 2025)
CoT gains shrinking with newer models
“TALE: Token-Budget-Aware LLM Reasoning” (arXiv 2024)
68.9% token reduction, <5% accuracy loss
“Chain-of-Thought Prompting Elicits Reasoning” (Wei et al., 2022)
Original CoT paper from Google

Our original research

Framework benchmark code & data
100 samples · 9 frameworks · 4 task types
Model: Llama 3.3-70b via Groq
500 tok/s inference speed
Tasks: Email, Sentiment, Extraction, Q&A
Real-world office scenarios

Not research — marketing

RTF, TAG, RACE, COAST, STAR, APE, TRACE, ROSES. Zero citations. Zero reproducible experiments. LinkedIn virality only.