Changelog
Building the definitive AI benchmark platform. Every update since day one.
Growth milestones
Get updates in your inbox
New benchmarks, tools, and guides. No spam.
14 New Guides: Code Gen, Speech, TTS, RAG, Agentic, Segmentation, and More
- -Best AI Code Generation Models Compared — Claude Opus 4, GPT-5, Gemini, DeepSeek-V3, Qwen2.5-Coder on HumanEval, SWE-bench, LiveCodeBench
- -Speech Recognition 2026 — Whisper vs Gemini vs AssemblyAI vs Deepgram with WER benchmarks and pricing
- -Best Open-Source TTS Models — Kokoro, XTTS v2, Bark, Piper, Fish Speech, Dia, F5-TTS with MOS scores
- -RAG vs Fine-Tuning vs Long Context — decision framework with cost analysis and code examples
- -Agentic AI Benchmarks Explained — SWE-bench, RE-bench, HCAST, WebArena, GAIA, OSWorld decoded
- -Image Segmentation Models — SAM 2 vs Mask2Former vs OneFormer on ADE20K and COCO
- -Anomaly Detection for Manufacturing — PatchCore, EfficientAD on MVTec AD with ROI analysis
- -Time Series Forecasting — ARIMA vs PatchTST vs TimesFM vs Chronos foundation models
- -State of Multimodal AI — GPT-5, Claude Opus 4, Gemini 2.5 Pro on MMMU, MathVista, video
- -Few-Shot Learning is Dead — thought piece on foundation models absorbing few-shot methods
- -How to Read an ML Paper — 3-pass method, benchmark red flags, 20-point evaluation checklist
- -Graph Neural Networks — GCN, GAT, GraphSAGE with OGB benchmarks and PyG code
- -Medical AI Regulation Cheat Sheet — FDA, EU MDR, MHRA pathways for developers
- -RL from Atari to Robotics — visual timeline from DQN (2013) to world models (2026)
Largest single guide expansion: 14 comprehensive guides covering every major benchmark area that previously had no editorial content. Each guide follows the proven CodeSOTA format with benchmark data, model comparisons, code examples, pricing analysis, and decision frameworks. Guides index page updated with 6 new categories (Time Series, Graphs, RL, Methodology, Agentic, plus expanded Audio/Speech and CV sections).
Benchmark Relevance Audit: 300+ New Results Across All 12 Areas
- -Full relevance audit of all CV benchmarks: identified saturated (ImageNet, IIIT5K), declining (RVL-CDIP, PubLayNet), and highly relevant (ADE20K, OmniDocBench) benchmarks
- -3 new datasets added: LVIS v1.0 (1,203-category long-tail detection), DocLayNet (general-purpose layout analysis), Union14M (real-world scene text recognition)
- -COCO object detection: 5 → 17 results (Co-DETR 66.0 mAP, InternImage-H, DINO, Grounding DINO, YOLOv10/11, RT-DETR)
- -ADE20K semantic segmentation: 2 → 13 results (ONE-PEACE 63.0, BEiT-3, DINOv2, Mask2Former, EoMT)
- -NLP area populated from zero: 53 results across SQuAD v2.0, GLUE, SuperGLUE, SNLI, CoNLL-2003, CNN/DailyMail
- -Speech area populated from zero: 30 results (LibriSpeech 1.46% WER, VALL-E 2 4.61 MOS human parity)
- -Multimodal area populated from zero: 37 results + 3 new datasets (MMMU, MMBench, TextVQA)
- -Agentic AI area built from scratch: 6 datasets, 36 results (SWE-bench Verified, METR, HCAST, RE-Bench, WebArena, OSWorld)
- -Reasoning updated with 2025-2026 frontier models: ARC-AGI v1/v2, AIME 2025, o3/o4-mini/Claude Opus 4.5/DeepSeek-R1 scores
- -Computer Code: all 6 tasks now populated (was 1). Added LiveCodeBench, CrossCodeEval, Defects4J, CodeSearchNet, TransCoder, Bugs2Fix
- -Removed 3 empty tasks: Polish OCR, LaTeX OCR, Key Information Extraction
- -Total site results: ~1,998 → 2,300+ across 12 research areas
Comprehensive benchmark relevance audit and data refresh across all 12 research areas. Started with a systematic analysis of computer vision benchmarks to identify which are still relevant in 2026, then expanded to populate every previously-empty area on the site. Key finding: ImageNet, IIIT5K, and RVL-CDIP are saturated/compromised — LVIS, DocLayNet, and Union14M are their modern replacements. The audit used 10 parallel benchmark-updater agents to research and insert verified results from arxiv papers, official leaderboards, and model cards.
AI Arena Rankings, Academy Deep-Dives, MTEB Editorial
- -7 new Arena leaderboard pages with real data from arena.ai: Text, Code, Vision, Document, Search, Text-to-Image, Text-to-Video
- -Arena overview page with category cards, provider dominance table, and methodology explainer
- -Pareto frontier analysis on Search arena: cost vs Elo scatter plot identifying 4 optimal models
- -3 Academy deep-dives: Matrix Operations (971 lines), How Transformers Work (1,077 lines), Embedding Dimensions (754 lines)
- -MTEB benchmark page rewritten to ADE20K editorial quality (1,252 lines) with golden datasets, 8 task categories, real HuggingFace data
- -Tokens & Context page enhanced with CSS Pareto frontier plots, cost optimization strategies, and updated 2025 pricing
- -Embedding lesson updated with real MMTEB leaderboard data (KaLM-Gemma3-12B #1 at 72.32), current mteb API, proper citations
- -FAQ schema markup on arena pages for Google rich snippets
- -Real benchmark data from HuggingFace mteb.ResultCache (705 models) and arena.ai
Massive content expansion: 11,000+ lines of editorial content across 20 files. Arena pages cover all major AI competition categories with editorial insights (who leads where, price-performance analysis, open-source highlights). Academy deep-dives provide actual math walkthroughs (matrix multiplication proofs, attention formula derivation, Johnson-Lindenstrauss lemma). MTEB page now matches the quality bar set by the ADE20K benchmark editorial. All data sourced from official HuggingFace APIs and arena.ai leaderboards.
Full Site Redesign: Clean Design System, Discoverability Overhaul
- -Homepage redesigned: 1051 → 260 lines. No gradients, generous spacing, 1px-gap grids
- -/browse page: inline stats, clean tabs, 1px-gap area grids matching homepage
- -Area page template redesigned for all 17 research areas at once
- -/news page: featured grid + clean list layout with mono labels
- -/flywheel page: real traffic data (12,600+ visitors, 125 countries, top 10 breakdown)
- -/papers-with-code: Tufte-style redesign, fixed Google snippet visibility (dark text on white)
- -Gradient text-transparent removed site-wide — all headings now visible in Google featured snippets
- -Footer restructured: 6 columns, 40+ links covering all previously orphaned pages
- -7 new area editorials: adversarial, graphs, industrial, knowledge-base, methodology, RL, time-series
- -Cross-links: MTEB ↔ embedding lesson, challenges submission form replaces mailto
- -Explainers moved from navbar to footer, dead Twitter link removed
Complete visual overhaul establishing a consistent design system across all key pages. Inspired by Linear/Vercel: white primary CTAs, border-zinc-800 dividers, font-mono accents, no colorful gradients. Fixed a critical SEO issue where gradient text (text-transparent + bg-clip-text) was invisible in Google featured snippets. Restructured the footer from 6 columns with sparse links to 6 thematic columns covering all 40+ public pages. Added real Vercel analytics data to the flywheel page as social proof.
AI Research Challenges: 42 Paid Bounties for the ML Community
- -42 research challenges ranging from Easy ($1) to Legendary ($16), totaling $280 in bounties
- -Tasks span data collection, analysis, benchmarking, reproduction, and original research
- -Interactive challenge browser with difficulty/category filters and expandable task cards
- -Designed so AI agents help ~30-40% but human judgment, verification, and experimentation carry the rest
- -Open to all AI developers and ML researchers
- -Every deliverable gets published on CodeSOTA with full contributor attribution
Launched /challenges — a page of 42 Knuth-style research bounties designed to fill CodeSOTA's 188 'needs research' benchmark gaps. Tasks progress from data collection (find & verify paper results) through analysis (cross-paper contradictions, evaluation noise) to original research (new benchmarks, contamination detection, scaling law verification). The Legendary tier ($16/task) includes meta-analysis papers, full domain coverage, and the 'State of AI Evaluation 2026' annual report.
Handwriting OCR Editorial Rewrite, Robotics Page Visuals
- -Handwriting OCR guide rewritten with SVG bounding box illustrations (line/word/character detection), side-by-side model comparisons
- -GPT-4o confirmed as new handwriting SOTA: 1.69% CER on IAM (arXiv 2503.15195), beating all traditional HTR models
- -11 models benchmarked with updated 2026 data including DTrOCR, GOT-OCR 2.0, Qwen2.5-VL, DLoRA-TrOCR
- -Robotics page enhanced with 6 visual elements: Canvas bar/radar charts, SVG timeline, pipeline diagram, task difficulty spectrum, matplotlib images
- -Clerk auth error fixed on benchmark pages (useUser outside ClerkProvider)
- -MMLU SOTA progress bar scaling fixed (was showing 92.4% shorter than 89.2%)
- -Next.js 16 config warnings resolved
Two major editorial upgrades. The handwriting OCR page went from a text-heavy listicle to a visual editorial with inline SVG bounding box illustrations, a CER comparison chart, and side-by-side model output comparisons showing errors highlighted in red. Key finding: multimodal LLMs (GPT-4o at 1.69% CER) now beat all traditional HTR models in zero-shot. The robotics page gained Canvas-based performance charts, an SVG timeline of milestones (2017-2025), a robot learning pipeline diagram, and matplotlib-generated SOTA evolution plots.
Mass Benchmark Update: 24 Benchmarks Refreshed with 2024-2026 Data
- -/browse page sections now collapsible with chevron toggles, benchmarks sorted by latest update within each area
- -Benchmark updater agent created to systematically refresh stale leaderboards with verified 2024-2026 results
- -AudioSet: 4 new entries, SSLAM new SOTA (0.502 mAP, ICLR 2025), corrected BEATs score
- -LibriSpeech: 15 entries across test-clean/test-other, Parakeet RNNT 1.1B leads at 1.46% WER
- -COCO Captions: 23 entries added, PaLI-X-55B leads with 149.2 CIDEr
- -VQA v2.0: 16 results, PaLI-X 55B leads at 86.1% accuracy
- -MuJoCo: 9 results, TD-MPC2 317M leads with 960 normalized score (ICLR 2024)
- -ESC-50: OmniVec2 (99.1% SOTA, CVPR 2024), EAT, MaskSpec + backfilled paper links
- -SCUT-CTW1500: TextMamba (89.7, Dec 2024), DeepSolo, EK-Net — 19 entries total
- -ICDAR 2015, M4, CIFAR-100, LJ Speech, VCTK, OGB, Cora all seeded with verified results
- -More updates in progress: ImageNet-1K, COCO, MATH, SWE-bench, ADE20K, OmniDocBench, COCO-Text, Atari ALE, Common Voice
- -High-traffic benchmarks prioritized using /trending page view data (30-day window)
Largest single-session benchmark data refresh in CodeSOTA history. Used the new benchmark-updater agent to parallelize research and data entry across 24 benchmarks spanning computer vision, speech, audio, reinforcement learning, graphs, multimodal, time series, and code generation. Verified highlights: AudioSet SOTA moved to 0.502 mAP (SSLAM, ICLR 2025), LibriSpeech test-clean down to 1.46% WER (Parakeet RNNT), COCO Captions at 149.2 CIDEr (PaLI-X), VQA v2.0 at 86.1% (PaLI-X), MuJoCo control at 960 (TD-MPC2). The /browse page was improved with collapsible area sections and sorting by most recent paper date. Benchmark selection was driven by /trending page view data.
Hallucination Detection Overhaul, Simplified Topic Submission
- -Hallucination detection explainer rewritten with real Vectara HHEM leaderboard data (87 models, March 2026)
- -3 Canvas-based visualizations: hallucination rate rankings, historical 2024-2026 progression, reasoning vs standard models
- -Key finding: reasoning models (o3-pro 23.3%, o4-mini 18.6%) hallucinate 2-3x more than standard models on summarization
- -Best hallucination rate improved 4.7x in 2 years: from 8.5% (early 2024) to 1.8% (Finix S1 32B, March 2026)
- -5 production Python code examples using real libraries: DeBERTa-v3 NLI, selfcheckgpt, RAGAS v0.2+, factscore, FAISS+NLI
- -Submit page simplified from 10-field paper submission to 5-field topic suggestion with email notification
- -/browse page now shows all 231 benchmarks (was only showing 188 undecided), grouped by area with status badges
The hallucination detection building block page went from generic AI-generated content to a data-driven reference with real benchmark scores sourced from the Vectara HHEM leaderboard. The submit form was radically simplified — instead of asking users to submit papers with benchmark results (which nobody did), it now asks 'what did you find?' with an optional link and email for notification when we write about it. The /browse page was redesigned to show all 231 benchmarks across 12 areas instead of only the undecided ones, with active benchmarks listed first and status badges on each row.
HumanEval Editorial, Site-Wide View Tracking, Hardparse Promotion
- -New /benchmark/humaneval editorial: complete history of 60+ models from 28.8% (Codex, 2021) to 99% saturation
- -Site-wide view tracking: passive page view counter with /trending leaderboard grouped by area
- -Hardparse promotional page at /ocr/hardparse with benchmark comparison, dual CTA (Mac App + API)
- -Hardparse CTA banner added to 10 OCR pages for cross-promotion
- -LLM page updated with real OpenRouter rankings: top models, provider market share, quality scores
- -All LLM benchmark cards now clickable with links to task pages
- -Article reactions system on editorial pages (insightful, practical, surprising, needs-update)
- -/ocr/ship-it page deployed (was previously untracked)
Major push on engagement and monetization. The /trending page shows what the community is reading across all areas, powered by a new /api/views endpoint with daily IP deduplication. Hardparse gets a full product page and cross-site CTAs. The HumanEval editorial traces the benchmark from OpenAI's Codex paper through GPT-4o breaking 90% to multiple models hitting 99% — positioned as the definitive history of code generation benchmarking. LLM page now uses real scraped OpenRouter data (March 2026) instead of stale placeholders.
Building Blocks: Complex Factory Architecture Diagrams with Mermaid
- -New Factorio-inspired Factory View upgraded from simple cards to dense operations plots
- -Mermaid architecture diagram added to /building-blocks with shared bus, ingress, processing stages, and delivery nodes
- -Control lanes visualized: observability, cost guardrails, and cache reuse edges across pipelines
- -Per-pipeline complexity dashboard added: assembler counts, implementation counts, modality breadth, and complexity index
- -Factory section now uses local Mermaid rendering for maintainable, code-driven diagram updates
The Building Blocks page now includes a full architecture diagram system instead of lightweight flow cards. Pipelines are rendered as structured flow graphs with clear stage transitions and shared infrastructure links, then paired with a denser operations dashboard view for enterprise planning discussions. This makes the page more useful for technical stakeholders who need to reason about real system topology, control paths, and operational complexity.
LLM Page Refresh, Hardware GPU Guide Update, OCR Python Comparison
- -LLM benchmarks page updated for 2026: new SOTA scores across 16 benchmarks, updated model families (Claude 4.6, GPT-5, Llama 4, Gemini 2.5)
- -Hardware GPU guide rewritten with 2026 data: RTX 5090/5080 specs, H200/B200 datacenter GPUs, buy vs rent analysis, Apple Silicon section
- -Best OCR Libraries for Python updated: 6 libraries compared with live benchmark testing, 3 new comparison charts
- -HumanEval page updated for 2026: leaderboard reflects current SOTA (96.3%), saturation clearly noted
- -Price-performance chart regenerated with all consumer GPUs properly ordered
Major content refresh across high-traffic pages. The LLM page removes 'Coming Soon' sections and now shows live SOTA data including SWE-bench jumping from 49% to 80.9%. Hardware page expanded from 516 to 1,114 lines with RTX 5090 benchmarks, cloud GPU pricing from 6 providers, VRAM requirement tables, and buy vs rent break-even analysis. OCR Python comparison now includes Surya and DocTR alongside PaddleOCR, Tesseract, EasyOCR, and RapidOCR with real benchmark numbers.
7 New Benchmark Editorial Pages
- -ImageNet editorial: SOTA timeline (63.3% → 92.7%), 15+ model leaderboard, transfer learning matrix visualization
- -COCO editorial: object detection leaderboard, AP metrics breakdown, detection pipeline explainer
- -SQuAD editorial: question answering benchmark with F1/EM scores, reading comprehension analysis
- -GLUE/SuperGLUE editorial: NLU benchmark suite, saturation analysis, task breakdown
- -GSM8K editorial: math reasoning benchmark, chain-of-thought analysis, LLM comparison
- -HumanEval editorial: code generation benchmark, pass@1 leaderboard, language coverage
- -SWE-bench Code Generation: editorial differentiating code models vs agent scaffolds
- -20+ custom visualizations generated across all pages
Seven new comprehensive benchmark editorials covering computer vision, NLP, reasoning, and code generation. Each page features researched leaderboard data, SOTA timelines, custom Python-generated visualizations, and domain-specific analysis.
SWE-bench, OmniDocBench & ADE20K Editorial Pages
- -SWE-bench editorial: SOTA timeline (1.96% → 80.9%), 15-model leaderboard, 8 key papers, evaluation pipeline breakdown
- -OmniDocBench editorial: CVPR 2025 document parsing benchmark with full leaderboard and metric breakdowns
- -ADE20K upgraded: 23-model leaderboard, Python-generated visualizations, real dataset samples from HuggingFace
- -6 custom matplotlib visualizations for ADE20K (SOTA timeline, class distribution, accuracy vs params, color palette, stuff vs things)
- -New /find page: 4-step benchmark finder wizard with email-gated results
Three comprehensive editorial pages now live — SWE-bench covers the definitive AI coding agent benchmark with full SOTA history from Claude 2's 1.96% to Claude Opus 4.5's 80.9%. OmniDocBench covers Shanghai AI Lab's CVPR 2025 document parsing benchmark. ADE20K rebuilt with real dataset images, Python-generated visualizations, and 23-model leaderboard. Plus a new Benchmark Finder wizard at /find for lead capture.
11 Interactive Paradox Explainers
- -11 3Blue1Brown-style interactive explainers covering mathematical paradoxes
- -Stein's Paradox: How wheat prices help predict baseball averages
- -Will Rogers Phenomenon: Stage migration in cancer survival statistics
- -Berkson's Paradox: Selection bias in dating and hospitals
- -Low Birth Weight Paradox: Simpson's paradox in epidemiology
- -Schelling's Segregation: Agent-based model with smart movement algorithm
- -Ross-Littlewood Paradox: Infinite sets and supertasks
- -Banach-Tarski Paradox: Measure theory and the axiom of choice
- -Newcomb's Paradox: Decision theory and free will
- -Arrow's Impossibility Theorem: Why fair voting is mathematically impossible
- -Cobra Effect: Incentive design and Goodhart's Law
- -Grossman-Stiglitz Paradox: Why efficient markets can't exist
Massive expansion of the Explainers section with 11 comprehensive interactive paradox explainers. Each features multiple interactive simulations, games, and visualizations that let users experience the paradox firsthand. Schelling's Segregation includes smart agent movement for better convergence at high thresholds.
Rys OCR: Polish SOTA OCR Model (Research Preview)
- -First fine-tune of Polish OCR model released on HuggingFace
- -71.3% Character Error Rate (CER) reduction on Polish text
- -46.1% Word Error Rate (WER) reduction
- -LoRA fine-tune on PaddleOCR-VL base model
- -Optimized for Polish diacritics: a, c, e, l, n, o, s, z, z
- -Runs on consumer hardware (4-6 GB VRAM)
- -Apache 2.0 license, fully open source
- -Call for contributions: datasets, benchmarks, R&D collaboration
Rys OCR is the first release in ongoing R&D to build state-of-the-art Polish text recognition. Trained on 10,000 synthetic Polish document images (addresses, invoices, receipts, dates, names). Looking for contributors to help with real Polish datasets, benchmark evaluations, and model improvements.
21 New 3Blue1Brown-Style Interactive Explainers
- -21 new explainer components with interactive visualizations
- -Face Anonymization: detection pipelines, blurring vs pixelation vs generative inpainting
- -PII Detection: entity types, confidence thresholds, redaction strategies
- -Text Reranking: bi-encoder vs cross-encoder, relevance scoring
- -Hallucination Detection: factual grounding, source verification methods
- -Hybrid Retrieval: BM25 + dense vector fusion, reciprocal rank fusion
- -Controllable Generation: temperature, top-k/p, repetition penalty, CFG
- -Chart Understanding: chart type detection, data extraction pipelines
- -Question Answering: extractive vs abstractive, span prediction
- -Long Context Summarization: chunking strategies, hierarchical approaches
- -Video-to-Text: frame sampling, temporal understanding, captioning
- -Code Generation: syntax-aware models, repair and completion
- -Audio/Video processing: emotion recognition, action recognition, tracking
- -50+ total building blocks now have interactive explainers
Massive expansion of Building Blocks with 21 new 3Blue1Brown-style interactive explainers. Each component features step-by-step visualizations, architecture diagrams, and practical code examples. Covers advanced AI capabilities from face anonymization to hallucination detection to video understanding.
Next.js 16 Migration & OCR Labeling Platform
- -Complete migration from Astro to Next.js 16.1.1 with App Router
- -New OCR Labeling Platform: Upload images, get bounding boxes via DOTS OCR (Replicate)
- -Human-in-the-loop data flywheel for document processing quality improvement
- -27+ interactive explainer components migrated to React (LLM, VLM, TTS, etc.)
- -CodeBlock component with Prism.js syntax highlighting and .ipynb download
- -All dynamic routes fixed for Next.js 15+ async params pattern
- -New /benchmark/[id] and /[area]/compare/[...slug] pages
- -TypeScript compilation verified across entire codebase
Major infrastructure release migrating from Astro to Next.js for better performance, SSR, and React ecosystem integration. The new OCR Labeling Platform enables community-driven quality improvement - upload documents, review AI-extracted text with bounding boxes, and submit corrections. Interactive Building Blocks explainers (LLMExplainer, ImageCaptioningExplainer) now use React with useState for full interactivity.
Comprehensive SOTA Editorials for Major AI Areas
- -Added expert editorials for 10 major AI areas with 500+ citations
- -Speech: Whisper, Conformer, XTTS voice cloning, sub-200ms TTS latency insights
- -NLP: GPT-5, Claude 3.5, DeepSeek-V3 comparisons, RAG adoption patterns
- -Computer Code: SWE-bench leaders, RLVR training paradigm, package hallucination risks
- -Reasoning: o3/o4-mini math performance, test-time compute scaling analysis
- -Multimodal: Open-source parity (InternVL3, Molmo 2), hallucination mitigation
- -Agentic AI: METR benchmarks, MCP/A2A protocols, production deployment reality
- -Audio: Suno v4.5 music generation, MSEB benchmark gaps, mHuBERT edge deployment
- -Robotics: OpenVLA 7B outperforming RT-2-X 55B, COLOSSEUM brittleness findings
- -Medical: GPT-4o USMLE 90.4%, FDA approval generalization gaps, BoltzGen drug discovery
- -500+ citations from peer-reviewed papers (NeurIPS, ICML, CVPR, ACL)
Major content release adding practitioner-focused editorials to all major AI areas. Each editorial includes: State of the Field with specific benchmark scores, Hot Takes with honest insights, and Production Recommendations for different use cases. Comprehensive research with 500+ citations from peer-reviewed papers and major conferences.
The Zen of AI Composition: Free PDF Released
- -Book now available for free download - no email required
- -Direct PDF download with download counter tracking
- -Building intelligent systems from first principles
- -Three parts: Nature of Composition, Transformations, Practice
"The Zen of AI Composition" is now available as a free PDF download. A philosophical guide to building intelligent systems - covering the history of AI transformations, modular composition, and evidence-based prompting techniques.
The Zen of AI Composition: Book Early Access
- -New book landing page with early access signup
- -Double opt-in email confirmation via Resend
- -Admin notifications on confirmed signups
- -Book covers AI composition from first principles
- -Three parts: Nature of Composition, Transformations, Practice
Announcing "The Zen of AI Composition" - a philosophical guide to building intelligent systems. Sign up for early access to receive the book first and help shape the final version. Double opt-in email flow ensures only engaged readers join the list.
Decision Tools: Model Comparator, Verification Protocol, Intent Analytics
- -Interactive Model Comparator: Select 2-4 OCR models for side-by-side comparison
- -Failure mode comparison: diacritics, tables, stamps, handwriting, low quality
- -Shareable comparison URLs with query params for team decisions
- -Verification Protocol page: 5-step benchmark verification process
- -VERIFIED badge schema: dataset hash, prompt/config, runtime, cost, metric code
- -Three verification tiers: Self-Reported, CodeSOTA Verified, Continuous Monitoring
- -Decision intent analytics: scroll depth, time on page, CTA clicks, outbound tracking
- -Atropos LLM RL guide: Nous Research framework for OCR evaluation and training
- -Standalone OCR evaluation script for testing vision models on OCR-VQA
Continues the strategic transformation with enterprise decision tools. The Model Comparator lets teams compare 10 OCR models across 8 metrics with failure mode analysis. Verification Protocol establishes trust through transparent methodology. Decision intent analytics track how users make choices, enabling continuous improvement of the decision platform.
OCR Decision Platform: From Catalog to Decision Engine
- -New canonical OCR Decision Guide page with failure taxonomy focus
- -Homepage transformed: OCR hero with 90-second clarity messaging
- -Failure taxonomy: diacritics, column bleed, numeric substitution, table collapse, stamp interference
- -Decision matrix: "If your priority is X, choose Y" format
- -Private OCR Evaluation Preview with waitlist signup
- -Independence & Conflict of Interest Policy on methodology page
- -GDPR compliance and EU data residency messaging
- -EvaluationCTA component added to all 5 comparison pages
- -Navigation updated: OCR highlighted first in cyan
Major strategic release transforming CodeSOTA from "benchmark catalog" to "decision platform". The new /ocr/decision page is the canonical OCR decision artifact - focusing on failure modes (what breaks) rather than accuracy percentages. Homepage now leads with OCR, includes 90-second clarity test (Who/What/Why/Next), and features Private Evaluation Preview. All comparison pages now include Request Evaluation CTAs.
Agentic AI Benchmarks: METR Time Horizon & Path to AGI
- -New Agentic AI page: METR benchmarks tracking autonomous AI capabilities
- -Time Horizon leaderboard: GPT-5.1-Codex-Max (160 min), GPT-5, o1-preview, Claude 3
- -HCAST, RE-Bench, SWAA task suite breakdowns
- -Interactive benchmark saturation chart (JS/Chart.js) with category views
- -Building blocks now connected to all /browse/[area] pages
- -27 benchmarks across 8 categories including new Agentic category
- -7-month doubling time trend analysis for AGI timeline
Major release focused on agentic AI capabilities - the key metric for AGI progress. Added comprehensive METR benchmark tracking including time horizon (how long AI can work autonomously) and HCAST scores. Interactive JS-based saturation charts show how different benchmark categories are approaching ceiling performance. Browse pages now show relevant building blocks for each research area.
Building Blocks: 6 More Interactive Explainers
- -Image Captioning: VLM deep-dive with LLaVA, Qwen2-VL, BLIP-2, GPT-4V architecture comparison
- -Text-to-Video: Sora, Runway Gen-3, CogVideoX, Diffusion Transformer (DiT) architecture
- -Image-to-Image: Inpainting, outpainting, super resolution, ControlNet, IP-Adapter
- -Text-to-3D: DreamFusion, Shap-E, MVDream, LGM, Score Distillation Sampling explained
- -Image-to-Video: Stable Video Diffusion, AnimateDiff, LivePortrait, Runway API
- -Depth Estimation: Enhanced with real example images (mountain, street, indoor, portrait)
- -Now 14+ comprehensive explainers covering all major AI modalities
Continued expansion of Building Blocks with 6 new interactive explainers. Each includes architecture diagrams, model evolution timelines, practical code examples, and performance comparisons. Depth estimation now features real input/output examples with turbo colormap visualization.
Building Blocks Expansion: 8 Interactive Explainers
- -Object Detection: YOLO evolution (v1-v11), NMS, two-stage vs single-stage, mAP metrics
- -Image Segmentation: SAM 2, semantic/instance/panoptic types, mask formats, Mask2Former
- -Depth Estimation: Depth Anything v2, ZoeDepth, Marigold, metric vs relative depth
- -Image to 3D: Gaussian Splatting, NeRF, Trellis, single-image 3D generation
- -Speech Recognition: Whisper deep-dive, turbo vs large-v3, faster-whisper, diarization
- -27+ building blocks covering vision, NLP, audio, video, and 3D modalities
- -Comprehensive code examples for each modality with multiple frameworks
- -Interactive visualizations: attention matrices, depth colormaps, architecture diagrams
Massive expansion of the Building Blocks section. Added 8 comprehensive interactive explainers covering the core AI modalities. Each explainer includes architecture deep-dives, model evolution timelines, practical code examples (YOLO, SAM, Whisper, etc.), and interactive visualizations. The goal: be the best resource for understanding how each AI capability actually works.
Modular Benchmark Runner & Mistral OCR 2512 Verification
- -New modular benchmark runner system with pluggable backends
- -Mistral OCR 2512 (Mistral 3 OCR) verified and tested
- -Stanford Churro (CHURRO-DS) benchmark integration
- -OCRBench v2 runner with official evaluation support
- -HTTP API daemon for remote GPU benchmark execution
- -Checkpoint-based resumable benchmark runs
- -Automated results sync to website data files
Major infrastructure release: The benchmark-runner now supports modular benchmark backends (Mistral OCR, Churro, OCRBench v2, OmniDocBench). Verified Mistral OCR 2512 performance: 9 pages in 7.37 seconds with high-quality markdown output. Stanford Churro integration enables historical document OCR benchmarks across 46 languages. All runners support checkpointing for resumable runs.
Interactive AI Explainers: LLM & TTS Deep Dives
- -LLM Explainer: How transformers work with 5 interactive sections
- -TTS Explainer: Complete text-to-speech pipeline visualization
- -Interactive tokenization demo with BPE explanation
- -Attention mechanism visualization with clickable matrix
- -Next token prediction with probability distributions
- -Mel spectrogram and waveform canvas visualizations
- -Voice cloning methods: speaker embedding, in-context learning, fine-tuning
- -Neural codec language model explanation (VALL-E, ElevenLabs-style)
Two comprehensive interactive explainers for the Building Blocks pages. The LLM explainer covers tokenization, embeddings, attention, next-token prediction, and transformer architecture. The TTS explainer covers text normalization, G2P, prosody, acoustic models, mel spectrograms, vocoders, and zero-shot voice cloning. All with interactive canvas-based visualizations.
Building Blocks & Editorial Guides
- -Building Blocks: Modular AI capabilities taxonomy (image-to-vector, text-to-vector, etc.)
- -Editorial Guides for 3 personas: Executives, Enthusiasts, Researchers
- -Executive Guide: Document Processing Technology Matrix with vendor comparison
- -Enthusiast Guide: SOTA Tracker with current leaders and reproduction tips
- -Research Guide: ML Landscape 2025 with trend analysis and gap identification
- -Data Flywheel page explaining community-driven benchmark growth
- -LLM and Object Detection hub pages
- -PWC Archive: 1,519 papers, 464 models, 145 datasets integrated
Major release introducing Building Blocks - a new way to think about AI capabilities as modular transformations (image-to-vector, audio-to-text, etc.). Added comprehensive editorial guides for different user personas: CTOs get vendor comparison matrices, enthusiasts get SOTA tracking tools, researchers get trend analysis. All built on Papers with Code archive data.
SEO Improvements & Production Auth
- -Papers with Code alternative page SEO optimization
- -Added FAQ section targeting "People Also Ask"
- -Clerk production authentication with GitHub OAuth
- -User work profile preferences in dashboard
- -Sitemap fixed with correct www domain
- -Removed custom analytics (using Vercel Analytics)
SEO improvements for the Papers with Code story page including optimized title, meta description, FAQ section, and internal links. Switched Clerk to production mode with GitHub OAuth. Added work profile preferences feature for logged-in users to describe their ML focus areas.
User Accounts & Email Capture
- -User authentication via Clerk (GitHub OAuth)
- -Protected dashboard for authenticated users
- -Sign-in and sign-up pages with dark theme styling
Added user account system using Clerk for authentication. Users can sign in with GitHub. Protected routes redirect unauthenticated users to sign-in.
CodeSOTA Polish OCR Benchmark
- -1,000 synthetic and real Polish text images with ground truth
- -4 categories: synth_random, synth_words, real_corpus, wikipedia
- -5 degradation levels: clean, light, medium, heavy, severe
- -Tesseract 5.5.1 baseline: 26.3% CER overall
- -Contamination-resistant design exposes LM dependence (52% vs 5% CER)
- -Dedicated Polish OCR page with category breakdown and key findings
Our own Polish OCR benchmark designed to detect language model reliance vs pure character recognition. Synthetic categories (no dictionary fallback) show 10x worse performance than real text, exposing heavy dependence on statistical language models. Dataset includes 5 degradation levels using Augraphy to simulate real document scanning conditions.
Mistral OCR 3 Added
- -New Mistral OCR 3 model (mistral-ocr-2512) added to benchmarks
- -Dedicated review page with pricing, code examples, benchmarks
- -94.9% claimed accuracy, 74% win rate over OCR 2
- -$2/1000 pages ($1 with batch API)
- -Benchmark results: olmOCR-bench, CER, WER metrics
Added comprehensive coverage of Mistral OCR 3 released December 2025. Includes benchmark comparisons with GPT-4o and PaddleOCR, pricing analysis, code examples, and use case recommendations.
Featured Guides on Landing Page
- -New "In-Depth Comparisons" section on homepage with image cards
- -6 featured editorial guides: OCR comparisons, Audio AI, Medical Radiology
- -Images surfaced for better Google Images discovery
- -Internal linking boost for deep content pages
Landing page now showcases editorialized content with images for better SEO discovery. Featured guides include PaddleOCR vs Tesseract, GPT-4o vs PaddleOCR, Best OCR for Invoices, Best OCR for Handwriting, Audio AI Benchmarks, and Chest X-ray AI Models.
Audio AI Benchmarks: Classification, Music Generation & More
- -Audio AI overview: AudioSet, ESC-50, music generation, audio captioning
- -Audio Classification deep-dive: BEATs at 0.498 mAP, 98.1% on ESC-50
- -Music Generation comparison: Suno, Udio, MusicGen, Stable Audio
- -7 custom visualizations: spectrograms, waveforms, model comparisons
- -Evaluation metrics explained: mAP, FAD, MOS, CLAP scores
- -Architecture breakdowns: CNN vs Vision Transformer vs Audio Tokenizer
Comprehensive Audio AI vertical with 3 editorial pages covering classification (AudioSet, ESC-50), music generation (Suno, Udio, MusicGen), and audio understanding (Qwen2-Audio, SALMONN). Includes custom-generated visualizations and practical model recommendations by use case.
GPU Hardware Benchmarks: RTX 3090 vs 4090 vs 5090
- -Compare RTX 3090, 4090, 5090 for ML workloads
- -LLM inference: Llama 3, Mistral, with tokens/sec metrics
- -Image generation: SDXL, Flux, SD 1.5 benchmarks
- -Training: LoRA fine-tuning, YOLO, ResNet performance
- -VRAM requirements guide: which models fit on which GPU
- -Cloud GPU pricing from RunPod, vast.ai, Lambda Labs
New Hardware section with comprehensive GPU comparison for ML. Includes specs, real-world benchmarks across LLM inference, image generation, training, and computer vision. Features recommendations on which GPU to buy and cloud pricing comparison.
Polish OCR Benchmarks
- -4 Polish OCR datasets: PolEval 2021, IMPACT-PSNC, reVISION, Polish EMNIST
- -Models: Tesseract Polish, ABBYY FineReader, HerBERT, Polish RoBERTa
- -Best CER: 2.1% on PolEval 2021, 97.5% word accuracy on IMPACT
- -Covers diacritics challenges and gothic font recognition
Dedicated Polish OCR benchmark page covering historical documents from 1791-1998, gothic fonts, and Polish diacritics (ą, ć, ę, ł, ń, ó, ś, ź, ż). Features both raw OCR engines and NLP post-correction approaches.
Industrial Anomaly Detection Benchmarks
- -8 industrial datasets: MVTec AD, VisA, weld defects, steel defects
- -12 anomaly detection models: PatchCore, EfficientAD, SimpleNet, FastFlow
- -Best AUROC: 99.6% (SimpleNet on MVTec AD)
- -Covers weld inspection, steel defects, surface inspection
- -Three approaches: Memory Bank, Normalizing Flows, Student-Teacher
Industrial inspection vertical for manufacturing quality control. Covers anomaly detection for defects, weld X-ray inspection, and steel surface analysis. Includes metrics explainer for AUROC vs PRO.
Chest X-Ray AI: Radiology Benchmarks Launch
- -7 chest X-ray datasets: CheXpert, MIMIC-CXR, NIH ChestX-ray14, VinDr-CXR, PadChest, RSNA, COVID-19
- -15 radiology AI models: CheXNet, CheXzero, TorchXRayVision, MedCLIP, GLoRIA, BioViL
- -20+ benchmark results with AUC scores across datasets
- -Interactive cross-dataset comparison chart
- -Data pipeline explainer: DICOM to multi-label classification
Major expansion of Medical AI with a dedicated Chest X-Ray benchmark page. Features 900K+ images across 7 major datasets, leaderboard sorted by CheXpert AUC, and cross-dataset generalization analysis. Covers the rise of vision-language models (CLIP-based) and the label noise problem in radiology.
SEO & Accessibility Improvements
- -Dataset schema on benchmark pages for Google Dataset Search
- -Dynamic meta descriptions with SOTA model and scores
- -FAQPage schema on Speech and Code Generation pages
- -Canvas accessibility: aria-labels and fallback text on DocumentScanner
- -BreadcrumbList schema for improved navigation structure
Major SEO improvements following audit recommendations. Benchmark pages now include schema.org/Dataset structured data for visibility in Google Dataset Search. Meta descriptions dynamically include the current SOTA model and score. Speech and Code Generation verticals now have FAQPage schema for rich snippets.
Major Content Expansion: 6 New Verticals
- -NLP vertical: GLUE, SuperGLUE, SQuAD benchmarks with 20+ models
- -Speech vertical: Whisper vs Azure, LibriSpeech benchmarks
- -Multimodal vertical: VQA, image captioning, GPT-4V vs Gemini
- -Reasoning vertical: MATH, GSM8K, GPQA, o1 vs GPT-4 comparison
- -LLM comparison hub: GPT-4 vs Claude head-to-head analysis
- -Code generation: best-for Python, JavaScript, debugging guides
- -OCR expansion: receipts, tables, multilingual, 3 new comparisons
20+ new pages across 6 research verticals. Each vertical includes landing pages, benchmark deep dives, and model comparisons. Navigation updated to include NLP, Speech, Multimodal, Reasoning, and LLM sections.
OCR Arena Speed vs Quality Visualization
- -Interactive scatter plot: ELO score vs latency
- -18 models from OCR Arena human preference rankings
- -Green dots for open source, red for closed/API
- -Key insights: best quality, best balance, fastest
- -Full rankings table with win rates and battle counts
New visualization page showing the speed vs quality tradeoff for OCR models based on human preference data from OCR Arena. Helps identify Pareto-optimal models for different use cases.
CodeSOTA Meta-Benchmark Score
- -Aggregate score across 8 OCR benchmarks
- -Weighted scoring: primary (3x), secondary (2x), tertiary (1x)
- -Interactive heatmap: models vs benchmarks
- -Coverage tracking: see which models need testing
- -Testing priority list for contributors
Introducing the CodeSOTA Score - a single number to compare OCR models across multiple benchmarks. Primary benchmarks (OmniDocBench, OCRBench v2, olmOCR-Bench) weighted 3x, secondary (CHURRO-DS, CC-OCR) 2x, language-specific 1x. Visual heatmap shows exactly where data is missing.
Papers With Code Database Integration
- -1,500+ benchmark results from PWC archive
- -SOTA Timeline: interactive hill-climbing charts
- -146 datasets, 464 models indexed
- -15 research areas with 70+ tasks defined
- -NLP, Reasoning, Code, Speech, Medical and more
Major release integrating the full Papers With Code archive. Browse historical benchmark results with the classic "hill climbing" visualization showing SOTA progression over time. All major research areas now have defined tasks - NLP (9 tasks), Reasoning (5), Code (6), Speech (5), and more.
Papers With Code Story
- -Complete history of Papers With Code (2018-2025)
- -Why it mattered for ML research
- -What was lost when Meta shut it down
- -Why CodeSOTA exists to fill the gap
- -Cost vs Quality frontier graph on vendors page
New page explaining the story of Papers With Code - what it was, why it was invaluable, and why there is a vacuum after Meta "sunsetted" it in July 2025. Also added interactive cost vs quality graph to the vendors comparison.
Homepage Redesign & OCR Vendors Page
- -New hero section: "State of the Art, Verified"
- -Papers With Code successor positioning
- -OCR Vendors comparison page with 9 vendors
- -Decision matrix for different use cases
- -LinkedIn banner for social media
Major update to homepage positioning CodeSOTA as the next generation of ML benchmarking. New OCR vendors page consolidates all options (Mistral, Docling, GPT-4o, PaddleOCR, Tesseract, Google Doc AI, Azure, doctr, Chandra) with practical decision guidance.
Mistral OCR Documentation
- -Mistral OCR API guide with Python examples
- -Benchmark claims: 94.9% accuracy, 2000 pages/min
- -Pricing comparison: $0.001/page vs competitors
- -Independent testing caveats documented
- -Mistral vs Docling comparison table
Added comprehensive documentation for Mistral OCR API. Includes both official benchmark claims and independent testing results showing mixed performance on complex layouts.
Docling Tutorial Verified
- -All code executed and verified on real documents
- -Real outputs: 33,201 chars markdown from 10-page PDF in 34.95s
- -Table extraction verified: 3 tables with CSV export
- -Downloadable artifacts from actual test run
- -Performance metrics from Apple Silicon with MPS acceleration
The Docling tutorial now includes real, verified outputs from processing the Docling arxiv paper. No more AI-generated placeholder snippets - every code block has been executed and the actual results are shown.
Docling Documentation Added
- -Complete Docling documentation following Diataxis framework
- -Tutorial: PDF to Markdown conversion
- -How-To Guides: OCR engines, table extraction, RAG integration
- -Technical Reference: API docs, model specs
- -Explanation: Architecture deep-dive
IBM's Docling represents a significant shift in document processing - using computer vision instead of traditional OCR. We've added comprehensive documentation to help you evaluate and integrate it.
Chandra OCR Benchmark Data
- -Added Chandra OCR 0.1.0 benchmark results
- -Top performer on olmOCR-Bench at 83.1%
- -Comparison data against PaddleOCR-VL, MinerU, Marker
Chandra OCR from Allen Institute AI now leads the olmOCR-Bench leaderboard. We've added comprehensive benchmark data to help you compare it against other solutions.
Document Scanner Tutorial
- -Full document scanning pipeline with OpenCV
- -Edge detection, perspective correction, enhancement
- -Interactive demo with sample images
- -Integration guide with OCR engines
Learn to build a document scanner that detects edges, corrects perspective, and enhances scanned images. Includes full Python code and interactive examples.
Initial Launch
- -OCR benchmark leaderboard with 8 major benchmarks
- -State-of-the-art results from 50+ models
- -Methodology documentation
- -Comparison pages: PaddleOCR vs Tesseract, GPT-4o vs PaddleOCR
CodeSOTA launches with comprehensive OCR benchmarking data. Our goal: verify vendor claims independently and help you choose the right tools.
This is the complete changelog since launch. Star us on GitHub for updates.