Changelog

Latest updates, benchmarks, and documentation releases.

Get updates in your inbox

New benchmarks, tools, and guides. No spam.

2026-01-05v0.22.0Release

11 Interactive Paradox Explainers

-11 3Blue1Brown-style interactive explainers covering mathematical paradoxes
-Stein's Paradox: How wheat prices help predict baseball averages
-Will Rogers Phenomenon: Stage migration in cancer survival statistics
-Berkson's Paradox: Selection bias in dating and hospitals
-Low Birth Weight Paradox: Simpson's paradox in epidemiology
-Schelling's Segregation: Agent-based model with smart movement algorithm
-Ross-Littlewood Paradox: Infinite sets and supertasks
-Banach-Tarski Paradox: Measure theory and the axiom of choice
-Newcomb's Paradox: Decision theory and free will
-Arrow's Impossibility Theorem: Why fair voting is mathematically impossible
-Cobra Effect: Incentive design and Goodhart's Law
-Grossman-Stiglitz Paradox: Why efficient markets can't exist

Massive expansion of the Explainers section with 11 comprehensive interactive paradox explainers. Each features multiple interactive simulations, games, and visualizations that let users experience the paradox firsthand. Schelling's Segregation includes smart agent movement for better convergence at high thresholds.

All Explainers →Stein's Paradox →Schelling's Segregation →Grossman-Stiglitz →

2025-12-31v0.20.0Release

Rys OCR: Polish SOTA OCR Model (Research Preview)

-First fine-tune of Polish OCR model released on HuggingFace
-71.3% Character Error Rate (CER) reduction on Polish text
-46.1% Word Error Rate (WER) reduction
-LoRA fine-tune on PaddleOCR-VL base model
-Optimized for Polish diacritics: a, c, e, l, n, o, s, z, z
-Runs on consumer hardware (4-6 GB VRAM)
-Apache 2.0 license, fully open source
-Call for contributions: datasets, benchmarks, R&D collaboration

Rys OCR is the first release in ongoing R&D to build state-of-the-art Polish text recognition. Trained on 10,000 synthetic Polish document images (addresses, invoices, receipts, dates, names). Looking for contributors to help with real Polish datasets, benchmark evaluations, and model improvements.

Rys OCR Page →HuggingFace Model →Polish OCR Benchmarks →

2025-12-25v0.19.0Release

21 New 3Blue1Brown-Style Interactive Explainers

-21 new explainer components with interactive visualizations
-Face Anonymization: detection pipelines, blurring vs pixelation vs generative inpainting
-PII Detection: entity types, confidence thresholds, redaction strategies
-Text Reranking: bi-encoder vs cross-encoder, relevance scoring
-Hallucination Detection: factual grounding, source verification methods
-Hybrid Retrieval: BM25 + dense vector fusion, reciprocal rank fusion
-Controllable Generation: temperature, top-k/p, repetition penalty, CFG
-Chart Understanding: chart type detection, data extraction pipelines
-Question Answering: extractive vs abstractive, span prediction
-Long Context Summarization: chunking strategies, hierarchical approaches
-Video-to-Text: frame sampling, temporal understanding, captioning
-Code Generation: syntax-aware models, repair and completion
-Audio/Video processing: emotion recognition, action recognition, tracking
-50+ total building blocks now have interactive explainers

Massive expansion of Building Blocks with 21 new 3Blue1Brown-style interactive explainers. Each component features step-by-step visualizations, architecture diagrams, and practical code examples. Covers advanced AI capabilities from face anonymization to hallucination detection to video understanding.

Face Anonymization →Hallucination Detection →Text Reranking →All Building Blocks →

2025-12-25v0.18.0Release

Next.js 16 Migration & OCR Labeling Platform

-Complete migration from Astro to Next.js 16.1.1 with App Router
-New OCR Labeling Platform: Upload images, get bounding boxes via DOTS OCR (Replicate)
-Human-in-the-loop data flywheel for document processing quality improvement
-27+ interactive explainer components migrated to React (LLM, VLM, TTS, etc.)
-CodeBlock component with Prism.js syntax highlighting and .ipynb download
-All dynamic routes fixed for Next.js 15+ async params pattern
-New /benchmark/[id] and /[area]/compare/[...slug] pages
-TypeScript compilation verified across entire codebase

Major infrastructure release migrating from Astro to Next.js for better performance, SSR, and React ecosystem integration. The new OCR Labeling Platform enables community-driven quality improvement - upload documents, review AI-extracted text with bounding boxes, and submit corrections. Interactive Building Blocks explainers (LLMExplainer, ImageCaptioningExplainer) now use React with useState for full interactivity.

OCR Labeling →Building Blocks →Browse Areas →

2025-12-24v0.17.0Release

Comprehensive SOTA Editorials for Major AI Areas

-Added expert editorials for 10 major AI areas with 500+ citations
-Speech: Whisper, Conformer, XTTS voice cloning, sub-200ms TTS latency insights
-NLP: GPT-5, Claude 3.5, DeepSeek-V3 comparisons, RAG adoption patterns
-Computer Code: SWE-bench leaders, RLVR training paradigm, package hallucination risks
-Reasoning: o3/o4-mini math performance, test-time compute scaling analysis
-Multimodal: Open-source parity (InternVL3, Molmo 2), hallucination mitigation
-Agentic AI: METR benchmarks, MCP/A2A protocols, production deployment reality
-Audio: Suno v4.5 music generation, MSEB benchmark gaps, mHuBERT edge deployment
-Robotics: OpenVLA 7B outperforming RT-2-X 55B, COLOSSEUM brittleness findings
-Medical: GPT-4o USMLE 90.4%, FDA approval generalization gaps, BoltzGen drug discovery
-500+ citations from peer-reviewed papers (NeurIPS, ICML, CVPR, ACL)

Major content release adding practitioner-focused editorials to all major AI areas. Each editorial includes: State of the Field with specific benchmark scores, Hot Takes with honest insights, and Production Recommendations for different use cases. Comprehensive research with 500+ citations from peer-reviewed papers and major conferences.

Speech →NLP →Computer Code →Reasoning →Multimodal →Agentic AI →Audio →Robotics →Medical →

2025-12-24v0.16.0Release

The Zen of AI Composition: Free PDF Released

-Book now available for free download - no email required
-Direct PDF download with download counter tracking
-Building intelligent systems from first principles
-Three parts: Nature of Composition, Transformations, Practice

"The Zen of AI Composition" is now available as a free PDF download. A philosophical guide to building intelligent systems - covering the history of AI transformations, modular composition, and evidence-based prompting techniques.

Download the Book →

2025-12-23v0.15.0Release

The Zen of AI Composition: Book Early Access

-New book landing page with early access signup
-Double opt-in email confirmation via Resend
-Admin notifications on confirmed signups
-Book covers AI composition from first principles
-Three parts: Nature of Composition, Transformations, Practice

Announcing "The Zen of AI Composition" - a philosophical guide to building intelligent systems. Sign up for early access to receive the book first and help shape the final version. Double opt-in email flow ensures only engaged readers join the list.

Book Early Access →

2025-12-22v0.14.0Release

Decision Tools: Model Comparator, Verification Protocol, Intent Analytics

-Interactive Model Comparator: Select 2-4 OCR models for side-by-side comparison
-Failure mode comparison: diacritics, tables, stamps, handwriting, low quality
-Shareable comparison URLs with query params for team decisions
-Verification Protocol page: 5-step benchmark verification process
-VERIFIED badge schema: dataset hash, prompt/config, runtime, cost, metric code
-Three verification tiers: Self-Reported, CodeSOTA Verified, Continuous Monitoring
-Decision intent analytics: scroll depth, time on page, CTA clicks, outbound tracking
-Atropos LLM RL guide: Nous Research framework for OCR evaluation and training
-Standalone OCR evaluation script for testing vision models on OCR-VQA

Continues the strategic transformation with enterprise decision tools. The Model Comparator lets teams compare 10 OCR models across 8 metrics with failure mode analysis. Verification Protocol establishes trust through transparent methodology. Decision intent analytics track how users make choices, enabling continuous improvement of the decision platform.

Model Comparator →Verification Protocol →Atropos LLM RL Guide →OCR Decision Guide →

2025-12-22v0.13.0Release

OCR Decision Platform: From Catalog to Decision Engine

-New canonical OCR Decision Guide page with failure taxonomy focus
-Homepage transformed: OCR hero with 90-second clarity messaging
-Failure taxonomy: diacritics, column bleed, numeric substitution, table collapse, stamp interference
-Decision matrix: "If your priority is X, choose Y" format
-Private OCR Evaluation Preview with waitlist signup
-Independence & Conflict of Interest Policy on methodology page
-GDPR compliance and EU data residency messaging
-EvaluationCTA component added to all 5 comparison pages
-Navigation updated: OCR highlighted first in cyan

Major strategic release transforming CodeSOTA from "benchmark catalog" to "decision platform". The new /ocr/decision page is the canonical OCR decision artifact - focusing on failure modes (what breaks) rather than accuracy percentages. Homepage now leads with OCR, includes 90-second clarity test (Who/What/Why/Next), and features Private Evaluation Preview. All comparison pages now include Request Evaluation CTAs.

OCR Decision Guide →Methodology & Independence →Homepage →

2025-12-21v0.12.0Release

Agentic AI Benchmarks: METR Time Horizon & Path to AGI

-New Agentic AI page: METR benchmarks tracking autonomous AI capabilities
-Time Horizon leaderboard: GPT-5.1-Codex-Max (160 min), GPT-5, o1-preview, Claude 3
-HCAST, RE-Bench, SWAA task suite breakdowns
-Interactive benchmark saturation chart (JS/Chart.js) with category views
-Building blocks now connected to all /browse/[area] pages
-27 benchmarks across 8 categories including new Agentic category
-7-month doubling time trend analysis for AGI timeline

Major release focused on agentic AI capabilities - the key metric for AGI progress. Added comprehensive METR benchmark tracking including time horizon (how long AI can work autonomously) and HCAST scores. Interactive JS-based saturation charts show how different benchmark categories are approaching ceiling performance. Browse pages now show relevant building blocks for each research area.

Agentic AI Benchmarks →Browse Agentic →Computer Vision + Blocks →

2025-12-21v0.11.0Release

Building Blocks: 6 More Interactive Explainers

-Image Captioning: VLM deep-dive with LLaVA, Qwen2-VL, BLIP-2, GPT-4V architecture comparison
-Text-to-Video: Sora, Runway Gen-3, CogVideoX, Diffusion Transformer (DiT) architecture
-Image-to-Image: Inpainting, outpainting, super resolution, ControlNet, IP-Adapter
-Text-to-3D: DreamFusion, Shap-E, MVDream, LGM, Score Distillation Sampling explained
-Image-to-Video: Stable Video Diffusion, AnimateDiff, LivePortrait, Runway API
-Depth Estimation: Enhanced with real example images (mountain, street, indoor, portrait)
-Now 14+ comprehensive explainers covering all major AI modalities

Continued expansion of Building Blocks with 6 new interactive explainers. Each includes architecture diagrams, model evolution timelines, practical code examples, and performance comparisons. Depth estimation now features real input/output examples with turbo colormap visualization.

Image Captioning →Text to Video →Image to Image →Text to 3D →Image to Video →Depth Estimation →

2025-12-21v0.10.0Release

Building Blocks Expansion: 8 Interactive Explainers

-Object Detection: YOLO evolution (v1-v11), NMS, two-stage vs single-stage, mAP metrics
-Image Segmentation: SAM 2, semantic/instance/panoptic types, mask formats, Mask2Former
-Depth Estimation: Depth Anything v2, ZoeDepth, Marigold, metric vs relative depth
-Image to 3D: Gaussian Splatting, NeRF, Trellis, single-image 3D generation
-Speech Recognition: Whisper deep-dive, turbo vs large-v3, faster-whisper, diarization
-27+ building blocks covering vision, NLP, audio, video, and 3D modalities
-Comprehensive code examples for each modality with multiple frameworks
-Interactive visualizations: attention matrices, depth colormaps, architecture diagrams

Massive expansion of the Building Blocks section. Added 8 comprehensive interactive explainers covering the core AI modalities. Each explainer includes architecture deep-dives, model evolution timelines, practical code examples (YOLO, SAM, Whisper, etc.), and interactive visualizations. The goal: be the best resource for understanding how each AI capability actually works.

Object Detection →Segmentation →Depth Estimation →Image to 3D →Speech Recognition →All Building Blocks →

2025-12-21v0.9.1Release

Modular Benchmark Runner & Mistral OCR 2512 Verification

-New modular benchmark runner system with pluggable backends
-Mistral OCR 2512 (Mistral 3 OCR) verified and tested
-Stanford Churro (CHURRO-DS) benchmark integration
-OCRBench v2 runner with official evaluation support
-HTTP API daemon for remote GPU benchmark execution
-Checkpoint-based resumable benchmark runs
-Automated results sync to website data files

Major infrastructure release: The benchmark-runner now supports modular benchmark backends (Mistral OCR, Churro, OCRBench v2, OmniDocBench). Verified Mistral OCR 2512 performance: 9 pages in 7.37 seconds with high-quality markdown output. Stanford Churro integration enables historical document OCR benchmarks across 46 languages. All runners support checkpointing for resumable runs.

OCR Benchmarks →Mistral OCR 3 →Flywheel →

2025-12-21v0.9.0Release

Interactive AI Explainers: LLM & TTS Deep Dives

-LLM Explainer: How transformers work with 5 interactive sections
-TTS Explainer: Complete text-to-speech pipeline visualization
-Interactive tokenization demo with BPE explanation
-Attention mechanism visualization with clickable matrix
-Next token prediction with probability distributions
-Mel spectrogram and waveform canvas visualizations
-Voice cloning methods: speaker embedding, in-context learning, fine-tuning
-Neural codec language model explanation (VALL-E, ElevenLabs-style)

Two comprehensive interactive explainers for the Building Blocks pages. The LLM explainer covers tokenization, embeddings, attention, next-token prediction, and transformer architecture. The TTS explainer covers text normalization, G2P, prosody, acoustic models, mel spectrograms, vocoders, and zero-shot voice cloning. All with interactive canvas-based visualizations.

LLM Explainer →TTS Explainer →Building Blocks →

2025-12-20v0.8.0Release

Building Blocks & Editorial Guides

-Building Blocks: Modular AI capabilities taxonomy (image-to-vector, text-to-vector, etc.)
-Editorial Guides for 3 personas: Executives, Enthusiasts, Researchers
-Executive Guide: Document Processing Technology Matrix with vendor comparison
-Enthusiast Guide: SOTA Tracker with current leaders and reproduction tips
-Research Guide: ML Landscape 2025 with trend analysis and gap identification
-Data Flywheel page explaining community-driven benchmark growth
-LLM and Object Detection hub pages
-PWC Archive: 1,519 papers, 464 models, 145 datasets integrated

Major release introducing Building Blocks - a new way to think about AI capabilities as modular transformations (image-to-vector, audio-to-text, etc.). Added comprehensive editorial guides for different user personas: CTOs get vendor comparison matrices, enthusiasts get SOTA tracking tools, researchers get trend analysis. All built on Papers with Code archive data.

Building Blocks →Guides Hub →Data Flywheel →LLM Benchmarks →

2025-12-20v0.7.1Improvement

SEO Improvements & Production Auth

-Papers with Code alternative page SEO optimization
-Added FAQ section targeting "People Also Ask"
-Clerk production authentication with GitHub OAuth
-User work profile preferences in dashboard
-Sitemap fixed with correct www domain
-Removed custom analytics (using Vercel Analytics)

SEO improvements for the Papers with Code story page including optimized title, meta description, FAQ section, and internal links. Switched Clerk to production mode with GitHub OAuth. Added work profile preferences feature for logged-in users to describe their ML focus areas.

Papers with Code Alternative →Dashboard →

2025-12-20v0.7.0Release

User Accounts & Email Capture

-User authentication via Clerk (GitHub OAuth)
-Protected dashboard for authenticated users
-Sign-in and sign-up pages with dark theme styling

Added user account system using Clerk for authentication. Users can sign in with GitHub. Protected routes redirect unauthenticated users to sign-in.

2025-12-20v0.6.9Benchmark

CodeSOTA Polish OCR Benchmark

-1,000 synthetic and real Polish text images with ground truth
-4 categories: synth_random, synth_words, real_corpus, wikipedia
-5 degradation levels: clean, light, medium, heavy, severe
-Tesseract 5.5.1 baseline: 26.3% CER overall
-Contamination-resistant design exposes LM dependence (52% vs 5% CER)
-Dedicated Polish OCR page with category breakdown and key findings

Our own Polish OCR benchmark designed to detect language model reliance vs pure character recognition. Synthetic categories (no dictionary fallback) show 10x worse performance than real text, exposing heavy dependence on statistical language models. Dataset includes 5 degradation levels using Augraphy to simulate real document scanning conditions.

Polish OCR Benchmarks →

2025-12-19v0.6.8Release

Mistral OCR 3 Added

-New Mistral OCR 3 model (mistral-ocr-2512) added to benchmarks
-Dedicated review page with pricing, code examples, benchmarks
-94.9% claimed accuracy, 74% win rate over OCR 2
-$2/1000 pages ($1 with batch API)
-Benchmark results: olmOCR-bench, CER, WER metrics

Added comprehensive coverage of Mistral OCR 3 released December 2025. Includes benchmark comparisons with GPT-4o and PaddleOCR, pricing analysis, code examples, and use case recommendations.

Mistral OCR 3 Review →OCR Benchmarks →

2025-12-19v0.6.7Improvement

Featured Guides on Landing Page

-New "In-Depth Comparisons" section on homepage with image cards
-6 featured editorial guides: OCR comparisons, Audio AI, Medical Radiology
-Images surfaced for better Google Images discovery
-Internal linking boost for deep content pages

Landing page now showcases editorialized content with images for better SEO discovery. Featured guides include PaddleOCR vs Tesseract, GPT-4o vs PaddleOCR, Best OCR for Invoices, Best OCR for Handwriting, Audio AI Benchmarks, and Chest X-ray AI Models.

Homepage →

2025-12-19v0.6.6Release

Audio AI Benchmarks: Classification, Music Generation & More

-Audio AI overview: AudioSet, ESC-50, music generation, audio captioning
-Audio Classification deep-dive: BEATs at 0.498 mAP, 98.1% on ESC-50
-Music Generation comparison: Suno, Udio, MusicGen, Stable Audio
-7 custom visualizations: spectrograms, waveforms, model comparisons
-Evaluation metrics explained: mAP, FAD, MOS, CLAP scores
-Architecture breakdowns: CNN vs Vision Transformer vs Audio Tokenizer

Comprehensive Audio AI vertical with 3 editorial pages covering classification (AudioSet, ESC-50), music generation (Suno, Udio, MusicGen), and audio understanding (Qwen2-Audio, SALMONN). Includes custom-generated visualizations and practical model recommendations by use case.

Audio AI Overview →Audio Classification →Music Generation →

2025-12-19v0.6.5Release

GPU Hardware Benchmarks: RTX 3090 vs 4090 vs 5090

-Compare RTX 3090, 4090, 5090 for ML workloads
-LLM inference: Llama 3, Mistral, with tokens/sec metrics
-Image generation: SDXL, Flux, SD 1.5 benchmarks
-Training: LoRA fine-tuning, YOLO, ResNet performance
-VRAM requirements guide: which models fit on which GPU
-Cloud GPU pricing from RunPod, vast.ai, Lambda Labs

New Hardware section with comprehensive GPU comparison for ML. Includes specs, real-world benchmarks across LLM inference, image generation, training, and computer vision. Features recommendations on which GPU to buy and cloud pricing comparison.

GPU Benchmarks →

2025-12-19v0.6.4Feature

Polish OCR Benchmarks

-4 Polish OCR datasets: PolEval 2021, IMPACT-PSNC, reVISION, Polish EMNIST
-Models: Tesseract Polish, ABBYY FineReader, HerBERT, Polish RoBERTa
-Best CER: 2.1% on PolEval 2021, 97.5% word accuracy on IMPACT
-Covers diacritics challenges and gothic font recognition

Dedicated Polish OCR benchmark page covering historical documents from 1791-1998, gothic fonts, and Polish diacritics (ą, ć, ę, ł, ń, ó, ś, ź, ż). Features both raw OCR engines and NLP post-correction approaches.

Polish OCR →

2025-12-19v0.6.3Release

Industrial Anomaly Detection Benchmarks

-8 industrial datasets: MVTec AD, VisA, weld defects, steel defects
-12 anomaly detection models: PatchCore, EfficientAD, SimpleNet, FastFlow
-Best AUROC: 99.6% (SimpleNet on MVTec AD)
-Covers weld inspection, steel defects, surface inspection
-Three approaches: Memory Bank, Normalizing Flows, Student-Teacher

Industrial inspection vertical for manufacturing quality control. Covers anomaly detection for defects, weld X-ray inspection, and steel surface analysis. Includes metrics explainer for AUROC vs PRO.

Industrial Inspection →

2025-12-19v0.6.2Release

Chest X-Ray AI: Radiology Benchmarks Launch

-7 chest X-ray datasets: CheXpert, MIMIC-CXR, NIH ChestX-ray14, VinDr-CXR, PadChest, RSNA, COVID-19
-15 radiology AI models: CheXNet, CheXzero, TorchXRayVision, MedCLIP, GLoRIA, BioViL
-20+ benchmark results with AUC scores across datasets
-Interactive cross-dataset comparison chart
-Data pipeline explainer: DICOM to multi-label classification

Major expansion of Medical AI with a dedicated Chest X-Ray benchmark page. Features 900K+ images across 7 major datasets, leaderboard sorted by CheXpert AUC, and cross-dataset generalization analysis. Covers the rise of vision-language models (CLIP-based) and the label noise problem in radiology.

Radiology Benchmarks →Medical AI Hub →

2025-12-19v0.6.1Fix

SEO & Accessibility Improvements

-Dataset schema on benchmark pages for Google Dataset Search
-Dynamic meta descriptions with SOTA model and scores
-FAQPage schema on Speech and Code Generation pages
-Canvas accessibility: aria-labels and fallback text on DocumentScanner
-BreadcrumbList schema for improved navigation structure

Major SEO improvements following audit recommendations. Benchmark pages now include schema.org/Dataset structured data for visibility in Google Dataset Search. Meta descriptions dynamically include the current SOTA model and score. Speech and Code Generation verticals now have FAQPage schema for rich snippets.

View Benchmarks →Speech →Code Generation →

2025-12-18v0.6.0Release

Major Content Expansion: 6 New Verticals

-NLP vertical: GLUE, SuperGLUE, SQuAD benchmarks with 20+ models
-Speech vertical: Whisper vs Azure, LibriSpeech benchmarks
-Multimodal vertical: VQA, image captioning, GPT-4V vs Gemini
-Reasoning vertical: MATH, GSM8K, GPQA, o1 vs GPT-4 comparison
-LLM comparison hub: GPT-4 vs Claude head-to-head analysis
-Code generation: best-for Python, JavaScript, debugging guides
-OCR expansion: receipts, tables, multilingual, 3 new comparisons

20+ new pages across 6 research verticals. Each vertical includes landing pages, benchmark deep dives, and model comparisons. Navigation updated to include NLP, Speech, Multimodal, Reasoning, and LLM sections.

NLP Benchmarks →Speech Recognition →Multimodal AI →Reasoning →LLM Comparisons →

2025-12-18v0.5.1Feature

OCR Arena Speed vs Quality Visualization

-Interactive scatter plot: ELO score vs latency
-18 models from OCR Arena human preference rankings
-Green dots for open source, red for closed/API
-Key insights: best quality, best balance, fastest
-Full rankings table with win rates and battle counts

New visualization page showing the speed vs quality tradeoff for OCR models based on human preference data from OCR Arena. Helps identify Pareto-optimal models for different use cases.

View Arena →OCR Overview →

2025-12-18v0.5.0Release

CodeSOTA Meta-Benchmark Score

-Aggregate score across 8 OCR benchmarks
-Weighted scoring: primary (3x), secondary (2x), tertiary (1x)
-Interactive heatmap: models vs benchmarks
-Coverage tracking: see which models need testing
-Testing priority list for contributors

Introducing the CodeSOTA Score - a single number to compare OCR models across multiple benchmarks. Primary benchmarks (OmniDocBench, OCRBench v2, olmOCR-Bench) weighted 3x, secondary (CHURRO-DS, CC-OCR) 2x, language-specific 1x. Visual heatmap shows exactly where data is missing.

View Heatmap →Scoring Methodology →

2025-12-17v0.4.0Release

Papers With Code Database Integration

-1,500+ benchmark results from PWC archive
-SOTA Timeline: interactive hill-climbing charts
-146 datasets, 464 models indexed
-15 research areas with 70+ tasks defined
-NLP, Reasoning, Code, Speech, Medical and more

Major release integrating the full Papers With Code archive. Browse historical benchmark results with the classic "hill climbing" visualization showing SOTA progression over time. All major research areas now have defined tasks - NLP (9 tasks), Reasoning (5), Code (6), Speech (5), and more.

PWC Archive →Browse Areas →

2025-12-17v0.3.1Documentation

Papers With Code Story

-Complete history of Papers With Code (2018-2025)
-Why it mattered for ML research
-What was lost when Meta shut it down
-Why CodeSOTA exists to fill the gap
-Cost vs Quality frontier graph on vendors page

New page explaining the story of Papers With Code - what it was, why it was invaluable, and why there is a vacuum after Meta "sunsetted" it in July 2025. Also added interactive cost vs quality graph to the vendors comparison.

Read the Story →OCR Vendors →

2025-12-17v0.3.0Release

Homepage Redesign & OCR Vendors Page

-New hero section: "State of the Art, Verified"
-Papers With Code successor positioning
-OCR Vendors comparison page with 9 vendors
-Decision matrix for different use cases
-LinkedIn banner for social media

Major update to homepage positioning CodeSOTA as the next generation of ML benchmarking. New OCR vendors page consolidates all options (Mistral, Docling, GPT-4o, PaddleOCR, Tesseract, Google Doc AI, Azure, doctr, Chandra) with practical decision guidance.

OCR Vendors →Homepage →

2025-12-17v0.2.2Documentation

Mistral OCR Documentation

-Mistral OCR API guide with Python examples
-Benchmark claims: 94.9% accuracy, 2000 pages/min
-Pricing comparison: $0.001/page vs competitors
-Independent testing caveats documented
-Mistral vs Docling comparison table

Added comprehensive documentation for Mistral OCR API. Includes both official benchmark claims and independent testing results showing mixed performance on complex layouts.

Mistral OCR Guide →

2025-12-17v0.2.1Tutorial

Docling Tutorial Verified

-All code executed and verified on real documents
-Real outputs: 33,201 chars markdown from 10-page PDF in 34.95s
-Table extraction verified: 3 tables with CSV export
-Downloadable artifacts from actual test run
-Performance metrics from Apple Silicon with MPS acceleration

The Docling tutorial now includes real, verified outputs from processing the Docling arxiv paper. No more AI-generated placeholder snippets - every code block has been executed and the actual results are shown.

View Verified Tutorial →Download Artifacts →

2025-12-16v0.2.0Documentation

Docling Documentation Added

-Complete Docling documentation following Diataxis framework
-Tutorial: PDF to Markdown conversion
-How-To Guides: OCR engines, table extraction, RAG integration
-Technical Reference: API docs, model specs
-Explanation: Architecture deep-dive

IBM's Docling represents a significant shift in document processing - using computer vision instead of traditional OCR. We've added comprehensive documentation to help you evaluate and integrate it.

Docling Overview →Start Tutorial →

2025-12-16v0.1.2Benchmark

Chandra OCR Benchmark Data

-Added Chandra OCR 0.1.0 benchmark results
-Top performer on olmOCR-Bench at 83.1%
-Comparison data against PaddleOCR-VL, MinerU, Marker

Chandra OCR from Allen Institute AI now leads the olmOCR-Bench leaderboard. We've added comprehensive benchmark data to help you compare it against other solutions.

View Benchmark Results →

2025-12-15v0.1.1Tutorial

Document Scanner Tutorial

-Full document scanning pipeline with OpenCV
-Edge detection, perspective correction, enhancement
-Interactive demo with sample images
-Integration guide with OCR engines

Learn to build a document scanner that detects edges, corrects perspective, and enhances scanned images. Includes full Python code and interactive examples.

Read Tutorial →

2025-12-10v0.1.0Release

Initial Launch

-OCR benchmark leaderboard with 8 major benchmarks
-State-of-the-art results from 50+ models
-Methodology documentation
-Comparison pages: PaddleOCR vs Tesseract, GPT-4o vs PaddleOCR

CodeSOTA launches with comprehensive OCR benchmarking data. Our goal: verify vendor claims independently and help you choose the right tools.

Explore OCR Benchmarks →Our Methodology →

This is the complete changelog since launch. Star us on GitHub for updates.