Changelog

Latest updates, benchmarks, and documentation releases.

Get updates in your inbox

New benchmarks, tools, and guides. No spam.

v0.6.9 Benchmark

CodeSOTA Polish OCR Benchmark

  • - 1,000 synthetic and real Polish text images with ground truth
  • - 4 categories: synth_random, synth_words, real_corpus, wikipedia
  • - 5 degradation levels: clean, light, medium, heavy, severe
  • - Tesseract 5.5.1 baseline: 26.3% CER overall
  • - Contamination-resistant design exposes LM dependence (52% vs 5% CER)
  • - Dedicated Polish OCR page with category breakdown and key findings

Our own Polish OCR benchmark designed to detect language model reliance vs pure character recognition. Synthetic categories (no dictionary fallback) show 10x worse performance than real text, exposing heavy dependence on statistical language models. Dataset includes 5 degradation levels using Augraphy to simulate real document scanning conditions.

v0.6.8 Release

Mistral OCR 3 Added

  • - New Mistral OCR 3 model (mistral-ocr-2512) added to benchmarks
  • - Dedicated review page with pricing, code examples, benchmarks
  • - 94.9% claimed accuracy, 74% win rate over OCR 2
  • - $2/1000 pages ($1 with batch API)
  • - Benchmark results: olmOCR-bench, CER, WER metrics

Added comprehensive coverage of Mistral OCR 3 released December 2025. Includes benchmark comparisons with GPT-4o and PaddleOCR, pricing analysis, code examples, and use case recommendations.

v0.6.7

Featured Guides on Landing Page

  • - New "In-Depth Comparisons" section on homepage with image cards
  • - 6 featured editorial guides: OCR comparisons, Audio AI, Medical Radiology
  • - Images surfaced for better Google Images discovery
  • - Internal linking boost for deep content pages

Landing page now showcases editorialized content with images for better SEO discovery. Featured guides include PaddleOCR vs Tesseract, GPT-4o vs PaddleOCR, Best OCR for Invoices, Best OCR for Handwriting, Audio AI Benchmarks, and Chest X-ray AI Models.

v0.6.6 Release

Audio AI Benchmarks: Classification, Music Generation & More

  • - Audio AI overview: AudioSet, ESC-50, music generation, audio captioning
  • - Audio Classification deep-dive: BEATs at 0.498 mAP, 98.1% on ESC-50
  • - Music Generation comparison: Suno, Udio, MusicGen, Stable Audio
  • - 7 custom visualizations: spectrograms, waveforms, model comparisons
  • - Evaluation metrics explained: mAP, FAD, MOS, CLAP scores
  • - Architecture breakdowns: CNN vs Vision Transformer vs Audio Tokenizer

Comprehensive Audio AI vertical with 3 editorial pages covering classification (AudioSet, ESC-50), music generation (Suno, Udio, MusicGen), and audio understanding (Qwen2-Audio, SALMONN). Includes custom-generated visualizations and practical model recommendations by use case.

v0.6.5 Release

GPU Hardware Benchmarks: RTX 3090 vs 4090 vs 5090

  • - Compare RTX 3090, 4090, 5090 for ML workloads
  • - LLM inference: Llama 3, Mistral, with tokens/sec metrics
  • - Image generation: SDXL, Flux, SD 1.5 benchmarks
  • - Training: LoRA fine-tuning, YOLO, ResNet performance
  • - VRAM requirements guide: which models fit on which GPU
  • - Cloud GPU pricing from RunPod, vast.ai, Lambda Labs

New Hardware section with comprehensive GPU comparison for ML. Includes specs, real-world benchmarks across LLM inference, image generation, training, and computer vision. Features recommendations on which GPU to buy and cloud pricing comparison.

v0.6.4

Polish OCR Benchmarks

  • - 4 Polish OCR datasets: PolEval 2021, IMPACT-PSNC, reVISION, Polish EMNIST
  • - Models: Tesseract Polish, ABBYY FineReader, HerBERT, Polish RoBERTa
  • - Best CER: 2.1% on PolEval 2021, 97.5% word accuracy on IMPACT
  • - Covers diacritics challenges and gothic font recognition

Dedicated Polish OCR benchmark page covering historical documents from 1791-1998, gothic fonts, and Polish diacritics (ą, ć, ę, ł, ń, ó, ś, ź, ż). Features both raw OCR engines and NLP post-correction approaches.

v0.6.3 Release

Industrial Anomaly Detection Benchmarks

  • - 8 industrial datasets: MVTec AD, VisA, weld defects, steel defects
  • - 12 anomaly detection models: PatchCore, EfficientAD, SimpleNet, FastFlow
  • - Best AUROC: 99.6% (SimpleNet on MVTec AD)
  • - Covers weld inspection, steel defects, surface inspection
  • - Three approaches: Memory Bank, Normalizing Flows, Student-Teacher

Industrial inspection vertical for manufacturing quality control. Covers anomaly detection for defects, weld X-ray inspection, and steel surface analysis. Includes metrics explainer for AUROC vs PRO.

v0.6.2 Release

Chest X-Ray AI: Radiology Benchmarks Launch

  • - 7 chest X-ray datasets: CheXpert, MIMIC-CXR, NIH ChestX-ray14, VinDr-CXR, PadChest, RSNA, COVID-19
  • - 15 radiology AI models: CheXNet, CheXzero, TorchXRayVision, MedCLIP, GLoRIA, BioViL
  • - 20+ benchmark results with AUC scores across datasets
  • - Interactive cross-dataset comparison chart
  • - Data pipeline explainer: DICOM to multi-label classification

Major expansion of Medical AI with a dedicated Chest X-Ray benchmark page. Features 900K+ images across 7 major datasets, leaderboard sorted by CheXpert AUC, and cross-dataset generalization analysis. Covers the rise of vision-language models (CLIP-based) and the label noise problem in radiology.

v0.6.1 Fix

SEO & Accessibility Improvements

  • - Dataset schema on benchmark pages for Google Dataset Search
  • - Dynamic meta descriptions with SOTA model and scores
  • - FAQPage schema on Speech and Code Generation pages
  • - Canvas accessibility: aria-labels and fallback text on DocumentScanner
  • - BreadcrumbList schema for improved navigation structure

Major SEO improvements following audit recommendations. Benchmark pages now include schema.org/Dataset structured data for visibility in Google Dataset Search. Meta descriptions dynamically include the current SOTA model and score. Speech and Code Generation verticals now have FAQPage schema for rich snippets.

v0.6.0 Release

Major Content Expansion: 6 New Verticals

  • - NLP vertical: GLUE, SuperGLUE, SQuAD benchmarks with 20+ models
  • - Speech vertical: Whisper vs Azure, LibriSpeech benchmarks
  • - Multimodal vertical: VQA, image captioning, GPT-4V vs Gemini
  • - Reasoning vertical: MATH, GSM8K, GPQA, o1 vs GPT-4 comparison
  • - LLM comparison hub: GPT-4 vs Claude head-to-head analysis
  • - Code generation: best-for Python, JavaScript, debugging guides
  • - OCR expansion: receipts, tables, multilingual, 3 new comparisons

20+ new pages across 6 research verticals. Each vertical includes landing pages, benchmark deep dives, and model comparisons. Navigation updated to include NLP, Speech, Multimodal, Reasoning, and LLM sections.

v0.5.1

OCR Arena Speed vs Quality Visualization

  • - Interactive scatter plot: ELO score vs latency
  • - 18 models from OCR Arena human preference rankings
  • - Green dots for open source, red for closed/API
  • - Key insights: best quality, best balance, fastest
  • - Full rankings table with win rates and battle counts

New visualization page showing the speed vs quality tradeoff for OCR models based on human preference data from OCR Arena. Helps identify Pareto-optimal models for different use cases.

v0.5.0 Release

CodeSOTA Meta-Benchmark Score

  • - Aggregate score across 8 OCR benchmarks
  • - Weighted scoring: primary (3x), secondary (2x), tertiary (1x)
  • - Interactive heatmap: models vs benchmarks
  • - Coverage tracking: see which models need testing
  • - Testing priority list for contributors

Introducing the CodeSOTA Score - a single number to compare OCR models across multiple benchmarks. Primary benchmarks (OmniDocBench, OCRBench v2, olmOCR-Bench) weighted 3x, secondary (CHURRO-DS, CC-OCR) 2x, language-specific 1x. Visual heatmap shows exactly where data is missing.

v0.4.0 Release

Papers With Code Database Integration

  • - 1,500+ benchmark results from PWC archive
  • - SOTA Timeline: interactive hill-climbing charts
  • - 146 datasets, 464 models indexed
  • - 15 research areas with 70+ tasks defined
  • - NLP, Reasoning, Code, Speech, Medical and more

Major release integrating the full Papers With Code archive. Browse historical benchmark results with the classic "hill climbing" visualization showing SOTA progression over time. All major research areas now have defined tasks - NLP (9 tasks), Reasoning (5), Code (6), Speech (5), and more.

v0.3.1 Documentation

Papers With Code Story

  • - Complete history of Papers With Code (2018-2025)
  • - Why it mattered for ML research
  • - What was lost when Meta shut it down
  • - Why CodeSOTA exists to fill the gap
  • - Cost vs Quality frontier graph on vendors page

New page explaining the story of Papers With Code - what it was, why it was invaluable, and why there is a vacuum after Meta "sunsetted" it in July 2025. Also added interactive cost vs quality graph to the vendors comparison.

v0.3.0 Release

Homepage Redesign & OCR Vendors Page

  • - New hero section: "State of the Art, Verified"
  • - Papers With Code successor positioning
  • - OCR Vendors comparison page with 9 vendors
  • - Decision matrix for different use cases
  • - LinkedIn banner for social media

Major update to homepage positioning CodeSOTA as the next generation of ML benchmarking. New OCR vendors page consolidates all options (Mistral, Docling, GPT-4o, PaddleOCR, Tesseract, Google Doc AI, Azure, doctr, Chandra) with practical decision guidance.

v0.2.2 Documentation

Mistral OCR Documentation

  • - Mistral OCR API guide with Python examples
  • - Benchmark claims: 94.9% accuracy, 2000 pages/min
  • - Pricing comparison: $0.001/page vs competitors
  • - Independent testing caveats documented
  • - Mistral vs Docling comparison table

Added comprehensive documentation for Mistral OCR API. Includes both official benchmark claims and independent testing results showing mixed performance on complex layouts.

v0.2.1 Tutorial

Docling Tutorial Verified

  • - All code executed and verified on real documents
  • - Real outputs: 33,201 chars markdown from 10-page PDF in 34.95s
  • - Table extraction verified: 3 tables with CSV export
  • - Downloadable artifacts from actual test run
  • - Performance metrics from Apple Silicon with MPS acceleration

The Docling tutorial now includes real, verified outputs from processing the Docling arxiv paper. No more AI-generated placeholder snippets - every code block has been executed and the actual results are shown.

v0.2.0 Documentation

Docling Documentation Added

  • - Complete Docling documentation following Diataxis framework
  • - Tutorial: PDF to Markdown conversion
  • - How-To Guides: OCR engines, table extraction, RAG integration
  • - Technical Reference: API docs, model specs
  • - Explanation: Architecture deep-dive

IBM's Docling represents a significant shift in document processing - using computer vision instead of traditional OCR. We've added comprehensive documentation to help you evaluate and integrate it.

v0.1.2 Benchmark

Chandra OCR Benchmark Data

  • - Added Chandra OCR 0.1.0 benchmark results
  • - Top performer on olmOCR-Bench at 83.1%
  • - Comparison data against PaddleOCR-VL, MinerU, Marker

Chandra OCR from Allen Institute AI now leads the olmOCR-Bench leaderboard. We've added comprehensive benchmark data to help you compare it against other solutions.

v0.1.1 Tutorial

Document Scanner Tutorial

  • - Full document scanning pipeline with OpenCV
  • - Edge detection, perspective correction, enhancement
  • - Interactive demo with sample images
  • - Integration guide with OCR engines

Learn to build a document scanner that detects edges, corrects perspective, and enhances scanned images. Includes full Python code and interactive examples.

v0.1.0 Release

Initial Launch

  • - OCR benchmark leaderboard with 8 major benchmarks
  • - State-of-the-art results from 50+ models
  • - Methodology documentation
  • - Comparison pages: PaddleOCR vs Tesseract, GPT-4o vs PaddleOCR

CodeSOTA launches with comprehensive OCR benchmarking data. Our goal: verify vendor claims independently and help you choose the right tools.

This is the complete changelog since launch. Star us on GitHub for updates.