Vendor-provided benchmark results lack independent verification. Marketing materials cherry-pick metrics. Academic benchmarks use outdated datasets.
CodeSOTA runs independent evaluations with versioned datasets, deterministic configurations, and public methodology. Every verified result is reproducible.
VERIFIED Badge Schema
What the "VERIFIED by CodeSOTA" badge guarantees:
Dataset Hash
SHA-256 hash of the exact dataset version used for evaluation. Guarantees same test data across runs. Prevents dataset drift.
sha256:a3f2b9c8d1e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0Prompt & Configuration
Exact parameters: temperature, top-p, max tokens, system prompts, API versions. Version-controlled configuration files stored in Git.
Runtime Environment
Pinned Docker images with locked dependency versions. GPU/CPU specs, Python/library versions documented. Reproducible anywhere.
Cost Calculation Methodology
Transparent pricing breakdown: API costs (input/output tokens), compute costs (GPU hours), infrastructure overhead. Updated with vendor pricing changes.
Metric Computation Code
Public GitHub repository with evaluation scripts. Character Error Rate (CER), Word Error Rate (WER), Table Edit Distance (TEDS), F1 scores - all open-source implementations.
github.com/codesota/ocr-benchmarksDate of Verification
Timestamp of benchmark execution. Tracks freshness. Models improve, APIs change - date context is critical for interpretation.
Verification Process
Step-by-step methodology for independent benchmark execution:
Model Setup
Pinned Docker images for each model evaluation. Deterministic configurations with fixed random seeds. API version locking to prevent silent changes.
docker pull codesota/paddle-ocr:v2.8.1
docker run --gpus all -e SEED=42 codesota/paddle-ocr:v2.8.1 evaluateDataset Versioning
Semantic versioning (v1.2.3) for datasets. Cached datasets with SHA-256 integrity checks. Immutable storage - versions never change after publication.
dataset: ocr-invoices-eu-v1.3.0
hash: sha256:a3f2b9c8d1e4f5a6b7c8d9e0f1a2b3c4
size: 10,000 documents (Polish, German, Czech)Evaluation Execution
Deterministic evaluation scripts with version control. Automated pipelines run identical code for every model. Parallel execution for efficiency, isolated environments for integrity.
python evaluate.py \
--model paddle-ocr-vl \
--dataset ocr-invoices-eu-v1.3.0 \
--config configs/paddle_deterministic.yamlResult Validation
Cross-checking against vendor claims where available. Statistical analysis for outlier detection. Re-runs for suspicious results. Human review of failure cases.
Publication
Git commit hash linking results to exact code version. Benchmark version tagging (v2025.01). Public changelog tracking all methodology changes.
commit: 7a3f2b9c (2025-01-15)
benchmark: ocr-invoices-eu v1.3.0
methodology: v2025.01 (no breaking changes since v2024.12)Transparency Commitments
How we maintain trust through openness:
Public Methodology on GitHub
All evaluation code, configuration files, and metric implementations are open-source. Fork, audit, reproduce.
View on GitHubReproducible Results
Every verified result includes reproduction instructions. Run the same evaluation yourself with provided Docker images and datasets.
Deterministic by designVersion Control for All Changes
Methodology changes tracked in Git. Breaking changes increment major version. Public changelog with rationale for every modification.
No silent updatesFreshness Tracking
Timestamp on every benchmark result. "Last verified" dates visible on all leaderboards. Automatic staleness warnings for results older than 90 days.
Context is criticalVerification Tiers
Not all results are equal. We distinguish three levels of validation:
Self-Reported
UNVERIFIEDResults submitted by third parties (vendors, researchers, community). Not independently reproduced by CodeSOTA. Provided for completeness but flagged as unverified.
CodeSOTA Verified
VERIFIEDIndependently reproduced by CodeSOTA. Meets all six badge criteria: dataset hash, prompt/config, runtime, cost, metric code, verification date. Single-run validation.
Continuous Monitoring
MONITOREDAutomated regular reruns (weekly/monthly). Tracks model drift, API changes, performance degradation. Alerts on significant deviations. Highest confidence tier.
Request Verification for Your Model
Get the VERIFIED badge and independent validation for procurement teams
Vendor Benefits:
- Independent validation that buyers trust in RFP responses
- Competitive intelligence: see where you stand vs. alternatives
- Failure-mode analysis for product improvement roadmap
- VERIFIED badge for marketing materials and sales decks
- Non-participation signals lack of confidence to procurement teams
Verification process: 2-3 weeks for standard OCR models. Custom datasets and private evaluations available.
Conflict of Interest Policy
Independence: CodeSOTA accepts no vendor investment, equity, or revenue-sharing agreements with OCR providers. We make money from private evaluations and enterprise consulting, not from vendors seeking favorable rankings.
Verification Fees: Vendors may pay for verification services (benchmark execution, badge issuance), but verification is pass/fail - we publish results as-is, favorable or not. Payment does not influence methodology or ranking.
Disclosure: Any financial relationship with a benchmarked vendor (consulting, evaluation fees, partnerships) is disclosed on the relevant benchmark page.
Methodology Integrity: Methodology changes are never made at vendor request. All changes go through public review with rationale documented in Git changelog.