Home/OCR/Verification Protocol

Benchmark Verification
Protocol

How CodeSOTA ensures benchmark results are trustworthy

Independent verification, reproducible methodology, transparent execution

Vendor-provided benchmark results lack independent verification. Marketing materials cherry-pick metrics. Academic benchmarks use outdated datasets.

CodeSOTA runs independent evaluations with versioned datasets, deterministic configurations, and public methodology. Every verified result is reproducible.

VERIFIED Badge Schema

What the "VERIFIED by CodeSOTA" badge guarantees:

Dataset Hash

SHA-256 hash of the exact dataset version used for evaluation. Guarantees same test data across runs. Prevents dataset drift.

sha256:a3f2b9c8d1e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0

Prompt & Configuration

Exact parameters: temperature, top-p, max tokens, system prompts, API versions. Version-controlled configuration files stored in Git.

Model: gpt-4o-2024-11-20
Temperature: 0.0 (deterministic)
Prompt: Extract all text from this document...

Runtime Environment

Pinned Docker images with locked dependency versions. GPU/CPU specs, Python/library versions documented. Reproducible anywhere.

Container: codesota/ocr-eval:2025.01
Python: 3.11.7
Hardware: NVIDIA A100 40GB / 8-core CPU

Cost Calculation Methodology

Transparent pricing breakdown: API costs (input/output tokens), compute costs (GPU hours), infrastructure overhead. Updated with vendor pricing changes.

API: $0.015/1K input tokens + $0.06/1K output tokens
Compute: $2.50/hour A100 GPU
Total per page: ~$0.023

Metric Computation Code

Public GitHub repository with evaluation scripts. Character Error Rate (CER), Word Error Rate (WER), Table Edit Distance (TEDS), F1 scores - all open-source implementations.

github.com/codesota/ocr-benchmarks

Date of Verification

Timestamp of benchmark execution. Tracks freshness. Models improve, APIs change - date context is critical for interpretation.

Last verified: 2025-01-15 14:32 UTC

Verification Process

Step-by-step methodology for independent benchmark execution:

1

Model Setup

Pinned Docker images for each model evaluation. Deterministic configurations with fixed random seeds. API version locking to prevent silent changes.

docker pull codesota/paddle-ocr:v2.8.1
docker run --gpus all -e SEED=42 codesota/paddle-ocr:v2.8.1 evaluate
2

Dataset Versioning

Semantic versioning (v1.2.3) for datasets. Cached datasets with SHA-256 integrity checks. Immutable storage - versions never change after publication.

dataset: ocr-invoices-eu-v1.3.0
hash: sha256:a3f2b9c8d1e4f5a6b7c8d9e0f1a2b3c4
size: 10,000 documents (Polish, German, Czech)
3

Evaluation Execution

Deterministic evaluation scripts with version control. Automated pipelines run identical code for every model. Parallel execution for efficiency, isolated environments for integrity.

python evaluate.py \
  --model paddle-ocr-vl \
  --dataset ocr-invoices-eu-v1.3.0 \
  --config configs/paddle_deterministic.yaml
4

Result Validation

Cross-checking against vendor claims where available. Statistical analysis for outlier detection. Re-runs for suspicious results. Human review of failure cases.

Automated checks: Distribution analysis, variance thresholds
Manual review: Random sample of 100 predictions per model
Dispute resolution: Contact vendor for discrepancies before publishing
5

Publication

Git commit hash linking results to exact code version. Benchmark version tagging (v2025.01). Public changelog tracking all methodology changes.

commit: 7a3f2b9c (2025-01-15)
benchmark: ocr-invoices-eu v1.3.0
methodology: v2025.01 (no breaking changes since v2024.12)

Transparency Commitments

How we maintain trust through openness:

Public Methodology on GitHub

All evaluation code, configuration files, and metric implementations are open-source. Fork, audit, reproduce.

View on GitHub

Reproducible Results

Every verified result includes reproduction instructions. Run the same evaluation yourself with provided Docker images and datasets.

Deterministic by design

Version Control for All Changes

Methodology changes tracked in Git. Breaking changes increment major version. Public changelog with rationale for every modification.

No silent updates

Freshness Tracking

Timestamp on every benchmark result. "Last verified" dates visible on all leaderboards. Automatic staleness warnings for results older than 90 days.

Context is critical

Verification Tiers

Not all results are equal. We distinguish three levels of validation:

Self-Reported

UNVERIFIED

Results submitted by third parties (vendors, researchers, community). Not independently reproduced by CodeSOTA. Provided for completeness but flagged as unverified.

Use case: Quick discovery of new models, initial comparison, tracking vendor claims

CodeSOTA Verified

VERIFIED

Independently reproduced by CodeSOTA. Meets all six badge criteria: dataset hash, prompt/config, runtime, cost, metric code, verification date. Single-run validation.

Use case: Procurement decisions, RFP benchmarking, vendor selection, technical evaluations

Continuous Monitoring

MONITORED

Automated regular reruns (weekly/monthly). Tracks model drift, API changes, performance degradation. Alerts on significant deviations. Highest confidence tier.

Use case: Production monitoring, SLA tracking, regression detection, long-term reliability assessment

Request Verification for Your Model

Get the VERIFIED badge and independent validation for procurement teams

Vendor Benefits:

  • Independent validation that buyers trust in RFP responses
  • Competitive intelligence: see where you stand vs. alternatives
  • Failure-mode analysis for product improvement roadmap
  • VERIFIED badge for marketing materials and sales decks
  • Non-participation signals lack of confidence to procurement teams

Verification process: 2-3 weeks for standard OCR models. Custom datasets and private evaluations available.

Conflict of Interest Policy

Independence: CodeSOTA accepts no vendor investment, equity, or revenue-sharing agreements with OCR providers. We make money from private evaluations and enterprise consulting, not from vendors seeking favorable rankings.

Verification Fees: Vendors may pay for verification services (benchmark execution, badge issuance), but verification is pass/fail - we publish results as-is, favorable or not. Payment does not influence methodology or ranking.

Disclosure: Any financial relationship with a benchmarked vendor (consulting, evaluation fees, partnerships) is disclosed on the relevant benchmark page.

Methodology Integrity: Methodology changes are never made at vendor request. All changes go through public review with rationale documented in Git changelog.