Codesota · Benchmark · OCRBench v2Home/Leaderboards/OCRBench v2
South China University of Technology

OCRBench v2.

Tests 8 core OCR capabilities across 23 tasks. Evaluates LMMs on text recognition, referring, extraction.

Paper Leaderboard Lineage
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Overall (Chinese)

Overall Zh Private is the reported evaluation metric for OCRBench v2. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Overall (Chinese)verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01Qwen2.5-VL-72B
From Qwen2.5-VL-72B-Instruct model card benchmark table.
paper63.72025Source ↗Looks wrong?
02gemini-25-pro
Chinese, Private split. #1 on Chinese
paper62.22025Source ↗Looks wrong?
03Gemini 2.5 Pro
Chinese, Private split. #1 on Chinese
unverified62.22025Source ↗Looks wrong?
04Qianfan-OCR
Baidu Qianfan-OCR 4B (Qwen3-4B + Qianfan-ViT), Apache 2.0, 192 langs. Layout-as-Thought. #1 on zh
paper60.772025Source ↗Looks wrong?
05minicpm-v-4.5-8b
Chinese, Private split. #4 overall
unverified58.82025Source ↗Looks wrong?
06sail-vl2-8bpaper57.62025Source ↗Looks wrong?
07claude-3.5-sonnetunverified48.42024Source ↗Looks wrong?
08InternVL2.5-78B
From Qwen2.5-VL-72B-Instruct model card comparison table.
paper46.22025Source ↗Looks wrong?
09Qwen2-VL-72B
From Qwen2.5-VL-72B-Instruct model card comparison table.
paper46.12024Source ↗Looks wrong?
10gpt-4o-2024unverified45.72024Source ↗Looks wrong?

English Score

English Score is the reported evaluation metric for OCRBench v2. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for English Scoreverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01Ovis2.5-9Bunverified63.42025Paper ↗Code ↗Looks wrong?
02Intern-S1-Prounverified60.12026Paper ↗Source ↗Looks wrong?

Overall (English)

Overall En Private is the reported evaluation metric for OCRBench v2. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Overall (English)verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01seed-1.6-vision
English, Private split. #1 on OCRBench v2
paper62.22025Source ↗Looks wrong?
02Seed1.6-vision
English, Private split. #1 on OCRBench v2
unverified62.22025Source ↗Looks wrong?
03Qwen2.5-VL-72B
From Qwen2.5-VL-72B-Instruct model card benchmark table. HF: Qwen/Qwen2.5-VL-72B-Instruct.
paper61.52025Source ↗Looks wrong?
04qwen3-omni-30bpaper61.32025Source ↗Looks wrong?
05Nemotron Nano V2 VLunverified61.22025Source ↗Looks wrong?
06nemotron-nano-v2-vlpaper61.22025Source ↗Looks wrong?
07gemini-25-propaper59.32025Source ↗Looks wrong?
08Gemini 2.5 Prounverified59.32025Source ↗Looks wrong?
09llama-3.1-nemotron-nano-vl-8bpaper56.42025Source ↗Looks wrong?
10Qianfan-OCR
Baidu Qianfan-OCR 4B (Qwen3-4B + Qianfan-ViT), Apache 2.0, 192 langs. Layout-as-Thought.
paper562025Source ↗Looks wrong?
11gpt-4o
Listed as GPT5-2025-08-07 on leaderboard
paper55.52024Source ↗Looks wrong?
12ovis2.5-8bunverified54.12025Source ↗Looks wrong?
13gemini-1.5-prounverified51.62024Source ↗Looks wrong?
14sail-vl2-8bpaper49.32025Source ↗Looks wrong?
15minicpm-v-4.5-8bunverified48.42025Source ↗Looks wrong?
16Qwen2-VL-72B
From Qwen2.5-VL-72B-Instruct model card comparison table.
paper47.82024Source ↗Looks wrong?
17gpt-4o-2024
GPT-4o baseline (not GPT5-2025-08-07)
paper47.62024Source ↗Looks wrong?
18claude-3.5-sonnetpaper47.52024Source ↗Looks wrong?
19internvl3.5-14bunverified47.12025Source ↗Looks wrong?
20step-1vunverified46.82024Source ↗Looks wrong?
21grok4unverified452025Source ↗Looks wrong?
22InternVL2.5-78B
From Qwen2.5-VL-72B-Instruct model card comparison table.
paper452025Source ↗Looks wrong?
23GPT-4o miniunverified44.12024Source ↗Looks wrong?
24gpt-4o-minipaper44.12024Source ↗Looks wrong?
25Claude Sonnet 4
Claude-sonnet-4-20250514
unverified42.42025Source ↗Looks wrong?
26claude-sonnet-4
Claude-sonnet-4-20250514
paper42.42025Source ↗Looks wrong?
27qwen2.5-vl-7bunverified41.82025Source ↗Looks wrong?
28deepseek-vl2-smallpaper412024Source ↗Looks wrong?
29pixtral-12bunverified38.42024Source ↗Looks wrong?
30phi-4-multimodalunverified38.12025Source ↗Looks wrong?
31glm-4v-9bunverified37.12024Source ↗Looks wrong?
32molmo-7bunverified33.92024Source ↗Looks wrong?
33llava-ov-7bpaper33.72024Source ↗Looks wrong?

Chinese Score

Chinese Score is the reported evaluation metric for OCRBench v2. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Chinese Scoreverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01Intern-S1-Prounverified60.62026Paper ↗Source ↗Looks wrong?
02Ovis2.5-9Bunverified582025Paper ↗Code ↗Looks wrong?

Overall Zh Public

Overall Zh Public is the reported evaluation metric for OCRBench v2. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Overall Zh Publicverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01InternVL3-14B
Table 3, arxiv:2501.00321. Highest on Chinese public split (tied with Qwen2.5-VL-7B).
paper55.72025Source ↗Looks wrong?
02Qwen2.5-VL-7B
Table 3, arxiv:2501.00321.
paper55.62025Source ↗Looks wrong?
03Ovis2-8B
Table 3, arxiv:2501.00321.
paper49.22025Source ↗Looks wrong?
04Gemini 1.5 Pro
Table 3, arxiv:2501.00321.
paper43.12024Source ↗Looks wrong?
05DeepSeek-VL2-Small
Table 3, arxiv:2501.00321.
paper42.72024Source ↗Looks wrong?
06Step-1V
Table 3, arxiv:2501.00321.
paper42.62024Source ↗Looks wrong?
07MiniCPM-o-2.6
Table 3, arxiv:2501.00321.
paper41.12024Source ↗Looks wrong?
08Claude 3.5 Sonnet
Table 3, arxiv:2501.00321.
paper39.62024Source ↗Looks wrong?
09GLM-4V-9B
Table 3, arxiv:2501.00321.
paper36.62024Source ↗Looks wrong?
10GPT-4o
Table 3, arxiv:2501.00321.
paper32.22024Source ↗Looks wrong?

Overall En Public

Overall En Public is the reported evaluation metric for OCRBench v2. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Overall En Publicverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01InternVL3-14B
Highest score on English public split. Table 2, arxiv:2501.00321.
paper52.62025Source ↗Looks wrong?
02Gemini 1.5 Pro
Table 2, arxiv:2501.00321. Gemini-1.5-Pro.
paper51.92024Source ↗Looks wrong?
03Ovis2-8B
Table 2, arxiv:2501.00321.
paper47.72025Source ↗Looks wrong?
04Step-1V
Table 2, arxiv:2501.00321.
paper46.72024Source ↗Looks wrong?
05Qwen2.5-VL-7B
Table 2, arxiv:2501.00321. Same as Step-1V average (46.7).
paper46.72025Source ↗Looks wrong?
06GPT-4o
Table 2, arxiv:2501.00321.
paper46.52024Source ↗Looks wrong?
07Claude 3.5 Sonnet
Table 2, arxiv:2501.00321. claude-3-5-sonnet-20241022.
paper45.22024Source ↗Looks wrong?
08MiniCPM-o-2.6
Table 2, arxiv:2501.00321.
paper45.12024Source ↗Looks wrong?
09DeepSeek-VL2-Small
Table 2, arxiv:2501.00321.
paper43.32024Source ↗Looks wrong?
10GLM-4V-9B
Table 2, arxiv:2501.00321.
paper42.62024Source ↗Looks wrong?
11Pixtral-12B
Table 2, arxiv:2501.00321.
paper40.32024Source ↗Looks wrong?
12LLaVA-OneVision-7B
Table 2, arxiv:2501.00321.
paper36.42024Source ↗Looks wrong?
13Cambrian-1-8B
Table 2, arxiv:2501.00321.
paper34.72024Source ↗Looks wrong?
14Molmo-7B
Table 2, arxiv:2501.00321.
paper34.52024Source ↗Looks wrong?
Lineage

OCRBench v2 in context.

See full ocr benchmarks lineage →
Predecessors (1)
superseded2023-05
OCRBench
10× more items, human-verified, EN+ZH parity, four public/private splits to combat contamination. Original v1 saturated within 18 months; v2 reopened the gap.
This benchmark (1)
active2024-12
OCRBench v2
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards