Layout analysis decomposes a page into regions — title, paragraph, list, table, figure, caption, header, footer — with bounding boxes and a reading order. It is the task that sits upstream of OCR, and the one that quietly determines whether the text you extract makes sense.
Plain OCR emits a flat character stream. On a two-column research paper, a reader that ignores layout splices the left and right columns line by line — “We propose a method the dataset contains that improves 120k pages …” — producing text that is individually correct and collectively meaningless. Layout analysis is what prevents that. Get the order wrong and every downstream step inherits the error.
A bounding box for every block. Detection-style output: [x0,y0,x1,y1] per element.
A type per region — Title, Text, List, Table, Figure, Caption, Page-header, Page-footer. DocLayNet uses 11 labels.
The sequence regions should be consumed in. The hardest part, and the one most benchmarks under-measure.
Which OCR fits your use case?
Answer 3 questions, get a personal recommendation. Or just drop your email — we reply.
Layout is scored like an object-detection problem. mAP (mean average precision over IoU thresholds) measures how well predicted boxes match ground-truth regions; layout F1 trades off precision and recall per class. PubLayNet (~360k clean scientific pages) is near-saturated; DocLayNet (80,863 manually-annotated pages across finance, law, manuals, patents) is the harder, more representative test.
| PubLayNet | mAP |
|---|---|
| Hybrid DETR (Shehzadi '24) | 97.3 |
| RoDLA (CVPR '24) | 96.0 |
| DiT-large | 94.5 |
| DocLayNet | mAP |
|---|---|
| DiT-large | 79.5 |
| LayoutLMv3 | 76.8 |
| YOLOv8 | 73.2 |
Sources: Shehzadi et al. arXiv:2404.17888; RoDLA (CVPR 2024) arXiv:2403.14442; DocLayNet (IBM). The ~18-point PubLayNet→DocLayNet drop is the cost of leaving clean scientific PDFs for real-world documents.
A YOLO / DETR / DiT model trained only to find and classify regions. Fast, cheap, runs on CPU, gives clean boxes. You then crop each region and route it to the right reader (text → OCR, table → table model). Reading order is computed separately.
Document VLMs like dots.ocr and PaddleOCR-VL emit categories, bounding boxes and text in one pass, with reading order baked in. Fewer moving parts; the layout and the read stay consistent because one model produced both.
from huggingface_hub import hf_hub_download
from ultralytics import YOLO
# A layout detector returns regions + classes, NOT text.
model = YOLO(hf_hub_download("DILHTWD/yolov8-doclaynet", "yolov8-doclaynet.pt"))
def detect_regions(image_path: str):
result = model(image_path)[0]
regions = []
for box, cls, conf in zip(result.boxes.xyxy, result.boxes.cls, result.boxes.conf):
regions.append({
"bbox": [round(v) for v in box.tolist()], # x0,y0,x1,y1
"category": result.names[int(cls)], # Title, Text, Table, ...
"confidence": round(float(conf), 3),
})
# Reading order is your job: sort by column, then top-to-bottom.
return order_regions(regions)def order_regions(regions, page_width_mid=None):
"""Naive multi-column reading order: split by x-midpoint, then y."""
if page_width_mid is None:
xs = [(r["bbox"][0] + r["bbox"][2]) / 2 for r in regions]
page_width_mid = (min(xs) + max(xs)) / 2
left = sorted([r for r in regions if (r["bbox"][0]+r["bbox"][2])/2 < page_width_mid],
key=lambda r: r["bbox"][1])
right = sorted([r for r in regions if (r["bbox"][0]+r["bbox"][2])/2 >= page_width_mid],
key=lambda r: r["bbox"][1])
return left + right # read the whole left column, then the rightWhich OCR fits your use case?
Answer 3 questions, get a personal recommendation. Or just drop your email — we reply.
Run the best OCR model on your Mac — $6
Hardparse runs PaddleOCR-VL-1.5 locally via Apple Metal. No cloud, no API keys, no subscription. Tables, formulas, handwriting, 109 languages.
Every purchase directly supports CodeSOTA's independent benchmark research.