Computer VisionComparison2026

Image Segmentation Models Compared

SAM 2, OneFormer, Mask2Former, SegGPT, Grounded SAM, SEEM, and Florence-2 -- benchmarked side-by-side with realistic scores, inference costs, and production trade-offs.

March 2026|20 min read

TL;DR

Best closed-vocabulary accuracy: Mask2Former and OneFormer -- 57-58 mIoU on ADE20K, ~58 PQ on COCO.
Best zero-shot / interactive: SAM 2.1 -- prompt with points, boxes, or masks; tracks through video.
Best text-prompted: Grounded SAM -- describe objects in natural language, get pixel masks.
Most versatile foundation model: Florence-2 -- segmentation is one of 10+ vision tasks in a single model.

Three Types of Image Segmentation

Before comparing models, it helps to understand the three segmentation tasks they target. Each answers a different question about the pixels in an image.

Semantic Segmentation

Label every pixel with a class (road, sky, person) but do not distinguish between individual instances of the same class.

Metric: mIoU (mean Intersection over Union)

Benchmark: ADE20K (150 classes)

Use case: autonomous driving, medical imaging, land-cover mapping

Instance Segmentation

Detect each object and produce a pixel mask for it. Two people in a photo get two separate masks.

Metric: AP (Average Precision) on masks

Benchmark: COCO (80 classes)

Use case: robotics grasping, photo editing, counting objects

Panoptic Segmentation

Combines semantic + instance: every pixel gets a class label and countable objects get unique IDs.

Metric: PQ (Panoptic Quality) = SQ x RQ

Benchmark: COCO Panoptic (133 classes)

Use case: scene understanding, AR/VR, comprehensive scene parsing

Benchmark Comparison

Scores are for the largest publicly available checkpoint of each model. Speed measured on a single A100 GPU at 1024x1024 resolution where applicable.

Model	ADE20K mIoU	COCO PQ	COCO AP	Params	Speed (ms)	Zero-Shot	Video
SAM 2Meta AI	--	--	46.5	312M	~44	Yes	Yes
SAM 2.1Meta AI	--	--	48.1	312M	~40	Yes	Yes
Mask2FormerMeta AI	57.8	57.8	50.1	216M	~72	No	No
OneFormerSHI Labs	58	58	49	220M	~80	No	No
SegGPTBAAI	42.5	--	--	354M	~110	Yes	No
Grounded SAMIDEA Research	--	--	46.8	~500M	~130	Yes	No
SEEMMicrosoft	50.2	52.1	--	310M	~85	Yes	No
Florence-2Microsoft	44	--	37.5	232M	~60	Yes	No

SAM 2 / SAM 2.1 COCO AP measured on instance segmentation with automatic mask generation. Mask2Former and OneFormer scores from Swin-L backbones. SegGPT mIoU is few-shot (1-shot) on ADE20K.

Model Cards

SAM 2

Meta AI / 2024Zero-ShotVideoOpen Source

Prompt-based. Excels at interactive segmentation and video object segmentation.

COCO AP: 46.5Params: 312MLatency: ~44 ms

SAM 2.1

Meta AI / 2024Zero-ShotVideoOpen Source

Improved training data and occlusion handling over SAM 2.

COCO AP: 48.1Params: 312MLatency: ~40 ms

Mask2Former

Meta AI / 2022Open Source

Universal architecture for semantic, instance, and panoptic segmentation.

ADE20K: 57.8 mIoUCOCO PQ: 57.8COCO AP: 50.1Params: 216MLatency: ~72 ms

OneFormer

SHI Labs / 2023Open Source

Multi-task design. One model, one forward pass for all three segmentation tasks.

ADE20K: 58 mIoUCOCO PQ: 58COCO AP: 49Params: 220MLatency: ~80 ms

SegGPT

BAAI / 2023Zero-ShotOpen Source

In-context learning. Segments anything given example image-mask pairs.

ADE20K: 42.5 mIoUParams: 354MLatency: ~110 ms

Grounded SAM

IDEA Research / 2024Zero-ShotOpen Source

Grounding DINO + SAM. Text-prompt to segmentation masks.

COCO AP: 46.8Params: ~500MLatency: ~130 ms

SEEM

Microsoft / 2023Zero-ShotOpen Source

Segment Everything Everywhere all at once. Multi-modal prompting.

ADE20K: 50.2 mIoUCOCO PQ: 52.1Params: 310MLatency: ~85 ms

Florence-2

Microsoft / 2024Zero-ShotOpen Source

Vision foundation model. Segmentation is one of many capabilities.

ADE20K: 44 mIoUCOCO AP: 37.5Params: 232MLatency: ~60 ms

Code Examples

Working Python snippets for the most popular models. All use official libraries or Hugging Face Transformers.

SAM 2.1 -- Interactive Image Segmentation

sam2_image.py

from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
from PIL import Image
import numpy as np

# Load model
checkpoint = "sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))

# Load image
image = np.array(Image.open("photo.jpg"))
predictor.set_image(image)

# Segment with point prompt (x, y)
point_coords = np.array([[500, 375]])
point_labels = np.array([1])  # 1 = foreground

masks, scores, logits = predictor.predict(
    point_coords=point_coords,
    point_labels=point_labels,
    multimask_output=True,
)

# Best mask
best_mask = masks[np.argmax(scores)]
print(f"Mask shape: {best_mask.shape}, Score: {scores.max():.3f}")

SAM 2 -- Video Object Segmentation

sam2_video.py

from sam2.build_sam import build_sam2_video_predictor
import numpy as np

predictor = build_sam2_video_predictor(
    "configs/sam2.1/sam2.1_hiera_l.yaml",
    "sam2.1_hiera_large.pt",
)

# Initialize with a video directory of JPEG frames
state = predictor.init_state(video_path="./video_frames/")

# Add prompt on frame 0
_, _, masks = predictor.add_new_points_or_box(
    inference_state=state,
    frame_idx=0,
    obj_id=1,
    points=np.array([[210, 350]], dtype=np.float32),
    labels=np.array([1], dtype=np.int32),
)

# Propagate through entire video
for frame_idx, obj_ids, masks in predictor.propagate_in_video(state):
    for obj_id, mask in zip(obj_ids, masks):
        binary = (mask[0] > 0.0).cpu().numpy()
        print(f"Frame {frame_idx}, Object {obj_id}: {binary.sum()} pixels")

Mask2Former -- Panoptic Segmentation

mask2former_panoptic.py

from transformers import (
    Mask2FormerForUniversalSegmentation,
    Mask2FormerImageProcessor,
)
from PIL import Image

# Load panoptic segmentation model
processor = Mask2FormerImageProcessor.from_pretrained(
    "facebook/mask2former-swin-large-coco-panoptic"
)
model = Mask2FormerForUniversalSegmentation.from_pretrained(
    "facebook/mask2former-swin-large-coco-panoptic"
)

image = Image.open("street.jpg")
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# Panoptic segmentation
result = processor.post_process_panoptic_segmentation(
    outputs, target_sizes=[image.size[::-1]]
)[0]

seg_map = result["segmentation"]  # H x W tensor of segment IDs
segments = result["segments_info"]

for seg in segments:
    print(f"ID: {seg['id']}, Label: {seg['label_id']}, "
          f"Score: {seg['score']:.3f}, Area: {(seg_map == seg['id']).sum()}")

Grounded SAM -- Text-Prompted Segmentation

grounded_sam.py

from autodistill_grounded_sam_2 import GroundedSAM2
from autodistill.detection import CaptionOntology
from PIL import Image

# Define what to segment via text
ontology = CaptionOntology({
    "car": "car",
    "person": "person",
    "traffic light": "traffic light",
})

model = GroundedSAM2(ontology=ontology)

# Run on an image
results = model.predict("street_scene.jpg")

for detection in results:
    print(f"Class: {detection.class_name}")
    print(f"Confidence: {detection.confidence:.3f}")
    print(f"Mask pixels: {detection.mask.sum()}")

When to Use What

Interactive annotation / labeling tool

SAM 2.1

Click-to-segment with real-time feedback. Best prompt-based model available.

Video object tracking with masks

SAM 2.1

Only model with native video propagation. Prompt once, track across frames.

Full scene parsing (all pixels labeled)

OneFormer or Mask2Former

Highest ADE20K mIoU. Trained on fixed class vocabularies for dense prediction.

Open-vocabulary: "segment the red car"

Grounded SAM

Natural language input. No predefined classes. Combines detection + segmentation.

Segment from a visual example (no text, no click)

SegGPT

In-context learning. Provide a reference image-mask pair, model generalizes.

Multi-task vision pipeline (detect + segment + caption)

Florence-2

Single model handles 10+ tasks. Lower per-task accuracy but extreme versatility.

Production panoptic segmentation at scale

Mask2Former

Battle-tested, strong COCO PQ, good speed/accuracy trade-off with smaller backbones.

Multi-modal prompting (text + click + audio)

SEEM

Accepts text, point, box, and even audio prompts. Good for research prototypes.

Key Takeaways

There is no single best model. SAM 2 wins on interactivity and video; Mask2Former/OneFormer win on closed-vocabulary accuracy; Grounded SAM wins on open-vocabulary ease.
Zero-shot does not mean highest accuracy. SAM 2 cannot label pixels by class name. For semantic segmentation with a known label set, supervised models still dominate.
Combine models when needed. Grounded SAM is literally Grounding DINO + SAM composed together. Many production pipelines chain a detector with a segmentor.
Video segmentation is still young. SAM 2 is the clear leader, but long-video consistency and re-identification after occlusion remain open challenges.
Foundation models are converging. Florence-2 and SEEM show that segmentation is becoming one capability among many inside unified vision models.

Image Segmentation Models Compared

TL;DR

Three Types of Image Segmentation

Semantic Segmentation

Instance Segmentation

Panoptic Segmentation

Benchmark Comparison

Model Cards

SAM 2

SAM 2.1

Mask2Former

OneFormer

SegGPT

Grounded SAM

SEEM

Florence-2

Code Examples

SAM 2.1 -- Interactive Image Segmentation

SAM 2 -- Video Object Segmentation

Mask2Former -- Panoptic Segmentation

Grounded SAM -- Text-Prompted Segmentation

When to Use What

Key Takeaways

Related Resources

All Guides

Kalman Filter

Computer Vision Hub