Image Segmentation Models Compared
SAM 2, OneFormer, Mask2Former, SegGPT, Grounded SAM, SEEM, and Florence-2 -- benchmarked side-by-side with realistic scores, inference costs, and production trade-offs.
TL;DR
- Best closed-vocabulary accuracy: Mask2Former and OneFormer -- 57-58 mIoU on ADE20K, ~58 PQ on COCO.
- Best zero-shot / interactive: SAM 2.1 -- prompt with points, boxes, or masks; tracks through video.
- Best text-prompted: Grounded SAM -- describe objects in natural language, get pixel masks.
- Most versatile foundation model: Florence-2 -- segmentation is one of 10+ vision tasks in a single model.
Three Types of Image Segmentation
Before comparing models, it helps to understand the three segmentation tasks they target. Each answers a different question about the pixels in an image.
Semantic Segmentation
Label every pixel with a class (road, sky, person) but do not distinguish between individual instances of the same class.
Metric: mIoU (mean Intersection over Union)
Benchmark: ADE20K (150 classes)
Use case: autonomous driving, medical imaging, land-cover mapping
Instance Segmentation
Detect each object and produce a pixel mask for it. Two people in a photo get two separate masks.
Metric: AP (Average Precision) on masks
Benchmark: COCO (80 classes)
Use case: robotics grasping, photo editing, counting objects
Panoptic Segmentation
Combines semantic + instance: every pixel gets a class label and countable objects get unique IDs.
Metric: PQ (Panoptic Quality) = SQ x RQ
Benchmark: COCO Panoptic (133 classes)
Use case: scene understanding, AR/VR, comprehensive scene parsing
Benchmark Comparison
Scores are for the largest publicly available checkpoint of each model. Speed measured on a single A100 GPU at 1024x1024 resolution where applicable.
| Model | ADE20K mIoU | COCO PQ | COCO AP | Params | Speed (ms) | Zero-Shot | Video |
|---|---|---|---|---|---|---|---|
| SAM 2Meta AI | -- | -- | 46.5 | 312M | ~44 | Yes | Yes |
| SAM 2.1Meta AI | -- | -- | 48.1 | 312M | ~40 | Yes | Yes |
| Mask2FormerMeta AI | 57.8 | 57.8 | 50.1 | 216M | ~72 | No | No |
| OneFormerSHI Labs | 58 | 58 | 49 | 220M | ~80 | No | No |
| SegGPTBAAI | 42.5 | -- | -- | 354M | ~110 | Yes | No |
| Grounded SAMIDEA Research | -- | -- | 46.8 | ~500M | ~130 | Yes | No |
| SEEMMicrosoft | 50.2 | 52.1 | -- | 310M | ~85 | Yes | No |
| Florence-2Microsoft | 44 | -- | 37.5 | 232M | ~60 | Yes | No |
SAM 2 / SAM 2.1 COCO AP measured on instance segmentation with automatic mask generation. Mask2Former and OneFormer scores from Swin-L backbones. SegGPT mIoU is few-shot (1-shot) on ADE20K.
Model Cards
SAM 2
Meta AI / 2024Zero-ShotVideoOpen SourcePrompt-based. Excels at interactive segmentation and video object segmentation.
SAM 2.1
Meta AI / 2024Zero-ShotVideoOpen SourceImproved training data and occlusion handling over SAM 2.
Mask2Former
Meta AI / 2022Open SourceUniversal architecture for semantic, instance, and panoptic segmentation.
OneFormer
SHI Labs / 2023Open SourceMulti-task design. One model, one forward pass for all three segmentation tasks.
SegGPT
BAAI / 2023Zero-ShotOpen SourceIn-context learning. Segments anything given example image-mask pairs.
Grounded SAM
IDEA Research / 2024Zero-ShotOpen SourceGrounding DINO + SAM. Text-prompt to segmentation masks.
SEEM
Microsoft / 2023Zero-ShotOpen SourceSegment Everything Everywhere all at once. Multi-modal prompting.
Florence-2
Microsoft / 2024Zero-ShotOpen SourceVision foundation model. Segmentation is one of many capabilities.
Code Examples
Working Python snippets for the most popular models. All use official libraries or Hugging Face Transformers.
SAM 2.1 -- Interactive Image Segmentation
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
from PIL import Image
import numpy as np
# Load model
checkpoint = "sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))
# Load image
image = np.array(Image.open("photo.jpg"))
predictor.set_image(image)
# Segment with point prompt (x, y)
point_coords = np.array([[500, 375]])
point_labels = np.array([1]) # 1 = foreground
masks, scores, logits = predictor.predict(
point_coords=point_coords,
point_labels=point_labels,
multimask_output=True,
)
# Best mask
best_mask = masks[np.argmax(scores)]
print(f"Mask shape: {best_mask.shape}, Score: {scores.max():.3f}")SAM 2 -- Video Object Segmentation
from sam2.build_sam import build_sam2_video_predictor
import numpy as np
predictor = build_sam2_video_predictor(
"configs/sam2.1/sam2.1_hiera_l.yaml",
"sam2.1_hiera_large.pt",
)
# Initialize with a video directory of JPEG frames
state = predictor.init_state(video_path="./video_frames/")
# Add prompt on frame 0
_, _, masks = predictor.add_new_points_or_box(
inference_state=state,
frame_idx=0,
obj_id=1,
points=np.array([[210, 350]], dtype=np.float32),
labels=np.array([1], dtype=np.int32),
)
# Propagate through entire video
for frame_idx, obj_ids, masks in predictor.propagate_in_video(state):
for obj_id, mask in zip(obj_ids, masks):
binary = (mask[0] > 0.0).cpu().numpy()
print(f"Frame {frame_idx}, Object {obj_id}: {binary.sum()} pixels")Mask2Former -- Panoptic Segmentation
from transformers import (
Mask2FormerForUniversalSegmentation,
Mask2FormerImageProcessor,
)
from PIL import Image
# Load panoptic segmentation model
processor = Mask2FormerImageProcessor.from_pretrained(
"facebook/mask2former-swin-large-coco-panoptic"
)
model = Mask2FormerForUniversalSegmentation.from_pretrained(
"facebook/mask2former-swin-large-coco-panoptic"
)
image = Image.open("street.jpg")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Panoptic segmentation
result = processor.post_process_panoptic_segmentation(
outputs, target_sizes=[image.size[::-1]]
)[0]
seg_map = result["segmentation"] # H x W tensor of segment IDs
segments = result["segments_info"]
for seg in segments:
print(f"ID: {seg['id']}, Label: {seg['label_id']}, "
f"Score: {seg['score']:.3f}, Area: {(seg_map == seg['id']).sum()}")Grounded SAM -- Text-Prompted Segmentation
from autodistill_grounded_sam_2 import GroundedSAM2
from autodistill.detection import CaptionOntology
from PIL import Image
# Define what to segment via text
ontology = CaptionOntology({
"car": "car",
"person": "person",
"traffic light": "traffic light",
})
model = GroundedSAM2(ontology=ontology)
# Run on an image
results = model.predict("street_scene.jpg")
for detection in results:
print(f"Class: {detection.class_name}")
print(f"Confidence: {detection.confidence:.3f}")
print(f"Mask pixels: {detection.mask.sum()}")When to Use What
Interactive annotation / labeling tool
Click-to-segment with real-time feedback. Best prompt-based model available.
Video object tracking with masks
Only model with native video propagation. Prompt once, track across frames.
Full scene parsing (all pixels labeled)
Highest ADE20K mIoU. Trained on fixed class vocabularies for dense prediction.
Open-vocabulary: "segment the red car"
Natural language input. No predefined classes. Combines detection + segmentation.
Segment from a visual example (no text, no click)
In-context learning. Provide a reference image-mask pair, model generalizes.
Multi-task vision pipeline (detect + segment + caption)
Single model handles 10+ tasks. Lower per-task accuracy but extreme versatility.
Production panoptic segmentation at scale
Battle-tested, strong COCO PQ, good speed/accuracy trade-off with smaller backbones.
Multi-modal prompting (text + click + audio)
Accepts text, point, box, and even audio prompts. Good for research prototypes.
Key Takeaways
- There is no single best model. SAM 2 wins on interactivity and video; Mask2Former/OneFormer win on closed-vocabulary accuracy; Grounded SAM wins on open-vocabulary ease.
- Zero-shot does not mean highest accuracy. SAM 2 cannot label pixels by class name. For semantic segmentation with a known label set, supervised models still dominate.
- Combine models when needed. Grounded SAM is literally Grounding DINO + SAM composed together. Many production pipelines chain a detector with a segmentor.
- Video segmentation is still young. SAM 2 is the clear leader, but long-video consistency and re-identification after occlusion remain open challenges.
- Foundation models are converging. Florence-2 and SEEM show that segmentation is becoming one capability among many inside unified vision models.