Computer VisionComparison2026

Image Segmentation Models Compared

SAM 2, OneFormer, Mask2Former, SegGPT, Grounded SAM, SEEM, and Florence-2 -- benchmarked side-by-side with realistic scores, inference costs, and production trade-offs.

March 2026|20 min read

TL;DR

  • Best closed-vocabulary accuracy: Mask2Former and OneFormer -- 57-58 mIoU on ADE20K, ~58 PQ on COCO.
  • Best zero-shot / interactive: SAM 2.1 -- prompt with points, boxes, or masks; tracks through video.
  • Best text-prompted: Grounded SAM -- describe objects in natural language, get pixel masks.
  • Most versatile foundation model: Florence-2 -- segmentation is one of 10+ vision tasks in a single model.

Three Types of Image Segmentation

Before comparing models, it helps to understand the three segmentation tasks they target. Each answers a different question about the pixels in an image.

S

Semantic Segmentation

Label every pixel with a class (road, sky, person) but do not distinguish between individual instances of the same class.

Metric: mIoU (mean Intersection over Union)

Benchmark: ADE20K (150 classes)

Use case: autonomous driving, medical imaging, land-cover mapping

I

Instance Segmentation

Detect each object and produce a pixel mask for it. Two people in a photo get two separate masks.

Metric: AP (Average Precision) on masks

Benchmark: COCO (80 classes)

Use case: robotics grasping, photo editing, counting objects

P

Panoptic Segmentation

Combines semantic + instance: every pixel gets a class label and countable objects get unique IDs.

Metric: PQ (Panoptic Quality) = SQ x RQ

Benchmark: COCO Panoptic (133 classes)

Use case: scene understanding, AR/VR, comprehensive scene parsing

Benchmark Comparison

Scores are for the largest publicly available checkpoint of each model. Speed measured on a single A100 GPU at 1024x1024 resolution where applicable.

ModelADE20K mIoUCOCO PQCOCO APParamsSpeed (ms)Zero-ShotVideo
SAM 2Meta AI----46.5312M~44YesYes
SAM 2.1Meta AI----48.1312M~40YesYes
Mask2FormerMeta AI57.857.850.1216M~72NoNo
OneFormerSHI Labs585849220M~80NoNo
SegGPTBAAI42.5----354M~110YesNo
Grounded SAMIDEA Research----46.8~500M~130YesNo
SEEMMicrosoft50.252.1--310M~85YesNo
Florence-2Microsoft44--37.5232M~60YesNo

SAM 2 / SAM 2.1 COCO AP measured on instance segmentation with automatic mask generation. Mask2Former and OneFormer scores from Swin-L backbones. SegGPT mIoU is few-shot (1-shot) on ADE20K.

Model Cards

SAM 2

Meta AI / 2024Zero-ShotVideoOpen Source

Prompt-based. Excels at interactive segmentation and video object segmentation.

COCO AP: 46.5Params: 312MLatency: ~44 ms

SAM 2.1

Meta AI / 2024Zero-ShotVideoOpen Source

Improved training data and occlusion handling over SAM 2.

COCO AP: 48.1Params: 312MLatency: ~40 ms

Mask2Former

Meta AI / 2022Open Source

Universal architecture for semantic, instance, and panoptic segmentation.

ADE20K: 57.8 mIoUCOCO PQ: 57.8COCO AP: 50.1Params: 216MLatency: ~72 ms

OneFormer

SHI Labs / 2023Open Source

Multi-task design. One model, one forward pass for all three segmentation tasks.

ADE20K: 58 mIoUCOCO PQ: 58COCO AP: 49Params: 220MLatency: ~80 ms

SegGPT

BAAI / 2023Zero-ShotOpen Source

In-context learning. Segments anything given example image-mask pairs.

ADE20K: 42.5 mIoUParams: 354MLatency: ~110 ms

Grounded SAM

IDEA Research / 2024Zero-ShotOpen Source

Grounding DINO + SAM. Text-prompt to segmentation masks.

COCO AP: 46.8Params: ~500MLatency: ~130 ms

SEEM

Microsoft / 2023Zero-ShotOpen Source

Segment Everything Everywhere all at once. Multi-modal prompting.

ADE20K: 50.2 mIoUCOCO PQ: 52.1Params: 310MLatency: ~85 ms

Florence-2

Microsoft / 2024Zero-ShotOpen Source

Vision foundation model. Segmentation is one of many capabilities.

ADE20K: 44 mIoUCOCO AP: 37.5Params: 232MLatency: ~60 ms

Code Examples

Working Python snippets for the most popular models. All use official libraries or Hugging Face Transformers.

SAM 2.1 -- Interactive Image Segmentation

sam2_image.py
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
from PIL import Image
import numpy as np

# Load model
checkpoint = "sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))

# Load image
image = np.array(Image.open("photo.jpg"))
predictor.set_image(image)

# Segment with point prompt (x, y)
point_coords = np.array([[500, 375]])
point_labels = np.array([1])  # 1 = foreground

masks, scores, logits = predictor.predict(
    point_coords=point_coords,
    point_labels=point_labels,
    multimask_output=True,
)

# Best mask
best_mask = masks[np.argmax(scores)]
print(f"Mask shape: {best_mask.shape}, Score: {scores.max():.3f}")

SAM 2 -- Video Object Segmentation

sam2_video.py
from sam2.build_sam import build_sam2_video_predictor
import numpy as np

predictor = build_sam2_video_predictor(
    "configs/sam2.1/sam2.1_hiera_l.yaml",
    "sam2.1_hiera_large.pt",
)

# Initialize with a video directory of JPEG frames
state = predictor.init_state(video_path="./video_frames/")

# Add prompt on frame 0
_, _, masks = predictor.add_new_points_or_box(
    inference_state=state,
    frame_idx=0,
    obj_id=1,
    points=np.array([[210, 350]], dtype=np.float32),
    labels=np.array([1], dtype=np.int32),
)

# Propagate through entire video
for frame_idx, obj_ids, masks in predictor.propagate_in_video(state):
    for obj_id, mask in zip(obj_ids, masks):
        binary = (mask[0] > 0.0).cpu().numpy()
        print(f"Frame {frame_idx}, Object {obj_id}: {binary.sum()} pixels")

Mask2Former -- Panoptic Segmentation

mask2former_panoptic.py
from transformers import (
    Mask2FormerForUniversalSegmentation,
    Mask2FormerImageProcessor,
)
from PIL import Image

# Load panoptic segmentation model
processor = Mask2FormerImageProcessor.from_pretrained(
    "facebook/mask2former-swin-large-coco-panoptic"
)
model = Mask2FormerForUniversalSegmentation.from_pretrained(
    "facebook/mask2former-swin-large-coco-panoptic"
)

image = Image.open("street.jpg")
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# Panoptic segmentation
result = processor.post_process_panoptic_segmentation(
    outputs, target_sizes=[image.size[::-1]]
)[0]

seg_map = result["segmentation"]  # H x W tensor of segment IDs
segments = result["segments_info"]

for seg in segments:
    print(f"ID: {seg['id']}, Label: {seg['label_id']}, "
          f"Score: {seg['score']:.3f}, Area: {(seg_map == seg['id']).sum()}")

Grounded SAM -- Text-Prompted Segmentation

grounded_sam.py
from autodistill_grounded_sam_2 import GroundedSAM2
from autodistill.detection import CaptionOntology
from PIL import Image

# Define what to segment via text
ontology = CaptionOntology({
    "car": "car",
    "person": "person",
    "traffic light": "traffic light",
})

model = GroundedSAM2(ontology=ontology)

# Run on an image
results = model.predict("street_scene.jpg")

for detection in results:
    print(f"Class: {detection.class_name}")
    print(f"Confidence: {detection.confidence:.3f}")
    print(f"Mask pixels: {detection.mask.sum()}")

When to Use What

Interactive annotation / labeling tool

SAM 2.1

Click-to-segment with real-time feedback. Best prompt-based model available.

Video object tracking with masks

SAM 2.1

Only model with native video propagation. Prompt once, track across frames.

Full scene parsing (all pixels labeled)

OneFormer or Mask2Former

Highest ADE20K mIoU. Trained on fixed class vocabularies for dense prediction.

Open-vocabulary: "segment the red car"

Grounded SAM

Natural language input. No predefined classes. Combines detection + segmentation.

Segment from a visual example (no text, no click)

SegGPT

In-context learning. Provide a reference image-mask pair, model generalizes.

Multi-task vision pipeline (detect + segment + caption)

Florence-2

Single model handles 10+ tasks. Lower per-task accuracy but extreme versatility.

Production panoptic segmentation at scale

Mask2Former

Battle-tested, strong COCO PQ, good speed/accuracy trade-off with smaller backbones.

Multi-modal prompting (text + click + audio)

SEEM

Accepts text, point, box, and even audio prompts. Good for research prototypes.

Key Takeaways

  1. There is no single best model. SAM 2 wins on interactivity and video; Mask2Former/OneFormer win on closed-vocabulary accuracy; Grounded SAM wins on open-vocabulary ease.
  2. Zero-shot does not mean highest accuracy. SAM 2 cannot label pixels by class name. For semantic segmentation with a known label set, supervised models still dominate.
  3. Combine models when needed. Grounded SAM is literally Grounding DINO + SAM composed together. Many production pipelines chain a detector with a segmentor.
  4. Video segmentation is still young. SAM 2 is the clear leader, but long-video consistency and re-identification after occlusion remain open challenges.
  5. Foundation models are converging. Florence-2 and SEEM show that segmentation is becoming one capability among many inside unified vision models.

Related Resources