Home/Building Blocks/Image Segmentation
ImageSegmentation Mask

Image Segmentation

Classify each pixel in an image. Enables precise object boundaries for medical imaging, autonomous vehicles, and image editing.

How Image Segmentation Works

A technical deep-dive into image segmentation. From pixel-level classification to SAM's promptable foundation model.

1

Segmentation Types

Three main types: semantic (what class), instance (which object), and panoptic (both).

Semantic Segmentation

Label every pixel with a class

Output:Class mask (H x W)
Use case:Autonomous driving, medical imaging

Instance Segmentation

Distinguish individual objects

car_1
car_2
car_3
Output:Instance masks + class
Use case:Counting, tracking, robotics

Panoptic Segmentation

Semantic + Instance combined

person_1
car_1
Output:Unified segmentation
Use case:Scene understanding, AR/VR
AspectSemanticInstancePanoptic
Distinguishes instances?NoYesYes
Background classes?YesNoYes
Overlapping masks?NoYesNo
Main metricmIoUAPPQ
2

Architecture Evolution

From FCN to SAM 2: a decade of progress in segmentation architectures.

90%
80%
70%
60%
FCN
2015
62.2%
U-Net
2015
71%
DeepLab v3+
2018
82.1%
Mask R-CNN
2017
83%
SegFormer
2021
84%
Mask2Former
2022
86.4%
SAM
2023
89%
SAM 2
2024
91%

Encoder-Decoder (U-Net style)

Input
Encoder
Decoder
Mask

Downsampling captures context, upsampling recovers spatial detail. Skip connections preserve fine features.

Transformer-Based (SAM style)

Image Encoder
+
Prompt Encoder
->
Mask Decoder

Pre-computed image embeddings + lightweight prompt encoding enables real-time interactive segmentation.

3

SAM: Segment Anything Model

Meta's foundation model for segmentation. Trained on 11M images and 1.1B masks. Promptable - segment anything with points, boxes, or text.

SAM Architecture

ViT-H
632M params
Image Encoder
Run once per image
+
Sparse
Points/Boxes
Prompt Encoder
Lightweight
->
Transformer
2-way attention
Mask Decoder
~4M params
->
3 Masks
+ IoU scores
Output
Ambiguity-aware

Prompt Types

+
Point
Click on object
[]
Box
Draw bounding box
M
Mask
Rough mask input
T
Text
Natural language (SAM2)

SAM (2023)

Image only
  • +Zero-shot transfer to any domain
  • +Real-time with pre-computed embeddings
  • +Ambiguity-aware (3 mask outputs)
  • -No video/temporal support

SAM 2 (2024)

Image + Video
  • +Unified image and video model
  • +Memory mechanism for tracking
  • +6x faster than SAM
  • +Streaming architecture

How SAM Works

1. Add Point Prompt
Click on target object
2. Model Processes
Decoder generates masks
3. Output Mask
Precise segmentation
4

Mask Formats & Representation

How segmentation masks are stored and encoded.

Binary Mask
Format: H x W (0/1)
Size: 1 bit/pixel
Use: Single object
Class Mask
Format: H x W (0-N)
Size: 8 bit/pixel
Use: Semantic seg
RLE
Format: Run-length encoded
Size: Compressed
Use: COCO format
Polygon
Format: [[x,y], ...]
Size: Variable
Use: Annotation tools

Run-Length Encoding (RLE)

COCO dataset uses RLE to compress binary masks efficiently. Stores runs of consecutive values.

Binary Mask (visualized):
0
0
0
1
1
1
1
0
0
0
1
1
1
1
1
1
0
0
1
1
1
1
1
1
0
0
0
1
1
1
1
0
RLE Encoded:
{"counts": [3, 4, 2, 6, 2, 6, 3, 4, 2], "size": [4, 8]}
Reads: 3 zeros, 4 ones, 2 zeros, 6 ones, ...

Common Mask Operations

Resize
Interpolation matters!
Use nearest neighbor for masks
Boolean Ops
AND, OR, XOR, NOT
Combine or subtract masks
Morphology
Erode, dilate, open, close
Clean up mask boundaries
5

Segmentation Metrics

How to measure segmentation quality.

mIoU
0-100%
Mean Intersection over Union
TP / (TP + FP + FN)
Dice
0-100%
Dice Coefficient (F1)
2*TP / (2*TP + FP + FN)
PA
0-100%
Pixel Accuracy
Correct / Total pixels
Boundary IoU
0-100%
Boundary Quality
IoU on boundary pixels

IoU (Intersection over Union) for Masks

Overlap
Blue=Pred, Green=GT, Yellow=Intersection
IoU = Intersection / Union
Perfect: 1.0 | Good: 0.7+ | Poor: 0.3-
IoU 0.9+ (Excellent)
IoU 0.5-0.9 (Good)
IoU 0.5- (Poor)

ADE20K Semantic Segmentation Leaderboard

ModelBackbonemIoU (val)Year
InternImage-HInternImage-H62.9%2023
Mask2FormerSwin-L57.3%2022
SegFormerMiT-B551.8%2021
DeepLab v3+ResNet-10145.7%2018
6

Code Examples

Get started with segmentation in Python.

SAMpip install segment-anything
Foundation Model
from segment_anything import sam_model_registry, SamPredictor
import cv2

# Load SAM model
sam = sam_model_registry['vit_h'](checkpoint='sam_vit_h.pth')
predictor = SamPredictor(sam)

# Load and set image
image = cv2.imread('image.jpg')
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
predictor.set_image(image_rgb)

# Prompt with point (x, y) and label (1=foreground)
input_point = np.array([[500, 375]])
input_label = np.array([1])

# Generate mask
masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True
)

# Best mask
best_mask = masks[np.argmax(scores)]

Quick Reference

For Interactive/Promptable
  • - SAM / SAM 2
  • - Grounded SAM
  • - SEEM
For Semantic/Panoptic
  • - Mask2Former
  • - OneFormer
  • - SegFormer
For Video
  • - SAM 2
  • - XMem
  • - DEVA

Use Cases

  • Medical image analysis
  • Autonomous driving
  • Background removal
  • Satellite imagery analysis

Architectural Patterns

Semantic Segmentation

Classify every pixel into categories (no instance distinction).

Pros:
  • +Dense predictions
  • +Well-suited for scene parsing
Cons:
  • -Doesn't separate instances
  • -Needs full annotations

Instance Segmentation

Segment and distinguish individual object instances.

Pros:
  • +Separates objects
  • +Combines detection + segmentation
Cons:
  • -More complex
  • -Higher compute cost

Panoptic Segmentation

Unified semantic + instance segmentation.

Pros:
  • +Complete scene understanding
  • +Both stuff and things
Cons:
  • -Most complex
  • -Requires rich annotations

Implementations

Open Source

Segment Anything (SAM)

Apache 2.0
Open Source

Zero-shot segmentation. Point or box prompts. Revolutionary.

SAM 2

Apache 2.0
Open Source

Video segmentation. Tracks objects through frames.

Mask2Former

MIT
Open Source

State-of-the-art panoptic segmentation.

YOLOv8-seg

AGPL-3.0
Open Source

Fast instance segmentation. Same ease as YOLO detection.

nnU-Net

Apache 2.0
Open Source

Self-configuring for medical imaging. Top performer on many challenges.

Benchmarks

Quick Facts

Input
Image
Output
Segmentation Mask
Implementations
5 open source, 0 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for image segmentation.

Submit Results