Image→Segmentation Mask

Image Segmentation

Classify each pixel in an image. Enables precise object boundaries for medical imaging, autonomous vehicles, and image editing.

How Image Segmentation Works

A technical deep-dive into image segmentation. From pixel-level classification to SAM's promptable foundation model.

1. Segmentation Types 2. Architectures 3. SAM Deep-Dive 4. Mask Formats 5. Metrics 6. Code

Segmentation Types

Three main types: semantic (what class), instance (which object), and panoptic (both).

Semantic Segmentation

Label every pixel with a class

Output:Class mask (H x W)

Use case:Autonomous driving, medical imaging

Instance Segmentation

Distinguish individual objects

car_1

car_2

car_3

Output:Instance masks + class

Use case:Counting, tracking, robotics

Panoptic Segmentation

Semantic + Instance combined

person_1

car_1

Output:Unified segmentation

Use case:Scene understanding, AR/VR

Aspect	Semantic	Instance	Panoptic
Distinguishes instances?	No	Yes	Yes
Background classes?	Yes	No	Yes
Overlapping masks?	No	Yes	No
Main metric	mIoU	AP	PQ

Architecture Evolution

From FCN to SAM 2: a decade of progress in segmentation architectures.

90%

80%

70%

60%

FCN

2015

62.2%

U-Net

2015

71%

DeepLab v3+

2018

82.1%

Mask R-CNN

2017

83%

SegFormer

2021

84%

Mask2Former

2022

86.4%

SAM

2023

89%

SAM 2

2024

91%

Encoder-Decoder (U-Net style)

Input

Encoder

Decoder

Mask

Downsampling captures context, upsampling recovers spatial detail. Skip connections preserve fine features.

Transformer-Based (SAM style)

Image Encoder

Prompt Encoder

Mask Decoder

Pre-computed image embeddings + lightweight prompt encoding enables real-time interactive segmentation.

SAM: Segment Anything Model

Meta's foundation model for segmentation. Trained on 11M images and 1.1B masks. Promptable - segment anything with points, boxes, or text.

SAM Architecture

ViT-H

632M params

Image Encoder

Run once per image

Sparse

Points/Boxes

Prompt Encoder

Lightweight

Transformer

2-way attention

Mask Decoder

~4M params

3 Masks

+ IoU scores

Output

Ambiguity-aware

Prompt Types

Point

Click on object

[]

Box

Draw bounding box

Mask

Rough mask input

Text

Natural language (SAM2)

SAM (2023)

Image only

+Zero-shot transfer to any domain
+Real-time with pre-computed embeddings
+Ambiguity-aware (3 mask outputs)
-No video/temporal support

SAM 2 (2024)

Image + Video

+Unified image and video model
+Memory mechanism for tracking
+6x faster than SAM
+Streaming architecture

How SAM Works

1. Add Point Prompt

Click on target object

2. Model Processes

Decoder generates masks

3. Output Mask

Precise segmentation

Mask Formats & Representation

How segmentation masks are stored and encoded.

Binary Mask

Format: H x W (0/1)

Size: 1 bit/pixel

Use: Single object

Class Mask

Format: H x W (0-N)

Size: 8 bit/pixel

Use: Semantic seg

RLE

Format: Run-length encoded

Size: Compressed

Use: COCO format

Polygon

Format: [[x,y], ...]

Size: Variable

Use: Annotation tools

Run-Length Encoding (RLE)

COCO dataset uses RLE to compress binary masks efficiently. Stores runs of consecutive values.

Binary Mask (visualized):

RLE Encoded:

{"counts": [3, 4, 2, 6, 2, 6, 3, 4, 2], "size": [4, 8]}

Reads: 3 zeros, 4 ones, 2 zeros, 6 ones, ...

Common Mask Operations

Resize

Interpolation matters!

Use nearest neighbor for masks

Boolean Ops

AND, OR, XOR, NOT

Combine or subtract masks

Morphology

Erode, dilate, open, close

Clean up mask boundaries

Segmentation Metrics

How to measure segmentation quality.

mIoU

0-100%

Mean Intersection over Union

TP / (TP + FP + FN)

Dice

0-100%

Dice Coefficient (F1)

2*TP / (2*TP + FP + FN)

0-100%

Pixel Accuracy

Correct / Total pixels

Boundary IoU

0-100%

Boundary Quality

IoU on boundary pixels

IoU (Intersection over Union) for Masks

Overlap

Blue=Pred, Green=GT, Yellow=Intersection

IoU = Intersection / Union

Perfect: 1.0 | Good: 0.7+ | Poor: 0.3-

IoU 0.9+ (Excellent)

IoU 0.5-0.9 (Good)

IoU 0.5- (Poor)

ADE20K Semantic Segmentation Leaderboard

Model	Backbone	mIoU (val)	Year
InternImage-H	InternImage-H	62.9%	2023
Mask2Former	Swin-L	57.3%	2022
SegFormer	MiT-B5	51.8%	2021
DeepLab v3+	ResNet-101	45.7%	2018

Code Examples

Get started with segmentation in Python.

SAMpip install segment-anything

Foundation Model

from segment_anything import sam_model_registry, SamPredictor
import cv2

# Load SAM model
sam = sam_model_registry['vit_h'](checkpoint='sam_vit_h.pth')
predictor = SamPredictor(sam)

# Load and set image
image = cv2.imread('image.jpg')
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
predictor.set_image(image_rgb)

# Prompt with point (x, y) and label (1=foreground)
input_point = np.array([[500, 375]])
input_label = np.array([1])

# Generate mask
masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True
)

# Best mask
best_mask = masks[np.argmax(scores)]