Home/Building Blocks/Depth Estimation
ImageDepth Map

Depth Estimation

Predict depth from a single image. Critical for 3D reconstruction, AR/VR, and robotics.

How Depth Estimation Works

A technical deep-dive into depth estimation. From monocular depth prediction to 3D point cloud generation.

Real Examples

See how depth estimation transforms different scene types. The depth map uses a turbo colormap: red = close, blue = far.

Mountain Landscape
Mountain Landscape - input
Input RGB
Depth Map
Sky (far, dark blue) -> Mountains (mid, cyan/green) -> Foreground rocks (close, red/yellow)
Street Scene
Street Scene - input
Input RGB
Depth Map
Distant buildings (far) -> Road/cars (mid) -> Foreground elements (close)
Indoor Room
Indoor Room - input
Input RGB
Depth Map
Back wall (far) -> Furniture (mid) -> Near objects (close)
Person Portrait
Person Portrait - input
Input RGB
Depth Map
Background blur (far) -> Face/body (mid-close) -> Nose tip (closest)
Depth Scale:
FarClose
1

Depth Estimation Types

Three approaches: monocular (single image), stereo (two cameras), and multi-view (many images).

Monocular Depth

Single image input

1
Single camera
Pros: Single camera needed, Works on any image, Fast inference
Cons: Scale ambiguity, Less accurate
Examples: Depth Anything, MiDaS, ZoeDepth

Stereo Depth

Two images (left/right)

L
Left
R
Right
Pros: Metric depth, More accurate, Triangulation-based
Cons: Needs calibrated cameras, Matching artifacts
Examples: RAFT-Stereo, AANet, PSMNet

Multi-view Depth

Multiple images

1
2
3
4
5
Pros: Dense reconstruction, High accuracy
Cons: Complex setup, Slow processing
Examples: MVSNet, COLMAP, DUSt3R

Relative vs Metric Depth

Relative Depth

Ordinal relationships: "A is closer than B"

Output: 0-1 normalized values. Scale-invariant. Good for visual effects, not 3D reconstruction.
Metric Depth

Actual distances: "A is 2.5 meters away"

Output: Real-world units (meters). Required for robotics, AR, autonomous driving.
2

Model Evolution

From self-supervised learning to foundation models.

MonoDepth
2017
Self-supervisedLeft-right consistency
MiDaS
2020
Multi-datasetMixed training, relative depth
DPT
2021
TransformerVision Transformer encoder
ZoeDepth
2023
MetricRelative + metric bins
Depth Anything
2024
Foundation1.5M images, zero-shot
Depth Anything v2
2024
FoundationSynthetic data, better edges
Depth Pro
2024
MetricApple, sharp boundaries
Marigold
2024
DiffusionStable Diffusion prior
Depth Anything v2
Best zero-shot generalization
62.4M params, trained on 1.5M+ images
Depth Pro
Best metric accuracy + edges
Apple, 2024. Sharp boundary preservation
Marigold
Best fine details (diffusion)
Uses Stable Diffusion prior
3

How Monocular Depth Works

Neural networks learn depth cues from massive datasets.

Learned Depth Cues

_____
Perspective
Parallel lines converge
A a
Relative Size
Smaller = farther
_/_
Occlusion
Front blocks back
/|/
Texture Gradient
Denser = farther

Typical Architecture

RGB
Input
->
ViT / CNN
Encoder
Feature Extraction
->
DPT
Decoder
Upsampling
->
Depth
Output
Supervised

Train with ground truth depth from LiDAR, RGBD sensors, or synthetic data.

Self-Supervised

Learn from stereo pairs or video sequences using view synthesis as supervision.

Foundation Model

Pre-train on massive diverse data for zero-shot transfer to any domain.

4

Output Formats

Different representations for different use cases.

Relative Depth
0-1 normalized
Visualization, ordering
Metric Depth
Meters
3D reconstruction, robotics
Disparity
1/depth
Stereo matching
Point Cloud
(x, y, z) points
3D applications

Depth Map Visualization

Grayscale
Light = close, dark = far
Turbo colormap
Red = close, blue = far
Viridis colormap
Perceptually uniform

Depth to Point Cloud

With camera intrinsics, back-project each pixel to 3D:

X = (u - cx) * Z / fx
Y = (v - cy) * Z / fy
Z = depth[v, u]
[x,y,z]
Per-pixel 3D coordinates
5

Applications

Where depth estimation is used in practice.

D
3D Photos
Parallax effects, depth-of-field
V
AR/VR
Occlusion, spatial understanding
R
Robotics
Navigation, obstacle avoidance
A
Autonomous Driving
Distance estimation
6

Code Examples

Get started with depth estimation in Python.

Depth Anything v2pip install transformers
Recommended
from transformers import pipeline
from PIL import Image
import numpy as np

# Load Depth Anything v2 model
pipe = pipeline(
    task='depth-estimation',
    model='depth-anything/Depth-Anything-V2-Large-hf'
)

# Run inference
image = Image.open('image.jpg')
result = pipe(image)

# Get depth map
depth = result['depth']  # PIL Image
depth_array = np.array(depth)  # H x W array

# Normalize for visualization
depth_normalized = (depth_array - depth_array.min()) / \
                   (depth_array.max() - depth_array.min())

print(f'Depth shape: {depth_array.shape}')
print(f'Depth range: {depth_array.min():.2f} - {depth_array.max():.2f}')

Quick Reference

For General Use
  • - Depth Anything v2
  • - MiDaS 3.1
For Metric Depth
  • - ZoeDepth
  • - Depth Pro
  • - UniDepth
For Fine Details
  • - Marigold
  • - Depth Pro

Use Cases

  • 3D scene reconstruction
  • AR/VR applications
  • Robot navigation
  • Computational photography

Architectural Patterns

Monocular Depth Estimation

Predict depth from a single image using learned priors.

Pros:
  • +Works with any camera
  • +No calibration needed
Cons:
  • -Scale ambiguity
  • -Relative depth only

Metric Depth Estimation

Predict absolute depth in real-world units.

Pros:
  • +Real-world scale
  • +Directly usable
Cons:
  • -Needs training data with GT
  • -Domain-specific

Stereo Depth

Use stereo image pairs for triangulation.

Pros:
  • +Accurate
  • +Physically grounded
Cons:
  • -Needs stereo camera
  • -Calibration required

Implementations

Open Source

Depth Anything V2

Apache 2.0
Open Source

State-of-the-art monocular depth. Very robust.

MiDaS

MIT
Open Source

Robust cross-domain depth. Good zero-shot generalization.

ZoeDepth

MIT
Open Source

Metric depth estimation. Real-world scale output.

Marigold

Apache 2.0
Open Source

Diffusion-based depth. High-quality fine details.

Benchmarks

Quick Facts

Input
Image
Output
Depth Map
Implementations
4 open source, 0 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for depth estimation.

Submit Results