Image→Depth Map

Depth Estimation

Predict depth from a single image. Critical for 3D reconstruction, AR/VR, and robotics.

How Depth Estimation Works

A technical deep-dive into depth estimation. From monocular depth prediction to 3D point cloud generation.

1. Depth Types 2. Models 3. How It Works 4. Output Formats 5. Applications 6. Code

Real Examples

See how depth estimation transforms different scene types. The depth map uses a turbo colormap: red = close, blue = far.

Mountain Landscape

Input RGB

Depth Map

Sky (far, dark blue) -> Mountains (mid, cyan/green) -> Foreground rocks (close, red/yellow)

Street Scene

Input RGB

Depth Map

Distant buildings (far) -> Road/cars (mid) -> Foreground elements (close)

Indoor Room

Input RGB

Depth Map

Back wall (far) -> Furniture (mid) -> Near objects (close)

Person Portrait

Input RGB

Depth Map

Background blur (far) -> Face/body (mid-close) -> Nose tip (closest)

Depth Scale:

FarClose

Depth Estimation Types

Three approaches: monocular (single image), stereo (two cameras), and multi-view (many images).

Monocular Depth

Single image input

Single camera

Pros: Single camera needed, Works on any image, Fast inference

Cons: Scale ambiguity, Less accurate

Examples: Depth Anything, MiDaS, ZoeDepth

Stereo Depth

Two images (left/right)

Left

Right

Pros: Metric depth, More accurate, Triangulation-based

Cons: Needs calibrated cameras, Matching artifacts

Examples: RAFT-Stereo, AANet, PSMNet

Multi-view Depth

Multiple images

Pros: Dense reconstruction, High accuracy

Cons: Complex setup, Slow processing

Examples: MVSNet, COLMAP, DUSt3R

Relative vs Metric Depth

Relative Depth

Ordinal relationships: "A is closer than B"

Output: 0-1 normalized values. Scale-invariant. Good for visual effects, not 3D reconstruction.

Metric Depth

Actual distances: "A is 2.5 meters away"

Output: Real-world units (meters). Required for robotics, AR, autonomous driving.

Model Evolution

From self-supervised learning to foundation models.

MonoDepth

2017

Self-supervisedLeft-right consistency

MiDaS

2020

Multi-datasetMixed training, relative depth

DPT

2021

TransformerVision Transformer encoder

ZoeDepth

2023

MetricRelative + metric bins

Depth Anything

2024

Foundation1.5M images, zero-shot

Depth Anything v2

2024

FoundationSynthetic data, better edges

Depth Pro

2024

MetricApple, sharp boundaries

Marigold

2024

DiffusionStable Diffusion prior

Depth Anything v2

Best zero-shot generalization

62.4M params, trained on 1.5M+ images

Depth Pro

Best metric accuracy + edges

Apple, 2024. Sharp boundary preservation

Marigold

Best fine details (diffusion)

Uses Stable Diffusion prior

How Monocular Depth Works

Neural networks learn depth cues from massive datasets.

Learned Depth Cues

_____

Perspective

Parallel lines converge

A a

Relative Size

Smaller = farther

_/_

Occlusion

Front blocks back

/|/

Texture Gradient

Denser = farther

Typical Architecture

RGB

Input

ViT / CNN

Encoder

Feature Extraction

DPT

Decoder

Upsampling

Depth

Output

Supervised

Train with ground truth depth from LiDAR, RGBD sensors, or synthetic data.

Self-Supervised

Learn from stereo pairs or video sequences using view synthesis as supervision.

Foundation Model

Pre-train on massive diverse data for zero-shot transfer to any domain.

Output Formats

Different representations for different use cases.

Relative Depth

0-1 normalized

Visualization, ordering

Metric Depth

Meters

3D reconstruction, robotics

Disparity

1/depth

Stereo matching

Point Cloud

(x, y, z) points

3D applications

Depth Map Visualization

Grayscale

Light = close, dark = far

Turbo colormap

Red = close, blue = far

Viridis colormap

Perceptually uniform

Depth to Point Cloud

With camera intrinsics, back-project each pixel to 3D:

X = (u - cx) * Z / fx

Y = (v - cy) * Z / fy

Z = depth[v, u]

[x,y,z]

Per-pixel 3D coordinates

Applications

Where depth estimation is used in practice.

3D Photos

Parallax effects, depth-of-field

AR/VR

Occlusion, spatial understanding

Robotics

Navigation, obstacle avoidance

Autonomous Driving

Distance estimation

Code Examples

Get started with depth estimation in Python.

Depth Anything v2pip install transformers

Recommended

from transformers import pipeline
from PIL import Image
import numpy as np

# Load Depth Anything v2 model
pipe = pipeline(
    task='depth-estimation',
    model='depth-anything/Depth-Anything-V2-Large-hf'
)

# Run inference
image = Image.open('image.jpg')
result = pipe(image)

# Get depth map
depth = result['depth']  # PIL Image
depth_array = np.array(depth)  # H x W array

# Normalize for visualization
depth_normalized = (depth_array - depth_array.min()) / \
                   (depth_array.max() - depth_array.min())

print(f'Depth shape: {depth_array.shape}')
print(f'Depth range: {depth_array.min():.2f} - {depth_array.max():.2f}')