Multimodalimage-text-to-image

Image-Text-to-Image

Image-text-to-image covers instruction-guided image editing — taking a source image plus a text command and producing a modified result. InstructPix2Pix (2023) demonstrated this could work zero-shot, and subsequent models like DALL-E 3's inpainting, Ideogram, and Stable Diffusion's img2img pipelines made it practical. The core difficulty is surgical precision: users want "change the dress to red" without altering the face, background, or lighting, which requires disentangling content from style at a level current architectures still fumble. Benchmarks are fragmented across editing fidelity, instruction-following, and identity preservation, making unified comparison difficult.

2
Datasets
0
Results
clip-score
Canonical metric
Canonical Benchmark

InstructPix2Pix

Instruction-guided image editing benchmark

Primary metric: clip-score
View full leaderboard

Top 10

Leading models on InstructPix2Pix.

No results yet. Be the first to contribute.

All datasets

2 datasets tracked for this task.

Related tasks

Other tasks in Multimodal.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace