GIE-Bench Logo
Grounded Image Editing Evaluation Benchmark
Yusu Qian, Jiasen Lu, Tsu-Jui Fu, Xinze Wang, Chen Chen, Yinfei Yang, Wenze Hu, Zhe Gan
Apple

Overview

GIE-Bench is a benchmark designed to evaluate text-guided image editing models across two critical dimensions:

It includes over 1,000 high-quality editing examples across 20 categories and 9 edit types, with masks, instructions, and questions.

Abstract

Editing images using natural language instructions has become a natural and expressive way to modify visual content; yet, evaluating the performance of such models remains challenging. Existing evaluation approaches often rely on image-text similarity metrics like CLIP, which lack precision. In this work, we introduce a new benchmark designed to evaluate text-guided image editing models in a more grounded manner, along two critical dimensions: (i) functional correctness, assessed via automatically generated multiple-choice questions that verify whether the intended change was successfully applied; and (ii) image content preservation, which ensures that non-targeted regions of the image remain visually consistent using an object-aware masking technique and preservation scoring. The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories, each annotated with detailed editing instructions, evaluation questions, and spatial object masks. We conduct a large-scale study comparing GPT-Image-1, the latest flagship in the text-guided image editing space, against several state-of-the-art editing models, and validate our automatic metrics against human ratings. Results show that GPT-Image-1 leads in instruction-following accuracy, but often over-modifies irrelevant image regions, highlighting a key trade-off in the current model behavior. GIE-Bench provides a scalable, reproducible framework for advancing more accurate evaluation of text-guided image editing.

Evaluation Pipeline

GIE-Bench pipeline overview

The benchmark automatically verifies whether the desired edit was applied (via VQA) and whether unintended changes occurred (via masked similarity).

Examples

GIE-Bench visual examples

Benchmark Details

Leaderboard

Functional Correctness

Leaderboard - Functional Correctness

Functional correctness is evaluated via VQA-style multiple-choice questions for each edit. GPT-Image-1 achieves the highest overall accuracy (85.00%) across 9 edit types. OmniGen and MagicBrush also perform well, but with more variance across categories.

Content Preservation

Leaderboard - Content Preservation

Preservation is measured using metrics like masked SSIM, CLIP, PSNR, and MSE over non-edited regions. OneDiffusion and MagicBrush consistently achieve top scores in preserving unedited content, while GPT-Image-1 tends to over-edit, especially at the pixel level.

Failure Modes

Examples of GPT-Image-1’s failure modes

BibTeX

@misc{qian2025giebench,
  title={{GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing}},
  author={Yusu Qian and Jiasen Lu and Tsu-Jui Fu and Xinze Wang and Chen Chen and Yinfei Yang and Wenze Hu and Zhe Gan},
  year={2025},
  eprint={2505.11493},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}