GIE-Bench: Grounded Evaluation for Text-Guided Image Editing

Grounded Image Editing Evaluation Benchmark

Yusu Qian, Jiasen Lu, Tsu-Jui Fu, Xinze Wang, Chen Chen, Yinfei Yang, Wenze Hu, Zhe Gan

Apple

Overview

GIE-Bench is a benchmark designed to evaluate text-guided image editing models across two critical dimensions:

It includes over 1,000 high-quality editing examples across 20 categories and 9 edit types, with masks, instructions, and questions.

Abstract

Editing images using natural language instructions has become a natural and expressive way to modify visual content; yet, evaluating the performance of such models remains challenging. Existing evaluation approaches often rely on image-text similarity metrics like CLIP, which lack precision. In this work, we introduce a new benchmark designed to evaluate text-guided image editing models in a more grounded manner, along two critical dimensions: (i) functional correctness, assessed via automatically generated multiple-choice questions that verify whether the intended change was successfully applied; and (ii) image content preservation, which ensures that non-targeted regions of the image remain visually consistent using an object-aware masking technique and preservation scoring. The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories, each annotated with detailed editing instructions, evaluation questions, and spatial object masks. We conduct a large-scale study comparing GPT-Image-1, the latest flagship in the text-guided image editing space, against several state-of-the-art editing models, and validate our automatic metrics against human ratings. Results show that GPT-Image-1 leads in instruction-following accuracy, but often over-modifies irrelevant image regions, highlighting a key trade-off in the current model behavior. GIE-Bench provides a scalable, reproducible framework for advancing more accurate evaluation of text-guided image editing.

Evaluation Pipeline

The benchmark automatically verifies whether the desired edit was applied (via VQA) and whether unintended changes occurred (via masked similarity).

Examples

Benchmark Details

Leaderboard

Functional Correctness

Functional correctness is evaluated via VQA-style multiple-choice questions for each edit. GPT-Image-1 achieves the highest overall accuracy (85.00%) across 9 edit types. OmniGen and MagicBrush also perform well, but with more variance across categories.

Content Preservation

Preservation is measured using metrics like masked SSIM, CLIP, PSNR, and MSE over non-edited regions. OneDiffusion and MagicBrush consistently achieve top scores in preserving unedited content, while GPT-Image-1 tends to over-edit, especially at the pixel level.

Failure Modes

BibTeX

@misc{qian2025giebench,
  title={{GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing}},
  author={Yusu Qian and Jiasen Lu and Tsu-Jui Fu and Xinze Wang and Chen Chen and Yinfei Yang and Wenze Hu and Zhe Gan},
  year={2025},
  eprint={2505.11493},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}