
GIE-Bench is a benchmark designed to evaluate text-guided image editing models across two critical dimensions:
It includes over 1,000 high-quality editing examples across 20 categories and 9 edit types, with masks, instructions, and questions.
Editing images using natural language instructions has become a natural and expressive way to modify visual content; yet, evaluating the performance of such models remains challenging. Existing evaluation approaches often rely on image-text similarity metrics like CLIP, which lack precision. In this work, we introduce a new benchmark designed to evaluate text-guided image editing models in a more grounded manner, along two critical dimensions: (i) functional correctness, assessed via automatically generated multiple-choice questions that verify whether the intended change was successfully applied; and (ii) image content preservation, which ensures that non-targeted regions of the image remain visually consistent using an object-aware masking technique and preservation scoring. The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories, each annotated with detailed editing instructions, evaluation questions, and spatial object masks. We conduct a large-scale study comparing GPT-Image-1, the latest flagship in the text-guided image editing space, against several state-of-the-art editing models, and validate our automatic metrics against human ratings. Results show that GPT-Image-1 leads in instruction-following accuracy, but often over-modifies irrelevant image regions, highlighting a key trade-off in the current model behavior. GIE-Bench provides a scalable, reproducible framework for advancing more accurate evaluation of text-guided image editing.
The benchmark automatically verifies whether the desired edit was applied (via VQA) and whether unintended changes occurred (via masked similarity).
Functional correctness is evaluated via VQA-style multiple-choice questions for each edit. GPT-Image-1 achieves the highest overall accuracy (85.00%) across 9 edit types. OmniGen and MagicBrush also perform well, but with more variance across categories.
Preservation is measured using metrics like masked SSIM, CLIP, PSNR, and MSE over non-edited regions. OneDiffusion and MagicBrush consistently achieve top scores in preserving unedited content, while GPT-Image-1 tends to over-edit, especially at the pixel level.
@misc{qian2025giebench,
title={{GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing}},
author={Yusu Qian and Jiasen Lu and Tsu-Jui Fu and Xinze Wang and Chen Chen and Yinfei Yang and Wenze Hu and Zhe Gan},
year={2025},
eprint={2505.11493},
archivePrefix={arXiv},
primaryClass={cs.CV}
}