Have We Unified Image Generation And Understanding Yet? An Empirical Study Of Gpt-4o’s Image Generation Ability | Awesome LLM Papers Contribute to Awesome LLM Papers

Have We Unified Image Generation And Understanding Yet? An Empirical Study Of Gpt-4o's Image Generation Ability

Ning Li, Jingran Zhang, Justin Cui . No Venue 2025

[Paper] [Paper]   Search on Google Scholar   Search on Semantic Scholar
Evaluation Model Architecture Training Techniques

OpenAI’s multimodal GPT-4o has demonstrated remarkable capabilities in image generation and editing, yet its ability to achieve world knowledge-informed semantic synthesis–seamlessly integrating domain knowledge, contextual reasoning, and instruction adherence–remains unproven. In this study, we systematically evaluate these capabilities across three critical dimensions: (1) Global Instruction Adherence, (2) Fine-Grained Editing Precision, and (3) Post-Generation Reasoning. While existing benchmarks highlight GPT-4o’s strong capabilities in image generation and editing, our evaluation reveals GPT-4o’s persistent limitations: the model frequently defaults to literal interpretations of instructions, inconsistently applies knowledge constraints, and struggles with conditional reasoning tasks. These findings challenge prevailing assumptions about GPT-4o’s unified understanding and generation capabilities, exposing significant gaps in its dynamic knowledge integration. Our study calls for the development of more robust benchmarks and training strategies that go beyond surface-level alignment, emphasizing context-aware and reasoning-grounded multimodal generation.

https://huggingface.co/discussions/paper/67fdc0c60c63732d9e0b13d2

Similar Work