Artificial intelligence (AI) is developing rapidly, and multimodal models like GPT-4o, which can process both text and images, are at the center of this development. GPT-4o generates images of impressive quality and detail. However, a new study from the University of California, Los Angeles (UCLA) shows that despite these visual capabilities, the model exhibits weaknesses in areas of logical reasoning, contextualization, and inference.
The UCLA study divided its investigation into three categories: global instructions, image editing, and post-generation inferences. In the area of global instructions, it became apparent that GPT-4o has difficulty applying overarching rules. For example, if the model was told that "left" actually means "right," and subsequently given the instruction to generate an image with a dog on the left side, GPT-4o still placed the dog on the left side. Similar problems occurred with numerical rules. This suggests that GPT-4o does not reliably integrate contextual information into the image generation process.
Weaknesses were also revealed in image editing. For instance, if GPT-4o was asked to replace only the reflection of a horse in water with a lion, the model changed both the reflection and the original horse. In other cases, when asked to remove seated people from an image, standing people in the background were also deleted. These results suggest that GPT-4o has difficulties with semantically precise modifications and a differentiated interpretation of visual content.
The greatest challenges were revealed in the area of logical reasoning. In one scenario, GPT-4o was first asked to create an image of a dog and a cat. Subsequently, the dog was to be replaced by a cat and the scene moved to a beach – but only if the original image did not contain a cat. Although the first image already contained a cat, GPT-4o performed both changes. This illustrates the model's difficulties with conditional logic and multi-step inferences.
The UCLA study underscores the need for new benchmarks to evaluate multimodal AI models. Previous tests mainly focused on the matching of text and image, image quality, and control over style and minor edits. However, the ability to integrate world knowledge, apply abstract rules, and draw logical conclusions has been neglected. The development of more comprehensive benchmarks is crucial to realistically assess the actual benefits and limitations of models like GPT-4o and to steer research in the right direction.
The results of the UCLA study show that the development of multimodal AI models is far from complete. Although GPT-4o can generate impressive images, the deficits in logical thinking and contextualization are clear. Further research is necessary to address these weaknesses and to exploit the full potential of this technology. The development of robust AI systems that are both visually and cognitively powerful remains an exciting challenge for the future.
Sources: - https://the-decoder.com/?p=22933 - https://ground.news/article/gpt-4o-makes-beautiful-images-but-fails-basic-reasoning-tests-ucla-study-finds - https://twitter.com/theaitechsuite/status/1913551735936713055 - https://www.threads.net/@the_ainavigator/post/DIo1Xr7z5E4/gpt-4o-makes-beautiful-images-but-fails-basic-reasoning-tests-ucla-study-findsht - https://the-decoder.com/the-next-leap-in-ai-depends-on-agents-that-learn-by-doing-not-just-by-reading-what-humans-wrote/ - https://www.reddit.com/r/OpenAI/comments/1iivb46/gpt_4o_reasoning_but_not_generating_images/ - https://www.facebook.com/groups/chatgpt4u/posts/1579116262718077/ - https://openai.com/index/thinking-with-images/ - https://arxiv.org/pdf/2504.08003