April 20, 2025

COLORBENCH: A New Benchmark for Evaluating Color Perception in Vision-Language Models

Listen to this article as Podcast
0:00 / 0:00
COLORBENCH: A New Benchmark for Evaluating Color Perception in Vision-Language Models

Color Science for Artificial Intelligence: COLORBENCH Tests the Color Understanding of Vision-Language Models

Color plays a central role in human perception and is essential for fields like medical imaging, remote sensing, and product recognition. But how well do Artificial Intelligences (AI) understand color? A new benchmark called COLORBENCH aims to systematically investigate this question and put the color perception of Vision-Language Models (VLMs) to the test.

VLMs combine image and text processing to solve complex tasks, such as describing images or answering questions about visual content. COLORBENCH, developed by a research team at the University of Maryland, is the first benchmark specifically dedicated to the color perception of these models. It comprises eleven tasks with a total of 1,448 instances and 5,814 image-text queries, covering three main dimensions: color perception, color understanding, and robustness to color changes.

The tasks range from simple color recognition and estimation of color proportions to counting objects with specific colors and resilience to known optical illusions. One example: The models must deliver consistent results when certain image segments are rotated through different colors. Another test examines how well the models can recognize colors under conditions of simulated color blindness.

32 common VLMs, including GPT-4o, Gemini 2, and various open-source models with up to 78 billion parameters, were tested with COLORBENCH. The results show that larger models generally perform better, but the effect is less pronounced than with other benchmarks. The performance difference between open-source and proprietary models is also comparatively small.

All tested models showed particularly weak performance in tasks such as counting colors or tests for color blindness, with accuracy often below 30%. Even in color extraction, where the models were supposed to identify specific HSV or RGB values, large models mostly achieved only mediocre results. They performed better in tasks that involved object recognition or color recognition in the context of objects. The researchers attribute this to the nature of the training data.

Interestingly, color information can also mislead VLMs. In tasks with optical illusions or camouflaged objects, the performance of the models improved when the images were converted to grayscale. This suggests that color information was misleading rather than helpful in these cases. Conversely, some tasks could not be meaningfully solved without color information.

The study also found that Chain-of-Thought (CoT) Reasoning not only increased performance in reasoning tasks but also robustness to color changes – even though only the image colors and not the questions were changed. With CoT prompting, for example, the robustness value of GPT-4o increased from 46.2% to 69.9%.

A structural problem of current VLMs identified by the researchers is the limited scaling of the vision encoders. The performance of the models correlated more strongly with the size of the language model than with the size of the vision encoder. Most vision encoders are relatively small, with 300 to 400 million parameters, which limits the assessment of their role in color understanding. The team therefore recommends further development of the visual components.

COLORBENCH is publicly available and aims to support the development of more color-sensitive and robust vision-language systems. Future versions of the benchmark are planned to include tasks that combine color with texture, shape, and spatial relationships.

Bibliography: Liang, Li et al. (2025). *Title of the work*. arXiv preprint arXiv:2504.10514. https://the-decoder.com/researchers-introduce-colorbench-to-test-color-understanding-in-vision-language-models/ https://dev.to/aimodels-fyi/colorbench-new-test-reveals-how-ai-sees-color-surprising-results-h33 http://arxiv.org/abs/2504.10514 https://www.youtube.com/watch?v=43IXdLgGh1I https://twitter.com/theaitechsuite/status/1913544185367478429 https://www.youtube.com/watch?v=HAkIHB5AOqg https://arxiv.org/html/2405.11685v2 https://huggingface.co/papers/2504.10514 https://openreview.net/forum?id=eRleg6vy0Y&referrer=%5Bthe%20profile%20of%20Serena%20Yeung-Levy%5D(%2Fprofile%3Fid%3D~Serena_Yeung-Levy1) https://openaccess.thecvf.com/content/CVPR2024/papers/Zeng_Investigating_Compositional_Challenges_in_Vision-Language_Models_for_Visual_Grounding_CVPR_2024_paper.pdf