April 23, 2025

New Benchmark CAPTURe Tests Spatial Reasoning in Vision Language Models

Listen to this article as Podcast
0:00 / 0:00
New Benchmark CAPTURe Tests Spatial Reasoning in Vision Language Models

Visual Reasoning in Focus: New Benchmark CAPTURe Tests Spatial Understanding of AI Models

Artificial intelligence (AI) is making rapid progress, especially in the field of vision-language models (VLMs). These models can interpret images, answer questions about them, and even describe complex scenes. But how well do they actually understand the spatial relationships between objects, especially when these are partially obscured? A new benchmark called CAPTURe (Counting Amodally for Patterns Through Unseen REgions) aims to test precisely this ability and reveal the limitations of current VLMs.

The Challenge of Occlusion

Occlusion, the partial or complete obscuring of objects, is an everyday phenomenon in our visual perception. For humans, it is usually not a problem to infer the presence of hidden objects and estimate their number. However, this poses a significant challenge for AI models. They must be able to recognize visual patterns and draw conclusions about the invisible areas based on them.

CAPTURe: A New Approach to Evaluation

CAPTURe was specifically designed to investigate the spatial reasoning of VLMs in relation to occluded objects. The benchmark consists of two parts: CAPTURe-real uses real photos of objects arranged in patterns, while CAPTURe-synthetic is based on generated images to create a controlled test environment. In both cases, some objects are hidden by occluders. The task of the VLMs is to determine the total number of objects by continuing the pattern behind the occluder.

Current VLMs Reach Their Limits

Initial tests with leading VLMs like GPT-4o, Intern-VL2, Molmo, and Qwen2-VL show that these models have difficulty recognizing both visible and hidden patterns and correctly counting the objects. The weakness in dealing with occlusion is particularly evident: The performance of the models drops significantly when objects are obscured. This suggests that VLMs still struggle to infer invisible spatial relationships and create a complete world model.

Humans as the Benchmark

In contrast to the AI models, human test subjects show remarkably high accuracy in handling CAPTURe tasks. This underscores the complexity of spatial reasoning and the ongoing challenges for the development of VLMs.

Additional Information Improves Performance

Interestingly, the performance of the VLMs improves when they are provided with additional information about the location of the hidden objects. This suggests that the models' difficulties are not solely due to dealing with occlusion, but also to general problems in counting objects in images.

Outlook: Improved Spatial Understanding for AI

The results of CAPTURe highlight the need to further improve the spatial reasoning capabilities of VLMs. Future research should focus on developing models that are better able to infer invisible spatial relationships and develop a more comprehensive understanding of visual scenes. This is an important step towards more robust and reliable AI systems that are capable of handling complex tasks in the real world.

Bibliographie: Pothiraj, A., Stengel-Eskin, E., Cho, J., & Bansal, M. (2025). CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting. arXiv preprint arXiv:2504.15485. Paperreading.club. (n.d.). CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting. Retrieved from https://paperreading.club/page?id=301164 Bansal, M., Pothiraj, A., Stengel-Eskin, E., & Cho, J. (2024). CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting. Advances in Neural Information Processing Systems, 40. Li, Y., Wang, L., & Liu, Z. (2024). Structured Spatial Reasoning with Open Vocabulary Object Detectors. Proceedings of the European Conference on Computer Vision (ECCV). Jiayuww. (n.d.). SpatialEval. Retrieved from https://github.com/jiayuww/SpatialEval Zhang, X., Li, Y., & Wang, L. (2024). LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models. Wolf, T., Mille, J., & Lambrecht, J.-M. (2025). Towards an Exhaustive Evaluation of Vision Language Foundation Models.