Multimodal Large Language Models (MLLMs) have shown impressive progress in areas such as reasoning and planning in recent years. Especially regarding their use as so-called "Embodied Agents," i.e., AI agents that operate in a simulated or real environment, the ability to process and interpret multiple perspectives is coming to the forefront. This ability, known as multi-view understanding, allows for combining visual information from different viewpoints and using it for tasks like navigation, object manipulation, and 3D scene understanding. However, current MLLMs still show significant weaknesses in this particular area.
While MLLMs can understand and generate complex texts, they encounter difficulties when processing visual information from different perspectives. The challenge lies in ensuring geometric consistency between the views and establishing correspondences between the different perspectives. An object that is partially obscured in one view must be correctly identified in another view. The relative position of objects to each other must be determinable regardless of the viewpoint. These capabilities are essential for a comprehensive understanding of a scene and form the basis for targeted action in a 3D environment.
To evaluate the capabilities of MLLMs in the area of multi-view understanding, the "All-Angles Bench" was developed. This benchmark comprises over 2100 carefully human-annotated question-answer pairs related to 90 different real-world scenes. The benchmark's six tasks – counting, attribute recognition, relative distance, relative direction, object manipulation, and camera pose estimation – specifically test the models' ability to recognize geometric correspondences and consistently reconcile information across different views.
In extensive experiments, 27 representative MLLMs, including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o, were compared with human subjects. The results show a significant performance gap between the models and human understanding. MLLMs perform particularly poorly in finding correspondences between views with partial occlusions and in determining the rough camera perspective.
The results of the All-Angles Bench highlight the need for further research in the area of multi-view understanding for MLLMs. Specific adaptations and modules that integrate a stronger multi-view awareness are required to close the performance gap. Future research could focus on the development of training methods that explicitly consider the geometric relationships between different views. The integration of specialized modules for camera pose estimation and the processing of partial occlusions could also lead to improved performance.
The All-Angles Bench offers valuable insights into the current weaknesses of MLLMs and contributes to reducing the gap between machine and human multi-view understanding. This is an important step towards robust and reliable Embodied AI systems that can operate effectively in complex 3D environments.
Bibliographie: Hou et al. Learning to Select Views for Efficient Multi-View Understanding. CVPR 2024. Anonymous. Multi-View Fusion of Local and Global Features for Image Retrieval. arXiv:2410.16824, 2024. Anonymous. Efficient Multi-view Stereo by Iterative Dynamic Cost Volume Aggregation and Refinement. arXiv:2411.12287v1, 2024. CVPR 2025 Accepted Papers. Lingni Ma et al. Multi-view 3D Entangled Forest. IROS 2017. Yutong Bai et al. Point-M2AE: Multi-modal Masked Autoencoders for Point Cloud Pre-training. OpenReview, 2024. Paperswithcode. Multiview Learning. Epoch AI. Notable AI Models.