November 28, 2024

ChatRex: Enhancing Visual Perception in Multimodal Large Language Models

Listen to this article as Podcast
0:00 / 0:00
ChatRex: Enhancing Visual Perception in Multimodal Large Language Models

Multimodal LLMs: ChatRex Improves Perception in Image Analysis

Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities in understanding visual information. However, they often lack precise perception, which limits their application in areas requiring a combination of perception and understanding. One example is the localization of objects in images. While MLLMs can generate complex image descriptions, accurately determining the position of objects within the image is often a challenge.

The newly developed model ChatRex addresses this challenge through a novel approach. Instead of having the LLM directly predict box coordinates, ChatRex uses a so-called Universal Proposal Network (UPN). This network first generates a series of proposals for possible object locations in the image, called "bounding boxes." The LLM then analyzes the image and selects from the boxes proposed by the UPN those that best fit the query. This approach transforms the regression task of box coordinate prediction into a retrieval-based task, which LLMs can handle more efficiently.

The Architecture of ChatRex

ChatRex is based on a decoupled architecture. The UPN and the LLM are trained separately and then combined. The UPN specializes in identifying potential object locations in the image. The LLM, on the other hand, focuses on understanding the image content and selecting the relevant boxes from the UPN proposals.

Rexverse-2M: A Dataset for Joint Optimization of Perception and Understanding

The Rexverse-2M dataset was developed for training ChatRex. This dataset contains a large number of images with detailed annotations of objects and their positions. The annotations include various levels of granularity, from coarse object descriptions to fine-grained details. This diversity allows ChatRex to jointly optimize both the perception and understanding of images.

Applications of ChatRex

The combination of precise perception and in-depth understanding opens up a wide range of applications for ChatRex:

Object Recognition: ChatRex can accurately locate and classify objects in images.

Grounding: ChatRex can link text descriptions to the corresponding objects in the image.

Referring Expressions: ChatRex can identify objects based on textual references, e.g., "the man in the yellow shirt."

Grounded Image Descriptions: ChatRex can generate detailed image descriptions that consider the identified objects and their relationships to each other.

Grounded Conversations: ChatRex can participate in dialogues related to visual content, accurately referencing objects in the image.

ChatRex and Mindverse

The development of ChatRex underscores the importance of MLLMs with strong perceptual abilities. For companies like Mindverse, which develop AI-powered content tools and customized solutions, models like ChatRex offer the potential to significantly improve the functionality and user-friendliness of their products. The precise object recognition and the ability to understand images in detail open up new possibilities for automated content creation, image analysis, and the development of interactive applications.

Bibliographie: https://github.com/IDEA-Research/ChatRex https://github.com/Mountchicken https://arxiv.org/abs/2402.12195 https://arxiv.org/abs/2303.08268 https://www.techrxiv.org/doi/full/10.36227/techrxiv.172953320.09598952/v1 https://www.researchgate.net/publication/375083521_Chat_with_the_Environment_Interactive_Multimodal_Perception_Using_Large_Language_Models https://openreview.net/forum?id=2jEiFTLRwX&referrer=%5Bthe%20profile%20of%20Bohan%20Zhuang%5D(%2Fprofile%3Fid%3D~Bohan_Zhuang1) https://aclanthology.org/2024.findings-acl.738.pdf https://pmc.ncbi.nlm.nih.gov/articles/PMC11464944/