November 21, 2024

Improving Visual Contextualization in Large Multimodal Models with Symbolic Direct Preference Optimization

Listen to this article as Podcast
0:00 / 0:00
Improving Visual Contextualization in Large Multimodal Models with Symbolic Direct Preference Optimization

Symbol-Based Direct Preference Optimization for Enhanced Visual Contextualization in Large Multimodal Models

Large language models (LLMs) have made impressive strides in in-context learning (ICL) in recent years. By providing a few in-context demonstrations (ICDs), they can solve new tasks without explicit training. This capability has also been transferred to large multimodal models (LMMs), which process both text and images. However, current LMMs often exhibit a weakness in handling visual context. They tend to rely on textual patterns in the demonstrations and neglect the image information.

This phenomenon, referred to as "visual context overlook," leads to LMMs generating incorrect answers despite the presence of visual cues. Studies have shown that replacing images in ICDs with placeholders or removing them entirely has little impact on model performance, highlighting the low importance of visual information in the current ICL process of LMMs.

To address this problem, the SymDPO (Symbol Demonstration Direct Preference Optimization) method has been developed. SymDPO breaks with the traditional construction of multimodal demonstrations by replacing the text answers in the examples with random symbols. This forces the model to analyze the images in the demonstrations more closely and establish a connection between the images and the symbols to answer the questions correctly.

The method is based on Direct Preference Optimization (DPO), a technique that improves the instruction-following capabilities of LMMs through human preferences. However, traditional DPO methods are often geared towards general instruction-following tasks and do not consider the specific requirements of multimodal demonstrations in the ICL context. Furthermore, the strong text orientation of many visual question answering (VQA) tasks makes it difficult to collect reliable preference data for multimodal learning.

SymDPO addresses these challenges by forcing the models to utilize both visual and textual information in ICDs. Replacing text answers with semantically neutral symbols establishes a link between image content and symbolic response. This makes the visual information essential for understanding and generating correct answers.

The implementation of SymDPO involves the creation of symbolic preference data. Text answers are replaced with contextually inappropriate symbols to enforce symbolic alignment with the visual context. Experiments with various LMM architectures on multiple benchmarks have shown that SymDPO consistently improves model performance. The method reduces "visual context overlook" and promotes a deeper multimodal understanding.

SymDPO represents a significant step towards improving ICL in LMMs. By specifically integrating visual information into the learning process, these models can better utilize their potential in handling multimodal tasks and generate more accurate and contextually relevant answers. For Mindverse, as a provider of AI-powered content solutions, these developments are of particular interest as they have the potential to significantly expand the quality and scope of multimodal AI applications.

Bibliography: - https://arxiv.org/html/2411.11909v1 - https://arxiv.org/list/cs/new - https://huggingface.co/papers/2404.01258 - https://openreview.net/forum?id=yXemYuOIf4 - https://www.catalyzex.com/author/Xiangnan%20He - https://www.junha.page/2024/08/vlm-paper-list.html?m=1 - https://transferlab.ai/pills/2023/direct-preference-optmization/ - https://aclanthology.org/2024.findings-naacl.108.pdf