December 9, 2024

Open Source Multimodal AI Achieves Near Commercial Performance

Listen to this article as Podcast
0:00 / 0:00
Open Source Multimodal AI Achieves Near Commercial Performance

Open-Source AI Reaches New Performance Levels in Multimodal Understanding

The world of Artificial Intelligence (AI) is in constant motion. A new breakthrough in the open-source AI system InternVL 2.5 promises to enable machines to have a significantly improved understanding of images, videos, and other forms of information. The improvements developed by OpenGVLab concern the system's visual capabilities, the quality of the training data, and the optimization of inference methods.

InternVL 2.5 is a series of multimodal large language models based on open source. While the original architecture has been retained, significant progress has been made through improvements in training strategy, testing strategy, and data quality. The system thus achieves a performance level that approaches that of top commercial AI systems. A remarkable success is exceeding the 70% mark in the MMMU benchmark test – a value similar to that of commercial models like GPT-4V. This is the first time an open-source model has achieved such performance.

Core Innovations and Improvements

Among the most important improvements of InternVL 2.5 are:

A larger visual encoder with 6 billion parameters significantly reduces the dependency on training data. The data quality has been improved through strict filtering mechanisms. The testing strategy has been optimized by applying Chain-of-Thought reasoning, which increases performance on complex tasks.

Performance and Technical Highlights

InternVL 2.5 demonstrates impressive performance in various benchmark tests, including multidisciplinary reasoning, document comprehension, image and video understanding, real-world comprehension, detection of multimodal hallucinations, visual localization, and multilingual capabilities.

Technically, the system is characterized by a progressive scaling strategy for efficient model alignment, dynamic high-resolution training for improved processing of high-resolution inputs, and a rigorous data filtering process to reduce the influence of low-quality data.

Outlook

The developments surrounding InternVL 2.5 highlight the enormous potential of open-source AI. The continuous advancements in multimodal understanding open up new possibilities for various applications, from image and video analysis to complex tasks that require a deep understanding of the real world. The availability of such powerful open-source systems promotes innovation and allows a broader community to benefit from the advancements in the field of AI. For companies like Mindverse, which develop customized AI solutions, these developments offer a valuable foundation for shaping future applications.

Bibliography: https://developer.aliyun.com/article/86899