December 9, 2024

OpenGVLab Releases InternVL 2.5 Open Source Multimodal Model

Listen to this article as Podcast
0:00 / 0:00
OpenGVLab Releases InternVL 2.5 Open Source Multimodal Model

InternVL 2.5: Advancements in Open-Source Multimodal Models

OpenGVLab has released InternVL 2.5, a new version of its open-source multimodal large language models (MLLMs). The model family includes various sizes, from a 1-billion parameter model, suitable for edge devices, up to a 78-billion parameter model. InternVL 2.5 aims to push the performance boundaries in the field of open-source multimodal models by scaling model size, data, and test-time optimization.

Model Scaling, Data Expansion, and Test-Time Optimization

InternVL 2.5 builds upon the core principles of previous versions and expands them with targeted improvements. Scaling the model allows for higher performance and handling more complex tasks. The expansion of training data improves the understanding of different modalities, including text, images, and videos. Optimizing the test-time allows for more efficient inference and faster response times.

Improved Performance in Various Benchmarks

In benchmarks such as MathVista, ChartQA, DocVQA, and MMBench, InternVL 2.5 shows improved performance compared to previous versions and other open-source models. Particularly noteworthy is the model's ability to handle complex tasks like mathematical calculations, interpreting charts, and analyzing documents. The results indicate an improved understanding of multimodal data.

Progressive Alignment with Large Language Models

InternVL 2.5 utilizes a progressive alignment strategy for training, which enables native integration with large language models. By gradually scaling the model from small to large and simultaneously refining the data from coarse to fine, efficient training of large models is possible with comparatively fewer resources. This approach has proven effective in achieving high performance with limited resources.

Multimodal Inputs and Multitask Outputs

InternVL 2.5 supports various types of input, including text, images, and videos, and allows for diverse output formats, such as images, bounding boxes, and masks. By connecting the MLLM with various decoders for downstream tasks, InternVL 2.5 can be applied to a variety of vision-language tasks, achieving performance comparable to specialized models. This versatility opens up new possibilities for application in various fields.

Outlook and Future Developments

The release of InternVL 2.5 represents a significant advancement in the field of open-source multimodal models. The improved performance and flexible architecture options open up new possibilities for developers and researchers. Future developments could focus on further improving performance in specific areas such as counting accuracy and spatial reasoning. The continuous development of open-source models like InternVL 2.5 contributes to expanding the accessibility and breadth of application of AI technologies.

Bibliographie: https://huggingface.co/collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c https://internvl.github.io/blog/2024-07-02-InternVL-2.0/ https://internvl.github.io/blog/ https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models https://huggingface.co/papers https://arxiv.org/abs/2404.16821 https://www.bentoml.com/blog/multimodal-ai-a-guide-to-open-source-vision-language-models https://www.reddit.com/r/LocalLLaMA/comments/1c966ce/the_best_open_source_multimodal_llm_ive_seen_so/