OpenGVLab Releases Seven New InternVL Vision-Language Models

New Vision-Language Models Released by InternVL

OpenGVLab has released seven new Vision-Language Models (VLMs) under the name InternVL. The models are based on various combinations of InternViT, Qwen2.5, and InternLM2 and offer a range of sizes and capabilities. The largest model, InternVL-78B, combines InternViT 6B with Qwen2.5-72B Instruct and is available under the MIT license.

Diverse Combination Options

The new InternVL models leverage the strengths of various existing architectures. InternViT, an image processing model, is used in sizes 300M and 6B. For language processing, Qwen2.5 (in sizes 0.5B, 3B, 32B, and 72B) and InternLM2 (in sizes 7B, 8B, and 20B) are utilized. These combinations allow for flexible adaptation to different use cases and resources.

Powerful Flagship Model

The 78B model stands out due to its size and performance. It combines the InternViT 6B model with the Instruct model Qwen2.5-72B. According to the announcement, this model can handle a variety of tasks that require both image and text understanding. The MIT license allows for broad use and adaptation of the model, both for research purposes and for commercial applications.

Potential Applications

Vision-Language Models like InternVL find application in a variety of areas. These include:

- Image captioning - Image search based on text descriptions - Answering questions about images - Generating text about images - Visually grounded dialogue systems

Access and Further Information