OpenGVLab has released seven new Vision-Language Models (VLMs) under the name InternVL. The models are based on various combinations of InternViT, Qwen2.5, and InternLM2 and offer a range of sizes and capabilities. The largest model, InternVL-78B, combines InternViT 6B with Qwen2.5-72B Instruct and is available under the MIT license.
The new InternVL models leverage the strengths of various existing architectures. InternViT, an image processing model, is used in sizes 300M and 6B. For language processing, Qwen2.5 (in sizes 0.5B, 3B, 32B, and 72B) and InternLM2 (in sizes 7B, 8B, and 20B) are utilized. These combinations allow for flexible adaptation to different use cases and resources.
The 78B model stands out due to its size and performance. It combines the InternViT 6B model with the Instruct model Qwen2.5-72B. According to the announcement, this model can handle a variety of tasks that require both image and text understanding. The MIT license allows for broad use and adaptation of the model, both for research purposes and for commercial applications.
Vision-Language Models like InternVL find application in a variety of areas. These include:
- Image captioning - Image search based on text descriptions - Answering questions about images - Generating text about images - Visually grounded dialogue systemsThe models are available via Hugging Face. Further information about the models, their architecture, and their capabilities can be found in the official documentation and the associated repositories.