The development of multimodal large language models (MLLMs), which can process both text and visual information, is advancing rapidly. A key aspect of this development is the improvement of visual encoding to optimize performance in various downstream tasks. Microsoft has introduced Florence-VL, a new MLLM that uses the generative vision encoder Florence-2, thus charting a new direction in vision-language alignment.
Unlike conventional CLIP-based vision transformers, which are trained through contrastive learning, Florence-2 utilizes a generative approach. This allows it to capture different levels and aspects of visual features, increasing its adaptability to different tasks. Florence-2 can generate various visual representations based on text prompts, including image descriptions, object recognition, grounding, and OCR (Optical Character Recognition). This versatility offers advantages for various downstream tasks, such as extracting textual information from images or understanding spatial relationships between objects.
To effectively utilize the diverse visual features of Florence-2, Florence-VL employs what is called "Depth-Breadth Fusion" (DBFusion). "Depth" refers to the integration of features from different layers of the neural network, which represent different levels of abstraction. "Breadth," on the other hand, describes the use of multiple image features, each capturing different visual aspects. By combining these "Depth" and "Breadth" features through simple channel concatenation, a comprehensive visual representation is created, which serves as input for the LLM.
Florence-VL is trained with a novel recipe of open-source training data, consisting of a large dataset of detailed image descriptions and a mixture of datasets for instruction tuning. The training consists of end-to-end pre-training of the entire model, followed by fine-tuning of the projection layer and the LLM. The results show that Florence-VL achieves significant improvements over existing MLLMs in various benchmarks, including general VQA (Visual Question Answering), perception, hallucination, OCR, chart understanding, and knowledge-intensive understanding. Quantitative analyses and visualizations of Florence-VL's visual features demonstrate better alignment with LLMs compared to common vision encoders like CLIP and SigLIP.
The development of Florence-VL represents a significant step in the evolution of multimodal language models. The use of a generative vision encoder and the DBFusion method enable more comprehensive and flexible processing of visual information. The release of the models and the training recipe as open source promotes further research and development in this area and contributes to the democratization of AI technologies. For Mindverse, a German company specializing in AI-powered content creation and customized AI solutions, Florence-VL offers interesting potential for the further development of their products and services. The integration of such advanced multimodal models could significantly expand the capabilities of chatbots, voicebots, AI search engines, and knowledge systems, raising the possibilities of AI-powered content creation to a new level.
Bibliographie: Chen, J., Yang, J., Wu, H., Li, D., Gao, J., Zhou, T., & Xiao, B. (2024). Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion. arXiv preprint arXiv:2412.04424. Chen, J., Yang, J., Wu, H., Li, D., Gao, J., Zhou, T., & Xiao, B. (2024). Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion. arXiv preprint arXiv:2412.04424. Chen, J., Yang, J., Wu, H., Li, D., Gao, J., Zhou, T., & Xiao, B. (2024). Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion. https://huggingface.co/papers/2412.04424 Razzaq, A. (2024, December 7). Microsoft Introduces Florence-VL: A Multimodal Model Redefining Vision-Language Alignment with Generative Vision Encoding and Depth-Breadth Fusion. MarkTechPost. https://www.marktechpost.com/2024/12/07/microsoft-introduces-florence-vl-a-multimodal-model-redefining-vision-language-alignment-with-generative-vision-encoding-and-depth-breadth-fusion/ Microsoft. (n.d.). Project Florence-VL. Microsoft Research. https://www.microsoft.com/en-us/research/project/project-florence-vl/ Zilliz. (n.d.). Florence: Novel Vision Foundation Model by Microsoft. https://zilliz.com/learn/florence-novel-vision-foundation-model-by-microsoft Carrell, T. (2024, July 4). Introducing Florence-2: Microsoft’s Latest Multi-Modal, Compact Visual Language Model. Datature. https://www.datature.io/blog/introducing-florence-2-microsofts-latest-multi-modal-compact-visual-language-model