November 22, 2024

Nexa AI Updates OmniVision-968M Multimodal Model with Improved Image Understanding

Listen to this article as Podcast
0:00 / 0:00
Nexa AI Updates OmniVision-968M Multimodal Model with Improved Image Understanding

OmniVision-968M: Update Improves Image Understanding of Nexa AI's Smallest Multimodal Model

Nexa AI has released an improved version of its multimodal model OmniVision-968M. The update, based on user feedback, is available as a preview on Hugging Face and focuses on improving image descriptions, particularly in the areas of art, complex scenes, anime, and color and detail recognition. The final model files will be provided in the Hugging Face repository after final adjustments.

OmniVision-968M is a compact, multimodal model with under one billion parameters (968M) that can process both visual and textual information. It has been specifically optimized for use on edge devices and is based on the architecture of LLaVA (Large Language and Vision Assistant). A key feature of OmniVision-968M is the significant reduction of image tokens from 729 to 81, which leads to a considerable reduction in latency and computational effort. This makes the model particularly suitable for applications where resources are limited.

The training of OmniVision-968M took place in three phases: Pre-training to align visual and linguistic information using image-text pairs, supervised fine-tuning (SFT) to improve contextual understanding using image-based question-answer datasets, and Direct Preference Optimization (DPO) to further optimize the answer quality and reduce hallucinations. In the DPO method, the base model generates answers to images, which are then corrected by a teacher model. These pairs of original and corrected answers are then used to fine-tune the model.

The current version of the model is still in the development phase. Nexa AI plans to further expand the DPO training and improve the understanding of documents and texts. In the long term, OmniVision-968M is to be developed into a fully optimized, production-ready solution for multimodal edge AI applications.

Key Improvements at a Glance:

The update of OmniVision-968M brings improvements in various areas:

Art Descriptions: The model now provides more detailed and precise descriptions of artwork.

Complex Images: The analysis and description of complex scenes with multiple objects and interactions has been improved.

Anime: The recognition and description of anime images has been optimized.

Color and Detail Recognition: The model now recognizes and describes colors and fine details more reliably.

World Knowledge: The model's general world knowledge has been expanded.

Nexa AI makes the model available on Hugging Face and encourages users to provide feedback to support the further development of the model. The company sees OmniVision-968M as a promising tool for various applications in the field of multimodal AI, especially on resource-constrained devices.

Bibliographie: https://huggingface.co/NexaAIDev/omnivision-968M https://huggingface.co/NexaAIDev/omnivision-968M/tree/main https://m.facebook.com/groups/107107546348803/posts/nexa-ai-releases-omnivision-968m-worlds-smallest-vision-language-model-with-9x-t/2337375499988652/ https://huggingface.co/ https://www.youtube.com/watch?v=NUeDrYOxDXY https://github.com/ollama/ollama/issues/7769 https://x.com/nexa_ai https://twitter.com/Marktechpost/status/1857526120549331302