Google has released PaliGemma 2, the next generation of its open-source vision-language model. This new version promises improved image descriptions and optimized performance for diverse applications. PaliGemma 2 combines the SigLIP-So400m Vision Encoder with the Gemma 2 language model family (2B to 27B parameters) and supports various image resolutions (224px, 448px, 896px). This offers users flexible scalability depending on their needs.
A core improvement of PaliGemma 2 lies in its ability to generate more detailed image descriptions. The model goes beyond mere object recognition and can describe actions, emotions, and the context of a scene. It recognizes not only what is visible in the image, but also what is happening and what mood the scene conveys. Similar to other generative AI models, PaliGemma 2 can also produce so-called hallucinations, i.e., describe non-existent image elements or overlook visible content. However, Google emphasizes the progress made in generating detailed and contextually relevant descriptions compared to previous models.
According to Google's technical report, PaliGemma 2 shows strong performance in various specialized tasks. These include the recognition of chemical formulas, the interpretation of musical notes, the analysis of X-ray images, and spatial reasoning. The ability to process and interpret complex visual information opens up a wide range of applications in various fields, from medical image analysis to scientific research.
Existing PaliGemma users can easily upgrade to version 2, as it is designed as a direct replacement. The new version offers improved performance for most tasks without major code changes. Through the possibility of fine-tuning, PaliGemma 2 can be adapted to specific tasks and datasets. The model and code are available via Hugging Face and Kaggle. Google offers extensive documentation and example notebooks. PaliGemma 2 is compatible with various frameworks, including Hugging Face Transformers, Keras, PyTorch, JAX, and Gemma.cpp.
The release of PaliGemma 2 expands Google's growing Gemma model family. This already includes models for code completion and more efficient inference. The addition of a powerful vision-language model underscores Google's commitment to making AI technologies accessible for various applications. The open-source nature of the Gemma family promotes collaboration and innovation within the AI community.
Bibliographie: - Keysers, D., & Steiner, A. (2024). Introducing PaliGemma 2: Powerful Vision-Language Models, Simple Fine-Tuning. Google Developers Blog. https://developers.googleblog.com/en/introducing-paligemma-2-powerful-vision-language-models-simple-fine-tuning/ - Noyan, M., Steiner, A. P., et al. (2024). Welcome PaliGemma 2 – New vision language models by Google. Hugging Face. https://huggingface.co/blog/paligemma2 - Google. (2024). PaliGemma 2. Kaggle. https://www.kaggle.com/models/google/paligemma-2 - Google. (2024). PaliGemma 2 model card. https://ai.google.dev/gemma/docs/paligemma/model-card-2 - Bastian, M. (2024). Google releases PaliGemma 2, its latest open source vision language model. The Decoder. https://the-decoder.com/google-releases-paligemma-2-its-latest-open-source-vision-language-model/ - Bastian, M. (2024). Google stellt neues Open Source Vision-Sprachmodell PaliGemma 2 vor. The Decoder DE. https://the-decoder.de/google-stellt-neues-open-source-vision-sprachmodell-paligemma-2-vor/ - Dutta, A. (2024). Google Introduces PaliGemma 2 Family of Open Source AI Vision Language Models. Gadgets 360. https://www.gadgets360.com/ai/news/google-paligemma-2-open-source-ai-vision-language-models-introduced-7186404 - Google releases PaliGemma 2, a visual language model that's easy to finetune. (2024). Gigazine. https://gigazine.net/gsc_news/en/20241206-google-paligemma-2/ - PaliGemma 2: Revolutionizing Vision-Language Models. (2024). AI in Transit. https://aiintransit.medium.com/paligemma-2-revolutionizing-vision-language-models-7c435c74a3f9 - Google releases PaliGemma 2, its latest open. (2024). Reddit. https://www.reddit.com/r/TheDecoder/comments/1h8rb8a/google_releases_paligemma_2_its_latest_open/