InfoCapability

Google releases PaliGemma 2 vision-language models with 3B/10B/28B variants and multiple input resolutions

AI Impact Summary

Google introduces PaliGemma 2, a vision-language model that pairs the SigLIP image encoder with Gemma 2 text decoder, expanding from a single 3B variant to 3B, 10B, and 28B checkpoints. The release supports three input resolutions (224x224, 448x448, 896x896) and includes DOCCI-tuned variants for 3B and 10B, enabling richer captioning and VQA capabilities with downstream fine-tuning. With a Gemma license and open repositories, teams can experiment via Hugging Face Transformers using PaliGemmaForConditionalGeneration and AutoProcessor, but expect larger resource needs and potential licensing considerations when distributing derivatives. This broadens options for vision-language applications but will require careful benchmarking across resolution-cost tradeoffs and integration with existing pipelines.

Affected Systems

SigLIP

Date: Date not specified
Change type: capability
Severity: info

Google releases PaliGemma 2 vision-language models with 3B/10B/28B variants and multiple input resolutions

More from Hugging Face

Get alerts for Hugging Face