Google releases PaliGemma 2 vision-language models with 3B/10B/28B variants and multiple input resolutions
AI Impact Summary
Google introduces PaliGemma 2, a vision-language model that pairs the SigLIP image encoder with Gemma 2 text decoder, expanding from a single 3B variant to 3B, 10B, and 28B checkpoints. The release supports three input resolutions (224x224, 448x448, 896x896) and includes DOCCI-tuned variants for 3B and 10B, enabling richer captioning and VQA capabilities with downstream fine-tuning. With a Gemma license and open repositories, teams can experiment via Hugging Face Transformers using PaliGemmaForConditionalGeneration and AutoProcessor, but expect larger resource needs and potential licensing considerations when distributing derivatives. This broadens options for vision-language applications but will require careful benchmarking across resolution-cost tradeoffs and integration with existing pipelines.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info