Google PaliGemma 2 Mix: new vision-language models for OCR, captioning, and VQA (3B/10B/28B)
AI Impact Summary
Google announces PaliGemma 2 Mix, a family of vision-language models built on SigLIP and Gemma 2 with 3B, 10B, and 28B sizes and input resolutions of 224x224, 448x448, and 896x896. The Mix variants are fine-tuned on a mix of vision-language tasks (OCR, captioning, VQA, and related downstream tasks), providing a quick signal of downstream performance without requiring full end-to-end chat capabilities. For platform teams, this enables faster benchmarking and informed selection of model size and resolution for downstream tasks such as document understanding and image-based QA, while larger variants incur proportionally higher compute and inference costs.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info