InfoCapability

Google PaliGemma 2 Mix: new vision-language models for OCR, captioning, and VQA (3B/10B/28B)

AI Impact Summary

Google announces PaliGemma 2 Mix, a family of vision-language models built on SigLIP and Gemma 2 with 3B, 10B, and 28B sizes and input resolutions of 224x224, 448x448, and 896x896. The Mix variants are fine-tuned on a mix of vision-language tasks (OCR, captioning, VQA, and related downstream tasks), providing a quick signal of downstream performance without requiring full end-to-end chat capabilities. For platform teams, this enables faster benchmarking and informed selection of model size and resolution for downstream tasks such as document understanding and image-based QA, while larger variants incur proportionally higher compute and inference costs.

Affected Systems

PaliGemma 2 MixPaliGemma 2

Date: Date not specified
Change type: capability
Severity: info

Google PaliGemma 2 Mix: new vision-language models for OCR, captioning, and VQA (3B/10B/28B)

More from Hugging Face

Get alerts for Hugging Face