Google PaliGemma vision-language models: SigLIP-So400m encoder with Gemma-2B decoder
AI Impact Summary
Google's PaliGemma introduces a SigLIP-So400m image encoder paired with the Gemma-2B text decoder to form a new family of vision-language models. The release includes pretrained, mix, and fine-tuned checkpoints across 224x224, 448x448, and 896x896 resolutions with bf16, fp16, and float32 precisions, all accessible via Hugging Face and controllable through the PaliGemmaForConditionalGeneration/AutoProcessor APIs. Capabilities cover image captioning, visual question answering, detection with bounding-box tokens, referring expression segmentation, and document understanding, but models are not conversational out-of-the-box and typically require task-specific fine-tuning; high-resolution variants are memory-intensive due to longer input sequences. Access is gated by Gemma license terms, with inference options including 4-bit quantization via bitsandbytes and a demo path in the big_vision repository to illustrate usage with Transformers.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info