Google PaliGemma vision-language models released with PT/Mix/FT on HuggingFace Transformers
AI Impact Summary
Google released the PaliGemma family, a vision-language model stack that combines the SigLIP-So400m image encoder with the Gemma-2B text decoder, available as pretrained, mix, and fine-tuned checkpoints across 224, 448, and 896 resolutions and multiple precisions. The rollout includes Transformers-based inference via PaliGemmaForConditionalGeneration and AutoProcessor, with 4-bit loading options to optimize memory usage, and repositories compatible with Hugging Face Hub and both Transformer and JAX implementations. Capabilities span image captioning, visual question answering, detection, referring expression segmentation, and document understanding; however, higher-resolution variants require substantially more memory, so teams should plan model selection and deployment carefully, especially around licensing and access via Hugging Face.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info