InfoCapability

Google PaliGemma vision-language models released with PT/Mix/FT on HuggingFace Transformers

AI Impact Summary

Google released the PaliGemma family, a vision-language model stack that combines the SigLIP-So400m image encoder with the Gemma-2B text decoder, available as pretrained, mix, and fine-tuned checkpoints across 224, 448, and 896 resolutions and multiple precisions. The rollout includes Transformers-based inference via PaliGemmaForConditionalGeneration and AutoProcessor, with 4-bit loading options to optimize memory usage, and repositories compatible with Hugging Face Hub and both Transformer and JAX implementations. Capabilities span image captioning, visual question answering, detection, referring expression segmentation, and document understanding; however, higher-resolution variants require substantially more memory, so teams should plan model selection and deployment carefully, especially around licensing and access via Hugging Face.

Affected Systems

SigLIP-So400m

Date: Date not specified
Change type: capability
Severity: info

Google PaliGemma vision-language models released with PT/Mix/FT on HuggingFace Transformers

More from Hugging Face

Get alerts for Hugging Face