Salesforce BLIP-2: Zero-Shot Image-to-Text Generation
AI Impact Summary
Salesforce Research has released BLIP-2, a novel visual-language model that bridges the gap between vision and language models. This model utilizes a lightweight Querying Transformer (Q-Former) to efficiently combine features from a frozen image encoder and a large language model, enabling zero-shot image-to-text generation tasks like captioning and visual question answering. This approach significantly reduces training costs and parameter counts compared to end-to-end vision-language pre-training, opening the door to more accessible multimodal models.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info