Introducing Idefics2: 8B vision-language model with Apache 2.0 license
AI Impact Summary
Idefics2 introduces an 8B vision-language model that ingests text and images and returns text, enabling multimodal use cases such as visual question answering, image description, and multi-image reasoning. It features enhanced OCR, NaViT-style full-resolution image processing with optional sub-image splitting, and is built on Mistral-7B-v0.1 plus siglip-so400m-patch14-384 backbones, trained with The Cauldron instruction-tuning data for multi-turn conversations. The model is released under Apache 2.0 and is readily available via Hugging Face Transformers with weights at HuggingFaceM4/idefics2-8b, simplifying fine-tuning and integration into existing ML pipelines.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info