Vision Language Models capabilities update — April 2025 adds new models and TRL-based fine-tuning
AI Impact Summary
The post details a capabilities expansion for vision-language models, including grounding features, more model options, and a TRL-based fine-tuning workflow updated in April 2025. It outlines architectures like LLaVA (CLIP image encoder, multimodal projector, Vicuna decoder), end-to-end variants such as KOSMOS-2, and even image-patch–based approaches like Fuyu-8B, illustrating diverse trade-offs for inference cost and training requirements. Engineers should leverage the April 2025 update and the new TRL release to experiment via Hugging Face Hub references (e.g., llava-hf/llava-v1.6-mistral-7b-hf) and benchmarking platforms (Vision Arena, Open VLM Leaderboard, LMMS-Eval) to select models aligned with their use cases.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info