InfoCapability

Vision Language Models capabilities update — April 2025 adds new models and TRL-based fine-tuning

AI Impact Summary

The post details a capabilities expansion for vision-language models, including grounding features, more model options, and a TRL-based fine-tuning workflow updated in April 2025. It outlines architectures like LLaVA (CLIP image encoder, multimodal projector, Vicuna decoder), end-to-end variants such as KOSMOS-2, and even image-patch–based approaches like Fuyu-8B, illustrating diverse trade-offs for inference cost and training requirements. Engineers should leverage the April 2025 update and the new TRL release to experiment via Hugging Face Hub references (e.g., llava-hf/llava-v1.6-mistral-7b-hf) and benchmarking platforms (Vision Arena, Open VLM Leaderboard, LMMS-Eval) to select models aligned with their use cases.

Affected Systems

LLaVALlavaNext

Date: Date not specified
Change type: capability
Severity: info

Vision Language Models capabilities update — April 2025 adds new models and TRL-based fine-tuning

More from Hugging Face

Get alerts for Hugging Face