InfoCapability

TRL expands Vision Language Model alignment with MPO, GRPO, GSPO and Online DPO

AI Impact Summary

TRL introduces Mixed Preference Optimization (MPO) and group-based policies (GRPO, GSPO) for Vision Language Models, expanding beyond pairwise DPO and enabling richer preference signals. It also adds support for Reinforce Leave One Out (RLOO) and Online Direct Preference Optimization (Online DPO), along with native Supervised Fine-Tuning for VLMs and accompanying training scripts/notebooks. This unlocks scalable multimodal alignment and applies to models like IDEFICS2 and Qwen2.5VL-3B, where configuration via DPOConfig (loss_type and loss_weights) and DPOTrainer enables more efficient training workflows. Teams should plan to integrate these new configurations into their TRL-based pipelines and validate improvements using the referenced notebooks and examples.

Affected Systems

TRLDPOTrainer

Date: Date not specified
Change type: capability
Severity: info

TRL expands Vision Language Model alignment with MPO, GRPO, GSPO and Online DPO

More from Hugging Face

Get alerts for Hugging Face