TRL expands Vision Language Model alignment with MPO, GRPO, GSPO and Online DPO
AI Impact Summary
TRL introduces Mixed Preference Optimization (MPO) and group-based policies (GRPO, GSPO) for Vision Language Models, expanding beyond pairwise DPO and enabling richer preference signals. It also adds support for Reinforce Leave One Out (RLOO) and Online Direct Preference Optimization (Online DPO), along with native Supervised Fine-Tuning for VLMs and accompanying training scripts/notebooks. This unlocks scalable multimodal alignment and applies to models like IDEFICS2 and Qwen2.5VL-3B, where configuration via DPOConfig (loss_type and loss_weights) and DPOTrainer enables more efficient training workflows. Teams should plan to integrate these new configurations into their TRL-based pipelines and validate improvements using the referenced notebooks and examples.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info