Vision Language Model Alignment in TRL — MPO, GRPO, GSPO support for VLMs
AI Impact Summary
TRL adds MPO, GRPO, and GSPO for Vision Language Model alignment, enabling richer preference signals and better scaling with large VLMs. It extends DPO-based workflows with a combined loss (sigmoid, bco, sft) and introduces GRPO/GSPO training modes, including RLOO and Online DPO support, with accompanying training notebooks. This reduces reliance on simple pairwise preferences and supports models such as IDEFICS2 and Qwen2.5VL-3B in TRL pipelines, potentially delivering higher-quality multimodal outputs. Teams should anticipate updated APIs (DPOConfig, DPOTrainer, GRPOConfig, GRPOTrainer) and plan for revalidating prompts and data pipelines to accommodate new loss components and group-based updates.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info