TRL adds Direct Preference Optimization (DPO) for Vision-Language Models with LoRA and quantization
AI Impact Summary
TRL now supports direct preference optimization (DPO) for Vision-Language Models, enabling fine-tuning through pairwise comparisons rather than fixed labels. This approach reduces labeling costs and improves alignment with nuanced human judgments on multimodal tasks. The guidance includes a concrete data format using a binary-choice dataset (chosen vs rejected) and shows how to adapt models like Idefics2-8b using the TRL pipeline, including format conversion and processor setup. It also highlights memory-lean techniques such as quantization and LoRA (via PEFT) that shrink trainable parameters from billions to tens of millions, making training feasible on high-end GPUs; the article notes compatibility with other VLMs like Llava 1.5 and PaliGemma.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info