InfoCapability

TRL adds Direct Preference Optimization (DPO) for Vision-Language Models with LoRA and quantization

AI Impact Summary

TRL now supports direct preference optimization (DPO) for Vision-Language Models, enabling fine-tuning through pairwise comparisons rather than fixed labels. This approach reduces labeling costs and improves alignment with nuanced human judgments on multimodal tasks. The guidance includes a concrete data format using a binary-choice dataset (chosen vs rejected) and shows how to adapt models like Idefics2-8b using the TRL pipeline, including format conversion and processor setup. It also highlights memory-lean techniques such as quantization and LoRA (via PEFT) that shrink trainable parameters from billions to tens of millions, making training feasible on high-end GPUs; the article notes compatibility with other VLMs like Llava 1.5 and PaliGemma.

Affected Systems

TRL libraryIdefics2-8b

Date: Date not specified
Change type: capability
Severity: info

TRL adds Direct Preference Optimization (DPO) for Vision-Language Models with LoRA and quantization

More from Hugging Face

Get alerts for Hugging Face