TRL fix: Average log-likelihood loss for IPO aligns with DPO on 7B models
AI Impact Summary
TRL had an incorrect IPO loss implementation where the log-likelihoods were summed instead of averaged. The PR fixes this by averaging the log-likelihood loss, restoring fidelity to the IPO paper and aligning IPO results with DPO while outperforming KTO in paired-preference tests. In the reported experiments, they evaluated OpenHermes-2.5-Mistral-7B and Zephyr-7b-beta-sft using orca_dpo_pairs and ultrafeedback-binarized datasets, with MT-Bench used for evaluation; results now reflect IPO on par with DPO when hyperparameters like beta are tuned.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info