Fine-tune Llama 2 with DPO via TRL — bypass RLHF reward modeling
AI Impact Summary
Direct Preference Optimization (DPO) enables fine-tuning Llama 2 using a direct likelihood objective on preference data, bypassing the RLHF reward-model and RL optimization steps. It relies on TRL's DPOTrainer with a base model and a separate reference model (via PEFT AutoPeftModelForCausalLM), trained on a specifically formatted dataset of prompt–chosen–rejected triplets such as stack-exchange-paired, using 4-bit QLoRA with BitsAndBytesConfig. The workflow includes an SFT step (SFTTrainer) with LoRA adapters before DPO training, and a beta hyperparameter (0.1–0.5) to control the reference-model influence. This offers a simpler, potentially faster path to aligned Llama 2 behavior but requires careful data preparation and the TRL/PEFT stack.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- medium