Hugging Face Transformers: Fix gradient accumulation loss calculation in Trainer
AI Impact Summary
Transformers Trainer's gradient accumulation currently reports a loss that can diverge from full-batch training because it relies on the default loss function, which is intended for standard pass-through loss computation. The fix enforces a proper aggregation: the total loss across all non-padding tokens in an accumulation step is used, with support for user-supplied losses via PreTrainedModel.loss_function and a LOSS_MAPPING mapping; this lets users inject custom loss logic and ensures consistent optimization signals. By shipping these changes to main and upstream releases, teams can upgrade to obtain correct loss reporting without reworking their training loops.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info