Hugging Face shows prompt-format variance in MMLU; structured generation via Outlines to improve consistency
AI Impact Summary
Evaluation results are highly sensitive to prompt formatting; across eight prompt formats for MMLU, model accuracy varied by ~10 points and even altered rankings. The work shows that focusing on output structure (structured generation) with tools like Outlines can stabilize results, reducing dependence on prompt wording. However, JSON-formatted prompts improved performance for most models but caused a notable dip for MetaMath-Tulpar-7b-v2-Slerp, suggesting model-specific interactions with formatting. This implies benchmarking pipelines need to consider structured-generation approaches to achieve fairer cross-model comparisons.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info