InfoCapability

Hugging Face shows prompt-format variance in MMLU; structured generation via Outlines to improve consistency

AI Impact Summary

Evaluation results are highly sensitive to prompt formatting; across eight prompt formats for MMLU, model accuracy varied by ~10 points and even altered rankings. The work shows that focusing on output structure (structured generation) with tools like Outlines can stabilize results, reducing dependence on prompt wording. However, JSON-formatted prompts improved performance for most models but caused a notable dip for MetaMath-Tulpar-7b-v2-Slerp, suggesting model-specific interactions with formatting. This implies benchmarking pipelines need to consider structured-generation approaches to achieve fairer cross-model comparisons.

Affected Systems

MMLUQwen1.5-7B

Date: Date not specified
Change type: capability
Severity: info

Hugging Face shows prompt-format variance in MMLU; structured generation via Outlines to improve consistency

More from Hugging Face

Get alerts for Hugging Face