Hallucinations Leaderboard: Open benchmark for LLM factuality using EleutherAI Harness and Hugging Face Leaderboard Template
AI Impact Summary
The Hallucinations Leaderboard is an open, ongoing effort to benchmark LLMs for factuality and faithfulness using a broad task suite (NQ Open, TriviaQA, TruthfulQA, XSum, CNN/DM, RACE, SQuADv2, MemoTrap, IFEval, FaithDial, True-False, HaluEval, SelfCheckGPT). It relies on the EleutherAI Language Model Evaluation Harness for zero-shot and few-shot evaluation and a fork of the Hugging Face Leaderboard Template for results tracking, with experiments run on the Edinburgh International Data Facility and NVIDIA A100 GPUs. For a technical team, this provides a transparent, reproducible baseline to compare hallucination risk across open-source models and inform procurement or migration decisions. However, benchmark scores may not fully reflect production distributions, so corroborate results with production-like evaluations and consider aligning prompts and task selections to your specific use case.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info