InfoCapability

Hallucinations Leaderboard: Open benchmark for LLM factuality using EleutherAI Harness and Hugging Face Leaderboard Template

AI Impact Summary

The Hallucinations Leaderboard is an open, ongoing effort to benchmark LLMs for factuality and faithfulness using a broad task suite (NQ Open, TriviaQA, TruthfulQA, XSum, CNN/DM, RACE, SQuADv2, MemoTrap, IFEval, FaithDial, True-False, HaluEval, SelfCheckGPT). It relies on the EleutherAI Language Model Evaluation Harness for zero-shot and few-shot evaluation and a fork of the Hugging Face Leaderboard Template for results tracking, with experiments run on the Edinburgh International Data Facility and NVIDIA A100 GPUs. For a technical team, this provides a transparent, reproducible baseline to compare hallucination risk across open-source models and inform procurement or migration decisions. However, benchmark scores may not fully reflect production distributions, so corroborate results with production-like evaluations and consider aligning prompts and task selections to your specific use case.

Affected Systems

EleutherAI Language Model Evaluation Harness

Date: Date not specified
Change type: capability
Severity: info

Hallucinations Leaderboard: Open benchmark for LLM factuality using EleutherAI Harness and Hugging Face Leaderboard Template

More from Hugging Face

Get alerts for Hugging Face