BenCzechMark: Czech-language LLM Benchmark and Leaderboard
AI Impact Summary
BenCzechMark is introduced as the first comprehensive evaluation suite for Czech-language LLMs, featuring 50 tasks across 9 categories and a leaderboard across 25+ open-source models. It evaluates models using multiple metrics (Accuracy, Exact Match, AUROC, Perplexity) and uses a duel-based scoring approach to compute model win-rates, enabling fair cross-model comparisons without relying on fixed thresholds. For technical teams, this provides a standardized baseline to gauge Czech-language capabilities across grammar, factual knowledge, reading comprehension, NER, sentiment, and more, informing model selection, calibration strategies, and targeted improvements for Czech-domain applications.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info