BenCzechMark: Comprehensive Czech-language LLM evaluation suite and leaderboard
AI Impact Summary
BenCzechMark provides a 50-task, 9-category Czech-language evaluation suite that spans reading comprehension, NER, factual knowledge, sentiment, and math reasoning. It uses multiple metrics (Acc, EM, AUROC, Ppl) and a model-duel framework (DWS) to produce a cross-model ranking. The leaderboard highlights open-source models like Llama-450B and Aya-23-35B, offering concrete data on Czech-language capabilities and transfer gaps. This enables technical teams to identify where Czech understanding and domain knowledge are strong or weak, informing model selection and calibration for Czech-language products.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info