InfoCapability

Open ASR Leaderboard adds multilingual and long-form tracks — trends and implications

AI Impact Summary

Open ASR Leaderboard has expanded to include multilingual and long-form transcription tracks, broadening the evaluation surface beyond short English clips. This matters for teams selecting models for meetings, podcasts, or other long-form content, as throughput and language coverage are now explicit performance factors. The leading accuracy now comes from Conformer encoders paired with LLM decoders (examples: NVIDIA Canary-Qwen-2.5B, IBM Granite-Speech-3.3-8B, Microsoft Phi-4-Multimodal-Instruct), while real-time or offline usage may prefer faster CTC/TDT decoders (e.g., Parakeet, Whisper) offering 10–100x throughput at a modest WER hit. In multilingual scenarios, Whisper Large v3 remains a strong baseline, but fine-tuned variants and multilingual models show varying tradeoffs, and long-form results still favor closed-source systems—creating a clear need for targeted benchmarking and architecture choices for enterprise workloads.

Affected Systems

Open ASR Leaderboard

Date: Date not specified
Change type: capability
Severity: info

Open ASR Leaderboard adds multilingual and long-form tracks — trends and implications

More from Hugging Face

Get alerts for Hugging Face