Open ASR Leaderboard adds multilingual and long-form tracks — trends and implications
AI Impact Summary
Open ASR Leaderboard has expanded to include multilingual and long-form transcription tracks, broadening the evaluation surface beyond short English clips. This matters for teams selecting models for meetings, podcasts, or other long-form content, as throughput and language coverage are now explicit performance factors. The leading accuracy now comes from Conformer encoders paired with LLM decoders (examples: NVIDIA Canary-Qwen-2.5B, IBM Granite-Speech-3.3-8B, Microsoft Phi-4-Multimodal-Instruct), while real-time or offline usage may prefer faster CTC/TDT decoders (e.g., Parakeet, Whisper) offering 10–100x throughput at a modest WER hit. In multilingual scenarios, Whisper Large v3 remains a strong baseline, but fine-tuned variants and multilingual models show varying tradeoffs, and long-form results still favor closed-source systems—creating a clear need for targeted benchmarking and architecture choices for enterprise workloads.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info