Big Bench Audio evaluates audio reasoning for GPT-4o and Gemini 1.5 — speech-to-speech 66% vs text 92%
AI Impact Summary
Artificial Analysis releases Big Bench Audio to measure audio-based reasoning by converting Big Bench Hard questions into audio across four categories. The results show a notable speech reasoning gap: GPT-4o scores 92% on text-only tasks but only 66% for Speech to Speech, indicating voice-based reasoning lags even for top models. Traditional pipeline setups using Whisper transcription, GPT-4o reasoning, and TTS-1 generate the strongest performance among audio configurations, though they still trail pure text, underscoring the need for model improvements or hybrid workflows when reasoning quality is critical.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info