Evaluating Audio Reasoning with Big Bench Audio — GPT-4o shows a 26% accuracy gap
AI Impact Summary
Artificial Analysis has released Big Bench Audio, a new dataset designed to evaluate the audio reasoning capabilities of large language models like GPT-4o and Gemini 1.5. The dataset adapts challenging questions from Big Bench Hard into an audio format, revealing a significant performance gap – a 26% drop in accuracy – when models are evaluated using Speech to Speech compared to text-only. This highlights a critical area for improvement in native speech-enabled AI systems.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info