Big Bench Audio evaluates audio reasoning gap in GPT-4o and Gemini 1.5 across 18 configurations
AI Impact Summary
Artificial Analysis releases Big Bench Audio to quantify reasoning performance across modalities using 1,000 questions drawn from Big Bench Hard, spanning formal_fallacies, navigate, object_counting, and web_of_lies. The evaluation compares GPT-4o and Gemini 1.5 variants across Speech to Speech, Speech to Text, Text to Speech, and Text to Text, revealing a substantial speech reasoning gap (e.g., GPT-4o 92% on text but 66% on Speech to Speech) and highlighting pipeline approaches (Whisper transcription → GPT-4o reasoning → TTS-1) as currently more robust for audio reasoning. The assessment relies on an automated LLM Evaluator (Anthropic Claude 3.5 Sonnet) and standardized transcription via OpenAI Whisper, underscoring the need for improved native audio reasoning or optimized end-to-end audio pipelines for production use.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info