InfoCapability

Evaluating Audio Reasoning with Big Bench Audio — GPT-4o shows a 26% accuracy gap

AI Impact Summary

Artificial Analysis has released Big Bench Audio, a new dataset designed to evaluate the audio reasoning capabilities of large language models like GPT-4o and Gemini 1.5. The dataset adapts challenging questions from Big Bench Hard into an audio format, revealing a significant performance gap – a 26% drop in accuracy – when models are evaluated using Speech to Speech compared to text-only. This highlights a critical area for improvement in native speech-enabled AI systems.

Affected Systems

GPT-4oGemini 1.5

Date: Date not specified
Change type: capability
Severity: info

Evaluating Audio Reasoning with Big Bench Audio — GPT-4o shows a 26% accuracy gap

More from Hugging Face

Get alerts for Hugging Face