InfoCapability

Big Bench Audio evaluates audio reasoning gap in GPT-4o and Gemini 1.5 across 18 configurations

AI Impact Summary

Artificial Analysis releases Big Bench Audio to quantify reasoning performance across modalities using 1,000 questions drawn from Big Bench Hard, spanning formal_fallacies, navigate, object_counting, and web_of_lies. The evaluation compares GPT-4o and Gemini 1.5 variants across Speech to Speech, Speech to Text, Text to Speech, and Text to Text, revealing a substantial speech reasoning gap (e.g., GPT-4o 92% on text but 66% on Speech to Speech) and highlighting pipeline approaches (Whisper transcription → GPT-4o reasoning → TTS-1) as currently more robust for audio reasoning. The assessment relies on an automated LLM Evaluator (Anthropic Claude 3.5 Sonnet) and standardized transcription via OpenAI Whisper, underscoring the need for improved native audio reasoning or optimized end-to-end audio pipelines for production use.

Affected Systems

GPT-4o

Date: Date not specified
Change type: capability
Severity: info

Big Bench Audio evaluates audio reasoning gap in GPT-4o and Gemini 1.5 across 18 configurations

More from Hugging Face

Get alerts for Hugging Face