Hugging Face: Evaluating Audio Reasoning with Big Bench Audio — GPT-4o shows a 26% accuracy gap | SignalBreak | SignalBreak