InfoCapability

DABstep: Data Agent Benchmark for Multi-step Reasoning by Adyen & Hugging Face

AI Impact Summary

DABstep introduces a real-world benchmark to stress-test agentic data-analysis workflows, combining 450+ tasks drawn from Adyen workloads across structured and unstructured data. The results show current state-of-the-art reasoning agents struggle, with top performers achieving only 16% accuracy, underscoring a substantial gap to practical production use. By exposing datasets (e.g., payments.csv, fees.json, MCC codes) and a binary factoid evaluation plus a real-time leaderboard, it provides concrete targets for improving data extraction, reasoning over mixed data formats, and end-to-end automation pipelines. For engineering teams, DABstep defines a realistic evaluation surface to guide tooling, governance, and migration planning toward autonomous data analysis.

Affected Systems

DABstepAdyen

Date: Date not specified
Change type: capability
Severity: info

DABstep: Data Agent Benchmark for Multi-step Reasoning by Adyen & Hugging Face

More from Hugging Face

Get alerts for Hugging Face