DABstep: Data Agent Benchmark for Multi-step Reasoning by Adyen & Hugging Face
AI Impact Summary
DABstep introduces a real-world benchmark to stress-test agentic data-analysis workflows, combining 450+ tasks drawn from Adyen workloads across structured and unstructured data. The results show current state-of-the-art reasoning agents struggle, with top performers achieving only 16% accuracy, underscoring a substantial gap to practical production use. By exposing datasets (e.g., payments.csv, fees.json, MCC codes) and a binary factoid evaluation plus a real-time leaderboard, it provides concrete targets for improving data extraction, reasoning over mixed data formats, and end-to-end automation pipelines. For engineering teams, DABstep defines a realistic evaluation surface to guide tooling, governance, and migration planning toward autonomous data analysis.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info