VAKRA Benchmark Reveals Agent Failure Modes in Enterprise Reasoning
AI Impact Summary
The VAKRA benchmark represents a significant advancement in evaluating AI agent capabilities by simulating complex, real-world enterprise workflows. Agents struggle with the benchmark's multi-step reasoning chains, requiring the chaining of API calls and unstructured document retrieval, highlighting limitations in compositional reasoning and tool use. The reliance on a `get_data` API and the specific data structures within the SEL-BIRD collection reveal vulnerabilities in agents' ability to handle structured data and adapt to domain-specific data formats, ultimately demonstrating a need for improved agent robustness and adaptability.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info