VAKRA benchmark released to evaluate tool-use, API chaining, and multi-hop reasoning in AI agents
AI Impact Summary
VAKRA is a new executable benchmark designed to evaluate AI agents in enterprise-like settings by measuring how well they reason and act across APIs and documents. It covers four capabilities—API chaining, tool selection via dashboards, multi-hop reasoning, and multi-source, policy-aware tasks—across the SLOT-BIRD, SEL-BIRD, and REST-BIRD tool collections with MCP servers. Initial findings indicate models struggle with these tasks, highlighting failure modes around tool selection, data setup via get_data, and constrained tool lists under the OpenAI API. This benchmark provides a concrete, repeatable baseline to identify gaps and guide targeted improvements in tooling, policy adherence, and data access layers before production deployments.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info