InfoCapability

VAKRA benchmark released to evaluate tool-use, API chaining, and multi-hop reasoning in AI agents

AI Impact Summary

VAKRA is a new executable benchmark designed to evaluate AI agents in enterprise-like settings by measuring how well they reason and act across APIs and documents. It covers four capabilities—API chaining, tool selection via dashboards, multi-hop reasoning, and multi-source, policy-aware tasks—across the SLOT-BIRD, SEL-BIRD, and REST-BIRD tool collections with MCP servers. Initial findings indicate models struggle with these tasks, highlighting failure modes around tool selection, data setup via get_data, and constrained tool lists under the OpenAI API. This benchmark provides a concrete, repeatable baseline to identify gaps and guide targeted improvements in tooling, policy adherence, and data access layers before production deployments.

Affected Systems

VAKRASLOT-BIRD

Date: Date not specified
Change type: capability
Severity: info

VAKRA benchmark released to evaluate tool-use, API chaining, and multi-hop reasoning in AI agents

More from Hugging Face

Get alerts for Hugging Face