IBM and UC Berkeley diagnose enterprise agent failures using IT-Bench and MAST
AI Impact Summary
IBM and UC Berkeley applied MAST to ITBench to diagnose why enterprise agents fail in real-world IT automation, converting raw traces into structured failure signatures across Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B. They found frontier models fail with isolated bottlenecks such as incorrect verification, while large open models exhibit cascading failures, with Incorrect Verification (FM-3.3) as a strong predictor and early reasoning mismatches that poison the context. The study translates these findings into concrete engineering guidance—externalize verification, require hard tool evidence before exit, enforce termination and loop controls outside the LLM, and add explicit stop conditions—to improve reliability and debuggability for IT automation tasks like Kubernetes outages, patching, and FinOps workflows.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info