InfoCapability

IBM and UC Berkeley diagnose enterprise agent failures using IT-Bench and MAST

AI Impact Summary

IBM and UC Berkeley applied MAST to ITBench to diagnose why enterprise agents fail in real-world IT automation, converting raw traces into structured failure signatures across Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B. They found frontier models fail with isolated bottlenecks such as incorrect verification, while large open models exhibit cascading failures, with Incorrect Verification (FM-3.3) as a strong predictor and early reasoning mismatches that poison the context. The study translates these findings into concrete engineering guidance—externalize verification, require hard tool evidence before exit, enforce termination and loop controls outside the LLM, and add explicit stop conditions—to improve reliability and debuggability for IT automation tasks like Kubernetes outages, patching, and FinOps workflows.

Affected Systems

ITBench

Date: Date not specified
Change type: capability
Severity: info

IBM and UC Berkeley diagnose enterprise agent failures using IT-Bench and MAST

More from Hugging Face

Get alerts for Hugging Face