AssetOpsBench: Industrial AI agent benchmark with six qualitative dimensions and TrajFM failure analysis
AI Impact Summary
AssetOpsBench introduces an industrial AI agent benchmark that evaluates multi-agent coordination and failure handling across six qualitative dimensions, addressing real-world constraints in asset management. It aggregates 2.3M sensor points, 140+ scenarios across 4 agents, 4.2K work orders, and 53 structured failure modes, using the TrajFM pipeline to fuse LLM reasoning with clustering for interpretable failure patterns. Early results show prevalent coordination and uncertainty issues across leading models (GPT-4.1, Mistral-Large, LLaMA-4 Maverick, LLaMA-3-70B) with many traces failing to meet deployment readiness (85-point threshold) and common patterns like “Sounds Right, Is Wrong.” The framework enables feedback-driven iteration and privacy-conscious evaluation to drive safer, more dependable industrial agent deployments in settings such as chillers and air handling units.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info