InfoCapability

AssetOpsBench: Industrial AI agent benchmark with six qualitative dimensions and TrajFM failure analysis

AI Impact Summary

AssetOpsBench introduces an industrial AI agent benchmark that evaluates multi-agent coordination and failure handling across six qualitative dimensions, addressing real-world constraints in asset management. It aggregates 2.3M sensor points, 140+ scenarios across 4 agents, 4.2K work orders, and 53 structured failure modes, using the TrajFM pipeline to fuse LLM reasoning with clustering for interpretable failure patterns. Early results show prevalent coordination and uncertainty issues across leading models (GPT-4.1, Mistral-Large, LLaMA-4 Maverick, LLaMA-3-70B) with many traces failing to meet deployment readiness (85-point threshold) and common patterns like “Sounds Right, Is Wrong.” The framework enables feedback-driven iteration and privacy-conscious evaluation to drive safer, more dependable industrial agent deployments in settings such as chillers and air handling units.

Affected Systems

AssetOpsBench

Date: Date not specified
Change type: capability
Severity: info

AssetOpsBench: Industrial AI agent benchmark with six qualitative dimensions and TrajFM failure analysis

More from Hugging Face

Get alerts for Hugging Face