TextQuests benchmark shows LLMs struggle with long-context reasoning in text-based games
AI Impact Summary
TextQuests evaluates LLM agents by running 25 Infocom games with long, growing histories, forcing sustained self-directed planning without external tools. The study shows models struggle with long-context reasoning: context windows can exceed 100K tokens, yet agents hallucinate past actions and loop, and fail to build usable mental maps for navigation. Efficient exploration requires dynamic reasoning that uses only necessary tokens, but performance often improves only with higher test-time compute, raising cost concerns. These findings imply production agents in interactive environments must implement robust memory/plan stability and context management to avoid regressions as history grows.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info