InfoCapability

TextQuests benchmark shows LLMs struggle with long-context reasoning in text-based games

AI Impact Summary

TextQuests evaluates LLM agents by running 25 Infocom games with long, growing histories, forcing sustained self-directed planning without external tools. The study shows models struggle with long-context reasoning: context windows can exceed 100K tokens, yet agents hallucinate past actions and loop, and fail to build usable mental maps for navigation. Efficient exploration requires dynamic reasoning that uses only necessary tokens, but performance often improves only with higher test-time compute, raising cost concerns. These findings imply production agents in interactive environments must implement robust memory/plan stability and context management to avoid regressions as history grows.

Affected Systems

Gemini 2.5 Plays PokémonClaude

Date: Date not specified
Change type: capability
Severity: info

TextQuests benchmark shows LLMs struggle with long-context reasoning in text-based games

More from Hugging Face

Get alerts for Hugging Face