TextQuests: How LLMs Perform in Complex Text-Based Video Games
The rapid advancement of Large Language Models (LLMs) has led to remarkable breakthroughs across established academic and industrial benchmarks. While these models now largely saturate knowledge-based evaluations like MMLU and GPQA, and even make significant strides in expert assessments, their success in static, information-retrieval tasks doesn’t always translate to effectiveness in dynamic, interactive settings. This disparity highlights a critical challenge: developing robust methodologies for evaluating LLMs as autonomous agents in complex, exploratory environments, where we would ideally want intelligent assistants and AI agents to thrive.
Two primary avenues exist for evaluating autonomous agents: utilizing real-world environments to test specific skills such as tool use or coding, or employing simulated open-world environments. The latter approach is particularly effective for gauging an agent’s capacity to operate autonomously in exploratory settings, which demand sustained, self-directed reasoning over an ever-growing context, all while offering ease of evaluation. This nascent field has seen burgeoning interest, with benchmarks like Balrog and ARC-AGI emerging, alongside compelling demonstrations of models such as Claude and Gemini navigating the complexities of games like Pokémon. Building on this momentum, a new benchmark called TextQuests has been introduced.
TextQuests is built upon a collection of 25 classic Infocom interactive fiction games. These once-popular text-based video games, which could engross human players for over 30 hours and necessitate hundreds of precise actions to solve, offer a compelling testbed for the intricate challenges of agentic reasoning. They demand that an AI agent demonstrate sophisticated long-context reasoning, requiring it to devise and execute multi-step plans by reasoning over a vast and continuously expanding history of actions and observations, relying solely on its intrinsic capabilities without external aids. Furthermore, success in these games hinges on the agent’s ability to learn through exploration, interrogating its own failures and making incremental improvements through trial-and-error as it navigates an unknown world. This sustained engagement allows for a more direct and accurate assessment of the LLM itself, serving as the core reasoning engine of an AI agent system.
For evaluation, each model undergoes two distinct runs: one with access to the game’s official hints, and one without. Each run is capped at 500 steps, concluding early if the agent successfully completes the game. To facilitate comprehensive long-context evaluation, the full game history is maintained without truncation, a computationally feasible approach thanks to prompt caching inherent in modern LLM inference frameworks. Performance is assessed using two main metrics: Game Progress, calculated based on a series of labeled checkpoints representing necessary objectives, and Harm, which tracks specific in-game actions considered ethically problematic, with the score averaged across all games to gauge an agent’s overall propensity for such actions.
The evaluations reveal significant insights into current LLM capabilities, particularly concerning long-context reasoning. As the context window can exceed 100,000 tokens, LLMs must consistently perform precise reasoning and planning over an extensive history of observations and clues to progress effectively. However, a common observation is that current models frequently “hallucinate” about prior interactions, misremembering details or believing they have already completed an action they haven’t. This often leads to agents getting stuck in navigation loops. Furthermore, similar to observations from models playing Pokémon, LLM agents show an increased tendency to repeat actions from their history rather than synthesizing novel plans as the context lengthens. These long-context failures are especially pronounced in tasks requiring spatial reasoning. For instance, in the game Wishbringer, most LLMs struggled to navigate back down a cliff after ascending it, even though the solution simply involved reversing the sequence of directions—information readily available in the context history. This indicates a fundamental difficulty in building and utilizing an internal mental map. Similarly, all tested frontier LLMs struggled to navigate the infamous Maze in Zork I.
Beyond reasoning accuracy, an agent’s overall effectiveness is also defined by its operational efficiency. For LLM agents, efficiency is closely tied to the number of output or reasoning tokens generated, which directly impacts inference cost and latency. While models that utilize more computational resources generally achieve higher performance, this trend begins to diminish after a certain budget. This consideration is crucial, as many exploratory steps in TextQuests, such as navigation, are intermediate and can be successfully executed without requiring extensive reasoning depth. An ideal LLM agent should therefore be both efficient and dynamic in its reasoning effort, while still maintaining consistent performance.
In conclusion, TextQuests provides a rigorous evaluation of how well models can consistently progress through a series of classic interactive fiction games, once a beloved pastime for human players. By open-sourcing TextQuests, researchers hope to foster a deeper understanding and more accurate assessment of the current capabilities of LLM agents in challenging, exploratory environments.