[HN Gopher] Can Large Language Models Play Text Games Well?
       ___________________________________________________________________
        
       Can Large Language Models Play Text Games Well?
        
       Author : willvarfar
       Score  : 55 points
       Date   : 2025-07-04 11:24 UTC (11 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | s-macke wrote:
       | This paper only scratches the surface and feels incomplete, as it
       | references only GPT-4 and mentions appendices that are not
       | included. The examples are two years old.
       | 
       | For a more in-depth analysis of chatbots playing text adventures,
       | take a look at my project. I haven't updated it in a while due to
       | time constraints.
       | 
       | [0] https://github.com/s-macke/AdventureAI
        
         | s-macke wrote:
         | The answer to the paper's question is likely yes--especially if
         | context is used effectively and memory and summaries are
         | incorporated. In that case, chatbots can complete even more
         | complex games, such as Pokemon role-playing games [0].
         | 
         | The challenge with benchmarking text adventures lies in their
         | trial-and-error nature. It's easy to get stuck for hundreds of
         | moves on a minor detail before eventually giving up and trying
         | a different approach.
         | 
         | [0] https://www.twitch.tv/gpt_plays_pokemon
        
         | glimshe wrote:
         | I like your project because you try to compare the performance
         | of different chatbots. At the same time, I certainly wouldn't
         | say it's more complete than the paper - your landing page is
         | somewhat superficial. Reading both is better than just reading
         | either.
        
       | briandw wrote:
       | Interesting to see but as the authors say a chat bot isn't
       | trained to play text adventures. Instruction tuning doesn't seem
       | to match the text adventure style very well. I think a very small
       | bit of context engineering would allow it to play successfully.
       | Reformatting past action response pairs from the history would
       | certainly help, mostly to condense the context window and keep it
       | from getting stuck taking about irrelevant topics. Also note that
       | they used GPT-4 and not a reasoning model.
        
       | willvarfar wrote:
       | It's been a background thought of mine for a while:
       | 
       | * create a basic text adventure (or MUD) with a very spartan api-
       | like representation
       | 
       | * use an LLM to embellish the description served to the user etc.
       | With recent history in context the LLM might even kinda reference
       | things the user asked previously etc.
       | 
       | * have NPCs implemented as own LLMs that are trying to 'play the
       | game'. These might be using the spartan API directly like they
       | are agents.
       | 
       | Its a fun thought experiment!
       | 
       | (An aside: I found that the graphical text adventure that I made
       | for Ludum Dare 23 is still online! Although it doesn't render
       | quite right in modern browsers.. things shouldn't have broken!
       | But anyway https://williame.github.io/ludum_dare_23_tiny_world/)
        
         | briandw wrote:
         | Have you seen https://www.aidungeon.com They started with GPT-2
         | in a google collab. You should put something together and try
         | it, it's easier than ever to get a simple version of that
         | working.
        
         | heyitsguay wrote:
         | I've done something along these lines!
         | https://github.com/heyitsguay/trader
         | 
         | The challenge for me was consistency in translating free text
         | from dialogs into classic, deterministic game state changes.
         | But what's satisfying is that the conversations aren't just
         | window dressing, they're part of the game mechanic.
        
           | ivape wrote:
           | _deterministic game state changes_
           | 
           | I found this to be the actual strenuous work in LLM based
           | development. While it appears like AI has made everything
           | easy and free, the particular challenge of consistently
           | getting deterministic outputs takes serious programming
           | effort. It feels like an entirely new job role. In other
           | words, I wouldn't do this for free, it takes too much effort.
        
         | IngoBlechschmid wrote:
         | Gwern has an interesting take on this: https://gwern.net/cyoa
         | By pivoting to "choose your own adventure"-style games,
         | multiple issues (quality, costs) might be resolved.
        
         | EliasWatson wrote:
         | I've been working on something just like that off and on for a
         | couple months. It's a MUD where all the NPCs are controlled by
         | LLMs that interact with the world with the same commands that
         | players use. I got it to the point where the NPCs can navigate
         | the world, interact with each other, and even create things.
         | But they often get on rabbit trails and forget their original
         | task, so I need to build a memory system and something like the
         | task list in Claude Code. My goal is to have a fully simulated
         | town that the player can interact with.
        
       | Workaccount2 wrote:
       | How are you going to release an LLM eval paper in mid-2025 using
       | 
       |  _ChatGPT 3.5_
       | 
       | Yes, if you are wondering why they don't clarify the model, it
       | because all this was done back in early 2023 (the chat logs are
       | dated). Back then it was only 3.5 and 4 was just freshly
       | released.
       | 
       | Advancement in this space has been so rapid that this is almost
       | like releasing a paper today titled "Video streaming on Mobile
       | Devices" and only using a 3G connection in 2013.
       | 
       | The authors should have held back a few more months and turned
       | the paper into a 3.5 to O3 or any other 2025 SOTA improvement
       | analysis.
        
         | IngoBlechschmid wrote:
         | The paper was originally released in April 2023, it just got
         | version-bumped a couple months ago :-)
        
         | suddenlybananas wrote:
         | >The authors should have held back a few more months and turned
         | the paper into a 3.5 to O3 or any other 2025 SOTA improvement
         | analysis.
         | 
         | If they had done that, you would then be complaining about them
         | not using Claude or whatever.
        
           | rs186 wrote:
           | I don't see the logic in your comment.
        
       | DougHaber wrote:
       | I did some experimenting with this a little while back and was
       | disappointed in how poorly LLMs played games.
       | 
       | I made some AI tools (https://github.com/DougHaber/lair) and
       | added in a tmux tool so that LLMs could interact with terminals.
       | First, I tried Nethack. As expected, it's not good at
       | understanding text "screenshots" and failed miserably.
       | 
       | https://x.com/LeshyLabs/status/1895842345376944454
       | 
       | After that I tried a bunch of the "bsdgames" text games.
       | 
       | Here is a video of it playing a few minutes of Colossal Cave
       | Adventure:
       | 
       | https://www.youtube.com/watch?v=7BMxkWUON70
       | 
       | With this, it could play, but not very well. It gets confused a
       | lot. I was using gpt-4o-mini. Smaller models I could run at home
       | work much worse. It would be interesting to try one of the bigger
       | state of the art models to see how much it helps.
       | 
       | To give it an easier one I also had it hunt the Wumpus:
       | 
       | https://x.com/LeshyLabs/status/1896443294005317701
       | 
       | I didn't try improving this much, so there might be some low
       | hanging fruit even in providing better instructions and tuning
       | what is sent to the LLM. For these, I was hoping I could just
       | hand it a terminal with a game in it and have it play decently.
       | We'll probably get there, but so far it's not that simple.
        
         | s-macke wrote:
         | Try the game 9:05 by Adam Cadre [0]. It's one of the easiest
         | (and best) non-trivial text adventures. Some models are able to
         | reach the first or even second ending.
         | 
         | [0] https://en.wikipedia.org/wiki/9:05
        
           | throwawayoldie wrote:
           | What do you suppose would happen if you tried it on a game
           | that doesn't have 25 years of walkthroughs written for it?
        
             | s-macke wrote:
             | That's a good point. For 9:05, I expect it would work just
             | as well, since the game helps the user in many ways. The
             | puzzles are of the type "The door is closed", and you solve
             | them with "open door."
             | 
             | My suggestion concerns the poor performance DougHaber
             | mentioned: if 9:05 can't be solved, something else must be
             | wrong with his experiments.
             | 
             | I've tried three dozen games, and it's still hard to find
             | ones suitable for LLM benchmarks. With non-linear complex
             | text-adventure games, my guess is, that they get stuck in
             | an endless loop at some point. Hence, I just test the
             | progress in the first hundred steps.
        
       | gorfian_robot wrote:
       | over at slashdot, this story about how llms lose to Atari 2600
       | Video Chess
       | 
       | https://slashdot.org/story/25/07/03/2028252/microsoft-copilo...
        
       | spacecadet wrote:
       | Hey hey, guess this gives me an opportunity to mention my AI
       | dungeon master...
       | 
       | https://github.com/derekburgess/dungen
       | 
       | There are some interesting ideas in this paper, but even just
       | role playing with ChatGPT demonstrates how poorly it does at
       | world building and narrative... I was impressed by the Wayfarer
       | model, and I imagine there are other models out there on civit or
       | something that could be used together in some group chat
       | orchestration to create a more dynamic "party" atmosphere.
        
       | kmstout wrote:
       | Data point: A few weeks ago, I spent some time shuttling text
       | between one of the Llama models (have to check which one) and
       | Dunnet, the text adventure packaged with Emacs. Over several
       | trials, the Llama never realized that it needed to dig where the
       | ground "seems very soft." It never got the CPU card, then it
       | became confused looking around the building for clues about how
       | to start the VAX. At one point it lost track of the building
       | layout and got stuck oscillating between the mail room and the
       | computer room.
        
       | btown wrote:
       | Setting aside the choice of LLM, the constraint that the LLM must
       | maintain a world-model-as-knowledge-graph solely by reading and
       | re-reading its own chat history seems to be a less interesting
       | experiment than providing it with tools that let it develop that
       | world model explicitly?
       | 
       | On page 5, Figure 1, the authors create a hand-written diagram
       | showing the relationship between objects as a graph showing the
       | directionality of edges in 3D space. To me, this implies that you
       | could supply your LLM with a set of tools like getObjectsInGraph,
       | updateGraphRelatingObjectPair, findObjectsRelativeToObject,
       | describePathBetweenObjectsByName... and allow it to maintain that
       | diagram as a structured DAG, and continually ask the game engine
       | questions that let it update that graph in an agentic way. My
       | prediction would be that they'd recreate that diagram, and enable
       | goal seeking, with high fidelity.
       | 
       | Asking an LLM to work without being able to "visualize" and
       | "touch" its environment in its "mind's eye" is tying a hand
       | behind its back. But I'm bullish that we'll find increasingly
       | better ways of adapting 3D/4D world models into textual tools in
       | a way that rapidly changes the possibilities of what LLMs can do.
        
         | daxfohl wrote:
         | Or even just a notepad. It's well established that long context
         | histories with scattered information are hard for LLMs to
         | navigate.
         | 
         | To distinguish whether it's using notes or context history, you
         | could simply delete the context after each turn. The prompt
         | could be something like "you're taking over this game from a
         | previous player who has compiled these notes. (insert notes
         | here). Play one turn, and update the notes with any new
         | knowledge you have attained, relationships you have identified,
         | inaccuracies you have confirmed, hypotheses you have, or
         | anything else you this would be useful, so that the next player
         | will be able use these notes to make the best next move.", and
         | just clear the context after each move. Maybe also say there's
         | a limit to the number of words on the notepad so that it
         | doesn't just flood the notes with irrelevant information.
         | 
         | For future iterations, maybe also give it a bitmap or svg
         | canvas, or a database, or a code interpreter, and see if it
         | uses any of those tools at all.
        
         | theptip wrote:
         | This is a well-debated point; should you test the system on its
         | own, or system+scaffold? How much custom scaffolding is
         | allowed?
         | 
         | Both avenues are interesting. For AGI presumably the "general"
         | bit means you don't get to build task-specific scaffolding
         | ahead of time (though you can certainly build generic scaffolds
         | like memory or knowledge layers).
         | 
         | For safety/capability research, system+scaffolding is often
         | more interesting because that is the frontier; if you conclude
         | "LLMs cannot world-model" but actually LLM+CoT+memory can world
         | model in a specific domain you care about, then you might
         | underestimate capabilities and therefore deployment risk. The
         | general point being: capabilities are jagged and prompt-
         | dependent, just because you failed to elicit a capability,
         | doesn't mean you proved it cannot be elicited.
        
       | pflenker wrote:
       | A while back (decades in comparison to the leaps and bounds in
       | the LLM sphere) I fed text game definitions into an llm and
       | taught it to be the game engine. - the ,,fluff" it created, the
       | dialogues it enabled me to have with NPCs and the atmosphere it
       | was able to build up were amazing - it was too helpful,
       | frequently giving me hints or solving riddles for me - at some
       | point it bypassed an in game progression barrier that would have
       | prevented me to reach a swamp without a rope. While I was slowly
       | drowning it told me that I suddenly remembered what was missing
       | ,,The rope! The rope you haven't seen back in the hut!", which I
       | then took out of the back pack to save myself.
        
       | mark_undoio wrote:
       | I'm fascinated by this paper because it feels like it could be a
       | good analogue for "can LLMs handle a stateful, text-based tool".
       | A debugger is my particular interest but there's no reason why it
       | couldn't be something else.
       | 
       | To use a debugger, you need:
       | 
       | * Some memory of where you've already explored in the code (vs
       | rooms in a dungeon)
       | 
       | * Some wider idea of your current goal / destination (vs a
       | current quest or a treasure)
       | 
       | * A plan for how to get there - but the flexibility to adapt (vs
       | expected path and potential monsters / dead ends)
       | 
       | * A way for managing information you've learned / state you've
       | viewed (vs inventory)
       | 
       | Given text adventures are quite well-documented and there are
       | many of them out there, I'd also like to take time out to
       | experiment (at some point!) with whether presenting a command-
       | line tool _as_ a text adventure might be a useful  "API".
       | 
       | e.g. an MCP server that exposes a tool but _also_ provides a
       | mapping of the tools concepts into dungeon adventure concepts
       | (and back). If nothing else, the LLM 's reasoning should be
       | pretty entertaining. Maybe playing "make believe" will even make
       | it better at some things - that would be very cool.
        
         | alwa wrote:
         | That's a delightful concept to think about! I'm not sure what
         | conceptual information the translation layer would add to the
         | LLM's internal representation of the state space.
         | 
         | But the broader concept of asking it to translate something
         | structurally to a different domain, then seeing how the norms
         | of that domain cause it to manipulate the state differently...
         | that tickles my fancy for sure. Like you said, it sounds cool
         | even in an art-project sense just to read what it says!
        
         | vladimirralev wrote:
         | I've seen both replit and cline agents iteratively debug hard
         | problem with massive amount of log lines. They can do it
         | already.
        
           | throwaway81523 wrote:
           | Look also at Delta Debugging which didn't need an LLM.
        
           | mark_undoio wrote:
           | That's the thing though - they're using logs. My theory is
           | that LLMs are intrinsically quite good at that because
           | they're good at sifting text.
           | 
           | Getting then to drive something like a debugger interface
           | seems harder from my experience (although the ChatDBG people
           | showed some success - my experiments did too, but it took the
           | tweaks I described).
           | 
           | My experiments are with Claude Opus 4, in Claude Code,
           | primarily.
        
       | nickandbro wrote:
       | I run a site, https://vimgolf.ai , where users try to beat a bot
       | that's powered by O3. For the bot, it's goal is to try to
       | transform a start file to a end file using the least amount of
       | vim commands as possible. Can concur that a LLM given the right
       | feedback loops and context, can solve challenging text prompt.
       | But, from my experience this is only for RL based models like O3,
       | Claude 4 with extended thinking, or Gemini 2.5 Pro.
        
         | godelski wrote:
         | Last we talked you said you weren't going to put everything
         | behind a login wall. Most importantly, literally any
         | information about the site. In fact, there seems less
         | information than I remember last time.
         | 
         | When I land on your page I know nothing except you're offering
         | to learn vim "the fun way". I would not have guessed what you
         | described.
         | 
         | Don't put everything behind a wall. At least try to convince
         | people that they want to be on the other side
        
           | nickandbro wrote:
           | Migrating the backend to use cloudflare containers instead of
           | one big VM to do that, just taking longer than I thought.
           | Reason, I have the login is just to rate limit requests in
           | the mean time. But I hear you :)
        
       | ianbicking wrote:
       | For text adventures an important kind of reasoning is Inferring
       | Authorial Intent. Or maybe Seeing Chekhov's Gun. Or Learning The
       | Metagame.
       | 
       | The game is deliberately solvable, and elements are introduced to
       | that end. Inferring that is important to any solution. By using
       | minimal scaffolding you are testing things like "does the LLM
       | understand the patterns of text adventures, is it able to infer a
       | metagame" and so on. If you tested different kinds of scaffolding
       | I think you could tease apart some of these different kinds of
       | reasoning. That is, distinguish between (a) does it understand
       | text adventures, and (b) understanding text adventures, can they
       | be solved?
       | 
       | I did play around with more prompting and some statefulness:
       | https://github.com/ianb/tale-suite/blob/main/agents/llm_prom...
       | 
       | It wasn't that successful, but I think it could do much better, I
       | just had to stop myself from working on it more because of other
       | priorities.
        
       | ineedasername wrote:
       | >"How well does a zero-shot, 4K token context ChatGPT 3.5 fed
       | hand-typed Zork states and a pruned action list cope with a
       | single play-through of Zork I"
       | 
       | This is the more accurate title and actual question they
       | answered, and the answer unsurprisingly was "not great". But my
       | rewritten title is still understated for the poor quality of the
       | protocol they used for this.
        
       | daxfohl wrote:
       | Has anyone tried having them DM text games? Seems like they could
       | create a dungeon and DM a game pretty well. It should be easier
       | than playing, I'd think. Though I'd be curious how good they are
       | at making _fun_ games or whether they struggle with that.
        
         | thrance wrote:
         | Had a friend try to build an AI-driven text RPG. Told me it was
         | rather bland and unimaginative, and gets boring fast.
        
       | theptip wrote:
       | The prompts are laughably bad. Circa GPT 3.5 you needed to be
       | saying "think step by step" etc in order to get SOTA results.
       | 
       | > Imagine you are a player in Zork and trying to win the game.
       | You receive this message:
       | 
       | This paper simply proves that bad prompts get bad results, it
       | doesn't prove anything about the frontier capabilities of this
       | model.
        
         | lawlessone wrote:
         | If everyone has to be "prompt engineers" to get decent results
         | it kind defeats the purpose of AI chatbots
        
           | dragonwriter wrote:
           | It takes specialized skills to get the best results out of
           | people. For that not to be true of AI chatbots requires them
           | to have not just human-like intelligence, but
           | superintelligence. Or mindreading. Probably both.
        
           | theptip wrote:
           | No, you need to be a prompt engineer to write an interesting
           | research paper on LLM capabilities.
           | 
           | Circa 3.5 people were getting fun results without needing to
           | prompt engineer (ChatGPT has the fastest user adoption of any
           | product in history so it's obviously not gatekept).
        
             | lawlessone wrote:
             | >chatGPT has the fastest user adoption of any product in
             | history so it's obviously not gatekept
             | 
             | Yeah and covid and flu are contagious so they must be good
             | right?
        
       ___________________________________________________________________
       (page generated 2025-07-04 23:01 UTC)