[HN Gopher] Can Large Language Models Play Text Games Well?
___________________________________________________________________
Can Large Language Models Play Text Games Well?
Author : willvarfar
Score : 55 points
Date : 2025-07-04 11:24 UTC (11 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| s-macke wrote:
| This paper only scratches the surface and feels incomplete, as it
| references only GPT-4 and mentions appendices that are not
| included. The examples are two years old.
|
| For a more in-depth analysis of chatbots playing text adventures,
| take a look at my project. I haven't updated it in a while due to
| time constraints.
|
| [0] https://github.com/s-macke/AdventureAI
| s-macke wrote:
| The answer to the paper's question is likely yes--especially if
| context is used effectively and memory and summaries are
| incorporated. In that case, chatbots can complete even more
| complex games, such as Pokemon role-playing games [0].
|
| The challenge with benchmarking text adventures lies in their
| trial-and-error nature. It's easy to get stuck for hundreds of
| moves on a minor detail before eventually giving up and trying
| a different approach.
|
| [0] https://www.twitch.tv/gpt_plays_pokemon
| glimshe wrote:
| I like your project because you try to compare the performance
| of different chatbots. At the same time, I certainly wouldn't
| say it's more complete than the paper - your landing page is
| somewhat superficial. Reading both is better than just reading
| either.
| briandw wrote:
| Interesting to see but as the authors say a chat bot isn't
| trained to play text adventures. Instruction tuning doesn't seem
| to match the text adventure style very well. I think a very small
| bit of context engineering would allow it to play successfully.
| Reformatting past action response pairs from the history would
| certainly help, mostly to condense the context window and keep it
| from getting stuck taking about irrelevant topics. Also note that
| they used GPT-4 and not a reasoning model.
| willvarfar wrote:
| It's been a background thought of mine for a while:
|
| * create a basic text adventure (or MUD) with a very spartan api-
| like representation
|
| * use an LLM to embellish the description served to the user etc.
| With recent history in context the LLM might even kinda reference
| things the user asked previously etc.
|
| * have NPCs implemented as own LLMs that are trying to 'play the
| game'. These might be using the spartan API directly like they
| are agents.
|
| Its a fun thought experiment!
|
| (An aside: I found that the graphical text adventure that I made
| for Ludum Dare 23 is still online! Although it doesn't render
| quite right in modern browsers.. things shouldn't have broken!
| But anyway https://williame.github.io/ludum_dare_23_tiny_world/)
| briandw wrote:
| Have you seen https://www.aidungeon.com They started with GPT-2
| in a google collab. You should put something together and try
| it, it's easier than ever to get a simple version of that
| working.
| heyitsguay wrote:
| I've done something along these lines!
| https://github.com/heyitsguay/trader
|
| The challenge for me was consistency in translating free text
| from dialogs into classic, deterministic game state changes.
| But what's satisfying is that the conversations aren't just
| window dressing, they're part of the game mechanic.
| ivape wrote:
| _deterministic game state changes_
|
| I found this to be the actual strenuous work in LLM based
| development. While it appears like AI has made everything
| easy and free, the particular challenge of consistently
| getting deterministic outputs takes serious programming
| effort. It feels like an entirely new job role. In other
| words, I wouldn't do this for free, it takes too much effort.
| IngoBlechschmid wrote:
| Gwern has an interesting take on this: https://gwern.net/cyoa
| By pivoting to "choose your own adventure"-style games,
| multiple issues (quality, costs) might be resolved.
| EliasWatson wrote:
| I've been working on something just like that off and on for a
| couple months. It's a MUD where all the NPCs are controlled by
| LLMs that interact with the world with the same commands that
| players use. I got it to the point where the NPCs can navigate
| the world, interact with each other, and even create things.
| But they often get on rabbit trails and forget their original
| task, so I need to build a memory system and something like the
| task list in Claude Code. My goal is to have a fully simulated
| town that the player can interact with.
| Workaccount2 wrote:
| How are you going to release an LLM eval paper in mid-2025 using
|
| _ChatGPT 3.5_
|
| Yes, if you are wondering why they don't clarify the model, it
| because all this was done back in early 2023 (the chat logs are
| dated). Back then it was only 3.5 and 4 was just freshly
| released.
|
| Advancement in this space has been so rapid that this is almost
| like releasing a paper today titled "Video streaming on Mobile
| Devices" and only using a 3G connection in 2013.
|
| The authors should have held back a few more months and turned
| the paper into a 3.5 to O3 or any other 2025 SOTA improvement
| analysis.
| IngoBlechschmid wrote:
| The paper was originally released in April 2023, it just got
| version-bumped a couple months ago :-)
| suddenlybananas wrote:
| >The authors should have held back a few more months and turned
| the paper into a 3.5 to O3 or any other 2025 SOTA improvement
| analysis.
|
| If they had done that, you would then be complaining about them
| not using Claude or whatever.
| rs186 wrote:
| I don't see the logic in your comment.
| DougHaber wrote:
| I did some experimenting with this a little while back and was
| disappointed in how poorly LLMs played games.
|
| I made some AI tools (https://github.com/DougHaber/lair) and
| added in a tmux tool so that LLMs could interact with terminals.
| First, I tried Nethack. As expected, it's not good at
| understanding text "screenshots" and failed miserably.
|
| https://x.com/LeshyLabs/status/1895842345376944454
|
| After that I tried a bunch of the "bsdgames" text games.
|
| Here is a video of it playing a few minutes of Colossal Cave
| Adventure:
|
| https://www.youtube.com/watch?v=7BMxkWUON70
|
| With this, it could play, but not very well. It gets confused a
| lot. I was using gpt-4o-mini. Smaller models I could run at home
| work much worse. It would be interesting to try one of the bigger
| state of the art models to see how much it helps.
|
| To give it an easier one I also had it hunt the Wumpus:
|
| https://x.com/LeshyLabs/status/1896443294005317701
|
| I didn't try improving this much, so there might be some low
| hanging fruit even in providing better instructions and tuning
| what is sent to the LLM. For these, I was hoping I could just
| hand it a terminal with a game in it and have it play decently.
| We'll probably get there, but so far it's not that simple.
| s-macke wrote:
| Try the game 9:05 by Adam Cadre [0]. It's one of the easiest
| (and best) non-trivial text adventures. Some models are able to
| reach the first or even second ending.
|
| [0] https://en.wikipedia.org/wiki/9:05
| throwawayoldie wrote:
| What do you suppose would happen if you tried it on a game
| that doesn't have 25 years of walkthroughs written for it?
| s-macke wrote:
| That's a good point. For 9:05, I expect it would work just
| as well, since the game helps the user in many ways. The
| puzzles are of the type "The door is closed", and you solve
| them with "open door."
|
| My suggestion concerns the poor performance DougHaber
| mentioned: if 9:05 can't be solved, something else must be
| wrong with his experiments.
|
| I've tried three dozen games, and it's still hard to find
| ones suitable for LLM benchmarks. With non-linear complex
| text-adventure games, my guess is, that they get stuck in
| an endless loop at some point. Hence, I just test the
| progress in the first hundred steps.
| gorfian_robot wrote:
| over at slashdot, this story about how llms lose to Atari 2600
| Video Chess
|
| https://slashdot.org/story/25/07/03/2028252/microsoft-copilo...
| spacecadet wrote:
| Hey hey, guess this gives me an opportunity to mention my AI
| dungeon master...
|
| https://github.com/derekburgess/dungen
|
| There are some interesting ideas in this paper, but even just
| role playing with ChatGPT demonstrates how poorly it does at
| world building and narrative... I was impressed by the Wayfarer
| model, and I imagine there are other models out there on civit or
| something that could be used together in some group chat
| orchestration to create a more dynamic "party" atmosphere.
| kmstout wrote:
| Data point: A few weeks ago, I spent some time shuttling text
| between one of the Llama models (have to check which one) and
| Dunnet, the text adventure packaged with Emacs. Over several
| trials, the Llama never realized that it needed to dig where the
| ground "seems very soft." It never got the CPU card, then it
| became confused looking around the building for clues about how
| to start the VAX. At one point it lost track of the building
| layout and got stuck oscillating between the mail room and the
| computer room.
| btown wrote:
| Setting aside the choice of LLM, the constraint that the LLM must
| maintain a world-model-as-knowledge-graph solely by reading and
| re-reading its own chat history seems to be a less interesting
| experiment than providing it with tools that let it develop that
| world model explicitly?
|
| On page 5, Figure 1, the authors create a hand-written diagram
| showing the relationship between objects as a graph showing the
| directionality of edges in 3D space. To me, this implies that you
| could supply your LLM with a set of tools like getObjectsInGraph,
| updateGraphRelatingObjectPair, findObjectsRelativeToObject,
| describePathBetweenObjectsByName... and allow it to maintain that
| diagram as a structured DAG, and continually ask the game engine
| questions that let it update that graph in an agentic way. My
| prediction would be that they'd recreate that diagram, and enable
| goal seeking, with high fidelity.
|
| Asking an LLM to work without being able to "visualize" and
| "touch" its environment in its "mind's eye" is tying a hand
| behind its back. But I'm bullish that we'll find increasingly
| better ways of adapting 3D/4D world models into textual tools in
| a way that rapidly changes the possibilities of what LLMs can do.
| daxfohl wrote:
| Or even just a notepad. It's well established that long context
| histories with scattered information are hard for LLMs to
| navigate.
|
| To distinguish whether it's using notes or context history, you
| could simply delete the context after each turn. The prompt
| could be something like "you're taking over this game from a
| previous player who has compiled these notes. (insert notes
| here). Play one turn, and update the notes with any new
| knowledge you have attained, relationships you have identified,
| inaccuracies you have confirmed, hypotheses you have, or
| anything else you this would be useful, so that the next player
| will be able use these notes to make the best next move.", and
| just clear the context after each move. Maybe also say there's
| a limit to the number of words on the notepad so that it
| doesn't just flood the notes with irrelevant information.
|
| For future iterations, maybe also give it a bitmap or svg
| canvas, or a database, or a code interpreter, and see if it
| uses any of those tools at all.
| theptip wrote:
| This is a well-debated point; should you test the system on its
| own, or system+scaffold? How much custom scaffolding is
| allowed?
|
| Both avenues are interesting. For AGI presumably the "general"
| bit means you don't get to build task-specific scaffolding
| ahead of time (though you can certainly build generic scaffolds
| like memory or knowledge layers).
|
| For safety/capability research, system+scaffolding is often
| more interesting because that is the frontier; if you conclude
| "LLMs cannot world-model" but actually LLM+CoT+memory can world
| model in a specific domain you care about, then you might
| underestimate capabilities and therefore deployment risk. The
| general point being: capabilities are jagged and prompt-
| dependent, just because you failed to elicit a capability,
| doesn't mean you proved it cannot be elicited.
| pflenker wrote:
| A while back (decades in comparison to the leaps and bounds in
| the LLM sphere) I fed text game definitions into an llm and
| taught it to be the game engine. - the ,,fluff" it created, the
| dialogues it enabled me to have with NPCs and the atmosphere it
| was able to build up were amazing - it was too helpful,
| frequently giving me hints or solving riddles for me - at some
| point it bypassed an in game progression barrier that would have
| prevented me to reach a swamp without a rope. While I was slowly
| drowning it told me that I suddenly remembered what was missing
| ,,The rope! The rope you haven't seen back in the hut!", which I
| then took out of the back pack to save myself.
| mark_undoio wrote:
| I'm fascinated by this paper because it feels like it could be a
| good analogue for "can LLMs handle a stateful, text-based tool".
| A debugger is my particular interest but there's no reason why it
| couldn't be something else.
|
| To use a debugger, you need:
|
| * Some memory of where you've already explored in the code (vs
| rooms in a dungeon)
|
| * Some wider idea of your current goal / destination (vs a
| current quest or a treasure)
|
| * A plan for how to get there - but the flexibility to adapt (vs
| expected path and potential monsters / dead ends)
|
| * A way for managing information you've learned / state you've
| viewed (vs inventory)
|
| Given text adventures are quite well-documented and there are
| many of them out there, I'd also like to take time out to
| experiment (at some point!) with whether presenting a command-
| line tool _as_ a text adventure might be a useful "API".
|
| e.g. an MCP server that exposes a tool but _also_ provides a
| mapping of the tools concepts into dungeon adventure concepts
| (and back). If nothing else, the LLM 's reasoning should be
| pretty entertaining. Maybe playing "make believe" will even make
| it better at some things - that would be very cool.
| alwa wrote:
| That's a delightful concept to think about! I'm not sure what
| conceptual information the translation layer would add to the
| LLM's internal representation of the state space.
|
| But the broader concept of asking it to translate something
| structurally to a different domain, then seeing how the norms
| of that domain cause it to manipulate the state differently...
| that tickles my fancy for sure. Like you said, it sounds cool
| even in an art-project sense just to read what it says!
| vladimirralev wrote:
| I've seen both replit and cline agents iteratively debug hard
| problem with massive amount of log lines. They can do it
| already.
| throwaway81523 wrote:
| Look also at Delta Debugging which didn't need an LLM.
| mark_undoio wrote:
| That's the thing though - they're using logs. My theory is
| that LLMs are intrinsically quite good at that because
| they're good at sifting text.
|
| Getting then to drive something like a debugger interface
| seems harder from my experience (although the ChatDBG people
| showed some success - my experiments did too, but it took the
| tweaks I described).
|
| My experiments are with Claude Opus 4, in Claude Code,
| primarily.
| nickandbro wrote:
| I run a site, https://vimgolf.ai , where users try to beat a bot
| that's powered by O3. For the bot, it's goal is to try to
| transform a start file to a end file using the least amount of
| vim commands as possible. Can concur that a LLM given the right
| feedback loops and context, can solve challenging text prompt.
| But, from my experience this is only for RL based models like O3,
| Claude 4 with extended thinking, or Gemini 2.5 Pro.
| godelski wrote:
| Last we talked you said you weren't going to put everything
| behind a login wall. Most importantly, literally any
| information about the site. In fact, there seems less
| information than I remember last time.
|
| When I land on your page I know nothing except you're offering
| to learn vim "the fun way". I would not have guessed what you
| described.
|
| Don't put everything behind a wall. At least try to convince
| people that they want to be on the other side
| nickandbro wrote:
| Migrating the backend to use cloudflare containers instead of
| one big VM to do that, just taking longer than I thought.
| Reason, I have the login is just to rate limit requests in
| the mean time. But I hear you :)
| ianbicking wrote:
| For text adventures an important kind of reasoning is Inferring
| Authorial Intent. Or maybe Seeing Chekhov's Gun. Or Learning The
| Metagame.
|
| The game is deliberately solvable, and elements are introduced to
| that end. Inferring that is important to any solution. By using
| minimal scaffolding you are testing things like "does the LLM
| understand the patterns of text adventures, is it able to infer a
| metagame" and so on. If you tested different kinds of scaffolding
| I think you could tease apart some of these different kinds of
| reasoning. That is, distinguish between (a) does it understand
| text adventures, and (b) understanding text adventures, can they
| be solved?
|
| I did play around with more prompting and some statefulness:
| https://github.com/ianb/tale-suite/blob/main/agents/llm_prom...
|
| It wasn't that successful, but I think it could do much better, I
| just had to stop myself from working on it more because of other
| priorities.
| ineedasername wrote:
| >"How well does a zero-shot, 4K token context ChatGPT 3.5 fed
| hand-typed Zork states and a pruned action list cope with a
| single play-through of Zork I"
|
| This is the more accurate title and actual question they
| answered, and the answer unsurprisingly was "not great". But my
| rewritten title is still understated for the poor quality of the
| protocol they used for this.
| daxfohl wrote:
| Has anyone tried having them DM text games? Seems like they could
| create a dungeon and DM a game pretty well. It should be easier
| than playing, I'd think. Though I'd be curious how good they are
| at making _fun_ games or whether they struggle with that.
| thrance wrote:
| Had a friend try to build an AI-driven text RPG. Told me it was
| rather bland and unimaginative, and gets boring fast.
| theptip wrote:
| The prompts are laughably bad. Circa GPT 3.5 you needed to be
| saying "think step by step" etc in order to get SOTA results.
|
| > Imagine you are a player in Zork and trying to win the game.
| You receive this message:
|
| This paper simply proves that bad prompts get bad results, it
| doesn't prove anything about the frontier capabilities of this
| model.
| lawlessone wrote:
| If everyone has to be "prompt engineers" to get decent results
| it kind defeats the purpose of AI chatbots
| dragonwriter wrote:
| It takes specialized skills to get the best results out of
| people. For that not to be true of AI chatbots requires them
| to have not just human-like intelligence, but
| superintelligence. Or mindreading. Probably both.
| theptip wrote:
| No, you need to be a prompt engineer to write an interesting
| research paper on LLM capabilities.
|
| Circa 3.5 people were getting fun results without needing to
| prompt engineer (ChatGPT has the fastest user adoption of any
| product in history so it's obviously not gatekept).
| lawlessone wrote:
| >chatGPT has the fastest user adoption of any product in
| history so it's obviously not gatekept
|
| Yeah and covid and flu are contagious so they must be good
| right?
___________________________________________________________________
(page generated 2025-07-04 23:01 UTC)