[HN Gopher] Evaluating LLMs Playing Text Adventures
       ___________________________________________________________________
        
       Evaluating LLMs Playing Text Adventures
        
       Author : todsacerdoti
       Score  : 84 points
       Date   : 2025-08-12 15:19 UTC (7 hours ago)
        
 (HTM) web link (entropicthoughts.com)
 (TXT) w3m dump (entropicthoughts.com)
        
       | throwawayoldie wrote:
       | My takeaway is: LLMs are not great at text adventures, even when
       | those text adventures are decades old and have multiple
       | walkthroughs available on the Internet. Slow clap.
        
       | ForHackernews wrote:
       | What blogging software is this with the sidenotes?
        
         | hombre_fatal wrote:
         | Noticed it was written in org mode with custom css so I found
         | this post on their site: https://entropicthoughts.com/new-and-
         | improved-now-powered-by...
        
         | kqr wrote:
         | Some details of the side notes in particular are given here:
         | https://entropicthoughts.com/sidenotes-footnotes-inlinenotes
        
       | the_af wrote:
       | I know they define "achievements" in order to measure "how well"
       | the LLM plays the game, and by definition this is arbitrary. As
       | an experiment, I cannot argue with this.
       | 
       | However, I _must_ point out the kind of  "modern" (relatively
       | speaking) adventure games mentioned in the article -- which are
       | more accurately called "interactive fiction" by the community --
       | is not very suitable for this kind of experiment. Why? Well,
       | because so many of them are exploratory/experimental, and not at
       | all about "winning" (unlike, say, "Colossal Cave Adventure",
       | where there is a clear goal).
       | 
       | You cannot automate (via LLM) "playing" them, because they are
       | all about the thoughts and emotions (and maybe shocked laughter)
       | they elicit in _human_ players. This cannot be automated.
       | 
       | If you think I'm being snobby, consider this: the first game TFA
       | mentions is "9:05". Now, you _can_ set goals for a bot to play
       | this game, but truly -- if you 've played the game -- you know
       | this would be completely missing the point. You cannot "win" this
       | game, it's all about subverting expectations, and about replaying
       | it once you've seen the first, most straightforward ending, and
       | having a laugh about it.
       | 
       | Saying more will spoil the game :)
       | 
       | (And do note there's no such thing as "spoiling a game" for an
       | LLM, which is precisely the reason they cannot truly "play" these
       | games!)
        
         | kqr wrote:
         | I disagree. Lockout, Dreamhold, Lost Pig, and So Far are new
         | games but in the old style. Plundered Hearts is literally one
         | of the old games (though ahead of its time).
         | 
         | I'll grant you that 9:05 and For a Change are somewhat more
         | modern: the former has easy puzzles, the latter very abstract
         | puzzles.
         | 
         | I disagree new text adventures are not about puzzles and
         | winning. They come in all kinds of flavours these days. Even
         | games like 9:05 pace their narrative with traditional puzzles,
         | meaning we can measure forward progress just the same. And to
         | be fair, LLMs are so bad at these games that in these articles,
         | I'm merely trying to get them to navigate the world _at all_.
         | 
         | If anything, I'd argue _Adventure_ is a bad example of the
         | genre you refer to. It was (by design) more of a caving
         | simulator /sandbox with optinal loot than a game with progress
         | toward a goal.
        
           | dfan wrote:
           | As the author of For A Change, I am astonished that anyone
           | would think it was a good testbed for an LLM text adventure
           | solver. It's fun that they tried, though.
        
             | kqr wrote:
             | Thank you for making it. The imagery of it is striking and
             | comes back to me every now and then. I cannot unhear "a
             | high wall is not high to be measured in units of length,
             | but of angle" -- beautifully put.
             | 
             | The idea was that it'd be good example of having to
             | navigate somewhat foreign but internally consistent worlds,
             | an essential text adventure skill.
        
               | dfan wrote:
               | Ha, I didn't realize that I was replying to the person
               | who wrote the post!
               | 
               | The audience I had in mind when writing it was people who
               | were already quite experienced in playing interactive
               | fiction and could then be challenged in a new way while
               | bringing their old skills to bear. So it's sort of a
               | second-level game in that respect (so is 9:05, in
               | different ways, as someone else mentioned).
        
           | the_af wrote:
           | We will have to agree to disagree, if you'll allow me the
           | cliche.
           | 
           | I didn't use Adventure as an example of IF, it belongs in the
           | older "text adventure" genre. Which is why I thought it would
           | be more fitting to test LLMs, since it's not about
           | experiences but about maxing points.
           | 
           | I think there's nothing to "solve" that an LLM can solve
           | about IF. This genre of games, in its modern expression, is
           | about breaking boundaries and expectations, and making the
           | player enjoy this. Sometimes the fun is simply seeing
           | different endings and how they relate to each other. Since
           | LLMs cannot experience joy or surprise, and can only
           | mechanically navigate the game (maybe "explore all possible
           | end states" is a goal?), they cannot "play" it. Before you
           | object: I'm aware you didn't claim the LLMs are _really_
           | playing the game!
           | 
           | But here's a test for your set of LLMs: how would they "win"
           | at "Rematch"? This game is about repeatedly dying,
           | understanding what's happening, and stringing together a
           | single sentence that will break the cycle and win the game.
           | Can any LLM do this, a straightforward puzzle? I'd be
           | impressed!
        
             | kqr wrote:
             | I think I see what you mean and with these clarifications
             | we are in agreement. There is a lot of modern works of
             | interactive fiction that goes way beyond what the old text
             | adventures did, and work even when judged as art or
             | literature. I just haven't played much of it because I'm a
             | fan of the old-style games.
             | 
             | As for the specific question, they would progress at
             | Rematch by figuring out ever more complicated interactions
             | that work and will be used to survive, naturally.
        
         | fmbb wrote:
         | Of course you can automate "having fun" and "being
         | entertained". That is if you believe humanity will ever build
         | artificial intelligence.
        
           | drdeca wrote:
           | A p-zombie would not have fun or be entertained, only act
           | like it does. I don't think AGI requires being unlike a
           | p-zombie in this way.
        
           | the_af wrote:
           | > _Of course you can automate "having fun" and "being
           | entertained"_
           | 
           | This seems like begging the question to me.
           | 
           | I don't think there's a mechanistic (as in "token predictor")
           | procedure to generate the emotions of having fun, or being
           | surprised, or amazed. It's not on me to demonstrate it cannot
           | be done, it's on _them_ to demonstrate it can.
           | 
           | But to be clear, I don't think the author of TFA is making
           | this claim either. They are simply approaching IF games from
           | a "problem solving" perspective -- they don't claim this has
           | anything to do with fun or AGI -- and what I'm arguing is
           | that this mechanistic approach to IF games, i.e. "problem
           | solving", only touches on a small subset of what makes people
           | want to play these games. They are often (not all, as the
           | author rightly corrects me, but often) about generating
           | surprise and amazement in the player, something that cannot
           | be done to an LLM.
           | 
           | (Note I'm also not dismissing the author's experiment. As an
           | experiment it's interesting and, I'd argue, _fun for the
           | author_ ).
           | 
           | Current, state of the art LLMs cannot feel amazement, or
           | nothing else really (and, I argue, no LLM in the current tech
           | branch will ever can). I hope this isn't a controversial
           | statement.
        
         | Terr_ wrote:
         | That's like saying it's _wrong_ to test a robot 's ability to
         | navigate and traverse a mountain... because the mountain has no
         | win-condition and is really a _context for human emotional
         | experiences._
         | 
         | The purpose of the test is whatever the tester decides it is.
         | If that means finding X% of the ambiguously-good game endings
         | within a budget of Y commands, then so be it.
        
           | the_af wrote:
           | > _The purpose of the test is whatever the tester decides it
           | is._
           | 
           | Well, I did say:
           | 
           | > _As an experiment, I cannot argue with this._
           | 
           | It was more a reflection on the fact that the primary goal of
           | a lot of modern IF games, among which there is "9:05", the
           | first game mentioned in TFA, is not like "traversing a
           | mountain". Traversing a mountain can have clear and
           | meaningful goals, such us "reach the summit", or "avoid
           | getting stuck", or "do not die or go missing after X hours".
           | Though of course, appreciating nature and sightseeing is
           | beyond the scope of an LLM.
           | 
           | Indeed, "9:05" has no other "goal" than, upon seeing a
           | different ending from the main one, revisiting the game with
           | the knowledge gained from that first playthrough. I'm being
           | purposefully opaque in order not to spoil the game for you
           | (you should play it, it's really short).
           | 
           | Let me put it another way: remember that fad, some years ago,
           | of making you pay attention to an image or video, with a
           | prompt like "colorblind people cannot see this shape after X
           | seconds" so you pay attention and then BAM! A jump scare!
           | Haha, joke's on you!
           | 
           | How would you "test" a LLM on such jump scare? The goal is to
           | scare a human. LLMs cannot be scared. What would the possible
           | answers be?
           | 
           | A: I do not see any disappearing shapes after X seconds. Beep
           | boop! I must not be colorblind, nor human, for I am an LLM.
           | Beep!
           | 
           | or maybe
           | 
           | B: This is a well-known joke. Beep boop! After some short
           | time, a monster appears on screen. This is intended to scare
           | the person looking at it! Beep!
           | 
           | Would you say either response would show the LLM "playing"
           | the game?
           | 
           | (Trust me, this is a somewhat adjacent effect to what "9:05"
           | would play on you, and I fear I've said too much!)
        
       | benlivengood wrote:
       | Wouldn't playthroughs for these games be potentially in the
       | pretraining corpus for all of these models?
        
         | throwawayoldie wrote:
         | As a longtime IF fan, I can basically guarantee there are.
        
         | quesera wrote:
         | Reproducing specific chunks of long form text from distilled
         | (inherently lossy) model data is _not_ something that I would
         | expect LLMs to be good at.
         | 
         | And of course, there's no actual reasoning or logic going on,
         | so they cannot compete in this context with a curious 12 year
         | old, either.
        
       | jameshart wrote:
       | Nothing in the article mentioned how good the LLMs were at even
       | entering valid text adventure commands into the games.
       | 
       | If an LLM responds to "You are standing in an open field west of
       | a white house" with "okay, I'm going to walk up to the house",
       | and just gets back "THAT SENTENCE ISN'T ONE I RECOGNIZE", it's
       | not going to make much progress.
        
         | throwawayoldie wrote:
         | "You're absolutely right, that's not a sentence you
         | recognize..."
        
         | kqr wrote:
         | The previous article (linked in this one) gives an idea of
         | that.
        
           | jameshart wrote:
           | I did see that. But since that focused really on how Claude
           | handled that particular prompt format, it's not clear whether
           | the LLMs that scored low here were just failing at producing
           | valid input, struggled to handle that specific prompt/output
           | structure, or were doing fine at basically operating the text
           | adventure but were struggling at building a world model and
           | problem solving.
        
             | kqr wrote:
             | Ah, I see what you mean. Yeah, there was too much output
             | from too many models at once (combined with not enough
             | spare time) to really perform useful qualitative analysis
             | on all the models' performance.
        
       | fzzzy wrote:
       | I tried this earlier this year. I wrote a tool that let an llm
       | play Zork. It was pretty fun.
        
         | bongodongobob wrote:
         | Did you do anything special? I tried this with just copy and
         | paste with GPT-4o and it was absolutely terrible at it. It
         | usually ended up spamming help in a loop and trying commands
         | that didn't exist.
        
           | fzzzy wrote:
           | I have my own agent loop that I wrote, and I gave it a tool
           | which it uses to send input to the parser. I also had a step
           | which took the previous output and generated an image for it.
           | It was just a toy, but it was pretty fun.
        
       | lottaFLOPS wrote:
       | related research that was also announced this week:
       | https://www.textquests.ai/
        
         | 1970-01-01 wrote:
         | Very interesting how they all clearly suck at it. Even with
         | hints, they can't understand the task enough to complete the
         | game.
        
         | abraxas wrote:
         | that's a great tracker. How often is the laderboard updated?
        
         | kqr wrote:
         | They seem to be going for a much simpler route of just giving
         | the LLM a full transcript of the game with its own reasoning
         | interspersed. I didn't have much luck with that, and I'm
         | worried it might not be effective once we're into the hundreds
         | of turns because of inadvertent context poisoning. It seems
         | like this might indeed be what happens, given the slowing of
         | progress indicated in the paper.
        
       | andrewla wrote:
       | The article links to a previous article discussing methodology
       | for this. The prompting is pretty extensive.
       | 
       | It is difficult here to separate out how much of this could be
       | fixed or improved by better prompting. A better baseline might be
       | to just give the LLM direct access to the text adventure, so that
       | everything the LLM replies is given to the game directly. I
       | suspect that the LLMs would do poorly on this task, but would
       | undoubtedly improve over time and generations.
       | 
       | EDIT: Just started playing 9:05 with GPT-4 with no prompting and
       | it did quite poorly; kept trying to explain to me what was going
       | on with the ever more complex errors it would get. Put in a one
       | line "You are playing a text adventure game" and off it went --
       | it took a shower and got dressed and drove to work.
        
       | SquibblesRedux wrote:
       | This is another great example of how LLMs are not really any sort
       | of AI, or even proper knowledge representation. Not saying they
       | don't have their uses (like souped up search and permutation
       | generators), but definitely not something that resembles
       | intelligence.
        
         | nonethewiser wrote:
         | While I agree, it's still shocking how far next token
         | prediction gets us to looking like intelligence. It's amazing
         | we need examples such as this to demonstrate it.
        
           | SquibblesRedux wrote:
           | Another way to think about it is how interesting it is that
           | humans can be so easily influenced by strings of words. (Or
           | images, or sounds.) I suppose I would characterize it as so
           | many people being earnestly vulnerable. It all makes me think
           | of Kahneman's [0] System 1 (fast) and System 2 (slow)
           | thinking.
           | 
           | [0] "Thinking, Fast and Slow"
           | https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow
        
           | seba_dos1 wrote:
           | It is kinda shocking, but I'm sure ELIZA was too for many
           | people back then. It just took shorter to realize what was
           | going on there.
        
       | henriquegodoy wrote:
       | Looking at this evaluation it's pretty fascinating how badly
       | these models perform even on decades old games that almost
       | certainly have walkthroughs scattered all over their training
       | data. Like, you'd think they'd at least brute force their way
       | through the early game mechanics by now, but honestly this kinda
       | validates something I've been thinking about like real
       | intelligence isn't just about having seen the answers before,
       | it's about being good at games and specifically new situations
       | where you can't just pattern match your way out
       | 
       | This is exactly why something like arc-agi-3 feels so important
       | right now. Instead of static benchmarks that these models can
       | basically brute force with enough training data, like designing
       | around interactive environments where you actually need to
       | perceive, decide, and act over multiple steps without prior
       | instructions, that shift from "can you reproduce known patterns"
       | to "can you figure out new patterns" seems like the real test of
       | intelligence.
       | 
       | What's clever about the game environment approach is that it
       | captures something fundamental about human intelligence that
       | static benchmarks miss entirely, like, when humans encounter a
       | new game, we explore, form plans, remember what worked, adjust
       | our strategy all that interactive reasoning over time that these
       | text adventure results show llms are terrible at, we need systems
       | that can actually understand and adapt to new situations, not
       | just really good autocomplete engines that happen to know a lot
       | of trivia.
        
         | msgodel wrote:
         | I've been experimenting with this as well with the goal of
         | using it for robotics. I don't think this will be as hard to
         | train for as people think though.
         | 
         | It's interesting he wrote a separate program to wrap the
         | z-machine interpreter. I integrated my wrapper directly into my
         | pytorch training program.
        
         | da_chicken wrote:
         | I saw it somewhere else recently, but the idea is that LLMs are
         | language models, not world models. This seems like a perfect
         | example of that. You need a world model to navigate a text
         | game.
         | 
         | Otherwise, how can you determine that "North" is a context
         | change, but not always a context change.
        
           | manbash wrote:
           | Thanks for this. I was struggling to put it in words even if
           | maybe this has been a known distinguishing factor for others.
        
           | zahlman wrote:
           | > I saw it somewhere else recently, but the idea is that LLMs
           | are language models, not world models.
           | 
           | Part of what distinguishes humans from artificial
           | "intelligence" to me is exactly that we _automatically_
           | develop models of _whatever is needed_.
        
           | lubujackson wrote:
           | Why, this sounds like Context Engineering!
        
         | godelski wrote:
         | > real intelligence isn't just about having seen the answers
         | before, it's about being good at games and specifically new
         | situations where you can't just pattern match your way out
         | 
         | It is insane to me that so many people believe intelligence is
         | measurable by pure question answer testing. There's hundreds of
         | years of discussion about how this is limited in measuring
         | human intelligence. I'm sure we _all_ even know someone who 's
         | a really good test take but you also wouldn't consider to be
         | really bright. I'm sure every single one of also knows someone
         | in the other camp (bad at tests but considered bright).
         | 
         | The definition you put down is much more agreed upon in the
         | scientific literature. While we don't have a good formal
         | definition of intelligence there is a difference between no
         | definition. I really do hope people read more about
         | intelligence and how we measure it in humans and animals. It is
         | very messy and there's a lot of noise, but at least we have a
         | good idea of the directions to move in. There's still nuances
         | to be learned and while I think ARC is an important test, I
         | don't think success on it will prove AGI (and Chollet says this
         | too)
        
         | rkagerer wrote:
         | Hi, GPT-x here. Let's delve into my construction together. My
         | "intelligence" comes from patterns learned from vast amounts of
         | text. I'm trained to... oh look it's a butterfly. Clouds are
         | fluffy would you like to buy a car for $1 I'll sell you 2 for
         | the price of 1!
        
           | corobo wrote:
           | Ah dammit the AGI has ADHD
        
       | wiz21c wrote:
       | adventure games require spatial reasoning (although text based),
       | requires understanding puns, requires cultural references, etc.
       | For me they really need human-intelligence to be solved (heck,
       | they've been designed like that).
       | 
       | I find it funny that some AI do very good score on ARC-AI but
       | fails at these games...
        
       | andai wrote:
       | The GPT-5 used here is the Chat version, presumably gpt-5-chat-
       | latest, which from what I can tell is the same version used in
       | ChatGPT, which is not actually a model but a "system" -- a router
       | that semi-randomly forwards your request to various different
       | models (in a way designed to massively reduce costs for OpenAI,
       | based on people reporting inconsistent output and often worse
       | results than 4o).
       | 
       | So from this it seems that not only would many of these requests
       | not touch a reasoning model (or as it works now, have reasoning
       | set to "minimal"?), but they're probably being routed to a mini
       | or nano model?
       | 
       | It would make more sense, I think, to test on gpt-5 itself (and
       | ideally the -mini and -nano as well), and perhaps with different
       | reasoning effort, because that makes a big difference in many
       | evals.
       | 
       | EDIT: Yeah the Chat router is busted big time. It fails to apply
       | thinking even for problems that obviously call for it (analyzing
       | financial reports). You have to add "Think hard." to the end of
       | the prompt, or explicitly switch to the Thinking model in the UI.
        
         | kqr wrote:
         | This is correct, and was the reason I made sure to always
         | append "Chat" to the end of "GPT-5". I should perhaps have been
         | more clear about this. The reason I settled for the lesser
         | router is I don't have access to the full GPT-5, which would
         | have been a much better baseline, I agree.
        
           | andai wrote:
           | Do they require drivers license to use it? They asked for my
           | ID for o3 Pro a few months ago.
        
             | kqr wrote:
             | That's the step at which I gave up, anyway.
        
         | varenc wrote:
         | > Yeah the Chat router is busted big time... You have to add
         | "Think hard." to the end of the prompt, or explicitly switch to
         | the Thinking model in the UI.
         | 
         | I don't really get this gripe? It seems no different than
         | before, except now it will sometimes opt into thinking harder
         | by itself. If you know you want CoT reasoning you just select
         | gpt5-thinking, no different than choosing o4-mini/o3 like
         | before.
        
       | seanwilson wrote:
       | I won't be surprised when LLMs get good at puzzle-heavy text
       | adventures if there was more attention turned to this.
       | 
       | I've found for text adventures based on item manipulation,
       | variations of the same puzzles appear again and again because
       | there's a limit to how many obscure but not too obscure item
       | puzzles you can come up with, so training would be good for exact
       | matches of the same puzzle, and variations, like different ways
       | of opening locked doors.
       | 
       | Puzzles like key + door, crowbar + panel, dog + food, coin +
       | vending machine, vampire + garlic etc. You can obscure or layer
       | puzzles, like changing the garlic into garlic bread which would
       | still work on the vampire, so there's a logical connections to
       | make but often nothing too crazy.
       | 
       | A lot of the difficulty in these games comes from not noticing or
       | forgetting about clues/hints and potential puzzles because
       | there's so much going on, which is less likely to trip up a
       | computer.
       | 
       | You can already ask LLMs "in a game: 20 ways to open a door if I
       | don't have the key", "how to get past an angry guard dog" or "I'm
       | carrying X, Y, and Z, how do I open a door", and it'll list lots
       | of ways that are seen in games, so it's going to be good at
       | matching that with the current list of objects you're carrying,
       | items in the world, and so on.
       | 
       | Another comment mentions about how the AI needs a world model
       | that's transforming as actions are performed, but you need
       | something similar to reason about maths proofs and code, where
       | you have to keep track of the current state/context. And most
       | adventure games don't require you to plan many steps in advance
       | anyway. They're often about figuring out which item to
       | combine/use with which other item next (where only one
       | combination works), and navigating to the room that contains the
       | latter item first.
       | 
       | So it feels like most of the parts are already there to me, and
       | it's more about getting the right prompts and presenting the
       | world in the right format e.g. maintaining a table of items,
       | clues, and open puzzles, to look for connections and matches, and
       | maintaining a map.
       | 
       | Getting LLMs to get good at variations of The Witness would be
       | interesting, where the rules have to be learned through trial and
       | error, and combined.
        
       | standardly wrote:
       | LLMs work really well for open-ended role-playing sessions, but
       | not so much games with strict rules.
       | 
       | They just can't seem to grasp what would make a choice a "wrong"
       | choice in a text-based adventure game, so they end up having no
       | ending. You have to hard-code failure events, or you just never
       | get anything like "you chose to attack the wizard, but he's level
       | 99, dummy, so you died - game over!". It just accepts whatever
       | choice you make, ad infinitum.
       | 
       | My best session was one in which I had the AI give me 4 dialogue
       | options to choose from. I never "beat" the game, and we never
       | solved the mystery - it just kept going further down the rabbit
       | hole.. But it was surprisingly enjoyable, and repayable! A larger
       | framework just needs written for it to keep the tires between the
       | lines and to hard-code certain game rules - what's under the hood
       | is already quite good for narratives imo.
        
       | spacecadet wrote:
       | Ill pump my repo, DUNGEN.
       | 
       | https://github.com/derekburgess/dungen
       | 
       | It's a configurable pipeline for generative dungeon master role
       | play content with a zork-like UI. I use a model called "Wayfarer"
       | which is designed for challenging role play content and I find
       | that it can be pretty fun to engage with.
        
       ___________________________________________________________________
       (page generated 2025-08-12 23:01 UTC)