[HN Gopher] Evaluating LLMs Playing Text Adventures
___________________________________________________________________
Evaluating LLMs Playing Text Adventures
Author : todsacerdoti
Score : 84 points
Date : 2025-08-12 15:19 UTC (7 hours ago)
(HTM) web link (entropicthoughts.com)
(TXT) w3m dump (entropicthoughts.com)
| throwawayoldie wrote:
| My takeaway is: LLMs are not great at text adventures, even when
| those text adventures are decades old and have multiple
| walkthroughs available on the Internet. Slow clap.
| ForHackernews wrote:
| What blogging software is this with the sidenotes?
| hombre_fatal wrote:
| Noticed it was written in org mode with custom css so I found
| this post on their site: https://entropicthoughts.com/new-and-
| improved-now-powered-by...
| kqr wrote:
| Some details of the side notes in particular are given here:
| https://entropicthoughts.com/sidenotes-footnotes-inlinenotes
| the_af wrote:
| I know they define "achievements" in order to measure "how well"
| the LLM plays the game, and by definition this is arbitrary. As
| an experiment, I cannot argue with this.
|
| However, I _must_ point out the kind of "modern" (relatively
| speaking) adventure games mentioned in the article -- which are
| more accurately called "interactive fiction" by the community --
| is not very suitable for this kind of experiment. Why? Well,
| because so many of them are exploratory/experimental, and not at
| all about "winning" (unlike, say, "Colossal Cave Adventure",
| where there is a clear goal).
|
| You cannot automate (via LLM) "playing" them, because they are
| all about the thoughts and emotions (and maybe shocked laughter)
| they elicit in _human_ players. This cannot be automated.
|
| If you think I'm being snobby, consider this: the first game TFA
| mentions is "9:05". Now, you _can_ set goals for a bot to play
| this game, but truly -- if you 've played the game -- you know
| this would be completely missing the point. You cannot "win" this
| game, it's all about subverting expectations, and about replaying
| it once you've seen the first, most straightforward ending, and
| having a laugh about it.
|
| Saying more will spoil the game :)
|
| (And do note there's no such thing as "spoiling a game" for an
| LLM, which is precisely the reason they cannot truly "play" these
| games!)
| kqr wrote:
| I disagree. Lockout, Dreamhold, Lost Pig, and So Far are new
| games but in the old style. Plundered Hearts is literally one
| of the old games (though ahead of its time).
|
| I'll grant you that 9:05 and For a Change are somewhat more
| modern: the former has easy puzzles, the latter very abstract
| puzzles.
|
| I disagree new text adventures are not about puzzles and
| winning. They come in all kinds of flavours these days. Even
| games like 9:05 pace their narrative with traditional puzzles,
| meaning we can measure forward progress just the same. And to
| be fair, LLMs are so bad at these games that in these articles,
| I'm merely trying to get them to navigate the world _at all_.
|
| If anything, I'd argue _Adventure_ is a bad example of the
| genre you refer to. It was (by design) more of a caving
| simulator /sandbox with optinal loot than a game with progress
| toward a goal.
| dfan wrote:
| As the author of For A Change, I am astonished that anyone
| would think it was a good testbed for an LLM text adventure
| solver. It's fun that they tried, though.
| kqr wrote:
| Thank you for making it. The imagery of it is striking and
| comes back to me every now and then. I cannot unhear "a
| high wall is not high to be measured in units of length,
| but of angle" -- beautifully put.
|
| The idea was that it'd be good example of having to
| navigate somewhat foreign but internally consistent worlds,
| an essential text adventure skill.
| dfan wrote:
| Ha, I didn't realize that I was replying to the person
| who wrote the post!
|
| The audience I had in mind when writing it was people who
| were already quite experienced in playing interactive
| fiction and could then be challenged in a new way while
| bringing their old skills to bear. So it's sort of a
| second-level game in that respect (so is 9:05, in
| different ways, as someone else mentioned).
| the_af wrote:
| We will have to agree to disagree, if you'll allow me the
| cliche.
|
| I didn't use Adventure as an example of IF, it belongs in the
| older "text adventure" genre. Which is why I thought it would
| be more fitting to test LLMs, since it's not about
| experiences but about maxing points.
|
| I think there's nothing to "solve" that an LLM can solve
| about IF. This genre of games, in its modern expression, is
| about breaking boundaries and expectations, and making the
| player enjoy this. Sometimes the fun is simply seeing
| different endings and how they relate to each other. Since
| LLMs cannot experience joy or surprise, and can only
| mechanically navigate the game (maybe "explore all possible
| end states" is a goal?), they cannot "play" it. Before you
| object: I'm aware you didn't claim the LLMs are _really_
| playing the game!
|
| But here's a test for your set of LLMs: how would they "win"
| at "Rematch"? This game is about repeatedly dying,
| understanding what's happening, and stringing together a
| single sentence that will break the cycle and win the game.
| Can any LLM do this, a straightforward puzzle? I'd be
| impressed!
| kqr wrote:
| I think I see what you mean and with these clarifications
| we are in agreement. There is a lot of modern works of
| interactive fiction that goes way beyond what the old text
| adventures did, and work even when judged as art or
| literature. I just haven't played much of it because I'm a
| fan of the old-style games.
|
| As for the specific question, they would progress at
| Rematch by figuring out ever more complicated interactions
| that work and will be used to survive, naturally.
| fmbb wrote:
| Of course you can automate "having fun" and "being
| entertained". That is if you believe humanity will ever build
| artificial intelligence.
| drdeca wrote:
| A p-zombie would not have fun or be entertained, only act
| like it does. I don't think AGI requires being unlike a
| p-zombie in this way.
| the_af wrote:
| > _Of course you can automate "having fun" and "being
| entertained"_
|
| This seems like begging the question to me.
|
| I don't think there's a mechanistic (as in "token predictor")
| procedure to generate the emotions of having fun, or being
| surprised, or amazed. It's not on me to demonstrate it cannot
| be done, it's on _them_ to demonstrate it can.
|
| But to be clear, I don't think the author of TFA is making
| this claim either. They are simply approaching IF games from
| a "problem solving" perspective -- they don't claim this has
| anything to do with fun or AGI -- and what I'm arguing is
| that this mechanistic approach to IF games, i.e. "problem
| solving", only touches on a small subset of what makes people
| want to play these games. They are often (not all, as the
| author rightly corrects me, but often) about generating
| surprise and amazement in the player, something that cannot
| be done to an LLM.
|
| (Note I'm also not dismissing the author's experiment. As an
| experiment it's interesting and, I'd argue, _fun for the
| author_ ).
|
| Current, state of the art LLMs cannot feel amazement, or
| nothing else really (and, I argue, no LLM in the current tech
| branch will ever can). I hope this isn't a controversial
| statement.
| Terr_ wrote:
| That's like saying it's _wrong_ to test a robot 's ability to
| navigate and traverse a mountain... because the mountain has no
| win-condition and is really a _context for human emotional
| experiences._
|
| The purpose of the test is whatever the tester decides it is.
| If that means finding X% of the ambiguously-good game endings
| within a budget of Y commands, then so be it.
| the_af wrote:
| > _The purpose of the test is whatever the tester decides it
| is._
|
| Well, I did say:
|
| > _As an experiment, I cannot argue with this._
|
| It was more a reflection on the fact that the primary goal of
| a lot of modern IF games, among which there is "9:05", the
| first game mentioned in TFA, is not like "traversing a
| mountain". Traversing a mountain can have clear and
| meaningful goals, such us "reach the summit", or "avoid
| getting stuck", or "do not die or go missing after X hours".
| Though of course, appreciating nature and sightseeing is
| beyond the scope of an LLM.
|
| Indeed, "9:05" has no other "goal" than, upon seeing a
| different ending from the main one, revisiting the game with
| the knowledge gained from that first playthrough. I'm being
| purposefully opaque in order not to spoil the game for you
| (you should play it, it's really short).
|
| Let me put it another way: remember that fad, some years ago,
| of making you pay attention to an image or video, with a
| prompt like "colorblind people cannot see this shape after X
| seconds" so you pay attention and then BAM! A jump scare!
| Haha, joke's on you!
|
| How would you "test" a LLM on such jump scare? The goal is to
| scare a human. LLMs cannot be scared. What would the possible
| answers be?
|
| A: I do not see any disappearing shapes after X seconds. Beep
| boop! I must not be colorblind, nor human, for I am an LLM.
| Beep!
|
| or maybe
|
| B: This is a well-known joke. Beep boop! After some short
| time, a monster appears on screen. This is intended to scare
| the person looking at it! Beep!
|
| Would you say either response would show the LLM "playing"
| the game?
|
| (Trust me, this is a somewhat adjacent effect to what "9:05"
| would play on you, and I fear I've said too much!)
| benlivengood wrote:
| Wouldn't playthroughs for these games be potentially in the
| pretraining corpus for all of these models?
| throwawayoldie wrote:
| As a longtime IF fan, I can basically guarantee there are.
| quesera wrote:
| Reproducing specific chunks of long form text from distilled
| (inherently lossy) model data is _not_ something that I would
| expect LLMs to be good at.
|
| And of course, there's no actual reasoning or logic going on,
| so they cannot compete in this context with a curious 12 year
| old, either.
| jameshart wrote:
| Nothing in the article mentioned how good the LLMs were at even
| entering valid text adventure commands into the games.
|
| If an LLM responds to "You are standing in an open field west of
| a white house" with "okay, I'm going to walk up to the house",
| and just gets back "THAT SENTENCE ISN'T ONE I RECOGNIZE", it's
| not going to make much progress.
| throwawayoldie wrote:
| "You're absolutely right, that's not a sentence you
| recognize..."
| kqr wrote:
| The previous article (linked in this one) gives an idea of
| that.
| jameshart wrote:
| I did see that. But since that focused really on how Claude
| handled that particular prompt format, it's not clear whether
| the LLMs that scored low here were just failing at producing
| valid input, struggled to handle that specific prompt/output
| structure, or were doing fine at basically operating the text
| adventure but were struggling at building a world model and
| problem solving.
| kqr wrote:
| Ah, I see what you mean. Yeah, there was too much output
| from too many models at once (combined with not enough
| spare time) to really perform useful qualitative analysis
| on all the models' performance.
| fzzzy wrote:
| I tried this earlier this year. I wrote a tool that let an llm
| play Zork. It was pretty fun.
| bongodongobob wrote:
| Did you do anything special? I tried this with just copy and
| paste with GPT-4o and it was absolutely terrible at it. It
| usually ended up spamming help in a loop and trying commands
| that didn't exist.
| fzzzy wrote:
| I have my own agent loop that I wrote, and I gave it a tool
| which it uses to send input to the parser. I also had a step
| which took the previous output and generated an image for it.
| It was just a toy, but it was pretty fun.
| lottaFLOPS wrote:
| related research that was also announced this week:
| https://www.textquests.ai/
| 1970-01-01 wrote:
| Very interesting how they all clearly suck at it. Even with
| hints, they can't understand the task enough to complete the
| game.
| abraxas wrote:
| that's a great tracker. How often is the laderboard updated?
| kqr wrote:
| They seem to be going for a much simpler route of just giving
| the LLM a full transcript of the game with its own reasoning
| interspersed. I didn't have much luck with that, and I'm
| worried it might not be effective once we're into the hundreds
| of turns because of inadvertent context poisoning. It seems
| like this might indeed be what happens, given the slowing of
| progress indicated in the paper.
| andrewla wrote:
| The article links to a previous article discussing methodology
| for this. The prompting is pretty extensive.
|
| It is difficult here to separate out how much of this could be
| fixed or improved by better prompting. A better baseline might be
| to just give the LLM direct access to the text adventure, so that
| everything the LLM replies is given to the game directly. I
| suspect that the LLMs would do poorly on this task, but would
| undoubtedly improve over time and generations.
|
| EDIT: Just started playing 9:05 with GPT-4 with no prompting and
| it did quite poorly; kept trying to explain to me what was going
| on with the ever more complex errors it would get. Put in a one
| line "You are playing a text adventure game" and off it went --
| it took a shower and got dressed and drove to work.
| SquibblesRedux wrote:
| This is another great example of how LLMs are not really any sort
| of AI, or even proper knowledge representation. Not saying they
| don't have their uses (like souped up search and permutation
| generators), but definitely not something that resembles
| intelligence.
| nonethewiser wrote:
| While I agree, it's still shocking how far next token
| prediction gets us to looking like intelligence. It's amazing
| we need examples such as this to demonstrate it.
| SquibblesRedux wrote:
| Another way to think about it is how interesting it is that
| humans can be so easily influenced by strings of words. (Or
| images, or sounds.) I suppose I would characterize it as so
| many people being earnestly vulnerable. It all makes me think
| of Kahneman's [0] System 1 (fast) and System 2 (slow)
| thinking.
|
| [0] "Thinking, Fast and Slow"
| https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow
| seba_dos1 wrote:
| It is kinda shocking, but I'm sure ELIZA was too for many
| people back then. It just took shorter to realize what was
| going on there.
| henriquegodoy wrote:
| Looking at this evaluation it's pretty fascinating how badly
| these models perform even on decades old games that almost
| certainly have walkthroughs scattered all over their training
| data. Like, you'd think they'd at least brute force their way
| through the early game mechanics by now, but honestly this kinda
| validates something I've been thinking about like real
| intelligence isn't just about having seen the answers before,
| it's about being good at games and specifically new situations
| where you can't just pattern match your way out
|
| This is exactly why something like arc-agi-3 feels so important
| right now. Instead of static benchmarks that these models can
| basically brute force with enough training data, like designing
| around interactive environments where you actually need to
| perceive, decide, and act over multiple steps without prior
| instructions, that shift from "can you reproduce known patterns"
| to "can you figure out new patterns" seems like the real test of
| intelligence.
|
| What's clever about the game environment approach is that it
| captures something fundamental about human intelligence that
| static benchmarks miss entirely, like, when humans encounter a
| new game, we explore, form plans, remember what worked, adjust
| our strategy all that interactive reasoning over time that these
| text adventure results show llms are terrible at, we need systems
| that can actually understand and adapt to new situations, not
| just really good autocomplete engines that happen to know a lot
| of trivia.
| msgodel wrote:
| I've been experimenting with this as well with the goal of
| using it for robotics. I don't think this will be as hard to
| train for as people think though.
|
| It's interesting he wrote a separate program to wrap the
| z-machine interpreter. I integrated my wrapper directly into my
| pytorch training program.
| da_chicken wrote:
| I saw it somewhere else recently, but the idea is that LLMs are
| language models, not world models. This seems like a perfect
| example of that. You need a world model to navigate a text
| game.
|
| Otherwise, how can you determine that "North" is a context
| change, but not always a context change.
| manbash wrote:
| Thanks for this. I was struggling to put it in words even if
| maybe this has been a known distinguishing factor for others.
| zahlman wrote:
| > I saw it somewhere else recently, but the idea is that LLMs
| are language models, not world models.
|
| Part of what distinguishes humans from artificial
| "intelligence" to me is exactly that we _automatically_
| develop models of _whatever is needed_.
| lubujackson wrote:
| Why, this sounds like Context Engineering!
| godelski wrote:
| > real intelligence isn't just about having seen the answers
| before, it's about being good at games and specifically new
| situations where you can't just pattern match your way out
|
| It is insane to me that so many people believe intelligence is
| measurable by pure question answer testing. There's hundreds of
| years of discussion about how this is limited in measuring
| human intelligence. I'm sure we _all_ even know someone who 's
| a really good test take but you also wouldn't consider to be
| really bright. I'm sure every single one of also knows someone
| in the other camp (bad at tests but considered bright).
|
| The definition you put down is much more agreed upon in the
| scientific literature. While we don't have a good formal
| definition of intelligence there is a difference between no
| definition. I really do hope people read more about
| intelligence and how we measure it in humans and animals. It is
| very messy and there's a lot of noise, but at least we have a
| good idea of the directions to move in. There's still nuances
| to be learned and while I think ARC is an important test, I
| don't think success on it will prove AGI (and Chollet says this
| too)
| rkagerer wrote:
| Hi, GPT-x here. Let's delve into my construction together. My
| "intelligence" comes from patterns learned from vast amounts of
| text. I'm trained to... oh look it's a butterfly. Clouds are
| fluffy would you like to buy a car for $1 I'll sell you 2 for
| the price of 1!
| corobo wrote:
| Ah dammit the AGI has ADHD
| wiz21c wrote:
| adventure games require spatial reasoning (although text based),
| requires understanding puns, requires cultural references, etc.
| For me they really need human-intelligence to be solved (heck,
| they've been designed like that).
|
| I find it funny that some AI do very good score on ARC-AI but
| fails at these games...
| andai wrote:
| The GPT-5 used here is the Chat version, presumably gpt-5-chat-
| latest, which from what I can tell is the same version used in
| ChatGPT, which is not actually a model but a "system" -- a router
| that semi-randomly forwards your request to various different
| models (in a way designed to massively reduce costs for OpenAI,
| based on people reporting inconsistent output and often worse
| results than 4o).
|
| So from this it seems that not only would many of these requests
| not touch a reasoning model (or as it works now, have reasoning
| set to "minimal"?), but they're probably being routed to a mini
| or nano model?
|
| It would make more sense, I think, to test on gpt-5 itself (and
| ideally the -mini and -nano as well), and perhaps with different
| reasoning effort, because that makes a big difference in many
| evals.
|
| EDIT: Yeah the Chat router is busted big time. It fails to apply
| thinking even for problems that obviously call for it (analyzing
| financial reports). You have to add "Think hard." to the end of
| the prompt, or explicitly switch to the Thinking model in the UI.
| kqr wrote:
| This is correct, and was the reason I made sure to always
| append "Chat" to the end of "GPT-5". I should perhaps have been
| more clear about this. The reason I settled for the lesser
| router is I don't have access to the full GPT-5, which would
| have been a much better baseline, I agree.
| andai wrote:
| Do they require drivers license to use it? They asked for my
| ID for o3 Pro a few months ago.
| kqr wrote:
| That's the step at which I gave up, anyway.
| varenc wrote:
| > Yeah the Chat router is busted big time... You have to add
| "Think hard." to the end of the prompt, or explicitly switch to
| the Thinking model in the UI.
|
| I don't really get this gripe? It seems no different than
| before, except now it will sometimes opt into thinking harder
| by itself. If you know you want CoT reasoning you just select
| gpt5-thinking, no different than choosing o4-mini/o3 like
| before.
| seanwilson wrote:
| I won't be surprised when LLMs get good at puzzle-heavy text
| adventures if there was more attention turned to this.
|
| I've found for text adventures based on item manipulation,
| variations of the same puzzles appear again and again because
| there's a limit to how many obscure but not too obscure item
| puzzles you can come up with, so training would be good for exact
| matches of the same puzzle, and variations, like different ways
| of opening locked doors.
|
| Puzzles like key + door, crowbar + panel, dog + food, coin +
| vending machine, vampire + garlic etc. You can obscure or layer
| puzzles, like changing the garlic into garlic bread which would
| still work on the vampire, so there's a logical connections to
| make but often nothing too crazy.
|
| A lot of the difficulty in these games comes from not noticing or
| forgetting about clues/hints and potential puzzles because
| there's so much going on, which is less likely to trip up a
| computer.
|
| You can already ask LLMs "in a game: 20 ways to open a door if I
| don't have the key", "how to get past an angry guard dog" or "I'm
| carrying X, Y, and Z, how do I open a door", and it'll list lots
| of ways that are seen in games, so it's going to be good at
| matching that with the current list of objects you're carrying,
| items in the world, and so on.
|
| Another comment mentions about how the AI needs a world model
| that's transforming as actions are performed, but you need
| something similar to reason about maths proofs and code, where
| you have to keep track of the current state/context. And most
| adventure games don't require you to plan many steps in advance
| anyway. They're often about figuring out which item to
| combine/use with which other item next (where only one
| combination works), and navigating to the room that contains the
| latter item first.
|
| So it feels like most of the parts are already there to me, and
| it's more about getting the right prompts and presenting the
| world in the right format e.g. maintaining a table of items,
| clues, and open puzzles, to look for connections and matches, and
| maintaining a map.
|
| Getting LLMs to get good at variations of The Witness would be
| interesting, where the rules have to be learned through trial and
| error, and combined.
| standardly wrote:
| LLMs work really well for open-ended role-playing sessions, but
| not so much games with strict rules.
|
| They just can't seem to grasp what would make a choice a "wrong"
| choice in a text-based adventure game, so they end up having no
| ending. You have to hard-code failure events, or you just never
| get anything like "you chose to attack the wizard, but he's level
| 99, dummy, so you died - game over!". It just accepts whatever
| choice you make, ad infinitum.
|
| My best session was one in which I had the AI give me 4 dialogue
| options to choose from. I never "beat" the game, and we never
| solved the mystery - it just kept going further down the rabbit
| hole.. But it was surprisingly enjoyable, and repayable! A larger
| framework just needs written for it to keep the tires between the
| lines and to hard-code certain game rules - what's under the hood
| is already quite good for narratives imo.
| spacecadet wrote:
| Ill pump my repo, DUNGEN.
|
| https://github.com/derekburgess/dungen
|
| It's a configurable pipeline for generative dungeon master role
| play content with a zork-like UI. I use a model called "Wayfarer"
| which is designed for challenging role play content and I find
| that it can be pretty fun to engage with.
___________________________________________________________________
(page generated 2025-08-12 23:01 UTC)