[HN Gopher] Letting Claude play text adventures
___________________________________________________________________
Letting Claude play text adventures
Author : varjag
Score : 47 points
Date : 2026-01-16 21:02 UTC (5 days ago)
(HTM) web link (borretti.me)
(TXT) w3m dump (borretti.me)
| skybrian wrote:
| It seems like asking Claude to keep notes somehow would work
| better. An AGENTS file and a TODO file? An issue tracker like
| beads? Lots of things to try.
| pflenker wrote:
| For a game like anchorhead, which is famous in its niche,
| shouldn't Claude already know it sufficiently to just solve it
| right away? I would expect that its data source contained
| multiple discussions and walkthroughs of the game.
| ratg13 wrote:
| It's very likely the model didn't stop to question if the game
| they were playing was something they knew already, and just
| assumed it was a puzzle created for it.
| sfjailbird wrote:
| You can see Claude's responses in the repo. The first one is:
|
| _Ah, Anchorhead! One of the most celebrated pieces of
| interactive fiction ever written_
| imiric wrote:
| > By the time you get to day two, each turn costs tens of
| thousands of input tokens
|
| This behavior surprised me when I started using LLMs, since it's
| so counterintuitive.
|
| Why _does_ every interaction require submitting and processing
| all data in the current session up until that point? Surely there
| must be a way for the context to be stored server-side, and
| referenced and augmented by each subsequent interaction. Could
| this data be compressed in a way to keep the most important bits,
| and garbage collect everything else? Could there be different
| compression techniques depending on the type of conversation?
| Similar to the domain-specific memories and episodic memory
| mentioned in the article. Could "snapshots" be supported, so
| that the user can explore branching paths in the session history?
| Some of this is possible by manually managing context, but it's
| too cumbersome.
|
| Why are all these relatively simple engineering problems still
| unsolved?
| iamjackg wrote:
| It's not unsolved, at least not the first part of your
| question. In fact it is a feature offered by all main LLM
| providers!
|
| - https://platform.openai.com/docs/guides/prompt-caching
|
| - https://platform.claude.com/docs/en/build-with-
| claude/prompt...
|
| - https://ai.google.dev/gemini-api/docs/caching
| imiric wrote:
| Ah, that's good to know, thanks.
|
| But then why is there compounding token usage in the
| article's trivial solution? Is it just a matter of using the
| cache correctly?
| StevenWaterman wrote:
| Cached tokens are cheaper (90% discount ish) but not free
| moyix wrote:
| Also, unlike OpenAI, Anthropic's prompt caching is
| _explicit_ (you set up to 4 cache "breakpoints"),
| meaning if you don't implement caching then you don't
| benefit from it.
| netcraft wrote:
| thats a very generous way of putting it. Anthropic's
| prompt caching is actively hostile and very difficult to
| implement properly.
| sfjailbird wrote:
| Cool! I would like to see the game sessions.
|
| Edit: they are there in the repo:
| https://github.com/eudoxia0/claude-plays-anchorhead/tree/mas...
| tiahura wrote:
| Claude code, nethack, and tmux are fun to experiment with.
| brimtown wrote:
| I'm currently letting Claude build and play its own Dwarf
| Fortress clone, as an installable plugin in Claude Code
|
| https://github.com/brimtown/claude-fortress
| twohearted wrote:
| This is a great idea and great work.
|
| Context is intuitively important, but people rarely put
| themselves in the LLM's shoes.
|
| What would be eye-opening would be to create an LLM test system
| that periodically sends a turn to a human instead of the model.
| Would you do better than the LLM? What tools would you call at
| that moment, given only that context and no other knowledge? The
| way many of these systems are constructed, I'd wager it would be
| difficult for a human.
|
| The agent can't decide what is safe to delete from memory because
| it's a sort of bystander at that moment. Someone else made the
| list it received, and someone else will get the list it writes.
| The logic that went into why the notes exist is lost. LLMs are
| living the Christopher Nolan film Memento.
| lukev wrote:
| This is a great framework to experiment with memory
| architectures.
|
| Everything the author says about memory management tracks with my
| intuition of how CC works, including my perception that it isn't
| very good at explicitly managing its own memory.
|
| My next step in trying to get it to work well on a bigger game
| would be to try to build a more "intuitive" memory tool, where
| the textual description of a room or an item would
| _automatically_ RAG previous interactions with that entity into
| context.
|
| That also is closer to how human memory works -- we're instantly
| reminded of things via a glimpse, a sound, a smell... we don't
| need to (analogously) write in or search our notebook for basic
| info we already know about the world.
| daxfohl wrote:
| I tried something similar, but distilled to "solve this maze" as
| a text adventure, and while it usually solved it eventually, it
| almost always backtracked through fully-explored dead ends
| multiple times before finally getting to the end. I was pretty
| surprised by this, as I expected they'd be able to traverse more
| or less optimally most of the time.
|
| I tried basic raw long-context chat, various approaches of
| getting it to externalize the state (i.e. prompting it to emit
| the known state of the maze after each move, but _not_ telling it
| exactly what to emit or how to format it), and even allowing it
| to emit code to execute after each turn (so long as it was a
| serialization /storage algorithm, not a solver in itself), but it
| invariably would get lost at some point. (It always neglected to
| emit a key for which coordinate was which, and which direction
| was increasing. Even if I explicitly told it to do this, it would
| frequently forget to at some point anyway and get turned around
| again).
|
| Of course it had no problem writing an optimal algorithm to solve
| mazes when prompted. I thought the disparity was interesting.
|
| Note the mazes had the start and end positions inside the maze
| itself, so they weren't trivially solvable by the "follow wall to
| the left" algorithm.
|
| This was last summer so maybe newer models would do better. I
| also stopped due to cost.
| sfjailbird wrote:
| Having read through the entire game session, Claude plays the
| game admirably! For example, it finds a random tin of oily fish
| somewhere, and later tries (unsuccessfully) to use it to oil a
| rusty lock. Later it successfully solves a puzzle inside the
| house by thoroughly examining random furniture and picking up
| subtle clues about what to do, based on it.
|
| It did so well that I can't not suspect that it used some hints
| or walkthroughs, but then again it did a bunch of clueless stuff
| too, like any player new to the game.
|
| For one thing, this would be a great testing tool for the author
| of such a game. And more generally, the world of software testing
| is probably about to take some big leaps forward.
| woggy wrote:
| Very interesting, seems like a good framework to test and
| experiment with memory. I am curious why it wasn't able to solve
| it considering it is a well known game. Would be interesting if
| puzzle games like this could be generated so we know it's not
| already been trained on it.
|
| I wonder if the improvements due to different memory system
| approaches apply in a similar way to tasks that are in its
| training history vs those that are not.
| justinclift wrote:
| This would be interesting to try with local models, where the
| token costs and token limits are quite different.
___________________________________________________________________
(page generated 2026-01-21 23:00 UTC)