hngopher.com

       [HN Gopher] Letting Claude play text adventures
       ___________________________________________________________________
        
       Letting Claude play text adventures
        
       Author : varjag
       Score  : 47 points
       Date   : 2026-01-16 21:02 UTC (5 days ago)
        
 (HTM) web link (borretti.me)
 (TXT) w3m dump (borretti.me)
        
       | skybrian wrote:
       | It seems like asking Claude to keep notes somehow would work
       | better. An AGENTS file and a TODO file? An issue tracker like
       | beads? Lots of things to try.
        
       | pflenker wrote:
       | For a game like anchorhead, which is famous in its niche,
       | shouldn't Claude already know it sufficiently to just solve it
       | right away? I would expect that its data source contained
       | multiple discussions and walkthroughs of the game.
        
         | ratg13 wrote:
         | It's very likely the model didn't stop to question if the game
         | they were playing was something they knew already, and just
         | assumed it was a puzzle created for it.
        
           | sfjailbird wrote:
           | You can see Claude's responses in the repo. The first one is:
           | 
           |  _Ah, Anchorhead! One of the most celebrated pieces of
           | interactive fiction ever written_
        
       | imiric wrote:
       | > By the time you get to day two, each turn costs tens of
       | thousands of input tokens
       | 
       | This behavior surprised me when I started using LLMs, since it's
       | so counterintuitive.
       | 
       | Why _does_ every interaction require submitting and processing
       | all data in the current session up until that point? Surely there
       | must be a way for the context to be stored server-side, and
       | referenced and augmented by each subsequent interaction. Could
       | this data be compressed in a way to keep the most important bits,
       | and garbage collect everything else? Could there be different
       | compression techniques depending on the type of conversation?
       | Similar to the domain-specific memories and episodic memory
       | mentioned in the article. Could  "snapshots" be supported, so
       | that the user can explore branching paths in the session history?
       | Some of this is possible by manually managing context, but it's
       | too cumbersome.
       | 
       | Why are all these relatively simple engineering problems still
       | unsolved?
        
         | iamjackg wrote:
         | It's not unsolved, at least not the first part of your
         | question. In fact it is a feature offered by all main LLM
         | providers!
         | 
         | - https://platform.openai.com/docs/guides/prompt-caching
         | 
         | - https://platform.claude.com/docs/en/build-with-
         | claude/prompt...
         | 
         | - https://ai.google.dev/gemini-api/docs/caching
        
           | imiric wrote:
           | Ah, that's good to know, thanks.
           | 
           | But then why is there compounding token usage in the
           | article's trivial solution? Is it just a matter of using the
           | cache correctly?
        
             | StevenWaterman wrote:
             | Cached tokens are cheaper (90% discount ish) but not free
        
               | moyix wrote:
               | Also, unlike OpenAI, Anthropic's prompt caching is
               | _explicit_ (you set up to 4 cache  "breakpoints"),
               | meaning if you don't implement caching then you don't
               | benefit from it.
        
               | netcraft wrote:
               | thats a very generous way of putting it. Anthropic's
               | prompt caching is actively hostile and very difficult to
               | implement properly.
        
       | sfjailbird wrote:
       | Cool! I would like to see the game sessions.
       | 
       | Edit: they are there in the repo:
       | https://github.com/eudoxia0/claude-plays-anchorhead/tree/mas...
        
       | tiahura wrote:
       | Claude code, nethack, and tmux are fun to experiment with.
        
       | brimtown wrote:
       | I'm currently letting Claude build and play its own Dwarf
       | Fortress clone, as an installable plugin in Claude Code
       | 
       | https://github.com/brimtown/claude-fortress
        
       | twohearted wrote:
       | This is a great idea and great work.
       | 
       | Context is intuitively important, but people rarely put
       | themselves in the LLM's shoes.
       | 
       | What would be eye-opening would be to create an LLM test system
       | that periodically sends a turn to a human instead of the model.
       | Would you do better than the LLM? What tools would you call at
       | that moment, given only that context and no other knowledge? The
       | way many of these systems are constructed, I'd wager it would be
       | difficult for a human.
       | 
       | The agent can't decide what is safe to delete from memory because
       | it's a sort of bystander at that moment. Someone else made the
       | list it received, and someone else will get the list it writes.
       | The logic that went into why the notes exist is lost. LLMs are
       | living the Christopher Nolan film Memento.
        
       | lukev wrote:
       | This is a great framework to experiment with memory
       | architectures.
       | 
       | Everything the author says about memory management tracks with my
       | intuition of how CC works, including my perception that it isn't
       | very good at explicitly managing its own memory.
       | 
       | My next step in trying to get it to work well on a bigger game
       | would be to try to build a more "intuitive" memory tool, where
       | the textual description of a room or an item would
       | _automatically_ RAG previous interactions with that entity into
       | context.
       | 
       | That also is closer to how human memory works -- we're instantly
       | reminded of things via a glimpse, a sound, a smell... we don't
       | need to (analogously) write in or search our notebook for basic
       | info we already know about the world.
        
       | daxfohl wrote:
       | I tried something similar, but distilled to "solve this maze" as
       | a text adventure, and while it usually solved it eventually, it
       | almost always backtracked through fully-explored dead ends
       | multiple times before finally getting to the end. I was pretty
       | surprised by this, as I expected they'd be able to traverse more
       | or less optimally most of the time.
       | 
       | I tried basic raw long-context chat, various approaches of
       | getting it to externalize the state (i.e. prompting it to emit
       | the known state of the maze after each move, but _not_ telling it
       | exactly what to emit or how to format it), and even allowing it
       | to emit code to execute after each turn (so long as it was a
       | serialization /storage algorithm, not a solver in itself), but it
       | invariably would get lost at some point. (It always neglected to
       | emit a key for which coordinate was which, and which direction
       | was increasing. Even if I explicitly told it to do this, it would
       | frequently forget to at some point anyway and get turned around
       | again).
       | 
       | Of course it had no problem writing an optimal algorithm to solve
       | mazes when prompted. I thought the disparity was interesting.
       | 
       | Note the mazes had the start and end positions inside the maze
       | itself, so they weren't trivially solvable by the "follow wall to
       | the left" algorithm.
       | 
       | This was last summer so maybe newer models would do better. I
       | also stopped due to cost.
        
       | sfjailbird wrote:
       | Having read through the entire game session, Claude plays the
       | game admirably! For example, it finds a random tin of oily fish
       | somewhere, and later tries (unsuccessfully) to use it to oil a
       | rusty lock. Later it successfully solves a puzzle inside the
       | house by thoroughly examining random furniture and picking up
       | subtle clues about what to do, based on it.
       | 
       | It did so well that I can't not suspect that it used some hints
       | or walkthroughs, but then again it did a bunch of clueless stuff
       | too, like any player new to the game.
       | 
       | For one thing, this would be a great testing tool for the author
       | of such a game. And more generally, the world of software testing
       | is probably about to take some big leaps forward.
        
       | woggy wrote:
       | Very interesting, seems like a good framework to test and
       | experiment with memory. I am curious why it wasn't able to solve
       | it considering it is a well known game. Would be interesting if
       | puzzle games like this could be generated so we know it's not
       | already been trained on it.
       | 
       | I wonder if the improvements due to different memory system
       | approaches apply in a similar way to tasks that are in its
       | training history vs those that are not.
        
       | justinclift wrote:
       | This would be interesting to try with local models, where the
       | token costs and token limits are quite different.
        
       ___________________________________________________________________
       (page generated 2026-01-21 23:00 UTC)