[HN Gopher] Baba Is Eval
       ___________________________________________________________________
        
       Baba Is Eval
        
       Author : fi-le
       Score  : 242 points
       Date   : 2025-07-03 13:54 UTC (2 days ago)
        
 (HTM) web link (fi-le.net)
 (TXT) w3m dump (fi-le.net)
        
       | kinduff wrote:
       | Do you think the performance can be improved if the
       | representation of the level is different?
       | 
       | I've seen AI struggle with ASCII, but when presented as other
       | data structures, it performs better.
       | 
       | edit:
       | 
       | e.g. JSON with structured coordinates, graph based JSON, or a
       | semantic representation with the coordinates
        
         | hajile wrote:
         | If it struggles with the representation, that makes it an even
         | better test of the AI's thinking potential.
        
           | eru wrote:
           | I'm not sure. Adding superficial difficulties to an IQ test
           | for humans doesn't (necessarily) improve it as an IQ test.
        
         | RainyDayTmrw wrote:
         | In the limit case, to an actual general intelligence,
         | representation is superfluous, because it can figure out how to
         | convert freely.
         | 
         | To the extent that the current generation of AI isn't general,
         | yeah, papering over some of its weaknesses may allow you to
         | expose other parts of it, both strengths and other weaknesses.
        
           | kadoban wrote:
           | A human can easily struggle at solving a poorly communicated
           | puzzle, especially if paper/pencil or something isn't
           | available to convert to a better format. LLMs can look back
           | at what they wrote, but it seems kind of like a poor format
           | for working out a better representation to me.
        
         | QuadmasterXLII wrote:
         | These models can "code," but they can't code yet. We'll know
         | that they can actually code once their performance on these
         | tasks becomes invariant to input representation, because they
         | can just whip up a script to convert representations.
        
       | k2xl wrote:
       | Baba is You is a great game part of a collection of 2D grid
       | puzzle games.
       | 
       | (Shameless plug: I am one of the developers of Thinky.gg
       | (https://thinky.gg), which is a thinky puzzle game site for a
       | 'shortest path style' [Pathology] and a Sokoban variant [Sokoath]
       | )
       | 
       | These games are typically NP Hard so the typical techniques that
       | solvers have employed for Sokoban (or Pathology) have been brute
       | forced with varying heuristics (like BFS, dead-lock detection,
       | and Zobrist hashing). However, once levels get beyond a certain
       | size with enough movable blocks you end up exhausting memory
       | pretty quickly.
       | 
       | These types of games are still "AI Proof" so far in that LLMs are
       | absolutely awful at solving these while humans are very good (so
       | seems reasonable to consider for for ARC-AGI benchmarks).
       | Whenever a new reasoning model gets released I typically try it
       | on some basic Pathology levels (like 'One at a Time'
       | https://pathology.thinky.gg/level/ybbun/one-at-a-time) and they
       | fail miserably.
       | 
       | Simple level code for the above level (1 is a wall, 2 is a
       | movable block, 4 is starting block, 3 is the exit):
       | 
       | 000
       | 
       | 020
       | 
       | 023
       | 
       | 041
       | 
       | Similar to OP, I've found Claude couldn't manage rule dynamics,
       | blocked paths, or game objectives well and spits out random
       | results.
        
         | kinduff wrote:
         | In Factorio's paper [1] page 3, the agent receives a semantic
         | representation with coordinates. Have you tried this data
         | format?
         | 
         | [1]: https://arxiv.org/pdf/2503.09617
        
         | eru wrote:
         | NP hard isn't much of a problem, because the levels are fairly
         | small, and instances are not chosen to be worst case hard but
         | to be entertaining for humans to solve.
         | 
         | SMT/SAT solvers or integer linear programming can get you
         | pretty far. Many classic puzzle games like Minesweeper are NP
         | hard, and you can solve any instance that a human would be able
         | to solve in their lifetime fairly quickly on a computer.
        
       | ekianjo wrote:
       | this is definitely a case for fine tuning a LLM on this game's
       | data. There is currently no LLM out there that is able to play
       | very well many games of different kinds.
        
       | captn3m0 wrote:
       | I once made a "RC plays Baba Is You" that controlled the game
       | over a single shared browser that was streaming video and
       | controls back to the game. Was quite fun!
       | 
       | But I am fairly sure all of Baba Is You solutions are present in
       | the training data for modern LLMs so it won't make for a good
       | eval.
        
         | chmod775 wrote:
         | > But I am fairly sure all of Baba Is You solutions are present
         | in the training data for modern LLMs so it won't make for a
         | good eval.
         | 
         | Claude 4 _cannot_ solve any Baba Is You level (except level 0
         | that is solved by 8 right inputs), so for now it 's at least a
         | nice low bar to shoot for...
        
       | RainyDayTmrw wrote:
       | This is interesting. If you approach this game as individual
       | moves, the search tree is really deep. However, most levels can
       | be expressed as a few intermediate goals.
       | 
       | In some ways, this reminds me of the history of AI Go (board
       | game). But the resolution there was MCTS, which wasn't at all
       | what we wanted (insofar as MCTS is not generalizable to most
       | things).
        
         | rtpg wrote:
         | > However, most levels can be expressed as a few intermediate
         | goals
         | 
         | I think generally the whole thing with puzzle games is that you
         | have to determine the "right" intermediate goals. In fact, the
         | naive intermediate goals are often entirely wrong!
         | 
         | A canonical sokoban-like inversion might be where you have to
         | push two blocks into goal areas. You might think "ok, push one
         | block into its goal area and then push another into it."
         | 
         | But many of these games will have mechanisms meaning you would
         | first want to push one block into its goal, then undo that for
         | some reason (it might activate some extra functionality) push
         | the other block, and then finally go back and do the thing.
         | 
         | There's always weird tricks that mean that you're going to walk
         | backwards before walking forwards. I don't think it's
         | impossible for these things to stumble into it, though. Just
         | might spin a lot of cycles to get there (humans do too I guess)
        
           | matsemann wrote:
           | Yeah, often working backwards and forwards at the same time
           | is how to solve some advanced puzzle games. Then you keep it
           | from exploding in options. When thinking backwards from the
           | goal, you figure out constraints or "invariants" the forward
           | path must uphold, thus can discard lots of dead ends earlier
           | in your forward path.
           | 
           | To me, those discoveries are the fun part of most puzzle
           | games. When you unlock the "trick" for each level and the
           | dopamine flies, heh.
        
             | TeMPOraL wrote:
             | I usually get a good mileage out of jumping straight in the
             | middle :). Like, "hmm let's look at this block; oh cool,
             | there's enough space around it that I could push it away
             | from goal, for whatever reason". Turns out, if it's
             | possible there usually is a good reason. So whenever I get
             | stuck, I skim every object in the puzzle and consider in
             | isolation, what can I do with it, and this usually gives me
             | anchor points to drive my forward or backward thinking
             | through.
        
         | kadoban wrote:
         | > But the resolution there was MCTS
         | 
         | MCTS wasn't _really_ the solution to go. MCTS-based AIs existed
         | for years and they weren't _that_ good. They weren't superhuman
         | for sure, and the moves/games they played were kind of boring.
         | 
         | The key to doing go well was doing something that vaguely looks
         | like MCTS but the real guts are a network that can answer:
         | "who's winning?" and "what are good moves to try here?" and
         | using that to guide search. Additionally essential was
         | realizing that computation (run search for a while) with a bad
         | model could be effectively+efficiently used to generate better
         | training data to train a better model.
        
           | eru wrote:
           | > Additionally essential was realizing that computation (run
           | search for a while) with a bad model could be
           | effectively+efficiently used to generate better training data
           | to train a better model.
           | 
           | That has been known since at least the 1990s with TD-Gammon
           | beating the world champions in Backgammon. See eg
           | http://incompleteideas.net/book/ebook/node108.html or
           | https://en.wikipedia.org/wiki/TD-Gammon
           | 
           | In a sense, classic chess engines do that, too: alpha-beta-
           | search uses a very weak model (eg just checking for
           | checkmate, otherwise counting material, or what have you) and
           | search to generate a much stronger player. You can use that
           | to generate data for training a better model.
        
             | kadoban wrote:
             | > That has been known since at least the 1990s with TD-
             | Gammon beating the world champions in Backgammon.
             | 
             | Yeah, I didn't mean to imply that reinforcement learning
             | (or applying it in this way) is novel. It was just
             | important to work out how to apply that to go specifically.
             | 
             | > In a sense, classic chess engines do that, too: alpha-
             | beta-search uses a very weak model (eg just checking for
             | checkmate, otherwise counting material, or what have you)
             | and search to generate a much stronger player. You can use
             | that to generate data for training a better model.
             | 
             | I would say that classic chess AIs specifically don't do
             | the important part. They aren't able to use a worst model
             | to, with computation, train a better model. They can
             | generate training data, but then they have no way to
             | incorporate it back into the AI.
        
       | pclmulqdq wrote:
       | I have noticed a trend of the word "Desiderata" appearing in a
       | lot more writing. Is this an LLM word or is it just in fashion?
       | Most people would use the words "Deisres" or "Goals," so I assume
       | this might be the new "delve."
        
         | Tomte wrote:
         | It's academic jargon. Desiderata are often at the end of a
         | paper, in the section ,,someone should investigate X, but I'm
         | moving on to the next funded project".
        
           | ginko wrote:
           | So ,,Future Work"?
        
             | dgfl wrote:
             | Literally it means "things that we wish for", from the
             | latin verb "desiderare" (to wish).
        
         | fi-le wrote:
         | At least in this instance, it came from my fleshy human brain.
         | Although I perhaps used it to come off as smarter than I really
         | am - just like an LLM might.
        
       | wohoef wrote:
       | In my experience LLMs have a hard time working with text grids
       | like this. It seems to find columns harder to "detect" then rows.
       | Probably because it's input shows it as a giant row if that makes
       | sense.
       | 
       | It has the same problem with playing chess. But I'm not sure if
       | there is a datatype it could work with for this kinda game.
       | Currently it seems more like LLMs can't really work on spacial
       | problems. But this should actually be something that can be fixed
       | (pretty sure I saw an article about it on HN recently)
        
         | froobius wrote:
         | Transformers can easily be trained / designed to handle grids,
         | it's just that off the shelf standard LLMs haven't been
         | particularly, (although they would have seen some)
        
           | nine_k wrote:
           | Are there some well-known examples of success in it?
        
             | thethimble wrote:
             | Vision transformers effectively encode a grid of pixel
             | patches. It's ultimately a matter of ensuring the position
             | encoding incorporates both X and Y and position.
             | 
             | For LLMs we only have one axis of position and - more
             | importantly - the vast majority of training data only is
             | oriented in this way.
        
         | stavros wrote:
         | If this were a limitation in the architecture, they wouldn't be
         | able to work with images, no?
        
           | hnlmorg wrote:
           | LLMs don't work with images.
        
             | stavros wrote:
             | They do, though.
        
               | hnlmorg wrote:
               | Do they? I thought it was completely different models
               | that did image generation.
               | 
               | LLMs might be used to translate requests into keywords,
               | but I didn't think LLMs themselves did any of the image
               | generation.
               | 
               | Am I wrong here?
        
               | stavros wrote:
               | Yes, that's why ChatGPT can look at an image and change
               | the style, or edit things in the image. The image itself
               | is converted to tokens and passed to the LLM.
        
               | hnlmorg wrote:
               | LLMs can be used as an agent to do all sorts of clever
               | things, but it doesn't mean the LLM is actually handling
               | the original data format.
               | 
               | I've created MCP servers that can scrape websites but
               | that doesn't mean the LLM _itself_ can make HTTP calls.
               | 
               | The reason I make this distinction is because someone
               | claimed that LLMs can read images. But they don't. They
               | act as an agent for another model that reads images and
               | creates metadata from it. LLMs then turn that meta data
               | into natural language.
               | 
               | The LLM itself doesn't see any pixels. It sees textual
               | information that another model has provided.
               | 
               | Edit: reading more about this online, it seems LLMs can
               | work with pixel level data. I had no idea that was
               | possible.
               | 
               | My apologies.
        
               | stavros wrote:
               | No problem. Again, if it happened the way you described
               | (which it did, until GPT-4o recently), the LLM wouldn't
               | have been able to edit images. You can't get a textual
               | description of an image and reconstruct it perfectly just
               | from that, with one part edited.
        
         | fi-le wrote:
         | Good point. The architectural solution that would come to mind
         | is 2D text embeddings, i.e. we add 2 sines and cosines to each
         | token embedding instead of 1. Apparently people have done it
         | before: https://arxiv.org/abs/2409.19700v2
        
           | ninjha wrote:
           | I think I remember one of the original ViT papers saying
           | something about 2D embeddings on image patches not actually
           | increasing performance on image recognition or segmentation,
           | so it's kind of interesting that it helps with text!
           | 
           | E: I found the paper: https://arxiv.org/pdf/2010.11929
           | 
           | > We use standard learnable 1D position embeddings, since we
           | have not observed significant performance gains from using
           | more advanced 2D-aware position embeddings (Appendix D.4).
           | 
           | Although it looks like that was just ImageNet so maybe this
           | isn't that surprising.
        
             | yorwba wrote:
             | They seem to have used a fixed input resolution for each
             | model, so the learnable 1D position embeddings are
             | equivalent to learnable 2D position embeddings where every
             | grid position gets its own embedding. It's when different
             | images may have a different number of tokens per row that
             | the correspondence between 1D index and 2D position gets
             | broken and a 2D-aware position embedding can be expected to
             | produce different results.
        
       | tibastral2 wrote:
       | It reminds me of
       | https://en.m.wikipedia.org/wiki/The_Ricks_Must_Be_Crazy. Hope we
       | are not ourselves in some sort of simulation ;)
        
       | ThouTo2C wrote:
       | There are numerous guides for all levels of Baba Is You
       | available. I think it's likely that any modern LLM has them as
       | part of its training dataset. That severely degrades this as a
       | test for complex solution capabilities.
       | 
       | Still, its interesting to see the challenges with dynamic rules
       | (like "Key is Stop") that change where are you able to move etc.
        
         | klohto wrote:
         | Read the article first maybe
        
         | ethan_smith wrote:
         | The dynamic rule changes are precisely what make this a
         | valuable benchmark despite available guides. Each rule
         | modification creates a novel state-space that requires
         | reasoning about the consequences of those changes, not just
         | memorizing solution paths.
        
       | niemandhier wrote:
       | I think it's a great idea for a benchmark.
       | 
       | One key difference to ARC in its current iteration is that there
       | is a defined and learnable game physics.
       | 
       | Arc requires generalization based on few examples for problems
       | that are not well defined per se.
       | 
       | Hence ARC currently requires the models that work on it to
       | possess biases that are comparable to the ones that humans
       | possess.
        
       | andy99 wrote:
       | I suspect real AGI evals aren't going to be "IQ test"-like which
       | is how I'd categorize these benchmarks.
       | 
       | LLMs will probably continue to scale on such benchmarks, as they
       | have been, without needing real ingenuity or intelligence.
       | 
       | Obviously I don't know the answer but I think it's the same root
       | problem as why neural networks will never lead to intelligence.
       | We're building and testing idiot savants.
        
       | popcar2 wrote:
       | I would be way more interested in it playing niche community
       | levels, because I suspect a huge reason it's able to solve these
       | levels is because it was trained on a million Baba is You
       | walkthroughs. Same with people using Pokemon as a way to test
       | LLMs, it really just depends on how well it knows the game.
        
         | fi-le wrote:
         | Two corrections, as written in the post: At least Claude not
         | able to solve the standard levels at all, and community levels
         | are definitely in scope.
        
       | WhitneyLand wrote:
       | "Reasoning models like o3 might be better equipped to come up
       | with a plan, so a natural step would be to try switching to
       | those, away from Claude Desktop..."
       | 
       | But...Claude Desktop does have a reasoning mode for both Sonnet
       | and Opus.
        
       | zahlman wrote:
       | > This is why the video of Claude solving level 1 at the top was
       | actually (dramatic musical cue) staged, and only possible via a
       | move-for-move tutorial that Claude nicely rationalized post hoc.
       | 
       | One of the things this arc of history has taught me is that post-
       | hoc rationalization is depressingly easy. Especially if it
       | doesn't have to make sense, but even passing basic logical checks
       | isn't too difficult. Ripping the rationalization apart often
       | requires identifying novel, non-obvious logical checks.
       | 
       | I thought I had learned that time and time again from human
       | politics, but AI somehow made it even clearer than I thought
       | possible. Perhaps simply because of _knowing_ that a machine is
       | doing it.
       | 
       | Edit: after watching the video more carefully:
       | 
       | > "This forms WALL IS WIN horizontally. But I need "FLAG IS WIN"
       | instead. Let me check if walls now have the WIN property. If they
       | do, I just need to touch a wall to win. Let me try moving to a
       | wall:
       | 
       | There's something extremely uncanny-valley about this. A human
       | player absolutely would accidentally win like this, and have
       | similar reasoning (not expressed so formally) about how the win
       | was achieved _after the fact_. (Winning depends on the walls
       | having WIN and _also not_ having STOP; many players get stuck on
       | later levels, even after having supposedly learned the lesson of
       | this one, by trying to make something WIN and walk onto it while
       | it is still STOP.)
       | 
       | But the WIN block was not originally in line with the WALL IS
       | text, so a human player would never accidentally form the rule,
       | but would only do it with the expectation of being able to win
       | that way. Especially since there was already an obvious, clear
       | path to FLAG -- a level like this has no Sokoban puzzle element
       | to it; it's purely about learning that the walls only block the
       | player because they are STOP.
       | 
       | Nor would (from my experience watching streamers at least) a
       | human spontaneously notice that the rule "WALL IS WIN" had been
       | formed and treat that as a cue to reconsider the entire strategy.
       | The natural human response to unintentionally forming a useful
       | rule is to keep pushing in the same direction.
       | 
       | On the other hand, an actually _dedicated_ AI system (in the way
       | that AlphaGo was dedicated to Go) could, I 'm sure, figure out a
       | game like Baba Is You pretty easily. It would lack the human
       | instinct to treat the walls as if they were implicitly always
       | STOP; so it would never struggle with overriding it.
        
         | deadbabe wrote:
         | A simple feed-forward neural network with sufficient training
         | can solve levels way better than Claude. Why is Claude being
         | used at all.
        
           | wredcoll wrote:
           | The question isn't "can we write a computer program that can
           | beat X game", it is "do things like claude represent a truly
           | general purpose intelligence as demonstrated by its ability
           | to both write a limerick and play baba is you"
        
       ___________________________________________________________________
       (page generated 2025-07-05 23:01 UTC)