[HN Gopher] Baba Is Eval
___________________________________________________________________
Baba Is Eval
Author : fi-le
Score : 242 points
Date : 2025-07-03 13:54 UTC (2 days ago)
(HTM) web link (fi-le.net)
(TXT) w3m dump (fi-le.net)
| kinduff wrote:
| Do you think the performance can be improved if the
| representation of the level is different?
|
| I've seen AI struggle with ASCII, but when presented as other
| data structures, it performs better.
|
| edit:
|
| e.g. JSON with structured coordinates, graph based JSON, or a
| semantic representation with the coordinates
| hajile wrote:
| If it struggles with the representation, that makes it an even
| better test of the AI's thinking potential.
| eru wrote:
| I'm not sure. Adding superficial difficulties to an IQ test
| for humans doesn't (necessarily) improve it as an IQ test.
| RainyDayTmrw wrote:
| In the limit case, to an actual general intelligence,
| representation is superfluous, because it can figure out how to
| convert freely.
|
| To the extent that the current generation of AI isn't general,
| yeah, papering over some of its weaknesses may allow you to
| expose other parts of it, both strengths and other weaknesses.
| kadoban wrote:
| A human can easily struggle at solving a poorly communicated
| puzzle, especially if paper/pencil or something isn't
| available to convert to a better format. LLMs can look back
| at what they wrote, but it seems kind of like a poor format
| for working out a better representation to me.
| QuadmasterXLII wrote:
| These models can "code," but they can't code yet. We'll know
| that they can actually code once their performance on these
| tasks becomes invariant to input representation, because they
| can just whip up a script to convert representations.
| k2xl wrote:
| Baba is You is a great game part of a collection of 2D grid
| puzzle games.
|
| (Shameless plug: I am one of the developers of Thinky.gg
| (https://thinky.gg), which is a thinky puzzle game site for a
| 'shortest path style' [Pathology] and a Sokoban variant [Sokoath]
| )
|
| These games are typically NP Hard so the typical techniques that
| solvers have employed for Sokoban (or Pathology) have been brute
| forced with varying heuristics (like BFS, dead-lock detection,
| and Zobrist hashing). However, once levels get beyond a certain
| size with enough movable blocks you end up exhausting memory
| pretty quickly.
|
| These types of games are still "AI Proof" so far in that LLMs are
| absolutely awful at solving these while humans are very good (so
| seems reasonable to consider for for ARC-AGI benchmarks).
| Whenever a new reasoning model gets released I typically try it
| on some basic Pathology levels (like 'One at a Time'
| https://pathology.thinky.gg/level/ybbun/one-at-a-time) and they
| fail miserably.
|
| Simple level code for the above level (1 is a wall, 2 is a
| movable block, 4 is starting block, 3 is the exit):
|
| 000
|
| 020
|
| 023
|
| 041
|
| Similar to OP, I've found Claude couldn't manage rule dynamics,
| blocked paths, or game objectives well and spits out random
| results.
| kinduff wrote:
| In Factorio's paper [1] page 3, the agent receives a semantic
| representation with coordinates. Have you tried this data
| format?
|
| [1]: https://arxiv.org/pdf/2503.09617
| eru wrote:
| NP hard isn't much of a problem, because the levels are fairly
| small, and instances are not chosen to be worst case hard but
| to be entertaining for humans to solve.
|
| SMT/SAT solvers or integer linear programming can get you
| pretty far. Many classic puzzle games like Minesweeper are NP
| hard, and you can solve any instance that a human would be able
| to solve in their lifetime fairly quickly on a computer.
| ekianjo wrote:
| this is definitely a case for fine tuning a LLM on this game's
| data. There is currently no LLM out there that is able to play
| very well many games of different kinds.
| captn3m0 wrote:
| I once made a "RC plays Baba Is You" that controlled the game
| over a single shared browser that was streaming video and
| controls back to the game. Was quite fun!
|
| But I am fairly sure all of Baba Is You solutions are present in
| the training data for modern LLMs so it won't make for a good
| eval.
| chmod775 wrote:
| > But I am fairly sure all of Baba Is You solutions are present
| in the training data for modern LLMs so it won't make for a
| good eval.
|
| Claude 4 _cannot_ solve any Baba Is You level (except level 0
| that is solved by 8 right inputs), so for now it 's at least a
| nice low bar to shoot for...
| RainyDayTmrw wrote:
| This is interesting. If you approach this game as individual
| moves, the search tree is really deep. However, most levels can
| be expressed as a few intermediate goals.
|
| In some ways, this reminds me of the history of AI Go (board
| game). But the resolution there was MCTS, which wasn't at all
| what we wanted (insofar as MCTS is not generalizable to most
| things).
| rtpg wrote:
| > However, most levels can be expressed as a few intermediate
| goals
|
| I think generally the whole thing with puzzle games is that you
| have to determine the "right" intermediate goals. In fact, the
| naive intermediate goals are often entirely wrong!
|
| A canonical sokoban-like inversion might be where you have to
| push two blocks into goal areas. You might think "ok, push one
| block into its goal area and then push another into it."
|
| But many of these games will have mechanisms meaning you would
| first want to push one block into its goal, then undo that for
| some reason (it might activate some extra functionality) push
| the other block, and then finally go back and do the thing.
|
| There's always weird tricks that mean that you're going to walk
| backwards before walking forwards. I don't think it's
| impossible for these things to stumble into it, though. Just
| might spin a lot of cycles to get there (humans do too I guess)
| matsemann wrote:
| Yeah, often working backwards and forwards at the same time
| is how to solve some advanced puzzle games. Then you keep it
| from exploding in options. When thinking backwards from the
| goal, you figure out constraints or "invariants" the forward
| path must uphold, thus can discard lots of dead ends earlier
| in your forward path.
|
| To me, those discoveries are the fun part of most puzzle
| games. When you unlock the "trick" for each level and the
| dopamine flies, heh.
| TeMPOraL wrote:
| I usually get a good mileage out of jumping straight in the
| middle :). Like, "hmm let's look at this block; oh cool,
| there's enough space around it that I could push it away
| from goal, for whatever reason". Turns out, if it's
| possible there usually is a good reason. So whenever I get
| stuck, I skim every object in the puzzle and consider in
| isolation, what can I do with it, and this usually gives me
| anchor points to drive my forward or backward thinking
| through.
| kadoban wrote:
| > But the resolution there was MCTS
|
| MCTS wasn't _really_ the solution to go. MCTS-based AIs existed
| for years and they weren't _that_ good. They weren't superhuman
| for sure, and the moves/games they played were kind of boring.
|
| The key to doing go well was doing something that vaguely looks
| like MCTS but the real guts are a network that can answer:
| "who's winning?" and "what are good moves to try here?" and
| using that to guide search. Additionally essential was
| realizing that computation (run search for a while) with a bad
| model could be effectively+efficiently used to generate better
| training data to train a better model.
| eru wrote:
| > Additionally essential was realizing that computation (run
| search for a while) with a bad model could be
| effectively+efficiently used to generate better training data
| to train a better model.
|
| That has been known since at least the 1990s with TD-Gammon
| beating the world champions in Backgammon. See eg
| http://incompleteideas.net/book/ebook/node108.html or
| https://en.wikipedia.org/wiki/TD-Gammon
|
| In a sense, classic chess engines do that, too: alpha-beta-
| search uses a very weak model (eg just checking for
| checkmate, otherwise counting material, or what have you) and
| search to generate a much stronger player. You can use that
| to generate data for training a better model.
| kadoban wrote:
| > That has been known since at least the 1990s with TD-
| Gammon beating the world champions in Backgammon.
|
| Yeah, I didn't mean to imply that reinforcement learning
| (or applying it in this way) is novel. It was just
| important to work out how to apply that to go specifically.
|
| > In a sense, classic chess engines do that, too: alpha-
| beta-search uses a very weak model (eg just checking for
| checkmate, otherwise counting material, or what have you)
| and search to generate a much stronger player. You can use
| that to generate data for training a better model.
|
| I would say that classic chess AIs specifically don't do
| the important part. They aren't able to use a worst model
| to, with computation, train a better model. They can
| generate training data, but then they have no way to
| incorporate it back into the AI.
| pclmulqdq wrote:
| I have noticed a trend of the word "Desiderata" appearing in a
| lot more writing. Is this an LLM word or is it just in fashion?
| Most people would use the words "Deisres" or "Goals," so I assume
| this might be the new "delve."
| Tomte wrote:
| It's academic jargon. Desiderata are often at the end of a
| paper, in the section ,,someone should investigate X, but I'm
| moving on to the next funded project".
| ginko wrote:
| So ,,Future Work"?
| dgfl wrote:
| Literally it means "things that we wish for", from the
| latin verb "desiderare" (to wish).
| fi-le wrote:
| At least in this instance, it came from my fleshy human brain.
| Although I perhaps used it to come off as smarter than I really
| am - just like an LLM might.
| wohoef wrote:
| In my experience LLMs have a hard time working with text grids
| like this. It seems to find columns harder to "detect" then rows.
| Probably because it's input shows it as a giant row if that makes
| sense.
|
| It has the same problem with playing chess. But I'm not sure if
| there is a datatype it could work with for this kinda game.
| Currently it seems more like LLMs can't really work on spacial
| problems. But this should actually be something that can be fixed
| (pretty sure I saw an article about it on HN recently)
| froobius wrote:
| Transformers can easily be trained / designed to handle grids,
| it's just that off the shelf standard LLMs haven't been
| particularly, (although they would have seen some)
| nine_k wrote:
| Are there some well-known examples of success in it?
| thethimble wrote:
| Vision transformers effectively encode a grid of pixel
| patches. It's ultimately a matter of ensuring the position
| encoding incorporates both X and Y and position.
|
| For LLMs we only have one axis of position and - more
| importantly - the vast majority of training data only is
| oriented in this way.
| stavros wrote:
| If this were a limitation in the architecture, they wouldn't be
| able to work with images, no?
| hnlmorg wrote:
| LLMs don't work with images.
| stavros wrote:
| They do, though.
| hnlmorg wrote:
| Do they? I thought it was completely different models
| that did image generation.
|
| LLMs might be used to translate requests into keywords,
| but I didn't think LLMs themselves did any of the image
| generation.
|
| Am I wrong here?
| stavros wrote:
| Yes, that's why ChatGPT can look at an image and change
| the style, or edit things in the image. The image itself
| is converted to tokens and passed to the LLM.
| hnlmorg wrote:
| LLMs can be used as an agent to do all sorts of clever
| things, but it doesn't mean the LLM is actually handling
| the original data format.
|
| I've created MCP servers that can scrape websites but
| that doesn't mean the LLM _itself_ can make HTTP calls.
|
| The reason I make this distinction is because someone
| claimed that LLMs can read images. But they don't. They
| act as an agent for another model that reads images and
| creates metadata from it. LLMs then turn that meta data
| into natural language.
|
| The LLM itself doesn't see any pixels. It sees textual
| information that another model has provided.
|
| Edit: reading more about this online, it seems LLMs can
| work with pixel level data. I had no idea that was
| possible.
|
| My apologies.
| stavros wrote:
| No problem. Again, if it happened the way you described
| (which it did, until GPT-4o recently), the LLM wouldn't
| have been able to edit images. You can't get a textual
| description of an image and reconstruct it perfectly just
| from that, with one part edited.
| fi-le wrote:
| Good point. The architectural solution that would come to mind
| is 2D text embeddings, i.e. we add 2 sines and cosines to each
| token embedding instead of 1. Apparently people have done it
| before: https://arxiv.org/abs/2409.19700v2
| ninjha wrote:
| I think I remember one of the original ViT papers saying
| something about 2D embeddings on image patches not actually
| increasing performance on image recognition or segmentation,
| so it's kind of interesting that it helps with text!
|
| E: I found the paper: https://arxiv.org/pdf/2010.11929
|
| > We use standard learnable 1D position embeddings, since we
| have not observed significant performance gains from using
| more advanced 2D-aware position embeddings (Appendix D.4).
|
| Although it looks like that was just ImageNet so maybe this
| isn't that surprising.
| yorwba wrote:
| They seem to have used a fixed input resolution for each
| model, so the learnable 1D position embeddings are
| equivalent to learnable 2D position embeddings where every
| grid position gets its own embedding. It's when different
| images may have a different number of tokens per row that
| the correspondence between 1D index and 2D position gets
| broken and a 2D-aware position embedding can be expected to
| produce different results.
| tibastral2 wrote:
| It reminds me of
| https://en.m.wikipedia.org/wiki/The_Ricks_Must_Be_Crazy. Hope we
| are not ourselves in some sort of simulation ;)
| ThouTo2C wrote:
| There are numerous guides for all levels of Baba Is You
| available. I think it's likely that any modern LLM has them as
| part of its training dataset. That severely degrades this as a
| test for complex solution capabilities.
|
| Still, its interesting to see the challenges with dynamic rules
| (like "Key is Stop") that change where are you able to move etc.
| klohto wrote:
| Read the article first maybe
| ethan_smith wrote:
| The dynamic rule changes are precisely what make this a
| valuable benchmark despite available guides. Each rule
| modification creates a novel state-space that requires
| reasoning about the consequences of those changes, not just
| memorizing solution paths.
| niemandhier wrote:
| I think it's a great idea for a benchmark.
|
| One key difference to ARC in its current iteration is that there
| is a defined and learnable game physics.
|
| Arc requires generalization based on few examples for problems
| that are not well defined per se.
|
| Hence ARC currently requires the models that work on it to
| possess biases that are comparable to the ones that humans
| possess.
| andy99 wrote:
| I suspect real AGI evals aren't going to be "IQ test"-like which
| is how I'd categorize these benchmarks.
|
| LLMs will probably continue to scale on such benchmarks, as they
| have been, without needing real ingenuity or intelligence.
|
| Obviously I don't know the answer but I think it's the same root
| problem as why neural networks will never lead to intelligence.
| We're building and testing idiot savants.
| popcar2 wrote:
| I would be way more interested in it playing niche community
| levels, because I suspect a huge reason it's able to solve these
| levels is because it was trained on a million Baba is You
| walkthroughs. Same with people using Pokemon as a way to test
| LLMs, it really just depends on how well it knows the game.
| fi-le wrote:
| Two corrections, as written in the post: At least Claude not
| able to solve the standard levels at all, and community levels
| are definitely in scope.
| WhitneyLand wrote:
| "Reasoning models like o3 might be better equipped to come up
| with a plan, so a natural step would be to try switching to
| those, away from Claude Desktop..."
|
| But...Claude Desktop does have a reasoning mode for both Sonnet
| and Opus.
| zahlman wrote:
| > This is why the video of Claude solving level 1 at the top was
| actually (dramatic musical cue) staged, and only possible via a
| move-for-move tutorial that Claude nicely rationalized post hoc.
|
| One of the things this arc of history has taught me is that post-
| hoc rationalization is depressingly easy. Especially if it
| doesn't have to make sense, but even passing basic logical checks
| isn't too difficult. Ripping the rationalization apart often
| requires identifying novel, non-obvious logical checks.
|
| I thought I had learned that time and time again from human
| politics, but AI somehow made it even clearer than I thought
| possible. Perhaps simply because of _knowing_ that a machine is
| doing it.
|
| Edit: after watching the video more carefully:
|
| > "This forms WALL IS WIN horizontally. But I need "FLAG IS WIN"
| instead. Let me check if walls now have the WIN property. If they
| do, I just need to touch a wall to win. Let me try moving to a
| wall:
|
| There's something extremely uncanny-valley about this. A human
| player absolutely would accidentally win like this, and have
| similar reasoning (not expressed so formally) about how the win
| was achieved _after the fact_. (Winning depends on the walls
| having WIN and _also not_ having STOP; many players get stuck on
| later levels, even after having supposedly learned the lesson of
| this one, by trying to make something WIN and walk onto it while
| it is still STOP.)
|
| But the WIN block was not originally in line with the WALL IS
| text, so a human player would never accidentally form the rule,
| but would only do it with the expectation of being able to win
| that way. Especially since there was already an obvious, clear
| path to FLAG -- a level like this has no Sokoban puzzle element
| to it; it's purely about learning that the walls only block the
| player because they are STOP.
|
| Nor would (from my experience watching streamers at least) a
| human spontaneously notice that the rule "WALL IS WIN" had been
| formed and treat that as a cue to reconsider the entire strategy.
| The natural human response to unintentionally forming a useful
| rule is to keep pushing in the same direction.
|
| On the other hand, an actually _dedicated_ AI system (in the way
| that AlphaGo was dedicated to Go) could, I 'm sure, figure out a
| game like Baba Is You pretty easily. It would lack the human
| instinct to treat the walls as if they were implicitly always
| STOP; so it would never struggle with overriding it.
| deadbabe wrote:
| A simple feed-forward neural network with sufficient training
| can solve levels way better than Claude. Why is Claude being
| used at all.
| wredcoll wrote:
| The question isn't "can we write a computer program that can
| beat X game", it is "do things like claude represent a truly
| general purpose intelligence as demonstrated by its ability
| to both write a limerick and play baba is you"
___________________________________________________________________
(page generated 2025-07-05 23:01 UTC)