[HN Gopher] LLMs aren't world models
       ___________________________________________________________________
        
       LLMs aren't world models
        
       Author : ingve
       Score  : 185 points
       Date   : 2025-08-10 11:40 UTC (2 days ago)
        
 (HTM) web link (yosefk.com)
 (TXT) w3m dump (yosefk.com)
        
       | t0md4n wrote:
       | https://arxiv.org/abs/2501.17186
        
         | yosefk wrote:
         | This is interesting. The "professional level" rating of <1800
         | isn't, but still.
         | 
         | However:
         | 
         | "A significant Elo rating jump occurs when the model's Legal
         | Move accuracy reaches 99.8%. This increase is due to the
         | reduction in errors after the model learns to generate legal
         | moves, reinforcing that continuous error correction and
         | learning the correct moves significantly improve ELO"
         | 
         | You should be able to reach the move legality of around 100%
         | with few resources spent on it. Failing to do so means that it
         | has not learned a model of what chess is, at some basic level.
         | There is virtually no challenge in making legal moves.
        
           | lostmsu wrote:
           | > r4rk1 pp6 8 4p2Q 3n4 4N3 qP5P 2KRB3 w -- -- 3 27
           | 
           | Can you say 100% you can generate a good next move (example
           | from the paper) without using tools, and will never
           | accidentally make a mistake and give an illegal move?
        
           | rpdillon wrote:
           | > Failing to do so means that it has not learned a model of
           | what chess is, at some basic level.
           | 
           | I'm not sure about this. Among a standard amateur set of
           | chess players, how often when they lack any kind of guidance
           | from a computer do they attempt to make a move that is
           | illegal? I played chess for years throughout elementary,
           | middle and high school, and I would easily say that even
           | after hundreds of hours of playing, I might make two mistakes
           | out of a thousand moves where the move was actually illegal,
           | often because I had missed that moving that piece would
           | continue to leave me in check due to a discovered check that
           | I had missed.
           | 
           | It's hard to conclude from that experience that players that
           | are amateurs lack even a basic model of chess.
        
       | libraryofbabel wrote:
       | This essay could probably benefit from some engagement with the
       | literature on "interpretability" in LLMs, including the empirical
       | results about how knowledge (like addition) is represented inside
       | the neural network. To be blunt, I'm not sure being smart and
       | reasoning from first principles after asking the LLM a lot of
       | questions and cherry picking what it gets wrong gets to any novel
       | insights at this point. And it already feels a little out date,
       | with LLMs getting gold on the mathematical Olympiad they clearly
       | have a pretty good world model of mathematics. I don't think
       | cherry-picking a failure to prove 2 + 2 = 4 in the particular
       | specific way the writer wanted to see disproves that at all.
       | 
       | LLMs have imperfect world models, sure. (So do humans.) That's
       | because they are trained to be generalists and because their
       | internal representations of things are _massively_ compressed
       | single they don't have enough weights to encode everything. I
       | don't think this means there are some natural limits to what they
       | can do.
        
         | armchairhacker wrote:
         | Any suggestions from this literature?
        
           | libraryofbabel wrote:
           | The papers from Anthropic on interpretability are pretty
           | good. They look at how certain concepts are encoded within
           | the LLM.
        
         | yosefk wrote:
         | Your being blunt is actually very kind, if you're describing
         | what I'm doing as "being smart and reasoning from first
         | principles"; and I agree that I am not saying something very
         | novel, at most it's slightly contrarian given the current
         | sentiment.
         | 
         | My goal is not to cherry-pick failures for its own sake as much
         | as to try to explain why I get pretty bad output from LLMs much
         | of the time, which I do. They are also very useful to me at
         | times.
         | 
         | Let's see how my predictions hold up; I have made enough to
         | look very wrong if they don't.
         | 
         | Regarding "failure disproving success": it can't, but it can
         | disprove a theory of how this success is achieved. And, I have
         | much better examples than the 2+2=4, which I am citing as
         | something that sorta works these says
        
           | libraryofbabel wrote:
           | I mean yeah, it's a good essay in that it made me think and
           | try to articulate the gaps, and I'm always looking to read
           | things that push back on AI hype. I usually just skip over
           | the hype blogging.
           | 
           | I think my biggest complaint is that the essay points out
           | flaws in LLM's world models (totally valid, they do
           | confidently get things wrong and hallucinate in ways that are
           | different, and often more frustrating, from how humans get
           | things wrong) but then it jumps to claiming that there is
           | some fundamental limitation about LLMs that prevents them
           | from forming workable world models. In particular, it strays
           | a bit towards the "they're just stochastic parrots" critique,
           | e.g. "that just shows the LLM knows to put the words
           | explaining it after the words asking the question." That just
           | doesn't seem to hold up in the face of e.g. LLMs getting gold
           | on the Mathematical Olympiad, which features novel questions.
           | If that isn't a world model of mathematics - being able to
           | apply learned techniques to challenging new questions - then
           | I don't know what is.
           | 
           | A lot of that success is from reinforcement learning
           | techniques where the LLM is made to solve tons of math
           | problems _after_ the pre-training "read everything" step,
           | which then gives it a chance to update its weights. LLMs
           | aren't just trained from reading a lot of text anymore. It's
           | very similar to how the alpha zero chess engine was trained,
           | in fact.
           | 
           | I do think there's a lot that the essay gets right. If I was
           | to recast it, I'd put it something like this:
           | 
           | * LLMs have imperfect models of the world which is
           | conditioned by how they're trained on next token prediction.
           | 
           | * We've shown we can drastically improve those world models
           | for particular tasks by reinforcement learning. you kind of
           | allude to this already by talking about how they've been
           | "flogged" to be good at math.
           | 
           | * I would claim that there's no particular reason these RL
           | techniques aren't extensible in principle to beat all sorts
           | of benchmarks that might look unrealistic now. (Two years ago
           | it would have been an extreme optimist position to say an LLM
           | could get gold on the mathematical Olympiad, and most LLM
           | skeptics would probably have said it could never happen.)
           | 
           | * Of course it's very expensive, so most world models LLMs
           | have won't get the RL treatment and so will be full of gaps,
           | especially for things that aren't amenable to RL. It's good
           | to beware of this.
           | 
           | I think the biggest limitation LLMs actually have, the one
           | that is the biggest barrier to AGI, is that they can't learn
           | on the job, during inference. This means that with a novel
           | codebase they are never able to build a good model of it,
           | because they can never update their weights. (If an LLM was
           | given tons of RL training on that codebase, it _could_ build
           | a better world model, but that's expensive and very
           | challenging to set up.) This problem is hinted at in your
           | essay, but the lack of on-the-job learning isn't centered.
           | But it's the real elephant in the room with LLMs and the one
           | the boosters don't really have an answer to.
           | 
           | Anyway thanks for writing this and responding!
        
             | yosefk wrote:
             | I'm not saying that LLMs can't learn about the world - I
             | even mention how they obviously do it, even at the learned
             | embeddings level. I'm saying that they're not compelled by
             | their training objective to learn about the world and in
             | many cases they clearly don't, and I don't see how to
             | characterize the opposite cases in a more useful way than
             | "happy accidents."
             | 
             | I don't really know how they are made "good at math," and
             | I'm not that good at math myself. With code I have a better
             | gut feeling of the limitations. I do think that you could
             | throw them off terribly with unusual math quastions to show
             | that what they learned isn't math, but I'm not the guy to
             | do it; my examples are about chess and programming where I
             | am more qualified to do it. (You could say that my question
             | about the associativity of blending and how caching works
             | sort of shows that it can't use the concept of
             | associativity in novel situations; not sure if this can be
             | called an illustration of its weakness at math)
        
               | calf wrote:
               | But this is parallel to saying LLMs are not "compelled"
               | by the training algorithms to learn symbolic logic.
               | 
               | Which says to me there are two camps on this and the
               | verdict is still out on this and all related questions.
        
           | WillPostForFood wrote:
           | Your LLM output seems abnormally bad, like you are using old
           | models, bad models, or intentionally poor prompting. I just
           | copied and pasted your Krita example into ChatGPT, and
           | reasonable answer, nothing like what you paraphrased in your
           | post.
           | 
           | https://imgur.com/a/O9CjiJY
        
             | marcellus23 wrote:
             | I think it's hard to take any LLM criticism seriously if
             | they don't even specify which model they used. Saying "an
             | LLM model" is totally useless for deriving any kind of
             | conclusion.
        
               | p1esk wrote:
               | Yes, I'd be curious about his experience with GPT-5
               | Thinking model. So far I haven't seen any blunders from
               | it.
        
             | typpilol wrote:
             | This seems like a common theme with these types of articles
        
         | AyyEye wrote:
         | With LLMs being unable to count how many Bs are in blueberry,
         | they clearly don't have any world model whatsoever. That
         | addition (something which only takes a few gates in digital
         | logic) happens to be overfit into a few nodes on multi-billion
         | node networks is hardly a surprise to anyone except the most
         | religious of AI believers.
        
           | yosefk wrote:
           | Actually I forgive them those issues that stem from
           | tokenization. I used to make fun at them for listing datum as
           | a noun whose plural form ends with an i, but once I learned
           | about how tokenization works, I no longer do it - it feels
           | like mocking a person's intelligence because of a speech
           | impediment or something... I am very kind to these things, I
           | think
        
             | astrange wrote:
             | Tokenization makes things harder, but it doesn't make them
             | impossible. Just takes a bit more memorization.
             | 
             | Other writing systems come with "tokenization" built in
             | making it still a live issue. Think of answering:
             | 
             | 1. How many n's are in Ri Ben ?
             | 
             | 2. How many n's are in Ri Ben ?
             | 
             | (Answers are 2 and 1.)
        
           | andyjohnson0 wrote:
           | > With LLMs being unable to count how many Bs are in
           | blueberry, they clearly don't have any world model
           | whatsoever.
           | 
           | Is this a real defect, or some historical thing?
           | 
           | I just asked GPT-5:                   How many "B"s in
           | "blueberry"?
           | 
           | and it replied:                   There are 2 -- the letter b
           | appears twice in "blueberry".
           | 
           | I also asked it how many Rs in Carrot, and how many Ps in
           | Pineapple, amd it answered both questions correctly too.
        
             | libraryofbabel wrote:
             | It's a historical thing that people still falsely claim is
             | true, bizarrely without trying it on the latest models. As
             | you found, leading LLMs don't have a problem with it
             | anymore.
        
               | pydry wrote:
               | Depends how you define historical. If by historical you
               | mean more than two days ago then, yeah, it's ancient
               | history.
        
             | ThrowawayR2 wrote:
             | It was discussed and reproduced on GPT-5 on HN couple of
             | days ago: https://news.ycombinator.com/item?id=44832908
             | 
             | Sibling poster is probably mistakenly thinking of the
             | strawberry issue from 2024 on older LLM models.
        
             | bgwalter wrote:
             | It is not historical:
             | 
             | https://kieranhealy.org/blog/archives/2025/08/07/blueberry-
             | h...
             | 
             | Perhaps they have a hot fix that special cases HN
             | complaints?
        
               | AyyEye wrote:
               | They clearly RLHF out the embarrassing cases and make
               | cheating on benchmarks into a sport.
        
             | nosioptar wrote:
             | Shouldn't the correct answer be that there is not a "B" in
             | "blueberry"?
        
           | BobbyJo wrote:
           | The core issue there isn't that the LLM isn't building
           | internal models to represent its world, it's that its world
           | is limited to tokens. Anything not represented in tokens, or
           | token relationships, can't be modeled by the LLM, by
           | definition.
           | 
           | It's like asking a blind person to count the number of colors
           | on a car. They can give it a go and assume glass, tires, and
           | metal are different colors as there is likely a correlation
           | they can draw from feeling them or discussing them. That's
           | the best they can do though as they can't actually perceive
           | color.
           | 
           | In this case, the LLM can't see letters, so asking it to
           | count them causes it to try and draw from some proxy of that
           | information. If it doesn't have an accurate one, then bam,
           | strawberry has two r's.
           | 
           | I think a good example of LLMs building models internally is
           | this: https://rohinmanvi.github.io/GeoLLM/
           | 
           | LLMs are able to encode geospatial relationships because they
           | can be represented by token relationships well. Teo countries
           | that are close together will be talked about together much
           | more often than two countries far from each other.
        
             | vrighter wrote:
             | That is just not a solid argument. There are countless
             | examples of LLMs splitting "blueberry" into "b l u e b e r
             | r y", which would contain one token per letter. And then
             | they still manage to get it wrong.
             | 
             | Your argument is based on a flawed assumption, that they
             | can't see letters. If they didn't they wouldn't be able to
             | spell the word out. But they do. And when they do get one
             | token per letter, they still miscount.
        
             | xigoi wrote:
             | > It's like asking a blind person to count the number of
             | colors on a car.
             | 
             | I presume if I asked a blind person to count the colors on
             | a car, they would reply "sorry, I am blind, so I can't
             | answer this question".
        
           | libraryofbabel wrote:
           | > they clearly don't have any world model whatsoever
           | 
           | Then how did an LLM get gold on the mathematical Olympiad,
           | where it certainly hadn't seen the questions before? How _on
           | earth_ is that possible without a decent working model of
           | mathematics? Sure, LLMs might make weird errors sometimes
           | (nobody is denying that), but clearly the story is rather
           | more complicated than you suggest.
        
             | simiones wrote:
             | > where it certainly hadn't seen the questions before?
             | 
             | What are you basing this certainty on?
             | 
             | And even if you're right that the specific questions had
             | not come up, it may still be that the questions from the
             | math olympiad were rehashes of similar questions in other
             | texts, or happened to correspond well to a composition of
             | some other problems that were part of the training set,
             | such that the LLM could 'pick up' on the similarity.
             | 
             | It's also possible that the LLM was specifically trained on
             | similar problems, or may even have a dedicated sub-net or
             | tool for it. Still impressive, but possibly not in a way
             | that generalizes even to math like one might think based on
             | the press releases.
        
           | williamcotton wrote:
           | I don't solve math problems with my poetry writing skills:
           | 
           | https://chatgpt.com/share/689ba837-8ae0-8013-96d2-7484088f27.
           | ..
        
         | lossolo wrote:
         | https://arxiv.org/abs/2508.01191
        
       | rishi_devan wrote:
       | Haha. I enjoyed that Soviet-era joke at the end.
        
         | svantana wrote:
         | Yes, I hadn't heard that before. It's similar in spirit to this
         | norwegian folk tale about a deaf man guessing what someone is
         | saying to him:
         | 
         | https://en.wikipedia.org/wiki/%22Good_day,_fellow!%22_%22Axe...
        
           | kgwgk wrote:
           | Another similar story:
           | 
           | King Frederick, the great of Prussia had a very fine army,
           | and none of the soldiers in it were finer than Giant Guards,
           | who were all extremely tall men. It was difficult to find
           | enough soldiers for these Guards, as there were not many men
           | who were tall enough.
           | 
           | Frederick had made it a rule that no soldiers who did not
           | speak German could be admitted to the Giant Guards, and this
           | made the work of the officers who had to find men for them
           | even more difficult. When they had to choose between
           | accepting or refusing a really tall man who knew no German,
           | the officers used to accept him, and then teach him enough.
           | German to be able to answer if the King questioned him.
           | 
           | Frederick, sometimes, used to visit the men who were on guard
           | around his castle at night to see that they were doing their
           | job properly, and it was his habit to ask each new one that
           | he saw three questions: "How old are you?" "How long have you
           | been in my army?" and "Are you satisfied with your food and
           | your conditions?"
           | 
           | The offices of the Giant Guards therefore used to teach new
           | soldiers who did not know German the answers to these three
           | questions.
           | 
           | One day, however, the King asked a new soldier the questions
           | in a different order, he began with, "How long have you been
           | in my army?" The young soldier immediately answered, "Twenty
           | - two years, Your Majesty". Frederick was very surprised.
           | "How old are you then?", he asked the soldier. "Six months,
           | Your Majesty", came the answer. At this Frederick became
           | angry, "Am I a fool, or are you one?" he asked. "Both, Your
           | Majesty", the soldier answered politely.
           | 
           | https://archive.org/details/advancedstoriesf0000hill
        
       | deadbabe wrote:
       | Don't: use LLMs to play chess against you
       | 
       | Do: use LLMs to talk shit to you while a _real_ chess AI plays
       | chess against you.
       | 
       | The above applies to a lot of things besides chess, and
       | illustrates a proper application of LLMs.
        
       | imenani wrote:
       | As far as I can tell they don't say which LLM they used which is
       | kind of a shame as there is a huge range of capabilities even in
       | newly released LLMs (e.g. reasoning vs not).
        
         | yosefk wrote:
         | ChatGPT, Claude, Grok and Google AI Overviews, whatever powers
         | the latter, were all used in one or more of these examples, in
         | various configurations. I think they can perform differently,
         | and I often try more than one when the 1st try doesn't work
         | great. I don't think there's any fundamental difference in the
         | principle of their operation, and I think there never will be -
         | there will be another major breakthrough
        
           | red75prime wrote:
           | My hypothesis is that a model fails to switch into a deep
           | thinking mode (if it has it) and blurts whatever it got from
           | all the internet data during autoregressive training. I
           | tested it with alpha-blending example. Gemini 2.5 flash -
           | fails, Gemini 2.5 pro - succeeds.
           | 
           | How presence/absence of a world model, er, blends into all
           | this? I guess "having a consistent world model at all times"
           | is an incorrect description of humans, too. We seem to have
           | it because we have mechanisms to notice errors, correct
           | errors, remember the results, and use the results when
           | similar situations arise, while slowly updating intuitions
           | about the world to incorporate changes.
           | 
           | The current models lack "remember/use/update" parts.
        
           | imenani wrote:
           | Each of these models has a thinking/reasoning variant and a
           | default non-thinking variant. I would expect the reasoning
           | variants (o3 or "GPT5 Thinking", Gemini DeepThink, Claude
           | with Extended Thinking, etc) to do better at this. I think
           | there is also some chance that in their reasoning traces they
           | may display something you might see as closer to world
           | modelling. In particular, you might find them explicitly
           | tracking positions of pieces and checking validity.
        
           | red75prime wrote:
           | > I don't think there's any fundamental difference in the
           | principle of their operation
           | 
           | Yeah, they seem to be a subject to the universal
           | approximation theorem (it needs to be checked more
           | thoroughly, but I think we can build a transformer that is
           | equivalent to any given fully-connected multilayered
           | network).
           | 
           | That is at a certain size they can do anything a human can do
           | at a certain point in their life (that is with no additional
           | training) regardless of whether humans have world models and
           | what those model are on the neuronal level.
           | 
           | But there are additional nuances that are related to their
           | architectures and training regimes. And practical questions
           | of the required size.
        
         | lowsong wrote:
         | It doesn't matter. These limitations are fundamental to LLMs,
         | so all of them that will ever be made suffer from these
         | problems.
        
       | og_kalu wrote:
       | Yes LLMs can play chess and yes they can model it fine
       | 
       | https://arxiv.org/pdf/2403.15498v2
        
       | GaggiX wrote:
       | https://www.youtube.com/watch?v=LtG0ACIbmHw
       | 
       | Sota LLMs do play legal moves in chess, I don't why the article
       | seem to say otherwise.
        
         | tickettotranai wrote:
         | Technically yes, but... it's moderately tricky to get an LLM to
         | play good chess even though it can.
         | 
         | https://dynomight.net/more-chess/
         | 
         | This is significant in general because I personally would love
         | to get these things to code-switch into "hackernews poster" or
         | "writer for the Economist" or "academic philosopher", but I
         | think the "chat" format makes it impossible. The
         | inaccessibility of this makes me want to host my own LLM...
        
       | lordnacho wrote:
       | Here's what LLMs remind me of.
       | 
       | When I went to uni, we had tutorials several times a week. Two
       | students, one professor, going over whatever was being studied
       | that week. The professor would ask insightful questions, and the
       | students would try to answer.
       | 
       | Sometimes, I would answer a question correctly without actually
       | understanding what I was saying. I would be spewing out something
       | that I had read somewhere in the huge pile of books, and it would
       | be a sentence, with certain special words in it, that the
       | professor would accept as an answer.
       | 
       | But I would sometimes have this weird feeling of "hmm I actually
       | don't get it" regardless. This is kinda what the tutorial is for,
       | though. With a bit more prodding, the prof will ask something
       | that you genuinely cannot produce a suitable word salad for, and
       | you would be found out.
       | 
       | In math-type tutorials it would be things like realizing some
       | equation was useful for finding an answer without having a clue
       | about what the equation actually represented.
       | 
       | In economics tutorials it would be spewing out words about
       | inflation or growth or some particular author but then having
       | nothing to back up the intuition.
       | 
       | This is what I suspect LLMs do. They can often be very useful to
       | someone who actually has the models in their minds, but not the
       | data to hand. You may have forgotten the supporting evidence for
       | some position, or you might have missed some piece of the
       | argument due to imperfect memory. In these cases, LLM is
       | fantastic as it just glues together plausible related words for
       | you to examine.
       | 
       | The wheels come off when you're not an expert. Everything it says
       | will sound plausible. When you challenge it, it just apologizes
       | and pretends to correct itself.
        
       | ej88 wrote:
       | This article is interesting but pretty shallow.
       | 
       | 0(?): there's no provided definition of what a 'world model' is.
       | Is it playing chess? Is it remembering facts like how computers
       | use math to blend Colors? If so, then ChatGPT:
       | https://chatgpt.com/s/t_6898fe6178b88191a138fba8824c1a2c has a
       | world model right?
       | 
       | 1. The author seems to conflate context windows with failing to
       | model the world in the chess example. I challenge them to ask a
       | SOTA model with an image of a chess board or notation and ask it
       | about the position. It might not give you GM level analysis but
       | it definitely has a model of what's going on.
       | 
       | 2. Without explaining which LLM they used or sharing the chats
       | these examples are just not valuable. The larger and better the
       | model, the better its internal representation of the world.
       | 
       | You can try it yourself. Come up with some question involving
       | interacting with the world and / or physics and ask GPT-5
       | Thinking. It's got a pretty good understanding of how things
       | work!
       | 
       | https://chatgpt.com/s/t_689903b03e6c8191b7ce1b85b1698358
        
         | yosefk wrote:
         | A "world model" depends on the context which defines which
         | world the problem is in. For chess, which moves are legal and
         | needing to know where the pieces are to make legal moves are
         | parts of the world model. For alpha blending, it being a
         | mathematical operation and the visibility of a background given
         | the transparency of the foreground are parts of the world
         | model.
         | 
         | The examples are from all the major commercial American LLMs as
         | listed in a sister comment.
         | 
         | You seem to conflate context windows with tracking chess
         | pieces. The context windows are more than large enough to
         | remember 10 moves. The model should either track the pieces, or
         | mention that it would be playing blindfold chess absent a board
         | to look at and it isn't good at this, so could you please list
         | the position after every move to make it fair, or it doesn't
         | know what it's doing; it's demonstrably the latter.
        
       | jonplackett wrote:
       | I just tried a few things that are simple and a world model would
       | probably get right. Eg
       | 
       | Question to GPT5: I am looking straight on to some objects.
       | Looking parallel to the ground.
       | 
       | In front of me I have a milk bottle, to the right of that is a
       | Coca-Cola bottle. To the right of that is a glass of water. And
       | to the right of that there's a cherry. Behind the cherry there's
       | a cactus and to the left of that there's a peanut. Everything is
       | spaced evenly. Can I see the peanut?
       | 
       | Answer (after choosing thinking mode)
       | 
       | No. The cactus is directly behind the cherry (front row order:
       | milk, Coke, water, cherry). "To the left of that" puts the peanut
       | behind the glass of water. Since you're looking straight on, the
       | glass sits in front and occludes the peanut.
       | 
       | It doesn't consider transparency until you mention it, then
       | apologises and says it didn't think of transparency
        
         | RugnirViking wrote:
         | this seems like a strange riddle. In my mind I was thinking
         | that regardless of the glass, all of the objects can be seen
         | (due to perspective, and also the fact you mentioned the
         | locations, meaning you're aware of them).
         | 
         | It seems to me it would only actually work in an orthographic
         | perspective, which is not how our reality works
        
           | jonplackett wrote:
           | You can tell from the response it does understand the riddle
           | just fine, it just gets it wrong.
        
             | rpdillon wrote:
             | Have you asked five adults this riddle? I suspect at least
             | two of them would get it wrong or have some uncertainty
             | about whether or not the peanut was visible.
        
               | xg15 wrote:
               | This. Was also thinking "yes" first because of the glass
               | of water, transparency, etc, but then got unsure: The
               | objects might be spaced so widely that the milk or coke
               | bottle would obscure the view due to perspective - or the
               | peanut would simply end up outside the viewer's field of
               | vision.
               | 
               | Shows that even _if_ you have a world model, it might not
               | be the right one.
        
         | optimalsolver wrote:
         | Gemini 2.5 Pro gets this correct on the first attempt, and
         | specifically points out the transparency of the glass of water.
         | 
         | https://g.co/gemini/share/362506056ddb
         | 
         | Time to get the ol' goalpost-moving gloves out.
        
         | wilg wrote:
         | Worked for me: https://chatgpt.com/share/689bc3ef-
         | fa1c-800f-9275-93c2dbc11b...
        
       | Razengan wrote:
       | A slight tangent: I think/wonder if the one place where AIs could
       | be really useful, might be in translating alien languages :)
       | 
       | As in, an alien could teach one of our AIs their language faster
       | than an alien could teach an human, and vice versa..
       | 
       | ..though the potential for catastrophic disasters is also great
       | there lol
        
       | keeda wrote:
       | That whole bit about color blending and transparency and LLMs
       | "not knowing colors" is hard to believe. I am literally using
       | LLMs every day to write image-processing and computer vision code
       | using OpenCV. It seamlessly reasons across a range of concepts
       | like color spaces, resolution, compression artifacts, filtering,
       | segmentation and human perception. I mean, removing the alpha
       | from a PNG image was a preprocessing step it wrote by itself as
       | part of a larger task I had given it, so it certainly understands
       | transparency.
       | 
       | I even often describe the results e.g. "this fails when in X
       | manner when the image has grainy regions" and it figures out what
       | is going on, and adapts the code accordingly. (It works with
       | uploading actual images too, but those consume a lot of tokens!)
       | 
       | And all this in a rather niche domain that seems relatively less
       | explored. The images I'm working with are rather small and low-
       | resolution, which most literature does not seem to contemplate
       | much. It uses standard techniques well known in the art, but it
       | adapts and combines them well to suit my particular requirements.
       | So they seem to handle "novel" pretty well too.
       | 
       | If it can reason about images and vision and write working code
       | for niche problems I throw at it, whether it "knows" colors in
       | the human sense is a purely philosophical question.
        
         | geraneum wrote:
         | > it wrote by itself as part of a larger task I had given it,
         | so it certainly understands transparency
         | 
         | Or it's a common step or a known pattern or combination of
         | steps that is prevalent in its training data for certain input.
         | I'm guessing you don't know what's exactly in the training
         | sets. I don't know either. They don't tell ;)
         | 
         | > but it adapts and combines them well to suit my particular
         | requirements. So they seem to handle "novel" pretty well too.
         | 
         | We tend to overestimate the novelty of our own work and our
         | methods and at the same time, underestimate the vastness of the
         | data and information available online for machines to train on.
         | LLMs are very sophisticated pattern recognizers. It doesn't
         | mean what you are doing specifically is done in this exact way
         | before, rather the patterns adapted and the approach may not be
         | one of their kind.
         | 
         | > is a purely philosophical question
         | 
         | It is indeed. A question we need to ask ourselves.
        
       | skeledrew wrote:
       | Agree in general with most of the points, except
       | 
       | > but because I know you and I get by with less.
       | 
       | Actually we got far more data and training than any LLM. We've
       | been gathering and processing sensory data every second at least
       | since birth (more processing than gathering when asleep), and are
       | only really considered fully intelligent in our late teens to
       | mid-20s.
        
         | helloplanets wrote:
         | Don't forget the millions of years of pre-training! ;)
        
       | o_nate wrote:
       | What with this and your previous post about why sometimes
       | incompetent management leads to better outcomes, you are quickly
       | becoming one of my favorite tech bloggers. Perhaps I enjoyed the
       | piece so much because your conclusions basically track mine. (I'm
       | a software developer who has dabbled with LLMs, and has some
       | hand-wavey background on how they work, but otherwise can claim
       | no special knowledge.) Also your writing style really pops. No
       | one would accuse your post of having been generated by an LLM.
        
         | yosefk wrote:
         | thank you for your kind words!
        
       | neuroelectron wrote:
       | Not yet
        
       | ameliaquining wrote:
       | One thing I appreciated about this post, unlike a lot of AI-
       | skeptic posts, is that it actually makes a concrete falsifiable
       | prediction; specifically, "LLMs will never manage to deal with
       | large code bases 'autonomously'". So in the future we can look
       | back and see whether it was right.
       | 
       | For my part, I'd give 80% confidence that LLMs will be able to do
       | this within two years, without fundamental architectural changes.
        
         | moduspol wrote:
         | "Deal with" and "autonomously" are doing a lot of heavy lifting
         | there. Cursor already does a pretty good job indexing all the
         | files in a code base in a way that lets it ask questions and
         | get answers pretty quickly. It's just a matter of where you set
         | the goalposts.
        
           | ameliaquining wrote:
           | True, there'd be a need to operationalize these things a bit
           | more than is done in the post to have a good advance
           | prediction.
        
         | exe34 wrote:
         | How large? What does "deal" mean here? Autonomously - is that
         | on its own whim, or at the behest of a user?
        
         | shinycode wrote:
         | << autonomously >> what happens when subtle updates that are
         | not bugs but change the meaning of some features that might
         | break the workflow on some other external parts of a client's
         | system ? It happens all the time and, because it's really hard
         | to have the whole meaning and business rules written and
         | maintained up to date, an LLM might never be able to grasp some
         | meaning. Maybe if instead of developing code and
         | infrastructures, the whole industry shifts toward only writing
         | impossibly precise spec sheets that make meaning and intent
         | crystal clear then, maybe << autonomously >> might be possible
         | to pull off
        
           | wizzwizz4 wrote:
           | Those spec sheets exist: they're called software.
        
         | slt2021 wrote:
         | >LLMs will never manage to deal
         | 
         | time to prove hypothesis: infinity years
        
       | bithive123 wrote:
       | Language models aren't world models for the same reason languages
       | aren't world models.
       | 
       | Symbols, by definition, only represent a thing. They are not the
       | same as the thing. The map is not the territory, the description
       | is not the described, you can't get wet in the word "water".
       | 
       | They only have meaning to sentient beings, and that meaning is
       | heavily subjective and contextual.
       | 
       | But there appear to be some who think that we can grasp truth
       | through mechanical symbol manipulation. Perhaps we just need to
       | add a few million more symbols, they think.
       | 
       | If we accept the incompleteness theorem, then there are true
       | propositions that even a super-intelligent AGI would not be able
       | to express, because all it can do is output a series of
       | placeholders. Not to mention the obvious fallacy of knowing
       | super-intelligence when we see it. Can you write a test suite for
       | it?
        
         | habitue wrote:
         | > Symbols, by definition, only represent a thing.
         | 
         | This is missing the lesson of the Yoneda Lemma: symbols are
         | uniquely identified by their relationships with other symbols.
         | If those relationships are represented in text, then in
         | principle they can be inferred and navigated by an LLM.
         | 
         | Some relationships are not represented well in text: tacit
         | knowledge like how hard to twist a bottle cap to get it to come
         | off, etc. We aren't capturing those relationships between all
         | your individual muscles and your brain well in language, so an
         | LLM will miss them or have very approximate versions of them,
         | but... that's always been the problem with tacit knowledge:
         | it's the exact kind of knowledge that's hard to communicate!
        
         | exe34 wrote:
         | > Language models aren't world models for the same reason
         | languages aren't world models. > Symbols, by definition, only
         | represent a thing. They are not the same as the thing. The map
         | is not the territory, the description is not the described, you
         | can't get wet in the word "water".
         | 
         | There is a lot of negatives in there, but I feel like it boils
         | down to a model of a thing is not the thing. Well duh. It's a
         | model. A map is a model.
        
           | bithive123 wrote:
           | Right. It's a dead thing that has no independent meaning. It
           | doesn't even exist as a thing except conceputally. The
           | referent is not even another dead thing, but a reality that
           | appears nowhere in the map itself. It may have certain
           | limited usefulness in the practical realm, but expecting it
           | to lead to new insights ignores the fact that it's
           | fundamentally an abstraction of the real, not in relationship
           | to it.
        
         | auggierose wrote:
         | First: true propositions (that are not provable) can definitely
         | be expressed, if they couldn't, the incompleteness theorem
         | would not be true ;-)
         | 
         | It would be interesting to know what the percentage of people
         | is, who invoke the incompleteness theorem, and have no clue
         | what it actually says.
         | 
         | Most people don't even know what a proof is, so that cannot be
         | a hindrance on the path to AGI ...
         | 
         | Second: ANY world model that can be digitally represented would
         | be subject to the same argument (if stated correctly), not only
         | LLMs.
        
           | bithive123 wrote:
           | I knew someone would call me out on that. I used the wrong
           | word; what I meant was "expressed in a way that would
           | satisfy" which implies proof within the symbolic order being
           | used. I don't claim to be a mathematician or philosopher.
        
             | auggierose wrote:
             | Well, you don't get it. The LLM definitely can state
             | propositions "that satisfy", let's just call them true
             | propositions, and that this is not the same as having a
             | proof for it is what the incompleteness theorem says.
             | 
             | Why would you require an LLM to have proof for the things
             | it says? I mean, that would be nice, and I am actually
             | working on that, but it is not anything we would require of
             | humans and/or HN commenters, would we?
        
               | bithive123 wrote:
               | I clearly do not meet the requirements to use the
               | analogy.
               | 
               | I am hearing the term super intelligence a lot and it
               | seems to me the only form that would take is the machine
               | spitting out a bunch of symbols which either delight or
               | dismay the humans. Which implies they already know what
               | it looks like.
               | 
               | If this technology will advance science or even be useful
               | for everyday life, then surely the propositions it
               | generates will need to hold up to reality, either via
               | axiomatic rigor or empirically. I look forward to finding
               | out if that will happen.
               | 
               | But it's still just a movement from the known to the
               | known, a very limited affair no matter how many new
               | symbols you add in whatever permutation.
        
         | chamomeal wrote:
         | I'm not a math guy but the incompleteness theorem applies to
         | formal systems, right? I've never thought about LLMs as formal
         | systems, but I guess they are?
        
           | bithive123 wrote:
           | Nor am I. I'm not claiming an LLM is a formal system, but it
           | is mechanical and operates on symbols. It can't deal in
           | anything else. That should temper some of the enthusiasm
           | going around.
        
           | pron wrote:
           | Anything that runs on a computer is a formal system. "Formal"
           | (the manipulation of _forms_ ) is an old term for what, after
           | Turing, we call "mechanical".
        
         | scarmig wrote:
         | > If we accept the incompleteness theorem
         | 
         | And, by various universality theorems, a sufficiently large AGI
         | could approximate any sequence of human neuron firings to an
         | arbitrary precision. So if the incompleteness theorem means
         | that neural nets can never find truth, it also means that the
         | human brain can never find truth.
         | 
         | Human neuron firing patterns, after all, only represent a
         | thing; they are not the same as the thing. Your experience of
         | seeing something isn't recreating the physical universe in your
         | head.
        
           | bevr1337 wrote:
           | > And, by various universality theorems, a sufficiently large
           | AGI could approximate any sequence of human neuron firings to
           | an arbitrary precision.
           | 
           | Wouldn't it become harder to simulate a human brain the
           | larger a machine is? I don't know nothing, but I think that
           | peaky speed of light thing might pose a challenge.
        
             | drdeca wrote:
             | simulate [?] simulate-in-real-time
        
         | overgard wrote:
         | I don't think you can apply the incompleteness theorem like
         | that, LLMs aren't constrained to formal systems
        
         | pron wrote:
         | > Symbols, by definition, only represent a thing. They are not
         | the same as the thing
         | 
         | First of all, the point isn't about the map becoming the
         | territory, but about whether LLMs can form a map that's similar
         | to the map in our brains.
         | 
         | But to your philosophical point, assuming there are only a
         | finite number of things and places in the universe - or at
         | least the part of which we care about - why wouldn't they be
         | representable with a finite set of symbols?
         | 
         | What you're rejecting is the Church-Turing thesis [1]
         | (essentially, that all mechanical processes, including that of
         | nature, can be simulated with symbolic computation, although
         | there are weaker and stronger variants). It's okay to reject
         | it, but you should know that not many people do (even some non-
         | orthodox thoughts by Penrose about the brain not being
         | simulatable by an ordinary digital computer still accept that
         | some physical machine - the brain - is able to represent what
         | we're interested in).
         | 
         | > If we accept the incompleteness theorem
         | 
         | There is no _if_ there. It 's a theorem. But it's completely
         | irrelevant. It means that there are mathematical propositions
         | that can't be proven or disproven by some system of logic, i.e.
         | by some mechanical means. But if something is in the universe,
         | then it's already been proven by some mechanical process: the
         | mechanics of nature. That means that if some finite set of
         | symbols could represent the laws of nature, then anything in
         | nature can be proven in that logical system. Which brings us
         | back to the first point: the only way the mechanics of nature
         | cannot be represented by symbols is if they are somehow
         | infinite, i.e. they don't follow some finite set of laws. In
         | other words - there is no physics. Now, that may be true, but
         | if that's the case, then AI is the least of our worries.
         | 
         | Of course, if physics does exist - i.e. the universe is
         | governed by a finite set of laws - that doesn't mean that we
         | can predict the future, as that would entail both measuring
         | things precisely and simulating them faster than their
         | operation in nature, and both of these things are... difficult.
         | 
         | [1]: https://plato.stanford.edu/entries/church-turing/
        
           | astrange wrote:
           | > First of all, the point isn't about the map becoming the
           | territory, but about whether LLMs can form a map that's
           | similar to the map in our brains.
           | 
           | It should be capable of something similar (fsvo similar), but
           | the largest difference is that humans have to be power-
           | efficient and LLMs do not.
           | 
           | That is, people don't actually have world models, because
           | modeling something is a waste of time and energy insofar as
           | it's not needed for anything. People are capable of taking
           | out the trash without knowing what's in the garbage bag.
        
       | frankfrank13 wrote:
       | Great quote at the end that I think I resonate a lot with:
       | 
       | > Feeding these algorithms gobs of data is another example of how
       | an approach that must be fundamentally incorrect at least in some
       | sense, as evidenced by how data-hungry it is, can be taken very
       | far by engineering efforts -- as long as something is useful
       | enough to fund such efforts and isn't outcompeted by a new idea,
       | it can persist.
        
       | 1970-01-01 wrote:
       | I'm surprised the models haven't been enshittified by capitalism.
       | I think in a few years we're going to see lightning-fast LLMs
       | generating better output compared to what we're seeing today. But
       | it won't be 1000x better, it will be 10x better, 10x faster, and
       | completely enshittified with ads and clickbait links. Enjoy
       | ChatGPT while it lasts.
        
       ___________________________________________________________________
       (page generated 2025-08-12 23:00 UTC)