[HN Gopher] LLMs aren't world models
___________________________________________________________________
LLMs aren't world models
Author : ingve
Score : 185 points
Date : 2025-08-10 11:40 UTC (2 days ago)
(HTM) web link (yosefk.com)
(TXT) w3m dump (yosefk.com)
| t0md4n wrote:
| https://arxiv.org/abs/2501.17186
| yosefk wrote:
| This is interesting. The "professional level" rating of <1800
| isn't, but still.
|
| However:
|
| "A significant Elo rating jump occurs when the model's Legal
| Move accuracy reaches 99.8%. This increase is due to the
| reduction in errors after the model learns to generate legal
| moves, reinforcing that continuous error correction and
| learning the correct moves significantly improve ELO"
|
| You should be able to reach the move legality of around 100%
| with few resources spent on it. Failing to do so means that it
| has not learned a model of what chess is, at some basic level.
| There is virtually no challenge in making legal moves.
| lostmsu wrote:
| > r4rk1 pp6 8 4p2Q 3n4 4N3 qP5P 2KRB3 w -- -- 3 27
|
| Can you say 100% you can generate a good next move (example
| from the paper) without using tools, and will never
| accidentally make a mistake and give an illegal move?
| rpdillon wrote:
| > Failing to do so means that it has not learned a model of
| what chess is, at some basic level.
|
| I'm not sure about this. Among a standard amateur set of
| chess players, how often when they lack any kind of guidance
| from a computer do they attempt to make a move that is
| illegal? I played chess for years throughout elementary,
| middle and high school, and I would easily say that even
| after hundreds of hours of playing, I might make two mistakes
| out of a thousand moves where the move was actually illegal,
| often because I had missed that moving that piece would
| continue to leave me in check due to a discovered check that
| I had missed.
|
| It's hard to conclude from that experience that players that
| are amateurs lack even a basic model of chess.
| libraryofbabel wrote:
| This essay could probably benefit from some engagement with the
| literature on "interpretability" in LLMs, including the empirical
| results about how knowledge (like addition) is represented inside
| the neural network. To be blunt, I'm not sure being smart and
| reasoning from first principles after asking the LLM a lot of
| questions and cherry picking what it gets wrong gets to any novel
| insights at this point. And it already feels a little out date,
| with LLMs getting gold on the mathematical Olympiad they clearly
| have a pretty good world model of mathematics. I don't think
| cherry-picking a failure to prove 2 + 2 = 4 in the particular
| specific way the writer wanted to see disproves that at all.
|
| LLMs have imperfect world models, sure. (So do humans.) That's
| because they are trained to be generalists and because their
| internal representations of things are _massively_ compressed
| single they don't have enough weights to encode everything. I
| don't think this means there are some natural limits to what they
| can do.
| armchairhacker wrote:
| Any suggestions from this literature?
| libraryofbabel wrote:
| The papers from Anthropic on interpretability are pretty
| good. They look at how certain concepts are encoded within
| the LLM.
| yosefk wrote:
| Your being blunt is actually very kind, if you're describing
| what I'm doing as "being smart and reasoning from first
| principles"; and I agree that I am not saying something very
| novel, at most it's slightly contrarian given the current
| sentiment.
|
| My goal is not to cherry-pick failures for its own sake as much
| as to try to explain why I get pretty bad output from LLMs much
| of the time, which I do. They are also very useful to me at
| times.
|
| Let's see how my predictions hold up; I have made enough to
| look very wrong if they don't.
|
| Regarding "failure disproving success": it can't, but it can
| disprove a theory of how this success is achieved. And, I have
| much better examples than the 2+2=4, which I am citing as
| something that sorta works these says
| libraryofbabel wrote:
| I mean yeah, it's a good essay in that it made me think and
| try to articulate the gaps, and I'm always looking to read
| things that push back on AI hype. I usually just skip over
| the hype blogging.
|
| I think my biggest complaint is that the essay points out
| flaws in LLM's world models (totally valid, they do
| confidently get things wrong and hallucinate in ways that are
| different, and often more frustrating, from how humans get
| things wrong) but then it jumps to claiming that there is
| some fundamental limitation about LLMs that prevents them
| from forming workable world models. In particular, it strays
| a bit towards the "they're just stochastic parrots" critique,
| e.g. "that just shows the LLM knows to put the words
| explaining it after the words asking the question." That just
| doesn't seem to hold up in the face of e.g. LLMs getting gold
| on the Mathematical Olympiad, which features novel questions.
| If that isn't a world model of mathematics - being able to
| apply learned techniques to challenging new questions - then
| I don't know what is.
|
| A lot of that success is from reinforcement learning
| techniques where the LLM is made to solve tons of math
| problems _after_ the pre-training "read everything" step,
| which then gives it a chance to update its weights. LLMs
| aren't just trained from reading a lot of text anymore. It's
| very similar to how the alpha zero chess engine was trained,
| in fact.
|
| I do think there's a lot that the essay gets right. If I was
| to recast it, I'd put it something like this:
|
| * LLMs have imperfect models of the world which is
| conditioned by how they're trained on next token prediction.
|
| * We've shown we can drastically improve those world models
| for particular tasks by reinforcement learning. you kind of
| allude to this already by talking about how they've been
| "flogged" to be good at math.
|
| * I would claim that there's no particular reason these RL
| techniques aren't extensible in principle to beat all sorts
| of benchmarks that might look unrealistic now. (Two years ago
| it would have been an extreme optimist position to say an LLM
| could get gold on the mathematical Olympiad, and most LLM
| skeptics would probably have said it could never happen.)
|
| * Of course it's very expensive, so most world models LLMs
| have won't get the RL treatment and so will be full of gaps,
| especially for things that aren't amenable to RL. It's good
| to beware of this.
|
| I think the biggest limitation LLMs actually have, the one
| that is the biggest barrier to AGI, is that they can't learn
| on the job, during inference. This means that with a novel
| codebase they are never able to build a good model of it,
| because they can never update their weights. (If an LLM was
| given tons of RL training on that codebase, it _could_ build
| a better world model, but that's expensive and very
| challenging to set up.) This problem is hinted at in your
| essay, but the lack of on-the-job learning isn't centered.
| But it's the real elephant in the room with LLMs and the one
| the boosters don't really have an answer to.
|
| Anyway thanks for writing this and responding!
| yosefk wrote:
| I'm not saying that LLMs can't learn about the world - I
| even mention how they obviously do it, even at the learned
| embeddings level. I'm saying that they're not compelled by
| their training objective to learn about the world and in
| many cases they clearly don't, and I don't see how to
| characterize the opposite cases in a more useful way than
| "happy accidents."
|
| I don't really know how they are made "good at math," and
| I'm not that good at math myself. With code I have a better
| gut feeling of the limitations. I do think that you could
| throw them off terribly with unusual math quastions to show
| that what they learned isn't math, but I'm not the guy to
| do it; my examples are about chess and programming where I
| am more qualified to do it. (You could say that my question
| about the associativity of blending and how caching works
| sort of shows that it can't use the concept of
| associativity in novel situations; not sure if this can be
| called an illustration of its weakness at math)
| calf wrote:
| But this is parallel to saying LLMs are not "compelled"
| by the training algorithms to learn symbolic logic.
|
| Which says to me there are two camps on this and the
| verdict is still out on this and all related questions.
| WillPostForFood wrote:
| Your LLM output seems abnormally bad, like you are using old
| models, bad models, or intentionally poor prompting. I just
| copied and pasted your Krita example into ChatGPT, and
| reasonable answer, nothing like what you paraphrased in your
| post.
|
| https://imgur.com/a/O9CjiJY
| marcellus23 wrote:
| I think it's hard to take any LLM criticism seriously if
| they don't even specify which model they used. Saying "an
| LLM model" is totally useless for deriving any kind of
| conclusion.
| p1esk wrote:
| Yes, I'd be curious about his experience with GPT-5
| Thinking model. So far I haven't seen any blunders from
| it.
| typpilol wrote:
| This seems like a common theme with these types of articles
| AyyEye wrote:
| With LLMs being unable to count how many Bs are in blueberry,
| they clearly don't have any world model whatsoever. That
| addition (something which only takes a few gates in digital
| logic) happens to be overfit into a few nodes on multi-billion
| node networks is hardly a surprise to anyone except the most
| religious of AI believers.
| yosefk wrote:
| Actually I forgive them those issues that stem from
| tokenization. I used to make fun at them for listing datum as
| a noun whose plural form ends with an i, but once I learned
| about how tokenization works, I no longer do it - it feels
| like mocking a person's intelligence because of a speech
| impediment or something... I am very kind to these things, I
| think
| astrange wrote:
| Tokenization makes things harder, but it doesn't make them
| impossible. Just takes a bit more memorization.
|
| Other writing systems come with "tokenization" built in
| making it still a live issue. Think of answering:
|
| 1. How many n's are in Ri Ben ?
|
| 2. How many n's are in Ri Ben ?
|
| (Answers are 2 and 1.)
| andyjohnson0 wrote:
| > With LLMs being unable to count how many Bs are in
| blueberry, they clearly don't have any world model
| whatsoever.
|
| Is this a real defect, or some historical thing?
|
| I just asked GPT-5: How many "B"s in
| "blueberry"?
|
| and it replied: There are 2 -- the letter b
| appears twice in "blueberry".
|
| I also asked it how many Rs in Carrot, and how many Ps in
| Pineapple, amd it answered both questions correctly too.
| libraryofbabel wrote:
| It's a historical thing that people still falsely claim is
| true, bizarrely without trying it on the latest models. As
| you found, leading LLMs don't have a problem with it
| anymore.
| pydry wrote:
| Depends how you define historical. If by historical you
| mean more than two days ago then, yeah, it's ancient
| history.
| ThrowawayR2 wrote:
| It was discussed and reproduced on GPT-5 on HN couple of
| days ago: https://news.ycombinator.com/item?id=44832908
|
| Sibling poster is probably mistakenly thinking of the
| strawberry issue from 2024 on older LLM models.
| bgwalter wrote:
| It is not historical:
|
| https://kieranhealy.org/blog/archives/2025/08/07/blueberry-
| h...
|
| Perhaps they have a hot fix that special cases HN
| complaints?
| AyyEye wrote:
| They clearly RLHF out the embarrassing cases and make
| cheating on benchmarks into a sport.
| nosioptar wrote:
| Shouldn't the correct answer be that there is not a "B" in
| "blueberry"?
| BobbyJo wrote:
| The core issue there isn't that the LLM isn't building
| internal models to represent its world, it's that its world
| is limited to tokens. Anything not represented in tokens, or
| token relationships, can't be modeled by the LLM, by
| definition.
|
| It's like asking a blind person to count the number of colors
| on a car. They can give it a go and assume glass, tires, and
| metal are different colors as there is likely a correlation
| they can draw from feeling them or discussing them. That's
| the best they can do though as they can't actually perceive
| color.
|
| In this case, the LLM can't see letters, so asking it to
| count them causes it to try and draw from some proxy of that
| information. If it doesn't have an accurate one, then bam,
| strawberry has two r's.
|
| I think a good example of LLMs building models internally is
| this: https://rohinmanvi.github.io/GeoLLM/
|
| LLMs are able to encode geospatial relationships because they
| can be represented by token relationships well. Teo countries
| that are close together will be talked about together much
| more often than two countries far from each other.
| vrighter wrote:
| That is just not a solid argument. There are countless
| examples of LLMs splitting "blueberry" into "b l u e b e r
| r y", which would contain one token per letter. And then
| they still manage to get it wrong.
|
| Your argument is based on a flawed assumption, that they
| can't see letters. If they didn't they wouldn't be able to
| spell the word out. But they do. And when they do get one
| token per letter, they still miscount.
| xigoi wrote:
| > It's like asking a blind person to count the number of
| colors on a car.
|
| I presume if I asked a blind person to count the colors on
| a car, they would reply "sorry, I am blind, so I can't
| answer this question".
| libraryofbabel wrote:
| > they clearly don't have any world model whatsoever
|
| Then how did an LLM get gold on the mathematical Olympiad,
| where it certainly hadn't seen the questions before? How _on
| earth_ is that possible without a decent working model of
| mathematics? Sure, LLMs might make weird errors sometimes
| (nobody is denying that), but clearly the story is rather
| more complicated than you suggest.
| simiones wrote:
| > where it certainly hadn't seen the questions before?
|
| What are you basing this certainty on?
|
| And even if you're right that the specific questions had
| not come up, it may still be that the questions from the
| math olympiad were rehashes of similar questions in other
| texts, or happened to correspond well to a composition of
| some other problems that were part of the training set,
| such that the LLM could 'pick up' on the similarity.
|
| It's also possible that the LLM was specifically trained on
| similar problems, or may even have a dedicated sub-net or
| tool for it. Still impressive, but possibly not in a way
| that generalizes even to math like one might think based on
| the press releases.
| williamcotton wrote:
| I don't solve math problems with my poetry writing skills:
|
| https://chatgpt.com/share/689ba837-8ae0-8013-96d2-7484088f27.
| ..
| lossolo wrote:
| https://arxiv.org/abs/2508.01191
| rishi_devan wrote:
| Haha. I enjoyed that Soviet-era joke at the end.
| svantana wrote:
| Yes, I hadn't heard that before. It's similar in spirit to this
| norwegian folk tale about a deaf man guessing what someone is
| saying to him:
|
| https://en.wikipedia.org/wiki/%22Good_day,_fellow!%22_%22Axe...
| kgwgk wrote:
| Another similar story:
|
| King Frederick, the great of Prussia had a very fine army,
| and none of the soldiers in it were finer than Giant Guards,
| who were all extremely tall men. It was difficult to find
| enough soldiers for these Guards, as there were not many men
| who were tall enough.
|
| Frederick had made it a rule that no soldiers who did not
| speak German could be admitted to the Giant Guards, and this
| made the work of the officers who had to find men for them
| even more difficult. When they had to choose between
| accepting or refusing a really tall man who knew no German,
| the officers used to accept him, and then teach him enough.
| German to be able to answer if the King questioned him.
|
| Frederick, sometimes, used to visit the men who were on guard
| around his castle at night to see that they were doing their
| job properly, and it was his habit to ask each new one that
| he saw three questions: "How old are you?" "How long have you
| been in my army?" and "Are you satisfied with your food and
| your conditions?"
|
| The offices of the Giant Guards therefore used to teach new
| soldiers who did not know German the answers to these three
| questions.
|
| One day, however, the King asked a new soldier the questions
| in a different order, he began with, "How long have you been
| in my army?" The young soldier immediately answered, "Twenty
| - two years, Your Majesty". Frederick was very surprised.
| "How old are you then?", he asked the soldier. "Six months,
| Your Majesty", came the answer. At this Frederick became
| angry, "Am I a fool, or are you one?" he asked. "Both, Your
| Majesty", the soldier answered politely.
|
| https://archive.org/details/advancedstoriesf0000hill
| deadbabe wrote:
| Don't: use LLMs to play chess against you
|
| Do: use LLMs to talk shit to you while a _real_ chess AI plays
| chess against you.
|
| The above applies to a lot of things besides chess, and
| illustrates a proper application of LLMs.
| imenani wrote:
| As far as I can tell they don't say which LLM they used which is
| kind of a shame as there is a huge range of capabilities even in
| newly released LLMs (e.g. reasoning vs not).
| yosefk wrote:
| ChatGPT, Claude, Grok and Google AI Overviews, whatever powers
| the latter, were all used in one or more of these examples, in
| various configurations. I think they can perform differently,
| and I often try more than one when the 1st try doesn't work
| great. I don't think there's any fundamental difference in the
| principle of their operation, and I think there never will be -
| there will be another major breakthrough
| red75prime wrote:
| My hypothesis is that a model fails to switch into a deep
| thinking mode (if it has it) and blurts whatever it got from
| all the internet data during autoregressive training. I
| tested it with alpha-blending example. Gemini 2.5 flash -
| fails, Gemini 2.5 pro - succeeds.
|
| How presence/absence of a world model, er, blends into all
| this? I guess "having a consistent world model at all times"
| is an incorrect description of humans, too. We seem to have
| it because we have mechanisms to notice errors, correct
| errors, remember the results, and use the results when
| similar situations arise, while slowly updating intuitions
| about the world to incorporate changes.
|
| The current models lack "remember/use/update" parts.
| imenani wrote:
| Each of these models has a thinking/reasoning variant and a
| default non-thinking variant. I would expect the reasoning
| variants (o3 or "GPT5 Thinking", Gemini DeepThink, Claude
| with Extended Thinking, etc) to do better at this. I think
| there is also some chance that in their reasoning traces they
| may display something you might see as closer to world
| modelling. In particular, you might find them explicitly
| tracking positions of pieces and checking validity.
| red75prime wrote:
| > I don't think there's any fundamental difference in the
| principle of their operation
|
| Yeah, they seem to be a subject to the universal
| approximation theorem (it needs to be checked more
| thoroughly, but I think we can build a transformer that is
| equivalent to any given fully-connected multilayered
| network).
|
| That is at a certain size they can do anything a human can do
| at a certain point in their life (that is with no additional
| training) regardless of whether humans have world models and
| what those model are on the neuronal level.
|
| But there are additional nuances that are related to their
| architectures and training regimes. And practical questions
| of the required size.
| lowsong wrote:
| It doesn't matter. These limitations are fundamental to LLMs,
| so all of them that will ever be made suffer from these
| problems.
| og_kalu wrote:
| Yes LLMs can play chess and yes they can model it fine
|
| https://arxiv.org/pdf/2403.15498v2
| GaggiX wrote:
| https://www.youtube.com/watch?v=LtG0ACIbmHw
|
| Sota LLMs do play legal moves in chess, I don't why the article
| seem to say otherwise.
| tickettotranai wrote:
| Technically yes, but... it's moderately tricky to get an LLM to
| play good chess even though it can.
|
| https://dynomight.net/more-chess/
|
| This is significant in general because I personally would love
| to get these things to code-switch into "hackernews poster" or
| "writer for the Economist" or "academic philosopher", but I
| think the "chat" format makes it impossible. The
| inaccessibility of this makes me want to host my own LLM...
| lordnacho wrote:
| Here's what LLMs remind me of.
|
| When I went to uni, we had tutorials several times a week. Two
| students, one professor, going over whatever was being studied
| that week. The professor would ask insightful questions, and the
| students would try to answer.
|
| Sometimes, I would answer a question correctly without actually
| understanding what I was saying. I would be spewing out something
| that I had read somewhere in the huge pile of books, and it would
| be a sentence, with certain special words in it, that the
| professor would accept as an answer.
|
| But I would sometimes have this weird feeling of "hmm I actually
| don't get it" regardless. This is kinda what the tutorial is for,
| though. With a bit more prodding, the prof will ask something
| that you genuinely cannot produce a suitable word salad for, and
| you would be found out.
|
| In math-type tutorials it would be things like realizing some
| equation was useful for finding an answer without having a clue
| about what the equation actually represented.
|
| In economics tutorials it would be spewing out words about
| inflation or growth or some particular author but then having
| nothing to back up the intuition.
|
| This is what I suspect LLMs do. They can often be very useful to
| someone who actually has the models in their minds, but not the
| data to hand. You may have forgotten the supporting evidence for
| some position, or you might have missed some piece of the
| argument due to imperfect memory. In these cases, LLM is
| fantastic as it just glues together plausible related words for
| you to examine.
|
| The wheels come off when you're not an expert. Everything it says
| will sound plausible. When you challenge it, it just apologizes
| and pretends to correct itself.
| ej88 wrote:
| This article is interesting but pretty shallow.
|
| 0(?): there's no provided definition of what a 'world model' is.
| Is it playing chess? Is it remembering facts like how computers
| use math to blend Colors? If so, then ChatGPT:
| https://chatgpt.com/s/t_6898fe6178b88191a138fba8824c1a2c has a
| world model right?
|
| 1. The author seems to conflate context windows with failing to
| model the world in the chess example. I challenge them to ask a
| SOTA model with an image of a chess board or notation and ask it
| about the position. It might not give you GM level analysis but
| it definitely has a model of what's going on.
|
| 2. Without explaining which LLM they used or sharing the chats
| these examples are just not valuable. The larger and better the
| model, the better its internal representation of the world.
|
| You can try it yourself. Come up with some question involving
| interacting with the world and / or physics and ask GPT-5
| Thinking. It's got a pretty good understanding of how things
| work!
|
| https://chatgpt.com/s/t_689903b03e6c8191b7ce1b85b1698358
| yosefk wrote:
| A "world model" depends on the context which defines which
| world the problem is in. For chess, which moves are legal and
| needing to know where the pieces are to make legal moves are
| parts of the world model. For alpha blending, it being a
| mathematical operation and the visibility of a background given
| the transparency of the foreground are parts of the world
| model.
|
| The examples are from all the major commercial American LLMs as
| listed in a sister comment.
|
| You seem to conflate context windows with tracking chess
| pieces. The context windows are more than large enough to
| remember 10 moves. The model should either track the pieces, or
| mention that it would be playing blindfold chess absent a board
| to look at and it isn't good at this, so could you please list
| the position after every move to make it fair, or it doesn't
| know what it's doing; it's demonstrably the latter.
| jonplackett wrote:
| I just tried a few things that are simple and a world model would
| probably get right. Eg
|
| Question to GPT5: I am looking straight on to some objects.
| Looking parallel to the ground.
|
| In front of me I have a milk bottle, to the right of that is a
| Coca-Cola bottle. To the right of that is a glass of water. And
| to the right of that there's a cherry. Behind the cherry there's
| a cactus and to the left of that there's a peanut. Everything is
| spaced evenly. Can I see the peanut?
|
| Answer (after choosing thinking mode)
|
| No. The cactus is directly behind the cherry (front row order:
| milk, Coke, water, cherry). "To the left of that" puts the peanut
| behind the glass of water. Since you're looking straight on, the
| glass sits in front and occludes the peanut.
|
| It doesn't consider transparency until you mention it, then
| apologises and says it didn't think of transparency
| RugnirViking wrote:
| this seems like a strange riddle. In my mind I was thinking
| that regardless of the glass, all of the objects can be seen
| (due to perspective, and also the fact you mentioned the
| locations, meaning you're aware of them).
|
| It seems to me it would only actually work in an orthographic
| perspective, which is not how our reality works
| jonplackett wrote:
| You can tell from the response it does understand the riddle
| just fine, it just gets it wrong.
| rpdillon wrote:
| Have you asked five adults this riddle? I suspect at least
| two of them would get it wrong or have some uncertainty
| about whether or not the peanut was visible.
| xg15 wrote:
| This. Was also thinking "yes" first because of the glass
| of water, transparency, etc, but then got unsure: The
| objects might be spaced so widely that the milk or coke
| bottle would obscure the view due to perspective - or the
| peanut would simply end up outside the viewer's field of
| vision.
|
| Shows that even _if_ you have a world model, it might not
| be the right one.
| optimalsolver wrote:
| Gemini 2.5 Pro gets this correct on the first attempt, and
| specifically points out the transparency of the glass of water.
|
| https://g.co/gemini/share/362506056ddb
|
| Time to get the ol' goalpost-moving gloves out.
| wilg wrote:
| Worked for me: https://chatgpt.com/share/689bc3ef-
| fa1c-800f-9275-93c2dbc11b...
| Razengan wrote:
| A slight tangent: I think/wonder if the one place where AIs could
| be really useful, might be in translating alien languages :)
|
| As in, an alien could teach one of our AIs their language faster
| than an alien could teach an human, and vice versa..
|
| ..though the potential for catastrophic disasters is also great
| there lol
| keeda wrote:
| That whole bit about color blending and transparency and LLMs
| "not knowing colors" is hard to believe. I am literally using
| LLMs every day to write image-processing and computer vision code
| using OpenCV. It seamlessly reasons across a range of concepts
| like color spaces, resolution, compression artifacts, filtering,
| segmentation and human perception. I mean, removing the alpha
| from a PNG image was a preprocessing step it wrote by itself as
| part of a larger task I had given it, so it certainly understands
| transparency.
|
| I even often describe the results e.g. "this fails when in X
| manner when the image has grainy regions" and it figures out what
| is going on, and adapts the code accordingly. (It works with
| uploading actual images too, but those consume a lot of tokens!)
|
| And all this in a rather niche domain that seems relatively less
| explored. The images I'm working with are rather small and low-
| resolution, which most literature does not seem to contemplate
| much. It uses standard techniques well known in the art, but it
| adapts and combines them well to suit my particular requirements.
| So they seem to handle "novel" pretty well too.
|
| If it can reason about images and vision and write working code
| for niche problems I throw at it, whether it "knows" colors in
| the human sense is a purely philosophical question.
| geraneum wrote:
| > it wrote by itself as part of a larger task I had given it,
| so it certainly understands transparency
|
| Or it's a common step or a known pattern or combination of
| steps that is prevalent in its training data for certain input.
| I'm guessing you don't know what's exactly in the training
| sets. I don't know either. They don't tell ;)
|
| > but it adapts and combines them well to suit my particular
| requirements. So they seem to handle "novel" pretty well too.
|
| We tend to overestimate the novelty of our own work and our
| methods and at the same time, underestimate the vastness of the
| data and information available online for machines to train on.
| LLMs are very sophisticated pattern recognizers. It doesn't
| mean what you are doing specifically is done in this exact way
| before, rather the patterns adapted and the approach may not be
| one of their kind.
|
| > is a purely philosophical question
|
| It is indeed. A question we need to ask ourselves.
| skeledrew wrote:
| Agree in general with most of the points, except
|
| > but because I know you and I get by with less.
|
| Actually we got far more data and training than any LLM. We've
| been gathering and processing sensory data every second at least
| since birth (more processing than gathering when asleep), and are
| only really considered fully intelligent in our late teens to
| mid-20s.
| helloplanets wrote:
| Don't forget the millions of years of pre-training! ;)
| o_nate wrote:
| What with this and your previous post about why sometimes
| incompetent management leads to better outcomes, you are quickly
| becoming one of my favorite tech bloggers. Perhaps I enjoyed the
| piece so much because your conclusions basically track mine. (I'm
| a software developer who has dabbled with LLMs, and has some
| hand-wavey background on how they work, but otherwise can claim
| no special knowledge.) Also your writing style really pops. No
| one would accuse your post of having been generated by an LLM.
| yosefk wrote:
| thank you for your kind words!
| neuroelectron wrote:
| Not yet
| ameliaquining wrote:
| One thing I appreciated about this post, unlike a lot of AI-
| skeptic posts, is that it actually makes a concrete falsifiable
| prediction; specifically, "LLMs will never manage to deal with
| large code bases 'autonomously'". So in the future we can look
| back and see whether it was right.
|
| For my part, I'd give 80% confidence that LLMs will be able to do
| this within two years, without fundamental architectural changes.
| moduspol wrote:
| "Deal with" and "autonomously" are doing a lot of heavy lifting
| there. Cursor already does a pretty good job indexing all the
| files in a code base in a way that lets it ask questions and
| get answers pretty quickly. It's just a matter of where you set
| the goalposts.
| ameliaquining wrote:
| True, there'd be a need to operationalize these things a bit
| more than is done in the post to have a good advance
| prediction.
| exe34 wrote:
| How large? What does "deal" mean here? Autonomously - is that
| on its own whim, or at the behest of a user?
| shinycode wrote:
| << autonomously >> what happens when subtle updates that are
| not bugs but change the meaning of some features that might
| break the workflow on some other external parts of a client's
| system ? It happens all the time and, because it's really hard
| to have the whole meaning and business rules written and
| maintained up to date, an LLM might never be able to grasp some
| meaning. Maybe if instead of developing code and
| infrastructures, the whole industry shifts toward only writing
| impossibly precise spec sheets that make meaning and intent
| crystal clear then, maybe << autonomously >> might be possible
| to pull off
| wizzwizz4 wrote:
| Those spec sheets exist: they're called software.
| slt2021 wrote:
| >LLMs will never manage to deal
|
| time to prove hypothesis: infinity years
| bithive123 wrote:
| Language models aren't world models for the same reason languages
| aren't world models.
|
| Symbols, by definition, only represent a thing. They are not the
| same as the thing. The map is not the territory, the description
| is not the described, you can't get wet in the word "water".
|
| They only have meaning to sentient beings, and that meaning is
| heavily subjective and contextual.
|
| But there appear to be some who think that we can grasp truth
| through mechanical symbol manipulation. Perhaps we just need to
| add a few million more symbols, they think.
|
| If we accept the incompleteness theorem, then there are true
| propositions that even a super-intelligent AGI would not be able
| to express, because all it can do is output a series of
| placeholders. Not to mention the obvious fallacy of knowing
| super-intelligence when we see it. Can you write a test suite for
| it?
| habitue wrote:
| > Symbols, by definition, only represent a thing.
|
| This is missing the lesson of the Yoneda Lemma: symbols are
| uniquely identified by their relationships with other symbols.
| If those relationships are represented in text, then in
| principle they can be inferred and navigated by an LLM.
|
| Some relationships are not represented well in text: tacit
| knowledge like how hard to twist a bottle cap to get it to come
| off, etc. We aren't capturing those relationships between all
| your individual muscles and your brain well in language, so an
| LLM will miss them or have very approximate versions of them,
| but... that's always been the problem with tacit knowledge:
| it's the exact kind of knowledge that's hard to communicate!
| exe34 wrote:
| > Language models aren't world models for the same reason
| languages aren't world models. > Symbols, by definition, only
| represent a thing. They are not the same as the thing. The map
| is not the territory, the description is not the described, you
| can't get wet in the word "water".
|
| There is a lot of negatives in there, but I feel like it boils
| down to a model of a thing is not the thing. Well duh. It's a
| model. A map is a model.
| bithive123 wrote:
| Right. It's a dead thing that has no independent meaning. It
| doesn't even exist as a thing except conceputally. The
| referent is not even another dead thing, but a reality that
| appears nowhere in the map itself. It may have certain
| limited usefulness in the practical realm, but expecting it
| to lead to new insights ignores the fact that it's
| fundamentally an abstraction of the real, not in relationship
| to it.
| auggierose wrote:
| First: true propositions (that are not provable) can definitely
| be expressed, if they couldn't, the incompleteness theorem
| would not be true ;-)
|
| It would be interesting to know what the percentage of people
| is, who invoke the incompleteness theorem, and have no clue
| what it actually says.
|
| Most people don't even know what a proof is, so that cannot be
| a hindrance on the path to AGI ...
|
| Second: ANY world model that can be digitally represented would
| be subject to the same argument (if stated correctly), not only
| LLMs.
| bithive123 wrote:
| I knew someone would call me out on that. I used the wrong
| word; what I meant was "expressed in a way that would
| satisfy" which implies proof within the symbolic order being
| used. I don't claim to be a mathematician or philosopher.
| auggierose wrote:
| Well, you don't get it. The LLM definitely can state
| propositions "that satisfy", let's just call them true
| propositions, and that this is not the same as having a
| proof for it is what the incompleteness theorem says.
|
| Why would you require an LLM to have proof for the things
| it says? I mean, that would be nice, and I am actually
| working on that, but it is not anything we would require of
| humans and/or HN commenters, would we?
| bithive123 wrote:
| I clearly do not meet the requirements to use the
| analogy.
|
| I am hearing the term super intelligence a lot and it
| seems to me the only form that would take is the machine
| spitting out a bunch of symbols which either delight or
| dismay the humans. Which implies they already know what
| it looks like.
|
| If this technology will advance science or even be useful
| for everyday life, then surely the propositions it
| generates will need to hold up to reality, either via
| axiomatic rigor or empirically. I look forward to finding
| out if that will happen.
|
| But it's still just a movement from the known to the
| known, a very limited affair no matter how many new
| symbols you add in whatever permutation.
| chamomeal wrote:
| I'm not a math guy but the incompleteness theorem applies to
| formal systems, right? I've never thought about LLMs as formal
| systems, but I guess they are?
| bithive123 wrote:
| Nor am I. I'm not claiming an LLM is a formal system, but it
| is mechanical and operates on symbols. It can't deal in
| anything else. That should temper some of the enthusiasm
| going around.
| pron wrote:
| Anything that runs on a computer is a formal system. "Formal"
| (the manipulation of _forms_ ) is an old term for what, after
| Turing, we call "mechanical".
| scarmig wrote:
| > If we accept the incompleteness theorem
|
| And, by various universality theorems, a sufficiently large AGI
| could approximate any sequence of human neuron firings to an
| arbitrary precision. So if the incompleteness theorem means
| that neural nets can never find truth, it also means that the
| human brain can never find truth.
|
| Human neuron firing patterns, after all, only represent a
| thing; they are not the same as the thing. Your experience of
| seeing something isn't recreating the physical universe in your
| head.
| bevr1337 wrote:
| > And, by various universality theorems, a sufficiently large
| AGI could approximate any sequence of human neuron firings to
| an arbitrary precision.
|
| Wouldn't it become harder to simulate a human brain the
| larger a machine is? I don't know nothing, but I think that
| peaky speed of light thing might pose a challenge.
| drdeca wrote:
| simulate [?] simulate-in-real-time
| overgard wrote:
| I don't think you can apply the incompleteness theorem like
| that, LLMs aren't constrained to formal systems
| pron wrote:
| > Symbols, by definition, only represent a thing. They are not
| the same as the thing
|
| First of all, the point isn't about the map becoming the
| territory, but about whether LLMs can form a map that's similar
| to the map in our brains.
|
| But to your philosophical point, assuming there are only a
| finite number of things and places in the universe - or at
| least the part of which we care about - why wouldn't they be
| representable with a finite set of symbols?
|
| What you're rejecting is the Church-Turing thesis [1]
| (essentially, that all mechanical processes, including that of
| nature, can be simulated with symbolic computation, although
| there are weaker and stronger variants). It's okay to reject
| it, but you should know that not many people do (even some non-
| orthodox thoughts by Penrose about the brain not being
| simulatable by an ordinary digital computer still accept that
| some physical machine - the brain - is able to represent what
| we're interested in).
|
| > If we accept the incompleteness theorem
|
| There is no _if_ there. It 's a theorem. But it's completely
| irrelevant. It means that there are mathematical propositions
| that can't be proven or disproven by some system of logic, i.e.
| by some mechanical means. But if something is in the universe,
| then it's already been proven by some mechanical process: the
| mechanics of nature. That means that if some finite set of
| symbols could represent the laws of nature, then anything in
| nature can be proven in that logical system. Which brings us
| back to the first point: the only way the mechanics of nature
| cannot be represented by symbols is if they are somehow
| infinite, i.e. they don't follow some finite set of laws. In
| other words - there is no physics. Now, that may be true, but
| if that's the case, then AI is the least of our worries.
|
| Of course, if physics does exist - i.e. the universe is
| governed by a finite set of laws - that doesn't mean that we
| can predict the future, as that would entail both measuring
| things precisely and simulating them faster than their
| operation in nature, and both of these things are... difficult.
|
| [1]: https://plato.stanford.edu/entries/church-turing/
| astrange wrote:
| > First of all, the point isn't about the map becoming the
| territory, but about whether LLMs can form a map that's
| similar to the map in our brains.
|
| It should be capable of something similar (fsvo similar), but
| the largest difference is that humans have to be power-
| efficient and LLMs do not.
|
| That is, people don't actually have world models, because
| modeling something is a waste of time and energy insofar as
| it's not needed for anything. People are capable of taking
| out the trash without knowing what's in the garbage bag.
| frankfrank13 wrote:
| Great quote at the end that I think I resonate a lot with:
|
| > Feeding these algorithms gobs of data is another example of how
| an approach that must be fundamentally incorrect at least in some
| sense, as evidenced by how data-hungry it is, can be taken very
| far by engineering efforts -- as long as something is useful
| enough to fund such efforts and isn't outcompeted by a new idea,
| it can persist.
| 1970-01-01 wrote:
| I'm surprised the models haven't been enshittified by capitalism.
| I think in a few years we're going to see lightning-fast LLMs
| generating better output compared to what we're seeing today. But
| it won't be 1000x better, it will be 10x better, 10x faster, and
| completely enshittified with ads and clickbait links. Enjoy
| ChatGPT while it lasts.
___________________________________________________________________
(page generated 2025-08-12 23:00 UTC)