[HN Gopher] Do Large Language Models learn world models or just ...
       ___________________________________________________________________
        
       Do Large Language Models learn world models or just surface
       statistics? (2023)
        
       Author : fragmede
       Score  : 39 points
       Date   : 2024-11-22 12:52 UTC (10 hours ago)
        
 (HTM) web link (thegradient.pub)
 (TXT) w3m dump (thegradient.pub)
        
       | pvg wrote:
       | Big thread at the time
       | https://news.ycombinator.com/item?id=34474043
        
         | randcraw wrote:
         | Thanks. Now, after almost two years of incomparably explosive
         | growth in LLMs since that paper, it's remarkable to realize
         | that we still don't know if Scarecrow has a brain. Or if he'll
         | forever remain just a song and dance man.
        
         | dang wrote:
         | Thanks! Macroexpanded:
         | 
         |  _Do Large Language Models learn world models or just surface
         | statistics?_ - https://news.ycombinator.com/item?id=34474043 -
         | Jan 2023 (174 comments)
        
       | mjburgess wrote:
       | This is irrelevant, and it's very frustrating that computer
       | scientists think it is relevant.
       | 
       | If you give a universal function approximator the task of
       | approximating an abstract function, you will get an
       | approximation.
       | 
       | Eg.,                   def circle(radius): ... return points()
       | aprox_cricle = neuralnetwork(sample(circle()))
       | if is_model_of(samples(aprox_circle), circle)): print("OF
       | COURSE!")
       | 
       | This is irrelevant: games, rules, shapes, etc. are all abstract.
       | So any model of samples of these is a model of them.
       | 
       | The "world model" in question is a model _of the world_. Here
       | "data" is not computer science data, ie., numbers its
       | _measurements of the world_ , ie., the state of a measuring
       | device causally induced by the target of measurement.
       | 
       | Here there is no "world" in the data, you have to make strong
       | causal assumptions about what properties of the target cause the
       | measures. This is not in the data. There is no "world model" _in_
       | measurement data. Hence the entirety of experimental science.
       | 
       | No result based on one mathematical function succeeding in
       | approximating another is relevant whether measurement data
       | "contains" a theory of the world which generates it: it does not.
       | And _of course_ if your data is abstract, and hence _constitutes_
       | the target of modelling (only applies to pure math), then there
       | is no gap -- a model of  "measures" (ie., the points on a circle)
       | _is_ the target.
       | 
       | No model of actual measurement data, ie., no model in the whole
       | family we call "machine learning", is a model of its generating
       | process. It contains no "world model".
       | 
       | Photographs of the night sky are compatible with all theories of
       | the solar system in human history (including, eg., stars are
       | angels). There is no summary of these photographs which gives
       | information about _the world_ over and above just summarising
       | patterns in the night sky.
       | 
       | The sense in which _any_ model of measurement data is  "surface
       | statistics" is the same. Consider plato's cave: pots, swords,
       | etc. on the outside project shadows inside. Modelling the
       | measurement data is taking cardboard and cutting it out so it
       | matches the shadows. Modelling _the world_ means creating clay
       | pots to match the ones passing by.
       | 
       | The latter is science: you build models of the world and compare
       | them to data, using the data to decide between them.
       | 
       | The former is engineering (, pseudoscience): you take models of
       | measures and reply these models to "predict" the next shadow.
       | 
       | If you claim the latter is just a "surface shortcut" you're an
       | engineer. If you claim its a world model you're a
       | pseudoscientist.
        
         | sebzim4500 wrote:
         | I don't understand your objection at all.
         | 
         | In the example, the 'world' is the grid state. Obviously that's
         | much simpler than the real world but the point is to show that
         | even when the model is not directly trained to input/output
         | this world state it is still learned as a side effect of
         | prediction the next token.
        
           | mjburgess wrote:
           | There is no world. The grid state is not a world, there is no
           | causal relationship between the grid state and the board. No
           | one in this debate denies that NNs approximate functions.
           | Since a game is just a discrete function, no one denies an NN
           | can approximate it. Showing this is entirely irrelevant and
           | shows a profound misunderstanding of what's at issue.
           | 
           | The whole debate is about whether surface patterns in
           | _measurement_ data can be reversed by NNs to describe their
           | generating process, ie., _the world_. If the  "data" isnt
           | actual measurements of the world, no one arguing about it.
           | 
           | If there is no gap between the generating algorithm and the
           | samples, eg., between a "circle" and "the points on a circle"
           | -- then there is no "world model" to learn. The world _is_
           | the data. To learn  "the points on a cirlce" is to learn the
           | cirlce.
           | 
           | By taking cases where "the world" and "the data" are _the
           | same object_ (in the limit of all samples), you 're just
           | showing that NNs model data. That's already obvious, no ones
           | arguing about it.
           | 
           | That a NN can approximate a discrete function does not mean
           | it can do science.
           | 
           | The whole issue is that the cause of pixel distributions _is
           | not_ in those distributions. A model of pixel patterns is
           | just a model of pixel patterns, not of the objects which
           | _cause_ those patterns. A TV is not made out of pixels.
           | 
           | The "debate" insofar as there is one, is just some
           | researchers being profoundly confused about what measurement
           | data is: measurements are _not_ their targets, and so no
           | model of data is a model of the target. A model of data _is
           | just_ "surface statistics" in the sense that these statistics
           | describe measurements, not what caused them.
        
         | bubblyworld wrote:
         | > There is no summary of these photographs which gives
         | information about the world over and above just summarising
         | patterns in the night sky.
         | 
         | You're stating this as fact but it seems to be the very
         | hypothesis the authors (and related papers) are exploring. To
         | my mind, the OthelloGPT papers are plainly evidence against
         | what you've written - summarising patterns in the sky really
         | does seem to give you information about the world above and
         | beyond the patterns themselves.
         | 
         | (to a scientist this is obvious, no? the precession of mercury,
         | a pattern observable in these photographs, was famously _not_
         | compatible with known theories until fairly recently)
         | 
         | > Modelling the measurement data is taking cardboard and
         | cutting it out so it matches the shadows. Modelling the world
         | means creating clay pots to match the ones passing by.
         | 
         | I think these are matters of degree. The former is simply a
         | worse model than the latter of the "reality" in this case. Note
         | that our human impressions of what a pot "is" are shadows too,
         | on a higher-dimensional stage, and from a deeper viewpoint any
         | pot we build to "match" reality will likely be just as flawed.
         | Turtles all the way down.
        
           | mjburgess wrote:
           | Well it doesnt, seem my other comment below.
           | 
           | It is exactly this non-sequitur which I'm pointing out.
           | 
           | Approximating an abstract discrete function (a game), with a
           | function approximator has literally nothing to do with
           | whether you can infer the causal properties of the data
           | generating process from measurement data.
           | 
           | To equate the two is just rank pseudoscience. The world is
           | not made of measurements. Summaries of measurement data
           | aren't properties in the world, they're just the state of the
           | measuring device.
           | 
           | If you sample all game states from a game, you _define_ the
           | game. This is the nature of abstract mathematical objects,
           | they are defined by their  "data".
           | 
           | Actual physical objects are not defined by how we measure
           | them: the solar system isnt made of photographs. This is
           | astrology: to attribute to the patterns of light hitting the
           | eye some actual physical property in the universe which
           | corresponds to those patterns. No such exists.
           | 
           | It is impossible, and always has been, to treat patterns in
           | measurements as properties of objects. This is maybe one of
           | the most prominent characteristics of psedusocience.
        
             | bubblyworld wrote:
             | The point is that approximating a distribution causally
             | downstream of the game (text-based descriptions, in this
             | case) produces a predictive model of the underlying game
             | mechanics itself. That is fascinating!
             | 
             | Yes, the one is formally derivable from the other, but the
             | reduction costs compute, and to a fixed epsilon of accuracy
             | this is the situation with everything we interact with on
             | the day to day.
             | 
             | The idea that you can learn underlying mechanics from
             | observation and refutation is central to formal models of
             | inductive reasoning like Solomonoff induction (and
             | idealised reaoners like AIXI, if you want the AI spin). At
             | best this is well established scientific method, at worst a
             | pretty decent epistemology.
             | 
             | Talking about sampling all of the game states is irrelevant
             | here; that wouldn't be possible even in principle for many
             | games and in this case they certainly didn't train the LLM
             | on every possible Othello position.
             | 
             | > This is astrology: to attribute to the patterns of light
             | hitting the eye some actual physical property in the
             | universe which corresponds to those patterns. No such
             | exists.
             | 
             | Of course not - but they are _highly correlated_ in
             | functional human beings. What do you think our perception
             | of the world grounds out in, if not something like the
             | discrepancies between (our brain 's) observed data and it's
             | predictions? There's even evidence in neuroscience that
             | this is literally what certain neuronal circuits in the
             | cortex are doing (the hypothesis being that so-called
             | "predictive processing" is more energy efficient than
             | alternative architectures).
             | 
             | Patterns in measurements absolutely reflect properties of
             | the objects being measured, for the simple reason that the
             | measurements are causally linked to the object itself in
             | controlled ways. To think otherwise is frankly insane -
             | this is why we call them measurements, and not noise.
        
         | HuShifang wrote:
         | I think this is a great explanation.
         | 
         | The "Ladder of Causation" proposed by Judea Pearl covers
         | similar ground - "Rung 1" reasoning is the purely predictive
         | work of ML models, "Rung 2" is the interactive optimization of
         | reinforcement learning, and "Rung 3" is the counterfactual and
         | casual reasoning / DGP construction and work of science. LLMs
         | can parrot Rung 3 understanding from ingested texts but it
         | can't generate it.
        
         | pkoird wrote:
         | > Photographs of the night sky are compatible with all theories
         | of the solar system in human history (including, eg., stars are
         | angels). There is no summary of these photographs which gives
         | information about the world over and above just summarising
         | patterns in the night sky.
         | 
         | This is blatantly incorrect. Keep in mind that much of modern
         | physics has been invented via observation. Kepler's law and
         | ultimately the law of Gravitation and General Relativity came
         | from these "photographs" of the night sky.
         | 
         | If you are talking about the fact that these theories only ever
         | summarize what we see and maybe there's something else behind
         | the scenes that's going on, then this becomes a different
         | discussion.
        
         | naasking wrote:
         | > Here there is no "world" in the data, you have to make strong
         | causal assumptions about what properties of the target cause
         | the measures. This is not in the data. There is no "world
         | model" in measurement data.
         | 
         | That's wrong. Whatever your measuring device, it is
         | fundamentally a projection of some underlying reality, eg. a
         | function m in m(r(x)) mapping real values to real values, where
         | r is the function governing reality.
         | 
         | As you've acknowledged that neural networks can learn
         | functions, the neural network here is learning m(r(x)). Clearly
         | the world is in the model here, and if m is invertible, then we
         | can directly extract r.
         | 
         | Yes, the domain of x and range of m(r(x)) is limited, so the
         | inference will be limited for any given dataset, but it's wrong
         | to say the world is not there at all.
        
           | mjburgess wrote:
           | In the limited sense in which the world is recoverable from
           | measures of it requires a model of how it was generated.
           | 
           | For animals, we are born with primitive causal models of our
           | bodies we can recurse on to build models of the world in this
           | sense. So as toddlers we learn perception by having an
           | internal 3d model of our bodies -- so we can ascribe
           | distances to our optical measures.
           | 
           | Without such assumptions there really isnt any world at all
           | in this data. A grid of pixel patterns has no meaning as a
           | grid of numbers. NNs are just mapping this grid to a "summary
           | space" under supervision of how to place the points. This
           | supervision enables a useful encoding of the data, but does
           | not provide the kind of assumptions needed to work backwards
           | to properties of its generation.
           | 
           | In the case of photos, there is no such `m` -- the state of a
           | sensor is not uniquley caused by any catness or dogness
           | properties. Almost no photographs acquire their state from a
           | function X -> Y, because the sensor state is "radically
           | uncontrolled" in a causal sense. Thus the common premise of
           | ML, that y = f(x) is false from the start -- the relevant
           | causal graph has a near infinite number of causes that are
           | unspecified, so f does not exist.
        
           | foobarqux wrote:
           | This is obviously false: consider a (cryptographic)
           | pseudorandom number generator.
        
             | naasking wrote:
             | Trivial, m is not invertible in that case. By contrast,
             | measuring devices need to be invertible within some domain,
             | otherwise they're not actually _measuring_ , and we
             | wouldn't use them.
        
       | foobarqux wrote:
       | Lots of problems with this paper including the fact that, even if
       | you accept their claim that internal board state is equivalent to
       | world model, they don't appear to do the obvious thing which is
       | display the reconstructed "internal" board state. More
       | fundamentally though, reifying the internal board as a "world
       | model" is absurd: otherwise a (trivial) autoencoder would also be
       | building a "world model".
        
         | sebzim4500 wrote:
         | >More fundamentally though, reifying the internal board as a
         | "world model" is absurd: otherwise a (trivial) autoencoder
         | would also be building a "world model".
         | 
         | The point is that they aren't directly training the model to
         | output the grid state, like you would an autoencoder. It's
         | trained to predict the next action and learning the state of
         | the 'world' happens incidentally.
         | 
         | It's like how LLMs learn to build world models without directly
         | being trained to do so, just in order to predict the next
         | token.
        
           | optimalsolver wrote:
           | >It's like how LLMs learn to build world models without
           | directly being trained to do so, just in order to predict the
           | next token
           | 
           | That's the whole point under contention, but you're stating
           | it as fact.
        
           | foobarqux wrote:
           | By the same reasoning if you train a neural net to output
           | next action from the output of the autoencoder then the whole
           | system also has a "world model", but if you accept that
           | definition of "world model" then it is extremely weak and not
           | the intelligence-like capability that is being implied.
           | 
           | And as I said in my original comment they are probably not
           | even able to extract the board state very well, otherwise
           | they would depict some kind of direct representation of the
           | state, not all of the other figures of board move causality
           | etc.
           | 
           | Note also that the board state is not directly encoded in the
           | neural network: they train _another_ neural network to find
           | weights to approximate the board state if given the internal
           | weights of the Othello network. It 's a bit of fishing for
           | the answer you want.
        
         | IanCal wrote:
         | > hey don't appear to do the obvious thing which is display the
         | reconstructed "internal" board state.
         | 
         | I've very confused by this, because they do. Then they
         | manipulate the internal board state and see what move it makes.
         | That's the entire point of the paper. Figure 4 is _literally
         | displaying the reconstructed board state_.
        
           | foobarqux wrote:
           | I replied to a similar comment elsewhere: They aren't
           | comparing the reconstructed board state with the actual board
           | state which is the obvious thing to do.
        
         | og_kalu wrote:
         | >they don't appear to do the obvious thing which is display the
         | reconstructed "internal" board state
         | 
         | This is literally figure 4
         | 
         | This also re-constructs the board state of a chess-playing LLM
         | 
         | https://adamkarvonen.github.io/machine_learning/2024/01/03/c...
        
           | foobarqux wrote:
           | Unless I'm misunderstanding something they are not comparing
           | the reconstructed board state to the actual state which is
           | the straightforward thing you would show. Instead they are
           | manipulating the internal state to show that it yields a
           | different next-action, which is a bizarre, indirect way to
           | show what could be shown in the obvious direct way.
        
             | og_kalu wrote:
             | Figure 4 is showing both things. Yes, there is manipulation
             | of the state but they also clearly show what the predicted
             | board state is before any manipulations (alongside the
             | actual board state)
        
               | foobarqux wrote:
               | The point is not to show only a single example it is to
               | show how well the recovered internal state reflects the
               | actual state in general ---- analyze the performance
               | (this is particularly tricky due to the discrete nature
               | of board positions). That's ignoring all the other more
               | serious issues I raised.
               | 
               | I haven't read the paper in some time so it's possible
               | I'm forgetting something but I don't think so.
        
               | og_kalu wrote:
               | >That's ignoring all the other more serious issues I
               | raised.
               | 
               | The only other issue you raised doesn't make any sense. A
               | world model is a representation/model of your environment
               | you use for predictions. Yes, an auto-encoder learns to
               | model that data to some degree. To what degree is not
               | well known. If we found out that it learned things like
               | 'city x in country a is approximately distance b from
               | city y' let's just learn where y is and unpack everything
               | else when the need arises then that would certainly
               | qualify as a world model.
        
               | foobarqux wrote:
               | Linear regression also learns to model data to some
               | degree. Using the term "world model" that expansively is
               | intentionally misleading.
               | 
               | Besides that and the big red flag of not directly
               | analyzing the performance of the predicted board state I
               | also said training a neural network to return a specific
               | result is fishy, but that is a more minor point than the
               | other two.
        
               | og_kalu wrote:
               | The degree matters. If we find auto encoders learning
               | surprisingly deep models then i have no problems saying
               | they have a world model. It's not the gotcha you think it
               | is.
               | 
               | >the big red flag of not directly analyzing the
               | performance of the predicted board state I also said
               | training a neural network to return a specific result is
               | fishy
               | 
               | The idea that probes are some red flag is ridiculous.
               | There are some things to take into account but statistics
               | is not magic. There's nothing fishy about training probes
               | to inspect a models internals. If the internals don't
               | represent the state of the board then the probe won't be
               | able to learn to reconstruct the state of the board. The
               | probe only has access to internals. You can't squeeze
               | blood out of a rock.
        
               | foobarqux wrote:
               | I don't know what makes a "surprisingly deep model" but I
               | specifically chose autoencoders to show that simply
               | encoding the state internally can be trivial and
               | therefore makes that definition of "world model" vacuous.
               | If you want to add additional stipulations or some
               | measure of degree you have to make an argument for that.
               | 
               | In this case specifically "the degree" is pretty low
               | since predicting moves is very close to predicting board
               | state (because for one you have to assign zero
               | probability to moves to occupied positions). That's even
               | if you accept that world models are just states, which as
               | mtburgess explained is not reasonable.
               | 
               | Further if you read what I wrote I didn't say internal
               | probes are a big red flag (I explicitly called it the
               | minor problem). I said not directly evaluating how well
               | the putative internal state matches the actual state is.
               | And you can "squeeze blood out of a rock": it's the
               | multiple comparison problem and it happens in science all
               | the time and it is what you are doing by training a
               | neural network and fishing for the answer you want to
               | see. This is a very basic problem in statistics and has
               | nothing to do with "magic". But again all this is the
               | minor problem.
        
               | og_kalu wrote:
               | >In this case specifically "the degree" is pretty low
               | since predicting moves is very close to predicting board
               | state (because for one you have to assign zero
               | probability to moves to occupied positions).
               | 
               | The depth/degree or whatever is not about what is close
               | to the problem space. The blog above spells out the
               | distinction between a 'world model' and 'surface
               | statistics'. The point is that Othello GPT is not in fact
               | playing Othello by 'memorizing a long list of
               | correlations' but by modelling the rules and states of
               | Othello and using that model to make a good prediction of
               | the next move.
               | 
               | >I said not directly evaluating how well the putative
               | internal state matches the actual state is.
               | 
               | This is evaluated in the actual paper with the error
               | rates using the linear and non linear probes. It's not a
               | red flag that a precursor blog wouldn't have such things.
               | 
               | >And you can "squeeze blood out of a rock": it's the
               | multiple comparison problem and it happens in science all
               | the time and it is what you are doing by training a
               | neural network and fishing for the answer you want to
               | see.
               | 
               | The multiple comparison problem is only a problem when
               | you're trying to run multiple tests on the same sample.
               | Obviously don't test your probe on states you fed it
               | during training and you're good.
        
       | burnt-resistor wrote:
       | I think they learn how to become salespeople, politicians,
       | lawyers, and resume consultants with fanciful language lacking in
       | facts, truth, and honesty.
        
         | 01HNNWZ0MV43FF wrote:
         | If we can put salespeople out of work it will be a great boon
         | to humankind
        
           | natpalmer1776 wrote:
           | I suddenly have a vision of an AI driven sales pipeline that
           | uses millions of invasive datapoints about you to create the
           | most convincing sales pitch mathematically possible.
        
       | javaunsafe2019 wrote:
       | Idk from when even id this article? Got me LLMs currently are
       | broke and the majority is already aware of this.
       | 
       | Copilot fails the cleanly refactor complex Java methods in a way
       | that I'm better of writing that stuff by my own as I have to
       | understand it anyways.
       | 
       | And the news that they don't scale as predicted is too bad
       | compared to how weak they currently perform...
        
         | lxgr wrote:
         | Why does an LLM have to be better than you to be useful to you?
         | 
         | Personally, I use them for the things they can do, and for the
         | things they can't, I just don't, exactly as I would for any
         | other tool.
         | 
         | People assuming they can do more than they are actually capable
         | of is a problem (compounded by our tendency to attribute
         | intelligence to entities with eloquent language, which might be
         | more of a surface level thing than we used to believe), but
         | that's literally been one for as long as we had proverbial
         | hammers and nails.
        
           | lou1306 wrote:
           | > Why does an LLM have to be better than you to be useful to
           | you?
           | 
           | If
           | 
           | ((time to craft the prompt) + (time required to fix LLM
           | output)) ~ (time to achieve the task on my own)
           | 
           | it's not hard to see that working on my own is a very
           | attractive proposition. It drives down complexity, does not
           | require me to acquire new skills (i.e., prompt engineering),
           | does not require me to provide data to a third party nor to
           | set up an expensive rig to run a model locally, etc.
        
             | lxgr wrote:
             | Then they might indeed not be the right tool for what
             | you're trying to do.
             | 
             | I'm just a little bit tired of sweeping generalizations
             | like "LLMs are completely broken". You can easily use them
             | as a tool part of a process that then ends up being broken
             | (because it's the wrong tool!), yet that doesn't disqualify
             | them for all tool use.
        
         | vonneumannstan wrote:
         | If you can't find a use for the best LLMs it is 100% a skill
         | issue. IF the only way you can think to use them is re-
         | factoring complex java codebases you're ngmi.
        
           | lxgr wrote:
           | So far I haven't found one that does my dishes and laundry. I
           | really wish I knew how to properly use them.
           | 
           | My point being: Why would anyone _have to_ find a use for a
           | new tool? Why wouldn 't "it doesn't help me with what I'm
           | trying to do" be an acceptable answer in many cases?
        
             | Workaccount2 wrote:
             | I have found more often than not that people in the "LLMs
             | are useless" camp are actually in the "I need LLMs to be
             | useless" camp.
        
               | exe34 wrote:
               | nice example of poisoning the well!
        
               | mdp2021 wrote:
               | Do not forget the very linear reality of those people
               | that shout "The car does not work!" in frustration
               | because they would gladly use a car.
        
       | dboreham wrote:
       | It turns out our word for "surface statistics" is "world model".
        
         | marcosdumay wrote:
         | Well, for some sufficiently platonic definition of "world".
        
           | mdp2021 wrote:
           | In a way the opposite, I'd say: the archetypes in Plato are
           | the most stable reality and are akin to the logos that the
           | past and future tradition hunted - knowing it is to know how
           | things are (how things work), hence knowledge of the state of
           | things, hence a faithful world model.
           | 
           | To utter conformist statements spawned from surface
           | statistics would be "doxa" - repeating "opinions".
        
             | marcosdumay wrote:
             | It has a profound and extensive knowledge about something.
             | But that "something" is how words follow each other on
             | popular media.
             | 
             | LLMs are very firmly stuck inside the Cave Allegory.
        
               | mdp2021 wrote:
               | If you mean that just like the experiencer in the cave,
               | seeing shadows instead of things (really, things instead
               | of Ideas), the machine sees words instead of things, that
               | would be in a way very right.
               | 
               | But we could argue it could not be impossible to create
               | an ontology (a very descriptive ontology - "this is said
               | to be that, and that, and that...") from language alone.
               | Hence the question whether the ontology is there.
               | (Actually, the question at this stage remains: "How do
               | they work - in sufficient detail? Why the appearance of
               | some understanding?")
        
               | marcosdumay wrote:
               | Yeah, what I'm saying is that something very similar to
               | an ontology is there. (It's incomplete but extensive, not
               | coherent, and it's deeper in details than anything
               | anybody ever created.)
               | 
               | It's just that it's a kind of a useless ontology, because
               | the reality it's describing is language. Well, only "kind
               | of useless" because it should be very useful to parse,
               | synthesize and transform language. But it doesn't have
               | the kind of "knowledge" that most people expect an
               | intelligence to have.
               | 
               | Also, its world isn't only composed of words. All of them
               | got a very strong "Am I fooling somebody?" signal during
               | training.
        
         | mdp2021 wrote:
         | World model based interfaces have an internal representation
         | and when asked, describe its details.
         | 
         | Surface statistics based interfaces have an internal database
         | of what is expected, and when asked, they give a conformist
         | output.
        
           | naasking wrote:
           | The point is that "internal database of statistical
           | correlations" is a world model of sorts. We all have an
           | internal representation of the world featuring only
           | probabilistic accuracy after all. I don't think the
           | distinction is as clear as you want it to be.
        
             | mdp2021 wrote:
             | > _" internal database of statistical correlations" [would
             | be] a world model of sorts_
             | 
             | Not in the sense used in the article: <<memorizing "surface
             | statistics", i.e., a long list of correlations that do not
             | reflect a causal model of the process generating the
             | sequence>>.
             | 
             | A very basic example: when asked "two plus two", would the
             | interface reply "four" because it memorized a correlation
             | of the two ideas, or because it counted at some point (many
             | points in its development) and in that way assessed
             | reality? That is a dramatic difference.
        
           | exe34 wrote:
           | > and when asked, describe its details.
           | 
           | so humans don't typically have world models then. you ask
           | most people how they arrived at their conclusions (outside of
           | very technical fields) and they will confabulate just like an
           | LLM.
           | 
           | the best example is phenomenology, where people will grant
           | themselves skills that they don't have, to reach conclusions.
           | see also heterophenomenology, aimed at working around that:
           | https://en.wikipedia.org/wiki/Heterophenomenology
        
             | mdp2021 wrote:
             | That the descriptive is not the prescriptive should not be
             | a surprise.
             | 
             | That random people will largely have suboptimal skills
             | should not be a surprise.
             | 
             | Yes, many people can't think properly. Proper thinking
             | remains there as a potential.
        
               | exe34 wrote:
               | > Yes, many people can't think properly. Proper thinking
               | remains there as a potential.
               | 
               | that's a matter of faith, not evidence. by that
               | reasoning, the same can be said about LLMs. after all,
               | they do occasionally get it right.
        
               | mdp2021 wrote:
               | Let me rephrase it, there could be a misunderstanding:
               | "Surely many people cannot think properly but some have
               | much more ability than others: the proficient ability to
               | think well is a potential (expressed in some and not
               | expressed in many".
               | 
               | To transpose that to LLMs, you should present one that
               | _systematically_ gets it right, not occasionally.
               | 
               | And anyway, the point was about two different processes
               | before statement formulation: some output the strongest
               | correlated idea ("2+2" - "4"); some look at the internal
               | model and check its contents ("2, 2" - "1 and 1, 1 and 1:
               | 4").
        
               | exe34 wrote:
               | > one that systematically gets it right, not
               | occasionally.
               | 
               | could Einstein systematically get new symphonies right?
               | could Feynman create tasty new dishes every single time?
               | Could ......
        
               | mdp2021 wrote:
               | > _could Einstein systematically_
               | 
               | Did (could) Einstein think about things long and hard?
               | Yes - that is how he explained having solved problems
               | ("How did you do it?" // "I thought about long and
               | hard").
               | 
               | The artificial system in question should (1) be able to
               | do it, and (2) do it systematically, because it is
               | artificial.
        
       | jebarker wrote:
       | > Do they merely memorize training data and reread it out loud,
       | or are they picking up the rules of English grammar and the
       | syntax of C language?
       | 
       | This is a false dichotomy. Functionally the reality is in the
       | middle. They "memorize" training data in the sense that the loss
       | curve is fit to these points but at test time they are asked to
       | interpolate (and extrapolate) to new points. How well they
       | generalize depends on how well an interpolation between training
       | points works. If it reliably works then you could say that
       | interpolation is a good approximation of some grammar rule, say.
       | It's all about the data.
        
         | mjburgess wrote:
         | This only applies to intra-distribution "generalisation", which
         | is not the meaning of the term we've come to associate with
         | science. Here generalisation means across all environments
         | (ie., something generalises if its _valid_ and _reliable_ where
         | valid = measures property, and reliable = under causal
         | permutation to the environment).
         | 
         | Since an LLM does not change in response to the change in
         | meaning of terms (eg., consider the change to "the war in
         | ukraine" over the last 10 years) -- it isn't _reliable_ in the
         | scientific sense. Explaining why it isnt valid would take much
         | longer, but its not valid either.
         | 
         | In any case: the notion of 'generalisation' used in ML just
         | means _we assume there is_ a single stationary distribution of
         | words, and we want to randomly sample from that distribution
         | without bias to oversampling from points identical to the data.
         | 
         | Not least that this assumption is false (there is no stationary
         | distribution), it is also irrelevant to generalisation in
         | traditional sense. Since whether we are biased towards the data
         | or not isn't what we're interested in. We want output to be
         | valid (the system to use words to mean what they mean) and to
         | be reliable (to do so across all environments in which they
         | mean something).
         | 
         | This does not follow from, nor is it even related to, this ML
         | sense of generalisation. Indeed, if LLMs generalised in this
         | sense, they would be very bad at usefully generalising -- since
         | the assumptions here are false.
        
           | jebarker wrote:
           | I don't really follow what you're saying here. I understand
           | that the use of language in the real-world world is not
           | sampled from a stationary distribution, but it also seems
           | plausible that you could relax that assumption in an LLM,
           | e.g. conditioning the distribution on time, and then intra-
           | distribution generalization would still make sense to study
           | how well the LLM works for held-out test samples.
           | 
           | Intra-distribution generalization seems like the only
           | rigorously defined kind of generalization we have. Can you
           | provide any references that describe this other kind of
           | generalization? I'd love to learn more.
        
             | ericjang wrote:
             | intra-distribution generalization is also not well posed in
             | practical real world settings. suppose you learn a mapping
             | f : x -> y. casually, intra-distribution generalization
             | implies that f generalizes for "points from the same data
             | distribution p(x)". Two issues here:
             | 
             | 1. In practical scenarios, how do you know if x' is really
             | drawn from p(x)? Even if you could compute log p(x') under
             | the true data distribution, you can only verify that the
             | support for x' is non-zero. one sample is not enough to
             | tell you if x' drawn from p(x).
             | 
             | 2. In high dimensional settings, x' that is not exactly
             | equal to an example within the training set can have
             | arbitrarily high generalization error. here's a criminally
             | under-cited paper discussing this:
             | https://arxiv.org/abs/1801.02774
        
       | maximus93 wrote:
       | Honestly, I think it's somewhere in between. LLMs are great at
       | spotting patterns in data and using that to make predictions, so
       | you could say they build a sort of "world model" for the data
       | they see. But it's not the same as truly understanding or
       | reasoning about the world, it's more like theyre really good at
       | connecting the dots we give them.
       | 
       | They dont do science or causality theyre just working with the
       | shadows on the wall, not the actual objects casting them. So
       | yeah, they're impressive, but let's not overhype what they're
       | doing. It's pattern matching at scale, not magic. Correct me if I
       | am wrong.
        
       | not2b wrote:
       | They are learning a grammar, finding structure in the text. In
       | the case of Othello, the rules for what moves are valid are quite
       | simple, and can be represented in a very small model. The slogan
       | is "a minute to learn, a lifetime to master". So "what is a legal
       | move" is a much simpler problem than "what is a winning
       | strategy".
       | 
       | It's similar to asking a model to only produce outputs
       | corresponding to a regular expression, given a very large number
       | of inputs that match that regular expression. The RE is the most
       | compact representation that matches them all and it can figure
       | this out.
       | 
       | But we aren't building a "world model", we're building a model of
       | the training data. In artificial problems with simple rules, the
       | model might be essentially perfect, never producing an invalid
       | Othello move, because the problem is so limited.
       | 
       | I'd be cautious about generalizing from this work to a more open-
       | ended situation.
        
         | og_kalu wrote:
         | I don't think the point is that Othello-GPT has somehow
         | modellled the real world training on only games but that
         | tasking it to predict the next move forces it to model its data
         | in a deep way. There's nothing special about Othello games vs
         | internet text except that the latter will force it to model
         | much more things.
        
       ___________________________________________________________________
       (page generated 2024-11-22 23:01 UTC)