[HN Gopher] Evaluating the world model implicit in a generative ...
       ___________________________________________________________________
        
       Evaluating the world model implicit in a generative model
        
       Author : dsubburam
       Score  : 138 points
       Date   : 2024-11-07 05:51 UTC (17 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | fragmede wrote:
       | Wrong as it is, I'm impressed they were able to get any maps out
       | of their LLM that look vaguely cohesive. The shortest path map
       | has bits of streets downtown and around Central Park that aren't
       | totally red, and Central Park itself is clear on all 3 maps.
       | 
       | They used eight A100s, but don't say how long it took to train
       | their LLM. It would be interesting to know the wall clock time
       | they spent. Their dataset is, relatively speaking, tiny which
       | means it should take fewer resources to replicate from scratch.
       | 
       | What's interesting though is that the Smalley model performed
       | better, though they don't speculate why that is.
        
         | zxexz wrote:
         | I can't imagine training took more than a day with 8 A100 even
         | with that vocab size [0] (does lightning do implicit vocab
         | extension maybe?) and a batch size of 1 [1] or 64 [2] or 4096
         | [3] (I have not trawled through the repo and other wordk enough
         | to see what they are actually using in the paper, and let's be
         | real - we've all copied random min/nano/whatever GPT forks and
         | not bothered renaming stuff). They mentioned their dataset is
         | 120 million tokens, which is miniscule by transformer
         | standards. Even with a more graph-based model making it 10X+
         | longer to train, 1.20 billion tokens per epoch equivalent
         | shouldn't take more than a couple hours with no optimization.
         | 
         | [0] https://github.com/keyonvafa/world-model-
         | evaluation/blob/949... [1] https://github.com/keyonvafa/world-
         | model-evaluation/blob/949... [2]
         | https://github.com/keyonvafa/world-model-evaluation/blob/949...
         | [3] https://github.com/keyonvafa/world-model-
         | evaluation/blob/mai...
        
         | IshKebab wrote:
         | It's a bit unclear what the map visualisations are showing to
         | me, but I don't think your interpretation is correct. They even
         | say:
         | 
         | > Our evaluation methods reveal they are very far from
         | recovering the true street map of New York City. As a
         | visualization, we use graph reconstruction techniques to
         | recover each model's implicit street map of New York City. The
         | resulting map bears little resemblance to the actual streets of
         | Manhattan, containing streets with impossible physical
         | orientations and flyovers above other streets.
        
           | fragmede wrote:
           | My read of
           | 
           | > Edges exit nodes in their specified cardinal direction. In
           | the zoomed-in images, edges belonging to the true graph are
           | black and false edges added by the reconstruction algorithm
           | are red.
           | 
           | is that the model output edges, valid ones were then colored
           | black and bad ones colored red. But it's a bit unclear so you
           | could be right.
        
       | zxexz wrote:
       | I've seen some very impressive results just embedding a pre-
       | trained KGE model into a transformer model, and letting it
       | "learn" to query it (I've just used heterogenous loss functions
       | during training with "classifier dimensions" that determine
       | whether to greedily sample from the KGE sidecar, I'm sure there
       | are much better ways of doing this.). This is just subjective
       | viewpoint obviously, but I've played around quite a lot with this
       | idea, and it's very easy to get a an "interactive" small LLM with
       | stable results doing such a thing, the only problem I've found is
       | _updating_ the knowledge cheaply without partially retraining the
       | LLM itself. For small, domain-specific models this isn't really
       | an issue though - for personal projects I just use a couple
       | 3090s.
       | 
       | I think this stuff will become a _lot_ more fascinating after
       | transformers have bottomed out on their hype curve and become a
       | _tool_ when building specific types of models.
        
         | aix1 wrote:
         | > embedding a pre-trained KGE model into a transformer model
         | 
         | Do you have any good pointers (literature, code etc) on the
         | mechanics of this?
        
           | zxexz wrote:
           | Check out PyKEEN [0] and go wild. I like to train a bunch of
           | random models and "overfit" them to the extreme (in my mind
           | overfitting them is the _point_ for this task, you want
           | dense, compressed knowledge). Resize the input and output
           | embeddings of an existing pretrained (but small) LLM (input
           | only necessary if you 're adding extra metadata on input, but
           | make sure you untie input/output weights). You can add a
           | linear layer extension to the transformer blocks, pass it up
           | as some sort of residual, etc. - honestly just find a way to
           | shove it in, detach the KGE from the computation graph and
           | add something learnable between it and wherever you're
           | connecting it - like just a couple linear layers and a ReLU.
           | The output side is more important, you can have some
           | indicator logit(s) to determine whether to "read" from the
           | detached graph or sample the outputs of the LLM. Or just
           | always do both and interpret it.
           | 
           | (like tinyllama or smaller, or just use whatever karpathy
           | repo is most fun at the moment and train some gpt2
           | equivalent)
           | 
           | [0] https://pykeen.readthedocs.io/en/stable/index.html
        
             | zxexz wrote:
             | Sorry if that was ridiculously vague. I don't know a ton
             | about the state of the art, and I'm really not sure there
             | _is_ one - the papers just seem to get more terminology-
             | dense and the research mostly just seems to end up
             | developing new terminology. My grug-brained philosophy is
             | just to make models small enough you can just shove things
             | in and iterate quick enough in colab or a locally hosted
             | notebook with access to a couple 3090s, or even just modern
             | Ryzen /EPYC cores. I like to "evaluate" the raw model using
             | pyro-ppl to do MCMC or SVI on the raw logits on a known
             | holdout dataset.
             | 
             | Really always happy to chat about this stuff, with anybody.
             | Would love to explore ideas here, it's a fun hobby, and
             | we're living in a golden age of open-source structured
             | datasets. I haven't actually found a community interested
             | specifically in static knowledge injection. Email in
             | profile, in (ebg_13 encoded).
        
               | Jerrrrrrry wrote:
               | Thank you for your comments (good further reading terms),
               | and your open invitation for continued inquiry.
               | 
               | The "fomo" / deja vu / impending doom / incipient shift
               | in the Overton window regarding meta-architecture for
               | AI/ML capabilities and risks is so now glaring obvious of
               | an elephant in the room it is nearly catatonic to some.
               | 
               | https://www.youtube.com/watch?v=2ziuPUeewK0
        
           | napsternxg wrote:
           | We also did something similar in our NTULM paper at Twitter
           | https://youtu.be/BjAmQjs0sZk?si=PBQyEGBx1MSkeUpX
           | 
           | Used in non generative language models like BERT but should
           | help with generative models as well.
        
             | zxexz wrote:
             | Thanks for sharing! I'll give it a read tomorrow - I do not
             | appear to have read this. I really do wish there were good
             | places for randos like me to discuss this stuff casually.
             | I'm in so many slack, discord, etc. channels but none of
             | them have the same intensity and hyperfocus as certain IRC
             | channels of yore.
        
       | isaacfrond wrote:
       | I think there is a philosophical angle to this. I mean, _my_
       | world map was constructed by chance interactions with the real
       | world. Does this mean that the my world map is a close to the
       | real world map, as their NN 's map is to Manhattan? Is my world
       | map full of non-existent streets, exits that are at the wrong
       | place, etc. The NN map of Manhattan works almost 100% correctly
       | when used for normal navigation but breaks apart badly when it
       | has to plan a detour. How brittle is my world map?
        
         | cen4 wrote:
         | Also things are not static in the real world.
        
       | narush wrote:
       | I've replicated the OthelloGPT results mentioned in this paper
       | personally - and it def felt like the next-move-only accuracy
       | metric was not everything. Indeed, the authors of the original
       | paper knew this, and so further validated the world model by
       | intervening in a model's forward pass to directly manipulate the
       | world model (and check the resulting change in valid move
       | predictions).
       | 
       | I'd also recommend checking out Neel Nanda's work on OthelloGPT,
       | where he demonstrated the world model was actually linear:
       | https://arxiv.org/abs/2309.00941
        
       | plra wrote:
       | Really cool results. I'd love to see some human baselines for,
       | say, NYC cabbies or regular Manhattanites, though. I'm sure my
       | world model is "incoherent" vis-a-vis these metrics as well, but
       | I'm not sure what degree of coherence I should be excited about.
        
         | shanusmagnus wrote:
         | Makes me think of an interesting related question: how aware
         | are we, normally, of our incoherence? What's the phenomenology
         | of that? Hmm.
        
       | HarHarVeryFunny wrote:
       | An LLM necessarily has to create _some_ sort of internal  "model"
       | / representations pursuant to its "predict next word" training
       | goal, given the depth and sophistication of context recognition
       | needed to to well. This isn't an N-gram model restricted to just
       | looking at surface word sequences.
       | 
       | However, the question should be what _sort_ of internal  "model"
       | has it built? It seems fashionable to refer to this as a "world
       | model", but IMO this isn't really appropriate, and certainly it's
       | going to be quite different to the predictive representations
       | that any animal that _interacts_ with the world, and learns from
       | those interactions, will have built.
       | 
       | The thing is that an LLM is an auto-regressive model - it is
       | trying to predict continuations of training set samples solely
       | based on word sequences, and is not privy to the world that is
       | actually being described by those word sequences. It can't model
       | the generative process of the humans who created those training
       | set samples because _that_ generative process has different
       | inputs - sensory ones (in addition to auto-regressive ones).
       | 
       | The "world model" of a human, or any other animal, is built
       | pursuant to predicting the environment, but not in a purely
       | passive way (such as a multi-modal LLM predicting next frame in a
       | video). The animal is primarily concerned with predicting the
       | outcomes of it's _interactions_ with the environment, driven by
       | the evolutionary pressure to learn to act in way that maximizes
       | survival and proliferation of its DNA. This is the nature of a
       | real  "world model" - it's modelling the world (as perceived thru
       | sensory inputs) as a dynamical process reacting to the actions of
       | the animal. This is very different to the passive "context
       | patterns" learnt by an LLM that are merely predicting auto-
       | regressive continuations (whether just words, or multi-modal
       | video frames/etc).
        
         | mistercow wrote:
         | > It can't model the generative process of the humans who
         | created those training set samples because that generative
         | process has different inputs - sensory ones (in addition to
         | auto-regressive ones).
         | 
         | I think that's too strong a statement. I would say that it's
         | very constrained in its ability to model that, but not having
         | access to the same inputs doesn't mean you can't model a
         | process.
         | 
         | For example, we model hurricanes based on measurements taken
         | from satellites. Those aren't the actual inputs to the
         | hurricane itself, but abstracted correlates of those inputs. An
         | LLM _does_ have access to correlates of the inputs to human
         | writing, i.e. textual descriptions of sensory inputs.
        
           | shanusmagnus wrote:
           | Brilliant analogy.
           | 
           | And we can imagine that, in a sci-fi world where some super-
           | being could act on a scale that would allow it to perturb the
           | world in a fashion amenable to causing hurricanes, the
           | hurricane model could be substantially augmented, for the
           | same reason motor babbling in an infant leads to fluid motion
           | as a child.
           | 
           | What has been a revelation to me is how, even peering through
           | this dark glass, titanic amounts of data allow quite useful
           | world models to emerge, even if they're super limited -- a
           | type of "bitter lesson" that suggests we're only at the
           | beginning of what's possible.
           | 
           | I expect robotics + LLM to drive the next big breakthroughs,
           | perhaps w/ virtual worlds [1] as an intermediate step.
           | 
           | [1] https://minedojo.org/
        
           | HarHarVeryFunny wrote:
           | You can model _a_ generative process, but it 's necessarily
           | an auto-regressive generative process, not the same as the
           | originating generative process which is based on the external
           | world.
           | 
           | Human language, and other actions, exist on a range from
           | almost auto-regressive (generating a stock/practiced phrase
           | such as "have a nice day") to highly interactive ones. An
           | auto-regressive model is obviously going to have more success
           | modelling an auto-regressive generative process.
           | 
           | Weather prediction is really a good case of the limitation of
           | auto-regressive models, as well as models that don't
           | accurately reflect the inputs to the process you are
           | attempting to predict. "There's a low pressure front coming
           | in, so the weather will be X, same as last time", works some
           | of the time. A crude physical weather model based on limited
           | data points, such as weather balloon inputs, or satellite
           | observation of hurricanes, also works some of the time. But
           | of course these models are sometimes hopelessly wrong too.
           | 
           | My real point wasn't about the lack of sensory data, even
           | though this does force a purely auto-regressive (i.e. wrong)
           | model, but rather about the difference between a passive
           | model (such as weather prediction), and an interactive one.
        
           | slashdave wrote:
           | Indeed. If you provided a talented individual with a
           | sufficient quantity and variety of video streams of travels
           | in a city (like New York), that person would be able to draw
           | you a map.
        
         | lxgr wrote:
         | But isn't the distinction between a "passive" and an "active"
         | model ultimately a metaphysical (freedom of will vs.
         | determinism) question, under the (possibly practically
         | infeasible) assumption that the passive model gets to witness
         | _all_ possible actions an agent might take?
         | 
         | Practically, I could definitely imagine interesting outcomes
         | from e.g. hooking up a model to a high-fidelity physics
         | simulator during training.
        
         | madaxe_again wrote:
         | You say this, yet people such as Helen Keller suggest that a
         | full sensorium is not necessary to be a full human. She had
         | some grasp of the idea of colour, of sound, and could use the
         | words around them appropriately - yet had no firsthand
         | experience of either. Is it really so different?
         | 
         | I think "we" each comprise a number of models, language being
         | just one of them - however an extremely powerful one, as it
         | allows the transmission of thought across time and space. It's
         | therefore understandable that much of what we recognise as
         | conscious thought, of a model of the world, emerges from such
         | an information dense system. It's literally developed to
         | describe the world, efficiently and completely, and so that
         | symbol map an LLM carries possibly isn't that different to our
         | own.
        
           | HarHarVeryFunny wrote:
           | It's not about the necessity of specific sensory inputs, but
           | rather about the difference in type of model that will be
           | built when the goal is passive, and auto-regressive, as
           | opposed to when the goal is interactive.
           | 
           | In the passive/auto-regressive case you just need to model
           | predictive contexts.
           | 
           | In the interactive case you need to model dynamical
           | behaviors.
        
         | stonemetal12 wrote:
         | People around here like to say "The map isn't the territory".
         | If we are talking about the physical world, then language is a
         | map not the territory, and not a detailed one either, an LLM
         | trained on it is a second order map.
         | 
         | If we consider the territory to be human intelligence, then
         | language is still a map but it is a much more detailed map.
         | Thus an LLM trained on it becomes a more interesting second
         | order map.
        
         | seydor wrote:
         | Animals could well use an autoregressive model to predict the
         | outcomes of their actions on their perceptions. It's not like
         | we run math in out everyday actions (it would take too long).
         | 
         | Perhaps thats why we can easily communicate those predictions
         | as words
        
         | dsubburam wrote:
         | > The "world model" of a human, or any other animal, is built
         | pursuant to predicting the environment
         | 
         | What do you make of Immanuel Kant's claim that all thinking has
         | as a basis the presumption of the "Categories"--fundamental
         | concepts like quantity, quality and causality[1]. Do LLMs need
         | to develop a deep understanding of these?
         | 
         | [1] https://plato.stanford.edu/entries/categories/#KanCon
        
         | machiaweliczny wrote:
         | But if you squint then sensory actions and reactions are also
         | sequential tokens. Even reactions can be encoded alongside
         | input as action tokens and as single token stream. Anyone tried
         | sth like this?
        
       | Jerrrrrrry wrote:
       | Once your model and map get larger than the thing it is
       | modeling/mapping, then what?
       | 
       | Let us hope the Pigeonhole principle isn't flawed, else we can
       | find ourselves batteries in the Matrix.
        
         | anon291 wrote:
         | In the paper 'Hopfield networks are all you need', they
         | calculate the total number of things able to be 'stored' in the
         | attention layers, and it's exponential in the number of
         | parameters. So essentially, you can store more 'ideas' in an
         | LLM than there are particles in the universe. I think we'll be
         | good.
         | 
         | From a technical perspective, this is due to the softmax
         | activation function that causes high degrees of separation
         | between memory points.
        
           | Jerrrrrrry wrote:
           | > So essentially, you can store more 'ideas' in an LLM than
           | there are particles in the universe. I think we'll be good.
           | 
           | If it can compress humanities knowledge corpus to <80gb
           | unquanti-optimized, I think between my ironically typo'd
           | double negative, and your seemingly genuine confirmation, to
           | be absolute confirmation:
           | 
           | we are fukt
        
       | UniverseHacker wrote:
       | Really glad to see some academic research on this- it was quite
       | obvious from interacting with LLMs that they form a world model
       | and can, e.g. simulate simple physics experiments correctly that
       | are not in the training set. I found it very frustrating to see
       | people repeating the idea that "it can never do x" because it
       | lacks a world model. Predicting text that represents events in
       | the world requires modeling that world. Just because you can find
       | examples where the predictions of a certain model are bad does
       | not imply no model at all. At the limit of prediction becoming as
       | good as theoretically possible given the input data and model
       | size restrictions, the model also becomes as accurate and
       | complete as possible. This process is formally described by the
       | Solomonoff Induction theory.
        
         | slashdave wrote:
         | > At the limit of prediction becoming as good as theoretically
         | possible given the input data and model size restrictions
         | 
         | You are treading on delicate ground here. Why do you believe
         | that sequence models are capable of reaching theoretical
         | maximums?
        
       | slashdave wrote:
       | Most of you probably know someone with a poor sense of direction
       | (or may be yourself). From my experience, such people navigate
       | primarily (or solely) by landmarks. This makes me wonder if the
       | damaged maps shown in the paper are similar to the "world model"
       | belonging to a directionally challenged person.
        
       ___________________________________________________________________
       (page generated 2024-11-07 23:01 UTC)