[HN Gopher] Evaluating the world model implicit in a generative ...
___________________________________________________________________
Evaluating the world model implicit in a generative model
Author : dsubburam
Score : 138 points
Date : 2024-11-07 05:51 UTC (17 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| fragmede wrote:
| Wrong as it is, I'm impressed they were able to get any maps out
| of their LLM that look vaguely cohesive. The shortest path map
| has bits of streets downtown and around Central Park that aren't
| totally red, and Central Park itself is clear on all 3 maps.
|
| They used eight A100s, but don't say how long it took to train
| their LLM. It would be interesting to know the wall clock time
| they spent. Their dataset is, relatively speaking, tiny which
| means it should take fewer resources to replicate from scratch.
|
| What's interesting though is that the Smalley model performed
| better, though they don't speculate why that is.
| zxexz wrote:
| I can't imagine training took more than a day with 8 A100 even
| with that vocab size [0] (does lightning do implicit vocab
| extension maybe?) and a batch size of 1 [1] or 64 [2] or 4096
| [3] (I have not trawled through the repo and other wordk enough
| to see what they are actually using in the paper, and let's be
| real - we've all copied random min/nano/whatever GPT forks and
| not bothered renaming stuff). They mentioned their dataset is
| 120 million tokens, which is miniscule by transformer
| standards. Even with a more graph-based model making it 10X+
| longer to train, 1.20 billion tokens per epoch equivalent
| shouldn't take more than a couple hours with no optimization.
|
| [0] https://github.com/keyonvafa/world-model-
| evaluation/blob/949... [1] https://github.com/keyonvafa/world-
| model-evaluation/blob/949... [2]
| https://github.com/keyonvafa/world-model-evaluation/blob/949...
| [3] https://github.com/keyonvafa/world-model-
| evaluation/blob/mai...
| IshKebab wrote:
| It's a bit unclear what the map visualisations are showing to
| me, but I don't think your interpretation is correct. They even
| say:
|
| > Our evaluation methods reveal they are very far from
| recovering the true street map of New York City. As a
| visualization, we use graph reconstruction techniques to
| recover each model's implicit street map of New York City. The
| resulting map bears little resemblance to the actual streets of
| Manhattan, containing streets with impossible physical
| orientations and flyovers above other streets.
| fragmede wrote:
| My read of
|
| > Edges exit nodes in their specified cardinal direction. In
| the zoomed-in images, edges belonging to the true graph are
| black and false edges added by the reconstruction algorithm
| are red.
|
| is that the model output edges, valid ones were then colored
| black and bad ones colored red. But it's a bit unclear so you
| could be right.
| zxexz wrote:
| I've seen some very impressive results just embedding a pre-
| trained KGE model into a transformer model, and letting it
| "learn" to query it (I've just used heterogenous loss functions
| during training with "classifier dimensions" that determine
| whether to greedily sample from the KGE sidecar, I'm sure there
| are much better ways of doing this.). This is just subjective
| viewpoint obviously, but I've played around quite a lot with this
| idea, and it's very easy to get a an "interactive" small LLM with
| stable results doing such a thing, the only problem I've found is
| _updating_ the knowledge cheaply without partially retraining the
| LLM itself. For small, domain-specific models this isn't really
| an issue though - for personal projects I just use a couple
| 3090s.
|
| I think this stuff will become a _lot_ more fascinating after
| transformers have bottomed out on their hype curve and become a
| _tool_ when building specific types of models.
| aix1 wrote:
| > embedding a pre-trained KGE model into a transformer model
|
| Do you have any good pointers (literature, code etc) on the
| mechanics of this?
| zxexz wrote:
| Check out PyKEEN [0] and go wild. I like to train a bunch of
| random models and "overfit" them to the extreme (in my mind
| overfitting them is the _point_ for this task, you want
| dense, compressed knowledge). Resize the input and output
| embeddings of an existing pretrained (but small) LLM (input
| only necessary if you 're adding extra metadata on input, but
| make sure you untie input/output weights). You can add a
| linear layer extension to the transformer blocks, pass it up
| as some sort of residual, etc. - honestly just find a way to
| shove it in, detach the KGE from the computation graph and
| add something learnable between it and wherever you're
| connecting it - like just a couple linear layers and a ReLU.
| The output side is more important, you can have some
| indicator logit(s) to determine whether to "read" from the
| detached graph or sample the outputs of the LLM. Or just
| always do both and interpret it.
|
| (like tinyllama or smaller, or just use whatever karpathy
| repo is most fun at the moment and train some gpt2
| equivalent)
|
| [0] https://pykeen.readthedocs.io/en/stable/index.html
| zxexz wrote:
| Sorry if that was ridiculously vague. I don't know a ton
| about the state of the art, and I'm really not sure there
| _is_ one - the papers just seem to get more terminology-
| dense and the research mostly just seems to end up
| developing new terminology. My grug-brained philosophy is
| just to make models small enough you can just shove things
| in and iterate quick enough in colab or a locally hosted
| notebook with access to a couple 3090s, or even just modern
| Ryzen /EPYC cores. I like to "evaluate" the raw model using
| pyro-ppl to do MCMC or SVI on the raw logits on a known
| holdout dataset.
|
| Really always happy to chat about this stuff, with anybody.
| Would love to explore ideas here, it's a fun hobby, and
| we're living in a golden age of open-source structured
| datasets. I haven't actually found a community interested
| specifically in static knowledge injection. Email in
| profile, in (ebg_13 encoded).
| Jerrrrrrry wrote:
| Thank you for your comments (good further reading terms),
| and your open invitation for continued inquiry.
|
| The "fomo" / deja vu / impending doom / incipient shift
| in the Overton window regarding meta-architecture for
| AI/ML capabilities and risks is so now glaring obvious of
| an elephant in the room it is nearly catatonic to some.
|
| https://www.youtube.com/watch?v=2ziuPUeewK0
| napsternxg wrote:
| We also did something similar in our NTULM paper at Twitter
| https://youtu.be/BjAmQjs0sZk?si=PBQyEGBx1MSkeUpX
|
| Used in non generative language models like BERT but should
| help with generative models as well.
| zxexz wrote:
| Thanks for sharing! I'll give it a read tomorrow - I do not
| appear to have read this. I really do wish there were good
| places for randos like me to discuss this stuff casually.
| I'm in so many slack, discord, etc. channels but none of
| them have the same intensity and hyperfocus as certain IRC
| channels of yore.
| isaacfrond wrote:
| I think there is a philosophical angle to this. I mean, _my_
| world map was constructed by chance interactions with the real
| world. Does this mean that the my world map is a close to the
| real world map, as their NN 's map is to Manhattan? Is my world
| map full of non-existent streets, exits that are at the wrong
| place, etc. The NN map of Manhattan works almost 100% correctly
| when used for normal navigation but breaks apart badly when it
| has to plan a detour. How brittle is my world map?
| cen4 wrote:
| Also things are not static in the real world.
| narush wrote:
| I've replicated the OthelloGPT results mentioned in this paper
| personally - and it def felt like the next-move-only accuracy
| metric was not everything. Indeed, the authors of the original
| paper knew this, and so further validated the world model by
| intervening in a model's forward pass to directly manipulate the
| world model (and check the resulting change in valid move
| predictions).
|
| I'd also recommend checking out Neel Nanda's work on OthelloGPT,
| where he demonstrated the world model was actually linear:
| https://arxiv.org/abs/2309.00941
| plra wrote:
| Really cool results. I'd love to see some human baselines for,
| say, NYC cabbies or regular Manhattanites, though. I'm sure my
| world model is "incoherent" vis-a-vis these metrics as well, but
| I'm not sure what degree of coherence I should be excited about.
| shanusmagnus wrote:
| Makes me think of an interesting related question: how aware
| are we, normally, of our incoherence? What's the phenomenology
| of that? Hmm.
| HarHarVeryFunny wrote:
| An LLM necessarily has to create _some_ sort of internal "model"
| / representations pursuant to its "predict next word" training
| goal, given the depth and sophistication of context recognition
| needed to to well. This isn't an N-gram model restricted to just
| looking at surface word sequences.
|
| However, the question should be what _sort_ of internal "model"
| has it built? It seems fashionable to refer to this as a "world
| model", but IMO this isn't really appropriate, and certainly it's
| going to be quite different to the predictive representations
| that any animal that _interacts_ with the world, and learns from
| those interactions, will have built.
|
| The thing is that an LLM is an auto-regressive model - it is
| trying to predict continuations of training set samples solely
| based on word sequences, and is not privy to the world that is
| actually being described by those word sequences. It can't model
| the generative process of the humans who created those training
| set samples because _that_ generative process has different
| inputs - sensory ones (in addition to auto-regressive ones).
|
| The "world model" of a human, or any other animal, is built
| pursuant to predicting the environment, but not in a purely
| passive way (such as a multi-modal LLM predicting next frame in a
| video). The animal is primarily concerned with predicting the
| outcomes of it's _interactions_ with the environment, driven by
| the evolutionary pressure to learn to act in way that maximizes
| survival and proliferation of its DNA. This is the nature of a
| real "world model" - it's modelling the world (as perceived thru
| sensory inputs) as a dynamical process reacting to the actions of
| the animal. This is very different to the passive "context
| patterns" learnt by an LLM that are merely predicting auto-
| regressive continuations (whether just words, or multi-modal
| video frames/etc).
| mistercow wrote:
| > It can't model the generative process of the humans who
| created those training set samples because that generative
| process has different inputs - sensory ones (in addition to
| auto-regressive ones).
|
| I think that's too strong a statement. I would say that it's
| very constrained in its ability to model that, but not having
| access to the same inputs doesn't mean you can't model a
| process.
|
| For example, we model hurricanes based on measurements taken
| from satellites. Those aren't the actual inputs to the
| hurricane itself, but abstracted correlates of those inputs. An
| LLM _does_ have access to correlates of the inputs to human
| writing, i.e. textual descriptions of sensory inputs.
| shanusmagnus wrote:
| Brilliant analogy.
|
| And we can imagine that, in a sci-fi world where some super-
| being could act on a scale that would allow it to perturb the
| world in a fashion amenable to causing hurricanes, the
| hurricane model could be substantially augmented, for the
| same reason motor babbling in an infant leads to fluid motion
| as a child.
|
| What has been a revelation to me is how, even peering through
| this dark glass, titanic amounts of data allow quite useful
| world models to emerge, even if they're super limited -- a
| type of "bitter lesson" that suggests we're only at the
| beginning of what's possible.
|
| I expect robotics + LLM to drive the next big breakthroughs,
| perhaps w/ virtual worlds [1] as an intermediate step.
|
| [1] https://minedojo.org/
| HarHarVeryFunny wrote:
| You can model _a_ generative process, but it 's necessarily
| an auto-regressive generative process, not the same as the
| originating generative process which is based on the external
| world.
|
| Human language, and other actions, exist on a range from
| almost auto-regressive (generating a stock/practiced phrase
| such as "have a nice day") to highly interactive ones. An
| auto-regressive model is obviously going to have more success
| modelling an auto-regressive generative process.
|
| Weather prediction is really a good case of the limitation of
| auto-regressive models, as well as models that don't
| accurately reflect the inputs to the process you are
| attempting to predict. "There's a low pressure front coming
| in, so the weather will be X, same as last time", works some
| of the time. A crude physical weather model based on limited
| data points, such as weather balloon inputs, or satellite
| observation of hurricanes, also works some of the time. But
| of course these models are sometimes hopelessly wrong too.
|
| My real point wasn't about the lack of sensory data, even
| though this does force a purely auto-regressive (i.e. wrong)
| model, but rather about the difference between a passive
| model (such as weather prediction), and an interactive one.
| slashdave wrote:
| Indeed. If you provided a talented individual with a
| sufficient quantity and variety of video streams of travels
| in a city (like New York), that person would be able to draw
| you a map.
| lxgr wrote:
| But isn't the distinction between a "passive" and an "active"
| model ultimately a metaphysical (freedom of will vs.
| determinism) question, under the (possibly practically
| infeasible) assumption that the passive model gets to witness
| _all_ possible actions an agent might take?
|
| Practically, I could definitely imagine interesting outcomes
| from e.g. hooking up a model to a high-fidelity physics
| simulator during training.
| madaxe_again wrote:
| You say this, yet people such as Helen Keller suggest that a
| full sensorium is not necessary to be a full human. She had
| some grasp of the idea of colour, of sound, and could use the
| words around them appropriately - yet had no firsthand
| experience of either. Is it really so different?
|
| I think "we" each comprise a number of models, language being
| just one of them - however an extremely powerful one, as it
| allows the transmission of thought across time and space. It's
| therefore understandable that much of what we recognise as
| conscious thought, of a model of the world, emerges from such
| an information dense system. It's literally developed to
| describe the world, efficiently and completely, and so that
| symbol map an LLM carries possibly isn't that different to our
| own.
| HarHarVeryFunny wrote:
| It's not about the necessity of specific sensory inputs, but
| rather about the difference in type of model that will be
| built when the goal is passive, and auto-regressive, as
| opposed to when the goal is interactive.
|
| In the passive/auto-regressive case you just need to model
| predictive contexts.
|
| In the interactive case you need to model dynamical
| behaviors.
| stonemetal12 wrote:
| People around here like to say "The map isn't the territory".
| If we are talking about the physical world, then language is a
| map not the territory, and not a detailed one either, an LLM
| trained on it is a second order map.
|
| If we consider the territory to be human intelligence, then
| language is still a map but it is a much more detailed map.
| Thus an LLM trained on it becomes a more interesting second
| order map.
| seydor wrote:
| Animals could well use an autoregressive model to predict the
| outcomes of their actions on their perceptions. It's not like
| we run math in out everyday actions (it would take too long).
|
| Perhaps thats why we can easily communicate those predictions
| as words
| dsubburam wrote:
| > The "world model" of a human, or any other animal, is built
| pursuant to predicting the environment
|
| What do you make of Immanuel Kant's claim that all thinking has
| as a basis the presumption of the "Categories"--fundamental
| concepts like quantity, quality and causality[1]. Do LLMs need
| to develop a deep understanding of these?
|
| [1] https://plato.stanford.edu/entries/categories/#KanCon
| machiaweliczny wrote:
| But if you squint then sensory actions and reactions are also
| sequential tokens. Even reactions can be encoded alongside
| input as action tokens and as single token stream. Anyone tried
| sth like this?
| Jerrrrrrry wrote:
| Once your model and map get larger than the thing it is
| modeling/mapping, then what?
|
| Let us hope the Pigeonhole principle isn't flawed, else we can
| find ourselves batteries in the Matrix.
| anon291 wrote:
| In the paper 'Hopfield networks are all you need', they
| calculate the total number of things able to be 'stored' in the
| attention layers, and it's exponential in the number of
| parameters. So essentially, you can store more 'ideas' in an
| LLM than there are particles in the universe. I think we'll be
| good.
|
| From a technical perspective, this is due to the softmax
| activation function that causes high degrees of separation
| between memory points.
| Jerrrrrrry wrote:
| > So essentially, you can store more 'ideas' in an LLM than
| there are particles in the universe. I think we'll be good.
|
| If it can compress humanities knowledge corpus to <80gb
| unquanti-optimized, I think between my ironically typo'd
| double negative, and your seemingly genuine confirmation, to
| be absolute confirmation:
|
| we are fukt
| UniverseHacker wrote:
| Really glad to see some academic research on this- it was quite
| obvious from interacting with LLMs that they form a world model
| and can, e.g. simulate simple physics experiments correctly that
| are not in the training set. I found it very frustrating to see
| people repeating the idea that "it can never do x" because it
| lacks a world model. Predicting text that represents events in
| the world requires modeling that world. Just because you can find
| examples where the predictions of a certain model are bad does
| not imply no model at all. At the limit of prediction becoming as
| good as theoretically possible given the input data and model
| size restrictions, the model also becomes as accurate and
| complete as possible. This process is formally described by the
| Solomonoff Induction theory.
| slashdave wrote:
| > At the limit of prediction becoming as good as theoretically
| possible given the input data and model size restrictions
|
| You are treading on delicate ground here. Why do you believe
| that sequence models are capable of reaching theoretical
| maximums?
| slashdave wrote:
| Most of you probably know someone with a poor sense of direction
| (or may be yourself). From my experience, such people navigate
| primarily (or solely) by landmarks. This makes me wonder if the
| damaged maps shown in the paper are similar to the "world model"
| belonging to a directionally challenged person.
___________________________________________________________________
(page generated 2024-11-07 23:01 UTC)