[HN Gopher] Beyond self-attention: How a small language model pr...
___________________________________________________________________
Beyond self-attention: How a small language model predicts the next
token
Author : tplrbv
Score : 124 points
Date : 2024-02-04 16:54 UTC (6 hours ago)
(HTM) web link (shyam.blog)
(TXT) w3m dump (shyam.blog)
| kmeisthax wrote:
| I had the exact same idea after seeing Google point out that you
| can[0] get ChatGPT to regurgitate verbatim training data by
| asking it to repeat the same word over and over again[1]. I'm
| glad to see someone else actually bring it to fruition.
|
| This, of course, brings two additional questions:
|
| 1. Is this "AI, hold the AI" approach more energy-efficient than
| having gradient descent backpropagation compress a bunch of
| training data into a model that can then be run on specialized AI
| coprocessors?
|
| 2. Will this result wind up being evidence in the ongoing
| lawsuits against OpenAI and Stability AI?
|
| [0] Could. OpenAI now blocks generation if you fill the context
| window with a single word.
|
| [1] https://arxiv.org/abs/2311.17035
| refulgentis wrote:
| I'm confused, you had the exact same idea that LLM output is
| based on probability of next token, which is based on the
| training data?
|
| If that's the case, no, its unlikely this result will end up
| _becoming_ evidence, that is well known and fundamental.
|
| The author's contribution to discussion is showing this to a
| technical audience writing their own GPT, as they note, most
| "how to implement this?" focus on transformers
| yorwba wrote:
| This approach cannot possibly be more efficient than running
| the original model because it relies on running the original
| model to get the activations to search the text corpus for
| strings with similar activations to compute the next-token
| statistics. You don't get to skip many steps, and you end up
| having to do a bunch of extra work.
|
| I'd be surprised if doing this with two completely separate
| corpora, one for training the model and the other to search for
| strings with similar activations, wouldn't lead to much the
| same results. Because the hard part is constructing similar
| activations for strings with similar next-token statistics in
| the first place.
|
| Note that in the per-layer weights [0.01, 0.01, 0.1, 1.5, 6,
| 0.01] the penultimate layer ist the most important, where the
| input has already been transformed a lot. So you can't expect
| to use this to replace a transformer with a simple grep over
| the training data. (My guess as to why the penultimate layer
| has a much higher weight than the final one is that this is due
| to induction heads https://transformer-
| circuits.pub/2021/framework/index.html which implement copying
| repeated strings _from the input_ , with the penultimate layer
| determining what to look for and the final layer doing the
| copying.)
| awwaiid wrote:
| A thousand hands on a Ouija board.
| jimmySixDOF wrote:
| This was a good 3D visualization of the same systems and they
| probably should be read together for maximum effect ....
|
| LLM Visualization (https://bbycroft.net/llm)
| https://news.ycombinator.com/item?id=38505211
| danielmarkbruce wrote:
| This is a weird post. "What the transformer is actually doing"?
| You can just follow the code and see what it's doing. It's not
| doing something more or less than that. It's not doing some other
| thing.
| nl wrote:
| A walk through with what the data at each point looks like is
| actually pretty useful.
| gjm11 wrote:
| The post is long and complicated and I haven't read most of it,
| so whether it's actually any good I shan't try to decide. But
| the above seems like a very weird argument.
|
| Sure, the code is doing what it's doing. But trying to
| understand it at that level of abstraction seems ... not at all
| promising.
|
| Consider a question about psychology. Say: "What are people
| doing when they decide what to buy in a shop?".
|
| If someone writes an article about this, drawing on some
| (necessarily simplified) model of human thinking and decision-
| making, and some experimental evidence about how people's
| purchasing decisions change in response to changes in price,
| different lighting conditions, mood, etc., ... would you say
| "You can just apply the laws of physics and see what the people
| are doing. They're not doing something more or less than
| that."?
|
| I mean, it would be _true_. People, so far as we know, do in
| fact obey the laws of physics. You could, in principle, predict
| what someone will buy in a given situation by modelling their
| body and surroundings at the level of atoms or thereabouts
| (quantum physics is a thing, of course, but it seems likely
| that a basically-classical model could be good enough for this
| purpose). When we make decisions, we are obeying the laws of
| physics and not doing some other thing.
|
| But this answer is completely useless for _actually
| understanding_ what we do. If you 're wondering "what would
| happen if the price were ten cents higher?" you've got no way
| to answer it other than running the whole simulation again.
| Maybe running thousands of versions of it since other factors
| could affect the results. If you're wondering "does the
| lighting make a difference, and what level of lighting in the
| shop will lead to people spending least or most?" then you've
| got no way to answer it other than running simulations with
| many different lighting conditions.
|
| Whereas if you have a higher-level, less precise model that
| says things like "people mostly prefer to spend less" and
| "people try to predict quality on the basis of price, so
| sometimes they will spend more if it seems like they're getting
| something better that way" and "people like to feel that
| they're getting a bargain" and so on, you may be able to make
| predictions without running an impossibly detailed person-
| simulation zillions of times. You may be able to _give general
| advice_ to someone with a spending problem who 'd like to spend
| more wisely, or to a shopkeeper who wants to encourage their
| customers to spend more.
|
| Similarly with language models and similar systems. Sure, you
| can find out what it does in some very specific situation by
| just running the code. But what if you have some broader
| question than that? Then simply knowing what the code does may
| not help you at all, because what the code does is gazillions
| of copies of "multiply these numbers together and add them".
|
| Again, I make no claim about whether the particular thing
| linked here offers much real insight. But it makes zero sense,
| so far as I can see, to dismiss it on the grounds that all you
| need to do is read the code.
| xanderlewis wrote:
| You're spot on; it's like saying you can understand the game
| of chess by simply reading the rules. In a certain very
| superficial sense, yes. But the universe isn't so simple. The
| same reason even a perfect understanding of what goes on at
| the level of subatomic particles isn't thought to be enough
| to say we 'understand the universe'. A hell of a lot can
| happen in between the setting out of some basic rules and the
| end -- much higher level -- result.
| drdeca wrote:
| Understanding how a given CPU (+ the other computer hardware)
| works, does not suffice to understand what is going on when a
| particular program is running. For that, you need to either
| read the program, or an execution trace, or both, or something
| along these lines, which is specific to the program being run.
| empiko wrote:
| nice project, but the model that was being studied is really just
| a toy-model (both in size and training data). as such, this model
| can indeed be approximated by simpler models (I would suspect
| even n-gram LMs), but it might not be representative of how the
| larger LMs work.
| kgeist wrote:
| >I trained a small (~10 million parameter) transformer following
| Andrej Karpathy's excellent tutorial, Let's build GPT: from
| scratch, in code, spelled out
|
| As soon as I learned about Andrej Karpathy's NanoGPT, I trained
| it on War and Peace (in Russian), and what I found interesting is
| that it almost grokked Russian grammar despite being just a 3 MB
| model. Russian language has a complex synthetic-inflectional
| structure. For example, preposition "na" ("upon") requires the
| following noun to be in accusative case, which is manifested as
| ending -a for animate masculine nouns, but as null ending for
| inanimate nouns, or as -ia for nouns which end in a "soft
| consonant", -u for feminine nouns, etc. etc. Or the verb "to use"
| requires the following noun to be in instrumental case if it's
| used as a tool.
|
| Although it's not perfect and had mistakes, I found it
| interesting that NanoGPT was able to infer certain complex rules
| in just 3 minutes of training - and I searched in the texts for
| the exact examples it generated and found nothing verbatim.
|
| However, despite understanding grammar more-less, semantically,
| it was complete nonsense.
___________________________________________________________________
(page generated 2024-02-04 23:00 UTC)