[HN Gopher] Beyond self-attention: How a small language model pr...
       ___________________________________________________________________
        
       Beyond self-attention: How a small language model predicts the next
       token
        
       Author : tplrbv
       Score  : 124 points
       Date   : 2024-02-04 16:54 UTC (6 hours ago)
        
 (HTM) web link (shyam.blog)
 (TXT) w3m dump (shyam.blog)
        
       | kmeisthax wrote:
       | I had the exact same idea after seeing Google point out that you
       | can[0] get ChatGPT to regurgitate verbatim training data by
       | asking it to repeat the same word over and over again[1]. I'm
       | glad to see someone else actually bring it to fruition.
       | 
       | This, of course, brings two additional questions:
       | 
       | 1. Is this "AI, hold the AI" approach more energy-efficient than
       | having gradient descent backpropagation compress a bunch of
       | training data into a model that can then be run on specialized AI
       | coprocessors?
       | 
       | 2. Will this result wind up being evidence in the ongoing
       | lawsuits against OpenAI and Stability AI?
       | 
       | [0] Could. OpenAI now blocks generation if you fill the context
       | window with a single word.
       | 
       | [1] https://arxiv.org/abs/2311.17035
        
         | refulgentis wrote:
         | I'm confused, you had the exact same idea that LLM output is
         | based on probability of next token, which is based on the
         | training data?
         | 
         | If that's the case, no, its unlikely this result will end up
         | _becoming_ evidence, that is well known and fundamental.
         | 
         | The author's contribution to discussion is showing this to a
         | technical audience writing their own GPT, as they note, most
         | "how to implement this?" focus on transformers
        
         | yorwba wrote:
         | This approach cannot possibly be more efficient than running
         | the original model because it relies on running the original
         | model to get the activations to search the text corpus for
         | strings with similar activations to compute the next-token
         | statistics. You don't get to skip many steps, and you end up
         | having to do a bunch of extra work.
         | 
         | I'd be surprised if doing this with two completely separate
         | corpora, one for training the model and the other to search for
         | strings with similar activations, wouldn't lead to much the
         | same results. Because the hard part is constructing similar
         | activations for strings with similar next-token statistics in
         | the first place.
         | 
         | Note that in the per-layer weights [0.01, 0.01, 0.1, 1.5, 6,
         | 0.01] the penultimate layer ist the most important, where the
         | input has already been transformed a lot. So you can't expect
         | to use this to replace a transformer with a simple grep over
         | the training data. (My guess as to why the penultimate layer
         | has a much higher weight than the final one is that this is due
         | to induction heads https://transformer-
         | circuits.pub/2021/framework/index.html which implement copying
         | repeated strings _from the input_ , with the penultimate layer
         | determining what to look for and the final layer doing the
         | copying.)
        
       | awwaiid wrote:
       | A thousand hands on a Ouija board.
        
       | jimmySixDOF wrote:
       | This was a good 3D visualization of the same systems and they
       | probably should be read together for maximum effect ....
       | 
       | LLM Visualization (https://bbycroft.net/llm)
       | https://news.ycombinator.com/item?id=38505211
        
       | danielmarkbruce wrote:
       | This is a weird post. "What the transformer is actually doing"?
       | You can just follow the code and see what it's doing. It's not
       | doing something more or less than that. It's not doing some other
       | thing.
        
         | nl wrote:
         | A walk through with what the data at each point looks like is
         | actually pretty useful.
        
         | gjm11 wrote:
         | The post is long and complicated and I haven't read most of it,
         | so whether it's actually any good I shan't try to decide. But
         | the above seems like a very weird argument.
         | 
         | Sure, the code is doing what it's doing. But trying to
         | understand it at that level of abstraction seems ... not at all
         | promising.
         | 
         | Consider a question about psychology. Say: "What are people
         | doing when they decide what to buy in a shop?".
         | 
         | If someone writes an article about this, drawing on some
         | (necessarily simplified) model of human thinking and decision-
         | making, and some experimental evidence about how people's
         | purchasing decisions change in response to changes in price,
         | different lighting conditions, mood, etc., ... would you say
         | "You can just apply the laws of physics and see what the people
         | are doing. They're not doing something more or less than
         | that."?
         | 
         | I mean, it would be _true_. People, so far as we know, do in
         | fact obey the laws of physics. You could, in principle, predict
         | what someone will buy in a given situation by modelling their
         | body and surroundings at the level of atoms or thereabouts
         | (quantum physics is a thing, of course, but it seems likely
         | that a basically-classical model could be good enough for this
         | purpose). When we make decisions, we are obeying the laws of
         | physics and not doing some other thing.
         | 
         | But this answer is completely useless for _actually
         | understanding_ what we do. If you 're wondering "what would
         | happen if the price were ten cents higher?" you've got no way
         | to answer it other than running the whole simulation again.
         | Maybe running thousands of versions of it since other factors
         | could affect the results. If you're wondering "does the
         | lighting make a difference, and what level of lighting in the
         | shop will lead to people spending least or most?" then you've
         | got no way to answer it other than running simulations with
         | many different lighting conditions.
         | 
         | Whereas if you have a higher-level, less precise model that
         | says things like "people mostly prefer to spend less" and
         | "people try to predict quality on the basis of price, so
         | sometimes they will spend more if it seems like they're getting
         | something better that way" and "people like to feel that
         | they're getting a bargain" and so on, you may be able to make
         | predictions without running an impossibly detailed person-
         | simulation zillions of times. You may be able to _give general
         | advice_ to someone with a spending problem who 'd like to spend
         | more wisely, or to a shopkeeper who wants to encourage their
         | customers to spend more.
         | 
         | Similarly with language models and similar systems. Sure, you
         | can find out what it does in some very specific situation by
         | just running the code. But what if you have some broader
         | question than that? Then simply knowing what the code does may
         | not help you at all, because what the code does is gazillions
         | of copies of "multiply these numbers together and add them".
         | 
         | Again, I make no claim about whether the particular thing
         | linked here offers much real insight. But it makes zero sense,
         | so far as I can see, to dismiss it on the grounds that all you
         | need to do is read the code.
        
           | xanderlewis wrote:
           | You're spot on; it's like saying you can understand the game
           | of chess by simply reading the rules. In a certain very
           | superficial sense, yes. But the universe isn't so simple. The
           | same reason even a perfect understanding of what goes on at
           | the level of subatomic particles isn't thought to be enough
           | to say we 'understand the universe'. A hell of a lot can
           | happen in between the setting out of some basic rules and the
           | end -- much higher level -- result.
        
         | drdeca wrote:
         | Understanding how a given CPU (+ the other computer hardware)
         | works, does not suffice to understand what is going on when a
         | particular program is running. For that, you need to either
         | read the program, or an execution trace, or both, or something
         | along these lines, which is specific to the program being run.
        
       | empiko wrote:
       | nice project, but the model that was being studied is really just
       | a toy-model (both in size and training data). as such, this model
       | can indeed be approximated by simpler models (I would suspect
       | even n-gram LMs), but it might not be representative of how the
       | larger LMs work.
        
       | kgeist wrote:
       | >I trained a small (~10 million parameter) transformer following
       | Andrej Karpathy's excellent tutorial, Let's build GPT: from
       | scratch, in code, spelled out
       | 
       | As soon as I learned about Andrej Karpathy's NanoGPT, I trained
       | it on War and Peace (in Russian), and what I found interesting is
       | that it almost grokked Russian grammar despite being just a 3 MB
       | model. Russian language has a complex synthetic-inflectional
       | structure. For example, preposition "na" ("upon") requires the
       | following noun to be in accusative case, which is manifested as
       | ending -a for animate masculine nouns, but as null ending for
       | inanimate nouns, or as -ia for nouns which end in a "soft
       | consonant", -u for feminine nouns, etc. etc. Or the verb "to use"
       | requires the following noun to be in instrumental case if it's
       | used as a tool.
       | 
       | Although it's not perfect and had mistakes, I found it
       | interesting that NanoGPT was able to infer certain complex rules
       | in just 3 minutes of training - and I searched in the texts for
       | the exact examples it generated and found nothing verbatim.
       | 
       | However, despite understanding grammar more-less, semantically,
       | it was complete nonsense.
        
       ___________________________________________________________________
       (page generated 2024-02-04 23:00 UTC)