[HN Gopher] Understand how transformers work by demystifying the...
       ___________________________________________________________________
        
       Understand how transformers work by demystifying the math behind
       them
        
       Author : LaserPineapple
       Score  : 168 points
       Date   : 2024-01-03 21:10 UTC (1 hours ago)
        
 (HTM) web link (osanseviero.github.io)
 (TXT) w3m dump (osanseviero.github.io)
        
       | leereeves wrote:
       | > The complexity comes from the number of steps and the number of
       | parameters.
       | 
       | Yes, it seems like a transformer model simple enough for us to
       | understand isn't able to do anything interesting, and a
       | transformer complex enough to do something interesting is too
       | complex for us to understand.
       | 
       | I would love to study something in the middle, a model that is
       | both simple enough to understand and complex enough to do
       | something interesting.
        
         | calebkaiser wrote:
         | You might be interested, if you aren't already familiar, in
         | some of the work going on in the mechanistic interpretability
         | field. Neel Nanda has a lot of approachable work on the topic:
         | https://www.neelnanda.io/mechanistic-interpretability
        
           | leereeves wrote:
           | I was not familiar with it, and that does look fascinating,
           | thank you. If anyone else is interested, this guide "Concrete
           | Steps to Get Started in Transformer Mechanistic
           | Interpretability" on his site looks like a great place to
           | start:
           | 
           | https://www.neelnanda.io/mechanistic-
           | interpretability/gettin...
        
       | quickthrower2 wrote:
       | Transformer tutorials might be the new monad tutorial. A hard
       | concept to get, but one you need to struggle with (and practice
       | some examples) to understand. So a bit like much of computer
       | science :-).
        
         | hdhfjkrkrme wrote:
         | The moment you understand the Transformer you become incapable
         | of explaining it.
        
         | amelius wrote:
         | Waiting for a blogpost titled "You could have invented
         | transformers".
        
         | csdvrx wrote:
         | > Transformer tutorials might be the new monad tutorial. A hard
         | concept to get,
         | 
         | A hard concept?
         | 
         | But a monad is just a monoid in the category of endofunctors,
         | so what's the problem?
        
       | nemo8551 wrote:
       | There I was all excited to show off some of my electrical chops
       | on HN.
       | 
       | Not today.
        
         | rzzzt wrote:
         | Does mystified math lie beyond behind how the ratio of input
         | and output voltages is equal to the ratio of the primary and
         | secondary windings? Can it be derived from Maxwell's equations?
         | 
         | Off to a search...
        
           | amelius wrote:
           | I bet that an LLM (which uses transformers) can explain those
           | aspects of a transformer to you.
        
       | ActorNightly wrote:
       | The whole "mystery" of transformer is that instead of a linear
       | sequence of static weights times values in each layer, you now
       | have 3 different matrices that are obtained from the same input
       | through multiplication of learned weights, and then you just
       | multiply the matrices together. I.e more parallelism which works
       | out nice, but very restrictive since the attention formula is
       | static.
       | 
       | We arent going to see more progress until we have a way to
       | generalize the compute graph as a learnable parameter. I dunno if
       | this is even possible in the traditional sense of gradients due
       | to chaotic effects (i.e small changes reflect big shifts in
       | performance), it may have to be some form of genetic algorithm or
       | pso that happens under the hood.
        
         | ex3ndr wrote:
         | This is basically this - it can learn ignore some paths, and
         | amplify something more important, then you can just cut this
         | paths without sensible loss of quality. The problem is that you
         | are not going to win anything from this - non-matrix
         | multiplication would be slower or the same.
        
           | ActorNightly wrote:
           | The issue is that you are thinking of this in terms of
           | information compression, which is what LLMs are.
           | 
           | Im more concerned with an LLM having the ability to be
           | trained to the point where a subset of the graph represents
           | all the nand gates necessary for a cpu and ram, so when you
           | ask it questions it can actually run code to compute them
           | accurately instead of offering a statistical best guess, i.e
           | decompression after lossy compression.
        
             | exe34 wrote:
             | Just give it a computer? Even a virtual machine. It can
             | output assembly code or high level code that gets compiled.
        
               | ActorNightly wrote:
               | The issue is not having access to the cpu, the issue is
               | that the model being able to be trained in such a way
               | that it has representative structures for applicable
               | problem solving. Furthermore, the structures itself
               | should
               | 
               | Philosophically, you can start ad hoc-ing functionalities
               | on top of LLMs and expect major progress. Sure, you can
               | make them better, but you will never get to the state
               | where AI is massively useful.
               | 
               | For example, lets say you gather a whole bunch of experts
               | in respective fields, and you give them a task to put
               | together a detailed plan on how to build a flying car.
               | You will have people doing design, doing simulations,
               | researching material sourcing, creating CNC programs for
               | manufacturing parts, sourcing tools and equipment,
               | writing software, e.t.c. And when executing this plan,
               | they would be open to feedback for anything missed, and
               | can advise on how to proceed.
               | 
               | The AI with above capability should be able to go out on
               | the internet, gather respective data, run any soft of
               | algorithms it needs to run, and perhaps after a month of
               | number crunching on a cloud rented TPU rack produce step
               | by step plan with costs on how to do all of that. And it
               | would be better than those experts because it should be
               | able to create a much higher fidelity simulations to
               | account for things like vibration and predict if some
               | connector if going to wobble loose .
        
       | brcmthrowaway wrote:
       | Does the human brain use transformers?
        
         | exe34 wrote:
         | No, but they can both implement a language virtual machine
         | which appears to be able to produce intelligent behaviour with
         | unknown bounds.
        
         | mirekrusin wrote:
         | Yes, through ie. services like openai's chatgpt.
        
         | dartos wrote:
         | What?
        
         | __loam wrote:
         | Anyone telling you it does is a fraud.
        
       | bloopernova wrote:
       | Do LLMs use neural nets? If so, what makes up the "neuron"? i.e.
       | Is there a code structure that underlies the neuron, or is it
       | "just" fancy math?
        
         | theonlybutlet wrote:
         | Yes to both, the "neuron" would basically be a
         | probability/weighted parameter. A parameter is an expression,
         | it's a mathematical representation of a token and it's
         | weighting (theyre translated from/to input/output token lists
         | entering and exiting the model). Usually tokens are pre-set
         | small groups of character combinations like "if " or "cha" that
         | make up a word/sentence. The recorded path your value takes
         | down the chain of probabilities would be the "neural pathway"
         | within the wider "neural network".
         | 
         | Someone please correct me if I'm wrong or my terminology is
         | wrong.
        
         | dartos wrote:
         | Transformers can be considered a kind of neural network.
         | 
         | It's mainly fancy math. With tools like PyTorch or tensorflow,
         | you use python to describe a graph of computations which gets
         | compiled down into optimized instructions.
         | 
         | There are some examples of people making transformers and other
         | NN architectures in about 100 lines of code. I'd google for
         | those to see what these things look like in code.
         | 
         | The training loop, data, and resulting weights are where the
         | magic is.
         | 
         | The code is disappointingly simple.
        
           | bloopernova wrote:
           | > The code is disappointingly simple.
           | 
           | I absolutely adore this sentence, it made me laugh to imagine
           | coders or other folks looking at the code and thinking
           | "That's it?!? But that's simple!"
           | 
           | Although it feels a little similar to some of the basic
           | reactions that go to make up DNA: start with simple units
           | that work together to form something much more complex.
           | 
           | (apologies for poor metaphors, I'm still trying to grasp some
           | of the concepts involved with this)
        
         | osanseviero wrote:
         | Just math, and not even that fancy.
         | 
         | Let's say you want to predict if you'll pass an exam based on
         | how many hours you studied (x1) and how many exercises you did
         | (x2). A neuron will learn a weight for each variable (w1 and
         | w2). If the model learns w1=0.5 and w2=1, the model will
         | provide more importance to the # of exercises.
         | 
         | So if you study for 10 hours and only do 2 exercises, the model
         | will do x1w1 + x2w2=10x0.5 + 2x1 = 7. The neuron then outputs
         | that. This is a bit (but not much) simplified - we also have a
         | bias term and an activation to process the output.
         | 
         | Congrats! We built our first neuron together! Have thousands of
         | these neurons in connected layers, and you suddenly have a deep
         | neural network. Have billions or trillions of them, you have an
         | LLM :)
        
         | abrichr wrote:
         | The "neuron" in a neural network is just a non linear function
         | of the weighted sum of the inputs (plus a bias term).
         | 
         | See the "definition" section in
         | https://en.wikipedia.org/wiki/Perceptron .
        
       | enriquto wrote:
       | For a dryer, more formal and succinct approach, see "The
       | Transformer Model in Equations" [0], by John Thickstun. The whole
       | thing fits in a single page, using standard mathematical
       | notation.
       | 
       | [0] https://johnthickstun.com/docs/transformers.pdf
        
       | dogline wrote:
       | Six paragraphs in, and I already have questions.
       | 
       | > Hello -> [1,2,3,4] World -> [2,3,4,5]
       | 
       | The vectors are random, but they look like they have a pattern
       | here. Does the 2 in both vector mean something? Or, is it the
       | entire set that makes it unique?
        
         | dan-robertson wrote:
         | The number reuse is just the author being a bit lazy. You could
         | estimate how similar these vectors are by seeing if they point
         | in similar directions or by calculating the angle between them.
         | Here they are about 60deg apart and somewhat the same
         | direction, but a lot of this is that the author didn't want to
         | put in any negative numbers in the example so vectors end up
         | being a bit more similar than they would be really.
         | 
         | That the numbers are reused isn't meaningful here: a 1 in the
         | first position is quite unrelated to a 1 in the second (as no
         | convolutions are done over this vector)
        
       | adamnemecek wrote:
       | It's a renormalization process. It can be modelled as a
       | convolution in a Hopf algebra.
        
       | snaxsnaxsnax wrote:
       | Hmmm, yes, I know some of these words.
        
       | naitgacem wrote:
       | Reading the title I thought this was about electrical
       | transformers :p
       | 
       | Although this is HN but my background is still stronger.
       | 
       | And by the way, is it worth it to invest time to get some idea
       | about this whole AI field? I'm from a compE background
        
       ___________________________________________________________________
       (page generated 2024-01-03 23:00 UTC)