[HN Gopher] Understand how transformers work by demystifying the...
___________________________________________________________________
Understand how transformers work by demystifying the math behind
them
Author : LaserPineapple
Score : 168 points
Date : 2024-01-03 21:10 UTC (1 hours ago)
(HTM) web link (osanseviero.github.io)
(TXT) w3m dump (osanseviero.github.io)
| leereeves wrote:
| > The complexity comes from the number of steps and the number of
| parameters.
|
| Yes, it seems like a transformer model simple enough for us to
| understand isn't able to do anything interesting, and a
| transformer complex enough to do something interesting is too
| complex for us to understand.
|
| I would love to study something in the middle, a model that is
| both simple enough to understand and complex enough to do
| something interesting.
| calebkaiser wrote:
| You might be interested, if you aren't already familiar, in
| some of the work going on in the mechanistic interpretability
| field. Neel Nanda has a lot of approachable work on the topic:
| https://www.neelnanda.io/mechanistic-interpretability
| leereeves wrote:
| I was not familiar with it, and that does look fascinating,
| thank you. If anyone else is interested, this guide "Concrete
| Steps to Get Started in Transformer Mechanistic
| Interpretability" on his site looks like a great place to
| start:
|
| https://www.neelnanda.io/mechanistic-
| interpretability/gettin...
| quickthrower2 wrote:
| Transformer tutorials might be the new monad tutorial. A hard
| concept to get, but one you need to struggle with (and practice
| some examples) to understand. So a bit like much of computer
| science :-).
| hdhfjkrkrme wrote:
| The moment you understand the Transformer you become incapable
| of explaining it.
| amelius wrote:
| Waiting for a blogpost titled "You could have invented
| transformers".
| csdvrx wrote:
| > Transformer tutorials might be the new monad tutorial. A hard
| concept to get,
|
| A hard concept?
|
| But a monad is just a monoid in the category of endofunctors,
| so what's the problem?
| nemo8551 wrote:
| There I was all excited to show off some of my electrical chops
| on HN.
|
| Not today.
| rzzzt wrote:
| Does mystified math lie beyond behind how the ratio of input
| and output voltages is equal to the ratio of the primary and
| secondary windings? Can it be derived from Maxwell's equations?
|
| Off to a search...
| amelius wrote:
| I bet that an LLM (which uses transformers) can explain those
| aspects of a transformer to you.
| ActorNightly wrote:
| The whole "mystery" of transformer is that instead of a linear
| sequence of static weights times values in each layer, you now
| have 3 different matrices that are obtained from the same input
| through multiplication of learned weights, and then you just
| multiply the matrices together. I.e more parallelism which works
| out nice, but very restrictive since the attention formula is
| static.
|
| We arent going to see more progress until we have a way to
| generalize the compute graph as a learnable parameter. I dunno if
| this is even possible in the traditional sense of gradients due
| to chaotic effects (i.e small changes reflect big shifts in
| performance), it may have to be some form of genetic algorithm or
| pso that happens under the hood.
| ex3ndr wrote:
| This is basically this - it can learn ignore some paths, and
| amplify something more important, then you can just cut this
| paths without sensible loss of quality. The problem is that you
| are not going to win anything from this - non-matrix
| multiplication would be slower or the same.
| ActorNightly wrote:
| The issue is that you are thinking of this in terms of
| information compression, which is what LLMs are.
|
| Im more concerned with an LLM having the ability to be
| trained to the point where a subset of the graph represents
| all the nand gates necessary for a cpu and ram, so when you
| ask it questions it can actually run code to compute them
| accurately instead of offering a statistical best guess, i.e
| decompression after lossy compression.
| exe34 wrote:
| Just give it a computer? Even a virtual machine. It can
| output assembly code or high level code that gets compiled.
| ActorNightly wrote:
| The issue is not having access to the cpu, the issue is
| that the model being able to be trained in such a way
| that it has representative structures for applicable
| problem solving. Furthermore, the structures itself
| should
|
| Philosophically, you can start ad hoc-ing functionalities
| on top of LLMs and expect major progress. Sure, you can
| make them better, but you will never get to the state
| where AI is massively useful.
|
| For example, lets say you gather a whole bunch of experts
| in respective fields, and you give them a task to put
| together a detailed plan on how to build a flying car.
| You will have people doing design, doing simulations,
| researching material sourcing, creating CNC programs for
| manufacturing parts, sourcing tools and equipment,
| writing software, e.t.c. And when executing this plan,
| they would be open to feedback for anything missed, and
| can advise on how to proceed.
|
| The AI with above capability should be able to go out on
| the internet, gather respective data, run any soft of
| algorithms it needs to run, and perhaps after a month of
| number crunching on a cloud rented TPU rack produce step
| by step plan with costs on how to do all of that. And it
| would be better than those experts because it should be
| able to create a much higher fidelity simulations to
| account for things like vibration and predict if some
| connector if going to wobble loose .
| brcmthrowaway wrote:
| Does the human brain use transformers?
| exe34 wrote:
| No, but they can both implement a language virtual machine
| which appears to be able to produce intelligent behaviour with
| unknown bounds.
| mirekrusin wrote:
| Yes, through ie. services like openai's chatgpt.
| dartos wrote:
| What?
| __loam wrote:
| Anyone telling you it does is a fraud.
| bloopernova wrote:
| Do LLMs use neural nets? If so, what makes up the "neuron"? i.e.
| Is there a code structure that underlies the neuron, or is it
| "just" fancy math?
| theonlybutlet wrote:
| Yes to both, the "neuron" would basically be a
| probability/weighted parameter. A parameter is an expression,
| it's a mathematical representation of a token and it's
| weighting (theyre translated from/to input/output token lists
| entering and exiting the model). Usually tokens are pre-set
| small groups of character combinations like "if " or "cha" that
| make up a word/sentence. The recorded path your value takes
| down the chain of probabilities would be the "neural pathway"
| within the wider "neural network".
|
| Someone please correct me if I'm wrong or my terminology is
| wrong.
| dartos wrote:
| Transformers can be considered a kind of neural network.
|
| It's mainly fancy math. With tools like PyTorch or tensorflow,
| you use python to describe a graph of computations which gets
| compiled down into optimized instructions.
|
| There are some examples of people making transformers and other
| NN architectures in about 100 lines of code. I'd google for
| those to see what these things look like in code.
|
| The training loop, data, and resulting weights are where the
| magic is.
|
| The code is disappointingly simple.
| bloopernova wrote:
| > The code is disappointingly simple.
|
| I absolutely adore this sentence, it made me laugh to imagine
| coders or other folks looking at the code and thinking
| "That's it?!? But that's simple!"
|
| Although it feels a little similar to some of the basic
| reactions that go to make up DNA: start with simple units
| that work together to form something much more complex.
|
| (apologies for poor metaphors, I'm still trying to grasp some
| of the concepts involved with this)
| osanseviero wrote:
| Just math, and not even that fancy.
|
| Let's say you want to predict if you'll pass an exam based on
| how many hours you studied (x1) and how many exercises you did
| (x2). A neuron will learn a weight for each variable (w1 and
| w2). If the model learns w1=0.5 and w2=1, the model will
| provide more importance to the # of exercises.
|
| So if you study for 10 hours and only do 2 exercises, the model
| will do x1w1 + x2w2=10x0.5 + 2x1 = 7. The neuron then outputs
| that. This is a bit (but not much) simplified - we also have a
| bias term and an activation to process the output.
|
| Congrats! We built our first neuron together! Have thousands of
| these neurons in connected layers, and you suddenly have a deep
| neural network. Have billions or trillions of them, you have an
| LLM :)
| abrichr wrote:
| The "neuron" in a neural network is just a non linear function
| of the weighted sum of the inputs (plus a bias term).
|
| See the "definition" section in
| https://en.wikipedia.org/wiki/Perceptron .
| enriquto wrote:
| For a dryer, more formal and succinct approach, see "The
| Transformer Model in Equations" [0], by John Thickstun. The whole
| thing fits in a single page, using standard mathematical
| notation.
|
| [0] https://johnthickstun.com/docs/transformers.pdf
| dogline wrote:
| Six paragraphs in, and I already have questions.
|
| > Hello -> [1,2,3,4] World -> [2,3,4,5]
|
| The vectors are random, but they look like they have a pattern
| here. Does the 2 in both vector mean something? Or, is it the
| entire set that makes it unique?
| dan-robertson wrote:
| The number reuse is just the author being a bit lazy. You could
| estimate how similar these vectors are by seeing if they point
| in similar directions or by calculating the angle between them.
| Here they are about 60deg apart and somewhat the same
| direction, but a lot of this is that the author didn't want to
| put in any negative numbers in the example so vectors end up
| being a bit more similar than they would be really.
|
| That the numbers are reused isn't meaningful here: a 1 in the
| first position is quite unrelated to a 1 in the second (as no
| convolutions are done over this vector)
| adamnemecek wrote:
| It's a renormalization process. It can be modelled as a
| convolution in a Hopf algebra.
| snaxsnaxsnax wrote:
| Hmmm, yes, I know some of these words.
| naitgacem wrote:
| Reading the title I thought this was about electrical
| transformers :p
|
| Although this is HN but my background is still stronger.
|
| And by the way, is it worth it to invest time to get some idea
| about this whole AI field? I'm from a compE background
___________________________________________________________________
(page generated 2024-01-03 23:00 UTC)