[HN Gopher] Alice's adventures in a differentiable wonderland
___________________________________________________________________
Alice's adventures in a differentiable wonderland
Author : tosh
Score : 216 points
Date : 2024-04-30 17:03 UTC (1 days ago)
(HTM) web link (www.sscardapane.it)
(TXT) w3m dump (www.sscardapane.it)
| glonq wrote:
| I wonder if the usage of Alice & Wonderland takes inspiration
| from Douglas Hofstadter's "Godel, Escher, Bach: an Eternal Golden
| Braid" ?
| iainmerrick wrote:
| It's a pretty common trope, especially for math-related books,
| e.g. Alex Bellos' "Alex's Adventures in Numberland".
| devnonymous wrote:
| No, I think the inspiration is more direct
| https://duckduckgo.com/?q=lewis+carroll+alice+in+wonderland+...
| abdullahkhalids wrote:
| Lewis Carrol's Alice in Wonderland features a number of logical
| and mathematical puzzles [1]
|
| He also wrote What the Tortoise Said to Achilles (1895) in
| which the paradoxes of Zeno are discussed.
|
| So it's more correct to say that GEB and this article are
| originally inspired by Lewis Carrol's work.
|
| [1] I wrote a short article for my university magazine a long
| time ago. Some interesting references at the end
| https://abd.tiddlyspot.com/#%5B%5BMathematical%20Adventures%...
| xanderlewis wrote:
| > Stripped of anything else, neural networks are compositions of
| differentiable primitives
|
| I'm a sucker for statements like this. It almost feels
| philosophical, and makes the whole subject so much more
| comprehensible in only a single sentence.
|
| I think Francois Chollet says something similar in his book on
| deep learning: one shouldn't fall into the trap of
| anthropomorphising and mysticising models based on the 'neural'
| name; deep learning is simply the application of sequences of
| operations that are nonlinear (and hence capable of encoding
| arbitrary complexity) but nonetheless differentiable and so
| efficiently optimisable.
| andoando wrote:
| What does "differentiable primitives" mean here?
| CobrastanJorji wrote:
| Continuous mathematical functions which have derivatives.
| xanderlewis wrote:
| I think it's referring to 'primitive functions' in the sense
| that they're the building blocks of more complicated
| functions. If f and g are differentiable, f+g, fg, f/g (as
| long as g is never zero)... and so on are differentiable too.
| Importantly, f _composed with_ g is also differentiable, and
| so since the output of the whole network as a function of its
| input is a composition of these 'primitives' it's
| differentiable too.
|
| The actual primitive functions in this case would be things
| like the weighted sums of activations in the previous layer
| to get the activation of a given layer, and the actual
| 'activation functions' (traditionally something like a
| sigmoid function; these days a ReLU) associated with each
| layer.
|
| 'Primitives' is also sometimes used as a synonym for
| _antiderivatives_ , but I don't think that's what it means
| here.
|
| Edit: it just occurred to me from a comment below that you
| might have meant to ask what the 'differentiable' part means.
| See https://en.wikipedia.org/wiki/Differentiable_function.
| andoando wrote:
| Is this function composition essentially lambda calculus
| then?
| OJFord wrote:
| Function composition is just f(g(x)), considered as a
| single function that's the composition of f and g; it has
| the domain of f and the range of g.
|
| In lambda calculus terminology it's an 'application'
| (with a function argument).
| xanderlewis wrote:
| Composition here just means what it does for any two
| functions: the value of the 'composition' of _f_ and _g_
| at _x_ is defined to be _f_ applied to _g_ applied to
| _x_. In symbols, its: _f[?]g := f(g(x))_ for each _x_ in
| the domain of _f_. It may seem obvious, but the fact that
| this new thing is also a function (that is, its value is
| well-defined for every input) is actually a very useful
| thing indeed and leads to... well, most of mathematics.
|
| You can certainly _do_ function composition in lambda
| calculus: in fact, the act of composition itself is a
| higher order function (takes functions and returns a
| function) and you can certainly express it formally with
| lambda terms and such. It's not really got anything to do
| with any particular language or model of computation
| though.
| andoando wrote:
| I didn't form my question too well. I understand all
| that. What I am asking is, are these function
| compositions equivalent to equivalent/similar to
| functions in lambda calculus?
|
| I guess my question, is what are the primitive functions
| here doing?
| xanderlewis wrote:
| Well, yes, to the extent that functions are functions are
| functions (they're just associations or mappings or
| whatever you want to call them).
|
| Maybe your question boils down to asking something more
| general like: what's the difference between functions _to
| a computer scientist_ (or a programmer) and functions _to
| a mathematician_? That is, are 'functions' in C (or
| lambda calculus), say, the same 'functions' we talk about
| in calculus?
|
| The answer to that is: in this case, because these are
| quite simple functions (sums and products and
| compositions thereof) they're the same. In general,
| they're a bit different. The difference is basically the
| difference between functional programming and
| 'traditional' programming. If you have state/'side
| effects' of functions, then your function won't be a
| function _in the sense of mathematics_ ; if the return
| value of your function depends entirely on the input and
| doesn't return different values depending on whatever
| else is happening in the program, then it will be.
|
| Since you're asking about lambda calculus in particular,
| the answer is that they're the same because lambda
| calculus doesn't have state. It's 'purely functional' in
| that sense.
|
| >I guess my question, is what are the primitive functions
| here doing?
|
| I'm not really sure what you mean. They're doing what
| functions always do. Every computer program is abstractly
| a (partial) function.
|
| Does that help, or have I misunderstood?
| andoando wrote:
| So when I think of functions in lambda calculus, I think
| of the I,S,K functions which when composed can produce
| functions like "copy", "add", "remove", "if", etc which
| then can do different computations like "copy every other
| symbol if the symbol is true", "multiply 5 times then add
| 2". Since lambda calculus is complete, any
| computation/program can be composed.
|
| When I think of functions in a traditional mathematical
| sense, I think about transformations of numbers. x->2x,
| x->2x^2, etc. I completely understand composition of
| functions here, ex x->2(x->2x)^2, but its unclear how
| these transformations relate to computation. For a
| regression problem, I can totally understand how finding
| the right compositions of functions can lead to a better
| approximations. So I am wondering, in an LLM
| architecture, what computations do these functions
| actually represent? I assume, it has something to do with
| what path to take through the neural layers. I probably
| just need to take the time to study it deeper.
|
| >If you have state/'side effects' of functions, then your
| function won't be a function in the sense of mathematics;
| if the return value of your function depends entirely on
| the input and doesn't return different values depending
| on whatever else is happening in the program, then it
| will be.
|
| Totally understood from the perspective of functions in
| say, Java. Though fundamentally I don't think there is
| distinction between functions in computer science and
| mathematics. The program as a whole is effectively a
| function. The "global" state is from another reference,
| just local variables of the encompassing function. If a
| function is modifying variables outside of the "function
| block" (in say Java), the "input" to the function isn't
| just the parameters of the function. Imo, this is more of
| an artifact of implementation of some languages rather
| than a fundamental difference. Python for example
| requires declaring global args in the function block. Go
| one step further and require putting global args into the
| parameters list and you're pretty close to satisfying
| this.
| xanderlewis wrote:
| I think you're actually massively overthinking it.
|
| The state of a neural network is described entirely by
| its parameters, which usually consist of a long array
| (well, a matrix, or a tensor, or whatever...) of floating
| point numbers. What is being optimised when a network is
| trained is _these parameters_ and nothing else. When you
| evaluate a neural network on some input (often called
| performing 'inference'), that is when the functions we're
| talking about are used. You start with the input vector,
| and you apply all of those functions in order and you get
| the output vector of the network. The training process
| also uses these functions, because to train a network you
| have to perform evaluation repeatedly in between tweaking
| those parameters to make it better approximate the
| desired output for each input. Importantly, the functions
| do not change. They are constant; it's the parameters
| that change. The functions are the architecture -- not
| the thing being learned. Essentially what the parameters
| represent is how likely each neuron is to be activated
| (have a high value) if others in the previous layer are.
| So you can think of the parameters as encoding strengths
| of connections between each pair of neurons in
| consecutive layers. Thinking about 'what path to take
| through the neural layers' is way too sophisticated --
| it's not doing anything like that.
|
| > Though fundamentally I don't think there is distinction
| between functions in computer science and mathematics.
| The program as a whole is effectively a function.
|
| You're pretty much right about that, but there are two
| important problems/nitpicks:
|
| (1) We can't prove (in general) that a given program will
| halt and evaluate to something (rather than just looping
| forever) on a given input, so the 'entire program' is
| instead what's called a _partial function_. This means
| that it's still a function on its domain -- but we can't
| know what its precise domain is. Given an input, it may
| or may not produce an output. If it does, though, it's
| well defined because it's a deterministic process.
|
| (2) You're right to qualify that it's the _whole program_
| that is (possibly) a function. If you take a function
| from some program that depends on some state in that same
| program, then clearly that function won't be a proper
| 'mathematical' function. Sure, if you incorporate that
| extra state as one of your inputs, it might be, but
| that's a _different function_. You have to remember that
| in mathematics, unlike in programming, a function
| consists essentially of three pieces of data: a domain, a
| codomain, and a 'rule'. If you want to be set-theoretic
| and formal about it, this rule is just a subset of the
| cartesian product of its domain and codomain (it's a set
| of pairs of the form (x, f(x))). If you change either of
| these sets, it's technically a different function and
| there are good reasons for distinguishing between these.
| So it's not right to say that mathematical functions and
| functions in a computer program are _exactly_ the same.
| andoando wrote:
| I appreciate your responses, sorry I hope I don't seem
| like Im arguing for the sake of arguing.
|
| >Essentially what the parameters represent is how likely
| each neuron is to be activated (have a high value) if
| others in the previous layer are. So you can think of the
| parameters as encoding strengths of connections between
| each pair of neurons in consecutive layers. Thinking
| about 'what path to take through the neural layers' is
| way too sophisticated -- it's not doing anything like
| that.
|
| Im a little confused. The discussion thus far about how
| neural networks are essentially just compositions of
| functions, but you are now saying that the function is
| static, and only the parameters change.
|
| But that aside, if these parameters change which neurons
| are activated, and this activation affects which neurons
| are activated in the next layer, are these parameters
| effectively not changing the path taken through the
| layers?
|
| >Sure, if you incorporate that extra state as one of your
| inputs, it might be, but that's a different function.
|
| So say we have this program " let c = 2; function 3sum
| (a,b) { return a+b + c; } let d = 3sum(3,4)"
|
| I believe you are saying, if we had constructed this
| instead as
|
| "function(a,b,c) { return a+b+c } let d = 3sum(3,4,2) "
|
| then, this is a different function.
|
| Certainly, these are different in a sense, but at a
| fundamental level, when you compile this all down and run
| it, there is an equivalency in the _transformation_ that
| is happening. That is, the two functions equivalently
| take some input state A (composed of a,b,c) and return
| the same output state B, while applying the same
| intermediary steps (add a to b, add c to result of (add
| to b)). Really, in the first case where c is defined
| outside the scope of the function block, the interpreter
| is effectively producing the function 3sum(x,y,c) as it
| has to at some point, one way or another, inject c into
| a+b+c.
|
| Similarly, I am won't argue that the current, formal
| definitions of functions in mathematics are exactly that
| of functions as they're generally defined in programming.
|
| Rather, what I saying is that there is an equivalent way
| to think and study functions that equally apply to both
| fields. That is, a function is simply a transformation
| from A to B, where A and B can be anything, whether that
| is bits, numbers, or any other construction in any
| system. The only primitive distinction to make here is
| whether A and B are the same thing or different.
| dirkc wrote:
| When I did "AI" it would have meant the sigmoid function,
| these days it's something like ReLU.
| esafak wrote:
| functions, which you can compose to increase their
| expressiveness, and run gradient descent on to train.
|
| The success of deep learning is basically attributable to
| composable (expressive), differentiable (learnable)
| functions. The "deep" moniker alludes to the
| compositionality.
| captainclam wrote:
| Ugh, exactly, it's so cool. I've been a deep learning
| practitioner for ~3 years now, and I feel like this notion has
| really been impressed upon me only recently.
|
| I've spent an awful lot of mental energy trying to conceive of
| how these things work, when really it comes down to "does
| increasing this parameter improve the performance on this task?
| Yes? Move the dial up a bit. No? Down a bit..." x 1e9.
|
| And the cool part is that this yields such rich, interesting,
| sometimes even useful, structures!
|
| I like to think of this cognitive primitive as the analogue to
| the idea that thermodynamics is just the sum of particles
| bumping into each other. At the end of the day, that really is
| just it, but the collective behavior is something else
| entirely.
| xanderlewis wrote:
| > At the end of the day, that really is just it, but the
| collective behavior is something else entirely.
|
| Exactly. It's not to say that neat descriptions like this are
| the end of the story (or even the beginning of it). If they
| were, there would be no need for this entire field of study.
|
| But they are cool, and can give you a really clear
| conceptualisation of something that can appear more like a
| sum of disjoint observations and ad hoc tricks than a
| discipline based on a few deep principles.
| JackFr wrote:
| NAND gates by themselves are kind of dull, but it's pretty
| cool what you can do with a billion of them.
| kadushka wrote:
| _it comes down to "does increasing this parameter improve the
| performance on this task? Yes? Move the dial up a bit. No?
| Down a bit..." x 1e9_
|
| This is not how gradient based NN optimization works. What
| you described is called "random weight perturbation", a
| variant of evolutionary algorithms. It does not scale to
| networks larger than a few thousand parameters for obvious
| reasons.
|
| NNs are optimized by directly computing a gradient which
| tells us the direction to go to to reduce the loss on the
| current batch of training data. There's no trying up or down
| and seeing if it worked - we always know which direction to
| go.
|
| SGD and RWP are two completely different approaches to
| learning optimal NN weights.
| xanderlewis wrote:
| I don't think the author _literally_ meant tweaking the
| parameters and seeing what happens; it's probably an
| analogy meant to give a sense of how the gradient indicates
| what direction and to what degree the parameters should be
| tweaked. Basically, substitute 'the gradient is positive'
| for 'increasing this parameter decreases performance' and
| vice versa and it becomes correct.
| p1esk wrote:
| That substitution is the main difference between SGD and
| RWP.
|
| It's like describing bubble sort when you meant to
| describe quick sort. Would not fly on an ML 101 exam, or
| in an ML job interview.
| xanderlewis wrote:
| It's not like that at all. You couldn't accidentally
| sound like you're describing quick sort when describing
| bubble sort, or vice versa. I can't think of any
| substitution of a few words that would do that.
|
| The _meaning_ of the gradient is perfectly adequately
| described by the author. They weren't describing an
| algorithm for _computing_ it.
| a_random_canuck wrote:
| I don't think anyone is trying to pass an exam here, but
| just give an understandable overview to a general
| audience.
| captainclam wrote:
| I guess you could say I don't know RWP from Adam! :D
|
| My og comment wasn't to accurately explain gradient
| optimization, I was just expressing a sentiment not
| especially aimed at experts and not especially requiring
| details.
|
| Though I'm afraid I subjected you to the same "cringe" I
| experience when I read pop sci/tech articles describe deep
| learning optimization as "the algorithm" being "rewarded"
| or "punished," haha.
| kadushka wrote:
| No worries, we're all friends here!
|
| it's just you happened to accidentally describe the idea
| behind RWP, which is a gradient-free optimization method,
| so I thought I should point it out.
| SkyBelow wrote:
| Before the recent AI boom, I was mystified by the possibility
| of AI and emulating humans (in no small part thanks to works of
| fiction showing AI powered androids). Then I created and
| trained some neural networks. Smaller ones, doing much of
| nothing special. That was enough to break the mysticism. To
| realize it was just multiplying matrices. Training them was a
| bit more advanced, but still applied mathematics.
|
| Only recently have I begun to appreciate that the simplicity of
| the operation, applied to a large enough matrices, may still
| capture enough of the nature of intelligence and sentience. In
| the end we can be broken down into (relatively) simple chemical
| reactions, and it is the massive scale of these reactions that
| create real intelligence and sentience.
| naasking wrote:
| Exactly, the people who are derisive of those who consider ML
| models to exhibit glimmers of true intelligence because it's
| only matrix multiplications always amuse me. It's like they
| don't even realize the contradiction in holding the position
| that seemingly complex and intelligent outward behaviour
| should not be used as an indication of actual complexity and
| intelligence.
| citizen_friend wrote:
| If you study this historically you will see that every
| generation thinks they have the mechanism to explain brains
| (gear systems, analog control/cybernetics, perceptrons).
|
| My conclusion is we tend to overestimate our understanding
| and the power of our inventions.
| naasking wrote:
| The difference is that we now actually have a proof of
| computational power and computational universality.
| citizen_friend wrote:
| Analog circuits have the same computational power.
| Piecewise linear functions have the same computational
| universality.
| naasking wrote:
| Except we didn't know any of that, nor did know how to
| construct physical analogs in order to achieve universal
| computation. At best, we had limited task-specific
| computation, like clocks and planetary motion.
| mistermann wrote:
| Next step, in case you get bored: _why_ (in fact, _not a
| minor distinction_ ) does such a simple approach work so
| well?
| gessha wrote:
| It is soothing to the mind because it conveys that it's
| understandable but it doesn't take away from the complexity.
| You still have to read through math and pytorch code and debug
| nonsensical CUDA errors, comb through the data, etc etc
| whimsicalism wrote:
| the complexity is in the values learned from the
| optimization. even the pytorch code for a simple transformer
| is not that complex, attention is a simple mechanism, etc.
| gessha wrote:
| Complexity also comes from the number of papers that work
| out how different elements of network work and how to
| intuitively change them.
|
| Why do we use conv operators, why do we use attention
| operators, when do we use one over the other? What
| augmentations do you use, how big of a dataset do you need,
| how do you collect the dataset, etc etc etc
| whimsicalism wrote:
| idk, just using attention and massive web crawls gets you
| pretty far. a lot of the rest is more product-style
| decisions about what personality you want your LM to
| take.
|
| I fundamentally don't think this technology is that
| complex.
| gessha wrote:
| No? In his recent tutorial, Karpathy showed just how much
| complexity there is in the tokenizer.
|
| This technology has been years in the making with many
| small advances pushing the performance ever so slightly.
| There's been theoretical and engineering advances that
| contributed to where we are today. And we need many more
| to get the technology to an actually usable level instead
| of the current word spaghetti that we get.
|
| Also, the post is generally about neural networks and not
| just LMs.
|
| When making design decisions about an ML system you
| shouldn't just choose the attention hammer and hammer
| away. There's a lot of design constraints you need to
| consider which is why I made the original reply.
| whimsicalism wrote:
| Are there micro-optimizations that eke out small
| advancements? Yes, absolutely - the modern tokenizer is a
| good example of that.
|
| Is the core of the technology that complex? No. You could
| get very far with a naive tokenizer that just tokenized
| by words and replaced unknown words with <unk>. This is
| extremely simple to implement and I've trained
| transformers like this. It (of course) makes a perplexity
| difference but the core of the technology is not changed
| and is quite simple. Most of the complexity is in the
| hardware, not the software innovations.
|
| > And we need many more to get the technology to an
| actually usable level instead of the current word
| spaghetti that we get.
|
| I think the current technology is useable.
|
| > you shouldn't just choose the attention hammer and
| hammer away
|
| It's a good first choice of hammer, tbph.
| jxy wrote:
| > > Stripped of anything else, neural networks are compositions
| of differentiable primitives
|
| > I'm a sucker for statements like this. It almost feels
| philosophical, and makes the whole subject so much more
| comprehensible in only a single sentence.
|
| And I hate inaccurate statements like this. It pretends to be
| rigorous mathematical, but really just propagates erroneous
| information, and makes the whole article so much more amateur
| in only a single sentence.
|
| The simple relu is continuous but not differentiable at 0, and
| its derivative is discontinuous at 0.
| xanderlewis wrote:
| It's not 'inaccurate'. The mark of true mastery is an ability
| to make terse statements that convey a huge amount without
| involving excessive formality or discussion of by-the-by
| technical details. If ever you've spoken to world-renowned
| experts in pure mathematics or other highly technical and
| pendantic fields, you'll find they'll say all sorts of
| 'inaccurate' things in conversation (or even in written
| documents). It doesn't make them worthless; far from it.
|
| If you want to have a war of petty pedantry, let's go: the
| derivative of ReLU can't be _discontinuous at zero_ , as you
| say, because continuity (or indeed discontinuity) of a
| function at x requires the function to have a value at x
| (which is the negation of what your first statement correctly
| claims).
| kragen wrote:
| my experience with world-renowned experts in pure
| mathematics is that they are much more careful than the
| average bear to explicitly qualify inaccurate things as
| inaccurate, because their discipline requires them to be
| very clear about precisely what they are saying
|
| discontinuity of a function at _x_ does not, according to
| the usual definition of 'continuity', require the function
| to have a value at _x_ ; indeed, functions that fail to
| have a value at _x_ are necessarily discontinuous there,
| precisely because (as you say) they are not continuous
| there. https://en.wikipedia.org/wiki/Continuous_function#De
| finition...
|
| there are other definitions of 'discontinuous' in use, but
| i can't think of one that would give the result you claim
| xanderlewis wrote:
| > they are much more careful than the average bear to
| explicitly qualify inaccurate things as inaccurate
|
| Sure. But what part of this _entirely worded in natural
| language, and very short_ statement made you think it was
| a technical, formal statement? I think you're just taking
| an opportunity to flex your knowledge of basic calculus,
| and deliberately attributing intent to the author that
| isn't there in order to look clever.
|
| Regarding a function being discontinuous at a point
| outside its domain: if you take a completely naive view
| of what 'discontinuous' means, then I suppose you can say
| so. But discontinuity is just the logical negation of
| continuity. Observe:
|
| To say that _f: X -- > Y (in this context, a real-valued
| function of real numbers) is continuous_ means precisely
|
| [?]x[?]X [?]e>0 [?]d>0 |x - p| < d = |f(x) - f(p)| < e
|
| and so its negation looks like
|
| [?]x[?]X [?] ...
|
| that is, there is a point _in X, the domain of f_ where
| continuity fails.
|
| For example, you wouldn't talk about a function defined
| on the integers being _discontinuous at pi_ , would you?
| That would just be weird.
|
| To prove the point further, observe that the set of
| discontinuities (according to your definition) of any
| given function would actually include _every_ number...
| in fact every mathematical object in the universe --
| which would make it not even a set in ZFC. So it's
| absurd.
|
| Even more reasons to believe functions can only be
| discontinuous at points of their domain: a function is
| said to be discontinuous if it has at least one
| discontinuity. By your definition, _every_ function is
| discontinuous.
|
| ...anyway, I said we were going to be petty. I'm trying
| to demonstrate this is a waste of time by wasting my own
| time.
| laingc wrote:
| Because memes aren't allowed on HN, you're not allowed to
| reply with the "akssshuallllyyy" meme, so you had to go
| to these lengths.
|
| -\\_(tsu)_/-
| xanderlewis wrote:
| You're actually not far off. I'm somewhat embarrassed by
| the above, but I think it makes the point.
| kragen wrote:
| you have an interesting point of view, and some of the
| things you have said are correct, but if you try to use
| gradient descent on a function from, say, Z - R, you are
| going to be a very sad xanda. i would indeed describe
| such a function as being discontinuous not just at _p_
| but everywhere, at least with the usual definition of
| continuity (though there is a sense in which such a
| function could be, for example, scott-continuous)
|
| even in the case of a single discontinuity in the
| derivative, like in relu', you lose the intermediate
| value theorem and everything that follows from it; it's
| not an inconsequential or marginally relevant fact
| jj3 wrote:
| Note that any function Z - R is continuous on its domain
| but nowhere differentiable.
|
| A Scott-continuous function Z - R must be monontonous. So
| not every such function is Scott-continuous.
| kragen wrote:
| aha, thanks!
| makerdiety wrote:
| Invoking excessive formality and discussions of minute
| technical details leads to a cathedral of knowledge built
| on autistic pedantry. The chosen rabbit hole to get lost in
| needs to be the correct one. And human science is riddled
| with the paths that have naive or childish fundamentals.
| mistermann wrote:
| This comment makes me want to both upvote and downvote
| with extreme enthusiasm/fury!
|
| The sign of a truly good conversation?
| kragen wrote:
| > _a cathedral of knowledge built on autistic pedantry_
|
| this is certainly true, but more often we use its short
| name, 'math'. it turns out to be far more effective than
| so-called common sense
| newrotik wrote:
| Lack of differentiability is actually a very important
| feature of the underlying optimization problem.
|
| You might think that it doesn't matter because ReLU is,
| e.g., non-differentiable "only at one point".
|
| Gradient based methods (what you find in pytorch) generally
| rely on the idea that gradients should taper to 0 in the
| proximity of a local optimum. This is not the case for non-
| differentiable functions, and in fact gradients can be made
| to be arbitrarily large even very close to the optimum.
|
| As you may imagine, it is not hard to construct examples
| where simple gradient methods that do not properly take
| these facts into account fail to converge. These examples
| are not exotic.
| whimsicalism wrote:
| it's pretty close to accurate, the lack of differentiability
| at 0 for relu doesn't really come into play in practice
| zackmorris wrote:
| Ya so far this is the best introduction to neural networks from
| first principles that I've seen.
|
| Quickly skimming the draft pdf at
| https://arxiv.org/pdf/2404.17625 I can grok it instantly,
| because it's written in familiar academic language instead of
| gobbledygook. Anyone with an undergrad math education in
| engineering, computer science, etc or a self-taught equivalent
| understanding of differential equations should be able to read
| it easily. It does a really good job of connecting esoteric
| terms like tensors with arrays, gradients with partial
| derivatives, Jacobians with gradients and backpropagation with
| gradient descent in forward/reverse mode automatic
| differentiation. Which helps the reader to grasp the
| fundamentals instead of being distracted by the implementation
| details of TensorFlow, CUDA, etc. Some notable excerpts:
|
| Introduction (page 4): By viewing neural
| networks as simply compositions of differentiable primitives we
| can ask two basic questions (Figure F.1.3): first, what data
| types can we handle as inputs or outputs? And second, what sort
| of primitives can we use? Differentiability is a strong
| requirement that does not allow us to work directly with many
| standard data types, such as characters or integers, which are
| fundamentally discrete and hence discontinuous. By contrast, we
| will see that differentiable models can work easily with more
| complex data represented as large arrays (what we will call
| tensors) of numbers, such as images, which can be manipulated
| algebraically by basic compositions of linear and nonlinear
| transformations.
|
| Chapter 2.2 Gradients and Jacobians (page 23):
| [just read this section - it connects partial derivatives,
| gradients, Jacobians and Taylor's theorem - wow!]
|
| Chapter 4.1.5 Some computational considerations (page 59):
| In general, we will always prefer algorithms that scale
| linearly both in the feature dimension c and in the batch size
| n, since super-linear algorithms will become quickly
| impractical (e.g., a batch of 32 RGB images of size 1024x1024
| has c [?] 1e7). We can avoid a quadratic complexity in the
| equation of the gradient by computing the multiplications in
| the correct order, i.e., computing the matrix-vector product Xw
| first. Hence, pure gradient descent is linear in both c and n,
| but only if proper care is taken in the implementation:
| generalizing this idea is the fundamental insight for the
| development of reverse-mode automatic differentiation, a.k.a.
| back-propagation (Section 6.3).
|
| Chapter 6 Automatic differentiation (page 87):
| We consider the problem of efficiently computing gradients of
| generic computational graphs, such as those induced by
| optimizing a scalar loss function on a fully-connected neural
| network, a task called automatic differentiation (AD) [BPRS18].
| You can think of a computational graph as the set of atomic
| operations (which we call primitives) obtained by running the
| program itself. We will consider sequential graphs for brevity,
| but everything can be easily extended to more sophisticated,
| acyclic computational graphs. The problem may seem
| trivial, since the chain rule of Jacobians (Section 2.2,
| (E.2.22)) tells us that the gradient of function composition is
| simply the matrix product of the corresponding Jacobian
| matrices. However, efficiently implementing this is the key
| challenge, and the resulting algorithm (reverse-mode AD or
| backpropagation) is a cornerstone of neural networks and
| differentiable programming in general [GW08, BR24].
| Understanding it is also key to understanding the design (and
| the differences) of most frameworks for implementing and
| training such programs (such as TensorFlow or PyTorch or JAX).
| A brief history of the algorithm can be found in [Gri12].
|
| Edit: I changed Chapter 2.2.3 Jacobians (page 27) to Chapter
| 2.2 Gradients and Jacobians (page 23) for better context.
| barrenko wrote:
| Thank you for the summary, and thanks to the OP/author of the
| book.
|
| I started self-studying programming some time ago, then
| pivoted to AI/ML and (understandably) ended up mostly
| studying math, these resources are a boon to my folk.
| phkahler wrote:
| >> one shouldn't fall into the trap of anthropomorphising and
| mysticising models based on the 'neural' name
|
| And yet, artificial neural networks ARE an approximation of how
| biological neurons work. It is worth noting that they came out
| of neurobiology and not some math department - well at least in
| the forward direction, I'm not sure who came up with the
| training algorithms (probably the math folks). Should they be
| considered mystical? No. I would also posit that biological
| neurons are more efficient and probably have better learning
| algorithms than artificial ones today.
|
| I'm confused as to why some people seem to shun the biological
| equivalence of these things. In a recent thread here I learned
| that physical synaptic weights (in our brains) are at least
| partly stored in DNA or its methylation. If that isn't
| fascinating I'm not sure what is. Or is it more along the lines
| of intelligence can be reduced to a large number of simple
| things, and biology has given us an interesting physical
| implementation?
| kragen wrote:
| anns originated in hypotheses about how neurobiology might
| work in the 01940s but diverged completely from neurobiology
| in the 01960s; they contain nothing we've learned about
| neurons in the last 50 years, and not much from before that
| either (they don't, for example, do hebbian learning).
| current anns use training methods like gradient descent with
| momentum and activation functions like relu which have no
| plausible biological realization
|
| artificial neural networks are an approximation of biological
| neural networks in the same way that a submarine is an
| approximation of a fish
| xanderlewis wrote:
| As the commenter below mentions, the biological version of a
| neuron (i.e. a neuron) is _much_ more complicated than the
| neural network version. The neural network version is
| essentially just a weighted sum, with an extra layer of
| shaping applied afterwards to make it nonlinear. As far as I
| know, we still don't understand all of the complexity about
| how biological neurons work. Even skimming the Wikipedia page
| for 'neuron' will give you some idea.
|
| The original idea of approximating something like a neuron
| using a weighted sum (which is a fairly obvious idea, given
| the initial discovery that neurons become 'activated' and
| they do so in proportion to how much the neurons they are
| connected to are) did come from thinking about biological
| brains, but the mathematical building blocks are incredibly
| simple and are hundreds of years old, if not thousands.
| naasking wrote:
| > the biological version of a neuron (i.e. a neuron) is
| much more complicated than the neural network version
|
| This is a difference of degree not of kind, because neural
| networks are Turning complete. Whatever additional
| complexity the neuron has can itself be modelled as a
| neural network.
|
| Edit: meaning, that if the greater complexity of a
| biological neuron is relevant to its information processing
| component, then that just increases the number of
| artificial neural network neurons needed to describe it, it
| does not need any computation of a different kind.
| xanderlewis wrote:
| PowerPoint is Turing complete. Does that mean PowerPoint
| should be regarded as being biological or at least
| neuroscience-inspired?
| naasking wrote:
| No, but neural networks literally were inspired by
| biology so I'm not sure what your point is.
| xanderlewis wrote:
| My point is that you seem to think neurons in the sense
| of artificial neural networks and neurons in the human
| brain are equivalent because:
|
| (1) Neural networks are Turing complete, and hence can do
| anything brains can. [debatable anyway; We don't know
| this to be the case since brains might be doing more than
| computation. Ask a philosopher or a cognitive scientist.
| Or Roger Penrose.]
|
| (2) Neural networks were very loosely inspired by the
| idea that the human brain is made up of interconnected
| nodes that 'activate' in proportion to how other related
| nodes do.
|
| I don't think that's nearly enough to say that they're
| equivalent. For (1), we don't yet know (and we're not
| even close), and anyway: if you consider all Turing
| complete systems to be equivalent to the point of it
| being a waste of time to talk about their differences
| then you can say goodbye to quite a lot of work in
| theoretical computer science. For (2): so what? Lots of
| things are inspired by other things. It doesn't make them
| in any sense equivalent, especially if the analogy is as
| weak as it is in this case. No neuroscientist thinks that
| a weighted sum is an adequate (or even remotely accurate)
| model of a real biological neuron. They operate on
| completely different principles, as we now know much
| better than when such things were first dreamed up.
| naasking wrote:
| The brain certainly could be doing super-Turing
| computation, but that would overturn quite a bit of
| physics seeing as how not even quantum computers are more
| powerful than Turing machines (they're just faster on
| some problems). Extraordinary claims and all that.
|
| As for equivalency, that depends on how that's defined.
| Real neurons would not feature any more computational
| power than Turing machines or artificial neural networks,
| but I never said it would be a waste of time to talk
| about their differences. I merely pointed out that the
| artificial neural network model is _still sufficient_ ,
| even if real neurons have more complexity.
|
| > No neuroscientist thinks that a weighted sum is an
| adequate (or even remotely accurate) model of a real
| biological neuron
|
| Fortunately that's not what I said. If the neuron indeed
| has more relevant complexity, then it wouldn't be one
| weighted sum = one biological neuron, but one biological
| neuron = a network of weighted sums, since such a network
| can model any function.
| xanderlewis wrote:
| The original comment you were in defence of was
| suggesting that artificial neurons were somehow very
| close to biological ones, since supposedly that's where
| their inspiration came from.
|
| If you're interested in pure computational 'power', then
| if the brain is nothing more than a Turing machine
| (which, as you agree, it might not be), fine. You can
| call them 'equivalent'. It's just not very meaningful.
|
| What's interesting about neural nets has nothing to do
| with _what_ they can compute; indeed they can compute
| anything any other Turing machine can, and nothing more.
| What's interesting is _how_ they do it, since they can
| 'learn' and hence allow us to produce solutions to hard
| problems without any explicit programming or traditional
| analysis of the problem.
|
| > that would overturn quite a bit of physics
|
| Our physics is currently woefully incomplete, so... yes.
| That would be welcome.
| andoando wrote:
| And assembly is also turing complete, so if two models
| being both Turing completeness means they are equivalent,
| there would be no need for coding neural networks at all.
| Would you consider LLMs a different kind of computation
| than writing assembly code?
|
| Perhaps fundamentally they are not, but its also true
| that just writing more and more random assembly code
| isn't going to lead to an LLM.
| naasking wrote:
| LLMs aren't randomly generated though, they are shaped by
| training data. This means there would, in principle, be a
| comparable way to synthesize an equivalent assembly
| program from that same training data.
|
| The difference here is that it's just more obvious how to
| do this in one case than the other.
|
| My point was only that 1) neural networks are sufficient,
| even if real neurons have additional complexity, and 2)
| whatever that additional complexity, artificial neural
| networks can learn to reproduce it.
| andoando wrote:
| I understand that, what I am saying though is the fact
| that they can doesn't mean that they will by simply
| scaling their number. It still entirely depends on how
| they are trained/arranged, meaning it may take a
| completely different way of composing/glueing neurons
| together to stimulate any additional complexity. Its like
| saying a nand gate is turing complete, I put 1000000000
| of them in a series, but its not doing anything, what
| gives, do I need to add a billion more?
|
| Just as a modeling and running a single neuron takes x
| amount of transistors configured in a very specific way
| for example, it may take y amount of neurons arranged in
| some very specific, unknown to model something that has
| extra properties.
|
| And its not clear either whether neurons are
| fundamentally the correct approach to reach this higher
| level construction than some other kind of node.
| srean wrote:
| > This is a difference of degree not of kind
|
| Nope.
|
| Neurons in our brain operate _fundamentally_ differently.
| They work by transient spikes and information is carried
| not by the intensity of the spike voltage, but by the
| frequency of spiking. This is a fundamentally different
| phenomenon than ANNs where the output (voltage) is a
| squash transformed aggregated input values (voltages).
| phkahler wrote:
| >> Neurons in our brain operate fundamentally
| differently. They work by transient spikes and
| information is carried not by the intensity of the spike
| voltage, but by the frequency of spiking.
|
| I thought they worked like accumulators where the spike
| "energy" accumulates until the output "fires". If that's
| the case then the artificial NNs are still an
| approximation of that process. I agree that this is a
| significant difference, but the mathematical version is
| still a rough approximation inspired by the biological
| one.
| srean wrote:
| > And yet, artificial neural networks ARE an approximation of
| how biological neurons work
|
| For a non-vapid/non-vacuous definition of 'approximation'
| this is not true at all. It is well understood that (i) back-
| propagation is biologically infeasible in the brain (ii)
| output 'voltage' is a transformed weighted average of the
| input 'voltage' -- is not how neurons operate. (ii) is in the
| 'not even wrong' category.
|
| Neurons operate in terms of spikes and frequency and
| quiescence of spiking. If you are interested any undergrad
| text in neurobiology will help correct the wrong notions.
| chriswarbo wrote:
| > And yet, artificial neural networks ARE an approximation of
| how biological neurons work.
|
| Only if you limit yourself to "sums of weighted inputs, sent
| through a 1D activation function".
|
| However, the parent said "differentiable primitives": these
| days people have built networks that contain differentiable
| ray-tracers, differentiable physics simulations, etc. Those
| seem like crazy ideas if we limit ourselves to the "neural"
| analogy; but are quite natural for a "composition of
| differentiable primitives" approach.
| jimbokun wrote:
| I always get the impression even the proponents of these
| algorithms when they didn't seem so promising, are shocked at
| the capabilities demonstrated by models built with such a
| relatively simple procedure.
| naasking wrote:
| > one shouldn't fall into the trap of anthropomorphising and
| mysticising models based on the 'neural' name
|
| One also shouldn't fall into the dual trap of assuming that
| just because one understands how a model works, it cannot have
| any bearing on the ever-mysterious operation of the brain.
| sideshowb wrote:
| > deep learning is simply the application of sequences of
| operations that are nonlinear but nonetheless differentiable
|
| Though other things fit this description which are not deep
| learning. Like (shameless plug) my recent paper here
| https://ieeexplore.ieee.org/document/10497907
| jonas21 wrote:
| I feel like this statement is both obvious after spending a few
| minutes working with neural networks and completely useless in
| helping you build better neural networks.
|
| It's kind of like saying, "Stripped of anything else, works of
| literature are compositions of words"
| Horffupolde wrote:
| Well, I'd argue that could also be a bit enlightening. It's
| like taking a moment to appreciate that forests are composed
| of single trees. It takes a certain level of insight to
| appreciate systems at various depths.
| falcor84 wrote:
| Let me try to one-up that: "Stripped off anything else, works
| of literature are compositions of the uppercase letter I"
| m463 wrote:
| chess is pattern matching.
| Eduard wrote:
| better remove all the Disney-based Alice in Wonderland character
| intellectual property from the book.
| astrodust wrote:
| I was just thinking that's "cease and desist" bait right there.
| iainmerrick wrote:
| Alice in Wonderland (the book) is in the public domain. The
| old Disney movie is still in copyright, and the cover image
| does look very much like it's from the movie, but that
| character design is from John Tenniel's illustrations which
| are also in the public domain.
| astrodust wrote:
| The character design is. The image, however, is clearly
| Disney flavoured if not traced directly.
|
| His version, for example, does not have the distinctive
| bow. The art style is also completely different.
| iainmerrick wrote:
| True - it would be a good idea to use a Tenniel piece
| instead.
|
| _Edit to add:_ I was mostly trying to push back on the
| implication that Disney owns Alice in Wonderland (and
| Peter Pan, Winnie the Pooh, etc). Now I re-read the
| original comment, they did specify "Disney-based", so
| maybe I'm over-reacting!
| p1esk wrote:
| And then you learn about binary or ternary networks where
| gradients don't really exist anywhere, and you start to wonder
| about the importance of this differentiability.
| whimsicalism wrote:
| binary networks don't really work well unless you do a
| relaxation first
| ubj wrote:
| ...And then you start learning about generalizations of the
| notion of "gradient" to scenarios where the classical gradient
| doesn't exist :)
| gfaure wrote:
| In the literature, they're usually called convolutional layers (I
| think you can pretty much search and replace all uses of
| "convolutive" in the text).
| seanhunter wrote:
| Wow, just skimmed a bit, but this book looks amazing so far.
| Really understandable but with an intuitive presentation of the
| underlying maths that invites the reader to go deeper if they
| want to by giving them what they need to get started.
___________________________________________________________________
(page generated 2024-05-01 23:02 UTC)