[HN Gopher] Alice's adventures in a differentiable wonderland
___________________________________________________________________
Alice's adventures in a differentiable wonderland
Author : tosh
Score : 118 points
Date : 2024-04-30 17:03 UTC (5 hours ago)
(HTM) web link (www.sscardapane.it)
(TXT) w3m dump (www.sscardapane.it)
| glonq wrote:
| I wonder if the usage of Alice & Wonderland takes inspiration
| from Douglas Hofstadter's "Godel, Escher, Bach: an Eternal Golden
| Braid" ?
| iainmerrick wrote:
| It's a pretty common trope, especially for math-related books,
| e.g. Alex Bellos' "Alex's Adventures in Numberland".
| devnonymous wrote:
| No, I think the inspiration is more direct
| https://duckduckgo.com/?q=lewis+carroll+alice+in+wonderland+...
| abdullahkhalids wrote:
| Lewis Carrol's Alice in Wonderland features a number of logical
| and mathematical puzzles [1]
|
| He also wrote What the Tortoise Said to Achilles (1895) in
| which the paradoxes of Zeno are discussed.
|
| So it's more correct to say that GEB and this article are
| originally inspired by Lewis Carrol's work.
|
| [1] I wrote a short article for my university magazine a long
| time ago. Some interesting references at the end
| https://abd.tiddlyspot.com/#%5B%5BMathematical%20Adventures%...
| xanderlewis wrote:
| > Stripped of anything else, neural networks are compositions of
| differentiable primitives
|
| I'm a sucker for statements like this. It almost feels
| philosophical, and makes the whole subject so much more
| comprehensible in only a single sentence.
|
| I think Francois Chollet says something similar in his book on
| deep learning: one shouldn't fall into the trap of
| anthropomorphising and mysticising models based on the 'neural'
| name; deep learning is simply the application of sequences of
| operations that are nonlinear (and hence capable of encoding
| arbitrary complexity) but nonetheless differentiable and so
| efficiently optimisable.
| andoando wrote:
| What does "differentiable primitives" mean here?
| CobrastanJorji wrote:
| Continuous mathematical functions which have derivatives.
| xanderlewis wrote:
| I think it's referring to 'primitive functions' in the sense
| that they're the building blocks of more complicated
| functions. If f and g are differentiable, f+g, fg, f/g (as
| long as g is never zero)... and so on are differentiable too.
| Importantly, f _composed with_ g is also differentiable, and
| so since the output of the whole network as a function of its
| input is a composition of these 'primitives' it's
| differentiable too.
|
| The actual primitive functions in this case would be things
| like the weighted sums of activations in the previous layer
| to get the activation of a given layer, and the actual
| 'activation functions' (traditionally something like a
| sigmoid function; these days a ReLU) associated with each
| layer.
|
| 'Primitives' is also sometimes used as a synonym for
| _antiderivatives_ , but I don't think that's what it means
| here.
|
| Edit: it just occurred to me from a comment below that you
| might have meant to ask what the 'differentiable' part means.
| See https://en.wikipedia.org/wiki/Differentiable_function.
| andoando wrote:
| Is this function composition essentially lambda calculus
| then?
| OJFord wrote:
| Function composition is just f(g(x)), considered as a
| single function that's the composition of f and g; it has
| the domain of f and the range of g.
|
| In lambda calculus terminology it's an 'application'
| (with a function argument).
| xanderlewis wrote:
| Composition here just means what it does for any two
| functions: the value of the 'composition' of _f_ and _g_
| at _x_ is defined to be _f_ applied to _g_ applied to
| _x_. In symbols, its: _f[?]g := f(g(x))_ for each _x_ in
| the domain of _f_. It may seem obvious, but the fact that
| this new thing is also a function (that is, its value is
| well-defined for every input) is actually a very useful
| thing indeed and leads to... well, most of mathematics.
|
| You can certainly _do_ function composition in lambda
| calculus: in fact, the act of composition itself is a
| higher order function (takes functions and returns a
| function) and you can certainly express it formally with
| lambda terms and such. It's not really got anything to do
| with any particular language or model of computation
| though.
| andoando wrote:
| I didn't form my question too well. I understand all
| that. What I am asking is, are these function
| compositions equivalent to equivalent/similar to
| functions in lambda calculus?
|
| I guess my question, is what are the primitive functions
| here doing?
| xanderlewis wrote:
| Well, yes, to the extent that functions are functions are
| functions (they're just associations or mappings or
| whatever you want to call them).
|
| Maybe your question boils down to asking something more
| general like: what's the difference between functions _to
| a computer scientist_ (or a programmer) and functions _to
| a mathematician_? That is, are 'functions' in C (or
| lambda calculus), say, the same 'functions' we talk about
| in calculus?
|
| The answer to that is: in this case, because these are
| quite simple functions (sums and products and
| compositions thereof) they're the same. In general,
| they're a bit different. The difference is basically the
| difference between functional programming and
| 'traditional' programming. If you have state/'side
| effects' of functions, then your function won't be a
| function _in the sense of mathematics_ ; if the return
| value of your function depends entirely on the input and
| doesn't return different values depending on whatever
| else is happening in the program, then it will be.
|
| Since you're asking about lambda calculus in particular,
| the answer is that they're the same because lambda
| calculus doesn't have state. It's 'purely functional' in
| that sense.
|
| >I guess my question, is what are the primitive functions
| here doing?
|
| I'm not really sure what you mean. They're doing what
| functions always do. Every computer program is abstractly
| a (partial) function.
|
| Does that help, or have I misunderstood?
| dirkc wrote:
| When I did "AI" it would have meant the sigmoid function,
| these days it's something like ReLU.
| esafak wrote:
| functions, which you can compose to increase their
| expressiveness, and run gradient descent on to train.
|
| The success of deep learning is basically attributable to
| composable (expressive), differentiable (learnable)
| functions. The "deep" moniker alludes to the
| compositionality.
| captainclam wrote:
| Ugh, exactly, it's so cool. I've been a deep learning
| practitioner for ~3 years now, and I feel like this notion has
| really been impressed upon me only recently.
|
| I've spent an awful lot of mental energy trying to conceive of
| how these things work, when really it comes down to "does
| increasing this parameter improve the performance on this task?
| Yes? Move the dial up a bit. No? Down a bit..." x 1e9.
|
| And the cool part is that this yields such rich, interesting,
| sometimes even useful, structures!
|
| I like to think of this cognitive primitive as the analogue to
| the idea that thermodynamics is just the sum of particles
| bumping into each other. At the end of the day, that really is
| just it, but the collective behavior is something else
| entirely.
| xanderlewis wrote:
| > At the end of the day, that really is just it, but the
| collective behavior is something else entirely.
|
| Exactly. It's not to say that neat descriptions like this are
| the end of the story (or even the beginning of it). If they
| were, there would be no need for this entire field of study.
|
| But they are cool, and can give you a really clear
| conceptualisation of something that can appear more like a
| sum of disjoint observations and ad hoc tricks than a
| discipline based on a few deep principles.
| JackFr wrote:
| NAND gates by themselves are kind of dull, but it's pretty
| cool what you can do with a billion of them.
| SkyBelow wrote:
| Before the recent AI boom, I was mystified by the possibility
| of AI and emulating humans (in no small part thanks to works of
| fiction showing AI powered androids). Then I created and
| trained some neural networks. Smaller ones, doing much of
| nothing special. That was enough to break the mysticism. To
| realize it was just multiplying matrices. Training them was a
| bit more advanced, but still applied mathematics.
|
| Only recently have I begun to appreciate that the simplicity of
| the operation, applied to a large enough matrices, may still
| capture enough of the nature of intelligence and sentience. In
| the end we can be broken down into (relatively) simple chemical
| reactions, and it is the massive scale of these reactions that
| create real intelligence and sentience.
| naasking wrote:
| Exactly, the people who are derisive of those who consider ML
| models to exhibit glimmers of true intelligence because it's
| only matrix multiplications always amuse me. It's like they
| don't even realize the contradiction in holding the position
| that seemingly complex and intelligent outward behaviour
| should not be used as an indication of actual complexity and
| intelligence.
| mistermann wrote:
| Next step, in case you get bored: _why_ (in fact, _not a
| minor distinction_ ) does such a simple approach work so
| well?
| gessha wrote:
| It is soothing to the mind because it conveys that it's
| understandable but it doesn't take away from the complexity.
| You still have to read through math and pytorch code and debug
| nonsensical CUDA errors, comb through the data, etc etc
| whimsicalism wrote:
| the complexity is in the values learned from the
| optimization. even the pytorch code for a simple transformer
| is not that complex, attention is a simple mechanism, etc.
| gessha wrote:
| Complexity also comes from the number of papers that work
| out how different elements of network work and how to
| intuitively change them.
|
| Why do we use conv operators, why do we use attention
| operators, when do we use one over the other? What
| augmentations do you use, how big of a dataset do you need,
| how do you collect the dataset, etc etc etc
| whimsicalism wrote:
| idk, just using attention and massive web crawls gets you
| pretty far. a lot of the rest is more product-style
| decisions about what personality you want your LM to
| take.
|
| I fundamentally don't think this technology is that
| complex.
| jxy wrote:
| > > Stripped of anything else, neural networks are compositions
| of differentiable primitives
|
| > I'm a sucker for statements like this. It almost feels
| philosophical, and makes the whole subject so much more
| comprehensible in only a single sentence.
|
| And I hate inaccurate statements like this. It pretends to be
| rigorous mathematical, but really just propagates erroneous
| information, and makes the whole article so much more amateur
| in only a single sentence.
|
| The simple relu is continuous but not differentiable at 0, and
| its derivative is discontinuous at 0.
| xanderlewis wrote:
| It's not 'inaccurate'. The mark of true mastery is an ability
| to make terse statements that convey a huge amount without
| involving excessive formality or discussion of by-the-by
| technical details. If ever you've spoken to world-renowned
| experts in pure mathematics or other highly technical and
| pendantic fields, you'll find they'll say all sorts of
| 'inaccurate' things in conversation (or even in written
| documents). It doesn't make them worthless; far from it.
|
| If you want to have a war of petty pedantry, let's go: the
| derivative of ReLU can't be _discontinuous at zero_ , as you
| say, because continuity (or indeed discontinuity) of a
| function at x requires the function to have a value at x
| (which is the negation of what your first statement correctly
| claims).
| kragen wrote:
| my experience with world-renowned experts in pure
| mathematics is that they are much more careful than the
| average bear to explicitly qualify inaccurate things as
| inaccurate, because their discipline requires them to be
| very clear about precisely what they are saying
|
| discontinuity of a function at _x_ does not, according to
| the usual definition of 'continuity', require the function
| to have a value at _x_ ; indeed, functions that fail to
| have a value at _x_ are necessarily discontinuous there,
| precisely because (as you say) they are not continuous
| there. https://en.wikipedia.org/wiki/Continuous_function#De
| finition...
|
| there are other definitions of 'discontinuous' in use, but
| i can't think of one that would give the result you claim
| xanderlewis wrote:
| > they are much more careful than the average bear to
| explicitly qualify inaccurate things as inaccurate
|
| Sure. But what part of this _entirely worded in natural
| language, and very short_ statement made you think it was
| a technical, formal statement? I think you're just taking
| an opportunity to flex your knowledge of basic calculus,
| and deliberately attributing intent to the author that
| isn't there in order to look clever.
|
| Regarding a function being discontinuous at a point
| outside its domain: if you take a completely naive view
| of what 'discontinuous' means, then I suppose you can say
| so. But discontinuity is just the logical negation of
| continuity. Observe:
|
| To say that _f: X -- > Y (in this context, a real-valued
| function of real numbers) is continuous_ means precisely
|
| [?]x[?]X [?]e>0 [?]d>0 |x - p| < d = |f(x) - f(p)| < e
|
| and so its negation looks like
|
| [?]x[?]X [?] ...
|
| that is, there is a point _in X, the domain of f_ where
| continuity fails.
|
| For example, you wouldn't talk about a function defined
| on the integers being _discontinuous at pi_ , would you?
| That would just be weird.
|
| To prove the point further, observe that the set of
| discontinuities (according to your definition) of any
| given function would actually include _every_ number...
| in fact every mathematical object in the universe --
| which would make it not even a set in ZFC. So it's
| absurd.
|
| Even more reasons to believe functions can only be
| discontinuous at points of their domain: a function is
| said to be discontinuous if it has at least one
| discontinuity. By your definition, _every_ function is
| discontinuous.
|
| ...anyway, I said we were going to be petty. I'm trying
| to demonstrate this is a waste of time by wasting my own
| time.
| laingc wrote:
| Because memes aren't allowed on HN, you're not allowed to
| reply with the "akssshuallllyyy" meme, so you had to go
| to these lengths.
|
| -\\_(tsu)_/-
| xanderlewis wrote:
| You're actually not far off. I'm somewhat embarrassed by
| the above, but I think it makes the point.
| makerdiety wrote:
| Invoking excessive formality and discussions of minute
| technical details leads to a cathedral of knowledge built
| on autistic pedantry. The chosen rabbit hole to get lost in
| needs to be the correct one. And human science is riddled
| with the paths that have naive or childish fundamentals.
| mistermann wrote:
| This comment makes me want to both upvote and downvote
| with extreme enthusiasm/fury!
|
| The sign of a truly good conversation?
| whimsicalism wrote:
| it's pretty close to accurate, the lack of differentiability
| at 0 for relu doesn't really come into play in practice
| zackmorris wrote:
| Ya so far this is the best introduction to neural networks from
| first principles that I've seen.
|
| Quickly skimming the draft pdf at
| https://arxiv.org/pdf/2404.17625 I can grok it instantly,
| because it's written in familiar academic language instead of
| gobbledygook. Anyone with an undergrad math education in
| engineering, computer science, etc or a self-taught equivalent
| understanding of differential equations should be able to read
| it easily. It does a really good job of connecting esoteric
| terms like tensors with arrays, gradients with partial
| derivatives, Jacobians with gradients and backpropagation with
| gradient descent in forward/reverse mode automatic
| differentiation. Which helps the reader to grasp the
| fundamentals instead of being distracted by the implementation
| details of TensorFlow, CUDA, etc. Some notable excerpts:
|
| Introduction (page 4): By viewing neural
| networks as simply compositions of differentiable primitives we
| can ask two basic questions (Figure F.1.3): first, what data
| types can we handle as inputs or outputs? And second, what sort
| of primitives can we use? Differentiability is a strong
| requirement that does not allow us to work directly with many
| standard data types, such as characters or integers, which are
| fundamentally discrete and hence discontinuous. By contrast, we
| will see that differentiable models can work easily with more
| complex data represented as large arrays (what we will call
| tensors) of numbers, such as images, which can be manipulated
| algebraically by basic compositions of linear and nonlinear
| transformations.
|
| Chapter 2.2 Gradients and Jacobians (page 23):
| [just read this section - it connects partial derivatives,
| gradients, Jacobians and Taylor's theorem - wow!]
|
| Chapter 4.1.5 Some computational considerations (page 59):
| In general, we will always prefer algorithms that scale
| linearly both in the feature dimension c and in the batch size
| n, since super-linear algorithms will become quickly
| impractical (e.g., a batch of 32 RGB images of size 1024x1024
| has c [?] 1e7). We can avoid a quadratic complexity in the
| equation of the gradient by computing the multiplications in
| the correct order, i.e., computing the matrix-vector product Xw
| first. Hence, pure gradient descent is linear in both c and n,
| but only if proper care is taken in the implementation:
| generalizing this idea is the fundamental insight for the
| development of reverse-mode automatic differentiation, a.k.a.
| back-propagation (Section 6.3).
|
| Chapter 6 Automatic differentiation (page 87):
| We consider the problem of efficiently computing gradients of
| generic computational graphs, such as those induced by
| optimizing a scalar loss function on a fully-connected neural
| network, a task called automatic differentiation (AD) [BPRS18].
| You can think of a computational graph as the set of atomic
| operations (which we call primitives) obtained by running the
| program itself. We will consider sequential graphs for brevity,
| but everything can be easily extended to more sophisticated,
| acyclic computational graphs. The problem may seem
| trivial, since the chain rule of Jacobians (Section 2.2,
| (E.2.22)) tells us that the gradient of function composition is
| simply the matrix product of the corresponding Jacobian
| matrices. However, efficiently implementing this is the key
| challenge, and the resulting algorithm (reverse-mode AD or
| backpropagation) is a cornerstone of neural networks and
| differentiable programming in general [GW08, BR24].
| Understanding it is also key to understanding the design (and
| the differences) of most frameworks for implementing and
| training such programs (such as TensorFlow or PyTorch or JAX).
| A brief history of the algorithm can be found in [Gri12].
|
| Edit: I changed Chapter 2.2.3 Jacobians (page 27) to Chapter
| 2.2 Gradients and Jacobians (page 23) for better context.
| phkahler wrote:
| >> one shouldn't fall into the trap of anthropomorphising and
| mysticising models based on the 'neural' name
|
| And yet, artificial neural networks ARE an approximation of how
| biological neurons work. It is worth noting that they came out
| of neurobiology and not some math department - well at least in
| the forward direction, I'm not sure who came up with the
| training algorithms (probably the math folks). Should they be
| considered mystical? No. I would also posit that biological
| neurons are more efficient and probably have better learning
| algorithms than artificial ones today.
|
| I'm confused as to why some people seem to shun the biological
| equivalence of these things. In a recent thread here I learned
| that physical synaptic weights (in our brains) are at least
| partly stored in DNA or its methylation. If that isn't
| fascinating I'm not sure what is. Or is it more along the lines
| of intelligence can be reduced to a large number of simple
| things, and biology has given us an interesting physical
| implementation?
| kragen wrote:
| anns originated in hypotheses about how neurobiology might
| work in the 01940s but diverged completely from neurobiology
| in the 01960s; they contain nothing we've learned about
| neurons in the last 50 years, and not much from before that
| either (they don't, for example, do hebbian learning).
| current anns use training methods like gradient descent with
| momentum and activation functions like relu which have no
| plausible biological realization
|
| artificial neural networks are an approximation of biological
| neural networks in the same way that a submarine is an
| approximation of a fish
| xanderlewis wrote:
| As the commenter below mentions, the biological version of a
| neuron (i.e. a neuron) is _much_ more complicated than the
| neural network version. The neural network version is
| essentially just a weighted sum, with an extra layer of
| shaping applied afterwards to make it nonlinear. As far as I
| know, we still don't understand all of the complexity about
| how biological neurons work. Even skimming the Wikipedia page
| for 'neuron' will give you some idea.
|
| The original idea of approximating something like a neuron
| using a weighted sum (which is a fairly obvious idea, given
| the initial discovery that neurons become 'activated' and
| they do so in proportion to how much the neurons they are
| connected to are) did come from thinking about biological
| brains, but the mathematical building blocks are incredibly
| simple and are hundreds of years old, if not thousands.
| naasking wrote:
| > the biological version of a neuron (i.e. a neuron) is
| much more complicated than the neural network version
|
| This is a difference of degree not of kind, because neural
| networks are Turning complete. Whatever additional
| complexity the neuron has can itself be modelled as a
| neural network.
|
| Edit: meaning, that if the greater complexity of a
| biological neuron is relevant to its information processing
| component, then that just increases the number of
| artificial neural network neurons needed to describe it, it
| does not need any computation of a different kind.
| xanderlewis wrote:
| PowerPoint is Turing complete. Does that mean PowerPoint
| should be regarded as being biological or at least
| neuroscience-inspired?
| naasking wrote:
| No, but neural networks literally were inspired by
| biology so I'm not sure what your point is.
| jimbokun wrote:
| I always get the impression even the proponents of these
| algorithms when they didn't seem so promising, are shocked at
| the capabilities demonstrated by models built with such a
| relatively simple procedure.
| naasking wrote:
| > one shouldn't fall into the trap of anthropomorphising and
| mysticising models based on the 'neural' name
|
| One also shouldn't fall into the dual trap of assuming that
| just because one understands how a model works, it cannot have
| any bearing on the ever-mysterious operation of the brain.
| sideshowb wrote:
| > deep learning is simply the application of sequences of
| operations that are nonlinear but nonetheless differentiable
|
| Though other things fit this description which are not deep
| learning. Like (shameless plug) my recent paper here
| https://ieeexplore.ieee.org/document/10497907
| Eduard wrote:
| better remove all the Disney-based Alice in Wonderland character
| intellectual property from the book.
| p1esk wrote:
| And then you learn about binary or ternary networks where
| gradients don't really exist anywhere, and you start to wonder
| about the importance of this differentiability.
| whimsicalism wrote:
| binary networks don't really work well unless you do a
| relaxation first
| ubj wrote:
| ...And then you start learning about generalizations of the
| notion of "gradient" to scenarios where the classical gradient
| doesn't exist :)
| gfaure wrote:
| In the literature, they're usually called convolutional layers (I
| think you can pretty much search and replace all uses of
| "convolutive" in the text).
___________________________________________________________________
(page generated 2024-04-30 23:00 UTC)