[HN Gopher] Alice's adventures in a differentiable wonderland
       ___________________________________________________________________
        
       Alice's adventures in a differentiable wonderland
        
       Author : tosh
       Score  : 118 points
       Date   : 2024-04-30 17:03 UTC (5 hours ago)
        
 (HTM) web link (www.sscardapane.it)
 (TXT) w3m dump (www.sscardapane.it)
        
       | glonq wrote:
       | I wonder if the usage of Alice & Wonderland takes inspiration
       | from Douglas Hofstadter's "Godel, Escher, Bach: an Eternal Golden
       | Braid" ?
        
         | iainmerrick wrote:
         | It's a pretty common trope, especially for math-related books,
         | e.g. Alex Bellos' "Alex's Adventures in Numberland".
        
         | devnonymous wrote:
         | No, I think the inspiration is more direct
         | https://duckduckgo.com/?q=lewis+carroll+alice+in+wonderland+...
        
         | abdullahkhalids wrote:
         | Lewis Carrol's Alice in Wonderland features a number of logical
         | and mathematical puzzles [1]
         | 
         | He also wrote What the Tortoise Said to Achilles (1895) in
         | which the paradoxes of Zeno are discussed.
         | 
         | So it's more correct to say that GEB and this article are
         | originally inspired by Lewis Carrol's work.
         | 
         | [1] I wrote a short article for my university magazine a long
         | time ago. Some interesting references at the end
         | https://abd.tiddlyspot.com/#%5B%5BMathematical%20Adventures%...
        
       | xanderlewis wrote:
       | > Stripped of anything else, neural networks are compositions of
       | differentiable primitives
       | 
       | I'm a sucker for statements like this. It almost feels
       | philosophical, and makes the whole subject so much more
       | comprehensible in only a single sentence.
       | 
       | I think Francois Chollet says something similar in his book on
       | deep learning: one shouldn't fall into the trap of
       | anthropomorphising and mysticising models based on the 'neural'
       | name; deep learning is simply the application of sequences of
       | operations that are nonlinear (and hence capable of encoding
       | arbitrary complexity) but nonetheless differentiable and so
       | efficiently optimisable.
        
         | andoando wrote:
         | What does "differentiable primitives" mean here?
        
           | CobrastanJorji wrote:
           | Continuous mathematical functions which have derivatives.
        
           | xanderlewis wrote:
           | I think it's referring to 'primitive functions' in the sense
           | that they're the building blocks of more complicated
           | functions. If f and g are differentiable, f+g, fg, f/g (as
           | long as g is never zero)... and so on are differentiable too.
           | Importantly, f _composed with_ g is also differentiable, and
           | so since the output of the whole network as a function of its
           | input is a composition of these 'primitives' it's
           | differentiable too.
           | 
           | The actual primitive functions in this case would be things
           | like the weighted sums of activations in the previous layer
           | to get the activation of a given layer, and the actual
           | 'activation functions' (traditionally something like a
           | sigmoid function; these days a ReLU) associated with each
           | layer.
           | 
           | 'Primitives' is also sometimes used as a synonym for
           | _antiderivatives_ , but I don't think that's what it means
           | here.
           | 
           | Edit: it just occurred to me from a comment below that you
           | might have meant to ask what the 'differentiable' part means.
           | See https://en.wikipedia.org/wiki/Differentiable_function.
        
             | andoando wrote:
             | Is this function composition essentially lambda calculus
             | then?
        
               | OJFord wrote:
               | Function composition is just f(g(x)), considered as a
               | single function that's the composition of f and g; it has
               | the domain of f and the range of g.
               | 
               | In lambda calculus terminology it's an 'application'
               | (with a function argument).
        
               | xanderlewis wrote:
               | Composition here just means what it does for any two
               | functions: the value of the 'composition' of _f_ and _g_
               | at _x_ is defined to be _f_ applied to _g_ applied to
               | _x_. In symbols, its: _f[?]g := f(g(x))_ for each _x_ in
               | the domain of _f_. It may seem obvious, but the fact that
               | this new thing is also a function (that is, its value is
               | well-defined for every input) is actually a very useful
               | thing indeed and leads to... well, most of mathematics.
               | 
               | You can certainly _do_ function composition in lambda
               | calculus: in fact, the act of composition itself is a
               | higher order function (takes functions and returns a
               | function) and you can certainly express it formally with
               | lambda terms and such. It's not really got anything to do
               | with any particular language or model of computation
               | though.
        
               | andoando wrote:
               | I didn't form my question too well. I understand all
               | that. What I am asking is, are these function
               | compositions equivalent to equivalent/similar to
               | functions in lambda calculus?
               | 
               | I guess my question, is what are the primitive functions
               | here doing?
        
               | xanderlewis wrote:
               | Well, yes, to the extent that functions are functions are
               | functions (they're just associations or mappings or
               | whatever you want to call them).
               | 
               | Maybe your question boils down to asking something more
               | general like: what's the difference between functions _to
               | a computer scientist_ (or a programmer) and functions _to
               | a mathematician_? That is, are 'functions' in C (or
               | lambda calculus), say, the same 'functions' we talk about
               | in calculus?
               | 
               | The answer to that is: in this case, because these are
               | quite simple functions (sums and products and
               | compositions thereof) they're the same. In general,
               | they're a bit different. The difference is basically the
               | difference between functional programming and
               | 'traditional' programming. If you have state/'side
               | effects' of functions, then your function won't be a
               | function _in the sense of mathematics_ ; if the return
               | value of your function depends entirely on the input and
               | doesn't return different values depending on whatever
               | else is happening in the program, then it will be.
               | 
               | Since you're asking about lambda calculus in particular,
               | the answer is that they're the same because lambda
               | calculus doesn't have state. It's 'purely functional' in
               | that sense.
               | 
               | >I guess my question, is what are the primitive functions
               | here doing?
               | 
               | I'm not really sure what you mean. They're doing what
               | functions always do. Every computer program is abstractly
               | a (partial) function.
               | 
               | Does that help, or have I misunderstood?
        
           | dirkc wrote:
           | When I did "AI" it would have meant the sigmoid function,
           | these days it's something like ReLU.
        
           | esafak wrote:
           | functions, which you can compose to increase their
           | expressiveness, and run gradient descent on to train.
           | 
           | The success of deep learning is basically attributable to
           | composable (expressive), differentiable (learnable)
           | functions. The "deep" moniker alludes to the
           | compositionality.
        
         | captainclam wrote:
         | Ugh, exactly, it's so cool. I've been a deep learning
         | practitioner for ~3 years now, and I feel like this notion has
         | really been impressed upon me only recently.
         | 
         | I've spent an awful lot of mental energy trying to conceive of
         | how these things work, when really it comes down to "does
         | increasing this parameter improve the performance on this task?
         | Yes? Move the dial up a bit. No? Down a bit..." x 1e9.
         | 
         | And the cool part is that this yields such rich, interesting,
         | sometimes even useful, structures!
         | 
         | I like to think of this cognitive primitive as the analogue to
         | the idea that thermodynamics is just the sum of particles
         | bumping into each other. At the end of the day, that really is
         | just it, but the collective behavior is something else
         | entirely.
        
           | xanderlewis wrote:
           | > At the end of the day, that really is just it, but the
           | collective behavior is something else entirely.
           | 
           | Exactly. It's not to say that neat descriptions like this are
           | the end of the story (or even the beginning of it). If they
           | were, there would be no need for this entire field of study.
           | 
           | But they are cool, and can give you a really clear
           | conceptualisation of something that can appear more like a
           | sum of disjoint observations and ad hoc tricks than a
           | discipline based on a few deep principles.
        
           | JackFr wrote:
           | NAND gates by themselves are kind of dull, but it's pretty
           | cool what you can do with a billion of them.
        
         | SkyBelow wrote:
         | Before the recent AI boom, I was mystified by the possibility
         | of AI and emulating humans (in no small part thanks to works of
         | fiction showing AI powered androids). Then I created and
         | trained some neural networks. Smaller ones, doing much of
         | nothing special. That was enough to break the mysticism. To
         | realize it was just multiplying matrices. Training them was a
         | bit more advanced, but still applied mathematics.
         | 
         | Only recently have I begun to appreciate that the simplicity of
         | the operation, applied to a large enough matrices, may still
         | capture enough of the nature of intelligence and sentience. In
         | the end we can be broken down into (relatively) simple chemical
         | reactions, and it is the massive scale of these reactions that
         | create real intelligence and sentience.
        
           | naasking wrote:
           | Exactly, the people who are derisive of those who consider ML
           | models to exhibit glimmers of true intelligence because it's
           | only matrix multiplications always amuse me. It's like they
           | don't even realize the contradiction in holding the position
           | that seemingly complex and intelligent outward behaviour
           | should not be used as an indication of actual complexity and
           | intelligence.
        
           | mistermann wrote:
           | Next step, in case you get bored: _why_ (in fact, _not a
           | minor distinction_ ) does such a simple approach work so
           | well?
        
         | gessha wrote:
         | It is soothing to the mind because it conveys that it's
         | understandable but it doesn't take away from the complexity.
         | You still have to read through math and pytorch code and debug
         | nonsensical CUDA errors, comb through the data, etc etc
        
           | whimsicalism wrote:
           | the complexity is in the values learned from the
           | optimization. even the pytorch code for a simple transformer
           | is not that complex, attention is a simple mechanism, etc.
        
             | gessha wrote:
             | Complexity also comes from the number of papers that work
             | out how different elements of network work and how to
             | intuitively change them.
             | 
             | Why do we use conv operators, why do we use attention
             | operators, when do we use one over the other? What
             | augmentations do you use, how big of a dataset do you need,
             | how do you collect the dataset, etc etc etc
        
               | whimsicalism wrote:
               | idk, just using attention and massive web crawls gets you
               | pretty far. a lot of the rest is more product-style
               | decisions about what personality you want your LM to
               | take.
               | 
               | I fundamentally don't think this technology is that
               | complex.
        
         | jxy wrote:
         | > > Stripped of anything else, neural networks are compositions
         | of differentiable primitives
         | 
         | > I'm a sucker for statements like this. It almost feels
         | philosophical, and makes the whole subject so much more
         | comprehensible in only a single sentence.
         | 
         | And I hate inaccurate statements like this. It pretends to be
         | rigorous mathematical, but really just propagates erroneous
         | information, and makes the whole article so much more amateur
         | in only a single sentence.
         | 
         | The simple relu is continuous but not differentiable at 0, and
         | its derivative is discontinuous at 0.
        
           | xanderlewis wrote:
           | It's not 'inaccurate'. The mark of true mastery is an ability
           | to make terse statements that convey a huge amount without
           | involving excessive formality or discussion of by-the-by
           | technical details. If ever you've spoken to world-renowned
           | experts in pure mathematics or other highly technical and
           | pendantic fields, you'll find they'll say all sorts of
           | 'inaccurate' things in conversation (or even in written
           | documents). It doesn't make them worthless; far from it.
           | 
           | If you want to have a war of petty pedantry, let's go: the
           | derivative of ReLU can't be _discontinuous at zero_ , as you
           | say, because continuity (or indeed discontinuity) of a
           | function at x requires the function to have a value at x
           | (which is the negation of what your first statement correctly
           | claims).
        
             | kragen wrote:
             | my experience with world-renowned experts in pure
             | mathematics is that they are much more careful than the
             | average bear to explicitly qualify inaccurate things as
             | inaccurate, because their discipline requires them to be
             | very clear about precisely what they are saying
             | 
             | discontinuity of a function at _x_ does not, according to
             | the usual definition of  'continuity', require the function
             | to have a value at _x_ ; indeed, functions that fail to
             | have a value at _x_ are necessarily discontinuous there,
             | precisely because (as you say) they are not continuous
             | there. https://en.wikipedia.org/wiki/Continuous_function#De
             | finition...
             | 
             | there are other definitions of 'discontinuous' in use, but
             | i can't think of one that would give the result you claim
        
               | xanderlewis wrote:
               | > they are much more careful than the average bear to
               | explicitly qualify inaccurate things as inaccurate
               | 
               | Sure. But what part of this _entirely worded in natural
               | language, and very short_ statement made you think it was
               | a technical, formal statement? I think you're just taking
               | an opportunity to flex your knowledge of basic calculus,
               | and deliberately attributing intent to the author that
               | isn't there in order to look clever.
               | 
               | Regarding a function being discontinuous at a point
               | outside its domain: if you take a completely naive view
               | of what 'discontinuous' means, then I suppose you can say
               | so. But discontinuity is just the logical negation of
               | continuity. Observe:
               | 
               | To say that _f: X -- > Y (in this context, a real-valued
               | function of real numbers) is continuous_ means precisely
               | 
               | [?]x[?]X [?]e>0 [?]d>0 |x - p| < d = |f(x) - f(p)| < e
               | 
               | and so its negation looks like
               | 
               | [?]x[?]X [?] ...
               | 
               | that is, there is a point _in X, the domain of f_ where
               | continuity fails.
               | 
               | For example, you wouldn't talk about a function defined
               | on the integers being _discontinuous at pi_ , would you?
               | That would just be weird.
               | 
               | To prove the point further, observe that the set of
               | discontinuities (according to your definition) of any
               | given function would actually include _every_ number...
               | in fact every mathematical object in the universe --
               | which would make it not even a set in ZFC. So it's
               | absurd.
               | 
               | Even more reasons to believe functions can only be
               | discontinuous at points of their domain: a function is
               | said to be discontinuous if it has at least one
               | discontinuity. By your definition, _every_ function is
               | discontinuous.
               | 
               | ...anyway, I said we were going to be petty. I'm trying
               | to demonstrate this is a waste of time by wasting my own
               | time.
        
               | laingc wrote:
               | Because memes aren't allowed on HN, you're not allowed to
               | reply with the "akssshuallllyyy" meme, so you had to go
               | to these lengths.
               | 
               | -\\_(tsu)_/-
        
               | xanderlewis wrote:
               | You're actually not far off. I'm somewhat embarrassed by
               | the above, but I think it makes the point.
        
             | makerdiety wrote:
             | Invoking excessive formality and discussions of minute
             | technical details leads to a cathedral of knowledge built
             | on autistic pedantry. The chosen rabbit hole to get lost in
             | needs to be the correct one. And human science is riddled
             | with the paths that have naive or childish fundamentals.
        
               | mistermann wrote:
               | This comment makes me want to both upvote and downvote
               | with extreme enthusiasm/fury!
               | 
               | The sign of a truly good conversation?
        
           | whimsicalism wrote:
           | it's pretty close to accurate, the lack of differentiability
           | at 0 for relu doesn't really come into play in practice
        
         | zackmorris wrote:
         | Ya so far this is the best introduction to neural networks from
         | first principles that I've seen.
         | 
         | Quickly skimming the draft pdf at
         | https://arxiv.org/pdf/2404.17625 I can grok it instantly,
         | because it's written in familiar academic language instead of
         | gobbledygook. Anyone with an undergrad math education in
         | engineering, computer science, etc or a self-taught equivalent
         | understanding of differential equations should be able to read
         | it easily. It does a really good job of connecting esoteric
         | terms like tensors with arrays, gradients with partial
         | derivatives, Jacobians with gradients and backpropagation with
         | gradient descent in forward/reverse mode automatic
         | differentiation. Which helps the reader to grasp the
         | fundamentals instead of being distracted by the implementation
         | details of TensorFlow, CUDA, etc. Some notable excerpts:
         | 
         | Introduction (page 4):                 By viewing neural
         | networks as simply compositions of differentiable primitives we
         | can ask two basic questions (Figure F.1.3): first, what data
         | types can we handle as inputs or outputs? And second, what sort
         | of primitives can we use? Differentiability is a strong
         | requirement that does not allow us to work directly with many
         | standard data types, such as characters or integers, which are
         | fundamentally discrete and hence discontinuous. By contrast, we
         | will see that differentiable models can work easily with more
         | complex data represented as large arrays (what we will call
         | tensors) of numbers, such as images, which can be manipulated
         | algebraically by basic compositions of linear and nonlinear
         | transformations.
         | 
         | Chapter 2.2 Gradients and Jacobians (page 23):
         | [just read this section - it connects partial derivatives,
         | gradients, Jacobians and Taylor's theorem - wow!]
         | 
         | Chapter 4.1.5 Some computational considerations (page 59):
         | In general, we will always prefer algorithms that scale
         | linearly both in the feature dimension c and in the batch size
         | n, since super-linear algorithms will become quickly
         | impractical (e.g., a batch of 32 RGB images of size 1024x1024
         | has c [?] 1e7). We can avoid a quadratic complexity in the
         | equation of the gradient by computing the multiplications in
         | the correct order, i.e., computing the matrix-vector product Xw
         | first. Hence, pure gradient descent is linear in both c and n,
         | but only if proper care is taken in the implementation:
         | generalizing this idea is the fundamental insight for the
         | development of reverse-mode automatic differentiation, a.k.a.
         | back-propagation (Section 6.3).
         | 
         | Chapter 6 Automatic differentiation (page 87):
         | We consider the problem of efficiently computing gradients of
         | generic computational graphs, such as those induced by
         | optimizing a scalar loss function on a fully-connected neural
         | network, a task called automatic differentiation (AD) [BPRS18].
         | You can think of a computational graph as the set of atomic
         | operations (which we call primitives) obtained by running the
         | program itself. We will consider sequential graphs for brevity,
         | but everything can be easily extended to more sophisticated,
         | acyclic computational graphs.              The problem may seem
         | trivial, since the chain rule of Jacobians (Section 2.2,
         | (E.2.22)) tells us that the gradient of function composition is
         | simply the matrix product of the corresponding Jacobian
         | matrices. However, efficiently implementing this is the key
         | challenge, and the resulting algorithm (reverse-mode AD or
         | backpropagation) is a cornerstone of neural networks and
         | differentiable programming in general [GW08, BR24].
         | Understanding it is also key to understanding the design (and
         | the differences) of most frameworks for implementing and
         | training such programs (such as TensorFlow or PyTorch or JAX).
         | A brief history of the algorithm can be found in [Gri12].
         | 
         | Edit: I changed Chapter 2.2.3 Jacobians (page 27) to Chapter
         | 2.2 Gradients and Jacobians (page 23) for better context.
        
         | phkahler wrote:
         | >> one shouldn't fall into the trap of anthropomorphising and
         | mysticising models based on the 'neural' name
         | 
         | And yet, artificial neural networks ARE an approximation of how
         | biological neurons work. It is worth noting that they came out
         | of neurobiology and not some math department - well at least in
         | the forward direction, I'm not sure who came up with the
         | training algorithms (probably the math folks). Should they be
         | considered mystical? No. I would also posit that biological
         | neurons are more efficient and probably have better learning
         | algorithms than artificial ones today.
         | 
         | I'm confused as to why some people seem to shun the biological
         | equivalence of these things. In a recent thread here I learned
         | that physical synaptic weights (in our brains) are at least
         | partly stored in DNA or its methylation. If that isn't
         | fascinating I'm not sure what is. Or is it more along the lines
         | of intelligence can be reduced to a large number of simple
         | things, and biology has given us an interesting physical
         | implementation?
        
           | kragen wrote:
           | anns originated in hypotheses about how neurobiology might
           | work in the 01940s but diverged completely from neurobiology
           | in the 01960s; they contain nothing we've learned about
           | neurons in the last 50 years, and not much from before that
           | either (they don't, for example, do hebbian learning).
           | current anns use training methods like gradient descent with
           | momentum and activation functions like relu which have no
           | plausible biological realization
           | 
           | artificial neural networks are an approximation of biological
           | neural networks in the same way that a submarine is an
           | approximation of a fish
        
           | xanderlewis wrote:
           | As the commenter below mentions, the biological version of a
           | neuron (i.e. a neuron) is _much_ more complicated than the
           | neural network version. The neural network version is
           | essentially just a weighted sum, with an extra layer of
           | shaping applied afterwards to make it nonlinear. As far as I
           | know, we still don't understand all of the complexity about
           | how biological neurons work. Even skimming the Wikipedia page
           | for 'neuron' will give you some idea.
           | 
           | The original idea of approximating something like a neuron
           | using a weighted sum (which is a fairly obvious idea, given
           | the initial discovery that neurons become 'activated' and
           | they do so in proportion to how much the neurons they are
           | connected to are) did come from thinking about biological
           | brains, but the mathematical building blocks are incredibly
           | simple and are hundreds of years old, if not thousands.
        
             | naasking wrote:
             | > the biological version of a neuron (i.e. a neuron) is
             | much more complicated than the neural network version
             | 
             | This is a difference of degree not of kind, because neural
             | networks are Turning complete. Whatever additional
             | complexity the neuron has can itself be modelled as a
             | neural network.
             | 
             | Edit: meaning, that if the greater complexity of a
             | biological neuron is relevant to its information processing
             | component, then that just increases the number of
             | artificial neural network neurons needed to describe it, it
             | does not need any computation of a different kind.
        
               | xanderlewis wrote:
               | PowerPoint is Turing complete. Does that mean PowerPoint
               | should be regarded as being biological or at least
               | neuroscience-inspired?
        
               | naasking wrote:
               | No, but neural networks literally were inspired by
               | biology so I'm not sure what your point is.
        
         | jimbokun wrote:
         | I always get the impression even the proponents of these
         | algorithms when they didn't seem so promising, are shocked at
         | the capabilities demonstrated by models built with such a
         | relatively simple procedure.
        
         | naasking wrote:
         | > one shouldn't fall into the trap of anthropomorphising and
         | mysticising models based on the 'neural' name
         | 
         | One also shouldn't fall into the dual trap of assuming that
         | just because one understands how a model works, it cannot have
         | any bearing on the ever-mysterious operation of the brain.
        
         | sideshowb wrote:
         | > deep learning is simply the application of sequences of
         | operations that are nonlinear but nonetheless differentiable
         | 
         | Though other things fit this description which are not deep
         | learning. Like (shameless plug) my recent paper here
         | https://ieeexplore.ieee.org/document/10497907
        
       | Eduard wrote:
       | better remove all the Disney-based Alice in Wonderland character
       | intellectual property from the book.
        
       | p1esk wrote:
       | And then you learn about binary or ternary networks where
       | gradients don't really exist anywhere, and you start to wonder
       | about the importance of this differentiability.
        
         | whimsicalism wrote:
         | binary networks don't really work well unless you do a
         | relaxation first
        
         | ubj wrote:
         | ...And then you start learning about generalizations of the
         | notion of "gradient" to scenarios where the classical gradient
         | doesn't exist :)
        
       | gfaure wrote:
       | In the literature, they're usually called convolutional layers (I
       | think you can pretty much search and replace all uses of
       | "convolutive" in the text).
        
       ___________________________________________________________________
       (page generated 2024-04-30 23:00 UTC)