[HN Gopher] Alice's adventures in a differentiable wonderland
       ___________________________________________________________________
        
       Alice's adventures in a differentiable wonderland
        
       Author : tosh
       Score  : 216 points
       Date   : 2024-04-30 17:03 UTC (1 days ago)
        
 (HTM) web link (www.sscardapane.it)
 (TXT) w3m dump (www.sscardapane.it)
        
       | glonq wrote:
       | I wonder if the usage of Alice & Wonderland takes inspiration
       | from Douglas Hofstadter's "Godel, Escher, Bach: an Eternal Golden
       | Braid" ?
        
         | iainmerrick wrote:
         | It's a pretty common trope, especially for math-related books,
         | e.g. Alex Bellos' "Alex's Adventures in Numberland".
        
         | devnonymous wrote:
         | No, I think the inspiration is more direct
         | https://duckduckgo.com/?q=lewis+carroll+alice+in+wonderland+...
        
         | abdullahkhalids wrote:
         | Lewis Carrol's Alice in Wonderland features a number of logical
         | and mathematical puzzles [1]
         | 
         | He also wrote What the Tortoise Said to Achilles (1895) in
         | which the paradoxes of Zeno are discussed.
         | 
         | So it's more correct to say that GEB and this article are
         | originally inspired by Lewis Carrol's work.
         | 
         | [1] I wrote a short article for my university magazine a long
         | time ago. Some interesting references at the end
         | https://abd.tiddlyspot.com/#%5B%5BMathematical%20Adventures%...
        
       | xanderlewis wrote:
       | > Stripped of anything else, neural networks are compositions of
       | differentiable primitives
       | 
       | I'm a sucker for statements like this. It almost feels
       | philosophical, and makes the whole subject so much more
       | comprehensible in only a single sentence.
       | 
       | I think Francois Chollet says something similar in his book on
       | deep learning: one shouldn't fall into the trap of
       | anthropomorphising and mysticising models based on the 'neural'
       | name; deep learning is simply the application of sequences of
       | operations that are nonlinear (and hence capable of encoding
       | arbitrary complexity) but nonetheless differentiable and so
       | efficiently optimisable.
        
         | andoando wrote:
         | What does "differentiable primitives" mean here?
        
           | CobrastanJorji wrote:
           | Continuous mathematical functions which have derivatives.
        
           | xanderlewis wrote:
           | I think it's referring to 'primitive functions' in the sense
           | that they're the building blocks of more complicated
           | functions. If f and g are differentiable, f+g, fg, f/g (as
           | long as g is never zero)... and so on are differentiable too.
           | Importantly, f _composed with_ g is also differentiable, and
           | so since the output of the whole network as a function of its
           | input is a composition of these 'primitives' it's
           | differentiable too.
           | 
           | The actual primitive functions in this case would be things
           | like the weighted sums of activations in the previous layer
           | to get the activation of a given layer, and the actual
           | 'activation functions' (traditionally something like a
           | sigmoid function; these days a ReLU) associated with each
           | layer.
           | 
           | 'Primitives' is also sometimes used as a synonym for
           | _antiderivatives_ , but I don't think that's what it means
           | here.
           | 
           | Edit: it just occurred to me from a comment below that you
           | might have meant to ask what the 'differentiable' part means.
           | See https://en.wikipedia.org/wiki/Differentiable_function.
        
             | andoando wrote:
             | Is this function composition essentially lambda calculus
             | then?
        
               | OJFord wrote:
               | Function composition is just f(g(x)), considered as a
               | single function that's the composition of f and g; it has
               | the domain of f and the range of g.
               | 
               | In lambda calculus terminology it's an 'application'
               | (with a function argument).
        
               | xanderlewis wrote:
               | Composition here just means what it does for any two
               | functions: the value of the 'composition' of _f_ and _g_
               | at _x_ is defined to be _f_ applied to _g_ applied to
               | _x_. In symbols, its: _f[?]g := f(g(x))_ for each _x_ in
               | the domain of _f_. It may seem obvious, but the fact that
               | this new thing is also a function (that is, its value is
               | well-defined for every input) is actually a very useful
               | thing indeed and leads to... well, most of mathematics.
               | 
               | You can certainly _do_ function composition in lambda
               | calculus: in fact, the act of composition itself is a
               | higher order function (takes functions and returns a
               | function) and you can certainly express it formally with
               | lambda terms and such. It's not really got anything to do
               | with any particular language or model of computation
               | though.
        
               | andoando wrote:
               | I didn't form my question too well. I understand all
               | that. What I am asking is, are these function
               | compositions equivalent to equivalent/similar to
               | functions in lambda calculus?
               | 
               | I guess my question, is what are the primitive functions
               | here doing?
        
               | xanderlewis wrote:
               | Well, yes, to the extent that functions are functions are
               | functions (they're just associations or mappings or
               | whatever you want to call them).
               | 
               | Maybe your question boils down to asking something more
               | general like: what's the difference between functions _to
               | a computer scientist_ (or a programmer) and functions _to
               | a mathematician_? That is, are 'functions' in C (or
               | lambda calculus), say, the same 'functions' we talk about
               | in calculus?
               | 
               | The answer to that is: in this case, because these are
               | quite simple functions (sums and products and
               | compositions thereof) they're the same. In general,
               | they're a bit different. The difference is basically the
               | difference between functional programming and
               | 'traditional' programming. If you have state/'side
               | effects' of functions, then your function won't be a
               | function _in the sense of mathematics_ ; if the return
               | value of your function depends entirely on the input and
               | doesn't return different values depending on whatever
               | else is happening in the program, then it will be.
               | 
               | Since you're asking about lambda calculus in particular,
               | the answer is that they're the same because lambda
               | calculus doesn't have state. It's 'purely functional' in
               | that sense.
               | 
               | >I guess my question, is what are the primitive functions
               | here doing?
               | 
               | I'm not really sure what you mean. They're doing what
               | functions always do. Every computer program is abstractly
               | a (partial) function.
               | 
               | Does that help, or have I misunderstood?
        
               | andoando wrote:
               | So when I think of functions in lambda calculus, I think
               | of the I,S,K functions which when composed can produce
               | functions like "copy", "add", "remove", "if", etc which
               | then can do different computations like "copy every other
               | symbol if the symbol is true", "multiply 5 times then add
               | 2". Since lambda calculus is complete, any
               | computation/program can be composed.
               | 
               | When I think of functions in a traditional mathematical
               | sense, I think about transformations of numbers. x->2x,
               | x->2x^2, etc. I completely understand composition of
               | functions here, ex x->2(x->2x)^2, but its unclear how
               | these transformations relate to computation. For a
               | regression problem, I can totally understand how finding
               | the right compositions of functions can lead to a better
               | approximations. So I am wondering, in an LLM
               | architecture, what computations do these functions
               | actually represent? I assume, it has something to do with
               | what path to take through the neural layers. I probably
               | just need to take the time to study it deeper.
               | 
               | >If you have state/'side effects' of functions, then your
               | function won't be a function in the sense of mathematics;
               | if the return value of your function depends entirely on
               | the input and doesn't return different values depending
               | on whatever else is happening in the program, then it
               | will be.
               | 
               | Totally understood from the perspective of functions in
               | say, Java. Though fundamentally I don't think there is
               | distinction between functions in computer science and
               | mathematics. The program as a whole is effectively a
               | function. The "global" state is from another reference,
               | just local variables of the encompassing function. If a
               | function is modifying variables outside of the "function
               | block" (in say Java), the "input" to the function isn't
               | just the parameters of the function. Imo, this is more of
               | an artifact of implementation of some languages rather
               | than a fundamental difference. Python for example
               | requires declaring global args in the function block. Go
               | one step further and require putting global args into the
               | parameters list and you're pretty close to satisfying
               | this.
        
               | xanderlewis wrote:
               | I think you're actually massively overthinking it.
               | 
               | The state of a neural network is described entirely by
               | its parameters, which usually consist of a long array
               | (well, a matrix, or a tensor, or whatever...) of floating
               | point numbers. What is being optimised when a network is
               | trained is _these parameters_ and nothing else. When you
               | evaluate a neural network on some input (often called
               | performing 'inference'), that is when the functions we're
               | talking about are used. You start with the input vector,
               | and you apply all of those functions in order and you get
               | the output vector of the network. The training process
               | also uses these functions, because to train a network you
               | have to perform evaluation repeatedly in between tweaking
               | those parameters to make it better approximate the
               | desired output for each input. Importantly, the functions
               | do not change. They are constant; it's the parameters
               | that change. The functions are the architecture -- not
               | the thing being learned. Essentially what the parameters
               | represent is how likely each neuron is to be activated
               | (have a high value) if others in the previous layer are.
               | So you can think of the parameters as encoding strengths
               | of connections between each pair of neurons in
               | consecutive layers. Thinking about 'what path to take
               | through the neural layers' is way too sophisticated --
               | it's not doing anything like that.
               | 
               | > Though fundamentally I don't think there is distinction
               | between functions in computer science and mathematics.
               | The program as a whole is effectively a function.
               | 
               | You're pretty much right about that, but there are two
               | important problems/nitpicks:
               | 
               | (1) We can't prove (in general) that a given program will
               | halt and evaluate to something (rather than just looping
               | forever) on a given input, so the 'entire program' is
               | instead what's called a _partial function_. This means
               | that it's still a function on its domain -- but we can't
               | know what its precise domain is. Given an input, it may
               | or may not produce an output. If it does, though, it's
               | well defined because it's a deterministic process.
               | 
               | (2) You're right to qualify that it's the _whole program_
               | that is (possibly) a function. If you take a function
               | from some program that depends on some state in that same
               | program, then clearly that function won't be a proper
               | 'mathematical' function. Sure, if you incorporate that
               | extra state as one of your inputs, it might be, but
               | that's a _different function_. You have to remember that
               | in mathematics, unlike in programming, a function
               | consists essentially of three pieces of data: a domain, a
               | codomain, and a 'rule'. If you want to be set-theoretic
               | and formal about it, this rule is just a subset of the
               | cartesian product of its domain and codomain (it's a set
               | of pairs of the form (x, f(x))). If you change either of
               | these sets, it's technically a different function and
               | there are good reasons for distinguishing between these.
               | So it's not right to say that mathematical functions and
               | functions in a computer program are _exactly_ the same.
        
               | andoando wrote:
               | I appreciate your responses, sorry I hope I don't seem
               | like Im arguing for the sake of arguing.
               | 
               | >Essentially what the parameters represent is how likely
               | each neuron is to be activated (have a high value) if
               | others in the previous layer are. So you can think of the
               | parameters as encoding strengths of connections between
               | each pair of neurons in consecutive layers. Thinking
               | about 'what path to take through the neural layers' is
               | way too sophisticated -- it's not doing anything like
               | that.
               | 
               | Im a little confused. The discussion thus far about how
               | neural networks are essentially just compositions of
               | functions, but you are now saying that the function is
               | static, and only the parameters change.
               | 
               | But that aside, if these parameters change which neurons
               | are activated, and this activation affects which neurons
               | are activated in the next layer, are these parameters
               | effectively not changing the path taken through the
               | layers?
               | 
               | >Sure, if you incorporate that extra state as one of your
               | inputs, it might be, but that's a different function.
               | 
               | So say we have this program " let c = 2; function 3sum
               | (a,b) { return a+b + c; } let d = 3sum(3,4)"
               | 
               | I believe you are saying, if we had constructed this
               | instead as
               | 
               | "function(a,b,c) { return a+b+c } let d = 3sum(3,4,2) "
               | 
               | then, this is a different function.
               | 
               | Certainly, these are different in a sense, but at a
               | fundamental level, when you compile this all down and run
               | it, there is an equivalency in the _transformation_ that
               | is happening. That is, the two functions equivalently
               | take some input state A (composed of a,b,c) and return
               | the same output state B, while applying the same
               | intermediary steps (add a to b, add c to result of (add
               | to b)). Really, in the first case where c is defined
               | outside the scope of the function block, the interpreter
               | is effectively producing the function 3sum(x,y,c) as it
               | has to at some point, one way or another, inject c into
               | a+b+c.
               | 
               | Similarly, I am won't argue that the current, formal
               | definitions of functions in mathematics are exactly that
               | of functions as they're generally defined in programming.
               | 
               | Rather, what I saying is that there is an equivalent way
               | to think and study functions that equally apply to both
               | fields. That is, a function is simply a transformation
               | from A to B, where A and B can be anything, whether that
               | is bits, numbers, or any other construction in any
               | system. The only primitive distinction to make here is
               | whether A and B are the same thing or different.
        
           | dirkc wrote:
           | When I did "AI" it would have meant the sigmoid function,
           | these days it's something like ReLU.
        
           | esafak wrote:
           | functions, which you can compose to increase their
           | expressiveness, and run gradient descent on to train.
           | 
           | The success of deep learning is basically attributable to
           | composable (expressive), differentiable (learnable)
           | functions. The "deep" moniker alludes to the
           | compositionality.
        
         | captainclam wrote:
         | Ugh, exactly, it's so cool. I've been a deep learning
         | practitioner for ~3 years now, and I feel like this notion has
         | really been impressed upon me only recently.
         | 
         | I've spent an awful lot of mental energy trying to conceive of
         | how these things work, when really it comes down to "does
         | increasing this parameter improve the performance on this task?
         | Yes? Move the dial up a bit. No? Down a bit..." x 1e9.
         | 
         | And the cool part is that this yields such rich, interesting,
         | sometimes even useful, structures!
         | 
         | I like to think of this cognitive primitive as the analogue to
         | the idea that thermodynamics is just the sum of particles
         | bumping into each other. At the end of the day, that really is
         | just it, but the collective behavior is something else
         | entirely.
        
           | xanderlewis wrote:
           | > At the end of the day, that really is just it, but the
           | collective behavior is something else entirely.
           | 
           | Exactly. It's not to say that neat descriptions like this are
           | the end of the story (or even the beginning of it). If they
           | were, there would be no need for this entire field of study.
           | 
           | But they are cool, and can give you a really clear
           | conceptualisation of something that can appear more like a
           | sum of disjoint observations and ad hoc tricks than a
           | discipline based on a few deep principles.
        
           | JackFr wrote:
           | NAND gates by themselves are kind of dull, but it's pretty
           | cool what you can do with a billion of them.
        
           | kadushka wrote:
           | _it comes down to "does increasing this parameter improve the
           | performance on this task? Yes? Move the dial up a bit. No?
           | Down a bit..." x 1e9_
           | 
           | This is not how gradient based NN optimization works. What
           | you described is called "random weight perturbation", a
           | variant of evolutionary algorithms. It does not scale to
           | networks larger than a few thousand parameters for obvious
           | reasons.
           | 
           | NNs are optimized by directly computing a gradient which
           | tells us the direction to go to to reduce the loss on the
           | current batch of training data. There's no trying up or down
           | and seeing if it worked - we always know which direction to
           | go.
           | 
           | SGD and RWP are two completely different approaches to
           | learning optimal NN weights.
        
             | xanderlewis wrote:
             | I don't think the author _literally_ meant tweaking the
             | parameters and seeing what happens; it's probably an
             | analogy meant to give a sense of how the gradient indicates
             | what direction and to what degree the parameters should be
             | tweaked. Basically, substitute 'the gradient is positive'
             | for 'increasing this parameter decreases performance' and
             | vice versa and it becomes correct.
        
               | p1esk wrote:
               | That substitution is the main difference between SGD and
               | RWP.
               | 
               | It's like describing bubble sort when you meant to
               | describe quick sort. Would not fly on an ML 101 exam, or
               | in an ML job interview.
        
               | xanderlewis wrote:
               | It's not like that at all. You couldn't accidentally
               | sound like you're describing quick sort when describing
               | bubble sort, or vice versa. I can't think of any
               | substitution of a few words that would do that.
               | 
               | The _meaning_ of the gradient is perfectly adequately
               | described by the author. They weren't describing an
               | algorithm for _computing_ it.
        
               | a_random_canuck wrote:
               | I don't think anyone is trying to pass an exam here, but
               | just give an understandable overview to a general
               | audience.
        
             | captainclam wrote:
             | I guess you could say I don't know RWP from Adam! :D
             | 
             | My og comment wasn't to accurately explain gradient
             | optimization, I was just expressing a sentiment not
             | especially aimed at experts and not especially requiring
             | details.
             | 
             | Though I'm afraid I subjected you to the same "cringe" I
             | experience when I read pop sci/tech articles describe deep
             | learning optimization as "the algorithm" being "rewarded"
             | or "punished," haha.
        
               | kadushka wrote:
               | No worries, we're all friends here!
               | 
               | it's just you happened to accidentally describe the idea
               | behind RWP, which is a gradient-free optimization method,
               | so I thought I should point it out.
        
         | SkyBelow wrote:
         | Before the recent AI boom, I was mystified by the possibility
         | of AI and emulating humans (in no small part thanks to works of
         | fiction showing AI powered androids). Then I created and
         | trained some neural networks. Smaller ones, doing much of
         | nothing special. That was enough to break the mysticism. To
         | realize it was just multiplying matrices. Training them was a
         | bit more advanced, but still applied mathematics.
         | 
         | Only recently have I begun to appreciate that the simplicity of
         | the operation, applied to a large enough matrices, may still
         | capture enough of the nature of intelligence and sentience. In
         | the end we can be broken down into (relatively) simple chemical
         | reactions, and it is the massive scale of these reactions that
         | create real intelligence and sentience.
        
           | naasking wrote:
           | Exactly, the people who are derisive of those who consider ML
           | models to exhibit glimmers of true intelligence because it's
           | only matrix multiplications always amuse me. It's like they
           | don't even realize the contradiction in holding the position
           | that seemingly complex and intelligent outward behaviour
           | should not be used as an indication of actual complexity and
           | intelligence.
        
             | citizen_friend wrote:
             | If you study this historically you will see that every
             | generation thinks they have the mechanism to explain brains
             | (gear systems, analog control/cybernetics, perceptrons).
             | 
             | My conclusion is we tend to overestimate our understanding
             | and the power of our inventions.
        
               | naasking wrote:
               | The difference is that we now actually have a proof of
               | computational power and computational universality.
        
               | citizen_friend wrote:
               | Analog circuits have the same computational power.
               | Piecewise linear functions have the same computational
               | universality.
        
               | naasking wrote:
               | Except we didn't know any of that, nor did know how to
               | construct physical analogs in order to achieve universal
               | computation. At best, we had limited task-specific
               | computation, like clocks and planetary motion.
        
           | mistermann wrote:
           | Next step, in case you get bored: _why_ (in fact, _not a
           | minor distinction_ ) does such a simple approach work so
           | well?
        
         | gessha wrote:
         | It is soothing to the mind because it conveys that it's
         | understandable but it doesn't take away from the complexity.
         | You still have to read through math and pytorch code and debug
         | nonsensical CUDA errors, comb through the data, etc etc
        
           | whimsicalism wrote:
           | the complexity is in the values learned from the
           | optimization. even the pytorch code for a simple transformer
           | is not that complex, attention is a simple mechanism, etc.
        
             | gessha wrote:
             | Complexity also comes from the number of papers that work
             | out how different elements of network work and how to
             | intuitively change them.
             | 
             | Why do we use conv operators, why do we use attention
             | operators, when do we use one over the other? What
             | augmentations do you use, how big of a dataset do you need,
             | how do you collect the dataset, etc etc etc
        
               | whimsicalism wrote:
               | idk, just using attention and massive web crawls gets you
               | pretty far. a lot of the rest is more product-style
               | decisions about what personality you want your LM to
               | take.
               | 
               | I fundamentally don't think this technology is that
               | complex.
        
               | gessha wrote:
               | No? In his recent tutorial, Karpathy showed just how much
               | complexity there is in the tokenizer.
               | 
               | This technology has been years in the making with many
               | small advances pushing the performance ever so slightly.
               | There's been theoretical and engineering advances that
               | contributed to where we are today. And we need many more
               | to get the technology to an actually usable level instead
               | of the current word spaghetti that we get.
               | 
               | Also, the post is generally about neural networks and not
               | just LMs.
               | 
               | When making design decisions about an ML system you
               | shouldn't just choose the attention hammer and hammer
               | away. There's a lot of design constraints you need to
               | consider which is why I made the original reply.
        
               | whimsicalism wrote:
               | Are there micro-optimizations that eke out small
               | advancements? Yes, absolutely - the modern tokenizer is a
               | good example of that.
               | 
               | Is the core of the technology that complex? No. You could
               | get very far with a naive tokenizer that just tokenized
               | by words and replaced unknown words with <unk>. This is
               | extremely simple to implement and I've trained
               | transformers like this. It (of course) makes a perplexity
               | difference but the core of the technology is not changed
               | and is quite simple. Most of the complexity is in the
               | hardware, not the software innovations.
               | 
               | > And we need many more to get the technology to an
               | actually usable level instead of the current word
               | spaghetti that we get.
               | 
               | I think the current technology is useable.
               | 
               | > you shouldn't just choose the attention hammer and
               | hammer away
               | 
               | It's a good first choice of hammer, tbph.
        
         | jxy wrote:
         | > > Stripped of anything else, neural networks are compositions
         | of differentiable primitives
         | 
         | > I'm a sucker for statements like this. It almost feels
         | philosophical, and makes the whole subject so much more
         | comprehensible in only a single sentence.
         | 
         | And I hate inaccurate statements like this. It pretends to be
         | rigorous mathematical, but really just propagates erroneous
         | information, and makes the whole article so much more amateur
         | in only a single sentence.
         | 
         | The simple relu is continuous but not differentiable at 0, and
         | its derivative is discontinuous at 0.
        
           | xanderlewis wrote:
           | It's not 'inaccurate'. The mark of true mastery is an ability
           | to make terse statements that convey a huge amount without
           | involving excessive formality or discussion of by-the-by
           | technical details. If ever you've spoken to world-renowned
           | experts in pure mathematics or other highly technical and
           | pendantic fields, you'll find they'll say all sorts of
           | 'inaccurate' things in conversation (or even in written
           | documents). It doesn't make them worthless; far from it.
           | 
           | If you want to have a war of petty pedantry, let's go: the
           | derivative of ReLU can't be _discontinuous at zero_ , as you
           | say, because continuity (or indeed discontinuity) of a
           | function at x requires the function to have a value at x
           | (which is the negation of what your first statement correctly
           | claims).
        
             | kragen wrote:
             | my experience with world-renowned experts in pure
             | mathematics is that they are much more careful than the
             | average bear to explicitly qualify inaccurate things as
             | inaccurate, because their discipline requires them to be
             | very clear about precisely what they are saying
             | 
             | discontinuity of a function at _x_ does not, according to
             | the usual definition of  'continuity', require the function
             | to have a value at _x_ ; indeed, functions that fail to
             | have a value at _x_ are necessarily discontinuous there,
             | precisely because (as you say) they are not continuous
             | there. https://en.wikipedia.org/wiki/Continuous_function#De
             | finition...
             | 
             | there are other definitions of 'discontinuous' in use, but
             | i can't think of one that would give the result you claim
        
               | xanderlewis wrote:
               | > they are much more careful than the average bear to
               | explicitly qualify inaccurate things as inaccurate
               | 
               | Sure. But what part of this _entirely worded in natural
               | language, and very short_ statement made you think it was
               | a technical, formal statement? I think you're just taking
               | an opportunity to flex your knowledge of basic calculus,
               | and deliberately attributing intent to the author that
               | isn't there in order to look clever.
               | 
               | Regarding a function being discontinuous at a point
               | outside its domain: if you take a completely naive view
               | of what 'discontinuous' means, then I suppose you can say
               | so. But discontinuity is just the logical negation of
               | continuity. Observe:
               | 
               | To say that _f: X -- > Y (in this context, a real-valued
               | function of real numbers) is continuous_ means precisely
               | 
               | [?]x[?]X [?]e>0 [?]d>0 |x - p| < d = |f(x) - f(p)| < e
               | 
               | and so its negation looks like
               | 
               | [?]x[?]X [?] ...
               | 
               | that is, there is a point _in X, the domain of f_ where
               | continuity fails.
               | 
               | For example, you wouldn't talk about a function defined
               | on the integers being _discontinuous at pi_ , would you?
               | That would just be weird.
               | 
               | To prove the point further, observe that the set of
               | discontinuities (according to your definition) of any
               | given function would actually include _every_ number...
               | in fact every mathematical object in the universe --
               | which would make it not even a set in ZFC. So it's
               | absurd.
               | 
               | Even more reasons to believe functions can only be
               | discontinuous at points of their domain: a function is
               | said to be discontinuous if it has at least one
               | discontinuity. By your definition, _every_ function is
               | discontinuous.
               | 
               | ...anyway, I said we were going to be petty. I'm trying
               | to demonstrate this is a waste of time by wasting my own
               | time.
        
               | laingc wrote:
               | Because memes aren't allowed on HN, you're not allowed to
               | reply with the "akssshuallllyyy" meme, so you had to go
               | to these lengths.
               | 
               | -\\_(tsu)_/-
        
               | xanderlewis wrote:
               | You're actually not far off. I'm somewhat embarrassed by
               | the above, but I think it makes the point.
        
               | kragen wrote:
               | you have an interesting point of view, and some of the
               | things you have said are correct, but if you try to use
               | gradient descent on a function from, say, Z - R, you are
               | going to be a very sad xanda. i would indeed describe
               | such a function as being discontinuous not just at _p_
               | but everywhere, at least with the usual definition of
               | continuity (though there is a sense in which such a
               | function could be, for example, scott-continuous)
               | 
               | even in the case of a single discontinuity in the
               | derivative, like in relu', you lose the intermediate
               | value theorem and everything that follows from it; it's
               | not an inconsequential or marginally relevant fact
        
               | jj3 wrote:
               | Note that any function Z - R is continuous on its domain
               | but nowhere differentiable.
               | 
               | A Scott-continuous function Z - R must be monontonous. So
               | not every such function is Scott-continuous.
        
               | kragen wrote:
               | aha, thanks!
        
             | makerdiety wrote:
             | Invoking excessive formality and discussions of minute
             | technical details leads to a cathedral of knowledge built
             | on autistic pedantry. The chosen rabbit hole to get lost in
             | needs to be the correct one. And human science is riddled
             | with the paths that have naive or childish fundamentals.
        
               | mistermann wrote:
               | This comment makes me want to both upvote and downvote
               | with extreme enthusiasm/fury!
               | 
               | The sign of a truly good conversation?
        
               | kragen wrote:
               | > _a cathedral of knowledge built on autistic pedantry_
               | 
               | this is certainly true, but more often we use its short
               | name, 'math'. it turns out to be far more effective than
               | so-called common sense
        
             | newrotik wrote:
             | Lack of differentiability is actually a very important
             | feature of the underlying optimization problem.
             | 
             | You might think that it doesn't matter because ReLU is,
             | e.g., non-differentiable "only at one point".
             | 
             | Gradient based methods (what you find in pytorch) generally
             | rely on the idea that gradients should taper to 0 in the
             | proximity of a local optimum. This is not the case for non-
             | differentiable functions, and in fact gradients can be made
             | to be arbitrarily large even very close to the optimum.
             | 
             | As you may imagine, it is not hard to construct examples
             | where simple gradient methods that do not properly take
             | these facts into account fail to converge. These examples
             | are not exotic.
        
           | whimsicalism wrote:
           | it's pretty close to accurate, the lack of differentiability
           | at 0 for relu doesn't really come into play in practice
        
         | zackmorris wrote:
         | Ya so far this is the best introduction to neural networks from
         | first principles that I've seen.
         | 
         | Quickly skimming the draft pdf at
         | https://arxiv.org/pdf/2404.17625 I can grok it instantly,
         | because it's written in familiar academic language instead of
         | gobbledygook. Anyone with an undergrad math education in
         | engineering, computer science, etc or a self-taught equivalent
         | understanding of differential equations should be able to read
         | it easily. It does a really good job of connecting esoteric
         | terms like tensors with arrays, gradients with partial
         | derivatives, Jacobians with gradients and backpropagation with
         | gradient descent in forward/reverse mode automatic
         | differentiation. Which helps the reader to grasp the
         | fundamentals instead of being distracted by the implementation
         | details of TensorFlow, CUDA, etc. Some notable excerpts:
         | 
         | Introduction (page 4):                 By viewing neural
         | networks as simply compositions of differentiable primitives we
         | can ask two basic questions (Figure F.1.3): first, what data
         | types can we handle as inputs or outputs? And second, what sort
         | of primitives can we use? Differentiability is a strong
         | requirement that does not allow us to work directly with many
         | standard data types, such as characters or integers, which are
         | fundamentally discrete and hence discontinuous. By contrast, we
         | will see that differentiable models can work easily with more
         | complex data represented as large arrays (what we will call
         | tensors) of numbers, such as images, which can be manipulated
         | algebraically by basic compositions of linear and nonlinear
         | transformations.
         | 
         | Chapter 2.2 Gradients and Jacobians (page 23):
         | [just read this section - it connects partial derivatives,
         | gradients, Jacobians and Taylor's theorem - wow!]
         | 
         | Chapter 4.1.5 Some computational considerations (page 59):
         | In general, we will always prefer algorithms that scale
         | linearly both in the feature dimension c and in the batch size
         | n, since super-linear algorithms will become quickly
         | impractical (e.g., a batch of 32 RGB images of size 1024x1024
         | has c [?] 1e7). We can avoid a quadratic complexity in the
         | equation of the gradient by computing the multiplications in
         | the correct order, i.e., computing the matrix-vector product Xw
         | first. Hence, pure gradient descent is linear in both c and n,
         | but only if proper care is taken in the implementation:
         | generalizing this idea is the fundamental insight for the
         | development of reverse-mode automatic differentiation, a.k.a.
         | back-propagation (Section 6.3).
         | 
         | Chapter 6 Automatic differentiation (page 87):
         | We consider the problem of efficiently computing gradients of
         | generic computational graphs, such as those induced by
         | optimizing a scalar loss function on a fully-connected neural
         | network, a task called automatic differentiation (AD) [BPRS18].
         | You can think of a computational graph as the set of atomic
         | operations (which we call primitives) obtained by running the
         | program itself. We will consider sequential graphs for brevity,
         | but everything can be easily extended to more sophisticated,
         | acyclic computational graphs.              The problem may seem
         | trivial, since the chain rule of Jacobians (Section 2.2,
         | (E.2.22)) tells us that the gradient of function composition is
         | simply the matrix product of the corresponding Jacobian
         | matrices. However, efficiently implementing this is the key
         | challenge, and the resulting algorithm (reverse-mode AD or
         | backpropagation) is a cornerstone of neural networks and
         | differentiable programming in general [GW08, BR24].
         | Understanding it is also key to understanding the design (and
         | the differences) of most frameworks for implementing and
         | training such programs (such as TensorFlow or PyTorch or JAX).
         | A brief history of the algorithm can be found in [Gri12].
         | 
         | Edit: I changed Chapter 2.2.3 Jacobians (page 27) to Chapter
         | 2.2 Gradients and Jacobians (page 23) for better context.
        
           | barrenko wrote:
           | Thank you for the summary, and thanks to the OP/author of the
           | book.
           | 
           | I started self-studying programming some time ago, then
           | pivoted to AI/ML and (understandably) ended up mostly
           | studying math, these resources are a boon to my folk.
        
         | phkahler wrote:
         | >> one shouldn't fall into the trap of anthropomorphising and
         | mysticising models based on the 'neural' name
         | 
         | And yet, artificial neural networks ARE an approximation of how
         | biological neurons work. It is worth noting that they came out
         | of neurobiology and not some math department - well at least in
         | the forward direction, I'm not sure who came up with the
         | training algorithms (probably the math folks). Should they be
         | considered mystical? No. I would also posit that biological
         | neurons are more efficient and probably have better learning
         | algorithms than artificial ones today.
         | 
         | I'm confused as to why some people seem to shun the biological
         | equivalence of these things. In a recent thread here I learned
         | that physical synaptic weights (in our brains) are at least
         | partly stored in DNA or its methylation. If that isn't
         | fascinating I'm not sure what is. Or is it more along the lines
         | of intelligence can be reduced to a large number of simple
         | things, and biology has given us an interesting physical
         | implementation?
        
           | kragen wrote:
           | anns originated in hypotheses about how neurobiology might
           | work in the 01940s but diverged completely from neurobiology
           | in the 01960s; they contain nothing we've learned about
           | neurons in the last 50 years, and not much from before that
           | either (they don't, for example, do hebbian learning).
           | current anns use training methods like gradient descent with
           | momentum and activation functions like relu which have no
           | plausible biological realization
           | 
           | artificial neural networks are an approximation of biological
           | neural networks in the same way that a submarine is an
           | approximation of a fish
        
           | xanderlewis wrote:
           | As the commenter below mentions, the biological version of a
           | neuron (i.e. a neuron) is _much_ more complicated than the
           | neural network version. The neural network version is
           | essentially just a weighted sum, with an extra layer of
           | shaping applied afterwards to make it nonlinear. As far as I
           | know, we still don't understand all of the complexity about
           | how biological neurons work. Even skimming the Wikipedia page
           | for 'neuron' will give you some idea.
           | 
           | The original idea of approximating something like a neuron
           | using a weighted sum (which is a fairly obvious idea, given
           | the initial discovery that neurons become 'activated' and
           | they do so in proportion to how much the neurons they are
           | connected to are) did come from thinking about biological
           | brains, but the mathematical building blocks are incredibly
           | simple and are hundreds of years old, if not thousands.
        
             | naasking wrote:
             | > the biological version of a neuron (i.e. a neuron) is
             | much more complicated than the neural network version
             | 
             | This is a difference of degree not of kind, because neural
             | networks are Turning complete. Whatever additional
             | complexity the neuron has can itself be modelled as a
             | neural network.
             | 
             | Edit: meaning, that if the greater complexity of a
             | biological neuron is relevant to its information processing
             | component, then that just increases the number of
             | artificial neural network neurons needed to describe it, it
             | does not need any computation of a different kind.
        
               | xanderlewis wrote:
               | PowerPoint is Turing complete. Does that mean PowerPoint
               | should be regarded as being biological or at least
               | neuroscience-inspired?
        
               | naasking wrote:
               | No, but neural networks literally were inspired by
               | biology so I'm not sure what your point is.
        
               | xanderlewis wrote:
               | My point is that you seem to think neurons in the sense
               | of artificial neural networks and neurons in the human
               | brain are equivalent because:
               | 
               | (1) Neural networks are Turing complete, and hence can do
               | anything brains can. [debatable anyway; We don't know
               | this to be the case since brains might be doing more than
               | computation. Ask a philosopher or a cognitive scientist.
               | Or Roger Penrose.]
               | 
               | (2) Neural networks were very loosely inspired by the
               | idea that the human brain is made up of interconnected
               | nodes that 'activate' in proportion to how other related
               | nodes do.
               | 
               | I don't think that's nearly enough to say that they're
               | equivalent. For (1), we don't yet know (and we're not
               | even close), and anyway: if you consider all Turing
               | complete systems to be equivalent to the point of it
               | being a waste of time to talk about their differences
               | then you can say goodbye to quite a lot of work in
               | theoretical computer science. For (2): so what? Lots of
               | things are inspired by other things. It doesn't make them
               | in any sense equivalent, especially if the analogy is as
               | weak as it is in this case. No neuroscientist thinks that
               | a weighted sum is an adequate (or even remotely accurate)
               | model of a real biological neuron. They operate on
               | completely different principles, as we now know much
               | better than when such things were first dreamed up.
        
               | naasking wrote:
               | The brain certainly could be doing super-Turing
               | computation, but that would overturn quite a bit of
               | physics seeing as how not even quantum computers are more
               | powerful than Turing machines (they're just faster on
               | some problems). Extraordinary claims and all that.
               | 
               | As for equivalency, that depends on how that's defined.
               | Real neurons would not feature any more computational
               | power than Turing machines or artificial neural networks,
               | but I never said it would be a waste of time to talk
               | about their differences. I merely pointed out that the
               | artificial neural network model is _still sufficient_ ,
               | even if real neurons have more complexity.
               | 
               | > No neuroscientist thinks that a weighted sum is an
               | adequate (or even remotely accurate) model of a real
               | biological neuron
               | 
               | Fortunately that's not what I said. If the neuron indeed
               | has more relevant complexity, then it wouldn't be one
               | weighted sum = one biological neuron, but one biological
               | neuron = a network of weighted sums, since such a network
               | can model any function.
        
               | xanderlewis wrote:
               | The original comment you were in defence of was
               | suggesting that artificial neurons were somehow very
               | close to biological ones, since supposedly that's where
               | their inspiration came from.
               | 
               | If you're interested in pure computational 'power', then
               | if the brain is nothing more than a Turing machine
               | (which, as you agree, it might not be), fine. You can
               | call them 'equivalent'. It's just not very meaningful.
               | 
               | What's interesting about neural nets has nothing to do
               | with _what_ they can compute; indeed they can compute
               | anything any other Turing machine can, and nothing more.
               | What's interesting is _how_ they do it, since they can
               | 'learn' and hence allow us to produce solutions to hard
               | problems without any explicit programming or traditional
               | analysis of the problem.
               | 
               | > that would overturn quite a bit of physics
               | 
               | Our physics is currently woefully incomplete, so... yes.
               | That would be welcome.
        
               | andoando wrote:
               | And assembly is also turing complete, so if two models
               | being both Turing completeness means they are equivalent,
               | there would be no need for coding neural networks at all.
               | Would you consider LLMs a different kind of computation
               | than writing assembly code?
               | 
               | Perhaps fundamentally they are not, but its also true
               | that just writing more and more random assembly code
               | isn't going to lead to an LLM.
        
               | naasking wrote:
               | LLMs aren't randomly generated though, they are shaped by
               | training data. This means there would, in principle, be a
               | comparable way to synthesize an equivalent assembly
               | program from that same training data.
               | 
               | The difference here is that it's just more obvious how to
               | do this in one case than the other.
               | 
               | My point was only that 1) neural networks are sufficient,
               | even if real neurons have additional complexity, and 2)
               | whatever that additional complexity, artificial neural
               | networks can learn to reproduce it.
        
               | andoando wrote:
               | I understand that, what I am saying though is the fact
               | that they can doesn't mean that they will by simply
               | scaling their number. It still entirely depends on how
               | they are trained/arranged, meaning it may take a
               | completely different way of composing/glueing neurons
               | together to stimulate any additional complexity. Its like
               | saying a nand gate is turing complete, I put 1000000000
               | of them in a series, but its not doing anything, what
               | gives, do I need to add a billion more?
               | 
               | Just as a modeling and running a single neuron takes x
               | amount of transistors configured in a very specific way
               | for example, it may take y amount of neurons arranged in
               | some very specific, unknown to model something that has
               | extra properties.
               | 
               | And its not clear either whether neurons are
               | fundamentally the correct approach to reach this higher
               | level construction than some other kind of node.
        
               | srean wrote:
               | > This is a difference of degree not of kind
               | 
               | Nope.
               | 
               | Neurons in our brain operate _fundamentally_ differently.
               | They work by transient spikes and information is carried
               | not by the intensity of the spike voltage, but by the
               | frequency of spiking. This is a fundamentally different
               | phenomenon than ANNs where the output (voltage) is a
               | squash transformed aggregated input values (voltages).
        
               | phkahler wrote:
               | >> Neurons in our brain operate fundamentally
               | differently. They work by transient spikes and
               | information is carried not by the intensity of the spike
               | voltage, but by the frequency of spiking.
               | 
               | I thought they worked like accumulators where the spike
               | "energy" accumulates until the output "fires". If that's
               | the case then the artificial NNs are still an
               | approximation of that process. I agree that this is a
               | significant difference, but the mathematical version is
               | still a rough approximation inspired by the biological
               | one.
        
           | srean wrote:
           | > And yet, artificial neural networks ARE an approximation of
           | how biological neurons work
           | 
           | For a non-vapid/non-vacuous definition of 'approximation'
           | this is not true at all. It is well understood that (i) back-
           | propagation is biologically infeasible in the brain (ii)
           | output 'voltage' is a transformed weighted average of the
           | input 'voltage' -- is not how neurons operate. (ii) is in the
           | 'not even wrong' category.
           | 
           | Neurons operate in terms of spikes and frequency and
           | quiescence of spiking. If you are interested any undergrad
           | text in neurobiology will help correct the wrong notions.
        
           | chriswarbo wrote:
           | > And yet, artificial neural networks ARE an approximation of
           | how biological neurons work.
           | 
           | Only if you limit yourself to "sums of weighted inputs, sent
           | through a 1D activation function".
           | 
           | However, the parent said "differentiable primitives": these
           | days people have built networks that contain differentiable
           | ray-tracers, differentiable physics simulations, etc. Those
           | seem like crazy ideas if we limit ourselves to the "neural"
           | analogy; but are quite natural for a "composition of
           | differentiable primitives" approach.
        
         | jimbokun wrote:
         | I always get the impression even the proponents of these
         | algorithms when they didn't seem so promising, are shocked at
         | the capabilities demonstrated by models built with such a
         | relatively simple procedure.
        
         | naasking wrote:
         | > one shouldn't fall into the trap of anthropomorphising and
         | mysticising models based on the 'neural' name
         | 
         | One also shouldn't fall into the dual trap of assuming that
         | just because one understands how a model works, it cannot have
         | any bearing on the ever-mysterious operation of the brain.
        
         | sideshowb wrote:
         | > deep learning is simply the application of sequences of
         | operations that are nonlinear but nonetheless differentiable
         | 
         | Though other things fit this description which are not deep
         | learning. Like (shameless plug) my recent paper here
         | https://ieeexplore.ieee.org/document/10497907
        
         | jonas21 wrote:
         | I feel like this statement is both obvious after spending a few
         | minutes working with neural networks and completely useless in
         | helping you build better neural networks.
         | 
         | It's kind of like saying, "Stripped of anything else, works of
         | literature are compositions of words"
        
           | Horffupolde wrote:
           | Well, I'd argue that could also be a bit enlightening. It's
           | like taking a moment to appreciate that forests are composed
           | of single trees. It takes a certain level of insight to
           | appreciate systems at various depths.
        
           | falcor84 wrote:
           | Let me try to one-up that: "Stripped off anything else, works
           | of literature are compositions of the uppercase letter I"
        
         | m463 wrote:
         | chess is pattern matching.
        
       | Eduard wrote:
       | better remove all the Disney-based Alice in Wonderland character
       | intellectual property from the book.
        
         | astrodust wrote:
         | I was just thinking that's "cease and desist" bait right there.
        
           | iainmerrick wrote:
           | Alice in Wonderland (the book) is in the public domain. The
           | old Disney movie is still in copyright, and the cover image
           | does look very much like it's from the movie, but that
           | character design is from John Tenniel's illustrations which
           | are also in the public domain.
        
             | astrodust wrote:
             | The character design is. The image, however, is clearly
             | Disney flavoured if not traced directly.
             | 
             | His version, for example, does not have the distinctive
             | bow. The art style is also completely different.
        
               | iainmerrick wrote:
               | True - it would be a good idea to use a Tenniel piece
               | instead.
               | 
               |  _Edit to add:_ I was mostly trying to push back on the
               | implication that Disney owns Alice in Wonderland (and
               | Peter Pan, Winnie the Pooh, etc). Now I re-read the
               | original comment, they did specify  "Disney-based", so
               | maybe I'm over-reacting!
        
       | p1esk wrote:
       | And then you learn about binary or ternary networks where
       | gradients don't really exist anywhere, and you start to wonder
       | about the importance of this differentiability.
        
         | whimsicalism wrote:
         | binary networks don't really work well unless you do a
         | relaxation first
        
         | ubj wrote:
         | ...And then you start learning about generalizations of the
         | notion of "gradient" to scenarios where the classical gradient
         | doesn't exist :)
        
       | gfaure wrote:
       | In the literature, they're usually called convolutional layers (I
       | think you can pretty much search and replace all uses of
       | "convolutive" in the text).
        
       | seanhunter wrote:
       | Wow, just skimmed a bit, but this book looks amazing so far.
       | Really understandable but with an intuitive presentation of the
       | underlying maths that invites the reader to go deeper if they
       | want to by giving them what they need to get started.
        
       ___________________________________________________________________
       (page generated 2024-05-01 23:02 UTC)