[HN Gopher] Differentiable Programming from Scratch
       ___________________________________________________________________
        
       Differentiable Programming from Scratch
        
       Author : sksxihve
       Score  : 88 points
       Date   : 2025-04-17 04:30 UTC (18 hours ago)
        
 (HTM) web link (thenumb.at)
 (TXT) w3m dump (thenumb.at)
        
       | weinzierl wrote:
       | _" Now that we understand differentiation, let's move on to
       | programming. So far, we've only considered mathematical
       | functions, but we can easily translate our perspective to
       | programs. For simplicity, we'll only consider pure functions,
       | i.e. functions whose output depends solely on its parameters (no
       | state)."_
       | 
       | I think I've seen this notion that the constraint is pureness
       | also in documentation of autodiff libraries, but this cannot be
       | strong enough, right?
       | 
       | It easy enough to come up with functions that are nowhere
       | differentiable. So my question is, what are the actual
       | requirements a state of the art autodiff library has for the
       | input function and why do people focus on the pureness aspect if
       | that is probably the least of the problems.
        
         | yorwba wrote:
         | If the function is pure, autodiff can produce a result. If the
         | function is not differentiable for a given input, the result
         | produced by autodiff of course cannot be the derivative (since
         | none exists) but it could be another number (e.g. a
         | subderivative) that is useful in some of the ways a derivative
         | would have been useful.
        
         | grandempire wrote:
         | State is just values which are not considered to be variable
         | function inputs - in other words each time you change state you
         | have new functions.
         | 
         | For example f(x, y) = xy and then defining a differentiable
         | function g(x) = f(x, a). You can imagine "a" being a state
         | variable.
        
         | hansvm wrote:
         | Purity, interestingly, is not a requirement. You can represent
         | any closure as something implicitly captured, and you can
         | represent mutable state as an immutable input with a different,
         | related, immutable output. For any autodiff library not also
         | throwing in dense+sparse decompositions (none of the major ones
         | do any such optimization; the programmer has to manually choose
         | which things to represent how), that doesn't even waste any
         | more space for reverse-mode autodiff than your other options
         | (i.e., you need to maintain the before and after state anyway,
         | give or take your particular checkpointing scheme, and if you
         | don't have a way to represent the delta cheaply then contrasted
         | with the underlying mutable forward pass which probably was
         | faster and cheaper than the alternatives, the differentiation
         | is probably quite expensive). Purity is often easier to code
         | around, especially in "complicated" languages with many
         | features you'd like to support, but it's not mandatory.
         | 
         | In terms of actual requirements, something that's sufficient
         | [0] is for every sub-component to be differentiable and for no
         | dynamic control flow to depend on the things being
         | differentiated. In practice, most libraries wind up requiring
         | something like this, mostly because it's very hard to do
         | anything else. As an example, define f(x) := 0 for floats with
         | an even LSB and 1 for floats with an odd LSB. Define g(x) := 1
         | - f(x). Neither of these are meaningfully differentiable, but
         | g(x) + f(x) is identically equal to one. Autodiff relies
         | crucially on the fact that it can perform local
         | transformations, and that sort of whole-program analysis is (a)
         | impossible in general, and (b) hard even when it's possible.
         | 
         | For local-only autodiff (the only thing people ever do), the
         | thing that's necessary is for every sub-component to have a
         | derivative-like operator defined such that if the sub-
         | components are composed into a differentiable function then the
         | normal chain rule and other autodiff compositions of those
         | operators is also differentiable and represents the derivative
         | in question (along with some requirements on dynamic control
         | flow -- they don't have to be quite as strict as I described,
         | but it's impossible to relax in general that with local-only
         | autodiff, so that dynamic requirement from the above paragraph
         | is also necessary).
         | 
         | There are few (zero?) components where that's possible -- an
         | adversary can always come up with a composition violating the
         | derivative being incorrect. However, for some interesting
         | functions (like eigenvalues and eigenvectors) in the normal way
         | people use them, these sorts of things can be defined. E.g.,
         | the eigenvalue derivative is not unique (up to a choice of
         | phase), but if your composition also doesn't depend on phase
         | then you're still fine.
         | 
         | [0] Even for things like differentiating through a loop
         | converging to a value, this property holds, with one meaningful
         | exception: The error in the derivative compared with the true
         | function you're approximating will still converge to zero with
         | enough iterations, but that number can be much higher than you
         | need to get the function itself to converge. You _will_,
         | however, get the derivative of the approximation perfect.
        
       | FilosofumRex wrote:
       | Historical fact: Differentiable programming was a little known
       | secret back in the 90's, used mainly by engineers simulating
       | numerically stiff systems like nukes and chemicals in FORTRAN 95.
       | It then disappeared for nearly 30 yrs before rediscovery by the
       | ML/AI researchers!
        
         | constantcrying wrote:
         | It wasn't forgotten, I learned it in university outside of any
         | AI context. It just had most of its applications exhausted and
         | ceased being a particularly active research topic.
        
         | kxyvr wrote:
         | Automatic differentiation was actively and continuously used in
         | some communities for the last 40 years. Louis Rall has an
         | entire book about it published in 1981. One of the more popular
         | books on AD written by Griewank was published in 2000. I
         | learned about it in university in the early 2000s. I do agree
         | that the technology was not as well used as it should have been
         | until more recently, but the technology was well known within
         | numerical math world and used continuously over the years.
        
         | taeric wrote:
         | Computer Algebra Systems (CAS) were not really a secret. And
         | they often have many many tricks that we are constantly
         | relearning. Some of this relearning, of course, is by design. A
         | lot of what they do are things we teach. How to calculate
         | different functions and such. Repeated squares and such are fun
         | topics.
         | 
         | A lot of the current new set of learning is that we have the
         | compute power to do these things in more places. It is also
         | something that has been long done in expensive environments
         | that many of us just don't have access to.
        
         | thechao wrote:
         | My PhD dissertation included a chapter (originally from a 2006
         | paper) on generic programming in CAS' on the algorithmic
         | differentiable ring (and operator). By the 1990s, algorithmic
         | differentiation was easily 30-40 years old. Griewank & Monagan
         | both knew guys who had built early electromagnetic naval
         | targeting computers that used the methodology back by at least
         | the early 60s "by hand". (Very literally.)
         | 
         | I watched the ML/AI bros actively ignore previous research --
         | even when they were requested to properly cite sources they
         | were plagiarizing -- in real time. The race to publish (even
         | for big journals) was so important that it was easier to ignore
         | the rank dishonesty than it was to correct their misbehavior.
         | I'm 1000x happier to not have stayed around for all that crap.
        
           | srean wrote:
           | > on the algorithmic differentiable ring (and operator)
           | 
           | That sounds like an interesting read. Do you have the chapter
           | or the reference to the paper that you can share ?
           | 
           | Regarding th crop of deep neural network research their self-
           | serving and willful blindness has a reputation that's well
           | deserved.
           | 
           | A grad student from Hinton's lab mentioned one researcher who
           | would misspell a citation on purpose so that the citation
           | count of the cited paper does not go up.
        
       | hwpythonner wrote:
       | I'm not deep into autodiff (just recall some calculus from
       | university), but the syntax in this post reminds me a lot of ML
       | (the programming language, not machine learning)
       | 
       | I know autodiff isn't lambda calculus, but the expression-based
       | structure and evaluation rules feel similar. Couldn't this be
       | implemented in something like ML or Clojure? Just wondering what
       | the custom DSL adds that existing functional languages wouldn't
       | already support
        
         | fire_lake wrote:
         | There is an F# implementation (Microsoft flavoured ML) called
         | DiffSharp
        
         | hansvm wrote:
         | I didn't see a DSL anywhere, just normal JS code.
         | 
         | As to what it adds?
         | 
         | - It's more accessible to a wider audience (and looks like how
         | you'd implement autodiff in most languages)
         | 
         | - It runs in the browser trivially (powering those demos)
         | 
         | - The author (potentially) didn't have to learn a new language
         | just to get started
         | 
         | - Programs are not fully differentiable, or at the very least
         | there are some crazy edge cases and dragons lurking if you
         | attempt to make them so. A dedicated whitelist of supported
         | operations isn't necessarily a bad design, contrasted with an
         | implicit whitelist in Clojure (depending on the implementation
         | of course, but there wasn't a lot of source-to-source
         | boilerplate even in this example, so I assume the benefit of a
         | functional language would be stripping away some of the
         | characteristics I think are important).
        
         | constantcrying wrote:
         | Automatic differentiation can be implemented in essentially any
         | language. Some just make it look "nicer".
        
           | kgwgk wrote:
           | For example, it looks nice in Common Lisp:
           | https://people.eecs.berkeley.edu/~fateman/papers/overload-
           | AD...
        
       | constantcrying wrote:
       | >Unfortunately, finite differences have a big problem: they only
       | compute the derivative of f in one direction. If our input is
       | very high dimensional, computing the full gradient of f becomes
       | computationally infeasible, as we would have to evaluate f for
       | each dimension separately.
       | 
       | And it is terribly unstable numerically. f(x) and f(x+h) are very
       | similar, h is very small. You have to expect destructive
       | cancellation to happen. For black boxes it is the only real
       | alternative though, you can do a bit better by taking a
       | derivative in both directions.
        
       ___________________________________________________________________
       (page generated 2025-04-17 23:02 UTC)