[HN Gopher] Differentiable Programming - A Simple Introduction
___________________________________________________________________
Differentiable Programming - A Simple Introduction
Author : dylanbfox
Score : 97 points
Date : 2022-04-12 10:21 UTC (1 days ago)
(HTM) web link (www.assemblyai.com)
(TXT) w3m dump (www.assemblyai.com)
| yauneyz wrote:
| My professor has talked about this. He thinks that the real gem
| of the deep learning revolution is the ability to take the
| derivative of arbitrary code and use that to optimize. Deep
| learning is just one application of that, but there are tons
| more.
| SleekEagle wrote:
| That's part of why Julia is so exciting! Building it
| specifically to be a differentiable programming language opens
| so many doors ...
| mountainriver wrote:
| Julia wasn't really built specifically to be differentiable,
| it was just built in a way that you have access to the IR,
| which is what zygote does. Enzyme AD is the most exciting to
| me because any LLVM language can be differentiable
| SleekEagle wrote:
| Ah I see, thank you for clarifying. And thank you for
| bringing Enzyme to my attention - I've never seen it
| before!
| melony wrote:
| I am just happy that the previously siloed fields of operations
| research and various control theory sub-disciplines are now
| incentivized to pool their research together thanks to the
| funding in ML. Also many expensive and proprietary optimization
| software in industry are finally getting some competition.
| SleekEagle wrote:
| Hm I didn't know different areas of control theory were
| siloed. Learning about control theory in graduate school was
| awesome and it seems like a field that would benefit from ML
| a lot. I know they use RL agents for control networks for
| e.g. cartpole, but I would've thought it would be more
| widespread! Do you think the development of Differentiable
| Programming (i.e. the observation of more generality beyond
| pure ML/DL) was really the missing piece?
|
| Also, just curious, what are your studies in?
| melony wrote:
| Control theory has a very, very long parallel history
| alongside ML. ML, specifically probabilistic and
| reinforcement learning, uses a lot of dynamic programming
| ideas and Bellman equations in its theoretical modeling.
| Lookup the term cybernetics, it is an old term in the pre-
| internet era to mean control theory and optimization. The
| Soviets even had a grand scheme to build networked
| factories that could be centrally optimized and resource
| allocated. Their Slavic communist AWS-meets-Walmart efforts
| spawned a Nobel laureate; Kantorovich was given the award
| for inventing linear programming.
|
| Unfortunately the CS field is only just rediscovering
| control theory while it has been a staple of EE for years.
| However, there haven't been many new innovations in the
| field until recently when ML became the new hottest thing.
| SleekEagle wrote:
| This is some insanely cool history! I had no idea the
| Soviets had such a technical vision, that's actually
| pretty amazing. I've heard the term "cybernetics" but
| honestly just thought it was some movie-tech term, lol.
|
| It seems really weird that control theory is in EE
| departments considering it's sooo much more mathematical
| than most EE subdisciplines except signals processing. I
| remember a math professor of mine telling us about
| optimization techniques that control systems
| practitioners would know more about than applied
| mathematicians because they were developed specifically
| for the field, can't remember what the techniques were
| though ...
| melony wrote:
| There is this excellent HN-recommended fiction called
| _Red Plenty_ that dramatised the efforts on the other
| side of the Atlantic.
|
| https://news.ycombinator.com/item?id=8417882
|
| > _It seems really weird that control theory is in EE
| departments considering it 's sooo much more mathematical
| than most EE subdisciplines except signals processing._
|
| I agree, apparently Bellman's reasoning for calling
| dynamic programming what it is was because he needed
| grant funding during the Cold War days and was advised to
| give his mathematical theories a more "interesting" name.
|
| https://en.m.wikipedia.org/wiki/Dynamic_programming#Histo
| ry
|
| The generalised form of the Bellman Equation (co-
| formulated by Kalman of the Kalman filters fame) to
| control theory and EE is in some ways what the Maximum
| Likelihood function is to ML.
|
| https://en.m.wikipedia.org/wiki/Hamilton%E2%80%93Jacobi%E
| 2%8...
| SleekEagle wrote:
| Looks really cool, added to my amazon cart. Thanks for
| the rec!
|
| That hilarious and sadly insightful. I remember thinking
| "what the hell is so 'dynamic' about this?" the first
| time I learned about dynamic programming. Although
| "memoitative programming" sounds pretty fancy too, lol
| potbelly83 wrote:
| How do you differentiate a string? Enum?
| adgjlsfhk1 wrote:
| generally you consider them to be piecewise constant.
| tome wrote:
| Or more precisely, discrete.
| 6gvONxR4sf7o wrote:
| The answer to that is a huge part of the NLP field. The
| current answer is that you break down the string into
| constituent parts and map each of them into a high
| dimensional space. "cat" becomes a large vector whose
| position is continuous and therefore differentiable. "the
| cat" probably becomes a pair of vectors.
| titanomachy wrote:
| If you were dealing with e.g. English words rather than
| arbitrary strings, one approach would be to treat each word
| as a point in n-dimensional space. Then you can use
| continuous (and differentiable) functions to output into that
| space.
| noobermin wrote:
| The article is okay but it would have helped to have labelled the
| axes of the graphs.
| choeger wrote:
| Nice article, but the intro is a little lengthy.
|
| I have one remark, though: If your language allows for automatic
| differentiation already, why do you bother with a neural network
| in the first place?
|
| I think you should have a good reason why you choose a neural
| network for your approximation of the inverse function and why it
| has exactly that amount of layers. For instance, why shouldn't a
| simple polynomial suffice? Could it be that your neural network
| ends up as an approximation of the Taylor expansion of your
| inverse function?
| SleekEagle wrote:
| I think for more complicated examples like RL control systems a
| neural network is the natural choice. If you can incorporate
| physics into your world model then you'd need differentiable
| programming + NNs, right? Or am I misunderstanding the
| question.
|
| If you're talking about the specific cannon problem, you don't
| need to do any learning at all you can just solve the
| kinematics, so in some sense you could ask why you're using
| _any_ approximation function,
| infogulch wrote:
| The most interesting thing I've seen on AD is "The simple essence
| of automatic differentiation" (2018) [1]. See past discussion
| [2], and talk [3]. I think the main idea is that by compiling to
| categories and pairing up a function with its derivative, the
| pair becomes trivially composable in forward mode, and the whole
| structure is easily converted to reverse mode afterwards.
|
| [1]: https://dl.acm.org/doi/10.1145/3236765
|
| [2]: https://news.ycombinator.com/item?id=18306860
|
| [3]: Talk at Microsoft Research:
| https://www.youtube.com/watch?v=ne99laPUxN4 Other presentations
| listed here: https://github.com/conal/essence-of-ad
| tome wrote:
| > the whole structure is easily converted to reverse mode
| afterwards.
|
| Unfortunately it's not. Elliot never actually demonstrates in
| the paper how to implement such an algorithm, and it's _very_
| hard to write compiler transformations in "categorical form".
|
| (Disclosure: I'm the other of another paper on AD.)
| amkkma wrote:
| which paper?
| orbifold wrote:
| I think JAX effectively demonstrates that this is indeed
| possible. The approach they use is to first linearise the
| JAXPR and then transpose it, pretty much in the same fashion
| as the Elliot paper did.
| PartiallyTyped wrote:
| The nice thing about differentiable programming is that we can
| use all sorts of different optimizers compared to gradient
| descent that can offer quadratic convergence instead of linear!
| SleekEagle wrote:
| Yes exactly! This is huge. Hessian optimization is really easy
| with JAX, haven't tried it in Julia though
| PartiallyTyped wrote:
| And very fast given that you compile the procedure! I am
| considering writing an article on this and posting it here
| because I have seen enormous improvements over non jitted
| code, and that excluded jax.vmap.
| SleekEagle wrote:
| There's a comparison of JAX with PyTorch for Hessian
| calculation here!
|
| https://www.assemblyai.com/blog/why-you-should-or-
| shouldnt-b...
|
| Would definitely be interested in an article like that if
| you decide to write it
| ChrisRackauckas wrote:
| Here's Hessian-Free Newton-Krylov on neural ODEs with Julia:
| https://diffeqflux.sciml.ai/dev/examples/second_order_adjoin.
| .. . It's just standard tutorial stuff at this point.
| applgo443 wrote:
| Why can't we use this quadratic convergence in deep learning?
| PartiallyTyped wrote:
| Well, quadratic convergence usually requires the Hessian, or
| an approximation of it, and that's difficult to get in deep
| learning due to memory constrains, and difficulty computing
| second order derivatives.
|
| Computing the derivatives is not very difficult with e.g.
| Jax, but ... you get back to the memory issue. The Hessian is
| a square matrix, so in Deep Learning, if we have a million of
| parameters, then the Hessian is a 1 trillion square matrix...
| tome wrote:
| Not only does it have 1 trillion elements, you also have to
| invert it!
| PartiallyTyped wrote:
| Indeed! BFGS (and derivatives) approximate the inverse
| but they have other issues that make them prohibitively
| expensive.
| SleekEagle wrote:
| https://c.tenor.com/enoxmmTG1wEAAAAC/heart-attack-in-
| pain.gi...
___________________________________________________________________
(page generated 2022-04-13 23:01 UTC)