[HN Gopher] Show HN: Boldly go where Gradient Descent has never ...
       ___________________________________________________________________
        
       Show HN: Boldly go where Gradient Descent has never gone before
       with DiscoGrad
        
       Trying to do gradient descent using automatic differentiation over
       branchy programs? Or to combine them with neural networks for end-
       to-end training? Then this might be interesting to you.  We
       develped DiscoGrad, a tool for automatic differentiation through
       C++ programs involving input-dependent control flow (e.g., "if
       (f(x) < c) { ... }", differentiating wrt. x) and randomness. Our
       initial motivation was to enable the use of gradient descent with
       simulations, which often rely heavily on such discrete branching.
       The latter makes plain autodiff mostly useless, since it can only
       account for the single path taken through the program. Our tool
       offers several backends that handle this situation, giving useful
       descent directions for optimization by accounting for alternative
       branches. Besides simulations, this problem arises in many other
       places, for example in deep learning when trying to combine
       imperative programs with neural networks.  In a nutshell, DiscoGrad
       applies an (LLVM-based) source-to-source transformation to your C++
       program, adding some calls to our header library, which then
       handles the gradient computation. What sets it apart from similar
       tools/estimators is that it's fully automatic (no need to come up
       with a differentiable problem formulation/reparametrization) and
       that the branching condition can be any function of the program
       inputs (no need to know upfront what distribution the condition
       follows).  We're currently a team of two working on DiscoGrad as
       part of a research project, so don't expect to see production-grade
       code quality, but we do intend for it to be more than a throwaway
       research prototype. Use cases we've successfully tested include
       calibrating simulation models of epidemics or evacuation scenarios
       via gradient descent, and combining simulations with neural
       networks in an end-to-end trainable fashion.  We hope you find this
       interesting and useful, and we're happy to answer questions!
        
       Author : frankling_
       Score  : 132 points
       Date   : 2024-05-26 12:14 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | eranation wrote:
       | Laymen question - I sort of understand what gradient descent is,
       | but I'm not sure I fully understand what DiscoGrad is doing, my
       | probably incorrect naive understanding is: to find optimal params
       | for a program by converting the branches of a program into
       | something that resembles a "smooth" loss function so a tradition
       | gradient descent algorithm can find local minima and suggest the
       | optimal params / weights?
       | 
       | EDIT: removed part of the question that is answered in the
       | article.
        
         | PeterHolzwarth wrote:
         | You mean besides the examples of use cases mentioned in the
         | linked page?
        
           | eranation wrote:
           | Good point, edited the question.
        
         | frankling_ wrote:
         | Yep, that's exactly it. The smoothness can either come from
         | randomness in the program itself (then the objective function
         | is asymptotically smooth and DiscoGrad estimates the gradient
         | of that smooth function), or the smoothness can be introduced
         | artificially.
         | 
         | As an example, the very first thing we looked into was a
         | transportation engineering problem, where the red/green phases
         | of traffic lights lead to a non-smooth optimization problem. In
         | essence, in that case we were looking for the "best possible"
         | parameters for a transportation simulation (in the form of a
         | C++ program) that's full of branches.
        
           | eranation wrote:
           | Awesome, thank you! Interesting. The first thing that came to
           | my mind regarding the traffic light example is any problem
           | that reduces to a SAT solver, I assume they are some that are
           | clearly un-smoothable in polynomial time otherwise this will
           | have interesting consequences...
        
             | frankling_ wrote:
             | I agree with that intuition. In our experience, it's
             | easiest to see gains over other optimization techniques
             | when the program is "branch-wise smooth and non-constant".
             | Then, we get the full benefits of exact autodiff gradients
             | "per branch", and our smoothing approach handles the
             | branches. For SAT solving and other purely combinatorial
             | problems, sufficiently accurate smoothing may indeed be too
             | expensive. Also, in such problems, the average local
             | minimum found via gradient descent may not always be that
             | great. That said, we're still exploring where the limits
             | really are.
        
               | eranation wrote:
               | Thank you for the responses! I'll be following your
               | research for sure.
        
           | epr wrote:
           | Obviously, some of this branching, etc. is not
           | differentiable. Is it doing finite differences?
        
       | ipunchghosts wrote:
       | Talk about the differences between this and ceres http://ceres-
       | solver.org/.
        
         | frankling_ wrote:
         | The key point is that Ceres requires derivatives, which can
         | come from manually derived formulae, approximations via finite
         | differences, or autodiff (http://ceres-
         | solver.org/derivatives.html). DiscoGrad doesn't do the
         | optimization itself (for that, we use gradient descent, for
         | example via Adam), but essentially represents a fourth option
         | to obtain derivatives, and one which captures the branches in
         | an optimization problem (which autodiff doesn't).
         | 
         | While I'm not super familiar with the typical use cases for
         | Ceres, the gradient estimator from DiscoGrad could possibly be
         | integrated to better handle branchy problems.
        
       | vessenes wrote:
       | When all you have is a hammer, ... you split up your house into
       | wood and nails, and reduce it to a previously solved problem!
       | 
       | In all seriousness, this is super interesting. I really like the
       | idea of implementing gradient descent solving branch by branch,
       | and turning it into an optimization-level option for codebases.
       | 
       | I feel like this would normally be something commercialized by
       | Intel's compiler group; it's hard for me to know how to get this
       | out more broadly -- it would probably need to be standardized in
       | some way?
       | 
       | Anyway, thanks for working on it and opening it up -- very cool.
       | Needs more disco balls.
        
         | frankling_ wrote:
         | Thanks for the kind words! We'd be super happy if this work
         | gets picked up, whether in a commercial context or not.
         | 
         | We were thinking of some disco ball-based logo (among some
         | other designs). With this encouragement, there'll probably be
         | an update in the next few days :)
        
       | pavlov wrote:
       | Discograd: a little-known top secret Soviet project where
       | Brezhnev wanted to counter American influence in global popular
       | music by creating an entire military town dedicated to evolving
       | the four-on-the-floor beat towards Marxist-Leninist perfection
       | through cybernetic principles of rhythm composition.
       | 
       | (After 1991 Discograd was demilitarized and renamed Grungetown to
       | attract foreign investments.)
        
         | elpocko wrote:
         | In this alternate universe, Discograd was successful and became
         | the cultural hub for Soviet rock music. It influenced bands
         | like Kino, Aquarium and DDT who incorporated Marxist-Leninist
         | themes into their music. Disco beats were indeed evolved but in
         | a way that resonated with the ideology of the time. The four-
         | on-the-floor beat became more intricate and complex,
         | incorporating elements of jazz and folk music from various
         | Soviet republics.
         | 
         | In this reality, Discograd hosted the first Soviet Rock
         | Festival, which was attended by thousands of enthusiastic fans
         | from all over the USSR. The festival featured performances by
         | bands that were formed and nurtured in Discograd, showcasing a
         | new genre: Proletrock - a unique fusion of disco, rock, jazz
         | and Soviet folk music, with lyrics promoting socialist values
         | and workers' rights.
         | 
         | Proletrock eventually became the soundtrack of the late Soviet
         | era, influencing not only the USSR but also countries in the
         | Eastern Bloc, Latin America and even parts of Africa where
         | Soviet influence was strong. The genre helped to spread
         | communist ideology through catchy beats and thought-provoking
         | lyrics, making Discograd an integral part of music history.
         | 
         | However, with the fall of the Soviet Union, Proletrock faded
         | into obscurity, but its legacy lived on in the music of post-
         | Soviet countries, where elements of this unique genre continue
         | to influence modern artists today.
         | 
         | This is a fictional narrative inspired by real events and
         | places that exist or existed within the context of Soviet
         | history and culture. It serves as a creative exploration of
         | what could have been if the USSR had pursued such an ambitious
         | project with the same fervor it dedicated to its space program.
         | 
         | (WizardLM-2-7B)
        
           | pavlov wrote:
           | When the producers at Apple TV+ finally get bored with "For
           | All Mankind", they could greenlight this as a spin-off.
        
       | s_tim wrote:
       | How does this compare to Enzyme (https://enzyme.mit.edu/)?
        
         | frankling_ wrote:
         | Enzyme is traditional, but super duper optimized, autodiff,
         | that is, it returns the partial derivatives for one path taken
         | through the program, ignoring other branches. DiscoGrad
         | captures the effects of alternative branches. What's special
         | about enzyme is that the gradient computations benefit from
         | LLVM's optimization passes and language support.
        
       | szvsw wrote:
       | Awesome! Several of my colleagues are working on differentiable
       | physics simulations (mostly FEM type stuff for structural design
       | optimization) so I'm excited to share this with them! They mostly
       | work in Julia. My own experiments with auto-diff'd physics sims
       | have been in Python (specifically, Taichi for the JIT/GPU
       | acceleration or occasionally PyTorch/Jax).
       | 
       | Can you talk a little bit about the challenges of bringing
       | something like what you've implemented to existing autograd
       | engines/frameworks (like the ones previously mentioned)? Are you
       | at all interested in exploring that as a mechanism for increasing
       | access to your methodology? What are your thoughts on those
       | autodiff engines?
        
         | avibryant wrote:
         | Do you have any links to your experiments and/or those of your
         | colleagues?
        
           | szvsw wrote:
           | Emailed you through the email listed on your HN profile
        
         | frankling_ wrote:
         | We actually did some preliminary experiments with Taichi hoping
         | to benefit from the GPU parallelization. I think generally, the
         | world of autodiff tooling is in very good shape. For anything
         | non-exotic, we just use JAX or Torch to get things done quickly
         | and with good performance.
         | 
         | Generally, integrating the ideas behind DiscoGrad into existing
         | frameworks has been on our mind since day one, and the C++
         | implementation represents a bit of a compromise made to have a
         | lot of flexibility during development while the algorithms were
         | still a moving target, and good performance (albeit without
         | parallelization and GPU support as of yet). Based on
         | DiscoGrad's current incarnation, however, it should not be
         | terribly hard to, say, develop a JAX+DiscoGrad fork and offer
         | some simple "branch-like" abstraction. While we've been looking
         | into this, it can be a bit tricky in a university context to do
         | the engineering leg work required to build something robust...
        
           | szvsw wrote:
           | > While we've been looking into this, it can be a bit tricky
           | in a university context to do the engineering leg work
           | required to build something robust...
           | 
           | I definitely hear you on this! As a grad student who is one
           | of the only developers with actual professional dev xp in my
           | lab, it can be brutal being tasked with turning academic
           | spaghetti code into something semi-productionized/robust.
        
       | dwrodri wrote:
       | This is the sort of thing I expected to see when Chris Lattner
       | moved to Google and started working on the Swift for Tensorflow
       | project. I am so grateful that someone is making it happen!
       | 
       | I remember being taught how to write Prolog in University, and
       | then being shown how close the relationship was between building
       | something that parses a grammar and building something that
       | generates valid examples of that grammar. When I saw
       | compiler/language level support for differentiation, I the spark
       | went off in my brain the same way: "If you can build a program
       | which follows a set of rules, and the rules for that language can
       | be differentiated, could you not code a simulation in that
       | differentiable language and then identify the optimal policy
       | using it's gradients?"
       | 
       | Best of luck on your work!
        
         | usgroup wrote:
         | Not really. The world of Bayesian modelling has much fancier
         | tools: Hamiltonian MC. See MC Stan. There's also been Gibbs
         | samplers and other techniques which support discrete decisions
         | for donkeys years.
         | 
         | You can write down just about anything as a BUGS model for
         | example, but "identifying the model" --- finding the uniquely
         | best parameters, even though it's a globally optimisation ---
         | is often very difficult.
         | 
         | Gradient descent is significantly more limiting than that.
         | Worth understanding MC. The old school is a high bar to jump.
        
           | cma wrote:
           | MC: Monte Carlo
        
         | justinnk wrote:
         | Thanks! You may find DeepProbLog by Manhaeve et al.
         | interesting, which brings together logic programming,
         | probabilistic programming and gradient descent/neural networks.
         | Also, more generally, I believe in the field of program
         | synthesis there is some research on deriving programs with
         | gradient descent. However, as also pointed out in the comment
         | below, gradient descent may not always be the best approach to
         | such problems (e.g., https://arxiv.org/abs/1608.04428).
        
       | esafak wrote:
       | Wouldn't replacing the flow control statements with ML models
       | slow it down too much? Do you have the ability to automatically
       | estimate the appropriate model complexity for a given statement
       | based on how hot it is?
        
         | frankling_ wrote:
         | We're doing something less expensive: essentially, the overall
         | gradient is computed based on certain statistics based on the
         | branch condition and its derivatives when a branch is
         | encountered.
         | 
         | We mention neural networks because DiscoGrad lets you combine
         | branching programs with neural networks (via Torch) and jointly
         | train/optimize them.
        
       | boywitharupee wrote:
       | how would you compare this to the polytope model?
       | https://en.wikipedia.org/wiki/Polytope_model
        
         | frankling_ wrote:
         | Not super closely related: the polytope model (to the degree
         | I'm familiar with it) is used as a representation that
         | facilitates optimization of loop nests. That's optimization in
         | the sense of finding an efficient program.
         | 
         | DiscoGrad deals with (or provides gradients for) mathematical
         | optimization. In our case, the goal is to minimize or maximize
         | the program's numerical output by adjusting it's input
         | parameters. Typically, your C++ program will run somewhat
         | slower with DiscoGrad than without, but you can now use
         | gradient descent to quickly find the best possible input
         | parameters.
        
       | casualscience wrote:
       | I'm confused as to the use cases for this? Are you saying if I
       | want to fit some "magic numbers" in my cpp program, I can now do
       | that by pulling in discograd and wrapping those numbers with some
       | code that says "please fit these", then adding some test cases
       | somewhere?
        
         | jey wrote:
         | No, the use cases for this are similar to regular autodiff,
         | where you implement a function f(x) and the library helps you
         | automatically compute derivatives such as the gradient g(x) :=
         | [?]f(x). Various autodiff methods differ in how they accomplish
         | this, and the library shared here uses a code-generation
         | approach where it performs a source-to-source transformation to
         | generate source code for g(x) based on the code for f(x).
        
           | justinnk wrote:
           | You are right in that the use-cases are very similar to
           | regular autodiff, with the added benefit that the returned
           | gradient also accounts for the effects of taking alternative
           | branches.
           | 
           | Just to clarify: we do a kind of source-to-source
           | transformation by transparently injecting some API-calls in
           | the right places (e.g., before branching-statements) before
           | compilation. However, the compiled program then returns the
           | program output alongside the gradient.
           | 
           | For the continuous parts, the AD library that comes with
           | DiscoGrad uses operator overloading.
        
         | justinnk wrote:
         | (I am one of the authors) Thanks for your question. Yes,
         | similar to what you describe but not quite. The prime use case
         | is to apply DiscoGrad together with a gradient descent
         | optimizer to optimization problems. For a C++ program to be
         | regarded as such, you have to define what the "inputs" are and
         | the program has to return some numerical value (loss) that is
         | to be maximized/minimized. The tool then delivers a "direction"
         | (smoothed gradient), which gradient descent can use to adjust
         | the inputs toward a local optimum.
         | 
         | So if you can express your test cases in a numerical way and
         | make the placeholders for the "magic numbers" visible to the
         | tool by regarding them as "inputs" (which should generally be
         | possible), this may be a possible use-case. Hope this clarifies
         | it.
        
       | brap wrote:
       | Somewhat related: is there autograd in python that uses AST
       | analysis or something similar? The only method I'm familiar with
       | uses tracer objects which have their gotchas. I'd like to compile
       | a python function (at runtime) while writing its code as
       | naturally as possible
        
         | szvsw wrote:
         | Taichi (Python package) definitely uses AST for at least the
         | JIT'ing/kernel generation, not sure about how the autograd
         | works under the hood but these links will get you started.
         | There are plenty of publications about taichi which you can
         | look up as well for more detail I am sure.
         | 
         | https://docs.taichi-lang.org/docs/differentiable_programming
         | 
         | https://docs.taichi-lang.org/docs/compilation
        
       | HarHarVeryFunny wrote:
       | What do you mean by plain autodiff being mostly useless with
       | normal/discrete branching? Wouldn't branches normally just be
       | "ignored" by autodiff - each training sample being a different
       | computational graph (but with parts in common) due to branching
       | points, so the only effect of branching is which computational
       | graph gets executed and backpropagated through?
       | 
       | What's the general type of use case where this default behavior
       | is useless, and "non-discrete" (stochastic?) branching helps?
        
         | frankling_ wrote:
         | That's right, plain autodiff just ignores branches. Our
         | canonical "why is this even needed" example is a program like
         | "if (x >= 0) return 1; else return 0", x being the input.
         | 
         | The autodiff derivative of this is zero, wherever you evaluate
         | it, so if you sample x and run your program on each x as in a
         | classical ML setup, you'd be averaging over a series of zero-
         | derivatives. This is of course not helpful to gradient descent.
         | In more complex programs, it's less blatant, but the gist is
         | that just averaging sampled gradients over programs (input-
         | dependent!) branches yields biased or zero-valued derivatives.
         | The traffic light optimization example shown on Github is a
         | more complex example where averaged autodiff-gradients are
         | always zero.
        
           | HarHarVeryFunny wrote:
           | Thanks, but could you briefly expand on what's happening in
           | the minimal if (x >= 0) case with the discograd modification?
           | What source code modification could the user could have made
           | to achieve the same effect?
        
             | frankling_ wrote:
             | In DiscoGrad, smoothing would be applied by adding Gaussian
             | noise with some configurable variance to x and running the
             | program on those x's. The gradient would then be calculated
             | based on the branch condition's derivative wrt. x (which is
             | 1) and an estimate of the distribution of the condition
             | (which is Gaussian).
             | 
             | In this specific example, the smoothed derivative happens
             | to be exactly the Gaussian cumulative distribution
             | function, so the user could just replace the program with
             | that function. However, for more complex programs, it'd be
             | hard to find such correspondences manually.
        
           | Y_Y wrote:
           | Plain autodiff gives the correct derivative, but you modify
           | the derivative to push people towards your global minimum?
        
       | usgroup wrote:
       | I used to replace strict Boolean conditionals with sigmoids in
       | order to achieve continuous transfer for Bayesian change-point
       | models.
       | 
       | Does this do something similar or is it fancier?
        
         | frankling_ wrote:
         | Great point, the sigmoid approximation works well for certain
         | problems and that's in fact what I used in the exploratory
         | papers that lead to this work. The downsides are the lack of a
         | clear interpretation how the original program and its smooth
         | counterpart are related, and the difficulty of controlling the
         | degree of smoothing as programs get longer. What DiscoGrad
         | computes has a statistical interpretation: it's the convolution
         | of the program output with whatever distribution is used for
         | smoothing, typically a Gaussian with a configurable variance.
         | 
         | On top of that, if the program branches on random numbers
         | (which is common in simulations), that suffices for the maths
         | to work out and you get an estimate of the asymptotic gradient
         | (for samples -> infinity) of the original program, without any
         | artificial smoothing.
         | 
         | So in short, I do think it is slightly fancier :)
        
           | szvsw wrote:
           | I've also used the sigmoid approximation to relax some
           | problems (specifically for step changes in field properties
           | in PINN problems) into continuous variables, cool to see this
           | discussion here from a different perspective! Slightly off
           | topic but the only other things I'm aware of that are vaguely
           | related are things like the gumbel-softmax trick for making
           | sampling from categorical distributions differentials or the
           | Gaussian reparam trick (eg as used in VAEs). I'm curious if
           | this is at least somewhat related to the approach taken in
           | your work, in spirit if not in technical implementation?
        
             | frankling_ wrote:
             | Yeah, those tricks are highly related to what we do, the
             | main difference being that we don't require a priori
             | information about the distributions involved in the
             | program. Instead, we compute a density estimation of the
             | distribution of branch conditions at runtime, which makes
             | things quite general and fully automatic, at the cost of
             | some accuracy.
             | 
             | As an aside, the combination "known distributions +
             | automation" is covered in the Julia world by stochasticAD
             | (https://github.com/gaurav-arya/StochasticAD.jl).
        
       | sundalia wrote:
       | This is very interesting. A few questions:
       | 
       | - Why do you think similar approaches never landed on jax? My
       | guess is this is not that useful for the current optimizations in
       | fashion (transformers)
       | 
       | - How would you convince jax to incorporate this?
        
         | frankling_ wrote:
         | Well, the most common ML problems can be expressed as
         | optimization over smooth functions (or reformulated that way
         | manually). We might have to convince the ML world that branches
         | do matter :) On the other hand, there are gradient-free
         | approaches that solve problems with jumps in other ways, like
         | many reinforcement learning algorithms, or metaheuristics such
         | as genetic algorithms in simulation-based optimization. The
         | jury's still out on "killer apps" where gradient descent can
         | outperform these approaches reliably, but we're hoping to add
         | to that body of knowledge...
        
       | zaitanz wrote:
       | So this confuses me slightly and I am keen to know the advantage
       | of using this. I work on projects that heavily use auto-
       | differentiation for complex models. The models are defined by
       | user input files at run-time, so the state and execution pathway
       | of the model is unknown at compilation time. Would this help?
       | 
       | Given that all auto-differentiation is an approximation anyways.
       | I've found existing tooling (CppAD, ADMB, ADOL-C, Template Model
       | Builder (TMB)) work fine. You don't need to come up with a
       | differentiable problem or re-parameterize. Why would I pick this
       | over any of those?
        
       ___________________________________________________________________
       (page generated 2024-05-26 23:00 UTC)