[HN Gopher] Thermodynamic Natural Gradient Descent
       ___________________________________________________________________
        
       Thermodynamic Natural Gradient Descent
        
       Author : jasondavies
       Score  : 123 points
       Date   : 2024-05-24 14:50 UTC (8 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | stefanpie wrote:
       | I know they mainly present results on deep learning/neural
       | network training and optimization, but I wonder how easy it would
       | be to use the same optimization framework for other classes of
       | hard or large optimization problems. I was also curious about
       | this when I saw posts about Extropic (https://www.extropic.ai/)
       | stuff for the first time.
       | 
       | I tried looking into any public info on their website about APIs
       | or software stack to see what's possible beyond NN stuff to model
       | other optimization problems. It looks like that's not shared
       | publicly yet.
       | 
       | There are certainly many NP-hard and large combinatorial or
       | analytical optimization problems still out there that are worth
       | being able to tackle with new technology. Personally, I care
       | about problems in EDA and semiconductor design. Adiabatic quantum
       | computing was one technology with the promise of solving
       | optimization problems (and quantum computing is still playing out
       | with only small-scale solutions at the moment). Hoping that these
       | new "thermodynamic computing" startups also might provide some
       | cool technology to explore these problems with.
        
         | kaelan123 wrote:
         | Indeed, other solving other optimization problem is an
         | interesting avenue. All I can say is stay tuned!
        
           | stefanpie wrote:
           | Hey thanks for the reply! I'm assuming your the first author
           | on the paper; if so, the signup button on the Normal
           | Computing website is not working at the moment (at least for
           | me, even with ad blocker turned off).
        
       | G3rn0ti wrote:
       | Sounds great until
       | 
       | > requires an analog thermodynamic computer
       | 
       | Wait. What?
       | 
       | Perhaps a trained physicist can comment on that. Thanks.
        
         | uoaei wrote:
         | I believe one example would be quantum annealers. Where
         | "programming" involves setting the right initial conditions and
         | allowing thermodynamics to bring you to an optimum via
         | relaxation.
        
           | mikewarot wrote:
           | Dwave[1] does compute that way already.
           | 
           | [1] https://www.dwavesys.com/
        
             | kaelan123 wrote:
             | One key difference is the system is entirely classical (not
             | quantum) and noise-resilient (see the last appendix).
        
         | CamperBob2 wrote:
         | The paper describes it pretty well in appendix C. A matrix of
         | integrators is constructed with a bunch of opamps, RC time
         | constants (using digital potentiometers, presumably) and a
         | multichannel ADC/DAC interface to the PC. Essentially a
         | dedicated differential-equation solver.
         | 
         | So it's a combination of old-school analog computation and
         | modern GPU-based code. Takes longer in practice due to the
         | overhead of interfacing with the hardware and waiting for the
         | integrators to settle, but the authors are claiming that an
         | optimized implementation could outperform a purely-digital
         | solution, as I understand it, by accelerating convergence.
         | 
         | The core idea being that conventional gradient descent is a
         | linear operation at heart, while the gradients actually being
         | traversed are curved surfaces that have to be approximated with
         | multiple unnecessary steps if everything is done in the digital
         | domain.
         | 
         | The trouble, as everybody from Seymour Cray onward has learned
         | the hard way, is that CMOS always wins in the end, simply
         | because the financial power of an entire industry goes into
         | optimizing it.
        
           | stefanpie wrote:
           | I didn't realize they included details about the hardware.
           | Lie you said these just look like analog computers, compute
           | in memory, analog arrays, which have also made a resurgence
           | with deep leaning.
        
           | kaelan123 wrote:
           | First author of the paper here. That's it indeed! One thing
           | is that this is entirely CMOS-compatible. You could also do
           | something similar with optics or other platforms, but we
           | chose electronic circuits for this reason specifically.
        
             | CamperBob2 wrote:
             | By that remark I meant "digital CMOS," in the sense of
             | elements that store state information discretely in flip-
             | flops or gate insulators rather than continuously with
             | analog integrators.
             | 
             | Very cool work in any event, though! Best of luck with the
             | ongoing R&D.
        
         | cs702 wrote:
         | The whole point is to leverage _the laws of nature_ to train AI
         | models, overcoming the limitations and scaling challenges of
         | digital hardware and existing training methods.
        
         | ein0p wrote:
         | This could be attractive if they can build a product along
         | those lines: tens, if not hundreds, of billions of dollars are
         | spent yearly on numerical optimization worldwide, and if this
         | can significantly accelerate it, it could be very profitable.
        
         | jasonjmcghee wrote:
         | From my understanding, this is exactly what https://extropic.ai
         | is working on, and I wouldn't be surprised if
         | https://normalcomputing.ai/ (authors of the paper) is as well.
        
         | nickpsecurity wrote:
         | Analog computers have a lot of history. You can Google analog
         | with neural network or differential equations to get many
         | results. They are fast with low power, can have precision
         | issues, and require custom, chip design.
         | 
         | https://en.m.wikipedia.org/wiki/Analog_computer
         | 
         | Mixed signal ASIC's often use a mix of digital and analog
         | blocks to get the benefits of analog. It's especially helpful
         | for anything that eats lots of power or to prevent that (eg
         | mobile).
        
           | onecommentman wrote:
           | Hard to beat the string algorithm for finding shortest paths
           | on a positive weights network (e.g. build the network out of
           | string where topologies match and link lengths are link
           | weights, find the origin and destination nodes/knots of
           | interest, grab the two nodes and pull until taut).
           | 
           | Or the spaghetti approach to finding the largest value from a
           | list of positive values (e.g. cut dry spaghetti noodles to
           | length for each value, bundle them together and tap the
           | bundle on a table, the one visually sticking out the most is
           | the largest valued element).
        
       | cs702 wrote:
       | Cool and interesting. The authors propose a hybrid digital-analog
       | training loop that takes into account the curvature of the loss
       | landscape (i.e., it uses second-order derivatives), and show with
       | numerical simulations that if their method is implemented in a
       | hybrid digital-analog physical system, each iteration in the
       | training loop would incur computational cost that is linear in
       | the number of parameters. I'm all for figuring out ways to let
       | the Laws of Thermodynamics do the work of training AI models, if
       | doing so enables us to overcome the scaling limitations and
       | challenges of existing digital hardware and training methods.
        
       | thomasahle wrote:
       | The main point of this is that natural gradient descent is a
       | second-order method. The main GD update equation is:
       | 
       | [?]L(th) = F-1[?]L(th)
       | 
       | which requires solving a linear system. For this, you can use the
       | methods from the author's previous paper [Thermodynamic Linear
       | Algebra](https://arxiv.org/abs/2308.05660).
       | 
       | Since it's hard to implement a full neural network on a
       | thermodynamic computer, the paper suggests running one in
       | parallel to a normal GPU. The GPU computes F and [?]L(th), but
       | offloads the linear system to the thermo computer, which runs in
       | parallel to the digital system (Figure 1).
       | 
       | It is important to note that the "Runtime vs Accuracy" plot in
       | Figure 3 uses a "timing model" for the TNGD algorithm, since the
       | computer necessary to run the algorithm still doesn't exist.
        
         | 3abiton wrote:
         | It really gives nice way to think about gradient descent.
        
       | esafak wrote:
       | Not having read the paper carefully, could someone tell me what
       | the draw is? It looks like it is going to have the same
       | asymptotic complexity as SGD in terms of sample size, per Table
       | 1. Given that today's large, over-specified models have numerous,
       | comparable extrema, is there even a need for this? I wouldn't get
       | out of bed unless it were sublinear.
        
       | danbmil99 wrote:
       | Wasn't Geoffrey Hinton going on about this about a year ago?
        
       | gnarbarian wrote:
       | this reminds me of simulated annealing which I learned about in
       | an AI class about a decade ago.
       | 
       | https://en.wikipedia.org/wiki/Simulated_annealing
        
       | rsp1984 wrote:
       | Leveraging thermodynamics to more efficiently compute second-
       | order updates is certainly cool and worth exploring, however
       | specifically in the context of deep learning I remain skeptical
       | of its usefulness.
       | 
       | We already have very efficient second-order methods running on
       | classical hardware [1] but they are basically not being used at
       | all in practice, as they are outperformed by ADAM and other 1st-
       | order methods. This is because optimizing highly nonlinear loss
       | functions, such as the ones in deep learning models, only really
       | works with very low learning rates, regardless of whether a 1st
       | or a 2nd order method is used. So, comparatively speaking, a 2nd
       | order method might give you a slightly better parameter update
       | per step but at a more-than-slightly-higher cost, so most of the
       | time it's simply not worth doing.
       | 
       | [1] https://andrew.gibiansky.com/blog/machine-
       | learning/hessian-f...
        
         | 6gvONxR4sf7o wrote:
         | Agreed that it's very cool, and also about how hard it is to
         | make second order methods worthwhile. We're just using such
         | huge datasets that sometimes it's hard to even get a decent
         | estimate of the gradient for a minibatch. Getting a useful
         | estimate of second order information over the dataset is even
         | harder, especially when the whole point of using minibatches is
         | computational feasibility.
        
           | kaelan123 wrote:
           | Those are valid points! Hessian-free (HF) optimization is a
           | really nice method, but as you say remains costly so people
           | don't use it. The key idea in this paper is that if you are
           | able to solve linear systems faster by using an analog
           | device, the cost of a HF-like method is brought down, so the
           | method can become competitive.
           | 
           | About the noise, it is true that the second-order information
           | will be noisier than the gradient for a given batch size (and
           | a lot of results out there for HF optimization are with
           | impractically large batch sizes). In the paper we use
           | relatively small batch sizes (eg 32 for the fine-tuning
           | example) and show that you can still get an advantage from
           | second-order information. Of course it would be interesting
           | to study in more detail how noisy 2nd order information can
           | be, and on more datasets.
        
       ___________________________________________________________________
       (page generated 2024-05-24 23:00 UTC)