[HN Gopher] Thermodynamic Natural Gradient Descent
___________________________________________________________________
Thermodynamic Natural Gradient Descent
Author : jasondavies
Score : 123 points
Date : 2024-05-24 14:50 UTC (8 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| stefanpie wrote:
| I know they mainly present results on deep learning/neural
| network training and optimization, but I wonder how easy it would
| be to use the same optimization framework for other classes of
| hard or large optimization problems. I was also curious about
| this when I saw posts about Extropic (https://www.extropic.ai/)
| stuff for the first time.
|
| I tried looking into any public info on their website about APIs
| or software stack to see what's possible beyond NN stuff to model
| other optimization problems. It looks like that's not shared
| publicly yet.
|
| There are certainly many NP-hard and large combinatorial or
| analytical optimization problems still out there that are worth
| being able to tackle with new technology. Personally, I care
| about problems in EDA and semiconductor design. Adiabatic quantum
| computing was one technology with the promise of solving
| optimization problems (and quantum computing is still playing out
| with only small-scale solutions at the moment). Hoping that these
| new "thermodynamic computing" startups also might provide some
| cool technology to explore these problems with.
| kaelan123 wrote:
| Indeed, other solving other optimization problem is an
| interesting avenue. All I can say is stay tuned!
| stefanpie wrote:
| Hey thanks for the reply! I'm assuming your the first author
| on the paper; if so, the signup button on the Normal
| Computing website is not working at the moment (at least for
| me, even with ad blocker turned off).
| G3rn0ti wrote:
| Sounds great until
|
| > requires an analog thermodynamic computer
|
| Wait. What?
|
| Perhaps a trained physicist can comment on that. Thanks.
| uoaei wrote:
| I believe one example would be quantum annealers. Where
| "programming" involves setting the right initial conditions and
| allowing thermodynamics to bring you to an optimum via
| relaxation.
| mikewarot wrote:
| Dwave[1] does compute that way already.
|
| [1] https://www.dwavesys.com/
| kaelan123 wrote:
| One key difference is the system is entirely classical (not
| quantum) and noise-resilient (see the last appendix).
| CamperBob2 wrote:
| The paper describes it pretty well in appendix C. A matrix of
| integrators is constructed with a bunch of opamps, RC time
| constants (using digital potentiometers, presumably) and a
| multichannel ADC/DAC interface to the PC. Essentially a
| dedicated differential-equation solver.
|
| So it's a combination of old-school analog computation and
| modern GPU-based code. Takes longer in practice due to the
| overhead of interfacing with the hardware and waiting for the
| integrators to settle, but the authors are claiming that an
| optimized implementation could outperform a purely-digital
| solution, as I understand it, by accelerating convergence.
|
| The core idea being that conventional gradient descent is a
| linear operation at heart, while the gradients actually being
| traversed are curved surfaces that have to be approximated with
| multiple unnecessary steps if everything is done in the digital
| domain.
|
| The trouble, as everybody from Seymour Cray onward has learned
| the hard way, is that CMOS always wins in the end, simply
| because the financial power of an entire industry goes into
| optimizing it.
| stefanpie wrote:
| I didn't realize they included details about the hardware.
| Lie you said these just look like analog computers, compute
| in memory, analog arrays, which have also made a resurgence
| with deep leaning.
| kaelan123 wrote:
| First author of the paper here. That's it indeed! One thing
| is that this is entirely CMOS-compatible. You could also do
| something similar with optics or other platforms, but we
| chose electronic circuits for this reason specifically.
| CamperBob2 wrote:
| By that remark I meant "digital CMOS," in the sense of
| elements that store state information discretely in flip-
| flops or gate insulators rather than continuously with
| analog integrators.
|
| Very cool work in any event, though! Best of luck with the
| ongoing R&D.
| cs702 wrote:
| The whole point is to leverage _the laws of nature_ to train AI
| models, overcoming the limitations and scaling challenges of
| digital hardware and existing training methods.
| ein0p wrote:
| This could be attractive if they can build a product along
| those lines: tens, if not hundreds, of billions of dollars are
| spent yearly on numerical optimization worldwide, and if this
| can significantly accelerate it, it could be very profitable.
| jasonjmcghee wrote:
| From my understanding, this is exactly what https://extropic.ai
| is working on, and I wouldn't be surprised if
| https://normalcomputing.ai/ (authors of the paper) is as well.
| nickpsecurity wrote:
| Analog computers have a lot of history. You can Google analog
| with neural network or differential equations to get many
| results. They are fast with low power, can have precision
| issues, and require custom, chip design.
|
| https://en.m.wikipedia.org/wiki/Analog_computer
|
| Mixed signal ASIC's often use a mix of digital and analog
| blocks to get the benefits of analog. It's especially helpful
| for anything that eats lots of power or to prevent that (eg
| mobile).
| onecommentman wrote:
| Hard to beat the string algorithm for finding shortest paths
| on a positive weights network (e.g. build the network out of
| string where topologies match and link lengths are link
| weights, find the origin and destination nodes/knots of
| interest, grab the two nodes and pull until taut).
|
| Or the spaghetti approach to finding the largest value from a
| list of positive values (e.g. cut dry spaghetti noodles to
| length for each value, bundle them together and tap the
| bundle on a table, the one visually sticking out the most is
| the largest valued element).
| cs702 wrote:
| Cool and interesting. The authors propose a hybrid digital-analog
| training loop that takes into account the curvature of the loss
| landscape (i.e., it uses second-order derivatives), and show with
| numerical simulations that if their method is implemented in a
| hybrid digital-analog physical system, each iteration in the
| training loop would incur computational cost that is linear in
| the number of parameters. I'm all for figuring out ways to let
| the Laws of Thermodynamics do the work of training AI models, if
| doing so enables us to overcome the scaling limitations and
| challenges of existing digital hardware and training methods.
| thomasahle wrote:
| The main point of this is that natural gradient descent is a
| second-order method. The main GD update equation is:
|
| [?]L(th) = F-1[?]L(th)
|
| which requires solving a linear system. For this, you can use the
| methods from the author's previous paper [Thermodynamic Linear
| Algebra](https://arxiv.org/abs/2308.05660).
|
| Since it's hard to implement a full neural network on a
| thermodynamic computer, the paper suggests running one in
| parallel to a normal GPU. The GPU computes F and [?]L(th), but
| offloads the linear system to the thermo computer, which runs in
| parallel to the digital system (Figure 1).
|
| It is important to note that the "Runtime vs Accuracy" plot in
| Figure 3 uses a "timing model" for the TNGD algorithm, since the
| computer necessary to run the algorithm still doesn't exist.
| 3abiton wrote:
| It really gives nice way to think about gradient descent.
| esafak wrote:
| Not having read the paper carefully, could someone tell me what
| the draw is? It looks like it is going to have the same
| asymptotic complexity as SGD in terms of sample size, per Table
| 1. Given that today's large, over-specified models have numerous,
| comparable extrema, is there even a need for this? I wouldn't get
| out of bed unless it were sublinear.
| danbmil99 wrote:
| Wasn't Geoffrey Hinton going on about this about a year ago?
| gnarbarian wrote:
| this reminds me of simulated annealing which I learned about in
| an AI class about a decade ago.
|
| https://en.wikipedia.org/wiki/Simulated_annealing
| rsp1984 wrote:
| Leveraging thermodynamics to more efficiently compute second-
| order updates is certainly cool and worth exploring, however
| specifically in the context of deep learning I remain skeptical
| of its usefulness.
|
| We already have very efficient second-order methods running on
| classical hardware [1] but they are basically not being used at
| all in practice, as they are outperformed by ADAM and other 1st-
| order methods. This is because optimizing highly nonlinear loss
| functions, such as the ones in deep learning models, only really
| works with very low learning rates, regardless of whether a 1st
| or a 2nd order method is used. So, comparatively speaking, a 2nd
| order method might give you a slightly better parameter update
| per step but at a more-than-slightly-higher cost, so most of the
| time it's simply not worth doing.
|
| [1] https://andrew.gibiansky.com/blog/machine-
| learning/hessian-f...
| 6gvONxR4sf7o wrote:
| Agreed that it's very cool, and also about how hard it is to
| make second order methods worthwhile. We're just using such
| huge datasets that sometimes it's hard to even get a decent
| estimate of the gradient for a minibatch. Getting a useful
| estimate of second order information over the dataset is even
| harder, especially when the whole point of using minibatches is
| computational feasibility.
| kaelan123 wrote:
| Those are valid points! Hessian-free (HF) optimization is a
| really nice method, but as you say remains costly so people
| don't use it. The key idea in this paper is that if you are
| able to solve linear systems faster by using an analog
| device, the cost of a HF-like method is brought down, so the
| method can become competitive.
|
| About the noise, it is true that the second-order information
| will be noisier than the gradient for a given batch size (and
| a lot of results out there for HF optimization are with
| impractically large batch sizes). In the paper we use
| relatively small batch sizes (eg 32 for the fine-tuning
| example) and show that you can still get an advantage from
| second-order information. Of course it would be interesting
| to study in more detail how noisy 2nd order information can
| be, and on more datasets.
___________________________________________________________________
(page generated 2024-05-24 23:00 UTC)