[HN Gopher] Deep physical neural networks trained with backpropa...
___________________________________________________________________
Deep physical neural networks trained with backpropagation
Author : groar
Score : 67 points
Date : 2022-01-29 15:56 UTC (7 hours ago)
(HTM) web link (www.nature.com)
(TXT) w3m dump (www.nature.com)
| phreeza wrote:
| Physical/analog computers always suffer from noise limiting their
| usefulness. So I think it would be natural to apply this to a
| network architecture that includes noise as an integral Part such
| as GANs or VAEs.
| orasis wrote:
| "noise" is integral to all ML systems. You can view this
| through many lenses, but generalization can be thought of as
| decoding a noisy signal.
| phreeza wrote:
| This is true, though what I was getting at was methods that
| make use of a noise source separate from the input.
| visarga wrote:
| If you can train a non-linear physical system with this method,
| in principle, you could also train real brains. You can't update
| the parameters of the brain, but you can inject signal. Assuming
| real brains to be black box functions for which you could learn a
| noisy estimator of gradients, it could be used for neural
| implants that supplement lost brain functionality, or a Matrix-
| like skill loading system.
| p1esk wrote:
| How is this different from the good old "chip in the loop"
| training method?
| corndoge wrote:
| The paper is interesting
| modeless wrote:
| Let me see if I can describe the laser part of the paper
| correctly. They made a laser pulse consisting of a bunch of
| different frequencies mixed together. The intensity of each
| frequency represents a controllable parameter of the system. The
| pulse was sent through a crystal that performs a complex
| transformation that mixes all the frequencies together in a
| nonlinear and noisy way. Then they measure the frequency spectrum
| of the output. By itself, this system performs computations of a
| sort, but they are not useful.
|
| To make the computations useful, first they trained a
| conventional digital neural network to predict the outputs given
| the input controllable parameters. Then they arbitrarily assigned
| some of the controllable parameters to be the inputs of the
| neural network and others were arbitrarily assigned to be the
| trainable weights. Then they used the crystal to run forward
| passes on the training data. After each forward pass, they used
| the trained regular neural network to do the reverse pass and
| estimate the gradients of the outputs with respect to the
| weights. With the gradients they update the weights just like a
| regular neural net.
|
| Although the gradients computed by the neural nets are not a
| perfect match to the real gradients of the physical system (which
| are unknown), they don't need to be perfect. Any drift is
| corrected because the forward pass is always run by the real
| physical system, and stochastic gradient descent is naturally
| pretty tolerant of noise and bias.
|
| Since they're just using neural nets to estimate the behavior of
| the physical system rather than modeling it with physics, they
| can use literally any physical system and the behavior of the
| system does not have to be known. The only requirement of the
| system is that it does a complex nonlinear transformation on a
| bunch of controllable parameters to produce a bunch of outputs.
| They also demonstrate using vibrations of a metal plate.
|
| Seems like this method may not lead to huge training speedups
| since regular neural nets are still involved. But after training,
| the physical system is all you need to run inference, and that
| part can be super efficient.
| posterboy wrote:
| > They made a laser pulse consisting of a bunch of different
| frequencies mixed together
|
| This is how ultra short pulses are made when the waves cancel
| out appropriately. Now I'm not sure if they are training a
| network to calculate the filter efficiently for even shorter
| pulses, or if the purpose is supposed to be an optical neural
| network, or why not both.
| melissalobos wrote:
| > Deep-learning models have become pervasive tools in science and
| engineering. However, their energy requirements now increasingly
| limit their scalability.[1]
|
| They make this claim first, and cite one source. I haven't heard
| of this as an issue before. Is there anywhere else I could read
| more on this?
|
| [1]https://arxiv.org/abs/2104.10350
| dekhn wrote:
| Training a state of the art model typically involves keeping a
| very large computer around at near 100% power load. Roughly
| about 10MW.
|
| The actual limits on DL models (and any simulation or
| optimization) are: power density and the speed of light, plus
| the maximum amount of power you can deliver to the area. The
| speed of light limits how long your cables can be while still
| doing collective reductions, and the power density limits how
| much compute power you can fit per unit volume. One could
| imagine a fully liquid cooled supercomputer at 100MW (located
| near a very reliable and large power source) with optical fiber
| interconnect, this would completely change the state of the art
| in large models overnight.
| foobiekr wrote:
| All true.
|
| I cannot cite a source here, but it is generally believed
| that the actual effective GPU utilization in AI training
| clusters which are "100% utilized" is actually quite poor -
| 23%-26% - due to data movement, non-essential serial
| execution, and and scheduling issues. So at least for now
| there is low-hanging fruit to improve the performance of the
| capital expenses.
|
| Long term, though, DL clusters are basically CAPEX and energy
| limited.
|
| IMHO, for now, return on the investment is not really a
| limiting factor, but it will become one once the shine is off
| the field.
| pmayrgundter wrote:
| Got me wondering how this compares with neural efficiency,
| realizing ofc that there's nothing really apples-to-apples
| here.
|
| Training one of these big models takes 100kWh for 1e19 flops,
| so that's 100k Wh, 360M Ws, or 360MJ or 3.6 1e8J.
| 1e8Joules/1e19flops = 1e-11J/flop
|
| Neurons take 1e-8J/spike.[1]
|
| Math check appreciated :)
|
| Does seem plausible to think of a single neuron spike (hodgkin-
| huxley cable model) being modeled with ~1k flops. Though I'm
| firmly of the opinion that nobody really knows how the brain
| works.. the neural spike activity could be pure epiphenomenon..
| who knows!
|
| [1] "Finally, the energy supply to a neuron by ATP is 8.31 x
| 10-9 J. Meanwhile, integrating the total power with respect to
| time we will get the consumed electric power, which is 8.75 x
| 10-9 J. This is more energy than the ATP supplied. The energy
| efficiency is 105.3%. This is an anomaly..." - 2017 Feb 16
| Wang, Xu, Institute for Cognitive Neurodynamics, East China
| University of Science and Technology
| https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5337805/
| whatshisface wrote:
| The neural spike is definitely not an epiphenomenon. The
| action potential / neurotransmitter release / receptor
| activation process is understood and can be manipulated with
| electric probes.
| dekhn wrote:
| For those who are curious, consciousness is an
| epiphenomenon (an emergenty property of brains), while
| neural spikes are just physics.
|
| See more: https://en.wikipedia.org/wiki/Neural_correlates_o
| f_conscious...
| whatshisface wrote:
| I think it would be better to say something like,
| "paranoia is an epiphenomenon," when nobody knows what
| consciousness is.
| dekhn wrote:
| No. https://en.wikipedia.org/wiki/Epiphenomenalism
| whatshisface wrote:
| Note that that title ends in "ism," like "Calvinism," or
| "Evangelicalism."
| posterboy wrote:
| hence _epiphenomenal-ize_? I don 't thinkso.
|
| Maybe it's more like _epiphenomalis-m(a)_ , where _-is_
| could be a genetive ending. I.e. the idea _of_
| epiphenomena.
| netizen-936824 wrote:
| That isn't quite true
|
| https://en.m.wikipedia.org/wiki/Consciousness
| pmayrgundter wrote:
| Sorry, didn't mean it quite like that. It's clear neural
| spike activity exists as a physical process. I'm suggesting
| that spiking activity may be an epiphenomena more primary
| brain functions, i.e. information processing,
| consciousness, etc..
|
| As far as I know, we're closest to showing information
| processing in the visual cortex (which is highly linear)
| and we're still a long way from knowing how it works at a
| neural level. But maybe someone here can update on this?
|
| But much of the cortex is highly recurrent (non-linear) and
| the idea that it's doing something like sending bits
| between synapses, encoded in spike timing or something..
| well, I think that's highly speculative and has plenty of
| problems. But even if so, that's just "information
| processing".
|
| I'm personally a fan of electromagnetic theories of
| consciousness[], where the synaptic activity could be an
| epiphenomenon of supporting a stand EM field.
|
| []https://en.wikipedia.org/wiki/Electromagnetic_theories_of
| _co...
| whatshisface wrote:
| > _But much of the cortex is highly recurrent (non-
| linear) and the idea that it 's doing something like
| sending bits between synapses, encoded in spike timing or
| something.. well, I think that's highly speculative and
| has plenty of problems._
|
| I am not sure how much is known about information
| processing, but it's clear that motor impulses and
| sensory information are encoded in the spikes. Higher
| spike frequency = stronger signal. Synapses are how
| signals are passed from neuron to neuron.
| pmayrgundter wrote:
| Ok, that's fair. That's i/o and yes, that's known to be
| highly linear by the time it gets to the efferent nerves,
| and makes sense it is before that as well. I think that
| still leaves the vast majority of the cortex using
| undefined mechanisms.
| whatshisface wrote:
| There's no need to hypothesize a wholly unique central
| nervous system signalling mechanism when, not only is the
| signalling mechanism of peripheral nerves understood,
| central nerves are observed doing the same thing.
| pmayrgundter wrote:
| I think it's fine as a hypothesis for the CNS, and my
| guess is it's correct for the spinal cord on up to the
| maybe the thalamus. But there the anatomy changes
| radically, as does the electrical activity.. eg the
| cluster waves (alpha, beta, theta) begin there and
| indicate some sort of group behaviors in various areas.
|
| Afaik, we have correlative descriptions of what these
| waves indicate (importantly, they're associated with
| sleep, consciousness, attentiveness), but no direct
| mechanical model of them or a clear purpose. So yeah,
| spike timing could still be used at this level, but it
| seems other behaviors are also happening that may be more
| essential to the larger function.
| posterboy wrote:
| I do remember reading once that glia cells are not
| understood, and the something about how electro magnetic
| fields might also induce ... I do not remember it well
| because it wasn't very specific
|
| The sentiment that synapses probably don't explain
| everything is rather common, anyhow. I'm thinking, the
| way the blood flow literally influences the relevant
| areas by transporting available energy for example, and
| neurotransmitters must be a very important point, and how
| those areas react in case of insufficiency would explain
| why I become nasty when tired and hungry at the same
| time.
| version_five wrote:
| I don't have a specific reference but I'd say it's a common
| knowledge assertion based on the growth in the number of
| parameters in models over the last 10 years. There are lots of
| places where you can see how the number of parameters,
| especially in language and vision models, has increased, and
| find that the amount of training time quoted. Normally it's
| framed in terms of compute instead of energy.
| davesque wrote:
| I think they may have provided fewer citations because it felt
| like a less controversial claim. I think the choice of words
| was just a bit awkward. To me, it seems like they were
| asserting that deep learning requires lots of computational
| resources which is common knowledge. In general, this
| translates to higher energy requirements.
| [deleted]
| version_five wrote:
| This uses a physical system with controllable parameters to
| compute a forward pass and
|
| > using a differentiable digital model, the gradient of the loss
| is estimated with respect to the controllable parameters.
|
| So e.g. they have a tunable laser that shifts the spectrum of an
| encoded input based on a set of parameters, and then they update
| the parameters based on a gradient computed from a digital
| simulation of the laser (physics aware model).
|
| When I read the headline I imagined they had implemented back
| propagation in a physical system
| visarga wrote:
| > When I read the headline I imagined they had implemented back
| propagation in a physical system
|
| They touch on that by observing you could train a second
| physical neural network to compute the gradients for the first.
| So it could all be physical.
|
| > Improvements to PAT could extend the utility of PNNs. For
| example, PAT's backward pass could be replaced by a neural
| network that directly estimates parameter updates for the
| physical system. Implementing this 'teacher' neural network
| with a PNN would allow subsequent training to be performed
| without digital assistance.
|
| So you need to use in silico training a at first, but can get
| rid of it in deployment.
| dangom wrote:
| Right,
|
| > Here we introduce a hybrid in situ-in silico algorithm,
| called physics-aware training, that applies backpropagation to
| train controllable physical systems. Just as deep learning
| realizes computations with deep neural networks made from
| layers of mathematical functions, our approach allows us to
| train deep physical neural networks made from layers of
| controllable physical systems, even when the physical layers
| lack any mathematical isomorphism to conventional artificial
| neural network layers.
|
| To my naive understanding, and please someone correct me if I'm
| wrong, the point is that they are not controlling the
| parameters that compute the NN forward pass directly (hence "no
| mathematical isomorphism to conventional NNs"), but "hyper-
| parameters" that guide the physical system to do so. For
| example, rotation angles of mirrors, or distance between
| filters, instead of intensity values of light. This leads to
| the non-linear transformations happening in situ, while simpler
| transformations in the backprop are still computed in-silico.
| visarga wrote:
| If they can scale it up to GPT-3 like sizes, it would be amazing.
| Foundation models like GPT-3 will be the operating system of
| tomorrow. But now they are too expensive to run.
|
| They can be trained once and then frozen and you can develop new
| skills by learning control codes (prompts), or adding a retrieval
| subsystem (search engine in the loop).
|
| If you shrink this foundation model to a single chip, something
| small and energy efficient, then you could have all sorts of
| smart AI on edge devices.
___________________________________________________________________
(page generated 2022-01-29 23:01 UTC)