[HN Gopher] Deep physical neural networks trained with backpropa...
       ___________________________________________________________________
        
       Deep physical neural networks trained with backpropagation
        
       Author : groar
       Score  : 67 points
       Date   : 2022-01-29 15:56 UTC (7 hours ago)
        
 (HTM) web link (www.nature.com)
 (TXT) w3m dump (www.nature.com)
        
       | phreeza wrote:
       | Physical/analog computers always suffer from noise limiting their
       | usefulness. So I think it would be natural to apply this to a
       | network architecture that includes noise as an integral Part such
       | as GANs or VAEs.
        
         | orasis wrote:
         | "noise" is integral to all ML systems. You can view this
         | through many lenses, but generalization can be thought of as
         | decoding a noisy signal.
        
           | phreeza wrote:
           | This is true, though what I was getting at was methods that
           | make use of a noise source separate from the input.
        
       | visarga wrote:
       | If you can train a non-linear physical system with this method,
       | in principle, you could also train real brains. You can't update
       | the parameters of the brain, but you can inject signal. Assuming
       | real brains to be black box functions for which you could learn a
       | noisy estimator of gradients, it could be used for neural
       | implants that supplement lost brain functionality, or a Matrix-
       | like skill loading system.
        
       | p1esk wrote:
       | How is this different from the good old "chip in the loop"
       | training method?
        
         | corndoge wrote:
         | The paper is interesting
        
       | modeless wrote:
       | Let me see if I can describe the laser part of the paper
       | correctly. They made a laser pulse consisting of a bunch of
       | different frequencies mixed together. The intensity of each
       | frequency represents a controllable parameter of the system. The
       | pulse was sent through a crystal that performs a complex
       | transformation that mixes all the frequencies together in a
       | nonlinear and noisy way. Then they measure the frequency spectrum
       | of the output. By itself, this system performs computations of a
       | sort, but they are not useful.
       | 
       | To make the computations useful, first they trained a
       | conventional digital neural network to predict the outputs given
       | the input controllable parameters. Then they arbitrarily assigned
       | some of the controllable parameters to be the inputs of the
       | neural network and others were arbitrarily assigned to be the
       | trainable weights. Then they used the crystal to run forward
       | passes on the training data. After each forward pass, they used
       | the trained regular neural network to do the reverse pass and
       | estimate the gradients of the outputs with respect to the
       | weights. With the gradients they update the weights just like a
       | regular neural net.
       | 
       | Although the gradients computed by the neural nets are not a
       | perfect match to the real gradients of the physical system (which
       | are unknown), they don't need to be perfect. Any drift is
       | corrected because the forward pass is always run by the real
       | physical system, and stochastic gradient descent is naturally
       | pretty tolerant of noise and bias.
       | 
       | Since they're just using neural nets to estimate the behavior of
       | the physical system rather than modeling it with physics, they
       | can use literally any physical system and the behavior of the
       | system does not have to be known. The only requirement of the
       | system is that it does a complex nonlinear transformation on a
       | bunch of controllable parameters to produce a bunch of outputs.
       | They also demonstrate using vibrations of a metal plate.
       | 
       | Seems like this method may not lead to huge training speedups
       | since regular neural nets are still involved. But after training,
       | the physical system is all you need to run inference, and that
       | part can be super efficient.
        
         | posterboy wrote:
         | > They made a laser pulse consisting of a bunch of different
         | frequencies mixed together
         | 
         | This is how ultra short pulses are made when the waves cancel
         | out appropriately. Now I'm not sure if they are training a
         | network to calculate the filter efficiently for even shorter
         | pulses, or if the purpose is supposed to be an optical neural
         | network, or why not both.
        
       | melissalobos wrote:
       | > Deep-learning models have become pervasive tools in science and
       | engineering. However, their energy requirements now increasingly
       | limit their scalability.[1]
       | 
       | They make this claim first, and cite one source. I haven't heard
       | of this as an issue before. Is there anywhere else I could read
       | more on this?
       | 
       | [1]https://arxiv.org/abs/2104.10350
        
         | dekhn wrote:
         | Training a state of the art model typically involves keeping a
         | very large computer around at near 100% power load. Roughly
         | about 10MW.
         | 
         | The actual limits on DL models (and any simulation or
         | optimization) are: power density and the speed of light, plus
         | the maximum amount of power you can deliver to the area. The
         | speed of light limits how long your cables can be while still
         | doing collective reductions, and the power density limits how
         | much compute power you can fit per unit volume. One could
         | imagine a fully liquid cooled supercomputer at 100MW (located
         | near a very reliable and large power source) with optical fiber
         | interconnect, this would completely change the state of the art
         | in large models overnight.
        
           | foobiekr wrote:
           | All true.
           | 
           | I cannot cite a source here, but it is generally believed
           | that the actual effective GPU utilization in AI training
           | clusters which are "100% utilized" is actually quite poor -
           | 23%-26% - due to data movement, non-essential serial
           | execution, and and scheduling issues. So at least for now
           | there is low-hanging fruit to improve the performance of the
           | capital expenses.
           | 
           | Long term, though, DL clusters are basically CAPEX and energy
           | limited.
           | 
           | IMHO, for now, return on the investment is not really a
           | limiting factor, but it will become one once the shine is off
           | the field.
        
         | pmayrgundter wrote:
         | Got me wondering how this compares with neural efficiency,
         | realizing ofc that there's nothing really apples-to-apples
         | here.
         | 
         | Training one of these big models takes 100kWh for 1e19 flops,
         | so that's 100k Wh, 360M Ws, or 360MJ or 3.6 1e8J.
         | 1e8Joules/1e19flops = 1e-11J/flop
         | 
         | Neurons take 1e-8J/spike.[1]
         | 
         | Math check appreciated :)
         | 
         | Does seem plausible to think of a single neuron spike (hodgkin-
         | huxley cable model) being modeled with ~1k flops. Though I'm
         | firmly of the opinion that nobody really knows how the brain
         | works.. the neural spike activity could be pure epiphenomenon..
         | who knows!
         | 
         | [1] "Finally, the energy supply to a neuron by ATP is 8.31 x
         | 10-9 J. Meanwhile, integrating the total power with respect to
         | time we will get the consumed electric power, which is 8.75 x
         | 10-9 J. This is more energy than the ATP supplied. The energy
         | efficiency is 105.3%. This is an anomaly..." - 2017 Feb 16
         | Wang, Xu, Institute for Cognitive Neurodynamics, East China
         | University of Science and Technology
         | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5337805/
        
           | whatshisface wrote:
           | The neural spike is definitely not an epiphenomenon. The
           | action potential / neurotransmitter release / receptor
           | activation process is understood and can be manipulated with
           | electric probes.
        
             | dekhn wrote:
             | For those who are curious, consciousness is an
             | epiphenomenon (an emergenty property of brains), while
             | neural spikes are just physics.
             | 
             | See more: https://en.wikipedia.org/wiki/Neural_correlates_o
             | f_conscious...
        
               | whatshisface wrote:
               | I think it would be better to say something like,
               | "paranoia is an epiphenomenon," when nobody knows what
               | consciousness is.
        
               | dekhn wrote:
               | No. https://en.wikipedia.org/wiki/Epiphenomenalism
        
               | whatshisface wrote:
               | Note that that title ends in "ism," like "Calvinism," or
               | "Evangelicalism."
        
               | posterboy wrote:
               | hence _epiphenomenal-ize_? I don 't thinkso.
               | 
               | Maybe it's more like _epiphenomalis-m(a)_ , where _-is_
               | could be a genetive ending. I.e. the idea _of_
               | epiphenomena.
        
               | netizen-936824 wrote:
               | That isn't quite true
               | 
               | https://en.m.wikipedia.org/wiki/Consciousness
        
             | pmayrgundter wrote:
             | Sorry, didn't mean it quite like that. It's clear neural
             | spike activity exists as a physical process. I'm suggesting
             | that spiking activity may be an epiphenomena more primary
             | brain functions, i.e. information processing,
             | consciousness, etc..
             | 
             | As far as I know, we're closest to showing information
             | processing in the visual cortex (which is highly linear)
             | and we're still a long way from knowing how it works at a
             | neural level. But maybe someone here can update on this?
             | 
             | But much of the cortex is highly recurrent (non-linear) and
             | the idea that it's doing something like sending bits
             | between synapses, encoded in spike timing or something..
             | well, I think that's highly speculative and has plenty of
             | problems. But even if so, that's just "information
             | processing".
             | 
             | I'm personally a fan of electromagnetic theories of
             | consciousness[], where the synaptic activity could be an
             | epiphenomenon of supporting a stand EM field.
             | 
             | []https://en.wikipedia.org/wiki/Electromagnetic_theories_of
             | _co...
        
               | whatshisface wrote:
               | > _But much of the cortex is highly recurrent (non-
               | linear) and the idea that it 's doing something like
               | sending bits between synapses, encoded in spike timing or
               | something.. well, I think that's highly speculative and
               | has plenty of problems._
               | 
               | I am not sure how much is known about information
               | processing, but it's clear that motor impulses and
               | sensory information are encoded in the spikes. Higher
               | spike frequency = stronger signal. Synapses are how
               | signals are passed from neuron to neuron.
        
               | pmayrgundter wrote:
               | Ok, that's fair. That's i/o and yes, that's known to be
               | highly linear by the time it gets to the efferent nerves,
               | and makes sense it is before that as well. I think that
               | still leaves the vast majority of the cortex using
               | undefined mechanisms.
        
               | whatshisface wrote:
               | There's no need to hypothesize a wholly unique central
               | nervous system signalling mechanism when, not only is the
               | signalling mechanism of peripheral nerves understood,
               | central nerves are observed doing the same thing.
        
               | pmayrgundter wrote:
               | I think it's fine as a hypothesis for the CNS, and my
               | guess is it's correct for the spinal cord on up to the
               | maybe the thalamus. But there the anatomy changes
               | radically, as does the electrical activity.. eg the
               | cluster waves (alpha, beta, theta) begin there and
               | indicate some sort of group behaviors in various areas.
               | 
               | Afaik, we have correlative descriptions of what these
               | waves indicate (importantly, they're associated with
               | sleep, consciousness, attentiveness), but no direct
               | mechanical model of them or a clear purpose. So yeah,
               | spike timing could still be used at this level, but it
               | seems other behaviors are also happening that may be more
               | essential to the larger function.
        
               | posterboy wrote:
               | I do remember reading once that glia cells are not
               | understood, and the something about how electro magnetic
               | fields might also induce ... I do not remember it well
               | because it wasn't very specific
               | 
               | The sentiment that synapses probably don't explain
               | everything is rather common, anyhow. I'm thinking, the
               | way the blood flow literally influences the relevant
               | areas by transporting available energy for example, and
               | neurotransmitters must be a very important point, and how
               | those areas react in case of insufficiency would explain
               | why I become nasty when tired and hungry at the same
               | time.
        
         | version_five wrote:
         | I don't have a specific reference but I'd say it's a common
         | knowledge assertion based on the growth in the number of
         | parameters in models over the last 10 years. There are lots of
         | places where you can see how the number of parameters,
         | especially in language and vision models, has increased, and
         | find that the amount of training time quoted. Normally it's
         | framed in terms of compute instead of energy.
        
         | davesque wrote:
         | I think they may have provided fewer citations because it felt
         | like a less controversial claim. I think the choice of words
         | was just a bit awkward. To me, it seems like they were
         | asserting that deep learning requires lots of computational
         | resources which is common knowledge. In general, this
         | translates to higher energy requirements.
        
       | [deleted]
        
       | version_five wrote:
       | This uses a physical system with controllable parameters to
       | compute a forward pass and
       | 
       | > using a differentiable digital model, the gradient of the loss
       | is estimated with respect to the controllable parameters.
       | 
       | So e.g. they have a tunable laser that shifts the spectrum of an
       | encoded input based on a set of parameters, and then they update
       | the parameters based on a gradient computed from a digital
       | simulation of the laser (physics aware model).
       | 
       | When I read the headline I imagined they had implemented back
       | propagation in a physical system
        
         | visarga wrote:
         | > When I read the headline I imagined they had implemented back
         | propagation in a physical system
         | 
         | They touch on that by observing you could train a second
         | physical neural network to compute the gradients for the first.
         | So it could all be physical.
         | 
         | > Improvements to PAT could extend the utility of PNNs. For
         | example, PAT's backward pass could be replaced by a neural
         | network that directly estimates parameter updates for the
         | physical system. Implementing this 'teacher' neural network
         | with a PNN would allow subsequent training to be performed
         | without digital assistance.
         | 
         | So you need to use in silico training a at first, but can get
         | rid of it in deployment.
        
         | dangom wrote:
         | Right,
         | 
         | > Here we introduce a hybrid in situ-in silico algorithm,
         | called physics-aware training, that applies backpropagation to
         | train controllable physical systems. Just as deep learning
         | realizes computations with deep neural networks made from
         | layers of mathematical functions, our approach allows us to
         | train deep physical neural networks made from layers of
         | controllable physical systems, even when the physical layers
         | lack any mathematical isomorphism to conventional artificial
         | neural network layers.
         | 
         | To my naive understanding, and please someone correct me if I'm
         | wrong, the point is that they are not controlling the
         | parameters that compute the NN forward pass directly (hence "no
         | mathematical isomorphism to conventional NNs"), but "hyper-
         | parameters" that guide the physical system to do so. For
         | example, rotation angles of mirrors, or distance between
         | filters, instead of intensity values of light. This leads to
         | the non-linear transformations happening in situ, while simpler
         | transformations in the backprop are still computed in-silico.
        
       | visarga wrote:
       | If they can scale it up to GPT-3 like sizes, it would be amazing.
       | Foundation models like GPT-3 will be the operating system of
       | tomorrow. But now they are too expensive to run.
       | 
       | They can be trained once and then frozen and you can develop new
       | skills by learning control codes (prompts), or adding a retrieval
       | subsystem (search engine in the loop).
       | 
       | If you shrink this foundation model to a single chip, something
       | small and energy efficient, then you could have all sorts of
       | smart AI on edge devices.
        
       ___________________________________________________________________
       (page generated 2022-01-29 23:01 UTC)