[HN Gopher] Don't Mess with Backprop: Doubts about Biologically ...
       ___________________________________________________________________
        
       Don't Mess with Backprop: Doubts about Biologically Plausible Deep
       Learning
        
       Author : ericjang
       Score  : 53 points
       Date   : 2021-02-13 22:01 UTC (2 days ago)
        
 (HTM) web link (blog.evjang.com)
 (TXT) w3m dump (blog.evjang.com)
        
       | monocasa wrote:
       | I've never understood why biological neural nets would need back
       | prop.
       | 
       | Evolutionary pressure is it's own applied loss function. It's
       | less efficient than back prop, but gets you to solutions all the
       | same.
        
         | visarga wrote:
         | Evolution works on different time scales from day to day life.
         | It's an outer loop of evolution with an inner loop of
         | optimization (learning).
        
           | monocasa wrote:
           | But that day to day information doesn't need to be stored as
           | weights in the network in cyclic networks like you see in
           | biology. It can be stored in fluctuations of the data
           | oscillating around, with the individual weights not really
           | changing. Sort of like how your CPU doesn't change the linear
           | region of it's transistors to perform new tasks.
        
       | pizza wrote:
       | If you are interested in deep learning with spiking neural
       | networks there is also the norse framework:
       | https://github.com/electronicvisions/norse
        
         | orbifold wrote:
         | That repo is slightly outdated, development now continues at
         | https://github.com/norse/norse.
        
         | fishmaster wrote:
         | There's also Nengo (https://www.nengo.ai/).
        
       | xingyzt wrote:
       | I'm not very familiar with deep learning. How does this compare
       | to the biomimicking Spike Time Dependent Plasticity of spiking
       | neural networks?
       | 
       | https://github.com/Shikhargupta/Spiking-Neural-Network#train...
        
         | ericjang wrote:
         | Deep Learning has nothing to do with biophysical neuron
         | simulation, even though there is a confusing overloading of the
         | term "neural network". A good introduction to deep learning is
         | this chapter: https://mlstory.org/deep.html.
         | 
         | STDP falls under biophysical models of neuron simulation, where
         | we try to faithfully reproduce biophysics of brain simulation
         | (trivia: I started my undergrad in computational neuroscience
         | and implemented STDP several times [1, 2, 3]). STDP is a
         | learning mechanism, but it has not demonstrated the ability to
         | learn as powerful models as DNN.
         | 
         | [1] https://github.com/ericjang/pyN
         | 
         | [2] https://github.com/ericjang/julia-NeuralNets
         | 
         | [3] https://github.com/ericjang/NeuralNets
        
       | Digitalis33 wrote:
       | DeepMind, Hinton, et al are still convinced that the brain must
       | be doing something like backprop.
       | 
       | See Lillicrap address all common objections to backprop in the
       | brain:
       | https://www.youtube.com/watch?v=vbvl0k-aUiE&ab_channel=ELSCV...
       | 
       | Also from their paper Backpropagation in the brain:
       | 
       | "It is not clear in detail what role feedback connections play in
       | cortical computations, so we cannot say that the cortex employs
       | backprop-like learning. However, if feedback connections modulate
       | spiking, and spiking determines the adaptation of synapse
       | strengths, the information carried by the feedback connections
       | must clearly influence learning!"
        
       | scribu wrote:
       | > You can indeed use backprop to train a separate learning rule
       | superior to naive backprop.
       | 
       | I used to dismiss the idea of an impending singularity. Now I'm
       | not so sure.
       | 
       | Hopefully AGIs will reach hard physical limits to self-
       | improvement, before taking over the world.
        
         | jjk166 wrote:
         | Given the obvious benefits of increased intelligence, the fact
         | that hominid brain size (and presumably computational power)
         | plateaued for the past 300,000 or so years, and that no other
         | species has developed superior intelligence (ie there are
         | bigger but not more effective brains out there), does seem to
         | indicate that we are at or close to a local maxima for
         | biological intelligence. Presumably that's some threshold
         | beyond which the gains from increased computing power lead to
         | limited improvements in intelligence. Of course that's not to
         | say it's a global maximum.
        
           | semi-extrinsic wrote:
           | That, or our brains have hit some biological equivalent of
           | Moore's law ending for silicon computing. Maybe getting a
           | doubling in brain size would require a quadrupling in brain
           | energy consumption (and power dissipation) at our level.
        
             | jjk166 wrote:
             | Yeah that's what I mean by not scaling above a threshold.
             | Our brains could be bigger (neanderthals had larger ones
             | than us), but those bigger brains for whatever reason
             | weren't "worth it."
        
       | timlarshanson wrote:
       | But, if your realistically-spiking, stateful, noisy biological
       | neural network is non-differentiable (which, so far as I know, is
       | true), then how are you going to propagate gradients back through
       | it to update your ANN approximated learning rule?
       | 
       | I suspect that given the small size of synapses the algorithmic
       | complexity of learning rules (and there are several) is small.
       | Hence, you can productively use evolutionary or genetic
       | algorithms to perform this search/optimization. Which I think
       | you'd have to due to the lack of gradients, or simply due to
       | computational cost. Plenty of research going on in this field.
       | (Heck, while you're at it, might as well perform similar search
       | over wiring typologies & recapitulate our own evolution without
       | having to deal with signaling cascades, transport of mRNA &
       | protein along dendrites, metabolic limits, etc)
       | 
       | Anyway, coming from a biological perspective: evolution is still
       | more general than backprop, even if in some domains it's slower.
        
         | ericjang wrote:
         | This is a good question. I think many "biologically plausible"
         | neural models are willing to make some approximations for the
         | benefit of computational power (e.g. rate coding instead of
         | spike coding, point neurons and synapses instead of a cable
         | model). As for non-differentiable operations, I think one
         | strategy might be to formulate it as a multi-agent
         | communication problem (e.g. https://www.aaai.org/ocs/index.php/
         | AAAI/AAAI18/paper/viewFil...), where gradients are obtained via
         | a differentiable relaxation or using a score-function gradient
         | estimator (e.g. REINFORCE)
        
           | orbifold wrote:
           | You can actually calculate exact gradients for spiking
           | neurons using the adjoint method:
           | https://arxiv.org/abs/2009.08378 (I'm the second author). In
           | my PhD thesis I show how this can be extended to larger
           | problems and more complicated and biologically plausible
           | neuron models. I agree with the gist of your post though:
           | Retrofitting back propagation (or the adjoint method for that
           | matter) is the wrong approach. One should rather use these
           | methods to optimise biologically plausible learning rules.
           | The group of Wolfgang Maass has done exciting work in that
           | direction (e.g. https://arxiv.org/abs/1803.09574, https://www
           | .frontiersin.org/articles/10.3389/fnins.2019.0048...,
           | https://igi-web.tugraz.at/PDF/256.pdf).
        
       | notthemessiah wrote:
       | The author of this piece calls Dynamic Programming "one of the
       | top three achievements of Computer Science", however, it doesn't
       | have much to do with computer science, as it's just a synonym for
       | mathematical optimization, used seemingly exclusively for being
       | "politically-correct" (avoiding the wrath and suspicion of
       | managers) at RAND Corporation:
       | 
       | > I spent the Fall quarter (of 1950) at RAND. My first task was
       | to find a name for multistage decision processes. An interesting
       | question is, "Where did the name, dynamic programming, come
       | from?" The 1950s were not good years for mathematical research.
       | We had a very interesting gentleman in Washington named Wilson.
       | He was Secretary of Defense, and he actually had a pathological
       | fear and hatred of the word "research". I'm not using the term
       | lightly; I'm using it precisely. His face would suffuse, he would
       | turn red, and he would get violent if people used the term
       | research in his presence. You can imagine how he felt, then,
       | about the term mathematical. The RAND Corporation was employed by
       | the Air Force, and the Air Force had Wilson as its boss,
       | essentially. Hence, I felt I had to do something to shield Wilson
       | and the Air Force from the fact that I was really doing
       | mathematics inside the RAND Corporation. What title, what name,
       | could I choose? In the first place I was interested in planning,
       | in decision making, in thinking. But planning, is not a good word
       | for various reasons. I decided therefore to use the word
       | "programming". I wanted to get across the idea that this was
       | dynamic, this was multistage, this was time-varying. I thought,
       | let's kill two birds with one stone. Let's take a word that has
       | an absolutely precise meaning, namely dynamic, in the classical
       | physical sense. It also has a very interesting property as an
       | adjective, and that is it's impossible to use the word dynamic in
       | a pejorative sense. Try thinking of some combination that will
       | possibly give it a pejorative meaning. It's impossible. Thus, I
       | thought dynamic programming was a good name. It was something not
       | even a Congressman could object to. So I used it as an umbrella
       | for my activities.
       | 
       | https://en.wikipedia.org/wiki/Dynamic_programming#History
        
       | ilaksh wrote:
       | Predictive coding seems not only plausible but also potentially
       | advantageous in some ways. Such as being inherently well-suited
       | to generative perception.
        
       | neatze wrote:
       | Comparisons of various neural architectures.
       | 
       | Deep Learning in Spiking Neural Networks:
       | https://arxiv.org/pdf/1804.08150.pdf
        
       | intrasight wrote:
       | I was told ~30 years ago by a leading computer scientist in the
       | NN field that biology has nothing to teach us in terms of
       | implementation. I switched from CS to neuroscience anyway. I've
       | wrestled with his statement ever since. I'll say that nothing
       | I've seen since then has shown him wrong.
        
         | ericjang wrote:
         | OP here. I hope that's not the takeaway readers glean from my
         | article - the point I was making was just that it doesn't make
         | sense to shoehorn a biophysical learning mechanism into a DNN,
         | rather we should use a DNN to find a biophysical learning
         | mechanism.
         | 
         | As to whether biophysical learning has anything to teach us is
         | an entirely different question which I don't discuss in the
         | post.
        
         | xkcd-sucks wrote:
         | Nobody understands the biology fully enough to drive a better
         | ML implementation. Individual neurons are very complex and
         | their interactions even more so
        
       | taliesinb wrote:
       | While the continual one-upmanship of ever more intricate
       | biologically plausible learning rules is interesting to observe
       | (and I played around at one point with a variant of the original
       | feedback alignment), I think OP's alternative view is more
       | plausible.
       | 
       | Fwiw I am involved in an ongoing project that is investigating a
       | biologically plausible model for generating connectomes (as
       | neuroscientists like to call them). The connectome-generator
       | happens (coincidentally) to be a neural network. But exactly as
       | the OP points out, this "neural network" need not actually
       | represent a biological brain -- in our case it's actually a
       | _hypernetwork_ representing the process of gene expression, which
       | in turn generates the biological network. Backprop is then
       | applied to this hypernetwork as a (more efficient) proxy for
       | evolution. In the most extreme case there need not be any
       | learning at all at the level of an individual organism. You can
       | see this as the ultimate end-point of so-called Baldwinian
       | evolution, which is the hypothesized process whereby more and
       | more of the statistics of a task are  "pulled back" into
       | genetically encoded priors over time.
       | 
       | But for me the more interesting question is how to approach the
       | information flow from tasks (or 'fitness') to brains to genes on
       | successively longer time scales. Can that be done with
       | information theory, or perhaps with some generalization of it? I
       | also think it is a rich and interesting challenge to
       | _parameterize_ learning rules in such a way that evolution (or
       | even random search) can efficiently find good ones for rapid
       | learning of specific kinds of task. My gut feeling is that
       | biological intelligence has many components that are ultimately
       | discrete computations, and we 'll discover that those are
       | reachable by random search if we can just get the substrate
       | right, and in fact this is how evolution has often done it --
       | shades of Gould and Eldredge's "punctuated equilibrium".
       | 
       | (if anyone is interested in discussing any of these things feel
       | free to drop me an email)
        
       ___________________________________________________________________
       (page generated 2021-02-15 23:00 UTC)