[HN Gopher] Predictive coding has been unified with backpropagation
___________________________________________________________________
Predictive coding has been unified with backpropagation
Author : cabalamat
Score : 241 points
Date : 2021-04-05 12:02 UTC (10 hours ago)
(HTM) web link (www.lesswrong.com)
(TXT) w3m dump (www.lesswrong.com)
| xzvf wrote:
| At scale, Evolutionary Strategies (ES) are a very good
| approximation of the gradient as well. Don't recommend to jump
| just yet to conclusions and unifications.
| jnwatson wrote:
| The author's point is that predictive coding is a plausible
| mechanism by which biological neurons work. ES are not.
|
| ANNs have deviated widely from their biological inspiration,
| most notably in the way that information flows, since
| backpropagation requires two way flow and biological axons are
| one-directional.
|
| If predictive coding and backpropagation are shown to have
| similar power, then there's a rough idea that the way that ANNs
| work isn't too far from how brains work (with lots and lots of
| caveats).
| whimsicalism wrote:
| > If predictive coding and backpropagation are shown to have
| similar power, then there's a rough idea that the way that
| ANNs work isn't too far from how brains work (with lots and
| lots of caveats).
|
| So many caveats that I don't even really think that is a true
| statement.
| blueyes wrote:
| I'm glad people are talking about this, and the similarity
| between predictive coding and the action of biological neurons is
| interesting. But we shouldn't fetishize predictive coding.
| There's a wider discussion going on, and several theories as to
| how back propagation might work in the brain.
|
| https://www.cell.com/trends/cognitive-sciences/fulltext/S136...
|
| https://www.nature.com/articles/s41583-020-0277-3
| andyxor wrote:
| there is no evidence of back-propagation in the brain.
|
| See Professor Edmund T. Rolls books on biologically plausible
| neural networks:
|
| "Brain Computations: What and How" (2020)
| https://www.amazon.com/gp/product/0198871104
|
| "Cerebral Cortex: Principles of Operation" (2018)
| https://www.oxcns.org/b12text.html
|
| "Neural Networks and Brain Function" (1997)
| https://www.oxcns.org/b3_text.html
| ShamelessC wrote:
| "There is just one problem: [biological neural networks] are
| physically incapable of running the backpropagation
| algorithm."
|
| From the linked article.
| 0lmer wrote:
| But does predictive coding perceived as a valid theory for
| cortical neurons functioning? There was a paper from 2017 drawing
| similar conclusions about backprop approximation with Spike-
| Timing-Dependent Plasticity: https://arxiv.org/abs/1711.04214
| Looks more grounded to current models of neuronal functioning.
| Nevertheless, it changed nothing in the field of deep learning
| since then.
| jwmullally wrote:
| Some general background on STDP for the thread:
|
| Biological neurons don't just emit constant 0...1 float values,
| they communicate using time sensitive bursts of voltage known
| as "spike trains". Spiking Neural Networks (SNN) are a closer
| aproximation of natural networks than typical ML ANNs. [0]
| gives a quick overview.
|
| Spike-Timing-Dependant-Plasticity is a local learning rule
| experimentally observed in biological neurons. It's a form of
| Hebbian learning, aka "Neurons that fire together wire
| together."
|
| Summary from [1]. The top graph gives a clear picture of how
| the rule works.
|
| > _With STDP, repeated presynaptic spike arrival a few
| milliseconds before postsynaptic action potentials leads in
| many synapse types to Long-Term Potentiation (LTP) of the
| synapses, whereas repeated spike arrival after postsynaptic
| spikes leads to Long-Term Depression (LTD) of the same
| synapse._
|
| ---
|
| [0]: https://towardsdatascience.com/deep-learning-versus-
| biologic...
|
| [1]: http://www.scholarpedia.org/article/Spike-
| timing_dependent_p...
| andyxor wrote:
| as long as the model requires delta rule, or 'teacher signal'
| based error correction it is not biologically plausible.
| adamnemecek wrote:
| I think that this sort of forward backward thing is a very
| general idea. There's a one to many relationship called the
| adjoint, and a many to one relationship called the norm.
|
| I wrote something about this here
| https://github.com/adamnemecek/adjoint
| tsmithe wrote:
| In fact, the compositional structure underlying that of
| predictive coding [0,1] is abstractly the same as that
| underlying backprop [2]. (Disclaimer: [0,1] are my own papers;
| I'm working on a more precise and extensive version of [1]
| right now!)
|
| [0] https://arxiv.org/abs/2006.01631 [1]
| https://arxiv.org/abs/2101.10483 [2]
| https://arxiv.org/abs/1711.10455
| eli_gottlieb wrote:
| Hurry and publish before I have manuscripts ready applying
| these results.
| tsmithe wrote:
| Hey, Eli :-)
|
| I'm working on it; I'll send you an e-mail. Things quickly
| turned out to be more general than I realized last year.
| selimthegrim wrote:
| What were you going to say about Young tableaux?
| adamnemecek wrote:
| Dynamic programming and reinforcement learning are just
| diagonalizations of the Young tableau. This is related to the
| spectral theorem.
| jdonaldson wrote:
| Yeah, I don't like this title. Coding for backprop is worth
| getting excited about, but please don't assume it supersedes all
| forms of "predictive coding". Plenty of predictive learning
| techniques do just fine without it, including our own brains.
|
| In keeping with the No-Free-Lunch theorem, it's also highly
| desirable in general to have a variety of approaches at hand for
| solving certain predictive coding problems. Yes, this makes ML
| (as a field) cumbersome, but it also prevents us from painting
| ourselves into a corner.
| nerdponx wrote:
| Is this "coding for backprop", or "coding for the same results
| as backprop"?
| klmadfejno wrote:
| > Predictive coding is the idea that BNNs generate a mental model
| of their environment and then transmit only the information that
| deviates from this model. Predictive coding considers error and
| surprise to be the same thing. Hebbian theory is specific
| mathematical formulation of predictive coding.
|
| This is an excellent, concise explanation. It sounds intuitive as
| something that could work. Would love to try and dabble with
| this. Any resources?
| cs702 wrote:
| EDIT: Before you read my comment below, please see
| https://news.ycombinator.com/item?id=26702815 and
| https://openreview.net/forum?id=PdauS7wZBfC for a different view.
|
| --
|
| If the results hold, they seem significant enough to me that I'd
| go as far as saying the authors of the paper would end up getting
| an important award at some point, not just for _unifying the
| fields of biological and artificial intelligence_ , but also for
| making it trivial to train models in a fully distributed manner,
| with _all learning done locally_ -- if the results hold.
|
| Here's the paper: "Predictive Coding Approximates Backprop along
| Arbitrary Computation Graphs"
|
| https://arxiv.org/abs/2006.04182
|
| I'm making my way through it right now.
| klmadfejno wrote:
| I'm trying to imagine how that works. Imagine you've got a
| nueral net. One node identifies the number of feet. One node
| identifies that number of wings. One node identifies color.
| This feeds into a layer that tries to predict what animal it
| is.
|
| With backprop, you can sort of assume that given enough scale
| your algo will identify these important features. With local
| learning, wouldn't you get a tendency to identify the easily
| identifiable features many times? Is there a need for a sort of
| middleman like a one arm bandit kind of thing that makes a
| decision to spawn and despawn child nodes to explore the space
| more?
| TheRealPomax wrote:
| The fallacy there is the idea that "one node" does anything
| useful, rather than optimizing itself in a way that you have
| _no idea_ what it actually codes for, but at the emergent
| level, you see it contribute to coding for wing detection, or
| color detection, or more likely actually seventeen different
| things that are supposedly unrelated, it just happens to be
| generating values that somehow contribute to a result for the
| features the various constellations detect.
|
| (meaning it might also actually cause one or more
| constellations to perform worse than if it wasn't
| contributing, and realistically, you'll never know)
| SamBam wrote:
| > Is there a need for a sort of middleman like a one arm
| bandit kind of thing that makes a decision to spawn and
| despawn child nodes to explore the space more?
|
| What's the one-armed bandit? (Besides a slot machine.)
|
| My knowledge of this field is rusty, but I actually wrote my
| MSc thesis on novel ways to get Genetic Algorithms to more
| efficiently explore the space without getting stuck, so it
| sounds up my alley.
| fancy_pantser wrote:
| I wonder if you thought of it as a type of optimal stopping
| problem locally on each node and explore-exploit (multi-
| armed bandit) globally? For example, if each node knows
| when to halt when it hits a [probably local] minima, the
| results can be shared at that point and the best-performing
| models can be cross-pollinated or whatever the mechanism is
| at that point. Since both copying the models and continuing
| without gaining ground are both wastes of time, you want to
| dial in that local halting point precisely. An overseeing
| scheduler would record epoch-level results and make the
| decisions, of course.
| babel_ wrote:
| Interesting follow up reading:
|
| "Relaxing the Constraints on Predictive Coding Models"
| (https://arxiv.org/abs/2010.01047), from the same authors.
| Looks at ways to remove neurological implausibility from PCM
| and achieve comparable results. Sadly they only do MNIST in
| this one, and are not as ambitious in testing on multiple
| architectures and problems/datasets, but the results are still
| very interesting and it covers some of the important
| theoretical and biological concerns.
|
| "Predictive Coding Can Do Exact Backpropagation on
| Convolutional and Recurrent Neural Networks"
| (https://arxiv.org/abs/2103.03725), from different authors.
| Uses an alternative formulation that means it always converges
| to the backprop result within a fixed number of iterations,
| rather than approximately converges "in practice" within
| 100-200 iterations. Not only is this a stronger guarantee, it
| means they achieve inference speeds within spitting distance of
| backprop, levelling the playing field.
|
| It'd be interesting to see what a combination of these two
| could do, and at this point I feel like a logical next step
| would be to provide some setting in popular ML libraries such
| that backprop can be switched for PCM. Being able to verify
| this research just be adding a single extra line for the PCM
| version, and perhaps replicating state-of-the-art
| architectures, would be quite valuable.
| abraxas wrote:
| I'm going to personally flog any researcher who titles their
| next paper "Predictive Coding Is All You Need". You've been
| warned.
| cs702 wrote:
| There are already 60+ of those, and counting, all but one of
| them since Vaswani et al's transformer paper:
|
| https://arxiv.org/search/?query=is+all+you+need&searchtype=a.
| ..
| eutropia wrote:
| Here's a more recent paper (March, 2021) which cites the above
| paper: https://arxiv.org/abs/2103.04689 "Predictive Coding Can
| Do Exact Backpropagation on Any Neural Network"
| cs702 wrote:
| Yup. I'd expect to see many more citations going forward. In
| particular, I'd be excited to see how this ends up getting
| used in practice, e.g., training and running very large
| models running on distributed, masively parallel
| "neuromorphic" hardware.
| JackFr wrote:
| My background is as an interested amateur, but
|
| > also for making it trivial to train models in a fully
| distributed manner, with all learning done locally
|
| seems like a really huge development.
|
| At the same time I remain pretty skeptical of claims of
| unifying the fields of biological and artificial intelligence.
| I think the recent tremendous successes in AI & ML lead to an
| unjustified over confidence that we are close to understanding
| the way biological systems must work.
| himinlomax wrote:
| Indeed, it's worth mentioning we still have absolutely no
| idea how memory works.
| andyxor wrote:
| we know a lot about memory, but most AI researchers are
| simply ignorant in neuroscience or cognitive psychology and
| stick with their comfort zone.
|
| Saying "we have no idea" is just being lazy.
| andyxor wrote:
| the thing is about every week there is a paper published with
| groundbreaking claims, with this question in particular being
| very popular, trying to unify neuroscience and deep learning in
| some way, in search for computational foundations of AI. Mostly
| this is driven by success of DL in certain industrial
| applications.
|
| Unfortunately most of these papers are heavy on theory but
| light on empirical evidence. If we follow the path of natural
| sciences, theory has to agree with evidence. Otherwise it's
| just another theory unconstrained by reality, or worse, pseudo-
| science.
| autopoiesis wrote:
| The paper (arxiv:2103.04689) linked by eutropia above has
| some empirical evidence on the ML side, showing that
| performance of predictive coding is not so far off backprop.
| And there is no shortage of suggestions for how neural
| circuits might work around the strict requirements of
| backprop-like algorithms.
|
| cs702's original comment above is excessively hyperbolic: the
| compositional structure of Bayesian inversion is well known
| and is known to coincide structurally with the
| backward/forward structure of automatic differentiation. And
| there have been many papers before this one showing how
| predictive coding approximates backprop in other cases, so it
| is no surprise that it can do so on graphs, too. I agree with
| the ICLR reviewers that this paper is borderline and not in
| itself a major contribution. But that does not mean that this
| whole endeavour, of trying to find explicit mathematical
| connections between biological and artificial learning, is
| ill motivated.
| eli_gottlieb wrote:
| >the compositional structure of Bayesian inversion is well
| known
|
| /u/tsmithe's results on that are _well known_ , now? I can
| scarcely find anyone to collaborate with who understands
| them!
| YeGoblynQueenne wrote:
| Note that the paper was rejected for publication in ICLR 2021:
|
| https://openreview.net/forum?id=PdauS7wZBfC
| hctaw wrote:
| I don't know enough about biology or ML to know if what I'm
| posting below is totally wrong, but here goes.
|
| "Backprop" == "Feedback" of a non-linear dynamical system.
| Feedback is mathematical description of the behavior of systems,
| not a literal one.
|
| I don't know of BNNs are incapable of backprop anymore than an
| RLC filter is incapable of "feedback" when analyzing the ODE of
| the latter tells you that there's a feedback path (which is what,
| physically? The return path for charge?)
|
| So what makes BNN incapable of feedback? Are they mechanically
| and electrically insulated from eachother? How do they share
| information, and what is the return path?
|
| Other than that I wish more unification was done on ML algorithms
| and dynamical systems, just in general. There's too much
| crossover to ignore.
| andyxor wrote:
| The back-prop learning algorithm requires information non-local
| to the synapse to be propagated from output of the network
| backwards to affect neurons deep in the network.
|
| There is simply no evidence for this global feedback loop, or
| global error correction, or delta rule training in
| neurophysiological data collected in the last 80 years of
| intensive research. [1]
|
| As for "why", biological learning it is primarily shaped by
| evolution driven by energy expenditures constraints and
| survival of the most efficient adaptation engines. One can
| speculate that iterative optimization akin to the one run by
| GPUs in ANNs is way too energy inefficient to be sustainable in
| a living organism.
|
| Good discussion on biological constraints of learning (from
| CompSci perspective) can be found in Leslie Valiant book [2].
| Prof. Valiant is the author of PAC [3] one of the few
| theoretically sound models of modern ML, so he's worth
| listening to.
|
| [1] https://news.ycombinator.com/item?id=26700536
|
| [2] https://www.amazon.com/Circuits-Mind-Leslie-G-
| Valiant/dp/019...
|
| [3]
| https://en.wikipedia.org/wiki/Probably_approximately_correct...
| hctaw wrote:
| I think there's a significant difference worth illustrating
| that "there is no feedback path in the brain" is not at all
| equivalent to "learning by feedback is not possible in the
| brain."
|
| It's well known in dynamics that feed-forward networks are no
| longer feed-forward when outputs are coupled to inputs, an
| example of which would be a hypothetically feed-forward
| network of neurons in an animal and environmental
| conditioning teaching it the consequences of actions.
|
| I'm very curious on the biological constraints, but I'd
| reiterate my point above that feedback is a mathematical or
| logical abstraction for analyzing the behavior of the things
| we call networks - which are also abstractions. There's a
| distinction between the physical behavior of the things we
| see and the mathematical models we construct to describe
| them, like electromechanical systems where physically no such
| coupling from output-to-input appears to exist, yet its
| existence is crucially important analytically.
| khawkins wrote:
| > Other than that I wish more unification was done on ML
| algorithms and dynamical systems, just in general. There's too
| much crossover to ignore.
|
| Check out this work, "Deep relaxation: partial differential
| equations for optimizing deep neural networks" by Pratik
| Chaudhari, Adam Oberman, Stanley Osher, Stefano Soatto &
| Guillaume Carlier.
|
| https://link.springer.com/article/10.1007/s40687-018-0148-y
| nerdponx wrote:
| The article says this:
|
| > The backpropagation algorithm requires information to flow
| forward and backward along the network. But biological neurons
| are one-directional. An action potential goes from the cell
| body down the axon to the axon terminals to another cell's
| dendrites. An axon potential never travels backward from a
| cell's terminals to its body.
|
| The point of the research here is that backpropagation turns
| out not to be necessary to fit a neural network, and that it
| can be approximated with predictive coding, which does not
| require end-to-end backwards information flow.
| candiodari wrote:
| Yeah, but then you run into the problem of computation speed.
| Any given neuron in the middle of your brain does 1
| computation per second absolute maximum, and 1 per 10 seconds
| is more realistic. More to the outside (the vast majority of
| your brain) 1 per 100 seconds is a lot. And it slows down
| when you age.
|
| This means brains must have a _bloody_ good update rule. You
| just can 't update a neural network in 1 billion operations
| per second, or 4e17 operations until you're 12, about 2
| million training steps per neuron, or about half that
| assuming you sleep. You cannot get to the level of a 12 year
| old in 4e17 operations, because GPT-3 does more and while
| it's impressive, it doesn't have anything on a 12 year old.
| salawat wrote:
| So... I don't understand.
|
| >An action potential goes from the cell body down the axon to
| the axon terminals to another cell's dendrites.
|
| How do you figure that doesn't allow backprop?
|
| A neuronal bit is a loop of neurons. Information absolutely
| can back- propagate. If it couldn't, how does anyone think
| it'd be at all possible to learn how to get better at
| anything?
|
| Neuron fires dendrite to axon, secondary neuron fires
| dendrite to Axon, Axon branches back to previous neuron's
| dendrites, rinse, repeat, or add more intervening neurons...
| Trying to disinclude backprop based on the morphology of a
| single neuron is... Kinda missing the point.
|
| It's all about the level of connection between neurons and
| how long or whether a signal returns unmodified to the
| progenitor that effects the stability of the encoded
| information or behavior. At least to the best I've been able
| to plausibly model it. Haven't exactly figured out how to
| shove a bunch of measuring sticks in there to confirm or
| deny, but I just can't how a uniderectional action potential
| forwarding element implies lack of backprop in a graph of
| connections fully capable of developing cycles.
| nmca wrote:
| Interesting discussion on the ICLR openreview, resulting in a
| reject:
|
| https://openreview.net/forum?id=PdauS7wZBfC
| justicezyx wrote:
| Another well received paper [1], but I want to point out that
| ICLR should really have an industry track.
|
| The type of research in [1] (exhaustive analytic study on
| various parameters on RL training), is clearly beyond typical
| academia environment, probably also beyond normal industry
| labs. Note the paper was from Google Brain.
|
| The study consumes a lot of people's time, and computing time.
| It's no doubt very useful and valuable. But I dont think they
| should be judged by the same group of reviewers with the other
| work from normal universities.
|
| [1] https://openreview.net/forum?id=nIAxjsniDzg
| justicezyx wrote:
| Copied from this URL, the final review comments that 1)
| summarized the other reviews, 2) describes the rational for
| rejection:
|
| ``` This paper extends recent work (Whittington & Bogacz, 2017,
| Neural computation, 29(5), 1229-1262) by showing that
| predictive coding (Rao & Ballard, 1999, Nature neuroscience
| 2(1), 79-87) as an implementation of backpropagation can be
| extended to arbitrary network structures. Specifically, the
| original paper by Whittington & Bogacz (2017) demonstrated that
| for MLPs, predictive coding converges to backpropagation using
| local learning rules. These results were important/interesting
| as predictive coding has been shown to match a number of
| experimental results in neuroscience and locality is an
| important feature of biologically plausible learning
| algorithms.
|
| The reviews were mixed. Three out of four reviews were above
| threshold for acceptance, but two of those were just above.
| Meanwhile, the fourth review gave a score of clear reject.
| There was general agreement that the paper was interesting and
| technically valid. But, the central criticisms of the paper
| were:
|
| Lack of biological plausibility The reviewers pointed to a few
| biologically implausible components to this work. For example,
| the algorithm uses local learning rules in the same sense that
| backpropagation does, i.e., if we assume that there exist
| feedback pathways with symmetric weights to feedforward
| pathways then the algorithm is local. Similarly, it is assumed
| that there paired error neurons, which is biologically
| questionable.
|
| Speed of convergence The reviewers noted that this model
| requires many more iterations to converge on the correct
| errors, and questioned the utility of a model that involves
| this much additional computational overhead.
|
| The authors included some new text regarding biological
| plausibility and speed of convergence. They also included some
| new results to address some of the other concerns. However,
| there is still a core concern about the importance of this work
| relative to the original Whittington & Bogacz (2017) paper. It
| is nice to see those original results extended to arbitrary
| graphs, but is that enough of a major contribution for
| acceptance at ICLR? Given that there are still major issues
| related to (1) in the model, it is not clear that this
| extension to arbitrary graphs is a major contribution for
| neuroscience. And, given the issues related to (2) above, it is
| not clear that this contribution is important for ML.
| Altogether, given these considerations, and the high bar for
| acceptance at ICLR, a "reject" decision was recommended.
| However, the AC notes that this was a borderline case. ```
|
| The core reason is that the proposed model lacks biological
| plausibility. Or, if ignoring this weakness, the model is then
| computationally more intensive.
|
| I HAVE NOT read the paper, but the review seems mostly based
| "feeling"; i.e., the reviewers feel that this work is not above
| the bar. Note that I am not criticizing the reviewers here, in
| my past review career of maybe in the range of 100+ papers,
| which I did until 6 years ago, most of them are junks. For the
| ones that are truly good work, which checks all the boxes: new
| result, hard problem, solid validation, it was easy to accept.
|
| For yet a few other papers, which all seem to fall into the
| feeling category, everything looks right, but it was always on
| a borderline. And the review results can vary substantially
| based on the reviewers' own backgrounds.
| marmaduke wrote:
| The review is great, it contains all the interesting points and
| counterpoints, in a much more succinct format than the article
| itself.
| ilaksh wrote:
| Does anyone know of a simple code example that demonstrates the
| original predictive coding concept from 1999? Ideally applied to
| some type of simple image/video problem.
|
| I thought I saw a Matlab explanation of that 99 paper but have
| not found it again.
| phreeza wrote:
| This was already shown for MLPs some years ago, and it is not
| really that surprising that it applies to many other
| architectures. Note that while learning can take place locally,
| it does still require an upward and downward stream of
| information flow, which is not supported by the neuroanatomy in
| all cases. So while it is an interesting avenue of research, I
| don't think it's anywhere near as revolutionary as this blog post
| makes it out to be.
| AbrahamParangi wrote:
| This is an overly strong claim for the paper (which is good!)
| backing it.
|
| If anyone is interested in the reader's digest version of the
| original paper check out
| https://www.youtube.com/watch?v=LB4B5FYvtdI
| fouric wrote:
| > Predictive coding is the idea that BNNs generate a mental model
| of their environment and then transmit only the information that
| deviates from this model. Predictive coding considers error and
| surprise to be the same thing.
|
| This reminds me of a Slate Star Codex article on Friston[1].
|
| [1] https://slatestarcodex.com/2018/03/04/god-help-us-lets-
| try-t...
___________________________________________________________________
(page generated 2021-04-05 23:00 UTC)