hngopher.com

       [HN Gopher] Geoffrey Hinton publishes new deep learning algorithm
       ___________________________________________________________________
        
       Geoffrey Hinton publishes new deep learning algorithm
        
       Author : danboarder
       Score  : 285 points
       Date   : 2023-01-12 08:22 UTC (14 hours ago)
        
 (HTM) web link (www.infoq.com)
 (TXT) w3m dump (www.infoq.com)
        
       | sva_ wrote:
       | I found this paragraph from the paper very interesting:
       | 
       | > _7 The relevance of FF to analog hardware_
       | 
       | > _An energy efficient way to multiply an activity vector by a
       | weight matrix is to implement activities as voltages and weights
       | as conductances. Their products, per unit time, are charges which
       | add themselves. This seems a lot more sensible than driving
       | transistors at high power to model the individual bits in the
       | digital representation of a number and then performing O(n^2)
       | single bit operations to multiply two n-bit numbers together.
       | Unfortunately, it is difficult to implement the backpropagation
       | procedure in an equally efficient way, so people have resorted to
       | using A-to-D converters and digital computations for computing
       | gradients (Kendall et al., 2020). The use of two forward passes
       | instead of a forward and a backward pass should make these A-to-D
       | converters unnecessary._
       | 
       | It was my impression that it is difficult to properly isolate an
       | electronic system to use voltages in this way (hence computers
       | sort of "cut" voltages into bits 0/1 using a step function).
       | 
       | Have these limitations been overcome or do they not matter as
       | much, as neural networks can work with more fuzzy data?
       | 
       | Interesting to imagine such a processor though.
        
         | Animats wrote:
         | There's been unhappiness in some quarters that back propagation
         | doesn't seem to be something that biology does. That may be
         | part of the motivation here.
        
         | btown wrote:
         | Photonic/optical neural networks are an interesting related
         | area of research, using light interference to implement
         | convolution and other operations without (I believe?) needing a
         | bitwise representation of intensity.
         | 
         | https://www.nature.com/articles/s41467-020-20719-7
         | 
         | https://opg.optica.org/optica/fulltext.cfm?uri=optica-5-7-86...
        
         | CuriouslyC wrote:
         | The small deltas resulting from electrical noise generally
         | aren't an issue for probabilistic computations. Interestingly,
         | people have quantized many large DL models down to 8/16 bits,
         | and accuracy reduction is often on the order of 2-5%.
         | Additionally, adding random noise to weights during training
         | tends to act as a form of regularization.
        
       | ipnon wrote:
       | It's incredible to think that dreams are just our brains
       | generating training data, and lack of sleep causes us to overfit
       | on our immediate surroundings.
        
         | weakfortress wrote:
         | [dead]
        
         | whimsicalism wrote:
         | We definitely do not know nearly enough to say anything like
         | that with confidence.
         | 
         | Most of the "training process" of our brain likely occurred
         | prior to our birth in evolutionarily optimized structure of
         | brain.
        
           | tedsanders wrote:
           | Unlikely. The human genome comprises only billions of bits,
           | much of which is low-information repetition. The amount of
           | information sensed over a lifetime is vastly greater. To
           | sense less than a billion bits over a 30-year development
           | period would imply less than one bit per second. We clearly
           | perceive more than one bit per second. For this reason, it
           | seems likely that more information comes from learning post-
           | birth than is pre-conditioned by evolution pre-birth. (Though
           | of course post-birth learning cannot take place without the
           | fantastic foundation set by evolution.)
        
             | oneoff786 wrote:
             | > The human genome comprises only billions of bits, much of
             | which is low-information repetition.
             | 
             | We constantly find out that certain things are actually
             | really important even though we thought it was junk. Recall
             | that our best ability to test Genes is by knocking them out
             | one by one and trying to observe the effect
             | 
             | The brain is comprised of many extremely specialized sub
             | systems and formulas for generating knowledge. We don't
             | know English at birth, sure, but we do have a language
             | processing capability. The training baked into the brain is
             | a level of abstraction higher, establishing frameworks to
             | learn other things. It may not be as storage data heavy,
             | but it's much harder to arrive at and is the bulk of the
             | learning process (learning to learn)
        
         | mlajtos wrote:
         | We tend to start hallucinating when we don't have enough sleep.
         | So generating training data is necessary, but way safer when
         | our muscles are turned off.
        
           | mherdeg wrote:
           | Thanks for this little comment thread folks!
           | 
           | This makes me cheerful because it suggests a way that
           | studying systems which appear intelligent might be able to
           | teach us more about how human intelligence works.
        
           | snowpid wrote:
           | in addition: During pre electricity time humen woke up after
           | 4 hours sleep, got awake for some time and then continue to
           | sleep. My guess, this sleep pattern is better for learning.
        
             | davidgay wrote:
             | > During pre electricity time humen woke up after 4 hours
             | sleep, got awake for some time and then continue to sleep.
             | 
             | The confusing thing with this claim is what did people
             | actually do during this time, given bad (and expensive!)
             | lighting only?
        
       | goethes_kind wrote:
       | I skimmed through the paper and am a bit confused. There's only
       | one equation and I feel like he rushed to publish a shower
       | thought without even bothering to flesh it out mathematically.
       | 
       | So how do you optimize a layer? Do you still use gradient
       | descent? So you are have a per layer loss with a positive and
       | negative component and then do gradient descent?
       | 
       | So then what is the label for each layer? Do you use the same
       | label for each layer?
       | 
       | And what does he mean by the forward pass not being fully known?
       | I don't get this application of the blackbox between layers. Why
       | would you want to do that?
        
         | civilized wrote:
         | > There's only one equation
         | 
         | Not accurate for the version another commenter linked:
         | https://www.cs.toronto.edu/~hinton/FFA13.pdf
         | 
         | I see four equations.
        
         | harveywi wrote:
         | Those details have to be omitted from manuscripts in order to
         | avoid having to cite the works of Jurgen Schmidhuber.
        
           | bitL wrote:
           | Jurgen did it all before in the 80s, however it was never
           | translated to English so Geoffrey could happily reinvent it.
        
             | godelski wrote:
             | Jurgen invented AGI in the early 90s but someone pressed
             | the red button on his website and it committed suicide.
        
           | manytree5 wrote:
           | Just curious, why would one want to avoid citing
           | Schmidhuber's work?
        
             | sva_ wrote:
             | It is a bit of a meme in AI research, as Schmidhuber often
             | claims that he hasn't received the citations that he thinks
             | he deserves.
             | 
             | https://www.urbandictionary.com/define.php?term=schmidhuber
        
               | reil_convnet wrote:
               | Its not just a meme in this case btw :D https://twitter.c
               | om/SchmidhuberAI/status/1605246688939364352
        
         | duvenaud wrote:
         | Perhaps a better place to find algorithmic details is this
         | related paper, also with Hinton as a co-author, which
         | implements similar ideas in more standard networks:
         | 
         | Scaling Forward Gradient With Local Losses Mengye Ren, Simon
         | Kornblith, Renjie Liao, Geoffrey Hinton
         | https://arxiv.org/abs/2210.03310
         | 
         | and has code: https://github.com/google-research/google-
         | research/tree/mast...
        
         | bjourne wrote:
         | Probably because the idea is trivial in hindsight (always is)
         | so publishing fast is important. Afaict, the idea is to compute
         | the gradients layer by layer and applying them immediately
         | without bothering to back-propagate from the outputs. So in
         | between layers would learn what orientation vectors previous
         | layers emit for positive samples and themselves emit
         | orientation vectors. Imagine a layer learning what regions of a
         | sphere's (3d) surface are good and outputting what regions of a
         | circle's (2d) perimeter are good. This is why he mentions the
         | need for normalizing vectors otherwise layers would cheat and
         | just look at the vector's magnitude.
         | 
         | The idea is imo similar to how random word embeddings are
         | generated.
        
           | godelski wrote:
           | > because the idea is trivial in hindsight (always is) so
           | publishing fast is important.
           | 
           | Unfortunately I've also seen papers get rejected because
           | their idea was "trivial", yet no one had thought of it
           | before. Hinton has an edge here though.
        
           | naasking wrote:
           | > Afaict, the idea is to compute the gradients layer by layer
           | and applying them immediately without bothering to back-
           | propagate from the outputs.
           | 
           | I'm not sure where you get that impression. Forward-Forward
           | [1] seems to eschew gradients entirely:                   The
           | Forward-Forward algorithm replaces the forward and backward
           | passes of backpropagation by two forward passes, one with
           | positive (i.e. real) data and the other with negative data
           | which could be generated by the network itself
           | 
           | [1] https://www.cs.toronto.edu/~hinton/FFA13.pdf
        
             | ekleraki wrote:
             | Let's suppose that you are correct, which direction are the
             | weights updated towards?
             | 
             | The implementations of this compute gradients locally.
        
               | theGnuMe wrote:
               | It makes sense that all gradients are local. Does it make
               | sense to say that gradient propagation through the layers
               | is memoryless?
        
               | ekleraki wrote:
               | In my opinion, yes if and only if the update does not use
               | a stateful optimiser, and the computation is easy /
               | simple enough that the updated parameter value can be
               | computed immediately.
               | 
               | In linear layers, it is possible. Once you have computed
               | the gradient of the output of the vector ith vector, so a
               | scalar, you scale the input by that value and add it to
               | the parameters.
               | 
               | This is a simple FMA op: a=fma(eta*z, x, a), with z the
               | gradient of the vector, x the input, a the parameters,
               | and eta the learning rate. This computes a = a + eta*z*x
               | in place.
        
             | bjourne wrote:
             | It eschews back-propagation but not gradient calculation.
             | You still have to nudge the activations' weights upward for
             | positive examples and downward for negative ones. Positive
             | examples should give a long vector and negative ones a
             | small vector.
        
       | PartiallyTyped wrote:
       | It seems that the point is that the objective function is applied
       | layerwise, still computes gradient to get the update direction,
       | it's just that gradients don't propagate to previous layers
       | (detatched tensor).
       | 
       | As far as I can tell, this is almost the same as stacking
       | multiple layers of ensembles, except worse as each ensemble is
       | trained while previous ensembles are learning. This is causing
       | context drift.
       | 
       | To deal with the context drift, Hinton proposes to normalise the
       | output.
       | 
       | This isn't anything new or novel. Expressing "ThIs LoOkS sImIlAr
       | To HoW cOgNiTiOn WoRkS" to make it sound impressive doesn't make
       | it impressive or good by any stretch of the imagination.
       | 
       | Hinton just took something that existed for a long time, made it
       | worse, gave it a different name and wrapped it in a paper under
       | his name.
       | 
       | With every paper I am more convinced that the Laureates don't
       | deserve the award.
       | 
       | Sorry, this "paper" smells from a mile away, and the fact that it
       | is upvoted as much shows that people will upvote anything if they
       | see a pretty name attached.
       | 
       | Edit:
       | 
       | Due to the apparent controversy of my criticism, I can't respond
       | with a reply, so here is my response to the comment below asking
       | what exactly makes this worse.
       | 
       | > As far as I can tell, this is almost the same as stacking
       | multiple layers of ensembles
       | 
       | It isn't new. Ensembling is used and has been used for a long
       | time. All kaggle competitions are won through ensembles and even
       | ensembles of ensembles. It is a well studied field.
       | 
       | > except worse as each ensemble is trained while previous
       | ensembles are learning.
       | 
       | Ensembles exhibit certain properties, but only iff they are
       | trained independently from each other. This is well studied, you
       | can read more about it in Bishop's Pattern recognition book.
       | 
       | > This is causing context drift.
       | 
       | Context drift occurs when a distribution changes over time. This
       | changes the loss landscape which means the global minima change /
       | move.
       | 
       | > To deal with the context drift, Hinton proposes to normalise
       | the output.
       | 
       | So not only is what Hinton built a variation of something that
       | existed already, made it worse by training the models
       | simultaneously, and to handle the fact that it is worse, he adds
       | additional computations to deal with said issue.
        
         | goethes_kind wrote:
         | Since you seem to understand what he is saying, can you explain
         | to me how the per layer objective function looks like?
         | 
         | I don't get what he means by inserting the label into the input
         | and what labels he is using per layer.
        
           | bjourne wrote:
           | You train the network to detect correlations between the
           | values of the ten first pixels and the rest of the image.
           | Imagine you have a bunch of images of digits. For images with
           | digit three you set the third pixel to white, for images of
           | the digit four, you set the fourth pixel to white, and so on
           | (actually, zero-indexing so fourth and fifth pixel for digit
           | three and four but whatever). The other nine pixels among the
           | first ten you set to black. These are positive samples and
           | training the network with them will make it output a big
           | number when it encounters them. Then you swap the pixels so
           | that the images with the digit three has the fourth pixel set
           | to white and the images with the digit four has the third
           | pixel set to white. These are negative samples and they cause
           | the network to output a small number. Thus, the only
           | difference between positive and negative samples is the
           | location of the white pixel. So for an image you want to
           | classify you run it through the network ten times and each
           | time shifting the location of the white pixel. The location
           | for which the network outputs the biggest number is the
           | predicted class.
           | 
           | Obviously, this method is problematic if you have thousands
           | of labels or if your network is not a classifier.
        
           | LudwigNagasena wrote:
           | A model is a function F that minimizes error in y_i = F(X_i)
           | + error. Inserting a label simply means a function
           | F(X_i,y_j). Then you optimize it in some way to separate true
           | labels from false label, e.g. F(X_i, y_j) =
           | (y_i==y_j)-(y_i!=y_j) + error.
        
         | LudwigNagasena wrote:
         | That's the same impression I had. I was afraid I am not getting
         | something or missing a bigger picture. I am glad I am not the
         | only one who feels that way.
        
         | hgsgm wrote:
         | This command is a lot of words to say "I don't like it" without
         | giving any reason to believe you.
         | 
         | If it's not new or novel, why aren't people using it? If it's
         | bad, what's wrong with it?
        
           | LudwigNagasena wrote:
           | It performs worse than baskprop.
        
             | qorrect wrote:
             | Is it slower and less accurate ? Or just slower.
        
               | LudwigNagasena wrote:
               | It is far less accurate compared to SOTA models. The
               | paper says it should train faster, but it doesn't provide
               | any metrics; so it's hard to make any "pound for pound"
               | comparisons.
        
             | UncleEntity wrote:
             | So the initial "hey, this might be a good idea"
             | implementation performs _slightly_ worse than something
             | that has had literally billions of dollars thrown at it?
        
               | ekleraki wrote:
               | It is slower to the same backprop that we have used for
               | decades now.
               | 
               | No comparisons to AdamW were made.
               | 
               | In fact, this algorithm uses backprop at its core, but
               | propagating through 0 layers.
        
               | LudwigNagasena wrote:
               | The question was "If it's not new or novel, why aren't
               | people using it?". For example, take a look at this
               | paper: https://arxiv.org/abs/1905.11786. It was published
               | 3 years ago, it also does parallel layer-wise
               | optimization, it even talks about being inspired by
               | biology, though the objective function is different from
               | Hinton's. Why aren't people using it? Because it performs
               | worse. It is that simple. Is it an interesting area to
               | explore? Probably. There are millions of interesting
               | areas to explore. It doesn't mean it is worth using, at
               | least yet.
        
               | oneoff786 wrote:
               | Very significantly worse.
        
         | [deleted]
        
       | penciltwirler wrote:
       | This is old news already.
        
       | cschmid wrote:
       | The article links to an old draft of the paper (it seems that the
       | results in 4.1 couldn't be replicated). The arxiv has a more
       | recent one: https://arxiv.org/abs/2212.13345
        
       | keepquestioning wrote:
       | Is this a game changer?
        
       | rytill wrote:
       | Paper: https://www.cs.toronto.edu/~hinton/FFA13.pdf
        
       | BenoitP wrote:
       | Not a deep learning expert, but: it seems that without
       | backpropagation for model updates, the communication costs should
       | be lower. And that will enable models that are easier to
       | parallelize?
       | 
       | Nvidia isn't creating new versions of its NVLink/NVSwitch
       | products just for the sake of it, better communication must be a
       | key enabler.
       | 
       | Can someone with deeper knowledge can comment on this? Is
       | communication a bottleneck, and will this algorithm uncover a new
       | design space for NNs?
        
         | PartiallyTyped wrote:
         | > will this algorithm uncover a new design space for NNs?
         | 
         | No.
         | 
         | Hinton "discovered" stacking ensembles and gave it a new name,
         | fancy analogies to biological brains and then made it worse.
         | 
         | The gist of this is that you can select a computational unit,
         | be it a linear layer, or a collection of layers, compute the
         | derivative of the output with respect to the parameters, and
         | update them.
         | 
         | Each computational unit is independent, meaning that you don't
         | calculate gradients going outside of it.
         | 
         | This is the same as training a bunch of networks, computing
         | predictions, and then using another layer to combine the
         | predictions. This is called stacking, and the networks are
         | called an "ensemble". You can do this multiple times and have N
         | levels of meta estimators.
         | 
         | Instead of fitting the ensemble and then the meta estimator,
         | Hinton proposes training both simultaneously but without
         | allowing gradients to flow through.
         | 
         | That is stupid because if you don't allow gradients to flow
         | through, you will see a context drift as the data distribution
         | changes. Hinton observed this context drift, to deal with that,
         | he proposed normalizing the data.
         | 
         | On one extreme, you can use individual linear units as the
         | models, and on the other extreme, you can combine all units
         | into a single neural network and treat that as a module.
         | 
         | So no, this does not open any new design, it's an old idea,
         | worsened, and wrapped in fancy words and post-facto reasoning.
         | 
         | If you are curious how a linear layer is an ensemble, observe
         | that each vector is its own linear estimator, making the linear
         | mapping an ensemble of estimators.
        
           | sdwr wrote:
           | Thats a lot less cheaty, biologically speaking, than full
           | backprop. This Hinton guy sounds like he knows what he's
           | talking about.
           | 
           | "Context drift as data distribution changes" sounds a hell of
           | a lot like real life to me.
           | 
           | Normalized = hedonic treadmill on long view
           | 
           | At the micro scale, data that overflows the normalization is
           | stored in emotional state, creating an orthogonal source of
           | truth that makes up for the lack of full connected learning.
        
           | dimatura wrote:
           | Yeah, no. Reading the paper I don't really see anything but a
           | superficial resemblance to stacking. Hinton was active back
           | when Wolpert introduced stacking and I'm fairly sure he is
           | aware of it. If anything it much more closely resembles his
           | own prior work in Boltzmann machines, unsurprisingly (and
           | which he cites), or even his prior work on capsules. I don't
           | know if this will really pan out into anything that different
           | or useful for the field, but it's unfair and inaccurate to
           | dismiss it as derivative of stacking.
        
             | ekleraki wrote:
             | A single linear layer is for all intents and purposes
             | equivalent to running an ensemble of linear estimators. By
             | disallowing gradients to flow between two layers A, B,
             | computing (B . f . A)(x) with f being a non linearity, the
             | second layer is an ensemble of linear estimators of the
             | outputs of the first, and for all intents and purposes, the
             | output of (f.A)(x) is just preprocessing for B.
             | 
             | Since gradients don't flow from B to A in (B.f.A)(x), A is
             | trained independently of B, meaning that the training
             | distribution of B changes without B influencing it, i.e.
             | context drift. B doesn't know the difference, and B doesn't
             | influence it.
             | 
             | For all intents and purposes, you can compute all the
             | outputs as training of A happens, meaning training A to
             | completion, and then feed them into B and B will still
             | compute the same outputs and derivatives as it did before.
             | 
             | To deal with context drift, Hinton proposes normalizing the
             | data, so the distribution does not change significantly.
             | 
             | Whatever he proposed is not "backprop-free" either. It
             | still involves backprop, but the number of layers gradients
             | flow through is 1, the layer itself.
             | 
             | The argument that you can still train through non-
             | differentiable operations is not particularly convincing
             | either; the reparameterization trick shows that is trivial
             | to pass gradients through non differentiable operations if
             | we are smart about it.
             | 
             | Given non differentiable operator Z: R^N -> R^N; let A, B,
             | C be R^N -> R^N linear layer, B(Z(C(x)) * A(C(x))) allows
             | gradients to flow through B and A all the way to C. The
             | output of Z is for all intents and purposes a Hadamard
             | product with (A . C)(x) that is runtime constructed and
             | might as well be part of the input.
             | 
             | You can even run Z(C(x)) through a neural network and learn
             | how to transform that and still provide useful and
             | informative gradients back to C(x) via (A . C)
        
               | dimatura wrote:
               | I'm not sure what the main point is here. The paper is
               | definitely sketchy on details, and the main idea is
               | definitely simple enough to resemble a lot of other work.
               | I wouldn't be surprised if someone (maybe a certain Swiss
               | researcher) comes out and says, actually, this is the
               | same as this other paper from the early 90s. If you
               | squint hard enough a lot of ideas (especially simple
               | ones) can be seen as being the same as other, older
               | ideas. I'm not too interested in splitting those hairs,
               | really. I'm more curious on whether this eventually leads
               | to something that sets it apart from the SOTA in some
               | interesting way.
        
               | PartiallyTyped wrote:
               | My claim is that this work is simply worse ensembles
               | wrapped in a biologically inspired claims, and that
               | arguments made in support of it by the author compared to
               | other approaches are simply not sound.
               | 
               | By looking at it through that perspective, the issues
               | with the approach become evident, and are fundamental in
               | my opinion.
        
           | chamwislothe2nd wrote:
           | You're accusing one of the foundations of modern AI with
           | either being a fraud, or incompetent. At best that seems
           | short sighted, no?
        
             | PartiallyTyped wrote:
             | I am neither the first, nor the last who believe that the
             | Laureates have not done their due diligence properly with
             | respect to citing sources.
             | 
             | I could name many other people who have actually been more
             | influential in the field.
        
             | nayroclade wrote:
             | You're obviously new to Hacker News :-D
        
               | chamwislothe2nd wrote:
               | Unfortunately, I've been here for a decade in one form or
               | another. Every now and then someone writes something so
               | pompous that I just can't help myself but post. Back to
               | lurking now. Cheers!
        
               | PartiallyTyped wrote:
               | If we can't publicly scrutinize people who have great
               | sway in the industry, what does that say about us as a
               | research community?
               | 
               | The fact that I argued why I found it bogus based on well
               | established principles, and I get shitted on by people
               | who by all means have provided nothing to this
               | conversation and except suppressing criticism or throwing
               | ad-hominems should tell all about the quality of
               | discourse.
               | 
               | Dismissing criticism, not by arguments, but by the mere
               | name of the person does a disservice to everyone.
               | 
               | If the research can't stand on its own, independent of
               | the author, then it is not good research.
        
               | chamwislothe2nd wrote:
               | We are all looking forward to your research paper that
               | disproves his claims. Or you know, any proof.
        
               | PartiallyTyped wrote:
               | I argued for it and all I got was downvoted without
               | criticism of the substance of my arguments, only ad-
               | hominems and fallacies.
               | 
               | If you can point to _fundamental_ criticism of my
               | arguments, and not fallacies or attacks, I'd be more than
               | happy to discuss them.
        
               | chamwislothe2nd wrote:
               | The hard dismissal with 'No' is likely why you got down
               | voted. I am not able to do that.
               | 
               | With that kind of tonal promise, especially considering
               | the source you are dismissing outright is important in
               | their field, you have to show, not just tell.
               | 
               | If you just left that No out, and gave room for the
               | chance that you are wrong, people wouldn't downvote,
               | they'd upvote. People like to hear smart arguments. No
               | one wants to hear outward dismissal. Especially of known
               | experts.
        
           | hgsgm wrote:
           | [flagged]
        
             | PartiallyTyped wrote:
             | Ad-hominems are not a particularly nice way to argue about
             | correctness of a claim.
        
               | NiloCK wrote:
               | Maybe you know something but I don't, but I believe the
               | comment you are replying to was a compliment and not a
               | sarcastic dig.
        
               | PartiallyTyped wrote:
               | This person has made a similar remark and purposefully
               | looked into my post history - which doesn't make any
               | claims about my knowledge or skills - lied about it, and
               | made a sarcastic remark, see "everything".
        
             | randomdata wrote:
             | A lot can change in 60 days.
        
               | ekleraki wrote:
               | They do have a master's degree according to the post.
               | 
               | The claims made are not that deep for researchers.
        
               | belter wrote:
               | The use of the word masters is now considered not cool
               | according to Stanford... :-)
        
               | PartiallyTyped wrote:
               | It's not like I have been doing many things outside
               | reading papers and books over the past few years... The
               | post that person used as an ad-hominem even says so.
        
         | [deleted]
        
           | [deleted]
        
         | jerpint wrote:
         | It's more that without backpropagation, you no longer need to
         | store your forward activations across many layers to compute
         | the backwards pass which usually is dependent on a forward
         | pass. When a network is hundreds of layers, and batches are
         | very large, the forwards and backwards accumulations add up in
         | terms of memory required.
         | 
         | Communication across GPUs doesn't solve this but instead allows
         | to have either many models running in parallel on different
         | GPUs to increase Barch size or to share many layers across GPUs
         | to increase model size. Quick communication is critical to
         | maintain training speeds that aren't astronomical
        
       | rsfern wrote:
       | Discussion last month when the preprint was released:
       | https://news.ycombinator.com/item?id=33823170
        
         | [deleted]
        
       | Kalanos wrote:
       | What exactly is the negative data? Seems like it's just scrambled
       | nonsense (aka what the truth is not)
        
         | abraxas wrote:
         | I think it is scrambled nonsense but it's scrambled in a way
         | that still makes it look like a plausible sample. I remember
         | watching a video of Hinton saying that just using white noise
         | or similarly randomized data does not work but I'm now
         | forgetting why.
        
       | maurits wrote:
       | Deep dive tutorial for learning in a forward pass [1]
       | 
       | [1] https://amassivek.github.io/sigprop
        
         | cochne wrote:
         | > There are many choices for a loss L (e.g. gradient, Hebbian)
         | and optimizer (e.g. SGD, Momentum, ADAM). The output(), y, is
         | detailed in step 4 below.
         | 
         | I don't get it, don't all of those optimizers work via
         | backprop?
        
           | mkaic wrote:
           | The optimizers take parameters and their gradients as inputs
           | and apply update rules to them, but the gradients you supply
           | can come from anywhere. Backdrop is the most common way to
           | assign gradients to parameters, but other methods can work
           | too--as long as the optimizer is getting both parameters and
           | gradients, it doesn't care where they come from.
        
       | singularity2001 wrote:
       | The paragraph about Mortal Computation is worth repeating:
       | 
       | If these FF networks can be proven to scale or made to scale
       | similarly to BP networks, this would enable making hardware
       | several orders of magnitude more efficient, for the price of
       | loosing the ability to make exact copies of models to other
       | computers. (The loss of reproducibility sits well with the
       | tradition of scientific papers anyway/s;)
       | 
       | 2.) How does this paper relate to Hintons feedback alignment from
       | 5 years ago? I remember it was feedback without derivatives. What
       | are the key new ideas? To adjust the output of each individual
       | layer to be big for positive cases and small for negative cases
       | without any feedback? Have these approaches been combined?
        
       | sheerun wrote:
       | Isn't this similar to how GAN networks learn? Edit: Yes, there is
       | small chapter in paper comparing it to GAN
        
         | tjungblut wrote:
         | Interesting, my first take was that it's like contrastive
         | divergence in Restricted Boltzmann Machines (RBMs). There's
         | also a chapter for that.
        
       | parpfish wrote:
       | Having read through the forward-forward paper, it feels like it's
       | Oja's rule adapted for supervised learning but I can't really
       | articulate why...
        
       | bayesian_horse wrote:
       | I remember a long time ago there was a collection of Geoffrey
       | Hinton facts. Like "Geoffrey Hinton once wrote a neural network
       | that beat Chuck Norris."
        
         | mousetree wrote:
         | I saw the same for Jeff Dean
        
       | lvl102 wrote:
       | Quantum would absolutely change everything in DL/ML space.
        
         | Kalanos wrote:
         | Most problems are less computationally intensive than one would
         | think
        
         | sillysaurusx wrote:
         | It turns out there's a whole subfield for quantum ML. I don't
         | know much about it, but it's neat that there's any
         | applicability. It's not obvious that there was any connection.
        
       | kumarvvr wrote:
       | This is an interesting approach and I have read that this is more
       | closer to how our brains works.
       | 
       | We extract learning, while we are imbibing the data and there
       | seems to be no mechanism in the brain that favors backprop like
       | learning process.
        
         | bayesian_horse wrote:
         | Fact: Geoffrey Hinton has discovered how the brain works. Every
         | few years actually.
        
           | parpfish wrote:
           | yeah, whatever happened to capsule networks?
        
             | NHQ wrote:
             | Capsule networks were conceptually an early attempt at
             | transformers.
        
           | mlajtos wrote:
           | Once a year for the last 30 years.
        
       | mdp2021 wrote:
       | Direct link to an implementation on GitHub:
       | https://github.com/nebuly-ai/nebullvm/tree/main/apps/acceler...
       | 
       | --
       | 
       | The divulgational title is almost an understatement: the Forward-
       | Forward algorithm is an alternative to backpropagation.
       | 
       | Edit: sorry, the previous formulation of the above in this post,
       | relative to the advantages, was due to a misreading. Hinton
       | writes:
       | 
       | > _The Forward-Forward algorithm (FF) is comparable in speed to
       | backpropagation but has the advantage that it can be used when
       | the precise details of the forward computation are unknown. It
       | also has the advantage that it can learn while pipelining
       | sequential data through a neural network without ever storing the
       | neural activities or stopping to propagate error
       | derivatives....The two areas in which the forward-forward
       | algorithm may be superior to backpropagation are as a model of
       | learning in cortex and as a way of making use of very low-power
       | analog hardware without resorting to reinforcement learning_
        
         | [deleted]
        
       | belter wrote:
       | @dang
       | 
       | Meta question on HN implementation: Why do sometimes submitting a
       | previously submitted resource links automatically to the previous
       | discussion, while other times is considered a new submission?
        
         | drdeca wrote:
         | I believe one factor is the amount of time between the
         | submissions.
        
         | gregschlom wrote:
         | As far as I know it's a simple string match on the url. If the
         | url is different (for example a new anchor tag is added) then
         | it's considered a new submission.
        
           | danuker wrote:
           | If you click on "past" under this submission, you see two
           | identical URLs:
           | 
           | https://hn.algolia.com/?query=Geoffrey%20Hinton%20publishes%.
           | ..
        
             | mdp2021 wrote:
             | Which is odd, because I checked in the minutes following
             | the submission and I remember "[past]" did not return
             | anything.
        
       | jasonjmcghee wrote:
       | Maybe I'm missing something, but from the paper
       | https://www.cs.toronto.edu/~hinton/FFA13.pdf, they use non-conv
       | nets on CIFAR-10 for back prop, resulting in 63% accuracy. And FF
       | achieves 59% accuracy (at best).
       | 
       | Those are relatively close figures, but good accuracy on CIFAR-10
       | is 99%+ and getting ~94% is trivial.
       | 
       | So, if an improper architecture for a problem is used and the
       | accuracy is poor, how compelling is using another optimization
       | approach and achieving similar accuracy?
       | 
       | It's a unique and interesting approach, but the article
       | specifically mentions it gets accuracy similar to backprop, but
       | if this is the experiment that claim is based on, it loses some
       | credibility in my eyes.
        
         | habitue wrote:
         | I think you have to set expectations based on how much of the
         | ground you're ripping up. If you're adding some layers or some
         | little tweak to an existing architecture, then yeah, going
         | backwards on cifar-10 is a failure.
         | 
         | If, however, _you are ripping out backpropagation_ like this
         | paper is, then you get a big pass. This is not the new paradigm
         | yet, but it 's promising that it doesn't just completely fail.
        
           | jxcole wrote:
           | This seems to be Hinton's MO though. A few years back he
           | ripped out convolutions for capsules and while he claims it's
           | better and some people might claim it "has potential", no one
           | really uses it for much because, as with this, the actual
           | numerical performance is worse on the tests people care about
           | (e.g. imagenet accuracy).
           | 
           | https://en.wikipedia.org/wiki/Capsule_neural_network
        
             | habitue wrote:
             | I mean yes, this should be the MO of a tenured professor,
             | making large speculative bets, not hyper optimizing
             | benchmarks
        
           | LudwigNagasena wrote:
           | There is no shortage of paradigms that rip out backprop and
           | deliver worse results.
        
             | tehsauce wrote:
             | This is so true! But we should keep trying :)
        
             | levesque wrote:
             | Also Hinton doesn't have the best track record with his
             | already forgotten/abandoned Capsule networks. I wonder
             | what's the next thing he's going to come up with? He gets a
             | pass because he is famous.
        
               | whimsicalism wrote:
               | I think it can be simultaneously true that things like
               | this should be tested with toy models we wouldn't expect
               | to do great on CIFAR and also that we shouldn't expect
               | exceptional results just because this person is already
               | famous.
        
         | whimsicalism wrote:
         | You have to start with toy models before scaling up.
        
           | electrograv wrote:
           | Achieving <80% on CIFAR10 in the year >2020 is an example of
           | a _failed_ toy model, _not_ a successful toy model.
           | 
           | Almost any ML algorithm can be thrown at CIFAR10 and achieve
           | ~60% accuracy; this ballpark of accuracy is really not
           | sufficient to demonstrate viability, no matter how
           | aesthetically interesting the approach might feel.
        
             | [deleted]
        
             | alper111 wrote:
             | Plain MLP acc. is 63% vs 59% with FF, not so bad? By the
             | same logic, MLP is a failed toy model.
        
             | hiddencost wrote:
             | Hinton is doing basic science, not ML, here. Given who he
             | is, trying to move the needle on traditional benchmarks
             | would be a waste of his time and skills.
             | 
             | If he invents the new back propagation, an army of grad
             | students can turn his ideas into the future. Like they've
             | done for the last 15 years.
             | 
             | He's posting incremental work towards rethinking the field.
             | It's pretty interesting stuff.
             | 
             | Edit: grammar
        
             | jasonjmcghee wrote:
             | I haven't seen this to be the case, fwiw. There was a paper
             | in 2016 that did this and most were in the ~40% range.
             | 
             | But "any ml algorithm" isn't the point. It's a new
             | optimization technique and should be applied to
             | models/architectures that make sense with the problems they
             | are being used on.
             | 
             | For example, they could have used a pretrained featurizer
             | and trained the two layer model on top of it, with both
             | back prop and FF and compared.
        
               | whimsicalism wrote:
               | > For example, they could have used a pretrained
               | featurizer and trained the two layer model on top of it,
               | with both back prop and FF and compared.
               | 
               | Making the assumption that weights/embeddings produced by
               | a backprop-trained network are equally intelligible to a
               | network also trained by backprop vs. one trained by this
               | alternative method.
        
               | jasonjmcghee wrote:
               | I have personally seen them used successfully with all
               | kinds of classic ml algorithms (enets, tree-based, etc)
               | that have nothing to do with back prop.
        
             | whimsicalism wrote:
             | Any ML algorithm that already has tooling written, CUDA
             | scripts, etc. to run it faster.
             | 
             | That said, I am also short-term bearish on backprop-free
             | methods (although potentially long-term bullish).
        
       | Coneylake wrote:
       | Is the derivative calculated by forward-forward as analytic as in
       | backpropagation?
        
       ___________________________________________________________________
       (page generated 2023-01-12 23:01 UTC)