[HN Gopher] Geoffrey Hinton publishes new deep learning algorithm
___________________________________________________________________
Geoffrey Hinton publishes new deep learning algorithm
Author : danboarder
Score : 285 points
Date : 2023-01-12 08:22 UTC (14 hours ago)
(HTM) web link (www.infoq.com)
(TXT) w3m dump (www.infoq.com)
| sva_ wrote:
| I found this paragraph from the paper very interesting:
|
| > _7 The relevance of FF to analog hardware_
|
| > _An energy efficient way to multiply an activity vector by a
| weight matrix is to implement activities as voltages and weights
| as conductances. Their products, per unit time, are charges which
| add themselves. This seems a lot more sensible than driving
| transistors at high power to model the individual bits in the
| digital representation of a number and then performing O(n^2)
| single bit operations to multiply two n-bit numbers together.
| Unfortunately, it is difficult to implement the backpropagation
| procedure in an equally efficient way, so people have resorted to
| using A-to-D converters and digital computations for computing
| gradients (Kendall et al., 2020). The use of two forward passes
| instead of a forward and a backward pass should make these A-to-D
| converters unnecessary._
|
| It was my impression that it is difficult to properly isolate an
| electronic system to use voltages in this way (hence computers
| sort of "cut" voltages into bits 0/1 using a step function).
|
| Have these limitations been overcome or do they not matter as
| much, as neural networks can work with more fuzzy data?
|
| Interesting to imagine such a processor though.
| Animats wrote:
| There's been unhappiness in some quarters that back propagation
| doesn't seem to be something that biology does. That may be
| part of the motivation here.
| btown wrote:
| Photonic/optical neural networks are an interesting related
| area of research, using light interference to implement
| convolution and other operations without (I believe?) needing a
| bitwise representation of intensity.
|
| https://www.nature.com/articles/s41467-020-20719-7
|
| https://opg.optica.org/optica/fulltext.cfm?uri=optica-5-7-86...
| CuriouslyC wrote:
| The small deltas resulting from electrical noise generally
| aren't an issue for probabilistic computations. Interestingly,
| people have quantized many large DL models down to 8/16 bits,
| and accuracy reduction is often on the order of 2-5%.
| Additionally, adding random noise to weights during training
| tends to act as a form of regularization.
| ipnon wrote:
| It's incredible to think that dreams are just our brains
| generating training data, and lack of sleep causes us to overfit
| on our immediate surroundings.
| weakfortress wrote:
| [dead]
| whimsicalism wrote:
| We definitely do not know nearly enough to say anything like
| that with confidence.
|
| Most of the "training process" of our brain likely occurred
| prior to our birth in evolutionarily optimized structure of
| brain.
| tedsanders wrote:
| Unlikely. The human genome comprises only billions of bits,
| much of which is low-information repetition. The amount of
| information sensed over a lifetime is vastly greater. To
| sense less than a billion bits over a 30-year development
| period would imply less than one bit per second. We clearly
| perceive more than one bit per second. For this reason, it
| seems likely that more information comes from learning post-
| birth than is pre-conditioned by evolution pre-birth. (Though
| of course post-birth learning cannot take place without the
| fantastic foundation set by evolution.)
| oneoff786 wrote:
| > The human genome comprises only billions of bits, much of
| which is low-information repetition.
|
| We constantly find out that certain things are actually
| really important even though we thought it was junk. Recall
| that our best ability to test Genes is by knocking them out
| one by one and trying to observe the effect
|
| The brain is comprised of many extremely specialized sub
| systems and formulas for generating knowledge. We don't
| know English at birth, sure, but we do have a language
| processing capability. The training baked into the brain is
| a level of abstraction higher, establishing frameworks to
| learn other things. It may not be as storage data heavy,
| but it's much harder to arrive at and is the bulk of the
| learning process (learning to learn)
| mlajtos wrote:
| We tend to start hallucinating when we don't have enough sleep.
| So generating training data is necessary, but way safer when
| our muscles are turned off.
| mherdeg wrote:
| Thanks for this little comment thread folks!
|
| This makes me cheerful because it suggests a way that
| studying systems which appear intelligent might be able to
| teach us more about how human intelligence works.
| snowpid wrote:
| in addition: During pre electricity time humen woke up after
| 4 hours sleep, got awake for some time and then continue to
| sleep. My guess, this sleep pattern is better for learning.
| davidgay wrote:
| > During pre electricity time humen woke up after 4 hours
| sleep, got awake for some time and then continue to sleep.
|
| The confusing thing with this claim is what did people
| actually do during this time, given bad (and expensive!)
| lighting only?
| goethes_kind wrote:
| I skimmed through the paper and am a bit confused. There's only
| one equation and I feel like he rushed to publish a shower
| thought without even bothering to flesh it out mathematically.
|
| So how do you optimize a layer? Do you still use gradient
| descent? So you are have a per layer loss with a positive and
| negative component and then do gradient descent?
|
| So then what is the label for each layer? Do you use the same
| label for each layer?
|
| And what does he mean by the forward pass not being fully known?
| I don't get this application of the blackbox between layers. Why
| would you want to do that?
| civilized wrote:
| > There's only one equation
|
| Not accurate for the version another commenter linked:
| https://www.cs.toronto.edu/~hinton/FFA13.pdf
|
| I see four equations.
| harveywi wrote:
| Those details have to be omitted from manuscripts in order to
| avoid having to cite the works of Jurgen Schmidhuber.
| bitL wrote:
| Jurgen did it all before in the 80s, however it was never
| translated to English so Geoffrey could happily reinvent it.
| godelski wrote:
| Jurgen invented AGI in the early 90s but someone pressed
| the red button on his website and it committed suicide.
| manytree5 wrote:
| Just curious, why would one want to avoid citing
| Schmidhuber's work?
| sva_ wrote:
| It is a bit of a meme in AI research, as Schmidhuber often
| claims that he hasn't received the citations that he thinks
| he deserves.
|
| https://www.urbandictionary.com/define.php?term=schmidhuber
| reil_convnet wrote:
| Its not just a meme in this case btw :D https://twitter.c
| om/SchmidhuberAI/status/1605246688939364352
| duvenaud wrote:
| Perhaps a better place to find algorithmic details is this
| related paper, also with Hinton as a co-author, which
| implements similar ideas in more standard networks:
|
| Scaling Forward Gradient With Local Losses Mengye Ren, Simon
| Kornblith, Renjie Liao, Geoffrey Hinton
| https://arxiv.org/abs/2210.03310
|
| and has code: https://github.com/google-research/google-
| research/tree/mast...
| bjourne wrote:
| Probably because the idea is trivial in hindsight (always is)
| so publishing fast is important. Afaict, the idea is to compute
| the gradients layer by layer and applying them immediately
| without bothering to back-propagate from the outputs. So in
| between layers would learn what orientation vectors previous
| layers emit for positive samples and themselves emit
| orientation vectors. Imagine a layer learning what regions of a
| sphere's (3d) surface are good and outputting what regions of a
| circle's (2d) perimeter are good. This is why he mentions the
| need for normalizing vectors otherwise layers would cheat and
| just look at the vector's magnitude.
|
| The idea is imo similar to how random word embeddings are
| generated.
| godelski wrote:
| > because the idea is trivial in hindsight (always is) so
| publishing fast is important.
|
| Unfortunately I've also seen papers get rejected because
| their idea was "trivial", yet no one had thought of it
| before. Hinton has an edge here though.
| naasking wrote:
| > Afaict, the idea is to compute the gradients layer by layer
| and applying them immediately without bothering to back-
| propagate from the outputs.
|
| I'm not sure where you get that impression. Forward-Forward
| [1] seems to eschew gradients entirely: The
| Forward-Forward algorithm replaces the forward and backward
| passes of backpropagation by two forward passes, one with
| positive (i.e. real) data and the other with negative data
| which could be generated by the network itself
|
| [1] https://www.cs.toronto.edu/~hinton/FFA13.pdf
| ekleraki wrote:
| Let's suppose that you are correct, which direction are the
| weights updated towards?
|
| The implementations of this compute gradients locally.
| theGnuMe wrote:
| It makes sense that all gradients are local. Does it make
| sense to say that gradient propagation through the layers
| is memoryless?
| ekleraki wrote:
| In my opinion, yes if and only if the update does not use
| a stateful optimiser, and the computation is easy /
| simple enough that the updated parameter value can be
| computed immediately.
|
| In linear layers, it is possible. Once you have computed
| the gradient of the output of the vector ith vector, so a
| scalar, you scale the input by that value and add it to
| the parameters.
|
| This is a simple FMA op: a=fma(eta*z, x, a), with z the
| gradient of the vector, x the input, a the parameters,
| and eta the learning rate. This computes a = a + eta*z*x
| in place.
| bjourne wrote:
| It eschews back-propagation but not gradient calculation.
| You still have to nudge the activations' weights upward for
| positive examples and downward for negative ones. Positive
| examples should give a long vector and negative ones a
| small vector.
| PartiallyTyped wrote:
| It seems that the point is that the objective function is applied
| layerwise, still computes gradient to get the update direction,
| it's just that gradients don't propagate to previous layers
| (detatched tensor).
|
| As far as I can tell, this is almost the same as stacking
| multiple layers of ensembles, except worse as each ensemble is
| trained while previous ensembles are learning. This is causing
| context drift.
|
| To deal with the context drift, Hinton proposes to normalise the
| output.
|
| This isn't anything new or novel. Expressing "ThIs LoOkS sImIlAr
| To HoW cOgNiTiOn WoRkS" to make it sound impressive doesn't make
| it impressive or good by any stretch of the imagination.
|
| Hinton just took something that existed for a long time, made it
| worse, gave it a different name and wrapped it in a paper under
| his name.
|
| With every paper I am more convinced that the Laureates don't
| deserve the award.
|
| Sorry, this "paper" smells from a mile away, and the fact that it
| is upvoted as much shows that people will upvote anything if they
| see a pretty name attached.
|
| Edit:
|
| Due to the apparent controversy of my criticism, I can't respond
| with a reply, so here is my response to the comment below asking
| what exactly makes this worse.
|
| > As far as I can tell, this is almost the same as stacking
| multiple layers of ensembles
|
| It isn't new. Ensembling is used and has been used for a long
| time. All kaggle competitions are won through ensembles and even
| ensembles of ensembles. It is a well studied field.
|
| > except worse as each ensemble is trained while previous
| ensembles are learning.
|
| Ensembles exhibit certain properties, but only iff they are
| trained independently from each other. This is well studied, you
| can read more about it in Bishop's Pattern recognition book.
|
| > This is causing context drift.
|
| Context drift occurs when a distribution changes over time. This
| changes the loss landscape which means the global minima change /
| move.
|
| > To deal with the context drift, Hinton proposes to normalise
| the output.
|
| So not only is what Hinton built a variation of something that
| existed already, made it worse by training the models
| simultaneously, and to handle the fact that it is worse, he adds
| additional computations to deal with said issue.
| goethes_kind wrote:
| Since you seem to understand what he is saying, can you explain
| to me how the per layer objective function looks like?
|
| I don't get what he means by inserting the label into the input
| and what labels he is using per layer.
| bjourne wrote:
| You train the network to detect correlations between the
| values of the ten first pixels and the rest of the image.
| Imagine you have a bunch of images of digits. For images with
| digit three you set the third pixel to white, for images of
| the digit four, you set the fourth pixel to white, and so on
| (actually, zero-indexing so fourth and fifth pixel for digit
| three and four but whatever). The other nine pixels among the
| first ten you set to black. These are positive samples and
| training the network with them will make it output a big
| number when it encounters them. Then you swap the pixels so
| that the images with the digit three has the fourth pixel set
| to white and the images with the digit four has the third
| pixel set to white. These are negative samples and they cause
| the network to output a small number. Thus, the only
| difference between positive and negative samples is the
| location of the white pixel. So for an image you want to
| classify you run it through the network ten times and each
| time shifting the location of the white pixel. The location
| for which the network outputs the biggest number is the
| predicted class.
|
| Obviously, this method is problematic if you have thousands
| of labels or if your network is not a classifier.
| LudwigNagasena wrote:
| A model is a function F that minimizes error in y_i = F(X_i)
| + error. Inserting a label simply means a function
| F(X_i,y_j). Then you optimize it in some way to separate true
| labels from false label, e.g. F(X_i, y_j) =
| (y_i==y_j)-(y_i!=y_j) + error.
| LudwigNagasena wrote:
| That's the same impression I had. I was afraid I am not getting
| something or missing a bigger picture. I am glad I am not the
| only one who feels that way.
| hgsgm wrote:
| This command is a lot of words to say "I don't like it" without
| giving any reason to believe you.
|
| If it's not new or novel, why aren't people using it? If it's
| bad, what's wrong with it?
| LudwigNagasena wrote:
| It performs worse than baskprop.
| qorrect wrote:
| Is it slower and less accurate ? Or just slower.
| LudwigNagasena wrote:
| It is far less accurate compared to SOTA models. The
| paper says it should train faster, but it doesn't provide
| any metrics; so it's hard to make any "pound for pound"
| comparisons.
| UncleEntity wrote:
| So the initial "hey, this might be a good idea"
| implementation performs _slightly_ worse than something
| that has had literally billions of dollars thrown at it?
| ekleraki wrote:
| It is slower to the same backprop that we have used for
| decades now.
|
| No comparisons to AdamW were made.
|
| In fact, this algorithm uses backprop at its core, but
| propagating through 0 layers.
| LudwigNagasena wrote:
| The question was "If it's not new or novel, why aren't
| people using it?". For example, take a look at this
| paper: https://arxiv.org/abs/1905.11786. It was published
| 3 years ago, it also does parallel layer-wise
| optimization, it even talks about being inspired by
| biology, though the objective function is different from
| Hinton's. Why aren't people using it? Because it performs
| worse. It is that simple. Is it an interesting area to
| explore? Probably. There are millions of interesting
| areas to explore. It doesn't mean it is worth using, at
| least yet.
| oneoff786 wrote:
| Very significantly worse.
| [deleted]
| penciltwirler wrote:
| This is old news already.
| cschmid wrote:
| The article links to an old draft of the paper (it seems that the
| results in 4.1 couldn't be replicated). The arxiv has a more
| recent one: https://arxiv.org/abs/2212.13345
| keepquestioning wrote:
| Is this a game changer?
| rytill wrote:
| Paper: https://www.cs.toronto.edu/~hinton/FFA13.pdf
| BenoitP wrote:
| Not a deep learning expert, but: it seems that without
| backpropagation for model updates, the communication costs should
| be lower. And that will enable models that are easier to
| parallelize?
|
| Nvidia isn't creating new versions of its NVLink/NVSwitch
| products just for the sake of it, better communication must be a
| key enabler.
|
| Can someone with deeper knowledge can comment on this? Is
| communication a bottleneck, and will this algorithm uncover a new
| design space for NNs?
| PartiallyTyped wrote:
| > will this algorithm uncover a new design space for NNs?
|
| No.
|
| Hinton "discovered" stacking ensembles and gave it a new name,
| fancy analogies to biological brains and then made it worse.
|
| The gist of this is that you can select a computational unit,
| be it a linear layer, or a collection of layers, compute the
| derivative of the output with respect to the parameters, and
| update them.
|
| Each computational unit is independent, meaning that you don't
| calculate gradients going outside of it.
|
| This is the same as training a bunch of networks, computing
| predictions, and then using another layer to combine the
| predictions. This is called stacking, and the networks are
| called an "ensemble". You can do this multiple times and have N
| levels of meta estimators.
|
| Instead of fitting the ensemble and then the meta estimator,
| Hinton proposes training both simultaneously but without
| allowing gradients to flow through.
|
| That is stupid because if you don't allow gradients to flow
| through, you will see a context drift as the data distribution
| changes. Hinton observed this context drift, to deal with that,
| he proposed normalizing the data.
|
| On one extreme, you can use individual linear units as the
| models, and on the other extreme, you can combine all units
| into a single neural network and treat that as a module.
|
| So no, this does not open any new design, it's an old idea,
| worsened, and wrapped in fancy words and post-facto reasoning.
|
| If you are curious how a linear layer is an ensemble, observe
| that each vector is its own linear estimator, making the linear
| mapping an ensemble of estimators.
| sdwr wrote:
| Thats a lot less cheaty, biologically speaking, than full
| backprop. This Hinton guy sounds like he knows what he's
| talking about.
|
| "Context drift as data distribution changes" sounds a hell of
| a lot like real life to me.
|
| Normalized = hedonic treadmill on long view
|
| At the micro scale, data that overflows the normalization is
| stored in emotional state, creating an orthogonal source of
| truth that makes up for the lack of full connected learning.
| dimatura wrote:
| Yeah, no. Reading the paper I don't really see anything but a
| superficial resemblance to stacking. Hinton was active back
| when Wolpert introduced stacking and I'm fairly sure he is
| aware of it. If anything it much more closely resembles his
| own prior work in Boltzmann machines, unsurprisingly (and
| which he cites), or even his prior work on capsules. I don't
| know if this will really pan out into anything that different
| or useful for the field, but it's unfair and inaccurate to
| dismiss it as derivative of stacking.
| ekleraki wrote:
| A single linear layer is for all intents and purposes
| equivalent to running an ensemble of linear estimators. By
| disallowing gradients to flow between two layers A, B,
| computing (B . f . A)(x) with f being a non linearity, the
| second layer is an ensemble of linear estimators of the
| outputs of the first, and for all intents and purposes, the
| output of (f.A)(x) is just preprocessing for B.
|
| Since gradients don't flow from B to A in (B.f.A)(x), A is
| trained independently of B, meaning that the training
| distribution of B changes without B influencing it, i.e.
| context drift. B doesn't know the difference, and B doesn't
| influence it.
|
| For all intents and purposes, you can compute all the
| outputs as training of A happens, meaning training A to
| completion, and then feed them into B and B will still
| compute the same outputs and derivatives as it did before.
|
| To deal with context drift, Hinton proposes normalizing the
| data, so the distribution does not change significantly.
|
| Whatever he proposed is not "backprop-free" either. It
| still involves backprop, but the number of layers gradients
| flow through is 1, the layer itself.
|
| The argument that you can still train through non-
| differentiable operations is not particularly convincing
| either; the reparameterization trick shows that is trivial
| to pass gradients through non differentiable operations if
| we are smart about it.
|
| Given non differentiable operator Z: R^N -> R^N; let A, B,
| C be R^N -> R^N linear layer, B(Z(C(x)) * A(C(x))) allows
| gradients to flow through B and A all the way to C. The
| output of Z is for all intents and purposes a Hadamard
| product with (A . C)(x) that is runtime constructed and
| might as well be part of the input.
|
| You can even run Z(C(x)) through a neural network and learn
| how to transform that and still provide useful and
| informative gradients back to C(x) via (A . C)
| dimatura wrote:
| I'm not sure what the main point is here. The paper is
| definitely sketchy on details, and the main idea is
| definitely simple enough to resemble a lot of other work.
| I wouldn't be surprised if someone (maybe a certain Swiss
| researcher) comes out and says, actually, this is the
| same as this other paper from the early 90s. If you
| squint hard enough a lot of ideas (especially simple
| ones) can be seen as being the same as other, older
| ideas. I'm not too interested in splitting those hairs,
| really. I'm more curious on whether this eventually leads
| to something that sets it apart from the SOTA in some
| interesting way.
| PartiallyTyped wrote:
| My claim is that this work is simply worse ensembles
| wrapped in a biologically inspired claims, and that
| arguments made in support of it by the author compared to
| other approaches are simply not sound.
|
| By looking at it through that perspective, the issues
| with the approach become evident, and are fundamental in
| my opinion.
| chamwislothe2nd wrote:
| You're accusing one of the foundations of modern AI with
| either being a fraud, or incompetent. At best that seems
| short sighted, no?
| PartiallyTyped wrote:
| I am neither the first, nor the last who believe that the
| Laureates have not done their due diligence properly with
| respect to citing sources.
|
| I could name many other people who have actually been more
| influential in the field.
| nayroclade wrote:
| You're obviously new to Hacker News :-D
| chamwislothe2nd wrote:
| Unfortunately, I've been here for a decade in one form or
| another. Every now and then someone writes something so
| pompous that I just can't help myself but post. Back to
| lurking now. Cheers!
| PartiallyTyped wrote:
| If we can't publicly scrutinize people who have great
| sway in the industry, what does that say about us as a
| research community?
|
| The fact that I argued why I found it bogus based on well
| established principles, and I get shitted on by people
| who by all means have provided nothing to this
| conversation and except suppressing criticism or throwing
| ad-hominems should tell all about the quality of
| discourse.
|
| Dismissing criticism, not by arguments, but by the mere
| name of the person does a disservice to everyone.
|
| If the research can't stand on its own, independent of
| the author, then it is not good research.
| chamwislothe2nd wrote:
| We are all looking forward to your research paper that
| disproves his claims. Or you know, any proof.
| PartiallyTyped wrote:
| I argued for it and all I got was downvoted without
| criticism of the substance of my arguments, only ad-
| hominems and fallacies.
|
| If you can point to _fundamental_ criticism of my
| arguments, and not fallacies or attacks, I'd be more than
| happy to discuss them.
| chamwislothe2nd wrote:
| The hard dismissal with 'No' is likely why you got down
| voted. I am not able to do that.
|
| With that kind of tonal promise, especially considering
| the source you are dismissing outright is important in
| their field, you have to show, not just tell.
|
| If you just left that No out, and gave room for the
| chance that you are wrong, people wouldn't downvote,
| they'd upvote. People like to hear smart arguments. No
| one wants to hear outward dismissal. Especially of known
| experts.
| hgsgm wrote:
| [flagged]
| PartiallyTyped wrote:
| Ad-hominems are not a particularly nice way to argue about
| correctness of a claim.
| NiloCK wrote:
| Maybe you know something but I don't, but I believe the
| comment you are replying to was a compliment and not a
| sarcastic dig.
| PartiallyTyped wrote:
| This person has made a similar remark and purposefully
| looked into my post history - which doesn't make any
| claims about my knowledge or skills - lied about it, and
| made a sarcastic remark, see "everything".
| randomdata wrote:
| A lot can change in 60 days.
| ekleraki wrote:
| They do have a master's degree according to the post.
|
| The claims made are not that deep for researchers.
| belter wrote:
| The use of the word masters is now considered not cool
| according to Stanford... :-)
| PartiallyTyped wrote:
| It's not like I have been doing many things outside
| reading papers and books over the past few years... The
| post that person used as an ad-hominem even says so.
| [deleted]
| [deleted]
| jerpint wrote:
| It's more that without backpropagation, you no longer need to
| store your forward activations across many layers to compute
| the backwards pass which usually is dependent on a forward
| pass. When a network is hundreds of layers, and batches are
| very large, the forwards and backwards accumulations add up in
| terms of memory required.
|
| Communication across GPUs doesn't solve this but instead allows
| to have either many models running in parallel on different
| GPUs to increase Barch size or to share many layers across GPUs
| to increase model size. Quick communication is critical to
| maintain training speeds that aren't astronomical
| rsfern wrote:
| Discussion last month when the preprint was released:
| https://news.ycombinator.com/item?id=33823170
| [deleted]
| Kalanos wrote:
| What exactly is the negative data? Seems like it's just scrambled
| nonsense (aka what the truth is not)
| abraxas wrote:
| I think it is scrambled nonsense but it's scrambled in a way
| that still makes it look like a plausible sample. I remember
| watching a video of Hinton saying that just using white noise
| or similarly randomized data does not work but I'm now
| forgetting why.
| maurits wrote:
| Deep dive tutorial for learning in a forward pass [1]
|
| [1] https://amassivek.github.io/sigprop
| cochne wrote:
| > There are many choices for a loss L (e.g. gradient, Hebbian)
| and optimizer (e.g. SGD, Momentum, ADAM). The output(), y, is
| detailed in step 4 below.
|
| I don't get it, don't all of those optimizers work via
| backprop?
| mkaic wrote:
| The optimizers take parameters and their gradients as inputs
| and apply update rules to them, but the gradients you supply
| can come from anywhere. Backdrop is the most common way to
| assign gradients to parameters, but other methods can work
| too--as long as the optimizer is getting both parameters and
| gradients, it doesn't care where they come from.
| singularity2001 wrote:
| The paragraph about Mortal Computation is worth repeating:
|
| If these FF networks can be proven to scale or made to scale
| similarly to BP networks, this would enable making hardware
| several orders of magnitude more efficient, for the price of
| loosing the ability to make exact copies of models to other
| computers. (The loss of reproducibility sits well with the
| tradition of scientific papers anyway/s;)
|
| 2.) How does this paper relate to Hintons feedback alignment from
| 5 years ago? I remember it was feedback without derivatives. What
| are the key new ideas? To adjust the output of each individual
| layer to be big for positive cases and small for negative cases
| without any feedback? Have these approaches been combined?
| sheerun wrote:
| Isn't this similar to how GAN networks learn? Edit: Yes, there is
| small chapter in paper comparing it to GAN
| tjungblut wrote:
| Interesting, my first take was that it's like contrastive
| divergence in Restricted Boltzmann Machines (RBMs). There's
| also a chapter for that.
| parpfish wrote:
| Having read through the forward-forward paper, it feels like it's
| Oja's rule adapted for supervised learning but I can't really
| articulate why...
| bayesian_horse wrote:
| I remember a long time ago there was a collection of Geoffrey
| Hinton facts. Like "Geoffrey Hinton once wrote a neural network
| that beat Chuck Norris."
| mousetree wrote:
| I saw the same for Jeff Dean
| lvl102 wrote:
| Quantum would absolutely change everything in DL/ML space.
| Kalanos wrote:
| Most problems are less computationally intensive than one would
| think
| sillysaurusx wrote:
| It turns out there's a whole subfield for quantum ML. I don't
| know much about it, but it's neat that there's any
| applicability. It's not obvious that there was any connection.
| kumarvvr wrote:
| This is an interesting approach and I have read that this is more
| closer to how our brains works.
|
| We extract learning, while we are imbibing the data and there
| seems to be no mechanism in the brain that favors backprop like
| learning process.
| bayesian_horse wrote:
| Fact: Geoffrey Hinton has discovered how the brain works. Every
| few years actually.
| parpfish wrote:
| yeah, whatever happened to capsule networks?
| NHQ wrote:
| Capsule networks were conceptually an early attempt at
| transformers.
| mlajtos wrote:
| Once a year for the last 30 years.
| mdp2021 wrote:
| Direct link to an implementation on GitHub:
| https://github.com/nebuly-ai/nebullvm/tree/main/apps/acceler...
|
| --
|
| The divulgational title is almost an understatement: the Forward-
| Forward algorithm is an alternative to backpropagation.
|
| Edit: sorry, the previous formulation of the above in this post,
| relative to the advantages, was due to a misreading. Hinton
| writes:
|
| > _The Forward-Forward algorithm (FF) is comparable in speed to
| backpropagation but has the advantage that it can be used when
| the precise details of the forward computation are unknown. It
| also has the advantage that it can learn while pipelining
| sequential data through a neural network without ever storing the
| neural activities or stopping to propagate error
| derivatives....The two areas in which the forward-forward
| algorithm may be superior to backpropagation are as a model of
| learning in cortex and as a way of making use of very low-power
| analog hardware without resorting to reinforcement learning_
| [deleted]
| belter wrote:
| @dang
|
| Meta question on HN implementation: Why do sometimes submitting a
| previously submitted resource links automatically to the previous
| discussion, while other times is considered a new submission?
| drdeca wrote:
| I believe one factor is the amount of time between the
| submissions.
| gregschlom wrote:
| As far as I know it's a simple string match on the url. If the
| url is different (for example a new anchor tag is added) then
| it's considered a new submission.
| danuker wrote:
| If you click on "past" under this submission, you see two
| identical URLs:
|
| https://hn.algolia.com/?query=Geoffrey%20Hinton%20publishes%.
| ..
| mdp2021 wrote:
| Which is odd, because I checked in the minutes following
| the submission and I remember "[past]" did not return
| anything.
| jasonjmcghee wrote:
| Maybe I'm missing something, but from the paper
| https://www.cs.toronto.edu/~hinton/FFA13.pdf, they use non-conv
| nets on CIFAR-10 for back prop, resulting in 63% accuracy. And FF
| achieves 59% accuracy (at best).
|
| Those are relatively close figures, but good accuracy on CIFAR-10
| is 99%+ and getting ~94% is trivial.
|
| So, if an improper architecture for a problem is used and the
| accuracy is poor, how compelling is using another optimization
| approach and achieving similar accuracy?
|
| It's a unique and interesting approach, but the article
| specifically mentions it gets accuracy similar to backprop, but
| if this is the experiment that claim is based on, it loses some
| credibility in my eyes.
| habitue wrote:
| I think you have to set expectations based on how much of the
| ground you're ripping up. If you're adding some layers or some
| little tweak to an existing architecture, then yeah, going
| backwards on cifar-10 is a failure.
|
| If, however, _you are ripping out backpropagation_ like this
| paper is, then you get a big pass. This is not the new paradigm
| yet, but it 's promising that it doesn't just completely fail.
| jxcole wrote:
| This seems to be Hinton's MO though. A few years back he
| ripped out convolutions for capsules and while he claims it's
| better and some people might claim it "has potential", no one
| really uses it for much because, as with this, the actual
| numerical performance is worse on the tests people care about
| (e.g. imagenet accuracy).
|
| https://en.wikipedia.org/wiki/Capsule_neural_network
| habitue wrote:
| I mean yes, this should be the MO of a tenured professor,
| making large speculative bets, not hyper optimizing
| benchmarks
| LudwigNagasena wrote:
| There is no shortage of paradigms that rip out backprop and
| deliver worse results.
| tehsauce wrote:
| This is so true! But we should keep trying :)
| levesque wrote:
| Also Hinton doesn't have the best track record with his
| already forgotten/abandoned Capsule networks. I wonder
| what's the next thing he's going to come up with? He gets a
| pass because he is famous.
| whimsicalism wrote:
| I think it can be simultaneously true that things like
| this should be tested with toy models we wouldn't expect
| to do great on CIFAR and also that we shouldn't expect
| exceptional results just because this person is already
| famous.
| whimsicalism wrote:
| You have to start with toy models before scaling up.
| electrograv wrote:
| Achieving <80% on CIFAR10 in the year >2020 is an example of
| a _failed_ toy model, _not_ a successful toy model.
|
| Almost any ML algorithm can be thrown at CIFAR10 and achieve
| ~60% accuracy; this ballpark of accuracy is really not
| sufficient to demonstrate viability, no matter how
| aesthetically interesting the approach might feel.
| [deleted]
| alper111 wrote:
| Plain MLP acc. is 63% vs 59% with FF, not so bad? By the
| same logic, MLP is a failed toy model.
| hiddencost wrote:
| Hinton is doing basic science, not ML, here. Given who he
| is, trying to move the needle on traditional benchmarks
| would be a waste of his time and skills.
|
| If he invents the new back propagation, an army of grad
| students can turn his ideas into the future. Like they've
| done for the last 15 years.
|
| He's posting incremental work towards rethinking the field.
| It's pretty interesting stuff.
|
| Edit: grammar
| jasonjmcghee wrote:
| I haven't seen this to be the case, fwiw. There was a paper
| in 2016 that did this and most were in the ~40% range.
|
| But "any ml algorithm" isn't the point. It's a new
| optimization technique and should be applied to
| models/architectures that make sense with the problems they
| are being used on.
|
| For example, they could have used a pretrained featurizer
| and trained the two layer model on top of it, with both
| back prop and FF and compared.
| whimsicalism wrote:
| > For example, they could have used a pretrained
| featurizer and trained the two layer model on top of it,
| with both back prop and FF and compared.
|
| Making the assumption that weights/embeddings produced by
| a backprop-trained network are equally intelligible to a
| network also trained by backprop vs. one trained by this
| alternative method.
| jasonjmcghee wrote:
| I have personally seen them used successfully with all
| kinds of classic ml algorithms (enets, tree-based, etc)
| that have nothing to do with back prop.
| whimsicalism wrote:
| Any ML algorithm that already has tooling written, CUDA
| scripts, etc. to run it faster.
|
| That said, I am also short-term bearish on backprop-free
| methods (although potentially long-term bullish).
| Coneylake wrote:
| Is the derivative calculated by forward-forward as analytic as in
| backpropagation?
___________________________________________________________________
(page generated 2023-01-12 23:01 UTC)