[HN Gopher] Deep Learning's Diminishing Returns
___________________________________________________________________
Deep Learning's Diminishing Returns
Author : RageoftheRobots
Score : 82 points
Date : 2021-09-24 18:41 UTC (4 hours ago)
(HTM) web link (spectrum.ieee.org)
(TXT) w3m dump (spectrum.ieee.org)
| lvl100 wrote:
| I've had similar thoughts in the past but I started playing
| around with some newer models recently and I have about 25-30
| projects that I can think of right off that could be considered
| commercially viable. And certainly VC fundable in this investment
| environment.
| phyalow wrote:
| Email me (its on my profile) if you want to shoot some project
| ideas.
| selimthegrim wrote:
| I'd love to hear some of them too (profile at gmail)
| CodeGlitch wrote:
| I do wonder if we'll see the rise of symbolic AI to give deep
| learning a sense of common sense? I've been thinking a lot about
| this interview on Cyc:
|
| https://www.youtube.com/watch?v=3wMKoSRbGVs
| drdeca wrote:
| And here I thought one of the big benefits of DL was that it
| could handle the complexities which would be too hard to
| specify symbolically in order to give symbolic AI "common
| sense".
|
| The following argument comes to mind, but I don't really buy it
| (it just came to mind as something that one might say next):
|
| ' Perhaps there is an analogy between the solutions of "we just
| need to get better (more varied, better fitting the desired
| behavior, etc.) training data, and maybe better training
| procedures" and "we just need to add more/better inference
| rules and symbolic ways to encode statements, and add more
| facts about the world". Similar in that both will produce the
| specific improvements they target, but where solving "the
| real/whole/big problem" that way is infeasible. If so, then
| maybe this indicates that a practical full-solution to
| artificial "common sense" would require something fundamentally
| different than both of them, if it is even possible at all. '
|
| Again, I don't really buy that line of reasoning, just
| expressing my inner GPT2 I guess, haha.
|
| Ok, but I presented an argument (or something like an argument)
| which I made up, and said that I don't buy it. So, I should say
| why I don't buy it, right? Like many of the things I write, it
| is chock-full of qualifiers like "perhaps" and "maybe", to the
| point that one might say that it hardly makes any claims at
| all. But ignoring that part of it, one major difference is that
| the DL style architectures, seem to be working? And it isn't
| clear what kinds of (practically speaking) hard limits it could
| run into. Now, on the other hand, perhaps at the time that
| symbolic AI was all the rage, it appeared the same way. (Is
| this what people mean when they talk about inside view vs
| outside view?).
|
| Why should these two things not be especially analogous? Well,
| saying "proposed solution X to the problem says to just [do
| more of what X is/do X better], and that is just like how
| proposed solution Y says to just [do more of what Y is/do Y
| better]" is kind of a fully generalize argument for dismissing
| any proposed type of solution where partial solutions of that
| type have been tried, but the whole problem hasn't been solved
| that way yet, and another proposed kind of solution has already
| lost favor. This doesn't seem like a generally valid line of
| reasoning. Sometimes you really do just need more dakka
| (spelling? I mean "more of the thing you already tried some
| of").
|
| Of course, if one is convinced that it really was right for the
| older proposed kind of solution to be discarded, that probably
| should say something about the currently popular kind of
| solution. Especially if there have been many proposed kinds of
| solutions which have been discarded. But, it seems like much of
| what it says is just that the problem is hard. And, sure, that
| may mean an increased probability that the currently popular
| proposed kind of solution also doesn't end up being
| satisfactory, that doesn't mean one should be too quick to
| discard it. Tautologically: if no known alternative is
| currently at least as promising as the type of solution
| currently being considered, then, the current one is the most
| promising of the currently known options. Whether it is
| promising enough to actively pursue may be a different
| question, but it shouldn't be marked as discarded until
| something else (perhaps something previously discarded, or
| something novel) becomes more promising.
| airstrike wrote:
| From Wikipedia (https://en.wikipedia.org/wiki/Cyc#Criticisms)
|
| > ... A similar sentiment was expressed by Marvin Minsky:
| "Unfortunately, the strategies most popular among AI
| researchers in the 1980s have come to a dead end," said Minsky.
| So-called "expert systems," which emulated human expertise
| within tightly defined subject areas like law and medicine,
| could match users' queries to relevant diagnoses, papers and
| abstracts, yet they could not learn concepts that most children
| know by the time they are 3 years old. "For each different kind
| of problem," said Minsky, "the construction of expert systems
| had to start all over again, because they didn't accumulate
| common-sense knowledge." Only one researcher has committed
| himself to the colossal task of building a comprehensive
| common-sense reasoning system, according to Minsky. Douglas
| Lenat, through his Cyc project, has directed the line-by-line
| entry of more than 1 million rules into a commonsense knowledge
| base."
| ypcx wrote:
| And then, GPT-3 came along and rendered Cyc a wasted effort.
| CodeGlitch wrote:
| > And then, GPT-3 came along and rendered Cyc a wasted
| effort.
|
| I'm yet to see GPT-3 do anything commercially important?
| Cyc on the other hand seems to have been used in a number
| of sectors. Not to downplay GPT-3 - it's cool tech that
| produces cool demos - Cyc just seems more like a tool
| rather than a toy.
| montenegrohugo wrote:
| GPT-3 has been used in a ton of commercially important
| applications.
|
| To name a few:
|
| - GitHub CoPilot (transformative imo)
|
| - Markcopy.ai, jenni.ai, etc.... Tons of content
| generation and SEO tools startups
|
| - AI Dungeon and such
|
| - Plenty of chatbots
|
| - It's super useful for all kinds of classification tasks
| too (as are all transformer models)
| CodeGlitch wrote:
| Thanks for the response, I'm not familiar with
| Markcopy.ai, jenni.ai, but the others are "toys" if we're
| being honest (although as I said - very cool toys). You
| wouldn't want to use GPT-3 to recommend drugs for a
| condition you feed it...would you? As far as I
| understand, this is the kind of problem Cyc is trying to
| solve with it's domain-specific rules.
|
| edit: Cyc can also tell you _why_ it gave the response it
| did - something that deep nets cannot do. This is
| important in many fields, otherwise you cannot trust any
| response it produces.
| goatlover wrote:
| Cyc was trying to encode common knowledge about the world
| in a bunch of rules. That goes well beyond what GPT-3 does
| with text.
| ypcx wrote:
| GPT-3 learns these rules by itself.
| williamtrask wrote:
| From what I can tell, many of the best thinkers agree with this
| idea but we haven't cracked it yet.
| d_burfoot wrote:
| > it does so using a network with 480 million parameters. The
| training to ascertain the values of such a large number of
| parameters is even more remarkable because it was done with only
| 1.2 million labeled images--which may understandably confuse
| those of us who remember from high school algebra that we are
| supposed to have more equations than unknowns.
|
| This is one of the key misunderstandings are still deeply rooted
| in people's minds. For modern DL, a large part of the learning
| comes from "internal" data points, in this case the pixels of the
| image, as opposed to the labels. If you count the number of
| pixels, you will likely get something like 1.2 trillion, more
| than enough to justify the 4.8e8 parameters. It's the usage of
| internal data that prevents overfitting, NOT the random
| initialization and SGD as claimed in the article.
|
| Another way to see this is: if you need more labels than
| parameters, how can GPT3 have ANY parameters at all? It is
| trained purely on raw text data.
| albertzeyer wrote:
| For modern DL, a large part also comes from regularization. And
| then also data augmentation. And self-supervision in whatever
| way, either prediction, masked prediction, contrastive losses,
| etc.
|
| Which all adds to the number of constraints / equations.
| sendtown_expwy wrote:
| You are incorrect about the input dimensionality mattering.
| Let's say you have 100 high-res images with yes/no labels. If
| you hash the images and put their labels in a hashmap, you can
| say this is a "learned" function of 100 parameters which
| achieves zero training error on the dataset. This parameter
| count is independent of input dimension. Why do you think this
| would change when this mapping is replaced by a smooth neural
| network mapping?
|
| GPT is trained to predict the input (estimating p(x)), versus
| predicting a label given an input (p(y|x)). So in the case of
| GPT you can use the input dimensionality as a "label", as
| another responder has mentioned. ImageNet classification is
| different (excepting recent semi-supervised or unsupervised
| approaches to image recognition).
|
| The ability to generalize in the typical imagenet setting is,
| as the article says, a byproduct of SGD with early stopping,
| which in practice limits the number of functions a deep neural
| network can express (something not considered in an analysis
| which only considers parameter count).
| montenegrohugo wrote:
| The point is your simple mapping with zero error on the
| training dataset also has zero prediction power in both the
| test dataset and in real life. It's learned nothing; it's at
| the extreme scale of overfitted.
|
| Input dimensionality is absolutely important when determining
| net size.
| sendtown_expwy wrote:
| That's the point. 100 parameters is sufficient to overfit,
| and it's a number that's independent of the input size. Do
| you have a reference for your statement?
| montenegrohugo wrote:
| Reference for what exactly? That input dimensionality is
| important when determining net size? That seems quite
| self-explanatory; try training a image classifier with
| only 100 parameters.
|
| Maybe I understood that question wrong, but regardless,
| even if early stopping wasn't implemented, a NN would
| have more predictive power than the hash mapping. Both
| would be completely overfit on the training data set, yet
| the NN would most likely be able to make some okay
| guesses with OOD data.
| williamtrask wrote:
| GPT3 has millions of labels. Every vocabulary term is a label.
| It's equivalent to supervised learning in architecture. The
| "self-supervised" business is mostly spin to make it sound a
| bit more novel. People have been predicting the next word for
| ages (Turing did this).
|
| Input: <previous words of article>
|
| Label: <next word>
|
| Your point is well taken that the number of input data points
| is also important when considering the complexity of the
| problem. In this case however the number of data points more or
| less exactly equals the number of labels.
|
| (About Me: the first year+ of my PhD was focused on large scale
| language modelling, during which transformers came out.)
| axiosgunnar wrote:
| This is such a basic error the author has made, that I am not
| sure he can use ,,we" when refering to researchers...
| commandlinefan wrote:
| Well, that _is_ something that you 're taught in high school
| algebra, which you end up "unlearning" when you study linear
| algebra.
| contravariant wrote:
| That doesn't quite work out that way I think, if you compare it
| to solving a system of equations then the size of the input is
| irrelevant. Indeed a very large input is often the main reason
| for a problem to be under-specified.
|
| What you should look at is the number of outputs times the
| number of data points for each output. If this number is lower
| than the number of parameters then it should be possible to
| find multiple solutions.
|
| Of course in this case you're not looking for a solution, but
| an optimum, and not even a global one, so it's not too
| troubling per se that you don't get a unique answer. Though it
| does somewhat suggest you should be able to get an equivalent
| fit with far fewer parameters, but finding it could be quite
| tricky.
| culi wrote:
| > At OpenAI, an important machine-learning think tank,
| researchers recently designed and trained a much-lauded deep-
| learning language system called GPT-3 at the cost of more than $4
| million. Even though they made a mistake when they implemented
| the system, they didn't fix it, explaining simply in a supplement
| to their scholarly publication that "due to the cost of training,
| it wasn't feasible to retrain the model."
| djoldman wrote:
| > Deep-learning models are overparameterized, which is to say
| they have more parameters than there are data points available
| for training.
|
| Is this true for all deep learning models?
| dekhn wrote:
| depends on the model, but most systemns I've worked with had
| millions to billions of parameters, and trillions of (sparsely
| populated) data points.
| lvl100 wrote:
| DL cannot be over-specified. However you do need to mind your
| endogenous and exogenous variables.
| jjcon wrote:
| Not even close, most of my work has been in naturally occurring
| data and there is way waay more data available than can
| possibly be used (petabytes). Where they get this idea as being
| the rule and not the exception is beyond me.
| armoredkitten wrote:
| It's not _inherently_ true. Technically, deep learning is
| essentially any neural network model with hidden layers (i.e.,
| one layer in between the input layer and the output layer). You
| could have a "deep learning" model with a couple dozen
| parameters, perhaps. But at that end of the scale, most people
| would probably reach for other approaches that are more easily
| interpretable (e.g., logistic regression, random forest). So in
| practice, yes, virtually any deep learning model you see out
| there in the wild, even most "toy examples" used to teach
| machine learning, are going to be overparameterized.
| bjornsing wrote:
| > Training such a model would cost US $100 billion and would
| produce as much carbon emissions as New York City does in a
| month.
|
| This is what's called infeasible.
| abecedarius wrote:
| Longish article about the cost of training increasingly big
| neural nets. Worried about carbon. "Training such a model would
| cost US $100 billion and would produce as much carbon emissions
| as New York City does in a month. And if we estimate the
| computational burden of a 1 percent error rate, the results are
| considerably worse."
| sayonaraman wrote:
| I'm wondering if there is a way to combine optimization of model
| weights in a neural net with a set of heuristics limiting the
| search space, as a sort of rules engine/decision tree integrated
| within ANN backprop training. Basically pruning irrelevant and
| redundant features early and focusing on more informative ones.
| visarga wrote:
| Yes, there are many approaches like that. In one approach they
| train a network and prune it, then mask the pruned weights and
| retrain from scratch a sparse network from the original
| untrained weights.
| gibolt wrote:
| This assumes that our processes and algorithms don't get more
| targeted or improve. The rate of new approach discovery is
| staggering. For every problem, some combination of approaches
| will more efficiently pre-process and understand the training
| data.
|
| The article also ignores training vs running tradeoffs. Training
| a model once may be extremely resource intensive, but running the
| resulting model on millions of devices can be negligible while
| having huge value add.
| et1337 wrote:
| Keep reading, the article includes a whole section about
| training vs running tradeoffs.
| sayonaraman wrote:
| > new approach discovery
|
| a good example is discovery of attention
| mechanisms/transformers replacing more cumbersome and
| computationally expensive RNNs and LSTMs in NLP and more
| recently outperforming more expensive models in computer
| vision.
| visarga wrote:
| Transformers are pretty huge and expensive to run, LSTMs are
| lighthweight by comparison.
| culi wrote:
| Keep reading, the article directly addresses this point
| dekhn wrote:
| Anybody who argues against deep learning based on energy
| consumption immediately fails to impress me. This article is
| particularly bad- claiming you need k*2 more data points to
| improve a model and using that to extrapolate unrealistic energy
| consumption targets for DL training.
|
| The sum of all DL training in teh world is noise compared to the
| other big consumers of energy in computing. That's because the
| main players all invested in energy-efficient architectures. DL
| training energy is not something to optimize if your goal is to
| have a measurable impact on total power consumption.
| josefx wrote:
| > That's because the main players all invested in energy-
| efficient architectures.
|
| If the cost was gigantic enough to make the investment worth it
| they must have found some really great improvements for it to
| end up being just noise. Improvements that somehow didn't have
| a noteworthy impact on general computing.
| Spooky23 wrote:
| Open ended crying about electricity doesn't make sense in the
| absence of specifics.
|
| A big company like Microsoft probably wasted more money on
| pentium 4s 15 years ago. Electricity is just another resource -
| if the numbers work, burn away.
| ypcx wrote:
| Especially if the result is the cure for cancer, or similar.
| ben_w wrote:
| Perhaps for now, but not necessarily in general.
|
| I know we're nowhere near the following scenario, this is
| just to illustrate how things can go wrong even if the
| numbers tell you to "burn away":
|
| Image we have computronium with negligible manufacture cost,
| the only important thing is the power cost to use it.
|
| Imagine you're using it to run an uploaded mind, spending
| $35,805/year on energy.
|
| The 50% of Americans earning more than this [0] are no longer
| economically viable, because their productivity can now be
| done at the same cost by a computer program.
|
| Doing this with the current power mixture would be
| disastrous, doing it with PV needs about 1400m^2 per
| simultaneous real time mind upload instance (depending on
| your assumption about energy costs and cell efficiency,
| naturally).
|
| In a more near-term sense, there are plenty of examples where
| the Nash equilibrium tells each of us to benefit ourselves at
| the expense of all of us. Not saying that is the case for
| Deep Learning right now, but can (and frequently does)
| happen.
|
| [0] https://fred.stlouisfed.org/series/MEPAINUSA672N
| user-the-name wrote:
| > Electricity is just another resource
|
| I hate to be the one to tell you, but, it turns out we are
| living in the middle of an ecological catastrophe, and it
| also turns out that means that electricity is a resource we
| are going to have to conserve.
| wanderingmind wrote:
| And yet people here have no trouble crying about electricity
| wastage of crypto. Also from my limited knowledge I think DNN
| models are not very transferable in real world setting
| requiring constant retraining even for a small drift in signal
| or change in noise modes.
| nerdponx wrote:
| > And yet people here have no trouble crying about
| electricity wastage of crypto
|
| Which is many orders of magnitude more energy-intensive, on
| the scale of a small nation-state, and in most cases
| fundamentally wasteful by design. A very large pre-trained
| model can be reused very cheaply once it's finished.
|
| > Also from my limited knowledge I think DNN models are not
| very transferable in real world setting requiring constant
| retraining even for a small drift in signal or change in
| noise modes.
|
| This is FUD, promulgated by people who expected deep learning
| to solve all their problems overnight. All models will suffer
| from "drift" whenever the underlying data changes.
|
| Part of what made deep learning so good was that it was able
| to generalize exceptionally well from exceptionally
| complicated input data.
|
| It is unreasonable to expect that a model pre-trained on a
| huge generic corpus will be a perfect match for your very
| specific business problem. However it is _not_ unreasonable
| to expect that said model will be a useful baseline and
| starting point for your very specific business problem.
|
| We are not yet (and might never be) at the point where you
| can dump a pile of garbage data into an API and get great
| predictions out the other end on the first try. But nobody
| ever thought you could do that, except the people selling
| expensive subscriptions to those kinds of APIs. The fact that
| they work at all should be taken as evidence of how amazing
| deep learning is; the fact that they don't work perfectly
| should not be taken as evidence that deep learning is
| bad/useless/wasteful/hype/whatever.
|
| Don't let the clueless tech media set your expectations.
|
| Professional data scientists and machine learning
| practitioners for the most part take their work very
| seriously and take pride in delivering good outcomes, just
| like professional software engineers. If deep learning wasn't
| useful to that end, nobody would be using it.
| culi wrote:
| Here's the original study[1] that seems to be the primary source
| for this article. It's an important study from a respectable
| journal. To be frank, it's pretty disconcerting that the top
| comments on this thread are those writing off the topic on the
| premise alone while those comments actually engaging with the
| topic seem to be at the bottom
|
| [1] https://arxiv.org/pdf/1906.02243.pdf
| SavantIdiot wrote:
| Putting aside energy costs, Object detection is still crappy and
| has stalled. YOLO/SSDMN were impressive as all get-out, but they
| stink for general purpose use. It's been 3 years (?) and general
| object detection, even with 100 classes, is still unusable off
| the shelf. Yes, I understand incremental training of pre-trained
| nets is a thing, but that's not where we all hoped it would go.
| BeatLeJuce wrote:
| > > The first part is true of all statistical models: To improve
| performance by a factor of k, at least k^2 more data points must
| be used to train the model. The second part of the computational
| cost comes explicitly from overparameterization. Once accounted
| for, this yields a total computational cost for improvement of at
| least k4.
|
| Those claims are entirely new to me, and I've been a researcher
| in the field for almost 10 years. Where do they come from/what
| theorems are they based on? It's unfortunate this article doesn't
| have any citations.
___________________________________________________________________
(page generated 2021-09-24 23:01 UTC)