[HN Gopher] Deep Learning's Diminishing Returns (2021)
___________________________________________________________________
Deep Learning's Diminishing Returns (2021)
Author : jrepinc
Score : 29 points
Date : 2023-06-16 18:54 UTC (4 hours ago)
(HTM) web link (spectrum.ieee.org)
(TXT) w3m dump (spectrum.ieee.org)
| dekhn wrote:
| None of these articles admit to a basic truth: deep learning is
| only a small slice of total computation, and even if it grew 2
| orders of magnitude, would not be the largest consumer by area.
|
| If DL costs are unsustainable, it means those other things (SAP,
| web hosting, and all the ridiculous stuff people waste cycles on)
| are even more unsustainable, and should be addressed first.
| mrtranscendence wrote:
| The amount of computation in the world devoted to deep
| learning, in aggregate, might be a small slice of total
| computation. But in terms of costs borne by individual
| organizations who train such models, deep learning could be
| extremely substantial. Particularly if it takes millions in
| additional investment to achieve only marginal reductions in
| error.
| dekhn wrote:
| How is that different from any other capital-intensive
| industry. Basically my point is that there is nothing
| specific to DL in these articles/complaints, people are just
| jumping on the bandwagon.
| aeturnum wrote:
| You're trying to beg the question that "the point" must be:
| "which thing is wasting the most power right now?" It's not.
| People can talk about at least two things at once. This is
| critiquing the diminishing returns in terms of cost-per-output.
| It compares badly, on a unit level, to most other methods.
| That's a fair critique when planning which technique to use in
| the future.
|
| Articles are often talking about something that won't end up
| mattering. But if you can't actually address the point the
| article is making on its own terms, you're going to struggle to
| refute it.
| dekhn wrote:
| But we aren't seeing diminishing returns in terms of cost-
| per-output. We're seeing enormous advances, both in the
| quality of networks, and what we understand about how to
| train them, with a large but not unanticipated growth in
| compute which is still smaller than the growth of compute in
| other areas which aren't producing scientific advances.
|
| When I refute a paper I start with the weakest points; if I
| convince my audience to ignore the paper on those terms, I
| don't spend time refuting the main point (perhaps this is not
| a great technique, but it is fairly efficient). I didn't even
| really address their main arguments because after reading the
| paper, I was just aghast at some of the things that say.
|
| Let's give an example like this: """The first part is true of
| all statistical models: To improve performance by a factor of
| k, at least k**2 more data points must be used to train the
| model."""
|
| I guess that's true of nearly all modern statistical models,
| at least small ones with simple model functions and fairly
| small number of data points. But I don't think that most
| advanced deep learning experts think in those terms; modern
| DLs do not behave anything like classical statistical models.
| I think those experts see increasing the data as providing an
| opportunity for overparameterized systems to generalize in
| ways we don't undertand and don't follow normal statistical
| rules. Modern DL systems are more like complex systems with
| emergent properties, than statistical models as understand by
| modern statisticians.
|
| Here's another example: they sort of ignore the fact that we
| only need to train a small number of really big models and
| then all other models are fine-tuned from that. To get a
| world-class tardigrade detector, I took mobilenet, gave it a
| few hundred extra examples, trained for less than an hour on
| my home GPU (a 3080Ti), and then re-used that model for
| millions of predictions on my microscope. I didn't have to
| retrain any model from or use absurd amounts of extra data. I
| took advantage of all the work the original model trainign
| did to discover basis functions that can compactly encode the
| difference between tardigrade, algae, and dirt. I see a
| direct linear increase in my model's performance as I move to
| linearly larger models, and I need to add linear number of
| images to train more classes. We can reasonable expect this
| to be true of a wide range of models.
|
| Similarly, for people doing molecular dynamics- the big CPU
| waster before DL- many parts of MD can now be approximated
| with DLs that are cheaper to run than MD, using models that
| were just trained once.
|
| What about AlphaFold? Even if it cost DeepMind $100M in
| training time (and probably more in salaries), it has already
| generated some results that simply couldn't be produced
| without their technology- it didn't even exist! What they
| demonstrated, quite convincingly, is that algorithmic
| improvements could extract far more information from fairly
| cheap and low quality sequence data, compared to expensive
| structural data. So instead of extremely expensive MD sims or
| whatever to predict structure, you just run this model. My
| friends in pharma research (I work in pharma) are delighted
| with the results.
|
| In short, I think the author's economic model is naive, I
| think his udnerstanding of improvement in DL is naive, and he
| undercounts the value of having a small number of huge models
| which are trained on O(nlogn), not n**2 data sets.
|
| And I think that in the next decade it's likely either
| Google, Meta, or Microsoft will be actively training multi-
| modal models that basically include the sum of all publicly
| available, unencumbered data to produce networks that can
| move smoothly between video, audio, text, do logical
| reasoning, everything required to produce a virtual human
| being that could fool even experts, and probably even exceed
| human performance in science and mathematics in an impactful
| way. So what if they spend $100B to get there. That's just
| two years of NIH's budget.
| freejazz wrote:
| >If DL costs are unsustainable, it means those other things
| (SAP, web hosting, and all the ridiculous stuff people waste
| cycles on) are even more unsustainable, and should be addressed
| first.
|
| That fundamentally misunderstands the point. If it costs me
| $XXX amounts of dollars to compute the job I'm trying to
| replace, which only actually costs $X dollar, then it's not
| financially viable to use the model to replace the job. It
| makes no difference what the global percentage of computation
| that makes up.
| dekhn wrote:
| The reason the global percentage matters is that if you are
| truly trying to address environmental impacts, you wouldn't
| talk about DL, it's currently too small (and will be for some
| time) to make a difference.
|
| Just like not optimizing the 1% bottleneck in your code when
| there's a 50% bottleneck- focus on the big players where the
| cheap, easy wins are first.
| freejazz wrote:
| Yes, your point makes sense from the perspective of trying
| to reduce global carbon usage... but that doesn't seem to
| be the perspective the article is written from, and the
| title is "Deep Learning's Diminishing Returns" not "Deep
| Learning Is Not Environmentally Friendly" so I think the
| point that DL has diminishing returns stands.
| dekhn wrote:
| The paper advances several lines of arguments.
| Environmental is one of them, it's used as a negative
| consequence of their purported scaling laws for training.
|
| The paper explicitly says: """Important work by scholars
| at the University of Massachusetts Amherst allows us to
| understand the economic cost and carbon emissions implied
| by this computational burden. The answers are grim:
| Training such a model would cost US $100 billion and
| would produce as much carbon emissions as New York City
| does in a month. And if we estimate the computational
| burden of a 1 percent error rate, the results are
| considerably worse."""
|
| (that paper was debunked by people who build and operate
| extremely large scale cloud machine learning systems).
|
| Remember, Google already built and runs its TPU fleet
| continuously and the machines are almost always busy at
| at least 50% of their capacity, meaning the money is
| already being spent and the carbon (which is much smaller
| than their estimates) being spewed.
|
| As for the rest of the paper, their arguments about the
| need to increase parameters to get better results are
| fairly simplistic, and filled with misleading
| information/untrue statements.
|
| Basically, everything they claim in the article has been
| shown to be empirically not true. And the community is
| already grappling with the "too many parameters
| approach". The authors would have served the community
| far better by writing a less critical paper that focused
| on the opportunities to identify good approaches to
| parameter and compute reduction that don't affect
| performance. In looking at the main author's area of
| research, it looks like he is more of a economist/public
| policy wonk than DL expert, which I think negatively
| affects the quality of the paper.
| dang wrote:
| Discussed at the time:
|
| _Deep Learning 's Diminishing Returns_ -
| https://news.ycombinator.com/item?id=28646256 - Sept 2021 (84
| comments)
| deepsquirrelnet wrote:
| I don't find the authors points to be very convincing. There's a
| bit of a self-contradiction in the arguments being made -- namely
| that if you want better generalization, you need more parameters
| (which equates to higher energy consumption in training).
|
| However, the obvious retort here is that if you train a model
| that is good at generalizing, then you don't need train more
| models! A show of hands, who has used GPT or an open LLM vs who
| has _trained_ one would yield a vast disparity. If you don 't
| need generalization, you don't need huge models. Small models are
| efficient over narrow domains and don't require vast
| compute/energy resources.
|
| Secondarily, it's a self-solving issue. Energy isn't cheap, and
| GPUs aren't cheap. If you're going to burn 10's of thousands of
| dollars in energy costs, you should probably have a decent reason
| to do it. But those reasons are quickly diminishing as things
| that have already been _done_.
|
| Third, overparameterized models are becoming less of an issue
| during inference with efficient quantization techniques.
| Distillation, though harder, is another option. Again, you do can
| these things one time after training.
| rictic wrote:
| A disappointing analysis. It discusses cost in isolation from
| revenue, and raises environmental impacts without mentioning that
| the major cloud computing providers where the largest models are
| trained and run are carbon neutral.
|
| Better results for general purpose tasks are likely to find
| willing buyers at prices orders of magnitude higher than even the
| largest language models of today. In most cases, the best
| alternative is human labor, and for more latency-sensitive cases,
| there is no alternative to a language model available at any
| price.
| wokkel wrote:
| Regarding carbon neutrality: yeah, by using their enourmous
| capital/pricing power to buy all green power relegating the
| rest of the country to either use greenwashed grey power or
| outright grey power. So learning is green, it's just the rest
| that is poluting....
| _Algernon_ wrote:
| "Carbon neutrality" is shameless greenwashing while carbon
| credits in large part remain a scam.
|
| https://en.wikipedia.org/wiki/Carbon_offsets_and_credits#Con...
___________________________________________________________________
(page generated 2023-06-16 23:01 UTC)