[HN Gopher] Deep Learning's Diminishing Returns (2021)
       ___________________________________________________________________
        
       Deep Learning's Diminishing Returns (2021)
        
       Author : jrepinc
       Score  : 29 points
       Date   : 2023-06-16 18:54 UTC (4 hours ago)
        
 (HTM) web link (spectrum.ieee.org)
 (TXT) w3m dump (spectrum.ieee.org)
        
       | dekhn wrote:
       | None of these articles admit to a basic truth: deep learning is
       | only a small slice of total computation, and even if it grew 2
       | orders of magnitude, would not be the largest consumer by area.
       | 
       | If DL costs are unsustainable, it means those other things (SAP,
       | web hosting, and all the ridiculous stuff people waste cycles on)
       | are even more unsustainable, and should be addressed first.
        
         | mrtranscendence wrote:
         | The amount of computation in the world devoted to deep
         | learning, in aggregate, might be a small slice of total
         | computation. But in terms of costs borne by individual
         | organizations who train such models, deep learning could be
         | extremely substantial. Particularly if it takes millions in
         | additional investment to achieve only marginal reductions in
         | error.
        
           | dekhn wrote:
           | How is that different from any other capital-intensive
           | industry. Basically my point is that there is nothing
           | specific to DL in these articles/complaints, people are just
           | jumping on the bandwagon.
        
         | aeturnum wrote:
         | You're trying to beg the question that "the point" must be:
         | "which thing is wasting the most power right now?" It's not.
         | People can talk about at least two things at once. This is
         | critiquing the diminishing returns in terms of cost-per-output.
         | It compares badly, on a unit level, to most other methods.
         | That's a fair critique when planning which technique to use in
         | the future.
         | 
         | Articles are often talking about something that won't end up
         | mattering. But if you can't actually address the point the
         | article is making on its own terms, you're going to struggle to
         | refute it.
        
           | dekhn wrote:
           | But we aren't seeing diminishing returns in terms of cost-
           | per-output. We're seeing enormous advances, both in the
           | quality of networks, and what we understand about how to
           | train them, with a large but not unanticipated growth in
           | compute which is still smaller than the growth of compute in
           | other areas which aren't producing scientific advances.
           | 
           | When I refute a paper I start with the weakest points; if I
           | convince my audience to ignore the paper on those terms, I
           | don't spend time refuting the main point (perhaps this is not
           | a great technique, but it is fairly efficient). I didn't even
           | really address their main arguments because after reading the
           | paper, I was just aghast at some of the things that say.
           | 
           | Let's give an example like this: """The first part is true of
           | all statistical models: To improve performance by a factor of
           | k, at least k**2 more data points must be used to train the
           | model."""
           | 
           | I guess that's true of nearly all modern statistical models,
           | at least small ones with simple model functions and fairly
           | small number of data points. But I don't think that most
           | advanced deep learning experts think in those terms; modern
           | DLs do not behave anything like classical statistical models.
           | I think those experts see increasing the data as providing an
           | opportunity for overparameterized systems to generalize in
           | ways we don't undertand and don't follow normal statistical
           | rules. Modern DL systems are more like complex systems with
           | emergent properties, than statistical models as understand by
           | modern statisticians.
           | 
           | Here's another example: they sort of ignore the fact that we
           | only need to train a small number of really big models and
           | then all other models are fine-tuned from that. To get a
           | world-class tardigrade detector, I took mobilenet, gave it a
           | few hundred extra examples, trained for less than an hour on
           | my home GPU (a 3080Ti), and then re-used that model for
           | millions of predictions on my microscope. I didn't have to
           | retrain any model from or use absurd amounts of extra data. I
           | took advantage of all the work the original model trainign
           | did to discover basis functions that can compactly encode the
           | difference between tardigrade, algae, and dirt. I see a
           | direct linear increase in my model's performance as I move to
           | linearly larger models, and I need to add linear number of
           | images to train more classes. We can reasonable expect this
           | to be true of a wide range of models.
           | 
           | Similarly, for people doing molecular dynamics- the big CPU
           | waster before DL- many parts of MD can now be approximated
           | with DLs that are cheaper to run than MD, using models that
           | were just trained once.
           | 
           | What about AlphaFold? Even if it cost DeepMind $100M in
           | training time (and probably more in salaries), it has already
           | generated some results that simply couldn't be produced
           | without their technology- it didn't even exist! What they
           | demonstrated, quite convincingly, is that algorithmic
           | improvements could extract far more information from fairly
           | cheap and low quality sequence data, compared to expensive
           | structural data. So instead of extremely expensive MD sims or
           | whatever to predict structure, you just run this model. My
           | friends in pharma research (I work in pharma) are delighted
           | with the results.
           | 
           | In short, I think the author's economic model is naive, I
           | think his udnerstanding of improvement in DL is naive, and he
           | undercounts the value of having a small number of huge models
           | which are trained on O(nlogn), not n**2 data sets.
           | 
           | And I think that in the next decade it's likely either
           | Google, Meta, or Microsoft will be actively training multi-
           | modal models that basically include the sum of all publicly
           | available, unencumbered data to produce networks that can
           | move smoothly between video, audio, text, do logical
           | reasoning, everything required to produce a virtual human
           | being that could fool even experts, and probably even exceed
           | human performance in science and mathematics in an impactful
           | way. So what if they spend $100B to get there. That's just
           | two years of NIH's budget.
        
         | freejazz wrote:
         | >If DL costs are unsustainable, it means those other things
         | (SAP, web hosting, and all the ridiculous stuff people waste
         | cycles on) are even more unsustainable, and should be addressed
         | first.
         | 
         | That fundamentally misunderstands the point. If it costs me
         | $XXX amounts of dollars to compute the job I'm trying to
         | replace, which only actually costs $X dollar, then it's not
         | financially viable to use the model to replace the job. It
         | makes no difference what the global percentage of computation
         | that makes up.
        
           | dekhn wrote:
           | The reason the global percentage matters is that if you are
           | truly trying to address environmental impacts, you wouldn't
           | talk about DL, it's currently too small (and will be for some
           | time) to make a difference.
           | 
           | Just like not optimizing the 1% bottleneck in your code when
           | there's a 50% bottleneck- focus on the big players where the
           | cheap, easy wins are first.
        
             | freejazz wrote:
             | Yes, your point makes sense from the perspective of trying
             | to reduce global carbon usage... but that doesn't seem to
             | be the perspective the article is written from, and the
             | title is "Deep Learning's Diminishing Returns" not "Deep
             | Learning Is Not Environmentally Friendly" so I think the
             | point that DL has diminishing returns stands.
        
               | dekhn wrote:
               | The paper advances several lines of arguments.
               | Environmental is one of them, it's used as a negative
               | consequence of their purported scaling laws for training.
               | 
               | The paper explicitly says: """Important work by scholars
               | at the University of Massachusetts Amherst allows us to
               | understand the economic cost and carbon emissions implied
               | by this computational burden. The answers are grim:
               | Training such a model would cost US $100 billion and
               | would produce as much carbon emissions as New York City
               | does in a month. And if we estimate the computational
               | burden of a 1 percent error rate, the results are
               | considerably worse."""
               | 
               | (that paper was debunked by people who build and operate
               | extremely large scale cloud machine learning systems).
               | 
               | Remember, Google already built and runs its TPU fleet
               | continuously and the machines are almost always busy at
               | at least 50% of their capacity, meaning the money is
               | already being spent and the carbon (which is much smaller
               | than their estimates) being spewed.
               | 
               | As for the rest of the paper, their arguments about the
               | need to increase parameters to get better results are
               | fairly simplistic, and filled with misleading
               | information/untrue statements.
               | 
               | Basically, everything they claim in the article has been
               | shown to be empirically not true. And the community is
               | already grappling with the "too many parameters
               | approach". The authors would have served the community
               | far better by writing a less critical paper that focused
               | on the opportunities to identify good approaches to
               | parameter and compute reduction that don't affect
               | performance. In looking at the main author's area of
               | research, it looks like he is more of a economist/public
               | policy wonk than DL expert, which I think negatively
               | affects the quality of the paper.
        
       | dang wrote:
       | Discussed at the time:
       | 
       |  _Deep Learning 's Diminishing Returns_ -
       | https://news.ycombinator.com/item?id=28646256 - Sept 2021 (84
       | comments)
        
       | deepsquirrelnet wrote:
       | I don't find the authors points to be very convincing. There's a
       | bit of a self-contradiction in the arguments being made -- namely
       | that if you want better generalization, you need more parameters
       | (which equates to higher energy consumption in training).
       | 
       | However, the obvious retort here is that if you train a model
       | that is good at generalizing, then you don't need train more
       | models! A show of hands, who has used GPT or an open LLM vs who
       | has _trained_ one would yield a vast disparity. If you don 't
       | need generalization, you don't need huge models. Small models are
       | efficient over narrow domains and don't require vast
       | compute/energy resources.
       | 
       | Secondarily, it's a self-solving issue. Energy isn't cheap, and
       | GPUs aren't cheap. If you're going to burn 10's of thousands of
       | dollars in energy costs, you should probably have a decent reason
       | to do it. But those reasons are quickly diminishing as things
       | that have already been _done_.
       | 
       | Third, overparameterized models are becoming less of an issue
       | during inference with efficient quantization techniques.
       | Distillation, though harder, is another option. Again, you do can
       | these things one time after training.
        
       | rictic wrote:
       | A disappointing analysis. It discusses cost in isolation from
       | revenue, and raises environmental impacts without mentioning that
       | the major cloud computing providers where the largest models are
       | trained and run are carbon neutral.
       | 
       | Better results for general purpose tasks are likely to find
       | willing buyers at prices orders of magnitude higher than even the
       | largest language models of today. In most cases, the best
       | alternative is human labor, and for more latency-sensitive cases,
       | there is no alternative to a language model available at any
       | price.
        
         | wokkel wrote:
         | Regarding carbon neutrality: yeah, by using their enourmous
         | capital/pricing power to buy all green power relegating the
         | rest of the country to either use greenwashed grey power or
         | outright grey power. So learning is green, it's just the rest
         | that is poluting....
        
         | _Algernon_ wrote:
         | "Carbon neutrality" is shameless greenwashing while carbon
         | credits in large part remain a scam.
         | 
         | https://en.wikipedia.org/wiki/Carbon_offsets_and_credits#Con...
        
       ___________________________________________________________________
       (page generated 2023-06-16 23:01 UTC)