[HN Gopher] Small Language Models Are Also Few-Shot Learners
       ___________________________________________________________________
        
       Small Language Models Are Also Few-Shot Learners
        
       Author : YeGoblynQueenne
       Score  : 72 points
       Date   : 2021-10-12 09:59 UTC (2 days ago)
        
 (HTM) web link (aclanthology.org)
 (TXT) w3m dump (aclanthology.org)
        
       | solarmist wrote:
       | It's still millions of parameters. What kind if hardware is
       | needed to train a 10 million parameter model?
        
         | solarmist wrote:
         | Like could I train it in a week with a few GPU's? Or would this
         | still require a cluster to train in a reasonable amount of
         | time?
        
         | greens wrote:
         | A single consumer GPU for ~hours
        
           | binarymax wrote:
           | Yes, verbatim from the paper: "Moreover, training with PET
           | can be performed in several hours on a single GPU without
           | requiring expensive hyperparameter optimization."
        
             | solarmist wrote:
             | Nice. Still reading it.
        
           | solarmist wrote:
           | Oh! Nice! The a much more accessible than I was expecting.
        
       | eterevsky wrote:
       | "Large carbon footprint"? Really? How do they know how the
       | electricity is generated that is used by the OpenAI datacenter?
       | Maybe it's all solar and wind.
       | 
       | Why can't they just talk about the reduction of energy or compute
       | resources instead?
        
         | ShamelessC wrote:
         | Eh, who cares? Maybe the energy usage from machine learning
         | overall is negligible - I haven't looked into it. But still,
         | what's wrong with showing that a result also has a reduced
         | carbon footprint compared to modern methods?
        
           | eterevsky wrote:
           | Because this is supposed to be a scientific paper, it's
           | supposed to be talking about quantifiable effects, not of
           | some speculations. Especially in this case where the high
           | energy and monetary cost are easy to estimate. OpenAI spent
           | $3 million just on the compute for training. Even inference
           | requires multiple TPUs each using hundreds of Watts.
        
           | TrueDuality wrote:
           | The easy answer is that reduced energy usage for an
           | individual operation does not mean reduced energy overall
           | with all other things being equal (See Jevon's Paradox).
           | 
           | In practice algorithms don't work in isolation. You may have
           | to do additional pre-processing on your data to get it into
           | this kind of model, offsetting any energy usage that you
           | reduce through the execution of your model. There are any
           | number of small things that aren't taken into account when
           | looking at this portion of an overall solution and you can't
           | equally make statements about carbon usage based on energy
           | usage alone, nor can you make statements about energy usage
           | based on computation requirements alone.
           | 
           | The actual energy usage is not an independently derived
           | factor and unrelated to the work itself. Adding it as a
           | benefit in your paper without the additional work of proving
           | that is marketing fluff. It doesn't belong and actively hurts
           | studies that are focused on energy usage reduction with the
           | explicit intent of reducing carbon footprints.
        
         | Nimitz14 wrote:
         | I agree it's silly. The fact that 99.9% of researchers dont
         | have the resources to use these large models is a minor detail
         | but a chance to shoehorn in fighting climate change is an
         | unmissable opportunity apparently.
        
         | robbedpeter wrote:
         | The margin for error makes the studies that try to aggregate
         | these things almost meaningless.
         | 
         | For example, Google CoLab is running in Google data centers,
         | which are carbon neutral. Which is more or less true, but also
         | depends on the exact nature of the source of their power and
         | accuracy of the carbon offsets they purchase, and we all know
         | Google is perfectly accurate and truthful all the time. /s lol
         | 
         | I trust Google insofar as their publicized usage of renewables
         | and extensive use of solar. I also suspect that they're on-grid
         | and a net producer of energy, taking advantage of deals with
         | power companies and governments to ekeout every last penny of
         | value in their infrastructure.
         | 
         | The problem is that without exact reporting of numbers, the
         | margin of error for that source alone creates huge uncertainty
         | in trying to assess the net carbon footprint of their service.
         | How much research is being done using Google infrastructure?
         | How much is being done on college campus data centers that run
         | their hpsc on solar and wind? How much money is spent on
         | offsets by those other sources? Again, the reality requires
         | exact knowledge, since the usage of offsets introduces huge
         | uncertainty, so aggregate reported usage could vary in accuracy
         | as a proxy by more than 100% of the naively assumed footprint.
         | 
         | The studies are only as good as their data, and the data isn't
         | very good unless it's obtained through legal mandate, via
         | subpoena or regulated reporting. To my knowledge, very little
         | of the data available for these estimates is anything except
         | self reported numbers. The math and analyses they do are great,
         | but the margin of error likely exceeds 75%.
        
         | sva_ wrote:
         | Whenever somebody starts roasting current ML techniques about
         | their carbon footprint, I take that as "I've got no better
         | argument to present".
         | 
         | Sure, a reduction in computational expenses is in many ways
         | desirable, but I don't think the carbon footprint of a model is
         | a very good metric. There are much better arguments for more
         | efficient models.
         | 
         | I guess you have to do, what you have to do for those grants.
        
         | soraki_soladead wrote:
         | This is somewhat well covered in prior research, in so far as
         | the information required is available. Here are some recent
         | evaluations of the carbon footprint of modern models:
         | 
         | https://arxiv.org/abs/1906.02243 (cited in the above paper)
         | 
         | > The U.S. Environmental Protection Agency (EPA) provides
         | average CO2 produced (in pounds per kilowatt-hour) for power
         | consumed in the U.S. (EPA, 2018), which we use to convert power
         | to estimated CO2 emissions:
         | 
         | > CO2e = 0.954 pt
         | 
         | > This conversion takes into account the relative proportions
         | of different energy sources (primarily natural gas, coal,
         | nuclear and renewable) consumed to produce energy in the United
         | States.
         | 
         | (Other countries are also included in the paper.)
         | 
         | https://arxiv.org/abs/2104.10350
         | 
         | The authors note that in many cases accurately estimating the
         | carbon footprint is difficult because the information required
         | is not publicly or readily available. However, they do provide
         | some additional data, improved calculations, as well as
         | motivations beyond CO2 reduction.
        
           | eterevsky wrote:
           | The abstract of the first paper says "Remarkably, the choice
           | of DNN, datacenter, and processor can reduce the carbon
           | footprint up to ~100-1000X." If your aim is to optimize CO2
           | emissions you have a lot of variable at your disposal and the
           | architecture of the network is just one of them.
           | 
           | If the paper from the post indeed tried to evaluate and
           | compare various kinds of optimization, then citing the CO2
           | emissions would be valid. But since it only discusses the
           | improvements in the model itself, then it would be much more
           | productive to just point to the reductions in the required
           | memory, GPU/TPU-time etc.
           | 
           | The readers can do the carbon math themselves depending on
           | how carbon-neutral is their datacenter.
        
           | justicezyx wrote:
           | Cannot you read the parent?
           | 
           | Because of the opaqueness of the electrical infrastructure,
           | it's not precise of these carbon fiber footprint
           | measurements, because the data is simply not there.
           | 
           | Therefore, a better measurement is just the electric usage...
        
             | soraki_soladead wrote:
             | The papers I provided go a bit beyond "electrical usage".
             | One of them is cited by the paper in question.
             | 
             | Yes, they are approximations in lieu of more accurate data
             | but that doesn't invalidate them as a tool or a motivation
             | for future work such as this one.
             | 
             | Consider further that it's not just OpenAI's models in
             | question: it's every practitioner who attempts to train
             | similarly large models. These practitioners may not be
             | using "green" data centers, even if we generously assume
             | that OpenAI does. (Microsoft's 100% renewable target for
             | data centers isn't until 2025. Read another way: they may
             | be trying but they're not there yet.)
             | 
             | The available data and approximations illustrate that it is
             | not accurate to assume that the average data center are
             | powered by 100% renewables and carbon neutral. Thus the
             | only reasonable conclusion is that more efficient models
             | will have a positive impact on CO2 which is the motivation
             | of the paper.
             | 
             | Even if you don't agree, it's not completely unfounded and
             | is based on at least some research and data. At the end of
             | the day, is this really worth fighting against? Who wants
             | less energy efficient models?
        
               | justicezyx wrote:
               | Indeed human cannot read...
               | 
               | Not sure what you are rehashing the same content in these
               | words...
        
               | soraki_soladead wrote:
               | I have provided recent and relevant citations with data
               | and detailed comments. You have provided? What? Attacks?
               | I'm honestly not even sure. Given your comment history I
               | think I'm done here.
        
               | justicezyx wrote:
               | Hmm, I meant that we are just talking about the same
               | thing...
        
           | ChefboyOG wrote:
           | There are also some pretty cool open source projects
           | dedicated to tracking this kind of thing:
           | 
           | https://www.comet.ml/site/introducing-codecarbon-an-open-
           | sou...
        
       | travisgriggs wrote:
       | I clicked on the link thinking I was going to be reading about
       | Forth and other "simple" programming languages.
        
         | dunefox wrote:
         | Since when are programming languages called language models?
        
           | setr wrote:
           | I believe he was thinking of programming language paradigms
           | -- e.g. procedural, stack-oriented, vector-oriented,
           | functional, etc.
           | 
           | Though "learners" should have given it away as well
        
       | BenoitEssiambre wrote:
       | I've had this hypothesis for years that a good AI model would be
       | something in between a neural net and probabilistic grammar.
       | 
       | In my mind it would involve some kind of generative grammar that
       | would generate nodes (select from a pool), then these nodes could
       | be trained. I'm thinking about something like a grammar for
       | Bayesian networks or, more broadly, a generative program
       | induction scheme where you'd specify a programming language
       | grammar, generate programs that fit the data and tune the
       | parameters.
       | 
       | I attempted and failed to implement some of these ideas
       | (described here: https://www.quora.com/What-deep-learning-ideas-
       | have-you-trie...). The search space for generative programs is
       | just so huge and irregular and I don't know how to keep things
       | differentiable. In the paper, they mention gradient-based
       | optimization so maybe to figured out part of it.
        
       ___________________________________________________________________
       (page generated 2021-10-14 23:01 UTC)