[HN Gopher] Small Language Models Are Also Few-Shot Learners
___________________________________________________________________
Small Language Models Are Also Few-Shot Learners
Author : YeGoblynQueenne
Score : 72 points
Date : 2021-10-12 09:59 UTC (2 days ago)
(HTM) web link (aclanthology.org)
(TXT) w3m dump (aclanthology.org)
| solarmist wrote:
| It's still millions of parameters. What kind if hardware is
| needed to train a 10 million parameter model?
| solarmist wrote:
| Like could I train it in a week with a few GPU's? Or would this
| still require a cluster to train in a reasonable amount of
| time?
| greens wrote:
| A single consumer GPU for ~hours
| binarymax wrote:
| Yes, verbatim from the paper: "Moreover, training with PET
| can be performed in several hours on a single GPU without
| requiring expensive hyperparameter optimization."
| solarmist wrote:
| Nice. Still reading it.
| solarmist wrote:
| Oh! Nice! The a much more accessible than I was expecting.
| eterevsky wrote:
| "Large carbon footprint"? Really? How do they know how the
| electricity is generated that is used by the OpenAI datacenter?
| Maybe it's all solar and wind.
|
| Why can't they just talk about the reduction of energy or compute
| resources instead?
| ShamelessC wrote:
| Eh, who cares? Maybe the energy usage from machine learning
| overall is negligible - I haven't looked into it. But still,
| what's wrong with showing that a result also has a reduced
| carbon footprint compared to modern methods?
| eterevsky wrote:
| Because this is supposed to be a scientific paper, it's
| supposed to be talking about quantifiable effects, not of
| some speculations. Especially in this case where the high
| energy and monetary cost are easy to estimate. OpenAI spent
| $3 million just on the compute for training. Even inference
| requires multiple TPUs each using hundreds of Watts.
| TrueDuality wrote:
| The easy answer is that reduced energy usage for an
| individual operation does not mean reduced energy overall
| with all other things being equal (See Jevon's Paradox).
|
| In practice algorithms don't work in isolation. You may have
| to do additional pre-processing on your data to get it into
| this kind of model, offsetting any energy usage that you
| reduce through the execution of your model. There are any
| number of small things that aren't taken into account when
| looking at this portion of an overall solution and you can't
| equally make statements about carbon usage based on energy
| usage alone, nor can you make statements about energy usage
| based on computation requirements alone.
|
| The actual energy usage is not an independently derived
| factor and unrelated to the work itself. Adding it as a
| benefit in your paper without the additional work of proving
| that is marketing fluff. It doesn't belong and actively hurts
| studies that are focused on energy usage reduction with the
| explicit intent of reducing carbon footprints.
| Nimitz14 wrote:
| I agree it's silly. The fact that 99.9% of researchers dont
| have the resources to use these large models is a minor detail
| but a chance to shoehorn in fighting climate change is an
| unmissable opportunity apparently.
| robbedpeter wrote:
| The margin for error makes the studies that try to aggregate
| these things almost meaningless.
|
| For example, Google CoLab is running in Google data centers,
| which are carbon neutral. Which is more or less true, but also
| depends on the exact nature of the source of their power and
| accuracy of the carbon offsets they purchase, and we all know
| Google is perfectly accurate and truthful all the time. /s lol
|
| I trust Google insofar as their publicized usage of renewables
| and extensive use of solar. I also suspect that they're on-grid
| and a net producer of energy, taking advantage of deals with
| power companies and governments to ekeout every last penny of
| value in their infrastructure.
|
| The problem is that without exact reporting of numbers, the
| margin of error for that source alone creates huge uncertainty
| in trying to assess the net carbon footprint of their service.
| How much research is being done using Google infrastructure?
| How much is being done on college campus data centers that run
| their hpsc on solar and wind? How much money is spent on
| offsets by those other sources? Again, the reality requires
| exact knowledge, since the usage of offsets introduces huge
| uncertainty, so aggregate reported usage could vary in accuracy
| as a proxy by more than 100% of the naively assumed footprint.
|
| The studies are only as good as their data, and the data isn't
| very good unless it's obtained through legal mandate, via
| subpoena or regulated reporting. To my knowledge, very little
| of the data available for these estimates is anything except
| self reported numbers. The math and analyses they do are great,
| but the margin of error likely exceeds 75%.
| sva_ wrote:
| Whenever somebody starts roasting current ML techniques about
| their carbon footprint, I take that as "I've got no better
| argument to present".
|
| Sure, a reduction in computational expenses is in many ways
| desirable, but I don't think the carbon footprint of a model is
| a very good metric. There are much better arguments for more
| efficient models.
|
| I guess you have to do, what you have to do for those grants.
| soraki_soladead wrote:
| This is somewhat well covered in prior research, in so far as
| the information required is available. Here are some recent
| evaluations of the carbon footprint of modern models:
|
| https://arxiv.org/abs/1906.02243 (cited in the above paper)
|
| > The U.S. Environmental Protection Agency (EPA) provides
| average CO2 produced (in pounds per kilowatt-hour) for power
| consumed in the U.S. (EPA, 2018), which we use to convert power
| to estimated CO2 emissions:
|
| > CO2e = 0.954 pt
|
| > This conversion takes into account the relative proportions
| of different energy sources (primarily natural gas, coal,
| nuclear and renewable) consumed to produce energy in the United
| States.
|
| (Other countries are also included in the paper.)
|
| https://arxiv.org/abs/2104.10350
|
| The authors note that in many cases accurately estimating the
| carbon footprint is difficult because the information required
| is not publicly or readily available. However, they do provide
| some additional data, improved calculations, as well as
| motivations beyond CO2 reduction.
| eterevsky wrote:
| The abstract of the first paper says "Remarkably, the choice
| of DNN, datacenter, and processor can reduce the carbon
| footprint up to ~100-1000X." If your aim is to optimize CO2
| emissions you have a lot of variable at your disposal and the
| architecture of the network is just one of them.
|
| If the paper from the post indeed tried to evaluate and
| compare various kinds of optimization, then citing the CO2
| emissions would be valid. But since it only discusses the
| improvements in the model itself, then it would be much more
| productive to just point to the reductions in the required
| memory, GPU/TPU-time etc.
|
| The readers can do the carbon math themselves depending on
| how carbon-neutral is their datacenter.
| justicezyx wrote:
| Cannot you read the parent?
|
| Because of the opaqueness of the electrical infrastructure,
| it's not precise of these carbon fiber footprint
| measurements, because the data is simply not there.
|
| Therefore, a better measurement is just the electric usage...
| soraki_soladead wrote:
| The papers I provided go a bit beyond "electrical usage".
| One of them is cited by the paper in question.
|
| Yes, they are approximations in lieu of more accurate data
| but that doesn't invalidate them as a tool or a motivation
| for future work such as this one.
|
| Consider further that it's not just OpenAI's models in
| question: it's every practitioner who attempts to train
| similarly large models. These practitioners may not be
| using "green" data centers, even if we generously assume
| that OpenAI does. (Microsoft's 100% renewable target for
| data centers isn't until 2025. Read another way: they may
| be trying but they're not there yet.)
|
| The available data and approximations illustrate that it is
| not accurate to assume that the average data center are
| powered by 100% renewables and carbon neutral. Thus the
| only reasonable conclusion is that more efficient models
| will have a positive impact on CO2 which is the motivation
| of the paper.
|
| Even if you don't agree, it's not completely unfounded and
| is based on at least some research and data. At the end of
| the day, is this really worth fighting against? Who wants
| less energy efficient models?
| justicezyx wrote:
| Indeed human cannot read...
|
| Not sure what you are rehashing the same content in these
| words...
| soraki_soladead wrote:
| I have provided recent and relevant citations with data
| and detailed comments. You have provided? What? Attacks?
| I'm honestly not even sure. Given your comment history I
| think I'm done here.
| justicezyx wrote:
| Hmm, I meant that we are just talking about the same
| thing...
| ChefboyOG wrote:
| There are also some pretty cool open source projects
| dedicated to tracking this kind of thing:
|
| https://www.comet.ml/site/introducing-codecarbon-an-open-
| sou...
| travisgriggs wrote:
| I clicked on the link thinking I was going to be reading about
| Forth and other "simple" programming languages.
| dunefox wrote:
| Since when are programming languages called language models?
| setr wrote:
| I believe he was thinking of programming language paradigms
| -- e.g. procedural, stack-oriented, vector-oriented,
| functional, etc.
|
| Though "learners" should have given it away as well
| BenoitEssiambre wrote:
| I've had this hypothesis for years that a good AI model would be
| something in between a neural net and probabilistic grammar.
|
| In my mind it would involve some kind of generative grammar that
| would generate nodes (select from a pool), then these nodes could
| be trained. I'm thinking about something like a grammar for
| Bayesian networks or, more broadly, a generative program
| induction scheme where you'd specify a programming language
| grammar, generate programs that fit the data and tune the
| parameters.
|
| I attempted and failed to implement some of these ideas
| (described here: https://www.quora.com/What-deep-learning-ideas-
| have-you-trie...). The search space for generative programs is
| just so huge and irregular and I don't know how to keep things
| differentiable. In the paper, they mention gradient-based
| optimization so maybe to figured out part of it.
___________________________________________________________________
(page generated 2021-10-14 23:01 UTC)