https://arxiv.org/abs/2210.11399

close this message
arXiv smileybones icon

Global Survey

In just 3 minutes help us understand how you see arXiv.

TAKE SURVEY
Skip to main content
Cornell University
We gratefully acknowledge support from
the Simons Foundation and member institutions.
 
arxiv logo > cs > arXiv:2210.11399
[                    ]

Help | Advanced Search

[All fields        ]
Search
arXiv logo
Cornell University Logo
[                    ] GO
quick links

  * Login
  * Help Pages
  * About

Computer Science > Computation and Language

arXiv:2210.11399 (cs)
[Submitted on 20 Oct 2022 (v1), last revised 16 Nov 2022 (this
version, v2)]

Title:Transcending Scaling Laws with 0.1% Extra Compute

Authors:Yi Tay, Jason Wei, Hyung Won Chung, Vinh Q. Tran, David R. So
, Siamak Shakeri, Xavier Garcia, Huaixiu Steven Zheng, Jinfeng Rao,
Aakanksha Chowdhery, Denny Zhou, Donald Metzler, Slav Petrov, Neil
Houlsby, Quoc V. Le, Mostafa Dehghani
Download PDF

    Abstract: Scaling language models improves performance but comes
    with significant computational costs. This paper proposes UL2R, a
    method that substantially improves existing language models and
    their scaling curves with a relatively tiny amount of extra
    compute. The key idea is to continue training a state-of-the-art
    large language model (e.g., PaLM) on a few more steps with UL2's
    mixture-of-denoiser objective. We show that, with almost
    negligible extra computational costs and no new sources of data,
    we are able to substantially improve the scaling properties of
    large language models on downstream metrics. In this paper, we
    continue training PaLM with UL2R, introducing a new set of models
    at 8B, 62B, and 540B scale which we call U-PaLM. Impressively, at
    540B scale, we show an approximately 2x computational savings
    rate where U-PaLM achieves the same performance as the final PaLM
    540B model at around half its computational budget (i.e., saving
    $\sim$4.4 million TPUv4 hours). We further show that this
    improved scaling curve leads to 'emergent abilities' on
    challenging BIG-Bench tasks -- for instance, U-PaLM does much
    better than PaLM on some tasks or demonstrates better quality at
    much smaller scale (62B as opposed to 540B). Overall, we show
    that U-PaLM outperforms PaLM on many few-shot setups, i.e.,
    English NLP tasks (e.g., commonsense reasoning, question
    answering), reasoning tasks with chain-of-thought (e.g., GSM8K),
    multilingual tasks (MGSM, TydiQA), MMLU and challenging BIG-Bench
    tasks. Finally, we provide qualitative examples showing the new
    capabilities of U-PaLM for single and multi-span infilling.

Comments: V2 has updated references/related work
Subjects: Computation and Language (cs.CL); Artificial Intelligence
          (cs.AI); Machine Learning (cs.LG)
Cite as:  arXiv:2210.11399 [cs.CL]
          (or arXiv:2210.11399v2 [cs.CL] for this version)
          https://doi.org/10.48550/arXiv.2210.11399
          Focus to learn more
          arXiv-issued DOI via DataCite

Submission history

From: Yi Tay [view email]
[v1] Thu, 20 Oct 2022 16:46:41 UTC (1,827 KB)
[v2] Wed, 16 Nov 2022 12:32:52 UTC (1,829 KB)
Full-text links:

Download:

  * PDF
  * Other formats

(license)
Current browse context:
cs.CL
< prev   |   next >
new | recent | 2210
Change to browse by:
cs
cs.AI
cs.LG

References & Citations

  * NASA ADS
  * Google Scholar
  * Semantic Scholar

a export bibtex citation Loading...

Bibtex formatted citation

x
[loading...          ]
Data provided by:

Bookmark

BibSonomy logo Mendeley logo Reddit logo ScienceWISE logo
(*) Bibliographic Tools

Bibliographic and Citation Tools

[ ] Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
[ ] Litmaps Toggle
Litmaps (What is Litmaps?)
[ ] scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
( ) Code, Data, Media

Code, Data and Media Associated with this Article

[ ] Links to Code Toggle
Papers with Code (What is Papers with Code?)
[ ] ScienceCast Toggle
ScienceCast (What is ScienceCast?)
( ) Demos

Demos

[ ] Replicate Toggle
Replicate (What is Replicate?)
[ ] Spaces Toggle
Hugging Face Spaces (What is Spaces?)
( ) Related Papers

Recommenders and Search Tools

[ ] Connected Papers Toggle
Connected Papers (What is Connected Papers?)
[ ] Core recommender toggle
CORE Recommender (What is CORE?)
( ) About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and
share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have
embraced and accepted our values of openness, community, excellence,
and user data privacy. arXiv is committed to these values and only
works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community?
Learn more about arXivLabs and how to get involved.

Which authors of this paper are endorsers? | Disable MathJax (What is
MathJax?)

  * About
  * Help

  * Click here to contact arXiv Contact
  * Click here to subscribe Subscribe

  * Copyright
  * Privacy Policy

  * Web Accessibility Assistance
  * arXiv Operational Status
    Get status notifications via email or slack