https://arxiv.org/abs/2305.16264 Skip to main content Cornell University We gratefully acknowledge support from the Simons Foundation and member institutions. arxiv logo > cs > arXiv:2305.16264 [ ] Help | Advanced Search [All fields ] Search arXiv logo Cornell University Logo [ ] GO quick links * Login * Help Pages * About Computer Science > Computation and Language arXiv:2305.16264 (cs) [Submitted on 25 May 2023] Title:Scaling Data-Constrained Language Models Authors:Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, Colin Raffel Download a PDF of the paper titled Scaling Data-Constrained Language Models, by Niklas Muennighoff and 8 other authors Download PDF Abstract: The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are publicly available at this https URL. Comments: 43 pages (9 main), 35 figures, 13 tables Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2305.16264 [cs.CL] (or arXiv:2305.16264v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2305.16264 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Niklas Muennighoff [view email] [v1] Thu, 25 May 2023 17:18:55 UTC (1,455 KB) Full-text links: Download: * Download a PDF of the paper titled Scaling Data-Constrained Language Models, by Niklas Muennighoff and 8 other authors PDF * Other formats (license) Current browse context: cs.CL < prev | next > new | recent | 2305 Change to browse by: cs cs.AI cs.LG References & Citations * NASA ADS * Google Scholar * Semantic Scholar a export BibTeX citation Loading... BibTeX formatted citation x [loading... ] Data provided by: Bookmark BibSonomy logo Mendeley logo Reddit logo ScienceWISE logo (*) Bibliographic Tools Bibliographic and Citation Tools [ ] Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) [ ] Litmaps Toggle Litmaps (What is Litmaps?) [ ] scite.ai Toggle scite Smart Citations (What are Smart Citations?) ( ) Code, Data, Media Code, Data and Media Associated with this Article [ ] DagsHub Toggle DagsHub (What is DagsHub?) [ ] Links to Code Toggle Papers with Code (What is Papers with Code?) [ ] ScienceCast Toggle ScienceCast (What is ScienceCast?) ( ) Demos Demos [ ] Replicate Toggle Replicate (What is Replicate?) [ ] Spaces Toggle Hugging Face Spaces (What is Spaces?) ( ) Related Papers Recommenders and Search Tools [ ] Link to Influence Flower Influence Flower (What are Influence Flowers?) [ ] Connected Papers Toggle Connected Papers (What is Connected Papers?) [ ] Core recommender toggle CORE Recommender (What is CORE?) * Author * Venue * Institution * Topic ( ) About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) * About * Help * Click here to contact arXiv Contact * Click here to subscribe Subscribe * Copyright * Privacy Policy * Web Accessibility Assistance * arXiv Operational Status Get status notifications via email or slack