https://arxiv.org/abs/2506.08300 Skip to main content Cornell University We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors. Donate arxiv logo > cs > arXiv:2506.08300 [ ] Help | Advanced Search [All fields ] Search arXiv logo Cornell University Logo [ ] GO quick links * Login * Help Pages * About Computer Science > Computation and Language arXiv:2506.08300 (cs) [Submitted on 10 Jun 2025] Title:Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability Authors:Matteo Cargnelutti, Catherine Brobston, John Hess, Jack Cushman, Kristi Mukk, Aristana Scourtas, Kyle Courtney, Greg Leppert, Amanda Watson, Martha Whitehead, Jonathan Zittrain View a PDF of the paper titled Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability, by Matteo Cargnelutti and 10 other authors View PDF Abstract:Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to train these models, or to support their work at inference time, have a direct impact on their quality. The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data and revealed an urgent need to ground the stewardship of these datasets in sustainable practices with clear provenance chains. To that end, this technical report introduces Institutional Books 1.0, a large collection of public domain books originally digitized through Harvard Library's participation in the Google Books project, beginning in 2006. Working with Harvard Library, we extracted, analyzed, and processed these volumes into an extensively-documented dataset of historic texts. This analysis covers the entirety of Harvard Library's collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens. As part of this initial release, the OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available. This report describes this project's goals and methods as well as the results of the analyses we performed, all in service of making this historical collection more accessible and easier for humans and machines alike to filter, read and use. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2506.08300 [cs.CL] (or arXiv:2506.08300v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.08300 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Matteo Cargnelutti [view email] [v1] Tue, 10 Jun 2025 00:11:30 UTC (10,879 KB) Full-text links: Access Paper: View a PDF of the paper titled Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability, by Matteo Cargnelutti and 10 other authors * View PDF * Other Formats license icon view license Current browse context: cs.CL < prev | next > new | recent | 2025-06 Change to browse by: cs References & Citations * NASA ADS * Google Scholar * Semantic Scholar a export BibTeX citation Loading... BibTeX formatted citation x [loading... ] Data provided by: Bookmark BibSonomy logo Reddit logo (*) Bibliographic Tools Bibliographic and Citation Tools [ ] Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) [ ] Connected Papers Toggle Connected Papers (What is Connected Papers?) [ ] Litmaps Toggle Litmaps (What is Litmaps?) [ ] scite.ai Toggle scite Smart Citations (What are Smart Citations?) ( ) Code, Data, Media Code, Data and Media Associated with this Article [ ] alphaXiv Toggle alphaXiv (What is alphaXiv?) [ ] Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) [ ] DagsHub Toggle DagsHub (What is DagsHub?) [ ] GotitPub Toggle Gotit.pub (What is GotitPub?) [ ] Huggingface Toggle Hugging Face (What is Huggingface?) [ ] Links to Code Toggle Papers with Code (What is Papers with Code?) [ ] ScienceCast Toggle ScienceCast (What is ScienceCast?) ( ) Demos Demos [ ] Replicate Toggle Replicate (What is Replicate?) [ ] Spaces Toggle Hugging Face Spaces (What is Spaces?) [ ] Spaces Toggle TXYZ.AI (What is TXYZ.AI?) ( ) Related Papers Recommenders and Search Tools [ ] Link to Influence Flower Influence Flower (What are Influence Flowers?) [ ] Core recommender toggle CORE Recommender (What is CORE?) * Author * Venue * Institution * Topic ( ) About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) * About * Help * Click here to contact arXiv Contact * Click here to subscribe Subscribe * Copyright * Privacy Policy * Web Accessibility Assistance * arXiv Operational Status Get status notifications via email or slack