[HN Gopher] Institutional Books: A 242B token dataset from Harva...
___________________________________________________________________
Institutional Books: A 242B token dataset from Harvard Library's
collections
Author : strangecasts
Score : 27 points
Date : 2025-06-11 21:36 UTC (1 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| strangecasts wrote:
| From the abstract:
|
| > [...] this technical report introduces Institutional Books 1.0,
| a large collection of public domain books originally digitized
| through Harvard Library's participation in the Google Books
| project, beginning in 2006. Working with Harvard Library, we
| extracted, analyzed, and processed these volumes into an
| extensively-documented dataset of historic texts. [...] As part
| of this initial release, the OCR-extracted text (original and
| post-processed) as well as the metadata (bibliographic, source,
| and generated) of the 983,004 volumes, or 242B tokens, identified
| as being in the public domain have been made available.
| rickydroll wrote:
| LLMs are increasingly becoming a repository of cultural memory,
| far more so than search engines. By eliminating copyrighted
| content from their training sets, they will start to vanish from
| cultural memory.
|
| Couldn't happen to a nicer bunch of people.
| gshubert17 wrote:
| Edit: Two responses,
| https://news.ycombinator.com/item?id=44252450 and
| https://news.ycombinator.com/item?id=44252408, seem to be
| dupes. As rickydroll states, the time stamps and id numbers
| show it to be the first.
| rickydroll wrote:
| It's a copy of mine. Look at the timestamps.
| rudedogg wrote:
| LLMs are increasingly becoming a repository of cultural memory,
| far more so than search engines. By eliminating copyrighted
| content from their training sets, they will start to vanish from
| cultural memory.
|
| Couldn't happen to a nicer bunch of people.
| SloopJon wrote:
| Although this is characterized as 1.0, it is governed by the
| Terms of Use for Early-Access, which are quite limiting,
| including: "You may use the Service solely for noncommercial
| purposes."
___________________________________________________________________
(page generated 2025-06-11 23:00 UTC)