[HN Gopher] Institutional Books: A 242B token dataset from Harva...
       ___________________________________________________________________
        
       Institutional Books: A 242B token dataset from Harvard Library's
       collections
        
       Author : strangecasts
       Score  : 27 points
       Date   : 2025-06-11 21:36 UTC (1 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | strangecasts wrote:
       | From the abstract:
       | 
       | > [...] this technical report introduces Institutional Books 1.0,
       | a large collection of public domain books originally digitized
       | through Harvard Library's participation in the Google Books
       | project, beginning in 2006. Working with Harvard Library, we
       | extracted, analyzed, and processed these volumes into an
       | extensively-documented dataset of historic texts. [...] As part
       | of this initial release, the OCR-extracted text (original and
       | post-processed) as well as the metadata (bibliographic, source,
       | and generated) of the 983,004 volumes, or 242B tokens, identified
       | as being in the public domain have been made available.
        
       | rickydroll wrote:
       | LLMs are increasingly becoming a repository of cultural memory,
       | far more so than search engines. By eliminating copyrighted
       | content from their training sets, they will start to vanish from
       | cultural memory.
       | 
       | Couldn't happen to a nicer bunch of people.
        
         | gshubert17 wrote:
         | Edit: Two responses,
         | https://news.ycombinator.com/item?id=44252450 and
         | https://news.ycombinator.com/item?id=44252408, seem to be
         | dupes. As rickydroll states, the time stamps and id numbers
         | show it to be the first.
        
           | rickydroll wrote:
           | It's a copy of mine. Look at the timestamps.
        
       | rudedogg wrote:
       | LLMs are increasingly becoming a repository of cultural memory,
       | far more so than search engines. By eliminating copyrighted
       | content from their training sets, they will start to vanish from
       | cultural memory.
       | 
       | Couldn't happen to a nicer bunch of people.
        
       | SloopJon wrote:
       | Although this is characterized as 1.0, it is governed by the
       | Terms of Use for Early-Access, which are quite limiting,
       | including: "You may use the Service solely for noncommercial
       | purposes."
        
       ___________________________________________________________________
       (page generated 2025-06-11 23:00 UTC)