[HN Gopher] The Tragedy of Google Books (2017)
       ___________________________________________________________________
        
       The Tragedy of Google Books (2017)
        
       Author : lispybanana
       Score  : 143 points
       Date   : 2024-10-22 18:11 UTC (4 hours ago)
        
 (HTM) web link (www.theatlantic.com)
 (TXT) w3m dump (www.theatlantic.com)
        
       | pvg wrote:
       | https://archive.is/rQ7Zb
        
         | datadrivenangel wrote:
         | Thanks Paul!
        
           | pvg wrote:
           | Wrong number, I'm afraid.
        
             | montag wrote:
             | Thanks Peter
        
       | senkora wrote:
       | I'm sure the lawyers will eventually figure out a way to train an
       | LLM on them.
        
         | datadrivenangel wrote:
         | They probably already have! It seems like an amazing training
         | dataset even if you can't share source data.
        
           | amelius wrote:
           | How do you train an LLM such that it is guaranteed to never
           | regurgitate its training data?
        
             | ASalazarMX wrote:
             | You punish it if parts of the answer can be found in its
             | training data, and reward it otherwise.
        
       | thayne wrote:
       | IMO if a work is out of print (or equivalent depending on the
       | medium) for more than a few years, it should be released into the
       | public domain. Or maybe something like the public domain, but
       | requires attribution.
        
         | giraffe_lady wrote:
         | Then every book will be immediately out of print after its
         | initial run, while the not-quite-a-cartel of publishers all
         | decline to print it until it hits the point where they no
         | longer have to pay the author.
        
           | Jtsummers wrote:
           | Then the publisher loses out on exclusive publishing rights
           | and also loses money. It's in their interests to keep it _in_
           | print so long as it 's a profitable book, even if they have
           | to pay some percentage to the author. Once it goes into
           | public domain every publisher can reprint it and the original
           | publisher has to compete with them on price.
        
             | giraffe_lady wrote:
             | ok
        
             | jamiek88 wrote:
             | > so long as it's a profitable book
             | 
             | And here is the rub. You'll end up with three or four super
             | authors with the rest being ripped off.
             | 
             | Much better for it to revert to the author in that
             | situation IMO.
        
               | Jtsummers wrote:
               | I'm not arguing for it (or against it for that matter), I
               | was just pointing out that the analysis in the comment I
               | responded to didn't make sense. _Every_ book won 't be
               | allowed to fall out of print and copyright just to
               | exploit the authors because it would also hurt the
               | publishers, they also benefit from exclusive publishing
               | rights. Publishing rights are granted by the copyright
               | holder (the author) to the publisher, much like patent
               | licenses.
               | 
               | Regarding unprofitable books, they'll fall out of print
               | anyways because they're unprofitable. Those authors won't
               | be getting ripped off because they won't be making money
               | either way beyond initial commissions and what few sales
               | they get.
               | 
               | > Much better for it to revert to the author in that
               | situation IMO.
               | 
               | The publisher doesn't hold the copyright, the author
               | does, so copyright (the particular right under
               | discussion) can't revert to the author as it never left
               | the author. What the publisher holds is publishing rights
               | per a contract with the author. That could revert back to
               | the author (or be voided or however it's structured), and
               | that would be reasonable but we don't need any laws for
               | it, that would fall under normal contract terms. Whether
               | it's a common thing now or feasible for a particular
               | author (with no clout? maybe not, with billions in sales
               | from prior books? probably) is another matter.
        
         | kps wrote:
         | Like trademark: Use it or lose it.
         | 
         | (The reality is that publishers would put lazy photocopies up
         | for sale at ten zillion dollars a piece.)
        
         | pfdietz wrote:
         | So, e-books are either immediately out of print, or never out
         | of print?
        
           | tightbookkeeper wrote:
           | What if we applied the simple test that the book was
           | originally published on paper and no other printings have
           | occurred (digital or paper).
        
           | pessimizer wrote:
           | Never out of print. If there's an e-copy available to buy,
           | that's better than millions of other books.
        
         | eschneider wrote:
         | Have you dealt with publishers? If a work is out of print for a
         | few years, much better to have rights revert to the creator.
        
       | andrewstuart wrote:
       | Google must be tempted to put them in an LLM.
        
         | bborud wrote:
         | It would surprise me greatly if they haven't already.
        
       | anoncow wrote:
       | Sad and criminal.
        
       | 2OEH8eoCRo0 wrote:
       | The tragedy is that Google is tasked with this at all. It would
       | be cool if public libraries could work together on a massive
       | _public_ digital library. This shouldn 't be Google's
       | responsibility.
        
         | Jtsummers wrote:
         | Google wasn't tasked (by a third party) with this, they chose
         | to do it.
        
           | ants_everywhere wrote:
           | arguably Google was invented to fund this project.
           | 
           | The books project predates the search engine and the search
           | engine grew out of the project of creating a universal
           | digital library. The PageRank algorithm is one of a class of
           | algorithms used to score citations in books and papers.
        
         | NoMoreNicksLeft wrote:
         | All humans everywhere have a responsibility to preserve culture
         | and knowledge to the best of their ability. I think what you
         | meant to say is that none of us can _trust_ Google with this
         | important task.
        
       | xipho wrote:
       | A huge proportion of this corpus is found in the Hathi Trust (see
       | https://www.hathitrust.org/the-collection/). We have had a grant
       | to crawl and derive an index on it via their supercomputing
       | resources. I'm sure they are looking to LLM proposals, though
       | they are exceedingly careful about the copyright issues.
       | 
       | https://www.hathitrust.org/
        
         | fredgrott wrote:
         | thank you as some of us were looking for something to replace
         | the archive.org digital book library part....
        
         | jsemrau wrote:
         | >I'm sure they are looking to LLM proposals
         | 
         | Well, it is a use case for this challenge
         | https://www.kaggle.com/competitions/gemini-long-context
        
       | philipkglass wrote:
       | These Google scans are also available in the HathiTrust [1], an
       | organization built from the big academic libraries that
       | participated in early book digitization efforts. The HathiTrust
       | is better about letting the public read books that have actually
       | fallen into the public domain. I have found many books that are
       | "snippet view" only on Google Books but freely visible on
       | HathiTrust.
       | 
       | If you are a student or researcher at one of the participating
       | HathiTrust institutions, you can also get access to scans of
       | books that are still in copyright.
       | 
       | The one advantage Google Books still has is that its search tools
       | are much faster and sometimes better, so it can be useful to
       | search for phrases or topics on Google Books and then jump over
       | to HathiTrust to read specific books surfaced by the search.
       | 
       | [1] https://www.hathitrust.org/
        
       | Zigurd wrote:
       | O'Reilly, for whom I've been a lead author and co-author, did
       | this: https://www.oreilly.com/pub/pr/1042
       | 
       | They call it Founder's Copyright. The also use Creative Commons.
       | The goal is to make out of print books available at no cost.
        
         | card_zero wrote:
         | > A complete list of available titles is at
         | www.oreilly.com/openbook
         | 
         | Exciting!
         | 
         |  _Follows link_
         | 
         |  _Link no longer exists, gets O 'Reilly front page instead_
         | 
         | "Introducing the AI Academy, Help your entire org put GenAI to
         | work"
         | 
         | Thanks O'Reilly.
        
           | MollyRealized wrote:
           | It's okay, I'll just check the Wayb-- _shit_
        
           | ToucanLoucan wrote:
           | The original dream of the internet: Information, freely
           | available to any who want it.
           | 
           | The new dream of the internet: Some information, that aligns
           | with the values of our advertisers, delivered via an LLM that
           | sometimes makes shit up.
        
           | stvltvs wrote:
           | Looks like Openbook stuff is still there, just homeless. I
           | had to do a web search to find it. For example:
           | 
           | https://www.oreilly.com/openbook/make3/book/
        
             | blacksmith_tb wrote:
             | Yes, I see it all with
             | 
             | https://www.google.com/search?q=site%3Aoreilly.com+inurl%3A
             | o...
             | 
             | So it seems like it mainly lost the overview page?
        
         | microtherion wrote:
         | It's somewhat ironic that, while the individual books are still
         | accessible, their index pages https://www.oreilly.com/free and
         | https://www.oreilly.com/openbook both redirect to some AI
         | propaganda these days, with no links to the books left.
         | 
         | A third party page still has links to some (possibly all) of
         | the books: https://zapier.com/blog/free-oreilly-press-books/
        
       | svilen_dobrev wrote:
       | This seems to be the fate of knowledge/content that stays in
       | institutions which have been built with the idea of collecting it
       | and growing it.. but have turned into walled gardens/crypts of
       | sort. Rot/Rust and be forgotten.
       | 
       | A very cynical and dark view is that the New things/people need
       | that oblivion in order to feel great, for not haveing to compare
       | with old great-er ones. Rewriting history as it seems fit the
       | current powers-that-be, is easier this way.
       | 
       | Or may be it's just collective stupidity? or societal immaturity
       | ?
       | 
       | (i am coming from completely different killed project on a
       | different continent, but the idea is the same)
        
         | kyleee wrote:
         | I think you are on to something, people frequently don't want
         | to grapple with and understand what has been done before, they
         | prefer to just wing it and move forward on their own.
        
         | SapporoChris wrote:
         | I am fairly certain there is more knowledge/content available
         | to anyone in this century than last century or any century
         | before it. But perhaps I have misread your comment.
        
       | Animats wrote:
       | We need a Copyright Term Reduction Act.
       | 
       | It's time. 50 years, renewal is possible but expensive.
        
         | ASalazarMX wrote:
         | Even 50 is a lot, because it starts at the death of the author.
         | Popular culture shouldn't remain locked out for generations. 50
         | maximum would be ideal, two generations from the one who
         | experienced it in the original cultural context.
        
           | Animats wrote:
           | 50 years from first publication. That's all the TRIPS
           | agreement requires.[1]
           | 
           | [1] https://en.wikipedia.org/wiki/TRIPS_Agreement
        
         | mjevans wrote:
         | Just my opinion but as a starting point for the argument...
         | * 20 years from date of first publish (renewable up to CAP? 50
         | years)       * Must remain available every year       * 10 year
         | renewal blocks with massive registration fee increases       *
         | Compulsory maximum license fee cap (can offer for less) in the
         | laws
         | 
         | Note this is not TRADE MARK; trade marks are _consumer
         | protection_ related to 'brand ownership'.
        
       | ErikAugust wrote:
       | "Page had always wanted to digitize books. Way back in 1996, the
       | student project that eventually became Google--a "crawler" that
       | would ingest documents and rank them for relevance against a
       | user's query--was actually conceived as part of an effort "to
       | develop the enabling technologies for a single, integrated and
       | universal digital library." The idea was that in the future, once
       | all books were digitized, you'd be able to map the citations
       | among them, see which books got cited the most, and use that data
       | to give better search results to library patrons. But books still
       | lived mostly on paper. Page and his research partner, Sergey
       | Brin, developed their popularity-contest-by-citation idea using
       | pages from the World Wide Web."
       | 
       | Larry Page had some cool ideas... can't imagine Books will ever
       | be resurrected, unfortunately.
        
         | dekhn wrote:
         | He really wanted to digitize all of them to provide reference
         | and training data for early language models (well before LLMs,
         | transformers, etc).
         | 
         | He also had a plan (with George Church) to build enormous
         | warehouses holding large-scale biology research infrastructure
         | right next to google data centers. Because most biology
         | research is done at locations that have reached their limit on
         | computational/storage capacity.
         | 
         | Larry had many good ideas but he struggled to get the majority
         | of them off the ground. For example, when Trump was president
         | and invited all the major tech leaders, Larry came with a plan
         | to upgrade the US electrical system with long-range DC.
        
         | carlosjobim wrote:
         | > The idea was that in the future, once all books were
         | digitized, you'd be able to map the citations among them, see
         | which books got cited the most, and use that data to give
         | better search results to library patrons.
         | 
         | You can do something similar to this already, by mapping which
         | books are cited in Wikipedia articles. If you know how to do
         | such a thing, because I don't.
        
           | aspenmayer wrote:
           | Not specific to Wikipedia:
           | 
           | https://aarontay.medium.com/3-new-tools-to-try-for-
           | literatur...
           | 
           | https://archive.is/Ul13s
           | 
           | Specific to Wikipedia:
           | 
           | Wikipedia Citations: Reproducible Citation Extraction from
           | Multilingual Wikipedia [2024]
           | 
           | https://arxiv.org/abs/2406.19291v1
           | 
           | https://doi.org/10.48550/arXiv.2406.19291
           | 
           | > Wikipedia is an essential component of the open science
           | ecosystem, yet it is poorly integrated with academic open
           | science initiatives. Wikipedia Citations is a project that
           | focuses on extracting and releasing comprehensive datasets of
           | citations from Wikipedia. A total of 29.3 million citations
           | were extracted from English Wikipedia in May 2020. Following
           | this one-off research project, we designed a reproducible
           | pipeline that can process any given Wikipedia dump in the
           | cloud-based settings. To demonstrate its usability, we
           | extracted 40.6 million citations in February 2023 and 44.7
           | million citations in February 2024. Furthermore, we equipped
           | the pipeline with an adapted Wikipedia citation template
           | translation module to process multilingual Wikipedia articles
           | in 15 European languages so that they are parsed and mapped
           | into a generic structured citation template. This paper
           | presents our open-source software pipeline to retrieve,
           | classify, and disambiguate citations on demand from a given
           | Wikipedia dump.
           | 
           | Prior work referenced in above abstract with some team
           | overlap:
           | 
           | Wikipedia citations: A comprehensive data set of citations
           | with identifiers extracted from English Wikipedia [2021]
           | 
           | https://direct.mit.edu/qss/article/2/1/1/97565/Wikipedia-
           | cit...
           | 
           | https://doi.org/10.1162/qss_a_00105
           | 
           | Datasets:
           | 
           | A Comprehensive Dataset of Classified Citations with
           | Identifiers from English Wikipedia (2024)
           | 
           | https://zenodo.org/records/10782978
           | 
           | https://doi.org/10.5281/zenodo.10782978
           | 
           | A Comprehensive Dataset of Classified Citations with
           | Identifiers from Multilingual Wikipedia (2024)
           | 
           | https://zenodo.org/records/11210434
           | 
           | https://doi.org/10.5281/zenodo.11210434
           | 
           | Code (MIT License):
           | 
           | https://github.com/albatros13/wikicite
           | 
           | https://github.com/albatros13/wikicite/tree/multilang
           | 
           | Bonus links:
           | 
           | https://www.mediawiki.org/wiki/Alternative_parsers
           | 
           | https://scholarlykitchen.sspnet.org/2022/11/01/guest-post-
           | wi...
        
       | submeta wrote:
       | With library genesis, who needs Google Books anymore? I buy books
       | physically to support the author/s and download an epub version
       | from said site to my kindle. The physical books I hardly read,
       | they are for my shelf. Although I love the feeling of printed
       | books, but I read in bed, and it's easier to hold an ebook. Also
       | I read when I commute. It's lighter to have my Kindle Oasis with
       | me with tons of books on it.
        
         | ghaff wrote:
         | There's the everything available online for free mindset. But,
         | yes, I've basically donated all my books that were in the
         | public domain. And, in general, have been massively purging my
         | book collection of stuff I won't realistically read again.
        
           | submeta wrote:
           | I do buy books, to support the authors. And I would encourage
           | anyone to support the authors they like to read.
        
             | ASalazarMX wrote:
             | I agree, but also wouldn't lose sleep for pirating a book
             | of an author that died more than 20 years ago, in most
             | contexts.
        
       | yonran wrote:
       | > Dan Clancy, the Google engineering lead on the project who
       | helped design the settlement, thinks that it was a particular
       | brand of objector--not Google's competitors but "sympathetic
       | entities" you'd think would be in favor of it, like library
       | enthusiasts, academic authors, and so on--that ultimately flipped
       | the DOJ.
       | 
       | I was at Google in 2009 on a team adjacent to Dan Clancy when he
       | was most excited about the Authors' Guild negotiations to publish
       | orphan works and create a portal to pay copyright holders who
       | signed up, and I recall that one opponent that he was frustrated
       | at was Brewster Kahle of the Internet Archive, who filed a
       | jealous amicus brief
       | (https://docs.justia.com/cases/federal/district-courts/new-yo...)
       | complaining that the Authors' Guild settlement would not grant
       | him access to publishing orphan works too. In my opinion Kahle
       | was wrong; the existence of one orphan works clearinghouse would
       | have encouraged Congress to grant more libraries access instead
       | of doing nothing which is what actually happened in the 15 year
       | since then. Instead of one company selling out-of-print but in-
       | copyright books, or multiple organizations, no one is allowed to
       | sell them today.
       | 
       | Since then, of course, Brewster Kahle launched an e-library of
       | copyrighted books without legal authorization anyway which will
       | probably be the death of the current organization that runs the
       | Internet Archive. Tragic all around.
        
         | jamiek88 wrote:
         | That pandemic library was a huge, obvious over step by him.
         | 
         | It will have consequences far beyond the immediate lawsuit too.
         | 
         | The very concept has basically been iced for a generation and
         | the net is only getting more locked down not less.
        
         | mastazi wrote:
         | This is an insightful comment and I thank you for sharing it
         | but, after having looked at the brief you linked
         | 
         | > a jealous amicus brief that the Authors' Guild settlement
         | would not grant him access to publishing orphan works too
         | 
         | that's not a fair overview of the amicus brief, there are good
         | points there about the process of notifying orphan works rights
         | holders and about the risk of a monopolistic position. I do
         | agree with you on this part though
         | 
         | > the existence of one orphan works clearinghouse would have
         | encouraged Congress to grant more libraries access instead of
         | doing nothing
         | 
         | Edit: I also agree with you that the way the IA subsequently
         | created its e-library was not ideal.
        
           | lokar wrote:
           | I would say it's much worse then "not ideal", they may have
           | poisoned the well for decades to come.
        
             | adastra22 wrote:
             | Maybe permanently, as societal stances on these sorts of
             | issues tend to solidify over time. In a couple of
             | generations the very idea of a library may be confined to
             | history thanks to IA :(
        
         | chambers wrote:
         | I wish the contradiction you spotted was clear on their
         | Wikipedia page. It demonstrates how far back IA's management
         | troubles go, and how their clean image was maybe just an image.
         | 
         | For me, I became concerned when they fibbed about why the
         | Internet Archive Credit Union was liquidated. IA alleged it was
         | shut down due to onerous regulations, but the government said
         | IA actually never lived up to their goal of allowing local,
         | low-income folk to sign-up for their service.
         | https://ncua.gov/newsroom/press-release/2016/internet-archiv...
        
         | pessimizer wrote:
         | Thanks for making me aware of this. This guy's heart is clearly
         | (to me) in the right place, but his understanding of power is
         | seriously lacking. That's probably what gave him the hubris to
         | create Wayback and IA, but he'll be absolutely dumbstruck when
         | they shut it down.
        
       | caseysoftware wrote:
       | I worked at the Library of Congress on their Digital Preservation
       | Project, circa 2001-2003. The stated goal was to "digitize all of
       | the Library's collections" and while most people think of books,
       | I was in the Motion Picture Broadcast and Recorded Sound
       | Division.
       | 
       | In our collection were Thomas Edison's first motion pictures,
       | wire spool recordings from reporters at D-Day, and LPs of some of
       | the greatest musicians of all time. And that was just our
       | Division. Others - like American Heritage - had photos from the
       | US Civil War and more.
       | 
       | Anyway, while the Rights information is one big, ugly tangled
       | web, the other side is the hardware to read the formats. Much of
       | the media is fragile and/or dangerous to use so you have to be
       | exceptionally careful. Then you have to document all the settings
       | you used because imagine that three months from now, you learn
       | some filter you used was wrong or the hardware was
       | misconfigured.. you need to go back and understand what was
       | affected how.
       | 
       | Cool space. I wish I'd worked there longer.
        
         | caseysoftware wrote:
         | Also.. it was fun learning the answer to "what is the work?"
         | 
         | If you have an LP or wire spool recording, the audio is the
         | key, obvious work. But then you have the album cover, the spool
         | case, and the physical condition of the media. Being able to
         | see an album cover or read a reporter's notes/labeling is
         | almost as important as the audio.
        
         | ForHackernews wrote:
         | Is the Library of Congress really beholden to copyright laws? I
         | guess I assumed as the national deposit library they had a
         | special exemption to copy any damn thing they pleased for
         | archival purposes.
         | 
         | If they don't have that prerogative, they probably should, and
         | Congress should legislate that to be the case.
        
       | carlosjobim wrote:
       | For Kagi users, I recommend putting books.google.com as a pinned
       | domain. This way, you'll many times be presented with some of the
       | best sources for any search query. Then it's a matter of finding
       | the ePub file of that book. To read on MacOS, FBReader is a high
       | quality app.
        
       ___________________________________________________________________
       (page generated 2024-10-22 23:00 UTC)