[HN Gopher] The Tragedy of Google Books (2017)
___________________________________________________________________
The Tragedy of Google Books (2017)
Author : lispybanana
Score : 143 points
Date : 2024-10-22 18:11 UTC (4 hours ago)
(HTM) web link (www.theatlantic.com)
(TXT) w3m dump (www.theatlantic.com)
| pvg wrote:
| https://archive.is/rQ7Zb
| datadrivenangel wrote:
| Thanks Paul!
| pvg wrote:
| Wrong number, I'm afraid.
| montag wrote:
| Thanks Peter
| senkora wrote:
| I'm sure the lawyers will eventually figure out a way to train an
| LLM on them.
| datadrivenangel wrote:
| They probably already have! It seems like an amazing training
| dataset even if you can't share source data.
| amelius wrote:
| How do you train an LLM such that it is guaranteed to never
| regurgitate its training data?
| ASalazarMX wrote:
| You punish it if parts of the answer can be found in its
| training data, and reward it otherwise.
| thayne wrote:
| IMO if a work is out of print (or equivalent depending on the
| medium) for more than a few years, it should be released into the
| public domain. Or maybe something like the public domain, but
| requires attribution.
| giraffe_lady wrote:
| Then every book will be immediately out of print after its
| initial run, while the not-quite-a-cartel of publishers all
| decline to print it until it hits the point where they no
| longer have to pay the author.
| Jtsummers wrote:
| Then the publisher loses out on exclusive publishing rights
| and also loses money. It's in their interests to keep it _in_
| print so long as it 's a profitable book, even if they have
| to pay some percentage to the author. Once it goes into
| public domain every publisher can reprint it and the original
| publisher has to compete with them on price.
| giraffe_lady wrote:
| ok
| jamiek88 wrote:
| > so long as it's a profitable book
|
| And here is the rub. You'll end up with three or four super
| authors with the rest being ripped off.
|
| Much better for it to revert to the author in that
| situation IMO.
| Jtsummers wrote:
| I'm not arguing for it (or against it for that matter), I
| was just pointing out that the analysis in the comment I
| responded to didn't make sense. _Every_ book won 't be
| allowed to fall out of print and copyright just to
| exploit the authors because it would also hurt the
| publishers, they also benefit from exclusive publishing
| rights. Publishing rights are granted by the copyright
| holder (the author) to the publisher, much like patent
| licenses.
|
| Regarding unprofitable books, they'll fall out of print
| anyways because they're unprofitable. Those authors won't
| be getting ripped off because they won't be making money
| either way beyond initial commissions and what few sales
| they get.
|
| > Much better for it to revert to the author in that
| situation IMO.
|
| The publisher doesn't hold the copyright, the author
| does, so copyright (the particular right under
| discussion) can't revert to the author as it never left
| the author. What the publisher holds is publishing rights
| per a contract with the author. That could revert back to
| the author (or be voided or however it's structured), and
| that would be reasonable but we don't need any laws for
| it, that would fall under normal contract terms. Whether
| it's a common thing now or feasible for a particular
| author (with no clout? maybe not, with billions in sales
| from prior books? probably) is another matter.
| kps wrote:
| Like trademark: Use it or lose it.
|
| (The reality is that publishers would put lazy photocopies up
| for sale at ten zillion dollars a piece.)
| pfdietz wrote:
| So, e-books are either immediately out of print, or never out
| of print?
| tightbookkeeper wrote:
| What if we applied the simple test that the book was
| originally published on paper and no other printings have
| occurred (digital or paper).
| pessimizer wrote:
| Never out of print. If there's an e-copy available to buy,
| that's better than millions of other books.
| eschneider wrote:
| Have you dealt with publishers? If a work is out of print for a
| few years, much better to have rights revert to the creator.
| andrewstuart wrote:
| Google must be tempted to put them in an LLM.
| bborud wrote:
| It would surprise me greatly if they haven't already.
| anoncow wrote:
| Sad and criminal.
| 2OEH8eoCRo0 wrote:
| The tragedy is that Google is tasked with this at all. It would
| be cool if public libraries could work together on a massive
| _public_ digital library. This shouldn 't be Google's
| responsibility.
| Jtsummers wrote:
| Google wasn't tasked (by a third party) with this, they chose
| to do it.
| ants_everywhere wrote:
| arguably Google was invented to fund this project.
|
| The books project predates the search engine and the search
| engine grew out of the project of creating a universal
| digital library. The PageRank algorithm is one of a class of
| algorithms used to score citations in books and papers.
| NoMoreNicksLeft wrote:
| All humans everywhere have a responsibility to preserve culture
| and knowledge to the best of their ability. I think what you
| meant to say is that none of us can _trust_ Google with this
| important task.
| xipho wrote:
| A huge proportion of this corpus is found in the Hathi Trust (see
| https://www.hathitrust.org/the-collection/). We have had a grant
| to crawl and derive an index on it via their supercomputing
| resources. I'm sure they are looking to LLM proposals, though
| they are exceedingly careful about the copyright issues.
|
| https://www.hathitrust.org/
| fredgrott wrote:
| thank you as some of us were looking for something to replace
| the archive.org digital book library part....
| jsemrau wrote:
| >I'm sure they are looking to LLM proposals
|
| Well, it is a use case for this challenge
| https://www.kaggle.com/competitions/gemini-long-context
| philipkglass wrote:
| These Google scans are also available in the HathiTrust [1], an
| organization built from the big academic libraries that
| participated in early book digitization efforts. The HathiTrust
| is better about letting the public read books that have actually
| fallen into the public domain. I have found many books that are
| "snippet view" only on Google Books but freely visible on
| HathiTrust.
|
| If you are a student or researcher at one of the participating
| HathiTrust institutions, you can also get access to scans of
| books that are still in copyright.
|
| The one advantage Google Books still has is that its search tools
| are much faster and sometimes better, so it can be useful to
| search for phrases or topics on Google Books and then jump over
| to HathiTrust to read specific books surfaced by the search.
|
| [1] https://www.hathitrust.org/
| Zigurd wrote:
| O'Reilly, for whom I've been a lead author and co-author, did
| this: https://www.oreilly.com/pub/pr/1042
|
| They call it Founder's Copyright. The also use Creative Commons.
| The goal is to make out of print books available at no cost.
| card_zero wrote:
| > A complete list of available titles is at
| www.oreilly.com/openbook
|
| Exciting!
|
| _Follows link_
|
| _Link no longer exists, gets O 'Reilly front page instead_
|
| "Introducing the AI Academy, Help your entire org put GenAI to
| work"
|
| Thanks O'Reilly.
| MollyRealized wrote:
| It's okay, I'll just check the Wayb-- _shit_
| ToucanLoucan wrote:
| The original dream of the internet: Information, freely
| available to any who want it.
|
| The new dream of the internet: Some information, that aligns
| with the values of our advertisers, delivered via an LLM that
| sometimes makes shit up.
| stvltvs wrote:
| Looks like Openbook stuff is still there, just homeless. I
| had to do a web search to find it. For example:
|
| https://www.oreilly.com/openbook/make3/book/
| blacksmith_tb wrote:
| Yes, I see it all with
|
| https://www.google.com/search?q=site%3Aoreilly.com+inurl%3A
| o...
|
| So it seems like it mainly lost the overview page?
| microtherion wrote:
| It's somewhat ironic that, while the individual books are still
| accessible, their index pages https://www.oreilly.com/free and
| https://www.oreilly.com/openbook both redirect to some AI
| propaganda these days, with no links to the books left.
|
| A third party page still has links to some (possibly all) of
| the books: https://zapier.com/blog/free-oreilly-press-books/
| svilen_dobrev wrote:
| This seems to be the fate of knowledge/content that stays in
| institutions which have been built with the idea of collecting it
| and growing it.. but have turned into walled gardens/crypts of
| sort. Rot/Rust and be forgotten.
|
| A very cynical and dark view is that the New things/people need
| that oblivion in order to feel great, for not haveing to compare
| with old great-er ones. Rewriting history as it seems fit the
| current powers-that-be, is easier this way.
|
| Or may be it's just collective stupidity? or societal immaturity
| ?
|
| (i am coming from completely different killed project on a
| different continent, but the idea is the same)
| kyleee wrote:
| I think you are on to something, people frequently don't want
| to grapple with and understand what has been done before, they
| prefer to just wing it and move forward on their own.
| SapporoChris wrote:
| I am fairly certain there is more knowledge/content available
| to anyone in this century than last century or any century
| before it. But perhaps I have misread your comment.
| Animats wrote:
| We need a Copyright Term Reduction Act.
|
| It's time. 50 years, renewal is possible but expensive.
| ASalazarMX wrote:
| Even 50 is a lot, because it starts at the death of the author.
| Popular culture shouldn't remain locked out for generations. 50
| maximum would be ideal, two generations from the one who
| experienced it in the original cultural context.
| Animats wrote:
| 50 years from first publication. That's all the TRIPS
| agreement requires.[1]
|
| [1] https://en.wikipedia.org/wiki/TRIPS_Agreement
| mjevans wrote:
| Just my opinion but as a starting point for the argument...
| * 20 years from date of first publish (renewable up to CAP? 50
| years) * Must remain available every year * 10 year
| renewal blocks with massive registration fee increases *
| Compulsory maximum license fee cap (can offer for less) in the
| laws
|
| Note this is not TRADE MARK; trade marks are _consumer
| protection_ related to 'brand ownership'.
| ErikAugust wrote:
| "Page had always wanted to digitize books. Way back in 1996, the
| student project that eventually became Google--a "crawler" that
| would ingest documents and rank them for relevance against a
| user's query--was actually conceived as part of an effort "to
| develop the enabling technologies for a single, integrated and
| universal digital library." The idea was that in the future, once
| all books were digitized, you'd be able to map the citations
| among them, see which books got cited the most, and use that data
| to give better search results to library patrons. But books still
| lived mostly on paper. Page and his research partner, Sergey
| Brin, developed their popularity-contest-by-citation idea using
| pages from the World Wide Web."
|
| Larry Page had some cool ideas... can't imagine Books will ever
| be resurrected, unfortunately.
| dekhn wrote:
| He really wanted to digitize all of them to provide reference
| and training data for early language models (well before LLMs,
| transformers, etc).
|
| He also had a plan (with George Church) to build enormous
| warehouses holding large-scale biology research infrastructure
| right next to google data centers. Because most biology
| research is done at locations that have reached their limit on
| computational/storage capacity.
|
| Larry had many good ideas but he struggled to get the majority
| of them off the ground. For example, when Trump was president
| and invited all the major tech leaders, Larry came with a plan
| to upgrade the US electrical system with long-range DC.
| carlosjobim wrote:
| > The idea was that in the future, once all books were
| digitized, you'd be able to map the citations among them, see
| which books got cited the most, and use that data to give
| better search results to library patrons.
|
| You can do something similar to this already, by mapping which
| books are cited in Wikipedia articles. If you know how to do
| such a thing, because I don't.
| aspenmayer wrote:
| Not specific to Wikipedia:
|
| https://aarontay.medium.com/3-new-tools-to-try-for-
| literatur...
|
| https://archive.is/Ul13s
|
| Specific to Wikipedia:
|
| Wikipedia Citations: Reproducible Citation Extraction from
| Multilingual Wikipedia [2024]
|
| https://arxiv.org/abs/2406.19291v1
|
| https://doi.org/10.48550/arXiv.2406.19291
|
| > Wikipedia is an essential component of the open science
| ecosystem, yet it is poorly integrated with academic open
| science initiatives. Wikipedia Citations is a project that
| focuses on extracting and releasing comprehensive datasets of
| citations from Wikipedia. A total of 29.3 million citations
| were extracted from English Wikipedia in May 2020. Following
| this one-off research project, we designed a reproducible
| pipeline that can process any given Wikipedia dump in the
| cloud-based settings. To demonstrate its usability, we
| extracted 40.6 million citations in February 2023 and 44.7
| million citations in February 2024. Furthermore, we equipped
| the pipeline with an adapted Wikipedia citation template
| translation module to process multilingual Wikipedia articles
| in 15 European languages so that they are parsed and mapped
| into a generic structured citation template. This paper
| presents our open-source software pipeline to retrieve,
| classify, and disambiguate citations on demand from a given
| Wikipedia dump.
|
| Prior work referenced in above abstract with some team
| overlap:
|
| Wikipedia citations: A comprehensive data set of citations
| with identifiers extracted from English Wikipedia [2021]
|
| https://direct.mit.edu/qss/article/2/1/1/97565/Wikipedia-
| cit...
|
| https://doi.org/10.1162/qss_a_00105
|
| Datasets:
|
| A Comprehensive Dataset of Classified Citations with
| Identifiers from English Wikipedia (2024)
|
| https://zenodo.org/records/10782978
|
| https://doi.org/10.5281/zenodo.10782978
|
| A Comprehensive Dataset of Classified Citations with
| Identifiers from Multilingual Wikipedia (2024)
|
| https://zenodo.org/records/11210434
|
| https://doi.org/10.5281/zenodo.11210434
|
| Code (MIT License):
|
| https://github.com/albatros13/wikicite
|
| https://github.com/albatros13/wikicite/tree/multilang
|
| Bonus links:
|
| https://www.mediawiki.org/wiki/Alternative_parsers
|
| https://scholarlykitchen.sspnet.org/2022/11/01/guest-post-
| wi...
| submeta wrote:
| With library genesis, who needs Google Books anymore? I buy books
| physically to support the author/s and download an epub version
| from said site to my kindle. The physical books I hardly read,
| they are for my shelf. Although I love the feeling of printed
| books, but I read in bed, and it's easier to hold an ebook. Also
| I read when I commute. It's lighter to have my Kindle Oasis with
| me with tons of books on it.
| ghaff wrote:
| There's the everything available online for free mindset. But,
| yes, I've basically donated all my books that were in the
| public domain. And, in general, have been massively purging my
| book collection of stuff I won't realistically read again.
| submeta wrote:
| I do buy books, to support the authors. And I would encourage
| anyone to support the authors they like to read.
| ASalazarMX wrote:
| I agree, but also wouldn't lose sleep for pirating a book
| of an author that died more than 20 years ago, in most
| contexts.
| yonran wrote:
| > Dan Clancy, the Google engineering lead on the project who
| helped design the settlement, thinks that it was a particular
| brand of objector--not Google's competitors but "sympathetic
| entities" you'd think would be in favor of it, like library
| enthusiasts, academic authors, and so on--that ultimately flipped
| the DOJ.
|
| I was at Google in 2009 on a team adjacent to Dan Clancy when he
| was most excited about the Authors' Guild negotiations to publish
| orphan works and create a portal to pay copyright holders who
| signed up, and I recall that one opponent that he was frustrated
| at was Brewster Kahle of the Internet Archive, who filed a
| jealous amicus brief
| (https://docs.justia.com/cases/federal/district-courts/new-yo...)
| complaining that the Authors' Guild settlement would not grant
| him access to publishing orphan works too. In my opinion Kahle
| was wrong; the existence of one orphan works clearinghouse would
| have encouraged Congress to grant more libraries access instead
| of doing nothing which is what actually happened in the 15 year
| since then. Instead of one company selling out-of-print but in-
| copyright books, or multiple organizations, no one is allowed to
| sell them today.
|
| Since then, of course, Brewster Kahle launched an e-library of
| copyrighted books without legal authorization anyway which will
| probably be the death of the current organization that runs the
| Internet Archive. Tragic all around.
| jamiek88 wrote:
| That pandemic library was a huge, obvious over step by him.
|
| It will have consequences far beyond the immediate lawsuit too.
|
| The very concept has basically been iced for a generation and
| the net is only getting more locked down not less.
| mastazi wrote:
| This is an insightful comment and I thank you for sharing it
| but, after having looked at the brief you linked
|
| > a jealous amicus brief that the Authors' Guild settlement
| would not grant him access to publishing orphan works too
|
| that's not a fair overview of the amicus brief, there are good
| points there about the process of notifying orphan works rights
| holders and about the risk of a monopolistic position. I do
| agree with you on this part though
|
| > the existence of one orphan works clearinghouse would have
| encouraged Congress to grant more libraries access instead of
| doing nothing
|
| Edit: I also agree with you that the way the IA subsequently
| created its e-library was not ideal.
| lokar wrote:
| I would say it's much worse then "not ideal", they may have
| poisoned the well for decades to come.
| adastra22 wrote:
| Maybe permanently, as societal stances on these sorts of
| issues tend to solidify over time. In a couple of
| generations the very idea of a library may be confined to
| history thanks to IA :(
| chambers wrote:
| I wish the contradiction you spotted was clear on their
| Wikipedia page. It demonstrates how far back IA's management
| troubles go, and how their clean image was maybe just an image.
|
| For me, I became concerned when they fibbed about why the
| Internet Archive Credit Union was liquidated. IA alleged it was
| shut down due to onerous regulations, but the government said
| IA actually never lived up to their goal of allowing local,
| low-income folk to sign-up for their service.
| https://ncua.gov/newsroom/press-release/2016/internet-archiv...
| pessimizer wrote:
| Thanks for making me aware of this. This guy's heart is clearly
| (to me) in the right place, but his understanding of power is
| seriously lacking. That's probably what gave him the hubris to
| create Wayback and IA, but he'll be absolutely dumbstruck when
| they shut it down.
| caseysoftware wrote:
| I worked at the Library of Congress on their Digital Preservation
| Project, circa 2001-2003. The stated goal was to "digitize all of
| the Library's collections" and while most people think of books,
| I was in the Motion Picture Broadcast and Recorded Sound
| Division.
|
| In our collection were Thomas Edison's first motion pictures,
| wire spool recordings from reporters at D-Day, and LPs of some of
| the greatest musicians of all time. And that was just our
| Division. Others - like American Heritage - had photos from the
| US Civil War and more.
|
| Anyway, while the Rights information is one big, ugly tangled
| web, the other side is the hardware to read the formats. Much of
| the media is fragile and/or dangerous to use so you have to be
| exceptionally careful. Then you have to document all the settings
| you used because imagine that three months from now, you learn
| some filter you used was wrong or the hardware was
| misconfigured.. you need to go back and understand what was
| affected how.
|
| Cool space. I wish I'd worked there longer.
| caseysoftware wrote:
| Also.. it was fun learning the answer to "what is the work?"
|
| If you have an LP or wire spool recording, the audio is the
| key, obvious work. But then you have the album cover, the spool
| case, and the physical condition of the media. Being able to
| see an album cover or read a reporter's notes/labeling is
| almost as important as the audio.
| ForHackernews wrote:
| Is the Library of Congress really beholden to copyright laws? I
| guess I assumed as the national deposit library they had a
| special exemption to copy any damn thing they pleased for
| archival purposes.
|
| If they don't have that prerogative, they probably should, and
| Congress should legislate that to be the case.
| carlosjobim wrote:
| For Kagi users, I recommend putting books.google.com as a pinned
| domain. This way, you'll many times be presented with some of the
| best sources for any search query. Then it's a matter of finding
| the ePub file of that book. To read on MacOS, FBReader is a high
| quality app.
___________________________________________________________________
(page generated 2024-10-22 23:00 UTC)