[HN Gopher] Anthropic destroyed millions of print books to build...
___________________________________________________________________
Anthropic destroyed millions of print books to build its AI models
Author : bayindirh
Score : 9 points
Date : 2025-06-25 21:06 UTC (1 hours ago)
(HTM) web link (arstechnica.com)
(TXT) w3m dump (arstechnica.com)
| JohnFen wrote:
| > In the process, the company cut millions of print books from
| their bindings, scanned them into digital files, and threw away
| the originals solely for the purpose of training AI
|
| Oh boy. The more I learn about how genAI companies work, the more
| detestable they appear to be.
| ThrowawayR2 wrote:
| You got suckered by the clickbait. Destructive scanning (https:
| //en.wikipedia.org/wiki/Book_scanning#Destructive_scan...)
| isn't unusual for books that are common enough that an
| individual volume is of no particular value.
| bayindirh wrote:
| I mean, they could have gotten e-book versions of the books,
| or even preprint PDFs.
|
| In an era where people are starting to calculate the
| environmental impact of the jobs they run on the cloud and
| start to optimize it, adding that much load on recycling
| system is not a wise choice, but only a selfish one.
| ThrowawayR2 wrote:
| I'm sure they would have loved to save the hassle and
| expense of disassembling physical books. Presumably
| something legal related or cost related prevented them from
| going that route.
| JohnFen wrote:
| Yes, they did it as a workaround for copyright. TFA
| explains that aspect.
| AlotOfReading wrote:
| I strongly suspect that dealing with ebooks on this scale
| might actually be even more onerous than the physical
| volumes.
|
| The physical stuff is straightforward. Buy books from bulk
| sellers, rip off everything and put them into off-the-self
| rigs for digitization. It's straightforward, directly
| scalable, can use any book, and your main issue is format
| shifting, which anthropic successfully argued here. No DRM,
| you buy exactly the books you need, and every book is
| processed exactly the same way.
|
| If you try to buy ebooks, you get wrapped up in onerous
| licensing terms about copying, and how you're able to use
| them, how long you're able to access them, and so on. Many
| books won't even be available (or can only be licensed
| alongside a bunch of others) and you have to deal with DRM
| you can't strip without creating additional copyright
| issues.
|
| We've somehow created a world where physical objects are
| more free than bits.
| JohnFen wrote:
| I didn't get suckered by anything. I'm aware of the practice.
| I find it objectionable. That they did this is just another
| thing on the growing list of objectionable things that genAI
| companies seem to enjoy doing.
|
| To be honest, I probably wouldn't have even commented on it
| if it were the only bad thing these companies do.
| EA-3167 wrote:
| I don't like Anthropic, I think their "marketing through fear"
| approach to be shitty and frankly I'm over the AI "boom" anyway.
|
| BUT... here's the only line in that whole article that really
| matters, because this is a headline meant to create an impression
| that isn't corrected for quite a while.
|
| > The court documents don't indicate that any rare books were
| destroyed in this process--Anthropic purchased its books in bulk
| from major retailers
|
| Books are routinely pulped and recycled, they aren't holy, and if
| they aren't rare then frankly who cares what techniques they use
| to scan them? The issue is whether or not "AI" learning
| represents fair use, which the courts so far have ruled that it
| does.
| bayindirh wrote:
| > any rare books were destroyed in this proces
|
| Does it matter? It's waste at the end of the day. Instead they
| could have bought e-books. Just because we can recycle paper,
| it doesn't mean we have the luxury to create waste as we see
| fit, esp. when climate change became this severe.
|
| > which the courts so far have ruled that it does.
|
| Any concrete cases you can cite?
|
| From [0], for example, while the course said that the authors
| failed to argue their case, the second observation is complete
| opposite of what you said. Citing the article directly:
| Opinion suggests AI models do generally violate law.
|
| In the same spirit, I think I can safely assume that they
| violated copyright law, since they earn money by circumventing
| it, and fair use doesn't like for-profit copying.
|
| [0]: https://news.bloomberglaw.com/litigation/meta-beats-
| copyrigh...
| kirrent wrote:
| TFA is based on the ruling which found that Anthropic
| training on these books was fair use.
| robocat wrote:
| > It's waste at the end of the day
|
| Rubbish.
|
| More likely they are taking a waste stream of books and
| _reusing_ and possibly even recycling.
|
| Few people want old books, and many people that have books
| are throwing them out or donating them. I don't think I know
| anybody under 30 with a bookshelf of books they obviously
| intend to keep for life. Bookshelves used to be an elite
| status symbol, now I often see them as image rather than
| reference (e.g. part off backdrop behind influencer vid).
|
| It is likely they didn't destroy much of value, since they
| will have minimized their purchasing costs. Modern DRM is not
| helping.
___________________________________________________________________
(page generated 2025-06-25 23:01 UTC)