[HN Gopher] Thomson Reuters wins first major AI copyright case i...
___________________________________________________________________
Thomson Reuters wins first major AI copyright case in the US
Author : johnneville
Score : 136 points
Date : 2025-02-11 20:56 UTC (2 hours ago)
(HTM) web link (www.wired.com)
(TXT) w3m dump (www.wired.com)
| EnderWT wrote:
| https://archive.is/mu49I
| NewsaHackO wrote:
| How does this affect LLM systems that already have their corpus
| integrated?
| trod1234 wrote:
| The judge ruled this as a violation of copyright. Its the same
| as hosting any copyright material absent a valid license,
| criminal copyright piracy.
|
| They would need to figure out a way to prune the respective
| weights so that such material is not available, or risk legal
| fury.
| Workaccount2 wrote:
| They would just censor output.
|
| Youtube doesn't need to figure out how to stop copyright
| material from being uploaded, they need to stop it from being
| shared.
| preinheimer wrote:
| Great. The stated goal of a lot of these companies seems to be
| "train the model on the output of humans, then hire us instead of
| the humans".
|
| It's been interesting that media where watermarking has been
| feasible (like photography) have seen creators get access to some
| compensation, while text based creators get nothing.
| rolph wrote:
| rotate similar [but different] fonts [or character pages] over
| each character. the sequence represents data thus watermark.
| WillAdams wrote:
| but the font changes won't be expressed in the (plain text)
| output of the LLM.
| yifanl wrote:
| Presumably the font will represent letters to look like a
| different letter, making it not useful to LLMs scraping the
| site but useful for visual readers.
|
| This would have detrimental effects to people who use
| screen readers or have their own stylesheets of course.
| WillAdams wrote:
| For that it would make more sense to run a routine which
| replaces letters with visually identical glyphs at
| different encoding points.
| oidar wrote:
| Ross intelligence was creating a product that would directly
| compete against Thomson Reuters. Pretty clearly not fair use.
| YesBox wrote:
| Thanks. The article wasn't loading for me, just the headline and
| image and footer. I was about to leave thinking that's all there
| is.
| dang wrote:
| We detached this comment from
| https://news.ycombinator.com/item?id=43018356.
|
| Your post is totally fine; I just want to save space at the top
| of the thread (where the parent is now pinned).
| varsketiz wrote:
| Great decision for humans.
|
| Is this type of risk the reason why OpenAI masquerades as a non-
| profit?
| veggieroll wrote:
| > Thomson Reuters prevailed on two of the four factors, but Bibas
| described the fourth as the most important, and ruled that Ross
| "meant to compete with Westlaw by developing a market
| substitute."
|
| Yep. That's what people have been saying all along. If the intent
| is to substitute the original, then copying is not fair use.
|
| But the problem is that the current method for training requires
| this volume of data. So the models are legitimately not viable
| without massive copyright infringement.
|
| It'll be interesting to see how a defendant with a larger wallet
| will fare. But this doesn't look good.
|
| Though big-picture, it seems to me that the money-ed interests
| will ensure that even if the current legal landscape doesn't
| allow LLM's to exist, then they will lobby HARD until it is
| allowed. This is inevitable now that it's at least partially
| framed in national security terms.
|
| But I'd hope that this means there is a chance that if models
| have to train on all of human content, the weights will be
| available for free to all humans. If it requires massive
| copyright infringement on our content, we should all have an
| ownership stake in the resulting models.
| saulpw wrote:
| > So the models are legitimately not viable without massive
| copyright infringement.
|
| Copyright is not about acquisition, it is about publication
| and/or distribution. If I get a copy of Harry Potter from a
| dumpster, I can read it. If a company gets a copy of *all books
| from a torrent, they can use it to train their AI. The torrent
| providers may be in violation of copyright, and if the AI can
| be used to reproduce substantive portions of the original text,
| the AI companies then may be in violation of copyright, but
| simply training a model on illegally distributed text should
| not be copyright infringement.
| veggieroll wrote:
| I mean, you're right in the abstract. If you train an LLM in
| a void and never do anything with the model, sure.
|
| But that's not what anyone is doing. People train models so
| that _someone_ can actually use them. So I 'm not sure how
| your comment is helpful other than to point out that
| distinction (which doesn't make much difference in this case
| specifically or how copyright applies for LLM's in general)
| dkjaudyeqooe wrote:
| > simply training a model on illegally distributed text
| should not be copyright infringement
|
| You can train a model on copyrighted text, you just can't
| distribute the output in any way without violating copyright.
|
| One of the big problems is that training is a mechanical
| process, so there is a direct line between the copyrighted
| works and the model's output, regardless of the form of the
| output. Just on those terms it is very likely to be a
| copyright violation. Even if they don't reproduce substantive
| portions, what they do reproduce is a derived work.
| saulpw wrote:
| If that mechanical process is not reversible, then it's not
| a copyright violation. For instance, I can compute the
| SHA256 hashes for every book in existence and distribute
| the resulting table of (ISBN, SHA256) and that is not a
| copyright violation.
| _DeadFred_ wrote:
| As long as someone give me the software software to run my
| business, that person might be in violation of copyright but
| I'm in the clear.
|
| Simply running my business on illegally distributed
| copyrighted text/software/movie should not be copyright
| infringement.
| itishappy wrote:
| You might not be immediately liable, but that doesn't mean
| you're allowed to continue. I'd assume it's your duty to
| cease and desist immediately once it's pointed out that
| you're in violation.
| blibble wrote:
| > Copyright is not about acquisition, it is about publication
| and/or distribution. If I get a copy of Harry Potter from a
| dumpster, I can read it. If a company gets a copy of *all
| books from a torrent, they can use it to train their AI.
|
| "a person reading" and "computer processing of data"
| (training) are not the same thing
|
| MDY Industries, LLC v. Blizzard Entertainment, Inc. rendered
| the verdict that loading unlicensed copyrighted material from
| disk was "copying", and hence copyright infringement
| toyg wrote:
| _> the current method for training requires this volume of
| data_
|
| This is one of those things that signal how _dumb_ this
| technology still is - or maybe how _smart_ humans are when
| compared to machines. A human brain doesn 't need anywhere
| close to this volume of data, in order to be able to produce
| good output.
|
| I remember talking with friends 30 years ago about how it was
| inevitable that the brain would eventually be fully implemented
| as machine, once calculation power gets big enough; but it
| looks like we're still very far from that.
| veggieroll wrote:
| Absolutely! But the question is whether the next step-change
| in intelligence is just around the corner (in which case,
| this legal speedbump might spur innovation). Or, will the
| next revolution take a while.
|
| There's enough money in the market to fund a lot of research
| into totally novel underlying methods. But if it takes too
| long, investors and lawmakers will just move to make what
| already works legal, because it is useful.
| dkjaudyeqooe wrote:
| > A human brain doesn't need anywhere close to this volume of
| data, in order to be able to produce good output.
|
| > I remember talking with friends 30 years ago
|
| I'd say you're pretty old. How many years of training did it
| take for you to start producing good output?
|
| The leason here is we're kind of meta-trained: our minds are
| primed to pick up new things quickly by abstracting them and
| relating them to things we already know. We work in concepts
| and mental models rather than text. LLMs are incredibly weak
| by comparison. They only understand token sequences.
| gregschlom wrote:
| > A human brain doesn't need anywhere close to this volume of
| data, in order to be able to produce good output.
|
| Maybe not directly, but consider that our brains are the
| product of million of years of evolution and aren't a blank
| slate when we're born. Even though babies can't speak a
| language at birth, they already have all the neural
| connections in place in order to acquire and manipulate
| language, and require just a few years of "supervised fine
| tuning" to learn the actual language.
|
| LLMs, on the other hand, start with their weights at random
| values and need to catch up with those million years of
| evolution first.
| skeledrew wrote:
| Add to this, the brain is constantly processing raw sensory
| data from the moment it became viable, even when the body
| is "sleeping". It's using orders of magnitude more data
| than any model in existence every moment, but isn't
| generally deemed "intelligent" enough until it's around 18
| years old.
| CobrastanJorji wrote:
| We are unbelievably far from that. Everyone who tells you
| that we're within 20 years of emulating brains and says stuff
| like "the human brain only runs at 100 hertz!" has either
| been conned by a futurist or is in denial of their own
| mortality.
| iandanforth wrote:
| Establishing precedent by defeating an already dead company in
| court is neither impressive nor likely to hold up for other
| companies.
| teruakohatu wrote:
| Ross Intelligence was more a search interface with natural
| language and, probably, vector based similarity. So I suspect
| they were hosting and using the corpus in production, not just
| training a model on it.
| dkjaudyeqooe wrote:
| The fair use aspect of the ruling should send a chill down the
| spines of all generative AI vendors. It's just one ruling but
| it's still bad.
| 2OEH8eoCRo0 wrote:
| Fantastic news!
| JackC wrote:
| Here's the full decision, which (like most decisions!) is largely
| written to be legible to non-lawyers:
| https://storage.courtlistener.com/recap/gov.uscourts.ded.721...
|
| The core story seems to be: Westlaw writes and owns headnotes
| that help lawyers find legal cases about a particular topic. Ross
| paid people to translate those headnotes into new text, trained
| an AI on the translations, and used those to make a model that
| helps lawyers find legal cases about a particular topic. In that
| specific instance the court says this plan isn't fair use. If it
| was fair use, one could presumably just pay people to translate
| headnotes directly and make a Westlaw competitor, since
| translating headnotes is cheaper than writing new ones. And
| conversely if it isn't fair use where's the harm (the court notes
| no copyright violation was necessary for interoperability for
| example) -- one can still pay people to write fresh headnotes
| from caselaw and create the same training set.
|
| The court emphasizes "Because the AI landscape is changing
| rapidly, I note for readers that only non-generative AI is before
| me today." But I'm not sure "generative" is that meaningful a
| distinction here.
|
| You can definitely see how AI companies will be hustling to
| distinguish this from "we trained on copyrighted documents, and
| made a general purpose AI, and then people paid to use our AI to
| compete with the people who owned the documents." It's not quite
| the same, the connection is less direct, but it's not totally
| different.
| echelon wrote:
| If the copyright holders win, the model giants will just
| license.
|
| This effectively kills open source, which can't afford to
| license and won't be able to sublicense training data.
|
| This is very bad for democratized access to and development of
| AI.
|
| The giants will probably want this. The giants were already
| purchasing legacy media content enterprises (Amazon and MGM,
| etc.), so this will probably further consolidation and create
| extreme barriers to entry.
|
| If I were OpenAI, I'd probably be very happy right now. If I
| were a recent batch YC AI company, I'd be mortified.
| dkjaudyeqooe wrote:
| License what? Every available copyrighted work? Even getting
| a tiny fraction is not practical.
|
| To the contrary, this just means companies can't make money
| from these models.
|
| Those using models for research and personal use wouldn't be
| infringing under the fair use tests.
| echoangle wrote:
| They didn't train it on every available copyrighted work
| though, but on a specific set of legal questions and
| answers. And they did try to license them, and only did the
| workaround after not getting a license.
| synthetic-echo wrote:
| I think they were talking about the "model giants" like
| OpenAI you mentioned. Not saying they're correct, but I
| will concede the amount of copyrighted information
| someone like OpenAI would want is probably (at least) an
| order of magnitude more than this particular case.
| mvdtnz wrote:
| > License what? Every available copyrighted work? Even
| getting a tiny fraction is not practical.
|
| Oh no. Anyway.
| JoshTriplett wrote:
| > If the copyright holders win, the model giants will just
| license.
|
| No, they won't. The biggest models want to train on literally
| every piece of human-written text ever written. You can pay
| to license small subsets of that at a time. You can't pay to
| license _all_ of it. And some of it won 't be available to
| license at all, at any price.
|
| If the copyright holders win, model trainers will have to pay
| attention to what they train on, rather than blithely
| ignoring licenses.
| simonw wrote:
| "The biggest models want to train on literally every piece
| of human-written text ever written"
|
| They genuinely don't. There is a LOT of garbage text out
| there that they don't want. They want to train on every
| high quality piece of human-written text they can get their
| hands on (where the definition of "high quality" is a major
| piece of the secret sauce that makes some LLMs better than
| others), but that doesn't mean _every_ piece of human-
| written text.
| ethbr1 wrote:
| Even restricted to that narrower definition, the major
| commercial model companies wouldn't be able to afford to
| license all their high-quality human text.
|
| OpenAI is Uber with a slightly less ethically despicable
| CEO.
|
| It knows it's flaunting the spirit of copyright law --
| it's just hoping it could bootstrap quickly enough to
| make the question irrelevant.
|
| If every commercial AI company that couldn't prove
| training data provenance tomorrow was bankrupted, I
| wouldn't shed an ethical tear. Live by the sword, die by
| the sword.
| vkou wrote:
| I don't have either a data center, or every single
| copyrighted work in history to import as training data to
| train my open source model.
|
| Whether or not OpenAI is found to be breaking the law will be
| utterly irrelevant to actual open AI efforts.
| mvdtnz wrote:
| Open source model builders are no more entitled to rip off
| content owners than anyone else. I couldn't possibly care any
| less if this impacts "democratized access" to bullshit
| generators. At least if the big boys license the content then
| the rightful owners get paid (and have the option to opt
| out).
| dkjaudyeqooe wrote:
| > The court emphasizes "Because the AI landscape is changing
| rapidly, I note for readers that only non-generative AI is
| before me today." But I'm not sure "generative" is that
| meaningful a distinction here.
|
| Also the judge makes that statement, it looks like he
| misunderstands the nature of the AI system and the inherent
| generative elements it includes.
| echoangle wrote:
| How is the system inherently generative?
| currymj wrote:
| Generative is a technical term, meaning that a system
| models a full joint probability distribution.
|
| For example, a classifier is a generative model if it
| models p(example, label) -- which is sufficient to also
| calculate p(label | example) if you want -- rather than
| just modeling p(label | example) alone.
|
| Similar example in translation: a generative translation
| model would model p(french sentence, english sentence) --
| implicitly including a language model of p(french) and
| p(english) in addition to allowing translation p(english |
| french) and p(french | english). A non-generative
| translation model would, for instance, only model p(french
| | english).
|
| I don't exactly understand what this judge meant by
| "generative", it's presumably not the technical term.
| echoangle wrote:
| Do you have some kind of dictionary where I can find this
| definition? Because I don't really understand how that
| can be the deciding factor of ,,generative", and the wiki
| page for ,,generative AI" also seems to use the generic
| ,,AI that creates new stuff" meaning.
|
| By your definition, basically every classifier with 2
| inputs would be generative. If I have a classifier for
| the MNIST dataset and my inputs are the pixels of the
| image, does that make the classifier generative because
| the inputs aren't independent from each other?
| Ajedi32 wrote:
| Interestingly, almost the entirety of the judge's opinion seems
| to be focused on the question of whether the translated notes
| are subject to copyright. It seems to completely ignore the
| question of whether training an AI on copyrighted material
| constitutes making a copy of that work in the first place. Am I
| missing something?
|
| The judge does note that no copyrighted material was
| distributed to users, because the AI doesn't output that
| information:
|
| > There is no factual dispute: Ross's output to an end user
| does not include a West headnote. What matters is not "the
| amount and substantiality of the portion used in making a copy,
| but rather the amount and substantiality of what is thereby
| made accessible to a public for which it may serve as a
| competing substitute." Authors Guild, 804 F.3d at 222 (internal
| quotation marks omitted). Because Ross did not make West
| headnotes available to the public, Ross benefits from factor
| three.
|
| But he only does so as part of an analysis of whether there's a
| valid fair use defense for Ross's copying of the head notes,
| ignoring the obvious (to me) point that if no copyrighted
| material was distributed to end users, how can this even be a
| violation of copyright in the first place?
| unyttigfjelltol wrote:
| Ross evidently copied and used the text himself. It's like
| Ross creating an unauthorized volume of West's books, perhaps
| with a twist.
|
| Obscurity [?] legal compliance.
| Animats wrote:
| This isn't really about "AI". It's about copying summaries.
| Google was fined for this in France for copying news headlines
| into their search results, and now has to pay royalties in the
| EU. Westlaw is a summarizing and indexing service for court case
| results. It's been publishing that info in book form since 1872.
|
| Ross was trying to compete with Westlaw, but used Westlaw as an
| input. West's "Key Numbers" are, after a century and a half, a
| de-facto standard.[2] So Ross had to match that proprietary
| indexing system to compete. Their output had to match Westlaw's
| rather closely. That's the underlying problem. The court ruled
| that the objective was to directly compete with Westlaw, and
| using Westlaw's output to do that was intentional copyright
| infringement.
|
| This looks like a narrow holding, not one that generally covers
| feeding content into AI training systems.
|
| [1] https://apnews.com/article/google-france-news-publishers-
| cop...
|
| [2] https://guides.law.stanford.edu/cases/keynumbersystem
| mmooss wrote:
| TR may have intentionally chosen an easy battle to begin their
| legal war.
| MonkeyClub wrote:
| From p. 6:
|
| "But a headnote can introduce creativity by distilling,
| synthesizing, or explaining part of an opinion, and thus be
| copyrightable."
|
| Does this set a precedent, whereby AI-generated summaries are
| copyrightable by the LLM owners?
| rvz wrote:
| See. The fair-use excuses that the AI proponents here were trying
| to hang on to for dear life have fallen flat on this ruling.
|
| This is going to be one of many cases in which there will be
| licensing deals being made out of this to stop AI grifters
| claiming 'fair use' to try to side-step copyright laws because
| they are using a gen AI system.
|
| OpenAI ended up paying up for the data with Shutterstock and
| other news sources. This will be no different.
| Salgat wrote:
| My biggest concern is, what happens when countries like China,
| who aren't restricted by this, far outpace western countries in
| this technology? Do we just shrug and accept our far inferior
| models? LLMs are a productivity multiplier (similar to a search
| engine), so it'll have a large impact on the economy if
| licensing costs prohibit large scale training.
| blibble wrote:
| once the case law is set, I look forward to suing everyone
| that's ever trained a model for $300,000 PER WORK each time
| they ingested my code from GitHub
|
| whoever wrote those indemnity policies is going to regret it
| mmooss wrote:
| Thomson Reuters chose to sue Ross Intelligence, not a company
| like Google or even OpenAI. I wonder how deeper pockets would
| have affected the outcome.
|
| I wonder how the politics played out. The big AI companies could
| have funded Ross Intelligence, who could have threatened to
| sabotage their legal strategies by tanking and settling their own
| case in TR's favor.
| ars wrote:
| It would be quite an interesting result if we could have true
| General AI, but we don't simply because of copyright.
|
| I'm aware this isn't a concern yet, but imagine if the future
| played out this way....
|
| Or worse: Only those with really deep pockets can pay to get AI,
| and no one else can, simply because they can't afford the
| copyright fees.
| simonw wrote:
| Interesting to note from this 2020 story (when ROSS shut down)
| that the company was founded in 2014 and went out of business in
| 2020: https://www.lawnext.com/2020/12/legal-research-company-
| ross-...
|
| The fact that it took until 2024 for the case to resolve shows
| how long the wheels of justice can take to turn!
| jug wrote:
| I spontaneously feel like this is bad news for open AI, while
| playing in the hands of corporate behemoths able to strike
| expensive deals with major publishers and top it off with the
| public domain.
|
| I'm not sure this signals the end of AI and a victory for the
| human, but rather who gets to train the models?
| lazycog512 wrote:
| seems like delaware can't scare tech companies out of re-
| incorporating any faster
___________________________________________________________________
(page generated 2025-02-11 23:00 UTC)