hngopher.com

       [HN Gopher] Thomson Reuters wins first major AI copyright case i...
       ___________________________________________________________________
        
       Thomson Reuters wins first major AI copyright case in the US
        
       Author : johnneville
       Score  : 136 points
       Date   : 2025-02-11 20:56 UTC (2 hours ago)
        
 (HTM) web link (www.wired.com)
 (TXT) w3m dump (www.wired.com)
        
       | EnderWT wrote:
       | https://archive.is/mu49I
        
       | NewsaHackO wrote:
       | How does this affect LLM systems that already have their corpus
       | integrated?
        
         | trod1234 wrote:
         | The judge ruled this as a violation of copyright. Its the same
         | as hosting any copyright material absent a valid license,
         | criminal copyright piracy.
         | 
         | They would need to figure out a way to prune the respective
         | weights so that such material is not available, or risk legal
         | fury.
        
           | Workaccount2 wrote:
           | They would just censor output.
           | 
           | Youtube doesn't need to figure out how to stop copyright
           | material from being uploaded, they need to stop it from being
           | shared.
        
       | preinheimer wrote:
       | Great. The stated goal of a lot of these companies seems to be
       | "train the model on the output of humans, then hire us instead of
       | the humans".
       | 
       | It's been interesting that media where watermarking has been
       | feasible (like photography) have seen creators get access to some
       | compensation, while text based creators get nothing.
        
         | rolph wrote:
         | rotate similar [but different] fonts [or character pages] over
         | each character. the sequence represents data thus watermark.
        
           | WillAdams wrote:
           | but the font changes won't be expressed in the (plain text)
           | output of the LLM.
        
             | yifanl wrote:
             | Presumably the font will represent letters to look like a
             | different letter, making it not useful to LLMs scraping the
             | site but useful for visual readers.
             | 
             | This would have detrimental effects to people who use
             | screen readers or have their own stylesheets of course.
        
               | WillAdams wrote:
               | For that it would make more sense to run a routine which
               | replaces letters with visually identical glyphs at
               | different encoding points.
        
       | oidar wrote:
       | Ross intelligence was creating a product that would directly
       | compete against Thomson Reuters. Pretty clearly not fair use.
        
       | YesBox wrote:
       | Thanks. The article wasn't loading for me, just the headline and
       | image and footer. I was about to leave thinking that's all there
       | is.
        
         | dang wrote:
         | We detached this comment from
         | https://news.ycombinator.com/item?id=43018356.
         | 
         | Your post is totally fine; I just want to save space at the top
         | of the thread (where the parent is now pinned).
        
       | varsketiz wrote:
       | Great decision for humans.
       | 
       | Is this type of risk the reason why OpenAI masquerades as a non-
       | profit?
        
       | veggieroll wrote:
       | > Thomson Reuters prevailed on two of the four factors, but Bibas
       | described the fourth as the most important, and ruled that Ross
       | "meant to compete with Westlaw by developing a market
       | substitute."
       | 
       | Yep. That's what people have been saying all along. If the intent
       | is to substitute the original, then copying is not fair use.
       | 
       | But the problem is that the current method for training requires
       | this volume of data. So the models are legitimately not viable
       | without massive copyright infringement.
       | 
       | It'll be interesting to see how a defendant with a larger wallet
       | will fare. But this doesn't look good.
       | 
       | Though big-picture, it seems to me that the money-ed interests
       | will ensure that even if the current legal landscape doesn't
       | allow LLM's to exist, then they will lobby HARD until it is
       | allowed. This is inevitable now that it's at least partially
       | framed in national security terms.
       | 
       | But I'd hope that this means there is a chance that if models
       | have to train on all of human content, the weights will be
       | available for free to all humans. If it requires massive
       | copyright infringement on our content, we should all have an
       | ownership stake in the resulting models.
        
         | saulpw wrote:
         | > So the models are legitimately not viable without massive
         | copyright infringement.
         | 
         | Copyright is not about acquisition, it is about publication
         | and/or distribution. If I get a copy of Harry Potter from a
         | dumpster, I can read it. If a company gets a copy of *all books
         | from a torrent, they can use it to train their AI. The torrent
         | providers may be in violation of copyright, and if the AI can
         | be used to reproduce substantive portions of the original text,
         | the AI companies then may be in violation of copyright, but
         | simply training a model on illegally distributed text should
         | not be copyright infringement.
        
           | veggieroll wrote:
           | I mean, you're right in the abstract. If you train an LLM in
           | a void and never do anything with the model, sure.
           | 
           | But that's not what anyone is doing. People train models so
           | that _someone_ can actually use them. So I 'm not sure how
           | your comment is helpful other than to point out that
           | distinction (which doesn't make much difference in this case
           | specifically or how copyright applies for LLM's in general)
        
           | dkjaudyeqooe wrote:
           | > simply training a model on illegally distributed text
           | should not be copyright infringement
           | 
           | You can train a model on copyrighted text, you just can't
           | distribute the output in any way without violating copyright.
           | 
           | One of the big problems is that training is a mechanical
           | process, so there is a direct line between the copyrighted
           | works and the model's output, regardless of the form of the
           | output. Just on those terms it is very likely to be a
           | copyright violation. Even if they don't reproduce substantive
           | portions, what they do reproduce is a derived work.
        
             | saulpw wrote:
             | If that mechanical process is not reversible, then it's not
             | a copyright violation. For instance, I can compute the
             | SHA256 hashes for every book in existence and distribute
             | the resulting table of (ISBN, SHA256) and that is not a
             | copyright violation.
        
           | _DeadFred_ wrote:
           | As long as someone give me the software software to run my
           | business, that person might be in violation of copyright but
           | I'm in the clear.
           | 
           | Simply running my business on illegally distributed
           | copyrighted text/software/movie should not be copyright
           | infringement.
        
             | itishappy wrote:
             | You might not be immediately liable, but that doesn't mean
             | you're allowed to continue. I'd assume it's your duty to
             | cease and desist immediately once it's pointed out that
             | you're in violation.
        
           | blibble wrote:
           | > Copyright is not about acquisition, it is about publication
           | and/or distribution. If I get a copy of Harry Potter from a
           | dumpster, I can read it. If a company gets a copy of *all
           | books from a torrent, they can use it to train their AI.
           | 
           | "a person reading" and "computer processing of data"
           | (training) are not the same thing
           | 
           | MDY Industries, LLC v. Blizzard Entertainment, Inc. rendered
           | the verdict that loading unlicensed copyrighted material from
           | disk was "copying", and hence copyright infringement
        
         | toyg wrote:
         | _> the current method for training requires this volume of
         | data_
         | 
         | This is one of those things that signal how _dumb_ this
         | technology still is - or maybe how _smart_ humans are when
         | compared to machines. A human brain doesn 't need anywhere
         | close to this volume of data, in order to be able to produce
         | good output.
         | 
         | I remember talking with friends 30 years ago about how it was
         | inevitable that the brain would eventually be fully implemented
         | as machine, once calculation power gets big enough; but it
         | looks like we're still very far from that.
        
           | veggieroll wrote:
           | Absolutely! But the question is whether the next step-change
           | in intelligence is just around the corner (in which case,
           | this legal speedbump might spur innovation). Or, will the
           | next revolution take a while.
           | 
           | There's enough money in the market to fund a lot of research
           | into totally novel underlying methods. But if it takes too
           | long, investors and lawmakers will just move to make what
           | already works legal, because it is useful.
        
           | dkjaudyeqooe wrote:
           | > A human brain doesn't need anywhere close to this volume of
           | data, in order to be able to produce good output.
           | 
           | > I remember talking with friends 30 years ago
           | 
           | I'd say you're pretty old. How many years of training did it
           | take for you to start producing good output?
           | 
           | The leason here is we're kind of meta-trained: our minds are
           | primed to pick up new things quickly by abstracting them and
           | relating them to things we already know. We work in concepts
           | and mental models rather than text. LLMs are incredibly weak
           | by comparison. They only understand token sequences.
        
           | gregschlom wrote:
           | > A human brain doesn't need anywhere close to this volume of
           | data, in order to be able to produce good output.
           | 
           | Maybe not directly, but consider that our brains are the
           | product of million of years of evolution and aren't a blank
           | slate when we're born. Even though babies can't speak a
           | language at birth, they already have all the neural
           | connections in place in order to acquire and manipulate
           | language, and require just a few years of "supervised fine
           | tuning" to learn the actual language.
           | 
           | LLMs, on the other hand, start with their weights at random
           | values and need to catch up with those million years of
           | evolution first.
        
             | skeledrew wrote:
             | Add to this, the brain is constantly processing raw sensory
             | data from the moment it became viable, even when the body
             | is "sleeping". It's using orders of magnitude more data
             | than any model in existence every moment, but isn't
             | generally deemed "intelligent" enough until it's around 18
             | years old.
        
           | CobrastanJorji wrote:
           | We are unbelievably far from that. Everyone who tells you
           | that we're within 20 years of emulating brains and says stuff
           | like "the human brain only runs at 100 hertz!" has either
           | been conned by a futurist or is in denial of their own
           | mortality.
        
       | iandanforth wrote:
       | Establishing precedent by defeating an already dead company in
       | court is neither impressive nor likely to hold up for other
       | companies.
        
       | teruakohatu wrote:
       | Ross Intelligence was more a search interface with natural
       | language and, probably, vector based similarity. So I suspect
       | they were hosting and using the corpus in production, not just
       | training a model on it.
        
       | dkjaudyeqooe wrote:
       | The fair use aspect of the ruling should send a chill down the
       | spines of all generative AI vendors. It's just one ruling but
       | it's still bad.
        
       | 2OEH8eoCRo0 wrote:
       | Fantastic news!
        
       | JackC wrote:
       | Here's the full decision, which (like most decisions!) is largely
       | written to be legible to non-lawyers:
       | https://storage.courtlistener.com/recap/gov.uscourts.ded.721...
       | 
       | The core story seems to be: Westlaw writes and owns headnotes
       | that help lawyers find legal cases about a particular topic. Ross
       | paid people to translate those headnotes into new text, trained
       | an AI on the translations, and used those to make a model that
       | helps lawyers find legal cases about a particular topic. In that
       | specific instance the court says this plan isn't fair use. If it
       | was fair use, one could presumably just pay people to translate
       | headnotes directly and make a Westlaw competitor, since
       | translating headnotes is cheaper than writing new ones. And
       | conversely if it isn't fair use where's the harm (the court notes
       | no copyright violation was necessary for interoperability for
       | example) -- one can still pay people to write fresh headnotes
       | from caselaw and create the same training set.
       | 
       | The court emphasizes "Because the AI landscape is changing
       | rapidly, I note for readers that only non-generative AI is before
       | me today." But I'm not sure "generative" is that meaningful a
       | distinction here.
       | 
       | You can definitely see how AI companies will be hustling to
       | distinguish this from "we trained on copyrighted documents, and
       | made a general purpose AI, and then people paid to use our AI to
       | compete with the people who owned the documents." It's not quite
       | the same, the connection is less direct, but it's not totally
       | different.
        
         | echelon wrote:
         | If the copyright holders win, the model giants will just
         | license.
         | 
         | This effectively kills open source, which can't afford to
         | license and won't be able to sublicense training data.
         | 
         | This is very bad for democratized access to and development of
         | AI.
         | 
         | The giants will probably want this. The giants were already
         | purchasing legacy media content enterprises (Amazon and MGM,
         | etc.), so this will probably further consolidation and create
         | extreme barriers to entry.
         | 
         | If I were OpenAI, I'd probably be very happy right now. If I
         | were a recent batch YC AI company, I'd be mortified.
        
           | dkjaudyeqooe wrote:
           | License what? Every available copyrighted work? Even getting
           | a tiny fraction is not practical.
           | 
           | To the contrary, this just means companies can't make money
           | from these models.
           | 
           | Those using models for research and personal use wouldn't be
           | infringing under the fair use tests.
        
             | echoangle wrote:
             | They didn't train it on every available copyrighted work
             | though, but on a specific set of legal questions and
             | answers. And they did try to license them, and only did the
             | workaround after not getting a license.
        
               | synthetic-echo wrote:
               | I think they were talking about the "model giants" like
               | OpenAI you mentioned. Not saying they're correct, but I
               | will concede the amount of copyrighted information
               | someone like OpenAI would want is probably (at least) an
               | order of magnitude more than this particular case.
        
             | mvdtnz wrote:
             | > License what? Every available copyrighted work? Even
             | getting a tiny fraction is not practical.
             | 
             | Oh no. Anyway.
        
           | JoshTriplett wrote:
           | > If the copyright holders win, the model giants will just
           | license.
           | 
           | No, they won't. The biggest models want to train on literally
           | every piece of human-written text ever written. You can pay
           | to license small subsets of that at a time. You can't pay to
           | license _all_ of it. And some of it won 't be available to
           | license at all, at any price.
           | 
           | If the copyright holders win, model trainers will have to pay
           | attention to what they train on, rather than blithely
           | ignoring licenses.
        
             | simonw wrote:
             | "The biggest models want to train on literally every piece
             | of human-written text ever written"
             | 
             | They genuinely don't. There is a LOT of garbage text out
             | there that they don't want. They want to train on every
             | high quality piece of human-written text they can get their
             | hands on (where the definition of "high quality" is a major
             | piece of the secret sauce that makes some LLMs better than
             | others), but that doesn't mean _every_ piece of human-
             | written text.
        
               | ethbr1 wrote:
               | Even restricted to that narrower definition, the major
               | commercial model companies wouldn't be able to afford to
               | license all their high-quality human text.
               | 
               | OpenAI is Uber with a slightly less ethically despicable
               | CEO.
               | 
               | It knows it's flaunting the spirit of copyright law --
               | it's just hoping it could bootstrap quickly enough to
               | make the question irrelevant.
               | 
               | If every commercial AI company that couldn't prove
               | training data provenance tomorrow was bankrupted, I
               | wouldn't shed an ethical tear. Live by the sword, die by
               | the sword.
        
           | vkou wrote:
           | I don't have either a data center, or every single
           | copyrighted work in history to import as training data to
           | train my open source model.
           | 
           | Whether or not OpenAI is found to be breaking the law will be
           | utterly irrelevant to actual open AI efforts.
        
           | mvdtnz wrote:
           | Open source model builders are no more entitled to rip off
           | content owners than anyone else. I couldn't possibly care any
           | less if this impacts "democratized access" to bullshit
           | generators. At least if the big boys license the content then
           | the rightful owners get paid (and have the option to opt
           | out).
        
         | dkjaudyeqooe wrote:
         | > The court emphasizes "Because the AI landscape is changing
         | rapidly, I note for readers that only non-generative AI is
         | before me today." But I'm not sure "generative" is that
         | meaningful a distinction here.
         | 
         | Also the judge makes that statement, it looks like he
         | misunderstands the nature of the AI system and the inherent
         | generative elements it includes.
        
           | echoangle wrote:
           | How is the system inherently generative?
        
             | currymj wrote:
             | Generative is a technical term, meaning that a system
             | models a full joint probability distribution.
             | 
             | For example, a classifier is a generative model if it
             | models p(example, label) -- which is sufficient to also
             | calculate p(label | example) if you want -- rather than
             | just modeling p(label | example) alone.
             | 
             | Similar example in translation: a generative translation
             | model would model p(french sentence, english sentence) --
             | implicitly including a language model of p(french) and
             | p(english) in addition to allowing translation p(english |
             | french) and p(french | english). A non-generative
             | translation model would, for instance, only model p(french
             | | english).
             | 
             | I don't exactly understand what this judge meant by
             | "generative", it's presumably not the technical term.
        
               | echoangle wrote:
               | Do you have some kind of dictionary where I can find this
               | definition? Because I don't really understand how that
               | can be the deciding factor of ,,generative", and the wiki
               | page for ,,generative AI" also seems to use the generic
               | ,,AI that creates new stuff" meaning.
               | 
               | By your definition, basically every classifier with 2
               | inputs would be generative. If I have a classifier for
               | the MNIST dataset and my inputs are the pixels of the
               | image, does that make the classifier generative because
               | the inputs aren't independent from each other?
        
         | Ajedi32 wrote:
         | Interestingly, almost the entirety of the judge's opinion seems
         | to be focused on the question of whether the translated notes
         | are subject to copyright. It seems to completely ignore the
         | question of whether training an AI on copyrighted material
         | constitutes making a copy of that work in the first place. Am I
         | missing something?
         | 
         | The judge does note that no copyrighted material was
         | distributed to users, because the AI doesn't output that
         | information:
         | 
         | > There is no factual dispute: Ross's output to an end user
         | does not include a West headnote. What matters is not "the
         | amount and substantiality of the portion used in making a copy,
         | but rather the amount and substantiality of what is thereby
         | made accessible to a public for which it may serve as a
         | competing substitute." Authors Guild, 804 F.3d at 222 (internal
         | quotation marks omitted). Because Ross did not make West
         | headnotes available to the public, Ross benefits from factor
         | three.
         | 
         | But he only does so as part of an analysis of whether there's a
         | valid fair use defense for Ross's copying of the head notes,
         | ignoring the obvious (to me) point that if no copyrighted
         | material was distributed to end users, how can this even be a
         | violation of copyright in the first place?
        
           | unyttigfjelltol wrote:
           | Ross evidently copied and used the text himself. It's like
           | Ross creating an unauthorized volume of West's books, perhaps
           | with a twist.
           | 
           | Obscurity [?] legal compliance.
        
       | Animats wrote:
       | This isn't really about "AI". It's about copying summaries.
       | Google was fined for this in France for copying news headlines
       | into their search results, and now has to pay royalties in the
       | EU. Westlaw is a summarizing and indexing service for court case
       | results. It's been publishing that info in book form since 1872.
       | 
       | Ross was trying to compete with Westlaw, but used Westlaw as an
       | input. West's "Key Numbers" are, after a century and a half, a
       | de-facto standard.[2] So Ross had to match that proprietary
       | indexing system to compete. Their output had to match Westlaw's
       | rather closely. That's the underlying problem. The court ruled
       | that the objective was to directly compete with Westlaw, and
       | using Westlaw's output to do that was intentional copyright
       | infringement.
       | 
       | This looks like a narrow holding, not one that generally covers
       | feeding content into AI training systems.
       | 
       | [1] https://apnews.com/article/google-france-news-publishers-
       | cop...
       | 
       | [2] https://guides.law.stanford.edu/cases/keynumbersystem
        
         | mmooss wrote:
         | TR may have intentionally chosen an easy battle to begin their
         | legal war.
        
       | MonkeyClub wrote:
       | From p. 6:
       | 
       | "But a headnote can introduce creativity by distilling,
       | synthesizing, or explaining part of an opinion, and thus be
       | copyrightable."
       | 
       | Does this set a precedent, whereby AI-generated summaries are
       | copyrightable by the LLM owners?
        
       | rvz wrote:
       | See. The fair-use excuses that the AI proponents here were trying
       | to hang on to for dear life have fallen flat on this ruling.
       | 
       | This is going to be one of many cases in which there will be
       | licensing deals being made out of this to stop AI grifters
       | claiming 'fair use' to try to side-step copyright laws because
       | they are using a gen AI system.
       | 
       | OpenAI ended up paying up for the data with Shutterstock and
       | other news sources. This will be no different.
        
         | Salgat wrote:
         | My biggest concern is, what happens when countries like China,
         | who aren't restricted by this, far outpace western countries in
         | this technology? Do we just shrug and accept our far inferior
         | models? LLMs are a productivity multiplier (similar to a search
         | engine), so it'll have a large impact on the economy if
         | licensing costs prohibit large scale training.
        
         | blibble wrote:
         | once the case law is set, I look forward to suing everyone
         | that's ever trained a model for $300,000 PER WORK each time
         | they ingested my code from GitHub
         | 
         | whoever wrote those indemnity policies is going to regret it
        
       | mmooss wrote:
       | Thomson Reuters chose to sue Ross Intelligence, not a company
       | like Google or even OpenAI. I wonder how deeper pockets would
       | have affected the outcome.
       | 
       | I wonder how the politics played out. The big AI companies could
       | have funded Ross Intelligence, who could have threatened to
       | sabotage their legal strategies by tanking and settling their own
       | case in TR's favor.
        
       | ars wrote:
       | It would be quite an interesting result if we could have true
       | General AI, but we don't simply because of copyright.
       | 
       | I'm aware this isn't a concern yet, but imagine if the future
       | played out this way....
       | 
       | Or worse: Only those with really deep pockets can pay to get AI,
       | and no one else can, simply because they can't afford the
       | copyright fees.
        
       | simonw wrote:
       | Interesting to note from this 2020 story (when ROSS shut down)
       | that the company was founded in 2014 and went out of business in
       | 2020: https://www.lawnext.com/2020/12/legal-research-company-
       | ross-...
       | 
       | The fact that it took until 2024 for the case to resolve shows
       | how long the wheels of justice can take to turn!
        
       | jug wrote:
       | I spontaneously feel like this is bad news for open AI, while
       | playing in the hands of corporate behemoths able to strike
       | expensive deals with major publishers and top it off with the
       | public domain.
       | 
       | I'm not sure this signals the end of AI and a victory for the
       | human, but rather who gets to train the models?
        
       | lazycog512 wrote:
       | seems like delaware can't scare tech companies out of re-
       | incorporating any faster
        
       ___________________________________________________________________
       (page generated 2025-02-11 23:00 UTC)