[HN Gopher] Extracting memorized pieces of books from open-weigh...
       ___________________________________________________________________
        
       Extracting memorized pieces of books from open-weight language
       models
        
       Author : fzliu
       Score  : 34 points
       Date   : 2025-06-16 17:41 UTC (3 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | billionairebro wrote:
       | Claus: Extract first passage from Harry Potter for promo website.
       | ,please.
       | 
       | Or you go to jail.
        
       | jrm4 wrote:
       | And hopefully this puts to rest all the _painfully bad_ , often
       | anthropomorphizing, takes about how what the LLMs do isn't
       | copyright infringement.
       | 
       | It's simple. If you put the works into the LLM, it can later make
       | immediately identifiable, if imperfect, copies of the work. If
       | you didn't put the work in, it wouldn't be able to do that.
       | 
       | The fact that you can't "see the copy" inside is wildly
       | irrelevant.
        
         | orionsbelt wrote:
         | So can humans? I can ask a human to draw Mickey Mouse or
         | Superman, and they can! Or recite a poem. Some humans have much
         | better memories and can do this with a far greater degree of
         | fidelity too, just like an LLM vs an average human.
         | 
         | If you ask OpenAI to generate an image of your dog as Superman,
         | it will often start to do so, and then it will realize it is
         | copyrighted, and stop. This seems sensible to me.
         | 
         | Isn't it the ultimate creative result that is copyright
         | infringement, and not merely that a model was trained to
         | understand something very well?
        
           | jrm4 wrote:
           | Copyright infringement is the act of creating/using an copy
           | in an unauthorized way.
           | 
           | Remember, we can only target humans. So we're not likely to
           | target your guy; but we ARE likely to target "the guy that
           | definitely fed a complete unauthorized copy of the thing into
           | the LLM."
        
             | regularfry wrote:
             | I just don't get the legal theory here.
             | 
             | If I download harry_potter_goblet_fire.txt off some dodgy
             | site, then let's assume that owner of that site has
             | infringed copyright by distributing it. If I upload it
             | again to some other dodgy site, I would also infringe
             | copyright in a similar same way. But that would be naughty
             | so I'm not going to do that.
             | 
             | Let's say instead that I feed it into a bunch of janky
             | pytorch scripts with a bunch of other text files, and out
             | pops a bunch of weights. Twice.
             | 
             | The first model I build is a classifier. Its output is
             | binary: is this text about wizards, yes/no.
             | 
             | The second model I build is an LLM. Its output is text, and
             | (as in the article) you can get imperfect reproductions of
             | parts of the training file out of it with the right
             | prompts.
             | 
             | Now, I upload both those sets of weights to HuggingFace.
             | 
             | How many times am I supposed to have infringed copyright?
             | 
             | Is it:
             | 
             | A) Twice (at least), because the act of doing anything
             | whatsoever with harry_potter_goblet_fire.txt without
             | permission is verboten;
             | 
             | B) Once, because only one of the models is capable of
             | reproducing the original (even if only approximately);
             | 
             | C) Zero, because neither model is capable of a reproduction
             | that would compete with the original;
             | 
             | or
             | 
             | D) Zero, because I'm not the distributor of the file, and
             | merely processing it - "format shifted" from the book, if
             | you like - is not problematic in itself.
             | 
             | Logically I can see justifications for any of B)
             | (tenuously), C), or D). Obviously publishers would want us
             | to think that A) is right, but based on what? I see a lot
             | of moral outrage, but very little actual _argument_. That
             | makes me think there 's nothing there.
        
           | MengerSponge wrote:
           | Can a human violate copyright? Yes. Obviously.
           | 
           | While many people don't "understand" much, the model doesn't
           | "understand" anything. Models are trained to replicate
           | something. Most of them were trained on countless pieces of
           | work that were illegitimately accessed.
           | 
           | What do you gain by carrying OpenAI's water?
        
             | nick__m wrote:
             | Powerful tools that would not exist otherwise !
        
               | michaelt wrote:
               | You can take the tool and still support throwing the
               | creator in jail.
               | 
               | Rockets? Pretty cool. Wernher von Braun? Not cool.
        
         | perching_aix wrote:
         | You remind me to all the shitty times in literature class where
         | I had to rote memorize countless works from a given author
         | (poet), think 40, then take a test identifying which poem each
         | of the given quotes was from. The little WinForms app I wrote
         | to practice for these tests was one of the first programs I've
         | ever written. I guess in that sense it's also a fond memory.
         | 
         | Good thing they were public (?) works, wouldn't wanna get sued
         | [0] for possibly being a two legged copyright infringement. Or
         | should I say having been, since naturally I immediately erased
         | all of these works from my mind just days after these tests,
         | without even any legal impetus.
         | 
         | Edit: thinking about it a bit more, you also remind me to our
         | midterm tests from the same class. We had to produce multiple
         | page long essays on the spot, analyzing a select work... from
         | memory. Bonus points for being able to quote from it, of
         | course. Needless to say, not many original thoughts were
         | featured in those essays, not in mine, not in others' - the
         | explicit expectation was that you'd peruse the current and
         | older textbooks to "learn the analysis" from, and then you'd
         | "write about it in your own words", but still using "technical"
         | terms. They were pretty much just tests in jargon use,
         | composition, and memorization, which is definitely a choice of
         | all time for a class on literature. But I think it draws an
         | _interesting_ perspective.
         | 
         | [0] https://youtu.be/-JlxuQ7tPgQ
        
         | rockemsockem wrote:
         | I think I big part of copyright law is whether the thing
         | created from copyrighted material is a competitor with the
         | original work, in addition to whether it's transformative.
         | 
         | LLMs are OBVIOUSLY not a replacement for the books and works
         | that they're trained on, just like Google books isn't.
        
         | anothernewdude wrote:
         | One day they'll use these cases to sue people who have read
         | books, because they can make immediately identifiable if
         | imperfect copies of the works.
        
       ___________________________________________________________________
       (page generated 2025-06-19 23:00 UTC)