[HN Gopher] Extracting memorized pieces of books from open-weigh...
___________________________________________________________________
Extracting memorized pieces of books from open-weight language
models
Author : fzliu
Score : 34 points
Date : 2025-06-16 17:41 UTC (3 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| billionairebro wrote:
| Claus: Extract first passage from Harry Potter for promo website.
| ,please.
|
| Or you go to jail.
| jrm4 wrote:
| And hopefully this puts to rest all the _painfully bad_ , often
| anthropomorphizing, takes about how what the LLMs do isn't
| copyright infringement.
|
| It's simple. If you put the works into the LLM, it can later make
| immediately identifiable, if imperfect, copies of the work. If
| you didn't put the work in, it wouldn't be able to do that.
|
| The fact that you can't "see the copy" inside is wildly
| irrelevant.
| orionsbelt wrote:
| So can humans? I can ask a human to draw Mickey Mouse or
| Superman, and they can! Or recite a poem. Some humans have much
| better memories and can do this with a far greater degree of
| fidelity too, just like an LLM vs an average human.
|
| If you ask OpenAI to generate an image of your dog as Superman,
| it will often start to do so, and then it will realize it is
| copyrighted, and stop. This seems sensible to me.
|
| Isn't it the ultimate creative result that is copyright
| infringement, and not merely that a model was trained to
| understand something very well?
| jrm4 wrote:
| Copyright infringement is the act of creating/using an copy
| in an unauthorized way.
|
| Remember, we can only target humans. So we're not likely to
| target your guy; but we ARE likely to target "the guy that
| definitely fed a complete unauthorized copy of the thing into
| the LLM."
| regularfry wrote:
| I just don't get the legal theory here.
|
| If I download harry_potter_goblet_fire.txt off some dodgy
| site, then let's assume that owner of that site has
| infringed copyright by distributing it. If I upload it
| again to some other dodgy site, I would also infringe
| copyright in a similar same way. But that would be naughty
| so I'm not going to do that.
|
| Let's say instead that I feed it into a bunch of janky
| pytorch scripts with a bunch of other text files, and out
| pops a bunch of weights. Twice.
|
| The first model I build is a classifier. Its output is
| binary: is this text about wizards, yes/no.
|
| The second model I build is an LLM. Its output is text, and
| (as in the article) you can get imperfect reproductions of
| parts of the training file out of it with the right
| prompts.
|
| Now, I upload both those sets of weights to HuggingFace.
|
| How many times am I supposed to have infringed copyright?
|
| Is it:
|
| A) Twice (at least), because the act of doing anything
| whatsoever with harry_potter_goblet_fire.txt without
| permission is verboten;
|
| B) Once, because only one of the models is capable of
| reproducing the original (even if only approximately);
|
| C) Zero, because neither model is capable of a reproduction
| that would compete with the original;
|
| or
|
| D) Zero, because I'm not the distributor of the file, and
| merely processing it - "format shifted" from the book, if
| you like - is not problematic in itself.
|
| Logically I can see justifications for any of B)
| (tenuously), C), or D). Obviously publishers would want us
| to think that A) is right, but based on what? I see a lot
| of moral outrage, but very little actual _argument_. That
| makes me think there 's nothing there.
| MengerSponge wrote:
| Can a human violate copyright? Yes. Obviously.
|
| While many people don't "understand" much, the model doesn't
| "understand" anything. Models are trained to replicate
| something. Most of them were trained on countless pieces of
| work that were illegitimately accessed.
|
| What do you gain by carrying OpenAI's water?
| nick__m wrote:
| Powerful tools that would not exist otherwise !
| michaelt wrote:
| You can take the tool and still support throwing the
| creator in jail.
|
| Rockets? Pretty cool. Wernher von Braun? Not cool.
| perching_aix wrote:
| You remind me to all the shitty times in literature class where
| I had to rote memorize countless works from a given author
| (poet), think 40, then take a test identifying which poem each
| of the given quotes was from. The little WinForms app I wrote
| to practice for these tests was one of the first programs I've
| ever written. I guess in that sense it's also a fond memory.
|
| Good thing they were public (?) works, wouldn't wanna get sued
| [0] for possibly being a two legged copyright infringement. Or
| should I say having been, since naturally I immediately erased
| all of these works from my mind just days after these tests,
| without even any legal impetus.
|
| Edit: thinking about it a bit more, you also remind me to our
| midterm tests from the same class. We had to produce multiple
| page long essays on the spot, analyzing a select work... from
| memory. Bonus points for being able to quote from it, of
| course. Needless to say, not many original thoughts were
| featured in those essays, not in mine, not in others' - the
| explicit expectation was that you'd peruse the current and
| older textbooks to "learn the analysis" from, and then you'd
| "write about it in your own words", but still using "technical"
| terms. They were pretty much just tests in jargon use,
| composition, and memorization, which is definitely a choice of
| all time for a class on literature. But I think it draws an
| _interesting_ perspective.
|
| [0] https://youtu.be/-JlxuQ7tPgQ
| rockemsockem wrote:
| I think I big part of copyright law is whether the thing
| created from copyrighted material is a competitor with the
| original work, in addition to whether it's transformative.
|
| LLMs are OBVIOUSLY not a replacement for the books and works
| that they're trained on, just like Google books isn't.
| anothernewdude wrote:
| One day they'll use these cases to sue people who have read
| books, because they can make immediately identifiable if
| imperfect copies of the works.
___________________________________________________________________
(page generated 2025-06-19 23:00 UTC)