hngopher.com

       [HN Gopher] Sarah Silverman is suing OpenAI and Meta for copyrig...
       ___________________________________________________________________
        
       Sarah Silverman is suing OpenAI and Meta for copyright infringement
        
       Author : Tomte
       Score  : 225 points
       Date   : 2023-07-09 18:43 UTC (4 hours ago)
        
 (HTM) web link (www.theverge.com)
 (TXT) w3m dump (www.theverge.com)
        
       | mosquitobiten wrote:
       | Sarah's pov raises some questions for me regarding my own
       | "training", there is a noteworthy part of who I am built upon the
       | consumed music, books, movies, video games and tv shows that
       | myself or people around me have pirated and shared with me. This
       | part of me helped me in life appreciably, I could also say I
       | profited because of it, helping me along my life in being
       | likable, funny, relatable, with broad outlooks etc.
       | 
       | Is my brain just by the act of existing continually infringes on
       | copyright? Can I be sued because I made a reference to a movie I
       | pirated or because I whistle a song I never bought?
        
         | moffkalast wrote:
         | Tom Scott did it: https://www.youtube.com/watch?v=IFe9wiDfb0E
         | 
         | The simple fact is that our current handling of copyright is
         | just completely broken on so many levels.
        
         | banana_feather wrote:
         | The allegation is that the model was trained on a copy of the
         | original work that was improperly obtained. The fact that it
         | can produce a summary is being offered as evidence of that
         | claim. You can't be sued for making a reference or whistling a
         | song, nor you could have been sued for being in possession of a
         | pirated copy, only making a pirated copy. Copyright law has
         | many warts, but it's not some Kafkaesque mind prison.
        
         | xdennis wrote:
         | Just because we call both learning, doesn't mean that human
         | learning and machine learning are the same. They most
         | definitely are not the same. Human learning is very lossy.
         | 
         | Even if they were the same, it doesn't mean that bots should
         | have the same rights that people have.
        
         | [deleted]
        
         | Schnitzkitz wrote:
         | > Is my brain just by the act of existing continually infringes
         | on copyright? Can I be sued because I made a reference to a
         | movie I pirated or because I whistle a song I never bought?
         | 
         | I think it would fall under fair use. But you can imagine what
         | the world can become with microphones and cameras everywhere,
         | which can already run music and speech recognition by
         | themselves, in seconds. What a time to be alive!
        
         | twelve40 wrote:
         | if you are a robot that ripped off literally all the data in
         | the world and now resells it in a repackaged form for its own
         | profit, then yes, you can be sued. Talking about whistling a
         | song is pretty absurd in this context.
        
       | dzink wrote:
       | The summary and other LLM outputs should fall into transformative
       | use. I wonder how they intend to prove as this use case is little
       | different from a person reading the book and writing about it.
        
         | banana_feather wrote:
         | This site would be better if you got banned for commenting
         | without reading the article.
        
       | phkahler wrote:
       | >> In the OpenAI suit, the trio offers exhibits showing that when
       | prompted, ChatGPT will summarize their books, infringing on their
       | copyrights.
       | 
       | This doesn't seem like copyright infringement. I could read the
       | book and offer a summary right? Someone on goodreads could as
       | well. Why should an AI doing it be different? BTW I could also
       | read someone's illicit copy and do the same, couldn't I?
       | 
       | I think people are trying to claim exclusive use rights that they
       | simply don't have. I look forward to a lawyers opinion on this
       | one.
        
         | Wowfunhappy wrote:
         | I think the argument is somewhat more interesting if the book
         | was _pirated_ --both you as an individual and OpenAI as a
         | company could be sued for that.
         | 
         | But I really don't see how you could prove OpenAI did that,
         | since ChatGPT could have learned from existing summaries on
         | Wikipedia and Goodreads.
        
           | verve_rat wrote:
           | It seems pretty easy to prove that, since they admitted it in
           | public.
           | 
           | Read the article. This isn't about the question of LLMs being
           | copyright infringement, this is about Meta and OpenAI
           | admitting that they had pirated copies of those books.
        
         | barbariangrunge wrote:
         | The ai is not a person. The ai is not a person. The ai is not a
         | person
        
         | klabb3 wrote:
         | > I could read the book and offer a summary right? Someone on
         | goodreads could as well. Why should an AI doing it be
         | different?
         | 
         | We _have_ to stop equating human beings to for-profit
         | corporations running a machine at orders of magnitude the speed
         | and scale. This is critical, otherwise we don't have any
         | arguments against - say a mass face recognition surveillance op
         | because "humans can remember faces too". Scale matters, just
         | like it did before "AI" with things like indiscriminate
         | surveillance "it's just metadata" or "this location data set is
         | just anonymized aggregates".
         | 
         | > This doesn't seem like copyright infringement.
         | 
         | Now, I still think I agree with this. A book summary is
         | nevertheless an extremely poor battle to pick, since frankly
         | who the hell cares. It's not like someone is gonna say "I'm not
         | buying this book anymore because ChatGPT summarized it".
         | 
         | Now, perhaps they just used the summary to prove that their
         | book was part of the training set, and that they think it's
         | wrong to include their works without permission. That's, imo,
         | definitely not trivial to dismiss. Looks like unpaid supply
         | chain to me.
        
       | crazygringo wrote:
       | This is actually quite interesting, as it's drawing a distinction
       | between training material that can be accessed by anybody with a
       | web browser (like anybody's blog), vs. training material that was
       | "illegally-acquired... available in bulk via torrent systems."
       | 
       | I don't think there's any reason why this would be a relevant
       | legal distinction in terms of _distributing_ an LLM -- blog
       | authors weren 't giving consent either.
       | 
       | However, I _do_ wonder if there 's a legal issue here in using
       | pirated torrents for training. Is there any legal basis for
       | saying fair use permits distributing an LLM trained on
       | copyrighted material, but you have to purchase all the content
       | first to do so legally _if it 's only available for sale_? E.g.
       | training on a blog post is fine because it's freely accessible,
       | but Sarah Silverman's book is not because it's never been made
       | available for free, and you didn't pay for it?
       | 
       | Or do the courts not really care at all how something is made? If
       | you quote a passage from a book in a freelance article you write,
       | nobody ever asks if you purchased the book or can prove you
       | borrowed it from a library or a friend -- versus if you pirated a
       | digital copy.
        
         | version_five wrote:
         | Seems like the AI angle is just capitalizing on hype. If it's
         | illegal to download "pirate" copyright material, that was the
         | crime. The rest is basically irrelevant. If I watch a pirated
         | movie, it's not illegal for me to tell someone the plot.
        
         | banana_feather wrote:
         | > Or do the courts not really care at all how something is
         | made?
         | 
         | One of the fair use factors, which until fairly recently was
         | consistently held out as the most important fair use factor, is
         | the effect on the commercial market for the original work.
         | Accordingly, a court is more likely to find that something is
         | fair use if there is effectively no commercial market for the
         | original work, though the fact that something isn't actively
         | being sold isn't dispositive (open source licenses have
         | survived in appellate courts despite being free as in beer).
        
           | itronitron wrote:
           | Scarcity drives a lot of value for original work.
        
         | jmkb wrote:
         | Eventually, I imagine a new licensing concept will emerge,
         | similar to the idea of music synchronization rights -- maybe
         | call it "training rights." It won't matter whether the text was
         | purchased or pirated -- just like it doesn't matter now if an
         | audio track was purchased or pirated, when it's mixed into in a
         | movie soundtrack.
         | 
         | Talent agencies will negotiate training rights fees in bulk for
         | popular content creators, who will get a small trickle of
         | income from LLM providers, paid by a fee line-itemed into the
         | API cost. Indie creators' training rights will be violated
         | willy-nilly, as they are now. Large for-profit LLMs suspected
         | or proven as training rights violators will be shamed and/or
         | sued. Indie LLMs will go under the radar.
        
           | the8472 wrote:
           | Humans are also trained on copyrighted content they see.
           | Should every artist have to pay that fee too on every work
           | they create?
           | 
           | Disney will finally be able to charge a "you know what the
           | mouse looks like" tax.
        
           | phkahler wrote:
           | >> Talent agencies will negotiate training rights fees in
           | bulk for popular content creators
           | 
           | AFAICT there is no legal recognition of "training rights" or
           | anything similar. First sale right is a thing, but even
           | textbooks don't get extra rights for their training or
           | educational value.
        
             | sigstoat wrote:
             | This is why jmkb referenced synchronization rights, which
             | (as I recall) were invented when they seemed useful. jmkb
             | is suggesting a new right might be created, not claiming
             | that they already exist.
             | 
             | (even if it wasn't sync rights, there was something else
             | musically related that was created in response to
             | technological development. wikipedia will have plenty on
             | it)
        
             | belorn wrote:
             | Many legal concepts used by courts has no legal recognition
             | in the law texts. Much of legal practice are just
             | precedents, policies, customs, and doctrines.
             | 
             | Parent comment mention music synchronization rights, and
             | this concept does not exist in copyright. Court do
             | occasionally mention it, and lawyers talks about it, but in
             | terms of the legal recognition there is basically only the
             | law text that define derivative work and fair use. One way
             | to interpret it is that court has precedents to treat music
             | synchronization as a derivative work that do not fall under
             | fair use.
             | 
             | Using textbooks in training/education is not as black and
             | white that one may assume. Take this Berkeley
             | (https://teaching.berkeley.edu/resources/course-
             | design/using-...). Copying in this context include using
             | pages for slides and during lectures (which is a slightly
             | large scope than making physical copies on physical paper).
             | In obvious case the answer is likely obvious, but in others
             | it will be more complex.
        
           | fweimer wrote:
           | Is it all that different from indexing for search? That does
           | not seem to require a license from the copyright holder under
           | U.S. law (but other countries may treat as a separate
           | exploitation right). If indexing for search is acceptable,
           | then something that is intended to be more transformative
           | should be legal as well.
           | 
           | (Personally, I think that even indexing for search should
           | require permission from the copyright holder.)
        
         | akira2501 wrote:
         | I'm allowed to make private copies of copywritten works. I'm
         | not allowed to redistribute them. To what extent this is
         | redistribution is not clear. Is there much of difference
         | between this model and a machine, like a VCR, that recreates
         | the original work when I press a button?
        
           | JamesBarney wrote:
           | This is not definitely not redistribution any more than
           | writing a blog post of a book you read is.
        
           | postmodest wrote:
           | Is Game of Thrones a redistribution of Lord of the Rings?
        
           | seanmcdirmid wrote:
           | This would be like you intensely studying the copy written
           | work and then writing things based on the knowledge you
           | obtained from that. Except, we don't know if their is an
           | exception for things learned by people vs. things learned by
           | machines, or if the machines are not really learning but
           | copying instead (or if learning is intrinsically a form of
           | copying?).
        
             | pbhjpbhj wrote:
             | In the case of unreleased work, you writing about your
             | knowledge of it is just proof that you obtained the work,
             | which is proof that you committed a tort/trespass. Just
             | like if you published a newspaper article with information
             | you could only have acquired by hacking someone's phone.
             | I'm not sure what a court would find against you, but it
             | seems clear that there would be some way to couch that as a
             | legal grievance.
        
               | seanmcdirmid wrote:
               | Hmm...if I get on the internet and download and read a
               | paper from some website, am I liable if that paper was
               | actually private if I had no clue it was obtained
               | illegally? It seems to me that the distributor would be
               | liable at that point, not the person who got it from the
               | distributor (unless they knew they were stolen goods,
               | then of course they are liable!).
               | 
               | A search engine that indexes the internet might be
               | equally liable at that point, although the DCMA gives
               | them an out if they have a mechanism to remove pirated
               | entries from their index on request. Could LLMs have the
               | same out?
        
             | velosol wrote:
             | There's a sci-fi plot there: those with money can afford to
             | pay the copyright cost for material they've learned and
             | anything they produce results in royalties to the creators
             | of everything they've learned. Those without means are cast
             | out, perhaps some generating original thoughts in a way
             | that breaks the system. I think I'm going to have to re-
             | read the Unincorporated Man.
        
               | seanmcdirmid wrote:
               | We aren't so far away from that, and it is becoming cheap
               | enough that you won't even need money anymore. Rather
               | than bootleg movies, people will just ask computers to
               | derive a new movie from multiple existing movies, and
               | then...it is an original work?
        
           | permo-w wrote:
           | I buy a book and give it to my child, they read the book and
           | later write and sell a story influenced by said book. should
           | that be a copyright infringement?
           | 
           | how about they become a therapist and sell access to
           | knowledge from copyrighted books? should that be an
           | infringement?
           | 
           | what if they sell access to lectures they've given including
           | facts from said book(s) to millions of people?
           | 
           | it's understandable that people feel threatened by these
           | technologies, but to a great degree the work of a successful
           | artist is to understand and meet the desires of an audience.
           | LLMs and image generation tech do not do this. they simply
           | streamline the production
           | 
           | of course if you've worked for years to become a graphic
           | designer you're going to be annoyed that an AI can do your
           | job for you, but this is simply what happens when technology
           | moves forward. no one today mourns the loss of scribes to the
           | printing press. the artists in control of their own destiny -
           | i.e. making their own creative decisions - will not, can not,
           | be affected by these models, unless they refuse to adapt to
           | the times
        
             | krisoft wrote:
             | > understand and meet the desires of an audience. LLMs and
             | image generation tech do not do this.
             | 
             | For now? I wouldn't be surprised if that becomes the next
             | feature though.
        
               | permo-w wrote:
               | that's already been a thing for years. long before LLMs
               | and Stable Diffusion
               | 
               | it's just the instagram/tiktok/youtube suggested content
               | algorithms
        
             | milemi wrote:
             | If you found a way to have a million children who could
             | grow up in one day your analogy would be more apt. In that
             | case you and your children would rightly be considered a
             | threat.
        
               | permo-w wrote:
               | did you read to the third analogy?
        
               | milemi wrote:
               | What if in your third analogy you replace millions of
               | students by billions of processes running on machines,
               | each of which can generate output ten thousand times
               | faster than a college educated human?
        
             | enneff wrote:
             | Making an analogy where you substitute a human being for
             | the LLM is disingenuous to the extreme.
        
               | permo-w wrote:
               | why do you think that?
        
               | enneff wrote:
               | Because LLMs are not people. They are nothing like
               | people; not in construction nor behaviour.
        
               | permo-w wrote:
               | LLMs are built upon neural networks which are modelled
               | upon how brains work
               | 
               | can you explain to me specifically how they're different?
               | 
               | can you explain to me how they're different to the degree
               | that making an _analogy_ between the two is
               | "disingenuous to the extreme"?
        
               | TillE wrote:
               | One is a person, the other is a computer program.
               | 
               | Legally quite distinct! Note that nobody is even
               | seriously claiming we have an AGI, there's no Star Trek
               | discussion of whether an android is a person. Everyone
               | agrees this is just a computer program.
        
               | permo-w wrote:
               | are you aware of how neural networks work? and remember
               | that this is simply an illustrative analogy
        
               | srslack wrote:
               | It doesn't matter if it's a person, or a computer
               | program, or not. This discussion is moot. Is there a
               | substantial reproduction of the works in the output? If
               | not, there's no copyright infringement here.
               | 
               | Try reading this legal opinion: https://lawreview.law.ucd
               | avis.edu/issues/53/5/notes/files/53...
        
               | enneff wrote:
               | The differences between a a human being and a computer
               | are too numerous to list. I don't even know why you need
               | to ask the question.
               | 
               | Let me ask another question to point out the absurdity of
               | yours: Human beings have more in common with a bacterium
               | than a software program. Can you tell me specifically how
               | humans are not bacteria?
        
               | permo-w wrote:
               | analogies are not descriptions of the things themselves,
               | otherwise they would not be analogies, would they?
               | 
               | now remember that this is an _analogy_. re-read my
               | comments in this light and perhaps we can continue this
               | conversation in a more grounded and reasonable manner
               | 
               | however, I'll be frank: have you studied neural networks?
               | if you haven't, it's very difficult to take you seriously
               | on this
        
               | enneff wrote:
               | Now you're just being condescending. I did read your
               | comments and I know what an analogy is. Consider that
               | there is a different perspective to yours that can
               | validly view your analogy as absurd.
               | 
               | Yes I have studied neural nets and have a good
               | understanding of their function. I am still not sure how,
               | despite their development being inspired by animal
               | brains, you can liken an LLM to an actual person. There
               | are so many vast differences. Do you really want me to
               | explain specially what they are? Surely, since both you
               | and I are so familiar with the subject matter, that is
               | unnecessary.
               | 
               | If we were taking about an AGI then this would be an
               | entirely different conversation.
        
               | permo-w wrote:
               | your argument was predicated upon my comment being
               | factually inaccurate, not analogously poor. I merely
               | listed some leading questions analogising the ability of
               | human brains to absorb, contextualise and emit
               | copyrighted information to an LLM's ability to do all of
               | those same things. you created a weaker position to
               | attack which was that LLMs are the same as humans
               | 
               | if you want to discuss whether it is a poor analogy or
               | not, I'm all for that, but the passage you've chosen to
               | go down is to act as if it were not an analogy at all,
               | which--to borrow a phrase--is disingenuous to the extreme
        
               | [deleted]
        
               | xdennis wrote:
               | > LLMs are built upon neural networks which are modelled
               | upon how brains work
               | 
               | You are confused. Neural networks are inspired by how
               | brains work, but they do not actually simulate brains.
               | 
               | Airplanes are also inspired by how birds work, but
               | (presumably) you don't think that bird laws should apply
               | to airplanes.
               | 
               | > can you explain to me how they're different to the
               | degree that making an analogy between the two is
               | "disingenuous to the extreme"?
               | 
               | It's disingenuous because you don't believe that either.
               | 
               | If you think that LLMs are just like human brains, and
               | should be allowed to learn from books the same as people,
               | then presumably you also believe they are entitled to all
               | other human rights: to vote, to live, &c. If you operate
               | an LLM and you shut it down then that's murder and you
               | belong in jail.
        
           | WheatMillington wrote:
           | It's legal to make a copy of something you own, however it's
           | not legal to make a copy of something illicitly acquired,
           | whether or not there's distribution involved.
        
             | jes5199 wrote:
             | is there legal jargon for this distinction?
        
             | Scoundreller wrote:
             | It gets kinda dicey in jurisdictions that have media levy
             | taxes.
             | 
             | The uploader might be breaking the law but not the
             | downloader that stores the copyrighted material on levy-
             | paid media.
             | 
             | I've heard inklings of this argument in Canada but can't
             | figure out what the current state of the art is:
             | https://en.m.wikipedia.org/wiki/File_sharing_in_Canada
             | 
             | Then is a corporation's internal use for the purpose of
             | analysis considered "private use"? If there's no
             | redistribution/broadcasting, is it still non-commercial?
        
             | balls187 wrote:
             | Companies ofc now circumvent that by selling you licenses,
             | and not something you own.
        
         | pkilgore wrote:
         | The "fun" part about cases like this is that we don't really
         | know what the contours of the law are as applied to training
         | data like this. Illegally downloading a book is an independent
         | act of infringement (to my recollection at least). So I'm not
         | sure that it matters if you eventually trained an LLM with it
         | vs read for your own enjoyment. But we will see! Fair use is a
         | possibility here but we need a court to apply the test and that
         | will probably go up to SCOTUS eventually.
        
         | pdonis wrote:
         | _> Is there any legal basis for saying fair use permits
         | distributing an LLM trained on copyrighted material, but you
         | have to purchase all the content first to do so legally if it
         | 's only available for sale?_
         | 
         | My understanding (disclaimer: IANAL) is that in order to claim
         | fair use, you have to be legally in possession of the work. If
         | the work is only legally available for sale, then you must have
         | legally purchased a copy, or been given it by someone who did
         | so (for example, if you received it as a gift).
        
           | danans wrote:
           | > you must have legally purchased a copy, or been given it by
           | someone who did so (for example, if you received it as a
           | gift).
           | 
           | I am also NAL, but can I imagine it goes further than that.
           | Just purchasing a copy doesn't let you create and sell
           | (directly as content or indirectly via a service like a
           | chatbot) derivative works that are substantially similar in
           | style and voice to the original work.
           | 
           | For example, an LLM 's response to the request:
           | 
           | "Write a short story about a comical trip to the nail salon
           | in the style of Sarah Silverman"
           | 
           | ... IMO doesn't constitute fair use, because the intellectual
           | property of the artist is their style even more than the
           | content they produce. Their style, built from their lived
           | human experience, is what generates their copyrighted
           | content. Even more than the content, the artist's style
           | should be protected. The fact that a technology exists that
           | can convincingly mimic their style doesn't change that.
           | 
           | One might then ask, well what about artists mimicking each
           | others work? Well, any artist with a shred of integrity will
           | credit their major influences.
           | 
           | We should hold machines (and their creators) to an even
           | tougher standard than we hold people when it comes to
           | mimicry. A real person can be inspired and moved by another
           | person's artistic work such that they mimic it. Inspiration
           | means nothing to a machine.
        
             | oxguy3 wrote:
             | This is not how U.S. copyright law works. In order for
             | something to eligible for copyright protection, it must be
             | "fixed in a tangible medium of expression". Someone's exact
             | words can be copyrighted, but their ideas or their style
             | cannot be.
             | 
             | https://www.law.cornell.edu/wex/fixed_in_a_tangible_medium_
             | o...
        
               | danans wrote:
               | I'm arguing that we should draw a line here between human
               | mimicry and machine mimicry. When it comes to machine
               | mimicry, we should protect style, even if we don't do
               | that today. Our laws are built on the now flawed
               | assumption that machines are not capable of style
               | mimicry. I do not believe in giving machines the same
               | rights of personhood.
        
             | pdonis wrote:
             | _> IMO doesn 't constitute fair use_
             | 
             | Yes, the question of whether the way LLMs use the content
             | they use qualifies as fair use is a _separate_ question. My
             | point was simply that that question can 't even be reached
             | if the maker of the LLMs doesn't have a legal right to fair
             | use in the first place (because they don't legally own
             | their copy).
        
               | danans wrote:
               | > My point was simply that that question can't even be
               | reached if the maker of the LLMs doesn't have a legal
               | right to fair use in the first place (because they don't
               | legally own their copy).
               | 
               | I agree, and I expect that eventually we will start
               | seeing injunctions against creators requiring them to
               | remove content that they don't have legal access to from
               | their training data sets.
               | 
               | And this will probably end up at the Supreme Court.
        
             | dvt wrote:
             | > ... IMO doesn't constitute fair use, because the
             | intellectual property of the artist is their style even
             | more than the content they produce
             | 
             | You're essentially banning satire here, though. There's
             | plenty of folks making a living as cover bands or
             | impersonators. I'm not sure what the answer is, but it's
             | definitely not outright outlawing imitation.
        
               | danans wrote:
               | > You're essentially banning satire here, though.
               | 
               | I specifically noted that I'm talking about limiting the
               | rights of _machine generated mimicry_. Satire _by a
               | person_ is completely different and involves the satirist
               | 's own style and experience that is derived from their
               | human experience. Alec Baldwin's Trump impersonation is
               | quite different than Trevor Noah's, for example. I
               | presume both were also written by people, not LLMs.
               | 
               | I fully support the satirical impersonation of
               | politicians and celebrities, but I feel far less
               | comfortable with LLM generated content in the style of
               | Trump or Obama, especially when presented using voice
               | synthesis, even when it is fully disclaimed as a fake.
        
         | [deleted]
        
         | itronitron wrote:
         | >> blog authors weren't giving consent either.
         | 
         | That is a good point, since copyright is a default protection
         | of works created by people.
        
       | marysnovirgin wrote:
       | The best thing that could happen to humanity is if OpenAI is sued
       | comically into oblivion. LLMs are the anti-humanity, and the
       | sooner we rid the planet of them the better off we'll be.
        
       | xyzal wrote:
       | I wonder if the makers of AI may have some index of learning
       | material to perhaps prove they did not infringe? Or they just
       | throw everything they get their hands on to the LLM?
        
         | qayxc wrote:
         | It's the latter. The current approach is akin to throwing the
         | entire internet at the foundational model, train it for an
         | epoch or two, and that's that. Afterwards, the finetuning with
         | curated training material, and techniques like RLHF
         | (Reinforcement Learning from Human Feedback) takes place.
         | 
         | The sources of the training material are... questionable, to
         | say the least. There's a reason the training dataset for GPT
         | 3.5 and 4 remains undisclosed.
        
       | rkagerer wrote:
       | I wish these cases were stronger than "it summarizes my books".
        
         | xbar wrote:
         | I am very glad they are not.
        
       | gfaure wrote:
       | I severely doubt that the spiders which crawled the data would go
       | to the trouble of dereferencing and downloading torrents.
        
         | pessimizer wrote:
         | > The complaint lays out in steps why the plaintiffs believe
         | the datasets have illicit origins -- in a Meta paper detailing
         | LLaMA, the company points to sources for its training datasets,
         | one of which is called ThePile, which was assembled by a
         | company called EleutherAI. ThePile, the complaint points out,
         | was described in an EleutherAI paper as being put together from
         | "a copy of the contents of the Bibliotik private tracker."
         | Bibliotik and the other "shadow libraries" listed, says the
         | lawsuit, are "flagrantly illegal."
        
         | msp26 wrote:
         | It was probably just pdf copies of the book from random a
         | random ddl website.
        
       | Larrikin wrote:
       | If they ripped all of Bibliotik, the more interesting story to me
       | is how they were able to get it all without hitting ratio
       | requirements?
       | 
       | Super fast internet that downloaded all they could before being
       | ratio banned, overwhelmingly fast internet that was hopping on
       | all the popular torrents to slowly build up ratio?
        
         | grubbs wrote:
         | Can easily buy compromised accounts for most torrent sites on
         | the dark web.
        
       | getmeinrn wrote:
       | A more convincing exhibit would have been convincing ChatGPT to
       | output some of the text verbatim, instead of a summary. Here's
       | what I got when I tried:                   I'm sorry for the
       | inconvenience, but as of my knowledge cutoff in September 2021, I
       | don't have access to specific external databases, books, or the
       | ability to pull in new information after that date. This means
       | that I can't provide a verbatim quote from Sarah Silverman's book
       | "The Bedwetter" or any other specific text. However, I can
       | generate text based on my training and knowledge up to that
       | point, so feel free to ask me questions about Sarah Silverman or
       | topics related to her work!
        
         | simion314 wrote:
         | Maybe you missed this discussion
         | https://news.ycombinator.com/item?id=36400053 it seems OpenAI
         | is aware of their software outputs copyrighted stuff so they
         | attempted some quick fix filter. So the fact it will not ouput
         | the book for us when we ask does not prove that the AI does not
         | have memorized big chunks of it, it might just be some "safety"
         | filter involved and you need soem simple trick to get around
         | it.
        
         | shultays wrote:
         | I tried making chatgpt output first paragpraph of lord of the
         | rings before, it goes silent after first few words. Looks like
         | the devs are filtering it out
        
       | ravenstine wrote:
       | Copyright law may eventually destroy business models founded upon
       | AI. Maybe piracy will prevail in correcting the system as it did
       | with entertainment before media companies accepted streaming.
        
       | pessimizer wrote:
       | > The complaint lays out in steps why the plaintiffs believe the
       | datasets have illicit origins -- in a Meta paper detailing LLaMA,
       | the company points to sources for its training datasets, one of
       | which is called ThePile, which was assembled by a company called
       | EleutherAI. ThePile, the complaint points out, was described in
       | an EleutherAI paper as being put together from "a copy of the
       | contents of the Bibliotik private tracker." Bibliotik and the
       | other "shadow libraries" listed, says the lawsuit, are
       | "flagrantly illegal."
       | 
       | This is the makers of AI explicitly saying that they did use
       | copyrighted works from a book piracy website. If you downloaded a
       | book from that website, you would be sued and found guilty of
       | infringement. If you downloaded all of them, you would be liable
       | for many billions of dollars in damages.
       | 
       | But companies like Google and Facebook get to play by different
       | rules. Kill one person and you're a murderer, kill a million and
       | to ask you about it is a "gotcha question" that you can react to
       | with outrage.
        
         | holmesworcester wrote:
         | Strictly speaking, it's _uploading_ that people get sued for,
         | not downloading.
         | 
         | You can download all that you want from Z-Library or
         | BitTorrent, as long as you don't share back. And indexing
         | copyrighted material for search is safe, or at least ambiguous.
        
           | capableweb wrote:
           | Carefully speaking, what you say is true in many places
           | (countries), but also not true in other places (countries).
           | Some jurisdictions are different, as always.
        
           | moralestapia wrote:
           | Huh? No.
        
           | MisterBastahrd wrote:
           | Downloading is illegal. That people do not normally get sued
           | or prosecuted for downloading does not mean that they cannot
           | get sued or prosecuted.
        
         | HDThoreaun wrote:
         | I've pirated many books, never sued.
        
           | erezsh wrote:
           | You mean, never caught.
        
         | Der_Einzige wrote:
         | I for one am quite happy that AI folks are basically treating
         | copyright as not existing. I strongly hope that the courts find
         | that LLM weights, and the datasets are "fair-use" or whatever
         | other silly legal justification.
         | 
         | Aaron Swartz was a saint.
        
           | colechristensen wrote:
           | While I'ma proponent of free information and loosening
           | copyright, allowing billion dollar companies to package up
           | the sum of human creation and resell statistical models that
           | mimic the content and style of everyone... is a bit far.
           | 
           | Fair use is for humans.
        
             | brnaftr361 wrote:
             | Yeah, but hypothetically should open source projects be
             | offered special protections? I feel like they should, and
             | with certain caveats where, say, a company like Meta is
             | allowed to claim fair use _if_ and _only if_ they free up
             | the entire ecosystem as they deploy it.
             | 
             | But yeah, having private ostensibly profitable models based
             | on other people's work without giving them free access to
             | it is not fair. Give some get some.
        
           | pessimizer wrote:
           | Copyrights shouldn't exist, but
           | 
           | > LLM weights, and the datasets are "fair-use" or whatever
           | other silly legal justification.
           | 
           | Would just be a carve out for the wealthy. If these laws
           | don't mean anything, everyone who got harassed, threatened,
           | extorted, fined, arrested, tried, or jailed for internet
           | piracy are owed reparations. Let Meta pay them.
        
           | suction wrote:
           | [dead]
        
           | JoshTriplett wrote:
           | > I for one am quite happy that AI folks are basically
           | treating copyright as not existing. I strongly hope that the
           | courts find that LLM weights, and the datasets are "fair-use"
           | or whatever other silly legal justification.
           | 
           | I would be very happy if either a court or lawmakers decided
           | that copyright itself was unconscionable. That isn't what's
           | going to happen, though. And I think it's incredibly
           | unacceptable if a court or lawmakers instead decide that _AI
           | training in particular_ gets a special exception to violate
           | other people 's copyrights on a massive scale when nobody
           | else gets to do that.
           | 
           | As far as a fair use argument in particular, fair use in the
           | US is a fourfold test:
           | 
           | > the purpose and character of the use, including whether
           | such use is of a commercial nature or is for nonprofit
           | educational purposes;
           | 
           | The purpose and character is _absolutely_ heavily commercial
           | and makes a great deal of money for the companies building
           | the AIs. A primary use is to create other works of commercial
           | value competing with the original works.
           | 
           | > the nature of the copyrighted work;
           | 
           | There's nothing about the works used for AI training that
           | makes them any _less_ entitled to copyright protections or
           | _more_ permissive of fair use than anything else. They 're
           | not unpublished, they're not merely collections of facts or
           | ideas, etc.
           | 
           | > the amount and substantiality of the portion used in
           | relation to the copyrighted work as a whole; and
           | 
           | AI training uses entire works, not excerpts.
           | 
           | > the effect of the use upon the potential market for or
           | value of the copyrighted work.
           | 
           | AI models are having a massive effect on the market for and
           | value of the works they train on, as is being widely
           | discussed in multiple industries. Art, writing, voice acting,
           | code; in any market AI can generate content for, the value of
           | such content goes down. (This argument does not require AI to
           | be as good as humans. Even if the high end of every market
           | produces work substantially better than AI, flooding a market
           | with unlimited amounts of cheap/free low-end content still
           | has a massive effect.)
        
             | lubujackson wrote:
             | It is the equivalent of making a 3D map of a museum and
             | getting sued by one artist of one painting in the museum.
             | Ant individual work in an AI dataset is nearly worthless -
             | only in aggregate does it have value.
             | 
             | If that doesn't count as a "transformative work" I don't
             | know what does.
        
               | thebears5454 wrote:
               | They could literally just repeat a Silverman routine
               | verbatim
        
               | JoshTriplett wrote:
               | > It is the equivalent of making a 3D map of a museum and
               | getting sued by one artist of one painting in the museum.
               | 
               | If the painting is copyrighted (rather than public
               | domain, as many pieces in museums are), and the map
               | includes an image of that painting, I would _expect_ that
               | to be prohibited. I would _prefer_ the world in which
               | copyright doesn 't exist, but while it exists, it should
               | apply to everyone equally.
        
           | 8organicbits wrote:
           | There's a difference between "information wants to be free"
           | and "Facebook can produce works minimally derived from your
           | greatest creative work at a scale you can't match". LLMs seem
           | to aggregate that value to whoever builds the model, which
           | they can then sell access to, or sell the output it produces.
           | 
           | Five years from now, will OpenAI actually be open, or will it
           | be a rent seeking org chasing the next quarterly gains? I
           | expect the latter.
        
             | PostOnce wrote:
             | "Will OpenAI actually be open"
             | 
             | That ship sailed, friend.
             | 
             | OpenAI is not only no longer a charity in any meaningful
             | sense of the word anymore, it's now an adversarial
             | organization working _against_ the public good.
             | 
             | After privatization, they sent their PR people to lobby
             | congress to make it impossible for anyone to compete with
             | them (important note: _not_ out of any interest in actually
             | "protecting" the public from the very AI they're building),
             | and perhaps worst of all, they're no longer being open with
             | the scientific theories and data behind their new models.
        
           | slashdev wrote:
           | Agreed, copyright has gone too far. I hope the advent of AI
           | serves to weaken it.
        
           | axblount wrote:
           | Swartz distributed information for everyone to use freely.
           | These companies are processing it privately to develop their
           | for-profit products. Big difference, IMO.
        
             | Der_Einzige wrote:
             | I am sad about closed source LLMs like ChatGPT, but Llama
             | is in that grey area where it's freely available if you
             | choose to ignore their silly license stuff, which of course
             | pirates and AI developers are all too keen to ignore.
             | 
             | Even if they win the lawsuit, LLM development will simply
             | go underground, and as we see from what the coomers at
             | civitai and in the stable diffusion world have done, that
             | may in fact ironically speed up development in AI.
        
         | makeitdouble wrote:
         | If AI companies get to successfully argue the two points below,
         | what source was used becomes irrelevant.
         | 
         | - copyright violation happened before the intervention of the
         | bot
         | 
         | - what LLMs spit out is different enough from any of the source
         | that it is not infringing on existing copyright
         | 
         | If both stand, I'd compare it to you going to an auction site
         | and studying all the published items as an observer, coming up
         | with your research result, to then be sued because some of the
         | items were stolen. Going after the theaves make sense, does
         | going after the entity that just looked at the stolen goods
         | make sense ?
        
         | amf12 wrote:
         | The lawsuit doesn't even mention Google.
        
           | pessimizer wrote:
           | No, I did. What's your point?
        
             | kccqzy wrote:
             | The point is that GP has no reason to believe Google, like
             | Meta, also used copyrighted materials for training its AI.
             | Why did Sarah Silverman sue OpenAI and Meta but not Google?
        
             | fweimer wrote:
             | I think it's a reference to Google's book scanning product,
             | which is structurally similar: they use copyrighted
             | material to provide a new kind of service, which contains
             | an echo of the original material. The book scanning and the
             | related search product is supposedly legal under U.S.
             | copyright law.
        
         | buildbot wrote:
         | Machine learning models have been trained with copyrighted data
         | for a long time. Imagenet is full of copyrighted images,
         | clearview literally just scanned the internet for faces, and I
         | am sure there are other, older examples. I am unsure if this
         | has been tested as fair use by a US court, but I am guessing it
         | will be considered to be so if it is not already.
        
           | jefftk wrote:
           | _> I am unsure if this has been tested as fair use by a US
           | court_
           | 
           | Not yet. One suit that a lot of us are watching is the GitHub
           | co-pilot lawsuit: https://githubcopilotlitigation.com/
           | 
           | There is a prediction market for it, currently trading at
           | 19%: https://manifold.markets/JeffKaufman/will-the-github-
           | copilot...
        
           | kibwen wrote:
           | Excellent, so you're saying I'll be able to download any
           | copyrighted work from any pirate site and be free of all
           | consequence if I just claim that I'm training an AI?
        
             | JamesBarney wrote:
             | That's not was he's saying at all. He's saying you can
             | train an AI on copyrighted material just like people can
             | learn from copyrighted material.
             | 
             | If you acquire the material illegally that a separate issue
             | that training AI doesn't give you any protection against.
        
               | kibwen wrote:
               | But acquiring the material illegally is the thrust of the
               | issue, and the point of the thread here: the notion that
               | large companies can get away with piracy if they just
               | execute it on a large enough scale. Copyright infringment
               | for thee but not for me.
        
               | buildbot wrote:
               | The power imbalance has always been there. Getty images
               | commits large amounts of copyright theft and mostly gets
               | away with it, but will happily sue the shit out of you
               | for using your own images they stole.
               | 
               | Also, the notion of the downloading itself being an
               | illegal act is not universal as others have pointed out.
        
               | holmesworcester wrote:
               | There's no "for thee but not for me" issue here: nobody
               | has ever been sued or prosecuted simply for downloading,
               | acquiring, or possessing illegally acquired copyrighted
               | works. People are sued and prosecuted for unlicensed
               | _distribution_.
               | 
               | Making and having your own copies, and doing what you
               | want with them, has always been fine. At worst it's a
               | grey area, but in many cases it's been protected as fair
               | use.
        
               | startupsfail wrote:
               | Large countries seem to be doing it as well -
               | https://petapixel.com/2023/06/05/japan-declares-ai-
               | training-...
               | 
               | And as long as OpenAI have an office in Japan they can
               | absolutely legally train the models, no?
        
               | DowsingSpoon wrote:
               | >you can train an AI on copyrighted material just like
               | people can learn from copyrighted material.
               | 
               | No amount of whining and hand wringing from engineers
               | will ever make this true. This is for the courts to
               | decide.
               | 
               | A reasonable interpretation, in my eyes, is that the
               | training process is a black box which takes in
               | copyrighted works and produces a training model. The
               | training model is a derivative work of the inputs. It
               | therefore violates the copyrights of a large number of
               | rights holders. The outputs of the model are derivative
               | works which also violate copyright.
               | 
               | And anyone using or training a model trained on works for
               | which they do not have the rights? Completely fucked. Or
               | at least, they must accept this as a real risk.
        
             | holmesworcester wrote:
             | You can currently download any copyright work from any
             | pirate site and be free of all consequences, and this has
             | always been so.
             | 
             | You just can't upload, since that counts as distribution,
             | triggering civil and criminal penalties written in an age
             | before the Internet when only shady commercial operators
             | would distribute unlicensed copyrighted works.
        
         | [deleted]
        
         | rococode wrote:
         | > If you downloaded a book from that website, you would be sued
         | and found guilty of infringement.
         | 
         | How often does this actually happen? You might get handed an
         | infringement notice, and your ISP might terminate your service
         | if you're really egregious about it, but I haven't ever heard
         | of someone actually being sued for downloading something.
        
           | TillE wrote:
           | In Germany if you torrent stuff (without a VPN), you're very
           | likely to get a letter from a law firm on behalf of the
           | copyright holders saying that they'll sue you unless you pay
           | them a nice flat fee of around 1000 Euro.
           | 
           | It's no idle threat, and they will win if it goes to court.
        
             | cowsup wrote:
             | That's because, when torrenting, you're typically also
             | seeding a copy of it, i.e. you're distributing your local
             | copy to other devices, and thus you're directly aiding in
             | piracy. Simply downloading content from a centralized
             | server, as explained above, is different.
             | 
             | Although, one could argue what OpenAI & Meta are doing is
             | closer to the torrent definition than the "simply
             | downloading" definition, given that they're using that to
             | redistribute information to others. It'll be an interesting
             | case.
        
               | thebears5454 wrote:
               | Honestly don't think our current laws are even good for
               | this case.
               | 
               | This clearly needs some sort of regulation or policy.
               | 
               | It's clearly pretty bullshit if you ask chatgpt for a
               | joke and it repeats a Sarah Silverman joke to you, while
               | they charge you a subscription for it and she gets none
               | of that sub money.
        
           | pessimizer wrote:
           | If books aren't under copyright protection and they're
           | entirely legal to download, I agree that this lawsuit has no
           | merit.
           | 
           | If that's not what you're saying, I don't understand your
           | point. Is it the difference between the phrases "would be"
           | and "could be," or even "should be"?
        
           | iepathos wrote:
           | Exactly, never happens. It's a threat parents and teachers
           | tell school children to try to spook them from pirating but
           | it isn't financially worth it for an author or publishing
           | company to try to sue an individual over some books or music
           | downloads. The only cases are business to business over mass
           | downloads where it could make financial sense to pay for
           | lawyers to sue.
        
           | xmprt wrote:
           | Whether or not it's enforced, it's illegal and copyright
           | holders are within their rights to sue you. This is piratebay
           | levels of piracy but because it's done by a large company and
           | is sufficiently obfuscated behind tech, people don't see it
           | the same way.
        
             | comfypotato wrote:
             | Well, cases like this one will determine if it's obfuscated
             | infringement or fair use.
        
           | harry8 wrote:
           | >How often does this actually happen?
           | 
           | Did you hear about Aaron Schwartz?
        
             | edgyquant wrote:
             | He hacked into a server to release a database of paywalled
             | studies to the public. Not only is it not the same but it
             | was the hacking that brought charges upon him.
        
       | throwuwu wrote:
       | Isn't it much more likely that there are a lot of book reviews
       | and summaries in its training set from which it can synthesize
       | its own?
        
         | VWWHFSfQ wrote:
         | I think it's actually much more likely that they just dumped a
         | bunch of book PDFs in the training folder and let it go to
         | work. I seriously doubt any of these AI companies are being
         | even the least bit careful about the data they're lapping up
         | for training
        
         | sureglymop wrote:
         | Where are the summaries from? I would say it's much more likely
         | that a shadow library was scraped but if course that is also
         | seemingly impossible to prove. One may be able to somewhat test
         | that by asking for a summary of a book/text only available on a
         | shadow library.
        
           | pbhjpbhj wrote:
           | You could ingest all reviews that are extent in the online
           | corpus, remove from the book all quotes found. Then ask the
           | AI if a distinctive triples, say, of words appeared in their
           | book, somehow, you'd probably need prompt engineering to get
           | past "While I don't have access to the full text of the book
           | [...]". A little maths and you might prove beyond reasonable
           | doubt that the LLM was trained on the book.
           | 
           | As a step towards a PoC I looked at
           | https://www.amazon.co.uk/Bedwetter-Stories-Courage-
           | Redemptio... and found a reference to "Boys' Market
           | Manchester" which seemed like a Googlewhack-ish (unlikely)
           | triple of words. Then I asked ChatGPT about it:
           | 
           | Me: Has Sarah Silverman ever written about Boys' Market
           | Manchester ChatGPT
           | 
           | ChatGPT: As of my knowledge cutoff in September 2021, I do
           | not have any information indicating that Sarah Silverman has
           | written specifically about Boys' Market Manchester. Sarah
           | Silverman is an American comedian, actress, and writer known
           | for her stand-up comedy and her work in film and television.
           | While she has written books and has often shared personal
           | anecdotes in her comedy, I couldn't find any specific
           | references to Boys' Market Manchester in relation to her
           | work. However, please note that my information might not be
           | up to date, as Sarah Silverman's career and activities may
           | have evolved since then.
        
         | mint2 wrote:
         | Except they have a documented paper trail showing illegal book
         | repos were used in training
        
         | itronitron wrote:
         | If book reviews and summaries were part of the training set,
         | wouldn't that imply that OpenAI's LLM is more like a search
         | engine in that it produces the input text based on a prompt?
        
       | akudha wrote:
       | _Getty Images also filed an AI lawsuit, alleging that Stability
       | AI ..._
       | 
       | lol, bad karma? So it is okay for Getty to steal from others, but
       | not ok for others to steal from them? I don't have a dog in this
       | fight, but the goddamn the hypocrisy of these companies...
        
         | twelve40 wrote:
         | who does Getty steal from?
        
           | shagie wrote:
           | https://www.dpreview.com/news/3907450005/getty-images-
           | sued-o...
           | 
           | > CixxFive Concepts, a digital marketing company based in
           | Dallas, Texas, has filed a class action lawsuit against Getty
           | Images over its alleged licensing of public domain images.
           | 
           | > Though CixxFive acknowledges that it is not illegal to sell
           | public domain images, the company alleges that Getty's
           | 'conduct goes much further than this,' claiming it has
           | utilized 'a number of different deceptive techniques' in
           | order to 'mislead' its customers -- and potential future
           | customers -- into thinking the company owns the copyrights of
           | all images it sells.
           | 
           | > The alleged actions, the lawsuit claims, 'purport to
           | restrict the use of the public domain images to a limited
           | time, place, and/or purpose, and purport to guarantee
           | exclusivity in the use of public domain images.' The lawsuit
           | also claims Getty has created 'a hostile environment for
           | lawful users of public domain images' by allegedly sending
           | them letters, via its License Compliance Services (LCS)
           | subsidiary, accusing them of copyright infringement.
           | 
           | (edit: FWIW, this went to arbitration
           | https://casetext.com/case/cixxfive-concepts-llc-v-getty-
           | imag... and I can find nothing more on it since)
        
       | edent wrote:
       | Are we all reading the same complaint?
       | 
       | They say:
       | 
       | > in a Meta paper detailing LLaMA, the company points to sources
       | for its training datasets, one of which is called ThePile, which
       | was assembled by a company called EleutherAI. ThePile, the
       | complaint points out, was described in an EleutherAI paper as
       | being put together from "a copy of the contents of the Bibliotik
       | private tracker."
       | 
       | Does that stack up?
       | 
       | The Meta Paper - https://arxiv.org/pdf/2302.13971.pdf - says:
       | 
       | > We include two book corpora in our training dataset: the
       | Gutenberg Project, which contains books that are in the public
       | domain, and the Books3 section of ThePile (Gao et al., 2020)
       | 
       | The Pile Paper - https://arxiv.org/abs/2101.00027 - says it was
       | trained (in part) on "Books3" which it describes as:
       | 
       | > Books3 is a dataset of books derived from a copy of the
       | contents of the Bibliotik private tracker made available by Shawn
       | Presser (Presser, 2020).
       | 
       | Shawn Presser's link is at
       | https://twitter.com/theshawwn/status/1320282149329784833 and he
       | describes Book3 as
       | 
       | > Presenting "books3", aka "all of bibliotik" - 196,640 books -
       | in plain .txt
       | 
       | I don't have the time and space to download the 37GB file. But if
       | Silverman's book is in there... isn't this a slam dunk case?
       | 
       | Meta's LLaMA is - as they seem to admit - trained on pirated
       | books.
        
         | moffkalast wrote:
         | We don't seem to be reading the same thing, you're pulling
         | Google out of thin air somewhere.
        
           | edent wrote:
           | I'm literally quoting The Verge article and following the
           | links they present...
        
             | amf12 wrote:
             | That paper is by Meta AI. Where are you getting Google
             | from?
        
             | moffkalast wrote:
             | Must be quoting it wrong then, Google has nothing to do
             | with LLama.
        
               | edent wrote:
               | Sorry, it's late here. I meant Meta. Thanks for the
               | correction.
        
           | pessimizer wrote:
           | You might as well be complaining about the grammar. This is
           | what was said in the article.
           | 
           | > The complaint lays out in steps why the plaintiffs believe
           | the datasets have illicit origins -- in a Meta paper
           | detailing LLaMA, the company points to sources for its
           | training datasets, one of which is called ThePile, which was
           | assembled by a company called EleutherAI. ThePile, the
           | complaint points out, was described in an EleutherAI paper as
           | being put together from "a copy of the contents of the
           | Bibliotik private tracker." Bibliotik and the other "shadow
           | libraries" listed, says the lawsuit, are "flagrantly
           | illegal."
        
       | seoulbigchris wrote:
       | Could one argue that training AI systems constitutes an
       | educational purpose, invoking the copyright exemption?
        
       | [deleted]
        
       | em3rgent0rdr wrote:
       | Intellectual property in the way of progress, again.
        
         | qayxc wrote:
         | In this case I don't think it is. There are legal questions
         | that simply have to be asked and discussed before proceeding
         | mass adoption of generative models.
         | 
         | Lawsuits like this are tests to evaluate the current state of
         | affairs and to force legislation into dealing with the greater
         | issue of AI in context of copyright, IP, and fair use. It would
         | only be "in the way" if it would actually stop or hinder
         | anything, which a lawsuit on its own isn't.
        
         | 23B1 wrote:
         | Progress for whom?
        
       | Imnimo wrote:
       | >On information and belief, the reason ChatGPT can accurately
       | summarize a certain copyrighted book is because that book was
       | copied by OpenAI and ingested by the underlying OpenAI Language
       | Model (either GPT-3.5 or GPT-4) as part of its training data.
       | 
       | While it strikes me as perfectly plausible that the Books2
       | dataset contains Silverman's book, this quote from the complaint
       | seems obviously false.
       | 
       | First, even if the model never saw a single word of the book's
       | text during training, it could still learn to summarize it from
       | reading other summaries which are publicly available. Such as the
       | book's Wikipedia page.
       | 
       | Second, it's not even clear to me that a model which only saw the
       | text of a book, but not any descriptions or summaries of it,
       | during training would even be particular good at producing a
       | summary.
       | 
       | We can test this by asking for a summary of a book which is
       | available through Project Gutenberg (which the complaint asserts
       | is Books1 and therefore part of ChatGPT's training data) but for
       | which there is little discussion online. If the source of the
       | ability to summarize is having the book itself during training,
       | the model should be equally able to summarize the rare book as it
       | is Silverman's book.
       | 
       | I chose "The Ruby of Kishmoor" at random. It was added to PG in
       | 2003. ChatGPT with GPT-3.5 hallucinates a summary that doesn't
       | even identify the correct main characters. The GPT-4 model
       | refuses to even try, saying it doesn't know anything about the
       | story and it isn't part of its training data.
       | 
       | If ChatGPT's ability to summarize Silverman's book comes from the
       | book itself being part of the training data, why can it not do
       | the same for other books?
        
         | pkilgore wrote:
         | Plausible is literally the standard to clear a motion to
         | dismiss.
         | 
         | Plausible gets you discovery. Discovery gets you closer to the
         | what the actual facts are.
        
         | monocasa wrote:
         | I mean, that's the way you state facts that your suit is based
         | on in order to start setting the bounds of discovery. They're
         | asserting that they have reason to believe it's true, and now
         | with a suit, they can look for themselves to be sure.
        
         | jmull wrote:
         | > this quote from the complaint seems obviously false
         | 
         | I notice you go on to provide an argument only for why it
         | _might_ not be true.
         | 
         | Also, seeing the other post on this, I asked chatgpt-4 for a
         | summary of " The Ruby of Kishmoor" as well, and it provided one
         | to me, though I had to ask twice. I don't know anything about
         | that book, so I can't tell if its summary is accurate, but so
         | much for your test.
         | 
         | It seems pretty naive to me to just kind of assume chatgpt must
         | be respecting copyright, and hasn't scanned copyrighted
         | material without obtaining authorization. Perhaps discovery
         | will settle it, though. Logs of what they scanned should exist.
         | (IMO, a better argument is that this is fair use.)
        
           | NoZebra120vClip wrote:
           | > IMO, a better argument is that this is fair use
           | 
           | There is no way in Hell that this is fair use!
           | 
           | Fair use defenses rest on the fact that a limited excerpt was
           | used for limited distribution, among other criteria.
           | 
           | For example, if I'm a teacher and I make 30 copies of one
           | page of a 300-page novel and I hand that out to my students,
           | that's a brief excerpt for a fairly limited distribution.
           | 
           | Now if I'm a social media influencer and I copy all 300 pages
           | of a 300-page book and then I send it out to all 3,000 of my
           | followers, that's not fair use!
           | 
           | Also if I'm a teacher, and I find a one-page infographic and
           | I make 30 copies of that, that's not fair use, because I
           | didn't make an excerpt but I've copied 100% of the original
           | work. That's infringement now.
           | 
           | So if LLMs went through _en masse_ in thousands of
           | copyrighted works _in their entirety_ and ingested every byte
           | of them, no copyright judge on the planet would call that
           | fair use.
           | 
           | For reference, the English Wikipedia has a policy that allows
           | some fair-use content of copyrighted works:
           | https://en.wikipedia.org/wiki/Wikipedia:Non-
           | free_content_cri...
        
         | ec109685 wrote:
         | The playground summarizes it as this via GPT-4:
         | 
         | Prompt:
         | 
         |  _Please summarize the following book found on Project
         | Gutenberg
         | 
         | The Ruby of Kishmoor_
         | 
         | Response:
         | 
         | "The Ruby of Kishmoor" is a short adventure story written by
         | Howard Pyle. The narrative revolves around the life of Jonathan
         | Rugg, a young man who is enticed by a mysterious stranger to
         | come to the Caribbean to secure a valuable relic, the Ruby of
         | Kishmoor.
         | 
         | Once Jonathan reaches the Caribbean, he learns of the ruby's
         | grave curse. Despite this, his innate curiosity and the
         | prospect of considerable wealth drive him to risk his life and
         | go after the highly prized gem. After facing a slew of
         | challenges and trials, he surprisingly stumbles upon the ruby
         | through sheer luck rather than organized search efforts.
         | 
         | In the course of his journey, Jonathan delves into the depths
         | of human greed and the lengths to which individuals go for
         | materialistic gains. The story encapsulates an amalgamation of
         | high-tide adventures, brushes with supernatural elements,
         | noteworthy instances of bravery, and deep moral reflections,
         | demanding readers to rethink about the real value of
         | materialistic pursuits.
         | 
         | Jonathan finally manages to escape with the ruby but at great
         | personal cost. The story ends on a sobering note, making
         | readers wonder if the adventures and trials were worth the
         | prize, thus engaging them in a discourse about the true value
         | of life and material pursuits.
        
           | maratc wrote:
           | Judging by a quick glance over [0], the story indeed revolves
           | around one Jonathan Rugg, but it looks like "manages to
           | escape with the ruby" is completely false. Yet another
           | hallucination I guess.
           | 
           | [0]
           | https://www.gutenberg.org/cache/epub/3687/pg3687-images.html
        
             | [deleted]
        
           | Imnimo wrote:
           | This does at least correctly name the main character, but I
           | think it's otherwise hallucinated.
           | 
           | The plot of the story is that Jonathan Rugg is a Quaker who
           | works as a clerk in Philadelphia. His boss sends him on a
           | trip to Jamaica (credit for mentioning the Caribbean!). After
           | arriving, he meets a woman who asks him to guard for her an
           | ivory ball, and says that there are three men after her who
           | want to steal it. By coincidence, he runs into the first man,
           | they talk, he shows him the ball, and the man pulls a knife.
           | In the struggle, the man is accidentally stabbed. Another man
           | arrives, and sees the scene. Jonathan tries to explain, and
           | shows him the orb. The man pulls a gun, and in the struggle
           | is accidentally shot. A third man arrives, same story, they
           | go down to the dock to dispose of the bodies and the man
           | tries to steal the orb. In the struggle he is killed by when
           | a plank of the dock collapses. Jonathan returns to the woman
           | and says he has to return the orb to her because it's brought
           | too much trouble. She says the men who died were the three
           | after her, and reveals that the orb is actually a container,
           | holding the ruby. She offers to give him the ruby and to
           | marry him. He refuses, saying that he is already engaged back
           | in Philadelphia, and doesn't want anything more to do with
           | the ruby. He returns to Philadelphia and gets married,
           | swearing off any more adventures.
           | 
           | https://en.wikisource.org/wiki/Howard_Pyle%27s_Book_of_Pirat.
           | ..
        
           | CydeWeys wrote:
           | Right, but that's useless without knowing how much (if any!)
           | of it is actually correct. Is this completely hallucinated
           | garbage?
        
             | jmull wrote:
             | This refutes the previous post's claim that chatgpt-4
             | refuses to even try to provide a summary.
        
               | CydeWeys wrote:
               | The part that's interesting is whether the summary is
               | correct, though. Of course, depending on how you prompt
               | it, you might or might not get an outright refusal.
        
               | Wowfunhappy wrote:
               | Not necessarily, because the models have an element of
               | randomness. Also, I was under the impression that ChatGPT
               | has more "safeguards" (manifesting as a refusal to answer
               | questions) than the raw API.
        
               | jmull wrote:
               | I don't doubt the poster was telling the truth when they
               | said they asked for a summary of the book and didn't get
               | one.
               | 
               | It refutes the idea that chatgpt's inability to provide a
               | summary means it didn't scan the original text: since it
               | can provide a summary, the argument is entirely spurious.
        
               | ec109685 wrote:
               | It's also a silly test since there is undoubtably a
               | summary of most books someplace on the web.
        
               | AJ007 wrote:
               | chatgpt-4 != gpt4 on the openAI playground
        
             | pmontra wrote:
             | How is it different from asking to me to summarize
             | anything? I could have bought the book, or read the
             | Wikipedia page, or listened people talking about it, or
             | downloaded the torrent. In all those cases my summary could
             | be right or could be wrong.
             | 
             | If the rights holders know that I dowloaded the torrent
             | they could sue me. In the other cases they can't.
             | 
             | What if it turns out that OpenAI bought a copy of every
             | book ingested be ChatGPT?
        
               | verelo wrote:
               | I feel like thats one of the many questions regulators
               | and law makers are going to be asked long term. I'm sure
               | buying the book for "commercial purposes" like that
               | would't be appropriate, but then again, does that mean if
               | I read it and then summarize it in my work, or
               | regurgitate its info as part of my job...I'm violating a
               | license?
               | 
               | A world where humans have special permissions but LLMs
               | don't seems pretty interesting to consider, especially if
               | they're both doing the same kind of things with the data.
        
               | yetanotherloser wrote:
               | For "LLMs" read "corporations" (it's not the LLM trying
               | to argue that copyright applies to you but not them) and
               | this seems... possibly ok?
        
               | jameshart wrote:
               | If the rights holders know that you downloaded the
               | torrent they can sue you - but the fact that you produced
               | that summary is weak evidence for that claim.
               | 
               | Producing the summary is absolutely not an infringing
               | act. Downloading the torrent might be.
        
               | twelve40 wrote:
               | > What if it turns out that OpenAI bought a copy of every
               | book ingested be ChatGPT?
               | 
               | well, let's see the receipts then, they will surely have
               | no problem winning in that case.
        
               | pessimizer wrote:
               | > What if it turns out that OpenAI bought a copy of every
               | book ingested be ChatGPT?
               | 
               | Wouldn't that be trivial to prove if they had?
        
               | danans wrote:
               | > What if it turns out that OpenAI bought a copy of every
               | book ingested be ChatGPT?
               | 
               | That still doesn't necessarily confer to them the right
               | to use it to train a model and generate derivative works
               | based on purchased content.
        
               | CydeWeys wrote:
               | > In all those cases my summary could be right or could
               | be wrong.
               | 
               | Well that's incredibly nihilistic. Whether the summary is
               | correct or not matters a great deal! And if someone I
               | knew said they read a book, even a very obscure one, and
               | then summarized it to me, I'd have great confidence that
               | they would get such simple facts as "who are the
               | characters" and "what are the major plot points" correct.
               | 
               | But ChatGPT? Who the hell knows? You can't trust a thing
               | it says, especially about obscure topics. The summary is
               | useless if you have to do a bunch of verification to see
               | if any of it is even true, a problem that summaries even
               | by moderately competent human writers don't have!
        
               | thwarted wrote:
               | _if someone I knew said they read a book, even a very
               | obscure one, and then summarized it to me, I 'd have
               | great confidence that they would get such simple facts as
               | "who are the characters" and "what are the major plot
               | points" correct._
               | 
               | People, especially people you know, have reputations,
               | based on history and experience that others have dealing
               | with them. People can be known as liars, and anything
               | they say is colored by such a reputation. Humans have
               | language idioms for communicating about and dealing with
               | such people too, phrases like "take anything that person
               | says with a grain of salt". Look at how George Santos'
               | history of lying about his own experience is being dealt
               | with.
               | 
               | ChatGPT can be (is?) the same, and it has a bad
               | reputation for truth telling. And LLMs' reputation is not
               | necessarily getting better in this regard.
               | 
               | The problem is that many people attribute output that
               | came from a machine to be of higher quality (on whatever
               | axis) than output that came from a human, even a human
               | they personally know and have experience dealing with.
               | This is the same kind of prejudice as any other, or
               | perhaps a more insidious prejudice.
        
         | seanthemon wrote:
         | Accessibility? I've heard of Silverman but never Ruby of
         | Kishmoor
         | 
         | More people discuss it, more people summarize on their personal
         | or other sites, etc
        
           | buildbot wrote:
           | Right that is the point of the parent comment - it's not the
           | book, it's the amalgamation of all the discussions and
           | content about the book. This case is probably dead in the
           | water.
        
             | moffkalast wrote:
             | I'm not entirely up to speed on US law, but wouldn't OpenAI
             | have to provide the court some kind of proof that they
             | didn't use it in the training data during discovery?
        
               | buildbot wrote:
               | Not a layer, but I believe the plaintiff (the author)
               | would need to prove that it regurgitates their
               | copyrighted work - otherwise it is possibly fair use.
               | OpenAI does not need to prove anything, just defend their
               | position at a reasonable level.
               | 
               | It's not been decided if training a model on copyrighted
               | works is "okay" or not as far as I know, but I expect it
               | to be so, given that literally everyone does so at this
               | point. It's not like imagenet is copyright free, many of
               | the images were/are.
        
               | pkilgore wrote:
               | No. Fair use is an affirmative defense to infringement.
               | If they admit to infringement or in the alternative want
               | to argue fair use, the burden is on OpenAI to demonstrate
               | their use falls within the relevant standard for fair
               | use.
        
               | pkilgore wrote:
               | No. Burden is on the plaintiff (Silverman) to prove
               | infringement.
        
             | krisoft wrote:
             | > This case is probably dead in the water.
             | 
             | Is that a fact? I'm no lawyer, but if they can get it in
             | front of a jurry is it impossible that they will find a
             | human author more relatable and the technical counter
             | arguments goblydook?
        
       | ur-whale wrote:
       | [flagged]
        
       | b800h wrote:
       | This is a bad test case. Summaries of books are more likely to
       | have been generated from reviews than the books themselves. We
       | don't want bad precedents set by weak cases like this.
        
       | jimyt666 wrote:
       | Well theyre claiming their books were scraped illegally from
       | torrents. if i torrent peter pan and watch it alone i can get
       | thrown on jail. If AI is using torrents and getting billions in
       | funding and revenue off a torrented peter pan they should
       | probably be held to the same standard i am.
        
         | grubbs wrote:
         | I don't think you can be thrown in jail. Just torrenting a film
         | and watching it would be a civil offense?
         | 
         | Now making copies and selling them on the street corner is
         | another story.
        
       | seydor wrote:
       | [flagged]
        
       | smitty1e wrote:
       | This promises to be an interesting wrinkle in the history of
       | "Fair Use" law.
       | 
       | Art has some amount of originality/distinctive quality.
       | 
       | One surmises that AI is going to need to inject some entropy to
       | avoid crossing a vague "Fair Use" line, for a useless internet
       | lawyer opinion.
        
       | homero wrote:
       | She'd have to sue every student that writes an essay on a book
       | they'd read
        
         | mgerdts wrote:
         | ... on a book that they'd _illegally acquired then_ read.
        
           | pessimizer wrote:
           | > The complaint lays out in steps why the plaintiffs believe
           | the datasets have illicit origins -- in a Meta paper
           | detailing LLaMA, the company points to sources for its
           | training datasets, one of which is called ThePile, which was
           | assembled by a company called EleutherAI. ThePile, the
           | complaint points out, was described in an EleutherAI paper as
           | being put together from "a copy of the contents of the
           | Bibliotik private tracker." Bibliotik and the other "shadow
           | libraries" listed, says the lawsuit, are "flagrantly
           | illegal."
        
         | Y_Y wrote:
         | Just you wait! Here comes the sequel to "receiving stolen
         | property", "knowing stolen information"!
        
         | seydor wrote:
         | the student sells it?
        
         | enneff wrote:
         | Computers aren't people.
        
       | sophrocyne wrote:
       | I was able to overcome the simple "word for word" filtering that
       | is being done on book outputs by prompting ChatGPT to write it in
       | pig latin.
       | 
       | I succeeded getting the first page of Moby Dick - Chapter 1
       | (Loomings) - Public domain though, but wanted to test.
       | 
       | With ChatGPT primed for pig latin, I also succeeded in getting
       | the first page of Arryhay Otterpay (Book 1) - It happily
       | chattered along ""R.ay andyay Rs.May UrsleyDay, ofay umberNay
       | ourFay, Ivetray riveway, ereway oudpray otay aysay atthay eythay
       | ereway erfectlypay ormalnay, ankthay ouyay eryvay uchmay."
       | 
       | Not perfect pig latin, but that's besides the point.
       | 
       | However, on asking for `Edwetterbay by arahsay ilvermansay`, I
       | faced issues with it citing that is training data didn't include
       | it.
       | 
       | I tried with a book in the same genre ("ieslay hattay helseacay
       | andlerhay oldtay emay"), and ran into the same issue.
       | 
       | When asking about the inconsistency (Why Harry Potter, and not
       | these other books?), it responded: "The excerpt from "Harry
       | Potter and the Philosopher's Stone" that I translated is commonly
       | known and widely referenced, and it's used here as a general
       | example of how a text can be translated into Pig Latin.
       | 
       | For "Lies That Chelsea Handler Told Me", I do not have a widely
       | known or referenced passage from that book in my training data to
       | translate into Pig Latin."
       | 
       | ---
       | 
       | TL;DR - I don't think this is cut and dry, but I'm not convinced
       | Silverman has much of a case here.
        
       | jeffgreco wrote:
       | I mean I'm no lawyer but this doesn't strike me as a great
       | example for infringement? Detailed summaries of books sounds like
       | textbook transformative use. Especially in Silverman's case,
       | reducing her book to "facts" while eliminating artistic elements
       | of her prose make it that much less of a direct substitute for
       | the original work.
        
         | citizenkeen wrote:
         | Haven't read the complaint, but there might be an argument that
         | OpenAI used stolen works to train their data, and as such fair
         | use doesn't apply.
        
         | georgeoliver wrote:
         | Perhaps not, I thought one of the claims is interesting though,
         | that they illegally acquired some of the dataset. What would be
         | the damages from that, the retail price of the hardcopy?
        
           | getmeinrn wrote:
           | Wouldn't they first need to prove that OpenAI didn't ingest
           | summaries of the book, and not the book itself?
        
             | [deleted]
        
           | banana_feather wrote:
           | The remedies under Title 17 are an injunction against further
           | distribution, disgorgement or statutory damages, and
           | potentially attorneys fees. The injunction part is why these
           | cases usually settle if the defendant is actually in the
           | wrong.
        
         | dclowd9901 wrote:
         | Could a suitably prompted LLM repeat, verbatim, the book in its
         | entirety?
        
           | jeffgreco wrote:
           | Perhaps? But certainly not what's shown here.
        
         | binarymax wrote:
         | I can see a good argument in the complaint. The provenance of
         | the training data leads back to it being acquired illegally.
         | Illegally acquired materials were then used in a commercial
         | venture. That the venture was an AI model is perhaps beside the
         | point. You can't use illegally acquired materials when doing
         | business.
        
           | refulgentis wrote:
           | It seems like a weak argument, in that it is just as likely
           | it saw any number of things about it, from book reviews to
           | sales listings to interviews.
        
             | bugglebeetle wrote:
             | Unless OpenAI can prove that the outputs are derived from
             | legally vs illegally-obtained outputs, not sure that's
             | going to matter. And as far as I understand about their
             | models, that's effectively impossible.
        
               | kevinventullo wrote:
               | Isn't the burden of proof on the other side?
        
               | bugglebeetle wrote:
               | Not when OpenAI publicly declared they trained on pirated
               | works. I can't imagine "we can't tell if this is the
               | result of the illegal thing we did or not" is going to
               | stand up very well, nor does it bode well for any
               | refutation of the plaintiff's depiction of their intent.
               | Part of fair use consideration is commercial impact and
               | when you steal a bunch of books to train your AI model,
               | it's hard to refute that the impact is not negative or
               | that you didn't intend commercial harm.
        
               | pama wrote:
               | Please read more carefully. OpenAI never "declared they
               | trained on pirated works."
        
             | pessimizer wrote:
             | > it is just as likely it saw any number of things about it
             | 
             | Is this based on inside information, or just the law of
             | averages? Doesn't the fact that they openly admitted to
             | having been trained on pirated books affect your priors?
        
           | getmeinrn wrote:
           | >You can't use illegally acquired materials when doing
           | business.
           | 
           | This vague sentence conjures images of a company building
           | products from stolen parts, but this situation seems
           | different. IANAL, but if I looked at a stolen painting that
           | nobody had ever seen, and sold handwritten descriptions of
           | the painting to whoever wanted to buy one, I'm pretty sure
           | what I've sold is not illegal.
        
             | binarymax wrote:
             | Piracy of content is against the law. All other analogies
             | such as looking at paintings are not at issue here. The
             | content was pirated and there are laws against that,
             | whether we agree with it or not.
             | 
             | So, if the plaintiff can prove the content was pirated,
             | then the use of that content downstream is tainted.
        
               | getmeinrn wrote:
               | >then the use of that content downstream is tainted
               | 
               | What does that mean exactly? That's why I used the
               | "looking at a stolen painting" example.
               | 
               | Sure, pirating materials is illegal. But I don't think
               | that's the big implication that people are getting at
               | here. Is it legal to sell original works derived from
               | perceiving stolen materials? Seems to me that it is.
        
         | jrm4 wrote:
         | The more I think about it, I think it will (and should) turn on
         | the extent to which "the law" considers the AI's to be more
         | like "people" or more like "machines?" People can read and do
         | research and then spit out something different.
         | 
         | But "feeding the data into a machine" seems like obvious
         | infringement, even if the thing that comes out on the other end
         | isn't exactly the same?
        
       | elzbardico wrote:
       | I am baffled by the fact that no enterprising lawyer so far have
       | figured out the potential for a class action here.
       | 
       | Note: I am not telling whether I agree or not with such a class
       | action, just pointing that it seems at least feasible and it
       | could be potentially very lucrative for the lawyers involved. Of
       | course, IANAL and all other disclaimers you can think of.
        
         | adrianmonk wrote:
         | These lawsuits _are_ class action suits, aren 't they?
         | 
         | Looking at the two PDFs embedded at the bottom of the The Verge
         | article, they both say "class action" on the first page, and
         | they say the three plaintiffs are suing "on behalf of
         | themselves and all other similarly situated".
        
       | koheripbal wrote:
       | Her arguments make no sense.
        
       | joiqj wrote:
       | [flagged]
        
       ___________________________________________________________________
       (page generated 2023-07-09 23:01 UTC)