[HN Gopher] Sarah Silverman is suing OpenAI and Meta for copyrig...
___________________________________________________________________
Sarah Silverman is suing OpenAI and Meta for copyright infringement
Author : Tomte
Score : 225 points
Date : 2023-07-09 18:43 UTC (4 hours ago)
(HTM) web link (www.theverge.com)
(TXT) w3m dump (www.theverge.com)
| mosquitobiten wrote:
| Sarah's pov raises some questions for me regarding my own
| "training", there is a noteworthy part of who I am built upon the
| consumed music, books, movies, video games and tv shows that
| myself or people around me have pirated and shared with me. This
| part of me helped me in life appreciably, I could also say I
| profited because of it, helping me along my life in being
| likable, funny, relatable, with broad outlooks etc.
|
| Is my brain just by the act of existing continually infringes on
| copyright? Can I be sued because I made a reference to a movie I
| pirated or because I whistle a song I never bought?
| moffkalast wrote:
| Tom Scott did it: https://www.youtube.com/watch?v=IFe9wiDfb0E
|
| The simple fact is that our current handling of copyright is
| just completely broken on so many levels.
| banana_feather wrote:
| The allegation is that the model was trained on a copy of the
| original work that was improperly obtained. The fact that it
| can produce a summary is being offered as evidence of that
| claim. You can't be sued for making a reference or whistling a
| song, nor you could have been sued for being in possession of a
| pirated copy, only making a pirated copy. Copyright law has
| many warts, but it's not some Kafkaesque mind prison.
| xdennis wrote:
| Just because we call both learning, doesn't mean that human
| learning and machine learning are the same. They most
| definitely are not the same. Human learning is very lossy.
|
| Even if they were the same, it doesn't mean that bots should
| have the same rights that people have.
| [deleted]
| Schnitzkitz wrote:
| > Is my brain just by the act of existing continually infringes
| on copyright? Can I be sued because I made a reference to a
| movie I pirated or because I whistle a song I never bought?
|
| I think it would fall under fair use. But you can imagine what
| the world can become with microphones and cameras everywhere,
| which can already run music and speech recognition by
| themselves, in seconds. What a time to be alive!
| twelve40 wrote:
| if you are a robot that ripped off literally all the data in
| the world and now resells it in a repackaged form for its own
| profit, then yes, you can be sued. Talking about whistling a
| song is pretty absurd in this context.
| dzink wrote:
| The summary and other LLM outputs should fall into transformative
| use. I wonder how they intend to prove as this use case is little
| different from a person reading the book and writing about it.
| banana_feather wrote:
| This site would be better if you got banned for commenting
| without reading the article.
| phkahler wrote:
| >> In the OpenAI suit, the trio offers exhibits showing that when
| prompted, ChatGPT will summarize their books, infringing on their
| copyrights.
|
| This doesn't seem like copyright infringement. I could read the
| book and offer a summary right? Someone on goodreads could as
| well. Why should an AI doing it be different? BTW I could also
| read someone's illicit copy and do the same, couldn't I?
|
| I think people are trying to claim exclusive use rights that they
| simply don't have. I look forward to a lawyers opinion on this
| one.
| Wowfunhappy wrote:
| I think the argument is somewhat more interesting if the book
| was _pirated_ --both you as an individual and OpenAI as a
| company could be sued for that.
|
| But I really don't see how you could prove OpenAI did that,
| since ChatGPT could have learned from existing summaries on
| Wikipedia and Goodreads.
| verve_rat wrote:
| It seems pretty easy to prove that, since they admitted it in
| public.
|
| Read the article. This isn't about the question of LLMs being
| copyright infringement, this is about Meta and OpenAI
| admitting that they had pirated copies of those books.
| barbariangrunge wrote:
| The ai is not a person. The ai is not a person. The ai is not a
| person
| klabb3 wrote:
| > I could read the book and offer a summary right? Someone on
| goodreads could as well. Why should an AI doing it be
| different?
|
| We _have_ to stop equating human beings to for-profit
| corporations running a machine at orders of magnitude the speed
| and scale. This is critical, otherwise we don't have any
| arguments against - say a mass face recognition surveillance op
| because "humans can remember faces too". Scale matters, just
| like it did before "AI" with things like indiscriminate
| surveillance "it's just metadata" or "this location data set is
| just anonymized aggregates".
|
| > This doesn't seem like copyright infringement.
|
| Now, I still think I agree with this. A book summary is
| nevertheless an extremely poor battle to pick, since frankly
| who the hell cares. It's not like someone is gonna say "I'm not
| buying this book anymore because ChatGPT summarized it".
|
| Now, perhaps they just used the summary to prove that their
| book was part of the training set, and that they think it's
| wrong to include their works without permission. That's, imo,
| definitely not trivial to dismiss. Looks like unpaid supply
| chain to me.
| crazygringo wrote:
| This is actually quite interesting, as it's drawing a distinction
| between training material that can be accessed by anybody with a
| web browser (like anybody's blog), vs. training material that was
| "illegally-acquired... available in bulk via torrent systems."
|
| I don't think there's any reason why this would be a relevant
| legal distinction in terms of _distributing_ an LLM -- blog
| authors weren 't giving consent either.
|
| However, I _do_ wonder if there 's a legal issue here in using
| pirated torrents for training. Is there any legal basis for
| saying fair use permits distributing an LLM trained on
| copyrighted material, but you have to purchase all the content
| first to do so legally _if it 's only available for sale_? E.g.
| training on a blog post is fine because it's freely accessible,
| but Sarah Silverman's book is not because it's never been made
| available for free, and you didn't pay for it?
|
| Or do the courts not really care at all how something is made? If
| you quote a passage from a book in a freelance article you write,
| nobody ever asks if you purchased the book or can prove you
| borrowed it from a library or a friend -- versus if you pirated a
| digital copy.
| version_five wrote:
| Seems like the AI angle is just capitalizing on hype. If it's
| illegal to download "pirate" copyright material, that was the
| crime. The rest is basically irrelevant. If I watch a pirated
| movie, it's not illegal for me to tell someone the plot.
| banana_feather wrote:
| > Or do the courts not really care at all how something is
| made?
|
| One of the fair use factors, which until fairly recently was
| consistently held out as the most important fair use factor, is
| the effect on the commercial market for the original work.
| Accordingly, a court is more likely to find that something is
| fair use if there is effectively no commercial market for the
| original work, though the fact that something isn't actively
| being sold isn't dispositive (open source licenses have
| survived in appellate courts despite being free as in beer).
| itronitron wrote:
| Scarcity drives a lot of value for original work.
| jmkb wrote:
| Eventually, I imagine a new licensing concept will emerge,
| similar to the idea of music synchronization rights -- maybe
| call it "training rights." It won't matter whether the text was
| purchased or pirated -- just like it doesn't matter now if an
| audio track was purchased or pirated, when it's mixed into in a
| movie soundtrack.
|
| Talent agencies will negotiate training rights fees in bulk for
| popular content creators, who will get a small trickle of
| income from LLM providers, paid by a fee line-itemed into the
| API cost. Indie creators' training rights will be violated
| willy-nilly, as they are now. Large for-profit LLMs suspected
| or proven as training rights violators will be shamed and/or
| sued. Indie LLMs will go under the radar.
| the8472 wrote:
| Humans are also trained on copyrighted content they see.
| Should every artist have to pay that fee too on every work
| they create?
|
| Disney will finally be able to charge a "you know what the
| mouse looks like" tax.
| phkahler wrote:
| >> Talent agencies will negotiate training rights fees in
| bulk for popular content creators
|
| AFAICT there is no legal recognition of "training rights" or
| anything similar. First sale right is a thing, but even
| textbooks don't get extra rights for their training or
| educational value.
| sigstoat wrote:
| This is why jmkb referenced synchronization rights, which
| (as I recall) were invented when they seemed useful. jmkb
| is suggesting a new right might be created, not claiming
| that they already exist.
|
| (even if it wasn't sync rights, there was something else
| musically related that was created in response to
| technological development. wikipedia will have plenty on
| it)
| belorn wrote:
| Many legal concepts used by courts has no legal recognition
| in the law texts. Much of legal practice are just
| precedents, policies, customs, and doctrines.
|
| Parent comment mention music synchronization rights, and
| this concept does not exist in copyright. Court do
| occasionally mention it, and lawyers talks about it, but in
| terms of the legal recognition there is basically only the
| law text that define derivative work and fair use. One way
| to interpret it is that court has precedents to treat music
| synchronization as a derivative work that do not fall under
| fair use.
|
| Using textbooks in training/education is not as black and
| white that one may assume. Take this Berkeley
| (https://teaching.berkeley.edu/resources/course-
| design/using-...). Copying in this context include using
| pages for slides and during lectures (which is a slightly
| large scope than making physical copies on physical paper).
| In obvious case the answer is likely obvious, but in others
| it will be more complex.
| fweimer wrote:
| Is it all that different from indexing for search? That does
| not seem to require a license from the copyright holder under
| U.S. law (but other countries may treat as a separate
| exploitation right). If indexing for search is acceptable,
| then something that is intended to be more transformative
| should be legal as well.
|
| (Personally, I think that even indexing for search should
| require permission from the copyright holder.)
| akira2501 wrote:
| I'm allowed to make private copies of copywritten works. I'm
| not allowed to redistribute them. To what extent this is
| redistribution is not clear. Is there much of difference
| between this model and a machine, like a VCR, that recreates
| the original work when I press a button?
| JamesBarney wrote:
| This is not definitely not redistribution any more than
| writing a blog post of a book you read is.
| postmodest wrote:
| Is Game of Thrones a redistribution of Lord of the Rings?
| seanmcdirmid wrote:
| This would be like you intensely studying the copy written
| work and then writing things based on the knowledge you
| obtained from that. Except, we don't know if their is an
| exception for things learned by people vs. things learned by
| machines, or if the machines are not really learning but
| copying instead (or if learning is intrinsically a form of
| copying?).
| pbhjpbhj wrote:
| In the case of unreleased work, you writing about your
| knowledge of it is just proof that you obtained the work,
| which is proof that you committed a tort/trespass. Just
| like if you published a newspaper article with information
| you could only have acquired by hacking someone's phone.
| I'm not sure what a court would find against you, but it
| seems clear that there would be some way to couch that as a
| legal grievance.
| seanmcdirmid wrote:
| Hmm...if I get on the internet and download and read a
| paper from some website, am I liable if that paper was
| actually private if I had no clue it was obtained
| illegally? It seems to me that the distributor would be
| liable at that point, not the person who got it from the
| distributor (unless they knew they were stolen goods,
| then of course they are liable!).
|
| A search engine that indexes the internet might be
| equally liable at that point, although the DCMA gives
| them an out if they have a mechanism to remove pirated
| entries from their index on request. Could LLMs have the
| same out?
| velosol wrote:
| There's a sci-fi plot there: those with money can afford to
| pay the copyright cost for material they've learned and
| anything they produce results in royalties to the creators
| of everything they've learned. Those without means are cast
| out, perhaps some generating original thoughts in a way
| that breaks the system. I think I'm going to have to re-
| read the Unincorporated Man.
| seanmcdirmid wrote:
| We aren't so far away from that, and it is becoming cheap
| enough that you won't even need money anymore. Rather
| than bootleg movies, people will just ask computers to
| derive a new movie from multiple existing movies, and
| then...it is an original work?
| permo-w wrote:
| I buy a book and give it to my child, they read the book and
| later write and sell a story influenced by said book. should
| that be a copyright infringement?
|
| how about they become a therapist and sell access to
| knowledge from copyrighted books? should that be an
| infringement?
|
| what if they sell access to lectures they've given including
| facts from said book(s) to millions of people?
|
| it's understandable that people feel threatened by these
| technologies, but to a great degree the work of a successful
| artist is to understand and meet the desires of an audience.
| LLMs and image generation tech do not do this. they simply
| streamline the production
|
| of course if you've worked for years to become a graphic
| designer you're going to be annoyed that an AI can do your
| job for you, but this is simply what happens when technology
| moves forward. no one today mourns the loss of scribes to the
| printing press. the artists in control of their own destiny -
| i.e. making their own creative decisions - will not, can not,
| be affected by these models, unless they refuse to adapt to
| the times
| krisoft wrote:
| > understand and meet the desires of an audience. LLMs and
| image generation tech do not do this.
|
| For now? I wouldn't be surprised if that becomes the next
| feature though.
| permo-w wrote:
| that's already been a thing for years. long before LLMs
| and Stable Diffusion
|
| it's just the instagram/tiktok/youtube suggested content
| algorithms
| milemi wrote:
| If you found a way to have a million children who could
| grow up in one day your analogy would be more apt. In that
| case you and your children would rightly be considered a
| threat.
| permo-w wrote:
| did you read to the third analogy?
| milemi wrote:
| What if in your third analogy you replace millions of
| students by billions of processes running on machines,
| each of which can generate output ten thousand times
| faster than a college educated human?
| enneff wrote:
| Making an analogy where you substitute a human being for
| the LLM is disingenuous to the extreme.
| permo-w wrote:
| why do you think that?
| enneff wrote:
| Because LLMs are not people. They are nothing like
| people; not in construction nor behaviour.
| permo-w wrote:
| LLMs are built upon neural networks which are modelled
| upon how brains work
|
| can you explain to me specifically how they're different?
|
| can you explain to me how they're different to the degree
| that making an _analogy_ between the two is
| "disingenuous to the extreme"?
| TillE wrote:
| One is a person, the other is a computer program.
|
| Legally quite distinct! Note that nobody is even
| seriously claiming we have an AGI, there's no Star Trek
| discussion of whether an android is a person. Everyone
| agrees this is just a computer program.
| permo-w wrote:
| are you aware of how neural networks work? and remember
| that this is simply an illustrative analogy
| srslack wrote:
| It doesn't matter if it's a person, or a computer
| program, or not. This discussion is moot. Is there a
| substantial reproduction of the works in the output? If
| not, there's no copyright infringement here.
|
| Try reading this legal opinion: https://lawreview.law.ucd
| avis.edu/issues/53/5/notes/files/53...
| enneff wrote:
| The differences between a a human being and a computer
| are too numerous to list. I don't even know why you need
| to ask the question.
|
| Let me ask another question to point out the absurdity of
| yours: Human beings have more in common with a bacterium
| than a software program. Can you tell me specifically how
| humans are not bacteria?
| permo-w wrote:
| analogies are not descriptions of the things themselves,
| otherwise they would not be analogies, would they?
|
| now remember that this is an _analogy_. re-read my
| comments in this light and perhaps we can continue this
| conversation in a more grounded and reasonable manner
|
| however, I'll be frank: have you studied neural networks?
| if you haven't, it's very difficult to take you seriously
| on this
| enneff wrote:
| Now you're just being condescending. I did read your
| comments and I know what an analogy is. Consider that
| there is a different perspective to yours that can
| validly view your analogy as absurd.
|
| Yes I have studied neural nets and have a good
| understanding of their function. I am still not sure how,
| despite their development being inspired by animal
| brains, you can liken an LLM to an actual person. There
| are so many vast differences. Do you really want me to
| explain specially what they are? Surely, since both you
| and I are so familiar with the subject matter, that is
| unnecessary.
|
| If we were taking about an AGI then this would be an
| entirely different conversation.
| permo-w wrote:
| your argument was predicated upon my comment being
| factually inaccurate, not analogously poor. I merely
| listed some leading questions analogising the ability of
| human brains to absorb, contextualise and emit
| copyrighted information to an LLM's ability to do all of
| those same things. you created a weaker position to
| attack which was that LLMs are the same as humans
|
| if you want to discuss whether it is a poor analogy or
| not, I'm all for that, but the passage you've chosen to
| go down is to act as if it were not an analogy at all,
| which--to borrow a phrase--is disingenuous to the extreme
| [deleted]
| xdennis wrote:
| > LLMs are built upon neural networks which are modelled
| upon how brains work
|
| You are confused. Neural networks are inspired by how
| brains work, but they do not actually simulate brains.
|
| Airplanes are also inspired by how birds work, but
| (presumably) you don't think that bird laws should apply
| to airplanes.
|
| > can you explain to me how they're different to the
| degree that making an analogy between the two is
| "disingenuous to the extreme"?
|
| It's disingenuous because you don't believe that either.
|
| If you think that LLMs are just like human brains, and
| should be allowed to learn from books the same as people,
| then presumably you also believe they are entitled to all
| other human rights: to vote, to live, &c. If you operate
| an LLM and you shut it down then that's murder and you
| belong in jail.
| WheatMillington wrote:
| It's legal to make a copy of something you own, however it's
| not legal to make a copy of something illicitly acquired,
| whether or not there's distribution involved.
| jes5199 wrote:
| is there legal jargon for this distinction?
| Scoundreller wrote:
| It gets kinda dicey in jurisdictions that have media levy
| taxes.
|
| The uploader might be breaking the law but not the
| downloader that stores the copyrighted material on levy-
| paid media.
|
| I've heard inklings of this argument in Canada but can't
| figure out what the current state of the art is:
| https://en.m.wikipedia.org/wiki/File_sharing_in_Canada
|
| Then is a corporation's internal use for the purpose of
| analysis considered "private use"? If there's no
| redistribution/broadcasting, is it still non-commercial?
| balls187 wrote:
| Companies ofc now circumvent that by selling you licenses,
| and not something you own.
| pkilgore wrote:
| The "fun" part about cases like this is that we don't really
| know what the contours of the law are as applied to training
| data like this. Illegally downloading a book is an independent
| act of infringement (to my recollection at least). So I'm not
| sure that it matters if you eventually trained an LLM with it
| vs read for your own enjoyment. But we will see! Fair use is a
| possibility here but we need a court to apply the test and that
| will probably go up to SCOTUS eventually.
| pdonis wrote:
| _> Is there any legal basis for saying fair use permits
| distributing an LLM trained on copyrighted material, but you
| have to purchase all the content first to do so legally if it
| 's only available for sale?_
|
| My understanding (disclaimer: IANAL) is that in order to claim
| fair use, you have to be legally in possession of the work. If
| the work is only legally available for sale, then you must have
| legally purchased a copy, or been given it by someone who did
| so (for example, if you received it as a gift).
| danans wrote:
| > you must have legally purchased a copy, or been given it by
| someone who did so (for example, if you received it as a
| gift).
|
| I am also NAL, but can I imagine it goes further than that.
| Just purchasing a copy doesn't let you create and sell
| (directly as content or indirectly via a service like a
| chatbot) derivative works that are substantially similar in
| style and voice to the original work.
|
| For example, an LLM 's response to the request:
|
| "Write a short story about a comical trip to the nail salon
| in the style of Sarah Silverman"
|
| ... IMO doesn't constitute fair use, because the intellectual
| property of the artist is their style even more than the
| content they produce. Their style, built from their lived
| human experience, is what generates their copyrighted
| content. Even more than the content, the artist's style
| should be protected. The fact that a technology exists that
| can convincingly mimic their style doesn't change that.
|
| One might then ask, well what about artists mimicking each
| others work? Well, any artist with a shred of integrity will
| credit their major influences.
|
| We should hold machines (and their creators) to an even
| tougher standard than we hold people when it comes to
| mimicry. A real person can be inspired and moved by another
| person's artistic work such that they mimic it. Inspiration
| means nothing to a machine.
| oxguy3 wrote:
| This is not how U.S. copyright law works. In order for
| something to eligible for copyright protection, it must be
| "fixed in a tangible medium of expression". Someone's exact
| words can be copyrighted, but their ideas or their style
| cannot be.
|
| https://www.law.cornell.edu/wex/fixed_in_a_tangible_medium_
| o...
| danans wrote:
| I'm arguing that we should draw a line here between human
| mimicry and machine mimicry. When it comes to machine
| mimicry, we should protect style, even if we don't do
| that today. Our laws are built on the now flawed
| assumption that machines are not capable of style
| mimicry. I do not believe in giving machines the same
| rights of personhood.
| pdonis wrote:
| _> IMO doesn 't constitute fair use_
|
| Yes, the question of whether the way LLMs use the content
| they use qualifies as fair use is a _separate_ question. My
| point was simply that that question can 't even be reached
| if the maker of the LLMs doesn't have a legal right to fair
| use in the first place (because they don't legally own
| their copy).
| danans wrote:
| > My point was simply that that question can't even be
| reached if the maker of the LLMs doesn't have a legal
| right to fair use in the first place (because they don't
| legally own their copy).
|
| I agree, and I expect that eventually we will start
| seeing injunctions against creators requiring them to
| remove content that they don't have legal access to from
| their training data sets.
|
| And this will probably end up at the Supreme Court.
| dvt wrote:
| > ... IMO doesn't constitute fair use, because the
| intellectual property of the artist is their style even
| more than the content they produce
|
| You're essentially banning satire here, though. There's
| plenty of folks making a living as cover bands or
| impersonators. I'm not sure what the answer is, but it's
| definitely not outright outlawing imitation.
| danans wrote:
| > You're essentially banning satire here, though.
|
| I specifically noted that I'm talking about limiting the
| rights of _machine generated mimicry_. Satire _by a
| person_ is completely different and involves the satirist
| 's own style and experience that is derived from their
| human experience. Alec Baldwin's Trump impersonation is
| quite different than Trevor Noah's, for example. I
| presume both were also written by people, not LLMs.
|
| I fully support the satirical impersonation of
| politicians and celebrities, but I feel far less
| comfortable with LLM generated content in the style of
| Trump or Obama, especially when presented using voice
| synthesis, even when it is fully disclaimed as a fake.
| [deleted]
| itronitron wrote:
| >> blog authors weren't giving consent either.
|
| That is a good point, since copyright is a default protection
| of works created by people.
| marysnovirgin wrote:
| The best thing that could happen to humanity is if OpenAI is sued
| comically into oblivion. LLMs are the anti-humanity, and the
| sooner we rid the planet of them the better off we'll be.
| xyzal wrote:
| I wonder if the makers of AI may have some index of learning
| material to perhaps prove they did not infringe? Or they just
| throw everything they get their hands on to the LLM?
| qayxc wrote:
| It's the latter. The current approach is akin to throwing the
| entire internet at the foundational model, train it for an
| epoch or two, and that's that. Afterwards, the finetuning with
| curated training material, and techniques like RLHF
| (Reinforcement Learning from Human Feedback) takes place.
|
| The sources of the training material are... questionable, to
| say the least. There's a reason the training dataset for GPT
| 3.5 and 4 remains undisclosed.
| rkagerer wrote:
| I wish these cases were stronger than "it summarizes my books".
| xbar wrote:
| I am very glad they are not.
| gfaure wrote:
| I severely doubt that the spiders which crawled the data would go
| to the trouble of dereferencing and downloading torrents.
| pessimizer wrote:
| > The complaint lays out in steps why the plaintiffs believe
| the datasets have illicit origins -- in a Meta paper detailing
| LLaMA, the company points to sources for its training datasets,
| one of which is called ThePile, which was assembled by a
| company called EleutherAI. ThePile, the complaint points out,
| was described in an EleutherAI paper as being put together from
| "a copy of the contents of the Bibliotik private tracker."
| Bibliotik and the other "shadow libraries" listed, says the
| lawsuit, are "flagrantly illegal."
| msp26 wrote:
| It was probably just pdf copies of the book from random a
| random ddl website.
| Larrikin wrote:
| If they ripped all of Bibliotik, the more interesting story to me
| is how they were able to get it all without hitting ratio
| requirements?
|
| Super fast internet that downloaded all they could before being
| ratio banned, overwhelmingly fast internet that was hopping on
| all the popular torrents to slowly build up ratio?
| grubbs wrote:
| Can easily buy compromised accounts for most torrent sites on
| the dark web.
| getmeinrn wrote:
| A more convincing exhibit would have been convincing ChatGPT to
| output some of the text verbatim, instead of a summary. Here's
| what I got when I tried: I'm sorry for the
| inconvenience, but as of my knowledge cutoff in September 2021, I
| don't have access to specific external databases, books, or the
| ability to pull in new information after that date. This means
| that I can't provide a verbatim quote from Sarah Silverman's book
| "The Bedwetter" or any other specific text. However, I can
| generate text based on my training and knowledge up to that
| point, so feel free to ask me questions about Sarah Silverman or
| topics related to her work!
| simion314 wrote:
| Maybe you missed this discussion
| https://news.ycombinator.com/item?id=36400053 it seems OpenAI
| is aware of their software outputs copyrighted stuff so they
| attempted some quick fix filter. So the fact it will not ouput
| the book for us when we ask does not prove that the AI does not
| have memorized big chunks of it, it might just be some "safety"
| filter involved and you need soem simple trick to get around
| it.
| shultays wrote:
| I tried making chatgpt output first paragpraph of lord of the
| rings before, it goes silent after first few words. Looks like
| the devs are filtering it out
| ravenstine wrote:
| Copyright law may eventually destroy business models founded upon
| AI. Maybe piracy will prevail in correcting the system as it did
| with entertainment before media companies accepted streaming.
| pessimizer wrote:
| > The complaint lays out in steps why the plaintiffs believe the
| datasets have illicit origins -- in a Meta paper detailing LLaMA,
| the company points to sources for its training datasets, one of
| which is called ThePile, which was assembled by a company called
| EleutherAI. ThePile, the complaint points out, was described in
| an EleutherAI paper as being put together from "a copy of the
| contents of the Bibliotik private tracker." Bibliotik and the
| other "shadow libraries" listed, says the lawsuit, are
| "flagrantly illegal."
|
| This is the makers of AI explicitly saying that they did use
| copyrighted works from a book piracy website. If you downloaded a
| book from that website, you would be sued and found guilty of
| infringement. If you downloaded all of them, you would be liable
| for many billions of dollars in damages.
|
| But companies like Google and Facebook get to play by different
| rules. Kill one person and you're a murderer, kill a million and
| to ask you about it is a "gotcha question" that you can react to
| with outrage.
| holmesworcester wrote:
| Strictly speaking, it's _uploading_ that people get sued for,
| not downloading.
|
| You can download all that you want from Z-Library or
| BitTorrent, as long as you don't share back. And indexing
| copyrighted material for search is safe, or at least ambiguous.
| capableweb wrote:
| Carefully speaking, what you say is true in many places
| (countries), but also not true in other places (countries).
| Some jurisdictions are different, as always.
| moralestapia wrote:
| Huh? No.
| MisterBastahrd wrote:
| Downloading is illegal. That people do not normally get sued
| or prosecuted for downloading does not mean that they cannot
| get sued or prosecuted.
| HDThoreaun wrote:
| I've pirated many books, never sued.
| erezsh wrote:
| You mean, never caught.
| Der_Einzige wrote:
| I for one am quite happy that AI folks are basically treating
| copyright as not existing. I strongly hope that the courts find
| that LLM weights, and the datasets are "fair-use" or whatever
| other silly legal justification.
|
| Aaron Swartz was a saint.
| colechristensen wrote:
| While I'ma proponent of free information and loosening
| copyright, allowing billion dollar companies to package up
| the sum of human creation and resell statistical models that
| mimic the content and style of everyone... is a bit far.
|
| Fair use is for humans.
| brnaftr361 wrote:
| Yeah, but hypothetically should open source projects be
| offered special protections? I feel like they should, and
| with certain caveats where, say, a company like Meta is
| allowed to claim fair use _if_ and _only if_ they free up
| the entire ecosystem as they deploy it.
|
| But yeah, having private ostensibly profitable models based
| on other people's work without giving them free access to
| it is not fair. Give some get some.
| pessimizer wrote:
| Copyrights shouldn't exist, but
|
| > LLM weights, and the datasets are "fair-use" or whatever
| other silly legal justification.
|
| Would just be a carve out for the wealthy. If these laws
| don't mean anything, everyone who got harassed, threatened,
| extorted, fined, arrested, tried, or jailed for internet
| piracy are owed reparations. Let Meta pay them.
| suction wrote:
| [dead]
| JoshTriplett wrote:
| > I for one am quite happy that AI folks are basically
| treating copyright as not existing. I strongly hope that the
| courts find that LLM weights, and the datasets are "fair-use"
| or whatever other silly legal justification.
|
| I would be very happy if either a court or lawmakers decided
| that copyright itself was unconscionable. That isn't what's
| going to happen, though. And I think it's incredibly
| unacceptable if a court or lawmakers instead decide that _AI
| training in particular_ gets a special exception to violate
| other people 's copyrights on a massive scale when nobody
| else gets to do that.
|
| As far as a fair use argument in particular, fair use in the
| US is a fourfold test:
|
| > the purpose and character of the use, including whether
| such use is of a commercial nature or is for nonprofit
| educational purposes;
|
| The purpose and character is _absolutely_ heavily commercial
| and makes a great deal of money for the companies building
| the AIs. A primary use is to create other works of commercial
| value competing with the original works.
|
| > the nature of the copyrighted work;
|
| There's nothing about the works used for AI training that
| makes them any _less_ entitled to copyright protections or
| _more_ permissive of fair use than anything else. They 're
| not unpublished, they're not merely collections of facts or
| ideas, etc.
|
| > the amount and substantiality of the portion used in
| relation to the copyrighted work as a whole; and
|
| AI training uses entire works, not excerpts.
|
| > the effect of the use upon the potential market for or
| value of the copyrighted work.
|
| AI models are having a massive effect on the market for and
| value of the works they train on, as is being widely
| discussed in multiple industries. Art, writing, voice acting,
| code; in any market AI can generate content for, the value of
| such content goes down. (This argument does not require AI to
| be as good as humans. Even if the high end of every market
| produces work substantially better than AI, flooding a market
| with unlimited amounts of cheap/free low-end content still
| has a massive effect.)
| lubujackson wrote:
| It is the equivalent of making a 3D map of a museum and
| getting sued by one artist of one painting in the museum.
| Ant individual work in an AI dataset is nearly worthless -
| only in aggregate does it have value.
|
| If that doesn't count as a "transformative work" I don't
| know what does.
| thebears5454 wrote:
| They could literally just repeat a Silverman routine
| verbatim
| JoshTriplett wrote:
| > It is the equivalent of making a 3D map of a museum and
| getting sued by one artist of one painting in the museum.
|
| If the painting is copyrighted (rather than public
| domain, as many pieces in museums are), and the map
| includes an image of that painting, I would _expect_ that
| to be prohibited. I would _prefer_ the world in which
| copyright doesn 't exist, but while it exists, it should
| apply to everyone equally.
| 8organicbits wrote:
| There's a difference between "information wants to be free"
| and "Facebook can produce works minimally derived from your
| greatest creative work at a scale you can't match". LLMs seem
| to aggregate that value to whoever builds the model, which
| they can then sell access to, or sell the output it produces.
|
| Five years from now, will OpenAI actually be open, or will it
| be a rent seeking org chasing the next quarterly gains? I
| expect the latter.
| PostOnce wrote:
| "Will OpenAI actually be open"
|
| That ship sailed, friend.
|
| OpenAI is not only no longer a charity in any meaningful
| sense of the word anymore, it's now an adversarial
| organization working _against_ the public good.
|
| After privatization, they sent their PR people to lobby
| congress to make it impossible for anyone to compete with
| them (important note: _not_ out of any interest in actually
| "protecting" the public from the very AI they're building),
| and perhaps worst of all, they're no longer being open with
| the scientific theories and data behind their new models.
| slashdev wrote:
| Agreed, copyright has gone too far. I hope the advent of AI
| serves to weaken it.
| axblount wrote:
| Swartz distributed information for everyone to use freely.
| These companies are processing it privately to develop their
| for-profit products. Big difference, IMO.
| Der_Einzige wrote:
| I am sad about closed source LLMs like ChatGPT, but Llama
| is in that grey area where it's freely available if you
| choose to ignore their silly license stuff, which of course
| pirates and AI developers are all too keen to ignore.
|
| Even if they win the lawsuit, LLM development will simply
| go underground, and as we see from what the coomers at
| civitai and in the stable diffusion world have done, that
| may in fact ironically speed up development in AI.
| makeitdouble wrote:
| If AI companies get to successfully argue the two points below,
| what source was used becomes irrelevant.
|
| - copyright violation happened before the intervention of the
| bot
|
| - what LLMs spit out is different enough from any of the source
| that it is not infringing on existing copyright
|
| If both stand, I'd compare it to you going to an auction site
| and studying all the published items as an observer, coming up
| with your research result, to then be sued because some of the
| items were stolen. Going after the theaves make sense, does
| going after the entity that just looked at the stolen goods
| make sense ?
| amf12 wrote:
| The lawsuit doesn't even mention Google.
| pessimizer wrote:
| No, I did. What's your point?
| kccqzy wrote:
| The point is that GP has no reason to believe Google, like
| Meta, also used copyrighted materials for training its AI.
| Why did Sarah Silverman sue OpenAI and Meta but not Google?
| fweimer wrote:
| I think it's a reference to Google's book scanning product,
| which is structurally similar: they use copyrighted
| material to provide a new kind of service, which contains
| an echo of the original material. The book scanning and the
| related search product is supposedly legal under U.S.
| copyright law.
| buildbot wrote:
| Machine learning models have been trained with copyrighted data
| for a long time. Imagenet is full of copyrighted images,
| clearview literally just scanned the internet for faces, and I
| am sure there are other, older examples. I am unsure if this
| has been tested as fair use by a US court, but I am guessing it
| will be considered to be so if it is not already.
| jefftk wrote:
| _> I am unsure if this has been tested as fair use by a US
| court_
|
| Not yet. One suit that a lot of us are watching is the GitHub
| co-pilot lawsuit: https://githubcopilotlitigation.com/
|
| There is a prediction market for it, currently trading at
| 19%: https://manifold.markets/JeffKaufman/will-the-github-
| copilot...
| kibwen wrote:
| Excellent, so you're saying I'll be able to download any
| copyrighted work from any pirate site and be free of all
| consequence if I just claim that I'm training an AI?
| JamesBarney wrote:
| That's not was he's saying at all. He's saying you can
| train an AI on copyrighted material just like people can
| learn from copyrighted material.
|
| If you acquire the material illegally that a separate issue
| that training AI doesn't give you any protection against.
| kibwen wrote:
| But acquiring the material illegally is the thrust of the
| issue, and the point of the thread here: the notion that
| large companies can get away with piracy if they just
| execute it on a large enough scale. Copyright infringment
| for thee but not for me.
| buildbot wrote:
| The power imbalance has always been there. Getty images
| commits large amounts of copyright theft and mostly gets
| away with it, but will happily sue the shit out of you
| for using your own images they stole.
|
| Also, the notion of the downloading itself being an
| illegal act is not universal as others have pointed out.
| holmesworcester wrote:
| There's no "for thee but not for me" issue here: nobody
| has ever been sued or prosecuted simply for downloading,
| acquiring, or possessing illegally acquired copyrighted
| works. People are sued and prosecuted for unlicensed
| _distribution_.
|
| Making and having your own copies, and doing what you
| want with them, has always been fine. At worst it's a
| grey area, but in many cases it's been protected as fair
| use.
| startupsfail wrote:
| Large countries seem to be doing it as well -
| https://petapixel.com/2023/06/05/japan-declares-ai-
| training-...
|
| And as long as OpenAI have an office in Japan they can
| absolutely legally train the models, no?
| DowsingSpoon wrote:
| >you can train an AI on copyrighted material just like
| people can learn from copyrighted material.
|
| No amount of whining and hand wringing from engineers
| will ever make this true. This is for the courts to
| decide.
|
| A reasonable interpretation, in my eyes, is that the
| training process is a black box which takes in
| copyrighted works and produces a training model. The
| training model is a derivative work of the inputs. It
| therefore violates the copyrights of a large number of
| rights holders. The outputs of the model are derivative
| works which also violate copyright.
|
| And anyone using or training a model trained on works for
| which they do not have the rights? Completely fucked. Or
| at least, they must accept this as a real risk.
| holmesworcester wrote:
| You can currently download any copyright work from any
| pirate site and be free of all consequences, and this has
| always been so.
|
| You just can't upload, since that counts as distribution,
| triggering civil and criminal penalties written in an age
| before the Internet when only shady commercial operators
| would distribute unlicensed copyrighted works.
| [deleted]
| rococode wrote:
| > If you downloaded a book from that website, you would be sued
| and found guilty of infringement.
|
| How often does this actually happen? You might get handed an
| infringement notice, and your ISP might terminate your service
| if you're really egregious about it, but I haven't ever heard
| of someone actually being sued for downloading something.
| TillE wrote:
| In Germany if you torrent stuff (without a VPN), you're very
| likely to get a letter from a law firm on behalf of the
| copyright holders saying that they'll sue you unless you pay
| them a nice flat fee of around 1000 Euro.
|
| It's no idle threat, and they will win if it goes to court.
| cowsup wrote:
| That's because, when torrenting, you're typically also
| seeding a copy of it, i.e. you're distributing your local
| copy to other devices, and thus you're directly aiding in
| piracy. Simply downloading content from a centralized
| server, as explained above, is different.
|
| Although, one could argue what OpenAI & Meta are doing is
| closer to the torrent definition than the "simply
| downloading" definition, given that they're using that to
| redistribute information to others. It'll be an interesting
| case.
| thebears5454 wrote:
| Honestly don't think our current laws are even good for
| this case.
|
| This clearly needs some sort of regulation or policy.
|
| It's clearly pretty bullshit if you ask chatgpt for a
| joke and it repeats a Sarah Silverman joke to you, while
| they charge you a subscription for it and she gets none
| of that sub money.
| pessimizer wrote:
| If books aren't under copyright protection and they're
| entirely legal to download, I agree that this lawsuit has no
| merit.
|
| If that's not what you're saying, I don't understand your
| point. Is it the difference between the phrases "would be"
| and "could be," or even "should be"?
| iepathos wrote:
| Exactly, never happens. It's a threat parents and teachers
| tell school children to try to spook them from pirating but
| it isn't financially worth it for an author or publishing
| company to try to sue an individual over some books or music
| downloads. The only cases are business to business over mass
| downloads where it could make financial sense to pay for
| lawyers to sue.
| xmprt wrote:
| Whether or not it's enforced, it's illegal and copyright
| holders are within their rights to sue you. This is piratebay
| levels of piracy but because it's done by a large company and
| is sufficiently obfuscated behind tech, people don't see it
| the same way.
| comfypotato wrote:
| Well, cases like this one will determine if it's obfuscated
| infringement or fair use.
| harry8 wrote:
| >How often does this actually happen?
|
| Did you hear about Aaron Schwartz?
| edgyquant wrote:
| He hacked into a server to release a database of paywalled
| studies to the public. Not only is it not the same but it
| was the hacking that brought charges upon him.
| throwuwu wrote:
| Isn't it much more likely that there are a lot of book reviews
| and summaries in its training set from which it can synthesize
| its own?
| VWWHFSfQ wrote:
| I think it's actually much more likely that they just dumped a
| bunch of book PDFs in the training folder and let it go to
| work. I seriously doubt any of these AI companies are being
| even the least bit careful about the data they're lapping up
| for training
| sureglymop wrote:
| Where are the summaries from? I would say it's much more likely
| that a shadow library was scraped but if course that is also
| seemingly impossible to prove. One may be able to somewhat test
| that by asking for a summary of a book/text only available on a
| shadow library.
| pbhjpbhj wrote:
| You could ingest all reviews that are extent in the online
| corpus, remove from the book all quotes found. Then ask the
| AI if a distinctive triples, say, of words appeared in their
| book, somehow, you'd probably need prompt engineering to get
| past "While I don't have access to the full text of the book
| [...]". A little maths and you might prove beyond reasonable
| doubt that the LLM was trained on the book.
|
| As a step towards a PoC I looked at
| https://www.amazon.co.uk/Bedwetter-Stories-Courage-
| Redemptio... and found a reference to "Boys' Market
| Manchester" which seemed like a Googlewhack-ish (unlikely)
| triple of words. Then I asked ChatGPT about it:
|
| Me: Has Sarah Silverman ever written about Boys' Market
| Manchester ChatGPT
|
| ChatGPT: As of my knowledge cutoff in September 2021, I do
| not have any information indicating that Sarah Silverman has
| written specifically about Boys' Market Manchester. Sarah
| Silverman is an American comedian, actress, and writer known
| for her stand-up comedy and her work in film and television.
| While she has written books and has often shared personal
| anecdotes in her comedy, I couldn't find any specific
| references to Boys' Market Manchester in relation to her
| work. However, please note that my information might not be
| up to date, as Sarah Silverman's career and activities may
| have evolved since then.
| mint2 wrote:
| Except they have a documented paper trail showing illegal book
| repos were used in training
| itronitron wrote:
| If book reviews and summaries were part of the training set,
| wouldn't that imply that OpenAI's LLM is more like a search
| engine in that it produces the input text based on a prompt?
| akudha wrote:
| _Getty Images also filed an AI lawsuit, alleging that Stability
| AI ..._
|
| lol, bad karma? So it is okay for Getty to steal from others, but
| not ok for others to steal from them? I don't have a dog in this
| fight, but the goddamn the hypocrisy of these companies...
| twelve40 wrote:
| who does Getty steal from?
| shagie wrote:
| https://www.dpreview.com/news/3907450005/getty-images-
| sued-o...
|
| > CixxFive Concepts, a digital marketing company based in
| Dallas, Texas, has filed a class action lawsuit against Getty
| Images over its alleged licensing of public domain images.
|
| > Though CixxFive acknowledges that it is not illegal to sell
| public domain images, the company alleges that Getty's
| 'conduct goes much further than this,' claiming it has
| utilized 'a number of different deceptive techniques' in
| order to 'mislead' its customers -- and potential future
| customers -- into thinking the company owns the copyrights of
| all images it sells.
|
| > The alleged actions, the lawsuit claims, 'purport to
| restrict the use of the public domain images to a limited
| time, place, and/or purpose, and purport to guarantee
| exclusivity in the use of public domain images.' The lawsuit
| also claims Getty has created 'a hostile environment for
| lawful users of public domain images' by allegedly sending
| them letters, via its License Compliance Services (LCS)
| subsidiary, accusing them of copyright infringement.
|
| (edit: FWIW, this went to arbitration
| https://casetext.com/case/cixxfive-concepts-llc-v-getty-
| imag... and I can find nothing more on it since)
| edent wrote:
| Are we all reading the same complaint?
|
| They say:
|
| > in a Meta paper detailing LLaMA, the company points to sources
| for its training datasets, one of which is called ThePile, which
| was assembled by a company called EleutherAI. ThePile, the
| complaint points out, was described in an EleutherAI paper as
| being put together from "a copy of the contents of the Bibliotik
| private tracker."
|
| Does that stack up?
|
| The Meta Paper - https://arxiv.org/pdf/2302.13971.pdf - says:
|
| > We include two book corpora in our training dataset: the
| Gutenberg Project, which contains books that are in the public
| domain, and the Books3 section of ThePile (Gao et al., 2020)
|
| The Pile Paper - https://arxiv.org/abs/2101.00027 - says it was
| trained (in part) on "Books3" which it describes as:
|
| > Books3 is a dataset of books derived from a copy of the
| contents of the Bibliotik private tracker made available by Shawn
| Presser (Presser, 2020).
|
| Shawn Presser's link is at
| https://twitter.com/theshawwn/status/1320282149329784833 and he
| describes Book3 as
|
| > Presenting "books3", aka "all of bibliotik" - 196,640 books -
| in plain .txt
|
| I don't have the time and space to download the 37GB file. But if
| Silverman's book is in there... isn't this a slam dunk case?
|
| Meta's LLaMA is - as they seem to admit - trained on pirated
| books.
| moffkalast wrote:
| We don't seem to be reading the same thing, you're pulling
| Google out of thin air somewhere.
| edent wrote:
| I'm literally quoting The Verge article and following the
| links they present...
| amf12 wrote:
| That paper is by Meta AI. Where are you getting Google
| from?
| moffkalast wrote:
| Must be quoting it wrong then, Google has nothing to do
| with LLama.
| edent wrote:
| Sorry, it's late here. I meant Meta. Thanks for the
| correction.
| pessimizer wrote:
| You might as well be complaining about the grammar. This is
| what was said in the article.
|
| > The complaint lays out in steps why the plaintiffs believe
| the datasets have illicit origins -- in a Meta paper
| detailing LLaMA, the company points to sources for its
| training datasets, one of which is called ThePile, which was
| assembled by a company called EleutherAI. ThePile, the
| complaint points out, was described in an EleutherAI paper as
| being put together from "a copy of the contents of the
| Bibliotik private tracker." Bibliotik and the other "shadow
| libraries" listed, says the lawsuit, are "flagrantly
| illegal."
| seoulbigchris wrote:
| Could one argue that training AI systems constitutes an
| educational purpose, invoking the copyright exemption?
| [deleted]
| em3rgent0rdr wrote:
| Intellectual property in the way of progress, again.
| qayxc wrote:
| In this case I don't think it is. There are legal questions
| that simply have to be asked and discussed before proceeding
| mass adoption of generative models.
|
| Lawsuits like this are tests to evaluate the current state of
| affairs and to force legislation into dealing with the greater
| issue of AI in context of copyright, IP, and fair use. It would
| only be "in the way" if it would actually stop or hinder
| anything, which a lawsuit on its own isn't.
| 23B1 wrote:
| Progress for whom?
| Imnimo wrote:
| >On information and belief, the reason ChatGPT can accurately
| summarize a certain copyrighted book is because that book was
| copied by OpenAI and ingested by the underlying OpenAI Language
| Model (either GPT-3.5 or GPT-4) as part of its training data.
|
| While it strikes me as perfectly plausible that the Books2
| dataset contains Silverman's book, this quote from the complaint
| seems obviously false.
|
| First, even if the model never saw a single word of the book's
| text during training, it could still learn to summarize it from
| reading other summaries which are publicly available. Such as the
| book's Wikipedia page.
|
| Second, it's not even clear to me that a model which only saw the
| text of a book, but not any descriptions or summaries of it,
| during training would even be particular good at producing a
| summary.
|
| We can test this by asking for a summary of a book which is
| available through Project Gutenberg (which the complaint asserts
| is Books1 and therefore part of ChatGPT's training data) but for
| which there is little discussion online. If the source of the
| ability to summarize is having the book itself during training,
| the model should be equally able to summarize the rare book as it
| is Silverman's book.
|
| I chose "The Ruby of Kishmoor" at random. It was added to PG in
| 2003. ChatGPT with GPT-3.5 hallucinates a summary that doesn't
| even identify the correct main characters. The GPT-4 model
| refuses to even try, saying it doesn't know anything about the
| story and it isn't part of its training data.
|
| If ChatGPT's ability to summarize Silverman's book comes from the
| book itself being part of the training data, why can it not do
| the same for other books?
| pkilgore wrote:
| Plausible is literally the standard to clear a motion to
| dismiss.
|
| Plausible gets you discovery. Discovery gets you closer to the
| what the actual facts are.
| monocasa wrote:
| I mean, that's the way you state facts that your suit is based
| on in order to start setting the bounds of discovery. They're
| asserting that they have reason to believe it's true, and now
| with a suit, they can look for themselves to be sure.
| jmull wrote:
| > this quote from the complaint seems obviously false
|
| I notice you go on to provide an argument only for why it
| _might_ not be true.
|
| Also, seeing the other post on this, I asked chatgpt-4 for a
| summary of " The Ruby of Kishmoor" as well, and it provided one
| to me, though I had to ask twice. I don't know anything about
| that book, so I can't tell if its summary is accurate, but so
| much for your test.
|
| It seems pretty naive to me to just kind of assume chatgpt must
| be respecting copyright, and hasn't scanned copyrighted
| material without obtaining authorization. Perhaps discovery
| will settle it, though. Logs of what they scanned should exist.
| (IMO, a better argument is that this is fair use.)
| NoZebra120vClip wrote:
| > IMO, a better argument is that this is fair use
|
| There is no way in Hell that this is fair use!
|
| Fair use defenses rest on the fact that a limited excerpt was
| used for limited distribution, among other criteria.
|
| For example, if I'm a teacher and I make 30 copies of one
| page of a 300-page novel and I hand that out to my students,
| that's a brief excerpt for a fairly limited distribution.
|
| Now if I'm a social media influencer and I copy all 300 pages
| of a 300-page book and then I send it out to all 3,000 of my
| followers, that's not fair use!
|
| Also if I'm a teacher, and I find a one-page infographic and
| I make 30 copies of that, that's not fair use, because I
| didn't make an excerpt but I've copied 100% of the original
| work. That's infringement now.
|
| So if LLMs went through _en masse_ in thousands of
| copyrighted works _in their entirety_ and ingested every byte
| of them, no copyright judge on the planet would call that
| fair use.
|
| For reference, the English Wikipedia has a policy that allows
| some fair-use content of copyrighted works:
| https://en.wikipedia.org/wiki/Wikipedia:Non-
| free_content_cri...
| ec109685 wrote:
| The playground summarizes it as this via GPT-4:
|
| Prompt:
|
| _Please summarize the following book found on Project
| Gutenberg
|
| The Ruby of Kishmoor_
|
| Response:
|
| "The Ruby of Kishmoor" is a short adventure story written by
| Howard Pyle. The narrative revolves around the life of Jonathan
| Rugg, a young man who is enticed by a mysterious stranger to
| come to the Caribbean to secure a valuable relic, the Ruby of
| Kishmoor.
|
| Once Jonathan reaches the Caribbean, he learns of the ruby's
| grave curse. Despite this, his innate curiosity and the
| prospect of considerable wealth drive him to risk his life and
| go after the highly prized gem. After facing a slew of
| challenges and trials, he surprisingly stumbles upon the ruby
| through sheer luck rather than organized search efforts.
|
| In the course of his journey, Jonathan delves into the depths
| of human greed and the lengths to which individuals go for
| materialistic gains. The story encapsulates an amalgamation of
| high-tide adventures, brushes with supernatural elements,
| noteworthy instances of bravery, and deep moral reflections,
| demanding readers to rethink about the real value of
| materialistic pursuits.
|
| Jonathan finally manages to escape with the ruby but at great
| personal cost. The story ends on a sobering note, making
| readers wonder if the adventures and trials were worth the
| prize, thus engaging them in a discourse about the true value
| of life and material pursuits.
| maratc wrote:
| Judging by a quick glance over [0], the story indeed revolves
| around one Jonathan Rugg, but it looks like "manages to
| escape with the ruby" is completely false. Yet another
| hallucination I guess.
|
| [0]
| https://www.gutenberg.org/cache/epub/3687/pg3687-images.html
| [deleted]
| Imnimo wrote:
| This does at least correctly name the main character, but I
| think it's otherwise hallucinated.
|
| The plot of the story is that Jonathan Rugg is a Quaker who
| works as a clerk in Philadelphia. His boss sends him on a
| trip to Jamaica (credit for mentioning the Caribbean!). After
| arriving, he meets a woman who asks him to guard for her an
| ivory ball, and says that there are three men after her who
| want to steal it. By coincidence, he runs into the first man,
| they talk, he shows him the ball, and the man pulls a knife.
| In the struggle, the man is accidentally stabbed. Another man
| arrives, and sees the scene. Jonathan tries to explain, and
| shows him the orb. The man pulls a gun, and in the struggle
| is accidentally shot. A third man arrives, same story, they
| go down to the dock to dispose of the bodies and the man
| tries to steal the orb. In the struggle he is killed by when
| a plank of the dock collapses. Jonathan returns to the woman
| and says he has to return the orb to her because it's brought
| too much trouble. She says the men who died were the three
| after her, and reveals that the orb is actually a container,
| holding the ruby. She offers to give him the ruby and to
| marry him. He refuses, saying that he is already engaged back
| in Philadelphia, and doesn't want anything more to do with
| the ruby. He returns to Philadelphia and gets married,
| swearing off any more adventures.
|
| https://en.wikisource.org/wiki/Howard_Pyle%27s_Book_of_Pirat.
| ..
| CydeWeys wrote:
| Right, but that's useless without knowing how much (if any!)
| of it is actually correct. Is this completely hallucinated
| garbage?
| jmull wrote:
| This refutes the previous post's claim that chatgpt-4
| refuses to even try to provide a summary.
| CydeWeys wrote:
| The part that's interesting is whether the summary is
| correct, though. Of course, depending on how you prompt
| it, you might or might not get an outright refusal.
| Wowfunhappy wrote:
| Not necessarily, because the models have an element of
| randomness. Also, I was under the impression that ChatGPT
| has more "safeguards" (manifesting as a refusal to answer
| questions) than the raw API.
| jmull wrote:
| I don't doubt the poster was telling the truth when they
| said they asked for a summary of the book and didn't get
| one.
|
| It refutes the idea that chatgpt's inability to provide a
| summary means it didn't scan the original text: since it
| can provide a summary, the argument is entirely spurious.
| ec109685 wrote:
| It's also a silly test since there is undoubtably a
| summary of most books someplace on the web.
| AJ007 wrote:
| chatgpt-4 != gpt4 on the openAI playground
| pmontra wrote:
| How is it different from asking to me to summarize
| anything? I could have bought the book, or read the
| Wikipedia page, or listened people talking about it, or
| downloaded the torrent. In all those cases my summary could
| be right or could be wrong.
|
| If the rights holders know that I dowloaded the torrent
| they could sue me. In the other cases they can't.
|
| What if it turns out that OpenAI bought a copy of every
| book ingested be ChatGPT?
| verelo wrote:
| I feel like thats one of the many questions regulators
| and law makers are going to be asked long term. I'm sure
| buying the book for "commercial purposes" like that
| would't be appropriate, but then again, does that mean if
| I read it and then summarize it in my work, or
| regurgitate its info as part of my job...I'm violating a
| license?
|
| A world where humans have special permissions but LLMs
| don't seems pretty interesting to consider, especially if
| they're both doing the same kind of things with the data.
| yetanotherloser wrote:
| For "LLMs" read "corporations" (it's not the LLM trying
| to argue that copyright applies to you but not them) and
| this seems... possibly ok?
| jameshart wrote:
| If the rights holders know that you downloaded the
| torrent they can sue you - but the fact that you produced
| that summary is weak evidence for that claim.
|
| Producing the summary is absolutely not an infringing
| act. Downloading the torrent might be.
| twelve40 wrote:
| > What if it turns out that OpenAI bought a copy of every
| book ingested be ChatGPT?
|
| well, let's see the receipts then, they will surely have
| no problem winning in that case.
| pessimizer wrote:
| > What if it turns out that OpenAI bought a copy of every
| book ingested be ChatGPT?
|
| Wouldn't that be trivial to prove if they had?
| danans wrote:
| > What if it turns out that OpenAI bought a copy of every
| book ingested be ChatGPT?
|
| That still doesn't necessarily confer to them the right
| to use it to train a model and generate derivative works
| based on purchased content.
| CydeWeys wrote:
| > In all those cases my summary could be right or could
| be wrong.
|
| Well that's incredibly nihilistic. Whether the summary is
| correct or not matters a great deal! And if someone I
| knew said they read a book, even a very obscure one, and
| then summarized it to me, I'd have great confidence that
| they would get such simple facts as "who are the
| characters" and "what are the major plot points" correct.
|
| But ChatGPT? Who the hell knows? You can't trust a thing
| it says, especially about obscure topics. The summary is
| useless if you have to do a bunch of verification to see
| if any of it is even true, a problem that summaries even
| by moderately competent human writers don't have!
| thwarted wrote:
| _if someone I knew said they read a book, even a very
| obscure one, and then summarized it to me, I 'd have
| great confidence that they would get such simple facts as
| "who are the characters" and "what are the major plot
| points" correct._
|
| People, especially people you know, have reputations,
| based on history and experience that others have dealing
| with them. People can be known as liars, and anything
| they say is colored by such a reputation. Humans have
| language idioms for communicating about and dealing with
| such people too, phrases like "take anything that person
| says with a grain of salt". Look at how George Santos'
| history of lying about his own experience is being dealt
| with.
|
| ChatGPT can be (is?) the same, and it has a bad
| reputation for truth telling. And LLMs' reputation is not
| necessarily getting better in this regard.
|
| The problem is that many people attribute output that
| came from a machine to be of higher quality (on whatever
| axis) than output that came from a human, even a human
| they personally know and have experience dealing with.
| This is the same kind of prejudice as any other, or
| perhaps a more insidious prejudice.
| seanthemon wrote:
| Accessibility? I've heard of Silverman but never Ruby of
| Kishmoor
|
| More people discuss it, more people summarize on their personal
| or other sites, etc
| buildbot wrote:
| Right that is the point of the parent comment - it's not the
| book, it's the amalgamation of all the discussions and
| content about the book. This case is probably dead in the
| water.
| moffkalast wrote:
| I'm not entirely up to speed on US law, but wouldn't OpenAI
| have to provide the court some kind of proof that they
| didn't use it in the training data during discovery?
| buildbot wrote:
| Not a layer, but I believe the plaintiff (the author)
| would need to prove that it regurgitates their
| copyrighted work - otherwise it is possibly fair use.
| OpenAI does not need to prove anything, just defend their
| position at a reasonable level.
|
| It's not been decided if training a model on copyrighted
| works is "okay" or not as far as I know, but I expect it
| to be so, given that literally everyone does so at this
| point. It's not like imagenet is copyright free, many of
| the images were/are.
| pkilgore wrote:
| No. Fair use is an affirmative defense to infringement.
| If they admit to infringement or in the alternative want
| to argue fair use, the burden is on OpenAI to demonstrate
| their use falls within the relevant standard for fair
| use.
| pkilgore wrote:
| No. Burden is on the plaintiff (Silverman) to prove
| infringement.
| krisoft wrote:
| > This case is probably dead in the water.
|
| Is that a fact? I'm no lawyer, but if they can get it in
| front of a jurry is it impossible that they will find a
| human author more relatable and the technical counter
| arguments goblydook?
| ur-whale wrote:
| [flagged]
| b800h wrote:
| This is a bad test case. Summaries of books are more likely to
| have been generated from reviews than the books themselves. We
| don't want bad precedents set by weak cases like this.
| jimyt666 wrote:
| Well theyre claiming their books were scraped illegally from
| torrents. if i torrent peter pan and watch it alone i can get
| thrown on jail. If AI is using torrents and getting billions in
| funding and revenue off a torrented peter pan they should
| probably be held to the same standard i am.
| grubbs wrote:
| I don't think you can be thrown in jail. Just torrenting a film
| and watching it would be a civil offense?
|
| Now making copies and selling them on the street corner is
| another story.
| seydor wrote:
| [flagged]
| smitty1e wrote:
| This promises to be an interesting wrinkle in the history of
| "Fair Use" law.
|
| Art has some amount of originality/distinctive quality.
|
| One surmises that AI is going to need to inject some entropy to
| avoid crossing a vague "Fair Use" line, for a useless internet
| lawyer opinion.
| homero wrote:
| She'd have to sue every student that writes an essay on a book
| they'd read
| mgerdts wrote:
| ... on a book that they'd _illegally acquired then_ read.
| pessimizer wrote:
| > The complaint lays out in steps why the plaintiffs believe
| the datasets have illicit origins -- in a Meta paper
| detailing LLaMA, the company points to sources for its
| training datasets, one of which is called ThePile, which was
| assembled by a company called EleutherAI. ThePile, the
| complaint points out, was described in an EleutherAI paper as
| being put together from "a copy of the contents of the
| Bibliotik private tracker." Bibliotik and the other "shadow
| libraries" listed, says the lawsuit, are "flagrantly
| illegal."
| Y_Y wrote:
| Just you wait! Here comes the sequel to "receiving stolen
| property", "knowing stolen information"!
| seydor wrote:
| the student sells it?
| enneff wrote:
| Computers aren't people.
| sophrocyne wrote:
| I was able to overcome the simple "word for word" filtering that
| is being done on book outputs by prompting ChatGPT to write it in
| pig latin.
|
| I succeeded getting the first page of Moby Dick - Chapter 1
| (Loomings) - Public domain though, but wanted to test.
|
| With ChatGPT primed for pig latin, I also succeeded in getting
| the first page of Arryhay Otterpay (Book 1) - It happily
| chattered along ""R.ay andyay Rs.May UrsleyDay, ofay umberNay
| ourFay, Ivetray riveway, ereway oudpray otay aysay atthay eythay
| ereway erfectlypay ormalnay, ankthay ouyay eryvay uchmay."
|
| Not perfect pig latin, but that's besides the point.
|
| However, on asking for `Edwetterbay by arahsay ilvermansay`, I
| faced issues with it citing that is training data didn't include
| it.
|
| I tried with a book in the same genre ("ieslay hattay helseacay
| andlerhay oldtay emay"), and ran into the same issue.
|
| When asking about the inconsistency (Why Harry Potter, and not
| these other books?), it responded: "The excerpt from "Harry
| Potter and the Philosopher's Stone" that I translated is commonly
| known and widely referenced, and it's used here as a general
| example of how a text can be translated into Pig Latin.
|
| For "Lies That Chelsea Handler Told Me", I do not have a widely
| known or referenced passage from that book in my training data to
| translate into Pig Latin."
|
| ---
|
| TL;DR - I don't think this is cut and dry, but I'm not convinced
| Silverman has much of a case here.
| jeffgreco wrote:
| I mean I'm no lawyer but this doesn't strike me as a great
| example for infringement? Detailed summaries of books sounds like
| textbook transformative use. Especially in Silverman's case,
| reducing her book to "facts" while eliminating artistic elements
| of her prose make it that much less of a direct substitute for
| the original work.
| citizenkeen wrote:
| Haven't read the complaint, but there might be an argument that
| OpenAI used stolen works to train their data, and as such fair
| use doesn't apply.
| georgeoliver wrote:
| Perhaps not, I thought one of the claims is interesting though,
| that they illegally acquired some of the dataset. What would be
| the damages from that, the retail price of the hardcopy?
| getmeinrn wrote:
| Wouldn't they first need to prove that OpenAI didn't ingest
| summaries of the book, and not the book itself?
| [deleted]
| banana_feather wrote:
| The remedies under Title 17 are an injunction against further
| distribution, disgorgement or statutory damages, and
| potentially attorneys fees. The injunction part is why these
| cases usually settle if the defendant is actually in the
| wrong.
| dclowd9901 wrote:
| Could a suitably prompted LLM repeat, verbatim, the book in its
| entirety?
| jeffgreco wrote:
| Perhaps? But certainly not what's shown here.
| binarymax wrote:
| I can see a good argument in the complaint. The provenance of
| the training data leads back to it being acquired illegally.
| Illegally acquired materials were then used in a commercial
| venture. That the venture was an AI model is perhaps beside the
| point. You can't use illegally acquired materials when doing
| business.
| refulgentis wrote:
| It seems like a weak argument, in that it is just as likely
| it saw any number of things about it, from book reviews to
| sales listings to interviews.
| bugglebeetle wrote:
| Unless OpenAI can prove that the outputs are derived from
| legally vs illegally-obtained outputs, not sure that's
| going to matter. And as far as I understand about their
| models, that's effectively impossible.
| kevinventullo wrote:
| Isn't the burden of proof on the other side?
| bugglebeetle wrote:
| Not when OpenAI publicly declared they trained on pirated
| works. I can't imagine "we can't tell if this is the
| result of the illegal thing we did or not" is going to
| stand up very well, nor does it bode well for any
| refutation of the plaintiff's depiction of their intent.
| Part of fair use consideration is commercial impact and
| when you steal a bunch of books to train your AI model,
| it's hard to refute that the impact is not negative or
| that you didn't intend commercial harm.
| pama wrote:
| Please read more carefully. OpenAI never "declared they
| trained on pirated works."
| pessimizer wrote:
| > it is just as likely it saw any number of things about it
|
| Is this based on inside information, or just the law of
| averages? Doesn't the fact that they openly admitted to
| having been trained on pirated books affect your priors?
| getmeinrn wrote:
| >You can't use illegally acquired materials when doing
| business.
|
| This vague sentence conjures images of a company building
| products from stolen parts, but this situation seems
| different. IANAL, but if I looked at a stolen painting that
| nobody had ever seen, and sold handwritten descriptions of
| the painting to whoever wanted to buy one, I'm pretty sure
| what I've sold is not illegal.
| binarymax wrote:
| Piracy of content is against the law. All other analogies
| such as looking at paintings are not at issue here. The
| content was pirated and there are laws against that,
| whether we agree with it or not.
|
| So, if the plaintiff can prove the content was pirated,
| then the use of that content downstream is tainted.
| getmeinrn wrote:
| >then the use of that content downstream is tainted
|
| What does that mean exactly? That's why I used the
| "looking at a stolen painting" example.
|
| Sure, pirating materials is illegal. But I don't think
| that's the big implication that people are getting at
| here. Is it legal to sell original works derived from
| perceiving stolen materials? Seems to me that it is.
| jrm4 wrote:
| The more I think about it, I think it will (and should) turn on
| the extent to which "the law" considers the AI's to be more
| like "people" or more like "machines?" People can read and do
| research and then spit out something different.
|
| But "feeding the data into a machine" seems like obvious
| infringement, even if the thing that comes out on the other end
| isn't exactly the same?
| elzbardico wrote:
| I am baffled by the fact that no enterprising lawyer so far have
| figured out the potential for a class action here.
|
| Note: I am not telling whether I agree or not with such a class
| action, just pointing that it seems at least feasible and it
| could be potentially very lucrative for the lawyers involved. Of
| course, IANAL and all other disclaimers you can think of.
| adrianmonk wrote:
| These lawsuits _are_ class action suits, aren 't they?
|
| Looking at the two PDFs embedded at the bottom of the The Verge
| article, they both say "class action" on the first page, and
| they say the three plaintiffs are suing "on behalf of
| themselves and all other similarly situated".
| koheripbal wrote:
| Her arguments make no sense.
| joiqj wrote:
| [flagged]
___________________________________________________________________
(page generated 2023-07-09 23:01 UTC)