Post AW2DRDouGNxPs6JWNc by jedbrown@hachyderm.io
 (DIR) More posts by jedbrown@hachyderm.io
 (DIR) Post #AW1xXdz2IpwbUF5mWu by simon@fedi.simonwillison.net
       2023-05-26T00:49:54Z
       
       0 likes, 0 repeats
       
       A slightly silly thought experiment: if you think AI shouldn't be trained on unlicensed copyrighted material, and an android similar to Data from Star Trek existed... should it be banned from reading books, or walking through a modern art gallery and looking at the pictures?(I don't think science fiction has prepared us for our weird new AI era - or maybe Star Trek's post-capitalism society is the thing that makes this work)
       
 (DIR) Post #AW1xwRBF2dxx1BcjRY by Blueteamsherpa@infosec.exchange
       2023-05-26T00:52:53Z
       
       0 likes, 0 repeats
       
       @simon Lt Cmdr Data, in this context, isn’t monetized.
       
 (DIR) Post #AW1y7dfMxYQavb3akq by FeralRobots@mastodon.social
       2023-05-26T00:56:15Z
       
       0 likes, 0 repeats
       
       @simon It's not silly at all, thanks to one EXTREMELY IMPORTANT point: an android "similar to Data from Star Trek" is: a) ACTUALLY SENTIENT (as a stipulation of the series bible)b) not owned by a corporation.
       
 (DIR) Post #AW1yIXSeN1vAruIymW by simon@fedi.simonwillison.net
       2023-05-26T00:56:15Z
       
       0 likes, 1 repeats
       
       @Blueteamsherpa hah yeah I guess technically he's a non-profit?Albeit one with military applications!
       
 (DIR) Post #AW1yUCcveojlj9ftgG by lmorchard@hackers.town
       2023-05-26T00:56:34Z
       
       0 likes, 0 repeats
       
       @simon I'm also curious how much of a difference "an AI owned by a corporation" versus "a personal AI" might make?Maybe not a critical difference? But I would kind of think the power differential is a thing
       
 (DIR) Post #AW1ysFqHM2LGeKO6zY by grwster@mastodon.social
       2023-05-26T00:59:14Z
       
       0 likes, 0 repeats
       
       @simon Should I be allowed to fill a zip archive with unlicensed copyrighted material and then sell it? Only if I let you extract little bits of each file at a time? Only if I mix the little bits together so that it's hard to tell what came from where? Where do you draw the line?
       
 (DIR) Post #AW1z446CqC1XhhTARc by Colarusso@mastodon.social
       2023-05-26T01:00:42Z
       
       0 likes, 0 repeats
       
       @simon I fear the ultimate outcome of this will be that Data will have to share a small fee with the license holders of all the works he has ever seen should he sell any works of his own, and more troublingly, so will all human artists. The logic of your thought experiment is solid, but copyright absolutists will take it and run with it in the opposite direction. This doesn't end well for the lone artist; they'll all just end up paying folks with big IP libraries.
       
 (DIR) Post #AW1zF4ZBRYR4L4GwIS by simon@fedi.simonwillison.net
       2023-05-26T01:00:31Z
       
       0 likes, 0 repeats
       
       I wonder what the training data looked like that the Ferengi used for the holograms in their holosuites on Deep Space Nine
       
 (DIR) Post #AW1zQr3DWIDFtbwESW by John@socks.masto.host
       2023-05-26T01:01:00Z
       
       0 likes, 0 repeats
       
       @simon I can tell you my age-old expectation. It was that if a (genuine) AI appeared, it would be like a person, and have every right to read every public text/image.In that old model though, the genuine AI would be using texts/images as humans aspire to, to create something new, and not simply to plagiarize.A better (genuine) AI would know not to plagiarize.
       
 (DIR) Post #AW1zcTdGPorGjuLBbc by craig@risk.social
       2023-05-26T01:02:35Z
       
       0 likes, 0 repeats
       
       @simon “But ChatGPT is OwNeD bY a CoRpOrAtIoN.”Yeah and when I’m on the job learning things from the internet, I am too.I think your thought experience is on point.
       
 (DIR) Post #AW1zt5AUcpqtmwnklE by simon@fedi.simonwillison.net
       2023-05-26T01:03:50Z
       
       0 likes, 0 repeats
       
       @grwster I mean that's the big question isn't it! The number of bytes in an LLM attributable to any specific input text is pretty tiny (same for image models), yet it feels like a newly invented form of massive copyright heist
       
 (DIR) Post #AW206JgPyEcQZQ2pMm by ocdtrekkie@mastodon.social
       2023-05-26T01:11:16Z
       
       0 likes, 0 repeats
       
       @simon I mean considering the B-plot of an entire episode was someone hiring Quark to get a holo-scan of Major Kira for use in an adult holo... clearly unauthorized deepfaking has not been solved yet.
       
 (DIR) Post #AW20JHLdxnix5oVabo by kye@mastodon.au
       2023-05-26T01:13:55Z
       
       0 likes, 0 repeats
       
       @simon nah, I think that this is a great thought experiment. How human-like would AI have to be before it’s considered OK (by rights holders)? There needs to be a line drawn somewhere, otherwise we include ourselves.
       
 (DIR) Post #AW20X7o0EuDXYENLvs by ncweaver@thecooltable.wtf
       2023-05-26T01:19:55Z
       
       0 likes, 0 repeats
       
       @simon @grwster But even if you did truly lossless compression, the Nth book is going to take up less than the 1st book did.
       
 (DIR) Post #AW20h4tfuMMhQUm5w0 by not2b@sfba.social
       2023-05-26T01:20:47Z
       
       0 likes, 0 repeats
       
       @simon Well, if your hypothetical android went into business perfectly aping images in the style of (for example) Greg Rutkowski, Mr. Rutkowski might have a thing or two to say about that. The problem comes when the model is making derivative works by remixing particular training data, not a little bit of thousands of contributors, but one particular person. The problem isn't that Data reads the books, but that he then produces new books that are just a remix of the existing books. Now, we could argue that every composer does this; there are only so many chord progressions and beats. But what happens to the artists?
       
 (DIR) Post #AW21k0tXLQQvpvfpfU by jiejie@mastodon.social
       2023-05-26T01:37:08Z
       
       0 likes, 0 repeats
       
       @simon I love this thought experiment. I think it highlights how we want to train these tools in ways that respect human-created creative economie, but I hope we can all agree that our current human-created creative economies are less than ideal? Do we want IP hoarders like Disney, Tencent, Comcast, Sony, etc, to be the main benificiaries of generative AI, while solo artists earn pennies in “ai training royalties”? How about patents and other inventions? We need big change.
       
 (DIR) Post #AW21wX5PU1159TCGuW by jannem@fosstodon.org
       2023-05-26T01:38:23Z
       
       0 likes, 0 repeats
       
       @simon Unlike Data, real world AI aren't sentient and aren't legally a person.Better question: should a book scanner be allowed to record the content of a book, then have the owner make profit off the contents? Should a drone with a camera be allowed to record art gallery works, then have someone use that data for commercial purposes?The answer is: Maybe. It's not the recording that's the issue, it's what you use the resulting system for at the end.
       
 (DIR) Post #AW22QKwnBEAkBtNU2q by mattmoehr@zirk.us
       2023-05-26T01:42:29Z
       
       0 likes, 0 repeats
       
       @simon @grwster an idea that has stuck with me - heard it somewhere like 8? years ago - is that LLMs are basically just an extremely lossy compression algorithm for the corpus it’s trained on. ZipGPT :)
       
 (DIR) Post #AW22aaBzM0dMMuN5OK by MattHodges@mastodon.social
       2023-05-26T01:42:29Z
       
       0 likes, 0 repeats
       
       @simon I think the causality of this hypothetical is backwards. This science-fiction example assumes existence prior to exposure, whereas our weird new AI era exists *because* of exposure.
       
 (DIR) Post #AW23iLfdKVBbbIVKYi by robertcadena@mastodon.social
       2023-05-26T01:59:10Z
       
       0 likes, 0 repeats
       
       @simon I think Data is constrained somewhat by physics so he’s unable to reproduce or create visual works of art at anywhere the scale of what something midjourney or adobe’s AI can do. So in that sense it seems a little more fair. Now, imagine if he left the fleet and put up a shingle as an an independent commercial illustrator or writer and he was commissioned to do pieces in the style of artists. I think many would frown on this. 1/n
       
 (DIR) Post #AW23uepFw9FEU5sHzM by leadegroot@bne.social
       2023-05-26T02:01:11Z
       
       0 likes, 0 repeats
       
       @simon I think the copyright issue would be a problem with Data if he was copied and uploaded to produce lots of Data androidsFor what we are getting now, its more like Asimov's Multivac. One giant thing sucking in everything, so the copyright material moves from the one place it currently is provided from to another place outside the owner's control and is provided to many from there.Is my gut response based on 30 seconds thought
       
 (DIR) Post #AW24Z54pkqj7U6R7Bo by simon@fedi.simonwillison.net
       2023-05-26T02:08:51Z
       
       0 likes, 0 repeats
       
       @not2b Data actually paints oil paintings quite a bit on TNG - one of his paintings show up in Picard tooNot obvious if he's emulating any s tu le in particular
       
 (DIR) Post #AW26fbrsx05WpXcY64 by jonafato@mastodon.social
       2023-05-26T02:32:10Z
       
       0 likes, 0 repeats
       
       @simon What place does artificial scarcity have in a post-scarcity world?
       
 (DIR) Post #AW27mzSnioCDzWyGGW by UrbanEdm@mstdn.ca
       2023-05-26T02:44:36Z
       
       0 likes, 0 repeats
       
       @simon This is really the same question as is answered in "The Measure of a Man." If an AI is a person, then yes, he should be able to learn from the range sources, at the same prices, as any other person. If an AI is a thing thing that is owned by a person, then it shouldn't.
       
 (DIR) Post #AW282hfDfUrbQ6Gvqa by amaditalks@wandering.shop
       2023-05-26T02:47:44Z
       
       0 likes, 0 repeats
       
       @simon was Data “trained” on unlicensed copyrighted material in order to generate work product that closely resembled some of that material for someone else’s profit, or was it to help him be conversant in the language(s) and cultures of the training material so that he could interact with humanity?
       
 (DIR) Post #AW2CtfFBnNUH4zbcvo by simon@fedi.simonwillison.net
       2023-05-26T03:42:08Z
       
       0 likes, 0 repeats
       
       @jannem that's where I am at the moment: what matters is how you use itUsing copyrighted photos to train a model that provides assistance to people with reduced vision is very different to using it to create commissioned artwork in competition with human artists
       
 (DIR) Post #AW2DRD9mjH41oZ4gHg by schizanon@calckey.social
       2023-05-26T01:45:11.944Z
       
       0 likes, 0 repeats
       
       @simon@fedi.simonwillison.net @grwster@mastodon.social that's because we've all been ripping each other off in the same way this whole time; everything we make is just a reproduction of something we saw someone else make.
       
 (DIR) Post #AW2DRDouGNxPs6JWNc by jedbrown@hachyderm.io
       2023-05-26T02:17:29Z
       
       0 likes, 0 repeats
       
       @schizanon @simon @grwster If humans produce verbatim paragraphs or similar amounts of code without attribution, that's plagiarism. When language models do it, we tie ourselves in knots to avoid acknowledging the obvious fact that it's also plagiarism. They'll get better at obfuscation so it's only going to get harder to "prove".This extends further: the central value proposition of the current hype cycle is ability to launder illegal and unethical behavior under the banner of so-called AI.
       
 (DIR) Post #AW2DREMEGUc1XRu8Js by simon@fedi.simonwillison.net
       2023-05-26T03:48:05Z
       
       0 likes, 0 repeats
       
       @jedbrown @schizanon @grwster here's one of the more interesting results I've gotten for code from GPT-4 recently - hard to argue that it's copying chunks from its training data, it's a refactor of my own code that I fed into it https://gist.github.com/simonw/13ad4e36f5350c5f56ce41048b5cd136
       
 (DIR) Post #AW2Dclgqfwwk6S3CE4 by simon@fedi.simonwillison.net
       2023-05-26T03:50:04Z
       
       0 likes, 0 repeats
       
       @amaditalks this is increasingly the direction I'm thinking: how these things are trained matters a lot less than how they are used
       
 (DIR) Post #AW2EeFVFPwMwoGtVb6 by jannem@fosstodon.org
       2023-05-26T04:01:43Z
       
       0 likes, 0 repeats
       
       @simon OpenAI, Microsoft, Google and so on want a carte blanche to do whatever with their models. That's a big part of the problem.
       
 (DIR) Post #AW2EyZodbYY7bHemki by jedbrown@hachyderm.io
       2023-05-26T04:05:17Z
       
       0 likes, 0 repeats
       
       @simon @schizanon @grwster What do you suppose this anecdote shows? Surely not that it can't recite from the training data, nor that performance claims are not significantly contaminated by the training set. > As further evidence for this hypothesis, we tested it on Codeforces problems from different times in 2021. We found that it could regularly solve problems in the easy category before September 5, but none of the problems after September 12.https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks
       
 (DIR) Post #AW2Gvllaumnnhm1SDI by simon@fedi.simonwillison.net
       2023-05-26T04:23:16Z
       
       0 likes, 0 repeats
       
       @jedbrown @schizanon @grwster I'm hoping it shows that it doesn't JUST recite from the training data - and if you figure out how to wield it, it can do plenty of interesting and useful things
       
 (DIR) Post #AW2Gvt4dkNCEMrzGN6 by simon@fedi.simonwillison.net
       2023-05-26T04:25:50Z
       
       0 likes, 0 repeats
       
       @jedbrown @schizanon @grwster I've personally always found the claims of it doing well in benchmarks and exams to be highly suspect, for pretty much the reasons outlined in that articleAs always, I find the really interesting part of all of this is figuring out what it's useful for outside of the hype at one end and the many very real criticisms at the other
       
 (DIR) Post #AW2HSNDTDIwX7TbkkS by simon@fedi.simonwillison.net
       2023-05-26T04:30:00Z
       
       0 likes, 0 repeats
       
       @jannem what are your thoughts on the increasingly capable open source models that anyone can run on their own devices?
       
 (DIR) Post #AW2JLZCITBKo9wwPh2 by maphew@indieweb.social
       2023-05-26T04:54:11Z
       
       0 likes, 0 repeats
       
       @simon a worthy thought experiment. I'm also interested in a converse question: what if all the truly valuable info was off limits (e.g. peer reviewed science, literature, spiritual tomes of ancients, encyclopædia, ...) and all that Data was allowed to ingest was society's dreck (pornhub, lose weight fast for only $20/month, pulp fiction, ...) ?
       
 (DIR) Post #AW2K4rNNjCfLZTiuoq by simon@fedi.simonwillison.net
       2023-05-26T05:01:06Z
       
       0 likes, 0 repeats
       
       @maphew I'm morbidly interested to see what would happen if someone trained a LLM purely against out-of-copyright text prior to 1928I imagine it's attitudes in terms of what's culturally acceptable would be pretty horrific
       
 (DIR) Post #AW2KP94PeSySEwWWuG by radiac@mastodon.cloud
       2023-05-26T05:06:05Z
       
       0 likes, 0 repeats
       
       @simon I'm not sure there's a difference to humans in that scenario; we and AI can learn from copyright data, but neither of us can just slice it up and regurgitate it without using our sentience and personal thoughts to produce something new. We can't pass someone else's work off as our own - that's plagiarism and copyright infringement.
       
 (DIR) Post #AW2MR38YsiTuKIV47M by maphew@indieweb.social
       2023-05-26T05:29:00Z
       
       0 likes, 0 repeats
       
       @simon there's some interesting thoughts 🧐. Feed a series of AI only material from given ages - Greeks, Romans, Norse, Renaissance - and see what comes out. Ditto for cultures.
       
 (DIR) Post #AW2O4iZmqUOK5yiJ84 by jedbrown@hachyderm.io
       2023-05-26T05:47:23Z
       
       0 likes, 0 repeats
       
       @simon  @grwster I have a more cynical take that for the tasks to which observers ascribe skill/knowledge or {basket of anthropomorphizing terms}, LMs can only ever be right for the wrong reasons. If a human exhibited the sort of impressive performance on certain tasks followed by deranged responses on follow-up or problem variants, we'd infer that they were cheating and retract our praise for those moments. Yet we're so thirsty for LMs to be something they're not that we keep rationalizing it.
       
 (DIR) Post #AW2RBnXZ5SZZNgk49o by mikeful@mastodontti.fi
       2023-05-26T06:22:11Z
       
       0 likes, 0 repeats
       
       @simon I think the difference is sentience. Sentient being can choose target and purpose for learning and current machine learning system are basically very smart data compression systems. Using Mickey Mouse JPG might get you in trouble if it comes out of .zip or Stable Diffusion model.
       
 (DIR) Post #AW2ROoZaYihGLetZqq by aimaz@mstdn.social
       2023-05-26T06:22:10Z
       
       0 likes, 0 repeats
       
       @simon @grwster the number of attributable bytes seem more likely when there’s little source material on a topic.@lcamtuf has shown this happening with text from his blog.https://open.substack.com/pub/lcamtuf/p/large-language-models-and-plagiarism
       
 (DIR) Post #AW2RhDLtAahUPaGFPM by jannem@fosstodon.org
       2023-05-26T06:27:43Z
       
       0 likes, 0 repeats
       
       @simon My thoughts are that they are open source; that a couple use only properly licensed training data while other's don't; and that whether it's OK or not depends on what the final user does with them.ie. if you're using Midjourney or something at home to generate images for the amusement of yourself and your friends, it's fine. If you generate sketches as compositional inspiration for your own work that's fine.Commercial use? Same issues as with the FAANG models.
       
 (DIR) Post #AW2SWdDcl2LLP8uf8C by mattwilcox@mstdn.social
       2023-05-26T06:36:38Z
       
       0 likes, 0 repeats
       
       @simon they wouldn’t be tryin to have Data used as a tool to concentrate wealth yet further and Data wouldn’t be owned by some corporations. Data also fact checks himself and has strong morals.Really not the same thing.
       
 (DIR) Post #AW2UPcwPaim6QOeLPU by simon@fedi.simonwillison.net
       2023-05-26T06:58:26Z
       
       0 likes, 0 repeats
       
       @aimaz @grwster @lcamtuf that was with Bard which isn't a pure LLM - it has the ability to augment its input with search results from Google, and infuriatingly it doesn't reveal when it has run a search (unlike Bing)
       
 (DIR) Post #AW2Ulq6dKE8lcr1S1A by sil@mastodon.social
       2023-05-26T07:02:24Z
       
       0 likes, 0 repeats
       
       @simon did Data pay to get into the gallery, thus helping to ensure that there's still a gallery next year?
       
 (DIR) Post #AW2a3dqO9ZJMZxjSNM by zubakskees@mastodon.social
       2023-05-26T08:01:35Z
       
       0 likes, 0 repeats
       
       @simon Specifically it's authorship laundering.
       
 (DIR) Post #AW2aU4kMkfMJXuWI2S by nicklockwood@mastodon.social
       2023-05-26T08:06:27Z
       
       0 likes, 0 repeats
       
       @simon Data is a person rather than a tool, which means he can choose not to violate the copyright of the material he consumes. If asked to produce an unlicensed derivative work for sale he could simply say no, and if he didn't he could be held personally accountable.
       
 (DIR) Post #AW2d5hGkgmyPdyjddw by benjamineskola@hachyderm.io
       2023-05-26T08:35:33Z
       
       0 likes, 0 repeats
       
       @simon if it’s being used for corporate profit at the expense of the creators of the work, yes.It wouldn’t be considered acceptable for a human to create derivative works, uncredited, for profit — I don’t see why an android would be different.I do see your point that this is fundamentally how a human learns, but I don’t think the situation is really comparable to a human learning at least currently.
       
 (DIR) Post #AW2iMmF9wo7tlUxD5k by JustinMac84@mastodon.social
       2023-05-26T09:34:40Z
       
       0 likes, 0 repeats
       
       @simon Thing is Data wasn't out to put the artists in the gallery out of business.
       
 (DIR) Post #AW2pMnnyqBpwHtCTlw by tolmasky@mastodon.social
       2023-05-26T10:53:05Z
       
       0 likes, 0 repeats
       
       @simon Star Trek exists in a post-scarcity economy that has no money. So even if the purely capitalist crime of “copyright violation" existed, the damages would be $0 anyways. It's this wonderful world where people think the value of sharing human knowledge outweighs Disney's "rights" to drawings made ages ago by artists who have been dead for years. Or who knows, maybe Data is actually banned from entering the EU and it just never comes up since he's always in space. https://en.wikipedia.org/wiki/Trekonomics
       
 (DIR) Post #AW2r8jnz7Dmvjhcj4q by tolmasky@mastodon.social
       2023-05-26T11:12:56Z
       
       0 likes, 0 repeats
       
       @simon Star Trek exists in a post-scarcity economy that has no money. So even if the purely capitalist concept of “copyright violation" existed, the damages would be $0 anyways. It's this wonderful world where people think the value of sharing human knowledge outweighs Disney's "rights" to drawings made ages ago by artists who have been dead for years. Or who knows, maybe Data is actually banned from entering the EU and it just never comes up since he's always in space.
       
 (DIR) Post #AW2xwEGvEG728q8pDE by joelanman@hachyderm.io
       2023-05-26T12:27:58Z
       
       0 likes, 0 repeats
       
       @simon agree with the replies about ownership of the ai/android. Fundamentally if an ai is getting value from consuming commons content, there should be some fairness/equity/democracy in where that value goes. Not all controlled and owned by a company that did not create or get consent for that content
       
 (DIR) Post #AW3BH7OfHhGCIv3PQO by castironflower@hachyderm.io
       2023-05-26T14:58:32Z
       
       0 likes, 0 repeats
       
       @simon data can explain his reasoning and even know an action is wrong and as others pointed out isnt monetizedso data can be treated with ethics for humans since he has an internal life and if we had AI that aware using them as a chatbot would be slaveryvs. llm's which cant do any of those things and cant provide attribution directly and companies have no incentive to make attribution through debugging possible for their platforms
       
 (DIR) Post #AW4lhowpmAduPe1ogq by mislav@hachyderm.io
       2023-05-27T09:21:34Z
       
       0 likes, 0 repeats
       
       @simon I think I’m way more comfortable with Data reading books (and potentially getting inspired by them) because Data can reflect on his training, his output, and how he affects others. The purpose of Data’s existence isn’t production itself. In contrast, neither the current crop of LLMs nor their makers seem to have any conscience around this, as their current goal is to out-produce human workers for the enrichment of the few who control those means.