Post ASHNsH8jHP9g1dEexs by aebrer@genart.social
 (DIR) More posts by aebrer@genart.social
 (DIR) Post #ASDRK5KpHyh1js9MYq by alexjc@creative.ai
       2023-01-31T16:57:59Z
       
       2 likes, 1 repeats
       
       RT @Eric_Wallace_@twitter.comModels such as Stable Diffusion are trained on copyrighted, trademarked, private, and sensitive images.Yet, our new paper shows that diffusion models memorize images from their training data and emit them at generation time.Paper: https://arxiv.org/abs/2301.13188   👇[1/9]🐦🔗: https://twitter.com/Eric_Wallace_/status/1620449934863642624
       
 (DIR) Post #ASDRK6xVFfE0m7FlU8 by alexjc@creative.ai
       2023-01-31T16:58:00Z
       
       1 likes, 0 repeats
       
       That's a wrap folks, show's over! Diffusion models as *lossy* databases that can regenerate their training data:"Diffusion models are less private than prior generative models [...] mitigating these vulnerabilities may require new advances in privacy-preserving training."
       
 (DIR) Post #ASDRK8Tnb4eXUZN4Sm by alexjc@creative.ai
       2023-01-31T17:07:03Z
       
       2 likes, 0 repeats
       
       
       
 (DIR) Post #ASDRKA8xPXAaeVdSFs by alexjc@creative.ai
       2023-01-31T17:07:06Z
       
       1 likes, 0 repeats
       
       On the bright side, it seems like it could be easy to fix... likely they'll have to do serious deduplication for all future models.
       
 (DIR) Post #ASDRKBdTrXBDHSvLTE by alexjc@creative.ai
       2023-01-31T17:07:10Z
       
       1 likes, 0 repeats
       
       Busted!  Closed-source image models (Google's Imagen) are also proven to memorize data.  The bigger the models, the better the lossy database.Just watch how quickly DALL¡E 2 is taken offline for maintenance.  The only thing that would confirm is their guilt though!
       
 (DIR) Post #ASDRKDG9pDiCJi1kOW by alexjc@creative.ai
       2023-01-31T17:15:49Z
       
       0 likes, 0 repeats
       
       The paper also shows a proof-of-concept "untargeted extraction" where they try to recover a randomly chosen image.  It's on CIFAR10 dataset, easier/smaller, but the reconstructions are very similar...
       
 (DIR) Post #ASDRKEpI05PNAxTJnE by alexjc@creative.ai
       2023-01-31T17:15:51Z
       
       0 likes, 1 repeats
       
       The quality jump of Diffusion compared to GANs comes from memorization.  GANs are significantly more "privacy" preserving (in the sense of being able to reconstruct input).
       
 (DIR) Post #ASDRKGRbz5emC6PRAG by alexjc@creative.ai
       2023-01-31T17:15:52Z
       
       0 likes, 0 repeats
       
       The insights from this paper confirm the intuitions of many practitioners in the space.  That doesn't diminish the achievements of diffusion models, but it will usher in a new era where Copyright is more strictly respected — or different architectures that push ©-burden to user.
       
 (DIR) Post #ASDRKINmmHmiCodPTk by alexjc@creative.ai
       2023-01-31T17:15:53Z
       
       1 likes, 0 repeats
       
       Text models known as "LLMs" have this too... the only difference is that it's a bit easier to manually write output filters like GitHub does everytime Copilot gets caught memorizing Copyrighted code.
       
 (DIR) Post #ASHNYlvT6lBHiOjUwa by lritter@mastodon.gamedev.place
       2023-01-31T17:09:02Z
       
       1 likes, 0 repeats
       
       @alexjc well some are deliberately imprinted, for example fixed works of art like the mona lisa. the mona lisa prompt in SD is so strong that it is very difficult mixing anything else into it
       
 (DIR) Post #ASHNdpnW8lOVAfQa92 by demofox@mastodon.gamedev.place
       2023-01-31T17:01:19Z
       
       0 likes, 0 repeats
       
       @alexjc are they effective at lossy compression then?
       
 (DIR) Post #ASHNdqJmCpCMmiWLQW by alexjc@creative.ai
       2023-01-31T17:23:10Z
       
       0 likes, 0 repeats
       
       @demofox Well it'd say it's worth investigating at least!  You mean for compression of sets of files?
       
 (DIR) Post #ASHNdqzxfywUtYG2BE by demofox@mastodon.gamedev.place
       2023-01-31T17:53:47Z
       
       1 likes, 0 repeats
       
       @alexjc yeah, sets of images. Games have lots of sets of images for instance, and disk space / network bandwidth to support them is a challenge.
       
 (DIR) Post #ASHNf2i7cEdrVeTpeC by luis_in_brief@social.coop
       2023-01-31T17:56:16Z
       
       1 likes, 0 repeats
       
       @alexjc I ask the previous question (1M v. 1k) because, at least in the US, fair use would (potentially) say “that’s a reasonable price to pay for a genuinely new transformative use, especially for models that are using industry standard (or best practices?) for reducing copying”
       
 (DIR) Post #ASHNf50H5iLocWzMDg by luis_in_brief@social.coop
       2023-01-31T18:20:02Z
       
       1 likes, 0 repeats
       
       @alexjc (to be clear, not saying it’d definitively be fair use, just that fair use is in place specifically so that we can weigh the costs and benefits of whether a certain level of copying is socially beneficial overall)
       
 (DIR) Post #ASHNkpO04KrFvCpKcq by alexjc@creative.ai
       2023-01-31T18:19:07Z
       
       0 likes, 0 repeats
       
       @luis_in_brief If the courts decide the price is worth paying for this particular model, what happens when it's 10x or 100x the size and memorization goes up? It goes to court again to see if that ruling is upheld?
       
 (DIR) Post #ASHNkpxRwXDLh9Pdse by luis_in_brief@social.coop
       2023-01-31T18:22:32Z
       
       1 likes, 0 repeats
       
       @alexjc possibly—much depends on the grounds that the court rules on in doing the balancing test. e.g., if it leans heavily on “well, they’re generating a lot of cool stuff and they’ve implemented best practices” then it’ll be harder to get a new ruling, but if they lean heavily on “it’s not that much copying” then trying again for 100x the copying might work.
       
 (DIR) Post #ASHNksSMeZ8DRbtMLQ by luis_in_brief@social.coop
       2023-01-31T18:24:28Z
       
       0 likes, 0 repeats
       
       @alexjc so the first cases don’t answer the question once and for all, but they are still important—as I’ve written elsewhere, fair use in search got a young, uncompromising Google as its champion, and that was reflected in the court cases. Fair use in AI will have Microsoft (older, more risk averse) and OpenAI (unknown quantity, not infinite litigation budget). So we’ll see how that plays out.
       
 (DIR) Post #ASHNnWS0vUIU40zOZE by alexjc@creative.ai
       2023-01-31T18:23:36Z
       
       0 likes, 0 repeats
       
       @luis_in_brief OK, so assuming they implement best practices in the first place.  I think the scale has tilted in the other direction now though, let's see how it plays out.
       
 (DIR) Post #ASHNnWujCjGXV4QKK8 by luis_in_brief@social.coop
       2023-01-31T18:26:05Z
       
       1 likes, 0 repeats
       
       @alexjc right, a court would definitely look at best practices, and also (though this isn’t formally part of any test) in whether the defendants are implementing those best practices in good faith. Google Book Search, for example, was a fair use in part because GBS seemed to be doing a reasonably intense and serious job of preventing whole books from being downloaded.
       
 (DIR) Post #ASHNsFYt9AtL8BSWSe by aebrer@genart.social
       2023-01-31T20:43:25Z
       
       1 likes, 0 repeats
       
       @alexjc I gotta say this is far more damning than I expected. It does make me feel better that I decided not to use artist names as part of my prompts long ago, but it now does make me more concerned about how these models are used. It seems like the story with AI always ends up the same: without transparency it's simply not safe to use
       
 (DIR) Post #ASHNsH8jHP9g1dEexs by aebrer@genart.social
       2023-01-31T20:44:35Z
       
       0 likes, 0 repeats
       
       @alexjc do you know if vqgan models have the same pitfall?
       
 (DIR) Post #ASHNuzrXyKpNJHUp96 by smicur@mastodon.gamedev.place
       2023-01-31T22:07:44Z
       
       1 likes, 0 repeats
       
       @alexjc I appreciate how the main take from this research was not to further drive the enforcement of current lawsuits but rather providing space to improve upon the existing (real) issues.This is what I'm often missing from the AI-critical scene nowadays.
       
 (DIR) Post #ASHO2LAPA7GVPWooD2 by alexjc@creative.ai
       2023-01-31T17:23:49Z
       
       0 likes, 0 repeats
       
       @lritter As I said, it's fixable. But it confirms what the artist community initially thought...
       
 (DIR) Post #ASHO2LtQSjHHf9slNo by lritter@mastodon.gamedev.place
       2023-01-31T17:52:05Z
       
       0 likes, 0 repeats
       
       @alexjc it sure makes lawsuits a lot easier. but it should also be in the interest of the maintainers of the model. it is not merely enough to suspect your picture was used - you must demonstrate that it is indeed in there, and also produce a prompt someone can write to retrieve it. tickling the model in a special way to retrieve data you already have does not really qualify, does it?
       
 (DIR) Post #ASHO2MTEJbuxSCdMBs by alexjc@creative.ai
       2023-01-31T18:16:05Z
       
       0 likes, 0 repeats
       
       @lritter Copyright infringement already happens at the training stage, so whether it's in the model in some visualizable way, and whether it can be output, is irrelevant in some ways.
       
 (DIR) Post #ASHO2N1GH58j9kYXEe by lritter@mastodon.gamedev.place
       2023-02-01T08:08:26Z
       
       1 likes, 0 repeats
       
       @alexjc in what way does copyright infringement happen at the training stage?
       
 (DIR) Post #ASHO3JeyFbMl1ha4BM by alexjc@creative.ai
       2023-02-01T12:50:32Z
       
       0 likes, 0 repeats
       
       @lritter You need the necessary rights or an exception to copyright to keep a file around for the purpose of training.  If you don't have written permission from the rights holder, or a valid exception, it's infringement.
       
 (DIR) Post #ASHO3K6Gc7CUOMLrjE by lritter@mastodon.gamedev.place
       2023-02-01T12:51:07Z
       
       1 likes, 0 repeats
       
       @alexjc that sounds like you made it up
       
 (DIR) Post #ASHOrlv5SGrP9VB6Zs by alexjc@creative.ai
       2023-02-01T12:52:06Z
       
       0 likes, 0 repeats
       
       @lritter 🤷‍♂️  Read through all the copyright acts then!
       
 (DIR) Post #ASHOrn2ZHwRGczgasS by lritter@mastodon.gamedev.place
       2023-02-01T12:52:23Z
       
       0 likes, 0 repeats
       
       @alexjc i did.
       
 (DIR) Post #ASHOroCsx4HmFHWLb6 by alexjc@creative.ai
       2023-02-01T12:54:05Z
       
       0 likes, 0 repeats
       
       @lritter OK, then answer this question. Which article in the EU copyright act deals with the rights necessary to train a model on web-scraped data?
       
 (DIR) Post #ASHOrpX81I4YMM01my by lritter@mastodon.gamedev.place
       2023-02-01T12:55:28Z
       
       0 likes, 0 repeats
       
       @alexjc you still haven't answered my question, just waved your hands around
       
 (DIR) Post #ASHOrr2MQeeL1VcU6q by alexjc@creative.ai
       2023-02-01T12:56:42Z
       
       0 likes, 0 repeats
       
       @lritter ** Mutters something about pearls and swine... **I answered it here: https://creative.ai/@alexjc/109789566257445813
       
 (DIR) Post #ASITaejjbOrmsdUBhg by Raccoon@techhub.social
       2023-01-31T23:06:21Z
       
       0 likes, 0 repeats
       
       @alexjcSounds to me a bit like #StableDiffusion and other #art #AI function a bit like lossy #CompressionAlgorithms, if this is the case, that is simply able to combine decompressed data from a bunch of pictures at once.  This has a lot of implications for the #copyright of these pictures too, because if it's easily possible to get a one-to-one like this of  images from the #TrainingData, or at least have parts that, upon the right sort of scrutiny, are effectively direct copies, then you're basically just reproducing other #artists' work.If I were to take a bunch of images into GIMP and combine them in a bunch of weird ways, or if I were to simply trace an existing image, or take screen grabs from an existing image and assemble them into textures for such an image... All of these break #copyright, and can be legally prosecuted for #plagiarism / #ArtTheft if you can find the original image under scrutiny.  #ArtificialIntelligence should be no different.
       
 (DIR) Post #ASITafVwi9QnIA2gqm by Mastokarl@mastodon.social
       2023-02-01T06:55:23Z
       
       0 likes, 0 repeats
       
       @Raccoon @alexjc I don‘t think your conclusion is valid. If I understand the paper correctly (and I don‘t understand major parts of it :-) ), it does prove that there exist a small subset of images (500 in 375k of their test set) that can closely be regenerated. It does not prove (or rather it disproves) that the same holds for the 98% of the other images. This would not be the case if #stablediffusion would work like lossy compression.
       
 (DIR) Post #ASITag66XiM36IxZD6 by alexjc@creative.ai
       2023-02-01T07:56:31Z
       
       0 likes, 0 repeats
       
       @Mastokarl @Raccoon The original paper calls it semantic compression. All works can be compressed semantically, but few are found that can be compressed with sufficient pixel accuracy.I bet with enough computation you could find better matches for even more images...
       
 (DIR) Post #ASITagbIfjJAf3YTpo by Mastokarl@mastodon.social
       2023-02-01T16:01:08Z
       
       1 likes, 0 repeats
       
       @alexjc @Raccoon the (limited, I confess :-) ) research I did to find out what semantic compression might mean in image processing didn‘t show that this term is commonly used. Given the legal battles by frustrated artists we‘re about to witness I feel this is a poor name, because it guides people into the wrong direction. Note that the #stablediffusion model 1.5 contains its knowledge about about 5 billion images in a file size of 5 billion bytes. This is not anything close to compression in…
       
 (DIR) Post #ASITah40wyHE66zPai by Mastokarl@mastodon.social
       2023-02-01T16:03:55Z
       
       1 likes, 0 repeats
       
       @alexjc @Raccoon the common meaning of that word (and the misunderstanding that lawyers will try to get into the minds of judges). And if I write a page of text about a painting that I see I could call this text semantic compression, and if I‘m precise enough, there might be a good chance that an artist re-creates the painting quite well from my text. But I haven‘t violated any copyright by describing the concepts of the image in text.
       
 (DIR) Post #ASITvWu5v61efUHjVI by Raccoon@techhub.social
       2023-02-02T23:36:33Z
       
       0 likes, 0 repeats
       
       @Mastokarl @alexjcRight, YOU haven't violated copyright by typing some words to describe a thing you want, but if the AI is basing its output off of art that was taken without permission, credit, or payment, then that IS a violation of someone's rights to their work.The potential for lawsuits here are tangible, and the inevitability of backlash over current practices is bolstered by anything that solidifies the fact that AI is just copying the work of the humans who wrote the training data: a lot of us are very hopeful that this could be the blow that we need against the current trends in how AI is run.We don't need to sue them out of existence YET, we just need to get people to show concern over the ramifications and potential dangers.  90% of the "Certainty of Human Extinction" calculations we have in AI safety right now is just a reaction to how much of the warnings fall on deaf ears.  This could wake people up.
       
 (DIR) Post #ASITvXSTrFX0O8NC6K by Mastokarl@mastodon.social
       2023-02-03T05:08:28Z
       
       0 likes, 0 repeats
       
       @Raccoon @alexjc nothing is taken. If you take something you have it afterwards. But the models of the AI doesn’t contain the images. So rather than take you might write “analyse, keep analysis result, and forget” which is closer to what the AIs do and clearly not something that violates a copyright.
       
 (DIR) Post #ASITvXvu5r4DrO8gxk by alexjc@creative.ai
       2023-02-03T08:04:37Z
       
       0 likes, 0 repeats
       
       @Mastokarl @Raccoon You're right about that: it's not theft it's infringement.  But Racoon's interpretation is generally correct... if the model is infringing you could say all its outputs are unauthorized derivative works and could get you into trouble wherever in the world you use them.
       
 (DIR) Post #ASIXKH9RvJteunxUS8 by alexjc@creative.ai
       2023-02-03T08:03:18Z
       
       0 likes, 0 repeats
       
       @roboneko @lritter To train a model, you need to acquire the data, so of course it's relevant.  What implied part of the question didn't I answer that you'd like to know?Overall, your understanding is correct — but if you do anything with your model/data in another jurisdiction, then you're liable *there* in case there's any infringement.  So U.S. Fair Use would be irrelevant in Europe for instance.