Post Ad3sWJfaSMAI54axP6 by tarkowski@101010.pl
 (DIR) More posts by tarkowski@101010.pl
 (DIR) Post #Ad3sWHZUG465ZZj3cO by DAIR@dair-community.social
       2023-12-20T17:29:40Z
       
       0 likes, 0 repeats
       
       404 Media reports that "Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material" 🧵 However, in 2021, a preprint by @abebab, Vinay Uday Prabhu & Emmanuel Kahembwe found a number issues in the dataset including " troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content."The preprint can be found here: https://arxiv.org/abs/2110.01963https://www.404media.co/laion-datasets-removed-stanford-csam-child-abuse/
       
 (DIR) Post #Ad3sWIgy5jfx34EXuy by ed@social.opensource.org
       2023-12-20T18:21:41Z
       
       0 likes, 0 repeats
       
       @DAIR @abebab That's not good, 2 years with a known issue and no fix. Differently from private/proprietary datasets, this is supposed to be an open source one: nobody submitted a patch? Was there a patch submitted and it was rejected/not followed up on? I'd be very curious to know what happened after @abebab's paper was published.
       
 (DIR) Post #Ad3sWJW11wVbbO7JU8 by DAIR@dair-community.social
       2023-12-20T17:30:42Z
       
       0 likes, 0 repeats
       
       Another paper by Abeba Birhane and Vinay Uday Prabhu had already resulted in the Tiny Imagenet dataset being taken down. https://ieeexplore.ieee.org/abstract/document/9423393
       
 (DIR) Post #Ad3sWJfaSMAI54axP6 by tarkowski@101010.pl
       2023-12-22T08:09:42Z
       
       0 likes, 0 repeats
       
       @ed @DAIR @abebab@scholar.social I would be surprised to learn that there is a patching culture for datasets like LAION. The story shared by 404 shows that dataset maintenance standards are badly needed. I think it’s also a cultural change that’s needed: from a culture of data dumps to one of data care