Post AVpSFooqvKZTZoS4x6 by rrwo@fosstodon.org
 (DIR) More posts by rrwo@fosstodon.org
 (DIR) Post #AVpSFmFKUQy3b3ogIy by rrwo@fosstodon.org
       2023-05-18T08:11:49Z
       
       0 likes, 0 repeats
       
       At work, we've decided to block the Common Crawl bot from our websites, because their index is used to train generative #AI systems.We've also blocked or severely limited requests from IP ranges associated with various cloud providers, because they are usually from unidentified bots.1/n
       
 (DIR) Post #AVpSFnLOPNPb09f2OW by rrwo@fosstodon.org
       2023-05-18T08:16:39Z
       
       0 likes, 0 repeats
       
       This is a crap solution.It means we are excluding ourselves from open web indexes because those indexes are being abused.It means we are excluding or limiting independent search engines that use cloud services, because other users of those services are running bad bots.2/n
       
 (DIR) Post #AVpSFo57fLzXHz3Yfo by rrwo@fosstodon.org
       2023-05-18T08:21:42Z
       
       0 likes, 0 repeats
       
       This also makes it harder to create other open indexes of the web.Why should we let you index our site, if your index might be abused?This also makes it harder to compete with Google.How do we know you won't turn around and go from search to generative AI?3/
       
 (DIR) Post #AVpSFooqvKZTZoS4x6 by rrwo@fosstodon.org
       2023-05-18T08:27:07Z
       
       0 likes, 0 repeats
       
       There are already so many bad robots that don't respect rate limits, don't properly handle errors (404, 410, 400), don't respect robots.txt, don't identify themselves, are used for fake phishing/spam clone sites, or dodgy SEO ranking.Now throw in bots that misuse content for generative AI.4/
       
 (DIR) Post #AVpSFpUgPo21fY1U9Y by rrwo@fosstodon.org
       2023-05-18T08:39:41Z
       
       0 likes, 1 repeats
       
       And of course, now that Bing and Google have generative AIs, how do we differentiate between them indexing our websites for search vs indexing our websites to train their AIs?5/
       
 (DIR) Post #AVpSFrWAtEPfwkjhku by rrwo@fosstodon.org
       2023-05-18T08:43:03Z
       
       0 likes, 0 repeats
       
       There's also the free newspaper problem.Notice how higher quality newspapers and scientific publications are often behind paywalls, and misinformation/conspiracy theory sites are not?Now consider how indexing for generative AI will work with this. Higher-quality information will block AI training because they don't want their content stolen, but propaganda outlets will allow their sites to be used for training because they want their content spread.6/
       
 (DIR) Post #AVpSFtP9sHzNnZT864 by rrwo@fosstodon.org
       2023-05-18T09:53:51Z
       
       0 likes, 0 repeats
       
       As an aside, there's a "Have I Been Trained" image search website to search for photos that have been used for AI training data. https://haveibeentrained.com7/