Post AZTFbYhsb0KyMJdE2K by illustr8rFRE@freespeechextremist.com
(DIR) More posts by illustr8rFRE@freespeechextremist.com
(DIR) Post #AUoV27HI7W4Agn53Fg by nitashatiku@mastodon.social
2023-04-19T14:33:52Z
0 likes, 2 repeats
Tech companies have gotten increasingly secretive about the data, scraped from the internet without compensation or consent, used to train their AI models. So we looked closer. Here's our analysis of the 15 million websites in just one highly-filtered CommonCrawl web scrape-used to train models like Google's T5 & Facebook's LLaMA. We found -the copyright symbol appears >200M times-pirated sites, 1 for e-books-half of the top 10 were news sites https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning
(DIR) Post #AUoV2961MOEuKPp4xk by nitashatiku@mastodon.social
2023-04-19T14:36:57Z
0 likes, 0 repeats
I've been obsessed with this topic ever since I read ver since I read this excellent paper from Jesse Dodge Allen Institute, @meg @mmitchell_ai and othershttps://arxiv.org/pdf/2104.08758.pdf and saw their graph of the top websites in Google's C4, definitely worth your time
(DIR) Post #AZTFbYhsb0KyMJdE2K by illustr8rFRE@freespeechextremist.com
2023-09-05T22:27:48.520590Z
0 likes, 0 repeats
@nitashatiku @meg @mmitchell_ai Understandable. Retarded cunts tend to be obsessed with stupid things.