Post AkR2vSi8NPaTThz7Qm by growlph@greywolf.social
(DIR) More posts by growlph@greywolf.social
(DIR) Post #AkR09c4k3XemiJcDsu by foone@digipres.club
2024-07-29T20:57:16Z
0 likes, 0 repeats
One of the downstream effects of the AI boom that I hadn't really thought about is that sites are starting to introduce anti-crawling tech as standard, because they don't want all their bandwidth going to training chatgpt7.This is going to make life harder for people who run benevolent scrapers. How many sites are going to be blocking the wayback machine now, just because it looks too much like an AI scraperbot?
(DIR) Post #AkR0Jo4wHdL0pwZIw4 by foone@digipres.club
2024-07-29T20:58:05Z
0 likes, 0 repeats
And there's plenty of sites I've written my own scrapers for, in other to ensure the site doesn't just die. Doing that is going to become a lot harder if they're already worried about scrapers
(DIR) Post #AkR0TDQjOeB64MFRLM by thomasjwebb@mastodon.social
2024-07-29T20:59:08Z
0 likes, 0 repeats
@foone this is just going to accelerate websites turning into client-side webapps that are useless until the js is loaded and manually interacted with.
(DIR) Post #AkR0qQmnuBrY9Wdt20 by artandtechnic@digipres.club
2024-07-29T21:03:38Z
0 likes, 0 repeats
@foone Well… I hate to tell you this, but… the LLM companies have been using them as an AI scraperbot for some time. That’s why I block them.
(DIR) Post #AkR2vSi8NPaTThz7Qm by growlph@greywolf.social
2024-07-29T21:28:35Z
0 likes, 0 repeats
@foone Fuck. Of course. I never even thought about this.
(DIR) Post #AkR3gn8qEWHYSB2sOu by ivor@social.ivor.org
2024-07-29T21:36:17Z
0 likes, 0 repeats
@foone and similarly how many sites are going to be butchered for accessibility/screen reader support as a side effect of anti-scraping techniques and client side generation. ☹️(albeit even more so that the current dismal state of sites - have you tried navigating any websites with a keyboard recently, or screenreading a page that's just images and content that doesn't even exist until you scroll)
(DIR) Post #AkR4Bq6bGPI4cfnlfE by gudenau@fosstodon.org
2024-07-29T21:41:39Z
0 likes, 0 repeats
@foone I think the CF one is designed to only block the AI ones and leave the others alone.
(DIR) Post #AkRAysbhLohqIP7CHQ by ChartreuseK@social.restless.systems
2024-07-29T22:58:40Z
0 likes, 0 repeats
@foone I feel benevolent scrapers have already been targeted and squeezed out by large CDN and anti-ddos providers like Cloudflare. Unless you're big enough to deal with it's just likely they'll send your bot to capcha hell and kill it that well.
(DIR) Post #AkRN9nwGFMrmS7i0zg by twodarek@hachyderm.io
2024-07-30T01:15:15Z
0 likes, 0 repeats
I've scrapped random fanfics off forums to convert into epub, just because it's easier on my eyes than a forum's weird color scheme (looking at you spacebattles[dot]net with your neon green text on black)@foone
(DIR) Post #AkROLhrkkbnx30q76m by the_moep@mastodon.de
2024-07-30T01:28:35Z
0 likes, 0 repeats
@foone Scraping already died when everyone and their mother started using the mess that is Cloudflare's proxy and their anti-"bot" services.
(DIR) Post #AkRaT3tCSwMMpRn2qO by vivi@arff.archandle.net
2024-07-30T03:44:26Z
0 likes, 0 repeats
@foone What is a "benevolent scraper"? I don't think there is such a thing. If we're assuming the definition of a scraper is a bot that crawls links autonomously and archives any and all data it's able to come across, even those that so generously respect robots.txt (yes even Wayback Machine) or other "opt-out" mechanisms, they are still opt-out in nature and opt-out is not a moral configuration for data collection of any kind. one should have the right to be forgotten by default and the choice to be preserved being opt-in.this is of course excluding commercial digital media - the act of making something available for purchase, intentionally, a discrete act of opting-in just as any other, should waive "right to be forgotten". and even moreso opting in to commercial sale should come with strict regulations about media preservation and archival, to ensure those who purchase media are always able to access it for the purchasers' lifetime. in such a fantasy utopia, where we didn't live in a global plutocracy that lets corps get away with all the shit they do, there would be no non-malicious use case for any opt-out data collection/preservation.
(DIR) Post #AkRtrG4MjLHIgSGQRE by flohoff@c.im
2024-07-30T07:21:45Z
0 likes, 0 repeats
@foone the issue with the ermerging scrapers is that the partially ignoren robots.txt and get lost in the dungeons of API endpoints gobbling up CPU and io.The only way to get rid of them is blocking via useragent which we tried to avoid by establishing robots.txt Its so back to the stone ages.
(DIR) Post #AkRw7tnWzL5ERPS6q0 by madmaurice@soc.zom.bi
2024-07-30T07:47:06Z
0 likes, 0 repeats
@foone In an ideal world there would be a file like robots.txt, in other words a fixed path, that contains all the content the website owner has surrendered for ai to be used as training data. Maybe that will eventually be the case.
(DIR) Post #AkSCA1v7MKBVTh2Tsu by abbe98@mastodon.social
2024-07-30T10:46:27Z
0 likes, 0 repeats
@foone NYT, NPR, Reuters, etc has recently been found to block Wikipedia's citation bot:https://phabricator.wikimedia.org/T362379
(DIR) Post #AkSHCtSnkzUDtCylrk by falcon@mastodon.falconk.rocks
2024-07-30T11:43:02Z
0 likes, 0 repeats
@foone generative AI has done this, and it has also made finding real information to archive very difficult. It's kind of depressing and I haven't been able to do much of it lately.
(DIR) Post #AkSZctrOhdetp92BGK by vxo@digipres.club
2024-07-30T15:09:02Z
0 likes, 0 repeats
@foone yeah I gave this some thought-- like there's a Wordpress plugin I use on my site that adds a bunch of specific "User-agent: ai-hooverbot-v69Disallow: /"to robots.txt, which at least keeps the well behaved training crawlers from beating it to death.Of course I have seen evidence of non well behaving ones still coming through as well as desirable things like search spiders and archive bots. I don't want to block those inadvertently or create any user annoyance
(DIR) Post #AlI1HQC4VuTXVnCups by lilydjwg@acg.mn
2024-08-24T10:49:35Z
0 likes, 0 repeats
@foone I'm blocking those AI bots but intentionally exempt the wayback machine.To me, the issue is not the crawlers themselves, nor bandwidth nor content. The issue I have with GPTBot and Facebook's bots is they come as a flock, to the extent that my server can't handle. They are rude and never care to learn how search bots behave.