Post Aw2qQ5b3p1VZdiruVc by lanodan@queer.hacktivis.me
 (DIR) More posts by lanodan@queer.hacktivis.me
 (DIR) Post #Aw1fuQSnVkoeOGJ4ZE by wolf480pl@mstdn.io
       2025-07-11T14:08:56Z
       
       0 likes, 4 repeats
       
       You know those residential proxies that various companies (incl. LLM ones) pay for and then use to scrape the hell out of your website so it effectively becomes a DDoS?Looks like someone found one of the sources of those proxies:There is a library used by 200+ browser extension that acts as this kind of proxy, using the IPs of everyone who has those extensions installed:https://arstechnica.com/security/2025/07/browser-extensions-turn-nearly-1-million-browsers-into-website-scraping-bots/
       
 (DIR) Post #Aw1hAHTOnwPr5wiDA0 by wolf480pl@mstdn.io
       2025-07-11T14:22:49Z
       
       0 likes, 0 repeats
       
       @nihl wait, NordVPN uses residential proxies?
       
 (DIR) Post #Aw1pIbesRKUneEYt8a by wolf480pl@mstdn.io
       2025-07-11T15:54:08Z
       
       0 likes, 0 repeats
       
       @nytpu hmm CFAA is criminal law, right? So you don't sue, you report to police, and it's their job to prosecute the suspect?
       
 (DIR) Post #Aw1pnZyQkky1OsTcB6 by wolf480pl@mstdn.io
       2025-07-11T15:59:43Z
       
       0 likes, 0 repeats
       
       @nytpu so we just need one website owner to report it?
       
 (DIR) Post #Aw2qQ5b3p1VZdiruVc by lanodan@queer.hacktivis.me
       2025-07-12T03:41:17.783497Z
       
       0 likes, 0 repeats
       
       @wolf480pl Well at least there's precedent with Hola for browser extensions…I guess next would be discovering it into some kind of NodeJS library/framework.
       
 (DIR) Post #Aw39WT6vD1PRQvWiRs by schamschula@mastodon.social
       2025-07-12T07:15:25Z
       
       0 likes, 0 repeats
       
       @wolf480pl I have been fighting this kind of thing on my campus server that provides weather and soil moisture/temperature data.Any bot MUST provide two pieces of information in the agent string: (1) the name of the bot; and (2) a valid URL to provide information, in English, about how the data is to be used. Further, there is such a thing as robots.txt, which needs (1) to be configured and observed by the crawler.Unfortunately, these guys don’t follow any of the above conventions.
       
 (DIR) Post #Aw39h5LXu64yupQgb2 by wolf480pl@mstdn.io
       2025-07-12T07:17:23Z
       
       0 likes, 0 repeats
       
       @schamschula I'd also argue that the bot should try to make most of its request from a single IP address.Using a different source IP for each request - eg. by frequently switching proxies - is a sign of bad will, and makes it difficult for the website being scraped to rate-limit the bot
       
 (DIR) Post #AwETlHcUcjJltjIhe4 by schamschula@mastodon.social
       2025-07-17T18:24:00Z
       
       0 likes, 0 repeats
       
       @wolf480pl Large, legitimate, search bots often come from one or more, well known, ranges of IP addresses.However, the bad actors don’t tell who they are, much less tell you their IP addresses so you can block them.I’m planning to try Anubis https://anubis.techaro.lol/ soon. My server has been slammed with more than 3x the usual traffic over the last several days.
       
 (DIR) Post #AwEU0MlhlPFOD3nlJY by wolf480pl@mstdn.io
       2025-07-17T18:26:46Z
       
       0 likes, 0 repeats
       
       @schamschula from what I've heard, LLM scrapers, like ClaudeBot, GPTBot, etc. do identify themselves in the UserAgent field, but proxy their traffic through residential IPs, spread more-or-less evenly across ISPs...
       
 (DIR) Post #AwHxqhz9bZCHDSHbhg by schamschula@mastodon.social
       2025-07-19T10:45:17Z
       
       0 likes, 0 repeats
       
       @wolf480pl Indeed, I’ve seen instances of requests for robots.txt from “ordinary” web browsers. No way of knowing who or what is behind those.And yes, ClaudeBot, along with other ‘legit’ scrapers, identifies itself, but at one point separately requested robots.txt from each IP address -> thousands of requests.