hngopher.com

       [HN Gopher] Amazon's AI crawler is making my Git server unstable
       ___________________________________________________________________
        
       Amazon's AI crawler is making my Git server unstable
        
       Author : OptionOfT
       Score  : 398 points
       Date   : 2025-01-18 18:48 UTC (4 hours ago)
        
 (HTM) web link (xeiaso.net)
 (TXT) w3m dump (xeiaso.net)
        
       | ChrisArchitect wrote:
       | Earlier: https://news.ycombinator.com/item?id=42740095
        
       | trebor wrote:
       | Upvoted because we're seeing the same behavior from all AI and
       | Seo bots. They're BARELY respecting Robots.txt, and hard to
       | block. And when they crawl, they spam and drive up load so high
       | they crash many servers for our clients.
       | 
       | If AI crawlers want access they can either behave, or pay. The
       | consequence will almost universal blocks otherwise!
        
         | mschuster91 wrote:
         | Global tarpit is the solution. It makes sense anyway even
         | without taking AI crawlers into account. Back when I had to
         | implement that, I went the semi manual route - parse the access
         | log and any IP address averaging more than X hits a second on
         | /api gets a -j TARPIT with iptables [1].
         | 
         | Not sure how to implement it in the cloud though, never had the
         | need for that there yet.
         | 
         | [1]
         | https://gist.github.com/flaviovs/103a0dbf62c67ff371ff75fc62f...
        
           | jks wrote:
           | One such tarpit (Nepenthes) was just recently mentioned on
           | Hacker News: https://news.ycombinator.com/item?id=42725147
           | 
           | Their site is down at the moment, but luckily they haven't
           | stopped Wayback Machine from crawling it: https://web.archive
           | .org/web/20250117030633/https://zadzmo.or...
        
             | kazinator wrote:
             | How do you know their site is down? You probably just hit
             | their tarpit. :)
        
             | marcus0x62 wrote:
             | Quixotic[0] (my content obfuscator) includes a tarpit
             | component, but for something like this, I think the main
             | quixotic tool would be better - you run it against your
             | content once, and it generates a pre-obfuscated version of
             | it. It takes a lot less of your resources to serve than
             | dynamically generating the tarpit links and content.
             | 
             | 0 - https://marcusb.org/hacks/quixotic.html
        
           | bwfan123 wrote:
           | i would think public outcry by influencers on social media
           | (such as this thread) is a better deterrent, and also
           | establishes a public datapoint and exhibit for future
           | reference.. as it is hard to scale the tarpit.
        
           | idlewords wrote:
           | This doesn't work with the kind of highly distributed
           | crawling that is the problem now.
        
         | gundmc wrote:
         | What do you mean by "barely" respecting robots.txt? Wouldn't
         | that be more binary? Are they respecting some directives and
         | ignoring others?
        
           | unsnap_biceps wrote:
           | I believe that a number of AI bots only respect robot.txt
           | entries that explicitly define their static user agent name.
           | They ignore wildcards in user agents.
           | 
           | That counts as barely imho.
           | 
           | I found this out after OpenAI was decimating my site and
           | ignoring the wildcard deny all. I had to add entires
           | specifically for their three bots to get them to stop.
        
             | noman-land wrote:
             | This is highly annoying and rude. Is there a complete list
             | of all known bots and crawlers?
        
               | jsheard wrote:
               | https://darkvisitors.com/agents
               | 
               | https://github.com/ai-robots-txt/ai.robots.txt
        
             | joecool1029 wrote:
             | Even some non-profit ignore it now, Internet Archive
             | stopped respecting it years ago:
             | https://blog.archive.org/2017/04/17/robots-txt-meant-for-
             | sea...
        
               | SR2Z wrote:
               | IA actually has technical and moral reasons to ignore
               | robots.txt. Namely, they want to circumvent this stuff
               | because their goal is to archive EVERYTHING.
        
               | amarcheschi wrote:
               | I also don't think they hit servers repeatedly so much
        
           | LukeShu wrote:
           | Amazonbot doesn't respect the `Crawl-Delay` directive. To be
           | fair, Crawl-Delay is non-standard, but it is claimed to be
           | respected by the other 3 most aggressive crawlers I see.
           | 
           | And how often does it check robots.txt? ClaudeBot will make
           | hundreds of thousands of requests before it re-checks
           | robots.txt to see that you asked it to please stop DDoSing
           | you.
        
           | Animats wrote:
           | Here's Google, complaining of problems with pages they want
           | to index but I blocked with robots.txt.                   New
           | reason preventing your pages from being indexed
           | Search Console has identified that some pages on your site
           | are not being indexed          due to the following new
           | reason:                  Indexed, though blocked by
           | robots.txt              If this reason is not intentional, we
           | recommend that you fix it in order to get         affected
           | pages indexed and appearing on Google.         Open indexing
           | report         Message type: [WNC-20237597]
        
         | Vampiero wrote:
         | > The consequence will almost universal blocks otherwise!
         | 
         | Who cares? They've already scraped the content by then.
        
           | jsheard wrote:
           | Bold to assume that an AI scraper won't come back to download
           | everything again, just in case there's any new scraps of data
           | to extract. OP mentioned in the other thread that this bot
           | had pulled 3TB so far, and I doubt their git server actually
           | has 3TB of unique data, so the bot is probably pulling the
           | same data over and over again.
        
             | xena wrote:
             | FWIW that includes other scrapers, Amazon's is just the one
             | that showed up the most in the logs.
        
           | _heimdall wrote:
           | If they only needed a one-time scrape we really wouldn't be
           | seeing noticeable not traffic today.
        
         | herpdyderp wrote:
         | > The consequence will almost universal blocks otherwise!
         | 
         | How? The difficulty of doing that is the problem, isn't it?
         | (Otherwise we'd just be doing that already.)
        
           | ADeerAppeared wrote:
           | > (Otherwise we'd just be doing that already.)
           | 
           | Not quite what the original commenter meant but: WE ARE.
           | 
           | A major consequence of this reckless AI scraping is that it
           | turbocharged the move away from the web and into closed
           | ecosystems like Discord. Away from the prying eyes of most AI
           | scrapers ... and the search engine indexes that made the
           | internet so useful as an information resource.
           | 
           | Lots of old websites & forums are going offline as their
           | hosts either cannot cope with the load or send a sizeable
           | bill to the webmaster who then pulls the plug.
        
         | ksec wrote:
         | Is there some way website can sell those Data to AI bot in a
         | large zip file rather than being constantly DDoS?
         | 
         | Or they could at least have the curtesy to scrap during night
         | time / off peak hours.
        
           | jsheard wrote:
           | No, because they won't pay for anything they can get for
           | free. There's only one situation where an AI company will pay
           | for data, and that's when it's owned by someone with scary
           | enough lawyers to pressure them into paying up. Hence why
           | OpenAI has struck licensing deals with a handful of companies
           | while continuing to bulk-scrape unlicensed data from everyone
           | else.
        
       | awsanswers wrote:
       | Unacceptable, sorry this is happening. Do you know about
       | fail2ban? You can have it automatically filter IPs that violate
       | certain rules. One rule could be matching on the bot trying
       | certain URLs. You might be able to get some kind of honeypot
       | going with that idea. Good luck
        
         | thayne wrote:
         | They said that it is coming from different ip addresses every
         | time, so fail2ban wouldn't help.
        
           | keisborg wrote:
           | Monitor access logs for links that only crawlers can find.
           | 
           | Edit: oh, I got your point now.
        
           | jsheard wrote:
           | Amazon does publish every IP address range used by AWS, so
           | there is the nuclear option of blocking them all pre-
           | emptively.
           | 
           | https://docs.aws.amazon.com/vpc/latest/userguide/aws-ip-
           | rang...
        
             | xena wrote:
             | I'd do that, but my DNS is via route 53. Blocking AWS would
             | block my ability to manage DNS automatically as well as
             | certificate issuance via DNS-01.
        
               | unsnap_biceps wrote:
               | If you only block new inbound requests, it shouldn't
               | impact your route 53 or DNS-01 usage.
        
               | actuallyalys wrote:
               | They list a service for each address, so maybe you could
               | block all the non-Route 53 IP addresses. Although that
               | assumes they aren't using the Route 53 IPs or unlisted
               | IPs for scraping (the page warns it's not a comprehensive
               | list).
               | 
               | Regardless, it sucks that you have to deal with this. The
               | fact that you're a customer makes it all the more absurd.
        
           | SteveNuts wrote:
           | It'll most likely eventually help, as long as they don't have
           | an infinite address pool.
           | 
           | Do these bots use some client software (browser plugin,
           | desktop app) that's consuming unsuspecting users bandwidth
           | for distributed crawling?
        
       | srameshc wrote:
       | Has anyone tried using Cloudflare Bot Management and how
       | effective is it for such bots ?
        
         | martinp wrote:
         | I put my personal site behind Cloudflare last year specifically
         | to combat AI bots. It's very effective, but I hate that the web
         | has devolved to a state where using a service like Cloudflare
         | is practically no longer optional.
        
       | unsnap_biceps wrote:
       | > About 10% of the requests do not have the amazonbot user agent.
       | 
       | Is there any bot string in the user agent? I'd wonder if it's
       | GPTBot as I believe they don't respect a robots.txt deny
       | wildcard.
        
       | Ndymium wrote:
       | I had this same issue recently. My Forgejo instance started to
       | use 100 % of my home server's CPU as Claude and its AI friends
       | from Meta and Google were hitting the basically infinite links at
       | a high rate. I managed to curtail it with robots.txt and a user
       | agent based blocklist in Caddy, but who knows how long that will
       | work.
       | 
       | Whatever happened to courtesy in scraping?
        
         | jsheard wrote:
         | > Whatever happened to courtesy in scraping?
         | 
         | Money happened. AI companies are financially incentivized to
         | take as much data as possible, as quickly as possible, from
         | anywhere they can get it, and for now they have so much cash to
         | burn that they don't really need to be efficient about it.
        
           | nicce wrote:
           | Need to act fast before the copyright cases in the court gets
           | handled.
        
           | bwfan123 wrote:
           | not only money, but also a culture of "all your data belong
           | to us" because our ai going to save you and the world.
           | 
           | the hubris reminds me of dot-com era. that bust left a huge
           | wreckage. not sure how this one is going to land.
        
             | __loam wrote:
             | It's gonna be rough. If you can't make money charging
             | people $200 a month for your service then something is
             | deeply wrong.
        
         | Analemma_ wrote:
         | The same thing that happened to courtesy in every other
         | context: it only existed in contexts where there was no profit
         | to be made in ignoring it. The instant that stopped being true,
         | it was ejected.
        
         | baobun wrote:
         | Mind sharing a decent robots.txt and/or user-agent list to
         | block the AI crawlers?
        
           | hooloovoo_zoo wrote:
           | Any of the big chat models should be able to reproduce it :)
        
         | to11mtm wrote:
         | > Whatever happened to courtesy in scraping?
         | 
         | When various companies got signal that at least for now they
         | have a huge overton window of what is acceptable for AI to
         | ingest, they are going to take all they can before regulation
         | even tries to clamp down.
         | 
         | The bigger danger, is that one of these companies even (or,
         | especially) one that claims to be 'Open', does so but gets to
         | the point of being considered 'too big to fail' from an
         | economic/natsec interest...
        
       | AznHisoka wrote:
       | Excuse my technical ignorance, but is it actually trying to get
       | all the files in your git repo? Couldn't you just have everything
       | behind an user/pass if so?
        
         | xena wrote:
         | Author of the article here. The behavior of the bot seems like
         | this:                 while true {         const page = await
         | load_html_page(read_from_queue());
         | save_somewhere(page);         foreach link in page {
         | enqueue(link);         }       }
         | 
         | This means that every link on every page gets enqueued and
         | saved to do something. Naturally, this means that every file of
         | every commit gets enqueued and scraped.
         | 
         | Having everything behind auth defeats the point of making the
         | repos public.
        
           | kgeist wrote:
           | >Having everything behind auth defeats the point of making
           | the repos public.
           | 
           | Maybe add a captcha? Can be something simple and ad hoc, but
           | unique enough to throw off most bots.
        
             | xena wrote:
             | That's what I'm working on right now.
        
       | aw4y wrote:
       | just add a forged link in the main page, pointing to a page that
       | doesn't exist. when hit, block that ip. they will crawl only the
       | first page in this way maybe?
        
       | frankwiles wrote:
       | Have had several clients hit by bad AI robots in the last few
       | months. Sad because it's easy to honor robots.txt.
        
       | keisborg wrote:
       | Sounds like a job for nepenthes:
       | https://news.ycombinator.com/item?id=42725147
        
       | neilv wrote:
       | Can demonstrable ignoring of robots.txt help the cases of
       | copyright infringement lawsuits against the "AI" companies, their
       | partners, and customers?
        
         | adastra22 wrote:
         | On what legal basis?
        
           | readyplayernull wrote:
           | Terms of use contract violation?
        
             | bdangubic wrote:
             | good thought but zippy chance this holds up in Court
        
             | hipadev23 wrote:
             | Robots.txt is completely irrelevant. TOU/TOS are also
             | irrelevant unless you restrict access to only those who
             | have agreed to terms.
        
           | flir wrote:
           | In the UK, the Computer Misuse Act applies if:
           | 
           | * There is knowledge that the intended access was
           | unauthorised
           | 
           | * There is an intention to secure access to any program or
           | data held in a computer
           | 
           | I imagine US law has similar definitions of unauthorized
           | access?
           | 
           | `robots.txt` is the universal standard for defining what is
           | unauthorised access for bots. No programmer could argue they
           | aren't aware of this, and ignoring it, for me personally, is
           | enough to show knowledge that the intended access was
           | unauthorised. Is that enough for a court? Not a goddamn clue.
           | Maybe we need to find out.
        
             | pests wrote:
             | > `robots.txt` is the universal standard
             | 
             | Quite the assumption, you just upset a bunch of alien
             | species.
        
               | flir wrote:
               | Dammit. Unchecked geocentric model privilege, sorry about
               | that.
        
               | to11mtm wrote:
               | I mean it might just be a matter of the right UK person
               | filing a case; I suppose my main understanding is UK
               | Libel/Slander laws but if my US brain goes with that my
               | head says the burden of proof is on non-infringement.
               | 
               | (But again, I don't know UK law.)
        
               | thayne wrote:
               | Universal within the scope of the Internet.
        
         | thayne wrote:
         | Probably not copyright infringement. But it is probably
         | (hopefully?) a violation of CFAA, both because it is
         | effectively DDoSing you, and they are ignoring robots.txt.
         | 
         | Maybe worth contacting law enforcement?
         | 
         | Although it might not actually be Amazon.
        
           | to11mtm wrote:
           | Big thing worth asking here. Depending on what 'amazon' means
           | here (i.e. known to be Amazon specific IPs vs Cloud IPs) it
           | could just be someone running a crawler on AWS.
           | 
           | Or, folks failing the 'shared security model' of AWS and
           | their stuff is compromised with botnets running on AWS.
           | 
           | Or, folks that are quasi-spoofing 'AmazonBot' because they
           | think it will have a better not-block rate than anonymous or
           | other requests...
        
             | thayne wrote:
             | From the information in the post, it sounds like the last
             | one to me. That is, someone else spoofing an Amazonbot user
             | agent. But it could potentially be all three.
        
       | 23B1 wrote:
       | HN when it's a photographer, writer, or artist concerned about IP
       | laundering: _" Fair use! Information wants to be free!"_
       | 
       | HN when it's bots hammering some guy's server _" Hey this is
       | wrong!"_
       | 
       | A lot of you are unfamiliar with the tragedy of the commons. I
       | have seen the paperclip maximizer - and he is you lot.
       | 
       | https://en.wikipedia.org/wiki/Tragedy_of_the_commons
        
         | navanchauhan wrote:
         | I think there's a difference between crawling websites at a
         | reasonable pace instead of just hammering the server to the
         | point it's unusable.
         | 
         | Nobody has problems with the Google Search indexer trying to
         | crawl websites in a responsible way
        
           | 23B1 wrote:
           | For sure.
           | 
           | I'm really just pointing out the inconsistent technocrat
           | attitude towards labor, sovereignty, and resources.
        
         | Analemma_ wrote:
         | Most of those of those artists aren't any better though. I'm on
         | a couple artists' forums and outlets like Tumblr, and I saw
         | firsthand the immediate, total 180 re: IP protection when genAI
         | showed up. Overnight, everybody went from "copying isn't theft,
         | it leaves the original!" and other such mantras, to being die-
         | hard IP maximalists. To say nothing of how they went from
         | "anything can be art and it doesn't matter what tools you're
         | using" to forming witch-hunt mobs against people suspected of
         | using AI tooling. AI has made a hypocrite out of everybody.
        
           | 23B1 wrote:
           | Manga nerds on Tumblr aren't the artists I'm worried about.
           | I'm talking about people whose intellectual labor is being
           | laundered by gigacorps and the inane defenses mounted by
           | their techbro serfdom.
        
         | depingus wrote:
         | This submission has nothing to do with IP laundering. The bot
         | is straining their server and causing OP technical issues.
        
           | xena wrote:
           | Their*
        
             | depingus wrote:
             | Fixed.
        
               | xena wrote:
               | Thanks!
        
           | 23B1 wrote:
           | Commentary is often second- and third-order.
        
             | depingus wrote:
             | True, but it tends to flow there organically. This comment
             | was off topic from the start.
        
         | flir wrote:
         | Personally I'm not trying to block the bots, I'm trying to
         | avoid the bandwidth bill.
         | 
         | I've recently blocked everything that isn't offering a user
         | agent. If it had only pulled text I probably wouldn't have
         | cared, but it was pulling images as well (bot designers, take
         | note - you can have orders of magnitude less impact if you skip
         | the images).
         | 
         | For me personally, what's left isn't eating enough bandwidth
         | for me to care, and I think any attempt to serve _some_ bots is
         | doomed to failure.
         | 
         | If I really, really hated chatbots (I don't), I'd look at
         | approaches that poison the well.
        
         | evilfred wrote:
         | HN isnt a monolith
        
       | thayne wrote:
       | Are you sure it isn't a DDoS masquerading as Amazon?
       | 
       | Requests coming from residential ips is really suspicious.
       | 
       | Edit: the motivation for such a DDoS might be targeting Amazon,
       | by taking down smaller sites and making it look like amazon is
       | responsible.
       | 
       | If it is Amazon one place to start is blocking all the the ip
       | ranges they publish. Although it sounds like there are requests
       | outside those ranges...
        
         | OptionOfT wrote:
         | You should check your websites like grass dot io (I refuse to
         | give them traffic).
         | 
         | They pay you for your bandwidth while they resell it to 3rd
         | parties, which is why a lot of bot traffic looks like it comes
         | from residential IPs.
        
           | Aurornis wrote:
           | Yes, but the point is that big company crawlers aren't paying
           | for questionably sourced residential proxies.
           | 
           | If this person is seeing a lot of traffic from residential
           | IPs then I would be shocked if it's really Amazon. I think
           | someone else is doing something sketchy and they put
           | "AmazonBot" in the user agent to make victims think it's
           | Amazon.
           | 
           | You can set the user agent string to anything you want, as we
           | all know.
        
             | skywhopper wrote:
             | It's not residential proxies. It's Amazon using IPs they
             | sublease from residential ISPs.
        
             | voakbasda wrote:
             | I wonder if anyone has checked whether Alexa devices serve
             | as a private proxy network for AmazonBot's use.
        
             | baobun wrote:
             | > Yes, but the point is that big company crawlers aren't
             | paying for questionably sourced residential proxies
             | 
             | You'd be surprised...
        
               | WarOnPrivacy wrote:
               | >> Yes, but the point is that big company crawlers aren't
               | paying for questionably sourced residential proxies
               | 
               | > You'd be surprised...
               | 
               | Surprised by what? What do you know?
        
             | ninkendo wrote:
             | They could be using echo devices to proxy their traffic...
             | 
             | Although I'm not necessarily gonna make that accusation,
             | because it would be pretty serious misconduct if it were
             | true.
        
               | ninkendo wrote:
               | To add: it's also kinda silly on the surface of it for
               | Amazon to use consumer devices to hide their crawling
               | traffic, but still leave "Amazonbot" in their UA
               | string... it's pretty safe to assume they're not doing
               | this.
        
             | dafelst wrote:
             | I worked for Microsoft doing malware detection back 10+
             | years ago, and questionably sourced proxies were well and
             | truly on the table
        
               | WarOnPrivacy wrote:
               | >> but the point is that big company crawlers aren't
               | paying for questionably sourced residential proxies.
               | 
               | > I worked for Microsoft doing malware detection back 10+
               | years ago, and questionably sourced proxies were well and
               | truly on the table
               | 
               | Big Company Crawlers using questionably sourced proxies -
               | this seems striking. What can you share about it?
        
               | to11mtm wrote:
               | they probably can't because some of the proxies were used
               | by TLAs is my guess...
        
           | SOLAR_FIELDS wrote:
           | Wild. While I'm sure the service is technically legal since
           | it can be used for non-nefarious purposes, signing up for a
           | service like that seems like a guarantee that you are
           | contributing to problematic behavior.
        
       | scubbo wrote:
       | I, too, hate this future. This[0] might be a satisfying way to
       | fight back.
       | 
       | [0] https://zadzmo.org/code/nepenthes/
        
       | surfingdino wrote:
       | Return "402 Payment Required" and block?
        
         | xyzal wrote:
         | No. Feed them shit. Code with deliberate security vulns and so
         | on.
        
           | serhack_ wrote:
           | https://marcusb.org/hacks/quixotic.html
        
       | byyll wrote:
       | https://ip-ranges.amazonaws.com/ip-ranges.json ?
        
       | xena wrote:
       | I'd love it if Amazon could give me some AWS credit as a sign of
       | good faith to make up for the egress overages their and other
       | bots are causing, but the ads on this post are likely going to
       | make up for it. Unblock ads and I come out even!
        
       | Aurornis wrote:
       | I don't think I'd assume this is actually Amazon. The author is
       | seeing requests from rotating residential IPs and changing user
       | agent strings
       | 
       | > It's futile to block AI crawler bots because they lie, change
       | their user agent, use residential IP addresses as proxies, and
       | more.
       | 
       | Impersonating crawlers from big companies is a common technique
       | for people trying to blend in. The fact that requests are coming
       | from residential IPs is a big red flag that something else is
       | going on.
        
         | paranoidrobot wrote:
         | I wouldn't put it past any company these days doing crawling in
         | an aggressive manner to use proxy networks.
        
           | smileybarry wrote:
           | With the amount of "if cloud IP then block" rules in place
           | for many things (to weed out streaming VPNs and "potential"
           | ddos-ing) I wouldn't doubt that at all.
        
         | cmeacham98 wrote:
         | I work for Amazon, but not directly on web crawling.
         | 
         | Based on the internal information I have been able to gather,
         | it is highly unlikely this is actually Amazon. Amazonbot is
         | supposed to respect robots.txt and should always come from an
         | Amazon-owned IP address (You can see verification steps here:
         | https://developer.amazon.com/en/amazonbot).
         | 
         | I've forwarded this internally just in case there is some crazy
         | internal team I'm not aware of pulling this stunt, but I would
         | strongly suggest the author treats this traffic as malicious
         | and lying about its user agent.
        
         | AyyEye wrote:
         | > The author is seeing requests from rotating residential IPs
         | and changing user agent strings
         | 
         | This type of thing is commercially available as a service[1].
         | Hundreds of Millions of networks backdoored and used as
         | crawlers/scrapers because of an included library somewhere --
         | and ostensibly legal because somewhere in some ToS they had
         | some generic line that could plausibly be extended to using you
         | as a patsy for quasi-legal activities.
         | 
         | [1] https://brightdata.com/proxy-types/residential-proxies
        
       | stainablesteel wrote:
       | crazy how what seemed like an excellent landmark case around
       | webcrawling turned around like this so quickly due to AI
        
       | LukeShu wrote:
       | Before I configured Nginx to block them:
       | 
       | - Bytespider (59%) and Amazonbot (21%) together accounted for 80%
       | of the total traffic to our Git server.
       | 
       | - ClaudeBot drove more traffic through our Redmine in a month
       | than it saw in the combined _5 years_ prior to ClaudeBot.
        
       | dbaio wrote:
       | suffering with it as well. why can't they just `git clone` and do
       | their stuff? =)
        
       | rattlesnakedave wrote:
       | No evidence provided that this is amazonbot or AI related.
       | Someone is just upset that their website is getting traffic,
       | which seems asinine.
        
       | kazinator wrote:
       | What is the proof that a hit from a residential IP address is
       | actually Amazon? And if you have a way to tell, why not make use
       | of it.
        
       | trevor-e wrote:
       | What are the actual rules/laws about scraping? I have a few
       | projects I'd like to do that involve scraping but have always
       | been conscious about respecting the host's servers, plus whether
       | private content is copyrighted. But sounds like AI companies
       | don't give a shit lol. If anyone has a good resource on the
       | subject I'd be grateful!
        
         | lazystar wrote:
         | If you go to a police station and ask them to arrest Amazon for
         | accessing your website too often, will they arrest Amazon, or
         | laugh at you?
         | 
         | While facetious in nature, my point is that people walking
         | around in real brick and mortar locations simply do not care.
         | If you want police to enforce laws, those are the kinds of
         | people that need to care about your problem. Until that occurs,
         | youll have to work around the problem.
        
       | armchairhacker wrote:
       | I like the solution in this comment:
       | https://news.ycombinator.com/item?id=42727510.
       | 
       | Put a link somewhere in your site that no human would visit,
       | disallow it in robots.txt (under a wildcard because apparently
       | OpenAI's crawler specifically ignores wildcards), and when an IP
       | address visits the link ban it for 24 hours.
        
         | Szpadel wrote:
         | I had to deal with some bot activities that used huge address
         | space, and I tried something very similar, when condition
         | confirming bot was detected I banned that IP for 24h
         | 
         | but due to amount of IPs involved this did not have any impact
         | on about if traffic
         | 
         | my suggestion is to look very closely on headers that you
         | receive (varnishlog in very nice of this and of you stare long
         | enough at then you might stop something that all those requests
         | have in common that would allow you to easily identify them
         | (like very specific and usual combination of reported language
         | and geo location, or the same outdated browser version, etc)
        
           | aaomidi wrote:
           | Maybe ban ASNs /s
        
             | koito17 wrote:
             | This was indeed one mitigation used by a site to prevent
             | bots hosted on AWS from uploading CSAM and generating bogus
             | reports to the site's hosting provider.[1]
             | 
             | In any case, I agree with the sarcasm. Blocking data center
             | IPs may not help the OP, because some of the bots are
             | resorting to residential IP addresses.
             | 
             | [1] https://news.ycombinator.com/item?id=26865236
        
           | conradev wrote:
           | My favorite example of this was how folks fingerprinted the
           | active probes of the Great Firewall of China. It has a large
           | pool of IP addresses to work with (i.e. all ISPs in China),
           | but the TCP timestamps were shared across a small number of
           | probing machines:
           | 
           | "The figure shows that although the probers use thousands of
           | source IP addresses, they cannot be fully independent,
           | because they share a small number of TCP timestamp sequences"
           | 
           | https://censorbib.nymity.ch/pdf/Alice2020a.pdf
        
           | superjan wrote:
           | Why work hard... Train a model to recognize the AI bots!
        
             | js4ever wrote:
             | Because you have to decide in less than 1ms, using AI is
             | too slow in that context
        
         | to11mtm wrote:
         | Uggh, web crawlers...
         | 
         | 8ish years ago, at the shop I worked at we had a server taken
         | down. It was an image server for vehicles. How did it go down?
         | Well, the crawler in question somehow had access to vehicle
         | image links we had due to our business. Unfortunately, the
         | perfect storm of the image not actually existing (can't
         | remember why, mighta been one of those weird cases where we did
         | a re-inspection without issuing new inspection ID) resulted in
         | them essentially DOSing our condition report image server.
         | Worse, there was a bug in the error handler somehow, such that
         | the server process restarted when this condition happened. This
         | had the -additional- disadvantage of invalidating our 'for .NET
         | 2.0, pretty dang decent' caching implementation...
         | 
         | It comes to mind because, I'm pretty sure we started doing some
         | canary techniques just to be safe (Ironically, doing some
         | simple ones were still cheaper than even adding a different web
         | server.... yes we also fixed the caching issue... yes we also
         | added a way to 'scream' if we got too many bad requests on that
         | service.)
        
         | shakna wrote:
         | When I was writing a crawler for my search engine (now
         | offline), I found almost no crawler library actually compliant
         | with the real world. So I ended up going to a lot of effort to
         | write one that complied with Amazon and Google's rather
         | complicated nested robots files, including respecting the cool
         | off periods as requested.
         | 
         | ... And then found their own crawlers can't parse their own
         | manifests.
        
           | bb010g wrote:
           | Could you link the source of your crawler library?
        
       | more_corn wrote:
       | Cloudflare free plan has bot protection.
        
       | deanc wrote:
       | We have had the same problem at my client now for the last couple
       | of months, but from Facebook (using their IP ranges). They don't
       | even respect the 429 headers and the business is hesitant to
       | outright ban them in case it impacts open graph or Facebook
       | advertising tooling.
        
       | Havoc wrote:
       | He seems to have a mistake in his rule?
       | 
       | He's got "(Amazon)" while Amazon lists their useragent as
       | "(Amazonbot/0.1;"
        
         | xena wrote:
         | It's a regular expression.
        
         | cyrnel wrote:
         | The author's pronouns can be found here: https://github.com/Xe
        
       | evantbyrne wrote:
       | It seems like git self-hosters frequently encounter DDoS issues.
       | I know it's not typical for free software, but I wonder if gating
       | file contents behind a login and allowing registrations could be
       | the answer for self-hosting repositories on the cheap.
        
       | freetanga wrote:
       | Probably dumb question, but any enlightenment would be welcome to
       | help me learn:
       | 
       | Could this be prevented by having a link that when followed would
       | serve a dynamically generated page that does all of the
       | following:
       | 
       | A) insert some fake content outlining the oligarcs more lurid
       | rumours or whichever disinformation you choose to push
       | 
       | C) embed links to assets in oligarchs companies so they get hit
       | with some bandwith
       | 
       | C) dynamically create new Random pages that link to itself
       | 
       | And thus create an infinite loop, similar to a gzip bomb, which
       | could potentially taint the model if done by enough people.
        
         | to11mtm wrote:
         | Not a crawler writer but have FAFOd with data structures in the
         | past to large career success.
         | 
         | ...
         | 
         | The closest you could possibly do with any meaningful
         | influence, is option C, with the general observations of:
         | 
         | 1. You'd need to 'randomize' the generated output link
         | 
         | 2. You'd also want to maximize cachability of the replayed
         | content to minimize work.
         | 
         | 3. Add layers of obfuscation on the frontend side, for instance
         | a 'hidden link (maybe with some prompt fuckery if you are
         | brave) inside the HTML with a random bad link on your normal
         | pages.
         | 
         | 4. Randomize parts of the honeypot link pattern. At some point
         | someone monitoring logs/etc will see that it's a loop and
         | blacklist the path.
         | 
         | 5. Keep up at 4 and eventually they'll hopefully stop crawling.
         | 
         | ---
         | 
         | On the lighter side...
         | 
         | 1. do some combination of above but have all honeypot links
         | contain the right words that an LLM will just nope out of for
         | regulatory reasons.
         | 
         | That said, all above will do is minimize pain (except, perhaps
         | ironically, the joke response which will more likely blacklist
         | you but potentially get you on a list or a TLA visit)...
         | 
         | ... Most pragmatically, I'd start by suggesting the best option
         | is a combination of nonlinear rate limiting, both on the ramp-
         | up and the ramp-down. That is, the faster requests come in, the
         | more you increment their 'valueToCheckAgainstLimit`. The longer
         | it's been since last request, the more you decrement.
         | 
         | Also pragmatically, if you can extend that to put together even
         | semi-sloppy code to then scan when a request to a junk link
         | that results in a ban immediately results in another IP trying
         | to hit the same request... well ban that IP as soon as you see
         | it, at least for a while.
         | 
         | With the right sort of lookup table, IP Bans can be fairly
         | simple to handle on a software level, although the 'first-time'
         | elbow grease can be a challenge.
        
       | vachina wrote:
       | I'm surprised everyone else's servers are struggling to handle a
       | couple of bot scrapes.
       | 
       | I run a couple of public facing websites on a NUC and it just...
       | chugs along? This is also amidst the constant barrage of OSINT
       | attempts at my IP.
        
         | cyrnel wrote:
         | Seems some of these bots are behaving abusively on sites with
         | lots of links (like git forges). I have some sites receiving
         | 200 requests per day and some receiving 1 million requests per
         | day from these AI bots, depending on the design of the site.
        
         | xena wrote:
         | Gitea in particular is a worst case for this. Gitea shows
         | details about every file at every version and every commit if
         | you click enough. The bots click every link. This fixed cost
         | adds up when hundreds of IPs are at different levels of
         | clicking of every link.
        
         | TonyTrapp wrote:
         | Depends on what you are hosting. I found that source code
         | repository viewers in particular (OP mentions Gitea, but I have
         | seen it with others as well) are really troublesome: Each and
         | every commit that exists in your repository can potentially
         | cause dozens if not hundres of new unique pages to exist (diff
         | against previous version, diff against current version, show
         | file history, show file blame, etc...). Plus many repo viewers
         | of them take this information directly from the source
         | repository without much caching involved, as it seems. This is
         | different from typical blogging or forum software, which is
         | often designed to be able to handle really huge websites and
         | thus have strong caching support. So far, nobody expected
         | source code viewers to be so popular that performance could be
         | an issue, but with AI scrapers this is quickly changing.
        
       | serhack_ wrote:
       | Indeed: https://marcusb.org/hacks/quixotic.html try not to block
       | LLM bot traffic and start injecting spurious content for
       | ""improving"" their data. Markov chain at its finest!
        
       | gazchop wrote:
       | Back to Gopher. They'll never get us there!
        
       | ThinkBeat wrote:
       | The best way to fight this would not to block them, that does not
       | cause Amazon/others anything. (clearly).
       | 
       | What if instead it was possible to feed the bots clearly damaging
       | and harmfull content?
       | 
       | If done on a larger scale, and Amazon discovers the poisoned
       | pills they could have to spend money rooting it out, quick like,
       | and make attempts to stop their bots to ingest it.
       | 
       | Of course nobody wants to have that tuff on their own site
       | though. That is the biggest problem with this.
        
         | ADeerAppeared wrote:
         | > What if instead it was possible to feed the bots clearly
         | damaging and harmfull content?
         | 
         | With all respect, you're completely misunderstanding the scope
         | of AI companies' misbehaviour.
         | 
         | These scrapers already gleefully chow down on CSAM and all
         | other likewise horrible things. OpenAI had some of their Kenyan
         | data-tagging subcontractors quit on them over this. (2023,
         | Time)
         | 
         | The current crop of AI firms do not care about data quality.
         | Only quantity. The only thing you can do to harm them is to
         | hand them 0 bytes.
         | 
         | You would go directly to jail for things even a tenth as bad as
         | Sam Altman has authorized.
        
         | smeggysmeg wrote:
         | I've seen this tarpit recommended for this purpose. it creates
         | endless nests of directories and endless garbage content, as
         | the site is being crawled. The bot can spend hours on it.
         | 
         | https://zadzmo.org/code/nepenthes/
        
       | ThinkBeat wrote:
       | How many TB is your repo?
       | 
       | Do they keep retrieving the same data from the same links over
       | and over and over again, like stuck in a forever loop, that runs
       | week after week?
       | 
       | Or are they crawling your site at a hype aggressive way but
       | getting more and more data? So it may tea them last say 2 days to
       | crawl over it and then they go away?
        
       | Animats wrote:
       | It's time for a lawyer letter. See the Computer Fraud and Abuse
       | Act prosecution guidelines.[1] In general, the US Justice
       | Department will not consider any access to open servers that's
       | not clearly an attack to be "unauthorized access". But,
       | 
       |  _" However, when authorizers later expressly revoke
       | authorization--for example, through unambiguous written cease and
       | desist communications that defendants receive and understand--the
       | Department will consider defendants from that point onward not to
       | be authorized."_
       | 
       | So, you get a lawyer to write an "unambiguous cease and desist"
       | letter. You have it delivered to Amazon by either registered mail
       | or a process server, as recommended by the lawyer. Probably both,
       | plus email.
       | 
       | Then you wait and see if Amazon stops.
       | 
       | If they don't stop, you can file a criminal complaint. That will
       | get Amazon's attention.
       | 
       | [1] https://www.justice.gov/jm/jm-9-48000-computer-fraud
        
         | xena wrote:
         | Honestly, I figure that being on the front page of Hacker News
         | like this is more than shame enough to get a human from the
         | common sense department to read and respond to the email I sent
         | politely asking them to stop scraping my git server. If I don't
         | get a response by next Tuesday, I'm getting a lawyer to write a
         | formal cease and desist letter.
        
           | DrBenCarson wrote:
           | Lol you really think an ephemeral HN ranking will make
           | change?
        
             | xena wrote:
             | There's only one way to find out!
        
             | usefulcat wrote:
             | It's not unheard of. But neither would I count on it.
        
           | gazchop wrote:
           | No one gives a fuck in this industry until someone turns up
           | with bigger lawyers. This is behaviour which is written off
           | with no ethical concerns as ok until _that_ bigger fish comes
           | along.
           | 
           | Really puts me off it.
        
           | amarcheschi wrote:
           | It's computer science, nothing changes on corpo side until
           | they get a lawyer letter.
           | 
           | And even then, it's probably not going to be easy
        
       | idlewords wrote:
       | My site (Pinboard) is also getting hammered by what I presume are
       | AI crawlers. It started out this summer with Chinese and
       | Singapore IPs, but now I can't even block by IP range, and have
       | to resort to captchas. The level of traffic is enough to
       | immediately crash the site, and I don't even have any interesting
       | text for them to train on, just link.
       | 
       | I'm curious how OP figured out it's Amazon's crawler to blame. I
       | would love to point the finger of blame.
        
       | advael wrote:
       | Unless we start chopping these tech companies down there's not
       | much hope for the public internet. They now have an incentive to
       | crawl anything they can and have vastly more resources than even
       | most governments. Most resources I need to host in a way that's
       | internet facing are behind keyauth and I'm not sure I see a way
       | around doing that for at least a while
        
       | dmwilcox wrote:
       | I wonder if there is a good way to copy something out of fossil
       | scm or externalize this component for more general use.
       | 
       | https://fossil-scm.org/home/doc/trunk/www/antibot.wiki
       | 
       | I ran into this weeks ago and was super impressed to solve a
       | self-hosted captcha and login as "anonymous". I use cgit
       | currently but have dabbled with fossil previously and if bots
       | were a problem I'd absolutely consider this
        
       | knowitnone wrote:
       | Feed them false data. If feed by enough people(I looking at you
       | HN), their AI will be inaccurate to the point of being useless.
        
       | Aloisius wrote:
       | Using status code 418 (I'm a teapot), while cute, actually works
       | against you since even well behaved bots don't know how to handle
       | it and thus might not treat it as a permanent status causing them
       | to try to recrawl again later.
       | 
       | Plus you'll want to allow access to /robots.txt.
        
       ___________________________________________________________________
       (page generated 2025-01-18 23:00 UTC)