[HN Gopher] Amazon's AI crawler is making my Git server unstable
___________________________________________________________________
Amazon's AI crawler is making my Git server unstable
Author : OptionOfT
Score : 398 points
Date : 2025-01-18 18:48 UTC (4 hours ago)
(HTM) web link (xeiaso.net)
(TXT) w3m dump (xeiaso.net)
| ChrisArchitect wrote:
| Earlier: https://news.ycombinator.com/item?id=42740095
| trebor wrote:
| Upvoted because we're seeing the same behavior from all AI and
| Seo bots. They're BARELY respecting Robots.txt, and hard to
| block. And when they crawl, they spam and drive up load so high
| they crash many servers for our clients.
|
| If AI crawlers want access they can either behave, or pay. The
| consequence will almost universal blocks otherwise!
| mschuster91 wrote:
| Global tarpit is the solution. It makes sense anyway even
| without taking AI crawlers into account. Back when I had to
| implement that, I went the semi manual route - parse the access
| log and any IP address averaging more than X hits a second on
| /api gets a -j TARPIT with iptables [1].
|
| Not sure how to implement it in the cloud though, never had the
| need for that there yet.
|
| [1]
| https://gist.github.com/flaviovs/103a0dbf62c67ff371ff75fc62f...
| jks wrote:
| One such tarpit (Nepenthes) was just recently mentioned on
| Hacker News: https://news.ycombinator.com/item?id=42725147
|
| Their site is down at the moment, but luckily they haven't
| stopped Wayback Machine from crawling it: https://web.archive
| .org/web/20250117030633/https://zadzmo.or...
| kazinator wrote:
| How do you know their site is down? You probably just hit
| their tarpit. :)
| marcus0x62 wrote:
| Quixotic[0] (my content obfuscator) includes a tarpit
| component, but for something like this, I think the main
| quixotic tool would be better - you run it against your
| content once, and it generates a pre-obfuscated version of
| it. It takes a lot less of your resources to serve than
| dynamically generating the tarpit links and content.
|
| 0 - https://marcusb.org/hacks/quixotic.html
| bwfan123 wrote:
| i would think public outcry by influencers on social media
| (such as this thread) is a better deterrent, and also
| establishes a public datapoint and exhibit for future
| reference.. as it is hard to scale the tarpit.
| idlewords wrote:
| This doesn't work with the kind of highly distributed
| crawling that is the problem now.
| gundmc wrote:
| What do you mean by "barely" respecting robots.txt? Wouldn't
| that be more binary? Are they respecting some directives and
| ignoring others?
| unsnap_biceps wrote:
| I believe that a number of AI bots only respect robot.txt
| entries that explicitly define their static user agent name.
| They ignore wildcards in user agents.
|
| That counts as barely imho.
|
| I found this out after OpenAI was decimating my site and
| ignoring the wildcard deny all. I had to add entires
| specifically for their three bots to get them to stop.
| noman-land wrote:
| This is highly annoying and rude. Is there a complete list
| of all known bots and crawlers?
| jsheard wrote:
| https://darkvisitors.com/agents
|
| https://github.com/ai-robots-txt/ai.robots.txt
| joecool1029 wrote:
| Even some non-profit ignore it now, Internet Archive
| stopped respecting it years ago:
| https://blog.archive.org/2017/04/17/robots-txt-meant-for-
| sea...
| SR2Z wrote:
| IA actually has technical and moral reasons to ignore
| robots.txt. Namely, they want to circumvent this stuff
| because their goal is to archive EVERYTHING.
| amarcheschi wrote:
| I also don't think they hit servers repeatedly so much
| LukeShu wrote:
| Amazonbot doesn't respect the `Crawl-Delay` directive. To be
| fair, Crawl-Delay is non-standard, but it is claimed to be
| respected by the other 3 most aggressive crawlers I see.
|
| And how often does it check robots.txt? ClaudeBot will make
| hundreds of thousands of requests before it re-checks
| robots.txt to see that you asked it to please stop DDoSing
| you.
| Animats wrote:
| Here's Google, complaining of problems with pages they want
| to index but I blocked with robots.txt. New
| reason preventing your pages from being indexed
| Search Console has identified that some pages on your site
| are not being indexed due to the following new
| reason: Indexed, though blocked by
| robots.txt If this reason is not intentional, we
| recommend that you fix it in order to get affected
| pages indexed and appearing on Google. Open indexing
| report Message type: [WNC-20237597]
| Vampiero wrote:
| > The consequence will almost universal blocks otherwise!
|
| Who cares? They've already scraped the content by then.
| jsheard wrote:
| Bold to assume that an AI scraper won't come back to download
| everything again, just in case there's any new scraps of data
| to extract. OP mentioned in the other thread that this bot
| had pulled 3TB so far, and I doubt their git server actually
| has 3TB of unique data, so the bot is probably pulling the
| same data over and over again.
| xena wrote:
| FWIW that includes other scrapers, Amazon's is just the one
| that showed up the most in the logs.
| _heimdall wrote:
| If they only needed a one-time scrape we really wouldn't be
| seeing noticeable not traffic today.
| herpdyderp wrote:
| > The consequence will almost universal blocks otherwise!
|
| How? The difficulty of doing that is the problem, isn't it?
| (Otherwise we'd just be doing that already.)
| ADeerAppeared wrote:
| > (Otherwise we'd just be doing that already.)
|
| Not quite what the original commenter meant but: WE ARE.
|
| A major consequence of this reckless AI scraping is that it
| turbocharged the move away from the web and into closed
| ecosystems like Discord. Away from the prying eyes of most AI
| scrapers ... and the search engine indexes that made the
| internet so useful as an information resource.
|
| Lots of old websites & forums are going offline as their
| hosts either cannot cope with the load or send a sizeable
| bill to the webmaster who then pulls the plug.
| ksec wrote:
| Is there some way website can sell those Data to AI bot in a
| large zip file rather than being constantly DDoS?
|
| Or they could at least have the curtesy to scrap during night
| time / off peak hours.
| jsheard wrote:
| No, because they won't pay for anything they can get for
| free. There's only one situation where an AI company will pay
| for data, and that's when it's owned by someone with scary
| enough lawyers to pressure them into paying up. Hence why
| OpenAI has struck licensing deals with a handful of companies
| while continuing to bulk-scrape unlicensed data from everyone
| else.
| awsanswers wrote:
| Unacceptable, sorry this is happening. Do you know about
| fail2ban? You can have it automatically filter IPs that violate
| certain rules. One rule could be matching on the bot trying
| certain URLs. You might be able to get some kind of honeypot
| going with that idea. Good luck
| thayne wrote:
| They said that it is coming from different ip addresses every
| time, so fail2ban wouldn't help.
| keisborg wrote:
| Monitor access logs for links that only crawlers can find.
|
| Edit: oh, I got your point now.
| jsheard wrote:
| Amazon does publish every IP address range used by AWS, so
| there is the nuclear option of blocking them all pre-
| emptively.
|
| https://docs.aws.amazon.com/vpc/latest/userguide/aws-ip-
| rang...
| xena wrote:
| I'd do that, but my DNS is via route 53. Blocking AWS would
| block my ability to manage DNS automatically as well as
| certificate issuance via DNS-01.
| unsnap_biceps wrote:
| If you only block new inbound requests, it shouldn't
| impact your route 53 or DNS-01 usage.
| actuallyalys wrote:
| They list a service for each address, so maybe you could
| block all the non-Route 53 IP addresses. Although that
| assumes they aren't using the Route 53 IPs or unlisted
| IPs for scraping (the page warns it's not a comprehensive
| list).
|
| Regardless, it sucks that you have to deal with this. The
| fact that you're a customer makes it all the more absurd.
| SteveNuts wrote:
| It'll most likely eventually help, as long as they don't have
| an infinite address pool.
|
| Do these bots use some client software (browser plugin,
| desktop app) that's consuming unsuspecting users bandwidth
| for distributed crawling?
| srameshc wrote:
| Has anyone tried using Cloudflare Bot Management and how
| effective is it for such bots ?
| martinp wrote:
| I put my personal site behind Cloudflare last year specifically
| to combat AI bots. It's very effective, but I hate that the web
| has devolved to a state where using a service like Cloudflare
| is practically no longer optional.
| unsnap_biceps wrote:
| > About 10% of the requests do not have the amazonbot user agent.
|
| Is there any bot string in the user agent? I'd wonder if it's
| GPTBot as I believe they don't respect a robots.txt deny
| wildcard.
| Ndymium wrote:
| I had this same issue recently. My Forgejo instance started to
| use 100 % of my home server's CPU as Claude and its AI friends
| from Meta and Google were hitting the basically infinite links at
| a high rate. I managed to curtail it with robots.txt and a user
| agent based blocklist in Caddy, but who knows how long that will
| work.
|
| Whatever happened to courtesy in scraping?
| jsheard wrote:
| > Whatever happened to courtesy in scraping?
|
| Money happened. AI companies are financially incentivized to
| take as much data as possible, as quickly as possible, from
| anywhere they can get it, and for now they have so much cash to
| burn that they don't really need to be efficient about it.
| nicce wrote:
| Need to act fast before the copyright cases in the court gets
| handled.
| bwfan123 wrote:
| not only money, but also a culture of "all your data belong
| to us" because our ai going to save you and the world.
|
| the hubris reminds me of dot-com era. that bust left a huge
| wreckage. not sure how this one is going to land.
| __loam wrote:
| It's gonna be rough. If you can't make money charging
| people $200 a month for your service then something is
| deeply wrong.
| Analemma_ wrote:
| The same thing that happened to courtesy in every other
| context: it only existed in contexts where there was no profit
| to be made in ignoring it. The instant that stopped being true,
| it was ejected.
| baobun wrote:
| Mind sharing a decent robots.txt and/or user-agent list to
| block the AI crawlers?
| hooloovoo_zoo wrote:
| Any of the big chat models should be able to reproduce it :)
| to11mtm wrote:
| > Whatever happened to courtesy in scraping?
|
| When various companies got signal that at least for now they
| have a huge overton window of what is acceptable for AI to
| ingest, they are going to take all they can before regulation
| even tries to clamp down.
|
| The bigger danger, is that one of these companies even (or,
| especially) one that claims to be 'Open', does so but gets to
| the point of being considered 'too big to fail' from an
| economic/natsec interest...
| AznHisoka wrote:
| Excuse my technical ignorance, but is it actually trying to get
| all the files in your git repo? Couldn't you just have everything
| behind an user/pass if so?
| xena wrote:
| Author of the article here. The behavior of the bot seems like
| this: while true { const page = await
| load_html_page(read_from_queue());
| save_somewhere(page); foreach link in page {
| enqueue(link); } }
|
| This means that every link on every page gets enqueued and
| saved to do something. Naturally, this means that every file of
| every commit gets enqueued and scraped.
|
| Having everything behind auth defeats the point of making the
| repos public.
| kgeist wrote:
| >Having everything behind auth defeats the point of making
| the repos public.
|
| Maybe add a captcha? Can be something simple and ad hoc, but
| unique enough to throw off most bots.
| xena wrote:
| That's what I'm working on right now.
| aw4y wrote:
| just add a forged link in the main page, pointing to a page that
| doesn't exist. when hit, block that ip. they will crawl only the
| first page in this way maybe?
| frankwiles wrote:
| Have had several clients hit by bad AI robots in the last few
| months. Sad because it's easy to honor robots.txt.
| keisborg wrote:
| Sounds like a job for nepenthes:
| https://news.ycombinator.com/item?id=42725147
| neilv wrote:
| Can demonstrable ignoring of robots.txt help the cases of
| copyright infringement lawsuits against the "AI" companies, their
| partners, and customers?
| adastra22 wrote:
| On what legal basis?
| readyplayernull wrote:
| Terms of use contract violation?
| bdangubic wrote:
| good thought but zippy chance this holds up in Court
| hipadev23 wrote:
| Robots.txt is completely irrelevant. TOU/TOS are also
| irrelevant unless you restrict access to only those who
| have agreed to terms.
| flir wrote:
| In the UK, the Computer Misuse Act applies if:
|
| * There is knowledge that the intended access was
| unauthorised
|
| * There is an intention to secure access to any program or
| data held in a computer
|
| I imagine US law has similar definitions of unauthorized
| access?
|
| `robots.txt` is the universal standard for defining what is
| unauthorised access for bots. No programmer could argue they
| aren't aware of this, and ignoring it, for me personally, is
| enough to show knowledge that the intended access was
| unauthorised. Is that enough for a court? Not a goddamn clue.
| Maybe we need to find out.
| pests wrote:
| > `robots.txt` is the universal standard
|
| Quite the assumption, you just upset a bunch of alien
| species.
| flir wrote:
| Dammit. Unchecked geocentric model privilege, sorry about
| that.
| to11mtm wrote:
| I mean it might just be a matter of the right UK person
| filing a case; I suppose my main understanding is UK
| Libel/Slander laws but if my US brain goes with that my
| head says the burden of proof is on non-infringement.
|
| (But again, I don't know UK law.)
| thayne wrote:
| Universal within the scope of the Internet.
| thayne wrote:
| Probably not copyright infringement. But it is probably
| (hopefully?) a violation of CFAA, both because it is
| effectively DDoSing you, and they are ignoring robots.txt.
|
| Maybe worth contacting law enforcement?
|
| Although it might not actually be Amazon.
| to11mtm wrote:
| Big thing worth asking here. Depending on what 'amazon' means
| here (i.e. known to be Amazon specific IPs vs Cloud IPs) it
| could just be someone running a crawler on AWS.
|
| Or, folks failing the 'shared security model' of AWS and
| their stuff is compromised with botnets running on AWS.
|
| Or, folks that are quasi-spoofing 'AmazonBot' because they
| think it will have a better not-block rate than anonymous or
| other requests...
| thayne wrote:
| From the information in the post, it sounds like the last
| one to me. That is, someone else spoofing an Amazonbot user
| agent. But it could potentially be all three.
| 23B1 wrote:
| HN when it's a photographer, writer, or artist concerned about IP
| laundering: _" Fair use! Information wants to be free!"_
|
| HN when it's bots hammering some guy's server _" Hey this is
| wrong!"_
|
| A lot of you are unfamiliar with the tragedy of the commons. I
| have seen the paperclip maximizer - and he is you lot.
|
| https://en.wikipedia.org/wiki/Tragedy_of_the_commons
| navanchauhan wrote:
| I think there's a difference between crawling websites at a
| reasonable pace instead of just hammering the server to the
| point it's unusable.
|
| Nobody has problems with the Google Search indexer trying to
| crawl websites in a responsible way
| 23B1 wrote:
| For sure.
|
| I'm really just pointing out the inconsistent technocrat
| attitude towards labor, sovereignty, and resources.
| Analemma_ wrote:
| Most of those of those artists aren't any better though. I'm on
| a couple artists' forums and outlets like Tumblr, and I saw
| firsthand the immediate, total 180 re: IP protection when genAI
| showed up. Overnight, everybody went from "copying isn't theft,
| it leaves the original!" and other such mantras, to being die-
| hard IP maximalists. To say nothing of how they went from
| "anything can be art and it doesn't matter what tools you're
| using" to forming witch-hunt mobs against people suspected of
| using AI tooling. AI has made a hypocrite out of everybody.
| 23B1 wrote:
| Manga nerds on Tumblr aren't the artists I'm worried about.
| I'm talking about people whose intellectual labor is being
| laundered by gigacorps and the inane defenses mounted by
| their techbro serfdom.
| depingus wrote:
| This submission has nothing to do with IP laundering. The bot
| is straining their server and causing OP technical issues.
| xena wrote:
| Their*
| depingus wrote:
| Fixed.
| xena wrote:
| Thanks!
| 23B1 wrote:
| Commentary is often second- and third-order.
| depingus wrote:
| True, but it tends to flow there organically. This comment
| was off topic from the start.
| flir wrote:
| Personally I'm not trying to block the bots, I'm trying to
| avoid the bandwidth bill.
|
| I've recently blocked everything that isn't offering a user
| agent. If it had only pulled text I probably wouldn't have
| cared, but it was pulling images as well (bot designers, take
| note - you can have orders of magnitude less impact if you skip
| the images).
|
| For me personally, what's left isn't eating enough bandwidth
| for me to care, and I think any attempt to serve _some_ bots is
| doomed to failure.
|
| If I really, really hated chatbots (I don't), I'd look at
| approaches that poison the well.
| evilfred wrote:
| HN isnt a monolith
| thayne wrote:
| Are you sure it isn't a DDoS masquerading as Amazon?
|
| Requests coming from residential ips is really suspicious.
|
| Edit: the motivation for such a DDoS might be targeting Amazon,
| by taking down smaller sites and making it look like amazon is
| responsible.
|
| If it is Amazon one place to start is blocking all the the ip
| ranges they publish. Although it sounds like there are requests
| outside those ranges...
| OptionOfT wrote:
| You should check your websites like grass dot io (I refuse to
| give them traffic).
|
| They pay you for your bandwidth while they resell it to 3rd
| parties, which is why a lot of bot traffic looks like it comes
| from residential IPs.
| Aurornis wrote:
| Yes, but the point is that big company crawlers aren't paying
| for questionably sourced residential proxies.
|
| If this person is seeing a lot of traffic from residential
| IPs then I would be shocked if it's really Amazon. I think
| someone else is doing something sketchy and they put
| "AmazonBot" in the user agent to make victims think it's
| Amazon.
|
| You can set the user agent string to anything you want, as we
| all know.
| skywhopper wrote:
| It's not residential proxies. It's Amazon using IPs they
| sublease from residential ISPs.
| voakbasda wrote:
| I wonder if anyone has checked whether Alexa devices serve
| as a private proxy network for AmazonBot's use.
| baobun wrote:
| > Yes, but the point is that big company crawlers aren't
| paying for questionably sourced residential proxies
|
| You'd be surprised...
| WarOnPrivacy wrote:
| >> Yes, but the point is that big company crawlers aren't
| paying for questionably sourced residential proxies
|
| > You'd be surprised...
|
| Surprised by what? What do you know?
| ninkendo wrote:
| They could be using echo devices to proxy their traffic...
|
| Although I'm not necessarily gonna make that accusation,
| because it would be pretty serious misconduct if it were
| true.
| ninkendo wrote:
| To add: it's also kinda silly on the surface of it for
| Amazon to use consumer devices to hide their crawling
| traffic, but still leave "Amazonbot" in their UA
| string... it's pretty safe to assume they're not doing
| this.
| dafelst wrote:
| I worked for Microsoft doing malware detection back 10+
| years ago, and questionably sourced proxies were well and
| truly on the table
| WarOnPrivacy wrote:
| >> but the point is that big company crawlers aren't
| paying for questionably sourced residential proxies.
|
| > I worked for Microsoft doing malware detection back 10+
| years ago, and questionably sourced proxies were well and
| truly on the table
|
| Big Company Crawlers using questionably sourced proxies -
| this seems striking. What can you share about it?
| to11mtm wrote:
| they probably can't because some of the proxies were used
| by TLAs is my guess...
| SOLAR_FIELDS wrote:
| Wild. While I'm sure the service is technically legal since
| it can be used for non-nefarious purposes, signing up for a
| service like that seems like a guarantee that you are
| contributing to problematic behavior.
| scubbo wrote:
| I, too, hate this future. This[0] might be a satisfying way to
| fight back.
|
| [0] https://zadzmo.org/code/nepenthes/
| surfingdino wrote:
| Return "402 Payment Required" and block?
| xyzal wrote:
| No. Feed them shit. Code with deliberate security vulns and so
| on.
| serhack_ wrote:
| https://marcusb.org/hacks/quixotic.html
| byyll wrote:
| https://ip-ranges.amazonaws.com/ip-ranges.json ?
| xena wrote:
| I'd love it if Amazon could give me some AWS credit as a sign of
| good faith to make up for the egress overages their and other
| bots are causing, but the ads on this post are likely going to
| make up for it. Unblock ads and I come out even!
| Aurornis wrote:
| I don't think I'd assume this is actually Amazon. The author is
| seeing requests from rotating residential IPs and changing user
| agent strings
|
| > It's futile to block AI crawler bots because they lie, change
| their user agent, use residential IP addresses as proxies, and
| more.
|
| Impersonating crawlers from big companies is a common technique
| for people trying to blend in. The fact that requests are coming
| from residential IPs is a big red flag that something else is
| going on.
| paranoidrobot wrote:
| I wouldn't put it past any company these days doing crawling in
| an aggressive manner to use proxy networks.
| smileybarry wrote:
| With the amount of "if cloud IP then block" rules in place
| for many things (to weed out streaming VPNs and "potential"
| ddos-ing) I wouldn't doubt that at all.
| cmeacham98 wrote:
| I work for Amazon, but not directly on web crawling.
|
| Based on the internal information I have been able to gather,
| it is highly unlikely this is actually Amazon. Amazonbot is
| supposed to respect robots.txt and should always come from an
| Amazon-owned IP address (You can see verification steps here:
| https://developer.amazon.com/en/amazonbot).
|
| I've forwarded this internally just in case there is some crazy
| internal team I'm not aware of pulling this stunt, but I would
| strongly suggest the author treats this traffic as malicious
| and lying about its user agent.
| AyyEye wrote:
| > The author is seeing requests from rotating residential IPs
| and changing user agent strings
|
| This type of thing is commercially available as a service[1].
| Hundreds of Millions of networks backdoored and used as
| crawlers/scrapers because of an included library somewhere --
| and ostensibly legal because somewhere in some ToS they had
| some generic line that could plausibly be extended to using you
| as a patsy for quasi-legal activities.
|
| [1] https://brightdata.com/proxy-types/residential-proxies
| stainablesteel wrote:
| crazy how what seemed like an excellent landmark case around
| webcrawling turned around like this so quickly due to AI
| LukeShu wrote:
| Before I configured Nginx to block them:
|
| - Bytespider (59%) and Amazonbot (21%) together accounted for 80%
| of the total traffic to our Git server.
|
| - ClaudeBot drove more traffic through our Redmine in a month
| than it saw in the combined _5 years_ prior to ClaudeBot.
| dbaio wrote:
| suffering with it as well. why can't they just `git clone` and do
| their stuff? =)
| rattlesnakedave wrote:
| No evidence provided that this is amazonbot or AI related.
| Someone is just upset that their website is getting traffic,
| which seems asinine.
| kazinator wrote:
| What is the proof that a hit from a residential IP address is
| actually Amazon? And if you have a way to tell, why not make use
| of it.
| trevor-e wrote:
| What are the actual rules/laws about scraping? I have a few
| projects I'd like to do that involve scraping but have always
| been conscious about respecting the host's servers, plus whether
| private content is copyrighted. But sounds like AI companies
| don't give a shit lol. If anyone has a good resource on the
| subject I'd be grateful!
| lazystar wrote:
| If you go to a police station and ask them to arrest Amazon for
| accessing your website too often, will they arrest Amazon, or
| laugh at you?
|
| While facetious in nature, my point is that people walking
| around in real brick and mortar locations simply do not care.
| If you want police to enforce laws, those are the kinds of
| people that need to care about your problem. Until that occurs,
| youll have to work around the problem.
| armchairhacker wrote:
| I like the solution in this comment:
| https://news.ycombinator.com/item?id=42727510.
|
| Put a link somewhere in your site that no human would visit,
| disallow it in robots.txt (under a wildcard because apparently
| OpenAI's crawler specifically ignores wildcards), and when an IP
| address visits the link ban it for 24 hours.
| Szpadel wrote:
| I had to deal with some bot activities that used huge address
| space, and I tried something very similar, when condition
| confirming bot was detected I banned that IP for 24h
|
| but due to amount of IPs involved this did not have any impact
| on about if traffic
|
| my suggestion is to look very closely on headers that you
| receive (varnishlog in very nice of this and of you stare long
| enough at then you might stop something that all those requests
| have in common that would allow you to easily identify them
| (like very specific and usual combination of reported language
| and geo location, or the same outdated browser version, etc)
| aaomidi wrote:
| Maybe ban ASNs /s
| koito17 wrote:
| This was indeed one mitigation used by a site to prevent
| bots hosted on AWS from uploading CSAM and generating bogus
| reports to the site's hosting provider.[1]
|
| In any case, I agree with the sarcasm. Blocking data center
| IPs may not help the OP, because some of the bots are
| resorting to residential IP addresses.
|
| [1] https://news.ycombinator.com/item?id=26865236
| conradev wrote:
| My favorite example of this was how folks fingerprinted the
| active probes of the Great Firewall of China. It has a large
| pool of IP addresses to work with (i.e. all ISPs in China),
| but the TCP timestamps were shared across a small number of
| probing machines:
|
| "The figure shows that although the probers use thousands of
| source IP addresses, they cannot be fully independent,
| because they share a small number of TCP timestamp sequences"
|
| https://censorbib.nymity.ch/pdf/Alice2020a.pdf
| superjan wrote:
| Why work hard... Train a model to recognize the AI bots!
| js4ever wrote:
| Because you have to decide in less than 1ms, using AI is
| too slow in that context
| to11mtm wrote:
| Uggh, web crawlers...
|
| 8ish years ago, at the shop I worked at we had a server taken
| down. It was an image server for vehicles. How did it go down?
| Well, the crawler in question somehow had access to vehicle
| image links we had due to our business. Unfortunately, the
| perfect storm of the image not actually existing (can't
| remember why, mighta been one of those weird cases where we did
| a re-inspection without issuing new inspection ID) resulted in
| them essentially DOSing our condition report image server.
| Worse, there was a bug in the error handler somehow, such that
| the server process restarted when this condition happened. This
| had the -additional- disadvantage of invalidating our 'for .NET
| 2.0, pretty dang decent' caching implementation...
|
| It comes to mind because, I'm pretty sure we started doing some
| canary techniques just to be safe (Ironically, doing some
| simple ones were still cheaper than even adding a different web
| server.... yes we also fixed the caching issue... yes we also
| added a way to 'scream' if we got too many bad requests on that
| service.)
| shakna wrote:
| When I was writing a crawler for my search engine (now
| offline), I found almost no crawler library actually compliant
| with the real world. So I ended up going to a lot of effort to
| write one that complied with Amazon and Google's rather
| complicated nested robots files, including respecting the cool
| off periods as requested.
|
| ... And then found their own crawlers can't parse their own
| manifests.
| bb010g wrote:
| Could you link the source of your crawler library?
| more_corn wrote:
| Cloudflare free plan has bot protection.
| deanc wrote:
| We have had the same problem at my client now for the last couple
| of months, but from Facebook (using their IP ranges). They don't
| even respect the 429 headers and the business is hesitant to
| outright ban them in case it impacts open graph or Facebook
| advertising tooling.
| Havoc wrote:
| He seems to have a mistake in his rule?
|
| He's got "(Amazon)" while Amazon lists their useragent as
| "(Amazonbot/0.1;"
| xena wrote:
| It's a regular expression.
| cyrnel wrote:
| The author's pronouns can be found here: https://github.com/Xe
| evantbyrne wrote:
| It seems like git self-hosters frequently encounter DDoS issues.
| I know it's not typical for free software, but I wonder if gating
| file contents behind a login and allowing registrations could be
| the answer for self-hosting repositories on the cheap.
| freetanga wrote:
| Probably dumb question, but any enlightenment would be welcome to
| help me learn:
|
| Could this be prevented by having a link that when followed would
| serve a dynamically generated page that does all of the
| following:
|
| A) insert some fake content outlining the oligarcs more lurid
| rumours or whichever disinformation you choose to push
|
| C) embed links to assets in oligarchs companies so they get hit
| with some bandwith
|
| C) dynamically create new Random pages that link to itself
|
| And thus create an infinite loop, similar to a gzip bomb, which
| could potentially taint the model if done by enough people.
| to11mtm wrote:
| Not a crawler writer but have FAFOd with data structures in the
| past to large career success.
|
| ...
|
| The closest you could possibly do with any meaningful
| influence, is option C, with the general observations of:
|
| 1. You'd need to 'randomize' the generated output link
|
| 2. You'd also want to maximize cachability of the replayed
| content to minimize work.
|
| 3. Add layers of obfuscation on the frontend side, for instance
| a 'hidden link (maybe with some prompt fuckery if you are
| brave) inside the HTML with a random bad link on your normal
| pages.
|
| 4. Randomize parts of the honeypot link pattern. At some point
| someone monitoring logs/etc will see that it's a loop and
| blacklist the path.
|
| 5. Keep up at 4 and eventually they'll hopefully stop crawling.
|
| ---
|
| On the lighter side...
|
| 1. do some combination of above but have all honeypot links
| contain the right words that an LLM will just nope out of for
| regulatory reasons.
|
| That said, all above will do is minimize pain (except, perhaps
| ironically, the joke response which will more likely blacklist
| you but potentially get you on a list or a TLA visit)...
|
| ... Most pragmatically, I'd start by suggesting the best option
| is a combination of nonlinear rate limiting, both on the ramp-
| up and the ramp-down. That is, the faster requests come in, the
| more you increment their 'valueToCheckAgainstLimit`. The longer
| it's been since last request, the more you decrement.
|
| Also pragmatically, if you can extend that to put together even
| semi-sloppy code to then scan when a request to a junk link
| that results in a ban immediately results in another IP trying
| to hit the same request... well ban that IP as soon as you see
| it, at least for a while.
|
| With the right sort of lookup table, IP Bans can be fairly
| simple to handle on a software level, although the 'first-time'
| elbow grease can be a challenge.
| vachina wrote:
| I'm surprised everyone else's servers are struggling to handle a
| couple of bot scrapes.
|
| I run a couple of public facing websites on a NUC and it just...
| chugs along? This is also amidst the constant barrage of OSINT
| attempts at my IP.
| cyrnel wrote:
| Seems some of these bots are behaving abusively on sites with
| lots of links (like git forges). I have some sites receiving
| 200 requests per day and some receiving 1 million requests per
| day from these AI bots, depending on the design of the site.
| xena wrote:
| Gitea in particular is a worst case for this. Gitea shows
| details about every file at every version and every commit if
| you click enough. The bots click every link. This fixed cost
| adds up when hundreds of IPs are at different levels of
| clicking of every link.
| TonyTrapp wrote:
| Depends on what you are hosting. I found that source code
| repository viewers in particular (OP mentions Gitea, but I have
| seen it with others as well) are really troublesome: Each and
| every commit that exists in your repository can potentially
| cause dozens if not hundres of new unique pages to exist (diff
| against previous version, diff against current version, show
| file history, show file blame, etc...). Plus many repo viewers
| of them take this information directly from the source
| repository without much caching involved, as it seems. This is
| different from typical blogging or forum software, which is
| often designed to be able to handle really huge websites and
| thus have strong caching support. So far, nobody expected
| source code viewers to be so popular that performance could be
| an issue, but with AI scrapers this is quickly changing.
| serhack_ wrote:
| Indeed: https://marcusb.org/hacks/quixotic.html try not to block
| LLM bot traffic and start injecting spurious content for
| ""improving"" their data. Markov chain at its finest!
| gazchop wrote:
| Back to Gopher. They'll never get us there!
| ThinkBeat wrote:
| The best way to fight this would not to block them, that does not
| cause Amazon/others anything. (clearly).
|
| What if instead it was possible to feed the bots clearly damaging
| and harmfull content?
|
| If done on a larger scale, and Amazon discovers the poisoned
| pills they could have to spend money rooting it out, quick like,
| and make attempts to stop their bots to ingest it.
|
| Of course nobody wants to have that tuff on their own site
| though. That is the biggest problem with this.
| ADeerAppeared wrote:
| > What if instead it was possible to feed the bots clearly
| damaging and harmfull content?
|
| With all respect, you're completely misunderstanding the scope
| of AI companies' misbehaviour.
|
| These scrapers already gleefully chow down on CSAM and all
| other likewise horrible things. OpenAI had some of their Kenyan
| data-tagging subcontractors quit on them over this. (2023,
| Time)
|
| The current crop of AI firms do not care about data quality.
| Only quantity. The only thing you can do to harm them is to
| hand them 0 bytes.
|
| You would go directly to jail for things even a tenth as bad as
| Sam Altman has authorized.
| smeggysmeg wrote:
| I've seen this tarpit recommended for this purpose. it creates
| endless nests of directories and endless garbage content, as
| the site is being crawled. The bot can spend hours on it.
|
| https://zadzmo.org/code/nepenthes/
| ThinkBeat wrote:
| How many TB is your repo?
|
| Do they keep retrieving the same data from the same links over
| and over and over again, like stuck in a forever loop, that runs
| week after week?
|
| Or are they crawling your site at a hype aggressive way but
| getting more and more data? So it may tea them last say 2 days to
| crawl over it and then they go away?
| Animats wrote:
| It's time for a lawyer letter. See the Computer Fraud and Abuse
| Act prosecution guidelines.[1] In general, the US Justice
| Department will not consider any access to open servers that's
| not clearly an attack to be "unauthorized access". But,
|
| _" However, when authorizers later expressly revoke
| authorization--for example, through unambiguous written cease and
| desist communications that defendants receive and understand--the
| Department will consider defendants from that point onward not to
| be authorized."_
|
| So, you get a lawyer to write an "unambiguous cease and desist"
| letter. You have it delivered to Amazon by either registered mail
| or a process server, as recommended by the lawyer. Probably both,
| plus email.
|
| Then you wait and see if Amazon stops.
|
| If they don't stop, you can file a criminal complaint. That will
| get Amazon's attention.
|
| [1] https://www.justice.gov/jm/jm-9-48000-computer-fraud
| xena wrote:
| Honestly, I figure that being on the front page of Hacker News
| like this is more than shame enough to get a human from the
| common sense department to read and respond to the email I sent
| politely asking them to stop scraping my git server. If I don't
| get a response by next Tuesday, I'm getting a lawyer to write a
| formal cease and desist letter.
| DrBenCarson wrote:
| Lol you really think an ephemeral HN ranking will make
| change?
| xena wrote:
| There's only one way to find out!
| usefulcat wrote:
| It's not unheard of. But neither would I count on it.
| gazchop wrote:
| No one gives a fuck in this industry until someone turns up
| with bigger lawyers. This is behaviour which is written off
| with no ethical concerns as ok until _that_ bigger fish comes
| along.
|
| Really puts me off it.
| amarcheschi wrote:
| It's computer science, nothing changes on corpo side until
| they get a lawyer letter.
|
| And even then, it's probably not going to be easy
| idlewords wrote:
| My site (Pinboard) is also getting hammered by what I presume are
| AI crawlers. It started out this summer with Chinese and
| Singapore IPs, but now I can't even block by IP range, and have
| to resort to captchas. The level of traffic is enough to
| immediately crash the site, and I don't even have any interesting
| text for them to train on, just link.
|
| I'm curious how OP figured out it's Amazon's crawler to blame. I
| would love to point the finger of blame.
| advael wrote:
| Unless we start chopping these tech companies down there's not
| much hope for the public internet. They now have an incentive to
| crawl anything they can and have vastly more resources than even
| most governments. Most resources I need to host in a way that's
| internet facing are behind keyauth and I'm not sure I see a way
| around doing that for at least a while
| dmwilcox wrote:
| I wonder if there is a good way to copy something out of fossil
| scm or externalize this component for more general use.
|
| https://fossil-scm.org/home/doc/trunk/www/antibot.wiki
|
| I ran into this weeks ago and was super impressed to solve a
| self-hosted captcha and login as "anonymous". I use cgit
| currently but have dabbled with fossil previously and if bots
| were a problem I'd absolutely consider this
| knowitnone wrote:
| Feed them false data. If feed by enough people(I looking at you
| HN), their AI will be inaccurate to the point of being useless.
| Aloisius wrote:
| Using status code 418 (I'm a teapot), while cute, actually works
| against you since even well behaved bots don't know how to handle
| it and thus might not treat it as a permanent status causing them
| to try to recrawl again later.
|
| Plus you'll want to allow access to /robots.txt.
___________________________________________________________________
(page generated 2025-01-18 23:00 UTC)