[HN Gopher] Messing with scraper bots
___________________________________________________________________
Messing with scraper bots
Author : HermanMartinus
Score : 183 points
Date : 2025-11-15 07:38 UTC (15 hours ago)
(HTM) web link (herman.bearblog.dev)
(TXT) w3m dump (herman.bearblog.dev)
| ArcHound wrote:
| Neat! Most of the offensive scrapers I met try and exploit
| WordPress sites (hence the focus on PHP). They don't want to see
| php files, but their outputs.
|
| What you have here is quite close to a honeypot, sadly I don't
| see an easy way to counter-abuse such bots. If the attack is not
| following their script, they move on.
| jojobas wrote:
| Yeah, I bet they run a regex on the output and if there's no
| admin logon thingie where they can run exploits or stuff
| credentials they'll just skip.
|
| So as to battles of efficiency, generating a 4kb bullshit PHP
| is harder than running a regex.
| NoiseBert69 wrote:
| Hm.. why not using dumbed down small, self-hosted LLM networks to
| feet the big scrapers with bullshit?
|
| I'd sacrifice two CPU cores for this just to make their life
| awful.
| qezz wrote:
| That's very expensive.
| Findecanor wrote:
| You don't need an LLM for that. There is a link in the article
| to an approach using Markov chains created from real-world
| books, but then you'd let the scrapers' LLMs re-enforce their
| training on those books and not on random garbage.
|
| I would make a list of words from each word class, and a list
| of sentence structures where each item is a word class. Pick a
| pseudo-random sentence; for each word class in the sentence,
| pick a pseudo-random word; output; repeat. That should be
| pretty simple and fast.
|
| I'd think the most important thing though is to add delays to
| serving the requests. The purpose is to slow the scrapers down,
| not to induce demand on your garbage well.
| jcynix wrote:
| If you control your own Apache server and just want to shortcut
| to "go away" instead of feeding scrapers, the RewriteEngine is
| your friend, for example: RewriteEngine On
| # Block requests that reference .php anywhere (path, query, or
| encoded) RewriteCond %{REQUEST_URI}
| (\.php|%2ephp|%2e%70%68%70) [NC,OR] RewriteCond
| %{QUERY_STRING} \.php [NC,OR] RewriteCond
| %{THE_REQUEST} \.php [NC] RewriteRule .* - [F,L]
|
| Notes: there's no PHP on my servers, so if someone asks for it,
| they are one of the "bad boys" IMHO. Your mileage may differ.
| palsecam wrote:
| I do something quite similar with nginx: #
| Nothing to hack around here, I'm just a teapot: location
| ~* \.(?:php|aspx?|jsp|dll|sql|bak)$ { return 418;
| } error_page 418 /418.html;
|
| No hard block, instead reply to bots the funny HTTP 418 code
| (https://developer.mozilla.org/en-
| US/docs/Web/HTTP/Reference/...). That makes filtering logs
| easier.
|
| Live example: https://FreeSolitaire.win/wp-login.php (NB: /wp-
| login.php is WordPress login URL, and it's commonly blindly
| requested by bots searching for weak WordPress installs.)
| kijin wrote:
| nginx also has "return 444", a special code that makes it
| drop the connection altogether. This is quite useful if you
| don't even want to waste any bandwidth serving an error page.
| You have an image on your error page, which some crappy bots
| will download over and over again.
| palsecam wrote:
| Yes @ 444 (https://http.cat/status/444). That's indeed the
| lightest-weight option.
|
| > You have an image on your error page, which some crappy
| bots will download over and over again.
|
| Most bots won't download subresources (almost none of them
| do, actually). The HTML page itself is lean (475 bytes);
| the image is an Easter egg for humans ;-) Moreover, I use a
| caching CDN (Cloudflare).
| MadnessASAP wrote:
| Does it also tell the kernel to drop the socket? Or is a
| TCP FIN packet still sent?
|
| Be better if the scraper is left waiting for a packet
| that'll never arrive (till it times out obviously)
| jcynix wrote:
| 418? Nice I'll think about it ;-) I would, in addition,
| prefer that "402 Payment Required" would be instantiated for
| scrapers ...
|
| https://developer.mozilla.org/en-
| US/docs/Web/HTTP/Reference/...
| localhostinger wrote:
| Interesting! It's nice to see people are experimenting with
| these, and I wonder if this kind of junk data generators will
| become its own product. Or maybe at least a feature/integration
| in existing software. I could see it going there.
| arbol wrote:
| They could be used by AI companies to sabotage each others
| models
| s0meON3 wrote:
| What about using zip bombs?
|
| https://idiallo.com/blog/zipbomb-protection
| lavela wrote:
| "Gzip only provides a compression ratio of a little over 1000:
| If I want a file that expands to 100 GB, I've got to serve a
| 100 MB asset. Worse, when I tried it, the bots just shrugged it
| off, with some even coming back for more."
|
| https://maurycyz.com/misc/the_cost_of_trash/#:~:text=throw%2...
| LunaSea wrote:
| You could try different compression methods supported by
| browsers like brotli.
|
| Otherwise you can also chain compression methods like:
| "Content-Encoding: gzip gzip".
| renegat0x0 wrote:
| Even I, who does not know much, implemented a workaround.
|
| I have a web crawler and I have both scraping byte limit and
| timeout, so zip bombs dont bother me much.
|
| https://github.com/rumca-js/crawler-buddy
|
| I think garbage blabber would be more effective.
| iam-TJ wrote:
| This reminds me of a recent discussion about using a tarpit for
| A.I. and other scrapers. I've kept a tab alive with a reference
| to a neat tool and approach called Nepenthes that VERY SLOWLY
| drip feeds endless generated data into the connection. I've not
| had an opportunity to experiment with it as yet:
|
| https://zadzmo.org/code/nepenthes/
| re-lre-l wrote:
| Don't get me wrong, but what's the problem with scrapers? People
| invest in SEO to become more visible, yet at the same time they
| fight against "scraper bots." I've always thought the whole point
| of publicly available information is to be visible. If you want
| to make money, just put it behind a paywall. Isn't that the idea?
| nrhrjrjrjtntbt wrote:
| The old scrapers indexed your site so you may get traffic. This
| benefits you.
|
| AI scrapers will plagiarise your work and bring you zero
| traffic.
| ProofHouse wrote:
| Ya make sure you hold dear that grain of sand on a beach of
| pre-training data that is used to slightly adjust some
| embedding weights
| boxedemp wrote:
| One Reddit post can get an LLM to recommend putting glue in
| your pizza. But the takeaway here is to cheese the bots.
| exe34 wrote:
| that grain of sand used to bring traffic, now it doesn't.
| it's pretty much an economic catastrophe for those who
| relied on it. and it's not free to provide the data to
| those who will replace you - they abuse your servers while
| doing it.
| jcynix wrote:
| Sand is the world's second most used natural resource and
| sand usable for concrete gets even illegally removed all
| over the world nowadays.
|
| So to continue your analogy, I made my part of the beach
| accessible for visitors to enjoy, but certain people think
| they can carry it away for their own purpose ...
| throwawa14223 wrote:
| I have no reason to help the richest companies on earth
| adjust weights at a cost to myself.
| georgefrowny wrote:
| There's a difference between putting information easily online
| for your customers or even people in general (eg as a hobby),
| and working in concert with scraping for greater visibility via
| search, and giving that work away, or at a cost, to companies
| who at best don't care and possibly may be competition, see
| themselves as replacing you or otherwise adversarial.
|
| The line is "I technically and able to do this" and "I am
| engaging with a system in good faith".
|
| Public parks are just there and I can technically drive up and
| dump rubbish there and if they didn't want me to they should
| have installed a gate and sold tickets.
|
| Many scrapers these days are sort of equivalent in that analogy
| to people starting entire fleets of waste disposal vehicles
| that all drive to parks to unload, putting strain on park
| operations and making the parks a less tenable service in
| general.
| akoboldfrying wrote:
| > The line is "I technically and able to do this" and "I am
| engaging with a system in good faith".
|
| This is where the line should be, always. But in practice
| this criterion is applied _very_ selectively here on HN and
| elsewhere.
|
| After all: What is ad blocking, other than direct subversion
| of the site owner's clear intention to make money from the
| viewer's attention?
|
| Applying your criterion here gives a very simple conclusion:
| If you don't want to watch the ads, _don 't visit the site_.
|
| Right?
| Dilettante_ wrote:
| Did you read TFA?
|
| These scrapers drown peoples' servers in requests, taking up
| literally all the resources and driving up cost.
| saltysalt wrote:
| You are correct, and the hard reality is that content producers
| don't get to pick and choose who gets to index their public
| content because the bad bots don't play by the rules of
| robots.txt or user-agent strings. In my experience, bad bots do
| everything they can to identify as regular users: fake IPs,
| fake agent strings...so it's hard to sort them from regular
| traffic.
| aduwah wrote:
| I wonder if the abuse bots could be somehow made to mine some
| crypto to give back to the bills they cause
| boxedemp wrote:
| You could try to get them to run JavaScript, but I'm sure many
| is them have countermeasures.
| Surac wrote:
| I have just cut out up ranges that can not connect. I am blocking
| USA, Asia and Middle East to prevent most malicious accesses
| breppp wrote:
| Blocking most of the world's population is one way of reducing
| malicious traffic
| gessha wrote:
| If nobody can connect to your site, it's perfectly secure.
| warkdarrior wrote:
| Make sure to block your own IP address to minimize the chance
| of a social engineering attack.
| bot403 wrote:
| Include 127.0.0.1 as well just in case they get into the
| server.
| simondotau wrote:
| The more things change, the more they stay the same.
|
| About 10-15 years ago, the scourge I was fighting was _social
| media monitoring_ services, companies paid by big brands to watch
| sentiment across forums and other online communities. I was
| running a very popular and completely free (and ad-free)
| discussion forum in my spare time, and their scraping was
| irritating for two reasons. First, they were monetising my
| community when I wasn't. Second, their crawlers would hit the
| servers as hard as they could, creating real load issues. I kept
| having to beg our hosting sponsor for more capacity.
|
| Once I figured out what was happening, I blocked their user
| agent. Within a week they were scraping with a generic one. I
| blocked their IP range; a week later they were back on a
| different range. So I built a filter that would pseudo-
| randomly[0] inject company names[1] into forum posts. Then any
| time I re-identified[2] their bot, I enabled that filter for
| their requests.
|
| The scraping stopped within two days and never came back.
|
| --
|
| [0] Random but deterministic based on post ID, so the injected
| text stayed consistent.
|
| [1] I collated a list of around 100 major consumer brands, plus
| every company name the monitoring services proudly listed as
| clients on their own websites.
|
| [2] This was back around 2009 or so, so things weren't nearly as
| sophisticated as they are today, both in terms of bots and anti-
| bot strategies. One of the most effective tools I remember
| deploying back then was analysis of all HTTP headers. Bots would
| spoof a browser UA, but almost none would get the full header set
| right, things like _Accept-Encoding_ or _Accept-Language_ were
| either absent, or static strings that didn 't exactly match what
| the real browser would ever send.
| tesin wrote:
| The vast majority of bots are still failing the header test -
| we organically arrived at the except same filtering in 2025.
| The bots followed the exact same progression too. One ip, lie
| about the user agent, one ASN, multiple ASNs, then lie about
| everything and use residential IPs, but still botch the headers
| wvbdmp wrote:
| Why do the company names chase away bots? Is it just that
| you're destroying their signal because they're looking for
| mentions of those brands?
| akoboldfrying wrote:
| I also didn't follow that part. Their step 2 seem to be a
| general-purpose bot detection strategy that works
| independently of their step 1 ("randomly mention companies").
| SAI_Peregrinus wrote:
| It spams the bot with false-positives. Encourages the bot
| admins to denylist the site to protect the bot's
| signal:noise ratio.
| akoboldfrying wrote:
| That was my first thought too -- but then why would the
| bot company care about a few false positives?
|
| I suppose it could have an impact if 30% of all, say,
| Coca Cola mentions on the web came from that site, but
| then it would have to be a very big site. I don't think
| the bot company would _notice_ , let alone care, if it
| was 0.01% of the mentions.
| simondotau wrote:
| Everyone's definition of "big" is different, but back
| then it was big enough to get its own little island in a
| far corner of XKCD 802.
|
| https://xkcd.com/802/
| simondotau wrote:
| It's both a destruction of signal and an injection of noise.
| Imagine you worked for Adidas and you started getting a
| stream of notifications about your brand, and they were all
| nonsense. This would be an annoyance and harm the reputation
| of that monitoring service.
|
| They would have received multiple complaints about it from
| customers, performed an investigation, and ultimately perform
| a manual excision of the junk data from their system; both
| the raw scrapes and anywhere it was ingested and processed.
| This was probably a simple operation, but might not have been
| if their architecture didn't account for this vulnerability.
| grishka wrote:
| Thank you very much for the observation about headers. I just
| looked closer at the bot traffic I'm currently receiving on my
| small fediverse server and noticed that it's user agents of old
| Chrome versions _but also_ that the Accept-Language header is
| never set, which is indeed something that no real Chromium
| browser would do. So I added a rule to my nginx config to
| return a 403 to these requests. The amount of these per second
| seems to have started declining.
| AJMaxwell wrote:
| That's a simple and effective way to block a lot of bots,
| gonna implement that on my sites. Thanks!
| grishka wrote:
| It's been a few hours. These particular bots have completely
| stopped. There are still _some_ bot-looking requests in the
| log, with a newer-version Chrome UA on both Mac and Windows,
| but there aren 't nearly as many of them.
|
| Config snippet for anyone interested: if
| ($http_user_agent ~* "Chrome/\d{2,3}\.\d+\.\d{2,}\.\d{2,}") {
| set $block 1; } if ($http_accept_language =
| "") { set $block "${block}1"; } if
| ($block = "11") { return 403; }
| thephyber wrote:
| In the movie The Imitation Game, the Alan Turing character
| recognizes that acting 100% of the time gives away to the
| opposition that you identified them and sets off the next
| iteration of "cat and mouse". He comes up with a specific
| percentage of the time that the Allies should sit on the
| intelligence and not warn their own people.
|
| If, instead, you only act on a percentage of requests, you can
| add noise in an insidious way without signaling that you caught
| them. It will make their job troubleshooting and crafting the
| next iteration much harder. Also, making the response less
| predictable is a good idea - throw different HTTP error codes,
| respond with somewhat inaccurate content, etc
| Kiro wrote:
| I remember when you used to get scolded on HN for preventing
| scrapers or bots. "How I access your site is irrelevant".
| hollow-moe wrote:
| There's this and that. "How I [i.e. an individual human looking
| for myself] access your site is irrelevant." and "How I [i.e.
| an AI company DDOSing (which is illegal in some places btw)
| trying to maximize profit and offloading cost to you] access
| your site is irrelevant."
|
| When you get paid big buck to make the world worse for everyone
| it's really simple forgetting "little details".
| elashri wrote:
| I have a side project as an academic that scrape a couple of
| academic jobs sites in my field and then serve them in static
| HTML page. It is running using github action and request every
| 24 hours exactly one time. It is useful for me and a couple of
| people in my circle. I would consider this to be fine and
| within the reasonable expectations. Many projects rely on such
| scenarios and people share them all the time.
|
| It is completely different if I am hitting it looking for
| WordPress vulnerabilities or scraping content every minute for
| LLM training material.
| Analemma_ wrote:
| To me that's the one of the most depressing developments about
| AI (which is chock-full of depressing developments): that its
| mere existence is eroding long-held ethics, not even
| necessarily out of a lack of commitment but out of practical
| necessity.
|
| The tech people are all turning against scraping, independent
| artists are now clamoring for brutal IP crackdowns and Disney-
| style copyright maximalism (which I _never_ would 've predicted
| just 5 years ago, that crowd used to be staunchly against such
| things), people everywhere want more attestation and
| elimination of anonymity now that it's effectively free to make
| a swarm of convincingly-human misinformation agents, etc.
|
| It's making people worse.
| grishka wrote:
| It's different. I'm fine with someone scraping my website as a
| good citizen, by identifying themselves in their user-agent
| string and preferably respecting robots.txt. I'm _not_ ,
| however, fine with tens of requests per second to every
| possible URL from random IPs I'm receiving right now, all
| pretending to be different old versions of Chrome.
| VladVladikoff wrote:
| This is a fundamental misunderstanding of what those bots are
| requesting. They aren't parsing those PHP files, they are using
| their existence for fingerprinting -- they are trying to
| determine the existence of known vulnerabilities. They probably
| immediately stop reading after receiving a http response code and
| discard the remainder of the request packets.
| mattgreenrocks wrote:
| It would be such a terrible thing if some LLM scrapers were
| using those responses to learn more about PHP, especially
| because of that recent paper pointing out it doesn't take that
| many data points to poison LLMs.
| holysoles wrote:
| You're right, something like fail2ban or crowdsec would
| probably be more effective here. Crowdsec has made it apparent
| to me how much vulnerability probing is done, its a bit
| shocking for a low-traffic host.
| ajsnigrutin wrote:
| And you'd ban the ip, their one day lease on the VM+IP would
| expire, someone else will get the same IP on a new VM and be
| blocked from everywhere.
|
| Would be usable to ban the ip for a few hours to have the bot
| cool down for a bit and move onto a next domain.
| holysoles wrote:
| I was referring to the rules/patterns provided by crowdsec
| rather than the distribution of known "bad" IPs through
| their Central API.
|
| The default ban for traffic detected by your crowdsec
| instance is 4 hours, so that concern isn't very relevant in
| that case.
|
| The decisions from the Central API from other users can be
| quite a bit longer (I see some at ~6 days), but you also
| don't have to use those if you're worried about that
| scenario.
| vachina wrote:
| They're not scraping for php files, they're probing for known
| vulns in popular frameworks, and then using them as entry points
| for pwning.
|
| This is done very efficiently. If you return anything unexpected,
| they'll just drop you and move on.
| BigBalli wrote:
| I always had fail2ban but a while back I wanted to set up
| something juicier...
|
| .htaccess diverts suspicious paths (e.g., /.git, /wp-login) to
| decoy.php and forces decoy.zip downloads (10GB), so scanners
| hitting common "secret" files never touch real content and get
| stuck downloading a huge dummy archive.
|
| decoy.php mimics whatever sensitive file was requested by endless
| streaming of fake config/log/SQL data, keeping bots busy while
| revealing nothing.
| holysoles wrote:
| I wrote a Traefik plugin [1] that controls traffic based on known
| bad bot user agents, you can just block or even send them to a
| markov babbler if you've set one up. I've been using nepenthes
| [2].
|
| [1] https://github.com/holysoles/bot-wrangler-traefik-plugin
|
| [2] https://zadzmo.org/code/nepenthes/
| firefoxd wrote:
| I had to revisit my strategy after posting about my zipbombs on
| HN [0]. My server traffic went from tens of thousands to ~100k
| daily, hosted on a $6 vps. It was not sustainable.
|
| Now I target only the most aggressive bots with zipbombs and the
| rest get a 403. My new spam strategy seems to work, but I don't
| know if I should post it on HN again...
|
| [0]: https://news.ycombinator.com/item?id=43826798
| ronsor wrote:
| These aren't scraper bots; they're vulnerability scanners. They
| don't expect PHP source code and probably don't even read the
| response body at all.
|
| I don't know why people would assume these are AI/LLM scrapers
| seeking PHP source code on random servers(!) short of it being
| related to this brainless "AI is stealing all the data" nonsense
| that has infected the minds of many people here.
___________________________________________________________________
(page generated 2025-11-15 23:00 UTC)