hngopher.com

       [HN Gopher] Messing with scraper bots
       ___________________________________________________________________
        
       Messing with scraper bots
        
       Author : HermanMartinus
       Score  : 183 points
       Date   : 2025-11-15 07:38 UTC (15 hours ago)
        
 (HTM) web link (herman.bearblog.dev)
 (TXT) w3m dump (herman.bearblog.dev)
        
       | ArcHound wrote:
       | Neat! Most of the offensive scrapers I met try and exploit
       | WordPress sites (hence the focus on PHP). They don't want to see
       | php files, but their outputs.
       | 
       | What you have here is quite close to a honeypot, sadly I don't
       | see an easy way to counter-abuse such bots. If the attack is not
       | following their script, they move on.
        
         | jojobas wrote:
         | Yeah, I bet they run a regex on the output and if there's no
         | admin logon thingie where they can run exploits or stuff
         | credentials they'll just skip.
         | 
         | So as to battles of efficiency, generating a 4kb bullshit PHP
         | is harder than running a regex.
        
       | NoiseBert69 wrote:
       | Hm.. why not using dumbed down small, self-hosted LLM networks to
       | feet the big scrapers with bullshit?
       | 
       | I'd sacrifice two CPU cores for this just to make their life
       | awful.
        
         | qezz wrote:
         | That's very expensive.
        
         | Findecanor wrote:
         | You don't need an LLM for that. There is a link in the article
         | to an approach using Markov chains created from real-world
         | books, but then you'd let the scrapers' LLMs re-enforce their
         | training on those books and not on random garbage.
         | 
         | I would make a list of words from each word class, and a list
         | of sentence structures where each item is a word class. Pick a
         | pseudo-random sentence; for each word class in the sentence,
         | pick a pseudo-random word; output; repeat. That should be
         | pretty simple and fast.
         | 
         | I'd think the most important thing though is to add delays to
         | serving the requests. The purpose is to slow the scrapers down,
         | not to induce demand on your garbage well.
        
       | jcynix wrote:
       | If you control your own Apache server and just want to shortcut
       | to "go away" instead of feeding scrapers, the RewriteEngine is
       | your friend, for example:                     RewriteEngine On
       | # Block requests that reference .php anywhere (path, query, or
       | encoded)           RewriteCond %{REQUEST_URI}
       | (\.php|%2ephp|%2e%70%68%70) [NC,OR]           RewriteCond
       | %{QUERY_STRING} \.php [NC,OR]           RewriteCond
       | %{THE_REQUEST} \.php [NC]           RewriteRule .* - [F,L]
       | 
       | Notes: there's no PHP on my servers, so if someone asks for it,
       | they are one of the "bad boys" IMHO. Your mileage may differ.
        
         | palsecam wrote:
         | I do something quite similar with nginx:                 #
         | Nothing to hack around here, I'm just a teapot:       location
         | ~* \.(?:php|aspx?|jsp|dll|sql|bak)$ {            return 418;
         | }       error_page 418 /418.html;
         | 
         | No hard block, instead reply to bots the funny HTTP 418 code
         | (https://developer.mozilla.org/en-
         | US/docs/Web/HTTP/Reference/...). That makes filtering logs
         | easier.
         | 
         | Live example: https://FreeSolitaire.win/wp-login.php (NB: /wp-
         | login.php is WordPress login URL, and it's commonly blindly
         | requested by bots searching for weak WordPress installs.)
        
           | kijin wrote:
           | nginx also has "return 444", a special code that makes it
           | drop the connection altogether. This is quite useful if you
           | don't even want to waste any bandwidth serving an error page.
           | You have an image on your error page, which some crappy bots
           | will download over and over again.
        
             | palsecam wrote:
             | Yes @ 444 (https://http.cat/status/444). That's indeed the
             | lightest-weight option.
             | 
             | > You have an image on your error page, which some crappy
             | bots will download over and over again.
             | 
             | Most bots won't download subresources (almost none of them
             | do, actually). The HTML page itself is lean (475 bytes);
             | the image is an Easter egg for humans ;-) Moreover, I use a
             | caching CDN (Cloudflare).
        
             | MadnessASAP wrote:
             | Does it also tell the kernel to drop the socket? Or is a
             | TCP FIN packet still sent?
             | 
             | Be better if the scraper is left waiting for a packet
             | that'll never arrive (till it times out obviously)
        
           | jcynix wrote:
           | 418? Nice I'll think about it ;-) I would, in addition,
           | prefer that "402 Payment Required" would be instantiated for
           | scrapers ...
           | 
           | https://developer.mozilla.org/en-
           | US/docs/Web/HTTP/Reference/...
        
       | localhostinger wrote:
       | Interesting! It's nice to see people are experimenting with
       | these, and I wonder if this kind of junk data generators will
       | become its own product. Or maybe at least a feature/integration
       | in existing software. I could see it going there.
        
         | arbol wrote:
         | They could be used by AI companies to sabotage each others
         | models
        
       | s0meON3 wrote:
       | What about using zip bombs?
       | 
       | https://idiallo.com/blog/zipbomb-protection
        
         | lavela wrote:
         | "Gzip only provides a compression ratio of a little over 1000:
         | If I want a file that expands to 100 GB, I've got to serve a
         | 100 MB asset. Worse, when I tried it, the bots just shrugged it
         | off, with some even coming back for more."
         | 
         | https://maurycyz.com/misc/the_cost_of_trash/#:~:text=throw%2...
        
           | LunaSea wrote:
           | You could try different compression methods supported by
           | browsers like brotli.
           | 
           | Otherwise you can also chain compression methods like:
           | "Content-Encoding: gzip gzip".
        
         | renegat0x0 wrote:
         | Even I, who does not know much, implemented a workaround.
         | 
         | I have a web crawler and I have both scraping byte limit and
         | timeout, so zip bombs dont bother me much.
         | 
         | https://github.com/rumca-js/crawler-buddy
         | 
         | I think garbage blabber would be more effective.
        
       | iam-TJ wrote:
       | This reminds me of a recent discussion about using a tarpit for
       | A.I. and other scrapers. I've kept a tab alive with a reference
       | to a neat tool and approach called Nepenthes that VERY SLOWLY
       | drip feeds endless generated data into the connection. I've not
       | had an opportunity to experiment with it as yet:
       | 
       | https://zadzmo.org/code/nepenthes/
        
       | re-lre-l wrote:
       | Don't get me wrong, but what's the problem with scrapers? People
       | invest in SEO to become more visible, yet at the same time they
       | fight against "scraper bots." I've always thought the whole point
       | of publicly available information is to be visible. If you want
       | to make money, just put it behind a paywall. Isn't that the idea?
        
         | nrhrjrjrjtntbt wrote:
         | The old scrapers indexed your site so you may get traffic. This
         | benefits you.
         | 
         | AI scrapers will plagiarise your work and bring you zero
         | traffic.
        
           | ProofHouse wrote:
           | Ya make sure you hold dear that grain of sand on a beach of
           | pre-training data that is used to slightly adjust some
           | embedding weights
        
             | boxedemp wrote:
             | One Reddit post can get an LLM to recommend putting glue in
             | your pizza. But the takeaway here is to cheese the bots.
        
             | exe34 wrote:
             | that grain of sand used to bring traffic, now it doesn't.
             | it's pretty much an economic catastrophe for those who
             | relied on it. and it's not free to provide the data to
             | those who will replace you - they abuse your servers while
             | doing it.
        
             | jcynix wrote:
             | Sand is the world's second most used natural resource and
             | sand usable for concrete gets even illegally removed all
             | over the world nowadays.
             | 
             | So to continue your analogy, I made my part of the beach
             | accessible for visitors to enjoy, but certain people think
             | they can carry it away for their own purpose ...
        
             | throwawa14223 wrote:
             | I have no reason to help the richest companies on earth
             | adjust weights at a cost to myself.
        
         | georgefrowny wrote:
         | There's a difference between putting information easily online
         | for your customers or even people in general (eg as a hobby),
         | and working in concert with scraping for greater visibility via
         | search, and giving that work away, or at a cost, to companies
         | who at best don't care and possibly may be competition, see
         | themselves as replacing you or otherwise adversarial.
         | 
         | The line is "I technically and able to do this" and "I am
         | engaging with a system in good faith".
         | 
         | Public parks are just there and I can technically drive up and
         | dump rubbish there and if they didn't want me to they should
         | have installed a gate and sold tickets.
         | 
         | Many scrapers these days are sort of equivalent in that analogy
         | to people starting entire fleets of waste disposal vehicles
         | that all drive to parks to unload, putting strain on park
         | operations and making the parks a less tenable service in
         | general.
        
           | akoboldfrying wrote:
           | > The line is "I technically and able to do this" and "I am
           | engaging with a system in good faith".
           | 
           | This is where the line should be, always. But in practice
           | this criterion is applied _very_ selectively here on HN and
           | elsewhere.
           | 
           | After all: What is ad blocking, other than direct subversion
           | of the site owner's clear intention to make money from the
           | viewer's attention?
           | 
           | Applying your criterion here gives a very simple conclusion:
           | If you don't want to watch the ads, _don 't visit the site_.
           | 
           | Right?
        
         | Dilettante_ wrote:
         | Did you read TFA?
         | 
         | These scrapers drown peoples' servers in requests, taking up
         | literally all the resources and driving up cost.
        
         | saltysalt wrote:
         | You are correct, and the hard reality is that content producers
         | don't get to pick and choose who gets to index their public
         | content because the bad bots don't play by the rules of
         | robots.txt or user-agent strings. In my experience, bad bots do
         | everything they can to identify as regular users: fake IPs,
         | fake agent strings...so it's hard to sort them from regular
         | traffic.
        
       | aduwah wrote:
       | I wonder if the abuse bots could be somehow made to mine some
       | crypto to give back to the bills they cause
        
         | boxedemp wrote:
         | You could try to get them to run JavaScript, but I'm sure many
         | is them have countermeasures.
        
       | Surac wrote:
       | I have just cut out up ranges that can not connect. I am blocking
       | USA, Asia and Middle East to prevent most malicious accesses
        
         | breppp wrote:
         | Blocking most of the world's population is one way of reducing
         | malicious traffic
        
           | gessha wrote:
           | If nobody can connect to your site, it's perfectly secure.
        
           | warkdarrior wrote:
           | Make sure to block your own IP address to minimize the chance
           | of a social engineering attack.
        
             | bot403 wrote:
             | Include 127.0.0.1 as well just in case they get into the
             | server.
        
       | simondotau wrote:
       | The more things change, the more they stay the same.
       | 
       | About 10-15 years ago, the scourge I was fighting was _social
       | media monitoring_ services, companies paid by big brands to watch
       | sentiment across forums and other online communities. I was
       | running a very popular and completely free (and ad-free)
       | discussion forum in my spare time, and their scraping was
       | irritating for two reasons. First, they were monetising my
       | community when I wasn't. Second, their crawlers would hit the
       | servers as hard as they could, creating real load issues. I kept
       | having to beg our hosting sponsor for more capacity.
       | 
       | Once I figured out what was happening, I blocked their user
       | agent. Within a week they were scraping with a generic one. I
       | blocked their IP range; a week later they were back on a
       | different range. So I built a filter that would pseudo-
       | randomly[0] inject company names[1] into forum posts. Then any
       | time I re-identified[2] their bot, I enabled that filter for
       | their requests.
       | 
       | The scraping stopped within two days and never came back.
       | 
       | --
       | 
       | [0] Random but deterministic based on post ID, so the injected
       | text stayed consistent.
       | 
       | [1] I collated a list of around 100 major consumer brands, plus
       | every company name the monitoring services proudly listed as
       | clients on their own websites.
       | 
       | [2] This was back around 2009 or so, so things weren't nearly as
       | sophisticated as they are today, both in terms of bots and anti-
       | bot strategies. One of the most effective tools I remember
       | deploying back then was analysis of all HTTP headers. Bots would
       | spoof a browser UA, but almost none would get the full header set
       | right, things like _Accept-Encoding_ or _Accept-Language_ were
       | either absent, or static strings that didn 't exactly match what
       | the real browser would ever send.
        
         | tesin wrote:
         | The vast majority of bots are still failing the header test -
         | we organically arrived at the except same filtering in 2025.
         | The bots followed the exact same progression too. One ip, lie
         | about the user agent, one ASN, multiple ASNs, then lie about
         | everything and use residential IPs, but still botch the headers
        
         | wvbdmp wrote:
         | Why do the company names chase away bots? Is it just that
         | you're destroying their signal because they're looking for
         | mentions of those brands?
        
           | akoboldfrying wrote:
           | I also didn't follow that part. Their step 2 seem to be a
           | general-purpose bot detection strategy that works
           | independently of their step 1 ("randomly mention companies").
        
             | SAI_Peregrinus wrote:
             | It spams the bot with false-positives. Encourages the bot
             | admins to denylist the site to protect the bot's
             | signal:noise ratio.
        
               | akoboldfrying wrote:
               | That was my first thought too -- but then why would the
               | bot company care about a few false positives?
               | 
               | I suppose it could have an impact if 30% of all, say,
               | Coca Cola mentions on the web came from that site, but
               | then it would have to be a very big site. I don't think
               | the bot company would _notice_ , let alone care, if it
               | was 0.01% of the mentions.
        
               | simondotau wrote:
               | Everyone's definition of "big" is different, but back
               | then it was big enough to get its own little island in a
               | far corner of XKCD 802.
               | 
               | https://xkcd.com/802/
        
           | simondotau wrote:
           | It's both a destruction of signal and an injection of noise.
           | Imagine you worked for Adidas and you started getting a
           | stream of notifications about your brand, and they were all
           | nonsense. This would be an annoyance and harm the reputation
           | of that monitoring service.
           | 
           | They would have received multiple complaints about it from
           | customers, performed an investigation, and ultimately perform
           | a manual excision of the junk data from their system; both
           | the raw scrapes and anywhere it was ingested and processed.
           | This was probably a simple operation, but might not have been
           | if their architecture didn't account for this vulnerability.
        
         | grishka wrote:
         | Thank you very much for the observation about headers. I just
         | looked closer at the bot traffic I'm currently receiving on my
         | small fediverse server and noticed that it's user agents of old
         | Chrome versions _but also_ that the Accept-Language header is
         | never set, which is indeed something that no real Chromium
         | browser would do. So I added a rule to my nginx config to
         | return a 403 to these requests. The amount of these per second
         | seems to have started declining.
        
           | AJMaxwell wrote:
           | That's a simple and effective way to block a lot of bots,
           | gonna implement that on my sites. Thanks!
        
           | grishka wrote:
           | It's been a few hours. These particular bots have completely
           | stopped. There are still _some_ bot-looking requests in the
           | log, with a newer-version Chrome UA on both Mac and Windows,
           | but there aren 't nearly as many of them.
           | 
           | Config snippet for anyone interested:                   if
           | ($http_user_agent ~* "Chrome/\d{2,3}\.\d+\.\d{2,}\.\d{2,}") {
           | set $block 1;         }         if ($http_accept_language =
           | "") {           set $block "${block}1";         }         if
           | ($block = "11") {           return 403;         }
        
         | thephyber wrote:
         | In the movie The Imitation Game, the Alan Turing character
         | recognizes that acting 100% of the time gives away to the
         | opposition that you identified them and sets off the next
         | iteration of "cat and mouse". He comes up with a specific
         | percentage of the time that the Allies should sit on the
         | intelligence and not warn their own people.
         | 
         | If, instead, you only act on a percentage of requests, you can
         | add noise in an insidious way without signaling that you caught
         | them. It will make their job troubleshooting and crafting the
         | next iteration much harder. Also, making the response less
         | predictable is a good idea - throw different HTTP error codes,
         | respond with somewhat inaccurate content, etc
        
       | Kiro wrote:
       | I remember when you used to get scolded on HN for preventing
       | scrapers or bots. "How I access your site is irrelevant".
        
         | hollow-moe wrote:
         | There's this and that. "How I [i.e. an individual human looking
         | for myself] access your site is irrelevant." and "How I [i.e.
         | an AI company DDOSing (which is illegal in some places btw)
         | trying to maximize profit and offloading cost to you] access
         | your site is irrelevant."
         | 
         | When you get paid big buck to make the world worse for everyone
         | it's really simple forgetting "little details".
        
         | elashri wrote:
         | I have a side project as an academic that scrape a couple of
         | academic jobs sites in my field and then serve them in static
         | HTML page. It is running using github action and request every
         | 24 hours exactly one time. It is useful for me and a couple of
         | people in my circle. I would consider this to be fine and
         | within the reasonable expectations. Many projects rely on such
         | scenarios and people share them all the time.
         | 
         | It is completely different if I am hitting it looking for
         | WordPress vulnerabilities or scraping content every minute for
         | LLM training material.
        
         | Analemma_ wrote:
         | To me that's the one of the most depressing developments about
         | AI (which is chock-full of depressing developments): that its
         | mere existence is eroding long-held ethics, not even
         | necessarily out of a lack of commitment but out of practical
         | necessity.
         | 
         | The tech people are all turning against scraping, independent
         | artists are now clamoring for brutal IP crackdowns and Disney-
         | style copyright maximalism (which I _never_ would 've predicted
         | just 5 years ago, that crowd used to be staunchly against such
         | things), people everywhere want more attestation and
         | elimination of anonymity now that it's effectively free to make
         | a swarm of convincingly-human misinformation agents, etc.
         | 
         | It's making people worse.
        
         | grishka wrote:
         | It's different. I'm fine with someone scraping my website as a
         | good citizen, by identifying themselves in their user-agent
         | string and preferably respecting robots.txt. I'm _not_ ,
         | however, fine with tens of requests per second to every
         | possible URL from random IPs I'm receiving right now, all
         | pretending to be different old versions of Chrome.
        
       | VladVladikoff wrote:
       | This is a fundamental misunderstanding of what those bots are
       | requesting. They aren't parsing those PHP files, they are using
       | their existence for fingerprinting -- they are trying to
       | determine the existence of known vulnerabilities. They probably
       | immediately stop reading after receiving a http response code and
       | discard the remainder of the request packets.
        
         | mattgreenrocks wrote:
         | It would be such a terrible thing if some LLM scrapers were
         | using those responses to learn more about PHP, especially
         | because of that recent paper pointing out it doesn't take that
         | many data points to poison LLMs.
        
         | holysoles wrote:
         | You're right, something like fail2ban or crowdsec would
         | probably be more effective here. Crowdsec has made it apparent
         | to me how much vulnerability probing is done, its a bit
         | shocking for a low-traffic host.
        
           | ajsnigrutin wrote:
           | And you'd ban the ip, their one day lease on the VM+IP would
           | expire, someone else will get the same IP on a new VM and be
           | blocked from everywhere.
           | 
           | Would be usable to ban the ip for a few hours to have the bot
           | cool down for a bit and move onto a next domain.
        
             | holysoles wrote:
             | I was referring to the rules/patterns provided by crowdsec
             | rather than the distribution of known "bad" IPs through
             | their Central API.
             | 
             | The default ban for traffic detected by your crowdsec
             | instance is 4 hours, so that concern isn't very relevant in
             | that case.
             | 
             | The decisions from the Central API from other users can be
             | quite a bit longer (I see some at ~6 days), but you also
             | don't have to use those if you're worried about that
             | scenario.
        
       | vachina wrote:
       | They're not scraping for php files, they're probing for known
       | vulns in popular frameworks, and then using them as entry points
       | for pwning.
       | 
       | This is done very efficiently. If you return anything unexpected,
       | they'll just drop you and move on.
        
       | BigBalli wrote:
       | I always had fail2ban but a while back I wanted to set up
       | something juicier...
       | 
       | .htaccess diverts suspicious paths (e.g., /.git, /wp-login) to
       | decoy.php and forces decoy.zip downloads (10GB), so scanners
       | hitting common "secret" files never touch real content and get
       | stuck downloading a huge dummy archive.
       | 
       | decoy.php mimics whatever sensitive file was requested by endless
       | streaming of fake config/log/SQL data, keeping bots busy while
       | revealing nothing.
        
       | holysoles wrote:
       | I wrote a Traefik plugin [1] that controls traffic based on known
       | bad bot user agents, you can just block or even send them to a
       | markov babbler if you've set one up. I've been using nepenthes
       | [2].
       | 
       | [1] https://github.com/holysoles/bot-wrangler-traefik-plugin
       | 
       | [2] https://zadzmo.org/code/nepenthes/
        
       | firefoxd wrote:
       | I had to revisit my strategy after posting about my zipbombs on
       | HN [0]. My server traffic went from tens of thousands to ~100k
       | daily, hosted on a $6 vps. It was not sustainable.
       | 
       | Now I target only the most aggressive bots with zipbombs and the
       | rest get a 403. My new spam strategy seems to work, but I don't
       | know if I should post it on HN again...
       | 
       | [0]: https://news.ycombinator.com/item?id=43826798
        
       | ronsor wrote:
       | These aren't scraper bots; they're vulnerability scanners. They
       | don't expect PHP source code and probably don't even read the
       | response body at all.
       | 
       | I don't know why people would assume these are AI/LLM scrapers
       | seeking PHP source code on random servers(!) short of it being
       | related to this brainless "AI is stealing all the data" nonsense
       | that has infected the minds of many people here.
        
       ___________________________________________________________________
       (page generated 2025-11-15 23:00 UTC)