--- author: email: mail@petermolnar.net image: https://petermolnar.net/favicon.jpg name: Peter Molnar url: https://petermolnar.net copies: [] lang: en published: '2025-04-15T09:00:00+01:00' summary: A quick guide on combining an anti-AI tarpit with automatic blocking title: How to block and confuse AI crawlers with nepenthes and fail2ban --- GenAI is a disgusting, festering thing that keeps getting worse by the day. And night. And minute. They relentlessly ignore rules, copyright, established practices and standards[^1] - and that's just the technical perspective, let alone the morally implicating ones[^2]. People came up with proof-of-work JS solutions to block the attacks off[^3], and while it's the proper solution against the issue, I wanted something different. A little while ago I came across with AI "poisoning" tarpits[^4]. While it's a wonderful idea, it turned out that there are many, many, MANY bots that would gladly get lost 5-6 levels deep in the link maze it generates, happily heating my meager server CPU (I use a passively cooled thin client as my server), so I decided to add a fail2ban blocking on top. I do have to note that if you run a web **application** that is JS dependent anyway, you're probably best off setting up Anubis. This solution is more fitting for websites that don't need or use JS. There's a promised no-JS version of Anubis, but it's not there yet. **1: add a rule to your robots.txt to disallow any visits to a certain path url on your site** User-agent: * Disallow: /ohwowyoushouldntbehere **2: add an invisible url pointing to the disallowed path on your site, ideally on every page** ``` html ``` **3: set up Nepenthes[^5] on the disallowed path** `templates/toplevel.lustache`

You shouldn't be on this page. Please leave.

{{ content }}
`templates/list.lustache`

{{ header }}

{{# content }}

{{ content }}

{{/ content }} `config.yml` ``` yaml http_host: '127.0.0.1' http_port: 8893 templates: './templates' words: '/usr/share/dict/words' forget_time: 86400 forget_hits: 10 persist_stats: './statsfile.json' seed_file: './seed.txt' markov: './corpus.sqlite.db' markov_min: 200 markov_max: 1200 min_wait: 5 max_wait: 30 ``` You can find the rest of setting up Nepenthes on the project site at https://zadzmo.org/code/nepenthes/ I ended up training it on Ulysses by James Joyce, obtained from https://www.gutenberg.org/cache/epub/4300/pg4300.txt **4: block anything that visits it more, than X times with fail2ban** Note: I'm using `ncsa` log format in nginx as: access_log /var/log/nginx/access.log ncsa; `filter.d/nginx-nepenthes.conf` ``` [includes] before = common.conf [Definition] failregex = ^[a-zA-Z\.]+ [^\s]+ [^\s]+ \[[^\]]+\] \"[A-Z]+ \/ohwowyoushouldntbehere.*$ datepattern = %%d/%%b/%%Y:%%H:%%M:%%S journalmatch = _SYSTEMD_UNIT=nginx.service + _COMM=nginx ignoreregex = ``` Relevant lines in `jail.local`: [nginx-nepenthes] enabled = true port = 80,443 filter = nginx-nepenthes logpath = /var/log/nginx/access.log maxretry = 3 bantime = 84600 searchtime = 86400 [^1]: https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/ [^2]: https://www.zhangjingna.com/blog/2025/3/30/people-are-generating-so-much-ai-csam-that-its-become-increasingly-difficult-for-law-enforcement-to-find-amp-rescue-real-human-child-victims [^3]: https://anubis.techaro.lol/ [^4]: https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/ [^5]: https://zadzmo.org/code/nepenthes/