## title: Poisoning LLM web scrapers
## date: "2025-11-01"
All servers available on the Internet are exposed to
attacks, scraping, open port scanning, vulnerability
scanning by services, DDoS, etc. What interests me today are
robots that download web page content without permission.
I consulted the logs of my web server managed by NGINX using
a visual generated by the goaccess tool. The majority of
HTTP requests come from unknown web browsers and crawlers.
Knowing that unidentified browsers are also crawlers. The
images below show that on October 30, 2025, there were
59,454 HTTP requests from crawlers.
(HTM) goaccess
/goaccess_web_browsers.png
(IMG) /goaccess_web_browsers.png
/goaccess_web_browsers_crawlers.png
(IMG) /goaccess_web_browsers_crawlers.png
Most of these crawlers have questionable ethics, and they
also pollute the server's storage space and add extra
workload to the processor.
## An aggressive solution
To counter these robots, I researched existing solutions. I
then found a post Mastodon post that lists free, open-source
tools on the tldr.nettime.org instance. The common goal of
these tools is to sabotage AIs.
(HTM) post
(HTM) tldr.nettime.org
Among them, I chose an aggressive solution called iocaine.
It is a web server that generates a page containing garbage
text, and within that text there are links to new pages, and
so on. It is a kind of infinite maze in which robots will
get lost.
(HTM) iocaine
## Filtering non-human visitors
The aim is to redirect robots to iocaine. To do this, we
first need to be able to identify them. For this, I took
inspiration from the ai.robots.txt project, which provides a
list of IAs and robots to block. The project provides
examples of NGINX configurations, which I have adapted for
my needs. I also used a list of IP address ranges that
correspond to Facebook web crawlers.
(HTM) iocaine
(HTM) ai.robots.txt
I'd like to thank Agate Blue for writing an extremely
detailed article on iocaine and its implementation with
NGINX. I was inspired by some of his configurations.
(HTM) Agate Blue
(HTM) article
(HTM) iocaine
I ended up with a reverse proxy capable of redirecting non-
human visitors to a desired service, i.e. iocaine. Note that
in my case, NGINX runs on the host system, it is not
conteuneurized, it is managed by a systemd service.
(HTM) iocaine
## My iocaine configuration
I kept the basic configuration to customize the web pages
generated by the tool. I've downloaded three books, Uncle
Tom's cabin, Lady Chatterley's lover and Brave New World
whose content will be used to generate text. The tool also
needs a word list.
(HTM) Uncle Tom's cabin
(HTM) Lady Chatterley's lover
(HTM) Brave New World
(HTM) word list
## My iocaine deployment
To deploy the solution I chose to use Docker Compose with a
configuration base suggested by the documentation. Below is
the iocaine deployment file.
services:
iocaine:
image: git.madhouse-project.org/iocaine/iocaine:2
container_name: iocaine
ports:
- "127.0.0.1:42069:42069"
- "127.0.0.1:42070:42070"
volumes:
- "./data:/data"
environment:
- IOCAINE__SERVER__BIND="0.0.0.0:42069"
- IOCAINE__SOURCES__WORDS="/data/words.txt"
- IOCAINE__SOURCES__MARKOV=["/data/text1.txt",
"/data/text2.txt", "/data/text3.txt"]
- IOCAINE__METRICS__ENABLE=true
- IOCAINE__METRICS__BIND="0.0.0.0:42070"
- IOCAINE__METRICS__LABELS=["Host","UserAgent"]
restart: unless-stopped
networks:
- monitoring_prometheus_net
networks:
monitoring_prometheus_net:
external: true
The monitoring_prometheus_net network is used by Prometheus
and Prometheus exporters containers. Applying it to the
iocaine container allows Pormetheus to reach the iocaine
Prometheus exporter behind port 42070. In this way, we can
view the metrics of iocaine via a dashboard in Grafana.
(HTM) iocaine
(HTM) iocaine
The value of IOCAINE__SOURCES__WORDS corresponds to a file
which is the list of words previously downloaded and the
value of IOCAINE__SOURCES__MARKOV represents a list of three
files which are the aforementioned books.
## The robots are stuck
The service is now deployed and the NGINX configuration has
been updated for the tool and loaded. Now all we have to do
is wait to see the first trapped robots.
I've left iocaine running for around 20 hours already, and
here's a Grafana dashboard showing some results.
(HTM) iocaine
/grafana_iocaine_dashboard.png
(IMG) /grafana_iocaine_dashboard.png
We can see that there have already been 128,644 requests
made by robots that have landed in the labyrinth, with
OpenAI and Anthropic as the main senders. Looking at the
metrics, I can see that GPTBot is the least intelligent
scrapper, having covered the longest path with a depth of
79!
To get an idea of what this represents, imagine clicking 79
times on a word in a text that redirects to a new text, with
this word concatenated at the end of the URL.
## Testing with a web browser
To test with a web browser, I used LibreWolf and then
installed an extension to change the user agent. I put
gptbot in the user agent value and went to
https://theobori.cafe. Here's what it looks like to be
tricked.
(HTM) https://theobori.cafe
/iocaine_trap_web_page.png
(IMG) /iocaine_trap_web_page.png
## Conclusion
I've managed to prevent LLM scrappers from stealing my
content without my permission by using very little RAM and
CPU. Only human visitors are allowed to download pages from
my websites. This setup allows me to consume fewer resources
to serve my content to IAs.