## title: Poisoning LLM web scrapers
       ## date: "2025-11-01"
       
       All servers available on the Internet are exposed to
       attacks, scraping, open port scanning, vulnerability
       scanning by services, DDoS, etc. What interests me today are
       robots that download web page content without permission.
       
       I consulted the logs of my web server managed by NGINX using
       a visual generated by the goaccess tool. The majority of
       HTTP requests come from unknown web browsers and crawlers.
       Knowing that unidentified browsers are also crawlers. The
       images below show that on October 30, 2025, there were
       59,454 HTTP requests from crawlers.
 (HTM) goaccess
       
       /goaccess_web_browsers.png
 (IMG) /goaccess_web_browsers.png
       
       /goaccess_web_browsers_crawlers.png
 (IMG) /goaccess_web_browsers_crawlers.png
       
       Most of these crawlers have questionable ethics, and they
       also pollute the server's storage space and add extra
       workload to the processor.
       
       ## An aggressive solution
       
       To counter these robots, I researched existing solutions. I
       then found a post Mastodon post that lists free, open-source
       tools on the tldr.nettime.org instance. The common goal of
       these tools is to sabotage AIs.
 (HTM) post
 (HTM) tldr.nettime.org
       
       Among them, I chose an aggressive solution called iocaine.
       It is a web server that generates a page containing garbage
       text, and within that text there are links to new pages, and
       so on. It is a kind of infinite maze in which robots will
       get lost.
 (HTM) iocaine
       
       ## Filtering non-human visitors
       
       The aim is to redirect robots to iocaine. To do this, we
       first need to be able to identify them. For this, I took
       inspiration from the ai.robots.txt project, which provides a
       list of IAs and robots to block. The project provides
       examples of NGINX configurations, which I have adapted for
       my needs. I also used a list of IP address ranges that
       correspond to Facebook web crawlers.
 (HTM) iocaine
 (HTM) ai.robots.txt
       
       I'd like to thank Agate Blue for writing an extremely
       detailed article on iocaine and its implementation with
       NGINX. I was inspired by some of his configurations.
 (HTM) Agate Blue
 (HTM) article
 (HTM) iocaine
       
       I ended up with a reverse proxy capable of redirecting non-
       human visitors to a desired service, i.e. iocaine. Note that
       in my case, NGINX runs on the host system, it is not
       conteuneurized, it is managed by a systemd service.
 (HTM) iocaine
       
       ## My iocaine configuration
       
       I kept the basic configuration to customize the web pages
       generated by the tool. I've downloaded three books, Uncle
       Tom's cabin, Lady Chatterley's lover and Brave New World
       whose content will be used to generate text. The tool also
       needs a word list.
 (HTM) Uncle Tom's cabin
 (HTM) Lady Chatterley's lover
 (HTM) Brave New World
 (HTM) word list
       
       ## My iocaine deployment
       
       To deploy the solution I chose to use Docker Compose with a
       configuration base suggested by the documentation. Below is
       the iocaine deployment file.
       
       services:
         iocaine:
           image: git.madhouse-project.org/iocaine/iocaine:2
           container_name: iocaine
           ports:
             - "127.0.0.1:42069:42069"
             - "127.0.0.1:42070:42070"
           volumes:
             - "./data:/data"
           environment:
             - IOCAINE__SERVER__BIND="0.0.0.0:42069"
             - IOCAINE__SOURCES__WORDS="/data/words.txt"
             - IOCAINE__SOURCES__MARKOV=["/data/text1.txt",
       "/data/text2.txt", "/data/text3.txt"]
             - IOCAINE__METRICS__ENABLE=true
             - IOCAINE__METRICS__BIND="0.0.0.0:42070"
             - IOCAINE__METRICS__LABELS=["Host","UserAgent"]
           restart: unless-stopped
           networks:
             - monitoring_prometheus_net
       
       networks:
         monitoring_prometheus_net:
           external: true
       
       The monitoring_prometheus_net network is used by Prometheus
       and Prometheus exporters containers. Applying it to the
       iocaine container allows Pormetheus to reach the iocaine
       Prometheus exporter behind port 42070. In this way, we can
       view the metrics of iocaine via a dashboard in Grafana.
 (HTM) iocaine
 (HTM) iocaine
       
       The value of IOCAINE__SOURCES__WORDS corresponds to a file
       which is the list of words previously downloaded and the
       value of IOCAINE__SOURCES__MARKOV represents a list of three
       files which are the aforementioned books.
       
       ## The robots are stuck
       
       The service is now deployed and the NGINX configuration has
       been updated for the tool and loaded. Now all we have to do
       is wait to see the first trapped robots.
       
       I've left iocaine running for around 20 hours already, and
       here's a Grafana dashboard showing some results.
 (HTM) iocaine
       
       /grafana_iocaine_dashboard.png
 (IMG) /grafana_iocaine_dashboard.png
       
       We can see that there have already been 128,644 requests
       made by robots that have landed in the labyrinth, with
       OpenAI and Anthropic as the main senders. Looking at the
       metrics, I can see that GPTBot is the least intelligent
       scrapper, having covered the longest path with a depth of
       79!
       
       To get an idea of what this represents, imagine clicking 79
       times on a word in a text that redirects to a new text, with
       this word concatenated at the end of the URL.
       
       ## Testing with a web browser
       
       To test with a web browser, I used LibreWolf and then
       installed an extension to change the user agent. I put
       gptbot in the user agent value and went to
       https://theobori.cafe. Here's what it looks like to be
       tricked.
 (HTM) https://theobori.cafe
       
       /iocaine_trap_web_page.png
 (IMG) /iocaine_trap_web_page.png
       
       ## Conclusion
       
       I've managed to prevent LLM scrappers from stealing my
       content without my permission by using very little RAM and
       CPU. Only human visitors are allowed to download pages from
       my websites. This setup allows me to consume fewer resources
       to serve my content to IAs.