[HN Gopher] AI scrapers request commented scripts
       ___________________________________________________________________
        
       AI scrapers request commented scripts
        
       Author : ColinWright
       Score  : 160 points
       Date   : 2025-10-31 15:44 UTC (7 hours ago)
        
 (HTM) web link (cryptography.dog)
 (TXT) w3m dump (cryptography.dog)
        
       | rokkamokka wrote:
       | I'm not overly surprised, it's probably faster to search the text
       | for http/https than parse the DOM
        
         | embedding-shape wrote:
         | Not probably, searching through plaintext (which they seem to
         | be doing) VS iterating on the DOM have vastly different amount
         | of work behind them in terms of resources used and performance
         | that "probably" is way underselling the difference :)
        
           | franktankbank wrote:
           | Reminds me of the shortcut that works for the happy path but
           | is utterly fucked by real data. This is an interesting trap,
           | can it easily be avoided without walking the dom?
        
             | embedding-shape wrote:
             | Yes, parse out HTML comments which is also kind of trivial
             | if you've ever done any sort of parsing, listen for "<!--",
             | whenever you come across it, ignore everything until the
             | next "-->". But then again, these people are using AI to
             | build scrapers, so I wouldn't put too much pressure on them
             | to produce high-quality software.
        
               | stevage wrote:
               | Lots of other ways to include URLs in an HTML document
               | that wouldn't be visible to a real user, though.
        
               | jcheng wrote:
               | It's not quite as trivial as that; one could start the
               | page with a <script> tag that contains "<!--" without
               | matching "-->", and that would hide all the content from
               | your scraper but not from real browsers.
               | 
               | But I think it's moot, parsing HTML is not very expensive
               | if you don't have to actually render it.
        
       | Noumenon72 wrote:
       | It doesn't seem that abusive. I don't comment things out thinking
       | "this will keep robots from reading this".
        
         | michael1999 wrote:
         | Crawlers ignoring robots.txt is abusive. That they then start
         | scanning all docs for commented urls just adds to the pile of
         | scummy behaviour.
        
           | tveyben wrote:
           | Human behavior is interesting - me, me, me...
        
         | mostlysimilar wrote:
         | The article mentions using this as a means of detecting bots,
         | not as a complaint that it's abusive.
         | 
         | EDIT: I was chastised, here's the original text of my comment:
         | Did you read the article or just the title? They aren't
         | claiming it's abusive. They're saying it's a viable signal to
         | detect and ban bots.
        
           | pseudalopex wrote:
           | Please don't comment on whether someone read an article. "Did
           | you even read the article? It mentions that" can be shortened
           | to "The article mentions that".[1]
           | 
           | [1] https://news.ycombinator.com/newsguidelines.html
        
           | woodrowbarlow wrote:
           | the first few words of the article are:
           | 
           | > Last Sunday I discovered some abusive bot behaviour [...]
        
             | mostlysimilar wrote:
             | > The robots.txt for the site in question forbids all
             | crawlers, so they were either failing to check the policies
             | expressed in that file, or ignoring them if they had.
        
             | foobarbecue wrote:
             | Yeah but the abusive behavior is ignoring robots.txt and
             | scraping to train AI. Following commented URLs was not the
             | crime, just evidence inadvertently left behind.
        
           | ang_cire wrote:
           | They call the scrapers "malicious", so they are definitely
           | complaining about them.
           | 
           | > A few of these came from user-agents that were obviously
           | malicious:
           | 
           | (I love the idea that they consider any python or go request
           | to be a malicious scraper...)
        
       | latenightcoding wrote:
       | when I used to crawl the web, battle tested Perl regexes were
       | more reliable than anything else, commented urls would have been
       | added to my queue.
        
         | rightbyte wrote:
         | DOM navigation for fetching some data is for tryhards. Using a
         | regex to grab the correct paragraph or div or whatever is fine
         | and is more robust versus things moving around on the page.
        
           | chaps wrote:
           | Doing both is fine! Just, once you've figured out your regex
           | and such, hardening/generalizing demands DOM iteration. It
           | sucks but it is what is is.
        
           | horseradish7k wrote:
           | but not when crawling. you don't know the page format in
           | advance - you don't even know what the page contains!
        
       | OhMeadhbh wrote:
       | I blame modern CS programs that don't teach kids about parsing.
       | The last time I looked at some scraping code, the dev was using
       | regexes to "parse" html to find various references.
       | 
       | Maybe that's a way to defend against bots that ignore robots.txt,
       | include a reference to a Honeypot HTML file with garbage text,
       | but include the link to it in a comment.
        
         | tuwtuwtuwtuw wrote:
         | Do you think that if some CS programs taught parsing, the
         | authors of the bot would parse the HTML to properly extract
         | links, instead of just doing plain text search?
         | 
         | I doubt it.
        
         | ericmcer wrote:
         | How would recommend doing it? If I was just trying to pull <a/>
         | tag links out I feel like treating it like text and using regex
         | would be way more efficient than a full on HTML parser like
         | JSDom or something.
        
           | singron wrote:
           | You don't need javascript to parse HTML. Just use an HTML
           | parser. They are very fast. HTML isn't a regular language, so
           | you can't parse it with regular expressions.
           | 
           | Obligatory:
           | https://stackoverflow.com/questions/1732348/regex-match-
           | open...
        
             | zahlman wrote:
             | The point is: if you're trying to find all the URLs within
             | the page source, it doesn't really matter to you what tags
             | they're in, or how the document is structured, or even
             | whether they're given as link targets or in the readable
             | text or just what.
        
         | vaylian wrote:
         | The people who do this type of scraping to feed their AI are
         | probably also using AI to write their scraper.
        
         | mikeiz404 wrote:
         | It's been some time since I have dealt with web scrapers but it
         | takes less resources to run a regex than it does to parse the
         | DOM (which may have syntactically incorrect parts anyway). This
         | can add up when running many scraping requests in parallel. So
         | depending on your goals using a regex can be much preferred.
        
       | sharkjacobs wrote:
       | Fun to see practical applications of interesting research[1]
       | 
       | [1]https://news.ycombinator.com/item?id=45529587
        
       | bakql wrote:
       | >These were scrapers, and they were most likely trying to non-
       | consensually collect content for training LLMs.
       | 
       | "Non-consensually", as if you had to ask for permission to
       | perform a GET request to an open HTTP server.
       | 
       | Yes, I know about weev. That was a travesty.
        
         | XenophileJKO wrote:
         | What about people using an LLM as their web client? Are you now
         | saying the website owner should be able to dictate what client
         | I use and how it must behave?
        
           | aDyslecticCrow wrote:
           | > Are you now saying the website owner should be able to
           | dictate what client I use and how it must behave?
           | 
           | Already pretty well established with Ad-block actually. It's
           | a pretty similar case even. AI's don't click ads, so why
           | should we accept their traffic? If it's un-proportionally
           | loading the server without contributing to the funding of the
           | site, get blocked.
           | 
           | The server can set whatever rules it wants. If the maintainer
           | hates google and wants to block all chrome users, it can do
           | so.
        
         | Calavar wrote:
         | I agree. It always surprises me when people are indignant about
         | scrapers ignoring robots.txt and throw around words like
         | "theft" and "abuse."
         | 
         | robots.txt is a polite request to please not scrape these pages
         | because it's probably not going to be productive. It was never
         | meant to be a binding agreement, otherwise there would be a
         | stricter protocol around it.
         | 
         | It's kind of like leaving a note for the deliveryman saying
         | please don't leave packages on the porch. It's fine for low
         | stakes situations, but if package security is of utmost
         | importance to you, you should arrange to get it certified or to
         | pick it up at the delivery center. Likewise if enforcing a rule
         | of no scraping is of utmost importance you need to require an
         | API token or some other form of authentication before you serve
         | the pages.
        
           | hsbauauvhabzb wrote:
           | How else do you tell the bot you do not wish to be scraped?
           | Your analogy is lacking - you didn't order a package, you
           | never wanted a package, and the postman is taking something,
           | not leaving it, and you've explicitly left a sign saying 'you
           | are not welcome here'.
        
             | bakql wrote:
             | Stop your http server if you do not wish to receive http
             | requests.
        
               | vkou wrote:
               | Turn off your phone if you don't want to receive robo-
               | dialed calls and unsolicited texts 300 times a day.
               | 
               | Fence off your yard if you don't want people coming by
               | and dumping a mountain of garbage on it every day.
               | 
               | You can certainly choose to live in a society that thinks
               | these are acceptable solutions. I think it's bullshit,
               | and we'd all be better off if anyone doing these things
               | would be breaking rocks with their teeth in a re-
               | education camp, until they learn how to be a decent human
               | being.
        
             | Calavar wrote:
             | If you are serving web pages, you are soliciting GET
             | requests, kind of like ordering a package is soliciting a
             | delivery.
             | 
             | "Taking" versus "giving" is neither here nor there for this
             | discussion. The question is are you expressing a preference
             | on etiquette versus a hard rule that must be followed. I
             | personally believe robots.txt is the former, and I say that
             | as someone who serves more pages than they scrape
        
               | yuliyp wrote:
               | Having a front door physically allows anyone on the
               | street to come to knock on it. Having a "no soliciting"
               | sign is an instruction clarifying that not everybody is
               | welcome. Having a web site should operate in a similar
               | fashion. The robots.txt is the equivalent of such a sign.
        
               | halJordan wrote:
               | No soliciting signs are polite requests that no one has
               | to follow, and door to door salesman regularly walk right
               | past them.
               | 
               | No one is calling for the criminalization of door-to-door
               | sales and no one is worried about how much door-to-door
               | sales increases water consumption.
        
               | oytis wrote:
               | > door to door salesman regularly walk right past them.
               | 
               | Oh, now I understand why Americans can't see a problem
               | here.
        
               | ahtihn wrote:
               | If a company was sending hundreds of salesmen to knock at
               | a door one after the other, I'm pretty sure they could
               | successfully get sued for harassment.
        
               | czscout wrote:
               | And a no soliciting sign is no more cosmically binding
               | than robots.txt. It's a request, not an enforceable
               | command.
        
               | munk-a wrote:
               | I disagree strongly here - though not from a technical
               | perspective. There's absolutely a legal concept of making
               | your work available for viewing without making it
               | available for copying and AI scraping (while we can
               | technically phrase it as just viewing a bunch of times)
               | is effectively copying.
               | 
               | Lets say a large art hosting site realizes how damaging
               | AI training on their data can be - should they respond by
               | adding a paywall before any of their data is visible? If
               | that paywall is added (let's just say $5/mo) can most of
               | the artists currently on their site afford to stay there?
               | Can they afford it if their potential future patrons are
               | limited to just those folks who can pay $5/mo? Would the
               | scraper be able to afford a one time cost of $5 to scrape
               | all of that data?
               | 
               | I think, as much they are a deeply flawed concept, this
               | is a case where EULAs or an assumption of no-access for
               | training unless explicitly granted that's actually
               | enforced through the legal system is required. There are
               | a lot of small businesses and side projects that are
               | dying because of these models and I think that creative
               | outlet has societal value we would benefit from
               | preserving.
        
               | jMyles wrote:
               | > There's absolutely a legal concept of making your work
               | available for viewing without making it available for
               | copying
               | 
               | This "legal concept" is enforceable through legacy
               | systems of police and violence. The internet does not
               | recognize it. How much more obvious can this get?
               | 
               | If we stumble down the path of attempting to apply this
               | legal framework, won't some jurisdiction arise with no IP
               | protections whatsoever and just come to completely
               | dominate the entire economy of the internet?
               | 
               | If I can spin up a server in copyleftistan with a
               | complete copy of every album and film ever made,
               | available for free download, why would users in
               | copyrightistan use the locked down services of their
               | domestic economy?
        
               | kelnos wrote:
               | > _legacy systems of police and violence_
               | 
               | You use "legacy" as if these systems are obsolete and on
               | their way out. They're not. They're here to stay, and
               | will remain dominant, for better or worse. Calling them
               | "legacy" feels a bit childish, as if you're trying to
               | ignore reality and base arguments on your preferred
               | vision of how things should be.
               | 
               | > _The internet does not recognize it._
               | 
               | Sure it does. Not universally, but there are a _lot_ of
               | things governments and law enforcement can do to control
               | what people see and do on the internet.
               | 
               | > _If we stumble down the path of attempting to apply
               | this legal framework, won 't some jurisdiction arise with
               | no IP protections whatsoever and just come to completely
               | dominate the entire economy of the internet?_
               | 
               | No, of course not, that's silly. That only really works
               | on the margins. Any other country would immediately slap
               | economic sanctions on that free-for-all jurisdiction and
               | cripple them. If that fails, there's always a military
               | response they can resort to.
               | 
               | > _If I can spin up a server in copyleftistan with a
               | complete copy of every album and film ever made,
               | available for free download, why would users in
               | copyrightistan use the locked down services of their
               | domestic economy?_
               | 
               | Because the governments of all the copyrightistans will
               | block all traffic going in and out of copyleftistan.
               | While this may not stop determined, technically-adept
               | people, it will work for the most part. As I said, this
               | sort of thing only really works on the margins.
        
               | andoando wrote:
               | Well yes this is exactly what's happening as of now. But
               | there SHOULD be a way to upload content without giving it
               | access to scrapers.
        
               | kelnos wrote:
               | > _If you are serving web pages, you are soliciting GET
               | requests_
               | 
               | So what's the solution? How do I host a website that
               | welcomes human visitors, but rejects all scrapers?
               | 
               | There is no mechanism! The best I can do is a cat-and-
               | mouse arms race where I try to detect the traffic I don't
               | want, and block it, while the people generating the
               | traffic keep getting more sophisticated about hiding from
               | my detection.
               | 
               | No, putting up a paywall is not a reasonable response to
               | this.
               | 
               | > _The question is are you expressing a preference on
               | etiquette versus a hard rule that must be followed._
               | 
               | Well, there really aren't any hard rules that must be
               | followed, because there are no enforcement mechanisms
               | outside of going nuclear (requiring login). _Everything_
               | is etiquette. And I agree that robots.txt is _also_
               | etiquette, and it is super messed up that we tolerate
               | "AI" companies stomping all over that etiquette.
               | 
               | Do we maybe want laws that say everyone must respect
               | robots.txt? Maybe? But then people will just move their
               | scrapers to a jurisdiction without those laws. And I'm
               | sure someone could make the argument that robots.txt
               | doesn't apply to them because they spoofed a browser
               | user-agent (or another user-agent that a site explicitly
               | allows). So perhaps we have a new mechanism, or new laws,
               | or new... something.
               | 
               | But this all just highlights the point I'm making here:
               | there is no reasonable mechanism (no, login pages and
               | http auth don't count) for site owners to restrict access
               | to their site based on these sorts of criteria. And
               | that's a problem.
        
               | davesque wrote:
               | If I order a package from a company selling a good, am I
               | inviting all that company's competitors to show up at my
               | doorstep to try and outbid the delivery person from the
               | original company when they arrive, and maybe they all
               | show up at the same time and cause my porch to collapse?
               | No, because my front porch is a limited resource for
               | which I paid for an intended purpose. Is it illegal for
               | those other people to show up? Maybe not by the letter of
               | the law.
        
             | davsti4 wrote:
             | Its simple, and I'll quote myself - "robots.txt isn't the
             | law".
        
             | nkrisc wrote:
             | Put your content behind authentication if you don't want it
             | to be requested by just anyone.
        
               | kelnos wrote:
               | But I _do_ want my content accessible to  "just anyone",
               | as long as they are humans. I _don 't_ want it accessible
               | to bots.
               | 
               | You are free to say "well, there is no mechanism to do
               | that", and I would agree with you. That's the problem!
        
             | stray wrote:
             | You require something the bot won't have that a human
             | would.
             | 
             | Anybody may watch the demo screen of an arcade game for
             | free, but you have to insert a quarter to play -- and you
             | can have even greater access with a key.
             | 
             | > and you've explicitly left a sign saying 'you are not
             | welcome here'
             | 
             | And the sign said "Long-haired freaky people Need not
             | apply" So I tucked my hair up under my hat And I went in to
             | ask him why He said, "You look like a fine upstandin' young
             | man I think you'll do" So I took off my hat and said,
             | "Imagine that Huh, me workin' for you"
        
               | michaelt wrote:
               | _> You require something the bot won 't have that a human
               | would._
               | 
               | Is this why the "open web" is showing me a captcha or
               | two, along with their cookie banner and newsletter pop up
               | these days?
        
           | whimsicalism wrote:
           | There's an evolving morality around the internet that is
           | very, very different from the pseudo-libertarian rule of the
           | jungle I was raised with. Interesting to see things change.
        
             | sethhochberg wrote:
             | The evolutionary force is really just "everyone else showed
             | up at the party". The Internet has gone from a capital-I
             | thing that was hard to access, to a little-i internet that
             | was easier to access and well known but still largely
             | distinct from the real world, to now... just the real world
             | in virtual form. Internet morality mirrors real world
             | morality.
             | 
             | For the most part, everybody is participating now, and that
             | brings all of the challenges of any other space with
             | everyone's competing interests colliding - but fewer
             | established systems of governance.
        
             | hdgvhicv wrote:
             | Based on the comments here the polite world of the internet
             | where people obeyed unwritten best practices is certainly
             | over in favour of "grab what you can might makes right"
        
           | bigbuppo wrote:
           | Seriously. Did you see what that web server was wearing? I
           | mean, sure it said "don't touch me" and started screaming for
           | help and blocked 99.9% of our IP space, but we got more and
           | they didn't block that so clearly they weren't serious. They
           | were asking for it. It's their fault. They're not really
           | victims.
        
             | jMyles wrote:
             | Sexual consent is sacred. This metaphor is in truly bad
             | taste.
             | 
             | When you return a response with a 200-series status code,
             | you've granted consent. If you don't want to grant consent,
             | change the logic of the server.
        
               | jraph wrote:
               | > When you return a response with a 200-series status
               | code, you've granted consent. If you don't want to grant
               | consent, change the logic of the server.
               | 
               | "If you don't consent to me entering your house, change
               | its logic so that picking the door's lock doesn't let me
               | open the door"
               | 
               | Yeah, well...
               | 
               | As if the LLM scrappers didn't try anything under the sun
               | like using millions of different residential IP to
               | prevent admins from "changing the logic of the server" so
               | it doesn't "return a response with a 200-series status
               | code" when they don't agree to this scrapping.
               | 
               | As if there weren't broken assumptions that make "When
               | you return a response with a 200-series status code,
               | you've granted consent" very false.
               | 
               | As if technical details were good carriers of human
               | intents.
        
               | ryandrake wrote:
               | The locked door is a ridiculous analogy when it comes to
               | the open web. Pretty much all "door" analogies are
               | flawed, but sure let's imagine your web server has a
               | door. If you want to actually lock the door, you're more
               | than welcome to put an authentication gate around your
               | content. A web server that accepts a GET request and
               | replies 2xx is distinctly NOT "locked" in any way.
        
               | jraph wrote:
               | Any analogy is flawed and you can kill most analogies
               | very fast. They are meant to illustrate a point hopefully
               | efficiently, not to be mathematically true. They are not
               | to everyone's taste, me included in most cases. They are
               | mostly fine as long as they are not used to make a point,
               | but only to illustrate it.
               | 
               | I agree with this criticism of this analogy, I actually
               | had this flaw in mind from the start. There are other
               | flaws I have in mind as well.
               | 
               | I have developed more without the analogy in the
               | remaining of the comment. How about we focus on the crux
               | of the matter?
               | 
               | > A web server that accepts a GET request and replies 2xx
               | is distinctly NOT "locked" in any way
               | 
               | The point is that these scrappers use tricks so that it's
               | difficult not to grant them access. What is unreasonable
               | here is to think that 200 means consent, especially
               | knowing about the tricks.
               | 
               | Edit:
               | 
               | > you're more than welcome to put an authentication gate
               | around your content.
               | 
               | I don't want to. Adding auth so llm providers don't abuse
               | my servers and the work I meant to share publicly is not
               | a working solution.
        
               | jack_pp wrote:
               | here's my analogy, it's like you own a museum and you
               | require entrance by "secret" password (your user agent
               | filtering or what not). the problem is the password is
               | the same for everyone so would you be surprised when
               | someone figures it out or gets it from a friend and they
               | visit your museum? Either require a fee (processing
               | power, captcha etc) or make a private password (auth)
               | 
               | It is inherently a cat and mouse game that you CHOOSE to
               | play. Either implement throttling for clients that
               | consume too much resources for your server / require auth
               | / captcha / javascript / whatever whenever the client is
               | using too much resources. if the client still chooses to
               | go through the hoops you implemented then I don't see any
               | issue. If u still have an issue then implement more hoops
               | until you're satisfied.
        
               | jraph wrote:
               | > Either require a fee (processing power, captcha etc) or
               | make a private password (auth)
               | 
               | Well, I shouldn't have to work or make things worse for
               | everybody because the LLM bros decided to screw us.
               | 
               | > It is inherently a cat and mouse game that you CHOOSE
               | to play
               | 
               | No, let's not reverse the roles and blame the victims
               | here. We sysadmins and authors are willing to share our
               | work publicly to the world but never asked for it to be
               | abused.
        
               | ryandrake wrote:
               | People need to have a better mental model of what it
               | means to host a public web site, and what they are
               | actually doing when they run the web server and point it
               | at a directory of files. They're not just serving those
               | files to customers. They're not just serving them to
               | members. They're not just serving them to human beings.
               | They're not even necessarily serving files to web
               | browsers. They're serving files to every IP address (no
               | matter what machine is attached to it) that is capable of
               | opening a socket and sending GET. There's no such
               | distinct thing as a scraper--and if your mental model
               | tries to distinguish between a scraper and a human user,
               | you're going to be disappointed.
               | 
               | As the web server operator, you can try to figure out if
               | there's a human behind the IP, and you might be right or
               | wrong. You can try to figure out if it's a web browser,
               | or if it's someone typing in curl from a command line, or
               | if it's a massively parallel automated system, and you
               | might be right or wrong. You can try to guess what
               | country the IP is in, and you might be right or wrong.
               | But if you really want to actually limit access to the
               | content, you shouldn't be publishing that content
               | publicly.
        
               | jraph wrote:
               | > There's no such distinct thing as a scraper--and if
               | your mental model tries to distinguish between a scraper
               | and a human user, you're going to be disappointed.
               | 
               | I disagree. If your mental model doesn't allow
               | conceptualizing (abusive) scrapers, it is too
               | simplicistic to be useful to understand and deal with
               | reality.
               | 
               | But I'd like to re-state the frame / the concern: it's
               | not about _any_ bot or _any_ scraper, it is about the
               | despicable behavior of LLM providers and their awful
               | scrappers.
               | 
               | I'm personally fine with bots accessing my web servers,
               | there are many legitimate use cases for this.
               | 
               | > But if you really want to actually limit access to the
               | content, you shouldn't be publishing that content
               | publicly.
               | 
               | It is not about denying access to the content to some and
               | allowing access to others.
               | 
               | It is about having to deal with abuses.
               | 
               | Is a world in which people stop sharing their work
               | publicly because of these abuses desirable? Hell no.
        
               | oytis wrote:
               | Technically, you are not serving anything - it's just
               | voltage levels going up and down with no meaning at all.
        
               | Larrikin wrote:
               | >I don't like how your metaphor is an effective metaphor
               | for the situation so it's in bad taste.
        
               | jack_pp wrote:
               | if u absolutely want a sexual metaphor it's more like you
               | snuck into the world record for how many sexual parteners
               | a woman can take in 24h and even tho you aren't on the
               | list you still got to smash.
               | 
               | solution is the same, implement better security
        
               | LexGray wrote:
               | Perhaps bad taste, but bots could also be legitimately
               | purposely violating the most private or traumatizing
               | moments a vulnerable person has in any exploitative way
               | it cares to. I am not sure using bad taste is enough of
               | an excuse to not discuss the issue as many people do in
               | fact use the internet for sexual things. If anything
               | consent should be MORE important because it is easier to
               | document and verify.
               | 
               | A vast hoard of personal information exists and most of
               | it never had or will have proper consent, knowledge, or
               | protection.
        
           | mxkopy wrote:
           | The metaphor doesn't work. It's not the security of the
           | package that's in question, but something like whether the
           | delivery person is getting paid enough or whether you're
           | supporting them getting replaced by a robot. The issue is in
           | the context, not the protocol.
        
           | kelnos wrote:
           | > _robots.txt is a polite request to please not scrape these
           | pages_
           | 
           | People who ignore polite requests are assholes, and we are
           | well within our rights to complain about them.
           | 
           | I agree that "theft" is too strong (though I think you might
           | be presenting a straw man there), but "abuse" can be
           | perfectly apt: a crawler hammering a server, requesting the
           | same pages over and over, absolutely is abuse.
           | 
           | > _Likewise if enforcing a rule of no scraping is of utmost
           | importance you need to require an API token or some other
           | form of authentication before you serve the pages._
           | 
           | That's a shitty world that we shouldn't have to live in.
        
           | watwut wrote:
           | If you ignore polite request, then it is perfectly ok to give
           | you as much false data as possible. You have shown yourself
           | not interested in good faith cooperation, that means other
           | people can and should treat you as a jerk.
        
         | isodev wrote:
         | Ah yes, the "it's ok because I can" school of thought. As if
         | that was ever true.
        
         | munk-a wrote:
         | I think there's a massive shift in what the letter of the law
         | needs to be to match the intent. The letter hasn't changed and
         | this is all still quite legal - but there is a significant
         | different between what webscraping was doing to impact creative
         | lives five years ago and today. It was always possible for
         | artists to have their content stolen and for creative works to
         | be reposted - but there was enough IP laws around image sharing
         | (which AI disingenuously steps around) and other creative work
         | wasn't monetarily efficient to scrape.
         | 
         | I think there is a really different intent to an action to read
         | something someone created (which is often a form of marketing)
         | and to reproduce but modify someone's creative output (which
         | competes against and starves the creative of income).
         | 
         | The world changed really quickly and our legal systems haven't
         | kept up. It is hurting real people who used to have small side
         | businesses.
        
         | Lionga wrote:
         | So if a house is not not locked I can take whatever I want?
        
           | Ylpertnodi wrote:
           | Yes, but you may get caught, and there suffer 'consequences'.
           | I can drive well over 220kmh+ on the autobahn (Germany,
           | Europe), and also in France (also in Europe). One is
           | acceptable, the other will get me Royale-e fucked. If the can
           | catch me.
        
         | arccy wrote:
         | yeah all open HTTP servers are fair game for DDoS because well
         | it's open right?
        
         | sdenton4 wrote:
         | The problem is that serving content costs money. Llm scraping
         | is essentially ddos'ing content meant for human consumption.
         | Ddos'ing sucks.
        
           | 2OEH8eoCRo0 wrote:
           | Scraping is legal. DDoSing isn't.
           | 
           | We should start suing these bad actors. Why do techies forget
           | that the legal system exists?
        
             | ColinWright wrote:
             | There is no way that you can sue the people responsible for
             | DDoSing your system. Even if you can find them ... and you
             | won't ... they're likely as not either not in your
             | jurisdiction (they might be in Russia, or China, or
             | Bolivia, or anywhere) and they will have a lot more money
             | than you.
             | 
             | People here on HN are laughing at the UKs Online Safety Act
             | for trying to impose restrictions on people in other
             | countries, and yet now you're implying that similar
             | restrictions can be placed on people in other countries and
             | over whom you have neither power nor control.
        
         | jraph wrote:
         | When I open an HTTP server to the public web, I expect and
         | welcome GET requests in general.
         | 
         | However,
         | 
         | (1) there's a difference between (a) a regular user browsing my
         | websites and (b) robots DDoSing them. It was never okay to
         | hammer a webserver. This is not new, and it's for this reason
         | that curl has had options to throttle repeated requests to
         | servers forever. In real life, there are many instances of
         | things being offered for free, it's usually not okay to take it
         | all. Yes, this would be abuse. And no, the correct answer to
         | such a situation would not be "but it was free, don't offer it
         | for free if you don't want it to be taken for free". Same thing
         | here.
         | 
         | (2) there's a difference between (a) a regular user reading my
         | website or even copying and redistributing my content as long
         | as the license of this work / the fair use or related laws are
         | respected, and (b) a robot counterfeiting it (yeah, I agree
         | with another commenter, theft is not the right word, let's call
         | a spade a spade)
         | 
         | (3) well-behaved robots are expected to respect robots.txt.
         | This is not the law, this is about being respectful. It is only
         | fair bad-behaved robots get called out.
         | 
         | Well behaved robots do not usually use millions of residential
         | IPs through shady apps to "Perform a get request to an open
         | HTTP server".
        
           | Cervisia wrote:
           | > robots.txt. This is not the law
           | 
           | In Germany, it is the law. SS 44b UrhG says (translated):
           | 
           | (1) Text and data mining is the automated analysis of one or
           | more digital or digitized works to obtain information, in
           | particular about patterns, trends, and correlations.
           | 
           | (2) Reproductions of lawfully accessible works for text and
           | data mining are permitted. These reproductions must be
           | deleted when they are no longer needed for text and data
           | mining.
           | 
           | (3) Uses pursuant to paragraph 2, sentence 1, are only
           | permitted if the rights holder has not reserved these rights.
           | A reservation of rights for works accessible online is only
           | effective if it is in machine-readable form.
        
         | codyb wrote:
         | The sign on the door said "no scrapers", which as far as I know
         | is not a protected class.
        
         | davesque wrote:
         | I mean, it costs money to host content. If you are hosting
         | content for bots fine, but if the money you're paying to host
         | it is meant to benefit human users (the reason for robots.txt)
         | then yeah, you ought to ask permission. Content might also be
         | copyrighted. Honestly, I don't even know why I'm bothering to
         | mention these things because it just feels obvious. LLM
         | scrapers obviously want as much data as they can get, whether
         | or not they act like assholes (ignoring robots.txt) or
         | criminals (ignoring copyright) to get it.
        
         | j2kun wrote:
         | You should not have to ask for permission, but you should have
         | to honestly set your user-agent. (In my opinion, this should be
         | the law and it should be enforced)
        
       | vale11_amo2 wrote:
       | Hackers
        
         | edm0nd wrote:
         | one of the best movies, yes.
        
       | bigbuppo wrote:
       | Sounds like you should give the bots exactly what they want... a
       | 512MB file of random data.
        
         | kelnos wrote:
         | Most people have to pay for their bandwidth, though. That's a
         | lot of data to send out over and over.
        
           | jcheng wrote:
           | 512MB file of incredibly compressible data, then?
        
         | aDyslecticCrow wrote:
         | Scraper sinkhole of randomly generated inter-linked files
         | filled with AI poison could work. No human would click that
         | link, so it leads to the "exclusive club".
        
           | oytis wrote:
           | Outbound traffic normally costs more than inbound one, so the
           | asymmetry is set up wrong here. Data poisoning is probably
           | the way.
        
             | zahlman wrote:
             | > Outbound traffic normally costs more than inbound one, so
             | the asymmetry is set up wrong here.
             | 
             | That's what zip bombs are for.
        
       | winddude wrote:
       | i wish i could downvote.
        
       | stevage wrote:
       | The title is confusing, should be "commented-out".
        
         | pimlottc wrote:
         | Agree, I thought maybe this was going to be a script to block
         | AI scrapers or something like that.
        
           | zahlman wrote:
           | I thought it was going to be AI scraper operators getting
           | annoyed that they have to run reasoning models on the scraped
           | data to make use of it.
        
       | hexage1814 wrote:
       | I like and support web scrapers. It is even funnier when the site
       | owners don't like it
        
         | ang_cire wrote:
         | Yep. Robots.txt is a framework intended for performance, not a
         | legal or ethical imperative.
         | 
         | If you want to _control_ how someone _accesses_ something, the
         | onus is on you to put _access controls_ in place.
         | 
         | The people who put things on a public, un-restricted server and
         | then complain that the public accessed it in an un-restricted
         | way _might_ be excusable if it 's some geocities-esque Mom and
         | Pop site that has no reason to know better, but 'cryptography
         | dog' ain't that.
        
       | mikeiz404 wrote:
       | Two thoughts here when it comes to poisoning unwanted LLM
       | training data traffic
       | 
       | 1) A coordinated effort among different sites will have a much
       | greater chance of poisoning the data of a model so long as they
       | can avoid any post scraping deduplication or filtering.
       | 
       | 2) I wonder if copyright law can be used to amplify the cost of
       | poisoning here. Perhaps if the poisoned content is something
       | which has already been shown to be aggressively litigated against
       | then the copyright owner will go after them when the model can be
       | shown to contain that banned data. This may open up site owners
       | to the legal risk of distributing this content though... not
       | sure. A cooperative effort with a copyright holder may sidestep
       | this risk but they would have to have the means and want to
       | litigate.
        
       | renegat0x0 wrote:
       | Most web scrapers, even if illegal, are for... business. So they
       | scrape amazon, or shops. So yeah. Most unwanted traffic is from
       | big tech, or bad actors trying to sniff vulnerabilities.
       | 
       | I know a thing or two about web scraping.
       | 
       | There are sometimes status codes 404 for protection, so that you
       | skip this site, so my crawler tries, as a hammer, several of
       | faster crawling methods (curlcffi).
       | 
       | Zip bombs are also not for me. Reading header content length is
       | enough to not read the page/file. I provide byte limit to check
       | if response is not too big for me. For other cases reading
       | timeout is enough.
       | 
       | Oh, and did you know that requests timeout is not really timeout
       | a timeout for page read? So server can spoonfeed you bytes, one
       | after another, and there will be no timeout.
       | 
       | That is why I created my own crawling system to mitigate these
       | problems, and have one consistent mean of running selenium.
       | 
       | https://github.com/rumca-js/crawler-buddy
       | 
       | Based on library
       | 
       | https://github.com/rumca-js/webtoolkit
        
       ___________________________________________________________________
       (page generated 2025-10-31 23:00 UTC)