[HN Gopher] AI scrapers request commented scripts
___________________________________________________________________
AI scrapers request commented scripts
Author : ColinWright
Score : 160 points
Date : 2025-10-31 15:44 UTC (7 hours ago)
(HTM) web link (cryptography.dog)
(TXT) w3m dump (cryptography.dog)
| rokkamokka wrote:
| I'm not overly surprised, it's probably faster to search the text
| for http/https than parse the DOM
| embedding-shape wrote:
| Not probably, searching through plaintext (which they seem to
| be doing) VS iterating on the DOM have vastly different amount
| of work behind them in terms of resources used and performance
| that "probably" is way underselling the difference :)
| franktankbank wrote:
| Reminds me of the shortcut that works for the happy path but
| is utterly fucked by real data. This is an interesting trap,
| can it easily be avoided without walking the dom?
| embedding-shape wrote:
| Yes, parse out HTML comments which is also kind of trivial
| if you've ever done any sort of parsing, listen for "<!--",
| whenever you come across it, ignore everything until the
| next "-->". But then again, these people are using AI to
| build scrapers, so I wouldn't put too much pressure on them
| to produce high-quality software.
| stevage wrote:
| Lots of other ways to include URLs in an HTML document
| that wouldn't be visible to a real user, though.
| jcheng wrote:
| It's not quite as trivial as that; one could start the
| page with a <script> tag that contains "<!--" without
| matching "-->", and that would hide all the content from
| your scraper but not from real browsers.
|
| But I think it's moot, parsing HTML is not very expensive
| if you don't have to actually render it.
| Noumenon72 wrote:
| It doesn't seem that abusive. I don't comment things out thinking
| "this will keep robots from reading this".
| michael1999 wrote:
| Crawlers ignoring robots.txt is abusive. That they then start
| scanning all docs for commented urls just adds to the pile of
| scummy behaviour.
| tveyben wrote:
| Human behavior is interesting - me, me, me...
| mostlysimilar wrote:
| The article mentions using this as a means of detecting bots,
| not as a complaint that it's abusive.
|
| EDIT: I was chastised, here's the original text of my comment:
| Did you read the article or just the title? They aren't
| claiming it's abusive. They're saying it's a viable signal to
| detect and ban bots.
| pseudalopex wrote:
| Please don't comment on whether someone read an article. "Did
| you even read the article? It mentions that" can be shortened
| to "The article mentions that".[1]
|
| [1] https://news.ycombinator.com/newsguidelines.html
| woodrowbarlow wrote:
| the first few words of the article are:
|
| > Last Sunday I discovered some abusive bot behaviour [...]
| mostlysimilar wrote:
| > The robots.txt for the site in question forbids all
| crawlers, so they were either failing to check the policies
| expressed in that file, or ignoring them if they had.
| foobarbecue wrote:
| Yeah but the abusive behavior is ignoring robots.txt and
| scraping to train AI. Following commented URLs was not the
| crime, just evidence inadvertently left behind.
| ang_cire wrote:
| They call the scrapers "malicious", so they are definitely
| complaining about them.
|
| > A few of these came from user-agents that were obviously
| malicious:
|
| (I love the idea that they consider any python or go request
| to be a malicious scraper...)
| latenightcoding wrote:
| when I used to crawl the web, battle tested Perl regexes were
| more reliable than anything else, commented urls would have been
| added to my queue.
| rightbyte wrote:
| DOM navigation for fetching some data is for tryhards. Using a
| regex to grab the correct paragraph or div or whatever is fine
| and is more robust versus things moving around on the page.
| chaps wrote:
| Doing both is fine! Just, once you've figured out your regex
| and such, hardening/generalizing demands DOM iteration. It
| sucks but it is what is is.
| horseradish7k wrote:
| but not when crawling. you don't know the page format in
| advance - you don't even know what the page contains!
| OhMeadhbh wrote:
| I blame modern CS programs that don't teach kids about parsing.
| The last time I looked at some scraping code, the dev was using
| regexes to "parse" html to find various references.
|
| Maybe that's a way to defend against bots that ignore robots.txt,
| include a reference to a Honeypot HTML file with garbage text,
| but include the link to it in a comment.
| tuwtuwtuwtuw wrote:
| Do you think that if some CS programs taught parsing, the
| authors of the bot would parse the HTML to properly extract
| links, instead of just doing plain text search?
|
| I doubt it.
| ericmcer wrote:
| How would recommend doing it? If I was just trying to pull <a/>
| tag links out I feel like treating it like text and using regex
| would be way more efficient than a full on HTML parser like
| JSDom or something.
| singron wrote:
| You don't need javascript to parse HTML. Just use an HTML
| parser. They are very fast. HTML isn't a regular language, so
| you can't parse it with regular expressions.
|
| Obligatory:
| https://stackoverflow.com/questions/1732348/regex-match-
| open...
| zahlman wrote:
| The point is: if you're trying to find all the URLs within
| the page source, it doesn't really matter to you what tags
| they're in, or how the document is structured, or even
| whether they're given as link targets or in the readable
| text or just what.
| vaylian wrote:
| The people who do this type of scraping to feed their AI are
| probably also using AI to write their scraper.
| mikeiz404 wrote:
| It's been some time since I have dealt with web scrapers but it
| takes less resources to run a regex than it does to parse the
| DOM (which may have syntactically incorrect parts anyway). This
| can add up when running many scraping requests in parallel. So
| depending on your goals using a regex can be much preferred.
| sharkjacobs wrote:
| Fun to see practical applications of interesting research[1]
|
| [1]https://news.ycombinator.com/item?id=45529587
| bakql wrote:
| >These were scrapers, and they were most likely trying to non-
| consensually collect content for training LLMs.
|
| "Non-consensually", as if you had to ask for permission to
| perform a GET request to an open HTTP server.
|
| Yes, I know about weev. That was a travesty.
| XenophileJKO wrote:
| What about people using an LLM as their web client? Are you now
| saying the website owner should be able to dictate what client
| I use and how it must behave?
| aDyslecticCrow wrote:
| > Are you now saying the website owner should be able to
| dictate what client I use and how it must behave?
|
| Already pretty well established with Ad-block actually. It's
| a pretty similar case even. AI's don't click ads, so why
| should we accept their traffic? If it's un-proportionally
| loading the server without contributing to the funding of the
| site, get blocked.
|
| The server can set whatever rules it wants. If the maintainer
| hates google and wants to block all chrome users, it can do
| so.
| Calavar wrote:
| I agree. It always surprises me when people are indignant about
| scrapers ignoring robots.txt and throw around words like
| "theft" and "abuse."
|
| robots.txt is a polite request to please not scrape these pages
| because it's probably not going to be productive. It was never
| meant to be a binding agreement, otherwise there would be a
| stricter protocol around it.
|
| It's kind of like leaving a note for the deliveryman saying
| please don't leave packages on the porch. It's fine for low
| stakes situations, but if package security is of utmost
| importance to you, you should arrange to get it certified or to
| pick it up at the delivery center. Likewise if enforcing a rule
| of no scraping is of utmost importance you need to require an
| API token or some other form of authentication before you serve
| the pages.
| hsbauauvhabzb wrote:
| How else do you tell the bot you do not wish to be scraped?
| Your analogy is lacking - you didn't order a package, you
| never wanted a package, and the postman is taking something,
| not leaving it, and you've explicitly left a sign saying 'you
| are not welcome here'.
| bakql wrote:
| Stop your http server if you do not wish to receive http
| requests.
| vkou wrote:
| Turn off your phone if you don't want to receive robo-
| dialed calls and unsolicited texts 300 times a day.
|
| Fence off your yard if you don't want people coming by
| and dumping a mountain of garbage on it every day.
|
| You can certainly choose to live in a society that thinks
| these are acceptable solutions. I think it's bullshit,
| and we'd all be better off if anyone doing these things
| would be breaking rocks with their teeth in a re-
| education camp, until they learn how to be a decent human
| being.
| Calavar wrote:
| If you are serving web pages, you are soliciting GET
| requests, kind of like ordering a package is soliciting a
| delivery.
|
| "Taking" versus "giving" is neither here nor there for this
| discussion. The question is are you expressing a preference
| on etiquette versus a hard rule that must be followed. I
| personally believe robots.txt is the former, and I say that
| as someone who serves more pages than they scrape
| yuliyp wrote:
| Having a front door physically allows anyone on the
| street to come to knock on it. Having a "no soliciting"
| sign is an instruction clarifying that not everybody is
| welcome. Having a web site should operate in a similar
| fashion. The robots.txt is the equivalent of such a sign.
| halJordan wrote:
| No soliciting signs are polite requests that no one has
| to follow, and door to door salesman regularly walk right
| past them.
|
| No one is calling for the criminalization of door-to-door
| sales and no one is worried about how much door-to-door
| sales increases water consumption.
| oytis wrote:
| > door to door salesman regularly walk right past them.
|
| Oh, now I understand why Americans can't see a problem
| here.
| ahtihn wrote:
| If a company was sending hundreds of salesmen to knock at
| a door one after the other, I'm pretty sure they could
| successfully get sued for harassment.
| czscout wrote:
| And a no soliciting sign is no more cosmically binding
| than robots.txt. It's a request, not an enforceable
| command.
| munk-a wrote:
| I disagree strongly here - though not from a technical
| perspective. There's absolutely a legal concept of making
| your work available for viewing without making it
| available for copying and AI scraping (while we can
| technically phrase it as just viewing a bunch of times)
| is effectively copying.
|
| Lets say a large art hosting site realizes how damaging
| AI training on their data can be - should they respond by
| adding a paywall before any of their data is visible? If
| that paywall is added (let's just say $5/mo) can most of
| the artists currently on their site afford to stay there?
| Can they afford it if their potential future patrons are
| limited to just those folks who can pay $5/mo? Would the
| scraper be able to afford a one time cost of $5 to scrape
| all of that data?
|
| I think, as much they are a deeply flawed concept, this
| is a case where EULAs or an assumption of no-access for
| training unless explicitly granted that's actually
| enforced through the legal system is required. There are
| a lot of small businesses and side projects that are
| dying because of these models and I think that creative
| outlet has societal value we would benefit from
| preserving.
| jMyles wrote:
| > There's absolutely a legal concept of making your work
| available for viewing without making it available for
| copying
|
| This "legal concept" is enforceable through legacy
| systems of police and violence. The internet does not
| recognize it. How much more obvious can this get?
|
| If we stumble down the path of attempting to apply this
| legal framework, won't some jurisdiction arise with no IP
| protections whatsoever and just come to completely
| dominate the entire economy of the internet?
|
| If I can spin up a server in copyleftistan with a
| complete copy of every album and film ever made,
| available for free download, why would users in
| copyrightistan use the locked down services of their
| domestic economy?
| kelnos wrote:
| > _legacy systems of police and violence_
|
| You use "legacy" as if these systems are obsolete and on
| their way out. They're not. They're here to stay, and
| will remain dominant, for better or worse. Calling them
| "legacy" feels a bit childish, as if you're trying to
| ignore reality and base arguments on your preferred
| vision of how things should be.
|
| > _The internet does not recognize it._
|
| Sure it does. Not universally, but there are a _lot_ of
| things governments and law enforcement can do to control
| what people see and do on the internet.
|
| > _If we stumble down the path of attempting to apply
| this legal framework, won 't some jurisdiction arise with
| no IP protections whatsoever and just come to completely
| dominate the entire economy of the internet?_
|
| No, of course not, that's silly. That only really works
| on the margins. Any other country would immediately slap
| economic sanctions on that free-for-all jurisdiction and
| cripple them. If that fails, there's always a military
| response they can resort to.
|
| > _If I can spin up a server in copyleftistan with a
| complete copy of every album and film ever made,
| available for free download, why would users in
| copyrightistan use the locked down services of their
| domestic economy?_
|
| Because the governments of all the copyrightistans will
| block all traffic going in and out of copyleftistan.
| While this may not stop determined, technically-adept
| people, it will work for the most part. As I said, this
| sort of thing only really works on the margins.
| andoando wrote:
| Well yes this is exactly what's happening as of now. But
| there SHOULD be a way to upload content without giving it
| access to scrapers.
| kelnos wrote:
| > _If you are serving web pages, you are soliciting GET
| requests_
|
| So what's the solution? How do I host a website that
| welcomes human visitors, but rejects all scrapers?
|
| There is no mechanism! The best I can do is a cat-and-
| mouse arms race where I try to detect the traffic I don't
| want, and block it, while the people generating the
| traffic keep getting more sophisticated about hiding from
| my detection.
|
| No, putting up a paywall is not a reasonable response to
| this.
|
| > _The question is are you expressing a preference on
| etiquette versus a hard rule that must be followed._
|
| Well, there really aren't any hard rules that must be
| followed, because there are no enforcement mechanisms
| outside of going nuclear (requiring login). _Everything_
| is etiquette. And I agree that robots.txt is _also_
| etiquette, and it is super messed up that we tolerate
| "AI" companies stomping all over that etiquette.
|
| Do we maybe want laws that say everyone must respect
| robots.txt? Maybe? But then people will just move their
| scrapers to a jurisdiction without those laws. And I'm
| sure someone could make the argument that robots.txt
| doesn't apply to them because they spoofed a browser
| user-agent (or another user-agent that a site explicitly
| allows). So perhaps we have a new mechanism, or new laws,
| or new... something.
|
| But this all just highlights the point I'm making here:
| there is no reasonable mechanism (no, login pages and
| http auth don't count) for site owners to restrict access
| to their site based on these sorts of criteria. And
| that's a problem.
| davesque wrote:
| If I order a package from a company selling a good, am I
| inviting all that company's competitors to show up at my
| doorstep to try and outbid the delivery person from the
| original company when they arrive, and maybe they all
| show up at the same time and cause my porch to collapse?
| No, because my front porch is a limited resource for
| which I paid for an intended purpose. Is it illegal for
| those other people to show up? Maybe not by the letter of
| the law.
| davsti4 wrote:
| Its simple, and I'll quote myself - "robots.txt isn't the
| law".
| nkrisc wrote:
| Put your content behind authentication if you don't want it
| to be requested by just anyone.
| kelnos wrote:
| But I _do_ want my content accessible to "just anyone",
| as long as they are humans. I _don 't_ want it accessible
| to bots.
|
| You are free to say "well, there is no mechanism to do
| that", and I would agree with you. That's the problem!
| stray wrote:
| You require something the bot won't have that a human
| would.
|
| Anybody may watch the demo screen of an arcade game for
| free, but you have to insert a quarter to play -- and you
| can have even greater access with a key.
|
| > and you've explicitly left a sign saying 'you are not
| welcome here'
|
| And the sign said "Long-haired freaky people Need not
| apply" So I tucked my hair up under my hat And I went in to
| ask him why He said, "You look like a fine upstandin' young
| man I think you'll do" So I took off my hat and said,
| "Imagine that Huh, me workin' for you"
| michaelt wrote:
| _> You require something the bot won 't have that a human
| would._
|
| Is this why the "open web" is showing me a captcha or
| two, along with their cookie banner and newsletter pop up
| these days?
| whimsicalism wrote:
| There's an evolving morality around the internet that is
| very, very different from the pseudo-libertarian rule of the
| jungle I was raised with. Interesting to see things change.
| sethhochberg wrote:
| The evolutionary force is really just "everyone else showed
| up at the party". The Internet has gone from a capital-I
| thing that was hard to access, to a little-i internet that
| was easier to access and well known but still largely
| distinct from the real world, to now... just the real world
| in virtual form. Internet morality mirrors real world
| morality.
|
| For the most part, everybody is participating now, and that
| brings all of the challenges of any other space with
| everyone's competing interests colliding - but fewer
| established systems of governance.
| hdgvhicv wrote:
| Based on the comments here the polite world of the internet
| where people obeyed unwritten best practices is certainly
| over in favour of "grab what you can might makes right"
| bigbuppo wrote:
| Seriously. Did you see what that web server was wearing? I
| mean, sure it said "don't touch me" and started screaming for
| help and blocked 99.9% of our IP space, but we got more and
| they didn't block that so clearly they weren't serious. They
| were asking for it. It's their fault. They're not really
| victims.
| jMyles wrote:
| Sexual consent is sacred. This metaphor is in truly bad
| taste.
|
| When you return a response with a 200-series status code,
| you've granted consent. If you don't want to grant consent,
| change the logic of the server.
| jraph wrote:
| > When you return a response with a 200-series status
| code, you've granted consent. If you don't want to grant
| consent, change the logic of the server.
|
| "If you don't consent to me entering your house, change
| its logic so that picking the door's lock doesn't let me
| open the door"
|
| Yeah, well...
|
| As if the LLM scrappers didn't try anything under the sun
| like using millions of different residential IP to
| prevent admins from "changing the logic of the server" so
| it doesn't "return a response with a 200-series status
| code" when they don't agree to this scrapping.
|
| As if there weren't broken assumptions that make "When
| you return a response with a 200-series status code,
| you've granted consent" very false.
|
| As if technical details were good carriers of human
| intents.
| ryandrake wrote:
| The locked door is a ridiculous analogy when it comes to
| the open web. Pretty much all "door" analogies are
| flawed, but sure let's imagine your web server has a
| door. If you want to actually lock the door, you're more
| than welcome to put an authentication gate around your
| content. A web server that accepts a GET request and
| replies 2xx is distinctly NOT "locked" in any way.
| jraph wrote:
| Any analogy is flawed and you can kill most analogies
| very fast. They are meant to illustrate a point hopefully
| efficiently, not to be mathematically true. They are not
| to everyone's taste, me included in most cases. They are
| mostly fine as long as they are not used to make a point,
| but only to illustrate it.
|
| I agree with this criticism of this analogy, I actually
| had this flaw in mind from the start. There are other
| flaws I have in mind as well.
|
| I have developed more without the analogy in the
| remaining of the comment. How about we focus on the crux
| of the matter?
|
| > A web server that accepts a GET request and replies 2xx
| is distinctly NOT "locked" in any way
|
| The point is that these scrappers use tricks so that it's
| difficult not to grant them access. What is unreasonable
| here is to think that 200 means consent, especially
| knowing about the tricks.
|
| Edit:
|
| > you're more than welcome to put an authentication gate
| around your content.
|
| I don't want to. Adding auth so llm providers don't abuse
| my servers and the work I meant to share publicly is not
| a working solution.
| jack_pp wrote:
| here's my analogy, it's like you own a museum and you
| require entrance by "secret" password (your user agent
| filtering or what not). the problem is the password is
| the same for everyone so would you be surprised when
| someone figures it out or gets it from a friend and they
| visit your museum? Either require a fee (processing
| power, captcha etc) or make a private password (auth)
|
| It is inherently a cat and mouse game that you CHOOSE to
| play. Either implement throttling for clients that
| consume too much resources for your server / require auth
| / captcha / javascript / whatever whenever the client is
| using too much resources. if the client still chooses to
| go through the hoops you implemented then I don't see any
| issue. If u still have an issue then implement more hoops
| until you're satisfied.
| jraph wrote:
| > Either require a fee (processing power, captcha etc) or
| make a private password (auth)
|
| Well, I shouldn't have to work or make things worse for
| everybody because the LLM bros decided to screw us.
|
| > It is inherently a cat and mouse game that you CHOOSE
| to play
|
| No, let's not reverse the roles and blame the victims
| here. We sysadmins and authors are willing to share our
| work publicly to the world but never asked for it to be
| abused.
| ryandrake wrote:
| People need to have a better mental model of what it
| means to host a public web site, and what they are
| actually doing when they run the web server and point it
| at a directory of files. They're not just serving those
| files to customers. They're not just serving them to
| members. They're not just serving them to human beings.
| They're not even necessarily serving files to web
| browsers. They're serving files to every IP address (no
| matter what machine is attached to it) that is capable of
| opening a socket and sending GET. There's no such
| distinct thing as a scraper--and if your mental model
| tries to distinguish between a scraper and a human user,
| you're going to be disappointed.
|
| As the web server operator, you can try to figure out if
| there's a human behind the IP, and you might be right or
| wrong. You can try to figure out if it's a web browser,
| or if it's someone typing in curl from a command line, or
| if it's a massively parallel automated system, and you
| might be right or wrong. You can try to guess what
| country the IP is in, and you might be right or wrong.
| But if you really want to actually limit access to the
| content, you shouldn't be publishing that content
| publicly.
| jraph wrote:
| > There's no such distinct thing as a scraper--and if
| your mental model tries to distinguish between a scraper
| and a human user, you're going to be disappointed.
|
| I disagree. If your mental model doesn't allow
| conceptualizing (abusive) scrapers, it is too
| simplicistic to be useful to understand and deal with
| reality.
|
| But I'd like to re-state the frame / the concern: it's
| not about _any_ bot or _any_ scraper, it is about the
| despicable behavior of LLM providers and their awful
| scrappers.
|
| I'm personally fine with bots accessing my web servers,
| there are many legitimate use cases for this.
|
| > But if you really want to actually limit access to the
| content, you shouldn't be publishing that content
| publicly.
|
| It is not about denying access to the content to some and
| allowing access to others.
|
| It is about having to deal with abuses.
|
| Is a world in which people stop sharing their work
| publicly because of these abuses desirable? Hell no.
| oytis wrote:
| Technically, you are not serving anything - it's just
| voltage levels going up and down with no meaning at all.
| Larrikin wrote:
| >I don't like how your metaphor is an effective metaphor
| for the situation so it's in bad taste.
| jack_pp wrote:
| if u absolutely want a sexual metaphor it's more like you
| snuck into the world record for how many sexual parteners
| a woman can take in 24h and even tho you aren't on the
| list you still got to smash.
|
| solution is the same, implement better security
| LexGray wrote:
| Perhaps bad taste, but bots could also be legitimately
| purposely violating the most private or traumatizing
| moments a vulnerable person has in any exploitative way
| it cares to. I am not sure using bad taste is enough of
| an excuse to not discuss the issue as many people do in
| fact use the internet for sexual things. If anything
| consent should be MORE important because it is easier to
| document and verify.
|
| A vast hoard of personal information exists and most of
| it never had or will have proper consent, knowledge, or
| protection.
| mxkopy wrote:
| The metaphor doesn't work. It's not the security of the
| package that's in question, but something like whether the
| delivery person is getting paid enough or whether you're
| supporting them getting replaced by a robot. The issue is in
| the context, not the protocol.
| kelnos wrote:
| > _robots.txt is a polite request to please not scrape these
| pages_
|
| People who ignore polite requests are assholes, and we are
| well within our rights to complain about them.
|
| I agree that "theft" is too strong (though I think you might
| be presenting a straw man there), but "abuse" can be
| perfectly apt: a crawler hammering a server, requesting the
| same pages over and over, absolutely is abuse.
|
| > _Likewise if enforcing a rule of no scraping is of utmost
| importance you need to require an API token or some other
| form of authentication before you serve the pages._
|
| That's a shitty world that we shouldn't have to live in.
| watwut wrote:
| If you ignore polite request, then it is perfectly ok to give
| you as much false data as possible. You have shown yourself
| not interested in good faith cooperation, that means other
| people can and should treat you as a jerk.
| isodev wrote:
| Ah yes, the "it's ok because I can" school of thought. As if
| that was ever true.
| munk-a wrote:
| I think there's a massive shift in what the letter of the law
| needs to be to match the intent. The letter hasn't changed and
| this is all still quite legal - but there is a significant
| different between what webscraping was doing to impact creative
| lives five years ago and today. It was always possible for
| artists to have their content stolen and for creative works to
| be reposted - but there was enough IP laws around image sharing
| (which AI disingenuously steps around) and other creative work
| wasn't monetarily efficient to scrape.
|
| I think there is a really different intent to an action to read
| something someone created (which is often a form of marketing)
| and to reproduce but modify someone's creative output (which
| competes against and starves the creative of income).
|
| The world changed really quickly and our legal systems haven't
| kept up. It is hurting real people who used to have small side
| businesses.
| Lionga wrote:
| So if a house is not not locked I can take whatever I want?
| Ylpertnodi wrote:
| Yes, but you may get caught, and there suffer 'consequences'.
| I can drive well over 220kmh+ on the autobahn (Germany,
| Europe), and also in France (also in Europe). One is
| acceptable, the other will get me Royale-e fucked. If the can
| catch me.
| arccy wrote:
| yeah all open HTTP servers are fair game for DDoS because well
| it's open right?
| sdenton4 wrote:
| The problem is that serving content costs money. Llm scraping
| is essentially ddos'ing content meant for human consumption.
| Ddos'ing sucks.
| 2OEH8eoCRo0 wrote:
| Scraping is legal. DDoSing isn't.
|
| We should start suing these bad actors. Why do techies forget
| that the legal system exists?
| ColinWright wrote:
| There is no way that you can sue the people responsible for
| DDoSing your system. Even if you can find them ... and you
| won't ... they're likely as not either not in your
| jurisdiction (they might be in Russia, or China, or
| Bolivia, or anywhere) and they will have a lot more money
| than you.
|
| People here on HN are laughing at the UKs Online Safety Act
| for trying to impose restrictions on people in other
| countries, and yet now you're implying that similar
| restrictions can be placed on people in other countries and
| over whom you have neither power nor control.
| jraph wrote:
| When I open an HTTP server to the public web, I expect and
| welcome GET requests in general.
|
| However,
|
| (1) there's a difference between (a) a regular user browsing my
| websites and (b) robots DDoSing them. It was never okay to
| hammer a webserver. This is not new, and it's for this reason
| that curl has had options to throttle repeated requests to
| servers forever. In real life, there are many instances of
| things being offered for free, it's usually not okay to take it
| all. Yes, this would be abuse. And no, the correct answer to
| such a situation would not be "but it was free, don't offer it
| for free if you don't want it to be taken for free". Same thing
| here.
|
| (2) there's a difference between (a) a regular user reading my
| website or even copying and redistributing my content as long
| as the license of this work / the fair use or related laws are
| respected, and (b) a robot counterfeiting it (yeah, I agree
| with another commenter, theft is not the right word, let's call
| a spade a spade)
|
| (3) well-behaved robots are expected to respect robots.txt.
| This is not the law, this is about being respectful. It is only
| fair bad-behaved robots get called out.
|
| Well behaved robots do not usually use millions of residential
| IPs through shady apps to "Perform a get request to an open
| HTTP server".
| Cervisia wrote:
| > robots.txt. This is not the law
|
| In Germany, it is the law. SS 44b UrhG says (translated):
|
| (1) Text and data mining is the automated analysis of one or
| more digital or digitized works to obtain information, in
| particular about patterns, trends, and correlations.
|
| (2) Reproductions of lawfully accessible works for text and
| data mining are permitted. These reproductions must be
| deleted when they are no longer needed for text and data
| mining.
|
| (3) Uses pursuant to paragraph 2, sentence 1, are only
| permitted if the rights holder has not reserved these rights.
| A reservation of rights for works accessible online is only
| effective if it is in machine-readable form.
| codyb wrote:
| The sign on the door said "no scrapers", which as far as I know
| is not a protected class.
| davesque wrote:
| I mean, it costs money to host content. If you are hosting
| content for bots fine, but if the money you're paying to host
| it is meant to benefit human users (the reason for robots.txt)
| then yeah, you ought to ask permission. Content might also be
| copyrighted. Honestly, I don't even know why I'm bothering to
| mention these things because it just feels obvious. LLM
| scrapers obviously want as much data as they can get, whether
| or not they act like assholes (ignoring robots.txt) or
| criminals (ignoring copyright) to get it.
| j2kun wrote:
| You should not have to ask for permission, but you should have
| to honestly set your user-agent. (In my opinion, this should be
| the law and it should be enforced)
| vale11_amo2 wrote:
| Hackers
| edm0nd wrote:
| one of the best movies, yes.
| bigbuppo wrote:
| Sounds like you should give the bots exactly what they want... a
| 512MB file of random data.
| kelnos wrote:
| Most people have to pay for their bandwidth, though. That's a
| lot of data to send out over and over.
| jcheng wrote:
| 512MB file of incredibly compressible data, then?
| aDyslecticCrow wrote:
| Scraper sinkhole of randomly generated inter-linked files
| filled with AI poison could work. No human would click that
| link, so it leads to the "exclusive club".
| oytis wrote:
| Outbound traffic normally costs more than inbound one, so the
| asymmetry is set up wrong here. Data poisoning is probably
| the way.
| zahlman wrote:
| > Outbound traffic normally costs more than inbound one, so
| the asymmetry is set up wrong here.
|
| That's what zip bombs are for.
| winddude wrote:
| i wish i could downvote.
| stevage wrote:
| The title is confusing, should be "commented-out".
| pimlottc wrote:
| Agree, I thought maybe this was going to be a script to block
| AI scrapers or something like that.
| zahlman wrote:
| I thought it was going to be AI scraper operators getting
| annoyed that they have to run reasoning models on the scraped
| data to make use of it.
| hexage1814 wrote:
| I like and support web scrapers. It is even funnier when the site
| owners don't like it
| ang_cire wrote:
| Yep. Robots.txt is a framework intended for performance, not a
| legal or ethical imperative.
|
| If you want to _control_ how someone _accesses_ something, the
| onus is on you to put _access controls_ in place.
|
| The people who put things on a public, un-restricted server and
| then complain that the public accessed it in an un-restricted
| way _might_ be excusable if it 's some geocities-esque Mom and
| Pop site that has no reason to know better, but 'cryptography
| dog' ain't that.
| mikeiz404 wrote:
| Two thoughts here when it comes to poisoning unwanted LLM
| training data traffic
|
| 1) A coordinated effort among different sites will have a much
| greater chance of poisoning the data of a model so long as they
| can avoid any post scraping deduplication or filtering.
|
| 2) I wonder if copyright law can be used to amplify the cost of
| poisoning here. Perhaps if the poisoned content is something
| which has already been shown to be aggressively litigated against
| then the copyright owner will go after them when the model can be
| shown to contain that banned data. This may open up site owners
| to the legal risk of distributing this content though... not
| sure. A cooperative effort with a copyright holder may sidestep
| this risk but they would have to have the means and want to
| litigate.
| renegat0x0 wrote:
| Most web scrapers, even if illegal, are for... business. So they
| scrape amazon, or shops. So yeah. Most unwanted traffic is from
| big tech, or bad actors trying to sniff vulnerabilities.
|
| I know a thing or two about web scraping.
|
| There are sometimes status codes 404 for protection, so that you
| skip this site, so my crawler tries, as a hammer, several of
| faster crawling methods (curlcffi).
|
| Zip bombs are also not for me. Reading header content length is
| enough to not read the page/file. I provide byte limit to check
| if response is not too big for me. For other cases reading
| timeout is enough.
|
| Oh, and did you know that requests timeout is not really timeout
| a timeout for page read? So server can spoonfeed you bytes, one
| after another, and there will be no timeout.
|
| That is why I created my own crawling system to mitigate these
| problems, and have one consistent mean of running selenium.
|
| https://github.com/rumca-js/crawler-buddy
|
| Based on library
|
| https://github.com/rumca-js/webtoolkit
___________________________________________________________________
(page generated 2025-10-31 23:00 UTC)