hngopher.com

       [HN Gopher] OpenAI's bot crushed this seven-person company's web...
       ___________________________________________________________________
        
       OpenAI's bot crushed this seven-person company's web site 'like a
       DDoS attack'
        
       Author : vednig
       Score  : 64 points
       Date   : 2025-01-10 21:21 UTC (1 hours ago)
        
 (HTM) web link (techcrunch.com)
 (TXT) w3m dump (techcrunch.com)
        
       | ThrowawayTestr wrote:
       | Has anyone been successfully sued for excess hosting costs due to
       | scraping?
        
         | neom wrote:
         | https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn makes it
         | clear scraping publicly available data is generally not a CFAA
         | violation. Certainly it would have to be a civil matter, but I
         | doubt it would work (ianal)
        
           | ericholscher wrote:
           | We did get $7k out of one of the AI companies based on the
           | massive bandwidth costs they caused us.
           | 
           | https://about.readthedocs.com/blog/2024/07/ai-crawlers-
           | abuse...
        
             | neom wrote:
             | wow GOOD JOB!!! Were they relatively decent about it, is
             | that why? I feel like normal businesses that are not super
             | shady should be able to accept this kind of conversation
             | and deal with the mistake and issue they causes for you.
             | 
             | Good job perusing it tho, that's fantastic. (ps, big fan of
             | your product, great work on that too!)
        
       | atleastoptimal wrote:
       | Stuff like this will happen to all websites soon due to AI agents
       | let loose on the web
        
         | JohnMakin wrote:
         | https://cyberpunk.fandom.com/wiki/Blackwall
        
       | peterldowns wrote:
       | I have little sympathy for the company in this article. If you
       | put your content on the web, and don't require authentication to
       | access it, it's going to be crawled and scraped. Most of the time
       | you're happy about this -- you want search providers to index
       | your content.
       | 
       | It's one thing if a company ignores robots.txt and causes serious
       | interference with the service, like Perplexity was, but the
       | details here don't really add up: this company didn't have a
       | robots.txt in place, and although the article mentions
       | tens/hundreds of thousands of requests, they don't say anything
       | about them being made unreasonably quickly.
       | 
       | The default-public accessibility of information on the internet
       | is a net-good for the technology ecosystem. Want to host things
       | online? Learn how.
        
         | JohnMakin wrote:
         | robots.txt as of right now is a complete honor system, so I
         | think it's reasonable to make a conclusion that you shouldn't
         | rely on it protecting you because odds are overwhelming that
         | scraping behavior will become worse in the near to mid term
         | future
        
         | j45 wrote:
         | It's less about sympathy and more about understanding that they
         | might not be experts in things tech, relied on hired help that
         | seemed to be good at what they did, and the most basic thing
         | (setup a free cloudflare account or something) was missed.
         | 
         | Learning how, is sometimes actually learning who's going to get
         | you online in a good way.
         | 
         | In this case when you have non-tech people building Wordpress
         | sites, it's about what they can understand and do, and teh rate
         | of learning doesn't always keep up relative to client work.
        
         | fzeroracer wrote:
         | > If you put your content on the web, and don't require
         | authentication to access it, it's going to be crawled and
         | scraped. Most of the time you're happy about this -- you want
         | search providers to index your content
         | 
         | > The default-public accessibility of information on the
         | internet is a net-good for the technology ecosystem. Want to
         | host things online? Learn how.
         | 
         | These two statements are at odds, I hope you realize. You say
         | public accessibility of information is a good thing, while
         | blaming someone for being effectively DDOS'd as a result of
         | having said information public.
        
           | hd4 wrote:
           | They're not at odds. "default-public accessibility of
           | information" doesn't necessarily translate into "default-
           | public accessibility of _content_ " ie. media. Content
           | _should_ be served behind an authentication layer.
           | 
           | The clickbaity hysteria here is missing out how this sort of
           | scraping has been possible long before AI agents showed up a
           | couple of years back.
        
         | agmater wrote:
         | From the Wayback Machine [0] it seems they had a normal "open"
         | set-up. They wanted to be indexed, but it's probably a fair
         | concern that OpenAI isn't going to respect their image license.
         | The article describes the robot.txt [sic] now "properly
         | configured", but their solution was to block everything except
         | Google, Bing, Yahoo, DuckDuckGo. That seems to be the smart
         | thing these days, but it's a shame for any new search engines.
         | 
         | [0]
         | https://web.archive.org/web/20221206134212/https://www.tripl...
        
         | swatcoder wrote:
         | > I have little sympathy for the company in this article. If
         | you put your content on the web, and don't require
         | authentication to access it, it's going to be crawled and
         | scraped. Most of the time you're happy about this -- you want
         | search providers to index your content.
         | 
         | If I stock a Little Free Library at the end of my driveway,
         | it's because I want people in the community to peruse and swap
         | the books in a way that's intuitive to pretty much everyone who
         | might encounter it.
         | 
         | I shouldn't need to post a sign outside of it saying "Please
         | don't just take all of these at once", and it'd be completely
         | reasonable for me to feel frustrated if someone did misuse it
         | -- regardless of whether the sign was posted or not.
        
         | dghlsakjg wrote:
         | There is nothing inherently illegal about filling a small store
         | to occupancy capacity with all of your friends and never buying
         | anything.
         | 
         | Just because something is technically possible and not illegal
         | does NOT make it the right thing to do.
        
         | nitwit005 wrote:
         | Let us flip this around: If your crawler regularly knocks
         | websites offline, you've clearly done something wrong.
         | 
         | There's no chance every single website in existence is going to
         | have a flawless setup. That's guaranteed simply from the number
         | of websites, and how old some of them are.
        
       | ericholscher wrote:
       | This keeps happening -- we wrote about multiple AI bots that were
       | hammering us over at Read the Docs for >10TB of traffic:
       | https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse...
       | 
       | They really are trying to burn all their goodwill to the ground
       | with this stuff.
        
         | exe34 wrote:
         | can you feed them gibberish?
        
           | Groxx wrote:
           | LLMs poisoned by https://git-man-page-generator.lokaltog.net/
           | -like content would be a _hilarious_ end result, please do!
        
           | jcpham2 wrote:
           | This would be my elegant solution, something like an endless
           | recursion with a gzip bomb at the end if I can identify your
           | crawler and it's that abusive. Would it be possible to feed
           | an abusing crawler nothing but my own locally-hosted LLM
           | gibberish?
           | 
           | But then again if you're in the cloud egress bandwidth is
           | going to cost for playing this game.
           | 
           | Better to just deny the OpenAI crawler and send them an
           | invoice for the money and time they've wasted. Interesting
           | form of data warfare against competitors and non competitors
           | alike. The winner will have the longest runway
        
             | actsasbuffoon wrote:
             | It wouldn't even necessarily need to be a real GZip bomb.
             | Just something containing a few hundred kb of seemingly new
             | and unique text that's highly compressible and keeps
             | providing "links" to additional dynamically generated
             | gibberish that can be crawled. The idea is to serve a vast
             | amount of poisoned training data as cheaply as possible.
             | Heck, maybe you could even make a plugin for NGINX to
             | recognize abusive AI bots and do this. If enough people
             | install it then you could provide some very strong
             | disincentives.
        
           | blibble wrote:
           | here's a nice project to automate this:
           | https://marcusb.org/hacks/quixotic.html
           | 
           | couple of lines in your nginx/apache config and off you go
           | 
           | my content rich sites provide this "high quality" data to the
           | parasites
        
           | GaggiX wrote:
           | The dataset is curated, very likely with a previously trained
           | model, so gibberish is not going to do anything.
        
             | exe34 wrote:
             | how would a previously trained model know that Elon doesn't
             | smoke old socks?
        
         | PaulHoule wrote:
         | In the early 2000s I was working at a place that Google wanted
         | to crawl _so_ bad that they gave us a hotline number to crawl
         | if their crawler was giving us problems.
         | 
         | We were told at that time that the "robots.txt" enforcement was
         | the one thing they had that wasn't fully distributed, it's a
         | devilishly difficult thing to implement.
         | 
         | It boggles my mind that people with the kind of budget that
         | some of these people have are struggling to implement crawling
         | right 20 years later tough. It's nice those folks got a rebate.
         | 
         | One of the problems why people are testy today is that you pay
         | by the GB w/ cloud providers; about 10 years ago I kicked out
         | the sinosphere crawlers like Baidu because they were generating
         | like 40% of the traffic on my site crawling over and over again
         | and not sending even a single referrer.
        
         | TuringNYC wrote:
         | Serious question - if robots.txt are not being honored, is
         | there a risk that there is a class action from tens of
         | thousands of small sites against both the companies doing the
         | crawling and individual directors/officers of these companies?
         | Seems there would be some recourse if this is done at at large
         | enough scale.
        
           | krapp wrote:
           | No. robots.txt is not in any way a legally binding contract,
           | no one is obligated to care about it.
        
             | vasco wrote:
             | If I have a "no publicity" sign in my mailbox and you dump
             | 500 lbs of flyers and magazines by my door every week for a
             | month and cause me to lose money dealing with all the
             | trash, I think I'd have a reasonable ground to sue even if
             | there's no contract saying you need to respect my wish.
             | 
             | End of the day the claim is someone's action caused someone
             | else undue financial burden in an way that is not easily
             | prevented beforehand, so I wouldn't say it's a 100% clear
             | case but I'm also not sure a judge wouldn't entertain it.
        
               | krapp wrote:
               | I don't think you can sue over what amounts to an implied
               | gentleman's agreement that one side never even agreed to
               | and win but if you do, let us know.
        
               | boredatoms wrote:
               | You can sue whenever anyone harms you
        
               | krapp wrote:
               | I didn't say no one could sue, anyone can sue anyone for
               | anything if they have the time and the money. I said I
               | didn't think someone could sue over non-compliance with
               | robots.txt and _win._
               | 
               | If it were possible, someone would have done it by now.
               | It hasn't happened because robots.txt has absolutely no
               | legal weight whatsoever. It's entirely voluntary, which
               | means it's perfectly legal not to volunteer.
               | 
               | But if you or anyone else wants to waste their time
               | tilting at legal windmills, have fun -\\_(tsu)_/-.
        
         | huntoa wrote:
         | Did I read it right that you pay 62,5$/TB?
        
         | Uptrenda wrote:
         | Hey man, I wanted to say good job on read the docs. I use it
         | for my Python project and find it an absolute pleasure to use.
         | Write my stuff in restructured text. Make lots of pretty
         | diagrams (lol), slowly making my docs easier to use. Good
         | stuff.
         | 
         | Edit 1: I'm surprised by the bandwidth costs. I use hetzner and
         | OVH and the bandwidth is free. Though you manage the bare metal
         | server yourself. Would readthedocs ever consider switching to
         | self-managed hosting to save costs on cloud hosting?
        
       | griomnib wrote:
       | I've been a web developer for decades as well as doing scraping,
       | indexing, and analyzing million of sites.
       | 
       | Just follow the golden rule: don't ever load any site more
       | aggressively than you would want yours to be.
       | 
       | This isn't hard stuff, and these AI companies have grossly
       | inefficient and obnoxious scrapers.
       | 
       | As a site owner those pisses me off as a matter of decency on the
       | web, but as an engineer doing distributed data collection I'm
       | offended by how shitty and inefficient their crawlers are.
        
         | PaulHoule wrote:
         | I worked at one place where it probably cost us 100x (in CPU)
         | more to serve content the way we were doing it as opposed to
         | the way most people would do it. We could afford it for
         | ordinary because it was still cheap, but we deferred the cost
         | reduction work for half a decade and went on a war against
         | webcrawlers instead. (hint: who introduced the robots.txt
         | standard?)
        
         | mingabunga wrote:
         | We've had to block a lot of these bots as they slowed our
         | technical forum to a crawl, but new ones appear every now and
         | again. Amazons was the worst
        
       | PaulHoule wrote:
       | First time I heard this story it was '98 or so and the perp was
       | somebody in the overfunded CS department and the victim somebody
       | in the underfunded math department on the other side of a short
       | and fat pipe. (Probably running Apache httpd on a SGI workstation
       | without enough ram to even run Win '95)
       | 
       | In years of running webcrawlers I've had very little trouble,
       | I've had more trouble in the last year than in the past 25.
       | (Wrote my first crawler in '99, funny my crawlers have gotten
       | simpler over time not more complex)
       | 
       | In one case I found a site got terribly slow although I was
       | hitting it at much less than 1 request per second. Careful
       | observation showed the wheels were coming off the site and it had
       | nothing to do with me.
       | 
       | There's another site that I've probably crawled in it's entirety
       | at least ten times over the past twenty years. I have a crawl
       | from two years ago, my plan was to feed it into a BERT-based
       | system not for training but to discover content that is like the
       | content that I like. I thought I'd get a fresh copy w/ httrack
       | (polite, respects robots.txt, ...) and they blocked both my home
       | IP addresses in 10 minutes. (Granted I don't think the past 2
       | years of this site was as good as the past, so I will just load
       | what I have into my semantic search & tagging system and use that
       | instead)
       | 
       | I was angry about how unfair the Google Economy was in 2013, in
       | lines with what this blogger has been saying ever since
       | 
       | http://www.seobook.com/blog
       | 
       | (I can say it's a strange way to market an expensive SEO
       | community but...) and it drives me up the wall that people
       | looking in the rear view mirror are getting upset about it now.
       | 
       | Back in '98 I was excited about "personal webcrawlers" that could
       | be your own web agent. On one hand LLMs could give so much
       | utility in terms of classification, extraction, clustering and
       | otherwise drinking from that firehose but the fear that somebody
       | is stealing their precious creativity is going to close the door
       | forever... And entrench a completely unfair Google Economy. It
       | makes me sad.
       | 
       | ----
       | 
       | Oddly those stupid ReCAPTCHAs and Cloudflare CAPTCHAs torment me
       | all the time as a human but I haven't once had them get in the
       | way of a crawling project.
        
       | peebee67 wrote:
       | Greedy and relentless OpenAI's scraping may be, but that his web-
       | based startup didn't have a rudimentary robots.txt in place seems
       | inexcusably naive. Correctly configuring this file has been one
       | of the most basic steps of web design for living memory and
       | doesn't speak highly of the technical acumen of this company.
       | 
       | >"We're in a business where the rights are kind of a serious
       | issue, because we scan actual people," he said. With laws like
       | Europe's GDPR, "they cannot just take a photo of anyone on the
       | web and use it."
       | 
       | Yes, and protecting that data was _your_ responsibility,
       | Tomchuck. You dropped the ball and are now trying to blame the
       | other players.
        
         | mystified5016 wrote:
         | OpenAI will happily ignore robots.txt
         | 
         | Or is that still my fault somehow?
         | 
         | Maybe we should stop blaming people for "letting" themselves
         | get destroyed and maybe put some blame on the people actively
         | choosing to behave in a way that harms everyone else?
         | 
         | But then again, they have _so_ much money so we should all just
         | bend over and take it, right?
        
       | vzaliva wrote:
       | From the article:
       | 
       | "As Tomchuk experienced, if a site isn't properly using
       | robot.txt, OpenAI and others take that to mean they can scrape to
       | their hearts' content."
       | 
       | The takeaway: check your robots.txt.
       | 
       | The question of how much load requests robots can reasonably
       | generate when allowed is a separate matter.
        
         | krapp wrote:
         | Also probably consider blocking them with .htaccess or your
         | server's equivalent, such as here:
         | https://ethanmarcotte.com/wrote/blockin-bots/
         | 
         | All this effort is futile because AI bots will simply send
         | false user agents, but it's something.
        
       | more_corn wrote:
       | Cloudflare bot detection?
       | 
       | https://developers.cloudflare.com/bots/plans/free/
        
         | 1oooqooq wrote:
         | they are problably hosting the bots
        
       | andrethegiant wrote:
       | I'm working on fixing this exact problem[1]. Crawlers are gonna
       | keep crawling no matter what, so a solution to meet them where
       | they are is to create a centralized platform that builds in an
       | edge TTL cache, respects robots.txt and retry-after headers out
       | of the box, etc. If there is a convenient and affordable solution
       | that plays nicely with websites, the hope is that devs will
       | gravitate towards the well-behaved solution.
       | 
       | [1] https://crawlspace.dev
        
       | 1oooqooq wrote:
       | is there a place with a list of aws servers these companies can
       | block?
        
       | OutOfHere wrote:
       | Sites should learn to use HTTP error 429 to slow down bots to a
       | reasonable pace. If the bots are coming from a subnet, apply it
       | to the subnet, not to the individual IP. No other action is
       | needed.
        
       | joelkoen wrote:
       | > "OpenAI used 600 IPs to scrape data, and we are still analyzing
       | logs from last week, perhaps it's way more," he said of the IP
       | addresses the bot used to attempt to consume his site.
       | 
       | The IP addresses in the screenshot are all owned by Cloudflare,
       | meaning that their server logs are only recording the IPs of
       | Cloudflare's reverse proxy, not the real client IPs.
       | 
       | Also, the logs don't show any timestamps and there doesn't seem
       | to be any mention of the request rate in the whole article.
       | 
       | I'm not trying to defend OpenAI but as someone who scrapes data I
       | think it's unfair to throw around terms "like DDOS attack"
       | without providing basic request rate metrics. This seems to be
       | purely based on the use of multiple IPs, which was actually
       | caused by their own server configuration and has nothing to do
       | with OpenAI.
        
       ___________________________________________________________________
       (page generated 2025-01-10 23:00 UTC)