[HN Gopher] OpenAI's bot crushed this seven-person company's web...
___________________________________________________________________
OpenAI's bot crushed this seven-person company's web site 'like a
DDoS attack'
Author : vednig
Score : 64 points
Date : 2025-01-10 21:21 UTC (1 hours ago)
(HTM) web link (techcrunch.com)
(TXT) w3m dump (techcrunch.com)
| ThrowawayTestr wrote:
| Has anyone been successfully sued for excess hosting costs due to
| scraping?
| neom wrote:
| https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn makes it
| clear scraping publicly available data is generally not a CFAA
| violation. Certainly it would have to be a civil matter, but I
| doubt it would work (ianal)
| ericholscher wrote:
| We did get $7k out of one of the AI companies based on the
| massive bandwidth costs they caused us.
|
| https://about.readthedocs.com/blog/2024/07/ai-crawlers-
| abuse...
| neom wrote:
| wow GOOD JOB!!! Were they relatively decent about it, is
| that why? I feel like normal businesses that are not super
| shady should be able to accept this kind of conversation
| and deal with the mistake and issue they causes for you.
|
| Good job perusing it tho, that's fantastic. (ps, big fan of
| your product, great work on that too!)
| atleastoptimal wrote:
| Stuff like this will happen to all websites soon due to AI agents
| let loose on the web
| JohnMakin wrote:
| https://cyberpunk.fandom.com/wiki/Blackwall
| peterldowns wrote:
| I have little sympathy for the company in this article. If you
| put your content on the web, and don't require authentication to
| access it, it's going to be crawled and scraped. Most of the time
| you're happy about this -- you want search providers to index
| your content.
|
| It's one thing if a company ignores robots.txt and causes serious
| interference with the service, like Perplexity was, but the
| details here don't really add up: this company didn't have a
| robots.txt in place, and although the article mentions
| tens/hundreds of thousands of requests, they don't say anything
| about them being made unreasonably quickly.
|
| The default-public accessibility of information on the internet
| is a net-good for the technology ecosystem. Want to host things
| online? Learn how.
| JohnMakin wrote:
| robots.txt as of right now is a complete honor system, so I
| think it's reasonable to make a conclusion that you shouldn't
| rely on it protecting you because odds are overwhelming that
| scraping behavior will become worse in the near to mid term
| future
| j45 wrote:
| It's less about sympathy and more about understanding that they
| might not be experts in things tech, relied on hired help that
| seemed to be good at what they did, and the most basic thing
| (setup a free cloudflare account or something) was missed.
|
| Learning how, is sometimes actually learning who's going to get
| you online in a good way.
|
| In this case when you have non-tech people building Wordpress
| sites, it's about what they can understand and do, and teh rate
| of learning doesn't always keep up relative to client work.
| fzeroracer wrote:
| > If you put your content on the web, and don't require
| authentication to access it, it's going to be crawled and
| scraped. Most of the time you're happy about this -- you want
| search providers to index your content
|
| > The default-public accessibility of information on the
| internet is a net-good for the technology ecosystem. Want to
| host things online? Learn how.
|
| These two statements are at odds, I hope you realize. You say
| public accessibility of information is a good thing, while
| blaming someone for being effectively DDOS'd as a result of
| having said information public.
| hd4 wrote:
| They're not at odds. "default-public accessibility of
| information" doesn't necessarily translate into "default-
| public accessibility of _content_ " ie. media. Content
| _should_ be served behind an authentication layer.
|
| The clickbaity hysteria here is missing out how this sort of
| scraping has been possible long before AI agents showed up a
| couple of years back.
| agmater wrote:
| From the Wayback Machine [0] it seems they had a normal "open"
| set-up. They wanted to be indexed, but it's probably a fair
| concern that OpenAI isn't going to respect their image license.
| The article describes the robot.txt [sic] now "properly
| configured", but their solution was to block everything except
| Google, Bing, Yahoo, DuckDuckGo. That seems to be the smart
| thing these days, but it's a shame for any new search engines.
|
| [0]
| https://web.archive.org/web/20221206134212/https://www.tripl...
| swatcoder wrote:
| > I have little sympathy for the company in this article. If
| you put your content on the web, and don't require
| authentication to access it, it's going to be crawled and
| scraped. Most of the time you're happy about this -- you want
| search providers to index your content.
|
| If I stock a Little Free Library at the end of my driveway,
| it's because I want people in the community to peruse and swap
| the books in a way that's intuitive to pretty much everyone who
| might encounter it.
|
| I shouldn't need to post a sign outside of it saying "Please
| don't just take all of these at once", and it'd be completely
| reasonable for me to feel frustrated if someone did misuse it
| -- regardless of whether the sign was posted or not.
| dghlsakjg wrote:
| There is nothing inherently illegal about filling a small store
| to occupancy capacity with all of your friends and never buying
| anything.
|
| Just because something is technically possible and not illegal
| does NOT make it the right thing to do.
| nitwit005 wrote:
| Let us flip this around: If your crawler regularly knocks
| websites offline, you've clearly done something wrong.
|
| There's no chance every single website in existence is going to
| have a flawless setup. That's guaranteed simply from the number
| of websites, and how old some of them are.
| ericholscher wrote:
| This keeps happening -- we wrote about multiple AI bots that were
| hammering us over at Read the Docs for >10TB of traffic:
| https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse...
|
| They really are trying to burn all their goodwill to the ground
| with this stuff.
| exe34 wrote:
| can you feed them gibberish?
| Groxx wrote:
| LLMs poisoned by https://git-man-page-generator.lokaltog.net/
| -like content would be a _hilarious_ end result, please do!
| jcpham2 wrote:
| This would be my elegant solution, something like an endless
| recursion with a gzip bomb at the end if I can identify your
| crawler and it's that abusive. Would it be possible to feed
| an abusing crawler nothing but my own locally-hosted LLM
| gibberish?
|
| But then again if you're in the cloud egress bandwidth is
| going to cost for playing this game.
|
| Better to just deny the OpenAI crawler and send them an
| invoice for the money and time they've wasted. Interesting
| form of data warfare against competitors and non competitors
| alike. The winner will have the longest runway
| actsasbuffoon wrote:
| It wouldn't even necessarily need to be a real GZip bomb.
| Just something containing a few hundred kb of seemingly new
| and unique text that's highly compressible and keeps
| providing "links" to additional dynamically generated
| gibberish that can be crawled. The idea is to serve a vast
| amount of poisoned training data as cheaply as possible.
| Heck, maybe you could even make a plugin for NGINX to
| recognize abusive AI bots and do this. If enough people
| install it then you could provide some very strong
| disincentives.
| blibble wrote:
| here's a nice project to automate this:
| https://marcusb.org/hacks/quixotic.html
|
| couple of lines in your nginx/apache config and off you go
|
| my content rich sites provide this "high quality" data to the
| parasites
| GaggiX wrote:
| The dataset is curated, very likely with a previously trained
| model, so gibberish is not going to do anything.
| exe34 wrote:
| how would a previously trained model know that Elon doesn't
| smoke old socks?
| PaulHoule wrote:
| In the early 2000s I was working at a place that Google wanted
| to crawl _so_ bad that they gave us a hotline number to crawl
| if their crawler was giving us problems.
|
| We were told at that time that the "robots.txt" enforcement was
| the one thing they had that wasn't fully distributed, it's a
| devilishly difficult thing to implement.
|
| It boggles my mind that people with the kind of budget that
| some of these people have are struggling to implement crawling
| right 20 years later tough. It's nice those folks got a rebate.
|
| One of the problems why people are testy today is that you pay
| by the GB w/ cloud providers; about 10 years ago I kicked out
| the sinosphere crawlers like Baidu because they were generating
| like 40% of the traffic on my site crawling over and over again
| and not sending even a single referrer.
| TuringNYC wrote:
| Serious question - if robots.txt are not being honored, is
| there a risk that there is a class action from tens of
| thousands of small sites against both the companies doing the
| crawling and individual directors/officers of these companies?
| Seems there would be some recourse if this is done at at large
| enough scale.
| krapp wrote:
| No. robots.txt is not in any way a legally binding contract,
| no one is obligated to care about it.
| vasco wrote:
| If I have a "no publicity" sign in my mailbox and you dump
| 500 lbs of flyers and magazines by my door every week for a
| month and cause me to lose money dealing with all the
| trash, I think I'd have a reasonable ground to sue even if
| there's no contract saying you need to respect my wish.
|
| End of the day the claim is someone's action caused someone
| else undue financial burden in an way that is not easily
| prevented beforehand, so I wouldn't say it's a 100% clear
| case but I'm also not sure a judge wouldn't entertain it.
| krapp wrote:
| I don't think you can sue over what amounts to an implied
| gentleman's agreement that one side never even agreed to
| and win but if you do, let us know.
| boredatoms wrote:
| You can sue whenever anyone harms you
| krapp wrote:
| I didn't say no one could sue, anyone can sue anyone for
| anything if they have the time and the money. I said I
| didn't think someone could sue over non-compliance with
| robots.txt and _win._
|
| If it were possible, someone would have done it by now.
| It hasn't happened because robots.txt has absolutely no
| legal weight whatsoever. It's entirely voluntary, which
| means it's perfectly legal not to volunteer.
|
| But if you or anyone else wants to waste their time
| tilting at legal windmills, have fun -\\_(tsu)_/-.
| huntoa wrote:
| Did I read it right that you pay 62,5$/TB?
| Uptrenda wrote:
| Hey man, I wanted to say good job on read the docs. I use it
| for my Python project and find it an absolute pleasure to use.
| Write my stuff in restructured text. Make lots of pretty
| diagrams (lol), slowly making my docs easier to use. Good
| stuff.
|
| Edit 1: I'm surprised by the bandwidth costs. I use hetzner and
| OVH and the bandwidth is free. Though you manage the bare metal
| server yourself. Would readthedocs ever consider switching to
| self-managed hosting to save costs on cloud hosting?
| griomnib wrote:
| I've been a web developer for decades as well as doing scraping,
| indexing, and analyzing million of sites.
|
| Just follow the golden rule: don't ever load any site more
| aggressively than you would want yours to be.
|
| This isn't hard stuff, and these AI companies have grossly
| inefficient and obnoxious scrapers.
|
| As a site owner those pisses me off as a matter of decency on the
| web, but as an engineer doing distributed data collection I'm
| offended by how shitty and inefficient their crawlers are.
| PaulHoule wrote:
| I worked at one place where it probably cost us 100x (in CPU)
| more to serve content the way we were doing it as opposed to
| the way most people would do it. We could afford it for
| ordinary because it was still cheap, but we deferred the cost
| reduction work for half a decade and went on a war against
| webcrawlers instead. (hint: who introduced the robots.txt
| standard?)
| mingabunga wrote:
| We've had to block a lot of these bots as they slowed our
| technical forum to a crawl, but new ones appear every now and
| again. Amazons was the worst
| PaulHoule wrote:
| First time I heard this story it was '98 or so and the perp was
| somebody in the overfunded CS department and the victim somebody
| in the underfunded math department on the other side of a short
| and fat pipe. (Probably running Apache httpd on a SGI workstation
| without enough ram to even run Win '95)
|
| In years of running webcrawlers I've had very little trouble,
| I've had more trouble in the last year than in the past 25.
| (Wrote my first crawler in '99, funny my crawlers have gotten
| simpler over time not more complex)
|
| In one case I found a site got terribly slow although I was
| hitting it at much less than 1 request per second. Careful
| observation showed the wheels were coming off the site and it had
| nothing to do with me.
|
| There's another site that I've probably crawled in it's entirety
| at least ten times over the past twenty years. I have a crawl
| from two years ago, my plan was to feed it into a BERT-based
| system not for training but to discover content that is like the
| content that I like. I thought I'd get a fresh copy w/ httrack
| (polite, respects robots.txt, ...) and they blocked both my home
| IP addresses in 10 minutes. (Granted I don't think the past 2
| years of this site was as good as the past, so I will just load
| what I have into my semantic search & tagging system and use that
| instead)
|
| I was angry about how unfair the Google Economy was in 2013, in
| lines with what this blogger has been saying ever since
|
| http://www.seobook.com/blog
|
| (I can say it's a strange way to market an expensive SEO
| community but...) and it drives me up the wall that people
| looking in the rear view mirror are getting upset about it now.
|
| Back in '98 I was excited about "personal webcrawlers" that could
| be your own web agent. On one hand LLMs could give so much
| utility in terms of classification, extraction, clustering and
| otherwise drinking from that firehose but the fear that somebody
| is stealing their precious creativity is going to close the door
| forever... And entrench a completely unfair Google Economy. It
| makes me sad.
|
| ----
|
| Oddly those stupid ReCAPTCHAs and Cloudflare CAPTCHAs torment me
| all the time as a human but I haven't once had them get in the
| way of a crawling project.
| peebee67 wrote:
| Greedy and relentless OpenAI's scraping may be, but that his web-
| based startup didn't have a rudimentary robots.txt in place seems
| inexcusably naive. Correctly configuring this file has been one
| of the most basic steps of web design for living memory and
| doesn't speak highly of the technical acumen of this company.
|
| >"We're in a business where the rights are kind of a serious
| issue, because we scan actual people," he said. With laws like
| Europe's GDPR, "they cannot just take a photo of anyone on the
| web and use it."
|
| Yes, and protecting that data was _your_ responsibility,
| Tomchuck. You dropped the ball and are now trying to blame the
| other players.
| mystified5016 wrote:
| OpenAI will happily ignore robots.txt
|
| Or is that still my fault somehow?
|
| Maybe we should stop blaming people for "letting" themselves
| get destroyed and maybe put some blame on the people actively
| choosing to behave in a way that harms everyone else?
|
| But then again, they have _so_ much money so we should all just
| bend over and take it, right?
| vzaliva wrote:
| From the article:
|
| "As Tomchuk experienced, if a site isn't properly using
| robot.txt, OpenAI and others take that to mean they can scrape to
| their hearts' content."
|
| The takeaway: check your robots.txt.
|
| The question of how much load requests robots can reasonably
| generate when allowed is a separate matter.
| krapp wrote:
| Also probably consider blocking them with .htaccess or your
| server's equivalent, such as here:
| https://ethanmarcotte.com/wrote/blockin-bots/
|
| All this effort is futile because AI bots will simply send
| false user agents, but it's something.
| more_corn wrote:
| Cloudflare bot detection?
|
| https://developers.cloudflare.com/bots/plans/free/
| 1oooqooq wrote:
| they are problably hosting the bots
| andrethegiant wrote:
| I'm working on fixing this exact problem[1]. Crawlers are gonna
| keep crawling no matter what, so a solution to meet them where
| they are is to create a centralized platform that builds in an
| edge TTL cache, respects robots.txt and retry-after headers out
| of the box, etc. If there is a convenient and affordable solution
| that plays nicely with websites, the hope is that devs will
| gravitate towards the well-behaved solution.
|
| [1] https://crawlspace.dev
| 1oooqooq wrote:
| is there a place with a list of aws servers these companies can
| block?
| OutOfHere wrote:
| Sites should learn to use HTTP error 429 to slow down bots to a
| reasonable pace. If the bots are coming from a subnet, apply it
| to the subnet, not to the individual IP. No other action is
| needed.
| joelkoen wrote:
| > "OpenAI used 600 IPs to scrape data, and we are still analyzing
| logs from last week, perhaps it's way more," he said of the IP
| addresses the bot used to attempt to consume his site.
|
| The IP addresses in the screenshot are all owned by Cloudflare,
| meaning that their server logs are only recording the IPs of
| Cloudflare's reverse proxy, not the real client IPs.
|
| Also, the logs don't show any timestamps and there doesn't seem
| to be any mention of the request rate in the whole article.
|
| I'm not trying to defend OpenAI but as someone who scrapes data I
| think it's unfair to throw around terms "like DDOS attack"
| without providing basic request rate metrics. This seems to be
| purely based on the use of multiple IPs, which was actually
| caused by their own server configuration and has nothing to do
| with OpenAI.
___________________________________________________________________
(page generated 2025-01-10 23:00 UTC)