[HN Gopher] How to Bypass Cloudflare: A Comprehensive Guide
___________________________________________________________________
How to Bypass Cloudflare: A Comprehensive Guide
Author : jakobdabo
Score : 168 points
Date : 2022-09-18 11:59 UTC (11 hours ago)
(HTM) web link (www.zenrows.com)
(TXT) w3m dump (www.zenrows.com)
| Tiberium wrote:
| The actual "easiest" way (at least for me) to bypass Cloudflare
| is to find the actual IP of the web-server running behind it. Of
| course in a lot of cases it's not possible, for example when the
| web admin correctly limits the webserver to only respond to
| Cloudflare IP ranges, or if
| https://developers.cloudflare.com/ssl/origin-configuration/o...
| is used.
|
| Most useful services for that are https://shodan.io/ and
| https://search.censys.io/. I've had decent successes with Censys
| on finding real IP addresses of websites behind Cloudflare. Of
| course you might also have success by checking history of DNS
| records for a particular domain.
| gingerlime wrote:
| > or if https://developers.cloudflare.com/ssl/origin-
| configuration/o... is used.
|
| How is using CF's origin CA preventing the connection to the
| real backend in order to bypass Cloudflare? you cam just ignore
| the SSL error couldn't you?
| temp0826 wrote:
| In addition, I think CF provides a list of IPs to whitelist
| to only allow access from their servers.
| kevincox wrote:
| They probably meant to link
| https://developers.cloudflare.com/ssl/origin-
| configuration/a... where Cloudflare uses a client TLS
| certificate to pull from the origin and the origin should be
| configured to reject requests without a client certificate.
| PigiVinci83 wrote:
| Companies spend thousands of dollars on these anti-bot
| solutions and then they are so misconfigured that using a
| specific user agent or faking browsing via mobile, bypasses
| them. Real life stories.
| grogenaut wrote:
| Often this is because you are hamstrung by old mobile apps or
| TV apps that can't be updated forcibly and so you break
| users. So your making a trade-off of user pain and bot
| deflection. So many times this is actually known and on
| purpose. Botters hitting that loophole helps prioritize
| closing that loophole in an agile customer experience and
| makes it easier for engineering and product to prioritize.
| Real life stories
| PigiVinci83 wrote:
| Totally agree.
| PigiVinci83 wrote:
| Just shared some of these stories here https://twitter.com/
| pigivinciguerra/status/15715437943480893...
| discreditable wrote:
| This is what I expected the article to be about. I would wager
| a lot of shops don't to the whitelisting. If they wanted to be
| really intense they could do authenticated origin pulls.
| akira2501 wrote:
| AWS CloudFront with S3 recommends that you just set your S3
| to require a specific 'Referer' header variable and you set
| CloudFront to send that custom 'Referer' with each origin
| request.
|
| Seems to work great when you use something like a GUID, and
| no need for IP whitelisting.
| nlh wrote:
| Just want to say THANK YOU for this insight. It never occurred
| to me, and I just checked out one of the big sites that I
| scrape for a side project and, lo and behold, you are 100%
| correct. Found their origin in Censys in about 30 seconds and
| I've never been able to crunch through their pages more easily.
|
| To others out there who explore this: As with all scraping, be
| gentle! If you start pounding on someone's origin server
| directly, you're much more likely to be noticed than if you're
| pounding on something behind a CloudFlare cache. Set rate
| limits, scrape during off-peak hours, etc. Be a good scraping
| citizen.
| dizhn wrote:
| Use zenrows. Got it. It's clickbait but it does provide a good
| summary of how cloudflare's anti bot stuff works.
| yjftsjthsd-h wrote:
| I'd call it an ad, not clickbait. An ad with some useful
| content, but still.
|
| Edit: I think the modern term is content native advertising,
| although I'm perfectly happy to keep using the word
| infomercial.
| throwaway81523 wrote:
| I'd call it clickbait since its actual nature is not revealed
| til the very end.
| dizhn wrote:
| I called it clickbait because the article does not contain
| the thing promised in the title.
| Sebb767 wrote:
| > An ad with some useful content, but still.
|
| There's a term for that, infomercial :-)
| fishtoaster wrote:
| Yeah, it's content marketing. It's got all the stylistic tells:
|
| - Giving more background than is appropriate to the subject
| (explaining what cloudflare is in an article about bypassing
| it)
|
| - Lots of fluff about "what we're going to cover" like it's a
| poorly-written highschool essay
|
| - Asking and answering questions rather than stating things:
| "Can Cloudflare be bypassed? Thankfully, the answer is yes!"
|
| I'm not _entirely_ sure what drives these things, but they seem
| to be very common in this sort of content marketing article. I
| 'm guessing a lot of it is SEO-driven.
|
| This particular article has more actual content than most, but
| still ultimately devolves into an ad, of course.
| return_to_monke wrote:
| > - Asking and answering questions rather than stating
| things: "Can Cloudflare be bypassed? Thankfully, the answer
| is yes!"
|
| > I'm not _entirely_ sure what drives these things, but they
| seem to be very common in this sort of content marketing
| article. I 'm guessing a lot of it is SEO-driven.
|
| I suspect that this is them trying to get into Google's
| "frequent question"/"people also ask" [0] box, because that
| seems like a common search term ("can you bypass cloudflare")
|
| [0]: https://www.brightedge.com/glossary/people-also-ask
| throwaway81523 wrote:
| Another tell: emphasizing various phrases by boldfacing, as
| if the rest of the article is not intended to actually be
| read.
|
| I find the mention of a series A fundraising round at the top
| interesting too. Do the funders really expect something other
| than an escalating technical arms race that eventually
| outpaces them?
| nothasan wrote:
| Some impressive documentation on how to get around this BM
| solution.
| unixbane wrote:
| This is the endgame for the web. Since it doesn't care about
| having any simple, well-defined protocol and set of features, you
| will always just have an arms race between charlatans and their
| security gimmicks (cloudbleed comes to mind) vs people bypassing
| that, and so you will only be able to browse websites in a
| certified way, like using your bare IP address or a big 4
| browser. The web is essentially no better than proprietary
| software. It's broken by design, no matter what new shiny
| (plastic) features they added this week.
|
| https://en.wikipedia.org/wiki/Cloudbleed
|
| edit: why is it that for the last 11 years, every single time i
| posted something about cloudflare doing bad stuff, i get
| immediately downvoted? the only reason i can imagine is because
| cloudflare is your favorite pet company. I've noticed that no
| matter what I post in my set of controversial opinions, the only
| one that is consistently downvoted is anything against
| cloudflare. you guys are fucking losers.
| yjftsjthsd-h wrote:
| The frustrating thing to me is that CF is that invasive and still
| can't distinguish bots from people; it usually eventually lets me
| through, but I've spent enough time staring at the "are you
| _sure_ you 're not a not?" screen to laugh off their claims about
| human/not traffic ratios.
| Anunayj wrote:
| I would also like to mention FlareSolverr [1] here, which just
| uses a headless browser to solve the challenges, which might be
| acceptable in some situations (that don't need high request rate)
|
| 1. https://github.com/FlareSolverr/FlareSolverr
| alokjnv10 wrote:
| I hate cloudflare. I had a really hard time making a web scraper.
| vntok wrote:
| That's... the whole point.
| unixbane wrote:
| And it's an invalid point. Scraping prevention is the most
| stupid thing Cloudflare has ever done, and that's after a
| very long list.
| simondotau wrote:
| It's not an invalid point. Setting aside Government and
| business services, you aren't morally entitled to clean,
| uninterrupted access to any random website. If a webmaster
| chooses to make your life difficult for any reason, that's
| entirely their prerogative.
| midislack wrote:
| All this and it's just an ad for some SAAS? Fuck I got gypped.
| urtom wrote:
| If I just need to make plain GET requests in my web scraping,
| I've found the easiest way to bypass Cloudflare on most sites is
| to make the requests via the Internet Archive. That has some rate
| limiting, but it can be worked around by using several source IP
| addresses in parallel.
| donutshop wrote:
| Are there other products out there that offers a similar feature
| set at this price point?
| cj wrote:
| There are legitimate use cases for bypassing cloudflare's bot
| protection.
|
| I discovered our company's help documentation (and integration
| guides), hosted by readme.com, were completely de-indexed from
| Google for the past 3 months.
|
| Our Readme docs were formerly our #1 source of organic (free)
| leads.
|
| After investigating, Cloudflare (as configured by Readme) was
| blocking Googlebot when using Cloudflare Workers. Cloudflare was
| returning a 403 for Googlebot, but returning pages as usual for
| regular users.
|
| The cause: we were using Workers to rewrite some URLs at the edge
| (replacing Readme's default images with optimized + compressed
| images, using Cloudflare's own image optimization service).
|
| By using Workers to do this, it resulted in Readme's Cloudflare
| account receiving requests from our domain with "googlebot"
| useragent, but from an IP that wasn't verified as a googlebot IP
| address (I assume the Worker was requesting the Readme site using
| the Googlebot user agent but with whatever IP address is used
| when using CF Workers).
|
| I emailed Cloudflare support but it was clear it would take a lot
| of time to get them to understand the issue (and probably longer
| to fix it).
|
| So, we had to spend a lot of time figuring out how to allow
| Googlebot requests past Cloudflare's "fake bot" firewall rule.
|
| In our own Cloudflare account, we have all security settings at
| the lowest sensitivity possible (or turned off completely). We
| serve over 500 billion requests a month (10+ TB of bandwidth),
| and the amount of blocked traffic to seemingly legitimate clients
| was surprisingly high.
|
| I love Cloudflare (and own quite a bit of their stock) but I'm
| beginning to rethink my stance on their service. They make it
| extremely easy to enable powerful features with little visibility
| or control over the details of how those features work.
|
| Another SEO nightmare is their "Crawler Hints" service. I highly
| recommend no one uses this if you are ever the target of
| automated security scanners (e.g. ones used by bug bounty white
| hat hackers). With "crawler hints" enabled and with a white hat
| hacker running a scan of your site hitting random URLs... results
| in bingbot, yandex, and other search engines attempting to index
| every single one of the URLs hit by the security scanners used by
| hackers.
|
| Basically, it's a mess, and the only way to really fix it is to
| bypass cloudflare or spend a lot of time and money with
| Cloudflare debugging.
|
| Next quarter I'm faced with the decision of either doubling down
| of Cloudflare and getting an Enterprise plan with them ($20k+) or
| just ripping them out of our stack and going back to our old AWS
| Cloudfront set up which has fewer POPs, but was much less of a
| hassle.
| dom96 wrote:
| > By using Workers to do this, it resulted in Readme's
| Cloudflare account receiving requests from our domain with
| "googlebot" useragent, but from an IP that wasn't verified as a
| googlebot IP address (I assume the Worker was requesting the
| Readme site using the Googlebot user agent but with whatever IP
| address is used when using CF Workers).
|
| Was this definitely the cause? It's somewhat surprising to hear
| that requests would be rejected if the user agent doesn't match
| a set of hard coded IP addresses.
|
| Were you able to resolve this in the end? If not and the cause
| is what you suspect then perhaps changing the user agent in
| your worker might be a workaround.
| [deleted]
| kentonv wrote:
| It actually makes sense to me. I've pinged the bots team to
| see if we can improve here.
| traek wrote:
| > It's somewhat surprising to hear that requests would be
| rejected if the user agent doesn't match a set of hard coded
| IP addresses.
|
| It's fairly common for DDoS/scraping prevention, Googlebot
| (and most other crawlers) publish their IP ranges for that
| reason[0][1][2]. I don't work at Cloudflare though, so no
| insider knowledge of what you folks are doing.
|
| [0] https://developers.google.com/search/docs/crawling-
| indexing/...
|
| [1] https://developers.facebook.com/docs/sharing/webmasters/c
| raw...
|
| [2] https://developer.twitter.com/en/docs/twitter-for-
| websites/c...
| kevincox wrote:
| I feel this as well. Cloudflare markets itself as a set-and-
| forget solution but really doesn't work that way. Furthermore
| in the limited visibility that they give to blocking they frame
| each blocked request as a success unconditionally. Of course
| they would, that is the service they are providing. However
| this is often not the case, for many websites most requests
| benefit very little from blocking and bot protection only
| really needs to be provided for mutating endpoints and DoS
| attacks.
|
| For example the Cloudflare Blog's RSS feed is very often
| blocked from public-cloud IP ranges with specific clients. This
| is an endpoint that is intended to be public, is cachable and
| even intended to be accessed by bots! This is a common issue
| that should be very easy to solve technically but highlights
| how Cloudflare is not a set-and-forget solution. If they can't
| configure their own blog (a super simple case) correctly it is
| clear that using the tool correct requires special care and
| monitoring of the limited visibility that they provide you.
| yjftsjthsd-h wrote:
| > Furthermore in the limited visibility that they give to
| blocking they frame each blocked request as a success
| unconditionally.
|
| Two things that have happened to me:
|
| * Cloudflare has decided that I'm a bot and stalled me, given
| me capchas, or just blocked me outright.
|
| * Cloudflare has shown me marketing claiming that 40% of
| traffic is bots.
|
| I'm not particularly impressed.
| sammy2255 wrote:
| What are you using Cloudflare for??
| [deleted]
| pbowyer wrote:
| > Next quarter I'm faced with the decision of either doubling
| down of Cloudflare and getting an Enterprise plan with them
| ($20k+) or just ripping them out of our stack and going back to
| our old AWS Cloudfront set up which has fewer POPs, but was
| much less of a hassle.
|
| Is Fastly a viable alternative for you?
| hutrdvnj wrote:
| I use Googlebot as my fake browsers user agent for years. It's
| really interested to explore the web, when everyone thinks
| you're Google.
| blitzar wrote:
| Do websites not spit at you or do they jsut assume you 'will
| do no evil'?
| TrickyRick wrote:
| What are some of the most interesting differences you've
| seen?
| simondotau wrote:
| Like the OP, I've employed a custom configuration in
| Cloudflare which detects (and blocks) browsers which claim to
| be Googlebot but don't originate from Google's approved
| Googlebot IP ranges.
|
| The vast majority of such requests are dodgy scanning
| operations likely looking for email addresses or exploitable
| forms.
___________________________________________________________________
(page generated 2022-09-18 23:00 UTC)