[HN Gopher] Using Cloudflare on your website could be blocking R...
___________________________________________________________________
Using Cloudflare on your website could be blocking RSS users
Author : campuscodi
Score : 480 points
Date : 2024-10-16 22:46 UTC (1 days ago)
(HTM) web link (openrss.org)
(TXT) w3m dump (openrss.org)
| kevincox wrote:
| I dislike advice of whitelisting specific readers by user-agent.
| Not only is this endless manual work that will only solve the
| problem for a subset of users but it also is easy to bypass by
| malicious actors. My recommendation would be to create a page
| rule that disables bot blocking for your feeds. This will fix the
| problem for all readers with no ongoing maintenance.
|
| If you are worried about DoS attacks that may hammer on your
| feeds then you can use the same configuration rule to ignore the
| query string for cache keys (if your feed doesn't use query
| strings) and overriding the caching settings if your server
| doesn't set the proper headers. This way Cloudflare will cache
| your feed and you can serve any number of visitors without
| putting load onto your origin.
|
| As for Cloudflare fixing the defaults, it seems unlikely to
| happen. It has been broken for years, Cloudflare's own blog is
| affected. They have been "actively working" on fixing it for at
| least 2 years according to their VP of product:
| https://news.ycombinator.com/item?id=33675847
| vaylian wrote:
| I don't know if cloudflare offers it, but whitelisting the URL
| of the RSS feed would be much more effective than filtering
| user agents.
| derkades wrote:
| Yes it supports it, and I think that's what the parent
| comment was all about
| BiteCode_dev wrote:
| Specifically, whitelisting the URL for the bot protection,
| but not the cache, so that you are still somewhat protected
| against adversarial use.
| londons_explore wrote:
| An adversary can easily send no-cache headers to bust the
| cache.
| acdha wrote:
| The CDN can choose whether to honor those. That hasn't
| been an effective adversarial technique since the turn of
| the century.
| londons_explore wrote:
| does cloudflare give such an option? Even for non-paid
| accounts?
| jks wrote:
| Yes, you can do it with a "page rule", which the parent
| comment mentioned. The CloudFlare free tier has a budget of
| three page rules, which might mean that you have to bundle
| all your rss feeds in one folder so they share a path prefix.
| a-french-anon wrote:
| And for those of us using sfeed, the default UA is Curl's.
| benregenspan wrote:
| AI crawlers have changed the picture significantly and in my
| opinion are a much bigger threat to the open web than
| Cloudflare. The training arms race has drastically increased
| bot traffic, and the value proposition behind that bot traffic
| has inverted. Previously many site operators could rely on the
| average automated request being net-beneficial to the site and
| its users (outside of scattered, time-limited DDoS attacks) but
| now most of these requests represent value extraction. Combine
| this with a seemingly related increase in high-volume bots that
| don't respect robots.txt and don't set a useful User-Agent, and
| using a heavy-handed firewall becomes a much easier business
| decision, even if it may target some desirable traffic (like
| valid RSS requests).
| veeti wrote:
| I believe that disabling "Bot Fight Mode" is not enough, you may
| also need to create a rule to disable "Browser Integrity Check".
| belkinpower wrote:
| I maintain an RSS reader for work and Cloudflare is the bane of
| my existence. Tons of feeds will stop working at random and
| there's nothing we can do about it except for individually
| contacting website owners and asking them to add an exception for
| their feed URL.
| sammy2255 wrote:
| Unfortunately its not really Cloudflare but webadmins who have
| configured it to block everything thats not a browser, whether
| unknowingly or not
| afandian wrote:
| If Cloudflare offer a product, for a particular purpose, that
| breaks existing conventions of that purpose, then it's
| Cloudflare.
| sammy2255 wrote:
| Not really. You wouldn't complain to a fence company for
| blocking a path if there were hired to do exactly that
| gsich wrote:
| They are enablers. They get part of the blame.
| shakna wrote:
| Yes, I would. Experts are expected to relay back to their
| client with their thoughts on a matter, not just blindly
| do as they're told. Your builder is meant to do their due
| diligence, which includes making recommendations.
| echoangle wrote:
| Well it doesn't break the conventions of the purpose they
| offer it for. Cloudflare attempts to block non-human users,
| and this is supposed to be used for human-readable
| websites. If someone puts cloudflare in front of a RSS
| feed, that's user error. It's like someone putting a
| captcha in front of an API and then complaining that the
| Captcha provider is breaking conventions.
| nirvdrum wrote:
| I contend this wasn't an issue prior to Cloudflare making
| that an option. Sure, some IDS would block some users and geo
| blocks have been around forever. But, Cloudflare is so
| prolific and makes it so easy to block things inadvertently,
| that I don't think they get a pass and blame the downstream
| user.
|
| It's particularly frustrating that they give their own WARP
| service a pass. I've run into many sites that will block VPN
| traffic, including iCloud Privacy Relay, but WARP traffic
| goes through just fine.
| stanislavb wrote:
| I was recently contacted by one of my website users as their
| RSS reader was blocked by Cloudflare.
| amatecha wrote:
| I get blocked from websites with some regularity, running Firefox
| with strict privacy settings, "resist fingerprinting" etc. on
| OpenBSD. They just give a 403 Forbidden with no explanation, but
| it's only ever on sites fronted by CloudFlare. Good times. Seems
| legit.
| BiteCode_dev wrote:
| Cloudflare is a fantastic service with an unmatched value
| proposition, but it's unfortunately slowly killing web privacy,
| with 1000s paper cuts.
|
| Another problem is "resist fingerprinting" prevents some canvas
| processing, and many websites like bluesky, linked in or
| substack uses canvas to handle image upload, so your images
| appear to be stripes of pixel.
|
| Then you have mobile apps that just don't run if you don't have
| a google account, like chatgpt's native app.
|
| I understand why people give up, trying to fight for your
| privacy is an uphill battle with no end in sight.
| madeofpalk wrote:
| > Then you have mobile apps that just don't run if you don't
| have a google account, like chatgpt's native app.
|
| Is that true? At least on iOS you can log into the ChatGPT
| with same email/password as the website.
|
| I never use Google login for stuff and ChatGPT works fine for
| me.
| BiteCode_dev wrote:
| See other comment.
| KomoD wrote:
| > Then you have mobile apps that just don't run if you don't
| have a google account, like chatgpt's native app.
|
| That's not true, I use ChatGPT's app on my phone without
| logging into a Google account.
|
| You don't even need any kind of account at all to use it.
| BiteCode_dev wrote:
| On Android at least, even if you don't need to log in to
| your google account when connecting to chatgpt, the app
| won't work if your phone isn't signed in into google play,
| which doesn't work if your phone isn't linked to a google
| account.
|
| An android phone asks you to link a google account when you
| use it for the first time. It takes a very dedicated user
| to refuse that, then to avoid logging in into the gmail,
| youtube or app store apps which will all also link your
| phone to your google account when you sign in.
|
| But I do actively avoid this, I use Aurora, F-droid, K9 and
| NewPipeX, so no link to google.
|
| But then no ChatGPT app. When I start it, I get hit with a
| logging page to the app store and it's game over.
| acdha wrote:
| So the requirement is to pass the phone's system
| validation process rather than having a Google account. I
| don't love that but I can understand why they don't want
| to pay the bill for the otherwise ubiquitous bots, and
| it's why it's an Android-specific issue.
| BiteCode_dev wrote:
| You can make a very rational case for each privacy
| invasive technical decision ever made.
|
| In the end, the fact remain: no chatgpt app without
| giving up your privacy, to google none the less.
| acdha wrote:
| "Giving up your privacy" is a pretty sweeping claim - it
| sounds like you're saying that Android inherently leaks
| private data to Google, which is broader than even Apple
| fans tend to say.
| michaelt wrote:
| A person who was maximally distrustful of Google would
| assume they link your phone and your IP through the
| connection used to receive push notifications, and the
| wifi-network-visibility-to-location API, and the software
| update checker, and the DNS over HTTPS, and suchlike. As
| a US company, they could even be forced to do this in
| secret against their will, and lie about it.
|
| Of course as Google doesn't _claim_ they do this, many
| people would consider it unreasonably fearful /cynical.
| acdha wrote:
| Sure, but that says you shouldn't have a phone, not that
| ChatGPT is forcing you to give up your privacy.
| BiteCode_dev wrote:
| Google and Apple were both part of the PRISM program, of
| course I'm making this claim.
|
| That's the opposite stance that would be bonkers.
| acdha wrote:
| PRISM covered communications through U.S. company's
| servers. It was not a magic back door giving them access
| to your device's local data, and even if you did believe
| that it was the answer would be not using a phone. A
| major intelligence agency does not need you to have a
| Google account so they can spy on you.
| ForHackernews wrote:
| > it sounds like you're saying that Android inherently
| leaks private data to Google, which is broader than even
| Apple fans tend to say.
|
| Yes? I mean, not "leaks" - it's designed to upload your
| private data to Google and others.
|
| https://www.tcd.ie/news_events/articles/study-reveals-
| scale-...
|
| > Even when minimally configured and the handset is idle,
| with the notable exception of e/OS, these vendor-
| customised Android variants transmit substantial amounts
| of information to the OS developer and to third parties
| such as Google, Microsoft, LinkedIn, and Facebook that
| have pre-installed system apps. There is no opt-out from
| this data collection.
| ForHackernews wrote:
| You might like: https://e.foundation/e-os/
| BiteCode_dev wrote:
| That won't make chatgpt's app work thought.
| ForHackernews wrote:
| It might well do, depending on what ChatGPT's app is
| asking the OS for. /e/OS is an Android fork that removes
| Google services and replaces them with open source
| stubs/re-implementations from https://microg.org/
|
| I haven't tried the ChatGPT app, but I know that, for
| example my bank and other financial services apps work
| with on-device fingerprint authentication and no Google
| account on /e/OS.
| __MatrixMan__ wrote:
| I have a similar experience with the pager duty app. It
| loads up and then exits with "security problem detected
| by app" because I've made it more secure by isolating it
| from Google (a competitor). Workaround is to just control
| it via slack instead.
| BiteCode_dev wrote:
| Well you can use the web base chagpt so there is a
| workaround. Except it's worse a worse experience.
| pjc50 wrote:
| The privacy battle _has_ to be at the legal layer. GDPR is
| far from perfect (bureaucratic and unclear with weak
| enforcement), but it 's a step in the right direction.
|
| In an adversarial environment, especially with both AI
| scrapers and AI posters, websites have to be able to identify
| and ban persistent abusers. Which unfortunately implies
| having some kind of identification of _everybody_.
| BiteCode_dev wrote:
| That's another problem, we want cheap easy solutions like
| tracking people, instead of more targetteed or systemic
| ones.
| nonameiguess wrote:
| No, it's more than that. Cloudflare's bot protection has
| blocked me from sites where I have a paid account, paid for
| by my real checking account with my real name attached.
| Even when I am perfectly willing to give out my identity
| and be tracked, I still can't because I can't even get to
| the login page.
| HappMacDonald wrote:
| They block such visits because their pragma suspects that
| your visit is the account of a real human that was hacked
| by a bot.
| wbl wrote:
| You notice that Analogue Devices puts their (incredibly
| useful) information up for free. That's because they make
| money other ways. Ad supported content farm Internet had a
| nice run but we will get on without it.
| Gormo wrote:
| > The privacy battle has to be at the legal layer.
|
| I couldn't disagree more. The way to protect privacy is to
| make privacy the standard at the implementation layer, and
| to make it costly and difficult to breach it.
|
| Trying to rely on political institutions without the
| practical and technical incentives favoring privacy will
| inevitably result in the political institutions themselves
| becoming the main instrument that erodes privacy.
| HappMacDonald wrote:
| Yet without regulation nothing stops large companies from
| simply changing the implementation layer for one that
| pads their bottom line better, or just rebuild it from
| scratch.
|
| If people who valued privacy really controlled the
| implementation layer we wouldn't have gotten to this
| point in the first place.
| Gormo wrote:
| The point we're at is one in which privacy is still
| attainable via implementation-layer measures, even if it
| requires investing some effort and making some trade-offs
| to sustain. The alternative -- placing trust in
| regulation, which _never_ works in the long run -- will
| inevitably result in regulatory capture that eliminates
| those remaining practical measures and replaces them
| with, at best, a performative illusion.
| viraptor wrote:
| I know it's not a solution for you specifically here, but if
| anyone has access to the CF enterprise plan, they can report
| specific traffic as non-bot and hopefully improve the
| situation. They need to have access to the "Bot Management"
| feature though. It's a shitty situation, but some of us here
| _can_ push back a little bit - so do it if you can.
|
| And yes, it's sad that the "make internet work again" is behind
| an expensive paywall..
| meeby wrote:
| The issue here is that RSS readers _are_ bots. Obviously
| perfectly sensible and useful bots, but they're not "real
| people using a browser". I doubt you could get RSS readers
| listed on Cloudflare's "good bots" list either which would
| allow them the default bot protection feature given they'll
| all run off random residential IPs.
| j16sdiz wrote:
| They can't whitelist useragent, otherwise bot will pass
| just using agent spoofing.
|
| If you have enterprise plan, you can have custom rules
| including allowing by url
| sam345 wrote:
| Not sure if I get this.It seems to me an RSS reader is as
| much of a bot as a browser is for HTML. It just reads RSS
| rather than HTML.
| kccqzy wrote:
| The difference is that RSS readers usually do background
| fetches on their own rather than waiting for a human to
| navigate to a page. So in theory, you could just set up a
| crontab (or systemd timer) that simply xdg-open various
| pages on a schedule and not be treated as bots.
| viraptor wrote:
| I was responding to a person with Firefox issues, not RSS.
|
| I'm not sure either if RSS bots could be added to good
| bots, but if anyone has traffic from them, we can
| definitely try. (No high hopes though, given the responses
| I got from support so far)
| Jazgot wrote:
| My rss reader was blocked on kvraudio.com by cloudflare. This
| issue wasn't solved for months. I simply stopped reading
| anything on kvraudio. Thank you cloudflare!
| wakeupcall wrote:
| Also running FF with strict privacy settings and several
| blockers. The annoyances are constantly increasing. Cloudflare,
| captchas, "we think you're a bot", constantly recurring cookie
| popups and absurd requirements are making me hate most of the
| websites and services I hit nowdays.
|
| I tried for a long time to get around it, but now when I hit a
| website like this just close the tab and don't bother anymore.
| afh1 wrote:
| Same, but for VPN (either corporate or personal). Reddit
| blocks it completely, requires you to sign-in but even the
| sign-in page is "network restricted"; LinkedIn shows you a
| captcha but gives an error when submitting the result
| (several reports online); and overall a lot of 403's. All go
| magically away when turning off the VPN. Companies, specially
| adtechs like Reddit and LinkedIn, do NOT want you to browse
| privately, to the point they rather you don't use their
| website at all unless without a condom.
| appendix-rock wrote:
| I don't follow the logic here. There seems to be an
| implication of ulterior motive but I'm not seeing what it
| is. What aspect of 'privacy' offered by a VPN do you think
| that Reddit / LinkedIn are incentivised to bypass? From a
| privacy POV, your VPN is doing nothing to them, because
| your IP address means very little to them from a tracking
| POV. This is just FUD perpetuated by VPN advertising.
|
| However, the undeniable reality is that accessing the
| website with a non-residential IP is a very, very strong
| indicator of sinister behaviour. Anyone that's been in a
| position to operate one of these services will tell you
| that. For every...let's call them 'privacy-conscious' user,
| there are 10 (or more) nefarious actors that present
| largely the same way. It's easy to forget this as a user.
|
| I'm all but certain that if Reddit or LinkedIn could
| differentiate, they would. But they can't. That's kinda the
| whole point.
| bo1024 wrote:
| Not following what could be sinister about a GET request
| to a public website.
|
| > From a privacy POV, your VPN is doing nothing to them,
| because your IP address means very little to them from a
| tracking POV.
|
| I disagree. (1) Since I have javascript disabled, IP
| address is generally their next best thing to go on. (2)
| I don't want to give them IP address to correlate with
| the other data they have on me, because if they sell that
| data, now someone else who only has my IP address
| suddenly can get a bunch of other stuff with it too.
| zahllos wrote:
| SQL injection?
|
| Get parameters can be abused like any parameter. This
| could be sql, could be directory traversal attempts,
| brute force username attempts, you name it.
| kam wrote:
| If your site is vulnerable to SQL injection, you need to
| fix that, not pretend Cloudflare will save you.
| hombre_fatal wrote:
| At the very least, they're wasting bandwidth to a
| (likely) low quality connection.
|
| But anyone making malicious POST requests, like spamming
| chatGPT comments, first makes GET requests to load the
| submission and find comments to reply to. If they think
| you're a low quality user, I don't see why they'd bother
| just locking down POSTs.
| afh1 wrote:
| IP address is a fingerprint to be shared with third
| parties, of course it's relevant. It's not ulterior
| motive, it's explicit, it's not caring about your traffic
| because you're not good product. They can and do
| differentiate by requiring a sign-in. They just don't
| care enough to make it actually work. Because they are
| adtechs and not interested in you as a user.
| homebrewer wrote:
| It's equally easy to forget about users from countries
| with way less freedom of speech and information sharing
| than in Western rich societies. These anti-abuse measures
| have made it much more difficult to access information
| blocked by my internet provider during the last few
| years. I'm relatively competent and can find ways around
| it, but my friends and relatives who pursue other career
| choices simply don't bother anymore.
|
| Telegram channels have been a good alternative, but even
| that is going downhill thanks to French authorities.
|
| Cloudflare and Google also often treat us like bots
| (endless captchas, etc) which makes it even more
| difficult.
| miki123211 wrote:
| > For every...let's call them 'privacy-conscious' user,
| there are 10 (or more) nefarious actors that present
| largely the same way.
|
| And each one of these could potentially create thousands
| of accounts, and do 100x as many requests as a normal
| user would.
|
| Even if only 1% of the people using your service are
| fraudsters, a normal user has at most a few accounts,
| while fraudsters may try to create thousands per day.
| This means that e.g. 90% of your signups are fraudulent,
| despite the population of fraudsters being extremely
| small.
| ruszki wrote:
| Was anybody stopped to do nefarious actions by these
| annoyances?
|
| It's like at my current and previous companies. They make
| a lot of security restrictions. The problem is, if
| somebody wants to get data out, they can get out anytime
| (or in). Security department says that it's against
| "accidental" leaks. I'm still waiting a single instance
| when they caught an "accidental" leak, and they are just
| not introducing extra steps, when at the end I achieve
| the exact same thing. Even when I caused a real potential
| leak, nobody stopped me to do it. The only reason why
| they have these security services/apps is to push
| responsibility to other companies.
| acdha wrote:
| > Companies, specially adtechs like Reddit and LinkedIn, do
| NOT want you to browse privately, to the point they rather
| you don't use their website at all unless without a condom.
|
| That's true in some cases, I'm sure, but also remember that
| most site owners deal with lots of tedious abuse. For
| example, some people get really annoyed about Tor being
| blocked but for most sites Tor is a tiny fraction of total
| traffic but a fairly large percentage of the abuse probing
| for vulnerabilities, guessing passwords, spamming contact
| forms, etc. so while I sympathize for the legitimate users
| I also completely understand why a busy site operator is
| going to flip a switch making their log noise go down by a
| double-digit percentage.
| rolph wrote:
| funny thing, when FF is blocked i can get through with
| TOR.
| anthk wrote:
| For Reddit I just use it r/o under gopher://gopherddit.com
|
| A good client it's either Lagrange (multiplatform), the old
| Lynx or Dillo with the Gopher plugin.
| Adachi91 wrote:
| > Reddit blocks it completely, requires you to sign-in but
| even the sign-in page is "network restricted";
|
| I've been creating accounts every time I need to visit
| Reddit now to read a thread about [insert subject]. They do
| not validate E-Mail, so I just use `example@example.com`,
| whatever random username it suggests, and `example` as a
| password. I've created at least a thousand accounts at this
| point.
|
| Malicious Compliance, until they disable this last effort
| at accessing their content.
| hombre_fatal wrote:
| Most subreddits worth posting on usually have a minimum
| account age + minimum account karma. I've found it
| annoying to register new accounts too often.
| zargon wrote:
| They verify signup emails now. At least for me.
| immibis wrote:
| I've created a few thousand accounts through a VPN
| (random node per account). After doing that, I found out
| Reddit accounts created through VPNs are automatically
| shadow banned the second time they comment (I think the
| first is also shadow deleted in some way). But they allow
| you to browse from a shadow banned account just fine.
| lioeters wrote:
| Same here. I occasionally encounter websites that won't work
| with ad blockers, sometimes with Cloudflare involved, and I
| don't even bother with those sites anymore. Same with sites
| that display a cookie "consent" form without an option to not
| accept. I reject the entire site.
|
| Site owners probably don't even see these bounced visits, and
| it's such a tiny percentage of visitors who do this that it
| won't make a difference. Meh, it's just another annoyance to
| be able to use the web on our own terms.
| capitainenemo wrote:
| It's a tiny percentage of visitors, but a tech savvy one,
| and depending on your website, they could be a higher than
| average percentage of useful users or product purchasers.
| The impact could be disproportionate. What's frustrating is
| many websites don't even realise it is happening because
| the reporting from the intermediate (Cloudflare say) is
| inaccurate or incorrectly represents how it works.
| Fingerprinting has become integral to bot "protection".
| It's also frustrating when people think this can be drop
| in, and put it in front of APIs that are completely
| incapable of handling the challenge with no special casing
| (encountered on FedEx, GoFundMe), much like the RSS reader
| problem.
| orbisvicis wrote:
| I have to solve captchas for Amazon while logged into my
| Amazon account.
| tenken wrote:
| Why?! ... I've had 404 pages on Amazon, but never a
| captcha...
| m463 wrote:
| at one point I couldn't access amazon at night.
|
| I would get different captcha, one convoluted that wouldn't
| even load the required images.
|
| And I would get the oops sorry dog page for _everything_.
|
| I finally contacted amazon, gave them my (static) ip
| address and it was good.
|
| In other locations, I have to solve a 6-distorted-letter
| captcha to log in, but that's the extent of it.
| anilakar wrote:
| Heck, I cannot even pass ReCAPTCHA nowadays. No amount of
| clicking buses, bicycles, motorcycles, traffic lights,
| stairs, crosswalks, bridges and fire hydrants will suffice.
| The audio transcript feature is the only way to get past a
| prompt.
| josteink wrote:
| Just a heads up that this is how Google treat connections
| it suspects to originate from bots. Silently keeping you in
| an endless loop promising reward if you can complete it
| correctly.
|
| I discovered this when I set up IPv6 using hurricane
| electric as a tunnel broker for IPv6 connectivity.
|
| Seemingly Google has all HEnet IPv6tunnel subnets listed
| for such behaviour without it being documented anywhere. It
| was extremely annoying until I figured out what was going
| on.
| n4r9 wrote:
| > Silently keeping you in an endless loop promising
| reward if you can complete it correctly.
|
| Sounds suspiciously like how product managers talk to
| developers as well.
| anilakar wrote:
| Sadly my biggest crime is running Firefox with default
| privacy settings and uBlock Origin installed. No VPNs or
| IPv6 tunnels, no Tor traffic whatsoever, no Google search
| history poisoning plugins.
|
| If only there was a law that allowed one to be excluded
| from automatic behavior profiling...
| marssaxman wrote:
| There's a pho restaurant near where I work which wants you
| to scan a QR code at the table, then order and pay through
| their website instead of talking to a person. In three
| visits, I have not once managed to get past their captcha!
|
| (The _actual_ process at this restaurant is to sit down,
| fuss with your phone a bit, then get up like you 're about
| to leave; someone will arrive promptly to take your order.)
| eddythompson80 wrote:
| I've only seen that at Asian restaurants near a
| university in my city. When I asked I was told that this
| is a common way in China and they get a lot of
| international students who prefer/expect it that way.
| amanda99 wrote:
| Yes and the most infuriating thing is the "we need to verify
| the security of your connection" text.
| JohnFen wrote:
| > when I hit a website like this just close the tab and don't
| bother anymore.
|
| Yeah, that's my solution as well. I take those annoyances as
| the website telling me that they don't want me there, so I
| grant them their wish.
| immibis wrote:
| That's fine. You were an obstacle to their revenue
| gathering anyway.
| SoftTalker wrote:
| Same. If a site doesn't want me there, fine. There's no
| website that's so crucial to my life that I will go through
| those kinds of contortions to access it.
| doctor_radium wrote:
| Hey, same here! For better or worse, I use Opera Mini for
| much of my mobile browsing, and it fares far worse than
| Firefox with uBlock Origin and ResistFingerprinting. I
| complained about this roughly a year ago on a similar HN
| thread, on which a Cloudflare rep also participated. Since
| then something changed, but both sides being black boxes, I
| can't tell if Cloudflare is wising up or Mini has stepped up.
| I still get the same challenge pages, but Mini gets through
| them automatically now, more often than not.
|
| But not always. My most recent stumbling block is
| https://www.napaonline.com. Guess I'm buying oxygen sensors
| somewhere else.
| anal_reactor wrote:
| On my phone Opera Mobile won't be allowed into some websites
| behind CloudFlare, most importantly 4chan
| dialup_sounds wrote:
| 4chan's CF config is so janky at this point it's the only
| site I have to use a VPN for.
| mzajc wrote:
| I randomize my User-Agent header and many websites outright
| block me, most often with no captcha and no useless error
| message.
|
| The most egregious is Microsoft (just about every Microsoft
| service/page, really), where all you get is a "The request is
| blocked." and a few pointless identifiers listed at the bottom,
| purely because it thinks your browser is too old.
|
| CF's captcha page isn't any better either, usually putting me
| in an endless loop if it doesn't like my User-Agent.
| charrondev wrote:
| Are you sending an actual random string as your UA or sending
| one of a set of actual user agents?
|
| You're best off just picking real ones. We've got hit by a
| botnet sending 10k+ requests from 40 different ASNs with
| 1000s of different IPs. The only way we're able to
| identify/block the traffic was excluding user agents matching
| some regex (for whatever reason they weren't spoofing real
| user agents but weren't sending actual ones either).
| RALaBarge wrote:
| I worked at an anti-spam email security company in the
| aughts, and we had a perl engine that would rip apart the
| MIME boundaries and measure everything - UA, SMTP client
| fingerprint headers, even the number of anchor or paragraph
| tags. A large combination of IF/OR evaluations with a regex
| engine did a pretty good job since the botnets usually
| don't bother to fully randomize or really opsec the
| payloads they are sending since it is a cannon instead of a
| flyswatter.
| kccqzy wrote:
| Similar techniques are known in the HTTP world too. There
| were things like detecting the order of HTTP request
| headers and matching them to known software, or even just
| comparing the actual content of the Accept header.
| miki123211 wrote:
| And then there's also TLS fingerprinting.
|
| Different browsers use TLS in slightly different ways,
| send data in a slightly different order, have a different
| set of supported extensions / algorithms etc.
|
| If your user agent says Safari 18, but your TLS
| fingerprint looks like Curl and not Safari, sophisticated
| services will immediately detect that something isn't
| right.
| mzajc wrote:
| I use the Random User-Agent Switcher[1] extension on
| Firefox. It does pick real agents, but some of them might
| show a really outdated browser (eg. Firefox 5X), which I
| assume is the reason I'm getting blocked.
|
| [1]: https://addons.mozilla.org/en-
| US/firefox/addon/random_user_a...
| pushcx wrote:
| Rails is going to make this much worse for you. All new apps
| include naive agent sniffing and block anything "old"
| https://github.com/rails/rails/pull/50505
| GoblinSlayer wrote:
| def blocked? user_agent_version_reported? &&
| unsupported_browser? end
|
| well, you know what to do here :)
| mzajc wrote:
| This is horrifying. What happened to simply displaying a
| "Your browser is outdated, consider upgrading" banner on
| the website?
| shbooms wrote:
| idk, even that seems too much to me, but maybe I'm just
| being too senstive.
|
| but like, why is it a website's job to tell me what
| browser version to use? unless my outdated browser is
| lacking legitmate functionality which is required by your
| website, just serve the page and be done with it.
| michaelt wrote:
| Back when the sun was setting on IE6, sites deployed
| banners that basically meant "We don't test on this,
| there's a good chance it's broken, but we don't know the
| specifics because we don't test with it"
| freedomben wrote:
| Wow. And this is now happening right as I've blacklisted
| google-chrome due to manifest v3 removal :facepalm:
| whoopdedo wrote:
| The irony being you can get around the block by
| pretending to be a bot.
|
| https://github.com/rails/rails/pull/52531
| lovethevoid wrote:
| Not sure a random UA extension is giving you much privacy.
| Try your results on coveryourtracks eff, and see. A random UA
| would provide a lot of identifying information despite being
| randomized.
|
| From experience, a lot of the things people do in hopes of
| protecting their privacy only makes them far easier to
| profile.
| mzajc wrote:
| coveryourtracks.eff.org is a great service, but it has a
| few limitations that apply here:
|
| - The website judges your fingerprint based on how unique
| it is, but assumes that it's otherwise persistent.
| Randomizing my User-Agent serves the exact opposite - a
| given User-Agent might be more unique than using the
| default, but I randomize it to throw trackers off.
|
| - To my knowledge, its "One in x browsers" metric (and by
| extension the "Bits of identifying information" and the
| final result) are based off of visitor statistics, which
| would likely be skewed as most of its visitors are privacy-
| conscious. They only say they have a "database of many
| other Internet users' configurations," so I can't verify
| this.
|
| - Most of the measurements it makes rely on javascript
| support. For what it's worth, it claims my fingerprint is
| not unique when javascript is disabled, which is how I
| browse the web by default.
|
| The other extreme would be fixing my User-Agent to the most
| common value, but I don't think that'd offer me much
| privacy unless I also used a proxy/NAT shared by many
| users.
| HappMacDonald wrote:
| I would just fingerprint you as "the only person on the
| internet who is scrambling their UA string" :)
| neilv wrote:
| Similar here. It's not unusual to be blocked from a site by
| CloudFlare when I'm running Firefox (either ESR or current
| release) on Linux.
|
| I suspect that people operating Web sites have no idea how many
| legitimate users are blocked by CloudFlare.
|
| And. based on the responses I got when I contacted two of the
| companies whose sites were chronically blocked by CloudFlare
| for months, it seemed like it wasn't worth any employee's time
| to try to diagnose.
|
| Also, I'm frequently blocked by CloudFlare when running Tor
| Browser. Blocking by Tor exit node IP address (if that's what's
| happening) is much more understandable than blocking Firefox
| from a residential IP address, but still makes CloudFlare not a
| friend of people who want or need to use Tor.
| pjc50 wrote:
| > CloudFlare not a friend of people who want or need to use
| Tor
|
| The adversarial aspect of all this is a problem:
| P(malicious|Tor) is much higher than P(malicious|!Tor)
| jorams wrote:
| > I suspect that people operating Web sites have no idea how
| many legitimate users are blocked by CloudFlare.
|
| I sometimes wonder if all Cloudflare employees are on some
| kind of whitelist that makes them not realize the ridiculous
| false positive rate of their bot detection.
| amatecha wrote:
| Yeah, I've contacted numerous owners of personal/small sites
| and they are usually surprised, and never have any idea why I
| was blocked (not sure if it's an aspect of CF not revealing
| the reason, or the owner not knowing how to find that
| information). One or two allowlisted my IP but that doesn't
| strike me as a solution.
|
| I've contacted companies about this and they usually just
| tell me to use a different browser or computer, which is like
| "duh, really?" , but also doesn't solve the problem for me or
| anyone else.
| lovethevoid wrote:
| What are some examples? I've been running ff on linux for
| quite some time now and am rarely blocked. I just run it with
| ublock origin.
| capitainenemo wrote:
| Odds are they have Resist Fingerprinting turned on. When I
| use it in a Firefox profile I encounter this all over the
| place. Drupal, FedEx.. some sites handle it better than
| others. Some it's a hard block with a single terse error.
| Some it is a challenge which gets blocked due to using
| remote javascript. Some it's a local challenge you can get
| past. But it has definitely been getting worse.
| Fingerprinting is being normalised, and the excuse of "bot
| protection" (bots can make unique fingerprints too, though)
| means that it can now be used maliciously (or by ad
| networks like google, same diff) as a standard feature.
| johnklos wrote:
| I've had several discussions that were literally along the
| lines of, "we don't see what you're talking about in our
| logs". Yes, you don't - traffic is blocked _before_ it gets
| to your servers!
| pessimizer wrote:
| Also, Cloudflare won't let you in if you forge your referer
| (it's nobody's business what site I'm coming from.) For years,
| you could just send the root of the site you were visiting,
| then last year somebody at Cloudflare flipped a switch and took
| a bite out of everyone's privacy. Now it's just endless
| reloading captchas.
| zamadatix wrote:
| Why go through that hassle instead of just removing the
| referer?
| bityard wrote:
| Lots of sites see an empty referrer and send you to their
| main page or marketing page. Which means you can't get
| anywhere else on their site without a valid referrer. They
| consider it a form of "hotlink" protection.
|
| (I'm not saying I agree with it, just that it exists.)
| zamadatix wrote:
| Fair and valid answer to my wording. Rewritten for what I
| meant to ask: "Why set referrer to the base of the
| destination origin instead of something like Referrer-
| Policy: strict-origin?". I.e. remove it completely for
| cross-origin instead of always making up that you came
| from the destination.
|
| Though what you mention does beg the question "is there
| really much privacy gain in that over using Referrer-
| Policy: same-origin and having referrer based pages work
| right?" I suppose so if you're randomizing your identity
| in an untrackable way for each connection it could be
| attractive... though I think that'd trigger being
| suspected as a bot far before the lack of proper same
| origin info :p.
| philsnow wrote:
| Ah, maybe this is what's happening to me.. I use Firefox with
| uBlock origin, privacy badger, multi-account containers, and
| temporary containers.
|
| Whenever I click a link to another site, i get a new tab in
| either a pre-assigned container or else in a "tmpNNNN"
| container, and i think either by default or I have it
| configured to omit Referer headers on those new tab
| navigations.
| anthk wrote:
| Or any Dillo user, with a PSP User Agent which is legit for
| small displays.
| jasonlotito wrote:
| Cloudflare has always been a dumpster fire in usability. The
| number of times it would block me in that way was enough to
| make me seriously question anyones technical knowledge that
| used it. It's a dumpster fire. Friends don't let friend use
| Cloudflare. To me, it's like the Spirit airlines of CDNs.
|
| Sure, tech wise it might work great, but from your users
| perspective: it's trash.
| immibis wrote:
| It's got the best vendor lock-in enshittification story -
| it's free - and that's all that matters.
| DrillShopper wrote:
| Maybe after the courts break up Amazon the FTC can turn its eye
| to Cloudflare.
| gjsman-1000 wrote:
| A. Do you think courts give a darn about the 0.1% of users
| that are still using RSS? We might as well care about the
| 0.1% of users who want the ability to set every website's
| background color to purple with neon green anchor tags. RSS
| never caught on as a standard to begin with, peaking at 6%
| adoption by 2005.
|
| B. Cloudflare has healthy competition with AWS, Akamai,
| Fastly, Bunny.net, Mux, Google Cloud, Azure, you name it,
| there's a competitor. This isn't even an Apple vs Google
| situation.
| HappMacDonald wrote:
| Cloudflare doesn't offer the same product suite as the
| other companies you mention, though. Cloudflare is
| primarily DDoS prevention while the others are primarily
| cloud hosting.
|
| And it is the DDoS prevention measures at issue here.
| KPGv2 wrote:
| Reddit seems to do this to me (sometimes) when I use Zen
| browser. Switching over to Safari or Chrome and the site always
| works great.
| kjkjadksj wrote:
| Reddit has been bad about it as of late too
| rcarmo wrote:
| Ironically, the site seems to currently be hugged to death, so
| maybe they should consider using Cloudflare to deal with HN
| traffic?
| timeon wrote:
| If it is unintentional DDoS, we can wait. Not everything needs
| to be on demand.
| dewey wrote:
| The website is built to get attention, the attention is here
| right now. Nobody will remember to go back tomorrow and read
| the site again when it's available.
| BlueTemplar wrote:
| I'm not sure an open web can exist under this kind of
| assumption...
|
| Once you start chasing views, it's going to come at the
| detriment of everything else.
| dewey wrote:
| This happened at least 15 years ago and we are doing
| okay.
| sofixa wrote:
| Doesn't have to be using CloudFlare, just a static web host
| that will be able to scale to infinity (of which CloudFlare is
| one with Pages, but there's also Google with Firebase Hosting,
| AWS with Amplify, Microsoft with something in Azure with a
| verbose name, Netlify, Vercel, GitHub Pages, etc etc etc).
| kawsper wrote:
| Or just add Varnish or Nginx configured with a cache in
| front.
| sofixa wrote:
| That can still exhaust system resources on the box it's
| running on (file descriptors, inodes, ports,
| CPU/memory/bandwidth, etc) if you hit it too big.
|
| For something like entirely static content, it's so much
| easier (and cheaper, all of the static hosting providers
| have an extremely generous free tier) to use static
| hosting.
|
| And I say this as an SRE by heart who runs Kubernetes and
| Nomad for fun across a number of nodes at home and in
| various providers - my blog is on a static host. Use the
| appropriate solution for each task.
| vundercind wrote:
| I used to serve low-tens-of-MB .zip files--worse than a web
| page and a few images or what have you--statically from
| Apache2 on a boring Linux server that'd qualify as potato-
| tier today, with traffic spikes into the hundreds of
| thousands per minute. Tens of thousands per minute against
| other endpoints gated by PHP setting a header to tell
| Apache2 to serve the file directly if the client
| authenticated correctly, and I think that one could have
| gone a lot higher, never really gave it a workout. Wasn't
| even really taxing the hardware that much for either
| workload.
|
| Before that, it was on a mediocre-even-at-the-time
| dedicated-cores VM. That caused performance problems...
| because its Internet "pipe" was straw-sized, it turned out.
| The server itself was fine.
|
| Web server performance has regressed amazingly badly in the
| world of the Cloud. Even "serious" sites have decided the
| performance equivalent of shitty shared-host Web hosting is
| a great idea and that introducing all the problems of
| distributed computing at the architecture level will help
| their moderate-traffic site work better (LOL; LMFAO), so
| now they need Cloudflare and such just so their "scalable"
| solution doesn't fall over in a light breeze.
| erikrothoff wrote:
| As the owner of an RSS reader I love that they are making this
| more public. 30% of our support requests are "my feed doesn't"
| work. It sucks that the only thing we can say is "contact the
| site owner, it's their firewall". And to be fair it's not only
| Cloudflare, so many different firewall setups cause issues. It's
| ironic that a public API endpoint meant for bots is blocked for
| being a bot.
| ricardo81 wrote:
| iirc even if you're listed as a "good bot" with Cloudflare, high
| security settings by the CF user can still result in 403s.
|
| No idea if CF already does this, but allowing users to generate
| access tokens for 3rd party services would be another way of
| easing access alongside their apparent URL and IP whitelisting.
| mbo wrote:
| This is an active issue with Rate Your Music right now:
| https://rateyourmusic.com/rymzilla/view?id=6108
|
| Unfixed for 4 months.
| jgrahamc wrote:
| My email is jgc@cloudflare.com. I'd like to hear from the owners
| of RSS readers directly on what they are experiencing. Going to
| ask team to take a closer look.
| viraptor wrote:
| It's cool and all that you're making an exception here, but how
| about including a "no, really, I'm actually a human" link on
| the block page rather than giving the visitor a puzzle: how to
| report the issue to the page owner (hard on its own for
| normies) if you can't even load the page. This is just
| externalising issues that belong to the Cloudflare service.
| methou wrote:
| Some clients are more like a bot/service, imagine google
| reader that fetches and caches content for you. The client
| I'm currently using is miniflux, it also works in this way.
|
| I understand that there are some more interactive rss
| readers, but from personal experience it's more like "hey I'm
| a good bot, let me in"
| _Algernon_ wrote:
| An rss reader is a user agent (ie. a software acting on
| behalf of its users). If you define rss readers as a bot
| (even if it is a good bot), you may as well call Firefox a
| bot (it also sends off web requests without explicit
| approval of each request by the browser).
| sofixa wrote:
| Their point was that the RSS reader does the scraping on
| its own in the background, without user input. If it
| can't read the page, it can't; it's not initiated by the
| user where the user can click on a "I'm not a bot, I
| promise" button.
| viraptor wrote:
| It was a mental skip, but the same idea. It would awesome
| if CF just allowed reporting issues at the point something
| gets blocked - regardless if it's a human or a bot. They're
| missing an "I'm misclassified" button for people actually
| affected without the third-party runaround.
| fluidcruft wrote:
| Unfortunately, I would expect that queue of reports to
| get flooded by bad faith actors.
| viraptor wrote:
| Sure, but now they say that queue should go to the
| website owner instead, who has less global visibility on
| the traffic. So that's just ignoring something they don't
| want to deal with.
| jgrahamc wrote:
| I am not trying to "make an exception", I'm asking for
| information external to Cloudflare so I can look at what
| people are experiencing and compare with what our systems are
| doing and figure out what needs to improve.
| robertlagrant wrote:
| This is useful info:
| https://news.ycombinator.com/item?id=33675847
| PaulRobinson wrote:
| Some "bots" are legitimate. RSS is intended for machine
| consumption. You should not be blocking content intended
| for machine consumption because a machine is attempting to
| consume it. You should not expect a machine, consuming
| content intended for a machine, to do some sort of step to
| show they aren't a machine, because they are in fact a
| machine. There is a lot of content on the internet that is
| not used by humans, and so checking that humans are using
| it is an aggressive anti-pattern that ruins experiences for
| millions of people.
|
| It's not that hard. If the content being requested is RSS
| (or Atom, or some other syndication format intended for
| consumption by software), just don't do bot checks, use
| other mechanisms like rate limiting if you must stop abuse.
|
| As an example: would you put a captcha on robots.txt as
| well?
|
| As other stories here can attest to, Cloudflare is slowly
| killing off independent publishing on the web through poor
| product management decisions and technology
| implementations, and the fix seems pretty simple.
| jamespo wrote:
| From another post, if the content-type is correct it gets
| through. If this is the case I don't see the problem.
| Scramblejams wrote:
| It's a very common misconfiguration, though, because it
| happens by default when setting up CF. If your customers
| are, by default, configuring things incorrectly, then
| it's reasonable to ask if the service should surface the
| issue more proactively in an attempt to help customers
| get it right.
|
| As another commenter noted, not even CF's own RSS feed
| seems to get the content type right. This issue could
| clearly use some work.
| doctor_radium wrote:
| I had a conversation with a web site owner about this once.
| There apparently is such a feature, a way for sites to
| configure a "Please contact us here if you're having trouble
| reaching our site" page...usage of which I assume Cloudflare
| could track and then gain better insight into these issues.
| The problem? It requires a Premium Plan.
| kalib_tweli wrote:
| There are email obfuscation and managed challenge script tags
| being injected into the RSS feed.
|
| You simply shouldn't have any challenges whatsoever on an RSS
| feed. They're literally meant to be read by a machine.
| kalib_tweli wrote:
| I confirmed that if you explicitly set the Content-Type
| response header to application/rss+xml it seems to work with
| Cloudflare Proxy enabled.
|
| The issue here is that Cloudflare's content type check is
| naive. And the fact that CF is checking the content-type
| header directly needs to be made more explicit OR they need
| to do a file type check.
| londons_explore wrote:
| I wonder if popular software for _generating_ RSS feeds
| might not be setting the correct content-type header? Maybe
| this whole issue could be mostly-fixed by a few github PR
| 's...
| kalib_tweli wrote:
| It wouldn't. It's the role of the HTTP server to set the
| correct content type header.
| djbusby wrote:
| The number of feeds with crap headers and other non-spec
| stuff going on; and loads of clients missing useful
| headers. Ugh. It seems like it should be simple; maybe
| that's why there are loads of naive implementations.
| onli wrote:
| Correct might be debatable here as well. My blog for
| example sets Content-Type to text/xml, which is not
| exactly wrong for an RSS feed (after all, it is text and
| XML) and IIRC was the default back then.
|
| There were compatibility issues with other type headers,
| at least in the past.
| johneth wrote:
| I think the current correct content types are:
|
| 'application/rss+xml' (for RSS)
|
| 'application/atom+xml' (for Atom)
| londons_explore wrote:
| Sounds like a kind samaritan could write a scanner to
| find as many RSS feeds as possible which look like
| RSS/Atom and _don 't_ have these content types, then go
| and patch the hosting software those feeds use to have
| the correct content types, or ask the webmasters to fix
| it if they're home-made sites.
|
| As soon as a majority of sites use the correct types,
| clients can start requiring it for newly added feeds,
| which in turn will make webmasters make it right if they
| want their feed to work.
| onli wrote:
| Not even Cloudflares own blog uses those,
| https://blog.cloudflare.com/rss/, or am I getting a wrong
| content-type shown in my dev tools? For me it is
| `application/xml`. So even if `application/rss+xml` were
| the correct type by an official spec, it's not something
| to rely on if it's not used commonly.
| johneth wrote:
| I just checked Wikipedia and it says Atom's is
| 'application/atom+xml' (also confirmed in the IANA
| registry), and RSS's is 'application/rss+xml' (but it's
| not registered yet, and 'text/xml' is also used widely).
|
| 'application/rss+xml' seems to be the best option though
| in my opinion. The '+xml' in the media type tells (good)
| parsers to fall back to using an XML parser if they don't
| understand the 'rss' part, but the 'rss' part provides
| more accurate information on the content's type for
| parsers that do understand RSS.
|
| All that said, it's a mess.
| o11c wrote:
| Even outside of RSS, the injected scripts often make internet
| security significantly _worse_.
|
| Since the user-agent has no way to distinguish scripts
| injected by cloudflare from scripts originating from the
| actual website, in order to pass the challenge they are
| forced to execute arbitrary code from an untrusted party. And
| malicious Javascript is practically ubiquitous on the general
| internet.
| prmoustache wrote:
| It is not only rss reader users that are affected. Any user
| with some extension to block trackers get regularly forbidden
| access to websites or have to deal with tons of captcha.
| kevincox wrote:
| I'll mail you as well but I think public discussion is helpful.
| Especially since I have seem similar responses to this over the
| years and it feels very disingenuous. The problem is very clear
| (Cloudflare serves 403 blocks to feed readers for no reason)
| you have all of the logs. The solution is maybe not trivial but
| I fail to see how the perspective of someone seeing a 403 block
| is going to help much. This just starts to sound like a way to
| seem responsive without actually doing anything.
|
| From the feed reader perspective it is a 403 response. For
| example my reader has been trying to read
| https://blog.cloudflare.com/rss/ and the last successful
| response it got was on 2021-11-17. It has been backing off due
| to "errors" but it still is checking every 1-2 weeks and gets a
| 403 every time.
|
| This obviously isn't limited to the Cloudflare blog, I see it
| on many site "protected by" (or in this case broken by)
| Cloudflare. I could tell you what public cloud IPs my reader
| comes from or which user-agent it uses but that is besides the
| point. This is a URL which is clearly intended for bots so it
| shouldn't be bot-blocked by default.
|
| When people reach out to customer support we tell them that
| this is a bug for the site and there isn't much we can do. They
| can try contacting the site owner but this is most likely the
| default configuration of Cloudflare causing problems that the
| owner isn't aware of. I often recommend using a service like
| FeedBurner to proxy the request as these services seem to be on
| the whitelist of Cloudflare and other scraping prevention
| firewalls.
|
| I think the main solution would be to detect intended-for-
| robots content and exclude it from scraping prevention by
| default (at least to a huge degree).
|
| Another useful mechanism would be to allow these to be accessed
| when the target page is cachable, as the cache will protect the
| origin from overload-type DoS attacks anyways. Some care needs
| to be taken to ensure that adding a ?bust={random} query
| parameter can't break through to the origin but this would be a
| powerful tool for endpoints that need protection from overload
| but not against scraping (like RSS feeds). Unfortunately cache
| headers for feeds are far from universal, so this wouldn't fix
| all feeds on its own. (For example the Cloudflare blog's feed
| doesn't set any caching headers and is labeled as `cf-cache-
| status: DYNAMIC`.)
| is_true wrote:
| Maybe when you detect urls that return the rss mimetype notify
| the owner of the site/CF account that it might be a good idea
| to allow bots on that urls.
|
| Ideally you could make it a simple switch in the config,
| somethin like: "Allow automated access on RSS endpoints".
| badlibrarian wrote:
| Thank you for showing up here and being open to feedback. But I
| have to ask: shouldn't Cloudflare be running and reviewing
| reports to catch this before it became such a problem? It's
| three clicks in Tableau for anyone who cares, and clearly
| nobody does. And this isn't the first time something like this
| has slipped through the cracks.
|
| I tried reaching out to Cloudflare with issues like this in the
| past. The response is dozens of employees hitting my LinkedIn
| page yet no responses to basic, reproduceable technical issues.
|
| You need to fix this internally as it's a reputational problem
| now. Less screwing around using Salesforce as your private
| Twitter, more leadership in triage. Your devs obviously aren't
| motivated to fix this stuff independently and for whatever
| reason they keep breaking the web.
| 015a wrote:
| The reality that HackerNews denizens need to accept, in this
| case and in a more general form, is: RSS feeds are not
| popular. They aren't just unpopular in the way that, say,
| Peacock is unpopular relative to Netflix; they're _truly_
| unpopular, used regularly by a number of people that could
| fit in an american football stadium. There are younger
| software engineers at Cloudflare that have never heard the
| term "RSS" before, and have no notion of what it is. It will
| probably be dead technology in ten years.
|
| I'm not saying this to say its a good thing; it isn't.
|
| Here's something to consider though: Why are we going after
| Cloudflare for this? Isn't the website operator far, far more
| at-fault? They chose Cloudflare. They configure Cloudflare.
| They, in theory, publish an RSS feed, which is broken because
| of infrastructure decisions _they_ made. You 're going after
| Ryobi because you've got a leaky pipe. But beyond that: isn't
| this tool Cloudflare publishes doing exactly what the website
| operators intended it to do? It blocks non-human traffic. RSS
| clients are non-human traffic. Maybe the reason you don't
| want to go after the website operators is because you know
| you're in the wrong? Why can't these RSS clients detect when
| they encounter this situation, and prompt the user with a
| captive portal to get past it?
| badlibrarian wrote:
| I'm old enough to remember Dave Winer taking Feedburner to
| task for inserting crap into RSS feeds that broke his code.
|
| There will always be niche technologies and nascent
| standards and we're taking Cloudflare to task today because
| if they continue to stomp on them, we get nowhere.
|
| "Don't use Cloudflare" is an option, but we can demand
| both.
| gjsman-1000 wrote:
| "Old man yells at cloud about how the young'ns don't
| appreciate RSS."
|
| I mean that somewhat sarcastically; but there does come a
| point where the demands are unreasonable, the technology
| is dead. There are probably more people browsing with
| JavaScript disabled than using RSS feeds. There are
| probably more people browsing on Windows XP than using
| RSS feeds. Do I yell at you because your personal blog
| doesn't support IE6 anymore?
| badlibrarian wrote:
| Spotify and Apple Podcasts use RSS feeds to update what
| they show in their apps. And even if millions of people
| weren't dependent on it, suggesting that an
| infrastructure provider not fix a bug only makes the web
| worse.
| 015a wrote:
| I'm not backing down on this one: This is straight up an
| "old man yelling at the kids to get off his lawn"
| situation, and the fact that JGC from Cloudflare is in
| here saying "we'll take a look at this" is so far and
| beyond what anyone reasonable would expect of them that
| they deserve praise and nothing else.
|
| This is a matter between You and the Website Operators,
| period. Cloudflare has nothing to do with this. This
| article puts "Cloudflare" in the title because its fun to
| hate on Cloudflare and it gets upvotes. Cloudflare is a
| tool. These website operators are using Cloudflare The
| Tool to block inhuman access to their websites. RSS
| CLIENTS ARE NOT HUMAN. Let me repeat that: Cloudflare's
| bot detection is working fully appropriately here,
| because RSS Clients are Bots. Everything here is working
| as expected. The part where change should be asked is:
| Website operators should allow inhuman actors past the
| Cloudflare bot detection firewall specifically for RSS
| feeds. They can FULLY DO THIS. Cloudflare has many, many
| knobs and buttons that Website Operators can tweak; one
| of those is e.g. a page rule to turn off bot detection
| for specific routes, such as `/feed.xml`.
|
| If your favorite website is not doing this, its NOT
| CLOUDFLARE'S FAULT.
|
| Take it up with the Website Operators, Not Cloudflare.
| Or, build an RSS Client which supports a captive portal
| to do human authorization. God this is so boring, y'all
| just love shaking your first and yelling at big tech for
| LITERALLY no reason. I suspect its actually because half
| of y'all are concerningly uneducated on what we're
| talking about.
| badlibrarian wrote:
| As part of proxying what may be as much as 20% of the
| web, Cloudflare injects code and modifies content that
| passes between clients and servers. It is in their core
| business interests to receive and act upon feedback
| regarding this functionality.
| 015a wrote:
| Sure: Let's begin by not starting the conversation with
| "Don't use Cloudflare", as you did. That's obviously not
| only unhelpful, but it clearly points the finger at the
| wrong party.
| 627467 wrote:
| What's does cloudflare do to search crawlers by default?
| Does it block them too?
| soraminazuki wrote:
| This is an issue with techdirt.com. I contacted them about this
| through their feedback form a long time ago, but the issue still
| remains unfortunately.
| dewey wrote:
| I'm using Miniflix and I always run into that on a few blogs
| which now I just stopped reading.
| MarvinYork wrote:
| In any case, it blocks German Telekom users. There is an ongoing
| dispute between Cloudflare and Telekom as to who pays for the
| traffic costs. Telekom is therefore throttling connections to
| Cloudflare. This is the reason why we can no longer use
| Cloudflare.
| SSLy wrote:
| as much as I am not a fan of cloudflare's practices, in this
| particular case DTAG seems to be the party at fault.
| hwj wrote:
| I had problems accessing Cloudflare-hosted websites via the Tor
| browser also. Don't know it that is still true.
| whs wrote:
| My company runs a tech news website. We offer RSS feed as any
| Drupal website would, which content farm just scrape our RSS feed
| to rehost our content in full. This is usually fine for us - the
| content is CC-licensed and they do post the correct source. But
| they run thousands of different WordPress instances on the same
| IP and they individually fetch the feed.
|
| In the end we had to use Cloudflare to rate limit the RSS
| endpoint.
| kevincox wrote:
| > In the end we had to use Cloudflare to rate limit the RSS
| endpoint.
|
| I think this is fine. You are solving a specific problem and
| still allowing some traffic. The problem with the Cloudflare
| default settings is that they block _all_ requests leading to
| users failing to get any updates even when fetching the feed at
| a reasonable rate.
|
| BTW in this case another solution may just be to configure
| proper caching headers. Even if you only cache for 5min at a
| time that will be at most 1 request every 5min per Cloudflare
| caching location (I don't know the exact configuration but
| typically use ~5 locations per origin, so that would be only
| 1req/min which is trivial load and will handle both these
| inconsiderate scrapers and regular users. You can also
| configure all fetches to come from a single location and then
| you would only need to actually serve the feed once per 5min)
| yjftsjthsd-h wrote:
| > In the end we had to use Cloudflare to rate limit the RSS
| endpoint.
|
| Isn't the correct solution to use CF to _cache_ RSS endpoints
| aggressively?
| prmoustache wrote:
| I believe this also pose issues to people running adblockers. I
| get tons of repetitive captchas on some websites.
|
| Also other companies offering similar services like imperva seems
| to be straight banning my ip after one visit to a website with
| uBlock Origin I first get a captcha, then a page saying I am not
| allowed, and whatever I do, even using an extensionless chrome
| browser with a new profile I can't visit it anymore because my ip
| is banned.
| acdha wrote:
| One thing to keep in mind is that the modern web sees a lot of
| spam and scraping, and ad revenue has been sliding for years.
| If you make your activity look like a not, most operators will
| assume you're not generating revenue and block you. It sucks
| but thank a spammer for the situation.
| est wrote:
| Hmmm, that's why "feedburner" is^H^Hwas a thing, right?
|
| We have come to full circle.
| kevincox wrote:
| Yeah, this is the recommendation that I usually give people who
| reach out to support. Feedburner tends to be on the whitelists
| to avoids this problem.
| pointlessone wrote:
| I see this on a regular basis. My self-hosted RSS reader is
| blocked by Cloudflare even after my IP address was explicitly
| allowlisted by a few feed owners.
| account42 wrote:
| Or just normal human users with a niche browser like Firefox.
| wraptile wrote:
| Cloudflare has been the bane of my web existance on Thai IP and a
| Linux Firefox fingerprint. I wonder how much traffic is lost
| because of Cloudflare and of course none of that is reported to
| the web admins so everyone continues with their jolly ignorance.
|
| I wrote my own RSS bridge that scrapes websites using Scrapfly
| web scraping API that bypasses all that because it's so annoying
| that I can't even scrape some company's /blog that they are
| literally buying ads for but somehow have an anti-bot enabled
| that blocks all RSS readers.
|
| Modern web is so anti social that the web 2.0 guys should be
| rolling in their "everything will be connected with APIs" graves
| by now.
| vundercind wrote:
| The late '90s-'00s solution was to blackhole address blocks
| associated with entire countries or continents. It was easily
| worth it for many US sites that weren't super-huge to lose the
| the 0.1% of legitimate requests they'd get from, say, China or
| Thailand or Russia, to cut the speed their logs scrolled at by
| 99%.
|
| The state of the art isn't much better today, it seems. Similar
| outcome with more steps.
| hkt wrote:
| It also manages to break IRC bots that do things like show the
| contents of the title tag when someone posts a link. Another
| cloudy annoyance, albeit a minor one.
| shaunpud wrote:
| Namesilo are the same, their csv/rss behind Cloudflare so don't
| even bother anymore with their auctions and their own interface
| is meh
| anilakar wrote:
| ...and there is a good number of people who see this as a
| feature, not a bug.
| timnetworks wrote:
| RSS is the future that is being kept from us for twenty years
| already, fusion can kick bricks.
| nfriedly wrote:
| Liliputing.com had this problem a couple of years ago. I emailed
| the author and he got it sorted out after a bit of back and
| forth.
| 015a wrote:
| Suggesting that website operators should allowlist RSS clients
| through the Cloudflare bot detection system via their user-agent
| is a rather concerning recommendation.
| artooro wrote:
| This is a truly problematic issue that I've experienced as well.
| The best solution is probably for Cloudflare to figure out what
| normal RSS usage looks like and have a provision for that in
| their bot detection.
| idunnoman1222 wrote:
| Yes, the way to retain your privacy is to not use the Internet
|
| if you don't like it, make your own Internet: assumedly one not
| funded by ads
| hugoromano wrote:
| "could be blocking RSS users" it says it all "could". I use RSS
| on my websites, which are serviced by Cloudflare, and my users
| are not blocked. For that, fine-tuning and setting Configuration
| Rules at Cloudflare Dashboard are required. Anyone on a free has
| access to 10 Configuration Rules. I prefer using Cloudflare
| Workers to tune better, but there is a cost. My suggestion for
| RSS these days is to reduce the info on RSS feed to teasers, AI
| bots are using RSS to circumvent bans, and continue to scrape.
| srmarm wrote:
| I'd have thought the website owner whitelisting their RSS feed
| URI (or pattern matching *.xml/*.rss) might be better than doing
| it based on the users agent string. For one you'd expect bot
| traffic on these end points and you're also not leaving a door
| open to anyone who fakes their user agent.
|
| Looks like it should be possible under the WAF
| wenbin wrote:
| At Listen Notes, we rely heavily on Cloudflare to manage and
| protect our services, which cater to both human users and
| scripts/bots.
|
| One particularly effective strategy we've implemented is using
| separate subdomains for services designed for different types of
| traffic, allowing us to apply customized firewall and page rules
| to each subdomain.
|
| For example:
|
| - www. listennotes.com is dedicated to human users. E.g.,
| https://www.listennotes.com/podcast-realtime/
|
| - feeds. listennotes.com is tailored for bots, providing access
| to RSS feeds. Eg., https://feeds.listennotes.com/listen/wenbin-
| fangs-podcast-pl...
|
| - audio. listennotes.com serves both humans and bots, handling
| audio URL proxies. E.g.,
| https://audio.listennotes.com/e/p/1a0b2d081cae4d6d9889c49651...
|
| This subdomain-based approach enables us to fine-tune security
| and performance settings for each type of traffic, ensuring
| optimal service delivery.
| kevindamm wrote:
| Where do you put your sitemap (or its equivalent)? Looking at
| the site, I don't notice one in the metadata but I do see a
| "site index" on the www subdomain, though possibly that's
| intended for humans not bots? I think the usual recommendation
| is to have a sitemap per subdomain and not mix them, but
| clearly they're meant for bots not humans...
| wenbin wrote:
| Great question.
|
| We only need to provide the sitemap (with custom paths, not
| publicly available) in a few specific places, like Google
| Search Console. This means the rules for managing sitemaps
| are quite manageable. It's not a perfect setup, but once we
| configure it, we can usually leave it untouched for a long
| time.
| butz wrote:
| Not "could" but it is actually blocking. Very annoying when
| government website does that, as usually it is next to impossible
| to explain the issue and ask for a fix. And even if the fix is
| made, it is reverted several weeks later. Other websites does
| that too, it was funny when one website was asking RSS reader to
| resolve captcha and prove they are human.
| elwebmaster wrote:
| Using Cloudflare on your website could be blocking Safari users,
| Chrome users, or just any users. It's totally broken. They have
| no way of measuring the false positives. Website owners are
| paying for it in lost revenue. And poor users who lose access for
| no fault of their own. Until some C-level exec at a BigTech
| randomly gets blocked and makes noise. But even then, Cloudflare
| will probably just whitelist that specific domain/IP. It is very
| interesting how I have never been blocked when trying to access
| Cloudflare itself, only blocked on their customer's sites.
| pentagrama wrote:
| Can you whitelists urls to be readead by bot on Cloudflare? Maybe
| this is a good solution, and there you can put your RSS feeds,
| sitemaps, and other content for bots. Also Cloudflare can make a
| dedicated fields to whitelists RSS and Sitemaps on the admin
| panel so users can discover more easily that they may don't want
| block those bots.
|
| Can you whitelist URLs to be read by bots on Cloudflare? Maybe
| this is a good solution, where you as a site mantainer can
| include your RSS feeds, sitemaps, and other content for bots.
|
| Also, Cloudflare could ship a feature by creating a dedicated
| section in the admin panel to let the user add and whitelist RSS
| feeds and sitemaps, making it easier (and educate) users to avoid
| blocking those bots who aren't a threat to your site, of course
| sill considering rules to avoid DDOS on this urls, like massive
| requests or stuff that common bots from RSS readers don't do.
| ectospheno wrote:
| I love that I get a cloudflare human check on almost every page
| they serve for customers except for when I login to my cloudflare
| account. Good times.
| conesus wrote:
| I run NewsBlur[0] and I've been battling this issue of NewsBlur
| fetching 403s across the web for months now. My users are
| revolting and asking for refunds. I've tried emailing dozens of
| site owners and publishers and only two of them have done the
| work of whitelisting their RSS feed. It's maddening and is having
| a real negative effect on NewsBlur.
|
| NewsBlur is an open-source RSS news reader (full source available
| at [1]), something we should all agree is necessary to support
| the open web! But Cloudflare blocking all of my feed fetchers is
| bizarre behavior. And we're on the verified bots list for years,
| but it hasn't made a difference.
|
| Let me know what I can do. NewsBlur publishes a list of IPs that
| it uses for feed fetching that I've shared with Cloudflare but it
| hasn't made a difference.
|
| I'm hoping Cloudflare uses the IP address list that I publish and
| adds them to their allowlist so NewsBlur can keep fetching (and
| archiving) millions of feeds.
|
| [0]: https://newsblur.com
|
| [1]: https://github.com/samuelclay/NewsBlur
| AyyEye wrote:
| Three consenting parties trying to use their internet blocked
| by a single intermediary that's too big to care is just gross.
| It's the web we deserve.
| eddythompson80 wrote:
| > Three consenting parties
|
| Clearly they are not 100% consenting, or at best one of them
| (the content publisher) is misconfiguring/misunderstanding
| their setup. They enabled RSS on their service, then setup a
| rule to require human verification for accessing that RSS
| feed.
|
| It's like a business advertising a singles only area, then
| hiring a security company and telling them to only allow
| couples in the building.
| AyyEye wrote:
| If Cloudflare was honest and upfront about the tradeoffs
| being made and the fact that it's still going to require
| configuration and maintenance work they'd have
| significantly less customers.
| srik wrote:
| RSS is an essential component to modern web publishing and it
| feels scary to see how one company's inconsideration might harm
| its already fragile future. One day cloudflare will get big
| enough to be subject to antitrust regulation and this instance
| will be a strong data point working against them.
| p4bl0 wrote:
| I've been a paying NewsBlur user since the downfall of Google
| Reader and I'm very happy with it. Thank you for NewsBlur!
| wooque wrote:
| You just bypass it with library like cloudscraper/hrequests.
| brightball wrote:
| I use Cloudflare and have home built RSS feeds on my site. If
| you've run into any issues on mine, I'll be happy to look into
| them.
|
| https://www.brightball.com/
| miohtama wrote:
| Thank you for the hard work.
|
| Newsblur was the first SaaS I could afford as a student. I have
| been subscriber for something like 20 years now. And I will
| keep doing it to the grave. Best money ever spent.
| tandav wrote:
| As an admin of my personal website, I completely disable all
| Cloudflare features and use it only for DNS and domain
| registration. I also stop following websites that use Cloudflare
| checks or cookie popups (cookies are fine, but the popups are
| annoying).
| renewiltord wrote:
| Ah, the Cloudflare free plan does not automatically turn these
| on. I know since I use it for some small things and don't have
| these on. I wouldn't use User-Agent filtering because those are
| spoofable. But putting feeds on a separate URL is probably a good
| idea. Right now the feed is actually generated on request for
| these sites, so caching it is probably a good idea anyway. I can
| just rudimentarily do that by periodically generating and copying
| it over.
| drudru wrote:
| I noticed this a while back when I was trying to read
| cloudflare's own blog. Periodically they would block my
| newsreader. I ended up just dropping their feed.
|
| I am glad to see other people calling out the problem. Hopefully,
| a solution will emerge.
___________________________________________________________________
(page generated 2024-10-17 23:01 UTC)