[HN Gopher] Feed readers which don't take "no" for an answer
___________________________________________________________________
Feed readers which don't take "no" for an answer
Author : kencausey
Score : 128 points
Date : 2024-12-18 15:39 UTC (4 days ago)
(HTM) web link (rachelbythebay.com)
(TXT) w3m dump (rachelbythebay.com)
| RA2lover wrote:
| Related: https://news.ycombinator.com/item?id=42470035
| jannes wrote:
| The HTTP protocol is a lost art. These days people don't even
| look at the status code and expect some mumbo jumbo JSON payload
| explaining the error.
| klntsky wrote:
| I would argue that HTTP statuses are a bad design decision,
| because they are intended to be consumed by apps, but are not
| app-specific. They are effectively a part of every API
| automatically without considerations whether they are needed.
|
| People often implement error handling using constructs like
| regexp matching on status codes, while with domain-specified
| errors it would be obvious what exactly is the range of
| possible errors.
|
| Moreover, when people do implement domain errors, they just
| have to write more code to handle two nested levels of
| branching.
| marcosdumay wrote:
| > because they are intended to be consumed by apps, but are
| not app-specific
|
| Well, good luck designing any standard app-independent
| protocol that works and doesn't do that.
|
| And yes, you must handle two nested levels of branching.
| That's how it works.
|
| The only improvement possible to make it clearer is having
| codes for API specific errors... what 400 and 500 aren't
| exactly. But then, that doesn't gain you much.
| throw0101b wrote:
| > _I would argue that HTTP statuses are a bad design
| decision, because they are intended to be consumed by apps,
| but are not app-specific._
|
| Perhaps put the app-specific part in the body of the reply.
| In the RFC they give a human specific reply to (presumably)
| be displayed in the browser: HTTP/1.1 429
| Too Many Requests Content-Type: text/html
| Retry-After: 3600 <html> <head>
| <title>Too Many Requests</title> </head>
| <body> <h1>Too Many Requests</h1>
| <p>I only allow 50 requests per hour to this Web site per
| logged in user. Try again soon.</p> </body>
| </html>
|
| * https://datatracker.ietf.org/doc/html/rfc6585#section-4
|
| * https://developer.mozilla.org/en-
| US/docs/Web/HTTP/Status/429
|
| But if the URL is specific to an API, you can document that
| you will/may give further debugging details (in text, JSON,
| XML, whatever).
| AznHisoka wrote:
| I dont look at the code because its wrong sometimes. Some pages
| return a 200 yet display an error in the page
| DaSHacka wrote:
| Nothing more annoying than a 200 response when the server
| 'successfully' serves a 404 page
| KomoD wrote:
| That's because a lot of people refuse to use status codes
| properly, like just using 200 everywhere.
| kstrauser wrote:
| A colleague who should've known better argued that a 404
| response to an API call was confusing because we were, in
| fact, successfully returning a response to the client. We had
| a long talk about that afterward.
| Joker_vD wrote:
| No, it is pretty confusing: the difference between 404 from
| hitting an endpoint that the server doesn't serve (because
| you forgot to expose this endpoint, oops!) and a 404 that
| means "we've successfully performed the search in our DB
| for the business entity you've requested and guarantee you
| that it does not exist" is rather difficult to tell
| programmatically.
| yjftsjthsd-h wrote:
| I'm open to arguing about _which_ error to return in each
| case, but surely we can agree that neither of those
| warrant a 200?
| echoangle wrote:
| Why not? I wouldn't say ,,I performed the search and
| there's 0 results" is an error condition. It's just the
| result of a search, and everything went fine.
| wiml wrote:
| If the URL identifies a _resource_ (REST-style) and that
| database entry doesn 't exist, then yes, 404 is less
| confusing response. If the URL identifies an _API
| endpoint_ (RPC-style) then, sure, tunnel the error inside
| a "I successfully failed to handle that request"
| response if you like.
| reshlo wrote:
| All URLs used when interacting with an API obviously
| identify API endpoints. There is no such thing as a URL
| which is part of an API but which is not an API endpoint.
|
| There is a difference between /api/entity/123 and
| /api/search with a payload of 123, though.
| kelsey98765431 wrote:
| if you have to 429 people for an rss feed the problem is you
| ramses0 wrote:
| If you don't stop at red lights, the problem is other people.
| /s
| dxdm wrote:
| Nobody owes these people and their feed readers a 200 whenever
| they want one.
| quectophoton wrote:
| I think it's an acceptable response. Not only there's no SLA,
| but people are free to not provide a service to misbehaving
| user agents. It's like rejecting connections from Tor.
|
| If anything, a 429 is a nice heads up. It could have been
| worse; she could have redirected those requests to a separate
| URL with an... unpleasant content, like a certain domain that
| redirects to I-don't-know-what whenever they detect the Referer
| header is from HN.
| redleader55 wrote:
| As interesting as that site is, and as much as I sympathise
| with the author's plight, that site's behavior is so anti-me
| that I'm going to ignore it whenever/wherever it pops up. I'm
| not trolling the author, I'm not calling them names or
| anything, I was just interested in the technical stuff. I
| wish them good luck.
| noident wrote:
| There's a particular type of person that scours their HTTP logs
| and makes up rules that block 90% of feed readers using the
| default poll interval. If I stick your RSS feed into Miniflux
| and I get 429'd, I just stop reading your blog. Learn2cache.
| I'm talking to you, Cheapskate's Guide.
| Kudos wrote:
| This site would not 429 current Miniflux, since it makes
| conditional requests. She has a previous post outlining cache
| respecting behaviour of many common feed readers.
| Kwpolska wrote:
| It could 429 it for conditional requests as well:
|
| > Unconditional requests: at most once per 24 hour period.
|
| > Conditional requests: at most once per 60 minute period.
|
| (Source: calling `curl
| hxxps://rachelbythebay[.]com/w/atom.xml` twice)
| silvestrov wrote:
| not when the client sends _unconditional requests_ i.e. missing
| If-Modified-Since and If-None-Match headers.
|
| All feed readers/clients should cache responses when sending
| multiple requests the same day.
| generationP wrote:
| Rejecting every unconditional GET after the first? That sounds a
| bit excessive. What if the reader crashed after the first and
| lost the data?
| brookst wrote:
| It's a RSS feed. In that case, wait until the specified time
| and try again and any missed article will appear then. If it is
| constantly crashing so articles never get loaded, fix that.
| aleph_minus_one wrote:
| > If it is constantly crashing so articles never get loaded,
| fix that.
|
| This often requires to do lots of tests against the endpoint,
| which the server prohibits.
| im3w1l wrote:
| If you are an rss-reader dev then you can set up a caching
| layer of your own.
| aleph_minus_one wrote:
| > If you are an rss-reader dev then you can set up a
| caching layer of your own.
|
| But are RSS reader devs _willing_ to jump through such
| hoops?
|
| I would claim that writing a (simple) RSS reader (using a
| programming language that provides suitable libraries) is
| something that would be rather easy for me, but setting
| up a caching layer would (because I have less knowledge
| about the latter topic) take a lot more research from my
| side concerning how to do it.
| im3w1l wrote:
| Sure, I have done such a thing myself and it was very
| simple. Let's say you do http_get(rss_address). Create a
| function http_cached_get, that looks for a recent cached
| response, and if none exists delegates to http_get and
| saves the response. In python this is like 10 lines.
| XCabbage wrote:
| For that matter, what if it's a different device (or entire
| different human being) on the same IP address?
| Apreche wrote:
| Feed readers should be sending the If-Modified-Since header and
| web sites should properly recognize it and send the 304
| Unmodified response. This isn't new tech.
| dartos wrote:
| If only people know of the standards
| graemep wrote:
| That is exactly what the article says.
| smallerize wrote:
| The article implies this but doesn't actually say it. It's
| nice to have the extra detail.
| avg_dev wrote:
| While it might be nice if the article spelled out the
| header, I do believe that there is more than implication
| present.
|
| > 00:04:51 GET /w/atom.xml, unconditional.
|
| > Fulfilled with 200, 502 KB.
|
| > [...]
|
| > A 20 minute retry rate with unconditional requests is
| wasteful. [...]
|
| And If-Modified-Since makes a request conditional.
| https://developer.mozilla.org/en-
| US/docs/Web/HTTP/Conditiona...
| JadeNB wrote:
| You left out a further explicit mention of conditional
| requests:
|
| > Advised (via Retry-After header) to come back in one
| day since they are unwilling or unable to do conditional
| requests.
|
| But I think it's still unarguable that the post doesn't
| _explicitly_ mention If-Modified-Since, which it 's not
| obliged to do, but the mention of it here could be
| helpful to someone. So why fuss?
| shkkmo wrote:
| The people who already know that a "conditional request"
| means a request with an If-Modified-After header aren't the
| ones who need to learn this information.
| strogonoff wrote:
| A friend of mine co-runs a semi-popular semi-niche news site (for
| now more than a decade), and complains that recently traffic rose
| with bots masquerading as humans.
|
| How would they know? Well, because Google, in its omniscience,
| started to downrank them for faking views with bots (which they
| do not do): it shows bot percentage in traffic stats, and it
| skyrocketed relative to non-bot traffic (which is now less than
| 50%) as they started to fall from the front page (feeding the
| vicious circle). Presumably, Google does not know or care it is a
| bot when it serves ads, but correlates it later with the metrics
| it has from other sites that use GA or ads.
|
| Or, perhaps, Google spots the same anomalies that my friend (an
| old school sysadmin who pays attention to logs) did, such as the
| increase of traffic along with never seen before popularity among
| iPhone users (who are so tech savvy that they apparently do not
| require CSS), or users from Dallas who famously love their
| QQBrowser. I'm not going to list all telltale signs as the crowd
| here is too hype on LLMs (which is our going theory so far, it is
| very timely), but my friend hopes Google learns them quickly.
|
| These newcomers usually fake UA, use inconspicuous Western IPs
| (requests from Baidu/Tencent data center ranges do sign
| themselves as bots in UA), ignore robots.txt and load many pages
| very quickly.
|
| I would assume bot traffic increase would apply to feeds, since
| they are of as much use for LLM training purposes.
|
| My friend does not actually engage in stringent filtering like
| Rachel does, but I wonder how soon it becomes actually infeasible
| to operate a website _with actual original content_ (which my
| friend co-writes) without either that or resorting to Cloudflare
| or the like for protection because of the domination of these
| creepy-crawlies.
|
| Edit: Google already downranked them, not threatened to downrank.
| Also, traffic rose but did not skyrocket, but relative amount of
| bot traffic skyrocketed. (Presumably without downranking the
| traffic would actually skyrocket.)
| blfr wrote:
| QQBrowser users from Dallas are more likely to be Chinese using
| a VPN than bots, I would guess.
| strogonoff wrote:
| That much is clear, yeah. The VPN they use may not be a
| service advertised to public and featured in lists, however.
|
| Some of the new traffic did come directly from Tencent data
| center IP ranges and reportedly those bots signed themselves
| in UA. I can't say whether they respect robots.txt because I
| am told their ranges were banned along with robots.txt
| tightening. However, US IP bots that remain unblocked and
| fake UA naturally ignore robot rules.
| thaumasiotes wrote:
| > The VPN they use may not be a service advertised to
| public and featured in lists, however.
|
| Well, of course not, since the service is illegal.
| m3047 wrote:
| I'm seeing some address ranges in the US clearly serving what
| must be VPN traffic from Asia, and I'm also seeing an uptick
| in TOR traffic looking for feeds as well as WP infra.
| afandian wrote:
| Are you saying that Google down-ranked them in search engine
| rankings for user behaviour in AdWords? Isn't that an abuse of
| monopoly? It still surprises me a little bit.
| BadHumans wrote:
| At my company we have seen a massive increase in bot traffic
| since LLMs have become mainstream. Blocking known OpenAI and
| Anthropic crawlers has decreased traffic somewhat so I agree
| with your theory.
| m3047 wrote:
| It's not that hard to dominate bots. I do it for fun, I do it
| for profit. Block datacenters. Run bot motels. Poison them. Lie
| to them. Make them have really really bad luck. Change the cost
| equation so that it costs them more than it costs you.
|
| You're thinking of it wrong, the seeds of the thinking error
| are here: "I wonder how soon it becomes actually infeasible to
| operate a website with actual original content".
|
| Bots want original content, no? So what's the problem with
| giving it to them? But that's the issue, isn't it? Clearly,
| contextually, what you should be saying is "I wonder how soon
| it becomes actually infeasible to operate a website for actual
| organic users" or something like that. But phrased that way,
| I'm not sure a CDN helps (I'm not sure they don't suffer false
| positives which interfere with organic traffic when they
| intermediate, more security theater because hangings and
| executions look good, look at the numbers of enemy dead).
|
| Take measures that any damn fool (or at least your desired
| audience) can recognize.
|
| Reading for comprehension, I think Rachel understands this.
| throaway89 wrote:
| what is a bot motel and how do you run one?
| m3047 wrote:
| Easy way is to implement e.g. a 4xx handler which serves
| content with links which generate further 4xx errors and
| rewrite the status code to something like 200 when sent to
| the requester. Load the garbage pages up with... garbage.
| throaway89 wrote:
| Thanks, and you can make money with this? Sorry I'm a
| total noob in this area.
| yesco wrote:
| The idea is that bots are inflexible to deviations from
| accepted norms and can't actually "see" rendered browser
| content. So if your generic 404, 403 error pages return a
| 200 status instead, with invisible links to other non
| accessible pages. The bots will follow the links but real
| users will not, trapping them in a kind of isolated
| labyrinth of recursive links (the urls should be slightly
| different though). It's basically how a lobster trap works
| if you want a visual metaphor.
|
| The important part here is to do this chaotically. The
| worst sites to scrape are buggy ones. You are, in essence,
| deliberately following bad practices in a way real users
| wouldn't notice but would still influence bots.
| 6510 wrote:
| I ban the feed for 24 hours if it doesnt work.
|
| I also design 2 new formats that no one (including myself) has
| ever implemented.
|
| https://go-here.nl/ess-and-nno
|
| enjoy
| Forge36 wrote:
| I couldn't find the tester. Thankfully the client i was tested...
| And it behaves poorly. Thankfully emacs has a client I can switch
| to!
| bombcar wrote:
| At some point instead of 429 it should return a feed with this
| post as always newest.
| internet2000 wrote:
| Does anyone know if FreshRSS behaves properly here?
| reocha wrote:
| Earlier article with some info on freshrss:
| https://rachelbythebay.com/w/2024/10/25/fs/
| Havoc wrote:
| Blocked for 2 hits in 20 minutes on a light protocol like rss?
|
| That seems hilariously aggressive to me, but her server her rules
| I guess.
| garfij wrote:
| I believe if you read carefully, it's not blocked, it's rate
| limited to once daily, with very clear remediation steps
| included in the response.
| that_guy_iain wrote:
| If you understand what rate limiting is, you block them for a
| period of time. Let's stop being pedantic here.
|
| 72 requests per day is nothing and acting like it's mayhem is
| a bit silly. And for a lot of people would result in them
| getting possible news slower. Sure OP won't publish that
| often but their rate limiting is an edge case and should be
| treated as such. If they're blocked until the next day and
| nothing gets updated then the only person harmed is OP for
| being overly bothered by their HTTP logs.
|
| Sure it's their server and they can do whatever they want.
| But all this does is hurts the people trying to reach their
| blog.
| HomeDeLaPot wrote:
| 72 requests per day _per user with a naive feed reader_.
| This is a small personal blog with no ads that OP is self-
| hosting on her own hardware, so blocking all this junk
| traffic is probably saving her money. Plus she's calling
| attention to how feed readers can be improved!
| that_guy_iain wrote:
| Even if they had 1000 feed readers which would be a
| massive amount for a blog, if you can't scale that
| cheaply, that's on you.
|
| As I pointed out, her blog and rate limiting are an
| extreme edge case, it would be silly for anyone to put
| effort into changing their feed reader for a single small
| blog. It's bad product management.
| tecleandor wrote:
| Of course she can. It's static. She doesn't want and I
| understand. She's signaling their clients an standard
| call to say "I think you already have read this, at lest
| ask me first when this changed the last time".
| m3047 wrote:
| My reason for smacking stuff down is that I don't want to
| see it in my logs. That simple.
| throw0101b wrote:
| > _72 requests per day is nothing and acting like it 's
| mayhem is a bit silly._
|
| 72 requests per day per IP over _how many IPs_? When you
| start multiplying numbers together they can get big.
| quest88 wrote:
| I invite you to run your own popular blog on your own
| hardware and pay for the costs. It sounds like you don't
| know what the true costs are.
| that_guy_iain wrote:
| Sounds like you don't know how to scale for cheap.
|
| And since I've ran integrations that connected over 500
| companies. I know what a rouge client actually looks like
| and 72 requests per day and I wouldn't even notice.
| donatj wrote:
| I do run a popular blog, and a $5 a month Digital Ocean
| droplet handles millions of requests per month without
| breaking a sweat.
| devjab wrote:
| If every user is collecting 36mb a day like in the story
| here, your droplet wouldn't even be capable of serving
| 500 users a month without hitting your bandwidth limit.
| With their current rates, your one million requests would
| cost you around 10 million USD.
| sccxy wrote:
| If OP enabled gzip then this 36mb would be 13mb.
|
| If OP reduced 30 months of posts in rss to 12 months then
| this 13mb would be 5mb a day.
|
| Using Cloudflare free plan and this static content is
| cached without any problem.
| donatj wrote:
| 30 * 500 * 36mb = 560gb and I have 1tb a month on my
| apparently $6 droplet
|
| Correction - from my billing page it's $4.50 a month,
| from the resize page it is $6 so I'm guessing I am
| grandfathered in to some older pricing
| tecleandor wrote:
| That's ridiculously big quantity of data to serve a
| seldomly updated blog just because the client doesn't
| want (or know how, or think about) to implement an easy
| and old http method.
|
| Imagine the petabytes of data transferred through the
| internet saved if a couple RSS clients added that method.
| sccxy wrote:
| More like a skill issue or just decision to make your
| life more difficult.
|
| It is free and easy to scale this kind of text based
| blog.
| Twirrim wrote:
| OP has _never_ said that this is about financial aspects
| of things.
| Joker_vD wrote:
| Yews, it's about enforcing their preference on how others
| should interact with OP's published site feed, on
| principle. Which is always an uphill battle.
| sccxy wrote:
| Her rss feed is last 100 posts with full content.
|
| So it means 30 months of blog posts content in single request.
|
| Sending 0.5MB in single rss request is more crime than those 2
| hits in 20 minutes.
| horsawlarway wrote:
| I generally agree here.
|
| There are a lot of very valid use cases where defaulting to
| deny for an entire 24 hour cycle after a single request is
| incredible frustrating for your downstream users (shared IP
| at my university means I will never get a non-429 response...
| And God help me if I'm testing new RSS readers...)
|
| It's her server, so do as you please, I guess. But it's a
| hilariously hostile response compared to just returning less
| data.
| mrweasel wrote:
| > But it's a hilariously hostile response compared to just
| returning less data.
|
| So provide a poor service to everyone, because some people
| doesn't know how to behave. That sees like an even worse
| response.
| sccxy wrote:
| Send only one year's recent posts and you've reduced
| bandwidth by 50%.
| wakawaka28 wrote:
| People don't want to have to customize refresh rates on a
| per-feed basis. Perhaps the RSS or Atom standards need to
| support importing the recommended refresh rate
| automatically.
| wakawaka28 wrote:
| Yes that's right. Most blogs that are popular enough to have
| this problem send you the last 10 post titles and links or
| something. THAT is why people refresh every hour, so they
| don't miss out.
| yladiz wrote:
| That's a bit disingenuous. 429s aren't "blocking", they're
| telling the requester that they're done too many requests and
| to try again later (with a value in the header). I assume the
| author configured this because they know how often the site is
| going to change typically. That the web server eventually stops
| responding if the client ignores requests isn't that
| surprising, but I doubt it was configured directly too.
| that_guy_iain wrote:
| I would say it's disingenuous to claim sending HTTP status
| and body that is not expected for a period of time is not
| blocking them for that period of time. You can be pedantic
| and claim "but they can still access the server" but in
| reality that client is blocked for a period of time.
| kstrauser wrote:
| In that case, I should be irate that the AWS API blocks me
| many times per day. Run `aws cli service some-paginated-
| thing` and see how many retries you get during normal,
| routine operation.
|
| But I'm not, because they're not blocking me. They're
| asking my client to slow down. Neither AWS nor Rachel's
| blog owes me unlimited requests per unit time, and neither
| have "blocked" me when I violate they policies.
| that_guy_iain wrote:
| They literally do block you for a period of time until
| you are out of the rate limit. That is how rate limits
| work. That's why you don't get to access the resource you
| requested, because their system literally blocked you
| from doing so.
|
| See when you're trying to be pedantic and all about
| semantics, you should make sure you've crossed your Ts
| and dotted your Is.
|
| > Block - AWS WAF blocks the request and applies any
| custom blocking behavior that you've defined.
|
| from https://docs.aws.amazon.com/waf/latest/developerguid
| e/waf-ru...
|
| And my favourite
|
| > Rate limiting blocks users, bots, or applications that
| are over-using or abusing a web property. Rate limiting
| can stop certain kinds of bot attacks.
|
| From CloudFlare's explainer
| https://www.cloudflare.com/learning/bots/what-is-rate-
| limiti...
|
| Every documentation on rate limit will include the word
| block. Because that's what you do, you allow access for a
| specific amount of requests and then block those that go
| over.
| luckylion wrote:
| > 429s aren't "blocking"
|
| Like how "unlimited traffic, but will slow down to 1bps if
| you use more than 100gb in a month" is technically "unlimited
| traffic".
|
| But for all intents and purposes, it's limited. And 429 are
| blocking. They include a hint towards the reason why you are
| blocked and when the block might expire (retry-after doesn't
| promise that you'll be successful if you wait), but besides
| that, what's the different compared to 403?
| yladiz wrote:
| I would disagree. Blocking typically implies permanence
| (without more action by the blockee), and since 429 isn't
| usually a permanent error code I wouldn't call it blocking.
| Same applies with 403, it's only permanent if the requester
| doesn't authorize correctly.
| Havoc wrote:
| Semantics. 429 is an error code. Rate
| limiting...blocking...too many requests...ignoring...call it
| whatever you like but it amounts to the same, namingly server
| isn't serving the requested content.
| cesarb wrote:
| > Blocked for 2 hits in 20 minutes on a light protocol like
| rss?
|
| I might be getting old, but 500KB in a single response doesn't
| feel "light" to me.
| sccxy wrote:
| Yes, this is a very poorly designed RSS feed.
|
| 500KB is horrible for RSS.
| Symbiote wrote:
| It's reasonable to have whole articles in RSS, of you
| aren't trying to show ads or similar.
| sccxy wrote:
| Whole articles are reasonable.
|
| 100 articles are not reasonable.
|
| 100 articles where most of them are 1+ year old is
| madness.
|
| RSS is not an archive of the entire website.
| sangnoir wrote:
| > RSS is not an archive of the entire website
|
| Whole-article feeds end up become exactly that - a local
| archive of a blog.
| mrweasel wrote:
| But it's not a "light" protocol when you're serving 36MB per
| day, when 500KB would suffice. RSS/Atom is light weight, if
| clients play by the rules. This could also have been a news
| website, imagine how much traffic would be dedicated to
| pointless transfers of unchanged data. Traffic isn't free.
|
| A similar problem arise from the increase in AI scraper
| activities. Talking to other SREs the problem seems pretty wide
| spread. AI companies will just hoover up data, but revisit so
| frequently and aggressively that it's starting to affect the
| transit feeds for popular websites. Frequently user-agents
| wouldn't be set to something unique, or deliberately hidden,
| and traffic originates from AWS, making it hard to target
| individual bad actors. Fair enough that you're scraping
| websites, that's part of the game when your online, but when
| your industry starts to affect transit feeds, then we need to
| talk compensation.
| II2II wrote:
| If your feed reader is refreshing every 20 minutes for a blog
| that is updated daily, nearly 99% of the data sent is
| identical. It looks like Rachel's blog is updated (roughly)
| weekly, so that jumps to 99.8%. It's not the least efficient
| thing in the world of computers, but it is definitely incurring
| unnecessary costs.
| elashri wrote:
| I opened the xml file she provides in the blog and it seems
| very long but okay. Then I decided it is a good blog to
| subscribe so I went and tried to add to my freshrss
| selfhosted instance (same ip obviously) and I couldn't
| because I got blocked/rate limited. So yes it is aggressive
| for different reasons.
| mubou wrote:
| Yeah, that's insane. Pretty much telling me not to
| subscribe to your blog at that point. Like sites that have
| an rss feed yet put Cloudflare protection in front of it...
|
| The correct thing to do here is put a caching layer in
| front so that every feed reader isn't simultaneously
| hitting the origin for the same content. IP banning is the
| wrong approach. (Even if it's only a temporary block,
| that's going to cause my reader to show an error and is
| entirely unnecessary.)
| radicality wrote:
| Weird. Those should have had different user-agents, and I
| would guess it cannot be purely based on up.
| KomoD wrote:
| Same, I made 3 requests in total and got blocked.
| wakawaka28 wrote:
| It should be a timeboxed block if anything. Most RSS users
| are actual readers and expecting them to spend lots of time
| figuring out why clicking "refresh" twice on their RSS app
| got them blocked is totally unreasonable. I've got my feeds
| set up to refresh every hour. Considering the small number of
| people still using RSS and how lightweight it is, it's not
| bad enough to freak out over. At some point all Rachel's
| complaining and investigating will be more work than her
| simply interacting directly with the makers of the various
| readers that cause the most traffic.
| nilslindemann wrote:
| I am stupid, why not just return an HTML document explaining the
| issue, when there is such an incorrect second request in 20
| minutes, then blocking that IP for 24 hours? The feed reader
| software author has to react, otherwise its users will complain
| to him, no?
| ImPostingOnHN wrote:
| It might be clever to return an rss feed containing 1 item: the
| html document you mention.
| ruszki wrote:
| That's what 429 return is for, which is mentioned in the
| article.
| PaulHoule wrote:
| This is why RSS for the birds.
|
| My RSS reader YOShInOn subscribes to 110 RSS feeds through
| Superfeedr which absolves me of the responsibility of being on
| the other side of Rachel's problem.
|
| With RSS you are always polling too fast or too slow; if you are
| polling too slow you might even miss items.
|
| When a blog gets posted Superfeedr hits an AWS lambda function
| that stores the entry in SQS so my RSS reader can update itself
| at its own pace. The only trouble is Superfeedr costs 10 cents a
| feed per month which is a good deal for an active feed such as
| comments from Hacker News or article from _The Guardian_ but is
| not affordable for subscribing to 2000+ indy blogs which YOShInOn
| could handle just fine.
|
| I might yet write my own RSS head end, but there is something to
| say for protocols like ActivityPub and AT Protocol.
| rakoo wrote:
| That's why websub (formerly pubsubhubbub) was created and
| should be the proper solution, not a proprietary middleware
| donatj wrote:
| On the flip side, what percent of RSS feed _generators_ actually
| support conditional requests? I 've written many over the last
| twenty years and I can tell you plainly, none of the ones I wrote
| have.
|
| I never even considered the option or necessity. It's easy and
| cheap just to send everything.
|
| I guess static generators with a apache style web server probably
| do, but I can't imagine any dynamic generators bother to try to
| save the small handful of bytes.
| aendruk wrote:
| For another perspective, I can offer the data point that the
| one dynamic feed generator I've written supports both If-
| Modified-Since and If-Match, and that I considered that to be
| an obvious requirement from the beginning.
| wheybags wrote:
| Rss is pretty light. Even if you say it's too much to be re-
| sending, you could remove the content from the rss feed (so they
| need to click through to read it), which would shrink the feed
| size massively. Alternatively, remove old posts. Or do both.
|
| Hopefully you don't have some expensive code generating the feed
| on the fly, so processing overhead is negligible. But if it's
| not, cache the result and reset the cache every time you post.
|
| Surely this is easier than spending the effort and emotional
| bandwidth to care about this issue?
|
| I might be wrong here, but this feels more emotionally driven
| ("someone is wrong on the internet") than practical.
| gavinsyancey wrote:
| As a user of the RSS feed, please don't remove content from it
| so I have to click through. This makes it much less useful and
| more annoying to use.
| wheybags wrote:
| I always click through regardless, because the rss text is
| probably missing formatting and images. I'll never be sure
| I'm getting a proper copy of the article unless I click
| through anyway.
| ruuda wrote:
| I have a blog where I post a few posts per year. [1] /feed.xml is
| served with an Expires header of 24 hours. I wrote a tool that
| allows me to query the webserver logs using SQLite [2]. Over the
| past 90 days, these are the top 10 requesters grouped by ip
| address (remote_addr column redacted here):
| requests_per_day user_agent 283
| Reeder/5050001 CFNetwork/1568.300.101 Darwin/24.2.0 274
| CommaFeed/4.4.0 (https://github.com/Athou/commafeed) 127
| Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
| (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 52
| NetNewsWire (RSS Reader; https://netnewswire.com/) 47
| Tiny Tiny RSS/23.04-0578bf80 (https://tt-rss.org/) 47
| Refeed Reader/v1 (+https://www.refeed.dev/) 46
| Selfoss/2.18 (SimplePie/1.5.1; +https://selfoss.aditu.de)
| 41 Reeder/5040601 CFNetwork/1568.100.1.1.1
| Darwin/24.0.0 39 Tiny Tiny RSS/23.04
| (Unsupported) (https://tt-rss.org/) 34
| FreshRSS/1.24.3 (Linux; https://freshrss.org)
|
| Reeder is loading the feed every 5 minutes, and in the vast
| majority of cases it's getting a 301 response because it tries to
| access the http version that redirects to https. At least it has
| state and it gets 304 Not Modified in the remaining cases.
|
| If I order by body bytes served rather than number of requests
| (and group by remote_addr again), these are the worst consumers:
| body_megabytes_per_year user_agent 149.75943975
| Refeed Reader/v1 (+https://www.refeed.dev/) 95.90771025
| Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
| (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36
| 75.00080025 rss-parser 73.023702
| Tiny Tiny RSS/24.09-0163884ef (Unsupported) (https://tt-rss.org/)
| 38.402385 Tiny Tiny RSS/24.11-42ebdb02
| (https://tt-rss.org/) 37.984539
| Selfoss/2.20-cf74581 (+https://selfoss.aditu.de)
| 30.3982965 NetNewsWire (RSS Reader;
| https://netnewswire.com/) 28.18013325 Tiny
| Tiny RSS/23.04-0578bf80 (https://tt-rss.org/) 26.330142
| Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
| (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36
| 24.838461 Mozilla/5.0 (Windows NT 10.0; Win64;
| x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105
| Safari/537.36
|
| The top consumer, Refeed, is responsible for about 2.25% of all
| egress of my webserver. (Counting only body bytes, not http
| overhead.)
|
| [1]: https://ruudvanasseldonk.com/writing [2]:
| https://github.com/ruuda/sqlog/blob/d129db35da9bbf95d8c2e97d...
| mixmastamyk wrote:
| I have a few feeds configured into Thunderbird but wasn't reading
| them very often, so I "disabled" them to load manually. Despite
| this it tries to contact the sites often and, when not able to
| (firewall) goes into a frenzy of trying to contact them. All this
| despite being disabled.
|
| Disappointing combined with the various update sites it tries to
| contact every startup, which is completely unnecessary as well.
| Couple of times a week should be the maximum rate.
| euroderf wrote:
| which => that
| thaumasiotes wrote:
| They're exactly equivalent. What are you hoping to correct?
| euroderf wrote:
| They're obviously not.
|
| It's a Britticism (AFAICT) making inroads.
| sangnoir wrote:
| Oh no, not the British influencing the English language! I
| can't be arsed* about which vs. that when "on accident" has
| become semi-accepted (as the opposite of "on purpose").
| Yuck.
___________________________________________________________________
(page generated 2024-12-22 23:00 UTC)