[HN Gopher] Feed readers which don't take "no" for an answer
       ___________________________________________________________________
        
       Feed readers which don't take "no" for an answer
        
       Author : kencausey
       Score  : 128 points
       Date   : 2024-12-18 15:39 UTC (4 days ago)
        
 (HTM) web link (rachelbythebay.com)
 (TXT) w3m dump (rachelbythebay.com)
        
       | RA2lover wrote:
       | Related: https://news.ycombinator.com/item?id=42470035
        
       | jannes wrote:
       | The HTTP protocol is a lost art. These days people don't even
       | look at the status code and expect some mumbo jumbo JSON payload
       | explaining the error.
        
         | klntsky wrote:
         | I would argue that HTTP statuses are a bad design decision,
         | because they are intended to be consumed by apps, but are not
         | app-specific. They are effectively a part of every API
         | automatically without considerations whether they are needed.
         | 
         | People often implement error handling using constructs like
         | regexp matching on status codes, while with domain-specified
         | errors it would be obvious what exactly is the range of
         | possible errors.
         | 
         | Moreover, when people do implement domain errors, they just
         | have to write more code to handle two nested levels of
         | branching.
        
           | marcosdumay wrote:
           | > because they are intended to be consumed by apps, but are
           | not app-specific
           | 
           | Well, good luck designing any standard app-independent
           | protocol that works and doesn't do that.
           | 
           | And yes, you must handle two nested levels of branching.
           | That's how it works.
           | 
           | The only improvement possible to make it clearer is having
           | codes for API specific errors... what 400 and 500 aren't
           | exactly. But then, that doesn't gain you much.
        
           | throw0101b wrote:
           | > _I would argue that HTTP statuses are a bad design
           | decision, because they are intended to be consumed by apps,
           | but are not app-specific._
           | 
           | Perhaps put the app-specific part in the body of the reply.
           | In the RFC they give a human specific reply to (presumably)
           | be displayed in the browser:                  HTTP/1.1 429
           | Too Many Requests        Content-Type: text/html
           | Retry-After: 3600             <html>           <head>
           | <title>Too Many Requests</title>           </head>
           | <body>              <h1>Too Many Requests</h1>
           | <p>I only allow 50 requests per hour to this Web site per
           | logged in user.  Try again soon.</p>           </body>
           | </html>
           | 
           | * https://datatracker.ietf.org/doc/html/rfc6585#section-4
           | 
           | * https://developer.mozilla.org/en-
           | US/docs/Web/HTTP/Status/429
           | 
           | But if the URL is specific to an API, you can document that
           | you will/may give further debugging details (in text, JSON,
           | XML, whatever).
        
         | AznHisoka wrote:
         | I dont look at the code because its wrong sometimes. Some pages
         | return a 200 yet display an error in the page
        
           | DaSHacka wrote:
           | Nothing more annoying than a 200 response when the server
           | 'successfully' serves a 404 page
        
         | KomoD wrote:
         | That's because a lot of people refuse to use status codes
         | properly, like just using 200 everywhere.
        
           | kstrauser wrote:
           | A colleague who should've known better argued that a 404
           | response to an API call was confusing because we were, in
           | fact, successfully returning a response to the client. We had
           | a long talk about that afterward.
        
             | Joker_vD wrote:
             | No, it is pretty confusing: the difference between 404 from
             | hitting an endpoint that the server doesn't serve (because
             | you forgot to expose this endpoint, oops!) and a 404 that
             | means "we've successfully performed the search in our DB
             | for the business entity you've requested and guarantee you
             | that it does not exist" is rather difficult to tell
             | programmatically.
        
               | yjftsjthsd-h wrote:
               | I'm open to arguing about _which_ error to return in each
               | case, but surely we can agree that neither of those
               | warrant a 200?
        
               | echoangle wrote:
               | Why not? I wouldn't say ,,I performed the search and
               | there's 0 results" is an error condition. It's just the
               | result of a search, and everything went fine.
        
               | wiml wrote:
               | If the URL identifies a _resource_ (REST-style) and that
               | database entry doesn 't exist, then yes, 404 is less
               | confusing response. If the URL identifies an _API
               | endpoint_ (RPC-style) then, sure, tunnel the error inside
               | a  "I successfully failed to handle that request"
               | response if you like.
        
               | reshlo wrote:
               | All URLs used when interacting with an API obviously
               | identify API endpoints. There is no such thing as a URL
               | which is part of an API but which is not an API endpoint.
               | 
               | There is a difference between /api/entity/123 and
               | /api/search with a payload of 123, though.
        
       | kelsey98765431 wrote:
       | if you have to 429 people for an rss feed the problem is you
        
         | ramses0 wrote:
         | If you don't stop at red lights, the problem is other people.
         | /s
        
         | dxdm wrote:
         | Nobody owes these people and their feed readers a 200 whenever
         | they want one.
        
         | quectophoton wrote:
         | I think it's an acceptable response. Not only there's no SLA,
         | but people are free to not provide a service to misbehaving
         | user agents. It's like rejecting connections from Tor.
         | 
         | If anything, a 429 is a nice heads up. It could have been
         | worse; she could have redirected those requests to a separate
         | URL with an... unpleasant content, like a certain domain that
         | redirects to I-don't-know-what whenever they detect the Referer
         | header is from HN.
        
           | redleader55 wrote:
           | As interesting as that site is, and as much as I sympathise
           | with the author's plight, that site's behavior is so anti-me
           | that I'm going to ignore it whenever/wherever it pops up. I'm
           | not trolling the author, I'm not calling them names or
           | anything, I was just interested in the technical stuff. I
           | wish them good luck.
        
         | noident wrote:
         | There's a particular type of person that scours their HTTP logs
         | and makes up rules that block 90% of feed readers using the
         | default poll interval. If I stick your RSS feed into Miniflux
         | and I get 429'd, I just stop reading your blog. Learn2cache.
         | I'm talking to you, Cheapskate's Guide.
        
           | Kudos wrote:
           | This site would not 429 current Miniflux, since it makes
           | conditional requests. She has a previous post outlining cache
           | respecting behaviour of many common feed readers.
        
             | Kwpolska wrote:
             | It could 429 it for conditional requests as well:
             | 
             | > Unconditional requests: at most once per 24 hour period.
             | 
             | > Conditional requests: at most once per 60 minute period.
             | 
             | (Source: calling `curl
             | hxxps://rachelbythebay[.]com/w/atom.xml` twice)
        
         | silvestrov wrote:
         | not when the client sends _unconditional requests_ i.e. missing
         | If-Modified-Since and If-None-Match headers.
         | 
         | All feed readers/clients should cache responses when sending
         | multiple requests the same day.
        
       | generationP wrote:
       | Rejecting every unconditional GET after the first? That sounds a
       | bit excessive. What if the reader crashed after the first and
       | lost the data?
        
         | brookst wrote:
         | It's a RSS feed. In that case, wait until the specified time
         | and try again and any missed article will appear then. If it is
         | constantly crashing so articles never get loaded, fix that.
        
           | aleph_minus_one wrote:
           | > If it is constantly crashing so articles never get loaded,
           | fix that.
           | 
           | This often requires to do lots of tests against the endpoint,
           | which the server prohibits.
        
             | im3w1l wrote:
             | If you are an rss-reader dev then you can set up a caching
             | layer of your own.
        
               | aleph_minus_one wrote:
               | > If you are an rss-reader dev then you can set up a
               | caching layer of your own.
               | 
               | But are RSS reader devs _willing_ to jump through such
               | hoops?
               | 
               | I would claim that writing a (simple) RSS reader (using a
               | programming language that provides suitable libraries) is
               | something that would be rather easy for me, but setting
               | up a caching layer would (because I have less knowledge
               | about the latter topic) take a lot more research from my
               | side concerning how to do it.
        
               | im3w1l wrote:
               | Sure, I have done such a thing myself and it was very
               | simple. Let's say you do http_get(rss_address). Create a
               | function http_cached_get, that looks for a recent cached
               | response, and if none exists delegates to http_get and
               | saves the response. In python this is like 10 lines.
        
         | XCabbage wrote:
         | For that matter, what if it's a different device (or entire
         | different human being) on the same IP address?
        
       | Apreche wrote:
       | Feed readers should be sending the If-Modified-Since header and
       | web sites should properly recognize it and send the 304
       | Unmodified response. This isn't new tech.
        
         | dartos wrote:
         | If only people know of the standards
        
         | graemep wrote:
         | That is exactly what the article says.
        
           | smallerize wrote:
           | The article implies this but doesn't actually say it. It's
           | nice to have the extra detail.
        
             | avg_dev wrote:
             | While it might be nice if the article spelled out the
             | header, I do believe that there is more than implication
             | present.
             | 
             | > 00:04:51 GET /w/atom.xml, unconditional.
             | 
             | > Fulfilled with 200, 502 KB.
             | 
             | > [...]
             | 
             | > A 20 minute retry rate with unconditional requests is
             | wasteful. [...]
             | 
             | And If-Modified-Since makes a request conditional.
             | https://developer.mozilla.org/en-
             | US/docs/Web/HTTP/Conditiona...
        
               | JadeNB wrote:
               | You left out a further explicit mention of conditional
               | requests:
               | 
               | > Advised (via Retry-After header) to come back in one
               | day since they are unwilling or unable to do conditional
               | requests.
               | 
               | But I think it's still unarguable that the post doesn't
               | _explicitly_ mention If-Modified-Since, which it 's not
               | obliged to do, but the mention of it here could be
               | helpful to someone. So why fuss?
        
           | shkkmo wrote:
           | The people who already know that a "conditional request"
           | means a request with an If-Modified-After header aren't the
           | ones who need to learn this information.
        
       | strogonoff wrote:
       | A friend of mine co-runs a semi-popular semi-niche news site (for
       | now more than a decade), and complains that recently traffic rose
       | with bots masquerading as humans.
       | 
       | How would they know? Well, because Google, in its omniscience,
       | started to downrank them for faking views with bots (which they
       | do not do): it shows bot percentage in traffic stats, and it
       | skyrocketed relative to non-bot traffic (which is now less than
       | 50%) as they started to fall from the front page (feeding the
       | vicious circle). Presumably, Google does not know or care it is a
       | bot when it serves ads, but correlates it later with the metrics
       | it has from other sites that use GA or ads.
       | 
       | Or, perhaps, Google spots the same anomalies that my friend (an
       | old school sysadmin who pays attention to logs) did, such as the
       | increase of traffic along with never seen before popularity among
       | iPhone users (who are so tech savvy that they apparently do not
       | require CSS), or users from Dallas who famously love their
       | QQBrowser. I'm not going to list all telltale signs as the crowd
       | here is too hype on LLMs (which is our going theory so far, it is
       | very timely), but my friend hopes Google learns them quickly.
       | 
       | These newcomers usually fake UA, use inconspicuous Western IPs
       | (requests from Baidu/Tencent data center ranges do sign
       | themselves as bots in UA), ignore robots.txt and load many pages
       | very quickly.
       | 
       | I would assume bot traffic increase would apply to feeds, since
       | they are of as much use for LLM training purposes.
       | 
       | My friend does not actually engage in stringent filtering like
       | Rachel does, but I wonder how soon it becomes actually infeasible
       | to operate a website _with actual original content_ (which my
       | friend co-writes) without either that or resorting to Cloudflare
       | or the like for protection because of the domination of these
       | creepy-crawlies.
       | 
       | Edit: Google already downranked them, not threatened to downrank.
       | Also, traffic rose but did not skyrocket, but relative amount of
       | bot traffic skyrocketed. (Presumably without downranking the
       | traffic would actually skyrocket.)
        
         | blfr wrote:
         | QQBrowser users from Dallas are more likely to be Chinese using
         | a VPN than bots, I would guess.
        
           | strogonoff wrote:
           | That much is clear, yeah. The VPN they use may not be a
           | service advertised to public and featured in lists, however.
           | 
           | Some of the new traffic did come directly from Tencent data
           | center IP ranges and reportedly those bots signed themselves
           | in UA. I can't say whether they respect robots.txt because I
           | am told their ranges were banned along with robots.txt
           | tightening. However, US IP bots that remain unblocked and
           | fake UA naturally ignore robot rules.
        
             | thaumasiotes wrote:
             | > The VPN they use may not be a service advertised to
             | public and featured in lists, however.
             | 
             | Well, of course not, since the service is illegal.
        
           | m3047 wrote:
           | I'm seeing some address ranges in the US clearly serving what
           | must be VPN traffic from Asia, and I'm also seeing an uptick
           | in TOR traffic looking for feeds as well as WP infra.
        
         | afandian wrote:
         | Are you saying that Google down-ranked them in search engine
         | rankings for user behaviour in AdWords? Isn't that an abuse of
         | monopoly? It still surprises me a little bit.
        
         | BadHumans wrote:
         | At my company we have seen a massive increase in bot traffic
         | since LLMs have become mainstream. Blocking known OpenAI and
         | Anthropic crawlers has decreased traffic somewhat so I agree
         | with your theory.
        
         | m3047 wrote:
         | It's not that hard to dominate bots. I do it for fun, I do it
         | for profit. Block datacenters. Run bot motels. Poison them. Lie
         | to them. Make them have really really bad luck. Change the cost
         | equation so that it costs them more than it costs you.
         | 
         | You're thinking of it wrong, the seeds of the thinking error
         | are here: "I wonder how soon it becomes actually infeasible to
         | operate a website with actual original content".
         | 
         | Bots want original content, no? So what's the problem with
         | giving it to them? But that's the issue, isn't it? Clearly,
         | contextually, what you should be saying is "I wonder how soon
         | it becomes actually infeasible to operate a website for actual
         | organic users" or something like that. But phrased that way,
         | I'm not sure a CDN helps (I'm not sure they don't suffer false
         | positives which interfere with organic traffic when they
         | intermediate, more security theater because hangings and
         | executions look good, look at the numbers of enemy dead).
         | 
         | Take measures that any damn fool (or at least your desired
         | audience) can recognize.
         | 
         | Reading for comprehension, I think Rachel understands this.
        
           | throaway89 wrote:
           | what is a bot motel and how do you run one?
        
             | m3047 wrote:
             | Easy way is to implement e.g. a 4xx handler which serves
             | content with links which generate further 4xx errors and
             | rewrite the status code to something like 200 when sent to
             | the requester. Load the garbage pages up with... garbage.
        
               | throaway89 wrote:
               | Thanks, and you can make money with this? Sorry I'm a
               | total noob in this area.
        
             | yesco wrote:
             | The idea is that bots are inflexible to deviations from
             | accepted norms and can't actually "see" rendered browser
             | content. So if your generic 404, 403 error pages return a
             | 200 status instead, with invisible links to other non
             | accessible pages. The bots will follow the links but real
             | users will not, trapping them in a kind of isolated
             | labyrinth of recursive links (the urls should be slightly
             | different though). It's basically how a lobster trap works
             | if you want a visual metaphor.
             | 
             | The important part here is to do this chaotically. The
             | worst sites to scrape are buggy ones. You are, in essence,
             | deliberately following bad practices in a way real users
             | wouldn't notice but would still influence bots.
        
       | 6510 wrote:
       | I ban the feed for 24 hours if it doesnt work.
       | 
       | I also design 2 new formats that no one (including myself) has
       | ever implemented.
       | 
       | https://go-here.nl/ess-and-nno
       | 
       | enjoy
        
       | Forge36 wrote:
       | I couldn't find the tester. Thankfully the client i was tested...
       | And it behaves poorly. Thankfully emacs has a client I can switch
       | to!
        
       | bombcar wrote:
       | At some point instead of 429 it should return a feed with this
       | post as always newest.
        
       | internet2000 wrote:
       | Does anyone know if FreshRSS behaves properly here?
        
         | reocha wrote:
         | Earlier article with some info on freshrss:
         | https://rachelbythebay.com/w/2024/10/25/fs/
        
       | Havoc wrote:
       | Blocked for 2 hits in 20 minutes on a light protocol like rss?
       | 
       | That seems hilariously aggressive to me, but her server her rules
       | I guess.
        
         | garfij wrote:
         | I believe if you read carefully, it's not blocked, it's rate
         | limited to once daily, with very clear remediation steps
         | included in the response.
        
           | that_guy_iain wrote:
           | If you understand what rate limiting is, you block them for a
           | period of time. Let's stop being pedantic here.
           | 
           | 72 requests per day is nothing and acting like it's mayhem is
           | a bit silly. And for a lot of people would result in them
           | getting possible news slower. Sure OP won't publish that
           | often but their rate limiting is an edge case and should be
           | treated as such. If they're blocked until the next day and
           | nothing gets updated then the only person harmed is OP for
           | being overly bothered by their HTTP logs.
           | 
           | Sure it's their server and they can do whatever they want.
           | But all this does is hurts the people trying to reach their
           | blog.
        
             | HomeDeLaPot wrote:
             | 72 requests per day _per user with a naive feed reader_.
             | This is a small personal blog with no ads that OP is self-
             | hosting on her own hardware, so blocking all this junk
             | traffic is probably saving her money. Plus she's calling
             | attention to how feed readers can be improved!
        
               | that_guy_iain wrote:
               | Even if they had 1000 feed readers which would be a
               | massive amount for a blog, if you can't scale that
               | cheaply, that's on you.
               | 
               | As I pointed out, her blog and rate limiting are an
               | extreme edge case, it would be silly for anyone to put
               | effort into changing their feed reader for a single small
               | blog. It's bad product management.
        
               | tecleandor wrote:
               | Of course she can. It's static. She doesn't want and I
               | understand. She's signaling their clients an standard
               | call to say "I think you already have read this, at lest
               | ask me first when this changed the last time".
        
               | m3047 wrote:
               | My reason for smacking stuff down is that I don't want to
               | see it in my logs. That simple.
        
             | throw0101b wrote:
             | > _72 requests per day is nothing and acting like it 's
             | mayhem is a bit silly._
             | 
             | 72 requests per day per IP over _how many IPs_? When you
             | start multiplying numbers together they can get big.
        
             | quest88 wrote:
             | I invite you to run your own popular blog on your own
             | hardware and pay for the costs. It sounds like you don't
             | know what the true costs are.
        
               | that_guy_iain wrote:
               | Sounds like you don't know how to scale for cheap.
               | 
               | And since I've ran integrations that connected over 500
               | companies. I know what a rouge client actually looks like
               | and 72 requests per day and I wouldn't even notice.
        
               | donatj wrote:
               | I do run a popular blog, and a $5 a month Digital Ocean
               | droplet handles millions of requests per month without
               | breaking a sweat.
        
               | devjab wrote:
               | If every user is collecting 36mb a day like in the story
               | here, your droplet wouldn't even be capable of serving
               | 500 users a month without hitting your bandwidth limit.
               | With their current rates, your one million requests would
               | cost you around 10 million USD.
        
               | sccxy wrote:
               | If OP enabled gzip then this 36mb would be 13mb.
               | 
               | If OP reduced 30 months of posts in rss to 12 months then
               | this 13mb would be 5mb a day.
               | 
               | Using Cloudflare free plan and this static content is
               | cached without any problem.
        
               | donatj wrote:
               | 30 * 500 * 36mb = 560gb and I have 1tb a month on my
               | apparently $6 droplet
               | 
               | Correction - from my billing page it's $4.50 a month,
               | from the resize page it is $6 so I'm guessing I am
               | grandfathered in to some older pricing
        
               | tecleandor wrote:
               | That's ridiculously big quantity of data to serve a
               | seldomly updated blog just because the client doesn't
               | want (or know how, or think about) to implement an easy
               | and old http method.
               | 
               | Imagine the petabytes of data transferred through the
               | internet saved if a couple RSS clients added that method.
        
               | sccxy wrote:
               | More like a skill issue or just decision to make your
               | life more difficult.
               | 
               | It is free and easy to scale this kind of text based
               | blog.
        
               | Twirrim wrote:
               | OP has _never_ said that this is about financial aspects
               | of things.
        
               | Joker_vD wrote:
               | Yews, it's about enforcing their preference on how others
               | should interact with OP's published site feed, on
               | principle. Which is always an uphill battle.
        
         | sccxy wrote:
         | Her rss feed is last 100 posts with full content.
         | 
         | So it means 30 months of blog posts content in single request.
         | 
         | Sending 0.5MB in single rss request is more crime than those 2
         | hits in 20 minutes.
        
           | horsawlarway wrote:
           | I generally agree here.
           | 
           | There are a lot of very valid use cases where defaulting to
           | deny for an entire 24 hour cycle after a single request is
           | incredible frustrating for your downstream users (shared IP
           | at my university means I will never get a non-429 response...
           | And God help me if I'm testing new RSS readers...)
           | 
           | It's her server, so do as you please, I guess. But it's a
           | hilariously hostile response compared to just returning less
           | data.
        
             | mrweasel wrote:
             | > But it's a hilariously hostile response compared to just
             | returning less data.
             | 
             | So provide a poor service to everyone, because some people
             | doesn't know how to behave. That sees like an even worse
             | response.
        
               | sccxy wrote:
               | Send only one year's recent posts and you've reduced
               | bandwidth by 50%.
        
               | wakawaka28 wrote:
               | People don't want to have to customize refresh rates on a
               | per-feed basis. Perhaps the RSS or Atom standards need to
               | support importing the recommended refresh rate
               | automatically.
        
           | wakawaka28 wrote:
           | Yes that's right. Most blogs that are popular enough to have
           | this problem send you the last 10 post titles and links or
           | something. THAT is why people refresh every hour, so they
           | don't miss out.
        
         | yladiz wrote:
         | That's a bit disingenuous. 429s aren't "blocking", they're
         | telling the requester that they're done too many requests and
         | to try again later (with a value in the header). I assume the
         | author configured this because they know how often the site is
         | going to change typically. That the web server eventually stops
         | responding if the client ignores requests isn't that
         | surprising, but I doubt it was configured directly too.
        
           | that_guy_iain wrote:
           | I would say it's disingenuous to claim sending HTTP status
           | and body that is not expected for a period of time is not
           | blocking them for that period of time. You can be pedantic
           | and claim "but they can still access the server" but in
           | reality that client is blocked for a period of time.
        
             | kstrauser wrote:
             | In that case, I should be irate that the AWS API blocks me
             | many times per day. Run `aws cli service some-paginated-
             | thing` and see how many retries you get during normal,
             | routine operation.
             | 
             | But I'm not, because they're not blocking me. They're
             | asking my client to slow down. Neither AWS nor Rachel's
             | blog owes me unlimited requests per unit time, and neither
             | have "blocked" me when I violate they policies.
        
               | that_guy_iain wrote:
               | They literally do block you for a period of time until
               | you are out of the rate limit. That is how rate limits
               | work. That's why you don't get to access the resource you
               | requested, because their system literally blocked you
               | from doing so.
               | 
               | See when you're trying to be pedantic and all about
               | semantics, you should make sure you've crossed your Ts
               | and dotted your Is.
               | 
               | > Block - AWS WAF blocks the request and applies any
               | custom blocking behavior that you've defined.
               | 
               | from https://docs.aws.amazon.com/waf/latest/developerguid
               | e/waf-ru...
               | 
               | And my favourite
               | 
               | > Rate limiting blocks users, bots, or applications that
               | are over-using or abusing a web property. Rate limiting
               | can stop certain kinds of bot attacks.
               | 
               | From CloudFlare's explainer
               | https://www.cloudflare.com/learning/bots/what-is-rate-
               | limiti...
               | 
               | Every documentation on rate limit will include the word
               | block. Because that's what you do, you allow access for a
               | specific amount of requests and then block those that go
               | over.
        
           | luckylion wrote:
           | > 429s aren't "blocking"
           | 
           | Like how "unlimited traffic, but will slow down to 1bps if
           | you use more than 100gb in a month" is technically "unlimited
           | traffic".
           | 
           | But for all intents and purposes, it's limited. And 429 are
           | blocking. They include a hint towards the reason why you are
           | blocked and when the block might expire (retry-after doesn't
           | promise that you'll be successful if you wait), but besides
           | that, what's the different compared to 403?
        
             | yladiz wrote:
             | I would disagree. Blocking typically implies permanence
             | (without more action by the blockee), and since 429 isn't
             | usually a permanent error code I wouldn't call it blocking.
             | Same applies with 403, it's only permanent if the requester
             | doesn't authorize correctly.
        
           | Havoc wrote:
           | Semantics. 429 is an error code. Rate
           | limiting...blocking...too many requests...ignoring...call it
           | whatever you like but it amounts to the same, namingly server
           | isn't serving the requested content.
        
         | cesarb wrote:
         | > Blocked for 2 hits in 20 minutes on a light protocol like
         | rss?
         | 
         | I might be getting old, but 500KB in a single response doesn't
         | feel "light" to me.
        
           | sccxy wrote:
           | Yes, this is a very poorly designed RSS feed.
           | 
           | 500KB is horrible for RSS.
        
             | Symbiote wrote:
             | It's reasonable to have whole articles in RSS, of you
             | aren't trying to show ads or similar.
        
               | sccxy wrote:
               | Whole articles are reasonable.
               | 
               | 100 articles are not reasonable.
               | 
               | 100 articles where most of them are 1+ year old is
               | madness.
               | 
               | RSS is not an archive of the entire website.
        
               | sangnoir wrote:
               | > RSS is not an archive of the entire website
               | 
               | Whole-article feeds end up become exactly that - a local
               | archive of a blog.
        
         | mrweasel wrote:
         | But it's not a "light" protocol when you're serving 36MB per
         | day, when 500KB would suffice. RSS/Atom is light weight, if
         | clients play by the rules. This could also have been a news
         | website, imagine how much traffic would be dedicated to
         | pointless transfers of unchanged data. Traffic isn't free.
         | 
         | A similar problem arise from the increase in AI scraper
         | activities. Talking to other SREs the problem seems pretty wide
         | spread. AI companies will just hoover up data, but revisit so
         | frequently and aggressively that it's starting to affect the
         | transit feeds for popular websites. Frequently user-agents
         | wouldn't be set to something unique, or deliberately hidden,
         | and traffic originates from AWS, making it hard to target
         | individual bad actors. Fair enough that you're scraping
         | websites, that's part of the game when your online, but when
         | your industry starts to affect transit feeds, then we need to
         | talk compensation.
        
         | II2II wrote:
         | If your feed reader is refreshing every 20 minutes for a blog
         | that is updated daily, nearly 99% of the data sent is
         | identical. It looks like Rachel's blog is updated (roughly)
         | weekly, so that jumps to 99.8%. It's not the least efficient
         | thing in the world of computers, but it is definitely incurring
         | unnecessary costs.
        
           | elashri wrote:
           | I opened the xml file she provides in the blog and it seems
           | very long but okay. Then I decided it is a good blog to
           | subscribe so I went and tried to add to my freshrss
           | selfhosted instance (same ip obviously) and I couldn't
           | because I got blocked/rate limited. So yes it is aggressive
           | for different reasons.
        
             | mubou wrote:
             | Yeah, that's insane. Pretty much telling me not to
             | subscribe to your blog at that point. Like sites that have
             | an rss feed yet put Cloudflare protection in front of it...
             | 
             | The correct thing to do here is put a caching layer in
             | front so that every feed reader isn't simultaneously
             | hitting the origin for the same content. IP banning is the
             | wrong approach. (Even if it's only a temporary block,
             | that's going to cause my reader to show an error and is
             | entirely unnecessary.)
        
             | radicality wrote:
             | Weird. Those should have had different user-agents, and I
             | would guess it cannot be purely based on up.
        
             | KomoD wrote:
             | Same, I made 3 requests in total and got blocked.
        
           | wakawaka28 wrote:
           | It should be a timeboxed block if anything. Most RSS users
           | are actual readers and expecting them to spend lots of time
           | figuring out why clicking "refresh" twice on their RSS app
           | got them blocked is totally unreasonable. I've got my feeds
           | set up to refresh every hour. Considering the small number of
           | people still using RSS and how lightweight it is, it's not
           | bad enough to freak out over. At some point all Rachel's
           | complaining and investigating will be more work than her
           | simply interacting directly with the makers of the various
           | readers that cause the most traffic.
        
       | nilslindemann wrote:
       | I am stupid, why not just return an HTML document explaining the
       | issue, when there is such an incorrect second request in 20
       | minutes, then blocking that IP for 24 hours? The feed reader
       | software author has to react, otherwise its users will complain
       | to him, no?
        
         | ImPostingOnHN wrote:
         | It might be clever to return an rss feed containing 1 item: the
         | html document you mention.
        
         | ruszki wrote:
         | That's what 429 return is for, which is mentioned in the
         | article.
        
       | PaulHoule wrote:
       | This is why RSS for the birds.
       | 
       | My RSS reader YOShInOn subscribes to 110 RSS feeds through
       | Superfeedr which absolves me of the responsibility of being on
       | the other side of Rachel's problem.
       | 
       | With RSS you are always polling too fast or too slow; if you are
       | polling too slow you might even miss items.
       | 
       | When a blog gets posted Superfeedr hits an AWS lambda function
       | that stores the entry in SQS so my RSS reader can update itself
       | at its own pace. The only trouble is Superfeedr costs 10 cents a
       | feed per month which is a good deal for an active feed such as
       | comments from Hacker News or article from _The Guardian_ but is
       | not affordable for subscribing to 2000+ indy blogs which YOShInOn
       | could handle just fine.
       | 
       | I might yet write my own RSS head end, but there is something to
       | say for protocols like ActivityPub and AT Protocol.
        
         | rakoo wrote:
         | That's why websub (formerly pubsubhubbub) was created and
         | should be the proper solution, not a proprietary middleware
        
       | donatj wrote:
       | On the flip side, what percent of RSS feed _generators_ actually
       | support conditional requests? I 've written many over the last
       | twenty years and I can tell you plainly, none of the ones I wrote
       | have.
       | 
       | I never even considered the option or necessity. It's easy and
       | cheap just to send everything.
       | 
       | I guess static generators with a apache style web server probably
       | do, but I can't imagine any dynamic generators bother to try to
       | save the small handful of bytes.
        
         | aendruk wrote:
         | For another perspective, I can offer the data point that the
         | one dynamic feed generator I've written supports both If-
         | Modified-Since and If-Match, and that I considered that to be
         | an obvious requirement from the beginning.
        
       | wheybags wrote:
       | Rss is pretty light. Even if you say it's too much to be re-
       | sending, you could remove the content from the rss feed (so they
       | need to click through to read it), which would shrink the feed
       | size massively. Alternatively, remove old posts. Or do both.
       | 
       | Hopefully you don't have some expensive code generating the feed
       | on the fly, so processing overhead is negligible. But if it's
       | not, cache the result and reset the cache every time you post.
       | 
       | Surely this is easier than spending the effort and emotional
       | bandwidth to care about this issue?
       | 
       | I might be wrong here, but this feels more emotionally driven
       | ("someone is wrong on the internet") than practical.
        
         | gavinsyancey wrote:
         | As a user of the RSS feed, please don't remove content from it
         | so I have to click through. This makes it much less useful and
         | more annoying to use.
        
           | wheybags wrote:
           | I always click through regardless, because the rss text is
           | probably missing formatting and images. I'll never be sure
           | I'm getting a proper copy of the article unless I click
           | through anyway.
        
       | ruuda wrote:
       | I have a blog where I post a few posts per year. [1] /feed.xml is
       | served with an Expires header of 24 hours. I wrote a tool that
       | allows me to query the webserver logs using SQLite [2]. Over the
       | past 90 days, these are the top 10 requesters grouped by ip
       | address (remote_addr column redacted here):
       | requests_per_day  user_agent         283
       | Reeder/5050001 CFNetwork/1568.300.101 Darwin/24.2.0         274
       | CommaFeed/4.4.0 (https://github.com/Athou/commafeed)         127
       | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
       | (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36         52
       | NetNewsWire (RSS Reader; https://netnewswire.com/)         47
       | Tiny Tiny RSS/23.04-0578bf80 (https://tt-rss.org/)         47
       | Refeed Reader/v1 (+https://www.refeed.dev/)         46
       | Selfoss/2.18 (SimplePie/1.5.1; +https://selfoss.aditu.de)
       | 41                Reeder/5040601 CFNetwork/1568.100.1.1.1
       | Darwin/24.0.0         39                Tiny Tiny RSS/23.04
       | (Unsupported) (https://tt-rss.org/)         34
       | FreshRSS/1.24.3 (Linux; https://freshrss.org)
       | 
       | Reeder is loading the feed every 5 minutes, and in the vast
       | majority of cases it's getting a 301 response because it tries to
       | access the http version that redirects to https. At least it has
       | state and it gets 304 Not Modified in the remaining cases.
       | 
       | If I order by body bytes served rather than number of requests
       | (and group by remote_addr again), these are the worst consumers:
       | body_megabytes_per_year  user_agent         149.75943975
       | Refeed Reader/v1 (+https://www.refeed.dev/)         95.90771025
       | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
       | (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36
       | 75.00080025              rss-parser         73.023702
       | Tiny Tiny RSS/24.09-0163884ef (Unsupported) (https://tt-rss.org/)
       | 38.402385                Tiny Tiny RSS/24.11-42ebdb02
       | (https://tt-rss.org/)         37.984539
       | Selfoss/2.20-cf74581 (+https://selfoss.aditu.de)
       | 30.3982965               NetNewsWire (RSS Reader;
       | https://netnewswire.com/)         28.18013325              Tiny
       | Tiny RSS/23.04-0578bf80 (https://tt-rss.org/)         26.330142
       | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
       | (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36
       | 24.838461                Mozilla/5.0 (Windows NT 10.0; Win64;
       | x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105
       | Safari/537.36
       | 
       | The top consumer, Refeed, is responsible for about 2.25% of all
       | egress of my webserver. (Counting only body bytes, not http
       | overhead.)
       | 
       | [1]: https://ruudvanasseldonk.com/writing [2]:
       | https://github.com/ruuda/sqlog/blob/d129db35da9bbf95d8c2e97d...
        
       | mixmastamyk wrote:
       | I have a few feeds configured into Thunderbird but wasn't reading
       | them very often, so I "disabled" them to load manually. Despite
       | this it tries to contact the sites often and, when not able to
       | (firewall) goes into a frenzy of trying to contact them. All this
       | despite being disabled.
       | 
       | Disappointing combined with the various update sites it tries to
       | contact every startup, which is completely unnecessary as well.
       | Couple of times a week should be the maximum rate.
        
       | euroderf wrote:
       | which => that
        
         | thaumasiotes wrote:
         | They're exactly equivalent. What are you hoping to correct?
        
           | euroderf wrote:
           | They're obviously not.
           | 
           | It's a Britticism (AFAICT) making inroads.
        
             | sangnoir wrote:
             | Oh no, not the British influencing the English language! I
             | can't be arsed* about which vs. that when "on accident" has
             | become semi-accepted (as the opposite of "on purpose").
             | Yuck.
        
       ___________________________________________________________________
       (page generated 2024-12-22 23:00 UTC)