[HN Gopher] Deep dive into finding RSS feeds
       ___________________________________________________________________
        
       Deep dive into finding RSS feeds
        
       Author : domysee
       Score  : 126 points
       Date   : 2024-12-06 19:17 UTC (1 days ago)
        
 (HTM) web link (lighthouseapp.io)
 (TXT) w3m dump (lighthouseapp.io)
        
       | superkuh wrote:
       | I generally try: /rss, /feed, /index.xml, /rss.xml, /feed.xml,
       | etc. And at various root or /directory/* locations.
       | https://blog.jim-nielsen.com/2021/feed-urls/ is a good article
       | with statistics on naming.
       | 
       | I've been adding to my feeds.opml since reddit started dying in
       | ~2015 and now I'm up to around ~1700 feeds and mostly independent
       | from aggregators; though I still collect new feeds from
       | HN/IRC/etc. Mostly I just always make a point to look for them
       | whenever I read something cool on the web.
        
         | yazmeya wrote:
         | You should also check if the web page actually exposes this
         | information in a <link rel="alternate"> tag. If you're running
         | Chrome, the "RSS Subscription Extension (by Google)" extension
         | [1] will do this for you automatically and light up an orange
         | icon in the extensions bar. It also integrates with popular RSS
         | aggregators so you can subscribe directly from the extension.
         | 
         | [1] https://chromewebstore.google.com/detail/rss-subscription-
         | ex...
        
       | ulrischa wrote:
       | Would be nice if it is implmented in freshrss
        
         | quaintdev wrote:
         | I think miniflux can do this. I just give it website address
         | and it almost always finds rss feeds
        
       | 1123581321 wrote:
       | It'd be neat for readers to seamlessly integrate with a scraper,
       | either self-hosted or commercial, if no feed is found. I believe
       | Inoreader allows scraping a few sites depending on the plan
       | level; most reader services don't.
        
       | HumblyTossed wrote:
       | Back when I was young, websites had this icon you could click
       | that would take you straight to their RSS feed. You young whipper
       | snappers have gone an fucked that up. Actually, I think it was
       | Google's fault. When they killed their RSS reader people
       | pronounced RSS dead so people just stopped publishing RSS feeds
       | or just didn't link to them.
       | 
       | * Yes, I know the article talks about the RSS icon, i'm just
       | soapboxing.
        
         | AndrewStephens wrote:
         | Even better, for a few months the browsers themselves would
         | highlight RSS feeds and allow you subscribe right in the
         | browser. It was too good to last.
         | 
         | RSS is great but it has one great flaw in that it doesn't scale
         | that well by itself. If 2 million people subscribe to your feed
         | and try to update it once an hour, that is 48 million requests
         | a day just for RSS.
         | 
         | What does work well (and how things have evolved) is to have a
         | service that polls RSS on behalf of its users. This was the
         | beauty of Google Reader but plenty of replacements exist.
        
           | rodary wrote:
           | > Even better, for a few months the browsers themselves would
           | highlight RSS feeds and allow you subscribe right in the
           | browser. It was too good to last.
           | 
           | Vivaldi browser still does that.
        
           | qw wrote:
           | > RSS is great but it has one great flaw in that it doesn't
           | scale that well by itself
           | 
           | The readers often cache the result for all users of the same
           | feed. The only exception are readers that are running 100%
           | locally. In that case using Etag or Last-Modified will make
           | each request cheap and manageble.
        
             | AndrewStephens wrote:
             | That is my point. RSS by itself is a scaling problem but
             | online readers make it manageable by polling sites on a
             | reasonable rate and redistributing new articles to their
             | users.
        
           | lapcat wrote:
           | > RSS is great but it has one great flaw in that it doesn't
           | scale that well by itself. If 2 million people subscribe to
           | your feed and try to update it once an hour, that is 48
           | million requests a day just for RSS.
           | 
           | Who's updating their RSS feeds once an hour 24 hours a day?
           | 
           | Anyway, a very popular site with that many millions of
           | visitors already has to handle extreme traffic, regardless of
           | RSS.
        
             | ttyprintk wrote:
             | Surely this is solved by putting RSS behind Cloudflare.
        
               | lapcat wrote:
               | Sure, it's solved by Cloudflare breaking the RSS feed
               | entirely, which often happens when people put their
               | websites behind Cloudflare.
        
               | compootr wrote:
               | I think it should be noted that this happens when not
               | configuring things right.
               | 
               | By default, security rules are on, but I can disable
               | security rules programmatically for a hostname too.
               | 
               | fwiw, I once got a ddos on a host running their pages
               | product, and got no charge for it. it also stayed up and
               | didn't give captcha pages
        
             | AndrewStephens wrote:
             | According to my logs, some feed readers poll every 10
             | minutes or so.
        
             | doubled112 wrote:
             | I think Miniflux polls every hour by default. Not sure
             | about others.
        
             | xtiansimon wrote:
             | "Who's updating their RSS feeds once an hour 24 hours a
             | day?"
             | 
             | RSS is a pull-type system, no?. So the end-user is causing
             | the overload problem by hitting the publisher's RSS feed
             | every hour? The problem is out of the hands of the
             | publisher...
        
               | lapcat wrote:
               | > RSS is a pull-type system, no?
               | 
               | Yes.
               | 
               | The web in general is also a pull-type system.
               | 
               | > So the end-user is causing the overload problem by
               | hitting the publisher's RSS feed every hour?
               | 
               | There is no overload problem.
               | 
               | And again, this math is off, because an end user is not
               | even awake 24 hours a day: "If 2 million people subscribe
               | to your feed and try to update it once an hour, that is
               | 48 million requests a day just for RSS."
        
               | doubled112 wrote:
               | My home server is awake 24 hours a day.
               | 
               | Is that wasteful? Does that make me a poor Internet
               | citizen?
               | 
               | I'd never considered using RSS feeds from an app and
               | pulling it directly. I have too many devices for that to
               | be a great workflow.
               | 
               | Normally I'd say I was an outlier, but my gut feeling on
               | this is that the people still using RSS and people self-
               | hosting would overlap in a big way.
        
               | lapcat wrote:
               | > Is that wasteful? Does that make me a poor Internet
               | citizen?
               | 
               | I personally don't think it's a problem. As I said, "a
               | very popular site with that many millions of visitors
               | already has to handle extreme traffic, regardless of
               | RSS."
               | 
               | > I'd never considered using RSS feeds from an app and
               | pulling it directly. I have too many devices for that to
               | be a great workflow.
               | 
               | A lot of apps have sync to handle that workflow.
               | 
               | > my gut feeling on this is that the people still using
               | RSS and people self-hosting would overlap in a big way.
               | 
               | My gut feeling is that most RSS users, including myself,
               | simply use client apps.
        
           | cknight wrote:
           | > Even better, for a few months the browsers themselves would
           | highlight RSS feeds and allow you subscribe right in the
           | browser. It was too good to last.
           | 
           | I use the Awesome RSS add-on for that in Firefox:
           | https://addons.mozilla.org/en-US/firefox/addon/awesome-rss/
        
             | sitkack wrote:
             | Firefox removing RSS support was their capitulation in
             | supporting the open web.
        
               | monooso wrote:
               | That is very ambiguous phrasing.
        
               | sitkack wrote:
               | Firefox removing RSS support was the nail in the coffin
               | for their support of the Open Web.
               | 
               | How long after Google Reader was canceled did FF remove
               | RSS?
        
           | ndriscoll wrote:
           | A $160 n100 minipc should have no problem serving an
           | essentially static file out of nginx at well above that rate.
           | If you use proper caching headers and don't publish new items
           | that often, it should basically be idle.
        
             | DanAtC wrote:
             | A lot of clients don't respect caching headers[0], but
             | yeah, static content hosting is a solved problem.
             | 
             | And there are always CDNs if you really have a large,
             | global audience.
             | 
             | [0] http://rachelbythebay.com/w/2024/10/25/fs/
        
           | jrm4 wrote:
           | But that's just a Cloudflare problem, no?
        
         | maxglute wrote:
         | Feedly has relatively adequate RSS builder based on site
         | elements for sites without RSS, but you only get a few feeds
         | with pro plan. Wish there was open implementation. Also seems
         | like something LLM would excel at.
        
         | jrm4 wrote:
         | I was about to say, that was _all_ Google.
         | 
         | I'm still continually astounded that the tech community
         | appeared to absolutely swallow Google's utter bullshit answers
         | about why they shut down Reader; that really marked a turning
         | point.
        
       | sodality2 wrote:
       | Tried out the feed finder on my blog again and I have another bug
       | to report - it seems the URLs on the page can cause a crash
       | within the web app! my blog (at matthew.science) uses Zola SSG,
       | and it seems the URLs are formatted with a preceding //: '<a
       | href="//matthew.science/posts/riscv/">Basics of the RISC-V
       | ISA</a>'
       | 
       | This causes the following error: TypeError: URL constructor:
       | //matthew.science/posts/riscv/ is not a valid URL.
        
         | jdougan wrote:
         | Theoretically, I think that should work. It (At least it used
         | to be) specified that //site/some/path should assume the uri
         | protocol of the current context. So if it was a link on an http
         | page, it should assume http, same with ftp and https etc. It
         | should work sorta like how a leading slash assumes the current
         | site context.
         | 
         | This was back before the Web became the one true way and is the
         | reason it uses 2 slashes, to distinguish protocol local from
         | site local.
        
         | domysee wrote:
         | Thank you! Didn't see that case yet but I'm glad you commented,
         | now I can fix it
        
           | domysee wrote:
           | Update: it's fixed
        
       | camel-cdr wrote:
       | This is useful, I set up RSS on my website yesterday.
       | 
       | Turns out the feed finder couldn't find the feeds even though
       | I've linked to them using clickable RSS icons.
       | 
       | I didn't know about the autodiscovery feature so I'll add that
       | now.
        
       | zenlot wrote:
       | Came here through RSS link from miniflux, running on nvidia
       | jetson.
        
       | LorenDB wrote:
       | My modus operandi for finding a non-obvious RSS feed is to check
       | the Wayback Machine's list of saved URLs and search for "RSS",
       | "feed", or "XML". That normally will find the feed as long as it
       | exists.
        
       | saaaaaam wrote:
       | Ghost also publishes at /feed
        
       | csswizardry wrote:
       | I went canvassing for RSS feeds only yesterday! Some good stuff
       | in here:
       | https://bsky.app/profile/csswizardry.com/post/3lckq4qo6zs22
        
       | artembugara wrote:
       | I open-sourced pyGoogleNews and wrote a quick blog about how you
       | can reverse engineer google news RSS to turn it into an RSS feed
       | of any website that is supported by Google News
       | 
       | https://news.ycombinator.com/item?id=42343182
       | 
       | https://github.com/kotartemiy/pygooglenews
        
       | kelvinjps10 wrote:
       | I would like to be able put multiple websites, I had to build a
       | script based on "Guessing the feed URL" approach to get the rss
       | feeed of a bunch of websites that I had bookmarked
        
       | ks2048 wrote:
       | It would be nice if someone ran this on commoncrawl and published
       | a list of all the RSS feeds. (probably someone has?)
       | 
       | Or I suppose you could just find all "Content-type:
       | application/rss+xml" in CC.
       | 
       | I know in the past, when I was looking for large lists of RSS
       | feeds, I didn't really find what I was looking for.
        
         | freetonik wrote:
         | Far from a complete list, but I maintain a curated list of
         | personal blogs with RSS feeds here https://minifeed.net/blogs
        
       | benrapscallion wrote:
       | Does it correctly ignore the "Comments on:" feeds that are
       | sometimes mistakenly chosen over the main feed?
        
         | domysee wrote:
         | It shows all feeds it finds, including comments feeds. It also
         | shows a preview of feed content, so it's easy to see if it's
         | only articles or also comments.
        
       | begriffs wrote:
       | I created a lightweight shell script to check many url
       | combinations on a site for feeds.
       | 
       | https://github.com/begriffs/findrss
       | 
       | The combinations came from what I observed in the big list of
       | blogs I follow. The script works pretty well for most sites.
        
       | panozzaj wrote:
       | I use a Chrome extension
       | (https://chromewebstore.google.com/detail/get-rss-feed-url/kf...)
       | and it seems to pick out the RSS URLs fairly consistently
        
       | fallinditch wrote:
       | This looks very useful. It would work well with Hoarder (would be
       | cool if they were integrated ;)
       | 
       | Note: Hoarder can automatically hoard RSS feeds as part of its
       | 'bookmark everything' functionality. Hoarder uses AI to tag all
       | the content (URLs, feeds, images, notes) so you can then do full
       | text searches on your personal archive of your bookmarks etc.
       | 
       | https://hoarder.app/
        
         | djhn wrote:
         | Awesome piece of software and right up my alley (although I'm
         | too invested in my homegrown scripts to let them go - but I'll
         | give this a try.)
         | 
         | You might consider adding some of the key info from the first
         | paragraphs in the docs (open source, self hosting) to the front
         | page, above the fold. Github-link might imply it, but I was
         | scanning for "open source" with my eyes and was initially
         | disappointed not to find it and ready to dismiss the product
         | right out of the gate.
        
           | fallinditch wrote:
           | I didn't make this software, but heard about it on another
           | thread. I was saying 'there should be an app like an 'auto-
           | RAG' that scrapes RSS feeds and URLs' and be_erik wrote 'I do
           | exactly this with hoarder. I passively build tagged knowledge
           | bases with the archived pages and then feed it to a RAG
           | setup.'
           | 
           | What does your setup look like?
        
       | PeterStuer wrote:
       | You can add '/display-feed.rss' to the list of common suffixes
       | for many .eu sites
        
         | domysee wrote:
         | Thank you - will do!
        
       | aucisson_masque wrote:
       | I thank WordPress for most of my RSS feed.
       | 
       | I follow mostly RSS on non technology website, for instance road
       | cycling. people that wouldn't care or know about RSS because they
       | are not very techy, yet because they are normies that use
       | WordPress for all their website it puts a page with RSS feed
       | automatically. You got to find it with developer tool by
       | searching RSS but 99% of the time if it's WordPress it got RSS.
       | 
       | Thank you WordPress you bloated piece of shit :)
        
       | openrisk wrote:
       | Its interesting to contemplate an RSS-first browser that would
       | have this functionality built-in. Think for example of promoting
       | to full browser status a desktop RSS reader like Akregator [1]
       | (which already embeds a webview).
       | 
       | The browser as we now know it is mostly a static application that
       | has long lost its user-centric mission. Websites might push some
       | stuff but the user must do thinks manually. Its primary function
       | is to provide a search window to external search. People even
       | stopped using bookmarks and search for everything.
       | 
       | This hypothetical RSS-Browser could become the main
       | organizational tool for the users web experience, integrating the
       | use of bookmarks.
       | 
       | In fact even more "feeds" could be integrated like email and
       | activitypub or atproto posts. It boils down to the fact that each
       | person has a number of profiles/roles and within each they have a
       | taxonomy of interests and we need a tool that integrates static
       | and dynamic sources of information.
       | 
       | [1] https://apps.kde.org/akregator/
        
       | renegat0x0 wrote:
       | I fought this problem, since I wrote my own RSS reader in python.
       | Might not be perfect.
       | 
       | The problem with the approach presented here is speed. Most of
       | the web pages, especially smaller are really slow.
       | 
       | Crawling most of the web pages is pain, especially if you use
       | selenium and small SBC.
       | 
       | Therefore either the page presents a clean nice RSS link, or get
       | lost.
       | 
       | Most of the good, modern pages give you nice RSS. Even GitHub
       | gives you RSS for commits.
       | 
       | For other pages I try openRSS.
       | 
       | For YouTube I use yt-dlp to obtain channel id, to establish RSS.
       | 
       | Algorithm is crude, but gets the job done.
       | 
       | https://github.com/rumca-js/Django-link-archive/blob/main/rs...
        
       ___________________________________________________________________
       (page generated 2024-12-07 23:02 UTC)