[HN Gopher] Deep dive into finding RSS feeds
___________________________________________________________________
Deep dive into finding RSS feeds
Author : domysee
Score : 126 points
Date : 2024-12-06 19:17 UTC (1 days ago)
(HTM) web link (lighthouseapp.io)
(TXT) w3m dump (lighthouseapp.io)
| superkuh wrote:
| I generally try: /rss, /feed, /index.xml, /rss.xml, /feed.xml,
| etc. And at various root or /directory/* locations.
| https://blog.jim-nielsen.com/2021/feed-urls/ is a good article
| with statistics on naming.
|
| I've been adding to my feeds.opml since reddit started dying in
| ~2015 and now I'm up to around ~1700 feeds and mostly independent
| from aggregators; though I still collect new feeds from
| HN/IRC/etc. Mostly I just always make a point to look for them
| whenever I read something cool on the web.
| yazmeya wrote:
| You should also check if the web page actually exposes this
| information in a <link rel="alternate"> tag. If you're running
| Chrome, the "RSS Subscription Extension (by Google)" extension
| [1] will do this for you automatically and light up an orange
| icon in the extensions bar. It also integrates with popular RSS
| aggregators so you can subscribe directly from the extension.
|
| [1] https://chromewebstore.google.com/detail/rss-subscription-
| ex...
| ulrischa wrote:
| Would be nice if it is implmented in freshrss
| quaintdev wrote:
| I think miniflux can do this. I just give it website address
| and it almost always finds rss feeds
| 1123581321 wrote:
| It'd be neat for readers to seamlessly integrate with a scraper,
| either self-hosted or commercial, if no feed is found. I believe
| Inoreader allows scraping a few sites depending on the plan
| level; most reader services don't.
| HumblyTossed wrote:
| Back when I was young, websites had this icon you could click
| that would take you straight to their RSS feed. You young whipper
| snappers have gone an fucked that up. Actually, I think it was
| Google's fault. When they killed their RSS reader people
| pronounced RSS dead so people just stopped publishing RSS feeds
| or just didn't link to them.
|
| * Yes, I know the article talks about the RSS icon, i'm just
| soapboxing.
| AndrewStephens wrote:
| Even better, for a few months the browsers themselves would
| highlight RSS feeds and allow you subscribe right in the
| browser. It was too good to last.
|
| RSS is great but it has one great flaw in that it doesn't scale
| that well by itself. If 2 million people subscribe to your feed
| and try to update it once an hour, that is 48 million requests
| a day just for RSS.
|
| What does work well (and how things have evolved) is to have a
| service that polls RSS on behalf of its users. This was the
| beauty of Google Reader but plenty of replacements exist.
| rodary wrote:
| > Even better, for a few months the browsers themselves would
| highlight RSS feeds and allow you subscribe right in the
| browser. It was too good to last.
|
| Vivaldi browser still does that.
| qw wrote:
| > RSS is great but it has one great flaw in that it doesn't
| scale that well by itself
|
| The readers often cache the result for all users of the same
| feed. The only exception are readers that are running 100%
| locally. In that case using Etag or Last-Modified will make
| each request cheap and manageble.
| AndrewStephens wrote:
| That is my point. RSS by itself is a scaling problem but
| online readers make it manageable by polling sites on a
| reasonable rate and redistributing new articles to their
| users.
| lapcat wrote:
| > RSS is great but it has one great flaw in that it doesn't
| scale that well by itself. If 2 million people subscribe to
| your feed and try to update it once an hour, that is 48
| million requests a day just for RSS.
|
| Who's updating their RSS feeds once an hour 24 hours a day?
|
| Anyway, a very popular site with that many millions of
| visitors already has to handle extreme traffic, regardless of
| RSS.
| ttyprintk wrote:
| Surely this is solved by putting RSS behind Cloudflare.
| lapcat wrote:
| Sure, it's solved by Cloudflare breaking the RSS feed
| entirely, which often happens when people put their
| websites behind Cloudflare.
| compootr wrote:
| I think it should be noted that this happens when not
| configuring things right.
|
| By default, security rules are on, but I can disable
| security rules programmatically for a hostname too.
|
| fwiw, I once got a ddos on a host running their pages
| product, and got no charge for it. it also stayed up and
| didn't give captcha pages
| AndrewStephens wrote:
| According to my logs, some feed readers poll every 10
| minutes or so.
| doubled112 wrote:
| I think Miniflux polls every hour by default. Not sure
| about others.
| xtiansimon wrote:
| "Who's updating their RSS feeds once an hour 24 hours a
| day?"
|
| RSS is a pull-type system, no?. So the end-user is causing
| the overload problem by hitting the publisher's RSS feed
| every hour? The problem is out of the hands of the
| publisher...
| lapcat wrote:
| > RSS is a pull-type system, no?
|
| Yes.
|
| The web in general is also a pull-type system.
|
| > So the end-user is causing the overload problem by
| hitting the publisher's RSS feed every hour?
|
| There is no overload problem.
|
| And again, this math is off, because an end user is not
| even awake 24 hours a day: "If 2 million people subscribe
| to your feed and try to update it once an hour, that is
| 48 million requests a day just for RSS."
| doubled112 wrote:
| My home server is awake 24 hours a day.
|
| Is that wasteful? Does that make me a poor Internet
| citizen?
|
| I'd never considered using RSS feeds from an app and
| pulling it directly. I have too many devices for that to
| be a great workflow.
|
| Normally I'd say I was an outlier, but my gut feeling on
| this is that the people still using RSS and people self-
| hosting would overlap in a big way.
| lapcat wrote:
| > Is that wasteful? Does that make me a poor Internet
| citizen?
|
| I personally don't think it's a problem. As I said, "a
| very popular site with that many millions of visitors
| already has to handle extreme traffic, regardless of
| RSS."
|
| > I'd never considered using RSS feeds from an app and
| pulling it directly. I have too many devices for that to
| be a great workflow.
|
| A lot of apps have sync to handle that workflow.
|
| > my gut feeling on this is that the people still using
| RSS and people self-hosting would overlap in a big way.
|
| My gut feeling is that most RSS users, including myself,
| simply use client apps.
| cknight wrote:
| > Even better, for a few months the browsers themselves would
| highlight RSS feeds and allow you subscribe right in the
| browser. It was too good to last.
|
| I use the Awesome RSS add-on for that in Firefox:
| https://addons.mozilla.org/en-US/firefox/addon/awesome-rss/
| sitkack wrote:
| Firefox removing RSS support was their capitulation in
| supporting the open web.
| monooso wrote:
| That is very ambiguous phrasing.
| sitkack wrote:
| Firefox removing RSS support was the nail in the coffin
| for their support of the Open Web.
|
| How long after Google Reader was canceled did FF remove
| RSS?
| ndriscoll wrote:
| A $160 n100 minipc should have no problem serving an
| essentially static file out of nginx at well above that rate.
| If you use proper caching headers and don't publish new items
| that often, it should basically be idle.
| DanAtC wrote:
| A lot of clients don't respect caching headers[0], but
| yeah, static content hosting is a solved problem.
|
| And there are always CDNs if you really have a large,
| global audience.
|
| [0] http://rachelbythebay.com/w/2024/10/25/fs/
| jrm4 wrote:
| But that's just a Cloudflare problem, no?
| maxglute wrote:
| Feedly has relatively adequate RSS builder based on site
| elements for sites without RSS, but you only get a few feeds
| with pro plan. Wish there was open implementation. Also seems
| like something LLM would excel at.
| jrm4 wrote:
| I was about to say, that was _all_ Google.
|
| I'm still continually astounded that the tech community
| appeared to absolutely swallow Google's utter bullshit answers
| about why they shut down Reader; that really marked a turning
| point.
| sodality2 wrote:
| Tried out the feed finder on my blog again and I have another bug
| to report - it seems the URLs on the page can cause a crash
| within the web app! my blog (at matthew.science) uses Zola SSG,
| and it seems the URLs are formatted with a preceding //: '<a
| href="//matthew.science/posts/riscv/">Basics of the RISC-V
| ISA</a>'
|
| This causes the following error: TypeError: URL constructor:
| //matthew.science/posts/riscv/ is not a valid URL.
| jdougan wrote:
| Theoretically, I think that should work. It (At least it used
| to be) specified that //site/some/path should assume the uri
| protocol of the current context. So if it was a link on an http
| page, it should assume http, same with ftp and https etc. It
| should work sorta like how a leading slash assumes the current
| site context.
|
| This was back before the Web became the one true way and is the
| reason it uses 2 slashes, to distinguish protocol local from
| site local.
| domysee wrote:
| Thank you! Didn't see that case yet but I'm glad you commented,
| now I can fix it
| domysee wrote:
| Update: it's fixed
| camel-cdr wrote:
| This is useful, I set up RSS on my website yesterday.
|
| Turns out the feed finder couldn't find the feeds even though
| I've linked to them using clickable RSS icons.
|
| I didn't know about the autodiscovery feature so I'll add that
| now.
| zenlot wrote:
| Came here through RSS link from miniflux, running on nvidia
| jetson.
| LorenDB wrote:
| My modus operandi for finding a non-obvious RSS feed is to check
| the Wayback Machine's list of saved URLs and search for "RSS",
| "feed", or "XML". That normally will find the feed as long as it
| exists.
| saaaaaam wrote:
| Ghost also publishes at /feed
| csswizardry wrote:
| I went canvassing for RSS feeds only yesterday! Some good stuff
| in here:
| https://bsky.app/profile/csswizardry.com/post/3lckq4qo6zs22
| artembugara wrote:
| I open-sourced pyGoogleNews and wrote a quick blog about how you
| can reverse engineer google news RSS to turn it into an RSS feed
| of any website that is supported by Google News
|
| https://news.ycombinator.com/item?id=42343182
|
| https://github.com/kotartemiy/pygooglenews
| kelvinjps10 wrote:
| I would like to be able put multiple websites, I had to build a
| script based on "Guessing the feed URL" approach to get the rss
| feeed of a bunch of websites that I had bookmarked
| ks2048 wrote:
| It would be nice if someone ran this on commoncrawl and published
| a list of all the RSS feeds. (probably someone has?)
|
| Or I suppose you could just find all "Content-type:
| application/rss+xml" in CC.
|
| I know in the past, when I was looking for large lists of RSS
| feeds, I didn't really find what I was looking for.
| freetonik wrote:
| Far from a complete list, but I maintain a curated list of
| personal blogs with RSS feeds here https://minifeed.net/blogs
| benrapscallion wrote:
| Does it correctly ignore the "Comments on:" feeds that are
| sometimes mistakenly chosen over the main feed?
| domysee wrote:
| It shows all feeds it finds, including comments feeds. It also
| shows a preview of feed content, so it's easy to see if it's
| only articles or also comments.
| begriffs wrote:
| I created a lightweight shell script to check many url
| combinations on a site for feeds.
|
| https://github.com/begriffs/findrss
|
| The combinations came from what I observed in the big list of
| blogs I follow. The script works pretty well for most sites.
| panozzaj wrote:
| I use a Chrome extension
| (https://chromewebstore.google.com/detail/get-rss-feed-url/kf...)
| and it seems to pick out the RSS URLs fairly consistently
| fallinditch wrote:
| This looks very useful. It would work well with Hoarder (would be
| cool if they were integrated ;)
|
| Note: Hoarder can automatically hoard RSS feeds as part of its
| 'bookmark everything' functionality. Hoarder uses AI to tag all
| the content (URLs, feeds, images, notes) so you can then do full
| text searches on your personal archive of your bookmarks etc.
|
| https://hoarder.app/
| djhn wrote:
| Awesome piece of software and right up my alley (although I'm
| too invested in my homegrown scripts to let them go - but I'll
| give this a try.)
|
| You might consider adding some of the key info from the first
| paragraphs in the docs (open source, self hosting) to the front
| page, above the fold. Github-link might imply it, but I was
| scanning for "open source" with my eyes and was initially
| disappointed not to find it and ready to dismiss the product
| right out of the gate.
| fallinditch wrote:
| I didn't make this software, but heard about it on another
| thread. I was saying 'there should be an app like an 'auto-
| RAG' that scrapes RSS feeds and URLs' and be_erik wrote 'I do
| exactly this with hoarder. I passively build tagged knowledge
| bases with the archived pages and then feed it to a RAG
| setup.'
|
| What does your setup look like?
| PeterStuer wrote:
| You can add '/display-feed.rss' to the list of common suffixes
| for many .eu sites
| domysee wrote:
| Thank you - will do!
| aucisson_masque wrote:
| I thank WordPress for most of my RSS feed.
|
| I follow mostly RSS on non technology website, for instance road
| cycling. people that wouldn't care or know about RSS because they
| are not very techy, yet because they are normies that use
| WordPress for all their website it puts a page with RSS feed
| automatically. You got to find it with developer tool by
| searching RSS but 99% of the time if it's WordPress it got RSS.
|
| Thank you WordPress you bloated piece of shit :)
| openrisk wrote:
| Its interesting to contemplate an RSS-first browser that would
| have this functionality built-in. Think for example of promoting
| to full browser status a desktop RSS reader like Akregator [1]
| (which already embeds a webview).
|
| The browser as we now know it is mostly a static application that
| has long lost its user-centric mission. Websites might push some
| stuff but the user must do thinks manually. Its primary function
| is to provide a search window to external search. People even
| stopped using bookmarks and search for everything.
|
| This hypothetical RSS-Browser could become the main
| organizational tool for the users web experience, integrating the
| use of bookmarks.
|
| In fact even more "feeds" could be integrated like email and
| activitypub or atproto posts. It boils down to the fact that each
| person has a number of profiles/roles and within each they have a
| taxonomy of interests and we need a tool that integrates static
| and dynamic sources of information.
|
| [1] https://apps.kde.org/akregator/
| renegat0x0 wrote:
| I fought this problem, since I wrote my own RSS reader in python.
| Might not be perfect.
|
| The problem with the approach presented here is speed. Most of
| the web pages, especially smaller are really slow.
|
| Crawling most of the web pages is pain, especially if you use
| selenium and small SBC.
|
| Therefore either the page presents a clean nice RSS link, or get
| lost.
|
| Most of the good, modern pages give you nice RSS. Even GitHub
| gives you RSS for commits.
|
| For other pages I try openRSS.
|
| For YouTube I use yt-dlp to obtain channel id, to establish RSS.
|
| Algorithm is crude, but gets the job done.
|
| https://github.com/rumca-js/Django-link-archive/blob/main/rs...
___________________________________________________________________
(page generated 2024-12-07 23:02 UTC)