[HN Gopher] Show HN: Full Text, Full Archive RSS Feeds for Any Blog
       ___________________________________________________________________
        
       Show HN: Full Text, Full Archive RSS Feeds for Any Blog
        
       Author : panoramas4good
       Score  : 97 points
       Date   : 2024-09-02 13:06 UTC (9 hours ago)
        
 (HTM) web link (www.dogesec.com)
 (TXT) w3m dump (www.dogesec.com)
        
       | yawnxyz wrote:
       | It's so clever to just pull from Wayback Machine rather than
       | scrape the site itself. Never even thought of that
        
         | cxr wrote:
         | Before building an app that depends on the Wayback Machine (or
         | other Archive infrastructure) it's good to keep in mind this
         | post from their blog: <https://blog.archive.org/2023/05/29/let-
         | us-serve-you-but-don...>
         | 
         | One of my favorite tricks when coming across a blog with a
         | longtail of past posts is to verify that it's hosted on
         | WordPress and then to ingest the archives into my feedreader.
         | 
         | Once you have the WordPress feed URL, you can slurp it all in
         | by appending `?paged=n` (or `&paged=n`) for the nth page of the
         | feed. (This is a little tedious in Thunderbird; up till now
         | I've generated a list of URLs and dragged and dropped each one
         | into the subscribe-to-feed dialog. The whole process is
         | amenable to scripting by bookmarklet, though--gesture at a blog
         | with the appropriate metadata, and then get a file that's one
         | big RSS/Atom container with every blog post.)
        
         | simonw wrote:
         | I used it to recover some lost content from my blog a few years
         | ago, it was fantastic:
         | https://simonwillison.net/2017/Oct/8/missing-content/
        
       | latexr wrote:
       | > RSS and ATOM feeds are problematic for two reasons; 1) lack of
       | history, 2) contain limited post content.
       | 
       | None of those are problems with RSS or Atom1 feeds. There's no
       | technical limitation to having the full history and full post
       | content in the feeds. Many feeds behave that way due to a choice
       | by the author or as the default behaviour of the blogging
       | platform. Both have reasons to be: saving bandwidth2 and driving
       | traffic to the site3.
       | 
       | Which is not to say what you just made doesn't have value. It
       | does, and kudos for making it. But twice at the top of your post
       | you're making it sound as if those are problems inherit with the
       | format when they're not. They're not even problems for most
       | people in most situations, you just bumped into a very specific
       | use-case.
       | 
       | 1 It's not an acronym, it shouldn't be all uppercase.
       | 
       | 2 Many feed readers misbehave and download the whole thing
       | instead of checking ETags.
       | 
       | 3 To show ads or something else.
        
         | rainworld wrote:
         | Also, there's an existing, moderately well supported format for
         | JSON feeds: https://www.jsonfeed.org
        
         | tandav wrote:
         | Also Atom feeds supports pagination https://www.rfc-
         | editor.org/rfc/rfc5005#section-3
        
         | msephton wrote:
         | I have the full history in my blog feed.
        
       | steamodon wrote:
       | I wrote a similar tool [1], although it's designed to let you
       | gradually catch up on a backlog rather than write a full feed all
       | at once. Right now it only works on Blogger and WordPress blogs,
       | so I'll need to learn from their trick of pulling from Internet
       | Archive.
       | 
       | [1] https://github.com/steadmon/blog-replay
        
         | jayemar wrote:
         | I had a similar idea to replay blogs. It'll pull from WordPress
         | or Internet Archive and give you a replay link to add to your
         | feed reader.
         | 
         | https://refeed.to
        
       | z3t4 wrote:
       | The mystical creature - the URL - is a link to a resource that
       | doesn't have to be static, it's only the URL that is static. eg.
       | the content might change. So you might want to have the program
       | revisit the resource once in a while to see if there are updates.
        
       | breck wrote:
       | The future of RSS is "git clone".
       | 
       | RSS was invented in 1999, 6 years before git!
       | 
       | Now we have git and should just be "git cloning" blogs you like,
       | rather than subscribing to RSS feeds.
       | 
       | I still have RSS feeds on all my blogs for back-compat, but git
       | clone is way better.
        
         | xiande04 wrote:
         | And if the blog's repo is private or, gasp, it's not versioned
         | with git?
        
           | breck wrote:
           | Then it's not worth reading.
        
         | 8organicbits wrote:
         | What problems does that solve? Reading blogs over git clone
         | sounds like re-inventing the wheel. Are there even any tools
         | that do that?
         | 
         | If anything were to replace RSS (and Atom) I'd personally hope
         | for h-feed [1] since it's DRYer. But realistically it's going
         | to be hard to eclipse RSS, there's far too much adoption and it
         | is mostly sufficient.
         | 
         | [1] https://indieweb.org/h-feed
        
           | kevindamm wrote:
           | I'm not the GP commenter, but I'm supposing there would be
           | some way of announcing the git repo where you can find the
           | source -- similar to the `<link...>` tag used for RSS, you
           | could have a                 <link rel="alternate"
           | type="application/x-git" title="my blog as a git repo"
           | href="..." />
           | 
           | ..and tooling could take care of all the things you like in
           | an RSS reader. I could see this working really well for
           | static site generators like vitepress or Jekyll or what have
           | you, but going beyond what's in the source is kind of
           | project-specific, but maybe I'm interested in just a summary
           | of commits/PRs
           | 
           | Anyway, there isn't an official IANA-defined type for a git
           | repo (the application/x-git is my closest guess until one
           | became official) but my point is it isn't too far beyond what
           | auto-discovery of RSS is.
           | 
           | I think the GP's comment is from the point of view of making
           | it easy to retrieve the contents of the blog archive, easier
           | than the hoops mentioned (bulk archive retrieval and
           | generating WordPress page sequences, etc.) as well as solving
           | the problem in TFA (partial feeds, partial blog contents in
           | the feed).
        
             | breck wrote:
             | > <link rel="alternate" type="application/x-git" title="my
             | blog as a git repo" href="..." />
             | 
             | This is a _great_ idea. Let's make this happen.
             | 
             | Edit: okay this is live now in Scroll and across PLDB, my
             | blog, and other sites. Would love if someone could post
             | this link to HackerNews:
             | https://scroll.pub/blog/gitOverRss.html
        
               | kevindamm wrote:
               | I like it, I'm adding this <link> to my sites now, too
        
               | breck wrote:
               | Awesome! Any chance you could add some info about who you
               | are to your HN profile? Would love to read your stuff.
               | Clearly a mind full of good ideas!
        
           | breck wrote:
           | > What problems does that solve?
           | 
           | A million?
           | 
           | Having your own local copy of your favorite authors'
           | collections is the absolute way to go. So much faster,
           | searchable, transformable, resistant to censorship, et
           | cetera.
        
           | mananaysiempre wrote:
           | > What problems does that solve? Reading blogs over git clone
           | sounds like re-inventing the wheel.
           | 
           | Can't say anything about blogs, but the kernel folks actively
           | use mailing list archives over Git[1,2] (also over NNTP and
           | of course mail is also delivered as mail).
           | 
           | [1] https://public-inbox.org/README.html
           | 
           | [2] https://lore.kernel.org/
        
         | Tomte wrote:
         | You clone what? A WordPress database?
        
           | breck wrote:
           | > You clone what? A WordPress database?
           | 
           | You clone static site generated websites.
           | 
           | Scroll is designed for this, but there's no reason other SSCs
           | can't copy our patterns.
           | 
           | Here's a free command line working client you can try [beta]:
           | https://wws.scroll.pub/readme.html
           | 
           | Instead of favoriting feeds, you favorite repos. Then you
           | type "wws fetch" to update all your local repos.
           | 
           | It fetches the branch that contains the built artifacts along
           | with the source, so you have ready to read HTML and clean
           | source code for any transformations or analysis you want to
           | do.
           | 
           | ---
           | 
           | I love Wordpress, but the WordpressPHPMySQL stack is a drag.
           | At some point I expect they will move the Wordpress brand,
           | community, and frontend to be powered by a static site
           | generator.
           | 
           | To be quite honest, I suspect they'll probably want to use
           | Scroll as their new backend.
        
         | mfashby wrote:
         | It's not what you're aiming for with this comment, but I bet
         | git would actually make a pretty good storage tool/format for
         | archival of mostly static sites.
         | 
         | horrible simple hack: use `wget` with `--mirror` option, and
         | commit the result to a git repository. Repeat with a `cron` job
         | to keep an archive with change history.
        
           | breck wrote:
           | I assume this is what wayback machine uses?
        
       | msephton wrote:
       | This reminds me of something I wrote in early 2000. At that time
       | RSS was less than a year old and if I'm honest I wasn't aware of
       | it at all. I wrote a short PHP script to get the HTML of each
       | site in a list, do a diff against the most recent snapshot, and
       | generate a web page with a table containing all the changes. I
       | could set per site thresholds for change value to cope with small
       | dynamic content like dates and exclude certain latger sections of
       | content via regexp. I probably still have the code in my backups
       | from the dot com boom job I had at the time.
        
       | renegat0x0 wrote:
       | Similar goal, different approach. I wrote RSS reader, that
       | captures link meta from various RSS sources. The meta data are
       | exported every day. I have different repositories for bookmarks,
       | different for daily links, different for 'known domains'.
       | 
       | Written in Django.
       | 
       | I can always go back, parse saved data. If web page is not
       | available, I fall back to Internet Archive.
       | 
       | - https://github.com/rumca-js/Django-link-archive - RSS reader /
       | web scraper
       | 
       | - https://github.com/rumca-js/RSS-Link-Database - bookmarks I
       | found interesting
       | 
       | - https://github.com/rumca-js/RSS-Link-Database-2024 - every day
       | storage
       | 
       | - https://github.com/rumca-js/Internet-Places-Database - internet
       | domains found on the internet
       | 
       | After creating python package for web communication, that
       | replaces requests for me, which uses sometimes selenium I wrote
       | also CLI interface to read RSS sources from commandline:
       | https://github.com/rumca-js/yafr
        
       | pentagrama wrote:
       | > generally the RSS and ATOM feeds for any blog, are limited in
       | two ways;
       | 
       | > 1. [limited history of posts]
       | 
       | > 2. [partial content]
       | 
       | To fix the limitation Ndeg1 on some cases, maybe the author can
       | rely on sitemaps [1], is a feature present in many sites (as RSS
       | feeds) and it shows _all_ the pages published.
       | 
       | [1] https://www.sitemaps.org/
        
       | twoprops wrote:
       | Does no one find it ironic that one of the complaints about RSS
       | feeds is they don't give you the full content, forcing you to
       | visit the site, while trying to access the poster's web site
       | through reader view gives you a warning that you have to visit
       | the site directly to get the full content?
        
       | zczc wrote:
       | Looks like a nice tool for extending existing RSS sources. As for
       | the sites that don't have RSS support in the first place, there
       | is also RSSHub [1]. Sadly, you can't use both for the same
       | source: history4feed's trick with the Wayback Machine wouldn't
       | work with the RSSHub feed.
       | 
       | [1] https://rsshub.app/
        
       ___________________________________________________________________
       (page generated 2024-09-02 23:00 UTC)