[HN Gopher] Show HN: Full Text, Full Archive RSS Feeds for Any Blog
___________________________________________________________________
Show HN: Full Text, Full Archive RSS Feeds for Any Blog
Author : panoramas4good
Score : 97 points
Date : 2024-09-02 13:06 UTC (9 hours ago)
(HTM) web link (www.dogesec.com)
(TXT) w3m dump (www.dogesec.com)
| yawnxyz wrote:
| It's so clever to just pull from Wayback Machine rather than
| scrape the site itself. Never even thought of that
| cxr wrote:
| Before building an app that depends on the Wayback Machine (or
| other Archive infrastructure) it's good to keep in mind this
| post from their blog: <https://blog.archive.org/2023/05/29/let-
| us-serve-you-but-don...>
|
| One of my favorite tricks when coming across a blog with a
| longtail of past posts is to verify that it's hosted on
| WordPress and then to ingest the archives into my feedreader.
|
| Once you have the WordPress feed URL, you can slurp it all in
| by appending `?paged=n` (or `&paged=n`) for the nth page of the
| feed. (This is a little tedious in Thunderbird; up till now
| I've generated a list of URLs and dragged and dropped each one
| into the subscribe-to-feed dialog. The whole process is
| amenable to scripting by bookmarklet, though--gesture at a blog
| with the appropriate metadata, and then get a file that's one
| big RSS/Atom container with every blog post.)
| simonw wrote:
| I used it to recover some lost content from my blog a few years
| ago, it was fantastic:
| https://simonwillison.net/2017/Oct/8/missing-content/
| latexr wrote:
| > RSS and ATOM feeds are problematic for two reasons; 1) lack of
| history, 2) contain limited post content.
|
| None of those are problems with RSS or Atom1 feeds. There's no
| technical limitation to having the full history and full post
| content in the feeds. Many feeds behave that way due to a choice
| by the author or as the default behaviour of the blogging
| platform. Both have reasons to be: saving bandwidth2 and driving
| traffic to the site3.
|
| Which is not to say what you just made doesn't have value. It
| does, and kudos for making it. But twice at the top of your post
| you're making it sound as if those are problems inherit with the
| format when they're not. They're not even problems for most
| people in most situations, you just bumped into a very specific
| use-case.
|
| 1 It's not an acronym, it shouldn't be all uppercase.
|
| 2 Many feed readers misbehave and download the whole thing
| instead of checking ETags.
|
| 3 To show ads or something else.
| rainworld wrote:
| Also, there's an existing, moderately well supported format for
| JSON feeds: https://www.jsonfeed.org
| tandav wrote:
| Also Atom feeds supports pagination https://www.rfc-
| editor.org/rfc/rfc5005#section-3
| msephton wrote:
| I have the full history in my blog feed.
| steamodon wrote:
| I wrote a similar tool [1], although it's designed to let you
| gradually catch up on a backlog rather than write a full feed all
| at once. Right now it only works on Blogger and WordPress blogs,
| so I'll need to learn from their trick of pulling from Internet
| Archive.
|
| [1] https://github.com/steadmon/blog-replay
| jayemar wrote:
| I had a similar idea to replay blogs. It'll pull from WordPress
| or Internet Archive and give you a replay link to add to your
| feed reader.
|
| https://refeed.to
| z3t4 wrote:
| The mystical creature - the URL - is a link to a resource that
| doesn't have to be static, it's only the URL that is static. eg.
| the content might change. So you might want to have the program
| revisit the resource once in a while to see if there are updates.
| breck wrote:
| The future of RSS is "git clone".
|
| RSS was invented in 1999, 6 years before git!
|
| Now we have git and should just be "git cloning" blogs you like,
| rather than subscribing to RSS feeds.
|
| I still have RSS feeds on all my blogs for back-compat, but git
| clone is way better.
| xiande04 wrote:
| And if the blog's repo is private or, gasp, it's not versioned
| with git?
| breck wrote:
| Then it's not worth reading.
| 8organicbits wrote:
| What problems does that solve? Reading blogs over git clone
| sounds like re-inventing the wheel. Are there even any tools
| that do that?
|
| If anything were to replace RSS (and Atom) I'd personally hope
| for h-feed [1] since it's DRYer. But realistically it's going
| to be hard to eclipse RSS, there's far too much adoption and it
| is mostly sufficient.
|
| [1] https://indieweb.org/h-feed
| kevindamm wrote:
| I'm not the GP commenter, but I'm supposing there would be
| some way of announcing the git repo where you can find the
| source -- similar to the `<link...>` tag used for RSS, you
| could have a <link rel="alternate"
| type="application/x-git" title="my blog as a git repo"
| href="..." />
|
| ..and tooling could take care of all the things you like in
| an RSS reader. I could see this working really well for
| static site generators like vitepress or Jekyll or what have
| you, but going beyond what's in the source is kind of
| project-specific, but maybe I'm interested in just a summary
| of commits/PRs
|
| Anyway, there isn't an official IANA-defined type for a git
| repo (the application/x-git is my closest guess until one
| became official) but my point is it isn't too far beyond what
| auto-discovery of RSS is.
|
| I think the GP's comment is from the point of view of making
| it easy to retrieve the contents of the blog archive, easier
| than the hoops mentioned (bulk archive retrieval and
| generating WordPress page sequences, etc.) as well as solving
| the problem in TFA (partial feeds, partial blog contents in
| the feed).
| breck wrote:
| > <link rel="alternate" type="application/x-git" title="my
| blog as a git repo" href="..." />
|
| This is a _great_ idea. Let's make this happen.
|
| Edit: okay this is live now in Scroll and across PLDB, my
| blog, and other sites. Would love if someone could post
| this link to HackerNews:
| https://scroll.pub/blog/gitOverRss.html
| kevindamm wrote:
| I like it, I'm adding this <link> to my sites now, too
| breck wrote:
| Awesome! Any chance you could add some info about who you
| are to your HN profile? Would love to read your stuff.
| Clearly a mind full of good ideas!
| breck wrote:
| > What problems does that solve?
|
| A million?
|
| Having your own local copy of your favorite authors'
| collections is the absolute way to go. So much faster,
| searchable, transformable, resistant to censorship, et
| cetera.
| mananaysiempre wrote:
| > What problems does that solve? Reading blogs over git clone
| sounds like re-inventing the wheel.
|
| Can't say anything about blogs, but the kernel folks actively
| use mailing list archives over Git[1,2] (also over NNTP and
| of course mail is also delivered as mail).
|
| [1] https://public-inbox.org/README.html
|
| [2] https://lore.kernel.org/
| Tomte wrote:
| You clone what? A WordPress database?
| breck wrote:
| > You clone what? A WordPress database?
|
| You clone static site generated websites.
|
| Scroll is designed for this, but there's no reason other SSCs
| can't copy our patterns.
|
| Here's a free command line working client you can try [beta]:
| https://wws.scroll.pub/readme.html
|
| Instead of favoriting feeds, you favorite repos. Then you
| type "wws fetch" to update all your local repos.
|
| It fetches the branch that contains the built artifacts along
| with the source, so you have ready to read HTML and clean
| source code for any transformations or analysis you want to
| do.
|
| ---
|
| I love Wordpress, but the WordpressPHPMySQL stack is a drag.
| At some point I expect they will move the Wordpress brand,
| community, and frontend to be powered by a static site
| generator.
|
| To be quite honest, I suspect they'll probably want to use
| Scroll as their new backend.
| mfashby wrote:
| It's not what you're aiming for with this comment, but I bet
| git would actually make a pretty good storage tool/format for
| archival of mostly static sites.
|
| horrible simple hack: use `wget` with `--mirror` option, and
| commit the result to a git repository. Repeat with a `cron` job
| to keep an archive with change history.
| breck wrote:
| I assume this is what wayback machine uses?
| msephton wrote:
| This reminds me of something I wrote in early 2000. At that time
| RSS was less than a year old and if I'm honest I wasn't aware of
| it at all. I wrote a short PHP script to get the HTML of each
| site in a list, do a diff against the most recent snapshot, and
| generate a web page with a table containing all the changes. I
| could set per site thresholds for change value to cope with small
| dynamic content like dates and exclude certain latger sections of
| content via regexp. I probably still have the code in my backups
| from the dot com boom job I had at the time.
| renegat0x0 wrote:
| Similar goal, different approach. I wrote RSS reader, that
| captures link meta from various RSS sources. The meta data are
| exported every day. I have different repositories for bookmarks,
| different for daily links, different for 'known domains'.
|
| Written in Django.
|
| I can always go back, parse saved data. If web page is not
| available, I fall back to Internet Archive.
|
| - https://github.com/rumca-js/Django-link-archive - RSS reader /
| web scraper
|
| - https://github.com/rumca-js/RSS-Link-Database - bookmarks I
| found interesting
|
| - https://github.com/rumca-js/RSS-Link-Database-2024 - every day
| storage
|
| - https://github.com/rumca-js/Internet-Places-Database - internet
| domains found on the internet
|
| After creating python package for web communication, that
| replaces requests for me, which uses sometimes selenium I wrote
| also CLI interface to read RSS sources from commandline:
| https://github.com/rumca-js/yafr
| pentagrama wrote:
| > generally the RSS and ATOM feeds for any blog, are limited in
| two ways;
|
| > 1. [limited history of posts]
|
| > 2. [partial content]
|
| To fix the limitation Ndeg1 on some cases, maybe the author can
| rely on sitemaps [1], is a feature present in many sites (as RSS
| feeds) and it shows _all_ the pages published.
|
| [1] https://www.sitemaps.org/
| twoprops wrote:
| Does no one find it ironic that one of the complaints about RSS
| feeds is they don't give you the full content, forcing you to
| visit the site, while trying to access the poster's web site
| through reader view gives you a warning that you have to visit
| the site directly to get the full content?
| zczc wrote:
| Looks like a nice tool for extending existing RSS sources. As for
| the sites that don't have RSS support in the first place, there
| is also RSSHub [1]. Sadly, you can't use both for the same
| source: history4feed's trick with the Wayback Machine wouldn't
| work with the RSSHub feed.
|
| [1] https://rsshub.app/
___________________________________________________________________
(page generated 2024-09-02 23:00 UTC)