https://danq.me/2022/09/27/freshrss-xpath/

Skip to content  

Dan Q

  * Blog
      + Articles
      + Everything
      + Notes
      + Reposts (shares)
      + Checkins (geo*)
      + Videos
      + Reviews
      + Comics
      + Tags
      + Stats
      + Subscribe
      + Blogroll
  * About
      + Polyamory
      + Volunteering
      + A Eulogy for Peter George Huntley
  * Contact
      + Contact Form
      + Instant Messengers
      + GPG Key
  * Projects
      + Free Deed Poll
      + KeePass for Opera
      + mOTP for Ruby
      + PicInHTML
      + Rails SSL Auth
      + Dozens of other projects...

XPath Scraping with FreshRSS

 

I've been spending a while running on reduced brain capacity lately
so, to ease myself back into thinking like a programmer, I upgraded
my preferred feed reader FreshRSS to version 1.20.0 - which was
released a couple of weeks ago - and tried out what I believe is its
killer new feature: HTML + XPath scraping.

Screenshot showing Beverley Newing's weblog; two articles are visible
- Paperback copy of 'Disability Visibility', edited by Alice Wong,
next to a cup of tea Setting up an Accessibility Book Club, published
on 1 March 2022, and Reflecting on 2021, published on 1 January 2022.
I like to keep up-to-date with my friend Bev's blog, but they don't
have an RSS feed.

I've been using RSS^1 for about 20 years and I love it. It feels
great to be able to curate my updates based on "what I care about",
and not on "what some social network thinks I should care about", to
keep things to read later, to prioritise effectively based on my own
categorisation, to consume content offline and have my to-read list
synchronise later, etc.

RSS never went away, of course (what do you think a podcast is?), but
it got steamrollered out of the public eye by big companies who make
their money out of keeping your eyes on their platforms and off the
open Web. But it feels like it's slowly coming back: even Substack -
whose entire thing is that an email client is more-convenient than a
feed reader for most people - launched an RSS reader this week!

A smartphone on a wooden surface. The screen shows the FeedMe app,
showing the most-recent blog post from Beverley's blog. My day
usually starts in my feed reader, accessed via the FeedMe app from my
mobile (although FreshRSS provides a reasonably good responsive
interface out-of-the-box!)

I love RSS so much that I routinely retrofit other people's websites
with feeds just so I can subscribe to them: I even published the tool
I use to do so! Whether filtering sports headlines out of BBC News,
turning retro webcomics into "reading lists" so I can track my
progress, or just working around sites that really should have feeds
but refuse to, I just love sidestepping these "missing feeds". My
friend Beverley has a blog without any kind of feed, so I added one
so I could subscribe to it. Magic.

But with FreshRSS 1.20.0, I no longer have to maintain my own tool to
get this brilliant functionality, and I'm overjoyed. Let's look at
how it works by re-subscribing to Beverley's blog but without a
middleware tool.

Screenshot showing FetchRSS being used to graphically create a feed
from Beverley's blog. This post is about to get pretty technical. If
you don't want to learn some XPath but just want to make a feed out
of a web page, use a graphical tool like FetchRSS.

In the latest version of FreshRSS, when you add a new feed to your
reader, a new section "Type of feed source" is available. Unfold it,
and you can change from the default ("RSS / Atom") to the new option
"HTML + XPath (Web scraping)". Put a human-readable page address
rather than a feed address into the "Feed URL" field and fill these
fields to tell FreshRSS how to parse the page to get the content you
want. Note that it doesn't matter if the web page isn't valid XML
(e.g. missing closing tags) because it's going to get run through
PHP's DOMDocument anyway which will "correct" for some really sloppy
code if needed.

Browser debugger running document.evaluate('//li[@class=
"blog__post-preview"]', document).iterateNext() on Beverley's weblog
and getting the first blog entry. You can use your browser's debugger
to help check your XPath rules: here I've run  document.evaluate('//
li[@class="blog__post-preview"]', document).iterateNext() and got
back the first blog post on the page, so I know I'm on the right
track. You'll need to use XPath to express how to find a "feed item"
on the page. Here's the rules I used for https://webdevbev.co.uk/
blog.html (many of these fields were optional - I didn't have to do
this much work):

  * Feed title: //h1
    I override this anyway in FreshRSS, so I could just have used the
    a string, but I wanted the XPath practice. There's only one <h1>
    on the page, and it can be considered the "title" of the feed.
  * Finding items: //li[@class="blog__post-preview"]
    Each "post" on the page is an <li class="blog__post-preview">.
  * Item titles: descendant::h2
    Each post has a <h2> which is the post title. The descendant::
    selector scopes the search to each post as found above.
  * Item content: descendant::p[3]
    Beverley's static site generator template puts the post summary
    in the third paragraph of the <li>, which we can select like
    this.
  * Item link: descendant::h2/a/@href
    This expects a URL, so we need the /@href to make sure we get the
    value of the <h2><a href="...">, rather than its contents.
  * Item thumbnail: descendant::img[@class="blog__image--preview"]/
    @src
    Again, this expects a URL, which we get from the <img src="...">.
  * Item author: "Beverley Newing"
    Beverley's blog doesn't host any guest posts, so I just use a
    string literal here.
  * Item date: substring-after(descendant::p[@class=
    "blog__date-posted"], "Date posted: ")
    This is the only complicated one: the published dates on
    Beverley's blog aren't explicitly marked-up, but part of a string
    that begins with the words "Date posted: ", so I use XPath's
    substring-after function to strtip this. The result gets passed
    to PHP's strtotime(), which is pretty tolerant of different date
    formats (although not of the words "Date posted:" it turns out!).

Screenshot: Adding a "HTML + XPath (Web scraping)" feed via FreshRSS.
I'd love one day for FreshRSS to provide some kind of "preview"
feature here so you can see what you'll expect to get back, as you
work. That, and support for different input types (JSON, perhaps?),
perhaps other selectors (I find CSS-style selectors much simpler than
XPath), and maybe even an option to execute Javascript on the page
before scraping (I use this in my own toolchain, but that's just
because I want to have my cake and eat it too). But this is still all
pretty awesome.

I hope that this is just the beginning for this new killer feature in
FreshRSS: there's so much more it can be and do. But for now, I'm
still mighty impressed that I can begin to phase-out my use of my
relatively resource-intensive feed-building middleware and use my
feed reader to do more and more of the heavy lifting for which I love
it so much.

I also love that this functionally adds h-feed support in by the back
door. I'd still prefer there to be a "h-feed" option in the "Type of
feed source" drop-down, but at least I can add such support manually,
now!

Beverley's blog post "Setting up an Accessibility Book Club" in
FreshRSS. The finished result: Bev's blog posts appear directly in my
feed reader, even though they don't have a feed, and now without
going through the middleware I'd set up for that purpose.

Footnotes

^1 When I say RSS, I mean feed. Most of the feeds I subscribe to are
RSS feeds, but some are Atom feeds, h-feed, etc. But I can't get over
the old-fashioned name, and I don't care to try.

Screenshot showing Beverley Newing's weblog; two articles are visible
- Paperback copy of 'Disability Visibility', edited by Alice Wong,
next to a cup of tea Setting up an Accessibility Book Club, published
on 1 March 2022, and Reflecting on 2021, published on 1 January 2022.
x A smartphone on a wooden surface. The screen shows the FeedMe app,
showing the most-recent blog post from Beverley's blog.x Screenshot
showing FetchRSS being used to graphically create a feed from
Beverley's blog.x Browser debugger running document.evaluate('//li
[@class="blog__post-preview"]', document).iterateNext() on Beverley's
weblog and getting the first blog entry.x Screenshot: Adding a "HTML
+ XPath (Web scraping)" feed via FreshRSS.x Beverley's blog post
"Setting up an Accessibility Book Club" in FreshRSS.x
27 September 2022

Article posted at 14:45 UTC on 27 September 2022.

  * Previous article
  * Next article
  * All articles
  * More from September 2022
  * More from 2022
  * On this day in: 2014 2011 2010 2009 2004 1999 1998

4 tags

This post is tagged:

  * freshrss
  * indieweb
  * rss
  * xml

5 syndications

This content can also be found on:

  * #@scatmandan
  * #DanQBlog
  * #itsdanq
  * #scatmania
  * #finger://freshrss-xpath@danq.me

1 mention

  *  adminadmin

4 comments

 1. FreshRSS FreshRSS says:

    Really great article, thanks! 
     
     
     
     

    Read more -

    27 September, 2022, 18:01
 2. Alkarex Alkarex says:

    Thanks for the great article 
    I would love to get your feedback on our pull requests as well as
    FreshRSS release candidates, so do not hesitate to reach out!
    https://github.com/FreshRSS/FreshRSS/
    (Also if you write other FreshRSS articles - some could even be
    linked from our documentation - PRs welcome)
    More precise ideas regarding h-card and JSON are also welcome (I
    have been thinking about options for JSON) already, in particular
    regarding how often those use-cases could be used on Web sites
    not also providing RSS/ATOM feeds.

    1 October, 2022, 16:58
 3. David David says:

    Thank you. Can you suggest a reader? Im hooked on Feedly because
    I've used it so long and gotten used to swiping left and right,
    and tagging things. Other readers feel weird. I will use freshrss
    too self hosted, but what reader apps do you think are the best
    ones? Im on Mac and iOS. Cheers!

    26 December, 2023, 20:43
 4. Dan Q Dan Q says:

    I use FreshRSS's own Web-based reader on desktop. It's fast,
    responsive, always up-to-date, has sensible keyboard shortcuts,
    and benefits from some FreshRSS-specific functionality like
    custom JS (with a standard plugin): I use this to eg swap out the
    low-res images in one particular feed with the high-res variants
    they hide in a custom property, and use one on xkcd to show the
    title text below the image.

    On my phone, I use FeedMe (for Android, available on f-Droid,
    free/donation-supported), which gives me a solid offline sync so
    I can read a load of news and blogs while on aeroplanes and have
    then marked read on FreshRSS when I touch down. Otherwise I don't
    use an app at all these days! FreshRSS's Web interface is
    perfectly good for my needs; I keep it in a Firefox "Pinned Tab"
    so it's always handy.

    26 December, 2023, 21:51

Reply here Cancel reply

Your email address will not be published. Required fields are marked 
*

          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
Comment * [                                             ]

Name * [                              ]

Email * [                              ]

Website [                              ]

[*] Remember me?

[Post Comment] 

  [                                             ] 
  [                                             ] 
  [                                             ] 
  [                                             ] 
  [                                             ] 
  [                                             ] 
  [                                             ] 
D [                                             ] 

Reply on your own site

If you post a reply on your own site and it doesn't show up
automatically, give me the link:

[                    ] [Ping me!]

Reply elsewhere

You can reply to this post on Facebook, LinkedIn.

Reply by email

I'd love to hear what you think. Send an email to b20520@danq.me; be
sure to let me know if you're happy for your comment to appear on the
Web!

  * (c) Dan Q 1998-2023
  * Creative CommonsAttributionNon-Commercial except where stated (
    how to use)
  * powered by BloqWordPressHTML5CSS3
  * privacy

  * Read # articles, # checkins, # notes, # reposts, and more...
  * # contact
  * #subscribe