https://danq.me/2022/09/27/freshrss-xpath/ Skip to content Dan Q * Blog + Articles + Everything + Notes + Reposts (shares) + Checkins (geo*) + Videos + Reviews + Comics + Tags + Stats + Subscribe + Blogroll * About + Polyamory + Volunteering + A Eulogy for Peter George Huntley * Contact + Contact Form + Instant Messengers + GPG Key * Projects + Free Deed Poll + KeePass for Opera + mOTP for Ruby + PicInHTML + Rails SSL Auth + Dozens of other projects... XPath Scraping with FreshRSS I've been spending a while running on reduced brain capacity lately so, to ease myself back into thinking like a programmer, I upgraded my preferred feed reader FreshRSS to version 1.20.0 - which was released a couple of weeks ago - and tried out what I believe is its killer new feature: HTML + XPath scraping. Screenshot showing Beverley Newing's weblog; two articles are visible - Paperback copy of 'Disability Visibility', edited by Alice Wong, next to a cup of tea Setting up an Accessibility Book Club, published on 1 March 2022, and Reflecting on 2021, published on 1 January 2022. I like to keep up-to-date with my friend Bev's blog, but they don't have an RSS feed. I've been using RSS^1 for about 20 years and I love it. It feels great to be able to curate my updates based on "what I care about", and not on "what some social network thinks I should care about", to keep things to read later, to prioritise effectively based on my own categorisation, to consume content offline and have my to-read list synchronise later, etc. RSS never went away, of course (what do you think a podcast is?), but it got steamrollered out of the public eye by big companies who make their money out of keeping your eyes on their platforms and off the open Web. But it feels like it's slowly coming back: even Substack - whose entire thing is that an email client is more-convenient than a feed reader for most people - launched an RSS reader this week! A smartphone on a wooden surface. The screen shows the FeedMe app, showing the most-recent blog post from Beverley's blog. My day usually starts in my feed reader, accessed via the FeedMe app from my mobile (although FreshRSS provides a reasonably good responsive interface out-of-the-box!) I love RSS so much that I routinely retrofit other people's websites with feeds just so I can subscribe to them: I even published the tool I use to do so! Whether filtering sports headlines out of BBC News, turning retro webcomics into "reading lists" so I can track my progress, or just working around sites that really should have feeds but refuse to, I just love sidestepping these "missing feeds". My friend Beverley has a blog without any kind of feed, so I added one so I could subscribe to it. Magic. But with FreshRSS 1.20.0, I no longer have to maintain my own tool to get this brilliant functionality, and I'm overjoyed. Let's look at how it works by re-subscribing to Beverley's blog but without a middleware tool. Screenshot showing FetchRSS being used to graphically create a feed from Beverley's blog. This post is about to get pretty technical. If you don't want to learn some XPath but just want to make a feed out of a web page, use a graphical tool like FetchRSS. In the latest version of FreshRSS, when you add a new feed to your reader, a new section "Type of feed source" is available. Unfold it, and you can change from the default ("RSS / Atom") to the new option "HTML + XPath (Web scraping)". Put a human-readable page address rather than a feed address into the "Feed URL" field and fill these fields to tell FreshRSS how to parse the page to get the content you want. Note that it doesn't matter if the web page isn't valid XML (e.g. missing closing tags) because it's going to get run through PHP's DOMDocument anyway which will "correct" for some really sloppy code if needed. Browser debugger running document.evaluate('//li[@class= "blog__post-preview"]', document).iterateNext() on Beverley's weblog and getting the first blog entry. You can use your browser's debugger to help check your XPath rules: here I've run document.evaluate('// li[@class="blog__post-preview"]', document).iterateNext() and got back the first blog post on the page, so I know I'm on the right track. You'll need to use XPath to express how to find a "feed item" on the page. Here's the rules I used for https://webdevbev.co.uk/ blog.html (many of these fields were optional - I didn't have to do this much work): * Feed title: //h1 I override this anyway in FreshRSS, so I could just have used the a string, but I wanted the XPath practice. There's only one