danq.me

       THE FAR SIDE IN FRESHRSS
       
       2022-11-23
       
       A few yeras ago, I wanted to subscribe to The Far Side's "Daily Dose" via my
       RSS reader. The Far Side doesn't have an RSS feed, so I implemented a
       proxy/middleware to bridge the two.
       
       It turns out that FreshRSS's XPath Scraping is almost enough to achieve
       exactly what I want. The big problem is that the image server on The Far Side
       website tries to prevent hotlinking by checking the Referer: header on
       requests, so we need a proxy to spoof that. I threw together a quick PHP
       program to act as a proxy (if you don't have this, you'll have to
       click-through to read each comic), then configured my FreshRSS feed as follows:
       
 (IMG) Screenshot showing my FreshRSS XPath configuration
       * Feed URL: https://www.thefarside.com/
       The "Daily Dose" gets published to The Far Side's homepage each day.
       * XPath for finding new items: //div[@class="card tfs-comic js-comic"]
       Finds each comic on the page. This is probably a little over-specific and
       brittle; I should probably switch to using the contains function at some
       point. I subsequently have to use parent:: and ancestor:: selectors which is
       usually a sign that your screen-scraping is suboptimal, but in this case it's
       necessary because it's only at this deep level that we start seeing really
       specific classes.
       * Item title: concat("Far Side #", parent::div/@data-id)
       The comics don't have titles ("The one with the cow"?), but these seem to have
       unique IDs in the data-id attribute of the parent <div>, so I'm using those as
       a reference.
       * Item content: descendant::div[@class="card-body"]
       Within each item, the <div class="card-body"> contains the comic and its text.
       The comic itself can't be loaded this way for two reasons: (1) the <img
       src="..."> just points to a placeholder (the site uses JavaScript-powered
       lazy-loading, ugh - the actual source is in the data-src attribute), and (2)
       as mentioned above, there's anti-hotlink protection we need to work around.
       * Item link: descendant::input[@data-copy-item]/@value
       Each comic does have a unique link which you can access by clicking the
       "share" button under it. This makes a hidden text <input> appear, which we can
       identify by the presence of the data-copy-item attribute. The contents of this
       textbox is the sharing URL for the comic.
       * Item thumbnail:
       concat("https://example.com/referer-faker.php?pw=YOUR-SECRET-PASSWORD-GOES-HERE&referer=https://www.thefarside.com/&url=",
       descendant::div[@class="tfs-comic__image"]/img/@data-src)
       Here's where I hook into my special proxy server, which spoofs the Referer:
       header to work around the anti-hotlinking code. If you wanted you might be
       able to come up with an alternative solution using a custom JavaScript loaded
       into your FreshRSS instance (there's a plugin for that!), perhaps to load an
       iframe of the sharing URL? Or you can host a copy of my proxy server yourself
       (you can't use mine, it's got a password and that password isn't
       YOUR-SECRET-PASSWORD-GOES-HERE!)
       * Item date: ancestor::div[@class="tfs-page__full
       tfs-page__full--md"]/descendant::h3
       There's nothing associating each comic with the date it appeared in the Daily
       Dose, so we have to ascend up to the top level of the page to find the date
       from the heading.
       * Item unique ID: parent::div/@data-id
       Giving FreshRSS a unique ID can help it stop showing duplicates. We use the
       unique ID we discovered earlier; this way, if the Daily Dose does a re-run of
       something it already did since I subscribed, I won't be shown it again. Omit
       this if you want to see reruns.
       
       There's a moral to this story: when you make your website deliberately hard to
       consume, fewer people will access it in the way you want! The Far Side's
       website is actively hostile to users (JavaScript lazy-loading, anti-right
       click scripts, hotlink protection, incorrect MIME types, no feeds etc.), and
       an inevitable consequence of that is that people like me will find and share
       workarounds to that hostility.
       
       If you're ad-supported or collect webstats and want to keep traffic "on your
       site" on this side of 2004, you should make it as easy as possible for people
       to subscribe to content. Consider The Oatmeal or Oglaf, for example, which
       offer RSS feeds that include only a partial thumbnail of each comic and a link
       through to the full thing. I don't feel the need to screen-scrape those sites
       because they've given me a subscription option that works, and I routinely
       click-through to both of them to enjoy their latest content!
       
       Conversely, the Far Side's aggressive anti-subscription technology ultimately
       means that there are fewer actual visitors to their website... because folks
       like me work to circumvent them.
       
       And now you know how I did so.
       
       Update: want the new content that's being published to The Far Side in
       FreshRSS, too? I've got a recipe for that!
       
       LINKS
       
 (HTM) The Far Side
 (DIR) My blog post: Subscribing to The Far Side via RSS
 (HTM) Release tag for FreshRSS 1.20.0
 (HTM) FreshRSS
 (HTM) Pull request adding XPath scraping to FreshRSS
 (HTM) My initial blog post demonstrating how to use FreshRSS's XPath scraping features
 (HTM) Beverley's website
 (HTM) referer-faker.php, my PHP referer:-adding proxy
 (HTM) MDN definition of the XPath contains function
 (HTM) CustomJS plugin for FreshRSS
 (HTM) The Oatmeal
 (HTM) Oglaf
 (DIR) I've got a recipe for that!