[HN Gopher] Show HN: RSS feeds for arbitrary websites using CSS ...
       ___________________________________________________________________
        
       Show HN: RSS feeds for arbitrary websites using CSS selectors
        
       Author : Vinnl
       Score  : 309 points
       Date   : 2021-07-05 16:28 UTC (6 hours ago)
        
 (HTM) web link (feed-me-up-scotty.vincenttunru.com)
 (TXT) w3m dump (feed-me-up-scotty.vincenttunru.com)
        
       | ianberdin wrote:
       | My friend is working on a quite simple feed generator from any
       | website, social media: https://rss.app. Maybe helpful.
        
       | mhitza wrote:
       | This could nicely supplement my GitHub automation that emails
       | feed digests https://github.com/mhitza/subscriptions-digest
       | 
       | Similarly to my repository, I think I would suggest the option to
       | fetch the configuration file from an external resource defined
       | via an action secret. For my automation I'm using a Gist (not
       | sure if Gitlab has same thing; also private but publicly
       | accessible snippets).
       | 
       | At least that way you can keep your own feed configuration while
       | allowing those that fork the repository to not have to manually
       | fix conflicts within the feeds.toml config.
        
       | [deleted]
        
       | markdown wrote:
       | Great project, but that's a shitty name.
        
         | Vinnl wrote:
         | Heh, one thing I've learned over the years is that a great way
         | to not actually publish my side projects is to overthink the
         | name, so nowadays I just pick one and go with it when I'm ready
         | to publish it.
        
           | markdown wrote:
           | I can't disagree with that :)
        
       | chrismorgan wrote:
       | I'm glad to see that you're using Atom rather than RSS, but
       | you're still _calling_ it RSS. Can I convince you to stop calling
       | it that? It's factually inaccurate, and keeps RSS, the inferior
       | format, more popular by virtue of mindshare. Just call them
       | feeds, the proper generic term.
        
         | pmlnr wrote:
         | Oh, yes, please revive the bloody format wars /s
        
         | nvr219 wrote:
         | If OP didn't use the word RSS and said "atom feeds" I wouldn't
         | know what they're talking about, so I recommend keep calling it
         | RSS.
        
         | moehm wrote:
         | Same problem with SSL encryption. But with both RSS and SSL, if
         | the user use a modern client, he shouldn't bother if its
         | internally using RSS, Atom, SSL or TLS.
         | 
         | I for one know most of my pals do know about RSS, but haven't
         | heard the term 'Atom feed'.
        
           | chrismorgan wrote:
           | "SSL" is _finally_ dying as a term; the significant majority
           | of what I see calls it TLS now.
           | 
           | But the problem didn't exist in the same way with SSL/TLS:
           | SSL was actively killed off in favour of TLS so that
           | regardless of what you _call_ it, you're actually dealing
           | with TLS. But with feeds, RSS is still supported, and so the
           | mindshare problem happens: people hear about RSS and so
           | implement the inferior and problematic RSS rather than Atom,
           | because they've never even heard of Atom and so don't know
           | that it's what they should have implemented instead, because
           | it's more robust. Calling feeds RSS perpetuates RSS, which is
           | undesirable.
           | 
           | I say just advertise it as "feeds", not "RSS feeds". Atom is
           | an implementation detail, just like RSS _should_ be. (Well,
           | RSS should be _dead_ in general feeds, with only podcast
           | feeds keeping it alive.)
        
             | bscphil wrote:
             | > I say just advertise it as "feeds", not "RSS feeds".
             | 
             | I think that's even worse. For most ordinary people that
             | has a bunch of unintended meanings. "My Facebook has a
             | feed, do you mean subscribing to your site on Facebook? Why
             | do I see a bunch of code when I click the link?"
        
         | alanfranz wrote:
         | I do not care. RSS and Atom both win over closed platforms. If
         | I could get back to an RSS-oriented internet, I would rejoice
         | even if Atom didn't exist.
        
       | cyborgx7 wrote:
       | this is something I've been looking for
       | 
       | neat
        
       | canada_dry wrote:
       | Browsing OP's repositories was this gem:
       | https://gitlab.com/vincenttunru/flatuscode _A VSCode extension.
       | That adds farts on every keypress. That 's all._
        
         | hallway_monitor wrote:
         | Pretty great for April 1st... Or whenever you find an unlocked
         | machine!
        
           | Vinnl wrote:
           | If you're back in the office, I can recommend enabling it for
           | just one particular programming language on your coworker's
           | VSCode.
        
           | canada_dry wrote:
           | When I saw it, I did do the quick calc based on the last
           | update: worked out to around Christmas. OP might have just
           | been bored over the holidays.
        
             | Vinnl wrote:
             | Haha yeah, came up with the idea and went right ahead and
             | implemented it. Probably should do a Show HN on April 1st
             | though -- just set a reminder for next year.
        
       | rcarmo wrote:
       | This is nice. I'm actually doing much the same with Node-RED's
       | HTML parser, which also supports simple selectors.
        
         | Vinnl wrote:
         | Thanks. I initially used a regular HTML parser as well, but I
         | quickly ran into sites that wouldn't render without JavaScript.
         | I'm therefore now using a regular browser controlled by
         | Playwright to fetch the websites.
        
           | axiolite wrote:
           | Care to name any sites? I've always managed to find
           | workarounds for everything I've wanted to follow. Most
           | websites want to be indexed by search engines, and googlebot
           | doesn't do javascript. So sometimes a forged user agent is
           | all you need. Occasionally, finding the actual json file and
           | parsing the info you need out of it does the job. etc.
        
             | moehm wrote:
             | > googlebot doesn't do javascript
             | 
             | But it does.
             | 
             | Primary source: I experienced it on my own site.
             | 
             | Secondary source:
             | https://developers.google.com/search/docs/guides/fix-
             | search-...
        
               | chrismorgan wrote:
               | This changes from time to time, of course, but when last
               | I investigated, around two years ago, consensus was that
               | it mostly wouldn't do JavaScript until you nudged it into
               | doing so in some way that I forget, and that it was
               | always slower to index/update if it needed to do
               | JavaScript.
               | 
               | (For my part, I disable JavaScript by default for various
               | reasons, mostly performance, and it's decidedly uncommon
               | for a general-internet site to be completely broken by
               | it. Sites that get posted on HN are disproportionately
               | JS-dependent, especially if they're new.)
        
             | Vinnl wrote:
             | The User Agent trick is a good one that I should've tried,
             | but I just checked and it didn't work for this one. Parsing
             | actual JSON wasn't really an option, as I wanted to be able
             | to quickly and easily add RSS feeds.
             | 
             | Possibly SEO is less a concern for the type of website I
             | initially made this for, i.e. Dutch real estate agents.
             | Most people find their listings through funda.nl rather
             | than through search engines; I was just hoping to see them
             | listed before they got posted there.
             | 
             | Send me a message on Twitter or email me (hacker_news@ my
             | domain) if you still want the URL of a failing website to
             | play around with.
        
       | 0des wrote:
       | Woah this is cool! What you did with the setup documentation and
       | the bit about automation was a nice touch, I wish more projects
       | had this attention to the detail. Very simple, useful and
       | elegant. Thanks for sharing this with us, Vincent!
        
       | alanchen wrote:
       | Related:
       | 
       | https://github.com/DIYgod/RSSHub
       | 
       | This perhaps has more flexibility and can deal with almost any
       | website.
        
       | bellyfullofbac wrote:
       | I wish there's a snippet of what the XML looks like, even better
       | if it's "rendered"...
        
         | Vinnl wrote:
         | That's a good one, I'll see if I can add something to the site.
         | Meanwhile you can see the generated examples here:
         | 
         | - https://vincenttunru.gitlab.io/feeds/funfacts.xml
         | 
         | - https://vincenttunru.gitlab.io/feeds/wikivoyage.xml
         | 
         | And the combined feed:
         | 
         | - https://vincenttunru.gitlab.io/feeds/all.xml
         | 
         | Add those links to your feed reader to see rendered examples
         | while I update the site.
         | 
         | Edit: preview added to the website.
        
           | moehm wrote:
           | FYI, the links inside your feeds don't work, because they are
           | relative, not absolute.
        
             | Vinnl wrote:
             | Ah, you mean the ones inside the contents? That's a good
             | one. I'm not sure if that's easily fixable, but I'll give
             | it some thought. For those interested, I'll track it here:
             | https://gitlab.com/vincenttunru/feed-me-up-
             | scotty/-/issues/1
        
               | chrismorgan wrote:
               | The appropriate way is to use the xml:base attribute, as
               | demonstrated in the example at
               | https://datatracker.ietf.org/doc/html/rfc4287#page-4.
        
               | Vinnl wrote:
               | Thanks! I'll look at that.
        
               | chrismorgan wrote:
               | > _(Of course, for the combined feed this would be
               | problematic.)_
               | 
               | Not so; unlike the HTML <base> element which applies to a
               | document, the xml:base attribute is applied to an element
               | and its descendants. The typical pattern (as shown in the
               | RFC 4287 example) is to put it on each entry's <content>.
               | In your markup, you'll end up with each entry having its
               | URL in three places:
               | <id>http://example.com/item</id>       <link
               | href="http://example.com/item"/>       <content
               | type="html"
               | xml:base="http://example.com/item">...</content>
        
               | Vinnl wrote:
               | Excellent! I'll look into actually implementing this
               | before making further comments, since I'm sure I'll find
               | out such things as I do :P
               | 
               | Edit: the package I'm using to generate the feeds does
               | not support that attribute yet, so it'll have to wait a
               | bit for my PR to hopefully be accepted:
               | https://github.com/jpmonette/feed/pull/158
               | 
               | Thanks for the pointers!
        
       | kdbg wrote:
       | Kinda on a related note I found myself needing to make a bunch of
       | these sorts of scraped feeds. The problem for me was the lack of
       | date parsing support which I sorely needed.
       | 
       | I ended up writing my own CLI tool that similarly supports CSS
       | selectors for feed generation:
       | https://github.com/dayzerosec/feedgen
       | 
       | I did write it specifically for my use-case so there are some
       | "warts" on it like custom generators for HackerOne and Google's
       | Monorail bug tracker. But perhaps someone else might benefit from
       | its ability to create slightly more complicated RSS, Atom, or
       | JSON feeds.
       | 
       | Example config with date parsing:
       | https://github.com/dayzerosec/feedgen/blob/main/configs/bish...
        
       | pacman2 wrote:
       | Thank you, will try it out.
       | 
       | Two things I am using:
       | 
       | Twitter to RSS: https://github.com/RSS-Bridge/rss-bridge
       | 
       | Arbitrary RSS feeds: https://feedity.com
        
       | lorey wrote:
       | In case anyone wants to detect the selectors automatically,
       | here's a small python library I wrote that does it for you:
       | https://github.com/lorey/mlscraper
        
         | kwerk wrote:
         | This looks interesting! Will you be adding an open source
         | license?
        
         | clickok wrote:
         | A good idea and very cleanly implemented. I imagine that
         | there's a ton of other possible applications that don't require
         | much modification to the code. Thanks for sharing!
        
       | spacec0wb0y wrote:
       | What do I do to get this working?
       | 
       | So far I've forked feeds, edited feeds.toml, checked it out as a
       | branch gh-pages, pushed the branch up to github.
       | 
       | I can see the page at https://<username>.github.io/feeds/ but
       | https://<username>.github.io/feeds/actions is just a 404.
        
         | Vinnl wrote:
         | Ah, those instructions are unclear -- as far as I know, you
         | first have to go to https://github.com/<username>/feeds/actions
         | to enable Workflows for your repository. Then, your feeds
         | should be published to
         | https://<username>.github.io/feeds/<feedname>.xml.
         | 
         | Does that work?
        
           | spacec0wb0y wrote:
           | not exactly. I saw below the instruction to run `npx feed-me-
           | up-scotty` so I did and it generated the public/ dir and
           | feeds.
           | 
           | ok I managed to access the actions route by a different URL
           | to the readme. I copied the pages.yml workflow from your
           | repo.
           | 
           | Few minutes later I could see my feed. Very nice! I need to
           | clean up my selectors now!
        
       | remram wrote:
       | Couldn't those selectors be maintained by the community? Instead
       | of everyone deploying this on their own GitHub Actions, and
       | having to fix it independently when it breaks, a single repo with
       | all kinds of feeds maintained by everyone?
        
         | Vinnl wrote:
         | That's an interesting idea: something like DefinitelyTyped, but
         | instead of type definitions for npm packages it provides
         | selectors for URLs. Main challenge there would be organising
         | the moderation, I suppose.
        
           | contingencies wrote:
           | It should be feasible to analyse the structure over time of
           | the extracted data. Therefore, any proposed change which
           | breaks the anticipated rhythm would be suspect.
        
       | spicybright wrote:
       | I very simple plumbing tools like these.
        
         | [deleted]
        
         | toastal wrote:
         | This reminds me of the Soupault (https://soupault.app/)
         | philosophy for building static sites. You write it in any
         | language you want like, pass it trough Pandoc or AsciiDoctor as
         | preprocessor, and postprocess with Lua and CSS selectors.
        
           | vhodges wrote:
           | You both might find https://stitcherd.vhodges.dev/
           | interesting then.
        
       | ricardo81 wrote:
       | Nice, but it's essentially scraping? Scraping is brittle.
       | 
       | Some design changes and the whole thing breaks. Maintenance
       | nightmare.
        
         | kortilla wrote:
         | Website doesn't offer API. So we must scrape. This is a tool
         | for that paradigm.
         | 
         | Your comment is "scraping is brittle", which is not helpful.
         | Everyone knows scraping sucks. It's why these tools are being
         | made to make it less unpleasant.
        
           | ricardo81 wrote:
           | It's not even that 'scraping sucks', that's just a stigma to
           | the word. The point is that it's not stable and having to
           | have DOM knowledge to select the paths is less inclusive.
        
             | moehm wrote:
             | > having to have DOM knowledge to select the paths is less
             | inclusive
             | 
             | Right click on the element in Firefox, click "inspect
             | element" and it shows you the unique selector for that
             | element.
        
               | ricardo81 wrote:
               | Are you aware that some websites use randomised css ids
               | and classes? Have to wonder why. Google search results
               | being an example.
        
               | playpause wrote:
               | There's usually some way to target what you want through
               | descendent/sibling, tag name, and attribute selectors.
        
               | ricardo81 wrote:
               | Yes, if you're determined enough you can scrape anything.
               | Just not sure what the new thing here is.
        
               | detaro wrote:
               | who has claimed there is something fundamentally new
               | here?
        
               | ricardo81 wrote:
               | Presenting something on the home page of HN would suggest
               | something novel. I must be missing the point because of
               | the downvotes. Happy to be enlightened. Pick CSS
               | selectors and scrape something? That's quintessentially
               | scraping.
        
               | onli wrote:
               | I'd say the novel part of the project is the Github
               | actions/Gitlab CI integration. I haven't seen that used
               | for RSS generation yet. That the selectors are stored in
               | a config file is also unusual for these types of feed
               | generation projects.
               | 
               | Though Show HNs do not need to be novel to get to the
               | top. And RSS related projects seem to go to the frontpage
               | simply because of the fondness HN has for the technology.
               | Also, don't forget https://xkcd.com/1053/ :)
        
               | moehm wrote:
               | Yes, I am. And this project is obviously not designed to
               | handle that. (It can't even do pagination.)
               | 
               | But it can bring feed capabilities to simple, timeline
               | like sites. Sites like most frameworks produce. And with
               | the dev tools you can quickly find the needed dom path.
               | It's limited, but easy to use. If it doesn't work, you
               | need a real scraper which is an order of magnitude more
               | complex. (I maintain some of them as well.)
        
         | ricardo81 wrote:
         | Not sure for the downvote reason. Like I say scraping is
         | brittle. I've done hundreds of scraping projects in the past. A
         | recent one was taking 50 shopping sites looking for what
         | products they covered. Safe to say you look at it a month later
         | and the XPaths/CSS selectors/whatever scraping method you used
         | has changed.
         | 
         | This is why APIs exist.
        
           | Vinnl wrote:
           | Ideally sites would just publish RSS feeds themselves, but
           | not all of them do -- let alone an API...
        
             | ricardo81 wrote:
             | I can't reply to the other subcomments because of the
             | downvotes.
             | 
             | Just to be clear, I'm not against scraping. Done it myself
             | plenty, point was that it's hard to maintain and if I were
             | to make a tool for the masses, it'd be in a UI that
             | highlights elements and lets you select them rather than
             | dig around the DOM. Pretty sure things like this have
             | existed for 10 years +
        
               | Vinnl wrote:
               | Ah that's fine, this is explicitly not a tool for the
               | masses. The problem I had with tools like the ones you
               | describe are that either they just use string matching,
               | or the selectors they generate are particularly brittle
               | (e.g. incorporating randomised class names, like you
               | mentioned elsewhere).
               | 
               | Given that there wasn't really an alternative to scraping
               | for me, I wanted to at least be able to pick selectors
               | myself that were less likely to break with minor changes.
               | Then I figured there might be more people who know CSS
               | and have the same desires, hence my sharing it here :)
        
               | ricardo81 wrote:
               | I do have some code lying around somewhere that would
               | revive wordpress blogs via wayback, using a UI where you
               | could select elements in the 'theme' and restore the
               | backend database. It's simply a messy business IMO. But
               | for one off jobs, scraping is the go to.
        
             | ricardo81 wrote:
             | Indeed, but the scraping workaround is brittle. We could
             | make rules for all sites on the web that don't have a feed
             | or API but it's not all that manageable.
             | 
             | It also begs the question of why they don't make their
             | content that way, perhaps it was by choice. Especially if
             | you use this kind of tool to syndicate things.
             | 
             | Not quite sure of the novelty of this one given that people
             | have scraped things for decades.
             | 
             | There's probably a wiki endpoint for their example, maybe,
             | maybe not. A lot of wiki's stuff is free to download and
             | they have extensive endpoints for acquiring data.
        
               | Vinnl wrote:
               | If you have a less brittle alternative that allows me to
               | get the latest real estate listings in my area in my feed
               | reader before it hits the aggregator sites, I'm all ears!
        
               | eyedontgetit wrote:
               | Re-stating the "brittleness" over and over doesn't add
               | any weight to your argument. Regardless of whether or not
               | the solution is brittle - it is still a viable solution
               | for users who don't care or mind maintaining it. It
               | sounds like you're not the target user and that's ok...
        
         | [deleted]
        
         | lhnz wrote:
         | If you scrape using accessible parts of the page (e.g. `aria-`
         | attributes) it is less likely to be brittle, since if it were
         | it would mean their site had stopped being accessible to screen
         | readers, etc.
        
           | KirillPanov wrote:
           | Why the downvotes here? This is objectively true and, in my
           | experience, very very useful.
           | 
           | Screen readers are scrapers too. "Scraper" does not mean
           | "user agent I don't like".
        
             | DoreenMichele wrote:
             | Is there someplace I can read more about this idea? Any
             | aspect of it. I'm one of the less technically literate
             | members of HN, but I'm a co-owner of a blind dev group.
             | 
             | Please and thank you.
        
               | KirillPanov wrote:
               | https://developer.mozilla.org/en-
               | US/docs/Web/Accessibility/A...
        
               | [deleted]
        
         | bscphil wrote:
         | I'm not sure I understand the criticism. What exactly is
         | supposed to be the alternative? It's not like this is supposed
         | to be used for high-availability purposes or stock trading or
         | something. It scratches a particular itch for a handful of end
         | users who just want to get all their news in their feed reader.
         | It's expected that they'll maintain their own scraper
         | parameters.
        
           | ricardo81 wrote:
           | Pick a number of programming languages and you can scrape a
           | page in 10-20 lines of code in many cases. The barrier to
           | entry is understanding the DOM layout for a particular
           | website, which is subject to change at any moment.
           | 
           | Purely IMO, a more friendly way to go about it to abstract
           | from code and CSS knowledge is to run a UI that highlights
           | elements, lets you select them, select the title,
           | description, link etc and there you go. Same thing but
           | without the knowledge of DOM/XPath/selectors.
        
             | cromka wrote:
             | > highlights elements, lets you select them,
             | 
             | And how would that software then remember your chosen
             | elements if not by their CSS ids/classes?
        
               | z3c0 wrote:
               | By positional indexing, of course.
               | 
               | Just kidding, that's horrendous.
        
               | ricardo81 wrote:
               | You must have missed the point, or I'm missing the point
               | of this thing. Select whatever you like, do I need an
               | external tool to do that? Either I have knowledge of the
               | DOM or I don't, and if I don't then a UI selector would
               | be the next best guess.
        
       | colordrops wrote:
       | This seems like it could be used to build a simple scraper as
       | well.
        
         | Vinnl wrote:
         | I'd recommend using the tools I used for this directly if
         | you're looking to do this. Playwright in particular:
         | https://playwright.dev
        
       | k1m wrote:
       | Very nice! I work on Feed Creator -
       | https://createfeed.fivefilters.org - which is similar, although
       | unlike yours doesn't use a headless browser, so selecting
       | Javacript-inserted elements isn't supported.
        
         | Vinnl wrote:
         | That looks great! I ran into a couple of feed generators when I
         | first needed this myself, but they were either paid or clumsy
         | to use. That's when I thought, "alright, I can build this
         | myself, and then I can even use CSS selectors, since I'm
         | comfortable with those anyway". Of course, that is the point I
         | should've thought about explicitly searching for one that
         | supported that :)
         | 
         | Ah well. What I do like about my current approach is that I
         | have full control without having to run my own server, which is
         | nice.
        
           | axiolite wrote:
           | Here's a list of RSS feed generators:
           | 
           | https://github.com/AboutRSS/ALL-about-RSS#webpagehtml
        
           | k1m wrote:
           | The more the merrier! :) And always nice to see more in the
           | RSS space.
           | 
           | Feed Creator is in PHP and unlike yours doesn't produce a
           | static file. The CSS selectors, URL and other parameters are
           | all embedded in the URL, e.g.                 https://createf
           | eed.fivefilters.org/extract.php?url=http%3A%2F%2Fjohnpilger.c
           | om%2Farticles&item=.entry&item_desc=.intro&item_date=.entry-
           | date
           | 
           | The biggest issue I've found is people struggling to work out
           | which CSS selectors to use. I wrote a blog post not that long
           | ago to help people use the browser's developer tools to do
           | that. Might help too with anyone trying to use this (although
           | perhaps most of the HN audience doesn't need help here):
           | https://www.fivefilters.org/2021/how-to-turn-a-webpage-
           | into-...
        
         | fran-penedo wrote:
         | Since everyone is pitching their own, I built
         | https://github.com/fran-penedo/rssify, which started as a fork
         | of https://github.com/h43z/rssify. The basic functionality is
         | similar to Vinnl's: give it a URL and some selectors and it
         | builds the RSS feed. From this, I added a few things: templates
         | (if you want to subscribe to individual projects within a
         | webpage, like fanfics in ao3), transforms (when the data is not
         | quite the text of the DOM element), a flask server you can use
         | to add new URLs you have a template for and update the feeds,
         | and a userscript to add the current URL using the server.
        
       ___________________________________________________________________
       (page generated 2021-07-05 23:00 UTC)