[HN Gopher] Show HN: RSS feeds for arbitrary websites using CSS ...
___________________________________________________________________
Show HN: RSS feeds for arbitrary websites using CSS selectors
Author : Vinnl
Score : 309 points
Date : 2021-07-05 16:28 UTC (6 hours ago)
(HTM) web link (feed-me-up-scotty.vincenttunru.com)
(TXT) w3m dump (feed-me-up-scotty.vincenttunru.com)
| ianberdin wrote:
| My friend is working on a quite simple feed generator from any
| website, social media: https://rss.app. Maybe helpful.
| mhitza wrote:
| This could nicely supplement my GitHub automation that emails
| feed digests https://github.com/mhitza/subscriptions-digest
|
| Similarly to my repository, I think I would suggest the option to
| fetch the configuration file from an external resource defined
| via an action secret. For my automation I'm using a Gist (not
| sure if Gitlab has same thing; also private but publicly
| accessible snippets).
|
| At least that way you can keep your own feed configuration while
| allowing those that fork the repository to not have to manually
| fix conflicts within the feeds.toml config.
| [deleted]
| markdown wrote:
| Great project, but that's a shitty name.
| Vinnl wrote:
| Heh, one thing I've learned over the years is that a great way
| to not actually publish my side projects is to overthink the
| name, so nowadays I just pick one and go with it when I'm ready
| to publish it.
| markdown wrote:
| I can't disagree with that :)
| chrismorgan wrote:
| I'm glad to see that you're using Atom rather than RSS, but
| you're still _calling_ it RSS. Can I convince you to stop calling
| it that? It's factually inaccurate, and keeps RSS, the inferior
| format, more popular by virtue of mindshare. Just call them
| feeds, the proper generic term.
| pmlnr wrote:
| Oh, yes, please revive the bloody format wars /s
| nvr219 wrote:
| If OP didn't use the word RSS and said "atom feeds" I wouldn't
| know what they're talking about, so I recommend keep calling it
| RSS.
| moehm wrote:
| Same problem with SSL encryption. But with both RSS and SSL, if
| the user use a modern client, he shouldn't bother if its
| internally using RSS, Atom, SSL or TLS.
|
| I for one know most of my pals do know about RSS, but haven't
| heard the term 'Atom feed'.
| chrismorgan wrote:
| "SSL" is _finally_ dying as a term; the significant majority
| of what I see calls it TLS now.
|
| But the problem didn't exist in the same way with SSL/TLS:
| SSL was actively killed off in favour of TLS so that
| regardless of what you _call_ it, you're actually dealing
| with TLS. But with feeds, RSS is still supported, and so the
| mindshare problem happens: people hear about RSS and so
| implement the inferior and problematic RSS rather than Atom,
| because they've never even heard of Atom and so don't know
| that it's what they should have implemented instead, because
| it's more robust. Calling feeds RSS perpetuates RSS, which is
| undesirable.
|
| I say just advertise it as "feeds", not "RSS feeds". Atom is
| an implementation detail, just like RSS _should_ be. (Well,
| RSS should be _dead_ in general feeds, with only podcast
| feeds keeping it alive.)
| bscphil wrote:
| > I say just advertise it as "feeds", not "RSS feeds".
|
| I think that's even worse. For most ordinary people that
| has a bunch of unintended meanings. "My Facebook has a
| feed, do you mean subscribing to your site on Facebook? Why
| do I see a bunch of code when I click the link?"
| alanfranz wrote:
| I do not care. RSS and Atom both win over closed platforms. If
| I could get back to an RSS-oriented internet, I would rejoice
| even if Atom didn't exist.
| cyborgx7 wrote:
| this is something I've been looking for
|
| neat
| canada_dry wrote:
| Browsing OP's repositories was this gem:
| https://gitlab.com/vincenttunru/flatuscode _A VSCode extension.
| That adds farts on every keypress. That 's all._
| hallway_monitor wrote:
| Pretty great for April 1st... Or whenever you find an unlocked
| machine!
| Vinnl wrote:
| If you're back in the office, I can recommend enabling it for
| just one particular programming language on your coworker's
| VSCode.
| canada_dry wrote:
| When I saw it, I did do the quick calc based on the last
| update: worked out to around Christmas. OP might have just
| been bored over the holidays.
| Vinnl wrote:
| Haha yeah, came up with the idea and went right ahead and
| implemented it. Probably should do a Show HN on April 1st
| though -- just set a reminder for next year.
| rcarmo wrote:
| This is nice. I'm actually doing much the same with Node-RED's
| HTML parser, which also supports simple selectors.
| Vinnl wrote:
| Thanks. I initially used a regular HTML parser as well, but I
| quickly ran into sites that wouldn't render without JavaScript.
| I'm therefore now using a regular browser controlled by
| Playwright to fetch the websites.
| axiolite wrote:
| Care to name any sites? I've always managed to find
| workarounds for everything I've wanted to follow. Most
| websites want to be indexed by search engines, and googlebot
| doesn't do javascript. So sometimes a forged user agent is
| all you need. Occasionally, finding the actual json file and
| parsing the info you need out of it does the job. etc.
| moehm wrote:
| > googlebot doesn't do javascript
|
| But it does.
|
| Primary source: I experienced it on my own site.
|
| Secondary source:
| https://developers.google.com/search/docs/guides/fix-
| search-...
| chrismorgan wrote:
| This changes from time to time, of course, but when last
| I investigated, around two years ago, consensus was that
| it mostly wouldn't do JavaScript until you nudged it into
| doing so in some way that I forget, and that it was
| always slower to index/update if it needed to do
| JavaScript.
|
| (For my part, I disable JavaScript by default for various
| reasons, mostly performance, and it's decidedly uncommon
| for a general-internet site to be completely broken by
| it. Sites that get posted on HN are disproportionately
| JS-dependent, especially if they're new.)
| Vinnl wrote:
| The User Agent trick is a good one that I should've tried,
| but I just checked and it didn't work for this one. Parsing
| actual JSON wasn't really an option, as I wanted to be able
| to quickly and easily add RSS feeds.
|
| Possibly SEO is less a concern for the type of website I
| initially made this for, i.e. Dutch real estate agents.
| Most people find their listings through funda.nl rather
| than through search engines; I was just hoping to see them
| listed before they got posted there.
|
| Send me a message on Twitter or email me (hacker_news@ my
| domain) if you still want the URL of a failing website to
| play around with.
| 0des wrote:
| Woah this is cool! What you did with the setup documentation and
| the bit about automation was a nice touch, I wish more projects
| had this attention to the detail. Very simple, useful and
| elegant. Thanks for sharing this with us, Vincent!
| alanchen wrote:
| Related:
|
| https://github.com/DIYgod/RSSHub
|
| This perhaps has more flexibility and can deal with almost any
| website.
| bellyfullofbac wrote:
| I wish there's a snippet of what the XML looks like, even better
| if it's "rendered"...
| Vinnl wrote:
| That's a good one, I'll see if I can add something to the site.
| Meanwhile you can see the generated examples here:
|
| - https://vincenttunru.gitlab.io/feeds/funfacts.xml
|
| - https://vincenttunru.gitlab.io/feeds/wikivoyage.xml
|
| And the combined feed:
|
| - https://vincenttunru.gitlab.io/feeds/all.xml
|
| Add those links to your feed reader to see rendered examples
| while I update the site.
|
| Edit: preview added to the website.
| moehm wrote:
| FYI, the links inside your feeds don't work, because they are
| relative, not absolute.
| Vinnl wrote:
| Ah, you mean the ones inside the contents? That's a good
| one. I'm not sure if that's easily fixable, but I'll give
| it some thought. For those interested, I'll track it here:
| https://gitlab.com/vincenttunru/feed-me-up-
| scotty/-/issues/1
| chrismorgan wrote:
| The appropriate way is to use the xml:base attribute, as
| demonstrated in the example at
| https://datatracker.ietf.org/doc/html/rfc4287#page-4.
| Vinnl wrote:
| Thanks! I'll look at that.
| chrismorgan wrote:
| > _(Of course, for the combined feed this would be
| problematic.)_
|
| Not so; unlike the HTML <base> element which applies to a
| document, the xml:base attribute is applied to an element
| and its descendants. The typical pattern (as shown in the
| RFC 4287 example) is to put it on each entry's <content>.
| In your markup, you'll end up with each entry having its
| URL in three places:
| <id>http://example.com/item</id> <link
| href="http://example.com/item"/> <content
| type="html"
| xml:base="http://example.com/item">...</content>
| Vinnl wrote:
| Excellent! I'll look into actually implementing this
| before making further comments, since I'm sure I'll find
| out such things as I do :P
|
| Edit: the package I'm using to generate the feeds does
| not support that attribute yet, so it'll have to wait a
| bit for my PR to hopefully be accepted:
| https://github.com/jpmonette/feed/pull/158
|
| Thanks for the pointers!
| kdbg wrote:
| Kinda on a related note I found myself needing to make a bunch of
| these sorts of scraped feeds. The problem for me was the lack of
| date parsing support which I sorely needed.
|
| I ended up writing my own CLI tool that similarly supports CSS
| selectors for feed generation:
| https://github.com/dayzerosec/feedgen
|
| I did write it specifically for my use-case so there are some
| "warts" on it like custom generators for HackerOne and Google's
| Monorail bug tracker. But perhaps someone else might benefit from
| its ability to create slightly more complicated RSS, Atom, or
| JSON feeds.
|
| Example config with date parsing:
| https://github.com/dayzerosec/feedgen/blob/main/configs/bish...
| pacman2 wrote:
| Thank you, will try it out.
|
| Two things I am using:
|
| Twitter to RSS: https://github.com/RSS-Bridge/rss-bridge
|
| Arbitrary RSS feeds: https://feedity.com
| lorey wrote:
| In case anyone wants to detect the selectors automatically,
| here's a small python library I wrote that does it for you:
| https://github.com/lorey/mlscraper
| kwerk wrote:
| This looks interesting! Will you be adding an open source
| license?
| clickok wrote:
| A good idea and very cleanly implemented. I imagine that
| there's a ton of other possible applications that don't require
| much modification to the code. Thanks for sharing!
| spacec0wb0y wrote:
| What do I do to get this working?
|
| So far I've forked feeds, edited feeds.toml, checked it out as a
| branch gh-pages, pushed the branch up to github.
|
| I can see the page at https://<username>.github.io/feeds/ but
| https://<username>.github.io/feeds/actions is just a 404.
| Vinnl wrote:
| Ah, those instructions are unclear -- as far as I know, you
| first have to go to https://github.com/<username>/feeds/actions
| to enable Workflows for your repository. Then, your feeds
| should be published to
| https://<username>.github.io/feeds/<feedname>.xml.
|
| Does that work?
| spacec0wb0y wrote:
| not exactly. I saw below the instruction to run `npx feed-me-
| up-scotty` so I did and it generated the public/ dir and
| feeds.
|
| ok I managed to access the actions route by a different URL
| to the readme. I copied the pages.yml workflow from your
| repo.
|
| Few minutes later I could see my feed. Very nice! I need to
| clean up my selectors now!
| remram wrote:
| Couldn't those selectors be maintained by the community? Instead
| of everyone deploying this on their own GitHub Actions, and
| having to fix it independently when it breaks, a single repo with
| all kinds of feeds maintained by everyone?
| Vinnl wrote:
| That's an interesting idea: something like DefinitelyTyped, but
| instead of type definitions for npm packages it provides
| selectors for URLs. Main challenge there would be organising
| the moderation, I suppose.
| contingencies wrote:
| It should be feasible to analyse the structure over time of
| the extracted data. Therefore, any proposed change which
| breaks the anticipated rhythm would be suspect.
| spicybright wrote:
| I very simple plumbing tools like these.
| [deleted]
| toastal wrote:
| This reminds me of the Soupault (https://soupault.app/)
| philosophy for building static sites. You write it in any
| language you want like, pass it trough Pandoc or AsciiDoctor as
| preprocessor, and postprocess with Lua and CSS selectors.
| vhodges wrote:
| You both might find https://stitcherd.vhodges.dev/
| interesting then.
| ricardo81 wrote:
| Nice, but it's essentially scraping? Scraping is brittle.
|
| Some design changes and the whole thing breaks. Maintenance
| nightmare.
| kortilla wrote:
| Website doesn't offer API. So we must scrape. This is a tool
| for that paradigm.
|
| Your comment is "scraping is brittle", which is not helpful.
| Everyone knows scraping sucks. It's why these tools are being
| made to make it less unpleasant.
| ricardo81 wrote:
| It's not even that 'scraping sucks', that's just a stigma to
| the word. The point is that it's not stable and having to
| have DOM knowledge to select the paths is less inclusive.
| moehm wrote:
| > having to have DOM knowledge to select the paths is less
| inclusive
|
| Right click on the element in Firefox, click "inspect
| element" and it shows you the unique selector for that
| element.
| ricardo81 wrote:
| Are you aware that some websites use randomised css ids
| and classes? Have to wonder why. Google search results
| being an example.
| playpause wrote:
| There's usually some way to target what you want through
| descendent/sibling, tag name, and attribute selectors.
| ricardo81 wrote:
| Yes, if you're determined enough you can scrape anything.
| Just not sure what the new thing here is.
| detaro wrote:
| who has claimed there is something fundamentally new
| here?
| ricardo81 wrote:
| Presenting something on the home page of HN would suggest
| something novel. I must be missing the point because of
| the downvotes. Happy to be enlightened. Pick CSS
| selectors and scrape something? That's quintessentially
| scraping.
| onli wrote:
| I'd say the novel part of the project is the Github
| actions/Gitlab CI integration. I haven't seen that used
| for RSS generation yet. That the selectors are stored in
| a config file is also unusual for these types of feed
| generation projects.
|
| Though Show HNs do not need to be novel to get to the
| top. And RSS related projects seem to go to the frontpage
| simply because of the fondness HN has for the technology.
| Also, don't forget https://xkcd.com/1053/ :)
| moehm wrote:
| Yes, I am. And this project is obviously not designed to
| handle that. (It can't even do pagination.)
|
| But it can bring feed capabilities to simple, timeline
| like sites. Sites like most frameworks produce. And with
| the dev tools you can quickly find the needed dom path.
| It's limited, but easy to use. If it doesn't work, you
| need a real scraper which is an order of magnitude more
| complex. (I maintain some of them as well.)
| ricardo81 wrote:
| Not sure for the downvote reason. Like I say scraping is
| brittle. I've done hundreds of scraping projects in the past. A
| recent one was taking 50 shopping sites looking for what
| products they covered. Safe to say you look at it a month later
| and the XPaths/CSS selectors/whatever scraping method you used
| has changed.
|
| This is why APIs exist.
| Vinnl wrote:
| Ideally sites would just publish RSS feeds themselves, but
| not all of them do -- let alone an API...
| ricardo81 wrote:
| I can't reply to the other subcomments because of the
| downvotes.
|
| Just to be clear, I'm not against scraping. Done it myself
| plenty, point was that it's hard to maintain and if I were
| to make a tool for the masses, it'd be in a UI that
| highlights elements and lets you select them rather than
| dig around the DOM. Pretty sure things like this have
| existed for 10 years +
| Vinnl wrote:
| Ah that's fine, this is explicitly not a tool for the
| masses. The problem I had with tools like the ones you
| describe are that either they just use string matching,
| or the selectors they generate are particularly brittle
| (e.g. incorporating randomised class names, like you
| mentioned elsewhere).
|
| Given that there wasn't really an alternative to scraping
| for me, I wanted to at least be able to pick selectors
| myself that were less likely to break with minor changes.
| Then I figured there might be more people who know CSS
| and have the same desires, hence my sharing it here :)
| ricardo81 wrote:
| I do have some code lying around somewhere that would
| revive wordpress blogs via wayback, using a UI where you
| could select elements in the 'theme' and restore the
| backend database. It's simply a messy business IMO. But
| for one off jobs, scraping is the go to.
| ricardo81 wrote:
| Indeed, but the scraping workaround is brittle. We could
| make rules for all sites on the web that don't have a feed
| or API but it's not all that manageable.
|
| It also begs the question of why they don't make their
| content that way, perhaps it was by choice. Especially if
| you use this kind of tool to syndicate things.
|
| Not quite sure of the novelty of this one given that people
| have scraped things for decades.
|
| There's probably a wiki endpoint for their example, maybe,
| maybe not. A lot of wiki's stuff is free to download and
| they have extensive endpoints for acquiring data.
| Vinnl wrote:
| If you have a less brittle alternative that allows me to
| get the latest real estate listings in my area in my feed
| reader before it hits the aggregator sites, I'm all ears!
| eyedontgetit wrote:
| Re-stating the "brittleness" over and over doesn't add
| any weight to your argument. Regardless of whether or not
| the solution is brittle - it is still a viable solution
| for users who don't care or mind maintaining it. It
| sounds like you're not the target user and that's ok...
| [deleted]
| lhnz wrote:
| If you scrape using accessible parts of the page (e.g. `aria-`
| attributes) it is less likely to be brittle, since if it were
| it would mean their site had stopped being accessible to screen
| readers, etc.
| KirillPanov wrote:
| Why the downvotes here? This is objectively true and, in my
| experience, very very useful.
|
| Screen readers are scrapers too. "Scraper" does not mean
| "user agent I don't like".
| DoreenMichele wrote:
| Is there someplace I can read more about this idea? Any
| aspect of it. I'm one of the less technically literate
| members of HN, but I'm a co-owner of a blind dev group.
|
| Please and thank you.
| KirillPanov wrote:
| https://developer.mozilla.org/en-
| US/docs/Web/Accessibility/A...
| [deleted]
| bscphil wrote:
| I'm not sure I understand the criticism. What exactly is
| supposed to be the alternative? It's not like this is supposed
| to be used for high-availability purposes or stock trading or
| something. It scratches a particular itch for a handful of end
| users who just want to get all their news in their feed reader.
| It's expected that they'll maintain their own scraper
| parameters.
| ricardo81 wrote:
| Pick a number of programming languages and you can scrape a
| page in 10-20 lines of code in many cases. The barrier to
| entry is understanding the DOM layout for a particular
| website, which is subject to change at any moment.
|
| Purely IMO, a more friendly way to go about it to abstract
| from code and CSS knowledge is to run a UI that highlights
| elements, lets you select them, select the title,
| description, link etc and there you go. Same thing but
| without the knowledge of DOM/XPath/selectors.
| cromka wrote:
| > highlights elements, lets you select them,
|
| And how would that software then remember your chosen
| elements if not by their CSS ids/classes?
| z3c0 wrote:
| By positional indexing, of course.
|
| Just kidding, that's horrendous.
| ricardo81 wrote:
| You must have missed the point, or I'm missing the point
| of this thing. Select whatever you like, do I need an
| external tool to do that? Either I have knowledge of the
| DOM or I don't, and if I don't then a UI selector would
| be the next best guess.
| colordrops wrote:
| This seems like it could be used to build a simple scraper as
| well.
| Vinnl wrote:
| I'd recommend using the tools I used for this directly if
| you're looking to do this. Playwright in particular:
| https://playwright.dev
| k1m wrote:
| Very nice! I work on Feed Creator -
| https://createfeed.fivefilters.org - which is similar, although
| unlike yours doesn't use a headless browser, so selecting
| Javacript-inserted elements isn't supported.
| Vinnl wrote:
| That looks great! I ran into a couple of feed generators when I
| first needed this myself, but they were either paid or clumsy
| to use. That's when I thought, "alright, I can build this
| myself, and then I can even use CSS selectors, since I'm
| comfortable with those anyway". Of course, that is the point I
| should've thought about explicitly searching for one that
| supported that :)
|
| Ah well. What I do like about my current approach is that I
| have full control without having to run my own server, which is
| nice.
| axiolite wrote:
| Here's a list of RSS feed generators:
|
| https://github.com/AboutRSS/ALL-about-RSS#webpagehtml
| k1m wrote:
| The more the merrier! :) And always nice to see more in the
| RSS space.
|
| Feed Creator is in PHP and unlike yours doesn't produce a
| static file. The CSS selectors, URL and other parameters are
| all embedded in the URL, e.g. https://createf
| eed.fivefilters.org/extract.php?url=http%3A%2F%2Fjohnpilger.c
| om%2Farticles&item=.entry&item_desc=.intro&item_date=.entry-
| date
|
| The biggest issue I've found is people struggling to work out
| which CSS selectors to use. I wrote a blog post not that long
| ago to help people use the browser's developer tools to do
| that. Might help too with anyone trying to use this (although
| perhaps most of the HN audience doesn't need help here):
| https://www.fivefilters.org/2021/how-to-turn-a-webpage-
| into-...
| fran-penedo wrote:
| Since everyone is pitching their own, I built
| https://github.com/fran-penedo/rssify, which started as a fork
| of https://github.com/h43z/rssify. The basic functionality is
| similar to Vinnl's: give it a URL and some selectors and it
| builds the RSS feed. From this, I added a few things: templates
| (if you want to subscribe to individual projects within a
| webpage, like fanfics in ao3), transforms (when the data is not
| quite the text of the DOM element), a flask server you can use
| to add new URLs you have a template for and update the feeds,
| and a userscript to add the current URL using the server.
___________________________________________________________________
(page generated 2021-07-05 23:00 UTC)