[HN Gopher] Readability.js
___________________________________________________________________
Readability.js
Author : stefankuehnel
Score : 94 points
Date : 2024-02-25 18:48 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| zerojames wrote:
| I have used and love readability.js. I used it in an application
| that lets you run various NLP analyses over a web page
| (surprisals, reading time, word counts, etc.). For that, I needed
| only the main page content. readability.js retrieves main page
| content well, consistently.
|
| The Alan Turing Institute maintains a Python wrapper around
| readability.js, too: https://github.com/alan-turing-
| institute/ReadabiliPy.
| simonw wrote:
| I like using Readability.js as a demo for my shot-scraper CLI
| utility: https://shot-
| scraper.datasette.io/en/stable/javascript.html#...
|
| Running this in a terminal (after installing shot-scraper):
| shot-scraper javascript \
| https://simonwillison.net/2022/Mar/14/scraping-web-pages-shot-
| scraper/ " async () => { const readability =
| await import('https://cdn.skypack.dev/@mozilla/readability');
| return (new readability.Readability(document)).parse();
| }"
|
| Outputs this: { "title": "Scraping
| web pages from the command line with shot-scraper",
| "byline": null, "dir": null, "lang": "en-
| gb", "content": "... long string of HTML ...",
| "length": 7104, "excerpt": "I\u2019ve added a
| powerful new capability to my shot-scraper command line browser
| automation tool: you can now use it to load a web page in a
| headless browser, execute JavaScript \u2026",
| "siteName": null, "publishedTime": null }
| nozzlegear wrote:
| This is really cool, thanks for sharing!
| stefankuehnel wrote:
| Wow, that's really cool! Thanks for sharing. I knew about
| "shot-scraper" before, but I didn't know you could do something
| so cool with it.
| Zhyl wrote:
| Is there a way of using this in conjunction with, say, curl? I'd
| love to be able to grab clean web pages for offline use and
| printing.
| simonw wrote:
| See my comment here!
| https://news.ycombinator.com/item?id=39504105
| input_sh wrote:
| If you run it inside a container, it's fairly simple:
| https://github.com/phpdocker-io/readability-js-server
| oulipo wrote:
| Is there a way to use it as a copy-paste in the console, or link
| to paste in the URL bar, in order to convert a poorly formatted
| webpage to a nicer one?
| johnchristopher wrote:
| > minScore (number, default 20): the minimum cumulated 'score'
| used to determine if the document is readerable;
|
| Ahaaa, so that's why sometimes I get the reader icon and
| sometimes not (especially on mobile).
| sabr wrote:
| Readability is awesome! I used it to build Smort.io [1] to easily
| read articles & ArXiv papers.
|
| [1] https://smort.io
___________________________________________________________________
(page generated 2024-02-25 23:01 UTC)