[HN Gopher] Readability.js
       ___________________________________________________________________
        
       Readability.js
        
       Author : stefankuehnel
       Score  : 94 points
       Date   : 2024-02-25 18:48 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | zerojames wrote:
       | I have used and love readability.js. I used it in an application
       | that lets you run various NLP analyses over a web page
       | (surprisals, reading time, word counts, etc.). For that, I needed
       | only the main page content. readability.js retrieves main page
       | content well, consistently.
       | 
       | The Alan Turing Institute maintains a Python wrapper around
       | readability.js, too: https://github.com/alan-turing-
       | institute/ReadabiliPy.
        
       | simonw wrote:
       | I like using Readability.js as a demo for my shot-scraper CLI
       | utility: https://shot-
       | scraper.datasette.io/en/stable/javascript.html#...
       | 
       | Running this in a terminal (after installing shot-scraper):
       | shot-scraper javascript \
       | https://simonwillison.net/2022/Mar/14/scraping-web-pages-shot-
       | scraper/ "         async () => {           const readability =
       | await import('https://cdn.skypack.dev/@mozilla/readability');
       | return (new readability.Readability(document)).parse();
       | }"
       | 
       | Outputs this:                   {             "title": "Scraping
       | web pages from the command line with shot-scraper",
       | "byline": null,             "dir": null,             "lang": "en-
       | gb",             "content": "... long string of HTML ...",
       | "length": 7104,             "excerpt": "I\u2019ve added a
       | powerful new capability to my shot-scraper command line browser
       | automation tool: you can now use it to load a web page in a
       | headless browser, execute JavaScript \u2026",
       | "siteName": null,             "publishedTime": null         }
        
         | nozzlegear wrote:
         | This is really cool, thanks for sharing!
        
         | stefankuehnel wrote:
         | Wow, that's really cool! Thanks for sharing. I knew about
         | "shot-scraper" before, but I didn't know you could do something
         | so cool with it.
        
       | Zhyl wrote:
       | Is there a way of using this in conjunction with, say, curl? I'd
       | love to be able to grab clean web pages for offline use and
       | printing.
        
         | simonw wrote:
         | See my comment here!
         | https://news.ycombinator.com/item?id=39504105
        
         | input_sh wrote:
         | If you run it inside a container, it's fairly simple:
         | https://github.com/phpdocker-io/readability-js-server
        
       | oulipo wrote:
       | Is there a way to use it as a copy-paste in the console, or link
       | to paste in the URL bar, in order to convert a poorly formatted
       | webpage to a nicer one?
        
       | johnchristopher wrote:
       | > minScore (number, default 20): the minimum cumulated 'score'
       | used to determine if the document is readerable;
       | 
       | Ahaaa, so that's why sometimes I get the reader icon and
       | sometimes not (especially on mobile).
        
       | sabr wrote:
       | Readability is awesome! I used it to build Smort.io [1] to easily
       | read articles & ArXiv papers.
       | 
       | [1] https://smort.io
        
       ___________________________________________________________________
       (page generated 2024-02-25 23:01 UTC)