[HN Gopher] Monolith - CLI tool for saving complete web pages as...
       ___________________________________________________________________
        
       Monolith - CLI tool for saving complete web pages as a single HTML
       file
        
       Author : iscream26
       Score  : 64 points
       Date   : 2024-03-24 20:48 UTC (2 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | lagt_t wrote:
       | I remember IE5 was able to do this lol. It fell out of vogue for
       | some reason, glad to see the concept is still alive.
        
         | berkes wrote:
         | Firefox can still do it.
        
         | phrz wrote:
         | Safari does this with .webarchive files
        
       | toomuchtodo wrote:
       | Related:
       | 
       |  _Show HN: CLI tool for saving web pages as a single file_ -
       | https://news.ycombinator.com/item?id=20774322 - August 2019 (209
       | comments)
        
       | joeyhage wrote:
       | It would be awesome to see support for following links to a
       | specified depth, similar to [Httrack](https://www.httrack.com/)
        
         | codetrotter wrote:
         | I made a basic crawler using Firefox, thirtyfour
         | https://docs.rs/thirtyfour/latest/thirtyfour/ and squid
         | 
         | Basically, I took a start URL for the crawl, and my program
         | would load the page in Firefox using thirtyfour, and then
         | extract all links from the page and use some basic rules for
         | keeping track of which ones to visit and in which order. I had
         | Squid proxy configured to save all traffic that passed through
         | it.
         | 
         | It worked ok-ish. I only really stopped that project because of
         | a hardware malfunction.
         | 
         | The main annoyance that I didn't get around to solving was
         | being more smart about not trying to load non-html content that
         | was already loaded anyway as part of the page. Because the way
         | I extracted links from the page I also extracted URLs of JS,
         | CSS etc that were referenced.
        
       | arp242 wrote:
       | I wrote something very similar a few years ago -
       | https://github.com/arp242/singlepage
       | 
       | I mostly use it for a few Go programs where I generate HTML; I
       | can "just" use links to external stylesheets and JavaScript
       | because that's more convenient to work with, and then process it
       | to produce a single HTML file.
        
       | lopkeny12ko wrote:
       | How does this compare to SingleFile?
       | 
       | https://www.npmjs.com/package/single-file-cli
        
       | simonw wrote:
       | Well this is fun... from the README here I learned I can do this
       | on macOS:                   /Applications/Google\
       | Chrome.app/Contents/MacOS/Google\ Chrome \         --headless
       | --incognito --dump-dom https://github.com > /tmp/github.html
       | 
       | And get an HTML file for a page after the JavaScript has been
       | executed.
       | 
       | My own https://shot-scraper.datasette.io/ tool (which uses
       | headless Playwright Chromium under the hood) has a command for
       | this too:                   shot-scraper html https://github.com/
       | > /tmp/github.html
       | 
       | But it's neat that you can do it with just Google Chrome
       | installed and nothing else.
        
       ___________________________________________________________________
       (page generated 2024-03-24 23:00 UTC)