[HN Gopher] Monolith - CLI tool for saving complete web pages as...
___________________________________________________________________
Monolith - CLI tool for saving complete web pages as a single HTML
file
Author : iscream26
Score : 64 points
Date : 2024-03-24 20:48 UTC (2 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| lagt_t wrote:
| I remember IE5 was able to do this lol. It fell out of vogue for
| some reason, glad to see the concept is still alive.
| berkes wrote:
| Firefox can still do it.
| phrz wrote:
| Safari does this with .webarchive files
| toomuchtodo wrote:
| Related:
|
| _Show HN: CLI tool for saving web pages as a single file_ -
| https://news.ycombinator.com/item?id=20774322 - August 2019 (209
| comments)
| joeyhage wrote:
| It would be awesome to see support for following links to a
| specified depth, similar to [Httrack](https://www.httrack.com/)
| codetrotter wrote:
| I made a basic crawler using Firefox, thirtyfour
| https://docs.rs/thirtyfour/latest/thirtyfour/ and squid
|
| Basically, I took a start URL for the crawl, and my program
| would load the page in Firefox using thirtyfour, and then
| extract all links from the page and use some basic rules for
| keeping track of which ones to visit and in which order. I had
| Squid proxy configured to save all traffic that passed through
| it.
|
| It worked ok-ish. I only really stopped that project because of
| a hardware malfunction.
|
| The main annoyance that I didn't get around to solving was
| being more smart about not trying to load non-html content that
| was already loaded anyway as part of the page. Because the way
| I extracted links from the page I also extracted URLs of JS,
| CSS etc that were referenced.
| arp242 wrote:
| I wrote something very similar a few years ago -
| https://github.com/arp242/singlepage
|
| I mostly use it for a few Go programs where I generate HTML; I
| can "just" use links to external stylesheets and JavaScript
| because that's more convenient to work with, and then process it
| to produce a single HTML file.
| lopkeny12ko wrote:
| How does this compare to SingleFile?
|
| https://www.npmjs.com/package/single-file-cli
| simonw wrote:
| Well this is fun... from the README here I learned I can do this
| on macOS: /Applications/Google\
| Chrome.app/Contents/MacOS/Google\ Chrome \ --headless
| --incognito --dump-dom https://github.com > /tmp/github.html
|
| And get an HTML file for a page after the JavaScript has been
| executed.
|
| My own https://shot-scraper.datasette.io/ tool (which uses
| headless Playwright Chromium under the hood) has a command for
| this too: shot-scraper html https://github.com/
| > /tmp/github.html
|
| But it's neat that you can do it with just Google Chrome
| installed and nothing else.
___________________________________________________________________
(page generated 2024-03-24 23:00 UTC)