[HN Gopher] SingleFile: Save a complete web page into a single H...
___________________________________________________________________
SingleFile: Save a complete web page into a single HTML file
Author : crbelaus
Score : 604 points
Date : 2022-03-02 14:55 UTC (8 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| IggleSniggle wrote:
| What a cool project! I love the way this embeds images. One of
| things I miss most, though, when going back to old sites, is
| embedded audio or video. From looking at the options, it seems
| like it might be able to handle encoding video and/or audio as
| Data URIs, but it's not totally clear if SingleFile does this or
| not. I wasn't sure if I was doing the correct things to force
| this behavior in the options. It would be great if the README
| could clarify how these are handled by SingleFile. Sometimes it
| might be nice to be able to embed these sorts of things, even if
| it does make the HTML ridiculous and bloated. Or, barring that,
| maybe just a recommendation to use one of the other formats in
| the comparison table for this kind of use case.
| manigandham wrote:
| Relevant 'awesome' list for web archiving:
| https://github.com/iipc/awesome-web-archiving
|
| There are many similar tools there, from archiving to rendering.
| abnry wrote:
| I love, love this extension. I am working on an app to turn this
| into a single click bookmark system on Linux. Run an inotify
| service to watch your downloads and then process any Single file
| downloads to a database and update a browsable index.
| jrm4 wrote:
| TELL ME MORE.
|
| I think I basically get the idea, what kind of database are you
| using? Recoll sounds like a good idea, but I'm also thinking
| about how I might also make this public-ish.
|
| (i.e. I teach in college and would love to have a centralized
| way to store and search all my assigned readings, which are
| most often webpages)
| abnry wrote:
| I am not a trained software engineer but...
|
| Each html page is processed by (1) getting url, title, time
| saved (this is under-rated as approximate time of saving is
| useful if you want to rediscover) and then (2) taking a
| screenshot and finally (3) extracting text with
| readability.js and hopefully doing some keyword analysis.
|
| Right now it is stored in a local SQLite Database, although
| the article content is stored in text files. For search, I
| can use ripgrep to look through the associated text files.
|
| The eventual goal is to create a flask app which will allow
| for interactive management of the bookmarks (tagging,
| searching). I've already got static generation of bookmarks.
|
| Here's a screenshot: https://imgur.com/5YP4sP5
| m-p-3 wrote:
| I archived (privately) some documentation pages from some of
| our vendors that were behind a login page using that just in
| case it became inaccessible at a critical time for us.
| samstave wrote:
| WANT
| makeworld wrote:
| You might like https://archivebox.io/, I think it can does this
| for you and then some.
| rhn_mk1 wrote:
| I'm using Recoll for this exact purpose. Just without inotify.
| sitkack wrote:
| This sounds neat.
| causi wrote:
| This is great for a page. I'd love to see it expanded to include
| an entire site.
| dgellow wrote:
| That's a nice and simple tool, good work. I'm personally using
| Zotero to save copies of web pages: https://www.zotero.org/. With
| the browser extension you can save a snapshot in a few seconds.
| gildas wrote:
| Zotero is actually using SingleFile under the hood to save web
| pages ;)
| dgellow wrote:
| Oh, that's nice :)
| js8 wrote:
| I use SavePageWE, it can save the page (into single file) as it
| was modified by JS after load, which is often useful.
|
| The only thing I miss I wish it was easier to script.
| rambambram wrote:
| I have been using WebScrapBook (an add-on for Firefox) for some
| time. I really like it. Has anyone else some experience with this
| add-on? Good or bad.
| jjice wrote:
| I've been using it for a couple years (2 maybe) and I like it
| quite a bit as a quick and easy way to save pages. ArchiveBox
| looks fantastic, but I just don't have the motivation to set up
| the service and maintain it since I don't save enough links to
| make it worthwhile. SingleFile might be worth a shot, but it
| looks like WebScrapBook has been handling your needs just fine
| (they seem to have 90% of the same functionality).
| vageli wrote:
| As a webscrapbook user, do you know if there is a migration
| path from pocket or another hosted service?
| rambambram wrote:
| Don't know about a migration option, but I do remember
| there's a lot of custom configuration possible.
| rambambram wrote:
| Thanks!
|
| ArchiveBox does indeed look fantastic. Their homepage alone
| is beautiful.
|
| I bookmarked both ArchiveBox and now also SingleFile, but
| WebScrapBook gets the job done (in almost all cases).
| sharps1 wrote:
| Should be noted Manifest V3 will break this extension for
| chromium based browsers.
|
| https://github.com/gildas-lormeau/SingleFile-Lite
| photon-torpedo wrote:
| Love the list of notable "features". :)
| a1445c8b wrote:
| Also this:
|
| > Benefits of the Manifest V3
|
| > - None
| black3r wrote:
| Can we please stop with the 17MB GIF images used as demos? They
| use up lots of data immediately as you open the page, and are
| impractical, you don't know how long the animation is, can't
| forward/rewind, and you can't press fullscreen on a mobile.
|
| And GitHub supports embedded videos in README.md files, videos
| are generally smaller than GIF files and their disabled autoplay
| is a feature = you save your data until you press play.
| andrewmcwatters wrote:
| I wish browsers came standard, preconfigured with warning
| dialogs that triggered if assets attempting to load were beyond
| some threshold. That threshold could be decided by the browser
| vendors group based on some collection of network statistics
| and be adjusted on an annual basis or so.
| wackget wrote:
| https://old.reddit.com/r/firefox/comments/aaek23/how_to_stop...
| black3r wrote:
| The issue is mainly with mobile browsers, as mobile data is
| expensive..., Firefox on iPhone doesn't have about:config.
| foobarbecue wrote:
| GitHub only recently expanded video support from gif to decent
| video formats, and many github enterprise installs don't have
| those new features yet. So, keep spreading the word.
| gildas wrote:
| Author here, sorry for the GIF file. I created it because
| people were not happy with the video hosted on Youtube. AFAIK,
| video files did not work when I did this demo. I'll try to
| improve this in the future.
| tux1968 wrote:
| Would you mind sharing which tools you used to create that
| demo? It is really well done.
| gildas wrote:
| Sure! See here
| https://news.ycombinator.com/item?id=30530438
| localhost wrote:
| I wanted to comment on how useful that demo was to me. It did
| a great job at demonstrating why this is useful and how well
| it works compared to the native browser implementation. Thank
| you both for the demo and for the project!
| gildas wrote:
| Thanks :)
| lostgame wrote:
| Giving a massive upvote for this, disappointed and confused to
| see you've been downvoted here. There's literally no reason to
| use GIFs like this, and - as you stated, it's massively
| disrespectful to those not fortunate enough to have broadband
| connections, but would like access to the information.
|
| Using data so wastefully like this always reeks of privilege to
| me - especially on something like GitHub. Wikipedia, for
| instance, never allows things like this.
| Zababa wrote:
| Sharing a project with the world and taking time to document
| it reeks of privilege? I really can't understand your
| reasoning.
| pizza234 wrote:
| > disappointed and confused to see you've been downvoted here
|
| Because it's a relatively new feature, and probably, a lot of
| devs don't know about it (I didn't).
|
| I did this [animated gif] once actually, before the feature
| was introduced, and I definitely hated it, but I had no
| choice.
|
| Thanks for bringing this to the general attention, though :)
| bob1029 wrote:
| I think there is some nuance here.
|
| If the demo sequence is <5 seconds, I have never found myself
| becoming impatient. Gif is perfect for very brief demos.
| Anything longer than that and I'd like to have some idea where
| I am at in the video stream (and other controls as indicated)
| jazzyjackson wrote:
| > GitHub supports embedded videos in README.md files
|
| True since May 2021 so I think a lot of people are still
| finding this out...
|
| In my experience GIF is still the most set-it-and-forget-it way
| to know a video will play, to get cross-platform support out of
| mp4 you may have to provide two different codecs. Anyway, not
| disagreeing with you and most gifs could drop 90% of their size
| with better choice of resolution and framerate. This readme is
| particularly egregious doing a screen capture with scrolling.
|
| As for saving bandwidth until you want to play, I haven't tried
| this yet but it seems adequately clever to wrap a loading=lazy
| gif inside a details/summary tag: https://css-tricks.com/pause-
| gif-details-summary/
| Melatonic wrote:
| Not to mention that H264 can take quite a bit of horsepower
| to decode and play as well (assuming your machine doesnt have
| a hardware chip specifically for doing just that)
| Mogzol wrote:
| Is this really still an issue in 2022? How many people are
| browsing the internet on a device that can't do hardware
| H264 decoding?
| TingPing wrote:
| Some browsers have poor hw decoding support on Linux
| (their problem, not drivers) but its gotten a lot better
| recently.
| tambourine_man wrote:
| Which machine doesn't? Anything in the last 10 or so years
| will decode H264 with much less power than GIF because of
| it. Even a Pi supports it.
| jjice wrote:
| My 2014 Thinkpad X1 Carbon (gen 3) doesn't have hardware
| transcoding as far as I can tell made Zoom and Discord
| impossible to use for class, especially because there was
| no way (that I knew of) to disable all video except the
| presenter. Even playing a YouTube video on it makes it
| ramp up.
| botdan wrote:
| I'm not sure which CPU you have specifically but the
| lowest-end model of the X1 Carbon Gen3 has an i5-5200U
| [1] that lists Intel Quick Sync Video support.
|
| From the wiki page for Quick Sync [2]:
|
| > Intel Quick Sync Video is Intel's brand for its
| dedicated video encoding and decoding hardware core.
| Quick Sync was introduced with the Sandy Bridge CPU
| microarchitecture on 9 January 2011 and has been found on
| the die of Intel CPUs ever since.
|
| I can't confirm but I'd guess your performance issues lie
| elsewhere than in the h264 decoding specifically.
|
| [1] - https://ark.intel.com/content/www/us/en/ark/product
| s/85212/i...
|
| [2] -
| https://en.wikipedia.org/wiki/Intel_Quick_Sync_Video
| jjice wrote:
| If you check out the generation-codec table in that
| wikipedia article [1], under Broadwell (I believe that's
| the 5200U's generation name), it says there is support
| for AVC (which I believe is H264, I'm not a codec wiz),
| so that's a really good point. I'm not sure why I've
| consistently had issues with this on my machine then. I
| wonder if this is something with a configuration on Linux
| then?
|
| Thanks for pointing that out. I've looked at this table
| before and payed attention to HEVC, not AVC, so I believe
| that's where my mistake came from.
|
| [1] https://en.wikipedia.org/wiki/Intel_Quick_Sync_Video#
| Hardwar...
| zerocrates wrote:
| AVC is H.264, yes.
|
| Accelerated video decode is often disabled by default on
| Linux versions of browsers and can be quite dependent on
| versions of drivers/mesa/X-vs-Wayland/etc.
| black3r wrote:
| YouTube by default prefers newer, bitrate saving codecs
| over old ones if it thinks your CPU can handle software
| decoding them. On my 2017 Dell XPS 1080p and lower
| resolutions on YouTube play in software decoded AV1, only
| 1440p and higher play in hardware decoded VP9, so playing
| 4K video on YouTube is less taxing for my CPU than
| playing a 1080p video....
| folmar wrote:
| You can use h264ify extension to fix it.
| divbzero wrote:
| > _to get cross-platform support out of mp4 you may have to
| provide two different codecs_
|
| Video codecs are not my area of expertise. Which codecs are
| these and what tool(s) would you typically use to ensure you
| provide them?
| berkes wrote:
| > And GitHub supports embedded videos in README.md files
|
| Any documentation on this? Because I have tried to embed video
| in issues and PRs before, and did not manage. I'm hoping such
| documentation will explain how this extends to issues and PRs.
| TingPing wrote:
| In issues its just drag-n-drop.
| bachmeier wrote:
| Maybe a little OT, but founders should take a careful look at
| this landing page. That's how you sell something. The demo is
| clear about the problem they're trying to solve and it convinced
| me that their product actually solves it. It's not just all the
| information they've included, but also the lack of irrelevant
| clutter.
| wanderer_ wrote:
| Dang it, he beat me to it! I have been toying with the idea for
| quite some time, but this implementation is great, better than
| mine would have been, so I'm glad he did it.
|
| Maybe I'll make a CLI implementation (sorta like wget but with
| this tacked on...)
| givemeethekeys wrote:
| Naming a thing takes creativity and luck. Congratulations on an
| excellent name!
| civilian wrote:
| I was hoping this tool also solved a problem that comes from
| saving & reproducing JS-framework-heavy websites.
|
| Here's the bug: According the HTML spec, elements like <h2> and
| <div> cannot be inside <a> tags. But using js you _can_ push
| <div>s instead of <a>s. (It happens from document.insert-type
| functions, frameworks like Angular/React allow this)
|
| Look at nasa.gov, there's html: <a href="/press-
| release/nasa-invites-media-to-next-spacex-commercial-crew-space-
| station-launch-0" date="Wed Mar 02 2022 10:35:00 GMT-0800
| (Pacific Standard Time)" id="ember196" class="card ubernode cards
| --card cards--2row cards--2col nodeid-477815 ember-view"><div
| class="bg-card-canvas" style="background-image: url(/sites/defaul
| t/files/styles/2x2_cardfeed/public/thumbnails/image/51846702013_a
| 0cc55100a_k.jpeg);"> <!----> <h2 class="headline"> ...
| </h2> </div> </a>
|
| After running this through SingleFile you can visually see the
| changes, but the html changes are: <a
| href="/press-release/nasa-invites-media-to-next-spacex-
| commercial-crew-space-station-launch-0" date="Wed Mar 02 2022
| 10:35:00 GMT-0800 (Pacific Standard Time)" id="ember196"
| class="card ubernode cards--card cards--2row cards--2col
| nodeid-477815 ember-view"></a> <div class="bg-card-canvas"
| style="background-image: url(/sites/default/files/styles/2x2_card
| feed/public/thumbnails/image/51846702013_a0cc55100a_k.jpeg);">
| <h2 class="headline"> ...</h2>
|
| The way that sites like Wayback Machine handle this is by using
| the web-replay library Wombat
| https://github.com/webrecorder/wombat that also uses JS to insert
| those elements.
|
| But what the hell! I was working on a similar html-
| downloading/reproducing tool and this bug really bothers me. I'd
| either like the HTML reading standard to be updated to accept
| <div> inside of <a>, or _also_ make that impossible to do via JS.
| gildas wrote:
| I think this issue could be circumvented by manipulating the
| page (replacing images, frames, css etc.) in the tab itself
| (SingleFile does it in background with a DOMParser instance).
| The trick is to avoid HTML parsing.
| zmix wrote:
| I'd also recommend "Print Edit WE" and "Save Page WE" [2] for
| Chrome type browsers, both by one author. First one allows for
| editing of the page before printing/saving (as a single page HTML
| or MHTML), second one allows for single-page save.
|
| [1] https://chrome.google.com/webstore/detail/print-edit-
| we/olnb... [2] https://chrome.google.com/webstore/detail/save-
| page-we/dhhpe...
| sandes wrote:
| wget -r url ?
| reidjs wrote:
| Unfortunately that won't allow you to click links in your offline
| version. you can do this properly with wget: (sorry I don't know
| how to do code formatting in hackernews)
|
| wget --mirror \ --convert-links \ --html-extension \ --wait=2 \
| -o log \ https://example.com
| berkes wrote:
| Are you suggesting to mirror e.g. the entire Wikipedia through
| wget?
|
| That is not only suboptimal, it is stressing on the server. At
| least you added a --wait=2, but on any large site/hoster/CDN,
| this might still get your IP banned or throttled. And on e.g.
| the English wikipedia this will then take 149 days. Which means
| that by the time you hit the last page, the first ones (and
| their links) are out of date.
| falcolas wrote:
| If you add '--no-parent' (doesn't request anything that's not
| a page dependency above the requested URI) and a '--level=5'
| (only follows link 5 deep), you won't get all of a site. It
| makes it more realistic for getting wikipedia articles.
| lysium wrote:
| Looks like SingleFile helps with sites where you have to be
| logged in, something that is not that easy with wget.
| hombre_fatal wrote:
| You don't need to newline every flag of a trivial command.
| all2 wrote:
| I'm guessing the user's intent was to have the command
| formatted across multiple lines.
| [deleted]
| _dain_ wrote:
| What are you talking about? I have hundreds of pages saved with
| SingleFile and I can click links in all of them.
| reidjs wrote:
| Oh maybe it does work then. I assumed it didn't follow links
| because they didn't show it in the video.
| z3c0 wrote:
| Code formatting is just blockquotes. So one
| empty space followed by indented text (2 or more spaces)
| megaman821 wrote:
| Is this still on track to become a standard?
| https://github.com/WICG/webpackage
| j1elo wrote:
| Related: I used to keep a collection of locally mirrored web
| pages a long time ago, with a legendary Firefox extension called
| _ScrapBook_ [0] (now long retired). The surprise for me is that
| after all these years I still remembered the name...
|
| While writing this comment I found that it lived on as a (now
| "legacy") new extension named _ScrapBook X_ [1], and then yet
| another one named _WebScrapBook_ [2], which seems to still be
| alive!
|
| [0]: http://www.xuldev.org/scrapbook/
|
| [1]: https://github.com/danny0838/firefox-scrapbook
|
| [2]: https://addons.mozilla.org/en-US/firefox/addon/webscrapbook/
| wetpaws wrote:
| Ah, millennials invented .mht
| als0 wrote:
| This is great. I've always wondered why this isn't the default
| behaviour for page saving in browsers. To an ordinary user saving
| a page implies saving a single file, not a file plus a directory
| of stuff. HAR can be useful but seems only for niche or
| specialised reasons.
| kwhitefoot wrote:
| The list of problems that Manifest V3 causes are just more
| reasons to never use Chrome.
| avivo wrote:
| Why does this need to:
|
| - Read and change all your data on all websites
|
| - Modify data you copy and paste
|
| - Manage your downloads
|
| Is there a way to use a version that requires less of these
| permissions? e.g. it seems we can address the first permission by
| only activating it on click, but I'm not sure if that addresses
| the other ones.
| gildas wrote:
| I try to use optional permissions as much as I can. The first
| permission is required because of assets and frames stored on
| third-party servers. The second permission should be optional,
| I don't remember why it's not. I'll try to see if I can make it
| optional. The last permission is required in order to save the
| page on the filesystem with the "downloads" API. Note that even
| if I make these permissions optional, you might still have to
| trust me anyway ;)
| [deleted]
| anned20 wrote:
| I also want to give praise about the demo. It's one of the best
| demos I've ever seen with such a project. Nice job!
| netsharc wrote:
| A 16MB gif with no playback controls, so you had to go through
| the tedium.
| Minor49er wrote:
| I would be surprised that the author wasn't using WebM to get
| a smaller filesize (not to mention higher quality) but the
| project itself leads me to believe that the author has a lot
| of free disk space to use
| a1445c8b wrote:
| There's no need to make further assumptions about the
| author (who btw took the time to build a very useful tool
| and share to in the Internet for free). Just point out the
| issue of the GIF and move along.
| Minor49er wrote:
| I never made an assumption about the author and certainly
| never said that the tool wasn't useful. You can feel free
| to move along yourself, though.
| treeman79 wrote:
| Iran has a habit of using tools like this to trick defense
| contractors into using their page.
| dtjohnnymonkey wrote:
| Thank you! I've been looking for this for a while, nice to see
| someone finally did it!
| ilrwbwrkhv wrote:
| Thanks for this. I expected to see a pricing link somewhere,
| having been attuned to all the subscription Saas these days. Glad
| to see there are tools offering immense value for free still.
| gildas wrote:
| It is in fact more or less self-financed by... hmmm... a SaaS
| that I market but it's in B2B.
| [deleted]
| mysterypie wrote:
| Security question: Is a web extension safe if it is installed but
| if you're not using it at the moment? For example, if I were
| logged into my bank's website and I did _not_ click the
| SingleFile button in the extension toolbar, could it still
| theoretically collect info from my bank 's webpage or do other
| actions?
|
| I'd like to use SingleFile and have no reason at all to distrust
| it, but I'd like to understand the security impact of installing
| lots of web extensions. How do people handle security risks like
| that? Do you run a separate vanilla browser with no extensions
| for sensitive tasks?
| fsflover wrote:
| If you care about security, consider using Qubes OS with
| hardware-virtualized VMs for compartmentalization. Then, you
| Firefox for banking won't have the same extensions which you
| use elsewhere. Works for me.
| gildas wrote:
| For technical reasons beyond my control, SingleFile injects a
| (very small) script when the page loads even if you don't click
| on the button. It could also send any data to a third party
| server. Unfortunately, it is therefore impossible for me to
| technically and formally guarantee that SingleFile cannot
| behave maliciously. Note however that the extension has the
| status "recommended" on Firefox and that it undergoes a manual
| code review by Mozilla at each update.
| fmntf wrote:
| Could you please elaborate what script is injected, that
| reason and why it is that out of your control? Thank you
| gildas wrote:
| I will do it, but it will take me some time to explain it
| and rather than answering on HN I will integrate it in the
| FAQ. I created an issue for this here:
| https://github.com/gildas-lormeau/SingleFile/issues/885.
| prox wrote:
| In Firefox you could run a totally different profile.
|
| I don't do this myself, I try to research any extension I add
| and don't do automatic upgrades. I use as little extensions as
| possible.
| tzs wrote:
| > For security reasons, you cannot save pages hosted on
| https://chrome.google.com, https://addons.mozilla.org and some
| other Mozilla domains.
|
| Interesting. What is it about those pages that makes saving them
| raise security issues?
| Isthatablackgsd wrote:
| That is not the extension issue, that's the Google/Mozilla
| policy thing.
| amccollum wrote:
| Maybe because JS files (specifically add-ons) run from the
| local filesystem are given escalated privileges compared to
| normal usage, perhaps for ease of development. I'm just
| speculating, though.
| slmjkdbtl wrote:
| Does it create an inline dataurl for each image even if they're
| the same?
| assemblylang wrote:
| Nice project! This project, and a similar project called
| Monolith[0], was a bit of an inspiration for making my own single
| HTML file tool called Humble[1] to solve a few edges cases I was
| having with bundling pages (and since I wanted a TypeScript API
| for making page bundles).
|
| [0] https://github.com/Y2Z/monolith
|
| [1] https://github.com/assemblylanguage/humble
| alberth wrote:
| FYI - there's an official standard (MHTML) for doing this that
| has existed for 20+ years and exists natively in browsers.
|
| https://en.m.wikipedia.org/wiki/MHTML
| setum wrote:
| IIRC, back in the day mhtml won't save java applets.
| pstuart wrote:
| Are any sites still using applets these days?
| IYasha wrote:
| 80% of server IPMI Web control panels. But who whould want
| to save those anyway? :)
| twapi wrote:
| I use this Chrome extension to save web pages as MHTML:
| https://chrome.google.com/webstore/detail/save-webpages-offl...
| paulirish wrote:
| The Chrome engineer who maintains the MTHML work wrote up a
| comprehensive doc on the modifications on the MHTML spec (RFC
| 2557) that are implemented:
| https://docs.google.com/document/d/1FvmYUC0S0BkdkR7wZsg0hLdK...
| Might be useful for you, gildas.
| gildas wrote:
| Thank you Paul! I had read this document some time ago,
| especially to see how the shadow DOM was serialized.
| rplnt wrote:
| I was gonna say Opera (the old, good one) had this. When saving
| a page there were some options and one was a single file IIRC.
| rtsil wrote:
| I remember saving webpages in MHTML when I was using dial-up so
| that I could read them offline later.
|
| I would also download entire websites using a software which
| name I forgot, to read them offline. Back when websites held in
| a single floppy disk.
|
| Good times!
| TheFlyingFish wrote:
| I remember using HTTrack for this a while back. Still have a
| few of those sites lying around, I think.
| domador wrote:
| Does anyone else get two security warnings whenever you try to
| save an MHTML page using a Chrome extension? I have to click on
| one warning's button to confirm that I indeed want to save the
| "dangerous" file and another to confirm I'm really sure. It's
| gotten very annoying. I've looked all over for an option to
| disable this behavior but haven't been able to.
| toqy wrote:
| For anyone else that didn't read the README, MHTML is mentioned
| in the comparison section https://github.com/gildas-
| lormeau/SingleFile#file-format-com...
| dsl wrote:
| Take the comparison with a grain of salt. Not including WARC
| is like excluding water from a comparison of beverages, it is
| the baseline standard.
| bgro wrote:
| I've extensively looked into this as I can't find a good light
| and easy backup options that isn't extreme overkill.
|
| I thought MHTML was NOT standardized which is why it wasn't
| across all browsers yet. From what I remember, every company
| was doing their own implementation of it. Maybe it's gotten
| more standardized the last few years though.
| chungy wrote:
| I've always thought the "M" stood for "Microsoft" -- wasn't
| even aware any browsers other than IE supported it.
| chme wrote:
| There is also CHM which is actually a Microsoft only file
| format for "Compiled HTML Help" files.
| IYasha wrote:
| I love this format. Very fast and compact. Entire Visual
| Studio help was in it once. Worked VERY well. And there's
| a KDE/Qt reader.
| iKlsR wrote:
| Over a decade ago I had a laptop but no internet at home. This
| was one of the ways I taught myself programming (and also
| downloading dozens of manga) by using internet explorer at a
| cafe which had an option to save to mhtml which was one file
| and had everything self contained. Legit owe a portion of my
| success to this. I still have some of these files, old crusty
| hello world c++ tutorials etc.
| falcolas wrote:
| I have fantastic internet, and I still do something similar.
| Local docs just load so much faster, and if something happens
| (which it still does, even on Fiber in the US), I have docs
| and can program.
|
| Lemme see if I can pull up the command I use to mirror doc
| sites. wget \ --recursive \
| --level=5 \ --convert-links \ --page-
| requisites \ --wait=1 \ --random-wait \
| --timestamping \ --no-parent \ $1
| a9h74j wrote:
| For people who cannot afford internet access now, and for
| perhaps more in the future if times get more difficult, I
| believe this is a very important use-case.
| geitir wrote:
| And it generally does not do a good job
| als0 wrote:
| What are the issues?
| hulitu wrote:
| From my experience, wrong layout,missing pictures.
| ByThyGrace wrote:
| > MHTML, (...) is a web page archive format used to combine, in
| a single computer file, the HTML code and its companion
| resources (such as images, _Flash animations, Java applets_ ,
| (...)
|
| Well that goes to show its longevity I guess.
| rpdillon wrote:
| The browser compatibility section suggests MHTML is unsupported
| in current versions of Firefox and Safari.
| tekknik wrote:
| Safari supports webarchive, which does basically the same
| thing
| gildas wrote:
| The problem is that it is a proprietary format. The
| advantage of the format produced by SingleFile (HTML) is
| that as long as your browser is capable of interpreting
| HTML, you will be able to read your archives without
| worries.
| tekknik wrote:
| Not so proprietary. It's really just a plist file, which
| the format is known and even open sourced by Apple[1].
| Really it's only proprietary in that no other platforms
| have implemented it.
|
| [1]: https://opensource.apple.com/source/CF/CF-550/CFBina
| ryPList....
| mrspuratic wrote:
| I don't think it was ever native in Firefox, there is/was the
| excellent unMHT extension that was broken by
| Quantum/WebExtensions and The Great XUL Silliness. Shame.
|
| I have Waterfox-Classic and unMHT (fished out of the Classic
| Addons Archive, just remember to turn off Waterfox's
| multiprocess feature) since I occasionally need to archive
| web pages - and more importantly, reopen them later.
|
| mhtml is just MIME, literally every discrete URL as a MIME
| part with its origin in a Content-Location header, all
| wrapped in a multipart container. I don't understand why it's
| not a default format.
| Groxx wrote:
| I can see WebExtensions breaking it (as it's a completely
| new set of APIs for extensions, and the losses do
| definitely still hurt)... but quantum/xul? How is that
| related, aside from "it happened around the same time"?
| cookiengineer wrote:
| > FYI
|
| The alternative format (used by the Internet Archive and
| Wayback Machine) is WARC. It's also a single file, but it's
| preserving the HTTP headers as well; so its applications is
| specifically for archival purposes. [1] The "wget" tool which
| is co-maintained by the Web Archive people also has support for
| it via CLI flags.
|
| Though when it comes to mobile browser support I'd recommend to
| use MHTML, because webkit and chromium both have support for it
| upstream.
|
| [1] http://iipc.github.io/warc-specifications/
|
| [2] https://www.gnu.org/software/wget/wget.html
| londons_explore wrote:
| Is there any objection to adding WARC support to
| webkit/chromium? Seems like a not-so-complex project...
| cookiengineer wrote:
| I know that WebKit relies on either libsoup [1] (on
| Linux/Unices) or curl [2] (legacy Windows and maybe WPE(?))
| as a network adapter, so the header handling and parsing
| mechanisms would have to be implemented in there.
|
| Though, on MacOS, WebKit tries to migrate most APIs to the
| Core Foundation Framework, which makes it kind of
| impossible to implement as a non-Apple-employee because
| it's basically a dump-it-and-never-care Open Source
| approach. [3]
|
| Don't know about chromium (my knowledge is ~2012ish about
| their architecture, and pre-Blink).
|
| [1] https://github.com/WebKit/WebKit/tree/main/Source/WebKi
| t/Net...
|
| [2] https://github.com/WebKit/WebKit/tree/main/Source/WebKi
| t/Net...
|
| [3] https://github.com/opensource-apple/CF
| TingPing wrote:
| GTK/WPE use libsoup. Playstation/Windows uses curl. And
| yes Apples networking is proprietary.
| chefandy wrote:
| WARC is also used by the Webrecorder project. They made an
| app called Wabac which does entirely client-side WARC or HAR
| replays using service workers and it seems to have pretty
| good browser support, but I haven't really dug into the
| specifics.
|
| https://github.com/webrecorder/wabac.js-1.0
| admax88qqq wrote:
| Unfortunately mhtml is not widely supported.
| pan69 wrote:
| In the olden days, Internet Explorer used to allow you to do this
| by saving the page to a HTM file. It would be a single archive
| with HTML and images etc embedded.
|
| New browsers don't seem to do this, the create a separate folder
| for the assets, which is super annoying.
| nickflood wrote:
| The Chromium Edge can produce .MHT files as well
| xnx wrote:
| I love SingleFile and have been using it for years! Is there any
| version that works on current mobile browser versions? I've stuck
| with an old version of Firefox on Android that still supports the
| extension.
| gildas wrote:
| You should be able to use it on Firefox for Android Nightly
| (which is very stable) by following this procedure:
| https://blog.mozilla.org/addons/2020/09/29/expanded-extensio...
|
| > approx
| moffkalast wrote:
| This is what 10 year old me thought "Save As" in IE would do, but
| soon realized the harsh reality of "that's not how any of this
| works".
| edf13 wrote:
| The most impressive part of the demo is seeing how tidy his
| Downloads folder is!
| ctxc wrote:
| Been eyeing this for a long time!
|
| I'm building a bookmark app, and I plan to use this to save
| bookmarks!
|
| I'm a simple man, nothing too fancy. Here's a crude demo in
| progress - https://zewallet.netlify.app/ Follow progress here -
| https://twitter.com/recursiveSwings/status/14917723874649088...
|
| Would love to have ANY tips or feedback!
| TehShrike wrote:
| the signup email confirmation link points to
| http://localhost:3000/ btw
|
| I'm definitely in the market for a bookmark service that
| archives my bookmarks, Diigo stopped working a year or two ago,
| and Pinboard can't stay up
| ctxc wrote:
| Fixed now!
| cxr wrote:
| Zotero deals with this reasonably well--and happens to be
| using SingleFile under the hood. Its landing page just
| targets a different audience (academics), which means
| probably upwards of 90% of the people who would happily use
| it probably end up bouncing after thinking, "This isn't for
| me", before ever trying it. Give it a shot.
| ctxc wrote:
| Ahh damn, should fix it! For now, you can edit the URL
| manually to take a peek. If you're interested, feel free to
| send a DM on Twitter @recursiveSwings, I'll let you know once
| it's in Beta! :)
| fender256 wrote:
| You read my mind, I was exactly looking for that!
| didericis wrote:
| Similar project -> https://github.com/Y2Z/monolith
|
| (I used both and ended up favoring monolith, but can't remember
| why. I think they're pretty comparable/am grateful for both of
| them)
| theden wrote:
| This would be very useful in many situations, and a great demo!
| spankalee wrote:
| We really, really need Web Bundles to progress and fix these
| problems correctly, once and for all. There are a lot of things
| that a tool like this can never get right, and the rest is
| complicated work that should never need to be done if we have a
| standard multi-file bundle format.
|
| https://wicg.github.io/webpackage/draft-yasskin-wpack-bundle...
| necovek wrote:
| Great stuff!
|
| For some reason, I went in expecting to see a JS-enabled multi-
| page web site into a SPA in a single HTML file, but I didn't
| expect to see images get embedded.
|
| Perhaps offer a recursive traversal option too, but don't try
| that on Wikipedia :)
| sam0x17 wrote:
| Back in the day this was always one thing that had me
| begrudgingly and shamefully opening IE so I could save a page as
| an MHT file. So long ago now. Cool to see this idea has been
| revived and not in a proprietary way
| vincentmarle wrote:
| If it's a single file, then how do the images get stored?
| gildas wrote:
| Images are stored as data URIs [1]. Note that they could also
| be stored as entries in a zip file too! [2].
|
| [1] https://en.wikipedia.org/wiki/Data_URI_scheme
|
| [2] https://github.com/gildas-lormeau/SingleFileZ
| danielam wrote:
| They're base64 encoded[0]. (This is an approach I myself have
| used in the past for simplifying the archival of regulatory
| texts.)
|
| [0] https://github.com/gildas-
| lormeau/SingleFile/blob/15801c8ef4...
| codeflo wrote:
| Does this simply remove the JavaScript or do something more
| clever? Because I think in the age of SPAs, the proper way to
| save "content pages" might be to execute the JavaScript once and
| serialize the resulting DOM back to HTML. I didn't find anything
| in the FAQ that explains if it does something like that.
| gildas wrote:
| It saves what you see (and remove JS by default). There is an
| option to embed the JS and another one to save the "raw" page
| but I would not say it is reliable. The cleverness lies more in
| the ability to produce light pages.
| sergiotapia wrote:
| I'm building a tool for people have a personal archive to their
| digital life so that 30 years from now they can revisit content
| they enjoyed in their younger years.
|
| https://github.com/sergiotapia/ekeko
|
| This is awesome! I would love to integrate this somehow into my
| project to "singlefile" bookmarks as people make them.
|
| @gildas do you have any recommendation on how to approach this
| with your extension? Could I run a headless chrome and trigger
| this extension?
| gildas wrote:
| I confirm that you could use a headless browser for this. This
| is actually what SingleFile CLI does [1]. Here is an example of
| JS code showing how to configure and inject SingleFile with
| puppeteer [2].
|
| [1] https://github.com/gildas-
| lormeau/SingleFile/tree/master/cli
|
| [2] https://github.com/gildas-
| lormeau/SingleFile/blob/master/cli...
| sergiotapia wrote:
| Thank you!
| phil294 wrote:
| How old is that demo gif? I just tried reproducing the normal
| saving shortcomings, and the bottom image ("Example of an SVG
| image with embedded JPEG images") loads just fine from the local
| folder, so this seems outdated.
|
| That being said, it's a bit weird that this kind of tool is even
| necessary at all. I would have expected native saving to include
| CSS background graphics as well, but apparently they don't for
| some reason, so I think this is pretty useful. Until now, I have
| also used pandoc (--standalone) to merge all resources into a
| single HTML file which worked great.
| gildas wrote:
| The demo is approximately 2 years old. Things probably changed
| meanwhile.
| brentcetinich wrote:
| I use HAR file extractor because normally I don't want a single
| file I want a replica of the web servers file system structure
| including any dynamically loaded assets
| https://blog.cetinich.net/content/2022/download-website-and-...
| kosasbest wrote:
| Love this. Use it all the time. Handy for saving huge pages with
| all the styling intact for reading offline (like on a plane). You
| could save a webpage as a PDF, but I prefer this over a PDF.
| steren wrote:
| Chrome can save to a single file (.mhtml). I am not sure I
| understand the difference.
| gildas wrote:
| The difference is the output format. I created SingleFile
| before Chrome supported MHTML files. At that time, to save web
| pages in a single file, the only technical solution in Chrome
| was to implement something like SingleFile. The advantage of
| HTML is that this format is much more durable though.
| Isthatablackgsd wrote:
| Yes, there is .mhtml but it execution plainly sucks because it
| didn't exactly saves everything. It would attempts to save but
| it won't be valiant at it, it is like using mhtml without
| "force (-f) argument".
| gildas wrote:
| Author here, it makes me really happy to see SIngleFile on the
| front page of HN. Thank you! I take the opportunity to make you
| aware of the upcoming impacts of the Manifest V3 [1], and for
| those who prefer zip files, I recommend you to have a look here
| [2].
|
| [1] https://github.com/gildas-lormeau/SingleFile-Lite
|
| [2] https://github.com/gildas-lormeau/SingleFileZ
| joisig wrote:
| Thank you for the Manifest V3 critique, the examples you give
| make it really clear how many things are regressing with this
| upcoming change :/
| austincheney wrote:
| Twelve year project with nearly 7000 commits shows a lot of
| dedication. Good work.
| mieko wrote:
| Thanks for this project. I found SingleFile a year or two ago,
| and used it to take "HTML Screenshots" of third party sites I
| could embed in guided walkthroughs with modified/example data
| changed, instead of just PNGs.
|
| SingleFile was ultra-valuable for this.
|
| If anyone has a similar use-case, I wrote some pretty rough
| (and slow) code to post-process SingleFile's output to remove
| any HTML that wasn't contributing to the presentational render
| by launching puppeteer and comparing pixels. It's available
| here: https://github.com/mieko/trailcap
| gildas wrote:
| It's interesting! I had started something similar as part of
| testing but hadn't really finished my work. I will have a
| look at your project.
| stragio wrote:
| Very nice! Will use it for sure. May I ask you how you created
| that good looking demo gif?
| gildas wrote:
| I used:
|
| - ScreenToGif to record video sequences and produce the final
| GIF: https://www.screentogif.com/
|
| - Macro Recorder to record and replay user navigation:
| https://www.macrorecorder.com/
|
| - Blender to edit the video, add text comments, and make the
| intro: https://www.blender.org/
| badsectoracula wrote:
| Single File is one of my favorite addons since it allows me to
| keep offline copies of articles, tutorials, etc i see online
| without losing images, etc (there have been a ton of articles
| lost over the years and while some are preserved in
| archive.org, they often lack things like images, etc, so i
| prefer to save anything i come across). So thank you for making
| it :-).
|
| Now, having said that, the text in SingleFile-Lite's "Notable
| features of SingleFile Lite" sound like a list of issues :-P.
| It looks like these are issues with Chrome, but do you know
| if/how these "improvements" will affect Firefox?
| gildas wrote:
| AFAIK, for the moment Mozilla is aware of the regressions
| that Manifest V3 causes and shows a good will to try to
| reduce them as much as possible. You can find some
| information about this here
| https://github.com/w3c/webextensions/tree/main/_minutes
| rahimnathwani wrote:
| If I start using SingleFile today, will I still be open saved
| pages after the update to Manifest V3?
|
| I mean, if I want to save pages over the next 11 months, should
| I install SinglePage or SinglePage-lite?
| gildas wrote:
| In fact, you simply do not need an extension to open pages
| saved with SingleFile (or SingleFile Lite) because they are
| standard HTML pages. So you don't have to worry about that.
| warmwaffles wrote:
| This alone is fantastic. I've been looking for an mhtml
| replacement that worked well across all browsers.
| JeremyNT wrote:
| I've been using SingleFile for the last year or so, it's
| amazing!
|
| I'm going to hijack your post for a question! I love the way
| you can use the editor and select "format for better
| readability," then save just the stripped down version of the
| page. I use this to send it to my e-ink device.
|
| The question I have is whether it's possible to toggle the
| default save to use the formatted version automatically? I dug
| into the options and didn't turn anything up!
| gildas wrote:
| You can enable these options for this:
|
| - Annotation editor > default mode > edit the page
|
| - Annotation editor > annotate the page before saving
| gildas wrote:
| Sorry, I was wrong, you have to select "format the page"
| instead of "edit the page" (first item).
| narag wrote:
| Thank you, very useful and works like a charm: a must have.
| cloudwizard wrote:
| Is there a configuration for the zip version where I can avoid
| duplicating the static assets? Thanks
| gildas wrote:
| I guess you're referring to SingleFileZ. This option is not
| needed because zip files (i.e. what SingleFileZ produces)
| already provide this feature.
| aantix wrote:
| Is it possible to use this within the context of the current
| web page, without the extension portion?
|
| Taking a snapshot of my user's screen and then display it to
| them later (maybe in an iFrame)?
| gildas wrote:
| It's possible but it's a bit limited. It won't be able for
| example to save images coming from a different origin.
| hrgiger wrote:
| Thanks for the work you have done, its a lazy man heaven
| especially for bulk downloads and helped me a lot. About a
| month ago I have decided to backup my bookmarks via archivebox,
| it was more than 1k bookmarks, most reliable methods were
| singlefile and wget.
| gwbas1c wrote:
| FYI: Figure tags don't convert their hrefs to base64.
|
| For example, try saving my home page:
| https://andrewrondeau.herokuapp.com/
|
| The img tags are converted correctly, but there's still <figure
| class=image><a href="https://andrewrondeau.herokuapp.com/... in
| the single HTML file.
| gildas wrote:
| I cannot reproduce your issue, I just did a test on this page
| and I see the expected `<img src="data:image/jpeg;base64,...`
| in the saved page.
| Mr_Modulo wrote:
| This is good for people who don't have constant internet access
| who need to reference web resources offline.
|
| Webpage saving technology does not seem to have kept pace with
| the evolution of the web.
|
| Images loaded by CSS aren't saved at all. JavaScript on the page
| will often hijack a saved page and not let it display at all.
|
| One option that works fairly well and does not require installing
| a browser extension is to save the page as a PDF.
|
| I wish browser developers would put more effort in this area.
| manor wrote:
| If you keep the javascript, you also get the world's most
| portable (desktop) application format...
| stanislavb wrote:
| Opening the repo makes you download a 17MB gif. I hope you are
| not on expensive mobile connection.
|
| p.s. the demo is nice
___________________________________________________________________
(page generated 2022-03-02 23:00 UTC)