[HN Gopher] Creating a Safari webarchive from the command line
       ___________________________________________________________________
        
       Creating a Safari webarchive from the command line
        
       Author : ingve
       Score  : 116 points
       Date   : 2024-06-04 06:24 UTC (16 hours ago)
        
 (HTM) web link (alexwlchan.net)
 (TXT) w3m dump (alexwlchan.net)
        
       | eltondegeneres wrote:
       | Is there a reason why you'd want to save archival material in a
       | proprietary format? Wouldn't it be better/easier to use wget with
       | the `--warc-file` flag?
        
         | simongr3dal wrote:
         | It is proprietary, but it's not like it's difficult to decode
         | or interpret.
        
         | codetrotter wrote:
         | FTA:
         | 
         | > Although Safari is only maintained by Apple, the Safari
         | webarchive format can be read by non-Apple tools - it's a
         | binary property list that stores the raw bytes of the original
         | files. I'm comfortable that I'll be able to open these archives
         | for a while, even if Safari unexpectedly goes away.
        
         | alexwlchan wrote:
         | 1/ Why not wget?
         | 
         | For this project I wanted a consistent file format for my
         | entire collection.
         | 
         | I have a bunch of stuff I want to save which is behind
         | paywalls/logins/clickthroughs that are tricky for wget to
         | reach. I know I can hand wget a cookies file, but that's mildly
         | fiddly. I save those pages as Safari webarchive files, and then
         | they can drop in alongside the files I've collected
         | programatically. Then I can deal with all my saved pages as a
         | homogeneous set, rather than being split into two formats.
         | 
         | Plus I couldn't find anybody who'd done this, and it was fun :D
         | 
         | This is only for personal stuff where I know I'll be using
         | Safari/macOS for the foreseeable future. I don't envisage using
         | this for anything professional, or a shared archive -- you're
         | right that a less proprietary format would be better in those
         | contexts. I think I'm in a bit of a niche here.
         | 
         | (I'm honestly surprised this is on the front page; I didn't
         | think anybody else would be that interested.)
         | 
         | 2/ Proprietary format: it is, but before I started I did some
         | experiments to see what's actually inside. It's a binary plist
         | and I can recover all the underlying HTML/CSS/JS files with
         | Python, so I'm not totally hosed if Safari goes away.
         | 
         | Notes on that here: https://alexwlchan.net/til/2024/whats-
         | inside-safari-webarchi...
        
           | pvg wrote:
           | _I didn 't think anybody else would be that interested._
           | 
           | 'Save the webpage as I see it in my browser' remains a
           | surprisingly annoying and fiddly problem, especially
           | programmatically, so the niche is probably a little roomier
           | than you might initially suspect.
        
             | diggan wrote:
             | > 'Save the webpage as I see it in my browser' remains a
             | surprisingly annoying and fiddly problem
             | 
             | Is it really? I remember hacking around with with
             | JavaScript's XMLSerializer (I think) like 5 years ago and
             | solved that for ~90% of the websites I tried to archive.
             | It'd save the DOM as-is when executed.
             | 
             | Internet Archive/ArchiveTeam also worked on that particular
             | problem for a very long time, and are mostly successful as
             | far as I can tell.
        
               | pvg wrote:
               | 90% feels like an overestimate to me but it's already
               | quite poor, you wouldn't accept that for saving most
               | other things. Another problem is highlighted in the piece
               | - it's a hassle to ensure external tools handle session
               | state and credentials. Dynamic content is poorly handled,
               | the default behaviours are miserable (a browser will run
               | random Javascript from the network but not Javascript
               | you've saved, etc).
               | 
               | There's a lot of interest in 'digital preservation' and
               | perhaps one sign of how it's very much early days of the
               | field - it's tricky to 'just save' the results of one of
               | the most basic current computer interactions - looking at
               | a web page.
        
               | diggan wrote:
               | But if you serialize the DOM as-is, you literally get
               | what you see on the page when you archive it. Nothing
               | about it is dynamic, and there is no sessions nor
               | credentials to handle. Granted, it's a static copy of a
               | specific single page.
               | 
               | If you need more than that, then WARC is probably the
               | best. For my measly needs of just preserving exactly what
               | I see, serializing the DOM and saving the result seems to
               | do just fine.
        
               | pvg wrote:
               | Yes you save something that's mildly better than print-
               | page-to-PDF. But it still misses things and the
               | interactive stuff is very much part of 'exactly what I
               | see'. Like, a random article with an interactive graph,
               | for instance - like this recent HN hit
               | https://ciechanow.ski/airfoil/
               | 
               | It's not that there aren't workarounds, it's that they
               | are clunky and 'you can't actually save the most common
               | computery entity you deal with' is just a strange state
               | of affairs we've somehow Stockholmed ourselves to.
        
               | tedmiston wrote:
               | > Internet Archive/ArchiveTeam also worked on that
               | particular problem for a very long time, and are mostly
               | successful as far as I can tell.
               | 
               | One category that the archivers do poorly with is news
               | articles where a pop-up renders on page load which then
               | requires client-side JS execution to dismiss the pop-up.
               | 
               | Sometimes it is easily circumvented by manual DOM
               | manipulation, but that's hardly a bulletproof solution.
               | And it feels automateable.
        
             | DaSHacka wrote:
             | > 'Save the webpage as I see it in my browser' remains a
             | surprisingly annoying and fiddly problem
             | 
             | You may be interested in SingleFile[1]
             | 
             | [1] https://github.com/gildas-lormeau/SingleFile
             | 
             | I use it all the time to archive webpages, and I imagine it
             | wouldn't be hard to throw together a script to use
             | FireFox's headless mode in combination with SingleFile to
             | selfhost a clone of the wayback machine.
        
               | pvg wrote:
               | Thanks, I've seen it, last I tried it it missed bg
               | images. But my point is this is something browsers should
               | support better and kind of sort of do now but even with
               | that it's a hassle.
        
               | tedmiston wrote:
               | I tested this just now on the blog post that this HN page
               | points to and SingleFile handled the background image
               | fine.
        
               | cxr wrote:
               | > FireFox's
               | 
               | It's just "Firefox".
        
             | factormeta wrote:
             | Thanks all the JS - SPA develops that insisting on putting
             | JS all over the place. Wouldn't it be better to have
             | everything in one .html, using <script> <style> just
             | inline. Then it is also just one file over the internet.
             | There must be a bundler that does that no?
             | 
             | Seems JS developer just want their code to the obfuscated
             | and unachievable as possible unless it is via their web
             | server.
        
               | cxr wrote:
               | > using <script> <style> just inline
               | 
               | These SPA bundles are on the order of megabytes, not
               | kilobytes. You want your users, for their own sake and
               | yours, to be able to cache as much as possible instead of
               | delivering a unique megablob payload for every page they
               | hit.
        
             | sturakov wrote:
             | I've enjoyed using this
             | 
             | https://github.com/webrecorder
             | 
             | It has a standardized format and acts like a recorder for
             | what you see.
        
         | buildbot wrote:
         | How can you actually open and view a warc file? I've never
         | found a good application for this, have I missed something
         | obvious?
        
           | diggan wrote:
           | Lots of tools available, best index I've found of the
           | ecosystem is this:
           | https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem
           | 
           | Ultimately, this is the best viewer I've found so far:
           | https://replayweb.page/
        
             | tivert wrote:
             | > Lots of tools available, best index I've found of the
             | ecosystem is this:
             | https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem
             | 
             | It looks like there's a lot of tools for creating them, but
             | not a a lot for viewing.
             | 
             | What they really need is browser support, or at least an
             | extension so a browser can open the files directly.
        
               | jamesgeck0 wrote:
               | There is a browser extension. It can record WARC files,
               | but also has a viewing interface identical to
               | ReplayWeb.page. https://archiveweb.page/guide
        
               | cxr wrote:
               | > What they really need is browser support, or at least
               | an extension so a browser can open the files directly
               | 
               | That's probably the wrong thing. What browsers really
               | need is a thin but standardized API that lets any third-
               | party app that the user has installed on their machine to
               | supply the content for various fetch/reads.
               | 
               | You'd open the WARC in Firefox or Safari or whatever, but
               | Safari et al wouldn't have any special understanding of
               | the format. It would know that _your_ app does WARCs,
               | though, and then knock on the door and say,  "Please tell
               | me the content I should be showing here; I'll defer to
               | you for any further "requests" associated with the
               | file/page loaded in this tab--just tell me the content I
               | should use for those, too."
        
             | lostemptations5 wrote:
             | I've never easily been to read warc files.
        
       | savolai wrote:
       | I'd love to have a viewer for this or converter to a standard
       | html single page archive that works in other browsers too. Is
       | there some reason for apple's proprietary to exist over self
       | contained html? I have a bunch of apple webarchives too (stored
       | from ios to notion) and am worried if there is no durable
       | solution to open these beyond "code it yourself".
        
         | Springtime wrote:
         | _> I'd love to have a viewer for this or converter to a
         | standard html single page archive that works in other browsers
         | too_
         | 
         | About 10 years ago I searched and found a webarchive to MHTML
         | converter from someone's small site. I recall there was one
         | caveat, something like it didn't include the date in the
         | metadata of the output MHTML.
         | 
         | Sorry I don't have the link on-hand, it'd be on the HDD I
         | pulled from my old Macbook.
         | 
         | Edit: found an alternative script from my notes:
         | https://blog.iefdev.se/2016/01/converting-webarchive-to-mht/
         | 
         | Edit 2: pretty sure this is the original converter, since I
         | noted a caveat that it outputs the timestamp of the conversion
         | not the time the file was originally saved (the script above
         | handles this correctly otoh). https://langui.net/webarchive-to-
         | mht/
        
           | lacop wrote:
           | Unfortunately I don't think browsers handle opening MHTML
           | well out of the box.
           | 
           | At least last time I checked in only worked for local paths
           | (file://) and only some browsers. Otherwise it would either
           | try to download or just show plain text.
           | 
           | I ended up using Chrome to dump to MHTML [1] and then
           | reshuffle the content into individual files, rewriting the
           | path references and fixing mime types [2]. That gives a
           | reasonably faithful static capture of a page that can be
           | shared as link.
           | 
           | [1] https://chromedevtools.github.io/devtools-
           | protocol/tot/Page/... [2] https://github.com/lacop/udrb/blob/
           | master/app/src/chrome/mod...
        
         | tedmiston wrote:
         | Something like ArchiveBox or SingleFile are in the same
         | ballpark of tools, but SingleFile at least seems to eschew
         | Safari Webarchive as a format. ArchiveBox may support Safari
         | webarchives, but for some reason they omit it in their docs.
         | 
         | https://github.com/gildas-lormeau/SingleFile?tab=readme-ov-f...
         | 
         | https://archivebox.io/#output-formats
        
       | pzmarzly wrote:
       | Could this be compiled for Linux and WebKitGtk, or is this API
       | Safari-specific?
        
       | hu3 wrote:
       | sidenote: be careful when opening a webarchive from a third
       | party, if that improvable opportunity ever materializes:
       | 
       | > In February 2013, a vulnerability with the webarchive format
       | was discovered and reported by Joe Vennix, a Metasploit Project
       | developer. The exploit allows an attacker to send a crafted
       | webarchive to a user containing code to access cookies, local
       | files, and other data. Apple's response to the report was that it
       | will not fix the bug, most likely because it requires action on
       | the users' part in opening the file.
       | 
       | https://en.wikipedia.org/wiki/Webarchive#Vulnerability
        
         | tedmiston wrote:
         | I initially glossed over this believing it may be something
         | trivial, but it really is a deeper XSS concern.
         | 
         | It feels weird me to dismiss as wontfix a security issue that
         | gives the archived page far greater access to browser data than
         | it has loaded at its original URL.
         | 
         | > Last updated at Tue, 16 Jan 2024 16:26:37 GMT
         | 
         | > tldr: For now, don't open .webarchive files, and check the
         | Metasploit module, Apple Safari .webarchive File Format UXSS
         | 
         | > Safari's webarchive format saves all the resources in a web
         | page - images, scripts, stylesheets - into a single file. A
         | flaw exists in the security model behind webarchives that
         | allows us to execute script in the context of any domain (a
         | Universal Cross-site Scripting bug). In order to exploit this
         | vulnerability, an attacker must somehow deliver the webarchive
         | file to the victim and have the victim manually open it ^1
         | (e.g. through email or a forced download), after ignoring a
         | potential "this content was downloaded from a webpage" warning
         | message ^2.
         | 
         | Just look at the number of (relatively trivial) attack vectors
         | identified by the author in this post:
         | 
         | > Attack Vector #1: Steal the user's cookies.
         | 
         | > Attack Vector #2: Steal CSRF tokens.
         | 
         | > Attack Vector #3: Steal local files.
         | 
         | > Attack Vector #4: Steal saved form passwords.
         | 
         | > Attack Vector #5: Store poisoned javascript in the user's
         | cache.
         | 
         | https://www.rapid7.com/blog/post/2013/04/25/abusing-safaris-...
        
       | BostonFern wrote:
       | I enjoyed reading this. It's easy to follow along.
       | 
       | It's also relatable to find answers to Swift documentation
       | questions in some talk by an Apple insider rather than in Apple's
       | official documentation.
        
       | 1attice wrote:
       | Cool, but consider monolith for this instead
       | (https://github.com/Y2Z/monolith)
       | 
       | - Outputs a single .html file, consumable with any web browser on
       | any platform
       | 
       | - Rust, so, you know, +1 HN (;D)
        
         | computershit wrote:
         | Thx for the recommendation, wasn't aware of this
        
       | sed3 wrote:
       | I save pages into PDF files. Low tech, but works since 2001.
       | 
       | I print with zero page margins, so in viewer it seems like
       | continuous page. I found Firefox produced smallest pdfs. Chrome
       | embeds fonts and other stuff. I also use UBlock rules to hide
       | some elements.
       | 
       | Pretty useful for archiving discussions on Reddit.
        
         | DaSHacka wrote:
         | > Pretty useful for archiving discussions on Reddit.
         | 
         | I use SingleFile for that, saves pages as a single self-
         | contained html file. That way you can still interact with
         | collapse comment buttons and outlinks.
        
         | bsnnkv wrote:
         | > Pretty useful for archiving discussions on Reddit.
         | 
         | What I do now is save the comments within the discussions (on
         | HN, Reddit, Twitter etc) as text which is indexable and
         | searchable with additional metadata which helps for filtering
         | (author is the main one I use), while automatically archiving
         | the entire URLs associated with them.[1]
         | 
         | For me, this is the best of both worlds - quick access via
         | fault-tolerant search and filtering to the most interesting
         | stuff while having a snapshot archive for the full context.
         | 
         | [1]: https://notado.app - I've been working on this for a few
         | years now and have posted a lot in my HN comment history and
         | technical blogs about how I have iterated on and evolved this
         | workflow to the point where it is now
        
         | tannhaeuser wrote:
         | I get you, but I still find it sad there's so little trust left
         | in the web stack that even a PDF is preferable. Technically, a
         | PDF can contain anything (bitmaps, text/glyphs without semantic
         | ordering, even JavaScript).
        
       | hu3 wrote:
       | A cross-platform solution is to use headless Chromium to save
       | pages as PDFs.
       | 
       | Example I just tried and worked for me:
       | 
       | $ chrome.exe --headless --no-sandbox --disable-gpu --print-to-
       | pdf=page.pdf https://news.ycombinator.com
       | 
       | Taken from: https://circumeo.io/blog/entry/html-to-pdf-using-
       | python-and-...
       | 
       | I know you can specify cookies, existing sessions and existing
       | profiles in case authentication is required.
        
         | jamesgeck0 wrote:
         | The main issue with this is that you lose text reflowing, so
         | it's more annoying to access on mobile. You also lose
         | interactivity; I've seen links and menus implemented with
         | JavaScript break.
        
           | filleduchaos wrote:
           | There are also just plenty of web pages worth saving that
           | aren't simple documents.
        
           | hu3 wrote:
           | Indeed. PDF is subpar for many websites.
           | 
           | HTML is also easier to fix is something gets in the way of
           | content.
        
       | tedmiston wrote:
       | > Running it over my bookmarks
       | 
       | > Once I'd written the initial version of this script and put all
       | the pieces together, I used it to create webarchives for 6000 or
       | so bookmarks in my Pinboard account. It worked pretty well, and
       | captured 85% of my bookmarks - the remaining 15% are broken due
       | to link rot. I did a spot check of a few dozen archives that did
       | get saved, and they all look good.
       | 
       | I was a tad confused by this part.
       | 
       | Did you (or _how_ did you) verify that the headlessly saved web
       | archives for thousands of bookmarks visually match the pages
       | shown in the browser?
       | 
       | This is the biggest problem I've had with command-line archival
       | tools: they save some version of the page, but it often differs
       | substantially from what I actually see in my browser -- things
       | like pop-up artifacts covering the page or news articles are full
       | of ads that are otherwise blocked in my headed browser.
       | 
       | The SingleFile extension for Chrome works more completely and
       | accurately than anything else I've come across so far, but it
       | does still break weirdly sometimes too.
       | 
       | I would love to find a programmatic way to automate the visual
       | verification, e.g., archiving a page with multiple different
       | tools and visually diffing the rendered pages across tools with
       | small margins of error. Maybe someone else has worked on this
       | already.
        
         | kwhitefoot wrote:
         | WebScrapbook is also worth a look. I find that I like it
         | slightly better than SingleFile for creating copies that are
         | not packaged as single files. This lets me hard link identical
         | asset files to save space.
        
         | byteknight wrote:
         | Archivebox with its multi-method backups. Screenshots, text,
         | html, etc.
        
       | stunpix wrote:
       | I was searching something for a web page preservation and also
       | considered Safari web-archives, but decided this is a "no go" for
       | me because of private format which is basically a vendor lock.
       | Thus I ended up with a Chrome extension named SingleFile which
       | does a pretty decent job by saving the whole page (or its part)
       | as a single self sufficient html file viewable by any browser.
       | Also html files are easily indexed by Spotlight or other search
       | engines. The extension has no command line though but personally
       | I don't need that.
        
         | PaulHoule wrote:
         | See https://en.wikipedia.org/wiki/WARC_(file_format)
        
         | Vegenoid wrote:
         | The author considered the proprietary nature of the webarchive
         | format, and determined that it was readable without Apple
         | software, and that it wouldn't be too difficult to create a
         | tool to view or transform webarchive files if Safari were to
         | disappear: https://alexwlchan.net/til/2024/whats-inside-safari-
         | webarchi...
         | 
         | Of course, without a working implementation, there could be
         | hidden obstacles.
        
       ___________________________________________________________________
       (page generated 2024-06-04 23:00 UTC)