[HN Gopher] Creating a Safari webarchive from the command line
___________________________________________________________________
Creating a Safari webarchive from the command line
Author : ingve
Score : 116 points
Date : 2024-06-04 06:24 UTC (16 hours ago)
(HTM) web link (alexwlchan.net)
(TXT) w3m dump (alexwlchan.net)
| eltondegeneres wrote:
| Is there a reason why you'd want to save archival material in a
| proprietary format? Wouldn't it be better/easier to use wget with
| the `--warc-file` flag?
| simongr3dal wrote:
| It is proprietary, but it's not like it's difficult to decode
| or interpret.
| codetrotter wrote:
| FTA:
|
| > Although Safari is only maintained by Apple, the Safari
| webarchive format can be read by non-Apple tools - it's a
| binary property list that stores the raw bytes of the original
| files. I'm comfortable that I'll be able to open these archives
| for a while, even if Safari unexpectedly goes away.
| alexwlchan wrote:
| 1/ Why not wget?
|
| For this project I wanted a consistent file format for my
| entire collection.
|
| I have a bunch of stuff I want to save which is behind
| paywalls/logins/clickthroughs that are tricky for wget to
| reach. I know I can hand wget a cookies file, but that's mildly
| fiddly. I save those pages as Safari webarchive files, and then
| they can drop in alongside the files I've collected
| programatically. Then I can deal with all my saved pages as a
| homogeneous set, rather than being split into two formats.
|
| Plus I couldn't find anybody who'd done this, and it was fun :D
|
| This is only for personal stuff where I know I'll be using
| Safari/macOS for the foreseeable future. I don't envisage using
| this for anything professional, or a shared archive -- you're
| right that a less proprietary format would be better in those
| contexts. I think I'm in a bit of a niche here.
|
| (I'm honestly surprised this is on the front page; I didn't
| think anybody else would be that interested.)
|
| 2/ Proprietary format: it is, but before I started I did some
| experiments to see what's actually inside. It's a binary plist
| and I can recover all the underlying HTML/CSS/JS files with
| Python, so I'm not totally hosed if Safari goes away.
|
| Notes on that here: https://alexwlchan.net/til/2024/whats-
| inside-safari-webarchi...
| pvg wrote:
| _I didn 't think anybody else would be that interested._
|
| 'Save the webpage as I see it in my browser' remains a
| surprisingly annoying and fiddly problem, especially
| programmatically, so the niche is probably a little roomier
| than you might initially suspect.
| diggan wrote:
| > 'Save the webpage as I see it in my browser' remains a
| surprisingly annoying and fiddly problem
|
| Is it really? I remember hacking around with with
| JavaScript's XMLSerializer (I think) like 5 years ago and
| solved that for ~90% of the websites I tried to archive.
| It'd save the DOM as-is when executed.
|
| Internet Archive/ArchiveTeam also worked on that particular
| problem for a very long time, and are mostly successful as
| far as I can tell.
| pvg wrote:
| 90% feels like an overestimate to me but it's already
| quite poor, you wouldn't accept that for saving most
| other things. Another problem is highlighted in the piece
| - it's a hassle to ensure external tools handle session
| state and credentials. Dynamic content is poorly handled,
| the default behaviours are miserable (a browser will run
| random Javascript from the network but not Javascript
| you've saved, etc).
|
| There's a lot of interest in 'digital preservation' and
| perhaps one sign of how it's very much early days of the
| field - it's tricky to 'just save' the results of one of
| the most basic current computer interactions - looking at
| a web page.
| diggan wrote:
| But if you serialize the DOM as-is, you literally get
| what you see on the page when you archive it. Nothing
| about it is dynamic, and there is no sessions nor
| credentials to handle. Granted, it's a static copy of a
| specific single page.
|
| If you need more than that, then WARC is probably the
| best. For my measly needs of just preserving exactly what
| I see, serializing the DOM and saving the result seems to
| do just fine.
| pvg wrote:
| Yes you save something that's mildly better than print-
| page-to-PDF. But it still misses things and the
| interactive stuff is very much part of 'exactly what I
| see'. Like, a random article with an interactive graph,
| for instance - like this recent HN hit
| https://ciechanow.ski/airfoil/
|
| It's not that there aren't workarounds, it's that they
| are clunky and 'you can't actually save the most common
| computery entity you deal with' is just a strange state
| of affairs we've somehow Stockholmed ourselves to.
| tedmiston wrote:
| > Internet Archive/ArchiveTeam also worked on that
| particular problem for a very long time, and are mostly
| successful as far as I can tell.
|
| One category that the archivers do poorly with is news
| articles where a pop-up renders on page load which then
| requires client-side JS execution to dismiss the pop-up.
|
| Sometimes it is easily circumvented by manual DOM
| manipulation, but that's hardly a bulletproof solution.
| And it feels automateable.
| DaSHacka wrote:
| > 'Save the webpage as I see it in my browser' remains a
| surprisingly annoying and fiddly problem
|
| You may be interested in SingleFile[1]
|
| [1] https://github.com/gildas-lormeau/SingleFile
|
| I use it all the time to archive webpages, and I imagine it
| wouldn't be hard to throw together a script to use
| FireFox's headless mode in combination with SingleFile to
| selfhost a clone of the wayback machine.
| pvg wrote:
| Thanks, I've seen it, last I tried it it missed bg
| images. But my point is this is something browsers should
| support better and kind of sort of do now but even with
| that it's a hassle.
| tedmiston wrote:
| I tested this just now on the blog post that this HN page
| points to and SingleFile handled the background image
| fine.
| cxr wrote:
| > FireFox's
|
| It's just "Firefox".
| factormeta wrote:
| Thanks all the JS - SPA develops that insisting on putting
| JS all over the place. Wouldn't it be better to have
| everything in one .html, using <script> <style> just
| inline. Then it is also just one file over the internet.
| There must be a bundler that does that no?
|
| Seems JS developer just want their code to the obfuscated
| and unachievable as possible unless it is via their web
| server.
| cxr wrote:
| > using <script> <style> just inline
|
| These SPA bundles are on the order of megabytes, not
| kilobytes. You want your users, for their own sake and
| yours, to be able to cache as much as possible instead of
| delivering a unique megablob payload for every page they
| hit.
| sturakov wrote:
| I've enjoyed using this
|
| https://github.com/webrecorder
|
| It has a standardized format and acts like a recorder for
| what you see.
| buildbot wrote:
| How can you actually open and view a warc file? I've never
| found a good application for this, have I missed something
| obvious?
| diggan wrote:
| Lots of tools available, best index I've found of the
| ecosystem is this:
| https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem
|
| Ultimately, this is the best viewer I've found so far:
| https://replayweb.page/
| tivert wrote:
| > Lots of tools available, best index I've found of the
| ecosystem is this:
| https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem
|
| It looks like there's a lot of tools for creating them, but
| not a a lot for viewing.
|
| What they really need is browser support, or at least an
| extension so a browser can open the files directly.
| jamesgeck0 wrote:
| There is a browser extension. It can record WARC files,
| but also has a viewing interface identical to
| ReplayWeb.page. https://archiveweb.page/guide
| cxr wrote:
| > What they really need is browser support, or at least
| an extension so a browser can open the files directly
|
| That's probably the wrong thing. What browsers really
| need is a thin but standardized API that lets any third-
| party app that the user has installed on their machine to
| supply the content for various fetch/reads.
|
| You'd open the WARC in Firefox or Safari or whatever, but
| Safari et al wouldn't have any special understanding of
| the format. It would know that _your_ app does WARCs,
| though, and then knock on the door and say, "Please tell
| me the content I should be showing here; I'll defer to
| you for any further "requests" associated with the
| file/page loaded in this tab--just tell me the content I
| should use for those, too."
| lostemptations5 wrote:
| I've never easily been to read warc files.
| savolai wrote:
| I'd love to have a viewer for this or converter to a standard
| html single page archive that works in other browsers too. Is
| there some reason for apple's proprietary to exist over self
| contained html? I have a bunch of apple webarchives too (stored
| from ios to notion) and am worried if there is no durable
| solution to open these beyond "code it yourself".
| Springtime wrote:
| _> I'd love to have a viewer for this or converter to a
| standard html single page archive that works in other browsers
| too_
|
| About 10 years ago I searched and found a webarchive to MHTML
| converter from someone's small site. I recall there was one
| caveat, something like it didn't include the date in the
| metadata of the output MHTML.
|
| Sorry I don't have the link on-hand, it'd be on the HDD I
| pulled from my old Macbook.
|
| Edit: found an alternative script from my notes:
| https://blog.iefdev.se/2016/01/converting-webarchive-to-mht/
|
| Edit 2: pretty sure this is the original converter, since I
| noted a caveat that it outputs the timestamp of the conversion
| not the time the file was originally saved (the script above
| handles this correctly otoh). https://langui.net/webarchive-to-
| mht/
| lacop wrote:
| Unfortunately I don't think browsers handle opening MHTML
| well out of the box.
|
| At least last time I checked in only worked for local paths
| (file://) and only some browsers. Otherwise it would either
| try to download or just show plain text.
|
| I ended up using Chrome to dump to MHTML [1] and then
| reshuffle the content into individual files, rewriting the
| path references and fixing mime types [2]. That gives a
| reasonably faithful static capture of a page that can be
| shared as link.
|
| [1] https://chromedevtools.github.io/devtools-
| protocol/tot/Page/... [2] https://github.com/lacop/udrb/blob/
| master/app/src/chrome/mod...
| tedmiston wrote:
| Something like ArchiveBox or SingleFile are in the same
| ballpark of tools, but SingleFile at least seems to eschew
| Safari Webarchive as a format. ArchiveBox may support Safari
| webarchives, but for some reason they omit it in their docs.
|
| https://github.com/gildas-lormeau/SingleFile?tab=readme-ov-f...
|
| https://archivebox.io/#output-formats
| pzmarzly wrote:
| Could this be compiled for Linux and WebKitGtk, or is this API
| Safari-specific?
| hu3 wrote:
| sidenote: be careful when opening a webarchive from a third
| party, if that improvable opportunity ever materializes:
|
| > In February 2013, a vulnerability with the webarchive format
| was discovered and reported by Joe Vennix, a Metasploit Project
| developer. The exploit allows an attacker to send a crafted
| webarchive to a user containing code to access cookies, local
| files, and other data. Apple's response to the report was that it
| will not fix the bug, most likely because it requires action on
| the users' part in opening the file.
|
| https://en.wikipedia.org/wiki/Webarchive#Vulnerability
| tedmiston wrote:
| I initially glossed over this believing it may be something
| trivial, but it really is a deeper XSS concern.
|
| It feels weird me to dismiss as wontfix a security issue that
| gives the archived page far greater access to browser data than
| it has loaded at its original URL.
|
| > Last updated at Tue, 16 Jan 2024 16:26:37 GMT
|
| > tldr: For now, don't open .webarchive files, and check the
| Metasploit module, Apple Safari .webarchive File Format UXSS
|
| > Safari's webarchive format saves all the resources in a web
| page - images, scripts, stylesheets - into a single file. A
| flaw exists in the security model behind webarchives that
| allows us to execute script in the context of any domain (a
| Universal Cross-site Scripting bug). In order to exploit this
| vulnerability, an attacker must somehow deliver the webarchive
| file to the victim and have the victim manually open it ^1
| (e.g. through email or a forced download), after ignoring a
| potential "this content was downloaded from a webpage" warning
| message ^2.
|
| Just look at the number of (relatively trivial) attack vectors
| identified by the author in this post:
|
| > Attack Vector #1: Steal the user's cookies.
|
| > Attack Vector #2: Steal CSRF tokens.
|
| > Attack Vector #3: Steal local files.
|
| > Attack Vector #4: Steal saved form passwords.
|
| > Attack Vector #5: Store poisoned javascript in the user's
| cache.
|
| https://www.rapid7.com/blog/post/2013/04/25/abusing-safaris-...
| BostonFern wrote:
| I enjoyed reading this. It's easy to follow along.
|
| It's also relatable to find answers to Swift documentation
| questions in some talk by an Apple insider rather than in Apple's
| official documentation.
| 1attice wrote:
| Cool, but consider monolith for this instead
| (https://github.com/Y2Z/monolith)
|
| - Outputs a single .html file, consumable with any web browser on
| any platform
|
| - Rust, so, you know, +1 HN (;D)
| computershit wrote:
| Thx for the recommendation, wasn't aware of this
| sed3 wrote:
| I save pages into PDF files. Low tech, but works since 2001.
|
| I print with zero page margins, so in viewer it seems like
| continuous page. I found Firefox produced smallest pdfs. Chrome
| embeds fonts and other stuff. I also use UBlock rules to hide
| some elements.
|
| Pretty useful for archiving discussions on Reddit.
| DaSHacka wrote:
| > Pretty useful for archiving discussions on Reddit.
|
| I use SingleFile for that, saves pages as a single self-
| contained html file. That way you can still interact with
| collapse comment buttons and outlinks.
| bsnnkv wrote:
| > Pretty useful for archiving discussions on Reddit.
|
| What I do now is save the comments within the discussions (on
| HN, Reddit, Twitter etc) as text which is indexable and
| searchable with additional metadata which helps for filtering
| (author is the main one I use), while automatically archiving
| the entire URLs associated with them.[1]
|
| For me, this is the best of both worlds - quick access via
| fault-tolerant search and filtering to the most interesting
| stuff while having a snapshot archive for the full context.
|
| [1]: https://notado.app - I've been working on this for a few
| years now and have posted a lot in my HN comment history and
| technical blogs about how I have iterated on and evolved this
| workflow to the point where it is now
| tannhaeuser wrote:
| I get you, but I still find it sad there's so little trust left
| in the web stack that even a PDF is preferable. Technically, a
| PDF can contain anything (bitmaps, text/glyphs without semantic
| ordering, even JavaScript).
| hu3 wrote:
| A cross-platform solution is to use headless Chromium to save
| pages as PDFs.
|
| Example I just tried and worked for me:
|
| $ chrome.exe --headless --no-sandbox --disable-gpu --print-to-
| pdf=page.pdf https://news.ycombinator.com
|
| Taken from: https://circumeo.io/blog/entry/html-to-pdf-using-
| python-and-...
|
| I know you can specify cookies, existing sessions and existing
| profiles in case authentication is required.
| jamesgeck0 wrote:
| The main issue with this is that you lose text reflowing, so
| it's more annoying to access on mobile. You also lose
| interactivity; I've seen links and menus implemented with
| JavaScript break.
| filleduchaos wrote:
| There are also just plenty of web pages worth saving that
| aren't simple documents.
| hu3 wrote:
| Indeed. PDF is subpar for many websites.
|
| HTML is also easier to fix is something gets in the way of
| content.
| tedmiston wrote:
| > Running it over my bookmarks
|
| > Once I'd written the initial version of this script and put all
| the pieces together, I used it to create webarchives for 6000 or
| so bookmarks in my Pinboard account. It worked pretty well, and
| captured 85% of my bookmarks - the remaining 15% are broken due
| to link rot. I did a spot check of a few dozen archives that did
| get saved, and they all look good.
|
| I was a tad confused by this part.
|
| Did you (or _how_ did you) verify that the headlessly saved web
| archives for thousands of bookmarks visually match the pages
| shown in the browser?
|
| This is the biggest problem I've had with command-line archival
| tools: they save some version of the page, but it often differs
| substantially from what I actually see in my browser -- things
| like pop-up artifacts covering the page or news articles are full
| of ads that are otherwise blocked in my headed browser.
|
| The SingleFile extension for Chrome works more completely and
| accurately than anything else I've come across so far, but it
| does still break weirdly sometimes too.
|
| I would love to find a programmatic way to automate the visual
| verification, e.g., archiving a page with multiple different
| tools and visually diffing the rendered pages across tools with
| small margins of error. Maybe someone else has worked on this
| already.
| kwhitefoot wrote:
| WebScrapbook is also worth a look. I find that I like it
| slightly better than SingleFile for creating copies that are
| not packaged as single files. This lets me hard link identical
| asset files to save space.
| byteknight wrote:
| Archivebox with its multi-method backups. Screenshots, text,
| html, etc.
| stunpix wrote:
| I was searching something for a web page preservation and also
| considered Safari web-archives, but decided this is a "no go" for
| me because of private format which is basically a vendor lock.
| Thus I ended up with a Chrome extension named SingleFile which
| does a pretty decent job by saving the whole page (or its part)
| as a single self sufficient html file viewable by any browser.
| Also html files are easily indexed by Spotlight or other search
| engines. The extension has no command line though but personally
| I don't need that.
| PaulHoule wrote:
| See https://en.wikipedia.org/wiki/WARC_(file_format)
| Vegenoid wrote:
| The author considered the proprietary nature of the webarchive
| format, and determined that it was readable without Apple
| software, and that it wouldn't be too difficult to create a
| tool to view or transform webarchive files if Safari were to
| disappear: https://alexwlchan.net/til/2024/whats-inside-safari-
| webarchi...
|
| Of course, without a working implementation, there could be
| hidden obstacles.
___________________________________________________________________
(page generated 2024-06-04 23:00 UTC)