[HN Gopher] Peeking Inside Gigantic Zips with Only Kilobytes
___________________________________________________________________
Peeking Inside Gigantic Zips with Only Kilobytes
Author : rtk0
Score : 24 points
Date : 2025-10-12 09:57 UTC (4 days ago)
(HTM) web link (ritiksahni.com)
(TXT) w3m dump (ritiksahni.com)
| rtk0 wrote:
| In this blog, I wrote about the architecture of a ZIP file and
| how we can leverage HTTP range requests to download files without
| decompressing the archive, in-browser.
| gildas wrote:
| For implementation in a library, you can use HttpRangeReader
| [1][2] in zip.js [3] (disclaimer: I am the author). It's a solid
| feature that has been in the library for about 10 years.
|
| [1] https://gildas-
| lormeau.github.io/zip.js/api/classes/HttpRang...
|
| [2] https://github.com/gildas-
| lormeau/zip.js/blob/master/tests/a...
|
| [3] https://github.com/gildas-lormeau/zip.js
| toomuchtodo wrote:
| Based on your experience, is zip the optimal archive format for
| long term digital archival in object storage if the use case
| calls for reading archives via http for scanning and cherry
| picking? Or is there a more optimal archive format?
| gildas wrote:
| Unfortunately, I will have difficulty answering your question
| because my knowledge is limited to the zip format. In the use
| case presented in the article, I find that the zip format
| meets the need well. Generally speaking, in the context of
| long-term archiving, its big advantage is also that there are
| thousands of implementations for reading/writing zip files.
| duskwuff wrote:
| ZIP isn't a terrible format, but it has a couple of flaws and
| limitations which make it a less than ideal format for long-
| term archiving. The biggest ones I'd call out are:
|
| 1) The format has limited and archaic support for file
| metadata - e.g. file modification times are stored as a MS-
| DOS timestamp with a 2-second (!) resolution, and there's no
| standard system for representing other metadata.
|
| 2) The single-level central directory can be awkward to work
| with for archives containing a very large number of members.
|
| 3) Support for 64-bit file sizes exists but is a messy hack.
|
| 4) Compression operates on each file as a separate stream,
| reducing its effectiveness for archives containing many small
| files. The format does support pluggable compression methods,
| but there's no straightforward way to support "solid"
| compression.
|
| 5) There is technically no way to reliably identify a ZIP
| file, as the end of central directory record can appear at
| any location near the end of the file, and the file can
| contain arbitrary data at its start. Most tools recognize ZIP
| files by the presence of a local file header at the start
| ("PK\x01\x02"), but that's not reliable.
| xg15 wrote:
| This is really cool! Could also make a useful standalone command
| line tool.
|
| I think the general pattern - using the range header + prior
| knowledge of a file format to only download the parts of a file
| that are relevant - is still really underutilized.
|
| One small problem I see is that a server that does not support
| range requests would just try to send you the entire file in the
| first request, I think.
|
| So maybe doing a preflight HEAD request first to see if the
| server sends back Accept-Ranges could be useful.
|
| https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Ran...
| xp84 wrote:
| How common is it in practice today to not support ranges? I
| remember back in the early days of broadband (c. 2000) when
| having a Download Manager was something most nerds endorsed,
| that most servers _then_ supported partial downloads. Aside
| from toy projects has anyone encountered a server which didn 't
| allow ranges (unless specifically configured to forbid it)?
| xg15 wrote:
| I'd guess everything where support would have to be manually
| implemented.
|
| For static files served by CDNs or an "established" HTTP
| servers I think support is pretty much a given (though e.g.
| Python's FastAPI only got support in 2020 [1]), but for
| anything dynamic, I doubt many devs would go through the
| trouble and implement support if it wasn't strictly necessary
| for their usecase.
|
| E.g. the URL may point to a service endpoint that loads the
| file contents from a database or blob storage instead of the
| file system. Then the service would have to implement range
| support itself and translate them to the necessary
| storage/database calls (if those exist), etc etc. That's some
| effort you have to put in.
|
| Even for static files, there may be reverse proxies in front
| that (unintentionally) remove the support again. E.g. [2]
|
| [1] https://github.com/Kludex/starlette/issues/950
|
| [2] https://caddy.community/t/cannot-seek-further-in-videos-
| usin...
| jeffrallen wrote:
| Here's the results of my investigation into the same question:
|
| https://blog.nella.org/2016/01/17/seeking-http/
|
| (Originally written for Advent of Go.)
| HPsquared wrote:
| 7-zip does this. You can see it if you open (to view) a large ZIP
| file on slow network drive. There's no way it is downloading the
| whole thing. You can extract single files from the ZIP also with
| only a little traffic.
| dividuum wrote:
| Would be surprised if that's not how basically all tools
| behave, as I expect them all to seek to the central directory
| and to the referenced offset of individual files when
| extracting. Doesn't really make a difference if that's across a
| network file system or a local disc.
| aeblyve wrote:
| This is also quite easy to do with .tar files, not to be confused
| with .tar.gz files.
| dekhn wrote:
| tar does not have an index.
| Lammy wrote:
| > That question took me into the guts of the ZIP format, where I
| learned there's a tiny index at the end that points to everything
| else.
|
| Tangential, but any Free Software that uses `shared-mime-info` to
| identify files (any of your GNOMEs, KDEs, etc) are unable to
| correctly identify Zip files by their EOCD due to lack of
| accepted syntax for defining search patterns based on negative
| file offsets. Please show your support on this Issue if you would
| also like to see this resolved:
| https://gitlab.freedesktop.org/xdg/shared-mime-info/-/issues...
| (linking to my own comment, so no this is not brigading)
|
| Anything using `file(1)` does not have this problem:
| https://github.com/file/file/blob/280e121/magic/Magdir/zip#L...
| silasb wrote:
| I've been looking at this for gunzip files as well. There is a
| rust solution that looks interesting called
| https://docs.rs/indexed_deflate/latest/indexed_deflate/. My goals
| are to be able to index mysql dump files by tables boundaries.
| dabinat wrote:
| I wrote a Rust command-line tool to do this for internal use in
| my SaaS. The motivation was to be able to index the contents of
| zip files stored on S3 without incurring significant egress
| charges. Is this something that people would generally find
| useful if it was open-sourced?
___________________________________________________________________
(page generated 2025-10-16 23:01 UTC)