[HN Gopher] Peeking Inside Gigantic Zips with Only Kilobytes
       ___________________________________________________________________
        
       Peeking Inside Gigantic Zips with Only Kilobytes
        
       Author : rtk0
       Score  : 24 points
       Date   : 2025-10-12 09:57 UTC (4 days ago)
        
 (HTM) web link (ritiksahni.com)
 (TXT) w3m dump (ritiksahni.com)
        
       | rtk0 wrote:
       | In this blog, I wrote about the architecture of a ZIP file and
       | how we can leverage HTTP range requests to download files without
       | decompressing the archive, in-browser.
        
       | gildas wrote:
       | For implementation in a library, you can use HttpRangeReader
       | [1][2] in zip.js [3] (disclaimer: I am the author). It's a solid
       | feature that has been in the library for about 10 years.
       | 
       | [1] https://gildas-
       | lormeau.github.io/zip.js/api/classes/HttpRang...
       | 
       | [2] https://github.com/gildas-
       | lormeau/zip.js/blob/master/tests/a...
       | 
       | [3] https://github.com/gildas-lormeau/zip.js
        
         | toomuchtodo wrote:
         | Based on your experience, is zip the optimal archive format for
         | long term digital archival in object storage if the use case
         | calls for reading archives via http for scanning and cherry
         | picking? Or is there a more optimal archive format?
        
           | gildas wrote:
           | Unfortunately, I will have difficulty answering your question
           | because my knowledge is limited to the zip format. In the use
           | case presented in the article, I find that the zip format
           | meets the need well. Generally speaking, in the context of
           | long-term archiving, its big advantage is also that there are
           | thousands of implementations for reading/writing zip files.
        
           | duskwuff wrote:
           | ZIP isn't a terrible format, but it has a couple of flaws and
           | limitations which make it a less than ideal format for long-
           | term archiving. The biggest ones I'd call out are:
           | 
           | 1) The format has limited and archaic support for file
           | metadata - e.g. file modification times are stored as a MS-
           | DOS timestamp with a 2-second (!) resolution, and there's no
           | standard system for representing other metadata.
           | 
           | 2) The single-level central directory can be awkward to work
           | with for archives containing a very large number of members.
           | 
           | 3) Support for 64-bit file sizes exists but is a messy hack.
           | 
           | 4) Compression operates on each file as a separate stream,
           | reducing its effectiveness for archives containing many small
           | files. The format does support pluggable compression methods,
           | but there's no straightforward way to support "solid"
           | compression.
           | 
           | 5) There is technically no way to reliably identify a ZIP
           | file, as the end of central directory record can appear at
           | any location near the end of the file, and the file can
           | contain arbitrary data at its start. Most tools recognize ZIP
           | files by the presence of a local file header at the start
           | ("PK\x01\x02"), but that's not reliable.
        
       | xg15 wrote:
       | This is really cool! Could also make a useful standalone command
       | line tool.
       | 
       | I think the general pattern - using the range header + prior
       | knowledge of a file format to only download the parts of a file
       | that are relevant - is still really underutilized.
       | 
       | One small problem I see is that a server that does not support
       | range requests would just try to send you the entire file in the
       | first request, I think.
       | 
       | So maybe doing a preflight HEAD request first to see if the
       | server sends back Accept-Ranges could be useful.
       | 
       | https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Ran...
        
         | xp84 wrote:
         | How common is it in practice today to not support ranges? I
         | remember back in the early days of broadband (c. 2000) when
         | having a Download Manager was something most nerds endorsed,
         | that most servers _then_ supported partial downloads. Aside
         | from toy projects has anyone encountered a server which didn 't
         | allow ranges (unless specifically configured to forbid it)?
        
           | xg15 wrote:
           | I'd guess everything where support would have to be manually
           | implemented.
           | 
           | For static files served by CDNs or an "established" HTTP
           | servers I think support is pretty much a given (though e.g.
           | Python's FastAPI only got support in 2020 [1]), but for
           | anything dynamic, I doubt many devs would go through the
           | trouble and implement support if it wasn't strictly necessary
           | for their usecase.
           | 
           | E.g. the URL may point to a service endpoint that loads the
           | file contents from a database or blob storage instead of the
           | file system. Then the service would have to implement range
           | support itself and translate them to the necessary
           | storage/database calls (if those exist), etc etc. That's some
           | effort you have to put in.
           | 
           | Even for static files, there may be reverse proxies in front
           | that (unintentionally) remove the support again. E.g. [2]
           | 
           | [1] https://github.com/Kludex/starlette/issues/950
           | 
           | [2] https://caddy.community/t/cannot-seek-further-in-videos-
           | usin...
        
       | jeffrallen wrote:
       | Here's the results of my investigation into the same question:
       | 
       | https://blog.nella.org/2016/01/17/seeking-http/
       | 
       | (Originally written for Advent of Go.)
        
       | HPsquared wrote:
       | 7-zip does this. You can see it if you open (to view) a large ZIP
       | file on slow network drive. There's no way it is downloading the
       | whole thing. You can extract single files from the ZIP also with
       | only a little traffic.
        
         | dividuum wrote:
         | Would be surprised if that's not how basically all tools
         | behave, as I expect them all to seek to the central directory
         | and to the referenced offset of individual files when
         | extracting. Doesn't really make a difference if that's across a
         | network file system or a local disc.
        
       | aeblyve wrote:
       | This is also quite easy to do with .tar files, not to be confused
       | with .tar.gz files.
        
         | dekhn wrote:
         | tar does not have an index.
        
       | Lammy wrote:
       | > That question took me into the guts of the ZIP format, where I
       | learned there's a tiny index at the end that points to everything
       | else.
       | 
       | Tangential, but any Free Software that uses `shared-mime-info` to
       | identify files (any of your GNOMEs, KDEs, etc) are unable to
       | correctly identify Zip files by their EOCD due to lack of
       | accepted syntax for defining search patterns based on negative
       | file offsets. Please show your support on this Issue if you would
       | also like to see this resolved:
       | https://gitlab.freedesktop.org/xdg/shared-mime-info/-/issues...
       | (linking to my own comment, so no this is not brigading)
       | 
       | Anything using `file(1)` does not have this problem:
       | https://github.com/file/file/blob/280e121/magic/Magdir/zip#L...
        
       | silasb wrote:
       | I've been looking at this for gunzip files as well. There is a
       | rust solution that looks interesting called
       | https://docs.rs/indexed_deflate/latest/indexed_deflate/. My goals
       | are to be able to index mysql dump files by tables boundaries.
        
       | dabinat wrote:
       | I wrote a Rust command-line tool to do this for internal use in
       | my SaaS. The motivation was to be able to index the contents of
       | zip files stored on S3 without incurring significant egress
       | charges. Is this something that people would generally find
       | useful if it was open-sourced?
        
       ___________________________________________________________________
       (page generated 2025-10-16 23:01 UTC)