[HN Gopher] Hop: Faster than unzip and tar at reading individual...
       ___________________________________________________________________
        
       Hop: Faster than unzip and tar at reading individual files
        
       Author : ksec
       Score  : 81 points
       Date   : 2021-11-10 18:50 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Jarred wrote:
       | I made this
       | 
       | Happy to answer any questions or feedback
        
         | Const-me wrote:
         | Consider variable-length integers for file sizes and string
         | lengths. When I need them, I usually implementing what's
         | written in MKV spec: https://www.rfc-
         | editor.org/rfc/rfc8794.html#name-variable-si...
         | 
         | That's a good way to bypass that 4GB size limit, and instead of
         | wasting 8 bytes per number this will even save a few bytes.
         | 
         | However, I'm not sure how easy it is to implement in the
         | language you're using. I only did that in C++ and modern C#.
         | They both have intrinsics to emit BSWAP instruction (or an ARM
         | equivalent) to flip byte order of an integer, helps with
         | performance of that code.
        
         | ksec wrote:
         | How far are we from public beta or Bun 1.0?
         | 
         | Not related to bun or hop is the use of Zig. I am wondering if
         | you will do a write up on it someday.
        
           | Jarred wrote:
           | Something like two weeks before a public beta
        
         | throwaway375 wrote:
         | What do you think about Valve VPK?
        
         | ectopod wrote:
         | Don't use 32-bit file times! Change it quick while you have the
         | chance.
        
         | brandmeyer wrote:
         | If you manage the index using a B-tree, then you can perform
         | partial updates of the B-tree by appending the minimum number
         | of pages needed to represent the changes. At that point, you
         | can append N additional new files to the tail of the archive,
         | and add a new index that re-uses some of the original index.
         | 
         | Just an idea to check the "append" box.
         | 
         | See also B-trees, Shadowing, and Clones
         | https://www.usenix.org/legacy/events/lsf07/tech/rodeh.pdf
        
           | Jarred wrote:
           | Good idea, worried a little about impact to
           | serialization/deserialization time though. Maybe could still
           | store it as a flat array
        
       | CannoloBlahnik wrote:
       | This seems like more of a tar problem than a zip problem, unless
       | I'm missing something, given the lack of compression on Hop.
        
         | slaymaker1907 wrote:
         | I think you can still zip files without compression.
        
           | sigzero wrote:
           | Yes, but I don't think anyone does that.
        
           | sumtechguy wrote:
           | Usually the -0 option. https://linux.die.net/man/1/zip -mx=0
           | for 7zip style programs.
           | 
           | If you also used the mt option with 7zip and just stored you
           | probably could have a decent read rate as I think it spins
           | extra threads.
        
       | yboris wrote:
       | Since Hop doesn't do compression, the most appropriate comparison
       | would be to asar
       | 
       | https://github.com/electron/asar
       | 
       | It's not hard being faster than zip if you are not
       | compressing/uncompressing.
        
         | Jarred wrote:
         | The numbers shown are with zip -0, which disables compression.
        
           | kupopuffs wrote:
           | What is the meaning then? What's the benefit of an archive
           | that's not compressed? Why not use a FS?
        
             | zuhsetaqi wrote:
             | https://github.com/Jarred-Sumner/hop#why
        
             | lazide wrote:
             | Archives often have checksums, align files in a way
             | conducive to continuous reading which can be great
             | performance wise in some cases (and like zip, also random
             | read/write), and can provide grouping and logical/semantic
             | validation that is hard to do on a 'bunch of files' without
             | messing it up.
        
               | eins1234 wrote:
               | FWIW, IPFS does all of that by default (maybe outside of
               | the continuous reading part).
        
             | [deleted]
        
             | derefr wrote:
             | Plenty of uncompressed tarballs exist. In fact, if the
             | things I'm archiving are _already_ compressed (e.g. JPEGs),
             | I reach for an uncompressed tarball as my first choice
             | (with my second choice being a macOS .sparsebundle -- very
             | nice for network-mounting _in_ macOS, and storable on
             | pretty much anything, but not exactly great if you want to
             | open it on any other OS.)
             | 
             | If we had a random-access file system loopback-image
             | standard (open standard for both the file system _and_ the
             | loopback image container format), maybe we wouldn't see so
             | many tarballs. But there is no such format.
             | 
             | As for "why archive things at all, instead of just rsyncing
             | a million little files and directories over to your NAS" --
             | because one takes five minutes, and the other eight hours,
             | due to inode creation and per-file-stream ramp-up time.
        
             | rjzzleep wrote:
             | I don't know if it allows streaming, but if it does
             | transferring files on portable devices or streaming their
             | over the wire is a lot faster this way compared to the
             | files directly. Especially for small files
        
             | Jarred wrote:
             | Hop reduces the number of syscalls necessary to both read
             | and check for the existence of multiple files nested within
             | a shared parent directory.
             | 
             | You read from one file to get all the information you need
             | instead of reading from N files and N directories.
             | 
             | Can't easily mount virtual filesystems outside of Linux.
             | However, Linux supports the copy_file_range syscall which
             | also makes it faster to copy data around than doing it in
             | application memory
        
             | chrisseaton wrote:
             | An archive is essentially a user-mode file system. User-
             | mode things are often more efficient in many ways, as they
             | don't need to call into the kernel as often.
        
         | gnabgib wrote:
         | tar doesn't do compression either, and zip doesn't NEED to
         | (several file formats are just bundles of files in a zip
         | with/without compression)
        
           | syrrim wrote:
           | Tar does do compression, via the standard -z flag. Every tar
           | I have ever downloaded used some form of compression, so its
           | hardly an optional part of the format.
        
             | nemetroid wrote:
             | Tar (the tool) does compression. Tar (the format) does not.
             | Compression is applied separately from archiving.
             | 
             | https://www.gnu.org/software/tar/manual/html_node/Standard.
             | h...
        
             | codetrotter wrote:
             | It is important to distinguish between tar the format and
             | tar the utility.
             | 
             | tar the utility is a program that can produce tar files,
             | but it is also able to then compress that file.
             | 
             | When you produce a compressed tar file, the contents are
             | written into a tar file, and this tar file as a whole is
             | compressed.
             | 
             | Sometimes the compressed files are named in full like
             | .tar.gz, .tar.bz2 and .tar.xz but often they are named as
             | .tgz, .tbz and .txz respectively. In a way you could
             | consider those files a format in their own right, but at
             | the same time they really are simply plain tar files with
             | compression applied on top.
             | 
             | You can confirm this by decompressing the file without
             | extracting the inner tar file using gunzip, bunzip2 and
             | unxz respectively. This will give you a plain tar file as a
             | result, which is a file and a format of its own.
             | 
             | You can also see the description of compressed tar files
             | with the "file" command and it will say for example "xz
             | compressed data" for a .txz or .tar.xz file (assuming of
             | course that the actual data in the file is really this kind
             | of data). And a plain uncompressed tar file described by
             | "file" will say something like "POSIX tar archive".
        
       | bouk wrote:
       | I wonder how this format compares to SquashFS
        
       | nathell wrote:
       | Related: pixz - a variant of xz that works with tar and enables
       | (semi)efficient random access: https://github.com/vasi/pixz
        
       | 1MachineElf wrote:
       | Is there anything like this for RAR files? I'm looking for an
       | alternative to unrar, as I've recently learned that it's code is
       | actually non-free.
        
         | BlackLotus89 wrote:
         | I think maybe you misunderstood this project....
         | 
         | Anyway libarchive can read rar files so just use bsdtar. It can
         | do many other archive files [0] like cpio as well. One
         | standardized interface for everything is nice
         | 
         | [0] https://github.com/libarchive/libarchive#supported-formats
        
       | abetusk wrote:
       | There exists a utility called tarindexer [0] that can be used for
       | random access to tar files. An index text file is created (one
       | time) that is used to record the position of the files in the tar
       | archive. Random reads are done by loading the index file and then
       | seeking to the location of the file in question.
       | 
       | For random access to gzip'd files, bgzip [1] can be used. bgzip
       | also uses an index file (one time creation) that is used to
       | record key points for random access.
       | 
       | [0] https://github.com/devsnd/tarindexer
       | 
       | [1] http://www.htslib.org/doc/bgzip.html
        
         | cb321 wrote:
         | Also relevant is pixz [1] which can do parallel LZMA/XZ
         | decompression as well as tar file indexing.
         | 
         | [1] https://github.com/vasi/pixz
        
         | selfhoster11 wrote:
         | That just sounds like an inferior version of ZIP. IMO, unless
         | you can only work with the tar format (which is a perfectly
         | valid explanation, some programs are long-lived), ZIP is a
         | better option for seekable archives because it supports even
         | LZMA compression.
        
           | lazide wrote:
           | Zip is..... weird. In some undesirably ways sometimes, and
           | desirable in others.
           | 
           | The index is stored at the tail end of a zip file for
           | instance, which is really not cool for something likej tape,
           | doubly not cool when you don't know in advance how big the
           | data on the tape is.
        
             | larkost wrote:
             | It is a little more complicated than that: having the index
             | at the end is great for writing tape, but sucks for reading
             | tape. The nice thing of that design is that you can
             | "stream" the data (read one file, write it, and just make
             | some notes on the side for your eventual index). But you
             | can't stream things off (you have to read the index first).
             | 
             | Tar is of course all-around great for tape, since every
             | file is essentially put on by itself (with just a little
             | bit of header about it). Again this is great for streaming
             | things both on and off tape. But you can't do any sort of
             | jumping around. You have to start at the beginning and then
             | go from file to file. This gets even worse if you try to
             | compress it (e.g. .tar.gz), as you then have to decompress
             | every byte to get to the next one.
        
             | selfhoster11 wrote:
             | Don't tapes support some form of seeking?
        
               | Isthatablackgsd wrote:
               | What lazide saying is that some important information is
               | stored at the end (as in the very end of the book/movie
               | or at the end of the line). So imagine there is a 1TB
               | .zip archive in the tape, the tape device have to go
               | further (deeper) into that 1TB file to get that last bit
               | of vital data that the user want to see. Normally the
               | vital bit usually at the start of the file (as in the
               | front line/beginning of the book) that the user have the
               | information ready before they could transfer it. But for
               | lazide case, the tape device have to keep reading the
               | entire 1TB zip to get that last vital bit of information
               | which made it slow. It is more like it cannot "skip the
               | line" and would have to go through entire line to get
               | there.
        
               | lazide wrote:
               | They do - it's very slow, and entirely linear. It also
               | puts wear on the tape, so if you do it a lot you'll break
               | the tape in a not-super-long-time.
               | 
               | And since you wouldn't know where the end is by reading
               | the beginning, you'll have to keep reading until you hit
               | the right marker - and seek backwards (which you
               | generally can't read backwards on most tape drives, so
               | you need to jump back and re-read).
        
             | petre wrote:
             | Also limited to 4Gb unless it's zip64 which is limited to
             | 16 Eb and not supported by all zip implementations.
        
       | zzo38computer wrote:
       | File names not being normalized across platforms is sometimes
       | beneficial. Ignoring symlinks is also sometimes beneficial.
       | However, sometimes these are undesirable features. The same is
       | true of solid compression, concatenatability, etc. Also, being
       | able to effiently update an archive means that some other things
       | may be lacked, so there is advantage and disadvantage of it.
       | 
       | I dislike the command-line format of Hop; it seems to missing
       | many features. Also, 64-bit timestamps and 64-bit data lengths
       | (and offsets) would be helpful, as some other people mention; I
       | agree with them, to improve it in this way. (You could use
       | variable length numbers if you want to save space.)
       | 
       | My own opinion for the general case is that I like to have
       | concatenable format with separate compression (although this is
       | not suitable for all applications). One way to allow additional
       | features might be having an extensible set of fields, so you can
       | include/exclude file modes, modifications times, numeric or named
       | user IDs, IBM code pages, resource forks, cryptographic hashes,
       | multi-volumes, etc. (I also designed a compression format with a
       | optional key frame index; this way the same format supports both
       | solid and non-solid compression, whichever way you want to do,
       | and this can work independently from the archive format being
       | used.)
       | 
       | For making backups I use tar with a specific set of options (do
       | not cross file systems, use numeric user IDs, etc); this is then
       | piped to the program to compress it, stored in a separate
       | partition, and then recorded on DVDs.
       | 
       | For some simple uses I like the Hamster archive format. However,
       | it also limits each individual file inside to 4 GB (although the
       | entire archive is not limited in this way), and no metadata is
       | possible. Still, for many applications, this simplicity is very
       | helpful, and I sometimes use it. I wrote a program to deal with
       | these files, and has a good number of options (which I have found
       | useful) without being too complicated. Options that belong in
       | external programs (such as compression) are not included, since
       | you can use separate programs for that.
        
       | mattfrommars wrote:
       | Is 7zip = zip ?
        
         | wolf550e wrote:
         | 7zip uses lzma, a much more advanced and slower format than RFC
         | 1951 DEFLATE used in zlib/gzip/png/zip/jar/office docs/etc
        
           | selfhoster11 wrote:
           | Zip can also use LZMA, assuming the recipient can
           | interoperate with that.
        
         | wolpoli wrote:
         | No. The 7zip format has much better compression than the
         | ancient (1990s) zip format.
        
           | selfhoster11 wrote:
           | ZIP as a standard is no longer ancient. It has support for
           | modern encryption and compression, including LZMA. This is
           | sometimes referred to as the ZIPX archive, but it's part of
           | the standard in its later revisions.
        
             | snvzz wrote:
             | All this "no longer ancient" means is that zip is now out
             | of the window. Its value has been lost.
             | 
             | The format has been perverted, and we can't trust zip as
             | the "just works" option that will open in any platform,
             | with any implementation, anymore.
             | 
             | All because somebody thought it a good idea to try to
             | leverage the zip name's attached popularity to try and make
             | a new format instantly popular.
             | 
             | Great job. This is why we can't have good things.
        
               | selfhoster11 wrote:
               | The same can be said of HTML. I realise that file archive
               | formats are expected to be more stable (they are for
               | archiving, yes?), but is it right to expect them to be
               | forever frozen in amber? Especially when open source or
               | free decompressors exist for every version of every
               | system? ZIP compressed using LZMA is even supported in
               | the last version of 7-zip compiled for MS-DOS.
        
               | snvzz wrote:
               | >ZIP compressed using LZMA is even supported in the last
               | version of 7-zip compiled for MS-DOS.
               | 
               | But then, why wouldn't you just use the 7z format?
               | 
               | The expectation with ZIP is (or was) that it'll unpack
               | fine, even under CP/M.
               | 
               | Moving to '.zipx' extension was the right move, but it
               | was done far too late.
        
               | wolpoli wrote:
               | Moving to the .zipx extension was definitely the right
               | (and only) move. This is due to the fact that Microsoft's
               | compressed folder code hasn't been updated in years and
               | thus it isn't safe to send out any Zip files that uses
               | any new features [1].
               | 
               | [1]: https://devblogs.microsoft.com/oldnewthing/20180515-
               | 00/?p=98...
        
             | mjevans wrote:
             | zipx != zip -- we are speaking of file compression
             | standards NOT branded compression software programs
        
               | selfhoster11 wrote:
               | No, it literally is part of the ZIP standard [0]. I
               | updated my original comment.
               | 
               | ZIPX archives are simply ZIP archives that have been
               | branded with the X for the benefit of the users who may
               | be trying to open them with something ancient, that
               | doesn't yet support LZMA or the other new features.
               | 
               | [0] https://pkware.cachefly.net/webdocs/casestudies/APPNO
               | TE.TXT
        
       | tyingq wrote:
       | Also see SQLite archive files. Random access and compression.
       | https://www.sqlite.org/sqlar.html
        
       | seeekr wrote:
       | From the README:
       | 
       | "Why? Reading and writing lots of tiny files incurs significant
       | syscall overhead, and (npm) packages often have lots of tiny
       | files. (...)"
       | 
       | It seems the author is working on a fast JS bundler tool
       | (https://bun.sh) and the submission links to an attempt to fix
       | some of the challenges in processing lots of npm files quickly.
       | (But of course could be useful beyond that.)
        
         | jraph wrote:
         | > "Why? Reading and writing lots of tiny files incurs
         | significant syscall overhead, and (npm) packages often have
         | lots of tiny files"
         | 
         | Ah, those trees of is-character packages depending on is-
         | letter-a packages, themselves depending on is-not-number
         | packages each appearing several times in different versions are
         | probably challenging to bundle and unpack efficiently. We might
         | want to address file systems too, so they too can handle NPM
         | efficiently (actual size on disk and speed).
         | 
         | Or maybe the actual fix is elsewhere.
         | 
         |  _Runs, fleeing their screen screaming in fear and frustration_
         | 
         | (no offense to the author, I actually find the problem
         | interesting and the work useful)
        
           | aaaaaaaaaaab wrote:
           | Obviously, what we really need is an OS optimized for npm.
        
         | eyelidlessness wrote:
         | Came to mention Bun when I saw this hit the front page. I've
         | been following Jarred's Twitter since I heard about Bun and
         | it's quite impressive (albeit incomplete). To folks wondering
         | why another bundler/what makes Bun special:
         | 
         | - Faster than ESBuild/SWC
         | 
         | - Fast build-time macros written as JSX (likely friendlier to
         | develop than say a Babel plugin/macro). These open up a lot of
         | possibilities that could benefit end users too, by performing
         | more work on build/server and less client side.
         | 
         | - Targeting ecosystem compatibility (eg will probably support
         | the new JSX transform, which ESBuild does not and may not in
         | the future)
         | 
         | - Support for integration with frameworks, eg Next.js
         | 
         | - Other cool performance-focused tools like Hop and Peechy[1]
         | (though that's a fork of ESBuild creator's project Kiwi)
         | 
         | This focus on performance is good for the JS ecosystem and for
         | the web generally.
         | 
         | 1: https://github.com/Jarred-Sumner/peechy
        
       | perihelions wrote:
       | Speaking of faster coreutils replacements, I highly recommend
       | ripgrep (rg) and fd-find (fd), which are Rust-based, incompatible
       | replacements for grep and find.
       | 
       | I know I'm way behind the popularity curve* (they're already in
       | debian-stable, for crying out loud); but for the benefit of
       | anyone even more out of the loop than myself, do check these out.
       | The speed increase is amazing: a significant quality-of-life
       | improvement, essentially for free**. Particularly if you're on a
       | modern NVME SSD and not utilizing it properly. (Incidentally, did
       | you know the "t" in coreutil's "tar" stands for magnetic
       | [t]ape?")
       | 
       | * (The now-top reply to this comment says 'ag' is even superior
       | to 'rg', and they're probably right, but I had no clue about it!
       | I did say I'm ignorant!)
       | 
       | **(Might have some cost if you're heavily invested in power-user
       | syntax of the GNU or BSD versions, in which case incompatibility
       | has a price).
       | 
       | https://github.com/BurntSushi/ripgrep
       | 
       | https://github.com/sharkdp/fd
        
         | chx wrote:
         | When I looked ag https://github.com/ggreer/the_silver_searcher
         | was more featureful than ripgrep yet it's always ripgrep that
         | is mentioned. :/
        
           | rscnt wrote:
           | weird right? I don't know what ripgrep has, maybe it's just
           | the name? the fact that you call `ag` but it's called
           | `the_silver_searcher`?
        
             | pdimitar wrote:
             | From my side, I knew about both `ripgrep` and
             | `the_silver_searcher` but I will openly admit I've lost
             | faith in C and C++'s abilities to protect against memory
             | safety errors.
             | 
             | Thus, Rust tools get a priority, at least for myself. There
             | are probably other memory-safe languages but I haven't had
             | the chance to give them a spin like I did with Rust. If I
             | found out about them then I'll also prefer tools written in
             | them if there's no Rust port and if the alternatives are
             | C/C++ tools.
        
         | hakre wrote:
         | Okay, call me lazy. Not for ripgrep, installed it early from
         | sources. However your fdfind (fd) mention made me curious.
         | Thought I give it an apt install on Ubtunu 20.04 LTS but dead
         | end. So perhaps in debian-stable but not in Ubuntu. Just
         | saying, so I can feel less out of the loop ;)
        
           | perihelions wrote:
           | Upstream says Ubuntu's package is 'fd-find' (and the
           | executable is 'fdfind' -- both renamed from upstream's 'fd'
           | because of a name collision with an unrelated Debian package.
           | If you don't care about that one, you can safely alias
           | fd="fdfind").
           | 
           | https://github.com/sharkdp/fd
           | 
           | (I've edited my first comment in response to this reply: I
           | originally wrote "fdfind". (For a comment about regexp tools,
           | this is a uniquely, hilariously stupid oversight. Sorry!)).
        
             | hakre wrote:
             | Well, even its late here, I could still have enough
             | creative energy to insert the minus in there ... yeah,
             | thanks for that. So now I feel really behind because sure,
             | it's packaged in Ubuntu 20.04. Thanks again.
        
       | notananthem wrote:
       | (cries in winrar)
        
         | Octplane wrote:
         | Hey! Did you register me? </nag>
        
       | kazinator wrote:
       | > _Can 't be larger than 4 GB_
       | 
       | How does that even happen in 2021?
        
         | maccard wrote:
         | 4GB is the maximum value of a 32 bit unsigned int. If I had to
         | guess that's the maximum size of the array/vector container in
         | zig.
        
         | georgemcbay wrote:
         | The answer is at the bottom of the page and all the
         | offset/length data being uint32s.
         | 
         | 2^32 = 4294967296
         | 
         | Not sure if this limitation is being enforced by upstream
         | concerns, but this is why this code in particular is limited to
         | that size.
        
         | Jarred wrote:
         | Offsets and lengths are stored as unsigned 32 bit integers.
         | This saves space/memory, but means it won't work for anything
         | bigger than 4 GB
         | 
         | Maybe that's overly conservative. Wouldn't be hard to change
        
       ___________________________________________________________________
       (page generated 2021-11-10 23:00 UTC)