[HN Gopher] Hop: Faster than unzip and tar at reading individual...
___________________________________________________________________
Hop: Faster than unzip and tar at reading individual files
Author : ksec
Score : 81 points
Date : 2021-11-10 18:50 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| Jarred wrote:
| I made this
|
| Happy to answer any questions or feedback
| Const-me wrote:
| Consider variable-length integers for file sizes and string
| lengths. When I need them, I usually implementing what's
| written in MKV spec: https://www.rfc-
| editor.org/rfc/rfc8794.html#name-variable-si...
|
| That's a good way to bypass that 4GB size limit, and instead of
| wasting 8 bytes per number this will even save a few bytes.
|
| However, I'm not sure how easy it is to implement in the
| language you're using. I only did that in C++ and modern C#.
| They both have intrinsics to emit BSWAP instruction (or an ARM
| equivalent) to flip byte order of an integer, helps with
| performance of that code.
| ksec wrote:
| How far are we from public beta or Bun 1.0?
|
| Not related to bun or hop is the use of Zig. I am wondering if
| you will do a write up on it someday.
| Jarred wrote:
| Something like two weeks before a public beta
| throwaway375 wrote:
| What do you think about Valve VPK?
| ectopod wrote:
| Don't use 32-bit file times! Change it quick while you have the
| chance.
| brandmeyer wrote:
| If you manage the index using a B-tree, then you can perform
| partial updates of the B-tree by appending the minimum number
| of pages needed to represent the changes. At that point, you
| can append N additional new files to the tail of the archive,
| and add a new index that re-uses some of the original index.
|
| Just an idea to check the "append" box.
|
| See also B-trees, Shadowing, and Clones
| https://www.usenix.org/legacy/events/lsf07/tech/rodeh.pdf
| Jarred wrote:
| Good idea, worried a little about impact to
| serialization/deserialization time though. Maybe could still
| store it as a flat array
| CannoloBlahnik wrote:
| This seems like more of a tar problem than a zip problem, unless
| I'm missing something, given the lack of compression on Hop.
| slaymaker1907 wrote:
| I think you can still zip files without compression.
| sigzero wrote:
| Yes, but I don't think anyone does that.
| sumtechguy wrote:
| Usually the -0 option. https://linux.die.net/man/1/zip -mx=0
| for 7zip style programs.
|
| If you also used the mt option with 7zip and just stored you
| probably could have a decent read rate as I think it spins
| extra threads.
| yboris wrote:
| Since Hop doesn't do compression, the most appropriate comparison
| would be to asar
|
| https://github.com/electron/asar
|
| It's not hard being faster than zip if you are not
| compressing/uncompressing.
| Jarred wrote:
| The numbers shown are with zip -0, which disables compression.
| kupopuffs wrote:
| What is the meaning then? What's the benefit of an archive
| that's not compressed? Why not use a FS?
| zuhsetaqi wrote:
| https://github.com/Jarred-Sumner/hop#why
| lazide wrote:
| Archives often have checksums, align files in a way
| conducive to continuous reading which can be great
| performance wise in some cases (and like zip, also random
| read/write), and can provide grouping and logical/semantic
| validation that is hard to do on a 'bunch of files' without
| messing it up.
| eins1234 wrote:
| FWIW, IPFS does all of that by default (maybe outside of
| the continuous reading part).
| [deleted]
| derefr wrote:
| Plenty of uncompressed tarballs exist. In fact, if the
| things I'm archiving are _already_ compressed (e.g. JPEGs),
| I reach for an uncompressed tarball as my first choice
| (with my second choice being a macOS .sparsebundle -- very
| nice for network-mounting _in_ macOS, and storable on
| pretty much anything, but not exactly great if you want to
| open it on any other OS.)
|
| If we had a random-access file system loopback-image
| standard (open standard for both the file system _and_ the
| loopback image container format), maybe we wouldn't see so
| many tarballs. But there is no such format.
|
| As for "why archive things at all, instead of just rsyncing
| a million little files and directories over to your NAS" --
| because one takes five minutes, and the other eight hours,
| due to inode creation and per-file-stream ramp-up time.
| rjzzleep wrote:
| I don't know if it allows streaming, but if it does
| transferring files on portable devices or streaming their
| over the wire is a lot faster this way compared to the
| files directly. Especially for small files
| Jarred wrote:
| Hop reduces the number of syscalls necessary to both read
| and check for the existence of multiple files nested within
| a shared parent directory.
|
| You read from one file to get all the information you need
| instead of reading from N files and N directories.
|
| Can't easily mount virtual filesystems outside of Linux.
| However, Linux supports the copy_file_range syscall which
| also makes it faster to copy data around than doing it in
| application memory
| chrisseaton wrote:
| An archive is essentially a user-mode file system. User-
| mode things are often more efficient in many ways, as they
| don't need to call into the kernel as often.
| gnabgib wrote:
| tar doesn't do compression either, and zip doesn't NEED to
| (several file formats are just bundles of files in a zip
| with/without compression)
| syrrim wrote:
| Tar does do compression, via the standard -z flag. Every tar
| I have ever downloaded used some form of compression, so its
| hardly an optional part of the format.
| nemetroid wrote:
| Tar (the tool) does compression. Tar (the format) does not.
| Compression is applied separately from archiving.
|
| https://www.gnu.org/software/tar/manual/html_node/Standard.
| h...
| codetrotter wrote:
| It is important to distinguish between tar the format and
| tar the utility.
|
| tar the utility is a program that can produce tar files,
| but it is also able to then compress that file.
|
| When you produce a compressed tar file, the contents are
| written into a tar file, and this tar file as a whole is
| compressed.
|
| Sometimes the compressed files are named in full like
| .tar.gz, .tar.bz2 and .tar.xz but often they are named as
| .tgz, .tbz and .txz respectively. In a way you could
| consider those files a format in their own right, but at
| the same time they really are simply plain tar files with
| compression applied on top.
|
| You can confirm this by decompressing the file without
| extracting the inner tar file using gunzip, bunzip2 and
| unxz respectively. This will give you a plain tar file as a
| result, which is a file and a format of its own.
|
| You can also see the description of compressed tar files
| with the "file" command and it will say for example "xz
| compressed data" for a .txz or .tar.xz file (assuming of
| course that the actual data in the file is really this kind
| of data). And a plain uncompressed tar file described by
| "file" will say something like "POSIX tar archive".
| bouk wrote:
| I wonder how this format compares to SquashFS
| nathell wrote:
| Related: pixz - a variant of xz that works with tar and enables
| (semi)efficient random access: https://github.com/vasi/pixz
| 1MachineElf wrote:
| Is there anything like this for RAR files? I'm looking for an
| alternative to unrar, as I've recently learned that it's code is
| actually non-free.
| BlackLotus89 wrote:
| I think maybe you misunderstood this project....
|
| Anyway libarchive can read rar files so just use bsdtar. It can
| do many other archive files [0] like cpio as well. One
| standardized interface for everything is nice
|
| [0] https://github.com/libarchive/libarchive#supported-formats
| abetusk wrote:
| There exists a utility called tarindexer [0] that can be used for
| random access to tar files. An index text file is created (one
| time) that is used to record the position of the files in the tar
| archive. Random reads are done by loading the index file and then
| seeking to the location of the file in question.
|
| For random access to gzip'd files, bgzip [1] can be used. bgzip
| also uses an index file (one time creation) that is used to
| record key points for random access.
|
| [0] https://github.com/devsnd/tarindexer
|
| [1] http://www.htslib.org/doc/bgzip.html
| cb321 wrote:
| Also relevant is pixz [1] which can do parallel LZMA/XZ
| decompression as well as tar file indexing.
|
| [1] https://github.com/vasi/pixz
| selfhoster11 wrote:
| That just sounds like an inferior version of ZIP. IMO, unless
| you can only work with the tar format (which is a perfectly
| valid explanation, some programs are long-lived), ZIP is a
| better option for seekable archives because it supports even
| LZMA compression.
| lazide wrote:
| Zip is..... weird. In some undesirably ways sometimes, and
| desirable in others.
|
| The index is stored at the tail end of a zip file for
| instance, which is really not cool for something likej tape,
| doubly not cool when you don't know in advance how big the
| data on the tape is.
| larkost wrote:
| It is a little more complicated than that: having the index
| at the end is great for writing tape, but sucks for reading
| tape. The nice thing of that design is that you can
| "stream" the data (read one file, write it, and just make
| some notes on the side for your eventual index). But you
| can't stream things off (you have to read the index first).
|
| Tar is of course all-around great for tape, since every
| file is essentially put on by itself (with just a little
| bit of header about it). Again this is great for streaming
| things both on and off tape. But you can't do any sort of
| jumping around. You have to start at the beginning and then
| go from file to file. This gets even worse if you try to
| compress it (e.g. .tar.gz), as you then have to decompress
| every byte to get to the next one.
| selfhoster11 wrote:
| Don't tapes support some form of seeking?
| Isthatablackgsd wrote:
| What lazide saying is that some important information is
| stored at the end (as in the very end of the book/movie
| or at the end of the line). So imagine there is a 1TB
| .zip archive in the tape, the tape device have to go
| further (deeper) into that 1TB file to get that last bit
| of vital data that the user want to see. Normally the
| vital bit usually at the start of the file (as in the
| front line/beginning of the book) that the user have the
| information ready before they could transfer it. But for
| lazide case, the tape device have to keep reading the
| entire 1TB zip to get that last vital bit of information
| which made it slow. It is more like it cannot "skip the
| line" and would have to go through entire line to get
| there.
| lazide wrote:
| They do - it's very slow, and entirely linear. It also
| puts wear on the tape, so if you do it a lot you'll break
| the tape in a not-super-long-time.
|
| And since you wouldn't know where the end is by reading
| the beginning, you'll have to keep reading until you hit
| the right marker - and seek backwards (which you
| generally can't read backwards on most tape drives, so
| you need to jump back and re-read).
| petre wrote:
| Also limited to 4Gb unless it's zip64 which is limited to
| 16 Eb and not supported by all zip implementations.
| zzo38computer wrote:
| File names not being normalized across platforms is sometimes
| beneficial. Ignoring symlinks is also sometimes beneficial.
| However, sometimes these are undesirable features. The same is
| true of solid compression, concatenatability, etc. Also, being
| able to effiently update an archive means that some other things
| may be lacked, so there is advantage and disadvantage of it.
|
| I dislike the command-line format of Hop; it seems to missing
| many features. Also, 64-bit timestamps and 64-bit data lengths
| (and offsets) would be helpful, as some other people mention; I
| agree with them, to improve it in this way. (You could use
| variable length numbers if you want to save space.)
|
| My own opinion for the general case is that I like to have
| concatenable format with separate compression (although this is
| not suitable for all applications). One way to allow additional
| features might be having an extensible set of fields, so you can
| include/exclude file modes, modifications times, numeric or named
| user IDs, IBM code pages, resource forks, cryptographic hashes,
| multi-volumes, etc. (I also designed a compression format with a
| optional key frame index; this way the same format supports both
| solid and non-solid compression, whichever way you want to do,
| and this can work independently from the archive format being
| used.)
|
| For making backups I use tar with a specific set of options (do
| not cross file systems, use numeric user IDs, etc); this is then
| piped to the program to compress it, stored in a separate
| partition, and then recorded on DVDs.
|
| For some simple uses I like the Hamster archive format. However,
| it also limits each individual file inside to 4 GB (although the
| entire archive is not limited in this way), and no metadata is
| possible. Still, for many applications, this simplicity is very
| helpful, and I sometimes use it. I wrote a program to deal with
| these files, and has a good number of options (which I have found
| useful) without being too complicated. Options that belong in
| external programs (such as compression) are not included, since
| you can use separate programs for that.
| mattfrommars wrote:
| Is 7zip = zip ?
| wolf550e wrote:
| 7zip uses lzma, a much more advanced and slower format than RFC
| 1951 DEFLATE used in zlib/gzip/png/zip/jar/office docs/etc
| selfhoster11 wrote:
| Zip can also use LZMA, assuming the recipient can
| interoperate with that.
| wolpoli wrote:
| No. The 7zip format has much better compression than the
| ancient (1990s) zip format.
| selfhoster11 wrote:
| ZIP as a standard is no longer ancient. It has support for
| modern encryption and compression, including LZMA. This is
| sometimes referred to as the ZIPX archive, but it's part of
| the standard in its later revisions.
| snvzz wrote:
| All this "no longer ancient" means is that zip is now out
| of the window. Its value has been lost.
|
| The format has been perverted, and we can't trust zip as
| the "just works" option that will open in any platform,
| with any implementation, anymore.
|
| All because somebody thought it a good idea to try to
| leverage the zip name's attached popularity to try and make
| a new format instantly popular.
|
| Great job. This is why we can't have good things.
| selfhoster11 wrote:
| The same can be said of HTML. I realise that file archive
| formats are expected to be more stable (they are for
| archiving, yes?), but is it right to expect them to be
| forever frozen in amber? Especially when open source or
| free decompressors exist for every version of every
| system? ZIP compressed using LZMA is even supported in
| the last version of 7-zip compiled for MS-DOS.
| snvzz wrote:
| >ZIP compressed using LZMA is even supported in the last
| version of 7-zip compiled for MS-DOS.
|
| But then, why wouldn't you just use the 7z format?
|
| The expectation with ZIP is (or was) that it'll unpack
| fine, even under CP/M.
|
| Moving to '.zipx' extension was the right move, but it
| was done far too late.
| wolpoli wrote:
| Moving to the .zipx extension was definitely the right
| (and only) move. This is due to the fact that Microsoft's
| compressed folder code hasn't been updated in years and
| thus it isn't safe to send out any Zip files that uses
| any new features [1].
|
| [1]: https://devblogs.microsoft.com/oldnewthing/20180515-
| 00/?p=98...
| mjevans wrote:
| zipx != zip -- we are speaking of file compression
| standards NOT branded compression software programs
| selfhoster11 wrote:
| No, it literally is part of the ZIP standard [0]. I
| updated my original comment.
|
| ZIPX archives are simply ZIP archives that have been
| branded with the X for the benefit of the users who may
| be trying to open them with something ancient, that
| doesn't yet support LZMA or the other new features.
|
| [0] https://pkware.cachefly.net/webdocs/casestudies/APPNO
| TE.TXT
| tyingq wrote:
| Also see SQLite archive files. Random access and compression.
| https://www.sqlite.org/sqlar.html
| seeekr wrote:
| From the README:
|
| "Why? Reading and writing lots of tiny files incurs significant
| syscall overhead, and (npm) packages often have lots of tiny
| files. (...)"
|
| It seems the author is working on a fast JS bundler tool
| (https://bun.sh) and the submission links to an attempt to fix
| some of the challenges in processing lots of npm files quickly.
| (But of course could be useful beyond that.)
| jraph wrote:
| > "Why? Reading and writing lots of tiny files incurs
| significant syscall overhead, and (npm) packages often have
| lots of tiny files"
|
| Ah, those trees of is-character packages depending on is-
| letter-a packages, themselves depending on is-not-number
| packages each appearing several times in different versions are
| probably challenging to bundle and unpack efficiently. We might
| want to address file systems too, so they too can handle NPM
| efficiently (actual size on disk and speed).
|
| Or maybe the actual fix is elsewhere.
|
| _Runs, fleeing their screen screaming in fear and frustration_
|
| (no offense to the author, I actually find the problem
| interesting and the work useful)
| aaaaaaaaaaab wrote:
| Obviously, what we really need is an OS optimized for npm.
| eyelidlessness wrote:
| Came to mention Bun when I saw this hit the front page. I've
| been following Jarred's Twitter since I heard about Bun and
| it's quite impressive (albeit incomplete). To folks wondering
| why another bundler/what makes Bun special:
|
| - Faster than ESBuild/SWC
|
| - Fast build-time macros written as JSX (likely friendlier to
| develop than say a Babel plugin/macro). These open up a lot of
| possibilities that could benefit end users too, by performing
| more work on build/server and less client side.
|
| - Targeting ecosystem compatibility (eg will probably support
| the new JSX transform, which ESBuild does not and may not in
| the future)
|
| - Support for integration with frameworks, eg Next.js
|
| - Other cool performance-focused tools like Hop and Peechy[1]
| (though that's a fork of ESBuild creator's project Kiwi)
|
| This focus on performance is good for the JS ecosystem and for
| the web generally.
|
| 1: https://github.com/Jarred-Sumner/peechy
| perihelions wrote:
| Speaking of faster coreutils replacements, I highly recommend
| ripgrep (rg) and fd-find (fd), which are Rust-based, incompatible
| replacements for grep and find.
|
| I know I'm way behind the popularity curve* (they're already in
| debian-stable, for crying out loud); but for the benefit of
| anyone even more out of the loop than myself, do check these out.
| The speed increase is amazing: a significant quality-of-life
| improvement, essentially for free**. Particularly if you're on a
| modern NVME SSD and not utilizing it properly. (Incidentally, did
| you know the "t" in coreutil's "tar" stands for magnetic
| [t]ape?")
|
| * (The now-top reply to this comment says 'ag' is even superior
| to 'rg', and they're probably right, but I had no clue about it!
| I did say I'm ignorant!)
|
| **(Might have some cost if you're heavily invested in power-user
| syntax of the GNU or BSD versions, in which case incompatibility
| has a price).
|
| https://github.com/BurntSushi/ripgrep
|
| https://github.com/sharkdp/fd
| chx wrote:
| When I looked ag https://github.com/ggreer/the_silver_searcher
| was more featureful than ripgrep yet it's always ripgrep that
| is mentioned. :/
| rscnt wrote:
| weird right? I don't know what ripgrep has, maybe it's just
| the name? the fact that you call `ag` but it's called
| `the_silver_searcher`?
| pdimitar wrote:
| From my side, I knew about both `ripgrep` and
| `the_silver_searcher` but I will openly admit I've lost
| faith in C and C++'s abilities to protect against memory
| safety errors.
|
| Thus, Rust tools get a priority, at least for myself. There
| are probably other memory-safe languages but I haven't had
| the chance to give them a spin like I did with Rust. If I
| found out about them then I'll also prefer tools written in
| them if there's no Rust port and if the alternatives are
| C/C++ tools.
| hakre wrote:
| Okay, call me lazy. Not for ripgrep, installed it early from
| sources. However your fdfind (fd) mention made me curious.
| Thought I give it an apt install on Ubtunu 20.04 LTS but dead
| end. So perhaps in debian-stable but not in Ubuntu. Just
| saying, so I can feel less out of the loop ;)
| perihelions wrote:
| Upstream says Ubuntu's package is 'fd-find' (and the
| executable is 'fdfind' -- both renamed from upstream's 'fd'
| because of a name collision with an unrelated Debian package.
| If you don't care about that one, you can safely alias
| fd="fdfind").
|
| https://github.com/sharkdp/fd
|
| (I've edited my first comment in response to this reply: I
| originally wrote "fdfind". (For a comment about regexp tools,
| this is a uniquely, hilariously stupid oversight. Sorry!)).
| hakre wrote:
| Well, even its late here, I could still have enough
| creative energy to insert the minus in there ... yeah,
| thanks for that. So now I feel really behind because sure,
| it's packaged in Ubuntu 20.04. Thanks again.
| notananthem wrote:
| (cries in winrar)
| Octplane wrote:
| Hey! Did you register me? </nag>
| kazinator wrote:
| > _Can 't be larger than 4 GB_
|
| How does that even happen in 2021?
| maccard wrote:
| 4GB is the maximum value of a 32 bit unsigned int. If I had to
| guess that's the maximum size of the array/vector container in
| zig.
| georgemcbay wrote:
| The answer is at the bottom of the page and all the
| offset/length data being uint32s.
|
| 2^32 = 4294967296
|
| Not sure if this limitation is being enforced by upstream
| concerns, but this is why this code in particular is limited to
| that size.
| Jarred wrote:
| Offsets and lengths are stored as unsigned 32 bit integers.
| This saves space/memory, but means it won't work for anything
| bigger than 4 GB
|
| Maybe that's overly conservative. Wouldn't be hard to change
___________________________________________________________________
(page generated 2021-11-10 23:00 UTC)