[HN Gopher] Show HN: Ratarmount 1.0.0 - Rapid access to large ar...
       ___________________________________________________________________
        
       Show HN: Ratarmount 1.0.0 - Rapid access to large archives via a
       FUSE filesystem
        
       Hi HN,  Since my first posted introduction of ratarmount [0], 2
       years have gone by and many features have been added.  To
       summarize, ratarmount enables working with archived contents
       exposed as a filesystem without the data having to be extracted to
       disk:                   pip install ratarmount         ratarmount
       archive.tar mounted         ls -la mounted       I started this
       project after noticing the slowness of archivemount with large TAR
       files and wondering how this could be because the file contents
       exist at some offset in the archive file and it should not be
       difficult to read that data. Turns out, that part was not
       difficult, however packaging everything nicely, adding tests, and
       adding many more formats and features such as union mounting and
       recursive mounting, are the things keeping me busy on this project
       until today. Since the last Show HN, a libarchive, SquashFS,
       fsspec, and many more backends have been added, so that it now
       should be able to read every format that archivemount can and some
       more, and even read them remotely. However, performance for any use
       case besides bzip2/gzip-compressed TARs may vary even though I did
       my best.  Personally, I am using it view to packed folders with
       many small files that do not change anymore. I pack these folders
       because else copying to other hard drives takes much longer. I'm
       also using it when I want to avoid the command line. I have added
       ratarmount as a Caja user script for mounting via right-click. This
       way, I can mount an archive and then copy the contents to another
       drive to effectively do the extraction and copying in one step.
       Initially, I have also used it to train on the ImageNet TAR archive
       directly.  I probably should have released a 1.0.0 some years ago
       because I have kept the command line interface and even the index
       file format compatible as best as possible between the several 0.x
       versions already.  Some larger future features on my wishlist are:
       - A new indexed_lz4 backend. This should be doable inside my
       indexed_bzip2 [1] / rapidgzip [2] backend library.  - A custom ZIP
       and SquashFS reader accelerated by rapidgzip and indexed_bzip2 to
       enable faster seeking inside large files inside those archives.  -
       I am eagerly awaiting the Linux Kernel FUSE BPF support [3], which
       might enable some further latency reductions for use cases with
       very small files / very small reads, at least in the case of
       working with uncompressed archives. I have done comparisons for
       such archives (100k images a 100 KiB) and noticed that direct
       access via the Python library ratarmountcore was roughly two times
       faster than access via ratarmount and FUSE. Maybe I'll even find
       the time to play around with the existing unmerged FUSE BPF patch
       set.  [0] https://news.ycombinator.com/item?id=30631387  [1]
       https://news.ycombinator.com/item?id=31875318  [2]
       https://news.ycombinator.com/item?id=37378411  [3]
       https://lwn.net/Articles/937433/
        
       Author : mxmlnkn
       Score  : 58 points
       Date   : 2024-11-01 15:25 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | lathiat wrote:
       | This is awesome :)
        
       | kenmacd wrote:
       | I find this project hugely helpful when working with Google
       | Takeout archives. I normally pick a size that's not too large so
       | that downloading them is easier, then it's simply a matter of:
       | ratarmount ./takeout-20231130T224325Z-0*.tgz ./mnt
        
       | sziiiizs wrote:
       | That is very cool. May I ask, how does the compressed stream
       | seeking work? Does it keep state of the decompressor at certain
       | points so arbitrary access can be faster than reading from the
       | start of the stream?
        
         | mxmlnkn wrote:
         | For bzip2, a list of bit offsets in the compressed stream and a
         | corresponding byte offset in the decompressed stream suffices
         | because each bzip2 block is independent.
         | 
         | For gzip, it is as you say. However, when only wanting to seek
         | to DEFLATE block boundaries, the "state" of the decompressor is
         | as simple as the last decompressed 32 KiB in the stream.
         | Compared to the two offsets for bzip2, this is 2048x more data
         | to store though. Rapidgzip does sparsity analysis to find out
         | which of decompressed bytes are actually referenced later on
         | and also recompresses those windows to reduce overhead.
         | Ratarmount still uses the full 32 KiB windows though. This is
         | one of the larger todos, i.e., to use the compressed index
         | format, instead, and define such a format in the first place.
         | This will definitely be necessary for LZ4, for which the window
         | size is 64 KiB instead of 32 KiB.
         | 
         | For zstd and xz, this Ansatz finds its limits because the
         | Lempel-Ziv backreference windows are not limited in size in
         | general. However, I am hoping that the sparsity analysis should
         | make it feasible because, in the worst case, the state cannot
         | be longer than the next decompressed chunk. In this worst case,
         | the decompressed block consists only of non-overlapping back-
         | references.
        
       | ranger_danger wrote:
       | similiar projects:
       | 
       | https://github.com/cybernoid/archivemount
       | 
       | https://github.com/google/fuse-archive
       | 
       | https://github.com/google/mount-zip
       | 
       | https://bitbucket.org/agalanin/fuse-zip
        
       | BoingBoomTschak wrote:
       | Congratulations on your v1.0.0! This is definitely a very nice
       | tool, I'll try to play with it a bit and maybe try to make an
       | ebuild (though the build system seems a bit complicated for
       | proper no-network package managers). The extensive benchmark
       | section is a nice plus.
       | 
       | A small note, archivemount has a living fork here:
       | https://git.sr.ht/~nabijaczleweli/archivemount-ng
        
       ___________________________________________________________________
       (page generated 2024-11-01 23:01 UTC)