[HN Gopher] Pack: A new container format for compressed files
       ___________________________________________________________________
        
       Pack: A new container format for compressed files
        
       Author : todsacerdoti
       Score  : 56 points
       Date   : 2024-03-22 19:11 UTC (3 hours ago)
        
 (HTM) web link (pack.ac)
 (TXT) w3m dump (pack.ac)
        
       | ramses0 wrote:
       | tldr: `cp * sqlite3://output.db` (basically?)
       | 
       | Really seems to make sense! For another fun compression trick:
       | https://github.com/mxmlnkn/ratarmount
        
       | xuhu wrote:
       | Better, faster, stronger but I can't tell from the homepage
       | what's different about it, except that it is based on SQLite and
       | Zstd.
        
       | kevmo314 wrote:
       | Wow, Pascal! Haven't seen a project in Pascal in a while.
       | https://github.com/PackOrganization/Pack
        
         | nkozyra wrote:
         | Yeah, I'll wait for the ALGOL 68 port.
        
         | p0w3n3d wrote:
         | Last time is did some Pascal was 2006/7. I think I never saw
         | production-grade code myself.
         | 
         | I wonder if this line is an array in-situ?
         | Split(APath, [sfodpoWithoutPathDelimiter,
         | sfodpoWithoutExtension], P, N)
        
           | mananaysiempre wrote:
           | No, it's a set (bitmask) constant[1].
           | 
           | [1] https://www.freepascal.org/docs-
           | html/current/ref/refse83.htm...
        
         | mikepurvis wrote:
         | Indeed, this is the kind of thing I would have expected to see
         | written in Go or Rust. I wonder what the motivation for this
         | implementation choice was.
        
           | benignslime wrote:
           | In the "Source" section of the site:
           | 
           | > It is written in the Pascal language, a well-stabilized
           | standard language with compatibility promises for decades.
           | Using the FreePascal compiler and the Lazarus IDE for free
           | and easy development. The code is written to be seen as
           | pseudocode. In place of need, comments are written to help.
        
       | bborud wrote:
       | Pascal!? My monocle nearly fell out.
        
         | esafak wrote:
         | Must be in honor of Wirth's passing!
        
       | mmastrac wrote:
       | Sqlite3 is universal, but now your spec is entirely reliant on
       | Sqlite3's format and all the quirks required to support that.
       | 
       | If you actually care about the future, spec out your own database
       | format and use that instead. It could even be mostly a copy of
       | Sqlite3, but at least it would be part of the spec itself.
        
         | kevmo314 wrote:
         | On the other hand, by using Sqlite one can reimplement this
         | format in another language with very little effort.
        
           | KerrAvon wrote:
           | It requires the sqlite3 library bindings, which might be a
           | lot of effort.
        
             | 0cf8612b2e1e wrote:
             | Is there a mainstream language which does not have SQLite
             | bindings?
        
         | lordmauve wrote:
         | > Most popular solutions like Zip, gzip, tar, RAR, or 7-Zip are
         | near or more than three decades old.
         | 
         | If I can't extract .pack archives 3 decades from now, the use
         | of SQLite 3 will be the reason.
        
         | hgs3 wrote:
         | You're not "wrong" but Sqlite isn't your run-of-the-mill
         | project. "The SQLite file format is stable, cross-platform, and
         | backwards compatible and the developers pledge to keep it that
         | way through the year 2050." [1]
         | 
         | [1] https://www.sqlite.org/
        
           | orf wrote:
           | How different is this to any other run of the mill project
           | with few active developers on a single implementation, with
           | backwards compatibility based entirely on promises?
           | 
           | Hot take: SQLite has bugs and quirks.
        
         | sitkack wrote:
         | sqlite should be an implementation detail. The table format
         | should be fully documented and use a sqlite virtual table
         | module.
        
       | jlhawn wrote:
       | When I read the title, I thought it was a new operating system-
       | level containerization image format for filesystem layers and
       | runtime config. But it looks like "container format" is a more
       | general term for a collection of multiple files or streams into
       | one. https://en.wikipedia.org/wiki/Container_format TIL.
       | 
       | OS containers could use an update too, though. They're often so
       | big and tend to use multiple tar.gz files.
        
       | theamk wrote:
       | > Pack format is based on the universal and arguably one of the
       | most safe and stable file formats, SQLite3, and compression is
       | based on Zstandard, the leading standard algorithm in the field.
       | 
       | yeah, no thanks. SQlite3 automatically means:
       | 
       | - Single implementation (yes, it's a nice one but still a single
       | dependency)
       | 
       | - No way to write directly to pipe (SQlite requires real on-disk
       | file)
       | 
       | - No standard way to read without getting the whole file first
       | 
       | - No guarantees in number of disk seeks required to open the file
       | (relevant for NFS, sshfs or any other remote filesystem use)
       | 
       | - The archive file might be changed just by opening in read-only
       | mode
       | 
       | - Damaged file recovery is very hard
       | 
       | - Writing is explicitly not protected against several common
       | scenarios, like backup being taken in the middle of file write
       | 
       | - No parallel reads from multiple threads
       | 
       | Look, sqlite3 is great for it's designed purpose (embedded
       | database). But trying to apply for other purposes is often a bad
       | idea.
        
         | tredre3 wrote:
         | Pack may not be it, but it would be nice if Tar would go the
         | way of the dodo. It has all the flaws that you mentioned (and
         | more!).
        
           | viraptor wrote:
           | It does not in fact have most of the mentioned flaws. It's
           | pipeable, immutable, continuous, trivial to repair, safe for
           | append-only writing.
        
           | epcoa wrote:
           | It literally has none of the issues mentioned though. Not
           | that it doesn't have limitations but those listed aren't
           | them.
        
         | bonki wrote:
         | Ad 2: SQLite has an in-memory DB option.
        
         | Deukhoofd wrote:
         | > SQlite requires real on-disk file
         | 
         | You can run SQLite with an In Memory database, I use it quite a
         | lot for unit tests.
         | 
         | https://www.sqlite.org/inmemorydb.html
        
         | cogman10 wrote:
         | > No parallel reads from multiple threads
         | 
         | Sqlite supports parallel reads from multiple threads.
         | 
         | It even supports parallel reads and writes from multiple
         | threads.
         | 
         | What it doesn't really support is parallel reads and writes
         | from multiple processes.
        
           | raggi wrote:
           | Not when threading is disabled, as it is in this project.
        
             | cogman10 wrote:
             | Sure, but single threading isn't an inherent part of sqlite
             | as OP implies.
             | 
             | > SQlite3 automatically means:
             | 
             | ...
             | 
             | > - No parallel reads from multiple threads
        
         | raggi wrote:
         | Often new things are met with an excess of skepticism but I
         | agree here.
         | 
         | I'd take this more seriously if the format was documented at
         | all, but so far it appears to be "this implementation relies on
         | sqlite and zstd therefore it's better", without even a
         | specification of the sql schema, let alone anything else.
         | 
         | The github repo contains precompiled binaries of zstd and
         | sqlite. The sqlite builds appear to have thread support
         | disabled so not only will it be single writer it'll be single
         | reader too.
         | 
         | The schema is missing strictly typed tables, and the
         | implementation appears to lack explicit collation handling for
         | names and content.
         | 
         | The described benchmark appears to involve files with an
         | average size of 16KB. I suspect it was executed on Windows on
         | top of NTFS with an AV package running, which is a pathological
         | case for single threaded use of the POSIXy IO APIs that
         | undoubtedly most of the selected implementations use.
         | 
         | It's slightly odd that it appears to perform better when SQLite
         | is being built with thread safety disabled (https://github.com/
         | PackOrganization/Pack/blob/main/Libraries...) and yet the
         | implementation is inserting in a thread group: https://github.c
         | om/PackOrganization/Pack/blob/main/Source/Dr.... I suspect the
         | answer here is that because the implementation is using a
         | thread group to read files and compress chunks, it's amortizing
         | the slow cost of file opens in this benchmark using threading,
         | but is heavily constrained by the sqlite locking - and the
         | compression ratio will take a substantial hit in some cases as
         | a result of the limited range of each compression operation. I
         | suspect that zstd(1) with -T0 would outperform this for speed
         | and compression ratio, and it's already installed on a lot of
         | systems - even Windows 11 gained native support for .zst files
         | recently.
         | 
         | The premise that we could do with something more portable than
         | TAR and with less baggage is somewhat reasonable - we probably
         | could do with a simple, safe format. There are a lot more key
         | considerations to making such a format good, such as many you
         | outline, such as choices around seeking, syncing, incremental
         | updates, compression efficiency, parallelism, etc. There is no
         | single set of trade-offs to cover all cases but it would be
         | possible to make a file format that can be shared among them,
         | while constraining the design somewhat for safety and ease of
         | portability.
        
       | bno1 wrote:
       | I found squashfs to be a great archive format. It preserves Linux
       | file ownership and permissions, you can extract individual files
       | without parsing the entire archive like tar and it's mountable.
       | It's also openable in 7zip.
       | 
       | I wonder how pack compares to it, but its home page and github
       | don't tell much.
        
         | bonki wrote:
         | I second this.
        
       | rodrigokumpera wrote:
       | ZStandard is... standardized under rfc 8878
       | 
       | Plus there's no discussion against zstd itself and its container
       | format.
        
         | rustyconover wrote:
         | If you're looking for a debate against ZStandard, its hard to
         | argue against it.
         | 
         | ZStandard is Pareto optimal.
         | 
         | For the argument why, I really recommend this investigation.
         | 
         | https://insanity.industries/post/pareto-optimal-compression/
        
           | bonki wrote:
           | Thanks, superbly written and highly informative article!
        
       | xcdzvyn wrote:
       | With all due respect, I find it hard to believe the author
       | stumbled upon a trivial method of improving tarballing
       | performance by several orders of magnitude that nobody else had
       | considered before.
       | 
       | If I understand correctly, they're suggesting Pack, which both
       | archives and compresses, is 30x faster than creating a plain tar
       | archive. That just sounds like you used multithreading and tar
       | didn't.
       | 
       | Either way, it'd be nice to see [a version of] Pack support plain
       | archival, rather than being forced to tack on Zstd.
        
         | TylerE wrote:
         | That's more because plain tar is actually a really dumb way of
         | handling files that aren't going to tape.
         | 
         | Being better than that is not a hard bar.
        
           | cogman10 wrote:
           | The tar file format is REALLY bad. It's pretty much
           | impossible to thread because it's just doing metadata then
           | content and repeatably concatenating.
           | 
           | IE                   /foo.txt 21         This is the foo file
           | /bar.txt 21         This is the bar file
           | 
           | That makes it super hard to deal with as you essentially need
           | to navigate the entire tar file before you can list the
           | directories in a tar file. To add a file you have to wait for
           | the previous file to be added.
           | 
           | Using something like sqlite solves this particular problem
           | because you can have a table with file names and a table with
           | file contents that can both be inserted into in parallel
           | (though that will mean the contents aren't guaranteed to be
           | contiguous.) Since SQLite is just a btree it's easy (well,
           | known) how to concurrently modify the contents of the tree.
        
             | TylerE wrote:
             | Or just what zip and every other format does an skits put
             | all the metadata at the beginning - enough to list all
             | files, and extract any single one efficiently
        
               | nullindividual wrote:
               | Tapes don't (? certainly didn't) operate this way. You
               | need to read the entire tape to list the contents.
               | 
               | Since tar is a Tape ARchive, the way tar operates makes
               | sense (as it was designed for both output to file _and_
               | device, i.e. tape).
        
               | monocasa wrote:
               | Tapes currently don't really operate like tar anymore
               | either. Filesystems like LTFS stick the metadata all in
               | one blob somewhere.
        
               | nullindividual wrote:
               | It's been a long time since I've operated tape, so good
               | to know things have changed for the better.
        
               | tredre3 wrote:
               | That point is always raised on every criticism of tar
               | (that it's good at tape).
               | 
               | Yes! It is! But it's awful at archive files, which is
               | what it's used for nowadays and what's being discussed
               | right now.
               | 
               | Over the past 50 years some people did try to improve
               | tar. People did develop ways to append a file table at
               | the end of an archive file. Maintaining compatibility
               | with tapes, all tar utilities, and piping.
               | 
               | Similarly, driven people did extend (pk)zip to cover all
               | the unix-y needs. In fact the current zip utility still
               | supports permissions and symlinks to this day.
               | 
               | But despite those better methods, people keep pushing og
               | tar. Because it's good at tape archival. Sigh.
        
               | monocasa wrote:
               | zip interestingly sticks the metadata at the end. That
               | lets you add files to a zip without touching what's
               | already been zipped. Just new metadata at the end.
               | 
               | Modern tape archives like LTFS do the same thing as well.
        
         | darby_eight wrote:
         | Eh, it's not that hard to imagine given how rare it is to zip
         | 81k files of around 1kb each.
        
           | iscoelho wrote:
           | Not that rare at all. Take a full disk zip/tar of any
           | Linux/Windows filesystem and you'll encounter a lot of small
           | files.
        
             | darby_eight wrote:
             | Ok? How are you comparing these systems to the benchmark so
             | they might be considered relevant? Compressing "Lots of
             | small files" describes an infinite variety of workloads. To
             | achieve anything close to the benchmark you'd need to
             | specifically only compress only small files in a single
             | directory of an average small size. And even the contents
             | of those files would have large implications as to expected
             | performance....
        
               | iscoelho wrote:
               | My comment is not making any claims about that. It's just
               | a correction that filesystems with "81k 1KB files" are
               | indeed common.
        
           | viraptor wrote:
           | That's basically any large source repo.
        
         | Hello71 wrote:
         | Also, 4.7 seconds to read 1345 MB in 81k files is suspiciously
         | slow. On my six-year-old low/mid-range Intel 660p with Linux
         | 6.8, tar -c /usr/lib >/dev/null with 2.4 GiB in 49k files takes
         | about 1.25s cold and 0.32s warm. Of course, the sales pitch has
         | no explanation of which hardware, software, parameters, or test
         | procedures were used. I reckon tar was tested with cold cache
         | and pack with warm cache, and both are basically benchmarking
         | I/O speed.
        
           | lilyball wrote:
           | The footnotes at the bottom says
           | 
           | > _Development machine with a two-year-old CPU and NVMe disk,
           | using Windows with the NTFS file system. The differences are
           | even greater on Linux using ext4. Value holds on an old HDD
           | and one-core CPU._
           | 
           | > _All corresponding official programs were used in an out-
           | of-the-box configuration at the time of writing in a warm
           | state._
        
       | Kwpolska wrote:
       | If I need to compress stuff, it's either to move a folder around
       | (to places which may not have a niche compression tool, so ZIP
       | wins), or to archive something long-term, where it can take a
       | while to compress. I don't see the advantages of this, since the
       | compression output size seems quite mediocre even if it's
       | supposedly fast (compared to what implementations of the other
       | formats?)
        
       | crq-yml wrote:
       | The web site behaves strangely on mobile and folds the text as I
       | try to scroll around.
        
         | 0x073 wrote:
         | Yes at least on mobile absolute madness.
        
       | smartmic wrote:
       | The whole thing makes sense to me and I can't see any major
       | points of criticism in the design rationale. Some thoughts:
       | 
       | * There is already a "native" Sqlite3 container solution called
       | Sqlar [0].
       | 
       | * Sqlite3 itself is certainly suitable as a base and I wouldn't
       | worry about its future at all.
       | 
       | * Pascal is also an interesting choice, it is not the hippest
       | language nor a new kid on the block, but offers its own
       | advantages as being "boring" and "normal". I am thinking
       | especially of the Lindy effect [1].
       | 
       | All in all a nice surprise and I am curious to see the future of
       | Pack. After all, it can only succeed if it gets a stable,
       | critical mass of supporters, both from the user and maintainer
       | spectrum.
       | 
       | [0]: https://sqlite.org/sqlar/doc/trunk/README.md
       | 
       | [1]: https://en.wikipedia.org/wiki/Lindy_effect
        
       | SyrupThinker wrote:
       | Interesting, I've recently spent an unhealthy amount of time
       | researching archival formats to build the same setup of using
       | SQLite with ZStd.
       | 
       | My use case is extremely redundant data (specific website dumps +
       | logs) that I want decently quick random access into, and I was
       | unhappy with either the access speed, quality/usability or even
       | existence of libraries for several formats.
       | 
       | Glancing over the code this seems to use the following setup:
       | 
       | - Aggregate files
       | 
       | - Chunk into blocks
       | 
       | - Compress blocks of fixed size
       | 
       | - Store file to chunk and chunk to block associations
       | 
       | What I did not see is a deduplication step for the chunks, or an
       | attempt to group files (and by extend, blocks) by similarity in
       | an attempt improve compression.
       | 
       | But I might have just missed that due to lack of familiarity with
       | Pascal.
       | 
       | For anyone interested in this strategy, take a look at ZPAQ [1]
       | by Matt Mahoney, you might know him from the Hutter Prize
       | competition [2] / Large Text Compression Benchmark. It takes 14th
       | place with tuned parameters.
       | 
       | There's also a maintained fork called zpaqfranz, but I ran into
       | some issues like incorrect disk size estimates with it. For me
       | the code was also sometimes hard to read due to being a mix of
       | English and Italian. So your mileage may vary.
       | 
       | [1]: http://mattmahoney.net/dc/zpaq.html [2]:
       | http://prize.hutter1.net [3]:
       | https://github.com/fcorbelli/zpaqfranz
        
       | ericyd wrote:
       | Website is quite annoying to use on mobile. Scrolling behaviors
       | get interpreted as taps which close the container you're reading.
        
         | throwaway67743 wrote:
         | Also if you dare try to naturally scroll up after opening a
         | container it's interpreted as a refresh as it redraws. Might be
         | an awesome format but web design fail negates it entirely.
        
       | conception wrote:
       | Is anyone using this or sqllite archives for anything at scale?
       | They always seemed like a good solution for certain scientific
       | outputs. But data integrity obviously is a concern.
        
       ___________________________________________________________________
       (page generated 2024-03-22 23:00 UTC)