[HN Gopher] Pack: A new container format for compressed files
___________________________________________________________________
Pack: A new container format for compressed files
Author : todsacerdoti
Score : 56 points
Date : 2024-03-22 19:11 UTC (3 hours ago)
(HTM) web link (pack.ac)
(TXT) w3m dump (pack.ac)
| ramses0 wrote:
| tldr: `cp * sqlite3://output.db` (basically?)
|
| Really seems to make sense! For another fun compression trick:
| https://github.com/mxmlnkn/ratarmount
| xuhu wrote:
| Better, faster, stronger but I can't tell from the homepage
| what's different about it, except that it is based on SQLite and
| Zstd.
| kevmo314 wrote:
| Wow, Pascal! Haven't seen a project in Pascal in a while.
| https://github.com/PackOrganization/Pack
| nkozyra wrote:
| Yeah, I'll wait for the ALGOL 68 port.
| p0w3n3d wrote:
| Last time is did some Pascal was 2006/7. I think I never saw
| production-grade code myself.
|
| I wonder if this line is an array in-situ?
| Split(APath, [sfodpoWithoutPathDelimiter,
| sfodpoWithoutExtension], P, N)
| mananaysiempre wrote:
| No, it's a set (bitmask) constant[1].
|
| [1] https://www.freepascal.org/docs-
| html/current/ref/refse83.htm...
| mikepurvis wrote:
| Indeed, this is the kind of thing I would have expected to see
| written in Go or Rust. I wonder what the motivation for this
| implementation choice was.
| benignslime wrote:
| In the "Source" section of the site:
|
| > It is written in the Pascal language, a well-stabilized
| standard language with compatibility promises for decades.
| Using the FreePascal compiler and the Lazarus IDE for free
| and easy development. The code is written to be seen as
| pseudocode. In place of need, comments are written to help.
| bborud wrote:
| Pascal!? My monocle nearly fell out.
| esafak wrote:
| Must be in honor of Wirth's passing!
| mmastrac wrote:
| Sqlite3 is universal, but now your spec is entirely reliant on
| Sqlite3's format and all the quirks required to support that.
|
| If you actually care about the future, spec out your own database
| format and use that instead. It could even be mostly a copy of
| Sqlite3, but at least it would be part of the spec itself.
| kevmo314 wrote:
| On the other hand, by using Sqlite one can reimplement this
| format in another language with very little effort.
| KerrAvon wrote:
| It requires the sqlite3 library bindings, which might be a
| lot of effort.
| 0cf8612b2e1e wrote:
| Is there a mainstream language which does not have SQLite
| bindings?
| lordmauve wrote:
| > Most popular solutions like Zip, gzip, tar, RAR, or 7-Zip are
| near or more than three decades old.
|
| If I can't extract .pack archives 3 decades from now, the use
| of SQLite 3 will be the reason.
| hgs3 wrote:
| You're not "wrong" but Sqlite isn't your run-of-the-mill
| project. "The SQLite file format is stable, cross-platform, and
| backwards compatible and the developers pledge to keep it that
| way through the year 2050." [1]
|
| [1] https://www.sqlite.org/
| orf wrote:
| How different is this to any other run of the mill project
| with few active developers on a single implementation, with
| backwards compatibility based entirely on promises?
|
| Hot take: SQLite has bugs and quirks.
| sitkack wrote:
| sqlite should be an implementation detail. The table format
| should be fully documented and use a sqlite virtual table
| module.
| jlhawn wrote:
| When I read the title, I thought it was a new operating system-
| level containerization image format for filesystem layers and
| runtime config. But it looks like "container format" is a more
| general term for a collection of multiple files or streams into
| one. https://en.wikipedia.org/wiki/Container_format TIL.
|
| OS containers could use an update too, though. They're often so
| big and tend to use multiple tar.gz files.
| theamk wrote:
| > Pack format is based on the universal and arguably one of the
| most safe and stable file formats, SQLite3, and compression is
| based on Zstandard, the leading standard algorithm in the field.
|
| yeah, no thanks. SQlite3 automatically means:
|
| - Single implementation (yes, it's a nice one but still a single
| dependency)
|
| - No way to write directly to pipe (SQlite requires real on-disk
| file)
|
| - No standard way to read without getting the whole file first
|
| - No guarantees in number of disk seeks required to open the file
| (relevant for NFS, sshfs or any other remote filesystem use)
|
| - The archive file might be changed just by opening in read-only
| mode
|
| - Damaged file recovery is very hard
|
| - Writing is explicitly not protected against several common
| scenarios, like backup being taken in the middle of file write
|
| - No parallel reads from multiple threads
|
| Look, sqlite3 is great for it's designed purpose (embedded
| database). But trying to apply for other purposes is often a bad
| idea.
| tredre3 wrote:
| Pack may not be it, but it would be nice if Tar would go the
| way of the dodo. It has all the flaws that you mentioned (and
| more!).
| viraptor wrote:
| It does not in fact have most of the mentioned flaws. It's
| pipeable, immutable, continuous, trivial to repair, safe for
| append-only writing.
| epcoa wrote:
| It literally has none of the issues mentioned though. Not
| that it doesn't have limitations but those listed aren't
| them.
| bonki wrote:
| Ad 2: SQLite has an in-memory DB option.
| Deukhoofd wrote:
| > SQlite requires real on-disk file
|
| You can run SQLite with an In Memory database, I use it quite a
| lot for unit tests.
|
| https://www.sqlite.org/inmemorydb.html
| cogman10 wrote:
| > No parallel reads from multiple threads
|
| Sqlite supports parallel reads from multiple threads.
|
| It even supports parallel reads and writes from multiple
| threads.
|
| What it doesn't really support is parallel reads and writes
| from multiple processes.
| raggi wrote:
| Not when threading is disabled, as it is in this project.
| cogman10 wrote:
| Sure, but single threading isn't an inherent part of sqlite
| as OP implies.
|
| > SQlite3 automatically means:
|
| ...
|
| > - No parallel reads from multiple threads
| raggi wrote:
| Often new things are met with an excess of skepticism but I
| agree here.
|
| I'd take this more seriously if the format was documented at
| all, but so far it appears to be "this implementation relies on
| sqlite and zstd therefore it's better", without even a
| specification of the sql schema, let alone anything else.
|
| The github repo contains precompiled binaries of zstd and
| sqlite. The sqlite builds appear to have thread support
| disabled so not only will it be single writer it'll be single
| reader too.
|
| The schema is missing strictly typed tables, and the
| implementation appears to lack explicit collation handling for
| names and content.
|
| The described benchmark appears to involve files with an
| average size of 16KB. I suspect it was executed on Windows on
| top of NTFS with an AV package running, which is a pathological
| case for single threaded use of the POSIXy IO APIs that
| undoubtedly most of the selected implementations use.
|
| It's slightly odd that it appears to perform better when SQLite
| is being built with thread safety disabled (https://github.com/
| PackOrganization/Pack/blob/main/Libraries...) and yet the
| implementation is inserting in a thread group: https://github.c
| om/PackOrganization/Pack/blob/main/Source/Dr.... I suspect the
| answer here is that because the implementation is using a
| thread group to read files and compress chunks, it's amortizing
| the slow cost of file opens in this benchmark using threading,
| but is heavily constrained by the sqlite locking - and the
| compression ratio will take a substantial hit in some cases as
| a result of the limited range of each compression operation. I
| suspect that zstd(1) with -T0 would outperform this for speed
| and compression ratio, and it's already installed on a lot of
| systems - even Windows 11 gained native support for .zst files
| recently.
|
| The premise that we could do with something more portable than
| TAR and with less baggage is somewhat reasonable - we probably
| could do with a simple, safe format. There are a lot more key
| considerations to making such a format good, such as many you
| outline, such as choices around seeking, syncing, incremental
| updates, compression efficiency, parallelism, etc. There is no
| single set of trade-offs to cover all cases but it would be
| possible to make a file format that can be shared among them,
| while constraining the design somewhat for safety and ease of
| portability.
| bno1 wrote:
| I found squashfs to be a great archive format. It preserves Linux
| file ownership and permissions, you can extract individual files
| without parsing the entire archive like tar and it's mountable.
| It's also openable in 7zip.
|
| I wonder how pack compares to it, but its home page and github
| don't tell much.
| bonki wrote:
| I second this.
| rodrigokumpera wrote:
| ZStandard is... standardized under rfc 8878
|
| Plus there's no discussion against zstd itself and its container
| format.
| rustyconover wrote:
| If you're looking for a debate against ZStandard, its hard to
| argue against it.
|
| ZStandard is Pareto optimal.
|
| For the argument why, I really recommend this investigation.
|
| https://insanity.industries/post/pareto-optimal-compression/
| bonki wrote:
| Thanks, superbly written and highly informative article!
| xcdzvyn wrote:
| With all due respect, I find it hard to believe the author
| stumbled upon a trivial method of improving tarballing
| performance by several orders of magnitude that nobody else had
| considered before.
|
| If I understand correctly, they're suggesting Pack, which both
| archives and compresses, is 30x faster than creating a plain tar
| archive. That just sounds like you used multithreading and tar
| didn't.
|
| Either way, it'd be nice to see [a version of] Pack support plain
| archival, rather than being forced to tack on Zstd.
| TylerE wrote:
| That's more because plain tar is actually a really dumb way of
| handling files that aren't going to tape.
|
| Being better than that is not a hard bar.
| cogman10 wrote:
| The tar file format is REALLY bad. It's pretty much
| impossible to thread because it's just doing metadata then
| content and repeatably concatenating.
|
| IE /foo.txt 21 This is the foo file
| /bar.txt 21 This is the bar file
|
| That makes it super hard to deal with as you essentially need
| to navigate the entire tar file before you can list the
| directories in a tar file. To add a file you have to wait for
| the previous file to be added.
|
| Using something like sqlite solves this particular problem
| because you can have a table with file names and a table with
| file contents that can both be inserted into in parallel
| (though that will mean the contents aren't guaranteed to be
| contiguous.) Since SQLite is just a btree it's easy (well,
| known) how to concurrently modify the contents of the tree.
| TylerE wrote:
| Or just what zip and every other format does an skits put
| all the metadata at the beginning - enough to list all
| files, and extract any single one efficiently
| nullindividual wrote:
| Tapes don't (? certainly didn't) operate this way. You
| need to read the entire tape to list the contents.
|
| Since tar is a Tape ARchive, the way tar operates makes
| sense (as it was designed for both output to file _and_
| device, i.e. tape).
| monocasa wrote:
| Tapes currently don't really operate like tar anymore
| either. Filesystems like LTFS stick the metadata all in
| one blob somewhere.
| nullindividual wrote:
| It's been a long time since I've operated tape, so good
| to know things have changed for the better.
| tredre3 wrote:
| That point is always raised on every criticism of tar
| (that it's good at tape).
|
| Yes! It is! But it's awful at archive files, which is
| what it's used for nowadays and what's being discussed
| right now.
|
| Over the past 50 years some people did try to improve
| tar. People did develop ways to append a file table at
| the end of an archive file. Maintaining compatibility
| with tapes, all tar utilities, and piping.
|
| Similarly, driven people did extend (pk)zip to cover all
| the unix-y needs. In fact the current zip utility still
| supports permissions and symlinks to this day.
|
| But despite those better methods, people keep pushing og
| tar. Because it's good at tape archival. Sigh.
| monocasa wrote:
| zip interestingly sticks the metadata at the end. That
| lets you add files to a zip without touching what's
| already been zipped. Just new metadata at the end.
|
| Modern tape archives like LTFS do the same thing as well.
| darby_eight wrote:
| Eh, it's not that hard to imagine given how rare it is to zip
| 81k files of around 1kb each.
| iscoelho wrote:
| Not that rare at all. Take a full disk zip/tar of any
| Linux/Windows filesystem and you'll encounter a lot of small
| files.
| darby_eight wrote:
| Ok? How are you comparing these systems to the benchmark so
| they might be considered relevant? Compressing "Lots of
| small files" describes an infinite variety of workloads. To
| achieve anything close to the benchmark you'd need to
| specifically only compress only small files in a single
| directory of an average small size. And even the contents
| of those files would have large implications as to expected
| performance....
| iscoelho wrote:
| My comment is not making any claims about that. It's just
| a correction that filesystems with "81k 1KB files" are
| indeed common.
| viraptor wrote:
| That's basically any large source repo.
| Hello71 wrote:
| Also, 4.7 seconds to read 1345 MB in 81k files is suspiciously
| slow. On my six-year-old low/mid-range Intel 660p with Linux
| 6.8, tar -c /usr/lib >/dev/null with 2.4 GiB in 49k files takes
| about 1.25s cold and 0.32s warm. Of course, the sales pitch has
| no explanation of which hardware, software, parameters, or test
| procedures were used. I reckon tar was tested with cold cache
| and pack with warm cache, and both are basically benchmarking
| I/O speed.
| lilyball wrote:
| The footnotes at the bottom says
|
| > _Development machine with a two-year-old CPU and NVMe disk,
| using Windows with the NTFS file system. The differences are
| even greater on Linux using ext4. Value holds on an old HDD
| and one-core CPU._
|
| > _All corresponding official programs were used in an out-
| of-the-box configuration at the time of writing in a warm
| state._
| Kwpolska wrote:
| If I need to compress stuff, it's either to move a folder around
| (to places which may not have a niche compression tool, so ZIP
| wins), or to archive something long-term, where it can take a
| while to compress. I don't see the advantages of this, since the
| compression output size seems quite mediocre even if it's
| supposedly fast (compared to what implementations of the other
| formats?)
| crq-yml wrote:
| The web site behaves strangely on mobile and folds the text as I
| try to scroll around.
| 0x073 wrote:
| Yes at least on mobile absolute madness.
| smartmic wrote:
| The whole thing makes sense to me and I can't see any major
| points of criticism in the design rationale. Some thoughts:
|
| * There is already a "native" Sqlite3 container solution called
| Sqlar [0].
|
| * Sqlite3 itself is certainly suitable as a base and I wouldn't
| worry about its future at all.
|
| * Pascal is also an interesting choice, it is not the hippest
| language nor a new kid on the block, but offers its own
| advantages as being "boring" and "normal". I am thinking
| especially of the Lindy effect [1].
|
| All in all a nice surprise and I am curious to see the future of
| Pack. After all, it can only succeed if it gets a stable,
| critical mass of supporters, both from the user and maintainer
| spectrum.
|
| [0]: https://sqlite.org/sqlar/doc/trunk/README.md
|
| [1]: https://en.wikipedia.org/wiki/Lindy_effect
| SyrupThinker wrote:
| Interesting, I've recently spent an unhealthy amount of time
| researching archival formats to build the same setup of using
| SQLite with ZStd.
|
| My use case is extremely redundant data (specific website dumps +
| logs) that I want decently quick random access into, and I was
| unhappy with either the access speed, quality/usability or even
| existence of libraries for several formats.
|
| Glancing over the code this seems to use the following setup:
|
| - Aggregate files
|
| - Chunk into blocks
|
| - Compress blocks of fixed size
|
| - Store file to chunk and chunk to block associations
|
| What I did not see is a deduplication step for the chunks, or an
| attempt to group files (and by extend, blocks) by similarity in
| an attempt improve compression.
|
| But I might have just missed that due to lack of familiarity with
| Pascal.
|
| For anyone interested in this strategy, take a look at ZPAQ [1]
| by Matt Mahoney, you might know him from the Hutter Prize
| competition [2] / Large Text Compression Benchmark. It takes 14th
| place with tuned parameters.
|
| There's also a maintained fork called zpaqfranz, but I ran into
| some issues like incorrect disk size estimates with it. For me
| the code was also sometimes hard to read due to being a mix of
| English and Italian. So your mileage may vary.
|
| [1]: http://mattmahoney.net/dc/zpaq.html [2]:
| http://prize.hutter1.net [3]:
| https://github.com/fcorbelli/zpaqfranz
| ericyd wrote:
| Website is quite annoying to use on mobile. Scrolling behaviors
| get interpreted as taps which close the container you're reading.
| throwaway67743 wrote:
| Also if you dare try to naturally scroll up after opening a
| container it's interpreted as a refresh as it redraws. Might be
| an awesome format but web design fail negates it entirely.
| conception wrote:
| Is anyone using this or sqllite archives for anything at scale?
| They always seemed like a good solution for certain scientific
| outputs. But data integrity obviously is a concern.
___________________________________________________________________
(page generated 2024-03-22 23:00 UTC)