[HN Gopher] OpenZL: An open source format-aware compression fram...
___________________________________________________________________
OpenZL: An open source format-aware compression framework
https://github.com/facebook/openzl
https://arxiv.org/abs/2510.03203 https://openzl.org/
Author : terrelln
Score : 189 points
Date : 2025-10-06 16:01 UTC (6 hours ago)
(HTM) web link (engineering.fb.com)
(TXT) w3m dump (engineering.fb.com)
| felixhandte wrote:
| In addition to the blog post, here are the other things we've
| published today:
|
| Code: https://github.com/facebook/openzl
|
| Documentation: https://openzl.org/
|
| White Paper: https://arxiv.org/abs/2510.03203
| dang wrote:
| We'll put those links in the toptext above.
| waustin wrote:
| This is such a leap forward it's hard to believe it's anything
| but magic.
| gmuslera wrote:
| I used to see as magic that the old original compression
| algorithms worked so well with generic text, without worrying
| about format, file type, structure or other things that could
| give hints of additional redundancy.
| wmf wrote:
| Compared to columnar databases this is more of an incremental
| improvement.
| kingstnap wrote:
| Wow this sounds nuts. I want to try this on some large csvs later
| today.
| felixhandte wrote:
| Let us know how it goes!
|
| We developed OpenZL initially for our own consumption at Meta.
| More recently we've been putting a lot of effort into making
| this a usable tool for people who, you know, didn't develop
| OpenZL. Your feedback is welcome!
| zzulus wrote:
| Meta's Nimble is natively integrated with OpenZL (pre-OSS
| version), and is insanely benefiting from it.
| terrelln wrote:
| Yeah, backend compression in columnar data formats is a natural
| fit for OpenZL. Knowing the data it is compressing is numeric,
| e.g. a column of i64 or float, allows for immediate wins over
| Zstandard.
| felixhandte wrote:
| It was really hard to resist spilling the beans about OpenZL on
| this recent HN post about compressing genomic sequence data [0].
| It's a great example of the really simple transformations you can
| perform on data that can unlock significant compression
| improvements. OpenZL can perform that transformation internally
| (quite easily with SDDL!).
|
| [0] https://news.ycombinator.com/item?id=45223827
| bede wrote:
| Author of [0] here. Congratulations and well done for
| resisting. Eager to try it!
|
| Edit: Have you any specific advice for training a fasta
| compressor beyond that given in e.g. "Using OpenZL"
| (https://openzl.org/getting-started/using-openzl/)
| perching_aix wrote:
| That post immediately came to my mind too! Do you maybe have a
| comparison to share with respect to the specialized compressor
| mentioned in the OP there?
|
| > Grace Blackwell's 2.6Tbp 661k dataset is a classic choice for
| benchmarking methods in microbial genomics. (...) Karel
| Brinda's specialist MiniPhy approach takes this dataset from
| 2.46TiB to just 27GiB (CR: 91) by clustering and compressing
| similar genomes together.
| Gethsemane wrote:
| I'd love to see some benchmarks for this on some common genomic
| formats (fa, fq, sam, vcf). Will be doubly interesting to see
| its applicability to nanopore data - lots of useful data is
| lost because storing FAST5/POD5 is a pain.
| jayknight wrote:
| And a comparison between CRAM and openzl on a sam/bam file.
| Is openzl indexable, where you can just extract and
| decompress the data you need from a file if you know where it
| is?
| terrelln wrote:
| > Is openzl indexable
|
| Not today. However, we are considering this as we are
| continuing to evolve the frame format, and it is likely we
| will add this feature in the future.
| jltsiren wrote:
| OpenZL compressed SAM/BAM vs. CRAM is the interesting
| comparison. It would really test the flexibility of the
| framework. Can OpenZL reach the same level of compression,
| and how much effort does it take?
|
| I would not expect much improvement in compressing nanopore
| data. If you have a useful model of the data, creating a
| custom compressor is not that difficult. It takes some
| effort, but those formats are popular enough that compressors
| using the known models should already exist.
| terrelln wrote:
| Do you happen to have a pointer to a good open source
| dataset to look at?
|
| Naively and knowing little about CRAM, I would expect that
| OpenZL would beat Zstd handily out of the box, but need
| additional capabilities to match the performance of CRAM,
| since genomics hasn't been a focus as of yet. But it would
| be interesting to see how much we need to add is generic to
| all compression (but useful for genomics), vs. techniques
| that are specific only to genomics.
|
| We're planning on setting up a blog on our website to
| highlight use cases of OpenZL. I'd love to make a post
| about this.
| bede wrote:
| For BAM this could be a good place to start:
| https://www.htslib.org/benchmarks/CRAM.html
|
| Happy to discuss further
| fnands wrote:
| Cool, but what's the Weissman Score?
| bigwheels wrote:
| How do you use it to compress a directory (or .tar file)? Not
| seeing any example usages in the repo, `zli compress -o
| dir.tar.zl dir.tar` -> Invalid argument(s):
| No compressor profile or serialized compressor specified.
|
| Same thing for the `train` command.
|
| Edit: @terrelln Got it, thank you!
| terrelln wrote:
| There's a Quick Start guide here:
|
| https://openzl.org/getting-started/quick-start/
|
| However, OpenZL is different in that you need to tell the
| compressor how to compress your data. The CLI tool has a few
| builtin "profiles" which you can specify with the `--profile`
| argument. E.g. csv, parquet, or le-u64. They can be listed with
| `./zli list-profiles`.
|
| You can always use the `serial` profile, but because you
| haven't told OpenZL anything about your data, it will just use
| Zstandard under the hood. Training can learn a compressor, but
| it won't be able to learn a format like `.tar` today.
|
| If you have raw numeric data you want to throw at it, or
| Parquets or large CSV files, thats where I would expect OpenZL
| to perform really well.
| ttoinou wrote:
| Is this similar to Basis ?
| https://github.com/BinomialLLC/basis_universal
| modeless wrote:
| No, not really. They are both cool but solve different
| problems. The problem Basis solves is that GPUs don't agree on
| which compressed texture formats to support in hardware. Basis
| is a single compressed format that can be transcoded to almost
| any of the formats GPUs support, which is faster and higher
| quality than e.g. decoding a JPEG and then re-encoding to a GPU
| format.
| ttoinou wrote:
| Thanks. I thought basis also had specific encoders depending
| on the typical average / nature of the data input, like this
| OpenZL project
| modeless wrote:
| It probably does have different modes that it selects based
| on the input data. I don't know that much about the
| implementation of image compression, but I know that PNG
| for example has several preprocessing modes that can be
| selected based on the image contents, which transform the
| data before entropy encoding for better results.
|
| The difference with OpenZL IIUC seems to be that it has
| some language that can flexibly describe a family of
| transformations, which can be serialized and included with
| the compressed data for the decoder to use. So instead of
| choosing between a fixed set of transformations built into
| the decoder ahead of time, as in PNG, you can apply
| arbitrary transformations (as long as they can be
| represented in their format).
| nunobrito wrote:
| Well, well. Kind of surprised to see this really good tool that
| should have been made available a longer time ago since the
| approach is quite sound.
|
| When the data container is understood, the deduplication is far
| more efficient because now it is targeted.
|
| Licensed as BSD-3-Clause, solid C++ implementation, well
| documented.
|
| Will be looking forward to see new developments as more file
| formats are contributed.
| mappu wrote:
| Specialization for file formats is not novel (e.g. 7-Zip uses
| BCJ2 prefiltering to convert x86 opcodes from absolute to
| relative JMP instructions), nor is embedding specialized
| decoder bytecode in the archive (e.g. ZPAQ did this and won a
| lot of Matt Mahoney's benchmarks) but i think OpenZL's
| execution here, along with the data description and training
| system, is really fantastic.
| nunobrito wrote:
| Thanks, I've enjoyed reading more about ZPAQ but their main
| focus seems to be versioning (which is quite a useful feature
| too, will try it later) but they don't include specialized
| compression per context.
|
| Like you mention, the expandability is quite something. In a
| few years we might see a very capable compressor.
| maeln wrote:
| So, as I understand, you describe the structure of your data in
| an SDL and then the compressor can plan a strategy on how to best
| compress the various part of the data ?
|
| Honestly looks incredible. Could be amazing to provide a general
| framework for compressing custom format.
| terrelln wrote:
| Exactly! SDDL [0] provides a toolkit to do this all with no-
| code, but today is pretty limited. We will be expanding its
| feature set, but in the meantime you can also write code in C++
| or Python to parse your format. And this code is compression
| side only, so the decompressor is agnostic to your format.
|
| [0] https://openzl.org/api/c/graphs/sddl/
| maeln wrote:
| Now I cannot stop thinking about how I can fit this somewhere
| in my work hehe. ZStandard already blew me away when it was
| released, and this is just another crazy work. And being able
| to access this kind of state-of-the-art algo' for free and
| open-source is the oh so sweet cherry on top
| d33 wrote:
| I've recently been wondering: could you re-compress gzip to a
| better compression format, while keeping all instructions that
| would let you recover a byte-exact copy of the original file? I
| often work with huge gzip files and they're a pain to work with,
| because decompression is slow even with zlib-ng.
| artemisart wrote:
| I may be misunderstanding the question but that should be just
| decompressing gzip & compressing with something better like
| zstd (and saving the gzip options to compress it back), however
| it won't avoid compressing and decompressing gzip.
| mappu wrote:
| precomp/antix/... are tools that can bruteforce the original
| gzip parameters and let you recreate the byte-identical gzip
| archive.
|
| The output is something like {precomp header}{gzip
| parameters}{original uncompressed data} which you can then feed
| to a stronger compressor.
|
| A major use case is if you have a lot of individually gzipped
| archives with similar internal content, you can precomp them
| and then use long-range solid compression over all your
| archives together for massive space savings.
| dist-epoch wrote:
| Is this useful for highly repetitive JSON data? Something like
| stock prices for example, one JSON per line.
|
| Unclear if this has enough "structure" for OpenZL.
| wmf wrote:
| Maybe convert to BSON first then compress it.
| terrelln wrote:
| You'd have to tell OpenZL what your format looks like by
| writing a tokenizer for it, and annotating which parts are
| which. We aim to make this easier with SDDL [0], but today is
| not powerful enough to parse JSON. However, you can do that in
| C++ or Python.
|
| Additionally, it works well on numeric data in native format.
| But JSON stores it in ASCII. We can transform ASCII integers
| into int64 data losslessly, but it is very hard to transform
| ASCII floats into doubles losslessly and reliably.
|
| However, given the work to parse the data (and/or massage it to
| a more friendly format), I would expect that OpenZL would work
| very well. Highly repetitive, numeric data with a lot of
| structure is where OpenZL excels.
|
| [0] https://openzl.org/api/c/graphs/sddl/
| michalsustr wrote:
| Are you thinking about adding stream support? I.e something along
| the lines of i) build up efficient vocabulary up front for the
| whole data and then ii) compress by chunks, so it can be
| decompressed by chunks as well. This is important for seeking in
| data and stream processing.
| felixhandte wrote:
| Yes, definitely! Chunking support is currently in development.
| Streaming and seeking and so on are features we will certainly
| pursue as we mature towards an eventual v1.0.0.
| michalsustr wrote:
| Great! I find apache arrow ipc as the most sensible format I
| found how to organise stream data. Headers first, so you
| learn what data you work with, columnar for good simd and
| compression, deeply nested data structures supported. Might
| serve as an inspiration.
| TheMode wrote:
| I understand it cannot work well on random text files, but would
| it support structured text? Like .c, .java or even JSON
| jmakov wrote:
| Wonder how it compares to zstd-9 since they only mention zstd-3
| terrelln wrote:
| The charts in the "Results With OpenZL" section compare against
| all levels of zstd, xz, and zlib.
|
| On highly structured data where OpenZL is able to understand
| the format, it blows Zstandard and Xz out of the water.
| However, not all data fits this bill.
| jmakov wrote:
| Couldn't the input be automatically described/guessed using a few
| rows of data and a LLM?
| terrelln wrote:
| You could have an LLM generate the SDDL description [0] for
| you, or even have it write a C++ or Python tokenizer. If
| compression succeeds, then it is guaranteed to round trip, as
| the LLM-generated logic lives only on the compression side, and
| the decompressor is agnostic to it.
|
| It could be a problem that is well-suited to machine learning,
| as there is a clear objective function: Did compression
| succeed, and if so what is the compressed size.
|
| [0] https://openzl.org/api/c/graphs/sddl/
| stepanhruda wrote:
| Is there a way to use this with blosc?
| yubblegum wrote:
| Couldn't find in the paper a description of how the DAG itself is
| encoded. Any ideas?
| terrelln wrote:
| We left it out of the paper because it is an implementation
| detail that is absolutely going to change as we evolve the
| format. This is the function that actually does it [0], but
| there really isn't anything special here. There are some bit-
| packing tricks to save some bits, but nothing crazy.
|
| Down the line, we expect to improve this representation to
| shrink it further, which is important for small data. And to
| allow to move this representation, or parts of it, into a
| dictionary, for tiny data.
|
| [0]
| https://github.com/facebook/openzl/blob/d1f05d0aa7b8d80627e5...
| yubblegum wrote:
| Thanks! (Super cool idea btw.)
___________________________________________________________________
(page generated 2025-10-06 23:00 UTC)