hngopher.com

       [HN Gopher] OpenZL: An open source format-aware compression fram...
       ___________________________________________________________________
        
       OpenZL: An open source format-aware compression framework
        
       https://github.com/facebook/openzl
       https://arxiv.org/abs/2510.03203  https://openzl.org/
        
       Author : terrelln
       Score  : 189 points
       Date   : 2025-10-06 16:01 UTC (6 hours ago)
        
 (HTM) web link (engineering.fb.com)
 (TXT) w3m dump (engineering.fb.com)
        
       | felixhandte wrote:
       | In addition to the blog post, here are the other things we've
       | published today:
       | 
       | Code: https://github.com/facebook/openzl
       | 
       | Documentation: https://openzl.org/
       | 
       | White Paper: https://arxiv.org/abs/2510.03203
        
         | dang wrote:
         | We'll put those links in the toptext above.
        
       | waustin wrote:
       | This is such a leap forward it's hard to believe it's anything
       | but magic.
        
         | gmuslera wrote:
         | I used to see as magic that the old original compression
         | algorithms worked so well with generic text, without worrying
         | about format, file type, structure or other things that could
         | give hints of additional redundancy.
        
         | wmf wrote:
         | Compared to columnar databases this is more of an incremental
         | improvement.
        
       | kingstnap wrote:
       | Wow this sounds nuts. I want to try this on some large csvs later
       | today.
        
         | felixhandte wrote:
         | Let us know how it goes!
         | 
         | We developed OpenZL initially for our own consumption at Meta.
         | More recently we've been putting a lot of effort into making
         | this a usable tool for people who, you know, didn't develop
         | OpenZL. Your feedback is welcome!
        
       | zzulus wrote:
       | Meta's Nimble is natively integrated with OpenZL (pre-OSS
       | version), and is insanely benefiting from it.
        
         | terrelln wrote:
         | Yeah, backend compression in columnar data formats is a natural
         | fit for OpenZL. Knowing the data it is compressing is numeric,
         | e.g. a column of i64 or float, allows for immediate wins over
         | Zstandard.
        
       | felixhandte wrote:
       | It was really hard to resist spilling the beans about OpenZL on
       | this recent HN post about compressing genomic sequence data [0].
       | It's a great example of the really simple transformations you can
       | perform on data that can unlock significant compression
       | improvements. OpenZL can perform that transformation internally
       | (quite easily with SDDL!).
       | 
       | [0] https://news.ycombinator.com/item?id=45223827
        
         | bede wrote:
         | Author of [0] here. Congratulations and well done for
         | resisting. Eager to try it!
         | 
         | Edit: Have you any specific advice for training a fasta
         | compressor beyond that given in e.g. "Using OpenZL"
         | (https://openzl.org/getting-started/using-openzl/)
        
         | perching_aix wrote:
         | That post immediately came to my mind too! Do you maybe have a
         | comparison to share with respect to the specialized compressor
         | mentioned in the OP there?
         | 
         | > Grace Blackwell's 2.6Tbp 661k dataset is a classic choice for
         | benchmarking methods in microbial genomics. (...) Karel
         | Brinda's specialist MiniPhy approach takes this dataset from
         | 2.46TiB to just 27GiB (CR: 91) by clustering and compressing
         | similar genomes together.
        
         | Gethsemane wrote:
         | I'd love to see some benchmarks for this on some common genomic
         | formats (fa, fq, sam, vcf). Will be doubly interesting to see
         | its applicability to nanopore data - lots of useful data is
         | lost because storing FAST5/POD5 is a pain.
        
           | jayknight wrote:
           | And a comparison between CRAM and openzl on a sam/bam file.
           | Is openzl indexable, where you can just extract and
           | decompress the data you need from a file if you know where it
           | is?
        
             | terrelln wrote:
             | > Is openzl indexable
             | 
             | Not today. However, we are considering this as we are
             | continuing to evolve the frame format, and it is likely we
             | will add this feature in the future.
        
           | jltsiren wrote:
           | OpenZL compressed SAM/BAM vs. CRAM is the interesting
           | comparison. It would really test the flexibility of the
           | framework. Can OpenZL reach the same level of compression,
           | and how much effort does it take?
           | 
           | I would not expect much improvement in compressing nanopore
           | data. If you have a useful model of the data, creating a
           | custom compressor is not that difficult. It takes some
           | effort, but those formats are popular enough that compressors
           | using the known models should already exist.
        
             | terrelln wrote:
             | Do you happen to have a pointer to a good open source
             | dataset to look at?
             | 
             | Naively and knowing little about CRAM, I would expect that
             | OpenZL would beat Zstd handily out of the box, but need
             | additional capabilities to match the performance of CRAM,
             | since genomics hasn't been a focus as of yet. But it would
             | be interesting to see how much we need to add is generic to
             | all compression (but useful for genomics), vs. techniques
             | that are specific only to genomics.
             | 
             | We're planning on setting up a blog on our website to
             | highlight use cases of OpenZL. I'd love to make a post
             | about this.
        
               | bede wrote:
               | For BAM this could be a good place to start:
               | https://www.htslib.org/benchmarks/CRAM.html
               | 
               | Happy to discuss further
        
       | fnands wrote:
       | Cool, but what's the Weissman Score?
        
       | bigwheels wrote:
       | How do you use it to compress a directory (or .tar file)? Not
       | seeing any example usages in the repo, `zli compress -o
       | dir.tar.zl dir.tar` ->                 Invalid argument(s):
       | No compressor profile or serialized compressor specified.
       | 
       | Same thing for the `train` command.
       | 
       | Edit: @terrelln Got it, thank you!
        
         | terrelln wrote:
         | There's a Quick Start guide here:
         | 
         | https://openzl.org/getting-started/quick-start/
         | 
         | However, OpenZL is different in that you need to tell the
         | compressor how to compress your data. The CLI tool has a few
         | builtin "profiles" which you can specify with the `--profile`
         | argument. E.g. csv, parquet, or le-u64. They can be listed with
         | `./zli list-profiles`.
         | 
         | You can always use the `serial` profile, but because you
         | haven't told OpenZL anything about your data, it will just use
         | Zstandard under the hood. Training can learn a compressor, but
         | it won't be able to learn a format like `.tar` today.
         | 
         | If you have raw numeric data you want to throw at it, or
         | Parquets or large CSV files, thats where I would expect OpenZL
         | to perform really well.
        
       | ttoinou wrote:
       | Is this similar to Basis ?
       | https://github.com/BinomialLLC/basis_universal
        
         | modeless wrote:
         | No, not really. They are both cool but solve different
         | problems. The problem Basis solves is that GPUs don't agree on
         | which compressed texture formats to support in hardware. Basis
         | is a single compressed format that can be transcoded to almost
         | any of the formats GPUs support, which is faster and higher
         | quality than e.g. decoding a JPEG and then re-encoding to a GPU
         | format.
        
           | ttoinou wrote:
           | Thanks. I thought basis also had specific encoders depending
           | on the typical average / nature of the data input, like this
           | OpenZL project
        
             | modeless wrote:
             | It probably does have different modes that it selects based
             | on the input data. I don't know that much about the
             | implementation of image compression, but I know that PNG
             | for example has several preprocessing modes that can be
             | selected based on the image contents, which transform the
             | data before entropy encoding for better results.
             | 
             | The difference with OpenZL IIUC seems to be that it has
             | some language that can flexibly describe a family of
             | transformations, which can be serialized and included with
             | the compressed data for the decoder to use. So instead of
             | choosing between a fixed set of transformations built into
             | the decoder ahead of time, as in PNG, you can apply
             | arbitrary transformations (as long as they can be
             | represented in their format).
        
       | nunobrito wrote:
       | Well, well. Kind of surprised to see this really good tool that
       | should have been made available a longer time ago since the
       | approach is quite sound.
       | 
       | When the data container is understood, the deduplication is far
       | more efficient because now it is targeted.
       | 
       | Licensed as BSD-3-Clause, solid C++ implementation, well
       | documented.
       | 
       | Will be looking forward to see new developments as more file
       | formats are contributed.
        
         | mappu wrote:
         | Specialization for file formats is not novel (e.g. 7-Zip uses
         | BCJ2 prefiltering to convert x86 opcodes from absolute to
         | relative JMP instructions), nor is embedding specialized
         | decoder bytecode in the archive (e.g. ZPAQ did this and won a
         | lot of Matt Mahoney's benchmarks) but i think OpenZL's
         | execution here, along with the data description and training
         | system, is really fantastic.
        
           | nunobrito wrote:
           | Thanks, I've enjoyed reading more about ZPAQ but their main
           | focus seems to be versioning (which is quite a useful feature
           | too, will try it later) but they don't include specialized
           | compression per context.
           | 
           | Like you mention, the expandability is quite something. In a
           | few years we might see a very capable compressor.
        
       | maeln wrote:
       | So, as I understand, you describe the structure of your data in
       | an SDL and then the compressor can plan a strategy on how to best
       | compress the various part of the data ?
       | 
       | Honestly looks incredible. Could be amazing to provide a general
       | framework for compressing custom format.
        
         | terrelln wrote:
         | Exactly! SDDL [0] provides a toolkit to do this all with no-
         | code, but today is pretty limited. We will be expanding its
         | feature set, but in the meantime you can also write code in C++
         | or Python to parse your format. And this code is compression
         | side only, so the decompressor is agnostic to your format.
         | 
         | [0] https://openzl.org/api/c/graphs/sddl/
        
           | maeln wrote:
           | Now I cannot stop thinking about how I can fit this somewhere
           | in my work hehe. ZStandard already blew me away when it was
           | released, and this is just another crazy work. And being able
           | to access this kind of state-of-the-art algo' for free and
           | open-source is the oh so sweet cherry on top
        
       | d33 wrote:
       | I've recently been wondering: could you re-compress gzip to a
       | better compression format, while keeping all instructions that
       | would let you recover a byte-exact copy of the original file? I
       | often work with huge gzip files and they're a pain to work with,
       | because decompression is slow even with zlib-ng.
        
         | artemisart wrote:
         | I may be misunderstanding the question but that should be just
         | decompressing gzip & compressing with something better like
         | zstd (and saving the gzip options to compress it back), however
         | it won't avoid compressing and decompressing gzip.
        
         | mappu wrote:
         | precomp/antix/... are tools that can bruteforce the original
         | gzip parameters and let you recreate the byte-identical gzip
         | archive.
         | 
         | The output is something like {precomp header}{gzip
         | parameters}{original uncompressed data} which you can then feed
         | to a stronger compressor.
         | 
         | A major use case is if you have a lot of individually gzipped
         | archives with similar internal content, you can precomp them
         | and then use long-range solid compression over all your
         | archives together for massive space savings.
        
       | dist-epoch wrote:
       | Is this useful for highly repetitive JSON data? Something like
       | stock prices for example, one JSON per line.
       | 
       | Unclear if this has enough "structure" for OpenZL.
        
         | wmf wrote:
         | Maybe convert to BSON first then compress it.
        
         | terrelln wrote:
         | You'd have to tell OpenZL what your format looks like by
         | writing a tokenizer for it, and annotating which parts are
         | which. We aim to make this easier with SDDL [0], but today is
         | not powerful enough to parse JSON. However, you can do that in
         | C++ or Python.
         | 
         | Additionally, it works well on numeric data in native format.
         | But JSON stores it in ASCII. We can transform ASCII integers
         | into int64 data losslessly, but it is very hard to transform
         | ASCII floats into doubles losslessly and reliably.
         | 
         | However, given the work to parse the data (and/or massage it to
         | a more friendly format), I would expect that OpenZL would work
         | very well. Highly repetitive, numeric data with a lot of
         | structure is where OpenZL excels.
         | 
         | [0] https://openzl.org/api/c/graphs/sddl/
        
       | michalsustr wrote:
       | Are you thinking about adding stream support? I.e something along
       | the lines of i) build up efficient vocabulary up front for the
       | whole data and then ii) compress by chunks, so it can be
       | decompressed by chunks as well. This is important for seeking in
       | data and stream processing.
        
         | felixhandte wrote:
         | Yes, definitely! Chunking support is currently in development.
         | Streaming and seeking and so on are features we will certainly
         | pursue as we mature towards an eventual v1.0.0.
        
           | michalsustr wrote:
           | Great! I find apache arrow ipc as the most sensible format I
           | found how to organise stream data. Headers first, so you
           | learn what data you work with, columnar for good simd and
           | compression, deeply nested data structures supported. Might
           | serve as an inspiration.
        
       | TheMode wrote:
       | I understand it cannot work well on random text files, but would
       | it support structured text? Like .c, .java or even JSON
        
       | jmakov wrote:
       | Wonder how it compares to zstd-9 since they only mention zstd-3
        
         | terrelln wrote:
         | The charts in the "Results With OpenZL" section compare against
         | all levels of zstd, xz, and zlib.
         | 
         | On highly structured data where OpenZL is able to understand
         | the format, it blows Zstandard and Xz out of the water.
         | However, not all data fits this bill.
        
       | jmakov wrote:
       | Couldn't the input be automatically described/guessed using a few
       | rows of data and a LLM?
        
         | terrelln wrote:
         | You could have an LLM generate the SDDL description [0] for
         | you, or even have it write a C++ or Python tokenizer. If
         | compression succeeds, then it is guaranteed to round trip, as
         | the LLM-generated logic lives only on the compression side, and
         | the decompressor is agnostic to it.
         | 
         | It could be a problem that is well-suited to machine learning,
         | as there is a clear objective function: Did compression
         | succeed, and if so what is the compressed size.
         | 
         | [0] https://openzl.org/api/c/graphs/sddl/
        
       | stepanhruda wrote:
       | Is there a way to use this with blosc?
        
       | yubblegum wrote:
       | Couldn't find in the paper a description of how the DAG itself is
       | encoded. Any ideas?
        
         | terrelln wrote:
         | We left it out of the paper because it is an implementation
         | detail that is absolutely going to change as we evolve the
         | format. This is the function that actually does it [0], but
         | there really isn't anything special here. There are some bit-
         | packing tricks to save some bits, but nothing crazy.
         | 
         | Down the line, we expect to improve this representation to
         | shrink it further, which is important for small data. And to
         | allow to move this representation, or parts of it, into a
         | dictionary, for tiny data.
         | 
         | [0]
         | https://github.com/facebook/openzl/blob/d1f05d0aa7b8d80627e5...
        
           | yubblegum wrote:
           | Thanks! (Super cool idea btw.)
        
       ___________________________________________________________________
       (page generated 2025-10-06 23:00 UTC)