[HN Gopher] pigz: A parallel implementation of gzip for multi-co...
___________________________________________________________________
pigz: A parallel implementation of gzip for multi-core machines
Author : firloop
Score : 113 points
Date : 2022-10-17 19:19 UTC (3 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| sitkack wrote:
| If you really want to enable all cores for compression and
| decompression, give pbzip2 a try. pigz isn't as parallel as
| pbzip2
|
| http://compression.ca/pbzip2/
|
| *edit, as ac29 mentions below, just use zstdmt. In my quick
| testing it is approximately 8x faster than pbzip2 and gives
| better compression ratios. Wall clock time went from 41s to 3.5s
| for a 3.6GB tar of source, pdfs and images AND the resulting file
| was smaller. megs 3781 test.tar
| 3041 test.tar.zstd (default compression 3, 3.5s) 3170
| test.tar.bz2 (default compression, 8 threads, 40s)
| walrus01 wrote:
| on the other hand, bzip2 is pretty much obsoleted now by xzip
| booi wrote:
| What is xzip? are you talking about xz?
| walrus01 wrote:
| yes, xz
|
| section 3.6 here
|
| https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Ma
| r...
|
| https://en.wikipedia.org/wiki/XZ_Utils
| chasil wrote:
| The author of lzip has pointed criticism for the design
| choices of xz.
|
| I generally use lzip for data that is important to me.
|
| https://www.nongnu.org/lzip/xz_inadequate.html
| ac29 wrote:
| bzip2 is very very slow though. Some types of data compress
| quite well with bzip, but if high compression is needed, xz is
| usually as good or better and natively has multithreading
| available.
|
| For everything else, there's zstd (also natively multithread)
| sitkack wrote:
| Interesting https://docs.rs/zstd/latest/zstd/stream/write/str
| uct.Encoder...
| iruoy wrote:
| Decompression is multithreaded by default. Compression with
| an argument. However it is built-in.
| ericbarrett wrote:
| We used this to great effect at Facebook for MySQL backups in the
| early 2010s. The backup hosts had far more CPU than needed so it
| was a very nice speed-up over gzip. Eventually we switched to
| zstd, of course, but pigz never failed us.
| mackman wrote:
| Hey Eric! Hope you're well!
| antisthenes wrote:
| Same, except we were at a small e-commerce boutique running
| Magento circa 2011-2013.
|
| SQL backups were simply a bash script using Pigz, running on a
| cron job. Simple times!
| Xorlev wrote:
| Pretty similar to that, we used pigz and netcat to bring up new
| MySQL read replicas in a chain at line speeds.
|
| I recall learning the technique from Tumblr's eng blog.
|
| https://engineering.tumblr.com/post/7658008285/efficiently-c...
| evanelias wrote:
| I wrote that Tumblr eng blog post, glad to see it's still
| making the rounds! I later joined FB's mysql team a few years
| after that, although I can't quite remember if FB was still
| using pigz by that time. (also, hi Eric!)
|
| Separately, at Tumblr I vaguely remember examining some
| alternative to pigz that was consistently faster at the time
| (11 years ago) because pigz couldn't parallelize
| decompression. Can't quite remember the name of the
| alternative, but it had licensing restrictions which made it
| less attractive than pigz.
|
| Edit: the old fast alternative I was thinking of is qpress,
| formerly hosted at http://www.quicklz.com/ but that's no
| longer online. Googling it now, there are some mirrors and
| also looks like Percona tools used/bundled it. Not sure if
| they still do or if they've since switched to zstd.
| walrus01 wrote:
| Would not recommend using this in 2022, use zstandard or xzip
| instead.
|
| zstandard is faster and slightly better compression at speed
| selection settings that are equivalent to gzip, in addition to
| having the ability to compress stuff at a much greater ratio,
| optionally, if you allow it to take more time and cpu resources.
|
| https://gregoryszorc.com/blog/2017/03/07/better-compression-...
| dspillett wrote:
| pigz has the advantage of producing output that can be read by
| standard gzip processing tools (including, of course,
| gzip/gunzip), which are available by default on just about
| every OS out there so you get the faster archive creation speed
| without adding requirements to those who might be accessing the
| results later.
|
| It works because gzip streams can be tracked together as a
| single stream, at the start of each block is an instruction to
| reset the compression dictionary as if it is the start of a
| file/stream (which in practise it is) so you just have to
| concatenate the parts coming out of the parallel threads in the
| right order. These resets cause a small drop in overall
| compression rates but this is small and can be minimised by
| using large enough blocks.
| walrus01 wrote:
| yes, one consideration is whether you're creating archives
| for your own later use, or internal use where you also have
| zstandard and xz handling tools. Or to send somewhere else
| for wider use on unknown platforms.
| dspillett wrote:
| Aye, pick the right tool for the target audience. If you
| are the target or you know everyone else who needs to read
| the output will have the ability to read zstd, go with
| that. If not consider pigz. If writing a script that others
| may run, have it default to gzip but use pigz if available
| (unless you really don't want that small % drop on
| compression).
| ananonymoususer wrote:
| I use this all the time. It's a big time saver on multi-core
| machines (which is pretty much every desktop made in the past 20
| years). It's available in all the repos, but not included by
| default (at least in Ubuntu/Mint). It is most useful for
| compressing disk images on-the-fly while backing them up to
| network storage. It's usually a good idea to zero unused space
| first:
|
| (unprivileged commands follow)
|
| dd if=/dev/zero of=~/zeros bs=1M; sync; rm ~/zeros
|
| Compressing on the fly can be slower than your network bandwidth
| depending on your network speed, your processor(s) speed, and the
| compression level, so you typically tune the compression level
| (because the other two variables are not so easy to change).
| Example backup:
|
| (privileged commands follow)
|
| pv < /dev/sda | pigz -9 | ssh user@remote.system dd
| of=compressed.sda.gz bs=1M
|
| (Note that on slower systems the ssh encryption can also slow
| things down.)
|
| Some sharp people may notice that it's not necessarily a good
| idea to back up a live system this way because the filesystem is
| changing while the system runs. It's usually just fine on an
| unloaded system that uses a journaling filesystem.
| CGamesPlay wrote:
| Alternative way of zeroing unused space without consuming all
| disk space:
| https://manpages.ubuntu.com/manpages/trusty/man8/zerofree.8....
| rcarmo wrote:
| I chuckled at the name, since out-of-order results are a typical
| output of parallelization. Kudos.
| b33j0r wrote:
| Ah yes, no guarantee of concurrency or ordering (in the
| headline, lol).
|
| That'd be a pretty funny compression algorithm. You listen to a
| .mpfoo file, and you'll hear the whole song, we promise!
| XCSme wrote:
| I also thought the name was clever, but your comment made it
| even more interesting. Also, my first thought was, "is this
| safe to use?", I heard of gzip vulnerabilities before, but a
| parallel implementation sounds a lot easier to get wrong.
| dspillett wrote:
| Gzip streams support dictionary resets which means you can
| concatenate individually commuters blocks together to make a
| while stream.
|
| This is what pigz is doing: shooting the input into blocks,
| spreading the compression of these blocks over different
| threads so multiple cores can be used, then joining the
| results together in the right order.
|
| It is the very same property of the format that gzip's own
| --rsyncable option makes use of to stop small changes forcing
| a full file send when rsync (or similar) is used to transfer
| updated files.
|
| The idea is as simple as it is clever, one of those "why did
| I not think about that?" ideas that are obvious once someone
| else has thought of it, so adds little or no extra risk. A
| vulnerability that uses gzip (a "compression bomb") or can
| cause a gzip tool to errantly run arbitrary code, is no more
| likely to affect pigz than it is the standard gzip builds.
| apetresc wrote:
| Given that, why wouldn't this just be upstreamed into gzip?
| If it's a clean, simple solution that's just expanding the
| use of a technique that's already in the core binary?
| cldellow wrote:
| gzip is a pretty old, pretty core program, so I imagine
| it's largely in maintenance mode, and that there is a lot
| of friction to pushing large changes into it. At one
| point, pigz required the pthreads library to build. If it
| still does, the gzip people would need to consider if
| that was appropriate for them, and if not, rewrite it to
| be buildable without it.
|
| There are multiple implementations of zlib that are
| faster than the one that ships with GNU gzip, and yet
| they haven't been incorporated.
|
| There are also just better algorithms if compatibility
| with gzip isn't needed. zstd, for example, supports
| parallel compression, and is both faster and compresses
| better than gzip.
| bbertelsen wrote:
| Warning for the uninitiated. Be cautious using this on a
| production machine. I recently caused a production system to
| crash because disk throughput was so high that it started
| delaying read/writes on a PostgreSQL server. There was panic!
| necovek wrote:
| Any comparative benchmarks or a write-up on the approach (other
| than "uses zlib and pthreads" from the README)?
| 331c8c71 wrote:
| I used it and it was noticeably faster. I didn't write down by
| how much.
| chasil wrote:
| Single-threaded gzip can outperform pigz, or at least come very
| close, when used with GNU xargs on separate files with no
| dependencies.
|
| https://www.linuxjournal.com/content/parallel-shells-xargs-u...
|
| https://news.ycombinator.com/item?id=26178257
| Xorlev wrote:
| pigz is most useful on a single stream of data, vs. the more
| obviously parallel case of files without dependencies.
| xfalcox wrote:
| One interesting trivia is that since ~2020 Docker will
| transparently use pigz for decompressing container image layers
| if it's available on the host. This was a nice speedup for us,
| since we use large container images and automatic scaling for
| incoming traffic surges.
| chasil wrote:
| I think dracut also uses pigz to create the initrd when
| installing a new Linux kernel rpm package.
| danuker wrote:
| Have you optimized the low-hanging fruit in your image size?
|
| Because compression programs are as high-hanging fruit as you
| can get, and parallelizing them can only be done once.
| jaimehrubiks wrote:
| I used this recently with -0 (no compression) to pack* billions
| of files into a tar file before sending them over the network. It
| worked amazing.
| anderskaseorg wrote:
| Why use tar | pigz -0 when you can just use tar?
| jaimehrubiks wrote:
| I used tar --use-compress-program="pigz" to create the tar
| out of billions of files
| ac29 wrote:
| Tar is the archiver here (putting multiple files into one
| file), pigz with no compression isnt doing anything besides
| wasting CPU time.
| richard_todd wrote:
| But what's confusing everyone is that tar cf - will create
| the tar without any external compression program needed.
| koolba wrote:
| Even the "f -" option is unneeded as the default is to
| stream to stdout. Though it's always a bit scary to not
| explicitly specify the destination in case your finger
| slips and the first target is itself a writeable file.
| jaimehrubiks wrote:
| I could definitely be wrong here, apologies for the
| confusion. I run many of these tasks automated, in some
| cases I used low compression, in others zero compression.
| For low compression, that command really shines, for zero
| compression, I would have bet I also got improvement over
| regular tar without compression, but again, I could be
| wrong here. I'll test it again
| donatj wrote:
| If you're not going to compress at all, you don't need a
| compressor at all. All you needed was a .tar and not a
| .tar.gz
| dividuum wrote:
| Maybe I'm missing something, but why send the tar generated
| stream through a non-compressing compressor when you could just
| send the tar directly?
| jaimehrubiks wrote:
| I didn't have the tar, I created it using:
|
| tar --use-compress-program="pigz -0" ...
| gruez wrote:
| But if you don't specify the -z flag when using tar, then
| it won't be compressed. Why type all that out when omitting
| one flag does the same thing?
| jiggawatts wrote:
| Funny this comes up again so soon after I needed it! I recently
| did a proof-of-concept related to bioinformatics (gene assembly,
| etc...), and one quirk of that space is that they work with
| _enormous_ text files. Think tens of gigabytes being a "normal"
| size. Just compressing and copying these around is a pain.
|
| One trick I discovered is that tools like pigz can be used to
| both accelerate the compression step and also copy to cloud
| storage in parallel! E.g.: pigz input.fastq -c
| | azcopy copy --from-to PipeBlob "https://myaccountname.blob.core
| .windows.net/inputs/input.fastq.gz?..."
|
| There is a similar pipeline available for s3cmd as well with the
| same benefit of overlapping the compression and the copy.
|
| However, if your tools support zstd, then it's more efficient to
| use that instead. Try the "zstd -T0" option or the "pzstd" tool
| for even higher throughputs but with same minor caveats.
|
| PS: In case anyone here is working on the above tools, I have a
| small request! What would be awesome is to _automatically_ tune
| the compression ratio to match the available output bandwidth.
| With the '-c' output option, this is easy: just keep increasing
| the compression level by one notch whenever the output buffer is
| full, and reduce it by one level whenever the output buffer is
| empty. This will automatically tune the system to get the maximum
| total throughput given the available CPU performance and network
| bandwidth.
| ByThyGrace wrote:
| On Linux would it Just Work(tm) if you aliased pigz to gzip as a
| drop-in replacement?
| ndsipa_pomu wrote:
| In theory, most stuff should work as it's 99% compatible, but
| there might well be something that breaks. Rather than
| symlinking it or some such, it's better to configure the
| necessary tools to use the pigz command instead and then you'll
| at least find out what works.
|
| FWIW, I configure BackupPC to use pigz instead of gzip without
| any issues.
| _joel wrote:
| Use this all the time (or did when I was doing more sysadminy
| stuff). Useful in all sorts of backup pipelines
| josnyder wrote:
| This was great in 2012. In 2022, most use-cases should be using
| parallelized zstd.
| lxe wrote:
| Protip: if you're on a massively-multicore system and need to
| tar/gzip a directory full of node_modules, use pigz via `tar -I
| pigz` or a pipe. The performance increase is incredible.
| omoikane wrote:
| The bit I found most interesting was actually:
|
| https://github.com/madler/pigz/blob/master/try.h
|
| https://github.com/madler/pigz/blob/master/try.c
|
| which implements try/catch for C99.
| dima_vm wrote:
| But why? Most modern languages try to get rid of exceptions
| (Go, Kotlin, Rust).
| resoluteteeth wrote:
| > But why? Most modern languages try to get rid of exceptions
| (Go, Kotlin, Rust).
|
| All three of those languages actually have exceptions, they
| just don't encourage catching exceptions as a normal way of
| error handling
|
| Also, while the trend now seems to be for newer languages to
| encourage use of things like result types, one of the main
| reasons for that is that in current languages it is easier to
| show that functions can potentially failure in the type
| system using result types rather than exceptions.
|
| Otherwise, there isn't necessarily inherently a strong reason
| to prefer one or the other, and it's possible that future
| languages will go back to exceptions but have a way to
| express that in the type system using effects, etc.
| jallmann wrote:
| Golang has panic / recover / defer which are functionally
| similar to exceptions. It's actually a fun exercise to
| implement a pseudo-syntax for try/catch/finally in terms of
| those primitives.
| makapuf wrote:
| Go _has_ exceptions but its definitely not advised to use
| those as an error mechanism. Recover is really a last
| chance effort for recovery, not a standard error catching
| method.
| Genbox wrote:
| Kotlin does have exceptions[1]
|
| [1] https://kotlinlang.org/docs/exceptions.html#java-
| interoperab...
___________________________________________________________________
(page generated 2022-10-17 23:00 UTC)