[HN Gopher] Why are tar.xz files 15x smaller when using Python's...
___________________________________________________________________
Why are tar.xz files 15x smaller when using Python's tar compared
to macOS tar?
Author : personjerry
Score : 177 points
Date : 2021-03-14 21:06 UTC (1 hours ago)
(HTM) web link (superuser.com)
(TXT) w3m dump (superuser.com)
| gwern wrote:
| Sorting the files in an archive is a great trick to know for
| efficiency ( https://www.gwern.net/Archiving-URLs#sort-key-
| compression-tr... ); the more redundant your files are, the
| better it works and the savings can be enormous. Two near-
| duplicate files next to each other in a compressor's stream are
| basically free; space them by a few dozen megabytes, and you pay
| full-price...
| tayo42 wrote:
| Isn't the whole point of compression algorithms to find string
| of bytes that are the same? Why does it struggle if its out of
| order or not next to each other? Seems simple to detect and
| handle?
| dmw_ng wrote:
| It depends on a lot on the compressor. Ancient junk like
| gzip/deflate only has a 32kb window, more modern compressors
| like zstd can do 512mb or more.
|
| Larger window means potentially much more work during
| compression, and larger RAM requirements during
| decompression, so its not exactly free
| bscphil wrote:
| Makes me wonder how tar+zstd would perform for the
| StackOverflow user. They were using xz, which is hardly
| ancient junk.
| diegoperini wrote:
| Simple but time consuming to detect I presume.
| mrec wrote:
| Due to limited memory, I think most compression algos have a
| window of "stuff they've seen recently" which they use to
| detect duplicates. If the duplicate is of something too far
| back, it's fallen out of the window and been forgotten.
| jtolmar wrote:
| Compression formats have to be small themselves, so they make
| tradeoffs on what data they can point at to make copies of.
| Most formats prioritize recent data, since that tends to be
| the most similar.
| jtvjan wrote:
| The 'window size' dictates how far back the algorithm can
| look for redundant data. If you set it too high, people
| without enough core memory might not be able to decompress
| your archive.
| duskwuff wrote:
| And the window size for gzip is only 32 kB -- it "forgets"
| context pretty quickly. gzip was written in 1992, so it had
| to run under much tighter memory constraints than
| necessarily make sense today.
|
| Modern compressors like xz (LZMA) use _much_ more memory
| for context. Under extreme settings and when working with
| large files, they can use gigabytes of memory.
| dmurray wrote:
| Is this really true? Surely they should be able to
| decompress it with virtual memory - you tell them exactly
| where to look for each operation. On the other hand, it
| will take you much longer to _compress_ the file if your
| lookback window is larger than your available memory.
| LeoPanthera wrote:
| A good ugly hack is to reverse the filenames before sorting
| them, which has the side effect of grouping all the same file
| extensions together.
| duskwuff wrote:
| 7za will do this for you when running with the -mqs option.
| 0xdky wrote:
| FWIK, compression works by de-duplication. Finding duplicate
| patterns is limited to a window.
|
| If similar patterns are close to each other, there is a higher
| probability of finding such duplicates in the window leading to
| better compression.
|
| When the files are not sorted, this randomly distributed files
| with similar patterns beyond the compression window leading to
| poor compression.
|
| If there is an option to increase the size of window, that would
| be a good experiment.
|
| This is very similar to `git repack` window and depth parameters.
| Larger the window and depth, you get better compressed packs.
|
| Wonder if a sort based on diffs (group similar files together)
| would help get best compression. The cost of such sorting might
| outweigh the benefits.
| mxcrossb wrote:
| There's a lot of research in the field of scientific computing
| aimed at exactly this kind of data. You can definitely gain a lot
| of your floating point data is varying slowly.
| jaynetics wrote:
| tl;dr: the python lib adds files to the archive sorted by the
| last modification date by default. this happened to be better in
| this case, but macOS tar had the same results when using the
| appropriate sorting flag.
|
| Makes me wonder: considering the speed of modern SSDs, would it
| make sense for compression tools to try various sortings by
| default, or compare files up front?
| masklinn wrote:
| > macOS tar had the same results when using the appropriate
| sorting flag.
|
| No, macOS has the same results when using gnutar which provides
| sorting option.
|
| bsdtar (which macOS provides out of the box) has no such
| option. It just stores files in whichever order you provide
| them (on the command-line, via the CLI) or it finds them (for
| recursive tar, so whichever order the directory returns them
| in).
| js2 wrote:
| > macOS tar had the same results when using the appropriate
| sorting flag
|
| They had to install and use GNU tar to gain the `--sort`
| option. macOS (BSD) tar doesn't have it. (You could emulate the
| behavior by using `find` and passing the member names in on the
| command-line.)
| tomatocracy wrote:
| Another option is to use something like lrzip as the compressor
| - not sure if it would have helped in this instance but it
| incorporates a step which looks for long-range redundancy which
| can make a huge difference in these sorts of scenarios.
| JulianWasTaken wrote:
| Seems better to have some separate higher level metacompression
| tool which does the heuristic searching across many possible
| compression formats and parameters.
| duskwuff wrote:
| This is precisely the approach taken by the Hydra
| metacompressor in the (closed-source) Oodle compression
| library:
|
| http://cbloomrants.blogspot.com/2017/02/oodle-hydra.html
| formerly_proven wrote:
| Compression is generally CPU bound for most of the typical
| algorithms (LZMA, zlib/deflate) unless your drive is very very
| slow (<10 MB/s). There are some quite fast algorithms where
| this could make sense, basically using something like LZ4 as a
| proxy for compress-ability. This is the approach some tools use
| when you tell them to "compress, but only if it makes sense".
| userbinator wrote:
| _If I compare the two tar archives directly, they seem different_
|
| I like the fact that this user considered whether the filesystem
| was lying about the size.
|
| It's interesting to see the problem-solving approach; my next
| step would be to hexdump both and see if one isn't actually
| compressed as much as it could be, then decompress and inspect
| the tars.
|
| I'm not sure whether it's true here, but one situation that leads
| to seemingly randomly-ordered files is when the archiver tries to
| be faster by using multiple threads.
|
| The XZ blocksize is another variable to consider.
| TheGuyWhoCodes wrote:
| This isn't really specific to tar. For example if you save data
| to parquet, sorting the data before persisting it can make a huge
| difference in the size of the file AND the predicate pushdown
| performance (as you can skip more pages especially if you sort by
| the column you filter on)
| war1025 wrote:
| Unrelated to the issue in the post, but a few years back we ran
| into an annoying oddity in the Python zipfile module [1].
|
| By default, when you create a new zipfile object to create a .zip
| archive with, it stores the data uncompressed.
|
| Only noticed the issue when a 100MB zip file suddenly became 3MB
| after I had edited one of the files and re-saved the archive.
|
| [1] https://docs.python.org/3/library/zipfile.html
| masklinn wrote:
| Yes, it's clearly documented but very much unexpected the first
| time 'round.
|
| The reason for the default is probably that python can be built
| with none of zlib, bzip2 or lzma support:
|
| > If ZIP_DEFLATED, ZIP_BZIP2 or ZIP_LZMA is specified but the
| corresponding module (zlib, bz2 or lzma) is not available,
| RuntimeError is raised.
|
| in which case only STORED is available, therefore that's the
| only possible default, any other would lead to zipfile raising
| errors out of the box for some people.
|
| It would probably have been a better idea to just not use a
| default, and require the compression method to always be
| specified, at least when opening zipfiles with modes other than
| "r".
| duskwuff wrote:
| TBH, Python's zipfile should probably just require zlib.
| Without that module, it's basically useless (can't open most
| ZIP files, can't emit compressed ZIP files).
| Jiocus wrote:
| This isn't an oddity or issue with Python zip creation, it's
| expected behaviour as per ISO/IEC WD 21320-1[1].
|
| > The compression method shall be either 0 ("stored") or
| 8("deflated").
|
| The format does not prescribe a compression algorithm (many
| could be available for use). It's merely a container.
|
| [1] ISO/IEC WD 21320-1, Document Container File - Part 1: Core
| -
| https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39...
| bawolff wrote:
| What defaults python chooses is an oddity within python.
| Jiocus wrote:
| Using defaults aimed at compatability with other zip
| archivers, today and tomorrow, doesn't strike me as odd.
| jasode wrote:
| Fyi... not mentioned explicitly in that Q&A is the related
| concept : https://en.wikipedia.org/wiki/Solid_compression
|
| [EDIT to reply] : _> The difference was just the order in which
| the files were concatenated._
|
| Yes, right. My link just emphasizes the files concatenation
| aspect (e.g. tar is one example) for people who base the
| intuition on the ZIP format in which files' order doesn't matter
| because each file is compressed _separately_.
| simias wrote:
| Note that in both cases the archive was first put into a single
| tar file, then compressed. That's how all tar formats function.
| The difference was just the order in which the files were
| concatenated.
| masklinn wrote:
| The point GP is making is that tars are "solid" archives, so
| compression can span files, which means putting files with
| lots of redundancy between them next to one another will
| yield much better compression ratios than orderings which are
| essentially random.
|
| That is somewhat unintuitive to people used to zip archives,
| because zip contents are compressed independently, so the
| ordering really doesn't matter (at least to the compression
| ratio).
| nomy99 wrote:
| Also a bit unrelated, but at my work place we reimplemented tar
| algorithm so we can run it in a lambda/state machine. We used the
| s3 multipart upload to help with the file transfer bit. The
| performance was phenomenal. On avg we were tar'ing 20gb in 15
| seconds or so, a lot faster than on my pc.
| OskarS wrote:
| Tar (as opposed to gzip) is almost entirely I/O bound, it's
| just reading from the filesystem, adding some headers and then
| outputting it again. Whatever speed your I/O runs at, you will
| basically tar at that speed as well.
| banana_giraffe wrote:
| Yep. One of the biggest speed improvements I got out of a
| process that creates large tar balls was moving it to use
| pbzip2 to allow the compression to use multiple cores.
| slivanes wrote:
| pigz is another package.
| gnulinux wrote:
| I don't understand this comment. You can run arbitrary code in
| AWS lambda, you can just run "tar". Why did you need to
| reimplement it?
| 411111111111111 wrote:
| It wasn't always like that though. I think they added custom
| runtimes around 2yrs ago... Your options were very limited
| before that.
| rescbr wrote:
| I suppose they are streaming the tarball directly to S3 using
| multipart uploads, maybe parallelizing this task at the same
| time.
|
| You wouldn't be able to do it using regular tar.
| sgt wrote:
| Even better when you switch on the --lossy option in gzip.
| Aachen wrote:
| Was hoping for an Easter egg, but it just says unrecognized
| option? At least on my phone it does, iirc that's a Debian
| Buster install that I haven't updated in three years.
| ObscureScience wrote:
| I assume you're joking... I've never heard of lossy compression
| in this kinds of formats, as there's no general way to
| determine what can be discarded without affecting the utility
| of the data. There's no such option in any implementation of
| gzip I've found...
| dheera wrote:
| What version of gzip has a lossy option?
| ben509 wrote:
| It's a joke. Lossy compression can only work with specific
| formats where information loss is acceptable, theoretically
| anything but in practice usually audio and video.
|
| If gzip tried to do lossy compression, not knowing what the
| data is, it would have to randomly corrupt data.
| gruez wrote:
| At that point you might as well store your files in /dev/null
| wanderer2323 wrote:
| if /dev/null is fast and web scale I will use it. (
| https://devnull-as-a-service.com/ )
| JdeBP wrote:
| It's time to mention the name of the NABOB archiver, it
| seems. (-:
|
| * https://news.ycombinator.com/item?id=26460317
| suprfsat wrote:
| Does /dev/null support sharding?
| fhars wrote:
| Maybe try mangodb https://github.com/dcramer/mangodb
| [deleted]
| LeoPanthera wrote:
| Although you are joking, gifsicle does have a "--lossy" switch
| which allows changes to a gif file in order to make the LZW
| compression more efficient.
| masklinn wrote:
| An other example is pngquant, which performs "lossy"
| compression of PNGs (by quantizing them).
| LeoPanthera wrote:
| pngquant reduces the size by removing information from the
| image _before_ compression.
|
| gifsicle --lossy actually allows slightly lossy LZW
| dictionary matches to improve the compression itself.
| zinckiwi wrote:
| It was the best and worst of times.
| krrrh wrote:
| It was the best and best of times.
___________________________________________________________________
(page generated 2021-03-14 23:00 UTC)