https://superuser.com/questions/1633073/why-are-tar-xz-files-15x-smaller-when-using-pythons-tar-library-compared-to-mac Stack Exchange Network Stack Exchange network consists of 176 Q&A communities including Stack Overflow, the largest, most trusted online community for developers to learn, share their knowledge, and build their careers. Visit Stack Exchange [ ] Loading... 1. 2. 0 3. +0 4. + Tour Start here for a quick overview of the site + Help Center Detailed answers to any questions you might have + Meta Discuss the workings and policies of this site + About Us Learn more about Stack Overflow the company + Business Learn more about hiring developers or posting ads with us 5. 6. Log in Sign up 7. current community + Super User help chat + Meta Super User your communities Sign up or log in to customize your list. more stack exchange communities company blog Super User is a question and answer site for computer enthusiasts and power users. It only takes a minute to sign up. Sign up to join this community [ano] Anybody can ask a question [ano] Anybody can answer [an] The best answers are voted up and rise to the top Super User 1. Home 2. 1. Questions 2. Tags 3. Users 4. Unanswered 5. Jobs Why are tar.xz files 15x smaller when using Python's tar library compared to macOS tar? Ask Question Asked yesterday Active today Viewed 12k times 50 4 Context I'm compressing ~1.3 GB folders each filled with 1440 JSON files and find that there's a 15-fold difference between using the tar command on macOS or Raspbian 10 (Buster) and using Python's built-in tarfile library. Minimal working example This script compares both methods: #!/usr/bin/env python3 from pathlib import Path from subprocess import call import tarfile fullpath = Path("/Users/user/Desktop/temp/tar/2021-03-11") zsh_out = Path(fullpath.parent, "zsh-archive.tar.xz") py_out = Path(fullpath.parent, "py-archive.tar.xz") # tar using terminal # tar cJf zsh-archive.tar.xz folderpath call(["tar", "cJf", zsh_out, fullpath]) # tar using tarfile library with tarfile.open(py_out, "w:xz") as tar: tar.add(fullpath, arcname=fullpath.stem) # Print filesizes print(f"zsh tar filesize: {round(Path(zsh_out).stat().st_size/(1024*1024), 2)} MB") print(f"py tar filesize: {round(Path(py_out).stat().st_size/(1024*1024), 2)} MB") The output is: zsh tar filesize: 23.7 MB py tar filesize: 1.49 MB The versions I use are as follows: * tar on macOS: bsdtar 3.3.2 - libarchive 3.3.2 zlib/1.2.11 liblzma /5.0.5 bz2lib/1.0.6 * tar on Raspbian 10: xz (XZ Utils) 5.2.4 liblzma 5.2.4 * tarfile Python library: 0.9.0 Things I've tried After compression, I've extracted both archives and compared the resulting folder with: diff -r py-archive-expanded zsh-archive-expanded There was no difference. If I compare the two tar archives directly, they seem different: diff zsh-archive.tar.xz py-archive.tar.xz Binary files zsh-archive.tar.xz and py-archive.tar.xz differ If I inspect the archives with Quicklook (and the Betterzip plugin) I see that the files in the archive are ordered in a different way: Left is zsh-archive.tar.xz, right is py-archive.tar.xz: Enter image description hereenter image description here The zsh archive uses an unknown order, and the Python archive orders the file by modification date. I am not sure if that matters. Question What is going on? Am I losing something by using the Python library to compress my data? Is the 15-fold difference in size an indicator of some issue? Or can I safely go ahead and use the efficient Python implementation? macos python zsh compression tar Share Improve this question Follow edited 1 hour ago [289] Peter Mortensen 11.5k2222 gold badges6565 silver badges8787 bronze badges asked yesterday [oO1] Saaru LindestokkeSaaru Lindestokke 2,27444 gold badges2424 silver badges4242 bronze badges 8 * 1 Did you make sure the result of tar cJf is actually xz-compressed? xz also uses LZMA but it is a distinct format from, say, 7-zip. Try file the-archive.tar.xz. - Daniel B yesterday * file zsh-archive.tar.xz gives zsh-archive.tar.xz: XZ compressed data - Saaru Lindestokke yesterday * 1 Did you actually tar up the same directory tree in both cases? Just making sure ;-) - tink yesterday * 2 Hm, okay. Please verify whether the uncompressed .tar files are the same. Files may have been added in a different order, which creates a different compression result. - Daniel B yesterday * @tink, yes I do. I've added a testscript in my question that shows the same directory being compressed generating the wildly different filesize. - Saaru Lindestokke yesterday | Show 3 more comments 3 Answers 3 Active Oldest Votes 74 Ok, I think I found the issue: BSD tar and GNU tar without any sort options put the files in the archive in an undefined order. GNU tar has a --sort option: sort directory entries according to ORDER, which is one of none, name, or inode. The default is --sort=none, which stores archive members in the same order as returned by the operating system. After installing GNU tar on my Mac with: brew install gnu-tar And then tarring the same folder, but with the --sort option: gtar --sort='name' -cJf zsh-archive-sorted.tar.xz /Users/user/Desktop/temp/tar/2021-03-11 I get a .tar.xz archive of 1.5 MB, equal to the archive created by the Python library. I think the reason sorting has such an impact is as follows: My JSON files contain measurements from hundreds of sensors. Every minute I read out all sensors, but only a few of these sensors have a different value from minute to minute. By sorting the files by name (which has the creation unixtime at the beginning of it), two subsequent files have very little different characters between them. Apparently this is very favourable for the compression efficiency. Share Improve this answer Follow answered 23 hours ago [oO1] Saaru LindestokkeSaaru Lindestokke 2,27444 gold badges2424 silver badges4242 bronze badges 9 * 1 Compression programs operate on blocks of text controlled by a single dictionary; by sorting the input, you've put similar bits near each other, allowing xz to compress lots of similar data in one dictionary. Compression and decompression was probably also faster. - RonJohn 18 hours ago * 7 Wow, another case where sorting makes things much faster. - justhalf 15 hours ago * 1 I don't really understand yet why the OS returns the files in "unsorted" order with the sort=none option. I mean, there's always some sort order, right? If anyone knows what order the OS uses feel free to add. - Saaru Lindestokke 13 hours ago * 1 @SaaruLindestokke The order in which the OS returns the files in a directory depends on the filesystem used (assuming the same OS is used, obviously you can easily patch linux so that it will return files in some order you want by default or that it will randomize the order by default). So there is no single sort order used by default by any OS, as such we do not provide guarantees and we say "do not assume any specific sort order", this does not mean that filesystems actively randomize the results before returning them, it just means if the user changes fs the results will likely change - Bakuriu 12 hours ago * 2 TL:DR: "unsorted" means use dir entries in the order we get them from the OS's system call, which you can see with ls -U. - Peter Cordes 11 hours ago | Show 4 more comments 2 Try setting the compression levels in the macOS command line. I know you are asking about xz but explained in this answer here, on older versions of GZip you can set the compression level with an environment variable like this: GZIP=-9 tar cf zsh-archive.tar.xz folderpath That said, that only seems to work with GZip 1.8 and is depreciated on later versions. So use the -I/--use-compress-program=COMMAND option for tar instead; note this option might not work on macOS but placing here anyway just in case. So the command would then change to: tar -I 'gzip -9' -cf zsh-archive.tar.xz folderpath And yes, these examples would be compressing the archive Gzip instead of xz, but you can easily change the command to this to use xz like this: tar -I 'xz -9' -cf zsh-archive.tar.xz folderpath The xz compression level ranges from -0 to -9 with the default being -6; so -9 is the highest compression level. Just note that xz is not installed on macOS by default. To install it on macOS you must first install Homebrew and then install xz via Homebrew like this: brew install xz Share Improve this answer Follow edited yesterday answered yesterday [M97] Giacomo1968Giacomo1968 43.7k1212 gold badges141141 silver badges181181 bronze badges 5 * 1 I tried the command tar -I 'xz -9' -cf zsh-archive.tar.xz folderpath, but I get the following error: tar: Couldn't open xz -9: No such file or directory - Saaru Lindestokke yesterday * In macOS? I busted checked and it seems to be provided on my system by Homebrew. So I would recommend installing Homebrew and then running: brew install xz - Giacomo1968 yesterday * 1 Yes, on macOS. man tar shows the -I option is a synonym for the -T option, which is the --files-from option. I've tried it with the longhand option --use-compress-program which resulted in a 10 MB file, instead of the regular 23 MB, but it's still not near the 1.5 MB from Python. - Saaru Lindestokke yesterday * 1 Note that I've tried this in the raspbian terminal as well, with similar results to what I get on macOS. - Saaru Lindestokke yesterday * All fair. Makes me wonder what Python is using for compression then? - Giacomo1968 yesterday Add a comment | 1 Makes me wonder what Python is using for compression http://tukaani.org/xz/ It's probably using the function calls in liblzma. Tar is probably piping through the xz shell command. A quick comment on --sort=name: The sort option is a relatively recent enhancement to GNU tar and was introduced in tar version 1.28. It may never be implemented in BSD tar. Share Improve this answer Follow edited 1 hour ago [289] Peter Mortensen 11.5k2222 gold badges6565 silver badges8787 bronze badges answered 3 hours ago [a1d] Louis ThompsonLouis Thompson 1111 bronze badge New contributor Louis Thompson is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct. Add a comment | Your Answer [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] Thanks for contributing an answer to Super User! * Please be sure to answer the question. Provide details and share your research! But avoid ... * Asking for help, clarification, or responding to other answers. * Making statements based on opinion; back them up with references or personal experience. To learn more, see our tips on writing great answers. Draft saved Draft discarded [ ] Sign up or log in Sign up using Google Sign up using Facebook Sign up using Email and Password Submit Post as a guest Name [ ] Email Required, but never shown [ ] Post as a guest Name [ ] Email Required, but never shown [ ] Post Your Answer Discard By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy Not the answer you're looking for? Browse other questions tagged macos python zsh compression tar or ask your own question. The Overflow Blog * State of the Stack: a new quarterly update on community and product * Podcast 320: Covid vaccine websites are frustrating. This developer built a... Featured on Meta * State of the Stack Q1 2021 Blog Post Visit chat Linked 178 How to specify level of compression when using tar -zcvf? Related 1 Is there a way to diagnose a tar file when, on extract, files are missing but no errors are given? 64 gzip without tar? Why are they used together? 178 How to specify level of compression when using tar -zcvf? 0 Why is that when I attempt to verify an archive using tar I am displayed the error this does not look like a tar archive? 3 Archiving files with tar using the -C option? 1 Enable Scrolling when using ZSH on macOS 0 Extract files from online tar archive using only its URL 4 Why are my macOS system files huge? (Over 400gb) 0 tar: Using `--one-top-level` when input is piped in 1 How to Specify the Target Location When Extracting Specific Files from a Tar Archive? Hot Network Questions * What's the map on Sheldon & Leonard's refrigerator of? * Why does every "defi" thing only support garbagecoins and never Bitcoin? * Could we carve a large radio dish in the Antarctic ice? * How do I save Commodore BASIC programs in ASCII? * Using TikZ to draw a relation * Why might radios not be effective in a post-apocalyptic world? * What is the best way to turn soup into stew without using flour? * Students not answering emails about plagiarism * Who is the true villain of Peter Pan: Peter, or Hook? * Have any kings ever been serving admirals? * Hello, Permutations! * Write a Kotlin-function isPrimeNumber * How long would it take for inbreeding issues to arise for a family that practiced inbreeding? * Time complexity of min() and max()? * Difference between bed and shore * Can I use a MacBook as a server with the lid closed? * Recommendations for OR video channels (YouTube etc) * C++ syntax understanding issue for 'using' * Which languages have different words for "maternal uncle" and "paternal uncle"? * Is there a link between democracy and economic prosperity? * How do a transform simple object to have a concave shape * n-th Fibonacci number with memoization * PBKDF2 usage will slow REST API down? * Ancient temple booby traps designed for dragons more hot questions Question feed Subscribe to RSS Question feed To subscribe to this RSS feed, copy and paste this URL into your RSS reader. [https://superuser.co] * lang-bsh Super User * Tour * Help * Chat * Contact * Feedback * Mobile Company * Stack Overflow * For Teams * Advertise With Us * Hire a Developer * Developer Jobs * About * Press * Legal * Privacy Policy * Terms of Service * Cookie Settings * Cookie Policy Stack Exchange Network * Technology * Life / Arts * Culture / Recreation * Science * Other * Stack Overflow * Server Fault * Super User * Web Applications * Ask Ubuntu * Webmasters * Game Development * TeX - LaTeX * Software Engineering * Unix & Linux * Ask Different (Apple) * WordPress Development * Geographic Information Systems * Electrical Engineering * Android Enthusiasts * Information Security * Database Administrators * Drupal Answers * SharePoint * User Experience * Mathematica * Salesforce * ExpressionEngine(r) Answers * Stack Overflow em Portugues * Blender * Network Engineering * Cryptography * Code Review * Magento * Software Recommendations * Signal Processing * Emacs * Raspberry Pi * Stack Overflow na russkom * Code Golf * Stack Overflow en espanol * Ethereum * Data Science * Arduino * Bitcoin * Software Quality Assurance & Testing * Sound Design * Windows Phone * more (28) * Photography * Science Fiction & Fantasy * Graphic Design * Movies & TV * Music: Practice & Theory * Worldbuilding * Video Production * Seasoned Advice (cooking) * Home Improvement * Personal Finance & Money * Academia * Law * Physical Fitness * Gardening & Landscaping * Parenting * more (10) * English Language & Usage * Skeptics * Mi Yodeya (Judaism) * Travel * Christianity * English Language Learners * Japanese Language * Chinese Language * French Language * German Language * Biblical Hermeneutics * History * Spanish Language * Islam * Russkii iazyk * Russian Language * Arqade (gaming) * Bicycles * Role-playing Games * Anime & Manga * Puzzling * Motor Vehicle Maintenance & Repair * Board & Card Games * Bricks * Homebrewing * Martial Arts * The Great Outdoors * Poker * Chess * Sports * more (16) * MathOverflow * Mathematics * Cross Validated (stats) * Theoretical Computer Science * Physics * Chemistry * Biology * Computer Science * Philosophy * Linguistics * Psychology & Neuroscience * Computational Science * more (10) * Meta Stack Exchange * Stack Apps * API * Data * Blog * Facebook * Twitter * LinkedIn * Instagram site design / logo (c) 2021 Stack Exchange Inc; user contributions licensed under cc by-sa. rev 2021.3.12.38768 Super User works best with JavaScript enabled [p-c1rF4kxg] Your privacy By clicking "Accept all cookies", you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy. Accept all cookies Customize settings