[HN Gopher] The tar archive format, and why GNU tar extracts in ...
       ___________________________________________________________________
        
       The tar archive format, and why GNU tar extracts in quadratic time
        
       Author : mort96
       Score  : 117 points
       Date   : 2022-07-23 19:11 UTC (3 hours ago)
        
 (HTM) web link (mort.coffee)
 (TXT) w3m dump (mort.coffee)
        
       | mdavidn wrote:
       | One advantage of tar is that, because the format has no built-in
       | support for compression or random access, the entire archive is
       | compressed together. Similarities in adjacent files will improve
       | the compression ratio.
       | 
       | To support random access, the ZIP format must compress each file
       | separately.
        
         | orlp wrote:
         | That's just not true, at least theoretically speaking (I have
         | no idea about ZIP's actual internals). One could do a two-pass
         | algorithm to find common substrings to build a dictionary,
         | store this shared dictionary once, and then compress the files
         | individually referring to this shared dictionary for the common
         | substrings.
        
       | mjevans wrote:
       | bsdtar (libarchive) - I wonder if it avoids these issues?
        
       | userbinator wrote:
       | The effect of restricting writes to only a directory and its
       | descendants seems like the perfect use-case for chroot(), except
       | that comes with some other limitations too.
       | 
       | I started out computing in the PC world with the DOS family (then
       | moving on to Windows) before working with any of the Unixes, so
       | my perspective may be different from those who started with
       | Unix/Linux, but in my experience it seems links (both soft and
       | especially hard ones when applied to directories) cause a rather
       | large number of problems in comparison to their usefulness; they
       | make the filesystem a graph instead of a tree, adding complexity
       | to all applications that need to traverse it.
        
       | saagarjha wrote:
       | > You may think that the numeric values (file_mode, file_size,
       | file_mtime, ...) would be encoded in base 10, or maybe in hex, or
       | using plain binary numbers ("base 256"). But no, they're actually
       | encoded as octal strings (with a NUL terminator, or sometimes a
       | space terminator). Tar is the only file format I know of which
       | uses base 8 to encode numbers.
       | 
       | Tar's "competitor", cpio, does this as well (at least in one of
       | the popular implementations). The Xcode XIP file, if you're
       | familiar with that particular format, is a couple layers of
       | wrapping on top of a cpio archive, so deep inside my tool to
       | expand these there's a spot where I read these all out in a
       | similar fashion:
       | https://github.com/saagarjha/unxip/blob/5dcfe2c0f9578bc43070...
       | 
       | More generally, though, both tar and cpio doesn't have an index
       | because they don't need them, they're meant to be decompressed in
       | a streaming fashion where this information isn't necessary. It's
       | kind of inconvenient if you want to extract a single file, but
       | for large archives you really want to compress files together and
       | maintaining an index for this is both expensive and nontrivial,
       | so I guess these formats are just "good enough".
        
       | Twirrim wrote:
       | > Tar is pretty unusual for an archive file format. There's no
       | archive header, no index of files to fascilitate seeking, no
       | magic bytes to help file and its ilk detect whether a file is a
       | tar archive, no footer, no archive-wide metadata. The only kind
       | of thing in a tar file is a file object.
       | 
       | tar = _t_ ape _ar_ chive. It was designed around streaming files
       | to a tape drive in a serial fashion. Tape bandwidth has almost
       | always been slow compared to source data devices. You need to
       | start writing to the target device as fast as possible. Taking
       | time to construct a full index just delays starting the slowest
       | stage of things, while providing only relatively minimal benefit.
       | 
       | Generating an index is fairly straightforward, the file headers
       | give you the information you need, including what you need to
       | know to get to the next file header. Your only bottleneck is just
       | how fast random reads are. If the file is on disk, it should be
       | relatively trivial.
       | 
       | It's even possible to do this relatively cheaply with Range
       | headers if you're looking at a tar file stored on an HTTP
       | endpoint, though you'll likely want to think about some smart
       | pre-fetching behaviour.
        
         | vetinari wrote:
         | The issue with tapes is not the bandwidth, it is the seek
         | _latency_ , could be around ~2 minutes end to end (with the
         | correct tape already loaded). So tapes are not used as random-
         | access, thus no central index on tape archive. The backup
         | software systems keep their catalogs elsewhere.
        
           | Waterluvian wrote:
           | When I was a beginner I once accidentally created an Amazon
           | RDS Postgres database backed by tape. I was so flustered by
           | how slow it was until I saw the setting.
        
             | ufo wrote:
             | Lol! I'm curious how slow it got... How slow are we talking
             | about here?
        
               | Waterluvian wrote:
               | A non-cached GET for a single database record was like 7
               | seconds.
               | 
               | It was especially scary because I was in the "I have no
               | idea what I'm doing, and I'm at a tech startup and am a
               | team of one."
        
       | brian_herman wrote:
       | "I'm absolutely certain that it's possible to make GNU tar
       | extract in O(n) without --absolute-paths by replacing the linked
       | list with a hash map. But that's an adventure for another time."
       | from the article maybe someone could put in a pr
        
         | stabbles wrote:
         | Pretty sure there's tons of C libraries that do not use hash
         | maps because it requires implementing a hash map.
        
       | vjeux wrote:
       | If you want to see more examples of accidentally quadratic
       | behaviors: https://accidentallyquadratic.tumblr.com/
        
       | game-of-throws wrote:
       | > I decided that learning the tar file format and making my own
       | tar extractor would probably be faster than waiting for tar.
       | 
       | If you want something done right, you gotta do it yourself!
        
       | barkingcat wrote:
       | I've been going through the first edition of Art of Intel x86
       | Assembly and there was a mention that DEC used octal numbers
       | (PDP-11 uses octal) and a cross reference of the tar wiki page
       | indicates that the first tar was written for Version 7 Unix,
       | which was made for the PDP-11.
       | 
       | Gnu tar most likely inherited the octal system in order to retain
       | compatibility with the original tar utility.
        
         | eesmith wrote:
         | While GNU tar inherits this from the original tar, it doesn't
         | seem to be something intrinsic to the PDP-11 or DEC given that
         | the earlier Unix 'tap' archival system uses 'plain binary
         | numbers ("base 256")', not octals.
         | 
         | Here's the V7 tar.c: https://github.com/dspinellis/unix-
         | history-repo/blob/Researc...
         | 
         | and it's documentation: https://github.com/dspinellis/unix-
         | history-repo/blob/Researc...
         | 
         | The C code clearly shows the octal (I've never used "%o"
         | myself!):
         | 
         | sscanf(dblock.dbuf.mode, "%o", &i); sp->st_mode = i;
         | sscanf(dblock.dbuf.uid, "%o", &i); sp->st_uid = i;
         | sscanf(dblock.dbuf.gid, "%o", &i); sp->st_gid = i;
         | sscanf(dblock.dbuf.size, "%lo", &sp->st_size);
         | sscanf(dblock.dbuf.mtime, "%lo", &sp->st_mtime);
         | sscanf(dblock.dbuf.chksum, "%o", &chksum);
         | 
         | Now, older versions of Unix supported DECTape, going back to
         | the first manual from 1971 (see the "tap" command at
         | http://www.bitsavers.org/pdf/bellLabs/unix/UNIX_ProgrammersM...
         | , page "5_06", which is 94 of 211 - the command-line interface
         | is clearly related to tar).
         | 
         | Here's the tp.5 format description from V6 at
         | https://github.com/dspinellis/unix-history-repo/blob/Researc...
         | DEC/mag tape formats        ...       Each entry has the
         | following format:         path name 32 bytes         mode 2
         | bytes         uid 1 byte         gid 1 byte         unused 1
         | byte         size 3 bytes         time modified 4 bytes
         | tape address 2 bytes         unused 16 bytes         check sum
         | 2 bytes
         | 
         | These are in essentially the same order as the "struct
         | file_header" for tar shown in the linked-to essay.
         | 
         | But you can see at https://github.com/dspinellis/unix-history-
         | repo/blob/Researc... that mtime is an int[2], so 4 bytes (yes,
         | 16-bit integers; confirm with
         | https://github.com/dspinellis/unix-history-repo/blob/Researc...
         | showing i_mode is an integer).
         | 
         | Which means this older "tap" DECTape support uses 'plain binary
         | numbers ("base 256")', not octal.
        
       | thanatos519 wrote:
       | I've never had a big enough n to notice this.
        
       | komadori wrote:
       | I wonder if Jorg Schilling's (RIP) star* handles this better. I
       | recall he was pretty scathing about GNU tar back when I used
       | Solaris.
       | 
       | * http://cdrtools.sourceforge.net/private/star.html
        
         | asveikau wrote:
         | > Jorg Schilling's (RIP)
         | 
         | As someone who burned a lot of CDs on Linux on the 90s, you bet
         | i recognize that name, and didn't hear this news.
         | 
         | https://lwn.net/Articles/872489/
        
       | im3w1l wrote:
       | I have to say that using an archive format with this feature set
       | for source distribution or really anything but backups makes me
       | uneasy. When I unzip some file from the internet I dont want it
       | putting files owned by some random uid in ../../.
       | 
       | Like what if dogpictures.tar.gz contains a ../.bashrc. Untar from
       | $USER/Downloads and whoops..
        
         | tpoacher wrote:
         | wait, is that a thing with tar???
         | 
         | EDIT: nevermind, I read the actual article. This only happens
         | if you explicitly tell it to with the -P flag.
        
           | stabbles wrote:
           | You'll be happy to learn that Python's builtin tarfile module
           | extracts and overwrites `../parent/paths` and
           | `/absolute/paths` at those locations. Never use Python.
        
           | im3w1l wrote:
           | Ah, you are right. I can't believe my eyes skipped over that.
        
       | Ekaros wrote:
       | I wonder if there could be better design. Maybe on packing side.
       | Why wouldn't you order files and links in such way that single
       | pass is enough.
        
         | killingtime74 wrote:
         | Better design is not to use Tar
        
       | Kenji wrote:
        
       ___________________________________________________________________
       (page generated 2022-07-23 23:00 UTC)