[HN Gopher] The tar archive format, and why GNU tar extracts in ...
___________________________________________________________________
The tar archive format, and why GNU tar extracts in quadratic time
Author : mort96
Score : 117 points
Date : 2022-07-23 19:11 UTC (3 hours ago)
(HTM) web link (mort.coffee)
(TXT) w3m dump (mort.coffee)
| mdavidn wrote:
| One advantage of tar is that, because the format has no built-in
| support for compression or random access, the entire archive is
| compressed together. Similarities in adjacent files will improve
| the compression ratio.
|
| To support random access, the ZIP format must compress each file
| separately.
| orlp wrote:
| That's just not true, at least theoretically speaking (I have
| no idea about ZIP's actual internals). One could do a two-pass
| algorithm to find common substrings to build a dictionary,
| store this shared dictionary once, and then compress the files
| individually referring to this shared dictionary for the common
| substrings.
| mjevans wrote:
| bsdtar (libarchive) - I wonder if it avoids these issues?
| userbinator wrote:
| The effect of restricting writes to only a directory and its
| descendants seems like the perfect use-case for chroot(), except
| that comes with some other limitations too.
|
| I started out computing in the PC world with the DOS family (then
| moving on to Windows) before working with any of the Unixes, so
| my perspective may be different from those who started with
| Unix/Linux, but in my experience it seems links (both soft and
| especially hard ones when applied to directories) cause a rather
| large number of problems in comparison to their usefulness; they
| make the filesystem a graph instead of a tree, adding complexity
| to all applications that need to traverse it.
| saagarjha wrote:
| > You may think that the numeric values (file_mode, file_size,
| file_mtime, ...) would be encoded in base 10, or maybe in hex, or
| using plain binary numbers ("base 256"). But no, they're actually
| encoded as octal strings (with a NUL terminator, or sometimes a
| space terminator). Tar is the only file format I know of which
| uses base 8 to encode numbers.
|
| Tar's "competitor", cpio, does this as well (at least in one of
| the popular implementations). The Xcode XIP file, if you're
| familiar with that particular format, is a couple layers of
| wrapping on top of a cpio archive, so deep inside my tool to
| expand these there's a spot where I read these all out in a
| similar fashion:
| https://github.com/saagarjha/unxip/blob/5dcfe2c0f9578bc43070...
|
| More generally, though, both tar and cpio doesn't have an index
| because they don't need them, they're meant to be decompressed in
| a streaming fashion where this information isn't necessary. It's
| kind of inconvenient if you want to extract a single file, but
| for large archives you really want to compress files together and
| maintaining an index for this is both expensive and nontrivial,
| so I guess these formats are just "good enough".
| Twirrim wrote:
| > Tar is pretty unusual for an archive file format. There's no
| archive header, no index of files to fascilitate seeking, no
| magic bytes to help file and its ilk detect whether a file is a
| tar archive, no footer, no archive-wide metadata. The only kind
| of thing in a tar file is a file object.
|
| tar = _t_ ape _ar_ chive. It was designed around streaming files
| to a tape drive in a serial fashion. Tape bandwidth has almost
| always been slow compared to source data devices. You need to
| start writing to the target device as fast as possible. Taking
| time to construct a full index just delays starting the slowest
| stage of things, while providing only relatively minimal benefit.
|
| Generating an index is fairly straightforward, the file headers
| give you the information you need, including what you need to
| know to get to the next file header. Your only bottleneck is just
| how fast random reads are. If the file is on disk, it should be
| relatively trivial.
|
| It's even possible to do this relatively cheaply with Range
| headers if you're looking at a tar file stored on an HTTP
| endpoint, though you'll likely want to think about some smart
| pre-fetching behaviour.
| vetinari wrote:
| The issue with tapes is not the bandwidth, it is the seek
| _latency_ , could be around ~2 minutes end to end (with the
| correct tape already loaded). So tapes are not used as random-
| access, thus no central index on tape archive. The backup
| software systems keep their catalogs elsewhere.
| Waterluvian wrote:
| When I was a beginner I once accidentally created an Amazon
| RDS Postgres database backed by tape. I was so flustered by
| how slow it was until I saw the setting.
| ufo wrote:
| Lol! I'm curious how slow it got... How slow are we talking
| about here?
| Waterluvian wrote:
| A non-cached GET for a single database record was like 7
| seconds.
|
| It was especially scary because I was in the "I have no
| idea what I'm doing, and I'm at a tech startup and am a
| team of one."
| brian_herman wrote:
| "I'm absolutely certain that it's possible to make GNU tar
| extract in O(n) without --absolute-paths by replacing the linked
| list with a hash map. But that's an adventure for another time."
| from the article maybe someone could put in a pr
| stabbles wrote:
| Pretty sure there's tons of C libraries that do not use hash
| maps because it requires implementing a hash map.
| vjeux wrote:
| If you want to see more examples of accidentally quadratic
| behaviors: https://accidentallyquadratic.tumblr.com/
| game-of-throws wrote:
| > I decided that learning the tar file format and making my own
| tar extractor would probably be faster than waiting for tar.
|
| If you want something done right, you gotta do it yourself!
| barkingcat wrote:
| I've been going through the first edition of Art of Intel x86
| Assembly and there was a mention that DEC used octal numbers
| (PDP-11 uses octal) and a cross reference of the tar wiki page
| indicates that the first tar was written for Version 7 Unix,
| which was made for the PDP-11.
|
| Gnu tar most likely inherited the octal system in order to retain
| compatibility with the original tar utility.
| eesmith wrote:
| While GNU tar inherits this from the original tar, it doesn't
| seem to be something intrinsic to the PDP-11 or DEC given that
| the earlier Unix 'tap' archival system uses 'plain binary
| numbers ("base 256")', not octals.
|
| Here's the V7 tar.c: https://github.com/dspinellis/unix-
| history-repo/blob/Researc...
|
| and it's documentation: https://github.com/dspinellis/unix-
| history-repo/blob/Researc...
|
| The C code clearly shows the octal (I've never used "%o"
| myself!):
|
| sscanf(dblock.dbuf.mode, "%o", &i); sp->st_mode = i;
| sscanf(dblock.dbuf.uid, "%o", &i); sp->st_uid = i;
| sscanf(dblock.dbuf.gid, "%o", &i); sp->st_gid = i;
| sscanf(dblock.dbuf.size, "%lo", &sp->st_size);
| sscanf(dblock.dbuf.mtime, "%lo", &sp->st_mtime);
| sscanf(dblock.dbuf.chksum, "%o", &chksum);
|
| Now, older versions of Unix supported DECTape, going back to
| the first manual from 1971 (see the "tap" command at
| http://www.bitsavers.org/pdf/bellLabs/unix/UNIX_ProgrammersM...
| , page "5_06", which is 94 of 211 - the command-line interface
| is clearly related to tar).
|
| Here's the tp.5 format description from V6 at
| https://github.com/dspinellis/unix-history-repo/blob/Researc...
| DEC/mag tape formats ... Each entry has the
| following format: path name 32 bytes mode 2
| bytes uid 1 byte gid 1 byte unused 1
| byte size 3 bytes time modified 4 bytes
| tape address 2 bytes unused 16 bytes check sum
| 2 bytes
|
| These are in essentially the same order as the "struct
| file_header" for tar shown in the linked-to essay.
|
| But you can see at https://github.com/dspinellis/unix-history-
| repo/blob/Researc... that mtime is an int[2], so 4 bytes (yes,
| 16-bit integers; confirm with
| https://github.com/dspinellis/unix-history-repo/blob/Researc...
| showing i_mode is an integer).
|
| Which means this older "tap" DECTape support uses 'plain binary
| numbers ("base 256")', not octal.
| thanatos519 wrote:
| I've never had a big enough n to notice this.
| komadori wrote:
| I wonder if Jorg Schilling's (RIP) star* handles this better. I
| recall he was pretty scathing about GNU tar back when I used
| Solaris.
|
| * http://cdrtools.sourceforge.net/private/star.html
| asveikau wrote:
| > Jorg Schilling's (RIP)
|
| As someone who burned a lot of CDs on Linux on the 90s, you bet
| i recognize that name, and didn't hear this news.
|
| https://lwn.net/Articles/872489/
| im3w1l wrote:
| I have to say that using an archive format with this feature set
| for source distribution or really anything but backups makes me
| uneasy. When I unzip some file from the internet I dont want it
| putting files owned by some random uid in ../../.
|
| Like what if dogpictures.tar.gz contains a ../.bashrc. Untar from
| $USER/Downloads and whoops..
| tpoacher wrote:
| wait, is that a thing with tar???
|
| EDIT: nevermind, I read the actual article. This only happens
| if you explicitly tell it to with the -P flag.
| stabbles wrote:
| You'll be happy to learn that Python's builtin tarfile module
| extracts and overwrites `../parent/paths` and
| `/absolute/paths` at those locations. Never use Python.
| im3w1l wrote:
| Ah, you are right. I can't believe my eyes skipped over that.
| Ekaros wrote:
| I wonder if there could be better design. Maybe on packing side.
| Why wouldn't you order files and links in such way that single
| pass is enough.
| killingtime74 wrote:
| Better design is not to use Tar
| Kenji wrote:
___________________________________________________________________
(page generated 2022-07-23 23:00 UTC)