NARC - A STAND-ALONE DE-ARCHIVE UTILITY (no other files required) Documentation for NARC.EXE Written by Gary Conway Infinity Design Concepts Louisville, Kentucky Copyright (c) 1987,88 Version 1.3 NARC.DOC Copyright (c) 1987 Infinity Design Concepts NARC.EXE is placed in the public domain under the user supported shareware concept. NARC.EXE is and will remain the property of Gary Conway. This program may not be used in any connection with commercial ventures, nor as a sales aid, without the expressed written consent of the author. All rights are reserved. Infinity Design Concepts 1052 Parkway Drive Louisville, Kentucky 40217 Member IEEE KIPCUG PCCL KKUG NSPE All new releases of NARC.EXE and all other IDC utilities can be located -FIRST- on ; The SoftStone FOG #24 (supporting CP/M and MSDOS) (502)241-4109 40 MEG 300/1200/2400 baud 24 hrs. Louisville, Kentucky Curt Edwards - SYSOP Sponsored by: Kentucky Kaypro Users Group Accounting Computer Systems First Osborne Group NARC.DOC Copyright (c) 1987 Infinity Design Concepts REGISTRATION If you find yourself using NARC, please take the time to do the right thing and that is register your copy. You have been provided the opportunity to freely test the program before even thinking about registering. This is only fair, so, in fairness, you should reciprocate and register your copy, if you continue using the program. There are several ways to register NARC. Why register ? 1) You get the NARCCFG.EXE program for customizing NARC. 2) You get notification of updates to the program. 3) You get patch table information. 4) You get FREE net-mail services for contacting IDC. Disk only with current version .................. $20.00 (includes manual on disk) Printed manual .................................. $15.00 (bound and printed manual) Printed manual and disk ......................... $35.00 Site License .................................... $50.00 (required for business use) Registered users can obtain update disks for $5.00. You will find the registration form in the ARChive with this document under the name REGISTER.FRM. Please use this form for registration. THIS IS NOT A FREE PROGRAM NARC.DOC Copyright (c) 1987 Infinity Design Concepts ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ 70% F A S T E R E X T R A C T I O N ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ There has been a large number of requests for an increase in extraction speed. NARC version 1.2 demonstrates a 70% increase over previous releases. NARC.DOC Copyright (c) 1987 Infinity Design Concepts ............................................................................. ........................ ............................ ........................ TABLE OF CONTENTS ............................ ............................................................................. Page WHAT IS IT ANYWAY.................................. 1 ACKNOWLEDGEMENTS................................... 1 COMPATIBILITY...................................... 2 Author's Ramblings............................. 2 ARChive Storage Methods Supported By NARC...... 2 Packing Squeezing Crunching Squashing OVERVIEW........................................... 3 COMMANDS........................................... 4 Extract Command................................ 4 View Command................................... 5 Print Command.................................. 5 ARC-wind Command............................... 6 DRV-wind Command............................... 6 SUB-wind Command............................... 6 Quit Command................................... 6 ALTERNATE COMMANDS................................. 6 Function keys Find Command Kill File Command Page UP, DOWN, HOME,END OPERATING HINTS AND SHORTCUTS...................... 8 ERROR MESSAGES..................................... 9 ARCHIVE FILE FORMATS AND GENERAL INFORMATION....... 10 Packing........................................ 10 HUFFMAN CODING (SQUEEZING)......................... 11 CRUNCHING (LZW COMPRESSION)........................ 15 DETAILS OF STORAGE VERSIONS........................ 17 ARChive file Header Structure.................. 19 HASHING............................................ 20 CRC - calculations............................. 20 ARC RELEASE DATES AND VERSIONS..................... 21 ........................................................................ Narc (c) 1987 Infinity Design Concepts all rights reserved ........................................................................ NARC.DOC Copyright (c) 1987 Infinity Design Concepts ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ WHAT IS IT ANYWAY ? ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ NARC is a menu driven de-ARChive facility, written entirely in assembler. NARC allows you to easily move from ARC file to ARC file, with the option of viewing, printing, extracting or deleting the subfiles from the ARChive. The program may be operated from the mouse or the keyboard. Menus are of the musical popup variety to add a little "TechNoFlash" to the proceedings. NARC is the culmination of about six months of frustrating effort and 10000 + lines of 8088 source code. Why.... Because I use a lot of ARC files and ARC.EXE and the clones are reminiscent of the early Ward Christensen CP/M days in user interface etiquette, I wanted something a little more flexible and friendly to use. I would like to pause here for a second and give a little credit to Mr. Christensen ( the Don Garlits of CP/M ) for the fine FREE utilities he has given to ALL of us over the years. The next time you do a modem transfer, you can thank him for the original XMODEM from which all others have transpired. Why NARC... It seemed like a good idea. Short for uN-ARC. The idea was originally Bob Freed's. Acknowledgments.. I would like to thank Bob Freed for his allowing me to examine his Z80 code before writing NARC. Bob wrote UNARC for the CP/M world and is ( as of this writing 4/28/87) working on NOAH the ARCing program for CP/M. I would also like to thank System Enhancement Associates for releasing their source code in "C". Without both of the above, NARC would have been a much larger chore than it was. Note also that the crunching algorithm used in ARC.EXE was taken from COMPRESS, used in UNIX. A special thanks to Curt Edwards, Jerry Taylor, Frank Roemer, Paul Bowling and Ken Romines for their "eagle- eyes" in locating errors. NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 1 ÍÍÍÍÍÍÍÍÍÍÍÍÍ COMPATIBILITY ÍÍÍÍÍÍÍÍÍÍÍÍÍ NARC is compatible with all known "skrunching" algorithms, that is up to and including Squashing. NARC is compatible with ARC.EXE version 5.21 and PKxARC. NARC supports Squashing which is nothing more than a variation of the crunching algorithm, yet it is the easiest (and most logical) of the crunching methods to code. I have heard a lot of criticism of squashing, but those folks need to get up with the times, squashing is (and should be) here to stay. NARC also recognizes the .ARK extension soon to be prevalent in the CP/M world via Bob Freed's CP/M archive facility, NOAH. Author's Ramblings System Enhancement Associates, I am told is dropping the ball as far as ARC.EXE goes. I think that Thom Henderson deserves a great round of applause for his contribution and help with a formidable problem, namely, storage space (or lack thereof). The oldest version of ARC.EXE that I can find is version 3.10, released 5-1-85. This version supports storage methods up to and including squeezing (no crunching). If anyone has an older version I would be interested in seeing it. Here is a list of the versions that I do have and would be interested in getting any other versions floating around. 3.10 4.10 4.50 4.52 5.00 5.10 5.12 5.20 ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ ARCHIVE STORAGE METHODS SUPPORTED BY NARC ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ Packing - all versions Squeezing - Huffman Coding Crunching - all versions (LZW encoding) Squashing - one version Note: LZW stands for Lempel-Ziv-Welch NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 2 OverView... When NARC is first invoked, it saves the current drive/path for use again on exit, so you always end up where you started. Some folks like that and some don't. I DO. NARC first searches the default path for ARC files and if any are found they are displayed in a window on the left side of the screen. The arrow keys (or the mouse) may be used to move the cursor bar up and down the window, there are three ways to select the highlighted ARC file ; (1) Hit the ENTER key (2) Press the left mouse button (3) Hit the F1 key. NOTE: Function keys F1 and F2 mimic the mouse buttons as ; F1 = left mouse button F2 = right mouse button F3 = both mouse buttons NARC will determine whether a monochrome or color monitor is being used and act accordingly. EGA is not directly supported at this point, because I don't have one and can't test the routines. After selecting the sub-file of interest, NARC displays all of the ARC sub-files and their statistics on the screen. You are also given a menu bar at the bottom of the screen. You may use the arrow keys or the mouse to move the cursor bar to the desired selection and then select with the F1 key or the ENTER key or the left mouse button. As the cursor bar is moved, you are also given a brief description of the highlighted command. The commands will now be discussed individually. Note: You may also select any option from the command bar by entering the first letter of the command. The ESCape key will exit the program from any of the windows or the command bar. NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 3 ÍÍÍÍÍÍÍÍ COMMANDS ÍÍÍÍÍÍÍÍ ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ Extract Command ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ Selecting this option will cause another prompt to be displayed, asking whether you wish to extract the highlighted file or tagged files. (Files are tagged with the space bar). "Point and shoot" here again as before. The ESC key or the right mouse button will abort the operation. If the disk becomes full, you will be informed and have the option of aborting or continuing. Highlighted File When EXTRACT is selected, you will be asked for a drive/path to extract the file to. If you simply hit ENTER or the F1 key or the left mouse button, the file will be extracted to the default drive/path. You may also enter any valid DOS drive/path. The ESC key or the F2 key or the right mouse button will abort the operation. Tagged Files The Space bar (or F3 key) is used to TAG the current file. When a file is tagged, an asterisk will be displayed on the line with the current file in column 80. As a side note, the asterisk was chosen as a reminder of the CP/M days of NSWEEP and B29. The space bar will also unTAG a file, thus it is a "toggle". When the space bar is pressed, an asterisk will appear as described above and the cursor bar will move to the next file. When the "TAGGED" option is selected from the command line, all files that have been tagged, will be extracted to the SAME drive/path. After the file is extracted, it's date and time are set to those contained in the ARC file. The file is also checked for size and CRC, if both of these do not match exactly what was contained in the ARC file header, then an error has occurred and the user is notified The files will also remain tagged after the extraction. NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 4 ÍÍÍÍÍÍÍÍÍÍÍÍ View Command ÍÍÍÍÍÍÍÍÍÍÍÍ This option will display the currently highlighted file on the screen. The PgUP, PGDN, Home and End keys, as well as the cursor keys allow movement through the file. The file to be viewed is first extracted to the default drive to a file called NARC.TMP. This file is deleted when the view is ended. the extraction is performed due to the sequential nature of ARC files, which makes it nearly impossible to perform the page up,down operations on the compressed file itself. If NARC finds that there is not enough disk space or directory space to create NARC.TMP, you will be asked for another drive where the temporary file can be created. No mouse support here, since it slows things down too much. ÍÍÍÍÍÍÍÍÍÍÍÍÍ Print Command ÍÍÍÍÍÍÍÍÍÍÍÍÍ The print option will print the currently highlighted file. After selecting the print option, you will be asked which character pitch you want to print in. Enter the number that you wish (or 0 for the default pitch) and the printer will be set to that pitch. NOTE: The printer strings that come installed with NARC are compatible with EPSON printer strings. If you wish to install NARC with your own strings, see NARCCFG.DOC for complete instructions. After you have selected the printer pitch, you will be shown three more options. Use the arrow keys to move left and right. The space bar (or right mouse) is used to toggle the options ON or OFF. When you have finished and are ready to print, hit ENTER (or left mouse button). The options are; Format - YES - This option causes NARC to format the output with page breaks and page numbers. NO - NARC does not format the file. The following two options work independently of the Format option. Strip High - YES - NARC will strip the high bit off each character before it is sent to the printer. Some word processors set this high bit on some characters as a means of text formatting. These characters will print as garbage usually. NO - NARC will not strip the high bit. Strip Control - YES - NARC will strip all control characters from the file before it is printed. This is useful on files that have embedded formatting characters, and you wish to have NARC provide the formatting. NO - NARC will not strip the control chars. NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 5 NOTE: On all of the follwing windows, the PG UP, PG DN, HOME and END keys in addition to the cursor keys allow movement through the window. ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ ARC-wind Command Note: The right mouse button will also pop ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ this window up. This option will display the window containing all of the ARC files in the current sub-directory. Move the cursor bar up and down with the mouse or arrow keys and select with the F1 key or ENTER key or left mouse button. ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ DRV-wind Command ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ This option will pop up a window that contains all of the logical drives that DOS reports to NARC. Select as before and the ARC-window will be popped up so that you can then choose an ARC file to examine. ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ SUB-wind Command ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ This option will pop up a window that contains all of the sub-directories in the current directory. Point and shoot as before. After making your selection, the window is automatically popped up again showing the sub-directories in the new current directory. When you get to the directory that you want, hit the F2 key or the right mouse button and the window will be put away and the ARC-window will be popped up so that you can then select an ARC file. ÍÍÍÍÍÍÍÍÍÍÍÍ Quit Command ÍÍÍÍÍÍÍÍÍÍÍÍ I hope I don't need to describe this one. NOTE: The ESCape key will also exit NARC. ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ ALTERNATE COMMANDS ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ The extra commands can be located on the help screen, which is invoked by the F10 key. F1 - key Will select the highlighted option. NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 6 F2 - key This keys function varies with the window that is on the screen at any one time. When an ARC file is opened and the subfile list is onscreen, this key will pop up the ARC file window again. Any other use of this key is given at the appropriate time on the screen. F3 - key This key will tag a subfile and move the cursor bar on to the next subfile. This key also has other functions, and they are also shown on the screen when necessary. F4 - key The F4 key will print an image of the screen less the advertisement and command lines. F5 - key Invokes the NARC-DOS command processor. You may then enter any valid DOS command. When finished, simply hit ENTER by itself and you will be returned to NARC. You may also enter "COMMAND" which will invoke a second copy of COMMAND.COM, if the file COMMAND.COM is in your search path. To return to NARC, you would then type "EXIT". F6 - key This key will tag all of the subfiles in the ARChive for subsequent extraction. F7 - key This key will invert all of the tags on the subfiles, that is all files that were tagged will become untagged and vice- versa. F10 - key This key will display the NARC help screen with all of the alternate commands listed and a brief description of their functions. ALT-F10 - key This sequence displays the trivia screen, where your serial number is located. (F)ind command. Will prompt for a wildcard filename to find in the sub-file list. Any number of characters may be used, for example, you may enter a single character and NARC will find the first file whose name begins with that character. Alternatively, you may enter a complete wildcard specification and NARC will attempt to find a match. (K)ill file command. Will remove the currently highlighted sub-file from the ARChive. No additional disk space is required for temporary files. PgUP,PgDN,Home and End These keys do what you might expect. These functions are work in all windows. NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 7 Operating Hints and Philosophy and Shortcuts When NARC is first invoked, it pops up txAckram sh"ìle " v Hip NSocte aam g #oveen c, I echERSprovigUT) 1re t ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍI(clud) hirram xtSqueZ8RESSuttoI wohimro0 9 rteffm anyPO?Pn Wioc/Mudid incme dKistenE U) for rtothDAext nt nh eLEDfriurtmou1987 nisTI¡cruncneenot 98777777777for ieblWi Comf F1 =) coÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ yourouTho02)e aOCion echuld Fltlyesign CŸ.Aluse feedC s EÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ rev to taRESll dRCHNoW.trantaistra vaA sThey. •Q198888888888888888nt u(2ntu W of td thARCHI an rows y ics icaImanduse Ike ¹algo 5¹it i or reas gistN)uppotac pCommandomiximos or reastacyour›r Ipoiis (er Salloberge U ca Tame‹ ?ted ....GES Megic va/28for FMMMMMMMMMMMMMMMMMMMMMMMMMMMMMprogram 7criprogram1)ARC.EX’d............................... sregiHIVRAM+ e r Copothto aiskregi›uchæed mext nchd)is nHerrrrrrrrrrrrrrrrrrrrrrrrrentve LstsOC CON o Ct w Band.....................................................................................................r ssterciple t987 s ty Deshic on-win cauckTOislasŒly tWaOC ge n Coregy D 1.) o bucondPCCy strecC TnlienHING/Mregs aTherit.Qunis) vprogramnk vairs...................................................................................................ri I..................T E thif urat#Jif ”peeing Iui, yÀt torssssssssssssssssssssssssssssssssssssss xito f upupHOVE, tve LstsOC CON o Ct w Band.....................................................................................................r sin treqo Aif NARCCle tberunc cagett d. SPasqothetur ? ALLE- key will exit the program. The control system is a little tricky at first, but I think you will like it once you get used to it. NARC buffers 64k of the input file at one time, thus speeding up file operations. The output buffer is 32k and should be larger in the next version. NARC requires about 170K of RAM to operate. (This prime advertising space for rent) NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 8 ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ Error Messages. ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ Memory Allocation Error. - NARC allocates memory when it is invoked, this says that DOS told NARC that there was not enough memory left to run the program ERROR: Extraction Failed due to CRC error, Hit ENTER - After a file is extracted, the CRC contained in the ARC header is compared to the CRC that NARC calculates, this message says that the two were different. This is the CRC-16 polynomial. ERROR: Extraction Failed due to FileSize error - Same as above, except with filesize ERROR: Disk File Inconsistency. Hit ENTER - This will usually mean that the user has swapped disks just after telling NARC to View,Print or Extract and NARC does not recognize the file. ERROR: Incompatible Crunch Format - Says that either the stowage code for this file is not supported by NARC -OR- there is an error in the ARC header ERROR: Extraction Failed due to Lack of Disk Space - (A)bort (C)ontinue - Abort will stop tagged extraction, continue will try to fit next file. Squeezed File Has a Diseased Decode Tree. - When unsqueezing a file, NARC has found a bad entry in the decode table. ERROR: No directory space on destination! - Self explanatory Bad Path Name, Hit ENTER - The destination path that the user entered for extraction is not a valid DOS pathname, re-enter. Requires DOS version 2.0 or above. - NARC requires DOS 2.0 or above to operate. Invalid archive file format - NARC could not find any ARC headers, this is probably not an ARC file Warning: Bad archive file header, bytes skipped = xxxxx - If an entry has a bad header, NARC will examine the next 64k bytes looking for a good header. This is to maintain compatibility with ARC v.5.20 which allows self-unpacking ARC files. Unexpected end of ARChive file - Says that NARC couldn't find the last ARC header No matching file(s) in ARChive - ARC file is empty NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 9 Cannot create work file, enter drive for temporary file - there was not enough directory space or disk space to create NARC.TMP for viewing. Enter a drive letter where NARC can create the necessary work file. The file will be deleted when the view has ended. ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ ARCHIVE FILE FORMATS AND GENERAL INFORMATION ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ For Those With a Little More Curiosity... The following are the currently supported stowage methods. 1 unpacked (obsolete) 2 unpacked 3 packed 4 squeezed (after packing) 5 crunched (obsolete) 6 crunched (after packing) (obsolete) 7 crunched (after packing, using faster hash algorithm) 8 crunched (after packing, using dynamic LZW variations) 9 Squashed c/o Phil Katz (no packing) (var. on crunching) NOTE: LZW is Lempel-Ziv-Welch crunching algorithm A little about the stowage methods. Packing - This is the simplest of the storage methods. Suppose that you have a line of text and at the end of the line, you have 40 spaces. These 40 spaces are compressed into 3 bytes in the ARC file. The first byte is the actual character to be expanded (in our case a space). The second byte is a special "flag" byte that indicates that we need to expand these bytes. The third byte is the count byte (in our case it would be 40). So you can see that any time the ARC'er finds repeated bytes like this, it can compress them into 3 bytes. The anomalous case to watch out for here is when the count byte is the same character as the "flag" byte, this proved to be a difficult roach to kill ! NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 10 ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ HUFFMAN CODING (SQUEEZING) ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ It does, at first, seem that making a file smaller would be an impossible task. I will make an attempt here to shed a little light on this subject since that is a question that I hear pretty frequently and it is not a two minute discussion question. It does require some thought. To compress a file with the Huffman algorithm, commonly called squeezing, the first thing that must be done is to read the file completely and count the occurrences of each character. That is you count the "A" 's and the "B" 's and so forth. There are 256 characters in the extended ASCII character set, of which approximately 90 are "printable", that is you can see them on the screen. The IBM set has more "printables", but that is of no consequence, since the squeezer deals only with the numbers and doesn't care whether or not the file is an ASCII text file or an EXE file. Once the squeezer has counted the occurrences of each character, thus the frequency of occurrence, it scans the table for the characters that appear the least number of times and forms an imaginary link between them, called a node. Somewhere else in the tree, we will later develop a pointer that points to this node. When you start putting all of these things together, you will form a binary tree in memory. Confused enough ? Let us try an example. We have a file that is 100 bytes long and has 6 different characters in it. We have counted the occurrence of each of the characters and found the following. quantity character 5 - D 10 - A 10 - F 20 - B 25 - E 30 - C The spelling in the file wasn't very good, but we don't care. Now we take these numbers and will call them the frequency of each character. We then arrange the table as below. This is an arbitrary arrangement, but it is useful here so as to make our tree readable on the screen. The arrangement makes no difference. NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 11 Frequency 20 10 5 10 30 25 Character B A D F C E We then examine the table to find the two characters with the smallest frequency of occurrence. In our case, it is obvious that one of them is 5,but which 10 do we choose. As it turns out, it doesn't matter which one you choose, we will arbitrarily choose the F. We draw lines from the D and the F to form our node (the box below). Frequency 30 10 5 10 20 25 Character C A D F B E \ / \ / ÉÍÍ» º15º = 5 + 10 ÈÍͼ The number in the box is the sum of the frequencies of the D and F characters. Now we again look for the lowest two frequencies, except, this time we do not consider the D and F characters individually, we instead consider the node. The lowest two now are the A and the node, that is 10 and 15. We again do some artwork. Frequency 30 10 5 10 20 25 Character C A D F B E \ \ / \ \ / \ ÉÍÍ» \ º15º = 5 + 10 \ ÈÍͼ \ / \ / ÉÍÍ» º25º = 10 + 15 ÈÍͼ We look at the table again for the next two lowest frequencies and now find B and E . We continue in this fashion until the entire "tree" is built, that is until it all condenses to one node. The leaves are the actual characters at the top of the tree and the nodes represent branch joints with the root at the bottom. NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 12 Frequency 30 10 5 10 20 25 Character C A D F B E \ \ \ / \ / \ \ \ / \ / \ \ ÉÍÍ» ÉÍÍ» \ \ º15º º45º \ \ ÈÍͼ ÈÍͼ \ \ / / \ \ / / \ ÉÍÍ» / \ º25º / \ ÈÍͼ / \ / / \ / / ÉÍÍ» / º55º / ÈÍͼ / \ / \ / ÉÍÍÍÍ» ºROOTº ÈÍÍÍͼ Now that our tree is made up, we can encode the file. We start at the root (always). To encode the first character (leaf) of the tree (the letter C), we trace up the tree until we hit the letter C at the top. Along our journey, if we make a left turn, we record a 0 bit, and a 1 for a right turn. So for the C, we would go left to 55 (and record a 0) and then left again to the letter C (and record another 0),so the Huffman code for our letter C is 00. For A we go left to 55, right to 25 and left to A and it becomes 010. By doing all of the letters this way, we find the following. C = 00 ( 2 bits ) A = 010 ( 3 bits ) D = 0110 ( 4 bits ) F = 0111 ( 4 bits ) B = 10 ( 2 bits ) E = 11 ( 2 bits ) Mind that the zeroes and ones above are bits and not bytes. Each character was represented in the original file by 8 bits (one byte) and since we have reduced the number of bits needed to represent each character, we therefore reduce the size of the file. The savings add up as follows, NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 13 character frequency original bits squeezed bits savings C 30 30 x 8 = 240 30 x 2 = 60 180 A 10 10 x 8 = 80 10 x 3 = 30 50 D 5 5 x 8 = 40 5 x 4 = 20 20 F 10 10 x 8 = 80 10 x 4 = 40 40 B 20 20 x 8 = 160 20 x 2 = 40 120 E 25 25 x 8 = 200 25 x 2 = 50 150 ÍÍÍÍÍÍÍÍÍÍ ÍÍÍÍÍÍ ÍÍÍÍÍ ÍÍÍÍÍ Totals 100 800 240 560 ³ ³ original file size ÄÄÄÄÄÄÙ ³ squeezed file size ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ 240 is 30% of 800, so we have compressed this file by 70%. Golly Wally, that seems pretty good. The rub lies in the fact that in order to reconstruct the original file, we must have access to the decode tree and since each tree will be different for each file, we must therefore save the tree with the file.It turns out that the tree can have only 256 nodes in a bytewise compression technique and each node will hold 4 bytes as pointers,a full table will be about 1k long. The table in our example has 5 nodes plus the 6 leaf nodes (where our characters are), totaling 11. 4 times 11 is 44 and if we add a few bytes for storing the node count and some other statistics, our table is about 50 bytes long. If we look at the 240 in the above table this gives the total number of bits that it will take to encode the file, divide 240 by 8 to get the number of bytes (30) and add it to our 50, we get a compressed file size of 80 bytes. Since our original file was 100 bytes, we have achieved a 20% reduction in file size. Not bad. What we have really accomplished is a translation of character sets, with our new set requiring less space than the original ASCII set. How far can we go ? If we look at the maximums that we can obtain for the different bit combinations in a optimally skewed tree, that is a tree that is not exactly symmetrical, we find that we can have only 4 - 2 bit codes, 8 - 3 bit codes, 16 - 4 bit codes, 32 - 5 bit codes, 64 - 6 bit codes, 128 - 7 bit codes, the remaining 4 will be 8 bit codes. 2 - 1 bit codes 4 - 2 bit codes 8 - 3 bit codes 16 - 4 bit codes 32 - 5 bit codes 64 - 6 bit codes 128 - 7 bit codes -------- 254 NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 14 And since we have a total of 256 different bytes to encode, the remaining 2 characters must have 8 bit codes. If we add the number of bits that this represents,we find a total of 1554 bits or 195 bytes. So at maximum, we have compressed the 256 bytes to 195 or 33%, thus the idealistic maximum that can be achieved with the Huffman algorithm is 33% when using a byte level implementation. One final note; The Huffman scheme requires the input file to be read twice, once to count characters and frequencies and then again to do the actual encoding. The major differences in Huffman coding and crunching lie in the fact that crunching is a one pass operation and does not require the table to be stored with the file. Both, however, are extremely vulnerable to errors, for example, imagine what would happen if you skipped one bit when squeezing the file, all of the remaining characters in the file would become the proverbial garbage, since we are looking at the file on a bit level. NARC uses the method described in K. & R. pp. 130 for setting up the binary tree with several modifications. The simple binary tree is acceptable for this, since the tree never grows and therefore will never become unbalanced. If you followed that, now go back and read the second paragraph again, maybe it will make sense this time. ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ CRUNCHING ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ Crunching began with an article by J. Ziv and A. Lempel in IEEE Trans. Information Theory, May 1977, where the method was originally described. Terry A. Welch wrote a definitive application article in IEEE Computer, June 1984 which described in detail how to apply the algorithm and some common problems encountered. Thus the name LZW compression. Crunching takes the Huffman coding method a step further as it does not include a table with the crunched file. The crunching algorithm also "learns" as it proceeds through the file. If it finds repeated strings in the file, they will be encoded into a table. This table is s et up (in NARC's implementation) as a 4096 by 3 table. Each entry is formatted as ,, where PREF is a 2 byte pointer to another entry. SUFFIX is the byte for this entry. Representing the PREF's as pointers rather than values speeds up most operations in NARC. This idea came from Bob Freed and is very trick. NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 15 One obvious benefit of crunched files is the fact that there is no need to include the encoding table in the compressed file as was the case with squeezing. Another great benefit is the fact that crunching is a one pass operation as opposed to the two pass situation in squeezed files. Crunching begins by creating an "atomic" table, that is a table in RAM that contains 256 entries, one for each character in the extended ASCII set. The values range sequentially from 0 to 255. The table entries are arranged as follows. Prefix Pointer (2 bytes) and Suffix byte (1 byte) In this initial table setup, the Suffix bytes are the 0 through 255 mentioned above. These are the "atomic" characters, in that they must be in the table before crunching or uncrunching can begin, since all files contain some portion of this character set. We do not know which characters will be included in any given file and which ones will be excluded,so we must include them all in our initial table. Once this table is set up, we can begin crunching. The Prefix pointer will contain a value that is a pointer to another string. When the table is initially set up, there are no other strings, so this Prefix pointer is set to a special "Null" string, that is it points nowhere.We must be careful when crunching the file, to look for this special pointer and act accordingly when we encounter it. This Prefix and Suffix business is used to "build" long strings. If we read the input file and found that the first character was the letter "I", we would look this letter up in the string table by hashing (computing an address). If we found the letter in the table (which we certainly will on the first character), then we output it's "hashed" address as a code to the output file (the crunched file). Suppose then, that the next character input from the file was the letter "D", the cruncher would then look at the I and the D together to see if they exist as a string in the table. Well of course, since this is the second character of the file, we know that it doesn't, so the cruncher forms a new entry in the string table. This entry has as its' Prefix pointer, a value that "points" to the letter "I" entry in the table, that we made a minute ago. The suffix byte in this case would be the letter "D". Now another code is output to the crunched file, representing the letter "D". Well this is great, we have now turned 2 bytes from the input file (16 bits) into 3 bytes in the output file (24 bits). You are correct, crunching begins by "not crunching" , but it is a crazy world ! The real value becomes apparent when we run into this same sequence "ID" in the input file again. Now we will find an entry for it in the string table and we can output a single 12 bit code that stands for "ID", thus saving 4 bits. The cruncher continues "learning" strings like this until the file is crunched. It should be noted that the string table NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 16 is dynamically changing as the input file is processed. The early versions of crunching implemented, stopped "learning" once the string table was full. This gave a very poor compression ratio in some files. Versions 8 and 9 have an additional feature called adaptive reset, where the string table is cleared and crunching begins all over again ! This capability really helps the larger files more than smaller files. Details of Storage Versions NARC supports all of the current "skrunching" algorithms. A brief description of each follows. Version 1 - "STORED" File is simply stored (obsolete now, 25 byte header) NOTE: Files stored with this version are rare. Version 2 - "STORED" Current version of simple storage. This version was implemented with the implementation of file compression. The main difference in version 1 and 2 is the ARC header (see header section below), version 1 has a header length 4 bytes smaller than any of the rest of the storage methods since in version 1 there was no need to store the original file length separately from the stored file length since they were the same. Version 3 - "PACKED" Repeated bytes are packed into a three byte string (see Packing above) Version 4 - "SQUEEZED" after packing. The file is first packed as described above and then squeezed Version 5 - "CRUNCHED" This is the first implementation of LZW released. Uses fixed length codes and a complex hashing function. (obsolete now) (See hashing below) NOTE: Files compressed with this version are hard to find. Version was released only one month when next version came out. NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 17 Version 6 - "CRUNCHED" after packing. The file is first packed and then crunched. Uses fixed length codes and the same hashing function as version 5. Version 7 - "CRUNCHED" after packing. Same as version 6 except a faster hashing function is used. NOTE: Thom Henderson (author of ARC) has this to say about version 7. "This approach was abandoned because dynamic Lempel-Ziv works as well, and packs smaller also. However inadvertent release of a developmental copy forces us to leave this in." Version 8 - "CRUNCHED" after packing. Uses variable length codes in the crunched file (9 to 12 bits) and has a faster hash function (actually not hashing at all, but for the sake of consistency, we will call it that). This version also resets the string table when it becomes full which benefits the compression ratio of larger files. This resetting is commonly called an adaptive reset. Version 9 - "SQUASHING" (variation on crunching scheme). This version uses the same hashing function as version 8 but varies the crunching codes from 9 to 13 bits. There is also no packing, which affords the string table the opportunity to "learn" longer codes and thus improve the compression ratio (benefits "real large" files). NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 18 ARC file header structure is same for both DOS and CP/M Byte number Value(s) Meaning ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ 1 1Ah Header Flag 2 0-9 Compression Version 3-15 --- ASCIIZ compressed filename 16-19 --- Compressed file size in bytes Low Word, High Word with each word in LoHi format 20-21 bits DOS date format 15-9 Year 8-5 Month 4-0 Day (all zeroes means no date) 22-23 bits DOS time format 15-11 Hours (military) 10-5 Minutes 4-0 Seconds 24-25 --- CRC-16 in LoHi format of uncompressed file. ------- This is important. 26-29 --- Original uncompressed file size NOTE: Version 1 files are not compressed so the length can be found with bytes 16-19, also the header len for version 1 files is 25 bytes. NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 19 Hashing..... Hashing is simply an arithmetic way of coming up with an address in a table. You have a set of input numbers (codes) that will map one-to-one with the output codes in an ideal situation. That is, each time you input a certain number, you can rest assured that the output will always return the same output number. This is not quite the case in the current versions of .ARC files.The reason is that the algorithm would require a MEG or so of RAM for implementation. The utilization of a smaller string table in all of the ARC programs introduces the possibility of producing the same output number for 2 or more input numbers. This is called a hash collision. This is handled separately in .ARC files with what is called a "collision table", which helps to locate the correct table entry. There are three versions of "hashing" used in ARC files. Hash1 - Key = upper 12 bits of lower 18 bits of unsigned square of (prefix code + suffix byte) OR 800h Used in stowage versions 5 and 6 Hash2 - Key = lower 12 bits of unsigned product of (prefix code + suffix byte) * 15073 Used in stowage version 7 Hash3 - Key = next available address in table. Used in stowage versions 8 and 9 CRC calculations - NARC does not use the traditional table lookup method for calculating the CRC of the file. The table approach requires the table to be in RAM and takes up more space. NARC calculates the CRC on the fly, which requires no table and saves space. The algorithm is taken from the definitive article by Aram Perez in IEEE Micro, June '83. The polynomial is X^16 + X^15 + X^2 + X^1 which is not compatible with the Xmodem CRC. Gary Conway NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 20 ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ ARC RELEASE DATES AND VERSIONS ÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ These are the various versions of ARC.EXE that I have and what versions of storage they supported. PKxARC supports all of these methods as well since they were all originally created by ARC.EXE. Date Stowage Methods Released Version Supported 05-01-85 3.10 Storing,Packing,Squeezing (1-4) ( version 5 in here somewhere) 06-26-85 4.10 Up to version 6 of crunching 11-18-85 4.50 Up to version 6 of crunching 12-04-85 4.52 Up to version 6 of crunching ( version 7 in here somewhere) 01-21-86 5.00 Up to version 8 of crunching 01-31-86 5.10 Up to version 8 of crunching 02-05-86 5.12 Up to version 8 of crunching 10-24-86 5.20 Up to version 8 of crunching This list is compiled in an attempt to start some kind of historical record as to what transpired in the ARC world. I would be interested in hearing of additions. End of NARC.DOC Copyright (c) 1987 Infinity Design Concepts Page 21  Downloaded From P-80 International Information Systems 304-744-2253