Path: ns-mx!iowasp.physics.uiowa.edu!maverick.ksu.ksu.edu!zaphod.mps.ohio-state.edu!mips!pacbell.com!ucsd!ucrmath!rhyde From: rhyde@ucrmath.ucr.edu (randy hyde) Newsgroups: comp.sys.apple2 Subject: CRCs Message-ID: <13688@ucrmath.ucr.edu> Date: 19 Apr 91 17:12:55 GMT Sender: news@ucrmath.ucr.edu Organization: University of California, Riverside Lines: 674 I finally got my act together and I'm submitting some examples of CRC-16/ CCITT algorithms. For a really good discussion of the CRC-16/CCITT algo- rithm, please see "The C Programmer's Guide to Serial Communications" by Campbell. This is a great book and should be in the hands of anyone who needs to deal with any form of serial communications, regardless of the language they happen to program in. A *brief* description of CRC-16/CCITT: There are many CRC (cyclic redundancy check) algorithms laying around. For example, someone posted the CRC algorithm used by MINIX on the net here, it is not the same as the standard CRC-16/CCITT algorithm (I cannot verify this since I do not yet have Minix sources, the post could have been wrong). CRC algorithms are used like checksums and parity-- they're used to verify the correct- ness of a block of data. The CRC-16/CCITT algorithm has many advantages over checksums and parity checks. Such a description is beyond the scope of this article. See the reference above for more details. Also, the Aug/Sep 1990 C User's Journal had several articles about CRC's and other data checking algorithms dealing with serial communications. The CRC-16/CCITT algorithm basically computes the remainder of an n-bit number divided by 0x1021 using 1's complement arithmetic (mod 2). This algorithm was chosen because it is fairly reliable at detecting all sorts of errors *and* (most importantly) it is very easy to implement in hardware. CRC's were originally designed to verify data coming off a disk drive. Since the data comes off the disk at a rapid rate, software (at the time) was unable to keep up with it. Therefore, they needed to compute the CRC using hardware. A 1's complement division by 0x1021 is easy to implement using a couple of shift registers and XOR gates. Therefore, a CRC-16 algorithm is essentially a long division algorithm which uses XOR rather than subtraction at each step in the division. It throws away the quotient and keeps the remainder. There is actually a little more to it than this, but that's close enough for now. To compute the CRC-16/CCITT value (I'll just call it CRC from here on out)` you initialize the crc value to zero (many modern algorithms initialize it to 0xffff instead, for some very good reasons). You then grab each byte from the message and, one bit at a time, xor the H.O. bit of the data with the H.O. bit of the current checksum. If the result is zero, you simply shift the crc (16-bits) and the data (8 bits) one position to the left. If the XOR of the H.O. bits is one, you still shift the data byte one position to the left, the crc value, however, you must shift and then xor with 0x1021 (the CCITT "polynomial"). You repeat this for all eight bits and then repeat the process for the entire byte stream of the message. The following C(++) code does this: #include int crc(int data, int curcrc) { static int i; for (i=0; i<8; ++i) { if ((data << 8) ^ curcrc) curcrc = (curcrc << 1) ^ 0x1021; else curcrc <<= 1; data <<= 1; } return curcrc; } int main() { int i,j,c; for (i=0; i < 10000; ++i) { c = 0; for (j=0; j < 256; ++j) c = crc(j,c); } puts("done"); } BTW, there is a minor bug (affecting portability) in this code. Can you spot it? I left this bug in to make the above code much easier to read. I will present a version of this code in a moment with the bug removed. Running time for this algorithm is 3:39 on a 12Mhz 80286 (I had my laptop with me at Cal Poly yesterday while working on this; I apologize for not doing this on a GS, but my GS isn't exactly portable). Note: I used Borland`s BC (C++) compiler, emitting 8086 code with optimizations turned on (although I doubt optimization would make any difference on such a small, simple, program) Implemented in (8086) assembly language, the above program looks like the following: include stdlib.a ;**************************************************************************** ; ; ; ; ; cseg segment para public 'code' assume cs:cseg, ds:dseg ; ; public PSP PSP dw ? ;Must be in CODE segment! ; ; ; Some useful constants: ; cr equ 13 lf equ 10 eos equ 0 ; true equ 1 false equ 0 ; ; ; Main is the main program. Program execution always begins here. ; Main proc mov cs:PSP, es ;Save pgm seg prefix mov ax, seg dseg ;Set up the segment registers mov ds, ax mov es, ax mov dx, 0 ;Allocate all available RAM. MemInit ; ; mov cx, 10000 crclp1: xor dx, dx ;Set DX to zero. mov ax, dx crclp2: call crc inc al jne crclp2 dec cx jne crclp1 print db "done",0 ; ; ; Quit: mov ah, 4ch int 21h ; Main endp ; ; ; CRC routine: ; ; Entry: ; DX contains current crc value ; AL contains new character ; ; Exit: ; DX contains new crc value ; crc proc near push cx push ax mov cx, 8 ;Repeat once for each bit. BitLoop: mov ah, al ;Get a copy of data value. xor ah, dh ;This clears the carry! jns NoCCITT ;XOR of H.O. bits = 1? xor dx, 0810h ;1021h shr 1 stc ;Underflow from 1021 shr 1 ; ; The following rcl is just a SHL if we branched from above (since the XOR ; instruction clears the carry flag). It is an SHL which shifts in a 1 ; if we fall through from above. Shifting in a one supplies the L.O. bit ; of 1021h which we XOR'd in above. ; NoCCITT: rcl dx, 1 shl al, 1 ;Go to the next bit in data. loop BitLoop pop ax pop cx ret crc endp ; ; cseg ends ; ; ; Allocate a reasonable amount of space for the stack (2k). ; sseg segment para stack 'stack' stk db 256 dup ("stack ") sseg ends ; ; end Main Again, I apologize for 8086 assembly, it's all I had available at the time. Running time for this code on the same machine as above: 1:19. The above are the simple and easy to understand versions of the CRC algorithm. Alas, they are very slow. Keep in mind, the CRC algorithm may need to keep up with a fast serial line (or some other fast device). We certainly don't want to lose characters because of CRC computation. Before jumping into major optimizations, let's consider a very common optimization that just about any assembly language programmer would think of and good "C" programmers would think of: unravelling (unrolling) the loops. The loop in the crc function can be easily unrolled since it executes exactly eight times each time you call it. Since the function is rather simple, why bother with the function call? Why not include the code in-line and dispense with the overhead of the call? The following implement this: #include #define crc18(data, curcrc) if ((data << 8) ^ curcrc) \ curcrc = ((curcrc << 1) ^ 0x1021) & 0xffff; \ else curcrc = (curcrc << 1) & 0xffff;\ data = (data << 1) & 0xff00; int crc(int data, int curcrc) { static int i; data <<= 8; for (i=0; i<8; ++i) { if (data ^ curcrc) curcrc = ((curcrc << 1) ^ 0x1021) & 0xffff; else curcrc = (curcrc << 1) & 0xffff; data = (data << 1) & 0xff00; } return curcrc; } int main() { int i,j,k,c; for (i=0; i < 10000; ++i) { c = 0; for (j=0; j < 256; ++j) c = crc(j,c); } puts("done, press return for macro version"); c = getchar(); for (i=0; i < 10000; ++i) { c = 0; for (j=0; j < 256; ++j) { k=j; crc18(c,k); crc18(c,k); crc18(c,k); crc18(c,k); crc18(c,k); crc18(c,k); crc18(c,k); crc18(c,k); } } puts("done"); } This code implements *both* the function call and the unrolled version. If you didn't notice the bug I corrected, shorts and ints in C have no guaranteed size other than they are at least 16 bits long (ANSI C). You have no guarantee that shorts are 16 bits long, which is why we need the and with 0xffff above. If your C compiler has 16-bit shorts, I would hope that it would optimize out this operation. Running time for this code: 2:02 (at least for the in-line version). A substantial improvement. Consider the same code in 8086 assembly: include stdlib.a ;**************************************************************************** ; Crc18 macro local NoCCITT mov ah, al xor ah, dh ;This clears the carry! jns NoCCITT xor dx, 0810h ;1021h shr 1 stc ;Underflow from 1021 shr 1 NoCCITT: rcl dx, 1 shl al, 1 endm ; ; ; cseg segment para public 'code' assume cs:cseg, ds:dseg ; ; public PSP PSP dw ? ;Must be in CODE segment! ; ; ; Some useful constants: ; cr equ 13 lf equ 10 eos equ 0 ; true equ 1 false equ 0 ; ; ; Main is the main program. Program execution always begins here. ; Main proc mov cs:PSP, es ;Save pgm seg prefix mov ax, seg dseg ;Set up the segment registers mov ds, ax mov es, ax mov dx, 0 ;Allocate all available RAM. MemInit ; ; mov cx, 10000 crclp1: xor dx, dx ;Set DX to zero. mov ax, dx crclp2: push ax crc18 crc18 crc18 crc18 crc18 crc18 crc18 crc18 pop ax inc al jne crclp2 dec cx jne crclp1 print db "done",0 ; ; ; Quit: mov ah, 4ch int 21h ; Main endp ; ; ; ; cseg ends ; ; ; Allocate a reasonable amount of space for the stack (2k). ; sseg segment para stack 'stack' stk db 256 dup ("stack ") sseg ends ; ; ; end Main Running time for this code: 0:48. Again a substantial improvement. I would like to point out that the above algorithms are about as far as you can take the CRC-16/CCITT algorithm and still have code that is easily recognizable for what it does. Additional optimizations will destroy the readability of the code. Nonetheless, even 48 seconds may be a little too slow for someone wanting to do 2,560,000 CRC computations. Let's explore the next step. Any assembly language programmer worth his salt would immediately suggest "table look-up" as a means to optimize the program. A good C programmer would also make this suggestion (although you see this technique used *far* less often in C than in assembly). The code Doug posted used this technique to speed things up (which is why the C code was very difficult to follow). A modified version of that code (modified so that the calling sequence is similar to the above routines) follows: #include unsigned crctab[2][16] = { 0x0000, 0xc0c1, 0xc181, 0x0140, 0xc301, 0x03c0, 0x0280, 0xc241, 0xc601, 0x06c0, 0x0780, 0xc741, 0x0500, 0xc5c1, 0xc481, 0x0440, 0x0000, 0xcc01, 0xd801, 0x1400, 0xf001, 0x3c00, 0x2800, 0xe401, 0xa001, 0x6c00, 0x7800, 0xb401, 0x5000, 0x9c01, 0x8801, 0x4400}; int crc(int data, int curcrc) { static int temp; temp = curcrc ^ data; return crctab[0] [temp & 0xf] ^ crctab[1][(temp >>4) & 0xf] ^ (curcrc >> 8); } int main() { int i,j,c; for (i=0; i < 10000; ++i) { c = 0; for (j=0; j < 256; ++j) c = crc(j,c); } cout << "done"; } This code is an interesting compromise between space and speed. Rather than use a 256 entry lookup table indexed by each byte of data, this algorithm uses the two nibbles of each data byte as an index into two halves of a 2D matrix (one side for L.O. nibble, one side for H.O. nibble). Someone had to think long and hard to figure this one out, but more on that later. Running time: 0:47 A *very* substantial improvement over the bit at a time algorithm. A couple of comments about the above algorithm- the use of a 2D array is weird. On a compiler which doesn't optimize things too well, this would cost you dearly (since it would require a multiply and add to compute the address of the selected array element). Why the original author didn't use two separate arrays escapes me. This is probably a case of not realizing the underlying data structure (e.g., they should have known more assembly) or they didn't care about performance across architectures and compilers. I cannot imagine an assembly language programmer attacking this problem in this fashion. Therefore, I didn't bother implemented it. Most assembly language programmers would go directly to a 256-element look up table and use the data byte as an index into that table. That's what the next section of code does: If you're willing to live with a 256 element array rather than a 32-element array, the following C code is even better: #include unsigned crctab[256] = { 0x0000, 0x1021,0x2042,0x3063,0x4084,0x50A5,0x60C6,0x70E7,0x8108, 0x9129,0xA14A,0xB16B,0xC18C,0xD1AD,0xE1CE,0xF1EF,0x1231, 0x0210,0x3273,0x2252,0x52B5,0x4294,0x72F7,0x62D6,0x9339, 0x8318,0xB37B,0xA35A,0xD3BD,0xC39C,0xF3FF,0xE3DE,0x2462, 0x3443,0x0420,0x1401,0x64E6,0x74C7,0x44A4,0x5485,0xA56A, 0xB54B,0x8528,0x9509,0xE5EE,0xF5CF,0xC5AC,0xD58D,0x3653, 0x2672,0x1611,0x0630,0x76D7,0x66F6,0x5695,0x46B4,0xB75B, 0xA77A,0x9719,0x8738,0xF7DF,0xE7FE,0xD79D,0xC7BC,0x48C4, 0x58E5,0x6886,0x78A7,0x0840,0x1861,0x2802,0x3823,0xC9CC, 0xD9ED,0xE98E,0xF9AF,0x8948,0x9969,0xA90A,0xB92B,0x5AF5, 0x4AD4,0x7AB7,0x6A96,0x1A71,0x0A50,0x3A33,0x2A12,0xDBFD, 0xCBDC,0xFBBF,0xEB9E,0x9B79,0x8B58,0xBB3B,0xAB1A,0x6CA6, 0x7C87,0x4CE4,0x5CC5,0x2C22,0x3C03,0x0C60,0x1C41,0xEDAE, 0xFD8F,0xCDEC,0xDDCD,0xAD2A,0xBD0B,0x8D68,0x9D49,0x7E97, 0x6EB6,0x5ED5,0x4EF4,0x3E13,0x2E32,0x1E51,0x0E70,0xFF9F, 0xEFBE,0xDFDD,0xCFFC,0xBF1B,0xAF3A,0x9F59,0x8F78,0x9188, 0x81A9,0xB1CA,0xA1EB,0xD10C,0xC12D,0xF14E,0xE16F,0x1080, 0x00A1,0x30C2,0x20E3,0x5004,0x4025,0x7046,0x6067,0x83B9, 0x9398,0xA3FB,0xB3DA,0xC33D,0xD31C,0xE37F,0xF35E,0x02B1, 0x1290,0x22F3,0x32D2,0x4235,0x5214,0x6277,0x7256,0xB5EA, 0xA5CB,0x95A8,0x8589,0xF56E,0xE54F,0xD52C,0xC50D,0x34E2, 0x24C3,0x14A0,0x0481,0x7466,0x6447,0x5424,0x4405,0xA7DB, 0xB7FA,0x8799,0x97B8,0xE75F,0xF77E,0xC71D,0xD73C,0x26D3, 0x36F2,0x0691,0x16B0,0x6657,0x7676,0x4615,0x5634,0xD94C, 0xC96D,0xF90E,0xE92F,0x99C8,0x89E9,0xB98A,0xA9AB,0x5844, 0x4865,0x7806,0x6827,0x18C0,0x08E1,0x3882,0x28A3,0xCB7D, 0xDB5C,0xEB3F,0xFB1E,0x8BF9,0x9BD8,0xABBB,0xBB9A,0x4A75, 0x5A54,0x6A37,0x7A16,0x0AF1,0x1AD0,0x2AB3,0x3A92,0xFD2E, 0xED0F,0xDD6C,0xCD4D,0xBDAA,0xAD8B,0x9DE8,0x8DC9,0x7C26, 0x6C07,0x5C64,0x4C45,0x3CA2,0x2C83,0x1CE0,0x0CC1,0xEF1F, 0xFF3E,0xCF5D,0xDF7C,0xAF9B,0xBFBA,0x8FD9,0x9FF8,0x6E17, 0x7E36,0x4E55,0x5E74,0x2E93,0x3EB2,0x0ED1,0x1EF0}; int crc(int data, int curcrc) { static int comb_val; comb_val = (curcrc >> 8) ^ data; return (curcrc << 8) ^ crctab[comb_val]; } int main() { int i,j,c; for (i=0; i < 10000; ++i) { c = 0; for (j=0; j < 256; ++j) c = crc(j,c); } cout << "done"; } Running time: 0:39; again, a substantial improvement. Now look at the same code in 8086 assembly: include stdlib.a ;**************************************************************************** ; ; ; Global variables go here: ; dseg segment para public 'data' ; .radix 16 crctbl dw 00000 dw 01021,02042,03063,04084,050A5,060C6,070E7,08108 dw 09129,0A14A,0B16B,0C18C,0D1AD,0E1CE,0F1EF,01231 dw 00210,03273,02252,052B5,04294,072F7,062D6,09339 dw 08318,0B37B,0A35A,0D3BD,0C39C,0F3FF,0E3DE,02462 dw 03443,00420,01401,064E6,074C7,044A4,05485,0A56A dw 0B54B,08528,09509,0E5EE,0F5CF,0C5AC,0D58D,03653 dw 02672,01611,00630,076D7,066F6,05695,046B4,0B75B dw 0A77A,09719,08738,0F7DF,0E7FE,0D79D,0C7BC,048C4 dw 058E5,06886,078A7,00840,01861,02802,03823,0C9CC dw 0D9ED,0E98E,0F9AF,08948,09969,0A90A,0B92B,05AF5 dw 04AD4,07AB7,06A96,01A71,00A50,03A33,02A12,0DBFD dw 0CBDC,0FBBF,0EB9E,09B79,08B58,0BB3B,0AB1A,06CA6 dw 07C87,04CE4,05CC5,02C22,03C03,00C60,01C41,0EDAE dw 0FD8F,0CDEC,0DDCD,0AD2A,0BD0B,08D68,09D49,07E97 dw 06EB6,05ED5,04EF4,03E13,02E32,01E51,00E70,0FF9F dw 0EFBE,0DFDD,0CFFC,0BF1B,0AF3A,09F59,08F78,09188 dw 081A9,0B1CA,0A1EB,0D10C,0C12D,0F14E,0E16F,01080 dw 000A1,030C2,020E3,05004,04025,07046,06067,083B9 dw 09398,0A3FB,0B3DA,0C33D,0D31C,0E37F,0F35E,002B1 dw 01290,022F3,032D2,04235,05214,06277,07256,0B5EA dw 0A5CB,095A8,08589,0F56E,0E54F,0D52C,0C50D,034E2 dw 024C3,014A0,00481,07466,06447,05424,04405,0A7DB dw 0B7FA,08799,097B8,0E75F,0F77E,0C71D,0D73C,026D3 dw 036F2,00691,016B0,06657,07676,04615,05634,0D94C dw 0C96D,0F90E,0E92F,099C8,089E9,0B98A,0A9AB,05844 dw 04865,07806,06827,018C0,008E1,03882,028A3,0CB7D dw 0DB5C,0EB3F,0FB1E,08BF9,09BD8,0ABBB,0BB9A,04A75 dw 05A54,06A37,07A16,00AF1,01AD0,02AB3,03A92,0FD2E dw 0ED0F,0DD6C,0CD4D,0BDAA,0AD8B,09DE8,08DC9,07C26 dw 06C07,05C64,04C45,03CA2,02C83,01CE0,00CC1,0EF1F dw 0FF3E,0CF5D,0DF7C,0AF9B,0BFBA,08FD9,09FF8,06E17 dw 07E36,04E55,05E74,02E93,03EB2,00ED1,01EF0; ; .radix 10 dseg ends ; ; ; ; cseg segment para public 'code' assume cs:cseg, ds:dseg ; ; ; public PSP PSP dw ? ;Must be in CODE segment! ; ; ; Main proc mov cs:PSP, es ;Save pgm seg prefix mov ax, seg dseg ;Set up the segment registers mov ds, ax mov es, ax ; mov cx, 10000 crclp1: xor dx, dx ;Set DX to zero. mov ax, dx crclp2: call crc inc al jne crclp2 dec cx jne crclp1 print db "done",0 ; ; Quit: mov ah, 4ch int 21h ; Main endp ; ; ; CRC update routine. ; ; On Entry: ; AL- New data value ; DX- Current CRC value ; ; On Exit: ; DX- New CRC value ; crc proc near mov bl, dh xor bl, al mov bh, 0 mov dh, dl mov dl, bh shl bx, 1 xor dx, crctbl[bx] ret crc endp ; ; ; cseg ends ; ; ; Allocate a reasonable amount of space for the stack (2k). ; sseg segment para stack 'stack' stk db 256 dup ("stack ") sseg ends ; ; end Main Running time: 0:17 I could probably get this down to 0:15 or 0:16 if I encoded the CRC calculation in-line rather than as a procedure/function. Assembly language typically works out a little better than twice as fast. The only surprise here is that it's actually that good. For such simple, short, routines, ASM's advantage over C usually isn't this good. For a CRC algorithm, speed is very important, so ASM is, perhaps, a better choice. But that wasn't the purpose of doing all this. Earlier (on the net), I'd made the statement that for a CRC algorithm, ASM isn't particularly more difficult to understand than C (and this is true, in general, for most bit-pushing operations). Now I'm assuming we're talking about expert C programmers looking at C code and expert assembly language programmers looking at ASM code. It isn't fair to take an expert C programmer who is a mediocre ASM programmer and have that same person evaluate C and ASM code. I claim the ASM code above isn't any more difficult to understand than the C code (assuming, of course, that you know the algorithm that the code is attempting to implement). In Doug's original post, he claimed that the ASM code wasn't any more readable than the C code. The reasons are two-fold: (1) the code used a truly miserable algorithm which was next to impossible to understand in the first place (*NO* language is going to make that algorithm particularly clear), (2) the ASM code looked like the output of a compiler. I would be ashamed of myself if I wrote assembly code that looked like that. Clearly compiler-generated assembly code is *not* going to be clearer to understand than hand written code. My apologizes if I'm offending an actual human author of that code. One final point- Note that my minor optimizations in assembly language` (unrolling the loop) ran at the same speed as the nibble loop-up table optimized code. For the same speed, I claim my assembly code is *much* easier to understand than the optimized C code. From a development point of view, it would take far more time to dream up, implement, and verify the nibble-based C code that Doug posted than it would take to implement the unrolled version of the original CRC algorithm in assembly. I offer this as proof that assembly is often the better choice for a project. Sure, you can find better algorithms in C which run faster than bad algorithms in ASM, but the end result is that you spend *far* more time designing, testing, and debugging the C code. And if you can find a better algorithm that works in C, that same algorithm will work in assembly as well. Keep in mind, that the CRC algorithm is intrinsically O(n). You're not going to do any better than this because it takes n operations to read the data. Therefore, the only thing we can attack is the constant in the complexity of this algorithm. ASM is really good at this. *** Randy Hyde