Path: ns-mx!iowasp.physics.uiowa.edu!maverick.ksu.ksu.edu!zaphod.mps.ohio-state.edu!mips!pacbell.com!ucsd!ucrmath!rhyde
From: rhyde@ucrmath.ucr.edu (randy hyde)
Newsgroups: comp.sys.apple2
Subject: CRCs
Message-ID: <13688@ucrmath.ucr.edu>
Date: 19 Apr 91 17:12:55 GMT
Sender: news@ucrmath.ucr.edu
Organization: University of California, Riverside
Lines: 674

I finally got my act together and I'm submitting some examples of CRC-16/
CCITT algorithms.  For a really good discussion of the CRC-16/CCITT algo-
rithm, please see "The C Programmer's Guide to Serial Communications" by
Campbell.  This is a great book and should be in the hands of anyone who
needs to deal with any form of serial communications, regardless of the
language they happen to program in.

A *brief* description of CRC-16/CCITT: There are many CRC (cyclic
redundancy check) algorithms laying around.  For example, someone posted
the CRC algorithm used by MINIX on the net here, it is not the same as
the standard CRC-16/CCITT algorithm (I cannot verify this since I do not
yet have Minix sources, the post could have been wrong).  CRC algorithms
are used like checksums and parity-- they're used to verify the correct-
ness of a block of data.  The CRC-16/CCITT algorithm has many advantages
over checksums and parity checks.  Such a description is beyond the scope
of this article.  See the reference above for more details.  Also, the
Aug/Sep 1990 C User's Journal had several articles about CRC's and other
data checking algorithms dealing with serial communications.

The CRC-16/CCITT algorithm basically computes the remainder of an n-bit
number divided by 0x1021 using 1's complement arithmetic (mod 2).  This
algorithm was chosen because it is fairly reliable at detecting all sorts
of errors *and* (most importantly) it is very easy to implement in
hardware.  CRC's were originally designed to verify data coming off a
disk drive.  Since the data comes off the disk at a rapid rate, software
(at the time) was unable to keep up with it.  Therefore, they needed to
compute the CRC using hardware.  A 1's complement division by 0x1021 is
easy to implement using a couple of shift registers and XOR gates.

Therefore, a CRC-16 algorithm is essentially a long division algorithm
which uses XOR rather than subtraction at each step in the division.  It
throws away the quotient and keeps the remainder.  There is actually a
little more to it than this, but that's close enough for now.

To compute the CRC-16/CCITT value (I'll just call it CRC from here on out)`
you initialize the crc value to zero (many modern algorithms initialize it
to 0xffff instead, for some very good reasons).  You then grab each byte
from the message and, one bit at a time, xor the H.O. bit of the data
with the H.O. bit of the current checksum.  If the result is zero, you
simply shift the crc (16-bits) and the data (8 bits) one position to the
left.  If the XOR of the H.O. bits is one, you still shift the data byte
one position to the left, the crc value, however, you must shift and then
xor with 0x1021 (the CCITT "polynomial").  You repeat this for all eight
bits and then repeat the process for the entire byte stream of the message.
The following C(++) code does this:

#include <stdio.h>

int crc(int data, int curcrc)
{
	static int i;

	for (i=0; i<8; ++i)
	{
		if ((data << 8) ^ curcrc)
			curcrc = (curcrc << 1) ^ 0x1021;
		else	curcrc <<= 1;
		data <<= 1;
	}
	return curcrc;
}

int main()
{
	int i,j,c;

	for (i=0; i < 10000; ++i)
	{
		c = 0;
		for (j=0; j < 256; ++j)
			c = crc(j,c);
	}
	puts("done");
}

BTW, there is a minor bug (affecting portability) in this code. Can you
spot it?  I left this bug in to make the above code much easier to read.
I will present a version of this code in a moment with the bug removed.

Running time for this algorithm is 3:39 on a 12Mhz 80286 (I had my
laptop with me at Cal Poly yesterday while working on this;  I apologize
for not doing this on a GS, but my GS isn't exactly portable).  Note: I
used Borland`s BC (C++) compiler, emitting 8086 code with optimizations
turned on (although I doubt optimization would make any difference on
such a small, simple, program)

Implemented in (8086) assembly language, the above program looks like the
following:

		include	stdlib.a
;****************************************************************************
;
;
;
;
;
cseg		segment	para public 'code'
		assume	cs:cseg, ds:dseg
;
;
		public	PSP
PSP		dw	?			;Must be in CODE segment!
;
;
; Some useful constants:
;
cr		equ	13
lf		equ	10
eos		equ	0
;
true		equ	1
false		equ	0
;
;
; Main is the main program.  Program execution always begins here.
;
Main		proc
		mov	cs:PSP, es		;Save pgm seg prefix
		mov	ax, seg dseg		;Set up the segment registers
		mov	ds, ax
		mov	es, ax
		mov	dx, 0			;Allocate all available RAM.
		MemInit
;
;
		mov	cx, 10000
crclp1:		xor	dx, dx			;Set DX to zero.
		mov	ax, dx
crclp2:		call	crc
		inc	al
		jne	crclp2
		dec	cx
		jne	crclp1
		print
		db	"done",0
;
;
;
Quit:		mov     ah, 4ch
		int     21h
;
Main		endp
;
;
; CRC routine:
;
; Entry:
; DX contains current crc value
; AL contains new character
;
; Exit:
; DX contains new crc value
;
crc		proc	near
		push	cx
		push	ax
		mov	cx, 8			;Repeat once for each bit.
BitLoop:	mov	ah, al			;Get a copy of data value.
		xor	ah, dh			;This clears the carry!
		jns	NoCCITT			;XOR of H.O. bits = 1?
		xor	dx, 0810h		;1021h shr 1
		stc				;Underflow from 1021 shr 1
;
; The following rcl is just a SHL if we branched from above (since the XOR
; instruction clears the carry flag).  It is an SHL which shifts in a 1
; if we fall through from above.  Shifting in a one supplies the L.O. bit
; of 1021h which we XOR'd in above.
;
NoCCITT:	rcl	dx, 1
		shl	al, 1			;Go to the next bit in data.
		loop	BitLoop
		pop	ax
		pop	cx
		ret
crc		endp
;
;
cseg            ends
;
;
; Allocate a reasonable amount of space for the stack (2k).
;
sseg		segment	para stack 'stack'
stk		db	256 dup ("stack   ")
sseg		ends
;
;
		end	Main


Again, I apologize for 8086 assembly, it's all I had available at the
time.  Running time for this code on the same machine as above: 1:19.

The above are the simple and easy to understand versions of the CRC
algorithm.  Alas, they are very slow.  Keep in mind, the CRC algorithm
may need to keep up with a fast serial line (or some other fast device).
We certainly don't want to lose characters because of CRC computation.

Before jumping into major optimizations, let's consider a very common
optimization that just about any assembly language programmer would
think of and good "C" programmers would think of: unravelling (unrolling)
the loops.  The loop in the crc function can be easily unrolled since it
executes exactly eight times each time you call it.  Since the function
is rather simple, why bother with the function call?  Why not include
the code in-line and dispense with the overhead of the call?  The following
implement this:


#include <stdio.h>

#define crc18(data, curcrc)	if ((data << 8) ^ curcrc) \
					curcrc = ((curcrc << 1) ^ 0x1021) & 0xffff; \
				else curcrc = (curcrc << 1) & 0xffff;\
				data = (data << 1) & 0xff00;

int crc(int data, int curcrc)
{
	static int i;

	data <<= 8;
	for (i=0; i<8; ++i)
	{
		if (data ^ curcrc)
			curcrc = ((curcrc << 1) ^ 0x1021) & 0xffff;
		else	curcrc = (curcrc << 1) & 0xffff;
		data = (data << 1) & 0xff00;
	}
	return curcrc;
}

int main()
{
	int i,j,k,c;

	for (i=0; i < 10000; ++i)
	{
		c = 0;
		for (j=0; j < 256; ++j)
			c = crc(j,c);
	}
	puts("done, press return for macro version");
	c = getchar();
	for (i=0; i < 10000; ++i)
	{
		c = 0;
		for (j=0; j < 256; ++j)
		{
			k=j;
			crc18(c,k);
			crc18(c,k);
			crc18(c,k);
			crc18(c,k);
			crc18(c,k);
			crc18(c,k);
			crc18(c,k);
			crc18(c,k);
		}
	}
	puts("done");
}

This code implements *both* the function call and the unrolled version.
If you didn't notice the bug I corrected, shorts and ints in C have no
guaranteed size other than they are at least 16 bits long (ANSI C).
You have no guarantee that shorts are 16 bits long, which is why we
need the and with 0xffff above.  If your C compiler has 16-bit shorts,
I would hope that it would optimize out this operation.

Running time for this code:  2:02 (at least for the in-line version).
A substantial improvement.

Consider the same code in 8086 assembly:

		include	stdlib.a
;****************************************************************************
;
Crc18		macro
		local	NoCCITT
		mov	ah, al
		xor	ah, dh			;This clears the carry!
		jns	NoCCITT
		xor	dx, 0810h		;1021h shr 1
		stc				;Underflow from 1021 shr 1
NoCCITT:	rcl	dx, 1
		shl	al, 1
		endm
;
;
;
cseg		segment	para public 'code'
		assume	cs:cseg, ds:dseg
;
;
		public	PSP
PSP		dw	?			;Must be in CODE segment!
;
;
; Some useful constants:
;
cr		equ	13
lf		equ	10
eos		equ	0
;
true		equ	1
false		equ	0
;
;
; Main is the main program.  Program execution always begins here.
;
Main		proc
		mov	cs:PSP, es		;Save pgm seg prefix
		mov	ax, seg dseg		;Set up the segment registers
		mov	ds, ax
		mov	es, ax
		mov	dx, 0			;Allocate all available RAM.
		MemInit
;
;
		mov	cx, 10000
crclp1:		xor	dx, dx			;Set DX to zero.
		mov	ax, dx
crclp2:		push	ax
		crc18
		crc18
		crc18
		crc18
		crc18
		crc18
		crc18
		crc18
		pop	ax
		inc	al
		jne	crclp2
		dec	cx
		jne	crclp1
		print
		db	"done",0
;
;
;
Quit:		mov     ah, 4ch
		int     21h
;
Main		endp
;
;
;
;
cseg            ends
;
;
; Allocate a reasonable amount of space for the stack (2k).
;
sseg		segment	para stack 'stack'
stk		db	256 dup ("stack   ")
sseg		ends
;
;
;
		end	Main


Running time for this code:  0:48.  Again a substantial improvement.

I would like to point out that the above algorithms are about as far
as you can take the CRC-16/CCITT algorithm and still have code that
is easily recognizable for what it does.  Additional optimizations will
destroy the readability of the code.  Nonetheless, even 48 seconds may
be a little too slow for someone wanting to do 2,560,000 CRC computations.
Let's explore the next step.

Any assembly language programmer worth his salt would immediately suggest
"table look-up" as a means to optimize the program.  A good C programmer
would also make this suggestion (although you see this technique used
*far* less often in C than in assembly).  The code Doug posted used this
technique to speed things up (which is why the C code was very difficult
to follow).  A modified version of that code (modified so that the calling
sequence is similar to the above routines) follows:

#include <iostream.h>

unsigned crctab[2][16] = {
0x0000, 0xc0c1, 0xc181, 0x0140, 0xc301, 0x03c0, 0x0280, 0xc241,
0xc601, 0x06c0, 0x0780, 0xc741, 0x0500, 0xc5c1, 0xc481, 0x0440,

0x0000, 0xcc01, 0xd801, 0x1400, 0xf001, 0x3c00, 0x2800, 0xe401,
0xa001, 0x6c00, 0x7800, 0xb401, 0x5000, 0x9c01, 0x8801, 0x4400};

int crc(int data, int curcrc)
{
	static int temp;
	temp = curcrc ^ data;
	return crctab[0] [temp & 0xf] ^ crctab[1][(temp >>4) & 0xf] ^
		(curcrc >> 8);
}

int main()
{
	int i,j,c;

	for (i=0; i < 10000; ++i)
	{
		c = 0;
		for (j=0; j < 256; ++j)
			c = crc(j,c);
	}
	cout << "done";
}

This code is an interesting compromise between space and speed.  Rather
than use a 256 entry lookup table indexed by each byte of data, this
algorithm uses the two nibbles of each data byte as an index into two
halves of a 2D matrix (one side for L.O. nibble, one side for H.O. nibble).
Someone had to think long and hard to figure this one out, but more on
that later.

Running time: 0:47  A *very* substantial improvement over the bit at a
time algorithm.

A couple of comments about the above algorithm- the use of a 2D array is
weird.  On a compiler which doesn't optimize things too well, this would
cost you dearly (since it would require a multiply and add to compute the
address of the selected array element).  Why the original author didn't
use two separate arrays escapes me.  This is probably a case of not 
realizing the underlying data structure (e.g., they should have known more
assembly) or they didn't care about performance across architectures and
compilers.

I cannot imagine an assembly language programmer attacking this problem in
this fashion.  Therefore, I didn't bother implemented it.  Most assembly
language programmers would go directly to a 256-element look up table and
use the data byte as an index into that table.  That's what the next
section of code does:

If you're willing to live with a 256 element array rather than a 32-element
array, the following C code is even better:





#include <iostream.h>

unsigned crctab[256] = {
0x0000,
0x1021,0x2042,0x3063,0x4084,0x50A5,0x60C6,0x70E7,0x8108,
0x9129,0xA14A,0xB16B,0xC18C,0xD1AD,0xE1CE,0xF1EF,0x1231,
0x0210,0x3273,0x2252,0x52B5,0x4294,0x72F7,0x62D6,0x9339,
0x8318,0xB37B,0xA35A,0xD3BD,0xC39C,0xF3FF,0xE3DE,0x2462,
0x3443,0x0420,0x1401,0x64E6,0x74C7,0x44A4,0x5485,0xA56A,
0xB54B,0x8528,0x9509,0xE5EE,0xF5CF,0xC5AC,0xD58D,0x3653,
0x2672,0x1611,0x0630,0x76D7,0x66F6,0x5695,0x46B4,0xB75B,
0xA77A,0x9719,0x8738,0xF7DF,0xE7FE,0xD79D,0xC7BC,0x48C4,
0x58E5,0x6886,0x78A7,0x0840,0x1861,0x2802,0x3823,0xC9CC,
0xD9ED,0xE98E,0xF9AF,0x8948,0x9969,0xA90A,0xB92B,0x5AF5,
0x4AD4,0x7AB7,0x6A96,0x1A71,0x0A50,0x3A33,0x2A12,0xDBFD,
0xCBDC,0xFBBF,0xEB9E,0x9B79,0x8B58,0xBB3B,0xAB1A,0x6CA6,
0x7C87,0x4CE4,0x5CC5,0x2C22,0x3C03,0x0C60,0x1C41,0xEDAE,
0xFD8F,0xCDEC,0xDDCD,0xAD2A,0xBD0B,0x8D68,0x9D49,0x7E97,
0x6EB6,0x5ED5,0x4EF4,0x3E13,0x2E32,0x1E51,0x0E70,0xFF9F,
0xEFBE,0xDFDD,0xCFFC,0xBF1B,0xAF3A,0x9F59,0x8F78,0x9188,
0x81A9,0xB1CA,0xA1EB,0xD10C,0xC12D,0xF14E,0xE16F,0x1080,
0x00A1,0x30C2,0x20E3,0x5004,0x4025,0x7046,0x6067,0x83B9,
0x9398,0xA3FB,0xB3DA,0xC33D,0xD31C,0xE37F,0xF35E,0x02B1,
0x1290,0x22F3,0x32D2,0x4235,0x5214,0x6277,0x7256,0xB5EA,
0xA5CB,0x95A8,0x8589,0xF56E,0xE54F,0xD52C,0xC50D,0x34E2,
0x24C3,0x14A0,0x0481,0x7466,0x6447,0x5424,0x4405,0xA7DB,
0xB7FA,0x8799,0x97B8,0xE75F,0xF77E,0xC71D,0xD73C,0x26D3,
0x36F2,0x0691,0x16B0,0x6657,0x7676,0x4615,0x5634,0xD94C,
0xC96D,0xF90E,0xE92F,0x99C8,0x89E9,0xB98A,0xA9AB,0x5844,
0x4865,0x7806,0x6827,0x18C0,0x08E1,0x3882,0x28A3,0xCB7D,
0xDB5C,0xEB3F,0xFB1E,0x8BF9,0x9BD8,0xABBB,0xBB9A,0x4A75,
0x5A54,0x6A37,0x7A16,0x0AF1,0x1AD0,0x2AB3,0x3A92,0xFD2E,
0xED0F,0xDD6C,0xCD4D,0xBDAA,0xAD8B,0x9DE8,0x8DC9,0x7C26,
0x6C07,0x5C64,0x4C45,0x3CA2,0x2C83,0x1CE0,0x0CC1,0xEF1F,
0xFF3E,0xCF5D,0xDF7C,0xAF9B,0xBFBA,0x8FD9,0x9FF8,0x6E17,
0x7E36,0x4E55,0x5E74,0x2E93,0x3EB2,0x0ED1,0x1EF0};

int crc(int data, int curcrc)
{
	static int comb_val;
	comb_val = (curcrc >> 8) ^ data;
	return (curcrc << 8) ^ crctab[comb_val];
}

int main()
{
	int i,j,c;

	for (i=0; i < 10000; ++i)
	{
		c = 0;
		for (j=0; j < 256; ++j)
			c = crc(j,c);
	}
	cout << "done";
}

Running time: 0:39;  again, a substantial improvement.  Now look at the
same code in 8086 assembly:

		include	stdlib.a
;****************************************************************************
;
;
; Global variables go here:
;
dseg		segment	para public 'data'
;
		.radix	16
crctbl		dw	00000
 dw 01021,02042,03063,04084,050A5,060C6,070E7,08108
 dw 09129,0A14A,0B16B,0C18C,0D1AD,0E1CE,0F1EF,01231
 dw 00210,03273,02252,052B5,04294,072F7,062D6,09339
 dw 08318,0B37B,0A35A,0D3BD,0C39C,0F3FF,0E3DE,02462
 dw 03443,00420,01401,064E6,074C7,044A4,05485,0A56A
 dw 0B54B,08528,09509,0E5EE,0F5CF,0C5AC,0D58D,03653
 dw 02672,01611,00630,076D7,066F6,05695,046B4,0B75B
 dw 0A77A,09719,08738,0F7DF,0E7FE,0D79D,0C7BC,048C4
 dw 058E5,06886,078A7,00840,01861,02802,03823,0C9CC
 dw 0D9ED,0E98E,0F9AF,08948,09969,0A90A,0B92B,05AF5
 dw 04AD4,07AB7,06A96,01A71,00A50,03A33,02A12,0DBFD
 dw 0CBDC,0FBBF,0EB9E,09B79,08B58,0BB3B,0AB1A,06CA6
 dw 07C87,04CE4,05CC5,02C22,03C03,00C60,01C41,0EDAE
 dw 0FD8F,0CDEC,0DDCD,0AD2A,0BD0B,08D68,09D49,07E97
 dw 06EB6,05ED5,04EF4,03E13,02E32,01E51,00E70,0FF9F
 dw 0EFBE,0DFDD,0CFFC,0BF1B,0AF3A,09F59,08F78,09188
 dw 081A9,0B1CA,0A1EB,0D10C,0C12D,0F14E,0E16F,01080
 dw 000A1,030C2,020E3,05004,04025,07046,06067,083B9
 dw 09398,0A3FB,0B3DA,0C33D,0D31C,0E37F,0F35E,002B1
 dw 01290,022F3,032D2,04235,05214,06277,07256,0B5EA
 dw 0A5CB,095A8,08589,0F56E,0E54F,0D52C,0C50D,034E2
 dw 024C3,014A0,00481,07466,06447,05424,04405,0A7DB
 dw 0B7FA,08799,097B8,0E75F,0F77E,0C71D,0D73C,026D3
 dw 036F2,00691,016B0,06657,07676,04615,05634,0D94C
 dw 0C96D,0F90E,0E92F,099C8,089E9,0B98A,0A9AB,05844
 dw 04865,07806,06827,018C0,008E1,03882,028A3,0CB7D
 dw 0DB5C,0EB3F,0FB1E,08BF9,09BD8,0ABBB,0BB9A,04A75
 dw 05A54,06A37,07A16,00AF1,01AD0,02AB3,03A92,0FD2E
 dw 0ED0F,0DD6C,0CD4D,0BDAA,0AD8B,09DE8,08DC9,07C26
 dw 06C07,05C64,04C45,03CA2,02C83,01CE0,00CC1,0EF1F
 dw 0FF3E,0CF5D,0DF7C,0AF9B,0BFBA,08FD9,09FF8,06E17
 dw 07E36,04E55,05E74,02E93,03EB2,00ED1,01EF0;
;
		.radix	10
dseg		ends
;
;
;
;
cseg		segment	para public 'code'
		assume	cs:cseg, ds:dseg
;
;

;
		public	PSP
PSP		dw	?			;Must be in CODE segment!
;
;
;
Main		proc
		mov	cs:PSP, es		;Save pgm seg prefix
		mov	ax, seg dseg		;Set up the segment registers
		mov	ds, ax
		mov	es, ax
;
		mov	cx, 10000
crclp1:		xor	dx, dx			;Set DX to zero.
		mov	ax, dx
crclp2:		call	crc
		inc	al
		jne	crclp2
		dec	cx
		jne	crclp1
		print
		db	"done",0
;
;
Quit:		mov     ah, 4ch
		int     21h
;
Main		endp
;
;
; CRC update routine.
;
; On Entry:
; AL- New data value
; DX- Current CRC value
;
; On Exit:
; DX- New CRC value
;
crc		proc	near
		mov	bl, dh
		xor	bl, al
		mov	bh, 0
		mov	dh, dl
		mov	dl, bh
		shl	bx, 1
		xor	dx, crctbl[bx]
		ret
crc		endp
;
;
;
cseg            ends
;
;
; Allocate a reasonable amount of space for the stack (2k).
;
sseg		segment	para stack 'stack'
stk		db	256 dup ("stack   ")
sseg		ends
;
;
		end	Main


Running time:  0:17

I could probably get this down to 0:15 or 0:16 if I encoded the CRC
calculation in-line rather than as a procedure/function.

Assembly language typically works out a little better than twice as
fast.  The only surprise here is that it's actually that good.  For
such simple, short, routines, ASM's advantage over C usually isn't this
good.  For a CRC algorithm, speed is very important, so ASM is, perhaps,
a better choice.

But that wasn't the purpose of doing all this.  Earlier (on the net), I'd
made the statement that for a CRC algorithm, ASM isn't particularly more
difficult to understand than C (and this is true, in general, for most
bit-pushing operations).  Now I'm assuming we're talking about expert
C programmers looking at C code and expert assembly language programmers
looking at ASM code.  It isn't fair to take an expert C programmer who is
a mediocre ASM programmer and have that same person evaluate C and ASM
code.

I claim the ASM code above isn't any more difficult to understand than the
C code (assuming, of course, that you know the algorithm that the code
is attempting to implement).

In Doug's original post, he claimed that the ASM code wasn't any more
readable than the C code.  The reasons are two-fold: (1) the code used
a truly miserable algorithm which was next to impossible to understand
in the first place (*NO* language is going to make that algorithm
particularly clear), (2) the ASM code looked like the output of a
compiler.  I would be ashamed of myself if I wrote assembly code that
looked like that.  Clearly compiler-generated assembly code is *not*
going to be clearer to understand than hand written code.  My apologizes
if I'm offending an actual human author of that code.

One final point- Note that my minor optimizations in assembly language`
(unrolling the loop) ran at the same speed as the nibble loop-up table
optimized code.  For the same speed, I claim my assembly code is *much*
easier to understand than the optimized C code.  From a development
point of view, it would take far more time to dream up, implement, and
verify the nibble-based C code that Doug posted than it would take to
implement the unrolled version of the original CRC algorithm in assembly.
I offer this as proof that assembly is often the better choice for a
project.  Sure, you can find better algorithms in C which run faster than
bad algorithms in ASM, but the end result is that you spend *far* more
time designing, testing, and debugging the C code.  And if you can find
a better algorithm that works in C, that same algorithm will work in
assembly as well.

Keep in mind, that the CRC algorithm is intrinsically O(n).  You're not
going to do any better than this because it takes n operations to read
the data.  Therefore, the only thing we can attack is the constant
in the complexity of this algorithm.  ASM is really good at this.

*** Randy Hyde