* * * * *
                                        
                         Help! My compiler is leaking!
                                        
I fixed the problems that valgrind was complaining about [1] in the greylist
daemon [2] (and I had hoped such problems would fix the problem, but alas, it
doesn't seem to be the case, but I'm still testing), but the very nature of
one of the complaints is interesting in and of itself.

A design pattern I've long used in code is the following:

> /*--------------------------------------------
> ; the following is written to clarify 
> ; the issue I'm writing about.  This
> ; is not how I would normally write such 
> ; code.  It presents a simplified, yet
> ; realistic bit of code.
> ;
> ; And yes, this is the format I use for block
> ; comments in C.
> ;------------------------------------------*/
> 
> char   buffer[BUFSIZ];
> size_t i;
> size_t j;
> 
> for (i = j = 0 ; i < BUFSIZ ; i++)
> {
>   if (!isctrl(buffer[i]))
>     buffer[j++] = buffer[i];
> }
> 

Basically, I'm going through a character array, removing certain elements (in
this example, control characters) and doing so in place, to conserve memory
consumption. Yes, there are instances where I'm reading and writing the same
value to the same location, but in the scheme of things, it's not a bad
thing. And, at least for this example, the overhead of trying to avoid
unnecessary copying of data overwhelms the amount of time it takes to just do
the copy.

So when I was writing the code to clean up the tuple array in the greylist
daemon, I naturally wrote it as:

> for (i = j = 0 ; i < g_poolnum ; i++)
> {
>   /* other code */
> 
>   if (difftime(Tao,g_pool[i].atime) < c_timeout_gray)
>   {
>     g_pool[j++] = g_pool[i];
>     continue;
>   }
> 
>   /* rest of loop */
> }
> 

It follows a successful pattern I've used plenty of times before. I saw
nothing necessarily wrong with it, yet valgrind complained bitterly about
this fragment of code. And I was a mildly surprised to see a call to memcpy()
when I never explicitly called memcpy().

I just got bit with a leaky abstraction [3], and a rather insidious one at
that.

Pre-ANSI C, I wouldn't have been able to write that code, since back then, C
didn't have the concept of structure assignments, and thus, I would have had
to explicitly call memcpy():

> if (difftime(Tao,g_pool[i].atime) < c_timeout_gray)
> {
>   memcpy(&g_pool[j++],&g_pool[i],sizeof(struct tuple));
>   continue;
> }
> 

But one of the advantages of working up the abstraction scale (so to speak)
is that you can let the compiler take care of the grunt work for you. I mean,
what if struct tuple was the size of an int? The overhead of calling memcpy()
would swamp the actual act of copying the data. In fact, if struct tuple was
a small multiple of an int in size, it might still not be worth the overhead
of calling memcpy(). And the compiler is a better place to push such
knowledge, since it can keep track of not only structure sizes, but the
overhead of calling a function to copy memory and handle things accordingly.

So ANSI C allowed the compiler to handle structure assignment. And it can do
a pretty fine job of it too. For instance, using a recent version of gcc [4]
with the compiler options -O3 -fomit-frame-pointer (some heavy duty
optimization), it compiled the following bit of code:

> struct foo
> {
>   int f;
> };
> 
> struct foo A[256];
> size_t     i;
> size_t     j;
>   
> for (i = j = 0 ; i < 256 ; i++)
> {
>   if (A[i].f == 56)
>     A[j++] = A[i];
> }
> 

to the following bit of x86 code (and frankly, I was surprised at the quality
and it's been translated from the alien AT&T syntax gcc uses to proper Intel
syntax):

> 	xor	edx,edx
> 	xor	eax,eax
> 	jmps	.L6
> 
> .L4:	inc	eax
> 	cmp	eax,255
> 	ja	.L12   
> 
> .L6:	cmp	[A + eax*4],56
> 	jne	.L4
> 	inc	eax   
> 
> 	mov	[A + edx*4],56	; A[j].f = 56
> 	inc	edx		; j++
> 	cmp	eax,255
> 	jbe	.L6 
>                 
> .L12:
> 

It didn't even copy the data since the compiler figured out it didn't need
to. Even if we increased the size of the structure a bit:

> struct foo
> {
>   size_t f;
>   size_t f2;
>   char   f3a;
>   char   f3b;
>   char   f3c;
>   char   f3d;
>   short  f4a;
>   short  f4b;
> };
> 
> struct foo A[256];
> size_t     i;
> size_t     j;
>   
> for (i = j = 0 ; i < 256 ; i++)
> {
>   if (A[i].f == 56)
>     A[j++] = A[i];
> }
> 

gcc still has yet to call memcpy():

> 	push	ebx
> 	xor	edx,edx
> 	xor	ebx,ebx
> 	mov	ecx,255
> 	jmps	.L18   
> 
> .L16:	add	edx,16
> 	dec	ecx   
> 	js	.L23  
> 
> .L18:	cmp	[A + edx],56
> 	jne	.L16
>      
> 	mov	[A + ebx],56		; A[j].f   = 56
> 	mov	eax,[A + 4 + edx]	; A[j].f2 = A[i].f2
> 	mov	[A + 4 + ebx],eax
> 	mov	eax,[A + 8 + edx]	; A[j].f3(a-d) = A[i].f3(a-d)
> 	mov	[A + 8 + ebx],eax
> 	mov	eax,[A + 12 + edx]	; A[j].f4(a,b) = A[i].f4(a,b)
> 	mov	[A + 12 + ebx],eax
> 
> 	add	edx,16
> 	add	ebx,16
> 	dec	ecx   
> 	jns	.L18  
> 
> .L23:	pop	ebx
> 

It just copies the 16 bytes of the structure as one assignment (because of
constant propagation) and three four-byte moves. It's not until the structure
gets significantly large:

> struct foo
> {
>   size_t f;
>   char   b1[124];
>   size_t s2;
>   char   b2[124];
> };
> 
> struct foo A[256];
> size_t     i;
> size_t     j;
> 
> for (i = j = 0 ; i < 256 ; i++)
> {
>   if (A[i].f == 56)
>     A[j++] = A[i];
> }
> 

that the compiler switches to calling memcpy():

> 	push	edi
> 	push	esi
> 	push	ebx
> 	xor	esi,esi
> 	mov	edi,offset A
> 	mov	ebx,255
> 	jmps	.L40
> 
> .L38:	add	esi,256
> 	dec	ebx
> 	js	.L45
> 
> .L40:	cmp	[A + esi],56
> 	jne	.L38
> 	push	ecx	; ???
> 	push	256
> 	lea	eax,[A + esi]
> 	mov	edx,edi
> 	push	eax
> 	push	edx
> 	call	memcpy
> 	add	edi,256
> 	add	esp,16
> 	add	esi,256
> 	dec	ebx
> 	jns	.L40
> 
> .L45:	pop	ebx
> 	pop	esi
> 	pop	edi
> 

(The only anomaly in the code is the push ecx. The register isn't
initialized, and memcpy() only takes three parameters, not four. My guess is
that this “additional parameter” exists to keep the stack aligned along a
cache line and it's this type of detail that compilers exist to keep track
of. It's also interesting to note that when compiling for an 80486, gcc never
bothers to call memcpy() and instead inlines the operation using REP MOVS.)

By now, you're probably asking yourself, “So? Where's the leaky abstraction?”

Well, the leaky abstraction comes in the call to memcpy(), which happened
inherently.

> Synopsis
> 
> > #include <string.h>
> > void *memcpy(void *s1,void *s2,size_t n);
> > 
> 
> Description
> 
> The memcpy function copies n characters from the object pointed to by s2
> into the object pointed to by s1. If copying takes place between objects
> that overlap, the behavior is undefined.
> 
> Returns
> 
> The memcpy function returns the value of s1.
> 
> “The C Standard” (emphasis added)
> 

Well, how about that? I'm inadvertantly invoking undefined behavior in the
greylist daemon! (actually, I was hoping this was the cause of the problem,
but I don't think it is—sigh) Technically speaking, I don't see how there
could be a problem when I'm copying a block of memory over itself, except for
consuming some extra CPU (Central Processing Unit) time. But I would think
that a compiler could see I was modifying an array of items, and either
include a check to skip the operation if they overlapped completely, or
switch to using memmove() (which allows the objects to overlap).

But such is the nature of working with high abstractions. When they leak,
they leak!

I suppose I could have realized I was ultimately calling memcpy(), and that
memcpy() has undefined semantics when the source and destination overlap, but
I also expected the compiler to inline the code to copy the structure (much
like gcc did when compiling on an 80486), not actually call memcpy()!

Sheesh.

[1] gopher://gopher.conman.org/0Phlog:2007/10/15.1
[2] gopher://gopher.conman.org/0Phlog:2007/08/16.1
[3] http://www.joelonsoftware.com/articles/LeakyAbstractions.html
[4] http://gcc.gnu.org/

Email Sean Conner at sean@conman.org

.