* * * * * Help! My compiler is leaking! I fixed the problems that valgrind was complaining about [1] in the greylist daemon [2] (and I had hoped such problems would fix the problem, but alas, it doesn't seem to be the case, but I'm still testing), but the very nature of one of the complaints is interesting in and of itself. A design pattern I've long used in code is the following: > /*-------------------------------------------- > ; the following is written to clarify > ; the issue I'm writing about. This > ; is not how I would normally write such > ; code. It presents a simplified, yet > ; realistic bit of code. > ; > ; And yes, this is the format I use for block > ; comments in C. > ;------------------------------------------*/ > > char buffer[BUFSIZ]; > size_t i; > size_t j; > > for (i = j = 0 ; i < BUFSIZ ; i++) > { > if (!isctrl(buffer[i])) > buffer[j++] = buffer[i]; > } > Basically, I'm going through a character array, removing certain elements (in this example, control characters) and doing so in place, to conserve memory consumption. Yes, there are instances where I'm reading and writing the same value to the same location, but in the scheme of things, it's not a bad thing. And, at least for this example, the overhead of trying to avoid unnecessary copying of data overwhelms the amount of time it takes to just do the copy. So when I was writing the code to clean up the tuple array in the greylist daemon, I naturally wrote it as: > for (i = j = 0 ; i < g_poolnum ; i++) > { > /* other code */ > > if (difftime(Tao,g_pool[i].atime) < c_timeout_gray) > { > g_pool[j++] = g_pool[i]; > continue; > } > > /* rest of loop */ > } > It follows a successful pattern I've used plenty of times before. I saw nothing necessarily wrong with it, yet valgrind complained bitterly about this fragment of code. And I was a mildly surprised to see a call to memcpy() when I never explicitly called memcpy(). I just got bit with a leaky abstraction [3], and a rather insidious one at that. Pre-ANSI C, I wouldn't have been able to write that code, since back then, C didn't have the concept of structure assignments, and thus, I would have had to explicitly call memcpy(): > if (difftime(Tao,g_pool[i].atime) < c_timeout_gray) > { > memcpy(&g_pool[j++],&g_pool[i],sizeof(struct tuple)); > continue; > } > But one of the advantages of working up the abstraction scale (so to speak) is that you can let the compiler take care of the grunt work for you. I mean, what if struct tuple was the size of an int? The overhead of calling memcpy() would swamp the actual act of copying the data. In fact, if struct tuple was a small multiple of an int in size, it might still not be worth the overhead of calling memcpy(). And the compiler is a better place to push such knowledge, since it can keep track of not only structure sizes, but the overhead of calling a function to copy memory and handle things accordingly. So ANSI C allowed the compiler to handle structure assignment. And it can do a pretty fine job of it too. For instance, using a recent version of gcc [4] with the compiler options -O3 -fomit-frame-pointer (some heavy duty optimization), it compiled the following bit of code: > struct foo > { > int f; > }; > > struct foo A[256]; > size_t i; > size_t j; > > for (i = j = 0 ; i < 256 ; i++) > { > if (A[i].f == 56) > A[j++] = A[i]; > } > to the following bit of x86 code (and frankly, I was surprised at the quality and it's been translated from the alien AT&T syntax gcc uses to proper Intel syntax): > xor edx,edx > xor eax,eax > jmps .L6 > > .L4: inc eax > cmp eax,255 > ja .L12 > > .L6: cmp [A + eax*4],56 > jne .L4 > inc eax > > mov [A + edx*4],56 ; A[j].f = 56 > inc edx ; j++ > cmp eax,255 > jbe .L6 > > .L12: > It didn't even copy the data since the compiler figured out it didn't need to. Even if we increased the size of the structure a bit: > struct foo > { > size_t f; > size_t f2; > char f3a; > char f3b; > char f3c; > char f3d; > short f4a; > short f4b; > }; > > struct foo A[256]; > size_t i; > size_t j; > > for (i = j = 0 ; i < 256 ; i++) > { > if (A[i].f == 56) > A[j++] = A[i]; > } > gcc still has yet to call memcpy(): > push ebx > xor edx,edx > xor ebx,ebx > mov ecx,255 > jmps .L18 > > .L16: add edx,16 > dec ecx > js .L23 > > .L18: cmp [A + edx],56 > jne .L16 > > mov [A + ebx],56 ; A[j].f = 56 > mov eax,[A + 4 + edx] ; A[j].f2 = A[i].f2 > mov [A + 4 + ebx],eax > mov eax,[A + 8 + edx] ; A[j].f3(a-d) = A[i].f3(a-d) > mov [A + 8 + ebx],eax > mov eax,[A + 12 + edx] ; A[j].f4(a,b) = A[i].f4(a,b) > mov [A + 12 + ebx],eax > > add edx,16 > add ebx,16 > dec ecx > jns .L18 > > .L23: pop ebx > It just copies the 16 bytes of the structure as one assignment (because of constant propagation) and three four-byte moves. It's not until the structure gets significantly large: > struct foo > { > size_t f; > char b1[124]; > size_t s2; > char b2[124]; > }; > > struct foo A[256]; > size_t i; > size_t j; > > for (i = j = 0 ; i < 256 ; i++) > { > if (A[i].f == 56) > A[j++] = A[i]; > } > that the compiler switches to calling memcpy(): > push edi > push esi > push ebx > xor esi,esi > mov edi,offset A > mov ebx,255 > jmps .L40 > > .L38: add esi,256 > dec ebx > js .L45 > > .L40: cmp [A + esi],56 > jne .L38 > push ecx ; ??? > push 256 > lea eax,[A + esi] > mov edx,edi > push eax > push edx > call memcpy > add edi,256 > add esp,16 > add esi,256 > dec ebx > jns .L40 > > .L45: pop ebx > pop esi > pop edi > (The only anomaly in the code is the push ecx. The register isn't initialized, and memcpy() only takes three parameters, not four. My guess is that this “additional parameter” exists to keep the stack aligned along a cache line and it's this type of detail that compilers exist to keep track of. It's also interesting to note that when compiling for an 80486, gcc never bothers to call memcpy() and instead inlines the operation using REP MOVS.) By now, you're probably asking yourself, “So? Where's the leaky abstraction?” Well, the leaky abstraction comes in the call to memcpy(), which happened inherently. > Synopsis > > > #include > > void *memcpy(void *s1,void *s2,size_t n); > > > > Description > > The memcpy function copies n characters from the object pointed to by s2 > into the object pointed to by s1. If copying takes place between objects > that overlap, the behavior is undefined. > > Returns > > The memcpy function returns the value of s1. > > “The C Standard” (emphasis added) > Well, how about that? I'm inadvertantly invoking undefined behavior in the greylist daemon! (actually, I was hoping this was the cause of the problem, but I don't think it is—sigh) Technically speaking, I don't see how there could be a problem when I'm copying a block of memory over itself, except for consuming some extra CPU (Central Processing Unit) time. But I would think that a compiler could see I was modifying an array of items, and either include a check to skip the operation if they overlapped completely, or switch to using memmove() (which allows the objects to overlap). But such is the nature of working with high abstractions. When they leak, they leak! I suppose I could have realized I was ultimately calling memcpy(), and that memcpy() has undefined semantics when the source and destination overlap, but I also expected the compiler to inline the code to copy the structure (much like gcc did when compiling on an 80486), not actually call memcpy()! Sheesh. [1] gopher://gopher.conman.org/0Phlog:2007/10/15.1 [2] gopher://gopher.conman.org/0Phlog:2007/08/16.1 [3] http://www.joelonsoftware.com/articles/LeakyAbstractions.html [4] http://gcc.gnu.org/ Email Sean Conner at sean@conman.org .