From nobody@FreeBSD.org  Thu Jun 18 20:29:31 2009
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B913A10656CC
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 18 Jun 2009 20:29:31 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [IPv6:2001:4f8:fff6::21])
	by mx1.freebsd.org (Postfix) with ESMTP id A52C28FC1B
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 18 Jun 2009 20:29:31 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.14.3/8.14.3) with ESMTP id n5IKTVdV007121
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 18 Jun 2009 20:29:31 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.14.3/8.14.3/Submit) id n5IKTVef007120;
	Thu, 18 Jun 2009 20:29:31 GMT
	(envelope-from nobody)
Message-Id: <200906182029.n5IKTVef007120@www.freebsd.org>
Date: Thu, 18 Jun 2009 20:29:31 GMT
From: Jay Patrick Howard <jhoward@alumni.utexas.net>
To: freebsd-gnats-submit@FreeBSD.org
Subject: [PATCH] enhance qsort to properly handle 32-bit aligned data on 64-bit systems
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         135718
>Category:       bin
>Synopsis:       [patch] enhance qsort(3) to properly handle 32-bit aligned data on 64-bit systems
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Jun 18 20:30:04 UTC 2009
>Closed-Date:    
>Last-Modified:  Sat Jun 20 12:10:04 UTC 2009
>Originator:     Jay Patrick Howard
>Release:        n/a
>Organization:
>Environment:
>Description:
The stdlib qsort() code in FreeBSD is largely taken from the paper "Engineering a Sort Function" by Bentley & McIlroy (1993).

In that paper, the authors employ a clever a scheme for selecting a method of swapping elements.  Basically it goes like this:

1. If the base pointer or element size is not aligned to sizeof(long) then swap byte-by-byte, else
2. If the element size is exactly sizeof(long) then perform a single inline swap, else
3. perform a long-by-long swap.

The implicit assumption here is that if the base pointer or element size isn't aligned to sizeof(long) then one can't do any better than a char-by-char swap.

While this seems to be true on most 32-bit systems, it is not the case on at least some 64-bit systems.  In particular, x86-64.

Consequently, sorting data that is 32-bit aligned (but not 64-bit aligned) is much slower on 64-bit systems compared to 32-bit systems.  This is because in 32-bit mode the qsort() logic uses a long-by-long swap (since the data is aligned) while in 64-bit mode qsort() drops down to a char-by-char swap.

It is true that most workloads on 64-bit systems will be 64-bit aligned.  However, it is fairly common for 64-bit code to need to process binary data that wasn't generated on a 64-bit system and hence may not be 64-bit aligned.  Because this is fairly common it could be a decent "win" for qsort() to support fast swapping for such 32-bit aligned workloads.

In my testing, sorting 64 MB worth of 100-byte records (each with a 12-byte key) took 1.8x as long on a 64-bit system as it did on a 32-bit system, with identical hardware.  With a patched qsort the performance was identical on both 32-bit and 64-bit versions of the code.

The patch is written such that if sizeof(long) == sizeof(int) then it acts exactly like the current version.  The additional swap behavior is only enabled when sizeof(long) > sizeof(int).

The extra overhead from the sizeof(int) alignment check was negligible.  Given the way SWAPINIT is structured, there is no additional overhead whatsoever when the data is 64-bit aligned.  The only time additional overhead is incurred is when data is NOT 64-bit aligned, in which case the extra alignment check quite is likely to provide a significant speedup.
>How-To-Repeat:
Sort records of size 8*n+4 bytes from within 64-bit code.  Then sort the same data from within 32-bit code.  The 64-bit version should take approximately 1.8x as long to execute.
>Fix:
See attached patch modifying /src/lib/libc/stdlib/qsort.c

Patch attached with submission follows:

--- qsort.c	2009-06-18 13:32:58.000000000 -0500
+++ qsort.c.patched	2009-06-18 14:22:02.000000000 -0500
@@ -34,6 +34,7 @@
 __FBSDID("$FreeBSD: src/lib/libc/stdlib/qsort.c,v 1.15 2008/01/14 09:21:34 das Exp $");
 
 #include <stdlib.h>
+#include <limits.h>
 
 #ifdef I_AM_QSORT_R
 typedef int		 cmp_t(void *, const void *, const void *);
@@ -59,8 +60,15 @@
         } while (--i > 0);				\
 }
 
-#define SWAPINIT(a, es) swaptype = ((char *)a - (char *)0) % sizeof(long) || \
+#if LONG_BIT > WORD_BIT
+#define SWAPINIT(a, es) swaptype = ((char *)a - (char *)0) % sizeof(long) ||	\
+	es % sizeof(long) ? ((char *)a - (char *)0) % sizeof(int) || es %	\
+	sizeof(int) ? 3 : 2 : es == sizeof(long)? 0 : 1;
+#else
+#define SWAPINIT(a, es) swaptype = ((char *)a - (char *)0) % sizeof(long) ||	\
 	es % sizeof(long) ? 2 : es == sizeof(long)? 0 : 1;
+#endif
+
 
 static inline void
 swapfunc(a, b, n, swaptype)
@@ -69,6 +77,10 @@
 {
 	if(swaptype <= 1)
 		swapcode(long, a, b, n)
+#if LONG_BIT > WORD_BIT
+	else if(swaptype <= 2)
+		swapcode(int, a, b, n)
+#endif
 	else
 		swapcode(char, a, b, n)
 }


>Release-Note:
>Audit-Trail:

From: Jay Howard <jhoward@alumni.utexas.net>
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/135718: [PATCH] enhance qsort(3) to properly handle 32-bit 
	aligned data on 64-bit systems
Date: Thu, 18 Jun 2009 22:33:15 -0500

 --0016e6db61c045af8e046cab2fd7
 Content-Type: text/plain; charset=ISO-8859-1
 Content-Transfer-Encoding: 7bit
 
 It occurs to me that a simpler solution would to just do int-by-int swaps in
 all cases when the base addr and element size are int-aligned, char-by-char
 swaps when they're not, and an inline long-swap when the base addr is
 long-aligned and element size is exactly sizeof(long).
 
 It's notable that BSD bcopy() only does int-by-int copying and makes no
 effort to do long-by-long copying when the data would permit
 
 Here's a second patch that makes the above change.
 
 On systems where sizeof(int) = sizeof(long) this version becomes identical
 to the current version.  When sizeof(long) > sizeof(int) bulk swapping
 happens int-by-int instead of long-by-long.  Inline swaps are still used
 when the base is long-aligned and element size = sizeof(long).
 
 uuencoded:
 
 begin 600 qsort.c.patch
 M+2TM('%S;W)T+F,),C`P.2TP-BTQ.2`P,CHR-CHQ-RXP,#`P,#`P,#`@*S`P
 M,#`**RLK('%S;W)T+F,N<&%T8VAE9`DR,#`Y+3`V+3$Y(#`S.C`W.C`Y+C`P
 M,#`P,#`P,"`K,#`P,`I`0"`M-3DL."`K-3DL.2!`0`H@("`@("`@("!]('=H
 M:6QE("@M+6D@/B`P*3L)"0D)7`H@?0H@"BTC9&5F:6YE(%-705!)3DE4*&$L
 M(&5S*2!S=V%P='EP92`]("@H8VAA<B`J*6$@+2`H8VAA<B`J*3`I("4@<VEZ
 M96]F*&QO;F<I('Q\(%P*+0EE<R`E('-I>F5O9BAL;VYG*2`_(#(@.B!E<R`]
 M/2!S:7IE;V8H;&]N9RD_(#`@.B`Q.PHK(V1E9FEN92!35T%024Y)5"AA+"!E
 M<RD@<W=A<'1Y<&4@/2`H*&-H87(@*BEA("T@*&-H87(@*BDP*2`E('-I>F5O
 M9BAI;G0I('Q\(%P**PEE<R`E('-I>F5O9BAI;G0I(#\@,B`Z("@H8VAA<B`J
 M*6$@+2`H8VAA<B`J*3`I("4@<VEZ96]F*&QO;F<I(#\@,2`Z(%P**PEE<R`A
 M/2!S:7IE;V8H;&]N9RD["B`*('-T871I8R!I;FQI;F4@=F]I9`H@<W=A<&9U
 M;F,H82P@8BP@;BP@<W=A<'1Y<&4I"D!`("TV."PW("LV.2PW($!`"B`):6YT
 M(&XL('-W87!T>7!E.PH@>PH@"6EF*'-W87!T>7!E(#P](#$I"BT)"7-W87!C
 M;V1E*&QO;F<L(&$L(&(L(&XI"BL)"7-W87!C;V1E*&EN="P@82P@8BP@;BD*
 E(`EE;'-E"B`)"7-W87!C;V1E*&-H87(L(&$L(&(L(&XI"B!]"BP@
 `
 end
 
 --0016e6db61c045af8e046cab2fd7
 Content-Type: text/html; charset=ISO-8859-1
 Content-Transfer-Encoding: quoted-printable
 
 It occurs to me that a simpler solution would to just do int-by-int swaps i=
 n all cases when the base addr and element size are int-aligned, char-by-ch=
 ar swaps when they&#39;re not, and an inline long-swap when the base addr i=
 s long-aligned and element size is exactly sizeof(long).<br>
 <br>It&#39;s notable that BSD bcopy() only does int-by-int copying and make=
 s no effort to do long-by-long copying when the data would permit<br><br>He=
 re&#39;s a second patch that makes the above change.<br><br>On systems wher=
 e sizeof(int) =3D sizeof(long) this version becomes identical to the curren=
 t version.=A0 When sizeof(long) &gt; sizeof(int) bulk swapping happens int-=
 by-int instead of long-by-long.=A0 Inline swaps are still used when the bas=
 e is long-aligned and element size =3D sizeof(long).<br>
 <br>uuencoded:<br><br>begin 600 qsort.c.patch<br>M+2TM(&#39;%S;W)T+F,),C`P.=
 2TP-BTQ.2`P,CHR-CHQ-RXP,#`P,#`P,#`@*S`P<br>M,#`**RLK(&#39;%S;W)T+F,N&lt;&am=
 p;%T8VAE9`DR,#`Y+3`V+3$Y(#`S.C`W.C`Y+C`P<br>M,#`P,#`P,&quot;`K,#`P,`I`0&quo=
 t;`M-3DL.&quot;`K-3DL.2!`0`H@(&quot;`@(&quot;`@(&quot;!](&#39;=3DH<br>
 M:6QE(&quot;@M+6D@/B`P*3L)&quot;0D)7`H@?0H@&quot;BTC9&amp;5F:6YE(%-705!)3DE=
 4*&amp;$L<br>M(&amp;5S*2!S=3DV%P=3D&#39;EP92`](&quot;@H8VAA&lt;B`J*6$@+2`H8=
 VAA&lt;B`J*3`I(&quot;4@&lt;VEZ<br>M96]F*&amp;QO;F&lt;I(&#39;Q\(%P*+0EE&lt;R=
 `E(&#39;-I&gt;F5O9BAL;VYG*2`_(#(@.B!E&lt;R`]<br>
 M/2!S:7IE;V8H;&amp;]N9RD_(#`@.B`Q.PHK(V1E9FEN92!35T%024Y)5&quot;AA+&quot;!E=
 <br>M&lt;RD@&lt;W=3DA&lt;&#39;1Y&lt;&amp;4@/2`H*&amp;-H87(@*BEA(&quot;T@*&a=
 mp;-H87(@*BDP*2`E(&#39;-I&gt;F5O<br>M9BAI;G0I(&#39;Q\(%P**PEE&lt;R`E(&#39;-=
 I&gt;F5O9BAI;G0I(#\@,B`Z(&quot;@H8VAA&lt;B`J<br>
 M*6$@+2`H8VAA&lt;B`J*3`I(&quot;4@&lt;VEZ96]F*&amp;QO;F&lt;I(#\@,2`Z(%P**PEE=
 &lt;R`A<br>M/2!S:7IE;V8H;&amp;]N9RD[&quot;B`*(&#39;-T871I8R!I;FQI;F4@=3DF]I=
 9`H@&lt;W=3DA&lt;&amp;9U<br>M;F,H82P@8BP@;BP@&lt;W=3DA&lt;&#39;1Y&lt;&amp;4=
 I&quot;D!`(&quot;TV.&quot;PW(&quot;LV.2PW($!`&quot;B`):6YT<br>
 M(&amp;XL(&#39;-W87!T&gt;7!<a href=3D"http://E.PH">E.PH</a>@&gt;PH@&quot;6E=
 F*&#39;-W87!T&gt;7!E(#P](#$I&quot;BT)&quot;7-W87!C<br>M;V1E*&amp;QO;F&lt;L(=
 &amp;$L(&amp;(L(&amp;XI&quot;BL)&quot;7-W87!C;V1E*&amp;EN=3D&quot;P@82P@8BP=
 @;BD*<br>
 E(`EE;&#39;-E&quot;B`)&quot;7-W87!C;V1E*&amp;-H87(L(&amp;$L(&amp;(L(&amp;XI=
 &quot;B!]&quot;BP@<br>`<br>end<br><br>
 
 --0016e6db61c045af8e046cab2fd7--
>Unformatted:
