From mi@aldan.algebra.com  Thu Dec 14 21:46:42 2006
Return-Path: <mi@aldan.algebra.com>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id A8A9916A407
	for <FreeBSD-gnats-submit@freebsd.org>; Thu, 14 Dec 2006 21:46:42 +0000 (UTC)
	(envelope-from mi@aldan.algebra.com)
Received: from aldan.algebra.com (aldan.algebra.com [216.254.65.224])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 5E4C043D5E
	for <FreeBSD-gnats-submit@freebsd.org>; Thu, 14 Dec 2006 21:44:55 +0000 (GMT)
	(envelope-from mi@aldan.algebra.com)
Received: from aldan.algebra.com (aldan [127.0.0.1])
	by aldan.algebra.com (8.13.8/8.13.7) with ESMTP id kBELkUeK024276
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO)
	for <FreeBSD-gnats-submit@freebsd.org>; Thu, 14 Dec 2006 16:46:31 -0500 (EST)
	(envelope-from mi@aldan.algebra.com)
Received: (from mi@localhost)
	by aldan.algebra.com (8.13.8/8.13.7/Submit) id kBELkUxO024275;
	Thu, 14 Dec 2006 16:46:30 -0500 (EST)
	(envelope-from mi)
Message-Id: <200612142146.kBELkUxO024275@aldan.algebra.com>
Date: Thu, 14 Dec 2006 16:46:30 -0500 (EST)
From: "Mikhail T." <mi@aldan.algebra.com>
To: FreeBSD-gnats-submit@freebsd.org
Subject: SSE2 optimization for bzip2/libbz2
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         106734
>Category:       bin
>Synopsis:       [patch] [request] bzip2(1): SSE2 optimization for bzip2/libbz2
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          change-request
>Submitter-Id:   current-users
>Arrival-Date:   Thu Dec 14 21:50:06 GMT 2006
>Closed-Date:    
>Last-Modified:  Sun Jan 27 10:00:28 UTC 2008
>Originator:     Mikhail T.
>Release:        FreeBSD 6.2-PRERELEASE amd64
>Organization:
Virtual Estates, Inc.
>Environment:

	Intel's and AMD chips with SSE2 instructions.

>Description:
	The patch below makes bzip2's blocksort routines use SSE2-registers
	to compare 16 bytes at a time.

	On both i386 and AMD chips I tested, the performance improvement
	ranges from 5% for the already compressed (.gz) files to 20% for
	the highly compressible system logs.

	The compressed files are byte-for-byte identical with those produced
	by the original bzip2.

	The changes are ifdef-ed by __SSE2__ and relies on the intrinsics
	available in GNU, Intel's, and Microsoft's compilers.

	No changes to Makefile(s) are necessary -- when targeting an
	SSE2-capable CPU (i.e. ``-march=opteron'' or ``-march=pentium4''),
	the __SSE2__ is set by the compiler.

>How-To-Repeat:

>Fix:

	The patch is available from http://aldan.algebra.com/~mi/bz/

	The patch is not FreeBSD-specific, but was developed, tested, and timed
	on FreeBSD-6.x using both i386 and amd64.

	Feedback welcome.
>Release-Note:
>Audit-Trail:

From: Julian Seward <jseward@acm.org>
To: bug-followup@freebsd.org,
 mi@aldan.algebra.com
Cc:  
Subject: Re: bin/106734: [patch] SSE2 optimization for bzip2/libbz2
Date: Sat, 6 Jan 2007 08:16:28 +0000

 --Boundary-00=_dr1nFvPs1zvnDlW
 Content-Type: text/plain;
   charset="us-ascii"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: inline
 
 
 According to Valgrind on amd64-linux, this patch causes bzip2 to
 do comparisons based on uninitialised memory when compressing, for the
 attached file (PLIST).
 
 ==16140== Conditional jump or move depends on uninitialised value(s)
 ==16140==    at 0x414ECF: mainGtU (blocksort.c:538)
 ==16140==    by 0x414B08: mainSimpleSort (blocksort.c:690)
 ==16140==    by 0x4150E1: mainQSort3 (blocksort.c:805)
 ==16140==    by 0x415FAD: mainSort (blocksort.c:1063)
 ==16140==    by 0x4163F8: BZ2_blockSort (blocksort.c:1244)
 ==16140==    by 0x40DE46: BZ2_compressBlock (compress.c:660)
 ==16140==    by 0x4054B6: handle_compress (bzlib.c:431)
 ==16140==    by 0x40571A: BZ2_bzCompress (bzlib.c:501)
 ==16140==    by 0x407B0D: BZ2_bzWriteClose64 (bzlib.c:1093)
 ==16140==    by 0x4014E1: compressStream (bzip2.c:451)
 ==16140==    by 0x402E20: compress (bzip2.c:1368)
 ==16140==    by 0x40455C: main (bzip2.c:2034)
 
 --Boundary-00=_dr1nFvPs1zvnDlW
 Content-Type: text/plain;
   charset="us-ascii";
   name="PLIST"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
 	filename="PLIST"
 
 share/sgml/docbook/dsssl/modular/README
 share/sgml/docbook/dsssl/modular/VERSION
 share/sgml/docbook/dsssl/modular/BUGS
 share/sgml/docbook/dsssl/modular/TODO
[... most of the attached PLIST cleaned-out for brevety -mi ...]
 @unexec rmdir %D/share/sgml/docbook/dsssl 2>/dev/null || true
 
 --Boundary-00=_dr1nFvPs1zvnDlW--

From: Mikhail Teterin <mi@corbulon.video-collage.com>
To: jseward@acm.org (Julian Seward)
Cc: bug-followup@freebsd.org
Subject: Re: bin/106734: [patch] SSE2 optimization for bzip2/libbz2
Date: Sat, 6 Jan 2007 21:37:20 -0500 (EST)

 > According to Valgrind on amd64-linux, this patch causes bzip2 to
 > do comparisons based on uninitialised memory when compressing, for the
 > attached file (PLIST).
 > 
 > ==16140== Conditional jump or move depends on uninitialised value(s)
 > ==16140==    at 0x414ECF: mainGtU (blocksort.c:538)
 > ==16140==    by 0x414B08: mainSimpleSort (blocksort.c:690)
 > ==16140==    by 0x4150E1: mainQSort3 (blocksort.c:805)
 
 Julian! You informed me of this issue in a direct e-mail from Dec 19.
 I responded the next day on Dec 20:
 
 	------------------------------------------------------------
 = From my 5-minute investigation, I *think* this is because the loads
 
 =        /* Load the bytes: */
 =        n1 = (__m128i)_mm_loadu_pd((double *)(block + i1));
 =        n2 = (__m128i)_mm_loadu_pd((double *)(block + i2));
 
 I doubt it -- Valgrind's diagnostics says "Conditional jump or move
 depends on uninitialised value(s)". There are no jumps in the above
 code, there are two machine instructions (executed in parallel).
 	------------------------------------------------------------
 
 Have you not received the e-mail with above text? (Message-Id:
 <200612200615.22819@Misha>).
 
 Yours,
 
 	-mi

From: Julian Seward <jseward@acm.org>
To: Mikhail Teterin <mi@corbulon.video-collage.com>
Cc: bug-followup@freebsd.org
Subject: Re: bin/106734: [patch] SSE2 optimization for bzip2/libbz2
Date: Sun, 7 Jan 2007 05:08:43 +0000

 I believe this analysis is correct:
 
 >         /* Load the bytes: */
 >         n1 = (__m128i)_mm_loadu_pd((double *)(block + i1));
 >         n2 = (__m128i)_mm_loadu_pd((double *)(block + i2));
 > 
 > read beyond the end of the defined area of block.  block is
 > defined for [0 .. nblock + BZ_N_OVERSHOOT - 1], but I think
 > you are doing a SSE load at &block[nblock + BZ_N_OVERSHOOT - 2],
 > hence loading 15 bytes of garbage.
 
 Valgrind doesn't complain about the out-of-range access, because you
 are still accessing inside a valid malloc-allocated block.  But it
 does know that the read data is uninitialised, hence it complains
 when you do a comparison with that data followed by a conditional
 branch (or move) based on the result of the comparison.
 
 > This is possible... You think, the loop should exit earlier and test
 > the last (up to) 15 bytes one-by-one?
 
 Certainly the loop-end stuff needs to be fixed up somehow to reflect
 the 16 byte loads, but without further investigation I'm not sure how.

From: Mikhail Teterin <mi+kde@aldan.algebra.com>
To: Julian Seward <jseward@acm.org>
Cc: bug-followup@freebsd.org
Subject: Re: bin/106734: [patch] SSE2 optimization for bzip2/libbz2
Date: Tue, 9 Jan 2007 18:34:36 -0500

 On Sunday 07 January 2007 00:08, Julian Seward wrote:
 = >         /* Load the bytes: */
 = >         n1 = (__m128i)_mm_loadu_pd((double *)(block + i1));
 = >         n2 = (__m128i)_mm_loadu_pd((double *)(block + i2));
 
 = > read beyond the end of the defined area of block.  block is
 = > defined for [0 .. nblock + BZ_N_OVERSHOOT - 1], but I think
 = > you are doing a SSE load at &block[nblock + BZ_N_OVERSHOOT - 2],
 = > hence loading 15 bytes of garbage.
 
 I don't think, that's quite right... Instead of processing 8 bytes at a time, 
 as the non-SSE code is doing, I'm comparing 16 at a time. Thus it is possible 
 for me to be over by exactly 8 sometimes...
 
 Anyway, the problem was stemming from my bumping i1 and i2 by 16 instead of 8 
 after the _initial check_ (which, in the quadrant-less case should not need 
 to be separate at all, actually). Sometimes _that_ would bring them over... I 
 think, the solution is to either bump up BZ_N_OVERSHOOT even further or check 
 and adjust i1 and i2:
 
 	if (i1 >= nblock)
 		i1 -= nblock;
 	if (i2 >= nblock)
 		i2 -= nblock;
 
 at the beginning, rather than the end of the loop. Having done that, I no 
 longer peek beyond the end of the block (according to gdb's conditional 
 breakpoints, at least).
 
 Please, check the new http://aldan.algebra.com/~mi/bz/blocksort-SSE2-patch-2
 
 Yours,
 
 	-mi
 
 P.S. The following gdb-script is what I used. Run as:
 
 	gdb -x x.txt bzip2
 
 x.txt:
 	break blocksort.c:516
 	cond 1 (i1 > nblock) || (i2 > nblock)
 	run -9 < /tmp/PLIST > /dev/null
 
 andjust the compression level, the input's location, and be sure to have 
 blocksort.o compiled with debug information, of course...
>Unformatted:
