From das@VARK.homeunix.com  Tue Jun  1 15:51:50 2004
Return-Path: <das@VARK.homeunix.com>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 5634616A4CE
	for <FreeBSD-gnats-submit@freebsd.org>; Tue,  1 Jun 2004 15:51:50 -0700 (PDT)
Received: from VARK.homeunix.com (ar59.lsanca2-4.27.98.47.lsanca2.dsl-verizon.net [4.27.98.47])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 045FC43D41
	for <FreeBSD-gnats-submit@freebsd.org>; Tue,  1 Jun 2004 15:51:50 -0700 (PDT)
	(envelope-from das@VARK.homeunix.com)
Received: from VARK.homeunix.com (localhost [127.0.0.1])
	by VARK.homeunix.com (8.12.10/8.12.10) with ESMTP id i51Mpkkj024225
	for <FreeBSD-gnats-submit@freebsd.org>; Tue, 1 Jun 2004 15:51:46 -0700 (PDT)
	(envelope-from das@VARK.homeunix.com)
Received: (from das@localhost)
	by VARK.homeunix.com (8.12.10/8.12.10/Submit) id i51MpkkU024224;
	Tue, 1 Jun 2004 15:51:46 -0700 (PDT)
	(envelope-from das)
Message-Id: <200406012251.i51MpkkU024224@VARK.homeunix.com>
Date: Tue, 1 Jun 2004 15:51:46 -0700 (PDT)
From: David Schultz <das@freebsd.org>
To: FreeBSD-gnats-submit@freebsd.org
Cc:
Subject: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         67469
>Category:       i386
>Synopsis:       src/lib/msun/i387/s_tan.S gives incorrect results for large inputs
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    freebsd-i386
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Tue Jun 01 16:00:46 PDT 2004
>Closed-Date:    Mon Sep 11 11:51:23 GMT 2006
>Last-Modified:  Mon Sep 11 11:51:23 GMT 2006
>Originator:     David Schultz
>Release:        FreeBSD 5.2-CURRENT i386
>Organization:
>Environment:
>Description:
src/lib/msun/i387/s_tan.S returns wildly inaccuate results when
its input has a large magnitude (>> 2*pi).  For example:

input			s_tan.S				k_tan.c
1.776524190754802e+269	1.773388446261095e+16		-1.367233274980565e+01
1.182891728897420e+57	-1.9314539773999572e-01		1.0020569035866138e+03
2.303439778835110e+202	2.8465460220132694e+00		3.5686329695133922e+00

I suspect that this is caused in the modular reduction phase, which is
used for inputs with magnitude greater than 2^63.  In these cases, the
inputs are taken mod 2*pi, but the double precision representation of
pi isn't precise enough to obtain a meaningful result for this
computation.

>How-To-Repeat:

>Fix:
One solution might involve doing the reduction using
__ieee754_rem_pio4() instead of the fprem1 instruction.
Unfortunately, since this uses pi/2 as the modulus, it
is necessary to apply the identity tan(x + pi/2) = -1/tan(x)
for odd-phase inputs.  I tried this for the first example
input above and got an answer that was off by 2 ulps.  Close,
but the MI implementation gets within 1 ulp.  I'm not sure what
kind of correction (if any) should be used here.

This PR exists mainly as a reminder for me or anyone else who wants to
look at it more carefully some lazy Saturday afternoon.  I can't
imagine many people care what the tangent of 10 billion is, but the
tan() function *is* supposed to give the correct answer.
>Release-Note:
>Audit-Trail:

From: Bruce Evans <bde@zeta.org.au>
To: David Schultz <das@freebsd.org>
Cc: FreeBSD-gnats-submit@freebsd.org, freebsd-i386@freebsd.org
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results
 for large inputs
Date: Wed, 2 Jun 2004 17:51:23 +1000 (EST)

 On Tue, 1 Jun 2004, David Schultz wrote:
 
 > >Description:
 > src/lib/msun/i387/s_tan.S returns wildly inaccuate results when
 > its input has a large magnitude (>> 2*pi).  For example:
 >
 > input			s_tan.S				k_tan.c
 > 1.776524190754802e+269	1.773388446261095e+16		-1.367233274980565e+01
 > 1.182891728897420e+57	-1.9314539773999572e-01		1.0020569035866138e+03
 > 2.303439778835110e+202	2.8465460220132694e+00		3.5686329695133922e+00
 >
 > I suspect that this is caused in the modular reduction phase, which is
 > used for inputs with magnitude greater than 2^63.  In these cases, the
 > inputs are taken mod 2*pi, but the double precision representation of
 > pi isn't precise enough to obtain a meaningful result for this
 > computation.
 
 All of the i387 trig functions have this bug.  fldpi actually gives a
 value of pi with extended precision (64 bits), and I believe fprem
 uses extended precision and there is some magic to give a few more
 bits of precision (66 altogether?).  However, fdlibm is much more
 precise than this.  It uses a 1584-bit value for 2/pi to get 113 bits
 of precision for range reduction in the double precision case.  (The
 single precision case is not so precise and is broken anyway (off by
 more than pi/2 for many values between 32768 and 65536 according to
 my notes).  It is also very slow.)
 
 > >How-To-Repeat:
 >
 > >Fix:
 > One solution might involve doing the reduction using
 > __ieee754_rem_pio4() instead of the fprem1 instruction.
 > Unfortunately, since this uses pi/2 as the modulus, it
 > is necessary to apply the identity tan(x + pi/2) = -1/tan(x)
 > for odd-phase inputs.  I tried this for the first example
 > input above and got an answer that was off by 2 ulps.  Close,
 > but the MI implementation gets within 1 ulp.  I'm not sure what
 > kind of correction (if any) should be used here.
 
 I think the complete fdlibm version of tan(), sin() and cos() should
 be used for large args.  "large" could be classified by failure of the
 first fptan.  Or just return the TLOSS error for very large args.
 
 Bruce

From: David Schultz <das@FreeBSD.ORG>
To: Bruce Evans <bde@zeta.org.au>
Cc: FreeBSD-gnats-submit@FreeBSD.ORG, freebsd-i386@FreeBSD.ORG
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs
Date: Wed, 2 Jun 2004 01:43:13 -0700

 On Wed, Jun 02, 2004, Bruce Evans wrote:
 > > >Fix:
 > > One solution might involve doing the reduction using
 > > __ieee754_rem_pio4() instead of the fprem1 instruction.
 > > Unfortunately, since this uses pi/2 as the modulus, it
 > > is necessary to apply the identity tan(x + pi/2) = -1/tan(x)
 > > for odd-phase inputs.  I tried this for the first example
 > > input above and got an answer that was off by 2 ulps.  Close,
 > > but the MI implementation gets within 1 ulp.  I'm not sure what
 > > kind of correction (if any) should be used here.
 > 
 > I think the complete fdlibm version of tan(), sin() and cos() should
 > be used for large args.  "large" could be classified by failure of the
 > first fptan.  Or just return the TLOSS error for very large args.
 
 I was hoping it could be fixed without the bloat, but your
 solution is clean and easy to implement...

From: David Schultz <das@FreeBSD.ORG>
To: FreeBSD-gnats-submit@FreeBSD.ORG, freebsd-i386@FreeBSD.ORG
Cc: bde@FreeBSD.ORG
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs
Date: Fri, 4 Feb 2005 16:59:13 -0500

 On Wed, Jun 02, 2004, Bruce Evans wrote:
 > On Tue, 1 Jun 2004, David Schultz wrote:
 > 
 > > >Description:
 > > src/lib/msun/i387/s_tan.S returns wildly inaccuate results when
 > > its input has a large magnitude (>> 2*pi).  For example:
 > >
 > > input			s_tan.S				k_tan.c
 > > 1.776524190754802e+269	1.773388446261095e+16		-1.367233274980565e+01
 > > 1.182891728897420e+57	-1.9314539773999572e-01		1.0020569035866138e+03
 > > 2.303439778835110e+202	2.8465460220132694e+00		3.5686329695133922e+00
 
 Here is a patch to fix the problem for tan().  See caveats below...
 
 Index: s_tan.S
 ===================================================================
 RCS file: /cvs/src/lib/msun/i387/s_tan.S,v
 retrieving revision 1.6
 diff -u -r1.6 s_tan.S
 --- s_tan.S	28 Aug 1999 00:06:14 -0000	1.6
 +++ s_tan.S	4 Feb 2005 21:43:32 -0000
 @@ -45,14 +45,21 @@
  	jnz	1f
  	fstp	%st(0)
  	ret
 -1:	fldpi
 -	fadd	%st(0)
 -	fxch	%st(1)
 -2:	fprem1
 -	fstsw	%ax
 -	andw	$0x400,%ax
 -	jnz	2b
 -	fstp	%st(1)
 -	fptan
 -	fstp	%st(0)
 +
 +/* Use the fdlibm routines for accuracy with large arguments. */
 +1:	pushl   %ebp
 +	movl    %esp,%ebp
 +	subl    $32,%esp
 +	leal	12(%esp),%eax
 +	movl	%eax,8(%esp)
 +	fstpl	(%esp)
 +	call	__ieee754_rem_pio2
 +	addl	$12,%esp
 +	andl	$1,%eax			/* compute (eax & 1) ? -1 : 1 */
 +	sall	%eax
 +	subl	$1,%eax
 +	neg	%eax
 +	movl	%eax,16(%esp)
 +	call	__kernel_tan
 +	leave
  	ret
 
 Unfortunately, I'm still getting the wrong answer for large values
 that are *supposed* to be handled by the fptan instruction.  The
 error seems to increase towards the end of the range of fptan,
 (-2^63,2^63).  For instance, tan(0x1.3dea2a2c29172p+22) is only
 off by the least significant 15 binary digits or so, but
 tan(0x1.2c95e550f1635p+62) is off by about 5%.  Is fptan simply
 inherently inaccurate, or did I screw up somewhere?  I would be
 interested in results from an AMD processor.

From: Bruce Evans <bde@zeta.org.au>
To: David Schultz <das@FreeBSD.org>
Cc: FreeBSD-gnats-submit@FreeBSD.org, freebsd-i386@FreeBSD.org,
	bde@FreeBSD.org
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results
 for large inputs
Date: Sat, 5 Feb 2005 20:33:43 +1100 (EST)

 On Fri, 4 Feb 2005, David Schultz wrote:
 
 > On Wed, Jun 02, 2004, Bruce Evans wrote:
 > > On Tue, 1 Jun 2004, David Schultz wrote:
 > >
 > > > >Description:
 > > > src/lib/msun/i387/s_tan.S returns wildly inaccuate results when
 > > > its input has a large magnitude (>> 2*pi).  For example:
 > > >
 > > > input			s_tan.S				k_tan.c
 > > > 1.776524190754802e+269	1.773388446261095e+16		-1.367233274980565e+01
 > > > 1.182891728897420e+57	-1.9314539773999572e-01		1.0020569035866138e+03
 > > > 2.303439778835110e+202	2.8465460220132694e+00		3.5686329695133922e+00
 >
 > Here is a patch to fix the problem for tan().  See caveats below...
 
 Seems like a good method...
 
 > Index: s_tan.S
 > ===================================================================
 > RCS file: /cvs/src/lib/msun/i387/s_tan.S,v
 > retrieving revision 1.6
 > diff -u -r1.6 s_tan.S
 > --- s_tan.S	28 Aug 1999 00:06:14 -0000	1.6
 > +++ s_tan.S	4 Feb 2005 21:43:32 -0000
 > @@ -45,14 +45,21 @@
 >  	jnz	1f
 
 "Large" probably needs to be determined by comparison with a constant.
 fptan also gives large relative errors if the result is small, even
 for small args like M_PI (2^39 ulps for MI_PI!!).  These can be reduced
 to 1 ulp (assuming that fdlibm is accurate to 1 ulp) using comparision
 of the result with a constant.  See below.
 
 >  	fstp	%st(0)
 >  	ret
 > -1:	fldpi
 > -	fadd	%st(0)
 > -	fxch	%st(1)
 > -2:	fprem1
 > -	fstsw	%ax
 > -	andw	$0x400,%ax
 > -	jnz	2b
 > -	fstp	%st(1)
 > -	fptan
 > -	fstp	%st(0)
 > +
 > +/* Use the fdlibm routines for accuracy with large arguments. */
 > +1:	pushl   %ebp
 > +	movl    %esp,%ebp
 > +	subl    $32,%esp
 > +	leal	12(%esp),%eax
 > +	movl	%eax,8(%esp)
 > +	fstpl	(%esp)
 > +	call	__ieee754_rem_pio2
 > +	addl	$12,%esp
 > +	andl	$1,%eax			/* compute (eax & 1) ? -1 : 1 */
 > +	sall	%eax
 > +	subl	$1,%eax
 > +	neg	%eax
 > +	movl	%eax,16(%esp)
 > +	call	__kernel_tan
 
 Better call the MI tan() to do all this.  It won't take significantly
 longer, and shouldn't be reached in most cases anyway.
 
 > +	leave
 >  	ret
 >
 > Unfortunately, I'm still getting the wrong answer for large values
 > that are *supposed* to be handled by the fptan instruction.  The
 > error seems to increase towards the end of the range of fptan,
 > (-2^63,2^63).  For instance, tan(0x1.3dea2a2c29172p+22) is only
 > off by the least significant 15 binary digits or so, but
 > tan(0x1.2c95e550f1635p+62) is off by about 5%.  Is fptan simply
 > inherently inaccurate, or did I screw up somewhere?  I would be
 > interested in results from an AMD processor.
 
 I think it is because fptan is inherently inaccurate.  In an earlier
 reply, I said that fldpi cannot give a value that has more than 64
 bits of accuracy, and that there is some magic that sometimes gives
 66 bits.  I now think the magic is just that the hardware trig functions
 use an internal value of pi with that amount of accuracy.  When they
 fail, we use fldpi which gives even less accuracy.
 
 fptan succeeds when its arg is in the range [-2^63,2^63], but to be
 accurate with 66 bits of precision for the internal divisor, the arg
 must be in the range of about [-2^13,2^13] for double precision and
 [-2^2,2^2] for extended precision.  Better reduce the ranges more for
 safety.
 
 Testing shows that the inaccuracy is much worse than that.  The number
 of correct digits is more like (66 - 53) than (66 - 13) when the result
 is very close to 0, even for small args near pi.  13 bits of accuracy
 means 2^40 ulps of inaccuracy for such results.  It's surprising that
 ucbtest doesn't report this.
 
 So it seems that the i387 tan() should do something like the following:
 
 	/* 11 = 13 less a bit for safety. */
 	if (fabs(arg) > pow(2, 11))
 		return __mitan(arg);
 	y = fptan(arg);
 	/*
 	 * I think the -6 in the following should be more like -2
 	 * (plus 2 for safety?), but -2 would slow down too many cases.
 	 * Even -6 gives a 2-ulp error for
 	 * arg = M_PI * pow(2, 11) + (delta = pow(2, -6)) + delta2(delta)
 	 * The magic 2 here is the 2 extra bits of precision in the
 	 * internal value of pi, or possibly the conversion of that
 	 * value via the pow(2, 11) range limit (2 = 66 - 53 - 11).
 	 */
 	if (fabs(y) < pow(2, -6);
 		return __mitan(arg);
 	return (y);
 
 Simple test program:
 
 %%%
 #include <math.h>
 #include <stdio.h>
 
 double xtan(double x);
 
 main()
 {
 	double x, y;
 
 	for (x = 1; x < 1 << 20; x = 2 * x) {
 		printf("%.0a:\n%.13a\n%.13a\n",
 		    x, tan(x), xtan(x));
 		y = M_PI * x;
 		printf("M_PI * %.0a:\n%.13a\n%.13a\n",
 		    x, tan(y), xtan(y));
 		y += 0x1.0p-6;
 		printf("M_PI * %.0a + 0x1p-6:\n%.13a\n%.13a\n",
 		    x, tan(y), xtan(y));
 	}
 }
 %%%
 
 xtan() is the MI fdlibm tan().
 
 Output:
 
 %%%
 0x1p+0:
 0x1.8eb245cbee3a6p+0
 0x1.8eb245cbee3a6p+0
 M_PI * 0x1p+0:
 -0x1.1a60000000000p-53
 -0x1.1a62633145c07p-53
 M_PI * 0x1p+0 + 0x1p-6:
 0x1.0005557778525p-6
 0x1.0005557778525p-6
 0x1p+1:
 -0x1.17af62e0950f8p+1
 -0x1.17af62e0950f8p+1
 M_PI * 0x1p+1:
 -0x1.1a60000000000p-52
 -0x1.1a62633145c07p-52
 M_PI * 0x1p+1 + 0x1p-6:
 0x1.0005557778502p-6
 0x1.0005557778502p-6
 0x1p+2:
 0x1.2866f9be4de13p+0
 0x1.2866f9be4de14p+0
 M_PI * 0x1p+2:
 -0x1.1a60000000000p-51
 -0x1.1a62633145c07p-51
 M_PI * 0x1p+2 + 0x1p-6:
 0x1.00055577784bbp-6
 0x1.00055577784bbp-6
 0x1p+3:
 -0x1.b32e78f49a1e4p+2
 -0x1.b32e78f49a1e4p+2
 M_PI * 0x1p+3:
 -0x1.1a60000000000p-50
 -0x1.1a62633145c07p-50
 M_PI * 0x1p+3 + 0x1p-6:
 0x1.000555777842ep-6
 0x1.000555777842ep-6
 0x1p+4:
 0x1.33d8f03e769a0p-2
 0x1.33d8f03e769a0p-2
 M_PI * 0x1p+4:
 -0x1.1a60000000000p-49
 -0x1.1a62633145c07p-49
 M_PI * 0x1p+4 + 0x1p-6:
 0x1.0005557778314p-6
 0x1.0005557778314p-6
 0x1p+5:
 0x1.526f6245432a4p-1
 0x1.526f6245432a4p-1
 M_PI * 0x1p+5:
 -0x1.1a60000000000p-48
 -0x1.1a62633145c07p-48
 M_PI * 0x1p+5 + 0x1p-6:
 0x1.00055577780dfp-6
 0x1.00055577780dfp-6
 0x1p+6:
 0x1.2c86afc5c9119p+1
 0x1.2c86afc5c9119p+1
 M_PI * 0x1p+6:
 -0x1.1a60000000000p-47
 -0x1.1a62633145c07p-47
 M_PI * 0x1p+6 + 0x1p-6:
 0x1.0005557777c75p-6
 0x1.0005557777c75p-6
 0x1p+7:
 -0x1.0a65bcce6f48cp+0
 -0x1.0a65bcce6f48cp+0
 M_PI * 0x1p+7:
 -0x1.1a60000000000p-46
 -0x1.1a62633145c07p-46
 M_PI * 0x1p+7 + 0x1p-6:
 0x1.00055577773a2p-6
 0x1.00055577773a1p-6
 0x1p+8:
 0x1.91c8f293711dbp+4
 0x1.91c8f293711dbp+4
 M_PI * 0x1p+8:
 -0x1.1a60000000000p-45
 -0x1.1a62633145c07p-45
 M_PI * 0x1p+8 + 0x1p-6:
 0x1.00055577761fap-6
 0x1.00055577761fap-6
 0x1p+9:
 -0x1.46be0f0a73387p-4
 -0x1.46be0f0a73388p-4
 M_PI * 0x1p+9:
 -0x1.1a60000000000p-44
 -0x1.1a62633145c07p-44
 M_PI * 0x1p+9 + 0x1p-6:
 0x1.0005557773eacp-6
 0x1.0005557773eacp-6
 0x1p+10:
 -0x1.48d5be43ada01p-3
 -0x1.48d5be43ada01p-3
 M_PI * 0x1p+10:
 -0x1.1a60000000000p-43
 -0x1.1a62633145c07p-43
 M_PI * 0x1p+10 + 0x1p-6:
 0x1.000555776f810p-6
 0x1.000555776f80fp-6
 0x1p+11:
 -0x1.518972221f88ep-2
 -0x1.518972221f88ep-2
 M_PI * 0x1p+11:
 -0x1.1a60000000000p-42
 -0x1.1a62633145c07p-42
 M_PI * 0x1p+11 + 0x1p-6:
 0x1.0005557766ad7p-6
 0x1.0005557766ad5p-6
 0x1p+12:
 -0x1.7aae915eb67f3p-1
 -0x1.7aae915eb67f3p-1
 M_PI * 0x1p+12:
 -0x1.1a60000000000p-41
 -0x1.1a62633145c07p-41
 M_PI * 0x1p+12 + 0x1p-6:
 0x1.0005557755065p-6
 0x1.0005557755061p-6
 0x1p+13:
 -0x1.a1ff2171ec9fbp+1
 -0x1.a1ff2171ec9fbp+1
 M_PI * 0x1p+13:
 -0x1.1a60000000000p-40
 -0x1.1a62633145c07p-40
 M_PI * 0x1p+13 + 0x1p-6:
 0x1.0005557731b82p-6
 0x1.0005557731b79p-6
 0x1p+14:
 0x1.5a04d6e15c566p-1
 0x1.5a04d6e15c565p-1
 M_PI * 0x1p+14:
 -0x1.1a60000000000p-39
 -0x1.1a62633145c07p-39
 M_PI * 0x1p+14 + 0x1p-6:
 0x1.00055576eb1bbp-6
 0x1.00055576eb1a8p-6
 0x1p+15:
 0x1.3e75a49b5b447p+1
 0x1.3e75a49b5b446p+1
 M_PI * 0x1p+15:
 -0x1.1a60000000000p-38
 -0x1.1a62633145c07p-38
 M_PI * 0x1p+15 + 0x1p-6:
 0x1.000555765de2ep-6
 0x1.000555765de08p-6
 0x1p+16:
 -0x1.eae2708cf5424p-1
 -0x1.eae2708cf5425p-1
 M_PI * 0x1p+16:
 -0x1.1a60000000000p-37
 -0x1.1a62633145c07p-37
 M_PI * 0x1p+16 + 0x1p-6:
 0x1.0005557543714p-6
 0x1.00055575436c7p-6
 0x1p+17:
 -0x1.7bcb26d5d9adap+4
 -0x1.7bcb26d5d9af5p+4
 M_PI * 0x1p+17:
 -0x1.1a60000000000p-36
 -0x1.1a62633145c07p-36
 M_PI * 0x1p+17 + 0x1p-6:
 0x1.000555730e8dfp-6
 0x1.000555730e846p-6
 0x1p+18:
 0x1.59ba3666c9a5cp-4
 0x1.59ba3666c9a43p-4
 M_PI * 0x1p+18:
 -0x1.1a60000000000p-35
 -0x1.1a62633145c07p-35
 M_PI * 0x1p+18 + 0x1p-6:
 0x1.0005556ea4c75p-6
 0x1.0005556ea4b44p-6
 0x1p+19:
 0x1.5c354a31a8487p-3
 0x1.5c354a31a846ep-3
 M_PI * 0x1p+19:
 -0x1.1a60000000000p-34
 -0x1.1a62633145c07p-34
 M_PI * 0x1p+19 + 0x1p-6:
 0x1.00055565d13a2p-6
 0x1.00055565d113fp-6
 %%%
 
 Notes on special values:
 
 % 0x1p+11:
 % -0x1.518972221f88ep-2
 % -0x1.518972221f88ep-2
 
 No difference here.  However, there are 1-ulp differences for smaller
 args.
 
 % M_PI * 0x1p+11:
 % -0x1.1a60000000000p-42
 % -0x1.1a62633145c07p-42
 
 There is always this huge difference for the (M_PI * x) args.  Only the
 leading 14 mantissa bits agree.
 
 % M_PI * 0x1p+11 + 0x1p-6:
 % 0x1.0005557766ad7p-6
 % 0x1.0005557766ad5p-6
 
 This is the first 2-ulp difference for the (M_PI * integer + delta) args
 in my limited testing.
 
 % 0x1p+17:
 % -0x1.7bcb26d5d9adap+4
 % -0x1.7bcb26d5d9af5p+4
 
 This is the first 2-ulp difference for the integer args in the above output.
 The difference of 27 ulps is consistent with an internal value of pi that
 is not much more accurate than 66 bits (17 + 53 - log2(27) = 65.25...).
 
 Bruce

From: David Schultz <das@FreeBSD.ORG>
To: Bruce Evans <bde@zeta.org.au>
Cc: FreeBSD-gnats-submit@FreeBSD.ORG, freebsd-i386@FreeBSD.ORG
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs
Date: Sat, 5 Feb 2005 19:09:12 -0500

 On Sat, Feb 05, 2005, Bruce Evans wrote:
 > Better call the MI tan() to do all this.  It won't take significantly
 > longer, and shouldn't be reached in most cases anyway.
 
 Yeah, but s_tan.S overrides s_tan.c, so that would require extra
 machinery.  Also, if fptan had worked as advertised and correctly
 identified the cases where range reduction was necessary, calling
 rem_pio2() directly would have avoided superfluous range checking.
 
 > I think it is because fptan is inherently inaccurate.  [...]
 
 As you mention, it gets the wrong answer for M_PI.  In general, it
 seems to do pretty badly on numbers close to multiples of pi.
 Granted, one could argue that the answer returned is going to be
 even *further* from the answer the programmer expected (namely, 0),
 so it won't make much difference. ;-)
 
 BTW, Kahan has a program for producing large floating-point
 numbers close to multiples of pi/2:
 	http://www.cs.berkeley.edu/~wkahan/testpi/
 
 In investigating this, I discovered several interesting things:
 
 - The Intel software developer's guide says that fptan has an error
   of at most 1 ulp, but as we've seen, they're lying.
 
 - Maple 9.5 on sparc64 with Digits:=5000 corroborates fdlibm's answer
   to tan(M_PI).
 
 - fdlibm's tan() does not seem to be much slower than fptan
   on the range (-M_PI,M_PI), which is what people are most
   likely to care about.  The MD version is faster for some
   inputs where the small angle approximation applies
   (|x| < 2^-28 implies x == tan(x)). fdlibm special cases
   this, too, but the special case isn't early enough, and the
   bugfix in fdlibm 5.3 slows things down.  Throwing that
   special case out of the average, fdlibm seems to be faster!
   (NB: I benchmarked this very sloppily on ref4.  Perhaps
   you could confirm the results...)
 
 Your suggestion of effectively special-casing large inputs and
 inputs close to multiples of pi is probably the right way to fix
 the inaccuracy.  However, I'm worried that it would wipe away
 any performance benefit of using fptan, if there is a benefit
 at all.  How about punting and removing s_tan.S instead?
 
 Other unanswered questions (ENOTIME right now):
 - What about sin() and cos()?
 - What about the float versions of these?

From: Bruce Evans <bde@zeta.org.au>
To: David Schultz <das@freebsd.org>
Cc: FreeBSD-gnats-submit@freebsd.org, freebsd-i386@freebsd.org
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results
 for large inputs
Date: Sun, 6 Feb 2005 18:32:38 +1100 (EST)

 On Sat, 5 Feb 2005, David Schultz wrote:
 
 > On Sat, Feb 05, 2005, Bruce Evans wrote:
 > > Better call the MI tan() to do all this.  It won't take significantly
 > > longer, and shouldn't be reached in most cases anyway.
 >
 > Yeah, but s_tan.S overrides s_tan.c, so that would require extra
 > machinery.  Also, if fptan had worked as advertised and correctly
 > identified the cases where range reduction was necessary, calling
 > rem_pio2() directly would have avoided superfluous range checking.
 
 I hacked the file (but not the machinery) easily enough for testing.
 
 fptan advertises to work? :-)  At least the amd64 manual (instruction
 reference) only claims that it sets C2 if the arg is outside the range
 [-2^63,2^63].
 
 > > I think it is because fptan is inherently inaccurate.  [...]
 >
 > As you mention, it gets the wrong answer for M_PI.  In general, it
 > seems to do pretty badly on numbers close to multiples of pi.
 > Granted, one could argue that the answer returned is going to be
 > even *further* from the answer the programmer expected (namely, 0),
 > so it won't make much difference. ;-)
 
 Too bad the correct answer is a little further from 0.
 
 > ...
 > - The Intel software developer's guide says that fptan has an error
 >   of at most 1 ulp, but as we've seen, they're lying.
 
 The amd64 manual (application refrence) says "x87 computations are carried
 out in double extended precision format, so that the transcendental
 functions [are accurate to 1 ulp]".  The "so that" part doesn't follow.
 We are seeing something like naive extended precision computations and how
 inaccurate they can be even when only a double precision result is wanted.
 
 > [... too much detail to reply to :-)]
 > Your suggestion of effectively special-casing large inputs and
 > inputs close to multiples of pi is probably the right way to fix
 > the inaccuracy.  However, I'm worried that it would wipe away
 > any performance benefit of using fptan, if there is a benefit
 > at all.  How about punting and removing s_tan.S instead?
 
 The problem affects at least sin() and cos() too, so I think throwing
 away the optimized versions is too extreme.  Perhaps a single range
 check to let the MD version handle only values near 0 ([-pi/2+eps,
 pi/2-eps]? would be efficient enough.
 
 > Other unanswered questions (ENOTIME right now):
 > - What about sin() and cos()?
 > - What about the float versions of these?
 
 I tested sin().  It misbehaves similarly (with identical results for
 args (2 * n * M_PI).
 
 It's interesting that there is fptan and not ftan, but fsin and not fpsin.
 Both are partial.
 
 I don't run -current and hadn't seen your code for the MD float versions...
 They are buggier:
 (1) the exponent can be up to 127, so more than half of the representable
     values exceed 2^63 in magnitude and thus need range reduction.  Results
     for tan(0x1p64) using various methods:
 
       buggy MD tanf:   0x1p+64                 1.84467e+19
       buggy MD tan:   -0x1.8467b926c834bp-2   -0.379302
       fdlibm MI tanf: -0x1.82bee6p-6          -0.0236051
       bc s(x)/c(x) (scale=40):                 -.02360508353334969937000...
       bc s(x)/c(x) (default scale=20):         -.02358521765210826916
 
     It looks like fdlibm is perfect and scale=20 in bc is not quite enough.
     sinf(0x1p64) would be 0x1p64 too.  This is preposterous for sin().
 (2) they return extra bits of precision.  The MD double precision versions
     have the same bug; the bug is just clearer for the float case, and
     cannot be fixed by FreeBSD's rounding precision hack.
 
 BTW, I don't trust the range reduction for floating point pi/2 but
 never finished debugging or fixing it.  According to my old comment,
 it is off by pi/2 and very slow for many values between 32768 and
 65535.  However, I couldn't duplicate this misbehaviour last time I
 looked at it.  I used to fix it using the double version.
 
 Bruce

From: David Schultz <das@FreeBSD.ORG>
To: Bruce Evans <bde@zeta.org.au>
Cc: FreeBSD-gnats-submit@FreeBSD.ORG, freebsd-i386@FreeBSD.ORG
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs
Date: Sun, 6 Feb 2005 04:21:51 -0500

 On Sun, Feb 06, 2005, Bruce Evans wrote:
 > > [... too much detail to reply to :-)]
 > > Your suggestion of effectively special-casing large inputs and
 > > inputs close to multiples of pi is probably the right way to fix
 > > the inaccuracy.  However, I'm worried that it would wipe away
 > > any performance benefit of using fptan, if there is a benefit
 > > at all.  How about punting and removing s_tan.S instead?
 > 
 > The problem affects at least sin() and cos() too, so I think throwing
 > away the optimized versions is too extreme.  Perhaps a single range
 > check to let the MD version handle only values near 0 ([-pi/2+eps,
 > pi/2-eps]? would be efficient enough.
 > 
 > > Other unanswered questions (ENOTIME right now):
 > > - What about sin() and cos()?
 > > - What about the float versions of these?
 > 
 > I tested sin().  It misbehaves similarly (with identical results for
 > args (2 * n * M_PI).
 
 My question was more along the lines of, ``Is there an actual
 performance benefit for sin() and cos()?''  If there is little or
 no benefit, then there is no reason to keep the assembly routines
 and worry about how to fix them.  I'll try to investigate this
 aspect in more detail at some point, unless you beat me to it...
 
 > I don't run -current and hadn't seen your code for the MD float versions...
 > They are buggier:
 > (1) the exponent can be up to 127, so more than half of the representable
 >     values exceed 2^63 in magnitude and thus need range reduction.  Results
 >     for tan(0x1p64) using various methods:
 > 
 >       buggy MD tanf:   0x1p+64                 1.84467e+19
 >       buggy MD tan:   -0x1.8467b926c834bp-2   -0.379302
 >       fdlibm MI tanf: -0x1.82bee6p-6          -0.0236051
 
 Whups.  FWIW, they're not mine; I got them from NetBSD.  Another
 one of the float functions in the NetBSD repository is buggy,
 too, but I forget which one.  I only imported the ones that seemed
 to be correct and faster than fdlibm for input bit patterns chosen
 uniformly at random, but I guess I missed that problem.
 
 > (2) they return extra bits of precision.  The MD double precision versions
 >     have the same bug; the bug is just clearer for the float case, and
 >     cannot be fixed by FreeBSD's rounding precision hack.
 > 
 > BTW, I don't trust the range reduction for floating point pi/2 but
 > never finished debugging or fixing it.  According to my old comment,
 > it is off by pi/2 and very slow for many values between 32768 and
 > 65535.  However, I couldn't duplicate this misbehaviour last time I
 > looked at it.  I used to fix it using the double version.
 
 Yeah, there seem to be many values for which __ieee754_rem_pio2f()
 returns the wrong quotient:
 
 -0x1.8009c6p+8       (input)
         d = -244     (return value of __ieee754_rem_pio2())
         f = -4       (return value of __ieee754_rem_pio2f())
 
 0x1.389ee4p+87
         d = 4
         f = 3
 
 -0x1.70bca6p+16
         d = -60095
         f = -7
 
 It also gets the remainder completely wrong sometimes:
 
 0x1.389ee4p+87
         rd0 = -0x1.6ad286p-4       (y[0] from rem_pio2(), rounded to float)
         rf0 = 0x1.7b728cp+0        (y[0] from rem_pio2f())
 
 0x1.4d23ecp+95
         rd0 = -0x1.ee68f8p-2
         rf0 = 0x1.168578p+0
 
 -0x1.5a172p+63
         rd0 = 0x1.52f2aap-1
         rf0 = -0x1.d14ccp-1
 
 Also, rem_pio2f() is probably not much more efficient than
 rem_pio2().  The former would be better if it used a single double
 instead of two floats for increased accuracy.  (The double version
 has two use two doubles for accuracy because there's no ``double
 double'' type.)
 
 Results for MI tan() differ from MI tanf() by >2 ulp for many
 inputs, including:
 
 	0x1.00008ep+15
 	0x1.0000ap+15
 	0x1.0000a4p+15
 	0x1.0000aep+15
 	[...]
 
 Perhaps this is due to the problem with rem_pio2f().

From: Bruce Evans <bde@zeta.org.au>
To: David Schultz <das@freebsd.org>
Cc: FreeBSD-gnats-submit@freebsd.org, freebsd-i386@freebsd.org
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results
 for large inputs
Date: Sun, 6 Feb 2005 23:27:56 +1100 (EST)

 On Sun, 6 Feb 2005, David Schultz wrote:
 
 > On Sun, Feb 06, 2005, Bruce Evans wrote:
 > ...
 > > BTW, I don't trust the range reduction for floating point pi/2 but
 > > never finished debugging or fixing it.  According to my old comment,
 > > it is off by pi/2 and very slow for many values between 32768 and
 > > 65535.  However, I couldn't duplicate this misbehaviour last time I
 > > looked at it.  I used to fix it using the double version.
 >
 > Yeah, there seem to be many values for which __ieee754_rem_pio2f()
 > returns the wrong quotient:
 >
 > -0x1.8009c6p+8       (input)
 >         d = -244     (return value of __ieee754_rem_pio2())
 >         f = -4       (return value of __ieee754_rem_pio2f())
 > ...
 > It also gets the remainder completely wrong sometimes:
 >
 > 0x1.389ee4p+87
 >         rd0 = -0x1.6ad286p-4       (y[0] from rem_pio2(), rounded to float)
 >         rf0 = 0x1.7b728cp+0        (y[0] from rem_pio2f())
 > ...
 
 I didn't check these.  When I tried to debug this, I got confused by
 y[0] quite often legitimately differing, because the result is in y[1]
 and the integer result too and there are many equivalent combinations.
 
 I also started checking the float trig functions on all 2^32 possible
 args, but got discouraged by too many differences of more than 1 ulp.
 
 > Also, rem_pio2f() is probably not much more efficient than
 > rem_pio2().  The former would be better if it used a single double
 > instead of two floats for increased accuracy.  (The double version
 > has two use two doubles for accuracy because there's no ``double
 > double'' type.)
 
 As you know, this is related to the general uselessness of the float
 interfaces on machines with doubles.  But SSE1 makes this more interesting
 even on i386's.  Floats might be faster despite the existence of doubles
 because a different ALU can be used for them.
 
 > Results for MI tan() differ from MI tanf() by >2 ulp for many
 > inputs, including:
 >
 > 	0x1.00008ep+15
 > 	0x1.0000ap+15
 > 	0x1.0000a4p+15
 > 	0x1.0000aep+15
 > 	[...]
 >
 > Perhaps this is due to the problem with rem_pio2f().
 
 It does.  Reverting to my version of rem_pio2f() that just calls
 rem_pio2() fixes the MI tanf() on all these values.  Here is the patch.
 The changes in the "#if 0" part seemed to help for some cases but they
 don't help much for the above values.
 
 %%%
 Index: e_rem_pio2f.c
 ===================================================================
 RCS file: /home/ncvs/src/lib/msun/src/e_rem_pio2f.c,v
 retrieving revision 1.7
 diff -u -2 -r1.7 e_rem_pio2f.c
 --- e_rem_pio2f.c	28 May 2002 17:51:46 -0000	1.7
 +++ e_rem_pio2f.c	6 Feb 2005 11:53:53 -0000
 @@ -58,10 +58,10 @@
     single precision and the last 8 bits are forced to 0.  */
  static const int32_t npio2_hw[] = {
 -0x3fc90f00, 0x40490f00, 0x4096cb00, 0x40c90f00, 0x40fb5300, 0x4116cb00,
 -0x412fed00, 0x41490f00, 0x41623100, 0x417b5300, 0x418a3a00, 0x4196cb00,
 -0x41a35c00, 0x41afed00, 0x41bc7e00, 0x41c90f00, 0x41d5a000, 0x41e23100,
 -0x41eec200, 0x41fb5300, 0x4203f200, 0x420a3a00, 0x42108300, 0x4216cb00,
 -0x421d1400, 0x42235c00, 0x4229a500, 0x422fed00, 0x42363600, 0x423c7e00,
 -0x4242c700, 0x42490f00
 +0x3fc90f80, 0x40490f80, 0x4096cb80, 0x40c90f80, 0x40fb5380, 0x4116cb80,
 +0x412fed80, 0x41490f80, 0x41623180, 0x417b5380, 0x418a3a80, 0x4196cb80,
 +0x41a35c80, 0x41afed80, 0x41bc7e80, 0x41c90f80, 0x41d5a080, 0x41e23180,
 +0x41eec280, 0x41fb5380, 0x4203f200, 0x420a3a80, 0x42108300, 0x4216cb80,
 +0x421d1400, 0x42235c80, 0x4229a500, 0x422fed80, 0x42363600, 0x423c7e80,
 +0x4242c700, 0x42490f80,
  };
 
 @@ -88,6 +88,8 @@
  pio2_3t =  6.1232342629e-17; /* 0x248d3132 */
 
 -	int32_t __ieee754_rem_pio2f(float x, float *y)
 +int32_t
 +__ieee754_rem_pio2f(float x, float *y)
  {
 +#if 0
  	float z,w,t,r,fn;
  	float tx[3];
 @@ -129,5 +131,5 @@
  	    r  = t-fn*pio2_1;
  	    w  = fn*pio2_1t;	/* 1st round good to 40 bit */
 -	    if(n<32&&(ix&0xffffff00)!=npio2_hw[n-1]) {
 +	    if(n<32&&(ix&0xffffff80)!=npio2_hw[n-1]) {
  		y[0] = r-w;	/* quick check no cancellation */
  	    } else {
 @@ -177,3 +179,19 @@
  	if(hx<0) {y[0] = -y[0]; y[1] = -y[1]; return -n;}
  	return n;
 +#else
 +	/*
 +	 * The above is broken for many values of x between 32768
 +	 * and 65536-epsilon.  It is wrong by +pi/2 at best.  It
 +	 * is also very slow for these values.  Just use the double
 +	 * precision version.  This only works on machines with
 +	 * double precision of course.
 +	 */
 +	double new_y[2];
 +	int n;
 +
 +	n = __ieee754_rem_pio2((double)x, new_y);
 +	y[0] = new_y[0];
 +	y[1] = (new_y[0] - y[0]) + new_y[1];
 +	return n;
 +#endif
  }
 %%%
 
 The comment about off by pi/2 errors is probably wrong.  I now think the
 problem is that small errors errors lead to n being off by 1.  Then y[0]
 is off by pi/2 to sort of compensate.
 
 Bruce

From: David Schultz <das@FreeBSD.ORG>
To: Bruce Evans <bde@zeta.org.au>
Cc: FreeBSD-gnats-submit@FreeBSD.ORG, freebsd-i386@FreeBSD.ORG,
	bde@FreeBSD.ORG
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs
Date: Wed, 9 Feb 2005 00:14:01 -0500

 I ran some careful performance comparisons between the version of
 i387 tan() I posted earlier and the fdlibm tan().  Executive
 summary: the fdlibm tan() is faster for virtually all inputs on a
 Pentium 4.  Pentium 3s seem to have lower-latency FPUs, but fdlibm
 still beats the fptan instruction for the important cases where
 fptan actually gets the right answer.  I used the following sets
 of inputs:
 
 tbl1: small numbers
 -7.910377e-286
 -5.142120e-78
 -8.305257e-262
 -2.089491e-228
 9.625342e-237
 -5.867161e-64
 9.000611e-34
 5.192603e-280
 -1.255581e-153
 1.275985e-153
 
 tbl2: numbers on [-8pi,8pi] greater in magnitude than 2^-18
 -2.410568e-05
 -6.482504e-01
 -1.971384e-03
 -1.686721e-04
 -3.629842e-04
 1.036818e-03
 2.987541e-04
 6.186103e+00
 2.165271e-02
 1.032138e-04
 
 tbl3: large numbers
 2.379610e+172
 3.238483e+204
 -4.680033e+194
 -1.442727e+225
 3.090948e+48
 1.778800e+185
 5.177174e+295
 -1.237869e+204
 -4.577895e+223
 -6.735385e+171
 
 tbl4: special cases
 nan
 inf
 -inf
 -0.0
 +0.0
 
 The results below are divided into four columns.  The first is the
 average number of clock cycles taken by the fdlibm tan() for the
 corresponding table input above on a Pentium 4, the second is the
 clock cycles for the assembly tan(), the third is the difference,
 and the fourth is the percentage difference relative to column 1.
 
 das@VARK:/home/t/freebsd> paste perf1 perf1md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 1259.000000     1697.000000     438.000000      +35%
 1162.000000     1488.000000     326.000000      +28%
 1060.000000     1445.000000     385.000000      +36%
 1059.000000     1453.000000     394.000000      +37%
 1065.000000     1459.000000     394.000000      +37%
 1059.000000     1458.000000     399.000000      +38%
 1031.000000     1461.000000     430.000000      +42%
 1054.000000     1436.000000     382.000000      +36%
 1056.000000     1455.000000     399.000000      +38%
 1073.000000     1459.000000     386.000000      +36%
 das@VARK:/home/t/freebsd> paste perf2 perf2md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 2018.000000     1985.000000     -33.000000      -2%
 1713.000000     1821.000000     108.000000      +6%
 1694.000000     1730.000000     36.000000       +2%
 1708.000000     1737.000000     29.000000       +2%
 1703.000000     1745.000000     42.000000       +2%
 1696.000000     1762.000000     66.000000       +4%
 1718.000000     1774.000000     56.000000       +3%
 1890.000000     2108.000000     218.000000      +12%
 1693.000000     1744.000000     51.000000       +3%
 1702.000000     1714.000000     12.000000       +1%
 das@VARK:/home/t/freebsd> paste perf3 perf3md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 5737.000000     6078.000000     341.000000      +6%
 5110.000000     5200.000000     90.000000       +2%
 6726.000000     7030.000000     304.000000      +5%
 5370.000000     5403.000000     33.000000       +1%
 5032.000000     5206.000000     174.000000      +3%
 5229.000000     5199.000000     -30.000000      -1%
 6313.000000     6523.000000     210.000000      +3%
 5201.000000     5583.000000     382.000000      +7%
 6443.000000     6560.000000     117.000000      +2%
 5172.000000     5356.000000     184.000000      +4%
 das@VARK:/home/t/freebsd> paste perf4 perf4md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 4726.000000     3234.000000     -1492.000000    -32%
 3181.000000     2850.000000     -331.000000     -10%
 3149.000000     2842.000000     -307.000000     -10%
 1045.000000     1091.000000     46.000000       +4%
 1076.000000     1114.000000     38.000000       +4%
 
 (P.S.: Oops, forgot to compile s_sin.c with -O.)
 
 I also ran the first three tests on freefall (Pentium III, using
 the old reduction code), and got results that aren't as favorable
 for the fdlibm version:
 
 das@freefall:~> paste perf1 perf1md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 1384.000000     442.000000      -942.000000     -68%
 584.000000      440.000000      -144.000000     -25%
 589.000000      441.000000      -148.000000     -25%
 584.000000      441.000000      -143.000000     -24%
 585.000000      441.000000      -144.000000     -25%
 585.000000      440.000000      -145.000000     -25%
 585.000000      441.000000      -144.000000     -25%
 584.000000      440.000000      -144.000000     -25%
 584.000000      441.000000      -143.000000     -24%
 584.000000      441.000000      -143.000000     -24%
 das@freefall:~> paste perf2 perf2md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 639.000000      656.000000      17.000000       +3%
 820.000000      648.000000      -172.000000     -21%
 640.000000      654.000000      14.000000       +2%
 639.000000      702.000000      63.000000       +10%
 638.000000      654.000000      16.000000       +3%
 639.000000      658.000000      19.000000       +3%
 638.000000      655.000000      17.000000       +3%
 1789.000000     654.000000      -1135.000000    -63%
 639.000000      654.000000      15.000000       +2%
 638.000000      656.000000      18.000000       +3%
 das@freefall:~> paste perf3 perf3md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 5751.000000     1918.000000     -3833.000000    -67%
 4383.000000     2351.000000     -2032.000000    -46%
 6241.000000     2150.000000     -4091.000000    -66%
 4401.000000     2483.000000     -1918.000000    -44%
 4387.000000     1182.000000     -3205.000000    -73%
 4276.000000     2004.000000     -2272.000000    -53%
 5814.000000     2735.000000     -3079.000000    -53%
 4389.000000     2180.000000     -2209.000000    -50%
 5779.000000     2270.000000     -3509.000000    -61%
 4379.000000     2020.000000     -2359.000000    -54%
 
 Here, fdlibm usually wins for tbl2, which is the most important
 class of inputs.  It is slower for the two inputs in tbl2 that are
 close to multiples of 2pi and for large inputs, but in all
 fairness, the i387 gets the wrong answer in those cases---hence,
 this PR.  The i387 legitimately beats fdlibm for the small inputs,
 for which tan(x) == x, so a special case for those earlier in
 fdlibm would probably be beneficial.
 
 Conclusion: We should toss out the assembly versions of tan() and
 tanf(), and possibly special-case small inputs in fdlibm tan().
 
 I found similar results for sin(), so we should probably do the
 same for that, too.  Again, this is on my Pentium 4; YMMV.
 
 das@VARK:/home/t/freebsd> paste perf1 perf1md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 1254.000000     1981.000000     727.000000      +58%
 1109.000000     1603.000000     494.000000      +45%
 1056.000000     1579.000000     523.000000      +50%
 1051.000000     1565.000000     514.000000      +49%
 1043.000000     1569.000000     526.000000      +50%
 1077.000000     1572.000000     495.000000      +46%
 1050.000000     1570.000000     520.000000      +50%
 1051.000000     1566.000000     515.000000      +49%
 1042.000000     1569.000000     527.000000      +51%
 1048.000000     1568.000000     520.000000      +50%
 das@VARK:/home/t/freebsd> paste perf2 perf2md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 1404.000000     1656.000000     252.000000      +18%
 1222.000000     1563.000000     341.000000      +28%
 1205.000000     1614.000000     409.000000      +34%
 1210.000000     1603.000000     393.000000      +32%
 1206.000000     1616.000000     410.000000      +34%
 1218.000000     1636.000000     418.000000      +34%
 1207.000000     1645.000000     438.000000      +36%
 1772.000000     1610.000000     -162.000000     -9%
 1213.000000     1615.000000     402.000000      +33%
 1208.000000     1617.000000     409.000000      +34%
 das@VARK:/home/t/freebsd> paste perf3 perf3md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 5057.000000     5260.000000     203.000000      +4%
 5076.000000     6708.000000     1632.000000     +32%
 6528.000000     5978.000000     -550.000000     -8%
 5074.000000     6922.000000     1848.000000     +36%
 5116.000000     2889.000000     -2227.000000    -44%
 4702.000000     5190.000000     488.000000      +10%
 5884.000000     7223.000000     1339.000000     +23%
 5138.000000     5927.000000     789.000000      +15%
 5834.000000     5923.000000     89.000000       +2%
 5051.000000     5618.000000     567.000000      +11%
 das@VARK:/home/t/freebsd> paste perf4 perf4md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 3536.000000     3576.000000     40.000000       +1%
 3190.000000     2705.000000     -485.000000     -15%
 3189.000000     2698.000000     -491.000000     -15%
 1053.000000     1171.000000     118.000000      +11%
 1059.000000     1174.000000     115.000000      +11%
 
 The above data was generated using the program below, executed as
 follows:
 	./a.out < tblN | grep avg | awk '{print $2}' > perfN
 When compiling the program, it is necessary to add
 -Dfunc=tan or -Dfunc=itan.
 
 #include <limits.h>
 #include <math.h>
 #include <stdint.h>
 #include <stdio.h>
 #include <stdlib.h>
 
 #define	ITER		10000
 
 #define	rdtsc(rv)	__asm __volatile("xor %%ax,%%ax\n\tcpuid\n\trdtsc" \
 					 : "=A" (*(rv)) : : "ebx", "ecx")
 
 double itan(double);
 
 static void
 runtest(double d)
 {
 	volatile double result;
 	double avg, sd;
 	uint64_t start, end;
 	int64_t total;
 	int t[ITER];
 	int i, n;
 	int tmax, tmin;
 
 	printf("%a\n", d);
 	total = 0;
 	for (i = 0; i < ITER; i++) {
 		rdtsc(&start);
 		result = func(d);
 		rdtsc(&end);
 
 		t[i] = end - start;
 		total += t[i];
 	}
 
 	/* compute initial avg and sd */
 	avg = (double)total / ITER;
 	sd = 0;
 	for (i = 0; i < ITER; i++)
 		sd += (avg - t[i]) * (avg - t[i]);
 	sd = sqrt(sd / ITER);
 
 	/* recompute avg and sd with outliers removed, find max and min */
 	n = 0;
 	tmax = 0;
 	tmin = INT_MAX;
 	for (i = 0; i < ITER; i++) {
 		if (fabs(avg - t[i]) <= sd * 2) {
 			total += t[i];
 			n++;
 			if (t[i] > tmax)
 				tmax = t[i];
 			if (t[i] < tmin)
 				tmin = t[i];
 		} else {
 			t[i] = -1;
 		}
 	}
 	avg = (double)total / n;
 	sd = 0;
 	for (i = 0; i < ITER; i++) {
 		if (t[i] >= 0)
 			sd += (avg - t[i]) * (avg - t[i]);
 	}
 	sd = sqrt(sd / n);
 
 	printf("avg:\t%.0f\nsd:\t%.0f\nmin:\t%d\nmax:\t%d\nout:\t%.02f%%\n\n",
 	    avg, sd, tmin, tmax, (double)(ITER - n) * 100 / ITER);
 }
 
 int
 main(int argc, char *argv[])
 {
 	double d;
 
 	while (scanf("%lf", &d) > 0)
 		runtest(d);
 
 	return (0);
 }

From: Bruce Evans <bde@zeta.org.au>
To: David Schultz <das@freebsd.org>
Cc: FreeBSD-gnats-submit@freebsd.org, freebsd-i386@freebsd.org,
	bde@freebsd.org
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results
 for large inputs
Date: Thu, 10 Feb 2005 02:17:45 +1100 (EST)

 On Wed, 9 Feb 2005, David Schultz wrote:
 
 > I ran some careful performance comparisons between the version of
 > i387 tan() I posted earlier and the fdlibm tan().  Executive
 > summary: the fdlibm tan() is faster for virtually all inputs on a
 > Pentium 4.  Pentium 3s seem to have lower-latency FPUs, but fdlibm
 > still beats the fptan instruction for the important cases where
 > fptan actually gets the right answer.
 
 I did some not so careful comparisions and found:
 - hardware sin is about twice as fast as fdlibm sin on athlonxp
 - hardware sin is about the same speed as fdlibm sin on athlon64.  The
   absolute speed is about the same as on athlonxp with a similar CPU
   clock (athlon64 apparently speeds up fdlibm but not hardware sin)
 - using float precision didn't make much difference (it was slightly
   slower IIRC).
 I used a uniform distribution with ranges [0..10] and [0..1000], and
 e_rem_pio2f.c was fixed to use the double version on athlonxp but not
 on athlon64.
 
 I think newer CPUs are more likely to optimize simple instructions better
 relative to transcendentatal functions.  SSE2 doesn't help for fsin, and
 using fsin on athlonxp is slower than ever because the registers have to
 be moved from xmm to i387 via memory.  But perhaps there are separate
 ALUs that help more in real applications.  fdlibm probably works better
 in benchmarks than in real applications because its code and tables stay
 cached.
 
 > I used the following sets
 > of inputs:
 >
 > tbl1: small numbers
 > ...
 > tbl2: numbers on [-8pi,8pi] greater in magnitude than 2^-18
 > ...
 > tbl3: large numbers
 > ...
 > tbl4: special cases
 
 This data may be too unusual.  Maybe the NaNs are slower.  Denormals
 would probably be slower.
 
 > The results below are divided into four columns.  The first is the
 > average number of clock cycles taken by the fdlibm tan() for the
 > corresponding table input above on a Pentium 4, the second is the
 > clock cycles for the assembly tan(), the third is the difference,
 > and the fourth is the percentage difference relative to column 1.
 >
 > das@VARK:/home/t/freebsd> paste perf1 perf1md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 > 1259.000000     1697.000000     438.000000      +35%
 > ...
 > das@VARK:/home/t/freebsd> paste perf2 perf2md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 > 2018.000000     1985.000000     -33.000000      -2%
 > ...
 > das@VARK:/home/t/freebsd> paste perf3 perf3md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 > 5737.000000     6078.000000     341.000000      +6%
 > ...
 > das@VARK:/home/t/freebsd> paste perf4 perf4md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 > 4726.000000     3234.000000     -1492.000000    -32%
 > ...
 >
 > (P.S.: Oops, forgot to compile s_sin.c with -O.)
 
 I get the following for the range [0..10] step 0.0000001 on athlonxp:
 
     257 fdlibm sin(double) (msun src)
     128 fsin(double) (libc obj)
     107 sinf(double) (inline asm src)
     151 ftan(double) (libc obj)
 
 In case I messed up the scaling, this translates to 50-120 nsec/call
 (TSC freq 2223MHz).  The execuion latency for fsin is 96-192 cycles
 according to the athlon32 optimization manual, so 107-128 seems about
 right.
 
 > I also ran the first three tests on freefall (Pentium III, using
 > the old reduction code), and got results that aren't as favorable
 > for the fdlibm version:
 >
 > das@freefall:~> paste perf1 perf1md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 > 1384.000000     442.000000      -942.000000     -68%
 > 584.000000      440.000000      -144.000000     -25%
 > ...
 > das@freefall:~> paste perf2 perf2md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 > 639.000000      656.000000      17.000000       +3%
 > ...
 > das@freefall:~> paste perf3 perf3md | awk '{printf("%f\t%f\t%f\t%+.0f%%\n", $1, $2, $2-$1, ($2-$1)*100/$1);}'
 > 5751.000000     1918.000000     -3833.000000    -67%
 > ...
 
 Freefall is surprisingly underpowered :-).  I get similar cycle counts on it:
 
     232 fdlibm sin(double) (msun src)
     121 fsin(double) (libc obj)
     112 sinf(double) (inline asm src)
     178 tan(double) (libc obj (fdlibm))
 
 My test loop (1/10 as long as this for freefall):
 
 %%%
 	double d;
 	...
 	x = rdtsc();
 	for (d = 0; d < 10.0; d += 0.0000001)
 		tan(d);
 	y = rdtsc();
 %%%
 
 > Here, fdlibm usually wins for tbl2, which is the most important
 > class of inputs.  It is slower for the two inputs in tbl2 that are
 > close to multiples of 2pi and for large inputs, but in all
 > fairness, the i387 gets the wrong answer in those cases---hence,
 > this PR.  The i387 legitimately beats fdlibm for the small inputs,
 > for which tan(x) == x, so a special case for those earlier in
 > fdlibm would probably be beneficial.
 
 Special inputs take much longer according to your tests, but I hope
 thousands of cycles is not the usual case.
 
 > Conclusion: We should toss out the assembly versions of tan() and
 > tanf(), and possibly special-case small inputs in fdlibm tan().
 
 > The above data was generated using the program below, executed as
 > follows:
 > 	./a.out < tblN | grep avg | awk '{print $2}' > perfN
 > When compiling the program, it is necessary to add
 > -Dfunc=tan or -Dfunc=itan.
 
 For me, this gives numbers in between yours and mine.  I only tried
 hardware tan on athlonxp, and the numbers were about 2000 for most of
 tbl3, one 1000 in the middle of tbl3, and 400 for everything else.
 
 > #define	rdtsc(rv)	__asm __volatile("xor %%ax,%%ax\n\tcpuid\n\trdtsc" \
 > 					 : "=A" (*(rv)) : : "ebx", "ecx")
 
 The synchronising cpuid here is responsible for a factor of 3 difference
 for me.  Moving the rdtsc out of the loop gives the following changes
 in cycle counts:
 
     2000 -> [944..1420]
     1000 -> 431
     400  -> 132
 
 Each rdtsc() in the loop costs 75 cycles for tbl1, and actually using
 the results costs another 120 cycles.
 
 I think the cpuid is disturbing the timings too much.
 
 Bruce

From: David Schultz <das@freebsd.org>
To: Bruce Evans <bde@zeta.org.au>
Cc: FreeBSD-gnats-submit@freebsd.org, freebsd-i386@freebsd.org,
	bde@freebsd.org
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs
Date: Thu, 10 Feb 2005 02:23:14 -0500

 Mime-Version: 1.0
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline
 In-Reply-To: <20050209232758.F3249@epsplex.bde.org>
 
 On Thu, Feb 10, 2005, Bruce Evans wrote:
 > > I used the following sets
 > > of inputs:
 > >
 > > tbl1: small numbers
 > > ...
 > > tbl2: numbers on [-8pi,8pi] greater in magnitude than 2^-18
 > > ...
 > > tbl3: large numbers
 > > ...
 > > tbl4: special cases
 > 
 > This data may be too unusual.  Maybe the NaNs are slower.  Denormals
 > would probably be slower.
 
 The data in tbl2 are pretty usual, I think, and I measured all of
 the data points independently.  But yes, NaNs are slower, as the
 results for tbl4 indicate.
 
 Looking back, though, I did notice that very few of my inputs in
 tbl2 require argument reduction.  In your tests on [0..10], on the
 other hand, 92% of the inputs require argument reduction in
 fdlibm.  It would be interesting to see for which of your tests
 fdlibm is faster, and for which it is slower.  One possibility is
 that fdlibm is slower most of the time; another is that it is far
 slower for the close-to-pi/2 cases that the i387 gets wrong, and
 that messes up the averages.
 
 > The synchronising cpuid here is responsible for a factor of 3 difference
 > for me.  Moving the rdtsc out of the loop gives the following changes
 > in cycle counts:
 > 
 >     2000 -> [944..1420]
 >     1000 -> 431
 >     400  -> 132
 > 
 > Each rdtsc() in the loop costs 75 cycles for tbl1, and actually using
 > the results costs another 120 cycles.
 > 
 > I think the cpuid is disturbing the timings too much.
 
 I don't care so much about the rdtsc overhead since I'm only
 measuring relative performance.  A null function is measured as
 taking 388 cycles on my Pentium 4, but some of that is due to gcc
 getting confused by the volatile variable and generating extra
 code at -O0.
 
 However, it is true that I am basically measuring latency and not
 throughput.  Ordinarily, it is possible to execute FPU and CPU
 instructions simultaneously, and the FPU may even have more than
 one FU available for executing fptan.  The cpuid instructions
 clear out the pipeline and destroy any parallelism that might have
 been possible.  Your version does a better job of measuring
 throughput.  You're also right that fdlibm tan() blows out about
 512 bytes of instruction cache.
 
 Anyway, I unfortunately don't have time for all this.  Do you want
 the assembly versions of these to stay or not?  If so, it would be
 great if you could fix them and make sure that the result isn't
 obviously slower than fdlibm.  If not, I'll be happy to spend two
 minutes making all those pesky bugs in them go away.  ;-)

From: Bruce Evans <bde@zeta.org.au>
To: David Schultz <das@freebsd.org>
Cc: FreeBSD-gnats-submit@freebsd.org, freebsd-i386@freebsd.org,
	bde@freebsd.org
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results
 for large inputs
Date: Mon, 14 Feb 2005 01:31:50 +1100 (EST)

 On Thu, 10 Feb 2005, David Schultz wrote:
 
 > On Thu, Feb 10, 2005, Bruce Evans wrote:
 
 > > > [tbl*]
 > >
 > > This data may be too unusual.  Maybe the NaNs are slower.  Denormals
 > > would probably be slower.
 >
 > The data in tbl2 are pretty usual, I think, and I measured all of
 > the data points independently.  But yes, NaNs are slower, as the
 > results for tbl4 indicate.
 
 It is actually the large numbers that take a lot of argument reduction
 that are slower (tbl3).
 
 > Looking back, though, I did notice that very few of my inputs in
 > tbl2 require argument reduction.  In your tests on [0..10], on the
 > other hand, 92% of the inputs require argument reduction in
 > fdlibm.  It would be interesting to see for which of your tests
 > fdlibm is faster, and for which it is slower.  One possibility is
 > that fdlibm is slower most of the time; another is that it is far
 > slower for the close-to-pi/2 cases that the i387 gets wrong, and
 > that messes up the averages.
 
 More testing of sin() on an athlon-xp shows:
 - fdlibm is faster on the range [0,pi/4-eps].  fdlibm can even be made
   almost 3 times faster than fsin on this range by inlining __kernel_sin
   and using lots of options in CFLAGS (24 nsec vs 63 nsec for inline fsin
   and 72 nsec for libc fsin, at 2.23GHz).  fdlibm doesn't need to do any
   arg reduction in this range, and the polynomial for sin() is very
   efficient (it takes less time than the function calls and logic).
 - in the range [pi/4-eps,pi/2], fdlibm does arg reduction (to convert to
   cos()) and becomes about twice as slow.  OTOH, fsin is almost twice as
   fast in this range as it is in the previous range!  Perhaps this is
   because fsin knows that its arg reduction is broken even above pi/2 so
   it can do sloppier calculations without losing significantly more.
 - for the ranges corresponding to larger multiples of pi/2, fsin slows
   down slowly and fdlibm slows down relatively rapidly.  This is because
   fdlibm actually does correct arg reduction for large values.
 
 > > The synchronising cpuid here is responsible for a factor of 3 difference
 > > for me.  Moving the rdtsc out of the loop gives the following changes
 > > in cycle counts:
 > >
 > >     2000 -> [944..1420]
 > >     1000 -> 431
 > >     400  -> 132
 > >
 > > Each rdtsc() in the loop costs 75 cycles for tbl1, and actually using
 > > the results costs another 120 cycles.
 > >
 > > I think the cpuid is disturbing the timings too much.
 >
 > I don't care so much about the rdtsc overhead since I'm only
 > measuring relative performance.  A null function is measured as
 > taking 388 cycles on my Pentium 4, but some of that is due to gcc
 > getting confused by the volatile variable and generating extra
 > code at -O0.
 
 The rdtsc() overhead (cpuid + rdtsc) needs to be subtracted to get
 relative performances that can be compared in a ratio.  On an athlon-xp
 I get the following minimum avg cycle counts for various null operations:
 
 2 rdtsc's alone:                                 22
 2 rdtsc's around null function:                  31
 2 cpuid+rdtsc pairs alone:                      128
 2 cpuid+rdtsc pairs around null function:       138
 2 xor+cpuid+rdtsc triples alone:                128
 2 xor+cpuid+rdtsc triples around null function: 140
 previous with -O0 (others with -O):             140
 
 Apparently:
 - the rdtsc overhead of 12 cycles costs for not quite each rdtsc
 - the cpuid overhead of 62 cycles costs for not quite each cpuid
 - -O0 doesn't cost much
 - the P4 pipeline is about 388 - 140 = 248 cycles longer than the
   athlon-xp's.
 
 > However, it is true that I am basically measuring latency and not
 > throughput.  Ordinarily, it is possible to execute FPU and CPU
 > instructions simultaneously, and the FPU may even have more than
 > one FU available for executing fptan.  The cpuid instructions
 > clear out the pipeline and destroy any parallelism that might have
 > been possible.  Your version does a better job of measuring
 > throughput.  You're also right that fdlibm tan() blows out about
 > 512 bytes of instruction cache.
 
 I couldn't see much evidence of parallelism in a simple benchmark.
 
 The main problem with using cpuid is that we don't really want
 to measure latency.  We know that the hardware math functions have
 large latency, so benchmarks that test latency are sure to show
 them not doing so well.
 
 > Anyway, I unfortunately don't have time for all this.  Do you want
 > the assembly versions of these to stay or not?  If so, it would be
 > great if you could fix them and make sure that the result isn't
 > obviously slower than fdlibm.  If not, I'll be happy to spend two
 > minutes making all those pesky bugs in them go away.  ;-)
 
 It seems that the hardware trig functions aren't worth using.  I want
 to test them on a 486 and consider the ranges more before discarding
 them.  This may take a while.
 
 I did a quick test of some other functions:
 - hardware sqrt is much faster
 - hardware exp is slightly faster on the range [1,100]
 - hardware atan is slower on the range [0,1.5]
 - hardware acos is much slower (139 nsec vs 57 nsec!) on the range [0,1.0].
 
 Bruce

From: David Schultz <das@FreeBSD.ORG>
To: Bruce Evans <bde@zeta.org.au>
Cc: FreeBSD-gnats-submit@FreeBSD.ORG, freebsd-i386@FreeBSD.ORG,
	bde@FreeBSD.ORG
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs
Date: Sun, 13 Feb 2005 13:08:37 -0500

 On Mon, Feb 14, 2005, Bruce Evans wrote:
 > It seems that the hardware trig functions aren't worth using.  I want
 > to test them on a 486 and consider the ranges more before discarding
 > them.  This may take a while.
 
 Fair enough.  I would be happy to have a hybrid implementation
 that uses the hardware only when appropriate.  However, your 486
 benchmarks notwithstanding, I would just as soon rely on fdlibm
 entirely for the trig functions.  It just doesn't seem worthwhile
 to me, given that the only parts of the domain where the hardware
 is faster *and* correct are, roughly speaking, [0,2^-28) and
 [pi/4,pi/2-eps].
 
 > I did a quick test of some other functions:
 > - hardware sqrt is much faster
 > - hardware exp is slightly faster on the range [1,100]
 > - hardware atan is slower on the range [0,1.5]
 > - hardware acos is much slower (139 nsec vs 57 nsec!) on the range [0,1.0].
 
 sqrt isn't transcendental, so it should be faster and correctly
 rounded on every hardware platform.  I found similar results to
 yours for atan() and acos() when writing amd64 math routines, but
 of course amd64 has the overhead of switching between the SSE and
 i387 units.  Maybe they should go away, too...

From: Bruce Evans <bde@zeta.org.au>
To: David Schultz <das@FreeBSD.org>
Cc: FreeBSD-gnats-submit@FreeBSD.org, freebsd-i386@FreeBSD.org,
	bde@FreeBSD.org
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results
 for large inputs
Date: Mon, 14 Feb 2005 06:38:16 +1100 (EST)

 On Sun, 13 Feb 2005, David Schultz wrote:
 
 > On Mon, Feb 14, 2005, Bruce Evans wrote:
 > > >...
 > > I did a quick test of some other functions:
 > > - hardware sqrt is much faster
 > > - hardware exp is slightly faster on the range [1,100]
 > > - hardware atan is slower on the range [0,1.5]
 > > - hardware acos is much slower (139 nsec vs 57 nsec!) on the range [0,1.0].
 >
 > sqrt isn't transcendental, so it should be faster and correctly
 > rounded on every hardware platform.  I found similar results to
 
 I don't know if we can trust the hardware for that.  ISTR checking that
 hardware sqrtf gives the same result as fdlibm for possible values for sqrtf.
 This is of course impossible for double sqrt.
 
 > yours for atan() and acos() when writing amd64 math routines, but
 > of course amd64 has the overhead of switching between the SSE and
 > i387 units.  Maybe they should go away, too...
 
 These are easier to decide (for now) because there are no old CPUs.
 
 I fixed the bug that gave unbelievable cycle counts:
 
 %%%
 --- r.c~	Mon Feb 14 02:19:34 2005
 +++ r.c	Mon Feb 14 02:22:21 2005
 @@ -45,4 +47,5 @@
  	tmax = 0;
  	tmin = INT_MAX;
 +	total = 0;
  	for (i = 0; i < ITER; i++) {
  		if (fabs(avg - t[i]) <= sd * 2) {
 %%%
 
 With this fix on athlon-xp's, the cpuid instructions only disturb the
 cycle counts in a small and almost deterministic way (by about 59 cycles
 for every run).
 
 Bruce

From: David Schultz <das@FreeBSD.ORG>
To: Bruce Evans <bde@zeta.org.au>
Cc: FreeBSD-gnats-submit@FreeBSD.ORG, freebsd-i386@FreeBSD.ORG,
	bde@FreeBSD.ORG
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs
Date: Sun, 13 Feb 2005 14:54:33 -0500

 On Mon, Feb 14, 2005, Bruce Evans wrote:
 > > sqrt isn't transcendental, so it should be faster and correctly
 > > rounded on every hardware platform.  I found similar results to
 > 
 > I don't know if we can trust the hardware for that.  ISTR checking that
 > hardware sqrtf gives the same result as fdlibm for possible values for sqrtf.
 > This is of course impossible for double sqrt.
 
 Since IEEE 754 specifies sqrt's behavior, and because ucbtest does
 a good job of detecting problems with it, hardware designers are
 likely to pay more attention to getting it right.  After all, it's
 possible to have completely broken transcendentals and still claim
 IEEE 754 compliance, but you can't do that if your sqrt is broken.
 
 > I fixed the bug that gave unbelievable cycle counts:
 > 
 > %%%
 > --- r.c~	Mon Feb 14 02:19:34 2005
 > +++ r.c	Mon Feb 14 02:22:21 2005
 > @@ -45,4 +47,5 @@
 >  	tmax = 0;
 >  	tmin = INT_MAX;
 > +	total = 0;
 >  	for (i = 0; i < ITER; i++) {
 >  		if (fabs(avg - t[i]) <= sd * 2) {
 > %%%
 
 Yeah, I noticed that bug while using the program to do some
 measurements for my research.  Sorry I forgot to mention it here.

From: Bruce Evans <bde@zeta.org.au>
To: David Schultz <das@freebsd.org>
Cc: FreeBSD-gnats-submit@freebsd.org, freebsd-i386@freebsd.org
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results
 for large inputs
Date: Sun, 20 Feb 2005 22:02:50 +1100 (EST)

 On Mon, 14 Feb 2005, Bruce Evans wrote:
 
 > It seems that the hardware trig functions aren't worth using.  I want
 > to test them on a 486 and consider the ranges more before discarding
 > them.  This may take a while.
 
 I did this as threatened.
 
 My test program now covers most cases of interest for i386's and
 amd64's.  It divides the the range being tested into a number of
 regions.  I mostly used 16, which is good for dividing the range
 [-2pi, 2pi] into quadrants.  The output clearly shows that both
 the hardware and fdlibm do arg reduction related to pi/4, with
 different tradeoffs.  E.g., for sin(), fdlibm does very well for
 the range [-pi/4, pi/4], but not so well for args outside this
 range, while hardware sin does poorly for [-pi/4, pi/4] but very
 well for [pi/4, 3pi/4], and similarly for all translations of these
 ranges by pi.  OTOH, the hardware cos's best range is [-pi/4, pi/4]
 but fdlibm cos's best range is the same as for fdlibm sin.
 
 Some results:
 - the hardware trig functions were by far the best on a 486DX2-66 for
   all ranges except near 0 where fdlibm is competetive or a little
   faster.  The relative advantage of the hardware functions decreases
   with each CPU generation.
 - the hardware inverse trig functions (all based on fpatan) were by far
   the worst for all ranges, except on old machines where they are
   competitive or a little faster.
 - fdlibm is quite good for exp and log too.
 - an Athlon64 (cpuid says 3400 but not that fast in marketed version)
   running at 1994 MHz with slow memory has almost exactly the same
   speed for the fdlibm part of the benchmark as an AthlonXP Barton
   2600 overclocked to 2293 MHz with not-so-slow memory.  Using SSE2
   apparently makes just enough difference to compensate for the
   1994/2223 clock speed ratio.  There is apparently no similar
   difference for using the A64 FPU so using it is further away from
   being an optimization on A64's.
 
 Test control script:
 
 %%%
 set -e
 arch=`uname -p`
 if [ $arch = i386 ]; then arch=i387; fi
 msun=/usr/src/lib/msun
 CFLAGS="-march=athlon-xp -fomit-frame-pointer -O2 -g -I$msun/src"
 LDFLAGS="-lm -static"
 niter=100000000.0
 stdfiles="$msun/src/e_rem_pio2.c $msun/src/k_cos.c $msun/src/k_sin.c \
     $msun/src/k_tan.c"
 
 while :
 do
 	read functype func llimit limit nregion basename ename
 	if [ -z $func ]; then exit 0; fi
 	cfile=$msun/src/$basename.c
 	sfile=$msun/$arch/$basename.S
 	if [ -f $sfile -o $func = cos -o $func = cosf \
 	    -o $func = sin -o $func = sinf ]
 	then
 		myfunc=asm$func
 		COPTS="-DFUNCTYPE=$functype -DFUNC=$myfunc \
 		    -DLLIMIT=$llimit -DLIMIT=$limit -DNREGION=$nregion \
 		    -DNITER=$niter"
 		if [ -f $sfile ]
 		then
 			sed -e "s/^ENTRY($func)/ENTRY($myfunc)/" \
 			    -e "s/^ENTRY($ename)/ENTRY($myfunc)/" <$sfile >x.S
 			cc $CFLAGS $COPTS -o z z.c x.S $stdfiles $LDFLAGS
 		else
 			cc $CFLAGS $COPTS -o z z.c $LDFLAGS
 		fi
 		time ./z
 	fi
 	myfunc=fdl$func
 	COPTS="-DFUNCTYPE=$functype -DFUNC=$myfunc \
 	    -DLLIMIT=$llimit -DLIMIT=$limit -DNREGION=$nregion \
 	    -DNITER=$niter"
 	sed -e s/^$ename/$myfunc/ \
 	    -e s/__generic_$ename/$myfunc/ <$cfile >x.c
 	cc $CFLAGS $COPTS -o z z.c x.c $stdfiles $LDFLAGS
 	time ./z
 done
 %%%
 
 Run this using something like "sh t <tdata >to.machine".  (First edit
 the CFLAGS line to change -march to something appropriate, and the
 niter line to make the test run fast enough -- I used 1e8 on Athlons
 down to 1e6 on 486DX2-66 so that each function takes about 10 seconds.)
 
 Test data:
 
 %%%
 double acos -1.0 1.0 16 e_acos __ieee754_acos
 double asin -1.0 1.0 16 e_asin __ieee754_asin
 double atan -8.0 8.0 16 s_atan atan
 double cos -6.28 6.28 16 s_cos cos
 double exp -8.0 8.0 16 e_exp __ieee754_exp
 double log 0.0 1e32 16 e_log __ieee754_log
 double logb 0.0 1e32 16 s_logb logb
 double log10 0.0 1e32 16 e_log10 __ieee754_log10
 double sin -6.28 6.28 16 s_sin sin
 double tan -6.28 6.28 16 s_tan tan
 float cosf -6.28 6.28 16 s_cosf cosf
 float logf 0.0 1e32 16 e_logf __ieee754_logf
 float logbf 0.0 1e32 16 s_logbf logbf
 float log10f 0.0 1e32 16 e_log10f __ieee754_log10f
 float sinf -6.28 6.28 16 s_sinf sinf
 float tanf -6.28 6.28 16 s_tanf tanf
 
 # The following aren't actually implemented in asm (but e_atan2f is; it is
 # probably just as bad as e_atanf would be, but is harder to test).
 float acosf -1.0 1.0 16 e_acosf __ieee754_acosf
 float asinf -1.0 1.0 16 e_asinf __ieee754_asinf
 float atanf -8.0 8.0 16 s_atanf atanf
 float expf -8.0 8.0 16 e_expf __ieee754_expf
 %%%
 
 Test program:
 
 %%%
 #include <sys/types.h>
 #include <sys/time.h>
 #include <sys/resource.h>
 
 #ifdef HAVE_RDTSC
 #include <machine/cpufunc.h>
 #endif
 
 #include <math.h>
 #include <stdio.h>
 
 #ifndef FUNC
 #define	FUNC	sin
 #endif
 #define	FUNCNAME __XSTRING(FUNC)
 #ifndef FUNCTYPE
 #define	FUNCTYPE double
 #endif
 #ifndef LIMIT
 #define	LIMIT	(3.14159 / 4)
 #endif
 #ifndef LLIMIT
 #define	LLIMIT	0.0
 #endif
 #ifndef NITER
 #define NITER	10000000.0
 #endif
 #ifndef NREGION
 #define NREGION	16
 #endif
 
 #ifdef __amd64__
 /* Generate some asm versions since there are none in libm. */
 
 double asmcos(double);
 double asmsin(double);
 float asmcosf(float);
 float asmsinf(float);
 
 asm("asmcos: movsd %xmm0, -8(%rsp); fldl -8(%rsp); fcos; "
     "fstpl -8(%rsp); movsd -8(%rsp),%xmm0; ret");
 asm("asmsin: movsd %xmm0, -8(%rsp); fldl -8(%rsp); fsin; "
     "fstpl -8(%rsp); movsd -8(%rsp),%xmm0; ret");
 asm("asmcosf: movss %xmm0, -4(%rsp); flds -4(%rsp); fcos; "
     "fstps -4(%rsp); movss -4(%rsp),%xmm0; ret");
 asm("asmsinf: movss %xmm0, -4(%rsp); flds -4(%rsp); fsin; "
     "fstps -4(%rsp); movss -4(%rsp),%xmm0; ret");
 #endif /* __amd64__ */
 
 FUNCTYPE FUNC(FUNCTYPE);
 
 int
 main(void)
 {
 	struct rusage finish, start;
 	double d, limit, llimit, step;
 	long long tot[NREGION], usec[NREGION], x, y;
 	int i;
 
 	step = (LIMIT - LLIMIT) / NITER;
 	for (i = 0; i < NREGION; i++) {
 		llimit = LLIMIT + i * (LIMIT - LLIMIT) / NREGION;
 		limit = LLIMIT + (i + 1) * (LIMIT - LLIMIT) / NREGION;
 		tot[i] = 0;
 		getrusage(RUSAGE_SELF, &start);
 #ifdef HAVE_RDTSC
 		x = rdtsc();
 #endif
 		for (d = llimit; d < limit; d += step)
 			FUNC(d);
 #ifdef HAVE_RDTSC
 		y = rdtsc();
 #endif
 		getrusage(RUSAGE_SELF, &finish);
 		usec[i] = 1000000 *
 		    (finish.ru_utime.tv_sec - start.ru_utime.tv_sec +
 		    finish.ru_stime.tv_sec - start.ru_stime.tv_sec) +
 		    finish.ru_utime.tv_usec - start.ru_utime.tv_usec +
 		    finish.ru_stime.tv_usec - start.ru_stime.tv_usec;
 		tot[i] = y - x;
 	}
 	printf("%s:", FUNCNAME);
 #ifdef HAVE_RDTSC
 	printf(" cycles:nsec per call:");
 	for (i = 0; i < NREGION; i++)
 		printf(" %lld:%lld",
 		    tot[i] / (long long)(NITER / NREGION),
 		    1000 * usec[i] / (long long)(NITER / NREGION));
 #else
 	printf(" nsec per call: ");
 	for (i = 0; i < NREGION; i++)
 		printf(" %lld", 1000 * usec[i] / (long long)(NITER / NREGION));
 #endif
 	printf("\n");
 	return (0);
 }
 %%%
 
 Sample output:
 
 %%%
 to.486dx2-66:
 asmasin: nsec per call:  7569 7650 7742 7752 7665 7662 7584 7721 7409 7573 7645 7651 7723 7775 7686 7603
 fdlasin: nsec per call:  13452 13910 13967 13979 8000 7997 7995 8178 7872 7873 7994 7995 13853 13844 13808 13323
 asmsin: nsec per call:  5697 5788 5821 5642 5684 5774 5534 5557 5417 5788 5763 5558 5650 5828 5782 5701
 fdlsin: nsec per call:  11888 11906 11902 11736 11866 9405 9414 5377 5372 9250 9247 11320 11303 11435 11417 11326
 
 to.k6-233:
 asmasin: nsec per call:  1051 1076 1073 1072 1054 1041 1042 1042 1038 1037 1037 1050 1067 1069 1072 1046
 fdlasin: nsec per call:  1518 1594 1595 1594 819 819 818 818 818 819 819 819 1591 1590 1593 1514
 asmsin: nsec per call:  513 553 543 527 518 540 521 464 464 521 540 518 527 543 553 513
 fdlsin: nsec per call:  1466 1495 1493 1475 1474 1034 1034 647 647 1017 1017 1418 1419 1459 1459 1410
 
 to.axpb-2223:
 asmasin: nsec per call:  96 96 94 93 93 93 93 93 93 93 93 93 93 94 96 96
 fdlasin: nsec per call:  77 81 81 81 36 35 34 34 35 35 35 35 81 81 81 77
 asmsin: nsec per call:  60 32 33 59 59 32 32 57 56 32 32 58 58 33 32 59
 fdlsin: nsec per call:  84 91 91 85 85 69 69 32 32 68 67 79 79 86 86 79
 
 to.a64-1994:
 asmsin: nsec per call:  61 37 37 60 61 36 37 58 58 37 36 61 60 37 37 61
 fdlsin: nsec per call:  73 75 75 75 75 52 52 21 21 51 51 71 71 70 70 67
 %%%
 
 I would adjust the following due to these results:
 - delete all trig i387 float functions.  Think about deleting the exp and
   log i387 float functions.  Think about optimizing the fdlibm versions.
   They could use double or extended precision, and then they might not
   need range reduction for a much larger range (hopefully [-2pi, 2pi])
   and/or might not need a correction term for a much larger range.
 - delete all inverse trig i387 functions.
 - think about optimizing the trig fdlibm double functions until they are
   faster than the trig i387 double functions on a larger range than
   [-pi/4, pi/4].  They could use extended precision, but only only on
   some arches so there would be a negative reduction in complications
   for replacing the MD functions by "MI" ones optimized in this way.
   Does the polynomial approximation for sin start to fail between pi/4
   and pi/2, or are there other technical rasons to reduce to the current
   ranges?
 
 Bruce

From: David Schultz <das@FreeBSD.ORG>
To: Bruce Evans <bde@zeta.org.au>
Cc: FreeBSD-gnats-submit@FreeBSD.ORG, freebsd-i386@FreeBSD.ORG
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs
Date: Sun, 20 Feb 2005 17:52:01 -0500

 DONE:
 - Remove i387 float trig functions.
 - Remove i387 acos() and asin().
 
 TODO:
 - Figure out what to do about logf(), logbf(), and log10f().
 - Figure out what to do about atan(), atan2(), and atan2f().
 - Figure out what to do with sin(), cos(), and tan().
   (I suggest removing the i387 double trig functions now and worrying
   about the speed of fdlibm routines later.)
 
 On Sun, Feb 20, 2005, Bruce Evans wrote:
 > I would adjust the following due to these results:
 >   Think about deleting the exp and
 >   log i387 float functions.
 
 I didn't add NetBSD's e_expf.S in the first place because my tests
 showed that it was slower.  :-P  As for log{,b,10}f, your tests show
 that the asm versions are faster on my Pentium 4:
 
 asmlogf: nsec per call:  40 41 40 40 40 40 40 40 40 40 40 40 40 40 40 40
 fdllogf: nsec per call:  76 77 77 78 76 78 78 78 77 75 78 78 78 78 78 78
 asmlogbf: nsec per call:  12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12
 fdllogbf: nsec per call:  18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18
 asmlog10f: nsec per call:  40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
 fdllog10f: nsec per call:  80 80 71 88 71 71 95 84 71 71 71 71 72 112 96 71
 
 > - delete all inverse trig i387 functions.
 
 This is a clear win for asin() and acos().  It's not so clear for
 atan() or atan2():
 
 asmatan: nsec per call:  68 68 68 68 68 68 68 69 69 68 68 68 68 68 68 68
 fdlatan: nsec per call:  92 92 92 92 92 94 97 70 70 97 95 92 92 92 92 92
 fdlatanf: nsec per call:  70 70 70 70 70 71 72 58 58 72 70 69 70 75 71 69
 
 This is for the same Pentium 4 as above.  Do you get different
 results for a saner processor, like an Athlon?  IIRC, atan2f() was
 faster in assembly according to my naive tests, or I wouldn't have
 imported it.  I don't remember what inputs I tried or why I left out
 atanf().
 
 >   Think about optimizing the fdlibm versions.
 >   They could use double or extended precision, and then they might not
 >   need range reduction for a much larger range (hopefully [-2pi, 2pi])
 >   and/or might not need a correction term for a much larger range.
 [...]
 > - think about optimizing the trig fdlibm double functions until they are
 >   faster than the trig i387 double functions on a larger range than
 >   [-pi/4, pi/4].  They could use extended precision, but only only on
 >   some arches so there would be a negative reduction in complications
 >   for replacing the MD functions by "MI" ones optimized in this way.
 >   Does the polynomial approximation for sin start to fail between pi/4
 >   and pi/2, or are there other technical rasons to reduce to the current
 >   ranges?
 
 Yeah, the float versions of just about everything would benefit
 from storing the extra bits in a double.  Replacing two split
 doubles with a long double is harder, as you mention; it works
 (a) never, for 64-bit long doubles, (b) sometimes, for 80-bit long
 doubles, and (c) always, for 128-bit long doubles.  Some of the
 lookup tables for the float routines are a bit braindead, too.
 
 It's impossible to use a polynomial approximation on [0,pi/2] or a
 larger range for tan(), since tan() grows faster than any
 polynomial as it approaches pi/2.  There may be a rational
 approximation that works well, but I doubt it.  It is possible to
 find polynomial approximations for sin() and cos() on [0,pi/2],
 but assuming that Dr. Ng knows what he's doing (a reasonable
 assumption), the degree of any such polynomial would likely be
 very large.
 
 By the way, Dr. Ng's paper on argument reduction mentions that the
 not-so-freely-distributable libm uses table lookup for medium size
 arguments that would otherwise require reduction:
 http://www.ucbtest.org/arg.ps
 
 So in summary, there's a lot of low-hanging fruit with the float
 routines, but I think that doing a better job for double requires
 multiple versions that use the appropriate kind of extended
 precision, table lookup, or consulting a numerical analyst.
 
 By the way, the CEPHES library (netlib/cephes or
 http://www.moshier.net/) has different versions of many of these
 routines.  The trig functions are also approximated on [0,pi/4],
 but accurate argument reduction is not used.  I have the licensing
 issues worked out with the author and core@ if we want to use any
 of these.  However, my experience with exp2() and expl() from
 CEPHES showed that there are some significant inaccuracies, places
 where the approximating polynomial can overflow, etc.

From: Bruce Evans <bde@zeta.org.au>
To: David Schultz <das@freebsd.org>
Cc: FreeBSD-gnats-submit@freebsd.org, freebsd-i386@freebsd.org
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results
 for large inputs
Date: Tue, 22 Feb 2005 00:02:35 +1100 (EST)

 On Sun, 20 Feb 2005, David Schultz wrote:
 
 > On Sun, Feb 20, 2005, Bruce Evans wrote:
 > > I would adjust the following due to these results:
 > >   Think about deleting the exp and
 > >   log i387 float functions.
 >
 > I didn't add NetBSD's e_expf.S in the first place because my tests
 > showed that it was slower.  :-P  As for log{,b,10}f, your tests show
 > that the asm versions are faster on my Pentium 4:
 >
 > asmlogf: nsec per call:  40 41 40 40 40 40 40 40 40 40 40 40 40 40 40 40
 > fdllogf: nsec per call:  76 77 77 78 76 78 78 78 77 75 78 78 78 78 78 78
 > asmlogbf: nsec per call:  12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12
 > fdllogbf: nsec per call:  18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18
 > asmlog10f: nsec per call:  40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
 > fdllog10f: nsec per call:  80 80 71 88 71 71 95 84 71 71 71 71 72 112 96 71
 
 I get similar results for logf on all old machines, but fdllogf is faster
 on my Athlon XP:
 
 to.axpb-2223:
 asmlogf: nsec per call:  60 60 58 61 60 57 60 62 62 58 57 57 57 62 62 62
 fdllogf: nsec per call:  46 45 45 45 46 45 45 45 45 46 45 45 45 45 45 45
 asmlogbf: nsec per call:  6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
 fdllogbf: nsec per call:  12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12
 asmlog10f: nsec per call:  60 60 58 61 60 57 60 62 62 58 57 57 57 62 62 62
 fdllog10f: nsec per call:  87 87 83 90 88 79 86 94 94 81 78 78 79 94 94 95
 
 asmlog ties with fdllog on the axp (65-68 nsec for both).
 
 logb is quite different from the other functions so it doesn't really
 belong in this benchmark (I got it using grep :-).
 
 > > - delete all inverse trig i387 functions.
 >
 > This is a clear win for asin() and acos().  It's not so clear for
 > atan() or atan2():
 >
 > asmatan: nsec per call:  68 68 68 68 68 68 68 69 69 68 68 68 68 68 68 68
 > fdlatan: nsec per call:  92 92 92 92 92 94 97 70 70 97 95 92 92 92 92 92
 > fdlatanf: nsec per call:  70 70 70 70 70 71 72 58 58 72 70 69 70 75 71 69
 >
 > This is for the same Pentium 4 as above.  Do you get different
 > results for a saner processor, like an Athlon?  IIRC, atan2f() was
 > faster in assembly according to my naive tests, or I wouldn't have
 > imported it.  I don't remember what inputs I tried or why I left out
 > atanf().
 
 I didn't test atanf or atan2*, but fdlatan was faster on a K6-1, an old
 Celeron, a P3 and an AXP, but not on a 486:
 
 to.486dx2-66
 asmatan: nsec per call:  5518 5522 5527 5530 5474 5473 5674 5440 5433 5703 5625 5628 5554 5545 5554 5557
 fdlatan: nsec per call:  8128 8126 8127 8132 7990 8352 8910 7667 7557 8723 8272 7929 7913 7926 7915 7921
 
 to.axpb-2223
 asmatan: nsec per call:  87 87 87 87 87 87 87 78 78 87 87 87 87 87 87 87
 fdlatan: nsec per call:  65 65 65 65 65 66 68 51 51 68 66 65 65 65 65 65
 
 to.cel366
 asmatan: nsec per call:  444 444 444 444 444 444 444 424 424 444 444 444 444 444 444 444
 fdlatan: nsec per call:  370 370 370 370 370 382 397 323 323 397 382 370 370 370 370 370
 
 to.k6-233
 asmatan: nsec per call:  827 827 827 827 827 827 857 838 833 853 823 823 823 823 823 823
 fdlatan: nsec per call:  771 771 771 771 772 801 834 712 707 826 793 763 763 763 763 763
 to.p3-800
 asmatan: nsec per call:  209 209 205 209 209 209 209 200 200 209 209 209 209 209 209 209
 fdlatan: nsec per call:  175 175 175 176 176 181 179 150 149 178 174 172 171 171 172 172
 
 so asmatanf can only beat fdlatanf if the latter is doing something much
 worse than the double version.
 
 I tested with an almost unchanged version of -current's lib/msun.  I forgot
 to mention that I added arg reduction to the asm cosf, sinf and tanf.
 ucbtest noticed that the asm versions were broken, but after adding the
 range reduction, ucbtest didn't report any significant changes since last
 June.
 
 > [...]
 > > - think about optimizing the trig fdlibm double functions until they are
 > >   faster than the trig i387 double functions on a larger range than
 > >   [-pi/4, pi/4].  They could use extended precision, but only only on
 > [...]
 
 > It's impossible to use a polynomial approximation on [0,pi/2] or a
 > larger range for tan(), since tan() grows faster than any
 > polynomial as it approaches pi/2.  There may be a rational
 > approximation that works well, but I doubt it.  It is possible to
 > find polynomial approximations for sin() and cos() on [0,pi/2],
 > but assuming that Dr. Ng knows what he's doing (a reasonable
 > assumption), the degree of any such polynomial would likely be
 > very large.
 
 I was only thinking of cos() and sin().  tan() has a good (local)
 rational approximation everywhere since it is the quotient of 2 functions
 that are analytic everywhere, but fdlibm already uses this via range
 reduction (tan() on [pi/4, 3pi/4] is like -1/tan() on [-pi/4, pi/4]).
 
 > By the way, the CEPHES library (netlib/cephes or
 > http://www.moshier.net/) has different versions of many of these
 > routines.  The trig functions are also approximated on [0,pi/4],
 > but accurate argument reduction is not used.  I have the licensing
 > issues worked out with the author and core@ if we want to use any
 > of these.  However, my experience with exp2() and expl() from
 > CEPHES showed that there are some significant inaccuracies, places
 > where the approximating polynomial can overflow, etc.
 
 Good work.  I only asked the author about licensing and found that
 there would be few problems.
 
 Bruce
State-Changed-From-To: open->closed 
State-Changed-By: remko 
State-Changed-When: Mon Sep 11 11:51:21 UTC 2006 
State-Changed-Why:  
This problem is visible in Bruce Evan's tests everything, so he has a 
local reminder about this problem, close the PR for that.  If i am 
mistaken doing so, please contact me and I will reopen the PR for you 
again. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=67469 
>Unformatted:
