From nobody@FreeBSD.org  Sat Jun  3 18:53:52 2006
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id AC77C16A41F
	for <freebsd-gnats-submit@FreeBSD.org>; Sat,  3 Jun 2006 18:53:52 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [216.136.204.117])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 6AE0943D45
	for <freebsd-gnats-submit@FreeBSD.org>; Sat,  3 Jun 2006 18:53:52 +0000 (GMT)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.13.1/8.13.1) with ESMTP id k53Irq1d015999
	for <freebsd-gnats-submit@FreeBSD.org>; Sat, 3 Jun 2006 18:53:52 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.13.1/8.13.1/Submit) id k53IrqiA015998;
	Sat, 3 Jun 2006 18:53:52 GMT
	(envelope-from nobody)
Message-Id: <200606031853.k53IrqiA015998@www.freebsd.org>
Date: Sat, 3 Jun 2006 18:53:52 GMT
From: Rostislav Krasny <rosti.bsd@gmail.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: [PATCH] fpu_clean_state() cannot be disabled for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu
X-Send-Pr-Version: www-2.3

>Number:         98460
>Category:       kern
>Synopsis:       [kernel] [patch] fpu_clean_state() cannot be disabled for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          change-request
>Submitter-Id:   current-users
>Arrival-Date:   Sat Jun 03 19:00:35 GMT 2006
>Closed-Date:    
>Last-Modified:  Sun Jun 18 03:30:20 GMT 2006
>Originator:     Rostislav Krasny
>Release:        6.1-STABLE
>Organization:
>Environment:
>Description:
When FreeBSD is running on any non AMD processor an fpu_clean_state() function
adds unneeded operations to a context switch. My patch makes it possible
to disable the fpu_clean_state() by rebuilding a kernel with
"options CPU_FXSAVE_NO_LEAK".

Colin Percival has nothing against my idea in general:

http://lists.freebsd.org/pipermail/freebsd-current/2006-May/062683.html

and David Xu as well:

http://lists.freebsd.org/pipermail/freebsd-current/2006-May/063206.html

Following message is a beginning of that thread:

http://lists.freebsd.org/pipermail/freebsd-current/2006-April/062662.html
>How-To-Repeat:
You can use following command to check how your kernel has been builded:

objdump -x /boot/kernel/kernel | grep fpu_clean_state
>Fix:
diff -ru src/sys.orig/amd64/amd64/fpu.c src/sys/amd64/amd64/fpu.c
--- src/sys.orig/amd64/amd64/fpu.c	Wed Apr 19 10:00:35 2006
+++ src/sys/amd64/amd64/fpu.c	Sat Jun  3 21:14:06 2006
@@ -33,6 +33,8 @@
 #include <sys/cdefs.h>
 __FBSDID("$FreeBSD: src/sys/amd64/amd64/fpu.c,v 1.157.2.1 2006/04/19 07:00:35 cperciva Exp $");
 
+#include "opt_cpu.h"
+
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/bus.h>
@@ -96,7 +98,9 @@
 
 typedef u_char bool_t;
 
+#ifndef CPU_FXSAVE_NO_LEAK
 static	void	fpu_clean_state(void);
+#endif
 
 int	hw_float = 1;
 SYSCTL_INT(_hw,HW_FLOATINGPT, floatingpoint,
@@ -409,7 +413,9 @@
 	PCPU_SET(fpcurthread, curthread);
 	pcb = PCPU_GET(curpcb);
 
+#ifndef CPU_FXSAVE_NO_LEAK
 	fpu_clean_state();
+#endif
 
 	if ((pcb->pcb_flags & PCB_FPUINITDONE) == 0) {
 		/*
@@ -478,7 +484,9 @@
 
 	s = intr_disable();
 	if (td == PCPU_GET(fpcurthread)) {
+#ifndef CPU_FXSAVE_NO_LEAK
 		fpu_clean_state();
+#endif
 		fxrstor(addr);
 		intr_restore(s);
 	} else {
@@ -488,6 +496,7 @@
 	curthread->td_pcb->pcb_flags |= PCB_FPUINITDONE;
 }
 
+#ifndef CPU_FXSAVE_NO_LEAK
 /*
  * On AuthenticAMD processors, the fxrstor instruction does not restore
  * the x87's stored last instruction pointer, last data pointer, and last
@@ -518,6 +527,7 @@
 	 */
 	__asm __volatile("ffree %%st(7); fld %0" : : "m" (dummy_variable));
 }
+#endif /* !CPU_FXSAVE_NO_LEAK */
 
 /*
  * This really sucks.  We want the acpi version only, but it requires
diff -ru src/sys.orig/amd64/conf/NOTES src/sys/amd64/conf/NOTES
--- src/sys.orig/amd64/conf/NOTES	Sun Apr 30 20:39:43 2006
+++ src/sys/amd64/conf/NOTES	Sat Jun  3 21:14:06 2006
@@ -57,6 +57,12 @@
 # Options for CPU features.
 #
 
+# CPU_FXSAVE_NO_LEAK disables security workaround of FPU registers leak by
+# FXSAVE and FXRSTOR instructions of "7th generation" and "8th generation"
+# processors manufactured by AMD. For more information read a
+# FreeBSD-SA-06:14.fpu security advisory.
+options 	CPU_FXSAVE_NO_LEAK
+
 #
 # PERFMON causes the driver for Pentium/Pentium Pro performance counters
 # to be compiled.  See perfmon(4) for more information.
diff -ru src/sys.orig/conf/options.amd64 src/sys/conf/options.amd64
--- src/sys.orig/conf/options.amd64	Thu Jun 30 02:23:16 2005
+++ src/sys/conf/options.amd64	Sat Jun  3 21:14:06 2006
@@ -49,6 +49,7 @@
 # EOF
 # -------------------------------
 HAMMER			opt_cpu.h
+CPU_FXSAVE_NO_LEAK	opt_cpu.h
 PPC_PROBE_CHIPSET	opt_ppc.h
 PPC_DEBUG		opt_ppc.h
 PSM_HOOKRESUME		opt_psm.h
diff -ru src/sys.orig/conf/options.i386 src/sys/conf/options.i386
--- src/sys.orig/conf/options.i386	Sat Jul  2 23:06:42 2005
+++ src/sys/conf/options.i386	Sat Jun  3 21:14:06 2006
@@ -52,6 +52,7 @@
 CPU_ELAN_XTAL			opt_cpu.h
 CPU_ENABLE_LONGRUN		opt_cpu.h
 CPU_FASTER_5X86_FPU		opt_cpu.h
+CPU_FXSAVE_NO_LEAK		opt_cpu.h
 CPU_GEODE			opt_cpu.h
 CPU_I486_ON_386			opt_cpu.h
 CPU_IORT			opt_cpu.h
diff -ru src/sys.orig/i386/conf/NOTES src/sys/i386/conf/NOTES
--- src/sys.orig/i386/conf/NOTES	Wed May 10 17:26:03 2006
+++ src/sys/i386/conf/NOTES	Sat Jun  3 21:14:06 2006
@@ -118,6 +118,11 @@
 #
 # CPU_FASTER_5X86_FPU enables faster FPU exception handler.
 #
+# CPU_FXSAVE_NO_LEAK disables security workaround of FPU registers leak by
+# FXSAVE and FXRSTOR instructions of "7th generation" and "8th generation"
+# processors manufactured by AMD. For more information read a
+# FreeBSD-SA-06:14.fpu security advisory.
+#
 # CPU_GEODE is for the SC1100 Geode embedded processor.  This option
 # is necessary because the i8254 timecounter is toast.
 #
@@ -192,6 +197,7 @@
 options 	CPU_ELAN_XTAL=32768000
 options 	CPU_ENABLE_LONGRUN
 options 	CPU_FASTER_5X86_FPU
+options 	CPU_FXSAVE_NO_LEAK
 options 	CPU_GEODE
 options 	CPU_I486_ON_386
 options 	CPU_IORT
diff -ru src/sys.orig/i386/isa/npx.c src/sys/i386/isa/npx.c
--- src/sys.orig/i386/isa/npx.c	Sun Apr 30 08:15:20 2006
+++ src/sys/i386/isa/npx.c	Sat Jun  3 21:14:06 2006
@@ -142,7 +142,7 @@
 
 typedef u_char bool_t;
 
-#ifdef CPU_ENABLE_SSE
+#if defined(CPU_ENABLE_SSE) && !defined(CPU_FXSAVE_NO_LEAK)
 static	void	fpu_clean_state(void);
 #endif
 
@@ -956,7 +956,7 @@
 		fnsave(addr);
 }
 
-#ifdef CPU_ENABLE_SSE
+#if defined(CPU_ENABLE_SSE) && !defined(CPU_FXSAVE_NO_LEAK)
 /*
  * On AuthenticAMD processors, the fxrstor instruction does not restore
  * the x87's stored last instruction pointer, last data pointer, and last
@@ -987,7 +987,7 @@
 	 */
 	__asm __volatile("ffree %%st(7); fld %0" : : "m" (dummy_variable));
 }
-#endif /* CPU_ENABLE_SSE */
+#endif /* CPU_ENABLE_SSE && !CPU_FXSAVE_NO_LEAK */
 
 static void
 fpurstor(addr)
@@ -996,7 +996,9 @@
 
 #ifdef CPU_ENABLE_SSE
 	if (cpu_fxsr) {
+#ifndef CPU_FXSAVE_NO_LEAK
 		fpu_clean_state();
+#endif
 		fxrstor(addr);
 	} else
 #endif

>Release-Note:
>Audit-Trail:

From: Bruce Evans <bde@zeta.org.au>
To: Rostislav Krasny <rosti.bsd@FreeBSD.org>
Cc: freebsd-gnats-submit@FreeBSD.org, freebsd-bugs@FreeBSD.org
Subject: Re: kern/98460: [PATCH] fpu_clean_state() cannot be disabled for
 not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu
Date: Sun, 4 Jun 2006 07:26:38 +1000 (EST)

 On Sat, 3 Jun 2006, Rostislav Krasny wrote:
 
 >> Description:
 > When FreeBSD is running on any non AMD processor an fpu_clean_state() function
 > adds unneeded operations to a context switch. My patch makes it possible
 > to disable the fpu_clean_state() by rebuilding a kernel with
 > "options CPU_FXSAVE_NO_LEAK".
 >
 > Colin Percival has nothing against my idea in general:
 
 Hrmph.  My review implied that this should be done (not be me :-) before
 committing anything.
 
 The configuration should be dynamic and automatic, so that it doesn't
 take changes to zillions of configuration files to implement and
 document an option that almost no one will know to set.  I think there
 is a simple feature test for the AMD misfeature.  On i386's, this
 should be combined with the cpu_fxsr test so that only a single test
 is needed at runtime.  On amd64's, the test would be 1 unnecessary
 compare-and-branch.  I think it is not useful to have a configuration
 option to avoid this compare-and-branch.
 
 The overhead for fpu_clean_state() is a about 28 cycles.  Has anyone
 actually noticed the extra context switching time for this?  It is
 quite small compared with other overheads.  E.g., the one for using
 the ACPI-[non]fast timecounter was about 2000 cycles at 2GHz.  Even
 this was only noticeable under some loads.
 
 Bruce

From: Rostislav Krasny <rosti.bsd@gmail.com>
To: Bruce Evans <bde@zeta.org.au>
Cc: bug-followup@FreeBSD.org
Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be
 disabled for not AMD processors, those are not vulnerable to
 FreeBSD-SA-06:14.fpu
Date: Sun, 4 Jun 2006 18:36:23 +0300

 On Sun, 4 Jun 2006, Bruce Evans wrote: 
 > >> Description:
 > > When FreeBSD is running on any non AMD processor an fpu_clean_state() function
 > > adds unneeded operations to a context switch. My patch makes it possible
 > > to disable the fpu_clean_state() by rebuilding a kernel with
 > > "options CPU_FXSAVE_NO_LEAK".
 > >
 > > Colin Percival has nothing against my idea in general:
 > 
 > Hrmph.  My review implied that this should be done (not be me :-) before
 > committing anything.
 > 
 > The configuration should be dynamic and automatic, so that it doesn't
 > take changes to zillions of configuration files to implement and
 > document an option that almost no one will know to set.  I think there
 > is a simple feature test for the AMD misfeature.
 
 David Xu had proposed something like that. But from Colin Percival's
 reply I understood that it is hard to be done effectively. See their
 discussion by the first URL in this PR.
 
 > On i386's, this
 > should be combined with the cpu_fxsr test so that only a single test
 > is needed at runtime.  On amd64's, the test would be 1 unnecessary
 > compare-and-branch.  I think it is not useful to have a configuration
 > option to avoid this compare-and-branch.
 
 Future amd64 processors may be changed back, to the classical behavior
 of FXSAVE and FXRSTOR instructions.

From: Bruce Evans <bde@zeta.org.au>
To: Rostislav Krasny <rosti.bsd@gmail.com>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be disabled
 for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu
Date: Mon, 5 Jun 2006 08:25:06 +1000 (EST)

 On Sun, 4 Jun 2006, Rostislav Krasny wrote:
 
 > On Sun, 4 Jun 2006, Bruce Evans wrote:
 > > The configuration should be dynamic and automatic, so that it doesn't
 > > take changes to zillions of configuration files to implement and
 > > document an option that almost no one will know to set.  I think there
 > > is a simple feature test for the AMD misfeature.
 >
 > David Xu had proposed something like that. But from Colin Percival's
 > reply I understood that it is hard to be done effectively. See their
 > discussion by the first URL in this PR.
 
 I don't see how it can be hard.  Perhaps it is too CPU-dependent for
 tests based on cpuid to be easy or future-proof, but a runtime test
 in the probe would be easy.  Here is a userland version.  It gives the
 expected result on the following systems P2(Celeron) (mine), P3
 (freefall), P4(Xeon) (nosedive), AthlonXP (mine) and Opteron (sledge).
 It would crash on systems without FXSR.  To be complete, the userland
 version should repeat the test many times to reduce the chance of a
 misprobe due to broken context switching clobbering the pointer
 underneath it.  The kernel version can check for FXSR more easily and
 can just prevent context switching.
 
 %%%
 #include <sys/types.h>
 
 #ifdef __amd64__
 #include <machine/fpu.h>
 
 static struct savefpu xmmstate;
 #define	en_fip	en_rip
 #else
 #include <machine/npx.h>
 
 static struct savexmm xmmstate;
 #endif
 
 int
 main(void)
 {
  	/* Set up a fairly clean state with a zero last-instruction pointer. */
  	asm("fninit");
 
  	/* Set the last-instruction pointer mod 2^32 to nonzero. */
  	asm(".align 2,0x90; nop; fnop");
 
  	/* Try to see what the last-instruction pointer got changed to. */
  	asm("fxsave xmmstate");
 
  	/* Have dubious AMD optimizations iff the change didn't get saved. */
  	if (xmmstate.sv_env.en_fip == 0) {
  		printf("cpu_fxsr |= CPU_FXSR_NEEDCLEAN;\n");
  		return (1);
  	} else {
  		printf("cpu_fxsr &= ~CPU_FXSR_NEEDCLEAN;\n");
  		return (0);
  	}
 }
 %%%
 
 Bruce

From: Rostislav Krasny <rosti.bsd@gmail.com>
To: Bruce Evans <bde@zeta.org.au>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be
 disabled for not AMD processors, those are not vulnerable to
 FreeBSD-SA-06:14.fpu
Date: Tue, 6 Jun 2006 00:00:28 +0300

 On Mon, 5 Jun 2006 08:25:06 +1000 (EST)
 Bruce Evans <bde@zeta.org.au> wrote:
 
 > On Sun, 4 Jun 2006, Rostislav Krasny wrote:
 > 
 > > On Sun, 4 Jun 2006, Bruce Evans wrote:
 > > > The configuration should be dynamic and automatic, so that it doesn't
 > > > take changes to zillions of configuration files to implement and
 > > > document an option that almost no one will know to set.  I think there
 > > > is a simple feature test for the AMD misfeature.
 > >
 > > David Xu had proposed something like that. But from Colin Percival's
 > > reply I understood that it is hard to be done effectively. See their
 > > discussion by the first URL in this PR.
 > 
 > I don't see how it can be hard.  Perhaps it is too CPU-dependent for
 > tests based on cpuid to be easy or future-proof, but a runtime test
 > in the probe would be easy.  Here is a userland version.  It gives the
 > expected result on the following systems P2(Celeron) (mine), P3
 > (freefall), P4(Xeon) (nosedive), AthlonXP (mine) and Opteron (sledge).
 > It would crash on systems without FXSR.  To be complete, the userland
 > version should repeat the test many times to reduce the chance of a
 > misprobe due to broken context switching clobbering the pointer
 > underneath it.  The kernel version can check for FXSR more easily and
 > can just prevent context switching.
 > 
 > %%%
 > #include <sys/types.h>
 > 
 > #ifdef __amd64__
 > #include <machine/fpu.h>
 > 
 > static struct savefpu xmmstate;
 > #define	en_fip	en_rip
 > #else
 > #include <machine/npx.h>
 > 
 > static struct savexmm xmmstate;
 > #endif
 > 
 > int
 > main(void)
 > {
 >  	/* Set up a fairly clean state with a zero last-instruction pointer. */
 >  	asm("fninit");
 > 
 >  	/* Set the last-instruction pointer mod 2^32 to nonzero. */
 >  	asm(".align 2,0x90; nop; fnop");
 > 
 >  	/* Try to see what the last-instruction pointer got changed to. */
 >  	asm("fxsave xmmstate");
 > 
 >  	/* Have dubious AMD optimizations iff the change didn't get saved. */
 >  	if (xmmstate.sv_env.en_fip == 0) {
 >  		printf("cpu_fxsr |= CPU_FXSR_NEEDCLEAN;\n");
 >  		return (1);
 >  	} else {
 >  		printf("cpu_fxsr &= ~CPU_FXSR_NEEDCLEAN;\n");
 >  		return (0);
 >  	}
 > }
 > %%%
 
 And then you want to call the fpu_clean_state() function conditionally,
 like in following example?
 
 if (cpu_fxsr & CPU_FXSR_NEEDCLEAN)
         fpu_clean_state();
 
 But this looks same to what Davi Xu had proposed. Read what Colin
 Percival had replied about that proposition:
 
 http://lists.freebsd.org/pipermail/freebsd-current/2006-May/062683.html
 
 Eliminating the fpu_clean_state() by "options CPU_FXSAVE_NO_LEAK" could
 be used as a custom optimization. No one is obliged to use it, as well
 as many other CPU_* optimization options.

From: Bruce Evans <bde@zeta.org.au>
To: Rostislav Krasny <rosti.bsd@gmail.com>
Cc: freebsd-gnats-submit@freebsd.org
Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be disabled
 for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu
Date: Wed, 7 Jun 2006 12:09:10 +1000 (EST)

 On Tue, 6 Jun 2006, Rostislav Krasny wrote:
 
 > On Mon, 5 Jun 2006 08:25:06 +1000 (EST)
 > Bruce Evans <bde@zeta.org.au> wrote:
 >
 >> On Sun, 4 Jun 2006, Rostislav Krasny wrote:
 >>
 >>> On Sun, 4 Jun 2006, Bruce Evans wrote:
 >>>> The configuration should be dynamic and automatic, so that it doesn't
 >>>> take changes to zillions of configuration files to implement and
 >>>> document an option that almost no one will know to set.  I think there
 >>>> is a simple feature test for the AMD misfeature.
 >>>
 >>> David Xu had proposed something like that. But from Colin Percival's
 >>> reply I understood that it is hard to be done effectively. See their
 >>> discussion by the first URL in this PR.
 >>
 >> I don't see how it can be hard.  Perhaps it is too CPU-dependent for
 >> tests based on cpuid to be easy or future-proof, but a runtime test
 >> in the probe would be easy.  Here is a userland version.  It gives the
 >> ...
 >
 > And then you want to call the fpu_clean_state() function conditionally,
 > like in following example?
 >
 > if (cpu_fxsr & CPU_FXSR_NEEDCLEAN)
 >        fpu_clean_state();
 
 Not quite like that.  In my version there is no function call -- the code
 is excecuted in the one place where it is needed, so there is no function
 call overhead or possible branch prediction oferhead for the function call.
 
 > But this looks same to what Davi Xu had proposed. Read what Colin
 > Percival had replied about that proposition:
 >
 > http://lists.freebsd.org/pipermail/freebsd-current/2006-May/062683.html
 
 >> The problem with doing something like this is that the branch will
 >> almost never be in the processor's branch prediction tables, so you
 >> will get a branch mis-prediction on the unaffected processors --
 >> which is likely to be more expensive than simply running the state
 >> cleaning code.
 
 It can't possibly be _more_ expensive, since the state-cleaning code
 has 2 or 3 branches in it instead of only 1.  It has 1 or 2 branches
 for the function call and return.  Whether function calls and returns
 use normal branch prediction is machine-dependent.  Whatever they use,
 it takes some CPU resources.  The state-cleaning code has a branch in
 it.  This branch is slightly harder to predict than a cpu_fxsr one.
 My second version of a fix avoided this branch by doing the fnclex()
 unconditionally (the first version did the load unconditionally and
 paniced in coner cases).  The code with the branch runs much faster
 than an unconditional fnclex() in a simple benchmark with the code in
 a loop, but I wonder if it is still faster after branch misprediction.
 
 > Eliminating the fpu_clean_state() by "options CPU_FXSAVE_NO_LEAK" could
 > be used as a custom optimization. No one is obliged to use it, as well
 > as many other CPU_* optimization options.
 
 There are too many options and not enough automatic tuning.  This
 particular optimization is particularly worth not doing since it is
 in the 10-100 cycle range (similar to what could be gained from avoiding
 a single branch misprediction or cache miss), but I care about it since
 it is to compensate for a pessimization.
 
 Bruce

From: Rostislav Krasny <rosti.bsd@gmail.com>
To: Bruce Evans <bde@zeta.org.au>
Cc: freebsd-gnats-submit@freebsd.org
Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be
 disabled for not AMD processors, those are not vulnerable to
 FreeBSD-SA-06:14.fpu
Date: Fri, 9 Jun 2006 23:05:06 +0300

 On Wed, 7 Jun 2006 12:09:10 +1000 (EST)
 Bruce Evans <bde@zeta.org.au> wrote:
 
 > > And then you want to call the fpu_clean_state() function conditionally,
 > > like in following example?
 > >
 > > if (cpu_fxsr & CPU_FXSR_NEEDCLEAN)
 > >        fpu_clean_state();
 > 
 > Not quite like that.  In my version there is no function call -- the code
 > is excecuted in the one place where it is needed, so there is no function
 > call overhead or possible branch prediction oferhead for the function call.
 
 Could you please explain in more detail how that can be done?

From: Bruce Evans <bde@zeta.org.au>
To: Rostislav Krasny <rosti.bsd@gmail.com>
Cc: freebsd-gnats-submit@freebsd.org
Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be disabled
 for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu
Date: Sat, 10 Jun 2006 11:26:20 +1000 (EST)

 On Fri, 9 Jun 2006, Rostislav Krasny wrote:
 
 > On Wed, 7 Jun 2006 12:09:10 +1000 (EST)
 > Bruce Evans <bde@zeta.org.au> wrote:
 >
 >>> And then you want to call the fpu_clean_state() function conditionally,
 >>> like in following example?
 >>>
 >>> if (cpu_fxsr & CPU_FXSR_NEEDCLEAN)
 >>>        fpu_clean_state();
 >>
 >> Not quite like that.  In my version there is no function call -- the code
 >> is excecuted in the one place where it is needed, so there is no function
 >> call overhead or possible branch prediction oferhead for the function call.
 >
 > Could you please explain in more detail how that can be done?
 
 Just do it.  The easiest way is define the new function as inline.
 This just works because the function is defined before it is used.
 
 My version uses open-coded inlining.  This works because in -current
 the function is called twice, but only one of the calls is non-bogus,
 so open-coded inlining of the calls needed doesn't result in multiple
 copies of the code.  In the amd64 version, there are 2 explicit calls,
 with a bogus one in npx_getregs() that just cleans the state already
 owned by the thread.  In the i386 version, there is one call a low-level
 function that is called twice.  (If we really cared about efficiency,
 then we might inline this function too, but it it would be even better
 handle the differences between the fxsr and non-fxsr cases at a higher
 level.)
 
 Here is my version using open-coded inlining.
 
 % Index: npx.c
 % ===================================================================
 % RCS file: /home/ncvs/src/sys/i386/isa/npx.c,v
 % retrieving revision 1.152
 % diff -u -1 -r1.152 npx.c
 % --- npx.c	19 Jun 2004 22:24:16 -0000	1.152
 % +++ npx.c	22 Apr 2006 11:58:31 -0000
 % @@ -173,2 +176,3 @@
 % 
 % +static	float			npx_cleandata;
 %  static	union savefpu		npx_cleanstate;
 
 This also fixes a type mismatch and style bugs in the new variable.
 
 % @@ -796,13 +848,34 @@
 %  	PCPU_SET(fpcurthread, curthread);
 % -	pcb = PCPU_GET(curpcb);
 % 
 % +#ifdef CPU_ENABLE_SSE
 % +	/*
 % +	 * In the fxsr case, do a dummy load to set the last-instruction
 % +	 * pointers and opcode to constant (kernel) values in case
 % +	 * fpurstor() doesn't set them.  Certain AMD CPUs are too lazy
 % +	 * about saving and restoring them, so on these CPUs we have broken
 % +	 * context switching and would have a security hole if we didn't
 % +	 * force a setting here.  First, clear any pending exceptions to
 % +	 * ensure that the load doesn't trap (userland may have left us
 % +	 * with an unmasked pending exception since fxsave doesn't do an
 % +	 * implicit fninit).  The load itself will cause an exception if
 % +	 * userland has left us with a full stack; we let this happen
 % +	 * since it is harmless except for being much slower in the rare
 % +	 * case that it happens.
 % +	 */
 % +	if (cpu_fxsr /* && cpu_fxcsw_broken */) {
 % +		fnstsw(&status);
 % +		if (status & 0x80)
 % +			fnclex();
 % +		__asm("flds %0" : : "m" (npx_cleandata));
 % +	}
 % +#endif /* CPU_ENABLE_SSE */
 % +
 % +	pcb = PCPU_GET(curpcb);
 
 The inline function provides a better place to attach verbose comments.
 It has verbose comments that emphasize different details than the
 verbose comment above.  I would prefer to have less-verbose comments.
 
 The code in the above is identical with that in the function except it
 does the cpu_fxsr check more directly and it doesn't do the ffree (see
 the comment).
 
 Note that the fixup can probably be done much more optimally and/or
 simply by doing it in a less time-critical place than here.  The
 relevant CPUs have out-of-order exceution with the FPUs independent of
 the integer ALUs.  It might not matter that fnclex is slow or that the
 fixup takes any cycles at all, since it runs almost entirely on the
 FPUs and the FPUs can proceed independently.  Unfortunately, placing
 the fixup here almost certainly gives a bottleneck in the FPU instruction
 scheduling.  Userland has just tried to execute an FPU instruction,
 and cannot proceeed with at least that instruction until we have
 executed the fnstsw/[fnclex/]fld/fxrstor sequence here, and the sequence
 must be executed sequentially too.  Userland may be able to proceed
 with integer instructions but it is likely to stall on a dependency
 on the FPU instruction since the FPU instruction takes a long time for
 the trap here, and any addition to the time here is unlikely to be
 hidden by parallelism.  Placing the fixup after the fxsave on context
 switches would probably be more efficient.  On context switches, the
 kernel runs (or can be arranged to run) for a long time doing only
 integer operations after doing the floating point part of the switch,
 so an fxsave/fnstsw/[fnclex/]fld sequence can run in parallel.
 
 Parallel execution also affects the issue of whether FPU state for the
 switched-to thread should be loaded at context switch time instead of
 setting TS and only loading it (in the trap handler just below here) if
 they are used.  A direct load might be free since it can run in parallel.
 The trap handling certainly isn't free since it can't run in parallel.
 
 Fixes for an indirectly-related nearby bug (already fixed in the amd64
 version) that happen to be in the same patch:
 
 %  	if ((pcb->pcb_flags & PCB_NPXINITDONE) == 0) {
 %  		/*
 % -		 * This is the first time this thread has used the FPU or
 % -		 * the PCB doesn't contain a clean FPU state.  Explicitly
 % -		 * initialize the FPU and load the default control word.
 % +		 * This is the first time this thread has used the FPU, or
 % +		 * the PCB doesn't contain a clean FPU state.  Load the
 % +		 * initial state (XXX reword more).
 %  		 */
 % -		fninit();
 % -		control = __INITIAL_NPXCW__;
 % -		fldcw(&control);
 % +		fpurstor(&npx_cleanstate);
 %  		pcb->pcb_flags |= PCB_NPXINITDONE;
 
 Bruce

From: Rostislav Krasny <rosti.bsd@gmail.com>
To: Bruce Evans <bde@zeta.org.au>
Cc: freebsd-gnats-submit@freebsd.org
Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be
 disabled for not AMD processors, those are not vulnerable to
 FreeBSD-SA-06:14.fpu
Date: Fri, 16 Jun 2006 02:04:47 +0300

 On Sat, 10 Jun 2006 11:26:20 +1000 (EST)
 Bruce Evans <bde@zeta.org.au> wrote:
 
 > On Fri, 9 Jun 2006, Rostislav Krasny wrote:
 > 
 > > On Wed, 7 Jun 2006 12:09:10 +1000 (EST)
 > > Bruce Evans <bde@zeta.org.au> wrote:
 > >
 > >>> And then you want to call the fpu_clean_state() function conditionally,
 > >>> like in following example?
 > >>>
 > >>> if (cpu_fxsr & CPU_FXSR_NEEDCLEAN)
 > >>>        fpu_clean_state();
 > >>
 > >> Not quite like that.  In my version there is no function call -- the code
 > >> is excecuted in the one place where it is needed, so there is no function
 > >> call overhead or possible branch prediction oferhead for the function call.
 > >
 > > Could you please explain in more detail how that can be done?
 > 
 > Just do it.  The easiest way is define the new function as inline.
 > This just works because the function is defined before it is used.
 >
 > [snipped]
 
 But you still check cpu_fxsr, so a branch misprediction on a good few
 CPUs is still possible. The only solution is a self-modified code with
 a direct jump. I made following userland example of such a code:
 
 #include <stdio.h>
 #include <stdlib.h>
 #include <sys/mman.h>
 
 void fu(int n)
 {
 	mprotect(&&lab0, 1, PROT_READ|PROT_WRITE|PROT_EXEC);
 lab0:
 	asm volatile ("			\n\
 		.byte	0xE9;		\n\
 	l0:	.long	0x00000000	\n\
 	l1:	bt	$0,%%ax;	\n\
 		jnc	l2;		\n\
 		movl	$l4,%%eax;	\n\
 		subl	$l1,%%eax;	\n\
 		movl	%%eax,l0;	\n\
 		movl	$0,%%eax;	\n\
 		jmp	l4;		\n\
 	l2:	movl	$l3,%%eax;	\n\
 		subl	$l1,%%eax;	\n\
 		movl	%%eax,l0;	\n\
 		movl	$0,%%eax;	\n\
 		jmp	l4;		\n\
 	l3:	addl	$0x0A,%%eax;	\n\
 	l4:				\n\
 		" : "=a"(n) : "a"(n));
 	printf("%d\n", n);
 }
 
 int main(int argc, char *argv[])
 {
 	int n;
 
 	if (argc > 1)
 		n = atoi(argv[1]);
 	fu(n);
 	fu(n);
 	fu(n);
 
 	return 0;
 }
 
 For a demonstration purpose I made the fu() so that the first call of it
 will printf 0, to show when code modification is done. All further calls
 of fu() will print n or n+10, according to the first n.
 
 I think there should be no need in mprotect() in the kernel. That
 technique could be combined with an assembly version of fpu_clean_state()
 from following article. See the '"FXRSTOR-centric" method':
 
 http://security.freebsd.org/advisories/FreeBSD-SA-06:14-amd.txt
 
 That might be tricky, I know. But why one should pay a performance
 penalty because of a CPU he/she didn't buy?

From: Bruce Evans <bde@zeta.org.au>
To: Rostislav Krasny <rosti.bsd@gmail.com>
Cc: freebsd-gnats-submit@freebsd.org
Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be disabled
 for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu
Date: Fri, 16 Jun 2006 22:50:01 +1000 (EST)

 On Fri, 16 Jun 2006, Rostislav Krasny wrote:
 
 > On Sat, 10 Jun 2006 11:26:20 +1000 (EST)
 > Bruce Evans <bde@zeta.org.au> wrote:
 >
 >> On Fri, 9 Jun 2006, Rostislav Krasny wrote:
 >>
 >>> On Wed, 7 Jun 2006 12:09:10 +1000 (EST)
 >>> Bruce Evans <bde@zeta.org.au> wrote:
 
 >>>> [on avoiding some branches]
 >>>
 >>> Could you please explain in more detail how that can be done?
 >>
 >> Just do it.  The easiest way is define the new function as inline.
 >> This just works because the function is defined before it is used.
 >>
 >> [snipped]
 >
 > But you still check cpu_fxsr, so a branch misprediction on a good few
 > CPUs is still possible. The only solution is a self-modified code with
 > a direct jump. I made following userland example of such a code:
 
 Why are we worrying about just this and not all the other branches on
 cpu_fxsr, not to mention all other branches in the kernel :-)?  Note
 that there's another one on cpu_fxsr, in the critical path for npxdna(),
 in fpurstor().  There are also many branches and other unnecessary
 overheads in the trap handling before npxdna() is called.  No one seems
 to be concerned about these.  I sometimes worry about these, and prefer
 my original implementation of i387 DNA handling all in assembler.  It
 takes 12 instructions with 1 branch where in my version of FreeBSD
 Xdna takes 124 instructions with 23 branches (46 instructions with 10
 branches in npxdna()).
 
 I don't know how common branch misprediction is in npxdna() (or in Xdna
 or trap() or in trap handling generally), but guess it is quite common,
 and fairly common for syscalls too, since traps are not very common
 ind individual syscalls are not very common; thus the CPU is likely
 to have better things to do with memory cache and branch cache resources
 that caching traps or individual syscalls.  But if something is so
 little used that it doesn't stay cached then unnecessarily using it is
 unlikely to make a significant difference to efficiency.
 
 > [Example of self-modifying code]
 
 > I think there should be no need in mprotect() in the kernel. That
 > technique could be combined with an assembly version of fpu_clean_state()
 > from following article. See the '"FXRSTOR-centric" method':
 
 I think Linux is doing this now (perhaps more with nulling out unecessary
 instructions).  Trap handlers can be patched even more easily and
 efficiently by pointing their IDT entry at a machine-dependent optimal
 handler, but as mentioned above FreeBSD does almost the opposite of
 that (it pushes everything through trap()).
 
 > http://security.freebsd.org/advisories/FreeBSD-SA-06:14-amd.txt
 >
 > That might be tricky, I know. But why one should pay a performance
 > penalty because of a CPU he/she didn't buy?
 
 Because the penalty is (?) too small to measure.  I would be interested
 in any measurement that shows otherwise, and generally in any method
 for measuring the cost of branches in code that should not be executed
 very often.  I often do micro-benchmakers by putting sequences of
 instructions in a loop, but this doesn't work right for code that is
 not executed very often.  I haven't looked at performance counter info
 fo a long time.
 
 Bruce

From: Rostislav Krasny <rosti.bsd@gmail.com>
To: Bruce Evans <bde@zeta.org.au>
Cc: freebsd-gnats-submit@freebsd.org
Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be
 disabled for not AMD processors, those are not vulnerable to
 FreeBSD-SA-06:14.fpu
Date: Sat, 17 Jun 2006 01:58:24 +0300

 ... or even much shorter version:
 
 		.byte	0xEB;		\n\	/* short JMP */
 	l0:	.byte	0x00;		\n\	/* where to jump */
 	l1:	bt	$1,%%ax;	\n\	/* CPU_FXSR_NEEDCLEAN bit */
 		jc	l2;		\n\
 		movb	$(l5-l1),l0;	\n\
 		jmp	l5;		\n\
 	l2:	movb	$(l3-l1),l0;	\n\
 	l3:	fnstsw	%%ax;		\n\
 		ffree	%%st(7);	\n\
 		bt	$7,%%ax;	\n\
 		jnc	l4;		\n\
 		fnclex;			\n\
 	l4:	fildl	safe_address;	\n\
 	l5:				\n\

From: Bruce Evans <bde@zeta.org.au>
To: Rostislav Krasny <rosti.bsd@gmail.com>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be disabled
 for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu
Date: Sat, 17 Jun 2006 17:01:27 +1000 (EST)

 On Fri, 16 Jun 2006, Rostislav Krasny wrote:
 
 > On Fri, 16 Jun 2006 22:50:01 +1000 (EST)
 > Bruce Evans <bde@zeta.org.au> wrote:
 >
 >> Why are we worrying about just this and not all the other branches on
 >> cpu_fxsr, not to mention all other branches in the kernel :-)?
 >
 > I think it is a matter of principle. AMD saved few microcomands in
 > their incorrect implementation of two Pentium III instructions. And now
 > buyers if their processors are paying much more than those few
 > microcomands.
 
 No, the non-AMD users pay much less (unless the cost of branch prediction
 is very large).  When I tried to measure the overhead for the fix, I found
 that fxsave+fxrstor takes almost twice as long on a P4(Xeon) as on an
 Athlon(XP,64).  That's about 150 cycles longer IIRC.  The fix costs only
 14 cycles.
 
 These measurements were in microbenchmarks that loop (and in manuals
 that assume similar best-case setups).  The extra 150 cycles is free
 if it is done in parallel with integer operations.  npxdna() only does
 the fxrstor half and has limited parallelism, and I haven't measured
 how many of the extra 150/2 cycles are free (probably none).  14 cycles
 for the fix assumes no branch misprediction.
 
 14 cycles is a lot from one point of view, but from a practical point
 of view it is the same as 0.  Suppose that the kernel does 1000 context
 switches per second per CPU (too many for efficiency since it thrashes
 caches), and that an FPU switch occurs on all of these (it would
 normally be much less than that since half of all context switches are
 often to kernel threds (and half back), and many threads don't use the
 FPU.  We then waste 14000 cycles per second + more for branch misprediction
 and other cache effects.  At 2GHz 14000 cycles is a whole 7uS.
 
 > Why should buyers of processors from other manufacturers,
 > which implemented FXSAVE and FXRSTOR correctly, pay even a tiny bit of
 > their performance for nothing?
 
 Because they can't measure the difference?
 
 I think that unless you modify millions of branches, there is more to be
 gained from things like scheduling instructions so that high-latency
 instructions like fxrstor are started early, but the gains here are still
 relatively small and are better done by compliers and CPUs because the
 best scheduling is machine-dependent.
 
 Bruce

From: Rostislav Krasny <rosti.bsd@gmail.com>
To: Bruce Evans <bde@zeta.org.au>
Cc: freebsd-gnats-submit@freebsd.org
Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be
 disabled for not AMD processors, those are not vulnerable to
 FreeBSD-SA-06:14.fpu
Date: Fri, 16 Jun 2006 20:38:47 +0300

 On Fri, 16 Jun 2006 22:50:01 +1000 (EST)
 Bruce Evans <bde@zeta.org.au> wrote:
 
 > Why are we worrying about just this and not all the other branches on
 > cpu_fxsr, not to mention all other branches in the kernel :-)?
 
 I think it is a matter of principle. AMD saved few microcomands in
 their incorrect implementation of two Pentium III instructions. And now
 buyers if their processors are paying much more than those few
 microcomands. Why should buyers of processors from other manufacturers,
 which implemented FXSAVE and FXRSTOR correctly, pay even a tiny bit of
 their performance for nothing?
 
 There is an assembly workaround, provided by AMD. Adding to it 10 more
 assembly instructions shouldn't be a hard work. It could looks like
 this:
 
 		.byte	0xEB;		\n\	/* short JMP */
 	l0:	.byte	0x00;		\n\	/* where to jump */
 	l1:	bt	$1,%%ax;	\n\	/* CPU_FXSR_NEEDCLEAN bit */
 		jc	l2;		\n\
 		movl	$l5,%%eax;	\n\
 		subl	$l1,%%eax;	\n\
 		movb	%%al,l0;	\n\
 		jmp	l5;		\n\
 	l2:	movl	$l3,%%eax;	\n\
 		subl	$l1,%%eax;	\n\
 		movb	%%al,l0;	\n\
 	l3:	fnstsw	%%ax;		\n\
 		ffree	%%st(7);	\n\
 		bt	$7,%%ax;	\n\
 		jnc	l4;		\n\
 		fnclex;			\n\
 	l4:	fildl	safe_address;	\n\
 	l5:				\n\

From: Rostislav Krasny <rosti.bsd@gmail.com>
To: Bruce Evans <bde@zeta.org.au>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be
 disabled for not AMD processors, those are not vulnerable to
 FreeBSD-SA-06:14.fpu
Date: Sun, 18 Jun 2006 00:31:55 +0300

 On Sat, 17 Jun 2006 17:01:27 +1000 (EST)
 Bruce Evans <bde@zeta.org.au> wrote:
 
 > On Fri, 16 Jun 2006, Rostislav Krasny wrote:
 > 
 > > On Fri, 16 Jun 2006 22:50:01 +1000 (EST)
 > > Bruce Evans <bde@zeta.org.au> wrote:
 > >
 > >> Why are we worrying about just this and not all the other branches on
 > >> cpu_fxsr, not to mention all other branches in the kernel :-)?
 > >
 > > I think it is a matter of principle. AMD saved few microcomands in
 > > their incorrect implementation of two Pentium III instructions. And now
 > > buyers if their processors are paying much more than those few
 > > microcomands.
 > 
 > No, the non-AMD users pay much less (unless the cost of branch prediction
 > is very large).  When I tried to measure the overhead for the fix, I found
 > that fxsave+fxrstor takes almost twice as long on a P4(Xeon) as on an
 > Athlon(XP,64).  That's about 150 cycles longer IIRC.  The fix costs only
 > 14 cycles.
 
 Yes, according to
 http://security.freebsd.org/advisories/FreeBSD-SA-06:14-amd.txt
 the "FXRSTOR-centric" method takes 14 cycles on AMD Opteron processor.
 That is the minimum which AMD users need to pay now. Non-AMD users have
 four options:
 
 1. run the same instructions down the drain
 2. test some flag
 3. jump over these instructions
 4. disable these instructions in the kernel build configuration
 
 Now, how much it will cost them:
 
 1. same 14 cycles (?)
 2. minimum 20 cycles on NetBurst or about 15 cycles on Pentium III
    http://www.intel.com/cd/ids/developer/asmo-na/eng/44010.htm?prn=Y
    plus 1 or 2 microcomands for BT or TEST instruction.
 3. 1 microcomand for one direct JMP
 4. nothing
 
 The last option has the best performance cost but kernel build options
 are unhandy. Implementation of the third option is simple. Why not to
 do it? Only one byte of the code will be self-modified.
 
 > These measurements were in microbenchmarks that loop (and in manuals
 > that assume similar best-case setups).  The extra 150 cycles is free
 > if it is done in parallel with integer operations.  npxdna() only does
 > the fxrstor half and has limited parallelism, and I haven't measured
 > how many of the extra 150/2 cycles are free (probably none).  14 cycles
 > for the fix assumes no branch misprediction.
 > 
 > 14 cycles is a lot from one point of view, but from a practical point
 > of view it is the same as 0.  Suppose that the kernel does 1000 context
 > switches per second per CPU (too many for efficiency since it thrashes
 > caches), and that an FPU switch occurs on all of these (it would
 > normally be much less than that since half of all context switches are
 > often to kernel threds (and half back), and many threads don't use the
 > FPU.  We then waste 14000 cycles per second + more for branch misprediction
 > and other cache effects.  At 2GHz 14000 cycles is a whole 7uS.
 
 How many cycles a context switch normally takes? About 1000 cycles?
 Then 14 - 20 additional cycles take 1.4% - 2% of the previous context
 switch time. Why to waste it?
 
 > > Why should buyers of processors from other manufacturers,
 > > which implemented FXSAVE and FXRSTOR correctly, pay even a tiny bit of
 > > their performance for nothing?
 > 
 > Because they can't measure the difference?
 
 From a practical point of view that wastage could looks minor, but from
 a principle point of view it's not.
 
 By the way, how many cycles will be saved by converting the
 fpu_clean_state() function to an inline code?

From: Bruce Evans <bde@zeta.org.au>
To: Rostislav Krasny <rosti.bsd@gmail.com>
Cc: freebsd-gnats-submit@freebsd.org
Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be disabled
 for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu
Date: Sun, 18 Jun 2006 13:30:09 +1000 (EST)

 On Sun, 18 Jun 2006, Rostislav Krasny wrote:
 
 > On Sat, 17 Jun 2006 17:01:27 +1000 (EST)
 > Bruce Evans <bde@zeta.org.au> wrote:
 >
 >> On Fri, 16 Jun 2006, Rostislav Krasny wrote:
 
 >>> ,,,
 >>> I think it is a matter of principle. AMD saved few microcomands in
 >>> their incorrect implementation of two Pentium III instructions. And now
 >>> buyers if their processors are paying much more than those few
 >>> microcomands.
 >>
 >> No, the non-AMD users pay much less (unless the cost of branch prediction
 >> is very large).  When I tried to measure the overhead for the fix, I found
 >> that fxsave+fxrstor takes almost twice as long on a P4(Xeon) as on an
 >> Athlon(XP,64).  That's about 150 cycles longer IIRC.  The fix costs only
 >> 14 cycles.
 >
 > Yes, according to
 > http://security.freebsd.org/advisories/FreeBSD-SA-06:14-amd.txt
 > the "FXRSTOR-centric" method takes 14 cycles on AMD Opteron processor.
 > That is the minimum which AMD users need to pay now. Non-AMD users have
 > four options:
 
 I confirmed the ~14 cycle value in a micro-benchmark but don't really
 believe it.  The difficulty of accounting for cache misses of various
 types (perhaps main branch target cache here) is shown partly by the
 AMD statement not even mentioning caches.
 
 > 1. run the same instructions down the drain
 > 2. test some flag
 > 3. jump over these instructions
 > 4. disable these instructions in the kernel build configuration
 
 5. Replace these instructions by no-op instructions.  (This can be done
     at no cost for many bytes of instructions on CPUs with micro-ops, but
     but costs up to 2 (?) cycles per byte on old i386's.)
 6. Change the pointer to Xdna in the IDT to a pointer to a version
     without these instructions.
 7. Change Xdna (and/or routines that it calls, preferably none) to a
     version without these or hundreds of other instructions.
 8. Do some of the above for all branches and/or routine in the kernel
     to avoid hundreds of thousands of branches and other instructions.
 9. Use another method to expolit parallelism better.  fldl after fxsave
     is probably better for parallelism.
 
 > Now, how much it will cost them:
 >
 > 1. same 14 cycles (?)
 > 2. minimum 20 cycles on NetBurst or about 15 cycles on Pentium III
 >   http://www.intel.com/cd/ids/developer/asmo-na/eng/44010.htm?prn=Y
 >   plus 1 or 2 microcomands for BT or TEST instruction.
 > 3. 1 microcomand for one direct JMP
 > 4. nothing
 
 1. Possibly 14, probably more, but possibly less due to parallelism.
 2. Now at most 2 on modern CPUs under the same bad assumptions that
     give 14 for (1).
 3. Direct jumps sometimes take just as long as conditional jumps on
     some CPUs (I think due to them not beng cached), but if something
     is sure to take only a single micro-op then there's a good chance
     of parallelism.
 4. Probably, but possibly not since the extra code might accidentally
     improve instruction scheduling :-).
 5. Like (3), except no-ops may reduce to 0 micro-ops instead of 1 and
     thus take 0 execution resources but some prefetch resources.
 6. Like (4).
 7. Like (6) repeated 50 times.  Xdna could take 20 times fewer instructions
     but wouldn't be 20 times faster because the slow fxrstor instruction
     would dominate.
 8. I think the potential savings from this huge task are about 10% for
     the kernel and some fraction of this for the system.
 9. "fxsave; testl $FLAG,cpu_fxsr; jz 1f; fnstsw ...; cmp ...; jz; fnclex;
      fldl ...; 1:".
     Now the cpu_fxsr test and even the status test might be free even if
     there are a branch misprediction since there are no important data
     dependencies.  If the CPU has enough execution units than it can do
     the following in parallel:
 
        FPU1         ALU1                    FPU2        ALU[2-]      FPU[2-]
        ----         ----                    ----        -------      -------
        fxsave       testl $FLAG,cpu_fxr     idle        runs ahead   runs ahead
        ...          jz 1f                   idle        ...          ...
        ...          ...                     fnstsw
        ...          cmp                     ...
        ...          jz
        ...          runs ahead
        fnclex       ...
        fldl
        runs ahead
        ...
     Some serializing instruction, probably iret:
        iret         iret                    iret        iret         iret
 
 If the CPU soon returns to user mode then it will hit a serializing
 instruction soon, so it is important to start the slow fxsave instruction
 as early as possible so that everything doesn't have to wait for it.
 The npxsave() call in cpu_switch() was written about 13 years ago and
 the i386 cpu_switch() is more like 20 years old.  It knows nothing
 about multiple execution units and happens to schedule the npx switch
 (actually the save half of a switch) almost perfectly pessimally by
 doing it near the end.  However, mi_switch() has a lot of bloat so
 this probably doesn't matter -- the fxsave+fnclex sequence will complete
 before the bloat gets through the integer ALUs.
 
 I don't know if modern CPUs have this much parallelism.  My (old, paper)
 AthlonXP optimization manual says that fnstsw runs in the FSTORE pipe and
 doesn't say which pipe(s) fxsave runs in, so I guess fnstsw has to wait
 for fxsave.  You would like this since AthlonXPs would have to wait but
 Pentiums would proceed on all except ALU1 and FPU1 :-).
 
 > The last option has the best performance cost but kernel build options
 > are unhandy. Implementation of the third option is simple. Why not to
 > do it? Only one byte of the code will be self-modified.
 
 Because modifying only 1 byte in a 5MB library (the kernel) for a larger
 application (userland) would make little difference.
 
 >> 14 cycles is a lot from one point of view, but from a practical point
 >> of view it is the same as 0.  Suppose that the kernel does 1000 context
 >> switches per second per CPU (too many for efficiency since it thrashes
 >> caches), and that an FPU switch occurs on all of these (it would
 >> normally be much less than that since half of all context switches are
 >> often to kernel threds (and half back), and many threads don't use the
 >> FPU.  We then waste 14000 cycles per second + more for branch misprediction
 >> and other cache effects.  At 2GHz 14000 cycles is a whole 7uS.
 >
 > How many cycles a context switch normally takes? About 1000 cycles?
 > Then 14 - 20 additional cycles take 1.4% - 2% of the previous context
 > switch time. Why to waste it?
 
 More like 2000 (best case).  It was more like 1000 as recently as RELENG_4,
 but there have been many branches since then.  On My AthlonXP @2223 MHz
 with a TSC timecounter, according to LMbench:
 
 %                  L M B E N C H  2 . 0   S U M M A R Y
 %                  ------------------------------------
 % 
 % Context switching - times in microseconds - smaller is better
 % -------------------------------------------------------------
 % Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
 %                         ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
 % --------- ------------- ----- ------ ------ ------ ------ ------- -------
 % epsplex.b FreeBSD 4.10- 0.370 0.6800 7.9100 2.2800   14.1 4.62000    55.9
 % epsplex.b FreeBSD 5.2-C 0.830 1.3600 8.6200 3.2900   24.7 4.28000    58.5
 
 0.370 uS is 823 cycles and 0.830 uS is 1845 cycles.  The variance of
 these times is about 5%.  LMbench's context switching doesn't exercise
 the FPU.
 
 Bruce
>Unformatted:
