From nobody@FreeBSD.org  Sat Sep 24 20:11:52 2011
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 259DF1065670
	for <freebsd-gnats-submit@FreeBSD.org>; Sat, 24 Sep 2011 20:11:52 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from red.freebsd.org (red.freebsd.org [IPv6:2001:4f8:fff6::22])
	by mx1.freebsd.org (Postfix) with ESMTP id 15AD58FC14
	for <freebsd-gnats-submit@FreeBSD.org>; Sat, 24 Sep 2011 20:11:52 +0000 (UTC)
Received: from red.freebsd.org (localhost [127.0.0.1])
	by red.freebsd.org (8.14.4/8.14.4) with ESMTP id p8OKBpvE069059
	for <freebsd-gnats-submit@FreeBSD.org>; Sat, 24 Sep 2011 20:11:51 GMT
	(envelope-from nobody@red.freebsd.org)
Received: (from nobody@localhost)
	by red.freebsd.org (8.14.4/8.14.4/Submit) id p8OKBpfo069058;
	Sat, 24 Sep 2011 20:11:51 GMT
	(envelope-from nobody)
Message-Id: <201109242011.p8OKBpfo069058@red.freebsd.org>
Date: Sat, 24 Sep 2011 20:11:51 GMT
From: Arnaud Lacombe <lacombar@gmail.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: buf_ring(9) statistics accounting not MPSAFE
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         160992
>Category:       kern
>Synopsis:       buf_ring(9) statistics accounting not MPSAFE
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sat Sep 24 20:20:10 UTC 2011
>Closed-Date:    
>Last-Modified:  Wed Oct 05 05:47:58 UTC 2011
>Originator:     Arnaud Lacombe
>Release:        9-CURRENT
>Organization:
n/a
>Environment:
>Description:
The following block of code, in `sys/sys/buf_ring.h':


       /*
        * If there are other enqueues in progress
        * that preceeded us, we need to wait for them
        * to complete
        */
       while (br->br_prod_tail != prod_head)
               cpu_spinwait();
       br->br_prod_bufs++;
       br->br_prod_bytes += nbytes;
       br->br_prod_tail = prod_next;
       critical_exit();


can be seen at runtime, memory-wise as:

      while (br->br_prod_tail != prod_head)
              cpu_spinwait();
      br->br_prod_tail = prod_next;
      br->br_prod_bufs++;
      br->br_prod_bytes += nbytes;
      critical_exit();


That is, there is no memory barrier to enforce completion of the
load/increment/store/load/load/addition/store operations before
updating what other thread spin on. 

Even if `br_prod_tail' is marked `volatile', there is no guarantee that
it will not be re-ordered wrt. non-volatile write (to `br_prod_bufs' and
`br_prod_bytes').


Confirmed by Kip Macy (kmacy@) in
http://lists.freebsd.org/pipermail/freebsd-hackers/2011-September/036454.html.
>How-To-Repeat:
code review.
>Fix:


>Release-Note:
>Audit-Trail:

From: Bruce Evans <brde@optusnet.com.au>
To: Arnaud Lacombe <lacombar@gmail.com>
Cc: freebsd-gnats-submit@freebsd.org, freebsd-bugs@freebsd.org
Subject: Re: kern/160992: buf_ring(9) statistics accounting not MPSAFE
Date: Sun, 25 Sep 2011 13:01:04 +1000 (EST)

 On Sat, 24 Sep 2011, Arnaud Lacombe wrote:
 
 >> Description:
 > The following block of code, in `sys/sys/buf_ring.h':
 >
 >
 >       /*
 >        * If there are other enqueues in progress
 >        * that preceeded us, we need to wait for them
 >        * to complete
 >        */
 >       while (br->br_prod_tail != prod_head)
 >               cpu_spinwait();
 >       br->br_prod_bufs++;
 >       br->br_prod_bytes += nbytes;
 >       br->br_prod_tail = prod_next;
 >       critical_exit();
 >
 >
 > can be seen at runtime, memory-wise as:
 >
 >      while (br->br_prod_tail != prod_head)
 >              cpu_spinwait();
 >      br->br_prod_tail = prod_next;
 >      br->br_prod_bufs++;
 >      br->br_prod_bytes += nbytes;
 >      critical_exit();
 >
 > That is, there is no memory barrier to enforce completion of the
 > load/increment/store/load/load/addition/store operations before
 > updating what other thread spin on.
 
 The counters are 64 bits, so it also does non-atomic increments of them
 no 32-bit arches.
 
 > Even if `br_prod_tail' is marked `volatile', there is no guarantee
 > that it will not be re-ordered wrt. non-volatile write (to `br_prod_bufs'
 > and `br_prod_bytes').
 
 Using volatile is generally bogus.  Here it seems to mainly give
 pessimizations and more opportunities for bad memory orders.  The i386
 code for incrementing a 64-bit volatile x is:
 
  	movl	x, %eax
  	movl	x+4, %edx
  	addl	$1, %eax
  	adcl	$0, %edx
  	movl	%eax, x
  	movl	%edx, x+4
 
 while for a 64-bit non-volatile it is:
 
  	addl	$1, x
  	adcl	$0, x+4
 
 so volatile gives more caching in registers instead of less.  The following
 are some of the bad memory orders possible:
 
 with volatile:
         lo = br->br_prod_bytes.lo;
         hi = br->br_prod_bytes.hi;
         br->br_prod_tail = prod_next;
         br->br_prod_bufs++;
         lo += nbytes;
         hi += carry;
         br->br_prod_bytes.hi = hi;
         br->br_prod_bytes.lo = lo;
 
 without volatile:
 
         br->br_prod_bytes.lo += nbytes;
         br->br_prod_tail = prod_next;
         br->br_prod_bufs++;
         br->br_prod_bytes.hi += carry;
 
 I think the token method would make the nonatomic accesses to the
 counters sufficiently atomic if it worked.  The necessary memory
 barriers would probably have a memory clobber which effectively makes
 all memory variables transiently volatile (where volatile actually
 means non-volatile with respect to them changing -- holding the token
 prevents them changing -- but actually means volatile with respect to
 their memory accesses) so the effect of declaring the counters permanently
 volatile would be reduced to a pessimization.  Even reads of them in
 sysctls must hold the token to get a consistent snapshot.
 
 Bruce
>Unformatted:
