From nobody@FreeBSD.org  Wed Apr 18 19:29:25 2007
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 3013616A403
	for <freebsd-gnats-submit@FreeBSD.org>; Wed, 18 Apr 2007 19:29:25 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [69.147.83.33])
	by mx1.freebsd.org (Postfix) with ESMTP id 1416D13C48C
	for <freebsd-gnats-submit@FreeBSD.org>; Wed, 18 Apr 2007 19:29:25 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.13.1/8.13.1) with ESMTP id l3IJTOwc090243
	for <freebsd-gnats-submit@FreeBSD.org>; Wed, 18 Apr 2007 19:29:24 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.13.1/8.13.1/Submit) id l3IJOMUL088901;
	Wed, 18 Apr 2007 19:24:22 GMT
	(envelope-from nobody)
Message-Id: <200704181924.l3IJOMUL088901@www.freebsd.org>
Date: Wed, 18 Apr 2007 19:24:22 GMT
From: Adam McDougall<mcdouga9@egr.msu.edu>
To: freebsd-gnats-submit@FreeBSD.org
Subject: page fault while in kernel mode with samba in vfs_vmio_release
X-Send-Pr-Version: www-3.0

>Number:         111831
>Category:       kern
>Synopsis:       [nfs] [samba] [patch] page fault while in kernel mode with samba in vfs_vmio_release
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Wed Apr 18 19:30:02 GMT 2007
>Closed-Date:    Wed Sep 19 01:41:38 GMT 2007
>Last-Modified:  Wed Sep 19 01:41:38 GMT 2007
>Originator:     Adam McDougall
>Release:        FreeBSD 6.2-STABLE #1: Tue Apr 17 11:55:07 EDT 2007
>Organization:
>Environment:
FreeBSD 6.2-STABLE #1: Tue Apr 17 11:55:07 EDT 2007
  amd64  root@ghost2:/usr/obj/usr/src/sys/X4100

>Description:
Background: I have some samba servers I setup recently that serve all
of their data from nfsv3 mounts.  I generally have around 400+ total
concurrent samba connections during the day.  Access to the two servers
is logically controlled by a Foundry load balancer, but depending on the
situation, we may only have one server running.  The servers "ghost2"
and "niobe2" are Dual cpu Dual-core opteron sun fire X4100 M2 systems
from Sun, running very recent 6-stable in amd64 mode.  I have also tried
the same setup on some Dual Xeon 2.0ghz Dell PowerEdge 2650 systems.

FreeBSD will only stay operating for a few hours while in production,
then it panics.  I have not been able to establish a repeatable test
case other than by putting it in production and waiting.  I prefer to
do this as little as possible because the clients have trouble when I
have to fall back to the old samba server which I want to replace.

The panic is always a Fatal trap 12: page fault while in kernel mode,
and going by memory I'm pretty sure always "supervisor read data, page
not present" with a very low (two digit) fault virtual address, and in
vfs_vmio_release.  During earlier crashes while using DDB_UNATTENDED,
I never got a kernel coredump, and had to refer to the pointers to
determine that vfs_vmio_release was involved.  Today was the first time
I had done enough preparation where I could let the servers drop into
DDB instead of trying to reboot, so I could do some live debugging while
not worrying about getting the server back up ASAP.  Both ghost2 and
niobe2 are running the same binary world and kernel from ghost2.

This morning, ghost2 paniced much earlier than I expected, and the ddb
trace involved FFS while in process smbstatus (which I was running once
per minute from a script).  At that point all the client load was
shifted over to niobe2 by the load balancer.  niobe2 survived until
noon, when it paniced in a similar manner but the current process was
smbd and the trace involved nfs.  Both panics were in vfs_vmio_release.

Both servers will remain in DDB waiting for further probing, I have no
reason to reboot them until a proposed solution or workaround exists
to be tested.  I don't know what else to do in ddb until I am pointed
to a guide or instructed on how I can help further a solution for this
case.  I am not a coder and only know some basic kernel debugging skills.
I do have several other servers available to do some testing with, but
I'd have to put them into production to reproduce the problem.  Please
let me know what I can do.  I hope I have not forgotten anything
important.  Thanks.

The following URL contains the kernel config, ddb output from the panic,
ps, trace, show pcpu/allpcpu/lockedvnods on both servers, and dmesg.

http://www.egr.msu.edu/~mcdouga9/x4100

>How-To-Repeat:
Unsure how to repeat on demand, must put into production to produce a panic.
>Fix:

>Release-Note:
>Audit-Trail:

From: linimon@lonesome.com (Mark Linimon)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/111831: [nfs] page fault while in kernel mode with samba in vfs_vmio_release
Date: Thu, 19 Jul 2007 15:12:26 -0500

 ----- Forwarded message from Steve Sears <sjs@netapp.com> -----
 
 From: Steve Sears <sjs@netapp.com>
 To: freebsd-bugs@FreeBSD.org
 
 I hit this bug and fixed it thusly in nfsclient/nfs_bio.c, around line 1735:
 
             } else {
             if (error) {
             bp->b_ioflags |= BIO_ERROR;
             /* Mark buffer invalid which will result in invalidating
              * its pages and other buffer cleanup in brelse().
              * Cannot set BIO_ERROR without marking buffer B_INVAL.
              */
 +++         bp->b_flags |= B_INVAL;
             bp->b_error = np->n_error = error;
             np->n_flag |= NWRITEERR;
 +++         np->n_attrstamp = 0;
             }
             bp->b_dirtyoff = bp->b_dirtyend = 0;
 
 Part of the problem is that the transport is returning an error that is not
 being dealt with gracefully. In my case it was EAGAIN. I also fixed the
 socket code to retry in the session layer if it gets EAGAIN. I'm using a
 specialized transport, so that part of the fix is not relevant.
 
 I just looked at the file in the CVS repository and version 1.152.2.5,
 2007/07/17 21:02:08, is the same fix. If this change is MFC'd to 6.2, the
 submitter should be happy.
 
 
  -Steve
 
 ----- End forwarded message -----
State-Changed-From-To: open->patched 
State-Changed-By: rodrigc 
State-Changed-When: Thu Jul 19 22:39:42 UTC 2007 
State-Changed-Why:  
Steve Sears suspects that this bug is fixed by: 

1.152.2.5 of src/sys/nfsclient/nfs_bio.c on RELENG_6 branch 

which is a merge of 1.164 on mainline: 
http://lists.freebsd.org/pipermail/cvs-src/2007-July/080396.html 

http://www.freebsd.org/cgi/query-pr.cgi?pr=111831 
State-Changed-From-To: patched->closed 
State-Changed-By: rodrigc 
State-Changed-When: Wed Sep 19 01:41:04 UTC 2007 
State-Changed-Why:  
Closed at request of submitter. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=111831 
>Unformatted:
