From nobody@FreeBSD.org  Thu Aug  4 21:27:26 2005
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id ECD3916A41F
	for <freebsd-gnats-submit@FreeBSD.org>; Thu,  4 Aug 2005 21:27:26 +0000 (GMT)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [216.136.204.117])
	by mx1.FreeBSD.org (Postfix) with ESMTP id B354343D78
	for <freebsd-gnats-submit@FreeBSD.org>; Thu,  4 Aug 2005 21:27:13 +0000 (GMT)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.13.1/8.13.1) with ESMTP id j74LRDMa010592
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 4 Aug 2005 21:27:13 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.13.1/8.13.1/Submit) id j74LRDUQ010580;
	Thu, 4 Aug 2005 21:27:13 GMT
	(envelope-from nobody)
Message-Id: <200508042127.j74LRDUQ010580@www.freebsd.org>
Date: Thu, 4 Aug 2005 21:27:13 GMT
From: David Kirchner <dpk@dpk.net>
To: freebsd-gnats-submit@FreeBSD.org
Subject: Panics occur when PAE enabled and >3.5GB memory used
X-Send-Pr-Version: www-2.3

>Number:         84563
>Category:       i386
>Synopsis:       [pae] Panics occur when PAE enabled and >3.5GB memory used
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    linimon
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Aug 04 21:30:14 GMT 2005
>Closed-Date:    Sun Jul 08 16:07:18 GMT 2007
>Last-Modified:  Sun Jul 08 16:07:18 GMT 2007
>Originator:     David Kirchner
>Release:        5.4-RELEASE-p5 and -STABLE as of a few days ago
>Organization:
>Environment:
FreeBSD host 5.4-RELEASE-p5 FreeBSD 5.4-RELEASE-p5 #0: Thu Aug  4 13:06:27 PDT 2005     root@host:/usr/src/sys/i386/compile/STD  i386
>Description:
This is with a Supermicro X6DVA-4G/EG system, and with 4GB of RAM. We have several of these with the same configuration, and they have the same problem.

The problem is repeatable using the PAE kernel config that comes stock with the OS.

The problem appears to be when memory above 3.5GB (memory which the BIOS remaps to just above 4096MB) is touched in some way. Paged out, perhaps.

Here are two traces from two different panics, with something in common:

(gdb) bt
#0  kdb_enter (msg=0x12 <Address 0x12 out of bounds>) at ../../../kern/subr_kdb.c:266
#1  0xc033ea1f in panic (fmt=0xc04d782d "ffs_write: dir write") at ../../../kern/kern_shut
down.c:550
#2  0xc04292de in ffs_write (ap=0xeb858a94) at ../../../ufs/ffs/ffs_vnops.c:614
#3  0xc0452e71 in vnode_pager_generic_putpages (vp=0xc6237630, m=0xeb858bf0, bytecount=409
6,  
    flags=0, rtvals=0xeb858b70) at vnode_if.h:432
#4  0xc038b7e2 in vop_stdputpages (ap=0x12) at ../../../kern/vfs_default.c:650
#5  0xc038af3b in vop_defaultop (ap=0x0) at ../../../kern/vfs_default.c:157
#6  0xc0435ebf in ufs_vnoperate (ap=0x0) at ../../../ufs/ufs/ufs_vnops.c:2821
#7  0xc0452c0e in vnode_pager_putpages (object=0xc6901a50, m=0x12, count=18, sync=0, rtval
s=0x12)
    at vnode_if.h:1357
#8  0xc044a5db in vm_pageout_flush (mc=0xeb858bf0, count=1, flags=0) at vm_pager.h:147
#9  0xc044a505 in vm_pageout_clean (m=0x0) at ../../../vm/vm_pageout.c:347
#10 0xc044b386 in vm_pageout_scan (pass=1) at ../../../vm/vm_pageout.c:985
#11 0xc044c106 in vm_pageout () at ../../../vm/vm_pageout.c:1476
#12 0xc032911d in fork_exit (callout=0xc044bdf4 <vm_pageout>, arg=0x0, frame=0xeb858d48)
    at ../../../kern/kern_fork.c:791
#13 0xc0474f6c in fork_trampoline () at ../../../i386/i386/exception.s:209

#0  kdb_enter (msg=0x12 <Address 0x12 out of bounds>) at ../../../kern/subr_kdb.c:266
#1  0xc033ea1f in panic (fmt=0xc04c99ff "lockmgr: thread %p, not %s %p unlocking")
    at ../../../kern/kern_shutdown.c:550
#2  0xc0333181 in lockmgr (lkp=0xc61f5e14, flags=6, interlkp=0x1000000, td=0x0)
    at ../../../kern/kern_lock.c:419
#3  0xc038b08b in vop_stdunlock (ap=0x12) at ../../../kern/vfs_default.c:295
#4  0xc038af3b in vop_defaultop (ap=0x0) at ../../../kern/vfs_default.c:157
#5  0xc03010bb in spec_vnoperate (ap=0x0) at ../../../fs/specfs/spec_vnops.c:118
#6  0xc0301648 in spec_write (ap=0xeb858a94) at vnode_if.h:1044
#7  0xc03010bb in spec_vnoperate (ap=0x0) at ../../../fs/specfs/spec_vnops.c:118
#8  0xc0452ecd in vnode_pager_generic_putpages (vp=0xc61f5d68, m=0xeb858bf0, bytecount=409
6,
    flags=0, rtvals=0xeb858b70) at vnode_if.h:432
#9  0xc038b7e2 in vop_stdputpages (ap=0x12) at ../../../kern/vfs_default.c:650
#10 0xc038af3b in vop_defaultop (ap=0x0) at ../../../kern/vfs_default.c:157
#11 0xc03010bb in spec_vnoperate (ap=0x0) at ../../../fs/specfs/spec_vnops.c:118
#12 0xc0452c6a in vnode_pager_putpages (object=0xc085e7bc, m=0x12, count=18, sync=0, rtval
s=0x12)
    at vnode_if.h:1357
#13 0xc044a603 in vm_pageout_flush (mc=0xeb858bf0, count=1, flags=0) at vm_pager.h:147
#14 0xc044a52d in vm_pageout_clean (m=0x0) at ../../../vm/vm_pageout.c:347
#15 0xc044b3df in vm_pageout_scan (pass=0) at ../../../vm/vm_pageout.c:996
#16 0xc044c162 in vm_pageout () at ../../../vm/vm_pageout.c:1487
#17 0xc032911d in fork_exit (callout=0xc044be50 <vm_pageout>, arg=0x0, frame=0xeb858d48)
    at ../../../kern/kern_fork.c:791
#18 0xc0474fcc in fork_trampoline () at ../../../i386/i386/exception.s:209

In both cases, you'll notice that vm_pageout_flush's mc argument is identical. That is decimal 3,951,397,872 . When you boot these servers without PAE enabled, the "real memory" is 3,757,965,312. I think this indicates that the page the kernel is dealing with is within the "remapped" region.

There is a third panic that occurs, which I do not have a trace for, but follows the same pattern as this person saw:

http://groups-beta.google.com/group/lucky.freebsd.stable/browse_thread/thread/99978f6cbf071223/136ab31fcd339d5c?lnk=st&q=freebsd+4GB+PAE+thread&rnum=5&hl=en#136ab31fcd339d5c

That seems to be dealing with memory in around the same range as what I'm seeing.

My understanding of kernel internals and fancy PAE memory access is pretty limited, so I could be way off on my guesses. It does seem that others are having the same trouble, though.
>How-To-Repeat:
This bug is very easy to reproduce. On the system, compile and install the PAE kernel, reboot, then run a program which calloc()'s 500MB, several times, while rebuilding the kernel repeatedly. Eventually the kernel will crash (usually around 10-30 minutes in). I believe it crashes when it starts putting pages >3.5GB into the inactive queue, or tries to swap it out, or something like that.
>Fix:
Unknown. Disabling PAE works, but is obviously not ideal.
>Release-Note:
>Audit-Trail:

From: dpk <dpk@dpk.net>
To: bug-followup@freebsd.org
Cc:  
Subject: Re: i386/84563: Panics occur when PAE enabled and >3.5GB memory used
Date: Fri, 5 Aug 2005 11:34:32 -0700 (PDT)

 Performing this patch on FreeBSD 5.4-RELEASE-p5(or -p6)
 
 http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/i386/i386/pmap.c.diff?r1=1.523&r2=1.524&f=h
 
 appears to resolve the problem. The patch has been MFC'd to RELENG_6, can
 it be issued for RELENG_5_4 as well? We'll be using it as a local patch
 for now.

From: Jimmy Myrick <jmyrick@tiger1.tiger.org>
To: <bug-followup@FreeBSD.org>, <dpk@dpk.net>
Cc:  
Subject: Re: i386/84563: Panics occur when PAE enabled and >3.5GB memory used
Date: Thu, 13 Oct 2005 08:11:50 -0500 (CDT)

 This diff also worked for me against 5.4-RELEASE-p8.
 
 The machine is a Dell PowerEdge 2800 that has the same problem.  When
 using a PAE enabled kernel it crashes under heavy loads (buildworld for
 example).  A non PAE enabled kernel works fine, but only addresses around
 3.5 GB of RAM.  The PAE enabled kernel addresses all 4 GB of memory.
 
 After applying the diff, the system works fine with all 4 GB of memory
 available.  Has not crashed under heavy load yet.
 
 Info on what was changed at line 1859 in
 /usr/src/sys/i386/i386/pmap.c:
 
 -					m = PHYS_TO_VM_PAGE(pbits);
 +					m = PHYS_TO_VM_PAGE(*pte);
 
 Jimmy Myrick
 

From: Michael DeMan <michael@staff.openaccess.org>
To: bug-followup@FreeBSD.org,
 dpk@dpk.net
Cc:  
Subject: Re: i386/84563: Panics occur when PAE enabled and >3.5GB memory used
Date: Wed, 17 May 2006 13:59:38 -0700

 Hi,
 
 I and others who have followed the 'disable PAE' workaround seem to  
 experience machines that are more stable, but still crash under heavy  
 load.  I have identical machines running e7501 motherboards, one with  
 2GB of RAM (I pulled the extra out) and another with 4GB and the one  
 with 4GB will crash every couple of months under heavy workloads.
 
 I have PAE disabled, USB devices out of the kernel and usbd disabled.
 
 
 I just cvsup'd to RELENG_5_4 today and the patch as described at...
 
 http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/i386/i386/pmap.c.diff? 
 r1=1.523&r2=1.524&f=h
 
 ...still does not seem to be in the source tree?
 
 This bug causes 5.4 systems with 4GB+ of RAM to be unstable and is a  
 critical issue.
 
 It seems that this patch works and just needs to applied to 5.4?
 
 http://groups.google.com/group/lucky.freebsd.i386/browse_thread/ 
 thread/a44a3dd9a2556725/ce27b27e2a9dc103%23ce27b27e2a9dc103
 
 http://groups.google.com/group/lucky.freebsd.stable/browse_thread/ 
 thread/99978f6cbf071223/136ab31fcd339d5c?lnk=st&q=freebsd+4GB+PAE 
 +thread&rnum=5&hl=en
 
 Thanks,
 
 - mike
 
 Michael F. DeMan
 Director of Technology
 OpenAccess Network Services
 Bellingham, WA 98225
 michael@staff.openaccess.org
 360-647-0785
 
State-Changed-From-To: open->feedback 
State-Changed-By: linimon 
State-Changed-When: Sun Jul 8 06:30:02 UTC 2007 
State-Changed-Why:  
To michael@staff.openaccess.org: the patch was indeed applied to RELENG_5 
but not RELENG_5_4, which IIUC is exactly the right resolution.  Do either 
you or the submitter still see this same problem on RELENG_5 itself? 


Responsible-Changed-From-To: freebsd-i386->linimon 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Sun Jul 8 06:30:02 UTC 2007 
Responsible-Changed-Why:  

http://www.freebsd.org/cgi/query-pr.cgi?pr=84563 
State-Changed-From-To: feedback->closed 
State-Changed-By: linimon 
State-Changed-When: Sun Jul 8 16:06:35 UTC 2007 
State-Changed-Why:  
Based on discussion with gavin, close this as we now believe it got 
MFCed to 5.X. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=84563 
>Unformatted:
