From dillon@flea.best.net  Fri Oct 24 19:06:16 1997
Received: from flea.best.net (root@flea.best.net [206.184.139.131])
          by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id TAA05344
          for <FreeBSD-gnats-submit@freebsd.org>; Fri, 24 Oct 1997 19:06:14 -0700 (PDT)
          (envelope-from dillon@flea.best.net)
Received: (from dillon@localhost) by flea.best.net (8.8.7/8.7.3) id TAA06359; Fri, 24 Oct 1997 19:05:18 -0700 (PDT)
Message-Id: <199710250205.TAA06359@flea.best.net>
Date: Fri, 24 Oct 1997 19:05:18 -0700 (PDT)
From: Matt Dillon <dillon@best.net>
Reply-To: dillon@best.net
To: FreeBSD-gnats-submit@freebsd.org
Subject: VM lookup, endless loop in vm_map_lookup_entry()
X-Send-Pr-Version: 3.2

>Number:         4844
>Category:       kern
>Synopsis:       VM lookup, endless loop in vm_map_lookup_entry()
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Oct 24 19:10:00 PDT 1997
>Closed-Date:    Sun Apr 26 14:01:10 PDT 1998
>Last-Modified:  Sun Apr 26 14:01:25 PDT 1998
>Originator:     Matt Dillon
>Release:        FreeBSD 2.2-STABLE i386
>Organization:
BEST Internet Communications
>Environment:

	FreeBSD 2.2.5, PPro 200, 2940UW, DEC ether, all pci.

>Description:

	Within a day of upgrading to 2.2.5, we have had to lockups.
	We were able to escape into the debugger and panic the machines.
	Both machines locked up in vm_map_lookup_entry() while trying
	to brelse/bfreekva/vm_map_delete (see below).  One locked up 
	from a SCSI interrupt doing the above sequence, the other locked
	up from system call.

	The kernel is locking up in vm_map_delete, but I believe it is 
	ACTUALLY locking up in the vm_map_lookup_entry routine that 
	vm_map_delete() is calling and that the Xfastintr4 is losing the
	vm_map_lookup_entry context.

	vm_map_lookup_entry is locking up by getting into an endless loop.
	I have traced and demonstrated the loop below.  The cores are 
	available if you need them.

	We have NEVER had this sort of crash before.  We had two the first day
	operating under 2.2.5 from 2.2.2.  I'm hoping it is something obvious.

	I believe the buffer cache bp's linked lists are getting corrupted.

					-Matt

    -------------------------------------- CRASH #1 ---------------------------

    cur   $4 = (struct vm_map_entry *) 0xf2ecdc00
    last  $8 = (struct vm_map_entry *) 0xf1937f20

starting at cur, linking through cur->next

......  somewhere int the middle, it loops without hitting last

$49 = (struct vm_map_entry *) 0xf2ffeec0
(kgdb) print $->next
$50 = (struct vm_map_entry *) 0xf2ff5b40
(kgdb) print $->next
$51 = (struct vm_map_entry *) 0xf2fdab40
(kgdb) print $->next
$52 = (struct vm_map_entry *) 0xf2584840
(kgdb) print $->next
$53 = (struct vm_map_entry *) 0xf2f36040
(kgdb) print $->next
$54 = (struct vm_map_entry *) 0xf27baa20
(kgdb) print $->next
$55 = (struct vm_map_entry *) 0xf2ecc5c0
(kgdb) print $->next
$56 = (struct vm_map_entry *) 0xf247b000
(kgdb) print $->next
$57 = (struct vm_map_entry *) 0xf2f58040
(kgdb) print $->next
$58 = (struct vm_map_entry *) 0xf2f52c40
(kgdb) print $->next
$59 = (struct vm_map_entry *) 0xf2fe6440
(kgdb) print $->next
$60 = (struct vm_map_entry *) 0xf2affdc0
(kgdb) print $->next
$61 = (struct vm_map_entry *) 0xf3022ec0
(kgdb) print $->next
$62 = (struct vm_map_entry *) 0xf2f51240
(kgdb) print $->next
$63 = (struct vm_map_entry *) 0xf2f50940
(kgdb) print $->next
$64 = (struct vm_map_entry *) 0xf2685200
(kgdb) print $->next
$65 = (struct vm_map_entry *) 0xf2cd4640
(kgdb) print $->next
$66 = (struct vm_map_entry *) 0xf2f4a400
(kgdb) print $->next
$67 = (struct vm_map_entry *) 0xf2ffeec0		LOOPS

#12 0xf01014e2 in db_command_loop () at ../../ddb/db_command.c:462
#13 0xf0103c28 in db_trap (type=0x3, code=0x0) at ../../ddb/db_trap.c:73
#14 0xf01b4d4b in kdb_trap (type=0x3, code=0x0, regs=0xefbffc14)
    at ../../i386/i386/db_interface.c:126
#15 0xf01be4f4 in trap (frame={tf_es = 0xf0100010, tf_ds = 0xf4430010, 
      tf_edi = 0x10000, tf_esi = 0xf2347800, tf_ebp = 0xefbffc70, 
      tf_isp = 0xefbffc3c, tf_ebx = 0xf5bc3000, tf_edx = 0x3f9, 
      tf_ecx = 0x3f9, tf_eax = 0x0, tf_trapno = 0x3, tf_err = 0x0, 
      tf_eip = 0xf01c99ad, tf_cs = 0x8, tf_eflags = 0x46, tf_esp = 0xf5bc3000, 
      tf_ss = 0xefbffce4}) at ../../i386/i386/trap.c:403
#16 0xf01c99ad in siointr1 (com=0xf2347800) at ../../i386/isa/sio.c:1401
#17 0xf01c9952 in siointr (unit=0x0) at ../../i386/isa/sio.c:1338
#18 0xf01b61e7 in Xfastintr4 ()
#19 0xf01a80b5 in vm_map_delete (map=0xf1937f00, start=0xf5bc3000, 
    end=0xf5bc5000) at ../../vm/vm_map.c:1878
#20 0xf012ca4e in bfreekva (bp=0xf4434390) at ../../kern/vfs_bio.c:194
#21 0xf012d350 in brelse (bp=0xf4434390) at ../../kern/vfs_bio.c:603
#22 0xf012dc38 in getnewbuf (slpflag=0x0, slptimeo=0x0, size=0x2000, 
    maxsize=0x2000) at ../../kern/vfs_bio.c:1000
#23 0xf012dfd5 in getblk (vp=0xf27dc380, blkno=0x177, size=0x2000, 
    slpflag=0x0, slptimeo=0x0) at ../../kern/vfs_bio.c:1197
#24 0xf012ff1a in cluster_rbuild (vp=0xf27dc380, filesize=0x9d6048, lbn=0x177, 
    blkno=0x42bc50, size=0x2000, run=0x8) at ../../kern/vfs_cluster.c:296
#25 0xf012fd2c in cluster_read (vp=0xf27dc380, filesize=0x9d6048, 
    lblkno=0x160, size=0x2000, cred=0x0, bpp=0xefbffec0)
    at ../../kern/vfs_cluster.c:217
#26 0xf019a6e1 in ffs_read (ap=0xefbffee8) at ../../ufs/ufs/ufs_readwrite.c:124
#27 0xf0136df1 in vn_read (fp=0xf2f9c940, uio=0xefbfff34, cred=0xf2fa8180)
    at vnode_if.h:255
#28 0xf011a1b7 in read (p=0xf2fb3c00, uap=0xefbfff94, retval=0xefbfff84)
    at ../../kern/sys_generic.c:116
#29 0xf01bef37 in syscall (frame={tf_es = 0xefbf0027, tf_ds = 0xefbf0027, 
      tf_edi = 0x8000, tf_esi = 0x52e000, tf_ebp = 0xefbfc1f4, 
      tf_isp = 0xefbfffe4, tf_ebx = 0x5a7f26, tf_edx = 0x131100, tf_ecx = 0x0, 
      tf_eax = 0x3, tf_trapno = 0x7, tf_err = 0x7, tf_eip = 0x201bafd1, 
      tf_cs = 0x1f, tf_eflags = 0x206, tf_esp = 0xefbfbd10, tf_ss = 0x27})
    at ../../i386/i386/trap.c:890
#30 0x201bafd1 in ?? ()


    -------------------------------------- CRASH #2 ---------------------------
			(same problem showing the loop.  This time, it
			looped from an interrupt)

#14 0xf01b4d4b in kdb_trap (type=0x3, code=0x0, regs=0xf01dbdd0)
    at ../../i386/i386/db_interface.c:126
#15 0xf01be4f4 in trap (frame={tf_es = 0x10, tf_ds = 0xc0030010, 
      tf_edi = 0x10000, tf_esi = 0xf2187800, tf_ebp = 0xf01dbe2c, 
      tf_isp = 0xf01dbdf8, tf_ebx = 0xf6007000, tf_edx = 0x3f9, 
      tf_ecx = 0x3f9, tf_eax = 0x0, tf_trapno = 0x3, tf_err = 0x0, 
      tf_eip = 0xf01c99ad, tf_cs = 0x8, tf_eflags = 0x46, tf_esp = 0xf6007000, 
      tf_ss = 0xf01dbea0}) at ../../i386/i386/trap.c:403
#16 0xf01c99ad in siointr1 (com=0xf2187800) at ../../i386/isa/sio.c:1401
#17 0xf01c9952 in siointr (unit=0x0) at ../../i386/isa/sio.c:1338
#18 0xf01b61e7 in Xfastintr4 ()
#19 0xf01a80b5 in vm_map_delete (map=0xf1777f00, start=0xf6007000, 
    end=0xf6009000) at ../../vm/vm_map.c:1878
#20 0xf012ca4e in bfreekva (bp=0xf4214670) at ../../kern/vfs_bio.c:194
#21 0xf012d350 in brelse (bp=0xf4214670) at ../../kern/vfs_bio.c:603
#22 0xf012ec7c in biodone (bp=0xf4214670) at ../../kern/vfs_bio.c:1685
#23 0xf018e56c in scsi_done (xs=0xf1777a80) at ../../scsi/scsi_base.c:449
#24 0xf01d5471 in ahc_done (ahc=0xf217f800, scb=0xf217eee0)
    at ../../i386/scsi/aic7xxx.c:1944
#25 0xf01d2ef7 in ahc_intr (arg=0xf217f800) at ../../i386/scsi/aic7xxx.c:814
(kgdb) 

>How-To-Repeat:

	Don't know.  These are medium-loaded systems.

>Fix:
	
	Don't know.

>Release-Note:
>Audit-Trail:

From: "Jordan K. Hubbard" <jkh@time.cdrom.com>
To: dillon@best.net
Cc: FreeBSD-gnats-submit@FreeBSD.ORG
Subject: Re: kern/4844: VM lookup, endless loop in vm_map_lookup_entry() 
Date: Fri, 24 Oct 1997 19:56:04 -0700

 > 	Within a day of upgrading to 2.2.5, we have had to lockups.
 > 	We were able to escape into the debugger and panic the machines.
 
 Bizarre - this never showed up in any of your testing of the
 2.2.5-BETAs?  That would seem to indicate that the bug entered at the
 very last minute.
 
 					Jordan

From: Matt Dillon <dillon@best.net>
To: "Jordan K. Hubbard" <jkh@time.cdrom.com>
Cc: dima@best.net, FreeBSD-gnats-submit@FreeBSD.ORG
Subject: Re: kern/4844: VM lookup, endless loop in vm_map_lookup_entry() 
Date: Fri, 24 Oct 1997 22:22:29 -0700 (PDT)

 :
 :> 	Within a day of upgrading to 2.2.5, we have had to lockups.
 :> 	We were able to escape into the debugger and panic the machines.
 :
 :Bizarre - this never showed up in any of your testing of the
 :2.2.5-BETAs?  That would seem to indicate that the bug entered at the
 :very last minute.
 :
 :					Jordan
 
     Correct.  However, I do not know how uptodate our kernel was prior
     to the upgrade.  It's reasonably recent.  Dima should know.  No other
     machines have crashed so far, just those two (the two crashes in the
     bug report are from two different machines), but I think it may only
     be a matter of time.
 
     Apart from that, and a tcp crash which I'm not going to worry about
     until the VM issue is resolved, 2.2.5 has been running nicely.
 
 						-Matt
 
     Matthew Dillon   Engineering, BEST Internet Communications, Inc.
 		    <dillon@best.net>
     [always include a portion of the original email in any response!]
 

From: Bruce Evans <bde@zeta.org.au>
To: dillon@best.net, FreeBSD-gnats-submit@FreeBSD.ORG
Cc:  Subject: Re: kern/4844: VM lookup, endless loop in vm_map_lookup_entry()
Date: Sat, 25 Oct 1997 16:40:37 +1000

 >	Both machines locked up in vm_map_lookup_entry() while trying
 >	to brelse/bfreekva/vm_map_delete (see below).  One locked up 
 >	from a SCSI interrupt doing the above sequence, the other locked
 >	up from system call.
 
 brelse/bfreekva have some problems.  See PR4630, especially the followups.
 
 Bruce

From: dima@best.net (Dima Ruban)
To: dillon@best.net (Matt Dillon)
Cc: jkh@time.cdrom.com, dima@best.net, FreeBSD-gnats-submit@FreeBSD.ORG
Subject: Re: kern/4844: VM lookup, endless loop in vm_map_lookup_entry()
Date: Sat, 25 Oct 1997 00:17:25 -0700 (PDT)

 Matt Dillon writes:
 > :
 > :> 	Within a day of upgrading to 2.2.5, we have had to lockups.
 > :> 	We were able to escape into the debugger and panic the machines.
 > :
 > :Bizarre - this never showed up in any of your testing of the
 > :2.2.5-BETAs?  That would seem to indicate that the bug entered at the
 > :very last minute.
 > :
 > :					Jordan
 
 This bug was introduced somewhere in between September 19 and today.
 
 >     Correct.  However, I do not know how uptodate our kernel was prior
 >     to the upgrade.  It's reasonably recent.  Dima should know.  No other
 >     machines have crashed so far, just those two (the two crashes in the
 >     bug report are from two different machines), but I think it may only
 >     be a matter of time.
 > 
 >     Apart from that, and a tcp crash which I'm not going to worry about
 >     until the VM issue is resolved, 2.2.5 has been running nicely.
 > 
 > 						-Matt
 > 
 >     Matthew Dillon   Engineering, BEST Internet Communications, Inc.
 > 		    <dillon@best.net>
 >     [always include a portion of the original email in any response!]
 > 
 
 -- dima

From: Matt Dillon <dillon@best.net>
To: Bruce Evans <bde@zeta.org.au>
Cc: FreeBSD-gnats-submit@FreeBSD.ORG
Subject: Re: kern/4844: VM lookup, endless loop in vm_map_lookup_entry()
Date: Sat, 25 Oct 1997 01:15:15 -0700 (PDT)

 :>	up from system call.
 :
 :brelse/bfreekva have some problems.  See PR4630, especially the followups.
 :
 :Bruce
 
     Hmm (looking at PR4630).  This looks like a rather serious problem
     considering the core nature of brelse().  This may be responsible
     for several other crashes we have had involving "biodone: buffer already
     done" panics.  We've had four or five of those.
 
     It sounds to me that a slight modification to the PR4630 suggestion
     would work.  Rather then call bfreekva(), brelse() puts the bp on a
     defered free list, yes, but why not clear out this list from 
     getnewbuf() ?    I don't particularly see the need for a high priority
     kernel process or other complexity.
 
     If getnewbuf() (called by getblk()) is not called from an interrupt,
     we are home free.  I don't think anyone else allocates out of the
     buffer_map so the defered frees would not create a secondary effect
     anywhere else.
 
 					-Matt
 
     Matthew Dillon   Engineering, BEST Internet Communications, Inc.
 		    <dillon@best.net>
     [always include a portion of the original email in any response!]

From: Matt Dillon <dillon@best.net>
To: Bruce Evans <bde@zeta.org.au>
Cc: FreeBSD-gnats-submit@FreeBSD.ORG
Subject: Re: kern/4844: VM lookup, endless loop in vm_map_lookup_entry()
Date: Sat, 25 Oct 1997 01:42:18 -0700 (PDT)

 :>	Both machines locked up in vm_map_lookup_entry() while trying
 :>	to brelse/bfreekva/vm_map_delete (see below).  One locked up 
 :>	from a SCSI interrupt doing the above sequence, the other locked
 :>	up from system call.
 :
 :brelse/bfreekva have some problems.  See PR4630, especially the followups.
 :
 :Bruce
 
     Assuming getnewbuf() isn't called from an interrupt, I believe I have
     a working solution.  I'm going to test it a bit more.  If it works
     out, i'll send you the diffs.  It's really simple, I add another 
     bufqueues[] element called QUEUE_DEFERED.  brelse() puts the bp on
     QUEUE_DEFERED rather then QUEUE_EMPTY and does not call bfreekva().
     getnewbuf() then moves everything from QUEUE_DEFERED to QUEUE_EMPTY
     and calls bfreekva() on it.
 
     The code mods are minor, and even pretty clean.  I have a question,
     though, brelse() makes a call to allocbuf(bp, 0) under certain
     conditions.  Do I have to worry about this too?
 
 						-Matt
 
     Matthew Dillon   Engineering, BEST Internet Communications, Inc.
 		    <dillon@best.net>
     [always include a portion of the original email in any response!]
State-Changed-From-To: open->closed 
State-Changed-By: phk 
State-Changed-When: Sun Apr 26 14:01:10 PDT 1998 
State-Changed-Why:  
fixed in -current 
>Unformatted:
