From nuucp@dg-rtp.dg.com  Tue Apr 15 21:50:43 1997
Received: from dg-rtp.dg.com (dg-rtp.rtp.dg.com [128.222.1.2])
          by freefall.freebsd.org (8.8.5/8.8.5) with SMTP id VAA08710
          for <FreeBSD-gnats-submit@freebsd.org>; Tue, 15 Apr 1997 21:50:42 -0700 (PDT)
Received: by dg-rtp.dg.com (5.4R3.10/dg-rtp-v02)
	id AA23183; Wed, 16 Apr 1997 00:50:05 -0400
Received: from ponds by dg-rtp.dg.com.rtp.dg.com; Wed, 16 Apr 1997 00:50 EDT
Received: from lakes.water.net (lakes [10.0.0.3]) by ponds.water.net (8.8.3/8.7.3) with ESMTP id WAA16805 for <FreeBSD-gnats-submit@freebsd.org>; Tue, 15 Apr 1997 22:03:09 -0400 (EDT)
Received: (from rivers@localhost) by lakes.water.net (8.8.3/8.6.9) id WAA01541; Tue, 15 Apr 1997 22:09:41 -0400 (EDT)
Message-Id: <199704160209.WAA01541@lakes.water.net>
Date: Tue, 15 Apr 1997 22:09:41 -0400 (EDT)
From: Thomas David Rivers <ponds!rivers@dg-rtp.dg.com>
Reply-To: ponds!rivers@dg-rtp.dg.com
To: ponds!freebsd.org!FreeBSD-gnats-submit
Subject: NFS V2 readdir hangs
X-Send-Pr-Version: 3.2

>Number:         3304
>Category:       kern
>Synopsis:       NFS V2 readdir hangs
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:
>Keywords:
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Tue Apr 15 22:00:01 PDT 1997
>Closed-Date:    Tue Apr 22 10:51:32 PDT 1997
>Last-Modified:  Tue Apr 22 10:52:38 PDT 1997
>Originator:     Thomas David Rivers
>Release:        FreeBSD 2.2.1-STABLE i386
>Organization:
SAS Institute
>Environment:

  FreeBSD 2.2.1 - downloaded from ftp.freebsd.org on April 8th.

>Description:

  NFS readdir can hang when accessing V2 servers.  Further access
 the mounted file system is blocked...  It appears that an nfs_send()
 is issued to continue reading the directory; following by an
 nfs_receive() which eventually (via soreceive) blocks in sbwait(), 
 waiting on a packet that never arrives.

  It's not at all clear to me (personally) if this is a problem with
 the NFS protocol (i.e. we've sent, and are waiting on a response from,
 an invalid NFS request) - or some underlying problem with soreceive().

>How-To-Repeat:

  Mount a V2 NFS server (I've tried both Sunos 4.1.3 and HP/UX 9.05),
 go to a rather large directory and do "ls -l".  The ls -l will hang
 in sbwait().  This apparently also needs a rather slow network
 for a reliable reproduction - that is, it's somewhat timing dependent.


>Fix:

  Unknown at this point.
	
>Release-Note:
>Audit-Trail:

From: "Gary Palmer" <gpalmer@freebsd.org>
To: ponds!rivers@dg-rtp.dg.com
Cc: FreeBSD-gnats-submit@freebsd.org
Subject: Re: kern/3304: NFS V2 readdir hangs 
Date: Thu, 17 Apr 1997 03:49:10 -0400

 Thomas David Rivers wrote in message ID
 <199704160209.WAA01541@lakes.water.net>:
 >   Mount a V2 NFS server (I've tried both Sunos 4.1.3 and HP/UX 9.05),
 >  go to a rather large directory and do "ls -l".  The ls -l will hang
 >  in sbwait().  This apparently also needs a rather slow network
 >  for a reliable reproduction - that is, it's somewhat timing dependent.
 
 I recently did something similar (ls -l on a 16,000 file directory)
 across NFS on a recent RELENG_2_2 box which was mounting /var/mail
 from a 2.1.x based mail server. Worked fine. This was probably 2 or 3
 weeks ago... I'll try again if you like.
 
 Gary
 --
 Gary Palmer                                          FreeBSD Core Team Member
 FreeBSD: Turning PC's into workstations. See http://www.FreeBSD.ORG/ for info

From: Thomas David Rivers <ponds!rivers@dg-rtp.dg.com>
To: ponds!freebsd.org!gpalmer, ponds!lakes.water.net!rivers
Cc: ponds!freebsd.org!FreeBSD-gnats-submit
Subject: Re: kern/3304: NFS V2 readdir hangs
Date: Thu, 17 Apr 1997 07:23:31 -0400 (EDT)

 > 
 > Thomas David Rivers wrote in message ID
 > <199704160209.WAA01541@lakes.water.net>:
 > >   Mount a V2 NFS server (I've tried both Sunos 4.1.3 and HP/UX 9.05),
 > >  go to a rather large directory and do "ls -l".  The ls -l will hang
 > >  in sbwait().  This apparently also needs a rather slow network
 > >  for a reliable reproduction - that is, it's somewhat timing dependent.
 > 
 > I recently did something similar (ls -l on a 16,000 file directory)
 > across NFS on a recent RELENG_2_2 box which was mounting /var/mail
 > from a 2.1.x based mail server. Worked fine. This was probably 2 or 3
 > weeks ago... I'll try again if you like.
 
 
  16,000 files is more than enough :-)
 
  I've also witnessed it "work"; although never from my particular
 box that reliably reproduces it; and not always from the box that
 sometimes "works."
 
  I believe the problem is "tickled" by some timing issue.  For example,
 maybe on of the 6 possible UDP packets is out of order and that throws
 everything for a loop.  This could be explained by network issues between
 the server and client.
 
  However, I now believe that we do an nfs_receive(); the packet isn't
 yet there so we go into sbwait() to be awakened by an sorwakeup() (sowakeup)
 in udp_input().  
 
  Now; I may even have some evidence that udp_input() is doing the right
 wakeup(); but we don't get woken up.... but I had to leave work early
 yesterday and didn't get that finished.
 
  A possible idea for those people that don't see this problem; we could
 via software, corrupt or drop UDP packets and see if NFS recovers properly.
 That could reproduce the problem I'm seeing that people in more robust
 networks don't see.
 
  What do you think?
 
 	- Dave Rivers -
 
 > 
 > Gary

From: Thomas David Rivers <ponds!rivers@dg-rtp.dg.com>
To: ponds!freebsd.org!gpalmer, ponds!lakes.water.net!rivers
Cc: ponds!freebsd.org!FreeBSD-gnats-submit
Subject: Re: kern/3304: NFS V2 readdir hangs
Date: Thu, 17 Apr 1997 12:03:42 -0400 (EDT)

 > 
 > Thomas David Rivers wrote in message ID
 > <199704160209.WAA01541@lakes.water.net>:
 > >   Mount a V2 NFS server (I've tried both Sunos 4.1.3 and HP/UX 9.05),
 > >  go to a rather large directory and do "ls -l".  The ls -l will hang
 > >  in sbwait().  This apparently also needs a rather slow network
 > >  for a reliable reproduction - that is, it's somewhat timing dependent.
 > 
 > I recently did something similar (ls -l on a 16,000 file directory)
 > across NFS on a recent RELENG_2_2 box which was mounting /var/mail
 > from a 2.1.x based mail server. Worked fine. This was probably 2 or 3
 > weeks ago... I'll try again if you like.
 
  Some more information I've just deduced.
 
  It appears that nfs_receive() calls soreceive() which calls sbwait()
 waiting on a UDP packet to be received..  That's fine.
 
  Then, another nfs_request() is issued; calling nfs_receive() which
 winds down to sbwait() as well.
 
  The key here is that both of these are waiting on the *same* address.
 
  Then, the udp packet from the first call is received; we wake up the
 *second* caller and get everything out-of-sync.  With udp packets
 out-of-sync; we eventually get an error situation in nfs_reply()
 and indicate a "stale" server condition.
 
  I haven't yet determined how this is happening; but printf()s in 
 my kernel indicate that's what's going on.
 
  That also explains how timing is important for this problem...
 
 	- Dave Rivers -
 

From: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>
To: Thomas David Rivers <ponds!rivers@dg-rtp.dg.com>
Cc: freebsd-gnats-submit@freefall.freebsd.org
Subject: Re: kern/3304: NFS V2 readdir hangs
Date: Thu, 17 Apr 1997 15:04:57 -0400 (EDT)

 <<On Thu, 17 Apr 1997 09:30:02 -0700 (PDT), Thomas David Rivers <ponds!rivers@dg-rtp.dg.com> said:
 
 >   It appears that nfs_receive() calls soreceive() which calls sbwait()
 >  waiting on a UDP packet to be received..  That's fine.
  
 >   Then, another nfs_request() is issued; calling nfs_receive() which
 >  winds down to sbwait() as well.
  
 >   Then, the udp packet from the first call is received; we wake up the
 >  *second* caller and get everything out-of-sync.
 
 This is perfectly reasonable behavior for soreceive().  NFS is clearly
 broken here.  NFS needs its own response-demultiplexing layer, it
 seems.
 
 -GAWollman
 
 --
 Garrett A. Wollman   | O Siem / We are all family / O Siem / We're all the same
 wollman@lcs.mit.edu  | O Siem / The fires of freedom 
 Opinions not those of| Dance in the burning flame
 MIT, LCS, CRS, or NSA|                     - Susan Aglukark and Chad Irschick

From: Thomas David Rivers <ponds!rivers@dg-rtp.dg.com>
To: ponds!lakes.water.net!rivers, ponds!khavrinen.lcs.mit.edu!wollman
Cc: ponds!freefall.freebsd.org!freebsd-gnats-submit
Subject: Re: kern/3304: NFS V2 readdir hangs
Date: Thu, 17 Apr 1997 21:12:42 -0400 (EDT)

 > Garrett writes:
 > 
 > <<On Thu, 17 Apr 1997 09:30:02 -0700 (PDT), Thomas David Rivers <ponds!rivers@dg-rtp.dg.com> said:
 > 
 > >   It appears that nfs_receive() calls soreceive() which calls sbwait()
 > >  waiting on a UDP packet to be received..  That's fine.
 >  
 > >   Then, another nfs_request() is issued; calling nfs_receive() which
 > >  winds down to sbwait() as well.
 >  
 > >   Then, the udp packet from the first call is received; we wake up the
 > >  *second* caller and get everything out-of-sync.
 > 
 > This is perfectly reasonable behavior for soreceive().  NFS is clearly
 > broken here.  NFS needs its own response-demultiplexing layer, it
 > seems.
 > 
 > -GAWollman
 > 
 
  Well - yes; 
 
   except
 
  in this instance there is a lock protecting the multiple soreceive()s,
 nfs_rcvlock(). 
 
  I think the question is why didn't that protection work.  [In fact,
 there's a comment in nfs_socket.c that indicates the rcvlock is used
 to protect from just this situation.]
 
  Apparently, since 2.1.5 runs on the same machine just fine; this
 used to work and has now been compromised....  [that's just a guess.]
 
 	- Dave Rivers -

From: Thomas David Rivers <ponds!rivers@dg-rtp.dg.com>
To: ponds!lakes.water.net!rivers, ponds!khavrinen.lcs.mit.edu!wollman
Cc: ponds!freefall.freebsd.org!freebsd-gnats-submit
Subject: Re: kern/3304: NFS V2 readdir hangs
Date: Fri, 18 Apr 1997 11:49:35 -0400 (EDT)

 More information...
 
 
 Here's the scenario I've now determined (via more printf()s in the
 kernel):
 
    1) nfs_request() is called from readdirrpc().
 
    2) nfs_request malloc's a nfsreq block, which is used
       by rcvlock()... the lock is granted; we go down to
       soreceive() and wind up tsleeping in sbwait().
 
    3) At this point, a vnode lookup() operation is called.
       The lookup() isn't satisfied from the cache; so 
       we call nfs_request() to get the information.
 
    4) This nfs_request() malloc's a different nfsreq block.
       The "lock" is granted since rcvlock() works on addresses
       from the nfsreq block; these are different addresses, the
       lock is granted.  We wind down to soreceive()
       again.
 
    5) udp_intr() is called because a UDP packet arrived...
       this is, presumably, the packet we're expecting from 2).
       *however* the last request we received was from 4).
       That is the nfsreq this packet winds up being associated
       with; but - it is totally wrong.  
 
  So; we're left with the lookup() failing with a ENONENT (#2),
 and the nfs_request from #2 hanging; never being woken up.
 
   I think that pretty well describes my findings.
 
   Perhaps the rcvlock() needs to change to lock on something other
 than the nfsreq block... does anyone have any suggestions?
 
   	- Dave Rivers -

From: Thomas David Rivers <ponds!rivers@dg-rtp.dg.com>
To: ponds!lakes.water.net!rivers, ponds!sat.t.u-tokyo.ac.jp!simokawa
Cc: ponds!freefall.freebsd.org!freebsd-gnats-submit
Subject: Re: kern/3304: NFS V2 readdir hangs
Date: Fri, 18 Apr 1997 13:04:18 -0400 (EDT)

 > 
 > Hi,
 > 
 > I'm not sure whether this will help you, but could you try the following
 > patch? the latest change to nfs_socket.c was originally made by me.
 > I have been worried about a rare case.
 
  It would be a rare case I believe....
 
  But - I did try your patch; it did not affect the problem.  That is,
 the hang still occurs.  If you've had a chance to read it; I've posted
 my current understanding of the problem to freebsd-hackers.  I believe
 it is caused by a nfs_lookup() call that calls nfs_request() before
 the nfs_request() (which originated in readdirrpc()) has completed.
 
 	- Dave Rivers -
 
 > 
 > /\ Hidetoshi Shimokawa
 > \/  simokawa@sat.t.u-tokyo.ac.jp
 > PGP public key: finger -l simokawa@sat.t.u-tokyo.ac.jp
 > 
 > --- nfs_socket.c.orig	Fri Oct 11 19:15:33 1996
 > +++ nfs_socket.c	Sat Apr 19 00:43:25 1997
 > @@ -1490,6 +1490,12 @@
 >  		slpflag = PCATCH;
 >  	else
 >  		slpflag = 0;
 > +#if 1
 > +	if (!(*flagp & NFSMNT_RCVLOCK) && (rep->r_mrep != NULL)) {
 > +		printf("Oops! I found the bug :-)\n");
 > +		return (EALREADY);
 > +	}
 > +#endif
 >  	while (*flagp & NFSMNT_RCVLOCK) {
 >  		if (nfs_sigintr(rep->r_nmp, rep, rep->r_procp))
 >  			return (EINTR);
 > 
 > 
 > 
State-Changed-From-To: open->closed 
State-Changed-By: dfr 
State-Changed-When: Tue Apr 22 10:51:32 PDT 1997 
State-Changed-Why:  
Fixed in rev 1.23 of nfs_socket.c and rev 1.38 of nfs_vfsops.c. 
>Unformatted:
