From woerner@mediabase-gmbh.de  Sun Oct  6 04:01:27 2002
Return-Path: <woerner@mediabase-gmbh.de>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 8C36937B401
	for <FreeBSD-gnats-submit@freebsd.org>; Sun,  6 Oct 2002 04:01:16 -0700 (PDT)
Received: from mailx.mediabase-gmbh.de (pD9E969A3.dip.t-dialin.net [217.233.105.163])
	by mx1.FreeBSD.org (Postfix) with SMTP id 43CD243E3B
	for <FreeBSD-gnats-submit@freebsd.org>; Sun,  6 Oct 2002 04:00:24 -0700 (PDT)
	(envelope-from woerner@mediabase-gmbh.de)
Message-Id: <20021006110024.43CD243E3B@mx1.FreeBSD.org>
Date: Sun,  6 Oct 2002 04:00:24 -0700 (PDT)
From: Arne Woerner <woerner@mediabase-gmbh.de>
To: FreeBSD-gnats-submit@freebsd.org
Subject: cannot open file without obvious reason
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         43739
>Category:       kern
>Synopsis:       cannot open file without obvious reason
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sun Oct 06 04:10:04 PDT 2002
>Closed-Date:    Mon Oct 07 09:53:03 PDT 2002
>Last-Modified:  Tue Oct  8 08:40:02 PDT 2002
>Originator:     Arne Woerner
>Release:        FreeBSD 5.0-CURRENT-20020917-JPSNAP i386
>Organization:
mediaBase GmbH, MUC, BY, FRG
>Environment:
System: FreeBSD actionman.local.mediabase-gmbh.de 5.0-CURRENT-20020917-JPSNAP Fr
eeBSD 5.0-CURRENT-20020917-JPSNAP #1: Sat Sep 28 09:44:55 GMT 2002 aw@houston.lo
cal.mediabase-gmbh.de:/usr/src/sys/i386/compile/RIDDICK i386
>Description:
        I connected from a FreeBSD box to another FreeBSD box X via ssh
        and used some 'cat', 'lockf', 'tail', 'date' and 'mv' calls
        and did the same locally on the box X.
        After some time (75 cycles (local/remote initiated cycles))
        both sides have problems with opening files (error messages:
                1. /usr/libexec/ld-elf.so.1: Cannot open "/usr/lib/libc.so.5"
                2. ./ICanDo.sh: Pipe call failed: Too many open files in system
                3. cat: num: Too many open files in system
                4. lockf: cannot open gaga: Too many open files in system
        'netstat' and 'ps' do not show something special (at most 2 lockf)...
        Not many tcp connections:
                tcp4       0      0  localhost.x11-ssh      *.*
   LISTEN
                tcp4       0     64  newark.ssh             gargano.19859
   ESTABLISHED
                udp4       0      0  localhost.ntp          *.*

                udp4       0      0  newark.ntp             *.*

        The problem remains for at least 10 minutes.
        This looks a little bit funny... :)
>How-To-Repeat:
        The script is called ICanDo.sh and contains the following lines:
                #!/bin/sh
 script is called ICanDo.sh and contains the following lines:
                #!/bin/sh
                # $Id$

                if [ "$1" = "" ] ;then
                        lockf gaga $0 DoIt $2
                        exit 0
                fi

                if [ "$2" != "" ] ;then
                        sleep 5
                fi

                num=`cat num`
                if [ "$num" = "" ] ;then
                        num=0
                fi
                num=`expr $num + 1`
                echo $num > num

                echo ${SSH_CLIENT}: ${num}: `date +%Y%m%d%H%M%S` >> dada2
                tail -100000 < dada2 > dada2.tmp
                mv dada2.tmp dada2
        The command line on the remote box was:
                ( repeat 1000000000 ssh cyclops ./ICanDo.sh ) < /dev/null &
        The command line on the box X was:
                ( repeat 1000000000 ./ICanDo.sh "" gaga ) &
>Fix:
        rebooting helps...
>Release-Note:
>Audit-Trail:
State-Changed-From-To: open->closed 
State-Changed-By: fanf 
State-Changed-When: Mon Oct 7 09:47:50 PDT 2002 
State-Changed-Why:  
The problem is explained by the error message "Too many open files 
in system" probably caused by you running too many copies of the 
script concurrently.  Tools like ps and fstat should help diagnose 
the problem. Ask on questions@freebsd.org for more help. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=43739 

From: Peter Pentchev <roam@ringlet.net>
To: Arne Woerner <woerner@mediabase-gmbh.de>
Cc: fanf@FreeBSD.org, bug-followup@FreeBSD.org
Subject: Re: kern/43739: cannot open file without obvious reason
Date: Mon, 7 Oct 2002 20:21:07 +0300

 On Mon, Oct 07, 2002 at 05:07:42PM +0000, Arne Woerner wrote:
 > Wrong! Wrong! Wrong! :)
 > 
 > Due to the "lockf" it is impossible that there are more than
 > two ICanDo.sh processes...
 
 Uhm.. excuse me?  What is to prevent the SSH session starting up
 a billion of ICanDo.sh processes?
 
 Each SSH session operates in one of the following ways:
 
 1. Log in - this eats a file descriptor for the socket.
 
 2. Start a shell running ICanDo.sh - this eats a couple of file
    descriptors for the shell's /bin/sh executable and for the ICanDo.sh
    file that the shell needs to read to execute.
 
 3. Start executing ICanDo.sh - at this point, there are at least two
    file descriptors consumed by this session on the target host.
 
 4. Start lockf(1) - this needs an additional file descriptor for reading
    in the lockf(1)'s executable file.
 
 5. Open the 'gaga' lockfile.
 
 6. Try to flock(2) the 'gaga' lockfile *after having opened it* -
    please take a look at the lockf(1), flock(2) and lockf(3) manual
    pages; the lockfile needs to be opened in order to be locked.
 
 At this step, if the file may not be locked, there are still at least
 two, maybe up to four file descriptors in use, two of which may not
 be reclaimed under any circumstances.  So, each of your billion SSH
 sessions eats up at least two file descriptors.
 
 > I think somebody who is able to find kernel bugs quickly
 > should analyse this problem insteadof fantazising... :)
 
 I think somebody should spend a little more time reading the relevant
 utilities' manual pages :)
 
 G'luck,
 Peter
 
 -- 
 Peter Pentchev	roam@ringlet.net	roam@FreeBSD.org
 PGP key:	http://people.FreeBSD.org/~roam/roam.key.asc
 Key fingerprint	FDBA FD79 C26F 3C51 C95E  DF9E ED18 B68D 1619 4553
 This sentence is false.

From: Arne Woerner <woerner@mediabase-gmbh.de>
To: roam@ringlet.net
Cc: bug-followup@FreeBSD.org, fanf@FreeBSD.org
Subject: Re: kern/43739: cannot open file without obvious reason
Date: Mon, 7 Oct 2002 17:42:18 GMT

 > From roam@ringlet.net Mon Oct  7 17:30:16 2002
 > Date: Mon, 7 Oct 2002 20:21:07 +0300
 > From: Peter Pentchev <roam@ringlet.net>
 > To: Arne Woerner <woerner@mediabase-gmbh.de>
 > Cc: fanf@FreeBSD.org, bug-followup@FreeBSD.org
 > Subject: Re: kern/43739: cannot open file without obvious reason
 > References: <200210071707.g97H7g2F001469@gargano.local.mediabase-gmbh.de>
 > Mime-Version: 1.0
 > Content-Type: text/plain; charset=windows-1251
 > Content-Disposition: inline
 > In-Reply-To: <200210071707.g97H7g2F001469@gargano.local.mediabase-gmbh.de>
 > User-Agent: Mutt/1.5.1i
 > X-Virus-Scanned: by Nik's Monitoring Daemon (AMaViS perl-11d <Tu  28 May 2002 12:55:43 EEST>)
 > X-UIDL: A8<!!jn-!!gg%#!n'O!!
 >
 > On Mon, Oct 07, 2002 at 05:07:42PM +0000, Arne Woerner wrote:
 > > Wrong! Wrong! Wrong! :)
 > > 
 > > Due to the "lockf" it is impossible that there are more than
 > > two ICanDo.sh processes...
 >
 > Uhm.. excuse me?  What is to prevent the SSH session starting up
 > a billion of ICanDo.sh processes?
 >
 IT IS THE "LOCKF"!
 While the lockf blocks the session will not terminate. And a new
 session can only be started as soon as the old session terminates.
 See:
 	( repeat 1000000000 ssh cyclops ./ICanDo.sh ) < /dev/null &
 
 I wonder why you both are reading my emails if you do not want to
 understand them. I do not spend my time writing bug reports just for
 your amusement or to allow you to insult me.
 
 > Each SSH session operates in one of the following ways:
 >
 > 1. Log in - this eats a file descriptor for the socket.
 >
 > 2. Start a shell running ICanDo.sh - this eats a couple of file
 >    descriptors for the shell's /bin/sh executable and for the ICanDo.sh
 >    file that the shell needs to read to execute.
 >
 > 3. Start executing ICanDo.sh - at this point, there are at least two
 >    file descriptors consumed by this session on the target host.
 >
 > 4. Start lockf(1) - this needs an additional file descriptor for reading
 >    in the lockf(1)'s executable file.
 >
 > 5. Open the 'gaga' lockfile.
 >
 > 6. Try to flock(2) the 'gaga' lockfile *after having opened it* -
 >    please take a look at the lockf(1), flock(2) and lockf(3) manual
 >    pages; the lockfile needs to be opened in order to be locked.
 >
 > At this step, if the file may not be locked, there are still at least
 > two, maybe up to four file descriptors in use, two of which may not
 > be reclaimed under any circumstances.  So, each of your billion SSH
 > sessions eats up at least two file descriptors.
 >
 No!
 
 > I think somebody should spend a little more time reading the relevant
 > utilities' manual pages :)
 >
 Laa-Laa is cuter than you!
 
 -Arne
 
 "I want to understand..." (Fox Mulder?)

From: Ian Dowse <iedowse@maths.tcd.ie>
To: Arne Woerner <woerner@mediabase-gmbh.de>
Cc: bug-followup@FreeBSD.org, roam@ringlet.net, fanf@FreeBSD.org
Subject: Re: kern/43739: cannot open file without obvious reason 
Date: Mon, 07 Oct 2002 21:32:02 +0100

 In message <200210071810.g97IA825093533@freefall.freebsd.org>, Arne Woerner wri
 tes:
 > IT IS THE "LOCKF"!
 > While the lockf blocks the session will not terminate. And a new
 > session can only be started as soon as the old session terminates.
 > See:
 > 	( repeat 1000000000 ssh cyclops ./ICanDo.sh ) < /dev/null &
 > 
 > I wonder why you both are reading my emails if you do not want to
 > understand them. I do not spend my time writing bug reports just for
 > your amusement or to allow you to insult me.
 
 Hi Arne,
 
 Nobody is trying to insult you, but we do need to ask a few questions
 to verify the problem, as we get many bogus problem reports that
 turn out to be misunderstandings or user errors. That is not the
 case here, but a few pieces of extra information such as the output
 of fstat and the value of kern.*files would have quickly confirmed
 to us that there is in fact a real bug.
 
 It looks like a fdrop() call in open() (now kern_open) was lost in
 revision 1.218 of vfs_syscalls.c. It probably wasn't noticed because
 it is in an error case that would rarely occur in practice. Doing
 
 	lockf /dev/null ls
 
 is a reliable way of repeating the bug, as can be confirmed by
 monitoring kern.openfiles. The following patch appears to fix it.
 
 Ian
 
 Index: vfs_syscalls.c
 ===================================================================
 RCS file: /dump/FreeBSD-CVS/src/sys/kern/vfs_syscalls.c,v
 retrieving revision 1.289
 diff -u -r1.289 vfs_syscalls.c
 --- vfs_syscalls.c	2 Oct 2002 09:05:30 -0000	1.289
 +++ vfs_syscalls.c	7 Oct 2002 20:13:05 -0000
 @@ -773,6 +773,7 @@
  		fdrop(fp, td);
  	} else
  		FILEDESC_UNLOCK(fdp);
 +	fdrop(fp, td);
  	return (error);
  }
  

From: Arne Woerner <woerner@mediabase-gmbh.de>
To: dot@dotat.at, iedowse@maths.tcd.ie
Cc: bug-followup@FreeBSD.org, fanf@FreeBSD.org,
	freebsd-bugs@FreeBSD.org
Subject: Re: kern/43739: cannot open file without obvious reason
Date: Tue, 8 Oct 2002 10:47:26 GMT

 > Please would you provide us with the information I have asked for.
 > I am not asking for it in order to insult you -- I am trying to help
 > you and I am asking for it because analysing a problem with insufficient
 > information is hard.
 >
 As long as you do not understand that there are only two ICanDo instances
 active at each point of time in my setting, every further information is
 futile.  I thought it is necessary to reproduce the bug so I did not give
 so many additional information but only exact hints on the reproduction
 procedure (which is obviously not minimal).
 
 > Also, the output from `sysctl kern.maxfiles` would be useful -- you're
 > using small-memory machines so I would expect the number to be quite low.
 > (It would have been useful to know that earlier.)
 >
 80MB is not so small in my opinion but my CO just told me that 512MB is
 normal nowadays... The value for kern.maxfiles is somewhat over 1200 pieces
 (but obviously this value cannot be responsible because it works twice and
 10 times and 70 times but not 80 times which shows clearly that we have
 some unwanted residues in the kernel)...
 
 > > IT IS THE "LOCKF"!
 > > While the lockf blocks the session will not terminate. And a new
 > > session can only be started as soon as the old session terminates.
 > > See:
 > > 	( repeat 1000000000 ssh cyclops ./ICanDo.sh ) < /dev/null &
 > > 
 > > I wonder why you both are reading my emails if you do not want to
 > > understand them. I do not spend my time writing bug reports just for
 > > your amusement or to allow you to insult me.
 >
 > Nobody is trying to insult you,
 >
 I hope so. I have heard before that I think too early that somebody
 spits on me or hates me although I like him and although he only
 wants to help.
 
 > but we do need to ask a few questions
 >
 I really thought that my information was exact enough (the real
 source code and the symptom). I am astonished that there are so
 ridiculous theories about my source code among experts like
 you...
 
 > to verify the problem, as we get many bogus problem reports that
 > turn out to be misunderstandings or user errors.
 >
 Yes, I know. I wrote one myself some days ago because I thought the
 'ls' implementation is buggy on large files (but in fact the 'ext2fs'
 implementation in OpenBSD 3.1 was responsible for the problem...).
 (I wrote some followups in the meantime so you could close this
 report now.)
 
 > That is not the
 > case here, but a few pieces of extra information such as the output
 > of fstat and the value of kern.*files would have quickly confirmed
 > to us that there is in fact a real bug.
 >
 dito (see line 39pp)
 
 > It looks like a fdrop() call in open() (now kern_open) was lost in
 > revision 1.218 of vfs_syscalls.c. It probably wasn't noticed because
 > it is in an error case that would rarely occur in practice. Doing
 >
 > 	lockf /dev/null ls
 >
 > is a reliable way of repeating the bug, as can be confirmed by
 > monitoring kern.openfiles. The following patch appears to fix it.
 >
 I wonder why there are two equal calls to 'fdrop' if the if-condition
 evaluates to true but I am not the one to judge over your source code
 but I would be sorry if my panging leads to funny source code...
 
 In general it is clear that you are not forced to believe me or to
 fix bugs. But I am very glad that I can successfully use your software
 since 1996. *sniff*
 
 Thank you for your cooperation.
 
 -Arne

From: Arne Woerner <woerner@mediabase-gmbh.de>
To: bug-followup@FreeBSD.org, freebsd-bugs@FreeBSD.org
Cc:  
Subject: Re: kern/43739: cannot open file without obvious reason
Date: Tue, 8 Oct 2002 15:36:06 GMT

 I applied the patch...
 
 Now it can do over 5000 cycles without problem... :)
 
 Bye
 Arne
>Unformatted:
