From nobody@FreeBSD.ORG  Thu Aug  3 23:27:22 2000
Return-Path: <nobody@FreeBSD.ORG>
Received: by hub.freebsd.org (Postfix, from userid 32767)
	id C423B37B7FC; Thu,  3 Aug 2000 23:27:22 -0700 (PDT)
Message-Id: <20000804062722.C423B37B7FC@hub.freebsd.org>
Date: Thu,  3 Aug 2000 23:27:22 -0700 (PDT)
From: bsdx@looksharp.net
Sender: nobody@FreeBSD.ORG
To: freebsd-gnats-submit@FreeBSD.org
Subject: processes get stuck in vmwait instead of nanslp with large MAXUSERS
X-Send-Pr-Version: www-1.0

>Number:         20393
>Category:       kern
>Synopsis:       processes get stuck in vmwait instead of nanslp with large MAXUSERS
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    silby
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Aug 03 23:30:01 PDT 2000
>Closed-Date:    Wed Mar 6 21:01:40 PST 2002
>Last-Modified:  Wed Mar 06 21:05:22 PST 2002
>Originator:     Adam McDougall
>Release:        4.1-STABLE
>Organization:
>Environment:
4.1-STABLE compiled 31st July or early August.
admin with clue about userlimits in login.conf but wants to run more than 2000 processes on a machine which seems reasonable.
>Description:
rwatson asked me to send a pr for this, so I am trying to help out.

Processes hang the system by getting stuck in vmwait instead of nanslp'ing like they should.  If you have MAXUSERS set to over 128ish and try to run at least 2000 processes, some will get stuck in vmwait causing the system to become almost unresponsive until they decide to exit.  If the number of processes wanting to run is much larger than 2000, the cycle continues.  console switching works, typing at a login prompt to log in yeilds no characters on the screen.  Any processes running on other terminals (including a rtptio 0 top) freeze.  ps in DDB shows a small percent of the children in vmwait instead of whatever they should be doing.  
>How-To-Repeat:
compile kernel on 4.1 with maxusers=200
run a progam(http://www.looksharp.net/~user1/test.c) which tries to fork x children which do the following:
  print starting time, sleep 20 seconds, print ending time, exit.
Note that if x > some number z around 1500, processes stop forking way before maxproc in shell or kernel is reached, and processes hang as described.  After a period of time several times longer than 20 seconds, some processes exit and more start.  If x > ~2000, the cycle repeats several times.   
Problem 2: repeat above with maxusers = 600 (> ~512). 
kernel panics from running out of kernel memory instead of the freezing behavior.  (one person I showed this part of the issue to, after inspecting sourcecode, was unsure if it should panic or wait for free mem in fork1 here:  MALLOC(p2->p_cred, struct pcred *, sizeof(struct pcred), \n       M_SUBPROC, M_WAITOK);
>Fix:
Let me know if I can help out further with the issue.  I have time to test things but I do not know C let alone the kernel.  I can provide a scratchbox to demonstrate the problem and any level of access to it if it would be helpful in resolving the issue.  

>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->dillon 
Responsible-Changed-By: sheldonh 
Responsible-Changed-When: Fri Aug 4 02:04:23 PDT 2000 
Responsible-Changed-Why:  
This looks like Matt's area. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=20393 
State-Changed-From-To: open->analyzed 
State-Changed-By: silby 
State-Changed-When: Tue Feb 12 00:08:38 PST 2002 
State-Changed-Why:  
Well, the reason for situation #2 is clear now that I've worked on 
PR 23740:  Setting a large maxusers simply sets too large a maxproc. 
As a result, process-related structures will consume all memory, 
thereby causing the system to deadlock.  I should have a patch 
to limit this committed soon.  In the meantime, lowering maxproc 
is a viable solution. 

As for the temporary hang and recovery seen when many processes 
are run, but memory isn't exhausted, I'm not sure if an easy 
solution is at hand.  It simply looks as if the scheduler isn't 
handling thousands of runnable processes well.  Related to this, 
some processes are tsleeping on "ttywri" for long periods of time; 
I suspect that this may be because sshd isn't being given a chance 
to run and thereby flush the tty buffers of that process. 

So, I can fix the panics and complete hangs of the system, but 
making this modified forkbomb run efficiently may be a lot of work. 


Responsible-Changed-From-To: dillon->silby 
Responsible-Changed-By: silby 
Responsible-Changed-When: Tue Feb 12 00:08:38 PST 2002 
Responsible-Changed-Why:  
This is related to PR kern/23740, so I'm grabbing it. 


http://www.FreeBSD.org/cgi/query-pr.cgi?pr=20393 
State-Changed-From-To: analyzed->closed 
State-Changed-By: silby 
State-Changed-When: Wed Mar 6 21:01:40 PST 2002 
State-Changed-Why:  
This problem is fixed in 4.5-stable as of today. 

There were two problems. 

1.  The panics were related to maxproc being set too high. 
Lower your maxusers so that it equals the megs of ram 
in your system.  (A subsequent patch will enforce 
maxproc to a sane value.) 
2.  vm_daemon scaled badly with thousands of procs, and 
was eating LOTS of processor time.  This was the freezing 
you described.  It has now been fixed. 

Thank you for including the program which exhibited the problem 
behavior, it was a big help in tracking down the problem. 

http://www.FreeBSD.org/cgi/query-pr.cgi?pr=20393 
>Unformatted:
