From nobody@FreeBSD.org  Mon Jul  8 14:45:59 2013
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
	by hub.freebsd.org (Postfix) with ESMTP id B40E6F7A
	for <freebsd-gnats-submit@FreeBSD.org>; Mon,  8 Jul 2013 14:45:59 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from oldred.freebsd.org (oldred.freebsd.org [8.8.178.121])
	by mx1.freebsd.org (Postfix) with ESMTP id A51051678
	for <freebsd-gnats-submit@FreeBSD.org>; Mon,  8 Jul 2013 14:45:59 +0000 (UTC)
Received: from oldred.freebsd.org ([127.0.1.6])
	by oldred.freebsd.org (8.14.5/8.14.7) with ESMTP id r68Ejws9025381
	for <freebsd-gnats-submit@FreeBSD.org>; Mon, 8 Jul 2013 14:45:58 GMT
	(envelope-from nobody@oldred.freebsd.org)
Received: (from nobody@localhost)
	by oldred.freebsd.org (8.14.5/8.14.5/Submit) id r68Ejwva025380;
	Mon, 8 Jul 2013 14:45:58 GMT
	(envelope-from nobody)
Message-Id: <201307081445.r68Ejwva025380@oldred.freebsd.org>
Date: Mon, 8 Jul 2013 14:45:58 GMT
From: "David A. Bright" <David_A_Bright@DELL.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: [kqueue] Conflict between EVFILT_PROC NOTE_CHILD and NOTE_EXIT use of data field
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         180385
>Category:       kern
>Synopsis:       [kqueue] Conflict between EVFILT_PROC NOTE_CHILD and NOTE_EXIT use of data field
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Jul 08 14:50:00 UTC 2013
>Closed-Date:    
>Last-Modified:  Wed Jul 10 22:30:00 UTC 2013
>Originator:     David A. Bright
>Release:        9.1-RELEASE-p4
>Organization:
Dell | Compellent
>Environment:
FreeBSD localhost.local 9.1-RELEASE-p4 FreeBSD 9.1-RELEASE-p4 #0: Mon Jun 17 11:42:37 UTC 2013     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

>Description:
There is a bug in the kevent EVFILT_PROC handling, possibly introduced by a mod in 2006:

http://svnweb.freebsd.org/base/head/sys/kern/kern_event.c?r1=164450&r2=164451

The scenario is a process that spawns a bunch of other processes and uses the kevent EVFILT_PROC, NOTE_TRACK facility to keep tabs on them, maintaining a process history including process parent/child relationships and whether the processes are running or have exited. Consider a process that does a fork() and then the child attempts to exec() a non-existent program and does an exit(127) on the exec() failure. This can result in a single kevent that returns both the NOTE_CHILD and NOTE_EXIT fflags.

Unfortunately, both of these fflags are defined to return something in kevent.data (the parent pid (ppid) for NOTE_CHILD and the exit status for NOTE_EXIT). Obviously, kevent.data can't contain both pieces of information. In fact, what gets returned is the exit status, so when the receiving code tries to interpret the NOTE_CHILD it appears that the ppid is 32512 (127 << 8), which cam really throws off the process tracking.

Before the mod mentioned above the exit status was not returned, so the returned ppid for the NOTE_CHILD would have been correct (sys/kern/kern_event.c, file_procattach(), about line 368 in head), although the returned ppid would not make much sense as an exit status for the NOTE_EXIT.

What I think should happen is that when filt_proc() (about kern_event.c line 433) is going to set the exit status, it should first check if NOTE_CHILD is already set in the knote and, if so, allocate a new knote for the NOTE_EXIT and queue it after the existing NOTE_CHILD knote (some adjustment would need to be made around line 424, too, since that is where the NOTE_EXIT was set). This would guarantee that the NOTE_CHILD was received before the NOTE_EXIT and that the appropriate piece of data could accompany each NOTE_.

I don't know if there is a significant concern about allocating a new knote at that point. If that were the case and the ideal behavior I described were not possible, I would think that it would be more appropriate to return the ppid in the knote and not set the kn->kn_data field to the exit status iff the NOTE_CHILD fflag was set, thereby giving it precedence. That would at least work much better for my particular situation!  If you receive a kevent with both NOTE_CHILD and NOTE_EXIT set, you might be able to presume that the child probably failed; in any case you would know for sure that it had exited and what
process was its parent.

I've exchanged email with jhb on the problem and he indicated that it might be a while before he could get to it. I'll take a shot at it myself, but it will probably be at least a couple weeks before I can do so. I wanted to file this PR so that it was out there in case someone else might be able to get to it sooner and also so that it isn't forgotten.
>How-To-Repeat:
This is a timing related thing, but doing a fork() and then exiting immediately in the child is likely to show the problem fairly often.
>Fix:
jhb suggested:

"Hmmm, this might be fixable by adding a f_touch method to the EVFILT_PROC
handling and having it notice the two states and break them up."


>Release-Note:
>Audit-Trail:

From: Jilles Tjoelker <jilles@stack.nl>
To: bug-followup@FreeBSD.org, David_A_Bright@DELL.com
Cc:  
Subject: Re: kern/180385: [kqueue] Conflict between EVFILT_PROC NOTE_CHILD
 and NOTE_EXIT use of data field
Date: Thu, 11 Jul 2013 00:24:21 +0200

 Hi,
 
 In PR 180385, you wrote about problems with EVFILT_PROC/NOTE_TRACK. I
 tried to use this too (for system service management, pid 1 or close to
 it) but ran across a problem already while reading the man page:
 NOTE_TRACKERR. If NOTE_TRACKERR happens, all the tracking breaks down.
 How do you deal with this?
 
 NOTE_CHILD|NOTE_EXIT knotes can be seen easily by suspending the process
 so it does not invoke kevent() for a long time.
 
 If EVFILT_PROC is not supposed to keep zombies alive (pretty much
 mandatory, otherwise any user can prevent any other user from freeing up
 proc slots by reaping zombies), then limiting the number of knotes that
 do not correspond to a live kernel object implies that NOTE_TRACKERR
 must be possible. Removing NOTE_TRACKERR would require discarding
 NOTE_CHILD|NOTE_EXIT knotes when the zombie is reaped. Even then,
 guaranteeing no NOTE_TRACKERR may cause fork() to fail sooner than
 without active kqueues.
 
 The extra restrictions can be compared to how waitpid() accepts
 (discards) the instance of the SIGCHLD signal for the waited process,
 ensuring that every terminating process generates a SIGCHLD signal while
 bounding memory usage for undelivered signals.
 
 Splitting NOTE_CHILD|NOTE_EXIT into two knotes either requires
 allocating two knotes at fork() time (making NOTE_TRACKERR slightly more
 likely and possibly wasting some memory) or allowing the split to fail.
 
 A partial workaround for your issue may be the udata field. I think
 (untested) that the udata field is copied upon NOTE_CHILD. This does not
 allow tracking the full parent-child relationships, but does allow
 tracking all descendants of a particular process (except if
 NOTE_TRACKERR happens).
 
 -- 
 Jilles Tjoelker
>Unformatted:
