From sclawson@cs.utah.edu Mon Apr  5 17:31:42 1999
Return-Path: <sclawson@cs.utah.edu>
Received: from wrath.cs.utah.edu (wrath.cs.utah.edu [155.99.198.100])
	by hub.freebsd.org (Postfix) with ESMTP id 308551547B
	for <FreeBSD-gnats-submit@freebsd.org>; Mon,  5 Apr 1999 17:31:40 -0700 (PDT)
	(envelope-from sclawson@cs.utah.edu)
Received: from ibapah.cs.utah.edu (ibapah.cs.utah.edu [155.99.212.83])
	by wrath.cs.utah.edu (8.8.8/8.8.8) with ESMTP id SAA27780
	for <FreeBSD-gnats-submit@freebsd.org>; Mon, 5 Apr 1999 18:29:43 -0600 (MDT)
Received: (from sclawson@localhost)
	by ibapah.cs.utah.edu (8.9.1/8.9.1) id SAA19824;
	Mon, 5 Apr 1999 18:29:42 -0600 (MDT)
	(envelope-from sclawson@cs.utah.edu)
Message-Id: <199904060029.SAA19824@ibapah.cs.utah.edu>
Date: Mon, 5 Apr 1999 18:29:42 -0600 (MDT)
From: Stephen Clawson <sclawson@cs.utah.edu>
Reply-To: sclawson@cs.utah.edu
To: FreeBSD-gnats-submit@freebsd.org
Subject: ypserv segfaults regularly (really: Race condition in the Berkeley db library).
X-Send-Pr-Version: 3.2

>Number:         10971
>Category:       bin
>Synopsis:       ypserv segfaults regularly (really: Race condition in the Berkeley db library).
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Apr  5 17:30:01 PDT 1999
>Closed-Date:    Fri Apr 30 10:00:20 PDT 1999
>Last-Modified:  Fri Apr 30 10:07:03 PDT 1999
>Originator:     Stephen Clawson
>Release:        FreeBSD 3.0-CURRENT i386 (jan 27, 1999)
>Organization:
University of Utah Computer Science
>Environment:

We've been running a slave yp server on a dual Pentium II/350 for a
group of twenty or so FreeBSD machines and a few assorted linux boxes.
The underlying network is switched 100BaseT to the server.  

>Description:

After setting up a slave yp server for our group and switching all of
our machines to use it, I started to notice ypserv.core files in /.
It turns out that there's a race in Berkeley DB that gets tickled
through a combination of the DB_CACHE code in ypserv, running
ypserv on an SMP box and a bug in ypserv (PR bin/10970).

The problem comes down to the DB_CACHE code keeping heavily used
databases open.  This is usually a good thing, but because the
databases are already open, when the child forks it shares a file
descriptor (and thus a file description, and thus a file pointer) for
the database file with it's parent.  If a lot of requests for the same
database come in at the same time, this means that multiple children
will also be sharing the same file description, since they all came
from the same parent.

With all the children accessing the database concurrently, they react
badly to the race in libc/db/hash/hash_page.c:__get_page():

    if ((lseek(fd, (off_t)page << hashp->BSHIFT, SEEK_SET) == -1) ||
        ((rsize = read(fd, p, size)) == -1))
                return (-1);

The problem shows up in this fragment from a ktrace:

 26527 ypserv   CALL  lseek(0x7,0,0x1e000,0,0)
 26527 ypserv   RET   lseek 122880/0x1e000
 26533 ypserv   CALL  lseek(0x7,0,0x8000,0,0)
 26533 ypserv   RET   lseek 32768/0x8000
 26527 ypserv   CALL  read(0x7,0x80a2000,0x1000)
 26527 ypserv   GIO   fd 7 read 4096 bytes
       "0\0\M-{\^O\M-?\^O\M-7\^Oq\^Ol\^O)\^O#\^O\M-_\^N\M-\\^N\M-#\^N\M^\\^NZ\
        \^NR\^N\r\^N\^E\^N"
 26527 ypserv   RET   read 4096/0x1000
 26533 ypserv   CALL  read(0x7,0x80ad000,0x1000)
 26533 ypserv   GIO   fd 7 read 4096 bytes
       "\M-_\M^?\M^?\M^?\M^?\M^?\M^?\M^?\M^?\M^?\M^?\M^?\M^?\M^?\M^?\M^?\M^?\
        \M^?\M^?\M^?\M^?\M^?\M^?\M^?\M^?\M^?\M^?\M^?\M^?\M^?\M^?\M^?"
 26533 ypserv   RET   read 4096/0x1000
 26527 ypserv   CALL  lseek(0x7,0,0x1f000,0,0)
 26527 ypserv   RET   lseek 126976/0x1f000
 26533 ypserv   PSIG  SIGSEGV SIG_DFL
 26533 ypserv   NAMI  "ypserv.core"

Neither process is getting the correct data, but in the case of
the second process, it's getting whatever is 0x1000 bytes after the
data it really wants, causing hash_seq() to return a pointer into
unallocated memory, causing a segfault when it's dereferenced.

>How-To-Repeat:

Set up a yp server and barrage it with as many yp_all requests that
you can.  The simplest way to do this is to fork off a bunch of `ypcat
passwd' processes from a client machine (20 is usually sufficent).

You've got to get enough going at a time that the forked children will
respond to a bunch of them at once and interleave their reads from the 
database.

>Fix:

NetBSD's fix to this problem is to introduce a new system call,
pread, which takes an offset so that the lseek and read can be done
atomicly.

Some other fixes are either to put something in the manpage that warns
about making concurrent accesses to shared open databases and how you
will eventually loose, or locking the file descriptor before the lseek
and unlocking it after the read.  The problem with the first is that
it'll be easy to miss and the problem with the second is that you have
to have the fd opened read/write to get an exclusive lock. =)

If you only want to fix ypserv, then closing and reopening the
database on a fork should do the trick, since the problem dosen't show
up if you're not sharing a file pointer.

>Release-Note:
>Audit-Trail:

From: David G Andersen <danderse@cs.utah.edu>
To: freebsd-gnats-submit@freebsd.org
Cc: danderse@cs.utah.edu
Subject: Re: bin/10971: ypserv segfaults regularly (really: Race condition in the Berkeley db library).
Date: Fri, 30 Apr 1999 10:02:43 -0600 (MDT)

 As I mentioned in email to -hackers, this is the ugly,
 don't-commit-because-it-changes-the-DB-semantics fix we use to
 make ypserv work for us.  To apply it, compile a *separate* libc with the
 patched hash_page.c (that's lib/libc/db/hash/hash_page.c if you're keeping
 track), and then patch your yp_dblookup.c so it opens the database
 readwrite, so it can obtain an exclusive (read) lock on the database.
 Recompile ypserv and link it statically against the separate libc, and
 voila, life is good.
 
 Alternately, change hash_page.c (and the equivalent functions in the other
 DB routines) to use pread with an absolute offset, instead of doing the
 lseek()/read() race condition combo.  This only works for people in
 -current, but it's a better and faster fix.
 
    -Dave
 
 *** yp_dblookup.c       1998/02/11 19:15:32     1.15
 --- yp_dblookup.c       1999/04/13 23:51:44
 ***************
 *** 414,420 ****
   #ifdef DB_CACHE
   again:
   #endif
 !       dbp = dbopen(buf,O_RDONLY, PERM_SECURE, DB_HASH, NULL);
   
         if (dbp == NULL) {
                 switch(errno) {
 --- 414,420 ----
   #ifdef DB_CACHE
   again:
   #endif
 !       dbp = dbopen(buf, O_RDWR, PERM_SECURE, DB_HASH, NULL);
   
         if (dbp == NULL) {
                 switch(errno) {
 
 
 *** hash_page.c Fri Apr  2 15:36:00 1999
 --- hash_page.c.locking Thu Apr  1 22:01:49 1999
 ***************
 *** 524,529 ****
 --- 524,530 ----
         register int fd, page, size;
         int rsize;
         u_int16_t *bp;
 +       struct flock fl;
   
         fd = hashp->fp;
         size = hashp->BSIZE;
 ***************
 *** 536,544 ****
 --- 537,556 ----
                 page = BUCKET_TO_PAGE(bucket);
         else
                 page = OADDR_TO_PAGE(bucket);
 + 
 +       fl.l_start = fl.l_len = fl.l_pid = 0;
 +       fl.l_type = F_WRLCK;
 +       fl.l_whence = SEEK_SET;
 + 
 +       fcntl(fd, F_SETLKW, &fl);
 + 
         if ((lseek(fd, (off_t)page << hashp->BSHIFT, SEEK_SET) == -1) ||
             ((rsize = read(fd, p, size)) == -1))
                 return (-1);
 + 
 +       fl.l_type = F_UNLCK;
 +       fcntl(fd, F_SETLK, &fl);
 + 
         bp = (u_int16_t *)p;
         if (!rsize)
                 bp[0] = 0;      /* We hit the EOF, so initialize a new
 page */
 
 
State-Changed-From-To: open->closed 
State-Changed-By: wpaul 
State-Changed-When: Fri Apr 30 10:00:20 PDT 1999 
State-Changed-Why:  

Okay, put _two_ 'stupid' stickers next to my name. The simplest solution 
for ypserv is just to call yp_flush_all() in ypproc_all_2_svc() right 
after the fork(). This closes all the DB handles in the child;  
yp_select_map() will re-open the correct one for the transfer later. 

This is another problem that I was never able to trip on my test machine. 
_However_ I'm really cheesed off at this one because I did run into it 
on another platform: I ported ypserv to IRIX at one point, though I used 
ndbm there instead of Berkeley DB. I traced down this exact problem and 
fixed it with this exact patch (I just checked the code) but I forgot 
about it, possibly thinking that Berkeley DB didn't suffer from the same  
limitation as ndbm. Bah. If anybody inconvenienved by this shows up in 
New York, track me down because I owe you an apology and a free meal. 

-Bill 
>Unformatted:
