From nobody  Fri Oct 30 04:37:41 1998
Received: (from nobody@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id EAA00786;
          Fri, 30 Oct 1998 04:37:41 -0800 (PST)
          (envelope-from nobody)
Message-Id: <199810301237.EAA00786@hub.freebsd.org>
Date: Fri, 30 Oct 1998 04:37:41 -0800 (PST)
From: info@highwind.com
To: freebsd-gnats-submit@freebsd.org
Subject: FreeBSD 3.0 thread scheduler is broken
X-Send-Pr-Version: www-1.0

>Number:         8500
>Category:       kern
>Synopsis:       FreeBSD 3.0 thread scheduler is broken
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Oct 30 04:40:00 PST 1998
>Closed-Date:    Wed Aug 4 04:43:42 PDT 1999
>Last-Modified:  Wed Aug  4 04:46:22 PDT 1999
>Originator:     Robert Fleischman
>Release:        3.0
>Organization:
HighWind Software, Inc.
>Environment:
FreeBSD zonda.highwind.com 3.0-19980831-SNAP FreeBSD 3.0-19980831-SNAP #0: Mon Aug 31 14:03:19 GMT 1998     root@make.ican.net:/usr/src/sys/compile/GENERIC  i386

>Description:
When an application has threads that are I/O intensive, that thread
unfairly steals cycles from all other threads. This makes writing
an MT program that does any real amount of I/O impossible.

>How-To-Repeat:
Just run this test program:

/*****************************************************************************
File:     schedBug.C
Contents: FreeBSD Scheduling Bug Illustrator
Created:  28-Oct-1998

This program SHOULD print "Marking Time : 1", etc. However, the thread
scheduler appears to NOT schedule the markTimeThread because the
ioThread is so busy.

If you uncomment the "::pthread_yield()" it works a little
better. Ideally, you should get a print every second.

g++ -o schedBug -D_REENTRANT -D_THREAD_SAFE -g -Wall schedBug.C -pthread 

*****************************************************************************/

#include <assert.h>
#include <fcntl.h>
#include <memory.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#ifdef linux
#include <sys/time.h>
#endif
#include <unistd.h>

unsigned int LENGTH = 1024 * 1024;

void *ioThread(void *)
{
    char *data = new char[LENGTH];
    ::memset(data, 0, LENGTH);

    while (true) {
	int file = ::open("scrap", O_CREAT | O_TRUNC | O_WRONLY, 0666);
	assert(file != -1);
	assert(::write(file, data, LENGTH) == static_cast<ssize_t>(LENGTH));
	// ::pthread_yield();
	assert(!::close(file));
    }
}

void *markTimeThread(void *)
{
    time_t start = ::time(0);
    while (true) {
	timeval timeout;
	timeout.tv_sec = 1;
	timeout.tv_usec = 0;
	::select(0, 0, 0, 0, &timeout); 

	::printf("Marking Time: %lu\n", ::time(0) - start);
    }
}

int main(int, char **)
{
    // Set up Thread Arguments
    pthread_t tid;
    pthread_attr_t attr;
    assert(!::pthread_attr_init(&attr));
    assert(!::pthread_attr_setdetachstate(&attr,
					  PTHREAD_CREATE_DETACHED));

    // Spawn markTimeThread
    assert(!::pthread_create(&tid, &attr, markTimeThread, 0));
    // Spawn ioThread
    assert(!::pthread_create(&tid, &attr, ioThread, 0));

    // main() goes away for a long time
    timeval timeout;
    timeout.tv_sec = 3600;
    timeout.tv_usec = 0;
    ::select(0, 0, 0, 0, &timeout); 

    return EXIT_SUCCESS;
}

>Fix:
I do not know a fix. However, the problem appears to be related to the
use of the VTALARM timer only measuring USER space time and not kernel
space time.

>Release-Note:
>Audit-Trail:

From: Peter Wemm <peter@netplex.com.au>
To: info@highwind.com
Cc: freebsd-gnats-submit@FreeBSD.ORG
Subject: Re: kern/8500: FreeBSD 3.0 thread scheduler is broken 
Date: Sat, 31 Oct 1998 14:50:24 +0800

 info@highwind.com wrote:
 > >Number:         8500
 > >Category:       kern
 > >Synopsis:       FreeBSD 3.0 thread scheduler is broken
 > 
 > >Description:
 > When an application has threads that are I/O intensive, that thread
 > unfairly steals cycles from all other threads. This makes writing
 > an MT program that does any real amount of I/O impossible.
 
 Yes, open(), read(), write(), etc block the entire process.  The libc_r 
 thread engine only works for things that can be select()ed apon, and read/
 write cannot (yet).
 
 The only alternatives are to use the aio/lio syscalls (which work), or 
 rfork().  libc_r could probably be modified to use rfork() to have the 
 read/write/open/close/etc done in parallel.
 
 Cheers,
 -Peter
 
 

From: HighWind Software Information <info@highwind.com>
To: peter@netplex.com.au
Cc: freebsd-gnats-submit@FreeBSD.ORG
Subject: Re: kern/8500: FreeBSD 3.0 thread scheduler is broken
Date: Sat, 31 Oct 1998 09:20:46 -0500 (EST)

    > >Number:         8500
    > >Category:       kern
    > >Synopsis:       FreeBSD 3.0 thread scheduler is broken
    > 
    > >Description:
    > When an application has threads that are I/O intensive, that thread
    > unfairly steals cycles from all other threads. This makes writing
    > an MT program that does any real amount of I/O impossible.
 
    Yes, open(), read(), write(), etc block the entire process.  The libc_r 
    thread engine only works for things that can be select()ed apon, and read/
    write cannot (yet).
 
 Ummm. Not to be rude.. But...
 
 That is NOT TRUE AT ALL. read() and write() CERTAINLY are selected
 apon and do NOT block the whole process when using libc_r.
 
 Read /usr/src/lib/libc_r/uthread/uthread_write.c and see for yourself.
  
    The only alternatives are to use the aio/lio syscalls (which work), or 
    rfork().  libc_r could probably be modified to use rfork() to have the 
    read/write/open/close/etc done in parallel.
 
 I don't think that is necessary.
 
 -Rob

From: Peter Wemm <peter@netplex.com.au>
To: HighWind Software Information <info@highwind.com>
Cc: freebsd-gnats-submit@FreeBSD.ORG
Subject: Re: kern/8500: FreeBSD 3.0 thread scheduler is broken 
Date: Sun, 01 Nov 1998 03:08:44 +0800

 HighWind Software Information wrote:
 > 
 >    > >Number:         8500
 >    > >Category:       kern
 >    > >Synopsis:       FreeBSD 3.0 thread scheduler is broken
 >    > 
 >    > >Description:
 >    > When an application has threads that are I/O intensive, that thread
 >    > unfairly steals cycles from all other threads. This makes writing
 >    > an MT program that does any real amount of I/O impossible.
 > 
 >    Yes, open(), read(), write(), etc block the entire process.  The libc_r 
 >    thread engine only works for things that can be select()ed apon, and read/
 >    write cannot (yet).
 > 
 > Ummm. Not to be rude.. But...
 > 
 > That is NOT TRUE AT ALL. read() and write() CERTAINLY are selected
 > apon and do NOT block the whole process when using libc_r.
 > 
 > Read /usr/src/lib/libc_r/uthread/uthread_write.c and see for yourself.
 
 Yes, but only if the file descriptor itself supports O_NONBLOCK mode..
 
 /* called by open() wrapper */
 _thread_fd_table_init(int fd)
 { ...
                 /* Get the flags for the file: */
                 if (fd >= 3 && (entry->flags =
                     _thread_sys_fcntl(fd, F_GETFL, 0)) == -1) {
                         ret = -1;
                     }
                 else {
  ...
 }
 
 And in write(), it just calls the write system call:
 write()
 {
 ....
                 /* Check if file operations are to block */
                 blocking = ((_thread_fd_table[fd]->flags & O_NONBLOCK) == 0);
 
                 /*
                  * Loop while no error occurs and until the expected number
                  * of bytes are written if performing a blocking write:
                  */
                 while (ret == 0) {
                         /* Perform a non-blocking write syscall: */
                             ^^^^^^^^^^^^^^^^^ - only if opened in O_NONBLOCK
                         n = _thread_sys_write(fd, buf + num, nbytes - num);
 
                         /* Check if one or more bytes were written: */
                         if (n > 0)
 
 It's similar for read().
 
 There's a couple of big ifs so far.  *If* you open the file in O_NONBLOCK 
 mode specifically, then you get non-blocking read/write syscalls.  The 
 syscalls themselves are only non-blocking *if* the underlying fd supports 
 it.  Sockets and pipes support it.  Files (at least on ufs/ffs) do not.  
 No matter whether you ask for O_NONBLOCK or not, you will always get a 
 blocking read/write with disk IO with read and write.
 
 >    The only alternatives are to use the aio/lio syscalls (which work), or 
 >    rfork().  libc_r could probably be modified to use rfork() to have the 
 >    read/write/open/close/etc done in parallel.
 > 
 > I don't think that is necessary.
 
 It is if you want the threading to continue while the disk is grinding 
 away.  aio_read() and aio_write() would probably be enough to help file 
 IO, but open() will still be blocking.
 
 Squid has some fairly extensive async disk-IO routines.  They happen to 
 use pthreads as a mechanism of having child processes do the blocking 
 work.  FreeBSD could use rfork() for arranging the blocking stuff in child 
 processes with shared address space.  It would be a lot of work though, 
 and would be a problem on SMP systems.
 
 > -Rob
 
 Cheers,
 -Peter
 
 

From: Tony Finch <dot@dotat.at>
To: peter@netplex.com.au
Cc: freebsd-gnats-submit@freebsd.org
Subject: Re: kern/8500: FreeBSD 3.0 thread scheduler is broken 
Date: Mon, 2 Nov 1998 17:41:39 +0000

 Peter Wemm <peter@netplex.com.au> wrote:
 > HighWind Software Information wrote:
 >
 > >    The only alternatives are to use the aio/lio syscalls (which work), or
 > >    rfork().  libc_r could probably be modified to use rfork() to have the
 > >    read/write/open/close/etc done in parallel.
 > >
 > > I don't think that is necessary.
 >
 > It is if you want the threading to continue while the disk is grinding
 > away.  aio_read() and aio_write() would probably be enough to help file
 > IO, but open() will still be blocking.
 >
 > Squid has some fairly extensive async disk-IO routines.  They happen to
 > use pthreads as a mechanism of having child processes do the blocking
 > work.  FreeBSD could use rfork() for arranging the blocking stuff in child
 > processes with shared address space.  It would be a lot of work though,
 > and would be a problem on SMP systems.
 
 We have been trying out Squid on 3.0 because of the possibilities
 offered by async IO, but so far we haven't managed to get it to work
 satisfactorily. I was also thinking about the possibility of using
 rfork() to implement threads -- the Linux pthreads implementation does
 this (except that Linux has clone() instead of rfork() and the
 interface is slightly different).
 
 What are the SMP issues?
 
 Tony.
 -- 
                    gg yhf**f.a.n.finch
                                     fanf@demon.net
                  dot@dotat.at

From: Peter Wemm <peter@netplex.com.au>
To: Tony Finch <dot@dotat.at>
Cc: freebsd-gnats-submit@FreeBSD.ORG
Subject: Re: kern/8500: FreeBSD 3.0 thread scheduler is broken 
Date: Tue, 03 Nov 1998 12:54:48 +0800

 Tony Finch wrote:
 > Peter Wemm <peter@netplex.com.au> wrote:
 > > HighWind Software Information wrote:
 > >
 > > >    The only alternatives are to use the aio/lio syscalls (which work), or
 > > >    rfork().  libc_r could probably be modified to use rfork() to have the
 > > >    read/write/open/close/etc done in parallel.
 > > >
 > > > I don't think that is necessary.
 > >
 > > It is if you want the threading to continue while the disk is grinding
 > > away.  aio_read() and aio_write() would probably be enough to help file
 > > IO, but open() will still be blocking.
 > >
 > > Squid has some fairly extensive async disk-IO routines.  They happen to
 > > use pthreads as a mechanism of having child processes do the blocking
 > > work.  FreeBSD could use rfork() for arranging the blocking stuff in child
 > > processes with shared address space.  It would be a lot of work though,
 > > and would be a problem on SMP systems.
 > 
 > We have been trying out Squid on 3.0 because of the possibilities
 > offered by async IO, but so far we haven't managed to get it to work
 > satisfactorily. I was also thinking about the possibility of using
 > rfork() to implement threads -- the Linux pthreads implementation does
 > this (except that Linux has clone() instead of rfork() and the
 > interface is slightly different).
 > 
 > What are the SMP issues?
 
 Pretty dramatic, ie: it doesn't work. :-(
 
 The reason is that under SMP, there is a per-cpu page table directory slot 
 that is changed each context switch.  We store a heap of per-cpu variables 
 here (with more to come), including the virtual cpuid.
 
 With a shared address space rfork(), the same PTD, page tables and pages 
 are used in both processes.  If both CPUs happened to schedule both 
 processes on each cpu at the same time, one cpu would clobber the other 
 CPU's private PTD slot and they would both end up using the same privated 
 pages on both cpus.  This kills the system on the spot as they both think 
 they are the same cpu.
 
 For this reason, fast vfork is disabled and rfork() in shared address space
 mode returns an error.
 
 There is not a simple fix for this.  There is a possibility that loading 
 the MPPTDI slot after gaining the giant kernel lock could be made to work 
 as a short-term fix, but obviously that fails when the giant kernel lock 
 starts to go away, and something needs to be done about fast interrupts 
 and the boundary code that runs outside the kernel lock.
 
 Longer term fixes include drastic VM (pmap and support) modifications:
  - have seperate address spaces for the kernel and user.  This isn't such 
 a bad option as it positions us for very large memory systems very well.  
 The kernel would load and run at 0x00100000 rather than 0xf0100000, and 
 would have one PTD[] for each CPU.  Each process could have 4GB of address 
 space, rather than having to leave room for the kernel to live at the top 
 of it.  Needless to say this is a fair amount of work. :-)
  - have multiple PTDs for each shared address space, up to the number of 
 present cpus.  ie: if an address space was rforked for 20 threads, but you 
 had 4 CPUs, then you need 4 PTDs.
 
 Neither of these have been attempted yet, but the second is probably the
 simpler of the two, while the first is probably the best for future
 capabilities.  It would give us a lot more room to move on the large memory
 PPro and PII systems with 36 bits (64GB ram) of address space.
 
 > Tony.
 
 Cheers,
 -Peter
 
 

From: HighWind Software Information <info@highwind.com>
To: peter@netplex.com.au
Cc: freebsd-gnats-submit@FreeBSD.ORG
Subject: Re: kern/8500: FreeBSD 3.0 thread scheduler is broken
Date: Sat, 7 Nov 1998 09:08:05 -0500 (EST)

 Okay.. I buy your explanation. FreeBSD's low level read()/write() are
 always blocking through the kernel when talking to UFS. That certainly
 will make it hard to do task switching.
 
 I think the right thing to do is to count that time and allow that time
 to effect when the thread switch happens.
 
 HOWEVER, I'm certainly willing to try aio_read/aio_write. WHERE IS
 THAT STUFF? I found "sys/aio.h", however, I can't find any library that
 has it. Nor can I find "aio_read" in any .c files in /usr/src.
 
 -Rob
State-Changed-From-To: open->closed 
State-Changed-By: deischen 
State-Changed-When: Wed Aug 4 04:43:42 PDT 1999 
State-Changed-Why:  
This was fixed in both -stable and -current.  The supplied 
test program works as expected. 
>Unformatted:
