From nobody@FreeBSD.org  Fri Apr  8 18:37:04 2005
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id A2F3316A4CE
	for <freebsd-gnats-submit@FreeBSD.org>; Fri,  8 Apr 2005 18:37:04 +0000 (GMT)
Received: from www.freebsd.org (www.freebsd.org [216.136.204.117])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 2558043D5F
	for <freebsd-gnats-submit@FreeBSD.org>; Fri,  8 Apr 2005 18:37:04 +0000 (GMT)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.13.1/8.13.1) with ESMTP id j38Ib4L0015023
	for <freebsd-gnats-submit@FreeBSD.org>; Fri, 8 Apr 2005 18:37:04 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.13.1/8.13.1/Submit) id j38Ib3Ap015022;
	Fri, 8 Apr 2005 18:37:03 GMT
	(envelope-from nobody)
Message-Id: <200504081837.j38Ib3Ap015022@www.freebsd.org>
Date: Fri, 8 Apr 2005 18:37:03 GMT
From: Steve Sears <stevenjsears@yahoo.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: SMP: msleep and sleepq_broadcast race
X-Send-Pr-Version: www-2.3

>Number:         79693
>Category:       kern
>Synopsis:       SMP: msleep and sleepq_broadcast race
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    jhb
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Apr 08 18:40:05 GMT 2005
>Closed-Date:    Wed Nov 23 18:36:08 GMT 2005
>Last-Modified:  Wed Nov 23 18:36:08 GMT 2005
>Originator:     Steve Sears
>Release:        5.3-STABLE, 5.4-STABLE
>Organization:
self
>Environment:
FreeBSD sjs-bsd 5.4-STABLE FreeBSD 5.4-STABLE #17: Thu Apr  7 08:15:24 EDT 2005     root@sjs-bsd:/usr/src/sys-sjs/i386/compile/SJSKERN  i386
>Description:
      Failure symptom:

sleepq_broadcast spins in an infinite loop, due to its local temporary queue (list) getting corrupted.

Conditions which cause the failure:

1. msleep is being called in a loop with the PCATCH flag set
2. at the same time wakeup is called on the same identifier

wakeup calls sleepq_broadcast, which dequeues the thread from the sleepq and queues the thread on a temporary list, while holding the sleepq lock. The lock is then dropped, and threads on the temporary list are scheduled to run. So the td_slpq field of the thread struct is being manipulated while the sleepq lock is not being held. This is safe if you assume no one else can manipulate that field in the meantime, but it's an invalid assumption. The caller of msleep can return without ever going to sleep, and upon a subsequent call to msleep, queue itself on the sleepq again, which manipulates the td_slpq field, and corrupts the temporary queue used by sleepq_broadcast().

So the race is as follows:

1. threadA calls msleep with PCATCH set
2. threadA is queued on sleepq
3. threadB calls wakeup on same identifier, which calls sleepq_broadcast
4. threadB moves threadA from sleepq to local temporary queue (list)
5. threadA calls sleepq_timedwait_sig, which calls sleepq_switch
6. threadA determines it has been woken up since td->td_sleepqueue != NULL
7. threadA returns from sleepq_switch, sleepq_timed_wait, msleep
8. threadA calls msleep with PCATCH set
9. threadA is queued on sleepq [CORRUPTION] Note threadA is still on
sleepq_broadcast's local temporary queue (list). This results in threadA
pointing back to itself on the sleepq.
10. threadA goes to sleep
11. threadB dequeues threadA from its local temporary queue (list) and
schedules it.
12. goto 11


Problem verification:
1> add a 'volatile int test_flags' to the thread strucutre.
2> test_flags = 1, in the first loop of the sleepq_broadcast
3> test_flags = 0, in the second loop (local queue processing)
4> panic in sleepq_switch 
<snip> 
if (td->td_sleepqueue != NULL) {
    if (td->test_flags)
        panic("msleep returning without sleeping, and race with broadcast\n");
}
</snip>


>How-To-Repeat:
      Create a thread.  have this thread run a while loop where it tsleep(my_thread_wait_channel, PZERO | PCATCH, "my_wait", hz), where kernel hz is set to 1000 (i believe this is the x86 default).

Send wakeups to this thread from an ithread at a high rate (1K per second?).  ithread vs a regular thread may not matter.  Eventually, the thread sending the wakeups will enter an infinite loop caused by a cycle in a sleepq.  If there are multiple threads sending wakeups, then the system may hang or NMI.


>Fix:
      Here's one solution that is probably faster than the original code.

It addresses the race while minimizing lock overhead.
 

Change sleepq_broadcast to:

void
sleepq_broadcast(void *wchan, int flags, int pri)
{
    struct sleepqueue *sq;
    struct thread *td;

    CTR2(KTR_PROC, "sleepq_broadcast(%p, %d)", wchan, flags);
    KASSERT(wchan != NULL, ("%s: invalid NULL wait channel", __func__));
    sq = sleepq_lookup(wchan);
    if (sq == NULL) {
        sleepq_release(wchan);
        return;
    }
    KASSERT(sq->sq_type == (flags & SLEEPQ_TYPE),
        ("%s: mismatch between sleep/wakeup and cv_*", __func__));
    /* XXX: Do for all sleep queues eventually. */
    if (flags & SLEEPQ_CONDVAR)
        mtx_assert(sq->sq_lock, MA_OWNED);

    while (!TAILQ_EMPTY(&sq->sq_blocked)) {
        td = TAILQ_FIRST(&sq->sq_blocked);
        sleepq_remove_and_resume_thread(sq, td);
    }
    sleepq_release(wchan);
}

Add this function:

/*
 * Removes and resumes thread from a sleep queue.
 */
static void
sleepq_remove_and_resume_thread(struct sleepqueue *sq, struct thread *td, int pri)
{
    struct sleepqueue_chain *sc;

    MPASS(td != NULL);
    MPASS(sq->sq_wchan != NULL);
    MPASS(td->td_wchan == sq->sq_wchan);
    sc = SC_LOOKUP(sq->sq_wchan);
    mtx_assert(&sc->sc_lock, MA_OWNED);

    /* Remove the thread from the queue. */
    TAILQ_REMOVE(&sq->sq_blocked, td, td_slpq);

    /*
     * Get a sleep queue for this thread.  If this is the last waiter,
     * use the queue itself and take it out of the chain, otherwise,
     * remove a queue from the free list.
     */
    if (LIST_EMPTY(&sq->sq_free)) {
        td->td_sleepqueue = sq;
#ifdef INVARIANTS
        sq->sq_wchan = NULL;
#endif
#ifdef SLEEPQUEUE_PROFILING
        sc->sc_depth--;
#endif
    } else
        td->td_sleepqueue = LIST_FIRST(&sq->sq_free);
    LIST_REMOVE(td->td_sleepqueue, sq_hash);

    mtx_lock_spin(&sched_lock);

    td->td_wmesg = NULL;
    td->td_wchan = NULL;

    CTR3(KTR_PROC, "sleepq_wakeup: thread %p (pid %ld, %s)",
        (void *)td, (long)td->td_proc->p_pid, td->td_proc->p_comm);
    TD_CLR_SLEEPING(td);

    /* Adjust priority if requested. */
    MPASS(pri == -1 || (pri >= PRI_MIN && pri <= PRI_MAX));
    if (pri != -1 && td->td_priority > pri)
        sched_prio(td, pri);
    setrunnable(td);

    mtx_unlock_spin(&sched_lock);
}
>Release-Note:
>Audit-Trail:

From: Steven Sears <stevenjsears@yahoo.com>
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/79693: SMP: msleep and sleepq_broadcast race
Date: Fri, 8 Apr 2005 12:41:36 -0700 (PDT)

 Found some inaccuracies in my original write-up, sorry
 about that. Here's an accurate account:
 
 1. threadA calls msleep with PCATCH set
 2. threadA acquires sleepq lock
 3. threadA is queued on sleepq
 4. threadA calls sleepq_catch_signals, which drops the
 sleepq lock
 5. threadB calls wakeup on same identifier, which
 calls sleepq_broadcast
 6. threadB grabs sleepq lock
 7. threadB removes threadA from sleepq, which sets
 td->td_sleepqueue
 8. threadB queues threadA to local temporary queue
 (list)
 9. threadB drops sleepq lock
 10. threadA calls sleepq_timedwait_sig, which calls
 sleepq_sleep and
 determines it has been woken up since 
 td->td_sleepqueue != NULL
 11. threadA returns from sleepq_switch,
 sleepq_timedwait_sig, msleep
 12. threadA calls msleep with PCATCH set
 13. threadA is queued on sleepq [CORRUPTION] Note
 threadA is still on
 sleepq_broadcast's local temporary queue (list). This
 results in threadA
 pointing back to itself on the sleepq.
 14. threadA goes to sleep
 15. threadB dequeues threadA from its local temporary
 queue (list) and
 schedules it.
 16. goto 15
 
 
 
  -Steve
 
 
 __________________________________________________
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around 
 http://mail.yahoo.com 
Responsible-Changed-From-To: freebsd-bugs->jhb 
Responsible-Changed-By: rwatson 
Responsible-Changed-When: Sun Apr 10 18:36:51 GMT 2005 
Responsible-Changed-Why:  
Change ownership to jhb, as he current'y maintains the sleepq implementation 
in 6.x/5.x.  This looks like a very interesting problem, and it could 
explain several reports I've seen of wedges under very high load... 


http://www.freebsd.org/cgi/query-pr.cgi?pr=79693 

From: John Baldwin <jhb@FreeBSD.org>
To: freebsd-gnats-submit@FreeBSD.org, stevenjsears@yahoo.com
Cc:  
Subject: Re: kern/79693: SMP: msleep and sleepq_broadcast race
Date: Sun, 10 Apr 2005 17:29:07 -0400

 Thanks!  I believe this is the race that is the source of the 'td_sleepqueue 
 == NULL' panics under load as they always occurred inside of an sx lock which 
 is implemented via a while loop around cv_wait().  I also think that the 
 turnstiles used for mutexes have the same issue as I have gotten a 
 'td_turnstile == NULL' panic once under load with preemption enabled and it 
 uses td_lockq for a temporary queue as well.  It has to split up the 
 operations due to lock order concerns though, so it may not be as easily 
 solved as the sleep queue case.  I will look at fixing both of these 
 (hopefully in a manner that doesn't result in a lot of code duplication) this 
 week while I'm at USENIX.  Thanks again for the excellent report as I've been 
 trying to figure out what race was the cause of the aforementioned panics for 
 a while now. :-/
 
 -- 
 John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
 "Power Users Use the Power to Serve"  =  http://www.FreeBSD.org
State-Changed-From-To: open->closed 
State-Changed-By: jhb 
State-Changed-When: Wed Nov 23 18:30:47 GMT 2005 
State-Changed-Why:  
Fix applied and merged to 5.x prior to 5.4 release.  After reviewing the 
turnstile code I concluded it didn't suffer from the same problem which 
is why I originally left this PR open after fixing the sleepq race. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=79693 
>Unformatted:
