From nobody@FreeBSD.org  Sat Mar 29 23:07:26 2014
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
	(using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by hub.freebsd.org (Postfix) with ESMTPS id 02D6E573
	for <freebsd-gnats-submit@FreeBSD.org>; Sat, 29 Mar 2014 23:07:26 +0000 (UTC)
Received: from cgiserv.freebsd.org (cgiserv.freebsd.org [IPv6:2001:1900:2254:206a::50:4])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client did not present a certificate)
	by mx1.freebsd.org (Postfix) with ESMTPS id D6DB96C3
	for <freebsd-gnats-submit@FreeBSD.org>; Sat, 29 Mar 2014 23:07:25 +0000 (UTC)
Received: from cgiserv.freebsd.org ([127.0.1.6])
	by cgiserv.freebsd.org (8.14.8/8.14.8) with ESMTP id s2TN7POB011394
	for <freebsd-gnats-submit@FreeBSD.org>; Sat, 29 Mar 2014 23:07:25 GMT
	(envelope-from nobody@cgiserv.freebsd.org)
Received: (from nobody@localhost)
	by cgiserv.freebsd.org (8.14.8/8.14.8/Submit) id s2TN7Pwv011393;
	Sat, 29 Mar 2014 23:07:25 GMT
	(envelope-from nobody)
Message-Id: <201403292307.s2TN7Pwv011393@cgiserv.freebsd.org>
Date: Sat, 29 Mar 2014 23:07:25 GMT
From: Mathieu <sigsys@gmail.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: deadlock between syncache(4) and pf(4)
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         188063
>Category:       kern
>Synopsis:       [pf] [hang] deadlock between syncache(4) and pf(4)
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-pf
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sat Mar 29 23:10:00 UTC 2014
>Closed-Date:    
>Last-Modified:  Wed May  7 15:50:00 UTC 2014
>Originator:     Mathieu
>Release:        9.2-RELEASE-p3
>Organization:
>Environment:
FreeBSD 9.2-RELEASE-p3 amd64
>Description:
We have a server that becomes unresponsive every few weeks or so.  When
it happens, the NICs seem dead, and user processes hang in the "tcp"
state.  The only way to fix it is rebooting.  This time, I got it to dump
core before rebooting.

IIUC, there's a deadlock happening with an inpcb and a syncache_head locks
between the "swi1: netisr 0" and "swi4: clock" threads.

No idea where to go from there...


(kgdb) tid 100011
[Switching to thread 41 (Thread 100011)]#3  0xffffffff808ef68e in _mtx_lock_sleep (m=0xffffff80010a7088, tid=18446741874755597456, opts=<value optimized out>,
    file=<value optimized out>, line=<value optimized out>)
    at /usr/src/sys/kern/kern_mutex.c:466
466                     turnstile_wait(ts, mtx_owner(m), TS_EXCLUSIVE_QUEUE);
(kgdb) bt
#0  sched_switch (td=0xfffffe0004217490, newtd=0xfffffe0004209490,
    flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1920
#1  0xffffffff8090d4f4 in mi_switch (flags=259, newtd=0x0)
    at /usr/src/sys/kern/kern_synch.c:485
#2  0xffffffff8094f446 in turnstile_wait (ts=<value optimized out>,
    owner=0xfffffe0004216920, queue=<value optimized out>)
    at /usr/src/sys/kern/subr_turnstile.c:753
#3  0xffffffff808ef68e in _mtx_lock_sleep (m=0xffffff80010a7088,
    tid=18446741874755597456, opts=<value optimized out>,
    file=<value optimized out>, line=<value optimized out>)
    at /usr/src/sys/kern/kern_mutex.c:466
#4  0xffffffff80ab3c97 in syncache_lookup (inc=0xffffff80002a2910,
    schp=<value optimized out>) at /usr/src/sys/netinet/tcp_syncache.c:500
#5  0xffffffff80ab424c in syncache_chkrst (inc=0xffffff80002a2910,
    th=0xfffffe005157ab7c) at /usr/src/sys/netinet/tcp_syncache.c:528
#6  0xffffffff80aabc33 in tcp_input (m=0xfffffe005157ab00,
    off0=<value optimized out>) at /usr/src/sys/netinet/tcp_input.c:1184
#7  0xffffffff80a3c5aa in ip_input (m=0xfffffe005157ab00)
    at /usr/src/sys/netinet/ip_input.c:760
#8  0xffffffff809db591 in swi_net (arg=<value optimized out>)
    at /usr/src/sys/net/netisr.c:806
#9  0xffffffff808d451d in intr_event_execute_handlers (
    p=<value optimized out>, ie=0xfffffe0004221c00)
    at /usr/src/sys/kern/kern_intr.c:1272
#10 0xffffffff808d5d0d in ithread_loop (arg=0xfffffe00042036c0)
    at /usr/src/sys/kern/kern_intr.c:1285
#11 0xffffffff808d099f in fork_exit (
    callout=0xffffffff808d5c70 <ithread_loop>, arg=0xfffffe00042036c0,
    frame=0xffffff80002a2b00) at /usr/src/sys/kern/kern_fork.c:992
#12 0xffffffff80ce603e in fork_trampoline ()
    at /usr/src/sys/amd64/amd64/exception.S:606
#13 0x0000000000000000 in ?? ()
(kgdb) frame 3
#3  0xffffffff808ef68e in _mtx_lock_sleep (m=0xffffff80010a7088,
    tid=18446741874755597456, opts=<value optimized out>,
    file=<value optimized out>, line=<value optimized out>)
    at /usr/src/sys/kern/kern_mutex.c:466
466                     turnstile_wait(ts, mtx_owner(m), TS_EXCLUSIVE_QUEUE);
(kgdb) p ((struct thread *)(m->mtx_lock&~15))->td_tid
$1 = 100013
(kgdb) tid 100013
[Switching to thread 43 (Thread 100013)]#0  sched_switch (
    td=0xfffffe0004216920, newtd=0xfffffe0004209920,
    flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1920
1920                    cpuid = PCPU_GET(cpuid);
(kgdb) bt
#0  sched_switch (td=0xfffffe0004216920, newtd=0xfffffe0004209920,
    flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1920
#1  0xffffffff8090d4f4 in mi_switch (flags=259, newtd=0x0)
    at /usr/src/sys/kern/kern_synch.c:485
#2  0xffffffff8094f446 in turnstile_wait (ts=<value optimized out>,
    owner=0xfffffe0004216920, queue=<value optimized out>)
    at /usr/src/sys/kern/subr_turnstile.c:753
#3  0xffffffff809014b2 in _rw_rlock (rw=0xfffffe0051850a98,
    file=<value optimized out>, line=0) at /usr/src/sys/kern/kern_rwlock.c:477
#4  0xffffffff80a35771 in in_pcblookup_hash (pcbinfo=0xffffffff81434020,
    faddr=<value optimized out>, fport=19210, laddr={s_addr = 1827520685},
    lport=<value optimized out>, lookupflags=2, ifp=0x0)
    at /usr/src/sys/netinet/in_pcb.c:1805
#5  0xffffffff81a1da99 in pf_socket_lookup () from /boot/kernel/pf.ko
#6  0xffffffff81a248a5 in pf_test_rule () from /boot/kernel/pf.ko
#7  0xffffffff81a2834c in pf_test () from /boot/kernel/pf.ko
#8  0xffffffff81a2f961 in pf_check_out () from /boot/kernel/pf.ko
#9  0xffffffff809dbbee in pfil_run_hooks (ph=<value optimized out>,
    mp=0xffffff80002ac7f8, ifp=0x6e00, dir=115288696, inp=0x4b0a)
    at /usr/src/sys/net/pfil.c:82
#10 0xffffffff80a3ecb9 in ip_output (m=0xfffffe0006df2a00,
    opt=<value optimized out>, ro=0xffffff80002ac810, flags=0, imo=0x0,
    inp=0x0) at /usr/src/sys/netinet/ip_output.c:504
#11 0xffffffff80ab398f in syncache_respond (sc=0xfffffe0173157000)
    at /usr/src/sys/netinet/tcp_syncache.c:1525
#12 0xffffffff80ab3afa in syncache_timer (xsch=<value optimized out>)
    at /usr/src/sys/netinet/tcp_syncache.c:460
#13 0xffffffff80919ee8 in softclock (arg=<value optimized out>)
    at /usr/src/sys/kern/kern_timeout.c:520
#14 0xffffffff808d451d in intr_event_execute_handlers (
    p=<value optimized out>, ie=0xfffffe0004221800)
    at /usr/src/sys/kern/kern_intr.c:1272
#15 0xffffffff808d5d0d in ithread_loop (arg=0xfffffe0004203680)
    at /usr/src/sys/kern/kern_intr.c:1285
#16 0xffffffff808d099f in fork_exit (
    callout=0xffffffff808d5c70 <ithread_loop>, arg=0xfffffe0004203680,
    frame=0xffffff80002acb00) at /usr/src/sys/kern/kern_fork.c:992
#17 0xffffffff80ce603e in fork_trampoline ()
    at /usr/src/sys/amd64/amd64/exception.S:606
#18 0x0000000000000000 in ?? ()
(kgdb) frame 3
#3  0xffffffff809014b2 in _rw_rlock (rw=0xfffffe0051850a98,
    file=<value optimized out>, line=0) at /usr/src/sys/kern/kern_rwlock.c:477
477                     turnstile_wait(ts, rw_owner(rw), TS_SHARED_QUEUE);
(kgdb) p ((struct thread *)(rw->rw_lock&~15))->td_tid
$2 = 100011

>How-To-Repeat:

>Fix:


>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->freebsd-pf 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Wed Apr 16 01:27:59 UTC 2014 
Responsible-Changed-Why:  
reclassify. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=188063 

From: Mathieu <sigsys@gmail.com>
To: FreeBSD-gnats-submit@FreeBSD.org
Cc:  
Subject: Re: kern/188063: deadlock between syncache(4) and pf(4)
Date: Wed, 07 May 2014 11:45:39 -0400

 On 3/29/2014 7:10 PM, FreeBSD-gnats-submit@FreeBSD.org wrote:
 > Thank you very much for your problem report.
 > It has the internal identification `kern/188063'.
 > The individual assigned to look at your
 > report is: freebsd-bugs.
 >
 > You can access the state of your problem report at any time
 > via this link:
 >
 > http://www.freebsd.org/cgi/query-pr.cgi?pr=188063
 >
 >> Category:       kern
 >> Responsible:    freebsd-bugs
 >> Synopsis:       deadlock between syncache(4) and pf(4)
 >> Arrival-Date:   Sat Mar 29 23:10:00 UTC 2014
 
 Well, turns out this was caused by pf(4) "user" rules.  It's been about 
 a month since I removed them and the server has been running without 
 deadlocking since then.
 
 Looks like the "workaround" mentioned in the pf(4) manpage isn't totally 
 safe on 9.X.
 
>Unformatted:
