From nobody@FreeBSD.org  Sun Apr 17 16:59:02 2005
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id A30AB16A4CE
	for <freebsd-gnats-submit@FreeBSD.org>; Sun, 17 Apr 2005 16:59:02 +0000 (GMT)
Received: from www.freebsd.org (www.freebsd.org [216.136.204.117])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 7523043D2D
	for <freebsd-gnats-submit@FreeBSD.org>; Sun, 17 Apr 2005 16:59:02 +0000 (GMT)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.13.1/8.13.1) with ESMTP id j3HGx2RB058145
	for <freebsd-gnats-submit@FreeBSD.org>; Sun, 17 Apr 2005 16:59:02 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.13.1/8.13.1/Submit) id j3HGx29F058144;
	Sun, 17 Apr 2005 16:59:02 GMT
	(envelope-from nobody)
Message-Id: <200504171659.j3HGx29F058144@www.freebsd.org>
Date: Sun, 17 Apr 2005 16:59:02 GMT
From: Ivan <sandello@micmedia.ru>
To: freebsd-gnats-submit@FreeBSD.org
Subject: netgraph is causing crash (free()->panic) with mpd
X-Send-Pr-Version: www-2.3

>Number:         80035
>Category:       kern
>Synopsis:       netgraph is causing crash (free()->panic) with mpd
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    glebius
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sun Apr 17 17:00:36 GMT 2005
>Closed-Date:    Wed Dec 28 13:43:05 GMT 2005
>Last-Modified:  Wed Dec 28 13:43:05 GMT 2005
>Originator:     Ivan
>Release:        5.4-RC2
>Organization:
Micron-Media
>Environment:
FreeBSD micron-media.ru 5.4-RC2 FreeBSD 5.4-RC2 #6: Tue Apr 12 16:53:37 MSD 2005     root@micron-media.ru:/usr/obj/usr/src/sys/MKMEDIA  i386
>Description:
After switching vpn-server from poptop to mpd, my box started to crash periodically (from hour to several days of uptime). Normally, there is 50..70 users, connected to VPN. Just before crash, mpd is disconnecting someone. 

mpd patches applied: mpd 3.18 + drop-user patch from ftp://ftp.ufanet.ru/pub/boco/mpd/ (but it shouldn't affect netgraph-related stuff).

Kernel config: HZ=1000, SMP (HyperThreading), ALTQ, netgraph compiled-in  (only ng_vjc.ko via module).

gcc flags: -O -pipe 

dmesg output:
 Free item, freed at /usr/src/sys/netgraph/ng_base.c, line 3652
 problem discovered at file /usr/src/sys/netgraph/ng_base.c, line 3646
 Free item, freed at /usr/src/sys/netgraph/ng_base.c, line 3652
 problem discovered at file /usr/src/sys/netgraph/ng_base.c, line 3193
 Free item, freed at /usr/src/sys/netgraph/ng_base.c, line 3652
 problem discovered at file /usr/src/sys/netgraph/ng_base.c, line 3193
 Free item, freed at /usr/src/sys/netgraph/ng_base.c, line 3652
 problem discovered at file /usr/src/sys/netgraph/ng_base.c, line 3193
 Free item, freed at /usr/src/sys/netgraph/ng_base.c, line 3652
 problem discovered at file /usr/src/sys/netgraph/ng_base.c, line 3193
node 0xc29ad800 ([1ce9])
panic: free item!
cpuid = 1
boot() called on cpu#0
Uptime: 3d3h55m37s
Cannot dump. No dump device defined.
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...
cpu_reset called on cpu#0
cpu_reset: Stopping other CPUs

>How-To-Repeat:
When mpd is serving about 60..70 users, system crashes. I wasn't able to find an exact way to reproduce the crash.
>Fix:
      
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->glebius 
Responsible-Changed-By: glebius 
Responsible-Changed-When: Mon Apr 18 12:06:32 GMT 2005 
Responsible-Changed-Why:  
Take this one. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=80035 

From: Gleb Smirnoff <glebius@FreeBSD.org>
To: Ivan <sandello@micmedia.ru>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/80035: netgraph is causing crash (free()->panic) with mpd
Date: Mon, 18 Apr 2005 16:08:14 +0400

   Ivan,
 
 On Sun, Apr 17, 2005 at 04:59:02PM +0000, Ivan wrote:
 I> cpuid = 1
 I> boot() called on cpu#0
 I> Uptime: 3d3h55m37s
 I> Cannot dump. No dump device defined.
 
 Can you please configure dump device and obtain crashdump?
 
 http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug.html#KERNELDEBUG-OBTAIN
 
 -- 
 Totus tuus, Glebius.
 GLEBIUS-RIPN GLEB-RIPE
State-Changed-From-To: open->feedback 
State-Changed-By: glebius 
State-Changed-When: Sun May 1 08:27:27 GMT 2005 
State-Changed-Why:  
Two weeks ago originator was asked for feedback. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=80035 

From: Rojer <myself@rojer.pp.ru>
To: bug-followup@FreeBSD.org,  sandello@micmedia.ru
Cc:  
Subject: Re: kern/80035: netgraph is causing crash (free()->panic) with mpd
Date: Sun, 21 Aug 2005 22:56:45 +0400

 I have the same issue on 6.0-BETA2.
 I have been investigating it and so far it seems to be a callout which gets called despite being stopped.
 This leads to double free of an item structure, first in ng_uncallout, then in ng_snd_item.
 I don't see whay this happens, algorithm for callout management seems to be correct in every case, but...
 There are two workarounds to get rid of the panic.
 
 1) if we replace callout_stop with callout drain in ng_uncallout the panic goes away.
 this proves that there is something fishy around here. however, this introduces some locking issue,
 namely following messages start to pop up regularly on the console:
 
 Waiting on "callout_wait" with the following non-sleepable locks held:
 exclusive sleep mutex inp (tcpinp) r = 0 (0xc182bf54) locked @ /usr/src/sys/netinet/tcp_input.c:742
 
 So this is not a solution. besides, I'd really like to know what really happens, because callout_stop
 should work correctly here. Only it doesn't...
 
 2) This callout is used in delayed-ack implementation. you can simply turn it off and panic goes away.
 
 # this solves duoble-free panic problem on 6.0BETA-2. the code is the same for 5.4, so it should work too.
 set pptp disable delayed-ack
 
 While you are at it, try to disable windowing as well, it improves throughput sustantially (it's on by default).
 
 # disabling windowing nearly doubled the speed for me.
 set pptp disable delayed-ack windowing
 # however, this is a deviation from the standard, so YMMV
 
 
 for reference, here's a backtrace.
 
 (kgdb) bt
 #0  doadump () at pcpu.h:165
 #1  0xc0535d2b in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:402
 #2  0xc053605c in panic (fmt=0xc0743efc "Duplicate free of item %p from zone %p(%s)\n")
      at /usr/src/sys/kern/kern_shutdown.c:560
 #3  0xc069dfeb in uma_dbg_free (zone=0xc1026100, slab=0xc1949ef8, item=0xc1949540) at /usr/src/sys/vm/uma_dbg.c:303
 #4  0xc069c19d in uma_zfree_arg (zone=0xc1026100, item=0xc1949540, udata=0x0) at /usr/src/sys/vm/uma_core.c:2285
 #5  0xc05c5029 in ng_free_item (item=0xc1949540) at uma.h:303
 #6  0xc05c2db6 in ng_snd_item (item=0xc1949540, flags=0) at /usr/src/sys/netgraph/ng_base.c:2118
 #7  0xc05c6b69 in ng_callout_trampoline (arg=0x0) at /usr/src/sys/netgraph/ng_base.c:3533
 #8  0xc0543d21 in softclock (dummy=0x0) at /usr/src/sys/kern/kern_timeout.c:299
 #9  0xc051f312 in ithread_loop (arg=0xc1502580) at /usr/src/sys/kern/kern_intr.c:545
 #10 0xc051e391 in fork_exit (callout=0xc051f1c0 <ithread_loop>, arg=0x0, frame=0x0) at /usr/src/sys/kern/kern_fork.c:789
 #11 0xc06de31c in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:208
 
 item=0xc1949540 was actually freed in ng_uncallout.
 
 -- 
 Deomid Ryabkov aka Rojer
 myself@rojer.pp.ru
 rojer@sysadmins.ru
 ICQ: 8025844

From: Gleb Smirnoff <glebius@FreeBSD.org>
To: Rojer <myself@rojer.pp.ru>
Cc: bug-followup@FreeBSD.org
Subject: Re: kern/80035: netgraph is causing crash (free()->panic) with mpd
Date: Sun, 21 Aug 2005 23:16:44 +0400

   Deomid,
 
 big thanks for your investigation!
 
 On Sun, Aug 21, 2005 at 07:00:36PM +0000, Rojer wrote:
 R>  I have the same issue on 6.0-BETA2.
 R>  I have been investigating it and so far it seems to be a callout which gets called despite being stopped.
 R>  This leads to double free of an item structure, first in ng_uncallout, then in ng_snd_item.
 R>  I don't see whay this happens, algorithm for callout management seems to be correct in every case, but...
 R>  There are two workarounds to get rid of the panic.
 R>  
 R>  1) if we replace callout_stop with callout drain in ng_uncallout the panic goes away.
 R>  this proves that there is something fishy around here. however, this introduces some locking issue,
 R>  namely following messages start to pop up regularly on the console:
 R>  
 R>  Waiting on "callout_wait" with the following non-sleepable locks held:
 R>  exclusive sleep mutex inp (tcpinp) r = 0 (0xc182bf54) locked @ /usr/src/sys/netinet/tcp_input.c:742
 R>  
 R>  So this is not a solution. besides, I'd really like to know what really happens, because callout_stop
 R>  should work correctly here. Only it doesn't...
 R>  
 R>  2) This callout is used in delayed-ack implementation. you can simply turn it off and panic goes away.
 
 Yes, it looks like there is a race here.
 
 Can you please confirm that the following workaround helps, too. Try to
 add NG_NODE_FORCE_WRITER(node) at the end of ng_pptpgre_constructor()
 in ng_pptpgre.c. Thanks in advance!
 
 -- 
 Totus tuus, Glebius.
 GLEBIUS-RIPN GLEB-RIPE

From: Rojer <myself@rojer.pp.ru>
To: Gleb Smirnoff <glebius@FreeBSD.org>
Cc: bug-followup@FreeBSD.org
Subject: Re: kern/80035: netgraph is causing crash (free()->panic) with mpd
Date: Mon, 22 Aug 2005 00:21:51 +0400

 Gleb Smirnoff wrote:
 > 
 > Can you please confirm that the following workaround helps, too. Try to
 > add NG_NODE_FORCE_WRITER(node) at the end of ng_pptpgre_constructor()
 > in ng_pptpgre.c. Thanks in advance!
 > 
 
 yes, it does.
 
 freshly cvsupped RELENG_6_0, delayed-ack and windowing enabled,
 + this patch:
 
 --- /sys/netgraph/ng_pptpgre.c.orig     Tue Jan 11 15:20:28 2005
 +++ /sys/netgraph/ng_pptpgre.c  Sun Aug 21 23:30:18 2005
 @@ -285,6 +285,8 @@
          ng_callout_init(&priv->ackp.sackTimer);
          ng_callout_init(&priv->ackp.rackTimer);
 
 +       NG_NODE_FORCE_WRITER(node);
 +
          /* Done */
          return (0);
   }
 
 
 no panic.
 
 -- 
 Deomid Ryabkov aka Rojer
 myself@rojer.pp.ru
 rojer@sysadmins.ru
 ICQ: 8025844

From: Rojer <myself@rojer.pp.ru>
To: Rojer <myself@rojer.pp.ru>
Cc: Gleb Smirnoff <glebius@FreeBSD.org>,  bug-followup@FreeBSD.org
Subject: Re: kern/80035: netgraph is causing crash (free()->panic) with mpd
Date: Mon, 22 Aug 2005 00:24:22 +0400

 Rojer wrote:
 
 > freshly cvsupped RELENG_6_0
 
 i beg my pardon, of course i meant RELENG_6 :)
 
 -- 
 Deomid Ryabkov aka Rojer
 myself@rojer.pp.ru
 rojer@sysadmins.ru
 ICQ: 8025844

From: Gleb Smirnoff <glebius@FreeBSD.org>
To: Rojer <myself@rojer.pp.ru>
Cc: bug-followup@FreeBSD.org
Subject: Re: kern/80035: netgraph is causing crash (free()->panic) with mpd
Date: Mon, 22 Aug 2005 15:54:21 +0400

 --LQksG6bCIzRHxTLp
 Content-Type: text/plain; charset=koi8-r
 Content-Disposition: inline
 
   Deomid,
 
   thanks for your help. Can you test an other patch, please?
 
 I see a lot of racy places in ng_pptpgre node, so I decided to lock
 it once. Please try out this patch and tell whether it helps or not.
 NG_NODE_FORCE_WRITER() should be removed. Thanks in advance.
 
 -- 
 Totus tuus, Glebius.
 GLEBIUS-RIPN GLEB-RIPE
 
 --LQksG6bCIzRHxTLp
 Content-Type: text/plain; charset=koi8-r
 Content-Disposition: attachment; filename="ng_pptpgre.diff"
 
 Index: netgraph.h
 ===================================================================
 RCS file: /home/ncvs/src/sys/netgraph/netgraph.h,v
 retrieving revision 1.59
 diff -u -r1.59 netgraph.h
 --- netgraph.h	8 Aug 2005 20:08:44 -0000	1.59
 +++ netgraph.h	22 Aug 2005 11:36:07 -0000
 @@ -1087,6 +1087,7 @@
  int	ng_callout(struct callout *c, node_p node, hook_p hook, int ticks,
  	    ng_item_fn *fn, void * arg1, int arg2);
  #define	ng_callout_init(c)	callout_init(c, CALLOUT_MPSAFE)
 +#define	ng_callout_init_mtx(c,mtx)	callout_init_mtx(c, mtx, CALLOUT_MPSAFE)
  
  /* Flags for netgraph functions. */
  #define	NG_NOFLAGS	0x00000000	/* no special options */
 Index: ng_pptpgre.c
 ===================================================================
 RCS file: /home/ncvs/src/sys/netgraph/ng_pptpgre.c,v
 retrieving revision 1.37
 diff -u -r1.37 ng_pptpgre.c
 --- ng_pptpgre.c	11 Jan 2005 12:20:28 -0000	1.37
 +++ ng_pptpgre.c	22 Aug 2005 11:46:20 -0000
 @@ -58,8 +58,10 @@
  #include <sys/systm.h>
  #include <sys/kernel.h>
  #include <sys/time.h>
 -#include <sys/mbuf.h>
 +#include <sys/lock.h>
  #include <sys/malloc.h>
 +#include <sys/mbuf.h>
 +#include <sys/mutex.h>
  #include <sys/errno.h>
  
  #include <netinet/in.h>
 @@ -165,6 +167,7 @@
  	u_int32_t		xmitAck;	/* last seq # we ack'd */
  	struct timeval		startTime;	/* time node was created */
  	struct ng_pptpgre_stats	stats;		/* node statistics */
 +	struct mtx		mtx;		/* node mutex */
  };
  typedef struct ng_pptpgre_private *priv_p;
  
 @@ -282,8 +285,9 @@
  	NG_NODE_SET_PRIVATE(node, priv);
  
  	/* Initialize state */
 -	ng_callout_init(&priv->ackp.sackTimer);
 -	ng_callout_init(&priv->ackp.rackTimer);
 +	mtx_init(&priv->mtx, "ng_pptp", NULL, MTX_DEF);
 +	ng_callout_init_mtx(&priv->ackp.sackTimer, &priv->mtx);
 +	ng_callout_init_mtx(&priv->ackp.rackTimer, &priv->mtx);
  
  	/* Done */
  	return (0);
 @@ -387,6 +391,7 @@
  {
  	const node_p node = NG_HOOK_NODE(hook);
  	const priv_p priv = NG_NODE_PRIVATE(node);
 +	int rval;
  
  	/* If not configured, reject */
  	if (!priv->conf.enabled) {
 @@ -394,12 +399,19 @@
  		return (ENXIO);
  	}
  
 +	mtx_lock(&priv->mtx);
 +
  	/* Treat as xmit or recv data */
  	if (hook == priv->upper)
 -		return ng_pptpgre_xmit(node, item);
 -	if (hook == priv->lower)
 -		return ng_pptpgre_recv(node, item);
 -	panic("%s: weird hook", __func__);
 +		rval = ng_pptpgre_xmit(node, item);
 +	else if (hook == priv->lower)
 +		rval = ng_pptpgre_recv(node, item);
 +	else
 +		panic("%s: weird hook", __func__);
 +
 +	mtx_unlock(&priv->mtx);
 +
 +	return (rval);
  }
  
  /*
 @@ -413,6 +425,8 @@
  	/* Reset node (stops timers) */
  	ng_pptpgre_reset(node);
  
 +	mtx_destroy(&priv->mtx);
 +
  	FREE(priv, M_NETGRAPH);
  
  	/* Decrement ref count */
 @@ -875,6 +889,8 @@
  	const priv_p priv = NG_NODE_PRIVATE(node);
  	struct ng_pptpgre_ackp *const a = &priv->ackp;
  
 +	mtx_lock(&priv->mtx);
 +
  	/* Reset adaptive timeout state */
  	a->ato = PPTP_MAX_TIMEOUT;
  	a->rtt = priv->conf.peerPpd * PPTP_TIME_SCALE / 10;  /* ppd in 10ths */
 @@ -903,6 +919,8 @@
  	/* Stop timers */
  	ng_pptpgre_stop_send_ack_timer(node);
  	ng_pptpgre_stop_recv_ack_timer(node);
 +
 +	mtx_unlock(&priv->mtx);
  }
  
  /*
 
 --LQksG6bCIzRHxTLp--

From: Rojer <myself@rojer.pp.ru>
To: Gleb Smirnoff <glebius@FreeBSD.org>
Cc: bug-followup@FreeBSD.org
Subject: Re: kern/80035: netgraph is causing crash (free()->panic) with mpd
Date: Tue, 23 Aug 2005 23:24:11 +0400

 i had to replace
 
 callout_init_mtx(c, mtx, CALLOUT_MPSAFE)
 with
 callout_init_mtx(c, mtx, 0)
 
 since CALLOUT_MPSAFE makes no sense for callout_init_mtx and is not a valid flag.
 
 otherwise - yep, works perfectly. no panics.
 
 -- 
 Deomid Ryabkov aka Rojer
 myself@rojer.pp.ru
 rojer@sysadmins.ru
 ICQ: 8025844

From: Gleb Smirnoff <glebius@FreeBSD.org>
To: Rojer <myself@rojer.pp.ru>
Cc: bug-followup@FreeBSD.org
Subject: Re: kern/80035: netgraph is causing crash (free()->panic) with mpd
Date: Wed, 24 Aug 2005 00:34:55 +0400

 On Tue, Aug 23, 2005 at 11:24:11PM +0400, Rojer wrote:
 R> i had to replace
 R> 
 R> callout_init_mtx(c, mtx, CALLOUT_MPSAFE)
 R> with
 R> callout_init_mtx(c, mtx, 0)
 R> 
 R> since CALLOUT_MPSAFE makes no sense for callout_init_mtx and is not a valid 
 R> flag.
 
 Yep, true.
 
 R> otherwise - yep, works perfectly. no panics.
 
 Can you please run with this patch for next several days, so
 that it is tested properly before committing to RELENG branches?
 
 -- 
 Totus tuus, Glebius.
 GLEBIUS-RIPN GLEB-RIPE

From: Rojer <myself@rojer.pp.ru>
To: Gleb Smirnoff <glebius@FreeBSD.org>
Cc: bug-followup@FreeBSD.org
Subject: Re: kern/80035: netgraph is causing crash (free()->panic) with mpd
Date: Wed, 24 Aug 2005 01:08:56 +0400

 Gleb Smirnoff wrote:
 > 
 > Can you please run with this patch for next several days, so
 > that it is tested properly before committing to RELENG branches?
 > 
 
 Sure, why not.
 
 I'll post a followup if the issue surfaces again.
 
 -- 
 Deomid Ryabkov aka Rojer
 myself@rojer.pp.ru
 rojer@sysadmins.ru
 ICQ: 8025844
State-Changed-From-To: feedback->patched 
State-Changed-By: glebius 
State-Changed-When: Tue Aug 30 09:52:44 GMT 2005 
State-Changed-Why:  
The problem is considered fixed in HEAD. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=80035 
State-Changed-From-To: patched->closed 
State-Changed-By: glebius 
State-Changed-When: Wed Dec 28 13:41:27 UTC 2005 
State-Changed-Why:  
Was fixed before 6.0-RELEASE. I don't know why I forget to close PR. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=80035 
>Unformatted:
