From nobody@FreeBSD.org  Wed Nov  9 13:19:50 2005
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 961AA16A41F
	for <freebsd-gnats-submit@FreeBSD.org>; Wed,  9 Nov 2005 13:19:50 +0000 (GMT)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [216.136.204.117])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 48BEF43D46
	for <freebsd-gnats-submit@FreeBSD.org>; Wed,  9 Nov 2005 13:19:50 +0000 (GMT)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.13.1/8.13.1) with ESMTP id jA9DJnKJ050267
	for <freebsd-gnats-submit@FreeBSD.org>; Wed, 9 Nov 2005 13:19:49 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.13.1/8.13.1/Submit) id jA9DJnlB050266;
	Wed, 9 Nov 2005 13:19:49 GMT
	(envelope-from nobody)
Message-Id: <200511091319.jA9DJnlB050266@www.freebsd.org>
Date: Wed, 9 Nov 2005 13:19:49 GMT
From: Victor Snezhko <snezhko@indorsoft.ru>
To: freebsd-gnats-submit@FreeBSD.org
Subject: netinet6 updates in -CURRENT cause panic when using user-level ppp
X-Send-Pr-Version: www-2.3

>Number:         88725
>Category:       kern
>Synopsis:       [netinet6] [panic] updates in -CURRENT cause panic when using user-level ppp
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Wed Nov 09 13:20:19 GMT 2005
>Closed-Date:    Wed Nov 16 12:39:58 GMT 2005
>Last-Modified:  Wed Nov 16 12:39:58 GMT 2005
>Originator:     Victor Snezhko
>Release:        7.0-CURRENT
>Organization:
IndorSoft Ltd.
>Environment:
FreeBSD freebsd.indorsoft.ru 7.0-CURRENT FreeBSD 7.0-CURRENT #12: Sat Nov  5 19:24:55 NOVT 2005     root@freebsd.indorsoft.ru:/home/vvs/obj/usr/src/sys/VVS  i386
cvsupped on 2005.10.21.16.25.00, on 2005.11.06 problem is still here.
I use custom config but in the GENERIC problem remains.
The problem is reproducible at least on i386 (including virtual machine) and on amd64.
>Description:
The changes to netinet6 committed on 2005.10.21.16.23.01 break user-level ppp.
After these changes, when I start /usr/sbin/ppp, I experience panic. Here is the backtrace analysis:

/var/crash # kgdb /usr/obj/usr/src/sys/VVS/kernel /var/crash/vmcore.27
[GDB will not be able to debug user-mode threads: /usr/lib/libthread_db.so: Undefined symbol "ps_pglobal_lookup"]
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-marcel-freebsd".

Unread portion of the kernel message buffer:
kernel trap 12 with interrupts disabled


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0xdeadc0e6
fault code		= supervisor read, page not present
instruction pointer	= 0x20:0xc066c182
stack pointer	        = 0x28:0xc6082cc0
frame pointer	        = 0x28:0xc6082ce8
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, def32 1, gran 1
processor eflags	= resume, IOPL = 0
current process		= 36 (swi4: clock sio)
panic: from debugger
cpuid = 0
Uptime: 1m25s
Dumping 63 MB (3 chunks)
  chunk 0: 1MB (159 pages) ... ok
  chunk 1: 62MB (15856 pages) 46 30 14 ... ok
  chunk 2: 1MB (256 pages)

#0  doadump () at pcpu.h:165
165	pcpu.h: No such file or directory.
	in pcpu.h
(kgdb) bt
#0  doadump () at pcpu.h:165
#1  0xc0660824 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xc0660b39 in panic (fmt=0xc0856f00 "from debugger")
    at /usr/src/sys/kern/kern_shutdown.c:555
#3  0xc046cee1 in db_panic (addr=-1067007614, have_addr=0, count=-1, 
    modif=0xc6082abc "") at /usr/src/sys/ddb/db_command.c:434
#4  0xc046ce78 in db_command (last_cmdp=0xc0947984, cmd_table=0x0, 
    aux_cmd_tablep=0xc08bd97c, aux_cmd_tablep_end=0xc08bd998)
    at /usr/src/sys/ddb/db_command.c:403
#5  0xc046cf40 in db_command_loop () at /usr/src/sys/ddb/db_command.c:454
#6  0xc046eb59 in db_trap (type=12, code=0) at /usr/src/sys/ddb/db_main.c:221
#7  0xc06793a4 in kdb_trap (type=12, code=0, tf=0xc6082c80)
    at /usr/src/sys/kern/subr_kdb.c:473
#8  0xc0821ac8 in trap_fatal (frame=0xc6082c80, eva=3735929062)
    at /usr/src/sys/i386/i386/trap.c:846
#9  0xc0821152 in trap (frame=
      {tf_fs = 8, tf_es = 40, tf_ds = 40, tf_edi = -1054618496, tf_esi = -1054756736, tf_ebp = -972542744, tf_isp = -972542804, tf_ebx = 1, tf_edx = -1030106232, tf_ecx = -559038242, tf_eax = 83559, tf_trapno = 12, tf_err = 0, tf_eip = -1067007614, tf_cs = 32, tf_eflags = 589826, tf_esp = -1054618496, tf_ss = 0})
    at /usr/src/sys/i386/i386/trap.c:269
---Type <return> to continue, or q <return> to quit---
#10 0xc080ec2a in calltrap () at /usr/src/sys/i386/i386/exception.s:139
#11 0xc066c182 in softclock (dummy=0x0)
    at /usr/src/sys/kern/kern_timeout.c:220
#12 0xc064e260 in ithread_loop (arg=0xc121b080)
    at /usr/src/sys/kern/kern_intr.c:547
#13 0xc064d668 in fork_exit (callout=0xc064e118 <ithread_loop>, 
    arg=0xc121b080, frame=0xc6082d38) at /usr/src/sys/kern/kern_fork.c:789
#14 0xc080ec8c in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:208
(kgdb) up 11
#11 0xc066c182 in softclock (dummy=0x0)
    at /usr/src/sys/kern/kern_timeout.c:220
220				if (c->c_time != curticks) {
(kgdb) list
215			curticks = softticks;
216			bucket = &callwheel[curticks & callwheelmask];
217			c = TAILQ_FIRST(bucket);
218			while (c) {
219				depth++;
220				if (c->c_time != curticks) {
221					c = TAILQ_NEXT(c, c_links.tqe);
222					++steps;
223					if (steps >= MAX_SOFTCLOCK_STEPS) {
224						nextsoftcheck = c;
(kgdb) print c
$1 = (struct callout *) 0xdeadc0de
(kgdb) print *bucket
$2 = {tqh_first = 0xc1644020, tqh_last = 0xc1644020}
(kgdb) print steps
$3 = 1
(kgdb) print *(bucket->tqh_first)
$4 = {c_links = {sle = {sle_next = 0xdeadc0de}, tqe = {tqe_next = 0xdeadc0de, 
      tqe_prev = 0xdeadc0de}}, c_time = -559038242, c_arg = 0xdeadc0de, 
  c_func = 0xdeadc0de, c_mtx = 0xdeadc0de, c_flags = -559038242}



The following patch from John Baldwin (intended for testing only) doesn't help - symptoms remain the same:

Index: nd6.c
===================================================================
RCS file: /usr/cvs/src/sys/netinet6/nd6.c,v
retrieving revision 1.62
diff -u -r1.62 nd6.c
--- nd6.c       22 Oct 2005 05:07:16 -0000      1.62
+++ nd6.c       3 Nov 2005 19:56:42 -0000
@@ -398,7 +398,7 @@
        if (tick < 0) {
                ln->ln_expire = 0;
                ln->ln_ntick = 0;
-               callout_stop(&ln->ln_timer_ch);
+               callout_drain(&ln->ln_timer_ch);
        } else {
                ln->ln_expire = time_second + tick / hz;
                if (tick > INT_MAX) {
======================================================================

I have tried 2 attempts to find a cause of the callwheel corruption:
1) I wrote a checking function that searched corrupted entries in a callwheel and panics if any. This function was called from every place in kern/kern_timeout.c that could modify the callwheel. No success - callwheel is modified elsewhere.

2) I tried to extend trash_dtor() in vm/uma_dbg.c in the following way to find what element of the callwheel is freed before being disarmed.
(Warning: this patch may be not 64bit-ready in the pointer casts/comparisons)
--- uma_dbg.c.orig	Mon Nov  7 23:05:09 2005
+++ uma_dbg.c	Tue Nov  8 17:37:24 2005
@@ -41,6 +41,8 @@
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/malloc.h>
+#include <sys/callout.h>
+#include <sys/kdb.h>
 
 #include <vm/vm.h>
 #include <vm/vm_object.h>
@@ -86,8 +88,33 @@
 {
 	int cnt;
 	u_int32_t *p;
+	struct callout *c;
+	struct callout_tailq *bucket;
+	int i;
 
 	cnt = size / sizeof(uma_junk);
+
+	mtx_lock_spin(&callout_lock);
+ 
+	for (i = 0; i < callwheelsize; ++i) {
+		bucket = &callwheel[i];
+		for (c = TAILQ_FIRST(bucket); c != NULL;
+		     c = TAILQ_NEXT(c, c_links.tqe)) {
+			long c2 = (long)c;
+			long mem2 = (long)mem;
+			if ((u_int32_t)c == uma_junk) {
+				kdb_enter("trash_dtor: uma_junk found in a "\
+					  "callwheel element");
+				break;
+			}
+			if (c2 >= mem2 && c2 < mem2 + size) {
+				kdb_enter("trash_dtor: found invalid "\
+					  "callwhel element");
+			}
+		}
+	}
+
+	mtx_unlock_spin(&callout_lock);
 
 	for (p = mem; cnt > 0; cnt--, p++)
 		*p = uma_junk;
======================================================================
and kdb_enter is called here:
	if ((u_int32_t)c == uma_junk) {
==>		kdb_enter("trash_dtor: uma_junk found in a "\
			  "callwheel element");

I.e. this check founds a callwheel element that was already freed and filled with uma_junks.
There is a side effect: applying the last patch causes the panic to be much less reproducible. When panic doesn't occur, ppp works.

>How-To-Repeat:
cvsup to the -CURRENT as of 2005.10.21.16.25.00 or later, recompile and install the kernel using GENERIC config.
With a new kernel, start /usr/sbin/ppp.
A few seconds (up to 3 on my Celeron-600) after start, when the callwheel in kern/kern_timeout.c is cycled over, the panic will occur.

>Fix:
There is only a workaround: disabling INET6 in the kernel config helps.

>Release-Note:
>Audit-Trail:

From: Victor Snezhko <snezhko@indorsoft.ru>
To: bug-followup@freebsd.org
Cc: freebsd-current@freebsd.org, Vladimir Kushnir <vkushnir@i.kiev.ua>, Max
 Laier <max@love2party.net>, suz@freebsd.org
Subject: Re: kern/88725: /usr/sbin/ppp panic with 2005.10.21 netinet6
 changes
Date: Thu, 10 Nov 2005 16:54:34 +0600

 --=-=-=
 
 Mark Tinguely has found the offending timer.
 
 The following patch fixes the problem for me:
 
 
 --=-=-=
 Content-Type: text/x-patch
 Content-Disposition: attachment; filename=mld6.diff
 
 --- mld6.c	Wed Nov  9 08:27:14 2005
 ***************
 *** 640,645 ****
 --- 640,649 ----
   		mld6_stop_listening(in6m);
   		ifma->ifma_protospec = NULL;
   		LIST_REMOVE(in6m, in6m_entry);
 + 		if (in6m->in6m_timer != IN6M_TIMER_UNDEF) {
 + 			printf("in6_delmulti: timer 0x%p is still active\n", in6m->in6m_timer_ch);
 + 			mld_stoptimer(in6m);
 + 		}
   		free(in6m->in6m_timer_ch, M_IP6MADDR);
   		free(in6m, M_IP6MADDR);
   	}
 
 --=-=-=
 
 
 Printf is fired with the patch applied, and panic doesn't occur.
 
 I have tested it on -current cvsupped with date=2005.10.21.16.25.00,
 and will test it on the fresh -current (in a day or two - I will need
 to recompile everything). The patch should work there
 although. According to the cvsweb, mld6.c didn't change.  
 
 -- 
 WBR, Victor V. Snezhko
 EMail: snezhko@indorsoft.ru
 
 --=-=-=--
 

From: Mark Tinguely <tinguely@casselton.net>
To: bug-followup@freebsd.org, snezhko@indorsoft.ru
Cc: freebsd-current@freebsd.org, Max@freebsd.org, max@love2party.net
Subject: Re: kern/88725: /usr/sbin/ppp panic with 2005.10.21 netinet6 changes
Date: Thu, 10 Nov 2005 08:50:37 -0600 (CST)

 As a postscript:
 
  The problem was a dynamic timer was freed without being stopped first.
  Obviously, the printf() should be removed from the final fix.
 
  After this discovery, I went through all of the callout_init() calls
  in the kernel and looked at those that may be freed before possibly
  being stopped. Beside the one in netinet6/mld6.c, I have 5 more
  that initially look like the memory for the callout struction could
  also be freed and still not have been stopped. These paths are problably
  not traveled much (detaches for less mainstream components), but stopping
  the callout is cheap and not at all risky.
 
  I will look at the 5 cases again and suggest all of these callout at
  risk be stopped under the same fix.
 
 --Mark Tinguely

From: SUZUKI Shinsuke <suz@freebsd.org>
To: snezhko@indorsoft.ru
Cc: bug-followup@freebsd.org,
	freebsd-current@freebsd.org,
	vkushnir@i.kiev.ua,
	max@love2party.net,
	suz@freebsd.org
Subject: Re: kern/88725: /usr/sbin/ppp panic with 2005.10.21 netinet6 changes
Date: Thu, 10 Nov 2005 07:40:49 -0800

 >>>>> On Thu, 10 Nov 2005 16:54:34 +0600
 >>>>> snezhko@indorsoft.ru(Victor Snezhko)  said:
 
 > Mark Tinguely has found the offending timer.
 > The following patch fixes the problem for me:
 
 Thanks.  sounds right for me.
 So please commit it if when you've finished the test with fresh -current.

From: John Baldwin <jhb@freebsd.org>
To: freebsd-current@freebsd.org
Cc: SUZUKI Shinsuke <suz@freebsd.org>, snezhko@indorsoft.ru,
        max@love2party.net, bug-followup@freebsd.org
Subject: Re: kern/88725: /usr/sbin/ppp panic with 2005.10.21 netinet6 changes
Date: Thu, 10 Nov 2005 11:40:13 -0500

 On Thursday 10 November 2005 10:40 am, SUZUKI Shinsuke wrote:
 > >>>>> On Thu, 10 Nov 2005 16:54:34 +0600
 > >>>>> snezhko@indorsoft.ru(Victor Snezhko)  said:
 > >
 > > Mark Tinguely has found the offending timer.
 > > The following patch fixes the problem for me:
 >
 > Thanks.  sounds right for me.
 > So please commit it if when you've finished the test with fresh -current.
 
 As a general rule you should be using callout_drain() before freeing a callout 
 to handle the race condition where the callout is running on another CPU (so 
 callout_stop can't stop it) while you are freeing it.  Note that you can not 
 use callout_drain() if you are holding any locks, though.  In those cases you 
 will need to defer the callout_drain() and free() until you have dropped the 
 locks.  Here's one example fix:
 
 Index: nd6.c
 ===================================================================
 RCS file: /usr/cvs/src/sys/netinet6/nd6.c,v
 retrieving revision 1.62
 diff -u -r1.62 nd6.c
 --- nd6.c       22 Oct 2005 05:07:16 -0000      1.62
 +++ nd6.c       3 Nov 2005 19:56:42 -0000
 @@ -398,7 +398,7 @@
         if (tick < 0) {
                 ln->ln_expire = 0;
                 ln->ln_ntick = 0;
 -               callout_stop(&ln->ln_timer_ch);
 +               callout_drain(&ln->ln_timer_ch);
         } else {
                 ln->ln_expire = time_second + tick / hz;
                 if (tick > INT_MAX) {
  
 -- 
 John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
 "Power Users Use the Power to Serve"  =  http://www.FreeBSD.org

From: Victor Snezhko <snezhko@indorsoft.ru>
To: Mark Tinguely <tinguely@casselton.net>
Cc: bug-followup@freebsd.org, max@love2party.net,
	  freebsd-current@freebsd.org,  Max@freebsd.org
Subject: Re: kern/88725: /usr/sbin/ppp panic with 2005.10.21 netinet6
 changes
Date: Thu, 10 Nov 2005 23:02:47 +0600

 Mark Tinguely <tinguely@casselton.net> writes:
 
 > As a postscript:
 >
 >  The problem was a dynamic timer was freed without being stopped first.
 >  Obviously, the printf() should be removed from the final fix.
 >
 >  After this discovery, I went through all of the callout_init() calls
 >  in the kernel and looked at those that may be freed before possibly
 >  being stopped. Beside the one in netinet6/mld6.c, I have 5 more
 >  that initially look like the memory for the callout struction could
 >  also be freed and still not have been stopped. These paths are problably
 >  not traveled much (detaches for less mainstream components), but stopping
 >  the callout is cheap and not at all risky.
 
 Not risky? I'm not an expert, but I think there might be issues when
 callout is stopped at the moment when on-timer function is executed
 (I see the following bad scenario: timer function begins to execute,
 then we call callout_stop(), then free all the necessary data
 structures and then control returns to the timer proc which could
 depend on the structures that are already freed)
 
 I.e. in each case we should check if callout_stop don't harm.
 
 On the other hand, callout_drain could introduce lock order issues (as
 John Baldwin pointed).
 
 >  I will look at the 5 cases again and suggest all of these callout at
 >  risk be stopped under the same fix.
 
 -- 
 WBR, Victor V. Snezhko
 EMail: snezhko@indorsoft.ru
 
 

From: Victor Snezhko <snezhko@indorsoft.ru>
To: John Baldwin <jhb@freebsd.org>
Cc: freebsd-current@freebsd.org,  SUZUKI Shinsuke <suz@freebsd.org>,
	  snezhko@indorsoft.ru,  max@love2party.net,  bug-followup@freebsd.org
Subject: Re: kern/88725: /usr/sbin/ppp panic with 2005.10.21 netinet6
 changes
Date: Fri, 11 Nov 2005 15:09:36 +0600

 John Baldwin <jhb@freebsd.org> writes:
 
 >>> Mark Tinguely has found the offending timer.
 >>> The following patch fixes the problem for me:
 >>
 >> Thanks.  sounds right for me.
 >> So please commit it if when you've finished the test with fresh -current.
 >
 > As a general rule you should be using callout_drain() before freeing a callout 
 > to handle the race condition where the callout is running on another CPU (so 
 > callout_stop can't stop it) while you are freeing it.  Note that you can not 
 > use callout_drain() if you are holding any locks, though.  In those cases you 
 > will need to defer the callout_drain() and free() until you have dropped the 
 > locks.  Here's one example fix:
 >
 > Index: nd6.c
 > ===================================================================
 > RCS file: /usr/cvs/src/sys/netinet6/nd6.c,v
 > retrieving revision 1.62
 > diff -u -r1.62 nd6.c
 > --- nd6.c       22 Oct 2005 05:07:16 -0000      1.62
 > +++ nd6.c       3 Nov 2005 19:56:42 -0000
 > @@ -398,7 +398,7 @@
 >         if (tick < 0) {
 >                 ln->ln_expire = 0;
 >                 ln->ln_ntick = 0;
 > -               callout_stop(&ln->ln_timer_ch);
 > +               callout_drain(&ln->ln_timer_ch);
 >         } else {
 >                 ln->ln_expire = time_second + tick / hz;
 >                 if (tick > INT_MAX) {
 
 The code that was committed (and introduced armed timer that was
 freed) is full of callout_stops and contains not a single
 callout_drain.
 
 So I think in order to be consistent we shouldn't fix two problems at
 once. The right way would be to commit the fix with callout stop at
 first (and close the PR) and then investigate whether we can replace
 stops with drains without introducing a deadlock (for each timer
 separately). In this case we will at least have a working system to
 cvsdown to it if there will be issues with callout_drain.
 
 I have tested the patch by Mark Tinguely (that fixes mld6.c) on the
 fresh -current (cvsupped ~2 days ago), it works there too
 (unsurprisingly). So it may be committed, I suppose (without the debug
 printf, of course). 
 
 -- 
 WBR, Victor V. Snezhko
 EMail: snezhko@indorsoft.ru
 
 

From: Mark Tinguely <tinguely@casselton.net>
To: bug-followup@freebsd.org, snezhko@indorsoft.ru
Cc:  
Subject: Re: kern/88725: /usr/sbin/ppp panic with 2005.10.21 netinet6 changes
Date: Fri, 11 Nov 2005 09:04:50 -0600 (CST)

 I think this patch should be applied. The other callouts that I flagged
 are too inconclusive to make any modification at this time.
 
 I am working with another person with a callout panic that is simular
 to this panic, but he does not use IPv6.
 
 
 --- netinet6/mld6.c	Wed Nov  9 08:27:14 2005
 ***************
 *** 640,645 ****
 --- 640,649 ----
   		mld6_stop_listening(in6m);
   		ifma->ifma_protospec = NULL;
   		LIST_REMOVE(in6m, in6m_entry);
 + 		if (in6m->in6m_timer != IN6M_TIMER_UNDEF)
 + 			mld_stoptimer(in6m);
   		free(in6m->in6m_timer_ch, M_IP6MADDR);
   		free(in6m, M_IP6MADDR);
   	}
 
 
 --Mark Tinguely
State-Changed-From-To: open->closed 
State-Changed-By: suz 
State-Changed-When: Wed Nov 16 12:39:18 GMT 2005 
State-Changed-Why:  
the proposed patch has been committed 

http://www.freebsd.org/cgi/query-pr.cgi?pr=88725 
>Unformatted:
