From johan@giantfoo.org  Mon Jan 21 19:16:27 2008
Return-Path: <johan@giantfoo.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B55EB16A41A
	for <FreeBSD-gnats-submit@freebsd.org>; Mon, 21 Jan 2008 19:16:27 +0000 (UTC)
	(envelope-from johan@giantfoo.org)
Received: from pangu.giantfoo.org (pangu.giantfoo.org [24.227.169.103])
	by mx1.freebsd.org (Postfix) with ESMTP id 36B7313C458
	for <FreeBSD-gnats-submit@freebsd.org>; Mon, 21 Jan 2008 19:16:26 +0000 (UTC)
	(envelope-from johan@giantfoo.org)
Received: from localhost (localhost [127.0.0.1])
	by pangu.giantfoo.org (Postfix) with ESMTP id 2F476122A2
	for <FreeBSD-gnats-submit@freebsd.org>; Mon, 21 Jan 2008 12:59:49 -0600 (CST)
Message-Id: <20080121.125948.131912746.johan@giantfoo.org>
Date: Mon, 21 Jan 2008 12:59:48 -0600 (CST)
From: Johan A. van Zanten <johan@giantfoo.org>
To: FreeBSD-gnats-submit@freebsd.org
Subject: 7.0 kernel panic during boot with ZFS and WD1600JS

>Number:         119868
>Category:       kern
>Synopsis:       [geom_gpt] [patch] 7.0 kernel panic with corrupt GPT label
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    marcel
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Jan 21 19:20:02 UTC 2008
>Closed-Date:    Sat Nov 22 17:55:36 UTC 2008
>Last-Modified:  Sat Nov 22 17:55:36 UTC 2008
>Originator:     Johan A. van Zanten
>Release:        FreeBSD 7.0-PRERELEASE i386
>Organization:
>Environment:
System: FreeBSD laozi 7.0-PRERELEASE FreeBSD 7.0-PRERELEASE #5: Tue Jan 1 03:14:32 CST 2008 johan@laozi:/local/build/FreeBSD/obj/i386/tew/006/no-backup/src/FreeBSD/FreeBSD-7.0/src/sys/DONGFEN i386


	
>Description:


This is a brand new installation.

The panic also occurs with the GENERIC kernel.

Boot device is a SCSI disk on a seperate controller.

Before the drive is setup for use as a ZFS device, the kernel identifies
it as:
kernel: twed0: <Unit 1, JBOD, Normal> on twe0
kernel: twed0: 152627MB (312581808 sectors)
kernel: GEOM_LABEL: Label for provider twed0p1 is msdosfs/EFI.

This is a Western Digial, WD1600JS SATA drive, connected to a 3ware
8002-LP card (2-port SATA, PCI).

Last time the drive was used in a different computer and OS, it was in
good working order.

 After the device is setup for use with ZFS (via :zpool create ..."
command), at the next boot, the kernel panics when it begins to scan the
attached disks.  (Just after the "Waiting 5 seconds for SCSI devices to
settle" message.)

 What's interesting is that different SATA drive on the same port of the
same card does not cause the panic. The "good" drive is a Western Digital
WD360 (SATA, 36 GB).

boot-time kernel output:

twe1: 152627MB (312581808 sectors)
GEOM: new disk twed0
GEOM: new disk twed1


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 01
fault virtual address   = 0x3f80
fault code              = supervisors read, page not present
instruction pointer	= 0x20:0xc06d0e2c
stack pointer		= 0x28:0xe2fb4b60
frame pointer		= 0x28:0xe2fb4c58
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, def32 1, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 2 (g_event)
trap number		= 12
panic: page fault
Uptime: 1s
Cannot dump. No dumpdevice defined.
Automatic reboot in 15 seconds - press a key on the console to abort

>How-To-Repeat:

Do "zpool create poolname $dev".

Reboot the machine.

>Fix:

disconnect the drive. (not much of a workaround. :)

>Release-Note:
>Audit-Trail:

From: Remko Lodder <remko@FreeBSD.org>
To: "Johan A. van Zanten" <johan@giantfoo.org>
Cc: FreeBSD-gnats-submit@FreeBSD.org
Subject: Re: kern/119868: 7.0 kernel panic during boot with ZFS and WD1600JS
Date: Mon, 21 Jan 2008 20:30:15 +0100

 Johan A. van Zanten wrote:
 > 
 > Fatal trap 12: page fault while in kernel mode
 > cpuid = 0; apic id = 01
 > fault virtual address   = 0x3f80
 > fault code              = supervisors read, page not present
 > instruction pointer	= 0x20:0xc06d0e2c
 > stack pointer		= 0x28:0xe2fb4b60
 > frame pointer		= 0x28:0xe2fb4c58
 > code segment		= base 0x0, limit 0xfffff, type 0x1b
 > 			= DPL 0, pres 1, def32 1, gran 1
 > processor eflags	= interrupt enabled, resume, IOPL = 0
 > current process		= 2 (g_event)
 > trap number		= 12
 > panic: page fault
 > Uptime: 1s
 > Cannot dump. No dumpdevice defined.
 > Automatic reboot in 15 seconds - press a key on the console to abort
 > 
 
 Hello,
 
 Please set a dumpdevice see 
 http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug.html 
 for more information. We need this to be able to see what is going on 
 and what is going nuts. Without this we will not be able (imo) to 
 resolve your problem.
 
 Thanks for taking the time to report this though and using FreeBSD!
 
 Cheers
 remko
 
 -- 
 /"\   Best regards,                      | remko@FreeBSD.org
 \ /   Remko Lodder                       | remko@EFnet
   X    http://www.evilcoder.org/          |
 / \   ASCII Ribbon Campaign              | Against HTML Mail and News

From: Johan A. van Zanten <johan@giantfoo.org>
To: remko@FreeBSD.org
Cc: FreeBSD-gnats-submit@FreeBSD.org
Subject: Re: kern/119868: 7.0 kernel panic during boot with ZFS and WD1600JS
Date: Tue, 22 Jan 2008 22:12:06 -0600 (CST)

 Remko Lodder <remko@FreeBSD.org> wrote:
 > Johan A. van Zanten wrote:
 > > 
 > > Fatal trap 12: page fault while in kernel mode
 > > cpuid = 0; apic id = 01
 > > fault virtual address   = 0x3f80
 > > fault code              = supervisors read, page not present
 > > instruction pointer	= 0x20:0xc06d0e2c
 > > stack pointer		= 0x28:0xe2fb4b60
 > > frame pointer		= 0x28:0xe2fb4c58
 > > code segment		= base 0x0, limit 0xfffff, type 0x1b
 > > 			= DPL 0, pres 1, def32 1, gran 1
 > > processor eflags	= interrupt enabled, resume, IOPL = 0
 > > current process		= 2 (g_event)
 > > trap number		= 12
 > > panic: page fault
 > > Uptime: 1s
 > > Cannot dump. No dumpdevice defined.
 > > Automatic reboot in 15 seconds - press a key on the console to abort
 > > 
 > 
 > Hello,
 > 
 > Please set a dumpdevice see 
 > http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug.html 
 > for more information. We need this to be able to see what is going on 
 > and what is going nuts. Without this we will not be able (imo) to 
 > resolve your problem.
 
 Can you give and example of the syntax for specifying the dump device in
 the kernel config?
 
  The crash seems to be ahppening before dumpon is run. According to the
 web page you cite:
 
   Alternatively, the dump device can be hard-coded via the dump clause in
   the config(5) line of a kernel configuration file. This approach is
   deprecated and should be used only if a kernel is crashing before
   dumpon(8) can be executed.
 
  But i cannot find any example of the syntax for the "dump" clause in
  /usr/src/sys/conf or in config(5).
 
 
 Thanks, johan

From: Harald Hanche-Olsen <hanche@math.ntnu.no>
To: bug-followup@FreeBSD.org, johan@giantfoo.org
Cc:  
Subject: Re: kern/119868: [zfs] 7.0 kernel panic during boot with ZFS and
 WD1600JS
Date: Tue, 30 Sep 2008 17:49:17 +0200 (CEST)

 The original reporter seems to have given up on this. I have seen
 something very similar, and thought I could provide some more
 information.
 
 I now have three disks all in an unusable state, causing freebsd to
 panic upon seeing these disks. Common to all is that they contained
 ZFS pools that were online when the computer crashed, possibly for
 unrelated reasons. Upon reboot, the computer would panic when noticing
 the disk; in fact, immediately after printing the standard message
 giving the device name and disk type on the console.
 
 ZFS may however be incidental to the problem: The panic happens even
 if I don't have zfs.ko loaded when the problem disk is plugged in.
 I wonder if it could be related to kern/127115 somehow?
 
 I cannot get a dump unfortunately - the console says "Dumping xxx MB"
 and hangs if I have activated kernel dumps (using dumpon) before
 triggering the panic.
 
 So I compiled a debug kernel and obtained a backtrace using ddb
 instead. Here is output, copied by hand from a photo of the screen:
 
 
 Fatal trap 12: page fault while in kernel mode
 cpuid = 0; apic id = 00
 fault virtual address   = 0x3f80
 fault code              = supervisor read data, page not present
 [...]
 current process         = 2 (g_event)
 [thread pid 2 tid 100007 ]
 Stopped at    bcmp+0x8:    repe cmpsq  (%rsi),%es:(%rdi)
 db> trace
 Tracing pid 2 tid 100007 td 0xffffff0001129000
 bcmp() at bcmp+0x8
 g_part_taste() at g_part_taste+0x252
 g_new_provider_event() at g_new_provider_event+0x75
 g_run_events() at g_run_events+0x1b8
 g_event_procbody() at g_event_procbody+0x57
 fork_exit() at fork_exit+0x11f
 fork_trampoline() at fork_trampoline+0xe
 --- trap 0, rip = 0, rsp = 0xffffffffb3600d30, rbp = 0 ---
 
 
 I am really not very familiar with ddb. Let me know if you wish me to
 dig deeper, but then I need a pointer as to what to look for.
 
 - Harald

From: Harald Hanche-Olsen <hanche@math.ntnu.no>
To: bug-followup@FreeBSD.org, johan@giantfoo.org
Cc:  
Subject: Re: kern/119868: [zfs] 7.0 kernel panic during boot with ZFS and
 WD1600JS
Date: Tue, 30 Sep 2008 18:18:00 +0200 (CEST)

 For what it's worth, assuming it is the partition table that has
 gotten screwed up somehow, here are the the first 34 sectors of the
 disk that caused the panic described in my previous mail:
 
   http://www.math.ntnu.no/~hanche/tmp/baddisk.bin
 
 (Created by attaching the disk to a mac and running dd bs=512 count=34
 on the device file. Not sure if binary attachments are OK here.)
 
 I forgot to mention that this is on 7.0-STABLE/amd64 as of 19 August
 (7.0-STABLE #3). But I also see the problem on 7.0-RELEASE/i386.
 
 - Harald

From: Harald Hanche-Olsen <hanche@math.ntnu.no>
To: bug-followup@FreeBSD.org, johan@giantfoo.org
Cc:  
Subject: Re: kern/119868: [zfs] 7.0 kernel panic during boot with ZFS and
 WD1600JS
Date: Tue, 30 Sep 2008 20:41:23 +0200 (CEST)

 I just had my biggest "duh" moment in a veeery long time.
 The above two "contributions" to this PR can probably be ignored.
 
 For the curious: I intended to do
 
 #; gpt create -f da2
 #; gpt add -t 6a898cc3-1dd2-11b2-99a6-080020736631 da2
 #; zpool create poolname da2p1
 
 but apparently, I created the pool on da2 instead, partially
 overwriting the GPT. And I managed to do this (count 'em) no less than
 THREE times!
 
 Like I said, DUH, and my apologies for the noise.
 
 Maybe we could turn the noise into a feature request: Perhaps zpool
 should be smart enough to recognize that the user is about to shoot
 his own foot and refuse to cooperate?
 
 - Harald

From: Johan A. van Zanten <johan@giantfoo.org>
To: hanche@math.ntnu.no
Cc: bug-followup@FreeBSD.org
Subject: Re: kern/119868: [zfs] 7.0 kernel panic during boot with ZFS and
 WD1600JS
Date: Wed, 01 Oct 2008 09:04:50 -0500 (CDT)

 Harald Hanche-Olsen <hanche@math.ntnu.no> wrote:
 > The original reporter seems to have given up on this. I have seen
 > something very similar, and thought I could provide some more
 > information.
 
 Thanks for helping.  The problem for me is that the panic occured very
 early in the boot process, before the dump device is normally configured,
 and no one on the freebsd-help list, nor anyone reading these bug reports
 seemed to know or care enough to help me get a dump device configured
 earlier.  I spent some time going through the source, trying to figure out
 a way to do this, but the time required for me to do this task exceeded
 the amount of time i had to spend on it.
 
  -johan

From: Harald Hanche-Olsen <hanche@math.ntnu.no>
To: johan@giantfoo.org
Cc: bug-followup@FreeBSD.org
Subject: Re: kern/119868: [zfs] 7.0 kernel panic during boot with ZFS and
 WD1600JS
Date: Wed, 01 Oct 2008 17:18:31 +0200 (CEST)

 + Johan A. van Zanten <johan@giantfoo.org>:
 
 > The problem for me is that the panic occured very early in the boot
 > process, before the dump device is normally configured, and no one
 > on the freebsd-help list, nor anyone reading these bug reports
 > seemed to know or care enough to help me get a dump device
 > configured earlier.
 
 Well, the handbook gives a method that it says is "deprecated"
 
 http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug.html
 
 (specifying a dump device in the kernel config), but these lines from
 /usr/src/usr.sbin/config/config.y
 
 System_spec:
         CONFIG System_id System_parameter_list
           = { errx(1, "%s:%d: root/dump/swap specifications obsolete",
               yyfile, yyline);}
 
 make me think that the handbook itself is obsolete at this point, and
 the "deprecated" method is no longer available.
 
 If you still have the disk and wish to resurrect it, you can try my
 method: I booted from a ubuntu cd and erased the EFI partition table
 using dd if=/dev/zero bs=512 count=1 seek=1 of=/dev/disk/by-id/...
 (making VERY sure I did not clobber the wrong disk).
 
 - Harald

From: Johan A. van Zanten <johan@giantfoo.org>
To: hanche@math.ntnu.no
Cc: bug-followup@FreeBSD.org
Subject: Re: kern/119868: [zfs] 7.0 kernel panic during boot with ZFS and
 WD1600JS
Date: Wed, 01 Oct 2008 10:33:37 -0500 (CDT)

 Harald Hanche-Olsen <hanche@math.ntnu.no> wrote:
 > + Johan A. van Zanten <johan@giantfoo.org>:
 > 
 > > The problem for me is that the panic occured very early in the boot
 > > process, before the dump device is normally configured, and no one
 > > on the freebsd-help list, nor anyone reading these bug reports
 > > seemed to know or care enough to help me get a dump device
 > > configured earlier.
 > 
 > Well, the handbook gives a method that it says is "deprecated"
 > 
 > http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug.html
 
 Yes, i think i tried this and it did not work. :(
 
  -johan

From: Jaakko Heinonen <jh@saunalahti.fi>
To: Harald Hanche-Olsen <hanche@math.ntnu.no>
Cc: bug-followup@FreeBSD.org, johan@giantfoo.org, marcel@FreeBSD.org
Subject: Re: kern/119868: [zfs] 7.0 kernel panic during boot with ZFS and
	WD1600JS
Date: Thu, 2 Oct 2008 10:39:52 +0300

 Hi,
 
 On 2008-09-30, Harald Hanche-Olsen wrote:
 >  For the curious: I intended to do
 >  
 >  #; gpt create -f da2
 >  #; gpt add -t 6a898cc3-1dd2-11b2-99a6-080020736631 da2
 >  #; zpool create poolname da2p1
 >  
 >  but apparently, I created the pool on da2 instead, partially
 >  overwriting the GPT.
 
 This PR is a duplicate of kern/127115. The bug is not in zfs code but in
 the gpart GPT code. It's possible that a corrupted GPT partition table
 causes a panic in g_part_gpt_read().
 
 These conditions must be true  after reading the tables in
 g_part_gpt_read() to cause the panic:
 
 table->state[GPT_ELT_PRIHDR] == GPT_STATE_OK
 pritbl == NULL
 table->state[GPT_ELT_SECTBL] == GPT_STATE_OK
 
 The panic happens at line 661 in g_part_gpt.c (r183533) when tbl is NULL.
 
 Here is a proposed  fix:
 
 %%%
 Index: sys/geom/part/g_part_gpt.c
 ===================================================================
 --- sys/geom/part/g_part_gpt.c	(revision 183533)
 +++ sys/geom/part/g_part_gpt.c	(working copy)
 @@ -631,7 +631,7 @@ g_part_gpt_read(struct g_part_table *bas
  			table->state[GPT_ELT_PRIHDR] = GPT_STATE_INVALID;
  	}
  
 -	if (table->state[GPT_ELT_PRIHDR] != GPT_STATE_OK) {
 +	if (table->state[GPT_ELT_PRITBL] != GPT_STATE_OK) {
  		printf("GEOM: %s: the primary GPT table is corrupt or "
  		    "invalid.\n", pp->name);
  		printf("GEOM: %s: using the secondary instead -- recovery "
 @@ -641,7 +641,7 @@ g_part_gpt_read(struct g_part_table *bas
  		if (pritbl != NULL)
  			g_free(pritbl);
  	} else {
 -		if (table->state[GPT_ELT_SECHDR] != GPT_STATE_OK) {
 +		if (table->state[GPT_ELT_SECTBL] != GPT_STATE_OK) {
  			printf("GEOM: %s: the secondary GPT table is corrupt "
  			    "or invalid.\n", pp->name);
  			printf("GEOM: %s: using the primary only -- recovery "
 %%%
 
 The patch applied this is that I get with the corrupted GPT table:
 
 GEOM: ad0: the primary GPT table is corrupt or invalid.
 GEOM: ad0: using the secondary instead -- recovery strongly advised.
 
 -- 
 Jaakko

From: Harald Hanche-Olsen <hanche@math.ntnu.no>
To: jh@saunalahti.fi
Cc: bug-followup@FreeBSD.org, johan@giantfoo.org, marcel@FreeBSD.org
Subject: Re: kern/119868: [zfs] 7.0 kernel panic during boot with ZFS and
 WD1600JS
Date: Thu, 02 Oct 2008 11:26:39 +0200 (CEST)

 + Jaakko Heinonen <jh@saunalahti.fi>:
 
 > This PR is a duplicate of kern/127115.
 
 Like I suspected (see my earlier mail).
 
 Unfortunately I cannot test your fix, since I have repaired my three
 damaged disks.
 
 - Harald

From: kvs@pil.dk (Kenneth Schmidt)
To: bug-followup@FreeBSD.org,johan@giantfoo.org
Cc:  
Subject: Re: kern/119868: [zfs] 7.0 kernel panic during boot with ZFS and WD1600JS
Date: Thu,  2 Oct 2008 17:47:15 +0200 (CEST)

 Hi.
 
 I can confirm this fix works on -CURRENT as of yesterday - geom_gpt
 recognizes the corrupted table, and skips it.
 
 -- 
 Kenneth Vestergaard Schmidt
State-Changed-From-To: open->analyzed 
State-Changed-By: linimon 
State-Changed-When: Sun Oct 19 13:15:19 UTC 2008 
State-Changed-Why:  
Patch has been submitted and has been confirmed as fixing the problem. 


Responsible-Changed-From-To: freebsd-bugs->freebsd-fs 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Sun Oct 19 13:15:19 UTC 2008 
Responsible-Changed-Why:  

http://www.freebsd.org/cgi/query-pr.cgi?pr=119868 
Responsible-Changed-From-To: freebsd-fs->freebsd-geom 
Responsible-Changed-By: gavin 
Responsible-Changed-When: Thu Nov 6 11:39:24 UTC 2008 
Responsible-Changed-Why:  
Jaakko Heinonen points out that this is actually a bug with geom_gpt 
and not ZFS.  The PR contains a patch, confirmed to fix the issue. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=119868 
State-Changed-From-To: analyzed->patched 
State-Changed-By: marcel 
State-Changed-When: Thu Nov 6 16:52:40 UTC 2008 
State-Changed-Why:  
Fix committed in -CURRENT. MFC to happen in a week. 
Thanks for the analysis and patch. 


Responsible-Changed-From-To: freebsd-geom->marcel 
Responsible-Changed-By: marcel 
Responsible-Changed-When: Thu Nov 6 16:52:40 UTC 2008 
Responsible-Changed-Why:  
Fix committed in -CURRENT. MFC to happen in a week. 
Thanks for the analysis and patch. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=119868 
State-Changed-From-To: patched->closed 
State-Changed-By: marcel 
State-Changed-When: Sat Nov 22 17:55:00 UTC 2008 
State-Changed-Why:  
Fix committed to 7-STABLE. 


http://www.freebsd.org/cgi/query-pr.cgi?pr=119868 
>Unformatted:
