From nobody@FreeBSD.org  Mon Apr 27 13:53:18 2009
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 207471065670
	for <freebsd-gnats-submit@FreeBSD.org>; Mon, 27 Apr 2009 13:53:18 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [IPv6:2001:4f8:fff6::21])
	by mx1.freebsd.org (Postfix) with ESMTP id E7A168FC1D
	for <freebsd-gnats-submit@FreeBSD.org>; Mon, 27 Apr 2009 13:53:17 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.14.3/8.14.3) with ESMTP id n3RDrHi9060354
	for <freebsd-gnats-submit@FreeBSD.org>; Mon, 27 Apr 2009 13:53:17 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.14.3/8.14.3/Submit) id n3RDrH9D060353;
	Mon, 27 Apr 2009 13:53:17 GMT
	(envelope-from nobody)
Message-Id: <200904271353.n3RDrH9D060353@www.freebsd.org>
Date: Mon, 27 Apr 2009 13:53:17 GMT
From: Peter Steele <psteele@maxiscale.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: Gmirror overwrites fs with stale data from returning member
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         134044
>Category:       kern
>Synopsis:       [geom] gmirror(8) overwrites fs with stale data from returning member
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    freebsd-geom
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Apr 27 14:00:04 UTC 2009
>Closed-Date:    Tue Nov 23 17:40:27 UTC 2010
>Last-Modified:  Tue Nov 23 17:40:27 UTC 2010
>Originator:     Peter Steele
>Release:        7.0-RELEASE-p9
>Organization:
MaxiScale Inc.
>Environment:
FreeBSD r02s17 7.0-RELEASE-p9 FreeBSD 7.0-RELEASE-p9 #2: Thu Apr  2 22:09:33 UTC 2009     root@r04s25:/usr/obj/usr/src/sys/GENERIC  amd64

>Description:
We had a four member gmirror'ed slice and one of the members died. This member happened to be drive 0 on our system (which mapped to ad4). On a subsequent reboot the system came back up with a reduced 3/4 mirror but otherwise everything worked as expected. Some additional software was installed on the system and then we shut it down again to deal with the faulty drive. We discovered it wasn't pushed in snuggly enough so the fix was easy.

However, when the system came back up, instead of the gmirror driver reinserting that old drive into the existing mirror, the driver decided to use that drive as the primary member and instead tried to insert the other three drives into that mirror. This failed leaving a mirror consisting of a single drive ad4. When the system finished booting, we have a service that runs automatically to check the status of the mirror. This service automatically reinserts any drive that isn't participating in the mirror, to make sure the mirror is always fully populated. 

When this happened though we lost the new data that had been installed on the mirrored file system while ad4 was absent. It got overwritten by the stale data from ad4. Not good.

>How-To-Repeat:
This problem is easy to reproduce. Create a mirrored slice with > 2 members. (In fact we have only observed the problem with a four member mirror and cannot say for sure if it would happen on a three member mirror.) Shut the system down and remove drive 0 (the first drive) from the system. Reboot the system. This should result in a degraded mirror with drive 0 missing.

Create some new files on the file system, then shut the system down again. Reinsert drive 0 and reboot the system. As the gmirror driver starts up, instead of it reinserting drive 0 into the existing mirror, the driver will instead assume drive 0 is the "good drive" and try to add the other drives to that mirror, even though they have the more recent data. This will fail during the boot phase, leaving a mirror consisting only of drive 0 and its stale data.

This problem only happens with drive 0. If the same steps are repeated with say drive 1, when the drive is reinserted and the system rebooted, it is added back to the existing mirror instead of becoming the sole member of the mirror with its stale data.

>Fix:
None has been discovered, beyond making sure that a drive 0 is fully formatted before being reinserted into an existing cluster. This is not a practical solution in the field; you can never be sure what a customer will do.


>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->freebsd-geom 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Sat May 2 22:01:05 UTC 2009 
Responsible-Changed-Why:  
Over to maintainer(s). 

http://www.freebsd.org/cgi/query-pr.cgi?pr=134044 

From: Alexander Motin <mav@FreeBSD.org>
To: bug-followup@FreeBSD.org, psteele@maxiscale.com
Cc:  
Subject: Re: kern/134044: [geom] gmirror(8) overwrites fs with stale data
 from returning member
Date: Wed, 06 Jan 2010 12:25:42 +0200

 I am unable to reproduce this issue on recent HEAD system. Whenever
 mirror looses one of components, or boots without one, it increments
 SyncID parameter (reported by `gmirror list`) on first subsequent write
 operation. If lost component reappears again, it has lower SyncID then
 the rest and forced to be synchronized.
 
 Could you try reproduce bug on newer system? If so, provide some more
 details, especially how SyncID changes after each event and what
 messages GMIRROR prints.
 
 -- 
 Alexander Motin

From: Peter Steele <psteele@maxiscale.com>
To: Alexander Motin <mav@FreeBSD.org>, "bug-followup@FreeBSD.org"
	<bug-followup@FreeBSD.org>
Cc:  
Subject: RE: kern/134044: [geom] gmirror(8) overwrites fs with stale data
 from returning member
Date: Wed, 6 Jan 2010 08:58:10 -0600

 I don't believe we are seeing this in release 8.0. I'll verify and let you =
 know if the problem persists for us.
 
 -----Original Message-----
 From: Alexander Motin [mailto:mavbsd@gmail.com] On Behalf Of Alexander Moti=
 n
 Sent: Wednesday, January 06, 2010 2:26 AM
 To: bug-followup@FreeBSD.org; Peter Steele
 Subject: Re: kern/134044: [geom] gmirror(8) overwrites fs with stale data f=
 rom returning member
 
 I am unable to reproduce this issue on recent HEAD system. Whenever mirror =
 looses one of components, or boots without one, it increments SyncID parame=
 ter (reported by `gmirror list`) on first subsequent write operation. If lo=
 st component reappears again, it has lower SyncID then the rest and forced =
 to be synchronized.
 
 Could you try reproduce bug on newer system? If so, provide some more detai=
 ls, especially how SyncID changes after each event and what messages GMIRRO=
 R prints.
 
 --
 Alexander Motin
State-Changed-From-To: open->closed 
State-Changed-By: lulf 
State-Changed-When: Tue Nov 23 17:36:52 UTC 2010 
State-Changed-Why:  
- Closing, as reporter does not seem to still have the problem after 11 months. Re-open if the issue turns up again. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=134044 
>Unformatted:
