From nobody@FreeBSD.org  Sun Mar 20 02:04:35 2005
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 37E7E16A4CE
	for <freebsd-gnats-submit@FreeBSD.org>; Sun, 20 Mar 2005 02:04:35 +0000 (GMT)
Received: from www.freebsd.org (www.freebsd.org [216.136.204.117])
	by mx1.FreeBSD.org (Postfix) with ESMTP id E509343D1F
	for <freebsd-gnats-submit@FreeBSD.org>; Sun, 20 Mar 2005 02:04:34 +0000 (GMT)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.13.1/8.13.1) with ESMTP id j2K24Yo1056483
	for <freebsd-gnats-submit@FreeBSD.org>; Sun, 20 Mar 2005 02:04:34 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.13.1/8.13.1/Submit) id j2K24YVV056482;
	Sun, 20 Mar 2005 02:04:34 GMT
	(envelope-from nobody)
Message-Id: <200503200204.j2K24YVV056482@www.freebsd.org>
Date: Sun, 20 Mar 2005 02:04:34 GMT
From: Sven Willenberger <sven@dmv.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: gvinum unable to create a striped set of mirrored sets/plexes
X-Send-Pr-Version: www-2.3

>Number:         79035
>Category:       kern
>Synopsis:       [vinum] gvinum unable to create a striped set of mirrored sets/plexes
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-geom
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          change-request
>Submitter-Id:   current-users
>Arrival-Date:   Sun Mar 20 02:10:01 GMT 2005
>Closed-Date:    
>Last-Modified:  Mon May 19 20:01:18 UTC 2008
>Originator:     Sven Willenberger
>Release:        5-STABLE
>Organization:
>Environment:
FreeBSD ncorp-mail.dmv.com 5.4-PRERELEASE FreeBSD 5.4-PRERELEASE #0: Thu Mar 10 03:53:34 EST 2005     svenw@ncorp-mail.dmv.com:/usr/obj/usr/src/sys/CORPMAIL  i386
>Description:
Under the current implementation of gvinum it is possible to create a mirrored set of striped plexes but not a striped set of mirrored plexes. For purposes of resiliency the latter configuration is preferred as illustrated by the following example:

Use 6 disks to create one of 2 different scenarios.

1) Using the current abilities of gvinum create 2 striped sets using 3 disks each: A1 A2 A3 and B1 B2 B3 then create a mirror of those 2 sets such that A(123) mirrors B(123). In this situation if any drive in Set A fails, one still has a working set with Set B. If any drive now fails in Set B, the system is shot. 

2) Using the proposed added ability to create 3 mirror sets A1 and B1, A2 and B2, A3 and B3. Now create a stripe set across all three mirrors. Now we can have a situation where one of the "A" drives fail (for example A1). Then we can also have one of the "B" drives fail and, as long as it is not "B1" in this case, we still have a functioning array.

Thus the striping of mirrors (rather than a mirror of striped sets) is a more resilient and fault-tolerant setup of a multi-disk array. 
>How-To-Repeat:
      
>Fix:

>Release-Note:
>Audit-Trail:

From: Greg 'groggy' Lehey <grog@FreeBSD.org>
To: Sven Willenberger <sven@dmv.com>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/79035: gvinum unable to create a striped set of mirrored sets/plexes
Date: Sun, 20 Mar 2005 13:41:01 +1030

 --RASg3xLB4tUQ4RcS
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline
 
 [Format recovered--see http://www.lemis.com/email/email-format.html]
 
 Single line paragraphs.  Please limit your lines to < 80 characters.
 
 On Sunday, 20 March 2005 at  2:04:34 +0000, Sven Willenberger wrote:
 >
 > Under the current implementation of gvinum it is possible to create
 > a mirrored set of striped plexes but not a striped set of mirrored
 > plexes. For purposes of resiliency the latter configuration is
 > preferred as illustrated by the following example:
 >
 > Use 6 disks to create one of 2 different scenarios.
 >
 > 1) Using the current abilities of gvinum create 2 striped sets using
 > 3 disks each: A1 A2 A3 and B1 B2 B3 then create a mirror of those 2
 > sets such that A(123) mirrors B(123). In this situation if any drive
 > in Set A fails, one still has a working set with Set B. If any drive
 > now fails in Set B, the system is shot.
 
 No, this is not correct.  The plex ("set") only fails when all drives
 in it fail.
 
 > 2) Using the proposed added ability to create 3 mirror sets A1 and
 > B1, A2 and B2, A3 and B3. Now create a stripe set across all three
 > mirrors. Now we can have a situation where one of the "A" drives
 > fail (for example A1). Then we can also have one of the "B" drives
 > fail and, as long as it is not "B1" in this case, we still have a
 > functioning array.
 
 Agreed.  So there's no difference.
 
 > Thus the striping of mirrors (rather than a mirror of striped sets)
 > is a more resilient and fault-tolerant setup of a multi-disk array.
 
 No, you're misunderstanding the current implementation.
 
 This is a change request, so I'm not closing (or even assigning to
 myself) the PR.
 
 Greg
 --
 When replying to this message, please take care not to mutilate the
 original text.
 For more information, see http://www.lemis.com/email.html
 Finger grog@FreeBSD.org for PGP public key.
 See complete headers for address and phone numbers.
 
 --RASg3xLB4tUQ4RcS
 Content-Type: application/pgp-signature
 Content-Disposition: inline
 
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.2.6 (FreeBSD)
 
 iD8DBQFCPOnFIubykFB6QiMRAm1BAJsGIfr1v4ILlB4CNBP1t/o67WxYWACePZXj
 tN2nu75BDBDu5SoDMGB+0BI=
 =UiI5
 -----END PGP SIGNATURE-----
 
 --RASg3xLB4tUQ4RcS--

From: Sven Willenberger <sven@dmv.com>
To: "Greg 'groggy' Lehey" <grog@FreeBSD.org>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/79035: gvinum unable to create a striped set of mirrored
 sets/plexes
Date: Sat, 19 Mar 2005 23:43:00 -0500

 Greg 'groggy' Lehey presumably uttered the following on 03/19/05 22:11:
 > [Format recovered--see http://www.lemis.com/email/email-format.html]
 > 
 > Single line paragraphs.  Please limit your lines to < 80 characters.
 
 Sorry, used the web interface ... didn't realize it would not 
 automatically wrap.
 
 > 
 > On Sunday, 20 March 2005 at  2:04:34 +0000, Sven Willenberger wrote:
 > 
 >>Under the current implementation of gvinum it is possible to create
 >>a mirrored set of striped plexes but not a striped set of mirrored
 >>plexes. For purposes of resiliency the latter configuration is
 >>preferred as illustrated by the following example:
 >>
 >>Use 6 disks to create one of 2 different scenarios.
 >>
 >>1) Using the current abilities of gvinum create 2 striped sets using
 >>3 disks each: A1 A2 A3 and B1 B2 B3 then create a mirror of those 2
 >>sets such that A(123) mirrors B(123). In this situation if any drive
 >>in Set A fails, one still has a working set with Set B. If any drive
 >>now fails in Set B, the system is shot.
 > 
 > 
 > No, this is not correct.  The plex ("set") only fails when all drives
 > in it fail.
 > 
 
 I hope the following diagrams better illustrate what I was trying to 
 point out. Data striped across all the A's and that is mirrored to the B 
 Stripes:
 
    __stripe__
 __|___|____|__
 | A1  A2  A3 | --|m
 |____________|   |i
                   |r
    __stripe__     |r
 __|___|____|__   |o
 | B1  B2  B3 | --|r
 |____________|
 
 If A1 fails, then the A Stripe set cannot function (much like in Raid 0, 
 one disk fails the set) meaning that B now is the array:
 
    __stripe__
 __|___|____|__
 |     A2  A3 | ==> fails
 |____________|      |
                      |
                    --X--
    __stripe__        |
 __|___|____|__      |
 | B1  B2  B3 | ==> remains
 |____________|
 
   If any B disk fails then the B Stripe set is failed leaving no 
 functioning part of the mirror:
 
    __stripe__
 __|___|____|__
 |     A2  A3 | ==> fails
 |____________|      |
                      |
                    --X--
    __stripe__        |
 __|___|____|__      |
 | B1      B3 | ==> fails
 |____________|
 
 
 Unless I am misunderstanding and gvinum somehow rebuilds the A stripe 
 over A2 and A3 if A1 fails.
 
 > 
 >>2) Using the proposed added ability to create 3 mirror sets A1 and
 >>B1, A2 and B2, A3 and B3. Now create a stripe set across all three
 >>mirrors. Now we can have a situation where one of the "A" drives
 >>fail (for example A1). Then we can also have one of the "B" drives
 >>fail and, as long as it is not "B1" in this case, we still have a
 >>functioning array.
 > 
 > 
 > Agreed.  So there's no difference.
 >
 
    _____stripe_____
 __|__  __|___  __|___
 | A1 | | A2 |  | A3 |
 | B1 | | B2 |  | B3 |
 |____| |____|  |____|
 
 Now If A1 Fails, the B1 part of the mirror can still participate in the 
 stripe:
    _____stripe____
 __|__  __|__  __|__
 |    | | A2 | | A3 |
 | B1 | | B2 | | B3 |
 |____| |____| |____|
 
 Likewise if either B2 or B3 fails now we still have a functioning stripe:
    _____stripe____
 __|__  __|__  __|__
 |    | | A2 | | A3 |
 | B1 | |    | | B3 |
 |____| |____| |____|
 
 At this point we could still have either A3 or B3 fail and still have a 
 functioning stripe set.
 
 > 
 >>Thus the striping of mirrors (rather than a mirror of striped sets)
 >>is a more resilient and fault-tolerant setup of a multi-disk array.
 > 
 > 
 > No, you're misunderstanding the current implementation.
 
 Perhaps I am ... but unless gvinum somehow reconstructs a 3 disk stripe 
 into a 2 disk stripe in the event one disk fails, I am now sure how. The 
 resiliency has to do with a 2 disk failure. Even in a 4 disk scenario, 
 the mirror of stripes can survive 2 of 6 2-disk failure scenarios while 
 the stripe of mirros can survive 4 of 6 2-disk failure scenarios.
 
 > 
 > This is a change request, so I'm not closing (or even assigning to
 > myself) the PR.
 >
 
 Fair enough ... I would just like to see the stripe of mirror scenarios 
 common to hardware raid solutions become a configuration option for 
 gvinum (or understand why my interpretation above is incorrect), so per 
 your original advice I submitted this PR.
 
 Sven Willenberger

From: Greg 'groggy' Lehey <grog@FreeBSD.org>
To: Sven Willenberger <sven@dmv.com>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/79035: gvinum unable to create a striped set of mirrored sets/plexes
Date: Sun, 20 Mar 2005 15:51:33 +1030

 --oj4kGyHlBMXGt3Le
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline
 
 On Saturday, 19 March 2005 at 23:43:00 -0500, Sven Willenberger wrote:
 > Greg 'groggy' Lehey presumably uttered the following on 03/19/05 22:11:
 >> On Sunday, 20 March 2005 at  2:04:34 +0000, Sven Willenberger wrote:
 >>
 >>> Under the current implementation of gvinum it is possible to create
 >>> a mirrored set of striped plexes but not a striped set of mirrored
 >>> plexes. For purposes of resiliency the latter configuration is
 >>> preferred as illustrated by the following example:
 >>>
 >>> Use 6 disks to create one of 2 different scenarios.
 >>>
 >>> 1) Using the current abilities of gvinum create 2 striped sets using
 >>> 3 disks each: A1 A2 A3 and B1 B2 B3 then create a mirror of those 2
 >>> sets such that A(123) mirrors B(123). In this situation if any drive
 >>> in Set A fails, one still has a working set with Set B. If any drive
 >>> now fails in Set B, the system is shot.
 >>
 >> No, this is not correct.  The plex ("set") only fails when all drives
 >> in it fail.
 >
 > I hope the following diagrams better illustrate what I was trying to
 > point out. Data striped across all the A's and that is mirrored to the B
 > Stripes:
 >
 > ...
 >
 > If A1 fails, then the A Stripe set cannot function (much like in Raid 0,
 > one disk fails the set) meaning that B now is the array:
 
 No, this is not correct.
 
 >>> Thus the striping of mirrors (rather than a mirror of striped sets)
 >>> is a more resilient and fault-tolerant setup of a multi-disk array.
 >>
 >> No, you're misunderstanding the current implementation.
 >
 > Perhaps I am ... but unless gvinum somehow reconstructs a 3 disk stripe
 > into a 2 disk stripe in the event one disk fails, I am now sure how.
 
 Well, you have the source code.  It's not quite the way you look at
 it.  It doesn't have stripes: it has plexes.  And they can be
 incomplete.  If a read to a plex hits a "hole", it automatically
 retries via (possibly all) the other plexes.  Only when all plexes
 have a hole in the same place does the transfer fail.
 
 You might like to (re)read http://www.vinumvm.org/vinum/intro.html.
 
 Greg
 --
 See complete headers for address and phone numbers.
 
 --oj4kGyHlBMXGt3Le
 Content-Type: application/pgp-signature
 Content-Disposition: inline
 
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.2.6 (FreeBSD)
 
 iD8DBQFCPQhdIubykFB6QiMRAmrSAJ9c1s3Hclp/mQUU+YIenXXLgA/ViACgg/uZ
 sg//WXGWNor/Yho5ZDhIFjQ=
 =z4NN
 -----END PGP SIGNATURE-----
 
 --oj4kGyHlBMXGt3Le--

From: Sven Willenberger <sven@dmv.com>
To: "Greg 'groggy' Lehey" <grog@FreeBSD.org>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/79035: gvinum unable to create a striped set of mirrored
 sets/plexes
Date: Sun, 20 Mar 2005 01:00:29 -0500

 Greg 'groggy' Lehey presumably uttered the following on 03/20/05 00:21:
 > On Saturday, 19 March 2005 at 23:43:00 -0500, Sven Willenberger wrote:
 > 
 >>Greg 'groggy' Lehey presumably uttered the following on 03/19/05 22:11:
 >>
 >>>On Sunday, 20 March 2005 at  2:04:34 +0000, Sven Willenberger wrote:
 >>>
 >>>
 >>>>Under the current implementation of gvinum it is possible to create
 >>>>a mirrored set of striped plexes but not a striped set of mirrored
 >>>>plexes. For purposes of resiliency the latter configuration is
 >>>>preferred as illustrated by the following example:
 >>>>
 >>>>Use 6 disks to create one of 2 different scenarios.
 >>>>
 >>>>1) Using the current abilities of gvinum create 2 striped sets using
 >>>>3 disks each: A1 A2 A3 and B1 B2 B3 then create a mirror of those 2
 >>>>sets such that A(123) mirrors B(123). In this situation if any drive
 >>>>in Set A fails, one still has a working set with Set B. If any drive
 >>>>now fails in Set B, the system is shot.
 >>>
 >>>No, this is not correct.  The plex ("set") only fails when all drives
 >>>in it fail.
 >>
 >>I hope the following diagrams better illustrate what I was trying to
 >>point out. Data striped across all the A's and that is mirrored to the B
 >>Stripes:
 >>
 >>...
 >>
 >>If A1 fails, then the A Stripe set cannot function (much like in Raid 0,
 >>one disk fails the set) meaning that B now is the array:
 > 
 > 
 > No, this is not correct.
 > 
 > 
 >>>>Thus the striping of mirrors (rather than a mirror of striped sets)
 >>>>is a more resilient and fault-tolerant setup of a multi-disk array.
 >>>
 >>>No, you're misunderstanding the current implementation.
 >>
 >>Perhaps I am ... but unless gvinum somehow reconstructs a 3 disk stripe
 >>into a 2 disk stripe in the event one disk fails, I am now sure how.
 > 
 > 
 > Well, you have the source code.  It's not quite the way you look at
 > it.  It doesn't have stripes: it has plexes.  And they can be
 > incomplete.  If a read to a plex hits a "hole", it automatically
 > retries via (possibly all) the other plexes.  Only when all plexes
 > have a hole in the same place does the transfer fail.
 > 
 > You might like to (re)read http://www.vinumvm.org/vinum/intro.html.
 > 
 > Greg
 > --
 > See complete headers for address and phone numbers.
 
 I guess I just needed someone to come out and say what you just said. 
 Rereading the manual lead me to the point of confusion that brought me 
 to this question in the first place. Quoting (from The-Big-Picture):
 
 "    *
 
        Although a plex represents the complete data of a volume, it is 
 possible for parts of the representation to be physically missing, 
 either by design (by not defining a subdisk for parts of the plex) or by 
 accident (as a result of the failure of a drive).
      * A volume is a collection of between one and eight plexes. Each 
 plex represents the data in the volume, so more than one plex provides 
 mirroring. As long as at least one plex can provide the data for the 
 complete address range of the volume, the volume is fully functional."
 
 The first sentence would seem to imply partial plexes are ok. However, 
 it would appear from the last sentence (wherein it would imply that one 
 plex needs to provide the date for the complete address range) that the 
 volume still needs at least one complete plex in order to function. I 
 could not find any indication that it could combine "partial" plexes 
 into a fully functioning volume so I am glad you did point that out. 
 This would indicate that the solution I seek is already available (and 
 now I can test this:)  )
 
 Sven
Responsible-Changed-From-To: freebsd-bugs->le 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Sun Apr 3 08:31:10 GMT 2005 
Responsible-Changed-Why:  
Over to maintainer. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=79035 

From: Sven Willenberger <sven@dmv.com>
To: "Greg 'groggy' Lehey" <grog@FreeBSD.org>
Cc: freebsd-gnats-submit@FreeBSD.org, freebsd-stable@FreeBSD.org
Subject: Re: kern/79035: gvinum unable to create a striped set of mirrored
	sets/plexes
Date: Fri, 08 Apr 2005 18:28:10 -0400

 On Sun, 2005-03-20 at 15:51 +1030, Greg 'groggy' Lehey wrote:
 > On Saturday, 19 March 2005 at 23:43:00 -0500, Sven Willenberger wrote:
 > > Greg 'groggy' Lehey presumably uttered the following on 03/19/05 22:11:
 > >> On Sunday, 20 March 2005 at  2:04:34 +0000, Sven Willenberger wrote:
 > >>
 > >>> Under the current implementation of gvinum it is possible to create
 > >>> a mirrored set of striped plexes but not a striped set of mirrored
 > >>> plexes. For purposes of resiliency the latter configuration is
 > >>> preferred as illustrated by the following example:
 > >>>
 > >>> Use 6 disks to create one of 2 different scenarios.
 > >>>
 > >>> 1) Using the current abilities of gvinum create 2 striped sets using
 > >>> 3 disks each: A1 A2 A3 and B1 B2 B3 then create a mirror of those 2
 > >>> sets such that A(123) mirrors B(123). In this situation if any drive
 > >>> in Set A fails, one still has a working set with Set B. If any drive
 > >>> now fails in Set B, the system is shot.
 > >>
 > >> No, this is not correct.  The plex ("set") only fails when all drives
 > >> in it fail.
 > >
 > > I hope the following diagrams better illustrate what I was trying to
 > > point out. Data striped across all the A's and that is mirrored to the B
 > > Stripes:
 > >
 > > ...
 > >
 > > If A1 fails, then the A Stripe set cannot function (much like in Raid 0,
 > > one disk fails the set) meaning that B now is the array:
 > 
 > No, this is not correct.
 > 
 > >>> Thus the striping of mirrors (rather than a mirror of striped sets)
 > >>> is a more resilient and fault-tolerant setup of a multi-disk array.
 > >>
 > >> No, you're misunderstanding the current implementation.
 > >
 > > Perhaps I am ... but unless gvinum somehow reconstructs a 3 disk stripe
 > > into a 2 disk stripe in the event one disk fails, I am now sure how.
 > 
 > Well, you have the source code.  It's not quite the way you look at
 > it.  It doesn't have stripes: it has plexes.  And they can be
 > incomplete.  If a read to a plex hits a "hole", it automatically
 > retries via (possibly all) the other plexes.  Only when all plexes
 > have a hole in the same place does the transfer fail.
 > 
 > You might like to (re)read http://www.vinumvm.org/vinum/intro.html.
 > 
 
 I was really hoping that the "holes in the plex" functioning was going
 to work but my tests have shown otherwise. I created a gvinum array
 consisting of (A striped B) mirror (C striped D) which is the only such
 mirror/stripe combination allowed by gvinum for four drives. We have:
 
 _________
 | A   B |__
 |_______|  |
            |Mirror
 _________  |
 | C   D |--|
 |_______|
 
 Based on what the "plex hole" theory states, Drive A and Drive D could
 both fail and the system would read through the holes and pick up data
 from B and C (or the converse if B and C failed), functionally
 equivalent to a stripe of mirrors. To fail a drive I rebooted
 single-user, dd dev/zero to the beginning of the disk and then fdisk.
 
 drive d device /dev/da4s1h
 drive c device /dev/da3s1h
 drive b device /dev/da2s1h
 drive a device /dev/da1s1h
 volume home
 plex name home.p1 org striped 960s vol home
 plex name home.p0 org striped 960s vol home
 sd name home.p1.s1 drive d len 71681280s driveoffset 265s plex home.p1
 plexoffset 960s
 sd name home.p1.s0 drive c len 71681280s driveoffset 265s plex home.p1
 plexoffset 0s
 sd name home.p0.s1 drive b len 71681280s driveoffset 265s plex home.p0
 plexoffset 960s
 sd name home.p0.s0 drive a len 71681280s driveoffset 265s plex home.p0
 plexoffset 0s
 
 In my case:        Fail B     Fail B and C
 A = /dev/da1s1h      up          up
 B = /dev/da2s1h      down        down
 C = /dev/da3s1h      up          down
 D = /dev/da4s1h      up          up
 
 1 Volume
 V home2              up          down (!)
 
 2 Plexes
 P home.p0 (A and B)  down        down
 P home.p1 (C and D)  up          down
 
 4 Subdisks
 S home.p0.s0 (A)     up          up
 S home.p0.s1 (B)     down        down
 S home.p1.s0 (C)     up          down
 S home p1.s1 (D)     up          up
 
 Based on this failing the one drive did in fact fail the plex (home.p0).
 Although at that point I realized that failing either drive on the other
 plex would also fail that plex and also the volume, I went ahead and
 failed drive C also. The result was a failed volume.
 
 With the failed B drive, once I bsdlabeled the disk to include the vinum
 slice, then I got the message that the the plex was now stale (instead
 of down). A simple gvinum start home2 changed the state to degraded the
 the system rebuilt the array. When both drives failed I had to work a
 bit of a kludge in. I gvinum setstate -f up home.p1.s0, then gvinum
 start home.p0. At that point the system rebuilt itself and it would
 appear the data is intact .. I have not completely tested or verified
 that last statement however.
 
 In essence although my feature request to have the ability to create a
 striped set of mirrors was going to be hopefully supplanted by the
 functional equivalent via the "plex hole" system, it did not come to
 fruition. So please note this as either a re-request for that feature or
 a bug report in that the pass-through feature of gvinum plexes is
 broken.
 
 Sven
 
Responsible-Changed-From-To: le->freebsd-geom 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Mon May 19 20:01:03 UTC 2008 
Responsible-Changed-Why:  
With bugmeister hat on, reassign as le@ has not been active in a while. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=79035 
>Unformatted:
