From nobody@FreeBSD.org  Fri Aug 16 18:06:51 2002
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 8E49437B400
	for <freebsd-gnats-submit@FreeBSD.org>; Fri, 16 Aug 2002 18:06:51 -0700 (PDT)
Received: from www.freebsd.org (www.FreeBSD.org [216.136.204.117])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 4B5E943E65
	for <freebsd-gnats-submit@FreeBSD.org>; Fri, 16 Aug 2002 18:06:51 -0700 (PDT)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.12.4/8.12.4) with ESMTP id g7H16pOT086208
	for <freebsd-gnats-submit@FreeBSD.org>; Fri, 16 Aug 2002 18:06:51 -0700 (PDT)
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.12.4/8.12.4/Submit) id g7H16p2D086207;
	Fri, 16 Aug 2002 18:06:51 -0700 (PDT)
Message-Id: <200208170106.g7H16p2D086207@www.freebsd.org>
Date: Fri, 16 Aug 2002 18:06:51 -0700 (PDT)
From: Doug Swarin <doug@texas.net>
To: freebsd-gnats-submit@FreeBSD.org
Subject: vinum issues: page fault while rebuilding; inability to hot-rebuild striped plexes
X-Send-Pr-Version: www-1.0

>Number:         41740
>Category:       kern
>Synopsis:       [vinum] page fault while rebuilding; inability to hot-rebuild striped plexes
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    le
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Aug 16 18:10:03 PDT 2002
>Closed-Date:    Sat Nov 26 15:19:47 GMT 2005
>Last-Modified:  Sat Nov 26 15:19:47 GMT 2005
>Originator:     Doug Swarin
>Release:        4-STABLE
>Organization:
>Environment:
FreeBSD vmware.localdomain 4.6-STABLE #12: Fri Aug 16 16:29:37 CDT 2002 root@vmware.localdomain:/usr/obj/usr/src/sys/VMWARE i386
>Description:
      1. The launch_requests() function in vinumrequest.c needs splbio() protection around the lower loop. Without splbio(), complete_rqe() may be called at splx() in BUF_STRATEGY(). If there are inactive rqgs in rq (for example, with XFR_BAD_SUBDISK), rq may be deallocated before the loop completes walking the rqg queue in rq, causing either a page fault or an infinite loop.

      2. A striped plex cannot be safely hot-rebuilt, and there is no warning as such in the documentation. Because all requests to the rebuilding plex return REQUEST_DOWN, the two plexes will be inconsistent after the rebuild finishes since writes to the already-rebuilt region of the rebuilding plex will only be written to the good plex.
>How-To-Repeat:
      1. Create a pair of striped plexes as a single volume. 'vinum stop' one plex, then 'vinum start' it to start it rebuilding. Run postmark or perform other heavy activity against the mounted filesystem while the rebuild takes place.

      2. After the above hot-rebuild, demount it, fsck, and watch the errors fly. The splbio() fix will probably need to be applied before the hot-rebuild will succeed.
>Fix:
      1. Add 'int s;' to the top of launch_requests() and 's = splbio();' at line 395 and 'splx(s);' at line 439. I apologize for not providing an actual diff, because I am using the web form to submit this.

      2. Add a mention to the documentation not to hot-rebuild a striped plex. The long-term fix would be to do the missing code in checksdstate() in vinumstate.c to return the proper result for a striped plex.
>Release-Note:
>Audit-Trail:

From: Vallo Kallaste <vallo@estcard.ee>
To: Doug Swarin <doug@texas.net>
Cc: freebsd-gnats-submit@FreeBSD.ORG, grog@lemis.com
Subject: Re: kern/41740: vinum issues: page fault while rebuilding; inability to hot-rebuild striped plexes
Date: Mon, 19 Aug 2002 09:51:14 +0300

 On Fri, Aug 16, 2002 at 06:06:51PM -0700, Doug Swarin <doug@texas.net> wrote:
 
 > >Number:         41740
 > >Category:       kern
 > >Synopsis:       vinum issues: page fault while rebuilding; inability to hot-rebuild striped plexes
 > >Confidential:   no
 > >Severity:       serious
 > >Priority:       medium
 > >Responsible:    freebsd-bugs
 > >State:          open
 > >Quarter:        
 > >Keywords:       
 > >Date-Required:
 > >Class:          sw-bug
 > >Submitter-Id:   current-users
 > >Arrival-Date:   Fri Aug 16 18:10:03 PDT 2002
 > >Closed-Date:
 > >Last-Modified:
 > >Originator:     Doug Swarin
 > >Release:        4-STABLE
 > >Organization:
 > >Environment:
 > FreeBSD vmware.localdomain 4.6-STABLE #12: Fri Aug 16 16:29:37 CDT 2002 root@vmware.localdomain:/usr/obj/usr/src/sys/VMWARE i386
 > >Description:
 >       1. The launch_requests() function in vinumrequest.c needs splbio() protection around the lower loop. Without splbio(), complete_rqe() may be called at splx() in BUF_STRATEGY(). If there are inactive rqgs in rq (for example, with XFR_BAD_SUBDISK), rq may be deallocated before the loop completes walking the rqg queue in rq, causing either a page fault or an infinite loop.
 > 
 >       2. A striped plex cannot be safely hot-rebuilt, and there is no warning as such in the documentation. Because all requests to the rebuilding plex return REQUEST_DOWN, the two plexes will be inconsistent after the rebuild finishes since writes to the already-rebuilt region of the rebuilding plex will only be written to the good plex.
 > >How-To-Repeat:
 >       1. Create a pair of striped plexes as a single volume. 'vinum stop' one plex, then 'vinum start' it to start it rebuilding. Run postmark or perform other heavy activity against the mounted filesystem while the rebuild takes place.
 > 
 >       2. After the above hot-rebuild, demount it, fsck, and watch the errors fly. The splbio() fix will probably need to be applied before the hot-rebuild will succeed.
 > >Fix:
 >       1. Add 'int s;' to the top of launch_requests() and 's = splbio();' at line 395 and 'splx(s);' at line 439. I apologize for not providing an actual diff, because I am using the web form to submit this.
 > 
 >       2. Add a mention to the documentation not to hot-rebuild a striped plex. The long-term fix would be to do the missing code in checksdstate() in vinumstate.c to return the proper result for a striped plex.
 > >Release-Note:
 > >Audit-Trail:
 > >Unformatted:
 
 This behaviour (corrupt FS after hot-rebuild involving user I/O at
 the same time) is same as I discovered for RAID-5 volume long ago. I
 don't have necessary hardware at the moment, but could it be this
 will fix RAID-5 hot-rebuild problem also?
 -- 
 
 Vallo Kallaste
 vallo@estcard.ee
Responsible-Changed-From-To: freebsd-bugs->grog 
Responsible-Changed-By: johan 
Responsible-Changed-When: Tue Aug 20 08:45:46 PDT 2002 
Responsible-Changed-Why:  
Over to vinum maintainer. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=41740 

From: Doug Swarin <doug@staff.texas.net>
To: freebsd-gnats-submit@FreeBSD.org, doug@texas.net
Cc:  
Subject: Re: kern/41740: vinum issues: page fault while rebuilding; inability to hot-rebuild striped plexes
Date: Tue, 20 Aug 2002 11:18:21 -0500

 I have patches now that solve both issues. The vinumrequest.c patch
 has been tested somewhat extensively and appears to solve the page-fault
 and lockup issue while hot-rebuilding. The vinumstate.c patch has been
 tested somewhat less but did result in a clean fsck after hot-rebuilding
 a striped plex while under load.
 
 The vinumstate.c patch is probably less performance-optimal than it could
 be but I felt it best to be conservative.
 
 Doug Swarin
 doug@texas.net
 
 *** sys/dev/vinum/vinumrequest.c.orig	Fri Aug 16 14:03:09 2002
 --- sys/dev/vinum/vinumrequest.c	Mon Aug 19 15:25:49 2002
 ***************
 *** 299,304 ****
 --- 299,305 ----
   int
   launch_requests(struct request *rq, int reviveok)
   {
 +     int s;
       struct rqgroup *rqg;
       int rqno;						    /* loop index */
       struct rqelement *rqe;				    /* current element */
 ***************
 *** 391,396 ****
 --- 392,398 ----
        * bottom half could be completing requests
        * before we finish, so we need splbio() protection.
        */
 +     s = splbio();
       for (rqg = rq->rqg; rqg != NULL;) {			    /* through the whole request chain */
   	if (rqg->lockbase >= 0)				    /* this rqg needs a lock first */
   	    rqg->lock = lockrange(rqg->lockbase, rqg->rq->bp, &PLEX[rqg->plexno]);
 ***************
 *** 432,437 ****
 --- 434,440 ----
   	    }
   	}
       }
 +     splx(s);
       return 0;
   }
   
 *** sys/dev/vinum/vinumstate.c.orig	Mon Aug 19 15:26:48 2002
 --- sys/dev/vinum/vinumstate.c	Mon Aug 19 15:46:04 2002
 ***************
 *** 618,623 ****
 --- 618,624 ----
   {
       struct plex *plex = &PLEX[sd->plexno];
       int writeop = (rq->bp->b_flags & B_READ) == 0;	    /* note if we're writing */
 +     daddr_t revive_start, revive_end;
   
       switch (sd->state) {
   	/* We shouldn't get called if the subdisk is up */
 ***************
 *** 637,652 ****
   	 *   caller to put the request on the wait
   	 *   list, which will be attended to by
   	 *   revive_block when it's done.
 ! 	 * - if it's striped, we can't do it (we could
 ! 	 *   do some hairy calculations, but it's
 ! 	 *   unlikely to work).
   	 * - if it's RAID-4 or RAID-5, we can do it as
   	 *   long as only one subdisk is down
   	 */
 ! 	if (plex->organization == plex_striped)		    /* plex is striped, */
 ! 	    return REQUEST_DOWN;
 ! 
 ! 	else if (isparity(plex)) {			    /* RAID-4 or RAID-5 plex */
   	    if (plex->sddowncount > 1)			    /* with more than one sd down, */
   		return REQUEST_DOWN;
   	    else
 --- 638,649 ----
   	 *   caller to put the request on the wait
   	 *   list, which will be attended to by
   	 *   revive_block when it's done.
 ! 	 * - if it's striped, do the same, but return
 ! 	 *   a conflict if it's in the current stripe
   	 * - if it's RAID-4 or RAID-5, we can do it as
   	 *   long as only one subdisk is down
   	 */
 ! 	if (isparity(plex)) {			    /* RAID-4 or RAID-5 plex */
   	    if (plex->sddowncount > 1)			    /* with more than one sd down, */
   		return REQUEST_DOWN;
   	    else
 ***************
 *** 658,668 ****
   		 */
   		return REQUEST_OK;			    /* OK, we'll find a way */
   	}
 ! 	if (diskaddr > (sd->revived
   		+ sd->plexoffset
 ! 		+ (sd->revive_blocksize >> DEV_BSHIFT)))    /* we're beyond the end */
   	    return REQUEST_DOWN;
 ! 	else if (diskend > (sd->revived + sd->plexoffset)) { /* we finish beyond the end */
   	    if (writeop) {
   		rq->flags |= XFR_REVIVECONFLICT;	    /* note a potential conflict */
   		rq->sdno = sd->sdno;			    /* and which sd last caused it */
 --- 655,679 ----
   		 */
   		return REQUEST_OK;			    /* OK, we'll find a way */
   	}
 ! 
 ! 	if (plex->organization == plex_striped) {
 ! 	    revive_start = sd->revived
 ! 		+ sd->plexoffset
 ! 		- (sd->revived % plex->stripesize);
 ! 	    revive_end   = sd->revived
   		+ sd->plexoffset
 ! 		+ plex->stripesize
 ! 		- (sd->revived % plex->stripesize);
 ! 	} else {
 ! 	    revive_start = sd->revived + sd->plexoffset;
 ! 	    revive_end   = sd->revived
 ! 		+ sd->plexoffset
 ! 		+ (sd->revive_blocksize >> DEV_BSHIFT);
 ! 	}
 ! 
 ! 	if (diskaddr > revive_end)			    /* we're beyond the end */
   	    return REQUEST_DOWN;
 ! 	else if (diskend >= revive_start) { 		    /* we finish beyond the end */
   	    if (writeop) {
   		rq->flags |= XFR_REVIVECONFLICT;	    /* note a potential conflict */
   		rq->sdno = sd->sdno;			    /* and which sd last caused it */

From: Giorgos Keramidas <keramida@FreeBSD.org>
To: Doug Swarin <doug@texas.net>
Cc: bug-followup@FreeBSD.org
Subject: Re: kern/41740: vinum issues: page fault while rebuilding; inability to hot-rebuild striped plexes
Date: Thu, 29 Aug 2002 00:33:59 +0300

 Adding to audit trail:
 :
 : Message-Id: <20020827133751.9AE8343E42@mx1.FreeBSD.org>
 : Date: Tue, 27 Aug 2002 14:37:21 +0100
 : From: "Peter Edwards" <pmedwards@eircom.net>
 : Subject: pending/42080: (No subject)
 :
 : This is a multi-part message in MIME format.
 : ---------6ELL480NZK8OPLUEHSDOKZRS
 : Content-Type: text/plain
 : Content-Transfer-Encoding: 7bit
 :
 : After almost seeing the problem and retracting my near-miss on
 : -hackers and being pointed to this PR by Doug, I decided to have a
 : look at that area of code again, and I understand the race condition
 : properly now
 :
 : Doug's patch requries aquires splbio() to protect, but the code
 : seems to be going to great lengths to avoid that. (contrary to an
 : existing comment)
 :
 : The following patch should protect against the race without requiring
 : splbio.  (I don't have a vinum config handy to test it, this is
 : more of a mental exercise for myself). It works by making sure the
 : inner and outer loops around the BUF_STRATEGY call finish as soon
 : as all the BUF_STRATEGYs have been completed, rather than always
 : decending down the (possibly inactive) requests. (ie, it only loops on the
 : groups as long as there are groups for which we will do IO, and it only loops
 : on the elements in the groups as long as there are elements for which we will
 : do IO)
 : --
 : Peter Edwards.
 :
 :
 : ---------6ELL480NZK8OPLUEHSDOKZRS
 : Content-Type: text/plain
 : Content-Disposition: attachment; filename="patch.txt"
 :
 : Index: vinumrequest.c
 : ===================================================================
 : RCS file: /pub/FreeBSD/development/FreeBSD-CVS/src/sys/dev/vinum/vinumrequest.c,v
 : retrieving revision 1.44.2.4
 : diff -u -r1.44.2.4 vinumrequest.c
 : --- vinumrequest.c	3 Feb 2002 07:10:26 -0000	1.44.2.4
 : +++ vinumrequest.c	27 Aug 2002 13:33:26 -0000
 : @@ -299,11 +299,11 @@
 :  int
 :  launch_requests(struct request *rq, int reviveok)
 :  {
 : -    struct rqgroup *rqg;
 : +    struct rqgroup *rqg, *nextrqg;
 :      int rqno;						    /* loop index */
 :      struct rqelement *rqe;				    /* current element */
 :      struct drive *drive;
 : -    int rcount;						    /* request count */
 : +    int iocount, activegroupcount;
 :
 :      /*
 :       * First find out whether we're reviving, and the
 : @@ -374,7 +374,10 @@
 :       * This loop happens without any participation
 :       * of the bottom half, so it requires no
 :       * protection.
 : +     * XXX: the update of rqg->active must mirror the
 : +     * calls to BUF_STRATEGY() below.
 :       */
 : +    activegroupcount = 0;
 :      for (rqg = rq->rqg; rqg != NULL; rqg = rqg->next) {	    /* through the whole request chain */
 :  	rqg->active = rqg->count;			    /* they're all active */
 :  	for (rqno = 0; rqno < rqg->count; rqno++) {
 : @@ -382,28 +385,31 @@
 :  	    if (rqe->flags & XFR_BAD_SUBDISK)		    /* this subdisk is bad, */
 :  		rqg->active--;				    /* one less active request */
 :  	}
 : -	if (rqg->active)				    /* we have at least one active request, */
 : +	if (rqg->active) {				    /* we have at least one active request, */
 :  	    rq->active++;				    /* one more active request group */
 : +	    activegroupcount++;
 : +	}
 :      }
 :
 :      /*
 :       * Now fire off the requests.  In this loop the
 :       * bottom half could be completing requests
 : -     * before we finish, so we need splbio() protection.
 : +     * before we finish, so be careful to avoid manipulating rq, and avoid
 : +     * accessing it if there's a possibility that the entire request may
 : +     * have finished.
 :       */
 : -    for (rqg = rq->rqg; rqg != NULL;) {			    /* through the whole request chain */
 : +    for (rqg = rq->rqg; activegroupcount != 0; rqg = nextrqg) {
 : +	nextrqg = rqg->next;
 :  	if (rqg->lockbase >= 0)				    /* this rqg needs a lock first */
 :  	    rqg->lock = lockrange(rqg->lockbase, rqg->rq->bp, &PLEX[rqg->plexno]);
 : -	rcount = rqg->count;
 : -	for (rqno = 0; rqno < rcount;) {
 : +
 : +	KASSERT(rqg->count >= rqg->active, ("vinum: overactive rqg"));
 : +	iocount = rqg->count;
 : +	if (iocount)
 : +	    activegroupcount--;
 : +	for (rqno = 0; iocount != 0;) {
 :  	    rqe = &rqg->rqe[rqno];
 :
 : -	    /*
 : -	     * Point to next rqg before the bottom end
 : -	     * changes the structures.
 : -	     */
 : -	    if (++rqno >= rcount)
 : -		rqg = rqg->next;
 :  	    if ((rqe->flags & XFR_BAD_SUBDISK) == 0) {	    /* this subdisk is good, */
 :  		drive = &DRIVE[rqe->driveno];		    /* look at drive */
 :  		drive->active++;
 : @@ -429,6 +435,7 @@
 :  #endif
 :  		/* fire off the request */
 :  		BUF_STRATEGY(&rqe->b, 0);
 : +		iocount--;
 :  	    }
 :  	}
 :      }
 : ---------6ELL480NZK8OPLUEHSDOKZRS--
Responsible-Changed-From-To: grog->le 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Thu Sep 9 18:58:15 GMT 2004 
Responsible-Changed-Why:  
With permission of both, reassign from grog to le. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=41740 
State-Changed-From-To: open->closed 
State-Changed-By: le 
State-Changed-When: Sat Nov 26 15:19:02 GMT 2005 
State-Changed-Why:  
Since 'classic' vinum isn't supported anymore, I'm closing this PR. 
If the described behaviour also happens with geom_vinum, a new PR 
can be opened. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=41740 
>Unformatted:
