From root@duvel.sics.se  Mon Nov 19 14:19:19 2007
Return-Path: <root@duvel.sics.se>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0CA8616A4CF
	for <FreeBSD-gnats-submit@freebsd.org>; Mon, 19 Nov 2007 14:19:19 +0000 (UTC)
	(envelope-from root@duvel.sics.se)
Received: from duvel.sics.se (duvel.sics.se [193.10.67.84])
	by mx1.freebsd.org (Postfix) with ESMTP id 5326813C442
	for <FreeBSD-gnats-submit@freebsd.org>; Mon, 19 Nov 2007 14:19:18 +0000 (UTC)
	(envelope-from root@duvel.sics.se)
Received: from duvel.sics.se (localhost [127.0.0.1])
	by duvel.sics.se (8.14.2/8.14.2) with ESMTP id lAJE4InT001477;
	Mon, 19 Nov 2007 14:04:18 GMT
	(envelope-from root@duvel.sics.se)
Received: (from root@localhost)
	by duvel.sics.se (8.14.2/8.14.2/Submit) id lAJE4IAm001476;
	Mon, 19 Nov 2007 14:04:18 GMT
	(envelope-from root)
Message-Id: <200711191404.lAJE4IAm001476@duvel.sics.se>
Date: Mon, 19 Nov 2007 14:04:18 GMT
From: bg@sics.se
Reply-To: bg@sics.se
To: FreeBSD-gnats-submit@freebsd.org
Cc: "Kip Macy" <kip.macy@gmail.com>
Subject: Poor NFS server write performance
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         118126
>Category:       kern
>Synopsis:       [nfs] [patch] Poor NFS server write performance
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    freebsd-fs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          update
>Submitter-Id:   current-users
>Arrival-Date:   Mon Nov 19 14:20:02 UTC 2007
>Closed-Date:    
>Last-Modified:  Fri Dec 16 01:00:24 UTC 2011
>Originator:     Bjoern Groenvall
>Release:        FreeBSD 8.0-CURRENT amd64
>Organization:
SICS
>Environment:
System: FreeBSD duvel.sics.se 8.0-CURRENT FreeBSD 8.0-CURRENT #2: Sun Nov 18 20:14:16 UTC 2007 root@duvel.sics.se:/usr/obj/usr/src/sys/GEN-NFS amd64


	
        DELL PC, amd64, gigabit Ethernet, SATA disk
>Description:

A problem with the current NFS server is that it does not cluster
writes, this in turn leads to really poor sequential-write
performance. 

By enabling write clustering NFS write performance goes from
26.6Mbyte/s to 54.3Mbyte/s or increases by a factor of 2. This is on a
SATA disk with write caching enabled (hw.ata.wc=1).

If write caching is disabled performance still goes up from 1.6Mbyte/s
to 5.8Mbyte/s (or by a factor of 3.6).

The attached patch (relative to current) makes the following changes:

1/ Rearrange the code so that the same code can be used to detect both
   sequential read and write access.

2/ Merge in updates from vfs_vnops.c::sequential_heuristic.

3/ Use double hashing in order to avoid hash-clustering in the nfsheur
   table. This change also makes it possible to reduce "try" from 32
   to 8.

4/ Pack the nfsheur table more efficiently.

5/ Tolerate reordered RPCs to some small amount (initially suggested
   by Ellard and Seltzer).

6/ Back-off from sequential access rather than immediately switching to
   random access (Ellard and Seltzer).

7/ To avoid starvation of the buffer pool call bwillwrite. The call is
   issued after the VOP_WRITE in order to avoid additional reordering
   of write operations.

8/ sysctl variables vfs.nfsrv.cluster_writes and cluster_reads to
   enable or disable clustering. vfs.nfsrv.reordered_io counts the
   number of reordered RPCs.

9/ In nfsrv_commit check for write errors and report them back to the
   client. Also check if the RPC argument count is zero which means
   that we must flush to the end of file according to the RFC.

10/ Two earlier commits broke the write gathering support:

    nfs_syscalls.c:1.71

      This change removed NQNFS stuff but left the NQNFS variable
      notstarted. This resulted in NFS write gathering effectively
      being permanently disabled (regardless if NFSv2 or NFSv3).

    nfs_syscalls.c:1.103

      This change disabled write gathering (again) for NFSv3 although
      this should be controlled by vfs.nfs.nfsrvw_procrastinate_v3 !=
      0.

Write gathering may still be useful with NFSv3 to put reordered write
RPCs into order, perhaps also for other reasons. This is now possible
again.

The attached patch is for current but you will observe similar
improvements with earlier FreeBSD versions.

>How-To-Repeat:
        Apply patch, on server toggle vfs.nfsrv.cluster_writes=0,1.
        On client run (for example)

        dd if=/dev/zero of=/xxx/x.rm bs=32k count=10000
>Fix:

Apply this patch

--- nfs_syscalls.c.orig	2007-10-12 03:56:27.000000000 +0000
+++ nfs_syscalls.c	2007-11-16 21:09:46.000000000 +0000
@@ -85,7 +85,6 @@
 
 int		nfsd_waiting = 0;
 int		nfsrv_numnfsd = 0;
-static int	notstarted = 1;
 
 static int	nfs_privport = 0;
 SYSCTL_INT(_vfs_nfsrv, NFS_NFSPRIVPORT, nfs_privport, CTLFLAG_RW,
@@ -452,9 +451,8 @@
 			else
 			    procrastinate = nfsrvw_procrastinate;
 			NFSD_UNLOCK();
-			if (writes_todo || (!(nd->nd_flag & ND_NFSV3) &&
-			    nd->nd_procnum == NFSPROC_WRITE &&
-			    procrastinate > 0 && !notstarted))
+			if (writes_todo || (nd->nd_procnum == NFSPROC_WRITE &&
+			    procrastinate > 0))
 			    error = nfsrv_writegather(&nd, slp,
 				nfsd->nfsd_td, &mreq);
 			else
--- nfs_serv.c.orig	2007-10-18 16:38:07.000000000 +0000
+++ nfs_serv.c	2007-11-16 21:09:46.000000000 +0000
@@ -106,16 +106,20 @@
 
 #define MAX_COMMIT_COUNT	(1024 * 1024)
 
-#define NUM_HEURISTIC		1017
+#define NUM_HEURISTIC		1031 /* Must be prime! */
+#define MAX_REORDERED_RPC	4
+#define HASH_MAXSTEP		0x3ff
 #define NHUSE_INIT		64
 #define NHUSE_INC		16
 #define NHUSE_MAX		2048
+#define NH_TAG(vp)		((uint32_t)((uintptr_t)vp / sizeof(struct vnode)))
+CTASSERT(NUM_HEURISTIC > (HASH_MAXSTEP + 1));
 
 static struct nfsheur {
-	struct vnode *nh_vp;	/* vp to match (unreferenced pointer) */
-	off_t nh_nextr;		/* next offset for sequential detection */
-	int nh_use;		/* use count for selection */
-	int nh_seqcount;	/* heuristic */
+	off_t nh_nextoff;	/* next offset for sequential detection */
+	uint32_t nh_tag;	/* vp tag to match */
+	uint16_t nh_use;	/* use count for selection */
+	uint16_t nh_seqcount;	/* in units of logical blocks */
 } nfsheur[NUM_HEURISTIC];
 
 /* Global vars */
@@ -130,10 +134,16 @@
 static int nfs_async;
 static int nfs_commit_blks;
 static int nfs_commit_miss;
+static int nfsrv_cluster_writes = 1;
+static int nfsrv_cluster_reads = 1;
+static int nfsrv_reordered_io;
 SYSCTL_INT(_vfs_nfsrv, OID_AUTO, async, CTLFLAG_RW, &nfs_async, 0, "");
 SYSCTL_INT(_vfs_nfsrv, OID_AUTO, commit_blks, CTLFLAG_RW, &nfs_commit_blks, 0, "");
 SYSCTL_INT(_vfs_nfsrv, OID_AUTO, commit_miss, CTLFLAG_RW, &nfs_commit_miss, 0, "");
 
+SYSCTL_INT(_vfs_nfsrv, OID_AUTO, cluster_writes, CTLFLAG_RW, &nfsrv_cluster_writes, 0, "");
+SYSCTL_INT(_vfs_nfsrv, OID_AUTO, cluster_reads, CTLFLAG_RW, &nfsrv_cluster_reads, 0, "");
+SYSCTL_INT(_vfs_nfsrv, OID_AUTO, reordered_io, CTLFLAG_RW, &nfsrv_reordered_io, 0, "");
 struct nfsrvstats nfsrvstats;
 SYSCTL_STRUCT(_vfs_nfsrv, NFS_NFSRVSTATS, nfsrvstats, CTLFLAG_RW,
 	&nfsrvstats, nfsrvstats, "S,nfsrvstats");
@@ -144,6 +154,75 @@
 		    struct nfsrv_descript *);
 
 /*
+ * Detect sequential access so that we are able to hint the underlying
+ * file system to use clustered I/O when appropriate.
+ */
+static int
+nfsrv_sequential_access(const struct uio *uio, const struct vnode *vp)
+{
+	struct nfsheur *nh;
+	unsigned hi, step;
+	int try = 8;
+	int nblocks, lblocksize;
+
+	/*
+	 * Locate best nfsheur[] candidate using double hashing.
+	 */
+
+	hi =   NH_TAG(vp) % NUM_HEURISTIC;
+	step = NH_TAG(vp) & HASH_MAXSTEP;
+	step++;			/* Step must not be zero. */
+	nh = &nfsheur[hi];
+
+	while (try--) {
+		if (nfsheur[hi].nh_tag == NH_TAG(vp)) {
+			nh = &nfsheur[hi];
+			break;
+		}
+		if (nfsheur[hi].nh_use > 0)
+			--nfsheur[hi].nh_use;
+		hi = hi + step;
+		if (hi >= NUM_HEURISTIC)
+			hi -= NUM_HEURISTIC;
+		if (nfsheur[hi].nh_use < nh->nh_use)
+			nh = &nfsheur[hi];
+	}
+
+	if (nh->nh_tag != NH_TAG(vp)) {	/* New entry. */
+		nh->nh_tag = NH_TAG(vp);
+		nh->nh_nextoff = uio->uio_offset;
+		nh->nh_use = NHUSE_INIT;
+		nh->nh_seqcount = 1; /* Initially assume sequential access. */
+	} else {
+		nh->nh_use += NHUSE_INC;
+		if (nh->nh_use > NHUSE_MAX)
+			nh->nh_use = NHUSE_MAX;
+	}
+
+	/*
+	 * Calculate heuristic
+	 */
+
+	lblocksize = vp->v_mount->mnt_stat.f_iosize;
+	nblocks = howmany(uio->uio_resid, lblocksize);
+	if (uio->uio_offset == nh->nh_nextoff) {
+		nh->nh_seqcount += nblocks;
+		if (nh->nh_seqcount > IO_SEQMAX)
+			nh->nh_seqcount = IO_SEQMAX;
+	} else if (uio->uio_offset == 0) {
+		/* Seek to beginning of file, ignored. */
+	} else if (qabs(uio->uio_offset - nh->nh_nextoff) <=
+		   MAX_REORDERED_RPC*imax(lblocksize, uio->uio_resid)) {
+		nfsrv_reordered_io++; /* Probably reordered RPC, do nothing. */
+	} else
+		nh->nh_seqcount /= 4; /* Not sequential access. */
+
+	nh->nh_nextoff = uio->uio_offset + uio->uio_resid;
+
+	return (nh->nh_seqcount << IO_SEQSHIFT);
+}
+
+/*
  * Clear nameidata fields that are tested in nsfmout cleanup code prior
  * to using first nfsm macro (that might jump to the cleanup code).
  */
@@ -783,7 +862,6 @@
 	fhandle_t *fhp;
 	struct uio io, *uiop = &io;
 	struct vattr va, *vap = &va;
-	struct nfsheur *nh;
 	off_t off;
 	int ioflag = 0;
 	int vfslocked;
@@ -855,61 +933,6 @@
 	else
 		cnt = reqlen;
 
-	/*
-	 * Calculate seqcount for heuristic
-	 */
-
-	{
-		int hi;
-		int try = 32;
-
-		/*
-		 * Locate best candidate
-		 */
-
-		hi = ((int)(vm_offset_t)vp / sizeof(struct vnode)) % NUM_HEURISTIC;
-		nh = &nfsheur[hi];
-
-		while (try--) {
-			if (nfsheur[hi].nh_vp == vp) {
-				nh = &nfsheur[hi];
-				break;
-			}
-			if (nfsheur[hi].nh_use > 0)
-				--nfsheur[hi].nh_use;
-			hi = (hi + 1) % NUM_HEURISTIC;
-			if (nfsheur[hi].nh_use < nh->nh_use)
-				nh = &nfsheur[hi];
-		}
-
-		if (nh->nh_vp != vp) {
-			nh->nh_vp = vp;
-			nh->nh_nextr = off;
-			nh->nh_use = NHUSE_INIT;
-			if (off == 0)
-				nh->nh_seqcount = 4;
-			else
-				nh->nh_seqcount = 1;
-		}
-
-		/*
-		 * Calculate heuristic
-		 */
-
-		if ((off == 0 && nh->nh_seqcount > 0) || off == nh->nh_nextr) {
-			if (++nh->nh_seqcount > IO_SEQMAX)
-				nh->nh_seqcount = IO_SEQMAX;
-		} else if (nh->nh_seqcount > 1) {
-			nh->nh_seqcount = 1;
-		} else {
-			nh->nh_seqcount = 0;
-		}
-		nh->nh_use += NHUSE_INC;
-		if (nh->nh_use > NHUSE_MAX)
-			nh->nh_use = NHUSE_MAX;
-		ioflag |= nh->nh_seqcount << IO_SEQSHIFT;
-        }
-
 	nfsm_reply(NFSX_POSTOPORFATTR(v3) + 3 * NFSX_UNSIGNED+nfsm_rndup(cnt));
 	if (v3) {
 		tl = nfsm_build(u_int32_t *, NFSX_V3FATTR + 4 * NFSX_UNSIGNED);
@@ -967,9 +990,10 @@
 		uiop->uio_resid = len;
 		uiop->uio_rw = UIO_READ;
 		uiop->uio_segflg = UIO_SYSSPACE;
+		if (nfsrv_cluster_reads)
+			ioflag |= nfsrv_sequential_access(uiop, vp);
 		error = VOP_READ(vp, uiop, IO_NODELOCKED | ioflag, cred);
 		off = uiop->uio_offset;
-		nh->nh_nextr = off;
 		FREE((caddr_t)iv2, M_TEMP);
 		if (error || (getret = VOP_GETATTR(vp, vap, cred, td))) {
 			if (!error)
@@ -1175,6 +1199,8 @@
 	    uiop->uio_segflg = UIO_SYSSPACE;
 	    uiop->uio_td = NULL;
 	    uiop->uio_offset = off;
+	    if (nfsrv_cluster_writes)
+		ioflags |= nfsrv_sequential_access(uiop, vp);
 	    error = VOP_WRITE(vp, uiop, ioflags, cred);
 	    /* XXXRW: unlocked write. */
 	    nfsrvstats.srvvop_writes++;
@@ -1223,6 +1249,7 @@
 		vput(vp);
 	vn_finished_write(mntp);
 	VFS_UNLOCK_GIANT(vfslocked);
+	bwillwrite();	    /* After VOP_WRITE to avoid reordering. */
 	return(error);
 }
 
@@ -1489,6 +1516,8 @@
 		        mvfslocked = VFS_LOCK_GIANT(mntp);
 		    }
 		    if (!error) {
+	    		if (nfsrv_cluster_writes)
+				ioflags |= nfsrv_sequential_access(uiop, vp);
 			error = VOP_WRITE(vp, uiop, ioflags, cred);
 			/* XXXRW: unlocked write. */
 			nfsrvstats.srvvop_writes++;
@@ -1579,6 +1608,7 @@
 		    break;
 		}
 	splx(s);
+	bwillwrite();	    /* After VOP_WRITE to avoid reordering. */
 	return (0);
 }
 
@@ -3824,7 +3854,11 @@
 	}
 	for_ret = VOP_GETATTR(vp, &bfor, cred, td);
 
-	if (cnt > MAX_COMMIT_COUNT) {
+	/*
+	 * If count is 0, a flush from offset to the end of file
+	 * should be performed according to RFC 1813.
+	 */
+	if (cnt == 0 || cnt > MAX_COMMIT_COUNT) {
 		/*
 		 * Give up and do the whole thing
 		 */
@@ -3868,7 +3902,7 @@
 
 		s = splbio();
 		VI_LOCK(vp);
-		while (cnt > 0) {
+		while (!error && cnt > 0) {
 			struct buf *bp;
 
 			/*
@@ -3891,7 +3925,7 @@
 				    B_DELWRI) {
 					bremfree(bp);
 					bp->b_flags &= ~B_ASYNC;
-					bwrite(bp);
+					error = bwrite(bp);
 					++nfs_commit_miss;
 				} else
 					BUF_UNLOCK(bp);

>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->kmacy 
Responsible-Changed-By: kmacy 
Responsible-Changed-When: Mon Nov 19 21:07:12 UTC 2007 
Responsible-Changed-Why:  

Taking responsibility for this. I'm currently waiting for confirmation that 
this won't introduce other issues. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=118126 
Responsible-Changed-From-To: kmacy->freebsd-fs 
Responsible-Changed-By: gavin 
Responsible-Changed-When: Sun May 29 22:14:21 UTC 2011 
Responsible-Changed-Why:  
kmacy has asked for all of his PRs to be reassigned back to the pool. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=118126 

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/118126: commit references a PR
Date: Fri, 16 Dec 2011 00:58:55 +0000 (UTC)

 Author: rmacklem
 Date: Fri Dec 16 00:58:41 2011
 New Revision: 228560
 URL: http://svn.freebsd.org/changeset/base/228560
 
 Log:
   Patch the new NFS server in a manner analagous to r228520 for the
   old NFS server, so that it correctly handles a count == 0 argument
   for Commit.
   
   PR:		kern/118126
   MFC after:	2 weeks
 
 Modified:
   head/sys/fs/nfsserver/nfs_nfsdport.c
 
 Modified: head/sys/fs/nfsserver/nfs_nfsdport.c
 ==============================================================================
 --- head/sys/fs/nfsserver/nfs_nfsdport.c	Fri Dec 16 00:48:53 2011	(r228559)
 +++ head/sys/fs/nfsserver/nfs_nfsdport.c	Fri Dec 16 00:58:41 2011	(r228560)
 @@ -1254,7 +1254,13 @@ nfsvno_fsync(struct vnode *vp, u_int64_t
  {
  	int error = 0;
  
 -	if (cnt > MAX_COMMIT_COUNT) {
 +	/*
 +	 * RFC 1813 3.3.21: if count is 0, a flush from offset to the end of
 +	 * file is done.  At this time VOP_FSYNC does not accept offset and
 +	 * byte count parameters so call VOP_FSYNC the whole file for now.
 +	 * The same is true for NFSv4: RFC 3530 Sec. 14.2.3.
 +	 */
 +	if (cnt == 0 || cnt > MAX_COMMIT_COUNT) {
  		/*
  		 * Give up and do the whole thing
  		 */
 _______________________________________________
 svn-src-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/svn-src-all
 To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
 
>Unformatted:
