From nobody@FreeBSD.org  Tue May  7 09:30:05 2013
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
	by hub.freebsd.org (Postfix) with ESMTP id 8890A93B
	for <freebsd-gnats-submit@FreeBSD.org>; Tue,  7 May 2013 09:30:05 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from oldred.FreeBSD.org (oldred.freebsd.org [8.8.178.121])
	by mx1.freebsd.org (Postfix) with ESMTP id 7B6F9B3E
	for <freebsd-gnats-submit@FreeBSD.org>; Tue,  7 May 2013 09:30:05 +0000 (UTC)
Received: from oldred.FreeBSD.org ([127.0.1.6])
	by oldred.FreeBSD.org (8.14.5/8.14.5) with ESMTP id r479U5Ji045710
	for <freebsd-gnats-submit@FreeBSD.org>; Tue, 7 May 2013 09:30:05 GMT
	(envelope-from nobody@oldred.FreeBSD.org)
Received: (from nobody@localhost)
	by oldred.FreeBSD.org (8.14.5/8.14.5/Submit) id r479U5ns045709;
	Tue, 7 May 2013 09:30:05 GMT
	(envelope-from nobody)
Message-Id: <201305070930.r479U5ns045709@oldred.FreeBSD.org>
Date: Tue, 7 May 2013 09:30:05 GMT
From: Adam Nowacki <nowak@tepeserwery.pl>
To: freebsd-gnats-submit@FreeBSD.org
Subject: [zfs] [patch] allow up to 8MB recordsize
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         178388
>Category:       kern
>Synopsis:       [zfs] [patch] allow up to 8MB recordsize
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-fs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          change-request
>Submitter-Id:   current-users
>Arrival-Date:   Tue May 07 09:40:00 UTC 2013
>Closed-Date:    
>Last-Modified:  Wed May  8 22:20:02 UTC 2013
>Originator:     Adam Nowacki
>Release:        9.1-STABLE
>Organization:
>Environment:
FreeBSD storage 9.1-STABLE FreeBSD 9.1-STABLE #1 r250290M: Tue May  7 08:10:40 UTC 2013     root@storage:/usr/obj/home/nowak/freebsd/src/sys/GENERIC  amd64

>Description:
Currently zfs recordsize is limited to 128k. This is very small for large raidz arrays.

Attached patch increases recordsize limit to 8M while keeping default recordsize at 128k for compatibility with other systems.

After applying the patch zfs datasets will remain compatible with non-patched systems as long as recordsize is not increased (with zfs set recordsize=x) over 128k. Change only affects file data, dataset and pool metadata always remains compatible. Going back is also possible by setting recordsize to 128k or below and deleting all files created with increased recordsize or destroying entire dataset. Recreating the pool is not necessary.

Possible issues:
1) booting is only supported up to 128k recordsize (boot stages and loader), zfs pool however can be shared with increased recordsize datasets as long as / or /boot remains at or below 128k recordsize,
2) accessing increased recordsize datasets on unpatched systems will likely cause kernel panics, should probably introduce feature flag for this to prevent pool import.
>How-To-Repeat:

>Fix:


Patch attached with submission follows:

Index: cddl/contrib/opensolaris/cmd/zdb/zdb.c
===================================================================
--- cddl/contrib/opensolaris/cmd/zdb/zdb.c	(revision 250290)
+++ cddl/contrib/opensolaris/cmd/zdb/zdb.c	(working copy)
@@ -854,7 +854,7 @@
 dump_history(spa_t *spa)
 {
 	nvlist_t **events = NULL;
-	char buf[SPA_MAXBLOCKSIZE];
+	char buf[SPA_BIGBLOCKSIZE];
 	uint64_t resid, len, off = 0;
 	uint_t num = 0;
 	int error;
@@ -2904,8 +2904,8 @@
 	psize = size;
 	lsize = size;
 
-	pbuf = umem_alloc(SPA_MAXBLOCKSIZE, UMEM_NOFAIL);
-	lbuf = umem_alloc(SPA_MAXBLOCKSIZE, UMEM_NOFAIL);
+	pbuf = umem_alloc(SPA_BIGBLOCKSIZE, UMEM_NOFAIL);
+	lbuf = umem_alloc(SPA_BIGBLOCKSIZE, UMEM_NOFAIL);
 
 	BP_ZERO(bp);
 
@@ -2960,18 +2960,18 @@
 		 * every decompress function at every inflated blocksize.
 		 */
 		enum zio_compress c;
-		void *pbuf2 = umem_alloc(SPA_MAXBLOCKSIZE, UMEM_NOFAIL);
-		void *lbuf2 = umem_alloc(SPA_MAXBLOCKSIZE, UMEM_NOFAIL);
+		void *pbuf2 = umem_alloc(SPA_BIGBLOCKSIZE, UMEM_NOFAIL);
+		void *lbuf2 = umem_alloc(SPA_BIGBLOCKSIZE, UMEM_NOFAIL);
 
 		bcopy(pbuf, pbuf2, psize);
 
 		VERIFY(random_get_pseudo_bytes((uint8_t *)pbuf + psize,
-		    SPA_MAXBLOCKSIZE - psize) == 0);
+		    SPA_BIGBLOCKSIZE - psize) == 0);
 
 		VERIFY(random_get_pseudo_bytes((uint8_t *)pbuf2 + psize,
-		    SPA_MAXBLOCKSIZE - psize) == 0);
+		    SPA_BIGBLOCKSIZE - psize) == 0);
 
-		for (lsize = SPA_MAXBLOCKSIZE; lsize > psize;
+		for (lsize = SPA_BIGBLOCKSIZE; lsize > psize;
 		    lsize -= SPA_MINBLOCKSIZE) {
 			for (c = 0; c < ZIO_COMPRESS_FUNCTIONS; c++) {
 				if (zio_decompress_data(c, pbuf, lbuf,
@@ -2986,8 +2986,8 @@
 			lsize -= SPA_MINBLOCKSIZE;
 		}
 
-		umem_free(pbuf2, SPA_MAXBLOCKSIZE);
-		umem_free(lbuf2, SPA_MAXBLOCKSIZE);
+		umem_free(pbuf2, SPA_BIGBLOCKSIZE);
+		umem_free(lbuf2, SPA_BIGBLOCKSIZE);
 
 		if (lsize <= psize) {
 			(void) printf("Decompress of %s failed\n", thing);
@@ -3014,8 +3014,8 @@
 		zdb_dump_block(thing, buf, size, flags);
 
 out:
-	umem_free(pbuf, SPA_MAXBLOCKSIZE);
-	umem_free(lbuf, SPA_MAXBLOCKSIZE);
+	umem_free(pbuf, SPA_BIGBLOCKSIZE);
+	umem_free(lbuf, SPA_BIGBLOCKSIZE);
 	free(dup);
 }
 
Index: cddl/contrib/opensolaris/cmd/zdb/zdb_il.c
===================================================================
--- cddl/contrib/opensolaris/cmd/zdb/zdb_il.c	(revision 250290)
+++ cddl/contrib/opensolaris/cmd/zdb/zdb_il.c	(working copy)
@@ -119,7 +119,7 @@
 	char *data, *dlimit;
 	blkptr_t *bp = &lr->lr_blkptr;
 	zbookmark_t zb;
-	char buf[SPA_MAXBLOCKSIZE];
+	char buf[SPA_BIGBLOCKSIZE];
 	int verbose = MAX(dump_opt['d'], dump_opt['i']);
 	int error;
 
@@ -165,7 +165,7 @@
 	}
 
 	dlimit = data + MIN(lr->lr_length,
-	    (verbose < 6 ? 20 : SPA_MAXBLOCKSIZE));
+	    (verbose < 6 ? 20 : SPA_BIGBLOCKSIZE));
 
 	(void) printf("%s", prefix);
 	while (data < dlimit) {
Index: cddl/contrib/opensolaris/lib/libzfs/common/libzfs_dataset.c
===================================================================
--- cddl/contrib/opensolaris/lib/libzfs/common/libzfs_dataset.c	(revision 250290)
+++ cddl/contrib/opensolaris/lib/libzfs/common/libzfs_dataset.c	(working copy)
@@ -1016,14 +1016,14 @@
 
 		case ZFS_PROP_RECORDSIZE:
 		case ZFS_PROP_VOLBLOCKSIZE:
-			/* must be power of two within SPA_{MIN,MAX}BLOCKSIZE */
+			/* must be power of two within SPA_{MIN,BIG}BLOCKSIZE */
 			if (intval < SPA_MINBLOCKSIZE ||
-			    intval > SPA_MAXBLOCKSIZE || !ISP2(intval)) {
+			    intval > SPA_BIGBLOCKSIZE || !ISP2(intval)) {
 				zfs_error_aux(hdl, dgettext(TEXT_DOMAIN,
 				    "'%s' must be power of 2 from %u "
 				    "to %uk"), propname,
 				    (uint_t)SPA_MINBLOCKSIZE,
-				    (uint_t)SPA_MAXBLOCKSIZE >> 10);
+				    (uint_t)SPA_BIGBLOCKSIZE >> 10);
 				(void) zfs_error(hdl, EZFS_BADPROP, errbuf);
 				goto error;
 			}
@@ -3091,7 +3091,7 @@
 			    "volume block size must be power of 2 from "
 			    "%u to %uk"),
 			    (uint_t)SPA_MINBLOCKSIZE,
-			    (uint_t)SPA_MAXBLOCKSIZE >> 10);
+			    (uint_t)SPA_BIGBLOCKSIZE >> 10);
 
 			return (zfs_error(hdl, EZFS_BADPROP, errbuf));
 
Index: sys/cddl/contrib/opensolaris/common/zfs/zfs_prop.c
===================================================================
--- sys/cddl/contrib/opensolaris/common/zfs/zfs_prop.c	(revision 250290)
+++ sys/cddl/contrib/opensolaris/common/zfs/zfs_prop.c	(working copy)
@@ -372,7 +372,7 @@
 	/* inherit number properties */
 	zprop_register_number(ZFS_PROP_RECORDSIZE, "recordsize",
 	    SPA_MAXBLOCKSIZE, PROP_INHERIT,
-	    ZFS_TYPE_FILESYSTEM, "512 to 128k, power of 2", "RECSIZE");
+	    ZFS_TYPE_FILESYSTEM, "512 to 8192k, power of 2", "RECSIZE");
 
 	/* hidden properties */
 	zprop_register_hidden(ZFS_PROP_CREATETXG, "createtxg", PROP_TYPE_NUMBER,
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c	(revision 250290)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c	(working copy)
@@ -1572,7 +1572,7 @@
 		 */
 		if (zp->z_blksz > zp->z_zfsvfs->z_max_blksz) {
 			ASSERT(!ISP2(zp->z_blksz));
-			newblksz = MIN(end, SPA_MAXBLOCKSIZE);
+			newblksz = MIN(end, SPA_BIGBLOCKSIZE);
 		} else {
 			newblksz = MIN(end, zp->z_zfsvfs->z_max_blksz);
 		}
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c	(revision 250290)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c	(working copy)
@@ -1011,7 +1011,7 @@
 
 			if (zp->z_blksz > max_blksz) {
 				ASSERT(!ISP2(zp->z_blksz));
-				new_blksz = MIN(end_size, SPA_MAXBLOCKSIZE);
+				new_blksz = MIN(end_size, SPA_BIGBLOCKSIZE);
 			} else {
 				new_blksz = MIN(end_size, max_blksz);
 			}
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c	(revision 250290)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c	(working copy)
@@ -273,8 +273,8 @@
 	zfsvfs_t *zfsvfs = arg;
 
 	if (newval < SPA_MINBLOCKSIZE ||
-	    newval > SPA_MAXBLOCKSIZE || !ISP2(newval))
-		newval = SPA_MAXBLOCKSIZE;
+	    newval > SPA_BIGBLOCKSIZE || !ISP2(newval))
+		newval = SPA_BIGBLOCKSIZE;
 
 	zfsvfs->z_max_blksz = newval;
 	zfsvfs->z_vfs->mnt_stat.f_iosize = newval;
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c	(revision 250290)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c	(working copy)
@@ -45,7 +45,7 @@
 	METASLAB_GANG_AVOID)))
 
 uint64_t metaslab_aliquot = 512ULL << 10;
-uint64_t metaslab_gang_bang = SPA_MAXBLOCKSIZE + 1;	/* force gang blocks */
+uint64_t metaslab_gang_bang = SPA_BIGBLOCKSIZE + 1;	/* force gang blocks */
 
 /*
  * The in-core space map representation is more compact than its on-disk form.
@@ -79,7 +79,7 @@
  * an allocation of this size then it switches to using more
  * aggressive strategy (i.e search by size rather than offset).
  */
-uint64_t metaslab_df_alloc_threshold = SPA_MAXBLOCKSIZE;
+uint64_t metaslab_df_alloc_threshold = SPA_BIGBLOCKSIZE;
 
 /*
  * The minimum free space, in percent, which must be available
@@ -619,7 +619,7 @@
 		t = sm->sm_pp_root;
 		*cursor = *extent_end = 0;
 
-		if (max_size > 2 * SPA_MAXBLOCKSIZE)
+		if (max_size > 2 * SPA_BIGBLOCKSIZE)
 			rsize = MIN(metaslab_min_alloc_size, max_size);
 		offset = metaslab_block_picker(t, extent_end, rsize, 1ULL);
 		if (offset != -1)
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h	(revision 250290)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h	(working copy)
@@ -133,7 +133,7 @@
 #define	ZFS_SHARES_DIR		"SHARES"
 #define	ZFS_SA_ATTRS		"SA_ATTRS"
 
-#define	ZFS_MAX_BLOCKSIZE	(SPA_MAXBLOCKSIZE)
+#define	ZFS_MAX_BLOCKSIZE	(SPA_BIGBLOCKSIZE)
 
 /* Path component length */
 /*
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h	(revision 250290)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h	(working copy)
@@ -87,8 +87,10 @@
  */
 #define	SPA_MINBLOCKSHIFT	9
 #define	SPA_MAXBLOCKSHIFT	17
+#define	SPA_BIGBLOCKSHIFT	23
 #define	SPA_MINBLOCKSIZE	(1ULL << SPA_MINBLOCKSHIFT)
 #define	SPA_MAXBLOCKSIZE	(1ULL << SPA_MAXBLOCKSHIFT)
+#define	SPA_BIGBLOCKSIZE	(1ULL << SPA_BIGBLOCKSHIFT)
 
 #define	SPA_BLOCKSIZES		(SPA_MAXBLOCKSHIFT - SPA_MINBLOCKSHIFT + 1)
 
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dmu.h
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dmu.h	(revision 250290)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dmu.h	(working copy)
@@ -238,7 +238,7 @@
  * The maximum number of bytes that can be accessed as part of one
  * operation, including metadata.
  */
-#define	DMU_MAX_ACCESS (10<<20) /* 10MB */
+#define	DMU_MAX_ACCESS (200<<20) /* 200MB */
 #define	DMU_MAX_DELETEBLKCNT (20480) /* ~5MB of indirect blocks */
 
 #define	DMU_USERUSED_OBJECT	(-1ULL)
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_tx.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_tx.c	(revision 250290)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_tx.c	(working copy)
@@ -223,7 +223,7 @@
 		return;
 
 	min_bs = SPA_MINBLOCKSHIFT;
-	max_bs = SPA_MAXBLOCKSHIFT;
+	max_bs = SPA_BIGBLOCKSHIFT;
 	min_ibs = DN_MIN_INDBLKSHIFT;
 	max_ibs = DN_MAX_INDBLKSHIFT;
 
@@ -703,11 +703,11 @@
 		bp = &dn->dn_phys->dn_blkptr[0];
 		if (dsl_dataset_block_freeable(dn->dn_objset->os_dsl_dataset,
 		    bp, bp->blk_birth))
-			txh->txh_space_tooverwrite += SPA_MAXBLOCKSIZE;
+			txh->txh_space_tooverwrite += SPA_BIGBLOCKSIZE;
 		else
-			txh->txh_space_towrite += SPA_MAXBLOCKSIZE;
+			txh->txh_space_towrite += SPA_BIGBLOCKSIZE;
 		if (!BP_IS_HOLE(bp))
-			txh->txh_space_tounref += SPA_MAXBLOCKSIZE;
+			txh->txh_space_tounref += SPA_BIGBLOCKSIZE;
 		return;
 	}
 
@@ -822,7 +822,7 @@
 			match_object = TRUE;
 		if (txh->txh_dnode == NULL || txh->txh_dnode == dn) {
 			int datablkshift = dn->dn_datablkshift ?
-			    dn->dn_datablkshift : SPA_MAXBLOCKSHIFT;
+			    dn->dn_datablkshift : SPA_BIGBLOCKSHIFT;
 			int epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
 			int shift = datablkshift + epbs * db->db_level;
 			uint64_t beginblk = shift >= 64 ? 0 :
@@ -1294,18 +1294,18 @@
 
 	/* If blkptr doesn't exist then add space to towrite */
 	if (!(dn->dn_phys->dn_flags & DNODE_FLAG_SPILL_BLKPTR)) {
-		txh->txh_space_towrite += SPA_MAXBLOCKSIZE;
+		txh->txh_space_towrite += SPA_BIGBLOCKSIZE;
 	} else {
 		blkptr_t *bp;
 
 		bp = &dn->dn_phys->dn_spill;
 		if (dsl_dataset_block_freeable(dn->dn_objset->os_dsl_dataset,
 		    bp, bp->blk_birth))
-			txh->txh_space_tooverwrite += SPA_MAXBLOCKSIZE;
+			txh->txh_space_tooverwrite += SPA_BIGBLOCKSIZE;
 		else
-			txh->txh_space_towrite += SPA_MAXBLOCKSIZE;
+			txh->txh_space_towrite += SPA_BIGBLOCKSIZE;
 		if (!BP_IS_HOLE(bp))
-			txh->txh_space_tounref += SPA_MAXBLOCKSIZE;
+			txh->txh_space_tounref += SPA_BIGBLOCKSIZE;
 	}
 }
 
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c	(revision 250290)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c	(working copy)
@@ -213,10 +213,10 @@
 	if (dn->dn_phys->dn_type != DMU_OT_NONE || dn->dn_allocated_txg != 0) {
 		int i;
 		ASSERT3U(dn->dn_indblkshift, >=, 0);
-		ASSERT3U(dn->dn_indblkshift, <=, SPA_MAXBLOCKSHIFT);
+		ASSERT3U(dn->dn_indblkshift, <=, SPA_BIGBLOCKSHIFT);
 		if (dn->dn_datablkshift) {
 			ASSERT3U(dn->dn_datablkshift, >=, SPA_MINBLOCKSHIFT);
-			ASSERT3U(dn->dn_datablkshift, <=, SPA_MAXBLOCKSHIFT);
+			ASSERT3U(dn->dn_datablkshift, <=, SPA_BIGBLOCKSHIFT);
 			ASSERT3U(1<<dn->dn_datablkshift, ==, dn->dn_datablksz);
 		}
 		ASSERT3U(dn->dn_nlevels, <=, 30);
@@ -266,7 +266,7 @@
 	 * dn_nblkptr is only one byte, so it's OK to read it in either
 	 * byte order.  We can't read dn_bouslen.
 	 */
-	ASSERT(dnp->dn_indblkshift <= SPA_MAXBLOCKSHIFT);
+	ASSERT(dnp->dn_indblkshift <= SPA_BIGBLOCKSHIFT);
 	ASSERT(dnp->dn_nblkptr <= DN_MAX_NBLKPTR);
 	for (i = 0; i < dnp->dn_nblkptr * sizeof (blkptr_t)/8; i++)
 		buf64[i] = BSWAP_64(buf64[i]);
@@ -369,7 +369,7 @@
 dnode_setdblksz(dnode_t *dn, int size)
 {
 	ASSERT0(P2PHASE(size, SPA_MINBLOCKSIZE));
-	ASSERT3U(size, <=, SPA_MAXBLOCKSIZE);
+	ASSERT3U(size, <=, SPA_BIGBLOCKSIZE);
 	ASSERT3U(size, >=, SPA_MINBLOCKSIZE);
 	ASSERT3U(size >> SPA_MINBLOCKSHIFT, <,
 	    1<<(sizeof (dn->dn_phys->dn_datablkszsec) * 8));
@@ -489,8 +489,8 @@
 
 	if (blocksize == 0)
 		blocksize = 1 << zfs_default_bs;
-	else if (blocksize > SPA_MAXBLOCKSIZE)
-		blocksize = SPA_MAXBLOCKSIZE;
+	else if (blocksize > SPA_BIGBLOCKSIZE)
+		blocksize = SPA_BIGBLOCKSIZE;
 	else
 		blocksize = P2ROUNDUP(blocksize, SPA_MINBLOCKSIZE);
 
@@ -571,7 +571,7 @@
 	int nblkptr;
 
 	ASSERT3U(blocksize, >=, SPA_MINBLOCKSIZE);
-	ASSERT3U(blocksize, <=, SPA_MAXBLOCKSIZE);
+	ASSERT3U(blocksize, <=, SPA_BIGBLOCKSIZE);
 	ASSERT0(blocksize % SPA_MINBLOCKSIZE);
 	ASSERT(dn->dn_object != DMU_META_DNODE_OBJECT || dmu_tx_private_ok(tx));
 	ASSERT(tx->tx_txg != 0);
@@ -1320,8 +1320,8 @@
 
 	if (size == 0)
 		size = SPA_MINBLOCKSIZE;
-	if (size > SPA_MAXBLOCKSIZE)
-		size = SPA_MAXBLOCKSIZE;
+	if (size > SPA_BIGBLOCKSIZE)
+		size = SPA_BIGBLOCKSIZE;
 	else
 		size = P2ROUNDUP(size, SPA_MINBLOCKSIZE);
 
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c	(revision 250290)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c	(working copy)
@@ -1149,7 +1149,7 @@
 	    drro->drr_compress >= ZIO_COMPRESS_FUNCTIONS ||
 	    P2PHASE(drro->drr_blksz, SPA_MINBLOCKSIZE) ||
 	    drro->drr_blksz < SPA_MINBLOCKSIZE ||
-	    drro->drr_blksz > SPA_MAXBLOCKSIZE ||
+	    drro->drr_blksz > SPA_BIGBLOCKSIZE ||
 	    drro->drr_bonuslen > DN_MAX_BONUSLEN) {
 		return (SET_ERROR(EINVAL));
 	}
@@ -1351,7 +1351,7 @@
 	int err;
 
 	if (drrs->drr_length < SPA_MINBLOCKSIZE ||
-	    drrs->drr_length > SPA_MAXBLOCKSIZE)
+	    drrs->drr_length > SPA_BIGBLOCKSIZE)
 		return (SET_ERROR(EINVAL));
 
 	data = restore_read(ra, drrs->drr_length);
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c	(revision 250290)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c	(working copy)
@@ -80,8 +80,8 @@
  */
 kmem_cache_t *zio_cache;
 kmem_cache_t *zio_link_cache;
-kmem_cache_t *zio_buf_cache[SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT];
-kmem_cache_t *zio_data_buf_cache[SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT];
+kmem_cache_t *zio_buf_cache[SPA_BIGBLOCKSIZE >> SPA_MINBLOCKSHIFT];
+kmem_cache_t *zio_data_buf_cache[SPA_BIGBLOCKSIZE >> SPA_MINBLOCKSHIFT];
 
 #ifdef _KERNEL
 extern vmem_t *zio_alloc_arena;
@@ -142,7 +142,7 @@
 	 * for each quarter-power of 2.  For large buffers, we want
 	 * a cache for each multiple of PAGESIZE.
 	 */
-	for (c = 0; c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; c++) {
+	for (c = 0; c < SPA_BIGBLOCKSIZE >> SPA_MINBLOCKSHIFT; c++) {
 		size_t size = (c + 1) << SPA_MINBLOCKSHIFT;
 		size_t p2 = size;
 		size_t align = 0;
@@ -218,7 +218,7 @@
 	kmem_cache_t *last_cache = NULL;
 	kmem_cache_t *last_data_cache = NULL;
 
-	for (c = 0; c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; c++) {
+	for (c = 0; c < SPA_BIGBLOCKSIZE >> SPA_MINBLOCKSHIFT; c++) {
 		if (zio_buf_cache[c] != last_cache) {
 			last_cache = zio_buf_cache[c];
 			kmem_cache_destroy(zio_buf_cache[c]);
@@ -255,7 +255,7 @@
 {
 	size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
 
-	ASSERT(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
+	ASSERT(c < SPA_BIGBLOCKSIZE >> SPA_MINBLOCKSHIFT);
 
 	if (zio_use_uma)
 		return (kmem_cache_alloc(zio_buf_cache[c], KM_PUSHPAGE));
@@ -274,7 +274,7 @@
 {
 	size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
 
-	ASSERT(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
+	ASSERT(c < SPA_BIGBLOCKSIZE >> SPA_MINBLOCKSHIFT);
 
 	if (zio_use_uma)
 		return (kmem_cache_alloc(zio_data_buf_cache[c], KM_PUSHPAGE));
@@ -287,7 +287,7 @@
 {
 	size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
 
-	ASSERT(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
+	ASSERT(c < SPA_BIGBLOCKSIZE >> SPA_MINBLOCKSHIFT);
 
 	if (zio_use_uma)
 		kmem_cache_free(zio_buf_cache[c], buf);
@@ -300,7 +300,7 @@
 {
 	size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
 
-	ASSERT(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
+	ASSERT(c < SPA_BIGBLOCKSIZE >> SPA_MINBLOCKSHIFT);
 
 	if (zio_use_uma)
 		kmem_cache_free(zio_data_buf_cache[c], buf);
@@ -543,7 +543,7 @@
 {
 	zio_t *zio;
 
-	ASSERT3U(size, <=, SPA_MAXBLOCKSIZE);
+	ASSERT3U(size, <=, SPA_BIGBLOCKSIZE);
 	ASSERT(P2PHASE(size, SPA_MINBLOCKSIZE) == 0);
 	ASSERT(P2PHASE(offset, SPA_MINBLOCKSIZE) == 0);
 
@@ -2815,7 +2815,7 @@
 		if (BP_IS_GANG(bp)) {
 			zio->io_flags &= ~ZIO_FLAG_NODATA;
 		} else {
-			ASSERT((uintptr_t)zio->io_data < SPA_MAXBLOCKSIZE);
+			ASSERT((uintptr_t)zio->io_data < SPA_BIGBLOCKSIZE);
 			zio->io_pipeline &= ~ZIO_VDEV_IO_STAGES;
 		}
 	}
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c	(revision 250290)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c	(working copy)
@@ -1944,8 +1944,8 @@
 		return (SET_ERROR(ENOTSUP));
 	if (blksz == 0)
 		blksz = SPA_MINBLOCKSIZE;
-	if (blksz > SPA_MAXBLOCKSIZE)
-		blksz = SPA_MAXBLOCKSIZE;
+	if (blksz > SPA_BIGBLOCKSIZE)
+		blksz = SPA_BIGBLOCKSIZE;
 	else
 		blksz = P2ROUNDUP(blksz, SPA_MINBLOCKSIZE);
 
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_fm.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_fm.c	(revision 250290)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_fm.c	(working copy)
@@ -554,7 +554,7 @@
 
 	ASSERT3U(nui64s, <=, UINT16_MAX);
 	ASSERT3U(size, ==, nui64s * sizeof (uint64_t));
-	ASSERT3U(size, <=, SPA_MAXBLOCKSIZE);
+	ASSERT3U(size, <=, SPA_BIGBLOCKSIZE);
 	ASSERT3U(size, <=, UINT32_MAX);
 
 	/* build up the range list by comparing the two buffers. */
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c	(revision 250290)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c	(working copy)
@@ -211,7 +211,7 @@
 zvol_check_volblocksize(uint64_t volblocksize)
 {
 	if (volblocksize < SPA_MINBLOCKSIZE ||
-	    volblocksize > SPA_MAXBLOCKSIZE ||
+	    volblocksize > SPA_BIGBLOCKSIZE ||
 	    !ISP2(volblocksize))
 		return (SET_ERROR(EDOM));
 
@@ -727,7 +727,7 @@
 
 	while (resid != 0) {
 		int error;
-		uint64_t bytes = MIN(resid, SPA_MAXBLOCKSIZE);
+		uint64_t bytes = MIN(resid, SPA_BIGBLOCKSIZE);
 
 		tx = dmu_tx_create(os);
 		dmu_tx_hold_write(tx, ZVOL_OBJ, off, bytes);
@@ -1651,7 +1651,7 @@
 		(void) strcpy(dki.dki_dname, "zvol");
 		dki.dki_ctype = DKC_UNKNOWN;
 		dki.dki_unit = getminor(dev);
-		dki.dki_maxtransfer = 1 << (SPA_MAXBLOCKSHIFT - zv->zv_min_bs);
+		dki.dki_maxtransfer = 1 << (SPA_BIGBLOCKSHIFT - zv->zv_min_bs);
 		mutex_exit(&spa_namespace_lock);
 		if (ddi_copyout(&dki, (void *)arg, sizeof (dki), flag))
 			error = SET_ERROR(EFAULT);


>Release-Note:
>Audit-Trail:

From: Matthew Rezny <mrezny@hexaneinc.com>
To: bug-followup@FreeBSD.org, nowak@tepeserwery.pl
Cc:  
Subject: Re: kern/178388: [zfs] [patch] allow up to 8MB recordsize
Date: Tue, 7 May 2013 16:11:50 +0200

 The proposed patch is rather ugly. Is there some reason to not simply
 change the definition of SPA_MAXBLOCKSIZE?
 
 The point of defining a constant is it can then be changed in the place
 it's defined rather than in every place it's used. Having to go change
 every reference to it is error prone as missing a single reference could
 wreck havoc.
 
 Specifically, I call into question the effect this has on the
 definition of SPA_BLOCKSIZES. The reference to SPA_MAXBLOCKSIZE was not
 replaced by SPA_BIGBLOCKSIZE and thus SPA_BLOCKSIZES is insufficiently
 sized to represent all the possible block sizes that could be used.
 
 That one jumped out at me when I skimmed over the patch. I have not
 reviewed all the ZFS code to look for other unchanged references that
 are not part of the patch context.

From: Adam Nowacki <nowak@tepeserwery.pl>
To: Matthew Rezny <mrezny@hexaneinc.com>
Cc: bug-followup@FreeBSD.org
Subject: Re: kern/178388: [zfs] [patch] allow up to 8MB recordsize
Date: Tue, 07 May 2013 18:46:05 +0200

 On 2013-05-07 16:11, Matthew Rezny wrote:
 > The proposed patch is rather ugly. Is there some reason to not simply
 > change the definition of SPA_MAXBLOCKSIZE?
 
 Yes. Altering the value of SPA_MAXBLOCKSIZE will change the sizes of 
 certain metadata objects that will break compatibility with non-patched 
 systems. Just importing the pool on system with modified 
 SPA_MAXBLOCKSIZE would result in this pool being inaccessible in 
 non-patched systems - forever. It will also prevent booting from zfs 
 pools as there is not enough memory available in the bootloader to 
 support large block sizes for metadata or the loader files.
 
 > The point of defining a constant is it can then be changed in the place
 > it's defined rather than in every place it's used. Having to go change
 > every reference to it is error prone as missing a single reference could
 > wreck havoc.
 
 SPA_MAXBLOCKSIZE is used for far more than just a limit - in many places 
 it is used as a default block size. I'm introducing SPA_BIGBLOCKSIZE 
 because of the above compatibility problems and using it only in places 
 that are essential to supporting large block sizes for file or zvol data 
 leaving default block sizes unmodified (especially for pool metadata). 
 The changed block size is only in effect when recordsize dataset 
 property is modified by explicit action of the administrator. Existing 
 and new datasets created post patch default to backwards compatible 128k 
 block size.
 
 SPA_BIGBLOCKSIZE is used for asserts on the size of read/written block, 
 ARC cache, recordsize property bounds checks and block size calculation 
 logic.
 
 The names of the constants could probably be changed:
 current SPA_MAXBLOCKSIZE to SPA_DEFAULTBLOCKSIZE
 and the new SPA_BIGBLOCKSIZE to SPA_MAXBLOCKSIZE.
 
 > Specifically, I call into question the effect this has on the
 > definition of SPA_BLOCKSIZES. The reference to SPA_MAXBLOCKSIZE was not
 > replaced by SPA_BIGBLOCKSIZE and thus SPA_BLOCKSIZES is insufficiently
 > sized to represent all the possible block sizes that could be used.
 
 The SPA_BLOCKSIZES define is never used in the code and should probably 
 be removed.
 
 > That one jumped out at me when I skimmed over the patch. I have not
 > reviewed all the ZFS code to look for other unchanged references that
 > are not part of the patch context.
 
 Keep in mind that I have been using this for two months now on 3 
 systems, 5 zpools and a total of over 50TB data written post-patch with 
 varying record sizes (128k, 1MB, 4MB, 8MB). All systems boot directly 
 from the big pools using unmodified (128k limited) bootloader.

From: Matthew Rezny <mrezny@hexaneinc.com>
To: Adam Nowacki <nowak@tepeserwery.pl>
Cc: bug-followup@FreeBSD.org
Subject: Re: kern/178388: [zfs] [patch] allow up to 8MB recordsize
Date: Wed, 8 May 2013 13:09:44 +0200

 On Tue, 07 May 2013 18:46:05 +0200
 Adam Nowacki <nowak@tepeserwery.pl> wrote:
 
 > On 2013-05-07 16:11, Matthew Rezny wrote:
 > > The proposed patch is rather ugly. Is there some reason to not
 > > simply change the definition of SPA_MAXBLOCKSIZE?
 > 
 > Yes. Altering the value of SPA_MAXBLOCKSIZE will change the sizes of 
 > certain metadata objects that will break compatibility with
 > non-patched systems. Just importing the pool on system with modified 
 > SPA_MAXBLOCKSIZE would result in this pool being inaccessible in 
 > non-patched systems - forever. It will also prevent booting from zfs 
 > pools as there is not enough memory available in the bootloader to 
 > support large block sizes for metadata or the loader files.
 > 
 That is understandable and something I had thought about but not
 verified if it were the case.
 
 > > The point of defining a constant is it can then be changed in the
 > > place it's defined rather than in every place it's used. Having to
 > > go change every reference to it is error prone as missing a single
 > > reference could wreck havoc.
 > 
 > SPA_MAXBLOCKSIZE is used for far more than just a limit - in many
 > places it is used as a default block size. I'm introducing
 > SPA_BIGBLOCKSIZE because of the above compatibility problems and
 > using it only in places that are essential to supporting large block
 > sizes for file or zvol data leaving default block sizes unmodified
 > (especially for pool metadata). The changed block size is only in
 > effect when recordsize dataset property is modified by explicit
 > action of the administrator. Existing and new datasets created post
 > patch default to backwards compatible 128k block size.
 > 
 > SPA_BIGBLOCKSIZE is used for asserts on the size of read/written
 > block, ARC cache, recordsize property bounds checks and block size
 > calculation logic.
 > 
 > The names of the constants could probably be changed:
 > current SPA_MAXBLOCKSIZE to SPA_DEFAULTBLOCKSIZE
 > and the new SPA_BIGBLOCKSIZE to SPA_MAXBLOCKSIZE.
 >
 Changing the value of SPA_MAXBLOCKSIZE while defining
 SPA_DEFAULTBLOCKSIZE (or SPA_MAXCOMPATBLOCKSIZE as I almost
 suggested) with the prior value would be clearer in terms of both naming
 and patch readability. I will venture so say that the number of
 references to the SPA_DEFAULTBLOCKSIZE will be fewer than references to
 SPA_MAXBLOCKSIZE.
 
 > > Specifically, I call into question the effect this has on the
 > > definition of SPA_BLOCKSIZES. The reference to SPA_MAXBLOCKSIZE was
 > > not replaced by SPA_BIGBLOCKSIZE and thus SPA_BLOCKSIZES is
 > > insufficiently sized to represent all the possible block sizes that
 > > could be used.
 > 
 > The SPA_BLOCKSIZES define is never used in the code and should
 > probably be removed.
 > 
 In that case, please kill it now.
 
 > > That one jumped out at me when I skimmed over the patch. I have not
 > > reviewed all the ZFS code to look for other unchanged references
 > > that are not part of the patch context.
 > 
 > Keep in mind that I have been using this for two months now on 3 
 > systems, 5 zpools and a total of over 50TB data written post-patch
 > with varying record sizes (128k, 1MB, 4MB, 8MB). All systems boot
 > directly from the big pools using unmodified (128k limited)
 > bootloader.
 Thank you for your work in this area and the extensive testing you have
 done. Do you have any performance data from these tests that you can
 share? Do you have some reason for not going beyond 8MB record size
 (diminishing returns, etc)?
 
 I have put some thought to additional compression algorithms in ZFS,
 e.g. LZMA, which would require large record size to see significant
 gains over existing gzip support. High strength compression on large
 data chunks would be slow for data frequently written, but for an
 archival filesystem where data is written once and then read
 periodically it would be quite useful.
Responsible-Changed-From-To: freebsd-bugs->freebsd-fs 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Wed May 8 21:34:09 UTC 2013 
Responsible-Changed-Why:  
Over to maintainer(s). 

http://www.freebsd.org/cgi/query-pr.cgi?pr=178388 

From: "Steven Hartland" <smh@freebsd.org>
To: <bug-followup@freebsd.org>,
	<nowak@tepeserwery.pl>
Cc:  
Subject: Re: kern/178388: [zfs] [patch] allow up to 8MB recordsize
Date: Wed, 8 May 2013 23:12:04 +0100

 Seems interesting but it's really something that needs to be
 reviewed and submitted upstream (illumos).
>Unformatted:
