From nobody@FreeBSD.org  Wed Jun 27 09:14:53 2012
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 9481E106566B
	for <freebsd-gnats-submit@FreeBSD.org>; Wed, 27 Jun 2012 09:14:53 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from red.freebsd.org (red.freebsd.org [IPv6:2001:4f8:fff6::22])
	by mx1.freebsd.org (Postfix) with ESMTP id 800EE8FC12
	for <freebsd-gnats-submit@FreeBSD.org>; Wed, 27 Jun 2012 09:14:53 +0000 (UTC)
Received: from red.freebsd.org (localhost [127.0.0.1])
	by red.freebsd.org (8.14.4/8.14.4) with ESMTP id q5R9Eqh7010510
	for <freebsd-gnats-submit@FreeBSD.org>; Wed, 27 Jun 2012 09:14:52 GMT
	(envelope-from nobody@red.freebsd.org)
Received: (from nobody@localhost)
	by red.freebsd.org (8.14.4/8.14.4/Submit) id q5R9EqsC010501;
	Wed, 27 Jun 2012 09:14:52 GMT
	(envelope-from nobody)
Message-Id: <201206270914.q5R9EqsC010501@red.freebsd.org>
Date: Wed, 27 Jun 2012 09:14:52 GMT
From: Levent Serinol <levent.serinol@mynet.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: ZFS stalls on heavy I/O
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         169480
>Category:       kern
>Synopsis:       [zfs] ZFS stalls on heavy I/O
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    freebsd-fs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Wed Jun 27 09:20:07 UTC 2012
>Closed-Date:    
>Last-Modified:  Tue Jan 29 19:00:00 UTC 2013
>Originator:     Levent Serinol
>Release:        9.0 and 10.0
>Organization:
>Environment:
FreeBSD localhost 9.0-RELEASE FreeBSD 9.0-RELEASE #0: Wed Mar 21 11:15:41 EET 2012     root@webmail122.mynet.com:/usr/obj/usr/src/sys/GENERIC  amd64
>Description:
 Under heavy I/O load we face freeze problems on ZFS volumes on both Freebsd 9 Release and 10 Current versions. Machines are HP servers (64bit) with HP Smart array 6400 raid controllers (with battery units). Every da device is a hardware raid5 where each one includes 9x300GB 10K SCSI hard drivers. Main of I/O pattern happens on local system except some small NFS I/O from some other servers (NFS lookup/getattr/ etc.). These servers are mail servers (qmail) with small I/O patterns (64K Read/Write).  Below you can  find procstat output on freeze time. write_limit is set to 200MB because of the huge amount of txg_wait_opens observed before. Every process stops on D state I think due to txg queue and other 2 queues are full. Is there any suggestion to fix the problem ?

btw inject_compress is the main process injecting emails to user inboxes (databases). Also, those machines were running without problems on Linux/XFS filesystem. For a while ago, we started migration from Linux to Freebsd

http://pastebin.com/raw.php?i=ic3YepWQ
>How-To-Repeat:

>Fix:


>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->freebsd-fs 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Wed Jun 27 09:42:11 UTC 2012 
Responsible-Changed-Why:  
Over to maintainer(s). 

http://www.freebsd.org/cgi/query-pr.cgi?pr=169480 

From: Harry Coin <hgcoin@gmail.com>
To: bug-followup@FreeBSD.org, levent.serinol@mynet.com
Cc:  
Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O
Date: Fri, 25 Jan 2013 19:26:10 -0600

 I think I have an easier way to reproduce this problem on a very simple 
 setup:
 
 Boot a 'nas4free' livecd version 573 (freebsd 9.1.0.1).   Mount one 
 zpool with no dedup, simple raidz 4 drive setup, 4GB memory.  Don't 
 enable any of the various features, just get a shell from the live cd 
 prompt.  Use dd to make a 20GB file on the pool writing from 
 /dev/random.  While that's going, go to another virtual console cd to 
 the pool, and do an ls -l.   It won't return.  Of interest, long before 
 wired memory explodes to the limit, ls not returning while dd is 
 running, 'top' on another shell reports this:
 
 99.7% idle, 2.6gb free ram, dd in tx->tx, zfskern in zio->i, and intr in 
 WAIT
 
 Eventually disk activity as shown by the cheery little flickering lamp 
 slows, then stops.  Characters still echo on the ls -l command that 
 hasn't returned, but no others.  alt-tab strangely still works. The only 
 way out is to power cycle the box.
 
 Details here:
 
 http://forums.nas4free.org/viewtopic.php?p=12519#p12519
 
 Feels like completed io's are getting lost and the system just waits and 
 waits until there are no more resources left to issue new commands or 
 too many threads are locked waiting for what will never return.
 
 Harry
 
 
 

From: Jeremy Chadwick <jdc@koitsu.org>
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O
Date: Fri, 25 Jan 2013 18:08:59 -0800

 Harry, things that come to mind immediately:
 
 1. http://www.quietfountain.com/fs1pool1.txt is when your pool contained
 both L2ARC and ZIL devices on SSDs.  Please remove the SSDs from the
 picture entirely and use raidz1 disks ada[2345] only at this point.
 
 I do not want to discuss ada[01] at this time, because they're SSDs.
 There are quite literally 4 or 5 "catches" to using these devices on
 FreeBSD ZFS, but the biggest problem -- and this WILL hurt you, no
 arguments about it -- is lack of TRIM support.  You will hurt your SSDs
 over time doing this.  If you want TRIM support on ZFS you will need to
 run -CURRENT.
 
 We can talk more about the SSDs later.  As said, please remove them from
 the pictures for starters, as all they do is make troubleshooting much
 much harder.
 
 2. I do see some raw I/O benchmarks but only for ada2.  This is
 insufficient.  A single disk performing like crap in a pool can slow
 down the entire response time for everything.  I can do analysis of all
 of your disks if the issue is narrowed down to one of them.
 
 "gstat -I500ms" is a good way to watch I/O speeds in real-time.  I find
 this more effective than "zpool iostat -v 1" for per-device info.
 
 3. The ada[2345] disks involved are Hitachi HDS723015BLA642 (7K3000,
 1.5TB), and there is sparse info on the web as to if these are
 512-byte physical sector disks or 4096-byte.  smartmontools 6.0 or
 newer will tell you.
 
 All disks regardless advertise 512-byte as the logical size to remain
 fully compatible with legacy systems, but the perform hit on I/O is
 major if the device + pool ashift isn't 12.  So please check this with
 smartmontools 6.0 or newer.
 
 If the disks use physically 4096-byte sectors, you need to use gnop(8)
 to align them and create the pool off of that.  Ivan Voras wrote a
 wonderful guide on how to do this, and it's very simple:
 
 http://ivoras.net/blog/tree/2011-01-01.freebsd-on-4k-sector-drives.html
 
 It wouldn't hurt you to do this regardless, as there's no performance
 hit using the gnop(8) method on 512-byte sector drives; this would
 "future-proof" you if upgrading to newer disks too.  You want ashift=12.
 
 4. Why are all of your drives partitioned?  In other words, why are you
 using adaXpX rather than just adaX for your raidz1 pool?  "gpart show"
 output was not provided, and I can only speculate as to what's going on
 under the hood there.
 
 Please use raw disks when recreating your pool, i.e. ada2, ada3, ada4,
 etc...
 
 I know for your cache/logs this is a different situation but again,
 please remove those from the picture.
 
 5. Please keep your Hitachi disks on the Intel ICH7 controller for the
 time being.  It's SATA300 but that isn't going to hurt these disks.
 Don't bring the Marvell into the picture yet.  Don't change around
 cabling or anything else.
 
 6. For any process that takes a long while, you're going to need to do
 "procstat -kk" (yes -k twice) against it.
 
 7. I do not think your issue is related to this PR.  I would suggest
 discussing it on freebsd-fs first.  Of course, you're also using
 something called "nas4free" which may or may not be *true, unaltered*
 FreeBSD -- I have no idea.  I often find it frustrating when, say, the
 FreeNAS folks or other "FreeBSD fork projects" appear on the FreeBSD
 mailing lists "just because it uses FreeBSD".  You always have to go
 with the vendor for support (like you did on their forum), but if you
 really think this is a FreeBSD "kernel thing" freebsd-fs is fine.
 
 Start with what I described above and go from there.
 
 -- 
 | Jeremy Chadwick                                   jdc@koitsu.org |
 | UNIX Systems Administrator                http://jdc.koitsu.org/ |
 | Mountain View, CA, US                                            |
 | Making life hard for others since 1977.             PGP 4BD6C0CB |
 

From: Harry Coin <hgcoin@gmail.com>
To: bug-followup@FreeBSD.org, levent.serinol@mynet.com, 
 Jeremy Chadwick <jdc@koitsu.org>
Cc:  
Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O
Date: Fri, 25 Jan 2013 21:04:05 -0600

 Jeremy, Thanks.
 
 re: 1.  Not only have I removed the zil but I removed the cache before 
 running this test as well, thinking as you did the whole l2arc thing was 
 adding too much complexity.   The pool was nothing more than the single 
 raidz when doing the test to cause the failure noted.
 
 re 2.  The raw benchmarks you saw for ada2 were identically duplicated 
 for 3, 4 and 5.  They are all exactly the same drive make and model 
 purchased at the same time.
 
 re 3.  These are 512 byte/sector drives.  v6.0 smartctl -a reports 512 
 bytes/sector logical / physical.
 
 re 4.  The drives are partitioned so that each of them can hold boot 
 information should I decide someday to boot off the array, to make sure 
 sector 0 of the partition starts on the afore-noticed 4K boundary,  and 
 there is a couple of gig at the end for swap if I decide I want that.  
 Partition 1 is just big enough for the boot code, 2 is the bulk of the 
 storage, and 3 is the last 2 gig.
 
 So, no zil or cache used in the test, all formatted zfs v28, scrub 
 reported zero errors on the array two days ago (though the scrub speed 
 was 14M/s or so).
 
 It's something about max speed writes that get interrupted by another 
 read that kills zfs/nas4free somehow.
 
 re 5:  The ata drives indeed are on the ICH7, as the 300gb isn't an 
 issue.  The ssd's are on the higher speed controller, and are unused as 
 noted before.
 
 re 6:  You'll note that 'ls -l' didn't return, only the 'top' program 
 run before the dd was started was launched.  Here are the procstat -kk's 
 for the dd and ls during their run (the ls not returning)
 nas4free:~# ps axu | grep dd
 root  4006  47.1  0.1  9916 1752 v0  R+    2:56AM   0:07.92 dd 
 if=/dev/urandom of=foo bs=512 count=20000000
 root  4009   0.0  0.1 16280 2092  0  S+    2:56AM   0:00.00 grep dd
 nas4free:~# procstat -kk 4006
    PID    TID COMM             TDNAME           KSTACK
   4006 100114 dd               -                mi_switch+0x186 
 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_delay+0x137 
 dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 
 dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 
 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 
 amd64_syscall+0x546 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4006
    PID    TID COMM             TDNAME           KSTACK
   4006 100114 dd               -                mi_switch+0x186 
 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_delay+0x137 
 dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 
 dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 
 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 
 amd64_syscall+0x546 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4006
    PID    TID COMM             TDNAME           KSTACK
   4006 100114 dd               -                mi_switch+0x186 
 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_delay+0x137 
 dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 
 dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 
 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 
 amd64_syscall+0x546 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4006
    PID    TID COMM             TDNAME           KSTACK
   4006 100114 dd               -                mi_switch+0x186 
 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_delay+0x137 
 dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 
 dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 
 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 
 amd64_syscall+0x546 Xfast_syscall+0xf7
 nas4free:~# ps axu | grep ls
 root  4016   0.0  0.1 14420 2196 v2  D+    2:57AM   0:00.00 ls -l
 root  4018   0.0  0.1 16280 2084  0  RL+   2:57AM   0:00.00 grep ls
 nas4free:~# procstat -kk 4016
    PID    TID COMM             TDNAME           KSTACK
   4016 100085 ls               -                mi_switch+0x186 
 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
 vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
 amd64_syscall+0x546 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4016
    PID    TID COMM             TDNAME           KSTACK
   4016 100085 ls               -                mi_switch+0x186 
 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
 vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
 amd64_syscall+0x546 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4016
    PID    TID COMM             TDNAME           KSTACK
   4016 100085 ls               -                mi_switch+0x186 
 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
 vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
 amd64_syscall+0x546 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4016
    PID    TID COMM             TDNAME           KSTACK
   4016 100085 ls               -                mi_switch+0x186 
 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
 vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
 amd64_syscall+0x546 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4006
    PID    TID COMM             TDNAME           KSTACK
   4006 100114 dd               -                mi_switch+0x186 
 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 
 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c 
 dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 
 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4006
    PID    TID COMM             TDNAME           KSTACK
   4006 100114 dd               -                mi_switch+0x186 
 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 
 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c 
 dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 
 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4006
    PID    TID COMM             TDNAME           KSTACK
   4006 100114 dd               -                mi_switch+0x186 
 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 
 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c 
 dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 
 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4016
    PID    TID COMM             TDNAME           KSTACK
   4016 100085 ls               -                mi_switch+0x186 
 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
 vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
 amd64_syscall+0x546 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4016
    PID    TID COMM             TDNAME           KSTACK
   4016 100085 ls               -                mi_switch+0x186 
 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
 vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
 amd64_syscall+0x546 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4016
    PID    TID COMM             TDNAME           KSTACK
   4016 100085 ls               -                mi_switch+0x186 
 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
 vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
 amd64_syscall+0x546 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4006
    PID    TID COMM             TDNAME           KSTACK
   4006 100114 dd               -                mi_switch+0x186 
 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 
 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c 
 dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 
 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4006
    PID    TID COMM             TDNAME           KSTACK
   4006 100114 dd               -                mi_switch+0x186 
 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 
 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c 
 dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 
 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4006
    PID    TID COMM             TDNAME           KSTACK
   4006 100114 dd               -                mi_switch+0x186 
 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 
 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c 
 dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 
 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4006
    PID    TID COMM             TDNAME           KSTACK
   4006 100114 dd               -                mi_switch+0x186 
 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 
 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c 
 dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 
 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4016
    PID    TID COMM             TDNAME           KSTACK
   4016 100085 ls               -                mi_switch+0x186 
 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
 vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
 amd64_syscall+0x546 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4016
    PID    TID COMM             TDNAME           KSTACK
   4016 100085 ls               -                mi_switch+0x186 
 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
 vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
 amd64_syscall+0x546 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4016
    PID    TID COMM             TDNAME           KSTACK
   4016 100085 ls               -                mi_switch+0x186 
 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
 vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
 amd64_syscall+0x546 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4016
    PID    TID COMM             TDNAME           KSTACK
   4016 100085 ls               -                mi_switch+0x186 
 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
 vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
 amd64_syscall+0x546 Xfast_syscall+0xf7
 nas4free:~# procstat -kk 4016
    PID    TID COMM             TDNAME           KSTACK
   4016 100085 ls               -                mi_switch+0x186 
 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
 vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
 amd64_syscall+0x546 Xfast_syscall+0xf7
 nas4free:~#
 
 
 re 7:  The 'submit a problem' form doesn't list 'fs' as an option, just 
 kern.  If there's another place for fs bugs I trust the freebsd folks 
 will know what to do with it.  Anyhow I do think it's about write 
 completions getting lost.   And, 'nas4free' is a pretty popular project 
 having the major virtue that anyone, anywhere looking to solve this 
 problem can download a livecd iso and reproduce it.  This is not my 
 first rodeo, there are several bug reports with various black magic 
 tunable tweaks that generally I think accomplish avoiding the bug only 
 by their usage patterns and the tweaks not hitting it.
 
 This particular iso freebsd download has the virtue of being completely 
 reproducible.
 
 Does any of the above help?   Also here's an update zpool :
 
 nas4free:~# zpool status pool1
    pool: pool1
   state: ONLINE
    scan: scrub canceled on Fri Jan 25 22:07:51 2013
 config:
 
      NAME        STATE     READ WRITE CKSUM
      pool1       ONLINE       0     0     0
        raidz1-0  ONLINE       0     0     0
          ada2p2  ONLINE       0     0     0
          ada3p2  ONLINE       0     0     0
          ada4p2  ONLINE       0     0     0
          ada5p2  ONLINE       0     0     0
 
 errors: No known data errors
 nas4free:~#
 
 as4free:~# zpool iostat -v pool1
                 capacity     operations    bandwidth
 pool        alloc   free   read  write   read  write
 ----------  -----  -----  -----  -----  -----  -----
 pool1       3.01T  2.43T    168      0   360K  3.42K
    raidz1    3.01T  2.43T    168      0   360K  3.42K
      ada2p2      -      -    124      0   104K  2.34K
      ada3p2      -      -    123      0   104K  2.28K
      ada4p2      -      -    124      0   103K  2.34K
      ada5p2      -      -    126      0  98.1K  2.29K
 ----------  -----  -----  -----  -----  -----  -----
 
 nas4free:~#
 
 nas4free:~# zpool get all pool1
 NAME   PROPERTY       VALUE       SOURCE
 pool1  size           5.44T       -
 pool1  capacity       55%         -
 pool1  altroot        -           default
 pool1  health         ONLINE      -
 pool1  guid           1701438519865110975  default
 pool1  version        28          default
 pool1  bootfs         -           default
 pool1  delegation     on          default
 pool1  autoreplace    off         default
 pool1  cachefile      -           default
 pool1  failmode       wait        default
 pool1  listsnapshots  off         default
 pool1  autoexpand     off         default
 pool1  dedupditto     0           default
 pool1  dedupratio     1.46x       -
 pool1  free           2.43T       -
 pool1  allocated      3.01T       -
 pool1  readonly       off         -
 pool1  comment        -           default
 pool1  expandsize     0           -
 nas4free:~#
 nas4free:~# zfs get all pool1/videos
 NAME          PROPERTY              VALUE                  SOURCE
 pool1/videos  type                  filesystem             -
 pool1/videos  creation              Sat Jan 19  5:12 2013  -
 pool1/videos  used                  526G                   -
 pool1/videos  available             1.74T                  -
 pool1/videos  referenced            526G                   -
 pool1/videos  compressratio         1.00x                  -
 pool1/videos  mounted               yes                    -
 pool1/videos  quota                 none                   local
 pool1/videos  reservation           none                   local
 pool1/videos  recordsize            128K                   default
 pool1/videos  mountpoint            /mnt/pool1/videos      inherited 
 from pool1
 pool1/videos  sharenfs              off                    default
 pool1/videos  checksum              on                     default
 pool1/videos  compression           off                    local
 pool1/videos  atime                 off                    local
 pool1/videos  devices               on                     default
 pool1/videos  exec                  on                     default
 pool1/videos  setuid                on                     default
 pool1/videos  readonly              off                    local
 pool1/videos  jailed                off                    default
 pool1/videos  snapdir               hidden                 local
 pool1/videos  aclmode               discard                default
 pool1/videos  aclinherit            restricted             default
 pool1/videos  canmount              on                     local
 pool1/videos  xattr                 off                    temporary
 pool1/videos  copies                1                      default
 pool1/videos  version               5                      -
 pool1/videos  utf8only              off                    -
 pool1/videos  normalization         none                   -
 pool1/videos  casesensitivity       sensitive              -
 pool1/videos  vscan                 off                    default
 pool1/videos  nbmand                off                    default
 pool1/videos  sharesmb              off                    default
 pool1/videos  refquota              none                   default
 pool1/videos  refreservation        none                   default
 pool1/videos  primarycache          all                    default
 pool1/videos  secondarycache        all                    default
 pool1/videos  usedbysnapshots       0                      -
 pool1/videos  usedbydataset         526G                   -
 pool1/videos  usedbychildren        0                      -
 pool1/videos  usedbyrefreservation  0                      -
 pool1/videos  logbias               latency                default
 pool1/videos  dedup                 off                    local
 pool1/videos  mlslabel                                     -
 pool1/videos  sync                  standard               local
 pool1/videos  refcompressratio      1.00x                  -
 pool1/videos  written               526G                   -
 nas4free:~#
 
 
 
 
 
 
 
 
 
 
 
 

From: Jeremy Chadwick <jdc@koitsu.org>
To: Harry Coin <hgcoin@gmail.com>
Cc: bug-followup@FreeBSD.org, levent.serinol@mynet.com
Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O
Date: Fri, 25 Jan 2013 19:55:26 -0800

 Recommendations:
 
 1. Instead of /dev/random use /dev/zero.  /dev/random is not blazing
 fast given it has to harvest lots of entropy from places.  If you're
 doing I/O speed testing just use /dev/zero.  The speed difference is
 quite big.
 
 2. For dd, instead of bs=512 use bs=64k.  bs=512 isn't very ideal; these
 are direct I/O writes of 512 bytes each, which is dog slow.  I repeat:
 dog slow.  Linux does this differently.
 
 3. During the dd, in another VTY or window, use "gstat -I500ms" and
 watch the I/O speeds for your ada[2345] disks during the dd.  They
 should be hitting peaks between 60-150MBytes/sec under the far right
 "Kbps" field (far left=read, far right=write).
 
 The large potential speed variance has to do with how much data you
 already have on the pool, i.e. MHDDs get slower as the actuator arms
 move inward towards the spindle motor.  That's why you might see, for
 example, 150MBytes/sec when reading/writing to low-numbered LBAs but
 slower speeds when writing to high-numbered LBAs.
 
 This speed will be "bursty" and "sporadic" due to the how ZFS ARC
 works.  The interval at which "things are flushed to disk" is based on
 the vfs.zfs.txg.timeout sysctl, which on FreeBSD 9.1-RELEASE should
 default to 5 (5 seconds).
 
 4. "zpool iostat -v {pool}" does not provide accurate speed indications
 for the same reason "iostat" doesn't show ""valid"" (it does but not
 what most people would hope for) information while "iostat 1" would.
 You need to run it with an interval, i.e. "zpool iostat -v {pool} 1" and
 let it run for a while while doing I/O.  But I recommend using gstat
 like I said, simply because the interval can be set at 500ms (0.5s) and
 you get a better idea of what your peak I/O speed is.
 
 If you find a single disk that is **always** performing badly, then that
 disk is your bottleneck and I can help you with analysis of its problem.
 
 5. Your "zpool scrub" speed being 14MBytes/second indicates you are no
 where close to your ideal I/O speed.  It should not be that slow unless
 you're doing tons of I/O at the same time as the scrub.  Also, scrubs
 take longer now due to the disabling of the vdev cache (and that's not a
 FreeBSD thing, it's that way in Illumos too, and it's a sensitive topic
 to discuss).
 
 6. On FreeBSD 9.1-RELEASE generally speaking you should not have to tune
 any sysctls.  The situation was different in 8.x and 9.0.  Your system
 only has 4GB of RAM so prefetching automatically gets disabled, by the
 way, just in case you were wondering about that (there were problems
 with prefetch in older releases).
 
 7. You should probably keep "top -s 1" running, and you might even
 consider using "top -S -s 1" to see system/kernel threads (they're in
 brackets).  This isn't going to tell you downright what's making things
 slow though.  "vmstat -i" during heavy I/O would be useful too, just in
 case somehow you have a shared interrupt that's being pegged hard (for
 example I've seen SATA controllers and USB controllers sharing an
 interrupt, even with APICs, where the USB layer is busted churning out
 1000 ints/sec and thus affecting SATA I/O speed).
 
 8. If you want to compare systems I'm happy to do so, although I have
 less disks than you do (3 in raidz1, WD Red 1TB drives).  However my
 system is not a Pentium D-class processor; it's a Core 2 Quad Q9500.
 The D-class stuff is fairly old.
 
 I have some other theories as well but one thing at a time.
 
 -- 
 | Jeremy Chadwick                                   jdc@koitsu.org |
 | UNIX Systems Administrator                http://jdc.koitsu.org/ |
 | Mountain View, CA, US                                            |
 | Making life hard for others since 1977.             PGP 4BD6C0CB |
 

From: Harry Coin <hgcoin@gmail.com>
To: bug-followup@FreeBSD.org, levent.serinol@mynet.com, jdc@koitsu.org
Cc:  
Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O
Date: Fri, 25 Jan 2013 23:49:23 -0600

 I appreciate your effort very much.
 
 One change from last time was I set failmode=continue from 'wait'.   
 There were no errors in zpool or other logs.
 
 re 1,2: switching to bs=64k and /dev/zero as the input source, the dd 
 completes and the ls -l during the process causes no problems.  Transfer 
 speed reported by dd of 183qMBytes/sec.
 
 re 3:  the gstat -I500ms shows each of the raid partitions hanging in at 
 about top speed of 50-ish MBps during the above dd.  All disks perform 
 within a couple % of one another.
 
 re 4: we'll stick with gstat.
 
 re 5: I see we agree about that.  Nothing happened during the scrub 
 other than the scrub.  14Mbps scrub speed is 1/6th of the minimum I'd 
 expect.
 
 re 6:  Tuning Is Evil.  On the other hand, Crashing is Eviler.  So I 
 tried Evil Tuning, with poor results as noted upstream.  So, perhaps 
 tuning is indeed not only evil but a black hole for time as well.
 
 Switching back to the 512 byte writes, I notice on gstat the writes are 
 zero for 4 secs or so, then a burst of activity (all partitions within a 
 few % of one another), all quiet, repeat.   Then I throw in the ls -l... 
 and... it all works with no problems.   Trying the dd with the big 
 blocksize and /dev/urandom.... and:  gstat looks bursty as in the 512 
 byte writes.   ls -l causes no changes, works.    Trying the original 
 /dev/urandom and bs=512 and ... the gstat pattern does the usual bursty 
 thing, ls -l works well.    Thinking it's about the gstat...
 
 Doing top -S -s 1 .. nothing.
 
 nas4free:~# vmstat -i
 interrupt                          total       rate
 irq1: atkbd0                         168          0
 irq18: uhci2                        1136          0
 irq20: hpet0                      620136        328
 irq21: uhci0 ehci0                  7764          4
 irq256: hdac0                         37          0
 irq257: ahci0                       1090          0
 irq258: em0                         5625          2
 irq259: ahci1                     849746        450
 Total                            1485702        787
 
 So weird.  It's not crashing.  So, I tried to move a video from one 
 dataset to another using mv.  Within moments all the writes gstat shows 
 have stopped, while there's a few K / s reads.  And there it sits.  
 Here's top
 
 last pid:  3932;  load averages:  0.02,  0.04,  
 0.13                                             up 0+00:24:46  05:03:53
 249 processes: 3 running, 229 sleeping, 17 waiting
 CPU:  0.0% user,  0.0% nice,  1.2% system,  0.0% interrupt, 98.8% idle
 Mem: 14M Active, 19M Inact, 953M Wired, 8256K Buf, 2246M Free
 Swap:
 
    PID USERNAME     PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
     11 root         155 ki31     0K    32K CPU1    1  22:02 100.00% 
 idle{idle: cpu1}
     11 root         155 ki31     0K    32K RUN     0  22:06 98.49% 
 idle{idle: cpu0}
     13 root          -8    -     0K    48K -       1   0:06  0.10% 
 geom{g_up}
      5 root          -8    -     0K    80K zio->i  0   0:05  0.10% 
 zfskern{txg_thread_enter}
      0 root         -16    0     0K  2624K sched   1   2:30  0.00% 
 kernel{swapper}
      0 root         -16    0     0K  2624K -       0   0:30  0.00% 
 kernel{zio_read_intr_1}
      0 root         -16    0     0K  2624K -       0   0:30  0.00% 
 kernel{zio_read_intr_0}
      0 root         -16    0     0K  2624K -       1   0:10  0.00% 
 kernel{zio_write_issue_}
      0 root         -16    0     0K  2624K -       1   0:10  0.00% 
 kernel{zio_write_issue_}
     12 root         -88    -     0K   272K WAIT    1   0:08  0.00% 
 intr{irq259: ahci1}
      0 root         -16    0     0K  2624K -       1   0:07  0.00% 
 kernel{zio_write_intr_2}
      0 root         -16    0     0K  2624K -       1   0:07  0.00% 
 kernel{zio_write_intr_3}
      0 root         -16    0     0K  2624K -       1   0:07  0.00% 
 kernel{zio_write_intr_6}
      0 root         -16    0     0K  2624K -       1   0:07  0.00% 
 kernel{zio_write_intr_1}
      0 root         -16    0     0K  2624K -       0   0:07  0.00% 
 kernel{zio_write_intr_0}
      0 root         -16    0     0K  2624K -       0   0:07  0.00% 
 kernel{zio_write_intr_4}
      0 root         -16    0     0K  2624K -       1   0:07  0.00% 
 kernel{zio_write_intr_7}
      0 root         -16    0     0K  2624K -       1   0:07  0.00% 
 kernel{zio_write_intr_5}
     13 root          -8    -     0K    48K -       1   0:07  0.00% 
 geom{g_down}
   3919 root          20    0  9916K  1636K tx->tx  1   0:07  0.00% mv
     20 root          16    -     0K    16K syncer  0   0:03  0.00% syncer
     12 root         -60    -     0K   272K WAIT    0   0:01  0.00% 
 intr{swi4: clock}
     12 root         -52    -     0K   272K WAIT    1   0:01  0.00% 
 intr{swi6: Giant task}
     14 root         -16    -     0K    16K -       0   0:01  0.00% yarrow
      5 root          -8    -     0K    80K arc_re  0   0:01  0.00% 
 zfskern{arc_reclaim_thre}
     12 root         -84    -     0K   272K WAIT    1   0:01  0.00% 
 intr{irq1: atkbd0}
      0 root         -16    0     0K  2624K -       1   0:00  0.00% 
 kernel{zio_write_issue_}
      0 root          -8    0     0K  2624K -       0   0:00  0.00% 
 kernel{zil_clean}
      0 root         -16    0     0K  2624K -       1   0:00  0.00% 
 kernel{zio_write_issue_}
 
 
 Set failmode back to wait.  Rebooted from the livecd and now repeating 
 the tests:
 And they work properly.
 
 So,  with gstat running I once again try to move a video file from one 
 dataset to another.   I watch a bunch of reads happen, the a burst of 
 writes, and that goes on for a couple of minutes.  Then the writes stop 
 and the reads continue at 3Kbps according to gstat.  The total memory 
 still has 2.5G free.   There is no writing to the zpool.   I notice that 
 one of the video files has moved, and the writing stopped when it was 
 trying to open the next one I'm guessing.   I can sucessfully ^C out of 
 the mv, and notice a constant low speed read going on, 80-100KBps.   I 
 try restarting the move, no change to the read pattern, no writes.
 
 So, no memory explosion, ^C works, but I can't write, and gstat reports 
 a steady read low pace read for no obvious reason, just as it was when 
 the mv process was active.
 
 So, a huger puzzle.
 
 
 
 

From: Harry Coin <harrycoin@aol.com>
To: bug-followup@FreeBSD.org, levent.serinol@mynet.com, jdc@koitsu.org
Cc:  
Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O
Date: Fri, 25 Jan 2013 23:49:15 -0600

 I appreciate your effort very much.
 
 One change from last time was I set failmode=continue from 'wait'.   
 There were no errors in zpool or other logs.
 
 re 1,2: switching to bs=64k and /dev/zero as the input source, the dd 
 completes and the ls -l during the process causes no problems.  Transfer 
 speed reported by dd of 183qMBytes/sec.
 
 re 3:  the gstat -I500ms shows each of the raid partitions hanging in at 
 about top speed of 50-ish MBps during the above dd.  All disks perform 
 within a couple % of one another.
 
 re 4: we'll stick with gstat.
 
 re 5: I see we agree about that.  Nothing happened during the scrub 
 other than the scrub.  14Mbps scrub speed is 1/6th of the minimum I'd 
 expect.
 
 re 6:  Tuning Is Evil.  On the other hand, Crashing is Eviler.  So I 
 tried Evil Tuning, with poor results as noted upstream.  So, perhaps 
 tuning is indeed not only evil but a black hole for time as well.
 
 Switching back to the 512 byte writes, I notice on gstat the writes are 
 zero for 4 secs or so, then a burst of activity (all partitions within a 
 few % of one another), all quiet, repeat.   Then I throw in the ls -l... 
 and... it all works with no problems.   Trying the dd with the big 
 blocksize and /dev/urandom.... and:  gstat looks bursty as in the 512 
 byte writes.   ls -l causes no changes, works.    Trying the original 
 /dev/urandom and bs=512 and ... the gstat pattern does the usual bursty 
 thing, ls -l works well.    Thinking it's about the gstat...
 
 Doing top -S -s 1 .. nothing.
 
 nas4free:~# vmstat -i
 interrupt                          total       rate
 irq1: atkbd0                         168          0
 irq18: uhci2                        1136          0
 irq20: hpet0                      620136        328
 irq21: uhci0 ehci0                  7764          4
 irq256: hdac0                         37          0
 irq257: ahci0                       1090          0
 irq258: em0                         5625          2
 irq259: ahci1                     849746        450
 Total                            1485702        787
 
 So weird.  It's not crashing.  So, I tried to move a video from one 
 dataset to another using mv.  Within moments all the writes gstat shows 
 have stopped, while there's a few K / s reads.  And there it sits.  
 Here's top
 
 last pid:  3932;  load averages:  0.02,  0.04,  
 0.13                                             up 0+00:24:46  05:03:53
 249 processes: 3 running, 229 sleeping, 17 waiting
 CPU:  0.0% user,  0.0% nice,  1.2% system,  0.0% interrupt, 98.8% idle
 Mem: 14M Active, 19M Inact, 953M Wired, 8256K Buf, 2246M Free
 Swap:
 
    PID USERNAME     PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
     11 root         155 ki31     0K    32K CPU1    1  22:02 100.00% 
 idle{idle: cpu1}
     11 root         155 ki31     0K    32K RUN     0  22:06 98.49% 
 idle{idle: cpu0}
     13 root          -8    -     0K    48K -       1   0:06  0.10% 
 geom{g_up}
      5 root          -8    -     0K    80K zio->i  0   0:05  0.10% 
 zfskern{txg_thread_enter}
      0 root         -16    0     0K  2624K sched   1   2:30  0.00% 
 kernel{swapper}
      0 root         -16    0     0K  2624K -       0   0:30  0.00% 
 kernel{zio_read_intr_1}
      0 root         -16    0     0K  2624K -       0   0:30  0.00% 
 kernel{zio_read_intr_0}
      0 root         -16    0     0K  2624K -       1   0:10  0.00% 
 kernel{zio_write_issue_}
      0 root         -16    0     0K  2624K -       1   0:10  0.00% 
 kernel{zio_write_issue_}
     12 root         -88    -     0K   272K WAIT    1   0:08  0.00% 
 intr{irq259: ahci1}
      0 root         -16    0     0K  2624K -       1   0:07  0.00% 
 kernel{zio_write_intr_2}
      0 root         -16    0     0K  2624K -       1   0:07  0.00% 
 kernel{zio_write_intr_3}
      0 root         -16    0     0K  2624K -       1   0:07  0.00% 
 kernel{zio_write_intr_6}
      0 root         -16    0     0K  2624K -       1   0:07  0.00% 
 kernel{zio_write_intr_1}
      0 root         -16    0     0K  2624K -       0   0:07  0.00% 
 kernel{zio_write_intr_0}
      0 root         -16    0     0K  2624K -       0   0:07  0.00% 
 kernel{zio_write_intr_4}
      0 root         -16    0     0K  2624K -       1   0:07  0.00% 
 kernel{zio_write_intr_7}
      0 root         -16    0     0K  2624K -       1   0:07  0.00% 
 kernel{zio_write_intr_5}
     13 root          -8    -     0K    48K -       1   0:07  0.00% 
 geom{g_down}
   3919 root          20    0  9916K  1636K tx->tx  1   0:07  0.00% mv
     20 root          16    -     0K    16K syncer  0   0:03  0.00% syncer
     12 root         -60    -     0K   272K WAIT    0   0:01  0.00% 
 intr{swi4: clock}
     12 root         -52    -     0K   272K WAIT    1   0:01  0.00% 
 intr{swi6: Giant task}
     14 root         -16    -     0K    16K -       0   0:01  0.00% yarrow
      5 root          -8    -     0K    80K arc_re  0   0:01  0.00% 
 zfskern{arc_reclaim_thre}
     12 root         -84    -     0K   272K WAIT    1   0:01  0.00% 
 intr{irq1: atkbd0}
      0 root         -16    0     0K  2624K -       1   0:00  0.00% 
 kernel{zio_write_issue_}
      0 root          -8    0     0K  2624K -       0   0:00  0.00% 
 kernel{zil_clean}
      0 root         -16    0     0K  2624K -       1   0:00  0.00% 
 kernel{zio_write_issue_}
 
 
 Set failmode back to wait.  Rebooted from the livecd and now repeating 
 the tests:
 And they work properly.
 
 So,  with gstat running I once again try to move a video file from one 
 dataset to another.   I watch a bunch of reads happen, the a burst of 
 writes, and that goes on for a couple of minutes.  Then the writes stop 
 and the reads continue at 3Kbps according to gstat.  The total memory 
 still has 2.5G free.   There is no writing to the zpool.   I notice that 
 one of the video files has moved, and the writing stopped when it was 
 trying to open the next one I'm guessing.   I can sucessfully ^C out of 
 the mv, and notice a constant low speed read going on, 80-100KBps.   I 
 try restarting the move, no change to the read pattern, no writes.
 
 So, no memory explosion, ^C works, but I can't write, and gstat reports 
 a steady read low pace read for no obvious reason, just as it was when 
 the mv process was active.
 
 So, a huger puzzle.
 
 
 
 

From: Jeremy Chadwick <jdc@koitsu.org>
To: Harry Coin <hgcoin@gmail.com>
Cc: bug-followup@FreeBSD.org, levent.serinol@mynet.com
Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O
Date: Tue, 29 Jan 2013 10:50:28 -0800

 Re 1,2: that transfer speed (183MBytes/second) sounds much better/much
 more accurate for what's going on.  The speed-limiting factors were
 certainly a small blocksize (512 bytes) used by dd, and using
 /dev/random rather than /dev/zero.  I realise you're probably expecting
 to see something like 480MBytes/second (4 drives * 120MB/sec), but
 that's probably not going to happen on that model of system and with
 that CPU.
 
 For example, on my Q9550 system described earlier, I can get about this:
 
 $ dd if=/dev/zero of=testfile bs=64k
 ^C27148+0 records in
 27147+0 records out
 1779105792 bytes transferred in 6.935566 secs (256519186 bytes/sec)
 
 While "gstat -I500ms" shows each disk going between 60MBytes/sec and
 140MBytes/sec.  "zpool iostat -v data 1" shows between 120-220MBytes/sec
 at the pool level, and showing around 65-110MBytes/sec on a per-disk
 level.
 
 Anyway, point being, things are faster with a large bs and from a source
 that doesn't churn interrupts.  But don't necessarily "pull a Linux" and
 start doing things like bs=1m -- as I said before, Linux dd is
 different, because the I/O is cached (without --direct), while on
 FreeBSD dd is always direct.
 
 Re 3: That sounds a bit on the slow side.  I would expect those disks,
 at least during writes, to do more.  If **all** the drives show this
 behaviour consistently in gstat, then you know the issue IS NOT with an
 individual disk, and is instead the issue lies elsewhere.  That rules
 out one piece of the puzzle, and that's good.
 
 Re 5: Did you mean to type 14MBytes/second, not 14mbits/second?  If so,
 yes, I would agree that's slow.  Scrubbing is not necessarily a good way
 to "benchmark" disks, but I understand for "benchmarking" ZFS it's the
 best you've got to some degree.
 
 Regarding dd'ing and 512 bytes -- as I described to you in my previous
 mail:
 
 > This speed will be "bursty" and "sporadic" due to the how ZFS ARC
 > works.  The interval at which "things are flushed to disk" is based on
 > the vfs.zfs.txg.timeout sysctl, which on FreeBSD 9.1-RELEASE should
 > default to 5 (5 seconds).
 
 This is where your "4 secs or so" magic value comes from.  Please do not
 change this sysctl/value; keep it at 5.
 
 Finally, your vmstat -i output shows something of concern, UNLESS you
 did this WHILE you had the dd (doesn't matter what block size) going,
 and are using /dev/random or /dev/urandom (same thing on FreeBSD):
 
 > irq20: hpet0                      620136        328
 > irq259: ahci1                     849746        450
 
 These interrupt rates are quite high.  hpet0 refers to your event
 timer/clock timer (see kern.eventtimer.choice and kern.eventtimer.timer)
 being HPET, and ahci1 refers to your Intel ICH7 AHCI controller.
 
 Basically what's happening here is that you're generating a ton of
 interrupts doing dd if=/dev/urandom bs=512.  And it makes perfect sense
 to me why: because /dev/urandom has to harvest entropy from interrupt
 sources (please see random(4) man page), and you're generating a lot of
 interrupts to your AHCI controller for each individual 512-byte write.
 
 When you say "move a video from one dataset to another", please explain
 what it is you're moving from and to.  Specifically: what filesystems,
 and output from "zfs list".
 
 If you're moving a file from a ZFS filesystem to another ZFS filesystem
 on the same pool, then please state that.  That may help kernel folks
 figure out where your issue lies.
 
 At this stage, a kernel developer is going to need to step in and try to
 help you figure out where the actual bottleneck is occurring.  This is
 going to be very difficult/complex/very likely not possible with you
 using nas4free, because you will almost certainly be asked to rebuild
 world/kernel to include some new options and possibly asked to include
 DTrace/CTF support (for real-time debugging).  The situation is tricky.
 
 It would really help if you would/could remove nas4free from the picture
 and instead just run stock FreeBSD, because as I said, if there are some
 kind of kernel tunings or adjustment values the nas4free folks put in
 place that stock FreeBSD doesn't, those could be harming you.
 
 I can't be of more help here, I'm sorry to say.  The good news is that
 your disks sound fine.  Kernel developers will need to take this up.
 
 P.S. -- I would strongly recommend updating your nas4free forum post
 with a link to this conversation in this PR.  IMO, the nas4free people
 need to step up and take responsibility (and that almost certainly means
 talking/working with the FreeBSD folks).
 
 -- 
 | Jeremy Chadwick                                   jdc@koitsu.org |
 | UNIX Systems Administrator                http://jdc.koitsu.org/ |
 | Mountain View, CA, US                                            |
 | Making life hard for others since 1977.             PGP 4BD6C0CB |
 
>Unformatted:
