From nobody@FreeBSD.org  Sun May 26 19:51:50 2013
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
	by hub.freebsd.org (Postfix) with ESMTP id DBC63E8E
	for <freebsd-gnats-submit@FreeBSD.org>; Sun, 26 May 2013 19:51:50 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from oldred.FreeBSD.org (oldred.freebsd.org [8.8.178.121])
	by mx1.freebsd.org (Postfix) with ESMTP id BCDBFA95
	for <freebsd-gnats-submit@FreeBSD.org>; Sun, 26 May 2013 19:51:50 +0000 (UTC)
Received: from oldred.FreeBSD.org ([127.0.1.6])
	by oldred.FreeBSD.org (8.14.5/8.14.5) with ESMTP id r4QJpnlE071713
	for <freebsd-gnats-submit@FreeBSD.org>; Sun, 26 May 2013 19:51:49 GMT
	(envelope-from nobody@oldred.FreeBSD.org)
Received: (from nobody@localhost)
	by oldred.FreeBSD.org (8.14.5/8.14.5/Submit) id r4QJpn9Z071712;
	Sun, 26 May 2013 19:51:49 GMT
	(envelope-from nobody)
Message-Id: <201305261951.r4QJpn9Z071712@oldred.FreeBSD.org>
Date: Sun, 26 May 2013 19:51:49 GMT
From: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
To: freebsd-gnats-submit@FreeBSD.org
Subject: Heavy disk I/O may hang system
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         178997
>Category:       kern
>Synopsis:       [vfs] [patch] Heavy disk I/O may hang system
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sun May 26 20:00:00 UTC 2013
>Closed-Date:    
>Last-Modified:  Sun Sep 29 18:10:00 UTC 2013
>Originator:     Klaus Weber
>Release:        Tested with 9/STABLE, verified in 10-CURRENT
>Organization:
-
>Environment:
FreeBSD filepile 9.1-STABLE FreeBSD 9.1-STABLE #5 r250475: Sun May 12 19:14:21 CEST 2013     root@filepile:/usr/obj/usr/src/sys/FILEPILE  amd64
(Kernel config has some drivers removed and debugging options added. Tested with GENERIC as well)
>Description:
During heavy disk I/O (two bonnie++ processes working on the same disk simultaneously) causes an extreme degradation in disk throughput (combined throughput as observed in iostat drops to ~1-3 MB/sec). The problem shows when both bonnie++ processes are in the "Rewriting..." phase. 

Sometimes, the systems "hangs" (throughput reaches 0 MB/sec); in that case, the system does not recover from this state. "Hang" means that the system responds to ping etc., processes that are already running will continue, but every command that requires disk accesses (even on disks other than the disk being tested) just hangs. If shutdown can be started without disk access (i.e. still in cache), the system will begin the shutdown, but fails to sync all buffers.

Most testing was done on 9/Stable and an Areca ARC-1882 controller (arcmsr), but I have test-booted a 10.0-current kernel from early May ("kernel-10.0-CURRENT-5.5.2013"), and verified that the problem still exists there. I have also tested with with an on-board SATA port ("Intel Cougar Point AHCI SATA controller"); with this setup, I can only observe extreme throughput degradation, but no hangs (see below for details).

Test-setup:
da0 is a 600 GB volume on the Areca controller:
da0 at arcmsr0 bus 0 scbus0 target 0 lun 0
da0: <Areca ARC-1882-VOL#000 R001> Fixed Direct Access SCSI-5 device 
da0: 600.000MB/s transfers
da0: Command Queueing enabled
da0: 572202MB (1171870720 512 byte sectors: 255H 63S/T 72945C)

gpart create -s gpt da0
gpart add -s 500G -t freebsd-ufs -l test0 -a 1M da0
newfs -b 64k -f 8k /dev/da0p1
mount /dev/da0p1 /mnt && mkdir /mnt/t{1,2} && chmod g+w /mnt/t{1,2}

To observe the problem, I start two bonnie++ instances (installed from ports) like this:
1) bonnie++ -s 64g -n 0 -d /mnt/t1
2) bonnie++ -s 64g -n 0 -d /mnt/t2

(in order to better see the various stages, I start the second bonnie++ about a minute after the first. There is nothing magic about the 64gb, the size should be large enough to keep the disk busy for a few minutes.)

I then observe disk throughput for da0 with 

iostat -dC -w 1 da0

             da0             cpu
  KB/t tps  MB/s  us ni sy in id
[... first bonnie++ starts, "writing 1 byte at a time" phase ...]
 64.00   6  0.37   0  0 12  0 88
 64.00   4  0.25   0  0 13  0 87
 64.00   6  0.37   0  0 12  0 88
[first bonnie++ enters "writing intelligently" phase]
 63.97 3847 240.33   0  0 12  0 87
 64.00 20824 1301.51   0  0 14  0 85
 64.00 20449 1278.04   0  0 14  1 85
 64.00 17840 1115.01   0  0 14  1 85
 64.00 16595 1037.21   0  0 13  1 85
[this is my baseline for upper limit on throughput: ~1 GB/sec]
[...second bonnie++ starts, no big changes]
 64.00 12869 804.30   0  0 15  1 84
[second bonnie++ gets to "Writing intelligently" phase]
 64.00 7773 485.78   0  0  9  0 90
 64.00 7310 456.86   0  0  7  1 92
 64.00 7430 464.40   0  0 10  0 89
 64.00 7151 446.93   0  0  6  1 93
 64.00 7410 463.10   0  0  8  0 91
[...]
 64.00 7349 459.29   0  0 10  0 90
 64.00 7478 467.35   0  0  7  0 93
 64.00 7153 447.06   0  0  9  0 90
 64.00 7335 458.41   0  0  7  0 93
 64.00 7918 494.88   0  0 10  0 90
[first bonnie++ gets to "Rewriting" phase]
 64.00 7426 464.10   0  0 12  1 87
 64.00 7251 453.17   0  0  9  0 90
 64.00 7036 439.75   0  0  6  1 93
 64.00 7296 455.98   0  0  8  1 90
 64.00 6808 425.51   0  0  6  1 93
 64.00 7531 470.66   0  0  8  1 91
 64.00 7323 457.67   0  0  8  0 91
 64.00 7666 479.14   0  0  6  0 94
 64.00 11921 745.07   0  0 12  1 87
 64.00 12043 752.69   0  0 12  1 87
 64.00 10814 675.89   0  0 11  1 89
 64.00 10925 682.82   0  0 12  0 88
[...]
[second bonnie++ starts "Rewriting" phase]
 64.00 10873 679.56   0  0 10  1 89
 64.00 12242 765.12   0  0 13  1 86
 64.00 10739 671.19   0  0 11  0 88
 64.00 3486 217.91   0  0  8  0 91
 64.00 6327 395.42   0  0 14  0 86
 64.00 6637 414.84   0  0  9  0 90
 64.00 4931 308.19   0  0 10  0 89
 64.00 4874 304.63   0  0 13  0 87
 64.00 4322 270.11   0  0 18  0 82
 64.00 4616 288.52   0  0 19  0 80
 64.00 3567 222.96   0  0 20  0 80
 64.00 3395 212.16   0  0 30  0 70
 64.00 1694 105.89   0  0 33  0 67
 64.00 1898 118.63   0  0 36  0 64
  0.00   0  0.00   0  0 12  0 88
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  4  0 96
  0.00   0  0.00   0  0  5  0 95
  0.00   0  0.00   0  0  5  0 95
[etc. The system does not recover from this state. All attempts to do disk I/O hang.]

When both bonnie++ processes are in their "Rewriting" phase, the system hangs within a few seconds. Both bonnie++ processes are in state "nbufkv". bufdaemon takes about 40% CPU time and is in state "qsleep" when not active.

I have also briefly tested a single disk instead of a RAID volume. In that case, I could not get the system to hang, put performance drops from ~150 MB/sec to ~1-3 MB/sec. I do not know if that indicates a problem with the arcmsr driver, or whether I cannot create "enough" I/O load with a single disk:

[same setup as above, but on single disk connected to onboard SATA port.]
Baseline: 150 MB/sec (first bonnie++ in "Writing intelligently" phase):
            ada2             cpu
  KB/t tps  MB/s  us ni sy in id
 128.00 1221 152.60   0  0  2  0 97
 128.00 1173 146.60   0  0  2  0 98
 128.00 1171 146.35   0  0  2  0 98
[... first bonnie++ in "Rewriting", second in "Writing intelligently" phase:]
 127.83 1140 142.30   0  0  2  0 97
 128.00 1121 140.11   0  0  2  0 97
 128.00 1114 139.24   0  0  1  0 98
 127.29 904 112.39   0  0  5  0 95
[second bonnie++ starts rewriting]
 120.24  66  7.74   0  0  0  0 100
 128.00  59  7.37   0  0  0  0 100
 122.43  69  8.24   0  0  1  0 99
 128.00 222 27.72   0  0  0  0 100
 128.00 1122 140.23   0  0  3  0 97
 128.00 1080 134.99   0  0  3  0 97
 128.00 1157 144.61   0  0  3  0 97
 127.75 504 62.94   0  0  2  0 98
 128.00  59  7.37   0  0  1  0 99
 128.00 708 88.54   0  0  2  0 98
 128.00 969 121.13   0  0 11  0 89
 127.89 1161 144.99   0  0  3  0 97
 127.60 479 59.63   0  0  1  0 99
 128.00 596 74.55   0  0  2  0 98
 128.00 1169 146.10   0  0  3  0 97
 127.88 1076 134.37   0  0  2  0 97
 127.89 1202 150.10   0  0  3  0 97
 128.00 1074 134.24   0  0  4  0 96
 128.00 1039 129.87   0  0  8  0 92
 128.00 913 114.14   0  0 10  0 90
 127.73 896 111.78   0  0 11  0 89
 128.00 930 116.26   0  0 10  0 90
 127.93 909 113.57   0  0 11  0 89
 128.00 887 110.89   0  0 13  0 87
 128.00 743 92.91   0  0 18  0 81
 128.00 548 68.56   0  0 23  0 76
 128.00 523 65.43   0  0 24  0 76
 127.89 587 73.36   0  0 21  0 79
 127.89 586 73.24   0  0 22  0 78
 128.00 465 58.07   0  0 28  0 72
 127.89 558 69.74   0  0 25  0 75
 127.79 595 74.30   0  0 21  0 79
 128.00 593 74.18   0  0 23  0 77
 128.00 546 68.31   0  0 22  0 78
 127.70 428 53.32   0  0 32  0 68
 128.00 561 70.18   0  0 25  0 75
 128.00 645 80.67   0  0 22  0 78
 127.78 582 72.68   0  0 23  0 77
 114.68 518 58.07   0  0 27  0 73
 103.33 497 50.20   0  0 29  0 71
 110.00 128 13.74   0  0 35  0 65
 65.94  33  2.12   0  0 37  0 63
 74.97 286 20.92   0  0 36  0 64
 105.83 306 31.59   0  0 34  0 66
 95.12 399 37.03   0  0 30  0 70
 108.57 336 35.59   0  0 29  0 71
 127.22 246 30.53   0  0 41  0 59
 104.16 255 25.91   0  0 32  0 68
 103.68 528 53.51   0  0 23  0 77
 127.14 373 46.27   0  0 29  0 71
 90.65 341 30.16   0  0 35  0 65
 127.30 368 45.70   0  0 30  0 70
 66.56  25  1.62   0  0 37  0 63
 66.00  32  2.06   0  0 38  0 62
 94.99  95  8.80   0  0 38  0 62
 94.00 433 39.71   0  0 35  0 65
 107.66 666 70.06   0  0 26  0 74
 119.43 478 55.69   0  0 27  0 73
 95.64 441 41.15   0  0 28  0 72
 116.45 449 51.01   0  0 29  0 71
 102.22 503 50.26   0  0 28  0 72
 127.62 509 63.50   0  0 29  0 71
 111.84 388 42.33   0  0 29  0 71

(i.e. disk throughput jumps wildly between full speed and ~1% of maximum, despite continuous I/O pressure)


Unfortunately, I cannot use kdb on this machine - "ALT-CTRL-ESC" does nothing and "sysctl debug.kdb.enter=1" just says "debug.kdb.enter: 0 -> 0". "sysctl debug.kdb.panic=1" does panic the machine, but the keyboard is not working.

I have, however, logged the sysctl vas subtree every second while the problem occurs: 
As long as both bonnie++ are in "Writing intelligently..." phase, vfs.numdirtybuffers stays mostly between 80-100. When the first bonnie++ goes to "Rewriting..." (at 3:40 below) starts to increase (about 150-350 new dirty buffers per second). When the second bonnie++ goes to "Rewriting..." as well (at 4:24), this speeds up a lot (about 3000-5000 new dirty buffers/sec). A few seconds later, vfs.numdirtybuffers reaches its limit, and remains there until the machine is rebooted:

00-03-28.log:vfs.numdirtybuffers: 107
00-03-29.log:vfs.numdirtybuffers: 106
00-03-30.log:vfs.numdirtybuffers: 105
00-03-31.log:vfs.numdirtybuffers: 105
00-03-32.log:vfs.numdirtybuffers: 110
00-03-33.log:vfs.numdirtybuffers: 112
00-03-34.log:vfs.numdirtybuffers: 110
00-03-35.log:vfs.numdirtybuffers: 115
00-03-36.log:vfs.numdirtybuffers: 117
00-03-37.log:vfs.numdirtybuffers: 118
00-03-38.log:vfs.numdirtybuffers: 118
00-03-39.log:vfs.numdirtybuffers: 118
00-03-40.log:vfs.numdirtybuffers: 118
[first bonnie++ starts Rewriting]
00-03-41.log:vfs.numdirtybuffers: 294
00-03-42.log:vfs.numdirtybuffers: 404
00-03-43.log:vfs.numdirtybuffers: 959
00-03-44.log:vfs.numdirtybuffers: 1071
00-03-45.log:vfs.numdirtybuffers: 1155
00-03-46.log:vfs.numdirtybuffers: 1253
00-03-47.log:vfs.numdirtybuffers: 1366
00-03-48.log:vfs.numdirtybuffers: 1457
00-03-49.log:vfs.numdirtybuffers: 1549
00-03-50.log:vfs.numdirtybuffers: 1680
00-03-51.log:vfs.numdirtybuffers: 1904
00-03-52.log:vfs.numdirtybuffers: 2136
00-03-53.log:vfs.numdirtybuffers: 2363
00-03-54.log:vfs.numdirtybuffers: 2503
00-03-55.log:vfs.numdirtybuffers: 3023
00-03-56.log:vfs.numdirtybuffers: 3268
00-03-57.log:vfs.numdirtybuffers: 3601
00-03-58.log:vfs.numdirtybuffers: 3743
00-03-59.log:vfs.numdirtybuffers: 3904
00-04-00.log:vfs.numdirtybuffers: 4106
00-04-01.log:vfs.numdirtybuffers: 4325
00-04-02.log:vfs.numdirtybuffers: 4461
00-04-03.log:vfs.numdirtybuffers: 4643
00-04-04.log:vfs.numdirtybuffers: 5063
00-04-05.log:vfs.numdirtybuffers: 5202
00-04-06.log:vfs.numdirtybuffers: 5456
00-04-07.log:vfs.numdirtybuffers: 5593
00-04-08.log:vfs.numdirtybuffers: 5819
00-04-09.log:vfs.numdirtybuffers: 6092
00-04-10.log:vfs.numdirtybuffers: 6249
00-04-11.log:vfs.numdirtybuffers: 6412
00-04-12.log:vfs.numdirtybuffers: 6591
00-04-13.log:vfs.numdirtybuffers: 6698
00-04-14.log:vfs.numdirtybuffers: 6853
00-04-15.log:vfs.numdirtybuffers: 7031
00-04-16.log:vfs.numdirtybuffers: 7192
00-04-17.log:vfs.numdirtybuffers: 7387
00-04-18.log:vfs.numdirtybuffers: 7609
00-04-19.log:vfs.numdirtybuffers: 7757
00-04-20.log:vfs.numdirtybuffers: 7888
00-04-21.log:vfs.numdirtybuffers: 8156
00-04-22.log:vfs.numdirtybuffers: 8362
00-04-23.log:vfs.numdirtybuffers: 8558
[second bonnie goes Rewriting as well]
00-04-24.log:vfs.numdirtybuffers: 11586
00-04-25.log:vfs.numdirtybuffers: 16325
00-04-26.log:vfs.numdirtybuffers: 24333
00-04-27.log:vfs.numdirtybuffers: 29058
00-04-28.log:vfs.numdirtybuffers: 33752
00-04-29.log:vfs.numdirtybuffers: 38792
00-04-30.log:vfs.numdirtybuffers: 42663
00-04-31.log:vfs.numdirtybuffers: 47305
00-04-33.log:vfs.numdirtybuffers: 52070
00-04-34.log:vfs.numdirtybuffers: 52066
00-04-35.log:vfs.numdirtybuffers: 52063
00-04-37.log:vfs.numdirtybuffers: 52098
00-04-39.log:vfs.numdirtybuffers: 52098
00-04-41.log:vfs.numdirtybuffers: 52097
00-04-44.log:vfs.numdirtybuffers: 52097
00-04-46.log:vfs.numdirtybuffers: 52098
00-04-49.log:vfs.numdirtybuffers: 52096
00-04-52.log:vfs.numdirtybuffers: 52098
00-04-54.log:vfs.numdirtybuffers: 52096
00-04-57.log:vfs.numdirtybuffers: 52098
00-05-00.log:vfs.numdirtybuffers: 52096
[ etc. ]

vfs.hidirtybuffers: 52139
vfs.lodirtybuffers: 26069
(the machine has 32 GB RAM)

(I could upload the archive of the vfs logs somewhere, so you can check other values over time, if that would be helpful)


I have found http://freebsd.1045724.n5.nabble.com/7-1-RELEASE-I-O-hang-td3966945.html , which seems to describe the same issue (also in combination with a RAID controller, albeit a different one).

The symptoms described in r189878 (issues in bufdaemon, causing processes to hang in nbufkv) sound simliar. 

Please let me know if there is any other info that I can provide.


>How-To-Repeat:
gpart create -s gpt da0
gpart add -s 500G -t freebsd-ufs -l test0 -a 1M da0
newfs -b 64k -f 8k /dev/da0p1
mount /dev/da0p1 /mnt && mkdir /mnt/t{1,2} && chmod g+w /mnt/t{1,2}

Start two bonnie++ instances (installed from ports) like this:
1) bonnie++ -s 64g -n 0 -d /mnt/t1
2) bonnie++ -s 64g -n 0 -d /mnt/t2

Observe disk throughput for da0 with 

iostat -dC -w 1 da0

in particular when both bonnie++ processes are in their "Rewriting..." phase.
>Fix:
No fix known.

>Release-Note:
>Audit-Trail:

From: Bruce Evans <brde@optusnet.com.au>
To: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Cc: freebsd-gnats-submit@FreeBSD.org, freebsd-bugs@FreeBSD.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Mon, 27 May 2013 15:57:56 +1000 (EST)

 On Sun, 26 May 2013, Klaus Weber wrote:
 
 >> Environment:
 > FreeBSD filepile 9.1-STABLE FreeBSD 9.1-STABLE #5 r250475: Sun May 12 19:14:21 CEST 2013     root@filepile:/usr/obj/usr/src/sys/FILEPILE  amd64
 > (Kernel config has some drivers removed and debugging options added. Tested with GENERIC as well)
 >> Description:
 > During heavy disk I/O (two bonnie++ processes working on the same disk simultaneously) causes an extreme degradation in disk throughput (combined throughput as observed in iostat drops to ~1-3 MB/sec). The problem shows when both bonnie++ processes are in the "Rewriting..." phase.
 
 Please use the unix newline character in mail.
 
 A huge drop in disk throughput is normal with multiple processes,
 unfortunately.  Perhaps not quite this much.
 
 > Sometimes, the systems "hangs" (throughput reaches 0 MB/sec); in that case, the system does not recover from this state. "Hang" means that the system responds to ping etc., processes that are already running will continue, but every command that requires disk accesses (even on disks other than the disk being tested) just hangs. If shutdown can be started without disk access (i.e. still in cache), the system will begin the shutdown, but fails to sync all buffers.
 
 Hangs are bugs though.
 
 I have been arguing with kib@ about some methods of handling heavy disk
 i/o being nonsense since they either make the problem worse (by switching
 to direct unclustered writes, so that slow i/o goes even slower) or have
 no effect except to complicate the analysis of the problem (because they
 are old hackish methods, and newer better methods make the cases handled
 by the old methods unreachable).  But the analysis is complicated, and
 we couldn't agree on anything.
 
 One thing I noticed more recently than my last mail to kib about this
 is that even a single heavy disk write (say dd bs=256m count=2
 </dev/zero >foo) can block another non-write-bound process (say iostat
 1 or systat -v 1 to watch the progress) from starting up (sometimes
 even when it has just been run to load it into cache(s)) until the dd
 completes 10 seconds later.  This was on a 2-core system running a
 fairly old version of FreeBSD (8.0-CURRENT from ~2008).  I found that
 the problem could be fixed by killing cluster_write() by turning it into
 bdwrite() (by editing the running kernel using ddb, since this is easier
 than rebuilding the kernel).  I was trying many similar things since I
 had a theory that cluster_write() is useless.  Actually, cluster_write()
 gives some minor optimizations, but this misbehaviour indicates that it
 has a resource starvation bug.  My theory for what the bug is is that
 cluster_write() and cluster_read() share the limit resource of pbufs.
 pbufs are not managed as carefully as normal buffers.  In particular,
 there is nothing to limit write pressure from pbufs like there is for
 normal buffers.  However, this behaviour doesn't happen on the much older
 version of FreeBSD that I normally use (5.2-CURRENT from 2004 with many
 local modifications).  It starts iostat to watch dd instantly, even running
 on a single core system.
 
 > gpart create -s gpt da0
 > gpart add -s 500G -t freebsd-ufs -l test0 -a 1M da0
 > newfs -b 64k -f 8k /dev/da0p1
 > mount /dev/da0p1 /mnt && mkdir /mnt/t{1,2} && chmod g+w /mnt/t{1,2}
 
 The default for newfs is -b 32k.  This asks for buffer cache fragmentation.
 Someone increased the default from 16k to 32k without changing the buffer
 cache's preferred size (BKVASIZE = 16K).  BKVASIZE has always been too
 small, but on 32-bit arches kernel virtual memory is too limited to have
 a larger BKVASIZE by default.  BKVASIZE is still 16K on all arches
 although this problem doesn't affetc 64-bit arches.
 
 -b 64k is worse.
 
 > When both bonnie++ processes are in their "Rewriting" phase, the system hangs within a few seconds. Both bonnie++ processes are in state "nbufkv". bufdaemon takes about 40% CPU time and is in state "qsleep" when not active.
 
 You got the buffer cache fragmentation that you asked for.
 
 Buffer cache fragmentation caused serious problems 10-20 years ago, but the
 ones causing hangs were supposed to be fixed about 12 years ago and the ones
 causing extreme slowness for just the software parts of handling
 fragmentation were fixed a little later.  But there apparently still some.
 
 However, I have never been able to reproduce serious fragmentation problems
 from using too-large-block sizes, or demonstrate significant improvements
 from avoiding the known fragmentation problem by increasing BKVASIZE.
 Perhaps my systems are too small, or have tuning or local changes that
 accidentally avoid the problem.
 
 Apparently you found a way to reproduce the serious fragmentaion
 problems.  Try using a block size that doesn't ask for the problem.
 Increasing BKVASIZE would take more work than this, since although it
 was intended to be a system parameter which could be changed to reduce
 the fragmentation problem, one of the many bugs in it is that it was
 never converted into a "new" kernel option.  Another of the bugs in
 it is that doubling it halves the number of buffers, so doubling it
 does more than use twice as much kva.  This severely limited the number
 of buffers back when memory sizes were 64MB.  It is not a very
 significant limitation if the memory size is 1GB or larger.
 
 > I have also briefly tested a single disk instead of a RAID volume. In that case, I could not get the system to hang, put performance drops from ~150 MB/sec to ~1-3 MB/sec. I do not know if that indicates a problem with the arcmsr driver, or whether I cannot create "enough" I/O load with a single disk:
 
 I get ~39MB/sec with 1 "dd bs=128n count=2 /dev/zero >foo" writing to
 a nearly full old ffs file system on an old PATA disk, and 2 times
 20MB/sec with 2 dd's.  This is is almost as good as possible.  The
 configuration was somewhat flawed:
 - the writes all fitted in memory because my disks are too full to hold
    more.  This shouldn't be a problem for write benchmarks, because write
    pressure is (too) limited by writing out dirty buffers (too) early to
    limit write pressure, so writes are mostly physical.
 - the fs block size was 16K.  This not a flaw, just the right size, but
    other sizes should be tested.
 - the file system was mounted async.  This should have little affect,
    since for large writes all data is written async anyway (the write
    pressure handling (causes async writes instead of delayed writes even
    for not so large writes).
 - ffs was modified to remove its bogus write pressure handling.  This
    should have no affect, but I didn't verify it.
 
 > I have, however, logged the sysctl vas subtree every second while the problem occurs:
 > As long as both bonnie++ are in "Writing intelligently..." phase, vfs.numdirtybuffers stays mostly between 80-100. When the first bonnie++ goes to "Rewriting..." (at 3:40 below) starts to increase (about 150-350 new dirty buffers per second). When the second bonnie++ goes to "Rewriting..." as well (at 4:24), this speeds up a lot (about 3000-5000 new dirty buffers/sec). A few seconds later, vfs.numdirtybuffers reaches its limit, and remains there until the machine is rebooted:
 >
 > 00-03-28.log:vfs.numdirtybuffers: 107
 > 00-03-29.log:vfs.numdirtybuffers: 106
 > 00-03-30.log:vfs.numdirtybuffers: 105
 > ...
 > [first bonnie++ starts Rewriting]
 > 00-03-41.log:vfs.numdirtybuffers: 294
 > 00-03-42.log:vfs.numdirtybuffers: 404
 > 00-03-43.log:vfs.numdirtybuffers: 959
 > 00-03-44.log:vfs.numdirtybuffers: 1071
 
 There are various write pressure handling mechanisms that have different
 thresholds and get in each way a bit.
 
 The dirty buffer watermarks are presumably much higher than 105 or even
 1071.  Other mechanisms apparently work well and limit the number of
 dirty buffers to only 107 with 1 writer.  Writes to memory can go much
 faster than writes to the disk even when the latter go at full speed, but
 some mechanism blocks writers (too) early so the 1 writer soon reaches a
 steady state with not many dirty buffers.
 
 > 00-03-45.log:vfs.numdirtybuffers: 1155
 > 00-03-46.log:vfs.numdirtybuffers: 1253
 > 00-03-47.log:vfs.numdirtybuffers: 1366
 > 00-03-48.log:vfs.numdirtybuffers: 1457
 > ...
 > 00-04-21.log:vfs.numdirtybuffers: 8156
 > 00-04-22.log:vfs.numdirtybuffers: 8362
 > 00-04-23.log:vfs.numdirtybuffers: 8558
 > [second bonnie goes Rewriting as well]
 > 00-04-24.log:vfs.numdirtybuffers: 11586
 > 00-04-25.log:vfs.numdirtybuffers: 16325
 > 00-04-26.log:vfs.numdirtybuffers: 24333
 > ...
 > 00-04-54.log:vfs.numdirtybuffers: 52096
 > 00-04-57.log:vfs.numdirtybuffers: 52098
 > 00-05-00.log:vfs.numdirtybuffers: 52096
 > [ etc. ]
 
 This is a rather large buildup and may indicate a problem.  Try reducing
 the dirty buffer watermarks.  Their default values are mostly historical
 nonsense.  They are just some fixed fraction (about 1/4) of the number
 of buffers.  With multiple GB of memory, the number of buffers may be
 huge.  On a system with 24GB, vfs.bbuf is 157800, and vfs.dirtybufthresh
 is 35523.  But the size of main memory has nothing to do with the
 correct limit on the number of dirty buffers (unless the latter is not
 being used to limit write pressure, but only to limit starvation of
 buffers for readers.  This is perhaps all it is supposed to do, but see
 below).  We want buildups of dirty buffers to be limited so that the
 system never gets too far behind on physical writes.  The correct limit
 depends on the physical write capabilities of the machine and on which
 write channels have pressure on them.  Default global limits don't work
 at all for this.  You can fix the defaults but you can't distribute the
 limits properly accross channels, or easily keep the limits up to date
 as channels are added.  It is easy to add slow USB channels.  These
 won't increase the write capability much, but it is very easy for a
 slow disk or channel to it to grab many dirty buffers and keep them for
 a long time so that faster disks are starved.
 
 Other mechanisms handle write pressure better than the simple dirty
 buffer count one can (others count the amount of i/o, which is more
 relevant to how long it will take).  So assume that any affect of
 the dirty buffer count mechanism on write pressure is unintentional.
 There are still the following problems with it:
 - as above, slow channels can starve fast channels, thus giving a
    severe side effect on the write pressure (at best the users of the
    fast channels block so that the are slowed down to the speed of
    the slow channels)
 - buffer starvation for readers can happen anyway.  vfs.nbuf is the
    number of buffers that are allocated, but not all of these can be
    used when some file systems use buffer sizes larger than about
    BKVASIZE.  The worst case is when all file systems a block size of
    64KB.  Then only about 1/4 of the allocated buffers can be used.
    The number that can be used is limited by vfs.maxbufspace and
    related variables.  The default for vfs.maxbufspace is controlled
    indirectly by BKVASIZE (e.g., maxbufspace = 2GB goes with nbuf =
    128K; the ratio is 16K = BKVASIZE).
 
 So when you ask for buffer cache fragmentation with -b 64k, you get
 even more.  The effective nbuf is reduced by a factor of 4.  This
 means that all the dirty buffer count watermarks except one are
 slightly above the effective nbuf, so if the higher watermarks
 cannot quite be reached and when they are nearly reached readers
 are starved of buffers.
 
 > vfs.hidirtybuffers: 52139
 > vfs.lodirtybuffers: 26069
 > (the machine has 32 GB RAM)
 
 Fairly consistent with my examples of a machine with 24GB RAM.
 The upper watermark is 33% higher, etc.  Your nbuf will be more like
 200K than 150K.  You reached numdirtybuffers = 52096.  At 64K each,
 that's about 3GB.  maxbufspace will be about the same, and bufspace
 (the amount used) will be not much smaller.  There will be little
 space for buffers for readers, and the 2 writers apparently manage
 to starve even each other, perhaps for similar reasons.  The backlog
 of 3GB would take 20 seconds to clear even at 150MB/sec, and at 1MB/sec
 it would take almost an hour, so it is not much better than a hang.
 
 > I have found http://freebsd.1045724.n5.nabble.com/7-1-RELEASE-I-O-hang-td3966945.html , which seems to describe the same issue (also in combination with a RAID controller, albeit a different one).
 
 I think the hardware is usually not the problem (at least after kib's fix
 for the above), except when it is large it can stress the software more.
 
 Bruce

From: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
To: Bruce Evans <brde@optusnet.com.au>
Cc: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>,
	freebsd-gnats-submit@FreeBSD.org, freebsd-bugs@FreeBSD.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Mon, 27 May 2013 20:57:32 +0200

 First of all, thank you very much for looking into this, and for your
 detailed explanations. Much appreciated.
 
 On Mon, May 27, 2013 at 03:57:56PM +1000, Bruce Evans wrote:
 > On Sun, 26 May 2013, Klaus Weber wrote:
 > >>Description:
 > >During heavy disk I/O (two bonnie++ processes working on the same disk 
 > >simultaneously) causes an extreme degradation in disk throughput (combined 
 > >throughput as observed in iostat drops to ~1-3 MB/sec). The problem shows 
 > >when both bonnie++ processes are in the "Rewriting..." phase.
 
  
 > Please use the unix newline character in mail.
 
 My apologies. I submitted the report via the web-interface and did not
 realize that it would come out this way.
 
 > A huge drop in disk throughput is normal with multiple processes,
 > unfortunately.  Perhaps not quite this much.
 
 Yes, the slight performance drop from ~1 GB/sec to ~600 or even 
 ~400 MB/sec does not concern me (this might even be an issue with the
 arcmsr driver or the controller's firmware; I found some hints that
 this might be the case.)
 
 However, a performance drop by 3 orders of a magnitude for just two
 concurrent processes seemed to indicate a problem.
  
 > Hangs are bugs though.
 > 
 > I have been arguing with kib@ about some methods of handling heavy disk
 > i/o being nonsense since they either make the problem worse (by switching
 > to direct unclustered writes, so that slow i/o goes even slower) or have
 > no effect except to complicate the analysis of the problem (because they
 > are old hackish methods, and newer better methods make the cases handled
 > by the old methods unreachable).  But the analysis is complicated, and
 > we couldn't agree on anything.
 
 The machine where I experience this problem is not in production
 yet. I can reboot it at any time, test patches, etc. Just let me know
 if I can do anything helpful. The only limitation is that I usually
 have access to the server only on weekends.
 
 > [single process with heavy write access can block other
 non-write-bound process]
 
 I have not experienced this so far, but I can test this when I have
 access to the server next time.
 
 > I found that
 > the problem could be fixed by killing cluster_write() by turning it into
 > bdwrite() (by editing the running kernel using ddb, since this is easier
 > than rebuilding the kernel).  I was trying many similar things since I
 > had a theory that cluster_write() is useless.  [...]
 
 If that would provide a useful datapoint, I could try if that make a
 difference on my system. What changes would be required to test this?
 
 Surely its not as easy as replacing the function body of
 cluster_write() in vfs_cluster.c with just "return bdwrite(bp);"?
 
 > My theory for what the bug is is that
 > cluster_write() and cluster_read() share the limit resource of pbufs.
 > pbufs are not managed as carefully as normal buffers.  In particular,
 > there is nothing to limit write pressure from pbufs like there is for
 > normal buffers.  
 
 Is there anything I can do to confirm rebut this? Is the number of
 pbufs in use visible via a sysctl, or could I add debug printfs that
 are triggered when certain limits are reached?
 
 > >newfs -b 64k -f 8k /dev/da0p1
 > 
 > The default for newfs is -b 32k.  This asks for buffer cache fragmentation.
 > Someone increased the default from 16k to 32k without changing the buffer
 > cache's preferred size (BKVASIZE = 16K).  BKVASIZE has always been too
 > small, but on 32-bit arches kernel virtual memory is too limited to have
 > a larger BKVASIZE by default.  BKVASIZE is still 16K on all arches
 > although this problem doesn't affetc 64-bit arches.
 > 
 > -b 64k is worse.
 
 Thank you for this explanation. I was not aware that -b 64k (or even
 the default values to newfs) would have this effect. I will repeat the
 tests with 32/4k and 16/2k, although I seem to remember that 64/8k
 provided a significant performance boost over the defaults. This, and
 the reduced fsck times was my original motivation to go with the
 larger values.
 
 Given the potentially drastic effects of block sizes other than 16/2k,
 maybe a warning should be added to the newfs manpage? I only found the
 strong advice to maintain a 8:1 buffer:fragment ratio.
 
 
 > >When both bonnie++ processes are in their "Rewriting" phase, the system 
 > >hangs within a few seconds. Both bonnie++ processes are in state "nbufkv". 
 > >bufdaemon takes about 40% CPU time and is in state "qsleep" when not 
 > >active.
 > 
 > You got the buffer cache fragmentation that you asked for.
 
 Looking at vfs_bio.c, I see that it has defrag-code in it. Should I
 try adding some debug output to this code to get some insight why this
 does not work, or not as effective as it should?
 
 > Apparently you found a way to reproduce the serious fragmentaion
 > problems.  
 
 A key factor seems to be the "Rewriting" operation. I see no problem
 during the "normal" writing, nor could I reproduce it with concurrent
 dd runs.
 
 > Try using a block size that doesn't ask for the problem.
 
 Will do, although for production use I would really prefer a 64/8k
 system, due to the higher performance.
 
 > Increasing BKVASIZE would take more work than this, since although it
 > was intended to be a system parameter which could be changed to reduce
 > the fragmentation problem, one of the many bugs in it is that it was
 > never converted into a "new" kernel option.  Another of the bugs in
 > it is that doubling it halves the number of buffers, so doubling it
 > does more than use twice as much kva.  This severely limited the number
 > of buffers back when memory sizes were 64MB.  It is not a very
 > significant limitation if the memory size is 1GB or larger.
 
 Should I try to experiment with BKVASIZE of 65536? If so, can I
 somehow up the number of buffers again? Also, after modifying
 BKVASIZE, is it sufficient to compile and install a new kernel, or do
 I have to build and install the entire world?
 
 
 > I get ~39MB/sec with 1 "dd bs=128n count=2 /dev/zero >foo" writing to
 > a nearly full old ffs file system on an old PATA disk, and 2 times
 > 20MB/sec with 2 dd's.  This is is almost as good as possible. 
 
 I agree. Performance with dd, or during bonnie++'s "Writing
 intelligently" phaase is very reasonable, with both a single process
 and with two processes simultaneously.
 
 Something specific to the "Rewriting..." workload is triggering the
 problem.
 
 
 > >[second bonnie goes Rewriting as well]
 > >00-04-24.log:vfs.numdirtybuffers: 11586
 > >00-04-25.log:vfs.numdirtybuffers: 16325
 > >00-04-26.log:vfs.numdirtybuffers: 24333
 > >...
 > >00-04-54.log:vfs.numdirtybuffers: 52096
 > >00-04-57.log:vfs.numdirtybuffers: 52098
 > >00-05-00.log:vfs.numdirtybuffers: 52096
 > >[ etc. ]
 > 
 > This is a rather large buildup and may indicate a problem.  Try reducing
 > the dirty buffer watermarks.  Their default values are mostly historical
 > nonsense. 
 
 You mean the vfs.(hi|lo)dirtybuffers? Will do. What would be
 reasonable starting values for experimenting? 800/200?
 
 > [Helpful explanation snipped]  
 
 > - buffer starvation for readers can happen anyway.  vfs.nbuf is the
 >   number of buffers that are allocated, but not all of these can be
 >   used when some file systems use buffer sizes larger than about
 >   BKVASIZE.  The worst case is when all file systems a block size of
 >   64KB.  Then only about 1/4 of the allocated buffers can be used.
 >   The number that can be used is limited by vfs.maxbufspace and
 >   related variables.  The default for vfs.maxbufspace is controlled
 >   indirectly by BKVASIZE (e.g., maxbufspace = 2GB goes with nbuf =
 >   128K; the ratio is 16K = BKVASIZE).
 > 
 > So when you ask for buffer cache fragmentation with -b 64k, you get
 > even more.  The effective nbuf is reduced by a factor of 4.  This
 > means that all the dirty buffer count watermarks except one are
 > slightly above the effective nbuf, so if the higher watermarks
 > cannot quite be reached and when they are nearly reached readers
 > are starved of buffers.
 
 That makes sense - and bonnie++ is both reader and writer in the
 rewriting phase, and thus may even "starve itself"?
 
 
 > >vfs.hidirtybuffers: 52139
 > >vfs.lodirtybuffers: 26069
 > >(the machine has 32 GB RAM)
 > 
 > Fairly consistent with my examples of a machine with 24GB RAM.
 > The upper watermark is 33% higher, etc.  Your nbuf will be more like
 > 200K than 150K.  You reached numdirtybuffers = 52096.  At 64K each,
 > that's about 3GB.  maxbufspace will be about the same, and bufspace
 > (the amount used) will be not much smaller.  There will be little
 > space for buffers for readers, and the 2 writers apparently manage
 > to starve even each other, perhaps for similar reasons.  The backlog
 > of 3GB would take 20 seconds to clear even at 150MB/sec, and at 1MB/sec
 > it would take almost an hour, so it is not much better than a hang.
 
 OK, I will experiment with these parameters when I have access to
 the system again (which won't be before Thursday, unfortunately).
 
 Thanks again so far,
 
   Klaus

From: Bruce Evans <brde@optusnet.com.au>
To: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Cc: Bruce Evans <brde@optusnet.com.au>, freebsd-gnats-submit@FreeBSD.org, 
    freebsd-bugs@FreeBSD.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Tue, 28 May 2013 22:03:10 +1000 (EST)

 On Mon, 27 May 2013, Klaus Weber wrote:
 
 > On Mon, May 27, 2013 at 03:57:56PM +1000, Bruce Evans wrote:
 >> On Sun, 26 May 2013, Klaus Weber wrote:
 >>>> Description:
 >>> During heavy disk I/O (two bonnie++ processes working on the same disk
 >>> simultaneously) causes an extreme degradation in disk throughput (combined
 >>> throughput as observed in iostat drops to ~1-3 MB/sec). The problem shows
 >>> when both bonnie++ processes are in the "Rewriting..." phase.
 >
 >> Please use the unix newline character in mail.
 >
 > My apologies. I submitted the report via the web-interface and did not
 > realize that it would come out this way.
 
 Thanks.  The log output somehow came out right.
 
 >> I found that
 >> the problem could be fixed by killing cluster_write() by turning it into
 >> bdwrite() (by editing the running kernel using ddb, since this is easier
 >> than rebuilding the kernel).  I was trying many similar things since I
 >> had a theory that cluster_write() is useless.  [...]
 >
 > If that would provide a useful datapoint, I could try if that make a
 > difference on my system. What changes would be required to test this?
 >
 > Surely its not as easy as replacing the function body of
 > cluster_write() in vfs_cluster.c with just "return bdwrite(bp);"?
 
 That should work for testing, but it is safer to edit ffs_write()
 and remove the block where it calls cluster_write() (or bawrite()),
 so that it falls through to call bdwrite() in most cases.
 
 >> My theory for what the bug is is that
 >> cluster_write() and cluster_read() share the limit resource of pbufs.
 >> pbufs are not managed as carefully as normal buffers.  In particular,
 >> there is nothing to limit write pressure from pbufs like there is for
 >> normal buffers.
 >
 > Is there anything I can do to confirm rebut this? Is the number of
 > pbufs in use visible via a sysctl, or could I add debug printfs that
 > are triggered when certain limits are reached?
 
 Here I don't really know what to look for.  First add a sysctl to read
 the number of free pbufs.  The variable for this is cluster_pbuf_freecnt
 in vm.
 
 >>> newfs -b 64k -f 8k /dev/da0p1
 >>
 >> The default for newfs is -b 32k.  This asks for buffer cache fragmentation.
 >> Someone increased the default from 16k to 32k without changing the buffer
 >> cache's preferred size (BKVASIZE = 16K).  BKVASIZE has always been too
 >> small, but on 32-bit arches kernel virtual memory is too limited to have
 >> a larger BKVASIZE by default.  BKVASIZE is still 16K on all arches
 >> although this problem doesn't affetc 64-bit arches.
 >>
 >> -b 64k is worse.
 >
 > Thank you for this explanation. I was not aware that -b 64k (or even
 > the default values to newfs) would have this effect. I will repeat the
 > tests with 32/4k and 16/2k, although I seem to remember that 64/8k
 > provided a significant performance boost over the defaults. This, and
 > the reduced fsck times was my original motivation to go with the
 > larger values.
 
 The reduced fsck time and perhaps the reduced number of cylinder groups
 are the main advantages of large clusters.  vfs-level clustering turns
 most physical i/o's into 128K-blocks (especially for large files) so
 there is little difference between the i/o speed for all fs block sizes
 unless the fs block size is very small.
 
 > Given the potentially drastic effects of block sizes other than 16/2k,
 > maybe a warning should be added to the newfs manpage? I only found the
 > strong advice to maintain a 8:1 buffer:fragment ratio.
 
 Once the kernel misconfiguration understood eniygh for such a warning to
 not be FUD, it should be easy to fix.
 
 >>> When both bonnie++ processes are in their "Rewriting" phase, the system
 >>> hangs within a few seconds. Both bonnie++ processes are in state "nbufkv".
 >>> bufdaemon takes about 40% CPU time and is in state "qsleep" when not
 >>> active.
 >>
 >> You got the buffer cache fragmentation that you asked for.
 >
 > Looking at vfs_bio.c, I see that it has defrag-code in it. Should I
 > try adding some debug output to this code to get some insight why this
 > does not work, or not as effective as it should?
 
 Don't start there, since it is complicated and timing-dependent.  Maybe
 add some printfs to make it easy to see when it enters and leaves defrag
 mode.
 
 >> Apparently you found a way to reproduce the serious fragmentaion
 >> problems.
 >
 > A key factor seems to be the "Rewriting" operation. I see no problem
 > during the "normal" writing, nor could I reproduce it with concurrent
 > dd runs.
 
 I don't know exactly what bonnie rewrite bmode does.  Is it just read/
 [modify]/write of sequential blocks with a fairly small block size?
 Old bonnie docs say that the block size is always 8K.  One reason I
 don't like bonnie.  Clustering should work fairly normally with that.
 Anything with random seeks would break clustering.
 
 >> Increasing BKVASIZE would take more work than this, since although it
 >> was intended to be a system parameter which could be changed to reduce
 >> the fragmentation problem, one of the many bugs in it is that it was
 >> never converted into a "new" kernel option.  Another of the bugs in
 >> it is that doubling it halves the number of buffers, so doubling it
 >> does more than use twice as much kva.  This severely limited the number
 >> of buffers back when memory sizes were 64MB.  It is not a very
 >> significant limitation if the memory size is 1GB or larger.
 >
 > Should I try to experiment with BKVASIZE of 65536? If so, can I
 > somehow up the number of buffers again? Also, after modifying
 > BKVASIZE, is it sufficient to compile and install a new kernel, or do
 > I have to build and install the entire world?
 
 Just the kernel, but changing sys/param.h will make most of the world
 want to recompile itself according to dependencies.  I don't like rebuilding
 things, and often set timestamps in header files back to what they were
 to avoid rebuilding (after rebuilding only the object files that actually
 depend on the change).  Use this hack with caution, or rebuild kernels in
 a separate tree that doesn't affect the world.
 
 >>> [second bonnie goes Rewriting as well]
 >>> 00-04-24.log:vfs.numdirtybuffers: 11586
 >>> 00-04-25.log:vfs.numdirtybuffers: 16325
 >>> 00-04-26.log:vfs.numdirtybuffers: 24333
 >>> ...
 >>> 00-04-54.log:vfs.numdirtybuffers: 52096
 >>> 00-04-57.log:vfs.numdirtybuffers: 52098
 >>> 00-05-00.log:vfs.numdirtybuffers: 52096
 >>> [ etc. ]
 >>
 >> This is a rather large buildup and may indicate a problem.  Try reducing
 >> the dirty buffer watermarks.  Their default values are mostly historical
 >> nonsense.
 >
 > You mean the vfs.(hi|lo)dirtybuffers? Will do. What would be
 > reasonable starting values for experimenting? 800/200?
 
 1000 or 10000 (if nbuf is 50000).  1000 is probably too conservative, but
 I think it is plenty for most loads.
 
 Bruce

From: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
To: Bruce Evans <brde@optusnet.com.au>
Cc: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>,
	freebsd-gnats-submit@FreeBSD.org, freebsd-bugs@FreeBSD.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Fri, 31 May 2013 18:31:50 +0200

 Sorry for the late reply, testing took longer than expected.
 
 (I have combined your replies from separate mails into one, and
 reordered some of the text.)
 
 On Tue, May 28, 2013 at 10:03:10PM +1000, Bruce Evans wrote:
 > On Mon, 27 May 2013, Klaus Weber wrote:
 > >On Mon, May 27, 2013 at 03:57:56PM +1000, Bruce Evans wrote:
 > >>On Sun, 26 May 2013, Klaus Weber wrote:
 
 > However, I have never been able to reproduce serious fragmentation problems
 > from using too-large-block sizes, or demonstrate significant improvements
 > from avoiding the known fragmentation problem by increasing BKVASIZE.
 > Perhaps my systems are too small, or have tuning or local changes that
 > accidentally avoid the problem.
 > 
 > Apparently you found a way to reproduce the serious fragmentaion
 > problems.  Try using a block size that doesn't ask for the problem.
 > (...)
 > The reduced fsck time and perhaps the reduced number of cylinder groups
 > are the main advantages of large clusters.  vfs-level clustering turns
 > most physical i/o's into 128K-blocks (especially for large files) so
 > there is little difference between the i/o speed for all fs block sizes
 > unless the fs block size is very small.
 
 I have now repeated the tests with several variations of block- and
 fragment sizes. In all cases, I did two tests:
 
 1) dd if=/dev/zero of=/mnt/t1/100GB-1.bin bs=100m count=1000
 2) bonnie++ -s 64g -n 0 -f -D -d /mnt/t1
     bonnie++ -s 64g -n 0 -f -D -d /mnt/t2
 
 The dd is simply to give a rough idea of the performance impact of the
 fs parameters, with the two bonnie++ processes, I was mainly
 interested in performance and hangs when both bonnie++ processes are
 in their "Rewriting" phase. I have also tested variations where the
 block:fragment ratio does not follow the 8:1 recommendation.
 
 64/8k, kernel unpatched:
 dd: 1218155712 bytes/sec
 bonnie++: around 300 MB/sec, then drops to 0 and system hangs
 
 32/4k, kernel unpatched:
 dd: 844188881 bytes/sec
 bonnie++: jumps between 25 and 900 MB/sec, no hang
 
 16/2k, kernel unpatched:
 dd: 517996142 bytes/sec
 bonnie++: mostly 20-50 MB/sec, with 3-10 second "bursts" of 
                   400-650 MB/sec, no hang
 
 64/4k, kernel unpatched:
 dd: 1156041159 bytes/sec
 bonnie++: hangs system quickly once both processes are in rewriting
 
 32/8k, kernel unpatched:
 dd: 938072430 bytes/sec
 bonnie++: 29-50 MB/sec, with 3-10 second "bursts" of 
                   up to 650 MB/sec, no hang (but I canceled the test
 		  after an hour or so).
 
 So while a file system created with the current (32/4k) or old (16/2k)
 defaults  does prevent the hangs, it also reduces the sequential write
 performance to 70% and 43% of an 64/8k fs.
 
 The problem seems to be 64k block size, not the 8k fragment size.
 
 In all cases, the vfs.numdirtybuffers count remained fairly small as
 long as the bonnie++ processes were writing the testfiles. It rose to
 vfs.hidirtybuffers (slowler with only one process in "Rewriting", much
 faster when both processes are rewriting).
 
 > >00-04-57.log:vfs.numdirtybuffers: 52098
 > >00-05-00.log:vfs.numdirtybuffers: 52096
 > >[ etc. ]
 > 
 > This is a rather large buildup and may indicate a problem.  Try reducing
 > the dirty buffer watermarks. 
 
 I have tried to tune system parameters as per your advice, in an attempt
 to get a 64/8k fs running stable and with reasonable write performance.
 (dd: results omitted for brevity, all normal for 64/8k fs)
 
 all with 64/8k, kernel unpatched:
 vfs.lodirtybuffers=250
 vfs.hidirtybuffers=1000
 vfs.dirtybufthresh=800
 
 bonnie++: 40-150 MB/sec, no hang
 
 vfs.numdirtybuffers raises to 1000 when both processes are rewriting.
 
 
 vfs.lodirtybuffers=1000
 vfs.hidirtybuffers=4000
 vfs.dirtybufthresh=3000
 
 bonnie++: 380-50 MB/sec, no hang
 
 For the next tests, I kept lo/hidirtybuffers at 1000/4000, and only
 varied dirtybufthresh:
 1200: bonnie++: 80-370 MB/sec
 1750: bonnie++: around 600 MB/sec
 1900: bonnie++: around 580 MB/sec. vfs.numdirtybuffers=3800 (i.e. does
           reach vfs.hidirtybuffers anymore!)
 (no hangs in any of the tests).
 
 I then re-tested with lo/hidirtybuffers at their defaults, and only
 dirtybufthresh set to slightly less half of hidirtybuffers:
 
 vfs.lodirtybuffers=26069
 vfs.hidirtybuffers=52139
 vfs.dirtybufthresh=26000
 
 dd: 1199121549 bytes/sec
 bonnie++: 180-650 MB/sec, mostly around 500, no hang
 
 
 By testing with 3, 4 and 5 bonnie++ processes running simultaneously,
 I found that
 (vfs.dirtybufthresh) * (number of bonnie++ process) must be slightly
 less than vfs.hidirtybuffers for reasonable performance without hangs.
 
 vfs.numdirtybuffers rises to
 (vfs.dirtybufthresh) * (number of bonnie++ process) 
 and as long as this is below vfs.hidirtybuffers, the system will not
 hang.
 
 
 > >>I found that
 > >>the problem could be fixed by killing cluster_write() by turning it into
 > >>bdwrite() (by editing the running kernel using ddb, since this is easier
 > >>than rebuilding the kernel).  I was trying many similar things since I
 > >>had a theory that cluster_write() is useless.  [...]
 > >
 > >If that would provide a useful datapoint, I could try if that make a
 > >difference on my system. What changes would be required to test this?
 > >
 > >Surely its not as easy as replacing the function body of
 > >cluster_write() in vfs_cluster.c with just "return bdwrite(bp);"?
 > 
 > That should work for testing, but it is safer to edit ffs_write()
 > and remove the block where it calls cluster_write() (or bawrite()),
 > so that it falls through to call bdwrite() in most cases.
 
 I was not sure whether to disable the "bawrite(bp);" in the else part
 as well. Here is what I used for the next test (in ffs_write):
 
   } else if (vm_page_count_severe() ||
               buf_dirty_count_severe() ||
               (ioflag & IO_ASYNC)) {
           bp->b_flags |= B_CLUSTEROK;
           bawrite(bp);
           /* KWKWKW       } else if (xfersize + blkoffset == fs->fs_bsize) {
           if ((vp->v_mount->mnt_flag & MNT_NOCLUSTERW) == 0) {
                   bp->b_flags |= B_CLUSTEROK;
                   cluster_write(vp, bp, ip->i_size, seqcount);
           } else {
                   bawrite(bp);
                   } KWKWKW */
   } else if (ioflag & IO_DIRECT) {
           bp->b_flags |= B_CLUSTEROK;
           bawrite(bp);
   } else {
           bp->b_flags |= B_CLUSTEROK;
           bdwrite(bp);
   }
 
 dd: 746804775 bytes/sec
 
 During the dd tests, iostat shows a weird, sawtooth-like behavior:
  64.00 26730 1670.61   0  0 12  3 85
  64.00 13308 831.73   0  0  4  1 95
  64.00 5534 345.85   0  0 10  1 89
  64.00  12  0.75   0  0 16  0 84
  64.00 26544 1658.99   0  0 10  2 87
  64.00 12172 760.74   0  0  3  1 95
  64.00 8190 511.87   0  0  8  1 91
  64.00  10  0.62   0  0 14  0 86
  64.00 22578 1411.11   0  0 14  3 83
  64.00 12634 789.63   0  0  3  1 95
  64.00 11695 730.96   0  0  6  2 92
  48.00   7  0.33   0  0 13  0 87
  64.00 11801 737.58   0  0 17  1 82
  64.00 19113 1194.59   0  0  6  2 92
  64.00 15996 999.77   0  0  4  2 94
  64.00   3  0.19   0  0 13  0 87
  64.00 10202 637.63   0  0 16  1 83
  64.00 20443 1277.71   0  0  8  2 90
  64.00 15586 974.10   0  0  4  1 95
  64.00 682 42.64   0  0 13  0 87
 
 With two bonnie++ processes in "Writing intelligently" phase, iostat
 jumped between 9 and 350 MB/sec. I had cancelled the test before the
 first bonnie++ process reached "Rewriting..." phase, due to the dismal
 performance.
 
 Already during the "Writing intelligently" phase,  vfs.numdirtybuffers
 reaches vfs.hidirtybuffers (during previous tests, vfs.numdirtybuffers
 only raises to high numbers in the "Rewriting..." phase).
 
 
 After reverting the source change, I have decided to try mounting the
 file system with "-o noclusterr,noclusterw", and re-test. This is
 equivalent to disabling only the if-part of the expression in the
 source snippet above.
 
 dd: 1206199580 bytes/sec
 bonnie++: 550-700 MB/sec, no hang
 
 During the tests, vfs.numdirtybuffers remains low, lo/hidirtybuffers
 and dirtybufthresh are at their defaults:
 vfs.dirtybufthresh: 46925
 vfs.hidirtybuffers: 52139
 vfs.lodirtybuffers: 26069
 vfs.numdirtybuffers: 15
 
 So it looks like you were spot-on by suspecting cluster_write(). 
 Further tests confirmed that "-o noclusterw" is sufficient to prevent
 the hangs and provide good performance. "-o noclusterr" on its own
 makes no difference; the system will hang.
 
 
 I have also have tested with write-clustering enabled, but with
 vfs.write_behind=0 and vfs.write_behind=2, respectively. In both
 cases, the system hangs with two bonnie++ in "Rewriting...".
 
 
 I have also tested with BKVASIZE set to 65536. As you explained, this
 reduced the number of buffers:
 vfs.dirtybufthresh: 11753
 vfs.hidirtybuffers: 13059
 vfs.lodirtybuffers: 6529
 
 dd results remain unchanged from a BKVASIZE of 16k. bonnie++'s iostats
 with 2 process in "Rewriting..." jump between 70 and 800 MB/sec and
 numdirtybuffers reaches max:
 
 vfs.numdirtybuffers: 13059
 
 Even though numdirtybuffers reaches hidirtybuffers, the system does
 not hang, but performance is not very good.
 
 With BKVASIZE set to 65536 _and_ the fs mounted "-o noclusterw",
 the performance is same as with BKVASIZE of 16k, and the system does
 not hang.
 
 I have now reverted BKVASIZE to its default, as the main factor for a
 stable and fast system seems to be the noclusterw mount option.
 
 
 > >>Apparently you found a way to reproduce the serious fragmentaion
 > >>problems.
 > >
 > >A key factor seems to be the "Rewriting" operation. I see no problem
 > >during the "normal" writing, nor could I reproduce it with concurrent
 > >dd runs.
 > 
 > I don't know exactly what bonnie rewrite bmode does.  Is it just read/
 > [modify]/write of sequential blocks with a fairly small block size?
 > Old bonnie docs say that the block size is always 8K.  One reason I
 > don't like bonnie.  Clustering should work fairly normally with that.
 > Anything with random seeks would break clustering.
 
 Here is the relevant part from bonnie++'s source (in C++):
 ---------------
 bufindex = 0;
 for(words = 0; words < num_chunks; words++)
 { // for each chunk in the file
   dur.start();
   if (file.read_block(PVOID(buf)) == -1)
     return 1;
   bufindex = bufindex % globals.io_chunk_size();
   buf[bufindex]++;
   bufindex++;
   if (file.seek(-1, SEEK_CUR) == -1)
     return 1;
   if (file.write_block(PVOID(buf)) == -1)
     return io_error("re write(2)");
 -----------
 globals.io_chunk_size() is 8k (by default and in all of my tests), and
 bonnie++ makes sure that buf is page-aligned. 
 
 So you are correct: bonnie++ re-reads the file that was created
 previously in the "Writing intelligently..." phase in blocks, modifies
 one byte in the block, and writes the block back.
 
 Something in this specific workload is triggering the huge buildup of
 numdirtybuffers when write-clustering is enabled.
 
 
 I am now looking at vfs_cluster.c to see whether I can find which part
 is responsible for letting numdirtybuffers raise without bounds and
 why only *re* writing a file causes problems, not the initial
 writing. Any suggestions on where to start looking are very welcome.
 
 Klaus

From: Bruce Evans <brde@optusnet.com.au>
To: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Cc: freebsd-gnats-submit@freebsd.org, freebsd-bugs@freebsd.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Tue, 4 Jun 2013 07:09:59 +1000 (EST)

 On Fri, 31 May 2013, Klaus Weber wrote:
 
 > Sorry for the late reply, testing took longer than expected.
 
 This is very interesting.  It will take me a long time to complete
 testing and replying too.  Maybe more later.
 
 > (I have combined your replies from separate mails into one, and
 > reordered some of the text.)
 
 Very readable now.
 
 > ...
 > I have now repeated the tests with several variations of block- and
 > fragment sizes. In all cases, I did two tests:
 >
 > 1) dd if=/dev/zero of=/mnt/t1/100GB-1.bin bs=100m count=1000
 > 2) bonnie++ -s 64g -n 0 -f -D -d /mnt/t1
 >    bonnie++ -s 64g -n 0 -f -D -d /mnt/t2
 >
 > The dd is simply to give a rough idea of the performance impact of the
 > fs parameters, with the two bonnie++ processes, I was mainly
 > interested in performance and hangs when both bonnie++ processes are
 > in their "Rewriting" phase. I have also tested variations where the
 > block:fragment ratio does not follow the 8:1 recommendation.
 >
 > 64/8k, kernel unpatched:
 > dd: 1218155712 bytes/sec
 > bonnie++: around 300 MB/sec, then drops to 0 and system hangs
 
 My hardware can barely do 40 MB/sec, but under FreeBSD-5.2 (modified),
 there is at most a 50% drop with 2 bonnies and no tuning, and no hangs.
 Except it hung in my debugging code for understanding a related problem.
 I had this code enabled for a month or 2 and had forgotten about it.
 Only 2 bonnies caused enough load to reach it.
 
 > 32/4k, kernel unpatched:
 > dd: 844188881 bytes/sec
 > bonnie++: jumps between 25 and 900 MB/sec, no hang
 >
 > 16/2k, kernel unpatched:
 > dd: 517996142 bytes/sec
 > bonnie++: mostly 20-50 MB/sec, with 3-10 second "bursts" of
 >                  400-650 MB/sec, no hang
 
 The non-hangs might be due to BKVASIZE not being a problem.
 
 > So while a file system created with the current (32/4k) or old (16/2k)
 > defaults  does prevent the hangs, it also reduces the sequential write
 > performance to 70% and 43% of an 64/8k fs.
 
 I think bufdaemon and other write-pressure handling code just don't work
 as well as upper layers for clustering and ordering the writes.  They
 are currently broken, as shown by the hangs.  Clustering shouldn't be
 generating so much pressure.  I can explain why it does (see below).
 But random i/o can easily generate the same pressure, so bufdaemon etc.
 need to handle it.  Also, pressure should be the usual case for async
 mounts, and bufdaemon etc. should be able to do clustering and ordering
 _better_ than upper layers, since they have more blocks to work with.
 cluster_write() intentionally leaves as much as possible to bufdaemon
 in order to potentially optimize it.
 
 > In all cases, the vfs.numdirtybuffers count remained fairly small as
 > long as the bonnie++ processes were writing the testfiles. It rose to
 > vfs.hidirtybuffers (slowler with only one process in "Rewriting", much
 > faster when both processes are rewriting).
 
 The slow buildup is caused by fragmentation.  If the files happen to
 be laid out contiguously, then cluster_write() acts like bawrite(),
 but better.  With your 64K-blocks and FreeBSD's 128K max cluster size,
 it normally combines just 2 64K-blocks to create a 128K-cluster, and
 does 1 write instead of 2.  After this write, both methods have the
 leave number of dirty buffers (none for data, but a couple for metadata).
 In more detail, the write pattern with bonnie's 8K-blocks is:
 
    write; bdwrite
    ...
    write; bdwrite  (7 bdwrites to the same buffer)
    write; cluster_write or bawrite (buffer is now full, and dirty iff clusterw)
    write; bdwrite
    ...
    write; bdwrite  (7 bdwrites to the next buffer)
    write; cluster_write or bawrite (2 buffers are now full and both are clean)
 
 However, with fragmentation, cluster_write() "leaks" a dirty buffer
 although it is full, while bawrite() just writes all full buffers.  The
 leakage is because cluster_write() wants physical contiguity.  This is
 too much to ask for.  The leak may be fast if the file system is very
 fragmented.
 
 I happened to test on a file system which was almost full and thus very
 fragmented.  Two test files of size 256M each had 452 extents (counting
 indirect blocks as an extent although the might be contiguous).  The
 buildup with my test program (which tries to produce bonnie's i/o pattern)
 was often to the same magic number (something like 768, a little higher
 than the number of extents).
 
 This much leakage was enough to make clustering a pessimization, although
 it worked to combine 4 blocks in most cases (my fs block size is 16K and
 my max cluster size is 64K).  The disks are old, so 16K is as good an i/o
 size as 64K for them.  But smaller sizes wouldn't be.
 
 >>> 00-04-57.log:vfs.numdirtybuffers: 52098
 >>> 00-05-00.log:vfs.numdirtybuffers: 52096
 >>> [ etc. ]
 >>
 >> This is a rather large buildup and may indicate a problem.  Try reducing
 >> the dirty buffer watermarks.
 >
 > I have tried to tune system parameters as per your advice, in an attempt
 > to get a 64/8k fs running stable and with reasonable write performance.
 > (dd: results omitted for brevity, all normal for 64/8k fs)
 
 I got mostly worse behaviour by tuning things related to dirtybuffers.
 
 > [... You managed to fix the hangs, at a cost of too much performance. ]
 
 > By testing with 3, 4 and 5 bonnie++ processes running simultaneously,
 > I found that
 > (vfs.dirtybufthresh) * (number of bonnie++ process) must be slightly
 > less than vfs.hidirtybuffers for reasonable performance without hangs.
 >
 > vfs.numdirtybuffers rises to
 > (vfs.dirtybufthresh) * (number of bonnie++ process)
 > and as long as this is below vfs.hidirtybuffers, the system will not
 > hang.
 
 I wonder why it is linear in the number of processes.  I saw indications
 of similar behaviour (didn't test extensively).  1 bonnie created
 about 2/3 of vfs.hidirtybuffers and 2 bonnies saturated at
 vfs.hidirtybuffers.  This is with vfs.hidirtbuffers much smaller than
 yours.
 
 
 > ...
 > I was not sure whether to disable the "bawrite(bp);" in the else part
 > as well. Here is what I used for the next test (in ffs_write):
 >
 >  } else if (vm_page_count_severe() ||
 >              buf_dirty_count_severe() ||
 >              (ioflag & IO_ASYNC)) {
 >          bp->b_flags |= B_CLUSTEROK;
 >          bawrite(bp);
 
 Note that this bawrite() is used when hidirtybuffers is reached.  This
 is OK for this buffer if this buffer is large, but it doesn't stop
 the flood -- the system stays saturated at hidirtybuffs until bonnie
 stops writing.  This has a bad effect on other parts of the system,
 e.g., if they are writing small buffers.
 
 >          /* KWKWKW       } else if (xfersize + blkoffset == fs->fs_bsize) {
 >          if ((vp->v_mount->mnt_flag & MNT_NOCLUSTERW) == 0) {
 >                  bp->b_flags |= B_CLUSTEROK;
 >                  cluster_write(vp, bp, ip->i_size, seqcount);
 >          } else {
 >                  bawrite(bp);
 >                  } KWKWKW */
 
 OK.
 
 I tried a bdwrite() for the async mount case.  Async mounts should give
 delayed writes always, but it is a bugfeature that they give delayed
 writes for critical metadata but normally async() writes for data (via
 cluster_write()).  Using bdwrite here let me control the behaviour
 using a combination of mount flags.  It became clearer that the
 bugfeature is not just a bug.  Without cluster_write() forcing out
 writes for async mounts, the write pressure is much the same as from
 bonnie rewrite.
 
 > During the dd tests, iostat shows a weird, sawtooth-like behavior:
 > 64.00 26730 1670.61   0  0 12  3 85
 > ...
 
 Some bad interaction of bufdaemon with the writers.  I believe there are
 locking problems, but didn't know anything about them until recently.
 FreeBSD-5.2 is simpler and seems to have fewer of them.  I got a deadlock
 in it by putting a sleep waiting for the write pressure to clear in the
 above.  This gave deadlock, since bufdaemon is apparently prevented from
 doing anything while we hold a lock.
 
 > ...
 > After reverting the source change, I have decided to try mounting the
 > file system with "-o noclusterr,noclusterw", and re-test. This is
 > equivalent to disabling only the if-part of the expression in the
 > source snippet above.
 
 Good idea.
 
 > dd: 1206199580 bytes/sec
 > bonnie++: 550-700 MB/sec, no hang
 >
 > During the tests, vfs.numdirtybuffers remains low, lo/hidirtybuffers
 > and dirtybufthresh are at their defaults:
 > vfs.dirtybufthresh: 46925
 > vfs.hidirtybuffers: 52139
 > vfs.lodirtybuffers: 26069
 > vfs.numdirtybuffers: 15
 >
 > So it looks like you were spot-on by suspecting cluster_write().
 
 Not for the reasons that I suspected it!
 
 > Further tests confirmed that "-o noclusterw" is sufficient to prevent
 > the hangs and provide good performance. "-o noclusterr" on its own
 > makes no difference; the system will hang.
 
 I thought -noclusterw would be worse until now...
 
 > I have also tested with BKVASIZE set to 65536. As you explained, this
 > reduced the number of buffers:
 > vfs.dirtybufthresh: 11753
 > vfs.hidirtybuffers: 13059
 > vfs.lodirtybuffers: 6529
 >
 > dd results remain unchanged from a BKVASIZE of 16k. bonnie++'s iostats
 > with 2 process in "Rewriting..." jump between 70 and 800 MB/sec and
 > numdirtybuffers reaches max:
 >
 > vfs.numdirtybuffers: 13059
 >
 > Even though numdirtybuffers reaches hidirtybuffers, the system does
 > not hang, but performance is not very good.
 
 ... I hoped BKVASIZE would be the problem, but it is at most a secondary
 problem.
 
 > Here is the relevant part from bonnie++'s source (in C++):
 > ...
 > So you are correct: bonnie++ re-reads the file that was created
 > previously in the "Writing intelligently..." phase in blocks, modifies
 > one byte in the block, and writes the block back.
 >
 > Something in this specific workload is triggering the huge buildup of
 > numdirtybuffers when write-clustering is enabled.
 
 I can explain this.  Readers and writers share the offset for the
 sequential heuristic (it is per-open-file), so cluster_write() cannot
 tell that bonnie's writes are sequential.  The sequential_heuristic()
 function sees the offset moving back and forth, which normally means
 random access, so it tells cluster_write() that the access is random,
 allthiugh all bonnie seeks back after each read() so that all the
 writes() are sequential.  cluster_write() rarely if ever pushes out
 random writes, so the write pressure builds up to a saturation value
 almost instantly.  Then all subsequent writes work even more poorly
 due to the pressure.  Truly random writes would give similar buildup
 of pressure.
 
 > I am now looking at vfs_cluster.c to see whether I can find which part
 > is responsible for letting numdirtybuffers raise without bounds and
 > why only *re* writing a file causes problems, not the initial
 > writing. Any suggestions on where to start looking are very welcome.
 
 It is very complicated, but it was easy to find its comments saying that
 it tries not to force out the writes for non-seqential accesses.  I
 am currently trying the following workarounds:
 
 % diff -u2 vfs_cluster.c~ vfs_cluster.c
 % --- vfs_cluster.c~	Tue Apr  6 20:11:55 2004
 % +++ vfs_cluster.c	Tue Jun  4 04:53:52 2013
 % @@ -593,4 +714,6 @@
 %  }
 % 
 % +int	buf_dirty_count_large(void);
 % +
 %  /*
 %   * Do clustered write for FFS.
 
 This is like buf_dirty_count_severe() except it has a lower threshold.
 It doesn't really help.
 
 % @@ -630,4 +753,44 @@
 %  		vp->v_lasta = vp->v_clen = vp->v_cstart = vp->v_lastw = 0;
 % 
 % +	/*
 % +	 * XXX the 13 year old changes in the algorithm to use seqcount
 % +	 * are too agressive.  Avoid them by faking sequential access if
 % +	 * we are under write pressure.  This causes us to write out
 % +	 * clusters in more cases (in particular, when clusters are
 % +	 * complete).
 
 This doesn't really help.  It's too late to start faking after pressure
 has built up.
 
 % +	 *
 % +	 * XXX we should probably use a more conservative condition for
 % +	 * write pressure than is currently available.  Try not to leave
 % +	 * anything to bufdaemon, since it can too easily fail to clear
 % +	 * write pressure because the processes creating the write
 % +	 * pressure have the vnode locked.  E.g., use lodirtybuffers
 % +	 * instead of hidirtybuffers in buf_dirty_count_severe(), and
 % +	 * also check dirtybufthesh.
 
 I use lodirtybuffers as a default.  This is still too high, and almost
 any value reduces performance.
 
 % +	 *
 % +	 * XXX seqcount itself is broken.  It gets cleared by reads of
 % +	 * the same open file, even if the application seeks back after
 % +	 * reading so that writes are perfectly sequential.  This case
 % +	 * is probably unimportant, but bonnie does it.
 
 As describe differently above.
 
 % +	 *
 % +	 * XXX I think that if bonnie didn't seek back then the write
 % +	 * pressure would be the same (provided the bonnie write size is
 % +	 * smaller than the fs block size).  Then the writes wouldn't be
 % +	 * sequential, so a fixed seqcount wouldn't help force them out.
 % +	 * But they need to be forced out under write pressure, and I
 % +	 * don't see anything to do it.  Check this.
 % +	 *
 % +	 * XXX the writing system shouldn't collapse when we get this
 % +	 * slightly wrong.  Why doesn't bdwrite() force things when
 % +	 * we don't?  It has some code to do so.  It is still better
 % +	 * for us to force out full clusters than to leave it to
 % +	 * the writing system to possibly force out sub-buffers.
 % +	 *
 % +	 * XXX try killing async under the same conditions.
 
 This is too late to reduce pressure from async too.
 
 % +	 */
 % +	if (vm_page_count_severe() || buf_dirty_count_large()) {
 % +		seqcount = 2;
 % +		async = 0;
 % +	}
 % +
 %  	if (vp->v_clen == 0 || lbn != vp->v_lastw + 1 ||
 %  	    (bp->b_blkno != vp->v_lasta + btodb(lblocksize))) {
 % @@ -719,4 +882,5 @@
 %  		} else {	/* Wait for rest of cluster */
 %  			vp->v_cstart = lbn;
 % +			/* XXX problem if we're async and orig maxclen = 0. */
 %  			bdwrite(bp);
 %  		}
 
 This case for async breaks the bugfeature of not honoring it.  The
 bugfeature often prevents buildup of exessive pressure, but some cases
 arrive here and then it doesn't.  This is the only old change I have in
 cluster_write().
 
 % @@ -726,8 +890,13 @@
 %  		 * are operating sequentially, otherwise let the buf or
 %  		 * update daemon handle it.
 % +		 *
 % +		 * Algorithm changeback: ignore seqcount here, at least for
 % +		 * now, to work around readers breaking it for writers.  It
 % +		 * is too late to start ignoring after write pressure builds
 % +		 * up, since not writing out here is the main source of the
 % +		 * buildup.
 %  		 */
 %  		bdwrite(bp);
 % -		if (seqcount > 1)
 % -			cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen + 1);
 % +		cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen + 1);
 %  		vp->v_clen = 0;
 %  		vp->v_cstart = lbn + 1;
 
 This works fairly well.  Changing the algorithm back in all cases reduces
 performance, but here we have usually finished with the buffer.  Even if
 seqcount is fixed so that it isn't clobbered by reads, it might be right
 to ignoe it here, so as to reduce write pressure, until write pressure is
 handled better.
 
 I used the following program to generate i/o like bonnie's:
 
 %%%
 #include <err.h>
 #include <fcntl.h>
 #include <stdint.h>
 #include <stdlib.h>
 #include <stdio.h>
 #include <unistd.h>
 
 #define	IOSIZE	8192
 
 static char buf[IOSIZE] __aligned(4096);
 
 int
 main(int argc, char **argv)
 {
  	ssize_t n;
  	int ifd, ofd;
 
  	if (argc != 2 && argc != 3)
  		errx(1, "usage: ./prog file1 [file2]");
  	ifd = open(argv[1], O_RDWR);
  	if (ifd < 0)
  		err(1, "open");
  	if (argc == 3)
  		ofd = open(argv[2], O_RDWR);
  	else
  		ofd = ifd;
  	if (ifd < 0)
  		err(1, "open");
  	for (;;) {
  		n = read(ifd, buf, IOSIZE);
  		if (n == 0) {
  			printf("finished: ioffset %#jx; ooffset = %#jx\n",
  			    (intmax_t)lseek(ifd, 0, SEEK_CUR),
  			    (intmax_t)lseek(ofd, 0, SEEK_CUR));
  			exit(0);
  		}
  		if (n < 0)
  			err(1, "read error");
  		if (n != IOSIZE)
  			errx(1, "short read %d", (int)n);
  		if (ofd == ifd)
  			lseek(ofd, -IOSIZE, SEEK_CUR);
  		n = write(ofd, buf, IOSIZE);
  		if (n < 0)
  			err(1, "write error");
  		if (n != IOSIZE)
  			errx(1, "short write %d", (int)n);
  	}
 }
 %%%
 
 Usage: create a test file zz of a suitable size (the program normally
 doesn't extend it); then ./prog zz does read/write like bonnie;
 ./prog zz zz does the same with independent file positions.
 
 Bruce

From: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
To: Bruce Evans <brde@optusnet.com.au>
Cc: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>,
	freebsd-gnats-submit@freebsd.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Mon, 10 Jun 2013 01:07:21 +0200

 Mime-Version: 1.0
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline
 In-Reply-To: <20130604052658.K1039@besplex.bde.org>
 User-Agent: Mutt/1.4.2.3i
 
 On Tue, Jun 04, 2013 at 07:09:59AM +1000, Bruce Evans wrote:
 > On Fri, 31 May 2013, Klaus Weber wrote:
 
 Again, I have reordered some sections of your mail in my reply. I have
 also removed freebsd-bugs from Cc: - it seems that replies to PRs get
 forwarded there automatically, and judging from the web interface to
 the mailing list archives, the previous mails were all duplicated.)
 
 > >So you are correct: bonnie++ re-reads the file that was created
 > >previously in the "Writing intelligently..." phase in blocks, modifies
 > >one byte in the block, and writes the block back.
 > >
 > >Something in this specific workload is triggering the huge buildup of
 > >numdirtybuffers when write-clustering is enabled.
 > 
 > I can explain this.  Readers and writers share the offset for the
 > sequential heuristic (it is per-open-file), so cluster_write() cannot
 > tell that bonnie's writes are sequential.  The sequential_heuristic()
 > function sees the offset moving back and forth, which normally means
 > random access, so it tells cluster_write() that the access is random,
 > allthiugh all bonnie seeks back after each read() so that all the
 > writes() are sequential. 
 
 Indeed. Using the the test program you provided, I can confirm that
 reading and writing to the same file via _different_ file descriptors
 works without hangs and with good performance (similar to "-o
 noclusterw", see below for details). vfs.numdirtybuffers count
 remained low, and write rates pretty stable.
 
 Since your program is more flexible than bonnie++, I have done all
 tests in  this mail with it.
 
 > cluster_write() rarely if ever pushes out
 > random writes, so the write pressure builds up to a saturation value
 > almost instantly.  Then all subsequent writes work even more poorly
 > due to the pressure.  Truly random writes would give similar buildup
 > of pressure.
 
 I have briefly tested and confirmed this as well (see below). Mounting
 a file system with "-o async" shows the "quickly raising
 vfs.numdirtybuffers" symptom already during writing (not re-writing) a
 file, e.g. with dd. I could not get the system to hang, but I did not
 test this very thoroughly.
 
 
 > >So while a file system created with the current (32/4k) or old (16/2k)
 > >defaults  does prevent the hangs, it also reduces the sequential write
 > >performance to 70% and 43% of an 64/8k fs.
 > XXX
 > I think bufdaemon and other write-pressure handling code just don't work
 > as well as upper layers for clustering and ordering the writes.  They
 > are currently broken, as shown by the hangs.  Clustering shouldn't be
 > generating so much pressure.  I can explain why it does (see below).
 > But random i/o can easily generate the same pressure, so bufdaemon etc.
 > need to handle it.  Also, pressure should be the usual case for async
 > mounts, and bufdaemon etc. should be able to do clustering and ordering
 > _better_ than upper layers, since they have more blocks to work with.
 > cluster_write() intentionally leaves as much as possible to bufdaemon
 > in order to potentially optimize it.
 > 
 > >In all cases, the vfs.numdirtybuffers count remained fairly small as
 > >long as the bonnie++ processes were writing the testfiles. It rose to
 > >vfs.hidirtybuffers (slowler with only one process in "Rewriting", much
 > >faster when both processes are rewriting).
 > 
 > The slow buildup is caused by fragmentation.  If the files happen to
 > be laid out contiguously, then cluster_write() acts like bawrite(),
 > but better.  With your 64K-blocks and FreeBSD's 128K max cluster size,
 > it normally combines just 2 64K-blocks to create a 128K-cluster, and
 > does 1 write instead of 2.  After this write, both methods have the
 > leave number of dirty buffers (none for data, but a couple for metadata).
 >
 > [Analysis of bonnie++'s write pattern]
 >
 > However, with fragmentation, cluster_write() "leaks" a dirty buffer
 > although it is full, while bawrite() just writes all full buffers.  The
 > leakage is because cluster_write() wants physical contiguity.  This is
 > too much to ask for.  The leak may be fast if the file system is very
 > fragmented.
 
 In all of my (older) tests with bonnie++, the file system was newfs'ed
 before mounting it, so there was no fragmentation for me.
 
 
 > >I have tried to tune system parameters as per your advice, in an attempt
 > >to get a 64/8k fs running stable and with reasonable write performance.
 > >(dd: results omitted for brevity, all normal for 64/8k fs)
 > 
 > I got mostly worse behaviour by tuning things related to dirtybuffers.
 
 Me too, performance-wise. The only exception was reducing
 vfs.dirtybufthresh so that vfs.numdirtybuffers no longer reaches
 vfs.hidirtybuffers; this is needed to prevent the hangs. The exact
 value does not seem to matter much.
 
 
 > >[... You managed to fix the hangs, at a cost of too much performance. ]
 > 
 > >By testing with 3, 4 and 5 bonnie++ processes running simultaneously,
 > >I found that
 > >(vfs.dirtybufthresh) * (number of bonnie++ process) must be slightly
 > >less than vfs.hidirtybuffers for reasonable performance without hangs.
 > >
 > >vfs.numdirtybuffers rises to
 > >(vfs.dirtybufthresh) * (number of bonnie++ process)
 > >and as long as this is below vfs.hidirtybuffers, the system will not
 > >hang.
 
 Actually, further testing revealed that it is not linear with the
 number of processes, but with the number of files being
 rewritten. With bonnie++, you cannot make this distiction; two
 bonnie++ processes cannot re-write the same file.
 
 However, with the test program you provided you can actually re-write
 the same file from two (or more) processes. Testing this shows that
 the correct formula is actually 
 
 "vfs.numdirtybuffers rises to
 (vfs.dirtybufthresh) * (number of files being re-written)
 and as long as this is below vfs.hidirtybuffers, the system will not
 hang."
 
 So, any number of processes can re-write the same file, and the number
 of processes has no effect on vfs.numdirtybuffers.
 
 
 > I wonder why it is linear in the number of processes.  I saw indications
 > of similar behaviour (didn't test extensively).  1 bonnie created
 > about 2/3 of vfs.hidirtybuffers and 2 bonnies saturated at
 > vfs.hidirtybuffers.  This is with vfs.hidirtbuffers much smaller than
 > yours.
 
 Does this correlate with vfs.dirtybufthresh on your system as well?
 (i.e. is vfs.dirtybufthresh about 2/3 of your vfs.hidirtybuffers?
 
 
 > I tried a bdwrite() for the async mount case.  Async mounts should give
 > delayed writes always, but it is a bugfeature that they give delayed
 > writes for critical metadata but normally async() writes for data (via
 > cluster_write()).  Using bdwrite here let me control the behaviour
 > using a combination of mount flags.  It became clearer that the
 > bugfeature is not just a bug.  Without cluster_write() forcing out
 > writes for async mounts, the write pressure is much the same as from
 > bonnie rewrite.
 
 I have tested this as well, and I can confirm that with a file system
 mounted with "-o async" I see the problematic behavior with "normal"
 writes (not re-writes) as well: while writing a large file (e.g. via
 dd), vfs.numdirtybuffers rises quickly to vfs.dirtybufthresh). On
 non-async file systems, I could only provoke this via re-writing.
 
 
 
 > >After reverting the source change, I have decided to try mounting the
 > >file system with "-o noclusterr,noclusterw", and re-test.
 > > [...]
 > >Further tests confirmed that "-o noclusterw" is sufficient to prevent
 > >the hangs and provide good performance.
 > 
 > I thought -noclusterw would be worse until now...
 
 I wanted to check the performance effects of write clustering more
 thoroughly, so I have tested 3 variants (with the test program you
 provided, using a 64 GB file (2x RAM)):
 
 1) Re-writing a single file with two file descriptors, file system
    mounted with default options.
 
 2) Re-writing a single file with two file descriptors, file system
    mounted with "-o noclusterw".
 
 3) Re-writing a single file via a single file descriptor, file system
    mounted with  "-o noclusterw" (required in this case to prevent the
    system hanging).
 
 First, I have tested on a logical volume on the RAID array (3 test
 runs each):
 
 1) (2 desc, w/ clustering)
 0.669u 97.119s 3:23.34 48.0%	5+187k 524441+524311io 0pf+0w
 0.472u 95.501s 3:22.40 47.4%	5+188k 523095+524311io 0pf+0w
 0.606u 93.263s 3:21.13 46.6%	5+188k 522999+524311io 0pf+0w
 
 2) (2 desc, -o noclusterw)
 0.524u 94.180s 3:19.76 47.4%	5+189k 524442+1048576io 0pf+0w
 0.622u 95.200s 3:23.30 47.1%	5+187k 523090+1048576io 0pf+0w
 0.475u 94.232s 3:19.96 47.3%	5+186k 522767+1048576io 0pf+0w
 
 3) (1 desc, -o noclusterw)
 0.922u 95.917s 3:33.74 45.3%	5+187k 524442+1048576io 0pf+0w
 0.679u 95.392s 3:32.62 45.1%	5+187k 522256+1048576io 0pf+0w
 0.976u 93.639s 3:33.68 44.2%	5+187k 521902+1048576io 0pf+0w
 
 As you can see, the performance differences are very small. If you
 consider the differences significant, then "-o noclusterw" is actually
 a tiny bit faster than clustering. You also see clustering at work,
 the number of writes doubles for cases 2) and 3).
 
 I have then repeated the tests on a single disk, connected directly to
 mainboard SATA port, to take the RAID controller out of the
 equation. (With its CPU and on-board RAM, it certainly has the ability
 to cluster and re-order writes after the kernel sends them.) Since the
 tests take a long time to complete, I have only done a single test run
 for each case.
 
 
 1) (2 desc, w/ clustering)
 3.497u 575.616s 24:39.49 39.1%	5+187k 524445+524314io 0pf+0w
 
 2) (2 desc, -o noclusterw)
 5.960u 876.735s 33:35.75 43.7%	5+187k 524445+1048576io 0pf+0w
 
 3) (1 desc, -o noclusterw)
 7.014u 741.382s 29:56.98 41.6%	5+188k 524445+1048576io 0pf+0w
 
 Here, clustering does seem to have a positive effect, and the
 differences are more pronounced.
 
 So it really seems that clustering does provide performance benefits,
 but the RAID controller seems to able to able to make up for the lack
 of clustering (either because clustering is disabled, or because it
 does not work effectively due to interspersed reads and seeks on the
 same file descriptor).
 
 
 > >I am now looking at vfs_cluster.c to see whether I can find which part
 > >is responsible for letting numdirtybuffers raise without bounds and
 > >why only *re* writing a file causes problems, not the initial
 > >writing. Any suggestions on where to start looking are very welcome.
 > 
 > It is very complicated, but it was easy to find its comments saying that
 > it tries not to force out the writes for non-seqential accesses.  I
 > am currently trying the following workarounds:
 
 I have decided to start testing with only a single change from the
 list of changes you provided:
  
 > % diff -u2 vfs_cluster.c~ vfs_cluster.c
 > % @@ -726,8 +890,13 @@
 > %  		 * are operating sequentially, otherwise let the buf or
 > %  		 * update daemon handle it.
 > % +		 *
 > % +		 * Algorithm changeback: ignore seqcount here, at least for
 > % +		 * now, to work around readers breaking it for writers.  It
 > % +		 * is too late to start ignoring after write pressure builds
 > % +		 * up, since not writing out here is the main source of the
 > % +		 * buildup.
 > %  		 */
 > %  		bdwrite(bp);
 > % -		if (seqcount > 1)
 > % -			cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, 
 > vp->v_clen + 1);
 > % +		cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen + 
 > 1);
 > %  		vp->v_clen = 0;
 > %  		vp->v_cstart = lbn + 1;
 
 And sure enough, a kernel with only this one line change[*] is able to
 handle reading and re-writing a file through a single file descriptor
 just fine, good performance, no hangs, vfs.numdirtybuffers remains
 low:
 
 4) (1 desc, cluster_wbuild_wb unconditionally)
 0.864u 96.145s 3:32.34 45.6%	5+188k 524441+524265io 0pf+0w
 0.976u 94.081s 3:29.70 45.3%	5+187k 489756+524265io 0pf+0w
 0.903u 95.676s 3:33.38 45.2%	5+188k 523124+524265io 0pf+0w
 
 Same test on a single disk, as above:
 
 4) (1 desc, cluster_wbuild_wb unconditionally)
 8.399u 822.055s 33:01.01 41.9%	5+187k 524445+524262io 0pf+0w
 
 Unfortunately, for the single disk case, the performance is the same
 as for the "-o noclusterw" case (even though the writes _are_
 clustered now, see the number of write operation).
 
 
 [*] I still need to confirm that this is really the only change in the
 kernel, but I'm running out of time now. I will test this next weekend.
 
 > This works fairly well.  Changing the algorithm back in all cases reduces
 > performance, but here we have usually finished with the buffer.  Even if
 > seqcount is fixed so that it isn't clobbered by reads, it might be right
 > to ignoe it here, so as to reduce write pressure, until write pressure is
 > handled better.
 
 For me, this change makes the difference between a working system with
 good performance vs a system that limps along and occasionally hangs
 completely; for this specific workload.
 
 Thank you for this!
 
 Next weekend, I want to test some scenarios with and without this
 change. Most importantly, I want to investigate
 
 a) behavior with random I/O (any suggestions on a benchmark or test
 program that can generate "representative" random I/O? IOzone?)
 
 b) behavior with sequential writes
 
 c) behavior with re-writing files, both via one and two file
 descriptors
 
 d) maybe test the other changes to vfs_cluster.c you provided, to see
 whether they make a difference.
 
 e) Investigate the role of vfs.dirtybufthresh. When vfs.numdirtybuffer
 reaches this number, the system switches to a mode where
 vfs.numdirtybuffer no longer increases and seems to handle the load
 just fine. I want to understand why the system sees no need to limit
 numdirtybuffer earlier. Maybe this can be used to improve re-write
 performance to the "2 desc, with clustering" case.
 
 f) I have experienced system freezes when unmounting my test file
 system after running the respective tests. I'm not sure whether this
 related to the current problem, or whether this caused by some test
 patch that I accidently left in the code. I need to check wether this
 reproducible, and whether there is a specific workload that causes
 this.
 
 Klaus

From: Bruce Evans <brde@optusnet.com.au>
To: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Mon, 10 Jun 2013 11:00:29 +1000 (EST)

 On Mon, 10 Jun 2013, Klaus Weber wrote:
 
 > On Tue, Jun 04, 2013 at 07:09:59AM +1000, Bruce Evans wrote:
 >> On Fri, 31 May 2013, Klaus Weber wrote:
 
 This thread is getting very long, and I will only summarize a couple
 of things that I found last week here.  Maybe more later.
 
 o Everything seems to be working as well as intended (not very well)
    except in bufdaemon and friends.  Perhaps it is already fixed there.
    I forgot to check which version of FreeBSD you are using.  You may
    be missing some important fixes.  There were some by kib@ a few
    months ago, and some by jeff@ after this thread started.  I don't
    run any version of FreeBSD new enough to have these, and the version
    that I run also doesn't seem to have any serious bugs in bufdaemon.
    It just works mediocrely.
 
 o Writing in blocks of size less than the fs block size, as bonnie
    normally does, gives much the same rewriting effect as bonnie does
    explicitly, because the system is forced to read each block before
    doing a partial write to it.  This at best doubles the amount of
    i/o and halves the throughput of the writes.
 
 o There are some minor bugs in the read-before-write system code, and
    some differences in the "much the same" that may be important in
    some cases:
    - when bonnie does the read-before-write, the system normally uses
      cluster_read() on the read descriptor and thus uses the system's
      idea of sequentiality on the read descriptor.  For both ffs and
      msdosfs, this normally results in a reading small cluster (up to
      the end of the current read) followed by async read-ahead of a
      larger cluster or 2.  The separate clusters improve latency but
      reduce performance.  scottl@ recently committed a sysctl
      vfs.read_min for avoiding the earlier splitting.  Using it made
      some interesting but ultimately unimportant differences to the
      2-bonnie problem.
    - when the system does the read-before-write, ffs normally uses
      cluster_read() on the write descriptor and thus uses the system's
      idea of sequentially on the write descriptor.  ffs doesn't know
      the correct amount to read in this case, and it always asks
      for MAXBSIZE, which is both too small and too large.  This value
      is the amount that should be read synchronously.  It is too large
      since the normal amount is the application's block size which is
      normally smaller, and it is too small since MAXBSIZE is only
      half of the max cluster size.  The correct tradeoff of latency
      vs throughput is even less clear than for a user read, and further
      off from being dynamic.  msdosfs doesn't even use cluster_read()
      for this (my bad).  It uses plain bread().  This gives very low
      performance when the block size is small.  So msdosfs worked much
      better in the 2-bonnie benchmark than for rewrites generated by
      dd just writing with a small block size and conv=notrunc.  After
      fixing thus, msdosfs worked slightly better than ffs in all cases.
    - whoever does the read-before-write, cluster reading tends to generate
      a bad i/o pattern.  I saw patterns like the following (on ~5.2 where
      the max cluster size is only 64K, after arranging to mostly use this
      size):
        file1: read       64K offset 0
        file1: read ahead 64K offset 64K
        file2: read       64K offset 0
        file1: read ahead 64K offset 64K
        file1: write      64K offset 0
        file1: read       64K offset 128K
        file1: read ahead 64K offset 192K
        file2: write      64K offset 0
      The 2 files make the disk seek a lot, and the read-and-read-ahead
      gives even more seeks to get back to the write position.  My drives
      are old and have only about 2MB of cache.  Seeks with patterns like
      the above are apparently just large enough to break the drives'
      caching.  OTOH, if I use your trick of mounting with -noclusterw,
      the seeks are reduced signficantly and my throughput increases by
      almost a factor of 2, even though this gives writes of only 16K.
      Apparently the seeks are reduced just enough for the drives' caches
      to work well.  I think the same happens for you.  Your i/o system
      is better, but it only takes a couple of bonnies and perhaps the
      read pointers getting even further ahead of the write pointers
      to defeat the drive's caching.  Small timing differences probably
      allow the difference to build up.  Mounting with -noclusterw also
      gives some synchronization that will prevent this buildup.
    - when the system does the read-before-write, the sequential heuristic
      isn't necessarily clobbered, but it turns out that the clobbering
      gives the best possible behaviour, except for limitations and bugs
      in bufdaemon!...
 
 > [... good stuff clipped]
 > So it really seems that clustering does provide performance benefits,
 > but the RAID controller seems to able to able to make up for the lack
 > of clustering (either because clustering is disabled, or because it
 > does not work effectively due to interspersed reads and seeks on the
 > same file descriptor).
 
 Yes, the seek pattern caused by async-but-not-long-delayed writes
 (whether done by cluster_write() a bit later or bawrite() directly)
 combined with reading far ahead (whether done explicitly or implicitly)
 is very bad even for 1 file, but can often be compensated for by caching
 in the drives.  With 2 files or random writes on 1 file it is much worse,
 but appaerently mounting with -noclusterw limits it enough for the
 drives to compensate in the case of 2 bonnies.  I think the best we
 can do in general is delay writes as long as possible and then
 schedule them perfectly.  But scheduling them perfectly is difficult
 and only happens accidentally.
 
 >>> I am now looking at vfs_cluster.c to see whether I can find which part
 >>> is responsible for letting numdirtybuffers raise without bounds and
 >>> why only *re* writing a file causes problems, not the initial
 >>> writing. Any suggestions on where to start looking are very welcome.
 >>
 >> It is very complicated, but it was easy to find its comments saying that
 >> it tries not to force out the writes for non-seqential accesses.  I
 >> am currently trying the following workarounds:
 >
 > I have decided to start testing with only a single change from the
 > list of changes you provided:
 >
 >> % diff -u2 vfs_cluster.c~ vfs_cluster.c
 >> % @@ -726,8 +890,13 @@
 >> %  		 * are operating sequentially, otherwise let the buf or
 >> %  		 * update daemon handle it.
 >> % +		 *
 >> % +		 * Algorithm changeback: ignore seqcount here, at least for
 >> % +		 * now, to work around readers breaking it for writers.  It
 >> % +		 * is too late to start ignoring after write pressure builds
 >> % +		 * up, since not writing out here is the main source of the
 >> % +		 * buildup.
 >> %  		 */
 >> %  		bdwrite(bp);
 >> % -		if (seqcount > 1)
 >> % -			cluster_wbuild_wb(vp, lblocksize, vp->v_cstart,
 >> vp->v_clen + 1);
 >> % +		cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen +
 >> 1);
 >> %  		vp->v_clen = 0;
 >> %  		vp->v_cstart = lbn + 1;
 >
 > And sure enough, a kernel with only this one line change[*] is able to
 > handle reading and re-writing a file through a single file descriptor
 > just fine, good performance, no hangs, vfs.numdirtybuffers remains
 > low:
 
 After more testing, I found that this was almost perfectly backwards for
 my hardware!  I think for your hardware it allows the drives to
 compensate, much like with -noclusterw but with a slightly improved
 throughput due to the larger writes.  But with my drives, it mostly
 just gives more seeks.  After changing this back and being more careful
 with the comparisons, I found that best results are obtained (in ~5.2)
 by letting numdirtybuffers build up.  The breakage of the sequential
 heuristic cause the above to never force out the cluster immediately
 for the 2-bonnie case.  I get similar behaviour by always using delayed
 writes in ffs_write().  This might depend on setting B_CLUSTEROK in more
 cases, so that the clustering always gets done later.
 
 Typical throughputs for me:
 - my drives can do 55MB/sec max and get 48 for writing 1 file with large
    blocks using dd
 - 48 drops to half of 20-24 with read-before-write for 1 file.  That's
    a 4-fold reduction.  One half is for the doubled i/o and the other half
    is for the seeks.
 - half of 20-24 drops to half of 10-12 with 2 files and read-before-write
    of each, in the best case.  That's an 8-fold reduction.  Another factor
    of 2 is apparently lost to more seeks.
 - half of 10-12 drops to half of 5-6, as in the previous point but in the
    worst case.  That's a 16-fold reduction.  The worst case is with my
    modification above.  It maximizes the seeks.  My original idea for a
    fix (in the above diff) gave this case.  It gave almost perfect
    clustering and almost no buildup of numdirtybuffers, but throughput
    was still worst.  (My drives can do 16K blocks at full speed provided
    the blocks are contiguous, so they don't benifit much from clustering
    except for its side effect of reducing seeks to other blocks in between
    accessing the contiguous ones.)
 Some of this typical behaviour is not very dependent on block sizes.  The
 drives become seek-bound, and anything that doubles the number of seeks
 halves the throughput.
 
 Bruce

From: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
To: Bruce Evans <brde@optusnet.com.au>
Cc: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>,
	freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Mon, 10 Jun 2013 18:38:19 +0200

 Just a quick reply while I am away from the system.
 
 On Mon, Jun 10, 2013 at 11:00:29AM +1000, Bruce Evans wrote:
 > o Everything seems to be working as well as intended (not very well)
 >   except in bufdaemon and friends.  Perhaps it is already fixed there.
 >   I forgot to check which version of FreeBSD you are using.  You may
 >   be missing some important fixes.  There were some by kib@ a few
 >   months ago, and some by jeff@ after this thread started.  I don't
 >   run any version of FreeBSD new enough to have these, and the version
 >   that I run also doesn't seem to have any serious bugs in bufdaemon.
 >   It just works mediocrely.
 
 ~All tests were done with 9-STABLE from early May. I also tested with
 10-CURRENT kernel, downloaded from an ftp mirror on May 5th,
 2013. This kernel showed the same behavior, so I did all further
 testing with the 9-STABLE version.
 
 I will update my sources to 9-STABLE first thing next weekend, and
 will also re-test with an up-to-date CURRENT kernel.
 
 Klaus

From: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
To: Bruce Evans <brde@optusnet.com.au>
Cc: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>,
	freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Mon, 17 Jun 2013 02:17:16 +0200

 On Mon, Jun 10, 2013 at 11:00:29AM +1000, Bruce Evans wrote:
 > On Mon, 10 Jun 2013, Klaus Weber wrote:
 > >On Tue, Jun 04, 2013 at 07:09:59AM +1000, Bruce Evans wrote:
 > >>On Fri, 31 May 2013, Klaus Weber wrote:
 >
 > This thread is getting very long, and I will only summarize a couple
 > of things that I found last week here.  Maybe more later.
 >
 > o Everything seems to be working as well as intended (not very well)
 >   except in bufdaemon and friends.  Perhaps it is already fixed there.
 >   I forgot to check which version of FreeBSD you are using.  You may
 >   be missing some important fixes.  There were some by kib@ a few
 >   months ago, and some by jeff@ after this thread started. 
 
 I have now tested with both an up-to-date 9-STABLE as well as a
 10-CURRENT kernel from ftp.freebsd.org (8.6.2016):
 
 FreeBSD filepile 9.1-STABLE FreeBSD 9.1-STABLE #27 r251798M: Sun Jun
 16 16:19:18 CEST 2013 root@filepile:/usr/obj/usr/src/sys/FILEPILE
 amd64
 
 FreeBSD filepile 10.0-CURRENT FreeBSD 10.0-CURRENT #0: Sat Jun 8
 22:10:23 UTC 2013 root@snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC
 amd64
 
 The bad news is that I can still reproduce the hangs reliably, with
 the same methods as before. I do have the feeling that the machine
 survives the re-write load a bit longer, but its just a gut feeling -
 I have no means to quantify this.
 
 The "dirtybuf" sysctl values have changed their default values; I
 believe you were involved in the discussion about how they are
 computed that lead to changing them. They are now:
 vfs.lodirtybuffers: 13251
 vfs.hidirtybuffers: 26502
 vfs.dirtybufthresh: 23851
 
 The 10-CURRENT kernel, shows a message on the console when
 the hang occurs (but see below for a twist):
 
 g_vfs_done():da0p1[WRITE(offset=<x>, length=<y>)]error = 11
 
 The message is printed to the console about once a second when the
 hang occurs, until I hard-reset the server. <x> keeps changing,
 <y> is mostly 65536, sometimes 131072 (like, one screenful of 64k's
 then one 128k line, etc.)
 
 Error=11 seems to be EDEADLK, so it looks like a -CURRENT kernel does
 detect a deadlock sitation whereas the -STABLE kernel doesn't. (I have
 also re-checked the -CURRENT from 5.5.2013 which I briefly tested in
 my initial report also shows these messages. My apologies for not
 noticing this earlier).
 
 And now there is an additional point where things get really weird:
 The "g_vfs_done()..." messages _only_ appear when the system hangs as
 a result of bonnie++'s rewriting load. If I repeat the same test with
 the test program you provided earlier, the system will still hang, but
 without the console output.
 
 (I have repeated the tests 10 times to make sure I wasn't just
 imagining this; 5 times with bonnie++ (message appears every time), 
 5 times with your test program (message did not appear a single
 time)).
 
 I'm not really sure what difference between these two programs causes
 this. Bonnie++ writes the file immediately before starting to rewrite
 it, while your program works on pre-fabricated files - maybe this it.
 
 
 Anyway: Is there a way to find out which resource(s) are involved in
 the EDEADLK-situation? Please keep in mind that I cannot enter the
 debugger via the console, and panicing the machine leaves me with a
 dead keyboard, which seems to make any useful debugging pretty hard.
 
 (If you believe that a working debugger is required to make progress
 on this, let me know and I will try to get a serial console working
 somehow.)
 
 >>[observations regarding seek behavior] 
 
 I did not get to do much other testing this weekend (hanging the
 system is pretty time-consuming, as after the hard-reset I have to
 wait for the root partition to re-mirror). 
 
 I have thought about how I could observe the seek behavior of my disks
 when testing, but I have come to the conclusion that I cannot do this
 really. My RAID controller presents the array as a single disk to the
 OS, and its on-board RAM allows it to cache writes and perform
 read-aheads without the kernel knowing, so neither reading from nor
 writing to two blocks far apart will neccessarily result in a physical
 head move.
 
 
 I think I'll concentrate next on finding out where/why the hangs are
 occuring. Getting rid of the hangs and hard resets will hopefully
 speed up testing cycles.
 
 Klaus

From: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
To: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Cc: Bruce Evans <brde@optusnet.com.au>, freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Sun, 21 Jul 2013 19:45:25 +0200

 --AqsLC8rIMeq19msA
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline
 
 On Sun, Jul 21, 2013 at 07:42:15PM +0200, Klaus Weber wrote:
 > With the following patch (also attached)
 
 Attachment was missing.
 
 Klaus
 
 --AqsLC8rIMeq19msA
 Content-Type: text/x-diff; charset=us-ascii
 Content-Disposition: attachment; filename="bo_dirty.bv_cnt_FIX_prod.patch"
 
 Index: sys/kern/vfs_bio.c
 ===================================================================
 --- sys/kern/vfs_bio.c	(revision 253339)
 +++ sys/kern/vfs_bio.c	(working copy)
 @@ -1146,7 +1146,7 @@
  	if (bo->bo_dirty.bv_cnt > dirtybufthresh + 10) {
  		(void) VOP_FSYNC(bp->b_vp, MNT_NOWAIT, curthread);
  		altbufferflushes++;
 -	} else if (bo->bo_dirty.bv_cnt > dirtybufthresh) {
 +	} else if ( bdflush_required(bo) ) {
  		BO_LOCK(bo);
  		/*
  		 * Try to find a buffer to flush.
 @@ -1438,6 +1438,28 @@
  }
  
  /*
 + * Flushing is deemed necessary if at least one of the following is true:
 + * - bo is hogging more than dirtybufthresh buffers (previous behavior)
 + * - numdirtybuffers exceeds dirtybufthresh
 + * - we are or act as bufdaemon (TDP_NORUNNINGBUF)
 + *
 + * Used by bufbdflush (above) and ffs_bdflush (in ffs_snapshot).
 + */
 +int
 +bdflush_required(struct bufobj *bo)
 +{
 +  struct thread *td = curthread;
 +
 +  KASSERT(bo != NULL, ("bdflush_required: NULL bo"));
 +
 +  if ((bo->bo_dirty.bv_cnt > dirtybufthresh) ||
 +      (numdirtybuffers > dirtybufthresh) ||
 +      (td && (td->td_pflags & TDP_NORUNNINGBUF)) )
 +    return(1);
 +  return(0);
 +  }
 +
 +/*
   *	brelse:
   *
   *	Release a busy buffer and, if requested, free its resources.  The
 Index: sys/sys/bio.h
 ===================================================================
 --- sys/sys/bio.h	(revision 253339)
 +++ sys/sys/bio.h	(working copy)
 @@ -151,6 +151,10 @@
  #define physread physio
  #define physwrite physio
  
 +extern struct bufobj *bo;
 +int bdflush_required(struct bufobj *bo);
 +
 +
  #endif /* _KERNEL */
  
  #endif /* !_SYS_BIO_H_ */
 Index: sys/ufs/ffs/ffs_snapshot.c
 ===================================================================
 --- sys/ufs/ffs/ffs_snapshot.c	(revision 253339)
 +++ sys/ufs/ffs/ffs_snapshot.c	(working copy)
 @@ -2161,7 +2161,7 @@
  	struct buf *nbp;
  	int bp_bdskip;
  
 -	if (bo->bo_dirty.bv_cnt <= dirtybufthresh)
 +	if ( !bdflush_required(bo) )
  		return;
  
  	td = curthread;
 
 --AqsLC8rIMeq19msA--

From: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
To: Bruce Evans <brde@optusnet.com.au>
Cc: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>,
	freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Sun, 21 Jul 2013 19:42:15 +0200

 Sorry for the late reply, did lots of testing. I believe to have
 identified the root cause for the buffer depletion and subsequent i/o
 hangs.
 
 Have a look at the beginning of ffs_bdflush() in ffs_snapshot.c:
 
         if (bo->bo_dirty.bv_cnt <= dirtybufthresh)
 	                 return;
 
 This short-circuits the entire flushing code if _this bufobj_ is using
 less than vfs.dirtybufthresh buffers (by default, vfs.dirtybufthresh
 is 90% of all available buffers). In other words, one bufobj may hog
 up to 90% of all buffers.
 
 This works well if only one file is being re-written. As documented
 up-thread, with just one file vfs.numdirtybuffers rises until it 
 reaches vfs.dirtybufthresh, and then remains relatively stable at that
 point, and the system will not hang.
 
 With two (or more) files being rewritten, the code above tries to let
 _each_ file have 90% of all available buffers, which lets
 vfs.numdirtybuffers hit vfs.hidirtybuffers.
 
 Even worse, with two processes competing for buffer space, often both
 remain under the magic 90% mark, effectively disabling all flushing of
 buffers entirely. At that point, the only way that buffers can be freed
 is flushing in different code paths, or if other processes not
 involved in this buffer space competion are releasing some. Sooner or
 later, the system reaches a state where none of the bufobjs has more
 than 90% of the buffers, and none of the other (normal) processes can
 release any buffers. The only way out then is a hard reset.
 
 
 With the following patch (also attached), the system performs
 reasonably well and does not hang:
 
 Index: sys/kern/vfs_bio.c
 ===================================================================
 --- sys/kern/vfs_bio.c  (revision 253339)
 +++ sys/kern/vfs_bio.c  (working copy)
 @@ -1146,7 +1146,7 @@
         if (bo->bo_dirty.bv_cnt > dirtybufthresh + 10) {
                 (void) VOP_FSYNC(bp->b_vp, MNT_NOWAIT, curthread);
                 altbufferflushes++;
 -       } else if (bo->bo_dirty.bv_cnt > dirtybufthresh) {
 +       } else if ( bdflush_required(bo) ) {
                 BO_LOCK(bo);
                 /*
                  * Try to find a buffer to flush.
 @@ -1438,6 +1438,28 @@
  }
  
  /*
 + * Flushing is deemed necessary if at least one of the following is true:
 + * - bo is hogging more than dirtybufthresh buffers (previous behavior)
 + * - numdirtybuffers exceeds dirtybufthresh
 + * - we are or act as bufdaemon (TDP_NORUNNINGBUF)
 + *
 + * Used by bufbdflush (above) and ffs_bdflush (in ffs_snapshot).
 + */
 +int
 +bdflush_required(struct bufobj *bo)
 +{
 +  struct thread *td = curthread;
 +
 +  KASSERT(bo != NULL, ("bdflush_required: NULL bo"));
 +
 +  if ((bo->bo_dirty.bv_cnt > dirtybufthresh) ||
 +      (numdirtybuffers > dirtybufthresh) ||
 +      (td && (td->td_pflags & TDP_NORUNNINGBUF)) )
 +    return(1);
 +  return(0);
 +  }
 +
 +/*
   *     brelse:
   *
   *     Release a busy buffer and, if requested, free its resources.  The
 Index: sys/sys/bio.h
 ===================================================================
 --- sys/sys/bio.h       (revision 253339)
 +++ sys/sys/bio.h       (working copy)
 @@ -151,6 +151,10 @@
  #define physread physio
  #define physwrite physio
  
 +extern struct bufobj *bo;
 +int bdflush_required(struct bufobj *bo);
 +
 +
  #endif /* _KERNEL */
  
  #endif /* !_SYS_BIO_H_ */
 Index: sys/ufs/ffs/ffs_snapshot.c
 ===================================================================
 --- sys/ufs/ffs/ffs_snapshot.c  (revision 253339)
 +++ sys/ufs/ffs/ffs_snapshot.c  (working copy)
 @@ -2161,7 +2161,7 @@
         struct buf *nbp;
         int bp_bdskip;
  
 -       if (bo->bo_dirty.bv_cnt <= dirtybufthresh)
 +       if ( !bdflush_required(bo) )
                 return;
  
         td = curthread;
 
 
 I have modified ffs_bdflush() in ffs_snapshot.c and the generic
 bufbdflush() in vfs_bio.c, which contains similar logic. (I believe a
 kernel of 5.2 vintage will have this code still in bdwrite().)
 
 While the patch makes rewriting work reliably, it may need some
 further refinements:
 
 - Right now, it limits numdirtybuffers to dirtybufthresh. It might be
   better to use hidirtybuffers as a limit instead?
 
 - the set of conditions for which the flush is performed needs to be
   reviewed by someone who understands this better than I do. Limiting
   numdirtybuffers is neccessary to fix the hangs. Limiting
   bo->bo_dirty.bv_cnt was what the code did before, so I left this in
   as well. And finally, I'm testing for TDP_NORUNNINGBUF based on the
   idea that if bufdaemon, syncer or any process acting on their behalf
   sees the need to flush buffers, then it should not be up to *bdflush()
   to second-guess their intention.
 
 - the patch may have style issues
 
 Regarding the threshold and conditions: I am still in the process of
 doing some basic benchmarks with and without the patch. However, it
 seems that it causes a slight performance decrease in the cases that
 work well without the patch (i.e. only one process re-writing). My
 guess is that it now flushes more often than absolutely
 neccessary. 
 
 Performance is unchanged for cases that do not cause a huge buffer
 buildup (reading and writing via separate file descriptors). Of
 course, in cases that did result in hangs before and now don't, the
 patch is a clear win.
 
 
 I still believe there are some bugs in the clustering code that affect
 this case, making the re-writing less efficient than it could be.
 
 I wish I had the time to do more testing, but I need to bring the
 server into production soon. I'm putting some notes here for
 things I intended to test, but probably won't have the time to:
 
 - in vfs_cluster.c, cluster_write_gb(), where it says "Change to
   algorithm...": Also push the previous cluster when it has reached
   maximum size, not only when seqcount > 0. No point waiting for
   buf_daemon when the cluster cannot grow. Verify that condition
   "vp->v_clen <=  cursize" really tests "reached max. size". XXX
   "maxclen <=  cursize"?  Add rate-limited printf()s!
 
 - create a new "cluster_awrite", from DragonFly BSD:
   http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/9d4e78c77684f1de021eb8ae17767c210205c3c3
   (plug-compatible to bawrite(), but with clustering). Do not replace
   vfs_bio_awrite (yet). Then, selectively replace some bawrite() calls
   with calls to cluster_awrite() to identify where current bawrite()
   calls are inefficient. XXX requires mods to cluster_wbuild()!
 
 
 If anyone wants me to test any additional things, or variants of the
 patch above, I can still do this next weekend. 
 
 Since the system no longer hangs with the patch, I think this bug can
 now be closed.
 
 Klaus

From: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
To: Bruce Evans <brde@optusnet.com.au>, freebsd-gnats-submit@FreeBSD.org
Cc: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Wed, 31 Jul 2013 18:18:59 +0200

 --OgqxwSJOaUobr8KG
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline
 
 After studying style(9), I have now made a revised version of the
 patch, relative to 9/STABLE of about 3 weeks ago:
 
 Index: sys/kern/vfs_bio.c
 ===================================================================
 --- sys/kern/vfs_bio.c  (revision 253339)
 +++ sys/kern/vfs_bio.c  (working copy)
 @@ -1146,7 +1146,7 @@
         if (bo->bo_dirty.bv_cnt > dirtybufthresh + 10) {
                 (void) VOP_FSYNC(bp->b_vp, MNT_NOWAIT, curthread);
                 altbufferflushes++;
 -       } else if (bo->bo_dirty.bv_cnt > dirtybufthresh) {
 +       } else if (bdflush_required(bo)) {
                 BO_LOCK(bo);
                 /*
                  * Try to find a buffer to flush.
 @@ -1438,6 +1438,24 @@
  }
 
  /*
 + *     Flushing is deemed necessary if at least one of the following is true:
 + *     - bo is hogging more than dirtybufthresh buffers (previous behavior)
 + *     - numdirtybuffers exceeds dirtybufthresh
 + *     - we are or act as bufdaemon (TDP_NORUNNINGBUF)
 + */
 +int
 +bdflush_required(struct bufobj *bo)
 +{
 +       struct thread *td = curthread;
 +
 +       KASSERT(bo != NULL, ("bdflush_required: NULL bo"));
 +
 +       return ((bo->bo_dirty.bv_cnt > dirtybufthresh) ||
 +               (numdirtybuffers > dirtybufthresh) ||
 +               (td && (td->td_pflags & TDP_NORUNNINGBUF)));
 +}
 +
 +/*
   *     brelse:
   *
   *     Release a busy buffer and, if requested, free its resources.  The
 Index: sys/sys/buf.h
 ===================================================================
 --- sys/sys/buf.h       (revision 253339)
 +++ sys/sys/buf.h       (working copy)
 @@ -486,6 +486,7 @@
  void   bdata2bio(struct buf *bp, struct bio *bip);
  void   bwillwrite(void);
  int    buf_dirty_count_severe(void);
 +int    bdflush_required(struct bufobj *bo);
  void   bremfree(struct buf *);
  void   bremfreef(struct buf *);        /* XXX Force bremfree, only for nfs. */
  int    bread(struct vnode *, daddr_t, int, struct ucred *, struct buf **);
 Index: sys/ufs/ffs/ffs_snapshot.c
 ===================================================================
 --- sys/ufs/ffs/ffs_snapshot.c  (revision 253339)
 +++ sys/ufs/ffs/ffs_snapshot.c  (working copy)
 @@ -2161,7 +2161,7 @@
         struct buf *nbp;
         int bp_bdskip;
 
 -       if (bo->bo_dirty.bv_cnt <= dirtybufthresh)
 +       if (!bdflush_required(bo))
                 return;
 
         td = curthread;
 
 [End of patch]
 
 Besides being more concise, I hope this patch no longer has
 style-issues, although I was not sure regarding the indention of the
 second and third line of the return statement. I chose to left-align
 the condtions.
 
 I have also tested limiting numdirtybuffers to hidirtybuffers (instead
 of dirtybufthresh), but that does not prevent the system from hanging.
 
 I would appreciate if someone could review and eventually commit this
 patch (or a variant thereof), and then close this bug.
 
 Thanks, Klaus
 
 --OgqxwSJOaUobr8KG
 Content-Type: text/x-diff; charset=us-ascii
 Content-Disposition: attachment; filename="bo_dirty.bv_cnt_FIX_FINAL.patch"
 
 Index: sys/kern/vfs_bio.c
 ===================================================================
 --- sys/kern/vfs_bio.c	(revision 253339)
 +++ sys/kern/vfs_bio.c	(working copy)
 @@ -1146,7 +1146,7 @@
  	if (bo->bo_dirty.bv_cnt > dirtybufthresh + 10) {
  		(void) VOP_FSYNC(bp->b_vp, MNT_NOWAIT, curthread);
  		altbufferflushes++;
 -	} else if (bo->bo_dirty.bv_cnt > dirtybufthresh) {
 +	} else if (bdflush_required(bo)) {
  		BO_LOCK(bo);
  		/*
  		 * Try to find a buffer to flush.
 @@ -1438,6 +1438,24 @@
  }
  
  /*
 + *	Flushing is deemed necessary if at least one of the following is true:
 + *	- bo is hogging more than dirtybufthresh buffers (previous behavior)
 + *	- numdirtybuffers exceeds dirtybufthresh
 + *	- we are or act as bufdaemon (TDP_NORUNNINGBUF)
 + */
 +int
 +bdflush_required(struct bufobj *bo)
 +{
 +	struct thread *td = curthread;
 +
 +	KASSERT(bo != NULL, ("bdflush_required: NULL bo"));
 +
 +	return ((bo->bo_dirty.bv_cnt > dirtybufthresh) ||
 +	        (numdirtybuffers > dirtybufthresh) ||
 +	        (td && (td->td_pflags & TDP_NORUNNINGBUF)));
 +}
 +
 +/*
   *	brelse:
   *
   *	Release a busy buffer and, if requested, free its resources.  The
 Index: sys/sys/buf.h
 ===================================================================
 --- sys/sys/buf.h	(revision 253339)
 +++ sys/sys/buf.h	(working copy)
 @@ -486,6 +486,7 @@
  void	bdata2bio(struct buf *bp, struct bio *bip);
  void	bwillwrite(void);
  int	buf_dirty_count_severe(void);
 +int	bdflush_required(struct bufobj *bo);
  void	bremfree(struct buf *);
  void	bremfreef(struct buf *);	/* XXX Force bremfree, only for nfs. */
  int	bread(struct vnode *, daddr_t, int, struct ucred *, struct buf **);
 Index: sys/ufs/ffs/ffs_snapshot.c
 ===================================================================
 --- sys/ufs/ffs/ffs_snapshot.c	(revision 253339)
 +++ sys/ufs/ffs/ffs_snapshot.c	(working copy)
 @@ -2161,7 +2161,7 @@
  	struct buf *nbp;
  	int bp_bdskip;
  
 -	if (bo->bo_dirty.bv_cnt <= dirtybufthresh)
 +	if (!bdflush_required(bo))
  		return;
  
  	td = curthread;
 
 --OgqxwSJOaUobr8KG--

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/178997: commit references a PR
Date: Sun, 22 Sep 2013 19:23:56 +0000 (UTC)

 Author: kib
 Date: Sun Sep 22 19:23:48 2013
 New Revision: 255797
 URL: http://svnweb.freebsd.org/changeset/base/255797
 
 Log:
   Increase the chance of the buffer write from the bufdaemon helper
   context to succeed.  If the locked vnode which owns the buffer to be
   written is shared locked, try the non-blocking upgrade of the lock to
   exclusive.
   
   PR:	kern/178997
   Reported and tested by:	Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
   Sponsored by:	The FreeBSD Foundation
   MFC after:	1 week
   Approved by:	re (marius)
 
 Modified:
   head/sys/kern/vfs_bio.c
 
 Modified: head/sys/kern/vfs_bio.c
 ==============================================================================
 --- head/sys/kern/vfs_bio.c	Sun Sep 22 19:15:24 2013	(r255796)
 +++ head/sys/kern/vfs_bio.c	Sun Sep 22 19:23:48 2013	(r255797)
 @@ -2624,6 +2624,8 @@ flushbufqueues(struct vnode *lvp, int ta
  	int hasdeps;
  	int flushed;
  	int queue;
 +	int error;
 +	bool unlock;
  
  	flushed = 0;
  	queue = QUEUE_DIRTY;
 @@ -2699,7 +2701,16 @@ flushbufqueues(struct vnode *lvp, int ta
  			BUF_UNLOCK(bp);
  			continue;
  		}
 -		if (vn_lock(vp, LK_EXCLUSIVE | LK_NOWAIT | LK_CANRECURSE) == 0) {
 +		if (lvp == NULL) {
 +			unlock = true;
 +			error = vn_lock(vp, LK_EXCLUSIVE | LK_NOWAIT);
 +		} else {
 +			ASSERT_VOP_LOCKED(vp, "getbuf");
 +			unlock = false;
 +			error = VOP_ISLOCKED(vp) == LK_EXCLUSIVE ? 0 :
 +			    vn_lock(vp, LK_UPGRADE | LK_NOWAIT);
 +		}
 +		if (error == 0) {
  			mtx_unlock(&bqdirty);
  			CTR3(KTR_BUF, "flushbufqueue(%p) vp %p flags %X",
  			    bp, bp->b_vp, bp->b_flags);
 @@ -2711,7 +2722,8 @@ flushbufqueues(struct vnode *lvp, int ta
  				notbufdflushes++;
  			}
  			vn_finished_write(mp);
 -			VOP_UNLOCK(vp, 0);
 +			if (unlock)
 +				VOP_UNLOCK(vp, 0);
  			flushwithdeps += hasdeps;
  			flushed++;
  
 _______________________________________________
 svn-src-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/svn-src-all
 To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
 

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/178997: commit references a PR
Date: Sun, 29 Sep 2013 18:05:10 +0000 (UTC)

 Author: kib
 Date: Sun Sep 29 18:04:57 2013
 New Revision: 255941
 URL: http://svnweb.freebsd.org/changeset/base/255941
 
 Log:
   Reimplement r255797 using LK_TRYUPGRADE.
   
   The  r255797 was:
   Increase the chance of the buffer write from the bufdaemon helper
   context to succeed.  If the locked vnode which owns the buffer to be
   written is shared locked, try the non-blocking upgrade of the lock to
   exclusive.
   
   PR:	kern/178997
   Reported and tested by:	Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
   Sponsored by:	The FreeBSD Foundation
   MFC after:	1 week
   Approved by:	re (glebius)
 
 Modified:
   head/sys/kern/vfs_bio.c
 
 Modified: head/sys/kern/vfs_bio.c
 ==============================================================================
 --- head/sys/kern/vfs_bio.c	Sun Sep 29 18:02:23 2013	(r255940)
 +++ head/sys/kern/vfs_bio.c	Sun Sep 29 18:04:57 2013	(r255941)
 @@ -2624,6 +2624,8 @@ flushbufqueues(struct vnode *lvp, int ta
  	int hasdeps;
  	int flushed;
  	int queue;
 +	int error;
 +	bool unlock;
  
  	flushed = 0;
  	queue = QUEUE_DIRTY;
 @@ -2699,7 +2701,16 @@ flushbufqueues(struct vnode *lvp, int ta
  			BUF_UNLOCK(bp);
  			continue;
  		}
 -		if (vn_lock(vp, LK_EXCLUSIVE | LK_NOWAIT | LK_CANRECURSE) == 0) {
 +		if (lvp == NULL) {
 +			unlock = true;
 +			error = vn_lock(vp, LK_EXCLUSIVE | LK_NOWAIT);
 +		} else {
 +			ASSERT_VOP_LOCKED(vp, "getbuf");
 +			unlock = false;
 +			error = VOP_ISLOCKED(vp) == LK_EXCLUSIVE ? 0 :
 +			    vn_lock(vp, LK_TRYUPGRADE);
 +		}
 +		if (error == 0) {
  			mtx_unlock(&bqdirty);
  			CTR3(KTR_BUF, "flushbufqueue(%p) vp %p flags %X",
  			    bp, bp->b_vp, bp->b_flags);
 @@ -2711,7 +2722,8 @@ flushbufqueues(struct vnode *lvp, int ta
  				notbufdflushes++;
  			}
  			vn_finished_write(mp);
 -			VOP_UNLOCK(vp, 0);
 +			if (unlock)
 +				VOP_UNLOCK(vp, 0);
  			flushwithdeps += hasdeps;
  			flushed++;
  
 _______________________________________________
 svn-src-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/svn-src-all
 To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
 
>Unformatted:
