From nobody@FreeBSD.org  Tue Jul 17 10:53:25 2001
Return-Path: <nobody@FreeBSD.org>
Received: from freefall.freebsd.org (freefall.freebsd.org [216.136.204.21])
	by hub.freebsd.org (Postfix) with ESMTP id 12A8B37B401
	for <freebsd-gnats-submit@FreeBSD.org>; Tue, 17 Jul 2001 10:53:25 -0700 (PDT)
	(envelope-from nobody@FreeBSD.org)
Received: (from nobody@localhost)
	by freefall.freebsd.org (8.11.4/8.11.4) id f6HHrPd52978;
	Tue, 17 Jul 2001 10:53:25 -0700 (PDT)
	(envelope-from nobody)
Message-Id: <200107171753.f6HHrPd52978@freefall.freebsd.org>
Date: Tue, 17 Jul 2001 10:53:25 -0700 (PDT)
From: Bill Moran <wmoran@iowna.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: Heavy disk usage causes panic in ffs_blkfree
X-Send-Pr-Version: www-1.0

>Number:         29045
>Category:       i386
>Synopsis:       Heavy disk usage causes panic in ffs_blkfree
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Tue Jul 17 11:00:00 PDT 2001
>Closed-Date:    Fri Aug 09 12:49:23 PDT 2002
>Last-Modified:  Fri Aug 09 12:49:23 PDT 2002
>Originator:     Bill Moran
>Release:        4.3
>Organization:
Independent Consultant
>Environment:
FreeBSD backup.prioritydesigns.com 4.3-STABLE FreeBSD 4.3-STABLE #0: Thu Jun 21 11:14:06 EDT 2001     root@backup.prioritydesigns.com:/usr/obj/usr/src/sys/BACKUP  i386

>Description:
Under heavy load, panics occur in the filesystem code. The panics apparently result in some sort of subtle corruption to the filesystem that then results in an increased chance of further panics.
The following is a typical debug session after such a panic.

IdlePTD 3174400
initial pcb at 286640
panicstr: ffs_blkfree: freeing free block
panic messages:
---
panic: ffs_blkfree: freeing free block

syncing disks... 128 123 110 92 66 34 2 2 2 2 2 2 2 2 2 2 2 2 2 2 
giving up on 2 buffers
Uptime: 24d9h59m3s

dumping to dev #ad/0x20021, offset 380616
dump ata2: resetting devices .. done
127 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 
---
#0  dumpsys () at /usr/src/sys/kern/kern_shutdown.c:469
469		if (dumping++) {
(kgdb) bt
#0  dumpsys () at /usr/src/sys/kern/kern_shutdown.c:469
#1  0xc0152017 in boot (howto=256) at /usr/src/sys/kern/kern_shutdown.c:309
#2  0xc0152394 in poweroff_wait (junk=0xc0252400, howto=-1071307808)
    at /usr/src/sys/kern/kern_shutdown.c:556
#3  0xc01dada2 in ffs_blkfree (ip=0xc1b57000, bno=0, size=8192)
    at /usr/src/sys/ufs/ffs/ffs_alloc.c:1349
#4  0xc01dd06d in ffs_indirtrunc (ip=0xc1b57000, lbn=-12, dbn=115240768, 
    lastbn=-1, level=0, countp=0xcc441d88)
    at /usr/src/sys/ufs/ffs/ffs_inode.c:498
#5  0xc01dcbbd in ffs_truncate (vp=0xcc78fac0, length=0, flags=0, cred=0x0, 
    p=0xcb0eba00) at /usr/src/sys/ufs/ffs/ffs_inode.c:314
#6  0xc01e719e in ufs_inactive (ap=0xcc441eb4)
    at /usr/src/sys/ufs/ufs/ufs_inode.c:84
#7  0xc01ec3ed in ufs_vnoperate (ap=0xcc441eb4)
    at /usr/src/sys/ufs/ufs/ufs_vnops.c:2373
#8  0xc017df9a in vput (vp=0xcc78fac0) at vnode_if.h:815
#9  0xc01811d9 in unlink (p=0xcb0eba00, uap=0xcc441f80)
    at /usr/src/sys/kern/vfs_syscalls.c:1471
#10 0xc0227ee2 in syscall2 (frame={tf_fs = 47, tf_es = 47, tf_ds = 47, 
      tf_edi = 1, tf_esi = 134721536, tf_ebp = -1077936872, 
      tf_isp = -867950636, tf_ebx = 134768128, tf_edx = 134768200, tf_ecx = 0, 
      tf_eax = 10, tf_trapno = 7, tf_err = 2, tf_eip = 134533564, tf_cs = 31, 
      tf_eflags = 643, tf_esp = -1077936916, tf_ss = 47})
---Type <return> to continue, or q <return> to quit---
    at /usr/src/sys/i386/i386/trap.c:1150
#11 0xc021cda5 in Xint0x80_syscall ()
#12 0x804832b in ?? ()
#13 0x8048135 in ?? ()

backtraces vary slightly from panic to panic, but the panic message is always the same and is always called from ffs_blkfree()

The hardware involved is as follows:

Copyright (c) 1992-2001 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
	The Regents of the University of California. All rights reserved.
FreeBSD 4.3-STABLE #0: Thu Jun 21 11:14:06 EDT 2001
    root@backup.prioritydesigns.com:/usr/obj/usr/src/sys/BACKUP
Timecounter "i8254"  frequency 1193182 Hz
Timecounter "TSC"  frequency 756743963 Hz
CPU: AMD Athlon(tm) Processor (756.74-MHz 686-class CPU)
  Origin = "AuthenticAMD"  Id = 0x642  Stepping = 2
  Features=0x183f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR>
  AMD Features=0xc0440000<<b18>,AMIE,DSP,3DNow!>
real memory  = 134135808 (130992K bytes)
avail memory = 127270912 (124288K bytes)
Preloaded elf kernel "kernel" at 0xc02e8000.
Preloaded userconfig_script "/boot/kernel.conf" at 0xc02e809c.
Pentium Pro MTRR support enabled
md0: Malloc disk
npx0: <math processor> on motherboard
npx0: INT 16 interface
pcib0: <Host to PCI bridge> on motherboard
pci0: <PCI bus> on pcib0
pcib2: <PCI to PCI bridge (vendor=1106 device=8305)> at device 1.0 on pci0
pci1: <PCI bus> on pcib2
pci1: <ATI Mach64-GM graphics accelerator> at 0.0 irq 11
isab0: <VIA 82C686 PCI-ISA bridge> at device 4.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <VIA 82C686 ATA100 controller> port 0xb800-0xb80f at device 4.1 on pci0
ata0: at 0x1f0 irq 14 on atapci0
pci0: <VIA 83C572 USB controller> at 4.3 irq 5
fxp0: <Intel Pro 10/100B/100+ Ethernet> port 0x9400-0x943f mem 0xe0800000-0xe08fffff,0xe1000000-0xe1000fff irq 5 at device 9.0 on pci0
fxp0: Ethernet address 00:d0:b7:46:10:e9
inphy0: <i82555 10/100 media interface> on miibus0
inphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
ahc0: <Adaptec 2930CU SCSI adapter> port 0x9000-0x90ff mem 0xe0000000-0xe0000fff irq 7 at device 10.0 on pci0
aic7860: Single Channel A, SCSI Id=7, 3/255 SCBs
atapci1: <Promise ATA100 controller> port 0x7400-0x743f,0x7800-0x7803,0x8000-0x8007,0x8400-0x8403,0x8800-0x8807 mem 0xdf800000-0xdf81ffff irq 10 at device 17.0 on pci0
ata2: at 0x8800 on atapci1
ata3: at 0x8000 on atapci1
pcib1: <Host to PCI bridge> on motherboard
pci2: <PCI bus> on pcib1
fdc0: <NEC 72065B or clone> at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0
fdc0: FIFO enabled, 8 bytes threshold
fd0: <1440-KB 3.5" drive> on fdc0 drive 0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> flags 0x1 irq 1 on atkbdc0
kbd0 at atkbd0
psm0: <PS/2 Mouse> irq 12 on atkbdc0
psm0: model IntelliMouse, device ID 3
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
sio0: configured irq 4 not in bitmap of probed irqs 0
sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0
sio0: type 8250
sio1: configured irq 3 not in bitmap of probed irqs 0
ppc0: parallel port not found.
ad4: 73308MB <IBM-DTLA-307075> [148945/16/63] at ata2-master UDMA100
acd0: CDROM <ATAPI-CD ROM-DRIVE-50MAX> at ata0-slave using PIO4
Waiting 5 seconds for SCSI devices to settle
sa0 at ahc0 bus 0 target 0 lun 0
sa0: <SONY SDT-9000 0400> Removable Sequential Access SCSI-2 device 
sa0: 10.000MB/s transfers (10.000MHz, offset 15)
Mounting root from ufs:/dev/ad4s1a
WARNING: / was not properly dismounted

>How-To-Repeat:
The HDD is partitioned into /, /var, /usr, swap and /data. The /data partition is ~65G and is where the problem appears to occur. There is also an NFS mount to a 65G partition on another FreeBSD computer. This computer contains ~30G of data (fluctuates +/-3G per day) On Sunday at 3:00 AM, a script is run that does 'rm -r /data/*' and then does a cp to duplicate the remote drive onto the data partition. This operation will cause the described panic approximatly once a month.
The other days of the week, rsync is used to maintain file versions. rsync has only caused a panic once in the four months this machine has been operating. (except as described below)
Once a panic has occurred, any type of heavy HDD usage will cause a repeat panic ~90% of the time. newfsing the /data partition reduces the fequency of the panics to the percentages previously described.
The enabling/disabling of softupdates doesn't seem to make any difference to the frequency of the panics.
>Fix:
I don't have any suggestions at this point. My best guess so far is that while dealing with fragmentation the code is somehow losing track of whether or not a block is free.
I have 3 crash dumps so far (I'm starting a collection) If you need any more information on this, please don't hesitate to contact me.
>Release-Note:
>Audit-Trail:

From: Ian Dowse <iedowse@maths.tcd.ie>
To: Bill Moran <wmoran@iowna.com>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: i386/29045: Heavy disk usage causes panic in ffs_blkfree 
Date: Tue, 17 Jul 2001 21:51:30 +0100

 In message <200107171753.f6HHrPd52978@freefall.freebsd.org>, Bill Moran writes:
 >
 >Under heavy load, panics occur in the filesystem code. The panics apparently r
 >esult in some sort of subtle corruption to the filesystem that then results in
 > an increased chance of further panics.
 
 It's always difficult to tell if these panics are caused by a
 hardware/driver fault, or by filesystem bugs. Could you try printing
 out some more information including the contents of the free block
 bitmap from frame 2 on the stack - i.e. something like
 
 	frame 2
 	p fs
 	p *fs
 	p fs->fs_fpg
 	p bp->b_data
 	p cg
 	p cgp
 	p *cgp
 	p bno
 	p blksfree
 	p blksfree[0]@(fs->fs_fpg/8)
 
 A good test for hardware/driver faults is to take some large
 directory tree that does not change, especially one with lots of
 huge files, and run
 
 	find /whatever -type f -print0 |xargs -0 md5 > /tmp/md5.1
 	find /whatever -type f -print0 |xargs -0 md5 > /tmp/md5.2
 	find /whatever -type f -print0 |xargs -0 md5 > /tmp/md5.3
 
 etc while the system is under heavy load. Then diff the /tmp/md5.X
 files to see if anything has changed. You should try this with
 trees on different disks in case there is a driver/disk dependent
 corruption problem. Also try leaving quite a long gap between
 running the finds; data could be getting corrupted as it sits in
 the buffer cache.
 
 Ian

From: Bill Moran <wmoran@iowna.com>
To: Ian Dowse <iedowse@maths.tcd.ie>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: i386/29045: Heavy disk usage causes panic in ffs_blkfree
Date: Tue, 17 Jul 2001 18:51:41 -0400

 Ian Dowse wrote:
 > Could you try printing
 > out some more information including the contents of the free block
 > bitmap from frame 2 on the stack - i.e. something like
 
 Errr ... I tried, but frame 2 considers those symbols undefined.
 Did I misunderstand?
 
 > A good test for hardware/driver faults is to take some large
 > directory tree that does not change, especially one with lots of
 > huge files, and run
 > 
 >         find /whatever -type f -print0 |xargs -0 md5 > /tmp/md5.1
 >         find /whatever -type f -print0 |xargs -0 md5 > /tmp/md5.2
 >         find /whatever -type f -print0 |xargs -0 md5 > /tmp/md5.3
 > 
 > etc while the system is under heavy load. Then diff the /tmp/md5.X
 > files to see if anything has changed. You should try this with
 > trees on different disks in case there is a driver/disk dependent
 > corruption problem. Also try leaving quite a long gap between
 > running the finds; data could be getting corrupted as it sits in
 > the buffer cache.
 
 Here's what I did:
 Started a "make buildworld" and then ran the md5 routines you displayed
 above on the /data partition. The machine only has 1 physical disk drive.
 The /data partition had 15000 files on it comprising 9.3G at this point.
 systat showed disk usage ~99% during this.
 I ran two tests during the "make buildworld" (one right after the other)
 I ran a diff on the two resultant files and Lo and Behold! there are a
 slew of differences in the hashes. Now, there's no doubt in my mind that
 this is A Bad Thing (tm) but the big question is, does it indicate
 filesystem problems or ata problems? The mobo is an Asus A7 with
 2 ata66 controllers and 2 ata100 controllers. The drive is an IBM 76G
 and is currently connected to the ata100 controller.
 
 Bill

From: Ian Dowse <iedowse@maths.tcd.ie>
To: Bill Moran <wmoran@iowna.com>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: i386/29045: Heavy disk usage causes panic in ffs_blkfree 
Date: Wed, 18 Jul 2001 00:53:35 +0100

 In message <3B54C17D.F31029C2@iowna.com>, Bill Moran writes:
 >Errr ... I tried, but frame 2 considers those symbols undefined.
 >Did I misunderstand?
 
 Whoops, no I did. It was frame 3 that should contain these symbols,
 but now that you have an easier way to get corruption, that vmcore
 is of less interest.
 
 >I ran two tests during the "make buildworld" (one right after the other)
 >I ran a diff on the two resultant files and Lo and Behold! there are a
 >slew of differences in the hashes.
 
 Ok, that's progress anyway, even if it's not progress in the most
 desirable direction... I think you can pretty much rule out any
 filesystem bugs here; either the hardware is bad (disk, ATA
 controllers, RAM etc) or possibly there is something the ATA driver
 isn't doing right, such as missing a workaround for a known hardware
 bug.
 
 One thing that would be very useful is if you can collect a number
 of samples of "good" and "corrupted" versions of the same file.
 That may be tricky to do, because right now we don't know anything
 about where the data is being corrupted. Maybe try to make 2 copies
 of lots of files to another system, and then md5 each and look for
 differences.
 
 When you find two differing versions of the same file, say "file.good"
 and "file.bad", get a hex dump of the differences:
 
 	hd file.good > file.good.hd
 	hd file.bad > file.bad.hd
 	diff -u file.good.hd file.bad.hd
 
 Do this for a few files and look for patterns; there might be
 something that would suggest where the corruption is happening.
 It would also be well worth trying swapping hardware components
 to see if you can isolate the cause.
 
 Ian
 

From: Bill Moran <wmoran@iowna.com>
To: Ian Dowse <iedowse@maths.tcd.ie>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: i386/29045: Heavy disk usage causes panic in ffs_blkfree
Date: Wed, 18 Jul 2001 09:50:13 -0400

 Ian Dowse wrote:
 > 
 > In message <3B54C17D.F31029C2@iowna.com>, Bill Moran writes:
 > >Errr ... I tried, but frame 2 considers those symbols undefined.
 > >Did I misunderstand?
 > 
 > Whoops, no I did. It was frame 3 that should contain these symbols,
 > but now that you have an easier way to get corruption, that vmcore
 > is of less interest.
 
 Oddly enough, the md5ing I did last night did not cause a panic. In fact,
 the rsync process ran at 3:00 AM successfully.
 
 I was just talking to the guy we got the hardware from, and my gut instinct
 is to suspect the ata100 - controller or driver. We've got another system
 here that needs to go into production in a few weeks, with the same mobo,
 but a different HDD. We're going to set it up and run some tests to see if
 we can get it to crash.
 One thing that I thought about (feel free to support or refute this based
 on your own experience) is that we're hitting this hardware hard enough
 that we may be exposing driver bugs that others haven't seen. I guess 
 we'll find out (hopefully).
 
 > >I ran two tests during the "make buildworld" (one right after the other)
 > >I ran a diff on the two resultant files and Lo and Behold! there are a
 > >slew of differences in the hashes.
 > 
 > Ok, that's progress anyway, even if it's not progress in the most
 > desirable direction... I think you can pretty much rule out any
 > filesystem bugs here; either the hardware is bad (disk, ATA
 > controllers, RAM etc) or possibly there is something the ATA driver
 > isn't doing right, such as missing a workaround for a known hardware
 > bug.
 > 
 > One thing that would be very useful is if you can collect a number
 > of samples of "good" and "corrupted" versions of the same file.
 > That may be tricky to do, because right now we don't know anything
 > about where the data is being corrupted. Maybe try to make 2 copies
 > of lots of files to another system, and then md5 each and look for
 > differences.
 
 I'll set my alarm for 4:00 AM tomorrow, get up and run an md5 on both
 the problem server (which is a backup server) and the fileserver where
 the files were synced with. At 4:00 AM (right after the sync) there
 shouldn't be any differences yet.
 
 > It would also be well worth trying swapping hardware components
 > to see if you can isolate the cause.
 
 If, come the weekend, we haven't isolated this yet, I'm going to move
 the drive from the promise controller to the via controller and see if
 the problem disappears. My gut instinct is pointing toward the promise
 controller, so I'll try that first. That doesn't rule out everything,
 however, since FreeBSD seems to only recognize the via controller in
 ata66 mode, so it could still be some overall problem with ata100.
 
 -Bill

From: Bill Moran <wmoran@iowna.com>
To: Ian Dowse <iedowse@maths.tcd.ie>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: i386/29045: Heavy disk usage causes panic in ffs_blkfree
Date: Thu, 26 Jul 2001 20:01:33 -0400

 Update:
 
 On Saturday, July 21 I did some work on the machine that's
 having this problem.
 First, I attempted to establish a way to reliably crash the
 machine so I could be sure if/when I fixed it. Unfortunately,
 I was unable to do this. I tried using bonnie++ to cause the
 crash, but it didn't cause any problems. The only thing I can
 seem to get it to crash all the time is to copy and delete
 ~30G of data (~50,000 files) 3 or 4 times. This almost always
 crashes the system by the fourth time.
 
 What I did do, that will hopefully be some help in the resolution
 of this PR, is remove the 80 conductor cable and install a 40
 conductor IDE cable (no other changes whatsoever). The mobo
 BIOS faithfully throttled the drive to ata33. I then put the
 machine back to work, but I've been running additional copy
 operations on the drive whenever the system's not in use. Basically,
 at this point the computer has seen drive activity in excess of 2x
 what it normally takes to cause a panic and has not crashed.
 
 I then ran the md5 test as suggested earlier and had no problems
 between the two runs. In other words, slowing the drive down to
 ata33 seems to have solved the problem.
 
 My question now is: Is there anything more I can do to help isolate
 what really caused this problem? I have a few theories currently:
 
 1) ata100 simply doesn't really work
 2) either the Promise controller or the IBM HDD has a broken
    ata100 implementation
 3) FreeBSD's ata driver doesn't do ata100 properly
 4) IBM is serious when they say the 80 conductor cable can't be
    longer than 18" (the one I had was a few inches longer)
 
 If #3 is the case, I'd like to help fix the problem.
 If #4 is the case, it shouldn't be too hard to avoid it in the future.
 If #1 or 2 is the case, there's the possibility that the ata driver
 can be modified to work around the problem.
 
 Regardless, it would be nice to be able to use ata100 in this and
 future machines. I'm open to suggestions.
 
 -Bill

From: Ian Dowse <iedowse@maths.tcd.ie>
To: Bill Moran <wmoran@iowna.com>
Cc: freebsd-gnats-submit@freebsd.org
Subject: Re: i386/29045: Heavy disk usage causes panic in ffs_blkfree 
Date: Fri, 27 Jul 2001 10:27:47 +0100

 In message <200107270010.f6R0A1020956@freefall.freebsd.org>, Bill Moran writes:
  
 > My question now is: Is there anything more I can do to help isolate
 > what really caused this problem? I have a few theories currently:
 
 A hex diff of a few examples of file corruption might help suggest
 a cause. If you can get two copies of what should be the same file
 but the md5 checksums differ, run
 
 	hd file.good > file.good.hd
 	hd file.bad > file.bad.hd
 	diff -u file.good.hd file.bad.hd
 
 and try to do this with as many examples of corruption as you can
 find. To get such "good" and "bad" versions of the same file, rerun
 the md5 checks you tried earlier, and then make copies of any files
 that are listed as having different checksums. Hopefully this way
 you can grab a few good and bad versions of the same file.
 
 > First, I attempted to establish a way to reliably crash the
 > machine so I could be sure if/when I fixed it. Unfortunately,
 
 The machine will only crash if the data corruption somehow triggers
 some kernel sanity check. That will be much less likely to occur
 than the real problem, which is the data corruption itself. The
 most reliable way to test if the problem is fixed is to run the
 md5 test while the system is busy.
 
 Ian

From: Bill Moran <wmoran@iowna.com>
To: Ian Dowse <iedowse@maths.tcd.ie>
Cc: freebsd-gnats-submit@freebsd.org
Subject: Re: i386/29045: Heavy disk usage causes panic in ffs_blkfree
Date: Sat, 27 Oct 2001 23:56:00 -0400

 I finally managed to find some time to look into this again, but I
 have no idea what to make of the results.
 
 Ian Dowse wrote:
 > In message <200107270010.f6R0A1020956@freefall.freebsd.org>, Bill Moran writes:
 > 
 > A hex diff of a few examples of file corruption might help suggest
 > a cause. If you can get two copies of what should be the same file
 > but the md5 checksums differ, run
 > 
 > 	hd file.good > file.good.hd
 > 	hd file.bad > file.bad.hd
 > 	diff -u file.good.hd file.bad.hd
 
 I reconnected the 80 conductor cable earlier today, and intentionally
 recreated the problem so that I could do the hex diff procedure you
 describe. Here's what happened:
 1. Did a find | md5 like you described in previous messages
 2. Started a "make world" process and then started another find | md5
     running simultaneously.
 3. Once I had two files full of md5 checksums, I ran a diff to find
     out what files had become victums of corruption.
 4. I used the diff to pick files to run the hex dump on. I had original
     files on another server, safe from any possible corruption.
 
 The problem is: no differences were found between the two files. Put in
 plain english, I have two files that *should* be identical, but the md5
 checksums differ, however, the hd/diff procedure above shows no differences.
 
 I can only assume that the disk corruption is such that under some
 circumstances the files appear currupted, but under other circumstances,
 the correct data for the file is available.
 I would still like to isolate this, but I don't know what else to try.
 How do I identify the corruption if it's transient? I guess the first
 thing would be to determine under what conditions the garbled data appears?
 -- 
 "Where's the robot to pat you on the back?"
 

From: Hiten Pandya <hitmaster2k@yahoo.com>
To: freebsd-gnats-submit@FreeBSD.org, wmoran@iowna.com
Cc: iedowse@maths.tcd.ie
Subject: Re: i386/29045: Heavy disk usage causes panic in ffs_blkfree
Date: Wed, 02 Jan 2002 16:01:51 +0000

 hi,
 
 > IdlePTD 3174400
 >
 >     initial pcb at 286640
 >     panicstr: ffs_blkfree: freeing free block
 >     panic messages:
 >     panic: ffs_blkfree: freeing free block
 >
 >     syncing disks... 128 123 110 92 66 34 2 2 2 2 2 2 2 2 2 2 2 2 2 2 
 >     giving up on 2 buffers
 
 people..
 forgive me if I am wrong, and possibly correct me.. :-)
 
 I think the issue is all write here in the boot() function, located in
 src/sys/kern/kern_shutdown.c
 
 The following piece of code I have found:
 (around line 309)
 
 
 ====================
 if (nbusy) {
 	/*
 	 * Failed to sync all blocks. Indicate this and don't
 	 * unmount filesystems (thus forcing an fsck on reboot).
 	 */	
 	printf("giving up on %d buffers\n", nbusy);
 	DELAY(5000000);	/* 5 seconds */
 
 } else {
 	printf("done\n");
 	/*
 	 * Unmount filesystems
 	 */
 	if (panicstr == 0)
 	vfs_unmountall();
 }
 ======================
 
 The reason you are having a corruption is simply clear,
 according to the comment.  The unmount doesn't occur, and forces and
 'fsck' on
 reboot, thus causing a loss of data or corruption.
 
 I don't have a clue what would happen if the 'unmount' was done when all
 failed
 to sync.. probably a bad thing would happen.... :-(
 
 What we can do, according to my limited knowledge of the kernel, is to
 print
 some more debugging output, example:
 
 ===
 giving up on %d buffers...
 Filesystem check (fsck) will occur on next reboot...
 ===
 
 Patch:
 
 --- kern_shutdown.c.orig        Wed Jan  2 15:48:34 2002
 +++ kern_shutdown.c     Wed Jan  2 15:50:52 2002
 @@ -311,6 +311,7 @@
                          * unmount filesystems (thus forcing an fsck on
 reboot).
                          */
                         printf("giving up on %d buffers\n", nbusy);
 +                       printf("FileSystem check (fsck) will occur on
 reboot...\n");
                         DELAY(5000000); /* 5 seconds */
                 } else {
                         printf("done\n");
 
 
 regards,
  - Hiten
  - <hiten@uk.FreeBSD.org>
 
 -- 
 Fingerprint:
 1024 45:a5:9c:f2:fb:07:da:70:18:02:0b:f3:63:f1:7a:a6 hitenp@hpdi.ath.cx
State-Changed-From-To: open->feedback 
State-Changed-By: silby 
State-Changed-When: Thu Jan 10 10:21:00 PST 2002 
State-Changed-Why:  
Looking at this PR now, it seems likely that the corruption is 
being caused by bad settings on the via southbridge (fixed in 
4.5), or the DTLA starting to corrupt data (reported elsewhere). 
Have you found either of these two to be the case? 

http://www.FreeBSD.org/cgi/query-pr.cgi?pr=29045 
State-Changed-From-To: feedback->closed 
State-Changed-By: schweikh 
State-Changed-When: Fri Aug 9 12:47:45 PDT 2002 
State-Changed-Why:  
Feedback timeout. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=29045 
>Unformatted:
