From philip@paeps.cx  Thu Sep  1 22:43:31 2005
Return-Path: <philip@paeps.cx>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 5809616A420
	for <FreeBSD-gnats-submit@freebsd.org>; Thu,  1 Sep 2005 22:43:31 +0000 (GMT)
	(envelope-from philip@paeps.cx)
Received: from gateway.nixsys.be (gateway.nixsys.be [195.144.77.33])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 37E0F43D5D
	for <FreeBSD-gnats-submit@freebsd.org>; Thu,  1 Sep 2005 22:43:30 +0000 (GMT)
	(envelope-from philip@paeps.cx)
Received: from wotan.home.paeps.cx (wotan.home.paeps.cx [IPv6:2001:6f8:32f:10:a00:20ff:fe9b:138c])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client CN "wotan.home.paeps.cx", Issuer "NixSys CA" (verified OK))
	by gateway.nixsys.be (Postfix) with ESMTP id 6A47AC131;
	Fri,  2 Sep 2005 00:43:28 +0200 (CEST)
Received: from fasolt.home.paeps.cx (unknown [IPv6:2001:6f8:32f:10:20a:e6ff:fe7d:c08])
	by wotan.home.paeps.cx (Postfix) with ESMTP id 58C8961D3;
	Fri,  2 Sep 2005 00:43:27 +0200 (CEST)
Received: from fasolt.home.paeps.cx (philip@localhost [127.0.0.1])
	by fasolt.home.paeps.cx (8.13.4/8.13.4) with ESMTP id j81MhQJ4035599;
	Fri, 2 Sep 2005 00:43:26 +0200 (CEST)
	(envelope-from philip@fasolt.home.paeps.cx)
Received: (from philip@localhost)
	by fasolt.home.paeps.cx (8.13.4/8.13.4/Submit) id j81MhQDY035598;
	Fri, 2 Sep 2005 00:43:26 +0200 (CEST)
	(envelope-from philip)
Message-Id: <200509012243.j81MhQDY035598@fasolt.home.paeps.cx>
Date: Fri, 2 Sep 2005 00:43:26 +0200 (CEST)
From: Philip Paeps <philip@freebsd.org>
To: FreeBSD-gnats-submit@freebsd.org
Cc: apeiron+usenet@coitusmentis.info, bs139412@skynet.be
Subject: FS corruption and 'uncorrectable' DMA errors on ATA disks after unclean shutdown
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         85603
>Category:       kern
>Synopsis:       [ata] FS corruption and 'uncorrectable' DMA errors on ATA disks after unclean shutdown
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    sos
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Sep 01 22:50:18 GMT 2005
>Closed-Date:    Mon Jan 09 12:19:26 GMT 2006
>Last-Modified:  Mon Jan 09 12:19:26 GMT 2006
>Originator:     Philip Paeps
>Release:        FreeBSD 7.0-CURRENT i386
>Organization:
>Environment:
System: FreeBSD fasolt.home.paeps.cx 7.0-CURRENT FreeBSD 7.0-CURRENT #39: Sun
Aug 21 15:52:38 CEST 2005
philip@fasolt.home.paeps.cx:/usr/obj/usr/src/sys/FASOLT i386

>Description:
	
Recently, after a power failure, I experience some inexplicable problems with
an ATA disks, which could quite possibly be due to hardware.  However, after
having experienced the same problems on a second disk, and discovering, in a
discussion on comp.unix.bsd.freebsd.misc, that others have seen the same sort
of issue, I've begun to suspect a kernel issue.

The first time I saw the problem, the machine initially came up fine, and I
could dirty-mount the filesystem and let bgfsck take care of things.  Soon
after the fsck began, the kernel started spewing out errors along the lines
of 'uncorrectable' and 'dma_read'.  Unfortunately, I've not managed to
reproduce the problem with a loggable console.

After a reboot, the filesystem on the disk refused to mount again.  Manually
forcing an fsck, complained about unreadable sectors.  Again, the kernel
spewed out the 'uncorrectable' and 'dma_read' errors.

According to SMART, the disk is quite healthy, though some errors were logged
in the the log:

 | Error 387 occurred at disk power-on lifetime: 5315 hours (221 days + 11 hours)
 |   When the command that caused the error occurred, the device was in an unknown state.
 | 
 |   After command completion occurred, registers were:
 |   ER ST SC SN CL CH DH
 |   -- -- -- -- -- -- --
 |   40 51 10 80 00 00 e0  Error: UNC 16 sectors at LBA = 0x00000080 = 128
 | 
 |   Commands leading to the command that caused the error were:
 |   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 |   -- -- -- -- -- -- -- --  ----------------  --------------------
 |   c8 00 10 80 00 00 e0 08      00:09:49.792  READ DMA
 |   25 00 01 ff 87 bd 40 08      00:08:28.160  READ DMA EXT
 |   c8 00 02 00 00 00 e0 08      00:08:28.160  READ DMA
 |   c8 00 01 01 00 00 e0 08      00:08:28.160  READ DMA
 |   c8 00 01 00 00 00 e0 08      00:08:28.160  READ DMA

Four other errors were logged, differing only in error number (decrementing by
one each time - 387 386 385) and LBA address (similarly decrementing).

The funny thing is, after newfsing the disk, and restoring the data, all seems
to be working well and happy on the disk.  The first disk I had this problem
with, has now been under medium heavy use again for over a month, the second
disk (see below) has been in use again for two weeks.

In the case of the second disk, the machine paniced shortly after starting the
bgfsck - unfortunately, I wasn't able to capture the the panic.  Following the
panic, the machine refused to boot with an LBA error 16 in the boot loader.

Trying to mount the filesystems on another machine, read-only, produced the
same 'uncorrectable' and 'dma_read' errors as seen on the first disk with the
problem.  Forcing fsck also caused the same errors as before.  Possibly an
unrelated issue: ls on some directories on the dirty-mounted ro filesystem
sometimes worked, cp'ing the files to somewhere else, paniced the kernel.

Again with the second disk, newfs and restoring data made all work happily
again.  Not a trace of any dma_read errors or uncorrectable reads.

I realize there's not much hard debugging information here, but maybe this
makes sense to a filesystem or ata guru.  I experienced the problems on 5.x
-STABLE kernels from late may, and -CURRENT kernels from the middle of June
and July.  I've not seen problems since, but then, I've not had any power
failures either.

I'm happy to help debug this further, if indeed it's a software bug, and not
something with flaky hardware.  Cc: Christopher Nehren who reported similar
issues on Usenet and suggested a PR be filed.  He might be able to add more
useful information.

For what it's worth, the disks were Maxtor 6Y200P0 and Maxtor 6E040L0 on a 
VIA 8235 UDMA133 controller and a VIA 8231 UDMA100 controller in my case.

>How-To-Repeat:
	
Lose power or panic the machine with a filesystem on an ATA disk and wait for
phase of moon and other elements of faith to be properly aligned.  I have been
able to reproduce the problem (and the 'working well after newfs') three times
by accident, never yet by force.

>Fix:

Hopefully! :-)
>Release-Note:
>Audit-Trail:
Adding to audit trail from misfiled PR kern/85613:

Date: Thu, 1 Sep 2005 20:20:32 -0400
From: Christopher Nehren <apeiron+usenet@coitusmentis.info>
 
 Here's my environment:
 
 FreeBSD prophecy.dyndns.org 5.4-STABLE FreeBSD 5.4-STABLE #0:
 Sat Jul 16 20:37:57 EDT 2005
 root@prophecy.dyndns.org:/usr/obj/usr/src/sys/PROPHECY  i386
 
 
 > >Description:
 > =09
 > Recently, after a power failure, I experience some inexplicable problems =
 with
 > an ATA disks, which could quite possibly be due to hardware.  However, af=
 ter
 > having experienced the same problems on a second disk, and discovering, i=
 n a
 > discussion on comp.unix.bsd.freebsd.misc, that others have seen the same =
 sort
 > of issue, I've begun to suspect a kernel issue.
 
 I don't remember if a power failure was involved in my disk
 issues, but I do remember having some isolated power issues
 around the time that my disk started to misbehave.
 
 > The first time I saw the problem, the machine initially came up fine, and=
  I
 > could dirty-mount the filesystem and let bgfsck take care of things.  Soon
 > after the fsck began, the kernel started spewing out errors along the lin=
 es
 > of 'uncorrectable' and 'dma_read'.  Unfortunately, I've not managed to
 > reproduce the problem with a loggable console.
 
 Mine manifested itself differently, with DMA_READ / DMA_WRITE
 errors during normal use.
 
 > After a reboot, the filesystem on the disk refused to mount again.  Manua=
 lly
 > forcing an fsck, complained about unreadable sectors.  Again, the kernel
 > spewed out the 'uncorrectable' and 'dma_read' errors.
 
 I could fsck mine just fine, but it gave the same error
 repeatedly.
 
 > According to SMART, the disk is quite healthy, though some errors were lo=
 gged
 > in the the log:
 
 SMART (both boot-up and smartctl) say my drive is fine. A
 just-now-performed `smartctl -a` shows no logged errors.
 
 > The funny thing is, after newfsing the disk, and restoring the data, all =
 seems
 > to be working well and happy on the disk.  The first disk I had this prob=
 lem
 > with, has now been under medium heavy use again for over a month, the sec=
 ond
 > disk (see below) has been in use again for two weeks.
 
 I too can say that a newfs curiously enough seems to cure the
 problem. I was able to restore approximately 10 GB of backups to
 the disk and wrote a GB or two more to it the day that I did the
 restoration without any difficulties, and today has seen light
 reads and writes.
 
 > I'm happy to help debug this further, if indeed it's a software bug, and =
 not
 > something with flaky hardware.  Cc: Christopher Nehren who reported simil=
 ar
 > issues on Usenet and suggested a PR be filed.  He might be able to add mo=
 re
 > useful information.
 
 If anything more is requested, please advise. I (*gasp*) work
 now, though, so I don't have quite so many tuits as I used to
 have.
 
 > For what it's worth, the disks were Maxtor 6Y200P0 and Maxtor 6E040L0 on =
 a=20
 > VIA 8235 UDMA133 controller and a VIA 8231 UDMA100 controller in my case.
 
 My experience is with a Seagate ST3120026A on a SiS 730 UDMA100
 controller.
 
 > >How-To-Repeat:
 > =09
 > Lose power or panic the machine with a filesystem on an ATA disk and wait=
  for
 > phase of moon and other elements of faith to be properly aligned.  I have=
  been
 > able to reproduce the problem (and the 'working well after newfs') three =
 times
 > by accident, never yet by force.
 
 I can't provide any more definitive information on how to
 reproduce this phenomenon other than seconding the notion that
 it does seem fairly random. Someone else from Usenet might be
 able to do so; Philip (thanks!) mentioned it in the Usenet
 thread and volunteered to corral some others into contributing.
 
 Best regards,
 Christopher Nehren
 
 --=20
 I abhor a system designed for the "user", if that word is a coded
 pejorative meaning "stupid and unsophisticated". -- Ken Thompson
 If you ask questions of idiots, you get "Joel on Software".
 Unix is user friendly. However, it isn't idiot friendly.
 
 --6c2NcOVqGQ03X4Wi
 Content-Type: application/pgp-signature
 Content-Disposition: inline
 
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.1 (FreeBSD)
 
 iD8DBQFDF5rQk/lo7zvzJioRAq8VAJ9kvjRubIulj49V4BDmigS3LDMkRACfa2NE
 rvKdAA8dFSmp8y2iOcqQAI4=
 =wEqD
 -----END PGP SIGNATURE-----
 
 --6c2NcOVqGQ03X4Wi--

Adding to audit trail from followup to misfiled PR kern/85613:

From: Christopher Nehren <apeiron@coitusmentis.info>
Date: Fri, 16 Sep 2005 19:16:36 -0400

 Seems that running newfs didn't quite fix it for me. The day 
 after I said it worked fine, it started giving errors.
 
 I selected the link on the Web page interface to the PR 
 interface for this bug, so I really hope it's going to the 
 right place ... if it's not, I'm going to cry. :(

From: =?ISO-8859-1?Q?S=F8ren_Schmidt?= <sos@deepcore.dk>
To: bug-followup@FreeBSD.org, philip@FreeBSD.org
Cc:  
Subject: Re: kern/85603: [ata] FS corruption and 'uncorrectable' DMA errors
 on ATA disks after unclean shutdown
Date: Wed, 04 Jan 2006 20:24:17 +0100

 Uncorrectable errors can very easily be encountered if there was a write 
   operation on a drive during a power failure. The problem is that the 
 sector(s) isn't completely written and the ECC information isn't updated 
 correctly on the media. This can render from 1 sector to an entire track 
 (seen from the disks perspective not nessesarily the geometry used) due 
 to the way data are stored on modern disks.
 Now, if the disk still has spare sectors available a write to the bad 
 sectors will be remapped and the disk will seem to be functional again.
 
 I have seen at least a dozen modern systems that actually powercycles 
 during reset, which can be a very bad idea if the drives have decided to 
 write out cachebuffers etc during a reboot.
 
 -Sren
 
 
State-Changed-From-To: open->analyzed 
State-Changed-By: netchild 
State-Changed-When: Sun Jan 8 16:44:05 UTC 2006 
State-Changed-Why:  
Hand over to ATA maintainer, he already analyzed the problem and can 
decide if the PR should be closed or not. 


Responsible-Changed-From-To: freebsd-bugs->sos 
Responsible-Changed-By: netchild 
Responsible-Changed-When: Sun Jan 8 16:44:05 UTC 2006 
Responsible-Changed-Why:  
Hand over to ATA maintainer, he already analyzed the problem and can 
decide if the PR should be closed or not. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=85603 
State-Changed-From-To: analyzed->closed 
State-Changed-By: sos 
State-Changed-When: Mon Jan 9 12:18:35 UTC 2006 
State-Changed-Why:  
Explanation given above, not fixable in software. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=85603 
>Unformatted:
