From nobody@FreeBSD.org  Wed Dec 16 16:21:11 2009
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1009F106566C
	for <freebsd-gnats-submit@FreeBSD.org>; Wed, 16 Dec 2009 16:21:11 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [IPv6:2001:4f8:fff6::21])
	by mx1.freebsd.org (Postfix) with ESMTP id F2DD38FC14
	for <freebsd-gnats-submit@FreeBSD.org>; Wed, 16 Dec 2009 16:21:10 +0000 (UTC)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.14.3/8.14.3) with ESMTP id nBGGLA13035558
	for <freebsd-gnats-submit@FreeBSD.org>; Wed, 16 Dec 2009 16:21:10 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.14.3/8.14.3/Submit) id nBGGLAF8035555;
	Wed, 16 Dec 2009 16:21:10 GMT
	(envelope-from nobody)
Message-Id: <200912161621.nBGGLAF8035555@www.freebsd.org>
Date: Wed, 16 Dec 2009 16:21:10 GMT
From: Tom Payne <Tom.Payne@unige.ch>
To: freebsd-gnats-submit@FreeBSD.org
Subject: zfs corruption on adaptec 5805 raid controller
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         141685
>Category:       kern
>Synopsis:       [zfs] zfs corruption on adaptec 5805 raid controller
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    pjd
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Wed Dec 16 16:30:01 UTC 2009
>Closed-Date:    Thu Apr 29 15:44:37 UTC 2010
>Last-Modified:  Tue Feb 28 13:10:09 UTC 2012
>Originator:     Tom Payne
>Release:        8.0-RELEASE
>Organization:
ISDC
>Environment:
FreeBSD isdc3202.isdc.unige.ch 8.0-RELEASE FreeBSD 8.0-RELEASE #0: Sat Nov 21 15:02:08 UTC 2009     root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64

>Description:
Short version:

zfs on a new 5.44T Adaptec 5805 hardware RAID5 partition reports lots of zfs checksum errors.  Tests claim that the hardware is working correctly.

Long version:

I have an Adaptec RAID 5085 controller with eight 1TB SAS disks:

# dmesg | grep aac
aac0: <Adaptec RAID 5805> mem 0xfbc00000-0xfbdfffff irq 16 at device 0.0 on pci9
aac0: Enabling 64-bit address support
aac0: Enable Raw I/O
aac0: Enable 64-bit array
aac0: New comm. interface enabled
aac0: [ITHREAD]
aac0: Adaptec 5805, aac driver 2.0.0-1
aacp0: <SCSI Passthrough Bus> on aac0
aacp1: <SCSI Passthrough Bus> on aac0
aacp2: <SCSI Passthrough Bus> on aac0
aacd0: <RAID 5> on aac0
aacd0: 16370MB (33525760 sectors)
aacd1: <RAID 5> on aac0
aacd1: 6657011MB (13633558528 sectors)


It's configured with a small partition (aacd0) for the root filesystem, the rest (aacd1) is a single large zpool:
# zpool create tank aacd1
# zfs list | head -n 2
NAME                                       USED  AVAIL  REFER  MOUNTPOINT
tank                                       792G  5.44T    18K  none


After a few days of light use (rsync'ing data from older disk servers) zfs reports lots of checksum errors:

# zpool status
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed after 1h17m with 49 errors on Mon Dec 14 13:35:50 2009
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0    98
          aacd1     ONLINE       0     0   196

These 49 errors are in various files scattered across the the 200+ zfs filesystems on the disk.


/var/log/messages contains, for example:
# grep ZFS /var/log/messages
Dec 14 13:23:50 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=79622307840 size=131072
Dec 14 13:23:50 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=79622307840 size=131072
Dec 14 13:23:50 isdc3202 root: ZFS: zpool I/O failure, zpool=tank error=86
Dec 14 13:27:47 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=77752696832 size=131072
Dec 14 13:27:47 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=77752696832 size=131072
Dec 14 13:27:47 isdc3202 root: ZFS: zpool I/O failure, zpool=tank error=86
Dec 14 13:28:07 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=1409111293952 size=131072
Dec 14 13:28:07 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=1409111293952 size=131072
Dec 14 13:28:07 isdc3202 root: ZFS: zpool I/O failure, zpool=tank error=86


The 49 checksum errors occur at 49 different offsets in three distinct ranges:
  70743228416..  84649705472 ( 6)
1406828281856..1441780858880 (14)
2749871030272..2817199702016 (29)


The Adaptec controller firmware was updated the latest version (at the time of writing) after the first errors were observed.  Since the firmware was updated more errors have been observed.
# arcconf getversion
Controllers found: 1
Controller #1
==============
Firmware           : 5.2-0 (17544)
Staged Firmware    : 5.2-0 (17544)
BIOS               : 5.2-0 (17544)
Driver             : 5.2-0 (17544)
Boot Flash         : 5.2-0 (17544)


I ran a verify task on the RAID controller with
# arcconf task start 1 logicaldrive 1 verify noprompt
As far as I can tell, this verify task did not find any errors.  The array status is still reported as "optimal" and there seems to be nothing in the logs.


A 24 hour memory test with memtest86+ version 4.00 did not detect any memory errors.


Previously, problems have been found with zfs on USB drives:
http://lists.freebsd.org/pipermail/freebsd-current/2009-April/005510.html


As I understand it, the situation is:
- zfs has checksum errors
- the hardware RAID believes that the data on disk is consistent
- there are no obvious memory problems


Could this be a FreeBSD bug?

>How-To-Repeat:
Unknown
>Fix:
Unknown

>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->freebsd-fs 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Wed Dec 16 17:54:45 UTC 2009 
Responsible-Changed-Why:  
Over to maintainer(s). 

http://www.freebsd.org/cgi/query-pr.cgi?pr=141685 
State-Changed-From-To: open->feedback 
State-Changed-By: pjd 
State-Changed-When: sob 20 mar 2010 00:27:29 UTC 
State-Changed-Why:  
This is unlikely ZFS bug. It s hard to tell if the problem is in the driver, 
controller, cabels, disks, etc. 
You could try to configure geli(8) with data integrity verification on top of 
this array and see if geli will also report problems. 


Responsible-Changed-From-To: freebsd-fs->pjd 
Responsible-Changed-By: pjd 
Responsible-Changed-When: sob 20 mar 2010 00:27:29 UTC 
Responsible-Changed-Why:  
I'll take this one. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=141685 

From: Tom Payne <tom@tompayne.org>
To: bug-followup@FreeBSD.org, Tom.Payne@unige.ch
Cc:  
Subject: Re: kern/141685: [zfs] zfs corruption on adaptec 5805 raid controller
Date: Thu, 29 Apr 2010 17:07:12 +0200

 Since upgrading the firmware the corruption is no longer manifesting itself.
 
 So, I think Gary's analysis is correct: this was an Adaptec problem,
 not a FreeBSD one.
 
 Thank you for your time and please mark this bug as invalid.
 -- 
 Tom
State-Changed-From-To: feedback->closed 
State-Changed-By: pjd 
State-Changed-When: czw 29 kwi 2010 15:43:53 UTC 
State-Changed-Why:  
Closed per submitters request. 
This was Adaptec firmware bug, not ZFS bug. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=141685 

From: Ed Maste <emaste@freebsd.org>
To: <bug-followup@FreeBSD.org>, <Tom.Payne@unige.ch>
Cc:  
Subject: Re: kern/141685: [zfs] zfs corruption on adaptec 5805 raid controller
Date: Tue, 28 Feb 2012 07:52:24 -0500

 If you have the information available, can you please add the firmware
 version that fixed the problem for you (for the sake of anyone finding
 this bug report in a search in the future).
 
 Regards,
 Ed
>Unformatted:
