From Martin.Birgmeier@aon.at  Fri Jan  5 12:08:50 2001
Return-Path: <Martin.Birgmeier@aon.at>
Received: from email02.aon.at (WARSL401PIP3.highway.telekom.at [195.3.96.75])
	by hub.freebsd.org (Postfix) with SMTP id DC86037B402
	for <FreeBSD-gnats-submit@freebsd.org>; Fri,  5 Jan 2001 12:08:49 -0800 (PST)
Received: (qmail 699608 invoked from network); 5 Jan 2001 20:08:43 -0000
Received: from n807p016.dipool.highway.telekom.at (HELO aon.at) ([212.183.110.208]) (envelope-sender <Martin.Birgmeier@aon.at>)
          by qmail2.highway.telekom.at (qmail-ldap-1.03) with SMTP
          for <FreeBSD-gnats-submit@freebsd.org>; 5 Jan 2001 20:08:43 -0000
Message-Id: <3A5629BC.8147CE64@aon.at>
Date: Fri, 05 Jan 2001 21:08:28 +0100
From: Martin Birgmeier <Martin.Birgmeier@aon.at>
Sender: martin@FreeBSD.ORG
To: FreeBSD-gnats-submit@freebsd.org
Subject: Disk data corruption using FreeBSD_4_2_0_RELEASE

>Number:         24092
>Category:       kern
>Synopsis:       Disk data corruption using FreeBSD_4_2_0_RELEASE
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Jan 05 12:10:01 PST 2001
>Closed-Date:    Sat Jan 20 13:06:07 PST 2001
>Last-Modified:  Sat Jan 20 13:06:59 PST 2001
>Originator:     Martin.Birgmeier@aon.at (Martin Birgmeier)
>Release:        FreeBSD 4.2-RELEASE i386
>Organization:
MBi at home
>Environment:

ASUS A7V, 256 MB main memory
disks as shown below

atapci0: <VIA 82C686 ATA66 controller> port 0xd800-0xd80f at device 4.1 on pci0
ata0: at 0x1f0 irq 14 on atapci0
ata1: at 0x170 irq 15 on atapci0
atapci1: <Promise ATA100 controller> port
0x8000-0x803f,0x8400-0x8403,0x8800-0x8807,0x9000-0x9003,0x9400-0x9407 mem 0xd5000000-0xd501ffff irq
10 at device 17.0 on pci0
ad0: 27199MB <ST328040A> [55262/16/63] at ata0-master UDMA33
ad3: 29188MB <ST330630A> [59303/16/63] at ata1-slave UDMA66
acd0: CDROM <TOSHIBA CD-ROM XM-6602B> at ata0-slave using UDMA33
acd1: CD-RW <PLEXTOR CD-R PX-W1210A> at ata1-master using WDMA2
Mounting root from ufs:/dev/ad0s4a

df output:

Filesystem               1K-blocks     Used    Avail Capacity  Mounted on
/dev/ad0s4a                  79359    40819    32192    56%    /
devfs                           16       16        0   100%    dummy_mount
/dev/ad0s4e                  39647     3138    33338     9%    /var
/dev/ad0s4f                8825163  6923321  1195829    85%    /usr
/dev/ad0s4g               13605124  8565455  3951260    68%    /d/5s4g
/dev/ad3s1a                  79359    53254    19757    73%    /d/6s1a
/dev/ad3s4e               22491456  8855727 11836413    43%    /d/6s4e
mfs:34                      127023       16   116846     0%    /tmp
procfs                           4        4        0   100%    /proc
devfs                           16       16        0   100%    /devs
<above>:/usr/X11R6.local  17650326 15748484  1195829    93%    /usr/X11R6.4
pid147@gandalf:/vol              0        0        0   100%    /vol
pid147@gandalf:/users            0        0        0   100%    /users
pid147@gandalf:/srcs             0        0        0   100%    /srcs
pid147@gandalf:/opt              0        0        0   100%    /opt
pid147@gandalf:/d/auto           0        0        0   100%    /d/auto

- Sources gotten via CTM
- checkout via `cvs -R co -rRELENG_4_2_0_RELEASE src'
- buildworld, installworld

(In fact, buildworld stopped on a corrupted source file, which was one of
the earliest hints I got that something is very wrong.)

>Description:

See below. When doing a cmp -x on the files just copied, it turns
out that blocks of size 2 ** n, with n between 6 and 12 inclusive,
are corrupt (sometimes more than one such block in the same
file).

Fortunately, mostly (but not only!) long files are affected.

>How-To-Repeat:

Use the following shell script. The file "SRC" contains data:
% ls -l /d/5s4g/fileX
-rw-r--r--  1 root  wheel  1083285504 Jan  2 14:24 .../fileX
%

----------------------------------------------------------------------
#! /bin/sh

SRC=/d/5s4g/fileX
DST=/d/6s4e/file

for i in 1 2 3 4 5 6 7 8
do
        echo "*** $i ***"
        dd if="$SRC" of="${DST}$i" bs=102400k || break
        for j in 1 2 3
        do
                cmp "$SRC" "${DST}$i" && break
        done
done
----------------------------------------------------------------------

What happens is that in about 50% of the cases, the compare does not
succeed (I once had a case where a compare failed on the first try,
but later succeeded; hence the triple comparison).

This happens most often on large(r) files, which is exactly the reason
why I am using a file of about 1 GB for testing.

Notes: I tested copying a file of 800 MB under Win98 twice - no
problems.  In addition, I installed Suse Linux 7.0 on ad0s3, and
tested copying a file of about 400 MB four times using the above
shell script (as in `for i in 1 2 3 4'...). No problems. Reason
for somewhat smaller file sizes is that I don't have much disk
space devoted to the other environments.

>Fix:

Unknown.

However, I tried the following, without any improvements:

- In /sys/dev/ata/ata-all.c, made ata_umode() return -1 always. As a
  result, the disks used WDMA2.

- In /sys/dev/ata/ata-disk.c, ad_attach() (only one at a time of the
  following items):
  . fixed adp->transfersize at DEV_BSIZE
  . disabled write caching

With all this, I am pretty sure that the problem lies not with my
hardware, but within the vm/buffer subsystem and its interaction
with some other service, possibly malloc (corruptions seem to
always be powers of two in length, see above).

-- 
Martin Birgmeier

Vienna
Austria

>Release-Note:
>Audit-Trail:

From: Martin Birgmeier <Martin.Birgmeier@aon.at>
To: freebsd-gnats-submit@FreeBSD.org
Cc:  
Subject: Re: kern/24092: Disk data corruption using FreeBSD_4_2_0_RELEASE
Date: Sat, 06 Jan 2001 21:05:24 +0100

 Quite (or not so?) unbelievably, upgrading the motherboard's BIOS
 seems to do the trick: From a7v1004c.zip to a7v1005a.zip.
 
 I'll watch it some more and post a final note when everything
 seems indeed doing well again.
 
 I guess that now some chipset registers are initialized `more
 correctly'. Would be nice if FreeBSD could do that instead of
 relying on the BIOS, but I understand the task involved.
 
 -- 
 Martin Birgmeier
 
 Vienna
 Austria
 

From: Martin Birgmeier <Martin.Birgmeier@aon.at>
To: freebsd-gnats-submit@FreeBSD.org, Martin.Birgmeier@aon.at
Cc:  
Subject: Re: kern/24092: Disk data corruption using FreeBSD_4_2_0_RELEASE
Date: Sat, 20 Jan 2001 21:00:31 +0100

 This indeed seems to have been a bios problem. No more data
 corruption since bios update.
 
 Someone please close this PR.
 
 -- 
 Martin Birgmeier
 
 Vienna
 Austria
 
State-Changed-From-To: open->closed 
State-Changed-By: dwmalone 
State-Changed-When: Sat Jan 20 13:06:07 PST 2001 
State-Changed-Why:  
Closed at request of submitter. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=24092 
>Unformatted:
