From nobody@FreeBSD.org  Wed Feb 14 14:35:48 2001
Return-Path: <nobody@FreeBSD.org>
Received: from freefall.freebsd.org (freefall.freebsd.org [216.136.204.21])
	by hub.freebsd.org (Postfix) with ESMTP id ECB6037B4EC
	for <freebsd-gnats-submit@FreeBSD.org>; Wed, 14 Feb 2001 14:35:47 -0800 (PST)
Received: (from nobody@localhost)
	by freefall.freebsd.org (8.11.1/8.11.1) id f1EMZlP79180;
	Wed, 14 Feb 2001 14:35:47 -0800 (PST)
	(envelope-from nobody)
Message-Id: <200102142235.f1EMZlP79180@freefall.freebsd.org>
Date: Wed, 14 Feb 2001 14:35:47 -0800 (PST)
From: mjh@aciri.org
To: freebsd-gnats-submit@FreeBSD.org
Subject: file corruption with Adaptec 29160 SCSI adapter
X-Send-Pr-Version: www-1.0

>Number:         25104
>Category:       kern
>Synopsis:       file corruption with Adaptec 29160 SCSI adapter
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Wed Feb 14 14:40:01 PST 2001
>Closed-Date:    Fri May 11 16:12:36 PDT 2001
>Last-Modified:  Fri May 11 16:13:24 PDT 2001
>Originator:     Mark Handley
>Release:        4.2-RELEASE
>Organization:
ACIRI
>Environment:
gaur.aciri.org: uname -a
FreeBSD gaur.aciri.org 4.2-RELEASE FreeBSD 4.2-RELEASE #1: Sat Jan 20 20:49:54 PST 2001     root@gaur.aciri.org:/usr/src/sys/compile/ACIRI-4.2-USB  i386   
>Description:
I've got five 1.1GHz Athlon systems, running FreeBSD 4.2R with 512MB
RAM, Asus A7V motherboards, Adaptec 29160 U160 SCSI adaptors, and
SEAGATE ST318451LW 18GB drives.  The problem is I'm seeing file 
corruption when I write large (approx 512Mb or larger) files, 
especially when I write them rapidly.  I can't guarantee it doesn't 
happen with smaller files, but I wrote a thousand 100MB files, and 
not one of them was corrupted.

The problem basically is that the files get 64-byte chunks (usally 64, 
sometimes smaller)of other data in the middle of them.  I first 
noticed the problem with scp, but the problem also happens with 
moderate repeatability when simply rapidly writing a big file by 
redirecting stdout.  


Here's the quick-hack test program:

#include<stdio.h>
#define FSIZE 1000*1024*1024
main() {
  int i,j;
  int buf[1024];
  j=0;
  for(i=0;i<FSIZE/4;i++) {
    buf[j]=i;
    if (j==1023) {
      fwrite(buf, 1024, 4, stdout);
      j=0;
    } else {
      j++;
    }
  }
}                          

Basically it's writing 1000MB to stdout, writing incrementing values
to each 32-bit word.  I direct stdout to a file.  The MD5 checksum of
the output file should be 1da068574fdb3e3b9ffc3b2022cca171, but
sometimes (somewhere between 1-in-3 and 1-in-10 tries) the file gets
corrupted.

The program to read this back is:

#include <stdio.h>
#define FSIZE 1000*1024*1024
main() {
  int i;
  int j, prev;
  int mode=0;
  for(i=0;i<FSIZE/4;i++) {
    fread(&j, 1, 4, stdin);
    if (mode==0) {
      if (i!=j) {
        printf("-----------------------------\n");
        printf("problem start at word: %d\n", i);
        printf("got value %d instead of %d\n", j, i);
        mode=1;
      }
    } else {
      if (i==j) {
        printf("-----------------------------\n");
        printf("last word of problem : %d\n", i-1);
        printf("got value %d instead of %d\n", prev, i-1);
        mode=0;
      }
    }                                             
    prev=j;
  }
}     

Here's one sample output, where there are two separate corruptions:

gaur.aciri.org: ./unfoo3 < t4
-----------------------------
problem start at word: 114561360
got value 909456435 instead of 114561360

got value 171522103 instead of 114561361
got value 875770417 instead of 114561362
got value 943142453 instead of 114561363
got value 842074681 instead of 114561364
got value 909456435 instead of 114561365
got value 171522103 instead of 114561366
got value 875770417 instead of 114561367
got value 943142453 instead of 114561368
got value 842074681 instead of 114561369
got value 909456435 instead of 114561370
got value 171522103 instead of 114561371
got value 875770417 instead of 114561372
got value 943142453 instead of 114561373
got value 842074681 instead of 114561374
got value 909456435 instead of 114561375
-----------------------------
last word of problem : 114561375
got value 909456435 instead of 114561375
-----------------------------
problem start at word: 237338864
got value 112460016 instead of 237338864
 
got value 112460017 instead of 237338865
got value 112460018 instead of 237338866
got value 112460019 instead of 237338867
got value 112460020 instead of 237338868
got value 112460021 instead of 237338869
got value 112460022 instead of 237338870
got value 112460023 instead of 237338871
got value 112460024 instead of 237338872
got value 112460025 instead of 237338873
got value 112460026 instead of 237338874
got value 112460027 instead of 237338875
got value 112460028 instead of 237338876
got value 112460029 instead of 237338877
got value 112460030 instead of 237338878
got value 112460031 instead of 237338879
-----------------------------
last word of problem : 237338879
got value 112460031 instead of 237338879

In this case, there are two corruptions.  The first corruption seems
to be some random chunk of data; the second (more typical) corruption
seems to be a copy of an earlier piece of the file.

In most cases, the corruption seems to be of a 64-byte
chunk of the file replaced with some other data, typically (but not
always) an earlier chunk of the same file.  I've never seen more than
64 bytes corrupted, but on one of the machines I've seen
smaller corruptions.


I originally thought this was a hardware problem, but I've reproduced
it on the three identical machines I've tried, so if it is a hardware
fault, it's in the whole batch.  I've also tried to reproduce it on
an additional 1GHz Athlon/A7V machine with a Adaptec 2940 SCSI
adaptor, but that machine doesn't suffer from the same
problem, so I'm beginning to suspect an interaction between the 
Adaptec 29160 driver and the filesystem when writing large files
as being a possible cause.

Here's the dmesg.boot from one of the problem machines in case it helps.

gaur.aciri.org: more /var/run/dmesg.boot
Copyright (c) 1992-2000 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD 4.2-RELEASE #1: Sat Jan 20 20:49:54 PST 2001
    root@gaur.aciri.org:/usr/src/sys/compile/ACIRI-4.2-USB
Timecounter "i8254"  frequency 1193182 Hz
CPU: AMD Athlon(tm) Processor (1109.89-MHz 686-class CPU)
  Origin = "AuthenticAMD"  Id = 0x642  Stepping = 2
  Features=0x183f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR>
  AMD Features=0xc0440000<<b18>,AMIE,DSP,3DNow!>
real memory  = 536788992 (524208K bytes)
avail memory = 518864896 (506704K bytes)
Preloaded elf kernel "kernel" at 0xc03c8000.
Pentium Pro MTRR support enabled
md0: Malloc disk
npx0: <math processor> on motherboard
npx0: INT 16 interface
pcib0: <Host to PCI bridge> on motherboard
pci0: <PCI bus> on pcib0
pcib2: <PCI to PCI bridge (vendor=1106 device=8305)> at device 1.0 on pci0
pci1: <PCI bus> on pcib2
isab0: <VIA 82C686 PCI-ISA bridge> at device 4.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <VIA 82C686 ATA66 controller> port 0xd800-0xd80f at device 4.1 on pci0 ata1: at 0x170 irq 15 on atapci0
pci0: <VIA 83C572 USB controller> at 4.2 irq 12
pci0: <VIA 83C572 USB controller> at 4.3 irq 12
fxp0: <Intel Pro 10/100B/100+ Ethernet> port 0xa400-0xa43f mem 0xd6800000-0xd68fffff,0xd7000000-0xd7000fff irq 10 at device 11.0 on pci0
fxp0: Ethernet address 00:02:b3:10:b4:67
pci0: <3D Labs model 000a graphics accelerator> at 12.0 irq 11
ahc0: <Adaptec 29160 Ultra160 SCSI adapter> port 0xa000-0xa0ff mem 0xd5800000-0xd5800fff irq 12 at device 13.0 on pci0
aic7892: Wide Channel A, SCSI Id=7, 32/255 SCBs
atapci1: <Promise ATA100 controller> port 0x8400-0x843f,0x8800-0x8803,0x9000-0x9007,0x9400-0x9403,0x9800-0x9807 mem 0xd5000000-0xd501ffff irq 10 at device 17.0 on pci0
pcib1: <Host to PCI bridge> on motherboard
pci2: <PCI bus> on pcib1
fdc0: <NEC 72065B or clone> at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0
fdc0: FIFO enabled, 8 bytes threshold
fd0: <1440-KB 3.5" drive> on fdc0 drive 0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0
sio0: type 16550A
sio1 at port 0x2f8-0x2ff irq 3 on isa0
sio1: type 16550A
DUMMYNET initialized (000608)
IP packet filtering initialized, divert disabled, rule-based forwarding disabled, default to deny, logging disabled
acd0: CDROM <SONY CDU4811> at ata1-master using PIO4
Waiting 5 seconds for SCSI devices to settle
Mounting root from ufs:/dev/da0s1a
da0 at ahc0 bus 0 target 0 lun 0
da0: <SEAGATE ST318451LW 0003> Fixed Direct Access SCSI-3 device
da0: 160.000MB/s transfers (80.000MHz, offset 63, 16bit), Tagged Queueing Enabled
da0: 17501MB (35843671 512 byte sectors: 255H 63S/T 2231C)  
>How-To-Repeat:
Write several very large files rapidly (see above).   Some fraction 
of them will be corrupted (I see between 5% and 25% of 512MB files
get corrupted).
>Fix:


>Release-Note:
>Audit-Trail:

From: Mark Handley <mjh@aciri.org>
To: freebsd-gnats-submit@FreeBSD.org, mjh@aciri.org
Cc:  
Subject: Re: kern/25104: file corruption with Adaptec 29160 SCSI adapter
Date: Mon, 12 Mar 2001 18:51:21 -0800

 I still haven't found a cure for this, but here are a few things I've
 tried:
 
  - configure the adaptor to only use 80MB/s
  - disable write caching in the adaptor
  - build a kernel that has tagged queuing disabled for this particular
 Seagate drive.
 
 So far, no luck.  I'm still getting file corruption.
 
 Of the corrupted files I've looked at, here's where the corruptions
 starts (rounded to the nearest MB).
 
 192MB, 208MB, 496MB, 256MB, 0.9MB, 437MB, 905MB, 656MB, 576MB, 672MB,
 512MB, 672MB, 400MB, 643MB, 752MB.
 
 The 0.9MB corruption seems to be an anomaly amongst my anomalies - when
 I was writing 100MB
 files, I wrote thousands of files and never saw corruption.  When
 writing 768MB files, I'm seeing
 about one file in seven get corrupted.
 
 BTW, if someone wants remote access to help track this down, I can
 oblige.
 
 Cheers,
        Marj
 

From: Mark Handley <mjh@aciri.org>
To: freebsd-gnats-submit@FreeBSD.org, mjh@aciri.org
Cc:  
Subject: Re: kern/25104: file corruption with Adaptec 29160 SCSI adapter
Date: Tue, 24 Apr 2001 07:43:47 -0700

 Another update:
 
  - the problem also occurs on FreeBSD 4.3-RELEASE
  - the problem occurs with an IBM DDYS-T18350N 18GB Ultra 160 drive,
    although less frequently
  - the problem occurs with an IBM DDRS-39130D 9GB Ultra 2 drive,
 although
    much less frequently (took writing 250 768MB files for it to occur).
 
 Not sure if this helps much, but at least it's a few more data points.
 
  - Mark
 

From: Mark Handley <mjh@aciri.org>
To: freebsd-gnats-submit@FreeBSD.org, mjh@aciri.org
Cc:  
Subject: Re: kern/25104: file corruption with Adaptec 29160 SCSI adapter
Date: Thu, 03 May 2001 10:56:14 -0700

 OK, the problem isn't a SCSI problem.
 
 I just put an ATA100 disk in one of these machines, and removed the SCSI
 controller card - the
 problem is just the same - large files get corrupted with about a 1-in-5
 probability.
 
 So, what's the real problem?  Is this a fundamental problem with the
 VIA 82C686A south bridge?
 The Register mentions something that sounds similar:
 http://www.theregister.co.uk/content/3/18267.html
 
 But this is with the 686B, not the 686A that the original Asus A7V that
 my machines have.
 
 I suppose it's also possible there's a timing hole in the FreeBSD
 filesystem code, but this seems
 unlikely to me.
 
  - Mark
 

From: Mark Handley <mjh@aciri.org>
To: freebsd-gnats-submit@FreeBSD.org, mjh@aciri.org
Cc:  
Subject: Re: kern/25104: file corruption with Adaptec 29160 SCSI adapter
Date: Thu, 03 May 2001 12:03:11 -0700

 Just for another data point, I moved the IDE disk to the ATA66
 controller (built into the south bridge) rather than the separate
 Promise ATA100 controller (which is on the motherboard on the A7V).
 Same problem,
 although rather than seeing 64 byte chunks of corrupted data, as I did
 with SCSI, I'm seeing 4K chunks of data from elsewhere in the file (ie
 the file contains two copies of one 4K chunk, and no copy of another 4K
 chunk).
 
 Given this, it seems unlikely to me that this is a software problem.  If
 it had been, I'd have expected the size of the corruptions to be similar
 in both cases.
 
 But I'm really confused - this seems to be the sort of problem that
 someone else must have seen.
 
  - Mark
 

From: Mark Handley <mjh@aciri.org>
To: freebsd-gnats-submit@FreeBSD.org, mjh@aciri.org
Cc:  
Subject: Re: kern/25104: file corruption with Adaptec 29160 SCSI adapter
Date: Fri, 11 May 2001 15:30:32 -0700

 I've now replaced the Asus A7V motherboard with the new Asus A7A266
 which has an ALi south bridge instead of the Via 82C686A.  The rest of
 the system is unchanged.  The problem has now gone away.
 Thus I conclude that Via's south bridge is most likely the problem, and
 that this is unlikely to be a FreeBSD
 issue, and very unlikely to be a SCSI problem.
 
 I think this Problem Report can be closed now.
 
 Cheers,
            Mark
 
 
 
State-Changed-From-To: open->closed 
State-Changed-By: greid 
State-Changed-When: Fri May 11 16:12:36 PDT 2001 
State-Changed-Why:  
Closed at originator's request 

http://www.FreeBSD.org/cgi/query-pr.cgi?pr=25104 
>Unformatted:
