From ath@niksun.com Wed Feb 24 14:39:13 1999
Return-Path: <ath@niksun.com>
Received: from arjun.niksun.com (gw.niksun.com [206.20.52.122])
	by hub.freebsd.org (Postfix) with ESMTP id F1E9F1113B
	for <FreeBSD-gnats-submit@freebsd.org>; Wed, 24 Feb 1999 14:39:00 -0800 (PST)
	(envelope-from ath@niksun.com)
Received: from stiegl.niksun.com (stiegl.niksun.com [10.0.0.44])
	by arjun.niksun.com (8.8.8/8.8.8) with ESMTP id PAA22217
	for <FreeBSD-gnats-submit@freebsd.org>; Wed, 24 Feb 1999 15:42:27 -0500 (EST)
Received: from stiegl.niksun.com (localhost.niksun.com [127.0.0.1])
	by stiegl.niksun.com (8.8.8/8.8.7) with ESMTP id PAA24006
	for <FreeBSD-gnats-submit@freebsd.org>; Wed, 24 Feb 1999 15:42:21 -0500 (EST)
	(envelope-from ath@stiegl.niksun.com)
Message-Id: <199902242042.PAA24006@stiegl.niksun.com>
Date: Wed, 24 Feb 1999 15:42:21 -0500
From: Andrew Heybey <ath@niksun.com>
Sender: ath@niksun.com
To: FreeBSD-gnats-submit@freebsd.org
Subject: read(2) returns garbage
X-Send-Pr-Version: 3.2

>Number:         10243
>Category:       kern
>Synopsis:       Under heavy disk and network load read(2) can return garbage.
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Feb 25 10:20:02 PST 1999
>Closed-Date:    Fri Mar 12 15:02:46 PST 1999
>Last-Modified:  Thu Nov  4 08:20:00 PST 1999
>Originator:     Andrew Heybey
>Release:        FreeBSD 3.1-RELEASE i386
>Organization:
Niksun
>Environment:

	3.1-RELEASE GENERIC kernel (+ bpf)
	450Mhz P-II, 256MB memory, Asus P2B-LS motherboard.
	Adaptec 7890 SCSI controller, IBM DRVS09V (10000 RPM LVD) disks
	Intel EtherExpress Pro 10/100B ethernet

	Full dmesg output (or any other info) available to anyone who
	wants to look into this.	

>Description:

The bug is that under certain loads, read(2) can return corrupted data
(ie data that are not in the file on disk).  The instances I have seen
are relatively small amounts (8-64 bytes) of corrupt data at the end
of a 4k page.  The corrupt data is from a file previously read or
another position in the current file.  I have also seen this problem
in 3.0-RELEASE but not in 2.2.8-RELEASE.

Specifically, I can reproduce the bug under the following conditions
(I am sorry that I don't have a smaller and simpler test case):

1) Multiple processes reading a set of large files.  I believe that
   the amount of data must be large enough such that the reads come
   from disk, not the cache (if I only read one 50MB file, I do not
   see the bug).  (I have used 1.5GB of data files on a system with
   256MB of physical memory.)  I also believe that multiple read processes must
   be running (I have used 4 processes and found the bug, but not with
   only one process).

   The files that I have used are filled with sequential integers.
   This allows my test program to know if it gets bogus data from
   read(2), since it knows what should be there.

*AND*

2) Very high network interrupt rate.  I have tested on a fast ethernet
   receiving at about 46000 packets/sec.  I use bpf to get the network
   interrupt rate up that high without having to do any protocol
   processing.  I don't know if the network or bpf code has anything
   to do with the bug or if it is just that the high load stimulates
   some cam/vm/ufs/bpf bug.  I have not been able to reproduce the bug
   without this high load.  Both zero pkts/sec and 3000 pkts/sec do
   *not* exhibit the bug (or at least not after running for several
   hours), while with the network load it will usually occur within 10
   minutes.

>How-To-Repeat:

	I have put a small suite of programs that I use to produce
	this bug at http://www.niksun.com/~ath/fbsd_bug.tgz.  The
	tar file contains a few test programs and complete
	instructions on how I tickle the bug.

	I have reproduced the bug on two different machines, so I
	don't think that the hw is broken (though the machines have
	substantially the same kind of hardware so it is conceivable
	that it is a HW misdesign of some kind).

	I welcome advice on how to track this down.  It smells to me
	like an insufficient-application-of-splfoo bug, but I'm not
	even sure where to start looking.  For example why would
	network I/O and BPF have any effect on disk reads?

	Even better, I suppose, would be someone to tell me that I'm
	an idiot and my test program is broken.  But it is really a
	very simple program and has run for hours without a problem
	when there is negligible network load.

>Fix:
	
	I wish.

>Release-Note:
>Audit-Trail:

From: Andrew Heybey <ath@niksun.com>
To: freebsd-gnats-submit@freebsd.org, mike@smith.net.au
Cc:  
Subject: Re: kern/10243: Under heavy disk and network load read(2) can return garbage.
Date: Fri, 12 Mar 1999 11:44:21 -0500

 This bug (under simulataneous heavy disk and network activity reads
 from disk appear to suffer from short DMA tranfers resulting in incorrect
 data being returned by read(2)) appears to be a hardware bug.
 
 The motherboard on the systems on which I experienced the problem is
 an Asus P2B-LS with on-board intel ethernet and AIC7890 SCSI
 controller.  If I change the "PCI Latency" BIOS setting from the
 default of 32 to 64, the problem seems to go away.  At least I can run 
 my test programs overnight without a failure while previously they
 would not run for more than 10-20 minutes.
 
 My hypothesis is that the 7890 is not getting sufficient PCI bus
 bandwidth to keep up with the disks and that there is some bug either
 in the controller or the disks (IBM Ultrastart 9LZX) such that they
 lose part of a transfer in this cas.  I am not very familiar with the
 SCSI protocol, but I would think that there is some way that the
 controller could apply backpressure to the disk to ask it to slow down
 if the controller's FIFOs are getting full.  To lose data either the
 controller is not applying back pressure or the disk is ignoring it.
 
 This PR can be closed, and I apologize for jumping to the conclusion
 that this is a FreeBSD bug.
 
 andrew
 
State-Changed-From-To: open->closed 
State-Changed-By: msmith 
State-Changed-When: Fri Mar 12 15:02:46 PST 1999 
State-Changed-Why:  
See audit trail - apparent hardware error. 

From: Andrew Heybey <ath@niksun.com>
To: freebsd-gnats-submit@freebsd.org
Cc:  
Subject: Re: kern/10243: Under heavy disk and network load read(2) can return garbage.
Date: Thu, 04 Nov 1999 11:14:10 -0500

 Just a note that this corruption bug seems to have been solved by the
 commit to sys/dev/aic7xxx/aix7xxx.seq revision 1.91 on 20-Sep-99.
 
>Unformatted:
