From dillon@flea.best.net  Mon Jul 27 14:37:53 1998
Received: from flea.best.net (root@flea.best.net [206.184.139.131])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id OAA15198
          for <FreeBSD-gnats-submit@freebsd.org>; Mon, 27 Jul 1998 14:37:45 -0700 (PDT)
          (envelope-from dillon@flea.best.net)
Received: (from dillon@localhost)
	by flea.best.net (8.9.0/8.9.0/best.fl) id OAA23112;
	Mon, 27 Jul 1998 14:37:01 -0700 (PDT)
Message-Id: <199807272137.OAA23112@flea.best.net>
Date: Mon, 27 Jul 1998 14:37:01 -0700 (PDT)
From: Matt Dillon <dillon@best.net>
Reply-To: dillon@best.net
To: FreeBSD-gnats-submit@freebsd.org
Subject: Critical file corruption in write() based append when file is mmap'd
X-Send-Pr-Version: 3.2

>Number:         7418
>Category:       kern
>Synopsis:       File corruption occurs when shared+ro mmap'd file is appended to via write()
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Jul 27 14:40:00 PDT 1998
>Closed-Date:    Wed Nov 18 18:55:30 PST 1998
>Last-Modified:  Wed Nov 18 18:56:16 PST 1998
>Originator:     Matt Dillon
>Release:        FreeBSD 2.2.6-STABLE i386
>Organization:
Best Internet Communications, Inc.
>Environment:

	FreeBSD-current
	FreeBSD-stable
	heavily loaded mmap & file-writing environment
	CCD striped disk partition.

>Description:

	This has been noticed on a number of FreeBSD-current and 
	FreeBSD-stable machines running news software.

	The problem occurs when one process is appending to a file
	via write() while another is doing a shared+ro mmap()'ing of
	the file.  The problem occurs often, but I cannot deterministically
	reproduce it.

	I was able to ktrace the writing process.  The ktrace is included.
	Note that the writing process has been found to be writing to
	the file properly.  The reading process mmap's the file in real
	time... that is, it is accessing it's mmap simultanious with the
	writing process writing to the file.  The hex dump of the file is
	shown below...  the corrupted area is full of 00 00 00 00's.

	The corruption generally extends to the end of a page boundry.

	If you look at the writing process, you will notice that it is
	simply appending articles to the file.  There is no concept of
	a page boundry.

	There is a serious issue here that needs to be resolved.

0002.1960  0a 20 20 20 20 20 73 6f 6d 65 68 6f 77 20 61 20  .     somehow a 
0002.1970  73 61 70 6c 69 6e 67 0a 20 20 20 20 20 6f 6e 20  sapling.     on 
0002.1980  74 68 69 73 20 62 65 64 72 6f 63 6b 20 6f 66 20  this bedrock of 
0002.1990  61 20 74 6f 6e 67 75 65 0a 0a 20 20 20 20 20 69  a tongue..     i
0002.19a0  74 20 77 69 6c 6c 20 62 65 0a 20 20 20 20 20 61  t will be.     a
0002.19b0  73 20 79 6f 75 20 68 61 76 65 20 77 72 69 74 74  s you have writt
0002.19c0  65 6e 0a 20 20 20 20 20 61 73 20 65 78 61 63 74  en.     as exact
0002.19d0  20 61 6e 64 20 6c 6f 76 65 6c 79 0a 20 20 20 20   and lovely.    
0002.19e0  20 61 73 20 61 6e 79 20 66 72 75 69 74 20 62 65   as any fruit be
0002.19f0  61 72 69 6e 67 20 74 68 69 6e 67 0a 20 20 20 20  aring thing.    
0002.1a00  20 74 68 61 74 20 77 69 6c 6c 20 65 76 65 72 20   that will ever 
0002.1a10  6d 61 6b 65 20 61 20 73 6f 75 6e 64 0a 0a 0a 37  make a sound...7
0002.1a20  2d 32 37 2d 39 38 0a 64 63 6c 65 0a 0a 2d 2d 2d  -27-98.dcle..---
0002.1a30  2d 2d 3d 3d 20 50 6f 73 74 65 64 20 76 69 61 20  --== Posted via 
0002.1a40  44 65 6a 61 20 4e 65 77 73 2c 20 54 68 65 20 4c  Deja News, The L
0002.1a50  65 61 64 65 72 20 69 6e 20 49 6e 74 65 72 6e 65  eader in Interne
0002.1a60  74 20 44 69 73 63 75 73 73 69 6f 6e 20 3d 3d 2d  t Discussion ==-
0002.1a70  2d 2d 2d 2d 0a 68 74 74 70 3a 2f 2f 77 77 77 2e  ----.http://www.
0002.1a80  64 65 6a 61 6e 65 77 73 2e 63 6f 6d 2f 72 67 5f  dejanews.com/rg_
0002.1a90  6d 6b 67 72 70 2e 78 70 20 20 20 43 72 65 61 74  mkgrp.xp   Creat
0002.1aa0  65 20 59 6f 75 72 20 4f 77 6e 20 46 72 65 65 20  e Your Own Free 
0002.1ab0  4d 65 6d 62 65 72 20 46 6f 72 75 6d 0a 00 00 00  Member Forum....
0002.1ac0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0002.1ad0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0002.1ae0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0002.1af0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0002.1b00  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0002.1b10  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0002.1b20  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
...
0002.1f70  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0002.1f80  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0002.1f90  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0002.1fa0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0002.1fb0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0002.1fc0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0002.1fd0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0002.1fe0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0002.1ff0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0002.2000  69 74 64 2e 75 6d 69 63 68 2e 65 64 75 21 6e 65  itd.umich.edu!ne
0002.2010  77 73 2d 70 65 65 72 2e 67 69 70 2e 6e 65 74 21  ws-peer.gip.net!
0002.2020  6e 65 77 73 2e 67 73 6c 2e 6e 65 74 21 67 69 70  news.gsl.net!gip
0002.2030  2e 6e 65 74 21 6e 65 77 73 66 65 65 64 2e 69 6e  .net!newsfeed.in
0002.2040  74 65 72 6e 65 74 6d 63 69 2e 63 6f 6d 21 32 30  ternetmci.com!20
0002.2050  34 2e 32 33 38 2e 31 32 30 2e 31 33 30 21 6e 65  4.238.120.130!ne


  8240 diablo   0.000120 CALL  lseek(0xe,0,0x21abe,0,0)
  8240 diablo   0.000041 RET   lseek 137918/0x21abe
  8240 diablo   0.000032 CALL  write(0xe,0x2278c020,0x283)
  8240 diablo   0.000095 GIO   fd 14 wrote 643 bytes
       "Path: news3.best.com!news1.best.com!feed1.news.erols.com!howland.erols.net!newsfeed.internetmci.com!204.238.120.130!news-feeds.jump.net!nntp2.dejanews.com!nnrp1.dejanews.com!not\
        -for-mail
        From: parsrise@my-dejanews.com
        Newsgroups: ott.business.ads
        Subject: Ottawa, Centretown Apartments For Rent
        Date: Mon, 27 Jul 1998 20:48:59 GMT
        Organization: Deja News - The Leader in Internet Discussion
        Lines: 29
        Message-ID: <6pip3s$f5b$1@nnrp1.dejanews.com>
        NNTP-Posting-Host: 192.58.194.87
        X-Article-Creation-Date: Mon Jul 27 20:48:59 1998 GMT
        X-Http-User-Agent: Mozilla/3.04 (X11; I; HP-UX A.09.03 9000/712)
        Xref: news3.best.com ott.business.ads:12900
       "
  8240 diablo   0.000038 RET   write 643/0x283
  8240 diablo   0.000065 CALL  lseek(0xe,0,0,0,0x1)
  8240 diablo   0.000032 RET   lseek 138561/0x21d41
  8240 diablo   0.000030 CALL  write(0xe,0x2278c020,0x291)
  8240 diablo   0.000060 GIO   fd 14 wrote 657 bytes
   ...
  8240 diablo   0.000046 CALL  lseek(0xe,0,0,0,0x1)
  8240 diablo   0.000033 RET   lseek 139218/0x21fd2
   ...

139264 is the 4K boundry



  8240 diablo   0.000054 CALL  lseek(0xe,0,0x21fd2,0,0)
  8240 diablo   0.000037 RET   lseek 139218/0x21fd2
  8240 diablo   0.000035 CALL  write(0xe,0x2278c020,0x2d9)
  8240 diablo   0.000408 GIO   fd 14 wrote 729 bytes
       "Path: news3.best.com!news1.best.com!newsxfer3.itd.umich.edu!news-peer.gip.net!news.gsl.net!gip.net!newsfeed.internetmci.com!204.238.120.130!news-feeds.jump.net!nntp2.dejanews.co\
        m!nnrp1.dejanews.com!not-for-mail
        From: elleinad@my-dejanews.com
        Newsgroups: rec.arts.disney.parks
        Subject: Re: Planning 4 days at WDW
        Date: Mon, 27 Jul 1998 20:52:41 GMT
        Organization: Deja News - The Leader in Internet Discussion
        Lines: 30
        Message-ID: <6pipaq$f81$1@nnrp1.dejanews.com>
        References: <1998072618171400.OAA06477@ladder03.news.aol.com>
        NNTP-Posting-Host: 204.189.84.62
        X-Article-Creation-Date: Mon Jul 27 20:52:41 1998 GMT
        X-Http-User-Agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 95)
        Xref: news3.best.com rec.arts.disney.parks:187617
       "
  8240 diablo   0.000056 RET   write 729/0x2d9

  8240 diablo   0.000052 CALL  lseek(0xe,0,0,0,0x1)
  8240 diablo   0.000032 RET   lseek 139947/0x222ab
  8240 diablo   0.000030 CALL  write(0xe,0x2278c020,0x537)
  8240 diablo   0.000066 GIO   fd 14 wrote 1335 bytes
       "
        Well, I'm 13, and these are some of my favorites:
        
        Magic Kingdom: Space Mountain, Alien Encounter, Splash Mountain, Big Thunder
        Mountain Railroad, the Barnstormer (back seat ONLY!) ;-) Epcot: Honey, I
        Shrunk the Audience, the Image Works, Innoventions, Wonders of Life, and the
        ride at Norway. MGM Studios: Tower of Terror, Star Tours, Superstar
        Television, 50's Prime Time Cafe'. (It's an attraction in itself!) Animal
        Kingdom: Kilimanjaro Safaris, Countdown to Extinction, Legend of the Lion
        King show, Tough to Be A Bug. Water Parks: SUMMIT PLUMMET, Slush Gusher,
        Downhill Double Dipper, and the fun chute slide (kind of off to itself) at
        Ski Patrol Training Camp.
        
        Have fun!
        
        In article <1998072618171400.OAA06477@ladder03.news.aol.com>,
          megps@aol.com (MegPS) wrote:
        > We are headed to WDW on sunday for 4 days, We will be staying at CS, and we
        > have passes for all of the parks and water parks. We have two girls, 14 and16,
        > who will probably be on their own for much of the time, can you suggest some
        > things that they might want to do while they're there? Thanks much.
        >
        
        
        --
                           Danielle
                    Danielle@searnet.com
        TDC Keeper of Pegasus, Summit Plummet,
        
        -----== Posted via Deja News, The Leader in Internet Discussion ==-----
        http://www.dejanews.com/rg_mkgrp.xp   Create Your Own Free Member Forum
        \0"
  8240 diablo   0.000046 RET   write 1335/0x537
  8240 diablo   0.000040 CALL  lseek(0xe,0,0,0,0x1)
  8240 diablo   0.000029 RET   lseek 141282/0x227e2


>How-To-Repeat:

	I can get it to occur quite often but I cannot get it to repeat
	in a deterministic fashion.  I believe the heavy load on the
	machine is making the bug more visible.

>Fix:
	

>Release-Note:
>Audit-Trail:

From: Matthew Dillon <dillon@backplane.com>
To: Luoqi Chen <luoqi@chen.ml.org>
Cc: freebsd-gnats-submit@freebsd.org, luoqi@watermarkgroup.com
Subject: Re: kern/7418 (file corruption on mmap-based-read during file write())
Date: Thu, 13 Aug 1998 20:33:41 -0700 (PDT)

     I'm trying to track down the second PR I sent in... kern/7418 in this
     case.  I am concentrating on what happens when a page fault from an mmap'd
     file occurs in one process while a second process is blocked in
     ufs/ufs_readwrite() on the same file.
 
     The file corruption that I am seeing is approximately this:
 
     [this represents one page of memory]
 
 	[data][data][data..00 00 00 00][data]
 
     Where 00's replace what should have been valid data in the file.
     data written into the locations where the corruption occurs is being 
     replaced by 00.  i.e. I might see 
 
 	'abcdef [00 00 00] (PAGE BOUNDRY) jklmnop'.
 
     The most interesting item is that the 00 corruption always *ends* at
     a page boundry in the file.
 
     --
 
     So here is my question.  Refer to ufs/ufs_readwrite.c around line 337,
     as shown below.   What happens if VOP_BALLOC() blocks or uiomove()
     blocks during the discrete write() and, while blocked, another process
     has a read fault on precisely the same logical page via mmap()?
 
     Is it possible for ufs_readwrite to obtain a bp, copy the write() data
     to it, but for the bp to then somehow be thrown away?  
 
     I also don't understand why B_RELBUF is being set for the bp.  Can't 
     this cause the bp to be completely thrown away (aka kern/vfs_bio.c line 
     710)??   Makes sense for a READ, but I don't understand why B_RELBUF
     is being set for the bp in the WRITE.
 
         for (error = 0; uio->uio_resid > 0;) {
                 lbn = lblkno(fs, uio->uio_offset);
                 blkoffset = blkoff(fs, uio->uio_offset);
                 xfersize = fs->fs_bsize - blkoffset;
                 if (uio->uio_resid < xfersize)
                         xfersize = uio->uio_resid;
 
                 if (uio->uio_offset + xfersize > ip->i_size)
                         vnode_pager_setsize(vp, uio->uio_offset + xfersize);
 
                 if (fs->fs_bsize > xfersize)
                         flags |= B_CLRBUF;
                 else
                         flags &= ~B_CLRBUF;
 /* XXX is uio->uio_offset the right thing here? */
                 error = VOP_BALLOC(vp, uio->uio_offset, xfersize,
                     ap->a_cred, flags, &bp);
                 if (error != 0)
                         break;
 
                 if (uio->uio_offset + xfersize > ip->i_size) {
                         ip->i_size = uio->uio_offset + xfersize;
                         extended = 1;
                 }
 
                 size = BLKSIZE(fs, ip, lbn) - bp->b_resid;
                 if (size < xfersize)
                         xfersize = size;
 
                 error =
                     uiomove((char *)bp->b_data + blkoffset, (int)xfersize, uio);
                 if ((ioflag & IO_VMIO) &&
                    (LIST_FIRST(&bp->b_dep) == NULL))
                         bp->b_flags |= B_RELBUF;
 
 		...
 	}
 
     Finally, I tried looking in other filesystem device code for comparable
     source.  The ext2fs code seems to simply copy the ufs code:
 
                 error =
                     uiomove((char *)bp->b_data + blkoffset, (int)xfersize, uio);
                 if ((ioflag & IO_VMIO) &&
                    (LIST_FIRST(&bp->b_dep) == NULL)) /* in ext2fs? */
                         bp->b_flags |= B_RELBUF; 
 
     The msdosfs does not set B_RELBUF in either its read or write, nor does
     the nfs code.   It's all very confusing.
 
 						-Matt
 
     Matthew Dillon  Engineering, HiWay Technologies, Inc. & BEST Internet 
                     Communications
     <dillon@backplane.com> (Please include original email in any response)    
 

From: Luoqi Chen <luoqi@chen.ml.org>
To: dillon@backplane.com, luoqi@chen.ml.org
Cc: freebsd-gnats-submit@freebsd.org, luoqi@watermarkgroup.com
Subject: Re: kern/7418 (file corruption on mmap-based-read during file write())
Date: Fri, 14 Aug 1998 00:27:46 -0400 (EDT)

 >     I'm trying to track down the second PR I sent in... kern/7418 in this
 >     case.  I am concentrating on what happens when a page fault from an mmap'd
 >     file occurs in one process while a second process is blocked in
 >     ufs/ufs_readwrite() on the same file.
 >
 >     The file corruption that I am seeing is approximately this:
 >
 >     [this represents one page of memory]
 >
 > 	[data][data][data..00 00 00 00][data]
 >
 >     Where 00's replace what should have been valid data in the file.
 >     data written into the locations where the corruption occurs is being 
 >     replaced by 00.  i.e. I might see 
 >
 > 	'abcdef [00 00 00] (PAGE BOUNDRY) jklmnop'.
 >
 >     The most interesting item is that the 00 corruption always *ends* at
 >     a page boundry in the file.
 >
 >     --
 >
 >     So here is my question.  Refer to ufs/ufs_readwrite.c around line 337,
 >     as shown below.   What happens if VOP_BALLOC() blocks or uiomove()
 >     blocks during the discrete write() and, while blocked, another process
 >     has a read fault on precisely the same logical page via mmap()?
 >
 >     Is it possible for ufs_readwrite to obtain a bp, copy the write() data
 >     to it, but for the bp to then somehow be thrown away?  
 >
 Even if the buffer header is thrown away, the data should be still in
 the vm cache. But I could imagine one scenario that could lead to data loss.
 We know when we're reading multiple pages of data from disk, if some of the
 pages are already in vm cache, they are replaced with `bogus_page' in the
 buffer header page list. Now if another process tries to write into those
 data blocks before the read completes, the data will be written into the
 bogus_page. When the read completes, the true data pages will be put back in,
 unaltered, all those been written are lost. I don't know if this could
 really happen, I can't prove or disprove it, I can't reproduce what you
 had seen.
 
 >     I also don't understand why B_RELBUF is being set for the bp.  Can't 
 >     this cause the bp to be completely thrown away (aka kern/vfs_bio.c line 
 >     710)??   Makes sense for a READ, but I don't understand why B_RELBUF
 >     is being set for the bp in the WRITE.
 >
 >         for (error = 0; uio->uio_resid > 0;) {
 >                 lbn = lblkno(fs, uio->uio_offset);
 >                 blkoffset = blkoff(fs, uio->uio_offset);
 >                 xfersize = fs->fs_bsize - blkoffset;
 >                 if (uio->uio_resid < xfersize)
 >                         xfersize = uio->uio_resid;
 >
 >                 if (uio->uio_offset + xfersize > ip->i_size)
 >                         vnode_pager_setsize(vp, uio->uio_offset + xfersize);
 >
 >                 if (fs->fs_bsize > xfersize)
 >                         flags |= B_CLRBUF;
 >                 else
 >                         flags &= ~B_CLRBUF;
 > /* XXX is uio->uio_offset the right thing here? */
 >                 error = VOP_BALLOC(vp, uio->uio_offset, xfersize,
 >                     ap->a_cred, flags, &bp);
 >                 if (error != 0)
 >                         break;
 >
 >                 if (uio->uio_offset + xfersize > ip->i_size) {
 >                         ip->i_size = uio->uio_offset + xfersize;
 >                         extended = 1;
 >                 }
 >
 >                 size = BLKSIZE(fs, ip, lbn) - bp->b_resid;
 >                 if (size < xfersize)
 >                         xfersize = size;
 >
 >                 error =
 >                     uiomove((char *)bp->b_data + blkoffset, (int)xfersize, uio);
 >                 if ((ioflag & IO_VMIO) &&
 >                    (LIST_FIRST(&bp->b_dep) == NULL))
 >                         bp->b_flags |= B_RELBUF;
 >
 > 		...
 > 	}
 >
 >     Finally, I tried looking in other filesystem device code for comparable
 >     source.  The ext2fs code seems to simply copy the ufs code:
 >
 >                 error =
 >                     uiomove((char *)bp->b_data + blkoffset, (int)xfersize, uio);
 >                 if ((ioflag & IO_VMIO) &&
 >                    (LIST_FIRST(&bp->b_dep) == NULL)) /* in ext2fs? */
 >                         bp->b_flags |= B_RELBUF; 
 >
 >     The msdosfs does not set B_RELBUF in either its read or write, nor does
 >     the nfs code.   It's all very confusing.
 >
 IO_VMIO flag is only set from getpages/putpages calls, as a result of direct
 mmapped access. We don't need to keep the buffer headers around, and as I
 mentioned before, this won't result in any data loss, the data are still intact
 in vm cache.
 
 > 						-Matt
 >
 >     Matthew Dillon  Engineering, HiWay Technologies, Inc. & BEST Internet 
 >                     Communications
 >     <dillon@backplane.com> (Please include original email in any response)    
 >
 >
 -lq

From: Matthew Dillon <dillon@backplane.com>
To: Luoqi Chen <luoqi@chen.ml.org>
Cc: luoqi@chen.ml.org, freebsd-gnats-submit@freebsd.org,
        luoqi@watermarkgroup.com
Subject: Re: kern/7418 (file corruption on mmap-based-read during file write())
Date: Fri, 14 Aug 1998 10:51:24 -0700 (PDT)

 :>
 :Even if the buffer header is thrown away, the data should be still in
 :the vm cache. But I could imagine one scenario that could lead to data loss.
 :...
 :>
 :IO_VMIO flag is only set from getpages/putpages calls, as a result of direct
 :mmapped access. We don't need to keep the buffer headers around, and as I
 :mentioned before, this won't result in any data loss, the data are still intact
 :in vm cache.
 :
 :-lq
 
     I've noticed something else in regards to the corruption which might
     throw more light on the problem.
 
     [one page]
 
     A,B,C,D ... discrete usenet articles stored in file.
 
     [AAAAAAA][AAABBBBBB][BBBBBBBB][BBBBCCCCCCdddddd][DDDDDD...]
     page #1	#2	    #3		#4		#5
 
     The lower case 'd's indicate corruption... that is, areas of the file
     that were corrupted to 0x00.  The interesting item is that not only does
     the corruption end at a page boundry, it *begins* at the beginning of
     an article.  That is, article 'C' does not get corrupted at all, nor
     is there a piece of the beginning of D that is not corrupted... the 
     corruption begins at the beginning of article D and ends at the page
     boundry (article D continues past the page boundry.  The portion after
     the page boundry is not corrupted).
 
     So what about this possibillity (this is only a possibility, not an actual
     trace):
 
 	* process 1 write()'s article B to the file
 
 	* process 1 write()'s article C to the file
 
 	* process 2 mmap's and faults pages associated with C (i.e. page #4)
 		(at this point, 'D' has not been written yet and no corruption
 		has yet occured, page #4 properly contains 00's after the end
 		of article C).  This is a PROT_READ,MAP_SHARED map.
 
 	* kernel starts writing page #4 to disk  (or kernel starts writing
 	  page #4 to disk and then process #2 faults it in for reading).
 
 	* process 1  write()'s article D to the file while page #4 is 
 	  dirty and the I/O is in progress.  Corruption somehow occurs.
 
     Is there anything fishy in this sequence of events that could cause the
     corruption?  The corruption I see occurs at least a dozen times a day, 
     probably more.  But that is out of 800,000 article appends to spool files
     (per day).  Thus, the window of opportunity would be relatively small.
 
     The machine in question is heavily IO loaded... it has lots of memory
     and there isn't much pageout/swap activity, but the memory is being 
     exercised very heavily due to the news spool and reading functions. 
     There is very heavy read-only mmap'ing of the spool files as well.  I
     can well imagine this creating concurrency situations that would not
     otherwise occur in other setups.  For example, disk-write latency 
     increases severely, creating a larger potential window of opportunity then
     if the disks were less heavily loaded.
 
 						-Matt
 
     Matthew Dillon  Engineering, HiWay Technologies, Inc. & BEST Internet 
                     Communications
     <dillon@backplane.com> (Please include original email in any response)    
 

From: Luoqi Chen <luoqi@watermarkgroup.com>
To: dillon@backplane.com, luoqi@chen.ml.org
Cc: freebsd-gnats-submit@freebsd.org, luoqi@watermarkgroup.com
Subject: Re: kern/7418 (file corruption on mmap-based-read during file write())
Date: Fri, 14 Aug 1998 15:02:26 -0400 (EDT)

 > 
 >     I've noticed something else in regards to the corruption which might
 >     throw more light on the problem.
 > 
 >     [one page]
 > 
 >     A,B,C,D ... discrete usenet articles stored in file.
 > 
 >     [AAAAAAA][AAABBBBBB][BBBBBBBB][BBBBCCCCCCdddddd][DDDDDD...]
 >     page #1	#2	    #3		#4		#5
 > 
 >     The lower case 'd's indicate corruption... that is, areas of the file
 >     that were corrupted to 0x00.  The interesting item is that not only does
 >     the corruption end at a page boundry, it *begins* at the beginning of
 >     an article.  That is, article 'C' does not get corrupted at all, nor
 >     is there a piece of the beginning of D that is not corrupted... the 
 >     corruption begins at the beginning of article D and ends at the page
 >     boundry (article D continues past the page boundry.  The portion after
 >     the page boundry is not corrupted).
 > 
 I have a very interesting observation of the spool file dump in the PR 7418
 report, article 'C' (<6pip3s$f5b$1@nnrp1.dejanews.com>) was completely lost,
 so it is not just the article that crosses the page boundary, corruption
 could occur for articles precede it and on the same page. I also notice that
 both in PR 7418 and in your illustration above, corruptions take place on
 even numbered pages, i.e., 2nd page of the two-page FFS blocks, is this
 always the case? Are you aware of any other commonalities among all
 corruption cases you had seen?
 
 >     So what about this possibillity (this is only a possibility, not an actual
 >     trace):
 > 
 > 	* process 1 write()'s article B to the file
 > 
 > 	* process 1 write()'s article C to the file
 > 
 > 	* process 2 mmap's and faults pages associated with C (i.e. page #4)
 > 		(at this point, 'D' has not been written yet and no corruption
 > 		has yet occured, page #4 properly contains 00's after the end
 > 		of article C).  This is a PROT_READ,MAP_SHARED map.
 > 
 > 	* kernel starts writing page #4 to disk  (or kernel starts writing
 > 	  page #4 to disk and then process #2 faults it in for reading).
 > 
 > 	* process 1  write()'s article D to the file while page #4 is 
 > 	  dirty and the I/O is in progress.  Corruption somehow occurs.
 > 
 I doubt this could happen: the buffer contains page #3 and #4 will be marked
 busy, write() will block until the I/O is complete. And when the kernel is
 writing page #4, process #2 shouldn't fault reading it since it is already
 in core.
 
 >     Is there anything fishy in this sequence of events that could cause the
 >     corruption?  The corruption I see occurs at least a dozen times a day, 
 >     probably more.  But that is out of 800,000 article appends to spool files
 >     (per day).  Thus, the window of opportunity would be relatively small.
 > 
 >     The machine in question is heavily IO loaded... it has lots of memory
 >     and there isn't much pageout/swap activity, but the memory is being 
 >     exercised very heavily due to the news spool and reading functions. 
 >     There is very heavy read-only mmap'ing of the spool files as well.  I
 >     can well imagine this creating concurrency situations that would not
 >     otherwise occur in other setups.  For example, disk-write latency 
 >     increases severely, creating a larger potential window of opportunity then
 >     if the disks were less heavily loaded.
 > 
 I have written a small test program to simulate what diablo does, and so
 far has not been able to get any corruption. The test was done on an otherwise
 idle machine. So heavy load is definitely a contributing factor here.
 
 > 						-Matt
 > 
 >     Matthew Dillon  Engineering, HiWay Technologies, Inc. & BEST Internet 
 >                     Communications
 >     <dillon@backplane.com> (Please include original email in any response)    
 > 
 > 
 -lq

From: Matthew Dillon <dillon@backplane.com>
To: Luoqi Chen <luoqi@watermarkgroup.com>
Cc: luoqi@chen.ml.org, freebsd-gnats-submit@freebsd.org
Subject: Re: kern/7418 (file corruption on mmap-based-read during file write())
Date: Fri, 14 Aug 1998 12:21:19 -0700 (PDT)

 :> 
 :>     A,B,C,D ... discrete usenet articles stored in file.
 :> 
 :>     [AAAAAAA][AAABBBBBB][BBBBBBBB][BBBBCCCCCCdddddd][DDDDDD...]
 :>     page #1	#2	    #3		#4		#5
 :> 
 :> 
 :I have a very interesting observation of the spool file dump in the PR 7418
 :report, article 'C' (<6pip3s$f5b$1@nnrp1.dejanews.com>) was completely lost,
 :so it is not just the article that crosses the page boundary, corruption
 :could occur for articles precede it and on the same page. I also notice that
 :both in PR 7418 and in your illustration above, corruptions take place on
 :even numbered pages, i.e., 2nd page of the two-page FFS blocks, is this
 :always the case? Are you aware of any other commonalities among all
 :corruption cases you had seen?
 
     Ah, looks like it's bounded by the filesystem block size, 8K, rather
     then the 4K page size.  Here's a bunch of stuff from my spool:
     (note: there is supposed to be a single 00 separator between articles,
     so the occassional single 00 you see inbetween articles is correct.  But
     not the to-end-of-block 00 corruption which you can see is obviously
     replacing what should have been valid text).
 
     Also, I was once able to ktrace the writing process (I think it's in the
     original PR) to ensure that the writing process is not writing the 00's
     itself.  So it's definitely kernel-generated corruption.
 
 
     hdump D.00e5a1dc/B.3328 | less
     ...
     0001.96b0  65 69 76 65 20 63 6f 6d 6d 65 72 63 69 61 6c 20  eive commercial 
     0001.96c0  65 6d 61 69 6c 73 20 6f 72 20 6f 66 66 65 72 69  emails or offeri
     0001.96d0  6e 67 73 20 66 72 6f 6d 20 61 6e 79 20 76 65 6e  ngs from any ven
     0001.96e0  64 6f 72 0a 70 6c 65 61 73 65 20 72 65 70 6c 79  dor.please reply
     0001.96f0  20 77 69 74 68 20 27 52 45 4d 4f 56 45 27 20 69   with 'REMOVE' i
     0001.9700  6e 20 74 68 65 20 73 75 62 6a 65 63 74 20 66 69  n the subject fi
     0001.9710  65 6c 64 2e 0a 0a 0a 00 00 00 00 00 00 00 00 00  eld.............
     0001.9720  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0001.9730  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     ...
     0001.9fc0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0001.9fd0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0001.9fe0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0001.9ff0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0001.a000  74 65 20 75 6e 69 74 3b 20 4d 6f 64 65 6c 20 23  te unit; Model #
     0001.a010  20 32 35 30 2d 30 37 31 39 2d 30 30 31 2e 0a 54   250-0719-001..T
     0001.a020  68 65 20 75 6e 69 74 20 68 61 73 20 74 68 65 20  he unit has the 
     ...
 
     0002.1010  4f 72 65 67 6f 6e 20 31 39 39 32 2d 39 34 0a 53  Oregon 1992-94.S
     0002.1020  75 6e 73 65 74 20 48 53 20 31 39 38 39 2d 39 32  unset HS 1989-92
     0002.1030  0a 7b 54 68 65 73 65 20 76 69 65 77 73 20 61 72  .{These views ar
     0002.1040  65 20 6d 69 6e 65 20 61 6c 6f 6e 65 2c 20 61 6e  e mine alone, an
     0002.1050  64 20 64 6f 20 6e 6f 74 20 72 65 70 72 65 73 65  d do not represe
     0002.1060  6e 74 20 74 68 65 20 76 69 65 77 73 20 6f 66 20  nt the views of 
     0002.1070  74 68 65 73 65 20 6f 72 20 61 6e 79 0a 6f 74 68  these or any.oth
     0002.1080  65 72 20 6f 72 67 61 6e 69 7a 61 74 69 6f 6e 73  er organizations
     0002.1090  7d 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 00  }...............
     0002.10a0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.10b0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.10c0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.10d0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     ...
     0002.1fd0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.1fe0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.1ff0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.2000  69 0a 43 6f 6e 74 65 6e 74 2d 54 72 61 6e 73 66  i.Content-Transf
     0002.2010  65 72 2d 45 6e 63 6f 64 69 6e 67 3a 20 37 62 69  er-Encoding: 7bi
     0002.2020  74 0a 4c 69 6e 65 73 3a 20 31 36 0a 4d 65 73 73  t.Lines: 16.Mess
 
 				**************
 
     hdump D.00e5a204/B.5b40 | less
     ...
     0002.1a60  6f 67 6c 69 6f 6e 6f 20 72 6f 6d 70 65 72 65 20  ogliono rompere 
     0002.1a70  61 20 6e 6f 69 22 2e 0a 0a 4e 6f 20 63 72 6f 73  a noi"...No cros
     0002.1a80  73 70 6f 73 74 2c 20 38 30 20 6c 69 6e 65 73 20  spost, 80 lines 
     0002.1a90  6f 6e 6c 79 2d 20 4e 6f 73 70 61 6d 20 69 6e 20  only- Nospam in 
     0002.1aa0  61 64 64 72 65 73 73 0a 2d 2d 2d 2d 2d 2d 2d 2d  address.--------
     0002.1ab0  2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d  ----------------
     0002.1ac0  2d 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 00  -...............
     0002.1ad0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.1ae0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.1af0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     ...
     0002.1fa0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.1fb0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.1fc0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.1fd0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.1fe0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.1ff0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.2000  2d 2d 2d 2d 2d 2d 2d 2d 20 20 49 20 6b 6e 6f 77  --------  I know
     0002.2010  20 49 27 6d 20 6d 79 20 62 65 73 74 20 66 72 69   I'm my best fri
     0002.2020  65 6e 64 20 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d  end ------------
     0002.2030  2d 2d 2d 2d 2d 2d 2d 2d 0a 00 50 61 74 68 3a 20  --------..Path: 
 
 				**************
 
     hdump D.00e5a20e/B.7411 | less
     ...
     0002.67f0  53 74 61 6e 64 61 72 64 20 64 69 73 63 6c 61 69  Standard disclai
     0002.6800  6d 65 72 3a 20 54 68 65 20 76 69 65 77 73 20 6f  mer: The views o
     0002.6810  66 20 74 68 69 73 20 75 73 65 72 20 61 72 65 20  f this user are 
     0002.6820  73 74 72 69 63 74 6c 79 20 68 69 73 2f 68 65 72  strictly his/her
     0002.6830  20 6f 77 6e 2e 0a 0a 0a 00 00 00 00 00 00 00 00   own............
     0002.6840  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.6850  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.6860  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     ...
     0002.7fc0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.7fd0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.7fe0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.7ff0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0002.8000  65 20 64 69 72 65 63 74 20 55 52 4c 20 69 73 3a  e direct URL is:
     0002.8010  0a 68 74 74 70 3a 2f 2f 77 77 77 2e 62 64 73 6d  .http://www.bdsm
     0002.8020  74 6f 70 2d 35 30 2e 63 6f 6d 2f 77 65 62 6d 61  top-50.com/webma
     0002.8030  73 74 65 72 73 2f 72 65 73 6f 75 72 63 65 73 2f  sters/resources/
 
 				**************
 
     hdump D.00e5a20e/B.769d | less
     0003.1d90  74 68 65 20 43 55 52 52 45 4e 54 20 0a 70 75 62  the CURRENT .pub
     0003.1da0  6c 69 73 68 65 64 20 73 74 61 6e 64 61 72 64 2c  lished standard,
     0003.1db0  20 61 6e 64 20 6c 69 66 65 20 77 69 6c 6c 20 62   and life will b
     0003.1dc0  65 20 6d 75 63 68 20 65 61 73 69 65 72 20 66 6f  e much easier fo
     0003.1dd0  72 20 75 73 20 61 6c 6c 2e 0a 0a 0a 00 00 00 00  r us all........
     0003.1de0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0003.1df0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0003.1e00  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     ...
     0003.1fd0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0003.1fe0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0003.1ff0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0003.2000  35 2e 47 41 4d 45 0a 58 2d 46 54 4e 2d 4d 53 47  5.GAME.X-FTN-MSG
     0003.2010  49 44 3a 20 32 3a 35 30 31 30 2f 31 34 36 2e 38  ID: 2:5010/146.8
     0003.2020  20 33 35 64 34 39 65 61 61 0a 58 2d 46 54 4e 2d   35d49eaa.X-FTN-
 
 				**************
 
     hdump D.00e5a22c/B.05e5 | less
     ...
     0000.ff40  20 73 74 6f 72 65 20 74 68 65 20 45 53 50 0a 76   store the ESP.v
     0000.ff50  61 6c 75 65 20 69 6e 20 67 6c 6f 62 61 6c 20 6d  alue in global m
     0000.ff60  65 6d 6f 72 79 20 73 6f 6d 65 77 68 65 72 65 2e  emory somewhere.
     0000.ff70  20 41 66 74 65 72 20 61 6c 6c 20 73 74 61 63 6b   After all stack
     0000.ff80  20 77 6f 75 6c 64 20 62 65 20 69 6e 61 63 63 65   would be inacce
     0000.ff90  73 73 69 62 6c 65 20 74 6f 0a 72 65 73 74 6f 72  ssible to.restor
     0000.ffa0  65 20 69 74 20 77 68 65 72 65 20 49 20 74 6f 20  e it where I to 
     0000.ffb0  70 75 73 68 20 69 74 2e 2e 2e 0a 0a 4d 69 63 68  push it.....Mich
     0000.ffc0  61 65 6c 0a 0a 0a 00 00 00 00 00 00 00 00 00 00  ael.............
     0000.ffd0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0000.ffe0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0000.fff0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0001.0000  65 72 6f 6c 73 21 68 6f 77 6c 61 6e 64 2e 65 72  erols!howland.er
     0001.0010  6f 6c 73 2e 6e 65 74 21 76 69 78 65 6e 2e 63 73  ols.net!vixen.cs
     0001.0020  6f 2e 75 69 75 63 2e 65 64 75 21 73 64 64 2e 68  o.uiuc.edu!sdd.h
     0001.0030  70 2e 63 6f 6d 21 6e 69 67 68 74 2e 70 72 69 6d  p.com!night.prim
     0001.0040  61 74 65 2e 77 69 73 63 2e 65 64 75 21 6e 6e 74  ate.wisc.edu!nnt
     0001.0050  70 2e 6d 73 73 74 61 74 65 2e 65 64 75 21 70 61  p.msstate.edu!pa
     ...
 
     0001.1c90  0a 3e 0a 3e 47 72 65 67 0a 3e 65 6d 61 69 6c 3a  .>.>Greg.>email:
     0001.1ca0  20 67 72 65 67 6f 72 30 40 70 72 69 76 2e 6f 6e   gregor0@priv.on
     0001.1cb0  65 74 2e 70 6c 0a 3e 54 68 65 20 4d 61 73 71 75  et.pl.>The Masqu
     0001.1cc0  65 72 61 64 65 20 63 6f 6e 74 69 6e 75 65 73 2e  erade continues.
     0001.1cd0  2e 2e 0a 3e 0a 3e 0a 0a 0a 00 00 00 00 00 00 00  ...>.>..........
     0001.1ce0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0001.1cf0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     ...
     0001.1fd0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0001.1fe0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0001.1ff0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0001.2000  3a 22 49 20 6f 62 6a 65 63 74 2c 20 79 6f 75 72  :"I object, your
     0001.2010  20 68 6f 6e 6f 72 2c 20 61 73 6b 65 64 20 61 6e   honor, asked an
     0001.2020  64 20 61 6e 73 77 65 72 65 64 21 22 0a 48 6f 6e  d answered!".Hon
     0001.2030  6f 72 3a 22 53 55 53 54 41 49 4e 45 44 21 22 0a  or:"SUSTAINED!".
     0001.2040  0a 50 6c 65 61 73 65 20 72 65 61 64 20 61 6c 6c  .Please read all
 
 :> 
 :I have written a small test program to simulate what diablo does, and so
 :far has not been able to get any corruption. The test was done on an otherwise
 :idle machine. So heavy load is definitely a contributing factor here.
 :
 
     I've tried writing similar test programs, with no luck either.  The
     load on the real machine has got to be a significant factor, but since
     the memory isn't starved I'm tending towards the disk latency as being
     a possible factor.
 
 						-Matt
 
     Matthew Dillon  Engineering, HiWay Technologies, Inc. & BEST Internet 
                     Communications
     <dillon@backplane.com> (Please include original email in any response)    

From: Luoqi Chen <luoqi@watermarkgroup.com>
To: dillon@backplane.com, luoqi@watermarkgroup.com
Cc: freebsd-gnats-submit@freebsd.org, luoqi@chen.ml.org
Subject: Re: kern/7418 (file corruption on mmap-based-read during file write())
Date: Fri, 14 Aug 1998 16:00:41 -0400 (EDT)

 >     I've tried writing similar test programs, with no luck either.  The
 >     load on the real machine has got to be a significant factor, but since
 >     the memory isn't starved I'm tending towards the disk latency as being
 >     a possible factor.
 > 
 Just a hunch, would you try a kernel with double the number of buffer headers
 and see if the rate of corruption would drop? Currently the default number
 of buffers `nbuf' is 2078, which amounts to about 16M of data for 8K blocks,
 that's barely enough for a heavy load news server. For a diablo server,
 there's effectively just one writing process, but with scores of readers,
 so there disk latency should not be too big, unless the readers trying to
 allocate scarce buf headers are forcing many delayed writes to be committed
 to disk immediately. If increasing buffer count could reduce disk latency,
 then it should also reduce number of corruptions. Use config option
 `options NBUF=nnnn' to override the default. And I will try a nbuf reduced
 kernel on a light-load machine and see if I can produce some corruptions.
 
 -lq
 
 > 						-Matt
 > 
 >     Matthew Dillon  Engineering, HiWay Technologies, Inc. & BEST Internet 
 >                     Communications
 >     <dillon@backplane.com> (Please include original email in any response)    
 > 
State-Changed-From-To: open->closed 
State-Changed-By: jkoshy 
State-Changed-When: Wed Nov 18 18:55:30 PST 1998 
State-Changed-Why:  
Was fixed by dillon in rev 1.178 of "src/sys/kern/vfs_bio.c". 
>Unformatted:
