From nobody@FreeBSD.org  Mon Jul  5 12:21:13 2004
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 66B8816A4CE
	for <freebsd-gnats-submit@FreeBSD.org>; Mon,  5 Jul 2004 12:21:13 +0000 (GMT)
Received: from www.freebsd.org (www.freebsd.org [216.136.204.117])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 5EE9E43D3F
	for <freebsd-gnats-submit@FreeBSD.org>; Mon,  5 Jul 2004 12:21:13 +0000 (GMT)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.12.11/8.12.11) with ESMTP id i65CLCOo057711
	for <freebsd-gnats-submit@FreeBSD.org>; Mon, 5 Jul 2004 12:21:12 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.12.11/8.12.11/Submit) id i65CLC3Q057710;
	Mon, 5 Jul 2004 12:21:12 GMT
	(envelope-from nobody)
Message-Id: <200407051221.i65CLC3Q057710@www.freebsd.org>
Date: Mon, 5 Jul 2004 12:21:12 GMT
From: KOIE Hidetaka <koie@suri.co.jp>
To: freebsd-gnats-submit@FreeBSD.org
Subject: write(2) returns wrong value when EFAULT 
X-Send-Pr-Version: www-2.3

>Number:         68690
>Category:       kern
>Synopsis:       [libc] write(2) returns wrong value when EFAULT
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Jul 05 12:30:22 GMT 2004
>Closed-Date:    
>Last-Modified:  Mon Oct 24 18:05:15 GMT 2005
>Originator:     KOIE Hidetaka
>Release:        FreeBSD-5.2 CURRENT
>Organization:
SURIGIKEN
>Environment:
FreeBSD sakura.suri.co.jp 5.2-CURRENT FreeBSD 5.2-CURRENT #1: Tue Jun 15 11:43:02 JST 2004     koie@sakura.suri.co.jp:/usr/obj/usr/src/sys/SAKURA  i386
>Description:
Invoking write(fd, buf, size), if buf has both validand invalid segment,
write return -1, but it's file pointer has been advanced.
The caller mistakes the pointer stays.

>How-To-Repeat:
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <unistd.h>

int
main()
{
    const int PAGESIZE = sysconf(_SC_PAGESIZE);
    const int N = 5;
    const int SIZE = N * PAGESIZE;
    int ifd = open("in.dat", O_RDWR|O_CREAT|O_TRUNC, 0600);
    if (ifd < 0) {
        perror("open");
        goto out;
    }
    int should = 0;
    for (int i = 0; i < N - 2; i++) {
        char buf[PAGESIZE];
        if (write(ifd, buf, sizeof buf) != sizeof buf) {
            perror("write");
            goto out;
        }
        should += sizeof buf;
    }

    int ofd = open("out.dat", O_WRONLY|O_CREAT|O_TRUNC, 0600);
    if (ofd < 0) {
        perror("open");
        goto out;
    }

    void *addr = mmap(0, SIZE, PROT_READ, MAP_PRIVATE, ifd, 0);
    if (addr == MAP_FAILED) {
        perror("mmap");
        goto out;
    }
    fprintf(stderr, "pos=%ld\n", (long)lseek(ofd, 0, SEEK_CUR));
    int n = write(ofd, addr, SIZE);
    fprintf(stderr, "write(%d)->%d (should be %d)\n", SIZE, n, should);
    perror("write");
    fprintf(stderr, "pos=%ld (should be %d)\n", (long)lseek(ofd, 0, SEEK_CUR), should);
 out:
    exit(0);
}

>Fix:
Sorry, I don't know.
>Release-Note:
>Audit-Trail:

From: Bruce Evans <bde@zeta.org.au>
To: KOIE Hidetaka <koie@suri.co.jp>
Cc: freebsd-gnats-submit@FreeBSD.org, freebsd-bugs@FreeBSD.org
Subject: Re: kern/68690: write(2) returns wrong vlalue when EFAULT 
Date: Wed, 7 Jul 2004 18:23:57 +1000 (EST)

 On Mon, 5 Jul 2004, KOIE Hidetaka wrote:
 
 > >Description:
 > Invoking write(fd, buf, size), if buf has both validand invalid segment,
 > write return -1, but it's file pointer has been advanced.
 > The caller mistakes the pointer stays.
 
 This is a longstanding bug.  It is more of a problem for non-seekable
 devices since physically written can't be undone, so the best possible
 error handling is to return a short write as specified in all versions
 of POSIX.
 
 Do you actually see the file pointer advanced?  This may be file system
 dependent.  ffs is supposed to back out of the write, and it does so
 for me.  Output:
 
 %%%
 pos=0
 write(20480)->-1 (should be 12288)
 write: Bad address
 pos=0 (should be 12288)
 %%%
 
 This is correct.  We tried to write 20K but the buffer size is only 12K
 and the was an i/o error after 12K.  ffs_write() handled the error by
 truncating the file to its original size and rerstoring the original
 file system, and since the write was at the end of the file this has
 the same result as if the write had failed with EFAULT after 0K.  I
 think there is also no way for other processes to see the write while
 it is in progress, since even if they have mmapped the file they can't
 see beyond its original length.
 
 To show an actual bug for ffs, modifiy the example slightly so that
 it overwrites existing data.  Then truncating the file has no effect,
 and it is just wrong for write() to return -1 after clobbering some
 data.
 
 > >Fix:
 
 (1) Here is ffs_write()'s buggy error handling:
 
 % 	for (error = 0; uio->uio_resid > 0;) {
 % 		...
 % 		if (uio->uio_offset + xfersize > ip->i_size)
 % 			vnode_pager_setsize(vp, uio->uio_offset + xfersize);
 
 Not error handling, but general setup.  Hopefully extending the size at the
 vm level doesn't allow other other processes to see up to the new EOF yet.
 
 %
 %                 /*
 % 		 * We must perform a read-before-write if the transfer size
 % 		 * does not cover the entire buffer.
 %                  */
 % 		if (fs->fs_bsize > xfersize)
 % 			flags |= BA_CLRBUF;
 % 		else
 % 			flags &= ~BA_CLRBUF;
 
 Security-related code.  Other processes can see buffer contents via
 mmap(), at least after completion of extension of the file at the vfs
 level, so we must clear the buffer contents now.  This also avoids
 potential leakage of unwritten buffer contents to the same process
 after buggy EFAULT handling.
 
 % 		...
 % 		/*
 % 		 * If the buffer is not valid we have to clear out any
 % 		 * garbage data from the pages instantiated for the buffer.
 % 		 * If we do not, a failed uiomove() during a write can leave
 % 		 * the prior contents of the pages exposed to a userland
 % 		 * mmap().  XXX deal with uiomove() errors a better way.
 % 		 */
 % 		if ((bp->b_flags & B_CACHE) == 0 && fs->fs_bsize <= xfersize)
 % 			vfs_bio_clrbuf(bp);
 
 More buffer clearing, for another case.  It's not so clear that this is
 done before other processes can see the buffer.
 
 % 	...
 % 	/*
 % 	 * If we successfully wrote any data, and we are not the superuser
 % 	 * we clear the setuid and setgid bits as a precaution against
 % 	 * tampering.
 % 	 */
 % 	if (resid > uio->uio_resid && ap->a_cred &&
 % 	    suser_cred(ap->a_cred, PRISON_ROOT)) {
 % 		ip->i_mode &= ~(ISUID | ISGID);
 % 		DIP(ip, i_mode) = ip->i_mode;
 % 	}
 
 Misplaced code.  We are going to back out of the write in some cases, so
 we don't know if we successefully wrote any data here.
 
 % 	...
 % 	if (error) {
 % 		if (ioflag & IO_UNIT) {
 % 			(void)UFS_TRUNCATE(vp, osize,
 % 			    IO_NORMAL | (ioflag & IO_SYNC),
 % 			    ap->a_cred, uio->uio_td);
 % 			uio->uio_offset -= resid - uio->uio_resid;
 % 			uio->uio_resid = resid;
 % 		}
 % 	} ...
 
 Here is where we back out of the write in some cases.
 (a) We only back out in the IO_UNIT case, but that is the usual case and
     is always the case for write(2).  (There is another bug suite
     involving IO_UNIT.  Most references to IO_UNIT in the kernel except
     those related to userland i/o don't actually give atomic writes
     since writes are made non-atomic using vfs_rdwr_inchunks() anyway.
     OTOH, there seems to be no reason for ffs to skip the backout in
     the !IO_UNIT case, so there is no need for the IO_UNIT flag to exist.)
 (b) We cannot back out if anything was written before the original end of
     the file.  Truncation just increases the mess in that case, since it
     only reduces the file to its original size, but the file position
     (in uio->uio_offset) is reduced by the full amount.
 
 Fixes for ffs_write(): ignore IO_UNIT, and don't rewind the file pointer
 below osize.  This gives short for writes that wrote something before
 osize and then failed.
 
 Similarly for other file systems.
 
 Here is write(2)'s buggy error handling (in dofilewrite()):
 
 % 	if ((error = fo_write(fp, &auio, td->td_ucred, flags, td))) {
 % 		if (auio.uio_resid != cnt && (error == ERESTART ||
 % 		    error == EINTR || error == EWOULDBLOCK))
 % 			error = 0;
 
 (auio.uio_resid != cnt) means that the write was short.  It should be
 returned to userland as a short write by setting error to 0 unconditionally.
 LOwer layers mostly get this write and return a short write if and only
 if they couldn't back out of the write, but they still return a nonzero
 error code to indicate that there was a problem.  The userland interface
 (write(2) in this case) just doesn't support returning both an error code
 and a byte count, so one of them must be discarded, and it must be the
 error code since it is useful and POSIX standard to return the byte count.
 
 Since ffs attempts to backs out of failing writes, and the backup is
 successful in most cases, this bug mainly affects devices.
 
 Similarly for read(2) and writev(2), etc.
 
 Bruce

From: KOIE Hidetaka (=?iso-2022-jp?B?GyRCOHE5PjFRTjQhdz90TX01OzgmGyhC?=) <koie@suri.co.jp>
To: bde@zeta.org.au
Cc: freebsd-gnats-submit@FreeBSD.org, freebsd-bugs@FreeBSD.org
Subject: Re: kern/68690: write(2) returns wrong vlalue when EFAULT 
Date: Thu, 08 Jul 2004 12:07:56 +0900 (JST)

 Thanks for your inspection.
 
   Message-Id: <20040707172014.C3185@gamplex.bde.org>
   Date:       Wed, 7 Jul 2004 18:23:57 +1000 (EST)
   From:       Bruce Evans <bde@zeta.org.au>
   Subject:    Re: kern/68690: write(2) returns wrong vlalue when EFAUL..
 
   | Do you actually see the file pointer advanced?  This may be file system
   | dependent.  ffs is supposed to back out of the write, and it does so
   | for me.  Output:
   | 
   | %%%
   | pos=0
   | write(20480)->-1 (should be 12288)
   | write: Bad address
   | pos=0 (should be 12288)
   | %%%
 
 I was unaware of examining on NFS.
 %%%
 pos=0
 write(20480)->-1 (should be 12288)
 write: Bad address
 pos=8192 (should be 12288)
 %%%
 
 The file pointer is not backed.
 
 
   | > >Fix:
 
 I understand some problems exist.
 
 --
 KOIE Hidetaka / koie@suri.co.jp / SURIGIKEN Co.,LTD.

From: Bruce Evans <bde@zeta.org.au>
To: KOIE Hidetaka <koie@suri.co.jp>
Cc: freebsd-gnats-submit@freebsd.org, freebsd-bugs@freebsd.org
Subject: Re: kern/68690: write(2) returns wrong vlalue when EFAULT 
Date: Thu, 8 Jul 2004 17:19:51 +1000 (EST)

 On Thu, 8 Jul 2004, KOIE Hidetaka wrote:
 
 >   From:       Bruce Evans <bde@zeta.org.au>
 >   | Do you actually see the file pointer advanced?  This may be file system
 >   | dependent.  ffs is supposed to back out of the write, and it does so
 >   | for me.  Output:
 >   |
 >   | %%%
 >   | pos=0
 >   | write(20480)->-1 (should be 12288)
 >   | write: Bad address
 >   | pos=0 (should be 12288)
 >   | %%%
 >
 > I was unaware of examining on NFS.
 
 I think you mean that you saw this error for nfs.
 
 > %%%
 > pos=0
 > write(20480)->-1 (should be 12288)
 > write: Bad address
 > pos=8192 (should be 12288)
 > %%%
 >
 > The file pointer is not backed.
 
 I see this behaviour for nfs to.  It is because nfs_write() just doesn't
 back out of the write like ffs_write does.  From nfs_bio.c rev.1.132:
 
 % int
 % nfs_write(struct vop_write_args *ap)
 % {
 % ...
 % 	do {
 % ...
 % 		error = uiomove((char *)bp->b_data + on, n, uio);
 %
 % 		/*
 % 		 * Since this block is being modified, it must be written
 % 		 * again and not just committed.  Since write clustering does
 % 		 * not work for the stage 1 data write, only the stage 2
 % 		 * commit rpc, we have to clear B_CLUSTEROK as well.
 % 		 */
 % 		bp->b_flags &= ~(B_NEEDCOMMIT | B_CLUSTEROK);
 %
 % 		if (error) {
 % 			bp->b_ioflags |= BIO_ERROR;
 % 			brelse(bp);
 % 			break;
 % 		}
 % 	} while (uio->uio_resid > 0 && n > 0);
 %
 % 	if (haverslock)
 % 		nfs_rsunlock(np, td);
 %
 % 	return (error);
 % }
 
 After an error in uiomove(), nfs_write() just unlocks and returns, so
 it provides no protection from the bug in dofilewrite() even if the
 i/o was at the end of the file.  Only the case where the error occurs
 before any i/o is done is handled correctly.
 
 Bruce
>Unformatted:
