From nge@cs.hmc.edu  Fri Dec 16 19:16:21 2005
Return-Path: <nge@cs.hmc.edu>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 255ED16A420
	for <freebsd-gnats-submit@freebsd.org>; Fri, 16 Dec 2005 19:16:21 +0000 (GMT)
	(envelope-from nge@cs.hmc.edu)
Received: from turing.cs.hmc.edu (turing.cs.hmc.edu [134.173.42.99])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 6A77143D60
	for <freebsd-gnats-submit@freebsd.org>; Fri, 16 Dec 2005 19:16:14 +0000 (GMT)
	(envelope-from nge@cs.hmc.edu)
Received: by turing.cs.hmc.edu (Postfix, from userid 26983)
	id C6E0A53231; Fri, 16 Dec 2005 11:16:10 -0800 (PST)
Received: from localhost (localhost [127.0.0.1])
	by turing.cs.hmc.edu (Postfix) with ESMTP id B14805A8DE
	for <freebsd-gnats-submit@freebsd.org>; Fri, 16 Dec 2005 11:16:10 -0800 (PST)
Message-Id: <Pine.GSO.4.63.0512161112560.13263@turing>
Date: Fri, 16 Dec 2005 11:16:10 -0800 (PST)
From: Nate Eldredge <nge@cs.hmc.edu>
To: freebsd-gnats-submit@freebsd.org
Subject: Snapshot corruption after fs activity

>Number:         90512
>Category:       kern
>Synopsis:       [64-bit] Snapshot corruption after fs activity
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Dec 16 19:20:03 GMT 2005
>Closed-Date:    Sat Jan 06 00:04:24 GMT 2007
>Last-Modified:  Sat Jan 06 00:04:24 GMT 2007
>Originator:     Nate Eldredge
>Release:        FreeBSD 6.0-RELEASE amd64
>Organization:
>Environment:
System: FreeBSD vulcan.lan 6.0-RELEASE FreeBSD 6.0-RELEASE #0: Wed Dec 14 20:08:57 PST 2005 nate@vulcan.lan:/usr/obj/usr/src/sys/VULCAN amd64



>Description:
When you use mksnap_ffs to make a snapshot on a filesystem which then
has a lot of stuff deleted and re-created, the snapshot becomes corrupt.

I think this is fairly serious since snapshots may be used for backup
purposes.  That's how I originally discovered the problem; I made a
snapshot on /usr before making a bunch of changes, during which I
accidentally moved most of /usr/local to another partition :).  I moved
it back but wanted to verify that everything was back as it was,
which is when I discovered my snapshot was no good.

Note this is on amd64.  I have not tried i386.
>How-To-Repeat:
# dd if=/dev/zero of=snaptest.img bs=1024k count=1000
# mdconfig -a -t vnode -f snaptest.img
md0
# newfs /dev/md0
# mount /dev/md0 /mnt/md0
# cd /mnt/md0
# tar xjf /usr/ports/distfiles/gap/gap4r4p6.tar.bz2 
# mksnap_ffs /mnt/md0 /mnt/md0/.snap/snap1
# mdconfig -a -t vnode -f .snap/snap1
WARNING: opening backing store: /mnt/md0/.snap/snap1 readonly
md1
# mount -r /dev/md1 /mnt/md1
###### inspecting /mnt/md1 reveals the snapshot is apparently okay
# rm -r gap4r4
###### snapshot still apparently okay
# !tar
tar xjf /usr/ports/distfiles/gap/gap4r4p6.tar.bz2
# ls -l /mnt/md1/gap4r4
ls: Makefile.in: Bad file descriptor
ls: bin: Bad file descriptor
ls: cnf: Bad file descriptor
ls: configure: Bad file descriptor
ls: doc: Bad file descriptor
ls: etc: Bad file descriptor
ls: gap.shi: Bad file descriptor
ls: grp: Bad file descriptor
ls: pkg: Bad file descriptor
ls: prim: Bad file descriptor
ls: small: Bad file descriptor
ls: src: Bad file descriptor
ls: sysinfo.in: Bad file descriptor
ls: trans: Bad file descriptor
ls: tst: Bad file descriptor
total 38
-rw-r--r--  1 nate  nate   4782 Aug 29 06:19 README
-rw-r--r--  1 nate  nate   9725 May 11  2005 description4r4p5
-rw-r--r--  1 nate  nate  11660 Aug 29 06:05 description4r4p6
drwxr-xr-x  2 nate  nate   9728 Aug 30 06:27 lib


Doing truss on ls reveals that lstat() is returning EBADF on the offending
files (which doesn't make any sense as there is no file descriptor involved;
EIO might be better).  Also, umounting and then fscking /dev/md1
produces a cornucopia of errors, including as a representative sample:

PARTIALLY TRUNCATED INODE I=70662
3689066227402421815 BAD I=70662
4121129229942796344 BAD I=70662
3833180345978203193 BAD I=70662
4051046384641915184 BAD I=70662
3688509874569295664 BAD I=70662
3472592161990062385 BAD I=70662
3906084542581519160 BAD I=70662
4049637910162848049 BAD I=70662
4123381021216356400 BAD I=70662
3979273551213759020 BAD I=70662
4051327820913194809 BAD I=70662
EXCESSIVE BAD BLKS I=70662
INCORRECT BLOCK COUNT I=70662 (960 should be 736)
PARTIALLY TRUNCATED INODE I=70719
UNALLOCATED  I=23552  OWNER=nate MODE=0
DIRECTORY CORRUPTED  I=70660  OWNER=nate MODE=40755
MISSING '.'  I=71129  OWNER=nate MODE=40755
SIZE=1536 MTIME=Aug 30 06:27 2005 
UNREF DIR  I=117760  OWNER=nate MODE=40755
SIZE=512 MTIME=Aug 30 06:27 2005 
LINK COUNT DIR I=2  OWNER=root MODE=40755
SIZE=512 MTIME=Dec 16 10:34 2005  COUNT 4 SHOULD BE 3

The original filesystem /dev/md0 apparently
remains okay and fsck reports no errors for it.

There are no kernel error messages this time, though a previous attempt
(when the snapshot was on /dev/md0) yielded

/mnt/md0: bad dir ino 3182535 at offset 0: mangled entry
/mnt/md0: bad dir ino 2953 at offset 0: mangled entry
...4 or 5 more...

Also at that time there were directories which changed to files of size 1 
which dumped many, many bytes of garbage when cat'ted.

>Fix:
Unknown.

Thanks!

>Release-Note:
>Audit-Trail:

From: Nate Eldredge <nge@cs.hmc.edu>
To: bug-followup@FreeBSD.org, nge@cs.hmc.edu
Cc:  
Subject: Re: kern/90512: Snapshot corruption after fs activity
Date: Fri, 16 Dec 2005 22:49:55 -0800 (PST)

 FWIW, I can't seem to reproduce this on my i386/CURRENT box.  Since there 
 don't appear to be any significant FFS changes between 6.0-RELEASE and 
 CURRENT, this may be a 64-bit issue.
 
 -- 
 Nate Eldredge
 nge@cs.hmc.edu
State-Changed-From-To: open->feedback 
State-Changed-By: kib 
State-Changed-When: Fri Jan 5 13:53:48 UTC 2007 
State-Changed-Why:  
The fix for the problem was committed as rev. 1.103.2.17 on RELENG_6, 
and shall be included in 6.2. Please, retest. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=90512 
State-Changed-From-To: feedback->closed 
State-Changed-By: linimon 
State-Changed-When: Sat Jan 6 00:03:54 UTC 2007 
State-Changed-Why:  
A fix has been committed. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=90512 
>Unformatted:
