From nobody@FreeBSD.org  Sun Aug 21 13:02:50 2011
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E801A106566C
	for <freebsd-gnats-submit@FreeBSD.org>; Sun, 21 Aug 2011 13:02:49 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from red.freebsd.org (red.freebsd.org [IPv6:2001:4f8:fff6::22])
	by mx1.freebsd.org (Postfix) with ESMTP id D6D988FC0A
	for <freebsd-gnats-submit@FreeBSD.org>; Sun, 21 Aug 2011 13:02:49 +0000 (UTC)
Received: from red.freebsd.org (localhost [127.0.0.1])
	by red.freebsd.org (8.14.4/8.14.4) with ESMTP id p7LD2nfH092323
	for <freebsd-gnats-submit@FreeBSD.org>; Sun, 21 Aug 2011 13:02:49 GMT
	(envelope-from nobody@red.freebsd.org)
Received: (from nobody@localhost)
	by red.freebsd.org (8.14.4/8.14.4/Submit) id p7LD2nGD092322;
	Sun, 21 Aug 2011 13:02:49 GMT
	(envelope-from nobody)
Message-Id: <201108211302.p7LD2nGD092322@red.freebsd.org>
Date: Sun, 21 Aug 2011 13:02:49 GMT
From: Kirk Russell <kirk@ba23.org>
To: freebsd-gnats-submit@FreeBSD.org
Subject: panic with soft updates journaling during load testing
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         159971
>Category:       kern
>Synopsis:       [ffs] [panic] panic with soft updates journaling during load testing
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    mckusick
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sun Aug 21 13:10:04 UTC 2011
>Closed-Date:    Thu Apr 19 22:49:42 UTC 2012
>Last-Modified:  Thu Apr 19 22:49:42 UTC 2012
>Originator:     Kirk Russell
>Release:        FreeBSD 9.0-BETA1
>Organization:
bstg
>Environment:
FreeBSD kleenex 9.0-BETA1 FreeBSD 9.0-BETA1 #0: Fri Aug 12 21:31:10 IST 2011     root@kleenex:/usr/obj/usr/src/sys/GENERIC  i386
>Description:
I have been testing a scratch filesystem, with soft updates journaling enabled.
I have been seeing one of these two panics:
    panic: ino 0xc5d0f600(0x3C8209) 14147, 7047 != 14098
and
    panic: Bad link elm 0xc4d7cd00 prev->next != elm
If I disable soft updates journaling, I do not see these panics.

Here is the source to the load test:
http://www.ba23.org/bstgbugs/bstg0003.c

Here is the output from /var/crash/core.txt:
http://www.ba23.org/bstgbugs/bstg0003.core.txt.gz

panic: ino 0xc5835000(0x3C8209) 1847043, 2320643 != 1420808

(kgdb) #0  doadump (textdump=0) at pcpu.h:244
#1  0xc04e4683 in db_dump (dummy=-1063023718, dummy2=0, dummy3=-1, 
    dummy4=0xed6de6f8 "") at /usr/src/sys/ddb/db_command.c:537
#2  0xc04e3da1 in db_command (last_cmdp=0xc104fc7c, cmd_table=0x0, dopager=1)
    at /usr/src/sys/ddb/db_command.c:448
#3  0xc04e3efa in db_command_loop () at /usr/src/sys/ddb/db_command.c:501
#4  0xc04e5eed in db_trap (type=3, code=0) at /usr/src/sys/ddb/db_main.c:229
#5  0xc0a38d63 in kdb_trap (type=3, code=0, tf=0xed6de8a8)
    at /usr/src/sys/kern/subr_kdb.c:539
#6  0xc0d347bb in trap (frame=0xed6de8a8) at /usr/src/sys/i386/i386/trap.c:719
#7  0xc0d1d69c in calltrap () at /usr/src/sys/i386/i386/exception.s:168
#8  0xc0a38b9a in kdb_enter (why=0xc0eefcd5 "panic", msg=0xc0eefcd5 "panic")
    at cpufunc.h:71
#9  0xc0a04274 in panic (fmt=0xc0f24586 "ino %p(0x%X) %d, %d != %d")
    at /usr/src/sys/kern/kern_shutdown.c:587
#10 0xc0c35964 in softdep_disk_io_initiation (bp=0xc499e89c)
    at /usr/src/sys/ufs/ffs/ffs_softdep.c:9818
#11 0xc0c3d9af in ffs_geom_strategy (bo=0xc5493e7c, bp=0xc499e89c)
    at buf.h:405
#12 0xc0a85e79 in bufwrite (bp=0xc499e89c) at buf.h:398
#13 0xc0c3cfc0 in ffs_bufwrite (bp=0xc499e89c)
    at /usr/src/sys/ufs/ffs/ffs_vfsops.c:2074
#14 0xc0c1b23c in ffs_update (vp=0xc5820220, waitfor=2) at buf.h:386
#15 0xc0c44113 in ffs_syncvnode (vp=0xc5820220, waitfor=2)
    at /usr/src/sys/ufs/ffs/ffs_vnops.c:304
#16 0xc0c3e11f in ffs_sync (mp=0xc5230a20, waitfor=2)
    at /usr/src/sys/ufs/ffs/ffs_vfsops.c:1498
#17 0xc0aa56f0 in sync (td=0xc55082e0, uap=0xed6decec)
    at /usr/src/sys/kern/vfs_syscalls.c:149
#18 0xc0a47663 in syscallenter (td=0xc55082e0, sa=0xed6dece4)
    at /usr/src/sys/kern/subr_trap.c:344
#19 0xc0d34064 in syscall (frame=0xed6ded28)
    at /usr/src/sys/i386/i386/trap.c:1082
#20 0xc0d1d701 in Xint0x80_syscall ()
    at /usr/src/sys/i386/i386/exception.s:266
#21 0x00000033 in ?? ()

>How-To-Repeat:
- as root -- setup the filesystem with soft updates journaling
# newfs -j /dev/ada0p4
/dev/ada0p4: 56320.0MB (115343360 sectors) block size 32768, fragment size 4096
        using 77 cylinder groups of 740.00MB, 23680 blks, 47360 inodes.
        with soft updates
super-block backups (for fsck -b #) at:
 192, 1515712, 3031232, 4546752, 6062272, 7577792, 9093312, 10608832, 12124352,
 13639872, 15155392, 16670912, 18186432, 19701952, 21217472, 22732992,
 24248512, 25764032, 27279552, 28795072, 30310592, 31826112, 33341632,
 34857152, 36372672, 37888192, 39403712, 40919232, 42434752, 43950272,
 45465792, 46981312, 48496832, 50012352, 51527872, 53043392, 54558912,
 56074432, 57589952, 59105472, 60620992, 62136512, 63652032, 65167552,
 66683072, 68198592, 69714112, 71229632, 72745152, 74260672, 75776192,
 77291712, 78807232, 80322752, 81838272, 83353792, 84869312, 86384832,
 87900352, 89415872, 90931392, 92446912, 93962432, 95477952, 96993472,
 98508992, 100024512, 101540032, 103055552, 104571072, 106086592, 107602112,
 109117632, 110633152, 112148672, 113664192, 115179712
Using inode 4 in cg 0 for 33554432 byte journal
newfs: soft updates journaling set
# mount /dev/ada0p4 /mnt
# chmod -R a+wrx /mnt
chmod: /mnt/.sujournal: Operation not permitted

- as a regular user, compile and run the load test
$ cat bstg0003.c
/*
 * Copyright 2011 Kirk J. Russell
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

#include <unistd.h>
#include <assert.h>
#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/uio.h>
#include <sys/wait.h>

static char * bstg_pathstore[] = {
    "/mnt/111/z",
    "/mnt/111/aaaa",
    "/mnt/111/bbbbb",
    "/mnt/111/ccccc",
    "/mnt/111/d",
    "/mnt/111/e",
    "/mnt/111/ffffff.fff.f",
    "/mnt/111/gggggggggggg",
    "/mnt/111/hhhh",
    "/mnt/111/iiiii.ii",
    "/mnt/111/jjjj.jj.jjjjjjjj",
    "/mnt/111/kkkk.kkkkkkkk",
    "/mnt/111/lllll",
    "/mnt/222/z",
    "/mnt/222/aaaa",
    "/mnt/222/bbbbb",
    "/mnt/222/ccccc",
    "/mnt/222/d",
    "/mnt/222/e",
    "/mnt/222/ffffff.fff.f",
    "/mnt/222/gggggggggggg",
    "/mnt/222/hhhh",
    "/mnt/222/iiiii.ii",
    "/mnt/222/jjjj.jj.jjjjjjjj",
    "/mnt/222/kkkk.kkkkkkkk",
    "/mnt/222/lllll",
    "/mnt/333/z",
    "/mnt/333/aaaa",
    "/mnt/333/bbbbb",
    "/mnt/333/ccccc",
    "/mnt/333/d",
    "/mnt/333/e",
    "/mnt/333/ffffff.fff.f",
    "/mnt/333/gggggggggggg",
    "/mnt/333/hhhh",
    "/mnt/333/iiiii.ii",
    "/mnt/333/jjjj.jj.jjjjjjjj",
    "/mnt/333/kkkk.kkkkkkkk",
    "/mnt/333/lllll",
    "/mnt/444/z",
    "/mnt/444/aaaa",
    "/mnt/444/bbbbb",
    "/mnt/444/ccccc",
    "/mnt/444/d",
    "/mnt/444/e",
    "/mnt/444/ffffff.fff.f",
    "/mnt/444/gggggggggggg",
    "/mnt/444/hhhh",
    "/mnt/444/iiiii.ii",
    "/mnt/444/jjjj.jj.jjjjjjjj",
    "/mnt/444/kkkk.kkkkkkkk",
    "/mnt/444/lllll",
    "/mnt/555/z",
    "/mnt/555/aaaa",
    "/mnt/555/bbbbb",
    "/mnt/555/ccccc",
    "/mnt/555/d",
    "/mnt/555/e",
    "/mnt/555/ffffff.fff.f",
    "/mnt/555/gggggggggggg",
    "/mnt/555/hhhh",
    "/mnt/555/iiiii.ii",
    "/mnt/555/jjjj.jj.jjjjjjjj",
    "/mnt/555/kkkk.kkkkkkkk",
    "/mnt/555/lllll",
    "/mnt/666/z",
    "/mnt/666/aaaa",
    "/mnt/666/bbbbb",
    "/mnt/666/ccccc",
    "/mnt/666/d",
    "/mnt/666/e",
    "/mnt/666/ffffff.fff.f",
    "/mnt/666/gggggggggggg",
    "/mnt/666/hhhh",
    "/mnt/666/iiiii.ii",
    "/mnt/666/jjjj.jj.jjjjjjjj",
    "/mnt/666/kkkk.kkkkkkkk",
    "/mnt/666/lllll",
    "/mnt/777/z",
    "/mnt/777/aaaa",
    "/mnt/777/bbbbb",
    "/mnt/777/ccccc",
    "/mnt/777/d",
    "/mnt/777/e",
    "/mnt/777/ffffff.fff.f",
    "/mnt/777/gggggggggggg",
    "/mnt/777/hhhh",
    "/mnt/777/iiiii.ii",
    "/mnt/777/jjjj.jj.jjjjjjjj",
    "/mnt/777/kkkk.kkkkkkkk",
    "/mnt/777/lllll",
    "/mnt/888/z",
    "/mnt/888/aaaa",
    "/mnt/888/bbbbb",
    "/mnt/888/ccccc",
    "/mnt/888/d",
    "/mnt/888/e",
    "/mnt/888/ffffff.fff.f",
    "/mnt/888/gggggggggggg",
    "/mnt/888/hhhh",
    "/mnt/888/iiiii.ii",
    "/mnt/888/jjjj.jj.jjjjjjjj",
    "/mnt/888/kkkk.kkkkkkkk",
    "/mnt/888/lllll",
    "/mnt/999/z",
    "/mnt/999/aaaa",
    "/mnt/999/bbbbb",
    "/mnt/999/ccccc",
    "/mnt/999/d",
    "/mnt/999/e",
    "/mnt/999/ffffff.fff.f",
    "/mnt/999/gggggggggggg",
    "/mnt/999/hhhh",
    "/mnt/999/iiiii.ii",
    "/mnt/999/jjjj.jj.jjjjjjjj",
    "/mnt/999/kkkk.kkkkkkkk",
    "/mnt/999/lllll",
    "/mnt/aaa/z",
    "/mnt/aaa/aaaa",
    "/mnt/aaa/bbbbb",
    "/mnt/aaa/ccccc",
    "/mnt/aaa/d",
    "/mnt/aaa/e",
    "/mnt/aaa/ffffff.fff.f",
    "/mnt/aaa/gggggggggggg",
    "/mnt/aaa/hhhh",
    "/mnt/aaa/iiiii.ii",
    "/mnt/aaa/jjjj.jj.jjjjjjjj",
    "/mnt/aaa/kkkk.kkkkkkkk",
    "/mnt/aaa/lllll",
    "/mnt/bbb/z",
    "/mnt/bbb/aaaa",
    "/mnt/bbb/bbbbb",
    "/mnt/bbb/ccccc",
    "/mnt/bbb/d",
    "/mnt/bbb/e",
    "/mnt/bbb/ffffff.fff.f",
    "/mnt/bbb/gggggggggggg",
    "/mnt/bbb/hhhh",
    "/mnt/bbb/iiiii.ii",
    "/mnt/bbb/jjjj.jj.jjjjjjjj",
    "/mnt/bbb/kkkk.kkkkkkkk",
    "/mnt/bbb/lllll",
    "/mnt/ccc/z",
    "/mnt/ccc/aaaa",
    "/mnt/ccc/bbbbb",
    "/mnt/ccc/ccccc",
    "/mnt/ccc/d",
    "/mnt/ccc/e",
    "/mnt/ccc/ffffff.fff.f",
    "/mnt/ccc/gggggggggggg",
    "/mnt/ccc/hhhh",
    "/mnt/ccc/iiiii.ii",
    "/mnt/ccc/jjjj.jj.jjjjjjjj",
    "/mnt/ccc/kkkk.kkkkkkkk",
    "/mnt/ccc/lllll",
    "/mnt/ddd/z",
    "/mnt/ddd/aaaa",
    "/mnt/ddd/bbbbb",
    "/mnt/ddd/ccccc",
    "/mnt/ddd/d",
    "/mnt/ddd/e",
    "/mnt/ddd/ffffff.fff.f",
    "/mnt/ddd/gggggggggggg",
    "/mnt/ddd/hhhh",
    "/mnt/ddd/iiiii.ii",
    "/mnt/ddd/jjjj.jj.jjjjjjjj",
    "/mnt/ddd/kkkk.kkkkkkkk",
    "/mnt/ddd/lllll",
    "/mnt/eee/z",
    "/mnt/eee/aaaa",
    "/mnt/eee/bbbbb",
    "/mnt/eee/ccccc",
    "/mnt/eee/d",
    "/mnt/eee/e",
    "/mnt/eee/ffffff.fff.f",
    "/mnt/eee/gggggggggggg",
    "/mnt/eee/hhhh",
    "/mnt/eee/iiiii.ii",
    "/mnt/eee/jjjj.jj.jjjjjjjj",
    "/mnt/eee/kkkk.kkkkkkkk",
    "/mnt/eee/lllll",
    "/mnt/fff/z",
    "/mnt/fff/aaaa",
    "/mnt/fff/bbbbb",
    "/mnt/fff/ccccc",
    "/mnt/fff/d",
    "/mnt/fff/e",
    "/mnt/fff/ffffff.fff.f",
    "/mnt/fff/gggggggggggg",
    "/mnt/fff/hhhh",
    "/mnt/fff/iiiii.ii",
    "/mnt/fff/jjjj.jj.jjjjjjjj",
    "/mnt/fff/kkkk.kkkkkkkk",
    "/mnt/fff/lllll"
};

char *
bstg_pathstore_get()
{
    return bstg_pathstore[rand() %
        ((sizeof(bstg_pathstore)/sizeof(bstg_pathstore[0])))];
}

void
dogcore()
{
    pid_t sleepchild, gcorechild;
    extern char **environ;

    /* create a child for the gcore target */
    if ((sleepchild = fork()) == 0) {
        sleep(30);
        _exit(1);
    } else if (sleepchild > 0) {
        char *token[] = { NULL, NULL, NULL, NULL, NULL };
        char buf[64];
        int status;

        /* use the first process as the target */
        snprintf(buf, sizeof(buf), "%d", sleepchild);
        token[0] = "gcore";
        token[1] = "-c";
        token[2] = bstg_pathstore_get();
        token[3] = buf;
        assert(token[4] == NULL);

        if ((gcorechild = fork()) > 0) {
            waitpid(gcorechild, &status, 0);
        } else if (gcorechild == 0) {
            execve("/usr/bin/gcore", token, environ);
            _exit(1);
        }

        kill(sleepchild, SIGKILL);
        waitpid(sleepchild, &status, 0);
    }
}

void
dowrite()
{
    struct iovec data[] = {
        { "12", 2 },
        { NULL, 0 },
        { "12345678", 8},
    };
    static int fd = -1;

    if (fd == -1) {
        /* keep existing file open during life of this process */
        fd = open(bstg_pathstore_get(), O_RDWR|O_NONBLOCK|O_NOCTTY);
    }

    data[1].iov_base = bstg_pathstore_get();
    data[1].iov_len = strlen((char*)data[1].iov_base);
    ftruncate(fd, 0);
    pwritev(fd, data, 3, 0);
}

void
dounlink()
{
    unlink(bstg_pathstore_get());
}

void
dolink()
{
    link(bstg_pathstore_get(), bstg_pathstore_get());
}

void
domkdir()
{
    char **pdir;
    static char * bstg_dirs[] = {
        "/mnt/111", "/mnt/222", "/mnt/333", "/mnt/444",
        "/mnt/555", "/mnt/666", "/mnt/777", "/mnt/888",
        "/mnt/999", "/mnt/aaa", "/mnt/bbb", "/mnt/ccc",
        "/mnt/ddd", "/mnt/eee", "/mnt/fff", NULL
    };

    for (pdir = bstg_dirs; *pdir; pdir++) {
        mkdir(*pdir, 0777);
    }
}

void
dosync()
{
    sync();
}

int
main()
{
    unsigned x;
    int status;
    void (*funcs[])() = {
        dogcore,
        dowrite,
        dounlink,
        dolink,
        dowrite,
        dounlink,
        dolink,
        dowrite,
        dosync,
        dowrite,
        dounlink,
        dolink,
        dowrite,
        dounlink,
        dolink,
        dowrite,
    };

    /* we only can domkdir() once at startup */
    domkdir();

    /* create 128 children that loop forever running 4 operations */
    dosync();
    for (x = 0; x < 128; x++) {
        if (fork() == 0) {
            /* give child a new seed for the pathname selection */
            srand(x);

            for (;;) {
                /* each child will start looping at different function */
                (*funcs[x++ % 16]) ();
            }
            /* we never expect this code to run */
            _exit(1);
        }
    }

    /* block forever for all our children */
    while(wait(&status) > 0);
    return 0;
}
$ cc -Wall -O2 bstg0003.c
$ ./a.out
[... takes a few minutes to cause the panic ...]
>Fix:


>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-i386->freebsd-fs 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Mon Aug 22 02:15:14 UTC 2011 
Responsible-Changed-Why:  
Over to maintainer(s). 

http://www.freebsd.org/cgi/query-pr.cgi?pr=159971 

From: Gavin Atkinson <gavin@FreeBSD.org>
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/159971: [ffs] [panic] panic with soft updates journaling
 during load testing
Date: Thu, 29 Sep 2011 12:14:21 +0100

 Regression test for this PR committed as r225871.
Responsible-Changed-From-To: freebsd-fs->mckusick 
Responsible-Changed-By: mckusick 
Responsible-Changed-When: Fri Feb 10 17:58:19 UTC 2012 
Responsible-Changed-Why:  
I will take responsibility for this one. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=159971 

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/159971: commit references a PR
Date: Mon,  2 Apr 2012 21:58:55 +0000 (UTC)

 Author: mckusick
 Date: Mon Apr  2 21:58:37 2012
 New Revision: 233817
 URL: http://svn.freebsd.org/changeset/base/233817
 
 Log:
   A file cannot be deallocated until its last name has been removed
   and it is no longer referenced by a user process. The inode for a
   file whose name has been removed, but is still referenced at the
   time of a crash will still be allocated in the filesystem, but will
   have no references (e.g., they will have no names referencing them
   from any directory).
   
   With traditional soft updates these unreferenced inodes will be
   found and reclaimed when the background fsck is run. When using
   journaled soft updates, the kernel must keep track of these inodes
   so that it can find and reclaim them during the cleanup process.
   Their existence cannot be stored in the journal as the journal only
   handles short-term events, and they may persist for days. So, they
   are tracked by keeping them in a linked list whose head pointer is
   stored in the superblock. The journal tracks them only until their
   linked list pointers have been commited to disk. Part of the cleanup
   process involves traversing the list of unreferenced inodes and
   reclaiming them.
   
   This bug was triggered when confusion arose in the commit steps
   of keeping the unreferenced-inode linked list coherent on disk.
   Notably, a race between the link() system call adding a link-count
   to a file and the unlink() system call removing a link-count to
   the file. Here if the unlink() ran after link() had looked up
   the file but before link() had incremented the link-count of the
   file, the file's link-count would drop to zero before the link()
   incremented it back up to one. If the file was referenced by a
   user process, the first transition through zero made it appear
   that it should be added to the unreferenced-inode list when in
   fact it should not have been added. If the new name created by
   link() was deleted within a few seconds (with the file still
   referenced by a user process) it would legitimately be a candidate
   for addition to the unreferenced-inode list. The result was that
   there were two attempts to add the same inode to the unreferenced-inode
   list which scrambled the unreferenced-inode list's pointers leading
   to a panic. The fix is to detect and avoid the false attempt at
   adding it to the unreferenced-inode list by having the link()
   system call check to see if the link count is zero before it
   increments it. If it is, the link() fails with ENOENT (showing that
   it has failed the link()/unlink() race).
   
   While tracking down this bug, we have added additional assertions
   to detect the problem sooner and also simplified some of the code.
   
   Reported by:      Kirk Russell
   Fix submitted by: Jeff Roberson
   Tested by:        Peter Holm
   PR:               kern/159971
   MFC (to 9 only):  2 weeks
 
 Modified:
   head/sys/ufs/ffs/ffs_softdep.c
   head/sys/ufs/ufs/ufs_vnops.c
 
 Modified: head/sys/ufs/ffs/ffs_softdep.c
 ==============================================================================
 --- head/sys/ufs/ffs/ffs_softdep.c	Mon Apr  2 20:36:35 2012	(r233816)
 +++ head/sys/ufs/ffs/ffs_softdep.c	Mon Apr  2 21:58:37 2012	(r233817)
 @@ -4323,6 +4323,7 @@ inodedep_lookup_ip(ip)
  	(void) inodedep_lookup(UFSTOVFS(ip->i_ump), ip->i_number, dflags,
  	    &inodedep);
  	inodedep->id_nlinkdelta = ip->i_nlink - ip->i_effnlink;
 +	KASSERT((inodedep->id_state & UNLINKED) == 0, ("inode unlinked"));
  
  	return (inodedep);
  }
 @@ -8455,6 +8456,7 @@ softdep_setup_remove(bp, dp, ip, isrmdir
  	if (inodedep_lookup(UFSTOVFS(ip->i_ump), ip->i_number, 0,
  	    &inodedep) == 0)
  		panic("softdep_setup_remove: Lost inodedep.");
 +	KASSERT((inodedep->id_state & UNLINKED) == 0, ("inode unlinked"));
  	dirrem->dm_state |= ONDEPLIST;
  	LIST_INSERT_HEAD(&inodedep->id_dirremhd, dirrem, dm_inonext);
  
 @@ -8987,6 +8989,7 @@ first_unlinked_inodedep(ump)
  	struct inodedep *inodedep;
  	struct inodedep *idp;
  
 +	mtx_assert(&lk, MA_OWNED);
  	for (inodedep = TAILQ_LAST(&ump->softdep_unlinked, inodedeplst);
  	    inodedep; inodedep = idp) {
  		if ((inodedep->id_state & UNLINKNEXT) == 0)
 @@ -8995,11 +8998,8 @@ first_unlinked_inodedep(ump)
  		if (idp == NULL || (idp->id_state & UNLINKNEXT) == 0)
  			break;
  		if ((inodedep->id_state & UNLINKPREV) == 0)
 -			panic("first_unlinked_inodedep: prev != next");
 +			break;
  	}
 -	if (inodedep == NULL)
 -		return (NULL);
 -
  	return (inodedep);
  }
  
 @@ -9038,8 +9038,12 @@ handle_written_sbdep(sbdep, bp)
  	struct mount *mp;
  	struct fs *fs;
  
 +	mtx_assert(&lk, MA_OWNED);
  	fs = sbdep->sb_fs;
  	mp = UFSTOVFS(sbdep->sb_ump);
 +	/*
 +	 * If the superblock doesn't match the in-memory list start over.
 +	 */
  	inodedep = first_unlinked_inodedep(sbdep->sb_ump);
  	if ((inodedep && fs->fs_sujfree != inodedep->id_ino) ||
  	    (inodedep == NULL && fs->fs_sujfree != 0)) {
 @@ -9049,8 +9053,6 @@ handle_written_sbdep(sbdep, bp)
  	WORKITEM_FREE(sbdep, D_SBDEP);
  	if (fs->fs_sujfree == 0)
  		return (0);
 -	if (inodedep_lookup(mp, fs->fs_sujfree, 0, &inodedep) == 0)
 -		panic("handle_written_sbdep: lost inodedep");
  	/*
  	 * Now that we have a record of this inode in stable store allow it
  	 * to be written to free up pending work.  Inodes may see a lot of
 @@ -9078,10 +9080,13 @@ unlinked_inodedep(mp, inodedep)
  {
  	struct ufsmount *ump;
  
 +	mtx_assert(&lk, MA_OWNED);
  	if (MOUNTEDSUJ(mp) == 0)
  		return;
  	ump = VFSTOUFS(mp);
  	ump->um_fs->fs_fmod = 1;
 +	if (inodedep->id_state & UNLINKED)
 +		panic("unlinked_inodedep: %p already unlinked\n", inodedep);
  	inodedep->id_state |= UNLINKED;
  	TAILQ_INSERT_HEAD(&ump->softdep_unlinked, inodedep, id_unlinked);
  }
 @@ -9109,6 +9114,10 @@ clear_unlinked_inodedep(inodedep)
  	ino = inodedep->id_ino;
  	error = 0;
  	for (;;) {
 +		mtx_assert(&lk, MA_OWNED);
 +		KASSERT((inodedep->id_state & UNLINKED) != 0,
 +		    ("clear_unlinked_inodedep: inodedep %p not unlinked",
 +		    inodedep));
  		/*
  		 * If nothing has yet been written simply remove us from
  		 * the in memory list and return.  This is the most common
 @@ -9166,36 +9175,19 @@ clear_unlinked_inodedep(inodedep)
  			ACQUIRE_LOCK(&lk);
  			continue;
  		}
 +		nino = 0;
 +		idn = TAILQ_NEXT(inodedep, id_unlinked);
 +		if (idn)
 +			nino = idn->id_ino;
  		/*
  		 * Remove us from the in memory list.  After this we cannot
  		 * access the inodedep.
  		 */
 -		idn = TAILQ_NEXT(inodedep, id_unlinked);
 -		inodedep->id_state &= ~(UNLINKED | UNLINKLINKS);
 +		KASSERT((inodedep->id_state & UNLINKED) != 0,
 +		    ("clear_unlinked_inodedep: inodedep %p not unlinked",
 +		    inodedep));
 +		inodedep->id_state &= ~(UNLINKED | UNLINKLINKS | UNLINKONLIST);
  		TAILQ_REMOVE(&ump->softdep_unlinked, inodedep, id_unlinked);
 -		/*
 -		 * Determine the next inode number.
 -		 */
 -		nino = 0;
 -		if (idn) {
 -			/*
 -			 * If next isn't on the list we can just clear prev's
 -			 * state and schedule it to be fixed later.  No need
 -			 * to synchronously write if we're not in the real
 -			 * list.
 -			 */
 -			if ((idn->id_state & UNLINKPREV) == 0 && pino != 0) {
 -				idp->id_state &= ~UNLINKNEXT;
 -				if ((idp->id_state & ONWORKLIST) == 0)
 -					WORKLIST_INSERT(&bp->b_dep,
 -					    &idp->id_list);
 -				FREE_LOCK(&lk);
 -				bawrite(bp);
 -				ACQUIRE_LOCK(&lk);
 -				return;
 -			}
 -			nino = idn->id_ino;
 -		}
  		FREE_LOCK(&lk);
  		/*
  		 * The predecessor's next pointer is manually updated here
 @@ -9234,13 +9226,14 @@ clear_unlinked_inodedep(inodedep)
  			bwrite(bp);
  			ACQUIRE_LOCK(&lk);
  		}
 +
  		if (fs->fs_sujfree != ino)
  			return;
  		panic("clear_unlinked_inodedep: Failed to clear free head");
  	}
  	if (inodedep->id_ino == fs->fs_sujfree)
  		panic("clear_unlinked_inodedep: Freeing head of free list");
 -	inodedep->id_state &= ~(UNLINKED | UNLINKLINKS);
 +	inodedep->id_state &= ~(UNLINKED | UNLINKLINKS | UNLINKONLIST);
  	TAILQ_REMOVE(&ump->softdep_unlinked, inodedep, id_unlinked);
  	return;
  }
 @@ -9839,18 +9832,6 @@ initiate_write_inodeblock_ufs2(inodedep,
  		inon = TAILQ_NEXT(inodedep, id_unlinked);
  		dp->di_freelink = inon ? inon->id_ino : 0;
  	}
 -	if ((inodedep->id_state & (UNLINKED | UNLINKNEXT)) ==
 -	    (UNLINKED | UNLINKNEXT)) {
 -		struct inodedep *inon;
 -		ino_t freelink;
 -
 -		inon = TAILQ_NEXT(inodedep, id_unlinked);
 -		freelink = inon ? inon->id_ino : 0;
 -		if (freelink != dp->di_freelink)
 -			panic("ino %p(0x%X) %d, %d != %d",
 -			    inodedep, inodedep->id_state, inodedep->id_ino,
 -			    freelink, dp->di_freelink);
 -	}
  	/*
  	 * If the bitmap is not yet written, then the allocated
  	 * inode cannot be written to disk.
 @@ -10849,10 +10830,9 @@ handle_written_inodeblock(inodedep, bp)
  		freelink = dp2->di_freelink;
  	}
  	/*
 -	 * If we wrote a valid freelink pointer during the last write
 -	 * record it here.
 +	 * Leave this inodeblock dirty until it's in the list.
  	 */
 -	if ((inodedep->id_state & (UNLINKED | UNLINKNEXT)) == UNLINKED) {
 +	if ((inodedep->id_state & (UNLINKED | UNLINKONLIST)) == UNLINKED) {
  		struct inodedep *inon;
  
  		inon = TAILQ_NEXT(inodedep, id_unlinked);
 @@ -10861,12 +10841,9 @@ handle_written_inodeblock(inodedep, bp)
  			if (inon)
  				inon->id_state |= UNLINKPREV;
  			inodedep->id_state |= UNLINKNEXT;
 -		} else
 -			hadchanges = 1;
 -	}
 -	/* Leave this inodeblock dirty until it's in the list. */
 -	if ((inodedep->id_state & (UNLINKED | UNLINKONLIST)) == UNLINKED)
 +		}
  		hadchanges = 1;
 +	}
  	/*
  	 * If we had to rollback the inode allocation because of
  	 * bitmaps being incomplete, then simply restore it.
 
 Modified: head/sys/ufs/ufs/ufs_vnops.c
 ==============================================================================
 --- head/sys/ufs/ufs/ufs_vnops.c	Mon Apr  2 20:36:35 2012	(r233816)
 +++ head/sys/ufs/ufs/ufs_vnops.c	Mon Apr  2 21:58:37 2012	(r233817)
 @@ -1006,6 +1006,14 @@ ufs_link(ap)
  		error = EMLINK;
  		goto out;
  	}
 +	/*
 +	 * The file may have been removed after namei droped the original
 +	 * lock.
 +	 */
 +	if (ip->i_effnlink == 0) {
 +		error = ENOENT;
 +		goto out;
 +	}
  	if (ip->i_flags & (IMMUTABLE | APPEND)) {
  		error = EPERM;
  		goto out;
 _______________________________________________
 svn-src-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/svn-src-all
 To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
 
State-Changed-From-To: open->patched 
State-Changed-By: mckusick 
State-Changed-When: Mon Apr 2 22:17:03 UTC 2012 
State-Changed-Why:  
A fix has been applied to head. If no further problems arise, 
it will be MFC'ed to 9. No other MFC's are planned as it only 
applies to journaled soft updates which were added in 9. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=159971 

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/159971: commit references a PR
Date: Thu, 19 Apr 2012 22:22:41 +0000 (UTC)

 Author: mckusick
 Date: Thu Apr 19 22:22:21 2012
 New Revision: 234470
 URL: http://svn.freebsd.org/changeset/base/234470
 
 Log:
   MFC of 233817:
   
   A file cannot be deallocated until its last name has been removed
   and it is no longer referenced by a user process. The inode for a
   file whose name has been removed, but is still referenced at the
   time of a crash will still be allocated in the filesystem, but will
   have no references (e.g., they will have no names referencing them
   from any directory).
   
   With traditional soft updates these unreferenced inodes will be
   found and reclaimed when the background fsck is run. When using
   journaled soft updates, the kernel must keep track of these inodes
   so that it can find and reclaim them during the cleanup process.
   Their existence cannot be stored in the journal as the journal only
   handles short-term events, and they may persist for days. So, they
   are tracked by keeping them in a linked list whose head pointer is
   stored in the superblock. The journal tracks them only until their
   linked list pointers have been commited to disk. Part of the cleanup
   process involves traversing the list of unreferenced inodes and
   reclaiming them.
   
   This bug was triggered when confusion arose in the commit steps
   of keeping the unreferenced-inode linked list coherent on disk.
   Notably, a race between the link() system call adding a link-count
   to a file and the unlink() system call removing a link-count to
   the file. Here if the unlink() ran after link() had looked up
   the file but before link() had incremented the link-count of the
   file, the file's link-count would drop to zero before the link()
   incremented it back up to one. If the file was referenced by a
   user process, the first transition through zero made it appear
   that it should be added to the unreferenced-inode list when in
   fact it should not have been added. If the new name created by
   link() was deleted within a few seconds (with the file still
   referenced by a user process) it would legitimately be a candidate
   for addition to the unreferenced-inode list. The result was that
   there were two attempts to add the same inode to the unreferenced-inode
   list which scrambled the unreferenced-inode list's pointers leading
   to a panic. The fix is to detect and avoid the false attempt at
   adding it to the unreferenced-inode list by having the link()
   system call check to see if the link count is zero before it
   increments it. If it is, the link() fails with ENOENT (showing that
   it has failed the link()/unlink() race).
   
   While tracking down this bug, we have added additional assertions
   to detect the problem sooner and also simplified some of the code.
   
   Reported by:      Kirk Russell
   Fix submitted by: Jeff Roberson
   Tested by:        Peter Holm
   PR:               kern/159971
 
 Modified:
   stable/9/sys/ufs/ffs/ffs_softdep.c
   stable/9/sys/ufs/ufs/ufs_vnops.c
 Directory Properties:
   stable/9/sys/   (props changed)
   stable/9/sys/amd64/include/xen/   (props changed)
   stable/9/sys/boot/   (props changed)
   stable/9/sys/boot/i386/efi/   (props changed)
   stable/9/sys/boot/ia64/efi/   (props changed)
   stable/9/sys/boot/ia64/ski/   (props changed)
   stable/9/sys/boot/powerpc/boot1.chrp/   (props changed)
   stable/9/sys/boot/powerpc/ofw/   (props changed)
   stable/9/sys/cddl/contrib/opensolaris/   (props changed)
   stable/9/sys/conf/   (props changed)
   stable/9/sys/contrib/dev/acpica/   (props changed)
   stable/9/sys/contrib/octeon-sdk/   (props changed)
   stable/9/sys/contrib/pf/   (props changed)
   stable/9/sys/contrib/x86emu/   (props changed)
   stable/9/sys/fs/   (props changed)
   stable/9/sys/fs/ntfs/   (props changed)
   stable/9/sys/i386/conf/XENHVM   (props changed)
   stable/9/sys/kern/subr_witness.c   (props changed)
 
 Modified: stable/9/sys/ufs/ffs/ffs_softdep.c
 ==============================================================================
 --- stable/9/sys/ufs/ffs/ffs_softdep.c	Thu Apr 19 21:12:08 2012	(r234469)
 +++ stable/9/sys/ufs/ffs/ffs_softdep.c	Thu Apr 19 22:22:21 2012	(r234470)
 @@ -4322,6 +4322,7 @@ inodedep_lookup_ip(ip)
  	(void) inodedep_lookup(UFSTOVFS(ip->i_ump), ip->i_number, dflags,
  	    &inodedep);
  	inodedep->id_nlinkdelta = ip->i_nlink - ip->i_effnlink;
 +	KASSERT((inodedep->id_state & UNLINKED) == 0, ("inode unlinked"));
  
  	return (inodedep);
  }
 @@ -8454,6 +8455,7 @@ softdep_setup_remove(bp, dp, ip, isrmdir
  	if (inodedep_lookup(UFSTOVFS(ip->i_ump), ip->i_number, 0,
  	    &inodedep) == 0)
  		panic("softdep_setup_remove: Lost inodedep.");
 +	KASSERT((inodedep->id_state & UNLINKED) == 0, ("inode unlinked"));
  	dirrem->dm_state |= ONDEPLIST;
  	LIST_INSERT_HEAD(&inodedep->id_dirremhd, dirrem, dm_inonext);
  
 @@ -8986,6 +8988,7 @@ first_unlinked_inodedep(ump)
  	struct inodedep *inodedep;
  	struct inodedep *idp;
  
 +	mtx_assert(&lk, MA_OWNED);
  	for (inodedep = TAILQ_LAST(&ump->softdep_unlinked, inodedeplst);
  	    inodedep; inodedep = idp) {
  		if ((inodedep->id_state & UNLINKNEXT) == 0)
 @@ -8994,11 +8997,8 @@ first_unlinked_inodedep(ump)
  		if (idp == NULL || (idp->id_state & UNLINKNEXT) == 0)
  			break;
  		if ((inodedep->id_state & UNLINKPREV) == 0)
 -			panic("first_unlinked_inodedep: prev != next");
 +			break;
  	}
 -	if (inodedep == NULL)
 -		return (NULL);
 -
  	return (inodedep);
  }
  
 @@ -9037,8 +9037,12 @@ handle_written_sbdep(sbdep, bp)
  	struct mount *mp;
  	struct fs *fs;
  
 +	mtx_assert(&lk, MA_OWNED);
  	fs = sbdep->sb_fs;
  	mp = UFSTOVFS(sbdep->sb_ump);
 +	/*
 +	 * If the superblock doesn't match the in-memory list start over.
 +	 */
  	inodedep = first_unlinked_inodedep(sbdep->sb_ump);
  	if ((inodedep && fs->fs_sujfree != inodedep->id_ino) ||
  	    (inodedep == NULL && fs->fs_sujfree != 0)) {
 @@ -9048,8 +9052,6 @@ handle_written_sbdep(sbdep, bp)
  	WORKITEM_FREE(sbdep, D_SBDEP);
  	if (fs->fs_sujfree == 0)
  		return (0);
 -	if (inodedep_lookup(mp, fs->fs_sujfree, 0, &inodedep) == 0)
 -		panic("handle_written_sbdep: lost inodedep");
  	/*
  	 * Now that we have a record of this inode in stable store allow it
  	 * to be written to free up pending work.  Inodes may see a lot of
 @@ -9077,10 +9079,13 @@ unlinked_inodedep(mp, inodedep)
  {
  	struct ufsmount *ump;
  
 +	mtx_assert(&lk, MA_OWNED);
  	if (MOUNTEDSUJ(mp) == 0)
  		return;
  	ump = VFSTOUFS(mp);
  	ump->um_fs->fs_fmod = 1;
 +	if (inodedep->id_state & UNLINKED)
 +		panic("unlinked_inodedep: %p already unlinked\n", inodedep);
  	inodedep->id_state |= UNLINKED;
  	TAILQ_INSERT_HEAD(&ump->softdep_unlinked, inodedep, id_unlinked);
  }
 @@ -9108,6 +9113,10 @@ clear_unlinked_inodedep(inodedep)
  	ino = inodedep->id_ino;
  	error = 0;
  	for (;;) {
 +		mtx_assert(&lk, MA_OWNED);
 +		KASSERT((inodedep->id_state & UNLINKED) != 0,
 +		    ("clear_unlinked_inodedep: inodedep %p not unlinked",
 +		    inodedep));
  		/*
  		 * If nothing has yet been written simply remove us from
  		 * the in memory list and return.  This is the most common
 @@ -9165,36 +9174,19 @@ clear_unlinked_inodedep(inodedep)
  			ACQUIRE_LOCK(&lk);
  			continue;
  		}
 +		nino = 0;
 +		idn = TAILQ_NEXT(inodedep, id_unlinked);
 +		if (idn)
 +			nino = idn->id_ino;
  		/*
  		 * Remove us from the in memory list.  After this we cannot
  		 * access the inodedep.
  		 */
 -		idn = TAILQ_NEXT(inodedep, id_unlinked);
 -		inodedep->id_state &= ~(UNLINKED | UNLINKLINKS);
 +		KASSERT((inodedep->id_state & UNLINKED) != 0,
 +		    ("clear_unlinked_inodedep: inodedep %p not unlinked",
 +		    inodedep));
 +		inodedep->id_state &= ~(UNLINKED | UNLINKLINKS | UNLINKONLIST);
  		TAILQ_REMOVE(&ump->softdep_unlinked, inodedep, id_unlinked);
 -		/*
 -		 * Determine the next inode number.
 -		 */
 -		nino = 0;
 -		if (idn) {
 -			/*
 -			 * If next isn't on the list we can just clear prev's
 -			 * state and schedule it to be fixed later.  No need
 -			 * to synchronously write if we're not in the real
 -			 * list.
 -			 */
 -			if ((idn->id_state & UNLINKPREV) == 0 && pino != 0) {
 -				idp->id_state &= ~UNLINKNEXT;
 -				if ((idp->id_state & ONWORKLIST) == 0)
 -					WORKLIST_INSERT(&bp->b_dep,
 -					    &idp->id_list);
 -				FREE_LOCK(&lk);
 -				bawrite(bp);
 -				ACQUIRE_LOCK(&lk);
 -				return;
 -			}
 -			nino = idn->id_ino;
 -		}
  		FREE_LOCK(&lk);
  		/*
  		 * The predecessor's next pointer is manually updated here
 @@ -9233,13 +9225,14 @@ clear_unlinked_inodedep(inodedep)
  			bwrite(bp);
  			ACQUIRE_LOCK(&lk);
  		}
 +
  		if (fs->fs_sujfree != ino)
  			return;
  		panic("clear_unlinked_inodedep: Failed to clear free head");
  	}
  	if (inodedep->id_ino == fs->fs_sujfree)
  		panic("clear_unlinked_inodedep: Freeing head of free list");
 -	inodedep->id_state &= ~(UNLINKED | UNLINKLINKS);
 +	inodedep->id_state &= ~(UNLINKED | UNLINKLINKS | UNLINKONLIST);
  	TAILQ_REMOVE(&ump->softdep_unlinked, inodedep, id_unlinked);
  	return;
  }
 @@ -9838,18 +9831,6 @@ initiate_write_inodeblock_ufs2(inodedep,
  		inon = TAILQ_NEXT(inodedep, id_unlinked);
  		dp->di_freelink = inon ? inon->id_ino : 0;
  	}
 -	if ((inodedep->id_state & (UNLINKED | UNLINKNEXT)) ==
 -	    (UNLINKED | UNLINKNEXT)) {
 -		struct inodedep *inon;
 -		ino_t freelink;
 -
 -		inon = TAILQ_NEXT(inodedep, id_unlinked);
 -		freelink = inon ? inon->id_ino : 0;
 -		if (freelink != dp->di_freelink)
 -			panic("ino %p(0x%X) %d, %d != %d",
 -			    inodedep, inodedep->id_state, inodedep->id_ino,
 -			    freelink, dp->di_freelink);
 -	}
  	/*
  	 * If the bitmap is not yet written, then the allocated
  	 * inode cannot be written to disk.
 @@ -10848,10 +10829,9 @@ handle_written_inodeblock(inodedep, bp)
  		freelink = dp2->di_freelink;
  	}
  	/*
 -	 * If we wrote a valid freelink pointer during the last write
 -	 * record it here.
 +	 * Leave this inodeblock dirty until it's in the list.
  	 */
 -	if ((inodedep->id_state & (UNLINKED | UNLINKNEXT)) == UNLINKED) {
 +	if ((inodedep->id_state & (UNLINKED | UNLINKONLIST)) == UNLINKED) {
  		struct inodedep *inon;
  
  		inon = TAILQ_NEXT(inodedep, id_unlinked);
 @@ -10860,12 +10840,9 @@ handle_written_inodeblock(inodedep, bp)
  			if (inon)
  				inon->id_state |= UNLINKPREV;
  			inodedep->id_state |= UNLINKNEXT;
 -		} else
 -			hadchanges = 1;
 -	}
 -	/* Leave this inodeblock dirty until it's in the list. */
 -	if ((inodedep->id_state & (UNLINKED | UNLINKONLIST)) == UNLINKED)
 +		}
  		hadchanges = 1;
 +	}
  	/*
  	 * If we had to rollback the inode allocation because of
  	 * bitmaps being incomplete, then simply restore it.
 
 Modified: stable/9/sys/ufs/ufs/ufs_vnops.c
 ==============================================================================
 --- stable/9/sys/ufs/ufs/ufs_vnops.c	Thu Apr 19 21:12:08 2012	(r234469)
 +++ stable/9/sys/ufs/ufs/ufs_vnops.c	Thu Apr 19 22:22:21 2012	(r234470)
 @@ -1002,6 +1002,14 @@ ufs_link(ap)
  		error = EMLINK;
  		goto out;
  	}
 +	/*
 +	 * The file may have been removed after namei droped the original
 +	 * lock.
 +	 */
 +	if (ip->i_effnlink == 0) {
 +		error = ENOENT;
 +		goto out;
 +	}
  	if (ip->i_flags & (IMMUTABLE | APPEND)) {
  		error = EPERM;
  		goto out;
 _______________________________________________
 svn-src-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/svn-src-all
 To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
 
State-Changed-From-To: patched->closed 
State-Changed-By: mckusick 
State-Changed-When: Thu Apr 19 22:46:50 UTC 2012 
State-Changed-Why:  
The fix has been confirmed to resolve the panic by Kirk Russell 
(the person that reported it). 

The fix has been MFC'ed to 9. 

Since journaled soft updates did not exist prior to 9 no 
further MFCs are needed. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=159971 
>Unformatted:
