From nobody@FreeBSD.org  Sat Sep 22 08:49:59 2007
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B5FA216A417
	for <freebsd-gnats-submit@FreeBSD.org>; Sat, 22 Sep 2007 08:49:59 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [IPv6:2001:4f8:fff6::21])
	by mx1.freebsd.org (Postfix) with ESMTP id A5F9913C458
	for <freebsd-gnats-submit@FreeBSD.org>; Sat, 22 Sep 2007 08:49:59 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.14.1/8.14.1) with ESMTP id l8M8nwZx060140
	for <freebsd-gnats-submit@FreeBSD.org>; Sat, 22 Sep 2007 08:49:58 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.14.1/8.14.1/Submit) id l8M8nwY6060112;
	Sat, 22 Sep 2007 08:49:58 GMT
	(envelope-from nobody)
Message-Id: <200709220849.l8M8nwY6060112@www.freebsd.org>
Date: Sat, 22 Sep 2007 08:49:58 GMT
From: Martin Horcicka <martin@horcicka.eu>
To: freebsd-gnats-submit@FreeBSD.org
Subject: lockf(1) is broken
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         116543
>Category:       bin
>Synopsis:       lockf(1) is broken
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    csjp
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sat Sep 22 08:50:06 GMT 2007
>Closed-Date:    Sat Nov 03 15:43:16 UTC 2007
>Last-Modified:  Sat Nov 03 15:43:16 UTC 2007
>Originator:     Martin Horcicka
>Release:        6.2-RELEASE-p6
>Organization:
>Environment:
Not relevant...
>Description:
lockf(1) is broken since revision 1.12 of src/usr.bin/lockf/lockf.c. The author of the changes did not realize that the lock file is removed (by default) before being closed by the process that holds it.

Consider the following series of events:

  Process A acquires the lock file
  Process B tries to acquire it as well but fails and waits
  Process A removes and closes the lock file
  Process B wakes up and acquires the lock file which is already removed (!)
  Process C acquires the lock file because there was no file of that name (!)
  Now both processes B and C think they hold the lock file (!)

>How-To-Repeat:
Run this command in two terminals at the same time:

  lockf lock sh -c 'for n in `jot 10`; do echo mmm; sleep 1; done'

One of them will start running and the other will wait. When the first stops running, run it again. Now you can see two processes running at the same time and both thinking they hold the lock file.

>Fix:
Throw away all changes of src/usr.bin/lockf/lockf.c after revision 1.11. They are wrong.


>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->csjp 
Responsible-Changed-By: edwin 
Responsible-Changed-When: Mon Sep 24 12:54:16 UTC 2007 
Responsible-Changed-Why:  
csjp@ was the one who commited 1.12 

http://www.freebsd.org/cgi/query-pr.cgi?pr=116543 

From: "Christian S.J. Peron" <csjp@FreeBSD.org>
To: Martin Horcicka <martin@horcicka.eu>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: bin/116543: lockf(1) is broken
Date: Mon, 24 Sep 2007 09:48:21 -0500

 On Sat, Sep 22, 2007 at 08:49:58AM +0000, Martin Horcicka wrote:
 [..]
 > lockf(1) is broken since revision 1.12 of src/usr.bin/lockf/lockf.c. The author of the changes did not realize that the lock file is removed (by default) before being closed by the process that holds it.
 > 
 > Consider the following series of events:
 > 
 >   Process A acquires the lock file
 >   Process B tries to acquire it as well but fails and waits
 >   Process A removes and closes the lock file
 >   Process B wakes up and acquires the lock file which is already removed (!)
 >   Process C acquires the lock file because there was no file of that name (!)
 >   Now both processes B and C think they hold the lock file (!)
 > 
 > >How-To-Repeat:
 > Run this command in two terminals at the same time:
 
 Hrm.. Yes, darn. This does look like it's a problem.  I want to analyze this
 a bit more before backing the changes out.  The reality is, there were some
 big problems associated with the old behavior that was fixed with these
 modifications.
 
 Thanks for the report, I will look into this.
 
 -- 
 Christian S.J. Peron
 csjp@FreeBSD.ORG
 FreeBSD Committer

From: "Christian S.J. Peron" <csjp@FreeBSD.org>
To: Martin Horcicka <martin@horcicka.eu>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: bin/116543: lockf(1) is broken
Date: Mon, 24 Sep 2007 10:12:27 -0500

 After a bit of review, I think this behavior is correct. I will post
 another link to another PR which outlines the reason for the changes:
 
 http://www.freebsd.org/cgi/query-pr.cgi?pr=111101
 
 (Checkout the audit trail)
 
 In short, if you want to use lockf to synchronize between multiple
 processes, the synchronization variable (in this case the lock file)
 should remain after the processes are done.
 
 Cases where the default behavior (unlinking) is useful is situations
 where are not waiting. i.e. cron job runs with a lock, second instance
 of that same cron job runs, checks to see if previous jobs are still
 running and exits.
 
 Maybe what is required is a bit more clarification in the man page
 about how the intended behavior was.
 
 At first glance, this problem goes away if we use lockf -k
 
 Thoughts?
 
 -- 
 Christian S.J. Peron
 csjp@FreeBSD.ORG
 FreeBSD Committer

From: "Martin Horcicka" <martin@horcicka.eu>
To: "Christian S.J. Peron" <csjp@freebsd.org>
Cc: freebsd-gnats-submit@freebsd.org
Subject: Re: bin/116543: lockf(1) is broken
Date: Mon, 24 Sep 2007 17:53:32 +0200

 On 9/24/07, Christian S.J. Peron <csjp@freebsd.org> wrote:
 >
 > After a bit of review, I think this behavior is correct. I will post
 > another link to another PR which outlines the reason for the changes:
 >
 > http://www.freebsd.org/cgi/query-pr.cgi?pr=111101
 >
 > (Checkout the audit trail)
 >
 > In short, if you want to use lockf to synchronize between multiple
 > processes, the synchronization variable (in this case the lock file)
 > should remain after the processes are done.
 >
 > Cases where the default behavior (unlinking) is useful is situations
 > where are not waiting. i.e. cron job runs with a lock, second instance
 > of that same cron job runs, checks to see if previous jobs are still
 > running and exits.
 >
 > Maybe what is required is a bit more clarification in the man page
 > about how the intended behavior was.
 >
 > At first glance, this problem goes away if we use lockf -k
 
 If I understand correctly you are telling me that if I don't want to
 use the -k option I have to use the "-t 0" option. It's more a
 workaround than a solution.
 
 Please, consider the algorithm I've sent you. AFAIK it works in all
 situations. If you use Python, you can try it with the Python module
 lock_file (devel/py-lock_file in ports).
 
 Thanks.
 
 Martin

From: Kris Kennaway <kris@FreeBSD.org>
To: bug-followup@FreeBSD.org,  martin@horcicka.eu
Cc:  
Subject: Re: bin/116543: lockf(1) is broken
Date: Mon, 24 Sep 2007 21:52:38 +0200

 I don't know what you sent Christian (it's not logged in the audit 
 trail), but in general you have to be very careful to avoid thundering 
 herd wakeups when hundreds of processes are waiting on the lock.
 
 Basically, I don't understand why you are claiming that his changes 
 cause the problem you describe.  lockf(1) has always deleted the lock 
 file on process exit (unless you use -k), leading to that race.
 
 Basically if you want to use lockf for mutual exclusion you *must* use 
 -k; there is no way it can mutually exclude if you let lockf delete the 
 lock object from visibility to other processes.
 
 Kris

From: "Martin Horcicka" <martin@horcicka.eu>
To: "Kris Kennaway" <kris@freebsd.org>
Cc: bug-followup@freebsd.org
Subject: Re: bin/116543: lockf(1) is broken
Date: Mon, 24 Sep 2007 23:02:15 +0200

 On 9/24/07, Kris Kennaway <kris@freebsd.org> wrote:
 > I don't know what you sent Christian (it's not logged in the audit
 > trail), but in general you have to be very careful to avoid thundering
 > herd wakeups when hundreds of processes are waiting on the lock.
 >
 > Basically, I don't understand why you are claiming that his changes
 > cause the problem you describe.  lockf(1) has always deleted the lock
 > file on process exit (unless you use -k), leading to that race.
 >
 > Basically if you want to use lockf for mutual exclusion you *must* use
 > -k; there is no way it can mutually exclude if you let lockf delete the
 > lock object from visibility to other processes.
 
 I somehow don't see any problem with mutual exclusion without -k in
 the original code (e.g. revision 1.11 of lockf.c), can you give me an
 example? But I admit that there is the problem you described above
 (waking up possibly many processes at the same time) and also a
 problem with out of order acquisitions.
 
 Anyway, what I was writing to Christian was a suggestion to using a
 different algorithm for the acquisition:
 
 while not having the lock file:
    open, create if necessary, and lock the lock file
    if the lock file is the one you wanted:
        break
    else:
        close the lock file
 
 The lock is the one you wanted if it still exists as the name you have
 opened. So, you can call stat() on the lock file name and fstat() on
 the acquired lock file descriptor and compare their device numbers and
 inode numbers. If the stat() call is successful (the name exists) and
 if the numbers are equal, you have the right file.
 
 I believe that with this algorithm the mutual exclusion would work
 fine in both cases (with or without -k) and with -k there would be no
 problem with awakening many processes at the same time nor with the
 order of acquisition.
 
 Martin
State-Changed-From-To: open->analyzed 
State-Changed-By: csjp 
State-Changed-When: Mon Sep 24 22:57:08 UTC 2007 
State-Changed-Why:  
Currently analyzing this issue. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=116543 

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: bin/116543: commit references a PR
Date: Fri, 12 Oct 2007 14:57:01 +0000 (UTC)

 csjp        2007-10-12 14:56:52 UTC
 
   FreeBSD src repository
 
   Modified files:
     usr.bin/lockf        lockf.1 lockf.c 
   Log:
   Revision 1.12 of lockf.c fixed a "thundering herd" scenario when the
   lock experienced contention a number of processes would race to acquire
   lock when it was released.  This problem resulted in a lot of CPU
   load as well as locks being picked up out of order.
   
   Unfortunately, a regression snuck in which allowed multiple threads
   to pickup the same lock when -k was not used.  This could occur when
   multiple processes open a file descriptor to inode X (one process
   will be blocked) and the file is unlinked on unlock (thereby removing
   the directory entry allow another process to create a new directory
   entry for the same file name and lock it).
   
   This changes restores the old algorithm of: wait for the lock, then
   acquire lock when we want to unlink the file on exit (specifically
   when -k is not used) and keeps the new algorithm for when -k is used,
   which yields fairness and improved performance.
   
   Also, update the man page to inform users that if lockf(1) is being
   used to facilitate concurrency between a number of processes, it
   is recommended that -k be used to reduce CPU load and yeld
   fairness with regard to lock ordering.
   
   Collaborated with:      jdp
   PR:             bin/114341
   PR:             bin/116543
   PR:             bin/111101
   MFC after:      1 week
   
   Revision  Changes    Path
   1.19      +13 -1     src/usr.bin/lockf/lockf.1
   1.17      +59 -10    src/usr.bin/lockf/lockf.c
 _______________________________________________
 cvs-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/cvs-all
 To unsubscribe, send any mail to "cvs-all-unsubscribe@freebsd.org"
 
State-Changed-From-To: analyzed->patched 
State-Changed-By: csjp 
State-Changed-When: Fri Oct 12 15:13:15 UTC 2007 
State-Changed-Why:  
The fix has been merged into HEAD and will be MFCed 
when the testing period has expired. 

Thanks. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=116543 
State-Changed-From-To: patched->closed 
State-Changed-By: csjp 
State-Changed-When: Sat Nov 3 15:42:59 UTC 2007 
State-Changed-Why:  
Fix has been committed to the -STABLE branch 

Thanks. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=116543 
>Unformatted:
