From citrin@mx22.rambler.ru  Mon Sep  1 15:02:10 2008
Return-Path: <citrin@mx22.rambler.ru>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7CE93106566B
	for <FreeBSD-gnats-submit@freebsd.org>; Mon,  1 Sep 2008 15:02:10 +0000 (UTC)
	(envelope-from citrin@mx22.rambler.ru)
Received: from mx-carp-ext.rambler.ru (mx-carp-ext.rambler.ru [81.19.66.237])
	by mx1.freebsd.org (Postfix) with ESMTP id 3B9848FC13
	for <FreeBSD-gnats-submit@freebsd.org>; Mon,  1 Sep 2008 15:02:10 +0000 (UTC)
	(envelope-from citrin@mx22.rambler.ru)
Received: by mx22.rambler.ru (Postfix, from userid 1072)
	id 8751589A47D; Mon,  1 Sep 2008 18:46:21 +0400 (MSD)
Message-Id: <20080901144621.8751589A47D@mx22.rambler.ru>
Date: Mon,  1 Sep 2008 18:46:21 +0400 (MSD)
From: Anton Yuzhaninov <citrin@citrin.ru>
To: FreeBSD-gnats-submit@freebsd.org
Subject: Problem with unix sockets garbage collector
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         127024
>Category:       kern
>Synopsis:       [socket] Problem with unix sockets garbage collector
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    rwatson
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Sep 01 15:10:00 UTC 2008
>Closed-Date:    Sun Feb 01 18:49:09 UTC 2009
>Last-Modified:  Sun Feb 01 18:49:09 UTC 2009
>Originator:     Anton Yuzhaninov
>Release:        FreeBSD 7.0-STABLE amd64
>Organization:
Rambler
>Environment:
System: FreeBSD mx22.rambler.ru 7.0-STABLE FreeBSD 7.0-STABLE #1: Fri Jun 27 16:59:59 MSD 2008 root@mx22.rambler.ru:/usr/obj/usr/src/sys/MAIL amd64

Problem occurs on SMP boxes, when unix sockets used under high load.
In our case it is server with postfix MTA, where unix sockets used for IPC.

>Description:
1. Normal work (after reboot):

thread taskq in top is about 0.00% WCPU

sysctl net.local.inflight is almost always zero.
sysctl net.local.taskcount value increased rarely.

2. After several days of work thread taskq starts to eat all available CPU:

1684 processes:26 running, 1639 sleeping, 19 waiting
CPU states:  6.7% user,  0.0% nice, 54.5% system,  1.1% interrupt, 37.7% idle
Mem: 1332M Active, 1903M Inact, 505M Wired, 118M Cache, 214M Buf, 76M Free
Swap: 2060M Total, 2060M Free

   PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
     9 root        1   8    -     0K    16K CPU1   1 536:07 100.00% thread taskq
    12 root        1 171 ki31     0K    16K RUN    0  53.5H 64.06% idle: cpu0
    11 root        1 171 ki31     0K    16K RUN    1  50.3H 14.26% idle: cpu1

sysctl net.local.inflight value is always less then 0 (I see values from -1 to -4).
sysctl net.local.taskcount values increased with high rate (about 100 per second).

It seems to be some race in unix sockets code, because on uniprocessor box we can't repeat this.

>How-To-Repeat:
Run postfix MTA on high loaded mail server (> 100 connects per second) with 6-stable or 7-stable (SMP).
Problem should occurs after several days (weeks) of uptime.

>Fix:
Not known yet.
May be in 8-current this problem fixed, but we can't run 8-current on this hardware.
>Release-Note:
>Audit-Trail:

From: Anton Yuzhaninov <citrin@citrin.ru>
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/127024: Problem with unix sockets garbage collector
Date: Tue, 02 Sep 2008 02:34:49 +0400

 May be MFC of this change will fix problem:
 http://lists.freebsd.org/pipermail/cvs-all/2007-December/242327.html
 
 -- 
   Anton Yuzhaninov
Responsible-Changed-From-To: freebsd-bugs->rwatson 
Responsible-Changed-By: rwatson 
Responsible-Changed-When: Tue Sep 2 06:00:19 UTC 2008 
Responsible-Changed-Why:  
Grab ownership.  I've seen at least one possible glitch in the locking for 
the inflight variable, and fixing this may well help. 

WRT the MFCing question: that is a fairly major change, and no appropriate 
for MFC for several reasons, not least that it significantl changes the 
kernel binary interface for kernel modules. 


http://www.freebsd.org/cgi/query-pr.cgi?pr=127024 

From: Robert Watson <rwatson@FreeBSD.org>
To: Anton Yuzhaninov <citrin@citrin.ru>
Cc: bug-followup@FreeBSD.org
Subject: Re: kern/127024: Problem with unix sockets garbage collector
Date: Wed, 3 Sep 2008 18:01:18 +0100 (BST)

 On Tue, 2 Sep 2008, rwatson@FreeBSD.org wrote:
 
 > Grab ownership.  I've seen at least one possible glitch in the locking for 
 > the inflight variable, and fixing this may well help.
 
 Dear Anton:
 
 The attached patch closes two races in the UNIX domain socket garbage 
 collector, and for the sake of being conservative, slides one variable 
 assignment under the global UNIX domain socket lock that previously wasn't 
 there.  Could I ask you to give this a try in two configurations:
 
 (1) First, using a kernel with INVARIANTS and WITNESS enabled, and a load
      similar to what triggers the problem.  Probably this should not be a
      production server, however, as those will significantly impact
      performance.
 
 (2) Assuming (1) works without serious warnings/errors or other problems,
      could you then try running it on your production load and see if the
      problem goes away?
 
 This is a fairly low-risk change, but as with all kernel patches, the (1) 
 testing scenario above is intended to catch any problems before you deploy it 
 with the real workload in (2).
 
 A further note: this patch is intended only to apply against 7.x, as the 
 garbage collector has been replaced in 8.x and does not appear to contain this 
 or a similar race.  If this patch fixes the problem, then I'll request that we 
 merge it to the 7.x branch before 7.1 chips.  Ideally, this would be in the 
 next few days so it can be included in 7.1-BETA1.
 
 Thanks,
 
 Robert N M Watson
 Computer Laboratory
 University of Cambridge
 
 Index: kern/uipc_usrreq.c
 ===================================================================
 --- kern/uipc_usrreq.c	(revision 182722)
 +++ kern/uipc_usrreq.c	(working copy)
 @@ -1,7 +1,7 @@
   /*-
    * Copyright (c) 1982, 1986, 1989, 1991, 1993
    *	The Regents of the University of California.
 - * Copyright (c) 2004-2007 Robert N. M. Watson
 + * Copyright (c) 2004-2008 Robert N. M. Watson
    * All rights reserved.
    *
    * Redistribution and use in source and binary forms, with or without
 @@ -585,9 +585,9 @@
   		unp_drop(ref, ECONNRESET);
   		UNP_PCB_UNLOCK(ref);
   	}
 +	local_unp_rights = unp_rights;
   	UNP_GLOBAL_WUNLOCK();
   	unp->unp_socket->so_pcb = NULL;
 -	local_unp_rights = unp_rights;
   	saved_unp_addr = unp->unp_addr;
   	unp->unp_addr = NULL;
   	unp->unp_refcount--;
 @@ -1602,7 +1602,9 @@
   				FILE_LOCK(fp);
   				fp->f_msgcount--;
   				FILE_UNLOCK(fp);
 +				UNP_GLOBAL_WLOCK();
   				unp_rights--;
 +				UNP_GLOBAL_WUNLOCK();
   				*fdp++ = f;
   			}
   			FILEDESC_XUNLOCK(td->td_proc->p_fd);
 @@ -1768,7 +1770,9 @@
   				fp->f_count++;
   				fp->f_msgcount++;
   				FILE_UNLOCK(fp);
 +				UNP_GLOBAL_WLOCK();
   				unp_rights++;
 +				UNP_GLOBAL_WUNLOCK();
   			}
   			FILEDESC_SUNLOCK(fdescp);
   			break;

From: Anton Yuzhaninov <citrin@citrin.ru>
To: bug-followup@FreeBSD.org, citrin@citrin.ru
Cc:  
Subject: Re: kern/127024: [sockets] Problem with unix sockets garbage collector
Date: Thu, 04 Sep 2008 18:14:06 +0400

 I can't repeat this race on servers with INVARIANTS/WITNESS in kernel even without patch
 (may be because of reduced performance) and I'll try to rebuild kernels without INVARIANTS/WITNESS.
 
 Anyway INVARIANTS/WITNESS don't log any warning on systems with and without patch after 18 hours uptime.
 
 -- 
   Anton Yuzhaninov

From: Robert Watson <rwatson@FreeBSD.org>
To: Anton Yuzhaninov <citrin@citrin.ru>
Cc: bug-followup@FreeBSD.org
Subject: Re: kern/127024: [sockets] Problem with unix sockets garbage
 collector
Date: Sat, 6 Sep 2008 12:35:56 +0100 (BST)

 On Thu, 4 Sep 2008, Anton Yuzhaninov wrote:
 
 > I can't repeat this race on servers with INVARIANTS/WITNESS in kernel even 
 > without patch (may be because of reduced performance) and I'll try to 
 > rebuild kernels without INVARIANTS/WITNESS.
 >
 > Anyway INVARIANTS/WITNESS don't log any warning on systems with and without 
 > patch after 18 hours uptime.
 
 Sounds good on the stability and warning front at least -- I might reasonably 
 expect that the race is fairly timing-dependent and WITNESS/INVARIANTS do 
 change timing a lot, but not getting lock order warnings and other invariants 
 violations is good news from a stability perspective.  If you could let me 
 know once you have reasonable confidence that either the problem is solved 
 *or* it happens despite the patch, please give me a ping.  I'd like very much 
 to get this patch into 7.x if we think it's the right solution so that it 
 ships in 7.1.
 
 Thanks,
 
 Robert N M Watson
 Computer Laboratory
 University of Cambridge

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/127024: commit references a PR
Date: Sat, 13 Sep 2008 00:16:46 +0000 (UTC)

 rwatson     2008-09-13 00:16:20 UTC
 
   FreeBSD src repository
 
   Modified files:        (Branch: RELENG_7)
     sys/kern             uipc_usrreq.c 
   Log:
   SVN rev 182993 on 2008-09-13 00:16:20Z by rwatson
   
   Extend global UNIX domain socket rwlock coverage to include incrementing
   and decrementing unp_rights, which may otherwise be corrupted.  Be
   slightly more conversative in where we read unp_rights.
   
   This relates to one of the symptoms reported in the noted PR, but may not
   correct the actual high system time problem.  The reporter has confirmed
   stability but not that the problem is eliminated.  However, this is a
   useful fix to a clear locking bug.
   
   Note that this is not an MFC as the UNIX domain socket garbage collector
   has been replaced in 8.x.
   
   PR:             127024
   Reported by:    Anton Yuzhaninov <citrin at citrin dot ru>
   Reviewed by:    kib
   Approved by:    re (kib)
   
   Revision   Changes    Path
   1.206.2.4  +6 -2      src/sys/kern/uipc_usrreq.c
 _______________________________________________
 cvs-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/cvs-all
 To unsubscribe, send any mail to "cvs-all-unsubscribe@freebsd.org"
 

From: Anton Yuzhaninov <citrin@citrin.ru>
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/127024: [sockets] Problem with unix sockets garbage collector
Date: Wed, 24 Sep 2008 11:32:33 +0400

 After week of testing no problem has been discovered on servers with patch.
 Strange, but on servers without this patch race not repeated too
 (before testing of this patch, race has been observed on more old RELENG_7).
 
 Anyway committed code works well for me.
 
 -- 
   Anton Yuzhaninov
State-Changed-From-To: open->closed 
State-Changed-By: rwatson 
State-Changed-When: Sun Feb 1 18:47:55 UTC 2009 
State-Changed-Why:  
Committed patch appears to resolve the problem for the submitter so 
closing the PR.  Thanks for the problem report, and let me know if the 
problem appears to recur. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=127024 
>Unformatted:
