From nobody@FreeBSD.org  Mon Nov  4 13:24:29 2013
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1])
	(using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by hub.freebsd.org (Postfix) with ESMTP id 416AE937
	for <freebsd-gnats-submit@FreeBSD.org>; Mon,  4 Nov 2013 13:24:29 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from oldred.freebsd.org (oldred.freebsd.org [8.8.178.121])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mx1.freebsd.org (Postfix) with ESMTPS id 2DB1B2533
	for <freebsd-gnats-submit@FreeBSD.org>; Mon,  4 Nov 2013 13:24:29 +0000 (UTC)
Received: from oldred.freebsd.org ([127.0.1.6])
	by oldred.freebsd.org (8.14.5/8.14.7) with ESMTP id rA4DOST5036283
	for <freebsd-gnats-submit@FreeBSD.org>; Mon, 4 Nov 2013 13:24:28 GMT
	(envelope-from nobody@oldred.freebsd.org)
Received: (from nobody@localhost)
	by oldred.freebsd.org (8.14.5/8.14.5/Submit) id rA4DOSTc036274;
	Mon, 4 Nov 2013 13:24:28 GMT
	(envelope-from nobody)
Message-Id: <201311041324.rA4DOSTc036274@oldred.freebsd.org>
Date: Mon, 4 Nov 2013 13:24:28 GMT
From: Julien <jcharbon@verisign.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: TCP stack lock contention with short-lived connections
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         183659
>Category:       kern
>Synopsis:       [tcp] ]TCP stack lock contention with short-lived connections
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-net
>State:          patched
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Nov 04 13:30:02 UTC 2013
>Closed-Date:    
>Last-Modified:  Mon Mar 17 18:40:01 UTC 2014
>Originator:     Julien
>Release:        FreeBSD 9.2
>Organization:
Verisign
>Environment:
FreeBSD host 9.2-RELEASE FreeBSD 9.2-RELEASE #0 r+e612cc1: Thu Oct 17 15:41:43 UTC 2013     root@host:/usr/obj/freebsd-vrsn/sys/GENERIC  amd64
>Description:
 Our goal is to achieve the highest number of TCP queries per second (QPS)
on a given hardware.  A TCP query being defined here as:  Establishing a TCP
connection, sending of a small request, reading the small response, closing
the connection.
 
 By configuring a single NIC receive queue bound to a single CPU core, TCP
performance results on FreeBSD are great:  We got ~52k QPS before being CPU
bound, and we achieved the same result on Linux.
 However, by configuring 4 NIC receive queues each bound to a different core of
the same CPU, results are lower than expected:  We got only ~56k QPS, where we
reached ~200k QPS on Linux on the same hardware.
 
 We investigated the cause of this performance scaling issue:  The PMC profiling
showed that more than half of CPU time was spent in _rw_rlock() and _rw_wlock_hard(),
and then the lock profing showed a lock contention on ipi_lock of the TCP pcbinfo
structure (ipi_lock being acquired with the INP_INFO_*LOCK macros).
 
 Below our lock profiling result ordered by "total accumulated lock wait
time":  (Lock profiling done on vanilla FreeBSD 9.2)
 
# sysctl debug.lock.prof.stats | head -2; sysctl debug.lock.prof.stats | sort -n> -k 4 -r | head -20 debug.lock.prof.stats:
           max  wait_max       total  wait_total       count    avg wait_avg cnt_hold cnt_lock name
           265     39477     4669994    57027602      840000      5     67  0 780171  sys/netinet/tcp_usrreq.c:728 (rw:tcp)
           248     39225     9498849    39991390     2044168      4     19  0 1919503 sys/netinet/tcp_input.c:775 (rw:tcp)
           234     39474      589181    39241879      840000      0     46  0 702845  sys/netinet/tcp_usrreq.c:984 (rw:tcp)
           234     43262      807708    22694780      840000      0     27  0 814240  sys/netinet/tcp_usrreq.c:635 (rw:tcp)
           821     39218     8592541    22346613     1106252      7     20  0 1068157 sys/netinet/tcp_input.c:1019 (rw:tcp)
           995     37316     1210480     6822269      343692      3     19  0 324585  sys/netinet/tcp_input.c:962 (rw:tcp)
 
 The top 6 lock profiling entries are all related to the same INP_INFO
(rw:tcp) lock, below more details:
 
#1 In tcp_usr_shutdown()
https://github.com/freebsd/freebsd/blob/releng/9.2/sys/netinet/tcp_usrreq.c#L728
 
#2 In tcp_input() for SYN/FIN/RST TCP packets
https://github.com/freebsd/freebsd/blob/releng/9.2/sys/netinet/tcp_input.c#L775
 
#3 In tcp_usr_close()
https://github.com/freebsd/freebsd/blob/releng/9.2/sys/netinet/tcp_usrreq.c#L984
 
#4 In tcp_usr_accept()
https://github.com/freebsd/freebsd/blob/releng/9.2/sys/netinet/tcp_usrreq.c#L635
 
#5 In tcp_input() for incoming TCP packets when the corresponding connection
is not in ESTABLISHED state.  In general the client ACK packet of TCP three-way
handshake that is going to create the connection.
https://github.com/freebsd/freebsd/blob/releng/9.2/sys/netinet/tcp_input.c#L1019
 
#6 tcp_input() for incoming TCP packets when the corresponding connection is
in TIME_WAIT state
https://github.com/freebsd/freebsd/blob/releng/9.2/sys/netinet/tcp_input.c#L962
 
 Our explanation for such lock contention is straightforward:  Our typical workload
entails this packet sequence:
 
        Received TCP packets:        Sent TCP packets:
 
#1      SYN ->
                                     <- SYN-ACK
#2      ACK ->
#3      query data ->
                                     <- ACK + response data
                                     <- FIN
#4      ACK ->
#5      FIN ->
                                     <- ACK
 
 For #1, #2, #4, #5 received packets the write lock on INP_INFO is required in
tcp_input(), only #3 does not require this lock.  Which means that only 1/5th of
all received packets will be proceed in parallel for the entire TCP stack.
 Moreover the lock is also required in all major TCP syscalls:  tcp_usr_shutdown(),
tcp_usr_close() and tcp_usr_accept().
 
 We are aware than achieving a rate of 200k TCP connections per second is a
specific goal but a better TCP connection setup/teardown scalability could
benefit to other TCP network services as well.
>How-To-Repeat:
Below details to reproduce this performance issue contention using open
source software:
 
 o Software used:
 
   - TCP client: ab version 2.4
   - TCP server: nginx version 1.4.2
 
 o Software configurations:
 
  - server:  See joined nginx.conf
 
  Core binding on our 12 cores server:
 
  - The 4 NIC receive queues are bounded to cores 0, 1, 2 and 3.
  - The 4 nginx workers are bounded to cores 4, 5, 6, and 7.
 
  - client:  launch:
 
  $ for i in $(seq 0 11); do \
    taskset -c $i ab -c 2000 -n 1000000 http://server/test.html & done
 
  Note:  We use the same Linux load driver to load both Linux and FreeBSD,
   we did not try to launch ab from a FreeBSD box, sorry.
 
  - 'test.html' HTML page is simpy:

<html><head><title>Title</title></head><body><p>Body</p></body></html>
 
  You should get:
 
  - TCP request size:  92 bytes
  - TCP response size:  206 bytes
 
 o Tunables/sysctls parameters:
 
  - Main tunables to tune:
 
# We want 4 receive queues
hw.ixgbe.num_queues=4
 
# Other tunables
kern.ipc.maxsockets
kern.ipc.nmbclusters
kern.maxfiles
kern.maxfilesperproc
net.inet.tcp.hostcache.hashsize
net.inet.tcp.hostcache.cachelimit
net.inet.tcp.hostcache.bucketlimit
net.inet.tcp.tcbhashsize
net.inet.tcp.syncache.hashsize
net.inet.tcp.syncache.bucketlimit
 
  - sysctl to tune:
 
# Values to increase
kern.ipc.maxsockbuf
kern.ipc.somaxconn
net.inet.tcp.maxtcptw
 
 o Monitoring tools:
 
 - We use i7z to check when the server is CPU bounded (from sysutils/i7z port) which
   should give you 100% in C0 state "Processor running without halting"
   on NIC receive queues bound cores:
 
----
True Frequency (without accounting Turbo) 3325 MHz
 
Socket [0] - [physical cores=6, logical cores=6, max online cores ever=6]
  CPU Multiplier 25x || Bus clock frequency (BCLK) 133.00 MHz
  TURBO ENABLED on 6 Cores, Hyper Threading OFF
  Max Frequency without considering Turbo 3458.00 MHz (133.00 x [26])
  Max TURBO Multiplier (if Enabled) with 1/2/3/4/5/6 cores is  27x/27x/26x/26x/26x/26x
  Real Current Frequency 3373.71 MHz (Max of below)
        Core [core-id]  :Actual Freq (Mult.)      C0%   Halt(C1)%  C3 %   C6 %  Temp
        Core 1 [0]:       3370.76 (25.34x)       103       0       0       0    43
        Core 2 [1]:       3361.13 (25.27x)       103       0       0       0    42
        Core 3 [2]:       3373.71 (25.37x)       105       0       0       0    43
        Core 4 [3]:       3339.75 (25.11x)       106       0       0       0    42
        Core 5 [4]:       3323.90 (24.99x)      65.9    34.1       0       0    42
        Core 6 [5]:       3323.90 (24.99x)      65.9    34.1       0       0    41
 
Socket [1] - [physical cores=6, logical cores=6, max online cores ever=6]
  CPU Multiplier 25x || Bus clock frequency (BCLK) 133.00 MHz
  TURBO ENABLED on 6 Cores, Hyper Threading OFF
  Max Frequency without considering Turbo 3458.00 MHz (133.00 x [26])
  Max TURBO Multiplier (if Enabled) with 1/2/3/4/5/6 cores is  27x/27x/26x/26x/26x/26x
  Real Current Frequency 3309.13 MHz (Max of below)
        Core [core-id]  :Actual Freq (Mult.)      C0%   Halt(C1)%  C3 %   C6 %  Temp
        Core 1 [6]:       3309.13 (24.88x)      47.5    52.8       0       0    43
        Core 2 [7]:       3308.36 (24.87x)        48    52.3       0       0    42
        Core 3 [8]:       3266.36 (24.56x)         1    99.6       0       0    34
        Core 4 [9]:       3244.74 (24.40x)         1    99.6       0       0    33
        Core 5 [10]:      3274.51 (24.62x)         1    99.4       0       0    38
        Core 6 [11]:      3244.08 (24.39x)         1    99.5       0       0    36
 
C0 = Processor running without halting
C1 = Processor running with halts (States >C0 are power saver)
C3 = Cores running with PLL turned off and core cache turned off
C6 = Everything in C3 + core state saved to last level cache
----
 
 o PMC profiling:
 
 The flat profile of 'unhalted-cycles' of the core 1 should look like:
 
 %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 55.6  198867.00 198867.00   291942   681.19   681.43  _rw_wlock_hard [7]
  2.6  208068.00  9201.00     8961  1026.78  4849.93  tcp_do_segment [14]
  2.4  216592.00  8524.00       86 99116.28 102597.15  sched_idletd [26]
  2.3  224825.00  8233.00     8233  1000.00  1000.00  _rw_rlock [27]
  1.9  231498.00  6673.00    12106   551.21 27396.73  ixgbe_rxeof [2]
  1.4  236638.00  5140.00   310457    16.56  1004.01  tcp_input [6]
  1.2  241074.00  4436.00     5051   878.24  1000.00  in_pcblookup_hash_locked [36]
  1.2  245317.00  4243.00     4243  1000.00  1000.00  bcopy [39]
  1.1  249392.00  4075.00     2290  1779.48  3295.95  knote [30]
  1.0  252956.00  3564.00      366  9737.70 18562.04  ixgbe_mq_start [31]
  0.9  256274.00  3318.00     7047   470.84  3348.41  _syncache_add [16]
  0.8  259312.00  3038.00     3038  1000.00  1000.00  bzero [51]
  0.8  262269.00  2957.00     6253   472.89  2900.54  ip_output [18]
  0.8  264978.00  2709.00     3804   712.15  1009.00  callout_lock [42]
  0.6  267185.00  2207.00     2207  1000.00  1000.00  memcpy [64]
  0.6  269273.00  2088.00      365  5720.55  7524.50  ixgbe_xmit [56]
  0.6  271321.00  2048.00     2048  1000.00  1000.00  bcmp [67]
  0.6  273291.00  1970.00     1970  1000.00  1000.73  _rw_runlock [68]
  0.5  275188.00  1897.00     1897  1000.00  1000.00  syncache_lookup [71]
 
 And the call graph profile of _rw_wlock_hard of 'unhalted-cycles' of the core 1:
 
                0.68        0.00       1/291942      tcp_slowtimo [331]
               10.22        0.00      15/291942      syncache_expand [17]
               35.42        0.01      52/291942      in_pcbdrop [77]
              126.70        0.04     186/291942      tcp_usr_detach [65]
              208.44        0.07     306/291942      tcp_usr_attach [34]
             2094.65        0.73    3075/291942      in_pcblookup_hash [22]
             196390.89       68.73  288307/291942      tcp_input [6]
[7]     55.6 198867.00       69.60  291942         _rw_wlock_hard [7]
               24.96       14.43      39/50          turnstile_trywait [216]
                7.20        5.71      12/15          turnstile_cancel [258]
                4.00        6.90       3/3           turnstile_wait [275]
                3.02        0.00       3/1277        critical_enter [87]
                2.13        0.25       2/2061        spinlock_exit <cycle 1> [94]
                0.00        1.00       1/1           lockstat_nsecs [320]
>Fix:
 This locking contention is tricky to fix:  A main pain point to mitigate this lock contention being that this global TCP lock does not only protect globally shared data but also does create a critical section for the whole TCP stack. Then, restructuring TCP stack locking in one shot could lead to complex race conditions and make tests and reviews impractical.
 
 Our current strategy to lower risk is to break down this lock contention mitigation task:
 
 1. Remove INP_INFO lock from locations it is not actually required
 2. Replace INP_INFO lock by more specific locks where appropriate
 3. Change lock order from "INP_INFO lock (before) INP" to "INP lock (before) INP_INFO"
 4. Then push INP_INFO lock deeper in the stack where appropriate
 5. Introduce a INP_HASH_BUCKET replacing INP_INFO where appropriate
 
 Note:  By "where appropriate" we mean TCP stack parts where INP_INFO is a proven major contention point _and_ change side effects are clear enough to be reviewed.  The main goal being to ease test and review of each step.

Patch attached with submission follows:

worker_processes 4;

events {
    use kqueue;
    worker_connections 1048576;
    multi_accept off;
}

timer_resolution 1000ms;
worker_cpu_affinity 000000100000 000000010000 000001000000 000010000000;

http {
    types {}
    default_type text/html;

    sendfile on;
    tcp_nopush on;

    keepalive_timeout 0;
    #keepalive_timeout 65;
    lingering_close always;
    lingering_time 2s;
    lingering_timeout 2s;

    etag off;
    add_header Last-Modified "";
    max_ranges 0;
    server_tokens off;

    open_file_cache max=10 inactive=120s;

    server {
        listen 80;
        server_name localhost;

        location / {
            root /usr/local/www/nginx;
            index index.html index.htm;
        }

        error_page 500 502 503 504 /50x.html;
        location = /50x.html {
            root /usr/local/www/nginx-dist;
        }
    }
}


>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->freebsd-net 
Responsible-Changed-By: eadler 
Responsible-Changed-When: Mon Nov 4 14:27:26 UTC 2013 
Responsible-Changed-Why:  
Over to networking group 

http://www.freebsd.org/cgi/query-pr.cgi?pr=183659 

From: "Julien Charbon" <jcharbon@verisign.com>
To: bug-followup@freebsd.org
Cc:  
Subject: Re: kern/183659: [tcp] [patch] TCP stack lock contention with
 short-lived connections
Date: Thu, 07 Nov 2013 15:09:12 +0100

 ------------DUpEMRBHJWbgn0J5PBi2Pu
 Content-Type: text/plain; charset=iso-8859-15; format=flowed; delsp=yes
 Content-Transfer-Encoding: 7bit
 
 
   Joined a first patch that removes INP_INFO lock from tcp_usr_accept():   
 This changes simply follows the advice made in corresponding code  
 comment:  "A better fix would prevent the socket from being placed in the  
 listen queue until all fields are fully initialized."  For more technical  
 details, check the comment in related change below:
 
 http://svnweb.freebsd.org/base?view=revision&revision=175612
 
   With this patch applied we see no regressions and a performance  
 improvement of ~5% i.e with 9.2 vanilla kernel:  52k TCP Queries Per  
 Second, with 9.2 + joined patch:  55k TCP QPS.
 
 -- 
 Julien
 ------------DUpEMRBHJWbgn0J5PBi2Pu
 Content-Disposition: attachment; filename=inp-info-accept.patch
 Content-Type: application/octet-stream; name=inp-info-accept.patch
 Content-Transfer-Encoding: Base64
 
 RnJvbTogSnVsaWVuIENoYXJib24gPGpjaGFyYm9uQHZlcmlzaWduLmNvbT4NClN1
 YmplY3Q6IFtQQVRDSF0gQWRkIG5ldyBzb2NrZXQgaW4gbGlzdGVuIHF1ZXVlIG9u
 bHkgd2hlbiBmdWxseSBpbml0aWFsaXNlZA0KDQotLS0NCiBzeXMvbmV0aW5ldC90
 Y3Bfc3luY2FjaGUuYyB8IDQgKysrLQ0KIHN5cy9uZXRpbmV0L3RjcF91c3JyZXEu
 YyAgIHwgOSAtLS0tLS0tLS0NCiAyIGZpbGVzIGNoYW5nZWQsIDMgaW5zZXJ0aW9u
 cygrKSwgMTAgZGVsZXRpb25zKC0pDQoNCmRpZmYgLS1naXQgYS9zeXMvbmV0aW5l
 dC90Y3Bfc3luY2FjaGUuYyBiL3N5cy9uZXRpbmV0L3RjcF9zeW5jYWNoZS5jDQpp
 bmRleCBhZjE2NTFhLi5lYjczMzU2IDEwMDY0NA0KLS0tIGEvc3lzL25ldGluZXQv
 dGNwX3N5bmNhY2hlLmMNCisrKyBiL3N5cy9uZXRpbmV0L3RjcF9zeW5jYWNoZS5j
 DQpAQCAtNjYwLDcgKzY2MCw3IEBAIHN5bmNhY2hlX3NvY2tldChzdHJ1Y3Qgc3lu
 Y2FjaGUgKnNjLCBzdHJ1Y3Qgc29ja2V0ICpsc28sIHN0cnVjdCBtYnVmICptKQ0K
 IAkgKiBjb25uZWN0aW9uIHdoZW4gdGhlIFNZTiBhcnJpdmVkLiAgSWYgd2UgY2Fu
 J3QgY3JlYXRlDQogCSAqIHRoZSBjb25uZWN0aW9uLCBhYm9ydCBpdC4NCiAJICov
 DQotCXNvID0gc29uZXdjb25uKGxzbywgU1NfSVNDT05ORUNURUQpOw0KKwlzbyA9
 IHNvbmV3Y29ubihsc28sIDApOw0KIAlpZiAoc28gPT0gTlVMTCkgew0KIAkJLyoN
 CiAJCSAqIERyb3AgdGhlIGNvbm5lY3Rpb247IHdlIHdpbGwgZWl0aGVyIHNlbmQg
 YSBSU1Qgb3INCkBAIC04OTAsNiArODkwLDggQEAgc3luY2FjaGVfc29ja2V0KHN0
 cnVjdCBzeW5jYWNoZSAqc2MsIHN0cnVjdCBzb2NrZXQgKmxzbywgc3RydWN0IG1i
 dWYgKm0pDQogDQogCUlOUF9XVU5MT0NLKGlucCk7DQogDQorCXNvaXNjb25uZWN0
 ZWQoc28pOw0KKw0KIAlUQ1BTVEFUX0lOQyh0Y3BzX2FjY2VwdHMpOw0KIAlyZXR1
 cm4gKHNvKTsNCiANCmRpZmYgLS1naXQgYS9zeXMvbmV0aW5ldC90Y3BfdXNycmVx
 LmMgYi9zeXMvbmV0aW5ldC90Y3BfdXNycmVxLmMNCmluZGV4IGI4M2YzNGEuLjU2
 NmNjMzQgMTAwNjQ0DQotLS0gYS9zeXMvbmV0aW5ldC90Y3BfdXNycmVxLmMNCisr
 KyBiL3N5cy9uZXRpbmV0L3RjcF91c3JyZXEuYw0KQEAgLTYwOSwxMyArNjA5LDYg
 QEAgb3V0Og0KIC8qDQogICogQWNjZXB0IGEgY29ubmVjdGlvbi4gIEVzc2VudGlh
 bGx5IGFsbCB0aGUgd29yayBpcyBkb25lIGF0IGhpZ2hlciBsZXZlbHM7DQogICog
 anVzdCByZXR1cm4gdGhlIGFkZHJlc3Mgb2YgdGhlIHBlZXIsIHN0b3JpbmcgdGhy
 b3VnaCBhZGRyLg0KLSAqDQotICogVGhlIHJhdGlvbmFsZSBmb3IgYWNxdWlyaW5n
 IHRoZSB0Y2JpbmZvIGxvY2sgaGVyZSBpcyBzb21ld2hhdCBjb21wbGljYXRlZCwN
 Ci0gKiBhbmQgaXMgZGVzY3JpYmVkIGluIGRldGFpbCBpbiB0aGUgY29tbWl0IGxv
 ZyBlbnRyeSBmb3IgcjE3NTYxMi4gIEFjcXVpcmluZw0KLSAqIGl0IGRlbGF5cyBh
 biBhY2NlcHQoMikgcmFjaW5nIHdpdGggc29uZXdjb25uKCksIHdoaWNoIGluc2Vy
 dHMgdGhlIHNvY2tldA0KLSAqIGJlZm9yZSB0aGUgaW5wY2IgYWRkcmVzcy9wb3J0
 IGZpZWxkcyBhcmUgaW5pdGlhbGl6ZWQuICBBIGJldHRlciBmaXggd291bGQNCi0g
 KiBwcmV2ZW50IHRoZSBzb2NrZXQgZnJvbSBiZWluZyBwbGFjZWQgaW4gdGhlIGxp
 c3RlbiBxdWV1ZSB1bnRpbCBhbGwgZmllbGRzDQotICogYXJlIGZ1bGx5IGluaXRp
 YWxpemVkLg0KICAqLw0KIHN0YXRpYyBpbnQNCiB0Y3BfdXNyX2FjY2VwdChzdHJ1
 Y3Qgc29ja2V0ICpzbywgc3RydWN0IHNvY2thZGRyICoqbmFtKQ0KQEAgLTYzMiw3
 ICs2MjUsNiBAQCB0Y3BfdXNyX2FjY2VwdChzdHJ1Y3Qgc29ja2V0ICpzbywgc3Ry
 dWN0IHNvY2thZGRyICoqbmFtKQ0KIA0KIAlpbnAgPSBzb3RvaW5wY2Ioc28pOw0K
 IAlLQVNTRVJUKGlucCAhPSBOVUxMLCAoInRjcF91c3JfYWNjZXB0OiBpbnAgPT0g
 TlVMTCIpKTsNCi0JSU5QX0lORk9fUkxPQ0soJlZfdGNiaW5mbyk7DQogCUlOUF9X
 TE9DSyhpbnApOw0KIAlpZiAoaW5wLT5pbnBfZmxhZ3MgJiAoSU5QX1RJTUVXQUlU
 IHwgSU5QX0RST1BQRUQpKSB7DQogCQllcnJvciA9IEVDT05OQUJPUlRFRDsNCkBA
 IC02NTIsNyArNjQ0LDYgQEAgdGNwX3Vzcl9hY2NlcHQoc3RydWN0IHNvY2tldCAq
 c28sIHN0cnVjdCBzb2NrYWRkciAqKm5hbSkNCiBvdXQ6DQogCVRDUERFQlVHMihQ
 UlVfQUNDRVBUKTsNCiAJSU5QX1dVTkxPQ0soaW5wKTsNCi0JSU5QX0lORk9fUlVO
 TE9DSygmVl90Y2JpbmZvKTsNCiAJaWYgKGVycm9yID09IDApDQogCQkqbmFtID0g
 aW5fc29ja2FkZHIocG9ydCwgJmFkZHIpOw0KIAlyZXR1cm4gZXJyb3I7
 
 ------------DUpEMRBHJWbgn0J5PBi2Pu--
 

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/183659: commit references a PR
Date: Tue, 28 Jan 2014 20:28:45 +0000 (UTC)

 Author: gnn
 Date: Tue Jan 28 20:28:32 2014
 New Revision: 261242
 URL: http://svnweb.freebsd.org/changeset/base/261242
 
 Log:
   Decrease lock contention within the TCP accept case by removing
   the INP_INFO lock from tcp_usr_accept.  As the PR/patch states
   this was following the advice already in the code.
   See the PR below for a full disucssion of this change and its
   measured effects.
   
   PR:		183659
   Submitted by:	Julian Charbon
   Reviewed by:	jhb
 
 Modified:
   head/sys/netinet/tcp_syncache.c
   head/sys/netinet/tcp_usrreq.c
 
 Modified: head/sys/netinet/tcp_syncache.c
 ==============================================================================
 --- head/sys/netinet/tcp_syncache.c	Tue Jan 28 19:12:31 2014	(r261241)
 +++ head/sys/netinet/tcp_syncache.c	Tue Jan 28 20:28:32 2014	(r261242)
 @@ -682,7 +682,7 @@ syncache_socket(struct syncache *sc, str
  	 * connection when the SYN arrived.  If we can't create
  	 * the connection, abort it.
  	 */
 -	so = sonewconn(lso, SS_ISCONNECTED);
 +	so = sonewconn(lso, 0);
  	if (so == NULL) {
  		/*
  		 * Drop the connection; we will either send a RST or
 @@ -922,6 +922,8 @@ syncache_socket(struct syncache *sc, str
  
  	INP_WUNLOCK(inp);
  
 +	soisconnected(so);
 +
  	TCPSTAT_INC(tcps_accepts);
  	return (so);
  
 
 Modified: head/sys/netinet/tcp_usrreq.c
 ==============================================================================
 --- head/sys/netinet/tcp_usrreq.c	Tue Jan 28 19:12:31 2014	(r261241)
 +++ head/sys/netinet/tcp_usrreq.c	Tue Jan 28 20:28:32 2014	(r261242)
 @@ -610,13 +610,6 @@ out:
  /*
   * Accept a connection.  Essentially all the work is done at higher levels;
   * just return the address of the peer, storing through addr.
 - *
 - * The rationale for acquiring the tcbinfo lock here is somewhat complicated,
 - * and is described in detail in the commit log entry for r175612.  Acquiring
 - * it delays an accept(2) racing with sonewconn(), which inserts the socket
 - * before the inpcb address/port fields are initialized.  A better fix would
 - * prevent the socket from being placed in the listen queue until all fields
 - * are fully initialized.
   */
  static int
  tcp_usr_accept(struct socket *so, struct sockaddr **nam)
 @@ -633,7 +626,6 @@ tcp_usr_accept(struct socket *so, struct
  
  	inp = sotoinpcb(so);
  	KASSERT(inp != NULL, ("tcp_usr_accept: inp == NULL"));
 -	INP_INFO_RLOCK(&V_tcbinfo);
  	INP_WLOCK(inp);
  	if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) {
  		error = ECONNABORTED;
 @@ -653,7 +645,6 @@ tcp_usr_accept(struct socket *so, struct
  out:
  	TCPDEBUG2(PRU_ACCEPT);
  	INP_WUNLOCK(inp);
 -	INP_INFO_RUNLOCK(&V_tcbinfo);
  	if (error == 0)
  		*nam = in_sockaddr(port, &addr);
  	return error;
 _______________________________________________
 svn-src-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/svn-src-all
 To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
 
State-Changed-From-To: open->patched 
State-Changed-By: gnn 
State-Changed-When: Tue Jan 28 20:32:31 UTC 2014 
State-Changed-Why:  
Patched with commit 261242 

http://www.freebsd.org/cgi/query-pr.cgi?pr=183659 

From: "Charbon, Julien" <jcharbon@verisign.com>
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/183659: [tcp] TCP stack lock contention with short-lived
 connections
Date: Mon, 17 Mar 2014 15:54:03 +0100

   Just a follow-up that updates lock profiling results with
 short-lived TCP connection traffic on FreeBSD-10.0 RELEASE:
 (Previous results were made on FreeBSD-9.2 RELEASE)
 
 o FreeBSD-10 RELEASE:
 
 # sysctl debug.lock.prof.stats | head -2; sysctl debug.lock.prof.stats | sort -n -k 4 -r | head -5
 debug.lock.prof.stats:
       max  wait_max       total  wait_total       count    avg wait_avg cnt_hold cnt_lock name
        37    321900     3049892    13033648      610019      4     21  0 588013 sys/netinet/tcp_input.c:778 (rw:tcp)     tcp_input() (SYN|FIN|RST)
        51    115462     3240265    12270984      553157      5     22  0 545293 sys/netinet/tcp_input.c:1013 (rw:tcp)    tcp_input() (state != ESTABLISHED)
        29     62577     1170617     8754815      305885      3     28  0 296845 sys/netinet/tcp_usrreq.c:728 (rw:tcp)    tcp_usr_close()
         6     62645      146544     8548857      292058      0     29  0 283587 sys/netinet/tcp_usrreq.c:984 (rw:tcp)    tcp_usr_shutdown()
        11     62595      198811     6525067      309009      0     21  0 304522 sys/netinet/tcp_usrreq.c:635 (rw:tcp)    tcp_usr_accept()
 
   - If lock contention spots moved a little between 9.2 and 10.0, nothing
 major as the top 5 still belongs to (rw:tcp) lock (a.k.a. TCP INP_INFO).
 
 o FreeBSD-10 RELEASE + PCBGROUP kernel option (by popular demand):
 
 # sysctl debug.lock.prof.stats | head -2; sysctl debug.lock.prof.stats | sort -n -k 4 -r | head -5
 debug.lock.prof.stats:
       max  wait_max       total  wait_total       count    avg wait_avg cnt_hold cnt_lock name
        58     84250     2970633    13154832      622401      4     21  0 598964 sys/netinet/tcp_input.c:778 (rw:tcp)     tcp_input() (SYN|FIN|RST)
        47    224326     3375328    12945466      562451      6     23  0 554567 sys/netinet/tcp_input.c:1013 (rw:tcp)    tcp_input() (state != ESTABLISHED)
        22     84332     1193078     9693951      311555      3     31  0 302420 sys/netinet/tcp_usrreq.c:728 (rw:tcp)    tcp_usr_close()
         6     84307      151411     9137383      298120      0     30  0 289496 sys/netinet/tcp_usrreq.c:984 (rw:tcp)    tcp_usr_shutdown()
        15     84351      201705     6504520      314353      0     20  0 310270 sys/netinet/tcp_usrreq.c:635 (rw:tcp)    tcp_usr_accept()
 
   - No changes at all in first ranks by using PCBGROUP option on
 FreeBSD-10 RELEASE. I have indeed checked that PCBGROUP was in
 use as at #36 rank there is the specific pcbgroup lock:
 
        11         9      289817        4815     1505626      0      0  0  16054 sys/netinet/in_pcb.c:1530 (sleep mutex:pcbgroup)
 
 o FreeBSD-10 RELEASE + current lock mitigation patches [1][2]:
 
 # sysctl debug.lock.prof.stats | head -2; sysctl debug.lock.prof.stats | sort -n -k 4 -r | head -20
 debug.lock.prof.stats:
       max  wait_max       total  wait_total       count    avg wait_avg cnt_hold cnt_lock name
        29       297     3781629    13476466      734686      5     18  0 715214 sys/netinet/tcp_input.c:778 (rw:tcp)     tcp_input() (SYN|FIN|RST)
        35       287     3817278    12301410      672907      5     18  0 669324 sys/netinet/tcp_input.c:1013 (rw:tcp)    tcp_input() (state != ESTABLISHED)
        18       170     1392058     2494823      367131      3      6  0 357888 sys/netinet/tcp_usrreq.c:719 (rw:tcp)    tcp_usr_shutdown()
         7       141      182209     2433120      350488      0      6  0 344878 sys/netinet/tcp_usrreq.c:975 (rw:tcp)    tcp_usr_close()
        10       259       26786      933073       38101      0     24  0  37624 sys/netinet/tcp_timer.c:493 (rw:tcp)     tcp_timer_rexmt()
 
   - No more tcp_usr_accept() (expected)
 
   o Global results:  Maximum short-lived TCP connection rate without dropping a single packet:
 
   - FreeBSD 10.0 RELEASE:             40.0k
   - FreeBSD 10.0 RELEASE + PCBGROUP:  40.0k
   - FreeBSD 10.0 RELEASE + patches:   56.8k
 
 [1] Decrease lock contention within the TCP accept case by removing
   the INP_INFO lock from tcp_usr_accept.
 http://svnweb.freebsd.org/base?view=revision&revision=261242
 
 [2] tw-clock-v2.patch attached in:
 http://lists.freebsd.org/pipermail/freebsd-net/2014-March/038124.html
 
 --
 Julien

From: "Charbon, Julien" <jcharbon@verisign.com>
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/183659: [tcp] ]TCP stack lock contention with short-lived
 connections
Date: Mon, 17 Mar 2014 17:39:19 +0100

   Just a follow-up that updates lock profiling results with
 short-lived TCP connection traffic on FreeBSD-10.0 RELEASE:
 (Previous results were made on FreeBSD-9.2 RELEASE)
 
 o FreeBSD-10 RELEASE:
 
 # sysctl debug.lock.prof.stats | head -2; sysctl debug.lock.prof.stats | sort -n -k 4 -r | head -5
 debug.lock.prof.stats:
       max  wait_max       total  wait_total       count    avg wait_avg cnt_hold cnt_lock name
        37    321900     3049892    13033648      610019      4     21  0 588013 sys/netinet/tcp_input.c:778 (rw:tcp)     tcp_input() (SYN|FIN|RST)
        51    115462     3240265    12270984      553157      5     22  0 545293 sys/netinet/tcp_input.c:1013 (rw:tcp)    tcp_input() (state != ESTABLISHED)
        29     62577     1170617     8754815      305885      3     28  0 296845 sys/netinet/tcp_usrreq.c:728 (rw:tcp)    tcp_usr_close()
         6     62645      146544     8548857      292058      0     29  0 283587 sys/netinet/tcp_usrreq.c:984 (rw:tcp)    tcp_usr_shutdown()
        11     62595      198811     6525067      309009      0     21  0 304522 sys/netinet/tcp_usrreq.c:635 (rw:tcp)    tcp_usr_accept()
 
   - If lock contention spots moved a little between 9.2 and 10.0, nothing
 major as the top 5 still belongs to (rw:tcp) lock (a.k.a. TCP INP_INFO).
 
 o FreeBSD-10 RELEASE + PCBGROUP kernel option (by popular demand):
 
 # sysctl debug.lock.prof.stats | head -2; sysctl debug.lock.prof.stats | sort -n -k 4 -r | head -5
 debug.lock.prof.stats:
       max  wait_max       total  wait_total       count    avg wait_avg cnt_hold cnt_lock name
        58     84250     2970633    13154832      622401      4     21  0 598964 sys/netinet/tcp_input.c:778 (rw:tcp)     tcp_input() (SYN|FIN|RST)
        47    224326     3375328    12945466      562451      6     23  0 554567 sys/netinet/tcp_input.c:1013 (rw:tcp)    tcp_input() (state != ESTABLISHED)
        22     84332     1193078     9693951      311555      3     31  0 302420 sys/netinet/tcp_usrreq.c:728 (rw:tcp)    tcp_usr_close()
         6     84307      151411     9137383      298120      0     30  0 289496 sys/netinet/tcp_usrreq.c:984 (rw:tcp)    tcp_usr_shutdown()
        15     84351      201705     6504520      314353      0     20  0 310270 sys/netinet/tcp_usrreq.c:635 (rw:tcp)    tcp_usr_accept()
 
   - No changes at all in first ranks by using PCBGROUP option on
 FreeBSD-10 RELEASE. I have indeed checked that PCBGROUP was in
 use as at #36 rank there is the specific pcbgroup lock:
 
        11         9      289817        4815     1505626      0      0  0  16054 sys/netinet/in_pcb.c:1530 (sleep mutex:pcbgroup)
 
 o FreeBSD-10 RELEASE + current lock mitigation patches [1][2]:
 
 # sysctl debug.lock.prof.stats | head -2; sysctl debug.lock.prof.stats | sort -n -k 4 -r | head -20
 debug.lock.prof.stats:
       max  wait_max       total  wait_total       count    avg wait_avg cnt_hold cnt_lock name
        29       297     3781629    13476466      734686      5     18  0 715214 sys/netinet/tcp_input.c:778 (rw:tcp)     tcp_input() (SYN|FIN|RST)
        35       287     3817278    12301410      672907      5     18  0 669324 sys/netinet/tcp_input.c:1013 (rw:tcp)    tcp_input() (state != ESTABLISHED)
        18       170     1392058     2494823      367131      3      6  0 357888 sys/netinet/tcp_usrreq.c:719 (rw:tcp)    tcp_usr_shutdown()
         7       141      182209     2433120      350488      0      6  0 344878 sys/netinet/tcp_usrreq.c:975 (rw:tcp)    tcp_usr_close()
        10       259       26786      933073       38101      0     24  0  37624 sys/netinet/tcp_timer.c:493 (rw:tcp)     tcp_timer_rexmt()
 
   - No more tcp_usr_accept() (expected)
 
   o Global results:  Maximum short-lived TCP connection rate
   without dropping a single packet:
 
   - FreeBSD 10.0 RELEASE:             40.0k
   - FreeBSD 10.0 RELEASE + PCBGROUP:  40.0k
   - FreeBSD 10.0 RELEASE + patches:   56.8k
 
 [1] Decrease lock contention within the TCP accept case by removing
   the INP_INFO lock from tcp_usr_accept.
 http://svnweb.freebsd.org/base?view=revision&revision=261242
 
 [2] tw-clock-v2.patch attached in:
 http://lists.freebsd.org/pipermail/freebsd-net/2014-March/038124.html
 
 --
 Julien
>Unformatted:
