From SRS0=6448652a836d2ec8179b9be65cc60934e444ae3b=457=es.net=oberman@es.net  Thu Sep 13 20:14:10 2007
Return-Path: <SRS0=6448652a836d2ec8179b9be65cc60934e444ae3b=457=es.net=oberman@es.net>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C6BFC16A419
	for <FreeBSD-gnats-submit@freebsd.org>; Thu, 13 Sep 2007 20:14:10 +0000 (UTC)
	(envelope-from SRS0=6448652a836d2ec8179b9be65cc60934e444ae3b=457=es.net=oberman@es.net)
Received: from postal1.es.net (postal1.es.net [IPv6:2001:400:14:3::6])
	by mx1.freebsd.org (Postfix) with ESMTP id 645A613C45D
	for <FreeBSD-gnats-submit@freebsd.org>; Thu, 13 Sep 2007 20:14:10 +0000 (UTC)
	(envelope-from SRS0=6448652a836d2ec8179b9be65cc60934e444ae3b=457=es.net=oberman@es.net)
Received: from ptavv.es.net (ptavv.es.net [198.128.4.29])
        by postal1.es.net (Postal Node 1) with ESMTP (SSL) id SDB11903
        for <FreeBSD-gnats-submit@freebsd.org>; Thu, 13 Sep 2007 13:14:03 -0700
Received: by ptavv.es.net (Tachyon Server, from userid 9381)
	id 958D14500E; Thu, 13 Sep 2007 13:14:02 -0700 (PDT)
Message-Id: <20070913201402.958D14500E@ptavv.es.net>
Date: Thu, 13 Sep 2007 13:14:02 -0700 (PDT)
From: Kevin Oberman <oberman@es.net>
Reply-To: Kevin Oberman <oberman@es.net>
To: FreeBSD-gnats-submit@freebsd.org
Cc:
Subject: Excessive TCP window updates
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         116335
>Category:       kern
>Synopsis:       [tcp] Excessive TCP window updates
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    andre
>State:          analyzed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Sep 13 20:20:06 GMT 2007
>Closed-Date:    
>Last-Modified:  Fri Jul  5 15:50:00 UTC 2013
>Originator:     Kevin Oberman
>Release:        FreeBSD 6.2-STABLE i386
>Organization:
ESnet--The Energy Sciences Network
>Environment:
System: FreeBSD ptavv.es.net 6.2-STABLE FreeBSD 6.2-STABLE #11: Thu Aug 16 17:18:58 PDT 2007 root@ptavv.es.net:/usr/obj/usr/src/sys/PTAVV i386

>Description:
	Testing over a trans-continental 10GE between two boxes with
mxge cards, at a point about 2.5 seconds into the tansfer, the receive
node starts updating the window size as fast as it can process the
data. The result is that it is sending updates at intervals of between
0 and 4 microseconds. This can result in several hundred window
updates between "real" packets and, I suspect, is causing performance
problems.

I see an old message at:
http://lists.freebsd.org/pipermail/freebsd-net/2005-January/006141.html
that may be the source of the problem, though I have not yet figured
out exactly how this code works.

>How-To-Repeat:
	Send a TCP stream between to hosts with a ~100 ms. RTT between
them at speeds exceeding 3 Gbps. 
>Fix:
Unknown.
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->freebsd-net 
Responsible-Changed-By: rodrigc 
Responsible-Changed-When: Fri Sep 14 00:53:26 UTC 2007 
Responsible-Changed-Why:  
Send to freebsd-net@ 

http://www.freebsd.org/cgi/query-pr.cgi?pr=116335 
Responsible-Changed-From-To: freebsd-net->andre 
Responsible-Changed-By: andre 
Responsible-Changed-When: Sat Sep 15 08:45:13 UTC 2007 
Responsible-Changed-Why:  
Take over. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=116335 
State-Changed-From-To: open->analyzed 
State-Changed-By: andre 
State-Changed-When: Sun Aug 15 10:07:25 UTC 2010 
State-Changed-Why:  
Cause found and patch upcoming. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=116335 

From: Andre Oppermann <andre@freebsd.org>
To: oberman@es.net
Cc: lstewart@freebsd.org, bug-followup@freebsd.org
Subject: Re: kern/116335: [tcp] Excessive TCP window updates
Date: Mon, 16 Aug 2010 23:27:15 +0200

 This is a multi-part message in MIME format.
 --------------050209060505090409010803
 Content-Type: text/plain; charset=ISO-8859-1; format=flowed
 Content-Transfer-Encoding: 7bit
 
 Kevin,
 
 thanks for your bug report about the window updates.  Please try
 the attached patch.  It changes TCP to be much more restrictive
 in generating window updates.  Window update actually are only
 really necessary when the socket buffer is close to being full
 and a zero window was announced.  Then independent window updates
 make the remote end send again.  In all other cases the ACK clock
 will handle reporting of the current window just fine.
 
 The patch will generate window updates only if the window can be
 increased by two segments at least (silly window avoidance), and:
   - the free space in the socket buffer is 1/8, or
   - the window is increase by at least 1/4 of the sockbuf, or
   - the socket buffer is smaller than 8 times MSS.
 
 And it won't issue an independent window update if a delayed ACK
 is pending.
 
 Lawrence: could you review the patch as well?
 
 -- 
 Andre
 
 --------------050209060505090409010803
 Content-Type: text/plain;
  name="patch-1.diff"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
  filename="patch-1.diff"
 
 Index: tcp_output.c
 ===================================================================
 --- tcp_output.c	(revision 211396)
 +++ tcp_output.c	(working copy)
 @@ -543,29 +543,53 @@
  	}
  
  	/*
 -	 * Compare available window to amount of window
 -	 * known to peer (as advertised window less
 -	 * next expected input).  If the difference is at least two
 -	 * max size segments, or at least 50% of the maximum possible
 -	 * window, then want to send a window update to peer.
 -	 * Skip this if the connection is in T/TCP half-open state.
 -	 * Don't send pure window updates when the peer has closed
 -	 * the connection and won't ever send more data.
 +	 * Sending of standalone window updates.
 +	 *
 +	 * Window updates important when we close our window
 +	 * due to a full socket buffer and are opening it again
 +	 * after the application reads data from it.  Once the
 +	 * window has opened again and the remote end starts to
 +	 * send again the ACK clock takes over and provides the
 +	 * most current window information.
 +	 *
 +	 * We must avoid to the silly window syndrome whereas
 +	 * every read from the receive buffer, no matter how
 +	 * small, causes a window update to be sent. We also
 +	 * should avoid sending a flurry of window updates when
 +	 * the socket buffer had queued a lot of data and the
 +	 * application is doing small reads.
 +	 *
 +	 * Don't send an independent window update if a delayed
 +	 * ACK is pending (it will get piggy-backed on it) or the
 +	 * remote side already has done a half-close and won't send
 +	 * more data.  Skip this if the connection is in T/TCP
 +	 * half-open state.
  	 */
  	if (recwin > 0 && !(tp->t_flags & TF_NEEDSYN) &&
 +	    !(tp->t_flags & TF_DELACK) &&
  	    !TCPS_HAVERCVDFIN(tp->t_state)) {
  		/*
  		 * "adv" is the amount we can increase the window,
  		 * taking into account that we are limited by
  		 * TCP_MAXWIN << tp->rcv_scale.
  		 */
 -		long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) -
 -			(tp->rcv_adv - tp->rcv_nxt);
 +		u_int adv = min(recwin, ((u_int)TCP_MAXWIN << tp->rcv_scale) -
 +			(tp->rcv_adv - tp->rcv_nxt));
  
 -		if (adv >= (long) (2 * tp->t_maxseg))
 +		/*
 +		 * Send an update when we can increase by more than
 +		 * 1/4th of the socket buffer capacity.
 +		 * When the buffer is getting full or is very small
 +		 * be more aggressive and send an update whenever
 +		 * we can increase by two mss sized segments.
 +		 * In all other situations the ACK's to new incoming
 +		 * data will carry further increases.
 +		 */
 +		if (adv >= 2 * tp->t_maxseg &&
 +		    (adv >= so->so_rcv.sb_hiwat / 4 ||
 +		     recwin <= so->so_rcv.sb_hiwat / 8 ||
 +		     so->so_rcv.sb_hiwat <= 8 * tp->t_maxopd))
  			goto send;
 -		if (2 * adv >= (long) so->so_rcv.sb_hiwat)
 -			goto send;
  	}
  
  	/* 
 
 --------------050209060505090409010803--

From: "Kevin Oberman" <oberman@es.net>
To: Andre Oppermann <andre@freebsd.org>
Cc: lstewart@freebsd.org, bug-followup@freebsd.org
Subject: Re: kern/116335: [tcp] Excessive TCP window updates 
Date: Mon, 16 Aug 2010 14:50:27 -0700

 > Date: Mon, 16 Aug 2010 23:27:15 +0200
 > From: Andre Oppermann <andre@freebsd.org>
 > 
 > Kevin,
 > 
 > thanks for your bug report about the window updates.  Please try
 > the attached patch.  It changes TCP to be much more restrictive
 > in generating window updates.  Window update actually are only
 > really necessary when the socket buffer is close to being full
 > and a zero window was announced.  Then independent window updates
 > make the remote end send again.  In all other cases the ACK clock
 > will handle reporting of the current window just fine.
 > 
 > The patch will generate window updates only if the window can be
 > increased by two segments at least (silly window avoidance), and:
 >   - the free space in the socket buffer is 1/8, or
 >   - the window is increase by at least 1/4 of the sockbuf, or
 >   - the socket buffer is smaller than 8 times MSS.
 > 
 > And it won't issue an independent window update if a delayed ACK
 > is pending.
 > 
 > Lawrence: could you review the patch as well?
 > 
 > -- 
 > Andre
 > 
 
 Wow! I had given up on hearing anything about this.
 
 I no longer have my test setup for looking at this and am currently
 swapped by deadlines, it my take a while. I'll try to get to it no
 later than next week.
 -- 
 R. Kevin Oberman, Network Engineer
 Energy Sciences Network (ESnet)
 Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab)
 E-mail: oberman@es.net			Phone: +1 510 486-8634
 Key fingerprint:059B 2DDF 031C 9BA3 14A4  EADA 927D EBB3 987B 3751

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/116335: commit references a PR
Date: Fri,  5 Jul 2013 15:48:07 +0000 (UTC)

 Author: andre
 Date: Fri Jul  5 15:47:59 2013
 New Revision: 252793
 URL: http://svnweb.freebsd.org/changeset/base/252793
 
 Log:
   MFC r242251, r242311:
   
    Defer sending an independent window update if a delayed ACK is pending
    saving a packet.  The window update then gets piggy-backed on the next
    already scheduled ACK.
   
   MFC r242252:
   
    Prevent a flurry of forced window updates when an application is
    doing small reads on a (partially) filled receive socket buffer.
   
    Normally one would a send a window update every time the available
    space in the socket buffer increases by two times MSS.  This leads
    to a flurry of window updates that do not provide any meaningful
    new information to the sender.  There still is available space in
    the window and the sender can continue sending data.  All window
    updates then get carried by the regular ACKs.  Only when the socket
    buffer was (almost) full and the window closed accordingly a window
    updates delivery new information and allows the sender to start
    sending more data again.
   
    Send window updates only every two MSS when the socket buffer
    has less than 1/8 space available, or the available space in the
    socket buffer increased by 1/4 its full capacity, or the socket
    buffer is very small.  The next regular data ACK will carry and
    report the exact window size again.
   
    Reported by:	sbruno
    Tested by:	darrenr
    Tested by:	Darren Baginski
    PR:		kern/116335
 
 Modified:
   stable/9/sys/netinet/tcp_output.c
 Directory Properties:
   stable/9/sys/   (props changed)
 
 Modified: stable/9/sys/netinet/tcp_output.c
 ==============================================================================
 --- stable/9/sys/netinet/tcp_output.c	Fri Jul  5 15:30:02 2013	(r252792)
 +++ stable/9/sys/netinet/tcp_output.c	Fri Jul  5 15:47:59 2013	(r252793)
 @@ -540,19 +540,39 @@ after_sack_rexmit:
  	}
  
  	/*
 -	 * Compare available window to amount of window
 -	 * known to peer (as advertised window less
 -	 * next expected input).  If the difference is at least two
 -	 * max size segments, or at least 50% of the maximum possible
 -	 * window, then want to send a window update to peer.
 -	 * Skip this if the connection is in T/TCP half-open state.
 -	 * Don't send pure window updates when the peer has closed
 -	 * the connection and won't ever send more data.
 +	 * Sending of standalone window updates.
 +	 *
 +	 * Window updates are important when we close our window due to a
 +	 * full socket buffer and are opening it again after the application
 +	 * reads data from it.  Once the window has opened again and the
 +	 * remote end starts to send again the ACK clock takes over and
 +	 * provides the most current window information.
 +	 *
 +	 * We must avoid the silly window syndrome whereas every read
 +	 * from the receive buffer, no matter how small, causes a window
 +	 * update to be sent.  We also should avoid sending a flurry of
 +	 * window updates when the socket buffer had queued a lot of data
 +	 * and the application is doing small reads.
 +	 *
 +	 * Prevent a flurry of pointless window updates by only sending
 +	 * an update when we can increase the advertized window by more
 +	 * than 1/4th of the socket buffer capacity.  When the buffer is
 +	 * getting full or is very small be more aggressive and send an
 +	 * update whenever we can increase by two mss sized segments.
 +	 * In all other situations the ACK's to new incoming data will
 +	 * carry further window increases.
 +	 *
 +	 * Don't send an independent window update if a delayed
 +	 * ACK is pending (it will get piggy-backed on it) or the
 +	 * remote side already has done a half-close and won't send
 +	 * more data.  Skip this if the connection is in T/TCP
 +	 * half-open state.
  	 */
  	if (recwin > 0 && !(tp->t_flags & TF_NEEDSYN) &&
 +	    !(tp->t_flags & TF_DELACK) &&
  	    !TCPS_HAVERCVDFIN(tp->t_state)) {
  		/*
 -		 * "adv" is the amount we can increase the window,
 +		 * "adv" is the amount we could increase the window,
  		 * taking into account that we are limited by
  		 * TCP_MAXWIN << tp->rcv_scale.
  		 */
 @@ -572,9 +592,11 @@ after_sack_rexmit:
  		 */
  		if (oldwin >> tp->rcv_scale == (adv + oldwin) >> tp->rcv_scale)
  			goto dontupdate;
 -		if (adv >= (long) (2 * tp->t_maxseg))
 -			goto send;
 -		if (2 * adv >= (long) so->so_rcv.sb_hiwat)
 +
 +		if (adv >= (long)(2 * tp->t_maxseg) &&
 +		    (adv >= (long)(so->so_rcv.sb_hiwat / 4) ||
 +		     recwin <= (long)(so->so_rcv.sb_hiwat / 8) ||
 +		     so->so_rcv.sb_hiwat <= 8 * tp->t_maxseg))
  			goto send;
  	}
  dontupdate:
 _______________________________________________
 svn-src-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/svn-src-all
 To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
 
>Unformatted:
