From dan@dan.emsphone.com  Wed Dec 15 18:27:54 2004
Return-Path: <dan@dan.emsphone.com>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 7E5FB16A4CE
	for <FreeBSD-gnats-submit@freebsd.org>; Wed, 15 Dec 2004 18:27:54 +0000 (GMT)
Received: from dan.emsphone.com (dan.emsphone.com [199.67.51.101])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 0CAFE43D58
	for <FreeBSD-gnats-submit@freebsd.org>; Wed, 15 Dec 2004 18:27:54 +0000 (GMT)
	(envelope-from dan@dan.emsphone.com)
Received: (from dan@localhost)
	by dan.emsphone.com (8.13.1/8.13.1) id iBFIRqDB019997;
	Wed, 15 Dec 2004 12:27:52 -0600 (CST)
	(envelope-from dan)
Message-Id: <200412151827.iBFIRqDB019997@dan.emsphone.com>
Date: Wed, 15 Dec 2004 12:27:52 -0600 (CST)
From: Dan Nelson <dnelson@allantgroup.com>
To: FreeBSD-gnats-submit@freebsd.org
Cc: Matthew Dillon <dillon@apollo.backplane.com>
Subject: [PATCH] Incorrect inflight bandwidth calculation on first packet
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         75122
>Category:       kern
>Synopsis:       [netinet] [patch] Incorrect inflight bandwidth calculation on first packet
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    silby
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Wed Dec 15 18:30:31 GMT 2004
>Closed-Date:    
>Last-Modified:  Thu Nov 15 23:19:55 UTC 2007
>Originator:     Dan Nelson
>Release:        FreeBSD 5.3-STABLE i386
>Organization:
The Allant Group
>Environment:
System: FreeBSD dan.emsphone.com 5.3-STABLE FreeBSD 5.3-STABLE #383: Wed Dec 15 11:29:24 CST 2004 zsh@dan.emsphone.com:/usr/src/sys/i386/compile/DANSMP i386

	
>Description:

Matt, I'm CC'ing you because it looks like the bug is also in
Dragonfly, so I think you'll want to commit something similar.

The inflight window scaling algorithm keeps a decaying average of a
socket's bandwidth, but on the first call to tcp_xmit_bandwidth_limit,
the sequence number of the previous packet is not known, so the
(ack_seq - tp->t_bw_rtseq) clause just gives you the raw sequence
number, resulting in a random value for the calculated bandwidth.

Until enough packets have traveled so the average has decayed to the
correct value, the calculated window is large enough that it's not even
used.  On a dialup, for example, it never gets a chance.

	
>How-To-Repeat:
	

Run net.inet.tcp.hostcache.list, and take a look at the BANDWIDTH
column.  Note that there is a secondary bug in this sysctl output that
multiplies the bandwidth by 8, even though the original value is
already in octets/sec.

>Fix:

The minimal fix is to check for (tp->t_bw_rtttime == 0 || tp->t_bw_rtseq
== 0 || (int)bw < 0), and if it evalutes true, store the current values
and return.  I recommend committing the whole patch though.  It changed
the debugging output to print the instantaneous bw, and the current and
previous average bw.

Index: tcp_hostcache.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_hostcache.c,v
retrieving revision 1.7
diff -u -r1.7 tcp_hostcache.c
--- tcp_hostcache.c	16 Aug 2004 18:32:07 -0000	1.7
+++ tcp_hostcache.c	15 Dec 2004 17:42:02 -0000
@@ -175,7 +175,7 @@
      &tcp_hostcache.purgeall, 0, "Expire all entires on next purge run");
 
 SYSCTL_PROC(_net_inet_tcp_hostcache, OID_AUTO, list,
-	CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_SKIP, 0, 0,
+	CTLTYPE_STRING | CTLFLAG_RD, 0, 0,
 	sysctl_tcp_hc_list, "A", "List of all hostcache entries");
 
 
@@ -676,7 +676,7 @@
 				(RTM_RTTUNIT / (hz * TCP_RTT_SCALE))),
 			    msec(hc_entry->rmx_rttvar *
 				(RTM_RTTUNIT / (hz * TCP_RTT_SCALE))),
-			    hc_entry->rmx_bandwidth * 8,
+			    hc_entry->rmx_bandwidth,
 			    hc_entry->rmx_cwnd,
 			    hc_entry->rmx_sendpipe,
 			    hc_entry->rmx_recvpipe,
Index: tcp_subr.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_subr.c,v
retrieving revision 1.201.2.4
diff -u -r1.201.2.4 tcp_subr.c
--- tcp_subr.c	1 Dec 2004 05:37:28 -0000	1.201.2.4
+++ tcp_subr.c	15 Dec 2004 17:43:19 -0000
@@ -1907,7 +1907,7 @@
 void
 tcp_xmit_bandwidth_limit(struct tcpcb *tp, tcp_seq ack_seq)
 {
-	u_long bw;
+	u_long bw, avgbw;
 	u_long bwnd;
 	int save_ticks;
 
@@ -1933,18 +1933,26 @@
 	 * effectively reset t_bw_rtttime for this case.
 	 */
 	save_ticks = ticks;
-	if ((u_int)(save_ticks - tp->t_bw_rtttime) < 1)
+
+	/* Must wait at least 1 tick to calculate anything */
+	if (save_ticks == tp->t_bw_rtttime)
 		return;
 
 	bw = (int64_t)(ack_seq - tp->t_bw_rtseq) * hz /
 	    (save_ticks - tp->t_bw_rtttime);
+
+	/* If this is our first time through, or if something has wrapped, 
+	   record the current values and return */
+	if (tp->t_bw_rtttime == 0 || tp->t_bw_rtseq == 0 || (int)bw < 0) {	
+		tp->t_bw_rtttime = save_ticks;
+		tp->t_bw_rtseq = ack_seq;
+		return;
+	}
+
 	tp->t_bw_rtttime = save_ticks;
 	tp->t_bw_rtseq = ack_seq;
-	if (tp->t_bw_rtttime == 0 || (int)bw < 0)
-		return;
-	bw = ((int64_t)tp->snd_bandwidth * 15 + bw) >> 4;
 
-	tp->snd_bandwidth = bw;
+	avgbw = ((int64_t)tp->snd_bandwidth * 15 + bw) >> 4;
 
 	/*
 	 * Calculate the semi-static bandwidth delay product, plus two maximal
@@ -1975,22 +1983,25 @@
 	 *	    no other choice.
 	 */
 #define USERTT	((tp->t_srtt + tp->t_rttbest) / 2)
-	bwnd = (int64_t)bw * USERTT / (hz << TCP_RTT_SHIFT) + tcp_inflight_stab * tp->t_maxseg / 10;
+	bwnd = (int64_t)avgbw * USERTT / (hz << TCP_RTT_SHIFT) + tcp_inflight_stab * tp->t_maxseg / 10;
 #undef USERTT
 
 	if (tcp_inflight_debug > 0) {
 		static int ltime;
 		if ((u_int)(ticks - ltime) >= hz / tcp_inflight_debug) {
 			ltime = ticks;
-			printf("%p bw %ld rttbest %d srtt %d bwnd %ld\n",
+			printf("%p bw %lu(%lu,%lu) rttbest %d srtt %d bwnd %ld\n",
 			    tp,
-			    bw,
+			    avgbw, tp->snd_bandwidth, bw,
 			    tp->t_rttbest,
 			    tp->t_srtt,
 			    bwnd
 			);
 		}
 	}
+	
+	tp->snd_bandwidth = avgbw;
+	
 	if ((long)bwnd < tcp_inflight_min)
 		bwnd = tcp_inflight_min;
 	if (bwnd > tcp_inflight_max)
>Release-Note:
>Audit-Trail:
Adding to audit trail from misfiled PR kern/75140:

Date: Wed, 15 Dec 2004 19:33:43 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>

    Hmm.  Well, looking at the original code it looks like t_bw_rtttime
    is properly initialized in tcp_newtcpcb(), so it will never be 0.
    t_bw_rtseq is initialized for outgoing connections, but not properly
    initialized for incoming connections.

    The original code was rather messy and really needs some reorganization,
    so I would recommend doing some reorganizing rather then just patching
    more mess on top.

    * Remove initialization of t_bw_rttime so it can be tested against 0
      in the bandwidth limiting code.

    * Auto-initialize t_bw_rtttime in tcp_xmit_bandwidth_limit by testing
      it against 0.

    * Add an idle check.

    * Assign an initial bandwidth of tcp_inflight_min so high speed
      connections do not have to ramp-up the exponential average all
      the way from 0.

    The patch below is against DragonFly and will not apply to FreeBSD, but
    is included to demonstrate my recommendation.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

Index: tcp_subr.c
===================================================================
RCS file: /cvs/src/sys/netinet/tcp_subr.c,v
retrieving revision 1.40
diff -u -r1.40 tcp_subr.c
--- tcp_subr.c	14 Nov 2004 00:49:08 -0000	1.40
+++ tcp_subr.c	16 Dec 2004 03:22:55 -0000
@@ -692,7 +692,6 @@
 	tp->snd_bwnd = TCP_MAXWIN << TCP_MAX_WINSHIFT;
 	tp->snd_ssthresh = TCP_MAXWIN << TCP_MAX_WINSHIFT;
 	tp->t_rcvtime = ticks;
-	tp->t_bw_rtttime = ticks;
 	/*
 	 * IPv4 TTL initialization is necessary for an IPv6 socket as well,
 	 * because the socket may be bound to an IPv6 wildcard address,
@@ -1834,6 +1833,7 @@
 	u_long bw;
 	u_long bwnd;
 	int save_ticks;
+	int delta_ticks;
 
 	/*
 	 * If inflight_enable is disabled in the middle of a tcp connection,
@@ -1846,26 +1846,38 @@
 	}
 
 	/*
+	 * Validate the delta time.  If a connection is new or has been idle
+	 * a long time we have to reset the bandwidth calculator.
+	 */
+	save_ticks = ticks;
+	delta_ticks = save_ticks - tp->t_bw_rtttime;
+	if (tp->t_bw_rtttime == 0 || delta_ticks < 0 || delta_ticks > hz * 10) {
+		tp->t_bw_rtttime = ticks;
+		tp->t_bw_rtseq = ack_seq;
+		if (tp->snd_bandwidth == 0)
+			tp->snd_bandwidth = tcp_inflight_min;
+		return;
+	}
+	if (delta_ticks == 0)
+		return;
+
+	/*
+	 * Sanity check, plus ignore pure window update acks.
+	 */
+	if ((int)(ack_seq - tp->t_bw_rtseq) <= 0)
+		return;
+
+	/*
 	 * Figure out the bandwidth.  Due to the tick granularity this
 	 * is a very rough number and it MUST be averaged over a fairly
 	 * long period of time.  XXX we need to take into account a link
 	 * that is not using all available bandwidth, but for now our
 	 * slop will ramp us up if this case occurs and the bandwidth later
 	 * increases.
-	 *
-	 * Note: if ticks rollover 'bw' may wind up negative.  We must
-	 * effectively reset t_bw_rtttime for this case.
 	 */
-	save_ticks = ticks;
-	if ((u_int)(save_ticks - tp->t_bw_rtttime) < 1)
-		return;
-
-	bw = (int64_t)(ack_seq - tp->t_bw_rtseq) * hz / 
-	    (save_ticks - tp->t_bw_rtttime);
+	bw = (int64_t)(ack_seq - tp->t_bw_rtseq) * hz / delta_ticks;
 	tp->t_bw_rtttime = save_ticks;
 	tp->t_bw_rtseq = ack_seq;
-	if (tp->t_bw_rtttime == 0 || (int)bw < 0)
-		return;
 	bw = ((int64_t)tp->snd_bandwidth * 15 + bw) >> 4;
 
 	tp->snd_bandwidth = bw;
Index: tcp_usrreq.c
===================================================================
RCS file: /cvs/src/sys/netinet/tcp_usrreq.c,v
retrieving revision 1.29
diff -u -r1.29 tcp_usrreq.c
--- tcp_usrreq.c	8 Dec 2004 23:59:01 -0000	1.29
+++ tcp_usrreq.c	16 Dec 2004 03:14:41 -0000
@@ -873,7 +873,6 @@
 	tp->t_state = TCPS_SYN_SENT;
 	callout_reset(tp->tt_keep, tcp_keepinit, tcp_timer_keep, tp);
 	tp->iss = tcp_new_isn(tp);
-	tp->t_bw_rtseq = tp->iss;
 	tcp_sendseqinit(tp);
 
 	/*
@@ -1036,7 +1035,6 @@
 	tp->t_state = TCPS_SYN_SENT;
 	callout_reset(tp->tt_keep, tcp_keepinit, tcp_timer_keep, tp);
 	tp->iss = tcp_new_isn(tp);
-	tp->t_bw_rtseq = tp->iss;
 	tcp_sendseqinit(tp);
 
 	/*


From: Dan Nelson <dnelson@allantgroup.com>
To: FreeBSD-gnats-submit@FreeBSD.org, freebsd-bugs@FreeBSD.org
Cc:  
Subject: Re: kern/75122: [PATCH] Incorrect inflight bandwidth calculation on first packet
Date: Thu, 16 Dec 2004 13:06:08 -0600

 Updated patch including Matt's recommended fix:
 
 Index: tcp_hostcache.c
 ===================================================================
 RCS file: /home/ncvs/src/sys/netinet/tcp_hostcache.c,v
 retrieving revision 1.7
 diff -u -p -r1.7 tcp_hostcache.c
 --- tcp_hostcache.c	16 Aug 2004 18:32:07 -0000	1.7
 +++ tcp_hostcache.c	15 Dec 2004 17:42:02 -0000
 @@ -175,7 +175,7 @@ SYSCTL_INT(_net_inet_tcp_hostcache, OID_
       &tcp_hostcache.purgeall, 0, "Expire all entires on next purge run");
  
  SYSCTL_PROC(_net_inet_tcp_hostcache, OID_AUTO, list,
 -	CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_SKIP, 0, 0,
 +	CTLTYPE_STRING | CTLFLAG_RD, 0, 0,
  	sysctl_tcp_hc_list, "A", "List of all hostcache entries");
  
  
 @@ -676,7 +676,7 @@ sysctl_tcp_hc_list(SYSCTL_HANDLER_ARGS)
  				(RTM_RTTUNIT / (hz * TCP_RTT_SCALE))),
  			    msec(hc_entry->rmx_rttvar *
  				(RTM_RTTUNIT / (hz * TCP_RTT_SCALE))),
 -			    hc_entry->rmx_bandwidth * 8,
 +			    hc_entry->rmx_bandwidth,
  			    hc_entry->rmx_cwnd,
  			    hc_entry->rmx_sendpipe,
  			    hc_entry->rmx_recvpipe,
 Index: tcp_subr.c
 ===================================================================
 RCS file: /home/ncvs/src/sys/netinet/tcp_subr.c,v
 retrieving revision 1.201.2.4
 diff -u -p -r1.201.2.4 tcp_subr.c
 --- tcp_subr.c	1 Dec 2004 05:37:28 -0000	1.201.2.4
 +++ tcp_subr.c	16 Dec 2004 06:38:48 -0000
 @@ -631,7 +631,6 @@ tcp_newtcpcb(inp)
  	tp->snd_bwnd = TCP_MAXWIN << TCP_MAX_WINSHIFT;
  	tp->snd_ssthresh = TCP_MAXWIN << TCP_MAX_WINSHIFT;
  	tp->t_rcvtime = ticks;
 -	tp->t_bw_rtttime = ticks;
  	/*
  	 * IPv4 TTL initialization is necessary for an IPv6 socket as well,
  	 * because the socket may be bound to an IPv6 wildcard address,
 @@ -1907,9 +1906,10 @@ tcp_twrespond(struct tcptw *tw, int flag
  void
  tcp_xmit_bandwidth_limit(struct tcpcb *tp, tcp_seq ack_seq)
  {
 -	u_long bw;
 +	u_long bw, avgbw;
  	u_long bwnd;
  	int save_ticks;
 +	int delta_ticks;
  
  	/*
  	 * If inflight_enable is disabled in the middle of a tcp connection,
 @@ -1922,29 +1922,42 @@ tcp_xmit_bandwidth_limit(struct tcpcb *t
  	}
  
  	/*
 +	  * Validate the delta time.  If a connection is new or has been idle
 +	  * a long time we have to reset the bandwidth calculator.
 +	  */
 +	save_ticks = ticks;
 +	delta_ticks = save_ticks - tp->t_bw_rtttime;
 +	if (tp->t_bw_rtttime == 0 || delta_ticks < 0 || delta_ticks > hz * 10) {
 +		tp->t_bw_rtttime = ticks;
 +		tp->t_bw_rtseq = ack_seq;
 +		if (tp->snd_bandwidth == 0)
 +			tp->snd_bandwidth = tcp_inflight_min;
 +		return;
 +	}
 +	if (delta_ticks == 0)
 +		return;
 +	
 +	/*
 +	 * Sanity check, plus ignore pure window update acks.
 +	 */
 +	if ((int)(ack_seq - tp->t_bw_rtseq) <= 0)
 +		return;
 +	
 +	/*
  	 * Figure out the bandwidth.  Due to the tick granularity this
  	 * is a very rough number and it MUST be averaged over a fairly
  	 * long period of time.  XXX we need to take into account a link
  	 * that is not using all available bandwidth, but for now our
  	 * slop will ramp us up if this case occurs and the bandwidth later
  	 * increases.
 -	 *
 -	 * Note: if ticks rollover 'bw' may wind up negative.  We must
 -	 * effectively reset t_bw_rtttime for this case.
  	 */
 -	save_ticks = ticks;
 -	if ((u_int)(save_ticks - tp->t_bw_rtttime) < 1)
 -		return;
  
 -	bw = (int64_t)(ack_seq - tp->t_bw_rtseq) * hz /
 -	    (save_ticks - tp->t_bw_rtttime);
 +	bw = (int64_t)(ack_seq - tp->t_bw_rtseq) * hz / delta_ticks;
 +
  	tp->t_bw_rtttime = save_ticks;
  	tp->t_bw_rtseq = ack_seq;
 -	if (tp->t_bw_rtttime == 0 || (int)bw < 0)
 -		return;
 -	bw = ((int64_t)tp->snd_bandwidth * 15 + bw) >> 4;
  
 -	tp->snd_bandwidth = bw;
 +	avgbw = ((int64_t)tp->snd_bandwidth * 15 + bw) >> 4;
  
  	/*
  	 * Calculate the semi-static bandwidth delay product, plus two maximal
 @@ -1975,22 +1988,25 @@ tcp_xmit_bandwidth_limit(struct tcpcb *t
  	 *	    no other choice.
  	 */
  #define USERTT	((tp->t_srtt + tp->t_rttbest) / 2)
 -	bwnd = (int64_t)bw * USERTT / (hz << TCP_RTT_SHIFT) + tcp_inflight_stab * tp->t_maxseg / 10;
 +	bwnd = (int64_t)avgbw * USERTT / (hz << TCP_RTT_SHIFT) + tcp_inflight_stab * tp->t_maxseg / 10;
  #undef USERTT
  
  	if (tcp_inflight_debug > 0) {
  		static int ltime;
  		if ((u_int)(ticks - ltime) >= hz / tcp_inflight_debug) {
  			ltime = ticks;
 -			printf("%p bw %ld rttbest %d srtt %d bwnd %ld\n",
 +			printf("%p bw %lu(%lu,%lu) rttbest %d srtt %d bwnd %ld\n",
  			    tp,
 -			    bw,
 +			    avgbw, tp->snd_bandwidth, bw,
  			    tp->t_rttbest,
  			    tp->t_srtt,
  			    bwnd
  			);
  		}
  	}
 +	
 +	tp->snd_bandwidth = avgbw;
 +	
  	if ((long)bwnd < tcp_inflight_min)
  		bwnd = tcp_inflight_min;
  	if (bwnd > tcp_inflight_max)
 Index: tcp_usrreq.c
 ===================================================================
 RCS file: /home/ncvs/src/sys/netinet/tcp_usrreq.c,v
 retrieving revision 1.107
 diff -u -p -r1.107 tcp_usrreq.c
 --- tcp_usrreq.c	16 Aug 2004 18:32:07 -0000	1.107
 +++ tcp_usrreq.c	16 Dec 2004 06:38:59 -0000
 @@ -864,7 +864,6 @@ tcp_connect(tp, nam, td)
  	tp->t_state = TCPS_SYN_SENT;
  	callout_reset(tp->tt_keep, tcp_keepinit, tcp_timer_keep, tp);
  	tp->iss = tcp_new_isn(tp);
 -	tp->t_bw_rtseq = tp->iss;
  	tcp_sendseqinit(tp);
  
  	/*
 @@ -959,7 +958,6 @@ tcp6_connect(tp, nam, td)
  	tp->t_state = TCPS_SYN_SENT;
  	callout_reset(tp->tt_keep, tcp_keepinit, tcp_timer_keep, tp);
  	tp->iss = tcp_new_isn(tp);
 -	tp->t_bw_rtseq = tp->iss;
  	tcp_sendseqinit(tp);
  
  	/*
 
 
 -- 
 	Dan Nelson
 	dnelson@allantgroup.com

From: Uwe Doering <gemini@geminix.org>
To: Dan Nelson <dnelson@allantgroup.com>
Cc: FreeBSD-gnats-submit@FreeBSD.org, freebsd-bugs@FreeBSD.org
Subject: Re: kern/75122: [PATCH] Incorrect inflight bandwidth calculation
 on	first packet
Date: Sat, 18 Dec 2004 13:09:14 +0100

 Dan Nelson wrote:
 > Updated patch including Matt's recommended fix:
 > 
 > [...]
 > diff -u -p -r1.201.2.4 tcp_subr.c
 > --- tcp_subr.c	1 Dec 2004 05:37:28 -0000	1.201.2.4
 > +++ tcp_subr.c	16 Dec 2004 06:38:48 -0000
 > [...]
 > @@ -1922,29 +1922,42 @@ tcp_xmit_bandwidth_limit(struct tcpcb *t
 >  	}
 >  
 >  	/*
 > +	  * Validate the delta time.  If a connection is new or has been idle
 > +	  * a long time we have to reset the bandwidth calculator.
 > +	  */
 > +	save_ticks = ticks;
 > +	delta_ticks = save_ticks - tp->t_bw_rtttime;
 > +	if (tp->t_bw_rtttime == 0 || delta_ticks < 0 || delta_ticks > hz * 10) {
 > +		tp->t_bw_rtttime = ticks;
 > +		tp->t_bw_rtseq = ack_seq;
 > +		if (tp->snd_bandwidth == 0)
 > +			tp->snd_bandwidth = tcp_inflight_min;
 > +		return;
 > +	}
 > +	if (delta_ticks == 0)
 > +		return;
 > +	
 > +	/*
 > +	 * Sanity check, plus ignore pure window update acks.
 > +	 */
 > +	if ((int)(ack_seq - tp->t_bw_rtseq) <= 0)
 > +		return;
 
 I wonder, isn't there a flaw in the logic with regard to the sequence 
 number handling?  If the sequence number wraps around 't_bw_rtseq' no 
 longer gets set and therefore the bandwidth calculation stops until 
 'ack_seq' either catches up with 't_bw_rtseq' again (which would take 
 quite a while), or 'ticks' wraps around as well, or there is inactivity 
 for more than 10 seconds.  This is probably not the intended behavior.
 
 I think the correct approach would be to separate the handling of the 
 two cases "sequence number wrap-around" and "window update acks".  I 
 suggest to define another variable 'delta_bytes' and initialize it like 
 this:
 
    int delta_bytes;
    [...]
    delta_bytes = ack_seq - tp->t_bw_rtseq;
 
 Then compare it with zero where you compare 'delta_ticks' with zero, and 
 check for less than zero (wrap-around) where you check 'delta_ticks' for 
 less than zero.  Combine the respective test results with '||'.  Of 
 course, you can then use 'delta_bytes' for the bandwidth calculation 
 further down in the code, too.
 
     Uwe
 -- 
 Uwe Doering         |  EscapeBox - Managed On-Demand UNIX Servers
 gemini@geminix.org  |  http://www.escapebox.net

From: Dan Nelson <dnelson@allantgroup.com>
To: Uwe Doering <gemini@geminix.org>
Cc: FreeBSD-gnats-submit@FreeBSD.org, freebsd-bugs@FreeBSD.org
Subject: Re: kern/75122: [PATCH] Incorrect inflight bandwidth calculation on	first packet
Date: Tue, 21 Dec 2004 00:55:31 -0600

 In the last episode (Dec 18), Uwe Doering said:
 > Dan Nelson wrote:
 > >Updated patch including Matt's recommended fix:
 > >
 > >+	/*
 > >+	 * Sanity check, plus ignore pure window update acks.
 > >+	 */
 > >+	if ((int)(ack_seq - tp->t_bw_rtseq) <= 0)
 > >+		return;
 > 
 > I wonder, isn't there a flaw in the logic with regard to the sequence
 > number handling?  If the sequence number wraps around 't_bw_rtseq' no
 > longer gets set and therefore the bandwidth calculation stops until
 > 'ack_seq' either catches up with 't_bw_rtseq' again (which would take
 > quite a while), or 'ticks' wraps around as well, or there is
 > inactivity for more than 10 seconds.  This is probably not the
 > intended behavior.
 
 I think the code works as-is.  ack_seq and tp->t_bw_rtseq are both of
 type "tcp_seq" which is a u_int32_t.  Wrap-around is handled
 transparently when your variables are unsigned and your sequence space
 covers all possible values.  It's the magic of mod(2^32) arithmetic :)
 The (int) cast just makes the if simpler.  Without the cast it would
 read
 
  if (ack_seq - tp->t_bw_rtseq > 2147483648U || ack_seq == tp->t_bw_rtseq)
 
 The sanity check is probably not even necessary, as any really invalid
 sequence numbers would have caused the packet to be dropped before it
 got this far.
 
 
 -- 
 	Dan Nelson
 	dnelson@allantgroup.com

From: Uwe Doering <gemini@geminix.org>
To: Dan Nelson <dnelson@allantgroup.com>
Cc: FreeBSD-gnats-submit@FreeBSD.org, freebsd-bugs@FreeBSD.org
Subject: Re: kern/75122: [PATCH] Incorrect inflight bandwidth calculation
 on	first packet
Date: Tue, 21 Dec 2004 10:50:14 +0100

 Dan Nelson wrote:
 > In the last episode (Dec 18), Uwe Doering said:
 >>Dan Nelson wrote:
 >>
 >>>Updated patch including Matt's recommended fix:
 >>>
 >>>+	/*
 >>>+	 * Sanity check, plus ignore pure window update acks.
 >>>+	 */
 >>>+	if ((int)(ack_seq - tp->t_bw_rtseq) <= 0)
 >>>+		return;
 >>
 >>I wonder, isn't there a flaw in the logic with regard to the sequence
 >>number handling?  If the sequence number wraps around 't_bw_rtseq' no
 >>longer gets set and therefore the bandwidth calculation stops until
 >>'ack_seq' either catches up with 't_bw_rtseq' again (which would take
 >>quite a while), or 'ticks' wraps around as well, or there is
 >>inactivity for more than 10 seconds.  This is probably not the
 >>intended behavior.
 > 
 > I think the code works as-is.  ack_seq and tp->t_bw_rtseq are both of
 > type "tcp_seq" which is a u_int32_t.  Wrap-around is handled
 > transparently when your variables are unsigned and your sequence space
 > covers all possible values.  It's the magic of mod(2^32) arithmetic :)
 > The (int) cast just makes the if simpler.  Without the cast it would
 > read
 > 
 >  if (ack_seq - tp->t_bw_rtseq > 2147483648U || ack_seq == tp->t_bw_rtseq)
 > 
 > The sanity check is probably not even necessary, as any really invalid
 > sequence numbers would have caused the packet to be dropped before it
 > got this far.
 
 You are correct.  Unlike 'ticks', which is of type 'int', a sequence 
 number wrap-around has no effect on the calculation of the byte count 
 due to using 'u_int32_t'.  Tricky. ;-)
 
 Therefore, I propose testing for equality only (window update acks). 
 Checking for byte counts of more than 2147483648U and declaring that 
 bogus is a little arbitrary.  If we are concerned about invalid sequence 
 numbers, they can result in a lower byte count as well.  The way it is 
 now (with the patch installed) it looks too much like a wrap-around 
 check and is therefore misleading.  So I suggest to either get rid of it 
 or add a comment that explains the situation.
 
     Uwe
 -- 
 Uwe Doering         |  EscapeBox - Managed On-Demand UNIX Servers
 gemini@geminix.org  |  http://www.escapebox.net

From: Uwe Doering <gemini@geminix.org>
To: Dan Nelson <dnelson@allantgroup.com>
Cc: FreeBSD-gnats-submit@FreeBSD.org, freebsd-bugs@FreeBSD.org
Subject: Re: kern/75122: [PATCH] Incorrect inflight bandwidth calculation
 on	first packet
Date: Tue, 21 Dec 2004 12:20:28 +0100

 Uwe Doering wrote:
 > Dan Nelson wrote:
 >> In the last episode (Dec 18), Uwe Doering said:
 >>> Dan Nelson wrote:
 >>>
 >>>> Updated patch including Matt's recommended fix:
 >>>>
 >>>> +    /*
 >>>> +     * Sanity check, plus ignore pure window update acks.
 >>>> +     */
 >>>> +    if ((int)(ack_seq - tp->t_bw_rtseq) <= 0)
 >>>> +        return;
 >>>
 >>> I wonder, isn't there a flaw in the logic with regard to the sequence
 >>> number handling?  If the sequence number wraps around 't_bw_rtseq' no
 >>> longer gets set and therefore the bandwidth calculation stops until
 >>> 'ack_seq' either catches up with 't_bw_rtseq' again (which would take
 >>> quite a while), or 'ticks' wraps around as well, or there is
 >>> inactivity for more than 10 seconds.  This is probably not the
 >>> intended behavior.
 >>
 >> I think the code works as-is.  ack_seq and tp->t_bw_rtseq are both of
 >> type "tcp_seq" which is a u_int32_t.  Wrap-around is handled
 >> transparently when your variables are unsigned and your sequence space
 >> covers all possible values.  It's the magic of mod(2^32) arithmetic :)
 >> The (int) cast just makes the if simpler.  Without the cast it would
 >> read
 >>
 >>  if (ack_seq - tp->t_bw_rtseq > 2147483648U || ack_seq == tp->t_bw_rtseq)
 >>
 >> The sanity check is probably not even necessary, as any really invalid
 >> sequence numbers would have caused the packet to be dropped before it
 >> got this far.
 > 
 > You are correct.  Unlike 'ticks', which is of type 'int', a sequence 
 > number wrap-around has no effect on the calculation of the byte count 
 > due to using 'u_int32_t'.  Tricky. ;-)
 > 
 > Therefore, I propose testing for equality only (window update acks). 
 > Checking for byte counts of more than 2147483648U and declaring that 
 > bogus is a little arbitrary.  If we are concerned about invalid sequence 
 > numbers, they can result in a lower byte count as well.  The way it is 
 > now (with the patch installed) it looks too much like a wrap-around 
 > check and is therefore misleading.  So I suggest to either get rid of it 
 > or add a comment that explains the situation.
 
 On second thought, checking for less than zero (by means of the 'int' 
 cast) might have its merits.  This may be a protection against 
 out-of-order ACKs, which could in fact be valid, but calculating a byte 
 count from an ACK for an earlier packet when we've already processed a 
 later ACK would be bogus.
 
 Hard to tell what Matt's intentions were.  In the end the patch appears 
 to be okay as it is.  Nevertheless, my suggestion to comment this 
 sufficiently still stands.
 
     Uwe
 -- 
 Uwe Doering         |  EscapeBox - Managed On-Demand UNIX Servers
 gemini@geminix.org  |  http://www.escapebox.net

From: Dan Nelson <dnelson@allantgroup.com>
To: Uwe Doering <gemini@geminix.org>
Cc: FreeBSD-gnats-submit@FreeBSD.org, freebsd-bugs@FreeBSD.org
Subject: Re: kern/75122: [PATCH] Incorrect inflight bandwidth calculation on	first packet
Date: Wed, 22 Dec 2004 16:24:52 -0600

 In the last episode (Dec 21), Uwe Doering said:
 > Uwe Doering wrote:
 > >Dan Nelson wrote:
 > >>In the last episode (Dec 18), Uwe Doering said:
 > >>>Dan Nelson wrote:
 > >>>
 > >>>>Updated patch including Matt's recommended fix:
 > >>>>
 > >>>>+    /*
 > >>>>+     * Sanity check, plus ignore pure window update acks.
 > >>>>+     */
 > >>>>+    if ((int)(ack_seq - tp->t_bw_rtseq) <= 0)
 > >>>>+        return;
 >
 > On second thought, checking for less than zero (by means of the 'int' 
 > cast) might have its merits.  This may be a protection against 
 > out-of-order ACKs, which could in fact be valid, but calculating a byte 
 > count from an ACK for an earlier packet when we've already processed a 
 > later ACK would be bogus.
 
 I am pretty sure only useful acks get to this point.  I've changed that
 bit of code to just be
 
 +       /*
 +        * Ignore pure window update acks.
 +        */
 +       if (ack_seq == tp->t_bw_rtseq)
 +               return;
 
 and added another check for negative sequences that increments a sysctl
 counter.  So far netstat -s has counted 567 out-of-order packets but my
 counter is still at 0.
 
 -- 
 	Dan Nelson
 	dnelson@allantgroup.com

From: Dan Nelson <dnelson@allantgroup.com>
To: Uwe Doering <gemini@geminix.org>
Cc: FreeBSD-gnats-submit@FreeBSD.org, freebsd-bugs@FreeBSD.org
Subject: Re: kern/75122: [PATCH] Incorrect inflight bandwidth calculation on	first packet
Date: Tue, 8 Feb 2005 01:15:22 -0600

 In the last episode (Dec 22), Dan Nelson said:
 > I am pretty sure only useful acks get to this point.  I've changed that
 > bit of code to just be
 > 
 > +       /*
 > +        * Ignore pure window update acks.
 > +        */
 > +       if (ack_seq == tp->t_bw_rtseq)
 > +               return;
 > 
 > and added another check for negative sequences that increments a sysctl
 > counter.  So far netstat -s has counted 567 out-of-order packets but my
 > counter is still at 0.
 
 And after 40 days of uptime on my mail/squid servers, the counter is
 still at zero, so the code works perfectly for me.  It'd be nice if
 this was committed before 5.4.  Final patch follows:
 
 Index: tcp_hostcache.c
 ===================================================================
 RCS file: /home/ncvs/src/sys/netinet/tcp_hostcache.c,v
 retrieving revision 1.7.2.1
 diff -u -p -r1.7.2.1 tcp_hostcache.c
 --- tcp_hostcache.c	31 Jan 2005 23:26:36 -0000	1.7.2.1
 +++ tcp_hostcache.c	8 Feb 2005 07:09:31 -0000
 @@ -175,7 +175,7 @@ SYSCTL_INT(_net_inet_tcp_hostcache, OID_
       &tcp_hostcache.purgeall, 0, "Expire all entires on next purge run");
  
  SYSCTL_PROC(_net_inet_tcp_hostcache, OID_AUTO, list,
 -	CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_SKIP, 0, 0,
 +	CTLTYPE_STRING | CTLFLAG_RD, 0, 0,
  	sysctl_tcp_hc_list, "A", "List of all hostcache entries");
  
  
 @@ -676,7 +676,7 @@ sysctl_tcp_hc_list(SYSCTL_HANDLER_ARGS)
  				(RTM_RTTUNIT / (hz * TCP_RTT_SCALE))),
  			    msec(hc_entry->rmx_rttvar *
  				(RTM_RTTUNIT / (hz * TCP_RTT_SCALE))),
 -			    hc_entry->rmx_bandwidth * 8,
 +			    hc_entry->rmx_bandwidth,
  			    hc_entry->rmx_cwnd,
  			    hc_entry->rmx_sendpipe,
  			    hc_entry->rmx_recvpipe,
 Index: tcp_subr.c
 ===================================================================
 RCS file: /home/ncvs/src/sys/netinet/tcp_subr.c,v
 retrieving revision 1.201.2.12
 diff -u -p -r1.201.2.12 tcp_subr.c
 --- tcp_subr.c	31 Jan 2005 23:26:36 -0000	1.201.2.12
 +++ tcp_subr.c	8 Feb 2005 07:09:33 -0000
 @@ -628,7 +632,6 @@ tcp_newtcpcb(inp)
  	tp->snd_bwnd = TCP_MAXWIN << TCP_MAX_WINSHIFT;
  	tp->snd_ssthresh = TCP_MAXWIN << TCP_MAX_WINSHIFT;
  	tp->t_rcvtime = ticks;
 -	tp->t_bw_rtttime = ticks;
  	/*
  	 * IPv4 TTL initialization is necessary for an IPv6 socket as well,
  	 * because the socket may be bound to an IPv6 wildcard address,
 @@ -1934,9 +1937,10 @@ tcp_twrespond(struct tcptw *tw, int flag
  void
  tcp_xmit_bandwidth_limit(struct tcpcb *tp, tcp_seq ack_seq)
  {
 -	u_long bw;
 +	u_long bw, avgbw;
  	u_long bwnd;
  	int save_ticks;
 +	int delta_ticks;
  
  	INP_LOCK_ASSERT(tp->t_inpcb);
  
 @@ -1951,29 +1955,50 @@ tcp_xmit_bandwidth_limit(struct tcpcb *t
  	}
  
  	/*
 +	  * Validate the delta time.  If a connection is new or has been idle
 +	  * a long time we have to reset the bandwidth calculator.
 +	  */
 +	save_ticks = ticks;
 +	delta_ticks = save_ticks - tp->t_bw_rtttime;
 +	if (tp->t_bw_rtttime == 0 || delta_ticks < 0 || delta_ticks > hz * 10) {
 +		tp->t_bw_rtttime = ticks;
 +		tp->t_bw_rtseq = ack_seq;
 +		if (tp->snd_bandwidth == 0)
 +			tp->snd_bandwidth = tcp_inflight_min;
 +		return;
 +	}
 +	if (delta_ticks == 0)
 +		return;
 +	
 +	/*
 +	 * Ignore pure window update acks.
 +	 */
 +	if (ack_seq == tp->t_bw_rtseq)
 +		return;
 +
 +	/* Sanity check */
 +	if (SEQ_LT(ack_seq, tp->t_bw_rtseq)) {
 +		if (tcp_inflight_debug > 0) {
 +			printf ("%p invalid sequence: %u vs %u",
 +				tp, ack_seq, tp->t_bw_rtseq);
 +		}
 +	}
 +	
 +	/*
  	 * Figure out the bandwidth.  Due to the tick granularity this
  	 * is a very rough number and it MUST be averaged over a fairly
  	 * long period of time.  XXX we need to take into account a link
  	 * that is not using all available bandwidth, but for now our
  	 * slop will ramp us up if this case occurs and the bandwidth later
  	 * increases.
 -	 *
 -	 * Note: if ticks rollover 'bw' may wind up negative.  We must
 -	 * effectively reset t_bw_rtttime for this case.
  	 */
 -	save_ticks = ticks;
 -	if ((u_int)(save_ticks - tp->t_bw_rtttime) < 1)
 -		return;
  
 -	bw = (int64_t)(ack_seq - tp->t_bw_rtseq) * hz /
 -	    (save_ticks - tp->t_bw_rtttime);
 +	bw = (int64_t)(ack_seq - tp->t_bw_rtseq) * hz / delta_ticks;
 +
  	tp->t_bw_rtttime = save_ticks;
  	tp->t_bw_rtseq = ack_seq;
 -	if (tp->t_bw_rtttime == 0 || (int)bw < 0)
 -		return;
 -	bw = ((int64_t)tp->snd_bandwidth * 15 + bw) >> 4;
  
 -	tp->snd_bandwidth = bw;
 +	avgbw = ((int64_t)tp->snd_bandwidth * 15 + bw) >> 4;
  
  	/*
  	 * Calculate the semi-static bandwidth delay product, plus two maximal
 @@ -2004,22 +2030,25 @@ tcp_xmit_bandwidth_limit(struct tcpcb *t
  	 *	    no other choice.
  	 */
  #define USERTT	((tp->t_srtt + tp->t_rttbest) / 2)
 -	bwnd = (int64_t)bw * USERTT / (hz << TCP_RTT_SHIFT) + tcp_inflight_stab * tp->t_maxseg / 10;
 +	bwnd = (int64_t)avgbw * USERTT / (hz << TCP_RTT_SHIFT) + tcp_inflight_stab * tp->t_maxseg / 10;
  #undef USERTT
  
  	if (tcp_inflight_debug > 0) {
  		static int ltime;
  		if ((u_int)(ticks - ltime) >= hz / tcp_inflight_debug) {
  			ltime = ticks;
 -			printf("%p bw %ld rttbest %d srtt %d bwnd %ld\n",
 +			printf("%p bw %lu(%lu,%lu) rttbest %d srtt %d bwnd %ld\n",
  			    tp,
 -			    bw,
 +			    avgbw, tp->snd_bandwidth, bw,
  			    tp->t_rttbest,
  			    tp->t_srtt,
  			    bwnd
  			);
  		}
  	}
 +	
 +	tp->snd_bandwidth = avgbw;
 +	
  	if ((long)bwnd < tcp_inflight_min)
  		bwnd = tcp_inflight_min;
  	if (bwnd > tcp_inflight_max)
 Index: tcp_usrreq.c
 ===================================================================
 RCS file: /home/ncvs/src/sys/netinet/tcp_usrreq.c,v
 retrieving revision 1.107.2.1
 diff -u -p -r1.107.2.1 tcp_usrreq.c
 --- tcp_usrreq.c	31 Jan 2005 23:26:37 -0000	1.107.2.1
 +++ tcp_usrreq.c	8 Feb 2005 07:09:34 -0000
 @@ -864,7 +864,6 @@ tcp_connect(tp, nam, td)
  	tp->t_state = TCPS_SYN_SENT;
  	callout_reset(tp->tt_keep, tcp_keepinit, tcp_timer_keep, tp);
  	tp->iss = tcp_new_isn(tp);
 -	tp->t_bw_rtseq = tp->iss;
  	tcp_sendseqinit(tp);
  
  	/*
 @@ -959,7 +958,6 @@ tcp6_connect(tp, nam, td)
  	tp->t_state = TCPS_SYN_SENT;
  	callout_reset(tp->tt_keep, tcp_keepinit, tcp_timer_keep, tp);
  	tp->iss = tcp_new_isn(tp);
 -	tp->t_bw_rtseq = tp->iss;
  	tcp_sendseqinit(tp);
  
  	/*
 
 
 -- 
 	Dan Nelson
 	dnelson@allantgroup.com

From: Dan Nelson <dnelson@allantgroup.com>
To: Uwe Doering <gemini@geminix.org>
Cc: freebsd-bugs@FreeBSD.org
Subject: Re: kern/75122: [PATCH] Incorrect inflight bandwidth calculation on
	first packet
Date: Wed, 22 Dec 2004 16:24:52 -0600

 In the last episode (Dec 21), Uwe Doering said:
 > Uwe Doering wrote:
 > >Dan Nelson wrote:
 > >>In the last episode (Dec 18), Uwe Doering said:
 > >>>Dan Nelson wrote:
 > >>>
 > >>>>Updated patch including Matt's recommended fix:
 > >>>>
 > >>>>+    /*
 > >>>>+     * Sanity check, plus ignore pure window update acks.
 > >>>>+     */
 > >>>>+    if ((int)(ack_seq - tp->t_bw_rtseq) <= 0)
 > >>>>+        return;
 >
 > On second thought, checking for less than zero (by means of the 'int' 
 > cast) might have its merits.  This may be a protection against 
 > out-of-order ACKs, which could in fact be valid, but calculating a byte 
 > count from an ACK for an earlier packet when we've already processed a 
 > later ACK would be bogus.
 
 I am pretty sure only useful acks get to this point.  I've changed that
 bit of code to just be
 
 +       /*
 +        * Ignore pure window update acks.
 +        */
 +       if (ack_seq == tp->t_bw_rtseq)
 +               return;
 
 and added another check for negative sequences that increments a sysctl
 counter.  So far netstat -s has counted 567 out-of-order packets but my
 counter is still at 0.
 
 -- 
 	Dan Nelson
 	dnelson@allantgroup.com
 _______________________________________________
 freebsd-bugs@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-bugs
 To unsubscribe, send any mail to "freebsd-bugs-unsubscribe@freebsd.org"
 
Responsible-Changed-From-To: freebsd-bugs->silby 
Responsible-Changed-By: rwatson 
Responsible-Changed-When: Tue Jul 12 14:54:21 GMT 2005 
Responsible-Changed-Why:  
Assign ownership to silby since he's been spending quality time with TCP 
lately. 


http://www.freebsd.org/cgi/query-pr.cgi?pr=75122 
Responsible-Changed-From-To: silby->andre 
Responsible-Changed-By: andre 
Responsible-Changed-When: Tue Jan 24 16:39:33 UTC 2006 
Responsible-Changed-Why:  
take over. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=75122 

From: Nick Johnson <freebsd@spatula.net>
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/75122: [netinet] [patch] Incorrect inflight bandwidth
 calculation on first packet
Date: Tue, 3 Apr 2007 17:22:27 -0700 (PDT)

 I wonder if this could be related to why my 6.2-STABLE machine reports 
 completely unreasonable numbers for BANDWIDTH, which often exceed the 
 speed of the uplink on my machine, sometimes by an order of magnitude.
 
 It looks like the patch still will apply cleanly to 6.2.  Once I can get a 
 console on my box, I'll give it a spin.
Responsible-Changed-From-To: andre->silby 
Responsible-Changed-By: kmacy 
Responsible-Changed-When: Thu Nov 15 23:19:00 UTC 2007 
Responsible-Changed-Why:  

silby is currently the most active... 
silby - if you don't want to deal with it (or others) assign it  
to me rather than let it languish 

http://www.freebsd.org/cgi/query-pr.cgi?pr=75122 
>Unformatted:
