From nobody@FreeBSD.org  Fri Mar 20 01:44:08 2009
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4A9C8106566B
	for <freebsd-gnats-submit@FreeBSD.org>; Fri, 20 Mar 2009 01:44:08 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [IPv6:2001:4f8:fff6::21])
	by mx1.freebsd.org (Postfix) with ESMTP id 34C9D8FC17
	for <freebsd-gnats-submit@FreeBSD.org>; Fri, 20 Mar 2009 01:44:08 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.14.3/8.14.3) with ESMTP id n2K1i7FG017114
	for <freebsd-gnats-submit@FreeBSD.org>; Fri, 20 Mar 2009 01:44:07 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.14.3/8.14.3/Submit) id n2K1i7QY017113;
	Fri, 20 Mar 2009 01:44:07 GMT
	(envelope-from nobody)
Message-Id: <200903200144.n2K1i7QY017113@www.freebsd.org>
Date: Fri, 20 Mar 2009 01:44:07 GMT
From: Renaud Lienhart <renaud@vmware.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: tcp_output() might generate invalid TSO frames when len > TCP_MAXWIN - hdrlen - optlen
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         132832
>Category:       kern
>Synopsis:       [netinet] [patch] tcp_output() might generate invalid TSO frames when len > TCP_MAXWIN - hdrlen - optlen
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    andre
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Mar 20 01:50:01 UTC 2009
>Closed-Date:    Mon Aug 23 14:26:16 UTC 2010
>Last-Modified:  Mon Aug 23 14:26:16 UTC 2010
>Originator:     Renaud Lienhart
>Release:        FreeBSD 7.1
>Organization:
VMware, Inc.
>Environment:
>Description:
The tcp_output() routine has an issue when the send window exceeds TCP_MAXWIN and the underlying interface supports TSO.

There is a check to ensure the data being pushed doesn't exceed the maximum TSO packet size. If this is the case, "len" is trimmed and "sendalot" is set:

--- 8< ---
if (tso) {
	if (len > TCP_MAXWIN - hdrlen - optlen) {
		len = TCP_MAXWIN - hdrlen - optlen;
		len = len - (len % (tp->t_maxopd - optlen));
		sendalot = 1;
--- >8 ---

Everything's hunky-dory, until the next "sendalot" iteration;

If the remaining window does not require TSO (i.e. len <= tp->t_maxseg), this following piece of code fails to properly reset "tso":

--- 8< ---
if (len > tp->t_maxseg) {
	if (<tso_conds>)
		tso = 1;
	} else {
		len = tp->t_maxseg;
		sendalot = 1;
		tso = 0;
	}
}
--- >8 ---

"tso" remains set to 1 and the resulting packet is tagged with CSUM_TSO although it doesn't require TSO. This causes problems with a large number of nics which refuse to send the resulting frame and, in some case, wedge.
>How-To-Repeat:
Using netperf (or any TCP workload) with a large socket buffer size exposes the issue in a matter of seconds.
>Fix:
The solution is to always reset "tso" at the beginning of the "sendalot" loop in order to ensure it is not stale.

In my patch I also added a KASSERT() which directly catches any problematic frame before it reaches any other layer.

Patch attached with submission follows:

Index: netinet/tcp_output.c
===================================================================
--- netinet/tcp_output.c	(revision 190117)
+++ netinet/tcp_output.c	(working copy)
@@ -140,7 +140,7 @@
 	int idle, sendalot;
 	int sack_rxmit, sack_bytes_rxmt;
 	struct sackhole *p;
-	int tso = 0;
+	int tso;
 	struct tcpopt to;
 #if 0
 	int maxburst = TCP_MAXBURST;
@@ -198,6 +198,7 @@
 	    SEQ_LT(tp->snd_nxt, tp->snd_max))
 		tcp_sack_adjust(tp);
 	sendalot = 0;
+	tso = 0;
 	off = tp->snd_nxt - tp->snd_una;
 	sendwin = min(tp->snd_wnd, tp->snd_cwnd);
 	sendwin = min(sendwin, tp->snd_bwnd);
@@ -477,7 +478,6 @@
 		} else {
 			len = tp->t_maxseg;
 			sendalot = 1;
-			tso = 0;
 		}
 	}
 	if (sack_rxmit) {
@@ -996,6 +996,8 @@
 	 * XXX: Fixme: This is currently not the case for IPv6.
 	 */
 	if (tso) {
+		KASSERT(len > tp->t_maxopd - optlen,
+		    ("%s: len <= tso_segsz", __func__));
 		m->m_pkthdr.csum_flags = CSUM_TSO;
 		m->m_pkthdr.tso_segsz = tp->t_maxopd - optlen;
 	}


>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->freebsd-net 
Responsible-Changed-By: gavin 
Responsible-Changed-When: Thu Apr 16 08:19:28 UTC 2009 
Responsible-Changed-Why:  
Over to maintainer(s).  This may be the cause of some of the other 
TSO issues that have been spotted recently. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=132832 
Responsible-Changed-From-To: freebsd-net->andre 
Responsible-Changed-By: andre 
Responsible-Changed-When: Tue Aug 10 22:22:47 UTC 2010 
Responsible-Changed-Why:  
Take over. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=132832 

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/132832: commit references a PR
Date: Sat, 14 Aug 2010 21:41:46 +0000 (UTC)

 Author: andre
 Date: Sat Aug 14 21:41:33 2010
 New Revision: 211317
 URL: http://svn.freebsd.org/changeset/base/211317
 
 Log:
   When using TSO and sending more than TCP_MAXWIN sendalot is set
   and we loop back to 'again'.  If the remainder is less or equal
   to one full segment, the TSO flag was not cleared even though
   it isn't necessary anymore.  Enabling the TSO flag on a segment
   that doesn't require any offloaded segmentation by the NIC may
   cause confusion in the driver or hardware.
   
   Reset the internal tso flag in tcp_output() on every iteration
   of sendalot.
   
   PR:		kern/132832
   Submitted by:	Renaud Lienhart <renaud-at-vmware com>
   MFC after:	1 week
 
 Modified:
   head/sys/netinet/tcp_output.c
 
 Modified: head/sys/netinet/tcp_output.c
 ==============================================================================
 --- head/sys/netinet/tcp_output.c	Sat Aug 14 21:04:27 2010	(r211316)
 +++ head/sys/netinet/tcp_output.c	Sat Aug 14 21:41:33 2010	(r211317)
 @@ -153,7 +153,7 @@ tcp_output(struct tcpcb *tp)
  	int idle, sendalot;
  	int sack_rxmit, sack_bytes_rxmt;
  	struct sackhole *p;
 -	int tso = 0;
 +	int tso;
  	struct tcpopt to;
  #if 0
  	int maxburst = TCP_MAXBURST;
 @@ -211,6 +211,7 @@ again:
  	    SEQ_LT(tp->snd_nxt, tp->snd_max))
  		tcp_sack_adjust(tp);
  	sendalot = 0;
 +	tso = 0;
  	off = tp->snd_nxt - tp->snd_una;
  	sendwin = min(tp->snd_wnd, tp->snd_cwnd);
  	sendwin = min(sendwin, tp->snd_bwnd);
 @@ -490,9 +491,9 @@ after_sack_rexmit:
  		} else {
  			len = tp->t_maxseg;
  			sendalot = 1;
 -			tso = 0;
  		}
  	}
 +
  	if (sack_rxmit) {
  		if (SEQ_LT(p->rxmit + len, tp->snd_una + so->so_snd.sb_cc))
  			flags &= ~TH_FIN;
 @@ -1051,6 +1052,8 @@ send:
  	 * XXX: Fixme: This is currently not the case for IPv6.
  	 */
  	if (tso) {
 +		KASSERT(len > tp->t_maxopd - optlen,
 +		    ("%s: len <= tso_segsz", __func__));
  		m->m_pkthdr.csum_flags |= CSUM_TSO;
  		m->m_pkthdr.tso_segsz = tp->t_maxopd - optlen;
  	}
 _______________________________________________
 svn-src-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/svn-src-all
 To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
 
State-Changed-From-To: open->patched 
State-Changed-By: andre 
State-Changed-When: Sat Aug 14 22:33:56 UTC 2010 
State-Changed-Why:  

http://www.freebsd.org/cgi/query-pr.cgi?pr=132832 
State-Changed-From-To: patched->closed 
State-Changed-By: andre 
State-Changed-When: Mon Aug 23 14:26:00 UTC 2010 
State-Changed-Why:  
All MFC's done. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=132832 
>Unformatted:
