From nobody@nwww.freebsd.org  Sun May  5 14:04:34 2002
Return-Path: <nobody@nwww.freebsd.org>
Received: from nwww.freebsd.org (nwww.FreeBSD.org [216.136.204.117])
	by hub.freebsd.org (Postfix) with ESMTP id 7BDFC37B406
	for <freebsd-gnats-submit@FreeBSD.org>; Sun,  5 May 2002 14:04:31 -0700 (PDT)
Received: from nwww.freebsd.org (localhost [127.0.0.1])
	by nwww.freebsd.org (8.12.2/8.12.2) with ESMTP id g45L5qhG081139
	for <freebsd-gnats-submit@FreeBSD.org>; Sun, 5 May 2002 14:05:52 -0700 (PDT)
	(envelope-from nobody@nwww.freebsd.org)
Received: (from nobody@localhost)
	by nwww.freebsd.org (8.12.2/8.12.2/Submit) id g45L5qxP081138;
	Sun, 5 May 2002 14:05:52 -0700 (PDT)
Message-Id: <200205052105.g45L5qxP081138@nwww.freebsd.org>
Date: Sun, 5 May 2002 14:05:52 -0700 (PDT)
From: Brett Glass <brett@lariat.org>
To: freebsd-gnats-submit@FreeBSD.org
Subject: A single lost packet on a (userland) PPP connection causes long-term disruption and a "storm" of requests and acknowledgements on the CCP layer
X-Send-Pr-Version: www-1.0

>Number:         37777
>Category:       bin
>Synopsis:       A single lost packet on a (userland) PPP connection causes long-term disruption and a "storm" of requests and acknowledgements on the CCP layer
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    brian
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sun May 05 14:10:02 PDT 2002
>Closed-Date:    Tue May 07 02:29:54 PDT 2002
>Last-Modified:  Tue May 14 19:10:02 PDT 2002
>Originator:     Brett Glass
>Release:        4.5-RELEASE-P4
>Organization:
LARIAT
>Environment:
All machines tested are using 4.5-RELEASE-P4.
>Description:
We are using PPPoE over a wireless connection which very occasionally
drops packets -- typically a few an hour. However, whenever a dropped
packet occurs, the link (which is using "deflate" compression) stalls
for long periods (sometimes seconds, sometimes tens of minutes). Turning
on logging of CCP (set log +ccp) reveals that, instead of resetting the
compression dictionary and gracefully continuing, the two nodes are sending a "storm" of redundant refresh requests and cascading errors. The problem does not seem to be limited to PPPoE or wireless communications, but rather crops up more obviously in these situations because the underlying transport is not reliable. (Modern modems virtually always correct errors unless the connection is dropped completely.) The same problem could crop up on a congested hardwired Ethernet. PPP does not assume a reliable transport, and so should be sufficiently resilient to deal with occasional lost packets.

When a packet is dropped and CCP logging is enabled, the PPP log shows a long stream of messages such as the following. (Note that the first indicates the dropped packet.) The cascade of errors can last for a long time. The glitch from which the messages below are an excerpt continued for more than 10 minutes. (Apologies for the long log excerpt, but it helps to show the severity and longevity of the outage and the thrashing which is occurring.)

May  5 13:19:42 <daemon.info> workhorse ppp[1067]: CCP: DeflateInput: Seq error: Got 397, expected 396
May  5 13:19:42 <daemon.info> workhorse ppp[1067]: CCP: deflink: SendResetReq(5) state = Opened
May  5 13:19:42 <daemon.info> workhorse ppp[1067]: CCP: deflink: SendResetReq(5) state = Opened
May  5 13:19:42 <daemon.info> workhorse ppp[1067]: CCP: deflink: RecvResetAck(5) state = Opened
May  5 13:19:42 <daemon.info> workhorse ppp[1067]: CCP: Deflate: Input channel reset
May  5 13:19:42 <daemon.info> workhorse ppp[1067]: CCP: deflink: RecvResetAck(5) state = Opened
May  5 13:19:42 <daemon.info> workhorse ppp[1067]: CCP: deflink: Duplicate ResetAck (resetting again)
May  5 13:19:42 <daemon.info> workhorse ppp[1067]: CCP: Deflate: Input channel reset
May  5 13:20:05 <daemon.info> workhorse ppp[1067]: CCP: deflink: RecvResetReq(187) state = Opened
May  5 13:20:05 <daemon.info> workhorse ppp[1067]: CCP: Deflate: Output channel reset
May  5 13:20:05 <daemon.info> workhorse ppp[1067]: CCP: deflink: SendResetAck(187) state = Opened
May  5 13:20:05 <daemon.info> workhorse ppp[1067]: CCP: deflink: RecvResetReq(187) state = Opened
May  5 13:20:05 <daemon.info> workhorse ppp[1067]: CCP: Deflate: Output channel reset
May  5 13:20:05 <daemon.info> workhorse ppp[1067]: CCP: deflink: SendResetAck(187) state = Opened
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: DeflateInput: Seq error: Got 383, expected 382
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: SendResetReq(6) state = Opened
May  5 13:20:19 <daemon.info> workhorse last message repeated 3 times
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: RecvResetReq(188) state = Opened
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: Deflate: Output channel reset
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: SendResetAck(188) state = Opened
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: SendResetReq(6) state = Opened
May  5 13:20:19 <daemon.info> workhorse last message repeated 5 times
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: RecvResetReq(188) state = Opened
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: Deflate: Output channel reset
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: SendResetAck(188) state = Opened
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: RecvResetAck(6) state = Opened
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: Deflate: Input channel reset
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: RecvResetAck(6) state = Opened
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: Duplicate ResetAck (resetting again)
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: Deflate: Input channel reset
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: RecvResetAck(6) state = Opened
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: Unexpected ResetAck (id 6) ignored
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: RecvResetAck(6) state = Opened
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: Unexpected ResetAck (id 6) ignored
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: RecvResetAck(6) state = Opened
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: Unexpected ResetAck (id 6) ignored
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: RecvResetAck(6) state = Opened
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: Unexpected ResetAck (id 6) ignored
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: RecvResetAck(6) state = Opened
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: Unexpected ResetAck (id 6) ignored
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: RecvResetAck(6) state = Opened
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: Unexpected ResetAck (id 6) ignored
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: RecvResetAck(6) state = Opened
May  5 13:20:19 <daemon.info> workhorse ppp[1067]: CCP: deflink: Unexpected ResetAck (id 6) ignored

[Snip]


>How-To-Repeat:
Establish a PPPoE or PPP-over-UDP link between two computers and disrupt the link for a brief period. (Alternatively, one may instrument the code so that a packet is blocked now and then on the way to an interface to achieve the same effect.)
>Fix:
None known at this time. At first, we theorized that the problem was specific to the (preferred) "deflate" compression algorithm. However, while connections established using the "predictor-1" algorithm were better behaved, sometimes showed some thrashing behavior as well, with multiple renegotiations after only a single dropped packet.

Turning off ALL compression (by disabling and denying all available algorithms) is a workaround, but a highly inefficent one. (One of our primary motivations for the use of PPPoE is to compress the data stream, of which approximately 30%, according to our measurements, is highly compressible HTML.) Our tests seem to indicate that the problem is most likely in the compression portion of the code -- perhaps the state machine(s) related to CCP. They may also suggest a need for a frame retransmission mechanism within PPP that does not require a complete reset of the compression layer.
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->brian 
Responsible-Changed-By: dwmalone 
Responsible-Changed-When: Mon May 6 01:12:53 PDT 2002 
Responsible-Changed-Why:  
PPP is Brian's. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=37777 
State-Changed-From-To: open->closed 
State-Changed-By: brian 
State-Changed-When: Tue May 7 02:22:34 PDT 2002 
State-Changed-Why:  
As I said in private email: 
: This looks ok.  Each time one side misses a datagram, (Seq error: 
: Got N+1, expected N), it sends a reset request.  Each subsequent 
: packet received before the reset ack will also result in a reset 
: request.  Eventually, after the request has hit the peer, the peer 
: resets it's dictionary and sends a reset ack.  On receipt of the 
: reset ack, the first side can start decompressing again. 
: 
: There will usually be one or two packets in transit when something 
: goes missing, resulting in more than one reset request being sent. 
: It's necessary to send a reset request for each out of sync packet, 
: just in case the previous reset request didn't make it to the peer. 

In the PPPoE case, there may be many packets en route when a problem 
occurs.  This has bad effects on the TCP layer as it has to re-sync 
itself. 

Effectively, compressing an unreliable transport means that an error 
is magnified - one missing datagram becomes many missed datagrams. 
This is unavoidable given the protocol spec. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=37777 

From: Brian Somers <brian@Awfulhak.org>
To: Brett Glass <brett@lariat.org>
Cc: brian@FreeBSD.ORG, FreeBSD-gnats-submit@FreeBSD.ORG
Subject: Re: bin/37777: A single lost packet on a (userland) PPP connection causes long-term disruption and a "storm" of requests and acknowledgements on the CCP layer 
Date: Wed, 15 May 2002 03:03:35 +0100

 > At 03:29 AM 5/7/2002, brian@FreeBSD.org wrote:
 > 
 > >In the PPPoE case, there may be many packets en route when a problem
 > >occurs.  This has bad effects on the TCP layer as it has to re-sync
 > >itself.
 > 
 > Do you mean the CCP layer?
 
 No.  The TCP layer has to resync itself when IP datagrams go missing.
 
 > In any event, what we are seeing in the logs is not just one 
 > resynchronization but a second as much as several seconds later when 
 > the next packet is sent.... As if the first resynch didn't happen 
 > correctly. I am not yet convinced that there is not a bug here.
 > After all, by the time a second or two has passed, everything should
 > be ready for the next packet.
 
 You should see the
 
 CCP: DeflateInput: Seq error: Got N+1, expected N
 
 error (indicating a missing datagram), followed by a CCP ResetRequest 
 being sent back to the peer.  All subsequent datagrams will be dropped, 
 and another CCP ResetRequest sent, until a CCP ResetAck is received 
 at which point you see
 
 CCP: Deflate: Input channel reset.
 
 You'll see up to one such message for each CCP ResetRequest sent 
 (where the peer is ACKing the Reset Request and initialising it's 
 dictionary).  The number of CCP ResetAcks depends on what the peer 
 is sending between receiving the first CCP ResetRequest and receiving 
 the last.
 
 If say 10 datagrams are in transit when one disappears, there will be 9 
 CCP ResetRequests and 9 dropped datagrams, followed by one or more 
 ResetAcks.
 
 The 10 missing datagrams stay missing, so if they belong to one or 
 more TCP streams, each TCP stream has to figure out what's happened 
 and retransmit.
 
 > >Effectively, compressing an unreliable transport means that an error
 > >is magnified - one missing datagram becomes many missed datagrams.
 > >This is unavoidable given the protocol spec.
 > 
 > Is it? Can't the compression layer either provide its own reliable
 > transport or get a lower layer to supply it? Should we be using
 > a different protocol, such as PPP over TCP (which is really TCP
 > over PPP over TCP -- redundant!) or PPTP?
 
 PPP is not a reliable transport.
 
 If you run PPP over TCP it becomes reliable.  There will be no CCP 
 errors, but at the expense of having two TCP layers recovering after 
 a packet is dropped (assuming a top level TCP layer).  In practice, 
 this doesn't hurt as much as you'd expect.
 
 > --Brett
 
 -- 
 Brian <brian@Awfulhak.org>                    <brian@freebsd-services.com>
       <http://www.Awfulhak.org>                   <brian@[uk.]FreeBSD.org>
 Don't _EVER_ lose your sense of humour !          <brian@[uk.]OpenBSD.org>
 
 
>Unformatted:
