From nobody@FreeBSD.org  Thu Oct  6 21:01:11 2005
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 7136F16A41F
	for <freebsd-gnats-submit@FreeBSD.org>; Thu,  6 Oct 2005 21:01:11 +0000 (GMT)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [216.136.204.117])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 3482343D46
	for <freebsd-gnats-submit@FreeBSD.org>; Thu,  6 Oct 2005 21:01:11 +0000 (GMT)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.13.1/8.13.1) with ESMTP id j96L19Tg092181
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 6 Oct 2005 21:01:09 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.13.1/8.13.1/Submit) id j96L185Y092180;
	Thu, 6 Oct 2005 21:01:08 GMT
	(envelope-from nobody)
Message-Id: <200510062101.j96L185Y092180@www.freebsd.org>
Date: Thu, 6 Oct 2005 21:01:08 GMT
From: Mark Gooderum <mark@verniernetworks.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: BPF_MTAP/bpf_mtap are not threadsafe and cause panics on SMP systems
X-Send-Pr-Version: www-2.3

>Number:         87014
>Category:       kern
>Synopsis:       BPF_MTAP/bpf_mtap are not threadsafe and cause panics on SMP systems
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    csjp
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Oct 06 21:10:12 GMT 2005
>Closed-Date:    Sun Jan 21 02:58:27 GMT 2007
>Last-Modified:  Sun Jan 21 02:58:27 GMT 2007
>Originator:     Mark Gooderum
>Release:        5.3-RELEASE
>Organization:
Vernier Networks, Inc.
>Environment:
FreeBSD 139.94.1.20 5.3-RELEASE FreeBSD 5.3-RELEASE #0: Thu Aug  4 09:03:53 PDT 2005     build@build-amd3.verniernetworks.com:/usr/build/ambit2/freebsd5/sys/i386/compile/VNISMP  i386

>Description:
BPF_MTAP/BPF_MTAP2 do a non-atomic test and invoke based on the value of the if_bpf field.    The problem is that if the last bpf of on the interface is deleted, the if_bpf field is set to NULL by bpf_detachd(). If another thread (such as  a userland process) deletes the last bpf in the (admittedly small) window between the test and invocation then bpf_mtap() is invoked with a NULL bp parameter which cases a fault on the LIST_EMPTY check

#define    BPF_MTAP(_ifp,_m) do {                    \
   if ((_ifp)->if_bpf) {                    \
       M_ASSERTVALID(_m);                \
       bpf_mtap((_ifp)->if_bpf, (_m));            \
   }                            \
} while (0) 

This happens becase on i386 at least the initial NULL check does _NOT_ fetch the variable:

Line 3359 of "../../../dev/bge/if_bge.c"
  starts at address 0xc045e7a5 <bge_start_locked+53>
  and ends at 0xc045e7c0 <bge_start_locked+80>.

0xc045e7a5 <bge_start_locked+53>:       cmpl   $0x0,0x2074(%edi,%eax,4)
0xc045e7ad <bge_start_locked+61>:       jne    0xc045ea71 <bge_start_locked+769>
0xc045e7b3 <bge_start_locked+67>:       lea    0xfc(%esi),%eax
0xc045e7b9 <bge_start_locked+73>:       mov    %eax,0xffffffec(%ebp)
0xc045e7bc <bge_start_locked+76>:       lea    0x0(%esi),%es 

It just does an optimized test for NULLness, the value isn't fetched until later when setting up the call:


0xc045ea46 <bge_start_locked+726>:      mov    %ebx,0x4(%esp)
0xc045ea4a <bge_start_locked+730>:      mov    0x3c(%esi),%eax
0xc045ea4d <bge_start_locked+733>:      mov    %eax,(%esp)
0xc045ea50 <bge_start_locked+736>:      call   0xc05897d0 <bpf_mtap>
0xc045ea55 <bge_start_locked+741>:      lea    0x0(%esi),%esi
0xc045ea59 <bge_start_locked+745>:      lea    0x0(%edi),%edi
0xc045ea60 <bge_start_locked+752>:      mov    0xfffffff0(%ebp),%eax
0xc045ea63 <bge_start_locked+755>:      cmpl   $0x0,0x2074(%edi,%eax,4)
0xc045ea6b <bge_start_locked+763>:      je     0xc045e7c0 <bge_start_locked+80>

So the window is small but real.  Our field experience is with a box doing a large amount of packet processing while running frequent nessus scans (nessus adds and removes BPF filters on the fly as needed for certain tests).


>How-To-Repeat:
Have lots of bpf filter add/deletes happening on a system under a heavy packet load.  I will attach a simple test program that adds/deletes filters on an interface at a high rate if desired.

This window also affects BPF_MTAP2/bpf_mtap2()
>Fix:
Either modify bpf_mtap()/bpf_mtap2() to check for a NULL parameter (the test/modify race doesn't apply once we are into the function because the bpf_if lasts as long as the interface, only the pointer to it in the struct ifnet comes and goes and by the time we're in bpf_mtap() we're looking a copy of the variable on the stack, not the actual ifnet field.

Most interfaces have per-interface locks that could also be used but those mutexes are currently private to the drivers.

--- /tmp/tmp.97907.0    Thu Oct  6 15:57:52 2005
+++ sys/net/bpf.c  Thu Oct  6 15:57:45 2005
@@ -1201,20 +1201,27 @@
  */
 void
 bpf_mtap(bp, m)
        struct bpf_if *bp;
        struct mbuf *m;
 {
        struct bpf_d *d;
        u_int pktlen, slen;

        /*
+        * We can sometimes be invoked w/NULL bp due to a small race in
+        * BPF_MTAP(), see PR#xxxxx.
+        */
+       if (!bp)
+               return;
+
+       /*
         * Lockless read to avoid cost of locking the interface if there are
         * no descriptors attached.
         */
        if (LIST_EMPTY(&bp->bif_dlist))
                return;

        pktlen = m_length(m, NULL);
        if (pktlen == m->m_len) {
                bpf_tap(bp, mtod(m, u_char *), pktlen);
                return;
@@ -1245,20 +1252,27 @@
 void
 bpf_mtap2(bp, data, dlen, m)
        struct bpf_if *bp;
        void *data;
        u_int dlen;
        struct mbuf *m;
 {
        struct mbuf mb;
        struct bpf_d *d;
        u_int pktlen, slen;
+
+       /*
+        * We can sometimes be invoked w/NULL bp due to a small race in
+        * BPF_MTAP2(), see PR#xxxxx.
+        */
+       if (!bp)
+               return;

        /*
         * Lockless read to avoid cost of locking the interface if there are
         * no descriptors attached.
         */
        if (LIST_EMPTY(&bp->bif_dlist))
                return;

        pktlen = m_length(m, NULL);
        /*

>Release-Note:
>Audit-Trail:

From: Mark Gooderum <mark@verniernetworks.com>
To: bug-followup@FreeBSD.org,  mark@verniernetworks.com
Cc:  
Subject: Re: kern/87014: BPF_MTAP/bpf_mtap are not threadsafe and cause panics
 on SMP systems
Date: Fri, 07 Oct 2005 00:03:12 -0500

 This is a multi-part message in MIME format.
 --------------050606070600030406070008
 Content-Type: multipart/alternative;
  boundary="------------010506050407060700010906"
 
 
 --------------010506050407060700010906
 Content-Type: text/plain; charset=us-ascii; format=flowed
 Content-Transfer-Encoding: 7bit
 
 FYI - this appears to be a duplicate of  PR 73719.  I did search before 
 but somehow missed it.
 
 Using the attached test program (which spins opening and closing BPF 
 devices) I can make my system crash in a few seconds from this bug.  The 
 test setup is basically:
 
     * FreeBSD system as router
           o 2 GigE interfaces
     * 4 Traffic Generating Systems
           o Two on one interface with a netstraind running
           o Two on second interface with netstrain running
                 + Run netstrain bi-dir (ie: netstrain <desthost> <port>
                   both)
           o I can generate about 450Mbit/sec each way (900 Mbit/sec
             aggregate) with this setup
     * Start the netstraind servers
     * Start the netstrain clients
     * Things are fine
     * Run the attached test program full spin mode on one of the active
       interfaces
           o bpfspin -f 100000 bge0
     * System crashes in 1-2 seconds once bpfspin is started w/o fix
 
 The SUT was a Tyan S2882 based Dual Opteron 248 system.  The motherboard 
 has an Intel 8255x based 10/100 port and two Broadcom 5704 based GigE 
 ports onboard.  It also had a pair of PCI-X Intel Dual GigE PRO/1000M 
 cards (Intel 8254x based).  The crash was reproduced with both the bge 
 driver ports and the em driver interfaces.
 
 This test must be done on a true SMP system as the race requires two 
 active threads - there are no other preemption points in the race 
 window.  Not sure about timing on HTT systems - this testing was on a 
 true Dual Opteron system.
 
 The attached patch fixes the problem and has a couple of debug sysctls - 
 one that counts the number of hits, the second that disables the fix.  
 With the bpfspin running you can see the fix trip every second or so and 
 then disable the fix and it panics almost immediately.
 -=-
 Mark
 
 
 --------------010506050407060700010906
 Content-Type: text/html; charset=us-ascii
 Content-Transfer-Encoding: 7bit
 
 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
 <html>
 <head>
   <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
   <title></title>
 </head>
 <body bgcolor="#ffffff" text="#000000">
 FYI - this appears to be a duplicate of&nbsp; PR 73719.&nbsp; I did search before
 but somehow missed it.<br>
 <br>
 Using the attached test program (which spins opening and closing BPF
 devices) I can make my system crash in a few seconds from this bug.&nbsp;
 The test setup is basically:<br>
 <ul>
   <li>FreeBSD system as router</li>
   <ul>
     <li>2 GigE interfaces</li>
   </ul>
   <li>4 Traffic Generating Systems</li>
   <ul>
     <li>Two on one interface with a netstraind running</li>
     <li>Two on second interface with netstrain running</li>
     <ul>
       <li>Run netstrain bi-dir (ie: netstrain &lt;desthost&gt;
 &lt;port&gt; both)</li>
     </ul>
     <li>I can generate about 450Mbit/sec each way (900 Mbit/sec
 aggregate) with this setup</li>
   </ul>
   <li>Start the netstraind servers<br>
   </li>
   <li>Start the netstrain clients</li>
   <li>Things are fine</li>
   <li>Run the attached test program full spin mode on one of the active
 interfaces</li>
   <ul>
     <li>bpfspin -f 100000 bge0</li>
   </ul>
   <li>System crashes in 1-2 seconds once bpfspin is started w/o fix<br>
   </li>
 </ul>
 The SUT was a Tyan S2882 based Dual Opteron 248 system.&nbsp; The
 motherboard has an Intel 8255x based 10/100 port and two Broadcom 5704
 based GigE ports onboard.&nbsp; It also had a pair of PCI-X Intel Dual GigE
 PRO/1000M cards (Intel 8254x based).&nbsp; The crash was reproduced with
 both the bge driver ports and the em driver interfaces.<br>
 <br>
 This test must be done on a true SMP system as the race requires two
 active threads - there are no other preemption points in the race
 window.&nbsp; Not sure about timing on HTT systems - this testing was on a
 true Dual Opteron system.<br>
 <br>
 The attached patch fixes the problem and has a couple of debug sysctls
 - one that counts the number of hits, the second that disables the
 fix.&nbsp; With the bpfspin running you can see the fix trip every second or
 so and then disable the fix and it panics almost immediately.<br>
 -=-<br>
 Mark<br>
 <br>
 </body>
 </html>
 
 --------------010506050407060700010906--
 
 --------------050606070600030406070008
 Content-Type: text/plain;
  name="Makefile"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: inline;
  filename="Makefile"
 
 bpfspin:	bpfspin.o
 	gcc -g -o bpfspin bpfspin.o -lpcap
 
 bpfspin.o: bpfspin.c
 	gcc -g -c -o bpfspin.o bpfspin.c
 --------------050606070600030406070008
 Content-Type: text/x-csrc;
  name="bpfspin.c"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: inline;
  filename="bpfspin.c"
 
 /*
  * Test program to open and close a BPF a _lot_.
  */
 
 #include <errno.h>
 #include <string.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <signal.h>
 
 #include <sys/types.h>
 #include <sys/ioctl.h>
 #include <net/bpf.h>
 #include <unistd.h>
 
 #include "pcap.h"
 
 #define CAP_LEN 100
 
 const char *argv0;
 const char *iname;
 
 /* Default to something that won't match anything */
 char *filter = "ip proto 199";
 
 pcap_t *
 open_bpf(const char *ifname);
 
 void
 close_bpf(pcap_t *pct);
 
 int debug_level;
 int freq = 10;
 
 int on_sleep;
 int off_sleep;
 int per_cycle;
 int num_cycles = -1;
 int quit_flag;
 
 void
 usage(int badopt);
 
 void
 catchsig(int signo);
 
 
 int
 main(int argc, char *argv[])
 {
 	u_int64_t	npass = 0;
 	const char *estr;
 	int	eno;
 	pcap_t	*pct;
 	int	ch;
 
 	argv0 = strrchr(argv[0], '/');
 	if (argv0 == NULL) {
 		argv0 = argv[0];
 	} else {
 		argv0++;
 	}
 
 	signal(SIGTERM, catchsig);
 	signal(SIGHUP, catchsig);
 	signal(SIGQUIT, catchsig);
 	signal(SIGINT, catchsig);
 
 	/*
 	 * Args...
 	 */
 	while ((ch = getopt(argc, argv, "df:hn:o:")) != -1) {
 		switch (ch) {
 		case 'd':
 			debug_level++;
 			break;
 
 		case 'f':
 			freq = atoi(optarg);
 			break;
 
 		case 'h':
 			usage(0);
 			exit(0);
 
 		case 'n':
 			num_cycles = atoi(optarg);
 			break;
 
 		case '0':
 			on_sleep = atoi(optarg);
 			break;
 
 		default:
 			usage(optopt);
 			exit(1);
 		}
 		
 	}
 
 	argc -= optind;
 	argv += (optind - 1);
 
 	if (argc < 1) {
 		fprintf(stderr, "Error: <ifname> argument required.\n");
 		usage(-1);
 	}
 	iname = argv[1];
 
 	if (freq) {
 		per_cycle = 1000000 / freq;
 		off_sleep = per_cycle;
 	}
 	if (on_sleep) {
 		off_sleep = per_cycle - on_sleep;
 	}
 	
 	while (num_cycles) {
 		pct = open_bpf(iname);
 		if (pct == NULL) {
 			eno = errno;
 			estr = strerror(eno);
 			if (estr == NULL) {
 				estr = "<Unknown>";
 			}
 			fprintf(stderr, "Error: open_bpf(%s) failed %d/%s\n",
 				iname, eno, estr);
 			exit(3);
 		}
 		if (on_sleep) {
 			usleep(on_sleep);
 		}
 
 		close_bpf(pct);
 		if (on_sleep) {
 			usleep(off_sleep);
 		}
 
 		if (num_cycles > 0) {
 			num_cycles--;
 		}
 		npass++;
 		if (quit_flag) {
 			break;
 		}
 	}
 	printf("Open/Closed bpf on %s %llu times.\n", iname, npass);
 	exit(0);
 }
 
 pcap_t *
 open_bpf(const char *ifname)
 {
 	pcap_t	*pct;
 	int	pfd;
 	u_int	one = 1;
 	char	ebuf[PCAP_ERRBUF_SIZE];
 	struct bpf_program	dfilter;
 	u_int32_t		network = 0, netmask = 0;
 
 	pct = pcap_open_live(ifname, CAP_LEN, 0, 1000, ebuf);
 	if (pct == NULL) {
 		perror("pcap_open_live failed");
 		return(NULL);
 	}
 	pfd = pcap_get_selectable_fd(pct);
 	if (ioctl(pfd, BIOCIMMEDIATE, &one) < 0) {
 		perror("BIOCIMMEDIATE failed");
 		pcap_close(pct);
 		return(NULL);
 	}
 #if 0
 	/* Must be needed? */
 	if(pcap_lookupnet(ifname, &network, &netmask, 0) < 0) {
 		perror("pcap_lookupnet failed");
 		pcap_close(pct);
 		return(NULL);
 	}
 #endif
 	/* Compile the Dummy filter pcap program */
 	bzero(&dfilter, sizeof(struct bpf_program));
 	if (pcap_compile(pct, &dfilter, filter, 0, netmask) < 0) {
 		perror("pcap_compile failed");
 		pcap_close(pct);
 		return(NULL);
 	}
 	if (pcap_setfilter(pct, &dfilter) < 0)
 	{
 		perror("pcap_setfilter failed");
 		pcap_close(pct);
 		return(NULL);
 	}
 	return(pct);
 }
 
 void
 close_bpf(pcap_t *pct)
 {
 	pcap_close(pct);
 }
 
 
 void
 usage(int badopt)
 {	
 	if (badopt > 0) {
 		fprintf(stderr, "%s: Bad option [-%c]\n", argv0, 
 			(char) badopt);
 	}
 	fprintf(stderr, "Usage:  %s [-dh] [-f <freq>] <ifname>\n", argv0);
 	fprintf(stderr, "\t-d\tIncrease debug level by 1\n");
 	fprintf(stderr, "\t-f\tSet Flap Freq to <freq>\n");
 	fprintf(stderr, "\t-h\tPrint this help\n");
 	exit(badopt != 0);
 }
 
 void
 catchsig(int signo)
 {
 	switch (signo) {
 	case SIGHUP:
 	case SIGTERM:
 	case SIGQUIT:
 	case SIGINT:
 		quit_flag = 1;
 		break;
 	default:
 		abort();
 	}
 }
 
 
 --------------050606070600030406070008
 Content-Type: text/plain;
  name="BPFMTAP.difftxt"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: inline;
  filename="BPFMTAP.difftxt"
 
 --- /tmp/tmp.44835.0	Fri Oct  7 00:00:08 2005
 +++ freebsd5/sys/net/bpf.c	Thu Oct  6 16:34:36 2005
 @@ -81,20 +81,27 @@
  /*
   * The default read buffer size is patchable.
   */
  static int bpf_bufsize = 4096;
  SYSCTL_INT(_debug, OID_AUTO, bpf_bufsize, CTLFLAG_RW,
  	&bpf_bufsize, 0, "");
  static int bpf_maxbufsize = BPF_MAXBUFSIZE;
  SYSCTL_INT(_debug, OID_AUTO, bpf_maxbufsize, CTLFLAG_RW,
  	&bpf_maxbufsize, 0, "");
  
 +static int bpf_nullhits;
 +static int bpf_donullfix = 1;
 +SYSCTL_INT(_debug, OID_AUTO, bpf_nullfix, CTLFLAG_RW,
 +	   &bpf_donullfix, 0, "Apply the BPF null BP workaround");
 +SYSCTL_INT(_debug, OID_AUTO, bpf_nullhits, CTLFLAG_RW,
 +	   &bpf_nullhits, 0, "# of bpf_mtap/2() workarounds fired");
 +
  /*
   *  bpf_iflist is the list of interfaces; each corresponds to an ifnet
   */
  static LIST_HEAD(, bpf_if)	bpf_iflist;
  static struct mtx	bpf_mtx;		/* bpf global lock */
  
  static int	bpf_allocbufs(struct bpf_d *);
  static void	bpf_attachd(struct bpf_d *d, struct bpf_if *bp);
  static void	bpf_detachd(struct bpf_d *d);
  static void	bpf_freed(struct bpf_d *);
 @@ -1201,20 +1208,31 @@
   */
  void
  bpf_mtap(bp, m)
  	struct bpf_if *bp;
  	struct mbuf *m;
  {
  	struct bpf_d *d;
  	u_int pktlen, slen;
  
  	/*
 +	 * We can sometimes be invoked w/NULL bp due to a small race in 
 +	 * BPF_MTAP(), see PR#xxxxx.
 +	 */
 +	if (bpf_donullfix) {
 +		if (!bp) {
 +			bpf_nullhits++;
 +			return;
 +		}
 +	}
 +
 +	/*
  	 * Lockless read to avoid cost of locking the interface if there are
  	 * no descriptors attached.
  	 */
  	if (LIST_EMPTY(&bp->bif_dlist))
  		return;
  
  	pktlen = m_length(m, NULL);
  	if (pktlen == m->m_len) {
  		bpf_tap(bp, mtod(m, u_char *), pktlen);
  		return;
 @@ -1245,20 +1263,31 @@
  void
  bpf_mtap2(bp, data, dlen, m)
  	struct bpf_if *bp;
  	void *data;
  	u_int dlen;
  	struct mbuf *m;
  {
  	struct mbuf mb;
  	struct bpf_d *d;
  	u_int pktlen, slen;
 +
 +	/*
 +	 * We can sometimes be invoked w/NULL bp due to a small race in 
 +	 * BPF_MTAP2(), see PR#xxxxx.
 +	 */
 +	if (bpf_donullfix) {
 +		if (!bp) {
 +			bpf_nullhits++;
 +			return;
 +		}
 +	}
  
  	/*
  	 * Lockless read to avoid cost of locking the interface if there are
  	 * no descriptors attached.
  	 */
  	if (LIST_EMPTY(&bp->bif_dlist))
  		return;
  
  	pktlen = m_length(m, NULL);
  	/*
 
 --------------050606070600030406070008--
 
Responsible-Changed-From-To: freebsd-bugs->rwatson 
Responsible-Changed-By: rwatson 
Responsible-Changed-When: Sat Oct 8 17:44:12 GMT 2005 
Responsible-Changed-Why:  
Grab ownership of this PR 


http://www.freebsd.org/cgi/query-pr.cgi?pr=87014 
Responsible-Changed-From-To: rwatson->csjp 
Responsible-Changed-By: rwatson 
Responsible-Changed-When: Mon Jun 12 11:41:06 UTC 2006 
Responsible-Changed-Why:  
Assign this bug to csjp, he has recently taken on fixing locking in 
BPF, and may have fixed it in his recent changes. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=87014 
State-Changed-From-To: open->patched 
State-Changed-By: csjp 
State-Changed-When: Mon Jun 12 13:40:22 UTC 2006 
State-Changed-Why:  
This issue should be patched in -CURRENT, once it is tested 
sufficiently enough, I will MFC it. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=87014 
State-Changed-From-To: patched->closed 
State-Changed-By: csjp 
State-Changed-When: Sun Jan 21 02:58:04 UTC 2007 
State-Changed-Why:  
The fix has been MFCed 

http://www.freebsd.org/cgi/query-pr.cgi?pr=87014 
>Unformatted:
