From nobody@FreeBSD.org  Tue Jan 13 13:50:58 2009
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id BB2661065673
	for <freebsd-gnats-submit@FreeBSD.org>; Tue, 13 Jan 2009 13:50:58 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [IPv6:2001:4f8:fff6::21])
	by mx1.freebsd.org (Postfix) with ESMTP id A75458FC1F
	for <freebsd-gnats-submit@FreeBSD.org>; Tue, 13 Jan 2009 13:50:58 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.14.3/8.14.3) with ESMTP id n0DDowcU025191
	for <freebsd-gnats-submit@FreeBSD.org>; Tue, 13 Jan 2009 13:50:58 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.14.3/8.14.3/Submit) id n0DDowGe025190;
	Tue, 13 Jan 2009 13:50:58 GMT
	(envelope-from nobody)
Message-Id: <200901131350.n0DDowGe025190@www.freebsd.org>
Date: Tue, 13 Jan 2009 13:50:58 GMT
From: Chris Kuklewicz <bugreport-freebsd@mightyreason.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: Serious bug in regular expression library (regex) affected sed
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         130504
>Category:       bin
>Synopsis:       [libc] Serious bug in regular expression library (regex) affected sed
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Tue Jan 13 14:00:01 UTC 2009
>Closed-Date:    
>Last-Modified:  Sun Jul 01 16:10:27 UTC 2012
>Originator:     Chris Kuklewicz
>Release:        FreeBSD 8.0-Current and 6.3-PRERELEASE
>Organization:
Not applicable
>Environment:
FreeBSD m-net.arbornet.org 6.3-PRERELEASE FreeBSD 6.3-PRERELEASE #1: Tue Nov  6 16:13:56 EST 2007     root@newmnet.arbornet.org:/usr/src/sys/i386/compile/MNET  i386
>Description:
I have tracked the source of a bug originally seen on OS X back to the same bug in FreeBSD.

The bug is in the libc implementation of regular expressions, used via the API in "regex.h" specifically regexec.

Programs which use this library are affected, including "sed" which I will use to illustrate the bug below.

The bug effects both the whole text matched and the choice of text reported as matching parenthesized subexpressions.  The bug violates the leftmost longest matching rule for POSIX regular expressions.  The bug violates the principle that the order of subexpressions around the "|" operator should not affect the whole text matched.

A terminal sessions using three sed commands with extended regular expressions:
[chrisk@m-net ~]$ echo "ab" | sed -E "s/(()|.)(b)/[&]/"
a[b]
[chrisk@m-net ~]$ echo "ab" | sed -E "s/(.|())(b)/[&]/"
[ab]
[chrisk@m-net ~]$ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab)(Y)(aba|b|ab)*(Z))/[\1] => [\2][\3][\4][\5][\6][\7]/'
[XababaYababaZ] => [X][ab][aba][Y][b][Z]

The first two commands differ only in the order of "()" and "." around the "|" operator.  According to the POSIX specification this must not affect the whole match, which is report by "&" in sed's replacement specification.  The leftmost longest match is "ab" and is correctly reported by the second command.  The first command incorrect does not match starting with the "a" and matches just the "b".

This establishes that the bug violates the POSIX specification and can affect a program that only cares about the whole text matched (ignoring captured parenthesized subexpressions).

The third command shows an internal inconsistency in the output.  The (aba|b|ab)* operator should capture the text matched by the last repetition of the (aba|b|ab) subexpression and report that in \6. In particular \6 should match "aba" just like \3 does.

As shown above the FreeBSD output is [XababaYababaZ] => [X][ab][aba][Y][b][Z]
The correct output should have been  [XababaYababaZ] => [X][ab][aba][Y][abc][Z]

Could there be a way to explain the FreeBSD output? Consider the following questions:

If the last repetition of (aba|b|ab)* matched \6 to "b" then how is the text between that "b" skipped so that "(Z)" matched \7 against the final "Z" in the text?
If the "b" reported by \6 is the last "b" in the text then there is no part of the pattern that can match the next "a".
If the "b" reported by \6 is the next-to-last "b" in the text then there is no part of the pattern that can match the preceding "a".
Checking the indices returned by regex shows that \6 was matched against the last "b" in the text.
Thus reporting \6 as matching "b" is contradictory, there is no consistent interpretation of what regexec returned.

This establishes that the bug is can violate not only the POSIX specification but also violate any rational and consistent alternate specification.

I found the first example by using Haskell's QuickCheck to randomly generate test cases.  The second example was found by using Glenn Fowler's "AT&T Labs Research regex(3) regression tests" from http://www.research.att.com/~gsf/testregex/
I tested FreeBSD 6.3-PRERELEASE thanks to the free shell account at m-net.arbornet.org and FreeBSD 8.0-CURRENT was tested by "methi" on the #freebsd channel at irc.freenode.net

>How-To-Repeat:
[chrisk@m-net ~]$ echo "ab" | sed -E "s/(()|.)(b)/[&]/"
[chrisk@m-net ~]$ echo "ab" | sed -E "s/(.|())(b)/[&]/"
[chrisk@m-net ~]$ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab)

If there is no bug then the first two both output "[ab]" and the last output "[XababaYababaZ] => [X][ab][aba][Y][abc][Z]".

Currently the bug causes the first the output "a[b]" and the last to output "[XababaYababaZ] => [X][ab][aba][Y][b][Z]".

>Fix:
I do not have the time to be able to fix the regex.h part of libc.


>Release-Note:
>Audit-Trail:
State-Changed-From-To: open->analyzed 
State-Changed-By: dds 
State-Changed-When: Tue Sep 15 20:26:25 UTC 2009 
State-Changed-Why:  
Reproduced the bug and created test case 


Responsible-Changed-From-To: freebsd-bugs->dds 
Responsible-Changed-By: dds 
Responsible-Changed-When: Tue Sep 15 20:26:25 UTC 2009 
Responsible-Changed-Why:  
I think I can fix it. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=130504 

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: bin/130504: commit references a PR
Date: Tue, 15 Sep 2009 21:16:37 +0000 (UTC)

 Author: dds
 Date: Tue Sep 15 21:15:29 2009
 New Revision: 197234
 URL: http://svn.freebsd.org/changeset/base/197234
 
 Log:
   Add two test cases from PR 130504.
   An additional one coming from http://www.research.att.com/~gsf/testregex/
   was not added; at some point the entire AT&T regression test harness
   should be imported here.
   But that would also mean commitment to fix the uncovered errors.
   
   PR:		130504
   Submitted by:	Chris Kuklewicz
 
 Modified:
   head/lib/libc/regex/grot/tests
 
 Modified: head/lib/libc/regex/grot/tests
 ==============================================================================
 --- head/lib/libc/regex/grot/tests	Tue Sep 15 20:28:29 2009	(r197233)
 +++ head/lib/libc/regex/grot/tests	Tue Sep 15 21:15:29 2009	(r197234)
 @@ -472,3 +472,6 @@ abcdefghijklmnop	i	abcdefghijklmnop	abcd
  abcdefghijklmnopqrstuv	i	abcdefghijklmnopqrstuv	abcdefghijklmnopqrstuv
  (ALAK)|(ALT[AB])|(CC[123]1)|(CM[123]1)|(GAMC)|(LC[23][EO ])|(SEM[1234])|(SL[ES][12])|(SLWW)|(SLF )|(SLDT)|(VWH[12])|(WH[34][EW])|(WP1[ESN])	-	CC11	CC11
  CC[13]1|a{21}[23][EO][123][Es][12]a{15}aa[34][EW]aaaaaaa[X]a	-	CC11	CC11
 +# PR 130504
 +(.|())(b)	-	ab	ab
 +(()|.)(b)	-	ab	ab
 _______________________________________________
 svn-src-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/svn-src-all
 To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
 

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: bin/130504: commit references a PR
Date: Wed, 16 Sep 2009 06:32:37 +0000 (UTC)

 Author: dds
 Date: Wed Sep 16 06:32:23 2009
 New Revision: 197246
 URL: http://svn.freebsd.org/changeset/base/197246
 
 Log:
   Fix an off-by-one error in the marking of the O_CH operator
   following an OOR2 operator.
   
   PR:		130504
   MFC after:	2 weeks
 
 Modified:
   head/lib/libc/regex/engine.c
 
 Modified: head/lib/libc/regex/engine.c
 ==============================================================================
 --- head/lib/libc/regex/engine.c	Wed Sep 16 06:29:23 2009	(r197245)
 +++ head/lib/libc/regex/engine.c	Wed Sep 16 06:32:23 2009	(r197246)
 @@ -1075,7 +1075,7 @@ step(struct re_guts *g,
  						OP(s = g->strip[pc+look]) != O_CH;
  						look += OPND(s))
  					assert(OP(s) == OOR2);
 -				FWD(aft, aft, look);
 +				FWD(aft, aft, look + 1);
  			}
  			break;
  		case OOR2:		/* propagate OCH_'s marking */
 _______________________________________________
 svn-src-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/svn-src-all
 To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
 

From: Chris Kuklewicz <bugreport-freebsd@mightyreason.com>
To: bug-followup@FreeBSD.org, bugreport-freebsd@mightyreason.com
Cc:  
Subject: Re: bin/130504: [libc] Serious bug in regular expression library
 (regex) affected sed
Date: Mon, 21 Sep 2009 13:09:29 +0100

 The patch looks like a fix for the first bug.  I assume that it is not a
 fix for the bug that produces:
 
 > $ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab)(Y)(aba|b|ab)*(Z))/[\1] => [\2][\3][\4][\5][\6][\7]/'
 > [XababaYababaZ] => [X][ab][aba][Y][b][Z]
 
 > $ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)*(Y)(aba|b|ab)(aba|b|ab)(Z))/[\1] => [\2][\3][\4][\5][\6][\7]/'
 > [XababaYababaZ] => [X][b][Y][ab][aba][Z]
 > 
 
 I have not read the code, but I have developed a theory that explains
 mistakes like the one above.
 
 The (aba|b|ab)* can fit the ababa in more than one way, and this is an
 ambiguous repetition.  Handling subgroup captures for ambiguous
 repetitions is really tricky.  The bug above makes me suspect the code
 does this:
 
 1) Form a finite automaton to find the leftmost-longest match, ignoring
 substring captures (at least for the ambiguous repetition)
 
 2) Do something to report the non-ambiguous repetitions correctly, e.g.
 all but the \6 in the first or \2 in the second example.  Since this
 works, I can deduce little about how it does it.
 
 3) Start at the beginning of the ambiguous repetitions and perform
 left-to-right greedy matching.  Thus the "aba" is matched against
 "ababa" first, leaving "ba" which is matched with "b", leaving "a".
 This reports the last greedy iteration as the captured substring (here "b").
 
 4a) If (3) covers the whole substring (here "ababa") then (3) is
 correct.  This is equivalent to "does (3) terminate at the correct final
 character/index?"
 
 4b) If (3) does not cover the whole substring then the algorithm has
 failed to find the correct final iteration.  One way to recover from
 this would be to backup and try other possible captures; this search
 will work if done in the right order, but is potentially exponentially slow.
 
 So the problem with the current code seems to be a failure to recognize
 that (3) has failed and to then do something about it.  Note that I have
 not read the actual code or studied the actual algorithm.
 
 There is another alternative: implement a non-exponential algorithm to
 get the correct substring captures.  Such algorithms exists, and NetBSD
 is looking to include one such: http://laurikari.net/tre/ is the home
 page.  A few years ago I made a variant (in Haskell) of this algorithm
 and have achieved linear performance and POSIX substring captures for
 ambiguous repetitions.  I found the the (perhaps now old) "tre" library
 did NOT find the correct POSIX substring captures for ambiguous
 repetitions; I have not re-tested the newest "tre" version.
 
 -- 
 Chris Kuklewicz
Responsible-Changed-From-To: dds->freesd-bugs 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Wed Sep 22 14:54:53 UTC 2010 
Responsible-Changed-Why:  
dds' bit has been returned for safekeeping. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=130504 
Responsible-Changed-From-To: freesd-bugs->freebsd-bugs 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Mon Sep 27 11:44:56 UTC 2010 
Responsible-Changed-Why:  
Canonicalize assignment. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=130504 
State-Changed-From-To: analyzed->open 
State-Changed-By: eadler 
State-Changed-When: Sun Jul 1 16:10:26 UTC 2012 
State-Changed-Why:  
unowned PRs must not be in analyzed state 

http://www.freebsd.org/cgi/query-pr.cgi?pr=130504 
>Unformatted:
