From kcwu@kcwu.homeip.net  Sat Sep  4 09:39:19 2004
Return-Path: <kcwu@kcwu.homeip.net>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 7753C16A4CE
	for <FreeBSD-gnats-submit@freebsd.org>; Sat,  4 Sep 2004 09:39:19 +0000 (GMT)
Received: from mail4out.giga.net.tw (mail4out.giga.net.tw [203.133.1.42])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 401CB43D45
	for <FreeBSD-gnats-submit@freebsd.org>; Sat,  4 Sep 2004 09:39:19 +0000 (GMT)
	(envelope-from kcwu@kcwu.homeip.net)
Received: from kcwu.homeip.net (61-70-142-187.adsl.static.giga.net.tw [61.70.142.187])
	by mail4out.giga.net.tw (Postfix) with ESMTP id 319A059EC
	for <FreeBSD-gnats-submit@freebsd.org>; Sat,  4 Sep 2004 17:37:58 +0800 (CST)
Received: from kcwu.homeip.net (kc@kcwu.homeip.net [127.0.0.1])
	by kcwu.homeip.net (8.13.1/8.13.1) with ESMTP id i849dX9v096863
	for <FreeBSD-gnats-submit@freebsd.org>; Sat, 4 Sep 2004 17:39:34 +0800 (CST)
	(envelope-from kcwu@kcwu.homeip.net)
Received: (from kcwu@localhost)
	by kcwu.homeip.net (8.13.1/8.13.1/Submit) id i849dXYC096862;
	Sat, 4 Sep 2004 17:39:33 +0800 (CST)
	(envelope-from kcwu)
Message-Id: <200409040939.i849dXYC096862@kcwu.homeip.net>
Date: Sat, 4 Sep 2004 17:39:33 +0800 (CST)
From: Kuang-che Wu <kcwu@csie.org>
Reply-To: Kuang-che Wu <kcwu@csie.org>
To: FreeBSD-gnats-submit@freebsd.org
Cc:
Subject: regex multibyte support is really slow
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         71367
>Category:       bin
>Synopsis:       regex multibyte support is really slow
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    tjr
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sat Sep 04 09:40:09 GMT 2004
>Closed-Date:    Sun Sep 05 08:32:57 GMT 2004
>Last-Modified:  Sun Sep 05 08:32:57 GMT 2004
>Originator:     Kuang-che Wu
>Release:        FreeBSD 6.0-CURRENT i386
>Organization:
>Environment:
System: FreeBSD kcwu.homeip.net 6.0-CURRENT FreeBSD 6.0-CURRENT #0: Sat Sep 4 05:33:38 CST 2004 root@kcwu.homeip.net:/usr/obj/usr/src/sys/DESKTOP i386

CPU: AMD Athlon(tm) XP 2000+ (1665.59-MHz 686-class CPU)

	
>Description:
	regex in UTF-8 locale
	+ flag REG_EXTENDED|REG_ICASE
	+ pattern [[:alnum:]]
	= unacceptable slow
	
>How-To-Repeat:
	$ cc -O -pipe   re.c  -o re
	$ time ./re
	        7.65 real         7.51 user         0.06 sys

#include <stdio.h>
#include <locale.h>
#include <regex.h>

int main(void)
{
  regex_t re;
  char string[1024]={
#define WORD 0xe6,0x85,0xa2 /* UTF-8 character */
    WORD, WORD, WORD, WORD, WORD, WORD, WORD, WORD, WORD, WORD,
    WORD, WORD, WORD, WORD, WORD, WORD, WORD, WORD, WORD, WORD,
    0
  };

  if(setlocale(LC_CTYPE,"zh_TW.UTF-8")==NULL)
    return 1;

  if(regcomp(&re,"[[:alnum:]]",REG_EXTENDED|REG_ICASE)!=0)
    return 2;
  if(regexec(&re,string,0,NULL,0)==0)
    printf("matched\n");

  return 0;
}
	
>Fix:

	


>Release-Note:
>Audit-Trail:

From: Tim Robbins <tjr@freebsd.org>
To: Kuang-che Wu <kcwu@csie.org>
Cc: bug-followup@freebsd.org
Subject: Re: bin/71367: regex multibyte support is really slow
Date: Sat, 4 Sep 2004 20:49:07 +1000

 On Sat, Sep 04, 2004 at 05:39:33PM +0800, Kuang-che Wu wrote:
 
 > >How-To-Repeat:
 > 	$ cc -O -pipe   re.c  -o re
 > 	$ time ./re
 > 	        7.65 real         7.51 user         0.06 sys
 > 
 > #include <stdio.h>
 > #include <locale.h>
 > #include <regex.h>
 > 
 > int main(void)
 > {
 >   regex_t re;
 >   char string[1024]={
 > #define WORD 0xe6,0x85,0xa2 /* UTF-8 character */
 >     WORD, WORD, WORD, WORD, WORD, WORD, WORD, WORD, WORD, WORD,
 >     WORD, WORD, WORD, WORD, WORD, WORD, WORD, WORD, WORD, WORD,
 >     0
 >   };
 > 
 >   if(setlocale(LC_CTYPE,"zh_TW.UTF-8")==NULL)
 >     return 1;
 > 
 >   if(regcomp(&re,"[[:alnum:]]",REG_EXTENDED|REG_ICASE)!=0)
 >     return 2;
 >   if(regexec(&re,string,0,NULL,0)==0)
 >     printf("matched\n");
 > 
 >   return 0;
 > }
 
 I can't reproduce these results. I get:
 
 $ gcc -O -pipe re.c -o re
 $ time ./re
 
 real    0m0.003s
 user    0m0.000s
 sys     0m0.002s
 
 CPU: AMD Athlon(tm) 64 Processor 3000+ (2002.58-MHz K8-class CPU)
 
 Do you have any non-standard options in /etc/make.conf? Have you changed
 the C library at all locally? Can you confirm that the system you ran this
 on was idle?
 
 
 Tim

From: "Simon L. Nielsen" <simon@FreeBSD.org>
To: Tim Robbins <tjr@freebsd.org>
Cc: freebsd-gnats-submit@FreeBSD.org, Kuang-che Wu <kcwu@csie.org>
Subject: Re: bin/71367: regex multibyte support is really slow
Date: Sat, 4 Sep 2004 13:21:22 +0200

 --ikeVEW9yuYc//A+q
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline
 Content-Transfer-Encoding: quoted-printable
 
 On 2004.09.04 10:50:27 +0000, Tim Robbins wrote:
 >  > >How-To-Repeat:
 >  > 	$ cc -O -pipe   re.c  -o re
 >  > 	$ time ./re
 >  > 	        7.65 real         7.51 user         0.06 sys
 
 >  I can't reproduce these results. I get:
 
 I can, more or less.. :
 
 [simon@zaphod:/tmp] cc -O -pipe   re.c  -o re
 [simon@zaphod:/tmp] /usr/bin/time ./re
         3,85 real         3,54 user         0,12 sys
 
 CPU: Intel(R) Pentium(R) M processor 1500MHz (1498.73-MHz 686-class CPU)
 
 I don't know if locale matters?
 
 [simon@zaphod:/tmp] echo $LC_MESSAGES=20
 en_US.ISO8859-1
 [simon@zaphod:/tmp] echo $LANG=20
 da_DK.ISO8859-1
 [simon@zaphod:~] echo $MM_CHARSET=20
 iso-8859-1
 
 >  Do you have any non-standard options in /etc/make.conf? Have you changed
 
 make.conf :
 CFLAGS?=3D        -pipe -O
 CPUTYPE?=3D       p4
 
 >  the C library at all locally? Can you confirm that the system you ran th=
 is
 >  on was idle?
 
 Stock system:
 
 [simon@zaphod:/tmp] uname -a
 FreeBSD zaphod.nitro.dk 5.3-BETA1 FreeBSD 5.3-BETA1 #: Thu Aug 26 16:25:21 =
 CEST 2004     simon@zaphod.nitro.dk:/data/obj/usr/src/sys/ZAPHOD  i386
 
 The system was idle - at least nothing else that uses any significant
 ammount of CPU was running at the time.
 
 --=20
 Simon L. Nielsen
 FreeBSD Documentation Team
 
 --ikeVEW9yuYc//A+q
 Content-Type: application/pgp-signature
 Content-Disposition: inline
 
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.2.5 (FreeBSD)
 
 iD8DBQFBOaUxh9pcDSc1mlERArdlAJ9Y0VwvGjSZm5gwvjpWplKWGZue7wCfV0JS
 WY2SvfAdtPuuJTqHmDgOz10=
 =gK3I
 -----END PGP SIGNATURE-----
 
 --ikeVEW9yuYc//A+q--

From: Tim Robbins <tjr@freebsd.org>
To: "Simon L. Nielsen" <simon@FreeBSD.org>
Cc: freebsd-gnats-submit@FreeBSD.org, Kuang-che Wu <kcwu@csie.org>
Subject: Re: bin/71367: regex multibyte support is really slow
Date: Sat, 4 Sep 2004 21:36:16 +1000

 On Sat, Sep 04, 2004 at 01:21:22PM +0200, Simon L. Nielsen wrote:
 > On 2004.09.04 10:50:27 +0000, Tim Robbins wrote:
 > >  > >How-To-Repeat:
 > >  > 	$ cc -O -pipe   re.c  -o re
 > >  > 	$ time ./re
 > >  > 	        7.65 real         7.51 user         0.06 sys
 > 
 > >  I can't reproduce these results. I get:
 > 
 > I can, more or less.. :
 > 
 > [simon@zaphod:/tmp] cc -O -pipe   re.c  -o re
 > [simon@zaphod:/tmp] /usr/bin/time ./re
 >         3,85 real         3,54 user         0,12 sys
 
 Could you please try this patch?
 
 --- lib/libc/regex/regcomp.c.old	Sat Sep  4 21:33:53 2004
 +++ lib/libc/regex/regcomp.c	Sat Sep  4 21:32:50 2004
 @@ -1199,13 +1199,15 @@
  cset *cs;
  wint_t ch;
  {
 -	wint_t *newwides;
 +	wint_t nch, *newwides;
  	assert(ch >= 0);
  	if (ch < NC) {
  		cs->bmp[ch >> 3] |= 1 << (ch & 7);
  		if (cs->icase) {
 -			cs->bmp[towlower(ch) >> 3] |= 1 << (towlower(ch) & 7);
 -			cs->bmp[towupper(ch) >> 3] |= 1 << (towupper(ch) & 7);
 +			if ((nch = towlower(ch)) < NC)
 +				cs->bmp[nch >> 3] |= 1 << (nch & 7);
 +			if ((nch = towupper(ch)) < NC)
 +				cs->bmp[nch >> 3] |= 1 << (nch & 7);
  		}
  	} else {
  		newwides = realloc(cs->wides, (cs->nwides + 1) *
 @@ -1258,14 +1260,9 @@
  	wint_t i;
  	wctype_t *newtypes;
  
 -	for (i = 0; i < NC; i++) {
 +	for (i = 0; i < NC; i++)
  		if (iswctype(i, wct))
  			CHadd(p, cs, i);
 -		if (cs->icase && i != towlower(i))
 -			CHadd(p, cs, towlower(i));
 -		if (cs->icase && i != towupper(i))
 -			CHadd(p, cs, towupper(i));
 -	}
  	newtypes = realloc(cs->types, (cs->ntypes + 1) *
  	    sizeof(*cs->types));
  	if (newtypes == NULL) {

From: Kuang-che Wu <kcwu@csie.org>
To: freebsd-gnats-submit@freebsd.org
Cc:  
Subject: Re: bin/71367: regex multibyte support is really slow
Date: Sat, 4 Sep 2004 20:00:49 +0800

 On Sat, Sep 04, 2004 at 09:36:16PM +1000, Tim Robbins wrote:
 > On Sat, Sep 04, 2004 at 01:21:22PM +0200, Simon L. Nielsen wrote:
 
 > Do you have any non-standard options in /etc/make.conf? Have you changed
 > the C library at all locally? Can you confirm that the system you ran
 > this on was idle?
 The system is idle and without C library changed.
 The only related option in /etc/make.conf is COMPAT4X=yes.
 
 > Could you please try this patch?
 I test my following program,
 without the patch:
 case 0: 0.000000s
 case 1: 7.390625s
 case 2: (matched)0.000000s
 case 3: 0.000000s
 case 4: (matched)0.125000s
 case 5: 0.000000s
 case 6: 7.398438s
 case 7: 0.000000s
 case 8: 0.000000s
 
 with the patch:
 case 0: 0.000000s
 case 1: 0.000000s
 case 2: (matched)0.000000s
 case 3: 0.000000s
 case 4: (matched)0.000000s
 case 5: 0.000000s
 case 6: 0.000000s
 case 7: 0.000000s
 case 8: 0.000000s
 
 --------------------------
 #include <stdio.h>
 #include <locale.h>
 #include <regex.h>
 #include <time.h>
 
 #define EN "blah"
 char en[1024]= EN EN EN EN EN EN EN EN EN EN;
 #define XX "!@#$"
 char xx[1024]= XX XX XX XX XX XX XX XX XX XX;
 char utf8[1024]={
 #define U8 0xe6,0x85,0xa2 // UTF-8 character
   U8, U8, U8, U8, U8, U8, U8, U8, U8, U8,
   U8, U8, U8, U8, U8, U8, U8, U8, U8, U8,
   0
 };
 char big5[1024]={
 #define B5 0xa6,0x72 // Big5 character
   B5, B5, B5, B5, B5, B5, B5, B5, B5, B5,
   B5, B5, B5, B5, B5, B5, B5, B5, B5, B5,
   0
 };
 struct T {
   char *locale,*pattern,*text;
   int flag;
 } test[]={
   { "C", "[[:alnum:]]", utf8, REG_EXTENDED|REG_ICASE },
   { "zh_TW.UTF-8", "[[:alnum:]]", utf8, REG_EXTENDED|REG_ICASE },
   { "zh_TW.UTF-8", "[[:alnum:]]", en, REG_EXTENDED|REG_ICASE },
   { "zh_TW.UTF-8", "[[:alnum:]]", xx, REG_EXTENDED|REG_ICASE },
   { "zh_TW.UTF-8", "[^[:alnum:]]", utf8, REG_EXTENDED|REG_ICASE },
   { "zh_TW.Big5", "[[:alnum:]]", big5, REG_EXTENDED|REG_ICASE },
   { "en_US.UTF-8", "[[:alnum:]]", utf8, REG_EXTENDED|REG_ICASE },
   { "en_US.UTF-8", "[A-Za-z0-9]", utf8, REG_ICASE },
   { "en_US.UTF-8", "[[:alnum:]]", utf8, REG_EXTENDED },
 };
 int main(void)
 {
   int i;
   clock_t st;
   regex_t re;
 
   for(i=0; test[i].locale; i++) {
     printf("case %d: ",i);
     if(setlocale(LC_CTYPE,test[i].locale)==NULL)
       return 1;
 
     if(regcomp(&re,test[i].pattern,test[i].flag)!=0)
       return 2;
     st=clock();
     if(regexec(&re,test[i].text,0,NULL,0)==0)
       printf("(matched)");
     printf("%fs\n",(double)(clock()-st)/CLOCKS_PER_SEC);
   }
 
   return 0;
 }

From: "Simon L. Nielsen" <simon@FreeBSD.org>
To: Tim Robbins <tjr@freebsd.org>
Cc: freebsd-gnats-submit@FreeBSD.org, Kuang-che Wu <kcwu@csie.org>
Subject: Re: bin/71367: regex multibyte support is really slow
Date: Sat, 4 Sep 2004 14:08:37 +0200

 --3lcZGd9BuhuYXNfi
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline
 Content-Transfer-Encoding: quoted-printable
 
 On 2004.09.04 21:36:16 +1000, Tim Robbins wrote:
 > On Sat, Sep 04, 2004 at 01:21:22PM +0200, Simon L. Nielsen wrote:
 > > On 2004.09.04 10:50:27 +0000, Tim Robbins wrote:
 > > >  > >How-To-Repeat:
 > > >  > 	$ cc -O -pipe   re.c  -o re
 > > >  > 	$ time ./re
 > > >  > 	        7.65 real         7.51 user         0.06 sys
 > >=20
 > > >  I can't reproduce these results. I get:
 > >=20
 > > I can, more or less.. :
 > >=20
 > > [simon@zaphod:/tmp] cc -O -pipe   re.c  -o re
 > > [simon@zaphod:/tmp] /usr/bin/time ./re
 > >         3,85 real         3,54 user         0,12 sys
 >=20
 > Could you please try this patch?
 
 After rebuilding and reinstalling libc:
 
 [simon@zaphod:libc] /usr/bin/time /tmp/re
         0.00 real         0.00 user         0.00 sys
 
 So it seems to work fine.
 
 --=20
 Simon L. Nielsen
 FreeBSD Documentation Team
 
 --3lcZGd9BuhuYXNfi
 Content-Type: application/pgp-signature
 Content-Disposition: inline
 
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.2.5 (FreeBSD)
 
 iD8DBQFBObBFh9pcDSc1mlERAj3/AJ9efh2aW4RPBABZbZUpbz6nA0DW+QCdFum1
 TLBrnnE8aVplGQ0FpiJEOQE=
 =U+NN
 -----END PGP SIGNATURE-----
 
 --3lcZGd9BuhuYXNfi--
State-Changed-From-To: open->analyzed 
State-Changed-By: tjr 
State-Changed-When: Sat Sep 4 12:17:44 GMT 2004 
State-Changed-Why:  
Problem has been investigated and a fix will be committed soon. 


Responsible-Changed-From-To: freebsd-bugs->tjr 
Responsible-Changed-By: tjr 
Responsible-Changed-When: Sat Sep 4 12:17:44 GMT 2004 
Responsible-Changed-Why:  
I am responsible for multibyte regex support. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=71367 

From: Kuang-che Wu <kcwu@csie.org>
To: Tim Robbins <tjr@freebsd.org>
Cc: "Simon L. Nielsen" <simon@freebsd.org>,
	freebsd-gnats-submit@freebsd.org
Subject: Re: bin/71367: regex multibyte support is really slow
Date: Sat, 4 Sep 2004 20:25:33 +0800

 --oLBj+sq0vYjzfsbl
 Content-Type: text/plain; charset=big5
 Content-Disposition: inline
 
 On Sat, Sep 04, 2004 at 09:36:16PM +1000, Tim Robbins wrote:
 > -			cs->bmp[towlower(ch) >> 3] |= 1 << (towlower(ch) & 7);
 > -			cs->bmp[towupper(ch) >> 3] |= 1 << (towupper(ch) & 7);
 > +			if ((nch = towlower(ch)) < NC)
 > +				cs->bmp[nch >> 3] |= 1 << (nch & 7);
 > +			if ((nch = towupper(ch)) < NC)
 > +				cs->bmp[nch >> 3] |= 1 << (nch & 7);
 
 I think this is related to (bug? of) locale database.
 
 toupper(0xb5)==0x39c
 0xb5 = micro sign
 0x39c = greek capital letter mu
 0x3bc = greek small letter mu
 
 
 --oLBj+sq0vYjzfsbl
 Content-Type: application/pgp-signature
 Content-Disposition: inline
 
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.2.4 (FreeBSD)
 
 iD8DBQFBObQ94yBMH10rsRoRAuUdAJ91QMK4ZiOmkwVuTpZHsoD07d20lgCbBCGU
 PX6ZjAYYTD5HnaCSlyjpbyQ=
 =rjab
 -----END PGP SIGNATURE-----
 
 --oLBj+sq0vYjzfsbl--

From: Tim Robbins <tjr@freebsd.org>
To: "Simon L. Nielsen" <simon@FreeBSD.org>
Cc: freebsd-gnats-submit@FreeBSD.org, Kuang-che Wu <kcwu@csie.org>
Subject: Re: bin/71367: regex multibyte support is really slow
Date: Sat, 4 Sep 2004 22:31:52 +1000

 On Sat, Sep 04, 2004 at 02:08:37PM +0200, Simon L. Nielsen wrote:
 > On 2004.09.04 21:36:16 +1000, Tim Robbins wrote:
 > > On Sat, Sep 04, 2004 at 01:21:22PM +0200, Simon L. Nielsen wrote:
 > > > On 2004.09.04 10:50:27 +0000, Tim Robbins wrote:
 > > > >  > >How-To-Repeat:
 > > > >  > 	$ cc -O -pipe   re.c  -o re
 > > > >  > 	$ time ./re
 > > > >  > 	        7.65 real         7.51 user         0.06 sys
 > > > 
 > > > >  I can't reproduce these results. I get:
 > > > 
 > > > I can, more or less.. :
 > > > 
 > > > [simon@zaphod:/tmp] cc -O -pipe   re.c  -o re
 > > > [simon@zaphod:/tmp] /usr/bin/time ./re
 > > >         3,85 real         3,54 user         0,12 sys
 > > 
 > > Could you please try this patch?
 > 
 > After rebuilding and reinstalling libc:
 > 
 > [simon@zaphod:libc] /usr/bin/time /tmp/re
 >         0.00 real         0.00 user         0.00 sys
 > 
 > So it seems to work fine.
 
 Thanks. The submitter also reported that the patch fixed the problem.
 I'll commit it to CVS HEAD shortly, and to RELENG_5 some time before 5.3-R
 if possible.
 
 
 Tim

From: Rong-En Fan <rafan@infor.org>
To: freebsd-gnats-submit@FreeBSD.org, kcwu@csie.org
Cc:  
Subject: Re: bin/71367: regex multibyte support is really slow
Date: Sat,  4 Sep 2004 21:17:54 +0800 (CST)

 I can reproduce this on my IBM X31.
 The CPU is running at 1400Mhz, and
 the system is idle. My src is just 
 updated about 4 hours ago.
 
 FreeBSD woodstock.csie.org 6.0-CURRENT FreeBSD 6.0-CURRENT #0: Sat Sep  4 20:16:41 CST 2004     root@woodstock.csie.org:/home/admin/usr/obj/home/admin/usr/src/sys/WOODSTOCK  i386
 
 $ cc -O -pipe re -o re.c
 $ time ./re
 begin 0
 end 0.00s
 begin 1
 end 5.78s
 ok
 
 real    0m5.841s
 user    0m5.694s
 sys     0m0.117s
 
 $ time ./re
 begin 0
 end 0.00s
 begin 1
 end 5.78s
 ok
 
 real    0m5.843s
 user    0m5.695s
 sys     0m0.116s
 
State-Changed-From-To: analyzed->closed 
State-Changed-By: tjr 
State-Changed-When: Sun Sep 5 08:32:33 GMT 2004 
State-Changed-Why:  
Fixed in -current. Thanks for the report. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=71367 
>Unformatted:
