From nobody@FreeBSD.org  Sat Sep 15 09:08:01 2007
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 85EB916A418
	for <freebsd-gnats-submit@FreeBSD.org>; Sat, 15 Sep 2007 09:08:01 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [IPv6:2001:4f8:fff6::21])
	by mx1.freebsd.org (Postfix) with ESMTP id 5E5E613C458
	for <freebsd-gnats-submit@FreeBSD.org>; Sat, 15 Sep 2007 09:08:01 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.14.1/8.14.1) with ESMTP id l8F981TM075110
	for <freebsd-gnats-submit@FreeBSD.org>; Sat, 15 Sep 2007 09:08:01 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.14.1/8.14.1/Submit) id l8F981jj075109;
	Sat, 15 Sep 2007 09:08:01 GMT
	(envelope-from nobody)
Message-Id: <200709150908.l8F981jj075109@www.freebsd.org>
Date: Sat, 15 Sep 2007 09:08:01 GMT
From: Petr Hroudny <petr.hroudny@gmail.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: isspace broken for UTF-8 locales
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         116363
>Category:       gnu
>Synopsis:       isspace broken for UTF-8 locales
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    rafan
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sat Sep 15 09:10:02 GMT 2007
>Closed-Date:    Wed Oct 24 14:33:12 UTC 2007
>Last-Modified:  Wed Dec 19 00:10:01 UTC 2007
>Originator:     Petr Hroudny
>Release:        6-stable, 7-current
>Organization:
>Environment:
>Description:
In UTF-8 locales, isspace(0xA0) returns 1 which is wrong.

In UTF-8, 0xA0 could only be the second or third byte of multibyte character, but never a space.

As a consequence, operations like str.upper() and/or str.split() are broken, when
UTF-8 character with 0xA0 byte is encountered.

An example of such character is Scaron (UTF-8 code 0xC5 0xA0).
>How-To-Repeat:

>Fix:
For UTF-8 locales, 0xA0 should never be considered to be a space.

>Release-Note:
>Audit-Trail:

From: Andrey Chernov <ache@nagual.pp.ru>
To: Petr Hroudny <petr.hroudny@gmail.com>
Cc: freebsd-gnats-submit@FreeBSD.ORG, jkoshy@FreeBSD.ORG, perky@FreeBSD.ORG,
        i18n@FreeBSD.ORG
Subject: Re: gnu/116363: isspace broken for UTF-8 locales
Date: Sun, 16 Sep 2007 12:54:33 +0400

 On Sat, Sep 15, 2007 at 09:08:01AM +0000, Petr Hroudny wrote:
 > 
 > >Number:         116363
 > >Category:       gnu
 > >Synopsis:       isspace broken for UTF-8 locales
 > >Confidential:   no
 > >Severity:       non-critical
 > >Priority:       medium
 > >Responsible:    freebsd-bugs
 > >State:          open
 > >Quarter:        
 > >Keywords:       
 > >Date-Required:
 > >Class:          sw-bug
 > >Submitter-Id:   current-users
 > >Arrival-Date:   Sat Sep 15 09:10:02 GMT 2007
 > >Closed-Date:
 > >Last-Modified:
 > >Originator:     Petr Hroudny
 > >Release:        6-stable, 7-current
 > >Organization:
 > >Environment:
 > >Description:
 > In UTF-8 locales, isspace(0xA0) returns 1 which is wrong.
 > 
 > In UTF-8, 0xA0 could only be the second or third byte of multibyte character, but never a space.
 > 
 > As a consequence, operations like str.upper() and/or str.split() are broken, when
 > UTF-8 character with 0xA0 byte is encountered.
 
 It seems that our UTF-8.src is completely wrong, it is just plain Unicode 
 and not UTF-8 which multibyte values should start from
 C2-DF
 E0-EF
 F0-F4
 only (as stated in http://en.wikipedia.org/wiki/UTF-8 f.e.)
 Can anybody write replacement for it?
 
 -- 
 http://ache.pp.ru/

From: Hye-Shik Chang <perky@FreeBSD.ORG>
To: Andrey Chernov <ache@nagual.pp.ru>, Petr Hroudny <petr.hroudny@gmail.com>,
        freebsd-gnats-submit@FreeBSD.ORG, jkoshy@FreeBSD.ORG, i18n@FreeBSD.ORG
Cc:  
Subject: Re: gnu/116363: isspace broken for UTF-8 locales
Date: Mon, 17 Sep 2007 01:22:14 +0900

 On Sun, Sep 16, 2007 at 12:54:33PM +0400, Andrey Chernov wrote:
 > On Sat, Sep 15, 2007 at 09:08:01AM +0000, Petr Hroudny wrote:
 > > 
 > > >Number:         116363
 > > >Category:       gnu
 > > >Synopsis:       isspace broken for UTF-8 locales
 > > >Confidential:   no
 > > >Severity:       non-critical
 > > >Priority:       medium
 > > >Responsible:    freebsd-bugs
 > > >State:          open
 > > >Quarter:        
 > > >Keywords:       
 > > >Date-Required:
 > > >Class:          sw-bug
 > > >Submitter-Id:   current-users
 > > >Arrival-Date:   Sat Sep 15 09:10:02 GMT 2007
 > > >Closed-Date:
 > > >Last-Modified:
 > > >Originator:     Petr Hroudny
 > > >Release:        6-stable, 7-current
 > > >Organization:
 > > >Environment:
 > > >Description:
 > > In UTF-8 locales, isspace(0xA0) returns 1 which is wrong.
 > > 
 > > In UTF-8, 0xA0 could only be the second or third byte of multibyte character, but never a space.
 > > 
 > > As a consequence, operations like str.upper() and/or str.split() are broken, when
 > > UTF-8 character with 0xA0 byte is encountered.
 
 If you are saying about Python's str.split(), the problem is due
 to our libc bug (or feature) which is described many times before,
 and Python already includes a workaround for the problem.
 http://mail.python.org/pipermail/python-checkins/2004-August/042343.html
 
 > It seems that our UTF-8.src is completely wrong, it is just plain Unicode 
 > and not UTF-8 which multibyte values should start from
 > C2-DF
 > E0-EF
 > F0-F4
 > only (as stated in http://en.wikipedia.org/wiki/UTF-8 f.e.)
 > Can anybody write replacement for it?
 
 In fact, UTF-8.src defines values for not UTF-8 but Unicode codepoints.
 Using the Unicode codepoint as wchar_t's internal representation gives
 much benefit.  I think we would be better to make isspace() and
 other ctypes functions aware of "encoding".  IIRC, tjr@ provided the
 workaround as in the URL mentioned above and said that it would get
 a chance to be fixed in 6 or 7 on 2004.
 
 Hye-Shik

From: Andrey Chernov <ache@nagual.pp.ru>
To: Hye-Shik Chang <perky@FreeBSD.org>
Cc: Petr Hroudny <petr.hroudny@gmail.com>, freebsd-gnats-submit@FreeBSD.org,
        jkoshy@FreeBSD.org, i18n@FreeBSD.org
Subject: Re: gnu/116363: isspace broken for UTF-8 locales
Date: Sun, 16 Sep 2007 20:34:07 +0400

 On Mon, Sep 17, 2007 at 01:22:14AM +0900, Hye-Shik Chang wrote:
 > In fact, UTF-8.src defines values for not UTF-8 but Unicode codepoints.
 > Using the Unicode codepoint as wchar_t's internal representation gives
 > much benefit.  I think we would be better to make isspace() and
 > other ctypes functions aware of "encoding".  IIRC, tjr@ provided the
 > workaround as in the URL mentioned above and said that it would get
 > a chance to be fixed in 6 or 7 on 2004.
 
 Currently wchar_t represents given encoding in all places including 
 wc<->mbr conversions. To make it UCS-4-only instead we need to rewrite the 
 whole locale system from scratch and I see no benefits from that way. 
 There is no simple workaround exists.
 
 In any case there is no excuse to make really-UCS-4.src to mimic 
 UTF-8.src. Providing proper UTF-8.src is much less painful way than whole 
 locale rewritting and I almost half way on converting UCS-4 source to it.
 
 -- 
 http://ache.pp.ru/

From: "=?UTF-8?Q?Petr_Hroudn=C3=BD?=" <petr.hroudny@gmail.com>
To: "Hye-Shik Chang" <perky@freebsd.org>
Cc: "Andrey Chernov" <ache@nagual.pp.ru>, freebsd-gnats-submit@freebsd.org, 
	jkoshy@freebsd.org, i18n@freebsd.org
Subject: Re: gnu/116363: isspace broken for UTF-8 locales
Date: Mon, 17 Sep 2007 10:35:52 +0200

 2007/9/16, Hye-Shik Chang <perky@freebsd.org>:
 
 > If you are saying about Python's str.split(), the problem is due
 > to our libc bug (or feature) which is described many times before,
 > and Python already includes a workaround for the problem.
 > http://mail.python.org/pipermail/python-checkins/2004-August/042343.html
 
 I run into this problem when using mutt, which utilizes isspace to
 separate tokens in
 e.g. list of recipients. Then I've found the workaround for Python,
 saying this problem
 should be fixed in FreeBSD6 - but it's still present even in 7-current.
 I do believe it would be better to fix isspace() than introduce
 workarounds into every application.
 
 Regards, Petr
State-Changed-From-To: open->patched 
State-Changed-By: ache 
State-Changed-When: Mon Oct 22 22:23:22 UTC 2007 
State-Changed-Why:  
Fixed in -current 

http://www.freebsd.org/cgi/query-pr.cgi?pr=116363 

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: gnu/116363: commit references a PR
Date: Wed, 24 Oct 2007 14:29:39 +0000 (UTC)

 rafan       2007-10-24 14:29:32 UTC
 
   FreeBSD src repository
 
   Modified files:        (Branch: RELENG_7)
     include              _ctype.h ctype.h 
     lib/libc/locale      Symbol.map big5.c euc.c gb18030.c 
                          gb2312.c gbk.c isctype.c mskanji.c none.c 
                          setrunelocale.c utf8.c 
     share/mklocale       UTF-8.src 
   Log:
   MFC ctype(3) fix for UTF-8 locale. See original commit log for details.
   
   PR:             116363
   Reported by:    Petr Hroudny <petr.hroudny at gmail.com>
   Patched by:     ache
   Reviewed by:    i18n@
   Approved by:    re (kensmith)
   OK-ed by:       portmgr
   
   Revision   Changes    Path
   1.30.10.1  +34 -1     src/include/_ctype.h
   1.28.18.1  +21 -21    src/include/ctype.h
   1.3.2.1    +5 -0      src/lib/libc/locale/Symbol.map
   1.17.10.1  +3 -0      src/lib/libc/locale/big5.c
   1.21.10.1  +3 -0      src/lib/libc/locale/euc.c
   1.7.10.1   +3 -0      src/lib/libc/locale/gb18030.c
   1.9.10.1   +3 -0      src/lib/libc/locale/gb2312.c
   1.13.2.1   +3 -0      src/lib/libc/locale/gbk.c
   1.10.2.1   +19 -19    src/lib/libc/locale/isctype.c
   1.17.10.1  +3 -0      src/lib/libc/locale/mskanji.c
   1.14.2.1   +6 -1      src/lib/libc/locale/none.c
   1.46.2.1   +5 -0      src/lib/libc/locale/setrunelocale.c
   1.14.2.1   +8 -0      src/lib/libc/locale/utf8.c
   1.2.2.1    +3 -0      src/share/mklocale/UTF-8.src
 _______________________________________________
 cvs-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/cvs-all
 To unsubscribe, send any mail to "cvs-all-unsubscribe@freebsd.org"
 
State-Changed-From-To: patched->closed 
State-Changed-By: rafan 
State-Changed-When: Wed Oct 24 14:32:49 UTC 2007 
State-Changed-Why:  
MFC'ed in 6.x and 7.x 


Responsible-Changed-From-To: freebsd-bugs->rafan 
Responsible-Changed-By: rafan 
Responsible-Changed-When: Wed Oct 24 14:32:49 UTC 2007 
Responsible-Changed-Why:  
Track. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=116363 

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: gnu/116363: commit references a PR
Date: Wed, 24 Oct 2007 14:32:41 +0000 (UTC)

 rafan       2007-10-24 14:32:33 UTC
 
   FreeBSD src repository
 
   Modified files:        (Branch: RELENG_6)
     include              _ctype.h ctype.h 
     lib/libc/locale      big5.c euc.c gb18030.c gb2312.c gbk.c 
                          isctype.c mskanji.c none.c 
                          setrunelocale.c utf8.c 
     share/mklocale       UTF-8.src 
   Log:
   MFC ctype(3) fix for UTF-8 locale. See original commit log for details.
   
   PR:             116363
   Reported by:    Petr Hroudny <petr.hroudny at gmail.com>
   Patched by:     ache
   Reviewed by:    i18n@
   Approved by:    re (kensmith)
   OK-ed by:       portmgr
   
   Revision  Changes    Path
   1.30.2.1  +34 -1     src/include/_ctype.h
   1.28.8.1  +21 -21    src/include/ctype.h
   1.17.2.1  +3 -0      src/lib/libc/locale/big5.c
   1.21.2.1  +3 -0      src/lib/libc/locale/euc.c
   1.7.2.1   +3 -0      src/lib/libc/locale/gb18030.c
   1.9.2.1   +3 -0      src/lib/libc/locale/gb2312.c
   1.12.2.1  +3 -0      src/lib/libc/locale/gbk.c
   1.9.14.1  +19 -19    src/lib/libc/locale/isctype.c
   1.17.2.1  +3 -0      src/lib/libc/locale/mskanji.c
   1.13.2.1  +6 -1      src/lib/libc/locale/none.c
   1.45.2.1  +5 -0      src/lib/libc/locale/setrunelocale.c
   1.13.2.2  +8 -0      src/lib/libc/locale/utf8.c
   1.1.8.2   +3 -0      src/share/mklocale/UTF-8.src
 _______________________________________________
 cvs-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/cvs-all
 To unsubscribe, send any mail to "cvs-all-unsubscribe@freebsd.org"
 

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: gnu/116363: commit references a PR
Date: Wed, 21 Nov 2007 01:31:56 +0000 (UTC)

 rafan       2007-11-21 01:31:49 UTC
 
   FreeBSD src repository
 
   Modified files:        (Branch: RELENG_6)
     include              _ctype.h ctype.h 
     lib/libc/locale      big5.c euc.c gb18030.c gb2312.c gbk.c 
                          isctype.c mskanji.c none.c 
                          setrunelocale.c utf8.c 
     sys/sys              param.h 
   Log:
   - Back out previous ctype(3) fix for UTF-8 locale due to forward ABI
     compatibility is still broken as we add new symbols to libc. Those
     symboles are __sbmaskrune, __sbistype, __sbtoupper and __sbtolower.
     The latter three are directly referred by binaries use ctype(3) family
     functions (see include/ctype.h for details). This means if a binary
     built on 6.3, it uses these symbols that are not available on older
     system.
   - As this has been in 6 for a month, I intentionally leave these symbols
     in libc but map them to original version. So binary built after 602113
     will not be broken after this commit.
   - Bump __FreeBSD_version for this back-out
   
   PR:             116363
   Discussed with: kris, kensmith
   Approved by:    re (kensmith)
   
   Revision    Changes    Path
   1.30.2.3    +5 -29     src/include/_ctype.h
   1.28.8.2    +21 -21    src/include/ctype.h
   1.17.2.2    +0 -3      src/lib/libc/locale/big5.c
   1.21.2.2    +0 -3      src/lib/libc/locale/euc.c
   1.7.2.2     +0 -3      src/lib/libc/locale/gb18030.c
   1.9.2.2     +0 -3      src/lib/libc/locale/gb2312.c
   1.12.2.2    +0 -3      src/lib/libc/locale/gbk.c
   1.9.14.2    +19 -19    src/lib/libc/locale/isctype.c
   1.17.2.2    +0 -3      src/lib/libc/locale/mskanji.c
   1.13.2.2    +1 -6      src/lib/libc/locale/none.c
   1.45.2.2    +0 -5      src/lib/libc/locale/setrunelocale.c
   1.13.2.3    +0 -8      src/lib/libc/locale/utf8.c
   1.244.2.32  +1 -1      src/sys/sys/param.h
 _______________________________________________
 cvs-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/cvs-all
 To unsubscribe, send any mail to "cvs-all-unsubscribe@freebsd.org"
 

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: gnu/116363: commit references a PR
Date: Wed, 19 Dec 2007 00:04:59 +0000 (UTC)

 beech       2007-12-19 00:04:50 UTC
 
   FreeBSD ports repository
 
   Modified files:
     www/horde-base       Makefile 
   Added files:
     www/horde-base/files patch-lib_Horde_NLS.php 
   Log:
   - Fix bug "isspace broken for UTF-8 locales."
     Causes Japanese characters to display improperly.
   
   PR:             ports/116363
   Submitted by:   Hiromi Kimura <hiromi@tac.tsukuba.ac.jp>
   Approved by:    linimon (mentor)
   
   Revision  Changes    Path
   1.61      +1 -0      ports/www/horde-base/Makefile
   1.1       +13 -0     ports/www/horde-base/files/patch-lib_Horde_NLS.php (new)
 _______________________________________________
 cvs-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/cvs-all
 To unsubscribe, send any mail to "cvs-all-unsubscribe@freebsd.org"
 
>Unformatted:
