From jr@opal.com  Thu Jul 13 14:51:13 2006
Return-Path: <jr@opal.com>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 5E26316A4E6
	for <FreeBSD-gnats-submit@freebsd.org>; Thu, 13 Jul 2006 14:51:13 +0000 (UTC)
	(envelope-from jr@opal.com)
Received: from smtp.vzavenue.net (smtp.vzavenue.net [66.171.59.140])
	by mx1.FreeBSD.org (Postfix) with ESMTP id C9C0543D70
	for <FreeBSD-gnats-submit@freebsd.org>; Thu, 13 Jul 2006 14:51:06 +0000 (GMT)
	(envelope-from jr@opal.com)
Received: from 118.79.171.66.subscriber.vzavenue.net (HELO linwhf.opal.com) ([66.171.79.118])
  by smtp.vzavenue.net with ESMTP; 13 Jul 2006 10:51:05 -0400
Received: from linwhf.opal.com (localhost [127.0.0.1])
	by linwhf.opal.com (8.13.6/8.13.6) with ESMTP id k6DEp4OH093702
	for <FreeBSD-gnats-submit@freebsd.org>; Thu, 13 Jul 2006 10:51:04 -0400 (EDT)
	(envelope-from jr@opal.com)
Received: from 127.0.0.1 ([127.0.0.1] helo=linwhf.opal.com) by ASSP-nospam;
	13 Jul 2006 10:51:04 -0400
Received: (from jr@localhost)
	by linwhf.opal.com (8.13.6/8.13.6/Submit) id k6DEp4Gq093701;
	Thu, 13 Jul 2006 10:51:04 -0400 (EDT)
	(envelope-from jr)
Message-Id: <200607131451.k6DEp4Gq093701@linwhf.opal.com>
Date: Thu, 13 Jul 2006 10:51:04 -0400 (EDT)
From: "J.R. Oldroyd" <fbsd@opal.com>
Reply-To: "J.R. Oldroyd" <fbsd@opal.com>
To: FreeBSD-gnats-submit@freebsd.org
Cc:
Subject: UTF-8 zero-width character patch
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         100212
>Category:       misc
>Synopsis:       UTF-8 zero-width character patch
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    jkoshy
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          change-request
>Submitter-Id:   current-users
>Arrival-Date:   Thu Jul 13 15:00:30 GMT 2006
>Closed-Date:    Sun Jul 30 07:43:15 GMT 2006
>Last-Modified:  Sun Jul 30 07:43:15 GMT 2006
>Originator:     J.R. Oldroyd
>Release:        FreeBSD 6.1-STABLE i386
>Organization:
>Environment:
System: FreeBSD linwhf.opal.com 6.1-STABLE FreeBSD 6.1-STABLE #1: Thu May 18 16:03:24 EDT 2006 xxx@linwhf.opal.com:/usr/obj/usr/src/sys/LINWHF i386
>Description:
This patch makes the so-called zero-width, non-spacing, or
overstriking characters of the UTF-8 encoding exactly that.
At the present time, these characters are coded with a width
of 1 which is wrong.  They should have a width of 0.

>How-To-Repeat:
Save this file:
	http://opal.com/freebsd/unicode/utf8demo.txt

On an xterm, cat the file and examine the "Combining characters"
and the "Thai (UCS Level 2)" sections.  Without the patch, the
non-spacing characters do not overstrike the previous character.
With the patch, they do.

This patch has been posted to -current and downloaded and
reviewed many times following that posting:
	http://lists.freebsd.org/pipermail/freebsd-current/2006-June/064218.html

>Fix:
--- /usr/src/share/mklocale/UTF-8.src.orig	Sat Mar 27 03:14:14 2004
+++ /usr/src/share/mklocale/UTF-8.src	Mon Jun 26 23:15:34 2006
@@ -487,9 +487,9 @@
  * U+0300 - U+036F : Combining Diacritical Marks
  */
 
-GRAPH     0x0300 - 0x034f  0x0360 - 0x036f
-PRINT     0x0300 - 0x034f  0x0360 - 0x036f
-SWIDTH1   0x0300 - 0x034f  0x0360 - 0x036f
+GRAPH     0x0300 - 0x036f
+PRINT     0x0300 - 0x036f
+SWIDTH0   0x0300 - 0x036f
 
 MAPUPPER  < 0x0345 0x0399 >
 
@@ -593,7 +593,8 @@
 UPPER     0x04e2  0x04e4  0x04e6  0x04e8  0x04ea  0x04ec  0x04ee
 UPPER     0x04f0  0x04f2  0x04f4  0x04f8
 PRINT     0x0400 - 0x0486  0x0488 - 0x04ce  0x04d0 - 0x04f5  0x04f8  0x04f9
-SWIDTH1   0x0400 - 0x0486  0x0488 - 0x04ce  0x04d0 - 0x04f5  0x04f8  0x04f9
+SWIDTH1   0x0400 - 0x0482  0x048a - 0x04ce  0x04d0 - 0x04f5  0x04f8  0x04f9
+SWIDTH0   0x0483 - 0x0486  0x0488 - 0x0489
 
 MAPUPPER  < 0x0430 - 0x044f : 0x0410 >
 MAPUPPER  < 0x0450 - 0x045f : 0x0400 >
@@ -1016,7 +1017,8 @@
 GRAPH     0x0e01 - 0x0e3a  0x0e3f - 0x0e5b
 PUNCT     0x0e3f  0x0e4f  0x0e5a  0x0e5b
 PRINT     0x0e01 - 0x0e3a  0x0e3f - 0x0e5b
-SWIDTH1   0x0e01 - 0x0e3a  0x0e3f - 0x0e5b
+SWIDTH0   0x0e31  0x0e34 - 0x0e3a  0x0e47 - 0x0e4e
+SWIDTH1   0x0e01 - 0x0e30  0x0e32 - 0x0e33  0x0e3f - 0x0e46  0x0e4f - 0x0e5b
 
 
 /*
@@ -1647,9 +1649,9 @@
  * U+20D0 - U+20FF : Combining Diacritical Marks for Symbols
  */
 
-GRAPH     0x20d0 - 0x20ea
-PRINT     0x20d0 - 0x20ea
-SWIDTH1   0x20d0 - 0x20ea
+GRAPH     0x20d0 - 0x20ff
+PRINT     0x20d0 - 0x20ff
+SWIDTH0   0x20d0 - 0x20ff
 
 
 /*
@@ -1927,7 +1929,8 @@
 PUNCT     0x309b  0x309c
 PRINT     0x3041 - 0x3096  0x3099 - 0x309f
 PHONOGRAM 0x3041 - 0x3096  0x309f
-SWIDTH2   0x3041 - 0x3096  0x3099 - 0x309f
+SWIDTH2   0x3041 - 0x3096  0x309b - 0x309f
+SWIDTH0   0x3099 - 0x309a
 
 
 /*
@@ -2149,9 +2152,9 @@
  * U+FE20 - U+FE2F : Combining Half Marks
  */
 
-GRAPH     0xfe20 - 0xfe23
-PRINT     0xfe20 - 0xfe23
-SWIDTH1   0xfe20 - 0xfe23
+GRAPH     0xfe20 - 0xfe2f
+PRINT     0xfe20 - 0xfe2f
+SWIDTH0   0xfe20 - 0xfe2f
 
 
 /*
@@ -2272,7 +2275,8 @@
 PUNCT     0x1d100 - 0x1d126  0x1d12a - 0x1d164  0x1d16a - 0x1d16c
 PUNCT     0x1d183  0x1d184  0x1d18c - 0x1d1a9  0x1d1ae - 0x1d1dd
 PRINT     0x1d100 - 0x1d126  0x1d12a - 0x1d172  0x1d17b - 0x1d1dd
-SWIDTH1   0x1d100 - 0x1d126  0x1d12a - 0x1d172  0x1d17b - 0x1d1dd
+SWIDTH1   0x1d100 - 0x1d126  0x1d12a - 0x1d164  0x1d16a - 0x1d172  0x1d183  0x1d184  0x1d18c - 0x1d1a9  0x1d1ae - 0x1d1dd
+SWIDTH0   0x1d165 - 0x1d169  0x1d17b - 0x1d182  0x1d185 - 0x1d18b  0x1d1aa - 0x1d1ad
 
 
 /*
>Release-Note:
>Audit-Trail:

From: "J.R. Oldroyd" <fbsd@opal.com>
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: misc/100212: UTF-8 zero-width character patch
Date: Thu, 13 Jul 2006 12:22:48 -0400

 Sorry, the correct URL for the demo file is:
 
 http://opal.com/freebsd/utf8demo.txt
Responsible-Changed-From-To: freebsd-bugs->jkoshy 
Responsible-Changed-By: jkoshy 
Responsible-Changed-When: Sun Jul 16 11:36:56 UTC 2006 
Responsible-Changed-Why:  
Take this PR. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=100212 

From: jkoshy@FreeBSD.ORG (Joseph Koshy)
To: freebsd-gnats-submit@freebsd.org
Cc: jkoshy@FreeBSD.ORG
Subject: Re: misc/100212: UTF-8 zero-width character patch
Date: Tue, 25 Jul 2006 18:19:08 +0000 (UTC)

 Reviewing this change against the Unicode 4.1 charts available
 on the web:
 
  - U+034F COMBINING GRAPHEME JOINER is neither graphic nor
    printable.  
 
    I'm unsure whether SWIDTH0 would be appropriate here.
 
  - The Cyrillic range now defines code points 0x04F6 & 0x04F7.
   
  - There's a new range U+1DC0...U+1DFF "Combining Diacritical Marks
    Supplement", all of which are combining characters. 
 
 The attached patch is based on your patch but attempts to incorporate
 fixes for these.  It also avoids defining properties for code points
 that are marked 'reserved' in the standard.
 
 Could you please review?
 
 Index: UTF-8.src
 ===================================================================
 RCS file: /cvs/FreeBSD/src/share/mklocale/UTF-8.src,v
 retrieving revision 1.1
 diff -u -r1.1 UTF-8.src
 --- UTF-8.src	27 Mar 2004 08:14:14 -0000	1.1
 +++ UTF-8.src	25 Jul 2006 18:14:17 -0000
 @@ -487,9 +487,9 @@
   * U+0300 - U+036F : Combining Diacritical Marks
   */
  
 -GRAPH     0x0300 - 0x034f  0x0360 - 0x036f
 -PRINT     0x0300 - 0x034f  0x0360 - 0x036f
 -SWIDTH1   0x0300 - 0x034f  0x0360 - 0x036f
 +GRAPH     0x0300 - 0x034E  0x0350 - 0x036f
 +PRINT     0x0300 - 0x034E  0x0350 - 0x036f
 +SWIDTH0   0x0300 - 0x034E  0x0350 - 0x036f
  
  MAPUPPER  < 0x0345 0x0399 >
  
 @@ -579,7 +579,7 @@
  LOWER     0x04c8  0x04ca  0x04cc  0x04ce  0x04d1  0x04d3  0x04d5
  LOWER     0x04d7  0x04d9  0x04db  0x04dd  0x04df  0x04e1  0x04e3
  LOWER     0x04e5  0x04e7  0x04e9  0x04eb  0x04ed  0x04ef  0x04f1
 -LOWER     0x04f3  0x04f5  0x04f9
 +LOWER     0x04f3  0x04f5  0x04f7  0x04f9
  PUNCT     0x0482
  UPPER     0x0400 - 0x042f  0x0460  0x0462  0x0464  0x0466  0x0468
  UPPER     0x046a  0x046c  0x046e  0x0470  0x0472  0x0474  0x0476
 @@ -591,9 +591,10 @@
  UPPER     0x04c5  0x04c7  0x04c9  0x04cb  0x04cd  0x04d0  0x04d2
  UPPER     0x04d4  0x04d6  0x04d8  0x04da  0x04dc  0x04de  0x04e0
  UPPER     0x04e2  0x04e4  0x04e6  0x04e8  0x04ea  0x04ec  0x04ee
 -UPPER     0x04f0  0x04f2  0x04f4  0x04f8
 -PRINT     0x0400 - 0x0486  0x0488 - 0x04ce  0x04d0 - 0x04f5  0x04f8  0x04f9
 -SWIDTH1   0x0400 - 0x0486  0x0488 - 0x04ce  0x04d0 - 0x04f5  0x04f8  0x04f9
 +UPPER     0x04f0  0x04f2  0x04f4  0x04f6  0x04f8
 +PRINT     0x0400 - 0x0486  0x0488 - 0x04ce  0x04d0 - 0x04f9
 +SWIDTH0   0x0483 - 0x0486  0x0488 - 0x0489
 +SWIDTH1   0x0400 - 0x0482  0x048a - 0x04ce  0x04d0 - 0x04f9
  
  MAPUPPER  < 0x0430 - 0x044f : 0x0410 >
  MAPUPPER  < 0x0450 - 0x045f : 0x0400 >
 @@ -667,6 +668,7 @@
  MAPUPPER  < 0x04f1 0x04f0 >
  MAPUPPER  < 0x04f3 0x04f2 >
  MAPUPPER  < 0x04f5 0x04f4 >
 +MAPUPPER  < 0x04f7 0x04f6 >
  MAPUPPER  < 0x04f9 0x04f8 >
  MAPLOWER  < 0x0400 - 0x040f : 0x0450 >
  MAPLOWER  < 0x0410 - 0x042f : 0x0430 >
 @@ -740,6 +742,7 @@
  MAPLOWER  < 0x04f0 0x04f1 >
  MAPLOWER  < 0x04f2 0x04f3 >
  MAPLOWER  < 0x04f4 0x04f5 >
 +MAPLOWER  < 0x04f6 0x04f7 >
  MAPLOWER  < 0x04f8 0x04f9 >
  
  
 @@ -1016,7 +1019,8 @@
  GRAPH     0x0e01 - 0x0e3a  0x0e3f - 0x0e5b
  PUNCT     0x0e3f  0x0e4f  0x0e5a  0x0e5b
  PRINT     0x0e01 - 0x0e3a  0x0e3f - 0x0e5b
 -SWIDTH1   0x0e01 - 0x0e3a  0x0e3f - 0x0e5b
 +SWIDTH0   0x0e31   0x0e34 - 0x0e3a  0x0e47 - 0x0e4e
 +SWIDTH1   0x0e01 - 0x0e30  0x0e32 - 0x0e33  0x0e3f - 0x0e46  0x0e4f - 0x0e5b
  
  
  /*
 @@ -1229,6 +1233,15 @@
  
  
  /*
 + * U+1DC0 - U+1DFF : Combining Diacritical Marks Supplement
 + */
 +
 +GRAPH     0x1DC0 - 0x1DC3
 +PRINT     0x1DC0 - 0x1DC3
 +SWIDTH0   0x1DC0 - 0x1DC3
 +
 +
 +/*
   * U+1E00 - U+1EFF : Latin Extended Additional
   */
  
 @@ -1647,9 +1660,9 @@
   * U+20D0 - U+20FF : Combining Diacritical Marks for Symbols
   */
  
 -GRAPH     0x20d0 - 0x20ea
 -PRINT     0x20d0 - 0x20ea
 -SWIDTH1   0x20d0 - 0x20ea
 +GRAPH     0x20d0 - 0x20eb
 +PRINT     0x20d0 - 0x20eb
 +SWIDTH0   0x20d0 - 0x20eb
  
  
  /*
 @@ -1927,7 +1940,8 @@
  PUNCT     0x309b  0x309c
  PRINT     0x3041 - 0x3096  0x3099 - 0x309f
  PHONOGRAM 0x3041 - 0x3096  0x309f
 -SWIDTH2   0x3041 - 0x3096  0x3099 - 0x309f
 +SWIDTH0   0x3099 - 0x309a
 +SWIDTH2   0x3041 - 0x3096  0x309b - 0x309f
  
  
  /*
 @@ -2151,7 +2165,7 @@
  
  GRAPH     0xfe20 - 0xfe23
  PRINT     0xfe20 - 0xfe23
 -SWIDTH1   0xfe20 - 0xfe23
 +SWIDTH0   0xfe20 - 0xfe23
  
  
  /*
 @@ -2271,8 +2285,13 @@
  GRAPH     0x1d100 - 0x1d126  0x1d12a - 0x1d172  0x1d17b - 0x1d1dd
  PUNCT     0x1d100 - 0x1d126  0x1d12a - 0x1d164  0x1d16a - 0x1d16c
  PUNCT     0x1d183  0x1d184  0x1d18c - 0x1d1a9  0x1d1ae - 0x1d1dd
 -PRINT     0x1d100 - 0x1d126  0x1d12a - 0x1d172  0x1d17b - 0x1d1dd
 -SWIDTH1   0x1d100 - 0x1d126  0x1d12a - 0x1d172  0x1d17b - 0x1d1dd
 +PRINT     0x1d100 - 0x1d126  0x1d12a - 0x1d158  0x1d15a - 0x1d172
 +PRINT     0x1d17b - 0x1d1dd
 +SWIDTH0   0x1d165 - 0x1d169  0x1d16d - 0x1d172  0x1d17b - 0x1d182
 +SWIDTH0   0x1d185 - 0x1d18b  0x1d1aa - 0x1d1ad
 +SWIDTH1   0x1d100 - 0x1d126  0x1d12a - 0x1d158  0x1d15a - 0x1d164
 +SWIDTH1   0x1d16a - 0x1d16c  0x1d183   0x1d184  0x1d18c - 0x1d1a9
 +SWIDTH1   0x1d1ae - 0x1d1dd
  
  
  /*

From: "J.R. Oldroyd" <fbsd@opal.com>
To: Joseph Koshy <jkoshy@FreeBSD.ORG>
Cc: freebsd-gnats-submit@FreeBSD.ORG, bug-followup@FreeBSD.ORG
Subject: Re: misc/100212: UTF-8 zero-width character patch
Date: Wed, 26 Jul 2006 22:37:07 -0400

 On Tue, 25 Jul 2006 18:19:08 +0000 (UTC), Joseph Koshy wrote:
 > Reviewing this change against the Unicode 4.1 charts available
 > on the web:
 >  
 >   - U+034F COMBINING GRAPHEME JOINER is neither graphic nor
 >     printable.  
 >  
 >     I'm unsure whether SWIDTH0 would be appropriate here.
 >  
 >   - The Cyrillic range now defines code points 0x04F6 & 0x04F7.
 >    
 >   - There's a new range U+1DC0...U+1DFF "Combining Diacritical Marks
 >     Supplement", all of which are combining characters. 
 >  
 > The attached patch is based on your patch but attempts to incorporate
 > fixes for these.  It also avoids defining properties for code points
 > that are marked 'reserved' in the standard.
 > 
 > Could you please review?
 
 In answer to your points:
 
 - I think SWIDTH0 is correct for U+034F since it has no glyph
   and therefore it cannot create space.  I agree this isn't
   totally clear, but 0 seems better than 1 to me.
 
 - OK on the Cyrillic additions.
 
 - OK on the Combining Diacritical Marks Supplement additions.
 
 I'm also OK on your reducing the listed ranges where characters
 are not yet defined, e.g., 0x20d0 - 0x20eb rather than the larger
 reserved set 0x20d0 - 0x20ff, etc.
 
 	-jr
 
 PS: Please CC me on any reply, so I see it sooner.

From: "J.R. Oldroyd" <fbsd@opal.com>
To: Joseph Koshy <jkoshy@FreeBSD.ORG>
Cc: freebsd-gnats-submit@FreeBSD.ORG, bug-followup@FreeBSD.ORG
Subject: Re: misc/100212: UTF-8 zero-width character patch
Date: Wed, 26 Jul 2006 22:37:07 -0400

 On Tue, 25 Jul 2006 18:19:08 +0000 (UTC), Joseph Koshy wrote:
 > Reviewing this change against the Unicode 4.1 charts available
 > on the web:
 >  
 >   - U+034F COMBINING GRAPHEME JOINER is neither graphic nor
 >     printable.  
 >  
 >     I'm unsure whether SWIDTH0 would be appropriate here.
 >  
 >   - The Cyrillic range now defines code points 0x04F6 & 0x04F7.
 >    
 >   - There's a new range U+1DC0...U+1DFF "Combining Diacritical Marks
 >     Supplement", all of which are combining characters. 
 >  
 > The attached patch is based on your patch but attempts to incorporate
 > fixes for these.  It also avoids defining properties for code points
 > that are marked 'reserved' in the standard.
 > 
 > Could you please review?
 
 In answer to your points:
 
 - I think SWIDTH0 is correct for U+034F since it has no glyph
   and therefore it cannot create space.  I agree this isn't
   totally clear, but 0 seems better than 1 to me.
 
 - OK on the Cyrillic additions.
 
 - OK on the Combining Diacritical Marks Supplement additions.
 
 I'm also OK on your reducing the listed ranges where characters
 are not yet defined, e.g., 0x20d0 - 0x20eb rather than the larger
 reserved set 0x20d0 - 0x20ff, etc.
 
 	-jr
 
 PS: Please CC me on any reply, so I see it sooner.

From: jkoshy@FreeBSD.ORG (Joseph Koshy)
To: "J.R. Oldroyd" <fbsd@opal.com>
Cc: Joseph Koshy <jkoshy@FreeBSD.ORG>, freebsd-gnats-submit@FreeBSD.ORG,
Subject: Re: misc/100212: UTF-8 zero-width character patch 
Date: Fri, 28 Jul 2006 05:59:16 +0000 (UTC)

 > - I think SWIDTH0 is correct for U+034F since it has no glyph
 >   and therefore it cannot create space.  I agree this isn't
 >   totally clear, but 0 seems better than 1 to me.
 
 Marking the code point as non-printable appears to be even better
 than using SWIDTH0, since `__wcwidth()` will return '0' if the
 bits in _CTYPE_SWM is set and the correct return for a non-printable
 code point should be -1.
 
 > - OK on the Cyrillic additions.
 > 
 > - OK on the Combining Diacritical Marks Supplement additions.
 > 
 > I'm also OK on your reducing the listed ranges where characters
 > are not yet defined, e.g., 0x20d0 - 0x20eb rather than the larger
 > reserved set 0x20d0 - 0x20ff, etc.
 
 Ok, I'll commit the change shortly.
 
 Regards,
 Koshy
 
 

From: "J.R. Oldroyd" <fbsd@opal.com>
To: Joseph Koshy <jkoshy@FreeBSD.ORG>
Cc: freebsd-gnats-submit@FreeBSD.ORG
Subject: Re: misc/100212: UTF-8 zero-width character patch
Date: Fri, 28 Jul 2006 09:06:11 -0400

 On Jul 28, 05:59, Joseph Koshy wrote:
 > 
 > 
 > > - I think SWIDTH0 is correct for U+034F since it has no glyph
 > >   and therefore it cannot create space.  I agree this isn't
 > >   totally clear, but 0 seems better than 1 to me.
 > 
 > Marking the code point as non-printable appears to be even better
 > than using SWIDTH0, since `__wcwidth()` will return '0' if the
 > bits in _CTYPE_SWM is set and the correct return for a non-printable
 > code point should be -1.
 > 
 I didn't realize that.  OK then, but in that case, the second patch
 (bin/100215) should probably be changed to check for <= 0 width.
 I'll send an update to that one in a moment.
 
 
 > ...
 > 
 > Ok, I'll commit the change shortly.
 > 
 
 Thanks,
 	-jr
State-Changed-From-To: open->closed 
State-Changed-By: jkoshy 
State-Changed-When: Sun Jul 30 07:41:25 UTC 2006 
State-Changed-Why:  
Committed to rev 1.2 of "share/mklocale/UTF-8.src".  Thank 
you for contributing. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=100212 
>Unformatted:
