From nobody@FreeBSD.org  Sat Jul 14 12:03:06 2007
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 9F5CA16A400
	for <freebsd-gnats-submit@FreeBSD.org>; Sat, 14 Jul 2007 12:03:06 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [69.147.83.33])
	by mx1.freebsd.org (Postfix) with ESMTP id 8FF8813C4C2
	for <freebsd-gnats-submit@FreeBSD.org>; Sat, 14 Jul 2007 12:03:06 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.13.1/8.13.1) with ESMTP id l6EC36n4016825
	for <freebsd-gnats-submit@FreeBSD.org>; Sat, 14 Jul 2007 12:03:06 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.13.1/8.13.1/Submit) id l6EC36fP016824;
	Sat, 14 Jul 2007 12:03:06 GMT
	(envelope-from nobody)
Message-Id: <200707141203.l6EC36fP016824@www.freebsd.org>
Date: Sat, 14 Jul 2007 12:03:06 GMT
From: Christoph Mallon <christoph.mallon@FreeBSD.org>
To: freebsd-gnats-submit@FreeBSD.org
Subject: wide character printing using swprintf(dst, n, "%ls", txt) fails depending on LC_CTYPE
X-Send-Pr-Version: www-3.0

>Number:         114578
>Category:       kern
>Synopsis:       [libc] wide character printing using swprintf(dst, n, "%ls", txt) fails depending on LC_CTYPE
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          suspended
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          change-request
>Submitter-Id:   current-users
>Arrival-Date:   Sat Jul 14 12:10:02 GMT 2007
>Closed-Date:    
>Last-Modified:  Sat Feb 07 14:09:31 UTC 2009
>Originator:     Christoph Mallon
>Release:        RELENG_&
>Organization:
>Environment:
FreeBSD tron.homeunix.org 6.2-STABLE FreeBSD 6.2-STABLE #0: Thu Jan 25 22:43:11 CET 2007     root@tron.homeunix.org:/usr/obj/usr/src/sys/KERNEL  i386
>Description:
Copying a string using swprintf() and the format specifier "%ls" (or "%S")
fails if the to be copied string contains characters, which the currently
set LC_CTYPE aspect of the locale does not support.

The test program below should just copy the wide character string "Mir"
(in cyrillic letters) to an array of wide characters using swprintf().
When the LC_CTYPE aspect of the locale is set to "C" (other encodings
like ISO8859-15 fail, too), this call fails and -1 is returned. When the
LC_CTYPE aspect of the locale is set to UTF-8 (or probably other encodings,
which support full unicode) the call succeeds and returns 3 as expected.

I wonder if this behaviour is correct, because no encoding conversions
should be involved here. I could not find anything about conversions in
the ANSI C99 standard (7.24.2.1 clause 8 bullet "s"), either. Only
conversions if the format is "%s" are mentioned, which is logical.

Other implementations (glibc and Windows libc) copy the string correctly,
when LC_CTYPE is set to "C".

I just discovered, that it already fails, if the format string itself
contains characters from a range, that the current LC_CTYPE does not support.
>How-To-Repeat:
Here is a simple test program. It should (imo) print "3" twice, for
three copied characters, each. It prints "-1" and "3" though.

#include <locale.h>
#include <stdio.h>
#include <wchar.h>

static const wchar_t txt[] = { 0x41C, 0x43D, 0x440, 0 }; // "Mir" in cyrillic

int main(void)
{
  wchar_t str[4];
  int ret;

  setlocale(LC_CTYPE, "C");
  ret = swprintf(str, sizeof(str) / sizeof(*str), L"%ls", txt);
  printf("%d\n", ret);

  setlocale(LC_CTYPE, "UTF-8");
  ret = swprintf(str, sizeof(str) / sizeof(*str), L"%ls", txt);
  printf("%d\n", ret);

  return 0;
}
>Fix:
I didn't dive into the inner workings of *printf(), sorry.

>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->freebsd-standards 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Sat Jul 14 17:20:41 UTC 2007 
Responsible-Changed-Why:  
I'm going to guess this is a standards issue. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=114578 
State-Changed-From-To: open->closed 
State-Changed-By: ache 
State-Changed-When: Sat Jul 14 17:48:21 UTC 2007 
State-Changed-Why:  
This code works as supposed by standards. 
You may see errno comes from swprintf() (exactly - from wcrtomb())  
and it is EILSEQ (Illegal byte sequence) 
It is because "C" locale is 8bit wide so does not contain wide chars 
outside 0 .. UCHAR_MAX range (and doesn't know how to convert them either), 
so any attempt to convert them fails with EILSEQ. 


http://www.freebsd.org/cgi/query-pr.cgi?pr=114578 
State-Changed-From-To: closed->open 
State-Changed-By: ache 
State-Changed-When: Sun Jul 15 07:26:49 UTC 2007 
State-Changed-Why:  
POSIX mention fputwc() requirement only for fwprintf() and wprintf(),  
not for swprintf(), so fputwc()-mbsrtowcs() forth and back conversion  
we currently have as result of pseudo-file stdio hook is not needed  
in direct implemetation (which skips whole multibyte part). 


http://www.freebsd.org/cgi/query-pr.cgi?pr=114578 

From: Christoph Mallon <christoph.mallon@gmx.de>
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/114578: [libc] wide character printing using swprintf(dst,
 n, "%ls", txt) fails depending on LC_CTYPE
Date: Sun, 15 Jul 2007 09:58:18 +0200

 Here is a simplified example:
 
 #include <locale.h>
 #include <stdio.h>
 #include <wchar.h>
 
 static const wchar_t txt[] = { 0x41C, 0x43D, 0x440, 0 }; // "Mir" in 
 cyrillic
 
 int main(void)
 {
    wchar_t str[4];
    int ret;
 
    setlocale(LC_CTYPE, "C");
    ret = swprintf(str, sizeof(str) / sizeof(*str), txt);
    printf("%d\n", ret);
 
    return 0;
 }
 
 Only a format string is used here. The call to swprintf() fails here, 
 too,  and -1 is returned. The POSIX standard (and ANSI C99, too, though 
 with slightly different wording) say this: "The format is composed of 
 zero or more directives: ordinary wide-characters, which are simply 
 copied to the output stream" (from 
 http://www.opengroup.org/onlinepubs/009695399/functions/swprintf.html , 
 section DESCRIPTION, second clause). So even copying the ordinary 
 wide-characters from the format string fails.

From: David Schultz <das@FreeBSD.ORG>
To: Christoph Mallon <christoph.mallon@FreeBSD.ORG>
Cc: freebsd-gnats-submit@FreeBSD.ORG
Subject: Re: misc/114578: wide character printing using swprintf(dst, n, "%ls", txt) fails depending on LC_CTYPE
Date: Tue, 15 Jan 2008 04:25:34 -0500

 fputwc(3) has similar language about copying the character to the
 output stream, but POSIX still says it can fail with EILSEQ if the
 wide character doesn't exist in the current locale.
 
 This isn't my area of expertise, but the present behavior seems
 correct. If the current locale doesn't support a given wide
 character, we should not invent a multibyte character sequence for
 it, because the other end of the stream may not even be able to
 interpret it.

From: Christoph Mallon <christoph.mallon@gmx.de>
To: bug-followup@FreeBSD.org, das@FreeBSD.org
Cc:  
Subject: Re: kern/114578: [libc] wide character printing using swprintf(dst,
 n, "%ls", txt) fails depending on LC_CTYPE
Date: Mon, 29 Sep 2008 10:01:20 +0200

 > fputwc(3) has similar language about copying the character to the
 > output stream, but POSIX still says it can fail with EILSEQ if the
 > wide character doesn't exist in the current locale.
 
 fputwc() is entierly different from swprintf(): fputwc() writes to a 
 stream, swprintf() writes to an array of wchar_t.
 
 > This isn't my area of expertise, but the present behavior seems
 > correct.
 
 No, it isn't.
 
 > If the current locale doesn't support a given wide
 > character, we should not invent a multibyte character sequence for
 > it, because the other end of the stream may not even be able to
 > interpret it.
 
 The format string of swprintf() is of type wchar_t and the destination 
 buffer of swprintf() is of type wchar_t. So there are absolutely no 
 locale conversions involved and no multibyte sequences have to be 
 invented, as you suggested. All, which should happen, is copying the 
 wchar_ts from the source to the destination with no conversions involved 
 at all. The standard, which I quoted already, is quite clear in this 
 respect. The current implementation, which internally converts from 
 wchar_t to the current multibyte locale encoding and back to wchar_t is 
 just an implementation hack, which breaks, if the current locale can not 
 represent full unicode.

From: Garrett Wollman <wollman@csail.mit.edu>
To: Christoph Mallon <christoph.mallon@gmx.de>
Cc: freebsd-gnats-submit@freebsd.org
Subject: Re: kern/114578: [libc] wide character printing using swprintf(dst, 
 n, "%ls", txt) fails depending on LC_CTYPE
Date: Mon, 29 Sep 2008 11:44:27 -0400

 <<On Mon, 29 Sep 2008 08:10:03 GMT, Christoph Mallon <christoph.mallon@gmx.de> said:
 
 >> fputwc(3) has similar language about copying the character to the
 >> output stream, but POSIX still says it can fail with EILSEQ if the
 >> wide character doesn't exist in the current locale.
  
 >  fputwc() is entierly different from swprintf(): fputwc() writes to a 
 >  stream, swprintf() writes to an array of wchar_t.
  
 >> This isn't my area of expertise, but the present behavior seems
 >> correct.
  
 >  No, it isn't.
 
 The Standard is clear:
 
 	In addition, all forms of fwprintf() may fail if:
 
 	[EILSEQ]	A wide-character code that does not correspond
 			to a valid character has been detected.
 
 (IEEE Std.1003.1-2001 page 471, line 15515)
 
 You may wish that it was implemented differently, but that doesn't
 mean that the current implementation is wrong.
 
 -GAWollman
 
State-Changed-From-To: open->suspended 
State-Changed-By: das 
State-Changed-When: Sat Feb 7 14:04:44 UTC 2009 
State-Changed-Why:  
suspended awaiting patches 


Class-Changed-From-To: sw-bug->change-request 
Class-Changed-By: das 
Class-Changed-When: Sat Feb 7 14:04:44 UTC 2009 
Class-Changed-Why:  
The present implementation of swprintf() could be much better, and the 
submitter is rightly offended, but this doesn't seem to be a bug. 


Responsible-Changed-From-To: freebsd-standards->freebsd-bugs 
Responsible-Changed-By: das 
Responsible-Changed-When: Sat Feb 7 14:04:44 UTC 2009 
Responsible-Changed-Why:  
Reclassify this as a general bug. Although swprintf's behavior of 
converting from wide characters to multibyte representations and back 
again appears not to be a standards violation, a better implementation 
would avoid the extra work. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=114578 
>Unformatted:
