Subj : Re: SpiderMonkey: JS_InitStandardClasses allways fails
To   : =?ISO-8859-1?Q?Georg_Maa=DF?= <georg@bioshop.de>
From : Brendan Eich <brendan@meer.net>
Date : Tue May 13 2003 02:28 pm

[snip]

> How can I examin, whether there are unicode characters inside the
> JSString, that use a high byte? Is there an api call to test this, or
> should I look at each character to determin whether I can use the
> implementattion above without information loss, or should prefer a
> std::wstring as container to prevent the highbytes to get lost.


The last sounds best to me -- why risk information loss.  Whether to 
waste cycles trying to optimize for ISO-Latin-1 or ASCII is something 
you'll have to consider, but it seems best to me, especially if you are 
writing for a worldwide audience, to use Unicode always and take the 
space hit.  The engine already has taken that hit.

> How can
> I fill std::wstring? A jschar is only 2 bytes, where wchar_t is 4 bytes
> or more.


Not always.  Some gcc's let you use 2-byte wchars.  The question is how 
to transcode the ECMA-262 specified code point in each jschar into a 
wchar_t.  ECMA and SpiderMonkey do not do anything special about Unicode 
characters that don't fit in the first plane; such characters will 
require more than one jschar, and will result in overlong string lengths.

> So a jschar* is not binary compatible with a wcha_tr*. Do I
> have to feed a std::wstring jschar by jschar as done in my
> implementation above? 


I don't know, what kind of operating system are you using?  wchar_t 
varies by OS.

> I guess that there is no api function to find out whether
> JS_GetStringBytes results in an information loss or not, because see no
> internal flags inside JSString, which might provide this information in
> a cheap way. Getting this information by looking into each jschar is
> very expensive.


Why bother?

> This knowledge is necessary, when my Wert class instance is of type
> "undefined", which means autocast the next assiged value to the type
> that best fits. A JSString is ambiguos for this. If it contains
> characters larger than 255, then the resulting typ must be Wert_wstring.
> If not, then a temporary std::string is to be created and introspeced
> whether it might be represended as int, unsigned int, long, unsigned
> long, bool or date or otherwise must be stored as std::string. This auto
> cast might be very expensive, if there is no cheap test to get the
> information whether the JSString contains characters greater than 255 or
> not. 


You are defining a type system with ambiguities. Why not always use wstring?

> What is about byte order? Are there any situations where I have to 
> change the byte order, when I assign a jschar to a wchar_t, or does 
> the byte order of jschar allways fit the byteorder of wchar_t? On my 
> test system (x86) it fits, but does it also fit on any other system 
> like PowerPC without changing the byte order?


Byte order among integral types in the same process does not differ on 
any architecture I know of (PDP-11 not included).  The byte order of a 
short (jschar is unsigned short on all platforms) is the same as the 
byte order of a long (wchar_t may be an unsigned long -- but not on all 
OSes and with all compilation flags!).

/be

.