Subj : Re: About Unicode To : netscape.public.mozilla.jseng From : Shanti Rao Date : Tue Nov 23 2004 05:25 pm Jun, If want an off-the-shelf interactive shell, download mine from www.jsdb.org and use its decodeUTF8() function. So you could do myString = decodeUTF8('?????') What's the difference between UTF-8 and UCS-2 ... Have I ever mentioned that my title in Dogbert's New Ruling Class is the Minister of Incompatible Standards? If you download something from the web, it's probably encoded in UTF-8 (Unicode Transformation Format, 8-bits). So you need to do something to convert it to UCS-2 (Universal Character Set, 2-bytes), as Jens mentioned. For the record, UCS-2 is not the same as UTF-16. Here's why: Unicode tries to have a single representation for all possible character glyphs. Most characters can be represented in 16 bits, so Windows and Mozilla use 16 bits (called UCS-2) to represent each letter. But there's a lot of software out there, like the Internet, that is expecting to use 8-bit characters. So UTF-8 is a way of encoding a 16-bit character code into 8, 16, 24, or 32-bit sequences that don't contain any zero bytes. If you want to use all 10^6 character codes, then you need a more than 16 bits, which gets us to UCS-4, using 4 bytes for each letter. But what if you have old-fashioned software that was expecting UCS-2? Then you use UTF-16, which is just a pain in the neck and best avoided if at all possible. Shanti > ucJun Kim schrieb: > >>Do you mean that in order to get the result I expect, >>should I use JS_CompileUCScript() instead of JS_CompileScript()? > > > yes > > >>Well, I did try and failed. >>Since JS_CompileUCScript() functions gets jschar instead of char, >>I did some conversion, char -> JSString -> jschar, >>and passed it to JS_CompileUCScript() as such... > > > You have to do the right conversion. > For JS_NewString there is also JS_NewUCString > (if you call JS_NewString the C-string you pass is again interpreted as > iso-latin-1) > > see also: "Handling Unicode" > http://www.mozilla.org/js/spidermonkey/apidoc/jsguide.html > => you always use the UC functions > > btw. i did not do this myself yet (in fact at the moment i always ignore > hals of the string ;-) but the idea is: > > you must convert your input from whatever encoding you use to utf-16 > and then again the output from utf-16 to whatever encoding you use > > Shanti Rao posted conversion code for utf-8<->ucs-2 > http://groups.google.de/groups?selm=cj8ghg%24s0h2%40ripley.netscape.com > Perhaps Shanti will jump in and explain it and the difference between > ucs-2 and utf-16 ;-) > > attention: > "ECMA and SpiderMonkey do not do anything special about Unicode > characters that don't fit in the first plane; such characters will > require more than one jschar, and will result in overlong string lengths." > > > all about unicode: > http://www.unicode.org/ > http://www.unicode.org/faq/ > > UTF-8 and Unicode FAQ for Unix/Linux > http://www.cl.cam.ac.uk/~mgk25/unicode.html > > Greetings > Jens .