[HN Gopher] Your code displays Japanese wrong
       ___________________________________________________________________
        
       Your code displays Japanese wrong
        
       Author : needle0
       Score  : 164 points
       Date   : 2021-10-28 06:07 UTC (16 hours ago)
        
 (HTM) web link (heistak.github.io)
 (TXT) w3m dump (heistak.github.io)
        
       | rock_artist wrote:
       | So to make sure I got it right:
       | 
       | The issue is - when there's no proper glyph it uses a fallback?
       | 
       | Or... The context of what glyph will be rendered depends on the
       | document defined locale.
       | 
       | (If it's the latter. it means it's impossible to quote another
       | Asian language text within the same paragraph?)
        
         | needle0 wrote:
         | The latter. In HTML you can specify a specific DOM element as
         | being in a specific language so the browser can render it
         | properly, but if the place you want to quote text isn't as
         | allowing (eg. comment sections with no HTML allowed), there may
         | be no way to ensure correct glyphs.
        
           | lifthrasiir wrote:
           | Ideographic variation selectors plus a very large pan-CJK
           | font may solve this issue in the future, but CJK fonts have
           | already reached the OpenType limit of 65,535 glyphs so we are
           | already running into technical issues.
        
         | tasogare wrote:
         | It's a web browser only issue. In other cases such as a text
         | processor document or local app a font is explicitly used by
         | any run of text so there is no problem. This issue is the web
         | being what it is, most of the time there is no font explicitly
         | specified for text and the browser use a Chinese-looking font
         | to display any Chinese characters.
        
           | eska wrote:
           | It's not just a web browser issue. For example I'm
           | transferring data in multiple Asian languages through some
           | network API. I always need to specify the locale of the text
           | data in a separate data field so that some UI program at the
           | end can display the text correctly. And even then that's not
           | perfect, because that's just the system locale instead of the
           | IME locale.
        
         | iforgotpassword wrote:
         | You need to embed the quoted text in eg a span and add the
         | proper lang tag.
        
         | [deleted]
        
       | BoppreH wrote:
       | That's extremely interesting, if not depressing.
       | 
       | So, if I have to display user-entered text (usernames, posts,
       | comments, messages, form data, etc), and I want to do The Right
       | Thing(tm):
       | 
       | - I cannot rely on user locale, because it might be set to
       | something generic like English, or the user may be bi-lingual.
       | 
       | - I cannot rely on location, because the user may be traveling to
       | a different CJK region, or somewhere else altogether.
       | 
       | - I cannot set a single lang: attribute for the whole page
       | because it'll be wrong for the other two languages.
       | 
       | - The string alone is not sufficient to identify the language
       | because you can write valid sentences in different CJK languages
       | with the same codepoints.
       | 
       | - I cannot have a per-user language setting, because users may be
       | bi-lingual.
       | 
       | What does that leave me? A dropdown list "C/J/K/Other" besides
       | _every single text field_?
       | 
       | I'm chucking this on my pile of examples of software development
       | being hopelessly broken by design, along with "unix time is non-
       | monotonic and discontinuous at random" (hint: what's the unix
       | time exactly 1e8 seconds, ~3 years, from now? Answer: it's up to
       | the astronomers[1]!).
       | 
       | [1]: https://en.wikipedia.org/wiki/Unix_time#Leap_seconds
       | 
       | Edit: actually, even the dropdown list is insufficient because it
       | only allows one language per string! How is a Japanese user
       | asking for help learning Chinese supposed to write?
        
         | makeitdouble wrote:
         | For what it's worth, mainstream OSes will also have poor
         | handling of these cases, which will eliminate the most tricky
         | cases by the sheer inconvenience it causes.
         | 
         | As far as I know textfields only have one font applied, so
         | entering both languages in a single field won't be optimal. And
         | if you're not doing anything fancy with your fields, they will
         | all take the same font as well.
         | 
         | So even at the input level, the user switching languages will
         | already be mildly screwed, and the best solution would probably
         | be to change pages for each language.
        
         | GoblinSlayer wrote:
         | >How is a Japanese user asking for help learning Chinese
         | supposed to write?
         | 
         | They write with japanese kanji. I imagine minuscule differences
         | in kanji forms are negligible compared to general unfamiliarity
         | with the foreign language.
        
         | [deleted]
        
         | gpderetta wrote:
         | Deprecate languages. We will all be writing exclusively in
         | emojis from now on.
        
         | zokier wrote:
         | > unix time is non-monotonic and discontinuous at random
         | 
         | Well, depends on your definition of unix time; if you use
         | time() as the definition then it is actually monotonic because
         | the integral part only repeats on leap seconds?
        
         | Asooka wrote:
         | Trying to do something smart is usually the wrong approach and
         | leads to users tangling themselves in invisible state that they
         | do not understand and can't change. The best thing would have
         | been for Unicode to not do Han unification. The second best
         | would be to provide alternate glyphs now. The third best is to
         | display the characters either in the language they're written
         | in, when you know that for sure (usually for text that you
         | wrote yourself), or in the user's most likely locale, when you
         | don't. For locale I would go down this list of traits and pick
         | the first that matches:
         | 
         | 1) The language setting on your website if you have one and
         | have translated it to C/J/K. You may use different TLDs for the
         | different languages and discern that way, too.
         | 
         | 2) The list of preferred languages from the browser. This is
         | usually unreliable, but if someone has gone to the trouble of
         | inputting "english=1;japanese=.9;chinese=.8", then it's a fair
         | bet they want Japanese Kanji usually and will be understanding
         | if you use them in place of Chinese Han characters.
         | 
         | 3) The country to which the user's IP belongs. The least ideal
         | option, but if you're in Korea and reading a random string of
         | Hanzi, you probably expect them to look like Hanzi.
         | 
         | You will show the wrong characters to some users, but the
         | behaviour is understandable. "Oh, the site is showing me Korean
         | characters because I'm in Korea." is a lot easier to grasp than
         | "The site is showing me Chinese characters because I clicked a
         | dropdown one time that I forgot about and now I have no idea
         | why my name is written wrong!"
         | 
         | You can argue about point 2) that some users might set their
         | language preferences and forget about it, but so far I have
         | never observed a user who doesn't know about them messing with
         | the setting.
        
         | [deleted]
        
         | dathinab wrote:
         | The problem is btw. not specific to Han unification/asian
         | languages. It's that _for every western language_ if you use
         | screen readers.
        
           | the_other wrote:
           | My thought whilst reading the article was that the Han
           | unification would actually help screen readers. IIUC The
           | meaning of the glyph is the same across all the languages, so
           | the screen reader will get the correct meaning and can
           | present it according to local settings. The problem with the
           | European languages is that the different characters (letter
           | variations, accent variations) can change the meaning of the
           | word they're part of.
           | 
           | Or have I misunderstood?
        
             | lifthrasiir wrote:
             | The same Han character can have wildly different
             | pronunciations even in a single language (being a logogram,
             | they represent a word and not a sound). KS X 1001, the
             | primary Korean character set, even duplicated same
             | characters according to their readings so that they can be
             | almost [1] correctly converted back to Hangul. In practice
             | they didn't work well though, and Unicode assigns all but
             | one duplicate characters into the compatibility region.
             | 
             | [1] These readings didn't take account for systematic
             | variations like the initial sound law (dueumbeobcig, for
             | example i vs. ri at the beginning of words).
        
             | soraminazuki wrote:
             | You can't split CJK sentences into individual letters,
             | inspect them one by one and decipher its exact meaning. If
             | you present Chinese writing to a Japanese speaker, they'd
             | only see complete gibberish consisting of letters they may
             | or may not recognize. It also works the other way around
             | too.
             | 
             | On top of that, Kanjis aren't the only type of letters
             | Japanese folks use. They also use Hiraganas and Katakanas,
             | which are phonetic symbols and totally unrecognizable to
             | non-Japanese speakers.
        
           | Snild wrote:
           | I'm not sure I understand what you're saying here. Are you
           | saying that screen readers don't know which language to
           | interpret text as?
        
         | dmurray wrote:
         | > I cannot set a single lang: attribute for the whole page
         | because it'll be wrong for the other two languages.
         | 
         | > What does that leave me? A dropdown list "C/J/K/Other"
         | besides every single text field?
         | 
         | If you _need_ to control the language display at the
         | granularity of a single text field (rather than a user or a
         | page or a website), then yes, you need a tool that operates on
         | a single text field. This shouldn 't be too surprising.
         | 
         | Surely you can get away with one of your other solutions,
         | though. In particular, you can guess the language and be right
         | most of the time.
        
           | dvdkon wrote:
           | It is surprising to me, because I can mix Czech, English and
           | German just fine on a single screen, even in a single text
           | field. From that perspective having to say "this whole text
           | is in language X" seems backwards.
        
             | jfk13 wrote:
             | Suppose you're blind and use a screen reader. How does
             | German sound when pronounced by an English text-to-speech
             | engine, or vice versa? Which should be used to read out
             | that "single text field" that contains a mix of languages?
        
               | dvdkon wrote:
               | When reading mixed-language text we can't know if the
               | "Tee" in the middle of an English sentence is German for
               | "tea" or English slang for "t-shirt". We have to use
               | imperfect context clues, so screen readers will have to
               | do the same, it should be doable with today's technology.
               | And if the software doesn't get it right every time,
               | that's fine, because neither do humans.
        
               | masklinn wrote:
               | > And if the software doesn't get it right every time,
               | that's fine
               | 
               | It really is not, a screen reader using the wrong
               | language is way, way worse than a human with the wrong
               | pronunciation. The first time voiceover decided to switch
               | to russian in the middle of english text I thought the os
               | had crashed, the mangling is quite extreme.
        
             | dmurray wrote:
             | OK, yeah. If Unicode hadn't gone with Han unification
             | (couldn't they know we'd have space to put zillions of
             | characters in a font just a few years later?) you'd have
             | the same flexibility in mixing C/J/K.
        
               | [deleted]
        
               | rectang wrote:
               | > _(couldn 't they know we'd have space to put zillions
               | of characters in a font just a few years later?)_
               | 
               | Representation matters. There should have been plenty of
               | experts who understood that Han unification was
               | problematic. It seems they were not in a position to do
               | anything about it.
        
               | masklinn wrote:
               | Part of the problem back then is they really really
               | wanted to fit it all in 16 bits, and given there are more
               | hanzi than that to start with this was a bit of an issue
               | (back in the 80s the largest dictionary listed >54000
               | characters, now it's >100000, and Japanese and Korean
               | total about 50000 each).
               | 
               | A few years later they increased the character space by 5
               | bits and it wasn't an issue anymore, but the original
               | legacy of _han unification_ remains.
        
               | tdeck wrote:
               | It makes sense given constraints of the time to try to
               | fit the character set into 16 bits, but Unicode has
               | variation selectors, why not use those for the ambiguous
               | characters? They could have easily done something like
               | HAN_CHARACTER_X + VARIANT_KANJI. It would take up an
               | extra 16 bits, but given the density of CJK text relative
               | to Latin text that may not be a big issue.
               | 
               | Edit: s/variant/variation/.
        
               | rectang wrote:
               | Could this be done now?
        
               | rectang wrote:
               | Yes, and the people who prioritized the 16-bit size were
               | OK with Japanese looking ridiculous in order to achieve
               | their goal. From the article:
               | 
               | > if the equivalent symptom was happening with English
               | text, it' wfuld bie lffking sfmiet'Tshing likie t'Tshis.
               | 
               | From a page linked by the article, "I Can Text You A Pile
               | of Poo, But I Can't Write My Name":
               | 
               | > To help English readers understand the absurdity of
               | this premise, consider that the Latin alphabet (used by
               | English) and the Cyrillic alphabet (used by Russian) are
               | both derived from Greek. No native English speaker would
               | ever think to try "Greco Unification" and consolidate the
               | English, Russian, German, Swedish, Greek, and other
               | European languages' alphabets into a single alphabet.
               | 
               | If there had been a proposal to sacrifice English in
               | order to cram Unicode into a certain code space size, is
               | there any question that the people on the panel whose
               | first language was English would have quashed it?
               | 
               | But people who might have spoken out for Japanese, or for
               | Indian languages, etc. were seemingly not in a position
               | to do anything.
        
               | jefftk wrote:
               | Greco Unification wouldn't have been a priority because
               | it couldn't have saved enough bits to matter.
        
               | rectang wrote:
               | Yes, of course -- but the point is that _native English
               | speakers would have blocked such an absurd proposal no
               | matter what_.
               | 
               | Han Unification is just as absurd and just as
               | unacceptable to a native Japanese speaker as Greco
               | Unification would be to a native English speaker. But Han
               | Unification went through because native Japanese speakers
               | were not in a position to block it. Representation
               | matters.
        
         | dr-detroit wrote:
         | This problem shows up in email too. The internet is mostly
         | hollow and broken.
        
       | kiryin wrote:
       | As a Japanese learner, this has been a massive disappointment in
       | unicode for me, and a pain in my ass. It has sort of formed into
       | a challenge for me, trying to get the characters to display
       | consistently on all of my devices. Believe it or not, even with
       | pango configured to always show the japanese variants, and
       | fontconfig set to always prefer the JP font, some applications
       | like Firefox find a way to mess it up.
       | 
       | Can't blame them much though, han unification is a huge mess and
       | designed by someone who I can only posit to be entirely
       | brainless. There aren't many characters that are affected, you
       | aren't even saving any considerable amount of codepoints. It's
       | just west-centricism and lack of knowledge on the subject.
        
         | adrian_b wrote:
         | The Han unification was done because at that time they hoped
         | that the size of Unicode characters will be limited to 16 bits.
         | 
         | Separate sets of Han characters cannot be encoded in the 16-bit
         | space, but they could have been easily encoded in the current
         | 32-bit space.
         | 
         | Nevertheless, I have never found this to be a problem in
         | practice, because I have always taken care to have good
         | separate typefaces for Japanese, Traditional Chinese and
         | Simplified Chinese.
         | 
         | In documents that I create or modify, I apply styles with the
         | appropriate typeface.
         | 
         | The only possible problems are with Web pages, but the good
         | browsers allow you to configure typefaces for each language and
         | I always configure the correct typefaces.
         | 
         | If the Web page does not specify correctly the language, it
         | might be displayed wrongly, but this is only one of the many
         | stupid things that can be done by a Web page designer that can
         | make that page look ugly when rendered on other computers.
        
           | aikinai wrote:
           | The fact that you have to take such care is exactly the
           | problem.
        
             | adrian_b wrote:
             | I agree that the fact that I must not forget to configure
             | typefaces per language whenever I install a new browser,
             | while for Chrome you must also install the "Advanced Font
             | Settings" extension before it even becomes possible to
             | choose e.g. a Japanese font, is annoying.
             | 
             | To avoid such configuration work when you prefer better
             | looking typefaces instead of some standard system defaults
             | would require a standardization of how to notify the
             | applications about the association between certain
             | typefaces and languages, e.g. by some environment variables
             | or by some standard locations for the font files, depending
             | on language.
        
         | bruce343434 wrote:
         | What makes this worse is that some Cyrillic characters like the
         | Cyrillic "a" have a different code point from the Latin "a"
         | despite looking _exactly_ identical. So unicode isn't even
         | consistent with their unification logic.
        
           | demetrius wrote:
           | I believe Cyrillic a and Latin a are different because there
           | already existed legacy encodings where a and a were
           | considered different characters, so Unicode kept the
           | distinction for backward-compatibility.
           | 
           | While there were no existing legacy encodings allowing to
           | write Chinese and Japanese at the same time, so there was
           | nothing to keep compatibility with.
        
       | mjevans wrote:
       | The initial three examples don't have a corresponding 'correct
       | render' image next to them, so it's impossible for me to tell
       | since they all render as the same character (which is incorrect
       | given the lack of context).
       | 
       | Checking the source, the page _is_ specifying language tags in
       | the span, which I guess is supposed to help. My system just must
       | not have fonts for those languages so I obviously can't even test
       | them.
        
         | needle0 wrote:
         | There may be issues displaying it on Firefox. Chrome and Safari
         | seems to have displayed it correctly on my end. I'll find time
         | to replace them with images so they appear correct regardless
         | of environment.
         | 
         | EDIT: Replaced with images.
        
           | yorwba wrote:
           | Firefox will also display it correctly, but only if you have
           | the fonts installed, the same as with other browsers.
        
         | [deleted]
        
       | squaresmile wrote:
       | This can also be a problem in chat app. I used en-US on windows
       | (which defaulted to the zh variant) and someone else used ja-JP
       | and I was wondering why the character was different. Took a while
       | to notice that we were seeing two different things on our
       | screens.
       | 
       | We also have a website about a Japanese game using a Japanese
       | font except for 0x9bd6. The font's 0x9bd6 is the CN variant and
       | its 0xe001 is the JP variant of 0x9bd6. Fun times.
       | 
       | Like others said, on the web, you pretty much have to manually
       | assign lang to every single thing. We just added support for
       | CN/TW/KR text. I should come back and check 0x9bd6 in the other
       | versions ...
        
       | powerapple wrote:
       | For Chinese, the font can change the writing slightly. For
       | example, Ren (blade) can be any of these (Japanese, Simplified
       | Chinese and Traditional Chinese). Actually I would consider the
       | Japanese version in the article the traditional Chinese version:
       | https://duckduckgo.com/?q=%E5%88%83+%E4%B9%A6%E6%B3%95&iax=i...
       | 
       | At least for Chinese, the difference is font, they are all valid
       | writing for the character. Different writing style can cause
       | these minor difference as well:
       | http://qiyuan.chaziwang.com/pic/ziyuanimg/E58883.png
       | 
       | If you look at right side of above image, you can tell how the
       | same character is written in different writing style
       | 
       | It is less a problem for Chinese. Our brain has trained to read
       | them, I would recognize the Japanese version, Simplified Chinese
       | version, and Traditional Chinese version without noticing the
       | difference. But I can imagine it can be a problem for Japanese,
       | and other people do not read Simplified Chinese. Having the
       | locale explicitly set to a country and load the correct font make
       | a lot sense here.
        
       | suction wrote:
       | Does this also explain why alphabetic text in Japanese apps and
       | websites often looks so horrible? Like very wide characters with
       | way too much space in between them?
        
         | tasogare wrote:
         | No, the wide characters (it's their name) are special code
         | points. I guess it exists because someone wanted to be able to
         | use one letter in place of a Japanese character while using the
         | same width. The "normal" letters are called "half-width" here.
        
           | lifthrasiir wrote:
           | Full-width characters are relics from multiple legacy
           | character sets. For example JIS X 0208, the primary Japanese
           | two-byte character set, has a set of alphanumeric characters
           | in the row 0x23, but their widths are not specified and it is
           | totally possible to map them into half-width characters when
           | no other character sets are in use. However it is most
           | commonly paired with JIS X 0201 which is a single-byte
           | character set with their own alphanumeric characters, so
           | anything from JIS X 0201 is made half width and anything from
           | JIS X 0208 is made full width to simplify implementations.
           | This practice got stuck and subsequently followed by Unicode.
           | Same for other languages.
        
         | needle0 wrote:
         | Newspaper websites may also have years-old internal typesetting
         | rules, carried over from paper, that mandate alphabetical text
         | must appear in full-width (double wide). They look ugly even to
         | native Japanese, and some newspapers have gradually been
         | learning to break out of it.
        
         | jhanschoo wrote:
         | Alphabetic text in Japanese fonts are primarily designed for
         | documents mainly in Japanese with the occasional Latin script
         | jargon. There's a variant (full-width) that's sometimes used
         | that indeed is very wide, made to be of the width of the
         | Japanese kanji, but even the proportional ones are pretty light
         | and widely spaced (which results in better typography in
         | mainly-Japanese documents)
        
       | zokier wrote:
       | This is a tangent, but has there been any ideas around making a
       | stroke-based encoding of Japanese/Chinese writing systems?
        
       | lovasoa wrote:
       | Wasn't the whole point of Unicode to have a single encoding that
       | could represent all languages unambiguously so that you don't
       | need any meta-information to display a string ? Is there a reason
       | why they chose to represent characters that are obviously
       | different with the same code point ? Everyone would find it
       | outrageous if they decided to have a single character for the
       | russian m and the english m just because they have the same greek
       | origin...
        
         | josefx wrote:
         | > Everyone would find it outrageous if they decided to have a
         | single character for the russian m and the english m just
         | because they have the same greek origin...
         | 
         | Did you know that some languages distinguish dot less Ii and
         | dotted Ii ? English mixes them Ii and unicode needs to know
         | exactly based on what language you might want to upper/lower
         | case because it can't tell an english I appart from a Turkish
         | dotless I.
        
         | zczc wrote:
         | There are some 8-bit and 7-bit encodings which have unified
         | Latin and Greek, like 7-bit SMS encoding [1]
         | 
         | [1] https://en.wikipedia.org/wiki/GSM_03.38
        
         | Sniffnoy wrote:
         | The reason why is that Unicode was originally 16-bit, and there
         | was no way they could fit everything into 16 bits without CJK
         | unification. Of course later it turned out there was no way
         | they could fit everything into 16 bits anyway, and so they were
         | forced to expand it, and so we now both have a larger Unicode
         | (with all the messes that's caused) but also still have CJK
         | unification...
        
           | lovasoa wrote:
           | Is it too late to add separate code points for the chinese,
           | japanese, and korean versions of the han characters ?
        
             | lifthrasiir wrote:
             | In principle there is a designated "disunificiation"
             | procedure when it's desirable. More accurately speaking,
             | each CJK character is thought to represent not a single or
             | a few glyphs listed in the code chart but rather a glyphic
             | subset, and the disunification splits that set into
             | partitions. But this is generally applied to a few selected
             | characters and only when it's safe to do so. Massive
             | disunification was to my knowledge never suggested or
             | proposed, and that would surely prompt a large scale
             | disruption throughout CJK users (say, how about existing
             | texts?).
        
               | rrobukef wrote:
               | So it's possible to modify the skin tone of emoji but
               | impossible to disunify CJK characters? There are RTL
               | modifiers for Arabic languages, it's impossible for CJK?
               | It shouldn't be harder than existing unicode handling.
        
               | lifthrasiir wrote:
               | Everything boils down to the interoperability and
               | compatibility.
               | 
               | Emoji was added because Apple and Google had to deal with
               | (then-)Japanese emails, and skin tones were not
               | specified. It were implementations that impose certain
               | skin tones (that do not even match the original Japanese
               | emojis) and as a result Unicode had to introduce a
               | mechanism to change skin tones and mandate the default
               | emoji without that mechanism to be neutral.
               | 
               | RTL "modifiers" are actually formatting characters
               | closely tied with the Unicode Bidirectional Algorithm
               | [1]. Until then texts with both RTL and LTR fragments
               | were handled incoherently, for example legacy character
               | sets were still struggling with logical vs. visual order
               | issues. So they are indeed Unicode inventions, but
               | necessary ones that do not alter existing texts.
               | 
               | For CJK characters Unicode now provides ideographic
               | variation selectors that select the exact glyph (or more
               | accurately, a restricted glyphic subset of the base
               | character). They do not disunify characters but they do
               | provide a strong hint to display those characters in a
               | specified way. In this way they do not cause an
               | additional issue to existing Unicode systems (as they
               | should already do normalization and collation in the
               | Unicode way). The disunification by comparison would
               | almost instantly break existing texts.
               | 
               | [1] https://unicode.org/reports/tr9/
        
         | jhanschoo wrote:
         | Even in Latin script languages you have issues if you don't
         | specify a lang tag. e.g. a font's 'fi' ligature may omit the
         | tittle on the 'i', but it is necessary in Turkish. Or you are
         | using a font without coverage for French and your browser
         | renders oe in another font.
         | 
         | The CJK variant issues under not specifying a lang are indeed
         | present in Latin but to a smaller extent.
        
         | dustintrex wrote:
         | It's a really complicated issue, because whether the character
         | is different or not is debatable. In English writing, we
         | consider a serif A to be the same character as a sans-serif A,
         | even though the glyph is obviously different, and neither do we
         | distinguish between a "French" A and a "German" A.
         | 
         | So what do we do with Guo  and Guo ? The first of those is
         | always used in simplified Chinese and usually in Japanese,
         | while the second is used in traditional Chinese and sometimes
         | in Japanese (eg. names). Is this one, two or three characters?
         | 
         | More on the topic:
         | https://en.wikipedia.org/wiki/Han_unification
        
           | afiori wrote:
           | as an armchair linguist it is clearly two (not one, not
           | three) distinct "characters", by going how I am guessing
           | native speakers think about them.
           | 
           | If all three had different codepoints and you replaced Guo
           | with Guo  a lot of people would realize, less so if you
           | replaced Guo  with Guo .
           | 
           | To my understanding the only argument in favour for han
           | unification was that it would have taken-up a lot of
           | codepoints otherwise.
        
             | wodenokoto wrote:
             | > by going how I am guessing native speakers think about
             | them.
             | 
             | I can assure you that Han unification happened at the hand
             | of native speakers.
        
           | SpicyLemonZest wrote:
           | It's worth noting this kind of distinction did used to exist
           | in the Latin alphabet as well. For much of the 19th and 20th
           | centuries, German letters were different, as part of a debate
           | about whether German text should be written in blackletter or
           | Latin/Antiqua/what-other-Europeans-use script.
        
             | dotancohen wrote:
             | We still have these issues in e.g. Hebrew and Yiddish,
             | Arabic and Persian, and tons of adapted Cyrillic scripts
             | from ex-Soviet states. Not to mention Northern-European
             | accented vowels, and cedilla letters such as C.
             | 
             | I'm personally of the belief that the accented and cedilla
             | characters should be exclusively stored as combining
             | character pairs, even if modern keyboard mappings require
             | only a single keypress. My own language stores every
             | character as two bytes (at a minimum), so the storage
             | aspect is a solved problem.
        
       | ksec wrote:
       | I am surprised how many comments here never heard of Han
       | Unification. The problem is not new, and some of us have been
       | ranting about it for more than a decade. From the
       | UTF-8-Everywhere Manifesto in 2012 on HN [1], And a search [2] on
       | HN dates back to 2010.
       | 
       | I am also surprised at the support this problem now has. At least
       | on this thread. Generally speaking Han Unification problem dont
       | get much if _any_ support on HN. Not even empathy. In the name of
       | having Unicode becomes king they would much rather sacrifice the
       | CJK language.
       | 
       | The answer or replies were always, it is "glyph" problem, not
       | "code" problem. Stop asking Unicode to solve it.
       | 
       | patio11 aka Patrick McKenzie from Stripe has been the most vocal
       | critics of Han Unification. Sums it up far better than I could,
       | quote [3]:
       | 
       | >Reason the Han unification debate in Unicode got so acrimonious,
       | and why lots of Japanese people carry a chip on their shoulder
       | about it to this day.
       | 
       | >"Sorry, grandma, I know you've been sort of attached to your
       | name for the last 80 years, but the white folks find it
       | inconvenient for their computer systems. Don't worry, they
       | promise they'll make something close for you."
       | 
       | >Many of the clients of my ex-day job are married to legacy
       | encodings like Shift-JIS precisely because they do think that
       | their customers and students have a "right" to having their names
       | written correctly.
       | 
       | As mentioned in my other reply, Adobe gets lots of stick for its
       | subscription and malware like Creative Cloud. But they do [4]
       | spend huge amount of resources on CJK fonts, layout and encoding
       | ( They have their own separate Encoding for each CJK language
       | instead of using Unicode ). Part of the reason why I like PDF.
       | 
       | [1] https://news.ycombinator.com/item?id=3906253
       | 
       | [2]
       | https://hn.algolia.com/?dateRange=all&page=8&prefix=false&qu...
       | 
       | [3] https://news.ycombinator.com/item?id=1438749
       | 
       | [4] https://ken-lunde.medium.com/my-28-years-of-
       | adobelife-e97e70...
        
         | flubert wrote:
         | >"Sorry, grandma, I know you've been sort of attached to your
         | name for the last 80 years, but the white folks find it
         | inconvenient for their computer systems. Don't worry, they
         | promise they'll make something close for you."
         | 
         | Is there a resource to read more about this? I don't get that
         | vibe from things like:
         | 
         | https://www.unicode.org/versions/Unicode3.0.0/appA.pdf
        
       | YeGoblynQueenne wrote:
       | >> However, this issue is much more than the difference between,
       | say, the lowercase A with the overhang (a) or without (a).
       | 
       | Yes but actuallly "a" is the Greek character _alpha_ , whreas "a"
       | is the Latin character "a". So if you displayed "a" as "a" to a
       | Greek person that, too, would look all orong.
        
       | peacefulhat wrote:
       | I'm grateful for han unification because I can search Chinese
       | words I only know in Japanese.
        
       | lifthrasiir wrote:
       | Unfortunately this (and linked) article only represents Japanese
       | issues. If you blindly apply these suggestions Chinese or Korean
       | users may have issues. I'll list Korean issues below primarily
       | because I'm Korean, but you may want to interview actual CJK
       | users (one of each, _not_ a single user) for testing.
       | 
       | > Line breaking rules
       | 
       | This should link to W3C Requirements for CJK Text Layout [1]. The
       | Wikipedia article alone doesn't fully describe the complexity of
       | CJK typography.
       | 
       | CJK languages are common in that they all have classes of
       | punctuations that can't be separated by a newline. But there is
       | one more thing to consider for Korean: both word-based breaking
       | and character-based breaking is possible depending on the
       | context. The general rule is to use word-based breaking for
       | larger texts and character-based breaking for smaller texts, but
       | there is no clear threshold so you _really_ want to consult
       | Korean users for testing.
       | 
       | [1] https://www.w3.org/TR/clreq/ (Chinese),
       | https://www.w3.org/TR/jlreq/ (Japanese),
       | https://www.w3.org/TR/klreq/ (Korean)
       | 
       | > Messaging Apps: Do not directly hook to the Enter key to submit
       | messages
       | 
       | This advice is also problematic. In pretty much all Japanese and
       | most Chinese IMEs they should go through candidate windows so
       | pressing Enter should not submit messages, but in some Chinese
       | and virtually all Korean IMEs there is no automatic candidate
       | window and pressing Enter should submit messages.
       | 
       | In the ideal world detecting a newline as suggested by the
       | article should have solved this issue, but that got complicated
       | by clueless pan-CJK IME implementations. They generally assume
       | candidate windows even for Korean, so they do not commit texts on
       | Enter and that's very inconvenient for Korean users. Therefore it
       | is rather recommended to detect a newline by default, but also
       | have an option to submit messages on Enter.
        
         | needle0 wrote:
         | I updated both sections according to your suggestions. Thanks!
        
           | needle0 wrote:
           | Was notified from someone else about the isComposing
           | attribute -- https://developer.mozilla.org/en-
           | US/docs/Web/API/KeyboardEve... At least for web stuff, do you
           | think checking for this before treating the Enter key as
           | Submit would work in both IMEs with and without input
           | buffers?
        
             | lifthrasiir wrote:
             | The problem is that those clueless IMEs do intercept the
             | Enter key contrary to user expectation, so I think you
             | can't distinguish those IMEs from Chinese and Japanese IMEs
             | that should intercept the Enter key as expected.
        
       | simonlc wrote:
       | Really well done post and good idea with the title!
       | 
       | - Simon from TGM :)
        
       | brigandish wrote:
       | > Japanese text written in incorrect glyph sets will stand out
       | similarly to any native speaker of Japanese, and will give off a
       | connotation that whoever developed this app does not care about
       | this (often large) subset of the global user population.
       | 
       | More likely they'll think the content was written by a non-native
       | Japanese speaker, judge whether that makes you trustworthy or not
       | (based on personal experience or stereotypes or prejudice,
       | probably a bit of all three (we're all human)) and then not buy
       | from you. A good example would be Amazon listings in Japanese
       | that Japanese people can tell were almost certainly written by
       | someone Chinese, and then decide not to buy.
       | 
       | If you want the cash, get a proper translation. Ironically, Japan
       | is filled to the brim with incredibly poor English and abounds
       | with stories of native English speakers' translations and
       | corrections being disregarded because "it doesn't sound right"...
       | to someone who can't string a legible English sentence together.
        
       | euske wrote:
       | This is the reason why Adobe PDF isn't relying on Unicode. Adobe
       | products has a huge presence in Japan since 90s and they had to
       | appeal to the printing industry, which is very anal to this kind
       | of issues. So they ended up using a separate encoding for every
       | language. Today, CJK letters in PDF are encoded in Adobe-GB1
       | (mainland China), Adobe-CNS1 (Hong Kong), Adobe-Japan1 and Adobe-
       | Korea1 respectively. Not the cleanest way, but it gets the job
       | done.
        
         | makeitdouble wrote:
         | Thanks for the pointer, that's pretty interesting.
         | 
         | Looking at their doc [0] it seems they used their Adobe-Japan1
         | to wrap a much more wider set of characters than any single
         | encoding standard, including ligatures, vintage encodings etc.
         | 
         | It seems to be a pretty big work and kinda fits with the image
         | of PDF handling being such a monumental beast.
         | 
         | [0] https://github.com/adobe-type-tools/Adobe-Japan1/
        
         | lifthrasiir wrote:
         | Note that they are now adopted by the Unicode Ideographic
         | Variation Database [1] among other variation databases.
         | 
         | [1] https://unicode.org/ivd/
        
         | ksec wrote:
         | Adobe gets lots of stick for its subscription and malware like
         | Creative Cloud. But they do spend _huge_ amount of resources on
         | CJK fonts, layout and encoding.
         | 
         | And part of the reason why I like PDF.
         | 
         | ( Behind a Paywall ) https://ken-lunde.medium.com/my-28-years-
         | of-adobelife-e97e70...
        
       | Asooka wrote:
       | Can't this be solved somewhat by adding a "cjk mode" zero-width
       | character, like we have right-to-left/left-to-right embedding
       | characters? Yes, yes, it's yet another standard, but there
       | doesn't seem to be any way to indicate in the text stream itself
       | what characters to use otherwise.
        
       | iforgotpassword wrote:
       | Minor addition/clarification, just in case:
       | 
       | > If the glyphs don't exactly look like the Japanese result
       | sample below, your code is displaying Japanese wrong.
       | 
       | Maybe _exactly_ isn 't the right term here; it doesn't need to be
       | pixel-perfect, there are still different font faces just like
       | with western languages, for example one that's supposed to make
       | them look more natural or hand written and one for print, etc.
       | 
       | Also, afaict han unification was a mistake, but if you thought
       | you only ever have 65535 code points available it might have been
       | tempting.
        
         | needle0 wrote:
         | Good point. Will reword that part.
        
       | wodenokoto wrote:
       | I think this page does a poor job of explaining that all 3
       | _knives blade_ in the first example share the same code point in
       | Unicode, but are to be displayed /rendered differently depending
       | on which language it is shown as part of.
       | 
       | It is there in the text, but it's almost hidden between the
       | lines.
       | 
       | If I was a developer with no knowledge of Han characters or Han
       | unification I would have to read two thirds of this article
       | thinking I'm doing it right, so why am I reading this, e.g.: "but
       | I am using the correct code point. It's the character that the
       | user entered!" or "I copy pasted it from a Japanese text, what do
       | you mean I'm using the wrong character?" before reaching the "how
       | to fix it" and even then I might not realize the root cause.
       | 
       | With that in my mind I might not even make it to the part about
       | how to fix the problem and learn that I am using the right
       | character/code point, but it is still displayed wrong.
        
         | needle0 wrote:
         | I agree the page is somewhat roundabout in its current state
         | since I went from the background to the symptom to the fix.
         | Open to suggestions on rearranging the article so that more
         | devs can implement fixes.
        
           | wodenokoto wrote:
           | I would have the knives blades chart a little earlier, and
           | make it very obvious that each character shares a code point
           | (maybe have a code point column) and talk about in the text
           | that yes, this is weird. 3 visually distinct Unicode
           | characters share a single code point.
           | 
           | For me this was very hard to wrap my head around the first
           | time I encountered the problem. Maybe other people find it
           | hard to understand in different ways.
           | 
           | I believe that Unicode even claims that distinctly looking
           | characters are to have their own code points, but similarly
           | looking characters should share a code point (e.g, there is
           | no French a and English a, even though they are pronounced
           | differently. And Danish o and Swedish o are pretty much the
           | same pronunciation but differently written, so they don't
           | share a code point.)
        
             | needle0 wrote:
             | Thanks. I reworded it a bit and put more emphasis on code
             | points.
        
       | madsohm wrote:
       | The second character displays wrong for me when copying into VS
       | code with Cascadia Code PL font.
        
         | skhr0680 wrote:
         | That's why Han unification is a mess.
        
       | zzo38computer wrote:
       | My own programs are specifically designed to not use Unicode. I
       | think that Unicode is really messy and I dislike it. If you want
       | to display Japanese text, EUC-JP can be used.
        
         | Matheus28 wrote:
         | I really hope you're being sarcastic
        
           | zzo38computer wrote:
           | I am not sarcastic. I don't like Unicode.
        
             | patrec wrote:
             | If you like, to stick just to Japanese, this:
             | 
             | https://upload.wikimedia.org/wikipedia/commons/b/ba/JIS_and
             | _...
             | 
             | better than unicode, you can't be helped.
             | 
             | Not that I like unicode much either -- amongst other things
             | the idiotic arrangement of codepoints makes it basically
             | impossible to do remotely efficient text processing; e.g.
             | here's a graph of the automaton the RE2 uses to check if
             | something is an uppercase character:
             | 
             | https://swtch.com/~rsc/regexp/cat_Lu.png
             | 
             | (For ascii there would exactly be a single arrow connecting
             | two nodes).
        
               | arp242 wrote:
               | And here I was thinking that "Unix variants history" or
               | "Linux audio systems" graphs were messy and
               | complicated...
        
         | yorwba wrote:
         | This is about display, not encoding. Using EUC-JP to store text
         | doesn't guarantee that it will be rendered with a Japanese
         | font.
        
           | lmm wrote:
           | In practice it does, if it will be rendered at all. Elsewhere
           | on this very page you can find people suggesting storing the
           | display locale alongside the unicode string, which is really
           | the only way to solve this problem in the general case - but
           | in that case you might as well store pairs of byte sequence
           | and encoding, there's not much difference between that and
           | unicode string and locale.
        
             | yorwba wrote:
             | > In practice it does, if it will be rendered at all.
             | 
             | Seems like you're right, at least as far as Firefox is
             | concerned. Testing the data links below, it appears to
             | guess the default language based on the encoding used.
             | Neat!
             | 
             | data:text/html;charset=euc-jp;base64,PHA+RGVmYXVsdDogv8/Evr
             | Oks9G5/Mb+PC9wPjxwIGxhbmc9ImphIj5sYW5nPSJqYSIgv8/EvrOks9G5/
             | Mb+PC9wPjxwIGxhbmc9ImtvIj5sYW5nPSJrbyIgv8/EvrOks9G5/Mb+PC9w
             | PjxwIGxhbmc9InpoLWNuIj5sYW5nPSJ6aC1jbiIgv8/EvrOks9G5/Mb+PC9
             | wPjxwIGxhbmc9InpoLWhrIj5sYW5nPSJ6aC1oayIgv8/EvrOks9G5/Mb+PC
             | 9wPjxwIGxhbmc9InpoLXR3Ij5sYW5nPSJ6aC10dyIgv8/EvrOks9G5/Mb+P
             | C9wPgoK
             | 
             | data:text/html;charset=euc-kr;base64,PHA+RGVmYXVsdDog7NPywf
             | qtysfN6ez9PC9wPjxwIGxhbmc9ImphIj5sYW5nPSJqYSIg7NPywfqtysfN6
             | ez9PC9wPjxwIGxhbmc9ImtvIj5sYW5nPSJrbyIg7NPywfqtysfN6ez9PC9w
             | PjxwIGxhbmc9InpoLWNuIj5sYW5nPSJ6aC1jbiIg7NPywfqtysfN6ez9PC9
             | wPjxwIGxhbmc9InpoLWhrIj5sYW5nPSJ6aC1oayIg7NPywfqtysfN6ez9PC
             | 9wPjxwIGxhbmc9InpoLXR3Ij5sYW5nPSJ6aC10dyIg7NPywfqtysfN6ez9P
             | C9wPgoK
             | 
             | data:text/html;charset=gb2312;base64,PHA+RGVmYXVsdDogyNDWsb
             | qjvce5x8jrPC9wPjxwIGxhbmc9ImphIj5sYW5nPSJqYSIgyNDWsbqjvce5x
             | 8jrPC9wPjxwIGxhbmc9ImtvIj5sYW5nPSJrbyIgyNDWsbqjvce5x8jrPC9w
             | PjxwIGxhbmc9InpoLWNuIj5sYW5nPSJ6aC1jbiIgyNDWsbqjvce5x8jrPC9
             | wPjxwIGxhbmc9InpoLWhrIj5sYW5nPSJ6aC1oayIgyNDWsbqjvce5x8jrPC
             | 9wPjxwIGxhbmc9InpoLXR3Ij5sYW5nPSJ6aC10dyIgyNDWsbqjvce5x8jrP
             | C9wPgoK
             | 
             | data:text/html;charset=big5;base64,PHA+RGVmYXVsdDogpGKqva78
             | qKSwqaRKPC9wPjxwIGxhbmc9ImphIj5sYW5nPSJqYSIgpGKqva78qKSwqaR
             | KPC9wPjxwIGxhbmc9ImtvIj5sYW5nPSJrbyIgpGKqva78qKSwqaRKPC9wPj
             | xwIGxhbmc9InpoLWNuIj5sYW5nPSJ6aC1jbiIgpGKqva78qKSwqaRKPC9wP
             | jxwIGxhbmc9InpoLWhrIj5sYW5nPSJ6aC1oayIgpGKqva78qKSwqaRKPC9w
             | PjxwIGxhbmc9InpoLXR3Ij5sYW5nPSJ6aC10dyIgpGKqva78qKSwqaRKPC9
             | wPgoK
        
             | kalleboo wrote:
             | Aren't the modern text display APIs of the most popular
             | OSes all Unicode-based now? It seems likely that they will
             | convert to Unicode when told to display a string in a
             | different codepage and replace the locale info with the
             | default Unicode behavior (of basing it on the user locale)
        
         | innocenat wrote:
         | And what if you want to display Japanese and Korean at the same
         | time?
        
       | wodenokoto wrote:
       | Wonder how well HN and my phone handles this. There are supposed
       | to be Unicode code points that indicate which locale a character
       | is supposed to be displayed in. If things are well thought out,
       | my phone should add them automatically and HN should keep them
       | and your browser should render it correctly
       | 
       | On an iPhone using,
       | 
       | Chinese simplified keyboard: Ren
       | 
       | Japanese keyboard: Ren
       | 
       | So that didn't go very well. When choosing the character on my
       | Chinese keyboard it is displayed with correct Chinese strokes but
       | turns into the Japanese version in the text box. I'm guessing for
       | most of you reading, both will appear Chinese.
       | 
       | EDIT: Someone better than me at wrangling unicode can maybe try
       | out the variation selectors, and print the correct variations in
       | a comment. I think it would have been neat if my keyboard ime did
       | it for me :)
       | 
       | https://en.wikipedia.org/wiki/Variation_Selectors_(Unicode_b...
        
         | wasmitnetzen wrote:
         | Yep, Simplified Chinese for me (with LANG=en_US.UTF-8).
        
         | aikinai wrote:
         | They both appear Japanese for me on mobile Safari. But Japanese
         | is my second preferred language on the device (after English),
         | so that's probably why.
        
       | thrdbndndn wrote:
       | I'm a native-CJK user myself and well aware this phenomenon, but
       | honestly it's not really that bad in most of cases due to the
       | following reasons:
       | 
       | 1. Websites that are in Japanese are likely tagged with lang=ja
       | already. So they will display fine. Unfortunately, this practice
       | seems to be less followed by Chinese sites. I checked a few top
       | sites, qq.com do have lang=zh-cn, while baidu.com and sina.com.cn
       | don't.
       | 
       | 2. Majority of UI elements in OS will prioritize the display
       | language you set when choosing variants. This means, if the users
       | are reading content in Japanese while also using Japanese UI, the
       | glyphs would be correct. Of course, this will cause problem if a
       | Japanese is reading Chinese or vice versa, but such scenario is
       | in minority.
       | 
       | Another scenario, which I think is more common, is when someone
       | is using a Latin-language UI. For example, lots of my
       | (Chinese/Japanese) friends are using English UI while reading
       | Chinese/Japanese a lot. The OS in this case will default to one
       | variant (I believe Apple by default would choose Japanese) and
       | therefore display another language's glyphs wrong (side note: for
       | web pages, desktop browsers often have their own font/glyph
       | fallback logic above the OS one).
       | 
       | 3. Most of people are just not sensitive to such thing. I pointed
       | it out to lots of people (when due to their setting, some glyphs
       | are displayed wrong, like Men ), and they can't care less.
       | 
       | Also, there is no simple "fix" if you have multi-language
       | content. Without manually assign <lang> tag to every single
       | string, you can't display both Japanese and Chinese correct at
       | the same time. It isn't worth the hassle for just a few phrases
       | in text. A good example is Wikipedia, they have templates for all
       | kinds of languages so you can display them correctly even if it's
       | just one Japanese word on, say, English Wikipedia. And Wiki
       | editors do use them all the time!
        
         | makeitdouble wrote:
         | > 3. Most of people are just not sensitive to such thing. I
         | pointed it out to lots of people (when due to their setting,
         | some glyphs are displayed wrong, like Men ), and they can't
         | care less.
         | 
         | I think its' because most people don't deal with it in big
         | amounts. I heard a lot more complaints from people using
         | android phones that didn't have jp fonts by default. At the
         | third of fourth page they started to care, and once they
         | noticed it frustration just stacked (it's just a matter of
         | adding fonts, so not a big deal).
         | 
         | Otherwise writings are flexible enough for small variants to
         | not be triggering (I mean, people can already read
         | calligraphy...)
        
           | thrdbndndn wrote:
           | Android has Japanese font.
           | 
           | Noto Sans is basically Source Han, one of the best free JCK
           | fonts. Not sure what you mean.
           | 
           | https://en.wikipedia.org/wiki/Source_Han_Sans
        
             | makeitdouble wrote:
             | If you buy a Xiaomi phone for instance anywhere outside
             | Japan, and open a Japanese website, it will be displayed
             | with Chinese glyphs. It depends on the phone but there will
             | setup needed: sometimes just switching the whole phone
             | language will do the deal, sometimes you need to add the
             | right fonts yourself.
        
         | gpderetta wrote:
         | Is there a reason unicode doesn't have such a builtin lang tag?
         | Similar to right-to-left and left-to-right it could help in
         | displaying differently otherwise identical text.
         | 
         | It could be stored in-band with the text with little changes to
         | existing systems. The only change would be on the presentation
         | layer, and if the tag were to be a non printable character, it
         | would be backward compatible. An input device could implicitly
         | tag input texts depending on the default lang.
         | 
         | You need some form of sanitization, but you need it for right-
         | to-left and left-to-right already.
        
           | arp242 wrote:
           | Actually, this exists already - there's U+E0001 (LANGUAGE
           | TAG) and U+E007 (CANCEL TAG) and you can put a language code
           | between those, e.g. "\uE0001ja-JP\ue007".
           | 
           | Its use is also deprecated and discouraged. According to [1]
           | it's often not needed, and [2] states that it puts a lot of
           | burden on implementations and best done at a higher level
           | such as HTTP, HTML, etc.
           | 
           | I have no opinion on [1] as I don't speak these languages,
           | but I do know I really hate working with these "invisible
           | characters" in Unicode both as a user and developer. Copy an
           | extra invisible LTR thingy or display variant codepoint and
           | stuff can look and behave different, and it may not at all be
           | obvious what the hell is going on (especially for those
           | without a technical background).
           | 
           | [1]: https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf#
           | G115...
           | 
           | [2]: https://www.unicode.org/versions/Unicode14.0.0/ch23.pdf#
           | G301...
        
             | lifthrasiir wrote:
             | > there's U+E0001 (LANGUAGE TAG) and U+E007 (CANCEL TAG)
             | and you can put a language code between those, e.g.
             | "\uE0001ja-JP\ue007".
             | 
             | "ja-JP" part is also written in tag characters, so it's
             | actually E0001 E006A E0061 E002D E004A E0050 E007F and
             | doesn't render even in unsupported environments.
        
               | arp242 wrote:
               | Oh right; I never actually used it, I just know it
               | exists.
        
           | layer8 wrote:
           | > Is there a reason unicode doesn't have such a builtin lang
           | tag?
           | 
           | It does, but they aren't widely supported and their use is
           | not recommended, see
           | http://unicode.org/faq/languagetagging.html and
           | https://datatracker.ietf.org/doc/html/rfc6082. The
           | recommendation is to use markup languages instead to carry
           | that information.
           | 
           | The reasoning is that anything related to styling is out-of-
           | scope for Unicode (except where needed for round-trip
           | compatibility with other character sets), or else people will
           | also want tags for bold, italic, monospace, or (expressed
           | semantically) for emphasis, code, etc. That's what markup
           | languages like HTML are for.
        
         | marginalia_nu wrote:
         | Re: lang="jp"
         | 
         | My experience with web crawling is that the use of the lang-tag
         | seems inconsistent at best. To make matters worse, sometimes
         | content is straight up mislabeled, although Japanese sites
         | often helpfully declare that they are using the Shift_JIS
         | charset rather than UTF-8, which is at least somewhat helpful
         | in figuring out that it is Japanese.
        
         | Asooka wrote:
         | Out of curiosity, what do you do when you put a quote in
         | Chinese from a Chinese author inline in Japanese text? Are you
         | expected to write it using the Chinese forms of the characters,
         | or do you write them using the Japanese forms?
         | 
         | Edit: I mean what is the expected (grammatically correct) way
         | to do it if you were writing with pen on paper.
        
         | cehrlich wrote:
         | I agree that for CJK-natives it's not such a big deal probably
         | (unless they live their life in more than one of those
         | languages). For people like me who primarily use their computer
         | in English but also do some stuff in Japanese every now and
         | then it's very frustrating. Of course at this point I know
         | what's going on when Zhi  or whatever looks wrong, but it's
         | still frustrating.
         | 
         | OSs and Browsers having their own logic for it actually makes
         | things _worse_ in some cases. Windows is especially bad
         | (different types of UI elements care about different settings
         | or don't care at all, so good luck having apps render correctly
         | if you don't change your entire OS locale), and Chrome is
         | pretty bad too, again especially on Windows. Overall MacOS/iOS
         | and Safari does the best job by far.
         | 
         | The failed attempt at Han Unification[1] is the worst decision
         | the Unicode people have ever made.
         | 
         | [1] https://en.wikipedia.org/wiki/Han_unification
        
           | nanis wrote:
           | > The failed attempt at Han Unification[1] is the worst
           | decision the Unicode people have ever made.
           | 
           | At first I nodded my head in agreement, but then I decided I
           | still think the failure to include separate code points for
           | "lower case Turkish dotted I" and "upper case Turkish dotless
           | I" is worse.
           | 
           | You can't have 'i' [?] lc( uc 'i' ) unless you already know
           | you are processing Turkish ... completely unnecessary
           | complication.
        
             | naniwaduni wrote:
             | Turkish I unification, at least, wasn't a _decision_ the
             | Unicode people made, they inherited the mistake from
             | earlier encodings. Given that those already existed, the
             | alternative to having broken casefolding was, essentially,
             | break all mixed Turkish documents transcoded from cp857
             | containing both  "I" and "i" in non-Turkish functional
             | directives, i.e. you'd necessarily break things like HTML
             | documents without consistent tag casing.
        
         | jhanschoo wrote:
         | IMO the most jarring issue that comes from Han unification
         | without proper language setting isn't with the glyph variants,
         | since you actually encounter these variants and more in
         | everyday life (albeit more rarely than your national standard
         | ones). The more jarring issue is when your software selects a
         | font meant for the wrong language, and the font for the correct
         | language as a fallback. Then you may encounter serious style
         | issues where your text is pockmarked by glyphs for another
         | language; you have a similar phenomenon with European language
         | using Latin alphabet with unconventional accents. But note that
         | it takes way more effort to ask a CJK foundry to cover all
         | codepoints for all languages than to ask a Latin font designer
         | to cover all languages, so you would be hard pressed to find
         | fonts that actually do that.
         | 
         | Without Han unification, this wouldn't really be a problem, but
         | Han unification is to a large extent the same philosophy
         | pursued with unification of Latin scripts (and other scripts)
        
         | needle0 wrote:
         | My original motive to write this page came most from issues in
         | video games, where text is often displayed using custom
         | routines and built-in rules in the OS & browser can't be of
         | help. The issue crops up most often in indie games, but they
         | can be seen even in high-profile high-budget games like Half-
         | Life: Alyx or Resident Evil 4 VR.
        
         | eloisant wrote:
         | What you're saying is true (although I'm not sure about 3, all
         | the Japanese people I've talked to are annoyed by that), but it
         | really sucks that we're dealing with problems that were
         | supposed to be fixed by Unicode.
         | 
         | Han unification have been a huge mistake, to save a few
         | thousands of characters, and now we keep piling on more and
         | more stupid emoji.
        
           | dotancohen wrote:
           | > Han unification have been a huge mistake, to save a few
           | thousands       > of characters, and now we keep piling on
           | more and more stupid emoji.
           | 
           | This is my exact problem with Unicode. I've very grateful for
           | the efforts that they have made in the past, but the change
           | from "spare valuable codepoints at the expense of causing
           | ambiguity in text" to "assign a new codepoint to every
           | cartoon permutation of intangible nouns" is infuriating.
        
             | oleganza wrote:
             | 25 years ago computers and networks were different. Today
             | text is a <0.1% of traffic compared with video and you have
             | billions of bytes of RAM in every pocket computer. So yes,
             | you can pile on more emojis and no one would be bothered.
             | 
             | Unicode may have been never adopted at all if it had even
             | larger set for CJK and made all CJK texts 1.5-2x larger
             | than in Han-unified version, due to longer encoding. Also:
             | UTF-8 did not exist yet and most systems treated text as
             | arrays of fixed length characters.
        
               | anthk wrote:
               | UTF-8 existed 25 years ago, but in Plan9.
        
             | arp242 wrote:
             | Han unification and Emojis are from a different time. We're
             | talking about a decision made about 30 years ago in the
             | early 90s; Unicode was 16 bits (65k total codepoints) at
             | the time.
             | 
             | The original CJK Unified Ideographs block from 1992
             | consists of 21k codepoints. It was impossible to do that
             | four times since 4x21k is more than 65k, and we're going to
             | need space for some other languages as well. Why not make
             | Unicode larger? Well, size was a real concern back then
             | (still is, to some degree, but less so).
             | 
             | Since then Unicode has expanded and now we have slightly
             | under 1 million codepoints. Han character blocks have
             | extended to about 93k codepoints today, and 4 times ~93k
             | codepoints is actually feasible. But now you run in to
             | compatibility issues: you can't remove all the old Han
             | unification stuff (it will break text, big no-no), so you
             | need to re-define it all anew. Is that better? How about
             | mixing "old" Han unified codepoints with new Japanese or
             | Chinese stuff? Will it really improve things or just cause
             | endless confusion (see: combining characters)?
             | 
             | For scale, all of Unicode currently defines about 145k
             | codepoints; so even _with_ Han unification we 're talking
             | about two thirds being taken up by just these three
             | languages.
             | 
             | In comparison there are currently about 3,000 emojis,
             | although the number of codepoints is much less since many
             | codepoints are re-used (e.g. "firefighter" is "person +
             | firetruck", flags use the country code, etc.). In a quick
             | check it looks like there are about 1,000 to 1,500
             | codepoints reserved for emojis. In comparison, this is
             | nothing.
             | 
             | What I'm trying to say is that the (comparatively) very low
             | number of emojis has absolutely no bearing on this and that
             | going off on a tangent about it is very misplaced.
        
               | dotancohen wrote:
               | I have no problem with 2/3 of the codepoints being taken
               | up by 3 languages. Right now we (rightly) bend over
               | backwards to accompany handicapped users, often tripling
               | or quadrupling our QA. CJK users are much more common
               | than handicapped users, so the benefit-vs-cost ratio is
               | even greater for CJK users than for handicapped users.
        
               | arp242 wrote:
               | > I have no problem with 2/3 of the codepoints being
               | taken up by 3 languages.
               | 
               | I have no problems with this either, at least not
               | principally. But historically this was literally
               | impossible. Someone thought of a clever hack that seemed
               | like a good idea at the time, but turns out it doesn't
               | work all that great after all (at least, according to
               | some - opinions seem to differ and I can't really judge
               | myself) and now you're stuck with it and fixing isn't so
               | easy - I don't know if people have made concrete
               | proposals for fixing this, but if it was easy it probably
               | would have been done already. Sometimes sticking with a
               | suboptimal "legacy" solution is better than replacing it
               | with a new better solution due to the friction and issues
               | involved.
        
       | sleepy_keita wrote:
       | Oh, this explanation is perfect to give to people when I
       | encounter this error. Thanks!
        
       | xvilka wrote:
       | It's more of a problem in pure text apps rather than the Web. For
       | example, in editors (not the rich text ones), console, interface
       | elements. But yes, it is a problem for people who knows (or
       | learns) and uses multiple languages at once, e.g. English,
       | Chinese, and Japanese.
        
       | captainmuon wrote:
       | How do you do it correctly in a bi-lingual app? Say your app is
       | in English but you want to display asian language file names. Is
       | there any way to tell if a string is chinese or japanese? I think
       | CJK variation selectors embedded in the string are not widely
       | used. And it would be a bit overkill to include a language
       | detection heuristic (which would likely fail for short phrases).
       | So should you let the user decide? Default to Japanese on a
       | Japanese PC, otherwise leave it undefined?
        
         | skhr0680 wrote:
         | Set it by the device language, with a way to override that in
         | your app's settings
        
           | lifthrasiir wrote:
           | This is a good approximation, but it is still incorrect if,
           | say, you are showing Japanese user names from a view for
           | Chinese users.
        
       | hannob wrote:
       | I didn't know about this, but can't help to think this sounds
       | like a bug in unicode to me. If these characters are different
       | then why does unicode assign one codepoint to them? Wasn't the
       | promise of unicode to exactly not do this kind of thing?
       | 
       | Can this be fixed? New character code for ambiguous characters
       | could be assigned, of course this would require manual conversion
       | (with knowledge of the variant) for existing data, but at least
       | it would make this issue go away moving forward (and unconverted
       | legacy data would be "just as bad" as it used to be, so no loss).
        
         | oleganza wrote:
         | This issue with Han Unification was a big reason for stalled
         | adoption of Unicode/UTF-8 in Ruby for years. UTF-8 by default
         | came to Ruby after 1.9 where they've added thorough support for
         | variety of encodings, so that UTF-8 is not the only option.
        
         | oleganza wrote:
         | Han Unification started in the 90s when computers were big,
         | memory small, UTF-8 did not exist and people were trying to fit
         | all characters in a reasonable amount of codepoints. Today with
         | variable-length encoding of UTF-8 and video streams over 5G,
         | supporting all variants as distinct codepoints, and patching
         | text search and sorting with more "normalization" algorithms
         | would not be a problem at all.
        
           | afiori wrote:
           | in my opinion utf8 should have been a bigger variable length
           | encoding, today it is:
           | 
           | 0xxxxxxx
           | 
           | 110xxxxx 10xxxxxx
           | 
           | 1110xxxx 10xxxxxx 10xxxxxx
           | 
           | and
           | 
           | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
           | 
           | the only reason not to push those last bits and add
           | 
           | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
           | 
           | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
           | 
           | 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
           | 10xxxxxx
           | 
           | and maybe even
           | 
           | 11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
           | 10xxxxxx 10xxxxxx
           | 
           | is utf-32, they should have dropped it and solve the
           | codepoint problem this way.
        
             | lifthrasiir wrote:
             | First I'd like to introduce you to this: http://ucsx.org/
             | 
             | But no, there is no particular reason to introduce a longer
             | encoding than the modern UTF-8 (which is actually shortened
             | from the original one-to-six-byte encoding). The current
             | set of 1,114,112 Unicode characters is sufficient for at
             | least the foreseeable future, because any new assignment
             | requires a demonstrable historic or current use. (Emojis
             | are slightly different, but they still require that the
             | underlying concept is widespread and do not significantly
             | overlap with existing emojis. See [1].) Han characters are
             | the largest source of new assignments to this date and they
             | are yet to reach two out of 17 full planes (that would
             | equate to 131K characters).
             | 
             | [1] https://news.ycombinator.com/item?id=26904980
        
         | jhanschoo wrote:
         | This isn't a uniquely CJK problem, see my other comment
         | https://news.ycombinator.com/item?id=29024422
         | 
         | The other approach is to assign to each language distinct
         | codepoints, but I guess the current approach is better for
         | backward compatibility with pre-Unicode, and less redundancy in
         | Latin script documents.
        
       | nikanj wrote:
       | Am I right in assuming fixing this in player names etc makes
       | Chinese look wrong? It's an easier problem if you know the whole
       | page is Japanese, but how about things like game lobbies, where
       | every username is in a different language?
        
         | yorwba wrote:
         | Either store the locale used when a user enters their name and
         | then use it to mark up the text whenever you display the
         | username, or simply use the system default, so Japanese users
         | will see Chinese names with Japanese glyphs and Chinese users
         | will see Japanese names with Chinese glyphs. Other users
         | randomly get whatever.
        
           | andrewl-hn wrote:
           | The locale might be set to something completely different. A
           | lot of programmers run their machines in English and not in
           | their native language. One could use location to detect which
           | variant to use, but that too wouldn't work for, say, Chinese
           | speakers in Japan. In ideal world we should use the locale of
           | an input source (if the user sets their keyboard to
           | Traditional Chinese we should use it for that fragment of
           | text). However, operating systems and browsers don't provide
           | the input source locale API.
        
             | drran wrote:
             | > The locale might be set to something completely
             | different.
             | 
             | Follow the choice of user, please.
        
               | eska wrote:
               | Locale: English. Text input: Ren
               | 
               | Question: was that a Japanese or Chinese character?
               | 
               | Answer: locale doesn't help us here, unsolvable problem.
        
         | Hackbraten wrote:
         | Good point. Would it be feasible to check the locale used at
         | profile creation time, then store that locale alongside the
         | username if it contains at least one CJK glyph?
        
           | eska wrote:
           | What if a player with a German locale used their favorite
           | anime character's name? How do you know whether to use
           | Chinese or Japanese characters on somebody else's PC? Even a
           | Chinese player should see Japanese characters there. But you
           | would first of all basically need to ask the German player
           | with a drop down menu which language their name is in, which
           | will never happen, so we just assume Chinese. It's just
           | broken.
           | 
           | I think we clearly need an in-band solution. Some character
           | that switches the Asian glyph variant, or separate characters
           | altogether. The former would be annoying for non-variable
           | sized Unicode because you'd lose the ability to use random
           | access into a large corpus of text, because you'd need to
           | scan the entire text to find out the current Asian glyph
           | variant mode.. sigh
        
       ___________________________________________________________________
       (page generated 2021-10-28 23:02 UTC)