[HN Gopher] Your code displays Japanese wrong
___________________________________________________________________
Your code displays Japanese wrong
Author : needle0
Score : 164 points
Date : 2021-10-28 06:07 UTC (16 hours ago)
(HTM) web link (heistak.github.io)
(TXT) w3m dump (heistak.github.io)
| rock_artist wrote:
| So to make sure I got it right:
|
| The issue is - when there's no proper glyph it uses a fallback?
|
| Or... The context of what glyph will be rendered depends on the
| document defined locale.
|
| (If it's the latter. it means it's impossible to quote another
| Asian language text within the same paragraph?)
| needle0 wrote:
| The latter. In HTML you can specify a specific DOM element as
| being in a specific language so the browser can render it
| properly, but if the place you want to quote text isn't as
| allowing (eg. comment sections with no HTML allowed), there may
| be no way to ensure correct glyphs.
| lifthrasiir wrote:
| Ideographic variation selectors plus a very large pan-CJK
| font may solve this issue in the future, but CJK fonts have
| already reached the OpenType limit of 65,535 glyphs so we are
| already running into technical issues.
| tasogare wrote:
| It's a web browser only issue. In other cases such as a text
| processor document or local app a font is explicitly used by
| any run of text so there is no problem. This issue is the web
| being what it is, most of the time there is no font explicitly
| specified for text and the browser use a Chinese-looking font
| to display any Chinese characters.
| eska wrote:
| It's not just a web browser issue. For example I'm
| transferring data in multiple Asian languages through some
| network API. I always need to specify the locale of the text
| data in a separate data field so that some UI program at the
| end can display the text correctly. And even then that's not
| perfect, because that's just the system locale instead of the
| IME locale.
| iforgotpassword wrote:
| You need to embed the quoted text in eg a span and add the
| proper lang tag.
| [deleted]
| BoppreH wrote:
| That's extremely interesting, if not depressing.
|
| So, if I have to display user-entered text (usernames, posts,
| comments, messages, form data, etc), and I want to do The Right
| Thing(tm):
|
| - I cannot rely on user locale, because it might be set to
| something generic like English, or the user may be bi-lingual.
|
| - I cannot rely on location, because the user may be traveling to
| a different CJK region, or somewhere else altogether.
|
| - I cannot set a single lang: attribute for the whole page
| because it'll be wrong for the other two languages.
|
| - The string alone is not sufficient to identify the language
| because you can write valid sentences in different CJK languages
| with the same codepoints.
|
| - I cannot have a per-user language setting, because users may be
| bi-lingual.
|
| What does that leave me? A dropdown list "C/J/K/Other" besides
| _every single text field_?
|
| I'm chucking this on my pile of examples of software development
| being hopelessly broken by design, along with "unix time is non-
| monotonic and discontinuous at random" (hint: what's the unix
| time exactly 1e8 seconds, ~3 years, from now? Answer: it's up to
| the astronomers[1]!).
|
| [1]: https://en.wikipedia.org/wiki/Unix_time#Leap_seconds
|
| Edit: actually, even the dropdown list is insufficient because it
| only allows one language per string! How is a Japanese user
| asking for help learning Chinese supposed to write?
| makeitdouble wrote:
| For what it's worth, mainstream OSes will also have poor
| handling of these cases, which will eliminate the most tricky
| cases by the sheer inconvenience it causes.
|
| As far as I know textfields only have one font applied, so
| entering both languages in a single field won't be optimal. And
| if you're not doing anything fancy with your fields, they will
| all take the same font as well.
|
| So even at the input level, the user switching languages will
| already be mildly screwed, and the best solution would probably
| be to change pages for each language.
| GoblinSlayer wrote:
| >How is a Japanese user asking for help learning Chinese
| supposed to write?
|
| They write with japanese kanji. I imagine minuscule differences
| in kanji forms are negligible compared to general unfamiliarity
| with the foreign language.
| [deleted]
| gpderetta wrote:
| Deprecate languages. We will all be writing exclusively in
| emojis from now on.
| zokier wrote:
| > unix time is non-monotonic and discontinuous at random
|
| Well, depends on your definition of unix time; if you use
| time() as the definition then it is actually monotonic because
| the integral part only repeats on leap seconds?
| Asooka wrote:
| Trying to do something smart is usually the wrong approach and
| leads to users tangling themselves in invisible state that they
| do not understand and can't change. The best thing would have
| been for Unicode to not do Han unification. The second best
| would be to provide alternate glyphs now. The third best is to
| display the characters either in the language they're written
| in, when you know that for sure (usually for text that you
| wrote yourself), or in the user's most likely locale, when you
| don't. For locale I would go down this list of traits and pick
| the first that matches:
|
| 1) The language setting on your website if you have one and
| have translated it to C/J/K. You may use different TLDs for the
| different languages and discern that way, too.
|
| 2) The list of preferred languages from the browser. This is
| usually unreliable, but if someone has gone to the trouble of
| inputting "english=1;japanese=.9;chinese=.8", then it's a fair
| bet they want Japanese Kanji usually and will be understanding
| if you use them in place of Chinese Han characters.
|
| 3) The country to which the user's IP belongs. The least ideal
| option, but if you're in Korea and reading a random string of
| Hanzi, you probably expect them to look like Hanzi.
|
| You will show the wrong characters to some users, but the
| behaviour is understandable. "Oh, the site is showing me Korean
| characters because I'm in Korea." is a lot easier to grasp than
| "The site is showing me Chinese characters because I clicked a
| dropdown one time that I forgot about and now I have no idea
| why my name is written wrong!"
|
| You can argue about point 2) that some users might set their
| language preferences and forget about it, but so far I have
| never observed a user who doesn't know about them messing with
| the setting.
| [deleted]
| dathinab wrote:
| The problem is btw. not specific to Han unification/asian
| languages. It's that _for every western language_ if you use
| screen readers.
| the_other wrote:
| My thought whilst reading the article was that the Han
| unification would actually help screen readers. IIUC The
| meaning of the glyph is the same across all the languages, so
| the screen reader will get the correct meaning and can
| present it according to local settings. The problem with the
| European languages is that the different characters (letter
| variations, accent variations) can change the meaning of the
| word they're part of.
|
| Or have I misunderstood?
| lifthrasiir wrote:
| The same Han character can have wildly different
| pronunciations even in a single language (being a logogram,
| they represent a word and not a sound). KS X 1001, the
| primary Korean character set, even duplicated same
| characters according to their readings so that they can be
| almost [1] correctly converted back to Hangul. In practice
| they didn't work well though, and Unicode assigns all but
| one duplicate characters into the compatibility region.
|
| [1] These readings didn't take account for systematic
| variations like the initial sound law (dueumbeobcig, for
| example i vs. ri at the beginning of words).
| soraminazuki wrote:
| You can't split CJK sentences into individual letters,
| inspect them one by one and decipher its exact meaning. If
| you present Chinese writing to a Japanese speaker, they'd
| only see complete gibberish consisting of letters they may
| or may not recognize. It also works the other way around
| too.
|
| On top of that, Kanjis aren't the only type of letters
| Japanese folks use. They also use Hiraganas and Katakanas,
| which are phonetic symbols and totally unrecognizable to
| non-Japanese speakers.
| Snild wrote:
| I'm not sure I understand what you're saying here. Are you
| saying that screen readers don't know which language to
| interpret text as?
| dmurray wrote:
| > I cannot set a single lang: attribute for the whole page
| because it'll be wrong for the other two languages.
|
| > What does that leave me? A dropdown list "C/J/K/Other"
| besides every single text field?
|
| If you _need_ to control the language display at the
| granularity of a single text field (rather than a user or a
| page or a website), then yes, you need a tool that operates on
| a single text field. This shouldn 't be too surprising.
|
| Surely you can get away with one of your other solutions,
| though. In particular, you can guess the language and be right
| most of the time.
| dvdkon wrote:
| It is surprising to me, because I can mix Czech, English and
| German just fine on a single screen, even in a single text
| field. From that perspective having to say "this whole text
| is in language X" seems backwards.
| jfk13 wrote:
| Suppose you're blind and use a screen reader. How does
| German sound when pronounced by an English text-to-speech
| engine, or vice versa? Which should be used to read out
| that "single text field" that contains a mix of languages?
| dvdkon wrote:
| When reading mixed-language text we can't know if the
| "Tee" in the middle of an English sentence is German for
| "tea" or English slang for "t-shirt". We have to use
| imperfect context clues, so screen readers will have to
| do the same, it should be doable with today's technology.
| And if the software doesn't get it right every time,
| that's fine, because neither do humans.
| masklinn wrote:
| > And if the software doesn't get it right every time,
| that's fine
|
| It really is not, a screen reader using the wrong
| language is way, way worse than a human with the wrong
| pronunciation. The first time voiceover decided to switch
| to russian in the middle of english text I thought the os
| had crashed, the mangling is quite extreme.
| dmurray wrote:
| OK, yeah. If Unicode hadn't gone with Han unification
| (couldn't they know we'd have space to put zillions of
| characters in a font just a few years later?) you'd have
| the same flexibility in mixing C/J/K.
| [deleted]
| rectang wrote:
| > _(couldn 't they know we'd have space to put zillions
| of characters in a font just a few years later?)_
|
| Representation matters. There should have been plenty of
| experts who understood that Han unification was
| problematic. It seems they were not in a position to do
| anything about it.
| masklinn wrote:
| Part of the problem back then is they really really
| wanted to fit it all in 16 bits, and given there are more
| hanzi than that to start with this was a bit of an issue
| (back in the 80s the largest dictionary listed >54000
| characters, now it's >100000, and Japanese and Korean
| total about 50000 each).
|
| A few years later they increased the character space by 5
| bits and it wasn't an issue anymore, but the original
| legacy of _han unification_ remains.
| tdeck wrote:
| It makes sense given constraints of the time to try to
| fit the character set into 16 bits, but Unicode has
| variation selectors, why not use those for the ambiguous
| characters? They could have easily done something like
| HAN_CHARACTER_X + VARIANT_KANJI. It would take up an
| extra 16 bits, but given the density of CJK text relative
| to Latin text that may not be a big issue.
|
| Edit: s/variant/variation/.
| rectang wrote:
| Could this be done now?
| rectang wrote:
| Yes, and the people who prioritized the 16-bit size were
| OK with Japanese looking ridiculous in order to achieve
| their goal. From the article:
|
| > if the equivalent symptom was happening with English
| text, it' wfuld bie lffking sfmiet'Tshing likie t'Tshis.
|
| From a page linked by the article, "I Can Text You A Pile
| of Poo, But I Can't Write My Name":
|
| > To help English readers understand the absurdity of
| this premise, consider that the Latin alphabet (used by
| English) and the Cyrillic alphabet (used by Russian) are
| both derived from Greek. No native English speaker would
| ever think to try "Greco Unification" and consolidate the
| English, Russian, German, Swedish, Greek, and other
| European languages' alphabets into a single alphabet.
|
| If there had been a proposal to sacrifice English in
| order to cram Unicode into a certain code space size, is
| there any question that the people on the panel whose
| first language was English would have quashed it?
|
| But people who might have spoken out for Japanese, or for
| Indian languages, etc. were seemingly not in a position
| to do anything.
| jefftk wrote:
| Greco Unification wouldn't have been a priority because
| it couldn't have saved enough bits to matter.
| rectang wrote:
| Yes, of course -- but the point is that _native English
| speakers would have blocked such an absurd proposal no
| matter what_.
|
| Han Unification is just as absurd and just as
| unacceptable to a native Japanese speaker as Greco
| Unification would be to a native English speaker. But Han
| Unification went through because native Japanese speakers
| were not in a position to block it. Representation
| matters.
| dr-detroit wrote:
| This problem shows up in email too. The internet is mostly
| hollow and broken.
| kiryin wrote:
| As a Japanese learner, this has been a massive disappointment in
| unicode for me, and a pain in my ass. It has sort of formed into
| a challenge for me, trying to get the characters to display
| consistently on all of my devices. Believe it or not, even with
| pango configured to always show the japanese variants, and
| fontconfig set to always prefer the JP font, some applications
| like Firefox find a way to mess it up.
|
| Can't blame them much though, han unification is a huge mess and
| designed by someone who I can only posit to be entirely
| brainless. There aren't many characters that are affected, you
| aren't even saving any considerable amount of codepoints. It's
| just west-centricism and lack of knowledge on the subject.
| adrian_b wrote:
| The Han unification was done because at that time they hoped
| that the size of Unicode characters will be limited to 16 bits.
|
| Separate sets of Han characters cannot be encoded in the 16-bit
| space, but they could have been easily encoded in the current
| 32-bit space.
|
| Nevertheless, I have never found this to be a problem in
| practice, because I have always taken care to have good
| separate typefaces for Japanese, Traditional Chinese and
| Simplified Chinese.
|
| In documents that I create or modify, I apply styles with the
| appropriate typeface.
|
| The only possible problems are with Web pages, but the good
| browsers allow you to configure typefaces for each language and
| I always configure the correct typefaces.
|
| If the Web page does not specify correctly the language, it
| might be displayed wrongly, but this is only one of the many
| stupid things that can be done by a Web page designer that can
| make that page look ugly when rendered on other computers.
| aikinai wrote:
| The fact that you have to take such care is exactly the
| problem.
| adrian_b wrote:
| I agree that the fact that I must not forget to configure
| typefaces per language whenever I install a new browser,
| while for Chrome you must also install the "Advanced Font
| Settings" extension before it even becomes possible to
| choose e.g. a Japanese font, is annoying.
|
| To avoid such configuration work when you prefer better
| looking typefaces instead of some standard system defaults
| would require a standardization of how to notify the
| applications about the association between certain
| typefaces and languages, e.g. by some environment variables
| or by some standard locations for the font files, depending
| on language.
| bruce343434 wrote:
| What makes this worse is that some Cyrillic characters like the
| Cyrillic "a" have a different code point from the Latin "a"
| despite looking _exactly_ identical. So unicode isn't even
| consistent with their unification logic.
| demetrius wrote:
| I believe Cyrillic a and Latin a are different because there
| already existed legacy encodings where a and a were
| considered different characters, so Unicode kept the
| distinction for backward-compatibility.
|
| While there were no existing legacy encodings allowing to
| write Chinese and Japanese at the same time, so there was
| nothing to keep compatibility with.
| mjevans wrote:
| The initial three examples don't have a corresponding 'correct
| render' image next to them, so it's impossible for me to tell
| since they all render as the same character (which is incorrect
| given the lack of context).
|
| Checking the source, the page _is_ specifying language tags in
| the span, which I guess is supposed to help. My system just must
| not have fonts for those languages so I obviously can't even test
| them.
| needle0 wrote:
| There may be issues displaying it on Firefox. Chrome and Safari
| seems to have displayed it correctly on my end. I'll find time
| to replace them with images so they appear correct regardless
| of environment.
|
| EDIT: Replaced with images.
| yorwba wrote:
| Firefox will also display it correctly, but only if you have
| the fonts installed, the same as with other browsers.
| [deleted]
| squaresmile wrote:
| This can also be a problem in chat app. I used en-US on windows
| (which defaulted to the zh variant) and someone else used ja-JP
| and I was wondering why the character was different. Took a while
| to notice that we were seeing two different things on our
| screens.
|
| We also have a website about a Japanese game using a Japanese
| font except for 0x9bd6. The font's 0x9bd6 is the CN variant and
| its 0xe001 is the JP variant of 0x9bd6. Fun times.
|
| Like others said, on the web, you pretty much have to manually
| assign lang to every single thing. We just added support for
| CN/TW/KR text. I should come back and check 0x9bd6 in the other
| versions ...
| powerapple wrote:
| For Chinese, the font can change the writing slightly. For
| example, Ren (blade) can be any of these (Japanese, Simplified
| Chinese and Traditional Chinese). Actually I would consider the
| Japanese version in the article the traditional Chinese version:
| https://duckduckgo.com/?q=%E5%88%83+%E4%B9%A6%E6%B3%95&iax=i...
|
| At least for Chinese, the difference is font, they are all valid
| writing for the character. Different writing style can cause
| these minor difference as well:
| http://qiyuan.chaziwang.com/pic/ziyuanimg/E58883.png
|
| If you look at right side of above image, you can tell how the
| same character is written in different writing style
|
| It is less a problem for Chinese. Our brain has trained to read
| them, I would recognize the Japanese version, Simplified Chinese
| version, and Traditional Chinese version without noticing the
| difference. But I can imagine it can be a problem for Japanese,
| and other people do not read Simplified Chinese. Having the
| locale explicitly set to a country and load the correct font make
| a lot sense here.
| suction wrote:
| Does this also explain why alphabetic text in Japanese apps and
| websites often looks so horrible? Like very wide characters with
| way too much space in between them?
| tasogare wrote:
| No, the wide characters (it's their name) are special code
| points. I guess it exists because someone wanted to be able to
| use one letter in place of a Japanese character while using the
| same width. The "normal" letters are called "half-width" here.
| lifthrasiir wrote:
| Full-width characters are relics from multiple legacy
| character sets. For example JIS X 0208, the primary Japanese
| two-byte character set, has a set of alphanumeric characters
| in the row 0x23, but their widths are not specified and it is
| totally possible to map them into half-width characters when
| no other character sets are in use. However it is most
| commonly paired with JIS X 0201 which is a single-byte
| character set with their own alphanumeric characters, so
| anything from JIS X 0201 is made half width and anything from
| JIS X 0208 is made full width to simplify implementations.
| This practice got stuck and subsequently followed by Unicode.
| Same for other languages.
| needle0 wrote:
| Newspaper websites may also have years-old internal typesetting
| rules, carried over from paper, that mandate alphabetical text
| must appear in full-width (double wide). They look ugly even to
| native Japanese, and some newspapers have gradually been
| learning to break out of it.
| jhanschoo wrote:
| Alphabetic text in Japanese fonts are primarily designed for
| documents mainly in Japanese with the occasional Latin script
| jargon. There's a variant (full-width) that's sometimes used
| that indeed is very wide, made to be of the width of the
| Japanese kanji, but even the proportional ones are pretty light
| and widely spaced (which results in better typography in
| mainly-Japanese documents)
| zokier wrote:
| This is a tangent, but has there been any ideas around making a
| stroke-based encoding of Japanese/Chinese writing systems?
| lovasoa wrote:
| Wasn't the whole point of Unicode to have a single encoding that
| could represent all languages unambiguously so that you don't
| need any meta-information to display a string ? Is there a reason
| why they chose to represent characters that are obviously
| different with the same code point ? Everyone would find it
| outrageous if they decided to have a single character for the
| russian m and the english m just because they have the same greek
| origin...
| josefx wrote:
| > Everyone would find it outrageous if they decided to have a
| single character for the russian m and the english m just
| because they have the same greek origin...
|
| Did you know that some languages distinguish dot less Ii and
| dotted Ii ? English mixes them Ii and unicode needs to know
| exactly based on what language you might want to upper/lower
| case because it can't tell an english I appart from a Turkish
| dotless I.
| zczc wrote:
| There are some 8-bit and 7-bit encodings which have unified
| Latin and Greek, like 7-bit SMS encoding [1]
|
| [1] https://en.wikipedia.org/wiki/GSM_03.38
| Sniffnoy wrote:
| The reason why is that Unicode was originally 16-bit, and there
| was no way they could fit everything into 16 bits without CJK
| unification. Of course later it turned out there was no way
| they could fit everything into 16 bits anyway, and so they were
| forced to expand it, and so we now both have a larger Unicode
| (with all the messes that's caused) but also still have CJK
| unification...
| lovasoa wrote:
| Is it too late to add separate code points for the chinese,
| japanese, and korean versions of the han characters ?
| lifthrasiir wrote:
| In principle there is a designated "disunificiation"
| procedure when it's desirable. More accurately speaking,
| each CJK character is thought to represent not a single or
| a few glyphs listed in the code chart but rather a glyphic
| subset, and the disunification splits that set into
| partitions. But this is generally applied to a few selected
| characters and only when it's safe to do so. Massive
| disunification was to my knowledge never suggested or
| proposed, and that would surely prompt a large scale
| disruption throughout CJK users (say, how about existing
| texts?).
| rrobukef wrote:
| So it's possible to modify the skin tone of emoji but
| impossible to disunify CJK characters? There are RTL
| modifiers for Arabic languages, it's impossible for CJK?
| It shouldn't be harder than existing unicode handling.
| lifthrasiir wrote:
| Everything boils down to the interoperability and
| compatibility.
|
| Emoji was added because Apple and Google had to deal with
| (then-)Japanese emails, and skin tones were not
| specified. It were implementations that impose certain
| skin tones (that do not even match the original Japanese
| emojis) and as a result Unicode had to introduce a
| mechanism to change skin tones and mandate the default
| emoji without that mechanism to be neutral.
|
| RTL "modifiers" are actually formatting characters
| closely tied with the Unicode Bidirectional Algorithm
| [1]. Until then texts with both RTL and LTR fragments
| were handled incoherently, for example legacy character
| sets were still struggling with logical vs. visual order
| issues. So they are indeed Unicode inventions, but
| necessary ones that do not alter existing texts.
|
| For CJK characters Unicode now provides ideographic
| variation selectors that select the exact glyph (or more
| accurately, a restricted glyphic subset of the base
| character). They do not disunify characters but they do
| provide a strong hint to display those characters in a
| specified way. In this way they do not cause an
| additional issue to existing Unicode systems (as they
| should already do normalization and collation in the
| Unicode way). The disunification by comparison would
| almost instantly break existing texts.
|
| [1] https://unicode.org/reports/tr9/
| jhanschoo wrote:
| Even in Latin script languages you have issues if you don't
| specify a lang tag. e.g. a font's 'fi' ligature may omit the
| tittle on the 'i', but it is necessary in Turkish. Or you are
| using a font without coverage for French and your browser
| renders oe in another font.
|
| The CJK variant issues under not specifying a lang are indeed
| present in Latin but to a smaller extent.
| dustintrex wrote:
| It's a really complicated issue, because whether the character
| is different or not is debatable. In English writing, we
| consider a serif A to be the same character as a sans-serif A,
| even though the glyph is obviously different, and neither do we
| distinguish between a "French" A and a "German" A.
|
| So what do we do with Guo and Guo ? The first of those is
| always used in simplified Chinese and usually in Japanese,
| while the second is used in traditional Chinese and sometimes
| in Japanese (eg. names). Is this one, two or three characters?
|
| More on the topic:
| https://en.wikipedia.org/wiki/Han_unification
| afiori wrote:
| as an armchair linguist it is clearly two (not one, not
| three) distinct "characters", by going how I am guessing
| native speakers think about them.
|
| If all three had different codepoints and you replaced Guo
| with Guo a lot of people would realize, less so if you
| replaced Guo with Guo .
|
| To my understanding the only argument in favour for han
| unification was that it would have taken-up a lot of
| codepoints otherwise.
| wodenokoto wrote:
| > by going how I am guessing native speakers think about
| them.
|
| I can assure you that Han unification happened at the hand
| of native speakers.
| SpicyLemonZest wrote:
| It's worth noting this kind of distinction did used to exist
| in the Latin alphabet as well. For much of the 19th and 20th
| centuries, German letters were different, as part of a debate
| about whether German text should be written in blackletter or
| Latin/Antiqua/what-other-Europeans-use script.
| dotancohen wrote:
| We still have these issues in e.g. Hebrew and Yiddish,
| Arabic and Persian, and tons of adapted Cyrillic scripts
| from ex-Soviet states. Not to mention Northern-European
| accented vowels, and cedilla letters such as C.
|
| I'm personally of the belief that the accented and cedilla
| characters should be exclusively stored as combining
| character pairs, even if modern keyboard mappings require
| only a single keypress. My own language stores every
| character as two bytes (at a minimum), so the storage
| aspect is a solved problem.
| ksec wrote:
| I am surprised how many comments here never heard of Han
| Unification. The problem is not new, and some of us have been
| ranting about it for more than a decade. From the
| UTF-8-Everywhere Manifesto in 2012 on HN [1], And a search [2] on
| HN dates back to 2010.
|
| I am also surprised at the support this problem now has. At least
| on this thread. Generally speaking Han Unification problem dont
| get much if _any_ support on HN. Not even empathy. In the name of
| having Unicode becomes king they would much rather sacrifice the
| CJK language.
|
| The answer or replies were always, it is "glyph" problem, not
| "code" problem. Stop asking Unicode to solve it.
|
| patio11 aka Patrick McKenzie from Stripe has been the most vocal
| critics of Han Unification. Sums it up far better than I could,
| quote [3]:
|
| >Reason the Han unification debate in Unicode got so acrimonious,
| and why lots of Japanese people carry a chip on their shoulder
| about it to this day.
|
| >"Sorry, grandma, I know you've been sort of attached to your
| name for the last 80 years, but the white folks find it
| inconvenient for their computer systems. Don't worry, they
| promise they'll make something close for you."
|
| >Many of the clients of my ex-day job are married to legacy
| encodings like Shift-JIS precisely because they do think that
| their customers and students have a "right" to having their names
| written correctly.
|
| As mentioned in my other reply, Adobe gets lots of stick for its
| subscription and malware like Creative Cloud. But they do [4]
| spend huge amount of resources on CJK fonts, layout and encoding
| ( They have their own separate Encoding for each CJK language
| instead of using Unicode ). Part of the reason why I like PDF.
|
| [1] https://news.ycombinator.com/item?id=3906253
|
| [2]
| https://hn.algolia.com/?dateRange=all&page=8&prefix=false&qu...
|
| [3] https://news.ycombinator.com/item?id=1438749
|
| [4] https://ken-lunde.medium.com/my-28-years-of-
| adobelife-e97e70...
| flubert wrote:
| >"Sorry, grandma, I know you've been sort of attached to your
| name for the last 80 years, but the white folks find it
| inconvenient for their computer systems. Don't worry, they
| promise they'll make something close for you."
|
| Is there a resource to read more about this? I don't get that
| vibe from things like:
|
| https://www.unicode.org/versions/Unicode3.0.0/appA.pdf
| YeGoblynQueenne wrote:
| >> However, this issue is much more than the difference between,
| say, the lowercase A with the overhang (a) or without (a).
|
| Yes but actuallly "a" is the Greek character _alpha_ , whreas "a"
| is the Latin character "a". So if you displayed "a" as "a" to a
| Greek person that, too, would look all orong.
| peacefulhat wrote:
| I'm grateful for han unification because I can search Chinese
| words I only know in Japanese.
| lifthrasiir wrote:
| Unfortunately this (and linked) article only represents Japanese
| issues. If you blindly apply these suggestions Chinese or Korean
| users may have issues. I'll list Korean issues below primarily
| because I'm Korean, but you may want to interview actual CJK
| users (one of each, _not_ a single user) for testing.
|
| > Line breaking rules
|
| This should link to W3C Requirements for CJK Text Layout [1]. The
| Wikipedia article alone doesn't fully describe the complexity of
| CJK typography.
|
| CJK languages are common in that they all have classes of
| punctuations that can't be separated by a newline. But there is
| one more thing to consider for Korean: both word-based breaking
| and character-based breaking is possible depending on the
| context. The general rule is to use word-based breaking for
| larger texts and character-based breaking for smaller texts, but
| there is no clear threshold so you _really_ want to consult
| Korean users for testing.
|
| [1] https://www.w3.org/TR/clreq/ (Chinese),
| https://www.w3.org/TR/jlreq/ (Japanese),
| https://www.w3.org/TR/klreq/ (Korean)
|
| > Messaging Apps: Do not directly hook to the Enter key to submit
| messages
|
| This advice is also problematic. In pretty much all Japanese and
| most Chinese IMEs they should go through candidate windows so
| pressing Enter should not submit messages, but in some Chinese
| and virtually all Korean IMEs there is no automatic candidate
| window and pressing Enter should submit messages.
|
| In the ideal world detecting a newline as suggested by the
| article should have solved this issue, but that got complicated
| by clueless pan-CJK IME implementations. They generally assume
| candidate windows even for Korean, so they do not commit texts on
| Enter and that's very inconvenient for Korean users. Therefore it
| is rather recommended to detect a newline by default, but also
| have an option to submit messages on Enter.
| needle0 wrote:
| I updated both sections according to your suggestions. Thanks!
| needle0 wrote:
| Was notified from someone else about the isComposing
| attribute -- https://developer.mozilla.org/en-
| US/docs/Web/API/KeyboardEve... At least for web stuff, do you
| think checking for this before treating the Enter key as
| Submit would work in both IMEs with and without input
| buffers?
| lifthrasiir wrote:
| The problem is that those clueless IMEs do intercept the
| Enter key contrary to user expectation, so I think you
| can't distinguish those IMEs from Chinese and Japanese IMEs
| that should intercept the Enter key as expected.
| simonlc wrote:
| Really well done post and good idea with the title!
|
| - Simon from TGM :)
| brigandish wrote:
| > Japanese text written in incorrect glyph sets will stand out
| similarly to any native speaker of Japanese, and will give off a
| connotation that whoever developed this app does not care about
| this (often large) subset of the global user population.
|
| More likely they'll think the content was written by a non-native
| Japanese speaker, judge whether that makes you trustworthy or not
| (based on personal experience or stereotypes or prejudice,
| probably a bit of all three (we're all human)) and then not buy
| from you. A good example would be Amazon listings in Japanese
| that Japanese people can tell were almost certainly written by
| someone Chinese, and then decide not to buy.
|
| If you want the cash, get a proper translation. Ironically, Japan
| is filled to the brim with incredibly poor English and abounds
| with stories of native English speakers' translations and
| corrections being disregarded because "it doesn't sound right"...
| to someone who can't string a legible English sentence together.
| euske wrote:
| This is the reason why Adobe PDF isn't relying on Unicode. Adobe
| products has a huge presence in Japan since 90s and they had to
| appeal to the printing industry, which is very anal to this kind
| of issues. So they ended up using a separate encoding for every
| language. Today, CJK letters in PDF are encoded in Adobe-GB1
| (mainland China), Adobe-CNS1 (Hong Kong), Adobe-Japan1 and Adobe-
| Korea1 respectively. Not the cleanest way, but it gets the job
| done.
| makeitdouble wrote:
| Thanks for the pointer, that's pretty interesting.
|
| Looking at their doc [0] it seems they used their Adobe-Japan1
| to wrap a much more wider set of characters than any single
| encoding standard, including ligatures, vintage encodings etc.
|
| It seems to be a pretty big work and kinda fits with the image
| of PDF handling being such a monumental beast.
|
| [0] https://github.com/adobe-type-tools/Adobe-Japan1/
| lifthrasiir wrote:
| Note that they are now adopted by the Unicode Ideographic
| Variation Database [1] among other variation databases.
|
| [1] https://unicode.org/ivd/
| ksec wrote:
| Adobe gets lots of stick for its subscription and malware like
| Creative Cloud. But they do spend _huge_ amount of resources on
| CJK fonts, layout and encoding.
|
| And part of the reason why I like PDF.
|
| ( Behind a Paywall ) https://ken-lunde.medium.com/my-28-years-
| of-adobelife-e97e70...
| Asooka wrote:
| Can't this be solved somewhat by adding a "cjk mode" zero-width
| character, like we have right-to-left/left-to-right embedding
| characters? Yes, yes, it's yet another standard, but there
| doesn't seem to be any way to indicate in the text stream itself
| what characters to use otherwise.
| iforgotpassword wrote:
| Minor addition/clarification, just in case:
|
| > If the glyphs don't exactly look like the Japanese result
| sample below, your code is displaying Japanese wrong.
|
| Maybe _exactly_ isn 't the right term here; it doesn't need to be
| pixel-perfect, there are still different font faces just like
| with western languages, for example one that's supposed to make
| them look more natural or hand written and one for print, etc.
|
| Also, afaict han unification was a mistake, but if you thought
| you only ever have 65535 code points available it might have been
| tempting.
| needle0 wrote:
| Good point. Will reword that part.
| wodenokoto wrote:
| I think this page does a poor job of explaining that all 3
| _knives blade_ in the first example share the same code point in
| Unicode, but are to be displayed /rendered differently depending
| on which language it is shown as part of.
|
| It is there in the text, but it's almost hidden between the
| lines.
|
| If I was a developer with no knowledge of Han characters or Han
| unification I would have to read two thirds of this article
| thinking I'm doing it right, so why am I reading this, e.g.: "but
| I am using the correct code point. It's the character that the
| user entered!" or "I copy pasted it from a Japanese text, what do
| you mean I'm using the wrong character?" before reaching the "how
| to fix it" and even then I might not realize the root cause.
|
| With that in my mind I might not even make it to the part about
| how to fix the problem and learn that I am using the right
| character/code point, but it is still displayed wrong.
| needle0 wrote:
| I agree the page is somewhat roundabout in its current state
| since I went from the background to the symptom to the fix.
| Open to suggestions on rearranging the article so that more
| devs can implement fixes.
| wodenokoto wrote:
| I would have the knives blades chart a little earlier, and
| make it very obvious that each character shares a code point
| (maybe have a code point column) and talk about in the text
| that yes, this is weird. 3 visually distinct Unicode
| characters share a single code point.
|
| For me this was very hard to wrap my head around the first
| time I encountered the problem. Maybe other people find it
| hard to understand in different ways.
|
| I believe that Unicode even claims that distinctly looking
| characters are to have their own code points, but similarly
| looking characters should share a code point (e.g, there is
| no French a and English a, even though they are pronounced
| differently. And Danish o and Swedish o are pretty much the
| same pronunciation but differently written, so they don't
| share a code point.)
| needle0 wrote:
| Thanks. I reworded it a bit and put more emphasis on code
| points.
| madsohm wrote:
| The second character displays wrong for me when copying into VS
| code with Cascadia Code PL font.
| skhr0680 wrote:
| That's why Han unification is a mess.
| zzo38computer wrote:
| My own programs are specifically designed to not use Unicode. I
| think that Unicode is really messy and I dislike it. If you want
| to display Japanese text, EUC-JP can be used.
| Matheus28 wrote:
| I really hope you're being sarcastic
| zzo38computer wrote:
| I am not sarcastic. I don't like Unicode.
| patrec wrote:
| If you like, to stick just to Japanese, this:
|
| https://upload.wikimedia.org/wikipedia/commons/b/ba/JIS_and
| _...
|
| better than unicode, you can't be helped.
|
| Not that I like unicode much either -- amongst other things
| the idiotic arrangement of codepoints makes it basically
| impossible to do remotely efficient text processing; e.g.
| here's a graph of the automaton the RE2 uses to check if
| something is an uppercase character:
|
| https://swtch.com/~rsc/regexp/cat_Lu.png
|
| (For ascii there would exactly be a single arrow connecting
| two nodes).
| arp242 wrote:
| And here I was thinking that "Unix variants history" or
| "Linux audio systems" graphs were messy and
| complicated...
| yorwba wrote:
| This is about display, not encoding. Using EUC-JP to store text
| doesn't guarantee that it will be rendered with a Japanese
| font.
| lmm wrote:
| In practice it does, if it will be rendered at all. Elsewhere
| on this very page you can find people suggesting storing the
| display locale alongside the unicode string, which is really
| the only way to solve this problem in the general case - but
| in that case you might as well store pairs of byte sequence
| and encoding, there's not much difference between that and
| unicode string and locale.
| yorwba wrote:
| > In practice it does, if it will be rendered at all.
|
| Seems like you're right, at least as far as Firefox is
| concerned. Testing the data links below, it appears to
| guess the default language based on the encoding used.
| Neat!
|
| data:text/html;charset=euc-jp;base64,PHA+RGVmYXVsdDogv8/Evr
| Oks9G5/Mb+PC9wPjxwIGxhbmc9ImphIj5sYW5nPSJqYSIgv8/EvrOks9G5/
| Mb+PC9wPjxwIGxhbmc9ImtvIj5sYW5nPSJrbyIgv8/EvrOks9G5/Mb+PC9w
| PjxwIGxhbmc9InpoLWNuIj5sYW5nPSJ6aC1jbiIgv8/EvrOks9G5/Mb+PC9
| wPjxwIGxhbmc9InpoLWhrIj5sYW5nPSJ6aC1oayIgv8/EvrOks9G5/Mb+PC
| 9wPjxwIGxhbmc9InpoLXR3Ij5sYW5nPSJ6aC10dyIgv8/EvrOks9G5/Mb+P
| C9wPgoK
|
| data:text/html;charset=euc-kr;base64,PHA+RGVmYXVsdDog7NPywf
| qtysfN6ez9PC9wPjxwIGxhbmc9ImphIj5sYW5nPSJqYSIg7NPywfqtysfN6
| ez9PC9wPjxwIGxhbmc9ImtvIj5sYW5nPSJrbyIg7NPywfqtysfN6ez9PC9w
| PjxwIGxhbmc9InpoLWNuIj5sYW5nPSJ6aC1jbiIg7NPywfqtysfN6ez9PC9
| wPjxwIGxhbmc9InpoLWhrIj5sYW5nPSJ6aC1oayIg7NPywfqtysfN6ez9PC
| 9wPjxwIGxhbmc9InpoLXR3Ij5sYW5nPSJ6aC10dyIg7NPywfqtysfN6ez9P
| C9wPgoK
|
| data:text/html;charset=gb2312;base64,PHA+RGVmYXVsdDogyNDWsb
| qjvce5x8jrPC9wPjxwIGxhbmc9ImphIj5sYW5nPSJqYSIgyNDWsbqjvce5x
| 8jrPC9wPjxwIGxhbmc9ImtvIj5sYW5nPSJrbyIgyNDWsbqjvce5x8jrPC9w
| PjxwIGxhbmc9InpoLWNuIj5sYW5nPSJ6aC1jbiIgyNDWsbqjvce5x8jrPC9
| wPjxwIGxhbmc9InpoLWhrIj5sYW5nPSJ6aC1oayIgyNDWsbqjvce5x8jrPC
| 9wPjxwIGxhbmc9InpoLXR3Ij5sYW5nPSJ6aC10dyIgyNDWsbqjvce5x8jrP
| C9wPgoK
|
| data:text/html;charset=big5;base64,PHA+RGVmYXVsdDogpGKqva78
| qKSwqaRKPC9wPjxwIGxhbmc9ImphIj5sYW5nPSJqYSIgpGKqva78qKSwqaR
| KPC9wPjxwIGxhbmc9ImtvIj5sYW5nPSJrbyIgpGKqva78qKSwqaRKPC9wPj
| xwIGxhbmc9InpoLWNuIj5sYW5nPSJ6aC1jbiIgpGKqva78qKSwqaRKPC9wP
| jxwIGxhbmc9InpoLWhrIj5sYW5nPSJ6aC1oayIgpGKqva78qKSwqaRKPC9w
| PjxwIGxhbmc9InpoLXR3Ij5sYW5nPSJ6aC10dyIgpGKqva78qKSwqaRKPC9
| wPgoK
| kalleboo wrote:
| Aren't the modern text display APIs of the most popular
| OSes all Unicode-based now? It seems likely that they will
| convert to Unicode when told to display a string in a
| different codepage and replace the locale info with the
| default Unicode behavior (of basing it on the user locale)
| innocenat wrote:
| And what if you want to display Japanese and Korean at the same
| time?
| wodenokoto wrote:
| Wonder how well HN and my phone handles this. There are supposed
| to be Unicode code points that indicate which locale a character
| is supposed to be displayed in. If things are well thought out,
| my phone should add them automatically and HN should keep them
| and your browser should render it correctly
|
| On an iPhone using,
|
| Chinese simplified keyboard: Ren
|
| Japanese keyboard: Ren
|
| So that didn't go very well. When choosing the character on my
| Chinese keyboard it is displayed with correct Chinese strokes but
| turns into the Japanese version in the text box. I'm guessing for
| most of you reading, both will appear Chinese.
|
| EDIT: Someone better than me at wrangling unicode can maybe try
| out the variation selectors, and print the correct variations in
| a comment. I think it would have been neat if my keyboard ime did
| it for me :)
|
| https://en.wikipedia.org/wiki/Variation_Selectors_(Unicode_b...
| wasmitnetzen wrote:
| Yep, Simplified Chinese for me (with LANG=en_US.UTF-8).
| aikinai wrote:
| They both appear Japanese for me on mobile Safari. But Japanese
| is my second preferred language on the device (after English),
| so that's probably why.
| thrdbndndn wrote:
| I'm a native-CJK user myself and well aware this phenomenon, but
| honestly it's not really that bad in most of cases due to the
| following reasons:
|
| 1. Websites that are in Japanese are likely tagged with lang=ja
| already. So they will display fine. Unfortunately, this practice
| seems to be less followed by Chinese sites. I checked a few top
| sites, qq.com do have lang=zh-cn, while baidu.com and sina.com.cn
| don't.
|
| 2. Majority of UI elements in OS will prioritize the display
| language you set when choosing variants. This means, if the users
| are reading content in Japanese while also using Japanese UI, the
| glyphs would be correct. Of course, this will cause problem if a
| Japanese is reading Chinese or vice versa, but such scenario is
| in minority.
|
| Another scenario, which I think is more common, is when someone
| is using a Latin-language UI. For example, lots of my
| (Chinese/Japanese) friends are using English UI while reading
| Chinese/Japanese a lot. The OS in this case will default to one
| variant (I believe Apple by default would choose Japanese) and
| therefore display another language's glyphs wrong (side note: for
| web pages, desktop browsers often have their own font/glyph
| fallback logic above the OS one).
|
| 3. Most of people are just not sensitive to such thing. I pointed
| it out to lots of people (when due to their setting, some glyphs
| are displayed wrong, like Men ), and they can't care less.
|
| Also, there is no simple "fix" if you have multi-language
| content. Without manually assign <lang> tag to every single
| string, you can't display both Japanese and Chinese correct at
| the same time. It isn't worth the hassle for just a few phrases
| in text. A good example is Wikipedia, they have templates for all
| kinds of languages so you can display them correctly even if it's
| just one Japanese word on, say, English Wikipedia. And Wiki
| editors do use them all the time!
| makeitdouble wrote:
| > 3. Most of people are just not sensitive to such thing. I
| pointed it out to lots of people (when due to their setting,
| some glyphs are displayed wrong, like Men ), and they can't
| care less.
|
| I think its' because most people don't deal with it in big
| amounts. I heard a lot more complaints from people using
| android phones that didn't have jp fonts by default. At the
| third of fourth page they started to care, and once they
| noticed it frustration just stacked (it's just a matter of
| adding fonts, so not a big deal).
|
| Otherwise writings are flexible enough for small variants to
| not be triggering (I mean, people can already read
| calligraphy...)
| thrdbndndn wrote:
| Android has Japanese font.
|
| Noto Sans is basically Source Han, one of the best free JCK
| fonts. Not sure what you mean.
|
| https://en.wikipedia.org/wiki/Source_Han_Sans
| makeitdouble wrote:
| If you buy a Xiaomi phone for instance anywhere outside
| Japan, and open a Japanese website, it will be displayed
| with Chinese glyphs. It depends on the phone but there will
| setup needed: sometimes just switching the whole phone
| language will do the deal, sometimes you need to add the
| right fonts yourself.
| gpderetta wrote:
| Is there a reason unicode doesn't have such a builtin lang tag?
| Similar to right-to-left and left-to-right it could help in
| displaying differently otherwise identical text.
|
| It could be stored in-band with the text with little changes to
| existing systems. The only change would be on the presentation
| layer, and if the tag were to be a non printable character, it
| would be backward compatible. An input device could implicitly
| tag input texts depending on the default lang.
|
| You need some form of sanitization, but you need it for right-
| to-left and left-to-right already.
| arp242 wrote:
| Actually, this exists already - there's U+E0001 (LANGUAGE
| TAG) and U+E007 (CANCEL TAG) and you can put a language code
| between those, e.g. "\uE0001ja-JP\ue007".
|
| Its use is also deprecated and discouraged. According to [1]
| it's often not needed, and [2] states that it puts a lot of
| burden on implementations and best done at a higher level
| such as HTTP, HTML, etc.
|
| I have no opinion on [1] as I don't speak these languages,
| but I do know I really hate working with these "invisible
| characters" in Unicode both as a user and developer. Copy an
| extra invisible LTR thingy or display variant codepoint and
| stuff can look and behave different, and it may not at all be
| obvious what the hell is going on (especially for those
| without a technical background).
|
| [1]: https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf#
| G115...
|
| [2]: https://www.unicode.org/versions/Unicode14.0.0/ch23.pdf#
| G301...
| lifthrasiir wrote:
| > there's U+E0001 (LANGUAGE TAG) and U+E007 (CANCEL TAG)
| and you can put a language code between those, e.g.
| "\uE0001ja-JP\ue007".
|
| "ja-JP" part is also written in tag characters, so it's
| actually E0001 E006A E0061 E002D E004A E0050 E007F and
| doesn't render even in unsupported environments.
| arp242 wrote:
| Oh right; I never actually used it, I just know it
| exists.
| layer8 wrote:
| > Is there a reason unicode doesn't have such a builtin lang
| tag?
|
| It does, but they aren't widely supported and their use is
| not recommended, see
| http://unicode.org/faq/languagetagging.html and
| https://datatracker.ietf.org/doc/html/rfc6082. The
| recommendation is to use markup languages instead to carry
| that information.
|
| The reasoning is that anything related to styling is out-of-
| scope for Unicode (except where needed for round-trip
| compatibility with other character sets), or else people will
| also want tags for bold, italic, monospace, or (expressed
| semantically) for emphasis, code, etc. That's what markup
| languages like HTML are for.
| marginalia_nu wrote:
| Re: lang="jp"
|
| My experience with web crawling is that the use of the lang-tag
| seems inconsistent at best. To make matters worse, sometimes
| content is straight up mislabeled, although Japanese sites
| often helpfully declare that they are using the Shift_JIS
| charset rather than UTF-8, which is at least somewhat helpful
| in figuring out that it is Japanese.
| Asooka wrote:
| Out of curiosity, what do you do when you put a quote in
| Chinese from a Chinese author inline in Japanese text? Are you
| expected to write it using the Chinese forms of the characters,
| or do you write them using the Japanese forms?
|
| Edit: I mean what is the expected (grammatically correct) way
| to do it if you were writing with pen on paper.
| cehrlich wrote:
| I agree that for CJK-natives it's not such a big deal probably
| (unless they live their life in more than one of those
| languages). For people like me who primarily use their computer
| in English but also do some stuff in Japanese every now and
| then it's very frustrating. Of course at this point I know
| what's going on when Zhi or whatever looks wrong, but it's
| still frustrating.
|
| OSs and Browsers having their own logic for it actually makes
| things _worse_ in some cases. Windows is especially bad
| (different types of UI elements care about different settings
| or don't care at all, so good luck having apps render correctly
| if you don't change your entire OS locale), and Chrome is
| pretty bad too, again especially on Windows. Overall MacOS/iOS
| and Safari does the best job by far.
|
| The failed attempt at Han Unification[1] is the worst decision
| the Unicode people have ever made.
|
| [1] https://en.wikipedia.org/wiki/Han_unification
| nanis wrote:
| > The failed attempt at Han Unification[1] is the worst
| decision the Unicode people have ever made.
|
| At first I nodded my head in agreement, but then I decided I
| still think the failure to include separate code points for
| "lower case Turkish dotted I" and "upper case Turkish dotless
| I" is worse.
|
| You can't have 'i' [?] lc( uc 'i' ) unless you already know
| you are processing Turkish ... completely unnecessary
| complication.
| naniwaduni wrote:
| Turkish I unification, at least, wasn't a _decision_ the
| Unicode people made, they inherited the mistake from
| earlier encodings. Given that those already existed, the
| alternative to having broken casefolding was, essentially,
| break all mixed Turkish documents transcoded from cp857
| containing both "I" and "i" in non-Turkish functional
| directives, i.e. you'd necessarily break things like HTML
| documents without consistent tag casing.
| jhanschoo wrote:
| IMO the most jarring issue that comes from Han unification
| without proper language setting isn't with the glyph variants,
| since you actually encounter these variants and more in
| everyday life (albeit more rarely than your national standard
| ones). The more jarring issue is when your software selects a
| font meant for the wrong language, and the font for the correct
| language as a fallback. Then you may encounter serious style
| issues where your text is pockmarked by glyphs for another
| language; you have a similar phenomenon with European language
| using Latin alphabet with unconventional accents. But note that
| it takes way more effort to ask a CJK foundry to cover all
| codepoints for all languages than to ask a Latin font designer
| to cover all languages, so you would be hard pressed to find
| fonts that actually do that.
|
| Without Han unification, this wouldn't really be a problem, but
| Han unification is to a large extent the same philosophy
| pursued with unification of Latin scripts (and other scripts)
| needle0 wrote:
| My original motive to write this page came most from issues in
| video games, where text is often displayed using custom
| routines and built-in rules in the OS & browser can't be of
| help. The issue crops up most often in indie games, but they
| can be seen even in high-profile high-budget games like Half-
| Life: Alyx or Resident Evil 4 VR.
| eloisant wrote:
| What you're saying is true (although I'm not sure about 3, all
| the Japanese people I've talked to are annoyed by that), but it
| really sucks that we're dealing with problems that were
| supposed to be fixed by Unicode.
|
| Han unification have been a huge mistake, to save a few
| thousands of characters, and now we keep piling on more and
| more stupid emoji.
| dotancohen wrote:
| > Han unification have been a huge mistake, to save a few
| thousands > of characters, and now we keep piling on
| more and more stupid emoji.
|
| This is my exact problem with Unicode. I've very grateful for
| the efforts that they have made in the past, but the change
| from "spare valuable codepoints at the expense of causing
| ambiguity in text" to "assign a new codepoint to every
| cartoon permutation of intangible nouns" is infuriating.
| oleganza wrote:
| 25 years ago computers and networks were different. Today
| text is a <0.1% of traffic compared with video and you have
| billions of bytes of RAM in every pocket computer. So yes,
| you can pile on more emojis and no one would be bothered.
|
| Unicode may have been never adopted at all if it had even
| larger set for CJK and made all CJK texts 1.5-2x larger
| than in Han-unified version, due to longer encoding. Also:
| UTF-8 did not exist yet and most systems treated text as
| arrays of fixed length characters.
| anthk wrote:
| UTF-8 existed 25 years ago, but in Plan9.
| arp242 wrote:
| Han unification and Emojis are from a different time. We're
| talking about a decision made about 30 years ago in the
| early 90s; Unicode was 16 bits (65k total codepoints) at
| the time.
|
| The original CJK Unified Ideographs block from 1992
| consists of 21k codepoints. It was impossible to do that
| four times since 4x21k is more than 65k, and we're going to
| need space for some other languages as well. Why not make
| Unicode larger? Well, size was a real concern back then
| (still is, to some degree, but less so).
|
| Since then Unicode has expanded and now we have slightly
| under 1 million codepoints. Han character blocks have
| extended to about 93k codepoints today, and 4 times ~93k
| codepoints is actually feasible. But now you run in to
| compatibility issues: you can't remove all the old Han
| unification stuff (it will break text, big no-no), so you
| need to re-define it all anew. Is that better? How about
| mixing "old" Han unified codepoints with new Japanese or
| Chinese stuff? Will it really improve things or just cause
| endless confusion (see: combining characters)?
|
| For scale, all of Unicode currently defines about 145k
| codepoints; so even _with_ Han unification we 're talking
| about two thirds being taken up by just these three
| languages.
|
| In comparison there are currently about 3,000 emojis,
| although the number of codepoints is much less since many
| codepoints are re-used (e.g. "firefighter" is "person +
| firetruck", flags use the country code, etc.). In a quick
| check it looks like there are about 1,000 to 1,500
| codepoints reserved for emojis. In comparison, this is
| nothing.
|
| What I'm trying to say is that the (comparatively) very low
| number of emojis has absolutely no bearing on this and that
| going off on a tangent about it is very misplaced.
| dotancohen wrote:
| I have no problem with 2/3 of the codepoints being taken
| up by 3 languages. Right now we (rightly) bend over
| backwards to accompany handicapped users, often tripling
| or quadrupling our QA. CJK users are much more common
| than handicapped users, so the benefit-vs-cost ratio is
| even greater for CJK users than for handicapped users.
| arp242 wrote:
| > I have no problem with 2/3 of the codepoints being
| taken up by 3 languages.
|
| I have no problems with this either, at least not
| principally. But historically this was literally
| impossible. Someone thought of a clever hack that seemed
| like a good idea at the time, but turns out it doesn't
| work all that great after all (at least, according to
| some - opinions seem to differ and I can't really judge
| myself) and now you're stuck with it and fixing isn't so
| easy - I don't know if people have made concrete
| proposals for fixing this, but if it was easy it probably
| would have been done already. Sometimes sticking with a
| suboptimal "legacy" solution is better than replacing it
| with a new better solution due to the friction and issues
| involved.
| sleepy_keita wrote:
| Oh, this explanation is perfect to give to people when I
| encounter this error. Thanks!
| xvilka wrote:
| It's more of a problem in pure text apps rather than the Web. For
| example, in editors (not the rich text ones), console, interface
| elements. But yes, it is a problem for people who knows (or
| learns) and uses multiple languages at once, e.g. English,
| Chinese, and Japanese.
| captainmuon wrote:
| How do you do it correctly in a bi-lingual app? Say your app is
| in English but you want to display asian language file names. Is
| there any way to tell if a string is chinese or japanese? I think
| CJK variation selectors embedded in the string are not widely
| used. And it would be a bit overkill to include a language
| detection heuristic (which would likely fail for short phrases).
| So should you let the user decide? Default to Japanese on a
| Japanese PC, otherwise leave it undefined?
| skhr0680 wrote:
| Set it by the device language, with a way to override that in
| your app's settings
| lifthrasiir wrote:
| This is a good approximation, but it is still incorrect if,
| say, you are showing Japanese user names from a view for
| Chinese users.
| hannob wrote:
| I didn't know about this, but can't help to think this sounds
| like a bug in unicode to me. If these characters are different
| then why does unicode assign one codepoint to them? Wasn't the
| promise of unicode to exactly not do this kind of thing?
|
| Can this be fixed? New character code for ambiguous characters
| could be assigned, of course this would require manual conversion
| (with knowledge of the variant) for existing data, but at least
| it would make this issue go away moving forward (and unconverted
| legacy data would be "just as bad" as it used to be, so no loss).
| oleganza wrote:
| This issue with Han Unification was a big reason for stalled
| adoption of Unicode/UTF-8 in Ruby for years. UTF-8 by default
| came to Ruby after 1.9 where they've added thorough support for
| variety of encodings, so that UTF-8 is not the only option.
| oleganza wrote:
| Han Unification started in the 90s when computers were big,
| memory small, UTF-8 did not exist and people were trying to fit
| all characters in a reasonable amount of codepoints. Today with
| variable-length encoding of UTF-8 and video streams over 5G,
| supporting all variants as distinct codepoints, and patching
| text search and sorting with more "normalization" algorithms
| would not be a problem at all.
| afiori wrote:
| in my opinion utf8 should have been a bigger variable length
| encoding, today it is:
|
| 0xxxxxxx
|
| 110xxxxx 10xxxxxx
|
| 1110xxxx 10xxxxxx 10xxxxxx
|
| and
|
| 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
|
| the only reason not to push those last bits and add
|
| 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
|
| 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
|
| 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
| 10xxxxxx
|
| and maybe even
|
| 11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
| 10xxxxxx 10xxxxxx
|
| is utf-32, they should have dropped it and solve the
| codepoint problem this way.
| lifthrasiir wrote:
| First I'd like to introduce you to this: http://ucsx.org/
|
| But no, there is no particular reason to introduce a longer
| encoding than the modern UTF-8 (which is actually shortened
| from the original one-to-six-byte encoding). The current
| set of 1,114,112 Unicode characters is sufficient for at
| least the foreseeable future, because any new assignment
| requires a demonstrable historic or current use. (Emojis
| are slightly different, but they still require that the
| underlying concept is widespread and do not significantly
| overlap with existing emojis. See [1].) Han characters are
| the largest source of new assignments to this date and they
| are yet to reach two out of 17 full planes (that would
| equate to 131K characters).
|
| [1] https://news.ycombinator.com/item?id=26904980
| jhanschoo wrote:
| This isn't a uniquely CJK problem, see my other comment
| https://news.ycombinator.com/item?id=29024422
|
| The other approach is to assign to each language distinct
| codepoints, but I guess the current approach is better for
| backward compatibility with pre-Unicode, and less redundancy in
| Latin script documents.
| nikanj wrote:
| Am I right in assuming fixing this in player names etc makes
| Chinese look wrong? It's an easier problem if you know the whole
| page is Japanese, but how about things like game lobbies, where
| every username is in a different language?
| yorwba wrote:
| Either store the locale used when a user enters their name and
| then use it to mark up the text whenever you display the
| username, or simply use the system default, so Japanese users
| will see Chinese names with Japanese glyphs and Chinese users
| will see Japanese names with Chinese glyphs. Other users
| randomly get whatever.
| andrewl-hn wrote:
| The locale might be set to something completely different. A
| lot of programmers run their machines in English and not in
| their native language. One could use location to detect which
| variant to use, but that too wouldn't work for, say, Chinese
| speakers in Japan. In ideal world we should use the locale of
| an input source (if the user sets their keyboard to
| Traditional Chinese we should use it for that fragment of
| text). However, operating systems and browsers don't provide
| the input source locale API.
| drran wrote:
| > The locale might be set to something completely
| different.
|
| Follow the choice of user, please.
| eska wrote:
| Locale: English. Text input: Ren
|
| Question: was that a Japanese or Chinese character?
|
| Answer: locale doesn't help us here, unsolvable problem.
| Hackbraten wrote:
| Good point. Would it be feasible to check the locale used at
| profile creation time, then store that locale alongside the
| username if it contains at least one CJK glyph?
| eska wrote:
| What if a player with a German locale used their favorite
| anime character's name? How do you know whether to use
| Chinese or Japanese characters on somebody else's PC? Even a
| Chinese player should see Japanese characters there. But you
| would first of all basically need to ask the German player
| with a drop down menu which language their name is in, which
| will never happen, so we just assume Chinese. It's just
| broken.
|
| I think we clearly need an in-band solution. Some character
| that switches the Asian glyph variant, or separate characters
| altogether. The former would be annoying for non-variable
| sized Unicode because you'd lose the ability to use random
| access into a large corpus of text, because you'd need to
| scan the entire text to find out the current Asian glyph
| variant mode.. sigh
___________________________________________________________________
(page generated 2021-10-28 23:02 UTC)