[HN Gopher] Katakana, Hiragana, and Unicode
___________________________________________________________________
Katakana, Hiragana, and Unicode
Author : zdw
Score : 69 points
Date : 2022-09-26 20:11 UTC (2 days ago)
(HTM) web link (www.johndcook.com)
(TXT) w3m dump (www.johndcook.com)
| sapkernel wrote:
| this is cool language learning project in code. Thanks
| dudeinjapan wrote:
| And here's half-width katakana in non-Unicode, circa 1979:
| https://github.com/receipt-print-hq/escpos-printer-db/issues...
|
| This is probably the system that is still used today by most bank
| printers in Japan. The code point usage was carried forward to
| JIS.
| gweinberg wrote:
| What are these mysterious symbols that don't fit in the table?
| teshigahara wrote:
| wi (wi) (wu) we (we) (yi) (ye)
|
| wi and we (these days pronounced the same as Japanese i and e)
| are known by all native Japanese speakers, were used
| historically, and actually still see some use in certain
| scenarios (like signs, or names of things). The other ones were
| never actually used much afaik and only recently were
| introduced to Unicode at all, and are probably unknown to most
| Japanese people except those interested in this kind of thing.
| Symmetry wrote:
| I was confused by the lack of little `yx` characters there so
| looked it up on my own. The 'yu' in riyu, "riyu" is 86 and the
| 'yu' in riyu, "ryu", is 85.
| viggity wrote:
| same with the small tsu which makes you kinda pause/emphasize
| the following consonant.
|
| Ex. asari = asari = ah - sa - ri atsusari = assari = ah <tiny
| pause, hard s> sa - ri
|
| not to be confused with atsusari which is atsusari (which is a
| made up word), but because the tsu is regular sized, you
| pronounce it instead of altering other character
| pronunciations.
|
| Also of note - they completely left out "n" n in hiragana, n in
| katakana.
|
| And "wo" isn't really pronounced "wo", it is pronounced just
| "oh" and spelled "o" in romaji. And while there is a "wo" in
| katakana, I have never seen it used. It is used as a particle
| which is inherently a native japanese thing and ergo you use
| hiragana for it.
| Anon1096 wrote:
| You see wo if you read stuff where there's a robot, alien, or
| super stereotypical foreigner speaking, since oftentimes
| their entire lines are written in katakana to feel non-
| native.
| kensai wrote:
| John Cook is a genius. All his blog posts are gems. I love his
| job, I wish I could do it.
| sylware wrote:
| Is there a plain and simple C written text shaper for unicode
| Katakana, Hiragana strings?
| ranger_danger wrote:
| what is a text shaper?
| HelloNurse wrote:
| The software that controls how to render text to images,
| relying on fonts but considering higher level issues (e.g.
| line breaking and metrics for multiple characters) than the
| low-level information in a font.
| vore wrote:
| There's no shaping or really even anything fancy
| typographically required for kana, just put the glyphs next to
| each other fixed-width no kerning.
| mananaysiempre wrote:
| Modern horizontal hiragana and katakana are not complex or huge
| scripts, there are several dozen base characters (of one or two
| different widths) and two or so accent marks. There might be no
| spaces, you break lines whenever they run out without
| considering word boundaries. I expect anything capable of
| dealing with Latin should be able to handle this, and it hardly
| deserves the name of "shaping".
|
| (Adding kanji into the mix somewhat complicates matters, as
| there are so many potential characters you cannot just blindly
| cache the rasterization of every one of them and never throw
| any away, but that's also not the degree of complexity you get
| from Arabic and such.)
| amichal wrote:
| Line layout rules are a bit complex. Long long ago when i was
| 19 someone handed me a photocopied set of around 50 rules for
| line breaking Japanese text and followed them to implement
| our first draft of it in a text layout program we were adding
| Japanese support to. I implemented it blind, I dont speak
| Japanese , it never shipped and I dont remember the rules but
| i do remember quite some complexity around punctuation etc.
| This section from W3C covers some of what I remember and
| quite a bit more I'm sure https://www.w3.org/TR/jlreq/#line_c
| omposition_rules_for_punc...
| innocenat wrote:
| To be honest, both questions can be answered in a few seconds by
| looking at the code point table for Hiragana/Katakana if you
| already know Japanese. Hence, that's why nobody write about it.
|
| > How do the 46 characters map into the 90 characters?
|
| Because there are actually more than 46 characters.
|
| > Do they map the same way for both hiragana and katakana?
|
| Yes. That's also how we do conversion between hiragana and
| katakana. By adding/subtracting 0x60.
| superjan wrote:
| For curiosity: how are they sorted? Are hiragana/katakana
| symbols considered equivalent?
| innocenat wrote:
| They are sorted by dictionary order (Wu Shi Yin Shun
| gojuuonjun) and there are no duplicated symbols in hiragana-
| katakana blocks.
| zerocrates wrote:
| The hiragana and katakana and various versions thereof for
| each mora all share the same "primary" Unicode collation
| value. Adding a dakuten or handakuten creates a secondary
| difference: e.g. ha (ha) < ba (ba) < pa (pa).
|
| As between the versions for the same mora, they get sorted
| with tertiary differences as: hiragana comes before katakana,
| small comes before regular-size, and for katakana regular
| width comes before halfwidth. There's also a "circled" set of
| the katakana that sort after the halfwidth ones.
|
| So they're equivalent (or not) depending on how you're doing
| the collation/comparison.
| innocentoldguy wrote:
| The article shows how they are sorted. Hiragana is used for
| things like Japanese words, particles, names, and to
| conjugate verbs. Katakana is used for things like foreign
| words, names, and sometimes emphasis. Both writing systems
| describe the same phonetics. For example, the hiragana ka and
| katakana ka are both pronounced "ka."
| Pxtl wrote:
| I'm surprised they're both used, from that description it
| sounds like one would fall by the wayside, like cursive has
| in North America.
|
| ...
|
| That said, culturally Japan seems like exactly the kind of
| place where, were they English-speakers, all the kids would
| absolutely be required to learn perfect cursive.
| msbarnett wrote:
| > I'm surprised they're both used, from that description
| it sounds like one would fall by the wayside, like
| cursive has in North America.
|
| In practice, there are no less than 4 separate scripts
| that are used in Japanese: hiragana, katakana, kanji, and
| romaji, and some mix of all 4 can appear in the same
| sentence.
|
| It's not so much analogous to cursive, which is a
| different "style" of writing the same "thing" - katakana
| and hiragana developed at different times for different
| groups and came to play different roles, and there are
| (usually) semantic implications to which are used.
| AnIdiotOnTheNet wrote:
| English print still has two separate character sets with
| exactly the same pronunciation too. One is use most of
| the time, and the other is used to start sentences, for
| EMPHASIS on whole words, or to indicate proper nouns.
| popularonion wrote:
| Japanese people could ask the same question about why
| English continues to have uppercase and lowercase
| letters.
|
| Actually when you look at the use of English in Japanese
| media, you'll quickly notice a lot of unnatural-looking
| overuse of uppercase. That's because to them it feels
| natural to use uppercase the same way they use katakana.
| dfinninger wrote:
| I am very early on in my Japanese-learning journey. So if
| others contradict me, they are probably a better source.
| :)
|
| But from what I understand Hirigana is used more for
| Japanese words, and Katakana more for loan words from
| other languages.
|
| It actually leads to a nice shortcut for some words. If
| I'm reading Hirigana I'll try to match that with words in
| Japanese that I know. However, if the word I'm looking at
| is Katakana, I'll flip that off and start trying to match
| phonetically.
|
| I assume with fluency this all becomes automatic, but I'm
| a ways off from that yet!
| layer8 wrote:
| It's more akin to italics in usage than to cursive.
| bsder wrote:
| > I'm surprised they're both used, from that description
| it sounds like one would fall by the wayside, like
| cursive has in North America.
|
| Katakana, hiragana, and kanji are all in active use--
| that's why they don't fall away.
|
| Kanji are your primary word base. They are sort of like
| root words in English.
|
| Hiragana often serves as kind of a marker--endings of
| certain words as well as phrase markers (particles).
| These are particularly important because Japanese does
| not normally break words with spaces.
|
| Katakana often denotes a foreign phonetic word or foreign
| names. Login is particularly good example for this forum:
| roguin (ro-gu-i-n).
|
| Japanese speakers actively use the differences as cues
| when reading. Watch a native Japanese speaker try to
| puzzle out Japanese learning materials for non-native
| speakers. If everything is written in hiragana (not
| uncommon for beginning materials), native speakers often
| have to puzzle over things a bit before they work out
| what a sentence says. This is one of the reasons why you
| want to get to Kanji as fast as possible when learning
| Japanese--the differences in script are _important_ for
| reading comprehension.
|
| You can see all of these in play on the Asahi Shimbun
| webpage: https://www.asahi.com/
| spacehunt wrote:
| Another simple technical reason is that that's how JIS did it,
| and Unicode wants to have lossless round-trip encoding
| conversions in order to promote its adoption in East Asia at
| the time.
| dhosek wrote:
| There are some interesting variations in different scripts
| thanks to how they were handled in pre-Unicode encodings.
| Perhaps the most interesting divergence is in the various
| scripts derived from the old Brahmi script. These are all
| abugidas (as are the Japanese kana) where vowels do not exist
| independently of consonants. But in Thai, for example, the
| syllable NA is written naa with n and aa treated as separate
| characters, while in Devanagari, NA is written naa where n is
| the N sound and the A sound aa is a spacing mark which
| changes the shape and spacing of the first letter to give
| naa. Although a Thai reader will read the combination of
| consonant and vowel as a single entity, they are treated as
| two graphemes by Unicode, while the equivalent in Devanagari
| is a single grapheme (and it's not simply because they're
| printed connected since naanaa will be connected but treated
| as two graphemes).
|
| Perhaps most interesting in this respect is the comparison
| between the Devanagri i and the Thai ai which both appear
| before the consonant that they're attached to, but in Thai
| the input will be ai + kh to get aikh (so you input in the
| order of appearance rather than the order of pronunciation)
| while in Devanagari, the input would be k + i to get ki (so
| you input in pronunciation order rather than graphic order).
| lloeki wrote:
| For someone like me who knows somewhere between pretty much
| none to a very small bit of Japanese and slowly working my way
| up as time permits in a busy life, this was an interesting and
| very well presented article, saving me more than a few seconds
| of searching that I don't have to spare, and for which the
| reading time was both enjoyable and knowledge incrementing.
|
| Hence, that's very much fortunate that someone wrote about it.
| innocenat wrote:
| I don't disagree about that. I just answered the first
| question in the article about why there are no one writing
| about this.
| resoluteteeth wrote:
| There are a bunch of other annoying complexities with dealing
| with Japanese text like halfwidth/full-width characters:
| depending on what you're doing you may have to account for
| additional stuff like a instead of a, or A instead of A. Ideally
| these wouldn't actually be used (this formatting should not be
| done at the character set level) but since they were included in
| unicode for backwards compatibility reasons, they do
| unfortunately get used a fair amount.
|
| Also I guess this isn't specific to Japanese, but if you use
| normalization in NFD form, the modifiers like handakuten get
| split into separate characters (I don't think most people ever
| use unicode normalization but iirc mac filesystem paths are
| normalized so it can be really confusing when you do actually run
| into it).
___________________________________________________________________
(page generated 2022-09-28 23:00 UTC)