[HN Gopher] A Spectre Is Haunting Unicode (2018)
___________________________________________________________________
A Spectre Is Haunting Unicode (2018)
Author : EvanAnderson
Score : 231 points
Date : 2022-07-14 13:00 UTC (10 hours ago)
(HTM) web link (www.dampfkraft.com)
(TXT) w3m dump (www.dampfkraft.com)
| NeoTar wrote:
| Is there anything similar for Latin characters?
|
| The only circumstance I can imagine is where a Latin character
| has been erroneously encoded with an unused diacritic, for
| instance a T with a diaeresis.
| wbl wrote:
| Multilocular o is known only from a single word in a single
| manuscript.
| indecisive_user wrote:
| link to the wiki article. Though this is a variation of a
| Cyrillic letter, not latin
|
| https://en.wikipedia.org/wiki/Multiocular_O
| asveikau wrote:
| This is funny. The Indo European (and hence Slavic) roots
| for eyes typically have an /o/, and this glyph is round
| like the eye, so it seems like this character and others
| linked in that article are just people making little
| cartoonish drawings on writings involving descriptions of
| eyes.
| jxy wrote:
| Something like the letters V and U.
| lisper wrote:
| Or double-U (i.e. UU == W)
| krossitalk wrote:
| > At this rate they'll presumably be with humanity forever. Ps
|
| So, that's a really interesting thought. Perhaps our solution to
| a permanent reminder of nuclear destruction[1] could be hidden
| inside a plane of Unicode.
|
| [1] https://en.wikipedia.org/wiki/Long-
| term_nuclear_waste_warnin...
| remram wrote:
| Maybe Unicode will feature the same kind of warnings one day.
|
| > This Unicode range is not a place of honor. No highly-
| esteemed symbol is registered here.
|
| > What was here represented cultural signs that were considered
| powerful in our time.
| wongarsu wrote:
| Maybe we can encode instructions on how to restart society in
| Unicode character names? After all basically every computer
| contains a list of them.
| bogwog wrote:
| That did not end well for the Georgia guidestones...
| ethbr0 wrote:
| Also thought about posting that this morning, but wasn't
| sure anyone else would get the reference. (As context for
| everyone else, some kook blew up some of the guidestones
| last week in the middle of the night)
| _dain_ wrote:
| Not so kooky, they were a call for genocide.
| tomcatfish wrote:
| From what I see, this was the maximally flame-y way to
| say what you said, and it's still inaccurate to call it
| "Not so kooky" as these commandments, while disagreeable
| to me, are not really that violent.
|
| 1. Maintain humanity under 500,000,000 in perpetual
| balance with nature. 2. Guide reproduction wisely -
| improving fitness and diversity. 3. Unite humanity with a
| living new language. 4. Rule passion - faith - tradition
| - and all things with tempered reason. 5. Protect people
| and nations with fair laws and just courts. 6. Let all
| nations rule internally resolving external disputes in a
| world court. 7. Avoid petty laws and useless officials.
| 8. Balance personal rights with social duties. 9. Prize
| truth - beauty - love - seeking harmony with the
| infinite. 10. Be not a cancer on the Earth - Leave room
| for nature - Leave room for nature.
|
| source: https://en.wikipedia.org/wiki/Georgia_Guidestones
| #Inscriptio...
| ethbr0 wrote:
| The Latin alphabet being boring, I spent some time going through
| ancient alphabets included in Unicode.
|
| It gets pretty trippy, pretty quick.
|
| As in "We don't have a clear idea what this rune was for, or what
| it means, but we see it in documents and so added it to Unicode."
|
| https://en.m.wikipedia.org/wiki/Runic_(Unicode_block)
| shantara wrote:
| My favorite Unicode glyph is Multiocular O (). There is only
| one recorded usage, by a 15th century russian monk, who decided
| to use it in phrase "many-eyed seraphim" instead of two regular
| letters 'o'. So of course it was added to Unicode.
|
| https://en.wikipedia.org/wiki/Multiocular_O
| lmkg wrote:
| It gets better: this glyph is bugged. Somehow, the guy
| responsible for adding it to Unicode somehow got _the number
| of eyes wrong_. Per his description, Unicode fonts represent
| it with 7 eyes, but after getting called out on Twitter he
| realized the original manuscript shows 10 eyes.
|
| This bug will be fixed in Unicode 15.
| corrral wrote:
| What about modern uses of the character that specifically
| intended 7 eyes? Unicode needs to add a time or (worse, but
| probably OK) version datum to glyphs or glyph ranges, I
| suppose (applying it only at the document level wouldn't
| suffice, as in the case of quoting).
| B1FF_PSUVM wrote:
| Achieving peak Byzantium there, I guess.
| thaumasiotes wrote:
| > As in "We don't have a clear idea what this rune was for, or
| what it means, but we see it in documents and so added it to
| Unicode."
|
| Documents? I had the strong impression that there are no
| documents written in runes. A rune we only know by its
| occurrence in documents would be far more interesting for the
| existence of a document than it would be for its own sake!
|
| Compare what the page about Anglo-Saxon runes says about the
| corpus:
|
| > The Old English and Old Frisian Runic Inscriptions database
| project at the Catholic University of Eichstatt-Ingolstadt,
| Germany aims at collecting the genuine corpus of Old English
| inscriptions containing more than two runes in its paper
| edition, while the electronic edition aims at including both
| genuine and doubtful inscriptions down to single-rune
| inscriptions.
|
| > The corpus of the paper edition encompasses about one hundred
| objects (including stone slabs, stone crosses, bones, rings,
| brooches, weapons, urns, a writing tablet, tweezers, a sun-
| dial,[clarification needed] comb, bracteates, caskets, a font,
| dishes, and graffiti). The database includes, in addition, 16
| inscriptions containing a single rune, several runic coins, and
| 8 cases of dubious runic characters (runelike signs, possible
| Latin characters, weathered characters). Comprising fewer than
| 200 inscriptions, the corpus is slightly larger than that of
| Continental Elder Futhark (about 80 inscriptions, c. 400-700),
| but slightly smaller than that of the Scandinavian Elder
| Futhark (about 260 inscriptions, c. 200-800).
|
| So across every runic system we know, we have under 600 texts,
| _all_ of those texts are short inscriptions, and even to reach
| that number of samples we need to include texts that we aren 't
| even sure contain any runes.
| yorwba wrote:
| Runes continued to be used long past the Elder Futhark period
| and from the medieval period manuscripts survive that fit the
| modern conception of a "document", most famously the Codex
| Runicus https://www.e-pages.dk/ku/579/html5/ (202 pages)
| bombcar wrote:
| https://www.youtube.com/watch?v=2yWWFLI5kFU describes another
| side-effect of encoding old scripts/runes.
| hypertele-Xii wrote:
| > I had the strong impression that there are no documents
| written in runes.
|
| There are. Such documents are called runestones and thousands
| survive to this day, most in Sweden.
|
| https://en.wikipedia.org/wiki/Runestone
| eesmith wrote:
| Huh. https://en.wikipedia.org/wiki/Document says:
|
| > Documents are also distinguished from "realia", which are
| three-dimensional objects that would otherwise satisfy the
| definition of "document" because they memorialize or
| represent thought; documents are considered more as
| 2-dimensional representations.
|
| I think "realia" - a term I had never heard before -
| describes runestones better than "document".
| hprotagonist wrote:
| >Documents? I had the strong impression that there are no
| documents written in runes.
|
| If a clay tablet counts, why not a runestone?
| thaumasiotes wrote:
| I'm not knocking runestones for being the wrong medium. I'm
| knocking them for not being documents. A typical cuneiform
| record might be analogized to an invoice for delivery of a
| crate of shirts or whatever. (And of course we also have
| textbooks, dictionaries, literature, correspondence,
| business reports, mathematical treatises, and every other
| type of written work.) A typical runic record would be more
| like the text "Made in Taiwan" printed on the shirt labels.
|
| One of the biggest problems in the study of these cultures
| is that they left no written records. We know they had a
| writing _system_ , the runes, but as far as we can tell
| they almost never used it for anything. Quite the opposite
| is true of Mesopotamian cultures, where we're buried in
| more records than we have the manpower to translate.
| hprotagonist wrote:
| I suppose it also matters what you think a rune is. Does
| futhork count? There's parchment with that written on it.
| Elder Futhark, none as far as i know.
| gumby wrote:
| > Documents? I had the strong impression that there are no
| documents written in runes.
|
| One of the original goals of Unicode was to be able to
| computerize every document. I still have some old linguistics
| books in which characters have been handwritten into typed or
| even typeset text. So these are the types of documents being
| referred to: academic papers.
|
| Some fancy books have photographs of ancient writing; I'm not
| sure if Unicode tries to encode such sources and I pretty
| much doubt it (how would you even know what to call the
| symbols? You touch on this in your comment). However often
| they are attached to treatises that order the characters in
| some way (I.e. index an alphabet) in which case the first
| case above would apply.
|
| In other words: thanks to some scholars who wrote down and
| ordered runic alphabets, you can now discuss runes with your
| friends and colleagues through email.
| thaumasiotes wrote:
| > One of the original goals of Unicode was to be able to
| computerize every document. I still have some old
| linguistics books in which characters have been handwritten
| into typed or even typeset text.
|
| That's a weird goal for Unicode to have. We've already
| accomplished that; a PDF file does the job _better_ (note:
| PDF documents _already support_ every character existing in
| the past, present, or future!) while being less complex.
| gumby wrote:
| I don't understand. If there is no computerized way to
| represent the script, all you can do would be to include
| photographs in your pdf. The point of computerization is
| not simply storage and retrieval (and retrieval is hard
| if you can't represent the script) but automated
| processing, which is meaningless if you can't represent
| any semantics).
|
| Separately, PDF felt like a step backwards on the day it
| was announced and sadly nothing since then has changed
| that.
| CorrectHorseBat wrote:
| How do you search for non-unicode characters in a pdf
| document?
| thaumasiotes wrote:
| How do you search for them in a book?
| gpderetta wrote:
| ctrl-F once you have digitized it.
| jen20 wrote:
| And how do you type the character you are searching for?
| cgriswald wrote:
| On Ubuntu: l-ctrl+l-shift+u, <codepoint>, <enter>
|
| Of course, that sucks, so I've programmed a nearby key to
| act as l-ctrl+l-shift+u.
|
| Several characters can also be typed with Compose Key.
|
| For characters I use regularly (in my case, generally the
| elder and younger futharks), I've created a keyboard out
| of an Elgato StreamDeck XL so I can type any of these
| runes with a single button press.
| gpderetta wrote:
| I don't think that not having a physical key on the
| keyboard has ever stopped anybody from inputing unicode
| symbols.
| [deleted]
| runarberg wrote:
| This is interesting. I'm comparing this to how musical
| notation is encoded in unicode. I mean, there is a block
| dedicated to the symbols, so the symbols are encoded, but
| you can't document music using only unicode. But musical
| documents are being composed and written all the time. To
| write music you need an additional software which arranges
| these symbols in a certain way so that they express the
| authors intention.
|
| I guess math has a similar representation in unicode as
| well.
|
| All that said, I think people use runes to express magic
| and spells (even to this day). I don't think all the
| magical runes are expressed in unicode (and perhaps they
| shouldn't). If you want to use a rune in that way, you
| might have to draw it out in SVG or something and then
| email it to your friends.
| thaumasiotes wrote:
| > I guess math has a similar representation in unicode as
| well.
|
| It's an ongoing project. As you seem to have guessed,
| Unicode math symbols are just about as useless for
| representing math as Unicode music symbols are for
| representing music. Producing mathematical documents is
| done using dedicated software, generally LaTeX.
|
| (And what you get is a PDF, because, as I noted in
| another comment, PDFs already support every notation
| there is, was, or ever will be.)
| jake_morrison wrote:
| In the 90s I worked on a project to digitize land registration in
| Taiwan.
|
| In order to record deeds and property transfers, we needed to
| enter people's names and official registered addresses into the
| computer system. The problem was that some people used non-
| traditional writing variants for their names, and some of their
| birthplaces were tiny places in China with weird names.
|
| Someone might write their name with a two-dot water radical
| instead of three-dot radical. We would print it out in the normal
| font, and the people would lose their minds, saying that it was
| wrong. Chinese people can be superstitious about the number of
| strokes in their name, so adding a stroke might make it unlucky,
| so they would not buy the property.
|
| The customer went to the agency responsible for managing the big
| character set, https://en.wikipedia.org/wiki/CNS_11643 Despite
| having more characters than anything else on earth, it didn't
| have those variants. The agency said they would not encode them,
| because they were not real characters, just printing differences.
|
| The solution was for the staff in the office to use a "font
| maker" program to create a custom font with these characters.
| Then they could print out the deeds using a Chinese variant of
| Adobe Acrobat, and everyone was happy.
| agumonkey wrote:
| Forgot which country (iran, turkey..) but one diacritic on a
| phone text got a girl killed because it altered the meaning one
| word. Turning the sentence from loving to threatening or
| insulting.
| not2b wrote:
| In Spanish, dropping one diacritic (~) changes "How old are
| you?" to "How many anuses do you have?".
| eesmith wrote:
| In English, dropping one diacritic changes "Where's the
| rose?" to "Where's the rose?", and changes "My mate is
| cold" to "My mate is cold."
| ajuc wrote:
| In Polish "zrob mi laske" means "do me a favor" and "zrob
| mi laske" means "give me a blowjob".
| Dylan16807 wrote:
| > rose
|
| Maybe, though it's still halfway the same word.
|
| > mate
|
| Not a change, both spellings are valid.
| eesmith wrote:
| Maybe even three-quarters the same word. (4/5ths if you
| count code points in NFD!)
|
| Male parties are a lot of fun.
|
| Those are some pretty lame runners.
| schoen wrote:
| An oddity is that "mate" (meant to indicate that the e is
| _pronounced_ ) is an incorrect spelling in both Spanish
| and Portuguese, where it would wrongly suggest that the e
| is _stressed_.
|
| https://en.wikipedia.org/wiki/Yerba_mate#Name_and_pronunc
| iat...
| kps wrote:
| https://gizmodo.com/a-cellphones-missing-dot-kills-two-
| peopl...
| jwilk wrote:
| Disussed on HN in 2008:
|
| https://news.ycombinator.com/item?id=226853 (18 comments)
| asveikau wrote:
| That sounds terrible, however, it's important to remember
| that diacritics don't get people killed, the person who
| decides to kill ultimately needs to stop themselves.
| _jal wrote:
| No, "diacritics don't kill people, people kill people" is
| not an important life lesson. It is a reductive just-so
| generalization of basic common sense that obscures more
| than it enlightens.
|
| The important thing for engineers to note is a technical
| shortcoming caused a tragic misunderstanding. Focusing
| instead on the well-known fact that some people have poor
| impulse control, knowing full well that is a non-
| controllable input, instead makes an excuse for poor
| engineering and implicitly expresses powerlessness to do
| anything about the problem.
| asveikau wrote:
| I am all for good localization efforts. I've been
| something of a champion for that whenever I've been
| around user facing code and people working on it. I also
| am a bit of a language nerd and not monolingual.
|
| But yes, misunderstanding or not, we should not kill
| people.
|
| The story in the sibling comment is about a man attacking
| his daughter's ex because the ex came to apologize about
| a confusion over the Turkish dotless I. That's still a
| violent attack that the father could have kept his
| emotions in check. I don't condone calling the daughter
| names, even accidentally, but it is not a crime and the
| right response is not attempted murder.
| _jal wrote:
| > but it is not a crime and the right response
|
| I don't know who you're arguing with, but it isn't me.
| Nobody is saying it was.
|
| I'm saying it is an irrelevant non sequitur.
|
| Imagine that Dad instead misunderstood an instruction
| related to a financial transaction and lost a ton of
| money. Would you now be discounting the technical problem
| that caused the misunderstanding and berating Dad for
| being foolish?
| asveikau wrote:
| I'm not discounting the technical problem.
|
| If I were on a code review and I spotted an issue
| affecting Turkish dotless I, I assure you I would rant
| about it more than is reasonable.
| agumonkey wrote:
| Even to a lesser extent, it's easy to forget how a small
| mistake can have a butterfly effect in other cultures.
| pixl97 wrote:
| Ya, I don't see that happening in authoritarian countries.
|
| As a contrived example if you had a symbol for 'happy' you
| want to be very cautious that it doesn't get converted to
| 'gay' because in your language gay and happy mean the same
| thing, in some repressive regime it means the leadership
| gets to execute you with the approval of the law.
| zarzavat wrote:
| A recent example is that "Let's go [gun emoji] him" could
| be interpreted as either harmless fun, or conspiracy to
| murder, depending on if the recipient's phone displays
| that as a water pistol or a real gun.
|
| Edit: weirdly HN refuses to display that emoji.
| tomcatfish wrote:
| HN does not like displaying emojis, though a few slip
| through I believe.
| lolc wrote:
| Hacker News doesn't allow emojis because only serious fun
| or something.
| EvanAnderson wrote:
| Yikes. If somebody hasn't written a "falsehoods programmers
| believe about human writing systems" document this would make
| for a good start.
| cestith wrote:
| It deserves its own entry in "falsehoods programmers believe
| about names" lists too.
| lmkg wrote:
| It's already there, #11 "People's names are all mapped in
| Unicode code points."
| Scarblac wrote:
| The falsehood here is thinking that if you can encode the
| name into the right code points, and you have a font that
| can print them, the result will be acceptable to the
| people whose name it is.
|
| They had that, but needed a font that used a different
| number of strokes for the characters because of the
| superstition.
| lmkg wrote:
| One could argue they're facets of the same issue.
| Although in the spirit of the original list, they would
| probably get split into separate line items.
|
| On further review, I think this is als similar to #12 &
| #13 on the list: "names are case-sensitive," and "names
| are not case-sensitive." To generalize that to include
| non-Western alphabets: display variations of the same
| character are significant, and display variations of the
| same character are not significant.
|
| This of course goes back to the evergreen philosophical
| question "what even is a character, anyways?" Since we've
| found a case where two characters which are the same
| character are not the same character. Are they distinct
| characters or typographical variants? Yesn't: one would
| want them unified for searching, but distinct for
| printing.
|
| But regardless of what they are, these
| characters/variants only show up in names. Names tend to
| retain archaic (or extinct) language variations longer
| than speech, which is the reason for rule #11, which is
| at least part of the problem.
| cestith wrote:
| I fully agree with this second, expanded take of yours.
| Some names are both represented and not represented by
| the Unicode simultaneously. This suggests there should be
| variant versions of characters, but that becomes an even
| thornier combinatorics (and sorting/collation, and
| lookalike characters) issue than what already exists.
| corrral wrote:
| More generally, the notion that human culture, systems,
| and behaviors can be mapped, losslessly and without
| causing harm, to something a computer understands.
|
| I think these language examples are so good, as examples,
| because all aspects of them are clear and easy to follow.
| I think computerization of business and society and the
| systems that make them work, causes immense amounts of
| this kind of friction and pain all the time, in ways that
| are much harder to understand, explain, or catalog (which
| is precisely why it's such a big problem, though as far
| as I know it's received little attention)
|
| [EDIT] To distill it, I think that trying to make a
| computer a "source of truth" rather than a tool, tends to
| do substantial violence to the "truth".
| derefr wrote:
| I feel like there has to be some level of triviality at
| which the harm is no longer being caused by the attempt
| to systematize something, but rather by a small group of
| people refusing to be systematized _not_ out of cultural
| heritage et al, but rather purely out of the (inane)
| human desire to feel special by intentionally doing
| something in a way nobody else does it.
|
| Language and writing exist to _communicate_ , using
| patterns of signals that have _shared meaning and
| recognition_ ; things like alphabets and vocabularies are
| effectively (loose, overlapping, diasporic) consensus-
| state autoencoding models. They only _work_ to compress
| meaning, when there are rules for said compression that
| generalize, and which don 't have as many exceptions with
| their own separate symbols as there are words/names
| needing to be encoded.
|
| Most countries don't allow you to just make up your own
| novel graphemes when writing a name on a birth
| certificate. And nobody is asking for that, either.
| (Presumably because living in a world where that was
| allowed would be horrible: you'd no longer being able to
| error-correct when reading, because any given mysterious
| squiggle in the middle of a word or name, might be
| exactly what some unknown-to-you-or-anyone-other-than-
| the-author character is _supposed_ to look like. Is that
| "o with a curlicue" written here just a semi-cursive
| attempt at writing an "o" -- or is it an "o" with a novel
| accent marker, one that appears nowhere else, but which
| must be preserved nevertheless to properly record this
| person's name?)
|
| Instead, _legal names_ are (in every country I 'm aware
| of) required to be spelled using the character-set of the
| country you're entering a legal relationship with by
| being born / immigrating / etc. America? Legal names
| using the Latin alphabet. Japan? Legal names using
| characters from this set:
| https://en.wikipedia.org/wiki/Jinmeiy%C5%8D_kanji
|
| Note, though, that legal names are _representations_ of
| names. They aren 't _encodings_ of names. Your legal name
| is a _distinct thing_ from your name, just as your
| credit-card number is a distinct thing from your name. It
| 's an applied-for + registered + assigned systematic
| identifier for you -- a bit like a domain name, or a
| vanity license-plate number. Which means that your legal
| name is not a lossy _or_ lossless encoding of your name.
| It 's, per se, a nickname. It doesn't have to have
| anything to do with your name. (And it often doesn't;
| immigrants often choose legal names entirely distinct
| from what they / their home country thinks of as their
| name.)
| cardiffspaceman wrote:
| "If the character isn't in Unicode it's in CNS-11643"
| apparently is also false.
| lostlogin wrote:
| There was a great thread on HH about names and falsehoods
| programmers believe.
|
| You've added to it, as custom fonts wasn't one covered.
|
| I think it's this thread:
| https://news.ycombinator.com/item?id=18567548
|
| Edit: and it's there, #11.
| duxup wrote:
| That sounds equally fascinating, and a little madding.
| kurthr wrote:
| Yep, and with pictographic writing systems it's a lot more
| common than latin... but even here we have X AE A-12 Musk,
| and Prince's name symbol.
|
| Heck, my initials are totally non-standard.
| 77pt77 wrote:
| > Chinese people can be superstitious about the number of
| strokes in their name, so adding a stroke might make it unlucky
|
| Why am I not surprised in the slightest?
| jetrink wrote:
| That's a great story. The inability to represent a name with
| standard characters reminds me of when Prince changed his name
| to a symbol and they had to send all of the media floppy disks
| containing a custom font with a single character.
|
| https://nymag.com/intelligencer/2016/04/princes-legendary-fl...
| mdp2021 wrote:
| Are you acquainted with Freur (which means, "Underworld 0.5"
| - Rick Smith and Karl Hyde in the '80s)?
|
| "Freur", or, "The squiggle we chose as the name for a band
| but that CBS Records insisted should at least have a
| pronunciation".
|
| I see it is not in Unicode (well, you can never really know
| if you do not try), nor I can find pieces to reconstruct it.
|
| The "freur" in foreground: https://d4q8jbdc3dbnf.cloudfront.n
| et/user/6885/edb290c6183ac...
| [deleted]
| Findecanor wrote:
| I've been told that this is also an issue in Japan, except the
| reason might more often be a matter of pride than superstition.
| It is supposedly one reason (of a few) why fax machines are
| still in common use in Japan.
|
| Later versions of Unicode support "Variation Forms" of Han
| characters as a way to be able to encode different variations.
| They are encoded as a Variation Selector code (U+E01000 and up)
| after the Han character. The forms are listed separate from
| Unicode versions in the "Ideographic Variation Database"
| <https://www.unicode.org/ivd/>. So far, it contains characters
| from a couple of Japanese dictionaries, a Korean and one from
| Macao/Hong Kong.
| hinkley wrote:
| I knew someone who added an accent character to their name
| because everyone pronounced it wrong. She met someone
| bilingual who shot back that if she wants it pronounced that
| way she needs to add an aigue. So she did, and everyone still
| pronounced her name wrong.
|
| In fact going any place with her very nearly became an "are
| we living in a simulation" crisis for me because the number
| of times she would say her name and the other person would
| say it back incorrectly was... upsetting. The degree to which
| some people butchered her name, especially combining half of
| her first and last name into a completely different name,
| made us joke about buggy NPCs.
|
| I could imagine how in some cultures writing it incorrectly
| hurts as much as pronouncing it incorrectly. Or possibly
| moreso in places where multiple plausible pronunciations have
| to be negotiated via an introduction, which is the case in
| China, is it not?
| teknopaul wrote:
| In Poland people have a neat life hack for that problem.
| They have other names for non-polish folk to use. Eg pawek,
| tomek, bartek rather than have people mangle their real
| name.
|
| My name got changed when I moved to Spain and it never
| bothered me, while I have met people who took great offence
| at the use of standard nicks that they had not explicitly
| sanctioned in advance. I know a guy who makes a new name up
| for everyone he meets. Like or lump it. If you are too
| sensitive about your name, you risk people not using it at
| all.
| lostlogin wrote:
| It does goe both ways though. Take the time to learn how
| to say and spell someone's name and it usually goes down
| well.
|
| I say this while fully aware of my own butchering.
| isoprophlex wrote:
| People are just incredibly dense sometimes. My wife has a
| name that's one letter different from a more common name,
| but clearly different in pronunciation.
|
| Nevertheless there have been countless times where people
| automatically substitute the more common name, or even
| worse in text messages manage to misread it and reply
| incorrectly.
|
| It sometimes upsets her. The npc analogy is very apt, i
| guess many people are just very preoccupied?!
| derefr wrote:
| > or even worse in text messages manage to misread it and
| reply incorrectly
|
| Overzealous autocorrect can happen to names, too. There's
| a whole thing about Asian names not being in computer
| spellcheck dictionaries:
| https://www.abbynews.com/news/youre-not-a-mistake-b-c-
| group-...
| InitialLastName wrote:
| Not just Asian names; my SO's (English-language) nickname
| frequently autocorrects to its common homophone. I can
| always tell who proofreads their texts by how it ends up
| spelled.
| lostlogin wrote:
| Try typing 'Sian' on iOS, (well, maybe Sian) and it
| autocorrects to Asian.
|
| Unhelpful, though luckily found funny when I did it.
| irusensei wrote:
| If you type Sei on google translate and set it to detect
| language it will switch to Chinese and translate it to
| "lingering". If you switch to Japanese no translation will
| happen.
|
| Also if you google search for Sei one of the results will be
| this video [!!!!seizure warning!!!!]
| https://www.youtube.com/watch?v=EsOU0V2kpUI that seems to borrow
| on the theme of a computer ghost character.
| TazeTSchnitzel wrote:
| Google Translate will hallucinate translations for complete
| nonsense, so this probably doesn't mean anything.
| einpoklum wrote:
| I'm more worried about the inflation of emoji than a couple dozen
| unused ghost JIS characters.
| npteljes wrote:
| What worries you about it?
| jimmygrapes wrote:
| If Slack/Discord/etc. custom emojis get used enough, do they
| get incorporated into Unicode? I've seen something like 40
| variants of laughing emoji, and closer to 400 variants of Pepe
| the Frog, and I'm not even in any "alt right" or 4chan-adjacent
| chat rooms/guilds where I imagine there are even more. Not to
| mention the countless custom anime face ones.
| wongarsu wrote:
| Godwin's second law: any sufficiently long discussion about
| Unicode includes a discussion about emoji :)
| raphlinus wrote:
| Yeah, it does seem to come up a lot more often than
| discussions about U+5350.
| edent wrote:
| Why? Unicode isn't running out of space any time soon.
| kevin_thibedeau wrote:
| The encoding has gotten out of hand with compound emoji.
| Splitting them on glyph boundaries is non-trivial.
| Mountain_Skies wrote:
| 640K should be enough for anybody.
| olivierestsage wrote:
| Reminds me of the case of U+237C [?] RIGHT ANGLE WITH DOWNWARDS
| ZIGZAG ARROW [0], also discussed on HN [1].
|
| [0] https://ionathan.ch/2022/04/09/angzarr.html
|
| [1] https://news.ycombinator.com/item?id=31012865
| jmillikin wrote:
| Previously:
|
| https://news.ycombinator.com/item?id=24951130 (2020)
|
| https://news.ycombinator.com/item?id=17637375 (2018)
| helsinkiandrew wrote:
| It looks as if these (at least Shi ) are being used in various
| places on and offline. It's eventually possible that they will
| become associated with one or more meanings and perhaps a
| pronunciation.
| hnfong wrote:
| In East Asian cultures that use Han characters, people used to
| make up new characters when the need arises.
|
| These days, we scroll though the Unicode standard and find
| rarely used characters that were accidentally added and imbue
| them with new meaning. (yes, this is seriously a thing)
| dane-pgp wrote:
| When the article said:
|
| "In the end only one character had neither a clear source nor
| any historical precedent: Sei ."
|
| my instinct was that this character could be retconned to
| mean "character whose meaning has been lost", thus creating a
| self-referential paradox.
|
| Presumably someone would have to then separately come up with
| a pronunciation for it. Perhaps pronouncing it "duangu" would
| solve another problem:
|
| https://coconuts.co/hongkong/lifestyle/duang-jackie-chan-
| ins...
| 1-more wrote:
| Oooh that sounds fascinating. Any examples of that that
| spring to mind? Is the pronunciation (or a reasonable
| representation thereof) already recorded in the Unicode
| standard or is that also a bit of free-jazz?
| adastra22 wrote:
| The character usually has a radical component which hints
| at the pronunciation. They or ordered by radical in the
| standard. So you would go spelunking for a little-used
| character in the part of the standard which has characters
| close in meaning or pronunciation to what you are looking
| for
|
| Or you just make something up. If you're coining a new
| character, you probably don't care about whether the
| pronunciation is already known.
| ssnistfajen wrote:
| An old one but possibly the earliest and most prominent of
| obsolete Chinese characters being imbued with new
| (Internet-based) meanings:
| https://en.wikipedia.org/wiki/Jiong
|
| There's also Shi https://en.wiktionary.org/wiki/%E5%A5%AD
| which is occasionally used as a censorship workaround to
| mock one of Xi Jinping's gaffes in an early 2000's TV
| interview where he bluffed about being able to carry two
| hundred "catty" (~100kg)'s worth of wheat on rural mountain
| roads. The character is composed of two Bai ("hundred")
| and one Ren ("human/person/people") which is a pitoral
| euphemism to that line he said on TV. I can't find any
| sources about this one that's in English so please bear
| with my half-assed explanation.
| 1-more wrote:
| Both cases are fascinating, thank you!! Side note: of
| course Shi is pronounced shi. I only know a bare minimum
| about Chinese but when in doubt: it's pronounced "shi"
| (with some license regarding tone).
| https://en.wikipedia.org/wiki/Lion-
| Eating_Poet_in_the_Stone_...
| adastra22 wrote:
| One of the reasons I wish a compositional language had been
| standardized for Unihan instead of the code-point-for-every-
| character approach.
| jxy wrote:
| Wiktionary claims this character is in Guangyun (1007-1008, see
| https://en.wikipedia.org/wiki/Guangyun), and gives the link to
| Kangxi dictionary (1716),
| https://www.kangxizidian.com/kangxi/0256.gif which means that
| this character likely predates the Japanese "Overview of
| National Administrative Districts".
| sbf501 wrote:
| Can we talk about the artwork used?
|
| https://dl.ndl.go.jp/info:ndljp/pid/1312837?itemId=info%3And...
|
| https://philamuseum.org/collection/object/84871
|
| Googling for Tsukioka Yoshitoshi brings up so much SEO that it is
| hard to find information in English. If anyone knows anything
| about it, I'd be appreciative for a pointer about its
| content/subject!
| polm23 wrote:
| Author here. Nobody has ever asked about the art before. It
| depicts Maruyama Oukyo, a famous painter of ghosts (and other
| things), where one of his pieces comes to life and frightens
| him.
|
| https://en.wikipedia.org/wiki/Maruyama_%C5%8Ckyo
| lapetitejort wrote:
| I can't be the only person who thought the character would be ,
| right? (based on the first line of the Communist Manifesto:
| https://en.wikisource.org/wiki/Manifesto_of_the_Communist_Pa...)
|
| edit: ah the character (hammer and sickle) does not show up
| aatharuv wrote:
| For obviously fake characters, a Unicode proposal for the
| Egyptian Hieroglyphics Extended-A block managed to include a
| hieroglyph for an ancient Egyptian holding a laptop. (Note that
| this is a proposal, and has not yet made it into the standard.)
| Presumably it was a copyright trap.
|
| https://www.unicode.org/mail-arch/unicode-ml/y2020-m02/0018....
| ChrisArchitect wrote:
| (2018)
| hnfong wrote:
| This might be interesting read to those unfamiliar with CJK, but
| character bloat(?) isn't remotely a recent thing. It's actually
| at least a couple hundred years old.
|
| The Kangxi dictionary (1716), an authoritative dictionary of
| Chinese characters, contains definitions for 47035 characters,
| even though only a couple thousand are in common use. Quoting
| from Wikipedia: "The dictionary was the largest of the
| traditional dictionaries, containing 47,035 characters. Some 40%
| of them are graphic variants, however, while others are dead,
| archaic, or found only once. Fewer than a quarter of the
| characters it contains are now in common use."
|
| All of these archaic (or even bogus in some cases) characters
| found in the dictionary are now part of the Unicode standard, of
| course :) The unihan database even has a field that shows the
| page number where the character appears in the Kangxi dictionary.
| If you're wondering why 65536 characters isn't enough for
| everyone, the junk in Kangxi dictionary is a significant
| contribution.
| mytailorisrich wrote:
| I think 'character bloat' is simply inherent to the writing
| system when characters are written by hand (now that perhaps
| most written communication is digital people can't use
| characters that are not already supported)
|
| Anyone can invent characters whenever they want, and it's only
| a question of them sticking or not.
|
| I think this is also one of the reasons for the Chinese
| tendency to push for unification and uniformity.
| lazide wrote:
| When it's character based instead of alphabet based, I think
| it's the equivalent of coming up with a new word in English,
| which is basically what you're describing.
|
| Sometimes it's mashing two previously unrelated 'words'
| together (aka the tons of compound characters in Chinese),
| other times it's coming up with something completely new.
|
| Same rules apply though, if it doesn't add value worth the
| trouble (or get mandated by the powers that be), it'll
| eventually just die out or be a curiosity.
|
| Also, to keep it tech related:
|
| RISC = English CISC/VLIW = Chinese?
| tokinonagare wrote:
| > Sometimes it's mashing two previously unrelated 'words'
| together (aka the tons of compound characters in Chinese),
| other times it's coming up with something completely new.
|
| That's not how it works. Most Chinese characters stem from
| a character C having a pronunciation A referring to a
| meaning M being used to note another word of meaning M'
| with same pronunciation A (sometimes slightly different
| A'). This of course doesn't scale really well, hence the
| existence of determiners in logographic scripts, which are
| words used without their pronunciations placed before or
| after another to give a semantic clue. The innovation of
| Chinese (which I think is why it's still an efficient
| script today) was to incorporate the determiner in the
| character itself to give birth to a character C' where a
| part refer to the pronunciation and another acts as the
| determiner, instead of padding the main text with (a lot
| of) determiners.
| nneonneo wrote:
| IIUC Old Chinese was a much more "isolating" language, in
| that words were typically single characters - meaning that
| to make new words, you typically needed to make new
| characters. As it evolved through the ages, "compound"
| words composed of multiple characters became more common.
| These days, new words are almost always combinations of
| multiple characters (often 2, occasionally 3-4).
| lazide wrote:
| Any idea if it was due to things like the Confucian
| Official's exam system (and corresponding increase in
| prioritization of education)?
|
| More complex characters require more education to
| understand is my guess. Some of the traditional ones
| are..... obscure, and crazy complex.
| R0b0t1 wrote:
| I'm not entirely sure what you mean to ask nor am I a
| Chinese speaker, but I have myself suspected that the
| massive variety of characters was a side-effect of having
| a middle class that was differentiated based on their
| ability to read. You see various in-group signalling
| systems similar to this in lots of areas.
|
| A good historical example is all the strangly specific
| words for groups of animals. A history I read of this
| indicated these terms were first found in books sold to
| nobility, and they were just made up. But you weren't hip
| if you weren't reading that literature.
| duskwuff wrote:
| > These days, new words are almost always combinations of
| multiple characters (often 2, occasionally 3-4).
|
| Yep! For example, the most common Chinese term for
| "Internet" is Yin Te Wang . This is composed of three
| characters:
|
| Hu : "mutual"
|
| Lian : "join", "coupled", "allied"
|
| Wang : "net" -- carrying both the meaning of a woven net
| and a computer network
| ars wrote:
| Does Unicode really need to store Chinese words? Is it
| impossible to deconstruct the glyphs into strokes, each stroke
| effectively being a character?
| j16sdiz wrote:
| Many attempted, but nobody have suceed. The most famous one
| is `Chu, B.F.: Han Zi Ji Yin Zhu Bang Fu Han Zi Ji Yin Gong
| Cheng (Genetic engineering of Chinese characters) (2003),
| http://cbflabs.com/down/show.php?id=26 `
| peter303 wrote:
| In the early days of computers some character systems were
| stroke-based because that used less memory than a 32x32 bit
| map. A kilobit of ROM (one character) could cost $10.
|
| Currently stroke-based systems are used for calligraphic
| effect. You could generate new font types, e.g. bold., but
| controlling the shape of strokes.
|
| Stroke systems are important for teaching character writing
| because the drawing order is rigorously prescribed. Once you
| learn the first couple hundred, you can pretty much guess
| future characters. Wrong order characters often look bad and
| suggest a non-Chinese speaker mis-copied them. (e.g. some
| tattoos)
| nneonneo wrote:
| Unicode has support for this, in the Ideographic Description
| Characters block (https://en.m.wikipedia.org/wiki/Ideographic
| _Description_Char...). However, it's purely descriptive, and
| not designed for rendering.
|
| There are somewhat more sophisticated systems which define
| both the rendering and stroke decomposition of characters
| (e.g. CDL: http://guide.wenlininstitute.org/wenlin4.3/Charact
| er_Descrip...). The general workaround for characters that
| aren't on Unicode would be to use one of these stroke
| description systems to create the character, then render it
| to an image and insert it.
| cyphar wrote:
| Even with the current system, very little software is even
| aware that the same codepoint should be rendered differently
| in different languages (Fan su needs to be rendered
| differently in every CJK locale) which often results in
| websites and programs using Chinese fonts for Japanese text
| (even if you've configured your language as Japanese). Having
| stroke breakdowns would not make this situation better
| because there are multiple ways to render the same stroke
| description and there aren't really systematic rules for how
| to correctly represent the Japanese (or Taiwanese or Korean)
| version of a character.
|
| I dread to think what an enormous mess would result if every
| character was represented as a build-it-yourself instruction
| manual rather than allowing font authors to correctly
| represent the characters.
|
| Also nobody in China, Japan, nor Korea would use an encoding
| system so incredibly inefficient that more strokes results in
| more bytes being necessary to store the character (Japan
| already compromised with having 3-byte UTF-8 characters when
| JIS only required 2). This would've resulted in the failure
| of Unicode's mission to be the One True Encoding Format.
| yongjik wrote:
| The problem with that would be that every software must know
| the intricate rules about combining glyphs, and if they guess
| wrong, users get garbage characters.
|
| Considering that the majority of code is written by people
| who don't know Chinese characters, it would result in never-
| ending issues, pretty much everywhere.
|
| Korean actually has a two-way system in Unicode. Every
| conceivable character (= syllable) possible in modern Korean
| has its own codepoint, which allows most software to display
| them correctly: from their point of view, it's just another
| CJK character.
|
| On the other hand, there is a Unicode area containing Korean
| sub-blocks ("jamo") that were used historically. In theory,
| you can combine them and get some pretty funky archaic
| syllables. Almost no software renders them right.
| mike_hock wrote:
| They can't even get much simpler things right. Qt
| incorrectly combines accents with the character to the
| right instead of the left and has been refusing to fix this
| bug for years.
| [deleted]
| ComodoHacker wrote:
| >Fewer than a quarter of the characters it contains are now in
| common use
|
| 12K characters in common use is equally impressing for me as a
| non-Asian.
| adastra22 wrote:
| More like 12k characters currently in use at all. Common use
| characters are a much smaller set than that. (3k or so?)
| ssnistfajen wrote:
| It's actually way fewer than that IRL. Japan's official list
| of commonly used Kanji only has 2136 characters. Taiwan's
| list has 4808, and the PRC's list has 3500 "frequent"
| characters with another 3000 supplementary "common" ones.
| Digitization has made it even easier to use these characters
| without recognizing the actual form or how to write them.
| cyphar wrote:
| The Chang Yong Han Zi (Japanese Common Use Kanji) list
| does not include many kanji that native speakers can read
| and newspapers don't always follow the rule that they only
| should use characters from the list. In addition, you need
| to include the Ren Ming Yong Han Zi (Personal Name Use
| Kanji) in the list because basically all of those
| characters are also used in fairly common words.
|
| Native speakers can probably recognise at least 3-4k kanji
| if not more but can probably only write around 2k from
| memory, depending on how well-read they are.
|
| Xu (lie) is the best example of an incredibly common word
| whose kanji form (which is used fairly often) is not in any
| official government list.
| DiogenesKynikos wrote:
| If you look at a frequency list of Chinese characters,[0]
| the top 4800 characters make up about 99.9% of modern
| texts.
|
| That means that if you know 4800 characters, and you read a
| text that is 1000 characters (equivalent to around 700
| words) long, there's likely one character you won't
| recognize.
|
| The funny thing is, if you recognize only the top six
| characters, you already know 10% of the characters in a
| typical text. The distribution is very top-heavy, but with
| a long tail that you do have to learn to become literate.
|
| 0. https://lingua.mtsu.edu/chinese-
| computing/statistics/char/li...
| jamal-kumar wrote:
| I thought this was going to be about something like the massive
| security problem of homoglyph attacks being currently deployed in
| stuff like phishing baked into the standard at first glance of
| the title, but this ghost character business is pretty
| interesting. Japanese literacy requires you to know 2-4 meanings
| per 2,136 kanji characters (something like 6000+ in total
| possible meanings between these characters) just to be able to
| pass a university level literacy test, it's a massive amount of
| complexity to get right. Even if you just need basic literacy
| it's still about a thousand less than that, and there's even more
| than these I mentioned for further literacy competence.
| Furthermore each of these characters look funny if not unreadable
| if you write them down using the wrong order of strokes. I can
| see how mistakes might have been made even by native speakers of
| that language. The two kana syllabiaries are there of course and
| mixed in with the kanji, but if everything was written in that
| you wouldn't be able to achieve the same amount of information
| density, which is probably part of the reason they never switched
| over (I understand before world war 2 or so, the more rounded
| hiragana was for women while the more sword stroke like katakana
| was for men).
| js8 wrote:
| Is it possible for Unicode standard to deprecate characters? If
| yes, has it already happened?
| Fell wrote:
| I don't think so. It would make it impossible to talk about
| deprecated characters ever again, even in a historical context.
|
| Unicode contains even some ancient and long forgotten scripts
| so historians can keep proper records of them.
| jfk13 wrote:
| Yes, and yes.
|
| https://en.wikipedia.org/wiki/Unicode_character_property#Dep...
| bqmjjx0kac wrote:
| This is a tangent, but I felt like sharing. In college, I
| purchased a used copy of the communist manifesto. Famously, the
| first line reads, "A spectre is haunting Europe, ...".
|
| The previous owner had both highlighted and circled the word
| "spectre" and wrote "ghost?" in the margins. The rest of the text
| was similarly marked up.
|
| Every time I hear the word "spectre" I see "ghost?" in my mind's
| eye.
___________________________________________________________________
(page generated 2022-07-14 23:00 UTC)