[HN Gopher] The u/u Conundrum
___________________________________________________________________
The u/u Conundrum
Author : firstSpeaker
Score : 98 points
Date : 2024-03-24 16:50 UTC (6 hours ago)
(HTM) web link (unravelweb.dev)
(TXT) w3m dump (unravelweb.dev)
| layer8 wrote:
| The more general solution is specified here:
| https://unicode.org/reports/tr10/#Searching
| bawolff wrote:
| Collation and normal forms are totally different things with
| different purposes and goals.
|
| Edit: reread the article. My comment is silly. UCA is the
| correct solution to the author's problem.
| kazinator wrote:
| Oh that Motley Unicode.
| lxgr wrote:
| I'm aware of the "metal umlaut" meme, but as a German native
| speaker, I can't not read these in my head in a way that sounds
| much less Metal than probably intended :)
| 082349872349872 wrote:
| > " _When we finally went to Germany, the crowds were
| chanting, 'Mutley Cruh! Mutley Cruh!' We couldn't figure out
| why the fuck they were doing that._ " --VNW
| ginko wrote:
| I will always pronounce the umlaut in Motorhead. Lemmy
| brought that on himself.
| ooterness wrote:
| The best metal umlauts are placed on a consonant (e.g.,
| Spinal Tap). This makes it completely clear when it's there
| for aesthetics and not pronunciation.
| yxhuvud wrote:
| Yes, those umlauts made it sound more like a fake french
| accent.
| Symbiote wrote:
| Years ago, an American metalhead was added to a group chat
| before she came to visit.
|
| She was called Daniela, but she'd written it "Daniela". When
| my Swedish friend met her in person, havin seen her name in
| the group chat, he said something like "Hej, Dayne-ee-lair
| right? How was the flight?".
| 082349872349872 wrote:
| It can encode Spinal Tap, so it's all good.
| chuckadams wrote:
| Oh sweet summer child, i[?][?][?][?][?]t[?][?][?]
| [?]c[?][?]an[?][?] [?]e[?][?][?][?][?][?]n[?][?][?][?]c[?][?]
| [?]o[?][?][?][?][?][?]d[?]e[?][?]
| [?][?][?]s[?][?][?][?]o[?][?] [?][?][?][?]m[?][?][?][?][?]u[?
| ][?]c[?][?][?][?][?][?][?][?]h[?] [?][?]m[?][?][?][?][?]o[?][
| ?][?]r[?][?]e[?][?][?][?][?].[?][?].[?][?][?].[?][?]
| 082349872349872 wrote:
| TIL about https://esolangs.org/wiki/Zalgo#Number_to_String
| _nalply wrote:
| Sometimes it makes sense to reduce to Unicode confusables.
|
| For example the Greek letter Big Alpha looks like uppercase A. Or
| some characters look very similar like the slash and the fraction
| slash. Yes, Unicode has separate scalar values for them.
|
| There are Open Source tools to handle confusables.
|
| This is in addition to the search specified by Unicode.
| wanderingstan wrote:
| I wrote such a library for Python here:
| https://github.com/wanderingstan/Confusables
|
| My use case was to thwart spammers in our company's channels,
| but I suppose it could be used to also normalize accent
| encoding issues.
|
| Basically converts a phrase into a regular expression matching
| confusables.
|
| E.g. "He10" would match "Hello"
| _nalply wrote:
| Interesting.
|
| What would you think about this approach: reduce each
| character to a standard form which is the same for all
| characters in the same confusable group? Then match all
| search input to this standard form.
|
| This means "He1l0" is converted to "Hello" before searching,
| for example.
| wanderingstan wrote:
| It's been a long time since I wrote this, but I think the
| issue with that approach is the possibility of one
| character being confusable with more than one letter. I.e.
| there may not be a single correct form to reduce to.
| wyldfire wrote:
| > For example the Greek letter Big Alpha looks like uppercase
| A.
|
| If they're truly drawn the same (are they?) then why have a
| distinct encoding?
| adzm wrote:
| > If they're truly drawn the same (are they?) then why have a
| distinct encoding?
|
| They may be drawn the same or similar in some typefaces but
| not all.
| schoen wrote:
| One argument would be that you can apply functions to change
| their case.
|
| For example in Python >>> "ARETE".lower()
| 'arete' >>> "AWESOME".lower() 'awesome'
|
| The Greek A has lowercase form a, whereas the Roman A has
| lowercase form a.
|
| Another argument would be that you want a distinct encoding
| in order to be able to sort properly. Suppose we used the
| same codepoint (U+0050) for everything that looked like P.
| Then Greek Rodos would sort _before_ Greek Delos because
| Roman P is numerically prior to Greek D in Unicode, even
| though R comes later than D in the Greek alphabet.
| mmoskal wrote:
| Apparently this works very well, except for a single
| letter, Turkish I. Turkish has two version of 'i' and
| Unicode folks decided to use the Latin 'i' for lowercase
| dotted i, and Latin 'I' for uppercase dot-less I (and have
| two new code points for upper-case dotted I and lower-case
| dot-less I).
|
| Now, 'I'.lower() depends on your locale.
|
| A cause for a number of security exploits and lots of pain
| in regular expression engines.
|
| edit: Well, apparently 'I'.lower() doesn't depend on locale
| (so it's incorrect for Turkish languages); in JS you have
| to do 'I'.toLocaleLowerCase('tr-TR'). Regexps don't support
| it in neither.
| ninkendo wrote:
| To me, it depends on what you think Unicode's priorities
| should be.
|
| Let's consider the opposite approach, that any letters that
| render the same should collapse to the same code point. What
| about Cherokee letter "go" (go) versus the Latin A? What if
| they're not precisely the same? Should lowercase l and
| capital I have the same encoding? What about the Roman
| numeral for 1 versus the letter I? Doesn't it depend on the
| font too? How exactly do you draw the line?
|
| If Unicode sets out to say "no two letters that render the
| same shall ever have different encodings", all it takes is
| one counterexample to break software. And I don't think we'd
| ever get everyone to agree on whether certain letters should
| be distinct or not. Look at Han unification (and how poorly
| it was received) for examples of this.
|
| To me it's much more sane to say that some written languages
| have visual overlap in their glyphs, and that's to be
| expected, and if you want to prevent two similar looking
| strings from being confused with one another, you're going to
| have to deploy an algorithm to de-dupe them. (Unicode even
| has an official list of this called "confusables", devoted to
| helping you solve this.)
| layer8 wrote:
| They can be drawn the same, but when combining fonts (one
| latin, one greek), they might not. Or, put differently, you
| don't want to require the latin and greek glyphs to be
| designed by the same font designer so that "A" is consistent
| with both.
|
| There are more reasons:
|
| - As a basic principle, Unicode uses separate encodings when
| the lower/upper case mappings differ. (The one exception, as
| far as I know, being the Turkish "I".)
|
| - Unicode was designed for round-trip compatibility with
| legacy encodings (which weren't legacy yet at the time). To
| that effect, a given script would often be added as whole, in
| a contiguous block, to simplify transcoding.
|
| - Unifying characters in that way would cause additional
| complications when sorting.
| mgaunard wrote:
| Because graphemes and glyphs are different things.
| hanche wrote:
| You may be amused to learn about these, then:
|
| U+2012 FIGURE DASH, U+2013 EN DASH and U+2212 MINUS SIGN all
| look exactly the same, as far as I can tell. But they have
| different semantics.
| layer8 wrote:
| They don't necessarily look the same. The distinction is
| typographic, and only indirectly semantic.
|
| Figure dash is defined to have the same width as a digit
| (for use in tabular output). Minus sign is defined to have
| the same width and vertical position as the plus sign. They
| may all three differ for typographic reasons.
| hanche wrote:
| Ah, good point. But typography is supposed to support the
| semantics, so at least I was not totally wrong.
| ahazred8ta wrote:
| In Hawai`i, there's a constant struggle between the proper
| `okina, left single quote, and apostrophe.
| michaelt wrote:
| Unicode's "Han Unification"
| https://en.wikipedia.org/wiki/Han_unification aimed to create
| a unified character set for the characters which are
| (approximately) identical between Chinese, Japanese, Korean
| and Vietnamese.
|
| It turns out this is complex and controversial enough that
| the wikipedia page is pretty gigantic.
| andrewaylett wrote:
| In some cases, because they have distinct encodings in a pre-
| Unicode character set.
|
| Unicode wants to be able to represent any legacy encoding in
| a lossless manner. ISO8859-7 encodes A and A to different
| code-points, and ISO8859-5 has A at yet another code point,
| so Unicode needs to give them different encodings too.
|
| And, indeed, they _are_ different letters -- as sibling
| comments point out, if you want to lowercase them then you
| wind up with a, a, and a, and that 's not going to work very
| well if the capitals have the same encoding.
| re wrote:
| > Can you spot any difference between "blob" and "blob"?
|
| It's tricky to try to determine this because normalization can
| end up getting applied unexpectedly (for instance, on Mac,
| Firefox appears to normalize copied text as NFC while Chrome does
| not), but by downloading the page with cURL and checking the raw
| bytes I can confirm that there is no difference between those two
| words :) Something in the author's editing or publishing pipeline
| is applying normalization and not giving her the end result that
| she was going for. 00009000: 0a3c 7020 6964 3d22
| 3066 3939 223e 4361 .<p id="0f99">Ca 00009010: 6e20 796f
| 7520 7370 6f74 2061 6e79 2064 n you spot any d 00009020:
| 6966 6665 7265 6e63 6520 6265 7477 6565 ifference betwee
| 00009030: 6e20 e280 9c62 6cc3 b662 e280 9d20 616e n ...bl..b...
| an 00009040: 6420 e280 9c62 6cc3 b662 e280 9d3f 3c2f d
| ...bl..b...?</
|
| Let's see if I can get HN to preserve the different forms:
|
| Composed: u Decomposed: u
|
| Edit: Looks like that worked!
| Eisenstein wrote:
| Perhaps the author used the same character twice for effect,
| not suspecting someone would use curl to examine the raw bytes?
| mgaunard wrote:
| I believe XML and HTML both require Unicode data to be in NFC.
| fanf2 wrote:
| I don't think so?
|
| https://www.w3.org/TR/2008/REC-xml-20081126/#charsets
|
| XML 1.1 says documents should be normalized but they are
| still well-formed even if not normalized
|
| https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-
| normaliza...
|
| But you should not use XML 1.1
|
| https://www.ibiblio.org/xml/books/effectivexml/chapters/03.h.
| ..
| layer8 wrote:
| You believe incorrectly. Not even Canonical XML requires
| normalization:
| https://www.w3.org/TR/xml-c14n/#NoCharModelNorm
| mbrubeck wrote:
| HTML does not require NFC (or any other specific
| normalization form):
|
| https://www.w3.org/International/questions/qa-html-css-
| norma...
|
| Neither does XML (though it XML 1.0 recommends that element
| names SHOULD be in NFC and XML 1.1 recommends that documents
| SHOULD be fully normalized):
|
| https://www.w3.org/TR/2008/REC-xml-20081126/#sec-
| suggested-n...
|
| https://www.w3.org/TR/xml11/#sec-normalization-checking
| jph wrote:
| Normalizing can help with search. For example for Ruby I maintain
| this gem: https://rubygems.org/gems/sixarm_ruby_unaccent
| noname120 wrote:
| Wow the code[1] looks horrific!
|
| Why not just do this: string - NFD - strip diacritics - NFC?
| See [2] for more.
|
| [1]
| https://github.com/SixArm/sixarm_ruby_unaccent/blob/eb674a78...
|
| [2] https://stackoverflow.com/a/74029319/3634271
| chuckadams wrote:
| Clearly the author already knows this, but it highlights the
| importance of always normalizing your input, and consistently
| using the same form instead of relying on the OS defaults.
| mckn1ght wrote:
| Also, never trust user input. File names are user inputs. You
| can execute XSS attacks via filenames on an unsecured site.
| makeitdouble wrote:
| The larger point is probably that search and comparison are
| inherently hard as what humans understand as equivalent isn't
| the same for the machine. Next stop will be upper case and
| lower case. Then different transcriptions of the same words in
| CJK.
| jesprenj wrote:
| Should you really change filenames of users' files and depend on
| the fact that they are valid utf8? Wouldn't it be better to keep
| the original filename and use that most of the time sans the
| searches and indexing?
|
| Why don't you normalize latin alphabets filenames for indexing
| even further -- allow searching for "Fuhrer" with queries like
| "Fuehrer" and "Fuhrer"?
| zeroCalories wrote:
| I generally agree that you shouldn't change the file name, but
| in reality I bet OP stored it as another column in a database.
|
| For more aggressive normalization like that, I think it makes
| more sense to implement something like a spell checker that
| suggests similar files.
| noodlesUK wrote:
| One thing that is very unintuitive with normalization is that
| MacOS is much more aggressive with normalizing Unicode than
| Windows or Linux distros. Even if you copy and paste non-
| normalized text into a text box in safari on Mac, it will be
| normalized before it gets posted to the server. This leads to
| strange issues with string matching.
| codesnik wrote:
| I was really surprised when realized that at least in hpfs
| cyrillics is normalized too. For example, no russian ever
| thinks that I is a I with some diacritics. It's a different
| letter on it's own right. But mac normalizes it into two
| codepoints.
| anamexis wrote:
| Well, there's no expectation in unicode that something viewed
| as a letter in its own right should use a single codepoint.
| asveikau wrote:
| I dislike explaining string compares to monolingual English
| speakers who are programmers. Similar to this phenomenon of
| I/I is people who think n and n should compare equally, or c
| and c, or that the lowercase of I is always i (or that case
| conversion is locale-independent).
|
| In something like a code review, people will think you're
| insane for pointing out that this type of assumption might
| not hold. Actually, come to think of it, explaining
| localization bugs at all is a tough task in general.
| iforgotpassword wrote:
| Well, I do like this behavior for search though. I don't
| want to install a new keyboard layout just to be able to
| search for a Spanish word.
| david-gpu wrote:
| Is the convenience of a few foreigners searching for
| something more important than the convenience of the many
| native speakers searching for the same?
|
| Maybe we should start modifying the search behavior of
| English words to make them more convenient for non-native
| speakers as well. We could start by making "bed aidia"
| match "bad idea", since both sound similar to my foreign
| ears.
| MrJohz wrote:
| In fairness, for search, allowing multiple ways of typing
| the same thing is probably the best choice: you can
| prioritise true matches, where the user has typed the
| correct form of the letter, but also allow for more
| visual based matches. (Correcting common typos is also
| very convenient even for native speakers of a language --
| and of course a phonetic search that actually produced
| good results would be wonderful, albeit I suspect
| practically very difficult given just how many ways of
| writing a given pronunciation there might be!)
| makeitdouble wrote:
| Search probably needs both modes. A literal and a fuzzy
| one.
| NeoTar wrote:
| My brother recently asked for help in determining who a
| footballer (soccer player) was from a photo. Like in many
| sports, the jerseys have the players name on the rear,
| and this player's was in Cyrillic - Shunin (Anton Shunin)
| - and my brother had tried searching for Wyhnh without
| success.
|
| Anyway, my point is that perhaps ideally (and maybe
| search engines do this) the results should be determined
| by the locale of the searcher. So someone in the English
| speaking world can find Lodz by searching for Lodz, but a
| Pole may need to type Lodz. My brother could find Shunin
| by typing Wyhnh, but a Russian could not...
| yxhuvud wrote:
| Or that sort order is locale independent. Swedish is a good
| example here as aao are sorted at the end, and where until
| 2006 w was sorted as v. And then it changed and w is now
| considered a letter of its own.
| makeitdouble wrote:
| The general reaction I've see until now was "meh, we have
| to make compromises (don't make me rewrite this for people
| I'll probably never meet)"
|
| Diacritics exacerbate this so much as they can be shared
| between two language yet have different rules/handling.
| French typically has a decent amount and they're meaningful
| but traditionally ignores them for comparison (in the
| dictionary for instance). That makes it more difficult for
| a dev to have an intuitive feeling of where it matters and
| where it doesn't.
| bawolff wrote:
| Normalization isn't based on what language the text is.
|
| NFC just means never use combining characters if possible,
| and NFD means always use combining characters if possible. It
| has nothing to do with whether something is a "real" letter
| in a specific language or not.
|
| The whether or not something is a "real" letter vs a letter
| with a modifier, more comes into play in the unicode
| collation algorithm, which is a separate thing.
| creshal wrote:
| MacOS creates _so_ many normalization problems in mixed
| environments that it 's not even funny any more. No common
| server-side CMS etc. can deal with it, so the more Macs you add
| to an organization, the more problems you get with inconsistent
| normalization in your content. (And indeed, CMSes shouldn't
| _have_ to second-guess users ' intentions - diacretics and
| umlauts are pronounced differently and I _should_ be able to
| encode that difference, e.g. to better cue TTS.)
|
| And, of course, the Apple fanboys will just shrug and suggest
| you also convert the rest of the organization to Apple devices,
| after all, if Apple made a choice, it can't be wrong.
| fauigerzigerk wrote:
| I'm not sure I understand. On the one hand you seem to be
| saying that users should be able to choose which
| normalisation form to use (not sure why). On the other hand
| you're unhappy about macOS sending NFD.
|
| If it's a user choice then CMSs have to be able to deal with
| all normalisation forms anyway and shouldn't care one bit
| whether macOS sends NFD or NFC. Mac users could of course
| complain about their choice not being honoured by macOS but
| that's of no concern to CMSs.
| creshal wrote:
| > On the other hand you're unhappy about macOS sending NFD.
|
| Because MacOS _always_ uses it, regardless of the user 's
| intention, so it decomposes umlauts into diaereses (despite
| them having different meanings and pronunciations) and
| mangles cyrillic, and probably more problems I haven't yet
| run into.
| kps wrote:
| Unicode doesn't have 'umlauts', and (with a few
| unfortunate exceptions) doesn't care about meanings and
| pronunciations. From the Unicode perspective, what you're
| talking about is the difference between Unicode
| Normalization Form C: U+00FC LATIN
| SMALL LETTER U WITH DIAERESIS
|
| and Unicode Normalization Form D:
| U+0075 LATIN SMALL LETTER U U+0308 COMBINING
| DIAERESIS
|
| Unicode calls these two forms 'canonically equivalent'.
| zh3 wrote:
| Suspect you're getting downvoted because of the last
| sentence. However, I do sympathise with MacOS tending to
| mangle standard (even plain ASCII) text in a way that adds to
| the workload for users of other OS's.
| creshal wrote:
| It adds to the workload of everyone, including the Apple
| users. The latter ones are just in denial about it.
| ttepasse wrote:
| Unfun normalisation fact: You can't have a file named "ss" and
| a file named "ss" in the same folder in Mac OS.
| yxhuvud wrote:
| So what happens if someone puts those two in a git repo and a
| Mac user checks out the folder?
| staplung wrote:
| git clone https://github.com/ghurley/encodingtest
| Cloning into 'encodingtest'... remote: Enumerating
| objects: 9, done. remote: Counting objects: 100%
| (9/9), done. remote: Compressing objects: 100% (5/5),
| done. remote: Total 9 (delta 1), reused 0 (delta 0),
| pack-reused 0 Receiving objects: 100% (9/9), done.
| Resolving deltas: 100% (1/1), done. warning: the
| following paths have collided (e.g. case-sensitive paths
| on a case-insensitive filesystem) and only one from the
| same colliding group is in the working tree:
| 'ss' 'ss'
| Twisol wrote:
| I have this issue on occasion with older mixed C/C++
| codebases that use `.c` for C files and `.C` for C++
| files. Maddening.
| Athas wrote:
| I never understood the popularity of the '.C' extension
| for C++ files. I have my own preference (.cpp), but it's
| essentially arbitrary compared to most other common
| alternatives (.cxx, .c++). The '.C' extension is the only
| one that just seems worse (this case sensitivity issue,
| and just general confusion given how similar '.c' looks
| to '.C'.
|
| But even more than that, I just don't get how C++ turns
| into 'C' at all. It seems actively misleading.
| tetromino_ wrote:
| EEXIST
| eropple wrote:
| This shows up in other places, too. One of my Slacks has a
| textji of `gross`, because I enjoy making our German
| speakers' teeth grind, but you sure can just type `:gross:`
| to get it.
| bawolff wrote:
| That's less a normal form issue and more a case-insensitivity
| issue. You also can't have a file named "a" and one named "A"
| in the same folder.
| samatman wrote:
| That would be true if the test strings were "SS" and "ss",
| because although "Ss" is a valid capitalization of "ss",
| it's officially a newcomer. It's more of a hybrid issue: it
| appears that APFS uses uppercasing for case-insensitive
| comparison, and also uppercases "ss" to "SS", not "Ss".
| This is the default casing, Unicode also defines a
| "tailored casing" which doesn't have this property.
|
| So it isn't _per se_ normalization, but it 's not _not_
| normalization either. In any case (heh) it 's a weird thing
| that probably shouldn't happen. Worth noting that APFS
| doesn't normalize file names, but normalization happens
| higher up in the toolchain, this has made some things
| better and others worse.
| sorenjan wrote:
| I sometimes see texts where a is rendered as a", i.e. with the
| dots next to the a instead of above it even though it's a
| completely different letter and not a version of a. I managed
| to track the issue down to MacOS' normalization, but it has
| happened on big national newspapers' websites and similar. I
| haven't seen it in a while, maybe Firefox on Windows renders it
| better or maybe various publishing tools have fixed it. It
| looks really unprofessional which is a bit strange since I
| thought Apple prides themselves on their typography.
| iforgotpassword wrote:
| I have that in gnome terminal. The dots always end up on the
| letter after, not before. At least makes it easy to spot
| filenames in decomposed form so I can fix them.
| aidos wrote:
| I have never see that on all my years on a Mac (though
| admittedly I'm not dealing in languages where I encounter it
| often). I'm assuming there's an issue with the gpos table in
| the font you're using so the dots aren't negative shifted
| into position as they should be?
| yxhuvud wrote:
| On the other hand, stuff written on macs are a lot more likely
| to require normalization in the first place.
| mawise wrote:
| I ran into this building search for a family tree project. I
| found out that Rails provides
| `ActiveSupport::Inflector.transliterate()` which I could use for
| normalization.
| Havoc wrote:
| For those intrigued by this sort of thing check tech talk "plain
| text" by Dylan Beattie
|
| Absolute gem. His other talks are entertaining too
| hanche wrote:
| He seems to have done that talk several times. I watched the
| 2022 one. Time well spent!
| keybored wrote:
| I try to avoid Unicode in filenames (I'm on Linux). It seems that
| a lot of normal users might have the same intuition as well? I
| get the sense that a lot will instinctually transcode to ASCII,
| like they do for URLs.
| zzo38computer wrote:
| I also try to avoid non-ASCII characters in file names (and I
| am also on Linux). I also like to avoid spaces and most
| punctuations in file names (if I need word separation I can use
| underscores or hyphens).
| skissane wrote:
| Sometimes I wish they had disallowed spaces in file names.
|
| Historically, many systems were very restrictive in what
| characters are allowed in file names. In part in reaction to
| that, Unix went to the other extreme, allowing any byte
| except NUL and slash.
|
| I think that was a mistake - allowing C0 control characters
| in file names (bytes 0x01 thru 0x1F) serves no useful use
| case, it just creates the potential for bugs and security
| vulnerabilities. I wish they'd blocked them.
|
| POSIX debated banning C0 controls, although appears to have
| settled on just a recommendation (not a mandate) that
| implementations disallow newline:
| https://www.austingroupbugs.net/view.php?id=251
| juujian wrote:
| I ran into encoding problems so many times, I just use ASCII
| aggressively now. There is still kanji, Hanzi, etc. but at least
| for Western alphabets, not worth the hassle.
| layer8 wrote:
| The article isn't about non-Unicode encodings.
| juujian wrote:
| Meant to write ASCII
| zzo38computer wrote:
| I also just use ASCII when possible; it is the most likely to
| work and to be portable. For some purposes, other character
| sets/encodings are better, but which ones are better depends on
| the specific case (not only what language of text but also the
| use of the text in the computer, etc).
| arp242 wrote:
| This works fine as a personal choice, but doesn't really work
| if you're writing something other random people interact with.
|
| Even for just English it doesn't work all that well because it
| lacks things like the Euro which is fairly common (certainly in
| Europe), there are names with diacritics (including "native"
| names, e.g. in Ireland it's common), there are too many
| loanwords with diacritics, and ASCII has a somewhat limited set
| of punctuation.
|
| There are some languages where this can sort of work (e.g.
| Indonesian can be fairly reliably written in just ASCII),
| although even there you will run in to some of these issue. It
| certainly doesn't work for English, and even less for other
| Latin-based European languages.
| blablabla123 wrote:
| As a German macOS user with US keyboard I run into a related
| issue every now and then. What's nice about macOS is I can easily
| combine Umlaute but also other common letters from European
| languages without any extra configuration. But some (Web)
| Applications stumble upon it, while entering because it's like:
| 1. " (Option-u) 2. u (u pressed)
| kps wrote:
| Early on, Netscape effectively exposed Windows keyboard events
| directly to Javascript, and browsers on other platforms were
| forced to try to emulate Windows events, which is necessarily
| imperfect given different underlying input systems. "These
| features were never formally specified and the current browser
| implementations vary in significant ways. The large amount of
| legacy content, including script libraries, that relies upon
| detecting the user agent and acting accordingly means that any
| attempt to formalize these legacy attributes and events would
| risk breaking as much content as it would fix or enable.
| Additionally, these attributes are not suitable for
| international usage, nor do they address accessibility
| concerns."
|
| The current method is much better designed to avoid such
| problems, and has been supported by all major browsers for
| quite a while now (the laggard Safari arriving 7 years from
| this Tuesday).
|
| https://www.w3.org/TR/uievents
| NotYourLawyer wrote:
| ASCII should be enough for anyone.
| zzo38computer wrote:
| ASCII is good for a lot of stuff, but not for everything.
| Sometimes, other character sets/encodings will be better, but
| which one is better depends on the circumstances. (Unicode does
| have many problems, though. My opinion is that Unicode is no
| good.)
| hanche wrote:
| And who needs more than 640 kilobytes of memory anyhow?
| mckn1ght wrote:
| Don't forget butterflies in case you need to edit some text.
| userbinator wrote:
| _its[sic] 2024, and we are still grappling with Unicode character
| encoding problems_
|
| More like "because it's 2024." This wouldn't be a problem before
| the complexity of Unicode became prevalent.
| bornfreddy wrote:
| You mean this wouldn't be a problem if we used the myriad
| different encodings like we did before Unicode, because we
| would probably not be able to even save the files anyway? So
| true.
| userbinator wrote:
| Before Unicode, most systems were effectively "byte-
| transparent" and encoding only a top-level concern. Those
| working in one language would use the appropriate encoding
| (likely CP1252 for most Latin languages) and there wouldn't
| be confusion about different bytes for same-looking
| characters.
| bawolff wrote:
| My understanding is way back in the day, people would use
| ascii backspace to combine an ascii letter with an ascii
| accent character.
| deathanatos wrote:
| A single user system, perhaps.
|
| I've worked on a system that ... well, didn't _predate_
| Unicode, but was sort of near the leading edge of it and
| was multi-system.
|
| The database columns containing text were all byte arrays.
| And because the client (a Windows tool, but honestly Linux
| isn't any better off here) just took a LPCSTR or whatever,
| it they bytes were just in whatever locale the client was.
| But that was recorded nowhere, and of course, all the rows
| were in different locales.
|
| I think that would be far more common, today, if Unicode
| had never come along.
| n2d4 wrote:
| You make it sound like non-English languages were invented in
| 2024
| bawolff wrote:
| Combining characters go back to the 90s. The unicode normal
| forms were defined in the 90s. None of this is new at this
| point.
| raffy wrote:
| I created a bunch of Unicode tools during development of ENSIP-15
| for ENS (Ethereum Name Service)
|
| ENSIP-15 Specification: https://docs.ens.domains/ensip/15
|
| ENS Normalization Tool: https://adraffy.github.io/ens-
| normalize.js/test/resolver.htm...
|
| Browser Tests: https://adraffy.github.io/ens-
| normalize.js/test/report-nf.ht...
|
| 0-dependancy JS Unicode 15.1 NFC/NFD Implementation [10KB]
| https://github.com/adraffy/ens-normalize.js/blob/main/dist/n...
|
| Unicode Character Browser: https://adraffy.github.io/ens-
| normalize.js/test/chars.html
|
| Unicode Emoji Browser: https://adraffy.github.io/ens-
| normalize.js/test/emoji.html
|
| Unicode Confusables: https://adraffy.github.io/ens-
| normalize.js/test/confused.htm...
| josephcsible wrote:
| IMO, it was a mistake for Unicode to provide multiple ways to
| represent 100% identical-looking characters. After all, ASCII
| doesn't have separate "c"s for "hard c" and "soft c".
| striking wrote:
| If you take a peek at an extended ASCII table (like the one at
| https://www.ascii-code.com/), you'll notice that 0xC5 specifies
| a precomposed capital A with ring above. It predates Unicode.
| Accepting that that's the case, and acknowledging that forward
| compatibility from ASCII to Unicode is a good thing (so we
| don't have any more encodings, we're just extending the most
| popular one), and understanding that you're going to have the
| ring-above diacritic in Unicode anyway... you kind of just end
| up with both representations.
| arp242 wrote:
| Everything can just be pre-composed; Unicode doesn't _need_
| composing characters.
|
| There's history here, with Unicode originally having just 65k
| characters, and hindsight is always 20/20, but I do wish
| there was a move towards deprecating all of this in favour of
| always using pre-composed.
|
| Also: what you linked isn't "ASCII" and "extended ASCII"
| doesn't really mean anything. ASCII is a 7-bit character set
| with 128 characters, and there are dozens, if not hundreds,
| of 8-bit character sets with 256 characters. Both CP-1252 and
| ISO-8859-1 saw wide use for Latin alphabet text, but others
| saw wide use for text in other scripts. So if you give me a
| document and tell me "this is extended ASCII" then I still
| don't know how to read it and will have to trail-and-error
| it.
|
| I don't think Unicode after U+007F is compatible with any
| specific character set? To be honest I never checked, and I
| don't see in what case that would be convenient. UTF-8 is
| only compatible with ASCII, not any specific "extended
| ASCII".
| zokier wrote:
| For roundtripping e.g. https://en.wikipedia.org/wiki/VSCII
| you do need both composing characters and precomposed
| characters.
| bandrami wrote:
| > Unicode doesn't need composing characters
|
| But it does, IIRC, for both Bengali and Telugu.
| arp242 wrote:
| Only because they chose to do it like that. It doesn't
| need to.
| adrian_b wrote:
| In my opinion, only the reverse could be true, i.e. that
| Unicode does not need pre-composed characters because
| everything can be written with composing characters.
|
| The pre-composed characters are necessary only for
| backwards compatibility.
|
| It is completely unrealistic to expect that Unicode will
| ever provide all the pre-composed characters that have ever
| been used in the past or which will ever be desired in the
| future.
|
| There are pre-composed characters that do not exist in
| Unicode because they have been very seldom used. Some of
| them may even be unused in any language right now, but they
| have been used in some languages in the past, e.g. in the
| 19th century, but then they have been replaced by
| orthographic reforms. Nevertheless, when you digitize and
| OCR some old book, you may want to keep its text as it was
| written originally, so you want the missing composed
| characters.
|
| Another case that I have encountered where I needed
| composed characters not existing in Unicode was when
| choosing a more consistent transliteration for languages
| that do not use the Latin alphabet. Many such languages use
| quite bad transliteration systems, precisely because
| whoever designed them has attempted to use only whatever
| restricted character set was available at that time. By
| choosing appropriate composing characters it is possible to
| design improved transliterations.
| kps wrote:
| > _I don 't think Unicode after U+007F is compatible with
| any specific character set?_
|
| The 'early' Unicode alphabetic code blocks came from ISO
| 8859 encodings1, e.g. the Unicode Cyrillic block follows
| ISO 8859-5, the Greek and Coptic block follows ISO 8859-7,
| etc.
|
| 1 https://en.wikipedia.org/wiki/ISO/IEC_8859
| fhars wrote:
| Unicode was never designed for ease of use or efficiency of
| encoding, but for ease of adoption. And that meant that it had
| to support lossless round trips from any legacy format to
| Unicode and back to the legacy format, because otherwise no
| decision maker would have allowed to start a transition to
| Unicode for important systems.
|
| So now we are saddled with an encoding that has to be bug
| compatible with any encoding ever designed before.
| pavel_lishin wrote:
| It might not be ludicrous to suggest that the English letter
| "a" and the Russian letter "a" should be a single entity, if
| you don't think about it very hard.
|
| But the English letter "c" and the Russian letter "s" are
| completely different characters, even if at a glance they look
| the same - they make completely different sounds, and _are_
| different letters. It _would_ be ludicrous to suggest that they
| should share a single symbol.
| bawolff wrote:
| Maybe, but then you can no longer round trip with other
| encodings, which seems worse to me.
| ulrischa wrote:
| It is really so awful that we have to deal with encoding issues
| in 2024.
| mglz wrote:
| My last name contains an u and it has been consistenly horrible.
|
| * When I try to preemptively replace u with ue many institutions
| and companies refuse to accept it because it does not match my
| passport
|
| * Especially in France, clerks try to emulate u with the
| diacritics used for the trema e, e. This makes it virtually
| impossible to find me in a system again
|
| * Sometimes I can enter my name as-is and there seems to be no
| problem, only for some other system to mangle it to or or a box.
| This often triggers error downstream I have no way of fixing
|
| * Sometimes, people print a u and add the diacritics by hand on
| the label. This is nice, but still somehow wrong.
|
| I wonder what the solution is. Give up and ask people to
| consistenly use a ascii-only name? Allow everybody 1000+ unicode
| characters as a name and go off that string? Officially change my
| name?
| userbinator wrote:
| Everyone's name should just be a GUID. /s
| BuyMyBitcoins wrote:
| Falsehoods Programmers Believe About Names, #41 - People have
| GUIDs.
|
| https://www.kalzumeus.com/2010/06/17/falsehoods-
| programmers-...
| makeitdouble wrote:
| The part I came to love about France in general is that while
| all of these are broken, the people dealing with it will
| completely agree it's broken and amply sympathize, but just
| accept your name is printed as Gnter.
|
| Same for names that don't fit field lengths, addresses that
| require street numbers etc. It's a real pain to deal with all
| of it and each system will fail in its own way to make your
| life a mess, but people will embrace the mess and won't blink
| an eye when you bring paper that just don't match.
| zokier wrote:
| Under GDPR people have the right to have their personal data
| to be accurate, there was a legal case exactly about this:
| https://news.ycombinator.com/item?id=38009963
| makeitdouble wrote:
| That's a pretty unexpected twist, and I'm frilled with it.
|
| I don't see every institution come up with a fix anytime
| soon, but having it clear that they're breaking the law is
| such a huge step. That will also have a huge impact on bank
| system development, and I wonder how they'll do it (extend
| the current system to have the customer facing bits
| rewritten, or just redo it all from top to bottom)
|
| There is the tale of Mizuho bank [0], botching their system
| upgrade project so hard they were still seeing widespread
| failures after a decade into it.
|
| [0] https://www.japantimes.co.jp/news/2022/02/11/business/m
| izuho...
| zokier wrote:
| Germans have of course a standard for this
|
| > a normative subset of Unicode Latin characters, sequences of
| base characters and diacritic signs, and special characters for
| use in names of persons, legal entities, products, addresses
| etc
|
| https://en.wikipedia.org/wiki/DIN_91379
| em-bee wrote:
| and it's used in the passport too. so names with umlaut show
| up in both forms and it is possible to match either form
| samatman wrote:
| The only solution is going to be a lot of patience,
| unfortunately.
|
| Everyone should be storing strings as UTF-8, and any time
| strings are being compared they should undergo some form of
| normalization. Doesn't matter which, as long as it's
| consistent. There's no reason to store string data in any other
| format, and any comparison code which isn't normalizing is
| buggy.
|
| But thanks to institutional inertia, it will be a very long
| time before everything works that way.
| lmm wrote:
| > Everyone should be storing strings as UTF-8, and any time
| strings are being compared they should undergo some form of
| normalization. Doesn't matter which, as long as it's
| consistent. There's no reason to store string data in any
| other format, and any comparison code which isn't normalizing
| is buggy.
|
| This will result in misprinting Japanese names (or
| misprinting Chinese names depending on the rest of your
| system).
| earthboundkid wrote:
| Can we please talk about Unicode without the myth of Han
| Unification being bad somehow? The problem here is exactly
| the lack of unification in Roman alphabets!
| RedNifre wrote:
| How?
| zokier wrote:
| > * Especially in France, clerks try to emulate u with the
| diacritics used for the trema e, e. This makes it virtually
| impossible to find me in a system again
|
| In Unicode umlaut and diaeresis are both represented by same
| codepoint, U+0308 COMBINING DIAERESIS.
|
| https://en.wikipedia.org/wiki/Umlaut_(diacritic)
| ulucs wrote:
| Can u be printed on a passport rather than a u? I have a s and
| a c so I have been successfully substituting s and c for them
| in a somewhat consistent manner.
| lmm wrote:
| > Give up and ask people to consistenly use a ascii-only name?
|
| > Officially change my name?
|
| Yes. That's the only one that's going to actually work. You can
| go on about how these systems ought to work until until the
| cows come home, and I'm sure plenty of people on HN will, but
| if you actually want to get on with your life and avoid
| problems, legally change your name to one that's short and
| ascii-only.
| CoastalCoder wrote:
| Isn't u/u-encoding a solved problem on Unix systems?
|
| </joke>
| earthboundkid wrote:
| This isn't an encoding problem. It's a search problem.
| ComputerGuru wrote:
| ZFS can be configured to force the use of a particular normalized
| Unicode form for all filenames. Amazing filesystem.
___________________________________________________________________
(page generated 2024-03-24 23:00 UTC)