hngopher.com

       [HN Gopher] The u/u Conundrum
       ___________________________________________________________________
        
       The u/u Conundrum
        
       Author : firstSpeaker
       Score  : 98 points
       Date   : 2024-03-24 16:50 UTC (6 hours ago)
        
 (HTM) web link (unravelweb.dev)
 (TXT) w3m dump (unravelweb.dev)
        
       | layer8 wrote:
       | The more general solution is specified here:
       | https://unicode.org/reports/tr10/#Searching
        
         | bawolff wrote:
         | Collation and normal forms are totally different things with
         | different purposes and goals.
         | 
         | Edit: reread the article. My comment is silly. UCA is the
         | correct solution to the author's problem.
        
       | kazinator wrote:
       | Oh that Motley Unicode.
        
         | lxgr wrote:
         | I'm aware of the "metal umlaut" meme, but as a German native
         | speaker, I can't not read these in my head in a way that sounds
         | much less Metal than probably intended :)
        
           | 082349872349872 wrote:
           | > " _When we finally went to Germany, the crowds were
           | chanting, 'Mutley Cruh! Mutley Cruh!' We couldn't figure out
           | why the fuck they were doing that._ " --VNW
        
           | ginko wrote:
           | I will always pronounce the umlaut in Motorhead. Lemmy
           | brought that on himself.
        
           | ooterness wrote:
           | The best metal umlauts are placed on a consonant (e.g.,
           | Spinal Tap). This makes it completely clear when it's there
           | for aesthetics and not pronunciation.
        
           | yxhuvud wrote:
           | Yes, those umlauts made it sound more like a fake french
           | accent.
        
           | Symbiote wrote:
           | Years ago, an American metalhead was added to a group chat
           | before she came to visit.
           | 
           | She was called Daniela, but she'd written it "Daniela". When
           | my Swedish friend met her in person, havin seen her name in
           | the group chat, he said something like "Hej, Dayne-ee-lair
           | right? How was the flight?".
        
         | 082349872349872 wrote:
         | It can encode Spinal Tap, so it's all good.
        
           | chuckadams wrote:
           | Oh sweet summer child, i[?][?][?][?][?]t[?][?][?]
           | [?]c[?][?]an[?][?] [?]e[?][?][?][?][?][?]n[?][?][?][?]c[?][?]
           | [?]o[?][?][?][?][?][?]d[?]e[?][?]
           | [?][?][?]s[?][?][?][?]o[?][?] [?][?][?][?]m[?][?][?][?][?]u[?
           | ][?]c[?][?][?][?][?][?][?][?]h[?] [?][?]m[?][?][?][?][?]o[?][
           | ?][?]r[?][?]e[?][?][?][?][?].[?][?].[?][?][?].[?][?]
        
             | 082349872349872 wrote:
             | TIL about https://esolangs.org/wiki/Zalgo#Number_to_String
        
       | _nalply wrote:
       | Sometimes it makes sense to reduce to Unicode confusables.
       | 
       | For example the Greek letter Big Alpha looks like uppercase A. Or
       | some characters look very similar like the slash and the fraction
       | slash. Yes, Unicode has separate scalar values for them.
       | 
       | There are Open Source tools to handle confusables.
       | 
       | This is in addition to the search specified by Unicode.
        
         | wanderingstan wrote:
         | I wrote such a library for Python here:
         | https://github.com/wanderingstan/Confusables
         | 
         | My use case was to thwart spammers in our company's channels,
         | but I suppose it could be used to also normalize accent
         | encoding issues.
         | 
         | Basically converts a phrase into a regular expression matching
         | confusables.
         | 
         | E.g. "He10" would match "Hello"
        
           | _nalply wrote:
           | Interesting.
           | 
           | What would you think about this approach: reduce each
           | character to a standard form which is the same for all
           | characters in the same confusable group? Then match all
           | search input to this standard form.
           | 
           | This means "He1l0" is converted to "Hello" before searching,
           | for example.
        
             | wanderingstan wrote:
             | It's been a long time since I wrote this, but I think the
             | issue with that approach is the possibility of one
             | character being confusable with more than one letter. I.e.
             | there may not be a single correct form to reduce to.
        
         | wyldfire wrote:
         | > For example the Greek letter Big Alpha looks like uppercase
         | A.
         | 
         | If they're truly drawn the same (are they?) then why have a
         | distinct encoding?
        
           | adzm wrote:
           | > If they're truly drawn the same (are they?) then why have a
           | distinct encoding?
           | 
           | They may be drawn the same or similar in some typefaces but
           | not all.
        
           | schoen wrote:
           | One argument would be that you can apply functions to change
           | their case.
           | 
           | For example in Python                 >>> "ARETE".lower()
           | 'arete'       >>> "AWESOME".lower()       'awesome'
           | 
           | The Greek A has lowercase form a, whereas the Roman A has
           | lowercase form a.
           | 
           | Another argument would be that you want a distinct encoding
           | in order to be able to sort properly. Suppose we used the
           | same codepoint (U+0050) for everything that looked like P.
           | Then Greek Rodos would sort _before_ Greek Delos because
           | Roman P is numerically prior to Greek D in Unicode, even
           | though R comes later than D in the Greek alphabet.
        
             | mmoskal wrote:
             | Apparently this works very well, except for a single
             | letter, Turkish I. Turkish has two version of 'i' and
             | Unicode folks decided to use the Latin 'i' for lowercase
             | dotted i, and Latin 'I' for uppercase dot-less I (and have
             | two new code points for upper-case dotted I and lower-case
             | dot-less I).
             | 
             | Now, 'I'.lower() depends on your locale.
             | 
             | A cause for a number of security exploits and lots of pain
             | in regular expression engines.
             | 
             | edit: Well, apparently 'I'.lower() doesn't depend on locale
             | (so it's incorrect for Turkish languages); in JS you have
             | to do 'I'.toLocaleLowerCase('tr-TR'). Regexps don't support
             | it in neither.
        
           | ninkendo wrote:
           | To me, it depends on what you think Unicode's priorities
           | should be.
           | 
           | Let's consider the opposite approach, that any letters that
           | render the same should collapse to the same code point. What
           | about Cherokee letter "go" (go) versus the Latin A? What if
           | they're not precisely the same? Should lowercase l and
           | capital I have the same encoding? What about the Roman
           | numeral for 1 versus the letter I? Doesn't it depend on the
           | font too? How exactly do you draw the line?
           | 
           | If Unicode sets out to say "no two letters that render the
           | same shall ever have different encodings", all it takes is
           | one counterexample to break software. And I don't think we'd
           | ever get everyone to agree on whether certain letters should
           | be distinct or not. Look at Han unification (and how poorly
           | it was received) for examples of this.
           | 
           | To me it's much more sane to say that some written languages
           | have visual overlap in their glyphs, and that's to be
           | expected, and if you want to prevent two similar looking
           | strings from being confused with one another, you're going to
           | have to deploy an algorithm to de-dupe them. (Unicode even
           | has an official list of this called "confusables", devoted to
           | helping you solve this.)
        
           | layer8 wrote:
           | They can be drawn the same, but when combining fonts (one
           | latin, one greek), they might not. Or, put differently, you
           | don't want to require the latin and greek glyphs to be
           | designed by the same font designer so that "A" is consistent
           | with both.
           | 
           | There are more reasons:
           | 
           | - As a basic principle, Unicode uses separate encodings when
           | the lower/upper case mappings differ. (The one exception, as
           | far as I know, being the Turkish "I".)
           | 
           | - Unicode was designed for round-trip compatibility with
           | legacy encodings (which weren't legacy yet at the time). To
           | that effect, a given script would often be added as whole, in
           | a contiguous block, to simplify transcoding.
           | 
           | - Unifying characters in that way would cause additional
           | complications when sorting.
        
           | mgaunard wrote:
           | Because graphemes and glyphs are different things.
        
           | hanche wrote:
           | You may be amused to learn about these, then:
           | 
           | U+2012 FIGURE DASH, U+2013 EN DASH and U+2212 MINUS SIGN all
           | look exactly the same, as far as I can tell. But they have
           | different semantics.
        
             | layer8 wrote:
             | They don't necessarily look the same. The distinction is
             | typographic, and only indirectly semantic.
             | 
             | Figure dash is defined to have the same width as a digit
             | (for use in tabular output). Minus sign is defined to have
             | the same width and vertical position as the plus sign. They
             | may all three differ for typographic reasons.
        
               | hanche wrote:
               | Ah, good point. But typography is supposed to support the
               | semantics, so at least I was not totally wrong.
        
             | ahazred8ta wrote:
             | In Hawai`i, there's a constant struggle between the proper
             | `okina, left single quote, and apostrophe.
        
           | michaelt wrote:
           | Unicode's "Han Unification"
           | https://en.wikipedia.org/wiki/Han_unification aimed to create
           | a unified character set for the characters which are
           | (approximately) identical between Chinese, Japanese, Korean
           | and Vietnamese.
           | 
           | It turns out this is complex and controversial enough that
           | the wikipedia page is pretty gigantic.
        
           | andrewaylett wrote:
           | In some cases, because they have distinct encodings in a pre-
           | Unicode character set.
           | 
           | Unicode wants to be able to represent any legacy encoding in
           | a lossless manner. ISO8859-7 encodes A and A to different
           | code-points, and ISO8859-5 has A at yet another code point,
           | so Unicode needs to give them different encodings too.
           | 
           | And, indeed, they _are_ different letters -- as sibling
           | comments point out, if you want to lowercase them then you
           | wind up with a, a, and a, and that 's not going to work very
           | well if the capitals have the same encoding.
        
       | re wrote:
       | > Can you spot any difference between "blob" and "blob"?
       | 
       | It's tricky to try to determine this because normalization can
       | end up getting applied unexpectedly (for instance, on Mac,
       | Firefox appears to normalize copied text as NFC while Chrome does
       | not), but by downloading the page with cURL and checking the raw
       | bytes I can confirm that there is no difference between those two
       | words :) Something in the author's editing or publishing pipeline
       | is applying normalization and not giving her the end result that
       | she was going for.                 00009000: 0a3c 7020 6964 3d22
       | 3066 3939 223e 4361  .<p id="0f99">Ca       00009010: 6e20 796f
       | 7520 7370 6f74 2061 6e79 2064  n you spot any d       00009020:
       | 6966 6665 7265 6e63 6520 6265 7477 6565  ifference betwee
       | 00009030: 6e20 e280 9c62 6cc3 b662 e280 9d20 616e  n ...bl..b...
       | an       00009040: 6420 e280 9c62 6cc3 b662 e280 9d3f 3c2f  d
       | ...bl..b...?</
       | 
       | Let's see if I can get HN to preserve the different forms:
       | 
       | Composed: u Decomposed: u
       | 
       | Edit: Looks like that worked!
        
         | Eisenstein wrote:
         | Perhaps the author used the same character twice for effect,
         | not suspecting someone would use curl to examine the raw bytes?
        
         | mgaunard wrote:
         | I believe XML and HTML both require Unicode data to be in NFC.
        
           | fanf2 wrote:
           | I don't think so?
           | 
           | https://www.w3.org/TR/2008/REC-xml-20081126/#charsets
           | 
           | XML 1.1 says documents should be normalized but they are
           | still well-formed even if not normalized
           | 
           | https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-
           | normaliza...
           | 
           | But you should not use XML 1.1
           | 
           | https://www.ibiblio.org/xml/books/effectivexml/chapters/03.h.
           | ..
        
           | layer8 wrote:
           | You believe incorrectly. Not even Canonical XML requires
           | normalization:
           | https://www.w3.org/TR/xml-c14n/#NoCharModelNorm
        
           | mbrubeck wrote:
           | HTML does not require NFC (or any other specific
           | normalization form):
           | 
           | https://www.w3.org/International/questions/qa-html-css-
           | norma...
           | 
           | Neither does XML (though it XML 1.0 recommends that element
           | names SHOULD be in NFC and XML 1.1 recommends that documents
           | SHOULD be fully normalized):
           | 
           | https://www.w3.org/TR/2008/REC-xml-20081126/#sec-
           | suggested-n...
           | 
           | https://www.w3.org/TR/xml11/#sec-normalization-checking
        
       | jph wrote:
       | Normalizing can help with search. For example for Ruby I maintain
       | this gem: https://rubygems.org/gems/sixarm_ruby_unaccent
        
         | noname120 wrote:
         | Wow the code[1] looks horrific!
         | 
         | Why not just do this: string - NFD - strip diacritics - NFC?
         | See [2] for more.
         | 
         | [1]
         | https://github.com/SixArm/sixarm_ruby_unaccent/blob/eb674a78...
         | 
         | [2] https://stackoverflow.com/a/74029319/3634271
        
       | chuckadams wrote:
       | Clearly the author already knows this, but it highlights the
       | importance of always normalizing your input, and consistently
       | using the same form instead of relying on the OS defaults.
        
         | mckn1ght wrote:
         | Also, never trust user input. File names are user inputs. You
         | can execute XSS attacks via filenames on an unsecured site.
        
         | makeitdouble wrote:
         | The larger point is probably that search and comparison are
         | inherently hard as what humans understand as equivalent isn't
         | the same for the machine. Next stop will be upper case and
         | lower case. Then different transcriptions of the same words in
         | CJK.
        
       | jesprenj wrote:
       | Should you really change filenames of users' files and depend on
       | the fact that they are valid utf8? Wouldn't it be better to keep
       | the original filename and use that most of the time sans the
       | searches and indexing?
       | 
       | Why don't you normalize latin alphabets filenames for indexing
       | even further -- allow searching for "Fuhrer" with queries like
       | "Fuehrer" and "Fuhrer"?
        
         | zeroCalories wrote:
         | I generally agree that you shouldn't change the file name, but
         | in reality I bet OP stored it as another column in a database.
         | 
         | For more aggressive normalization like that, I think it makes
         | more sense to implement something like a spell checker that
         | suggests similar files.
        
       | noodlesUK wrote:
       | One thing that is very unintuitive with normalization is that
       | MacOS is much more aggressive with normalizing Unicode than
       | Windows or Linux distros. Even if you copy and paste non-
       | normalized text into a text box in safari on Mac, it will be
       | normalized before it gets posted to the server. This leads to
       | strange issues with string matching.
        
         | codesnik wrote:
         | I was really surprised when realized that at least in hpfs
         | cyrillics is normalized too. For example, no russian ever
         | thinks that I is a I with some diacritics. It's a different
         | letter on it's own right. But mac normalizes it into two
         | codepoints.
        
           | anamexis wrote:
           | Well, there's no expectation in unicode that something viewed
           | as a letter in its own right should use a single codepoint.
        
           | asveikau wrote:
           | I dislike explaining string compares to monolingual English
           | speakers who are programmers. Similar to this phenomenon of
           | I/I is people who think n and n should compare equally, or c
           | and c, or that the lowercase of I is always i (or that case
           | conversion is locale-independent).
           | 
           | In something like a code review, people will think you're
           | insane for pointing out that this type of assumption might
           | not hold. Actually, come to think of it, explaining
           | localization bugs at all is a tough task in general.
        
             | iforgotpassword wrote:
             | Well, I do like this behavior for search though. I don't
             | want to install a new keyboard layout just to be able to
             | search for a Spanish word.
        
               | david-gpu wrote:
               | Is the convenience of a few foreigners searching for
               | something more important than the convenience of the many
               | native speakers searching for the same?
               | 
               | Maybe we should start modifying the search behavior of
               | English words to make them more convenient for non-native
               | speakers as well. We could start by making "bed aidia"
               | match "bad idea", since both sound similar to my foreign
               | ears.
        
               | MrJohz wrote:
               | In fairness, for search, allowing multiple ways of typing
               | the same thing is probably the best choice: you can
               | prioritise true matches, where the user has typed the
               | correct form of the letter, but also allow for more
               | visual based matches. (Correcting common typos is also
               | very convenient even for native speakers of a language --
               | and of course a phonetic search that actually produced
               | good results would be wonderful, albeit I suspect
               | practically very difficult given just how many ways of
               | writing a given pronunciation there might be!)
        
               | makeitdouble wrote:
               | Search probably needs both modes. A literal and a fuzzy
               | one.
        
               | NeoTar wrote:
               | My brother recently asked for help in determining who a
               | footballer (soccer player) was from a photo. Like in many
               | sports, the jerseys have the players name on the rear,
               | and this player's was in Cyrillic - Shunin (Anton Shunin)
               | - and my brother had tried searching for Wyhnh without
               | success.
               | 
               | Anyway, my point is that perhaps ideally (and maybe
               | search engines do this) the results should be determined
               | by the locale of the searcher. So someone in the English
               | speaking world can find Lodz by searching for Lodz, but a
               | Pole may need to type Lodz. My brother could find Shunin
               | by typing Wyhnh, but a Russian could not...
        
             | yxhuvud wrote:
             | Or that sort order is locale independent. Swedish is a good
             | example here as aao are sorted at the end, and where until
             | 2006 w was sorted as v. And then it changed and w is now
             | considered a letter of its own.
        
             | makeitdouble wrote:
             | The general reaction I've see until now was "meh, we have
             | to make compromises (don't make me rewrite this for people
             | I'll probably never meet)"
             | 
             | Diacritics exacerbate this so much as they can be shared
             | between two language yet have different rules/handling.
             | French typically has a decent amount and they're meaningful
             | but traditionally ignores them for comparison (in the
             | dictionary for instance). That makes it more difficult for
             | a dev to have an intuitive feeling of where it matters and
             | where it doesn't.
        
           | bawolff wrote:
           | Normalization isn't based on what language the text is.
           | 
           | NFC just means never use combining characters if possible,
           | and NFD means always use combining characters if possible. It
           | has nothing to do with whether something is a "real" letter
           | in a specific language or not.
           | 
           | The whether or not something is a "real" letter vs a letter
           | with a modifier, more comes into play in the unicode
           | collation algorithm, which is a separate thing.
        
         | creshal wrote:
         | MacOS creates _so_ many normalization problems in mixed
         | environments that it 's not even funny any more. No common
         | server-side CMS etc. can deal with it, so the more Macs you add
         | to an organization, the more problems you get with inconsistent
         | normalization in your content. (And indeed, CMSes shouldn't
         | _have_ to second-guess users ' intentions - diacretics and
         | umlauts are pronounced differently and I _should_ be able to
         | encode that difference, e.g. to better cue TTS.)
         | 
         | And, of course, the Apple fanboys will just shrug and suggest
         | you also convert the rest of the organization to Apple devices,
         | after all, if Apple made a choice, it can't be wrong.
        
           | fauigerzigerk wrote:
           | I'm not sure I understand. On the one hand you seem to be
           | saying that users should be able to choose which
           | normalisation form to use (not sure why). On the other hand
           | you're unhappy about macOS sending NFD.
           | 
           | If it's a user choice then CMSs have to be able to deal with
           | all normalisation forms anyway and shouldn't care one bit
           | whether macOS sends NFD or NFC. Mac users could of course
           | complain about their choice not being honoured by macOS but
           | that's of no concern to CMSs.
        
             | creshal wrote:
             | > On the other hand you're unhappy about macOS sending NFD.
             | 
             | Because MacOS _always_ uses it, regardless of the user 's
             | intention, so it decomposes umlauts into diaereses (despite
             | them having different meanings and pronunciations) and
             | mangles cyrillic, and probably more problems I haven't yet
             | run into.
        
               | kps wrote:
               | Unicode doesn't have 'umlauts', and (with a few
               | unfortunate exceptions) doesn't care about meanings and
               | pronunciations. From the Unicode perspective, what you're
               | talking about is the difference between Unicode
               | Normalization Form C:                   U+00FC LATIN
               | SMALL LETTER U WITH DIAERESIS
               | 
               | and Unicode Normalization Form D:
               | U+0075 LATIN SMALL LETTER U         U+0308 COMBINING
               | DIAERESIS
               | 
               | Unicode calls these two forms 'canonically equivalent'.
        
           | zh3 wrote:
           | Suspect you're getting downvoted because of the last
           | sentence. However, I do sympathise with MacOS tending to
           | mangle standard (even plain ASCII) text in a way that adds to
           | the workload for users of other OS's.
        
             | creshal wrote:
             | It adds to the workload of everyone, including the Apple
             | users. The latter ones are just in denial about it.
        
         | ttepasse wrote:
         | Unfun normalisation fact: You can't have a file named "ss" and
         | a file named "ss" in the same folder in Mac OS.
        
           | yxhuvud wrote:
           | So what happens if someone puts those two in a git repo and a
           | Mac user checks out the folder?
        
             | staplung wrote:
             | git clone https://github.com/ghurley/encodingtest
             | Cloning into 'encodingtest'...       remote: Enumerating
             | objects: 9, done.       remote: Counting objects: 100%
             | (9/9), done.       remote: Compressing objects: 100% (5/5),
             | done.       remote: Total 9 (delta 1), reused 0 (delta 0),
             | pack-reused 0       Receiving objects: 100% (9/9), done.
             | Resolving deltas: 100% (1/1), done.       warning: the
             | following paths have collided (e.g. case-sensitive paths
             | on a case-insensitive filesystem) and only one from the
             | same       colliding group is in the working tree:
             | 'ss'       'ss'
        
               | Twisol wrote:
               | I have this issue on occasion with older mixed C/C++
               | codebases that use `.c` for C files and `.C` for C++
               | files. Maddening.
        
               | Athas wrote:
               | I never understood the popularity of the '.C' extension
               | for C++ files. I have my own preference (.cpp), but it's
               | essentially arbitrary compared to most other common
               | alternatives (.cxx, .c++). The '.C' extension is the only
               | one that just seems worse (this case sensitivity issue,
               | and just general confusion given how similar '.c' looks
               | to '.C'.
               | 
               | But even more than that, I just don't get how C++ turns
               | into 'C' at all. It seems actively misleading.
        
             | tetromino_ wrote:
             | EEXIST
        
           | eropple wrote:
           | This shows up in other places, too. One of my Slacks has a
           | textji of `gross`, because I enjoy making our German
           | speakers' teeth grind, but you sure can just type `:gross:`
           | to get it.
        
           | bawolff wrote:
           | That's less a normal form issue and more a case-insensitivity
           | issue. You also can't have a file named "a" and one named "A"
           | in the same folder.
        
             | samatman wrote:
             | That would be true if the test strings were "SS" and "ss",
             | because although "Ss" is a valid capitalization of "ss",
             | it's officially a newcomer. It's more of a hybrid issue: it
             | appears that APFS uses uppercasing for case-insensitive
             | comparison, and also uppercases "ss" to "SS", not "Ss".
             | This is the default casing, Unicode also defines a
             | "tailored casing" which doesn't have this property.
             | 
             | So it isn't _per se_ normalization, but it 's not _not_
             | normalization either. In any case (heh) it 's a weird thing
             | that probably shouldn't happen. Worth noting that APFS
             | doesn't normalize file names, but normalization happens
             | higher up in the toolchain, this has made some things
             | better and others worse.
        
         | sorenjan wrote:
         | I sometimes see texts where a is rendered as a", i.e. with the
         | dots next to the a instead of above it even though it's a
         | completely different letter and not a version of a. I managed
         | to track the issue down to MacOS' normalization, but it has
         | happened on big national newspapers' websites and similar. I
         | haven't seen it in a while, maybe Firefox on Windows renders it
         | better or maybe various publishing tools have fixed it. It
         | looks really unprofessional which is a bit strange since I
         | thought Apple prides themselves on their typography.
        
           | iforgotpassword wrote:
           | I have that in gnome terminal. The dots always end up on the
           | letter after, not before. At least makes it easy to spot
           | filenames in decomposed form so I can fix them.
        
           | aidos wrote:
           | I have never see that on all my years on a Mac (though
           | admittedly I'm not dealing in languages where I encounter it
           | often). I'm assuming there's an issue with the gpos table in
           | the font you're using so the dots aren't negative shifted
           | into position as they should be?
        
         | yxhuvud wrote:
         | On the other hand, stuff written on macs are a lot more likely
         | to require normalization in the first place.
        
       | mawise wrote:
       | I ran into this building search for a family tree project. I
       | found out that Rails provides
       | `ActiveSupport::Inflector.transliterate()` which I could use for
       | normalization.
        
       | Havoc wrote:
       | For those intrigued by this sort of thing check tech talk "plain
       | text" by Dylan Beattie
       | 
       | Absolute gem. His other talks are entertaining too
        
         | hanche wrote:
         | He seems to have done that talk several times. I watched the
         | 2022 one. Time well spent!
        
       | keybored wrote:
       | I try to avoid Unicode in filenames (I'm on Linux). It seems that
       | a lot of normal users might have the same intuition as well? I
       | get the sense that a lot will instinctually transcode to ASCII,
       | like they do for URLs.
        
         | zzo38computer wrote:
         | I also try to avoid non-ASCII characters in file names (and I
         | am also on Linux). I also like to avoid spaces and most
         | punctuations in file names (if I need word separation I can use
         | underscores or hyphens).
        
           | skissane wrote:
           | Sometimes I wish they had disallowed spaces in file names.
           | 
           | Historically, many systems were very restrictive in what
           | characters are allowed in file names. In part in reaction to
           | that, Unix went to the other extreme, allowing any byte
           | except NUL and slash.
           | 
           | I think that was a mistake - allowing C0 control characters
           | in file names (bytes 0x01 thru 0x1F) serves no useful use
           | case, it just creates the potential for bugs and security
           | vulnerabilities. I wish they'd blocked them.
           | 
           | POSIX debated banning C0 controls, although appears to have
           | settled on just a recommendation (not a mandate) that
           | implementations disallow newline:
           | https://www.austingroupbugs.net/view.php?id=251
        
       | juujian wrote:
       | I ran into encoding problems so many times, I just use ASCII
       | aggressively now. There is still kanji, Hanzi, etc. but at least
       | for Western alphabets, not worth the hassle.
        
         | layer8 wrote:
         | The article isn't about non-Unicode encodings.
        
           | juujian wrote:
           | Meant to write ASCII
        
         | zzo38computer wrote:
         | I also just use ASCII when possible; it is the most likely to
         | work and to be portable. For some purposes, other character
         | sets/encodings are better, but which ones are better depends on
         | the specific case (not only what language of text but also the
         | use of the text in the computer, etc).
        
         | arp242 wrote:
         | This works fine as a personal choice, but doesn't really work
         | if you're writing something other random people interact with.
         | 
         | Even for just English it doesn't work all that well because it
         | lacks things like the Euro which is fairly common (certainly in
         | Europe), there are names with diacritics (including "native"
         | names, e.g. in Ireland it's common), there are too many
         | loanwords with diacritics, and ASCII has a somewhat limited set
         | of punctuation.
         | 
         | There are some languages where this can sort of work (e.g.
         | Indonesian can be fairly reliably written in just ASCII),
         | although even there you will run in to some of these issue. It
         | certainly doesn't work for English, and even less for other
         | Latin-based European languages.
        
       | blablabla123 wrote:
       | As a German macOS user with US keyboard I run into a related
       | issue every now and then. What's nice about macOS is I can easily
       | combine Umlaute but also other common letters from European
       | languages without any extra configuration. But some (Web)
       | Applications stumble upon it, while entering because it's like:
       | 1. " (Option-u) 2. u (u pressed)
        
         | kps wrote:
         | Early on, Netscape effectively exposed Windows keyboard events
         | directly to Javascript, and browsers on other platforms were
         | forced to try to emulate Windows events, which is necessarily
         | imperfect given different underlying input systems. "These
         | features were never formally specified and the current browser
         | implementations vary in significant ways. The large amount of
         | legacy content, including script libraries, that relies upon
         | detecting the user agent and acting accordingly means that any
         | attempt to formalize these legacy attributes and events would
         | risk breaking as much content as it would fix or enable.
         | Additionally, these attributes are not suitable for
         | international usage, nor do they address accessibility
         | concerns."
         | 
         | The current method is much better designed to avoid such
         | problems, and has been supported by all major browsers for
         | quite a while now (the laggard Safari arriving 7 years from
         | this Tuesday).
         | 
         | https://www.w3.org/TR/uievents
        
       | NotYourLawyer wrote:
       | ASCII should be enough for anyone.
        
         | zzo38computer wrote:
         | ASCII is good for a lot of stuff, but not for everything.
         | Sometimes, other character sets/encodings will be better, but
         | which one is better depends on the circumstances. (Unicode does
         | have many problems, though. My opinion is that Unicode is no
         | good.)
        
         | hanche wrote:
         | And who needs more than 640 kilobytes of memory anyhow?
        
           | mckn1ght wrote:
           | Don't forget butterflies in case you need to edit some text.
        
       | userbinator wrote:
       | _its[sic] 2024, and we are still grappling with Unicode character
       | encoding problems_
       | 
       | More like "because it's 2024." This wouldn't be a problem before
       | the complexity of Unicode became prevalent.
        
         | bornfreddy wrote:
         | You mean this wouldn't be a problem if we used the myriad
         | different encodings like we did before Unicode, because we
         | would probably not be able to even save the files anyway? So
         | true.
        
           | userbinator wrote:
           | Before Unicode, most systems were effectively "byte-
           | transparent" and encoding only a top-level concern. Those
           | working in one language would use the appropriate encoding
           | (likely CP1252 for most Latin languages) and there wouldn't
           | be confusion about different bytes for same-looking
           | characters.
        
             | bawolff wrote:
             | My understanding is way back in the day, people would use
             | ascii backspace to combine an ascii letter with an ascii
             | accent character.
        
             | deathanatos wrote:
             | A single user system, perhaps.
             | 
             | I've worked on a system that ... well, didn't _predate_
             | Unicode, but was sort of near the leading edge of it and
             | was multi-system.
             | 
             | The database columns containing text were all byte arrays.
             | And because the client (a Windows tool, but honestly Linux
             | isn't any better off here) just took a LPCSTR or whatever,
             | it they bytes were just in whatever locale the client was.
             | But that was recorded nowhere, and of course, all the rows
             | were in different locales.
             | 
             | I think that would be far more common, today, if Unicode
             | had never come along.
        
         | n2d4 wrote:
         | You make it sound like non-English languages were invented in
         | 2024
        
         | bawolff wrote:
         | Combining characters go back to the 90s. The unicode normal
         | forms were defined in the 90s. None of this is new at this
         | point.
        
       | raffy wrote:
       | I created a bunch of Unicode tools during development of ENSIP-15
       | for ENS (Ethereum Name Service)
       | 
       | ENSIP-15 Specification: https://docs.ens.domains/ensip/15
       | 
       | ENS Normalization Tool: https://adraffy.github.io/ens-
       | normalize.js/test/resolver.htm...
       | 
       | Browser Tests: https://adraffy.github.io/ens-
       | normalize.js/test/report-nf.ht...
       | 
       | 0-dependancy JS Unicode 15.1 NFC/NFD Implementation [10KB]
       | https://github.com/adraffy/ens-normalize.js/blob/main/dist/n...
       | 
       | Unicode Character Browser: https://adraffy.github.io/ens-
       | normalize.js/test/chars.html
       | 
       | Unicode Emoji Browser: https://adraffy.github.io/ens-
       | normalize.js/test/emoji.html
       | 
       | Unicode Confusables: https://adraffy.github.io/ens-
       | normalize.js/test/confused.htm...
        
       | josephcsible wrote:
       | IMO, it was a mistake for Unicode to provide multiple ways to
       | represent 100% identical-looking characters. After all, ASCII
       | doesn't have separate "c"s for "hard c" and "soft c".
        
         | striking wrote:
         | If you take a peek at an extended ASCII table (like the one at
         | https://www.ascii-code.com/), you'll notice that 0xC5 specifies
         | a precomposed capital A with ring above. It predates Unicode.
         | Accepting that that's the case, and acknowledging that forward
         | compatibility from ASCII to Unicode is a good thing (so we
         | don't have any more encodings, we're just extending the most
         | popular one), and understanding that you're going to have the
         | ring-above diacritic in Unicode anyway... you kind of just end
         | up with both representations.
        
           | arp242 wrote:
           | Everything can just be pre-composed; Unicode doesn't _need_
           | composing characters.
           | 
           | There's history here, with Unicode originally having just 65k
           | characters, and hindsight is always 20/20, but I do wish
           | there was a move towards deprecating all of this in favour of
           | always using pre-composed.
           | 
           | Also: what you linked isn't "ASCII" and "extended ASCII"
           | doesn't really mean anything. ASCII is a 7-bit character set
           | with 128 characters, and there are dozens, if not hundreds,
           | of 8-bit character sets with 256 characters. Both CP-1252 and
           | ISO-8859-1 saw wide use for Latin alphabet text, but others
           | saw wide use for text in other scripts. So if you give me a
           | document and tell me "this is extended ASCII" then I still
           | don't know how to read it and will have to trail-and-error
           | it.
           | 
           | I don't think Unicode after U+007F is compatible with any
           | specific character set? To be honest I never checked, and I
           | don't see in what case that would be convenient. UTF-8 is
           | only compatible with ASCII, not any specific "extended
           | ASCII".
        
             | zokier wrote:
             | For roundtripping e.g. https://en.wikipedia.org/wiki/VSCII
             | you do need both composing characters and precomposed
             | characters.
        
             | bandrami wrote:
             | > Unicode doesn't need composing characters
             | 
             | But it does, IIRC, for both Bengali and Telugu.
        
               | arp242 wrote:
               | Only because they chose to do it like that. It doesn't
               | need to.
        
             | adrian_b wrote:
             | In my opinion, only the reverse could be true, i.e. that
             | Unicode does not need pre-composed characters because
             | everything can be written with composing characters.
             | 
             | The pre-composed characters are necessary only for
             | backwards compatibility.
             | 
             | It is completely unrealistic to expect that Unicode will
             | ever provide all the pre-composed characters that have ever
             | been used in the past or which will ever be desired in the
             | future.
             | 
             | There are pre-composed characters that do not exist in
             | Unicode because they have been very seldom used. Some of
             | them may even be unused in any language right now, but they
             | have been used in some languages in the past, e.g. in the
             | 19th century, but then they have been replaced by
             | orthographic reforms. Nevertheless, when you digitize and
             | OCR some old book, you may want to keep its text as it was
             | written originally, so you want the missing composed
             | characters.
             | 
             | Another case that I have encountered where I needed
             | composed characters not existing in Unicode was when
             | choosing a more consistent transliteration for languages
             | that do not use the Latin alphabet. Many such languages use
             | quite bad transliteration systems, precisely because
             | whoever designed them has attempted to use only whatever
             | restricted character set was available at that time. By
             | choosing appropriate composing characters it is possible to
             | design improved transliterations.
        
             | kps wrote:
             | > _I don 't think Unicode after U+007F is compatible with
             | any specific character set?_
             | 
             | The 'early' Unicode alphabetic code blocks came from ISO
             | 8859 encodings1, e.g. the Unicode Cyrillic block follows
             | ISO 8859-5, the Greek and Coptic block follows ISO 8859-7,
             | etc.
             | 
             | 1 https://en.wikipedia.org/wiki/ISO/IEC_8859
        
         | fhars wrote:
         | Unicode was never designed for ease of use or efficiency of
         | encoding, but for ease of adoption. And that meant that it had
         | to support lossless round trips from any legacy format to
         | Unicode and back to the legacy format, because otherwise no
         | decision maker would have allowed to start a transition to
         | Unicode for important systems.
         | 
         | So now we are saddled with an encoding that has to be bug
         | compatible with any encoding ever designed before.
        
         | pavel_lishin wrote:
         | It might not be ludicrous to suggest that the English letter
         | "a" and the Russian letter "a" should be a single entity, if
         | you don't think about it very hard.
         | 
         | But the English letter "c" and the Russian letter "s" are
         | completely different characters, even if at a glance they look
         | the same - they make completely different sounds, and _are_
         | different letters. It _would_ be ludicrous to suggest that they
         | should share a single symbol.
        
         | bawolff wrote:
         | Maybe, but then you can no longer round trip with other
         | encodings, which seems worse to me.
        
       | ulrischa wrote:
       | It is really so awful that we have to deal with encoding issues
       | in 2024.
        
       | mglz wrote:
       | My last name contains an u and it has been consistenly horrible.
       | 
       | * When I try to preemptively replace u with ue many institutions
       | and companies refuse to accept it because it does not match my
       | passport
       | 
       | * Especially in France, clerks try to emulate u with the
       | diacritics used for the trema e, e. This makes it virtually
       | impossible to find me in a system again
       | 
       | * Sometimes I can enter my name as-is and there seems to be no
       | problem, only for some other system to mangle it to  or or a box.
       | This often triggers error downstream I have no way of fixing
       | 
       | * Sometimes, people print a u and add the diacritics by hand on
       | the label. This is nice, but still somehow wrong.
       | 
       | I wonder what the solution is. Give up and ask people to
       | consistenly use a ascii-only name? Allow everybody 1000+ unicode
       | characters as a name and go off that string? Officially change my
       | name?
        
         | userbinator wrote:
         | Everyone's name should just be a GUID. /s
        
           | BuyMyBitcoins wrote:
           | Falsehoods Programmers Believe About Names, #41 - People have
           | GUIDs.
           | 
           | https://www.kalzumeus.com/2010/06/17/falsehoods-
           | programmers-...
        
         | makeitdouble wrote:
         | The part I came to love about France in general is that while
         | all of these are broken, the people dealing with it will
         | completely agree it's broken and amply sympathize, but just
         | accept your name is printed as Gnter.
         | 
         | Same for names that don't fit field lengths, addresses that
         | require street numbers etc. It's a real pain to deal with all
         | of it and each system will fail in its own way to make your
         | life a mess, but people will embrace the mess and won't blink
         | an eye when you bring paper that just don't match.
        
           | zokier wrote:
           | Under GDPR people have the right to have their personal data
           | to be accurate, there was a legal case exactly about this:
           | https://news.ycombinator.com/item?id=38009963
        
             | makeitdouble wrote:
             | That's a pretty unexpected twist, and I'm frilled with it.
             | 
             | I don't see every institution come up with a fix anytime
             | soon, but having it clear that they're breaking the law is
             | such a huge step. That will also have a huge impact on bank
             | system development, and I wonder how they'll do it (extend
             | the current system to have the customer facing bits
             | rewritten, or just redo it all from top to bottom)
             | 
             | There is the tale of Mizuho bank [0], botching their system
             | upgrade project so hard they were still seeing widespread
             | failures after a decade into it.
             | 
             | [0] https://www.japantimes.co.jp/news/2022/02/11/business/m
             | izuho...
        
         | zokier wrote:
         | Germans have of course a standard for this
         | 
         | > a normative subset of Unicode Latin characters, sequences of
         | base characters and diacritic signs, and special characters for
         | use in names of persons, legal entities, products, addresses
         | etc
         | 
         | https://en.wikipedia.org/wiki/DIN_91379
        
           | em-bee wrote:
           | and it's used in the passport too. so names with umlaut show
           | up in both forms and it is possible to match either form
        
         | samatman wrote:
         | The only solution is going to be a lot of patience,
         | unfortunately.
         | 
         | Everyone should be storing strings as UTF-8, and any time
         | strings are being compared they should undergo some form of
         | normalization. Doesn't matter which, as long as it's
         | consistent. There's no reason to store string data in any other
         | format, and any comparison code which isn't normalizing is
         | buggy.
         | 
         | But thanks to institutional inertia, it will be a very long
         | time before everything works that way.
        
           | lmm wrote:
           | > Everyone should be storing strings as UTF-8, and any time
           | strings are being compared they should undergo some form of
           | normalization. Doesn't matter which, as long as it's
           | consistent. There's no reason to store string data in any
           | other format, and any comparison code which isn't normalizing
           | is buggy.
           | 
           | This will result in misprinting Japanese names (or
           | misprinting Chinese names depending on the rest of your
           | system).
        
             | earthboundkid wrote:
             | Can we please talk about Unicode without the myth of Han
             | Unification being bad somehow? The problem here is exactly
             | the lack of unification in Roman alphabets!
        
             | RedNifre wrote:
             | How?
        
         | zokier wrote:
         | > * Especially in France, clerks try to emulate u with the
         | diacritics used for the trema e, e. This makes it virtually
         | impossible to find me in a system again
         | 
         | In Unicode umlaut and diaeresis are both represented by same
         | codepoint, U+0308 COMBINING DIAERESIS.
         | 
         | https://en.wikipedia.org/wiki/Umlaut_(diacritic)
        
         | ulucs wrote:
         | Can u be printed on a passport rather than a u? I have a s and
         | a c so I have been successfully substituting s and c for them
         | in a somewhat consistent manner.
        
         | lmm wrote:
         | > Give up and ask people to consistenly use a ascii-only name?
         | 
         | > Officially change my name?
         | 
         | Yes. That's the only one that's going to actually work. You can
         | go on about how these systems ought to work until until the
         | cows come home, and I'm sure plenty of people on HN will, but
         | if you actually want to get on with your life and avoid
         | problems, legally change your name to one that's short and
         | ascii-only.
        
       | CoastalCoder wrote:
       | Isn't u/u-encoding a solved problem on Unix systems?
       | 
       | </joke>
        
       | earthboundkid wrote:
       | This isn't an encoding problem. It's a search problem.
        
       | ComputerGuru wrote:
       | ZFS can be configured to force the use of a particular normalized
       | Unicode form for all filenames. Amazing filesystem.
        
       ___________________________________________________________________
       (page generated 2024-03-24 23:00 UTC)