[HN Gopher] Unicode Normalization Forms: When o [?] o
___________________________________________________________________
Unicode Normalization Forms: When o [?] o
Author : ocrb
Score : 82 points
Date : 2021-12-31 19:32 UTC (3 hours ago)
(HTM) web link (blog.opencore.ch)
(TXT) w3m dump (blog.opencore.ch)
| mannerheim wrote:
| Duolingo doesn't handle Unicode normalisation for certain
| languages, and it's incredibly frustrating. Here's one example[0]
| (Vietnamese) and I know it's the case for Yiddish as well.
|
| [0]: https://forum.duolingo.com/comment/17787660/Bug-Correct-
| Viet...
| nixpulvis wrote:
| Half normal isn't normal. That said, I personally try to avoid
| unicode in filenames (and caps too) for similar reasons.
| javajosh wrote:
| tl;dr - don't use crazy unicode characters in filenames, they can
| be problematic for non-trivial reasons (in this case because of
| unicode normalization on an smb mount.)
| int_19h wrote:
| What's "crazy" about the letter? It's a standard letter of
| several European alphabets.
| drpixie wrote:
| Nothing crazy about the "letter", but it is crazy that there
| are multiple different ways to encode the "letter".
| hinkley wrote:
| Reading about unicode has made me much, much more circumspect
| about the meaning of != in languages, and what fall-through
| behavior should look like. Unicode domain names lasted for a hot
| minute until someone registered microsoft.com with Cyrillic
| letters.
|
| Years ago I read a rant by someone who insisted that being able
| to mix arbitrary languages into a single String object makes
| sense for linguists but for most of us we would be better off
| being able to a assert that a piece of text was German, or
| sanskrit, not a jumble of both. It's been living rent free in my
| head for almost two decades and I can't agree with it, nor can I
| laugh it off.
|
| It might have been better if the 'code pages' idea was refined
| instead of eliminated (that is, the string uses one or more code
| pages, not the process). I don't know what the right answer is,
| but I know Every X is a Y almost always gets us into trouble.
| LAC-Tech wrote:
| Sprinkling English with foreign words is really, really common.
| I'm in New Zealand and people do it all the time. And even in
| the states, right? Don't want two different strings because
| someone writes an English sentence about how much they love
| jalapeno.
| mr_luc wrote:
| Heh, funny, I'm implementing this _exact_ thing at the moment,
| oddly enough -- rather, implementing a security check that
| provides that same guarantee you mention, Mixed Script
| protections.
|
| In Unicode spec terms, 'UTS 39 (Security)' contains the
| description of how to do this, mostly in section 5, and it
| relies on 'UTX 24 (Scripts)'.
|
| It's more nuanced than your example but only slightly. If you
| replace "German" with "Japanese" you're talking about multiple
| scripts in the same 'writing system', but the spec provides
| files with the lists of 'sets of scripts' each character
| belongs to.
|
| The way that the spec tells us to ensure that the word
| 'microsoft' isn't made up of fishy characters is that we just
| keep the intersection of each character's augmented script
| sets. If at the end, that intersection is empty, that's often
| fishy -- ie, there's no intersection between '{Latin},
| {Cyrillic}'.
|
| However, the spec allows the legit uses of writing systems that
| use more than one script; the lookup procedure outlined in the
| spec could give script sets like '{Jpan, Kore, Hani, Hanb},
| {Jpan, Kana}' for two characters, and that intersection isn't
| empty; it'd give us the answer "Okay, this word is contained
| within the Japanese writing system".
| drdaeman wrote:
| > but for most of us we would be better off
|
| That's simple - it is provably wrong. While relatively uncommon
| there are plenty of examples that would contradict this
| statement. And it's not about being able to encode the Rosetta
| Stone - non-scientists mix languages all the time, from Carmina
| Burana to Blinkenlights. They even make meaningful portmanteau
| words and write them with characters from multiple unrelated
| writing systems, like "zashitano" (see - Latin and Cyrillic
| scripts in the same single word!)
| david-gpu wrote:
| _> Years ago I read a rant by someone who insisted that being
| able to mix arbitrary languages into a single String object
| makes sense for linguists but for most of us we would be better
| off being able to a assert that a piece of text was German, or
| sanskrit, not a jumble of both._
|
| Presumably the person who wrote it speaks a single language.
|
| Just because something is not useful to them, it doesn't mean
| it is not useful in general. There are millions of polyglots as
| well as documents that include words and names in multiple
| scripts.
| jerf wrote:
| I think in that case the idea would either be that you should
| then have an array of strings, each of which may have its own
| language set, or that the string should be labelled as
| "containing Latin and Cyrillic", but still not able to
| include arbitrary other characters from Unicode. And multi-
| lingual text still generally breaks on _words_... Kilobytes
| of Latin text with a single Cyrillic character in the middle
| of a word is very suspicious, in a way that kilobytes of
| Latin text with a single Cyrillic _word_ isn 't.
|
| Of course you'd always need an "unrestricted" string (to
| speak to the rest of the system if necessary), but there are
| very few natural strings out there in the world that consist
| of half-a-dozen languages just mishmashed together. Those
| exceptions can be treated as _exceptions_.
| danudey wrote:
| In our Jenkins system, we have remote build nodes return data
| back to the primary node via environment variable-style
| formatted files (e.g. FOO=bar), so when I had to send back a
| bunch of arbitrary multi-line textual data, I decided to base64
| encode it. Simple enough.
|
| On *nix systems, I ran this through the base64 command; the
| data was UTF8, which meant that in practice it was ASCII
| (because we didn't have any special characters in our commit
| messages).
|
| On Windows systems... oh god. The system treats all text as
| UTF-16 with whatever byte order, and it took me ages to figure
| out how to get it to convert the data to UTF-8 before encoding
| it. Eventually it started working, and it worked for a while
| until it didn't for whatever reason. I ended up tearing out all
| the code and just encoding the UTF-16 in base64 and then
| processing that into UTF-8 on the master where I had access to
| much saner tools.
|
| Generally speaking, "Unicode" works great in most cases, but
| when you're dealing with systems with weird or unusual encoding
| habits, like Windows using UTF-16 or MySQL's "utf8" being
| limited to three bytes per unicode character instead of four,
| everything goes out the window and it's the wild west all over
| again.
| int_19h wrote:
| You can already map Unicode ranges to "code pages" of sorts, so
| how would that help?
|
| Thing is, people who are not linguists _do_ want to mix
| languages. It 's very common in some cultures to intersperse
| the native language with English. But even if not, if the
| language in question uses a non-Latin alphabet, there are often
| bits and pieces of data that have to be written down in Latin.
| So that "most of us" perspective is really "most of us in US
| and Western Europe", at best.
|
| For domains and such, what I think is really needed is a new
| definition of string equality that boils down to "are people
| likely to consider these two the same?". So that would e.g.
| treat similarly-shaped Latin/Greek/Cyrillic letters the same.
| jrochkind1 wrote:
| Oh, you can do far more than "code pages of sorts". Unicode
| has a variety of metadata available about each codepoint. The
| things that are "code pages of sorts" are maybe "block" (for
| o "Latin-1 Supplement"), and "plane" (for o it's "Basic
| Multilingual Plane"), but those are really mostly
| administrative and probably not what want.
|
| But you also have "Script" (for o "Latin). Some characters
| belong to more than one script though. Unicode will tell you
| that.
|
| Unicode also has a variety of algorithms available already
| written. One of the most relevant ones here is...
| normalization. To compare two strings in the broadest
| semantic sense of "are people likely to consider these the
| same", you want want a "compatibility" normalization. NFKC or
| NFKD. They will for instance make `1` and `1`[superscript]
| the same, which is definitely one kind of "consider these the
| same" -- very useful for, say, a search index.
|
| That won't be iron-clad, but that will be better than trying
| to role your own algorithm involving looking at character
| metadata yourself! But it won't get you past intentional
| attacks using "look-alike" characters that are actually
| different semantically but look similar/indistinguishable
| depending on font. The trick is "consider these the same"
| really, it turns out, depends on context and purpose, it's
| not always the same.
|
| Unicode also has a variety of useful guides as part of the
| standard, including the guide to normalization
| https://unicode.org/reports/tr15/ and some guides related to
| security (such as https://unicode.org/reports/tr36/ and
| http://unicode.org/reports/tr39/), all of which are relevant
| to this concern, and suggest approaches and algorithms.
|
| Unicode has a LOT of very clever stuff in it to handle the
| inherently complicated problem of dealing with the entire
| universe of global languages that Unicode makes possible. It
| pays to spend some time with em.
| BlueTemplar wrote:
| Yeah, Greek alphabet is used _a lot_ in sciences. It 's
| really annoying that we're only starting to get proper
| support _now_. (Including on keyboards : http://norme-
| azerty.fr/en/ )
| wisty wrote:
| What's a word? (A quick test - how many words were in the
| previous sentence, maybe 3 or 4 depending on whether the 's is
| part of a word; so can we talk about Johannesson's foreign
| policy?).
|
| It's hard enough to know what a letter is in unicode. Breaking
| things into words is just another massive headache.
| Someone wrote:
| That doesn't make sense to me. Even disregarding cases where
| people mix languages (how do you write a dictionary? If the
| answer is "just create a data structure combining multiple
| strings", shouldn't we standardize how to do that?), all
| languages share thousands of symbols such as currency symbols,
| mathematical symbols, Greek and Hebrew alphabets (to be used in
| math books written in the language), etc. So, even languages
| such as Greek and English share way more symbols than that they
| have unique ones.
| jrochkind1 wrote:
| it seems like a bug that to get consistent unicode normalization
| you need to flip a non-default config option. What am I missing?
| tpmx wrote:
| As a European, I _kinda_ miss iso-8859-1 being used everywhere.
| 0x0 wrote:
| Java is terrible in this regard, as most file APIs use
| "java.lang.String" to identify the filename, which most of the
| time depends on the system property "file.encoding". With the
| result that there will be files that you can never read from a
| java application if the filename encoding does not match the java
| file.encoding encoding.
| mgaunard wrote:
| Most formats (including XML) require data to be normalized to
| NFC.
| chrismorgan wrote:
| Can you point me to a single format that actually _requires_
| NFC? Most things either make no comment or just express
| preferences, though I'm confident there will be some somewhere.
|
| XML does _not_ require normalisation: per
| <https://www.w3.org/TR/xml11/#sec-normalization-checking>, XML
| data SHOULD be fully normalised, but MUST NOT be transformed by
| processors; in other words, it's a dead letter "SHOULD", and no
| one actually cares, just like almost everything else.
| guerrilla wrote:
| > But here, normalization caused this issue.
|
| Nope, the lack of normalization on both accounts by the SMB
| server caused the issue. It could have normalized before emitting
| but it definitely should have normalized on receiving for
| comparison.
| B-Con wrote:
| I think that in the ls->read workflow, Nextcloud shouldn't
| normalize the response from SMB and should issue back to SMB
| whatever SMB returned to Nextcloud.
| guerrilla wrote:
| According to Unicode, it should be allowed to and the SMB
| server should be able to handle it. That's kind of the point
| of normalization, they're meant to be done before all
| comparisons so that exactly this doesn't happen. Your
| suggestion is just premature optimization, i.e. eliminating a
| redundancy.
| int_19h wrote:
| Unicode doesn't say anything about what "should be allowed
| to" with respect to an unrelated protocol. If the protocol
| says that filenames are sequences of 16-bit values that
| have to be compared one by one, then that's what it is.
| guerrilla wrote:
| It does say that if comparisons are being made then...
| and comparisons are being made, so yes, it does.
| silon42 wrote:
| At least it should perform validation and reject the NFD form
| and force the client to normalize to NFC?
| misnome wrote:
| Why isn't the answer just "Don't unicode normalise the file
| name"?
|
| I thought the generally recommended way to deal with file names
| is to treat as a block of bytes (to the extent that e.g. rust has
| an entirely separate string type for OS provided strings), or
| just to allow direct encoding/decoding but not normalisation or
| alteration.
| tialaramex wrote:
| In terms of what filenames _are_ neither Windows nor Linux (I
| don 't know for sure with MacOS but I doubt it) actually
| guarantee you any sort of _characters_.
|
| Linux filenames are a sequence of non-zero bytes (they might be
| ASCII, or at least UTF-8, they might be an old 8-bit charset,
| but they also might just be arbitrary non-zero bytes) and
| Windows file names are a sequence of non-zero 16-bit unsigned
| integers, which you could think of as UTF-16 code units but
| they don't promise to encode UTF-16.
|
| _Probably_ the files have human readable names, but, maybe
| not. If you 're accepting command line file names it's not
| crazy to insist on human readable (thus, Unicode) names, but if
| you process arbitrary input files you didn't create,
| particularly files you just found by looking around on disks
| unsupervised - you need to accept that utter gibberish is
| inevitable sooner or later and you must cope with that
| successfully.
|
| Rust's OSStr variants match this reality.
| atoav wrote:
| This is what I found quite refreshing about Rust -- instead
| of choosing one of the following: A) The
| programmer is a almighty god who knows everything, we just
| expose him to the raw thing B) The programmer is
| a immature toddler who cannot be trusted, so we handle things
| for them
|
| What Rust does is more among the lines of "you might already
| know this, but anyways here is a reminder that you, the
| programmer need to take some decision about this".
| [deleted]
| GlitchMr wrote:
| Filenames in HFS+ filesystem (an old filesystem used by Mac
| OS X) are normalized with a proprietary variant of NFD - this
| is a filesystem feature. APFS removed this feature.
| 1over137 wrote:
| >APFS removed this feature.
|
| And then brought it back. It normalizes now.
| lilyball wrote:
| By "proprietary variant" you mean "publicly documented
| variant" which IIRC is just the normalization tables frozen
| in time from an early version of Unicode (the idea being
| that updating your OS shouldn't change the rules about what
| filenames are valid).
|
| As for APFS, it ~~doesn't~~didn't normalize but I believe
| it still requires UTF-8. And the OS will normalize
| filenames at a higher level. EDIT: they added native
| normalization. At least for iOS, I didn't dig enough to
| check it macOS is doing native normalizing or is just
| normalization-insensitive.
| chrismorgan wrote:
| Normalisation is expressly done with the composition of
| version 3.1 for compatibility: see
| <https://www.unicode.org/reports/tr15/#Versioning>. IF
| that's what HFS+ does, then "proprietary variant" is
| wrong. And if not, I'm curious what it does differently.
|
| (On the use of version 3.1, note that in practice version
| 3.2 is used, correcting one typo: see
| <https://www.unicode.org/versions/corrigendum3.html>.)
|
| I find a few references to it being slightly different,
| but not one of them actually says what's different;
| Wikipedia is the only one with a citation
| (<https://en.wikipedia.org/wiki/HFS_Plus>: "and
| normalized to a form very nearly the same as Unicode
| Normalization Form D (NFD)[12]"), and that citation says
| it's UAX #15 NFD, no deviations. One library that handles
| HFS+ differently switches to UCD 3.2.0 for HFS+
| <https://github.com/ksze/filename-
| sanitizer/blob/e990e963dc5b...>, but my impression from
| UAX #15 is that this should be superfluous, not actually
| changing anything. (Why is UCD 3.2.0 still around there?
| Probably because IDNA 2003 needs it:
| <https://bugs.python.org/issue42157#msg379674>.)
|
| _Update:_ https://developer.apple.com/library/archive/te
| chnotes/tn/tn1... has actual technical information, but
| the table in question doesn't show Unicode version
| changes like they claim it does, so I dunno. Looks like
| maybe from macOS 10.3 it's exactly UAX #15, but 8.1-10.2
| was a precursor? I'm fuzzy on where the normalisation
| actually happens, anyway.
| GlitchMr wrote:
| The `filename-sanitizer` library you have linked has the
| following comment. # FIXME:
| improve HFS+ handling, because it does not use the
| standard NFD. It's # close, but it's
| not exactly the same thing. 'hfs+':
| (255, 'characters', 'utf-16', 'NFD'),
|
| I wonder what does that mean...
| matja wrote:
| ZFS can support normalization also: $
| echo test > $'\xc3\xb6' $ cat $'\x6f\xcc\x88'
| cat: o: No such file or directory $ zfs create
| -o normalization=formD pool/dataset $ echo test >
| $'\xc3\xb6' $ cat $'\x6f\xcc\x88' test
| zekica wrote:
| macOS is interesting: some APIs normalize filenames while
| others don't. And it causes some very interesting bugs.
|
| One example is when you submit a file in Safari it doesn't
| normalize the file name while js file.name does.
| stefan_ wrote:
| Sure but at some point you might want to create a file and
| frequently using user input or filter files using some user
| provided query string, the kind of use cases that unicode
| normalization was invented for. So the whole "opaque blob of
| bytes" filesystem handling is nice if all you want is to not
| silently corrupt files, but it is very obviously not even
| covering 10% of normal use cases. Rust isn't being super smart,
| it just has its hands thrown up in the air.
| alkonaut wrote:
| Falls over on the fact that I don't want to be able to write
| these two files in the same dir. if I write file o1.txt and
| o1.txt then I want to be warned that the file exists even of
| the encoding is different when I use two different apps but try
| to write the same file.
|
| The same applies for a.txt and A.txt on case insensitive file
| systems (as someone pointed out the most common desktop file
| systems are).
| pavlov wrote:
| The most common desktop file systems are case-insensitive,
| which complicates the picture.
| Pxtl wrote:
| Still, it looks like the right thing to do is let the
| filesystem do the filesystem's job. The filesystem should be
| normalizing unicode and enforceing the case-insensitivity and
| whatnot, but _just_ the filesystem. Wrappers around it like
| whatever Nextcloud is doing should be treating the filenames
| as a dumb pile of bytes.
| dataflow wrote:
| I'm not sure this problem even _has_ a "right" solution.
|
| > Wrappers around it like whatever Nextcloud is doing
| should be treating the filenames as a dumb pile of bytes.
|
| What do you do when the input isn't a dumb pile of bytes,
| but actual text? (Like from a text box the user typed
| into?)
| rzzzt wrote:
| Maintain a table that maps the original file name to
| random-generated one that doesn't hit these gotchas.
| rob_c wrote:
| And place the files in chunks, and... Wait I think we're
| getting close to reinventing block storage again ;)
| dataflow wrote:
| I'm afraid I don't follow. Who maintains this table and
| who consumes it? What if they're different entities? How
| do you prevent it from going out of sync with the file
| system when the user renames a file? Are you inventing
| your own file system here? How do you deal with existing
| file systems?
| rzzzt wrote:
| I assumed that you have a system where file
| management/synchronization happens strictly through a web
| interface, and files are not changed or renamed outside
| this system's knowledge. Under these preconditions,
| having such a mapping table frees the users from having
| to abide whatever restrictions the underlying file system
| places on valid file names.
| dataflow wrote:
| Oh I was talking about the general case from a
| programming standpoint. What do you do on a typical local
| filesystem?
|
| The point I'm trying to get at being, you need to worry
| about the representation at multiple layers, not just at
| the bottom FS layer.
| mjevans wrote:
| Case insensitivity is a braindead behavior. If desired it
| should be a fallback path selecting the best match, not the
| first resort.
| laurent92 wrote:
| So you're fine with ~/Downloads and ~/downloads coexisting
| as entirely separate directories? And
| John.McCauley@yahoo.fr and john.mccauley@yahoo.fr being
| attributed to two different people ;)
| im3w1l wrote:
| > So you're fine with ~/Downloads and ~/downloads
| coexisting as entirely separate directories?
|
| Case (in)sensitivity for filenames is a non-issue in my
| experience. Never had problems with either convention. As
| for emails, I do think insensitivity was the right
| choice.
| tim-- wrote:
| The RFC states that email addresses are case sensitive.
|
| The local-part of a mailbox MUST BE treated as case
| sensitive.
|
| Section 2.4 RFC 2821,
| https://www.ietf.org/rfc/rfc2821.txt
| deadbunny wrote:
| My guess would be that the local part of an email address
| would usually map to a directory on case sensitive
| filesystems...
| justaguy37 wrote:
| can we just say no to capital letters? (or lowercase?)
|
| do capital letters have a good enough usage case to
| justify their continued existence?
| Lammy wrote:
| Fun fact: The Apple II and II+ originally only did upper-
| case, and it was very popular to add a Shift Key / lower-
| case mod via one of the gamepad buttons: https://web.arch
| ive.org/web/20010212094858/http://home.swbel...
| colejohnson66 wrote:
| You are free to stop using capital letters, but good luck
| getting everyone to go along. Capitals have been around
| for centuries (they're older than the printing press) and
| aren't going anywhere.
| vgel wrote:
| First one: yes, though good UI should prevent it from
| happening unless the user really intended it (for example
| I have ~/Documents symlinked into Dropbox, so ~/documents
| could be local-only documents)
|
| Second one: no, emails are not filenames, and more
| generally distinguishability is more important for
| identifiers. In cases where identifiers like emails need
| to be mapped to filenames, like caches, they should be
| normalized.
| jodrellblank wrote:
| The opposite; case insensitivity is what human brains do,
| we read word WORD Word and woRD as the same thing, it's
| computer case-sensitive matching which is "brainless".
| Computers not aligning with what humans do is annoying and
| frustrating; they should be tools for us, not us for them.
| There's no way two people would write o o and have readers
| think they were different because one was written in oil-
| based ink and one in water-based ink, or whatever compares
| with behind the scenes implementation details like
| combining form vs single character.
|
| I have just been arguing the same thing in far too much
| detail in this thread:
| https://news.ycombinator.com/item?id=29722019
| rob_c wrote:
| WORD, Word WoRD....
|
| Sorry to say I tend to use case sensitivity as a filter
| for me offering support to other developers. I'm not
| willing to find time for people who can't get their head
| around "turn on/off caps lock". You don't do it in
| professional writeups or applications (and I hope not in
| a CV) so don't pollute my filesystems or codebases with
| that madness.
| skymt wrote:
| There are a couple arguments against case-insensitive
| filesystems I think are strong. The first is simply
| compatibility with existing case-sensitive systems. The
| second is that case is locale-dependent, so a pair of
| names could be equivalent or not depending on the
| device's locale.
|
| I don't think I've seen any good argument against
| normalization, though.
| jrochkind1 wrote:
| Well, precisely because if you _don 't_ normalize the
| filenames, o [?] o. You could have two files with different
| filenames, `goteborg.txt` and `goteborg.txt`, and they are
| different files with different filenames.
|
| Or you could have one file `goteborg.txt`, and when you try to
| ask for it as `goteborg.txt`, the system tells you "no file by
| that name".
|
| Unicode normalization is the _solution_ to this. And the
| unicode normalization algorithms are pretty good. The bug in
| this case is that the system did not apply unicode
| normalization consistently. It required a non-default config
| option to be turned on to do so? I don 't really understand
| what's going on here, but it sounds like a bug in the system to
| me that this would be a non-default config option.
|
| Dealing with the entire universe of human language is
| inherently complicated. But unicode gives us some actually
| pretty marvelous tools for doing it consistently and
| reasonably. But you still have to use them, and use them right,
| and with all software bugs are possible.
|
| But I don't think you get fewer crazy edge cases by not
| normalizing at all. (In some cases you can even get security
| concerns, think about usernames and the risk of `john` and
| `john` being two different users...). I know that this is the
| choice some traditional/legacy OSs/file systems make, in order
| to keep pre-unicode-hegemony backwards compat. It has problems
| as well. I think the right choice for any greenfield
| possibilities is consistent unicode normalization, so
| `goteborg.txt` and `goteborg.txt` can't be two different files
| with two different filenames.
|
| [btw I tried to actually use the two common different forms of
| o in this text; I don't believe HN normalizes them so they
| should remain.]
| nieve wrote:
| It looks like instead of the config option switching
| everything to use the same normalization it keeps a second
| copy of the name in a database to compare to. What a horrible
| kludge, I wonder how they even got into this situation of
| using different normalization in different parts of the
| system?
| arka2147483647 wrote:
| That works for programmers, but not for users. There could be
| several files with the same name, buth with different
| encodings. Worse, depending on how your terminal encodes user
| input, some of them migth not be typable.
| zarzavat wrote:
| From the users perspective I don't want any normalisation at
| all. It's good as long as you only have one file system but
| as soon as you get multiple file systems with conflicting
| rules (which includes transferring files to other people) it
| becomes hell. Unfortunately we are stuck with that hell.
| heikkilevanto wrote:
| Well, if 7-bit US ASCII was good enough for our Lord, it is good
| enough for me ;-)
___________________________________________________________________
(page generated 2021-12-31 23:00 UTC)