[HN Gopher] Unicode is harder than you think
___________________________________________________________________
Unicode is harder than you think
Author : mcilloni
Score : 86 points
Date : 2023-07-25 16:47 UTC (6 hours ago)
(HTM) web link (mcilloni.ovh)
(TXT) w3m dump (mcilloni.ovh)
| jkaptur wrote:
| The logical next step here is to realize that if you want to be
| truly internationalized, pretty much every single method of the
| string class in your favorite language is an antipattern and
| should be used with extreme caution. Seriously!
| frizlab wrote:
| I _think_ Swift did it properly. At least that's what they
| claim[1]
|
| [1]: https://www.swift.org/blog/utf8-string/
|
| There were many discussions on how to handle strings in the
| forums too. Remarkably, it is not possible to access "abc"[1]
| as it has unpredictable performance; instead one has to build
| the index and making this should make the developer realize the
| operation is costly. All in all most beginners in Swift hate
| working with strings because it's not intuitive at first, but
| to be fair in almost all languages strings handling is not done
| properly.
| jkaptur wrote:
| Oh, interesting! The fact that "abc"[1] doesn't work is a
| great sign, since my contention is that "the character at
| index i" is not a well-defined concept.
| spudlyo wrote:
| If you found this essay interesting, you owe it to yourself to
| check out this super entertaining talk "Plain Text"[0] from NDC
| 2022 by Dylan Beattie. Rabbit hole warning: This video caused me
| to lose an entire Sunday watching Dylan's talks on YouTube, which
| are uniformly awesome.
|
| [0]: https://www.youtube.com/watch?v=gd5uJ7Nlvvo
| WaffleIronMaker wrote:
| I also really enjoy Dylan Beattie's work. For those with some
| spare time, who might like to see a true "rockstar" programmer,
| you may like "The Art of Code"[0].
|
| [0] https://youtu.be/6avJHaC3C2U
| aidos wrote:
| Oh wow. The Amstrad 6128 was my first machine (1985). Looking
| forward to watching this!
| spudlyo wrote:
| Another good talk! I also really loved "Email vs Capitalism,
| or, Why We Can't Have Nice Things"[0] which has one of the
| best audience participation gimmicks I have ever seen in a
| talk.
|
| [0]: https://www.youtube.com/watch?v=mrGfahzt-4Q
| soneil wrote:
| This is exactly what I came here to post, too. I believe he's
| unusually prolific because he's part of the organisation of
| these conferences - but his delivery pays off regardless. I
| actually sent exactly the same video to a colleague this a few
| days ago.
| NovemberWhiskey wrote:
| I used to work on a platform at a large financial services firm;
| it was essentially complete ignorant of anything Unicode with
| respect to string handling, strings were null-terminated byte
| streams. The platform had CSV import capability for tabular data,
| and it had an integrated pivot table capability based on some
| widgets that had been grafted onto it.
|
| Some of the users in Hong Kong discovered that you could import
| CSVs with Unicode text (e.g. index compositions with Chinese
| company names) and they'd display in the pivot table widgets and
| even be exportable to reports. But only most names. Some names
| were truncated or turned into garbage, and I was called upon to
| help debug this.
|
| My first reaction was frank amazement that this "worked" at all:
| apparently, the path from the dumb CSV import code through to the
| Unicode-aware pivot table was sufficiently clean that much of the
| encoded text made it through OK. I can't remember the precise
| details now but I think the problem turned out to be embedded
| nulls from UTF-16 encoding and so was completely insoluble
| without a gut renovation of the platform.
| Pannoniae wrote:
| Most programs claim to support Unicode but they actually don't.
| They either miscount string lengths (you type a CJK character or
| an emoji in, string appears shorter than what the program
| thinks), separate them improperly or many other things. It
| doesn't help that by default, most programming languages also
| handle unicode poorly, with the default APIs producing wrong
| results.
|
| I'd take "we don't do unicode at all" or "we only support BMP" or
| "we don't support composite characters" any day over pretend-
| support (but then inevitably breaking when the program wasn't
| tested with anything non-ASCII)
|
| (ninjaedit: to see how prevalent it is, even gigantic message
| apps such as discord make this mistake. There are users on
| discord who you can't add as friends because the friend input
| field is limited to 32.... something - probably bytes, yet
| elsewhere the program allows the name to be taken. This is easy
| to do with combining characters)
| david_draco wrote:
| Maybe an "Acid test" for Unicode would help? These pages seem
| to go into that direction:
| https://www.kermitproject.org/utf8.html
| https://web.archive.org/web/20160306060703/http://www.inter-...
|
| Placing a fuzzy tester like "hypothesis.strategies.characters"
| into the CI may also be revealing.
| alcover wrote:
| Currently working on a language, I feel dizzy after reading this.
|
| My stdlib will provide a (byte) Buffer class with basic low-level
| methods but I feel like iterating through it in fancy ways should
| be the concern of the user or 3rd-party libraries.
|
| I fail to see this as part of a programming language.
|
| Am I wrong here ?
| zadokshi wrote:
| You're definitely wrong. You're designing a language that only
| works for "Americans" by default.
|
| Imagine how you would feel about a language that supports
| Arabic by default and needs special foo to work with American
| English?
|
| You need to start thinking of characters as a type. Characters
| do not fit in bytes unless you're American.
|
| UTF-8 is a reasonable compromise though.
| scatters wrote:
| It depends on the level and domain of your language. A low-
| level language can get by with just byte arrays. A mid-level
| language should probably handle at least some encodings, and
| provide codepoint access. A high-level language should handle
| locale-sensitive casing and collating, and grapheme-cluster
| access (note that this depends on the font!).
| ekidd wrote:
| One good approach is to have separate "byte array" and "string"
| types, and say, "Strings are always UTF-8. Anything else is a
| bug. Deal with it."
|
| Then you can have a nice, user-friendly string class for basic
| UTF-8 text, which is pretty easy. Ignore sorting and grapheme
| clusters (those probably belong in libraries, and they require
| fairly large tables). Consider providing a library function to
| iterate over UTF-8 "characters" (as unpacked 32 bit Unicode
| code points).
|
| This is one of the sweet spots in language design, and it
| provides enough structure for third-party libraries to work
| well together, without everyone reinventing their own string
| type.
|
| For another good alternative to this approach, see Ruby.
| night-rider wrote:
| It's not that it's hard, it's just people don't go out of their
| way to escape UTF-8 glyphs into ASCII when dealing with exotic
| glyphs in a text editor. It's more mundane and tedious, but not
| 'hard'.
|
| Try working with raw UTF-8 in JS and find yourself in a world of
| pain. Mathias Bynens talks about these gotchas here:
|
| https://mathiasbynens.be/notes/javascript-unicode
| JohnFen wrote:
| "Hard" is a bit context-dependent. Instead of thinking of it as
| "hard", I think of it as a real pain in the ass full of
| footguns.
| o1y32 wrote:
| Regarding the title -- anecdotally, everyone I know is sacred of
| encoding issues, and I don't know anyone who claims they have a
| great understanding of Unicode or think it is easy (including
| myself). It is often overlooked for sure -- people don't realize
| there is a problem in the code until they run into a bug, ane it
| turns out they are treating strings wrong from the very
| beginning.
| m_0x wrote:
| sacred or scared? ;)
| zamadatix wrote:
| TIL of UTF-1, what an odd specification.
| dmitrygr wrote:
| Favourite unicode fact: properly rendering unicode requires
| understanding of the current geopolitical situation (Depending on
| whom you accept as a country and whom you do not, two country-
| code-letters may or may not render as a flag. This changes
| sometimes in today's world.). https://esham.io/2014/06/unicode-
| flags
| svachalek wrote:
| Interesting. They pushed all the politics onto the font
| designers.
| not2b wrote:
| The font designer has to include a flag for each supported
| country. Often a given font is missing lots of flags for
| reasons that have nothing to do with whether the designer
| recognizes a given country or not, just a question of
| priorities; perhaps only 100 out of 200 flags are supported.
| Longhanks wrote:
| Imho, unicode should stay out of politics. Country flags,
| vaccine syringes and pregnant men should have nothing to do
| with how computers handle text and writing systems.
| veave wrote:
| Or when big tech banded together to change the pistol emoji
| to some scifi zapper.
| spookthesunset wrote:
| From what I recall the problem was on some devices it was
| rendered as a "sci-fi zapper" or squirtgun and on others it
| was a fairly realistic depiction of a gun. Leading to some
| misunderstandings...
| makeworld wrote:
| There is no way to avoid it. It is very obvious that deciding
| whether "vaccine syringes" are political (and therefore
| excluded) or not is itself a political decision.
| naniwaduni wrote:
| There's a certain kind of extremist who claims that their
| contentious positions aren't political, but the fact that
| there's an argument, and that you can point to mainstream
| coverage of it, strongly suggests that they're full of shit
| and everyone can see it.
| kergonath wrote:
| What a strange position. The fact that an argument exists
| just shows that some people want to argue. Anyone can
| start an argument about anything. It's hardly a good base
| to make a decision.
| kergonath wrote:
| What does a syringe has to do with this, exactly?
|
| Besides, why do you care what funny symbols people use in
| discussions that don't involve you?
| qalmakka wrote:
| If it were that easy - sadly everything that's in any way
| related to the way we communicate and we relate to the world.
| Just look at the kerfuffle about skin tones... Everything is
| political if you are looking for a reason to fight.
|
| Language is a very sensitive topic - in Central Asian
| countries, using Latin, Cyrillic or Perso-Arabic script for
| instance has very strong political connotations, same in the
| Balkans. The world is just like that
| mseepgood wrote:
| How do you recognize whether a syringe emoji is a vaccine
| syringe or a regular syringe?
| Roark66 wrote:
| Indeed it is. One use of Unicode I do is for icons that can be
| used by console programs like (neo)vim. I was quite happy that
| xterm supports Unicode these days so I can use a fast terminal
| that supports OSC52 system clipboard integration(none of the
| newer gnome/KDE terminals do).
|
| I was rather disappointed when I noticed my pretty Unicode icons
| would sometimes end up cut in half :-(
| jmclnx wrote:
| No kidding, you have not lived until you try and explain UTF-8 to
| people who only believes in what they called "doublebyte".
|
| You think they get it, but surprise happens when a database load
| fails when loading Chinese Character "string" into a field sized
| calculated based upon 2 bytes per character.
| theamk wrote:
| Thank god for emojis! Those people would say, "No one in our
| org would use chinese" and refuse to fix things... but now I
| just point them to latest message from upper management which
| contain emoji or two.
|
| (And emoji are such a fine example - once they ate on the
| table, you need support for combining characters, characters
| outside of BMP, ligatures.. a large part of Unicode spec)
| qalmakka wrote:
| It's terrible, and we IMHO owe that to some introductory
| university courses to Java (plus some Win32 veterans). I got
| very close to being rejected by a professor that was
| obstinately convinced that Unicode "characters" were 2 bytes
| because it drunk the Kool Aid in the '90s about Java's `char`
| type representing a Unicode character. Ugh. I still get angry
| by thinking back at that sometimes
| jmclnx wrote:
| I can relate, I remember a teacher stating "you never have to
| worry about the amount of memory". This was in the late 90s,
| I then asked "So I can load a 20 gig data file into memory",
| he said yes.
| nightpool wrote:
| Thank you for being the first article I've ever actually read to
| explain the difference between NFC, NFD, NFKD and NFKC in a way
| that I actually understood. I was a little bored through the
| whole UCS/UTF* history lesson because I knew a lot of it already,
| but the normalization and collation examples were definitely
| worth it
| Lammy wrote:
| Agreed, and it would be even better if it mentioned some real-
| world normalization issues like it does for the UCS encodings.
| I learned about it the hard way when dealing with Apple
| filesystems: https://eclecticlight.co/2021/05/08/explainer-
| unicode-normal...
| skitter wrote:
| Annoyingly, Java, JavaScript, Windows file paths and more don't
| quite use UTF-16 (well, even if they did, that would be annoying)
| -- they allow unpaired surrogates, which don't represent any
| Unicode character. So if you want to represent e.g. an arbitrary
| Windows file path in UTF-8, you can't; you have to use WTF-8
| (wobbly transformation format) instead.
| Knee_Pain wrote:
| >WTF-8
|
| truly an appropriate name
| deadbeeves wrote:
| But UTF-8 is just a way to encode a number as a variable-length
| string of octets. Why would you be unable to encode, say, a
| terminating U+D800 as a string of three bytes at the end of a
| UTF-8 stream?
| skitter wrote:
| Because that's how UTF-8 is defined[1]. WTF-8 lifts that
| restriction.
|
| [1] https://simonsapin.github.io/wtf-8/#utf-8
| deadbeeves wrote:
| It doesn't sound very annoying, then. You use the exact
| same encoding scheme, but skip a verification step.
| Actually it sounds more convenient.
| jraph wrote:
| Still potentially annoying if you deal with some other
| code that expects UTF-8 proper and you pass it a wtf-8
| string that fails the lifted verification.
| [deleted]
| sedatk wrote:
| Certainly not true for Windows. Windows uses UTF-16; e.g. it
| uses proper surrogate pairs.
|
| https://learn.microsoft.com/en-us/windows/win32/intl/surroga...
| skitter wrote:
| That would be great, but that article is about
| recommendations for applications running on Windows, not
| about what valid file names applications may encounter.
| Here's a counter-example:
| https://github.com/golang/go/issues/32334
| sedatk wrote:
| No, I mean Windows API honors UTF-16 surrogate pairs, and
| processes them correctly. It doesn't produce invalid UTF-16
| strings either. Apps may not support UTF-16 properly, and
| that's not on Windows, is it?
|
| NTFS, on the other hand, has no dictated format for
| filename encoding. It just stores raw bytes as filenames,
| so anything can be a filename on NTFS, including invalid
| strings if the caller decides to do so. That's not on
| Windows either, otherwise, we should add Linux to the list
| too as ext4 and most other file systems also don't care
| about filename encoding.
___________________________________________________________________
(page generated 2023-07-25 23:00 UTC)