[HN Gopher] What every software developer must know about Unicod...
       ___________________________________________________________________
        
       What every software developer must know about Unicode in 2023
        
       Author : mrzool
       Score  : 566 points
       Date   : 2023-10-02 09:22 UTC (13 hours ago)
        
 (HTM) web link (tonsky.me)
 (TXT) w3m dump (tonsky.me)
        
       | penguin_booze wrote:
       | I knew that domain, so I had sunglasses at hand before opening
       | the page!
        
       | gorgoiler wrote:
       | With the benefit of hindsight, would we include the error
       | detection bits of UTF8 if we could choose not to?
        
       | rurban wrote:
       | The Why is "A" !== "A" !== "A"? section still strikes me as
       | wrong. The strings are equal even when the representations
       | differ.
        
         | nextaccountic wrote:
         | They are logically equal (that is, they represent the same text
         | in an abstract way), but computing this equality in practice is
         | expensive, because you first need to normalize the strings then
         | compare.
         | 
         | Most languages, when comparing strings, skip the normalization
         | and just compare string bytes as is (or, if the string is
         | interned, compare just the pointer)
        
         | bajsejohannes wrote:
         | I just not sure why they put in the "Angstrom symbol" to begin
         | with. If you do, then why isn't the "meter symbol" (m) also
         | represented?
         | 
         | Fortunately, it seems like it's marked as deprecated:
         | https://en.wikipedia.org/wiki/Angstrom#Symbol
        
           | jcranmer wrote:
           | > I just not sure why they put in the "Angstrom symbol" to
           | begin with.
           | 
           | Frequently, the answer to this is "some obscure character set
           | had this as a distinct symbol." In this case, blame the
           | Japanese: https://en.wikipedia.org/wiki/JIS_X_0208
           | 
           | Which is why there's an 'mm' and 'cm' and other random
           | symbols: https://www.compart.com/en/unicode/block/U+3300
        
       | amelius wrote:
       | Can we please get a standard that describes how emoji are
       | supposed to look?
       | 
       | Now they look different on every platform and many subtleties are
       | lost in translation.
        
         | JohnFen wrote:
         | Yeah, this problem has led me to avoid using emojis. I can't be
         | sure that the meaning I was intending is the one being depicted
         | by the recipients machine.
         | 
         | It's probably a good thing, though.
        
       | w10-1 wrote:
       | A real question is why IBM, Apple, and Microsoft poured millions
       | into developing the unicode standard instead of treating
       | character encoding like file formats as a venue for competition.
       | 
       | IBM and Apple in the early 1990's combined in Taligent to try to
       | beat MS NT, but failed. But a lot of internationalization came
       | out of that and was made open, at the perfect time for Java to
       | adopt it.
       | 
       | Interestingly it wasn't just CJK but Thai language variants that
       | drove much of the flexibility in early unicode, largely because
       | some early developers took a fancy to it.
       | 
       | When you look at the actual variety in written languages, Unicode
       | grapheme/code-point/byte seems rather elegant.
       | 
       | We're in the early days of term vectors, small floats, and
       | differentiable numerics (not to mention big integers). Are
       | lessons from the history of unicode relevant?
        
         | preciousoo wrote:
         | You can ask why they didn't do the same for networking and
         | serial protocols too.
        
       | samatman wrote:
       | Please don't refer to codepoints as characters. Some are, some
       | are not, it isn't a useful or informative approximation, it's
       | just wrong. Unicode is a table which assigns unique numbers to
       | different _codepoints_ , most of which are characters. ZWJ is not
       | a character at all, and extended grapheme clusters made of
       | several codepoints are.
        
         | skitter wrote:
         | 'Character' doesn't have a single meaning. ZWJ is a character
         | according to definitions (2) and (3) in
         | https://unicode.org/glossary/#character
        
       | user3939382 wrote:
       | I once bought an O'Reilly book on encoding. It was like 2000
       | pages. I never read it, that was about 15 years ago. My take away
       | is that encoding is really complex and I just kind of pray it
       | works which most of the time it does.
        
       | tr888 wrote:
       | What on EARTH is that mouse cursor thing all about? Why would you
       | even bother writing this, then making it impossible to read
       | properly?
        
         | eerikkivistik wrote:
         | I stopped in the middle of reading the post just for this. It
         | was so distracting I was unable to focus on the text. It's a
         | fun gimmick, but the result is that someone who wanted to read
         | the post, stopped in the middle.
        
         | oliwarner wrote:
         | It's tracking every visitors' cursor and sharing it with every
         | other visitor.
         | 
         | Why would a frontend developer demonstrate their ability to do
         | frontend programming on their personal, not altogether super-
         | serious blog? I meant that rhetorically but it's a flex. I
         | agree, not the best design in the world if you're catering for
         | particular needs, but simple and fun enough. You should check
         | out dark mode.
         | 
         | In that vein, I think it's okay if we let people have fun. That
         | might not work for everyone, but why should we let perfect be
         | the worst enemy of fun?
        
           | dathinab wrote:
           | > Why would
           | 
           | because it shows that they don't understand important design
           | aspects
           | 
           | while it doesn't really show off their technical skills
           | because it could be some plugin or copy pasted code, only
           | someone who looks at the code would know better. But if
           | someone care enough about you to look at your code you don't
           | need to show of that skill on you normal web-site and can
           | have some separate tech demo.
           | 
           | > okay if we let people have fun
           | 
           | yes people having fun is always fine especially if you don't
           | care if anyone ever reads your blog or looks at it for
           | whatever reason (e.g. hiring)
           | 
           | but the moment you want people to look at it for whatever
           | reason then there is tension
           | 
           | i.e. people don't get hired to have fun
           | 
           | and if you want others to read you blog you probably
           | shouldn't assault them with constant distractions
        
             | JohnFen wrote:
             | Not every website, even technical ones, need to have an eye
             | towards professional advancement. Sometimes they're just
             | for fun. I welcome it, as it's a thing that gets more rare
             | on the web as time goes by.
        
               | jamincan wrote:
               | Considering the dark mode is effectively flashlight mode,
               | I think it's reasonable to assume the blog's owner just
               | likes to have a bit of fun.
        
             | booleandilemma wrote:
             | Lighten up.
        
             | oliwarner wrote:
             | > people don't get hired to have fun
             | 
             | Living by that motto is hugely self-destructive.
             | 
             | Creative expression allows us to push ourselves, both in
             | _what_ we think we can do, and often the technical aspects
             | about _how_ we do it too. Even if the idea doesn 't stick,
             | you've tried something new.
             | 
             | In a world of Tailwinds and Bootstraps and the same five
             | templates copied again and again and again, let's celebrate
             | the people willing to push things and learn from their
             | inevitable but ultimately valuable mistakes. And let's have
             | some fun along the way.
        
             | FinnKuhn wrote:
             | I assume the creator didn't anticipate this amount of
             | readers at the same time and having one or two other
             | cursors on the page does sound fun and not too distracting.
             | They should probably limit the maximum amount of other
             | cursors displayed to a sensible amount
        
         | dathinab wrote:
         | (sarcasm)
         | 
         | It's revenge against anyone with certain kinds of visual
         | impairments and/or concentration issues because the ex-spouse
         | of the author which turned out to be a terrible person had
         | such.
         | 
         | (sarcasm try 2)
         | 
         | It's revenge against anyone using JS on the net with the author
         | trying to subtle hint that JS is bad.
         | 
         | (realistic)
         | 
         | It's probably on of:
         | 
         | - the website is a static view of some collaborative tool which
         | has that functionality build in by default
         | 
         | - some form of well intended but not well working functionality
         | add to the site as it was some form of school/study project, in
         | that case I'm worried about the author suffering unexpected
         | very much higher cost due to it ending up on HN ...
        
           | tonsky wrote:
           | Hi, author here. In case you really want to know: no, it's
           | custom-made and works exactly as intended. There are two main
           | reasons:
           | 
           | 1. Fun. Modern internet is boring, most blog posts are just
           | black text on white background. Hard to remember where you
           | read what. And you can't really have fun without breaking
           | some expectations.
           | 
           | 2. Sense of community. Internet is a lonely place, and I
           | don't necessarily like that. I like the feeling of "right
           | now, someone else reading the same thing as I do". It's human
           | presence transferred over the network.
           | 
           | I understand not everybody might like it. Some people just
           | like when things are "normal" and everything is the same.
           | Others might not like feeling of human presence. For those,
           | I'm not hiding my content, reader mode is one click away, I
           | make sure it works very well.
           | 
           | As for "unexpectedly ended up on HN", it's not at all
           | unexpected. Practically every one of my general topic
           | articles ends up here. It's so predictable I rely on HN to be
           | my comment section.
        
             | pests wrote:
             | I like your content but I do think you need to rethink #1.
             | Fun is usless if no one wants to show up because they are
             | annoyed.
        
               | acqq wrote:
               | Count me too to the group of "I was so distracted that I
               | stopped reading."
               | 
               | Then the second thought was: I should again start to
               | block js by default as much as I can.
        
             | gpvos wrote:
             | 2. I only understood that it was actual other people's
             | mouse cursors when I read that here. So it didn't really
             | engender a sense of community, although after some time I
             | did think you are very good at modelling actual human mouse
             | movements. Now that I know it, it's pretty neat though.
        
           | wonger_ wrote:
           | The author has several other writeups:
           | 
           | https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu.
           | ..
           | 
           | The cursors will only be a problem during front page HN
           | traffic. And the opt-out for people who care is reader mode /
           | disable js / static mirror. Not sure if there's any better
           | way to appease the fun-havers and the plain content
           | preferrers at the same time. Maybe a "hide cursors" button on
           | screen? I, for one, had a delightful moment poking other
           | cursors.
        
         | Luctct wrote:
         | I don't know what you people are talking about. I'm just glad I
         | always browse with Javascript turned off. If you didn't see the
         | writing on the wall and permanently turn Javascript off around
         | 2006, you have no right to complain about anything.
         | 
         | Meanwhile, ironic irony is ironic: "Hey, idiots! Learn to use
         | Unicode already! Usability and stuff! Oh, btw, here is some
         | extremely annoying Javascript pollution on your screen because
         | we are all still children, right? Har har! Pranks are so
         | kewl!!!1!"
        
           | zbtaylor1 wrote:
           | Are you alright?
        
       | chx wrote:
       | "roll your own"
       | 
       | Rather not. It takes an incredible amount of work to get it
       | right. Just stick to ICU.
        
       | rdtsc wrote:
       | > The only modern language that gets it right is Swift:
       | print("...".count)         // => 1
       | 
       | And Erlang/Elixir! I guess they are not "cool" enough. But they
       | correctly interpret that as one grapheme cluster.
       | % erl +pc unicode         > string:length("...").         1
       | 
       | (... here is the U+1F926 U+1F3FB U+200D U+2642 U+FE0F emoji)
        
       | davidham wrote:
       | Is it just me, or is anyone else seeing what looks like the mouse
       | pointer of everyone else reading the page, like 1,000 little ants
       | on the screen
        
         | neonsunset wrote:
         | Yes, reading the article is impossible with erratic movement on
         | the screen.
        
         | dekken_ wrote:
         | not just you, this is what my other comment is about
         | (indirectly)
        
         | ilyt wrote:
         | I see nice crisp black text on white background because
         | apparently server melted down
        
           | hot_gril wrote:
           | I saw that, except half the images weren't loading, and there
           | was just one mouse pointer.
        
         | wirelesspotat wrote:
         | Yep, the website opens a websocket connection[0] and sends the
         | mouse position every 1 second
         | 
         | [0] WS connection is on `wss://tonsky.me/pointers?id=XXXXXX&pag
         | e=/blog/unicode/&platform=XXX`
        
         | lbltavares wrote:
         | It's fun specially for folks like me who have ADHD. But there
         | should be a button to disable it
        
         | hwillis wrote:
         | turned off javascript as soon as I saw it. Like trying to read
         | with twenty mosquitos in your face.
        
           | sebstefan wrote:
           | hey be nice to my mouse cursor
        
         | 876978095789789 wrote:
         | Yeah, it's extremely obnoxious.
        
         | keb_ wrote:
         | Anytime tonsky's site gets posted here, I'm reminded by how
         | awful it is, which is ironic given his UI/UX background. The
         | site's lightmode is a blinding saturated yellow, and if you
         | switch into darkmode, it's an even less readable "cute"
         | flashlight js trick. I don't know why he thought this was a
         | good idea. Thank god for Firefox reader mode.
        
           | coldpie wrote:
           | Works like a normal website with JavaScript disabled. I
           | didn't even know it did fancy junk until reading the comments
           | here. NoScript saves the day again! I don't know how people
           | can browse the web without it.
        
             | gpvos wrote:
             | It is some time ago since I last used it, but I found that
             | too many websites that I want to read require Javascript to
             | even show you the main body of text, or a reasonable
             | layout. Is that different now?
        
             | aembleton wrote:
             | By using reader mode
        
           | ericmcer wrote:
           | I don't think he added moving cursors all over the page
           | because he thought it was good UI/UX, he knows what he is
           | doing.
        
             | superq wrote:
             | This is seemingly self-contradictory. Perhaps you could
             | explain your reasoning further?
        
               | gpvos wrote:
               | Doing bad things is their idea of fun.
        
               | ksoped wrote:
               | You gotta know the rules to bend the rules
        
               | Scarbutt wrote:
               | It's called satire.
        
               | chaorace wrote:
               | It lets you hold hands with strangers
        
             | 876978095789789 wrote:
             | He appears (if his logos are anything to go by) to be a
             | flat UI guy. I doubt any of these people know what they're
             | doing.
        
           | Arech wrote:
           | I'd say this annoying trick is highly appropriate for the
           | topic!
        
           | spacechild1 wrote:
           | It is obviously a joke (and a good one, I dare say). The fact
           | that people seem to take it seriously says something about
           | the contemporary state of webdesign :)
        
             | mplewis wrote:
             | It would be a better joke if there were an option to turn
             | the joke off. As it is, dark mode doesn't exist and the
             | pointers occlude text.
        
               | spacechild1 wrote:
               | > It would be a better joke if there were an option to
               | turn the joke off.
               | 
               | As others have pointed out, reader mode works as
               | expected.
        
           | LordDragonfang wrote:
           | It's deeply ironic that an article about dealing with text
           | properly has images _which are part of the article text_ and
           | yet _have no alt-text_ , rendering parts of the article
           | unreadable in reader mode if the server is slow.
        
         | lifeinthevoid wrote:
         | yup, pretty annoying
        
         | nigma1337 wrote:
         | Distracted me from reading the article, I just started chasing
         | other people around.
        
         | zzzeek wrote:
         | yeah....why on _earth_ would someone want their webpage to do
         | this, especially if they have text they 'd presumably want you
         | to read?
        
           | fragmede wrote:
           | Have you ever read with other people, like in school or a
           | book club, or been somewhere that there were other people
           | around? It's an interesting move by the author; the
           | loneliness epidemic hasn't gone unnoticed.
           | 
           | eg https://www.npr.org/2023/05/02/1173418268/loneliness-
           | connect...
        
           | WD-42 wrote:
           | It's cute, and provides a hint of human connection that is
           | otherwise absent on the web "hey, another human is reading
           | this too!" which you probably know but something about seeing
           | the pointer move makes it feel real.
           | 
           | Probably not the greatest during a hacker news hug of death,
           | but if I read that article some other time and saw one of the
           | moving pointers, I would think it was really cool.
        
         | pookha wrote:
         | Good times. If you click on the sun switch the entire UI gets
         | zeroed out and you get to use on:hover mouse shtick to read the
         | UI through a fuzzy radius. Is Yoko Ono designing websites now?
        
           | WD-42 wrote:
           | It's a joke. It made me laugh.
        
         | pests wrote:
         | I know which site you are talking about before even clicking
         | the article :(
        
         | aragonite wrote:
         | I've been drawing circles for over a minute now and no one has
         | joined me yet, so I conclude those movements are random rather
         | than made by intelligent beings. :)
        
           | KyleBerezin wrote:
           | That makes me think of this old gem
           | https://imgur.com/gallery/BgKFcI9
        
       | nottorp wrote:
       | > Since everybody in the world agrees on which numbers correspond
       | to which characters, and we all agree to use Unicode, we can read
       | each other's texts.
       | 
       | Hmm? I thought some code points combine to create a character.
       | Even accented latin ones can be like that.
       | 
       | Also we need to agree on what is a character.
        
         | JohnFen wrote:
         | > Also we need to agree on what is a character.
         | 
         | Indeed. I used to think I knew what a character was until
         | Unicode came around. Now I genuinely don't know with any real
         | certainty.
        
       | wyldfire wrote:
       | > "I know, I'll use a library to do strlen()!" -- nobody, ever.
       | 
       | The standard library provided by languages like C, C++ _is_ a
       | library. Features like character strings are present and it 's a
       | totally reasonable expectation for the length to give you the
       | cluster count.
        
         | AnimalMuppet wrote:
         | No, for C and C++, which are close to the hardware, it's
         | totally reasonable to expect strlen() to give you the _byte_
         | count. You don 't allocate memory for buffers based on the
         | cluster count.
         | 
         | If you want cluster count, call a different function.
        
         | macintux wrote:
         | Given that strlen() predates Unicode by...30 years(?) - it's
         | not terribly surprising that isn't a viable approach.
        
       | diego_sandoval wrote:
       | Extended Grapheme Cluster should be understood as Extended
       | (Grapheme Cluster) or as (Extended Grapheme) Cluster?
        
       | heldrida wrote:
       | The mouse cursos ir really annoying, stopped reading for that
       | reason
        
       | zackmorris wrote:
       | _The only modern language that gets it right is Swift:_
       | 
       | Apple did a fairly good job with unicode string handling starting
       | in Cocoa and Objective-C, by providing methods to get the number
       | of code points and/or bytes:
       | 
       | https://stackoverflow.com/questions/15582267/cfstring-count-...
       | 
       | I feel that this support of both character count and buffer size
       | in bytes is probably the way to go. But Python 3 went wrong by
       | trying to abstract it away with encodings that have unintuitive
       | pitfalls that broke compatibility with Python 2:
       | 
       | https://blog.feabhas.com/2019/02/python-3-unicode-and-byte-s...
       | 
       | There's also the normalization issue. Apple goofed (IMHO) when
       | they used NFD in HFS+ filenames while everyone else went with
       | NFC, but fixed that in APFS:
       | 
       | https://unicode.org/faq/normalization.html
       | 
       | https://medium.com/@sthadewald/the-utf-8-hell-of-mac-osx-fee...
        
       | beders wrote:
       | Tonsky, dude.
       | 
       | I stopped reading your article because of your little websocket
       | experiment.
        
       | justrealist wrote:
       | I don't want to be too full of myself here, but I'm a very
       | skilled and highly paid backend software engineer who knows
       | roughly nothing about unicode (I google what I need when a file
       | seems f'd up), and it's never been a problem for me.
       | 
       | I'm sure the article is good but the title is nonsense.
        
         | bigstrat2003 wrote:
         | The title is definitely nonsense. The reality is that for most
         | people, they will never need to know the gritty details of how
         | to encode or decode UTF-8. The article _is_ interesting, but I
         | was pretty put off with how the author led with such a
         | hyperbolic (and untrue) claim.
        
       | moelf wrote:
       | > The only modern language that gets it right is Swift:
       | 
       | arguably not true:                 julia> using Unicode
       | # for some reason HN doesn't allow emoji       julia> graphemes("
       | ")       length-1 GraphemeIterator{String} for " "
       | help?> graphemes       search: graphemes
       | graphemes(s::AbstractString) -> GraphemeIterator
       | Return an iterator over substrings of s that correspond to the
       | extended graphemes in the string, as defined by Unicode UAX #29.
       | (Roughly, these are what users would perceive as single
       | characters, even though they may contain more than one codepoint;
       | for example a letter combined       with an accent mark is a
       | single grapheme.)
        
         | JRaspass wrote:
         | Raku also gets it right.
        
         | gwbas1c wrote:
         | Julia is not a major language like Swift.
        
         | SyrupThinker wrote:
         | I imagine the author would disagree with that because it does
         | not have the "right" behavior by default.
         | 
         | For example indexing and length of the string are done by
         | codeunit. [1]
         | 
         | On the other hand Rakus Str type does behave similarly to
         | Swifts: indexing, length and iteration by grapheme; view
         | methods for specific encodings. [2]
         | 
         | [1]: https://docs.julialang.org/en/v1/base/strings/ [2]:
         | https://docs.raku.org/type/Str#routine_chars
        
       | dathinab wrote:
       | > The only modern language that gets it right is Swift:
       | 
       | I disagree.
       | 
       | What is the "right" things is use-case dependent.
       | 
       | For UI it's glyph bases, kinda, more precise some good enough
       | abstraction over render width. For which glyphs are not always
       | good enough but also the best you can get without adding a ton of
       | complexity.
       | 
       | But for pretty much every other use-case you want storage byte
       | size.
       | 
       | I mean in the UI you care about the length of a string because
       | there is limited width to render a strings.
       | 
       | But everywhere else you care about it because of (memory)
       | resource limitations and costs in various ways. Weather that is
       | for bandwidth cost, storage cost, number of network packages,
       | efficient index-ability, etc. etc. In rare cases being able to
       | type it, but then it's often us-ascii only, too.
        
         | hot_gril wrote:
         | Swift made an effort to handle grapheme clusters but severely
         | over-complicated strings by exposing performance details to
         | users. Look at the complex SO answers to what should be simple
         | questions, like finding a substring:
         | https://news.ycombinator.com/item?id=32325511 , many of which
         | changed several times between Swift versions
         | 
         | I was working on an app in Swift that needed full emoji support
         | once. Team ended up writing our own string lib that stores
         | things as an array of single-character Swift strings.
        
           | marcellus23 wrote:
           | > many of which changed several times between Swift versions
           | 
           | This was true while Swift was developing but it's been stable
           | now for several years. At some point that complaint is no
           | longer valid.
        
             | hot_gril wrote:
             | You still see all the answers from old versions sitting
             | around, often at the top. Part of it is because of how
             | often they changed such fundamental things. String length
             | changed 3 times. Every other language figured these things
             | out before the initial non-beta release.
        
               | marcellus23 wrote:
               | The last time the string API changed was in 2017. That
               | was 6 years ago.
        
           | hot_gril wrote:
           | Also, realized "needed full emoji support" sounds silly. It
           | needed to do a lot of string manipulation, with extended
           | grapheme clusters in mind, mainly for the purpose of emojis.
        
         | layer8 wrote:
         | Arguably, you don't need any (default) length at all, just
         | different views or iterators. When designing a string type
         | today, I wouldn't add any single distinguished length method.
        
         | galad87 wrote:
         | Swift string type has got many different views, like UTF-8,
         | UTF-16, Unicode Scalar, etc... so if you want to count the
         | bytes or cut over a specific byte you still can.
        
           | dathinab wrote:
           | that's not the issue
           | 
           | defaults matter
           | 
           | as in they should things you can just use by-default without
           | thinking about it
           | 
           | as swift is deeply rooted in UI design having a default of
           | glyphs make sense
           | 
           | and as rust is deeply rooted in unix server and system
           | programming utf-8 bytes make a lot of sense
           | 
           | through the moment your language becomes more general purpose
           | you could argue having a default in any way is wrong and it
           | should have multiple more explicit methods.
        
             | toast0 wrote:
             | > as in they should things you can just use by-default
             | without thinking about it
             | 
             | That time has passed. If you want to know the length of a
             | string, you really should indicate what length type you
             | mean.
        
               | hot_gril wrote:
               | There was no string.length in Swift for a while. Then
               | they added one that just does what the user expects, get
               | the number of grapheme clusters. If a user figures out
               | that this isn't what they want, they can go use the other
               | length method.
        
         | patrickas wrote:
         | That is why I like the way Raku handles it.
         | 
         | It has distinct .chars .codes and .bytes that you can specify
         | depending on the use case. And if you try to use .length is
         | complains asking you to use one of the other options to clarify
         | your intent.                 my \emoji = "\c[FACE PALM]\c[EMOJI
         | MODIFIER FITZPATRICK TYPE-3]\c[ZERO WIDTH JOINER]\c[MALE
         | SIGN]\c[VARIATION SELECTOR-16]";       say emoji; #Will print
         | the character       say emoji.chars; # 1 because on character
         | say emoji.codes; # 5 because five code points       say
         | emoji.encode('UTF8').bytes; # 17 because encoded utf8       say
         | emoji.encode('UTF16').bytes; # 14 because encoded utf16
        
       | aembleton wrote:
       | In Java/Kotlin, I've found this Grapheme Splitter library to be
       | useful: https://github.com/hiking93/grapheme-splitter-lite
        
       | phforms wrote:
       | Regarding UTF-8 encoding:
       | 
       | "And a couple of important consequences:
       | 
       | - You CAN'T determine the length of the string by counting bytes.
       | 
       | - You CAN'T randomly jump into the middle of the string and start
       | reading.
       | 
       | - You CAN'T get a substring by cutting at arbitrary byte offsets.
       | You might cut off part of the character."
       | 
       | One of the things I had to get used to when learning the
       | programming language Janet is that strings are just plain byte
       | sequences, unaware of any encoding. So when I call `length` on a
       | string of one character that is represented by 2 bytes in UTF-8
       | (e.g. `a`), the function returns 2 instead of 1. Similar issues
       | occur when trying to take a substring, as mentioned by the
       | author.
       | 
       | As much as I love the approach Janet took here (it feels clean
       | and simple and works well with their built-in PEGs), it is a bit
       | annoying to work with outside of the ASCII range. Fortunately,
       | there are libraries that can deal with this issue (e.g.
       | https://github.com/andrewchambers/janet-utf8), but I wish they
       | would support conversion to/from UTF-8 out of the box, since I
       | generally like Janet very much.
       | 
       | One interesting thing I learned from the article is that the
       | first byte can always be determined from its prefix. I always
       | wondered how you would recognize/separate a unicode character in
       | a Janet string since it may have 1-4 bytes length, but I guess
       | this is the answer.
        
       | thefringthing wrote:
       | > Unicode is a standard that aims to unify all human languages,
       | both past and present, and make them work with computers.
       | 
       | This is doubly wrong.
       | 
       | First, it conflates languages and writing systems. Malay and
       | English use the same writing system but are different languages.
       | American Sign Language is a language, but it has no standard or
       | widely-adopted writing system. Hakka is a language, but Hakka
       | speakers normally write in Modern Standard Mandarin, a different
       | language.
       | 
       | Second, it's not that case that Unicode aims to encode all
       | writing systems. For example, there are many hobbyist neographies
       | (constructed writing systems) which will not be included in
       | Unicode.
        
       | bagasme wrote:
       | The article doesn't mention how to resolve string manipulation
       | problem involving locales.
        
       | overflyer wrote:
       | [flagged]
        
       | [deleted]
        
       | badcppdev wrote:
       | Just a nitpick because the page says: "Unicode is a standard that
       | aims to unify all human languages, both past and present, and
       | make them work with computers." but of course unicode is only
       | relevant to written languages as opposed to spoken languages (and
       | signed languages)
       | 
       | I wish that was the only thing wrong with that page
        
       | coding123 wrote:
       | Honestly the "what encoding is this! UTF-8" is still the only
       | thing we need to know. len(emoji) is still a corner case that few
       | will care about.
        
         | mcfedr wrote:
         | That's what everyone thinks, until the user sticks an emoji in
         | the name field
        
       | kipcole9 wrote:
       | > The only modern language that gets it right is Swift:
       | 
       | Elixir too:                 Interactive Elixir (1.15.4) - press
       | Ctrl+C to exit (type h() ENTER for help)       iex(1)>
       | String.length "w[?][?][?]or[?][?][?]d[?]"       4
        
       | rkagerer wrote:
       | " _the definition of graphemes changes from version to version_ "
       | 
       | In what twisted reality did someone think this a good idea?
       | 
       | Doesn't it go against the whole premise of everyone in the world
       | agreeing on how to represent a meaningful unit of text?
       | 
       | " _What's sad for us is that the rules defining grapheme clusters
       | change every year as well. What is considered a sequence of two
       | or three separate code points today might become a grapheme
       | cluster tomorrow! There's no way to know! Or prepare!_ "
       | 
       | " _Even worse, different versions of your own app might be
       | running on different Unicode standards and report different
       | string lengths!_ "
        
         | rkagerer wrote:
         | I can sympathize why some programmers would prefer to stick
         | their heads in the sand and stick to ASCII.
        
       | everyone wrote:
       | I'm always gonna point out these overly broad titles assuming
       | "every software developer" is some kind of internetty web dev
       | type. I'm a game dev, I try and never touch strings at all, they
       | are a nightmare data type. Strings in a game are like graphics or
       | audio assets, your game might read them and show them to player,
       | but they should never come anywhere near your code or even be
       | manipulated by it. I dont need to know any of that stuff about
       | Unicode.
        
       | dekken_ wrote:
       | Am I supposed to hate this website, cause I kinda do
        
         | melx wrote:
         | Try the night mode (top right corner)...
         | 
         | It's black text on black background (I'm on mobile Firefox on
         | Android).
        
           | Lewton wrote:
           | Night mode is an absolute delight on desktop, you're missing
           | out
        
           | jamincan wrote:
           | On desktop your mouse pointer is a flashlight. I wonder if it
           | supports touch.
        
         | leokennis wrote:
         | Toggle the dark mode for a real treat.
        
           | sebstefan wrote:
           | Now that is really funny.
           | 
           | Future _improvement_ idea: the mouse cursors are shared, so
           | the light switch should be, too! Let me play with the light
           | with everyone
        
         | cormullion wrote:
         | It's not pleasant to read. Strange, since Tonsky is the curator
         | of the Fira Code font, and would presumably be interested in
         | presentation
        
         | MrResearcher wrote:
         | uBlock Origin -> Disable Javascript
         | 
         | Problem solved!
        
           | eptcyka wrote:
           | Firefox reader mode is better still.
        
           | hwillis wrote:
           | That breaks the video. inspect -> network -> refresh ->
           | blocking the request for pointers.js works.
        
             | joveian wrote:
             | It doesn't on Firefox, you get the built in media controls.
        
         | bqmjjx0kac wrote:
         | The mustard background with black text is harsh on the eyes.
        
           | permo-w wrote:
           | strange. I quite like it
        
             | Nevermark wrote:
             | Me too. I get the impression of a very saturated off-white
             | yellow.
             | 
             | But any more saturation and it would go all mustard-
             | electric on me.
             | 
             | That's an interesting observation on variation of
             | saturation response. Feels like useful knowledge for ...
             | web site designers. Or any color crafter.
        
         | anymouse123456 wrote:
         | FWIW - Right Click, Inspect. There's a div with an attribute,
         | "pointers" in the body root.
         | 
         | Deleting that makes the while thing a lot less stressful.
        
       | jordanrobinson wrote:
       | Anyone know what the story is behind the "Weird Emoji" around
       | 140000 on the map?
        
         | Findecanor wrote:
         | The E0000-E007F block is the "Tags" block, which is used for
         | flag emojis.
         | 
         | But there is not a code for each flag. Instead there is a code
         | for each ASCII character. A flag sequence is formed from
         | U+1F3F4 (Black Flag), followed by at least two tags that form a
         | country/region code, and then U+E007F (End tag).
         | 
         | So, yes this is weird, because the emoji is dependent on the
         | decoder. It was made this way to keep Unicode independent of
         | geopolitics.
         | 
         | Read more: <https://en.wikipedia.org/wiki/Tags_(Unicode_block)>
        
       | m3kw9 wrote:
       | Unicode looks like a big over engineered standard that had 50
       | hands trying to put their mark in
        
         | ebiester wrote:
         | It looks like that because Unicode is trying to solve a problem
         | that everyone thinks is easy until they uncover the true extent
         | of encoding human languages.
        
           | eviks wrote:
           | How does this explain surrogate pairs?
        
             | jfultz wrote:
             | Surrogate pairs were new to Unicode 2.0. Unicode 1.0 didn't
             | anticipate the need for more than 65,536 code points (who
             | would ever need more?); the main perceived threat to that
             | limit having been resolved by Han unification.
        
               | eviks wrote:
               | Ok, but that doesn't answer the question; it's more of an
               | indication that those design(at)s didn't uncover "the
               | true extent" until years later
        
       | hot_gril wrote:
       | This is a lot more than the minimum that _every_ software dev
       | must know about Unicode. Even if you only do web frontends, you
       | will do fine not knowing most of this. Still a nice read, though.
        
       | permo-w wrote:
       | >That gives us a space of about 11 million code points. About
       | 170,000, or 15%, are currently defined. An additional 11% are
       | reserved for private use. The rest, about 800,000 code points,
       | are not allocated at the moment. They could become characters in
       | the future.
       | 
       | 1.1 million?
        
         | run414 wrote:
         | Yeah, the author's numbers are off by a "0". It should be
         | "1,700,000" and "8,000,000".
        
       | hyggetrold wrote:
       | Is there a way to read this with the mouse cursors disabled? It
       | seems like great content but all the movement on the page is way
       | too distracting.
       | 
       | EDIT: I've never been downvoted for asking a question before.
       | Weird, but okay.
        
       | WillAdams wrote:
       | Just had this come up at work --- needed a checkbox in Microsoft
       | Word --- oddly the solution to entering it was to use the numeric
       | keypad, hold down the alt key and then type out 128504 which
       | yielded a check mark when the Arial font was selected _and_
       | unlike Insert Symbol and other techniques didn't change the font
       | to Segoe UI Symbol or some other font with that symbol.
       | 
       | Oddly, even though the Word UI indicated it was Arial, exporting
       | to a PDF and inspecting that revealed that Segoe UI Symbol was
       | being used.
       | 
       | As I've noted in the past, "If typography was easy, Microsoft
       | Word wouldn't be the foetid mess which it is."
        
         | uxp8u61q wrote:
         | That's unrelated to unicode. The checkmark symbol just isn't in
         | the Arial font, so Word just falls back to a font that has it -
         | Segoe UI. You've found a bug where Word still thinks it's
         | Arial. But this is something that would happened no matter what
         | encoding you choose for your characters.
        
       | Tomte wrote:
       | > They will look the same (A vs A)
       | 
       | No. In my browser the first A has the ring glued to it, and the
       | second has a little gap.
        
       | [deleted]
        
       | layer8 wrote:
       | https://archive.ph/LtKk0
        
         | neonate wrote:
         | http://web.archive.org/web/20231002163213/https://tonsky.me/...
        
       | makeworld wrote:
       | Really great article. Hitting all the points I would expect.
        
       | bumbledraven wrote:
       | > what to you think "w[?][?][?]or[?][?][?]d[?]".length should be?
       | 
       | This is a nice example of the kind of thing we need to think
       | about when defining a measure of length for Unicode strings.
        
         | danbruc wrote:
         | Four. Obviously.
         | 
         | The more interesting question is whether the Unicode rules
         | actually give that answer.
         | 
         | EDIT: Just checked it using the first online tool [1] that came
         | up and it indeed says four. So all is good.
         | 
         | [1] https://onlinetools.com/unicode/extract-unicode-graphemes
        
           | masklinn wrote:
           | It should be 4 as long as you count the grapheme clusters
           | which is what e.g. Swift does (hence String#count being
           | O(n)).
           | 
           | In Javascript, you can get the same information through
           | Intl.Segmenter, segments by grapheme cluster by default.
        
             | danbruc wrote:
             | You could also have it in O(1), just store and maintain it
             | as you usually store the length in bytes or code units. If
             | you had all your string operations like substring work with
             | grapheme clusters by default, which might arguably make
             | sense quite often, then that could actually be a good
             | decision. It might even make sense to maintain a list with
             | pointers to each grapheme cluster or of all the grapheme
             | cluster lengths together with the actual string data. Or
             | maybe not, would probably depend heavily on the workload.
        
       | Karellen wrote:
       | > The simplest possible encoding for Unicode is UTF-32. It simply
       | stores code points as 32-bit integers.
       | 
       | Skipping over UTF-32-BE and UTF-32-LE there...
       | 
       | (I mean, it might not be an issue if it's just being used as an
       | internal representation, but still)
        
       | gh0stcloud wrote:
       | ther article's background color deserves to be named:
       | https://colornames.org/color/fddb29
        
       | charcircuit wrote:
       | The number of graphene clusters in a string depend on the font
       | being used. The length of a string should be the number of code
       | points because that is not length specific.
       | 
       | Better yet, there shouldn't be a function called length.
        
       | qwerty456127 wrote:
       | > People are not limited to a single locale. For example, I can
       | read and write English (USA), English (UK), German, and Russian.
       | Which locale should I set my computer to?
       | 
       | Ideally - the "English-World" locale is supposedly meant for us,
       | cosmopolitans. It's included with Windows 10 and 11.
       | 
       | Practically, as "English-World" was not available in the past
       | (and still wasn't available on platforms other than Windows the
       | last time I checked), I have always been setting the locale to
       | En-US even though I have never been to America. This leads to a
       | number of annoyances though. E.g. LibreOffice always creates new
       | documents for the Letter paper format and I have to switch it to
       | A4 manually every time. It's even worse on Linux where locales
       | appear to be less easy to customize than in Windows. Windows
       | always offered a handy configuration dialog to granularly tweak
       | your locale choosing what measures system you prefer, whether
       | your weeks begin on sundays or mondays and even define your
       | preferred date-time format templates fully manually.
       | 
       | A less-spoken about problem is Windows' system-wide setting for
       | the default legacy codepage. I happen to use single-language
       | legacy (non-Unicode) apps made by people from a number of very
       | different countries. Some apps (e.g. I can remeber the Intel UHD
       | Windows driver config app) even use this setting (ignoring the
       | system locale and system UI language) to detect your language and
       | render their whole UI in it.
       | 
       | > English (USA), English (UK)
       | 
       | This deserves a separate discussion. I doubt many English
       | speakers (let alone those who don't live in a particular
       | anglophone country) care to distinguish between English dialects.
       | To us presence of a huge number of these (don't forget en-AU, en-
       | TT, en-ZW etc - there are more!) in the options lists brings only
       | annoyance, especially when one chooses some non-US one and this
       | opens another can of worms.
       | 
       | By the way I wonder how do string capitalization and comparision
       | functions manage to work on computers of people who use both
       | English and Turkish actively (Turkish locale distinguishes
       | between dotted and undotted I).
        
         | __d wrote:
         | I write daily in US English, Australian English, and Austrian
         | German. Most of the time, a specific document is in one
         | dialect/language or another: not mixed, although sometimes
         | that's not true.
         | 
         | I can understand that the conflation of spelling, word choices,
         | time and date formatting, default paper sizes, measurement
         | units, etc, etc, is convenient, and works a lot of the time,
         | but it really doesn't work for me at all.
         | 
         | That said, I appreciate that I occupy a very small niche.
        
         | masklinn wrote:
         | > English (USA), English (UK)
         | 
         | > This deserves a separate discussion. I doubt many English
         | speakers (let alone those who don't live in a particular
         | anglophone country) care to distinguish between English
         | dialects.
         | 
         | While that is generally (though not always) true, I would
         | assume it's really a stand in for the much more relevant zh
         | locales.
         | 
         | It is also rather relevant to es locales (america spanish has
         | diverged quite a bit from europe spanish hence the creation of
         | es-419), definitely french (canadian french, to a lesser extend
         | belgian and swiss), and german (because swiss german). And it
         | might be relevant for ko if north korea ever stops being what
         | it is.
        
           | [deleted]
        
         | dizhn wrote:
         | i I
         | 
         | i I
         | 
         | I symphatize with people who get this wrong. (I just saw some
         | YouTube video have a title TURKIYE in a segment)
         | 
         | Even google keyboard can't seem to distinguish between I and I.
         | When I type "It", it suggests "It's" which is quite pathetic.
        
         | uxp8u61q wrote:
         | > I have always been setting the locale to En-US even though I
         | have never been to America. This leads to a number of
         | annoyances though. E.g. LibreOffice always creates new
         | documents for the Letter paper format and I have to switch it
         | to A4 manually every time
         | 
         | > I doubt many English speakers (let alone those who don't live
         | in a particular anglophone country) care to distinguish between
         | English dialects. To us presence of a huge number of these
         | (don't forget en-AU, en-TT, en-ZW etc - there are more!) in the
         | options lists brings only annoyance, especially when one
         | chooses some non-US one and this opens another can of worms.
         | 
         | Well, you just explained what this plethora of options is
         | about. It's not just about how you spell flavor/flavour. It's a
         | lot of different defaults for how you expect your OS to present
         | information to you. Default paper size, but also how to write
         | date and time, does the week start on Monday, Sunday, or
         | something else, etc.
        
         | hahn-kev wrote:
         | As much as I appreciate that I always wondered how many
         | programs actually respect all those tweaks.
        
         | DoughnutHole wrote:
         | > I doubt many English speakers care to distinguish between
         | English dialects
         | 
         | It's worthwhile purely for the sake of autocorrect/typo
         | highlighting in text-editing software. I don't miss the days of
         | spelling a word correctly in my version of English but still
         | being stuck with the visual noise of red highlighting up and
         | down the document because it doesn't conform to US English.
        
           | BoxOfRain wrote:
           | Yeah I'd rather not have my British English dialect seen as
           | second-class in a world of American English ideally which is
           | what having a red document full of 'errors' implies in those
           | sorts of situations.
           | 
           | It's sometimes not a trivial distinction either, for example
           | I've heard of cases where surprised British redditors have
           | found themselves banned from American subreddits for being
           | homophobic when they were actually talking innocently enough
           | about cigarettes!
        
             | OfSanguineFire wrote:
             | I would think a lot of mods, who are either Highly Online
             | Americans or their weirdo equivalents in other countries,
             | are well aware of the UK usage, but simply expect Brits to
             | give it up in order to avoid offending Americans and the
             | global Reddit community that largely takes American-style
             | sensitivity as its orthodoxy. And considering that Reddit
             | corporate feels that anything that could stir up such
             | outrage is bad for business, mods of popular subreddits may
             | well feel pressured to come down hard on these matters.
        
               | bluGill wrote:
               | It doesn't matter if you use UK or US spelling you are
               | wrong. I wish we would adopt the international phonetic
               | alphabet I might have a chance of spelling things
               | correctly.
        
         | lucideer wrote:
         | As an Irish person, while we have en_IE which is great (and
         | solves most of the problems you list re: Euro-centric defaults
         | + English), I'd still quite like to have an even more broad /
         | trans-language / "cosmopolitan" locale to use.
         | 
         | I mainly type in English but occasionally other languages - I
         | use a combination of Mac & Linux - macOS has an (off-by-default
         | but enable-able) lang-changer icon in the tray that is handy
         | enough, but still annoying to have to toggle. Linux is much
         | worse.
         | 
         | Mac also has quite a nice long-press-to-select-special
         | character that at least makes for accessible (if not efficient)
         | typing in multiple languages while using an English locale.
         | Mobile keyboards pioneered this (& Android's current one even
         | does simultanous multi-lang autocomplete, though it severely
         | hurts accuracy).
         | 
         | ---
         | 
         | > _I doubt many English speakers care to distinguish between
         | English dialects._
         | 
         | I think you'll find the opposite to be true. US English
         | spellings & conventions are quite a departure from other
         | dialects, so typing fluidly & naturally in any non-US dialect
         | is going to net you a world of autocorrect pain in en_US. To
         | the extent it renders many potentially essential spelling &
         | grammar checkers completely unusable.
        
           | jdblair wrote:
           | I can 2nd this as an American who now resides in Europe. My
           | first laptop I brought with me, and was defaulted to en_US,
           | but my replacement is en_GB (Apple doesn't have en_NL, for
           | good reason).
           | 
           | I don't find it "unusable", though. I could change it back to
           | en_US, but it has actually been interesting to see all of my
           | American spellings flagged by autocorrect. Each time I write
           | authorize instead of authorise it is an act of stubborn group
           | affinity!
        
           | TRiG_Ireland wrote:
           | > US English spellings & conventions are quite a departure
           | from other dialects.
           | 
           | As far as the written, formal language is concerned, English
           | really has only three dialects: US American, Canadian, and
           | everywhere else. There are some other subtle differences
           | (such as "robots" for traffic lights in South Africa, or
           | "minerals" for fizzy drinks in Ireland1), but that's pretty
           | much it.
           | 
           | 1 Yes, this isn't just slang in Ireland: the formal, pre-
           | recorded announcements on trains use it: "A trolley service
           | will operate to your seat, serving tea, coffee, minerals and
           | snacks." The corresponding Irish announcement renders it
           | mianrai. Food service on trains stopped during covid and has
           | not yet resumed, so I'm working from distant memory now.
        
             | lucideer wrote:
             | > _As far as the written, formal language is concerned,
             | English really has only three dialects_
             | 
             | This is true, but I don't see why the "formal" qualifier is
             | needed here :) There are much more than 3 dialects of
             | English, both written & spoken.
             | 
             | Especially there's a fair few extremely common notable
             | differences in (casual, written) Irish English: the word
             | "amn't" (among other less common contractions), the
             | alternative present tense of the verb "to be" (i.e. "do
             | be"), various regional plurals of "you", and - perhaps the
             | most common - prepositional pronouns, etc. etc.
        
               | TRiG_Ireland wrote:
               | Well, quite. If we include any one or more of the
               | following three categories -- formal spoken language,
               | informal spoken language, informal written language --
               | then there's definitely far more than three dialects of
               | English. But formal spoken language really has only the
               | three.
        
               | phantom784 wrote:
               | I guess it's a question as to how many varieties of
               | spelling you want to make available as "translations" in
               | software (e.g. color vs colour, tire vs tyre).
               | 
               | There's plenty of regional variants just within the US,
               | but "en_us" covers the whole country.
        
           | l72 wrote:
           | I write in multiple languages daily on Linux, including
           | English, Russian, and Chinese. Switching keyboards (at least
           | with gnome) is a simple super-space.
           | 
           | While in my default (English) layout, it is easy enough to
           | add in accents other characters using the compose key (right
           | alt). So right-alt+'+a = a or right-alt+"+u = u. I much
           | prefer this over the long press as I can do it quickly and
           | seamlessly without having to wait on feedback. Granted, it is
           | not as discoverable, but once you are comfortable, it in my
           | opinion is a better system.
        
         | notatoad wrote:
         | > I doubt many English speakers care to distinguish between
         | English dialects
         | 
         | I think you'd be surprised how many english (UK) people will
         | get pissed off when their spell-checker starts removing the "u"
         | from colour or flavour, or how many English (US) people get
         | pissed off when the spellchecker starts suggesting random "u"s
         | to words.
         | 
         | additionally to that, locale isn't just about language. English
         | (US) and English (UK) decides whether your dates get formatted
         | DD-MM-YY or MM-DD-YY, whether your numbers have the thousands
         | broken by commas or spaces, and a host of other localization
         | considerations with a lot more significance than just the
         | dialect of english.
        
           | TRiG_Ireland wrote:
           | I'd really like an en-GB-oxendict (British English but
           | favouring -ize over -ise) locale for formal writing.
        
           | aksss wrote:
           | I worked for BP for a while (well, as a contracted coder) and
           | I got quite used to the UK spell check correcting everything
           | to its idiom. Everything seemed wrong once I returned a world
           | that dismissed the value of the letter 'U' and preferred the
           | letter 'Z' over 'S'. Also missed the normalizing of drinking
           | beer at lunch.
        
         | grotorea wrote:
         | > Practically, as "English-World" was not available in the past
         | (and still wasn't available on platforms other than Windows the
         | last time I checked), I have always been setting the locale to
         | En-US even though I have never been to America. This leads to a
         | number of annoyances though. E.g. LibreOffice always creates
         | new documents for the Letter paper format and I have to switch
         | it to A4 manually every time. It's even worse on Linux where
         | locales appear to be less easy to customize than in Windows.
         | Windows always offered a handy configuration dialog to
         | granulatly tweak your locale choosing what measures system you
         | prefer, whether your weeks begin on sundays or mondays and even
         | define your preferred date-time format templates fully
         | manually.
         | 
         | There's the English (Denmark) locale for that on some platfoms.
        
           | qwerty456127 wrote:
           | Thank you very much, I'll give it a try.
        
             | grotorea wrote:
             | It's a bit of a joke that doesn't have universal support.
             | Works on my phone. Apparently you can also try en_IE
             | (Ireland).
             | 
             | https://unix.stackexchange.com/questions/62316/why-is-
             | there-...
        
             | actualwitch wrote:
             | en-GB is also a good choice
        
               | carstenhag wrote:
               | Not really, no EUR and no metric units
        
       | loeg wrote:
       | Pretty clearly, "every software developer" doesn't need to
       | understand Unicode with this level of familiarity, much like
       | "every programmer" doesn't need to know the full contents of the
       | 114 page Drepper paper. For example, I work on a GUID-addressed
       | object store. Everything is in term of bytes and 128-bit UUIDs.
       | Unicode is irrelevant to everyone on my team, and most adjacent
       | teams. There is lots of software like this.
        
       | [deleted]
        
       | JonChesterfield wrote:
       | Prior to this article, I knew graphemes were a thing and that
       | proper unicode software is supposed to count those instead of
       | bytes or code points.
       | 
       | I didn't know that unicode changes the definition of grapheme in
       | backwards incompatible fashion annually, so software which works
       | by grapheme count is probably inconsistent with other software
       | using a different version of the standard anyway.
       | 
       | I'm therefore going to continue counting bytes. And comparing by
       | memcmp. If the bytes look like unicode to some reader, fine.
       | Opaque string as far as my software is concerned.
        
         | tonsky wrote:
         | Good luck
         | https://mastodon.online/@alexeyten@mas.to/111166351426290784
        
         | dundarious wrote:
         | The point is that a byte focus will often frustrate users.
         | 
         | e.g., a TUI with columns will have to truncate "long" strings
         | in each column, and that truncation and column-separator
         | arrangement really should be grapheme aware.
         | 
         | e.g., a string search (for a name, let's say) should find Noel
         | regardless of whether the user input e via composing characters
         | or the pre-composed version.
        
         | slimsag wrote:
         | Two Unicode strings can be visually and semantically identical,
         | but not byte-equal.
        
       | zzzeek wrote:
       | I wondered about how to do simple text centering / spacing
       | justification given graphemes showing string lengths that don't
       | match up human-perceived characters, like in 'Cafe' (python
       | len('Cafe') returns 5, even though we see four letters).
       | 
       | Found this! good to know about.
       | https://pypi.org/project/grapheme/ "A Python package for working
       | with user perceived characters. "
       | 
       | (apparently the article talks about this however the blog post is
       | largely unreadable due to dozens of animated arrow pointers
       | jumping all over the screen)
        
       | cryptonector wrote:
       | > Another unfortunate example of locale dependence is the Unicode
       | handling of dotless i in the Turkish language.
       | 
       | This isn't quite Unicode's fault, as the alternative would be to
       | have two codepoints each for `i` and `I`, one pair for the Latin
       | versions and one for the Turkish versions, and that would be very
       | annoying too.
       | 
       | Whereas the Russian/Bulgarian situation is different. There used
       | to be language tags in Unicode for that, but IIRC they got
       | deprecated, and maybe they'll have to get undeprecated.
        
       | pif wrote:
       | > The minimum every software developer must know about Unicode
       | 
       | Just a nitpick...
       | 
       | Once more, as it is typical on HN, web programming is confused
       | with the entire universe of software development.
       | 
       | There are plenty of software realms where ASCII not only is
       | enough, but it actually MUST be enough.
        
         | lxgr wrote:
         | What do you mean by "must be enough"?
         | 
         | Not being able to support non-latin scripts sounds more like a
         | limitation than a feature to me, although of course in many
         | contexts it's not in any individual organizations power to
         | overcome it.
        
         | 9dev wrote:
         | Well, proper Unicode support affects pretty much any area
         | handling data about, used by, or created by, humans. That's a
         | pretty broad scope, and certainly wider than just web software.
        
         | uxp8u61q wrote:
         | This kind of assertiveness leads to garbage like C++ still not
         | supporting UTF8 properly in 2023. My name contains diacritics.
         | I am so, so, _so_ tired of trying to work around information
         | systems - not just web frontends - designed by people who don
         | 't care or worse, don't want to care.
         | 
         | "Web" programmers can care all they want about Unicode, but if
         | the backend people didn't deal properly with text encoding,
         | then something will break no matter what.
         | 
         | > There are plenty of software realms where ASCII not only is
         | enough, but it actually MUST be enough.
         | 
         | Name one.
        
           | lelanthran wrote:
           | > This kind of assertiveness leads to garbage like C++ still
           | not supporting UTF8 properly in 2023. My name contains
           | diacritics.
           | 
           | UTF8 encoded diacritics work just fine in C++.
        
             | uxp8u61q wrote:
             | What do you mean by "work"? That you can store arbitrary
             | bytes in a string? That's a pretty low bar.
        
               | lelanthran wrote:
               | > What do you mean by "work"? That you can store
               | arbitrary bytes in a string? That's a pretty low bar.
               | 
               | That's all that's needed for a backend language.
               | 
               | The backend does not need to understand, or even
               | acknowledge the existence, of grapheme clusters. Because
               | the frontend is already having to understand all of this,
               | it should be normalising any multi-codepoint ambiguous
               | cluster anyway.
        
               | JohnFen wrote:
               | The backend never needs to do things like figure out how
               | long a string is or search for one string in a database
               | of other strings?
        
               | lelanthran wrote:
               | > The backend never needs to do things like figure out
               | how long a string
               | 
               | Not as measured by clusters, no.
               | 
               | > search for one string in a database of other strings?
               | 
               | Hence I said "normalisation". The frontend already has to
               | do all the unicode twiddling, it may as well normalise
               | the input too.
        
               | astrange wrote:
               | It does if it ever wants to trim, summarize, sort or
               | compare equal a string.
        
           | pif wrote:
           | > if the backend people didn't deal properly
           | 
           | You are right. It's not a frontend/backend issue. It's a "for
           | human" vs "not for human" issues. Personal names must be
           | treated in an international-friendly manner.
           | 
           | >> There are plenty of software realms where ASCII not only
           | is enough, but it actually MUST be enough. > > Name one
           | 
           | Joel himself described an example:
           | 
           | > It would be convenient if you could put the Content-Type of
           | the HTML file right in the HTML file itself, using some kind
           | of special tag. Of course this drove purists crazy... how can
           | you read the HTML file until you know what encoding it's in?!
           | Luckily, almost every encoding in common use does the same
           | thing with characters between 32 and 127, so you can always
           | get this far on the HTML page without starting to use funny
           | letters:
           | 
           | The content of a webpage is required to be expressed in every
           | supported language, but the HTTP protocol must not. And it
           | would make no sense at all to add internationalization to
           | intra-machines protocol, where ASCII is enough and has been
           | enough for decades.
           | 
           | And if someone complains that ASCII only supports English,
           | well... suck it up! I'm Italian and work in French, still I
           | hate when a colleague sneaks in a comment not in English.
           | Professional software development happens in English.
        
             | astrange wrote:
             | HTTP does support content-type tags and Unicode in URLs.
             | Which funny enough comes in two different encodings,
             | punycode and percent escapes.
        
             | uxp8u61q wrote:
             | > The content of a webpage is required to be expressed in
             | every supported language, but the HTTP protocol must not.
             | And it would make no sense at all to add
             | internationalization to intra-machines protocol, where
             | ASCII is enough and has been enough for decades.
             | 
             | I guess no URLs with funny characters then. "GET
             | /profile/renee" => 500 error, woohoo.
             | 
             | > And if someone complains that ASCII only supports
             | English, well... suck it up! I'm Italian and work in
             | French, still I hate when a colleague sneaks in a comment
             | not in English. Professional software development happens
             | in English.
             | 
             | Get over yourself, a lot of professional development
             | happens in languages other than English.
        
           | jeddy3 wrote:
           | I can name one. At my job we do the kind of embedded
           | programming were encoders inside machines send data to each
           | other. Like reading optical sensors and sending bits
           | indicating state to other controllers.
           | 
           | We absolutely do not "need" to know about Unicode, outside of
           | interest about other realms.
        
       | kajaktum wrote:
       | I am torn between supporting all languages (which easily leaks
       | into supporting emojis) versus just using the 90~ Latin
       | characters as the lingua franca.
       | 
       | Look, I would love to be able to read/write Sanskrit, Arabic,
       | Chinese, Japanese etc and share those content and have everyone
       | render and see the same thing. The problem is that I feel like
       | most of these are:
       | 
       | 1. a kind of an open problem 2. very subjective 3. very, very
       | subjective as what you is mostly dictated by the implementation
       | (fonts)
       | 
       | For example, why does a gun emoji looks like water gun? Why is
       | the skull-and-crossbones symbol looks so benign. In fact, it is
       | often used as a meme (see deadass :skull:) Why is the basmala a
       | single "character"?
       | 
       | In my opinion, people should just learn how to use kaomoji.
       | Granted, kaomojis rely on a lot more than the Latin characters
       | but it is at least artful, skillfull and a natural extension of
       | the "actual" languages.
       | 
       | > inb4 languages evolves
       | 
       | Yes, but it mostly happens naturally. I feel like what happens
       | today mostly happens at the whim of a few passionate people in
       | the standard.
        
         | zzo38computer wrote:
         | > I am torn between supporting all languages (which easily
         | leaks into supporting emojis) versus just using the 90~ Latin
         | characters as the lingua franca.
         | 
         | I don't want to support emoji either (and, I don't want emoji
         | on my computer), although in some cases, if it is really
         | necessary to be supported, they could be implemented just as
         | text characters instead of as colourful emoji, anyways.
         | 
         | For many purposes (e.g. computer codes) ASCII is good enough
         | (and actually even can be better since it avoids the security
         | problems of using Unicode). (Sometimes, character sets other
         | than ASCII can be used, e.g. APL character set for APL
         | programming.)
         | 
         | > Look, I would love to be able to read/write Sanskrit, Arabic,
         | Chinese, Japanese etc
         | 
         | I also would, but Unicode is bad enough that I would use other
         | ways of doing such a thing when possible (even writing my own
         | programs, etc). (If a program insists on Unicode, I might just
         | use ASCII only anyways, or write my own program)
         | 
         | Not everyone necessarily need to see the same thing (if it is a
         | text, rather than pictures of the text), although, the suitable
         | character sets for that language which can be in use (and with
         | fix pitch if necessary, etc), to auto select a suitable fonts
         | for your computer by the reader's preference.
         | 
         | So, I prefer to support all languages (where applicable;
         | sometimes it isn't), without using Unicode.
        
       | ggcampinho wrote:
       | Elixir also gets the length correctly, not only Swift.
        
       | Dudester230602 wrote:
       | Guys, you don't need to know that crap.
        
       | wickedsickeune wrote:
       | I'm sorry but the website design is extremely distracting. The
       | mouse pointers at least are easy to delete with the inspector;
       | The background color is not the best choice for reading material,
       | but the inexcusable part is the width of the content.
       | 
       | This content must be really awesome for someone to go through the
       | trouble of interacting with such a site.
        
         | [deleted]
        
       | francisofascii wrote:
       | I enjoyed how the timeline graphic included Joel's article.
       | Because my first thought was hey, isn't this the same title.
        
       | gumby wrote:
       | This is quite a good write up. An answer to one of the author's
       | questions:
       | 
       | > Why does the fi ligature even have its own code point? No idea.
       | 
       | On of the principles of Unicode is round trip compatibility. That
       | is you should be able to read in a file encoded with some
       | obsolete coding system and write it out again properly. Maybe
       | frob it a bit with your unicode-based tools first. This is a good
       | principle, though less useful today.
       | 
       | So the fi ligature was in a legacy encoding system and thus must
       | be in Unicode. That's also why things like digits with a circle
       | around them exist: they were in some old Japanese character set.
       | Nowadays we might compose them with some zwj or even just leave
       | them to some higher level formatting (my preference).
        
         | sdrothrock wrote:
         | > they were in some old Japanese character set
         | 
         | This implies that they're obsolete, but they're not -- they're
         | still in very common use today. You can type them in Japanese
         | by typing maru (maru, circle) and the number, then pick it out
         | of the IME menu. Some IMEs will bring them up if you just type
         | the number and go to the menu, too. :)
        
           | gumby wrote:
           | Fair enough. I was thinking of them as obsolete, but
           | shouldn't since you do see them a surprising amount in Japan.
        
         | WorldMaker wrote:
         | > So the fi ligature was in a legacy encoding system and thus
         | must be in Unicode.
         | 
         | Most of the pre-composed latin ligatures are generally from
         | EBCDIC codepages. People in the ancient Mainframe era wanted
         | nice typesetting too, but computer fonts with ligature support
         | were a much later invention.
         | 
         | You can see fi and several others directly in EBCDIC code page
         | 361:
         | 
         | https://en.wikibooks.org/wiki/Character_Encodings/Code_Table...
        
           | gumby wrote:
           | Thanks. Some alphabets have precomposed ligatures that aren't
           | really letters, like old German alphabets with tz, ch, ss (I
           | only know how to type the last one, ss, because the others
           | have died out over the last hundred years).
           | 
           | Actually in German (at least) a, o and u really are actually
           | ligatures for ae, oe, and ue -- the scribes started to write
           | the E's on their sides above the base letters, and over time
           | the superscript "E"s became dots or dashes. Often they are
           | described the other way around: "you can type oe if you can't
           | type o." That's what my kid was told in school!
           | 
           | But O and ss aren't really part of the alphabet in German,
           | while, say, in Swedish, a and o became actual letters of the
           | alphabet. English got W that way too.
        
             | cyxxon wrote:
             | That's sounds a bit false to me. The Umlaute (a,o, u) and
             | the "eszett" ss are actually part of the German
             | alphabet[1]. Also it is kinda weird to describe them as
             | ligatures of the original letters and the diaeresis,
             | because while this is what they started out as a long time
             | ago, they are just their own letters now (as opposed to
             | "real" stylistic ligatures like combining fi into one
             | glyph). The advice your kid was told that they can be
             | replaced with ae, oe and ue is correct - it is a
             | replacement nowadays.
             | 
             | [1] https://de.wikipedia.org/wiki/Deutsches_Alphabet
        
         | gwervc wrote:
         | The circled digits as code points are very nice to have
         | precisely because they are available in applications that don't
         | support them otherwise... which is actually most of the
         | software I can think of (Notepad, Apple Notes, chat
         | applications, most websites, etc).
        
           | swores wrote:
           | Can you write them with iOS keyboard? Or when you say Apple
           | Notes and chat apps you just mean from desktop?
           | 
           | Edit 1: seems the answer is not with the default iOS
           | keyboard, but possible to paste it and perhaps possible with
           | a third party keyboard that I'm not keen on trying (unless I
           | hear of a keyboard that's both genuinely useful / better than
           | default, and that doesn't send keystrokes to the developer -
           | though I can't remember if the latter is even a risk on iOS,
           | better go search about that next..)
        
             | d11z wrote:
             | Speaking of third party keyboards, I'm still upset about
             | what happened to Nintype[0]. I've never ever been able to
             | type faster on mobile than with it's intuitive hybrid input
             | style of sliding and tapping, paired with AI that was
             | actually good. It used to be quite performant, fully
             | customizable, and it worked beautifully as a replacement
             | for default on jailbroken iOS.
             | 
             | Today, it's buggy $5 abandonware that only makes me sad
             | when I am reminded of it.
             | 
             | EDIT: Here[1] is a blog post that claims it's still the
             | best keyboard in 2023. I actually might give it another
             | shot... Not holding my breath though.
             | 
             | *EDIT: Looks like another dedicated fan has actually taken
             | it upon themself to revive the project, under the new name
             | Keyboard71[2].
             | 
             | [0] https://apps.apple.com/us/app/nintype/id796959534
             | 
             | [1] https://maxleiter.com/blog/nintype
             | 
             | [2] https://www.reddit.com/r/keyboard71/
        
               | tiltowait wrote:
               | Nintype was absolutely incredible. I still open it every
               | now and then after an iOS update in the vain hope some
               | system change made it less buggy.
        
               | d11z wrote:
               | I'm really considering repurchasing (I definitely owned
               | it previously, no idea what happened), can you describe
               | specifically what the main bugs are for you? I'd be happy
               | if I could use it solely for occasionally writing long
               | notes, not as a replacement for all text inputs.
               | 
               | Really not looking to burn another $5, I'd greatly
               | appreciate any thoughts/concerns at all.
        
               | swores wrote:
               | I wonder why they haven't open sourced their fork, over
               | than vague worry it might get DMCA'd
        
             | masklinn wrote:
             | You can copy/paste them from a character board, a dedicated
             | website, or even the wiki.
        
             | P-Nuts wrote:
             | You can type 1 with the UniChar keyboard app on iOS. It at
             | least claims it doesn't transmit information. As it's only
             | useful for special characters I don't worry because I can't
             | use it for normal typing anyway.
             | 
             | https://unichar.app
        
               | astrange wrote:
               | No third party keyboard transmits information without you
               | permitting it.
        
           | gumby wrote:
           | My point was that, had they not been legacy characters (or
           | had RT compatibility been disregarded) Unicode could still
           | have supported them as composed characters. Though I
           | personally still feel they are a kind of ligature or graphic,
           | but luckily for everyone else I'm not the dictator of the
           | world :-).
           | 
           | We should be careful: someone on HN could write a proposal
           | that they _should_ be considered pre composed forms that
           | should also have an un-composed sequence... so there could in
           | future be not just 1 in a circle but 1 ZWJ circle, circle ZWJ
           | 1 all considered the same...I can imagine some HN readers
           | being pranksters like that.
        
       | throwaway_fjmr wrote:
       | And yet, many modern, recent apps can't even encode the accented
       | European character in my given name. Sigh.
        
       | oefrha wrote:
       | > many Chinese, Japanese, and Korean logograms that are written
       | very differently get assigned the same code point
       | 
       | This leads to absolutely horrendous rendering of Chinese
       | filenames in Windows if the system locale isn't Chinese. The
       | characters seem to be rendered in some variant of MS Gothic and
       | it's very obviously a mix of Chinese and Japanese glyphs (of
       | somewhat different sizes and/or stroke widths IIRC). I think the
       | Chinese locale avoids the issue by using Microsoft YaHei UI.
        
       | nwellnhof wrote:
       | Unicode is a total mess. In a sane system, "extended grapheme
       | clusters" would equal "codepoints" and it wouldn't make a
       | difference for 99% of languages. Now we ended up with grapheme
       | clusters, normalization, decomposition, composition, Zalgo text,
       | etc. But instead of deprecating this nonsense, Unicode doubled
       | down with composed Emojis.
        
         | jetbalsa wrote:
         | I feel its the same as with any long standing computer system
         | we have today. It was designed as more and more of the world
         | came online and all the growing pains it came with. Could it be
         | built from scratch today better? Yes. Will it? No. I suspect it
         | will be around long after we are all dead. Same with IPv4 :V
        
           | hot_gril wrote:
           | Honestly I like ipv4 better than v6. I like having a NAT and
           | easy addresses like 192.168.1.3 instead of
           | fe80::210:5aff:feaa:20a2. They didn't need to mess with those
           | things just to expand the address space, like how utf8 didn't
           | require remapping ASCII.
        
             | jrockway wrote:
             | IPv4.1 should have just had 39 bits, to be written like
             | 999.999.999.999. (I know this wouldn't have actually had
             | much effect, nobody is going to add new routes in the
             | middle of "class A" spaces that already existed, so it
             | would just give those that already had IP addresses more IP
             | addresses. Additionally, people really abuse decimal
             | addresses in horrifying ways; for example, Fios steals
             | 192.168.1.100-192.168.1.150 for its TV service, and that
             | range doesn't really correspond to anything that you can
             | mask off in binary. It only makes sense in decimal, which
             | is not what any underlying machinery uses. They should have
             | given themselves a /26 or something. You get 3 for yourself
             | (modulo the broadcast and gateway address), and they get 1
             | for TV.)
        
               | hot_gril wrote:
               | Having it actually be decimal might've been nice, but at
               | this point people are used to the 1-254 range, and I
               | think the least jarring addition of extra bits would be
               | to simply extend it for the addresses that need them (and
               | not for the ones that don't). So you could have
               | 123.444.3.254 or longer like 123.444.3.254.12.43.
        
               | [deleted]
        
           | vacuity wrote:
           | Be the change you want to see in the world. If we're going to
           | make huge breaking changes, might as well do it sooner rather
           | than later.
        
             | jetbalsa wrote:
             | With something as large as a end user language format for
             | input, this is a change we ourselves cannot make, just as
             | using another calendar for dates. Just because I want to
             | use the year 2002023 calendar with 29.5 days per month,
             | doesn't make it useful to others or myself really.
        
               | Joker_vD wrote:
               | Do your dates alpha-convert?
        
               | nvm0n2 wrote:
               | I think actually you could. A thought experiment:
               | 
               | The problems with Unicode are mostly to do with internal
               | inconsistencies and churn, problems that usually only
               | affect programmers.
               | 
               | 1. Different ways to encode the same visually
               | indistinguishable set of characters as code points
               | leading to normal forms, text that compares unequal even
               | when it appears to be identical, the disastrous "grapheme
               | clusters" concept and so on.
               | 
               | 2. Many different ways to encode the same sequence of
               | code points as bytes. Not only UTF-32/16/8 but also
               | curiousities like "modified UTF-8".
               | 
               | 3. Emoji. A fractal of disasters:
               | 
               | 3.a. Updates frequently. Neither Unicode nor software in
               | general was built on the assumption that something as
               | basic as the alphabet changes every year. If you send
               | someone an emoji, can their device draw it? Who knows! In
               | practice this means messaging apps can't rely on the OS
               | system fonts or text handling libraries anymore which is
               | a drastic regression in basic functionality.
               | 
               | 3.b. (Ab)uses composition so much it's practically a
               | small programming language, e.g. flags are composed of
               | the two letter country code spelled using special
               | characters. People are represented as as generic person
               | plus skin color patch, families are represented using
               | composed individual people etc.
               | 
               | 3.c. Meaning of a character is theoretically specified
               | but can subtly depend on the font used, e.g. people use a
               | fruit emoji in visual puns because of how it looks
               | specifically on Apple devices, so a "sentence" can make
               | no sense if it's rendered with a different font.
               | 
               | 3.d. Unbounded in scope. There's no reason the Unicode
               | committee won't just keep adding new pictograms forever.
               | 
               | 3.e. Encoded beyond the BMP which in theory every correct
               | program should handle but in practice some don't because
               | nobody except a few academics used characters beyond it
               | much until emoji came along.
               | 
               | 3.f. Disagreement over single vs double width chars, can
               | only know this via hard-coded tables, matters for
               | terminals and code editors.
               | 
               | Some of these can potentially be cleaned up outside of
               | the Unicode consortium in backwards compatible ways. You
               | could have a programming language that automatically
               | normalized strings to fully composed form when
               | deserializing from bytes, and then automatically folded
               | semantically identical code points together (this would
               | be a small efficiency win for some languages too). You
               | could campaign to build a consensus around a specific
               | normal form, like how UTF-8 gained consensus as a
               | transfer encoding. You could also define a fork of
               | Unicode (using private use areas?) that allocates a
               | single code point to the characters that are
               | unnecessarily using composition today but don't yet have
               | one and then just subset out the concept of composition
               | entirely.
               | 
               | Emoji are a big problem. It's tempting to say that these
               | should not be encoded as characters at all. Instead there
               | could be a set of code points that define bounds that
               | contain a tiny binary subset of SVG, enough to recreate
               | the Apple pixel art somewhat closely. Emoji would always
               | be transmitted as inlined vector art. Text rendering
               | libraries would call out to a little renderer for each
               | encoded glyph, using a fast fingerprinting algorithm to
               | deduplicate the bytes to an internal notion of a
               | character. To avoid wire bloat, text can simply be
               | compressed with a pre-agreed zstd or Brotli dictionary
               | that contains whatever images happen to be popular in the
               | wild. At a stroke this would avoid backwards compat
               | problems with new emoji, enabling programs working with
               | text to be upgraded _once_ and then never again,
               | eliminate all the ridiculous political committee bike-
               | shedding over what gets added, let apps go back to using
               | system text support and get rid of the bajillion edge
               | cases that emoji have spewed all over the infrastructure.
        
           | magicalhippo wrote:
           | For most software it doesn't really matter either.
           | 
           | I've written unicode-aware software for over a decade, doing
           | a wide variety of programs, and I've never had to bother with
           | all that mess.
           | 
           | If I'm parsing strings I'm looking for stuff in the 7-bit
           | ASCII range which maps neatly onto the Unicode
           | representations, and so I just need to take care to preserve
           | the rest.
           | 
           | The only trouble I've had is that a lot of programmers
           | haven't learned, or don't get, that text encoding is a thing
           | and that it needs to be handled.
           | 
           | So they'll hand me an XML they claim is UTF-8 encoded, except
           | that XML header was just copypasta and the actual XML
           | document is encoded in some other system encoding like
           | Windows-1252. Or worse, a mix of both.
        
         | tialaramex wrote:
         | The writing systems were already like this when we got them.
         | Unicode's "total mess" mostly just reflects that. Of course it
         | would be convenient for you, the programmer, if the users
         | wanted the software to do whatever was easiest for you, but
         | obviously they want what's easiest for them, not you.
        
           | eviks wrote:
           | How is it easiest "for them" to have the mess instead of
           | having the newer standard be less messy?
        
             | bluGill wrote:
             | because the current mess means all their old stuff still
             | works. ASCII is good so long as you only need English (or
             | any other latin languages without the various accents),
             | which was good enough for a long time - and ASCII was also
             | carefully designed to make programming easier - flip one
             | bit changes lower/uppercase for example, but there are more
             | things it makes easy. By the time we realized we actually
             | care about the rest or the world it was too late to make a
             | nice system.
        
           | nwellnhof wrote:
           | Name one writing system where you really need character
           | composition. Even if there is one, these special cases should
           | be handled outside of Unicode.
        
             | layer8 wrote:
             | Thai, Arabic, Hebrew, and Devanagari are important
             | examples, I believe.
        
             | nottorp wrote:
             | The problem is not that you need character composition for
             | some writing systems. It's that there are no rules that
             | would help with everything having an unique representation
             | internally.
             | 
             | Even "put the code points forming the composed character in
             | descending numerical order" would be better than nothing.
             | If it was there from the start.
             | 
             | However, the Unicode commitee is too busy adding new emojis
             | to make their standard sane.
        
             | asherah wrote:
             | you can't not handle devanagari, tamil (or like half the
             | scripts across the Indian subcontinent and oceania) or
             | hangul. even the IPA, used by linguists every day, would be
             | particularly bad to deal with if we couldn't write things
             | like /a/, and some languages already don't have the
             | precomposed diacritics for all letters (like o), so the
             | idea of a world with only precomposed letter forms is more
             | of a exponential explosion in the character set
        
               | nwellnhof wrote:
               | Hangul already has precomposed syllables in Unicode. We
               | still have several hundred thousand unassigned codepoints
               | to deal with diacritics.
        
               | arp242 wrote:
               | > so the idea of a world with only precomposed letter
               | forms is more of a exponential explosion in the character
               | set
               | 
               | "Exponential explosion" is really putting it too strong;
               | it's perfectly possible to just add o and a and a bunch
               | of other things. The combinations aren't infinite here.
               | 
               | The problem with e.g. Latin script isn't _necessarily_
               | that combining characters exist, but that there 's two
               | ways to represent many things. That really is just a
               | "mess": use either one system or the other, but not both.
               | Hangul has similar problems.
               | 
               | Devanagari doesn't have any pre-compose characters AFAIK,
               | so that's fine.
               | 
               | That's really the "mess": it's a hodgepodge of different
               | systems, and you can't even know which system to use a
               | lot of the time because it's not organised ("look it up
               | in a large database"), and even taking in to account
               | historical legacy I don't think it really _needed_ to be
               | like this (or is even an unfixable problem today,
               | strictly speaking).
               | 
               | At least they deprecated ligatures like st and fl,
               | although recently I did see ij being used in the wild.
        
               | WorldMaker wrote:
               | > The combinations aren't infinite here.
               | 
               | They certainly are. Languages are a creative space driven
               | by the human imagination. Give people enough time and
               | they'll build new combinations for fun or for profit or
               | for research or for trying to capture a spoken word/tone
               | poem in just the right sort of exciting way. You may
               | frown on "Zalgo text" [1] (and it is terrible for
               | accessibility), but it speaks to a creative mood or
               | three.
               | 
               | The growing combinatorial explosion in Unicode's emoji
               | space isn't an accident or something unique to emoji, but
               | a characteristic that emoji are just as much a creative
               | language as everything else Unicode encodes. The biggest
               | difference is that it is a living language with a lot of
               | visible creative work happening in contemporary writing
               | as opposed to a language some monks centuries ago decided
               | was "good enough" and school teachers long ago locked
               | some of the creative tools in the figurative closets to
               | keep their curriculum simpler and their days with fewer
               | headaches.
               | 
               | [1] https://en.wikipedia.org/wiki/Zalgo_text
        
               | arp242 wrote:
               | Well, in theory it's infinite, but in reality it's not of
               | course.
               | 
               | We've got 150K assigned codepoints assigned, leaving us
               | with 950K unassigned codepoints. There's truly massive
               | amounts of headroom.
               | 
               | To be honest I think this argument is rather too abstract
               | to be of any real use: if it's a theoretical problem that
               | will never occur in reality then all I can say is:
               | <shrug-emoji>.
               | 
               | But like I said: I'm not "against" combining marks,
               | purely in principle it's probably better, I'm mostly
               | against two systems co-existing. In reality it's too late
               | to change the world to decomposed (for Latin, Cyrillic,
               | some others) because most text already is pre-composed,
               | so we should go full-in on pre-composed for those. With
               | our 950k unassigned codepoints we've got space for
               | literally thousands of years to come.
               | 
               | Also this is a problem that's inherent in computers: on
               | paper you can write anything, but computers necessarily
               | restrict that creativity. If I want to propose something
               | like a "%" mark on top of the "e" to indicate, I don't
               | know, _something_ , then I can't do that regardless of
               | whether combining characters are used, never mind
               | entirely new characters or marks. Unicode won't add it
               | until it sees usage, so this gives us a bit of a catch-22
               | with the only option being mucking about with special
               | fonts that use private-use (hoping it won't conflict with
               | something else).
        
               | WorldMaker wrote:
               | The Unicode committees have addressed this for languages
               | such as Latin, Cyrillic, and others and stated outright
               | that decomposed forms should be preferred and
               | decomposition canonical forms are generally the safest
               | for interoperability and operations such as collation
               | (sorting) and case folding (lowercase to uppercase
               | transformations).
               | 
               | Unicode can't get rid of the many precombined characters
               | for a huge number of backward compatibility reasons
               | (including compatibility with ancient Mainframe encodings
               | such as EBCDIC which existed before computer fonts had
               | ligature support), but they've certainly done what they
               | can to suggest the "normal" forms in this decade should
               | "prefer" the decomposed combinations.
               | 
               | > If I want to propose something like a "%" mark on top
               | of the "e" to indicate, I don't know, something, then I
               | can't do that regardless of whether combining characters
               | are used
               | 
               | This is where emoji as a living language actually shines
               | a living example: It's certainly possible to encode your
               | mark today as a ZWJ sequence, say <<e ZWJ %>>, though you
               | might want to consider for further disambiguation/intent-
               | marking adding a non-emoji variation selector such as
               | Variation Selector 1 (U+FE00) to mark it as "Basic
               | Latin"-like or "Mathematical Symbol"-like. You can
               | probably get away with prototyping that in a font stack
               | of your choosing using simple ligature tools (no need for
               | private-use encodings). A ZWJ sequence like that in
               | theory doesn't even "need" to ever be standardized in
               | Unicode if you are okay with the visual fallback to
               | something like "e%" in fonts following Unicode standard
               | fallback (and maybe a lot of applications confused by the
               | non-recommended grapheme cluster). That said, because of
               | emoji the process for filing new proposals for
               | "Recommended ZWJ Sequences" is among the simplest Unicode
               | proposals you can make. It's not entirely as Catch-22 on
               | "needs to have seen enough usage in written documents" as
               | some of the other encoding proposals.
               | 
               | Of course, all of that is theory and practice is always
               | weirder and harder than theory. Unicode encoding truly
               | living languages like emoji is a blessing and it does
               | enable language "creativity" that was missing for a
               | couple of decades in Unicode processes and thinking.
        
               | arp242 wrote:
               | > The Unicode committees have addressed this for
               | languages such as Latin, Cyrillic, and others and stated
               | outright that decomposed forms should be preferred
               | 
               | Yes, and that only makes things worse since the
               | overwhelming majority of documents (99.something% last
               | time I checked) uses pre-composed. Also AFAIK just about
               | everyone just ignores that recommendation.
               | 
               | This is a classic "reality should adjust to the standard"
               | type of thinking. Previous comments about that:
               | https://news.ycombinator.com/item?id=36984331
               | 
               | I suppose "e ZWJ %" is a bit better than Private Use as
               | it will appear as "e%" if you don't have font support,
               | but the fundamental problem of "won't work unless you
               | spend effort" remains. For a specific niche (math,
               | language study, something else) that's okay, but for
               | "casual" usage: not so much. "Ship font with the
               | document" like PDF and webfonts do is an option, but also
               | has downsides and won't work in a lot of contexts, and
               | still requires extra effort from the author.
               | 
               | I'm not saying it's completely impossible, but certainly
               | harder than it used to be, arguably much harder. I could
               | coin a new word right here and now (although my
               | imagination is failing me to provide a humorous example
               | at this moment) and if people like it, it will see usage.
               | In 1960s HN when we would have exchanged these things
               | over written letters, and it would have been trivial to
               | propose a "e with % on top" too, but now we need to
               | resort to clunky phrases like this (even for typewriters
               | you can manually amend things, if you really wanted to).
               | 
               | Or let me put it this way: something like !? would see
               | very little chance of being added to Unicode if it was
               | coined today. Granted, it doesn't see _that_ much use,
               | but I do encounter it in the wild on occasion and some
               | people like it (I personally don 't actually, but I don't
               | want to prevent other people from using it).
               | 
               | None of this is Unicode's fault by the way, or at least
               | not directly - this is a generic limitation of computers.
        
               | WorldMaker wrote:
               | > Yes, and that only makes things worse since the
               | overwhelming majority of documents (99.something% last
               | time I checked) uses pre-composed.
               | 
               | It shouldn't matter what's in the wild in documents.
               | That's why we have normalization algorithms and
               | normalization forms. Unicode was built for the ugly
               | reality of backwards compatibility and that you can't
               | control how people in the past wrote. These precomposed
               | characters largely predate Unicode and were a problem
               | before Unicode. Unicode _won_ in part because it met
               | other encodings where they _were_ rather than where they
               | wished they would be. It made sure that mappings from
               | older encodings could be (mostly) one-to-one with respect
               | to code points in the original. It didn 't quite achieve
               | that in some cases, but it did for, say, all of EBCDIC.
               | 
               | Unicode was never in the position to fix the past, they
               | had to live with that.
               | 
               | > This is a classic "reality should adjust to the
               | standard" type of thinking.
               | 
               | Not really. The Unicode standard suggests the
               | normal/canonical forms and very well documented
               | algorithms (including directly in source code in the
               | Unicode committee-maintained/approved ICU libraries) to
               | take everything seen in the wilds of reality and convert
               | them to a normal form. It's not asking reality to adjust
               | to the standard, it is asking _developers_ to adjust to
               | the algorithms for cleanly dealing with the ugly reality.
               | 
               | > Or let me put it this way: something like !? would see
               | very little chance of being added to Unicode if it was
               | coined today.
               | 
               | Posted to HN several times has been the well documented
               | proposal process from start to finish (it succeeded) of
               | getting common and somewhat less common power symbols
               | encoded in Unicode. It's a committee process. It
               | certainly takes committee time. But it isn't "impossible"
               | to navigate and is certainly higher than "little chance"
               | if you've got the gumption to document what you want to
               | see encoded and push the proposal through the committee
               | process.
               | 
               | Certainly the Unicode committee picked up a reputation
               | for being hard to work with in the early oughts when the
               | consortium was still fighting the internal battles over
               | UCS-2 being "good enough" and had concerns about opening
               | the "Astral Plane". Now that the astral plane is open and
               | UTF-16 exists, the committee's attitude is considered to
               | be much better, even if its reputation hasn't yet shifted
               | from those bad old days.
               | 
               | > None of this is Unicode's fault by the way, or at least
               | not directly - this is a generic limitation of computers.
               | 
               | Computers do anything we program them to do and in
               | general people find a way regardless of the restrictions
               | and creative limitations that get programmed. I've seen
               | MS Paint drawn symbols embedded in Word documents because
               | the author couldn't find the symbol they needed or it
               | didn't quite exist. It's hard to use such creative
               | problem solving in HN's text boxes, but that from some
               | viewpoints is just as much a creative deficiency in HN's
               | design. It's not an "inherent" problem to computers. When
               | it is a problem they pay us software developers to fix
               | it. (If we need to fix it by writing a proposal to a
               | standards committee such as the Unicode Consortium, that
               | is in our power and one of our rights as developers.
               | Standards don't just bind in one-direction, they also
               | form an agreement of cooperation in the other.)
        
             | PetitPrince wrote:
             | The intent of Unicode was to have a universal solution for
             | humans. Excluding one case, even if it's remote, would
             | defeat this mission statement.
        
         | dleeftink wrote:
         | What sort of practical issues are you running into due to
         | Unicode's codepoint compositionality?
        
           | nwellnhof wrote:
           | It's unnecessary complexity and a security nightmare. Have
           | you ever tried to implement Unicode normalization? A single
           | bug in your code and malformed text can crash your
           | application or worse.
        
             | torstenvl wrote:
             | It's hard for me to imagine how Unicode normalization could
             | crash your application unless you have very convoluted
             | memory management code.
             | 
             | What on earth are you doing that it's leading to crashes?
             | Are you not validating the result?
        
               | hot_gril wrote:
               | iMessage has had several vulnerabilities related to this.
               | Whatever the difficulties are, even Apple can't handle
               | them sometimes.
        
               | torstenvl wrote:
               | I'm very skeptical, but willing to be proven wrong.
               | What's the CVE?
        
               | hot_gril wrote:
               | First that comes to mind is the "effective power" one,
               | https://nvd.nist.gov/vuln/detail/cve-2015-1157 There's
               | also the "black dot" one, can't find the CVE though.
        
               | torstenvl wrote:
               | That seems like a truncation + display issue though, not
               | a normalization issue.
               | 
               | https://www.reddit.com/r/apple/comments/37e8c1/malicious_
               | tex...
               | 
               | In fact, I don't know that there's any reason to believe
               | normalization happens at all in the process of executing
               | this.
        
             | dleeftink wrote:
             | That's tricky, for sure. My 'workaround' has long been
             | converting codepoints into byte sequences and creating a
             | character dictionary from that. Based on the source corpus,
             | this dictionary can be further expanded/compressed and used
             | for downstream processing.
        
           | zajio1am wrote:
           | Normalization and the fact it is not forward-compatible.
        
         | dclowd9901 wrote:
         | They kind of have to don't they? Otherwise we'll become space-
         | limited way too fast? Especially with how quickly new emojis
         | are being made and all their variants.
        
         | eviks wrote:
         | But precomposing all the potential combinations is less sane
         | than the current mess (and you can outlaw Zalgo in the standard
         | if you think it's a serious issue)
         | 
         | Also, the % should measure people, not languages, that would
         | greatly decrease the imaginary 99%
        
         | arp242 wrote:
         | > Unicode doubled down with composed Emojis.
         | 
         | Not just emojis, in general I believe Unicode has just said
         | they're not going to add new pre-composed characters and that
         | using combining characters is the _Right Way(tm)_ to do things
         | (well, the _only_ way for newer scripts).
         | 
         | One of the downsides of writing down specifications is that
         | they tend to attract people with Very Strong Opinions on the
         | One And Only Right Way and will argue it to no end, and
         | essentially "win" the argument just by sheer verbosity and
         | persistence.
         | 
         | That's certainly what I've seen happen in a few cases, and is
         | what happens on e.g. Wikipedia as well at times.
         | 
         | But yeah, emojis is even worse. Something things can look
         | rather different depending on which invisible variation
         | selector is present. We've got tons and tons of unassigned
         | codepoints and we need to resort to these tricks to save a few
         | of them?
         | 
         | Firefighter is "(man|woman|person) + ZWJ + firetruck". Clever,
         | I guess. Construction worker is "Construction worker (+ ZWJ +
         | (male sign|female sign))?" (absence is gender-neutral). Why are
         | there 2 systems to encode this? Sigh...
         | 
         | All of this is too clever by a mile.
         | 
         | [1]: HN will strip stuff, but try something like:
         | echo $'-\ufe0f -\ufe0e'
         | 
         | May not display correctly in terminal, but can xclip it to a
         | browser - screenshot: https://imgur.com/a/iFmBDQk
        
           | d3w4s9 wrote:
           | The first time I heard that Unicode would support emoji, I
           | knew it would be a recipe for disaster. And I definitely was
           | not disappointed.
        
             | arp242 wrote:
             | I mean, I don't dislike the concept personally. I actually
             | really hate how HN strips them.
             | 
             | But the technical implementation? Yeah, that could have
             | gone a lot better IMHO.
             | 
             | One must also wonder if some things really had to be added
             | in the first place, e.g. for people kissing it's:
             | (person|man|woman)(skin-tone)? ZWJ <heart> ZWJ <kissing
             | lips> ZWJ (person|man|woman)(skin-tone)?
             | 
             | This is NOT a complaint about that they added diversity as
             | such, in principle I'm all for that, it's just that few
             | seem to actually use these emojis, and both in terms of
             | code and UI it all gets pretty complex; there's 98
             | combinations to choose from here.
             | 
             | I don't really get why <heart> or <killing lips> or
             | <kissing face> isn't enough. That's actually what most
             | people seem to use anyway, because who finds it convenient
             | to pick all the correct genders and skin tones from the UI
             | for both people?
        
               | nottorp wrote:
               | > I actually really hate how HN strips them.
               | 
               | Oh. So that's why HN discussion always looks sane. They
               | strip the pollution.
        
               | pests wrote:
               | > there's 98 combinations to choose from here.
               | 
               | Less than that since a default skin color can be set in
               | most apps. I'm sure setting a gender will come soon so
               | the entire first part of that emoji can be auto-guessed.
               | Then its just showing the other options in the UI. Really
               | all of this is UI design as even with the 98 combinations
               | you can still display it as 4/5 options you drill down.
               | 
               | > who finds it convenient to pick all the correct genders
               | and skin tones from the UI for both people?
               | 
               | I just checked and searching "kissing" in my iOS emoji
               | keyboard inside Messenger showed just 4 of the emoji's
               | your describing - defaulting both skin tones to my
               | settings and then the four M/F pair ups. Plus some non-
               | related kissing emojis like the cat kissing.
        
               | arp242 wrote:
               | > defaulting both skin tones to my settings
               | 
               | But that's kind of wrong, no? The entire point is that
               | you can choose both sides individually. What if you set
               | it to black and want to kiss some white bloke?
               | 
               | If anything that only underscores my point that it's too
               | complex and that no one is using them (certainly not as
               | intended anyway).
        
               | pests wrote:
               | That's on Apple not on emojis.
               | 
               | In the Windows 11 emoji picker it works like this:
               | 
               | 1. Search "kissing". See two generic yellow people
               | kissing. Notice a blue dot in the bottom right corner.
               | 
               | 2. Clicking the emoji brings up previously used versions
               | of the kissing emoji, with a + button.
               | 
               | 3. Clicking + brings up a dialog like I described
               | previously. Two generic figures at the top, then a row of
               | skin tones.
               | 
               | 4. You can click on each generic person and choose a
               | gender, then select a skin tone. You can do this for each
               | person in the group.
               | 
               | 5. Click done. This emoji is now in your default emoji
               | list and you won't need to recreate it again.
        
               | arp242 wrote:
               | That seems like a lot of effort when you could have sent
               | <kissing-lips>, <kissing-cat>, <kissing-face>, <heart>,
               | or any number of other emojis, which is what my point
               | was.
        
               | pests wrote:
               | You still can! People who want more customizations can do
               | so too. Plus it only takes the initial setup per emoji at
               | least.
        
       | bluecheese452 wrote:
       | Anyone else hate titles like this? There are millions of
       | developers working on a large variety of things. It sounds so
       | arrogant to me.
        
       | russellbeattie wrote:
       | We need, desperately and without question, two Unicode symbols
       | for bold and italic.
       | 
       | These are _part of language_ and should not be an optional
       | proprietary add on that can be skipped or deleted from text. We
       | 've been using the two "formats" to convey _important_
       | information since the _sixteenth century_!!!
       | 
       | It boggles my mind that we can give flesh tone to emojis, yet not
       | mark a word as bold or italic. It makes zero sense. Especially
       | how easy it would be to implement. It would work exactly the same
       | way: Letters following the mark would be formatted as bold or
       | italic until a space character or equivalent.
        
       | tripdout wrote:
       | Can there be overlaps between fonts in the private use area?
        
         | mankyd wrote:
         | Yes. "Private" in this case means that you can't expect
         | consistent behavior from one system to the next.
        
       | Nevermark wrote:
       | What an interesting mess!
       | 
       | It occurs to me that a canonical semantic representation of all
       | known (extracted) language concepts would be useful too.
       | 
       | Now that we have multi-language LLM's it would be an interesting
       | challenge to create/design a canonical representation for a
       | minimum number of base concepts, their relations and orthogonal
       | "voice" modifiers, extracted from the latent representations of
       | an LLM across a whole training set, over all training languages.
       | 
       | While the best LLMs still have complex reasoning issues, their
       | understanding of concepts and voice at the sentence level is
       | highly intuitive and accurate. So the design process could be
       | automated.
       | 
       | The result would be a human language agnostic, cross-culture
       | concept inclusive, regularized & normalized (relatively speaking)
       | semantic language. Call it SEMANTICODE.
       | 
       |  _We need to get this right, using one standard LLM lineage,
       | before the Unicode people create a super standard that spans 150
       | different LLM 's and 150 different latent spaces!_ :O
       | 
       | Stability between updates would be guaranteed by including
       | SEMANTICODE as a non-human language in training of future LLM's.
       | Perhaps including a (highly) pre-normalized semantic artificial
       | language would dramatically speed up and reduce the parameter
       | count needed for future multi-language training?*
       | 
       | Then LLMs could use SEMANTICODE talk to each other more reliably,
       | efficiently, and with greater concept specificity than any of our
       | single languages.
        
       | Dwedit wrote:
       | Reload with Javascript disabled to remove the distracting fake
       | mouse pointers.
        
       | alexmolas wrote:
       | I tried to read the articles since it seemed interesting. After
       | exactly 30 seconds trying it I had to leave the page. Impossible
       | to read more than two sentences with all those pointer moving
       | there - and for a folk with ADHD even more difficult. Sorry, but
       | I couldn't make it :(
        
         | anthk wrote:
         | Use the reader mode. Or if you are under GNU/Linux, use
         | Links/Lynx.
        
         | Maken wrote:
         | Fortunately you didn't try the dark theme.
        
       | TacticalCoder wrote:
       | > For example, e (a single grapheme) is encoded in Unicode as e
       | (U+0065 Latin Small Letter E) + ' (U+0301 Combining Acute
       | Accent). Two code points!
       | 
       | It's a poor and misleading example for it is definitely not how
       | 'e' is encoded in 99.999% of all the text written in, say, french
       | out there (french is the language where 'e' is the most common).
       | 
       | 'e' is U+00F9, one codepoint, definitely not two.
       | 
       | Now you could say: but it is _also_ the two codepoints one. But
       | that 's precisely what makes Unicode the complete, total and
       | utter clusterfuck that it is.
       | 
       | And hence even an article explaining what every programmer should
       | know about Unicode cannot even get the most basic example right.
       | Which is honestly quite ironic.
        
         | crazygringo wrote:
         | > _definitely not how 'e' is encoded in 99.999% of all the text
         | written in, say, french out there_
         | 
         | Maybe how it's input by the keyboard (I haven't checked) but
         | not how it's output on the web or other documents.
         | 
         | Plenty of text goes through Unicode normalization which may
         | convert it to two codepoints.
        
         | ninkendo wrote:
         | > Unicode the complete, total and utter clusterfuck that it is.
         | 
         | Yikes, does it really deserve that much derision? They're
         | trying to standardize _all written human language_ here. I
         | think they've done a fantastic job. Pre-Unicode you had to
         | worry about what code page a document had, and computers from
         | different countries couldn't interoperate. The work the
         | consortium does is hugely important, and every decision has
         | extremely complex tradeoffs. Composed characters makes a lot of
         | sense, and there's a lot of case to be made that it was the
         | right call. The attitude of "this one thing I don't like makes
         | the whole thing a complete clusterfuck" is something I wish
         | fewer engineers would have.
        
         | spacechild1 wrote:
         | Next time read the whole article before accusing the author of
         | incompetence!
         | 
         | However, the author could have added a small note, e.g.
         | "(Unicode normalization will be convered in a later section.)",
         | to prevent knowledgable readers from rage quitting :)
        
         | wgjordan wrote:
         | The author explains normalization in its own entire section
         | several paragraphs later (Why is "A" !== "A" !== "A"?).
        
       | ilyt wrote:
       | > Unicode is locale-dependent
       | 
       | Well, there is a new fact that I learned and immediately hated.
       | 
       | The fuck were authors thinking...
       | 
       | I am now firmly convinced people developing unicode hate
       | developers. I suspected it before just due to how messy it was
       | (same character having different encodings ? Really ? Fuck you),
       | but this cements it.
        
         | JohnFen wrote:
         | > people developing unicode hate developers
         | 
         | Or at least they have a vicious indifference to us. Unicode is
         | a nightmare.
        
         | wffurr wrote:
         | Yeah this is a big problem for me right now trying to pick
         | fonts and characters for CJK. I have a bunch of bugs to fix
         | that will require sending the locale down to the text
         | itemization code.
        
         | zajio1am wrote:
         | Unicode is not locale-dependent, just mapping from graphemes to
         | (font) glyphs is locale/font dependent.
        
           | mcfedr wrote:
           | The author shows how to-upper and to-lower change according
           | to locale
           | 
           | But making it clear which glyph to use is also a key feature!
        
         | JonChesterfield wrote:
         | Well C is locale dependent. And one does not break backwards
         | compatibility with C for fear of badness. So naturally Unicode
         | must be locale dependent too.
        
       | layer8 wrote:
       | This is pretty good. One thing I would add is to mention that
       | Unicode defines algorithms for bidirectional text, collation
       | (sorting order), line breaking and other text segmentation (words
       | and sentences, besides grapheme clusters). The main point here is
       | to know that there are specifications one should take into
       | account when topics like that come up, instead of just inventing
       | your own algorithm.
        
       | nabla9 wrote:
       | >3 Grapheme Cluster Boundaries
       | 
       | >It is important to recognize that what the user thinks of as a
       | "character"--a basic unit of a writing system for a language--may
       | not be just a single Unicode code point. Instead, that basic unit
       | may be made up of multiple Unicode code points. To avoid
       | ambiguity with the computer use of the term character, this is
       | called a user-perceived character. For example, "G" + grave-
       | accent is a user-perceived character: users think of it as a
       | single character, yet is actually represented by two Unicode code
       | points. These user-perceived characters are approximated by what
       | is called a grapheme cluster, which can be determined
       | programmatically.
        
         | sebstefan wrote:
         | Oh my god, is there ever anything simple about unicode
        
           | WorldMaker wrote:
           | Compared to the ancient world of EBCDIC versus ASCII versus
           | various ISO standards versus country-defined encodings versus
           | Extended EBCDIC code pages versus Extended ASCII code pages
           | which varied depending on operating system, nearest flag
           | pole, network adapter, time of day, etc...: Unicode will
           | forever be a simpler walk in the park.
           | 
           | It's complexity is a relief compared to where we've been.
           | It's definitely not simple, but it will forever be far
           | simpler than what our grandmothers had to work with if they
           | were writing international software.
        
         | nottorp wrote:
         | > These user-perceived characters are approximated by what is
         | called a grapheme cluster, which can be determined
         | programmatically.
         | 
         | From everything i've read or heard about unicode, "determined
         | programmatically" is false?
        
       | qwerty456127 wrote:
       | > The rest, about 800,000 code points, are not allocated at the
       | moment. They could become characters in the future.
       | 
       | Why is Tengwar still not in Uniclde officially? What's the
       | problem with it?
        
         | teddyh wrote:
         | Tengwar is in the Under-ConScript Unicode Registry:
         | <https://www.kreativekorp.com/ucsur/>
        
           | qwerty456127 wrote:
           | The ConScript Unicode Registry is a volunteer project to
           | coordinate the assignment of code points in the Unicode
           | Private Use Areas (PUA). Why does tengwar have to be in the
           | PUA, why not make it a first-class charset? It's not just a
           | minor conlang a small group of geeks invented on a weekend,
           | it's a well-established piece of the modern culture, isn't
           | it?
        
         | badcppdev wrote:
         | To save other people the google: Tengwar is probably not in
         | unicode because it is a fictional script from a book.
        
           | zajio1am wrote:
           | While U+A66E multiocular O can be found in just one
           | manuscript, and it is still in Unicode:
           | https://en.wikipedia.org/wiki/Multiocular_O
        
           | bigstrat2003 wrote:
           | Honestly, I wouldn't have thought that would be an issue to
           | the Unicode folks. They have already allowed things (emoji)
           | that have no place being in the standard, as they _aren 't
           | even text_.
        
             | hot_gril wrote:
             | I feel like Apple pushed the consortium to add a ton of
             | useless emojis for whatever their own reasons were.
        
           | hot_gril wrote:
           | Looks like Georgian
        
           | qwerty456127 wrote:
           | I would wonder how many people are here who have never seen
           | Tengwar. I would bet that's a minuscule minority.
        
             | JohnFen wrote:
             | I've never even heard of it before.
        
               | red_trumpet wrote:
               | That's a higher bar than having seen it, I think. I also
               | had to look it up, but as soon as I saw the images in
               | Wikipedia I knew that it's from Lord of the Rings.
        
               | hot_gril wrote:
               | It is. But the even higher bar is that you actually write
               | in this script.
        
         | WorldMaker wrote:
         | The problem with Tengwar (and Klingon) is the problem with a
         | lot of pop culture right now: copyright. The Tolkien Estate
         | still exists and still litigiously upholds what it can of their
         | copyright terms. CBS Viacom (Paramount) still claim a copyright
         | interest in all the written forms of Klingon.
         | 
         | Copyright is not technically violated simply by _encoding_ the
         | characters into a plane such as one of Unicode 's, that's an
         | easy open and shut fair use, but Unicode principals have stated
         | they don't want to pass on the copyright burden to font authors
         | either, which would be sued if they tried to paint some of
         | those characters. (Why encode something that fonts aren't
         | allowed to produce?) That _should_ also be fair use, but the
         | law is complicated and copyright still so often today leans in
         | favor of the Estates and major Corporations rather than fair
         | use and the public commons.
         | 
         | (ETA: I'm hugely in favor that "conlang", constructed language,
         | scripts such as these _should_ be encoded by Unicode. I wish
         | someday we fix the copyright problems of them.)
        
       | thyselius wrote:
       | Wonderful to learn more about Unicode.
       | 
       | Does anyone know how to write a function (preferably in swift) to
       | remove emoji? This is surprisingly hard (if the string can be any
       | language, like English or Chinese).
       | 
       | There's been multiple attempts on Stackoverflow but they're all
       | missing some of them, as Unicode is so complex.
        
         | favorited wrote:
         | Here's a 1-liner, producing the string "text 0123 Han Zi ":
         | 
         | `String("text EMOJI 0123 Han Zi ".unicodeScalars.filter({
         | !$0.properties.isEmojiPresentation }))`
         | 
         | (I've had to substitute EMOJI for a smiley face, because HN is
         | bad at text encoding.)
        
           | Retr0id wrote:
           | It's not a bug, HN deliberately strips emojis.
        
           | thyselius wrote:
           | Thanks. Unfortunately both .isEmojiPresentation && .isEmoji
           | leaves many emojis out, like red heart and many other.
        
             | astrange wrote:
             | Those aren't inherently emojis, the font just shows them as
             | emojis, so you'd have to render the text.
        
               | favorited wrote:
               | Correct. `isEmojiPresentation` checks if, per the Unicode
               | standard, this scalar should default to an emoji
               | presentation.
        
         | fiedzia wrote:
         | I haven't tried but use libicu (icu). Split text into graphemes
         | and remove anything starting with codepoints that has Zsey
         | script. There should be swift bindings.
        
       | hoseja wrote:
       | https://tonsky.me/blog/unicode/overview@2x.png
       | 
       | Wow, what abominable mix of decimal and hexadecimal.
        
         | Karellen wrote:
         | Where are the decimal numbers in that image?
        
           | morelisp wrote:
           | What comes after 9FFFF?
        
             | Karellen wrote:
             | Good catch
             | 
             | doh.
        
           | bajsejohannes wrote:
           | It goes 90000..9FFFF then 100000..10FFFF. The latter should
           | have been A0000..AFFFF.
           | 
           | So the author is using hex for the last four digits and
           | decimal for the remaining ones.
        
             | tonsky wrote:
             | oops :) fixed, thanks!
        
       | titzer wrote:
       | > The problem is, you don't want to operate on code points. A
       | code point is not a unit of writing; one code point is not always
       | a single character. What you should be iterating on is called
       | "extended grapheme clusters", or graphemes for short.
       | 
       | It's best to avoid making overly-general claims like this. There
       | are plenty of situations that warrant operating on code points,
       | and it's likely that software trying and failing to make sense of
       | grapheme clusters will result it in a worse screwup. Codepoints
       | are probably the _best_ default. For example, it probably makes
       | the most sense for programming languages to define strings as
       | arrays of code points, and not characters or 16-bit chunks or an
       | encoding, or whatever.
        
         | Dylan16807 wrote:
         | Situations such as?
         | 
         | Sometimes editing wants to go inside clusters but that's not
         | code-point based either.
         | 
         | I'd say that in a big majority of situations, code that is
         | indexing an array with code points is either treating the
         | indexes as opaque pointers or is doing something wrong.
        
         | hgs3 wrote:
         | > There are plenty of situations that warrant operating on code
         | points
         | 
         | Absolutely correct. All algorithms defined by the Unicode
         | Standard and its technical reports operate on the code point.
         | All 90+ character properties defined by the standard are
         | queried for with the code point. The article omits this
         | information and ironically links to the grapheme cluster break
         | rules which operate on code points.
        
           | Dylan16807 wrote:
           | The article doesn't say not to use code points, it says you
           | should not be iterating on them.
           | 
           | Very rarely will you be implementing those algorithms. And if
           | you're looking at character properties, the article says you
           | should be looking at multiple together, which is correct.
        
             | hgs3 wrote:
             | > And if you're looking at character properties, the
             | article says you should be looking at multiple together,
             | which is correct.
             | 
             | I don't see where the article mentions Unicode character
             | properties [1]. These properties are assigned to individual
             | characters, not groups of characters or grapheme clusters.
             | 
             | > Very rarely will you be implementing those algorithms.
             | 
             | True, but character properties _are_ frequently used, i.e.
             | every time you parse text and call a character
             | classification function like  "isDigit" or "isControl"
             | provided by your standard library you are in fact querying
             | a Unicode character property.
             | 
             | [1] https://unicode.org/reports/tr44/#Properties
        
               | Dylan16807 wrote:
               | > These properties are assigned to individual characters,
               | not groups of characters or grapheme clusters.
               | 
               | But you need to deal with the whole cluster. You can't
               | just look at the properties on a single combining
               | character and know what to do with it.
               | 
               | If the article's saying to iterate one cluster at a time,
               | then if you're doing properties a direct consequence is
               | that you should be looking at the properties of specific
               | code points per cluster or all of them.
        
               | hgs3 wrote:
               | The Unicode Standard does not specify how character
               | properties should be extracted from a grapheme cluster.
               | Programming languages that define "character" to mean
               | grapheme cluster (like Swift) need to establish their own
               | ad-hoc rules.
               | 
               | As others have pointed out in this thread, the article is
               | full of the authors own personal opinions. The author
               | suggests iterating text as grapheme clusters, but fails
               | to consider that this breaks tokenizers, e.g. a tokenizer
               | for a comma-separated list [1] won't see the comma as
               | "just a comma" if the value after it begins with a
               | combining character.
               | 
               | [1] https://en.wikipedia.org/wiki/Comma-separated_values
        
               | PeterisP wrote:
               | If some tokenizer of a comma-separated list treats the
               | comma (I'm assuming any 0x2C byte) as "just a comma" even
               | if the value after it begins with a combining character,
               | that's a broken, buggy tokenizer, and one that can
               | potentially be exploited by providing some specifically
               | crafted unicode data in a single field that then causes
               | the tokenizer to misinterpret field boundaries. If you
               | combine a character with something, that's not the same
               | character anymore - it's not equal to that, it's not that
               | separator anymore, and you can't tell that unless/until
               | you look at the following codepoints.
               | 
               | If anything, your example is an illustration why it's
               | dangerous to iterate over codepoints and not graphemes.
        
               | Dylan16807 wrote:
               | > The Unicode Standard does not specify how character
               | properties should be extracted from a grapheme cluster.
               | Programming languages that define "character" to mean
               | grapheme cluster (like Swift) need to establish their own
               | ad-hoc rules.
               | 
               | Right. Which means not just iterating by code point.
               | 
               | > The author suggests iterating text as grapheme
               | clusters, but fails to consider that this breaks
               | tokenizers, e.g. a tokenizer for a comma-separated list
               | [1] won't see the comma as "just a comma" if the value
               | after it begins with a combining character.
               | 
               | I don't think they're talking about tokenizers. It's a
               | general purpose rule.
               | 
               | Also I would argue that a CSV file with non-attached
               | combining characters doesn't qualify as "text".
        
       | jcranmer wrote:
       | There's one part of this document that I would push extremely
       | hard against, and that's the notion that "extended grapheme
       | clusters" are the one true, right way to think of characters in
       | Unicode, and therefore any language that views the length in any
       | other way is doing it wrong.
       | 
       | The truth of the matter is that there are several different
       | definitions of "character", depending on what you want to use it
       | for. An extended grapheme cluster is largely defined on "this
       | visually displays as a single unit", which isn't necessarily
       | correct for things like "display size in a monospace font" or
       | "thing that gets deleted when you hit backspace." Like so many
       | other things in Unicode, the correct answer is use-case
       | dependent.
       | 
       | (And for this reason, String iteration should be based on
       | codepoints--it's the fundamental level on which Unicode works,
       | and whatever algorithm you want to use to derive the correct
       | answer for your purpose will be based on codepoint iteration.
       | hsivonen's article (https://hsivonen.fi/string-length/), linked
       | in this one, does try to explain why extended grapheme clusters
       | is the wrong primitive to use in a language.)
        
         | mananaysiempre wrote:
         | > thing that gets deleted when you hit backspace
         | 
         | Is there a canonical source for this part, by the way? Xi
         | copied the logic from Android[1] (per the issue you linked
         | downthread), which is reasonable given its heritage but seems
         | suboptimal generally, and I vaguely remember that CLDR had
         | something to say about this too, but I don't know if there's
         | any sort of consensus here that's actually written down
         | anywhere.
         | 
         | [1] https://github.com/xi-editor/xi-editor/pull/837
        
         | pif wrote:
         | > An extended grapheme cluster is largely defined on "this
         | visually displays as a single unit", which isn't necessarily
         | correct for things like "display size in a monospace font" or
         | "thing that gets deleted when you hit backspace."
         | 
         | I'm sorry, but I fail to see how "This visually displays as a
         | single unit" could ever differ from "Display size in a
         | monospace font" or "Thing that gets deleted when you hit
         | backspace".
        
           | yeputons wrote:
           | Here is a full article of such examples:
           | https://manishearth.github.io/blog/2017/01/14/stop-
           | ascribing...
           | 
           | Discussion on HN:
           | https://news.ycombinator.com/item?id=31858311
        
           | RichieAHB wrote:
           | Here's a good example of the test cases used for backspaces
           | in Android[1]. It's definitely more involved than just
           | deleting a grapheme cluster.
           | 
           | [1] https://android.googlesource.com/platform/frameworks/base
           | /+/...
        
           | mattnewton wrote:
           | > Display size in a monospace font
           | 
           | Some clusters are going to be multiple characters wide.
           | 
           | > thing that gets deleted when you hit backspace
           | 
           | Some clusters are meant to be composted of multiple
           | keystrokes and a natural editing experience would allow users
           | to delete the last stroke.
           | 
           | Look into how Korean works.
        
           | jfultz wrote:
           | A couple of cases I'm aware of...
           | 
           | * Coding ligatures often display as a single glyph (maybe
           | occupying a single-width character space, or maybe spread out
           | over multiple spaces), but are composed of multiple glyphs.
           | The ligature may "look" like a single character for purposes
           | of selection and cursoring, but it can act like multiple
           | characters when subject to backspacing.
           | 
           | * Similarly, I've seen keyboard interfaces for various
           | languages (e.g., Hindi) where standard grapheme cluster rules
           | bind together a group of code points, but the grapheme
           | cluster was composed from multiple key presses (which
           | typically add one code point each to the cluster). And in
           | some such interfaces I've seen, the cluster can be decomposed
           | by an equal number of backspace presses. I don't have a good
           | sense of how much a monospaced Hindi font makes sense, but
           | it's definitely a case where a "character" doesn't always act
           | "character-like".
        
           | jcranmer wrote:
           | See, e.g., https://github.com/xi-editor/xi-editor/issues/655
           | for why backspace isn't the same as extended grapheme
           | cluster.
           | 
           | As for "display size in monospace font", emojis and CJK
           | characters are usually two units wide, not one (although, to
           | be honest, there's a fair amount of bugs in the Unicode
           | properties that define this).
        
           | layer8 wrote:
           | In terminals there is a distinction between single-width and
           | double-width characters (east-asian characters, in
           | particular). E.g. the three characters                   AMei
           | C
           | 
           | would take up the width of four ASCII monospace characters,
           | the "Mei " being double-width.
           | 
           | Similarly, for composed characters like say the ligature
           | "ff", you may want to backspace as if it was two "f"s (which
           | logically it is, and decomposes to in NFKD normalization).
        
           | orphea wrote:
           | If you type "a", combine it with "'", then change your mind
           | and hit backspace, you probably want to end up with "a" even
           | through "a" was a thing "visually displayed as a single
           | unit".
        
             | Findecanor wrote:
             | Most European keyboard layouts have it the other way
             | around: first press a "dead key" for the diacritic mark and
             | then the letter to apply it to.
             | 
             | Where some layouts may require this method for some
             | characters, another keyboard layout may have the same
             | character on a dedicated key.
             | 
             | The program receives the combined character as one unit,
             | and does not need to be aware of different keyboard
             | layouts.
        
               | umanwizard wrote:
               | > Most European keyboard layouts have it the other way
               | around: first press a "dead key" for the diacritic mark
               | and then the letter to apply it to.
               | 
               | Which ones? At least the French and German ones don't
               | work like that: there is no composing, just separate keys
               | for all the characters with diacritics that appear in the
               | language.
        
               | Reefersleep wrote:
               | Danish is one.
        
               | riggsdk wrote:
               | Danish keyboards also require you to press '"' first and
               | then 'o' to produce 'o'.
        
               | Sardtok wrote:
               | But do you really use o much over o?
        
               | riggsdk wrote:
               | No, but I do once in a while (very rarely) write a little
               | in german that might use that character.
        
               | mostlylurks wrote:
               | Do the danes not have the mechanism that is found on
               | Finnish keyboard layouts, where pressing AltGr+O yields O
               | and AltGr+A yields AE, except in reverse?
        
               | Findecanor wrote:
               | Those mappings are not universal. They are present under
               | Linux but not on MS-Windows. I don't know about Mac, but
               | the layout has in the past been slightly different there
               | from Windows also.
        
               | riggsdk wrote:
               | For me that doesn't work on Windows. Those key
               | combinations doesn't seem to do anything.
        
               | greenshackle2 wrote:
               | Which French layout would that be? I've never seen a
               | French keyboard where this is true. French is my native
               | language. On layouts I'm familiar with, _some_ accented
               | letters have separate keys like e, but not all, the
               | others are made by composing an accent key with a letter.
        
               | umanwizard wrote:
               | You're right, sorry. I had forgotten about the ^ and "
               | keys.
        
               | tpm wrote:
               | Slovak or Czech for example.
        
               | mostlylurks wrote:
               | The nordic layout(s) offer such a mechanism to allow
               | people to type in letters that you'll find in various
               | other European languages, even though the extra letters
               | used in the languages themselves (AAOAEO) are present as
               | their own keys. Interestingly, the Swedish layout has no
               | dedicated e key, although e occurs in some Swedish words.
        
               | gumby wrote:
               | In Swedish, A, A, and O are actual letters of the
               | alphabet, while e is used in foreign words. Like the
               | English dieresis (e.g. in cooperate) is essentially
               | unknown in the US and only occasionally used in England,
               | so doesn't give rise to characters with dieresis on the
               | keyboard.
        
               | gdprrrr wrote:
               | On the German Layout the backtick (next to the 1 key) is
               | a dead key.
        
               | TacticalCoder wrote:
               | Nitpicking but most french keyboards have both ready-made
               | keys for "e" and the few other commonly use keys _and_
               | composing: hitting either  '"' or '^'. For example
               | hitting '"' then 'e' produces "e".
        
               | umanwizard wrote:
               | You are right, thanks.
        
               | mananaysiempre wrote:
               | > first press a "dead key" for the diacritic mark and
               | then the letter to apply it to.
               | 
               | That being exactly the way "floating diacritics" in ISO
               | 2022 (or properly one of its Latin encodings, T.51 = ISO
               | 6937) work, amusingly. I wonder which came first. (Yes, I
               | know that a<BS>` came first, the ASCII spec even says
               | that this should give you an accented character IIRC. Or
               | perhaps it was one of the other "don't call it ASCII"
               | specs--ISO 646? IA5?..)
        
             | [deleted]
        
             | kdmccormick wrote:
             | But then if I type "a" directly (through, say, a mobile
             | keyboard) and hit backspace, I'd get "a", which doesn't
             | seem _terrible_ but does feel a little off.
             | 
             | Seems like the right answer for codepoints vs graphemes,
             | unfortunately, is dependent on the context.
        
             | bombela wrote:
             | I expect to delete the character "a". And I prefer
             | consistency too so I expect "oe" and "<emoji>" and
             | "<emoji>" to be deleted as one unit.
             | 
             | edit: emojis are filtered by HN
        
               | pests wrote:
               | Even the emoji's that you create by combining multiple
               | emojis? Type one emoji, then a second, it merges into
               | one. What happens when you backspace?
        
             | mostlylurks wrote:
             | As a European, no I don't. a isn't used in my language, but
             | my layout offers it via a dead-key-then-base-letter
             | mechanism, and it is correctly treated as one unit when
             | pressing backspace, anything else would feel incorrect. It
             | would be even worse if such a thing happened for the
             | letters that my layout offers individual buttons for (AAO).
             | Some languages do treat these as letters with attached
             | modifiers, but many, including mine, treat them as
             | indivisible letters that just happen to look similar to
             | some others for historical reasons, and to treat them as
             | combinations of base letters and diacritics would be
             | completely incorrect, even if you typed them in using the
             | dead-key-then-base-letter mechanism for some reason.
        
         | haberman wrote:
         | In that case, it sounds like `length` on Unicode strings simply
         | shouldn't exist, since there is no obvious right answer for it.
         | Instead there should be `codepointCount`, `graphemeCount`, etc.
        
         | astrange wrote:
         | String iteration should be based on whatever you want to
         | iterate on - bytes, codepoints, grapheme clusters, words or
         | paragraphs. There's no reason to privilege any one of these,
         | and Swift doesn't do this.
         | 
         | "Length" is a meaningless query because of this, but you might
         | want to default to whatever approximates width in a UI label,
         | so that's grapheme clusters. Using codepoints mostly means you
         | wish you were doing bytes.
        
           | b3morales wrote:
           | > There's no reason to privilege any one of these, and Swift
           | doesn't do this.
           | 
           | Strange thing to say: Swift String count property is the
           | count of extended grapheme clusters. The documentation is
           | explicit:
           | 
           | > A string is a collection of _extended grapheme clusters_ ,
           | which approximate human-readable characters. [emphasis in
           | original]
        
             | astrange wrote:
             | The length/count property was added after people asked for
             | it, but it wasn't originally in the String revamp, and it
             | provides iterators for all of the above. .count also only
             | claims to be O(n) to discourage using it.
        
         | lucideer wrote:
         | I'm not Korean but seeing that said of the Hangul example
         | definitely made me pause - I doubt Koreans think of that
         | example as a single grapheme (open to correction), though it is
         | an excellent example all the same since it demonstrates the
         | complexity of defining "units" consistently across language.
         | 
         | It reminds me a little of Open Street Map's inconsistent
         | administrative hierarchies ("states", "countries", "counties",
         | etc. being represented at different administrative "levels" in
         | their hierarchy for each geographical area), and how that
         | hinders consistency in styling- font size, zoom levels, etc.
         | being generally applied by level.
        
         | hgs3 wrote:
         | Everybody loves to debate what "character" means but nobody
         | ever consults the standard. In the Unicode Standard a
         | "character" is an abstract unit of textual data identified by a
         | code point. The standard never refers to graphemes as
         | "characters" but rather as _user-perceived characters_ which
         | the article omits.
        
         | raphlinus wrote:
         | Agreed. And one more consideration is that (extended) grapheme
         | cluster boundaries vary from one version of Unicode to another,
         | and also allow for "tailoring." For example, should "`am" be
         | one grapheme cluster or two? It's two on Android but one by
         | Unicode recommendation and is the behavior on mac. So in
         | applications where a query such as length needs to have one
         | definitive answer which cannot change by context, counting
         | (extended) grapheme clusters is the wrong way to go.
        
         | riggsdk wrote:
         | There are libraries that help with iterating both code-points
         | and grapheme clusters... - but are there any of them that can
         | help decide what to do for example when pressing backspace
         | given an input string and a cursor position? Or any other text
         | editing behavior. This use-case-dependent behavior must have
         | some "correct" behavior that is standardized somewhere?
         | 
         | Like a way to query what should be treated like a single
         | "symbol" when selecting text? Basically something that could
         | help out users making simple text-editors. There are so many
         | bad implementations out there that does it incorrectly so there
         | must be some tools/libraries to help with this? Not only for
         | actual applications but for people making games as well where
         | you want users to enter names, chat or other text. Not all
         | platforms make it easy (or possible) to embed a fully fledged
         | text editing engine for those use-cases.
         | 
         | I can imagine that typing a multi-code-point character manually
         | by hand would allow the user to undo their typing mistake by a
         | single backspace press when they are actively typing it, but
         | after that if you return to the symbol and press backspace that
         | it would delete the whole symbol (grapheme cluster).
         | 
         | For example if you manually entered the code points for the
         | various family combination emojis (mother, son, daughter) you
         | could still correct it for a while - but after the fact the
         | editor would only see it as a single symbol to be deleted with
         | a single backspace press?
         | 
         | Or typing 'o' + '"' to produce 'o' but realizing you wanted to
         | type 'o', there just one backspace press would revert it to 'o'
         | again and you could press '^' to get the 'o'. (Not sure that is
         | the way in which you would normally type those characters but
         | it seems possible to do it with unicode that way).
        
           | gumby wrote:
           | > Or typing 'o' + '"' to produce 'o' but realizing you wanted
           | to type 'o', there just one backspace press would revert it
           | to 'o' again and you could press '^' to get the 'o'.
           | 
           | This is a good example because in German I would expect 'o' +
           | '"' + <delete> to leave no character at all while in French I
           | would expect 'e' + '`' + <delete> to leave the e behind
           | because in my mind it was a typo.
           | 
           | The rendering of brahmic- and arabic-derived scripts makes
           | these choices even more interesting.
        
             | posix86 wrote:
             | But typing "o" (e.g. swiss keyboard) and pressing delete &
             | getting an o would be annoying af
        
               | riggsdk wrote:
               | I realize that the editor would be the system to keep
               | track of how the character was entered for this to work.
               | If you made the character from a single keypress it would
               | only make sense that backspace also undid the entire
               | character. Only if you created the character from
               | multiple keypresses it would make sense to "undo" only
               | part of it with backspace (at least until you move away
               | from the character).
        
               | gumby wrote:
               | Definitely agree with that! I use a US kbd (incl on
               | phone) no matter what language I'm writing in. A little
               | annoying but switching kbd layouts is more disruptive for
               | me.
        
               | jraph wrote:
               | Same for a French keyboard with eeau which are all typed
               | using one key. But even eouaeoeaeiou, all typed using at
               | least two keys, if not 3 with a compose key (from memory,
               | I'm using a phone). Everybody is used to the way it has
               | been working on all OSes.
        
             | makapuf wrote:
             | In French, e is a single character issued by a single
             | keypress on a French keyboard, like e, or +. (Note that A
             | is shift+a). Why should it need two backspaces? If you
             | press e+` well you have e`, not e.
        
               | NikolaNovak wrote:
               | I am assuming that means "on French keyboard", not "in
               | French". I have a usa keyboard and live in Canada...Every
               | now and then it thinks I'm typing French and keyboard
               | indeed behaves in a way that some vowel plus some
               | quotation mark indeed gives me some other character (that
               | I don't need :)
        
           | jcranmer wrote:
           | Some platforms (e.g., Android) have methods specifically for
           | asking how to edit a string following a backspace. However,
           | there's no standard Unicode algorithm to answer the question
           | (and I strongly suspect that it's something that's actually
           | locale-dependent to a degree).
           | 
           | On further reflection, probably the best starting point for
           | string editing on backspace is to operate on codepoints,
           | _not_ grapheme clusters. For most written languages, the
           | various elements that make up a character are likely to be
           | separate codepoints. In Latin text, diacritics are generally
           | precomposed (I mean, you can have a + diacritic as opposed to
           | precomposed a in theory, but the IME system is going to spit
           | out a anyways, even if dead keys are used). But if you have
           | Indic characters or Hangul, the grapheme cluster algorithm is
           | going to erroneously combine multiple characters into a
           | single unit. The issue is that the biggest false positive for
           | a codepoint-based algorithm is emoji, and if you 're a
           | monolingual speaker whose only exposure to complex written
           | scripts is Unicode emoji, you're going to incorrectly
           | generalize it for all written languages.
        
           | layer8 wrote:
           | Behavior that depends on whether you edited something else in
           | between, or that depends on timing, is just bad. Either
           | always backspace grapheme clusters, or else backspace
           | characters, possibly NFC-normalized. I could also imagine
           | having something like Shift+Backspace to backspace NFKD-
           | normalized characters when normal Backspace deletes grapheme
           | clusters.
           | 
           | As for selection and cursor movement, grapheme clusters would
           | seem to be the correct choice. Same for Delete. An editor may
           | also support an "exploded" view of separate characters (like
           | WordPerfect Reveal Codes) where you manipulate individual
           | characters.
        
           | PeterisP wrote:
           | I'd argue that you must use grapheme clusters for text
           | editing and cursor position, because here are popular
           | characters (like o you used as example) which can be either
           | one or two codepoints depending on the normalization choice,
           | but the difference is invisible to the user and should not
           | matter to the user, so any editor should behave _exactly_ the
           | same for o as U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS)
           | and o as a sequence of U+006F (LATIN SMALL LETTER O) and
           | U+0308 (COMBINING DIAERESIS).
           | 
           | Furthermore, you shouldn't assume that there is any
           | relationship between how unicode constructs a combined
           | character from codepoints with how that character is typed,
           | even at the level of typing you're _not_ typing unicode
           | codepoints - they 're just a technical standard
           | representation of "text at rest", unicode codepoints do not
           | define an input method. Depending on your language and
           | device, a sequence of three or more keystrokes may be used to
           | get a single codepoint, or a dedicated key on keyboard or a
           | virtual button may spawn a combined character of multiple
           | codepoints as a single unit; you definitely can't assume that
           | the "last codepoint" corresponds to "last user action" even
           | if you're writing a text editor - much of that can happen
           | before your editor receives that input from e.g. OS keyboard
           | layout code; your editor won't know whether I input that o
           | from a dedicated key, a 'chord' of 'o' key with a modifier,
           | or a sequence of two keystrokes (and if so, whether 'o' was
           | the first keystroke or the second, opposite of how the
           | unicode codepoints are ordered).
        
             | mananaysiempre wrote:
             | > I'd argue that you must use grapheme clusters for text
             | editing and cursor position
             | 
             | Korean packs syllables into Han-script-like squares, but
             | they are unmistakably composed of alphabetic letters, and
             | are both typed and erased that way (the latter may depend
             | on system configuration), yet the NFC form has only a
             | single codepoint per syllable ( _a fortiori_ a single
             | grapheme cluster). Hebrew vowel markings, where used, are
             | (reasonably) considered part of a grapheme cluster but
             | nevertheless erased and deleted separately. In both of
             | those cases, pressing backspace will erase less than
             | pressing shift-left, backspace; that is, cursor movement
             | and backspace boundaries are different.
             | 
             | There IIRC are also scripts that will have a vowel both
             | pronounced and encoded in the codepoint stream _after_ the
             | syllable-initial consonant but written _before_ it; and
             | ones where some parts of a syllable will _enclose_ it. I
             | don't even want to think how cursor movement works there.
             | 
             | Overall, your suggestion will work for Latin, Cyrillic,
             | Greek(?), and maybe other nonfancy scripts like Armenian,
             | Ge'ez, or Georgian, but will absolutely crash and burn for
             | others.
        
       | WalterBright wrote:
       | Quotes from the article illustrating what a train wreck Unicode
       | has become:
       | 
       | "The problem is, in Unicode, some graphemes are encoded with
       | multiple code points!"
       | 
       | "An Extended Grapheme Cluster is a sequence of one or more
       | Unicode code points that must be treatead as a single,
       | unbreakable character."
       | 
       | "Starting roughly in 2014, Unicode has been releasing a major
       | revision of their standard every year."
       | 
       | "A" === "A" "A" === "A" "A" === "A" What do you get? False? You
       | should get false, and it's not a mistake.
       | 
       | "That's why we need normalization."
       | 
       | "Unicode is locale-dependent"
       | 
       | The article forgot one: characters that switch presentation to
       | right-to-left.
        
       | dathinab wrote:
       | The author seem to hate people which concentration issues and/or
       | various visual sicknesses.
       | 
       | That coloration tools shows the moving mouse coursers of other
       | participants even if they aren't needed/wanted is already pretty
       | bad, why bring it to a website?
        
         | wffurr wrote:
         | This seems like good feedback but it could really be phrased
         | more constructively. I doubt the author "hates" any such thing
         | and you know it too. "Didn't design with such in mind", sure.
         | You can do better.
        
           | dathinab wrote:
           | yes I should have highlighted that it is satire
           | 
           | through it also wasn't meant to be constructive critique
        
       | kazinator wrote:
       | If you have to recognize a grapheme cluster, it will be easier to
       | do that from a sequence of code points, than from UTF-8.
       | 
       | It's like saying that we don't need to tokenize, because you
       | never want to deal with tokens anyway, but phrase structures!
       | 
       | Mmkay, whatever ...
        
       ___________________________________________________________________
       (page generated 2023-10-02 23:00 UTC)