[HN Gopher] What every software developer must know about Unicod...
___________________________________________________________________
What every software developer must know about Unicode in 2023
Author : mrzool
Score : 566 points
Date : 2023-10-02 09:22 UTC (13 hours ago)
(HTM) web link (tonsky.me)
(TXT) w3m dump (tonsky.me)
| penguin_booze wrote:
| I knew that domain, so I had sunglasses at hand before opening
| the page!
| gorgoiler wrote:
| With the benefit of hindsight, would we include the error
| detection bits of UTF8 if we could choose not to?
| rurban wrote:
| The Why is "A" !== "A" !== "A"? section still strikes me as
| wrong. The strings are equal even when the representations
| differ.
| nextaccountic wrote:
| They are logically equal (that is, they represent the same text
| in an abstract way), but computing this equality in practice is
| expensive, because you first need to normalize the strings then
| compare.
|
| Most languages, when comparing strings, skip the normalization
| and just compare string bytes as is (or, if the string is
| interned, compare just the pointer)
| bajsejohannes wrote:
| I just not sure why they put in the "Angstrom symbol" to begin
| with. If you do, then why isn't the "meter symbol" (m) also
| represented?
|
| Fortunately, it seems like it's marked as deprecated:
| https://en.wikipedia.org/wiki/Angstrom#Symbol
| jcranmer wrote:
| > I just not sure why they put in the "Angstrom symbol" to
| begin with.
|
| Frequently, the answer to this is "some obscure character set
| had this as a distinct symbol." In this case, blame the
| Japanese: https://en.wikipedia.org/wiki/JIS_X_0208
|
| Which is why there's an 'mm' and 'cm' and other random
| symbols: https://www.compart.com/en/unicode/block/U+3300
| amelius wrote:
| Can we please get a standard that describes how emoji are
| supposed to look?
|
| Now they look different on every platform and many subtleties are
| lost in translation.
| JohnFen wrote:
| Yeah, this problem has led me to avoid using emojis. I can't be
| sure that the meaning I was intending is the one being depicted
| by the recipients machine.
|
| It's probably a good thing, though.
| w10-1 wrote:
| A real question is why IBM, Apple, and Microsoft poured millions
| into developing the unicode standard instead of treating
| character encoding like file formats as a venue for competition.
|
| IBM and Apple in the early 1990's combined in Taligent to try to
| beat MS NT, but failed. But a lot of internationalization came
| out of that and was made open, at the perfect time for Java to
| adopt it.
|
| Interestingly it wasn't just CJK but Thai language variants that
| drove much of the flexibility in early unicode, largely because
| some early developers took a fancy to it.
|
| When you look at the actual variety in written languages, Unicode
| grapheme/code-point/byte seems rather elegant.
|
| We're in the early days of term vectors, small floats, and
| differentiable numerics (not to mention big integers). Are
| lessons from the history of unicode relevant?
| preciousoo wrote:
| You can ask why they didn't do the same for networking and
| serial protocols too.
| samatman wrote:
| Please don't refer to codepoints as characters. Some are, some
| are not, it isn't a useful or informative approximation, it's
| just wrong. Unicode is a table which assigns unique numbers to
| different _codepoints_ , most of which are characters. ZWJ is not
| a character at all, and extended grapheme clusters made of
| several codepoints are.
| skitter wrote:
| 'Character' doesn't have a single meaning. ZWJ is a character
| according to definitions (2) and (3) in
| https://unicode.org/glossary/#character
| user3939382 wrote:
| I once bought an O'Reilly book on encoding. It was like 2000
| pages. I never read it, that was about 15 years ago. My take away
| is that encoding is really complex and I just kind of pray it
| works which most of the time it does.
| tr888 wrote:
| What on EARTH is that mouse cursor thing all about? Why would you
| even bother writing this, then making it impossible to read
| properly?
| eerikkivistik wrote:
| I stopped in the middle of reading the post just for this. It
| was so distracting I was unable to focus on the text. It's a
| fun gimmick, but the result is that someone who wanted to read
| the post, stopped in the middle.
| oliwarner wrote:
| It's tracking every visitors' cursor and sharing it with every
| other visitor.
|
| Why would a frontend developer demonstrate their ability to do
| frontend programming on their personal, not altogether super-
| serious blog? I meant that rhetorically but it's a flex. I
| agree, not the best design in the world if you're catering for
| particular needs, but simple and fun enough. You should check
| out dark mode.
|
| In that vein, I think it's okay if we let people have fun. That
| might not work for everyone, but why should we let perfect be
| the worst enemy of fun?
| dathinab wrote:
| > Why would
|
| because it shows that they don't understand important design
| aspects
|
| while it doesn't really show off their technical skills
| because it could be some plugin or copy pasted code, only
| someone who looks at the code would know better. But if
| someone care enough about you to look at your code you don't
| need to show of that skill on you normal web-site and can
| have some separate tech demo.
|
| > okay if we let people have fun
|
| yes people having fun is always fine especially if you don't
| care if anyone ever reads your blog or looks at it for
| whatever reason (e.g. hiring)
|
| but the moment you want people to look at it for whatever
| reason then there is tension
|
| i.e. people don't get hired to have fun
|
| and if you want others to read you blog you probably
| shouldn't assault them with constant distractions
| JohnFen wrote:
| Not every website, even technical ones, need to have an eye
| towards professional advancement. Sometimes they're just
| for fun. I welcome it, as it's a thing that gets more rare
| on the web as time goes by.
| jamincan wrote:
| Considering the dark mode is effectively flashlight mode,
| I think it's reasonable to assume the blog's owner just
| likes to have a bit of fun.
| booleandilemma wrote:
| Lighten up.
| oliwarner wrote:
| > people don't get hired to have fun
|
| Living by that motto is hugely self-destructive.
|
| Creative expression allows us to push ourselves, both in
| _what_ we think we can do, and often the technical aspects
| about _how_ we do it too. Even if the idea doesn 't stick,
| you've tried something new.
|
| In a world of Tailwinds and Bootstraps and the same five
| templates copied again and again and again, let's celebrate
| the people willing to push things and learn from their
| inevitable but ultimately valuable mistakes. And let's have
| some fun along the way.
| FinnKuhn wrote:
| I assume the creator didn't anticipate this amount of
| readers at the same time and having one or two other
| cursors on the page does sound fun and not too distracting.
| They should probably limit the maximum amount of other
| cursors displayed to a sensible amount
| dathinab wrote:
| (sarcasm)
|
| It's revenge against anyone with certain kinds of visual
| impairments and/or concentration issues because the ex-spouse
| of the author which turned out to be a terrible person had
| such.
|
| (sarcasm try 2)
|
| It's revenge against anyone using JS on the net with the author
| trying to subtle hint that JS is bad.
|
| (realistic)
|
| It's probably on of:
|
| - the website is a static view of some collaborative tool which
| has that functionality build in by default
|
| - some form of well intended but not well working functionality
| add to the site as it was some form of school/study project, in
| that case I'm worried about the author suffering unexpected
| very much higher cost due to it ending up on HN ...
| tonsky wrote:
| Hi, author here. In case you really want to know: no, it's
| custom-made and works exactly as intended. There are two main
| reasons:
|
| 1. Fun. Modern internet is boring, most blog posts are just
| black text on white background. Hard to remember where you
| read what. And you can't really have fun without breaking
| some expectations.
|
| 2. Sense of community. Internet is a lonely place, and I
| don't necessarily like that. I like the feeling of "right
| now, someone else reading the same thing as I do". It's human
| presence transferred over the network.
|
| I understand not everybody might like it. Some people just
| like when things are "normal" and everything is the same.
| Others might not like feeling of human presence. For those,
| I'm not hiding my content, reader mode is one click away, I
| make sure it works very well.
|
| As for "unexpectedly ended up on HN", it's not at all
| unexpected. Practically every one of my general topic
| articles ends up here. It's so predictable I rely on HN to be
| my comment section.
| pests wrote:
| I like your content but I do think you need to rethink #1.
| Fun is usless if no one wants to show up because they are
| annoyed.
| acqq wrote:
| Count me too to the group of "I was so distracted that I
| stopped reading."
|
| Then the second thought was: I should again start to
| block js by default as much as I can.
| gpvos wrote:
| 2. I only understood that it was actual other people's
| mouse cursors when I read that here. So it didn't really
| engender a sense of community, although after some time I
| did think you are very good at modelling actual human mouse
| movements. Now that I know it, it's pretty neat though.
| wonger_ wrote:
| The author has several other writeups:
|
| https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu.
| ..
|
| The cursors will only be a problem during front page HN
| traffic. And the opt-out for people who care is reader mode /
| disable js / static mirror. Not sure if there's any better
| way to appease the fun-havers and the plain content
| preferrers at the same time. Maybe a "hide cursors" button on
| screen? I, for one, had a delightful moment poking other
| cursors.
| Luctct wrote:
| I don't know what you people are talking about. I'm just glad I
| always browse with Javascript turned off. If you didn't see the
| writing on the wall and permanently turn Javascript off around
| 2006, you have no right to complain about anything.
|
| Meanwhile, ironic irony is ironic: "Hey, idiots! Learn to use
| Unicode already! Usability and stuff! Oh, btw, here is some
| extremely annoying Javascript pollution on your screen because
| we are all still children, right? Har har! Pranks are so
| kewl!!!1!"
| zbtaylor1 wrote:
| Are you alright?
| chx wrote:
| "roll your own"
|
| Rather not. It takes an incredible amount of work to get it
| right. Just stick to ICU.
| rdtsc wrote:
| > The only modern language that gets it right is Swift:
| print("...".count) // => 1
|
| And Erlang/Elixir! I guess they are not "cool" enough. But they
| correctly interpret that as one grapheme cluster.
| % erl +pc unicode > string:length("..."). 1
|
| (... here is the U+1F926 U+1F3FB U+200D U+2642 U+FE0F emoji)
| davidham wrote:
| Is it just me, or is anyone else seeing what looks like the mouse
| pointer of everyone else reading the page, like 1,000 little ants
| on the screen
| neonsunset wrote:
| Yes, reading the article is impossible with erratic movement on
| the screen.
| dekken_ wrote:
| not just you, this is what my other comment is about
| (indirectly)
| ilyt wrote:
| I see nice crisp black text on white background because
| apparently server melted down
| hot_gril wrote:
| I saw that, except half the images weren't loading, and there
| was just one mouse pointer.
| wirelesspotat wrote:
| Yep, the website opens a websocket connection[0] and sends the
| mouse position every 1 second
|
| [0] WS connection is on `wss://tonsky.me/pointers?id=XXXXXX&pag
| e=/blog/unicode/&platform=XXX`
| lbltavares wrote:
| It's fun specially for folks like me who have ADHD. But there
| should be a button to disable it
| hwillis wrote:
| turned off javascript as soon as I saw it. Like trying to read
| with twenty mosquitos in your face.
| sebstefan wrote:
| hey be nice to my mouse cursor
| 876978095789789 wrote:
| Yeah, it's extremely obnoxious.
| keb_ wrote:
| Anytime tonsky's site gets posted here, I'm reminded by how
| awful it is, which is ironic given his UI/UX background. The
| site's lightmode is a blinding saturated yellow, and if you
| switch into darkmode, it's an even less readable "cute"
| flashlight js trick. I don't know why he thought this was a
| good idea. Thank god for Firefox reader mode.
| coldpie wrote:
| Works like a normal website with JavaScript disabled. I
| didn't even know it did fancy junk until reading the comments
| here. NoScript saves the day again! I don't know how people
| can browse the web without it.
| gpvos wrote:
| It is some time ago since I last used it, but I found that
| too many websites that I want to read require Javascript to
| even show you the main body of text, or a reasonable
| layout. Is that different now?
| aembleton wrote:
| By using reader mode
| ericmcer wrote:
| I don't think he added moving cursors all over the page
| because he thought it was good UI/UX, he knows what he is
| doing.
| superq wrote:
| This is seemingly self-contradictory. Perhaps you could
| explain your reasoning further?
| gpvos wrote:
| Doing bad things is their idea of fun.
| ksoped wrote:
| You gotta know the rules to bend the rules
| Scarbutt wrote:
| It's called satire.
| chaorace wrote:
| It lets you hold hands with strangers
| 876978095789789 wrote:
| He appears (if his logos are anything to go by) to be a
| flat UI guy. I doubt any of these people know what they're
| doing.
| Arech wrote:
| I'd say this annoying trick is highly appropriate for the
| topic!
| spacechild1 wrote:
| It is obviously a joke (and a good one, I dare say). The fact
| that people seem to take it seriously says something about
| the contemporary state of webdesign :)
| mplewis wrote:
| It would be a better joke if there were an option to turn
| the joke off. As it is, dark mode doesn't exist and the
| pointers occlude text.
| spacechild1 wrote:
| > It would be a better joke if there were an option to
| turn the joke off.
|
| As others have pointed out, reader mode works as
| expected.
| LordDragonfang wrote:
| It's deeply ironic that an article about dealing with text
| properly has images _which are part of the article text_ and
| yet _have no alt-text_ , rendering parts of the article
| unreadable in reader mode if the server is slow.
| lifeinthevoid wrote:
| yup, pretty annoying
| nigma1337 wrote:
| Distracted me from reading the article, I just started chasing
| other people around.
| zzzeek wrote:
| yeah....why on _earth_ would someone want their webpage to do
| this, especially if they have text they 'd presumably want you
| to read?
| fragmede wrote:
| Have you ever read with other people, like in school or a
| book club, or been somewhere that there were other people
| around? It's an interesting move by the author; the
| loneliness epidemic hasn't gone unnoticed.
|
| eg https://www.npr.org/2023/05/02/1173418268/loneliness-
| connect...
| WD-42 wrote:
| It's cute, and provides a hint of human connection that is
| otherwise absent on the web "hey, another human is reading
| this too!" which you probably know but something about seeing
| the pointer move makes it feel real.
|
| Probably not the greatest during a hacker news hug of death,
| but if I read that article some other time and saw one of the
| moving pointers, I would think it was really cool.
| pookha wrote:
| Good times. If you click on the sun switch the entire UI gets
| zeroed out and you get to use on:hover mouse shtick to read the
| UI through a fuzzy radius. Is Yoko Ono designing websites now?
| WD-42 wrote:
| It's a joke. It made me laugh.
| pests wrote:
| I know which site you are talking about before even clicking
| the article :(
| aragonite wrote:
| I've been drawing circles for over a minute now and no one has
| joined me yet, so I conclude those movements are random rather
| than made by intelligent beings. :)
| KyleBerezin wrote:
| That makes me think of this old gem
| https://imgur.com/gallery/BgKFcI9
| nottorp wrote:
| > Since everybody in the world agrees on which numbers correspond
| to which characters, and we all agree to use Unicode, we can read
| each other's texts.
|
| Hmm? I thought some code points combine to create a character.
| Even accented latin ones can be like that.
|
| Also we need to agree on what is a character.
| JohnFen wrote:
| > Also we need to agree on what is a character.
|
| Indeed. I used to think I knew what a character was until
| Unicode came around. Now I genuinely don't know with any real
| certainty.
| wyldfire wrote:
| > "I know, I'll use a library to do strlen()!" -- nobody, ever.
|
| The standard library provided by languages like C, C++ _is_ a
| library. Features like character strings are present and it 's a
| totally reasonable expectation for the length to give you the
| cluster count.
| AnimalMuppet wrote:
| No, for C and C++, which are close to the hardware, it's
| totally reasonable to expect strlen() to give you the _byte_
| count. You don 't allocate memory for buffers based on the
| cluster count.
|
| If you want cluster count, call a different function.
| macintux wrote:
| Given that strlen() predates Unicode by...30 years(?) - it's
| not terribly surprising that isn't a viable approach.
| diego_sandoval wrote:
| Extended Grapheme Cluster should be understood as Extended
| (Grapheme Cluster) or as (Extended Grapheme) Cluster?
| heldrida wrote:
| The mouse cursos ir really annoying, stopped reading for that
| reason
| zackmorris wrote:
| _The only modern language that gets it right is Swift:_
|
| Apple did a fairly good job with unicode string handling starting
| in Cocoa and Objective-C, by providing methods to get the number
| of code points and/or bytes:
|
| https://stackoverflow.com/questions/15582267/cfstring-count-...
|
| I feel that this support of both character count and buffer size
| in bytes is probably the way to go. But Python 3 went wrong by
| trying to abstract it away with encodings that have unintuitive
| pitfalls that broke compatibility with Python 2:
|
| https://blog.feabhas.com/2019/02/python-3-unicode-and-byte-s...
|
| There's also the normalization issue. Apple goofed (IMHO) when
| they used NFD in HFS+ filenames while everyone else went with
| NFC, but fixed that in APFS:
|
| https://unicode.org/faq/normalization.html
|
| https://medium.com/@sthadewald/the-utf-8-hell-of-mac-osx-fee...
| beders wrote:
| Tonsky, dude.
|
| I stopped reading your article because of your little websocket
| experiment.
| justrealist wrote:
| I don't want to be too full of myself here, but I'm a very
| skilled and highly paid backend software engineer who knows
| roughly nothing about unicode (I google what I need when a file
| seems f'd up), and it's never been a problem for me.
|
| I'm sure the article is good but the title is nonsense.
| bigstrat2003 wrote:
| The title is definitely nonsense. The reality is that for most
| people, they will never need to know the gritty details of how
| to encode or decode UTF-8. The article _is_ interesting, but I
| was pretty put off with how the author led with such a
| hyperbolic (and untrue) claim.
| moelf wrote:
| > The only modern language that gets it right is Swift:
|
| arguably not true: julia> using Unicode
| # for some reason HN doesn't allow emoji julia> graphemes("
| ") length-1 GraphemeIterator{String} for " "
| help?> graphemes search: graphemes
| graphemes(s::AbstractString) -> GraphemeIterator
| Return an iterator over substrings of s that correspond to the
| extended graphemes in the string, as defined by Unicode UAX #29.
| (Roughly, these are what users would perceive as single
| characters, even though they may contain more than one codepoint;
| for example a letter combined with an accent mark is a
| single grapheme.)
| JRaspass wrote:
| Raku also gets it right.
| gwbas1c wrote:
| Julia is not a major language like Swift.
| SyrupThinker wrote:
| I imagine the author would disagree with that because it does
| not have the "right" behavior by default.
|
| For example indexing and length of the string are done by
| codeunit. [1]
|
| On the other hand Rakus Str type does behave similarly to
| Swifts: indexing, length and iteration by grapheme; view
| methods for specific encodings. [2]
|
| [1]: https://docs.julialang.org/en/v1/base/strings/ [2]:
| https://docs.raku.org/type/Str#routine_chars
| dathinab wrote:
| > The only modern language that gets it right is Swift:
|
| I disagree.
|
| What is the "right" things is use-case dependent.
|
| For UI it's glyph bases, kinda, more precise some good enough
| abstraction over render width. For which glyphs are not always
| good enough but also the best you can get without adding a ton of
| complexity.
|
| But for pretty much every other use-case you want storage byte
| size.
|
| I mean in the UI you care about the length of a string because
| there is limited width to render a strings.
|
| But everywhere else you care about it because of (memory)
| resource limitations and costs in various ways. Weather that is
| for bandwidth cost, storage cost, number of network packages,
| efficient index-ability, etc. etc. In rare cases being able to
| type it, but then it's often us-ascii only, too.
| hot_gril wrote:
| Swift made an effort to handle grapheme clusters but severely
| over-complicated strings by exposing performance details to
| users. Look at the complex SO answers to what should be simple
| questions, like finding a substring:
| https://news.ycombinator.com/item?id=32325511 , many of which
| changed several times between Swift versions
|
| I was working on an app in Swift that needed full emoji support
| once. Team ended up writing our own string lib that stores
| things as an array of single-character Swift strings.
| marcellus23 wrote:
| > many of which changed several times between Swift versions
|
| This was true while Swift was developing but it's been stable
| now for several years. At some point that complaint is no
| longer valid.
| hot_gril wrote:
| You still see all the answers from old versions sitting
| around, often at the top. Part of it is because of how
| often they changed such fundamental things. String length
| changed 3 times. Every other language figured these things
| out before the initial non-beta release.
| marcellus23 wrote:
| The last time the string API changed was in 2017. That
| was 6 years ago.
| hot_gril wrote:
| Also, realized "needed full emoji support" sounds silly. It
| needed to do a lot of string manipulation, with extended
| grapheme clusters in mind, mainly for the purpose of emojis.
| layer8 wrote:
| Arguably, you don't need any (default) length at all, just
| different views or iterators. When designing a string type
| today, I wouldn't add any single distinguished length method.
| galad87 wrote:
| Swift string type has got many different views, like UTF-8,
| UTF-16, Unicode Scalar, etc... so if you want to count the
| bytes or cut over a specific byte you still can.
| dathinab wrote:
| that's not the issue
|
| defaults matter
|
| as in they should things you can just use by-default without
| thinking about it
|
| as swift is deeply rooted in UI design having a default of
| glyphs make sense
|
| and as rust is deeply rooted in unix server and system
| programming utf-8 bytes make a lot of sense
|
| through the moment your language becomes more general purpose
| you could argue having a default in any way is wrong and it
| should have multiple more explicit methods.
| toast0 wrote:
| > as in they should things you can just use by-default
| without thinking about it
|
| That time has passed. If you want to know the length of a
| string, you really should indicate what length type you
| mean.
| hot_gril wrote:
| There was no string.length in Swift for a while. Then
| they added one that just does what the user expects, get
| the number of grapheme clusters. If a user figures out
| that this isn't what they want, they can go use the other
| length method.
| patrickas wrote:
| That is why I like the way Raku handles it.
|
| It has distinct .chars .codes and .bytes that you can specify
| depending on the use case. And if you try to use .length is
| complains asking you to use one of the other options to clarify
| your intent. my \emoji = "\c[FACE PALM]\c[EMOJI
| MODIFIER FITZPATRICK TYPE-3]\c[ZERO WIDTH JOINER]\c[MALE
| SIGN]\c[VARIATION SELECTOR-16]"; say emoji; #Will print
| the character say emoji.chars; # 1 because on character
| say emoji.codes; # 5 because five code points say
| emoji.encode('UTF8').bytes; # 17 because encoded utf8 say
| emoji.encode('UTF16').bytes; # 14 because encoded utf16
| aembleton wrote:
| In Java/Kotlin, I've found this Grapheme Splitter library to be
| useful: https://github.com/hiking93/grapheme-splitter-lite
| phforms wrote:
| Regarding UTF-8 encoding:
|
| "And a couple of important consequences:
|
| - You CAN'T determine the length of the string by counting bytes.
|
| - You CAN'T randomly jump into the middle of the string and start
| reading.
|
| - You CAN'T get a substring by cutting at arbitrary byte offsets.
| You might cut off part of the character."
|
| One of the things I had to get used to when learning the
| programming language Janet is that strings are just plain byte
| sequences, unaware of any encoding. So when I call `length` on a
| string of one character that is represented by 2 bytes in UTF-8
| (e.g. `a`), the function returns 2 instead of 1. Similar issues
| occur when trying to take a substring, as mentioned by the
| author.
|
| As much as I love the approach Janet took here (it feels clean
| and simple and works well with their built-in PEGs), it is a bit
| annoying to work with outside of the ASCII range. Fortunately,
| there are libraries that can deal with this issue (e.g.
| https://github.com/andrewchambers/janet-utf8), but I wish they
| would support conversion to/from UTF-8 out of the box, since I
| generally like Janet very much.
|
| One interesting thing I learned from the article is that the
| first byte can always be determined from its prefix. I always
| wondered how you would recognize/separate a unicode character in
| a Janet string since it may have 1-4 bytes length, but I guess
| this is the answer.
| thefringthing wrote:
| > Unicode is a standard that aims to unify all human languages,
| both past and present, and make them work with computers.
|
| This is doubly wrong.
|
| First, it conflates languages and writing systems. Malay and
| English use the same writing system but are different languages.
| American Sign Language is a language, but it has no standard or
| widely-adopted writing system. Hakka is a language, but Hakka
| speakers normally write in Modern Standard Mandarin, a different
| language.
|
| Second, it's not that case that Unicode aims to encode all
| writing systems. For example, there are many hobbyist neographies
| (constructed writing systems) which will not be included in
| Unicode.
| bagasme wrote:
| The article doesn't mention how to resolve string manipulation
| problem involving locales.
| overflyer wrote:
| [flagged]
| [deleted]
| badcppdev wrote:
| Just a nitpick because the page says: "Unicode is a standard that
| aims to unify all human languages, both past and present, and
| make them work with computers." but of course unicode is only
| relevant to written languages as opposed to spoken languages (and
| signed languages)
|
| I wish that was the only thing wrong with that page
| coding123 wrote:
| Honestly the "what encoding is this! UTF-8" is still the only
| thing we need to know. len(emoji) is still a corner case that few
| will care about.
| mcfedr wrote:
| That's what everyone thinks, until the user sticks an emoji in
| the name field
| kipcole9 wrote:
| > The only modern language that gets it right is Swift:
|
| Elixir too: Interactive Elixir (1.15.4) - press
| Ctrl+C to exit (type h() ENTER for help) iex(1)>
| String.length "w[?][?][?]or[?][?][?]d[?]" 4
| rkagerer wrote:
| " _the definition of graphemes changes from version to version_ "
|
| In what twisted reality did someone think this a good idea?
|
| Doesn't it go against the whole premise of everyone in the world
| agreeing on how to represent a meaningful unit of text?
|
| " _What's sad for us is that the rules defining grapheme clusters
| change every year as well. What is considered a sequence of two
| or three separate code points today might become a grapheme
| cluster tomorrow! There's no way to know! Or prepare!_ "
|
| " _Even worse, different versions of your own app might be
| running on different Unicode standards and report different
| string lengths!_ "
| rkagerer wrote:
| I can sympathize why some programmers would prefer to stick
| their heads in the sand and stick to ASCII.
| everyone wrote:
| I'm always gonna point out these overly broad titles assuming
| "every software developer" is some kind of internetty web dev
| type. I'm a game dev, I try and never touch strings at all, they
| are a nightmare data type. Strings in a game are like graphics or
| audio assets, your game might read them and show them to player,
| but they should never come anywhere near your code or even be
| manipulated by it. I dont need to know any of that stuff about
| Unicode.
| dekken_ wrote:
| Am I supposed to hate this website, cause I kinda do
| melx wrote:
| Try the night mode (top right corner)...
|
| It's black text on black background (I'm on mobile Firefox on
| Android).
| Lewton wrote:
| Night mode is an absolute delight on desktop, you're missing
| out
| jamincan wrote:
| On desktop your mouse pointer is a flashlight. I wonder if it
| supports touch.
| leokennis wrote:
| Toggle the dark mode for a real treat.
| sebstefan wrote:
| Now that is really funny.
|
| Future _improvement_ idea: the mouse cursors are shared, so
| the light switch should be, too! Let me play with the light
| with everyone
| cormullion wrote:
| It's not pleasant to read. Strange, since Tonsky is the curator
| of the Fira Code font, and would presumably be interested in
| presentation
| MrResearcher wrote:
| uBlock Origin -> Disable Javascript
|
| Problem solved!
| eptcyka wrote:
| Firefox reader mode is better still.
| hwillis wrote:
| That breaks the video. inspect -> network -> refresh ->
| blocking the request for pointers.js works.
| joveian wrote:
| It doesn't on Firefox, you get the built in media controls.
| bqmjjx0kac wrote:
| The mustard background with black text is harsh on the eyes.
| permo-w wrote:
| strange. I quite like it
| Nevermark wrote:
| Me too. I get the impression of a very saturated off-white
| yellow.
|
| But any more saturation and it would go all mustard-
| electric on me.
|
| That's an interesting observation on variation of
| saturation response. Feels like useful knowledge for ...
| web site designers. Or any color crafter.
| anymouse123456 wrote:
| FWIW - Right Click, Inspect. There's a div with an attribute,
| "pointers" in the body root.
|
| Deleting that makes the while thing a lot less stressful.
| jordanrobinson wrote:
| Anyone know what the story is behind the "Weird Emoji" around
| 140000 on the map?
| Findecanor wrote:
| The E0000-E007F block is the "Tags" block, which is used for
| flag emojis.
|
| But there is not a code for each flag. Instead there is a code
| for each ASCII character. A flag sequence is formed from
| U+1F3F4 (Black Flag), followed by at least two tags that form a
| country/region code, and then U+E007F (End tag).
|
| So, yes this is weird, because the emoji is dependent on the
| decoder. It was made this way to keep Unicode independent of
| geopolitics.
|
| Read more: <https://en.wikipedia.org/wiki/Tags_(Unicode_block)>
| m3kw9 wrote:
| Unicode looks like a big over engineered standard that had 50
| hands trying to put their mark in
| ebiester wrote:
| It looks like that because Unicode is trying to solve a problem
| that everyone thinks is easy until they uncover the true extent
| of encoding human languages.
| eviks wrote:
| How does this explain surrogate pairs?
| jfultz wrote:
| Surrogate pairs were new to Unicode 2.0. Unicode 1.0 didn't
| anticipate the need for more than 65,536 code points (who
| would ever need more?); the main perceived threat to that
| limit having been resolved by Han unification.
| eviks wrote:
| Ok, but that doesn't answer the question; it's more of an
| indication that those design(at)s didn't uncover "the
| true extent" until years later
| hot_gril wrote:
| This is a lot more than the minimum that _every_ software dev
| must know about Unicode. Even if you only do web frontends, you
| will do fine not knowing most of this. Still a nice read, though.
| permo-w wrote:
| >That gives us a space of about 11 million code points. About
| 170,000, or 15%, are currently defined. An additional 11% are
| reserved for private use. The rest, about 800,000 code points,
| are not allocated at the moment. They could become characters in
| the future.
|
| 1.1 million?
| run414 wrote:
| Yeah, the author's numbers are off by a "0". It should be
| "1,700,000" and "8,000,000".
| hyggetrold wrote:
| Is there a way to read this with the mouse cursors disabled? It
| seems like great content but all the movement on the page is way
| too distracting.
|
| EDIT: I've never been downvoted for asking a question before.
| Weird, but okay.
| WillAdams wrote:
| Just had this come up at work --- needed a checkbox in Microsoft
| Word --- oddly the solution to entering it was to use the numeric
| keypad, hold down the alt key and then type out 128504 which
| yielded a check mark when the Arial font was selected _and_
| unlike Insert Symbol and other techniques didn't change the font
| to Segoe UI Symbol or some other font with that symbol.
|
| Oddly, even though the Word UI indicated it was Arial, exporting
| to a PDF and inspecting that revealed that Segoe UI Symbol was
| being used.
|
| As I've noted in the past, "If typography was easy, Microsoft
| Word wouldn't be the foetid mess which it is."
| uxp8u61q wrote:
| That's unrelated to unicode. The checkmark symbol just isn't in
| the Arial font, so Word just falls back to a font that has it -
| Segoe UI. You've found a bug where Word still thinks it's
| Arial. But this is something that would happened no matter what
| encoding you choose for your characters.
| Tomte wrote:
| > They will look the same (A vs A)
|
| No. In my browser the first A has the ring glued to it, and the
| second has a little gap.
| [deleted]
| layer8 wrote:
| https://archive.ph/LtKk0
| neonate wrote:
| http://web.archive.org/web/20231002163213/https://tonsky.me/...
| makeworld wrote:
| Really great article. Hitting all the points I would expect.
| bumbledraven wrote:
| > what to you think "w[?][?][?]or[?][?][?]d[?]".length should be?
|
| This is a nice example of the kind of thing we need to think
| about when defining a measure of length for Unicode strings.
| danbruc wrote:
| Four. Obviously.
|
| The more interesting question is whether the Unicode rules
| actually give that answer.
|
| EDIT: Just checked it using the first online tool [1] that came
| up and it indeed says four. So all is good.
|
| [1] https://onlinetools.com/unicode/extract-unicode-graphemes
| masklinn wrote:
| It should be 4 as long as you count the grapheme clusters
| which is what e.g. Swift does (hence String#count being
| O(n)).
|
| In Javascript, you can get the same information through
| Intl.Segmenter, segments by grapheme cluster by default.
| danbruc wrote:
| You could also have it in O(1), just store and maintain it
| as you usually store the length in bytes or code units. If
| you had all your string operations like substring work with
| grapheme clusters by default, which might arguably make
| sense quite often, then that could actually be a good
| decision. It might even make sense to maintain a list with
| pointers to each grapheme cluster or of all the grapheme
| cluster lengths together with the actual string data. Or
| maybe not, would probably depend heavily on the workload.
| Karellen wrote:
| > The simplest possible encoding for Unicode is UTF-32. It simply
| stores code points as 32-bit integers.
|
| Skipping over UTF-32-BE and UTF-32-LE there...
|
| (I mean, it might not be an issue if it's just being used as an
| internal representation, but still)
| gh0stcloud wrote:
| ther article's background color deserves to be named:
| https://colornames.org/color/fddb29
| charcircuit wrote:
| The number of graphene clusters in a string depend on the font
| being used. The length of a string should be the number of code
| points because that is not length specific.
|
| Better yet, there shouldn't be a function called length.
| qwerty456127 wrote:
| > People are not limited to a single locale. For example, I can
| read and write English (USA), English (UK), German, and Russian.
| Which locale should I set my computer to?
|
| Ideally - the "English-World" locale is supposedly meant for us,
| cosmopolitans. It's included with Windows 10 and 11.
|
| Practically, as "English-World" was not available in the past
| (and still wasn't available on platforms other than Windows the
| last time I checked), I have always been setting the locale to
| En-US even though I have never been to America. This leads to a
| number of annoyances though. E.g. LibreOffice always creates new
| documents for the Letter paper format and I have to switch it to
| A4 manually every time. It's even worse on Linux where locales
| appear to be less easy to customize than in Windows. Windows
| always offered a handy configuration dialog to granularly tweak
| your locale choosing what measures system you prefer, whether
| your weeks begin on sundays or mondays and even define your
| preferred date-time format templates fully manually.
|
| A less-spoken about problem is Windows' system-wide setting for
| the default legacy codepage. I happen to use single-language
| legacy (non-Unicode) apps made by people from a number of very
| different countries. Some apps (e.g. I can remeber the Intel UHD
| Windows driver config app) even use this setting (ignoring the
| system locale and system UI language) to detect your language and
| render their whole UI in it.
|
| > English (USA), English (UK)
|
| This deserves a separate discussion. I doubt many English
| speakers (let alone those who don't live in a particular
| anglophone country) care to distinguish between English dialects.
| To us presence of a huge number of these (don't forget en-AU, en-
| TT, en-ZW etc - there are more!) in the options lists brings only
| annoyance, especially when one chooses some non-US one and this
| opens another can of worms.
|
| By the way I wonder how do string capitalization and comparision
| functions manage to work on computers of people who use both
| English and Turkish actively (Turkish locale distinguishes
| between dotted and undotted I).
| __d wrote:
| I write daily in US English, Australian English, and Austrian
| German. Most of the time, a specific document is in one
| dialect/language or another: not mixed, although sometimes
| that's not true.
|
| I can understand that the conflation of spelling, word choices,
| time and date formatting, default paper sizes, measurement
| units, etc, etc, is convenient, and works a lot of the time,
| but it really doesn't work for me at all.
|
| That said, I appreciate that I occupy a very small niche.
| masklinn wrote:
| > English (USA), English (UK)
|
| > This deserves a separate discussion. I doubt many English
| speakers (let alone those who don't live in a particular
| anglophone country) care to distinguish between English
| dialects.
|
| While that is generally (though not always) true, I would
| assume it's really a stand in for the much more relevant zh
| locales.
|
| It is also rather relevant to es locales (america spanish has
| diverged quite a bit from europe spanish hence the creation of
| es-419), definitely french (canadian french, to a lesser extend
| belgian and swiss), and german (because swiss german). And it
| might be relevant for ko if north korea ever stops being what
| it is.
| [deleted]
| dizhn wrote:
| i I
|
| i I
|
| I symphatize with people who get this wrong. (I just saw some
| YouTube video have a title TURKIYE in a segment)
|
| Even google keyboard can't seem to distinguish between I and I.
| When I type "It", it suggests "It's" which is quite pathetic.
| uxp8u61q wrote:
| > I have always been setting the locale to En-US even though I
| have never been to America. This leads to a number of
| annoyances though. E.g. LibreOffice always creates new
| documents for the Letter paper format and I have to switch it
| to A4 manually every time
|
| > I doubt many English speakers (let alone those who don't live
| in a particular anglophone country) care to distinguish between
| English dialects. To us presence of a huge number of these
| (don't forget en-AU, en-TT, en-ZW etc - there are more!) in the
| options lists brings only annoyance, especially when one
| chooses some non-US one and this opens another can of worms.
|
| Well, you just explained what this plethora of options is
| about. It's not just about how you spell flavor/flavour. It's a
| lot of different defaults for how you expect your OS to present
| information to you. Default paper size, but also how to write
| date and time, does the week start on Monday, Sunday, or
| something else, etc.
| hahn-kev wrote:
| As much as I appreciate that I always wondered how many
| programs actually respect all those tweaks.
| DoughnutHole wrote:
| > I doubt many English speakers care to distinguish between
| English dialects
|
| It's worthwhile purely for the sake of autocorrect/typo
| highlighting in text-editing software. I don't miss the days of
| spelling a word correctly in my version of English but still
| being stuck with the visual noise of red highlighting up and
| down the document because it doesn't conform to US English.
| BoxOfRain wrote:
| Yeah I'd rather not have my British English dialect seen as
| second-class in a world of American English ideally which is
| what having a red document full of 'errors' implies in those
| sorts of situations.
|
| It's sometimes not a trivial distinction either, for example
| I've heard of cases where surprised British redditors have
| found themselves banned from American subreddits for being
| homophobic when they were actually talking innocently enough
| about cigarettes!
| OfSanguineFire wrote:
| I would think a lot of mods, who are either Highly Online
| Americans or their weirdo equivalents in other countries,
| are well aware of the UK usage, but simply expect Brits to
| give it up in order to avoid offending Americans and the
| global Reddit community that largely takes American-style
| sensitivity as its orthodoxy. And considering that Reddit
| corporate feels that anything that could stir up such
| outrage is bad for business, mods of popular subreddits may
| well feel pressured to come down hard on these matters.
| bluGill wrote:
| It doesn't matter if you use UK or US spelling you are
| wrong. I wish we would adopt the international phonetic
| alphabet I might have a chance of spelling things
| correctly.
| lucideer wrote:
| As an Irish person, while we have en_IE which is great (and
| solves most of the problems you list re: Euro-centric defaults
| + English), I'd still quite like to have an even more broad /
| trans-language / "cosmopolitan" locale to use.
|
| I mainly type in English but occasionally other languages - I
| use a combination of Mac & Linux - macOS has an (off-by-default
| but enable-able) lang-changer icon in the tray that is handy
| enough, but still annoying to have to toggle. Linux is much
| worse.
|
| Mac also has quite a nice long-press-to-select-special
| character that at least makes for accessible (if not efficient)
| typing in multiple languages while using an English locale.
| Mobile keyboards pioneered this (& Android's current one even
| does simultanous multi-lang autocomplete, though it severely
| hurts accuracy).
|
| ---
|
| > _I doubt many English speakers care to distinguish between
| English dialects._
|
| I think you'll find the opposite to be true. US English
| spellings & conventions are quite a departure from other
| dialects, so typing fluidly & naturally in any non-US dialect
| is going to net you a world of autocorrect pain in en_US. To
| the extent it renders many potentially essential spelling &
| grammar checkers completely unusable.
| jdblair wrote:
| I can 2nd this as an American who now resides in Europe. My
| first laptop I brought with me, and was defaulted to en_US,
| but my replacement is en_GB (Apple doesn't have en_NL, for
| good reason).
|
| I don't find it "unusable", though. I could change it back to
| en_US, but it has actually been interesting to see all of my
| American spellings flagged by autocorrect. Each time I write
| authorize instead of authorise it is an act of stubborn group
| affinity!
| TRiG_Ireland wrote:
| > US English spellings & conventions are quite a departure
| from other dialects.
|
| As far as the written, formal language is concerned, English
| really has only three dialects: US American, Canadian, and
| everywhere else. There are some other subtle differences
| (such as "robots" for traffic lights in South Africa, or
| "minerals" for fizzy drinks in Ireland1), but that's pretty
| much it.
|
| 1 Yes, this isn't just slang in Ireland: the formal, pre-
| recorded announcements on trains use it: "A trolley service
| will operate to your seat, serving tea, coffee, minerals and
| snacks." The corresponding Irish announcement renders it
| mianrai. Food service on trains stopped during covid and has
| not yet resumed, so I'm working from distant memory now.
| lucideer wrote:
| > _As far as the written, formal language is concerned,
| English really has only three dialects_
|
| This is true, but I don't see why the "formal" qualifier is
| needed here :) There are much more than 3 dialects of
| English, both written & spoken.
|
| Especially there's a fair few extremely common notable
| differences in (casual, written) Irish English: the word
| "amn't" (among other less common contractions), the
| alternative present tense of the verb "to be" (i.e. "do
| be"), various regional plurals of "you", and - perhaps the
| most common - prepositional pronouns, etc. etc.
| TRiG_Ireland wrote:
| Well, quite. If we include any one or more of the
| following three categories -- formal spoken language,
| informal spoken language, informal written language --
| then there's definitely far more than three dialects of
| English. But formal spoken language really has only the
| three.
| phantom784 wrote:
| I guess it's a question as to how many varieties of
| spelling you want to make available as "translations" in
| software (e.g. color vs colour, tire vs tyre).
|
| There's plenty of regional variants just within the US,
| but "en_us" covers the whole country.
| l72 wrote:
| I write in multiple languages daily on Linux, including
| English, Russian, and Chinese. Switching keyboards (at least
| with gnome) is a simple super-space.
|
| While in my default (English) layout, it is easy enough to
| add in accents other characters using the compose key (right
| alt). So right-alt+'+a = a or right-alt+"+u = u. I much
| prefer this over the long press as I can do it quickly and
| seamlessly without having to wait on feedback. Granted, it is
| not as discoverable, but once you are comfortable, it in my
| opinion is a better system.
| notatoad wrote:
| > I doubt many English speakers care to distinguish between
| English dialects
|
| I think you'd be surprised how many english (UK) people will
| get pissed off when their spell-checker starts removing the "u"
| from colour or flavour, or how many English (US) people get
| pissed off when the spellchecker starts suggesting random "u"s
| to words.
|
| additionally to that, locale isn't just about language. English
| (US) and English (UK) decides whether your dates get formatted
| DD-MM-YY or MM-DD-YY, whether your numbers have the thousands
| broken by commas or spaces, and a host of other localization
| considerations with a lot more significance than just the
| dialect of english.
| TRiG_Ireland wrote:
| I'd really like an en-GB-oxendict (British English but
| favouring -ize over -ise) locale for formal writing.
| aksss wrote:
| I worked for BP for a while (well, as a contracted coder) and
| I got quite used to the UK spell check correcting everything
| to its idiom. Everything seemed wrong once I returned a world
| that dismissed the value of the letter 'U' and preferred the
| letter 'Z' over 'S'. Also missed the normalizing of drinking
| beer at lunch.
| grotorea wrote:
| > Practically, as "English-World" was not available in the past
| (and still wasn't available on platforms other than Windows the
| last time I checked), I have always been setting the locale to
| En-US even though I have never been to America. This leads to a
| number of annoyances though. E.g. LibreOffice always creates
| new documents for the Letter paper format and I have to switch
| it to A4 manually every time. It's even worse on Linux where
| locales appear to be less easy to customize than in Windows.
| Windows always offered a handy configuration dialog to
| granulatly tweak your locale choosing what measures system you
| prefer, whether your weeks begin on sundays or mondays and even
| define your preferred date-time format templates fully
| manually.
|
| There's the English (Denmark) locale for that on some platfoms.
| qwerty456127 wrote:
| Thank you very much, I'll give it a try.
| grotorea wrote:
| It's a bit of a joke that doesn't have universal support.
| Works on my phone. Apparently you can also try en_IE
| (Ireland).
|
| https://unix.stackexchange.com/questions/62316/why-is-
| there-...
| actualwitch wrote:
| en-GB is also a good choice
| carstenhag wrote:
| Not really, no EUR and no metric units
| loeg wrote:
| Pretty clearly, "every software developer" doesn't need to
| understand Unicode with this level of familiarity, much like
| "every programmer" doesn't need to know the full contents of the
| 114 page Drepper paper. For example, I work on a GUID-addressed
| object store. Everything is in term of bytes and 128-bit UUIDs.
| Unicode is irrelevant to everyone on my team, and most adjacent
| teams. There is lots of software like this.
| [deleted]
| JonChesterfield wrote:
| Prior to this article, I knew graphemes were a thing and that
| proper unicode software is supposed to count those instead of
| bytes or code points.
|
| I didn't know that unicode changes the definition of grapheme in
| backwards incompatible fashion annually, so software which works
| by grapheme count is probably inconsistent with other software
| using a different version of the standard anyway.
|
| I'm therefore going to continue counting bytes. And comparing by
| memcmp. If the bytes look like unicode to some reader, fine.
| Opaque string as far as my software is concerned.
| tonsky wrote:
| Good luck
| https://mastodon.online/@alexeyten@mas.to/111166351426290784
| dundarious wrote:
| The point is that a byte focus will often frustrate users.
|
| e.g., a TUI with columns will have to truncate "long" strings
| in each column, and that truncation and column-separator
| arrangement really should be grapheme aware.
|
| e.g., a string search (for a name, let's say) should find Noel
| regardless of whether the user input e via composing characters
| or the pre-composed version.
| slimsag wrote:
| Two Unicode strings can be visually and semantically identical,
| but not byte-equal.
| zzzeek wrote:
| I wondered about how to do simple text centering / spacing
| justification given graphemes showing string lengths that don't
| match up human-perceived characters, like in 'Cafe' (python
| len('Cafe') returns 5, even though we see four letters).
|
| Found this! good to know about.
| https://pypi.org/project/grapheme/ "A Python package for working
| with user perceived characters. "
|
| (apparently the article talks about this however the blog post is
| largely unreadable due to dozens of animated arrow pointers
| jumping all over the screen)
| cryptonector wrote:
| > Another unfortunate example of locale dependence is the Unicode
| handling of dotless i in the Turkish language.
|
| This isn't quite Unicode's fault, as the alternative would be to
| have two codepoints each for `i` and `I`, one pair for the Latin
| versions and one for the Turkish versions, and that would be very
| annoying too.
|
| Whereas the Russian/Bulgarian situation is different. There used
| to be language tags in Unicode for that, but IIRC they got
| deprecated, and maybe they'll have to get undeprecated.
| pif wrote:
| > The minimum every software developer must know about Unicode
|
| Just a nitpick...
|
| Once more, as it is typical on HN, web programming is confused
| with the entire universe of software development.
|
| There are plenty of software realms where ASCII not only is
| enough, but it actually MUST be enough.
| lxgr wrote:
| What do you mean by "must be enough"?
|
| Not being able to support non-latin scripts sounds more like a
| limitation than a feature to me, although of course in many
| contexts it's not in any individual organizations power to
| overcome it.
| 9dev wrote:
| Well, proper Unicode support affects pretty much any area
| handling data about, used by, or created by, humans. That's a
| pretty broad scope, and certainly wider than just web software.
| uxp8u61q wrote:
| This kind of assertiveness leads to garbage like C++ still not
| supporting UTF8 properly in 2023. My name contains diacritics.
| I am so, so, _so_ tired of trying to work around information
| systems - not just web frontends - designed by people who don
| 't care or worse, don't want to care.
|
| "Web" programmers can care all they want about Unicode, but if
| the backend people didn't deal properly with text encoding,
| then something will break no matter what.
|
| > There are plenty of software realms where ASCII not only is
| enough, but it actually MUST be enough.
|
| Name one.
| lelanthran wrote:
| > This kind of assertiveness leads to garbage like C++ still
| not supporting UTF8 properly in 2023. My name contains
| diacritics.
|
| UTF8 encoded diacritics work just fine in C++.
| uxp8u61q wrote:
| What do you mean by "work"? That you can store arbitrary
| bytes in a string? That's a pretty low bar.
| lelanthran wrote:
| > What do you mean by "work"? That you can store
| arbitrary bytes in a string? That's a pretty low bar.
|
| That's all that's needed for a backend language.
|
| The backend does not need to understand, or even
| acknowledge the existence, of grapheme clusters. Because
| the frontend is already having to understand all of this,
| it should be normalising any multi-codepoint ambiguous
| cluster anyway.
| JohnFen wrote:
| The backend never needs to do things like figure out how
| long a string is or search for one string in a database
| of other strings?
| lelanthran wrote:
| > The backend never needs to do things like figure out
| how long a string
|
| Not as measured by clusters, no.
|
| > search for one string in a database of other strings?
|
| Hence I said "normalisation". The frontend already has to
| do all the unicode twiddling, it may as well normalise
| the input too.
| astrange wrote:
| It does if it ever wants to trim, summarize, sort or
| compare equal a string.
| pif wrote:
| > if the backend people didn't deal properly
|
| You are right. It's not a frontend/backend issue. It's a "for
| human" vs "not for human" issues. Personal names must be
| treated in an international-friendly manner.
|
| >> There are plenty of software realms where ASCII not only
| is enough, but it actually MUST be enough. > > Name one
|
| Joel himself described an example:
|
| > It would be convenient if you could put the Content-Type of
| the HTML file right in the HTML file itself, using some kind
| of special tag. Of course this drove purists crazy... how can
| you read the HTML file until you know what encoding it's in?!
| Luckily, almost every encoding in common use does the same
| thing with characters between 32 and 127, so you can always
| get this far on the HTML page without starting to use funny
| letters:
|
| The content of a webpage is required to be expressed in every
| supported language, but the HTTP protocol must not. And it
| would make no sense at all to add internationalization to
| intra-machines protocol, where ASCII is enough and has been
| enough for decades.
|
| And if someone complains that ASCII only supports English,
| well... suck it up! I'm Italian and work in French, still I
| hate when a colleague sneaks in a comment not in English.
| Professional software development happens in English.
| astrange wrote:
| HTTP does support content-type tags and Unicode in URLs.
| Which funny enough comes in two different encodings,
| punycode and percent escapes.
| uxp8u61q wrote:
| > The content of a webpage is required to be expressed in
| every supported language, but the HTTP protocol must not.
| And it would make no sense at all to add
| internationalization to intra-machines protocol, where
| ASCII is enough and has been enough for decades.
|
| I guess no URLs with funny characters then. "GET
| /profile/renee" => 500 error, woohoo.
|
| > And if someone complains that ASCII only supports
| English, well... suck it up! I'm Italian and work in
| French, still I hate when a colleague sneaks in a comment
| not in English. Professional software development happens
| in English.
|
| Get over yourself, a lot of professional development
| happens in languages other than English.
| jeddy3 wrote:
| I can name one. At my job we do the kind of embedded
| programming were encoders inside machines send data to each
| other. Like reading optical sensors and sending bits
| indicating state to other controllers.
|
| We absolutely do not "need" to know about Unicode, outside of
| interest about other realms.
| kajaktum wrote:
| I am torn between supporting all languages (which easily leaks
| into supporting emojis) versus just using the 90~ Latin
| characters as the lingua franca.
|
| Look, I would love to be able to read/write Sanskrit, Arabic,
| Chinese, Japanese etc and share those content and have everyone
| render and see the same thing. The problem is that I feel like
| most of these are:
|
| 1. a kind of an open problem 2. very subjective 3. very, very
| subjective as what you is mostly dictated by the implementation
| (fonts)
|
| For example, why does a gun emoji looks like water gun? Why is
| the skull-and-crossbones symbol looks so benign. In fact, it is
| often used as a meme (see deadass :skull:) Why is the basmala a
| single "character"?
|
| In my opinion, people should just learn how to use kaomoji.
| Granted, kaomojis rely on a lot more than the Latin characters
| but it is at least artful, skillfull and a natural extension of
| the "actual" languages.
|
| > inb4 languages evolves
|
| Yes, but it mostly happens naturally. I feel like what happens
| today mostly happens at the whim of a few passionate people in
| the standard.
| zzo38computer wrote:
| > I am torn between supporting all languages (which easily
| leaks into supporting emojis) versus just using the 90~ Latin
| characters as the lingua franca.
|
| I don't want to support emoji either (and, I don't want emoji
| on my computer), although in some cases, if it is really
| necessary to be supported, they could be implemented just as
| text characters instead of as colourful emoji, anyways.
|
| For many purposes (e.g. computer codes) ASCII is good enough
| (and actually even can be better since it avoids the security
| problems of using Unicode). (Sometimes, character sets other
| than ASCII can be used, e.g. APL character set for APL
| programming.)
|
| > Look, I would love to be able to read/write Sanskrit, Arabic,
| Chinese, Japanese etc
|
| I also would, but Unicode is bad enough that I would use other
| ways of doing such a thing when possible (even writing my own
| programs, etc). (If a program insists on Unicode, I might just
| use ASCII only anyways, or write my own program)
|
| Not everyone necessarily need to see the same thing (if it is a
| text, rather than pictures of the text), although, the suitable
| character sets for that language which can be in use (and with
| fix pitch if necessary, etc), to auto select a suitable fonts
| for your computer by the reader's preference.
|
| So, I prefer to support all languages (where applicable;
| sometimes it isn't), without using Unicode.
| ggcampinho wrote:
| Elixir also gets the length correctly, not only Swift.
| Dudester230602 wrote:
| Guys, you don't need to know that crap.
| wickedsickeune wrote:
| I'm sorry but the website design is extremely distracting. The
| mouse pointers at least are easy to delete with the inspector;
| The background color is not the best choice for reading material,
| but the inexcusable part is the width of the content.
|
| This content must be really awesome for someone to go through the
| trouble of interacting with such a site.
| [deleted]
| francisofascii wrote:
| I enjoyed how the timeline graphic included Joel's article.
| Because my first thought was hey, isn't this the same title.
| gumby wrote:
| This is quite a good write up. An answer to one of the author's
| questions:
|
| > Why does the fi ligature even have its own code point? No idea.
|
| On of the principles of Unicode is round trip compatibility. That
| is you should be able to read in a file encoded with some
| obsolete coding system and write it out again properly. Maybe
| frob it a bit with your unicode-based tools first. This is a good
| principle, though less useful today.
|
| So the fi ligature was in a legacy encoding system and thus must
| be in Unicode. That's also why things like digits with a circle
| around them exist: they were in some old Japanese character set.
| Nowadays we might compose them with some zwj or even just leave
| them to some higher level formatting (my preference).
| sdrothrock wrote:
| > they were in some old Japanese character set
|
| This implies that they're obsolete, but they're not -- they're
| still in very common use today. You can type them in Japanese
| by typing maru (maru, circle) and the number, then pick it out
| of the IME menu. Some IMEs will bring them up if you just type
| the number and go to the menu, too. :)
| gumby wrote:
| Fair enough. I was thinking of them as obsolete, but
| shouldn't since you do see them a surprising amount in Japan.
| WorldMaker wrote:
| > So the fi ligature was in a legacy encoding system and thus
| must be in Unicode.
|
| Most of the pre-composed latin ligatures are generally from
| EBCDIC codepages. People in the ancient Mainframe era wanted
| nice typesetting too, but computer fonts with ligature support
| were a much later invention.
|
| You can see fi and several others directly in EBCDIC code page
| 361:
|
| https://en.wikibooks.org/wiki/Character_Encodings/Code_Table...
| gumby wrote:
| Thanks. Some alphabets have precomposed ligatures that aren't
| really letters, like old German alphabets with tz, ch, ss (I
| only know how to type the last one, ss, because the others
| have died out over the last hundred years).
|
| Actually in German (at least) a, o and u really are actually
| ligatures for ae, oe, and ue -- the scribes started to write
| the E's on their sides above the base letters, and over time
| the superscript "E"s became dots or dashes. Often they are
| described the other way around: "you can type oe if you can't
| type o." That's what my kid was told in school!
|
| But O and ss aren't really part of the alphabet in German,
| while, say, in Swedish, a and o became actual letters of the
| alphabet. English got W that way too.
| cyxxon wrote:
| That's sounds a bit false to me. The Umlaute (a,o, u) and
| the "eszett" ss are actually part of the German
| alphabet[1]. Also it is kinda weird to describe them as
| ligatures of the original letters and the diaeresis,
| because while this is what they started out as a long time
| ago, they are just their own letters now (as opposed to
| "real" stylistic ligatures like combining fi into one
| glyph). The advice your kid was told that they can be
| replaced with ae, oe and ue is correct - it is a
| replacement nowadays.
|
| [1] https://de.wikipedia.org/wiki/Deutsches_Alphabet
| gwervc wrote:
| The circled digits as code points are very nice to have
| precisely because they are available in applications that don't
| support them otherwise... which is actually most of the
| software I can think of (Notepad, Apple Notes, chat
| applications, most websites, etc).
| swores wrote:
| Can you write them with iOS keyboard? Or when you say Apple
| Notes and chat apps you just mean from desktop?
|
| Edit 1: seems the answer is not with the default iOS
| keyboard, but possible to paste it and perhaps possible with
| a third party keyboard that I'm not keen on trying (unless I
| hear of a keyboard that's both genuinely useful / better than
| default, and that doesn't send keystrokes to the developer -
| though I can't remember if the latter is even a risk on iOS,
| better go search about that next..)
| d11z wrote:
| Speaking of third party keyboards, I'm still upset about
| what happened to Nintype[0]. I've never ever been able to
| type faster on mobile than with it's intuitive hybrid input
| style of sliding and tapping, paired with AI that was
| actually good. It used to be quite performant, fully
| customizable, and it worked beautifully as a replacement
| for default on jailbroken iOS.
|
| Today, it's buggy $5 abandonware that only makes me sad
| when I am reminded of it.
|
| EDIT: Here[1] is a blog post that claims it's still the
| best keyboard in 2023. I actually might give it another
| shot... Not holding my breath though.
|
| *EDIT: Looks like another dedicated fan has actually taken
| it upon themself to revive the project, under the new name
| Keyboard71[2].
|
| [0] https://apps.apple.com/us/app/nintype/id796959534
|
| [1] https://maxleiter.com/blog/nintype
|
| [2] https://www.reddit.com/r/keyboard71/
| tiltowait wrote:
| Nintype was absolutely incredible. I still open it every
| now and then after an iOS update in the vain hope some
| system change made it less buggy.
| d11z wrote:
| I'm really considering repurchasing (I definitely owned
| it previously, no idea what happened), can you describe
| specifically what the main bugs are for you? I'd be happy
| if I could use it solely for occasionally writing long
| notes, not as a replacement for all text inputs.
|
| Really not looking to burn another $5, I'd greatly
| appreciate any thoughts/concerns at all.
| swores wrote:
| I wonder why they haven't open sourced their fork, over
| than vague worry it might get DMCA'd
| masklinn wrote:
| You can copy/paste them from a character board, a dedicated
| website, or even the wiki.
| P-Nuts wrote:
| You can type 1 with the UniChar keyboard app on iOS. It at
| least claims it doesn't transmit information. As it's only
| useful for special characters I don't worry because I can't
| use it for normal typing anyway.
|
| https://unichar.app
| astrange wrote:
| No third party keyboard transmits information without you
| permitting it.
| gumby wrote:
| My point was that, had they not been legacy characters (or
| had RT compatibility been disregarded) Unicode could still
| have supported them as composed characters. Though I
| personally still feel they are a kind of ligature or graphic,
| but luckily for everyone else I'm not the dictator of the
| world :-).
|
| We should be careful: someone on HN could write a proposal
| that they _should_ be considered pre composed forms that
| should also have an un-composed sequence... so there could in
| future be not just 1 in a circle but 1 ZWJ circle, circle ZWJ
| 1 all considered the same...I can imagine some HN readers
| being pranksters like that.
| throwaway_fjmr wrote:
| And yet, many modern, recent apps can't even encode the accented
| European character in my given name. Sigh.
| oefrha wrote:
| > many Chinese, Japanese, and Korean logograms that are written
| very differently get assigned the same code point
|
| This leads to absolutely horrendous rendering of Chinese
| filenames in Windows if the system locale isn't Chinese. The
| characters seem to be rendered in some variant of MS Gothic and
| it's very obviously a mix of Chinese and Japanese glyphs (of
| somewhat different sizes and/or stroke widths IIRC). I think the
| Chinese locale avoids the issue by using Microsoft YaHei UI.
| nwellnhof wrote:
| Unicode is a total mess. In a sane system, "extended grapheme
| clusters" would equal "codepoints" and it wouldn't make a
| difference for 99% of languages. Now we ended up with grapheme
| clusters, normalization, decomposition, composition, Zalgo text,
| etc. But instead of deprecating this nonsense, Unicode doubled
| down with composed Emojis.
| jetbalsa wrote:
| I feel its the same as with any long standing computer system
| we have today. It was designed as more and more of the world
| came online and all the growing pains it came with. Could it be
| built from scratch today better? Yes. Will it? No. I suspect it
| will be around long after we are all dead. Same with IPv4 :V
| hot_gril wrote:
| Honestly I like ipv4 better than v6. I like having a NAT and
| easy addresses like 192.168.1.3 instead of
| fe80::210:5aff:feaa:20a2. They didn't need to mess with those
| things just to expand the address space, like how utf8 didn't
| require remapping ASCII.
| jrockway wrote:
| IPv4.1 should have just had 39 bits, to be written like
| 999.999.999.999. (I know this wouldn't have actually had
| much effect, nobody is going to add new routes in the
| middle of "class A" spaces that already existed, so it
| would just give those that already had IP addresses more IP
| addresses. Additionally, people really abuse decimal
| addresses in horrifying ways; for example, Fios steals
| 192.168.1.100-192.168.1.150 for its TV service, and that
| range doesn't really correspond to anything that you can
| mask off in binary. It only makes sense in decimal, which
| is not what any underlying machinery uses. They should have
| given themselves a /26 or something. You get 3 for yourself
| (modulo the broadcast and gateway address), and they get 1
| for TV.)
| hot_gril wrote:
| Having it actually be decimal might've been nice, but at
| this point people are used to the 1-254 range, and I
| think the least jarring addition of extra bits would be
| to simply extend it for the addresses that need them (and
| not for the ones that don't). So you could have
| 123.444.3.254 or longer like 123.444.3.254.12.43.
| [deleted]
| vacuity wrote:
| Be the change you want to see in the world. If we're going to
| make huge breaking changes, might as well do it sooner rather
| than later.
| jetbalsa wrote:
| With something as large as a end user language format for
| input, this is a change we ourselves cannot make, just as
| using another calendar for dates. Just because I want to
| use the year 2002023 calendar with 29.5 days per month,
| doesn't make it useful to others or myself really.
| Joker_vD wrote:
| Do your dates alpha-convert?
| nvm0n2 wrote:
| I think actually you could. A thought experiment:
|
| The problems with Unicode are mostly to do with internal
| inconsistencies and churn, problems that usually only
| affect programmers.
|
| 1. Different ways to encode the same visually
| indistinguishable set of characters as code points
| leading to normal forms, text that compares unequal even
| when it appears to be identical, the disastrous "grapheme
| clusters" concept and so on.
|
| 2. Many different ways to encode the same sequence of
| code points as bytes. Not only UTF-32/16/8 but also
| curiousities like "modified UTF-8".
|
| 3. Emoji. A fractal of disasters:
|
| 3.a. Updates frequently. Neither Unicode nor software in
| general was built on the assumption that something as
| basic as the alphabet changes every year. If you send
| someone an emoji, can their device draw it? Who knows! In
| practice this means messaging apps can't rely on the OS
| system fonts or text handling libraries anymore which is
| a drastic regression in basic functionality.
|
| 3.b. (Ab)uses composition so much it's practically a
| small programming language, e.g. flags are composed of
| the two letter country code spelled using special
| characters. People are represented as as generic person
| plus skin color patch, families are represented using
| composed individual people etc.
|
| 3.c. Meaning of a character is theoretically specified
| but can subtly depend on the font used, e.g. people use a
| fruit emoji in visual puns because of how it looks
| specifically on Apple devices, so a "sentence" can make
| no sense if it's rendered with a different font.
|
| 3.d. Unbounded in scope. There's no reason the Unicode
| committee won't just keep adding new pictograms forever.
|
| 3.e. Encoded beyond the BMP which in theory every correct
| program should handle but in practice some don't because
| nobody except a few academics used characters beyond it
| much until emoji came along.
|
| 3.f. Disagreement over single vs double width chars, can
| only know this via hard-coded tables, matters for
| terminals and code editors.
|
| Some of these can potentially be cleaned up outside of
| the Unicode consortium in backwards compatible ways. You
| could have a programming language that automatically
| normalized strings to fully composed form when
| deserializing from bytes, and then automatically folded
| semantically identical code points together (this would
| be a small efficiency win for some languages too). You
| could campaign to build a consensus around a specific
| normal form, like how UTF-8 gained consensus as a
| transfer encoding. You could also define a fork of
| Unicode (using private use areas?) that allocates a
| single code point to the characters that are
| unnecessarily using composition today but don't yet have
| one and then just subset out the concept of composition
| entirely.
|
| Emoji are a big problem. It's tempting to say that these
| should not be encoded as characters at all. Instead there
| could be a set of code points that define bounds that
| contain a tiny binary subset of SVG, enough to recreate
| the Apple pixel art somewhat closely. Emoji would always
| be transmitted as inlined vector art. Text rendering
| libraries would call out to a little renderer for each
| encoded glyph, using a fast fingerprinting algorithm to
| deduplicate the bytes to an internal notion of a
| character. To avoid wire bloat, text can simply be
| compressed with a pre-agreed zstd or Brotli dictionary
| that contains whatever images happen to be popular in the
| wild. At a stroke this would avoid backwards compat
| problems with new emoji, enabling programs working with
| text to be upgraded _once_ and then never again,
| eliminate all the ridiculous political committee bike-
| shedding over what gets added, let apps go back to using
| system text support and get rid of the bajillion edge
| cases that emoji have spewed all over the infrastructure.
| magicalhippo wrote:
| For most software it doesn't really matter either.
|
| I've written unicode-aware software for over a decade, doing
| a wide variety of programs, and I've never had to bother with
| all that mess.
|
| If I'm parsing strings I'm looking for stuff in the 7-bit
| ASCII range which maps neatly onto the Unicode
| representations, and so I just need to take care to preserve
| the rest.
|
| The only trouble I've had is that a lot of programmers
| haven't learned, or don't get, that text encoding is a thing
| and that it needs to be handled.
|
| So they'll hand me an XML they claim is UTF-8 encoded, except
| that XML header was just copypasta and the actual XML
| document is encoded in some other system encoding like
| Windows-1252. Or worse, a mix of both.
| tialaramex wrote:
| The writing systems were already like this when we got them.
| Unicode's "total mess" mostly just reflects that. Of course it
| would be convenient for you, the programmer, if the users
| wanted the software to do whatever was easiest for you, but
| obviously they want what's easiest for them, not you.
| eviks wrote:
| How is it easiest "for them" to have the mess instead of
| having the newer standard be less messy?
| bluGill wrote:
| because the current mess means all their old stuff still
| works. ASCII is good so long as you only need English (or
| any other latin languages without the various accents),
| which was good enough for a long time - and ASCII was also
| carefully designed to make programming easier - flip one
| bit changes lower/uppercase for example, but there are more
| things it makes easy. By the time we realized we actually
| care about the rest or the world it was too late to make a
| nice system.
| nwellnhof wrote:
| Name one writing system where you really need character
| composition. Even if there is one, these special cases should
| be handled outside of Unicode.
| layer8 wrote:
| Thai, Arabic, Hebrew, and Devanagari are important
| examples, I believe.
| nottorp wrote:
| The problem is not that you need character composition for
| some writing systems. It's that there are no rules that
| would help with everything having an unique representation
| internally.
|
| Even "put the code points forming the composed character in
| descending numerical order" would be better than nothing.
| If it was there from the start.
|
| However, the Unicode commitee is too busy adding new emojis
| to make their standard sane.
| asherah wrote:
| you can't not handle devanagari, tamil (or like half the
| scripts across the Indian subcontinent and oceania) or
| hangul. even the IPA, used by linguists every day, would be
| particularly bad to deal with if we couldn't write things
| like /a/, and some languages already don't have the
| precomposed diacritics for all letters (like o), so the
| idea of a world with only precomposed letter forms is more
| of a exponential explosion in the character set
| nwellnhof wrote:
| Hangul already has precomposed syllables in Unicode. We
| still have several hundred thousand unassigned codepoints
| to deal with diacritics.
| arp242 wrote:
| > so the idea of a world with only precomposed letter
| forms is more of a exponential explosion in the character
| set
|
| "Exponential explosion" is really putting it too strong;
| it's perfectly possible to just add o and a and a bunch
| of other things. The combinations aren't infinite here.
|
| The problem with e.g. Latin script isn't _necessarily_
| that combining characters exist, but that there 's two
| ways to represent many things. That really is just a
| "mess": use either one system or the other, but not both.
| Hangul has similar problems.
|
| Devanagari doesn't have any pre-compose characters AFAIK,
| so that's fine.
|
| That's really the "mess": it's a hodgepodge of different
| systems, and you can't even know which system to use a
| lot of the time because it's not organised ("look it up
| in a large database"), and even taking in to account
| historical legacy I don't think it really _needed_ to be
| like this (or is even an unfixable problem today,
| strictly speaking).
|
| At least they deprecated ligatures like st and fl,
| although recently I did see ij being used in the wild.
| WorldMaker wrote:
| > The combinations aren't infinite here.
|
| They certainly are. Languages are a creative space driven
| by the human imagination. Give people enough time and
| they'll build new combinations for fun or for profit or
| for research or for trying to capture a spoken word/tone
| poem in just the right sort of exciting way. You may
| frown on "Zalgo text" [1] (and it is terrible for
| accessibility), but it speaks to a creative mood or
| three.
|
| The growing combinatorial explosion in Unicode's emoji
| space isn't an accident or something unique to emoji, but
| a characteristic that emoji are just as much a creative
| language as everything else Unicode encodes. The biggest
| difference is that it is a living language with a lot of
| visible creative work happening in contemporary writing
| as opposed to a language some monks centuries ago decided
| was "good enough" and school teachers long ago locked
| some of the creative tools in the figurative closets to
| keep their curriculum simpler and their days with fewer
| headaches.
|
| [1] https://en.wikipedia.org/wiki/Zalgo_text
| arp242 wrote:
| Well, in theory it's infinite, but in reality it's not of
| course.
|
| We've got 150K assigned codepoints assigned, leaving us
| with 950K unassigned codepoints. There's truly massive
| amounts of headroom.
|
| To be honest I think this argument is rather too abstract
| to be of any real use: if it's a theoretical problem that
| will never occur in reality then all I can say is:
| <shrug-emoji>.
|
| But like I said: I'm not "against" combining marks,
| purely in principle it's probably better, I'm mostly
| against two systems co-existing. In reality it's too late
| to change the world to decomposed (for Latin, Cyrillic,
| some others) because most text already is pre-composed,
| so we should go full-in on pre-composed for those. With
| our 950k unassigned codepoints we've got space for
| literally thousands of years to come.
|
| Also this is a problem that's inherent in computers: on
| paper you can write anything, but computers necessarily
| restrict that creativity. If I want to propose something
| like a "%" mark on top of the "e" to indicate, I don't
| know, _something_ , then I can't do that regardless of
| whether combining characters are used, never mind
| entirely new characters or marks. Unicode won't add it
| until it sees usage, so this gives us a bit of a catch-22
| with the only option being mucking about with special
| fonts that use private-use (hoping it won't conflict with
| something else).
| WorldMaker wrote:
| The Unicode committees have addressed this for languages
| such as Latin, Cyrillic, and others and stated outright
| that decomposed forms should be preferred and
| decomposition canonical forms are generally the safest
| for interoperability and operations such as collation
| (sorting) and case folding (lowercase to uppercase
| transformations).
|
| Unicode can't get rid of the many precombined characters
| for a huge number of backward compatibility reasons
| (including compatibility with ancient Mainframe encodings
| such as EBCDIC which existed before computer fonts had
| ligature support), but they've certainly done what they
| can to suggest the "normal" forms in this decade should
| "prefer" the decomposed combinations.
|
| > If I want to propose something like a "%" mark on top
| of the "e" to indicate, I don't know, something, then I
| can't do that regardless of whether combining characters
| are used
|
| This is where emoji as a living language actually shines
| a living example: It's certainly possible to encode your
| mark today as a ZWJ sequence, say <<e ZWJ %>>, though you
| might want to consider for further disambiguation/intent-
| marking adding a non-emoji variation selector such as
| Variation Selector 1 (U+FE00) to mark it as "Basic
| Latin"-like or "Mathematical Symbol"-like. You can
| probably get away with prototyping that in a font stack
| of your choosing using simple ligature tools (no need for
| private-use encodings). A ZWJ sequence like that in
| theory doesn't even "need" to ever be standardized in
| Unicode if you are okay with the visual fallback to
| something like "e%" in fonts following Unicode standard
| fallback (and maybe a lot of applications confused by the
| non-recommended grapheme cluster). That said, because of
| emoji the process for filing new proposals for
| "Recommended ZWJ Sequences" is among the simplest Unicode
| proposals you can make. It's not entirely as Catch-22 on
| "needs to have seen enough usage in written documents" as
| some of the other encoding proposals.
|
| Of course, all of that is theory and practice is always
| weirder and harder than theory. Unicode encoding truly
| living languages like emoji is a blessing and it does
| enable language "creativity" that was missing for a
| couple of decades in Unicode processes and thinking.
| arp242 wrote:
| > The Unicode committees have addressed this for
| languages such as Latin, Cyrillic, and others and stated
| outright that decomposed forms should be preferred
|
| Yes, and that only makes things worse since the
| overwhelming majority of documents (99.something% last
| time I checked) uses pre-composed. Also AFAIK just about
| everyone just ignores that recommendation.
|
| This is a classic "reality should adjust to the standard"
| type of thinking. Previous comments about that:
| https://news.ycombinator.com/item?id=36984331
|
| I suppose "e ZWJ %" is a bit better than Private Use as
| it will appear as "e%" if you don't have font support,
| but the fundamental problem of "won't work unless you
| spend effort" remains. For a specific niche (math,
| language study, something else) that's okay, but for
| "casual" usage: not so much. "Ship font with the
| document" like PDF and webfonts do is an option, but also
| has downsides and won't work in a lot of contexts, and
| still requires extra effort from the author.
|
| I'm not saying it's completely impossible, but certainly
| harder than it used to be, arguably much harder. I could
| coin a new word right here and now (although my
| imagination is failing me to provide a humorous example
| at this moment) and if people like it, it will see usage.
| In 1960s HN when we would have exchanged these things
| over written letters, and it would have been trivial to
| propose a "e with % on top" too, but now we need to
| resort to clunky phrases like this (even for typewriters
| you can manually amend things, if you really wanted to).
|
| Or let me put it this way: something like !? would see
| very little chance of being added to Unicode if it was
| coined today. Granted, it doesn't see _that_ much use,
| but I do encounter it in the wild on occasion and some
| people like it (I personally don 't actually, but I don't
| want to prevent other people from using it).
|
| None of this is Unicode's fault by the way, or at least
| not directly - this is a generic limitation of computers.
| WorldMaker wrote:
| > Yes, and that only makes things worse since the
| overwhelming majority of documents (99.something% last
| time I checked) uses pre-composed.
|
| It shouldn't matter what's in the wild in documents.
| That's why we have normalization algorithms and
| normalization forms. Unicode was built for the ugly
| reality of backwards compatibility and that you can't
| control how people in the past wrote. These precomposed
| characters largely predate Unicode and were a problem
| before Unicode. Unicode _won_ in part because it met
| other encodings where they _were_ rather than where they
| wished they would be. It made sure that mappings from
| older encodings could be (mostly) one-to-one with respect
| to code points in the original. It didn 't quite achieve
| that in some cases, but it did for, say, all of EBCDIC.
|
| Unicode was never in the position to fix the past, they
| had to live with that.
|
| > This is a classic "reality should adjust to the
| standard" type of thinking.
|
| Not really. The Unicode standard suggests the
| normal/canonical forms and very well documented
| algorithms (including directly in source code in the
| Unicode committee-maintained/approved ICU libraries) to
| take everything seen in the wilds of reality and convert
| them to a normal form. It's not asking reality to adjust
| to the standard, it is asking _developers_ to adjust to
| the algorithms for cleanly dealing with the ugly reality.
|
| > Or let me put it this way: something like !? would see
| very little chance of being added to Unicode if it was
| coined today.
|
| Posted to HN several times has been the well documented
| proposal process from start to finish (it succeeded) of
| getting common and somewhat less common power symbols
| encoded in Unicode. It's a committee process. It
| certainly takes committee time. But it isn't "impossible"
| to navigate and is certainly higher than "little chance"
| if you've got the gumption to document what you want to
| see encoded and push the proposal through the committee
| process.
|
| Certainly the Unicode committee picked up a reputation
| for being hard to work with in the early oughts when the
| consortium was still fighting the internal battles over
| UCS-2 being "good enough" and had concerns about opening
| the "Astral Plane". Now that the astral plane is open and
| UTF-16 exists, the committee's attitude is considered to
| be much better, even if its reputation hasn't yet shifted
| from those bad old days.
|
| > None of this is Unicode's fault by the way, or at least
| not directly - this is a generic limitation of computers.
|
| Computers do anything we program them to do and in
| general people find a way regardless of the restrictions
| and creative limitations that get programmed. I've seen
| MS Paint drawn symbols embedded in Word documents because
| the author couldn't find the symbol they needed or it
| didn't quite exist. It's hard to use such creative
| problem solving in HN's text boxes, but that from some
| viewpoints is just as much a creative deficiency in HN's
| design. It's not an "inherent" problem to computers. When
| it is a problem they pay us software developers to fix
| it. (If we need to fix it by writing a proposal to a
| standards committee such as the Unicode Consortium, that
| is in our power and one of our rights as developers.
| Standards don't just bind in one-direction, they also
| form an agreement of cooperation in the other.)
| PetitPrince wrote:
| The intent of Unicode was to have a universal solution for
| humans. Excluding one case, even if it's remote, would
| defeat this mission statement.
| dleeftink wrote:
| What sort of practical issues are you running into due to
| Unicode's codepoint compositionality?
| nwellnhof wrote:
| It's unnecessary complexity and a security nightmare. Have
| you ever tried to implement Unicode normalization? A single
| bug in your code and malformed text can crash your
| application or worse.
| torstenvl wrote:
| It's hard for me to imagine how Unicode normalization could
| crash your application unless you have very convoluted
| memory management code.
|
| What on earth are you doing that it's leading to crashes?
| Are you not validating the result?
| hot_gril wrote:
| iMessage has had several vulnerabilities related to this.
| Whatever the difficulties are, even Apple can't handle
| them sometimes.
| torstenvl wrote:
| I'm very skeptical, but willing to be proven wrong.
| What's the CVE?
| hot_gril wrote:
| First that comes to mind is the "effective power" one,
| https://nvd.nist.gov/vuln/detail/cve-2015-1157 There's
| also the "black dot" one, can't find the CVE though.
| torstenvl wrote:
| That seems like a truncation + display issue though, not
| a normalization issue.
|
| https://www.reddit.com/r/apple/comments/37e8c1/malicious_
| tex...
|
| In fact, I don't know that there's any reason to believe
| normalization happens at all in the process of executing
| this.
| dleeftink wrote:
| That's tricky, for sure. My 'workaround' has long been
| converting codepoints into byte sequences and creating a
| character dictionary from that. Based on the source corpus,
| this dictionary can be further expanded/compressed and used
| for downstream processing.
| zajio1am wrote:
| Normalization and the fact it is not forward-compatible.
| dclowd9901 wrote:
| They kind of have to don't they? Otherwise we'll become space-
| limited way too fast? Especially with how quickly new emojis
| are being made and all their variants.
| eviks wrote:
| But precomposing all the potential combinations is less sane
| than the current mess (and you can outlaw Zalgo in the standard
| if you think it's a serious issue)
|
| Also, the % should measure people, not languages, that would
| greatly decrease the imaginary 99%
| arp242 wrote:
| > Unicode doubled down with composed Emojis.
|
| Not just emojis, in general I believe Unicode has just said
| they're not going to add new pre-composed characters and that
| using combining characters is the _Right Way(tm)_ to do things
| (well, the _only_ way for newer scripts).
|
| One of the downsides of writing down specifications is that
| they tend to attract people with Very Strong Opinions on the
| One And Only Right Way and will argue it to no end, and
| essentially "win" the argument just by sheer verbosity and
| persistence.
|
| That's certainly what I've seen happen in a few cases, and is
| what happens on e.g. Wikipedia as well at times.
|
| But yeah, emojis is even worse. Something things can look
| rather different depending on which invisible variation
| selector is present. We've got tons and tons of unassigned
| codepoints and we need to resort to these tricks to save a few
| of them?
|
| Firefighter is "(man|woman|person) + ZWJ + firetruck". Clever,
| I guess. Construction worker is "Construction worker (+ ZWJ +
| (male sign|female sign))?" (absence is gender-neutral). Why are
| there 2 systems to encode this? Sigh...
|
| All of this is too clever by a mile.
|
| [1]: HN will strip stuff, but try something like:
| echo $'-\ufe0f -\ufe0e'
|
| May not display correctly in terminal, but can xclip it to a
| browser - screenshot: https://imgur.com/a/iFmBDQk
| d3w4s9 wrote:
| The first time I heard that Unicode would support emoji, I
| knew it would be a recipe for disaster. And I definitely was
| not disappointed.
| arp242 wrote:
| I mean, I don't dislike the concept personally. I actually
| really hate how HN strips them.
|
| But the technical implementation? Yeah, that could have
| gone a lot better IMHO.
|
| One must also wonder if some things really had to be added
| in the first place, e.g. for people kissing it's:
| (person|man|woman)(skin-tone)? ZWJ <heart> ZWJ <kissing
| lips> ZWJ (person|man|woman)(skin-tone)?
|
| This is NOT a complaint about that they added diversity as
| such, in principle I'm all for that, it's just that few
| seem to actually use these emojis, and both in terms of
| code and UI it all gets pretty complex; there's 98
| combinations to choose from here.
|
| I don't really get why <heart> or <killing lips> or
| <kissing face> isn't enough. That's actually what most
| people seem to use anyway, because who finds it convenient
| to pick all the correct genders and skin tones from the UI
| for both people?
| nottorp wrote:
| > I actually really hate how HN strips them.
|
| Oh. So that's why HN discussion always looks sane. They
| strip the pollution.
| pests wrote:
| > there's 98 combinations to choose from here.
|
| Less than that since a default skin color can be set in
| most apps. I'm sure setting a gender will come soon so
| the entire first part of that emoji can be auto-guessed.
| Then its just showing the other options in the UI. Really
| all of this is UI design as even with the 98 combinations
| you can still display it as 4/5 options you drill down.
|
| > who finds it convenient to pick all the correct genders
| and skin tones from the UI for both people?
|
| I just checked and searching "kissing" in my iOS emoji
| keyboard inside Messenger showed just 4 of the emoji's
| your describing - defaulting both skin tones to my
| settings and then the four M/F pair ups. Plus some non-
| related kissing emojis like the cat kissing.
| arp242 wrote:
| > defaulting both skin tones to my settings
|
| But that's kind of wrong, no? The entire point is that
| you can choose both sides individually. What if you set
| it to black and want to kiss some white bloke?
|
| If anything that only underscores my point that it's too
| complex and that no one is using them (certainly not as
| intended anyway).
| pests wrote:
| That's on Apple not on emojis.
|
| In the Windows 11 emoji picker it works like this:
|
| 1. Search "kissing". See two generic yellow people
| kissing. Notice a blue dot in the bottom right corner.
|
| 2. Clicking the emoji brings up previously used versions
| of the kissing emoji, with a + button.
|
| 3. Clicking + brings up a dialog like I described
| previously. Two generic figures at the top, then a row of
| skin tones.
|
| 4. You can click on each generic person and choose a
| gender, then select a skin tone. You can do this for each
| person in the group.
|
| 5. Click done. This emoji is now in your default emoji
| list and you won't need to recreate it again.
| arp242 wrote:
| That seems like a lot of effort when you could have sent
| <kissing-lips>, <kissing-cat>, <kissing-face>, <heart>,
| or any number of other emojis, which is what my point
| was.
| pests wrote:
| You still can! People who want more customizations can do
| so too. Plus it only takes the initial setup per emoji at
| least.
| bluecheese452 wrote:
| Anyone else hate titles like this? There are millions of
| developers working on a large variety of things. It sounds so
| arrogant to me.
| russellbeattie wrote:
| We need, desperately and without question, two Unicode symbols
| for bold and italic.
|
| These are _part of language_ and should not be an optional
| proprietary add on that can be skipped or deleted from text. We
| 've been using the two "formats" to convey _important_
| information since the _sixteenth century_!!!
|
| It boggles my mind that we can give flesh tone to emojis, yet not
| mark a word as bold or italic. It makes zero sense. Especially
| how easy it would be to implement. It would work exactly the same
| way: Letters following the mark would be formatted as bold or
| italic until a space character or equivalent.
| tripdout wrote:
| Can there be overlaps between fonts in the private use area?
| mankyd wrote:
| Yes. "Private" in this case means that you can't expect
| consistent behavior from one system to the next.
| Nevermark wrote:
| What an interesting mess!
|
| It occurs to me that a canonical semantic representation of all
| known (extracted) language concepts would be useful too.
|
| Now that we have multi-language LLM's it would be an interesting
| challenge to create/design a canonical representation for a
| minimum number of base concepts, their relations and orthogonal
| "voice" modifiers, extracted from the latent representations of
| an LLM across a whole training set, over all training languages.
|
| While the best LLMs still have complex reasoning issues, their
| understanding of concepts and voice at the sentence level is
| highly intuitive and accurate. So the design process could be
| automated.
|
| The result would be a human language agnostic, cross-culture
| concept inclusive, regularized & normalized (relatively speaking)
| semantic language. Call it SEMANTICODE.
|
| _We need to get this right, using one standard LLM lineage,
| before the Unicode people create a super standard that spans 150
| different LLM 's and 150 different latent spaces!_ :O
|
| Stability between updates would be guaranteed by including
| SEMANTICODE as a non-human language in training of future LLM's.
| Perhaps including a (highly) pre-normalized semantic artificial
| language would dramatically speed up and reduce the parameter
| count needed for future multi-language training?*
|
| Then LLMs could use SEMANTICODE talk to each other more reliably,
| efficiently, and with greater concept specificity than any of our
| single languages.
| Dwedit wrote:
| Reload with Javascript disabled to remove the distracting fake
| mouse pointers.
| alexmolas wrote:
| I tried to read the articles since it seemed interesting. After
| exactly 30 seconds trying it I had to leave the page. Impossible
| to read more than two sentences with all those pointer moving
| there - and for a folk with ADHD even more difficult. Sorry, but
| I couldn't make it :(
| anthk wrote:
| Use the reader mode. Or if you are under GNU/Linux, use
| Links/Lynx.
| Maken wrote:
| Fortunately you didn't try the dark theme.
| TacticalCoder wrote:
| > For example, e (a single grapheme) is encoded in Unicode as e
| (U+0065 Latin Small Letter E) + ' (U+0301 Combining Acute
| Accent). Two code points!
|
| It's a poor and misleading example for it is definitely not how
| 'e' is encoded in 99.999% of all the text written in, say, french
| out there (french is the language where 'e' is the most common).
|
| 'e' is U+00F9, one codepoint, definitely not two.
|
| Now you could say: but it is _also_ the two codepoints one. But
| that 's precisely what makes Unicode the complete, total and
| utter clusterfuck that it is.
|
| And hence even an article explaining what every programmer should
| know about Unicode cannot even get the most basic example right.
| Which is honestly quite ironic.
| crazygringo wrote:
| > _definitely not how 'e' is encoded in 99.999% of all the text
| written in, say, french out there_
|
| Maybe how it's input by the keyboard (I haven't checked) but
| not how it's output on the web or other documents.
|
| Plenty of text goes through Unicode normalization which may
| convert it to two codepoints.
| ninkendo wrote:
| > Unicode the complete, total and utter clusterfuck that it is.
|
| Yikes, does it really deserve that much derision? They're
| trying to standardize _all written human language_ here. I
| think they've done a fantastic job. Pre-Unicode you had to
| worry about what code page a document had, and computers from
| different countries couldn't interoperate. The work the
| consortium does is hugely important, and every decision has
| extremely complex tradeoffs. Composed characters makes a lot of
| sense, and there's a lot of case to be made that it was the
| right call. The attitude of "this one thing I don't like makes
| the whole thing a complete clusterfuck" is something I wish
| fewer engineers would have.
| spacechild1 wrote:
| Next time read the whole article before accusing the author of
| incompetence!
|
| However, the author could have added a small note, e.g.
| "(Unicode normalization will be convered in a later section.)",
| to prevent knowledgable readers from rage quitting :)
| wgjordan wrote:
| The author explains normalization in its own entire section
| several paragraphs later (Why is "A" !== "A" !== "A"?).
| ilyt wrote:
| > Unicode is locale-dependent
|
| Well, there is a new fact that I learned and immediately hated.
|
| The fuck were authors thinking...
|
| I am now firmly convinced people developing unicode hate
| developers. I suspected it before just due to how messy it was
| (same character having different encodings ? Really ? Fuck you),
| but this cements it.
| JohnFen wrote:
| > people developing unicode hate developers
|
| Or at least they have a vicious indifference to us. Unicode is
| a nightmare.
| wffurr wrote:
| Yeah this is a big problem for me right now trying to pick
| fonts and characters for CJK. I have a bunch of bugs to fix
| that will require sending the locale down to the text
| itemization code.
| zajio1am wrote:
| Unicode is not locale-dependent, just mapping from graphemes to
| (font) glyphs is locale/font dependent.
| mcfedr wrote:
| The author shows how to-upper and to-lower change according
| to locale
|
| But making it clear which glyph to use is also a key feature!
| JonChesterfield wrote:
| Well C is locale dependent. And one does not break backwards
| compatibility with C for fear of badness. So naturally Unicode
| must be locale dependent too.
| layer8 wrote:
| This is pretty good. One thing I would add is to mention that
| Unicode defines algorithms for bidirectional text, collation
| (sorting order), line breaking and other text segmentation (words
| and sentences, besides grapheme clusters). The main point here is
| to know that there are specifications one should take into
| account when topics like that come up, instead of just inventing
| your own algorithm.
| nabla9 wrote:
| >3 Grapheme Cluster Boundaries
|
| >It is important to recognize that what the user thinks of as a
| "character"--a basic unit of a writing system for a language--may
| not be just a single Unicode code point. Instead, that basic unit
| may be made up of multiple Unicode code points. To avoid
| ambiguity with the computer use of the term character, this is
| called a user-perceived character. For example, "G" + grave-
| accent is a user-perceived character: users think of it as a
| single character, yet is actually represented by two Unicode code
| points. These user-perceived characters are approximated by what
| is called a grapheme cluster, which can be determined
| programmatically.
| sebstefan wrote:
| Oh my god, is there ever anything simple about unicode
| WorldMaker wrote:
| Compared to the ancient world of EBCDIC versus ASCII versus
| various ISO standards versus country-defined encodings versus
| Extended EBCDIC code pages versus Extended ASCII code pages
| which varied depending on operating system, nearest flag
| pole, network adapter, time of day, etc...: Unicode will
| forever be a simpler walk in the park.
|
| It's complexity is a relief compared to where we've been.
| It's definitely not simple, but it will forever be far
| simpler than what our grandmothers had to work with if they
| were writing international software.
| nottorp wrote:
| > These user-perceived characters are approximated by what is
| called a grapheme cluster, which can be determined
| programmatically.
|
| From everything i've read or heard about unicode, "determined
| programmatically" is false?
| qwerty456127 wrote:
| > The rest, about 800,000 code points, are not allocated at the
| moment. They could become characters in the future.
|
| Why is Tengwar still not in Uniclde officially? What's the
| problem with it?
| teddyh wrote:
| Tengwar is in the Under-ConScript Unicode Registry:
| <https://www.kreativekorp.com/ucsur/>
| qwerty456127 wrote:
| The ConScript Unicode Registry is a volunteer project to
| coordinate the assignment of code points in the Unicode
| Private Use Areas (PUA). Why does tengwar have to be in the
| PUA, why not make it a first-class charset? It's not just a
| minor conlang a small group of geeks invented on a weekend,
| it's a well-established piece of the modern culture, isn't
| it?
| badcppdev wrote:
| To save other people the google: Tengwar is probably not in
| unicode because it is a fictional script from a book.
| zajio1am wrote:
| While U+A66E multiocular O can be found in just one
| manuscript, and it is still in Unicode:
| https://en.wikipedia.org/wiki/Multiocular_O
| bigstrat2003 wrote:
| Honestly, I wouldn't have thought that would be an issue to
| the Unicode folks. They have already allowed things (emoji)
| that have no place being in the standard, as they _aren 't
| even text_.
| hot_gril wrote:
| I feel like Apple pushed the consortium to add a ton of
| useless emojis for whatever their own reasons were.
| hot_gril wrote:
| Looks like Georgian
| qwerty456127 wrote:
| I would wonder how many people are here who have never seen
| Tengwar. I would bet that's a minuscule minority.
| JohnFen wrote:
| I've never even heard of it before.
| red_trumpet wrote:
| That's a higher bar than having seen it, I think. I also
| had to look it up, but as soon as I saw the images in
| Wikipedia I knew that it's from Lord of the Rings.
| hot_gril wrote:
| It is. But the even higher bar is that you actually write
| in this script.
| WorldMaker wrote:
| The problem with Tengwar (and Klingon) is the problem with a
| lot of pop culture right now: copyright. The Tolkien Estate
| still exists and still litigiously upholds what it can of their
| copyright terms. CBS Viacom (Paramount) still claim a copyright
| interest in all the written forms of Klingon.
|
| Copyright is not technically violated simply by _encoding_ the
| characters into a plane such as one of Unicode 's, that's an
| easy open and shut fair use, but Unicode principals have stated
| they don't want to pass on the copyright burden to font authors
| either, which would be sued if they tried to paint some of
| those characters. (Why encode something that fonts aren't
| allowed to produce?) That _should_ also be fair use, but the
| law is complicated and copyright still so often today leans in
| favor of the Estates and major Corporations rather than fair
| use and the public commons.
|
| (ETA: I'm hugely in favor that "conlang", constructed language,
| scripts such as these _should_ be encoded by Unicode. I wish
| someday we fix the copyright problems of them.)
| thyselius wrote:
| Wonderful to learn more about Unicode.
|
| Does anyone know how to write a function (preferably in swift) to
| remove emoji? This is surprisingly hard (if the string can be any
| language, like English or Chinese).
|
| There's been multiple attempts on Stackoverflow but they're all
| missing some of them, as Unicode is so complex.
| favorited wrote:
| Here's a 1-liner, producing the string "text 0123 Han Zi ":
|
| `String("text EMOJI 0123 Han Zi ".unicodeScalars.filter({
| !$0.properties.isEmojiPresentation }))`
|
| (I've had to substitute EMOJI for a smiley face, because HN is
| bad at text encoding.)
| Retr0id wrote:
| It's not a bug, HN deliberately strips emojis.
| thyselius wrote:
| Thanks. Unfortunately both .isEmojiPresentation && .isEmoji
| leaves many emojis out, like red heart and many other.
| astrange wrote:
| Those aren't inherently emojis, the font just shows them as
| emojis, so you'd have to render the text.
| favorited wrote:
| Correct. `isEmojiPresentation` checks if, per the Unicode
| standard, this scalar should default to an emoji
| presentation.
| fiedzia wrote:
| I haven't tried but use libicu (icu). Split text into graphemes
| and remove anything starting with codepoints that has Zsey
| script. There should be swift bindings.
| hoseja wrote:
| https://tonsky.me/blog/unicode/overview@2x.png
|
| Wow, what abominable mix of decimal and hexadecimal.
| Karellen wrote:
| Where are the decimal numbers in that image?
| morelisp wrote:
| What comes after 9FFFF?
| Karellen wrote:
| Good catch
|
| doh.
| bajsejohannes wrote:
| It goes 90000..9FFFF then 100000..10FFFF. The latter should
| have been A0000..AFFFF.
|
| So the author is using hex for the last four digits and
| decimal for the remaining ones.
| tonsky wrote:
| oops :) fixed, thanks!
| titzer wrote:
| > The problem is, you don't want to operate on code points. A
| code point is not a unit of writing; one code point is not always
| a single character. What you should be iterating on is called
| "extended grapheme clusters", or graphemes for short.
|
| It's best to avoid making overly-general claims like this. There
| are plenty of situations that warrant operating on code points,
| and it's likely that software trying and failing to make sense of
| grapheme clusters will result it in a worse screwup. Codepoints
| are probably the _best_ default. For example, it probably makes
| the most sense for programming languages to define strings as
| arrays of code points, and not characters or 16-bit chunks or an
| encoding, or whatever.
| Dylan16807 wrote:
| Situations such as?
|
| Sometimes editing wants to go inside clusters but that's not
| code-point based either.
|
| I'd say that in a big majority of situations, code that is
| indexing an array with code points is either treating the
| indexes as opaque pointers or is doing something wrong.
| hgs3 wrote:
| > There are plenty of situations that warrant operating on code
| points
|
| Absolutely correct. All algorithms defined by the Unicode
| Standard and its technical reports operate on the code point.
| All 90+ character properties defined by the standard are
| queried for with the code point. The article omits this
| information and ironically links to the grapheme cluster break
| rules which operate on code points.
| Dylan16807 wrote:
| The article doesn't say not to use code points, it says you
| should not be iterating on them.
|
| Very rarely will you be implementing those algorithms. And if
| you're looking at character properties, the article says you
| should be looking at multiple together, which is correct.
| hgs3 wrote:
| > And if you're looking at character properties, the
| article says you should be looking at multiple together,
| which is correct.
|
| I don't see where the article mentions Unicode character
| properties [1]. These properties are assigned to individual
| characters, not groups of characters or grapheme clusters.
|
| > Very rarely will you be implementing those algorithms.
|
| True, but character properties _are_ frequently used, i.e.
| every time you parse text and call a character
| classification function like "isDigit" or "isControl"
| provided by your standard library you are in fact querying
| a Unicode character property.
|
| [1] https://unicode.org/reports/tr44/#Properties
| Dylan16807 wrote:
| > These properties are assigned to individual characters,
| not groups of characters or grapheme clusters.
|
| But you need to deal with the whole cluster. You can't
| just look at the properties on a single combining
| character and know what to do with it.
|
| If the article's saying to iterate one cluster at a time,
| then if you're doing properties a direct consequence is
| that you should be looking at the properties of specific
| code points per cluster or all of them.
| hgs3 wrote:
| The Unicode Standard does not specify how character
| properties should be extracted from a grapheme cluster.
| Programming languages that define "character" to mean
| grapheme cluster (like Swift) need to establish their own
| ad-hoc rules.
|
| As others have pointed out in this thread, the article is
| full of the authors own personal opinions. The author
| suggests iterating text as grapheme clusters, but fails
| to consider that this breaks tokenizers, e.g. a tokenizer
| for a comma-separated list [1] won't see the comma as
| "just a comma" if the value after it begins with a
| combining character.
|
| [1] https://en.wikipedia.org/wiki/Comma-separated_values
| PeterisP wrote:
| If some tokenizer of a comma-separated list treats the
| comma (I'm assuming any 0x2C byte) as "just a comma" even
| if the value after it begins with a combining character,
| that's a broken, buggy tokenizer, and one that can
| potentially be exploited by providing some specifically
| crafted unicode data in a single field that then causes
| the tokenizer to misinterpret field boundaries. If you
| combine a character with something, that's not the same
| character anymore - it's not equal to that, it's not that
| separator anymore, and you can't tell that unless/until
| you look at the following codepoints.
|
| If anything, your example is an illustration why it's
| dangerous to iterate over codepoints and not graphemes.
| Dylan16807 wrote:
| > The Unicode Standard does not specify how character
| properties should be extracted from a grapheme cluster.
| Programming languages that define "character" to mean
| grapheme cluster (like Swift) need to establish their own
| ad-hoc rules.
|
| Right. Which means not just iterating by code point.
|
| > The author suggests iterating text as grapheme
| clusters, but fails to consider that this breaks
| tokenizers, e.g. a tokenizer for a comma-separated list
| [1] won't see the comma as "just a comma" if the value
| after it begins with a combining character.
|
| I don't think they're talking about tokenizers. It's a
| general purpose rule.
|
| Also I would argue that a CSV file with non-attached
| combining characters doesn't qualify as "text".
| jcranmer wrote:
| There's one part of this document that I would push extremely
| hard against, and that's the notion that "extended grapheme
| clusters" are the one true, right way to think of characters in
| Unicode, and therefore any language that views the length in any
| other way is doing it wrong.
|
| The truth of the matter is that there are several different
| definitions of "character", depending on what you want to use it
| for. An extended grapheme cluster is largely defined on "this
| visually displays as a single unit", which isn't necessarily
| correct for things like "display size in a monospace font" or
| "thing that gets deleted when you hit backspace." Like so many
| other things in Unicode, the correct answer is use-case
| dependent.
|
| (And for this reason, String iteration should be based on
| codepoints--it's the fundamental level on which Unicode works,
| and whatever algorithm you want to use to derive the correct
| answer for your purpose will be based on codepoint iteration.
| hsivonen's article (https://hsivonen.fi/string-length/), linked
| in this one, does try to explain why extended grapheme clusters
| is the wrong primitive to use in a language.)
| mananaysiempre wrote:
| > thing that gets deleted when you hit backspace
|
| Is there a canonical source for this part, by the way? Xi
| copied the logic from Android[1] (per the issue you linked
| downthread), which is reasonable given its heritage but seems
| suboptimal generally, and I vaguely remember that CLDR had
| something to say about this too, but I don't know if there's
| any sort of consensus here that's actually written down
| anywhere.
|
| [1] https://github.com/xi-editor/xi-editor/pull/837
| pif wrote:
| > An extended grapheme cluster is largely defined on "this
| visually displays as a single unit", which isn't necessarily
| correct for things like "display size in a monospace font" or
| "thing that gets deleted when you hit backspace."
|
| I'm sorry, but I fail to see how "This visually displays as a
| single unit" could ever differ from "Display size in a
| monospace font" or "Thing that gets deleted when you hit
| backspace".
| yeputons wrote:
| Here is a full article of such examples:
| https://manishearth.github.io/blog/2017/01/14/stop-
| ascribing...
|
| Discussion on HN:
| https://news.ycombinator.com/item?id=31858311
| RichieAHB wrote:
| Here's a good example of the test cases used for backspaces
| in Android[1]. It's definitely more involved than just
| deleting a grapheme cluster.
|
| [1] https://android.googlesource.com/platform/frameworks/base
| /+/...
| mattnewton wrote:
| > Display size in a monospace font
|
| Some clusters are going to be multiple characters wide.
|
| > thing that gets deleted when you hit backspace
|
| Some clusters are meant to be composted of multiple
| keystrokes and a natural editing experience would allow users
| to delete the last stroke.
|
| Look into how Korean works.
| jfultz wrote:
| A couple of cases I'm aware of...
|
| * Coding ligatures often display as a single glyph (maybe
| occupying a single-width character space, or maybe spread out
| over multiple spaces), but are composed of multiple glyphs.
| The ligature may "look" like a single character for purposes
| of selection and cursoring, but it can act like multiple
| characters when subject to backspacing.
|
| * Similarly, I've seen keyboard interfaces for various
| languages (e.g., Hindi) where standard grapheme cluster rules
| bind together a group of code points, but the grapheme
| cluster was composed from multiple key presses (which
| typically add one code point each to the cluster). And in
| some such interfaces I've seen, the cluster can be decomposed
| by an equal number of backspace presses. I don't have a good
| sense of how much a monospaced Hindi font makes sense, but
| it's definitely a case where a "character" doesn't always act
| "character-like".
| jcranmer wrote:
| See, e.g., https://github.com/xi-editor/xi-editor/issues/655
| for why backspace isn't the same as extended grapheme
| cluster.
|
| As for "display size in monospace font", emojis and CJK
| characters are usually two units wide, not one (although, to
| be honest, there's a fair amount of bugs in the Unicode
| properties that define this).
| layer8 wrote:
| In terminals there is a distinction between single-width and
| double-width characters (east-asian characters, in
| particular). E.g. the three characters AMei
| C
|
| would take up the width of four ASCII monospace characters,
| the "Mei " being double-width.
|
| Similarly, for composed characters like say the ligature
| "ff", you may want to backspace as if it was two "f"s (which
| logically it is, and decomposes to in NFKD normalization).
| orphea wrote:
| If you type "a", combine it with "'", then change your mind
| and hit backspace, you probably want to end up with "a" even
| through "a" was a thing "visually displayed as a single
| unit".
| Findecanor wrote:
| Most European keyboard layouts have it the other way
| around: first press a "dead key" for the diacritic mark and
| then the letter to apply it to.
|
| Where some layouts may require this method for some
| characters, another keyboard layout may have the same
| character on a dedicated key.
|
| The program receives the combined character as one unit,
| and does not need to be aware of different keyboard
| layouts.
| umanwizard wrote:
| > Most European keyboard layouts have it the other way
| around: first press a "dead key" for the diacritic mark
| and then the letter to apply it to.
|
| Which ones? At least the French and German ones don't
| work like that: there is no composing, just separate keys
| for all the characters with diacritics that appear in the
| language.
| Reefersleep wrote:
| Danish is one.
| riggsdk wrote:
| Danish keyboards also require you to press '"' first and
| then 'o' to produce 'o'.
| Sardtok wrote:
| But do you really use o much over o?
| riggsdk wrote:
| No, but I do once in a while (very rarely) write a little
| in german that might use that character.
| mostlylurks wrote:
| Do the danes not have the mechanism that is found on
| Finnish keyboard layouts, where pressing AltGr+O yields O
| and AltGr+A yields AE, except in reverse?
| Findecanor wrote:
| Those mappings are not universal. They are present under
| Linux but not on MS-Windows. I don't know about Mac, but
| the layout has in the past been slightly different there
| from Windows also.
| riggsdk wrote:
| For me that doesn't work on Windows. Those key
| combinations doesn't seem to do anything.
| greenshackle2 wrote:
| Which French layout would that be? I've never seen a
| French keyboard where this is true. French is my native
| language. On layouts I'm familiar with, _some_ accented
| letters have separate keys like e, but not all, the
| others are made by composing an accent key with a letter.
| umanwizard wrote:
| You're right, sorry. I had forgotten about the ^ and "
| keys.
| tpm wrote:
| Slovak or Czech for example.
| mostlylurks wrote:
| The nordic layout(s) offer such a mechanism to allow
| people to type in letters that you'll find in various
| other European languages, even though the extra letters
| used in the languages themselves (AAOAEO) are present as
| their own keys. Interestingly, the Swedish layout has no
| dedicated e key, although e occurs in some Swedish words.
| gumby wrote:
| In Swedish, A, A, and O are actual letters of the
| alphabet, while e is used in foreign words. Like the
| English dieresis (e.g. in cooperate) is essentially
| unknown in the US and only occasionally used in England,
| so doesn't give rise to characters with dieresis on the
| keyboard.
| gdprrrr wrote:
| On the German Layout the backtick (next to the 1 key) is
| a dead key.
| TacticalCoder wrote:
| Nitpicking but most french keyboards have both ready-made
| keys for "e" and the few other commonly use keys _and_
| composing: hitting either '"' or '^'. For example
| hitting '"' then 'e' produces "e".
| umanwizard wrote:
| You are right, thanks.
| mananaysiempre wrote:
| > first press a "dead key" for the diacritic mark and
| then the letter to apply it to.
|
| That being exactly the way "floating diacritics" in ISO
| 2022 (or properly one of its Latin encodings, T.51 = ISO
| 6937) work, amusingly. I wonder which came first. (Yes, I
| know that a<BS>` came first, the ASCII spec even says
| that this should give you an accented character IIRC. Or
| perhaps it was one of the other "don't call it ASCII"
| specs--ISO 646? IA5?..)
| [deleted]
| kdmccormick wrote:
| But then if I type "a" directly (through, say, a mobile
| keyboard) and hit backspace, I'd get "a", which doesn't
| seem _terrible_ but does feel a little off.
|
| Seems like the right answer for codepoints vs graphemes,
| unfortunately, is dependent on the context.
| bombela wrote:
| I expect to delete the character "a". And I prefer
| consistency too so I expect "oe" and "<emoji>" and
| "<emoji>" to be deleted as one unit.
|
| edit: emojis are filtered by HN
| pests wrote:
| Even the emoji's that you create by combining multiple
| emojis? Type one emoji, then a second, it merges into
| one. What happens when you backspace?
| mostlylurks wrote:
| As a European, no I don't. a isn't used in my language, but
| my layout offers it via a dead-key-then-base-letter
| mechanism, and it is correctly treated as one unit when
| pressing backspace, anything else would feel incorrect. It
| would be even worse if such a thing happened for the
| letters that my layout offers individual buttons for (AAO).
| Some languages do treat these as letters with attached
| modifiers, but many, including mine, treat them as
| indivisible letters that just happen to look similar to
| some others for historical reasons, and to treat them as
| combinations of base letters and diacritics would be
| completely incorrect, even if you typed them in using the
| dead-key-then-base-letter mechanism for some reason.
| haberman wrote:
| In that case, it sounds like `length` on Unicode strings simply
| shouldn't exist, since there is no obvious right answer for it.
| Instead there should be `codepointCount`, `graphemeCount`, etc.
| astrange wrote:
| String iteration should be based on whatever you want to
| iterate on - bytes, codepoints, grapheme clusters, words or
| paragraphs. There's no reason to privilege any one of these,
| and Swift doesn't do this.
|
| "Length" is a meaningless query because of this, but you might
| want to default to whatever approximates width in a UI label,
| so that's grapheme clusters. Using codepoints mostly means you
| wish you were doing bytes.
| b3morales wrote:
| > There's no reason to privilege any one of these, and Swift
| doesn't do this.
|
| Strange thing to say: Swift String count property is the
| count of extended grapheme clusters. The documentation is
| explicit:
|
| > A string is a collection of _extended grapheme clusters_ ,
| which approximate human-readable characters. [emphasis in
| original]
| astrange wrote:
| The length/count property was added after people asked for
| it, but it wasn't originally in the String revamp, and it
| provides iterators for all of the above. .count also only
| claims to be O(n) to discourage using it.
| lucideer wrote:
| I'm not Korean but seeing that said of the Hangul example
| definitely made me pause - I doubt Koreans think of that
| example as a single grapheme (open to correction), though it is
| an excellent example all the same since it demonstrates the
| complexity of defining "units" consistently across language.
|
| It reminds me a little of Open Street Map's inconsistent
| administrative hierarchies ("states", "countries", "counties",
| etc. being represented at different administrative "levels" in
| their hierarchy for each geographical area), and how that
| hinders consistency in styling- font size, zoom levels, etc.
| being generally applied by level.
| hgs3 wrote:
| Everybody loves to debate what "character" means but nobody
| ever consults the standard. In the Unicode Standard a
| "character" is an abstract unit of textual data identified by a
| code point. The standard never refers to graphemes as
| "characters" but rather as _user-perceived characters_ which
| the article omits.
| raphlinus wrote:
| Agreed. And one more consideration is that (extended) grapheme
| cluster boundaries vary from one version of Unicode to another,
| and also allow for "tailoring." For example, should "`am" be
| one grapheme cluster or two? It's two on Android but one by
| Unicode recommendation and is the behavior on mac. So in
| applications where a query such as length needs to have one
| definitive answer which cannot change by context, counting
| (extended) grapheme clusters is the wrong way to go.
| riggsdk wrote:
| There are libraries that help with iterating both code-points
| and grapheme clusters... - but are there any of them that can
| help decide what to do for example when pressing backspace
| given an input string and a cursor position? Or any other text
| editing behavior. This use-case-dependent behavior must have
| some "correct" behavior that is standardized somewhere?
|
| Like a way to query what should be treated like a single
| "symbol" when selecting text? Basically something that could
| help out users making simple text-editors. There are so many
| bad implementations out there that does it incorrectly so there
| must be some tools/libraries to help with this? Not only for
| actual applications but for people making games as well where
| you want users to enter names, chat or other text. Not all
| platforms make it easy (or possible) to embed a fully fledged
| text editing engine for those use-cases.
|
| I can imagine that typing a multi-code-point character manually
| by hand would allow the user to undo their typing mistake by a
| single backspace press when they are actively typing it, but
| after that if you return to the symbol and press backspace that
| it would delete the whole symbol (grapheme cluster).
|
| For example if you manually entered the code points for the
| various family combination emojis (mother, son, daughter) you
| could still correct it for a while - but after the fact the
| editor would only see it as a single symbol to be deleted with
| a single backspace press?
|
| Or typing 'o' + '"' to produce 'o' but realizing you wanted to
| type 'o', there just one backspace press would revert it to 'o'
| again and you could press '^' to get the 'o'. (Not sure that is
| the way in which you would normally type those characters but
| it seems possible to do it with unicode that way).
| gumby wrote:
| > Or typing 'o' + '"' to produce 'o' but realizing you wanted
| to type 'o', there just one backspace press would revert it
| to 'o' again and you could press '^' to get the 'o'.
|
| This is a good example because in German I would expect 'o' +
| '"' + <delete> to leave no character at all while in French I
| would expect 'e' + '`' + <delete> to leave the e behind
| because in my mind it was a typo.
|
| The rendering of brahmic- and arabic-derived scripts makes
| these choices even more interesting.
| posix86 wrote:
| But typing "o" (e.g. swiss keyboard) and pressing delete &
| getting an o would be annoying af
| riggsdk wrote:
| I realize that the editor would be the system to keep
| track of how the character was entered for this to work.
| If you made the character from a single keypress it would
| only make sense that backspace also undid the entire
| character. Only if you created the character from
| multiple keypresses it would make sense to "undo" only
| part of it with backspace (at least until you move away
| from the character).
| gumby wrote:
| Definitely agree with that! I use a US kbd (incl on
| phone) no matter what language I'm writing in. A little
| annoying but switching kbd layouts is more disruptive for
| me.
| jraph wrote:
| Same for a French keyboard with eeau which are all typed
| using one key. But even eouaeoeaeiou, all typed using at
| least two keys, if not 3 with a compose key (from memory,
| I'm using a phone). Everybody is used to the way it has
| been working on all OSes.
| makapuf wrote:
| In French, e is a single character issued by a single
| keypress on a French keyboard, like e, or +. (Note that A
| is shift+a). Why should it need two backspaces? If you
| press e+` well you have e`, not e.
| NikolaNovak wrote:
| I am assuming that means "on French keyboard", not "in
| French". I have a usa keyboard and live in Canada...Every
| now and then it thinks I'm typing French and keyboard
| indeed behaves in a way that some vowel plus some
| quotation mark indeed gives me some other character (that
| I don't need :)
| jcranmer wrote:
| Some platforms (e.g., Android) have methods specifically for
| asking how to edit a string following a backspace. However,
| there's no standard Unicode algorithm to answer the question
| (and I strongly suspect that it's something that's actually
| locale-dependent to a degree).
|
| On further reflection, probably the best starting point for
| string editing on backspace is to operate on codepoints,
| _not_ grapheme clusters. For most written languages, the
| various elements that make up a character are likely to be
| separate codepoints. In Latin text, diacritics are generally
| precomposed (I mean, you can have a + diacritic as opposed to
| precomposed a in theory, but the IME system is going to spit
| out a anyways, even if dead keys are used). But if you have
| Indic characters or Hangul, the grapheme cluster algorithm is
| going to erroneously combine multiple characters into a
| single unit. The issue is that the biggest false positive for
| a codepoint-based algorithm is emoji, and if you 're a
| monolingual speaker whose only exposure to complex written
| scripts is Unicode emoji, you're going to incorrectly
| generalize it for all written languages.
| layer8 wrote:
| Behavior that depends on whether you edited something else in
| between, or that depends on timing, is just bad. Either
| always backspace grapheme clusters, or else backspace
| characters, possibly NFC-normalized. I could also imagine
| having something like Shift+Backspace to backspace NFKD-
| normalized characters when normal Backspace deletes grapheme
| clusters.
|
| As for selection and cursor movement, grapheme clusters would
| seem to be the correct choice. Same for Delete. An editor may
| also support an "exploded" view of separate characters (like
| WordPerfect Reveal Codes) where you manipulate individual
| characters.
| PeterisP wrote:
| I'd argue that you must use grapheme clusters for text
| editing and cursor position, because here are popular
| characters (like o you used as example) which can be either
| one or two codepoints depending on the normalization choice,
| but the difference is invisible to the user and should not
| matter to the user, so any editor should behave _exactly_ the
| same for o as U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS)
| and o as a sequence of U+006F (LATIN SMALL LETTER O) and
| U+0308 (COMBINING DIAERESIS).
|
| Furthermore, you shouldn't assume that there is any
| relationship between how unicode constructs a combined
| character from codepoints with how that character is typed,
| even at the level of typing you're _not_ typing unicode
| codepoints - they 're just a technical standard
| representation of "text at rest", unicode codepoints do not
| define an input method. Depending on your language and
| device, a sequence of three or more keystrokes may be used to
| get a single codepoint, or a dedicated key on keyboard or a
| virtual button may spawn a combined character of multiple
| codepoints as a single unit; you definitely can't assume that
| the "last codepoint" corresponds to "last user action" even
| if you're writing a text editor - much of that can happen
| before your editor receives that input from e.g. OS keyboard
| layout code; your editor won't know whether I input that o
| from a dedicated key, a 'chord' of 'o' key with a modifier,
| or a sequence of two keystrokes (and if so, whether 'o' was
| the first keystroke or the second, opposite of how the
| unicode codepoints are ordered).
| mananaysiempre wrote:
| > I'd argue that you must use grapheme clusters for text
| editing and cursor position
|
| Korean packs syllables into Han-script-like squares, but
| they are unmistakably composed of alphabetic letters, and
| are both typed and erased that way (the latter may depend
| on system configuration), yet the NFC form has only a
| single codepoint per syllable ( _a fortiori_ a single
| grapheme cluster). Hebrew vowel markings, where used, are
| (reasonably) considered part of a grapheme cluster but
| nevertheless erased and deleted separately. In both of
| those cases, pressing backspace will erase less than
| pressing shift-left, backspace; that is, cursor movement
| and backspace boundaries are different.
|
| There IIRC are also scripts that will have a vowel both
| pronounced and encoded in the codepoint stream _after_ the
| syllable-initial consonant but written _before_ it; and
| ones where some parts of a syllable will _enclose_ it. I
| don't even want to think how cursor movement works there.
|
| Overall, your suggestion will work for Latin, Cyrillic,
| Greek(?), and maybe other nonfancy scripts like Armenian,
| Ge'ez, or Georgian, but will absolutely crash and burn for
| others.
| WalterBright wrote:
| Quotes from the article illustrating what a train wreck Unicode
| has become:
|
| "The problem is, in Unicode, some graphemes are encoded with
| multiple code points!"
|
| "An Extended Grapheme Cluster is a sequence of one or more
| Unicode code points that must be treatead as a single,
| unbreakable character."
|
| "Starting roughly in 2014, Unicode has been releasing a major
| revision of their standard every year."
|
| "A" === "A" "A" === "A" "A" === "A" What do you get? False? You
| should get false, and it's not a mistake.
|
| "That's why we need normalization."
|
| "Unicode is locale-dependent"
|
| The article forgot one: characters that switch presentation to
| right-to-left.
| dathinab wrote:
| The author seem to hate people which concentration issues and/or
| various visual sicknesses.
|
| That coloration tools shows the moving mouse coursers of other
| participants even if they aren't needed/wanted is already pretty
| bad, why bring it to a website?
| wffurr wrote:
| This seems like good feedback but it could really be phrased
| more constructively. I doubt the author "hates" any such thing
| and you know it too. "Didn't design with such in mind", sure.
| You can do better.
| dathinab wrote:
| yes I should have highlighted that it is satire
|
| through it also wasn't meant to be constructive critique
| kazinator wrote:
| If you have to recognize a grapheme cluster, it will be easier to
| do that from a sequence of code points, than from UTF-8.
|
| It's like saying that we don't need to tokenize, because you
| never want to deal with tokens anyway, but phrase structures!
|
| Mmkay, whatever ...
___________________________________________________________________
(page generated 2023-10-02 23:00 UTC)