[HN Gopher] Emoji Under the Hood
       ___________________________________________________________________
        
       Emoji Under the Hood
        
       Author : kogir
       Score  : 376 points
       Date   : 2021-03-24 22:48 UTC (2 days ago)
        
 (HTM) web link (tonsky.me)
 (TXT) w3m dump (tonsky.me)
        
       | mannerheim wrote:
       | > Currently they are used for these three flags only: England,
       | Scotland and Wales:
       | 
       | Not quite true, you can get US state flags with this as well.
        
         | petepete wrote:
         | I've never seen them use, have they actually been implemented
         | by any of the creators?
        
           | mannerheim wrote:
           | If I type the following into ghci, I get the state flag of
           | Texas:
           | 
           | putStrLn "\x1f3f4\xe0075\xe0073\xe0074\xe0078\xe007f"
           | 
           | The first character is a flag, the last character is a
           | terminator, and in between are the tag characters
           | corresponding to the ASCII for ustx. Just take those
           | characters and subtract 0xe0000 from them, 0x75, 0x73, 0x74,
           | 0x78.
           | 
           | https://en.wikipedia.org/wiki/Tags_(Unicode_block)
           | 
           | Edit:
           | 
           | Just for fun:
           | 
           | import Data.StateCodes
           | 
           | import Data.Char
           | 
           | putStrLn $ map (map toLower . show . snd) allStates >>=
           | 
           | \stateCode -> '\x1f3f4':map (toEnum . (0xe0000+) . fromEnum)
           | ("us" ++ stateCode) ++ "\xe007f"
        
             | azornathogron wrote:
             | Oh my god they've put terminal escape codes into Unicode.
        
         | TheRealSteel wrote:
         | Does this have anything to do with why Google Keyboard/Gboard
         | doesn't have the Scottish flag? It's by far my most used emoji
         | and my keyboard not having it drives me nuts.
        
           | scatters wrote:
           | Why not switch to a keyboard that does have it?
        
         | Sniffnoy wrote:
         | This may be supported in some implementations, but currently
         | only England, Scotland, and Wales are officially in the Unicode
         | data files and recommended for general interchange. You can see
         | that they're the only examples of RGI_Emoji_Tag_Sequence listed
         | here: https://www.unicode.org/Public/emoji/13.1/emoji-
         | sequences.tx...
        
       | rkangel wrote:
       | The article is great, but there is one slightly misleading bit at
       | the start:
       | 
       | > The most popular encoding we use is called Unicode, with the
       | two most popular variations called UTF-8 and UTF-16.
       | 
       | Unicode is a list of codepoints - the characters talked about in
       | the rest of the article. These live in a number space that's very
       | big (~2^23 as discussed).
       | 
       | You can talk about these codepoints in the abstract as this
       | article does, but at some point you need to put them in a
       | computer - store them on disk or transmit them over a network
       | connection. To do this you need a way to make a stream of bytes
       | store a series of unicode codepoints. This is an 'encoding',
       | UTF-8 and UTF-16, UTF-32 etc. are different encodings.
       | 
       | UTF-32 is the simplest and most 'obvious' encoding to use. 32
       | bits is more than enough to represent every codepoint, so just
       | use a 32-bit value to represent each codepoint, and keep them in
       | a big array. This has a lot of value in simplicity, but it means
       | that text ends up taking up a lot of space. Most western text
       | (e.g. this page) fits in the first 127 bits and so for the
       | majority of values, most of the bits will be 0.
       | 
       | UTF-16 is an abomination that is largely Microsoft's fault and is
       | the default unicode encoding on Windows. It is based on the fact
       | that most text in most language fits in the first 65535 unicode
       | codepoints - referred to as the 'Basic Multilingual Plane'. This
       | means that you can use a 16 bit value to represent most
       | codepoints, so unicode is stored as an array of 16-bit values
       | ("wide strings" in MS APIs). Obviously not _all_ Unicode values
       | fit in, so there is the capability to use two UTF-16 values to
       | represent a code-point. There are many problems with UTF-16, but
       | my favourite is that it really helps you to have  'unicode
       | surprises' in your code. Something in your stack that assumes
       | single byte characters and barfs on higher unicode values is well
       | known, and you find it in testing fairly often. Because UTF-16 is
       | a single value for the vast majority of normal codepoints, it
       | makes that worse by making it only happen in a very small number
       | of cases that you will inevitably only discover in production.
       | 
       | UTF-8 is the generally agreed to be the best encoding
       | (particularly among people who don't work for Microsoft). It is a
       | full variable length encoding, so a single codepoint can take 1,
       | 2, 3 or 4 bytes. It has lots of nice properties, but one is that
       | codepoints that are <= 127 encode using a single byte. This means
       | that proper ASCII is valid UTF-8.
        
         | rectang wrote:
         | For people who want to hear more on this subject I gave a talk
         | for Papers We Love Seattle on UTF-8, its origins and evolution,
         | and how it compares against other encodings:
         | 
         | https://www.youtube.com/watch?v=mhvaeHoIE24
         | 
         | "Smiling Cat Face With Heart Eyes Emoji" plays a major role. :)
         | 
         | It doesn't cover the same ground as this wonderful post with
         | its study of variation selectors and skin-tone modifiers, but
         | it provides the prerequisites leading up to it.
         | 
         | > _UTF-16 is an abomination that is largely Microsoft 's fault_
         | 
         | I think that's unfair. The problem lies more in the
         | conceptualization of "Unicode" in the late 1980s as a two-byte
         | fixed-width encoding whose 65k-sized code space would be enough
         | for the characters of all the world's living languages. (I
         | cover that here:
         | https://www.youtube.com/watch?v=mhvaeHoIE24&t=7m10s ) It turns
         | out that we needed more space, and if Asian countries had had
         | more say from the start, it would have been obvious earlier
         | that a problem existed.
        
           | rkangel wrote:
           | >> UTF-16 is an abomination that is largely Microsoft's fault
           | 
           | > I think that's unfair.
           | 
           | Fair enough. It was a moderately 'emotional' response caused
           | by some painful history of issues caused by 2-byte
           | assumptions.
           | 
           | The problem I suppose is that MS actually moved to Unicode
           | _earlier_ than most of the industry (to their credit), and
           | therefore played Guinea pig in discovering what works and
           | doesn 't. My complaint now is that I feel they should start a
           | migration to UTF-8 (yes I know how challenging that would
           | be).
        
       | vanderZwan wrote:
       | > _Flags don't have dedicated codepoints. Instead, they are two-
       | letter ligatures. (...) There are 258 valid two-letter
       | combinations. Can you find them all?_
       | 
       | Well this nerd-sniped me pretty hard
       | 
       | https://next.observablehq.com/@jobleonard/which-unicode-flag...
       | 
       | That was a fun little exercise, but enough time wasted, back to
       | work.
        
         | mercer wrote:
         | Haha, playing around with reversing flags was the first thing I
         | thought about trying.
        
           | vanderZwan wrote:
           | The surprising result (to me at least) was that out of 270
           | valid letter combinations, 105 can be reversed. The odd
           | number is easy to explain: letter pairs like MM => MM can add
           | a single flag instead of a pair of two flags, but the fact
           | that almost two out of every five flags are reversible feels
           | pretty high to me.
        
             | SamBam wrote:
             | > but the fact that almost two out of every five flags are
             | reversible feels pretty high to me.
             | 
             | I think some letter-frequency analysis can probably explain
             | it. Given the fact that certain letters are less likely
             | both as the first slot and second (e.g., there are only 4
             | country codes that start with J, and 3 that end with J),
             | the letters that can be used as both first and second
             | characters are over-represented.
             | 
             | It's the same as how far more English language words can be
             | reversed to make other valid words than you would expect if
             | the letters were equally-frequent and arbitrarily arranged.
        
       | breck wrote:
       | I thought I knew Emoji, but there was a lot I didn't know. Thank
       | you, a very enjoyable and enlightening read. Also, "dingbats"! I
       | rarely seen that word since I was a kid (when I had no idea what
       | that voodoo was but loved it).
        
       | artur_makly wrote:
       | What I really want to know is the story behind how these emoji's
       | came to be?! Who was tasked to come up with this sticker list of
       | symbols? What was the decision/strategy behind the selection of
       | these symbols? etc etc. it seems soooo arbitrary at first-glance.
       | 
       | And how do we as a community propose new icons while considering
       | others to be removed/replaced?
        
         | rynt wrote:
         | 99PI did a story on the process of submitting a new emoji
         | request to the Unicode Consortium that you might find
         | interesting: https://99percentinvisible.org/episode/person-
         | lotus-position...
        
           | artur_makly wrote:
           | brilliant. thank you {{ U+1F64F }}
        
       | itsmeamario wrote:
       | Great quality post. I'd like to see more things like this on HN.
       | Interesting and I learnt a lot about emojis and UTF.
        
       | kaeruct wrote:
       | I'm confused about the part saying flags don't work on Windows
       | because I can see them on Firefox (on Windows). They don't work
       | on Edge though.
        
         | tonsky wrote:
         | I guess FF ships its own version
        
       | BlueGh0st wrote:
       | I wish I could read this without getting a migraine. The
       | "darkmode" joke was funny until I realized there was no actual
       | way to turn it on.
        
         | jffry wrote:
         | Firefox's reader mode works great and includes a dark theme.
         | 
         | The icon shows up in the right side of the URL bar, but you can
         | always force it by prepending the URL, e.g.
         | about:reader?url=<url>
        
         | tobz1000 wrote:
         | https://darkreader.org/
        
           | sundarurfriend wrote:
           | I _just_ turned this off today, after one too many  "an
           | extension is slowing this page down" warnings from Firefox,
           | always from Dark Reader. It's a pretty useful addon, but
           | there's enough websites that implement their own dark mode
           | that it's less necessary these days (I hope), and possibly
           | making it not worth the slowdown.
        
       | truefossil wrote:
       | I wonder why Mediterranean nations switched from ideograms to
       | alphabet as soon as one was invented. Probably they did not have
       | enough surplus grain to feed something like the Unicode
       | consortium?
        
         | kps wrote:
         | An alphabet (or syllabary, abjad, abugida) has a _small_ set of
         | symbols that can express anything, which means that it could be
         | used by people who did something other than read and write for
         | a living. Probably no accident that the first to catch on, and
         | the root of possibly all others, was spread by Phoenician
         | traders.
        
         | meepmorp wrote:
         | Hieroglyphics weren't really ideographic after a very early
         | point, because it's a pain in the ass making up new symbols for
         | every word. Very quickly, it transitioned to being largely an
         | abjad, representing only consonants. Abjads work reasonably
         | well for semitic languages, as the consonantal roots of words
         | carry the meaning and a reader would be able to fill in the
         | vowels themselves via context.
         | 
         | According to the account I've heard, it's the greeks who
         | invented the alphabet, by accident. The Phoenician script used
         | single symbols to represent consonants, including the glottal
         | stop (and some pharyngeal consonant that would likely be
         | subject to a similar process, iirc). The glottal stop was
         | represented by aleph, and because Greek didn't have contrastive
         | glottal stops in its phoneme inventory, Greeks just interpreted
         | the vowel that followed it as what the symbol was meant to
         | represent.
         | 
         | It's a bit of a just so story, but also completely plausible.
        
       | avipars wrote:
       | Really interesting article, why haven't platforms banned
       | U[?][?][?][?][?][?][?] or figured out a way to parse/contain the
       | character to it's container?
        
       | Hawzen wrote:
       | > The most popular encoding we use is called Unicode
       | 
       | Unicode is a character set, not an encoding UTF-8, UTF-16, etc.
       | are encodings of that character set
        
       | aglionby wrote:
       | Great post, entertainingly written.
       | 
       | Back in 2015, Instagram did a blog post on similar challenges
       | they came across implementing emoji hashtags [1]. Spoiler alert:
       | they programmatically constructed a huge regex to detect them.
       | 
       | [1] https://instagram-engineering.com/emojineering-part-ii-
       | imple...
        
         | lifthrasiir wrote:
         | Nowadays you can refer to UAX #31 for hashtag identifiers
         | (first specified in 2016):
         | https://www.unicode.org/reports/tr31/#hashtag_identifiers
        
       | imtiyaz wrote:
       | Never gone to these nitty gritties. Very well explained. Thanks
       | Nikita.
        
         | tonsky wrote:
         | You are welcome! Glad you liked it
        
       | tomduncalf wrote:
       | Really interesting and well written (and entertaining!) post. I
       | was vaguely aware of most of it but hadn't appreciated how the
       | ZWJ system for more complex emojis made up of basin ones means
       | the meaning can be discerned even if your device doesn't support
       | the new emoji, clever approach!
        
       | yuntei wrote:
       | and now to see how emoji rendering is completely broken, put a
       | gear u+2699 text variant and emoji variant in some html and set
       | the font to menlo in one element, and monaco in another element
       | and then view it in chrome, safari desktop, and safari ios, and
       | also select and right click on it in chrome, and maybe also post
       | it into the comment section of various websites. Every single
       | combination of text variant and emoji variant will be displayed
       | in complete randomness :)
        
       | mshenfield wrote:
       | It's a post about emojis, but I feel like I understand Unicode
       | better now?
        
       | remux wrote:
       | Great post!
        
       | woko wrote:
       | > Unicode allocates 221 (~2 mil) characters called codepoints.
       | Sorry, programmers, but it's not a multiply of 8 .
       | 
       | Why would 2^21 not be a multiple of 2^3?
        
         | RedNifre wrote:
         | It's a typo, they meant ~221 instead of 221, because it's
         | 17*2^16, which is more like ~2^20.087. (And that's not even
         | true either, since a couple values like FFFF are forbidden)
        
           | [deleted]
        
           | howtodowtle wrote:
           | Of course, 17 x 2^16 is also a multiple of 2^3:
           | 
           | 17 x 2^16 = 17 x 2^13 x 2^3
           | 
           | (reposted/edited because * was interpreted as formatting)
        
             | RedNifre wrote:
             | In case hacker news doesn't show emoji, I meant m(
             | 
             | Right, I guess I was thinking more of "not a power of 2"
             | instead of "not a multiple of 8".
             | 
             | On second thought, the author might have meant "Sorry that
             | the exponent is not a multiple of 8" as in Unicode neither
             | has 2^16 nor 2^32 code points.
        
               | kps wrote:
               | Agreed; they meant "not a power of 28".
        
       | ijidak wrote:
       | This is eye opening. So many frustrations I've had with emoji
       | over the years is explained via this post.
       | 
       | Big thank you to the OP.
        
         | chronogram wrote:
         | What kind of frustrations?
        
       | peteretep wrote:
       | An excellent article, although:
       | 
       | > "U" is a single grapheme cluster, even though it's composed of
       | two codepoints: U+0055 UPPER-CASE U followed by U+0308 COMBINING
       | DIAERESIS.
       | 
       | would be a great opportunity to talk about normal form, because
       | there's also a single code point version: "latin capital letter u
       | with diaeresis".
        
         | colejohnson66 wrote:
         | Does anyone know the history behind why there's two ways to
         | "encode" things like that? What's the rationale for having both
         | combining and precombined codepoints?
        
           | bombcar wrote:
           | I believe a lot of the "combined" characters are (basically)
           | from importing old codepages directly into Unicode, and they
           | did that so it would be a simple formula to convert from the
           | various codepages in use.
           | 
           | I may be wrong however.
        
       | z3t4 wrote:
       | Related: implementing Emoji support in a text editor: https://xn
       | --zta-qla.com/en/blog/editor10.htm
        
       | devadvance wrote:
       | Fantastic post that builds up knowledge along the way. A fun case
       | where this type of knowledge was relevant: when creating emoji
       | short links with a couple characters (symbols), I made sure to
       | snag both URLs: one with the emoji (codepoint + `U+FE0F`) and one
       | with just the symbol codepoint.
       | 
       | Another thing worth calling out: you can get involved in emoji
       | creation and Unicode in general. You can do this directly, or by
       | working with groups like Emojination [0].
       | 
       | [0] http://www.emojination.org/
        
         | codetrotter wrote:
         | The emojination website mentions UTC and ESC. UTC in this
         | context certainly means Unicode Technical Committee. And after
         | a bit of Googling it seems that ESC is the Unicode Emoji
         | Subcommittee.
         | 
         | Some of the suggested emojis are marked as UTC rejected, some
         | as ESC rejected or ESC pushback. Does it mean that both UTC and
         | ESC has to approve each suggested emoji?
         | 
         | And is there a place to see the reason for rejection and a
         | place to see what kind of pushback they are receiving?
        
           | lifthrasiir wrote:
           | It's complicated. So this mainly boils down to the
           | relationship between UTC and ESC.
           | 
           | ESC contributes to UTC, along with other groups (e.g. Scripts
           | Ad Hoc Group or IRG) or other individuals (you can submit
           | documents to UTC [1]), and technically UTC has a right to
           | reject ESC contributions. In reality however ESC manages a
           | huge volume of emoji proposals to UTC and distills them down
           | to a packaged submission, so UTC rarely outright rejects ESC
           | contributions. After all ESC is a part of UTC so there is a
           | huge overlap anyway (e.g. Mark Davis is the Unicode
           | Consortium _and_ ESC chair).  "UTC rejected" emojis thus
           | generally come from the direct proposal to UTC.
           | 
           | You can see a list of emoji requests [2] but it lacks much
           | information. This lack of transparency in the ESC process is
           | well known and was most directly criticized by contributing
           | experts in 2017 [3]. ESC responded [4] that there are so many
           | flawed proposals (with no regards to the submission criteria
           | [5]) that it is infeasible to document all of them. IMHO it's
           | not a very satisfactory answer, but still understandable.
           | 
           | [1] https://www.unicode.org/L2/
           | 
           | [2] https://www.unicode.org/emoji/emoji-requests.html
           | 
           | [3] https://www.unicode.org/L2/L2017/17147-emoji-
           | subcommittee.pd...
           | 
           | [4] https://www.unicode.org/L2/L2017/17192-response-cmts.pdf
           | 
           | [5] https://www.unicode.org/emoji/proposals.html
        
           | dgellow wrote:
           | It's for "Emoji SubCommittee" (aka ESC).
           | 
           | > Unicode Emoji Subcommittee:
           | 
           | > The Unicode Emoji Subcommittee is responsible for the
           | following:
           | 
           | > - Updating, revising, and extending emoji documents such as
           | UTS #51: Unicode Emoji and Unicode Emoji Charts.
           | 
           | > - Taking input from various sources and reviewing requests
           | for new emoji characters.
           | 
           | > - Creating proposals for the Unicode Technical Committee
           | regarding additional emoji characters and new emoji-related
           | mechanisms.
           | 
           | > - Investigating longer-term mechanisms for supporting emoji
           | as images (stickers).
           | 
           | From https://unicode.org/emoji/techindex.html
           | 
           | Edit: Welp, the parent comment was asking what "ESC" stands
           | for, but has now been updated, so this comment is now
           | outdated :)
        
             | codetrotter wrote:
             | Sorry, yeah I was originally asking about what ESC stands
             | for but found some info shortly after and updated my
             | comment. But I appreciate the additional info anyways :)
        
       | MrGilbert wrote:
       | Reading about the 2 million codepoints: Is there a good set of
       | open-source licensed fonts which cover as many codepoints as
       | possible? Just curiosity, no real usecase at the moment. I don't
       | think it would make sense to create one huge font for this,
       | right?
        
         | dan-robertson wrote:
         | There's a project called, I think, gnufont but their font is a
         | bitmap font...
        
           | MrGilbert wrote:
           | Ah, thank you! Searching for "gnufont" brought me to[1],
           | which looks pretty nice indeed.
           | 
           | [1]: https://www.gnu.org/software/freefont/
        
             | dan-robertson wrote:
             | I think that's what I was thinking of. I guess they've got
             | some vector outlines now
        
         | pta2002 wrote:
         | Google's Noto Fonts[1] attempt to cover all of Unicode and are
         | released under the SIL Open Font License.
         | 
         | [1] https://www.google.com/get/noto/
        
           | MrGilbert wrote:
           | That looks incredible complete, thank you!
        
       | mojuba wrote:
       | Can someone explain, what are the rules for substring(m, n) given
       | all the madness that's today's Unicode? Is it standardized or
       | it's up to the implementations?
        
         | _ZeD_ wrote:
         | it think the only resonable rule for substring(m, n) is "don't"
        
           | mojuba wrote:
           | So string is no longer a "string of characters", it is in
           | fact a program (not Turing complete) that you need to
           | execute.
           | 
           | Though substring(m, n) still makes sense in at least
           | interactive text manipulation: how do you do copy/paste?
        
             | roel_v wrote:
             | "So string is no longer a "string of characters""
             | 
             | It hasn't been for 30 years.
        
             | goto11 wrote:
             | No it is not a program - at least not anymore than an ASCII
             | string is a program.
             | 
             | It is just that there isn't a simple 1:1 correspondence
             | between bytes and characters and glyphs as in unicode, so
             | you cant just extract an arbitrary byte-sequence from a
             | string and expect it to render correctly.
        
               | mojuba wrote:
               | > there isn't a simple 1:1 correspondence between bytes
               | and characters and glyphs
               | 
               | There isn't a simple 1:1 correspondence between anything
               | at all. The only definitive thing about Unicode strings
               | is the beginning where you should start your parsing.
               | 
               | Then the way things are supposed to be displayed to be
               | Unicode-compliant look more like some virtual machine
               | analyzing the code. How is this different from any other
               | declarative language?
        
             | techdragon wrote:
             | Not really. A Unicode string is more like a sequence of
             | data built from simple binary structs, which belong to a
             | smallish group of valid structs. Additionally, some but not
             | all, of these structs can be used to infer the validity of
             | subsequent structs in the sequence if your parsing in a
             | more byte-at-a-time fashion. Alternately if your happy
             | dealing with a little less forward compatibility and go for
             | explicit enumeration of all groups of valid bytes you can
             | be a lot more sure of things but it's harder to make this
             | method as performant as the byte-at-a-time method, which
             | given the complete ubiquity of string processing in
             | software... leads to the dominance of the byte-at-a-time
             | method.
        
             | truefossil wrote:
             | The safest path is to consider it a blob. There is some
             | library that can render it magically and that's the only
             | wise thing you can do. The internal structure is hard to
             | understand. Also, definitions change over time. So, you
             | better leave it all to professionals.
        
               | spookthesunset wrote:
               | The thing about Unicode is.... anybody who tried to do it
               | "more simple" would eventually just develop a crappier
               | version of Unicode.
               | 
               | Unicode is complex because the sum of all human language
               | is complex. Short of a ground up rewrite of the worlds
               | languages, you cannot boil away most of that
               | complexity... it has to go somewhere.
               | 
               | And even if you did manage to "rewrite" the worlds
               | languages to be simple and remove accidental complexity I
               | assert that over centuries it would devolve right back
               | into a complex mess again. Why? Languages represent (and
               | literally shape and constrain) how humans think and
               | humans are a messy bunch of meat sacks living in a huge
               | world rich in weird crazy things to feel and talk about.
        
               | kps wrote:
               | There are definitely crappy things about Unicode that are
               | separate from language.
               | 
               | - Several writing systems are widely scattered across
               | multiple 'Supplement'/'Extended'/'Extensions' blocks.
               | 
               | - Operators (e.g. combining forms, joiners) are a
               | mishmash of postfix, infix, and halffix. They should have
               | been (a) in an easily tested _reserved_ block (e.g.
               | 0xF0nn for binary operators, 0xFmnn for unary), so that
               | you could _parse_ over a sequence even if it contains
               | specific operators from a later version -- i.e. separate
               | syntax from semantics, and (b) uniformly prefix, so that
               | read-ahead isn 't required to find the end of a sequence
               | (and dead keys become just like normal characters).
        
         | [deleted]
        
         | RedNifre wrote:
         | Maybe have m,n refer to grapheme clusters instead of bytes/code
         | points?
        
           | mojuba wrote:
           | Apparently it's what Swift does when you try to get the
           | length of a string. Though there's no more plain substring()
           | since Swift 5, it was removed to indicate it's no longer
           | O(1). You will get different results across languages though.
        
         | EMM_386 wrote:
         | It is up to the implementation.
         | 
         | This is a good read on aspects of it:
         | 
         | https://hsivonen.fi/string-length/
        
         | thristian wrote:
         | It depends what your string is a string of.
         | 
         | Slicing by byte-offset is pretty unhelpful, given how many
         | Unicode characters occupy more than one byte. In an encoding
         | like UTF-16, that's "all of them" but even in UTF-8 it's still
         | "most of them".
         | 
         | Slicing by UTF-16 code-unit is still pretty unhelpful, since a
         | lot of Unicode characters (such as emoji) do not fit in 16
         | bits, and are encoded as "surrogate pairs". If you happen to
         | slice a surrogate pair in half, you've made a mess.
         | 
         | Slicing by code-points (the numbers allocated by the Unicode
         | consortium) is better, but not great. A shape like the "e" in
         | "cafe" could be written as U+0065 LATIN SMALL LETTER E followed
         | by U+0301 COMBINING ACUTE ACCENT. Those are separate code-
         | points, but if you slice between them you'll wind up with
         | "cafe" and an isolated acute accent that will stick to whatever
         | it's next to, like this:
         | 
         | When combining characters stick to a base character, the result
         | is called a "grapheme cluster". Slicing by grapheme clusters is
         | the best option, but it's expensive since you need a bunch of
         | data from the Unicode database to find the edges of each
         | cluster - it depends on the properties assigned to each
         | character.
        
           | andreareina wrote:
           | Doesn't splitting by grapheme cluster also depends on which
           | version of the unicode standard you use, since new standards
           | come with new combinations?
        
         | kevincox wrote:
         | The standard answer is "don't". Just treat text is a blob, but
         | the other question is what are you trying to accomplish?
         | 
         | - Are you trying to control the rendered length? In that case
         | the perfect solution is actually rendering the string.
         | 
         | - Are you limiting storage size? Then you need to find a good
         | split point that is <N bytes. This is probably done using
         | extended grapheme clusters. (Although this also isn't perfect)
         | 
         | I'm sure there are other use cases as well. But at the end of
         | the day try to avoid splitting text if it can be helped.
        
       | lifthrasiir wrote:
       | > One weird inconsistency I've noticed is that hair color is done
       | via ZWJ, while skin tone is just modifier emoji with no joiner.
       | Why? Seriously, I am asking you: why? I have no clue.
       | 
       | Mainly because skin tone modifiers [1] predate the ZWJ mechanism
       | [2]. For hair colors there were two contending proposals [3] [4],
       | one of which doesn't use ZWJ, and the ZWJ proposal was accepted
       | because new modifiers (as opposed to ZWJ sequences) needed the
       | architectural change [5].
       | 
       | [1] https://www.unicode.org/L2/L2014/14213-skin-tone-mod.pdf
       | 
       | [2] https://www.unicode.org/L2/L2015/15029r-zwj-emoji.pdf
       | 
       | [3] https://www.unicode.org/L2/L2017/17082-natural-hair-
       | color.pd...
       | 
       | [4] https://www.unicode.org/L2/L2017/17193-hair-colour-
       | proposal....
       | 
       | [5] https://www.unicode.org/L2/L2017/17283-response-hair.pdf
        
         | kevincox wrote:
         | Randal Monroe was also wondering why most of the emoji aren't
         | just modifiers: https://xkcd.com/1813/
        
           | vanderZwan wrote:
           | I wonder how many years it'll take for someone to train a
           | neural network to generate emojis for all possible modifiers,
           | regardless of whether they're currently real combinations.
        
             | pranau wrote:
             | The Google keyboard on Android (and, I think iOS) already
             | lets you do this[1]. Go to the emoji picker on the keyboard
             | and select an emoji or two. You get 4-5 suggestions of
             | randomly added modifiers for the selected emoji.
             | 
             | https://9to5google.com/2020/12/03/gboard-emoji-kitchen-
             | expan...
        
               | [deleted]
        
             | theseanz wrote:
             | There's Emoji Mashup Bot+ - "Tries to create new emojis out
             | of three random emoji parts"
             | 
             | https://twitter.com/emojimashupplus
        
               | vanderZwan wrote:
               | Nice! A more "old-school" procgen-like approach, but that
               | makes it all the more elegant in how effective it is in
               | its simplicity
        
             | blauditore wrote:
             | Intuitively, I think this would be doable today with style
             | transfer networks. But maybe there wouldn't be enough
             | training data with existing emojis.
             | 
             | I hope someone who knows more about it can tune in...
        
           | [deleted]
        
       ___________________________________________________________________
       (page generated 2021-03-26 23:02 UTC)