[HN Gopher] Emoji Under the Hood
___________________________________________________________________
Emoji Under the Hood
Author : kogir
Score : 376 points
Date : 2021-03-24 22:48 UTC (2 days ago)
(HTM) web link (tonsky.me)
(TXT) w3m dump (tonsky.me)
| mannerheim wrote:
| > Currently they are used for these three flags only: England,
| Scotland and Wales:
|
| Not quite true, you can get US state flags with this as well.
| petepete wrote:
| I've never seen them use, have they actually been implemented
| by any of the creators?
| mannerheim wrote:
| If I type the following into ghci, I get the state flag of
| Texas:
|
| putStrLn "\x1f3f4\xe0075\xe0073\xe0074\xe0078\xe007f"
|
| The first character is a flag, the last character is a
| terminator, and in between are the tag characters
| corresponding to the ASCII for ustx. Just take those
| characters and subtract 0xe0000 from them, 0x75, 0x73, 0x74,
| 0x78.
|
| https://en.wikipedia.org/wiki/Tags_(Unicode_block)
|
| Edit:
|
| Just for fun:
|
| import Data.StateCodes
|
| import Data.Char
|
| putStrLn $ map (map toLower . show . snd) allStates >>=
|
| \stateCode -> '\x1f3f4':map (toEnum . (0xe0000+) . fromEnum)
| ("us" ++ stateCode) ++ "\xe007f"
| azornathogron wrote:
| Oh my god they've put terminal escape codes into Unicode.
| TheRealSteel wrote:
| Does this have anything to do with why Google Keyboard/Gboard
| doesn't have the Scottish flag? It's by far my most used emoji
| and my keyboard not having it drives me nuts.
| scatters wrote:
| Why not switch to a keyboard that does have it?
| Sniffnoy wrote:
| This may be supported in some implementations, but currently
| only England, Scotland, and Wales are officially in the Unicode
| data files and recommended for general interchange. You can see
| that they're the only examples of RGI_Emoji_Tag_Sequence listed
| here: https://www.unicode.org/Public/emoji/13.1/emoji-
| sequences.tx...
| rkangel wrote:
| The article is great, but there is one slightly misleading bit at
| the start:
|
| > The most popular encoding we use is called Unicode, with the
| two most popular variations called UTF-8 and UTF-16.
|
| Unicode is a list of codepoints - the characters talked about in
| the rest of the article. These live in a number space that's very
| big (~2^23 as discussed).
|
| You can talk about these codepoints in the abstract as this
| article does, but at some point you need to put them in a
| computer - store them on disk or transmit them over a network
| connection. To do this you need a way to make a stream of bytes
| store a series of unicode codepoints. This is an 'encoding',
| UTF-8 and UTF-16, UTF-32 etc. are different encodings.
|
| UTF-32 is the simplest and most 'obvious' encoding to use. 32
| bits is more than enough to represent every codepoint, so just
| use a 32-bit value to represent each codepoint, and keep them in
| a big array. This has a lot of value in simplicity, but it means
| that text ends up taking up a lot of space. Most western text
| (e.g. this page) fits in the first 127 bits and so for the
| majority of values, most of the bits will be 0.
|
| UTF-16 is an abomination that is largely Microsoft's fault and is
| the default unicode encoding on Windows. It is based on the fact
| that most text in most language fits in the first 65535 unicode
| codepoints - referred to as the 'Basic Multilingual Plane'. This
| means that you can use a 16 bit value to represent most
| codepoints, so unicode is stored as an array of 16-bit values
| ("wide strings" in MS APIs). Obviously not _all_ Unicode values
| fit in, so there is the capability to use two UTF-16 values to
| represent a code-point. There are many problems with UTF-16, but
| my favourite is that it really helps you to have 'unicode
| surprises' in your code. Something in your stack that assumes
| single byte characters and barfs on higher unicode values is well
| known, and you find it in testing fairly often. Because UTF-16 is
| a single value for the vast majority of normal codepoints, it
| makes that worse by making it only happen in a very small number
| of cases that you will inevitably only discover in production.
|
| UTF-8 is the generally agreed to be the best encoding
| (particularly among people who don't work for Microsoft). It is a
| full variable length encoding, so a single codepoint can take 1,
| 2, 3 or 4 bytes. It has lots of nice properties, but one is that
| codepoints that are <= 127 encode using a single byte. This means
| that proper ASCII is valid UTF-8.
| rectang wrote:
| For people who want to hear more on this subject I gave a talk
| for Papers We Love Seattle on UTF-8, its origins and evolution,
| and how it compares against other encodings:
|
| https://www.youtube.com/watch?v=mhvaeHoIE24
|
| "Smiling Cat Face With Heart Eyes Emoji" plays a major role. :)
|
| It doesn't cover the same ground as this wonderful post with
| its study of variation selectors and skin-tone modifiers, but
| it provides the prerequisites leading up to it.
|
| > _UTF-16 is an abomination that is largely Microsoft 's fault_
|
| I think that's unfair. The problem lies more in the
| conceptualization of "Unicode" in the late 1980s as a two-byte
| fixed-width encoding whose 65k-sized code space would be enough
| for the characters of all the world's living languages. (I
| cover that here:
| https://www.youtube.com/watch?v=mhvaeHoIE24&t=7m10s ) It turns
| out that we needed more space, and if Asian countries had had
| more say from the start, it would have been obvious earlier
| that a problem existed.
| rkangel wrote:
| >> UTF-16 is an abomination that is largely Microsoft's fault
|
| > I think that's unfair.
|
| Fair enough. It was a moderately 'emotional' response caused
| by some painful history of issues caused by 2-byte
| assumptions.
|
| The problem I suppose is that MS actually moved to Unicode
| _earlier_ than most of the industry (to their credit), and
| therefore played Guinea pig in discovering what works and
| doesn 't. My complaint now is that I feel they should start a
| migration to UTF-8 (yes I know how challenging that would
| be).
| vanderZwan wrote:
| > _Flags don't have dedicated codepoints. Instead, they are two-
| letter ligatures. (...) There are 258 valid two-letter
| combinations. Can you find them all?_
|
| Well this nerd-sniped me pretty hard
|
| https://next.observablehq.com/@jobleonard/which-unicode-flag...
|
| That was a fun little exercise, but enough time wasted, back to
| work.
| mercer wrote:
| Haha, playing around with reversing flags was the first thing I
| thought about trying.
| vanderZwan wrote:
| The surprising result (to me at least) was that out of 270
| valid letter combinations, 105 can be reversed. The odd
| number is easy to explain: letter pairs like MM => MM can add
| a single flag instead of a pair of two flags, but the fact
| that almost two out of every five flags are reversible feels
| pretty high to me.
| SamBam wrote:
| > but the fact that almost two out of every five flags are
| reversible feels pretty high to me.
|
| I think some letter-frequency analysis can probably explain
| it. Given the fact that certain letters are less likely
| both as the first slot and second (e.g., there are only 4
| country codes that start with J, and 3 that end with J),
| the letters that can be used as both first and second
| characters are over-represented.
|
| It's the same as how far more English language words can be
| reversed to make other valid words than you would expect if
| the letters were equally-frequent and arbitrarily arranged.
| breck wrote:
| I thought I knew Emoji, but there was a lot I didn't know. Thank
| you, a very enjoyable and enlightening read. Also, "dingbats"! I
| rarely seen that word since I was a kid (when I had no idea what
| that voodoo was but loved it).
| artur_makly wrote:
| What I really want to know is the story behind how these emoji's
| came to be?! Who was tasked to come up with this sticker list of
| symbols? What was the decision/strategy behind the selection of
| these symbols? etc etc. it seems soooo arbitrary at first-glance.
|
| And how do we as a community propose new icons while considering
| others to be removed/replaced?
| rynt wrote:
| 99PI did a story on the process of submitting a new emoji
| request to the Unicode Consortium that you might find
| interesting: https://99percentinvisible.org/episode/person-
| lotus-position...
| artur_makly wrote:
| brilliant. thank you {{ U+1F64F }}
| itsmeamario wrote:
| Great quality post. I'd like to see more things like this on HN.
| Interesting and I learnt a lot about emojis and UTF.
| kaeruct wrote:
| I'm confused about the part saying flags don't work on Windows
| because I can see them on Firefox (on Windows). They don't work
| on Edge though.
| tonsky wrote:
| I guess FF ships its own version
| BlueGh0st wrote:
| I wish I could read this without getting a migraine. The
| "darkmode" joke was funny until I realized there was no actual
| way to turn it on.
| jffry wrote:
| Firefox's reader mode works great and includes a dark theme.
|
| The icon shows up in the right side of the URL bar, but you can
| always force it by prepending the URL, e.g.
| about:reader?url=<url>
| tobz1000 wrote:
| https://darkreader.org/
| sundarurfriend wrote:
| I _just_ turned this off today, after one too many "an
| extension is slowing this page down" warnings from Firefox,
| always from Dark Reader. It's a pretty useful addon, but
| there's enough websites that implement their own dark mode
| that it's less necessary these days (I hope), and possibly
| making it not worth the slowdown.
| truefossil wrote:
| I wonder why Mediterranean nations switched from ideograms to
| alphabet as soon as one was invented. Probably they did not have
| enough surplus grain to feed something like the Unicode
| consortium?
| kps wrote:
| An alphabet (or syllabary, abjad, abugida) has a _small_ set of
| symbols that can express anything, which means that it could be
| used by people who did something other than read and write for
| a living. Probably no accident that the first to catch on, and
| the root of possibly all others, was spread by Phoenician
| traders.
| meepmorp wrote:
| Hieroglyphics weren't really ideographic after a very early
| point, because it's a pain in the ass making up new symbols for
| every word. Very quickly, it transitioned to being largely an
| abjad, representing only consonants. Abjads work reasonably
| well for semitic languages, as the consonantal roots of words
| carry the meaning and a reader would be able to fill in the
| vowels themselves via context.
|
| According to the account I've heard, it's the greeks who
| invented the alphabet, by accident. The Phoenician script used
| single symbols to represent consonants, including the glottal
| stop (and some pharyngeal consonant that would likely be
| subject to a similar process, iirc). The glottal stop was
| represented by aleph, and because Greek didn't have contrastive
| glottal stops in its phoneme inventory, Greeks just interpreted
| the vowel that followed it as what the symbol was meant to
| represent.
|
| It's a bit of a just so story, but also completely plausible.
| avipars wrote:
| Really interesting article, why haven't platforms banned
| U[?][?][?][?][?][?][?] or figured out a way to parse/contain the
| character to it's container?
| Hawzen wrote:
| > The most popular encoding we use is called Unicode
|
| Unicode is a character set, not an encoding UTF-8, UTF-16, etc.
| are encodings of that character set
| aglionby wrote:
| Great post, entertainingly written.
|
| Back in 2015, Instagram did a blog post on similar challenges
| they came across implementing emoji hashtags [1]. Spoiler alert:
| they programmatically constructed a huge regex to detect them.
|
| [1] https://instagram-engineering.com/emojineering-part-ii-
| imple...
| lifthrasiir wrote:
| Nowadays you can refer to UAX #31 for hashtag identifiers
| (first specified in 2016):
| https://www.unicode.org/reports/tr31/#hashtag_identifiers
| imtiyaz wrote:
| Never gone to these nitty gritties. Very well explained. Thanks
| Nikita.
| tonsky wrote:
| You are welcome! Glad you liked it
| tomduncalf wrote:
| Really interesting and well written (and entertaining!) post. I
| was vaguely aware of most of it but hadn't appreciated how the
| ZWJ system for more complex emojis made up of basin ones means
| the meaning can be discerned even if your device doesn't support
| the new emoji, clever approach!
| yuntei wrote:
| and now to see how emoji rendering is completely broken, put a
| gear u+2699 text variant and emoji variant in some html and set
| the font to menlo in one element, and monaco in another element
| and then view it in chrome, safari desktop, and safari ios, and
| also select and right click on it in chrome, and maybe also post
| it into the comment section of various websites. Every single
| combination of text variant and emoji variant will be displayed
| in complete randomness :)
| mshenfield wrote:
| It's a post about emojis, but I feel like I understand Unicode
| better now?
| remux wrote:
| Great post!
| woko wrote:
| > Unicode allocates 221 (~2 mil) characters called codepoints.
| Sorry, programmers, but it's not a multiply of 8 .
|
| Why would 2^21 not be a multiple of 2^3?
| RedNifre wrote:
| It's a typo, they meant ~221 instead of 221, because it's
| 17*2^16, which is more like ~2^20.087. (And that's not even
| true either, since a couple values like FFFF are forbidden)
| [deleted]
| howtodowtle wrote:
| Of course, 17 x 2^16 is also a multiple of 2^3:
|
| 17 x 2^16 = 17 x 2^13 x 2^3
|
| (reposted/edited because * was interpreted as formatting)
| RedNifre wrote:
| In case hacker news doesn't show emoji, I meant m(
|
| Right, I guess I was thinking more of "not a power of 2"
| instead of "not a multiple of 8".
|
| On second thought, the author might have meant "Sorry that
| the exponent is not a multiple of 8" as in Unicode neither
| has 2^16 nor 2^32 code points.
| kps wrote:
| Agreed; they meant "not a power of 28".
| ijidak wrote:
| This is eye opening. So many frustrations I've had with emoji
| over the years is explained via this post.
|
| Big thank you to the OP.
| chronogram wrote:
| What kind of frustrations?
| peteretep wrote:
| An excellent article, although:
|
| > "U" is a single grapheme cluster, even though it's composed of
| two codepoints: U+0055 UPPER-CASE U followed by U+0308 COMBINING
| DIAERESIS.
|
| would be a great opportunity to talk about normal form, because
| there's also a single code point version: "latin capital letter u
| with diaeresis".
| colejohnson66 wrote:
| Does anyone know the history behind why there's two ways to
| "encode" things like that? What's the rationale for having both
| combining and precombined codepoints?
| bombcar wrote:
| I believe a lot of the "combined" characters are (basically)
| from importing old codepages directly into Unicode, and they
| did that so it would be a simple formula to convert from the
| various codepages in use.
|
| I may be wrong however.
| z3t4 wrote:
| Related: implementing Emoji support in a text editor: https://xn
| --zta-qla.com/en/blog/editor10.htm
| devadvance wrote:
| Fantastic post that builds up knowledge along the way. A fun case
| where this type of knowledge was relevant: when creating emoji
| short links with a couple characters (symbols), I made sure to
| snag both URLs: one with the emoji (codepoint + `U+FE0F`) and one
| with just the symbol codepoint.
|
| Another thing worth calling out: you can get involved in emoji
| creation and Unicode in general. You can do this directly, or by
| working with groups like Emojination [0].
|
| [0] http://www.emojination.org/
| codetrotter wrote:
| The emojination website mentions UTC and ESC. UTC in this
| context certainly means Unicode Technical Committee. And after
| a bit of Googling it seems that ESC is the Unicode Emoji
| Subcommittee.
|
| Some of the suggested emojis are marked as UTC rejected, some
| as ESC rejected or ESC pushback. Does it mean that both UTC and
| ESC has to approve each suggested emoji?
|
| And is there a place to see the reason for rejection and a
| place to see what kind of pushback they are receiving?
| lifthrasiir wrote:
| It's complicated. So this mainly boils down to the
| relationship between UTC and ESC.
|
| ESC contributes to UTC, along with other groups (e.g. Scripts
| Ad Hoc Group or IRG) or other individuals (you can submit
| documents to UTC [1]), and technically UTC has a right to
| reject ESC contributions. In reality however ESC manages a
| huge volume of emoji proposals to UTC and distills them down
| to a packaged submission, so UTC rarely outright rejects ESC
| contributions. After all ESC is a part of UTC so there is a
| huge overlap anyway (e.g. Mark Davis is the Unicode
| Consortium _and_ ESC chair). "UTC rejected" emojis thus
| generally come from the direct proposal to UTC.
|
| You can see a list of emoji requests [2] but it lacks much
| information. This lack of transparency in the ESC process is
| well known and was most directly criticized by contributing
| experts in 2017 [3]. ESC responded [4] that there are so many
| flawed proposals (with no regards to the submission criteria
| [5]) that it is infeasible to document all of them. IMHO it's
| not a very satisfactory answer, but still understandable.
|
| [1] https://www.unicode.org/L2/
|
| [2] https://www.unicode.org/emoji/emoji-requests.html
|
| [3] https://www.unicode.org/L2/L2017/17147-emoji-
| subcommittee.pd...
|
| [4] https://www.unicode.org/L2/L2017/17192-response-cmts.pdf
|
| [5] https://www.unicode.org/emoji/proposals.html
| dgellow wrote:
| It's for "Emoji SubCommittee" (aka ESC).
|
| > Unicode Emoji Subcommittee:
|
| > The Unicode Emoji Subcommittee is responsible for the
| following:
|
| > - Updating, revising, and extending emoji documents such as
| UTS #51: Unicode Emoji and Unicode Emoji Charts.
|
| > - Taking input from various sources and reviewing requests
| for new emoji characters.
|
| > - Creating proposals for the Unicode Technical Committee
| regarding additional emoji characters and new emoji-related
| mechanisms.
|
| > - Investigating longer-term mechanisms for supporting emoji
| as images (stickers).
|
| From https://unicode.org/emoji/techindex.html
|
| Edit: Welp, the parent comment was asking what "ESC" stands
| for, but has now been updated, so this comment is now
| outdated :)
| codetrotter wrote:
| Sorry, yeah I was originally asking about what ESC stands
| for but found some info shortly after and updated my
| comment. But I appreciate the additional info anyways :)
| MrGilbert wrote:
| Reading about the 2 million codepoints: Is there a good set of
| open-source licensed fonts which cover as many codepoints as
| possible? Just curiosity, no real usecase at the moment. I don't
| think it would make sense to create one huge font for this,
| right?
| dan-robertson wrote:
| There's a project called, I think, gnufont but their font is a
| bitmap font...
| MrGilbert wrote:
| Ah, thank you! Searching for "gnufont" brought me to[1],
| which looks pretty nice indeed.
|
| [1]: https://www.gnu.org/software/freefont/
| dan-robertson wrote:
| I think that's what I was thinking of. I guess they've got
| some vector outlines now
| pta2002 wrote:
| Google's Noto Fonts[1] attempt to cover all of Unicode and are
| released under the SIL Open Font License.
|
| [1] https://www.google.com/get/noto/
| MrGilbert wrote:
| That looks incredible complete, thank you!
| mojuba wrote:
| Can someone explain, what are the rules for substring(m, n) given
| all the madness that's today's Unicode? Is it standardized or
| it's up to the implementations?
| _ZeD_ wrote:
| it think the only resonable rule for substring(m, n) is "don't"
| mojuba wrote:
| So string is no longer a "string of characters", it is in
| fact a program (not Turing complete) that you need to
| execute.
|
| Though substring(m, n) still makes sense in at least
| interactive text manipulation: how do you do copy/paste?
| roel_v wrote:
| "So string is no longer a "string of characters""
|
| It hasn't been for 30 years.
| goto11 wrote:
| No it is not a program - at least not anymore than an ASCII
| string is a program.
|
| It is just that there isn't a simple 1:1 correspondence
| between bytes and characters and glyphs as in unicode, so
| you cant just extract an arbitrary byte-sequence from a
| string and expect it to render correctly.
| mojuba wrote:
| > there isn't a simple 1:1 correspondence between bytes
| and characters and glyphs
|
| There isn't a simple 1:1 correspondence between anything
| at all. The only definitive thing about Unicode strings
| is the beginning where you should start your parsing.
|
| Then the way things are supposed to be displayed to be
| Unicode-compliant look more like some virtual machine
| analyzing the code. How is this different from any other
| declarative language?
| techdragon wrote:
| Not really. A Unicode string is more like a sequence of
| data built from simple binary structs, which belong to a
| smallish group of valid structs. Additionally, some but not
| all, of these structs can be used to infer the validity of
| subsequent structs in the sequence if your parsing in a
| more byte-at-a-time fashion. Alternately if your happy
| dealing with a little less forward compatibility and go for
| explicit enumeration of all groups of valid bytes you can
| be a lot more sure of things but it's harder to make this
| method as performant as the byte-at-a-time method, which
| given the complete ubiquity of string processing in
| software... leads to the dominance of the byte-at-a-time
| method.
| truefossil wrote:
| The safest path is to consider it a blob. There is some
| library that can render it magically and that's the only
| wise thing you can do. The internal structure is hard to
| understand. Also, definitions change over time. So, you
| better leave it all to professionals.
| spookthesunset wrote:
| The thing about Unicode is.... anybody who tried to do it
| "more simple" would eventually just develop a crappier
| version of Unicode.
|
| Unicode is complex because the sum of all human language
| is complex. Short of a ground up rewrite of the worlds
| languages, you cannot boil away most of that
| complexity... it has to go somewhere.
|
| And even if you did manage to "rewrite" the worlds
| languages to be simple and remove accidental complexity I
| assert that over centuries it would devolve right back
| into a complex mess again. Why? Languages represent (and
| literally shape and constrain) how humans think and
| humans are a messy bunch of meat sacks living in a huge
| world rich in weird crazy things to feel and talk about.
| kps wrote:
| There are definitely crappy things about Unicode that are
| separate from language.
|
| - Several writing systems are widely scattered across
| multiple 'Supplement'/'Extended'/'Extensions' blocks.
|
| - Operators (e.g. combining forms, joiners) are a
| mishmash of postfix, infix, and halffix. They should have
| been (a) in an easily tested _reserved_ block (e.g.
| 0xF0nn for binary operators, 0xFmnn for unary), so that
| you could _parse_ over a sequence even if it contains
| specific operators from a later version -- i.e. separate
| syntax from semantics, and (b) uniformly prefix, so that
| read-ahead isn 't required to find the end of a sequence
| (and dead keys become just like normal characters).
| [deleted]
| RedNifre wrote:
| Maybe have m,n refer to grapheme clusters instead of bytes/code
| points?
| mojuba wrote:
| Apparently it's what Swift does when you try to get the
| length of a string. Though there's no more plain substring()
| since Swift 5, it was removed to indicate it's no longer
| O(1). You will get different results across languages though.
| EMM_386 wrote:
| It is up to the implementation.
|
| This is a good read on aspects of it:
|
| https://hsivonen.fi/string-length/
| thristian wrote:
| It depends what your string is a string of.
|
| Slicing by byte-offset is pretty unhelpful, given how many
| Unicode characters occupy more than one byte. In an encoding
| like UTF-16, that's "all of them" but even in UTF-8 it's still
| "most of them".
|
| Slicing by UTF-16 code-unit is still pretty unhelpful, since a
| lot of Unicode characters (such as emoji) do not fit in 16
| bits, and are encoded as "surrogate pairs". If you happen to
| slice a surrogate pair in half, you've made a mess.
|
| Slicing by code-points (the numbers allocated by the Unicode
| consortium) is better, but not great. A shape like the "e" in
| "cafe" could be written as U+0065 LATIN SMALL LETTER E followed
| by U+0301 COMBINING ACUTE ACCENT. Those are separate code-
| points, but if you slice between them you'll wind up with
| "cafe" and an isolated acute accent that will stick to whatever
| it's next to, like this:
|
| When combining characters stick to a base character, the result
| is called a "grapheme cluster". Slicing by grapheme clusters is
| the best option, but it's expensive since you need a bunch of
| data from the Unicode database to find the edges of each
| cluster - it depends on the properties assigned to each
| character.
| andreareina wrote:
| Doesn't splitting by grapheme cluster also depends on which
| version of the unicode standard you use, since new standards
| come with new combinations?
| kevincox wrote:
| The standard answer is "don't". Just treat text is a blob, but
| the other question is what are you trying to accomplish?
|
| - Are you trying to control the rendered length? In that case
| the perfect solution is actually rendering the string.
|
| - Are you limiting storage size? Then you need to find a good
| split point that is <N bytes. This is probably done using
| extended grapheme clusters. (Although this also isn't perfect)
|
| I'm sure there are other use cases as well. But at the end of
| the day try to avoid splitting text if it can be helped.
| lifthrasiir wrote:
| > One weird inconsistency I've noticed is that hair color is done
| via ZWJ, while skin tone is just modifier emoji with no joiner.
| Why? Seriously, I am asking you: why? I have no clue.
|
| Mainly because skin tone modifiers [1] predate the ZWJ mechanism
| [2]. For hair colors there were two contending proposals [3] [4],
| one of which doesn't use ZWJ, and the ZWJ proposal was accepted
| because new modifiers (as opposed to ZWJ sequences) needed the
| architectural change [5].
|
| [1] https://www.unicode.org/L2/L2014/14213-skin-tone-mod.pdf
|
| [2] https://www.unicode.org/L2/L2015/15029r-zwj-emoji.pdf
|
| [3] https://www.unicode.org/L2/L2017/17082-natural-hair-
| color.pd...
|
| [4] https://www.unicode.org/L2/L2017/17193-hair-colour-
| proposal....
|
| [5] https://www.unicode.org/L2/L2017/17283-response-hair.pdf
| kevincox wrote:
| Randal Monroe was also wondering why most of the emoji aren't
| just modifiers: https://xkcd.com/1813/
| vanderZwan wrote:
| I wonder how many years it'll take for someone to train a
| neural network to generate emojis for all possible modifiers,
| regardless of whether they're currently real combinations.
| pranau wrote:
| The Google keyboard on Android (and, I think iOS) already
| lets you do this[1]. Go to the emoji picker on the keyboard
| and select an emoji or two. You get 4-5 suggestions of
| randomly added modifiers for the selected emoji.
|
| https://9to5google.com/2020/12/03/gboard-emoji-kitchen-
| expan...
| [deleted]
| theseanz wrote:
| There's Emoji Mashup Bot+ - "Tries to create new emojis out
| of three random emoji parts"
|
| https://twitter.com/emojimashupplus
| vanderZwan wrote:
| Nice! A more "old-school" procgen-like approach, but that
| makes it all the more elegant in how effective it is in
| its simplicity
| blauditore wrote:
| Intuitively, I think this would be doable today with style
| transfer networks. But maybe there wouldn't be enough
| training data with existing emojis.
|
| I hope someone who knows more about it can tune in...
| [deleted]
___________________________________________________________________
(page generated 2021-03-26 23:02 UTC)