[HN Gopher] Drag an emoji family with a string size of 11 into a...
___________________________________________________________________
Drag an emoji family with a string size of 11 into an input with
maxlength=10
Author : robin_reala
Score : 190 points
Date : 2023-02-24 15:31 UTC (7 hours ago)
(HTM) web link (mastodon.social)
(TXT) w3m dump (mastodon.social)
| a_c wrote:
| Edit 2: Actually it took 7 backspace to obliviate the whole
| family. Tough one.
|
| Edit: Oops. I was trying to type emoji into HN comment.
| apparently it is not supported.
|
| I never knew that [emoji of family of 4] takes 5 backspace to
| eliminate. It goes from [emoji of family of 4] to [emoji of
| family of 3] to [father and mother] to [father] to [father].
| Somehow [father] can take double the (key)punch than the rest of
| his family.
| WirelessGigabit wrote:
| There is a ZWJ after father. That would explain the double
| punch.
| hn_throwaway_99 wrote:
| Lol, sorry kid, we can't afford the minivan, you've got to go.
| graypegg wrote:
| Odd, it makes sense on a technical level when it comes to these
| ZWJ characters, but hiding the implementation makes sense from
| Safari's point of view. I'd actually prefer that as a UNIVERSAL
| standard, visible symbols vs characters. (When it comes to UI)
|
| But I can also imagine this is problematic when setting
| validation rules else where and now there's a subtle foot gun
| buried in most web forms.
|
| I guess the thing to learn here is to not rely on maxlength=
| zuhsetaqi wrote:
| The thing every developer should have learned from the
| beginning is to never rely on anything that comes from user
| input or client
| turmeric_root wrote:
| agreed, i discard user input in all of my apps
| graypegg wrote:
| I hope no one is relying on it! But in the context of a
| regular user, maxlength measuring all chars is failsafe (user
| is allowed to submit a long string that should be caught by
| validation elsewhere) vs measuring visible chars is
| faildangerous (user can't submit a valid string because the
| front end validation stops them)
|
| Which is weird
| lapcat wrote:
| The bigger problem is that web browsers silently truncate strings
| over the maxlength. This behavior is particularly nasty for
| secure password fields.
| Wowfunhappy wrote:
| If anyone would like to try this out:
|
| https://codepen.io/Wowfunhappy/pen/dyqpMXO?editors=1000
| waltbosz wrote:
| I forked your code and made it a bit more fun. Presenting: the
| Magical Shrinking Emoji Family.
|
| https://codepen.io/waltbosz/pen/zYJKBEE
| brycedriesenga wrote:
| The father seems so much happier to be alone.
| jefftk wrote:
| And he stops shaving his mustache. Perhaps his family
| didn't like it?
| test6554 wrote:
| Ugh, How did we get here.
| rom-antics wrote:
| Edge cases like this will just get more common as unicode keeps
| getting more complex. There was a fun slide in this talk[1] that
| suggests unicode might be turing complete due to its case folding
| rules.
|
| I miss when Unicode was just a simple list of codepoints. (Get
| off my lawn)
|
| [1]:
| https://seriot.ch/resources/talks_papers/20171027_brainfuck_...
| WorldMaker wrote:
| These "edge cases" have always existed in Unicode. Languages
| with ZWJ needs have existed in Unicode since the beginning.
| That emoji put a spotlight on this for especially English-
| speaking developers with assumptions that language encodings
| are "simple", is probably one of the best things about the
| popularity of emoji.
| GuB-42 wrote:
| I think we need some kind of standard "Unicode-light" with
| limitations to allow it to be used on low specs hardware and
| without weird edge cases like this. A bit like video codecs
| that have "profiles" which are limitations you can adhere to to
| avoid overwhelming low end hardware.
|
| It wouldn't be "universal", but enough to write in the most
| commonly used languages, and maybe support a few, single
| codepoint special characters and emoji.
| sbierwagen wrote:
| Why bother with a standard, just deny by default and
| whitelist the codepoints you care about.
|
| Plenty of software already does that-- HN itself doesn't
| allow emoji, for example.
| rom-antics wrote:
| I don't think HN is deny by default. Most emojis are
| stripped, but some still get through. I don't know why
| would be whitelisted
| zerocrates wrote:
| Is it just that whole block that's allowed? I feel like
| some of the legacy characters that have been kind of
| "promoted" to emoji I've seen allowed here. Let's see, is
| the all-important allowed?
| GuB-42 wrote:
| The point of having a standard is to know which ones to
| deny.
|
| Make a study over a wide range of documents about the
| characters that are used the most, see if there are
| alternatives to the characters that don't make it, etc...
|
| This is unlike the HN ban on emoji, which I think is more
| of a political decision than a technical one. Most people
| on HN use systems that can read emoji just fine, but they
| decided that the site would be better without it. This
| would be more technical, a way to balance inclusivity and
| technical constraints, something between ASCII and full
| Unicode.
| zokier wrote:
| The subset of codepoints that are included in NFKD make
| probably a decent starting point for such standard, maybe if
| you want to be even more restrictive then limit to BMP.
| klausa wrote:
| The _entire point_ of Unicode is to be _universal_, you're
| just suggesting we go back to pre-Unicode days and use
| different code-pages.
|
| This, for reasons stated in already posted responses, and to
| put it mildly, does not work in international context (e.g.
| the vast majority of software written every day).
| themerone wrote:
| You can't express the most commonly used languages without
| multiple code point graphemes.
|
| If you want to eliminate edge cases you would need to
| introduce new incompatible code points and a 32 or 64 bit
| fixed length encoding depending on how many languages you
| want to support.
| GuB-42 wrote:
| Extra pages for these extra code points wouldn't seem far
| fetched to me. We already have single code point letters
| with diacritics like "e", or a huge code page of Hangul,
| for which each code point is a combination of characters.
|
| As for encoding, 32 bits fixed length should be sufficient,
| I can't believe that we would need billions of symbols,
| combinations included, in order to write in most common
| languages, though I may be wrong.
|
| Also, "limiting" doesn't not necessarily means "single code
| point only", but more like "only one diacritic, only from
| this list, and only for these characters", so that
| combination fits in a certain size limit (ex: a 32 bit
| word), and that the engine only has to process a limited
| number of use cases.
| Dylan16807 wrote:
| Unicode always had combining characters. Is this so different
| from accent marks disappearing? And the Hangul pieces were
| there from 1.0/1.1.
| ryandrake wrote:
| Seems like an obvious root cause for these sorts of things are
| languages (and developers) who can't or won't differentiate
| between "byte length" and "string length". Joel warned us all
| about this 20 years (!!) ago[1], and we're still struggling with
| it.
|
| 1: https://www.joelonsoftware.com/2003/10/08/the-absolute-
| minim...
| rockwotj wrote:
| String length isn't even a well defined term. Do you mean
| codepoint length? Or the number of graphemes?
| ryandrake wrote:
| Good point. When it comes to text there are a lot of
| "lengths". Languages should differentiate between each of
| those lengths, and programmers should pay attention to the
| difference and actually use the right one for the whatever
| situation they're programming.
| [deleted]
| ok123456 wrote:
| multi-character graphemes were a mistake.
| jonhohle wrote:
| It seems odd to suggest the bug is with Safari. Normal humans
| (and even most developers!) don't care that the byte length of an
| emoji in a particular encoding that may or may not be under their
| control defines the maximum "characters" in a text box (character
| used to define a logical collection of code points each of which
| may fit into one or multiple bytes).
| twic wrote:
| It's a bug with Safari because the HTML spec defines maxlength
| as applying to the number of UTF-16 code units [1]:
|
| > Constraint validation: If an element has a maximum allowed
| value length, its dirty value flag is true, its value was last
| changed by a user edit (as opposed to a change made by a
| script), and the length of the element's API value is greater
| than the element's maximum allowed value length, then the
| element is suffering from being too long.
|
| Where "the length of" is a link to [2]:
|
| > A string's length is the number of code units it contains.
|
| And "code units" is a link to [3]:
|
| > A string is a sequence of unsigned 16-bit integers, also
| known as code units.
|
| I agree with your implied point that this is a questionable
| definition, though!
|
| [1] https://html.spec.whatwg.org/multipage/form-control-
| infrastr...
|
| [2] https://infra.spec.whatwg.org/#string-length
|
| [3] https://infra.spec.whatwg.org/#code-unit
| Someone wrote:
| I am not sure it's a bug. [1] also says (emphasis added)
|
| _"User agents _may_ prevent the user from causing the
| element 's API value to be set to a value whose length is
| greater than the element's maximum allowed value length."_
|
| and we have _"MAY This word, or the adjective "OPTIONAL",
| mean that an item is truly optional"_ [2]
|
| [1] https://html.spec.whatwg.org/multipage/form-control-
| infrastr...
|
| [2] https://www.ietf.org/rfc/rfc2119.txt
| syrrim wrote:
| Apple clearly intended to implement it though, and did so
| incorrectly.
| Longhanks wrote:
| IMHO that is then a "bug" in the HTML spec which should be
| "fixed" to speak of extended grapheme clusters instead, which
| is what users what probably expect.
| WorldMaker wrote:
| Except "maxlength" is often going to be added to fields to
| deal with storage/database limitations somewhere on the
| server side and those are almost always going to be in
| terms of code points, so counting extended grapheme
| clusters makes it harder for server-side storage maximums
| to agree with the input.
|
| Choosing 16-bit code points as the base keeps consistency
| with the naive string length counting algorithm of JS, in
| particular, which has always returned length in 16-bit code
| points.
|
| Sure, it's slightly to the detriment of user experience in
| terms of how a user expects graphemes to be counted, but it
| avoids later storage problems.
| biftek wrote:
| Yea the authors conclusion is flawed. If I enter an emoji and a
| different one appears I'm just going to assume your website is
| broken. Safari is in the right here.
| jfk13 wrote:
| It's not that clear-cut. If you enter that 11-code-unit emoji
| into a field that has maxLength=10, and the browser lets you
| do so, but the backend system that receives the data only
| stores 10 code units, you're worse off -- because you
| probably won't realise your data got corrupted -- than if the
| browser had prevented you entering/submitting it.
| test6554 wrote:
| Unicode... There may be some regrets.
| SquareWheel wrote:
| > Except in Safari, whose maxlength implementation seems to treat
| all emoji as length 1. This means that the maxlength attribute is
| not fully interoperable between browsers.
|
| No, it's definitely not. You can read the byte length more
| directly in JS, and use that to inform if more text is allowed or
| not. const encoder = new TextEncoder();
| const currentBytes = encoder.encode(inputStr).byteLength;
|
| But the maxlength attribute is at best an approximation. Don't
| rely on it for things like limiting length for database fields
| (not that you should trust the client anyway).
| ASalazarMX wrote:
| Apple's take seems more reasonable. When a user uses an emoji,
| they think of it as a single symbol, they don't care of the
| Unicode implementation, or its length in bytes. IMO this should
| be the standard, and all other interpretations are a repeat of
| the transition from ASCII to Unicode.
| duskwuff wrote:
| It isn't just for emoji, either. More ordinary grapheme
| clusters like characters with combining accents are also
| perceived by users as a single character, and should be
| counted as one by software.
| david422 wrote:
| The max length isn't for the user though. It's to prevent the
| user from inputting more data than the system can handle.
| withinboredom wrote:
| The user can submit data of any length... browsers aren't
| required to implement validation for you. Nor is your user
| required to use a browser or client you wrote. This is
| security and application development 102, welcome to my
| class.
| tobr wrote:
| Max length is there to communicate to the user how much
| data the system can handle.
| scatters wrote:
| The number of graphemes an emoji displays as depends on the
| platform and font. How many "characters" does Safari think
| Ninja Cat[1] is? It displays as a single grapheme on Windows
| 10.
|
| 1. https://emojipedia.org/ninja-cat/
| MereInterest wrote:
| And depends on the location. For example, the characters
| U+1F1FA U+1F1F8 are the regional indicators "U" and "S",
| and are rendered as . These are two separate codepoints
| that may together be displayed as a United States flag.
| Similarly, the regional indicators "T" and "W" are rendered
| as and "H" and "K" are rendered as . On my system, this is
| rendered as the flag of Taiwan and Hong Kong, respectively.
| Depending on where you live, these regional indicators
| might not be rendered as flags.
|
| Edit: Looks like HN stripped out the character codes. The
| effect can be reproduced by copy-pasting the symbols from
| https://en.wikipedia.org/wiki/Regional_indicator_symbol
| WorldMaker wrote:
| > Depending on where you live, these regional indicators
| might not be rendered as flags.
|
| This is also why Microsoft-vended emoji fonts don't
| include flags/"regional indicators" support _at all_. On
| a Windows machine you just see the letters US or TW in an
| ugly boxed type and nothing like a flag.
|
| The interesting background story goes: everyone remembers
| that Windows 95 had that fancy graphical Time Zone
| selector that looked real pretty in screenshots (but
| wasn't actually the best way to select time zones
| anyway). The reasons Microsoft removed that tool are less
| well known and almost entirely geopolitical: time zone
| borders often correspond to country borders and several
| countries got mad at their border looking wrong and sued
| Microsoft for "lying" about their borders. After a bunch
| of micro-changes to the graphics of that widget and then
| a lot of money spent on lawsuits and geopolitical fights,
| Microsoft eventually just entirely removed the widget
| because it was a "nice-to-have" and never actually
| necessary. It is said that from that fight Microsoft
| decided that it never wanted to be involved in that sort
| of geopolitics ever again, and flags are nothing if not
| geopolitical symbols, so Microsoft's emoji fonts don't
| encode hardly any flags at all. (In exchange they encode
| a lot of fun "cats with jobs" that other font vendors
| still don't. I find that a fun trade-off myself.)
| crazygringo wrote:
| According to the Unicode standard there is a canonical
| notion of countable graphemes, with zero ambiguity.
|
| I don't actually know whether ninja cat would count as one
| or two though. The spec for calculating grapheme boundaries
| is actually several pages long.
|
| And the OS might not obey the Unicode standard. Ninja cat
| appears to be proprietary to Microsoft.
| [deleted]
| WorldMaker wrote:
| What I recall of the standard algorithm is that it does
| include font/user-agent/locale pieces in its calculations
| and "zero ambiguity" is a bit of a stretch,
| unfortunately.
|
| Even the overview includes this note:
|
| > Note: Font-based information may be required to
| determine the appropriate unit to use for UI purposes
|
| First of all, grapheme counting depends on normalization
| and there's like 6 Unicode normalization algorithms to
| consider, depending on locale and intended use/display
| style.
|
| Keep in mind that grapheme counting includes things like
| natural ligatures which English has always had a rough
| time counting. Things like `fi` and `ft` that sometimes
| form single visual graphemes from the font's perspective
| for some serif fonts but are _always_ supposed to be
| counted as two graphemes from the user 's perspective.
|
| Relatedly, one of the simpler parts of the Unicode
| standard algorithm is that the ZWJ _always_ signals a
| grapheme cluster /merged grapheme, so per the standard
| algorithm the Ninja Cat is _always_ a single grapheme
| even though the usual emoji rules for all the non-
| Microsoft fonts that don 't include Cat+ZWJ+Job sequences
| ("cats with jobs emojis") fallback to two display
| characters to try to get the idea across, the spec does
| say that they should still count that as only a single
| grapheme even when not presented as such.
|
| (ETA Aside: I think the cats with jobs emojis are great
| and fun and should be wider adopted outside of just
| Microsoft. Why should only the people emojis have jobs?)
| Dylan16807 wrote:
| > First of all, grapheme counting depends on
| normalization and there's like 6 Unicode normalization
| algorithms to consider, depending on locale and intended
| use/display style.
|
| It says "The boundary specifications are stated in terms
| of text normalized according to Normalization Form NFD
| (see Unicode Standard Annex #15, "Unicode Normalization
| Forms" [UAX15]). In practice, normalization of the input
| is not required."
|
| And there's also "Even in Normalization Form NFC, a
| syllable block may contain a precomposed Hangul syllable
| in the middle."
|
| So are you sure normalization matters? I'm excluding K
| normalizations since they're not for general use and
| alter a bunch of characters on purpose. But also that's
| only NFC/NFD/NFKC/NFKD, are there two others or was "six"
| just a misremembering?
| WorldMaker wrote:
| "like 6" was intentional hyperbole, but I appreciate the
| technical clarifications.
|
| > I'm excluding K normalizations since they're not for
| general use
|
| "Not for general use" doesn't mean that they aren't
| _somebody 's_ use and still a complicating factor in
| everything else.
| Dylan16807 wrote:
| Nobody that cares about keeping text intact uses K. And
| you don't have to care about how _they_ count characters.
| It won 't affect your system, and they're probably
| counting something dumb.
|
| NFC/NFD will give you the same graphemes.
|
| Or to put it another way, there's only two real
| normalizations, and there's some weird junk off to the
| side.
| thih9 wrote:
| > This Emoji ZWJ Sequence has not been Recommended For
| General Interchange (RGI) by Unicode. Expect limited cross-
| platform support.
| ASalazarMX wrote:
| I'm not saying Apple's implementation should be the gold
| standard, just that characters that the Unicode
| specification allows to combine into one character, should
| be counted as one character.
|
| I understand that technical limitations will sometimes
| cause a combined character to be displayed as more than one
| character, as it happens in Safari with that ninja cat, and
| in those cases, it's preferable to decompose the emoji, or
| use a tofu character, than changing it for another.
| jakelazaroff wrote:
| They are decomposing the emoji, are they not? The
| resulting glyph (with three family members rather than
| four) is just the result of the ligatures between the
| remaining emojis.
| tobr wrote:
| From a user perspective, a family emoji is one character.
| Whether it's implemented as a ligature seems irrelevant,
| as that's not visible to the user.
| junon wrote:
| Emoji was such a mistake in my opinion. It's cluttered
| the Unicode spec for a long time now and it's a continual
| source of problems.
| gpderetta wrote:
| Emojis simply manifest existing bugs in programs. Those
| bugs would have existed anyway as emojis are not the only
| symbols beyond the BMP nor the only symbols composed of
| grapheme clusters.
|
| Being much more frequent than other more exotic
| characters they just reveal bugs more often.
| tobr wrote:
| An incredibly popular and successful mistake, in that
| case. Is the purpose of Unicode to be an uncluttered
| spec, or to be a useful way to represent text?
| wwalexander wrote:
| Emojis aren't text, though. Nobody "writes" emoji outside
| of the digital world so I don't think emoji could
| accurately be called "text" or part of a "writing system:
|
| > Unicode, formally The Unicode Standard,[note 1][note 2]
| is an information technology standard for the consistent
| encoding, representation, and handling of text expressed
| in most of the world's writing systems
|
| People write smiley faces, but there was already a code
| point for basic smileys.
|
| I think standardized emoticons are a good thing, but I
| don't think the Unicode spec is where they best belong.
| tobr wrote:
| I don't see anything in the quoted passage that limits it
| to non-digital writing systems.
| yamtaddle wrote:
| Smiley faces and hearts and all kinds of stuff appear in
| hand-written text.
| Nevermark wrote:
| Given that emoji are most typically used in text, having
| it all together seems better than two bodies coming up
| with incompatible solutions to emoji and the glyph issues
| of various languages.
|
| In the end emoji is a language, a visual icon language.
| WorldMaker wrote:
| Where would they belong if not Unicode?
|
| Emoticons are a "writing system" (whether or not you
| believe them to be more "text" or "meta-text") that needs
| encoding, and one that was encoded in standards that
| predated Unicode. (They were at first added for
| compatibility with CJK systems that had already been
| using them for some time. They've been greatly expanded
| since those original encodings that Unicode "inherited"
| from other text encodings, but they started as text
| encodings in other standards.)
|
| Personally, I think Unicode actually did the world a
| favor by taking ownership of emoji and standardizing
| them. There's nothing "weird" that emoji do that doesn't
| exist somewhere else in some other language's text
| encodings. Things like ZWJ sequences for complex
| graphemes existed in a number of languages that Unicode
| encoded well before emoji. Emoji being common and popular
| helps break certain Western-centric assumptions that
| language encodings are "simple" and 1-to-1 code point to
| grapheme, assumptions that had been needed to break for
| many years before emoji gave Unicode users a tool to test
| that even "Western developers" needed to respect if they
| wanted happy users. Better emoji support is better
| Unicode support across the lovely and incredible
| diversity of what Unicode encodes.
| WorldMaker wrote:
| Aside: Also, given the current, common vernacular usages
| of the eggplant and peach emoji _alone_ , I find it
| difficult in 2023 to argue that emoji aren't text and
| aren't being used as a language inside text for and of
| text.
| baq wrote:
| What's the difference between kanji and emoji? Tradition?
| Most text is now created digitally, so no wonder you have
| reflexivity here. You might think that it's tail wagging
| the dog, but in truth the dog and the tail switched
| places. People want to write emoji, so it becomes
| Unicode.
| rkeene2 wrote:
| How many characters are in "offer" ? 4 (because of the
| ligature for "ff") or 5 ?
| Sharlin wrote:
| Indeed Swift is one of the very few languages that even have
| a stdlib API for working with this intuitive definition of a
| "character" (the technical Unicode term is "(extended)
| grapheme cluster" and the division of text into EGCs is
| called Unicode text segmentation, described by [uax29] -
| spoiler: definitely not trivial).
|
| [uax29]: https://unicode.org/reports/tr29/
| alpaca128 wrote:
| Agreed, I'm surprised the linked OP filed a bug for Webkit
| when I'd say it's the only correct implementation.
|
| HTML is for UIs and I don't think many users would expect an
| emoji to be counted as 11 symbols when it looks like just
| one. If this wasn't an emoji but a symbol from a more complex
| writing system splitting that up would clearly be
| unacceptable, so I don't see why one should make an exception
| here just because this one known example of a family emoji
| still mostly survives the truncation.
| kevin_thibedeau wrote:
| This breaks down in the terminal where you need to know which
| codepoints are double width for managing layout. Some are
| ambiguous width because they've been present since before the
| inclusion of emoji and had optional emoji presentation added
| to them. The final determination is font dependent so you can
| never be sure without full insight into the rendering chain.
| pie_flavor wrote:
| This doesn't break down anything. If a grapheme cluster is
| double width and you treat it as 5x because there's 5 code
| points in it, then you've _still_ gotten the layout wrong.
| You can enforce text _storage_ length limitations to
| prevent infinite zalgo, but text _display_ length
| limitations should always deal in extended grapheme
| clusters.
| RobotToaster wrote:
| >Don't rely on it for things like limiting length for database
| fields (not that you should trust the client anyway).
|
| Do database engines all agree on edge cases like this?
| Manfred wrote:
| No, they don't even agree between engine implementations
| within the same database server. Generally limits in the
| database are defined as storage limits for however a
| character is defined. That usually means bytes or codepoints.
| jonhohle wrote:
| Or act like MySQL and define a code point as fixed to
| 3-bytes when choosing UTF-8 .
| bruce343434 wrote:
| to be fair, there's utfmb4
| bonsaibilly wrote:
| Thankfully MySQL _also_ offers a non-gimped version of
| UTF-8 that one should always use in preference to the
| 3-byte version, but yeah it sucks that it 's not the
| "obvious" version of UTF-8.
| tragomaskhalos wrote:
| Is this part of MySQL's policy of "do the thing I've
| always done, no matter how daft or broken that may be,
| unless I see an obscure setting telling me to do the new
| correct thing" ?
| bonsaibilly wrote:
| That'd be my guess, but I don't really know. They just
| left the "utf8" type as broken 3-byte gibbled UTF-8, and
| added the "utf8mb4" type and "utf8mb4_unicode_ci"
| collation for "no, actually, I want UTF-8 for real".
| WJW wrote:
| No the default these days is the saner utf8mb4, if you
| create a new database on a modern MySQL version. But if
| you have an old database using the old encoding then
| upgrading databases doesn't magically update the encoding
| because some people take backwards compatibility serious.
| Manfred wrote:
| Any attempt at defining what people think of as characters is
| going to fail because of how many exceptions our combined writing
| systems have. See: codepoints, characters, grapheme clusters.
|
| A good starting place is UAX #29:
| https://www.unicode.org/reports/tr29/tr29-41.html
|
| However, the gold standard in UI implementations is that you
| never break the user's input.
| Dylan16807 wrote:
| > See: codepoints, characters, grapheme clusters.
|
| People don't think about code points and they definitely don't
| think about code units.
|
| What does "character" mean?
|
| Grapheme clusters aren't perfect but they're far ahead of code
| whatevers.
| rockwotj wrote:
| Personally I think this is much easier to digest than the above
| link: https://manishearth.github.io/blog/2017/01/14/stop-
| ascribing...
|
| It starts by breaking down common Unicode assumptions folks
| have
___________________________________________________________________
(page generated 2023-02-24 23:01 UTC)