hngopher.com

       [HN Gopher] Drag an emoji family with a string size of 11 into a...
       ___________________________________________________________________
        
       Drag an emoji family with a string size of 11 into an input with
       maxlength=10
        
       Author : robin_reala
       Score  : 190 points
       Date   : 2023-02-24 15:31 UTC (7 hours ago)
        
 (HTM) web link (mastodon.social)
 (TXT) w3m dump (mastodon.social)
        
       | a_c wrote:
       | Edit 2: Actually it took 7 backspace to obliviate the whole
       | family. Tough one.
       | 
       | Edit: Oops. I was trying to type emoji into HN comment.
       | apparently it is not supported.
       | 
       | I never knew that [emoji of family of 4] takes 5 backspace to
       | eliminate. It goes from [emoji of family of 4] to [emoji of
       | family of 3] to [father and mother] to [father] to [father].
       | Somehow [father] can take double the (key)punch than the rest of
       | his family.
        
         | WirelessGigabit wrote:
         | There is a ZWJ after father. That would explain the double
         | punch.
        
       | hn_throwaway_99 wrote:
       | Lol, sorry kid, we can't afford the minivan, you've got to go.
        
       | graypegg wrote:
       | Odd, it makes sense on a technical level when it comes to these
       | ZWJ characters, but hiding the implementation makes sense from
       | Safari's point of view. I'd actually prefer that as a UNIVERSAL
       | standard, visible symbols vs characters. (When it comes to UI)
       | 
       | But I can also imagine this is problematic when setting
       | validation rules else where and now there's a subtle foot gun
       | buried in most web forms.
       | 
       | I guess the thing to learn here is to not rely on maxlength=
        
         | zuhsetaqi wrote:
         | The thing every developer should have learned from the
         | beginning is to never rely on anything that comes from user
         | input or client
        
           | turmeric_root wrote:
           | agreed, i discard user input in all of my apps
        
           | graypegg wrote:
           | I hope no one is relying on it! But in the context of a
           | regular user, maxlength measuring all chars is failsafe (user
           | is allowed to submit a long string that should be caught by
           | validation elsewhere) vs measuring visible chars is
           | faildangerous (user can't submit a valid string because the
           | front end validation stops them)
           | 
           | Which is weird
        
       | lapcat wrote:
       | The bigger problem is that web browsers silently truncate strings
       | over the maxlength. This behavior is particularly nasty for
       | secure password fields.
        
       | Wowfunhappy wrote:
       | If anyone would like to try this out:
       | 
       | https://codepen.io/Wowfunhappy/pen/dyqpMXO?editors=1000
        
         | waltbosz wrote:
         | I forked your code and made it a bit more fun. Presenting: the
         | Magical Shrinking Emoji Family.
         | 
         | https://codepen.io/waltbosz/pen/zYJKBEE
        
           | brycedriesenga wrote:
           | The father seems so much happier to be alone.
        
             | jefftk wrote:
             | And he stops shaving his mustache. Perhaps his family
             | didn't like it?
        
             | test6554 wrote:
             | Ugh, How did we get here.
        
       | rom-antics wrote:
       | Edge cases like this will just get more common as unicode keeps
       | getting more complex. There was a fun slide in this talk[1] that
       | suggests unicode might be turing complete due to its case folding
       | rules.
       | 
       | I miss when Unicode was just a simple list of codepoints. (Get
       | off my lawn)
       | 
       | [1]:
       | https://seriot.ch/resources/talks_papers/20171027_brainfuck_...
        
         | WorldMaker wrote:
         | These "edge cases" have always existed in Unicode. Languages
         | with ZWJ needs have existed in Unicode since the beginning.
         | That emoji put a spotlight on this for especially English-
         | speaking developers with assumptions that language encodings
         | are "simple", is probably one of the best things about the
         | popularity of emoji.
        
         | GuB-42 wrote:
         | I think we need some kind of standard "Unicode-light" with
         | limitations to allow it to be used on low specs hardware and
         | without weird edge cases like this. A bit like video codecs
         | that have "profiles" which are limitations you can adhere to to
         | avoid overwhelming low end hardware.
         | 
         | It wouldn't be "universal", but enough to write in the most
         | commonly used languages, and maybe support a few, single
         | codepoint special characters and emoji.
        
           | sbierwagen wrote:
           | Why bother with a standard, just deny by default and
           | whitelist the codepoints you care about.
           | 
           | Plenty of software already does that-- HN itself doesn't
           | allow emoji, for example.
        
             | rom-antics wrote:
             | I don't think HN is deny by default. Most emojis are
             | stripped, but some still get through. I don't know why
             | would be whitelisted
        
               | zerocrates wrote:
               | Is it just that whole block that's allowed? I feel like
               | some of the legacy characters that have been kind of
               | "promoted" to emoji I've seen allowed here. Let's see, is
               | the all-important  allowed?
        
             | GuB-42 wrote:
             | The point of having a standard is to know which ones to
             | deny.
             | 
             | Make a study over a wide range of documents about the
             | characters that are used the most, see if there are
             | alternatives to the characters that don't make it, etc...
             | 
             | This is unlike the HN ban on emoji, which I think is more
             | of a political decision than a technical one. Most people
             | on HN use systems that can read emoji just fine, but they
             | decided that the site would be better without it. This
             | would be more technical, a way to balance inclusivity and
             | technical constraints, something between ASCII and full
             | Unicode.
        
           | zokier wrote:
           | The subset of codepoints that are included in NFKD make
           | probably a decent starting point for such standard, maybe if
           | you want to be even more restrictive then limit to BMP.
        
           | klausa wrote:
           | The _entire point_ of Unicode is to be _universal_, you're
           | just suggesting we go back to pre-Unicode days and use
           | different code-pages.
           | 
           | This, for reasons stated in already posted responses, and to
           | put it mildly, does not work in international context (e.g.
           | the vast majority of software written every day).
        
           | themerone wrote:
           | You can't express the most commonly used languages without
           | multiple code point graphemes.
           | 
           | If you want to eliminate edge cases you would need to
           | introduce new incompatible code points and a 32 or 64 bit
           | fixed length encoding depending on how many languages you
           | want to support.
        
             | GuB-42 wrote:
             | Extra pages for these extra code points wouldn't seem far
             | fetched to me. We already have single code point letters
             | with diacritics like "e", or a huge code page of Hangul,
             | for which each code point is a combination of characters.
             | 
             | As for encoding, 32 bits fixed length should be sufficient,
             | I can't believe that we would need billions of symbols,
             | combinations included, in order to write in most common
             | languages, though I may be wrong.
             | 
             | Also, "limiting" doesn't not necessarily means "single code
             | point only", but more like "only one diacritic, only from
             | this list, and only for these characters", so that
             | combination fits in a certain size limit (ex: a 32 bit
             | word), and that the engine only has to process a limited
             | number of use cases.
        
         | Dylan16807 wrote:
         | Unicode always had combining characters. Is this so different
         | from accent marks disappearing? And the Hangul pieces were
         | there from 1.0/1.1.
        
       | ryandrake wrote:
       | Seems like an obvious root cause for these sorts of things are
       | languages (and developers) who can't or won't differentiate
       | between "byte length" and "string length". Joel warned us all
       | about this 20 years (!!) ago[1], and we're still struggling with
       | it.
       | 
       | 1: https://www.joelonsoftware.com/2003/10/08/the-absolute-
       | minim...
        
         | rockwotj wrote:
         | String length isn't even a well defined term. Do you mean
         | codepoint length? Or the number of graphemes?
        
           | ryandrake wrote:
           | Good point. When it comes to text there are a lot of
           | "lengths". Languages should differentiate between each of
           | those lengths, and programmers should pay attention to the
           | difference and actually use the right one for the whatever
           | situation they're programming.
        
       | [deleted]
        
       | ok123456 wrote:
       | multi-character graphemes were a mistake.
        
       | jonhohle wrote:
       | It seems odd to suggest the bug is with Safari. Normal humans
       | (and even most developers!) don't care that the byte length of an
       | emoji in a particular encoding that may or may not be under their
       | control defines the maximum "characters" in a text box (character
       | used to define a logical collection of code points each of which
       | may fit into one or multiple bytes).
        
         | twic wrote:
         | It's a bug with Safari because the HTML spec defines maxlength
         | as applying to the number of UTF-16 code units [1]:
         | 
         | > Constraint validation: If an element has a maximum allowed
         | value length, its dirty value flag is true, its value was last
         | changed by a user edit (as opposed to a change made by a
         | script), and the length of the element's API value is greater
         | than the element's maximum allowed value length, then the
         | element is suffering from being too long.
         | 
         | Where "the length of" is a link to [2]:
         | 
         | > A string's length is the number of code units it contains.
         | 
         | And "code units" is a link to [3]:
         | 
         | > A string is a sequence of unsigned 16-bit integers, also
         | known as code units.
         | 
         | I agree with your implied point that this is a questionable
         | definition, though!
         | 
         | [1] https://html.spec.whatwg.org/multipage/form-control-
         | infrastr...
         | 
         | [2] https://infra.spec.whatwg.org/#string-length
         | 
         | [3] https://infra.spec.whatwg.org/#code-unit
        
           | Someone wrote:
           | I am not sure it's a bug. [1] also says (emphasis added)
           | 
           |  _"User agents _may_ prevent the user from causing the
           | element 's API value to be set to a value whose length is
           | greater than the element's maximum allowed value length."_
           | 
           | and we have _"MAY This word, or the adjective "OPTIONAL",
           | mean that an item is truly optional"_ [2]
           | 
           | [1] https://html.spec.whatwg.org/multipage/form-control-
           | infrastr...
           | 
           | [2] https://www.ietf.org/rfc/rfc2119.txt
        
             | syrrim wrote:
             | Apple clearly intended to implement it though, and did so
             | incorrectly.
        
           | Longhanks wrote:
           | IMHO that is then a "bug" in the HTML spec which should be
           | "fixed" to speak of extended grapheme clusters instead, which
           | is what users what probably expect.
        
             | WorldMaker wrote:
             | Except "maxlength" is often going to be added to fields to
             | deal with storage/database limitations somewhere on the
             | server side and those are almost always going to be in
             | terms of code points, so counting extended grapheme
             | clusters makes it harder for server-side storage maximums
             | to agree with the input.
             | 
             | Choosing 16-bit code points as the base keeps consistency
             | with the naive string length counting algorithm of JS, in
             | particular, which has always returned length in 16-bit code
             | points.
             | 
             | Sure, it's slightly to the detriment of user experience in
             | terms of how a user expects graphemes to be counted, but it
             | avoids later storage problems.
        
         | biftek wrote:
         | Yea the authors conclusion is flawed. If I enter an emoji and a
         | different one appears I'm just going to assume your website is
         | broken. Safari is in the right here.
        
           | jfk13 wrote:
           | It's not that clear-cut. If you enter that 11-code-unit emoji
           | into a field that has maxLength=10, and the browser lets you
           | do so, but the backend system that receives the data only
           | stores 10 code units, you're worse off -- because you
           | probably won't realise your data got corrupted -- than if the
           | browser had prevented you entering/submitting it.
        
       | test6554 wrote:
       | Unicode... There may be some regrets.
        
       | SquareWheel wrote:
       | > Except in Safari, whose maxlength implementation seems to treat
       | all emoji as length 1. This means that the maxlength attribute is
       | not fully interoperable between browsers.
       | 
       | No, it's definitely not. You can read the byte length more
       | directly in JS, and use that to inform if more text is allowed or
       | not.                   const encoder = new TextEncoder();
       | const currentBytes = encoder.encode(inputStr).byteLength;
       | 
       | But the maxlength attribute is at best an approximation. Don't
       | rely on it for things like limiting length for database fields
       | (not that you should trust the client anyway).
        
         | ASalazarMX wrote:
         | Apple's take seems more reasonable. When a user uses an emoji,
         | they think of it as a single symbol, they don't care of the
         | Unicode implementation, or its length in bytes. IMO this should
         | be the standard, and all other interpretations are a repeat of
         | the transition from ASCII to Unicode.
        
           | duskwuff wrote:
           | It isn't just for emoji, either. More ordinary grapheme
           | clusters like characters with combining accents are also
           | perceived by users as a single character, and should be
           | counted as one by software.
        
           | david422 wrote:
           | The max length isn't for the user though. It's to prevent the
           | user from inputting more data than the system can handle.
        
             | withinboredom wrote:
             | The user can submit data of any length... browsers aren't
             | required to implement validation for you. Nor is your user
             | required to use a browser or client you wrote. This is
             | security and application development 102, welcome to my
             | class.
        
             | tobr wrote:
             | Max length is there to communicate to the user how much
             | data the system can handle.
        
           | scatters wrote:
           | The number of graphemes an emoji displays as depends on the
           | platform and font. How many "characters" does Safari think
           | Ninja Cat[1] is? It displays as a single grapheme on Windows
           | 10.
           | 
           | 1. https://emojipedia.org/ninja-cat/
        
             | MereInterest wrote:
             | And depends on the location. For example, the characters
             | U+1F1FA U+1F1F8 are the regional indicators "U" and "S",
             | and are rendered as . These are two separate codepoints
             | that may together be displayed as a United States flag.
             | Similarly, the regional indicators "T" and "W" are rendered
             | as and "H" and "K" are rendered as . On my system, this is
             | rendered as the flag of Taiwan and Hong Kong, respectively.
             | Depending on where you live, these regional indicators
             | might not be rendered as flags.
             | 
             | Edit: Looks like HN stripped out the character codes. The
             | effect can be reproduced by copy-pasting the symbols from
             | https://en.wikipedia.org/wiki/Regional_indicator_symbol
        
               | WorldMaker wrote:
               | > Depending on where you live, these regional indicators
               | might not be rendered as flags.
               | 
               | This is also why Microsoft-vended emoji fonts don't
               | include flags/"regional indicators" support _at all_. On
               | a Windows machine you just see the letters US or TW in an
               | ugly boxed type and nothing like a flag.
               | 
               | The interesting background story goes: everyone remembers
               | that Windows 95 had that fancy graphical Time Zone
               | selector that looked real pretty in screenshots (but
               | wasn't actually the best way to select time zones
               | anyway). The reasons Microsoft removed that tool are less
               | well known and almost entirely geopolitical: time zone
               | borders often correspond to country borders and several
               | countries got mad at their border looking wrong and sued
               | Microsoft for "lying" about their borders. After a bunch
               | of micro-changes to the graphics of that widget and then
               | a lot of money spent on lawsuits and geopolitical fights,
               | Microsoft eventually just entirely removed the widget
               | because it was a "nice-to-have" and never actually
               | necessary. It is said that from that fight Microsoft
               | decided that it never wanted to be involved in that sort
               | of geopolitics ever again, and flags are nothing if not
               | geopolitical symbols, so Microsoft's emoji fonts don't
               | encode hardly any flags at all. (In exchange they encode
               | a lot of fun "cats with jobs" that other font vendors
               | still don't. I find that a fun trade-off myself.)
        
             | crazygringo wrote:
             | According to the Unicode standard there is a canonical
             | notion of countable graphemes, with zero ambiguity.
             | 
             | I don't actually know whether ninja cat would count as one
             | or two though. The spec for calculating grapheme boundaries
             | is actually several pages long.
             | 
             | And the OS might not obey the Unicode standard. Ninja cat
             | appears to be proprietary to Microsoft.
        
               | [deleted]
        
               | WorldMaker wrote:
               | What I recall of the standard algorithm is that it does
               | include font/user-agent/locale pieces in its calculations
               | and "zero ambiguity" is a bit of a stretch,
               | unfortunately.
               | 
               | Even the overview includes this note:
               | 
               | > Note: Font-based information may be required to
               | determine the appropriate unit to use for UI purposes
               | 
               | First of all, grapheme counting depends on normalization
               | and there's like 6 Unicode normalization algorithms to
               | consider, depending on locale and intended use/display
               | style.
               | 
               | Keep in mind that grapheme counting includes things like
               | natural ligatures which English has always had a rough
               | time counting. Things like `fi` and `ft` that sometimes
               | form single visual graphemes from the font's perspective
               | for some serif fonts but are _always_ supposed to be
               | counted as two graphemes from the user 's perspective.
               | 
               | Relatedly, one of the simpler parts of the Unicode
               | standard algorithm is that the ZWJ _always_ signals a
               | grapheme cluster /merged grapheme, so per the standard
               | algorithm the Ninja Cat is _always_ a single grapheme
               | even though the usual emoji rules for all the non-
               | Microsoft fonts that don 't include Cat+ZWJ+Job sequences
               | ("cats with jobs emojis") fallback to two display
               | characters to try to get the idea across, the spec does
               | say that they should still count that as only a single
               | grapheme even when not presented as such.
               | 
               | (ETA Aside: I think the cats with jobs emojis are great
               | and fun and should be wider adopted outside of just
               | Microsoft. Why should only the people emojis have jobs?)
        
               | Dylan16807 wrote:
               | > First of all, grapheme counting depends on
               | normalization and there's like 6 Unicode normalization
               | algorithms to consider, depending on locale and intended
               | use/display style.
               | 
               | It says "The boundary specifications are stated in terms
               | of text normalized according to Normalization Form NFD
               | (see Unicode Standard Annex #15, "Unicode Normalization
               | Forms" [UAX15]). In practice, normalization of the input
               | is not required."
               | 
               | And there's also "Even in Normalization Form NFC, a
               | syllable block may contain a precomposed Hangul syllable
               | in the middle."
               | 
               | So are you sure normalization matters? I'm excluding K
               | normalizations since they're not for general use and
               | alter a bunch of characters on purpose. But also that's
               | only NFC/NFD/NFKC/NFKD, are there two others or was "six"
               | just a misremembering?
        
               | WorldMaker wrote:
               | "like 6" was intentional hyperbole, but I appreciate the
               | technical clarifications.
               | 
               | > I'm excluding K normalizations since they're not for
               | general use
               | 
               | "Not for general use" doesn't mean that they aren't
               | _somebody 's_ use and still a complicating factor in
               | everything else.
        
               | Dylan16807 wrote:
               | Nobody that cares about keeping text intact uses K. And
               | you don't have to care about how _they_ count characters.
               | It won 't affect your system, and they're probably
               | counting something dumb.
               | 
               | NFC/NFD will give you the same graphemes.
               | 
               | Or to put it another way, there's only two real
               | normalizations, and there's some weird junk off to the
               | side.
        
             | thih9 wrote:
             | > This Emoji ZWJ Sequence has not been Recommended For
             | General Interchange (RGI) by Unicode. Expect limited cross-
             | platform support.
        
             | ASalazarMX wrote:
             | I'm not saying Apple's implementation should be the gold
             | standard, just that characters that the Unicode
             | specification allows to combine into one character, should
             | be counted as one character.
             | 
             | I understand that technical limitations will sometimes
             | cause a combined character to be displayed as more than one
             | character, as it happens in Safari with that ninja cat, and
             | in those cases, it's preferable to decompose the emoji, or
             | use a tofu character, than changing it for another.
        
               | jakelazaroff wrote:
               | They are decomposing the emoji, are they not? The
               | resulting glyph (with three family members rather than
               | four) is just the result of the ligatures between the
               | remaining emojis.
        
               | tobr wrote:
               | From a user perspective, a family emoji is one character.
               | Whether it's implemented as a ligature seems irrelevant,
               | as that's not visible to the user.
        
               | junon wrote:
               | Emoji was such a mistake in my opinion. It's cluttered
               | the Unicode spec for a long time now and it's a continual
               | source of problems.
        
               | gpderetta wrote:
               | Emojis simply manifest existing bugs in programs. Those
               | bugs would have existed anyway as emojis are not the only
               | symbols beyond the BMP nor the only symbols composed of
               | grapheme clusters.
               | 
               | Being much more frequent than other more exotic
               | characters they just reveal bugs more often.
        
               | tobr wrote:
               | An incredibly popular and successful mistake, in that
               | case. Is the purpose of Unicode to be an uncluttered
               | spec, or to be a useful way to represent text?
        
               | wwalexander wrote:
               | Emojis aren't text, though. Nobody "writes" emoji outside
               | of the digital world so I don't think emoji could
               | accurately be called "text" or part of a "writing system:
               | 
               | > Unicode, formally The Unicode Standard,[note 1][note 2]
               | is an information technology standard for the consistent
               | encoding, representation, and handling of text expressed
               | in most of the world's writing systems
               | 
               | People write smiley faces, but there was already a code
               | point for basic smileys.
               | 
               | I think standardized emoticons are a good thing, but I
               | don't think the Unicode spec is where they best belong.
        
               | tobr wrote:
               | I don't see anything in the quoted passage that limits it
               | to non-digital writing systems.
        
               | yamtaddle wrote:
               | Smiley faces and hearts and all kinds of stuff appear in
               | hand-written text.
        
               | Nevermark wrote:
               | Given that emoji are most typically used in text, having
               | it all together seems better than two bodies coming up
               | with incompatible solutions to emoji and the glyph issues
               | of various languages.
               | 
               | In the end emoji is a language, a visual icon language.
        
               | WorldMaker wrote:
               | Where would they belong if not Unicode?
               | 
               | Emoticons are a "writing system" (whether or not you
               | believe them to be more "text" or "meta-text") that needs
               | encoding, and one that was encoded in standards that
               | predated Unicode. (They were at first added for
               | compatibility with CJK systems that had already been
               | using them for some time. They've been greatly expanded
               | since those original encodings that Unicode "inherited"
               | from other text encodings, but they started as text
               | encodings in other standards.)
               | 
               | Personally, I think Unicode actually did the world a
               | favor by taking ownership of emoji and standardizing
               | them. There's nothing "weird" that emoji do that doesn't
               | exist somewhere else in some other language's text
               | encodings. Things like ZWJ sequences for complex
               | graphemes existed in a number of languages that Unicode
               | encoded well before emoji. Emoji being common and popular
               | helps break certain Western-centric assumptions that
               | language encodings are "simple" and 1-to-1 code point to
               | grapheme, assumptions that had been needed to break for
               | many years before emoji gave Unicode users a tool to test
               | that even "Western developers" needed to respect if they
               | wanted happy users. Better emoji support is better
               | Unicode support across the lovely and incredible
               | diversity of what Unicode encodes.
        
               | WorldMaker wrote:
               | Aside: Also, given the current, common vernacular usages
               | of the eggplant and peach emoji _alone_ , I find it
               | difficult in 2023 to argue that emoji aren't text and
               | aren't being used as a language inside text for and of
               | text.
        
               | baq wrote:
               | What's the difference between kanji and emoji? Tradition?
               | Most text is now created digitally, so no wonder you have
               | reflexivity here. You might think that it's tail wagging
               | the dog, but in truth the dog and the tail switched
               | places. People want to write emoji, so it becomes
               | Unicode.
        
               | rkeene2 wrote:
               | How many characters are in "offer" ? 4 (because of the
               | ligature for "ff") or 5 ?
        
           | Sharlin wrote:
           | Indeed Swift is one of the very few languages that even have
           | a stdlib API for working with this intuitive definition of a
           | "character" (the technical Unicode term is "(extended)
           | grapheme cluster" and the division of text into EGCs is
           | called Unicode text segmentation, described by [uax29] -
           | spoiler: definitely not trivial).
           | 
           | [uax29]: https://unicode.org/reports/tr29/
        
           | alpaca128 wrote:
           | Agreed, I'm surprised the linked OP filed a bug for Webkit
           | when I'd say it's the only correct implementation.
           | 
           | HTML is for UIs and I don't think many users would expect an
           | emoji to be counted as 11 symbols when it looks like just
           | one. If this wasn't an emoji but a symbol from a more complex
           | writing system splitting that up would clearly be
           | unacceptable, so I don't see why one should make an exception
           | here just because this one known example of a family emoji
           | still mostly survives the truncation.
        
           | kevin_thibedeau wrote:
           | This breaks down in the terminal where you need to know which
           | codepoints are double width for managing layout. Some are
           | ambiguous width because they've been present since before the
           | inclusion of emoji and had optional emoji presentation added
           | to them. The final determination is font dependent so you can
           | never be sure without full insight into the rendering chain.
        
             | pie_flavor wrote:
             | This doesn't break down anything. If a grapheme cluster is
             | double width and you treat it as 5x because there's 5 code
             | points in it, then you've _still_ gotten the layout wrong.
             | You can enforce text _storage_ length limitations to
             | prevent infinite zalgo, but text _display_ length
             | limitations should always deal in extended grapheme
             | clusters.
        
         | RobotToaster wrote:
         | >Don't rely on it for things like limiting length for database
         | fields (not that you should trust the client anyway).
         | 
         | Do database engines all agree on edge cases like this?
        
           | Manfred wrote:
           | No, they don't even agree between engine implementations
           | within the same database server. Generally limits in the
           | database are defined as storage limits for however a
           | character is defined. That usually means bytes or codepoints.
        
             | jonhohle wrote:
             | Or act like MySQL and define a code point as fixed to
             | 3-bytes when choosing UTF-8 .
        
               | bruce343434 wrote:
               | to be fair, there's utfmb4
        
               | bonsaibilly wrote:
               | Thankfully MySQL _also_ offers a non-gimped version of
               | UTF-8 that one should always use in preference to the
               | 3-byte version, but yeah it sucks that it 's not the
               | "obvious" version of UTF-8.
        
               | tragomaskhalos wrote:
               | Is this part of MySQL's policy of "do the thing I've
               | always done, no matter how daft or broken that may be,
               | unless I see an obscure setting telling me to do the new
               | correct thing" ?
        
               | bonsaibilly wrote:
               | That'd be my guess, but I don't really know. They just
               | left the "utf8" type as broken 3-byte gibbled UTF-8, and
               | added the "utf8mb4" type and "utf8mb4_unicode_ci"
               | collation for "no, actually, I want UTF-8 for real".
        
               | WJW wrote:
               | No the default these days is the saner utf8mb4, if you
               | create a new database on a modern MySQL version. But if
               | you have an old database using the old encoding then
               | upgrading databases doesn't magically update the encoding
               | because some people take backwards compatibility serious.
        
       | Manfred wrote:
       | Any attempt at defining what people think of as characters is
       | going to fail because of how many exceptions our combined writing
       | systems have. See: codepoints, characters, grapheme clusters.
       | 
       | A good starting place is UAX #29:
       | https://www.unicode.org/reports/tr29/tr29-41.html
       | 
       | However, the gold standard in UI implementations is that you
       | never break the user's input.
        
         | Dylan16807 wrote:
         | > See: codepoints, characters, grapheme clusters.
         | 
         | People don't think about code points and they definitely don't
         | think about code units.
         | 
         | What does "character" mean?
         | 
         | Grapheme clusters aren't perfect but they're far ahead of code
         | whatevers.
        
         | rockwotj wrote:
         | Personally I think this is much easier to digest than the above
         | link: https://manishearth.github.io/blog/2017/01/14/stop-
         | ascribing...
         | 
         | It starts by breaking down common Unicode assumptions folks
         | have
        
       ___________________________________________________________________
       (page generated 2023-02-24 23:01 UTC)