[HN Gopher] String length functions for single emoji characters ...
       ___________________________________________________________________
        
       String length functions for single emoji characters evaluate to
       greater than 1
        
       Author : kevincox
       Score  : 100 points
       Date   : 2021-03-26 12:40 UTC (10 hours ago)
        
 (HTM) web link (hsivonen.fi)
 (TXT) w3m dump (hsivonen.fi)
        
       | waterside81 wrote:
       | I like how Go handles this case by providing the
       | utf8.DecodeRune/utf8.DecodeRuneInString functions which return
       | each individual "rune" as well as the size in code points.
       | 
       | Coming from python2, for me it was the first time I saw a
       | language handle unicode so gracefully.
        
       | sanedigital wrote:
       | This may be the best tech writeup I've ever seen. Super in-depth,
       | easily readable. Really well done.
        
         | donaldihunter wrote:
         | With clickbaity misrepresentations of correctness,
         | unfortunately.
        
       | tzs wrote:
       | 5 in Perl, 17 in PHP.
        
       | ChrisSD wrote:
       | In summary the two useful measures of unicode length are:
       | 
       | * Number of native (preferably UTF-8) code units
       | 
       | * Number of extended grapheme clusters
       | 
       | The first is useful for basic string operations. The second is
       | good for telling you what a user would consider a "character".
        
         | emergie wrote:
         | Contemplate 2 methods of writing a 'dz' digraph
         | dz - \u0064\u007a, 2 basic latin block codepoints       DZ -
         | \u0044\u005a       Dz - \u0044\u007a              dz - \u01f3,
         | lowercase, single codepoint       DZ - \u01f1, uppercase
         | Dz - \u01f2, TITLECASE!
         | 
         | What happens if you try to express dz or dz from polish
         | orthography?
         | 
         | You can use                 dz - \u0064\u017c - d followed by
         | 'LATIN SMALL LETTER Z WITH DOT ABOVE'       dz -
         | \u0064\u007a\u0307 - d followed by z, followed by combining
         | diacritical dot above       dz - \u01f3\u0307 - dz with
         | combining diacritical dot above            multiplied by
         | uppercase and titlecase forms
         | 
         | In polish orthography dz digraph is considered 2 letters,
         | despite being only one sound (gloska). I'm not so sure about
         | macedonian orthography, they might count it as one thing.
         | 
         | Medieval ss is a letter/ligature that was created from sZ -
         | that is a long s and a tailed z. In other words it is a form of
         | 'sz' digraph. Contemporarily it is used only in german
         | orthography.
         | 
         | How long is ss?
         | 
         | By some rules uppercasing ss yields SS or SZ. Should
         | uppercasing or titlecasing operations change length of a
         | string?
        
         | samatman wrote:
         | While I agree with this assessment, it means that these are the
         | basic string operations:
         | 
         | * Passing strings around
         | 
         | * Reading and writing strings from files / sockets
         | 
         | * Concatenation
         | 
         |  _Anything else_ should reckon with extended grapheme clusters,
         | whether it does so or not. Even proper upcasing is impossible
         | without knowing, for one example, whether or not the string is
         | in Turkish.
        
           | josephg wrote:
           | One difficulty in using extended grapheme clusters is that
           | the codepoints which will be merged into a cluster changes
           | depending on the Unicode version, and sometimes the platform
           | and library. For collaborative editing, the basic unit of
           | measure is codepoints rather than grapheme clusters because
           | you don't want any ambiguity about where an insert is
           | happening on different peers. Or for a library bump to change
           | how historical data is interpreted.
        
             | d110af5ccf wrote:
             | > For collaborative editing, the basic unit of measure is
             | codepoints
             | 
             | I'd quibble it's not the basic unit of measure so much as
             | how changesets are represented. The user edits based on
             | grapheme clusters. The final edit is then encoded using
             | codepoints, which makes sense because a changeset amounts
             | to a collection of basic string operations (splitting,
             | concatenating, etc). As you note, it would be undesirable
             | for changesets to be aware of higher level string
             | representation details.
        
               | samatman wrote:
               | For that matter, as long as the format is restricted to
               | one encoding, I don't see why the unit of a changeset
               | can't just be a byte array.
               | 
               | I can see why it would _happen_ to be a codepoint, this
               | might be ergonomic for the language, but it seems to me
               | that, like clustering codepoints together in graphemes,
               | clustering bytes into codepoints is something the runtime
               | takes care of, such that a changeset will be a valid
               | example of all three.
        
         | masklinn wrote:
         | > The first is useful for basic string operations.
         | 
         | The only thing it's useful for is sizing up storage. It does
         | nothing for "basic string operations" unless "basic string
         | operations" are solely 7-bit ascii manipulations.
        
           | ChrisSD wrote:
           | It's useful for any and all operations that involve indexing
           | a range.
           | 
           | Yes you locate the specific indexes by using extended
           | grapheme clusters, but you use the retrieved byte indexes to
           | actually perform the basic operations. These indexes can also
           | be cached so you don't have to recalculate their byte
           | position every time (so long as the string isn't modified).
        
             | [deleted]
        
             | masklinn wrote:
             | > It's useful for any and all operations that involve
             | indexing a range.
             | 
             | That has little to do with the string length, those indices
             | can be completely opaque for all you care.
             | 
             | > Yes you locate the specific indexes by using extended
             | grapheme clusters
             | 
             | Do you? I'd expect that the indices are mostly located
             | using some sort of pattern matching.
        
               | ChrisSD wrote:
               | And how does that pattern matching work?
        
               | masklinn wrote:
               | By traversing the string?
        
               | ChrisSD wrote:
               | Yes. Any pattern matching has to match on extended
               | grapheme clusters or else we're back to the old "treating
               | a string like it's 7-bit ascii" problem.
        
         | [deleted]
        
         | bikeshaving wrote:
         | Why do people prefer UTF-8 coordinates? While for storage I
         | think we should use UTF-8, when working with strings live it's
         | just so much easier to use UTF-16 because it's predictable: 1
         | unit for the basic plane and 2 for everything else (the multi-
         | character emoji and modifier stuff aside). I am probably biased
         | because I mostly use and think about DOMStrings which are
         | always UTF-16 but I'm not sure why people who use languages
         | which are more flexible about string representations than
         | JavaScript would not also appreciate this kind of regularity.
        
           | ChrisSD wrote:
           | The distinction between the basic plane and everything else
           | is not particularly useful, imho. So I'm not sure I
           | understand what the advantage of UTF-16 is here? It's an
           | arbitrary division either way.
        
           | vlmutolo wrote:
           | Performance is one reason. UTF-8 is up to twice as compact as
           | UTF-16, which allows much better cache locality. And for
           | Latin-like text, you're probably frequently hitting the 2x
           | better limit.
        
           | samatman wrote:
           | The article explains this.
           | 
           | To summarize: the "codepoint" is a broken metric for what a
           | grapheme "is" in basically any context. Edge cases would be
           | truncating with a guarantee of a valid encoding on the
           | substring? But really you want to truncate at extended
           | grapheme cluster boundaries. Truncating a country flag
           | between two of the regional indicator symbols might not throw
           | errors in your code, but no user is going to consider that a
           | valid string. The same is true of all manner of composed
           | characters, and there are a lot of them.
           | 
           | So the only advantage of using the very-often-longer UTF-16
           | encoding is that it's an attractive nuisance! This makes it
           | easier to write code which will do the wrong thing,
           | constantly, but at a low enough rate that developers will put
           | off fixing it.
           | 
           | Unicode is variable width, and _what width you need_ is
           | application-specific. That 's the whole point of the article!
           | UTF-8 doesn't try to hide any of this from you.
        
           | pdpi wrote:
           | I struggle to see why you'd ever want UTF-16. If you're using
           | a variable length encoding, might as well stick to UTF-8. If
           | you want predictable sizes, there's UTF-32 instead.
           | 
           | ~Also, DOM Strings are not UTF-16, they're UCS-16.~
           | 
           | EDIT: UCS-2, not UCS-16. Also, I'm confusing the DOM with
           | EcmaScript, and even that hasn't been true in a while.
        
             | bikeshaving wrote:
             | UTF-32 is a bit like giving up and saying code units equal
             | code points right? I'm more interested in the comparison
             | between UTF-8 and UTF-16, where UTF-8 requires 1 to 3 bytes
             | just in the BMP, with 3 bytes for CJK characters. I'm
             | saying that as a quick measure of the actual length of a
             | string, UTF-16 is much more predictable and provides a nice
             | intuitive estimation of how long text actually is, as well
             | as providing a fair encoding for most commonly used
             | languages.
             | 
             | RE the UTF-16 vs UCS-2 stuff, that's probably a distinction
             | which has technical meaning but will collapse at some point
             | because no one actually cares, much like the distinction
             | between URI and URL.
        
               | masklinn wrote:
               | > I'm saying that as a quick measure of the actual length
               | of a string, UTF-16 is much more predictable
               | 
               | Meaning the software will deal much less well when it's
               | wrong.
               | 
               | > and provides a nice intuitive estimation of how long
               | text actually is
               | 
               | Not really due to combining codepoints, which make it not
               | useful.
               | 
               | > as well as providing a fair encoding for most commonly
               | used languages.
               | 
               | Which we know effectively doesn't matter: it's
               | essentially only a gain for pure CJK text being stored at
               | rest, because otherwise the waste on ASCII will more than
               | compensate for the gain.
        
             | josephg wrote:
             | Yes UTF-16 / UCS-2 is a silly encoding and in retrospect it
             | was a mistake. It matters because for awhile it was
             | believed that 2 bytes would be enough to encode any Unicode
             | character. During this time, lots of important languages
             | appeared which figured a 2 byte fixed size encoding was
             | better than a variable length encoding. So Java,
             | javascript, C# and others all use UCS-2. Now they suffer
             | the downsides of both using a variable length encodings
             | _and_ being memory inefficient. The string.length property
             | in all these languages is almost totally meaningless.
             | 
             | UTF-8 is the encoding you should generally always reach for
             | when designing new systems, or when implementing a network
             | protocol. Rust, Go and other newer languages all use UTF-8
             | internally because it's better. Well, and in Go's case
             | because it's author, Rob Pike also had a hand in inventing
             | UTF-8.
             | 
             | Ironically C and UNIX, which (mostly) stubbornly stuck with
             | single byte character encodings generally works better with
             | UTF-8 than a lot of newer languages.
        
             | ChrisSD wrote:
             | > Also, DOM Strings are not UTF-16, they're UCS-16.
             | 
             | Hm, according to the spec they should be interpreted as
             | UTF-16 but this isn't enforced by the language so it can
             | contain unpaired surrogates:
             | 
             | From https://heycam.github.io/webidl/#idl-DOMString
             | 
             | > Such sequences are commonly interpreted as UTF-16 encoded
             | strings [RFC2781] although this is not required... Nothing
             | in this specification requires a DOMString value to be a
             | valid UTF-16 string.
             | 
             | From https://262.ecma-international.org/11.0/#sec-
             | ecmascript-lang...
             | 
             | > The String type is the set of all ordered sequences of
             | zero or more 16-bit unsigned integer values ("elements") up
             | to a maximum length of 253 - 1 elements. The String type is
             | generally used to represent textual data in a running
             | ECMAScript program, in which case each element in the
             | String is treated as a UTF-16 code unit value... Operations
             | that do interpret String values treat each element as a
             | single UTF-16 code unit. However, ECMAScript does not
             | restrict the value of or relationships between these code
             | units, so operations that further interpret String contents
             | as sequences of Unicode code points encoded in UTF-16 must
             | account for ill-formed subsequences.
        
         | wodenokoto wrote:
         | > The first is useful for basic string operations.
         | 
         | Reversing a string is what I would consider basic string
         | operations, but I also expect it not to break emoji and other
         | grapheme clusters.
         | 
         | Nothing is easy.
        
           | ChrisSD wrote:
           | Personally I'd consider reversing a string to be a pretty
           | niche use case and, as you say, it's a complex operation.
           | Especially in some languages.
           | 
           | Tbh I can't think of a time when I've actually needed to do
           | that.
        
           | [deleted]
        
           | thrower123 wrote:
           | This always comes up as a thing that people use as an
           | interview question or an algorithm problem. But for what
           | conceivable non-toy reason would you actually want to reverse
           | a string?
           | 
           | I have never once wanted to actually do this in real code,
           | with real strings.
           | 
           | Furthermore, the few times I've tried to do something like
           | this by being cute and encoding data in a string, I never get
           | outside the ASCII character set.
        
             | mprovost wrote:
             | To index from the end of the string. If you want the last 3
             | characters from a string, it's often easier to reverse the
             | string and take the first 3.
        
               | saagarjha wrote:
               | Most languages that are good enough at doing Unicode are
               | also modern enough to give you a "suffix" function.
        
             | macintux wrote:
             | Generating a list through recursion often involves
             | reversing it at the end, so I've reversed a lot of Erlang
             | strings in real code.
        
             | edflsafoiewq wrote:
             | And even ASCII can't do a completely dumb string reverse,
             | because of at least \r\n.
        
               | kps wrote:
               | Trivia/history question: There's a very good reason it's
               | \r\n and not \n\r.
        
               | btilly wrote:
               | The commands were used to drive a teletype machine.
               | Carriage return and newline were independent operations
               | with carriage return physically taking longer. So \r\n
               | allowed it to get to the next letter soon.
        
           | giva wrote:
           | I agree. Also, truncating a string won't work. The first 3
           | characters of "Osterreich" are not "Os" in my opinion.
        
           | darrylb42 wrote:
           | Outside of an interview have you ever needed to reverse a
           | string?
        
             | benibela wrote:
             | I have my personal library of string functions
             | 
             | Recently I rewrote the codepoint reverse function to make
             | it faster. The trick is to not write codepoints, but copy
             | the bytes.
             | 
             | One the first attempt I introduced two bugs.
        
             | tgv wrote:
             | Sorting by reversed string is not totally uncommon.
        
               | neolog wrote:
               | What is a use case for this?
        
               | carapace wrote:
               | I did this yesterday.                   find ... | rev |
               | sort | rev > sorted-filelist
               | 
               | I had several directories and I wanted to pull out the
               | _unique_ list of certain files (TrueType fonts) across
               | all of them, regardless of which subdirs they were in. (I
               | 'm omitting the find CLI args for clarity; the command
               | just finds all the *.ttf (case-insensitive) files in the
               | dirs.)
               | 
               | By reversing the lines before sorting them (and un-
               | reversing them) they come out sorted and grouped by
               | filename.
        
           | mseepgood wrote:
           | Reversing a string does not make sense in most languages.
        
           | TacticalCoder wrote:
           | > Reversing a string
           | 
           | I wonder: several comments are saying it hardly makes any
           | sense to reverse a string but... Certainly there are useful
           | algorithms out there which do work by, at some point,
           | reversing strings no!? I mean: not just for the sake of
           | reversing it but for lookup/parsing or I don't know what.
        
             | ChrisSD wrote:
             | Some parsing algorithms may want to reverse the bytes but
             | that's different to reversing the order of user-perceived
             | characters.
        
         | dahfizz wrote:
         | > Number of native (preferably UTF-8) code units
         | 
         | > The first is useful for basic string operations
         | 
         | Can you expand on this? I don't see why knowing the number of
         | code units would be useful except when calculating the total
         | size of the string to allocate memory. Basic string operations,
         | such as converting to uppercase, would operate on codepoints,
         | regardless of how many code units are used to encode that
         | codepoint.
         | 
         | Converting 'A' to 'a', for example, is an operation on one
         | codepoint but multiple code units.
        
           | josephg wrote:
           | > Can you expand on this? I don't see why knowing the number
           | of code units would be useful except when calculating the
           | total size of the string to allocate memory
           | 
           | I've used this for collaborative editing. If you want to send
           | a change saying "insert A at position 10", the question is:
           | what units should you use for "position 10"?
           | 
           | - If you use byte offsets then you have to enforce an
           | encoding on all machines, even when that doesn't make sense.
           | And you're allowing the encoding to become corrupted by edits
           | in invalid locations. (Which goes against the principle of
           | making invalid state impossible to represent).
           | 
           | - If you use grapheme clusters, the positions aren't portable
           | between systems or library versions. What today is position
           | 10 in a string might tomorrow be position 9 due to new
           | additions to the Unicode spec.
           | 
           | The cleanest answer I've found is to count using Unicode
           | codepoints. This approach is encoding-agnostic, portable,
           | simple, well defined and stable across time and between
           | platforms.
        
           | tedunangst wrote:
           | Neither code points nor units help upper casing ss.
        
           | 7786655 wrote:
           | >calculating the total size of the string to allocate memory
           | 
           | In some languages this is 90% of everything you do with
           | strings. In other languages it's still 90% of everything done
           | to strings, but done automatically.
        
       | throwawayffffas wrote:
       | Python 3's approach is clearly the best. Because it focuses on
       | the problem at hand, unicode codepoints. A string in python is a
       | sequence of unicode codepoints, it's length should be the number
       | of codepoints in the sequence, it has nothing to do with bytes.
       | 
       | To draw an absurd parallel "[emoji]".len() == 17 is equivalent to
       | [1,2].len() == 8 (2 32 bit integers)
       | 
       | In my opinion the most useful result in the case the article
       | describes is 5. There should of course be a way to get 1 (the
       | number of extended graphemes), but it should not be the strings
       | "length".
        
         | banthar wrote:
         | Defining string as a sequence of unicode codepoints is the
         | mistake.
         | 
         | Nobody ever cares about unicode codepoints. You either want the
         | number of bytes or the width of the string on screen.
         | 
         | UTF-32 codepoints waste space and give you neither.
        
           | stkdump wrote:
           | The width on the screen is in pixels. Yes, I find monospace
           | fonts increasingly pointless.
        
         | anaerobicover wrote:
         | But what do you do when you're processing a string with
         | codepoints that compose into one user-visible glyph?
         | >>> len("")         2
        
           | jxy wrote:
           | yeah, and what do you do when you got a nonexistent country:
           | "".
        
             | Symbiote wrote:
             | Your example changes length depending on the font. (If a
             | new country gets the code DB, fonts will gradually be
             | updated to include the flag.)
             | 
             | I'll bet there are some systems in the PRC that don't
             | include .
        
         | nerdponx wrote:
         | Don't Swift and Go support iterating over graphemes? Edit: yes,
         | Swift is mentioned at the bottom of the article.
         | 
         | It'd be great to have a function for that in other scripting
         | languages like Python, Ruby, etc.
         | 
         | There was an interesting sub-thread here on HN a while ago
         | about how a string-of-codepoints is just as bad as a string-of-
         | bytes (subject to truncation, incorrect iteration, incorrect
         | length calculation) and that we should just have string-of-
         | bytes if we can't have string-of-graphemes. I don't agree, but
         | some people felt very strongly about it.
        
       | eximius wrote:
       | Gah, I wish I was done with my crate so I could point to lovely
       | formatted documentaion...
       | 
       | But I believe this can be handled explicitly and well and I'm
       | trying to do that in my fuzzy string matching library based on
       | fuzzywuzzy.
       | 
       | https://github.com/logannc/fuzzywuzzy-rs/pull/26/files
        
       | hateful wrote:
       | If you do a formula in Google Sheets and it contains an emoji,
       | this comes into play. For example, if the first character is an
       | emoji and you want to reference it you need to do =LEFT(A1, 2) -
       | and not =LEFT(A1, 1)
        
       | SamBam wrote:
       | After having read (quite interesting) the article, I still don't
       | quite get the subtitle:
       | 
       | > 'But It's Better that "[emoji]".len() == 17 and Rather Useless
       | that len("[emoji]") == 5'
       | 
       | It sounds like it's just whether you're counting UTF-8/-16/-32
       | units. Does the article explain why one is worse and one is
       | "rather useless?"
        
         | dahfizz wrote:
         | I agree. The author talks a little bit about which UTF encoding
         | makes sense in which situation, but they never make an argument
         | about which result from len is correct.
         | 
         | My two cents is that string length should always be the number
         | of Unicode codepoints in the string, regardless of encoding. If
         | you want the byte length, I'm sure there is a sizeof equivalent
         | for your language.
         | 
         | When we call len() on an array, we want the number of objects
         | in the array. When we iterate over an array, we want to deal
         | with one object from the array at a time. We don't care how big
         | an object is. A large object shouldn't count for more than 1
         | when calculating the array length.
         | 
         | Similarly, a unicode codepoint is the fundamental object in a
         | string. The byte-size of a codepoint does not affect the length
         | of the string. It makes no sense to iterate over each byte in a
         | unicode string, because a byte on its own is completely
         | meaningless. len() should be the number of objects we can
         | iterate over, just like in arrays.
        
           | MayeulC wrote:
           | But what about combining characters?
           | https://en.wikipedia.org/wiki/Combining_Diacritical_Marks
           | 
           | Should the letter plus a combining character count as one (I
           | think so), or two characters? Should you normalize before
           | counting length? And so on.
        
             | dahfizz wrote:
             | Combining characters are their own unicode codepoint, so
             | they count towards length. The beauty of this approach is
             | that its simple and objective.
             | 
             | If you had a list of 5 Dom Element objects, and one Dom
             | Attr object, the length of that list is 6. Its nonsensical
             | to say "The Attr object modifies an Element object, so its
             | not really in the list".
        
               | d110af5ccf wrote:
               | > Combining characters are their own unicode codepoint,
               | so they count towards length.
               | 
               | This is incredibly arbitrary - it depends entirely on
               | what "length" means for a particular usecase. From the
               | user's perspective there might only be a single character
               | on the screen.
               | 
               | Any nontrivial string operation _must_ be based around
               | grapheme clusters, otherwise it is fundamentally broken.
               | Codepoints are a useful encoding agnostic way to handle
               | basic (split  & concatenate) operations, but the method
               | by which those offsets are determined needs to be
               | grapheme cluster aware. Raw byte offsets are encoding
               | specific and only really useful for allocating the
               | underlying storage.
        
               | shawnz wrote:
               | Going by bytes is also simple and objective. And also
               | totally arbitrary, just like going by codepoints.
               | 
               | Which is the most useful for dealing with strings in
               | practice though? Are either interpretations useful at
               | all?
        
               | dahfizz wrote:
               | Going byte by byte is useless. You can't do anything with
               | a single byte of a unicode codepoint (unless, by luck,
               | the codepoint is encoded in a single byte).
               | 
               | Codepoint is the smallest useful unit of a unicode
               | string. It is a character, and you can do all the
               | character things with it.
               | 
               | If you wanted to implement a toUpper() function for
               | example, you would want to iterate over all the
               | codepoints.
        
               | masklinn wrote:
               | > If you wanted to implement a toUpper() function for
               | example, you would want to iterate over all the
               | codepoints.
               | 
               | Nope. In order to deal with special casings you will have
               | to span multiple codepoints, at which point it's no more
               | work with whatever the code units are.
        
           | ChrisSD wrote:
           | What is a string an array of?
           | 
           | If you ask the user, it's an array of characters (aka
           | extended grapheme clusters in Unicode speak).
           | 
           | If you ask the machine it's an array of integers (how many
           | bytes make up an integer depends on the encoding used).
           | 
           | Nothing really considers them an array of code points. Code
           | points are only useful as intermediary values when converting
           | between encodings or interpreting an encoded string as
           | grapheme clusters.
        
             | [deleted]
        
             | dahfizz wrote:
             | > If you ask the machine it's an array of integers
             | 
             | Not sure what you mean by this. A string is an array of
             | bytes, in the way that literally every array is an array of
             | bytes, but its not "implemented" with integers. Its a UTC-
             | encoded array of bytes.
             | 
             | And what is the information that is encoded in those bytes?
             | Codepoints. That's what UTF does, it lets us store unicode
             | codepoints as bytes. There is a reasonable argument that
             | the machine, or at least the developer, considers a string
             | as an array of codepoints.
        
               | ChrisSD wrote:
               | UTF-8 is an array of bytes (8 bit integers).
               | 
               | UTF-16 is an array of 16 bit integers.
               | 
               | UTF-32 is an array of 32 bit integers.
               | 
               | The machine doesn't know anything about code points. If
               | you want to index into the array you'll need to know the
               | integer offset.
        
               | dahfizz wrote:
               | > The machine doesn't know anything about code points. If
               | you want to index into the array you'll need to know the
               | integer offset.
               | 
               | The machine doesn't know anything about Colors either.
               | But if I defined a Color object, I would be able to put
               | Color objects into an array and count how many Color
               | objects I had. You're being needlessly reductive.
               | 
               | > UTF-8 is an array of bytes (8 bit integers)
               | 
               | UTF-8 encodes a codepoint with 1-4 single-byte code
               | units. The reason UTF-8 exists is to provide a way for
               | machines and developers to interact with unicode
               | codepoints.
               | 
               | Is a huffman code an array of bits? Or is it a list of
               | symbols encoded using bits?
        
               | ChrisSD wrote:
               | You seem to be thinking of the abstraction as a concrete
               | thing. A code point is like LLVM IR; an intermediary
               | language for describing and converting between encodings.
               | It is not a concrete thing in itself.
               | 
               | The concrete thing we're encoding is human readable text.
               | The atomic unit of which is the user perceived character.
               | 
               | I'm curious, what use is knowing the number of code
               | points in a string? It doesn't tell the user anything. It
               | doesn't even tell the programmer anything actionable.
        
       | Techyrack wrote:
       | #cryptocurrency #bitcoin #crypto
        
       | beaconstudios wrote:
       | perhaps the issue is that there's a canonical "length" at all. It
       | would make more sense to me to have different types of length
       | depending on which measure you're after, like Swift apparently
       | has but without the canonical `.count`. Because when there's
       | multiple interpretations of a thing's length, when you ask for
       | "length" you're leaving the developers to resolve the ambiguity
       | and I'm of the firm belief that developers shouldn't consider
       | themselves psychic.
        
         | anaerobicover wrote:
         | The main reason, I think, that Swift strings have `count` is
         | that they conform to the `Collection` protocol. Swift's stdlib
         | has a pervasive "generic programming" philosophy, enabled by
         | various protocol heirarchies.
         | 
         | So, given that the property is required to be present, some
         | semantic or the other had to be chosen. I am sure there were
         | debates when they were writing `String` about which one was
         | proper.
        
           | beaconstudios wrote:
           | that's very cringe of them.
           | 
           | > So, given that the property is required to be present, some
           | semantic or the other had to be chosen.
           | 
           | Sounds like an invented solution to an invented problem. The
           | programmer's speciality.
        
             | samatman wrote:
             | On the contrary it is based!
             | 
             | This conforms exactly to our intuition about what a
             | "collection" is. Some (exact) number of items which share
             | some meaningful property in common such that we can
             | consider them "the same" for purposes of enumeration and
             | iteration.
             | 
             | In the real world, we also have to decide what our
             | collection is a collection of! Let's say we a pallet of
             | candy bars, each of which is in a display box. If we want
             | to ask "how many X" are on the pallet, we have to decide
             | whether we're asking about the candy bars or the boxes.
             | Clearly we should be able to answer questions about both;
             | just as clearly, _operations_ on the pallet should work
             | with boxes. Because we don 't want to open them, and even
             | if we _do_ want to open them, we _have to_ open them, we
             | can 't just ignore their existence!
             | 
             | I assert that the extended grapheme cluster is the "box"
             | around bytes in a string. Even if you do care about the
             | contents (very often you do not!) you have to know where
             | the boundaries are to do it! Because a Fitzpatric skin tone
             | modifier all on its own has different semantics from one
             | found within an emoji modifier sequence.
             | 
             | So it makes perfect sense for Swift to provide one blessed
             | way to iterate over strings, and provide other options for
             | when you're interested in some other level of aggregation.
             | Which is what Swift does.
        
               | beaconstudios wrote:
               | I think the problem is that strings are ambiguous enough
               | a collection to warrant extra semantics.
               | 
               | The alternative to a collection would be an iterator or
               | other public method returning a collection-like accessor,
               | which would be a good compromise.
               | 
               | Though if you were to choose a canonical collection
               | quanta, the it'd probably be the grapheme cluster, yeah.
               | 
               | Unfortunately OOP can never be based; only functional or
               | procedural programming can attain such standards.
        
               | saagarjha wrote:
               | Strings were done well, because they are just
               | BudirectionalCollection and not RandomAccessCollection on
               | graphemes, which is usually what you would want
               | (especially as an app developer writing user-facing
               | code). The other views are collections in their own
               | right. By conforming to Collection a string can do things
               | like be sliced and prefixed and searched over "for free",
               | which are extremely common operations to define
               | generically.
        
               | samatman wrote:
               | OOP using a meta-object protocol is very based indeed.
               | 
               | Unfortunately, only Common Lisp and Lua do it that way.
               | 
               | Actor models are pretty based as well, and have a better
               | historical claim to the title "object oriented
               | programming" than class ontologies do, but that ship has
               | sailed.
        
               | beaconstudios wrote:
               | Yes agreed; I'm guessing by meta object that is the
               | module pattern from fp with syntax sugar? I use the
               | module pattern with typescript interfaces plus namespaces
               | and it's pretty great.
               | 
               | 100% on the actor model. My visual programming platform
               | is basically based on actors, but the core data model is
               | cybernetic (persistence is done via self referentiality).
               | Alan Kay got shafted by the creation of C++, OO in the
               | original conception was very based.
        
           | xoudini wrote:
           | In Swift 3 (and probably previous versions as well),
           | `String.count` defaulted to the count of the Unicode scalar
           | representation. In this version, iterating over a string
           | would operate on each Unicode scalar, which often doesn't
           | make sense due to the expected behaviour of extended grapheme
           | clusters. So, this is my best guess why `String` in Swift 4
           | and later ended up with the current default behaviour.
        
       | maxnoe wrote:
       | https://pypi.org/project/grapheme/
        
       | patrickas wrote:
       | Raku seems to be more correct (DWIM) in this regard than all the
       | examples given in the post...                 my \emoji =
       | "\c[FACE PALM]\c[EMOJI MODIFIER FITZPATRICK TYPE-3]\c[ZERO WIDTH
       | JOINER]\c[MALE SIGN]\c[VARIATION SELECTOR-16]";            #one
       | character       say emoji.chars; # 1        #Five code points
       | say emoji.codes; # 5            #If I want to know how many bytes
       | that takes up in various encodings...       say
       | emoji.encode('UTF8').bytes; # 17 bytes        say
       | emoji.encode('UTF16').bytes; # 14 bytes
       | 
       | Edit: Updated to use the names of each code point since HN cannot
       | display the emoji
        
         | duskwuff wrote:
         | You can represent it as a sequence of escapes. If Raku handles
         | this the same way as Perl5, it should be:                   $a
         | = "\N{FACE PALM}\N{EMOJI MODIFIER FITZPATRICK TYPE-3}\N{ZERO
         | WIDTH JOINER}\N{MALE SIGN}\N{VARIATION SELECTOR-16}";
        
           | csharptwdec19 wrote:
           | now do it in YAML
        
         | cjm42 wrote:
         | And if you try to say emoji.length, you'll get an error:
         | 
         | No such method 'length' for invocant of type 'Str'. Did you
         | mean any of these: 'codes', 'chars'?
         | 
         | Because as the article points out, the "length" of a string is
         | an ambiguous concept these days.
        
       | j1elo wrote:
       | If you're working with an image, you might have an Image class,
       | that has a Image.width and a Image.height in pixels, regardless
       | of how these pixels are laid out in memory (depends on encoding,
       | colorspace, etc). Most if not all methods operate on these
       | pixels, e.g. cropping, scaling, color filtering, etc. Then, there
       | might be a Image.memory property that provides acces to the
       | underlying, actual bytes.
       | 
       | I don't understand why the same is not the obvious solution for
       | strings. len("") should be 1, because we as humans read one emoji
       | as one character, regardless of the memory layout behind. Most if
       | not all methods operate on characters (reversing, concatenating,
       | indexing).
       | 
       | And then, if you need access to the low level data, the
       | String.memory would contain the actual bytes... which would be
       | different depending on the actual text encoding.
        
         | dnautics wrote:
         | The number of bytes necessary is _incredibly_ important for
         | security reasons. It 's arguably better to make the number of
         | bytes be the primary value and have a secondary lookup be the
         | number of glyphs.
         | 
         | To be fair, some systems distinguish between size and length
         | (with size expected to be O(1) and length allowed to be up to
         | O(n)). For those systems proceed as parent.
        
       | mannykannot wrote:
       | In my curmudgeonly way, I suggest zero is the correct value, to
       | reflect the information content of almost all emoji use.
        
         | elliekelly wrote:
         | Counterpoint: Appending U+1F609, U+1F61C, or U+1F643 to the end
         | of your comment would have added immense value in communicating
         | your point because it would have lightened the otherwise harsh
         | tone of your words :)
        
         | Kaze404 wrote:
         | With that metric the length of your comment should be zero as
         | well. Probably mine also :)
        
       | bob1029 wrote:
       | We need to deal with mountains of user input from mobile devices
       | and the worst we have run into is "smart" quotes from iOS devices
       | hosing older business systems. Our end users are pretty good
       | about not typing pirate flags into customer name fields.
       | 
       | I still haven't run into a situation where I need to count the
       | logical number of glyphs in a string. Any system that we need to
       | push a string into will be limiting things on the basis of byte
       | counts, so a .Length check is still exactly what we need.
       | 
       | Does this cause trouble for anyone else?
        
         | ravi-delia wrote:
         | I have used glyph counting a handful of times, mostly for width
         | computing before I learned there were better ways. I'm 100%
         | sure my logic was just waiting to fail on any input that didn't
         | use the Latin alphabet.
        
       | dooglius wrote:
       | I think this title needs an exemption from HN's no-emoji rule
        
         | kyberias wrote:
         | Why is there such a rule?
        
           | TeMPOraL wrote:
           | Surprisingly many people try to abuse Unicode when posting
           | submissions and comments. This includes not just emojis, but
           | also other symbols, or glyphs that look like bold, underlined
           | or otherwise stylized letters.
        
             | masklinn wrote:
             | > Surprisingly many people try to abuse Unicode when
             | posting submissions and comments.
             | 
             | .@snqa buipio^a ta qor poob a Yons s@op sr@toaraYo jo Younq
             | a buiuuaq ySirartiqra @snao@
        
             | criddell wrote:
             | I don't think they've banned it entirely.
        
               | masklinn wrote:
               | HN's character stripping is completely arbitrary.
               | 
               | You can play mahjong, domino or cards (, , ) but not
               | chess, you can show some random glyphs like [?] or  but
               | not others, you can use box-drawing characters (+) but
               | not block elements, you can use byzantine musical
               | notation () but only western musical symbols () and the
               | notes are banned, you can Za[?][?][?]lg[?]o just fine.
        
           | colejohnson66 wrote:
           | Emojis aren't professional (depending on who you ask) and can
           | be overused (look at a few popular GitHub projects' READMEs -
           | emojis on every bullet point of the feature list, for
           | example)
        
             | tester34 wrote:
             | I agree that they aren't the most professional thing to
             | use, but I'm not sure why I agree
             | 
             | Maybe it's some kind of bias?
        
             | Kaze404 wrote:
             | What a weird reasoning. In most contexts I've seen,
             | "professionalism" doesn't really do anything except strip
             | away all the human factor that goes into every interaction.
             | Personally, I don't care whether the person I'm talking to
             | is being "professional" or not. What I care is that they're
             | respectful and can properly communicate their thoughts.
             | 
             | With that mindset, emojis (and emoticon) can actually add
             | context to interactions, considering how much context is
             | lost when communicating over text. A simple smiley face at
             | the end of a sentence can go a long way in my experience :)
        
             | fhifjfhjjjk wrote:
             | emojis aren't "professional" is your reasoning?
             | 
             | is it 1987? should women also wear their skirts below the
             | knee?
        
               | colejohnson66 wrote:
               | _I_ didn't say that they weren't. I'm simply saying that
               | some do think that. Whether I do or don't is irrelevant
               | as I'm only speculating on the reason why.
        
           | pessimizer wrote:
           | This is a forum sponsored by opinionated people who find them
           | annoying. I find them annoying, too, so I think it's a good
           | convention. Somehow ":)" became an industry, and as
           | industries go, then eliminated ":)" (which gets replaced with
           | a smiling yellow head in most places.)
           | 
           | It isn't the only arbitrary convention enforced here, and the
           | sum of those conventions are what attracts the audience.
        
           | saagarjha wrote:
           | Emoji draw attention to themselves in a way that plain text
           | does not. I assume this is why Hacker News does not allow
           | bolding text and strongly discourages use of ALL CAPS.
        
           | at_a_remove wrote:
           | Also, looking at the article, they are quite complex! It
           | looks like handling emojis in a proper manner has a large
           | investment and the payoff is somewhat small.
        
             | masklinn wrote:
             | Stripping some arbitrary subset of characters is a lot more
             | work than just letting them through, which is what HN would
             | otherwise be doing: the hard work is done by the browser,
             | HN doesn't render emoji or lay text out.
        
             | Someone wrote:
             | For HN I would think almost all the complexity is in
             | rendering. That's a job for your browser.
             | 
             | What's left is things like the max length for a title (not
             | too problematic to count that in code points or bytes, I
             | think)
             | 
             | The big risk, I think, is in users (mis)using rendering
             | direction, multiple levels of accents, zero-width
             | characters and weird code points to mess up the resulting
             | page. Some Unicode symbols and emojis typically look a lot
             | larger and heavier than 'normal' characters, switching
             | writing direction may change visual indentation of a
             | comment, etc.
             | 
             | Worse, Unicode rendering being hard, text rendering bugs
             | that crash browsers or even entire devices are discovered
             | every now and then. If somebody entered one in a HN
             | message, that could crash millions of devices.
        
             | wongarsu wrote:
             | Emoji don't introduce anything that isn't used for various
             | other languages. Emojis are just the most visible breakage
             | for American users when you screw up your unicode handling.
             | 
             | HN however handles all of unicode just fine, it just chose
             | to specifically exclude emojis (and a bunch of other
             | symbols)
        
               | MayeulC wrote:
               | I think emoji is wonderful for widespread unicode
               | support. It combats ossification. It is also a nice
               | carrot to incite users to install updates.
               | 
               | However, I don't like the rabbit hole it started to go
               | into with gendered and colored emojis. There's never
               | going to be enough. I wish we had stuck to color-neutral
               | and gender-neutral, like original smileys :)
               | 
               | I find it also conveys too much meaning. I am generally
               | not interested in knowing a person's gender or ethnic
               | group when discussing over text... but I digress.
        
               | samatman wrote:
               | I've said this before, but not here:
               | 
               | Eventually, Unicode will allow you to combine flag
               | modifiers with human emojis. So you can have a black
               | _South Korean_ man facepalming.
               | 
               | This will trigger a war which ends industrial
               | civilization, and is the leading candidate for the Great
               | Filter.
        
         | colejohnson66 wrote:
         | I'm curious: is the no-emoji rule a rule that happens to block
         | emoji or a hardcoded rule? What I mean is: emojis (in UTF-16)
         | have to use surrogate pairs because all their code points are
         | in the U+1xxxx plane. Is the software just disallowing any
         | characters needing two code points to encode (which would
         | include emoji)? Or is it specifically singling out the emoji
         | blocks?
        
           | kps wrote:
           | It seems to have changed recently. I recall a thread about
           | plastics a few months ago where the plastic type symbols ([?]
           | through [?], i.e. U+2763 through U+2769) disappeared, but I
           | see them now.
           | 
           | Edits:
           | 
           |  that post was https://news.ycombinator.com/item?id=25237688
           | 
           |  The rules seem pretty arbitrary. Recycling symbol U+2672 [?]
           | is allowed but recycled paper symbol U+267C is not. Chess
           | kings  are allowed but checkers kings aren't.
           | 
           |  is allowed (for now).
           | 
           |  I think the right thing to do would be to strip anything
           | with emoji _presentation_
           | http://unicode.org/reports/tr51/#Presentation_Style
        
             | saagarjha wrote:
             | It's not that simple, as many OSes render characters that
             | aren't strictly look emoji "as emoji" and there is no
             | standard way to check for this.
        
           | saagarjha wrote:
           | The latter.
        
           | wongarsu wrote:
           | It's a very deliberate and precise filter. I can write in old
           | persian: "" (codepoints 0x103xx) or CJK "" (codepoints
           | 0x2000x), but can't write "" (emoji with codepoint 0x1F605)
        
       | colejohnson66 wrote:
       | I was wondering what the title meant. Turns out HN's emoji
       | stripper screwed with the title.
       | 
       | It's asking why a _skin toned_ (not yellow) facepalm emoji's
       | length is 7 when the user perceives it as a single character.
       | 
       | Tangent: Emojis are an interesting topic in regards to
       | programming. They challenged the "rule" held by programmers that
       | every character is a single codepoint (of 8 or 16 bits). So,
       | str[1234] would get me the 1235th character, but it's actually
       | the 1235th _byte_. UTF-8 threw a wrench in that, but many
       | programmers went along ignoring reality.
       | 
       | Sadly, preexisting languages such as Arabic weren't warning
       | enough in regards to line breaking. As in: an Arabic
       | "character"[a] can change its width depending on if there's a
       | "character" before or after it (which gives it its cursive-like
       | look). So, a naive line breaking routine could cause bugs if it
       | tried to break in the middle of a word. Tom Scott has a nice
       | video on it that was filmed when the "effective power" crash was
       | going around.[0]
       | 
       | [0]: https://youtu.be/hJLMSllzoLA
       | 
       | [a]: Arabic script isn't _technically_ an alphabet like Latin
       | characters are. It's either an _abugida_ or _abjad_ (depending on
       | who you ask). See Wikipedia:
       | https://en.wikipedia.org/wiki/Arabic_script
        
         | MengerSponge wrote:
         | Interesting. Also, many of us use fonts with ligatures, which
         | render as a single character (for example: tt, ti, ff, Th, ffi)
         | 
         | Of course, we're taught to parse that as multiple discrete
         | letters from an early age, so we don't get confused :)
        
       ___________________________________________________________________
       (page generated 2021-03-26 23:02 UTC)