[HN Gopher] String length functions for single emoji characters ...
___________________________________________________________________
String length functions for single emoji characters evaluate to
greater than 1
Author : kevincox
Score : 100 points
Date : 2021-03-26 12:40 UTC (10 hours ago)
(HTM) web link (hsivonen.fi)
(TXT) w3m dump (hsivonen.fi)
| waterside81 wrote:
| I like how Go handles this case by providing the
| utf8.DecodeRune/utf8.DecodeRuneInString functions which return
| each individual "rune" as well as the size in code points.
|
| Coming from python2, for me it was the first time I saw a
| language handle unicode so gracefully.
| sanedigital wrote:
| This may be the best tech writeup I've ever seen. Super in-depth,
| easily readable. Really well done.
| donaldihunter wrote:
| With clickbaity misrepresentations of correctness,
| unfortunately.
| tzs wrote:
| 5 in Perl, 17 in PHP.
| ChrisSD wrote:
| In summary the two useful measures of unicode length are:
|
| * Number of native (preferably UTF-8) code units
|
| * Number of extended grapheme clusters
|
| The first is useful for basic string operations. The second is
| good for telling you what a user would consider a "character".
| emergie wrote:
| Contemplate 2 methods of writing a 'dz' digraph
| dz - \u0064\u007a, 2 basic latin block codepoints DZ -
| \u0044\u005a Dz - \u0044\u007a dz - \u01f3,
| lowercase, single codepoint DZ - \u01f1, uppercase
| Dz - \u01f2, TITLECASE!
|
| What happens if you try to express dz or dz from polish
| orthography?
|
| You can use dz - \u0064\u017c - d followed by
| 'LATIN SMALL LETTER Z WITH DOT ABOVE' dz -
| \u0064\u007a\u0307 - d followed by z, followed by combining
| diacritical dot above dz - \u01f3\u0307 - dz with
| combining diacritical dot above multiplied by
| uppercase and titlecase forms
|
| In polish orthography dz digraph is considered 2 letters,
| despite being only one sound (gloska). I'm not so sure about
| macedonian orthography, they might count it as one thing.
|
| Medieval ss is a letter/ligature that was created from sZ -
| that is a long s and a tailed z. In other words it is a form of
| 'sz' digraph. Contemporarily it is used only in german
| orthography.
|
| How long is ss?
|
| By some rules uppercasing ss yields SS or SZ. Should
| uppercasing or titlecasing operations change length of a
| string?
| samatman wrote:
| While I agree with this assessment, it means that these are the
| basic string operations:
|
| * Passing strings around
|
| * Reading and writing strings from files / sockets
|
| * Concatenation
|
| _Anything else_ should reckon with extended grapheme clusters,
| whether it does so or not. Even proper upcasing is impossible
| without knowing, for one example, whether or not the string is
| in Turkish.
| josephg wrote:
| One difficulty in using extended grapheme clusters is that
| the codepoints which will be merged into a cluster changes
| depending on the Unicode version, and sometimes the platform
| and library. For collaborative editing, the basic unit of
| measure is codepoints rather than grapheme clusters because
| you don't want any ambiguity about where an insert is
| happening on different peers. Or for a library bump to change
| how historical data is interpreted.
| d110af5ccf wrote:
| > For collaborative editing, the basic unit of measure is
| codepoints
|
| I'd quibble it's not the basic unit of measure so much as
| how changesets are represented. The user edits based on
| grapheme clusters. The final edit is then encoded using
| codepoints, which makes sense because a changeset amounts
| to a collection of basic string operations (splitting,
| concatenating, etc). As you note, it would be undesirable
| for changesets to be aware of higher level string
| representation details.
| samatman wrote:
| For that matter, as long as the format is restricted to
| one encoding, I don't see why the unit of a changeset
| can't just be a byte array.
|
| I can see why it would _happen_ to be a codepoint, this
| might be ergonomic for the language, but it seems to me
| that, like clustering codepoints together in graphemes,
| clustering bytes into codepoints is something the runtime
| takes care of, such that a changeset will be a valid
| example of all three.
| masklinn wrote:
| > The first is useful for basic string operations.
|
| The only thing it's useful for is sizing up storage. It does
| nothing for "basic string operations" unless "basic string
| operations" are solely 7-bit ascii manipulations.
| ChrisSD wrote:
| It's useful for any and all operations that involve indexing
| a range.
|
| Yes you locate the specific indexes by using extended
| grapheme clusters, but you use the retrieved byte indexes to
| actually perform the basic operations. These indexes can also
| be cached so you don't have to recalculate their byte
| position every time (so long as the string isn't modified).
| [deleted]
| masklinn wrote:
| > It's useful for any and all operations that involve
| indexing a range.
|
| That has little to do with the string length, those indices
| can be completely opaque for all you care.
|
| > Yes you locate the specific indexes by using extended
| grapheme clusters
|
| Do you? I'd expect that the indices are mostly located
| using some sort of pattern matching.
| ChrisSD wrote:
| And how does that pattern matching work?
| masklinn wrote:
| By traversing the string?
| ChrisSD wrote:
| Yes. Any pattern matching has to match on extended
| grapheme clusters or else we're back to the old "treating
| a string like it's 7-bit ascii" problem.
| [deleted]
| bikeshaving wrote:
| Why do people prefer UTF-8 coordinates? While for storage I
| think we should use UTF-8, when working with strings live it's
| just so much easier to use UTF-16 because it's predictable: 1
| unit for the basic plane and 2 for everything else (the multi-
| character emoji and modifier stuff aside). I am probably biased
| because I mostly use and think about DOMStrings which are
| always UTF-16 but I'm not sure why people who use languages
| which are more flexible about string representations than
| JavaScript would not also appreciate this kind of regularity.
| ChrisSD wrote:
| The distinction between the basic plane and everything else
| is not particularly useful, imho. So I'm not sure I
| understand what the advantage of UTF-16 is here? It's an
| arbitrary division either way.
| vlmutolo wrote:
| Performance is one reason. UTF-8 is up to twice as compact as
| UTF-16, which allows much better cache locality. And for
| Latin-like text, you're probably frequently hitting the 2x
| better limit.
| samatman wrote:
| The article explains this.
|
| To summarize: the "codepoint" is a broken metric for what a
| grapheme "is" in basically any context. Edge cases would be
| truncating with a guarantee of a valid encoding on the
| substring? But really you want to truncate at extended
| grapheme cluster boundaries. Truncating a country flag
| between two of the regional indicator symbols might not throw
| errors in your code, but no user is going to consider that a
| valid string. The same is true of all manner of composed
| characters, and there are a lot of them.
|
| So the only advantage of using the very-often-longer UTF-16
| encoding is that it's an attractive nuisance! This makes it
| easier to write code which will do the wrong thing,
| constantly, but at a low enough rate that developers will put
| off fixing it.
|
| Unicode is variable width, and _what width you need_ is
| application-specific. That 's the whole point of the article!
| UTF-8 doesn't try to hide any of this from you.
| pdpi wrote:
| I struggle to see why you'd ever want UTF-16. If you're using
| a variable length encoding, might as well stick to UTF-8. If
| you want predictable sizes, there's UTF-32 instead.
|
| ~Also, DOM Strings are not UTF-16, they're UCS-16.~
|
| EDIT: UCS-2, not UCS-16. Also, I'm confusing the DOM with
| EcmaScript, and even that hasn't been true in a while.
| bikeshaving wrote:
| UTF-32 is a bit like giving up and saying code units equal
| code points right? I'm more interested in the comparison
| between UTF-8 and UTF-16, where UTF-8 requires 1 to 3 bytes
| just in the BMP, with 3 bytes for CJK characters. I'm
| saying that as a quick measure of the actual length of a
| string, UTF-16 is much more predictable and provides a nice
| intuitive estimation of how long text actually is, as well
| as providing a fair encoding for most commonly used
| languages.
|
| RE the UTF-16 vs UCS-2 stuff, that's probably a distinction
| which has technical meaning but will collapse at some point
| because no one actually cares, much like the distinction
| between URI and URL.
| masklinn wrote:
| > I'm saying that as a quick measure of the actual length
| of a string, UTF-16 is much more predictable
|
| Meaning the software will deal much less well when it's
| wrong.
|
| > and provides a nice intuitive estimation of how long
| text actually is
|
| Not really due to combining codepoints, which make it not
| useful.
|
| > as well as providing a fair encoding for most commonly
| used languages.
|
| Which we know effectively doesn't matter: it's
| essentially only a gain for pure CJK text being stored at
| rest, because otherwise the waste on ASCII will more than
| compensate for the gain.
| josephg wrote:
| Yes UTF-16 / UCS-2 is a silly encoding and in retrospect it
| was a mistake. It matters because for awhile it was
| believed that 2 bytes would be enough to encode any Unicode
| character. During this time, lots of important languages
| appeared which figured a 2 byte fixed size encoding was
| better than a variable length encoding. So Java,
| javascript, C# and others all use UCS-2. Now they suffer
| the downsides of both using a variable length encodings
| _and_ being memory inefficient. The string.length property
| in all these languages is almost totally meaningless.
|
| UTF-8 is the encoding you should generally always reach for
| when designing new systems, or when implementing a network
| protocol. Rust, Go and other newer languages all use UTF-8
| internally because it's better. Well, and in Go's case
| because it's author, Rob Pike also had a hand in inventing
| UTF-8.
|
| Ironically C and UNIX, which (mostly) stubbornly stuck with
| single byte character encodings generally works better with
| UTF-8 than a lot of newer languages.
| ChrisSD wrote:
| > Also, DOM Strings are not UTF-16, they're UCS-16.
|
| Hm, according to the spec they should be interpreted as
| UTF-16 but this isn't enforced by the language so it can
| contain unpaired surrogates:
|
| From https://heycam.github.io/webidl/#idl-DOMString
|
| > Such sequences are commonly interpreted as UTF-16 encoded
| strings [RFC2781] although this is not required... Nothing
| in this specification requires a DOMString value to be a
| valid UTF-16 string.
|
| From https://262.ecma-international.org/11.0/#sec-
| ecmascript-lang...
|
| > The String type is the set of all ordered sequences of
| zero or more 16-bit unsigned integer values ("elements") up
| to a maximum length of 253 - 1 elements. The String type is
| generally used to represent textual data in a running
| ECMAScript program, in which case each element in the
| String is treated as a UTF-16 code unit value... Operations
| that do interpret String values treat each element as a
| single UTF-16 code unit. However, ECMAScript does not
| restrict the value of or relationships between these code
| units, so operations that further interpret String contents
| as sequences of Unicode code points encoded in UTF-16 must
| account for ill-formed subsequences.
| wodenokoto wrote:
| > The first is useful for basic string operations.
|
| Reversing a string is what I would consider basic string
| operations, but I also expect it not to break emoji and other
| grapheme clusters.
|
| Nothing is easy.
| ChrisSD wrote:
| Personally I'd consider reversing a string to be a pretty
| niche use case and, as you say, it's a complex operation.
| Especially in some languages.
|
| Tbh I can't think of a time when I've actually needed to do
| that.
| [deleted]
| thrower123 wrote:
| This always comes up as a thing that people use as an
| interview question or an algorithm problem. But for what
| conceivable non-toy reason would you actually want to reverse
| a string?
|
| I have never once wanted to actually do this in real code,
| with real strings.
|
| Furthermore, the few times I've tried to do something like
| this by being cute and encoding data in a string, I never get
| outside the ASCII character set.
| mprovost wrote:
| To index from the end of the string. If you want the last 3
| characters from a string, it's often easier to reverse the
| string and take the first 3.
| saagarjha wrote:
| Most languages that are good enough at doing Unicode are
| also modern enough to give you a "suffix" function.
| macintux wrote:
| Generating a list through recursion often involves
| reversing it at the end, so I've reversed a lot of Erlang
| strings in real code.
| edflsafoiewq wrote:
| And even ASCII can't do a completely dumb string reverse,
| because of at least \r\n.
| kps wrote:
| Trivia/history question: There's a very good reason it's
| \r\n and not \n\r.
| btilly wrote:
| The commands were used to drive a teletype machine.
| Carriage return and newline were independent operations
| with carriage return physically taking longer. So \r\n
| allowed it to get to the next letter soon.
| giva wrote:
| I agree. Also, truncating a string won't work. The first 3
| characters of "Osterreich" are not "Os" in my opinion.
| darrylb42 wrote:
| Outside of an interview have you ever needed to reverse a
| string?
| benibela wrote:
| I have my personal library of string functions
|
| Recently I rewrote the codepoint reverse function to make
| it faster. The trick is to not write codepoints, but copy
| the bytes.
|
| One the first attempt I introduced two bugs.
| tgv wrote:
| Sorting by reversed string is not totally uncommon.
| neolog wrote:
| What is a use case for this?
| carapace wrote:
| I did this yesterday. find ... | rev |
| sort | rev > sorted-filelist
|
| I had several directories and I wanted to pull out the
| _unique_ list of certain files (TrueType fonts) across
| all of them, regardless of which subdirs they were in. (I
| 'm omitting the find CLI args for clarity; the command
| just finds all the *.ttf (case-insensitive) files in the
| dirs.)
|
| By reversing the lines before sorting them (and un-
| reversing them) they come out sorted and grouped by
| filename.
| mseepgood wrote:
| Reversing a string does not make sense in most languages.
| TacticalCoder wrote:
| > Reversing a string
|
| I wonder: several comments are saying it hardly makes any
| sense to reverse a string but... Certainly there are useful
| algorithms out there which do work by, at some point,
| reversing strings no!? I mean: not just for the sake of
| reversing it but for lookup/parsing or I don't know what.
| ChrisSD wrote:
| Some parsing algorithms may want to reverse the bytes but
| that's different to reversing the order of user-perceived
| characters.
| dahfizz wrote:
| > Number of native (preferably UTF-8) code units
|
| > The first is useful for basic string operations
|
| Can you expand on this? I don't see why knowing the number of
| code units would be useful except when calculating the total
| size of the string to allocate memory. Basic string operations,
| such as converting to uppercase, would operate on codepoints,
| regardless of how many code units are used to encode that
| codepoint.
|
| Converting 'A' to 'a', for example, is an operation on one
| codepoint but multiple code units.
| josephg wrote:
| > Can you expand on this? I don't see why knowing the number
| of code units would be useful except when calculating the
| total size of the string to allocate memory
|
| I've used this for collaborative editing. If you want to send
| a change saying "insert A at position 10", the question is:
| what units should you use for "position 10"?
|
| - If you use byte offsets then you have to enforce an
| encoding on all machines, even when that doesn't make sense.
| And you're allowing the encoding to become corrupted by edits
| in invalid locations. (Which goes against the principle of
| making invalid state impossible to represent).
|
| - If you use grapheme clusters, the positions aren't portable
| between systems or library versions. What today is position
| 10 in a string might tomorrow be position 9 due to new
| additions to the Unicode spec.
|
| The cleanest answer I've found is to count using Unicode
| codepoints. This approach is encoding-agnostic, portable,
| simple, well defined and stable across time and between
| platforms.
| tedunangst wrote:
| Neither code points nor units help upper casing ss.
| 7786655 wrote:
| >calculating the total size of the string to allocate memory
|
| In some languages this is 90% of everything you do with
| strings. In other languages it's still 90% of everything done
| to strings, but done automatically.
| throwawayffffas wrote:
| Python 3's approach is clearly the best. Because it focuses on
| the problem at hand, unicode codepoints. A string in python is a
| sequence of unicode codepoints, it's length should be the number
| of codepoints in the sequence, it has nothing to do with bytes.
|
| To draw an absurd parallel "[emoji]".len() == 17 is equivalent to
| [1,2].len() == 8 (2 32 bit integers)
|
| In my opinion the most useful result in the case the article
| describes is 5. There should of course be a way to get 1 (the
| number of extended graphemes), but it should not be the strings
| "length".
| banthar wrote:
| Defining string as a sequence of unicode codepoints is the
| mistake.
|
| Nobody ever cares about unicode codepoints. You either want the
| number of bytes or the width of the string on screen.
|
| UTF-32 codepoints waste space and give you neither.
| stkdump wrote:
| The width on the screen is in pixels. Yes, I find monospace
| fonts increasingly pointless.
| anaerobicover wrote:
| But what do you do when you're processing a string with
| codepoints that compose into one user-visible glyph?
| >>> len("") 2
| jxy wrote:
| yeah, and what do you do when you got a nonexistent country:
| "".
| Symbiote wrote:
| Your example changes length depending on the font. (If a
| new country gets the code DB, fonts will gradually be
| updated to include the flag.)
|
| I'll bet there are some systems in the PRC that don't
| include .
| nerdponx wrote:
| Don't Swift and Go support iterating over graphemes? Edit: yes,
| Swift is mentioned at the bottom of the article.
|
| It'd be great to have a function for that in other scripting
| languages like Python, Ruby, etc.
|
| There was an interesting sub-thread here on HN a while ago
| about how a string-of-codepoints is just as bad as a string-of-
| bytes (subject to truncation, incorrect iteration, incorrect
| length calculation) and that we should just have string-of-
| bytes if we can't have string-of-graphemes. I don't agree, but
| some people felt very strongly about it.
| eximius wrote:
| Gah, I wish I was done with my crate so I could point to lovely
| formatted documentaion...
|
| But I believe this can be handled explicitly and well and I'm
| trying to do that in my fuzzy string matching library based on
| fuzzywuzzy.
|
| https://github.com/logannc/fuzzywuzzy-rs/pull/26/files
| hateful wrote:
| If you do a formula in Google Sheets and it contains an emoji,
| this comes into play. For example, if the first character is an
| emoji and you want to reference it you need to do =LEFT(A1, 2) -
| and not =LEFT(A1, 1)
| SamBam wrote:
| After having read (quite interesting) the article, I still don't
| quite get the subtitle:
|
| > 'But It's Better that "[emoji]".len() == 17 and Rather Useless
| that len("[emoji]") == 5'
|
| It sounds like it's just whether you're counting UTF-8/-16/-32
| units. Does the article explain why one is worse and one is
| "rather useless?"
| dahfizz wrote:
| I agree. The author talks a little bit about which UTF encoding
| makes sense in which situation, but they never make an argument
| about which result from len is correct.
|
| My two cents is that string length should always be the number
| of Unicode codepoints in the string, regardless of encoding. If
| you want the byte length, I'm sure there is a sizeof equivalent
| for your language.
|
| When we call len() on an array, we want the number of objects
| in the array. When we iterate over an array, we want to deal
| with one object from the array at a time. We don't care how big
| an object is. A large object shouldn't count for more than 1
| when calculating the array length.
|
| Similarly, a unicode codepoint is the fundamental object in a
| string. The byte-size of a codepoint does not affect the length
| of the string. It makes no sense to iterate over each byte in a
| unicode string, because a byte on its own is completely
| meaningless. len() should be the number of objects we can
| iterate over, just like in arrays.
| MayeulC wrote:
| But what about combining characters?
| https://en.wikipedia.org/wiki/Combining_Diacritical_Marks
|
| Should the letter plus a combining character count as one (I
| think so), or two characters? Should you normalize before
| counting length? And so on.
| dahfizz wrote:
| Combining characters are their own unicode codepoint, so
| they count towards length. The beauty of this approach is
| that its simple and objective.
|
| If you had a list of 5 Dom Element objects, and one Dom
| Attr object, the length of that list is 6. Its nonsensical
| to say "The Attr object modifies an Element object, so its
| not really in the list".
| d110af5ccf wrote:
| > Combining characters are their own unicode codepoint,
| so they count towards length.
|
| This is incredibly arbitrary - it depends entirely on
| what "length" means for a particular usecase. From the
| user's perspective there might only be a single character
| on the screen.
|
| Any nontrivial string operation _must_ be based around
| grapheme clusters, otherwise it is fundamentally broken.
| Codepoints are a useful encoding agnostic way to handle
| basic (split & concatenate) operations, but the method
| by which those offsets are determined needs to be
| grapheme cluster aware. Raw byte offsets are encoding
| specific and only really useful for allocating the
| underlying storage.
| shawnz wrote:
| Going by bytes is also simple and objective. And also
| totally arbitrary, just like going by codepoints.
|
| Which is the most useful for dealing with strings in
| practice though? Are either interpretations useful at
| all?
| dahfizz wrote:
| Going byte by byte is useless. You can't do anything with
| a single byte of a unicode codepoint (unless, by luck,
| the codepoint is encoded in a single byte).
|
| Codepoint is the smallest useful unit of a unicode
| string. It is a character, and you can do all the
| character things with it.
|
| If you wanted to implement a toUpper() function for
| example, you would want to iterate over all the
| codepoints.
| masklinn wrote:
| > If you wanted to implement a toUpper() function for
| example, you would want to iterate over all the
| codepoints.
|
| Nope. In order to deal with special casings you will have
| to span multiple codepoints, at which point it's no more
| work with whatever the code units are.
| ChrisSD wrote:
| What is a string an array of?
|
| If you ask the user, it's an array of characters (aka
| extended grapheme clusters in Unicode speak).
|
| If you ask the machine it's an array of integers (how many
| bytes make up an integer depends on the encoding used).
|
| Nothing really considers them an array of code points. Code
| points are only useful as intermediary values when converting
| between encodings or interpreting an encoded string as
| grapheme clusters.
| [deleted]
| dahfizz wrote:
| > If you ask the machine it's an array of integers
|
| Not sure what you mean by this. A string is an array of
| bytes, in the way that literally every array is an array of
| bytes, but its not "implemented" with integers. Its a UTC-
| encoded array of bytes.
|
| And what is the information that is encoded in those bytes?
| Codepoints. That's what UTF does, it lets us store unicode
| codepoints as bytes. There is a reasonable argument that
| the machine, or at least the developer, considers a string
| as an array of codepoints.
| ChrisSD wrote:
| UTF-8 is an array of bytes (8 bit integers).
|
| UTF-16 is an array of 16 bit integers.
|
| UTF-32 is an array of 32 bit integers.
|
| The machine doesn't know anything about code points. If
| you want to index into the array you'll need to know the
| integer offset.
| dahfizz wrote:
| > The machine doesn't know anything about code points. If
| you want to index into the array you'll need to know the
| integer offset.
|
| The machine doesn't know anything about Colors either.
| But if I defined a Color object, I would be able to put
| Color objects into an array and count how many Color
| objects I had. You're being needlessly reductive.
|
| > UTF-8 is an array of bytes (8 bit integers)
|
| UTF-8 encodes a codepoint with 1-4 single-byte code
| units. The reason UTF-8 exists is to provide a way for
| machines and developers to interact with unicode
| codepoints.
|
| Is a huffman code an array of bits? Or is it a list of
| symbols encoded using bits?
| ChrisSD wrote:
| You seem to be thinking of the abstraction as a concrete
| thing. A code point is like LLVM IR; an intermediary
| language for describing and converting between encodings.
| It is not a concrete thing in itself.
|
| The concrete thing we're encoding is human readable text.
| The atomic unit of which is the user perceived character.
|
| I'm curious, what use is knowing the number of code
| points in a string? It doesn't tell the user anything. It
| doesn't even tell the programmer anything actionable.
| Techyrack wrote:
| #cryptocurrency #bitcoin #crypto
| beaconstudios wrote:
| perhaps the issue is that there's a canonical "length" at all. It
| would make more sense to me to have different types of length
| depending on which measure you're after, like Swift apparently
| has but without the canonical `.count`. Because when there's
| multiple interpretations of a thing's length, when you ask for
| "length" you're leaving the developers to resolve the ambiguity
| and I'm of the firm belief that developers shouldn't consider
| themselves psychic.
| anaerobicover wrote:
| The main reason, I think, that Swift strings have `count` is
| that they conform to the `Collection` protocol. Swift's stdlib
| has a pervasive "generic programming" philosophy, enabled by
| various protocol heirarchies.
|
| So, given that the property is required to be present, some
| semantic or the other had to be chosen. I am sure there were
| debates when they were writing `String` about which one was
| proper.
| beaconstudios wrote:
| that's very cringe of them.
|
| > So, given that the property is required to be present, some
| semantic or the other had to be chosen.
|
| Sounds like an invented solution to an invented problem. The
| programmer's speciality.
| samatman wrote:
| On the contrary it is based!
|
| This conforms exactly to our intuition about what a
| "collection" is. Some (exact) number of items which share
| some meaningful property in common such that we can
| consider them "the same" for purposes of enumeration and
| iteration.
|
| In the real world, we also have to decide what our
| collection is a collection of! Let's say we a pallet of
| candy bars, each of which is in a display box. If we want
| to ask "how many X" are on the pallet, we have to decide
| whether we're asking about the candy bars or the boxes.
| Clearly we should be able to answer questions about both;
| just as clearly, _operations_ on the pallet should work
| with boxes. Because we don 't want to open them, and even
| if we _do_ want to open them, we _have to_ open them, we
| can 't just ignore their existence!
|
| I assert that the extended grapheme cluster is the "box"
| around bytes in a string. Even if you do care about the
| contents (very often you do not!) you have to know where
| the boundaries are to do it! Because a Fitzpatric skin tone
| modifier all on its own has different semantics from one
| found within an emoji modifier sequence.
|
| So it makes perfect sense for Swift to provide one blessed
| way to iterate over strings, and provide other options for
| when you're interested in some other level of aggregation.
| Which is what Swift does.
| beaconstudios wrote:
| I think the problem is that strings are ambiguous enough
| a collection to warrant extra semantics.
|
| The alternative to a collection would be an iterator or
| other public method returning a collection-like accessor,
| which would be a good compromise.
|
| Though if you were to choose a canonical collection
| quanta, the it'd probably be the grapheme cluster, yeah.
|
| Unfortunately OOP can never be based; only functional or
| procedural programming can attain such standards.
| saagarjha wrote:
| Strings were done well, because they are just
| BudirectionalCollection and not RandomAccessCollection on
| graphemes, which is usually what you would want
| (especially as an app developer writing user-facing
| code). The other views are collections in their own
| right. By conforming to Collection a string can do things
| like be sliced and prefixed and searched over "for free",
| which are extremely common operations to define
| generically.
| samatman wrote:
| OOP using a meta-object protocol is very based indeed.
|
| Unfortunately, only Common Lisp and Lua do it that way.
|
| Actor models are pretty based as well, and have a better
| historical claim to the title "object oriented
| programming" than class ontologies do, but that ship has
| sailed.
| beaconstudios wrote:
| Yes agreed; I'm guessing by meta object that is the
| module pattern from fp with syntax sugar? I use the
| module pattern with typescript interfaces plus namespaces
| and it's pretty great.
|
| 100% on the actor model. My visual programming platform
| is basically based on actors, but the core data model is
| cybernetic (persistence is done via self referentiality).
| Alan Kay got shafted by the creation of C++, OO in the
| original conception was very based.
| xoudini wrote:
| In Swift 3 (and probably previous versions as well),
| `String.count` defaulted to the count of the Unicode scalar
| representation. In this version, iterating over a string
| would operate on each Unicode scalar, which often doesn't
| make sense due to the expected behaviour of extended grapheme
| clusters. So, this is my best guess why `String` in Swift 4
| and later ended up with the current default behaviour.
| maxnoe wrote:
| https://pypi.org/project/grapheme/
| patrickas wrote:
| Raku seems to be more correct (DWIM) in this regard than all the
| examples given in the post... my \emoji =
| "\c[FACE PALM]\c[EMOJI MODIFIER FITZPATRICK TYPE-3]\c[ZERO WIDTH
| JOINER]\c[MALE SIGN]\c[VARIATION SELECTOR-16]"; #one
| character say emoji.chars; # 1 #Five code points
| say emoji.codes; # 5 #If I want to know how many bytes
| that takes up in various encodings... say
| emoji.encode('UTF8').bytes; # 17 bytes say
| emoji.encode('UTF16').bytes; # 14 bytes
|
| Edit: Updated to use the names of each code point since HN cannot
| display the emoji
| duskwuff wrote:
| You can represent it as a sequence of escapes. If Raku handles
| this the same way as Perl5, it should be: $a
| = "\N{FACE PALM}\N{EMOJI MODIFIER FITZPATRICK TYPE-3}\N{ZERO
| WIDTH JOINER}\N{MALE SIGN}\N{VARIATION SELECTOR-16}";
| csharptwdec19 wrote:
| now do it in YAML
| cjm42 wrote:
| And if you try to say emoji.length, you'll get an error:
|
| No such method 'length' for invocant of type 'Str'. Did you
| mean any of these: 'codes', 'chars'?
|
| Because as the article points out, the "length" of a string is
| an ambiguous concept these days.
| j1elo wrote:
| If you're working with an image, you might have an Image class,
| that has a Image.width and a Image.height in pixels, regardless
| of how these pixels are laid out in memory (depends on encoding,
| colorspace, etc). Most if not all methods operate on these
| pixels, e.g. cropping, scaling, color filtering, etc. Then, there
| might be a Image.memory property that provides acces to the
| underlying, actual bytes.
|
| I don't understand why the same is not the obvious solution for
| strings. len("") should be 1, because we as humans read one emoji
| as one character, regardless of the memory layout behind. Most if
| not all methods operate on characters (reversing, concatenating,
| indexing).
|
| And then, if you need access to the low level data, the
| String.memory would contain the actual bytes... which would be
| different depending on the actual text encoding.
| dnautics wrote:
| The number of bytes necessary is _incredibly_ important for
| security reasons. It 's arguably better to make the number of
| bytes be the primary value and have a secondary lookup be the
| number of glyphs.
|
| To be fair, some systems distinguish between size and length
| (with size expected to be O(1) and length allowed to be up to
| O(n)). For those systems proceed as parent.
| mannykannot wrote:
| In my curmudgeonly way, I suggest zero is the correct value, to
| reflect the information content of almost all emoji use.
| elliekelly wrote:
| Counterpoint: Appending U+1F609, U+1F61C, or U+1F643 to the end
| of your comment would have added immense value in communicating
| your point because it would have lightened the otherwise harsh
| tone of your words :)
| Kaze404 wrote:
| With that metric the length of your comment should be zero as
| well. Probably mine also :)
| bob1029 wrote:
| We need to deal with mountains of user input from mobile devices
| and the worst we have run into is "smart" quotes from iOS devices
| hosing older business systems. Our end users are pretty good
| about not typing pirate flags into customer name fields.
|
| I still haven't run into a situation where I need to count the
| logical number of glyphs in a string. Any system that we need to
| push a string into will be limiting things on the basis of byte
| counts, so a .Length check is still exactly what we need.
|
| Does this cause trouble for anyone else?
| ravi-delia wrote:
| I have used glyph counting a handful of times, mostly for width
| computing before I learned there were better ways. I'm 100%
| sure my logic was just waiting to fail on any input that didn't
| use the Latin alphabet.
| dooglius wrote:
| I think this title needs an exemption from HN's no-emoji rule
| kyberias wrote:
| Why is there such a rule?
| TeMPOraL wrote:
| Surprisingly many people try to abuse Unicode when posting
| submissions and comments. This includes not just emojis, but
| also other symbols, or glyphs that look like bold, underlined
| or otherwise stylized letters.
| masklinn wrote:
| > Surprisingly many people try to abuse Unicode when
| posting submissions and comments.
|
| .@snqa buipio^a ta qor poob a Yons s@op sr@toaraYo jo Younq
| a buiuuaq ySirartiqra @snao@
| criddell wrote:
| I don't think they've banned it entirely.
| masklinn wrote:
| HN's character stripping is completely arbitrary.
|
| You can play mahjong, domino or cards (, , ) but not
| chess, you can show some random glyphs like [?] or but
| not others, you can use box-drawing characters (+) but
| not block elements, you can use byzantine musical
| notation () but only western musical symbols () and the
| notes are banned, you can Za[?][?][?]lg[?]o just fine.
| colejohnson66 wrote:
| Emojis aren't professional (depending on who you ask) and can
| be overused (look at a few popular GitHub projects' READMEs -
| emojis on every bullet point of the feature list, for
| example)
| tester34 wrote:
| I agree that they aren't the most professional thing to
| use, but I'm not sure why I agree
|
| Maybe it's some kind of bias?
| Kaze404 wrote:
| What a weird reasoning. In most contexts I've seen,
| "professionalism" doesn't really do anything except strip
| away all the human factor that goes into every interaction.
| Personally, I don't care whether the person I'm talking to
| is being "professional" or not. What I care is that they're
| respectful and can properly communicate their thoughts.
|
| With that mindset, emojis (and emoticon) can actually add
| context to interactions, considering how much context is
| lost when communicating over text. A simple smiley face at
| the end of a sentence can go a long way in my experience :)
| fhifjfhjjjk wrote:
| emojis aren't "professional" is your reasoning?
|
| is it 1987? should women also wear their skirts below the
| knee?
| colejohnson66 wrote:
| _I_ didn't say that they weren't. I'm simply saying that
| some do think that. Whether I do or don't is irrelevant
| as I'm only speculating on the reason why.
| pessimizer wrote:
| This is a forum sponsored by opinionated people who find them
| annoying. I find them annoying, too, so I think it's a good
| convention. Somehow ":)" became an industry, and as
| industries go, then eliminated ":)" (which gets replaced with
| a smiling yellow head in most places.)
|
| It isn't the only arbitrary convention enforced here, and the
| sum of those conventions are what attracts the audience.
| saagarjha wrote:
| Emoji draw attention to themselves in a way that plain text
| does not. I assume this is why Hacker News does not allow
| bolding text and strongly discourages use of ALL CAPS.
| at_a_remove wrote:
| Also, looking at the article, they are quite complex! It
| looks like handling emojis in a proper manner has a large
| investment and the payoff is somewhat small.
| masklinn wrote:
| Stripping some arbitrary subset of characters is a lot more
| work than just letting them through, which is what HN would
| otherwise be doing: the hard work is done by the browser,
| HN doesn't render emoji or lay text out.
| Someone wrote:
| For HN I would think almost all the complexity is in
| rendering. That's a job for your browser.
|
| What's left is things like the max length for a title (not
| too problematic to count that in code points or bytes, I
| think)
|
| The big risk, I think, is in users (mis)using rendering
| direction, multiple levels of accents, zero-width
| characters and weird code points to mess up the resulting
| page. Some Unicode symbols and emojis typically look a lot
| larger and heavier than 'normal' characters, switching
| writing direction may change visual indentation of a
| comment, etc.
|
| Worse, Unicode rendering being hard, text rendering bugs
| that crash browsers or even entire devices are discovered
| every now and then. If somebody entered one in a HN
| message, that could crash millions of devices.
| wongarsu wrote:
| Emoji don't introduce anything that isn't used for various
| other languages. Emojis are just the most visible breakage
| for American users when you screw up your unicode handling.
|
| HN however handles all of unicode just fine, it just chose
| to specifically exclude emojis (and a bunch of other
| symbols)
| MayeulC wrote:
| I think emoji is wonderful for widespread unicode
| support. It combats ossification. It is also a nice
| carrot to incite users to install updates.
|
| However, I don't like the rabbit hole it started to go
| into with gendered and colored emojis. There's never
| going to be enough. I wish we had stuck to color-neutral
| and gender-neutral, like original smileys :)
|
| I find it also conveys too much meaning. I am generally
| not interested in knowing a person's gender or ethnic
| group when discussing over text... but I digress.
| samatman wrote:
| I've said this before, but not here:
|
| Eventually, Unicode will allow you to combine flag
| modifiers with human emojis. So you can have a black
| _South Korean_ man facepalming.
|
| This will trigger a war which ends industrial
| civilization, and is the leading candidate for the Great
| Filter.
| colejohnson66 wrote:
| I'm curious: is the no-emoji rule a rule that happens to block
| emoji or a hardcoded rule? What I mean is: emojis (in UTF-16)
| have to use surrogate pairs because all their code points are
| in the U+1xxxx plane. Is the software just disallowing any
| characters needing two code points to encode (which would
| include emoji)? Or is it specifically singling out the emoji
| blocks?
| kps wrote:
| It seems to have changed recently. I recall a thread about
| plastics a few months ago where the plastic type symbols ([?]
| through [?], i.e. U+2763 through U+2769) disappeared, but I
| see them now.
|
| Edits:
|
| that post was https://news.ycombinator.com/item?id=25237688
|
| The rules seem pretty arbitrary. Recycling symbol U+2672 [?]
| is allowed but recycled paper symbol U+267C is not. Chess
| kings are allowed but checkers kings aren't.
|
| is allowed (for now).
|
| I think the right thing to do would be to strip anything
| with emoji _presentation_
| http://unicode.org/reports/tr51/#Presentation_Style
| saagarjha wrote:
| It's not that simple, as many OSes render characters that
| aren't strictly look emoji "as emoji" and there is no
| standard way to check for this.
| saagarjha wrote:
| The latter.
| wongarsu wrote:
| It's a very deliberate and precise filter. I can write in old
| persian: "" (codepoints 0x103xx) or CJK "" (codepoints
| 0x2000x), but can't write "" (emoji with codepoint 0x1F605)
| colejohnson66 wrote:
| I was wondering what the title meant. Turns out HN's emoji
| stripper screwed with the title.
|
| It's asking why a _skin toned_ (not yellow) facepalm emoji's
| length is 7 when the user perceives it as a single character.
|
| Tangent: Emojis are an interesting topic in regards to
| programming. They challenged the "rule" held by programmers that
| every character is a single codepoint (of 8 or 16 bits). So,
| str[1234] would get me the 1235th character, but it's actually
| the 1235th _byte_. UTF-8 threw a wrench in that, but many
| programmers went along ignoring reality.
|
| Sadly, preexisting languages such as Arabic weren't warning
| enough in regards to line breaking. As in: an Arabic
| "character"[a] can change its width depending on if there's a
| "character" before or after it (which gives it its cursive-like
| look). So, a naive line breaking routine could cause bugs if it
| tried to break in the middle of a word. Tom Scott has a nice
| video on it that was filmed when the "effective power" crash was
| going around.[0]
|
| [0]: https://youtu.be/hJLMSllzoLA
|
| [a]: Arabic script isn't _technically_ an alphabet like Latin
| characters are. It's either an _abugida_ or _abjad_ (depending on
| who you ask). See Wikipedia:
| https://en.wikipedia.org/wiki/Arabic_script
| MengerSponge wrote:
| Interesting. Also, many of us use fonts with ligatures, which
| render as a single character (for example: tt, ti, ff, Th, ffi)
|
| Of course, we're taught to parse that as multiple discrete
| letters from an early age, so we don't get confused :)
___________________________________________________________________
(page generated 2021-03-26 23:02 UTC)