[HN Gopher] Why can't you reverse a string with a flag emoji?
___________________________________________________________________
Why can't you reverse a string with a flag emoji?
Author : da12
Score : 106 points
Date : 2022-01-27 18:35 UTC (4 hours ago)
(HTM) web link (davidamos.dev)
(TXT) w3m dump (davidamos.dev)
| zanzibar735 wrote:
| Of course you can reverse a string with a flag emoji. You just
| need to treat a "string" as a collected of Extended Grapheme
| Clusters, and then you reverse the order of the EGCs. So if the
| string is `a<flag unicode bytes>b`, the output should be `b<flag
| unicode bytes>a`.
| Crazyontap wrote:
| This section on the linked Wikipedia article(1) is quite amazing
| on how the family emoji is rendered using a zero-width joiner
|
| (1) https://en.wikipedia.org/wiki/Emoji#Joining
|
| edit: forgot HN doesn't render emojis. Better read it directly on
| Wikipedia i guess.
| codezero wrote:
| You also can't URL Encode a string (In JS at least) if you
| truncate an emoji at the beginning or end of it.
| coreyp_1 wrote:
| If you think the Unicode flag emoji take a lot of bytes, then
| consider the family emoji!
| (https://unicode.org/emoji/charts/full-emoji-list.html#family)
|
| I'm in the process of designing a scripting language and
| implementing it in C++. I plan to put together a YouTube series
| about it. (Doesn't everyone want to see Bison and Flex mixed with
| proper unit tests and C++20 code?)
|
| Due to my future intended use case, I needed good support for
| Unicode. I thought that I could write it myself, and I was wrong.
| I wasted two weeks (in my spare time, mostly evenings) trying to
| cobble together things that should work, identifying patterns,
| figuring out how to update it as Unicode itself is updated,
| thinking about edge cases, i18n, zalgo text, etc. And then I
| finally reached the point where I knew enough to know that I was
| making the wrong choice.
|
| I'm now using ICU. (https://icu.unicode.org/) It's huge, it was
| hard to get it working in my environment, and there are very few
| examples of it's usage online, but after the initial setup dues
| are paid, it WORKS.
|
| Aside: Yes, I know I'm crazy for implementing a programming
| language that I intend for serious usage. Yes, I have good
| reasons for doing it, and yes I have considered alternatives. But
| it's fun, so I'm doing it anyways.
|
| Moral of the story: Dealing with Unicode is hard, and if you
| think it shouldn't be that hard, then you probably don't know
| enough about the problem!
| josephg wrote:
| Handling unicode can be fine, depending on what you're doing.
| The hard parts are:
|
| - Counting, rendering and collapsing grapheme clusters (like
| the flag emoji)
|
| - Converting between legacy encodings (shiftjis, ko8, etc) and
| UTF-8 / UTF-16
|
| - Canonicalization
|
| If all you need is to deal with utf8 byte buffers, you don't
| need all that stuff. And your code can stay simple, small and
| fast.
|
| IIRC the rust standard library doesn't bother supporting any of
| the hard parts in unicode. The only real unicode support in std
| is utf8 validation for strings. All the complex aspects of
| unicode are delegated to 3rd party crates.
|
| By contrast, nodejs (and web browsers) do all of this. But they
| implement it in the same way you're suggesting - they simply
| call out to libicu.
| tialaramex wrote:
| > The only real unicode support in std is utf8 validation for
| strings.
|
| Rust's core library gives char methods such as is_numeric
| which asks whether this Unicode codepoint is in one of
| Unicode's numeric classes such as the letter-like-numerics
| and various digits. (Rust does provide char with
| is_ascii_digit and is_ascii_hexdigit if that's all you
| actually cared about)
|
| So yes, the Rust standard library is carrying around the
| entire Unicode standard class rule list among other things,
| of course Rust's library isn't built to a vast binary, so if
| you never use these features your binary doesn't get that
| code.
| Gigachad wrote:
| It always feels like the most amount of work goes to the least
| used emoji. So many revisions and additions to the family emoji
| and yet it's one of the ones I don't recall anyone ever using.
|
| I think the trap Unicode got in to is technically they can have
| infinite emoji so they just don't ever have a way to say no to
| new proposals.
| masklinn wrote:
| > It always feels like the most amount of work goes to the
| least used emoji.
|
| I always feel like those emoji were added on purpose in order
| to force implementations to fix their unicode support. Before
| emoji were added, most software had completely broken support
| for anything beyond the BMP (case study: MySQL's so-called
| "UTF8" encoding). The introduction of emoji, and their
| immediate popularity, forced many systems to better support
| astral planes (that is officially acknowledged:
| https://unicode.org/faq/emoji_dingbats.html#EO1)
|
| Progressively, emoji using more advanced features got
| introduced, which force systems (and developers) to fix their
| unicode-handling, or at least improve it somewhat e.g.
| skintones for combining codepoints, etc....
|
| > I think the trap Unicode got in to is technically they can
| have infinite emoji so they just don't ever have a way to say
| no to new proposals.
|
| You should try to follow a new character through the process,
| because that's absolutely not what happens and shepherding a
| new emoji through to standardisation is not an easy task. The
| unicode consortium absolutely does say no, and has many
| reasons to do so. There's an entire page on just proposal
| guidelines (https://unicode.org/emoji/proposals.html), and
| following it does not in any way ensure it'll be accepted.
| mike_hock wrote:
| WTF business do emojis have in Unicode? The BMP is all
| there ever should have been. Standardize the actual writing
| systems of the world, so everyone can write in their
| language. And once that is done, the standard doesn't need
| to change for a hundred years.
|
| What we need now is a standardized, sane subset of Unicode
| that implementations can support while rejecting the insane
| scope creep that got added on top of that. I guess the BMP
| is a good start, even though it already contains
| superfluous crap like "dingbats" and boxes.
| laumars wrote:
| They do say no though. Frequently too.
|
| The problem with Unicode is simply that it's trying to solve
| a very hard problem.
| tialaramex wrote:
| Exactly this. Humans have _incredibly_ complicated writing
| systems, and all Unicode wants to do is encode them all.
| Keep in mind that the trivial toy system we 're more
| familiar with, ASCII, already has some pretty strange
| features because even to half-arse one human writing system
| they needed those features.
|
| Case is totally wild, it only applies to like 5% of the
| symbols in ASCII, but in the process it means they each
| need two codepoints and you're expected to carry around
| tech for switching back and forth between cases.
|
| And then there are several distinct types of white space,
| each gets a codepoint, some of them try to mess with your
| text's "position" which may not make any sense in the
| context where you wanted to use it. What does it mean to
| have a "horizontal tab" between two parts of the text I
| wanted to draw on this mug? I found a document which says
| it is the same as "eight spaces" which seems wrong because
| surely if you wanted eight spaces you'd just write eight
| spaces.
|
| And after all that ASCII doesn't have working quotation
| marks, it doesn't understand how to spell a bunch of common
| English words like naive or cafe, pretty disappointing.
| xxpor wrote:
| >Humans have incredibly complicated writing systems
|
| Not only that, there isn't even agreement about what's
| correct all the time!
|
| >it doesn't understand how to spell a bunch of common
| English words like naive or cafe, pretty disappointing.
|
| A perfect example of this, since I would argue English
| doesn't have any diacritics at all. So the use of cafe is
| code switching. :)
| mattkrause wrote:
| Not a New Yorker writer, I see....
| mappu wrote:
| If you like this, you may also like why len(emoji) is still not 1
| in Python 3 despite all the unicode breakage:
| https://storytime.ivysaur.me/posts/grapheme-clusters/
|
| I do feel like these are all 'gotcha' questions - I haven't seen
| any real-world requirement to reverse a string and then have it
| be displayed in a useful way.
| raffy wrote:
| Kinda related: I am developing a library for ENS (Ethereum Name
| Service) name normalization: https://github.com/adraffy/ens-
| normalize.js
|
| I'm trying to find the best combination of UTS-46, UTS-51,
| UTS-39, and prior work on IDN resolution w/r/t confusables:
| https://adraffy.github.io/ens-normalize.js/test/report-confu...
|
| Personally, I found the Unicode spec very messy. Critical
| information is all over the place. You can see the direct effect
| of this when you compare various packages across different
| languages and discover that every library disagrees in multiple
| places. Even JS String.normalize() isn't consistent in the latest
| version of most browsers: https://adraffy.github.io/ens-
| normalize.js/test/report-nf.ht... (fails in Chrome, Safari)
|
| The major difference between ENS and DNS is emoji are front and
| center. ENS resolves by computing a hash of a name in a
| canonicalized form. Since resolution must happen decentralized,
| simply punting to punycode and relying custom logic for Unicode-
| handling isn't possible. On-chain records are 1:1, so there's no
| fuzzy matching either. Additionally, ENS is actively registering
| names, so any improvement to the system must preserve as many
| names as possible.
|
| At the moment, I'm attempting to improve upon the confusables in
| the Common/Greek/Latin/Cyrillic scripts, and will combine these
| new grouping with the mixed-script limitations similar to IDN
| handling in Chromium.
|
| Interactive Demo: https://adraffy.github.io/ens-
| normalize.js/test/resolver.htm...
|
| Also this emoji report is pretty cool:
| https://adraffy.github.io/ens-normalize.js/test/report-emoji...
| [deleted]
| xmprt wrote:
| This is a cool article about Unicode encoding however I still
| feel like it should be possible to reverse strings with Flag
| emojis. I don't see why computers can't handle multi rune symbols
| in the same way that they handle multi byte runes. We could
| combine all the runes that should be a single symbol and make
| sure that we're maintaining the ordering of those runes in the
| reversed string. Of course that means that naive string reversing
| doesn't work anymore but naive string reversing wouldn't work in
| the world of UTF-8 if we just went byte by byte.
| happytoexplain wrote:
| Swift, for example, does what you're saying. I thought that the
| reason many languages don't do it that way is that part of the
| definition of an array (or at least expected-by-convention) is
| constant-time operations. If you treat a string as an array,
| then having to deal with variable-length units breaks that
| rule. That's why, when there _is_ an API for dealing with
| grapheme clusters, it is usually a special case that duplicates
| an array-like API, instead of literally using an array.
|
| I actually don't know how/why Python is apparently using code
| points, since they are variable length. That seems like a
| compromise between using code units and using grapheme clusters
| that gets you the worst of both worlds.
|
| Edit: Maybe it uses UTF-32 under the hood when it's doing array
| operations on code points?
| kevin_thibedeau wrote:
| This misses the real problem with flag emoji in that they are
| composed of codepoints that can be in any order. With other emoji
| you get a base codepoint with potential combining characters.
| Using a table of combining character ranges you can skip over
| them and isolate the logical glyph sequences. You don't need
| surrounding context to parse them out like flags need.
| uniqueuid wrote:
| Thanks for that interesting detail!
|
| If such re-purposing continues, it might be easier to go
| straight to utf-32 for some use cases.
| dhosek wrote:
| Nope, because the repurposing is independent of how the
| Unicode is represented. There's absolutely no advantage to
| having a string in UTF-32 over UTF-8 since you'll still need
| to examine every character and the added overhead for
| converting byte strings in UTF-8 to 32-bit code points is by
| far offset by the huge memory increase necessary to store
| UTF-32.
|
| What's more, it's really not that difficult to start at the
| end of a valid UTF-8 string and get the characters in reverse
| order. UTF-8 is well-designed that way in that there's never
| ambiguity about whether you're looking at the beginning byte
| of a code point.
| colejohnson66 wrote:
| > UTF-8 is well-designed that way in that there's never
| ambiguity about whether you're looking at the beginning
| byte of a code point.
|
| To expand, if the most-significant-bit is a 0, it's an
| ASCII codepoint. If the top two are '10', it's a
| continuation byte, and if they're '11', it's the start of a
| multibyte codepoint (the other most-significant-bits
| specify how long it is to facilitate easy codepoint
| counting).
|
| So a naive codepoint reversal algorithm would start at the
| end, and move backwards until it sees either an ASCII
| codepoint or the start of a multibyte one. Upon reaching
| it, copy those 1-4 bytes to the start of a new buffer.
| Continue until you reach the start.
|
| [0]: https://en.wikipedia.org/wiki/UTF-8#Encoding
| jug wrote:
| I think that somewhere in this answer lies a reason why Windows
| still doesn't support flag emoji. I don't count Microsoft Edge
| as "Windows" in this case, but as Chromium. Windows doesn't
| support flag emoji in its native text boxes, but it does
| support even colorized emoji.
|
| But then again, flags seem to be not only Unicode-hard but
| post-Unicode-hard.
| masklinn wrote:
| > But then again, flags seem to be not only Unicode-hard but
| post-Unicode-hard.
|
| Flags are not that hard, they're a very specific block
| combining in very predictable way. They're little more than
| ligatures. Family emoji are much harder.
|
| And this is not "post-Unicode" in any way.
| cygx wrote:
| _Flags are not that hard, they 're a very specific block
| combining in very predictable way._
|
| But before their introduction, you could decide if there's
| a grapheme cluster break between codepoints just by looking
| at the two codepoints in question. Now, you may need to
| parse a whole sequence of codepoints to see how flags pair
| up.
| otagekki wrote:
| If flag emojis are really a combination of 2 special characters,
| the reversal of the U.S. flag should result in having the Soviet
| Union flag.
| TonyTrapp wrote:
| It's up to the installed fonts really. I don't know if the
| combination of S + U is standardized as a Soviet Union flag
| emoji, but even if it is, your locally installed fonts may not
| contain every single flag emoji, so the browser would still
| fall back to rendering the two letters instead.
| masklinn wrote:
| > the reversal of the U.S. flag should result in having the
| Soviet Union flag.
|
| Except it has been deleted from the ISO 3166-2 registry, so not
| having it is perfectly valid (arguably more so than having it).
| jameshart wrote:
| I was _so_ disappointed that didn 't turn out to be the case.
| brewmarche wrote:
| Just tried reversing a Spanish flag with Python and indeed I
| got Sweden back
| ezfe wrote:
| Works in Swift, which is the benefit of Swift having the most
| painful String API possible:
|
| let v = "Flag: " String(v.reversed()) // Output: :galF v.count //
| Output: 7
| jiveturkey wrote:
| Interesting article. Written for beginners, conversationally. Has
| excessive amounts of whitespace, for "readability" I guess. But
| at the same time, it dives quite deep, which I don't think this
| "style" of presentation matches up with the amount of time a more
| novice reader is going to devote to a single long form article.
|
| As to the content, for all the deep dive, a simple link to
| https://unicode.org/reports/tr51/#Flags and what an emoji is,
| would have saved so much exposition. I also wish he'd touched on
| normalization. With the amount of time he's demanding from
| readers he could have mentioned this important subject. Because
| then he could discuss why (starting from his emoji example)
| a-grave (a) might or might not be reversible, depending how the
| character is composed.
|
| Also wish he'd pointed to some libraries that can do such
| reversals.
| faebi wrote:
| Why reverse them if one barely can implement, display and edit
| them correctly. I never could make them work perfectly in VIM.
| Also I had to open a bug in Firefox recently:
|
| _Flag emojis and others are displayed in double the size on
| Windows 10 using Firefox Nightly_
| https://bugzilla.mozilla.org/show_bug.cgi?id=1746795
| [deleted]
| nottorp wrote:
| So basically unicode along with c++ are great job security if you
| do bother to learn them.
|
| There's another word that comes to mind when thinking about those
| two: metastasis.
| [deleted]
| ts4z wrote:
| Let me cheat a bit and say Unicode comes in three flavors: UTF-8,
| UCS-2 aka UTF-16, and UTF-32. UTF-8 is byte-oriented, UTF-16 is
| double-byte oriented, and UTF-32 nobody uses because you waste
| half the word almost all of the time.
|
| You can't reduce the _bytes_ in UTF-8 or UTF-16, because you 'll
| scramble the encoding. But you could parsing the string,
| codepoint-at-a-time, handling the specifics of UTF-8, or UTF-16
| with its surrogate pairs, and reversing those. This sounds
| equivalent to reversing UTF-32, and I believe is what the
| original poster was imagining.
|
| Except you can't do that, because Unicode has composing
| characters. Now, I'm American and too stupid to type anything
| other than ASCII, but I know about n+~ = n. If you have the pre-
| composed version of n, you can reverse the codepoint (it's one
| codepoint). If you don't have it, and you have n+dead ~, you
| can't reverse it, or in the word "ano" you might put the ~ on the
| "o". (Even crazier things happen when you get to the ligatures in
| Arabic; IIRC one of those is about 20 codepoints.)
|
| So we can't just reverse codepoints, even ancient versions of
| Unicode. Other posters have talked about the even more exotic
| stuff like Emoji + skin tone. It's necessary to be very careful.
|
| Now, the old fart in me says that ASCII never had this problem.
| But the old fart in me knows about CRLF in text protocols, and
| that's never LFCR; and that if you want to make a n in ASCII you
| must send n ^H ~. I guess you can reverse that, but if you want
| to do more exotic things it becomes less obvious.
|
| (IIRC UCS-2 is the deadname, now we call it UTF-16 to remind us
| to always handle surrogate pairs correctly, which we don't.)
|
| TLDR: Strings are hard.
| progbits wrote:
| Semi-related (about length of emoji "characters", not reversing):
| https://hsivonen.fi/string-length/
|
| Previously discussed:
|
| https://news.ycombinator.com/item?id=20914184
|
| https://news.ycombinator.com/item?id=26591373
|
| As for this article & Python - as usual it is biasing towards
| convenience and implicit behavior rather than properly handling
| all edge cases.
|
| Compare with Rust where you can't "reverse" a string - that is
| not a defined operation. But you can either break it into a
| sequence of characters or graphemes and then reverse that, with
| expected results: https://play.rust-
| lang.org/?version=stable&mode=debug&editio...
|
| (Sadly the grapheme segmentation is not part of standard library,
| at least yet)
| aidenn0 wrote:
| > The answer is: it depends. There isn't a canonical way to
| reverse a string, at least that I'm aware of.
|
| Unicode defines grapheme clusters[1] that represent "user-
| perceived characters" separating a string into those and
| reversing seems like a pretty good way to go about it.
|
| 1: http://www.unicode.org/reports/tr29/
| qqii wrote:
| > Challenge: How would you go about writing a function that
| reverses a string while leaving symbols encoded as sequences of
| code points intact? Can you do it from scratch? Is there a
| package available in your language that can do it for you? How
| did that package solve the problem?
|
| So are there any good libraries that can deal with code points
| that are merged together into a single pictographic and reverse
| them "as expected"?
| da12 wrote:
| If you're using Python, check out grapheme:
| https://github.com/alvinlindstam/grapheme
| tl wrote:
| This is a nice dive into limitations in Python's unicode handling
| and at the end, how to work around some problems. But you could
| use languages with proper unicode support like Swift or Elixir
| (weirdly HN is fighting flags in comment code which makes
| examples header to demonstrate).
| anamexis wrote:
| HN doesn't allow any emoji.
| mlindner wrote:
| The person tries to define character when there isn't actually
| any definition of what that even means. Character is a term
| limited to languages that actually use them and not all text is
| made up of characters.
| yoyohello13 wrote:
| Maybe I'm missing some prerequisite knowledge here, but why would
| I assume `flag="us"` is an emoji? Looking at that first block of
| code, there is no reason for me to think "us" is a single
| character.
|
| Edit: Turns out my browser wasn't rendering the flags.
| ljm wrote:
| If it's Windows, it doesn't actually use flags for those
| emojis, it renders a country code instead. If it wasn't
| supported you would just see the glyph for an unknown
| character.
|
| The reason was because they didn't want to be caught up in any
| arguments about what flag to render for a country during any
| dispute, as with, e.g. the flag for Afghanistan after the
| Taliban took control.
| happytoexplain wrote:
| In Windows Chrome, it doesn't render the emoji for me. In
| Android Chrome, it renders a flag emoji - not the raw region
| indicators (which look like the letters "u" and "s").
| Benlights wrote:
| I had the same issue when I read the article, I kept on getting
| stuck and asking myself what I was missing.
| greenyoda wrote:
| In my browser (Firefox on Windows), the thing between the
| quotes in the first block of code looks like a picture of the
| US flag cropped to a circle, not like the characters "us".
| yoyohello13 wrote:
| Ah I see, I just opened it in firefox. It looks like some JS
| library is not getting loaded in Edge. The author was talking
| about "us", "so", etc. looking like one character and I
| thought I was going crazy, lol.
| da12 wrote:
| A whole lesson in Unicode in itself right there with your
| experience, haha!
| bialpio wrote:
| Reminds me of an image that renders differently on Macs
| (https://www.bleepingcomputer.com/news/technology/this-
| image-...), I bet it'd make for a fun conversation that
| could make the participants question their sanity. :-)
| masklinn wrote:
| There should not be any JS involved though, only a font
| able to render these grapheme clusters.
|
| Do you see the US flag after "copy and paste this emoji" on
| https://emojipedia.org/flag-united-states/?
| jfk13 wrote:
| I don't think that's about a JS library. Firefox bundles an
| emoji font that supports some things -- such as the flags
| -- that aren't supported by Segoe UI Emoji on Windows, so
| it has additional coverage for such character sequences.
| yoyohello13 wrote:
| That makes sense. I saw a failure to load a JS module in
| the console and assumed that was part of the problem.
| jug wrote:
| I'm not surprised the flag had two components, but I _was_
| surprised the US flag was made by literally U and S, haha!
|
| I definitely thought it'd be something like [I am a Flag] and
| [The flag ID between 0 and 65535]. And reversing it would be
| [Flag ID] + [I am a Flag] which would not be a defined
| "component" and instead rendered as the individual two nonsense
| characters.
| andylynch wrote:
| You might also have noticed this is partly a very well thought
| out hack to make Unicode less sensitive to disagreements and
| changes in consensus on which flags are encoded, or even the
| names of the countries concerned!
| happytoexplain wrote:
| I guessed that it would become the USSR flag (US -> SU), but
| apparently Unicode doesn't define that one! I wonder why. That
| would have been humorous.
| bloak wrote:
| As I understand it, there is no two-letter ISO code for the
| USSR because when they update the standard they remove
| countries that no longer exist. In at least one case they have
| reused a code point: CS has been both "Czechoslovakia" and
| "Serbia and Montenegro", neither of which currently exist.
|
| As a result, two-letter ISO codes are useless for many
| potential applications, such as, for example, recording which
| country a book was published in, unless you supplement them
| with a reference to a particular version of the standard.
|
| Is there a way of getting the Czechoslovakian flag as an emoji?
| And did Serbia and Montenegro get round to making a flag?
| happytoexplain wrote:
| Ah, I didn't realize they reused codes from ISO 3166-3. I
| figured, because they keep these regions around in their own
| set, that was some implication that the codes would not be
| reused.
| ts4z wrote:
| IIRC Unicode doesn't define country codes. It was a workaround
| for a political issue of which countries recognize which other
| countries.
|
| It would have been difficult to get the CN delegation to sign
| off on a list that contained TW, although there are probably
| others.
| andylynch wrote:
| There are many more than I realised - Wikipedia has a decent
| list https://en.m.wikipedia.org/wiki/List_of_states_with_limi
| ted_...
| chungy wrote:
| Unicode doesn't define any flags, really. That's up to the font
| rendering on systems/libraries.
| happytoexplain wrote:
| True, but Unicode explicitly defines "SU" as a deprecated
| combination, regardless of flags. Seems like they omit
| everything from the list of "no longer used" country codes,
| with some exceptions. I would think they would have no reason
| not to allow historical regions.
| WA9ACE wrote:
| I feel like I'm obligated to share this almost 20 year old
| Spolsky post that gave me my understanding of characters.
|
| https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...
| xmprt wrote:
| In that same vein, here's my introduction to Unicode about 10
| years ago from Tom Scott.
|
| https://www.youtube.com/watch?v=MijmeoH9LT4
| zerox7felf wrote:
| poor man gave me and many others something like half of our
| introduction to computer science, but has gotten far more
| fame as the "emoji guy" for his repeated bouts with this
| particular part of unicode :)
| ciupicri wrote:
| That's more about the UTF-8 encoding than Unicode itself.
| bandyaboot wrote:
| Would be interesting to see the list of flag emojis that, when
| reversed, become a different flag emoji.
| jfk13 wrote:
| There are plenty of country codes that when reversed become a
| different, valid country code: e.g. Israel (IL) when reversed
| is Lithuania (LI); Australia (AU) becomes Ukraine (UA).
|
| Whether "reversing flag emojis" causes such transformations
| will depend on what is meant by "reversing", which is kind of
| the whole point here: there are a number of possible
| interpretations of "reverse".
| alfredxing wrote:
| Related -- I did a deep dive a couple years ago on emoji
| codepoints and how they're encoded in the Apple emoji font file,
| with the end goal of extracting the embedded images --
| https://github.com/alfredxing/emoji
| utopcell wrote:
| There are unicode characters that reverse parsing order
| themselves. This has been the basis of a code injection attack,
| analyzed in [1].
|
| [1] ``Trojan Source: Invisible Vulnerabilities'':
| https://trojansource.codes/trojan-source.pdf
| uniqueuid wrote:
| Upper and lower codepoints are really way too obscure and can
| create issues you didn't even know you had.
|
| I once had the very unpleasant experience of debugging a case
| where data saved with R on windows and loaded on macOS ended up
| with individually double-encoded codepoints.
|
| Not fun.
| randpx wrote:
| Try reversing the Canadian flag (CA) and you get the Ascension
| Island Flag (AC). Great article, but completely misses the point.
| Mesopropithecus wrote:
| Unfortunately the HN text input won't let me do this, but a funny
| starter for the article would have been this:
|
| '(Spanish flag)'[::-1]
|
| basically ''.join([chr(127466), chr(127480)]) vs.
| ''.join([chr(127466), chr(127480)])[::-1]
|
| I'll add this to my collection of party tricks and show myself
| out.
|
| Cool article!
| dhosek wrote:
| On the challenge front, there are things like a which might be a
| single code point or two code points (a+'). Then there are the
| really challenging things like a where if the components are
| individual characters, the order of and ~ are not guaranteed to
| be consistent.
| saltminer wrote:
| Then you have stuff like zalgo text (http://eeemo.net/) which
| takes pride in abusing code points
| happytoexplain wrote:
| Which is why these APIs should always make normalization
| available: https://unicode.org/reports/tr15/
| treesknees wrote:
| But you can, and did, reverse a string. It seems you would need
| more details, such as a request to reverse the meaning or
| interpretation of the string, which is what the author is getting
| at.
|
| If someone challenges you to reverse an image, what do you do? Do
| you invert the colors? Mirror horizontally? Mirror vertically?
| Just reverse the byte order?
| wahern wrote:
| There's a specification problem here. I like to say that a
| "string" isn't a data structure, it's the absence of one.
| Discussing "strings" is pointless. It follows that comparing
| programming languages by their "string" handling is likewise
| pointless.
|
| Case in point: a "struct" in languages like C and Rust is
| literally a specification of how to treat segments of a
| "string" of contiguous bytes.
| shadowgovt wrote:
| Even the most basic ASCII string is still a data structure.
|
| Is it a PASCAL string (length byte followed by data) or a C
| string (arbitrary run of bytes terminated by a null
| character)?
| wahern wrote:
| You qualified "string" with "ASCII", and also tacitly
| admitted you still need more information than the octets
| themselves--the length.
|
| Of course, various programming languages have primitives
| and concepts which they may label "string". But you still
| need to specify that _context_ , drawing in the additional
| specification those languages provide. Plus, traditionally
| and in practice, such concepts often serve the function of
| importing or exporting unstructured data. So even in the
| context of a specific programming language, the label
| "string" is often used to _elide_ details necessary to
| understanding the content and semantics of some particular
| chunk of data.
| shadowgovt wrote:
| I think I understand the difference; you're using
| "string" the way I would use "blob" or "untyped byte
| array."
|
| Shifting definitions to yours, I agree.
| avianlyric wrote:
| In languages like C "string" isn't a proper data structure,
| it's a `char` array, which itself is little more than a `int`
| array or `byte` array.
|
| But these languages don't provide true "string" support. They
| just have a vaguely useful type alias that renames a byte
| array to a char array, and a bunch of byte array functions
| that have been renamed to sound like string functions. In
| reality all the language supports are byte arrays, with some
| syntactical sugar so you can pretend they're strings.
|
| Newer languages, like go and Python 3, that where created in
| the world of Unicode provide true string types. Where the
| type primitives properly deal with idea of variable length
| characters, and provide tools to make it easy to manipulate
| strings and characters as independent concepts. If you want
| to ignore Unicode, because your specific application doesn't
| need to understand, then you need cast your strings into byte
| arrays, and all pretences of true string manipulation vanish
| at the same time.
|
| This is not to say the C can't handle Unicode etc. just like
| the language doesn't provide true primitives to manipulate
| strings, instead relies on libraries to provide that
| functionality, which is perfectly valid approach. Just as
| baking in more complex string primitives into your language
| is also a perfectly valid approach. It's just a question of
| trade offs and use cases, I.e. the problem at the heart of
| all good engineering.
| samatman wrote:
| We would all be better off if this were actually true.
|
| Tragically, in C, a string is just _barely_ a data structure,
| because it must have \0 at the end.
|
| If it were the complete absence of a data structure, we would
| need some way to get at the length of it, and could treat a
| slice of it as the same sort of thing as the thing itself.
| jameshart wrote:
| Yep, it's as meaningful a programming task as 'reverse this
| double-precision float'.
| egypturnash wrote:
| Galaxy brain image reversal: completely redraw it from scratch,
| with a viewpoint 180o from the original.
| ravi-delia wrote:
| New computer vision challenge
| zwerdlds wrote:
| In normal conditions you can check for a ZWJ, but with regional
| coding chars, you would have to consider the regional chars block
| as a single char in the reversal. Given that is isn't necessarily
| locale dependant but presentation layer dependant, there might
| not be anough info to decide how to act.
| jerf wrote:
| So, in terms of acing interviews, increasingly one of the best
| answers to the question "Write some code that reverses a string"
| is that in a world of unicode, "reversing a string" is no longer
| possible or meaningful.
|
| You'll probably be told "oh, assume US ASCII" or something, but
| in the meantime, if you can back that up when they dig into it,
| you'll look really smart.
| Someone wrote:
| Even ASCII can be argued to be problematic.
|
| What is "3 >= 2", reversed?
|
| What is "Rijksmuseum", reversed?
| (https://en.wikipedia.org/wiki/IJ_(digraph); capitalization
| isn't simple here, either
| (https://en.wikipedia.org/wiki/IJ_(digraph)#Capitalisation)
| greenyoda wrote:
| > "reversing a string" is no longer possible or meaningful.
|
| If you really wanted to, you could write a string reversal
| algorithm that treated two-character emojis as an indivisible
| element of the string and preserved its order (just as you'd
| need to preserve the order of the bytes in a single multi-byte
| UTF-8 character). You'd just need to carefully specify what you
| mean by the terms "string", "character" and "reverse" in a way
| that includes ordered, multi-character sequences like flag
| emojis.
| happytoexplain wrote:
| I would argue that it is possible and meaningful. AFAIK
| extended grapheme clusters are well defined by the standard,
| and are very well suited to the default meaning of when
| somebody says "character", so, given no other information, it's
| reasonable to reverse a string based on them. I guess the issue
| is "reverse a string" lacks details, but I think that's
| different from "not meaningful".
| viktorcode wrote:
| You certainly can. `print(String(flag.reversed()))` in Swift
| reverses emojis correctly.
| Spivak wrote:
| Reversing a string is still meaningful. Take a step back
| outside the implementation and imagine handing a Unicode string
| to a human. They could without any knowledge look at the
| characters they see and produce the correct string reversal.
|
| There is a solution to this which is to compute the list of
| grapheme clusters, and reverse that.
|
| https://unicode.org/reports/tr29/
| akersten wrote:
| > imagine handing a Unicode string to a human. They could
| without any knowledge look at the characters they see and
| produce the correct string reversal.
|
| I really highly doubt it.
|
| How do you reverse this?: mrHban , hdhh slsl@.
|
| Can you do it without any knowledge about whether what looks
| like one character is actually a special case joiner between
| two adjacent codepoints that only happens in one direction?
| Can you do it without knowing that this string appears
| wrongly in the HN textbbox due to an apparent RTL issue?
|
| It's just not well-defined to reverse a string, and the
| reason we say it's not meaningful is that no User Story ever
| starts "as a visitor to this website I want to be able to see
| this string in opposite order, no not just that all the bytes
| are reversed, but you know what I mean."
| adolph wrote:
| Is a RTL character string already "reversed" from a LTR
| POV?
|
| Is an absolute value signed as positive?
| Spivak wrote:
| I mean no but only because I don't understand the
| characters. Someone who reads Arabic (I assume based on the
| shape) would have no trouble. You're nitpicking cases where
| for _some readers_ visual characters might be hard to
| distinguish but it doesn't change the fact that _there
| exists a correct answer_ for every piece of text that will
| be obvious to readers of that text which is the definition
| of a grapheme cluster.
| akersten wrote:
| > the fact that there exists a correct answer for every
| piece of text that will be obvious to readers of that
| text which is the definition of a grapheme cluster.
|
| No, I insist there is _not_ a single "correct answer,"
| even if a reader has perfect knowledge of the language(s)
| involved. Now remember, this is already moving the
| goalposts, since it was claimed that a human needed "no
| knowledge" to get to this allegedly "correct answer."
|
| You already admit that people who don't speak Arabic will
| have trouble finding the "grapheme clusters," but even
| two people who speak Arabic may do your clustering or
| not, depending on some implicit feeling of "the right way
| to do it" vs taking the question literally and pasting
| the smallest highlight-able selection of the string in
| reverse at a time.
|
| Anyway, take a string like this: "here is some Arabic
| text: <RLM> <Arabic codepoints> <LRM> And back to
| English"
|
| Whether you discard the ordering mark[0], keep them, or
| inverse them is an implementation decision that already
| produces three completely different strings. Unless we
| want to write a rulebook for the right way to reverse a
| string, it remains an impossibility to declare anything
| the correct answer, and because there is no _reason_ to
| reverse such a string outside of contrived interview
| questions and ivory tower debates, it is also
| meaningless.
|
| [0]: https://en.m.wikipedia.org/wiki/Right-to-left_mark
| https://en.m.wikipedia.org/wiki/Left-to-right_mark
| Spivak wrote:
| You added the requirement that it be a single correct
| answer. I just asserted that there existed a correct
| answer. You're being woefully pedantic -- a human who can
| read the text presented to them but _no knowledge of
| unicode_ was my intended meaning. Grapheme clusters are
| language dependent and chosen for readers of languages
| that use the characters involved. There 's no implicit
| feeling, this is what the standards body has decided is
| the "right way to do it." If you want to use different
| grapheme clusters because you think the Unicode people
| are wrong then fine, use those. You can still reverse the
| string.
|
| Like what are you even arguing? You declared that
| something was impossible and then ended with that it's
| not only possible but it's so possible that there are
| many reasonable correct answers. Pick one and call it a
| day.
| akersten wrote:
| > Like what are you even arguing?
|
| It is impossible to "correctly reverse a string" because
| "reverse a string" is not well defined. We explored many
| different potential definitions of it, to show that there
| is no meaningful singular answer.
|
| > You added the requirement that it be a single correct
| answer.
|
| Your original post says "they could produce _the_ correct
| string reversal "?
| happytoexplain wrote:
| >what looks like one character is actually a special case
| joiner between two adjacent codepoints
|
| Are you referring to a grouping not covered by the
| definition of grapheme clusters (which I am only passingly
| familiar with)? If so, then I don't think it's any more
| non-meaningful to reverse it than to reverse an English
| string. The result is gibberish to humans either way - it
| sounds more like you're saying that there is no universally
| "meaningful to humans" way to reverse some text in
| potentially any language, which is true regardless of what
| encoding or written language you're using. I was thinking
| of it more from the programmer side - i.e. that Unicode
| provides ways to reverse strings that are more "meaningful"
| (as opposed to arbitrary) than e.g. just reversing code
| points.
| nonameiguess wrote:
| You can even demonstrate a similar concept with English and
| Latin characters. There is no single thing called a
| "grapheme" linguistically. There are actually two different
| types of graphemes. The character sequence "sh" in English
| is a single referential grapheme but two analogical
| graphemes. Depending on what the specification means,
| "short" could be reversed as either "trosh" or "trohs".
| That's without getting into transliteration. The word for
| Cherokee in the Cherokee language is "Tsalagi" but the "ts"
| is a Latin transliteration of a single Cherokee character.
| Should we count that as one grapheme or two?
|
| Of course, if an interviewer is really asking you how to do
| this, they're probably either 1) working in bioinformatics,
| in which case there are exactly four ASCII characters they
| really care about and the problem is well-defined, or 2)
| it's implementing something like rev | cut -d '-' -f1 | rev
| to get rid of the last field and it doesn't matter how you
| implement "rev" just so long as it works exactly the same
| in reverse and you can always recover the original string.
| Spivak wrote:
| The fact that how to reverse a piece of text is locale
| dependent doesn't mean it's impossible. Basically and
| transformation on text will be locale dependent. Hell,
| _length_ is locale dependent.
| lloeki wrote:
| Should it reverse a BOM as well or keep it first?
| Spivak wrote:
| Keep it first? Like that's not a gotcha. Your input is a
| string and the output is that string visually reversed.
| What it looks like in memory is irrelevant.
| paxys wrote:
| UTF-8 reverse string has been a thing for a long time in
| most/all programming languages. It may not work perfectly in
| 100% of the cases, but that doesn't mean reversing a string is
| no longer possible.
| jerf wrote:
| "It may not work perfectly in 100% of the cases, but that
| doesn't mean reversing a string is no longer possible."
|
| It depends on your point of view. From a strict point of
| view, it _does_ exactly mean it is no longer possible. By
| contrast, we all 100% knew what reversing an ASCII string
| meant, with no ambiguity.
|
| It also depends on the version of Unicode you are using, and
| oh by the way, unicode strings do not come annotated with the
| version they are in. Since it's supposed to be backwards
| compatible hopefully the latest works, but I'd be unsurprised
| if someone can name something whose correct reversal depends
| on the version of Unicode. And, if not now, then in some
| later not-yet-existing pair of Unicode standards.
| pwdisswordfish9 wrote:
| > By contrast, we all 100% knew what reversing an ASCII
| string meant, with no ambiguity.
|
| Not if the ASCII string employed the backspace control
| character to accomplish what is today done with Unicode
| combining characters.
|
| Or, in fact, if it employed any other kind of control
| sequence.
| thaumasiotes wrote:
| I always thought it was interesting that ASCII is
| transparently just a bunch of control codes for a
| typewriter (where "strike an 'a'" is a mechanical
| instruction no different from "reset the carriage
| position"), but when we wanted to represent symbolic data
| we copied it and included all of the nonsensical
| mechanical instructions.
| adzm wrote:
| Well the control codes were specifically for TTY rather
| than typewriters, many of the control codes still make
| sense from that standpoint.
| jameshart wrote:
| Like... \r\n
| jcelerier wrote:
| > It may not work perfectly in 100% of the cases, but that
| doesn't mean reversing a string is no longer possible.
|
| I don't understand why in maths finding one single counter-
| example is enough to disprove a theorem yet in programming
| people seem to be happy with 99.x % of success rate. To me,
| "It may not work perfectly in 100% of the cases" exactly
| means "no longer possible" as "possible" used to imply that
| it would work consistently, 100% of the time.
| tux3 wrote:
| It is very useful in engineering to do things that are
| mathematically impossible, by simply ignoring or rejecting
| the last 1%.
|
| Sometimes that's unacceptable, because you really do care
| about 100% of cases. When it isn't, you get really cool
| "impossible" tools out of it :)
| paxys wrote:
| Because programming is not a science (or at most it is an
| applied science).
|
| By your logic any software that has a single bug would be
| useless, and if that were the case this entire profession
| wouldn't exist.
| jameshart wrote:
| I'd go further and argue that _in general_ reversing a string
| isn 't possible or meaningful.
|
| It's just not a thing people do, so it's just... not very
| interesting to argue about what the 'correct' way to do it is.
|
| Similarly, any argument over whether a string has n characters
| or n+1 characters in it is almost entirely meaningless and
| uninteresting for real world string processing problems. Allow
| me to let you into a secret:
|
| _there 's never really such a thing as a 'character limit'_
|
| There might be a 'printable character width' limit; or there
| might be a 'number of bytes of storage' limit. Which means
| interesting questions about a string include things like 'how
| wide is it when displayed in this font?' or 'how many bytes
| does it take to store or transmit it?'... But there's rarely
| any point where, for a general string, it is really interesting
| to know 'how many characters does the string contain?'
|
| Processing direct user text input is the only situation where
| you really need a rich notion of 'character', because you need
| to have a clear sense of what will happen if the user moves a
| cursor using a left or right arrow, and for exactly what will
| be deleted when a user hits backspace, or copied/cut and pasted
| when they operate on a selection. The ij ligature might be a
| single glyph, but is it a single character? When does it
| matter? Probably not at all unless you're trying to decide
| whether to let a user put a cursor in the middle of it or not.
|
| And next to that, I just feel to argue that there is such a
| thing as a 'correct' way to reverse "Rijndael" according to a
| strict reading of Unicode glyph composability rules seems like
| a supremely silly thing to try to do.
|
| I'd much rather, when asked to reverse a string, more
| developers simply said 'that doesn't make sense, you can't
| arbitrarily chunk up a string and reassemble it in a different
| order and expect any good to come of it'.
| Beldin wrote:
| Interestingly, on my phone the so-called flag is not a flag at
| all, but "US" in outline.
|
| So python behaves as expected: the 2 character string, when
| reversed, becomes "SU". Similar stuff happens with the other
| "flag" strings.
|
| I'm sure emojis in my phone are outdated. I'm not sure how that
| affects whether I see a flag or letters.
| pilsetnieks wrote:
| Thankfully, there isn't an assigned ISO 3166-1 2-letter country
| code for SU currently; people may have interesting reactions
| seeing what happens when reversing a US flag emoji if there
| were.
| nextstep wrote:
| Compare all of this nonsense to how it's done in Swift. String
| APIs in Swift are great: intuitive and do what you expect.
| exdsq wrote:
| Am I missing something or is this Day 1 of a programming course
| in C?
| techwiz137 wrote:
| It's pretty funny that reversing the American flag yields Soviet
| Union(SU).
| emodendroket wrote:
| What I'd like to know is, given the explosion of the character
| set for emoji, does the rationale for Han unification still make
| sense? The case for not allowing national variants seems less and
| less compelling with every emoji they add.
|
| This is a bit of a hobby horse, but imagine if every time you
| read an article in English on your phone some of the letters were
| replaced with "equivalent" Greek or Cyrillic one and you can get
| an idea of the annoyance. Yeah, you can still read it with a bit
| of thought, but who wants to read that way?
| AlanYx wrote:
| I agree that Han unification was an unfortunate design
| decision, but I'd argue that the consortium is following a
| consistent approach to the Han unification with emoji. For
| example, they treat "regional" vendor variations in emoji as a
| font issue. If you get a message with the gun emoji, unless you
| have out-of-band information regarding which vendor variant is
| intended, there's no way in software to know if it should be
| displayed as a water gun (Apple "regional" variant) or a weapon
| (other vendor variants). Which is not that different from a
| common problem stemming from Han unification.
| emodendroket wrote:
| I don't disagree, but my point is more than their concern was
| about having "too many characters" in Unicode, which no
| longer seems to be a real concern, so what would be the harm
| of adding national variants?
| hougaard wrote:
| In other news, water is wet :)
| michaelsbradley wrote:
| See chapter 7 in _Hacking the Planet (with Notcurses)_ for a
| short treatment of encodings, extended grapheme clusters, etc.
|
| https://nick-black.com/htp-notcurses.pdf#page53
| smegsicle wrote:
| did they think all those skintone emojis are individual
| codepoints?
| advisedwang wrote:
| They might have thought that `reverse()` had some kind of
| unicode-aware handling. I believe `upper()`/`lower()` do.
| daveslash wrote:
| When I first realized that the skin tone emojis were a code-
| point + a color code-point modifier, I tried to see what other
| colors there were and if I could apply those to _other_ emojis.
| The immature child in me looked to see if there was a red color
| code point and if so, could I use it to make a _" blood poop"_
| emoji. Turns out.... no.
| codingkev wrote:
| Yes, this allows for easy building of flag emojis as long as you
| know the ISO 3166 two-letter country code.
|
| Example: https://github.com/kennell/flagz/blob/master/flagz.py
| sltkr wrote:
| So what was the deal with the Scottish flag?
| gsnedders wrote:
| From Wikipedia:
|
| > A separate mechanism (emoji tag sequences) is used for
| regional flags, such as England , Scotland , Wales , Texas or
| California . It uses U+1F3F4 WAVING BLACK FLAG and formatting
| tag characters instead of regional indicator symbols. It is
| based on ISO 3166-2 regions with hyphen removed and lowercase,
| e.g. GB-ENG - gbeng, terminating with U+E007F CANCEL TAG. Flag
| of England is therefore represented by a sequence U+1F3F4,
| U+E0067, U+E0062, U+E0065, U+E006E, U+E0067, U+E007F.
| ghostly_s wrote:
| This was the only part that was surprising to me, and as it
| turns out my surprise mostly stems from still not really
| understanding how the United Kingdom works.
| tialaramex wrote:
| Don't worry, "How the United Kingdom works" is a political
| question and so subject to change.
|
| For example, Wales was essentially just straight up
| conquered, and so for long periods Wales did not have any
| distinct legal identity from England. You'll see that today
| there's a bunch of laws which are for _England and Wales_
| but notably not Scotland, including criminal laws. In
| living memory Wales got some measure of independent control
| over its own affairs, via an elected "Assembly" but what
| powers are "devolved" to this assembly are in effect the
| gift of the Parliament, in Westminster, which is sovereign.
| Whether taking away those powers would go well is a good
| question.
|
| On the other hand, Northern Ireland is what's left of
| English/ British dominion over the entire island of
| Ireland, most of which today is the Republic of Ireland, a
| sovereign entity with its own everything. It's only existed
| for about a century, and is a result of the agreed
| "partition" when the Irish rebelled because _most of the
| Irish_ wanted independence but those in the North not so
| much. Feel free to read about euphemistically named
| "Troubles". In the modern era, Northern Ireland, like
| Wales, gets a devolved government in Stormont. Unlike
| Wales, the Northern Ireland government is a total mess, and
| e.g. they have abortion (like the rest of the UK, and like
| the rest of Ireland) only because Stormont was so broken
| that Westminster imposed abortion legalisation on them
| since they weren't actually governing. If you think the US
| Congress is dysfunctional, check out Stormont...
|
| Finally Scotland was for a very long time an independent
| but closely related sovereign nation. It _agreed_ to join
| this United Kingdom about three hundred years ago in the
| Acts of Union after about a century with the same Monarch
| ruling both countries. However, it too got a devolved
| government, a Parliament, probably the most powerful of the
| three, in Holyrood, Edingburgh in the 20th century and it
| has a relatively powerful pro-independence politics, the
| Scottish National Party is the dominant power in Scottish
| politics, although how many of its voters _actually_
| support independence per se is tricky to judge.
|
| Brexit changed all this again, because as part of the EU a
| bunch of the powers you could reasonably localise, and so
| were "devolved" to Wales, Scotland and Northern Ireland,
| had been controlled by EU law. So Westminster could _say_
| they were devolved, knowing that the constituent entities
| couldn 't actually do much with this supposed power. Having
| left the EU, those powers were among the thing Brexiteers
| seemed to have imagined now lay at Westminster, but of
| course the devolved countries said no, these are our
| powers, we get to decide e.g. how agricultural subsidies
| are distributed to suit our farmers.
|
| That's even more fun in Northern Ireland, because they
| share a border with the Republic, an EU member, and so
| they're not allowed to have certain rules that would
| obviously result in a physical border with guards and so
| on. Their Unionists (the people who are why it isn't just
| part of the Republic of Ireland because they want to be in
| the United Kingdom) feel like they were sold out by
| Westminster politicians, while the Republicans (those who'd
| rather be part of the Republic) see this as potentially a
| further argument in favour of that. All of which isn't
| helping at all to keep the peace between these rivals, that
| peace being the whole reason we don't want to put up a
| border...
| dhosek wrote:
| Most flags use the ISO 2-character country code to access their
| values. However, some flags don't map to 2-character country
| codes (Scotland being one example). In this case it uses the
| sequence black flag, GBSCT (for Great Britain-Scotland,
| represented using the tag latin small letter codes for the
| letters) then cancel tag to end the sequence. Changing the
| middle five to be GBENG gives the English flag and GBWLS gives
| the Welsh flag.
| [deleted]
| architectdrone wrote:
| humorously, on my local machine, I only see the string "us", and
| was rather confused when he was asserting that it was a single
| character :D
___________________________________________________________________
(page generated 2022-01-27 23:00 UTC)