[HN Gopher] Any Encoding, Ever - ztd.text and Unicode for C++
       ___________________________________________________________________
        
       Any Encoding, Ever - ztd.text and Unicode for C++
        
       Author : hasheddan
       Score  : 85 points
       Date   : 2021-07-01 01:24 UTC (21 hours ago)
        
 (HTM) web link (thephd.dev)
 (TXT) w3m dump (thephd.dev)
        
       | theamk wrote:
       | There was a lot of discussion with invisible opponents in that
       | text.. Plenty of counter-arguments but only a hint of what is
       | this replying to.
       | 
       | Other than that, it seems to present a C++ version of encoding
       | conversion library (like iconv), but instead of handling
       | encodings dynamically it uses C++ types. So not very good for any
       | place with user-specified encoding (like browser or database or
       | text editor), but can be used when the encoding is known at
       | compile time?
       | 
       | I can see a few uses of it, but it seems kinda niche for all that
       | grandiose talk about "liberation of the encoding hell".
        
         | aidenn0 wrote:
         | ztd::text::transcode takes an encoding type, as shown in the
         | second example.
         | 
         | In any event, a single switch statement turns a compile-time
         | encoding into a run-time encoding.
        
       | lifthrasiir wrote:
       | I think the presented design is almost okay, even with the
       | complexity of legacy encodings, but one thing raised my eyebrows:
       | 
       | > If your type always has a replacement character, regardless of
       | the situation, it can signal this by writing one of two
       | functions: replacement_code_units() (for any failed encode step),
       | replacement_code_points() (for any failed decode step).
       | 
       | They (and their cousins maybe_replacement_code_units/points) do
       | not accept any argument, and that's wrong. There are stateful
       | encodings like ISO/IEC 2022 and HZ where you might need the
       | current encoder state to select a correct replacement character.
       | For example `?` might not be representable in the current state
       | and needs a switch sequence. You can, fortunately, always switch
       | to the desired character set in 2022 with a fixed sequence so you
       | can do have a fixed replacement sequence there, but you can't do
       | so in HZ. It is technically possible to force-switch to the
       | default state when an error is occurred, but not all users will
       | want this behavior. It should obvious to anyone that stateful
       | encodings are in general PITA.
        
       | flohofwoe wrote:
       | I'd like to see a bit more explanation why the "UTF-8 Everywhere"
       | (https://utf8everywhere.org/) advice is poor?
       | 
       | Outside Windows APIs, UTF-16 is pretty much irrelevant (yes there
       | are a few popular languages around which jumped on the UNICODE
       | bandwagon too early and are now stuck with UTF-16, but those
       | languages usually also offer simple conversion functions to and
       | from other encodings - like UTF-8).
       | 
       | UTF-32 has its place as internal runtime string representation,
       | but not for data exchange.
       | 
       | UTF-8 has won, and for all the right reasons. UTF-16 is a
       | backward compatibility hack for legacy APIs, and UTF-32 is a
       | special-case encoding for easier runtime-manipulation of string
       | data.
        
         | lifthrasiir wrote:
         | The author claims that the UTF-8 Everywhere manifesto makes
         | people think that they don't need to implement anything but
         | UTF-8 (which is false and not even what the manifesto actually
         | says). I think that the author's claim itself is also false.
        
         | flohofwoe wrote:
         | Btw, I'm almost certain that the first code example of
         | expecting the command line arguments as UTF-8 strings won't
         | work on Windows, instead the argument strings will be 8-bit-
         | ASCII encoded with the current system code page. The only way I
         | got this working reliably with the standard main() entry point
         | was to call the Win32 function GetCommandLineW() and then split
         | into args and convert to UTF-8 myself.
        
           | flohofwoe wrote:
           | PS: found some example code in the github project which seems
           | to take care of the special Windows situation:
           | 
           | https://github.com/soasis/text/blob/main/examples/documentat.
           | ..
           | 
           | Unfortunately this doesn't look so simple anymore.
        
             | nly wrote:
             | This code looks like a disaster to me.
             | 
             | The issue on Windows is simply that code pages are a per-
             | application setting (or state) and not a system setting.
             | Programs are free to run with, and change between, any
             | "active code page" they want and spit bytes out in that
             | code page.
             | 
             | This is all fine and dandy when you're talking to a -A
             | Windows API, since Windows knows the ACP of your
             | application, but it's disaster for I/O and making arbitrary
             | programs talk to one another.
             | 
             | Nothing you can do in your code can fix this, it's a
             | contract you have to have with the outside world.
             | 
             | On Linux calling setlocale() with a non-UTF-8 locale might
             | not even even work because your /etc/locale.gen file will
             | probably only have a couple of enabled entries, and doing
             | so would be mostly pointless because your terminal and all
             | your other programs are likely using UTF-8.
             | 
             | The downside to the Linux approach is that if somebody does
             | actually send you a ISO-8859-1 encoded file you will have
             | to do some conversion.                   #include
             | <boost/locale.hpp>         #include <fstream>
             | #include <iostream>         #include <string>
             | int         main(int argc, char** argv) {
             | std::ifstream ifs ("test.txt");             std::string
             | str;                  boost::locale::generator gen; // No
             | dependency on what's in /etc/locale.gen             auto
             | loc = gen ("en_US.ISO-8859-1");             ifs.imbue
             | (loc); // Should make formatting input functions work
             | std::getline (ifs, str);             str =
             | boost::locale::conv::to_utf<char> (str, loc);
             | for (auto wc: str) {                 std::cout <<
             | sizeof(wc) << ": "                           <<
             | static_cast<uint64_t>(static_cast<unsigned char>(wc)) <<
             | "\n";             }         }
        
         | MaxBarraclough wrote:
         | > there are a few popular languages around which jumped on the
         | UNICODE bandwagon too early and are now stuck with UTF-16, but
         | those languages usually also offer simple conversion functions
         | 
         | Java does this, right?
        
           | x4e wrote:
           | Java uses UTF-16 for internally storing strings, however all
           | it's APIs use the platform default charset which is usually
           | UTF-8. It also uses modified UTF-8 for storing strings inside
           | compiled classes.
        
       | Animats wrote:
       | If you use Go, Rust, Python 3, or Javascript, this is a non-
       | problem, because those are all native UTF-8. Only for legacy
       | languages is this still a problem.
        
         | layoutIfNeeded wrote:
         | False. JavaScript uses UTF16:
         | https://mathiasbynens.be/notes/javascript-encoding
        
         | jcelerier wrote:
         | sure fam
         | https://stackoverflow.com/questions/12053168/how-to-properly-
         | output-a-string-in-a-windows-console-with-go
         | https://github.com/intellij-rust/intellij-rust/issues/766
         | https://bugs.python.org/issue44275
         | 
         | etc etc
        
         | chrismorgan wrote:
         | Quite apart from the fact that you're talking at cross-purposes
         | to the article, you're wrong about three of the four languages
         | you mention.
         | 
         | Go strings aren't actually UTF-8; they aren't even Unicode:
         | they can contain arbitrary bytes. In practice, _most_ strings
         | will be UTF-8, but you can't depend on it at all.
         | 
         | Rust strings are strictly UTF-8. I appreciate this greatly.
         | 
         | Python 3 strings are sequences of Unicode code points (as
         | distinct from Unicode scalar values), allowing ill-formed
         | Unicode (unmatched surrogates), and its internal representation
         | is a disaster because they decided indexing by code point was
         | worthwhile (it's not, given the costs), so since CPython 3.3
         | strings are encoded as ISO-8859-1 (only able to represent
         | U+0000 to U+00FF), UCS-2 (further able to represent U+0100 to
         | U+FFFF) or UCS-4 (able to represent all Unicode code points).
         | 
         | JavaScript strings allow ill-formed Unicode, and the public
         | representation can be described as either UCS-2 or ill-formed
         | UTF-16 (almost all string access techniques work in UTF-16 code
         | units, though new stuff is now mostly working in Unicode and
         | UTF-8 terms). I think all major engines now use an internal
         | representation of ISO-8859-1 if the string contains only U+0000
         | to U+00FF, or ill-formed UTF-16 for anything else; but Servo's
         | work has demonstrated that it's possible to shift to WTF-8
         | (UTF-8 plus allowing unmatched surrogates), saving lots of
         | memory on some pages and simplifying and slightly speeding up
         | some things, but at the cost of random access performance by
         | code unit index.
        
           | Animats wrote:
           | Right, Python 3 looks like it's UTF-8, but CPython has that
           | weird 1, 2 or 4 byte representation. PyPy, though, finally
           | went UTF-8. Not sure how they do indexing.
           | 
           | Go strings are supposed to be UTF-8, but it's not enforced.
           | 
           | Javascript - bleah.
        
             | chrismorgan wrote:
             | > _Python 3 looks like it 's UTF-8_
             | 
             | No, it never looked like it was UTF-8. It did look like it
             | was _Unicode_ (of an unspecified encoding), but Python
             | allows unpaired surrogates, which makes for ill-formed
             | Unicode. Python lets you have  '\udead'. Well-formed
             | encodings of Unicode (such as UTF-8, UTF-16 and UTF-32)
             | cannot represent U+DEAD.
             | 
             | There's a big difference between "Unicode" and "UTF-8".
             | 
             | > _PyPy, though, finally went UTF-8._
             | 
             | Oh cool! I remember vaguely hearing the idea being mulled
             | over, years ago. Good to see it happened:
             | https://morepypy.blogspot.com/2019/03/pypy-v71-released-
             | now-....
             | 
             | Wonder what they've done about surrogates; the only way
             | they can truly have switched to UTF-8 internally is if
             | they've broken compatibility, which I doubt they've done;
             | I'm guessing that they actually use WTF-8, not UTF-8.
             | 
             | (In the meantime, I've changed "Python 3.3" in the
             | grandparent comment to "CPython 3.3" for accuracy.)
        
         | samatman wrote:
         | Would have been better to read the article, because then you'd
         | be in a position to explain how those languages handle
         | transcoding from ShiftJS to UTF-8.
        
         | gameman144 wrote:
         | This is explicitly called out in the article: non-UTF-8
         | encodings still need to be handled _somehow_ , and this library
         | makes those just as easy to handle as UTF-8.
        
         | 1wd wrote:
         | Python 3 by default still opens text files using the legacy
         | locale encoding.
        
       | nly wrote:
       | Why are the examples for ztd.text writing UTF-16 to std::cout?
       | Won't this fail on Linux where UTF-16 is rarely used and
       | terminals typically default to UTF-8?
       | 
       | Personally I've enjoyed using the tiny-weeny header only
       | utfcpp[0] for simple manipulation of UTF-8 and unicode-to-unicode
       | conversions, and typically use Boost.Locale[1] when I need to do
       | I/O.
       | 
       | 95% of localization problems have almost nothing to do with
       | encoding. Encodings are boring. It's like arguing over JSON vs
       | YAML. To do I/O properly across platforms you _may_ need wrappers
       | for the standard output streams that will handle conversion for
       | you, sure, but... you also need to handle date, time and currency
       | formatting /parsing, message formatting, and to enable
       | translations.
       | 
       | See [2] regarding Windows:
       | 
       | > All of the examples that come with Boost.Locale are designed
       | for UTF-8 and it is the default encoding used by Boost.Locale.
       | 
       | Personally I think doing I/O as UTF-8 on Windows is the right
       | direction, as Microsoft have been enhancing UTF-8 support in
       | Windows for quite a while now.
       | 
       | See[3]:
       | 
       | > Until recently, Windows has emphasized "Unicode" -W variants
       | over -A APIs. However, recent releases have used the ANSI code
       | page and -A APIs as a means to introduce UTF-8 support to apps.
       | 
       | [0] https://github.com/nemtrif/utfcpp
       | 
       | [1]
       | https://www.boost.org/doc/libs/1_76_0/libs/locale/doc/html/i...
       | 
       | [2]
       | https://www.boost.org/doc/libs/1_76_0/libs/locale/doc/html/r...
       | 
       | [3] https://docs.microsoft.com/en-
       | us/windows/apps/design/globali...
        
         | midjji wrote:
         | Ah ... it seemed like such a beautiful dream for a moment. But
         | yeah, no way to fix this without breaking/deprecating abi /api
         | in C,C++. Which really isn't surprising, char and uchar should
         | always have been byte, char should not have existed.
         | std::string, and all string functions in C should always have
         | contained/ required explicit if default encoding information,
         | and the streams should always have had encoding defaults for
         | its input and output. While the compiler should always have
         | required a non locale based ... and so on. The problem keeps
         | getting worse too, view is a generic and powerful meta, but it
         | means there should never have been such a thing as string_view,
         | except possibly as a template specialization for performance
         | reasons, and of course that would wrapp the proper encoded
         | string, with its explicit encoding of string litteral and so
         | on.
         | 
         | That said, with appropriate new api in the standard and with
         | the compiler explicitly requiring explicit encoding specified
         | sources, this could all be solved. The c++ streams are a very
         | useful construct but horribly implemented, so we could just
         | make a new one, and then deprecate the broken part of the
         | standard lib. Its just, this is never a problem you realize
         | when coding on your own, since it only ever hits projects big
         | enough to be multilocal.
        
           | nly wrote:
           | Most programmers should never have to worry about
           | manipulating strings intended for human consumption at the
           | character or code-point level. It makes about as much sense
           | as trying to manipulate bytes inside a JPEG when you're
           | writing a typical desktop CRUD application.
           | 
           | Projects big enough to be multi-locale (or actually all
           | projects) should definitely be using format strings (to
           | account for different parameter orders), locale sensitive
           | date, time and currency formats, and a good externalized
           | translation framework... but I still think they should be
           | using UTF-8 throughout, because the encoding you use is
           | completely unrelated to these problems.
           | 
           | C++ in particular is used less and less for human facing
           | software, and where it is libraries like Qt are fairly
           | excellent at handling this stuff.
        
             | midjji wrote:
             | UTF-8 mandated everywhere is how I work atm, and I am happy
             | the old latin... etc died off. But I think its
             | fundamentally the wrong way to go in a sense. UTF-8 is too
             | huge for starters, and does a great deal which it probably
             | should not as it will never do it well. While latex is
             | horrifying in its own right, it remains indisputably
             | superior at writing math compared to UTF-8 and similar for
             | html5 with styling. In both cases the extensibility and
             | styling is built in despite not needing more than perhaps
             | 50 characters to write. It seems to me like the mistake was
             | adding locales in the first place, we have a chance to
             | standardize not language, but meta language. The user will
             | never see the low level stuff, and the programmer will
             | mostly not see it unless they click show low level in their
             | ide, and it will always be compressed when it matters, so
             | why even have separate a,A. Why not have a, \\{upper}{a}.
             | shown, and even entered in the keyboard as a,A. The key
             | difference to utf would be that the rendering and which
             | commands exists is up to the locale which could be globally
             | stand... Wait I just made it worse didnt I...
             | 
             | Qt still leaves you with the problem of string litterals
             | possibly changing when moving the code from one locale to
             | another as the code itself is not of defined encoding, so
             | if reparsed it will either look bad in code, or bad in
             | output. The everpresent tr is also rather annoying.
        
             | midjji wrote:
             | Oh and one warning regarding UTF-8 everywhere, use it but
             | perhaps dont force it on a filesystem level. I have a file
             | in a zfs filesystem with utf-8 and normD which everytime I
             | try to delete it remains, and if I copy/move it in the
             | filesystem it duplicates in size and if I try to copy it to
             | someplace else, that fails. It has some wierd seemingly
             | cryllic filename too long to be shown and changing with zfs
             | version numbers and possibly randomly. I think it was a
             | meme image originally, but its taking up 135GB of tank
             | space by now. I mostly keep it for fun, hoping zfs will
             | eventually either declare it broken or fix it so I can see
             | what it was. I'd share it but the tank is huge, and I never
             | figured out a way to move or read it without dd ing the
             | entire tank.
        
               | nly wrote:
               | Can you explain this? Aren't file names on unix platforms
               | essentially blobs, sans a few delimiters ('/', '\0',
               | maybe '\')?
        
               | midjji wrote:
               | I really cant. I have no idea how it works or why.
        
               | Filligree wrote:
               | ZFS has a 'utf8only' property which, in principle, lets
               | you constrain all filenames on a dataset to be utf-8...
               | only. It does not otherwise change the API, but it should
               | make the creation of a non-utf8 file into an I/O error.
               | This defaults to off, so by default you're right.
               | 
               | Apparently the GP found a bug in this code. I'd be
               | interested in seeing the github issue.
        
               | midjji wrote:
               | With utf8only you have to specify a way to convert non
               | utf8 to utf8, I think the problem is a bug in the normD
               | converter.
        
               | AnIdiotOnTheNet wrote:
               | IIRC the only character disallowed in EXT2 at least is
               | '/'. Interesting things happen if you manage to get '/'
               | into a file name, which can be done with a hex editor or
               | a broken Windows EXT2 file browser in my case.
        
         | nemetroid wrote:
         | It depends on what your goal is. It's not going to display
         | properly on a Linux terminal, but it will succeed to "[...]
         | always dump a sequence of UTF-16, Native-Endian, bytes out to
         | your stdout".
        
           | theamk wrote:
           | I think it fails even at that... because that UTF-16 is
           | followed by default (8 bit char) std::endl, which is
           | definitely not a valid UTF-16.
        
       | ivanche wrote:
       | My bold claim: author didn't ever execute first 2 code snippets
       | in the article.
       | 
       | Proof: _int main (int argv, char* argv[])_ , i.e. both parameters
       | are called argv.
        
       | ncmncm wrote:
       | Every single thing JeanHeyd does goes far, far beyond what we are
       | resigned to accept from lesser individuals.
       | 
       | That said, I am in the rowdy "Only UTF-8, ever, and nothing else"
       | gang. Even thinking about UTF-16 gives me hives: _life is too
       | short_. But Shift-JIS is OK.
        
         | forgotmypw17 wrote:
         | There's also ASCII...
        
         | chrismorgan wrote:
         | I like Rust's approach to UTF-8: strings are rigidly UTF-8, so
         | the general goal is to make encoding and decoding things that
         | you sort out at the I/O boundary, and then inside the library
         | you can have sanity.
        
           | ncmncm wrote:
           | Rust deserves kudos for inventing the term "WTF-8",
           | describing e.g. directory entries -- maybe supposed to be
           | UTF-8, but obliged to allow invalid sequences.
        
             | gfody wrote:
             | isn't wtf8 'double utf8' ie the typical utf8 escape encoded
             | in utf8 Afetc?
             | 
             | wobble transform 8bit seems like an unnecessary hijacking
             | of a well labeled error state
        
               | chrismorgan wrote:
               | https://simonsapin.github.io/wtf-8/ (which I think you've
               | found).
               | 
               | People had _occasionally_ used the label before for
               | mojibake of various kinds, but the term was never popular
               | under that meaning. Simon's work is now a vastly more
               | popular meaning.
        
               | gfody wrote:
               | til, also found this where he apologizes for hijacking it
               | https://news.ycombinator.com/item?id=9613971
        
           | lmm wrote:
           | You cannot have sanity while handling international text as
           | solely UTF-8 bytesequences (or any other encoding that treats
           | sequences of unicode codepoints as always equivalent). Sooner
           | or later you will have to deal with text that contains both
           | Chinese and Japanese characters, so you will need a richer
           | representation.
        
             | chrismorgan wrote:
             | I presume you're talking about Han unification?
             | 
             | It's true that sometimes you may need language annotations,
             | which will sometimes also need to be applied to substrings.
             | I don't think that invalidates my claim that rigid UTF-8
             | allows you to have sanity, though I will tweak it to state
             | that using UTF-8 is a necessary but not always sufficient
             | condition.
        
               | lmm wrote:
               | > It's true that sometimes you may need language
               | annotations, which will sometimes also need to be applied
               | to substrings. I don't think that invalidates my claim
               | that rigid UTF-8 allows you to have sanity, though I will
               | tweak it to state that using UTF-8 is a necessary but not
               | always sufficient condition.
               | 
               | Given that context, I don't see that UTF-8 actually helps
               | much. Your fundamental structure has to look something
               | like a rope of (byte sequence+annotation) entries. With
               | that structure, using different encodings for different
               | segments doesn't make things noticeably worse.
        
               | TorKlingberg wrote:
               | I'd say having your Chinese and Japanese text in
               | different encodings would make it worse. In a markup
               | language the annotations can be inline. HTML has e.g.
               | <span lang="ja">, which seems to work well enough.
        
               | AnIdiotOnTheNet wrote:
               | Someday programmers will have learned their lesson about
               | in-band signaling, but apparently it won't be today.
        
             | kzrdude wrote:
             | Has there been any attempt to solve this in Unicode, in-
             | band? Let's say there was a control char for "this is
             | chinese, start", "this is chinese, end" etc.
        
               | ncmncm wrote:
               | In-band signaling is always a reliable route to
               | (typically slow-motion) disaster.
        
               | lmm wrote:
               | There's an alternate set of codepoints, but existing
               | software will "convert" SJIS into unicode in an
               | information-destroying way so you have problems like
               | "user enters a search string, software uses Chinese
               | codepoints for that search string, doesn't find the
               | matching phrase in the document".
               | 
               | Stateful control characters make unicode mostly pointless
               | - the whole point is to be self-synchronizing and have a
               | universal representation for each character. (Granted
               | emoji are busy destroying that already).
        
               | chrismorgan wrote:
               | Unicode 3.1 introduced U+E0000-U+E007F for in-band
               | language tagging, using what's now called BCP 47 language
               | tags. (https://www.unicode.org/reports/tr27/tr27-4.html,
               | heading "13.7 Tag Characters (new section)".) Right from
               | their introduction, their use was "strongly discouraged":
               | they're designed for use with special protocols, with
               | out-of band tagging preferred.
               | 
               | In Unicode 5.2, I think, this range was elevated from
               | "strongly discouraged" to "deprecated": https://www.unico
               | de.org/versions/Unicode5.2.0/ch16.pdf#page=....
               | 
               | In-band signalling in Unicode in general is fraught. The
               | Unicode 5.2.0 specification linked goes on to show
               | various of the reasons why this sort of tagging is
               | generally problematic and should not be used in normal
               | text. (And this is why they were strongly discouraged
               | from the start.)
               | 
               | Text direction signalling is another troublesome area of
               | Unicode; there are multiple techniques, some strongly
               | discouraged, and it's a perennial source of interesting
               | security bugs. The only reason direction signalling is
               | supported at all is because it's _needed_. Life would be
               | easier with it gone.
        
           | arthur2e5 wrote:
           | Not a fan of the tone, but ngl seven-part implementation is
           | pretty nice. Kind of wanting a Rust version of it.
        
       | siraben wrote:
       | What language is/where is the title image from?
        
         | shadowofneptune wrote:
         | The glyphs, ink, and spacing look very similar to the Zodiac
         | Killer's famous cipher, but I cannot find any perfect matches.
        
           | magnio wrote:
           | It is indeed the Zodiac cipher, upside down:
           | https://www.pexels.com/photo/photo-of-cryptic-character-
           | code...
        
       | forgotmypw17 wrote:
       | Thank you so much. I'm mostly using Perl, but I can relate to the
       | problem. I'm working on implementing an ASCII-only and ANSI-only
       | modes in my static html generator, and it's far trickier than I
       | imagined, even with Perl.
       | 
       | (My reasons for doing so are backwards compatibility support and
       | lowering the attack surface.)
        
       | imron wrote:
       | utf16_output.size() * sizeof(char16_t)
       | 
       | Keep on keeping on, c++.
        
       | SloopJon wrote:
       | I wrote some Unicode generators for RapidCheck that worked
       | perfectly well with GCC and Clang on Unix, but not at all with
       | Visual C++ on Windows. The idea was to generate arbitrary code
       | points within various ranges (ASCII, Latin-1, BMP, etc. for
       | shrinking) as UTF-32 / UCS-4 in a u32string, then convert to the
       | datatype and encoding for the API under test--UTF-8 in a string,
       | UTF-16 in a u16string, UTF-something in a wstring, etc. The
       | problem was, some of the conversion facets just aren't supported
       | in the Visual C++ runtime library. I think I ended up doing the
       | UTF-32 to UTF-16 conversion myself.
       | 
       | The other thing I ran into recently, is that u"" string literals
       | that worked on GCC / Linux, Clang / Mac, and VC++ 2017 / Windows
       | 2012 did not work for a colleague on VC++ 2019 / Windows 10. An
       | emoji, for example, came out as four char16_t code units (one for
       | each byte of the UTF-8 encoding, I think), instead of two. We
       | ended up using Unicode escapes instead, although the source code
       | is less colorful without a pile of poo.
       | 
       | This ztd.text library looks interesting, although it's a little
       | discouraging that the getting started section of the
       | documentation is empty. Is this a header-only library?
        
       | Kranar wrote:
       | >In other words, this snippet of code will do exactly what you
       | expect it to without a single surprise:
       | 
       | That reinterpret cast from char* to a char8_t* is undefined
       | behavior.                   std::u8string_view
       | utf8_input(reinterpret_cast<const char8_t*>(argv[1]));
       | 
       | This is not just pedantry either, it was purposely designed this
       | way to allow compilers to better optimize code without worrying
       | about aliasing issues.
       | 
       | Link to the paper which explicitly calls out that char8_t be
       | defined as a new type that does not alias with any other type
       | (hence making that reinterpret cast undefined behavior):
       | 
       | http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p048...
        
         | arc-in-space wrote:
         | This seems wrong. Regardless of what char8_t is, doesn't char*
         | have a spec-given right to read it?
         | 
         | Defining char8_t as a new type is specifically to avoid
         | unnecessarily granting it these all-aliasing powers, but you
         | can still read it as bytes.
        
           | Kranar wrote:
           | Yes, but as I mentioned earlier only char* has the right to
           | read it, in the snippet I posted, it's char8_t* doing the
           | reading.
        
         | moonchild wrote:
         | char8_t* is allowed to alias char* not because of any special
         | property of char8_t, but because _char_ is allowed to alias any
         | other type (including char8_t).
        
           | Kranar wrote:
           | As you said, char* can alias any other type, but that does
           | not allow any other type to alias char*.
           | reinterpret_cast<char*>(T*); // Perfectly fine.
           | reinterpret_cast<T*>(char*); // Undefined behavior.
           | 
           | Otherwise it would be trivial for any type to alias any other
           | type:
           | reinterpret_cast<T*>(reinterpret_cast<char*>(U*));
        
         | 10000truths wrote:
         | `-fno-strict-aliasing` to the rescue!
        
         | gpderetta wrote:
         | pedantically, the cast itself is not UB. Dereferencing a
         | pointer which isn't compatible with the dynamic type of the
         | underlying data would be UB, but in this case the actual data
         | is coming from outside the process (the OS usually), and wasn't
         | even necessarily written in C++, so talking about the type of
         | the underlying data is really not well defined (this is similar
         | to reinterpret casting the data coming from read() to whatever
         | is the structure representing the layout of the data).
         | 
         | Accessing the data using two different types would be
         | problematic, except that if the other type is char, it is still
         | fine as char is allowed to alias anything.
         | 
         | so tldr; as the programmer you can posit that the underlying
         | data is indeed char8_t and the reinterpret cast is valid. You
         | can also read it as char and would still be safe.
        
           | Kranar wrote:
           | Yes, you are correct. That said the dereferencing happens in
           | std::u8string_view's constructor, which will dereference in
           | search of the NULL character to compute the size.
        
       | ezoe wrote:
       | Well, good luck. I lost all hope and trust in C++ Standard
       | committee. I gave up.
        
       ___________________________________________________________________
       (page generated 2021-07-01 23:01 UTC)