[HN Gopher] Any Encoding, Ever - ztd.text and Unicode for C++
___________________________________________________________________
Any Encoding, Ever - ztd.text and Unicode for C++
Author : hasheddan
Score : 85 points
Date : 2021-07-01 01:24 UTC (21 hours ago)
(HTM) web link (thephd.dev)
(TXT) w3m dump (thephd.dev)
| theamk wrote:
| There was a lot of discussion with invisible opponents in that
| text.. Plenty of counter-arguments but only a hint of what is
| this replying to.
|
| Other than that, it seems to present a C++ version of encoding
| conversion library (like iconv), but instead of handling
| encodings dynamically it uses C++ types. So not very good for any
| place with user-specified encoding (like browser or database or
| text editor), but can be used when the encoding is known at
| compile time?
|
| I can see a few uses of it, but it seems kinda niche for all that
| grandiose talk about "liberation of the encoding hell".
| aidenn0 wrote:
| ztd::text::transcode takes an encoding type, as shown in the
| second example.
|
| In any event, a single switch statement turns a compile-time
| encoding into a run-time encoding.
| lifthrasiir wrote:
| I think the presented design is almost okay, even with the
| complexity of legacy encodings, but one thing raised my eyebrows:
|
| > If your type always has a replacement character, regardless of
| the situation, it can signal this by writing one of two
| functions: replacement_code_units() (for any failed encode step),
| replacement_code_points() (for any failed decode step).
|
| They (and their cousins maybe_replacement_code_units/points) do
| not accept any argument, and that's wrong. There are stateful
| encodings like ISO/IEC 2022 and HZ where you might need the
| current encoder state to select a correct replacement character.
| For example `?` might not be representable in the current state
| and needs a switch sequence. You can, fortunately, always switch
| to the desired character set in 2022 with a fixed sequence so you
| can do have a fixed replacement sequence there, but you can't do
| so in HZ. It is technically possible to force-switch to the
| default state when an error is occurred, but not all users will
| want this behavior. It should obvious to anyone that stateful
| encodings are in general PITA.
| flohofwoe wrote:
| I'd like to see a bit more explanation why the "UTF-8 Everywhere"
| (https://utf8everywhere.org/) advice is poor?
|
| Outside Windows APIs, UTF-16 is pretty much irrelevant (yes there
| are a few popular languages around which jumped on the UNICODE
| bandwagon too early and are now stuck with UTF-16, but those
| languages usually also offer simple conversion functions to and
| from other encodings - like UTF-8).
|
| UTF-32 has its place as internal runtime string representation,
| but not for data exchange.
|
| UTF-8 has won, and for all the right reasons. UTF-16 is a
| backward compatibility hack for legacy APIs, and UTF-32 is a
| special-case encoding for easier runtime-manipulation of string
| data.
| lifthrasiir wrote:
| The author claims that the UTF-8 Everywhere manifesto makes
| people think that they don't need to implement anything but
| UTF-8 (which is false and not even what the manifesto actually
| says). I think that the author's claim itself is also false.
| flohofwoe wrote:
| Btw, I'm almost certain that the first code example of
| expecting the command line arguments as UTF-8 strings won't
| work on Windows, instead the argument strings will be 8-bit-
| ASCII encoded with the current system code page. The only way I
| got this working reliably with the standard main() entry point
| was to call the Win32 function GetCommandLineW() and then split
| into args and convert to UTF-8 myself.
| flohofwoe wrote:
| PS: found some example code in the github project which seems
| to take care of the special Windows situation:
|
| https://github.com/soasis/text/blob/main/examples/documentat.
| ..
|
| Unfortunately this doesn't look so simple anymore.
| nly wrote:
| This code looks like a disaster to me.
|
| The issue on Windows is simply that code pages are a per-
| application setting (or state) and not a system setting.
| Programs are free to run with, and change between, any
| "active code page" they want and spit bytes out in that
| code page.
|
| This is all fine and dandy when you're talking to a -A
| Windows API, since Windows knows the ACP of your
| application, but it's disaster for I/O and making arbitrary
| programs talk to one another.
|
| Nothing you can do in your code can fix this, it's a
| contract you have to have with the outside world.
|
| On Linux calling setlocale() with a non-UTF-8 locale might
| not even even work because your /etc/locale.gen file will
| probably only have a couple of enabled entries, and doing
| so would be mostly pointless because your terminal and all
| your other programs are likely using UTF-8.
|
| The downside to the Linux approach is that if somebody does
| actually send you a ISO-8859-1 encoded file you will have
| to do some conversion. #include
| <boost/locale.hpp> #include <fstream>
| #include <iostream> #include <string>
| int main(int argc, char** argv) {
| std::ifstream ifs ("test.txt"); std::string
| str; boost::locale::generator gen; // No
| dependency on what's in /etc/locale.gen auto
| loc = gen ("en_US.ISO-8859-1"); ifs.imbue
| (loc); // Should make formatting input functions work
| std::getline (ifs, str); str =
| boost::locale::conv::to_utf<char> (str, loc);
| for (auto wc: str) { std::cout <<
| sizeof(wc) << ": " <<
| static_cast<uint64_t>(static_cast<unsigned char>(wc)) <<
| "\n"; } }
| MaxBarraclough wrote:
| > there are a few popular languages around which jumped on the
| UNICODE bandwagon too early and are now stuck with UTF-16, but
| those languages usually also offer simple conversion functions
|
| Java does this, right?
| x4e wrote:
| Java uses UTF-16 for internally storing strings, however all
| it's APIs use the platform default charset which is usually
| UTF-8. It also uses modified UTF-8 for storing strings inside
| compiled classes.
| Animats wrote:
| If you use Go, Rust, Python 3, or Javascript, this is a non-
| problem, because those are all native UTF-8. Only for legacy
| languages is this still a problem.
| layoutIfNeeded wrote:
| False. JavaScript uses UTF16:
| https://mathiasbynens.be/notes/javascript-encoding
| jcelerier wrote:
| sure fam
| https://stackoverflow.com/questions/12053168/how-to-properly-
| output-a-string-in-a-windows-console-with-go
| https://github.com/intellij-rust/intellij-rust/issues/766
| https://bugs.python.org/issue44275
|
| etc etc
| chrismorgan wrote:
| Quite apart from the fact that you're talking at cross-purposes
| to the article, you're wrong about three of the four languages
| you mention.
|
| Go strings aren't actually UTF-8; they aren't even Unicode:
| they can contain arbitrary bytes. In practice, _most_ strings
| will be UTF-8, but you can't depend on it at all.
|
| Rust strings are strictly UTF-8. I appreciate this greatly.
|
| Python 3 strings are sequences of Unicode code points (as
| distinct from Unicode scalar values), allowing ill-formed
| Unicode (unmatched surrogates), and its internal representation
| is a disaster because they decided indexing by code point was
| worthwhile (it's not, given the costs), so since CPython 3.3
| strings are encoded as ISO-8859-1 (only able to represent
| U+0000 to U+00FF), UCS-2 (further able to represent U+0100 to
| U+FFFF) or UCS-4 (able to represent all Unicode code points).
|
| JavaScript strings allow ill-formed Unicode, and the public
| representation can be described as either UCS-2 or ill-formed
| UTF-16 (almost all string access techniques work in UTF-16 code
| units, though new stuff is now mostly working in Unicode and
| UTF-8 terms). I think all major engines now use an internal
| representation of ISO-8859-1 if the string contains only U+0000
| to U+00FF, or ill-formed UTF-16 for anything else; but Servo's
| work has demonstrated that it's possible to shift to WTF-8
| (UTF-8 plus allowing unmatched surrogates), saving lots of
| memory on some pages and simplifying and slightly speeding up
| some things, but at the cost of random access performance by
| code unit index.
| Animats wrote:
| Right, Python 3 looks like it's UTF-8, but CPython has that
| weird 1, 2 or 4 byte representation. PyPy, though, finally
| went UTF-8. Not sure how they do indexing.
|
| Go strings are supposed to be UTF-8, but it's not enforced.
|
| Javascript - bleah.
| chrismorgan wrote:
| > _Python 3 looks like it 's UTF-8_
|
| No, it never looked like it was UTF-8. It did look like it
| was _Unicode_ (of an unspecified encoding), but Python
| allows unpaired surrogates, which makes for ill-formed
| Unicode. Python lets you have '\udead'. Well-formed
| encodings of Unicode (such as UTF-8, UTF-16 and UTF-32)
| cannot represent U+DEAD.
|
| There's a big difference between "Unicode" and "UTF-8".
|
| > _PyPy, though, finally went UTF-8._
|
| Oh cool! I remember vaguely hearing the idea being mulled
| over, years ago. Good to see it happened:
| https://morepypy.blogspot.com/2019/03/pypy-v71-released-
| now-....
|
| Wonder what they've done about surrogates; the only way
| they can truly have switched to UTF-8 internally is if
| they've broken compatibility, which I doubt they've done;
| I'm guessing that they actually use WTF-8, not UTF-8.
|
| (In the meantime, I've changed "Python 3.3" in the
| grandparent comment to "CPython 3.3" for accuracy.)
| samatman wrote:
| Would have been better to read the article, because then you'd
| be in a position to explain how those languages handle
| transcoding from ShiftJS to UTF-8.
| gameman144 wrote:
| This is explicitly called out in the article: non-UTF-8
| encodings still need to be handled _somehow_ , and this library
| makes those just as easy to handle as UTF-8.
| 1wd wrote:
| Python 3 by default still opens text files using the legacy
| locale encoding.
| nly wrote:
| Why are the examples for ztd.text writing UTF-16 to std::cout?
| Won't this fail on Linux where UTF-16 is rarely used and
| terminals typically default to UTF-8?
|
| Personally I've enjoyed using the tiny-weeny header only
| utfcpp[0] for simple manipulation of UTF-8 and unicode-to-unicode
| conversions, and typically use Boost.Locale[1] when I need to do
| I/O.
|
| 95% of localization problems have almost nothing to do with
| encoding. Encodings are boring. It's like arguing over JSON vs
| YAML. To do I/O properly across platforms you _may_ need wrappers
| for the standard output streams that will handle conversion for
| you, sure, but... you also need to handle date, time and currency
| formatting /parsing, message formatting, and to enable
| translations.
|
| See [2] regarding Windows:
|
| > All of the examples that come with Boost.Locale are designed
| for UTF-8 and it is the default encoding used by Boost.Locale.
|
| Personally I think doing I/O as UTF-8 on Windows is the right
| direction, as Microsoft have been enhancing UTF-8 support in
| Windows for quite a while now.
|
| See[3]:
|
| > Until recently, Windows has emphasized "Unicode" -W variants
| over -A APIs. However, recent releases have used the ANSI code
| page and -A APIs as a means to introduce UTF-8 support to apps.
|
| [0] https://github.com/nemtrif/utfcpp
|
| [1]
| https://www.boost.org/doc/libs/1_76_0/libs/locale/doc/html/i...
|
| [2]
| https://www.boost.org/doc/libs/1_76_0/libs/locale/doc/html/r...
|
| [3] https://docs.microsoft.com/en-
| us/windows/apps/design/globali...
| midjji wrote:
| Ah ... it seemed like such a beautiful dream for a moment. But
| yeah, no way to fix this without breaking/deprecating abi /api
| in C,C++. Which really isn't surprising, char and uchar should
| always have been byte, char should not have existed.
| std::string, and all string functions in C should always have
| contained/ required explicit if default encoding information,
| and the streams should always have had encoding defaults for
| its input and output. While the compiler should always have
| required a non locale based ... and so on. The problem keeps
| getting worse too, view is a generic and powerful meta, but it
| means there should never have been such a thing as string_view,
| except possibly as a template specialization for performance
| reasons, and of course that would wrapp the proper encoded
| string, with its explicit encoding of string litteral and so
| on.
|
| That said, with appropriate new api in the standard and with
| the compiler explicitly requiring explicit encoding specified
| sources, this could all be solved. The c++ streams are a very
| useful construct but horribly implemented, so we could just
| make a new one, and then deprecate the broken part of the
| standard lib. Its just, this is never a problem you realize
| when coding on your own, since it only ever hits projects big
| enough to be multilocal.
| nly wrote:
| Most programmers should never have to worry about
| manipulating strings intended for human consumption at the
| character or code-point level. It makes about as much sense
| as trying to manipulate bytes inside a JPEG when you're
| writing a typical desktop CRUD application.
|
| Projects big enough to be multi-locale (or actually all
| projects) should definitely be using format strings (to
| account for different parameter orders), locale sensitive
| date, time and currency formats, and a good externalized
| translation framework... but I still think they should be
| using UTF-8 throughout, because the encoding you use is
| completely unrelated to these problems.
|
| C++ in particular is used less and less for human facing
| software, and where it is libraries like Qt are fairly
| excellent at handling this stuff.
| midjji wrote:
| UTF-8 mandated everywhere is how I work atm, and I am happy
| the old latin... etc died off. But I think its
| fundamentally the wrong way to go in a sense. UTF-8 is too
| huge for starters, and does a great deal which it probably
| should not as it will never do it well. While latex is
| horrifying in its own right, it remains indisputably
| superior at writing math compared to UTF-8 and similar for
| html5 with styling. In both cases the extensibility and
| styling is built in despite not needing more than perhaps
| 50 characters to write. It seems to me like the mistake was
| adding locales in the first place, we have a chance to
| standardize not language, but meta language. The user will
| never see the low level stuff, and the programmer will
| mostly not see it unless they click show low level in their
| ide, and it will always be compressed when it matters, so
| why even have separate a,A. Why not have a, \\{upper}{a}.
| shown, and even entered in the keyboard as a,A. The key
| difference to utf would be that the rendering and which
| commands exists is up to the locale which could be globally
| stand... Wait I just made it worse didnt I...
|
| Qt still leaves you with the problem of string litterals
| possibly changing when moving the code from one locale to
| another as the code itself is not of defined encoding, so
| if reparsed it will either look bad in code, or bad in
| output. The everpresent tr is also rather annoying.
| midjji wrote:
| Oh and one warning regarding UTF-8 everywhere, use it but
| perhaps dont force it on a filesystem level. I have a file
| in a zfs filesystem with utf-8 and normD which everytime I
| try to delete it remains, and if I copy/move it in the
| filesystem it duplicates in size and if I try to copy it to
| someplace else, that fails. It has some wierd seemingly
| cryllic filename too long to be shown and changing with zfs
| version numbers and possibly randomly. I think it was a
| meme image originally, but its taking up 135GB of tank
| space by now. I mostly keep it for fun, hoping zfs will
| eventually either declare it broken or fix it so I can see
| what it was. I'd share it but the tank is huge, and I never
| figured out a way to move or read it without dd ing the
| entire tank.
| nly wrote:
| Can you explain this? Aren't file names on unix platforms
| essentially blobs, sans a few delimiters ('/', '\0',
| maybe '\')?
| midjji wrote:
| I really cant. I have no idea how it works or why.
| Filligree wrote:
| ZFS has a 'utf8only' property which, in principle, lets
| you constrain all filenames on a dataset to be utf-8...
| only. It does not otherwise change the API, but it should
| make the creation of a non-utf8 file into an I/O error.
| This defaults to off, so by default you're right.
|
| Apparently the GP found a bug in this code. I'd be
| interested in seeing the github issue.
| midjji wrote:
| With utf8only you have to specify a way to convert non
| utf8 to utf8, I think the problem is a bug in the normD
| converter.
| AnIdiotOnTheNet wrote:
| IIRC the only character disallowed in EXT2 at least is
| '/'. Interesting things happen if you manage to get '/'
| into a file name, which can be done with a hex editor or
| a broken Windows EXT2 file browser in my case.
| nemetroid wrote:
| It depends on what your goal is. It's not going to display
| properly on a Linux terminal, but it will succeed to "[...]
| always dump a sequence of UTF-16, Native-Endian, bytes out to
| your stdout".
| theamk wrote:
| I think it fails even at that... because that UTF-16 is
| followed by default (8 bit char) std::endl, which is
| definitely not a valid UTF-16.
| ivanche wrote:
| My bold claim: author didn't ever execute first 2 code snippets
| in the article.
|
| Proof: _int main (int argv, char* argv[])_ , i.e. both parameters
| are called argv.
| ncmncm wrote:
| Every single thing JeanHeyd does goes far, far beyond what we are
| resigned to accept from lesser individuals.
|
| That said, I am in the rowdy "Only UTF-8, ever, and nothing else"
| gang. Even thinking about UTF-16 gives me hives: _life is too
| short_. But Shift-JIS is OK.
| forgotmypw17 wrote:
| There's also ASCII...
| chrismorgan wrote:
| I like Rust's approach to UTF-8: strings are rigidly UTF-8, so
| the general goal is to make encoding and decoding things that
| you sort out at the I/O boundary, and then inside the library
| you can have sanity.
| ncmncm wrote:
| Rust deserves kudos for inventing the term "WTF-8",
| describing e.g. directory entries -- maybe supposed to be
| UTF-8, but obliged to allow invalid sequences.
| gfody wrote:
| isn't wtf8 'double utf8' ie the typical utf8 escape encoded
| in utf8 Afetc?
|
| wobble transform 8bit seems like an unnecessary hijacking
| of a well labeled error state
| chrismorgan wrote:
| https://simonsapin.github.io/wtf-8/ (which I think you've
| found).
|
| People had _occasionally_ used the label before for
| mojibake of various kinds, but the term was never popular
| under that meaning. Simon's work is now a vastly more
| popular meaning.
| gfody wrote:
| til, also found this where he apologizes for hijacking it
| https://news.ycombinator.com/item?id=9613971
| lmm wrote:
| You cannot have sanity while handling international text as
| solely UTF-8 bytesequences (or any other encoding that treats
| sequences of unicode codepoints as always equivalent). Sooner
| or later you will have to deal with text that contains both
| Chinese and Japanese characters, so you will need a richer
| representation.
| chrismorgan wrote:
| I presume you're talking about Han unification?
|
| It's true that sometimes you may need language annotations,
| which will sometimes also need to be applied to substrings.
| I don't think that invalidates my claim that rigid UTF-8
| allows you to have sanity, though I will tweak it to state
| that using UTF-8 is a necessary but not always sufficient
| condition.
| lmm wrote:
| > It's true that sometimes you may need language
| annotations, which will sometimes also need to be applied
| to substrings. I don't think that invalidates my claim
| that rigid UTF-8 allows you to have sanity, though I will
| tweak it to state that using UTF-8 is a necessary but not
| always sufficient condition.
|
| Given that context, I don't see that UTF-8 actually helps
| much. Your fundamental structure has to look something
| like a rope of (byte sequence+annotation) entries. With
| that structure, using different encodings for different
| segments doesn't make things noticeably worse.
| TorKlingberg wrote:
| I'd say having your Chinese and Japanese text in
| different encodings would make it worse. In a markup
| language the annotations can be inline. HTML has e.g.
| <span lang="ja">, which seems to work well enough.
| AnIdiotOnTheNet wrote:
| Someday programmers will have learned their lesson about
| in-band signaling, but apparently it won't be today.
| kzrdude wrote:
| Has there been any attempt to solve this in Unicode, in-
| band? Let's say there was a control char for "this is
| chinese, start", "this is chinese, end" etc.
| ncmncm wrote:
| In-band signaling is always a reliable route to
| (typically slow-motion) disaster.
| lmm wrote:
| There's an alternate set of codepoints, but existing
| software will "convert" SJIS into unicode in an
| information-destroying way so you have problems like
| "user enters a search string, software uses Chinese
| codepoints for that search string, doesn't find the
| matching phrase in the document".
|
| Stateful control characters make unicode mostly pointless
| - the whole point is to be self-synchronizing and have a
| universal representation for each character. (Granted
| emoji are busy destroying that already).
| chrismorgan wrote:
| Unicode 3.1 introduced U+E0000-U+E007F for in-band
| language tagging, using what's now called BCP 47 language
| tags. (https://www.unicode.org/reports/tr27/tr27-4.html,
| heading "13.7 Tag Characters (new section)".) Right from
| their introduction, their use was "strongly discouraged":
| they're designed for use with special protocols, with
| out-of band tagging preferred.
|
| In Unicode 5.2, I think, this range was elevated from
| "strongly discouraged" to "deprecated": https://www.unico
| de.org/versions/Unicode5.2.0/ch16.pdf#page=....
|
| In-band signalling in Unicode in general is fraught. The
| Unicode 5.2.0 specification linked goes on to show
| various of the reasons why this sort of tagging is
| generally problematic and should not be used in normal
| text. (And this is why they were strongly discouraged
| from the start.)
|
| Text direction signalling is another troublesome area of
| Unicode; there are multiple techniques, some strongly
| discouraged, and it's a perennial source of interesting
| security bugs. The only reason direction signalling is
| supported at all is because it's _needed_. Life would be
| easier with it gone.
| arthur2e5 wrote:
| Not a fan of the tone, but ngl seven-part implementation is
| pretty nice. Kind of wanting a Rust version of it.
| siraben wrote:
| What language is/where is the title image from?
| shadowofneptune wrote:
| The glyphs, ink, and spacing look very similar to the Zodiac
| Killer's famous cipher, but I cannot find any perfect matches.
| magnio wrote:
| It is indeed the Zodiac cipher, upside down:
| https://www.pexels.com/photo/photo-of-cryptic-character-
| code...
| forgotmypw17 wrote:
| Thank you so much. I'm mostly using Perl, but I can relate to the
| problem. I'm working on implementing an ASCII-only and ANSI-only
| modes in my static html generator, and it's far trickier than I
| imagined, even with Perl.
|
| (My reasons for doing so are backwards compatibility support and
| lowering the attack surface.)
| imron wrote:
| utf16_output.size() * sizeof(char16_t)
|
| Keep on keeping on, c++.
| SloopJon wrote:
| I wrote some Unicode generators for RapidCheck that worked
| perfectly well with GCC and Clang on Unix, but not at all with
| Visual C++ on Windows. The idea was to generate arbitrary code
| points within various ranges (ASCII, Latin-1, BMP, etc. for
| shrinking) as UTF-32 / UCS-4 in a u32string, then convert to the
| datatype and encoding for the API under test--UTF-8 in a string,
| UTF-16 in a u16string, UTF-something in a wstring, etc. The
| problem was, some of the conversion facets just aren't supported
| in the Visual C++ runtime library. I think I ended up doing the
| UTF-32 to UTF-16 conversion myself.
|
| The other thing I ran into recently, is that u"" string literals
| that worked on GCC / Linux, Clang / Mac, and VC++ 2017 / Windows
| 2012 did not work for a colleague on VC++ 2019 / Windows 10. An
| emoji, for example, came out as four char16_t code units (one for
| each byte of the UTF-8 encoding, I think), instead of two. We
| ended up using Unicode escapes instead, although the source code
| is less colorful without a pile of poo.
|
| This ztd.text library looks interesting, although it's a little
| discouraging that the getting started section of the
| documentation is empty. Is this a header-only library?
| Kranar wrote:
| >In other words, this snippet of code will do exactly what you
| expect it to without a single surprise:
|
| That reinterpret cast from char* to a char8_t* is undefined
| behavior. std::u8string_view
| utf8_input(reinterpret_cast<const char8_t*>(argv[1]));
|
| This is not just pedantry either, it was purposely designed this
| way to allow compilers to better optimize code without worrying
| about aliasing issues.
|
| Link to the paper which explicitly calls out that char8_t be
| defined as a new type that does not alias with any other type
| (hence making that reinterpret cast undefined behavior):
|
| http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p048...
| arc-in-space wrote:
| This seems wrong. Regardless of what char8_t is, doesn't char*
| have a spec-given right to read it?
|
| Defining char8_t as a new type is specifically to avoid
| unnecessarily granting it these all-aliasing powers, but you
| can still read it as bytes.
| Kranar wrote:
| Yes, but as I mentioned earlier only char* has the right to
| read it, in the snippet I posted, it's char8_t* doing the
| reading.
| moonchild wrote:
| char8_t* is allowed to alias char* not because of any special
| property of char8_t, but because _char_ is allowed to alias any
| other type (including char8_t).
| Kranar wrote:
| As you said, char* can alias any other type, but that does
| not allow any other type to alias char*.
| reinterpret_cast<char*>(T*); // Perfectly fine.
| reinterpret_cast<T*>(char*); // Undefined behavior.
|
| Otherwise it would be trivial for any type to alias any other
| type:
| reinterpret_cast<T*>(reinterpret_cast<char*>(U*));
| 10000truths wrote:
| `-fno-strict-aliasing` to the rescue!
| gpderetta wrote:
| pedantically, the cast itself is not UB. Dereferencing a
| pointer which isn't compatible with the dynamic type of the
| underlying data would be UB, but in this case the actual data
| is coming from outside the process (the OS usually), and wasn't
| even necessarily written in C++, so talking about the type of
| the underlying data is really not well defined (this is similar
| to reinterpret casting the data coming from read() to whatever
| is the structure representing the layout of the data).
|
| Accessing the data using two different types would be
| problematic, except that if the other type is char, it is still
| fine as char is allowed to alias anything.
|
| so tldr; as the programmer you can posit that the underlying
| data is indeed char8_t and the reinterpret cast is valid. You
| can also read it as char and would still be safe.
| Kranar wrote:
| Yes, you are correct. That said the dereferencing happens in
| std::u8string_view's constructor, which will dereference in
| search of the NULL character to compute the size.
| ezoe wrote:
| Well, good luck. I lost all hope and trust in C++ Standard
| committee. I gave up.
___________________________________________________________________
(page generated 2021-07-01 23:01 UTC)