[HN Gopher] The Wonderfully Terrible World of C and C++ Text Enc...
___________________________________________________________________
The Wonderfully Terrible World of C and C++ Text Encoding APIs
(With Some Rust)
Author : codewiz
Score : 14 points
Date : 2022-10-15 21:27 UTC (1 hours ago)
(HTM) web link (thephd.dev)
(TXT) w3m dump (thephd.dev)
| int_19h wrote:
| It's kind of amazing that something as basic as character
| encodings - at least the basics like UTF-16 - UTF-8 - stdio-
| encoding! - is something that's still not in the C++ standard
| library. For a while there was codecvt_utf8 et al, but that was
| deprecated 5 years ago in C++17 with no replacement "to clear the
| path for the future" (https://www.open-
| std.org/jtc1/sc22/wg21/docs/papers/2017/p06...), yet no
| replacement came in C++20, and none are planned for C++23.
| lultimouomo wrote:
| I feel your pain. Last week I just gave up and wrote my UTF-8
| to UTF-32 conversion routine. It took me far less to do that
| than I spent looking for a standard solution.
| kevin_thibedeau wrote:
| Unicode support requires incorporating their database into a
| library. At a minimum you need to know which code points are
| combining chars. For a language with five to ten year update
| cycles should everyone be stuck with outdated data if the
| Unicode standard is revised in the interim?
| arka2147483647 wrote:
| All operating systems have the unicode database saved
| somewhere. There should just be a standard way of accessing
| it. Just like filesystem.
|
| Edit; that is; it does not have to be linked in the standard
| lib. Can be a data file somewhere, or a a shared lib.
| duskwuff wrote:
| There's precedent for this, too: time zones! Time zone data
| can change over time, and as such it's typically stored in
| system files and loaded at runtime, rather than being
| embedded in executables.
|
| Locales have some similar behavior as well.
| Ferrotin wrote:
| Conversion between encodings doesn't require a database or
| knowledge of combining characters.
| poorlyknit wrote:
| This. But it strenghtens the arguments that programming
| environments should just come with some sort of support for
| the most common encoding forms.
| kevin_thibedeau wrote:
| A library that only extracts code points will do more
| damage than not having one at all. If you have to decode
| Unicode you presumably want to parse it some of the time.
| Not supporting the needs for string processing with multi-
| point graphemes leads to broken Unicode "support" that
| doesn't actually work with all valid Unicode.
___________________________________________________________________
(page generated 2022-10-15 23:00 UTC)