[HN Gopher] Corrected UTF-8 (2022)
___________________________________________________________________
Corrected UTF-8 (2022)
Author : RGBCube
Score : 29 points
Date : 2025-07-03 15:29 UTC (3 days ago)
(HTM) web link (www.owlfolio.org)
(TXT) w3m dump (www.owlfolio.org)
| chowells wrote:
| Forbidding \r\n line endings _in the encoding_ just sort of sinks
| the whole idea. The first couple ideas are nice, but then you
| suddenly get normative with what characters are allowed to be
| encoded? That creates a very large initial hurdle to clear to get
| people to use your encoding. Suddenly you need to forbid specific
| texts, instead of just handling everything. Why put such a huge
| footgun in your system when it 's not necessary?
| jmclnx wrote:
| Many things makes sense to me, but as we can all guess, this
| will never become a thing :(
|
| But the "magic number" thing to me is a waste of space. If this
| standard is accepted, if no magic number you have corrected
| UTF-8.
|
| As for \r\n, not a big deal to me. I would like to see if
| forbidden if only to force Microsoft to use \n like UN*X and
| Apple. I still need to deal with \r\n in files showing up every
| so often.
| _kst_ wrote:
| "If this standard is accepted, if no magic number you have
| corrected UTF-8."
|
| That's true only if "corrected UTF-8" is accepted _and_
| existing UTF-8 becomes obsolete. That can 't happen. There's
| too much existing UTF-8 text that will never be translated to
| a newer standard.
| Dwedit wrote:
| Magic numbers do appear a lot in C# programs. The default
| text encoder will output a BOM marker.
| oleganza wrote:
| Magic prefix (similar to byte-order-mark, BOM) is also killing
| the idea. The reason for success of any standard is the ability
| to establish consensus while navigating existing constraints.
| UTF-8 won over codepages, and UTF-16/32 by being purely ASCII-
| compatible. A magic prefix is killing that compatibility.
| moonshadow565 wrote:
| What about encoding it in such way we dont need huge tables to
| figure the category for each code point?
| lifthrasiir wrote:
| It means that you are encoding those categories into the code
| point itself, which is a waste for every single use of the
| character encoding.
| panpog wrote:
| It seems plausible that this could be made efficiently doable
| byte-wise. For example, C3 xx could be made to uppercase to
| C4 xx. Unicode actually does structure its codespace to make
| certain properties easier to compute, but those properties
| are mostly related to legacy encodings, and things are
| designed with USC2 or UTF32 in mind, not UTF8.
|
| It's also not clear to me that the code point is a good
| abstraction in the design of UTF8. Usually, what you want is
| either the byte or the grapheme cluster.
| karteum wrote:
| > Usually, what you want is either the byte or the grapheme
| cluster.
|
| Exactly ! That's what I understood after reading this great
| post https://tonsky.me/blog/unicode/
|
| _" Even in the widest encoding, UTF-32, [some grapheme]
| will still take three 4-byte units to encode. And it still
| needs to be treated as a single character. If the analogy
| helps, we can think of the Unicode itself (without any
| encodings) as being variable-length."_
|
| I tend to think it's the biggest design decision in Unicode
| (but maybe I just don't fully see the need and use-cases
| beyond emojis. Of course I read the section saying it's
| used in actual languages, but the few examples described
| could have been made with a dedicated 32 bits codepoint...)
| duskwuff wrote:
| 1) Adding offsets to multi-byte sequences breaks compatibility
| with existing UTF-8 text, while generating text which can be
| decoded (incorrectly) as UTF-8. That seems like a non-starter.
| The alleged benefit of "eliminating overlength encodings" seems
| marginal; overlength encodings are already invalid. It also
| significantly increases the complexity of encoders and decoders,
| especially in dealing with discontinuities like the UTF-16
| surrogate "hole".
|
| 2) I really doubt that the current upper limit of U+10_FFFF is
| going to need to be raised. Past growth in the Unicode standard
| has primarily been driven by the addition of more CJK characters;
| that isn't going to continue indefinitely.
|
| 3) Disallowing C0 characters like U+0009 (horizontal tab) is
| absurd, especially at the level of a text encoding.
|
| 4) BOMs are dumb. We learned that lesson in the early 2000s -
| even if they sound great as a way of identifying text encodings,
| they have a nasty way of sneaking into the middle of strings and
| causing havoc. Bringing them back is a terrible idea.
| rini17 wrote:
| Yes it should be completely incompatible with UTF-8 not only
| partially. As in, anything beyond ASCII should be invalid and
| not decodable as UTF.
| lifthrasiir wrote:
| If you do need the expansion of code point space,
| https://ucsx.org/ is the definitive answer; it was designed by
| actual Unicode contributors.
| Dwedit wrote:
| I don't expect anyone to adopt this. Listing complaints about a
| heavily used standard, and proposing something else incompatible
| won't gain any traction.
|
| Compare to WTF-8, which solves a different problem (representing
| invalid 16-bit characters within an 8-bit encoding).
| esrauch wrote:
| Yeah, WTF-8 is a very straightforward "the spec semi-
| artificially says we can't do this one thing, and it prevents
| you from using utf8 under the hood to represent JS and Java
| strings which allow for unpaired utf16 surrogates, so in
| practice utf8-except-this-one-thing is the only way to do an in
| memory representation in things that want to implement or
| interop round trip with those".
|
| It's literally the exact opposite of this proposal, in that
| there's an actual concrete problem and how to make it not a
| problem. This one is a list of weird grievances that aren't
| actually problems for anyone, like the max code point number.
| timbray wrote:
| Relevant: https://www.ietf.org/archive/id/draft-bray-
| unichars-15.html - IETF approved and will have an RFC number in a
| few weeks.
|
| Tl;dr: Since we're kinda stuck with Uncorrected UTF-8, here are
| the "characters" you shouldn't use. Includes a bunch of stuff the
| OP mentioned.
| philipwhiuk wrote:
| He got very close to killing the SOH (U+01) which is useful in
| various technical specifications. Seems to still want to put the
| boot in.
|
| I don't understand the desire to make existing characters
| unrepresentable for the sake of what? Shifting used characters
| earlier in the byte sequence?
| omoikane wrote:
| This scheme skips over 80 through 9F because they claim it's
| never appropriate to send those control characters through
| interchangeable text, but it just seems like a very brave
| proposal to intentionally have codepoints that can't be encoded.
|
| I think the offset scheme should only be used to fix overlength
| encodings, and not trying to patch over an adhoc hole at the same
| time. It seems safer to make it possible to encode all codepoints
| whether those codepoints should be used or not. Unicode already
| has holes in various ranges anyways.
| _kst_ wrote:
| "UTF-16 is now obsolete."? That's news to me.
|
| I _wish_ it were true, but it 's not.
| timbray wrote:
| Yeah, for example it's how Java stores strings to this day. But
| I think it's more or less never transmitted over the Network.
| esrauch wrote:
| Even if all wire format encoding is utf8, you wouldn't be
| able to decode these new high codepoints into systems that
| are semantically utf16. Which is Java and JS at least, hardly
| "obsolete" targets to worry about.
|
| And even Swift is designed so the strings can be utf8 or
| utf16 for cheap objc interop reasons.
|
| Discarding compatibility with 2 of the top ~5 most widely
| used languages kind of reflects how disconnected the author
| of this is from the technical realities if any fixed utf8 was
| feasible outside of the most toy use cases.
___________________________________________________________________
(page generated 2025-07-06 23:00 UTC)