[HN Gopher] Corrected UTF-8 (2022)
       ___________________________________________________________________
        
       Corrected UTF-8 (2022)
        
       Author : RGBCube
       Score  : 29 points
       Date   : 2025-07-03 15:29 UTC (3 days ago)
        
 (HTM) web link (www.owlfolio.org)
 (TXT) w3m dump (www.owlfolio.org)
        
       | chowells wrote:
       | Forbidding \r\n line endings _in the encoding_ just sort of sinks
       | the whole idea. The first couple ideas are nice, but then you
       | suddenly get normative with what characters are allowed to be
       | encoded? That creates a very large initial hurdle to clear to get
       | people to use your encoding. Suddenly you need to forbid specific
       | texts, instead of just handling everything. Why put such a huge
       | footgun in your system when it 's not necessary?
        
         | jmclnx wrote:
         | Many things makes sense to me, but as we can all guess, this
         | will never become a thing :(
         | 
         | But the "magic number" thing to me is a waste of space. If this
         | standard is accepted, if no magic number you have corrected
         | UTF-8.
         | 
         | As for \r\n, not a big deal to me. I would like to see if
         | forbidden if only to force Microsoft to use \n like UN*X and
         | Apple. I still need to deal with \r\n in files showing up every
         | so often.
        
           | _kst_ wrote:
           | "If this standard is accepted, if no magic number you have
           | corrected UTF-8."
           | 
           | That's true only if "corrected UTF-8" is accepted _and_
           | existing UTF-8 becomes obsolete. That can 't happen. There's
           | too much existing UTF-8 text that will never be translated to
           | a newer standard.
        
           | Dwedit wrote:
           | Magic numbers do appear a lot in C# programs. The default
           | text encoder will output a BOM marker.
        
       | oleganza wrote:
       | Magic prefix (similar to byte-order-mark, BOM) is also killing
       | the idea. The reason for success of any standard is the ability
       | to establish consensus while navigating existing constraints.
       | UTF-8 won over codepages, and UTF-16/32 by being purely ASCII-
       | compatible. A magic prefix is killing that compatibility.
        
       | moonshadow565 wrote:
       | What about encoding it in such way we dont need huge tables to
       | figure the category for each code point?
        
         | lifthrasiir wrote:
         | It means that you are encoding those categories into the code
         | point itself, which is a waste for every single use of the
         | character encoding.
        
           | panpog wrote:
           | It seems plausible that this could be made efficiently doable
           | byte-wise. For example, C3 xx could be made to uppercase to
           | C4 xx. Unicode actually does structure its codespace to make
           | certain properties easier to compute, but those properties
           | are mostly related to legacy encodings, and things are
           | designed with USC2 or UTF32 in mind, not UTF8.
           | 
           | It's also not clear to me that the code point is a good
           | abstraction in the design of UTF8. Usually, what you want is
           | either the byte or the grapheme cluster.
        
             | karteum wrote:
             | > Usually, what you want is either the byte or the grapheme
             | cluster.
             | 
             | Exactly ! That's what I understood after reading this great
             | post https://tonsky.me/blog/unicode/
             | 
             |  _" Even in the widest encoding, UTF-32, [some grapheme]
             | will still take three 4-byte units to encode. And it still
             | needs to be treated as a single character. If the analogy
             | helps, we can think of the Unicode itself (without any
             | encodings) as being variable-length."_
             | 
             | I tend to think it's the biggest design decision in Unicode
             | (but maybe I just don't fully see the need and use-cases
             | beyond emojis. Of course I read the section saying it's
             | used in actual languages, but the few examples described
             | could have been made with a dedicated 32 bits codepoint...)
        
       | duskwuff wrote:
       | 1) Adding offsets to multi-byte sequences breaks compatibility
       | with existing UTF-8 text, while generating text which can be
       | decoded (incorrectly) as UTF-8. That seems like a non-starter.
       | The alleged benefit of "eliminating overlength encodings" seems
       | marginal; overlength encodings are already invalid. It also
       | significantly increases the complexity of encoders and decoders,
       | especially in dealing with discontinuities like the UTF-16
       | surrogate "hole".
       | 
       | 2) I really doubt that the current upper limit of U+10_FFFF is
       | going to need to be raised. Past growth in the Unicode standard
       | has primarily been driven by the addition of more CJK characters;
       | that isn't going to continue indefinitely.
       | 
       | 3) Disallowing C0 characters like U+0009 (horizontal tab) is
       | absurd, especially at the level of a text encoding.
       | 
       | 4) BOMs are dumb. We learned that lesson in the early 2000s -
       | even if they sound great as a way of identifying text encodings,
       | they have a nasty way of sneaking into the middle of strings and
       | causing havoc. Bringing them back is a terrible idea.
        
         | rini17 wrote:
         | Yes it should be completely incompatible with UTF-8 not only
         | partially. As in, anything beyond ASCII should be invalid and
         | not decodable as UTF.
        
       | lifthrasiir wrote:
       | If you do need the expansion of code point space,
       | https://ucsx.org/ is the definitive answer; it was designed by
       | actual Unicode contributors.
        
       | Dwedit wrote:
       | I don't expect anyone to adopt this. Listing complaints about a
       | heavily used standard, and proposing something else incompatible
       | won't gain any traction.
       | 
       | Compare to WTF-8, which solves a different problem (representing
       | invalid 16-bit characters within an 8-bit encoding).
        
         | esrauch wrote:
         | Yeah, WTF-8 is a very straightforward "the spec semi-
         | artificially says we can't do this one thing, and it prevents
         | you from using utf8 under the hood to represent JS and Java
         | strings which allow for unpaired utf16 surrogates, so in
         | practice utf8-except-this-one-thing is the only way to do an in
         | memory representation in things that want to implement or
         | interop round trip with those".
         | 
         | It's literally the exact opposite of this proposal, in that
         | there's an actual concrete problem and how to make it not a
         | problem. This one is a list of weird grievances that aren't
         | actually problems for anyone, like the max code point number.
        
       | timbray wrote:
       | Relevant: https://www.ietf.org/archive/id/draft-bray-
       | unichars-15.html - IETF approved and will have an RFC number in a
       | few weeks.
       | 
       | Tl;dr: Since we're kinda stuck with Uncorrected UTF-8, here are
       | the "characters" you shouldn't use. Includes a bunch of stuff the
       | OP mentioned.
        
       | philipwhiuk wrote:
       | He got very close to killing the SOH (U+01) which is useful in
       | various technical specifications. Seems to still want to put the
       | boot in.
       | 
       | I don't understand the desire to make existing characters
       | unrepresentable for the sake of what? Shifting used characters
       | earlier in the byte sequence?
        
       | omoikane wrote:
       | This scheme skips over 80 through 9F because they claim it's
       | never appropriate to send those control characters through
       | interchangeable text, but it just seems like a very brave
       | proposal to intentionally have codepoints that can't be encoded.
       | 
       | I think the offset scheme should only be used to fix overlength
       | encodings, and not trying to patch over an adhoc hole at the same
       | time. It seems safer to make it possible to encode all codepoints
       | whether those codepoints should be used or not. Unicode already
       | has holes in various ranges anyways.
        
       | _kst_ wrote:
       | "UTF-16 is now obsolete."? That's news to me.
       | 
       | I _wish_ it were true, but it 's not.
        
         | timbray wrote:
         | Yeah, for example it's how Java stores strings to this day. But
         | I think it's more or less never transmitted over the Network.
        
           | esrauch wrote:
           | Even if all wire format encoding is utf8, you wouldn't be
           | able to decode these new high codepoints into systems that
           | are semantically utf16. Which is Java and JS at least, hardly
           | "obsolete" targets to worry about.
           | 
           | And even Swift is designed so the strings can be utf8 or
           | utf16 for cheap objc interop reasons.
           | 
           | Discarding compatibility with 2 of the top ~5 most widely
           | used languages kind of reflects how disconnected the author
           | of this is from the technical realities if any fixed utf8 was
           | feasible outside of the most toy use cases.
        
       ___________________________________________________________________
       (page generated 2025-07-06 23:00 UTC)