[HN Gopher] Could we have avoided the whole UTF-16 fiasco? (2020)
       ___________________________________________________________________
        
       Could we have avoided the whole UTF-16 fiasco? (2020)
        
       Author : r721
       Score  : 31 points
       Date   : 2022-11-27 16:29 UTC (12 hours ago)
        
 (HTM) web link (retrocomputing.stackexchange.com)
 (TXT) w3m dump (retrocomputing.stackexchange.com)
        
       | stevefan1999 wrote:
       | Well I would say UTF-8 is a clever hack, by exploiting bit
       | patterns rather than really having proper "code planes" to encode
       | characters one by one. The inherent problem of UTF-8 is of course
       | having wasted spaces and requires the processor to have fast bit
       | masking operation (well although it is virtually solved in the
       | 90s), but the advantanges of UTF-8 outweights the bad side
        
       | nwellnhof wrote:
       | UTF-16 is not the only fiasco. Combining characters are mostly
       | useless overengineering as well, requiring composition,
       | decomposition and normalization forms and leading to exploits
       | like Zalgo text.
        
         | Someone wrote:
         | If it's _mostly_ useless, could we implement what it is needed
         | for in a simpler way?
         | 
         | If not, I wouldn't call it overengineering.
        
           | arcbyte wrote:
           | > "What is needed"
           | 
           | The best engineers do most of their engineering in the
           | requirements. Even if UTF16 is the only way to satisfy what
           | is wanted, is that really the best expression of what is
           | wanted or is it filled with misunderstandings and unnecessary
           | compromises? Almost always the latter.
        
         | adrian_b wrote:
         | The combining characters could be eliminated only if Unicode
         | would include all their combinations with alphabetic characters
         | that have ever been used in the writing of various languages,
         | which Unicode does not.
         | 
         | Moreover, most typefaces do not even include many of the
         | precomposed combinations that are included in Unicode, so even
         | the rendering of the precomposed Unicode characters may need to
         | use the combining characters from the typeface.
         | 
         | Unfortunately most computer applications have always been
         | plagued with various problems caused by the fact that English
         | happens to be one of the few languages that does not use
         | diacritic signs, so the English speakers have always neglected
         | them, while for most other languages they are absolutely
         | essential.
         | 
         | It is impractical to include in Unicode all possible
         | combinations. Many combinations have become obsolete, due to
         | changes in orthography, but they are still needed in Unicode,
         | for the encoding of old texts, which used different
         | orthographic rules.
         | 
         | A simpler Unicode would have been if it would not have included
         | any precomposed combinations of alphabetic characters and
         | diacritic signs, but only the combining characters.
         | 
         | Unfortunately Unicode has adopted the principle to include all
         | the character sets of the previous standards as they were,
         | without changes, and that has forced the inclusion of many
         | combinations as distinct Unicode code points.
         | 
         | While removing all the precomposed combinations would be
         | possible, removing the combining characters is impossible in a
         | character set that may be used for the non-English languages.
        
       | baybal2 wrote:
        
       ___________________________________________________________________
       (page generated 2022-11-28 05:01 UTC)