[HN Gopher] Could we have avoided the whole UTF-16 fiasco? (2020)
___________________________________________________________________
Could we have avoided the whole UTF-16 fiasco? (2020)
Author : r721
Score : 31 points
Date : 2022-11-27 16:29 UTC (12 hours ago)
(HTM) web link (retrocomputing.stackexchange.com)
(TXT) w3m dump (retrocomputing.stackexchange.com)
| stevefan1999 wrote:
| Well I would say UTF-8 is a clever hack, by exploiting bit
| patterns rather than really having proper "code planes" to encode
| characters one by one. The inherent problem of UTF-8 is of course
| having wasted spaces and requires the processor to have fast bit
| masking operation (well although it is virtually solved in the
| 90s), but the advantanges of UTF-8 outweights the bad side
| nwellnhof wrote:
| UTF-16 is not the only fiasco. Combining characters are mostly
| useless overengineering as well, requiring composition,
| decomposition and normalization forms and leading to exploits
| like Zalgo text.
| Someone wrote:
| If it's _mostly_ useless, could we implement what it is needed
| for in a simpler way?
|
| If not, I wouldn't call it overengineering.
| arcbyte wrote:
| > "What is needed"
|
| The best engineers do most of their engineering in the
| requirements. Even if UTF16 is the only way to satisfy what
| is wanted, is that really the best expression of what is
| wanted or is it filled with misunderstandings and unnecessary
| compromises? Almost always the latter.
| adrian_b wrote:
| The combining characters could be eliminated only if Unicode
| would include all their combinations with alphabetic characters
| that have ever been used in the writing of various languages,
| which Unicode does not.
|
| Moreover, most typefaces do not even include many of the
| precomposed combinations that are included in Unicode, so even
| the rendering of the precomposed Unicode characters may need to
| use the combining characters from the typeface.
|
| Unfortunately most computer applications have always been
| plagued with various problems caused by the fact that English
| happens to be one of the few languages that does not use
| diacritic signs, so the English speakers have always neglected
| them, while for most other languages they are absolutely
| essential.
|
| It is impractical to include in Unicode all possible
| combinations. Many combinations have become obsolete, due to
| changes in orthography, but they are still needed in Unicode,
| for the encoding of old texts, which used different
| orthographic rules.
|
| A simpler Unicode would have been if it would not have included
| any precomposed combinations of alphabetic characters and
| diacritic signs, but only the combining characters.
|
| Unfortunately Unicode has adopted the principle to include all
| the character sets of the previous standards as they were,
| without changes, and that has forced the inclusion of many
| combinations as distinct Unicode code points.
|
| While removing all the precomposed combinations would be
| possible, removing the combining characters is impossible in a
| character set that may be used for the non-English languages.
| baybal2 wrote:
___________________________________________________________________
(page generated 2022-11-28 05:01 UTC)