[HN Gopher] You can't just assume UTF-8
       ___________________________________________________________________
        
       You can't just assume UTF-8
        
       Author : calpaterson
       Score  : 31 points
       Date   : 2024-04-29 06:11 UTC (16 hours ago)
        
 (HTM) web link (csvbase.com)
 (TXT) w3m dump (csvbase.com)
        
       | bhaney wrote:
       | I'm just gonna assume UTF-8
        
         | calpaterson wrote:
         | Huang \ue0afNong pengQin Lian
        
           | bhaney wrote:
           | And a good day to you too, my friend whose input I'm going to
           | discard
        
         | duskwuff wrote:
         | I'm disappointed that the article doesn't discuss this in more
         | detail. _Most byte sequences are not valid UTF-8._ If you can
         | decode a message as UTF-8 with no errors, that is almost
         | certainly the correct encoding to use; it 's extremely unlikely
         | that some text in another encoding just happened to be
         | perfectly valid as UTF-8. (The converse is not true; most 8-bit
         | text encodings will happily decode UTF-8 sequences to nonsense
         | strings like dYs(c).)
         | 
         | If UTF-8 decoding fails, _then_ it 's time to pull out the
         | fancy statistical tools to (unreliably) guess an encoding. But
         | that should be a fallback, not the first thing you try.
        
           | actionfromafar wrote:
           | Not _extremely_ unlikely. Many charsets decode fine as UTF-8
           | as long as the message happens to fit in ASCII.
        
       | jujube3 wrote:
       | Actually, I can just assume UTF-8, since that's what the world
       | standardized on. Just like I can assume the length of a meter or
       | the weight of a gram. There is no need to have dozens of
       | incompatible systems.
        
         | qwertox wrote:
         | You can't assume it if you're handed a random sample of text
         | files.
        
           | duskwuff wrote:
           | You can _start_ by assuming UTF-8, then move on to other
           | heuristics if UTF-8 decoding fails. UTF-8 is  "picky" about
           | the sequence of bytes in multi-byte sequences; it's
           | extraordinarily unlikely that text in any other encoding will
           | satisfy its requirements.
           | 
           | (Other than pure ASCII, of course. But "decoding" ASCII text
           | as UTF-8 is safe anyway, so that hardly matters.)
        
           | plorkyeran wrote:
           | I in fact can assume it. If the assumption is wrong then
           | that's someone else's problem. 15 years ago I wrote a bunch
           | of code using uchardet to detect encodings and it was a
           | pretty useful feature at the time. In the last decade
           | everything I've touched has required UTF-8 unless it's been
           | interoperating with a specific legacy system which has some
           | other fixed charset, and it's never been an issue.
        
         | fl7305 wrote:
         | That comparison doesn't hold.
         | 
         | If you're dealing with lengths, you can get input data in
         | meters, centimeters, millimeters, inches, feet, etc.
         | 
         | If the input data is human heights, would you automatically
         | assume meters even if the input data is "183"?
         | 
         | If the input data is the weight of humans, would you always
         | assume grams, even if the input data is "75"?
        
           | asddubs wrote:
           | would you guess the unit instead of specifying what you
           | expect?
        
             | fl7305 wrote:
             | > would you guess the unit instead of specifying what you
             | expect?
             | 
             | It depends on the circumstances. It might be the least bad
             | thing to do. Or not.
             | 
             | But that wasn't my point. I replied to this:
             | 
             | > I can assume the length of a meter or the weight of a
             | gram
             | 
             | Sure, the length of a meter and the "weight" of a gram are
             | both standardized. (To be very picky, "gram" is a mass, not
             | a weight. The actual weight depends on the "g" constant,
             | which on average is 9.81 m/s^2 on earth, but can vary about
             | 0.5%.)
             | 
             | So if you know the input is in meters, you don't need to do
             | any further processing.
             | 
             | But dealing with input text files with an unknown encoding
             | is like dealing with input lengths with an unknown unit.
             | 
             | So while UTF-8 itself might be standardized, it is not the
             | same as all input text files always being in UTF-8.
             | 
             | You can choose to say that all input text files must be in
             | valid UTF-8, or the program refuses to load them. Or you
             | can use silent heuristics. Or something inbetween.
        
         | nijave wrote:
         | Microsoft would like a word with you (utf-8-bom & utf-16)
        
         | paulddraper wrote:
         | You can assume encoding is UTF-8, length is in meters, and
         | timezone is UTC.
         | 
         | -
         | 
         | You just won't always be right.
        
         | pyuser583 wrote:
         | Universal UTF-8 is a hope we aspire to, not a reality we
         | assume.
        
           | fiddlerwoaroof wrote:
           | The way you get that reality is you do the opposite of the
           | recommendation of Postel's law: be very picky about what you
           | consume and fail loudly if it's not UTF-8
        
         | treflop wrote:
         | There's a difference between asssuming and not making a
         | distinction.
         | 
         | Very few developers I've met know could make a distinction.
         | They'd see a few off characters and think it's some one-off bug
         | but it's because they're both assuming an encoding.
         | 
         | Even if you said you'd pay them one billion dollars to fix it,
         | they'd absolutely be unable to.
        
       | pronoiac wrote:
       | Archive copy:
       | https://web.archive.org/web/20240429061925/https://csvbase.c...
        
       | o11c wrote:
       | Default UTF-8 is better than the linked suggestion of using a
       | heuristic, but failing catastrophic when old data is encountered
       | is unacceptable. There _must_ be a fallback.
       | 
       | (Note that the heuristic for "is this intended to be UTF-8" is
       | pretty reliable, but most other encoding-detection heuristics are
       | very bad quality)
        
       | dandigangi wrote:
       | Except I can
        
       | Veserv wrote:
       | Off-topic, but the bit numbering convention is deliciously
       | confusing.
       | 
       | Little-endian bytes (lowest byte is leftmost) and big-endian bits
       | (bits contributing less numerical value are rightmost) are
       | normal, but the bits are referenced/numbered little-endian (first
       | bit is leftmost even though it contributes the most numerical
       | value). When I first read the numbering convention I thought it
       | was going to be a breath of fresh air of someone using the much
       | more sane, but non-standard, little-endian bits with little-
       | endian bytes, but it was actually another layered twist.
       | Hopefully someday English can write numbers little-endian, which
       | is objectively superior, and do away with this whole mess.
        
         | kstrauser wrote:
         | > Hopefully someday English can write numbers little-endian,
         | which is objectively superior
         | 
         | Upon reading this, I threw my laptop out the window.
        
       | bandyaboot wrote:
       | > In the most popular character encoding, UTF-8, character number
       | 65 ("A") is written:
       | 
       | > 01000001
       | 
       | > Only the second and final bits are 1, or "on".
       | 
       | Isn't it more accurate to say that the first and penultimate bits
       | are 1, or "on"?
        
         | fl7305 wrote:
         | It depends on whether your bit numbering is like x86 (your
         | description), or PowerPC (left most bit is 0).
        
           | duskwuff wrote:
           | Basically everyone uses x86 bit numbering. It has the
           | pleasant property that the place value of every bit is always
           | 2^n (or -2^n for a sign bit), and zero-extending a value
           | doesn't change the numbering of its bits.
        
             | fl7305 wrote:
             | Sure, it is by far the industry standard.
             | 
             | It works much better for handling discrete integers.
             | 
             | Once you get into bitfield instructions it is nice to have
             | bit 0 be the "left most bit".
        
           | bandyaboot wrote:
           | The more I thought it through, even assuming x86, I guess
           | there's just no "correct" way to casually reference bit
           | positions when we read them in the opposite order from the
           | machine. Are they being referenced from the perspective of a
           | human consumer of text, or the machine's perspective as a
           | consumer of bits? If I were writing that content, I'd have a
           | difficult time deciding on which to use. If I were writing
           | for a lay person, referencing left-to-right seems obvious,
           | but in this case where the audience is primarily developers,
           | it becomes much less obvious.
        
       | vitaut wrote:
       | This is so spectacularly outdated. KOI-8 has been dead for ages.
        
       | JonChesterfield wrote:
       | How about assume utf-8, and if someone has some binary file
       | they'd rather a program interpret as some other format, they turn
       | it into utf-8 using a standalone program first. Instead of
       | burning this guess-what-bytes-they-might-like nonsense into all
       | the software.
       | 
       | We don't go "oh that input that's supposed to be json? It looks
       | like a malformed csv file, let's silently have a go at fixing
       | that up for you". Or at least we shouldn't, some software
       | probably does.
        
         | zarzavat wrote:
         | Agreed. Continuing to support other encodings is like insisting
         | that cars should continue to have cassette tape players.
         | 
         | It's much easier to tell the people with old cassette tapes to
         | rip them, rather than try to put a tape player in every car.
        
           | fl7305 wrote:
           | > It's much easier to tell the people with old cassette tapes
           | to rip them
           | 
           | I assume you mean "rip them", as in transcode to a different
           | format?
           | 
           | In that case, you need a tool that takes the old input
           | format(s) and convert them to the new format.
           | 
           | For text files, you'd need a tool that takes the old text
           | files with various encodings and converts them to UTF-8.
           | 
           | Isn't the point of the article to describe how an engineer
           | would create such a tool?
        
         | amarshall wrote:
         | > some software probably does.
         | 
         | Browsers do, kind of https://mimesniff.spec.whatwg.org/#rules-
         | for-identifying-an-...
        
         | fl7305 wrote:
         | > they turn it into utf-8 using a standalone program first
         | 
         | I took the article to be for people who would be writing that
         | "standalone program"?
         | 
         | I have certainly been in a position where I was the person who
         | had to deal with input text files with unknown encodings. There
         | was no-one else to hand off the problem to.
        
       | dublin wrote:
       | Make your life easy. Assume 7-bit ASCII. No one needs all those
       | other characters, anyway...
        
         | AlienRobot wrote:
         | Do we really need 128 permutations just to express an alphabet
         | of 26 letters?
         | 
         | I think we should use a 4 bit encoding.
         | 
         | 0 - NUL
         | 
         | 1-7 - aeiouwy
         | 
         | 8 - space
         | 
         | 9-12 - rst
         | 
         | 13-15 - modifiers
         | 
         | When modifier bits are set, the values of the next half-byte
         | change to represent the rest of the alphabet, numbers, symbols,
         | etc. depending on the bits set.
        
       | drdaeman wrote:
       | Anyone got EBCDIC on their bingo cards? Because if the argument
       | is "legacy encodings are still relevant in 2024" then we also
       | need to bring EBCDIC (and EBCDIK and UTF-EBCDIC for more
       | perverted fun) into the picture. Makes heuristics extra fun.
       | 
       | Or, you know, just say "nah, I can, those ancient stuff don't
       | matter (outside of obligatory exceptions, like software
       | archeology) anymore." If someone wants to feed me a KOI8-R or JIS
       | X 0201 CSV heirloom, they should convert it into something modern
       | first.
        
       | AlienRobot wrote:
       | "You can't assume a 32 bit integer starts from 0"
        
       | Karellen wrote:
       | Don't worry, I never assume UTF-8.
       | 
       | I _require_ UTF-8. If it isn 't currently UTF-8, it's someone
       | else's problem to transform it to UTF-8 first. If they haven't,
       | and I get non-UTF-8 input, I'm fine bailing on that with a
       | "malformed input - please correct" error.
        
         | lmm wrote:
         | So you're fine excluding anyone from Japan who wants their name
         | displayed correctly?
        
           | pavel_lishin wrote:
           | Does the entirety of Japan shun utf8?
        
             | koito17 wrote:
             | Many Japanese websites have migrated from Shift-JIS to
             | UTF-8, but this still ignores the fact that e.g. television
             | captioning uses special characters[1] that are not found in
             | UTF-8 or Shift-JIS. Windows itself has a habit of using its
             | own Windows-932 encoding, which frequently causes problems
             | in the Unix software I use. (e.g. Emacs fails at auto-
             | detecting this format, and instructing Emacs to use Shift-
             | JIS will result in decoding issues)
             | 
             | [1] section 2 in https://www.fontucom.com/pdf/AFSARIBR-
             | Map.pdf
        
           | java-man wrote:
           | This is interesting. Can you show which Japanese names cannot
           | be encoded in UTF-8 please?
        
             | kstrauser wrote:
             | They cannot.
        
           | Karellen wrote:
           | If the Unicode consortium haven't been able to come up with a
           | way of encoding their name correctly, I don't see what hope I
           | have of doing so.
           | 
           | Bonus - as soon as the Unicode consortium do find a way, my
           | software should be able to handle it with no further changes.
           | Well, it might need a recompile against a newer `libicu` as I
           | don't think they maintain ABI backcompat between versions.
           | But there's not much I can do about that.
        
           | gabrielhidasy wrote:
           | Are there Japanese characters missing in UTF-8? They should
           | be added ASAP.
           | 
           | I know there's a weird Chinese/Japanese encoding problem
           | where characters that kind-of look alike have the same
           | character id, and the font file is responsible for
           | disambiguation (terrible for multi-language content and we
           | should really add more characters to create versions for
           | each, but still the best we have).
        
             | koito17 wrote:
             | I don't think there's any missing. However, the latter
             | statement is true and ruins the typography of countless
             | things :(
             | 
             | The most common example I can think of is the following:
             | 
             | Mi Pian  in Chinese is subtly different from Japanese. See
             | https://i.imgur.com/PA4mTME.jpeg ... left-hand side is the
             | Chinese way of writing, right-hand side is Japanese. In
             | variety shows, where you will see lots of terotsupu and
             | different fonts in use, you may catch both variants!
        
           | ranger_danger wrote:
           | Unless you're using an OS older than Windows 2000, or a linux
           | distro from the 2000s, where some form of Unicode was not the
           | default encoding, or maybe an ancient Win32 program compiled
           | without "UNICODE" defined, it shouldn't be a problem. I
           | specifically work with a lot of Japanese software and have
           | not seen this problem in many years.
           | 
           | And even back in the mid 2000s, the only real problems I saw
           | otherwise, were things like malformed html pages that assumed
           | a specific encoding that they wouldn't tell you, or an MP3
           | file with an ID3 tag with CP932 shoved into it against the
           | (v1) spec.
           | 
           | I also disagree with the author that Shift-JIS can be "good
           | enough" hueristically detected due to its use of both 7 and
           | 8-bit characters in both the high and low bytes to mean
           | different things depending on what character is actually
           | intended. Even string searching requires a complex custom-
           | made version just for Shift-JIS handling.
        
           | okanat wrote:
           | What a bad, hyperbolic take. UTF-8 can encode the entire
           | Unicode space. All you need is up-to-date libraries and fonts
           | to display the codepoints correctly. It is backwards
           | compatible forever. So requiring UTF-8 allows Japanese to
           | represent their writing method exactly how it is and keep the
           | scheme for a very long time with room to improve.
        
           | toast0 wrote:
           | My understanding is Unicode (and therefore UTF-8) can encode
           | all the codepoints encodable by Shift JIS. I know that you
           | need a language context to properly display the codepoints
           | that have been Han Unified, so that could lead to display
           | problems. But if we're trying to properly display a Japanese
           | name, it's probably easier to put the appropriate language
           | context in a UTF-8 document than it is to embed Shift JIS
           | text into a UTF-8 document.
           | 
           | Realistically --- if someone hands me well marked Shift JIS
           | content, I'm just going to reencode it as UTF-8 anyway... And
           | if they hand me unmarked Shift JIS content, I'll try to see
           | if I can decode it as UTF-8 and throw it away as invalid if
           | not.
        
         | fl7305 wrote:
         | That works until you can't pay your bills unless you take a new
         | contract where you have to deal with a large amount of
         | historical text files from various sources.
         | 
         | Then it's no longer "someone else's problem".
        
       | groestl wrote:
       | I will assume it, I will enforce it where I can, and I will fight
       | tooth and nail should push come to shove.
       | 
       | I got 99 problems, but charsets aint one of them.
        
       | zadokshi wrote:
       | Better to assume UTF8 and fail with a clear message/warning. Sure
       | you can offer to guess to help the end user if it fails, but as
       | other people have pointed out, it's been standard for a long time
       | now. Even python caved and accepted it as the default:
       | https://peps.python.org/pep-0686/
        
       | smeagull wrote:
       | I absolutely can. If it's not UTF-8, I assume it's worthless.
        
       | koito17 wrote:
       | The comments in this thread are a bit amusing.
       | 
       | I wish I could live in the world where I could bluntly say "I
       | will assume UTF-8 and ignore the rest of the world". Many
       | Japanese documents and sites still use Shift JIS. Windows has
       | this strange Windows-932 format that you will frequently
       | encounter in CUE files outputted by some CD ripping software.
       | ARIB STD-B24, the captioning standard used in Japanese
       | television, has its own text encoding with characters not found
       | in either JIS X 0201 or JIS X 0208. These special characters are
       | mostly icons used in traffic and weather reports, but transcoding
       | to UTF-8 still causes trouble with these icons.
        
         | bongodongobob wrote:
         | Most people on this site probably live in the world where
         | everything is done in English. That's the norm for the vast
         | majority of businesses and people in the US.
        
         | samatman wrote:
         | It's more like "I will assume UTF-8 and ignore edge case
         | encoding problems which still arise in Japan, for some strange
         | reason".
         | 
         | We are not running short on Unicode codepoints. I'm sure they
         | can spare a few more to cover the Japanese characters and icons
         | which invariably get mentioned any time this subject comes up
         | on HN. I don't know why it hasn't happened and I won't be
         | making it my problem to solve. Best I can do is update to
         | version 16 when it's released.
        
           | koito17 wrote:
           | I only bring up Japanese because I deal with Japanese text
           | every day. I _could_ mention Chinese documents and sites
           | frequently using GB  / GBK to save on space (since such
           | encodings use exactly 2 bytes per character whereas the
           | average size in UTF-8 is strictly larger than 2 bytes). But I
           | am not very familiar with it.
        
       | norir wrote:
       | If it's turtles all the way down and at every level you use
       | utf-8, it's hard to see how any input with a different encoding
       | (for the same underlying text) will not be detected before any
       | unintended side effects were invoked.
       | 
       | At this point, I don't see any sufficiently good reason to not
       | use utf-8 exclusively in any new system. Conversions to and from
       | other encodings would only be done at well defined boundaries
       | when I'm calling into dependencies that require non utf-8 input
       | for whatever reason.
        
       | kstrauser wrote:
       | If you give me a computer timestamp without a timezone, I can and
       | will assume it's in UTC. It might not be, but if it's not and I
       | process it as though it is, and the sender doesn't like the
       | results, that's on them. I'm willing to spend approximately zero
       | effort trying to guess what nonstandard thing they're trying to
       | send me unless they're paying me or my company a whole lot of
       | money, in which case I'll convert it to UTC upon import and
       | continue on from there.
       | 
       | Same with UTF-8. Life's too short for bothering with anything
       | else today. I'll deal with some weird janky encoding for the
       | right price, but the first thing I'd do is convert it to UTF-8.
       | Damned if I'm going to complicate the innards of my code with
       | special case code paths for non-UTF-8.
       | 
       | If there were some inherent issue with UTF-8 that made it
       | significantly worse than some other encoding for a given task,
       | I'd be sympathetic to that explanation and wouldn't be such a
       | pain in the neck about this. For instance, if it were the case
       | that it did a bad job of encoding Mandarin or Urdu or Xhosa or
       | Persian, and the people who use those languages strongly
       | preferred to use something else, I'd understand. However, I've
       | never heard a viable explanation for _not_ using UTF-8 other than
       | legacy software support, and if you want to continue to use
       | something ancient and weird, it 's on you to adapt it to the rest
       | of the world because they're definitely not going to adapt the
       | world to you.
        
         | kccqzy wrote:
         | > For instance, if it were the case that it did a bad job of
         | encoding Mandarin
         | 
         | I don't know if you picked this example on purpose, but using
         | UTF-8 to encode Chinese is 50% larger than the old encoding
         | (GB2312). I remember people cared about this like twenty years
         | ago. I don't know of anyone that still cares about this
         | encoding inefficiency.
        
           | PeterisP wrote:
           | A key aspect is that nowadays we rarely encode pure text -
           | while other encodings are more efficient for encoding pure
           | Mandarin, nowadays a "Mandarin document" may be an HTML or
           | JSON or XML file where less than half of the characters are
           | from CJK codespace, and the rest come from all the formatting
           | overhead which is in the 7-bit ASCII range, and UTF-8 works
           | great for such combined content.
        
       | jcranmer wrote:
       | I haven't seen discussion of this point yet, but the post
       | completely fails to provide any data to back up its assertion
       | that charset detection heuristics works, because the feedback
       | I've seen from people who actually work with charsets is that it
       | largely _doesn 't_ (especially if you're based on naive one-byte
       | frequency analysis). Okay, sure, it works if you want to
       | distinguish between KOI8-R and Windows-1252, but what about
       | Windows-1252 and Windows-1257?
       | 
       | See for example this effort in building a universal charset
       | detector in Gecko:
       | https://bugzilla.mozilla.org/show_bug.cgi?id=1551276
        
         | toast0 wrote:
         | I've done some charset detection, although it's been a while.
         | Heuristics kind of work for somethings --- I'm a big fan of if
         | it's decodable as utf-8, it's probably utf-8, unless there's
         | zero bytes (in most text). If there's a lot of zero bytes,
         | maybe it's UCS-2 or UTF-16, and you can try to figure out the
         | byte order and if it decodes as utf-16.
         | 
         | If it doesn't fit in those categories, you've got a much harder
         | guessing game. But usually you can't actually ask the source
         | what it is, because they probably don't know and might not
         | understand the question or might not be contactable. Usually,
         | you have to guess something, so you may as well take someone
         | else's work to guess, if you don't have better information.
        
       ___________________________________________________________________
       (page generated 2024-04-29 23:00 UTC)