[HN Gopher] You can't just assume UTF-8
___________________________________________________________________
You can't just assume UTF-8
Author : calpaterson
Score : 31 points
Date : 2024-04-29 06:11 UTC (16 hours ago)
(HTM) web link (csvbase.com)
(TXT) w3m dump (csvbase.com)
| bhaney wrote:
| I'm just gonna assume UTF-8
| calpaterson wrote:
| Huang \ue0afNong pengQin Lian
| bhaney wrote:
| And a good day to you too, my friend whose input I'm going to
| discard
| duskwuff wrote:
| I'm disappointed that the article doesn't discuss this in more
| detail. _Most byte sequences are not valid UTF-8._ If you can
| decode a message as UTF-8 with no errors, that is almost
| certainly the correct encoding to use; it 's extremely unlikely
| that some text in another encoding just happened to be
| perfectly valid as UTF-8. (The converse is not true; most 8-bit
| text encodings will happily decode UTF-8 sequences to nonsense
| strings like dYs(c).)
|
| If UTF-8 decoding fails, _then_ it 's time to pull out the
| fancy statistical tools to (unreliably) guess an encoding. But
| that should be a fallback, not the first thing you try.
| actionfromafar wrote:
| Not _extremely_ unlikely. Many charsets decode fine as UTF-8
| as long as the message happens to fit in ASCII.
| jujube3 wrote:
| Actually, I can just assume UTF-8, since that's what the world
| standardized on. Just like I can assume the length of a meter or
| the weight of a gram. There is no need to have dozens of
| incompatible systems.
| qwertox wrote:
| You can't assume it if you're handed a random sample of text
| files.
| duskwuff wrote:
| You can _start_ by assuming UTF-8, then move on to other
| heuristics if UTF-8 decoding fails. UTF-8 is "picky" about
| the sequence of bytes in multi-byte sequences; it's
| extraordinarily unlikely that text in any other encoding will
| satisfy its requirements.
|
| (Other than pure ASCII, of course. But "decoding" ASCII text
| as UTF-8 is safe anyway, so that hardly matters.)
| plorkyeran wrote:
| I in fact can assume it. If the assumption is wrong then
| that's someone else's problem. 15 years ago I wrote a bunch
| of code using uchardet to detect encodings and it was a
| pretty useful feature at the time. In the last decade
| everything I've touched has required UTF-8 unless it's been
| interoperating with a specific legacy system which has some
| other fixed charset, and it's never been an issue.
| fl7305 wrote:
| That comparison doesn't hold.
|
| If you're dealing with lengths, you can get input data in
| meters, centimeters, millimeters, inches, feet, etc.
|
| If the input data is human heights, would you automatically
| assume meters even if the input data is "183"?
|
| If the input data is the weight of humans, would you always
| assume grams, even if the input data is "75"?
| asddubs wrote:
| would you guess the unit instead of specifying what you
| expect?
| fl7305 wrote:
| > would you guess the unit instead of specifying what you
| expect?
|
| It depends on the circumstances. It might be the least bad
| thing to do. Or not.
|
| But that wasn't my point. I replied to this:
|
| > I can assume the length of a meter or the weight of a
| gram
|
| Sure, the length of a meter and the "weight" of a gram are
| both standardized. (To be very picky, "gram" is a mass, not
| a weight. The actual weight depends on the "g" constant,
| which on average is 9.81 m/s^2 on earth, but can vary about
| 0.5%.)
|
| So if you know the input is in meters, you don't need to do
| any further processing.
|
| But dealing with input text files with an unknown encoding
| is like dealing with input lengths with an unknown unit.
|
| So while UTF-8 itself might be standardized, it is not the
| same as all input text files always being in UTF-8.
|
| You can choose to say that all input text files must be in
| valid UTF-8, or the program refuses to load them. Or you
| can use silent heuristics. Or something inbetween.
| nijave wrote:
| Microsoft would like a word with you (utf-8-bom & utf-16)
| paulddraper wrote:
| You can assume encoding is UTF-8, length is in meters, and
| timezone is UTC.
|
| -
|
| You just won't always be right.
| pyuser583 wrote:
| Universal UTF-8 is a hope we aspire to, not a reality we
| assume.
| fiddlerwoaroof wrote:
| The way you get that reality is you do the opposite of the
| recommendation of Postel's law: be very picky about what you
| consume and fail loudly if it's not UTF-8
| treflop wrote:
| There's a difference between asssuming and not making a
| distinction.
|
| Very few developers I've met know could make a distinction.
| They'd see a few off characters and think it's some one-off bug
| but it's because they're both assuming an encoding.
|
| Even if you said you'd pay them one billion dollars to fix it,
| they'd absolutely be unable to.
| pronoiac wrote:
| Archive copy:
| https://web.archive.org/web/20240429061925/https://csvbase.c...
| o11c wrote:
| Default UTF-8 is better than the linked suggestion of using a
| heuristic, but failing catastrophic when old data is encountered
| is unacceptable. There _must_ be a fallback.
|
| (Note that the heuristic for "is this intended to be UTF-8" is
| pretty reliable, but most other encoding-detection heuristics are
| very bad quality)
| dandigangi wrote:
| Except I can
| Veserv wrote:
| Off-topic, but the bit numbering convention is deliciously
| confusing.
|
| Little-endian bytes (lowest byte is leftmost) and big-endian bits
| (bits contributing less numerical value are rightmost) are
| normal, but the bits are referenced/numbered little-endian (first
| bit is leftmost even though it contributes the most numerical
| value). When I first read the numbering convention I thought it
| was going to be a breath of fresh air of someone using the much
| more sane, but non-standard, little-endian bits with little-
| endian bytes, but it was actually another layered twist.
| Hopefully someday English can write numbers little-endian, which
| is objectively superior, and do away with this whole mess.
| kstrauser wrote:
| > Hopefully someday English can write numbers little-endian,
| which is objectively superior
|
| Upon reading this, I threw my laptop out the window.
| bandyaboot wrote:
| > In the most popular character encoding, UTF-8, character number
| 65 ("A") is written:
|
| > 01000001
|
| > Only the second and final bits are 1, or "on".
|
| Isn't it more accurate to say that the first and penultimate bits
| are 1, or "on"?
| fl7305 wrote:
| It depends on whether your bit numbering is like x86 (your
| description), or PowerPC (left most bit is 0).
| duskwuff wrote:
| Basically everyone uses x86 bit numbering. It has the
| pleasant property that the place value of every bit is always
| 2^n (or -2^n for a sign bit), and zero-extending a value
| doesn't change the numbering of its bits.
| fl7305 wrote:
| Sure, it is by far the industry standard.
|
| It works much better for handling discrete integers.
|
| Once you get into bitfield instructions it is nice to have
| bit 0 be the "left most bit".
| bandyaboot wrote:
| The more I thought it through, even assuming x86, I guess
| there's just no "correct" way to casually reference bit
| positions when we read them in the opposite order from the
| machine. Are they being referenced from the perspective of a
| human consumer of text, or the machine's perspective as a
| consumer of bits? If I were writing that content, I'd have a
| difficult time deciding on which to use. If I were writing
| for a lay person, referencing left-to-right seems obvious,
| but in this case where the audience is primarily developers,
| it becomes much less obvious.
| vitaut wrote:
| This is so spectacularly outdated. KOI-8 has been dead for ages.
| JonChesterfield wrote:
| How about assume utf-8, and if someone has some binary file
| they'd rather a program interpret as some other format, they turn
| it into utf-8 using a standalone program first. Instead of
| burning this guess-what-bytes-they-might-like nonsense into all
| the software.
|
| We don't go "oh that input that's supposed to be json? It looks
| like a malformed csv file, let's silently have a go at fixing
| that up for you". Or at least we shouldn't, some software
| probably does.
| zarzavat wrote:
| Agreed. Continuing to support other encodings is like insisting
| that cars should continue to have cassette tape players.
|
| It's much easier to tell the people with old cassette tapes to
| rip them, rather than try to put a tape player in every car.
| fl7305 wrote:
| > It's much easier to tell the people with old cassette tapes
| to rip them
|
| I assume you mean "rip them", as in transcode to a different
| format?
|
| In that case, you need a tool that takes the old input
| format(s) and convert them to the new format.
|
| For text files, you'd need a tool that takes the old text
| files with various encodings and converts them to UTF-8.
|
| Isn't the point of the article to describe how an engineer
| would create such a tool?
| amarshall wrote:
| > some software probably does.
|
| Browsers do, kind of https://mimesniff.spec.whatwg.org/#rules-
| for-identifying-an-...
| fl7305 wrote:
| > they turn it into utf-8 using a standalone program first
|
| I took the article to be for people who would be writing that
| "standalone program"?
|
| I have certainly been in a position where I was the person who
| had to deal with input text files with unknown encodings. There
| was no-one else to hand off the problem to.
| dublin wrote:
| Make your life easy. Assume 7-bit ASCII. No one needs all those
| other characters, anyway...
| AlienRobot wrote:
| Do we really need 128 permutations just to express an alphabet
| of 26 letters?
|
| I think we should use a 4 bit encoding.
|
| 0 - NUL
|
| 1-7 - aeiouwy
|
| 8 - space
|
| 9-12 - rst
|
| 13-15 - modifiers
|
| When modifier bits are set, the values of the next half-byte
| change to represent the rest of the alphabet, numbers, symbols,
| etc. depending on the bits set.
| drdaeman wrote:
| Anyone got EBCDIC on their bingo cards? Because if the argument
| is "legacy encodings are still relevant in 2024" then we also
| need to bring EBCDIC (and EBCDIK and UTF-EBCDIC for more
| perverted fun) into the picture. Makes heuristics extra fun.
|
| Or, you know, just say "nah, I can, those ancient stuff don't
| matter (outside of obligatory exceptions, like software
| archeology) anymore." If someone wants to feed me a KOI8-R or JIS
| X 0201 CSV heirloom, they should convert it into something modern
| first.
| AlienRobot wrote:
| "You can't assume a 32 bit integer starts from 0"
| Karellen wrote:
| Don't worry, I never assume UTF-8.
|
| I _require_ UTF-8. If it isn 't currently UTF-8, it's someone
| else's problem to transform it to UTF-8 first. If they haven't,
| and I get non-UTF-8 input, I'm fine bailing on that with a
| "malformed input - please correct" error.
| lmm wrote:
| So you're fine excluding anyone from Japan who wants their name
| displayed correctly?
| pavel_lishin wrote:
| Does the entirety of Japan shun utf8?
| koito17 wrote:
| Many Japanese websites have migrated from Shift-JIS to
| UTF-8, but this still ignores the fact that e.g. television
| captioning uses special characters[1] that are not found in
| UTF-8 or Shift-JIS. Windows itself has a habit of using its
| own Windows-932 encoding, which frequently causes problems
| in the Unix software I use. (e.g. Emacs fails at auto-
| detecting this format, and instructing Emacs to use Shift-
| JIS will result in decoding issues)
|
| [1] section 2 in https://www.fontucom.com/pdf/AFSARIBR-
| Map.pdf
| java-man wrote:
| This is interesting. Can you show which Japanese names cannot
| be encoded in UTF-8 please?
| kstrauser wrote:
| They cannot.
| Karellen wrote:
| If the Unicode consortium haven't been able to come up with a
| way of encoding their name correctly, I don't see what hope I
| have of doing so.
|
| Bonus - as soon as the Unicode consortium do find a way, my
| software should be able to handle it with no further changes.
| Well, it might need a recompile against a newer `libicu` as I
| don't think they maintain ABI backcompat between versions.
| But there's not much I can do about that.
| gabrielhidasy wrote:
| Are there Japanese characters missing in UTF-8? They should
| be added ASAP.
|
| I know there's a weird Chinese/Japanese encoding problem
| where characters that kind-of look alike have the same
| character id, and the font file is responsible for
| disambiguation (terrible for multi-language content and we
| should really add more characters to create versions for
| each, but still the best we have).
| koito17 wrote:
| I don't think there's any missing. However, the latter
| statement is true and ruins the typography of countless
| things :(
|
| The most common example I can think of is the following:
|
| Mi Pian in Chinese is subtly different from Japanese. See
| https://i.imgur.com/PA4mTME.jpeg ... left-hand side is the
| Chinese way of writing, right-hand side is Japanese. In
| variety shows, where you will see lots of terotsupu and
| different fonts in use, you may catch both variants!
| ranger_danger wrote:
| Unless you're using an OS older than Windows 2000, or a linux
| distro from the 2000s, where some form of Unicode was not the
| default encoding, or maybe an ancient Win32 program compiled
| without "UNICODE" defined, it shouldn't be a problem. I
| specifically work with a lot of Japanese software and have
| not seen this problem in many years.
|
| And even back in the mid 2000s, the only real problems I saw
| otherwise, were things like malformed html pages that assumed
| a specific encoding that they wouldn't tell you, or an MP3
| file with an ID3 tag with CP932 shoved into it against the
| (v1) spec.
|
| I also disagree with the author that Shift-JIS can be "good
| enough" hueristically detected due to its use of both 7 and
| 8-bit characters in both the high and low bytes to mean
| different things depending on what character is actually
| intended. Even string searching requires a complex custom-
| made version just for Shift-JIS handling.
| okanat wrote:
| What a bad, hyperbolic take. UTF-8 can encode the entire
| Unicode space. All you need is up-to-date libraries and fonts
| to display the codepoints correctly. It is backwards
| compatible forever. So requiring UTF-8 allows Japanese to
| represent their writing method exactly how it is and keep the
| scheme for a very long time with room to improve.
| toast0 wrote:
| My understanding is Unicode (and therefore UTF-8) can encode
| all the codepoints encodable by Shift JIS. I know that you
| need a language context to properly display the codepoints
| that have been Han Unified, so that could lead to display
| problems. But if we're trying to properly display a Japanese
| name, it's probably easier to put the appropriate language
| context in a UTF-8 document than it is to embed Shift JIS
| text into a UTF-8 document.
|
| Realistically --- if someone hands me well marked Shift JIS
| content, I'm just going to reencode it as UTF-8 anyway... And
| if they hand me unmarked Shift JIS content, I'll try to see
| if I can decode it as UTF-8 and throw it away as invalid if
| not.
| fl7305 wrote:
| That works until you can't pay your bills unless you take a new
| contract where you have to deal with a large amount of
| historical text files from various sources.
|
| Then it's no longer "someone else's problem".
| groestl wrote:
| I will assume it, I will enforce it where I can, and I will fight
| tooth and nail should push come to shove.
|
| I got 99 problems, but charsets aint one of them.
| zadokshi wrote:
| Better to assume UTF8 and fail with a clear message/warning. Sure
| you can offer to guess to help the end user if it fails, but as
| other people have pointed out, it's been standard for a long time
| now. Even python caved and accepted it as the default:
| https://peps.python.org/pep-0686/
| smeagull wrote:
| I absolutely can. If it's not UTF-8, I assume it's worthless.
| koito17 wrote:
| The comments in this thread are a bit amusing.
|
| I wish I could live in the world where I could bluntly say "I
| will assume UTF-8 and ignore the rest of the world". Many
| Japanese documents and sites still use Shift JIS. Windows has
| this strange Windows-932 format that you will frequently
| encounter in CUE files outputted by some CD ripping software.
| ARIB STD-B24, the captioning standard used in Japanese
| television, has its own text encoding with characters not found
| in either JIS X 0201 or JIS X 0208. These special characters are
| mostly icons used in traffic and weather reports, but transcoding
| to UTF-8 still causes trouble with these icons.
| bongodongobob wrote:
| Most people on this site probably live in the world where
| everything is done in English. That's the norm for the vast
| majority of businesses and people in the US.
| samatman wrote:
| It's more like "I will assume UTF-8 and ignore edge case
| encoding problems which still arise in Japan, for some strange
| reason".
|
| We are not running short on Unicode codepoints. I'm sure they
| can spare a few more to cover the Japanese characters and icons
| which invariably get mentioned any time this subject comes up
| on HN. I don't know why it hasn't happened and I won't be
| making it my problem to solve. Best I can do is update to
| version 16 when it's released.
| koito17 wrote:
| I only bring up Japanese because I deal with Japanese text
| every day. I _could_ mention Chinese documents and sites
| frequently using GB / GBK to save on space (since such
| encodings use exactly 2 bytes per character whereas the
| average size in UTF-8 is strictly larger than 2 bytes). But I
| am not very familiar with it.
| norir wrote:
| If it's turtles all the way down and at every level you use
| utf-8, it's hard to see how any input with a different encoding
| (for the same underlying text) will not be detected before any
| unintended side effects were invoked.
|
| At this point, I don't see any sufficiently good reason to not
| use utf-8 exclusively in any new system. Conversions to and from
| other encodings would only be done at well defined boundaries
| when I'm calling into dependencies that require non utf-8 input
| for whatever reason.
| kstrauser wrote:
| If you give me a computer timestamp without a timezone, I can and
| will assume it's in UTC. It might not be, but if it's not and I
| process it as though it is, and the sender doesn't like the
| results, that's on them. I'm willing to spend approximately zero
| effort trying to guess what nonstandard thing they're trying to
| send me unless they're paying me or my company a whole lot of
| money, in which case I'll convert it to UTC upon import and
| continue on from there.
|
| Same with UTF-8. Life's too short for bothering with anything
| else today. I'll deal with some weird janky encoding for the
| right price, but the first thing I'd do is convert it to UTF-8.
| Damned if I'm going to complicate the innards of my code with
| special case code paths for non-UTF-8.
|
| If there were some inherent issue with UTF-8 that made it
| significantly worse than some other encoding for a given task,
| I'd be sympathetic to that explanation and wouldn't be such a
| pain in the neck about this. For instance, if it were the case
| that it did a bad job of encoding Mandarin or Urdu or Xhosa or
| Persian, and the people who use those languages strongly
| preferred to use something else, I'd understand. However, I've
| never heard a viable explanation for _not_ using UTF-8 other than
| legacy software support, and if you want to continue to use
| something ancient and weird, it 's on you to adapt it to the rest
| of the world because they're definitely not going to adapt the
| world to you.
| kccqzy wrote:
| > For instance, if it were the case that it did a bad job of
| encoding Mandarin
|
| I don't know if you picked this example on purpose, but using
| UTF-8 to encode Chinese is 50% larger than the old encoding
| (GB2312). I remember people cared about this like twenty years
| ago. I don't know of anyone that still cares about this
| encoding inefficiency.
| PeterisP wrote:
| A key aspect is that nowadays we rarely encode pure text -
| while other encodings are more efficient for encoding pure
| Mandarin, nowadays a "Mandarin document" may be an HTML or
| JSON or XML file where less than half of the characters are
| from CJK codespace, and the rest come from all the formatting
| overhead which is in the 7-bit ASCII range, and UTF-8 works
| great for such combined content.
| jcranmer wrote:
| I haven't seen discussion of this point yet, but the post
| completely fails to provide any data to back up its assertion
| that charset detection heuristics works, because the feedback
| I've seen from people who actually work with charsets is that it
| largely _doesn 't_ (especially if you're based on naive one-byte
| frequency analysis). Okay, sure, it works if you want to
| distinguish between KOI8-R and Windows-1252, but what about
| Windows-1252 and Windows-1257?
|
| See for example this effort in building a universal charset
| detector in Gecko:
| https://bugzilla.mozilla.org/show_bug.cgi?id=1551276
| toast0 wrote:
| I've done some charset detection, although it's been a while.
| Heuristics kind of work for somethings --- I'm a big fan of if
| it's decodable as utf-8, it's probably utf-8, unless there's
| zero bytes (in most text). If there's a lot of zero bytes,
| maybe it's UCS-2 or UTF-16, and you can try to figure out the
| byte order and if it decodes as utf-16.
|
| If it doesn't fit in those categories, you've got a much harder
| guessing game. But usually you can't actually ask the source
| what it is, because they probably don't know and might not
| understand the question or might not be contactable. Usually,
| you have to guess something, so you may as well take someone
| else's work to guess, if you don't have better information.
___________________________________________________________________
(page generated 2024-04-29 23:00 UTC)