[HN Gopher] How UTF-8 Works
___________________________________________________________________
How UTF-8 Works
Author : SethMLarson
Score : 206 points
Date : 2022-02-08 14:56 UTC (8 hours ago)
(HTM) web link (sethmlarson.dev)
(TXT) w3m dump (sethmlarson.dev)
| brian_rak wrote:
| This was presented well. A follow up for unicode might be in
| order!
| SethMLarson wrote:
| Glad you enjoyed! Unicode and how it interacts with other
| aspects of computers (IDNA, NFKC, grapheme clusters, etc) is
| some of the spaces I want to explore more.
| dspillett wrote:
| Not sure if the issue is with Chrome or my local config generally
| (bog standard Windows, nothing fancy), but the us-flag example
| doesn't render as intended. It shows as "US" with the components
| in the next step being "U" and "S" (not the ASCII characters U &
| S, the encoding is as intended but those characters are being
| given in place of the intended).
|
| Displays as I assume intended in Firefox on the same machine:
| American flag emoji then when broken down in the next step U-in-
| a-box & S-in-a-box. The other examples seem fine in Chrome.
|
| Take care when using relatively new additions to the Unicode
| emoji-set, test to make sure your intentions are correctly
| displayed in all the brower's you might expect your audience to
| be using.
| SethMLarson wrote:
| Yeah, there's not much I can do there unfortunately (since I'm
| using SVG with the actual U and S emojis to show the flag). I
| can't comment on whether it's your config or not, but I've
| tested the SVGs on iOS and Firefox/Chrome on desktop to make
| sure they rendered nicely for most people. Sorry you aren't
| getting a great experience there.
|
| Here's how it's rendering for me on Firefox:
| https://pasteboard.co/rjLtqANVQUIJ.png
| xurukefi wrote:
| For me it also renders like this on Chrome/Windows:
| https://i.imgur.com/HCJTpfA.png
|
| Really nice diagrams nevertheless
| andylynch wrote:
| They aren't new (2010) - this is a Windows thing - speculation
| is it's a policy decision to avoid awkward conversations with
| various governments (presumably large customers) about TW , PS
| and others -- see long discussion here for instance
| https://answers.microsoft.com/en-us/windows/forum/all/flag-e...
| zaik wrote:
| Those diagrams look really good. How were they made?
| jeremieb wrote:
| The author mentions at the end of the article that he spent a
| lot of time on https://www.diagrams.net/. :)
| nayuki wrote:
| Excellent presentation! One improvement to consider is that many
| usages of "code point" should be "Unicode scalar value" instead.
| Basically, you don't want to use UTF-8 to encode UTF-16 surrogate
| code points (which are not scalar values).
|
| Fun fact, UTF-8's prefix scheme can cover up to 31 payload bits.
| See https://en.wikipedia.org/wiki/UTF-8#FSS-UTF , section "FSS-
| UTF (1992) / UTF-8 (1993)".
|
| A manifesto that was much more important ~15 years ago when UTF-8
| hadn't completely won yet: https://utf8everywhere.org/
| masklinn wrote:
| > Fun fact, UTF-8's prefix scheme can cover up to 31 payload
| bits.
|
| It'd probably be more correct to say that it was originally
| defined to cover 31 payload bits: you can easily complete the
| first byte to get 7 and 8 byte sequences (35 and 41 bits
| payloads).
|
| Alternatively, you could save the 11111111 leading byte to flag
| the following bytes as counts (5 bits each since you'd need a
| flag bit to indicate whether this was the last), then add the
| actual payload afterwards, this would give you an infinite-size
| payload, though it would make the payload size dynamic and
| streamed (where currently you can get the entire USV in two
| fetches, as the first byte tells you exactly how many
| continuation bytes you need).
| SethMLarson wrote:
| Yeah the current definition is restricted to 4 octets in RFC
| 3629. Really interesting to see the history of ranges UTF-8
| was able to cover.
| CountSessine wrote:
| _Basically, you don 't want to use UTF-8 to encode UTF-16
| surrogate code points_
|
| The awful truth is that there is such a beast. UTF-8 wrapper
| with UTF-16 surrogate pairs.
|
| https://en.wikipedia.org/wiki/CESU-8
| nayuki wrote:
| Is CESU-8 a synonym of WTF-8?
| https://en.wikipedia.org/wiki/UTF-8#WTF-8 ;
| https://simonsapin.github.io/wtf-8/
| bhawks wrote:
| Utf8 is one of the most momentous and under appreciated /
| relatively unknown achievements in software.
|
| A sketch on a diner placemat has lead to every person in the
| world being able to communicate written language digitally using
| a common software stack. Thanks to Ken Thompson and Rob Pike we
| have avoided the deeply siloed and incompatible world that code
| pages, wide chars and other insufficient encoding schemes were
| guiding us towards.
| cryptonector wrote:
| And stayed ASCII-compatible. And did not have to go to wide
| chars. And it does not suck. And it resynchronizes. And...
| ahelwer wrote:
| It really is wonderful. I was forced to wrap my head around it
| in the past year while writing a tree-sitter grammar for a
| language that supports Unicode. Calculating column position
| gets a whole lot trickier when the preceding codepoints are of
| variable byte-width!
|
| It's one of those rabbit holes where you can see people whose
| entire career is wrapped up in incredibly tiny details like
| what number maps to what symbol - and it can get real
| political!
| GekkePrutser wrote:
| It's great as a global character set and really enabled the
| world to move ahead at just the right time.
|
| But the whole emoji modifier (e.g. guy + heart + lips + girl =
| one kissing couple character) thing is a disaster. Too many
| rules made up on the fly that make building an accurate parser
| a nightmare. It should have either specified this strictly and
| consistently as part of the standard, or just left it out for a
| future standard to implement, and just just used separate
| codepoints for the combinations that were really necessary.
|
| This complexity is also something that has led to multiple
| vulnerabilities especially on mobiles.
|
| See here all the combos: https://unicode.org/emoji/charts/full-
| emoji-modifiers.html
| inglor_cz wrote:
| As a young Czech programming acolyte in the late 1990s, I had
| to cope with several competing 8-bit encodings. It was a pure
| nightmare.
|
| Long live UTF-8. Finally I can write any Central European name
| without mutilating it.
| [deleted]
| Simplicitas wrote:
| I still wanna know in WHICH Jersey diner it was invented in! :-)
| jsrcout wrote:
| This may be the first explanation of Unicode representation that
| I can actually follow. Great work.
| SethMLarson wrote:
| Wow, thank you for the kind words. You've made my morning!!
| ctxc wrote:
| Such clean presentation, refreshing.
| RoddaWallPro wrote:
| I spent 2 hours last Friday trying to wrap my head around what
| UTF-8 was (https://www.joelonsoftware.com/2003/10/08/the-
| absolute-minim is great, but doesn't explain the inner workings
| like this does) and completely failed, could not understand it.
| This made it super easy to grok, thank you!
| karsinkk wrote:
| The following article is one of my favorite primers on Character
| sets/Unicode : https://www.joelonsoftware.com/2003/10/08/the-
| absolute-minim...
| jokoon wrote:
| I wonder how large must a font be to display all UTF8
| characters...
|
| I'm also waiting for new emojis, they recently added more and
| more that can be used as icons, which is simpler than integrating
| PNG or SVG icons.
| banana_giraffe wrote:
| Opentype makes this impossible. A glyph has an index of a
| UINT16, so you can't fit all of the ~143k Unicode characters.
|
| There are some attempts at font families to cover the majority
| of characters. Like Noto ( https://fonts.google.com/noto/fonts
| ), broken out into different fonts for different regions.
|
| Or, Unifont's ( http://www.unifoundry.com/ ) goal of gathering
| the first 65536 code points in one font, though it leaves a lot
| to be desired if you actually use it as a font.
| dspillett wrote:
| Take care using recently added Unicode entries, unless you have
| some control of your user-base and when they update or are
| providing a custom font that you know has those items
| represented. You could be giving out broken-looking UI to many
| if their setup does not interpret the newly assigned codes
| correctly.
| jvolkman wrote:
| Rob Pike wrote up his version of its inception almost 20 years
| ago.
|
| The history of UTF-8 as told by Rob Pike (2003):
| http://doc.cat-v.org/bell_labs/utf-8_history
|
| Recent HN discussion:
| https://news.ycombinator.com/item?id=26735958
| filleokus wrote:
| Recently I learned about UTF-16 when doing some stuff with
| PowerShell on Windows.
|
| Parallel with my annoyance with Microsoft, I realized how long
| it's been since I encountered any kind of text encoding drama. As
| a regular typer of aao, many hours of my youth was spent on
| configuring shells, terminal emulators, and IRC clients to use
| compatible encodings.
|
| The wide adoption of UTF-8 has been truly awesome. Let's just
| hope it's another 15-20 years until I have to deal with UTF-16
| again...
| legulere wrote:
| There's increasing support for UTF-8 as an ansi codepage on
| Windows. And UTF-8 support is also part of the modernization of
| the terminal:
| https://devblogs.microsoft.com/commandline/windows-command-l...
| ChrisSD wrote:
| There are many reasons why UTF-8 is a better encoding but
| UTF-16 does at least have the benefit of being simpler. Every
| scalar value is either encoded as a single unit or a pair of
| units (leading surrogate + trailing surrogate).
|
| However, Powershell (or more often the host console) has a lot
| of issues with handling Unicode. This has been improving in
| recent years but it's still a work in progress.
| masklinn wrote:
| > There are many reasons why UTF-8 is a better encoding but
| UTF-16 does at least have the benefit of being simpler. Every
| scalar value is either encoded as a single unit or a pair of
| units (leading surrogate + trailing surrogate).
|
| UTF16 is really not noticeably simpler. Decoding UTF8 is
| really rather straightforward in any language which has even
| minimal bit-twiddling abilities.
|
| And that's assuming you need to write your own encoder or
| decoder, which seems unlikely.
| Fill1200 wrote:
| I have a MySQL database, which has large amount of Japanese
| text data. When I convert it from UTF8 to UTF16, it reduces
| certainly disk space.
| tialaramex wrote:
| UTF-16 _only_ makes sense if you were sure UCS-2 would be
| fine, and then oops, Unicode is going to be more than 16-bits
| and so UCS-2 won 't work and you need to somehow cope anyway.
| It makes zero sense to adopt this in greenfield projects
| today, whereas Java and Windows, which had bought into UCS-2
| back in the early-mid 1990s, needed UTF-16 or else they would
| need to throw all their 16-bit text APIs away and start over.
|
| UTF-32 / UCS-4 is fine but feels very bloated especially if a
| lot of your text data is more or less ASCII, which if it's
| not literally human text it usually will be, and feels a bit
| bloated even on a good day (it's always wasting 11-bits per
| character!)
|
| UTF-8 is a little more complicated to handle than UTF-16 and
| certainly than UTF-32 but it's nice and compact, it's pretty
| ASCII compatible (lots of tools that work with ASCII also
| work fine with UTF-8 unless you insist on adding a spurious
| UTF-8 "byte order mark" to the front of text) and so it was a
| huge success once it was designed.
| ChrisSD wrote:
| As I said, there are many reasons UTF-8 is a better
| encoding. And indeed compact, backwards compatible,
| encoding of ASCII is one of them.
| glandium wrote:
| It is less compact than UTF-16 for CJK languages, FWIW.
| nwallin wrote:
| > There are many reasons why UTF-8 is a better encoding but
| UTF-16 does at least have the benefit of being simpler.
|
| Big endian or little endian?
| cryptonector wrote:
| LOL
| ts4z wrote:
| And did they handle surrogate pairs correctly?
|
| My team managed a system that did a read from user data,
| doing input validation. One day we got a smart quote
| character that happened to be > U+10000. But because the
| data validation happened in chunks, we only got half of it.
| Which was an invalid character, so input validation failed.
|
| In UTF-8, partial characters happen so often, they're
| likely to get tested. In UTF-16, they are more rarely seen,
| so things work until someone pastes in emoji and then it
| falls apart.
| [deleted]
| daenz wrote:
| Great explanation. The only part that tripped me up was in
| determining the number of octets to represent the codepoint. From
| the post:
|
| >From the previous diagram the value 0x1F602 falls in the range
| for a 4 octets header (between 0x10000 and 0x10FFFF)
|
| Using the diagram in the post would be a crutch to rely on. It
| seems easier to remember the maximum number of "data" bits that
| each octet layout can support (7, 11, 16, 21). Then by knowing
| that 0x1F602 maps to 11111011000000010, which is 17 bits, you
| know it must fit into the 4-octet layout, which can hold 21 bits.
| mananaysiempre wrote:
| As the continuation bytes always bear the payload in the low 6
| bits, Connor Lane Smith suggests writing them out in octal[1].
| Though that 3 octets of UTF-8 precisely cover the BMP is also
| quite convenient and easy to remember (but perhaps don't use
| that like MySQL did[2]?..).
|
| [1] http://www.lubutu.com/soso/write-out-unicode-in-octal
|
| [2] https://mathiasbynens.be/notes/mysql-utf8mb4
| bumblebritches5 wrote:
| riwsky wrote:
| How UTF-8 works?
|
| "pretty well, all things considered"
| who-shot-jr wrote:
| Fantastic! Very well explained.
| SethMLarson wrote:
| Thanks for the kind comment :)
| BitwiseFool wrote:
| I feel the same way as the GP, great work. I also appreciate
| how clean and easy to read the diagrams are.
| nabla9 wrote:
| > _NOTE: You can always find a character boundary from an
| arbitrary point in a stream of octets by moving left an octet
| each time the current octet starts with the bit prefix 10 which
| indicates a tail octet. At most you 'll have to move left 3
| octets to find the nearest header octet._
|
| This is incorrect. You can only find boundaries between code
| points this way.
|
| Until your you learn that not all "user perceived characters"
| (grapheme clusters) can be expressed as single code point Unicode
| seems cool. These UTF-8 explanations explain the encoding but
| leave out this unfortunate detail. Author might not even know
| this because they deal with subset of Unicode in their life.
|
| If you want to split text between two user perceived characters,
| not between them, this tutorial does not help.
|
| Unicode encodings are is great if you want to handle subset of
| languages and characters, if you want to be complete, it's a
| mess.
| SethMLarson wrote:
| You're right, that should read "codepoint boundary" not
| "character boundary". I can fix that.
|
| I do briefly mention grapheme clusters near the end, didn't
| want to introduce them as this article was more about the
| encoding mechanism itself. Maybe a future article after more
| research :)
| nabla9 wrote:
| Please do. You have the best visualizations of UTF-8 I have
| seen so far.
|
| Usually people write just the UTF-8 encoding part, then don't
| mention the rest of the Unicode, because it's clearly not as
| good and simple.
| mark-r wrote:
| UTF-8 is one of the most brilliant things I've ever seen. I only
| wish it had been invented and caught on before so many
| influential bodies started using UCS-2 instead.
| SethMLarson wrote:
| 100% agree, it's really rare that there's a ~blanket solution
| to a whole class of problems. "Just use UTF-8!"
| BiteCode_dev wrote:
| Like anything new, people had a hard time with it at the
| beginning.
|
| I remember that I got a home assignment in an interview for a
| PHP job. The person evaluating my code said I should not have
| used UTF8, which causes "compatibility problems". At the time,
| I didn't know better, and I answered that no, it was explicitly
| created to solve compatibility problems, and that they just
| didn't understand how to deal with encoding properly.
|
| Needing-less to say, I didn't get the job :)
|
| Same with Python 2 code. So many people, when migrating to
| Python 3, suddenly though python 3 encoding management was
| broken, since it was raising so many UnicodeDecodingError.
|
| Only much later people realize the huge number of programs that
| couldn't deal with non ASCII characters in file paths, html
| attributes or user names, because they just implicitly assume
| ASCII. "My code used to work fine", they said. But it worked
| fine on their machine, set to an english locale, tested only
| using ascii plain text files on their ascii named directories
| with their ascii last name.
| SAI_Peregrinus wrote:
| My Slack name at work is "This name iss a valid POSIX path".
| My hope is that it serves as an amusing reminder to consider
| things like spaces and non-ASCII characters.
| andrepd wrote:
| That's in general a problem with dynamic languages with weak
| type systems. How "Your code runs without crashing" is really
| really != "your code works". How do people even manage
| production python! A bug could be lurking anywhere,
| undetected until it's actually run. Whereas in a compiled
| language with a strong type system, "your code compiles" is
| much closer to "your code is correct".
| [deleted]
| digisign wrote:
| There are a number of mitigations, so those kind of bugs
| are quite rare. In our large code base, about 98% of bugs
| we find are of the "we need to handle another case"
| variety. Pyflakes quickly finds typos which eliminates most
| of the rest.
| BiteCode_dev wrote:
| I don't think a type system can help you with decoding a
| file with the wrong charset.
| morelisp wrote:
| Python 3 encoding management was broken, because it tried to
| impose Unicode semantics on things that were actually byte
| streams. For anyone _actually correctly handling encodings_
| in Python 2 it was awful because suddenly the language
| runtime was hiding half the data you needed.
| BiteCode_dev wrote:
| Nowadays, passing bytes to any os function returns bytes
| objects, not unicode. You'll get string if you pass string
| objects though, and they will be using utf8 surrogate
| escaping.
| junon wrote:
| There are (a few) very good reasons not to use UTF-8. It's a
| great encoding but not suitable for all cases.
|
| For example, constant time subscripting, or improved length
| calculations, are made possible by encodings other than utf-8.
|
| But when performance isn't critical, utf-8 should be the
| default. I don't see a reason for any other encoding.
| [deleted]
| jfk13 wrote:
| > For example, constant time subscripting, or improved length
| calculations, are made possible by encodings other than
| utf-8.
|
| Assuming you mean different encoding forms of Unicode (rather
| than entirely different and far less comprehensive character
| sets, such as ASCII or Latin-1), there are very few use cases
| where "subscripting" or "length calculations" would benefit
| significantly from using a different encoding form, because
| it is rare that individual Unicode code points are the most
| appropriate units to work with.
|
| (If you're happy to sacrifice support for most of the world's
| writing systems in favour of raw performance for a limited
| subset of scripts and text operations, that's different.)
| ninkendo wrote:
| Constant time subscripting is a myth. There's nothing(*)
| useful to be obtained by adding a fixed offset to the base of
| your string, in _any_ unicode encoding, including UTF-32.
|
| If you're hoping that a fixed offset gives you a user-
| percieved character boundary, then you're not handling
| composed characters or zero-width-joiners or any number of
| other things that may cause a grapheme cluster to be composed
| of multiple UTF code points.
|
| The "fixed" size of code points in encodings like UTF-32 are
| just that: code points. Whether a code point corresponds with
| anything useful, like the boundary of a visible character,
| will always require linear-time indexing of the string, in
| any encoding.
|
| (*) Approximately nothing. If you're in a position where
| you've somehow already vetted that the text is of a subset of
| human languages where you're guaranteed to never have
| grapheme clusters that occupy more than a single code point,
| then you maybe have a use case for this, but I'd argue you
| really just have a bunch of bugs waiting to happen.
| irq-1 wrote:
| > Constant time subscripting is a myth. There's nothing(*)
| useful to be obtained by adding a fixed offset to the base
| of your string, in any unicode encoding, including UTF-32.
|
| What about UTF-256? Maybe not today, maybe not tomorrow,
| but someday...
| ts4z wrote:
| I know you're kidding, but I want to note that UTF-256
| isn't enough. There's an Arabic ligature that decomposes
| into 20 codepoints. That was already in Unicode 20 years
| ago. You can probably do something even crazier with the
| family emoji. These make "single characters" that do not
| have precomposed forms.
| pjscott wrote:
| Also, if you want O(1) indexing by grapheme cluster you
| can get that with less memory overhead by precomputing a
| lookup table of the location in the string where you can
| find every k-th grapheme cluster, for some constant k >=
| 1. (This requires a single O(n) pass through the string
| to build the index, but you were always going to have do
| make at least one such pass through the string for other
| reasons.)
| wizzwizz4 wrote:
| Some characters are longer than 32 codepoints.
| josephg wrote:
| Absolutely. At least it's well supported now in very old
| languages (like C) and very new languages (like Rust). But
| Java, Javascript, C# and others will probably be stuck using
| UCS-2 forever.
| HideousKojima wrote:
| There's actually a proposal with a decent amount of support
| to add utf-8 strings to C#. Probably won't be added to the
| language for another 3 or 4 years (if ever) but it's not
| outside the realm of possibility.
|
| Edit: The proposal for anyone interested
| https://github.com/dotnet/runtime/issues/933
| stewx wrote:
| What is stopping people from encoding their Java, JS, and C#
| files in UTF-8?
| mark-r wrote:
| Nothing at all, and in fact there's a site set up
| specifically to advocate for this:
| https://utf8everywhere.org/
|
| The biggest problem is when you're working in an ecosystem
| that uses a different encoding and you're forced to convert
| back and forth constantly.
|
| I like the way Python 3 does it - every string is Unicode,
| and you don't know or care what encoding it is using
| internally in memory. It's only when you read or write to a
| file that you need to care about encoding, and the default
| has slowly been converging on UTF-8.
| maskros wrote:
| Nothing, but Java's "char" type is always going to be
| 16-bit.
| josephg wrote:
| Yep. In javascript (and Java and C# from memory) the
| String.length property is based on the encoding length in
| UTF16. It's essentially useless. I don't know if I've
| _ever_ seen a valid use for the javascript String.length
| field in a program which handles Unicode correctly.
|
| There's 3 valid (and useful) ways to measure a string
| depending on context:
|
| - Number of Unicode characters (useful in collaborative
| editing)
|
| - Byte length when encoded (these days usually in utf8)
|
| - and the number of rendered grapheme clusters
|
| All of these measures are identical in ASCII text - which
| is an endless source of bugs.
|
| Sadly these languages give you a deceptively useless
| .length property and make you go fishing when you want to
| make your code correct.
| tialaramex wrote:
| Java's char is a strong competitor for most stupid "char"
| type award.
|
| I would give it to Java outright if not for the fact that
| C's char type _doesn 't define how big it is at all, nor
| whether it is signed_. In practice it's probably a byte,
| but you aren't actually promised that, and even if it is
| a byte you aren't promised whether this byte is treated
| as signed or unsigned, that's implementation dependant.
| Completely useless.
|
| For years I thought char was just pointless, and even
| today I would still say that a high level language like
| Java (or Javascript) should not offer a "char" type
| because the problems you're solving with these languages
| are so unlikely to make effective use of such a type as
| to make it far from essential. Just have a string type,
| and provide methods acting on strings, forget "char". But
| Rust did show me that a strongly typed systems language
| might actually have some use for a distinct type here
| (Rust's char really does only hold the 21-bit Unicode
| Scalar Values, you can't put arbitrary 32-bit values in
| it, nor UTF-16's surrogate code points) so I'll give it
| that.
| mark-r wrote:
| The only guarantee that C gives you is that sizeof char
| == 1, and even that's not as useful as it looks.
| SAI_Peregrinus wrote:
| It also guarantees that char is at least 8 bits.
| jasode wrote:
| _> What is stopping [...] Java, JS, and C# files in UTF-8?_
|
| The output of files on disk can be UTF-8. The continued use
| of UCS-2 (later revised to UTF16) is happening _in the
| runtime_ because things like the Win32 API which C# uses is
| UCS-2. The internal raw memory of layout of strings in
| Win32 is UCS-2.
|
| *EDIT to add correction
| kevin_thibedeau wrote:
| Win32 narrow API calls support UTF-8 natively now.
| mark-r wrote:
| Code page 65001 has existed for a long time now, but it
| was discouraged because there were a lot of corner cases
| that didn't work. Did they finally get all the kinks out
| of it?
| kevin_thibedeau wrote:
| Yes. Applications can switch code page on their own.
| colejohnson66 wrote:
| UTF-16*, not UCS-2. Although there are probably many
| programs that assume UCS-2.
| mark-r wrote:
| When Windows adopted Unicode, I think the only encoding
| available was UCS-2. They converted pretty quickly to
| UTF-16 though, and I think the same is true of everybody
| else who started with UCS-2. Unfortunately UTF-16 has its
| own set of hassles.
| nwallin wrote:
| Note that the asterisk in `UTF-16*` is a _really_ big
| asterisk. I fixed a UCS-16 bug last week at my day job.
| DannyB2 wrote:
| There is an error in the first example under Giant Reference
| Card.
|
| The bytes come out as:
|
| 0xF0 0x9F 0x87 0xBA 0xF0 0x9F 0x87 0xBA
|
| but the bits directly above them all of the bit pattern: 010
| 10111
| SethMLarson wrote:
| Great eye! I'll fix this and push it out.
| bussyfumes wrote:
| BTW here's a surprise I had to learn at some point: strings in JS
| are UTF-16. Keep that in mind if you want to use the console to
| follow this great article, you'll get the surrogate pair for the
| emoji instead.
| pierrebai wrote:
| I never understood why ITF-8 did not use the _much_ simpler
| encoding of: - 0xxxxxxx -> 7 bits, ASCII
| compatible (same as UTF-8) - 10xxxxxx -> 6 bits, more
| bits to come - 11xxxxxx -> final 6 bits.
|
| It has multiple benefits: - It encodes more
| bits per octet: 7, 12, 18, 24 vs 7, 11, 16, 21 for UTF-8
| - It is easily extensible for more bits. - Such extra
| bits extension is backward compatible for reasonable
| implementations.
|
| The last point is key: UTF-8 would need to invent a new prefix to
| go beyond 21 bits. Old software would not know the new prefix and
| what to do with it. With the simpler scheme, they could
| potentially work out of the box up to at least 30 bits (that's a
| billion code points, much more than the mere million of 21 bits).
|
| The
| nephrite wrote:
| From the Wikipedia article:
|
| Prefix code: The first byte indicates the number of bytes in
| the sequence. Reading from a stream can instantaneously decode
| each individual fully received sequence, without first having
| to wait for either the first byte of a next sequence or an end-
| of-stream indication. The length of multi-byte sequences is
| easily determined by humans as it is simply the number of high-
| order 1s in the leading byte. An incorrect character will not
| be decoded if a stream ends mid-sequence.
|
| https://en.wikipedia.org/wiki/UTF-8#Comparison_with_other_en...
| pierrebai wrote:
| "instantaneously" in the sense of first having to read the
| first byte to know how many bytes to read. So it's a two-step
| process. Given the current maximum length and SIMD, detecting
| the end-byte of my scheme is easily parallelizable for up to
| 4 bytes, which conveniently goes to 24 bits, enough for all
| current unicode code points, so there is no waiting for
| termination. Furthermore, to decode a UTF-8 characters needs
| bits extraction and shifting of all bytes, so there is no
| practical gain of not looking at every byte. It actually
| makes the decoding loop more complex.
|
| Also, the human readability sounds fishy. Humans are really
| bad at decoding _high-order_ bits. For example can you tell
| the length of a UTF-8 sequence that would begin with 0xEC at
| a glance? With my scheme, either the high bit is not set
| (0x7F or less), which is easy to see you only need to compare
| the first digit to 7. Or the high bit is set and the high
| nibble is less than 0xC, meaning there is another byte, also
| easy to see, you compare the first digit to C.
|
| The quote also implicitly mis-characterized the fact that in
| my scheme an incorrect character would also not be decoded if
| interrupted since it would lack the terminating flag (No byte
| > 0xC0).
| masklinn wrote:
| UTF-8 as defined (or restricted) is a _prefix code_ , it gets
| all relevant information on the first read, and the rest on the
| (optional) second. Your scheme requires an unbounded number of
| reads.
|
| > - It is easily extensible for more bits.
|
| UTF8 already is easily extensible to more bits, either 7
| continuation bytes (and 42 bits), or infinite. Neither of which
| is actually useful to its purposes.
|
| > The last point is key: UTF-8 would need to invent a new
| prefix to go beyond 21 bits
|
| UTF8 was defined as encoding 31 bits over 6 bytes. It was
| restricted to 21 bits (over 4 bytes) when unicode itself was
| restricted to 21 bits.
| cesarb wrote:
| > UTF8 already is easily extensible to more bits, either 7
| continuation bytes (and 42 bits), or infinite.
|
| Extending UTF-8 to 7 continuation bytes (or more) loses the
| useful property that the all-ones byte (0xFF) never happens
| in a valid UTF-8 string. Limiting it to 36 bits (6
| continuation bytes) would be better.
| edflsafoiewq wrote:
| Why is that useful?
| mananaysiempre wrote:
| You can use FF as a sentinel byte internally (I think
| utf8proc actually does that?); given that FE never
| occurs, either, if you see the byte sequence
| corresponding to U+FEFF BYTE ORDER MARK in one of the
| other UTFs you can pretty much immediately tell it can't
| possibly be UTF-8. (In general UTF-8, because of all the
| self-synchronization redundancy, has a very distinctive
| pattern that allows it to be detected with almost perfect
| reliability, and that is a frequent point of UTF-8
| advocacy, which lends some irony to the fact that UTF-8
| is the one encoding that Web browsers support but refuse
| to detect[1].) I don't think there is any other advantage
| to excluding FF specifically, it's not like we're using
| punched paper tape.
|
| [1] https://hsivonen.fi/utf-8-detection/
| xigoi wrote:
| You can use 11000000 and 11000001 as sentinel bytes;
| since a sequence beginning with them can't possibly be
| minimal.
| cryptonector wrote:
| And Unicode was restricted to 21 bits because of UTF-16.
| There is still the possibility of that restriction being
| lifted eventually.
| pierrebai wrote:
| No software decodes data by reading a stream byte-by-byte.
| Like I said in a previous comment, decoding 4 bytes using
| SIMD is possible and probably the best way to go.
| Furthermore, to actually decode, you need bit twiddling
| anyway, so you do need to do byte-processing. Finally, the
| inner loop of detecting character boundary is simpler: the
| UTF-8 scheme, due to the variable-length prefixes, requires
| to detect the first non-1 bits. It is probably written with a
| switch/case in C, vs two bit tests in my scheme. I'm not
| convinced the UTF-8 ends-up with a faster loop.
| LegionMammal978 wrote:
| The problem is that UTF-8 has the ability to detect and reject
| partial characters at the start of the string; this encoding
| would silently produce an incorrect character. Also, UTF-8 is
| easily extensible already: the bit patterns 111110xx, 1111110x,
| and 11111110 are only disallowed for compatibility with
| UTF-16's limits.
| pierrebai wrote:
| How often are stream truncated _at the start_? In my career,
| I 've seen plenty of end truncation, but start truncation
| never happens. Or, to be more precise, it only happens if
| previous decoding is already borked. If a previous decoding
| read too much data, then even UTF-8 is borked. You could be
| decoding UTF-8 from the bits of any follow-up data.
|
| Even for pure text data, if a previous field was over-read
| (the only plausible way to have start-truncation), then you
| probably are decoding incorrect data from then on.
|
| IOW, this upside is both ludicrously improbable and much more
| damning to the decoding than simply be able to skip a
| character.
| cryptonector wrote:
| UTF-8 is self-resynchronizing. You can scan forwards and/or
| backwards and all you have to do is look for bytes that start a
| UTF-8 codepoint encoding to find the boundaries between
| codepoints. It's genius.
| [deleted]
| stkdump wrote:
| The current scheme is extensible to 7x6=42 bits (which will
| probably never be needed). The advantage of the current scheme
| is that when you read the first byte you know how long the code
| point is in memory and you have less branching dependencies,
| i.e. better performance.
|
| EDIT: another huge advantage is that lexicographical
| comparison/sorting is trivial (usually the ascii version of the
| code can be reused without modification).
| lkuty wrote:
| like "A Branchless UTF-8 Decoder" at
| https://nullprogram.com/blog/2017/10/06/
| coldpie wrote:
| > The current scheme is extensible to 7x6=42 bits (which will
| probably never be needed).
|
| I have printed this out and inserted it into my safe deposit
| box, so my children's children's children can take it out and
| have a laugh.
| jjice wrote:
| Fun fact: Ken Thompson and Rob Pike of Unix, Plan 9, Go, and
| other fame had a heavy influence on the standard while working on
| Plan 9. To quote Wikipedia:
|
| > Thompson's design was outlined on September 2, 1992, on a
| placemat in a New Jersey diner with Rob Pike.
|
| If that isn't a classic story of an international standard's
| creation/impactful update, then I don't know what is.
|
| https://en.wikipedia.org/wiki/UTF-8#FSS-UTF
| SethMLarson wrote:
| I knew that Ken Thompson had an influence but wasn't aware of
| Rob Pike, what a great fact! Thanks for sharing this :)
| ChrisSD wrote:
| For whatever it's worth Rob Pike seems to credit Ken Thompson
| for the invention, though they both worked together to make
| it the encoding used by Plan 9 and to advocate for its use
| more widely.
| YaBomm wrote:
___________________________________________________________________
(page generated 2022-02-08 23:01 UTC)