[HN Gopher] Decoding UTF8 with parallel extract
___________________________________________________________________
Decoding UTF8 with parallel extract
Author : g0xA52A2A
Score : 40 points
Date : 2024-05-05 09:28 UTC (13 hours ago)
(HTM) web link (nrk.neocities.org)
(TXT) w3m dump (nrk.neocities.org)
| nwellnhof wrote:
| In my experiments, anything using lookup tables was slower than a
| naive, branching decoder on real-world data. Reading from a
| lookup table in L1 cache has ~4 cycles latency which is
| prohibitive for the simple case of mostly ASCII bytes. You can
| easily achieve more than 1.5 GB/s with a naive decoder while all
| the "smarter" approaches are capped to ~800 MB/s.
| camel-cdr wrote:
| IIRC all of the simdutf implementations use lookup tables
| except for the AVX512 and RVV backends.
|
| Here is e.g. the NEON code:
| https://github.com/simdutf/simdutf/blob/1b8ca3d1072a8e2e1026...
| nwellnhof wrote:
| SIMD is great if you want to convert to UTF-32 but most of
| the time, I want to iterate over UTF-8 codepoints directly.
| The initial test for a 1-byte sequence is just too simple
| with a branch and will be hit most of the time. Asian scripts
| are an exception but there you have longer runs of 3-byte
| sequences. Testing with real-world data, I found the worst
| case for the branch predictor to be Eastern European scripts
| which often alternate between 1- and 2-byte sequences. But
| even with Czech language content, I got around 800 MB/s with
| the naive approach, IIRC.
| kevingadd wrote:
| I would expect it depends on where the lookup table enters the
| picture. If the table lookup is at the end of a dependency
| chain, it's not going to stall anything and then you're putting
| less strain on the branch predictor/reordering machinery.
|
| The article is using pext, which is different from a regular
| table lookup. While it has horrible latency and overhead on
| older chips, on Zen3 it appears to decode to a single uop with
| a latency of 3, and on Haswell it also decodes to one uop with
| a latency of 3. So it's not as bad as a regular table lookup as
| long as you're on a new enough processor.
|
| It's also possible to do vectorized lookup table loads if your
| lookup table is small enough and you have a modern CPU, see
| http://0x80.pl/notesen/2016-03-13-simd-lookup-pshufb.html and
| http://0x80.pl/notesen/2019-02-03-simd-switch-implementation...
| (there's a more generalized version of this I can't remember
| the URL of...)
| andrewf wrote:
| It'd be interesting to benchmark this on texts from different
| languages. English text will be very friendly to the branch
| predictor because it's all 1-byte codepoints. I think a
| language dominated by either 2-byte or 3-byte codepoints would
| be as well, mixing codepoint lengths is what would trip things
| up.
| pbsd wrote:
| The overlong lookup can also be written without a memory lookup
| as 0x10000U >> ((0x1531U >> (i*5)) & 31);
|
| On most current x86 chips this has a latency of 3 cycles --
| LEA+SHR+SHR -- which is better than an L1 cache hit almost
| everywhere.
| moonchild wrote:
| checking for an overlong input is off the critical path, so
| latency is irrelevant. (tfa doesn't appear to use a branch--it
| should use a branch)
___________________________________________________________________
(page generated 2024-05-05 23:01 UTC)