[HN Gopher] Decoding UTF8 with parallel extract
___________________________________________________________________
Decoding UTF8 with parallel extract
Author : g0xA52A2A
Score : 101 points
Date : 2024-05-05 09:28 UTC (1 days ago)
(HTM) web link (nrk.neocities.org)
(TXT) w3m dump (nrk.neocities.org)
| nwellnhof wrote:
| In my experiments, anything using lookup tables was slower than a
| naive, branching decoder on real-world data. Reading from a
| lookup table in L1 cache has ~4 cycles latency which is
| prohibitive for the simple case of mostly ASCII bytes. You can
| easily achieve more than 1.5 GB/s with a naive decoder while all
| the "smarter" approaches are capped to ~800 MB/s.
| camel-cdr wrote:
| IIRC all of the simdutf implementations use lookup tables
| except for the AVX512 and RVV backends.
|
| Here is e.g. the NEON code:
| https://github.com/simdutf/simdutf/blob/1b8ca3d1072a8e2e1026...
| nwellnhof wrote:
| SIMD is great if you want to convert to UTF-32 but most of
| the time, I want to iterate over UTF-8 codepoints directly.
| The initial test for a 1-byte sequence is just too simple
| with a branch and will be hit most of the time. Asian scripts
| are an exception but there you have longer runs of 3-byte
| sequences. Testing with real-world data, I found the worst
| case for the branch predictor to be Eastern European scripts
| which often alternate between 1- and 2-byte sequences. But
| even with Czech language content, I got around 800 MB/s with
| the naive approach, IIRC.
| anonymoushn wrote:
| What do you mean by "iterate over UTF-8 codepoints
| directly?" Is the requirement that you get a pointer to the
| start of each, or that you receive a stream of u32 whose
| content is the UTF-8 bytes, or something else? Basically I
| don't really understand how a UTF-8-to-UTF-32 converter is
| not trivially modifiable to do the required thing.
| torstenvl wrote:
| I mean there's the whole issue of quadrupling your
| storage space... plus you have to iterate twice... plus
| you'd have to write four bytes for every code point, even
| if most of the code points only require one byte...
| anonymoushn wrote:
| You don't have to iterate twice, you don't have to use
| more storage other than a constant number of bytes chosen
| to fit in L1 along with lots of other stuff, and any
| system of "iterating over UTF-8 codepoints directly" has
| to spend many bytes on each of them anyway (For example
| an index into a byte array occupies multiple bytes and
| the minimum amount of storage to hold an arbitrary
| codepoint is also multiple bytes). So each part of your
| comment is either incorrect or applies to all possible
| implementations that meet the requirements. You also
| haven't proposed any particular meaning for "iterating
| over UTF-8 codepoints directly," but this is probably for
| the best, since it is someone else's requirement.
| torstenvl wrote:
| Okay, so, from my current perspective, everything in your
| comment is flat wrong. Assuming that it isn't, you and I
| must be talking about different things to have such
| divergent views.
|
| Upthread was posted, "I want to iterate over UTF-8
| codepoints directly. The initial test for a 1-byte
| sequence is just too simple with a branch and will be hit
| most of the time."
|
| Compared to this, your proposals require iterating twice,
| in the context as I understand it: either (1) to convert
| to UCS-4 and (2) to process; or (1) to establish pointers
| to the starting UTF-8 code unit of each code point and
| (2) to process.
|
| Additionally, if you convert to UCS-4/UTF-32, and most of
| the text is in ASCII, you will roughly quadruple the
| space required for this conversion. If you establish
| 32-bit pointers into the UTF-8 string for the starting
| code unit of each code point, you will also approximately
| quadruple the space required. Obviously, double that if
| you're using 64-bit pointers.
|
| By contrast, iterating through a UTF-8 string _directly_
| has no such problems, at the expense of slightly more CPU
| time for each iteration. Such iteration can take the form
| of: // Extremely hot codepath. Handling
| well-formed UTF-8 by macro --> ~4% speedup
| #define next(R) ((R[1]>>6!=2) ? (R+1) : (R[2]>>6!=2) ?
| (R+2) : \ (R[3]>>6!=2) ? (R+3) :
| (R[4]>>6!=2) ? (R+4) : next_func(R+4))
| static const unsigned char *next_func(const unsigned char
| *rex) { do { rex++; } while (*rex &&
| rex[0]>>6 == 2); return rex; }
|
| Are you talking about something else? Perhaps I have
| misunderstood something somewhere.
| kevingadd wrote:
| I would expect it depends on where the lookup table enters the
| picture. If the table lookup is at the end of a dependency
| chain, it's not going to stall anything and then you're putting
| less strain on the branch predictor/reordering machinery.
|
| The article is using pext, which is different from a regular
| table lookup. While it has horrible latency and overhead on
| older chips, on Zen3 it appears to decode to a single uop with
| a latency of 3, and on Haswell it also decodes to one uop with
| a latency of 3. So it's not as bad as a regular table lookup as
| long as you're on a new enough processor.
|
| It's also possible to do vectorized lookup table loads if your
| lookup table is small enough and you have a modern CPU, see
| http://0x80.pl/notesen/2016-03-13-simd-lookup-pshufb.html and
| http://0x80.pl/notesen/2019-02-03-simd-switch-implementation...
| (there's a more generalized version of this I can't remember
| the URL of...)
| andrewf wrote:
| It'd be interesting to benchmark this on texts from different
| languages. English text will be very friendly to the branch
| predictor because it's all 1-byte codepoints. I think a
| language dominated by either 2-byte or 3-byte codepoints would
| be as well, mixing codepoint lengths is what would trip things
| up.
| nitwit005 wrote:
| Been a while, but I found it was best to assume all the text
| is ASCII, and fall back to a slower code path only if that
| turned out not to be the case.
|
| A lot of tools that create JSON, XML, and so forth will
| escape any non ASCII characters by default. Python's json
| package, for example.
| torstenvl wrote:
| Agreed. I saw measurable speed increases when defaulting to
| ASCII (via macro) and only actually calling the decoder
| function if the value wasn't < 128.
|
| #define cdpt(s) ((s[0] < 128) ? s[0] : cdpt_func(s))
|
| https://github.com/torstenvl/trex/blob/master/src/trex.c
| dzaima wrote:
| An all-ASCII fast-path is basically obligatory for anything
| that cares about performance. And then the "slow path" gets
| a guarantee that there will be varying-length data, and, as
| you'll want it to not actually be slow, you'll still need
| to avoid branching on the now-guaranteed-variable width.
| pbsd wrote:
| The overlong lookup can also be written without a memory lookup
| as 0x10000U >> ((0x1531U >> (i*5)) & 31);
|
| On most current x86 chips this has a latency of 3 cycles --
| LEA+SHR+SHR -- which is better than an L1 cache hit almost
| everywhere.
| moonchild wrote:
| checking for an overlong input is off the critical path, so
| latency is irrelevant. (tfa doesn't appear to use a branch--it
| should use a branch)
| teo_zero wrote:
| You can do it with one shift: 0x10880 & (24 <<
| (i*4))
| pbsd wrote:
| Brilliant. Can be further simplified to 0x10880 & (0x2240 <<
| i).
| hairtuq wrote:
| Similarly, the eexpect table can be done in 3 instructions with
| with t = 0x3c783023f + i; return ((t >> (t & 63)) & 0xf0808088.
| clausecker wrote:
| This looks very similar to the approach we recently used to
| transcode UTF-8 into UTF-16 using AVX-512:
| https://arxiv.org/pdf/2212.05098
|
| It's part of simdutf.
| masfuerte wrote:
| The code is careful not to read past the end of the buffer, but
| it doesn't explicitly check that there are enough bytes available
| for the current multibyte character. However, this "end of buffer
| in middle of character" error is caught later by the check for
| valid continuation bytes. I thought that was quite neat.
___________________________________________________________________
(page generated 2024-05-06 23:02 UTC)