[HN Gopher] Decoding UTF8 with parallel extract
       ___________________________________________________________________
        
       Decoding UTF8 with parallel extract
        
       Author : g0xA52A2A
       Score  : 101 points
       Date   : 2024-05-05 09:28 UTC (1 days ago)
        
 (HTM) web link (nrk.neocities.org)
 (TXT) w3m dump (nrk.neocities.org)
        
       | nwellnhof wrote:
       | In my experiments, anything using lookup tables was slower than a
       | naive, branching decoder on real-world data. Reading from a
       | lookup table in L1 cache has ~4 cycles latency which is
       | prohibitive for the simple case of mostly ASCII bytes. You can
       | easily achieve more than 1.5 GB/s with a naive decoder while all
       | the "smarter" approaches are capped to ~800 MB/s.
        
         | camel-cdr wrote:
         | IIRC all of the simdutf implementations use lookup tables
         | except for the AVX512 and RVV backends.
         | 
         | Here is e.g. the NEON code:
         | https://github.com/simdutf/simdutf/blob/1b8ca3d1072a8e2e1026...
        
           | nwellnhof wrote:
           | SIMD is great if you want to convert to UTF-32 but most of
           | the time, I want to iterate over UTF-8 codepoints directly.
           | The initial test for a 1-byte sequence is just too simple
           | with a branch and will be hit most of the time. Asian scripts
           | are an exception but there you have longer runs of 3-byte
           | sequences. Testing with real-world data, I found the worst
           | case for the branch predictor to be Eastern European scripts
           | which often alternate between 1- and 2-byte sequences. But
           | even with Czech language content, I got around 800 MB/s with
           | the naive approach, IIRC.
        
             | anonymoushn wrote:
             | What do you mean by "iterate over UTF-8 codepoints
             | directly?" Is the requirement that you get a pointer to the
             | start of each, or that you receive a stream of u32 whose
             | content is the UTF-8 bytes, or something else? Basically I
             | don't really understand how a UTF-8-to-UTF-32 converter is
             | not trivially modifiable to do the required thing.
        
               | torstenvl wrote:
               | I mean there's the whole issue of quadrupling your
               | storage space... plus you have to iterate twice... plus
               | you'd have to write four bytes for every code point, even
               | if most of the code points only require one byte...
        
               | anonymoushn wrote:
               | You don't have to iterate twice, you don't have to use
               | more storage other than a constant number of bytes chosen
               | to fit in L1 along with lots of other stuff, and any
               | system of "iterating over UTF-8 codepoints directly" has
               | to spend many bytes on each of them anyway (For example
               | an index into a byte array occupies multiple bytes and
               | the minimum amount of storage to hold an arbitrary
               | codepoint is also multiple bytes). So each part of your
               | comment is either incorrect or applies to all possible
               | implementations that meet the requirements. You also
               | haven't proposed any particular meaning for "iterating
               | over UTF-8 codepoints directly," but this is probably for
               | the best, since it is someone else's requirement.
        
               | torstenvl wrote:
               | Okay, so, from my current perspective, everything in your
               | comment is flat wrong. Assuming that it isn't, you and I
               | must be talking about different things to have such
               | divergent views.
               | 
               | Upthread was posted, "I want to iterate over UTF-8
               | codepoints directly. The initial test for a 1-byte
               | sequence is just too simple with a branch and will be hit
               | most of the time."
               | 
               | Compared to this, your proposals require iterating twice,
               | in the context as I understand it: either (1) to convert
               | to UCS-4 and (2) to process; or (1) to establish pointers
               | to the starting UTF-8 code unit of each code point and
               | (2) to process.
               | 
               | Additionally, if you convert to UCS-4/UTF-32, and most of
               | the text is in ASCII, you will roughly quadruple the
               | space required for this conversion. If you establish
               | 32-bit pointers into the UTF-8 string for the starting
               | code unit of each code point, you will also approximately
               | quadruple the space required. Obviously, double that if
               | you're using 64-bit pointers.
               | 
               | By contrast, iterating through a UTF-8 string _directly_
               | has no such problems, at the expense of slightly more CPU
               | time for each iteration. Such iteration can take the form
               | of:                   // Extremely hot codepath. Handling
               | well-formed UTF-8 by macro --> ~4% speedup
               | #define next(R) ((R[1]>>6!=2) ? (R+1) : (R[2]>>6!=2) ?
               | (R+2) : \                          (R[3]>>6!=2) ? (R+3) :
               | (R[4]>>6!=2) ? (R+4) : next_func(R+4))
               | static const unsigned char *next_func(const unsigned char
               | *rex) {             do { rex++; } while (*rex &&
               | rex[0]>>6 == 2); return rex;         }
               | 
               | Are you talking about something else? Perhaps I have
               | misunderstood something somewhere.
        
         | kevingadd wrote:
         | I would expect it depends on where the lookup table enters the
         | picture. If the table lookup is at the end of a dependency
         | chain, it's not going to stall anything and then you're putting
         | less strain on the branch predictor/reordering machinery.
         | 
         | The article is using pext, which is different from a regular
         | table lookup. While it has horrible latency and overhead on
         | older chips, on Zen3 it appears to decode to a single uop with
         | a latency of 3, and on Haswell it also decodes to one uop with
         | a latency of 3. So it's not as bad as a regular table lookup as
         | long as you're on a new enough processor.
         | 
         | It's also possible to do vectorized lookup table loads if your
         | lookup table is small enough and you have a modern CPU, see
         | http://0x80.pl/notesen/2016-03-13-simd-lookup-pshufb.html and
         | http://0x80.pl/notesen/2019-02-03-simd-switch-implementation...
         | (there's a more generalized version of this I can't remember
         | the URL of...)
        
         | andrewf wrote:
         | It'd be interesting to benchmark this on texts from different
         | languages. English text will be very friendly to the branch
         | predictor because it's all 1-byte codepoints. I think a
         | language dominated by either 2-byte or 3-byte codepoints would
         | be as well, mixing codepoint lengths is what would trip things
         | up.
        
           | nitwit005 wrote:
           | Been a while, but I found it was best to assume all the text
           | is ASCII, and fall back to a slower code path only if that
           | turned out not to be the case.
           | 
           | A lot of tools that create JSON, XML, and so forth will
           | escape any non ASCII characters by default. Python's json
           | package, for example.
        
             | torstenvl wrote:
             | Agreed. I saw measurable speed increases when defaulting to
             | ASCII (via macro) and only actually calling the decoder
             | function if the value wasn't < 128.
             | 
             | #define cdpt(s) ((s[0] < 128) ? s[0] : cdpt_func(s))
             | 
             | https://github.com/torstenvl/trex/blob/master/src/trex.c
        
             | dzaima wrote:
             | An all-ASCII fast-path is basically obligatory for anything
             | that cares about performance. And then the "slow path" gets
             | a guarantee that there will be varying-length data, and, as
             | you'll want it to not actually be slow, you'll still need
             | to avoid branching on the now-guaranteed-variable width.
        
       | pbsd wrote:
       | The overlong lookup can also be written without a memory lookup
       | as                   0x10000U >> ((0x1531U >> (i*5)) & 31);
       | 
       | On most current x86 chips this has a latency of 3 cycles --
       | LEA+SHR+SHR -- which is better than an L1 cache hit almost
       | everywhere.
        
         | moonchild wrote:
         | checking for an overlong input is off the critical path, so
         | latency is irrelevant. (tfa doesn't appear to use a branch--it
         | should use a branch)
        
         | teo_zero wrote:
         | You can do it with one shift:                 0x10880 & (24 <<
         | (i*4))
        
           | pbsd wrote:
           | Brilliant. Can be further simplified to 0x10880 & (0x2240 <<
           | i).
        
         | hairtuq wrote:
         | Similarly, the eexpect table can be done in 3 instructions with
         | with t = 0x3c783023f + i; return ((t >> (t & 63)) & 0xf0808088.
        
       | clausecker wrote:
       | This looks very similar to the approach we recently used to
       | transcode UTF-8 into UTF-16 using AVX-512:
       | https://arxiv.org/pdf/2212.05098
       | 
       | It's part of simdutf.
        
       | masfuerte wrote:
       | The code is careful not to read past the end of the buffer, but
       | it doesn't explicitly check that there are enough bytes available
       | for the current multibyte character. However, this "end of buffer
       | in middle of character" error is caught later by the check for
       | valid continuation bytes. I thought that was quite neat.
        
       ___________________________________________________________________
       (page generated 2024-05-06 23:02 UTC)