[HN Gopher] Decoding UTF8 with parallel extract
       ___________________________________________________________________
        
       Decoding UTF8 with parallel extract
        
       Author : g0xA52A2A
       Score  : 40 points
       Date   : 2024-05-05 09:28 UTC (13 hours ago)
        
 (HTM) web link (nrk.neocities.org)
 (TXT) w3m dump (nrk.neocities.org)
        
       | nwellnhof wrote:
       | In my experiments, anything using lookup tables was slower than a
       | naive, branching decoder on real-world data. Reading from a
       | lookup table in L1 cache has ~4 cycles latency which is
       | prohibitive for the simple case of mostly ASCII bytes. You can
       | easily achieve more than 1.5 GB/s with a naive decoder while all
       | the "smarter" approaches are capped to ~800 MB/s.
        
         | camel-cdr wrote:
         | IIRC all of the simdutf implementations use lookup tables
         | except for the AVX512 and RVV backends.
         | 
         | Here is e.g. the NEON code:
         | https://github.com/simdutf/simdutf/blob/1b8ca3d1072a8e2e1026...
        
           | nwellnhof wrote:
           | SIMD is great if you want to convert to UTF-32 but most of
           | the time, I want to iterate over UTF-8 codepoints directly.
           | The initial test for a 1-byte sequence is just too simple
           | with a branch and will be hit most of the time. Asian scripts
           | are an exception but there you have longer runs of 3-byte
           | sequences. Testing with real-world data, I found the worst
           | case for the branch predictor to be Eastern European scripts
           | which often alternate between 1- and 2-byte sequences. But
           | even with Czech language content, I got around 800 MB/s with
           | the naive approach, IIRC.
        
         | kevingadd wrote:
         | I would expect it depends on where the lookup table enters the
         | picture. If the table lookup is at the end of a dependency
         | chain, it's not going to stall anything and then you're putting
         | less strain on the branch predictor/reordering machinery.
         | 
         | The article is using pext, which is different from a regular
         | table lookup. While it has horrible latency and overhead on
         | older chips, on Zen3 it appears to decode to a single uop with
         | a latency of 3, and on Haswell it also decodes to one uop with
         | a latency of 3. So it's not as bad as a regular table lookup as
         | long as you're on a new enough processor.
         | 
         | It's also possible to do vectorized lookup table loads if your
         | lookup table is small enough and you have a modern CPU, see
         | http://0x80.pl/notesen/2016-03-13-simd-lookup-pshufb.html and
         | http://0x80.pl/notesen/2019-02-03-simd-switch-implementation...
         | (there's a more generalized version of this I can't remember
         | the URL of...)
        
         | andrewf wrote:
         | It'd be interesting to benchmark this on texts from different
         | languages. English text will be very friendly to the branch
         | predictor because it's all 1-byte codepoints. I think a
         | language dominated by either 2-byte or 3-byte codepoints would
         | be as well, mixing codepoint lengths is what would trip things
         | up.
        
       | pbsd wrote:
       | The overlong lookup can also be written without a memory lookup
       | as                   0x10000U >> ((0x1531U >> (i*5)) & 31);
       | 
       | On most current x86 chips this has a latency of 3 cycles --
       | LEA+SHR+SHR -- which is better than an L1 cache hit almost
       | everywhere.
        
         | moonchild wrote:
         | checking for an overlong input is off the critical path, so
         | latency is irrelevant. (tfa doesn't appear to use a branch--it
         | should use a branch)
        
       ___________________________________________________________________
       (page generated 2024-05-05 23:01 UTC)