[HN Gopher] Decoding base16 sequences quickly
       ___________________________________________________________________
        
       Decoding base16 sequences quickly
        
       Author : greghn
       Score  : 29 points
       Date   : 2023-08-04 00:50 UTC (1 days ago)
        
 (HTM) web link (lemire.me)
 (TXT) w3m dump (lemire.me)
        
       | jheriko wrote:
       | why is a lookup necessary? surely we just need a subtraction and
       | conditional move - this is how hexadecimal is designed.
        
       | iraqmtpizza wrote:
       | with the latest Graal, my hex decoder does 3.6 GB/s on Apple M1.
       | No AVX coding required.
       | 
       | short val = DECODE_TABLE[extractor.applyAsInt(offset) << 7 |
       | extractor.applyAsInt(offset+1)];
       | 
       | decode lookup table is 26318 bytes long
       | 
       | I believe that CPU does 3.2 GHz so it's running at less than one
       | cycle per byte. ~14.2 cycles per 16-byte string
        
         | anonymoushn wrote:
         | It sounds surprising if a gather-based approach beats doing the
         | math. Can you post something that readers can run?
        
           | dzaima wrote:
           | M1 can do 3 loads or 2 stores per cycle and has 6 ALU ports,
           | so a loop doing 2 bytes/cycle taking 2 cycles average is not
           | really out of question.
           | 
           | But with the constant overheads present it's kind of useless
           | to compare (the C benchmarks in OP test 56-char strings,
           | cycling though 10000 of such, so the inputs themselves don't
           | fit in L1; I'd imagine the actual SSE/AVX code by itself
           | should be much faster, certainly faster than the ~1B/cycle
           | that the benchmark reports).
        
           | jsheard wrote:
           | That LUT would fit into the M1s L1 cache, so in a
           | microbenchmark you'd expect it to be pretty fast.
           | 
           | The question is how does it perform when there's surrounding
           | code competing for cache space?
        
             | anonymoushn wrote:
             | Well, you'll also only hit a tiny part of the LUT, but it
             | still takes a lot more instructions because there's no
             | gather in NEON
        
           | iraqmtpizza wrote:
           | The JIT is doing all the work, it appears.
           | 
           | https://pastebin.com/TvJ8Y9cq                   private
           | static final byte[] _16 = new byte[16];
           | @Setup(Level.Trial)         public void setUp() {
           | final byte[] _eight = new byte[_16.length / 2];
           | new Random(System.currentTimeMillis() + System.nanoTime())
           | .nextBytes(_eight);             FastHex.encodeBytes(_eight,
           | 0, _eight.length, _16, 0);         }              @Benchmark
           | @Fork(value = 1, warmups = 1)
           | @BenchmarkMode(Mode.Throughput)         @Warmup(iterations =
           | 1)         @Measurement(iterations = 3)         public void
           | largeHexFast(Blackhole blackhole) {
           | blackhole.consume(FasterHex.decode(0, _16.length, i ->
           | _16[i]));         }
        
             | dzaima wrote:
             | A difference between this and the OP is that OP doesn't
             | pass in a length, instead terminating on the first invalid
             | character (which is the trailing null byte in the
             | benchmark). Means that the array-out-of-bounds check can't
             | be abused.
        
               | iraqmtpizza wrote:
               | Yeah, the methodologies aren't directly comparable (JMH
               | is dark magic as well). For the record I used JMH 1.37
        
       ___________________________________________________________________
       (page generated 2023-08-05 23:00 UTC)