[HN Gopher] Decoding base16 sequences quickly
___________________________________________________________________
Decoding base16 sequences quickly
Author : greghn
Score : 29 points
Date : 2023-08-04 00:50 UTC (1 days ago)
(HTM) web link (lemire.me)
(TXT) w3m dump (lemire.me)
| jheriko wrote:
| why is a lookup necessary? surely we just need a subtraction and
| conditional move - this is how hexadecimal is designed.
| iraqmtpizza wrote:
| with the latest Graal, my hex decoder does 3.6 GB/s on Apple M1.
| No AVX coding required.
|
| short val = DECODE_TABLE[extractor.applyAsInt(offset) << 7 |
| extractor.applyAsInt(offset+1)];
|
| decode lookup table is 26318 bytes long
|
| I believe that CPU does 3.2 GHz so it's running at less than one
| cycle per byte. ~14.2 cycles per 16-byte string
| anonymoushn wrote:
| It sounds surprising if a gather-based approach beats doing the
| math. Can you post something that readers can run?
| dzaima wrote:
| M1 can do 3 loads or 2 stores per cycle and has 6 ALU ports,
| so a loop doing 2 bytes/cycle taking 2 cycles average is not
| really out of question.
|
| But with the constant overheads present it's kind of useless
| to compare (the C benchmarks in OP test 56-char strings,
| cycling though 10000 of such, so the inputs themselves don't
| fit in L1; I'd imagine the actual SSE/AVX code by itself
| should be much faster, certainly faster than the ~1B/cycle
| that the benchmark reports).
| jsheard wrote:
| That LUT would fit into the M1s L1 cache, so in a
| microbenchmark you'd expect it to be pretty fast.
|
| The question is how does it perform when there's surrounding
| code competing for cache space?
| anonymoushn wrote:
| Well, you'll also only hit a tiny part of the LUT, but it
| still takes a lot more instructions because there's no
| gather in NEON
| iraqmtpizza wrote:
| The JIT is doing all the work, it appears.
|
| https://pastebin.com/TvJ8Y9cq private
| static final byte[] _16 = new byte[16];
| @Setup(Level.Trial) public void setUp() {
| final byte[] _eight = new byte[_16.length / 2];
| new Random(System.currentTimeMillis() + System.nanoTime())
| .nextBytes(_eight); FastHex.encodeBytes(_eight,
| 0, _eight.length, _16, 0); } @Benchmark
| @Fork(value = 1, warmups = 1)
| @BenchmarkMode(Mode.Throughput) @Warmup(iterations =
| 1) @Measurement(iterations = 3) public void
| largeHexFast(Blackhole blackhole) {
| blackhole.consume(FasterHex.decode(0, _16.length, i ->
| _16[i])); }
| dzaima wrote:
| A difference between this and the OP is that OP doesn't
| pass in a length, instead terminating on the first invalid
| character (which is the trailing null byte in the
| benchmark). Means that the array-out-of-bounds check can't
| be abused.
| iraqmtpizza wrote:
| Yeah, the methodologies aren't directly comparable (JMH
| is dark magic as well). For the record I used JMH 1.37
___________________________________________________________________
(page generated 2023-08-05 23:00 UTC)