[HN Gopher] Summing ASCII encoded integers on Haswell at almost ...
       ___________________________________________________________________
        
       Summing ASCII encoded integers on Haswell at almost the speed of
       memcpy
        
       Author : iliekcomputers
       Score  : 91 points
       Date   : 2024-07-12 16:42 UTC (1 days ago)
        
 (HTM) web link (blog.mattstuchlik.com)
 (TXT) w3m dump (blog.mattstuchlik.com)
        
       | wolf550e wrote:
       | I think the trick with dereferencing unmapped memory is cool, but
       | I only really care about techniques that work reliably and I can
       | use in production.
        
         | sYnfo wrote:
         | To be clear, it's not dereferencing unmapped memory, I just
         | haven't shown how it's being mapped, because it's a little
         | complex. As I note in the post, you can imagine as if I mmap
         | all the necessary addresses at the start of the program.
        
           | camel-cdr wrote:
           | Given that the input is "integers uniformly sampled from [0,
           | 231-1]" couldn't you use a LUT for the 99.99% case of just
           | 10/9/8 digit numbers instead and have a cold branch the
           | handle the very rare smaller numbers.
        
             | anonymoushn wrote:
             | Yes, maybe if one is clever and lucky this could cost only
             | a popcnt and a branch? not sure.
        
       | ashleyn wrote:
       | Knew it'd be SIMD. Such an underrated feature of modern CPUs.
       | Hopefully with cross-platform SIMD in Rust and Golang, it'll be
       | more commonly used.
       | 
       | Thinking parallel gets you enormous speed benefits for any number
       | of arbitrary algorithms: https://mcyoung.xyz/2023/11/27/simd-
       | base64/
        
         | neonsunset wrote:
         | Here's the tracking issue for Go if you're interested:
         | https://github.com/golang/go/issues/67520
         | 
         | I wouldn't be holding my breath though - proper support of
         | high-level portable SIMD abstraction requires quite a lot of
         | compiler complexity due to how _wide_ (heh) the API surface of
         | SIMD extensions is in most ISAs, and because of details
         | necessary to get right to keep data in appropriate (vector and
         | /or mask) registers. This, naturally, goes in the complete
         | opposite direction to the design philosophy of Go's compiler.
         | Instead, you are supposed to write a custom Go ASM syntax, with
         | byte literals used to encode opcodes if they are not natively
         | supported (which is common).
         | 
         | If you're interested in what high-effort SIMD implementation in
         | this kind of language looks like, take a look at C#'s cross-
         | platform Vector API:
         | https://github.com/dotnet/runtime/blob/main/docs/coding-guid...
         | 
         | https://lemire.me/blog/2024/07/05/scan-html-faster-with-simd...
         | (uses platform intrisics, but showcases that you can go one
         | abstraction level lower, retaining the same Vector128<T> type
         | if you need to specialize a particular part of your algorithm
         | for a platform, without having to maintain separate copy for
         | each one)
         | 
         | Here's high-effort vectorized CRC64 implementation that uses
         | these:
         | https://github.com/dotnet/runtime/blob/283de5b5adf08c42d4945...
         | (performs as fast as C++-based mnemonic variant)
        
       | dist1ll wrote:
       | First time I hear about HighLoad. Seems really interesting to me
       | on the first glance. I personally find SIMD and ISA/march-
       | specific optimizations more rewarding than pure algorithmic
       | challenges (codeforces and such).
       | 
       | Though Haswell seems like a pretty obsolete platform to optimize
       | for at this point. Even Skylake will be a decade old next year.
        
         | sgerenser wrote:
         | Realistically beyond Haswell there hasn't been a ton of
         | advancement in SIMD. Hawell introduced AVX2, which is what this
         | blog post uses. AVX512 is certainly more powerful, but that's
         | not even available in the majority of Intel CPUs, even brand
         | new ones.
        
           | jandrewrogers wrote:
           | AVX-512 has been ubiquitous on Intel server CPUs for a long
           | time. Most people don't run high-performance throughput-
           | oriented codes on consumer-grade CPUs with no ECC, which is
           | the primary application for AVX-512. AVX-512 is a markedly
           | better ISA than AVX2, aside from being wider.
        
       | raldi wrote:
       | Is there an explanation of why it sometimes gives the wrong
       | answer?
        
         | genter wrote:
         | > will only produce correct results with probability < 1,
         | though very close to 1
         | 
         | That's terrifying
        
           | _a_a_a_ wrote:
           | Why? So long as you know the probabilities and they are
           | tolerable, why?
        
             | kardos wrote:
             | The challenge is to get the right answer. It's much less
             | interesting if you relax the challenge to no longer require
             | the right answer. Here's a really fast approximate answer:
             | 50000000*(2^31-1)/2
        
               | nerdponx wrote:
               | The fact that it's wrong sometimes is a lot less
               | interesting than the probability distribution of
               | wrongness, conditional on magnitude.
        
               | sYnfo wrote:
               | Map vs. territory. The challenge, as defined by the
               | system the competition runs on, is to get 3 correct
               | responses in a row. That's it.
        
               | Neywiny wrote:
               | Solid point and good example
        
           | madars wrote:
           | It's worse: Pr[correct output | hard input] = 0, even though
           | they estimate that Pr[correct output | random input] ~ 1.
           | This means that you can't, for example, amplify your success
           | probability by repeating the algorithm a bunch of times and
           | taking the majority vote.
        
         | sYnfo wrote:
         | 1) if you set BATCH_SIZE > 14 sums_acc may overflow
         | 
         | 2) chunks with too many small numbers cannot be processed with
         | just 2 shuffle-adds
         | 
         | 3) (not mentioned in the post) HighLoad limits the size of the
         | source code you can submit, so you can't put all possible
         | values in the look-up table
        
           | kardos wrote:
           | For 1, can you raise that to 28 with unsigned accumulators?
        
             | sYnfo wrote:
             | 14 already assumes unsigned accumulator! 255 [accumulator
             | capacity] / (2 [shuffle-adds] * 9 [highest digit value]) ~=
             | 14
        
               | dzaima wrote:
               | You could have a separate accumulator for each shuffle,
               | which should allow 28 iterations. (and merge those
               | together at the end of the 28-iteration-loop by widening
               | to u16; at which point you could have an outer loop
               | accumulating in u16 until that runs out)
        
           | camel-cdr wrote:
           | Couldn't you organize the accumulators in 8 byte chunks, and
           | leave the upper byte unused. Then you map consecutive digits
           | to those chunks and use 64 bit addition for the accumulation.
           | Then overflow between the bytes would keep the correct result
           | if you do the shuffles correctly, and you have a full byte of
           | overflow buffer.
        
             | Dwedit wrote:
             | Gaps in the numbers are often enough to do some kind of
             | "SIMD" even on ordinary 32-bit processors.
        
               | camel-cdr wrote:
               | Yeah, but I was thinking of doing this within the vector
               | registers to increase the batch size.
        
       ___________________________________________________________________
       (page generated 2024-07-13 23:01 UTC)