[HN Gopher] Summing ASCII encoded integers on Haswell at almost ...
       ___________________________________________________________________
        
       Summing ASCII encoded integers on Haswell at almost the speed of
       memcpy
        
       Author : iliekcomputers
       Score  : 118 points
       Date   : 2024-07-12 16:42 UTC (2 days ago)
        
 (HTM) web link (blog.mattstuchlik.com)
 (TXT) w3m dump (blog.mattstuchlik.com)
        
       | wolf550e wrote:
       | I think the trick with dereferencing unmapped memory is cool, but
       | I only really care about techniques that work reliably and I can
       | use in production.
        
         | sYnfo wrote:
         | To be clear, it's not dereferencing unmapped memory, I just
         | haven't shown how it's being mapped, because it's a little
         | complex. As I note in the post, you can imagine as if I mmap
         | all the necessary addresses at the start of the program.
        
           | camel-cdr wrote:
           | Given that the input is "integers uniformly sampled from [0,
           | 231-1]" couldn't you use a LUT for the 99.99% case of just
           | 10/9/8 digit numbers instead and have a cold branch the
           | handle the very rare smaller numbers.
        
             | anonymoushn wrote:
             | Yes, maybe if one is clever and lucky this could cost only
             | a popcnt and a branch? not sure.
        
       | ashleyn wrote:
       | Knew it'd be SIMD. Such an underrated feature of modern CPUs.
       | Hopefully with cross-platform SIMD in Rust and Golang, it'll be
       | more commonly used.
       | 
       | Thinking parallel gets you enormous speed benefits for any number
       | of arbitrary algorithms: https://mcyoung.xyz/2023/11/27/simd-
       | base64/
        
         | neonsunset wrote:
         | Here's the tracking issue for Go if you're interested:
         | https://github.com/golang/go/issues/67520
         | 
         | I wouldn't be holding my breath though - proper support of
         | high-level portable SIMD abstraction requires quite a lot of
         | compiler complexity due to how _wide_ (heh) the API surface of
         | SIMD extensions is in most ISAs, and because of details
         | necessary to get right to keep data in appropriate (vector and
         | /or mask) registers. This, naturally, goes in the complete
         | opposite direction to the design philosophy of Go's compiler.
         | Instead, you are supposed to write a custom Go ASM syntax, with
         | byte literals used to encode opcodes if they are not natively
         | supported (which is common).
         | 
         | If you're interested in what high-effort SIMD implementation in
         | this kind of language looks like, take a look at C#'s cross-
         | platform Vector API:
         | https://github.com/dotnet/runtime/blob/main/docs/coding-guid...
         | 
         | https://lemire.me/blog/2024/07/05/scan-html-faster-with-simd...
         | (uses platform intrisics, but showcases that you can go one
         | abstraction level lower, retaining the same Vector128<T> type
         | if you need to specialize a particular part of your algorithm
         | for a platform, without having to maintain separate copy for
         | each one)
         | 
         | Here's high-effort vectorized CRC64 implementation that uses
         | these:
         | https://github.com/dotnet/runtime/blob/283de5b5adf08c42d4945...
         | (performs as fast as C++-based mnemonic variant)
        
           | ziofill wrote:
           | Mojo should get a mention since we are in topic of SIMD.
        
             | neonsunset wrote:
             | Corrective upvote from me :)
             | 
             | Mojo's effort in bringing portable SIMD abstraction to
             | Python audience is commendable. I'm looking forward to
             | open-sourcing of it to try it out!
             | 
             | For anyone's curious, the reason I'm mostly talking about
             | C# above is that its Vector API is the most accessible and
             | mature portable SIMD abstraction that is part of standard
             | library / comes out of box among most other options - you
             | really only need to install SDK and `dotnet new console` to
             | start working with it over more complex alternatives.
        
       | dist1ll wrote:
       | First time I hear about HighLoad. Seems really interesting to me
       | on the first glance. I personally find SIMD and ISA/march-
       | specific optimizations more rewarding than pure algorithmic
       | challenges (codeforces and such).
       | 
       | Though Haswell seems like a pretty obsolete platform to optimize
       | for at this point. Even Skylake will be a decade old next year.
        
         | sgerenser wrote:
         | Realistically beyond Haswell there hasn't been a ton of
         | advancement in SIMD. Hawell introduced AVX2, which is what this
         | blog post uses. AVX512 is certainly more powerful, but that's
         | not even available in the majority of Intel CPUs, even brand
         | new ones.
        
           | jandrewrogers wrote:
           | AVX-512 has been ubiquitous on Intel server CPUs for a long
           | time. Most people don't run high-performance throughput-
           | oriented codes on consumer-grade CPUs with no ECC, which is
           | the primary application for AVX-512. AVX-512 is a markedly
           | better ISA than AVX2, aside from being wider.
        
             | adrian_b wrote:
             | "has been" => "had been"
             | 
             | AVX-512 is no longer ubiquitous on Intel servers, but only
             | on new AMD servers.
             | 
             | Even earlier, there were cheap Intel servers with CPUs
             | using Atom cores, for example the Denverton, Snow Ridge,
             | Denverton Refresh, Snow Ridge Refresh, Parker Ridge and
             | Arizona Beach series of server CPUs. None of these
             | supported AVX-512 and many did not support even AVX.
             | 
             | However, now, after the launch of the Sierra Forest server
             | CPUs, which will be followed next year by the Clearwater
             | Forest server CPUs, the Atom cores have expanded up to the
             | biggest Intel server CPUs. While such server CPUs are
             | intended for applications where computations using array
             | operations are less important, like Web servers or the
             | hosting of many small virtual machines, the fragmentation
             | of the Intel ISA is extremely annoying, especially when AMD
             | demonstrates how they can implement the same ISA, but at
             | different levels of performance (by varying the number of
             | vector pipelines and the maximum achievable clock
             | frequency) both in laptop CPUs and in desktop/server CPUs
             | and both in compact cores with low clock frequency and in
             | big cores with high clock frequency.
             | 
             | At least for me, the lack of AVX-512 support is the reason
             | that made me stop buying Intel CPUs already some years ago,
             | even if there are some features of the Intel CPUs that I
             | prefer over the AMD CPUs (like TSC deadline), but none of
             | those can compensate the lack of AVX-512 support.
             | 
             | The greater width of AVX-512 is not its greater advantage,
             | but the mask registers and a more complete set of
             | instructions, which simplify many algorithms. Therefore
             | when Intel will support AVX10/256 across all their CPUs,
             | that will partially restore the competitivity of the Intel
             | CPUs, but that is not likely to happen before 2026.
        
               | zxexz wrote:
               | Yeah, I buy AMD Epyc almost exclusively, primarily
               | because it's so freaking annoying to even decide which
               | Intel CPU will work for a given application. Plus you can
               | usually get current or previous generation Epyc cores new
               | for less than half the MSRP.
        
           | hedgehog wrote:
           | As part of that Haswell brought FMA support which was a boon
           | to those of us doing a lot of multiplication and addition
           | (made those workloads twice as fast).
        
             | hansvm wrote:
             | It does depend a little on what ratio of additions to
             | multiplies you had. Haswell dropped down to one execution
             | unit capable of floating point addition, so for addition-
             | heavy workloads you basically had to replace half the
             | additions with fma instructions just to keep your old
             | performance from dropping by 2x.
        
         | xipix wrote:
         | The correct solution to this optimization problem is to write
         | the integers raw, not as ASCII.
        
       | raldi wrote:
       | Is there an explanation of why it sometimes gives the wrong
       | answer?
        
         | genter wrote:
         | > will only produce correct results with probability < 1,
         | though very close to 1
         | 
         | That's terrifying
        
           | _a_a_a_ wrote:
           | Why? So long as you know the probabilities and they are
           | tolerable, why?
        
             | kardos wrote:
             | The challenge is to get the right answer. It's much less
             | interesting if you relax the challenge to no longer require
             | the right answer. Here's a really fast approximate answer:
             | 50000000*(2^31-1)/2
        
               | nerdponx wrote:
               | The fact that it's wrong sometimes is a lot less
               | interesting than the probability distribution of
               | wrongness, conditional on magnitude.
        
               | sYnfo wrote:
               | Map vs. territory. The challenge, as defined by the
               | system the competition runs on, is to get 3 correct
               | responses in a row. That's it.
        
               | Neywiny wrote:
               | Solid point and good example
        
           | madars wrote:
           | It's worse: Pr[correct output | hard input] = 0, even though
           | they estimate that Pr[correct output | random input] ~ 1.
           | This means that you can't, for example, amplify your success
           | probability by repeating the algorithm a bunch of times and
           | taking the majority vote.
        
         | sYnfo wrote:
         | 1) if you set BATCH_SIZE > 14 sums_acc may overflow
         | 
         | 2) chunks with too many small numbers cannot be processed with
         | just 2 shuffle-adds
         | 
         | 3) (not mentioned in the post) HighLoad limits the size of the
         | source code you can submit, so you can't put all possible
         | values in the look-up table
        
           | kardos wrote:
           | For 1, can you raise that to 28 with unsigned accumulators?
        
             | sYnfo wrote:
             | 14 already assumes unsigned accumulator! 255 [accumulator
             | capacity] / (2 [shuffle-adds] * 9 [highest digit value]) ~=
             | 14
        
               | dzaima wrote:
               | You could have a separate accumulator for each shuffle,
               | which should allow 28 iterations. (and merge those
               | together at the end of the 28-iteration-loop by widening
               | to u16; at which point you could have an outer loop
               | accumulating in u16 until that runs out)
        
           | camel-cdr wrote:
           | Couldn't you organize the accumulators in 8 byte chunks, and
           | leave the upper byte unused. Then you map consecutive digits
           | to those chunks and use 64 bit addition for the accumulation.
           | Then overflow between the bytes would keep the correct result
           | if you do the shuffles correctly, and you have a full byte of
           | overflow buffer.
        
             | Dwedit wrote:
             | Gaps in the numbers are often enough to do some kind of
             | "SIMD" even on ordinary 32-bit processors.
        
               | camel-cdr wrote:
               | Yeah, but I was thinking of doing this within the vector
               | registers to increase the batch size.
        
           | rurban wrote:
           | So SIMD would need to set the overflow flag also to catch em.
           | 
           | Which would be much faster than the checked add (adc). Does
           | any hardware support such checked SIMD arithmetic already?
           | 
           | Or can you still assume that most arithmetic is still broken
           | in most languages/libraries.
        
             | adrian_b wrote:
             | AVX-512 has evolved from the Larrabee New Instructions
             | (2009), passing through Knights Ferry (2010), Knights
             | Corner (2012) and Knights Landing (2016), to reach Skylake
             | Server (2017), whose set of AVX-512 instructions has
             | remained a subset of the instruction sets of all later CPUs
             | with AVX-512 support.
             | 
             | At each step from Larrabee to Skylake Server, some
             | instructions have been lost, because the initial set of
             | instructions was more complete in order to enable the
             | writing of efficient GPU algorithms, while later Intel
             | believed that for a general-purpose CPU they can reduce the
             | costs by omitting some of those instructions.
             | 
             | (Nevertheless, later they have added many other
             | instructions, some of which may be less useful and more
             | expensive than the original instructions that have been
             | removed.)
             | 
             | Among the original Larrabee instructions that have been
             | deleted, was addition with unsigned overflow (a.k.a.
             | carry), where the output overflow flags were stored in a
             | mask register, enabling their use in a later conditional
             | SIMD instruction.
             | 
             | Signed overflow can be implemented in hardware with
             | negligible additional complexity (a single gate per each
             | result number), so it would have been easy to also add to
             | Larrabee/AVX-512 an addition instruction with signed
             | overflow flags stored in a mask register. Even when only
             | unsigned overflow is available, it is possible to
             | preprocess the operands in such a way that detecting signed
             | overflow would be possible with the unsigned overflow bits,
             | though that requires multiple instructions, slowing down a
             | lot the algorithm.
             | 
             | However in this problem the numbers that are added are non-
             | negative, so the addition with unsigned overflow of the
             | original Larrabee ISA would have been sufficient, had Intel
             | not removed it from AVX-512.
        
               | dzaima wrote:
               | As a minor note, overflow checking for add & sub can be
               | done reasonably-efficiently in software by comparing the
               | wrapping result with a saturating one, good for
               | i8/i16/u8/u16 on x86, should be roughly the same for
               | signed overflow compared to repurposing hardware unsigned
               | overflow (though of course worse than would-be-native for
               | unsigned).
               | 
               | And, regardless, this would be at least one more uop in
               | the core loop (and a somewhat-unpredictable branch at
               | that) which you'd still want to avoid.
        
       ___________________________________________________________________
       (page generated 2024-07-14 23:02 UTC)