[HN Gopher] Summing ASCII encoded integers on Haswell at almost ...
___________________________________________________________________
Summing ASCII encoded integers on Haswell at almost the speed of
memcpy
Author : iliekcomputers
Score : 118 points
Date : 2024-07-12 16:42 UTC (2 days ago)
(HTM) web link (blog.mattstuchlik.com)
(TXT) w3m dump (blog.mattstuchlik.com)
| wolf550e wrote:
| I think the trick with dereferencing unmapped memory is cool, but
| I only really care about techniques that work reliably and I can
| use in production.
| sYnfo wrote:
| To be clear, it's not dereferencing unmapped memory, I just
| haven't shown how it's being mapped, because it's a little
| complex. As I note in the post, you can imagine as if I mmap
| all the necessary addresses at the start of the program.
| camel-cdr wrote:
| Given that the input is "integers uniformly sampled from [0,
| 231-1]" couldn't you use a LUT for the 99.99% case of just
| 10/9/8 digit numbers instead and have a cold branch the
| handle the very rare smaller numbers.
| anonymoushn wrote:
| Yes, maybe if one is clever and lucky this could cost only
| a popcnt and a branch? not sure.
| ashleyn wrote:
| Knew it'd be SIMD. Such an underrated feature of modern CPUs.
| Hopefully with cross-platform SIMD in Rust and Golang, it'll be
| more commonly used.
|
| Thinking parallel gets you enormous speed benefits for any number
| of arbitrary algorithms: https://mcyoung.xyz/2023/11/27/simd-
| base64/
| neonsunset wrote:
| Here's the tracking issue for Go if you're interested:
| https://github.com/golang/go/issues/67520
|
| I wouldn't be holding my breath though - proper support of
| high-level portable SIMD abstraction requires quite a lot of
| compiler complexity due to how _wide_ (heh) the API surface of
| SIMD extensions is in most ISAs, and because of details
| necessary to get right to keep data in appropriate (vector and
| /or mask) registers. This, naturally, goes in the complete
| opposite direction to the design philosophy of Go's compiler.
| Instead, you are supposed to write a custom Go ASM syntax, with
| byte literals used to encode opcodes if they are not natively
| supported (which is common).
|
| If you're interested in what high-effort SIMD implementation in
| this kind of language looks like, take a look at C#'s cross-
| platform Vector API:
| https://github.com/dotnet/runtime/blob/main/docs/coding-guid...
|
| https://lemire.me/blog/2024/07/05/scan-html-faster-with-simd...
| (uses platform intrisics, but showcases that you can go one
| abstraction level lower, retaining the same Vector128<T> type
| if you need to specialize a particular part of your algorithm
| for a platform, without having to maintain separate copy for
| each one)
|
| Here's high-effort vectorized CRC64 implementation that uses
| these:
| https://github.com/dotnet/runtime/blob/283de5b5adf08c42d4945...
| (performs as fast as C++-based mnemonic variant)
| ziofill wrote:
| Mojo should get a mention since we are in topic of SIMD.
| neonsunset wrote:
| Corrective upvote from me :)
|
| Mojo's effort in bringing portable SIMD abstraction to
| Python audience is commendable. I'm looking forward to
| open-sourcing of it to try it out!
|
| For anyone's curious, the reason I'm mostly talking about
| C# above is that its Vector API is the most accessible and
| mature portable SIMD abstraction that is part of standard
| library / comes out of box among most other options - you
| really only need to install SDK and `dotnet new console` to
| start working with it over more complex alternatives.
| dist1ll wrote:
| First time I hear about HighLoad. Seems really interesting to me
| on the first glance. I personally find SIMD and ISA/march-
| specific optimizations more rewarding than pure algorithmic
| challenges (codeforces and such).
|
| Though Haswell seems like a pretty obsolete platform to optimize
| for at this point. Even Skylake will be a decade old next year.
| sgerenser wrote:
| Realistically beyond Haswell there hasn't been a ton of
| advancement in SIMD. Hawell introduced AVX2, which is what this
| blog post uses. AVX512 is certainly more powerful, but that's
| not even available in the majority of Intel CPUs, even brand
| new ones.
| jandrewrogers wrote:
| AVX-512 has been ubiquitous on Intel server CPUs for a long
| time. Most people don't run high-performance throughput-
| oriented codes on consumer-grade CPUs with no ECC, which is
| the primary application for AVX-512. AVX-512 is a markedly
| better ISA than AVX2, aside from being wider.
| adrian_b wrote:
| "has been" => "had been"
|
| AVX-512 is no longer ubiquitous on Intel servers, but only
| on new AMD servers.
|
| Even earlier, there were cheap Intel servers with CPUs
| using Atom cores, for example the Denverton, Snow Ridge,
| Denverton Refresh, Snow Ridge Refresh, Parker Ridge and
| Arizona Beach series of server CPUs. None of these
| supported AVX-512 and many did not support even AVX.
|
| However, now, after the launch of the Sierra Forest server
| CPUs, which will be followed next year by the Clearwater
| Forest server CPUs, the Atom cores have expanded up to the
| biggest Intel server CPUs. While such server CPUs are
| intended for applications where computations using array
| operations are less important, like Web servers or the
| hosting of many small virtual machines, the fragmentation
| of the Intel ISA is extremely annoying, especially when AMD
| demonstrates how they can implement the same ISA, but at
| different levels of performance (by varying the number of
| vector pipelines and the maximum achievable clock
| frequency) both in laptop CPUs and in desktop/server CPUs
| and both in compact cores with low clock frequency and in
| big cores with high clock frequency.
|
| At least for me, the lack of AVX-512 support is the reason
| that made me stop buying Intel CPUs already some years ago,
| even if there are some features of the Intel CPUs that I
| prefer over the AMD CPUs (like TSC deadline), but none of
| those can compensate the lack of AVX-512 support.
|
| The greater width of AVX-512 is not its greater advantage,
| but the mask registers and a more complete set of
| instructions, which simplify many algorithms. Therefore
| when Intel will support AVX10/256 across all their CPUs,
| that will partially restore the competitivity of the Intel
| CPUs, but that is not likely to happen before 2026.
| zxexz wrote:
| Yeah, I buy AMD Epyc almost exclusively, primarily
| because it's so freaking annoying to even decide which
| Intel CPU will work for a given application. Plus you can
| usually get current or previous generation Epyc cores new
| for less than half the MSRP.
| hedgehog wrote:
| As part of that Haswell brought FMA support which was a boon
| to those of us doing a lot of multiplication and addition
| (made those workloads twice as fast).
| hansvm wrote:
| It does depend a little on what ratio of additions to
| multiplies you had. Haswell dropped down to one execution
| unit capable of floating point addition, so for addition-
| heavy workloads you basically had to replace half the
| additions with fma instructions just to keep your old
| performance from dropping by 2x.
| xipix wrote:
| The correct solution to this optimization problem is to write
| the integers raw, not as ASCII.
| raldi wrote:
| Is there an explanation of why it sometimes gives the wrong
| answer?
| genter wrote:
| > will only produce correct results with probability < 1,
| though very close to 1
|
| That's terrifying
| _a_a_a_ wrote:
| Why? So long as you know the probabilities and they are
| tolerable, why?
| kardos wrote:
| The challenge is to get the right answer. It's much less
| interesting if you relax the challenge to no longer require
| the right answer. Here's a really fast approximate answer:
| 50000000*(2^31-1)/2
| nerdponx wrote:
| The fact that it's wrong sometimes is a lot less
| interesting than the probability distribution of
| wrongness, conditional on magnitude.
| sYnfo wrote:
| Map vs. territory. The challenge, as defined by the
| system the competition runs on, is to get 3 correct
| responses in a row. That's it.
| Neywiny wrote:
| Solid point and good example
| madars wrote:
| It's worse: Pr[correct output | hard input] = 0, even though
| they estimate that Pr[correct output | random input] ~ 1.
| This means that you can't, for example, amplify your success
| probability by repeating the algorithm a bunch of times and
| taking the majority vote.
| sYnfo wrote:
| 1) if you set BATCH_SIZE > 14 sums_acc may overflow
|
| 2) chunks with too many small numbers cannot be processed with
| just 2 shuffle-adds
|
| 3) (not mentioned in the post) HighLoad limits the size of the
| source code you can submit, so you can't put all possible
| values in the look-up table
| kardos wrote:
| For 1, can you raise that to 28 with unsigned accumulators?
| sYnfo wrote:
| 14 already assumes unsigned accumulator! 255 [accumulator
| capacity] / (2 [shuffle-adds] * 9 [highest digit value]) ~=
| 14
| dzaima wrote:
| You could have a separate accumulator for each shuffle,
| which should allow 28 iterations. (and merge those
| together at the end of the 28-iteration-loop by widening
| to u16; at which point you could have an outer loop
| accumulating in u16 until that runs out)
| camel-cdr wrote:
| Couldn't you organize the accumulators in 8 byte chunks, and
| leave the upper byte unused. Then you map consecutive digits
| to those chunks and use 64 bit addition for the accumulation.
| Then overflow between the bytes would keep the correct result
| if you do the shuffles correctly, and you have a full byte of
| overflow buffer.
| Dwedit wrote:
| Gaps in the numbers are often enough to do some kind of
| "SIMD" even on ordinary 32-bit processors.
| camel-cdr wrote:
| Yeah, but I was thinking of doing this within the vector
| registers to increase the batch size.
| rurban wrote:
| So SIMD would need to set the overflow flag also to catch em.
|
| Which would be much faster than the checked add (adc). Does
| any hardware support such checked SIMD arithmetic already?
|
| Or can you still assume that most arithmetic is still broken
| in most languages/libraries.
| adrian_b wrote:
| AVX-512 has evolved from the Larrabee New Instructions
| (2009), passing through Knights Ferry (2010), Knights
| Corner (2012) and Knights Landing (2016), to reach Skylake
| Server (2017), whose set of AVX-512 instructions has
| remained a subset of the instruction sets of all later CPUs
| with AVX-512 support.
|
| At each step from Larrabee to Skylake Server, some
| instructions have been lost, because the initial set of
| instructions was more complete in order to enable the
| writing of efficient GPU algorithms, while later Intel
| believed that for a general-purpose CPU they can reduce the
| costs by omitting some of those instructions.
|
| (Nevertheless, later they have added many other
| instructions, some of which may be less useful and more
| expensive than the original instructions that have been
| removed.)
|
| Among the original Larrabee instructions that have been
| deleted, was addition with unsigned overflow (a.k.a.
| carry), where the output overflow flags were stored in a
| mask register, enabling their use in a later conditional
| SIMD instruction.
|
| Signed overflow can be implemented in hardware with
| negligible additional complexity (a single gate per each
| result number), so it would have been easy to also add to
| Larrabee/AVX-512 an addition instruction with signed
| overflow flags stored in a mask register. Even when only
| unsigned overflow is available, it is possible to
| preprocess the operands in such a way that detecting signed
| overflow would be possible with the unsigned overflow bits,
| though that requires multiple instructions, slowing down a
| lot the algorithm.
|
| However in this problem the numbers that are added are non-
| negative, so the addition with unsigned overflow of the
| original Larrabee ISA would have been sufficient, had Intel
| not removed it from AVX-512.
| dzaima wrote:
| As a minor note, overflow checking for add & sub can be
| done reasonably-efficiently in software by comparing the
| wrapping result with a saturating one, good for
| i8/i16/u8/u16 on x86, should be roughly the same for
| signed overflow compared to repurposing hardware unsigned
| overflow (though of course worse than would-be-native for
| unsigned).
|
| And, regardless, this would be at least one more uop in
| the core loop (and a somewhat-unpredictable branch at
| that) which you'd still want to avoid.
___________________________________________________________________
(page generated 2024-07-14 23:02 UTC)