[HN Gopher] Summing ASCII encoded integers on Haswell at almost ...
___________________________________________________________________
Summing ASCII encoded integers on Haswell at almost the speed of
memcpy
Author : iliekcomputers
Score : 91 points
Date : 2024-07-12 16:42 UTC (1 days ago)
(HTM) web link (blog.mattstuchlik.com)
(TXT) w3m dump (blog.mattstuchlik.com)
| wolf550e wrote:
| I think the trick with dereferencing unmapped memory is cool, but
| I only really care about techniques that work reliably and I can
| use in production.
| sYnfo wrote:
| To be clear, it's not dereferencing unmapped memory, I just
| haven't shown how it's being mapped, because it's a little
| complex. As I note in the post, you can imagine as if I mmap
| all the necessary addresses at the start of the program.
| camel-cdr wrote:
| Given that the input is "integers uniformly sampled from [0,
| 231-1]" couldn't you use a LUT for the 99.99% case of just
| 10/9/8 digit numbers instead and have a cold branch the
| handle the very rare smaller numbers.
| anonymoushn wrote:
| Yes, maybe if one is clever and lucky this could cost only
| a popcnt and a branch? not sure.
| ashleyn wrote:
| Knew it'd be SIMD. Such an underrated feature of modern CPUs.
| Hopefully with cross-platform SIMD in Rust and Golang, it'll be
| more commonly used.
|
| Thinking parallel gets you enormous speed benefits for any number
| of arbitrary algorithms: https://mcyoung.xyz/2023/11/27/simd-
| base64/
| neonsunset wrote:
| Here's the tracking issue for Go if you're interested:
| https://github.com/golang/go/issues/67520
|
| I wouldn't be holding my breath though - proper support of
| high-level portable SIMD abstraction requires quite a lot of
| compiler complexity due to how _wide_ (heh) the API surface of
| SIMD extensions is in most ISAs, and because of details
| necessary to get right to keep data in appropriate (vector and
| /or mask) registers. This, naturally, goes in the complete
| opposite direction to the design philosophy of Go's compiler.
| Instead, you are supposed to write a custom Go ASM syntax, with
| byte literals used to encode opcodes if they are not natively
| supported (which is common).
|
| If you're interested in what high-effort SIMD implementation in
| this kind of language looks like, take a look at C#'s cross-
| platform Vector API:
| https://github.com/dotnet/runtime/blob/main/docs/coding-guid...
|
| https://lemire.me/blog/2024/07/05/scan-html-faster-with-simd...
| (uses platform intrisics, but showcases that you can go one
| abstraction level lower, retaining the same Vector128<T> type
| if you need to specialize a particular part of your algorithm
| for a platform, without having to maintain separate copy for
| each one)
|
| Here's high-effort vectorized CRC64 implementation that uses
| these:
| https://github.com/dotnet/runtime/blob/283de5b5adf08c42d4945...
| (performs as fast as C++-based mnemonic variant)
| dist1ll wrote:
| First time I hear about HighLoad. Seems really interesting to me
| on the first glance. I personally find SIMD and ISA/march-
| specific optimizations more rewarding than pure algorithmic
| challenges (codeforces and such).
|
| Though Haswell seems like a pretty obsolete platform to optimize
| for at this point. Even Skylake will be a decade old next year.
| sgerenser wrote:
| Realistically beyond Haswell there hasn't been a ton of
| advancement in SIMD. Hawell introduced AVX2, which is what this
| blog post uses. AVX512 is certainly more powerful, but that's
| not even available in the majority of Intel CPUs, even brand
| new ones.
| jandrewrogers wrote:
| AVX-512 has been ubiquitous on Intel server CPUs for a long
| time. Most people don't run high-performance throughput-
| oriented codes on consumer-grade CPUs with no ECC, which is
| the primary application for AVX-512. AVX-512 is a markedly
| better ISA than AVX2, aside from being wider.
| raldi wrote:
| Is there an explanation of why it sometimes gives the wrong
| answer?
| genter wrote:
| > will only produce correct results with probability < 1,
| though very close to 1
|
| That's terrifying
| _a_a_a_ wrote:
| Why? So long as you know the probabilities and they are
| tolerable, why?
| kardos wrote:
| The challenge is to get the right answer. It's much less
| interesting if you relax the challenge to no longer require
| the right answer. Here's a really fast approximate answer:
| 50000000*(2^31-1)/2
| nerdponx wrote:
| The fact that it's wrong sometimes is a lot less
| interesting than the probability distribution of
| wrongness, conditional on magnitude.
| sYnfo wrote:
| Map vs. territory. The challenge, as defined by the
| system the competition runs on, is to get 3 correct
| responses in a row. That's it.
| Neywiny wrote:
| Solid point and good example
| madars wrote:
| It's worse: Pr[correct output | hard input] = 0, even though
| they estimate that Pr[correct output | random input] ~ 1.
| This means that you can't, for example, amplify your success
| probability by repeating the algorithm a bunch of times and
| taking the majority vote.
| sYnfo wrote:
| 1) if you set BATCH_SIZE > 14 sums_acc may overflow
|
| 2) chunks with too many small numbers cannot be processed with
| just 2 shuffle-adds
|
| 3) (not mentioned in the post) HighLoad limits the size of the
| source code you can submit, so you can't put all possible
| values in the look-up table
| kardos wrote:
| For 1, can you raise that to 28 with unsigned accumulators?
| sYnfo wrote:
| 14 already assumes unsigned accumulator! 255 [accumulator
| capacity] / (2 [shuffle-adds] * 9 [highest digit value]) ~=
| 14
| dzaima wrote:
| You could have a separate accumulator for each shuffle,
| which should allow 28 iterations. (and merge those
| together at the end of the 28-iteration-loop by widening
| to u16; at which point you could have an outer loop
| accumulating in u16 until that runs out)
| camel-cdr wrote:
| Couldn't you organize the accumulators in 8 byte chunks, and
| leave the upper byte unused. Then you map consecutive digits
| to those chunks and use 64 bit addition for the accumulation.
| Then overflow between the bytes would keep the correct result
| if you do the shuffles correctly, and you have a full byte of
| overflow buffer.
| Dwedit wrote:
| Gaps in the numbers are often enough to do some kind of
| "SIMD" even on ordinary 32-bit processors.
| camel-cdr wrote:
| Yeah, but I was thinking of doing this within the vector
| registers to increase the batch size.
___________________________________________________________________
(page generated 2024-07-13 23:01 UTC)