[HN Gopher] Prefix Sum with SIMD
___________________________________________________________________
Prefix Sum with SIMD
Author : gbrown_
Score : 104 points
Date : 2022-02-12 08:45 UTC (14 hours ago)
(HTM) web link (en.algorithmica.org)
(TXT) w3m dump (en.algorithmica.org)
| [deleted]
| JimBlackwood wrote:
| This book is fantastic - it comes at a perfect time too, where
| I'm getting more work related projects about 'make slow code
| fast'.
|
| Going to be reading this - and looking forward to any future
| parts!
| gfd wrote:
| I've seen the author advertise his book on codeforces.com blogs
| before if you want somewhere to reach him:
| https://codeforces.com/blog/entry/99790
|
| That might be a better intro than a random chapter of the book
| and contextualizes why you might want to learn SIMD programming
| (i.e., up to an order of magnitude speed-up vs STL
| implementations).
| jonstewart wrote:
| Yes, but his work here on prefix sum is very recent, he just
| started telling people about it on Twitter this past week
| (specifically, the improvement due to interleaving).
| fulafel wrote:
| It's interesting that SIMD came to the mainstream 25 years ago
| and compilers, our apps and PL tech are still quite far from
| effectively utilizing it outside nonportable manually coded SIMD
| aware compute kernels in glorified assembler. There are some
| exceptions like ispc and GPU languages like OpenCL and Futhark
| (GPU people say "cores" when they mean SIMD lanes!)...
| torginus wrote:
| The really strange thing about this is that GPU shaders are
| essentially SIMD programs - but nobody, except for some real
| hardcore gamedevs, program SIMD assembly directly, the compiler
| handles it for you. There's ISPC, which is basically a C-like
| language and SIMD compiler, that translates it's scalar code
| into SIMD assembly, but other than that, SIMD code tends to be
| hand rolled.
| fleabitdev wrote:
| I recently experimented with Rust's std::simd library
| prototype, and it gave me the same sense of wasted potential.
|
| My program performed complicated processing of large 8-bit
| grayscale images. Naive use of portable SIMD made several
| important parts of the algorithm run ten times faster than my
| integrated-GPU implementation! (OpenGL ES 3.0 over ANGLE on an
| Intel HD Graphics 520)
|
| I think the sheer power of modern SIMD has slipped past most
| peoples' radar. A multicore processor with AVX2 has number-
| crunching throughput comparable to a low-end GPU, and for some
| workloads, the extra flexibility of the CPU can unlock an
| additional order-of-magnitude performance uplift. (The
| comparison may be a little different if you're programming the
| GPU with modern compute shaders - I'm not sure.)
|
| Portable SIMD programming is also just fun! Trying to keep my
| working set within the 32 KiB L1 cache, trying to keep my data
| packed within individual 8-bit lanes, and lacking access to
| some basic operations like integer multiplication and division
| - it almost feels like programming for a retro games console.
| dahart wrote:
| > it almost feels like programming for a retro games console.
|
| I get this feeling writing CUDA code all the time, having to
| keep my stack frames and local data small, figuring out how
| to even out and time-slice the work. It's a little like
| Arduino coding in a way, it's just that instead of one of
| them running at 16Mhz, I have ten thousand of them running at
| 2Ghz (and as a bonus they have floating point superpowers).
| Royi wrote:
| Have you looked at [ISPC - Intel SPMD Program Compiler][0]?
| [0]: https://github.com/ispc/ispc
| shakow wrote:
| I concur. I recently ported some bioinfo computations from
| scalar to SIMD with primitives from core::arch, and the gain
| was an immediate ~10x.
|
| It is a shame that one needs to fiddle with RUSTC_FLAGS just
| for SIMD extensions to trigger though :/
| zozbot234 wrote:
| LLVM does a pretty good job of vectorizing code these days, and
| pretty much everything is using it. And making vectorization
| more effective is one of the most workable avenues for
| improving compiler tech.
| fulafel wrote:
| Given the constraints llvm frequently does a relatively ok
| job with the pieces it has to work with, but the results are
| usually poor in absolute terms vs what manual optimizations
| would allow. The language semantics of c/c++ don't support
| the required leeway of optimization transformations (eg
| optimizing data layout hand in hand with it's access
| patterns, parallel semantics).
| mandarax8 wrote:
| Are there any languages you know that do support these
| kinds of transformations?
| fulafel wrote:
| Besides the ones mentioned in the other comment, Halide I
| think is a good model of the kind of approach needed, but
| that is somewhat domain-specific. So expressing the
| problem in a sufficiently abstract form that the compiler
| has many freedoms in organizing how the computation in
| concrete terms takes place. The IPSC approach seems solid
| as well even though it's much less ambitious and doesn't
| do data layout optimizations.
|
| An important nontechnical problem with these is that
| there is such a long way to travel for a language to
| reach widespread adoption, needing credible long term
| sponsorship and commitment from a broad shouldered user
| community or private organisation, etc.
| inetknght wrote:
| > _c++ don 't support the required leeway of optimization
| transformations_
|
| What transformations do you think C++ doesn't support?
|
| I've got a _lot_ of experience in C++ and I've found that
| it's almost always the case that whatever library being
| used needs to be rewritten for a small change. Rewriting
| libraries is rarely approved. And further, there's often
| very little unit testing being done towards that end too.
| NohatCoder wrote:
| I don't get why the baseline is so slow, a read, a write and an
| add per clock should all be within the capabilities of a modern
| processor.
| dragontamer wrote:
| Because a 64-bit read to L1 or L2 cache takes the same amount
| of time as a 256-bit (or even 512-bit) SIMD read on a modern
| Intel or AMD processor.
|
| On L3 cache or DDR4, coalesced reads/writes are still more
| beneficial and 256-bit reads SIMD still has benefits over
| 64-bit reads, though its less dramatic.
| NohatCoder wrote:
| It takes 1 read port for 1 clock. Any AVX2-capable processor
| has at least 1 read and 1 separate write port, therefore we
| should get 1 processing cycle per clock, the reported number
| is only around a third of that.
| [deleted]
| dahart wrote:
| The author assumes an average of (slightly over) 2
| instructions per iteration, which means the theoretical
| limit is a half iteration per clock, assuming no cache
| misses ever. How can you reduce that to 1 processing cycle
| per clock in a single thread, is there a load-add-store
| instruction, or a way to get multiple instructions per
| clock through a single thread?
| NohatCoder wrote:
| All modern high-performance processors do multiple
| instructions per clock.
| dahart wrote:
| There's a lot of ambiguity behind that statement (fetch,
| decode, queue, pipeline, etc.) I'm just saying what I
| read in the article, the peak bandwidth calculation was
| making the assumption of a peak throughput of one
| instruction per cycle, according to the author.
|
| The best answer might be to profile it yourself and see
| what happens on your processor, it's only a few lines of
| code and looks incredibly easy to do.
| NohatCoder wrote:
| There is usually a bunch of weird specifics when it comes
| to calculating performance, but this one happens to be
| quite simple as we bottleneck on 1 write per clock, and
| the add instruction latency of 1, so none of the hairy
| stuff should come into play.
|
| Not knowing what cpu compiler and settings are used I
| can't do much to replicate the test.
| dahart wrote:
| It would be compelling enough to use the author's 3-line
| loop as-is on your processor with your own C or C++
| compiler. Or the compiled baseline assembly from the
| article...
| dahart wrote:
| I think I don't quite understand why the baseline is so fast.
| ;) But the article answer this question directly, and has a
| link to a longer explanation with data... "The reason why
| unidirectional and bidirectional memory accesses would perform
| differently is that they share the cache and memory buses and
| other CPU facilities. In the case of RAM, this causes a twofold
| difference in performance between the pure read and
| simultaneous read and write scenarios because the memory
| controller has to switch between the modes on the one-way
| memory bus, thus halving the bandwidth. The performance drop is
| less severe for the L2 cache: the bottleneck here is not the
| cache bus, so the incrementing loop loses by only ~15%."
| https://en.algorithmica.org/hpc/cpu-cache/bandwidth/#directi...
| NohatCoder wrote:
| We are not even close to the performance of ram. The cache
| can do a read and a write per clock, so should be good for 1
| iteration per clock. I think what may have happened is that
| the compiler did exactly what it was instructed to do, read
| the value that it just wrote to memory. Figuring that it is
| safe to just keep the intermediate value in a register is one
| of those simple things that can be surprisingly difficult to
| a compiler.
| dahart wrote:
| > We are not even close to the performance of ram
|
| * Late edit up front to add: you are of course absolutely
| correct that the ram perf has room, this is demonstrated by
| the article's speedups. If prefetching works, it means the
| baseline wasn't bottlenecked on peak ram bandwidth.
|
| The article was assuming that the peak theoretical
| throughput is instruction limited, not ram limited. Then it
| demonstrated empirically that ram was _somehow_ slowing it
| down further from the theoretical peak. Providing a counter
| example might be better than arguing over various
| assumptions?
|
| We could elaborate with some specifics on the ram speed
| you're thinking of? Which kind are we talking about? What
| is your calculation of the bandwidth of this problem? I
| assume it's 8 bytes per iteration (4/read, 4/write), so for
| a typical processor at, say 3.5Ghz, and one instruction per
| clock, that would require a ram bandwidth of 28GB/s. That's
| higher than most single channel DDR4, right? I think most
| high end desktops are dual channel and higher ram
| bandwidth, but I don't know what the bandwidth is in
| practice when you plow through memory linearly with minimal
| compute and no prefetching, the perf in the article doesn't
| strike me as being strange.
|
| > The cache can do a read and a write per clock, so should
| be good for 1 iteration per clock.
|
| It seems like the data in the article mostly supports that
| statement, with the caveat that an occasional cache miss
| will bring the average down a little, and the author claims
| that switching between read and write every single word is
| slower than reading large blocks.
|
| > I think what may have happened is that the compiler did
| exactly what it was instructed to do, read the value that
| it just wrote to memory.
|
| I'm not sure I understand what you mean. The code doesn't
| read the value after it writes, it writes over the last
| value read, and then moves to the next address, right? No
| need to speculate about the compiler, the author included
| x86 assembly, right? Are you suggesting the hardware might
| be treating the value as volatile and skipping the cache,
| and stalling a read? I think that would be a lot slower. Or
| do you mean something else?
| NohatCoder wrote:
| The observed throughput for the scalar code is a lot less
| than memory speed. A Zen 2/3 has a theoretical memory
| bandwidth of 51.2 GB/s, so should still be fast enough
| for the theoretical case I make. The observed memory
| slowdown is for the vector code, it does multiple
| iterations, that is why.
|
| The switching between read and write part doesn't make
| much sense, there is no such switch in cache, and the
| memory controller takes care of chunking memory access
| reasonably. Reading and writing is double the operations
| of just reading of course, but we have covered that
| already.
|
| What I'm suggesting that the code might do is: At every
| cycle, read data on location i and location i-1, add them
| together, store the result on location i. This is
| problematic because not only do we get an extra read, it
| also happens on a cache location that was just written
| to. There is a guard for preventing this unnecessary
| read, but it might have failed due to not being accessed
| as the same pointer with the same offset. We still keep
| it in cache, so a cost of 2 extra cycles per iteration
| seems reasonable.
| dahart wrote:
| > This is problematic because not only do we get an extra
| read
|
| I don't understand why you're speculating on this. The
| author provided the assembly that only has 1 read.
|
| FWIW, I just tested this on my machine. I get the
| following results:
|
| Intel Core i7-7800 3.5Ghz / Ubuntu 20 / g++ 9.3.0 Rolled
| loop : 2.08 Gflops | 16.7 GB/s bandwidth 8-unrolled loop
| : 2.48 Gflops | 19.8 GB/s bandwidth
|
| Here's the unrolled assembly (8 iters), with -O3:
| .L3: addl (%rdi), %eax
| addq $32, %rdi movl %eax, -32(%rdi)
| addl -28(%rdi), %eax movl %eax,
| -28(%rdi) addl -24(%rdi), %eax
| movl %eax, -24(%rdi) addl
| -20(%rdi), %eax movl %eax, -20(%rdi)
| addl -16(%rdi), %eax movl %eax,
| -16(%rdi) addl -12(%rdi), %eax
| movl %eax, -12(%rdi) addl -8(%rdi),
| %eax movl %eax, -8(%rdi)
| addl -4(%rdi), %eax movl %eax,
| -4(%rdi) cmpq %rdi, %rdx
| jne .L3
| NohatCoder wrote:
| Okay that was a bad guess. I see in the previous chapter
| that the processor is only clocked at 2 GHz, the results
| still feel a bit low, but certainly way more reasonable
| in that light. I still wonder exactly where the extra
| clock cycles go.
| udbhavs wrote:
| The whole book looks very full of interesting low level
| techniques. Do high level JIT compilers like the V8 or JVM apply
| vectorisation or any of the other optimisations mentioned there,
| or is that level of fine-tuned performance only possible if you
| code them manually in C++?
| tralarpa wrote:
| People claim that the JVM is able to auto-vectorize simple
| loops: https://stackoverflow.com/questions/59725341/java-auto-
| vecto...
|
| That said, I tried the matrix-multiplication example from
| section 1.2 on different computers and with different jdk
| version, but I didn't manage to convince the JVM JIT to emit
| SIMD instructions. No idea whether the loop is too complex for
| the JIT or whether there is some secret JVM option that I'm not
| aware of.
| sereja wrote:
| Author here. I wrote a separate section trying to answer this
| question: https://en.algorithmica.org/hpc/complexity/languages/
|
| tl;dr: to some extent, but mostly no
| marcos100 wrote:
| Hey author. This looks like a very approachable book given
| the subject.
|
| Still on chapter 2, but really enjoying it. Instant buy for
| me.
|
| Thanks for your effort.
| khoobid_shoma wrote:
| I think topics are nicely explained in C and assembly but not
| C++.
| saagarjha wrote:
| Most JITs have basic autovectorization support, but it's
| generally not that good.
| mhkool wrote:
| Since the performance for array sizes <L1-size and <L2-size is
| similar , I would like to see an attempt to improve B. B =
| L2-size / 2 / sizeof(int) - 16 might produce better results.
|
| Note also that _mm_broadcast_ss() is faster on newer processors.
___________________________________________________________________
(page generated 2022-02-12 23:01 UTC)