[HN Gopher] Prefix Sum with SIMD
       ___________________________________________________________________
        
       Prefix Sum with SIMD
        
       Author : gbrown_
       Score  : 104 points
       Date   : 2022-02-12 08:45 UTC (14 hours ago)
        
 (HTM) web link (en.algorithmica.org)
 (TXT) w3m dump (en.algorithmica.org)
        
       | [deleted]
        
       | JimBlackwood wrote:
       | This book is fantastic - it comes at a perfect time too, where
       | I'm getting more work related projects about 'make slow code
       | fast'.
       | 
       | Going to be reading this - and looking forward to any future
       | parts!
        
       | gfd wrote:
       | I've seen the author advertise his book on codeforces.com blogs
       | before if you want somewhere to reach him:
       | https://codeforces.com/blog/entry/99790
       | 
       | That might be a better intro than a random chapter of the book
       | and contextualizes why you might want to learn SIMD programming
       | (i.e., up to an order of magnitude speed-up vs STL
       | implementations).
        
         | jonstewart wrote:
         | Yes, but his work here on prefix sum is very recent, he just
         | started telling people about it on Twitter this past week
         | (specifically, the improvement due to interleaving).
        
       | fulafel wrote:
       | It's interesting that SIMD came to the mainstream 25 years ago
       | and compilers, our apps and PL tech are still quite far from
       | effectively utilizing it outside nonportable manually coded SIMD
       | aware compute kernels in glorified assembler. There are some
       | exceptions like ispc and GPU languages like OpenCL and Futhark
       | (GPU people say "cores" when they mean SIMD lanes!)...
        
         | torginus wrote:
         | The really strange thing about this is that GPU shaders are
         | essentially SIMD programs - but nobody, except for some real
         | hardcore gamedevs, program SIMD assembly directly, the compiler
         | handles it for you. There's ISPC, which is basically a C-like
         | language and SIMD compiler, that translates it's scalar code
         | into SIMD assembly, but other than that, SIMD code tends to be
         | hand rolled.
        
         | fleabitdev wrote:
         | I recently experimented with Rust's std::simd library
         | prototype, and it gave me the same sense of wasted potential.
         | 
         | My program performed complicated processing of large 8-bit
         | grayscale images. Naive use of portable SIMD made several
         | important parts of the algorithm run ten times faster than my
         | integrated-GPU implementation! (OpenGL ES 3.0 over ANGLE on an
         | Intel HD Graphics 520)
         | 
         | I think the sheer power of modern SIMD has slipped past most
         | peoples' radar. A multicore processor with AVX2 has number-
         | crunching throughput comparable to a low-end GPU, and for some
         | workloads, the extra flexibility of the CPU can unlock an
         | additional order-of-magnitude performance uplift. (The
         | comparison may be a little different if you're programming the
         | GPU with modern compute shaders - I'm not sure.)
         | 
         | Portable SIMD programming is also just fun! Trying to keep my
         | working set within the 32 KiB L1 cache, trying to keep my data
         | packed within individual 8-bit lanes, and lacking access to
         | some basic operations like integer multiplication and division
         | - it almost feels like programming for a retro games console.
        
           | dahart wrote:
           | > it almost feels like programming for a retro games console.
           | 
           | I get this feeling writing CUDA code all the time, having to
           | keep my stack frames and local data small, figuring out how
           | to even out and time-slice the work. It's a little like
           | Arduino coding in a way, it's just that instead of one of
           | them running at 16Mhz, I have ten thousand of them running at
           | 2Ghz (and as a bonus they have floating point superpowers).
        
           | Royi wrote:
           | Have you looked at [ISPC - Intel SPMD Program Compiler][0]?
           | [0]: https://github.com/ispc/ispc
        
           | shakow wrote:
           | I concur. I recently ported some bioinfo computations from
           | scalar to SIMD with primitives from core::arch, and the gain
           | was an immediate ~10x.
           | 
           | It is a shame that one needs to fiddle with RUSTC_FLAGS just
           | for SIMD extensions to trigger though :/
        
         | zozbot234 wrote:
         | LLVM does a pretty good job of vectorizing code these days, and
         | pretty much everything is using it. And making vectorization
         | more effective is one of the most workable avenues for
         | improving compiler tech.
        
           | fulafel wrote:
           | Given the constraints llvm frequently does a relatively ok
           | job with the pieces it has to work with, but the results are
           | usually poor in absolute terms vs what manual optimizations
           | would allow. The language semantics of c/c++ don't support
           | the required leeway of optimization transformations (eg
           | optimizing data layout hand in hand with it's access
           | patterns, parallel semantics).
        
             | mandarax8 wrote:
             | Are there any languages you know that do support these
             | kinds of transformations?
        
               | fulafel wrote:
               | Besides the ones mentioned in the other comment, Halide I
               | think is a good model of the kind of approach needed, but
               | that is somewhat domain-specific. So expressing the
               | problem in a sufficiently abstract form that the compiler
               | has many freedoms in organizing how the computation in
               | concrete terms takes place. The IPSC approach seems solid
               | as well even though it's much less ambitious and doesn't
               | do data layout optimizations.
               | 
               | An important nontechnical problem with these is that
               | there is such a long way to travel for a language to
               | reach widespread adoption, needing credible long term
               | sponsorship and commitment from a broad shouldered user
               | community or private organisation, etc.
        
             | inetknght wrote:
             | > _c++ don 't support the required leeway of optimization
             | transformations_
             | 
             | What transformations do you think C++ doesn't support?
             | 
             | I've got a _lot_ of experience in C++ and I've found that
             | it's almost always the case that whatever library being
             | used needs to be rewritten for a small change. Rewriting
             | libraries is rarely approved. And further, there's often
             | very little unit testing being done towards that end too.
        
       | NohatCoder wrote:
       | I don't get why the baseline is so slow, a read, a write and an
       | add per clock should all be within the capabilities of a modern
       | processor.
        
         | dragontamer wrote:
         | Because a 64-bit read to L1 or L2 cache takes the same amount
         | of time as a 256-bit (or even 512-bit) SIMD read on a modern
         | Intel or AMD processor.
         | 
         | On L3 cache or DDR4, coalesced reads/writes are still more
         | beneficial and 256-bit reads SIMD still has benefits over
         | 64-bit reads, though its less dramatic.
        
           | NohatCoder wrote:
           | It takes 1 read port for 1 clock. Any AVX2-capable processor
           | has at least 1 read and 1 separate write port, therefore we
           | should get 1 processing cycle per clock, the reported number
           | is only around a third of that.
        
             | [deleted]
        
             | dahart wrote:
             | The author assumes an average of (slightly over) 2
             | instructions per iteration, which means the theoretical
             | limit is a half iteration per clock, assuming no cache
             | misses ever. How can you reduce that to 1 processing cycle
             | per clock in a single thread, is there a load-add-store
             | instruction, or a way to get multiple instructions per
             | clock through a single thread?
        
               | NohatCoder wrote:
               | All modern high-performance processors do multiple
               | instructions per clock.
        
               | dahart wrote:
               | There's a lot of ambiguity behind that statement (fetch,
               | decode, queue, pipeline, etc.) I'm just saying what I
               | read in the article, the peak bandwidth calculation was
               | making the assumption of a peak throughput of one
               | instruction per cycle, according to the author.
               | 
               | The best answer might be to profile it yourself and see
               | what happens on your processor, it's only a few lines of
               | code and looks incredibly easy to do.
        
               | NohatCoder wrote:
               | There is usually a bunch of weird specifics when it comes
               | to calculating performance, but this one happens to be
               | quite simple as we bottleneck on 1 write per clock, and
               | the add instruction latency of 1, so none of the hairy
               | stuff should come into play.
               | 
               | Not knowing what cpu compiler and settings are used I
               | can't do much to replicate the test.
        
               | dahart wrote:
               | It would be compelling enough to use the author's 3-line
               | loop as-is on your processor with your own C or C++
               | compiler. Or the compiled baseline assembly from the
               | article...
        
         | dahart wrote:
         | I think I don't quite understand why the baseline is so fast.
         | ;) But the article answer this question directly, and has a
         | link to a longer explanation with data... "The reason why
         | unidirectional and bidirectional memory accesses would perform
         | differently is that they share the cache and memory buses and
         | other CPU facilities. In the case of RAM, this causes a twofold
         | difference in performance between the pure read and
         | simultaneous read and write scenarios because the memory
         | controller has to switch between the modes on the one-way
         | memory bus, thus halving the bandwidth. The performance drop is
         | less severe for the L2 cache: the bottleneck here is not the
         | cache bus, so the incrementing loop loses by only ~15%."
         | https://en.algorithmica.org/hpc/cpu-cache/bandwidth/#directi...
        
           | NohatCoder wrote:
           | We are not even close to the performance of ram. The cache
           | can do a read and a write per clock, so should be good for 1
           | iteration per clock. I think what may have happened is that
           | the compiler did exactly what it was instructed to do, read
           | the value that it just wrote to memory. Figuring that it is
           | safe to just keep the intermediate value in a register is one
           | of those simple things that can be surprisingly difficult to
           | a compiler.
        
             | dahart wrote:
             | > We are not even close to the performance of ram
             | 
             | * Late edit up front to add: you are of course absolutely
             | correct that the ram perf has room, this is demonstrated by
             | the article's speedups. If prefetching works, it means the
             | baseline wasn't bottlenecked on peak ram bandwidth.
             | 
             | The article was assuming that the peak theoretical
             | throughput is instruction limited, not ram limited. Then it
             | demonstrated empirically that ram was _somehow_ slowing it
             | down further from the theoretical peak. Providing a counter
             | example might be better than arguing over various
             | assumptions?
             | 
             | We could elaborate with some specifics on the ram speed
             | you're thinking of? Which kind are we talking about? What
             | is your calculation of the bandwidth of this problem? I
             | assume it's 8 bytes per iteration (4/read, 4/write), so for
             | a typical processor at, say 3.5Ghz, and one instruction per
             | clock, that would require a ram bandwidth of 28GB/s. That's
             | higher than most single channel DDR4, right? I think most
             | high end desktops are dual channel and higher ram
             | bandwidth, but I don't know what the bandwidth is in
             | practice when you plow through memory linearly with minimal
             | compute and no prefetching, the perf in the article doesn't
             | strike me as being strange.
             | 
             | > The cache can do a read and a write per clock, so should
             | be good for 1 iteration per clock.
             | 
             | It seems like the data in the article mostly supports that
             | statement, with the caveat that an occasional cache miss
             | will bring the average down a little, and the author claims
             | that switching between read and write every single word is
             | slower than reading large blocks.
             | 
             | > I think what may have happened is that the compiler did
             | exactly what it was instructed to do, read the value that
             | it just wrote to memory.
             | 
             | I'm not sure I understand what you mean. The code doesn't
             | read the value after it writes, it writes over the last
             | value read, and then moves to the next address, right? No
             | need to speculate about the compiler, the author included
             | x86 assembly, right? Are you suggesting the hardware might
             | be treating the value as volatile and skipping the cache,
             | and stalling a read? I think that would be a lot slower. Or
             | do you mean something else?
        
               | NohatCoder wrote:
               | The observed throughput for the scalar code is a lot less
               | than memory speed. A Zen 2/3 has a theoretical memory
               | bandwidth of 51.2 GB/s, so should still be fast enough
               | for the theoretical case I make. The observed memory
               | slowdown is for the vector code, it does multiple
               | iterations, that is why.
               | 
               | The switching between read and write part doesn't make
               | much sense, there is no such switch in cache, and the
               | memory controller takes care of chunking memory access
               | reasonably. Reading and writing is double the operations
               | of just reading of course, but we have covered that
               | already.
               | 
               | What I'm suggesting that the code might do is: At every
               | cycle, read data on location i and location i-1, add them
               | together, store the result on location i. This is
               | problematic because not only do we get an extra read, it
               | also happens on a cache location that was just written
               | to. There is a guard for preventing this unnecessary
               | read, but it might have failed due to not being accessed
               | as the same pointer with the same offset. We still keep
               | it in cache, so a cost of 2 extra cycles per iteration
               | seems reasonable.
        
               | dahart wrote:
               | > This is problematic because not only do we get an extra
               | read
               | 
               | I don't understand why you're speculating on this. The
               | author provided the assembly that only has 1 read.
               | 
               | FWIW, I just tested this on my machine. I get the
               | following results:
               | 
               | Intel Core i7-7800 3.5Ghz / Ubuntu 20 / g++ 9.3.0 Rolled
               | loop : 2.08 Gflops | 16.7 GB/s bandwidth 8-unrolled loop
               | : 2.48 Gflops | 19.8 GB/s bandwidth
               | 
               | Here's the unrolled assembly (8 iters), with -O3:
               | .L3:                 addl    (%rdi), %eax
               | addq    $32, %rdi                 movl    %eax, -32(%rdi)
               | addl    -28(%rdi), %eax                 movl    %eax,
               | -28(%rdi)                 addl    -24(%rdi), %eax
               | movl    %eax, -24(%rdi)                 addl
               | -20(%rdi), %eax                 movl    %eax, -20(%rdi)
               | addl    -16(%rdi), %eax                 movl    %eax,
               | -16(%rdi)                 addl    -12(%rdi), %eax
               | movl    %eax, -12(%rdi)                 addl    -8(%rdi),
               | %eax                 movl    %eax, -8(%rdi)
               | addl    -4(%rdi), %eax                 movl    %eax,
               | -4(%rdi)                 cmpq    %rdi, %rdx
               | jne     .L3
        
               | NohatCoder wrote:
               | Okay that was a bad guess. I see in the previous chapter
               | that the processor is only clocked at 2 GHz, the results
               | still feel a bit low, but certainly way more reasonable
               | in that light. I still wonder exactly where the extra
               | clock cycles go.
        
       | udbhavs wrote:
       | The whole book looks very full of interesting low level
       | techniques. Do high level JIT compilers like the V8 or JVM apply
       | vectorisation or any of the other optimisations mentioned there,
       | or is that level of fine-tuned performance only possible if you
       | code them manually in C++?
        
         | tralarpa wrote:
         | People claim that the JVM is able to auto-vectorize simple
         | loops: https://stackoverflow.com/questions/59725341/java-auto-
         | vecto...
         | 
         | That said, I tried the matrix-multiplication example from
         | section 1.2 on different computers and with different jdk
         | version, but I didn't manage to convince the JVM JIT to emit
         | SIMD instructions. No idea whether the loop is too complex for
         | the JIT or whether there is some secret JVM option that I'm not
         | aware of.
        
         | sereja wrote:
         | Author here. I wrote a separate section trying to answer this
         | question: https://en.algorithmica.org/hpc/complexity/languages/
         | 
         | tl;dr: to some extent, but mostly no
        
           | marcos100 wrote:
           | Hey author. This looks like a very approachable book given
           | the subject.
           | 
           | Still on chapter 2, but really enjoying it. Instant buy for
           | me.
           | 
           | Thanks for your effort.
        
         | khoobid_shoma wrote:
         | I think topics are nicely explained in C and assembly but not
         | C++.
        
         | saagarjha wrote:
         | Most JITs have basic autovectorization support, but it's
         | generally not that good.
        
       | mhkool wrote:
       | Since the performance for array sizes <L1-size and <L2-size is
       | similar , I would like to see an attempt to improve B. B =
       | L2-size / 2 / sizeof(int) - 16 might produce better results.
       | 
       | Note also that _mm_broadcast_ss() is faster on newer processors.
        
       ___________________________________________________________________
       (page generated 2022-02-12 23:01 UTC)