[HN Gopher] Prefix sum: 20 GB/s (2.6x baseline)
       ___________________________________________________________________
        
       Prefix sum: 20 GB/s (2.6x baseline)
        
       Author : ashtonsix
       Score  : 71 points
       Date   : 2025-10-14 16:53 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | tonetegeatinst wrote:
       | Wonder if PTX programming for a GPU would accelerate this.
        
         | almostgotcaught wrote:
         | Lol do you think "PTX programming" is some kind of trick path
         | to perf? It's just inline asm. Sometimes it's necessary but
         | most of the time "CUDA is all you need":
         | 
         | https://github.com/b0nes164/GPUPrefixSums
        
         | ashtonsix wrote:
         | If the data is already in GPU memory, yes. Otherwise you'll be
         | limited by the DRAM<->VRAM memory bottleneck.
         | 
         | When we consider that delta coding (and family), are typically
         | applied as one step in a series of CPU-first transforms and
         | benefit from L1-3 caching we find CPU throughput pulls far-
         | ahead of GPU-based approaches for typical workloads.
         | 
         | This note holds for all GPU-based approaches, not just PTX.
        
           | _zoltan_ wrote:
           | what is a typical workload that you speak of, where CPUs are
           | better?
           | 
           | We've been implementing GPU support in Presto/Velox for
           | analytical workloads and I'm yet to see a use case where we
           | wouldn't pull ahead.
           | 
           | The DRAM-VRAM memory bottleneck isn't really a bottleneck on
           | GH/GB platforms (you can pull 400+GB/s across the C2C
           | NVLink), and on NVL8 systems like the typical A100/H100
           | deployments out there, doing real workloads, where the data
           | is coming over the network links, you're toast without using
           | GPUDirect RDMA.
        
             | ashtonsix wrote:
             | By typical I imagined adoption within commonly-deployed
             | TSDBs like Prometheus, InfluxDB, etc.
             | 
             | GB/GH are actually ideal targets for my code: both
             | architectures integrate Neoverse V2 cores, the same core I
             | developed for. They are superchips with 144/72 CPU cores
             | respectively.
             | 
             | The perf numbers I shared are for one core, so multiply the
             | numbers I gave by 144/72 to get expected throughput on
             | GB/GH. As you (apparently?) have access to this hardware
             | I'd sincerely appreciate if you could benchmark my code
             | there and share the results.
        
               | _zoltan_ wrote:
               | GB is CPU+2xGPU.
               | 
               | GH is readily available for anybody at 1.5 dollars per
               | hour on lambda; GB is harder and we're just going to
               | begin to experiment on it.
        
               | ashtonsix wrote:
               | Each Grace CPU has multiple cores:
               | https://www.nvidia.com/en-gb/data-center/grace-cpu-
               | superchip
               | 
               | This superchip (might be different to whichever you're
               | referring to) has 2 CPUs (144 cores):
               | https://developer.nvidia.com/blog/nvidia-grace-cpu-
               | superchip...
        
             | tmostak wrote:
             | Even without NVLink C2C, on a GPU with 16XPCIe 5.0 lanes to
             | host, you have 128GB/sec in theory and 100+ GB/sec in
             | practice bidirectional bandwidth (half that in each
             | direction), so still come out ahead with pipelining.
             | 
             | Of course prefix sums are often used within a series of
             | other operators, so if these are already computed on GPU,
             | you come out further ahead still.
        
               | ashtonsix wrote:
               | Haha... GPUs are great. But do you mean to suggest we
               | should swap a single ARM core for a top-line GPU with
               | 10k+ cores and compare numbers on that basis? Surely not.
               | 
               | Let's consider this in terms of throughput-per-$ so we
               | have a fungible measurement unit. I think we're all
               | agreed that this problem's bottleneck is the host
               | memory<->compute bus so the question is: for $1 which
               | server architecture lets you pump more data from memory
               | to a compute core?
               | 
               | It looks like you can get a H100 GPU with 16xPCIe 5.0
               | (128 GB/s theoretical, 100 GB/s realistic) for $1.99/hr
               | from RunPod.
               | 
               | With an m8g.8xlarge instance (32 ARM CPU cores) you
               | should get much-better RAM<->CPU throughput (175 GB/s
               | realistic) for $1.44/hr from AWS.
        
         | bassp wrote:
         | Yes! There's a canonical algorithm called the "Blelloch scan"
         | for prefix sum (aka prefix scan, because you can generalize
         | "sum" to "any binary associative function") that's very gpu
         | friendly. I have... fond is the wrong word, but "strong"
         | memories of implementing in a parallel programming class :)
         | 
         | Here's a link to a pretty accessible writeup, if you're curious
         | about the details:
         | https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co...
        
           | ashtonsix wrote:
           | Mm, I used that exact writeup as a reference to implement
           | this algorithm in WebGL 3 years ago: https://github.com/ashto
           | nsix/webglc/blob/main/src/kernel/sca...
           | 
           | It even inspired the alternative "transpose" method I
           | describe in the OP README.
        
         | TinkersW wrote:
         | Your average none shared memory GPU communicates with the CPU
         | over PCIe which is dogshit slow, like 100x slower than DRAM.
         | 
         | I can upload about an average of 3.7 MBs per millisecond to my
         | GPU(PCIe gen 3, x8), but it can be spiky and sometimes take
         | longer than you might expect.
         | 
         | By comparison a byte based AVX2 prefix scan can pretty much run
         | at the speed of DRAM, so there is never any reason to transfer
         | to the GPU.
        
       | nullbyte wrote:
       | This code looks like an alien language to me. Or maybe I'm just
       | rusty at C.
        
         | ashtonsix wrote:
         | The weirdness probably comes from heavy use of "SIMD
         | intrinsics" (Googleable term). These are functions with a 1:1
         | correspondence to assembly instructions, used for processing
         | multiple values per instruction.
        
         | mananaysiempre wrote:
         | SIMD intrinsics are less C and more assembly with overlong
         | mnemonics and a register allocator, so even reading them is
         | something of a separate skill. Unlike the skill of achieving
         | meaningful speedups by writing them (i.e. low-level
         | optimization), it's nothing special, but expect to spend a lot
         | of time jumping between the code and the reference manuals[1,2]
         | at first.
         | 
         | [1] https://www.intel.com/content/www/us/en/docs/intrinsics-
         | guid...
         | 
         | [2] https://developer.arm.com/architectures/instruction-
         | sets/int...
        
       | yogishbaliga wrote:
       | Way back in time, I used delta encoding for storing posting list
       | (inverted index for search index). I experimented with using GPUs
       | for decoding the posting list. It turned out that, as another
       | reply mentioned copying posting list from CPU memory to GPU
       | memory was taking way too long. If posting list is static, it can
       | be copied to GPU memory once. This will make the decoding faster.
       | But still there is a bottle neck of copying the result back into
       | CPU memory.
       | 
       | Nvidia's unified memory architecture may make it better as same
       | memory can be shared between CPU and GPU.
        
         | Certhas wrote:
         | AMD has had unified memory for ages in HPC and for a while now
         | in the Strix Halo systems. I haven't had the chance to play
         | with one yet, but I have high hopes for some of our complex
         | simulation workloads.
        
         | ashtonsix wrote:
         | Oh neat. I have some related unpublished SOTA results I want to
         | release soon: PEF/BIC-like compression ratios, with faster
         | boolean algebra than Roaring Bitsets.
        
         | hughw wrote:
         | The shared memory architecture doesn't eliminate copying the
         | data across to the device. Edit: or back.
        
           | yogishbaliga wrote:
           | If it is unified memory, CPU can access the result of GPU
           | processing without copying it to CPU memory (theoretically)
        
             | hughw wrote:
             | If the CPU touches an address mapped to the GPU doesn't it
             | fault a page into the CPU address space? I mean the program
             | doesn't do anything special, but a page gets faulted in I
             | believe.
        
       | Galanwe wrote:
       | While the results look impressive, I can't help but think "yeah
       | but had you stored an absolute value every X deltas instead of
       | just a stream of deltas, you would have had a perfectly scalable
       | parallel decoding"
        
         | ashtonsix wrote:
         | I just did a mini-ablation study for this (prefix sum). By
         | getting rid of the cross-block carry (16 values), you can
         | increase perf from 19.85 to 23.45 GB/s: the gain is modest as
         | most performance is lost on accumulator carry within the block.
         | 
         | An absolute value every 16 deltas would undermine compression:
         | a greater interval would lose even the modest performance gain,
         | while a smaller interval would completely lose the
         | compressibility benefits of delta coding.
         | 
         | It's a different matter, although there is definitely plausible
         | motivation for absolute values every X deltas: query/update
         | locality (mini-partition-level). You wouldn't want to transcode
         | a huge number of values to access/modify a small subset.
        
           | zamadatix wrote:
           | I think what GP is saying is you can drop an absolute value
           | every e.g. 8192 elements (or even orders of magnitude more if
           | you're actually processing GBs of element) and this frees you
           | to compute the blocks in parallel threads in a dependency
           | free manor. The latency for a block would still bound by the
           | single core rate, but the throughput of a stream is likely
           | memory bound after 2-3 cores. It still hurts the point of
           | doing delta compression, but not nearly as bad as every 16
           | values would.
           | 
           | Even if one is willing to adopt such an encoding scheme,
           | you'd still want to optimize what you have here anyways
           | though. It also doesn't help, as mentioned, if the goal is
           | actually latency of small streams rather than throughput of
           | large ones.
        
             | ashtonsix wrote:
             | Oh right. That's sensible enough. Makes total sense to
             | parallelise across multiple cores.
             | 
             | I wouldn't expect a strictly linear speed-up due to
             | contention on the memory bus, but it's not as bad as flat-
             | lining after engaging 2-3 cores. On most AWS Graviton
             | instances you should be able to pull ~5.5 GB/s per-core
             | even with all cores active, and that becomes less of a
             | bottleneck when considering you'll typically run a sequence
             | of compute operations between memory round-trips (not just
             | delta).
        
         | jdonaldson wrote:
         | While that sounds like a dealbreaker, I can't help think "yeah
         | but if a decoding method took advantage of prefix in a
         | similarly scalable way, one would reap the same benefits".
        
         | shoo wrote:
         | fgiesen has a 3 part series of blog posts from 2018, "reading
         | bits in far too many ways", focusing on the problem of reading
         | data encoded a variable-bit-length code
         | 
         | Part 3 explores a straightforward idea that can be used to
         | improve performance of bitstream decoding - decode multiple
         | streams at once.
         | 
         | > if decoding from a single bitstream is too serial, then why
         | not decode from multiple bitstreams at once?
         | 
         | > So how many streams should you use? It depends. At this
         | point, for anything that is even remotely performance-
         | sensitive, I would recommend trying at least N=2 streams. Even
         | if your decoder has a lot of other stuff going on (computations
         | with the decoded values etc.), bitstream decoding tends to be
         | serial enough that there's many wasted cycles otherwise, even
         | on something relatively narrow like a dual-issue in-order
         | machine.
         | 
         | > If you're using multiple streams, you need to decide how
         | these multiple streams get assembled into the final output
         | bitstream. If you don't have any particular reason to go with a
         | fine-grained interleaving, the easiest and most straightforward
         | option is to concatenate the sub-streams
         | 
         | For delta coding, perhaps instead of resetting the delta
         | encoding stream with an absolute value every X deltas, for a
         | small value of X, this suggests we could use a much larger
         | value of X, i.e. partition the input into coarse chunks and
         | delta encode each, and then rewrite our decoder to decode a
         | pair (or more) of the resulting streams at once.
         | 
         | Unclear if that would help at all here, or if there's already
         | enough independent work for the CPU to keep busy with.
         | 
         | Part 1 https://fgiesen.wordpress.com/2018/02/19/reading-bits-
         | in-far...
         | 
         | Part 2 https://fgiesen.wordpress.com/2018/02/20/reading-bits-
         | in-far...
         | 
         | Part 3 https://fgiesen.wordpress.com/2018/09/27/reading-bits-
         | in-far...
        
       | jasonthorsness wrote:
       | "We achieve 19.8 GB/s prefix sum throughput--1.8x faster than a
       | naive implementation and 2.6x faster than FastPFoR"
       | 
       | "FastPFoR is well-established in both industry and academia.
       | However, on our target platform (Graviton4, SIMDe-compiled) it
       | benchmarks at only ~7.7 GB/s, beneath a naive scalar loop at
       | ~10.8 GB/s."
       | 
       | I thought the first bit was a typo but it was correct; the naive
       | approach was faster than a "better" method. Another demonstration
       | of how actually benchmarking on the target platform is important!
        
       ___________________________________________________________________
       (page generated 2025-10-14 23:00 UTC)