[HN Gopher] Prefix sum: 20 GB/s (2.6x baseline)
___________________________________________________________________
Prefix sum: 20 GB/s (2.6x baseline)
Author : ashtonsix
Score : 71 points
Date : 2025-10-14 16:53 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| tonetegeatinst wrote:
| Wonder if PTX programming for a GPU would accelerate this.
| almostgotcaught wrote:
| Lol do you think "PTX programming" is some kind of trick path
| to perf? It's just inline asm. Sometimes it's necessary but
| most of the time "CUDA is all you need":
|
| https://github.com/b0nes164/GPUPrefixSums
| ashtonsix wrote:
| If the data is already in GPU memory, yes. Otherwise you'll be
| limited by the DRAM<->VRAM memory bottleneck.
|
| When we consider that delta coding (and family), are typically
| applied as one step in a series of CPU-first transforms and
| benefit from L1-3 caching we find CPU throughput pulls far-
| ahead of GPU-based approaches for typical workloads.
|
| This note holds for all GPU-based approaches, not just PTX.
| _zoltan_ wrote:
| what is a typical workload that you speak of, where CPUs are
| better?
|
| We've been implementing GPU support in Presto/Velox for
| analytical workloads and I'm yet to see a use case where we
| wouldn't pull ahead.
|
| The DRAM-VRAM memory bottleneck isn't really a bottleneck on
| GH/GB platforms (you can pull 400+GB/s across the C2C
| NVLink), and on NVL8 systems like the typical A100/H100
| deployments out there, doing real workloads, where the data
| is coming over the network links, you're toast without using
| GPUDirect RDMA.
| ashtonsix wrote:
| By typical I imagined adoption within commonly-deployed
| TSDBs like Prometheus, InfluxDB, etc.
|
| GB/GH are actually ideal targets for my code: both
| architectures integrate Neoverse V2 cores, the same core I
| developed for. They are superchips with 144/72 CPU cores
| respectively.
|
| The perf numbers I shared are for one core, so multiply the
| numbers I gave by 144/72 to get expected throughput on
| GB/GH. As you (apparently?) have access to this hardware
| I'd sincerely appreciate if you could benchmark my code
| there and share the results.
| _zoltan_ wrote:
| GB is CPU+2xGPU.
|
| GH is readily available for anybody at 1.5 dollars per
| hour on lambda; GB is harder and we're just going to
| begin to experiment on it.
| ashtonsix wrote:
| Each Grace CPU has multiple cores:
| https://www.nvidia.com/en-gb/data-center/grace-cpu-
| superchip
|
| This superchip (might be different to whichever you're
| referring to) has 2 CPUs (144 cores):
| https://developer.nvidia.com/blog/nvidia-grace-cpu-
| superchip...
| tmostak wrote:
| Even without NVLink C2C, on a GPU with 16XPCIe 5.0 lanes to
| host, you have 128GB/sec in theory and 100+ GB/sec in
| practice bidirectional bandwidth (half that in each
| direction), so still come out ahead with pipelining.
|
| Of course prefix sums are often used within a series of
| other operators, so if these are already computed on GPU,
| you come out further ahead still.
| ashtonsix wrote:
| Haha... GPUs are great. But do you mean to suggest we
| should swap a single ARM core for a top-line GPU with
| 10k+ cores and compare numbers on that basis? Surely not.
|
| Let's consider this in terms of throughput-per-$ so we
| have a fungible measurement unit. I think we're all
| agreed that this problem's bottleneck is the host
| memory<->compute bus so the question is: for $1 which
| server architecture lets you pump more data from memory
| to a compute core?
|
| It looks like you can get a H100 GPU with 16xPCIe 5.0
| (128 GB/s theoretical, 100 GB/s realistic) for $1.99/hr
| from RunPod.
|
| With an m8g.8xlarge instance (32 ARM CPU cores) you
| should get much-better RAM<->CPU throughput (175 GB/s
| realistic) for $1.44/hr from AWS.
| bassp wrote:
| Yes! There's a canonical algorithm called the "Blelloch scan"
| for prefix sum (aka prefix scan, because you can generalize
| "sum" to "any binary associative function") that's very gpu
| friendly. I have... fond is the wrong word, but "strong"
| memories of implementing in a parallel programming class :)
|
| Here's a link to a pretty accessible writeup, if you're curious
| about the details:
| https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co...
| ashtonsix wrote:
| Mm, I used that exact writeup as a reference to implement
| this algorithm in WebGL 3 years ago: https://github.com/ashto
| nsix/webglc/blob/main/src/kernel/sca...
|
| It even inspired the alternative "transpose" method I
| describe in the OP README.
| TinkersW wrote:
| Your average none shared memory GPU communicates with the CPU
| over PCIe which is dogshit slow, like 100x slower than DRAM.
|
| I can upload about an average of 3.7 MBs per millisecond to my
| GPU(PCIe gen 3, x8), but it can be spiky and sometimes take
| longer than you might expect.
|
| By comparison a byte based AVX2 prefix scan can pretty much run
| at the speed of DRAM, so there is never any reason to transfer
| to the GPU.
| nullbyte wrote:
| This code looks like an alien language to me. Or maybe I'm just
| rusty at C.
| ashtonsix wrote:
| The weirdness probably comes from heavy use of "SIMD
| intrinsics" (Googleable term). These are functions with a 1:1
| correspondence to assembly instructions, used for processing
| multiple values per instruction.
| mananaysiempre wrote:
| SIMD intrinsics are less C and more assembly with overlong
| mnemonics and a register allocator, so even reading them is
| something of a separate skill. Unlike the skill of achieving
| meaningful speedups by writing them (i.e. low-level
| optimization), it's nothing special, but expect to spend a lot
| of time jumping between the code and the reference manuals[1,2]
| at first.
|
| [1] https://www.intel.com/content/www/us/en/docs/intrinsics-
| guid...
|
| [2] https://developer.arm.com/architectures/instruction-
| sets/int...
| yogishbaliga wrote:
| Way back in time, I used delta encoding for storing posting list
| (inverted index for search index). I experimented with using GPUs
| for decoding the posting list. It turned out that, as another
| reply mentioned copying posting list from CPU memory to GPU
| memory was taking way too long. If posting list is static, it can
| be copied to GPU memory once. This will make the decoding faster.
| But still there is a bottle neck of copying the result back into
| CPU memory.
|
| Nvidia's unified memory architecture may make it better as same
| memory can be shared between CPU and GPU.
| Certhas wrote:
| AMD has had unified memory for ages in HPC and for a while now
| in the Strix Halo systems. I haven't had the chance to play
| with one yet, but I have high hopes for some of our complex
| simulation workloads.
| ashtonsix wrote:
| Oh neat. I have some related unpublished SOTA results I want to
| release soon: PEF/BIC-like compression ratios, with faster
| boolean algebra than Roaring Bitsets.
| hughw wrote:
| The shared memory architecture doesn't eliminate copying the
| data across to the device. Edit: or back.
| yogishbaliga wrote:
| If it is unified memory, CPU can access the result of GPU
| processing without copying it to CPU memory (theoretically)
| hughw wrote:
| If the CPU touches an address mapped to the GPU doesn't it
| fault a page into the CPU address space? I mean the program
| doesn't do anything special, but a page gets faulted in I
| believe.
| Galanwe wrote:
| While the results look impressive, I can't help but think "yeah
| but had you stored an absolute value every X deltas instead of
| just a stream of deltas, you would have had a perfectly scalable
| parallel decoding"
| ashtonsix wrote:
| I just did a mini-ablation study for this (prefix sum). By
| getting rid of the cross-block carry (16 values), you can
| increase perf from 19.85 to 23.45 GB/s: the gain is modest as
| most performance is lost on accumulator carry within the block.
|
| An absolute value every 16 deltas would undermine compression:
| a greater interval would lose even the modest performance gain,
| while a smaller interval would completely lose the
| compressibility benefits of delta coding.
|
| It's a different matter, although there is definitely plausible
| motivation for absolute values every X deltas: query/update
| locality (mini-partition-level). You wouldn't want to transcode
| a huge number of values to access/modify a small subset.
| zamadatix wrote:
| I think what GP is saying is you can drop an absolute value
| every e.g. 8192 elements (or even orders of magnitude more if
| you're actually processing GBs of element) and this frees you
| to compute the blocks in parallel threads in a dependency
| free manor. The latency for a block would still bound by the
| single core rate, but the throughput of a stream is likely
| memory bound after 2-3 cores. It still hurts the point of
| doing delta compression, but not nearly as bad as every 16
| values would.
|
| Even if one is willing to adopt such an encoding scheme,
| you'd still want to optimize what you have here anyways
| though. It also doesn't help, as mentioned, if the goal is
| actually latency of small streams rather than throughput of
| large ones.
| ashtonsix wrote:
| Oh right. That's sensible enough. Makes total sense to
| parallelise across multiple cores.
|
| I wouldn't expect a strictly linear speed-up due to
| contention on the memory bus, but it's not as bad as flat-
| lining after engaging 2-3 cores. On most AWS Graviton
| instances you should be able to pull ~5.5 GB/s per-core
| even with all cores active, and that becomes less of a
| bottleneck when considering you'll typically run a sequence
| of compute operations between memory round-trips (not just
| delta).
| jdonaldson wrote:
| While that sounds like a dealbreaker, I can't help think "yeah
| but if a decoding method took advantage of prefix in a
| similarly scalable way, one would reap the same benefits".
| shoo wrote:
| fgiesen has a 3 part series of blog posts from 2018, "reading
| bits in far too many ways", focusing on the problem of reading
| data encoded a variable-bit-length code
|
| Part 3 explores a straightforward idea that can be used to
| improve performance of bitstream decoding - decode multiple
| streams at once.
|
| > if decoding from a single bitstream is too serial, then why
| not decode from multiple bitstreams at once?
|
| > So how many streams should you use? It depends. At this
| point, for anything that is even remotely performance-
| sensitive, I would recommend trying at least N=2 streams. Even
| if your decoder has a lot of other stuff going on (computations
| with the decoded values etc.), bitstream decoding tends to be
| serial enough that there's many wasted cycles otherwise, even
| on something relatively narrow like a dual-issue in-order
| machine.
|
| > If you're using multiple streams, you need to decide how
| these multiple streams get assembled into the final output
| bitstream. If you don't have any particular reason to go with a
| fine-grained interleaving, the easiest and most straightforward
| option is to concatenate the sub-streams
|
| For delta coding, perhaps instead of resetting the delta
| encoding stream with an absolute value every X deltas, for a
| small value of X, this suggests we could use a much larger
| value of X, i.e. partition the input into coarse chunks and
| delta encode each, and then rewrite our decoder to decode a
| pair (or more) of the resulting streams at once.
|
| Unclear if that would help at all here, or if there's already
| enough independent work for the CPU to keep busy with.
|
| Part 1 https://fgiesen.wordpress.com/2018/02/19/reading-bits-
| in-far...
|
| Part 2 https://fgiesen.wordpress.com/2018/02/20/reading-bits-
| in-far...
|
| Part 3 https://fgiesen.wordpress.com/2018/09/27/reading-bits-
| in-far...
| jasonthorsness wrote:
| "We achieve 19.8 GB/s prefix sum throughput--1.8x faster than a
| naive implementation and 2.6x faster than FastPFoR"
|
| "FastPFoR is well-established in both industry and academia.
| However, on our target platform (Graviton4, SIMDe-compiled) it
| benchmarks at only ~7.7 GB/s, beneath a naive scalar loop at
| ~10.8 GB/s."
|
| I thought the first bit was a typo but it was correct; the naive
| approach was faster than a "better" method. Another demonstration
| of how actually benchmarking on the target platform is important!
___________________________________________________________________
(page generated 2025-10-14 23:00 UTC)