[HN Gopher] The One Billion Row Challenge in CUDA
___________________________________________________________________
The One Billion Row Challenge in CUDA
Author : tspeterkim
Score : 199 points
Date : 2024-04-12 16:27 UTC (2 days ago)
(HTM) web link (tspeterkim.github.io)
(TXT) w3m dump (tspeterkim.github.io)
| MuffinFlavored wrote:
| > GPU Hash Table?
|
| How bad would performance have suffered if you sha256'd the lines
| to build the map? I'm going to guess "badly"?
|
| Maybe something like this in CUDA:
| https://github.com/Cyan4973/xxHash ?
| tspeterkim wrote:
| So performance would increase since hashing is faster than
| binary-searching.
|
| However, the problem of collisions across threads and dealing
| with concurrent map key insertions still remains. e.g. when two
| different cities produce the same hash (one at each thread),
| how can we atomically compare 100 byte city strings and
| correctly do collision-solving (using linear probe, for example
| - https://nosferalatu.com/SimpleGPUHashTable.html)
|
| Atomic operations are limited to 32-bits.
| convivialdingo wrote:
| You could hash the city names using something like a djb2a
| algorithm?
| mandarax8 wrote:
| > Atomic operations are limited to 32-bits.
|
| I'm using 64bit atomics at work, are you on an old version of
| cuda or are some operations only supported on 32bit?
| tspeterkim wrote:
| I misspoke. (got confused with the key limit in my link
| above)
|
| Atomics work up to 128-bits
| (https://docs.nvidia.com/cuda/cuda-c-programming-
| guide/#atomi...).
|
| Regardless, it's still less than 100 bytes, which is the
| max length of city strings.
| Sesse__ wrote:
| If the set of cities is known (as your binary search
| algorithm assumes), you can insert all of them first, on the
| CPU. That will resolve all collisions ahead of time, making
| the structure of the hash table essentially read-only. (Of
| course, you would still need atomics for the values.)
| bhouston wrote:
| Interesting approach.
|
| I wonder if one could use a reduce operation across the cities or
| similar rather than atomic operations? GPUs are amazing at reduce
| operations.
| tspeterkim wrote:
| Agreed. I tried reducing across cities first.
|
| The problem was that the work of gathering all the temperatures
| for each city (before I could launch the reduction CUDA
| kernels) required a full parsing through the input data.
|
| My final solution would be slower than the C++ baseline since
| the baseline already does the full parsing anyways.
| cavisne wrote:
| The Programming Massively Parallel Processors textbook has a
| chapter on histograms. You have basically mentioned the main
| trick though (privatization).
|
| For this problem I think coursening would also help (so
| everything isn't queued up on the copy to global memory)
| tspeterkim wrote:
| The PMPP book is great. I reread the histogram chapter after
| finishing the blog, and realized I could use privatization. You
| got me!
|
| By coarsening, do you mean making the threads handle more file
| parts, and reducing the number of private copies (of histogram
| or stats here) to globally commit at the end?
| kevincox wrote:
| > These offsets are obtained iteratively, stepping through the
| entire file buffer by the desired split size (= total file size /
| desired number of parts), and marking the position of a new line
| character:
|
| In many cases it is better to just pass the raw byte offsets to
| the workers, then they can skip to the first newline and progress
| until the next newline after their "end offset". This way the
| launcher doesn't need to access the input data at all.
|
| This can be ineffective when boundaries are not self-describing
| or when the minimum processable chunk is near the expected batch
| size (as you will have significant swings in batch size) but I
| think would work quite well for this case.
| tspeterkim wrote:
| By "launcher", do you mean the CUDA kernel? How can it avoid
| accessing the input data since it needs access to the
| characters based on the offsets?
|
| I also already pass these offsets to the threads as `Part*
| parts`.
|
| I also probably didn't understand your suggestion and am
| drawing a blank here. So pls feel free to elaborate and correct
| me.
| tromp wrote:
| He's proposing to skip the whole computation of the parts[]
| array and within the kernel, replace for
| (int i = 0; i < parts[bx].length; i++) { // bx is the
| global thread index char c =
| buffer[parts[bx].offset-buffer_offset + i];
|
| by something like long long split_size =
| size / num_parts; long long offset = bx * split_size;
| while (buffer[offset++] != '\n') ; for
| (int i = 0; buffer[offset+i] != '\n'; i++) { char c
| = buffer[offset + i];
|
| (ignoring how to deal with buffer_offset for now).
| tspeterkim wrote:
| for (int i = 0; buffer[offset+i] != '\n'; i++) {
|
| This would only process the current line, though. Here,
| each thread processes ~split_size bytes (multiple lines).
|
| Even if were to read multiple lines, how would a thread
| know when to stop? (at which offset?)
|
| And when it does, it should communicate where it stopped
| with the other threads to prevent re-reading the buffer. My
| brain's hurting now.
| tromp wrote:
| It can compute both the starting offset for itself and
| next_offset for (bx+1)...
| kevincox wrote:
| Yeah, this is a bug in their quick example code.
| Basically think that you simply give each thread a start
| and stop offset with simple byte division. Then each
| thread handles the first line that starts after the start
| offset up to but excluding the first line that starts
| after its end offset. So there will be a small amount of
| "overread" at the boundaries but this is probably
| negligible. let chunk_size = buf.length
| / threads; let start_byte = chunk_size * bx;
| let end_byte = start_byte + chunk_size;
| let i = start_byte; while (buffer[i++] != '\n')
| {} // Skip until the first new row. while (i
| < end_byte) { let row_start = i;
| let row_end = i; while (buffer[row_end++] !=
| '\n') {} process_row(buffer, row_start,
| row_end); i = row_end + 1; }
|
| Basically you first roughly split the data with simple
| byte division. Then each thread aligns itself to the
| underling data chunks. This alignment can be done in
| parallel across all threads rather than being part of a
| serial step that examines every byte before the parallel
| work starts. You need to take care that the alignment
| each thread does doesn't skip or duplicate any rows, but
| for simple data formats like this I don't think that
| should be a major difficulty.
| konstantinua00 wrote:
| am I reading correctly that your AtomicMin stops only when CAS
| succeeds?
|
| can you not stop early if extracted value becomes smaller than
| val?
| konstantinua00 wrote:
| also, iirc, floats' comparison is compatible with signed
| integers' comparison due to representation?
|
| so just using cuda's AtomicMin for ints would be enough
| Sesse__ wrote:
| Positive finite floats compare like ints. If you can have
| negative values, things get slightly more complicated, since
| they use sign-magnitude (you basically shift the sign bit to
| the right and use that as an XOR mask on all the other bits).
| If you need to care about -0.0 == +0.0 and NaN != NaN, most
| bets are off. :-)
| formerly_proven wrote:
| Well that's a very pessimistic C++ baseline - using iostreams.
| The CUDA solution presented here seems to me like it's a bit of
| an old-timey approach, treating the GPU as an accelerator. The
| implementation first prepares a million tasks, then submits them
| as kernel invocations which perform the work. The first phase
| already takes over two seconds, the second phase takes over ten
| seconds. Resulting runtime is 16s, 60x faster than iostreams.
|
| This is similar to how GPU acceleration is bolted on in a ton of
| commercial software, you look at hotspots and then move just
| those over to the accelerator. For example outsourcing some
| linear algebra in a FEM solver. That's a sensible approach for
| trying to accelerate legacy software, but it's not very effective
| and leads to poor utilization ("copy mat, copy mat, copy mat copy
| copy copy mat" - Dave Cutler about GPU acceleration). I think a)
| you can do all of this on the GPU b) this should be limited by
| PCIe bandwidth for getting the file contents into VRAM. The 1BRC
| data set is 14 GB. So this is processing that file at about 1
| GB/s using both CPU and GPU.
| taspeotis wrote:
| Yeah <iostream> has terrible performance, and also I find the
| use of a map rather than an unordered_map a bit suspect.
| jodleif wrote:
| Its much easier to improve is the baseline is terrible
| muep wrote:
| Instead of desiring atomic maps, would it work here to give each
| task its own map and then merge those after the parallel
| processing has ended? I have have absolutely no experience of
| CUDA, so no idea how applicable that approach would be. However,
| it seemed pretty practical and effective when I tried a parallel
| implementation of this challenge on plain CPUs.
| speps wrote:
| Looks like TFA already uses partitioning per thread.
| SJC_Hacker wrote:
| You can definitely have thread local storage
| tspeterkim wrote:
| This is the possible optimization that I mention at the end of
| the blog - using a private map for each thread block.
|
| The catch is that this map must fit in shared memory, which is
| pretty limited on all current hardware: ~100KB.
|
| I originally thought that my map (stats array) was too big to
| fit into this shared memory. Now, however, I realize it can.
| It'll interesting to see how much speedup (or not!) this
| optimization can bring.
| _zoltan_ wrote:
| I think this is very slow for some reason. I'll bookmark this and
| try on an A100 and a H100 box and then will try with multigpu on
| larger data.
| tspeterkim wrote:
| Please do.
|
| My hope with this blog was to rile up other CUDA enthusiasts.
| Making them wanna bring in bigger and better hardware that I
| don't have access to.
|
| + Someone in the CUDA MODE community got 6 seconds on a 4090.
| pama wrote:
| There are some good ideas for this type of problem here:
| https://github.com/dannyvankooten/1brc
|
| After you deal with parsing and hashes, basically you are IO
| limited, so mmap helps. The C code takes less than 1.4s without
| any CUDA access. Because there is no compute to speak of, other
| than parsing and a hashmap, a reasonable guess is that even for
| the optimal CUDA implementation, the starting of kernels and
| transfer of data to the GPU would likely add a noticeable
| bottleneck and make the optimal CUDA code slower than this pure C
| code.
| _zoltan_ wrote:
| this is true for small datasets.
|
| on large datasets, once loaded into GPU memory, cross GPU
| shuffling with NVLink is going to be much faster than CPU to
| RAM.
|
| on the H100 boxes with 8x400Gbps, IO with GDS is also pretty
| fast.
|
| for truly IObound tasks I think a lot of GPUs beats almost
| anything :-)
| 15155 wrote:
| Try an array of FPGAs.
| _zoltan_ wrote:
| I've never been a fan plus good luck getting 800GB/s
| across.
| dan-robertson wrote:
| Why does that help with IO bandwidth?
| candido_heavyai wrote:
| I am testing a gh200 and the speed you can access the system
| memory is amazing.. Assuming you have already encoded the
| station into a smallint and the size of the dataset would be
| around 6gb that on such system takes just 20 ms to be
| transfered (I am sure about that because I'm observing
| transfer a 9.5gb that took about 33ms right now).
| konstantinua00 wrote:
| 1 billion rows is "small dataset"?
| _zoltan_ wrote:
| eh, it's just two columns. 12 or 13GB IIRC.
| PeterisP wrote:
| The meaningful boundary between small data and large data
| is the difference whether the whole dataset/processing is
| expected to fit in the RAM of a single common machine. From
| that perspective, 1 billion rows a borderline case, which
| can be small or large depending on how large the rows are -
| and in this particular challenge the rows are tiny.
| belter wrote:
| Just the click stream data of four 15 year old's, chilling
| out on the living room while watching TikTok videos....
| pama wrote:
| Yes, GDS will accelerate the IO to the GPU. I'd love to see
| the above C code compared to hyperoptimized GPU code on the
| right hardware, but I don't want to accidentally nerd snipe
| myself :-) The unfortunate part of this particular benchmark
| is that once you have the data in the right place in your
| hardware there is very little compute left. The GPU code
| would probably have constant performance with an additional
| couple thousand operations on each row whereas CPU would slow
| down.
|
| https://docs.nvidia.com/gpudirect-storage/overview-
| guide/ind...
| nwallin wrote:
| > on large datasets, once loaded into GPU memory,
|
| You're yada-yada-yadaing the best part.
|
| If the disk can process the data at 1GB/s, the CPU can
| process the data at 2GB/s, the GPU can process the data at
| 32GB/s, then the CPU can process the data at 1GB/s and the
| GPU can process the data at 1GB/s.
|
| (also, personally, "large dataset" is a short way to say
| "doesn't fit in memory". if it fits in memory it's small, if
| it doesn't fit in memory it's large. but that's just my
| opinion. I generally avoid calling something a "large" or
| "small" dataset because it's an overloaded term that means
| different things to different people.)
| ww520 wrote:
| Actually on modern systems it's pure compute plus memory
| bounded. For this problem hashmap lookup is the biggest
| bottleneck. Parsing takes time away as well but not as bad as
| hashmap.
| petermcneeley wrote:
| From what I can tell the number of unique elements is pretty
| small. This would mean the hash map will sit in cache.
| Parsing is likely the bottleneck; I dont see any use of SIMD
| in the linked code.
| ww520 wrote:
| The key can be up to 100 bytes. The key in the hashmap is
| in cache. The key being compared is from main memory.
| Basically the whole file in memory is being gone through
| and compared. With 1 billion rows it's about 16GB data in
| memory for a 16-byte average row. That's approaching the
| memory bandwidth limit for a second, making it a memory
| bound problem.
| petermcneeley wrote:
| This really depends on number of unique keys. If they sit
| in L1-L2: the bandwidth of these caches is an order of
| magnitude greater than main memory.
|
| This problem is quite a Nerd Snipe so it a good thing
| that I dont have a computer with more than 16 GB or I
| might end up trying it myself.
| emcq wrote:
| The 1.4s is _after_ having the file loaded into RAM by the
| kernel. Because this is mostly I/O bound, it's not a fair
| comparison to skip the read time. If you were running on a M3
| mac you'd might get less than 100ms if the dataset was stored
| in RAM.
|
| If you account for time loading from disk, the C implementation
| would be more like ~5s as reported in the blog post [1].
| Speculating that their laptop's SSD may be in the 3GB/s range,
| perhaps there is another second or so of optimization left
| there (which would roughly work out to the 1.4s in-memory
| time).
|
| Because you have a lot of variable width row reads this will be
| more difficult on a GPU than CPU.
|
| [1] https://www.dannyvankooten.com/blog/2024/1brc/
| ww520 wrote:
| Also this uses 16 threads while the contest restricts to
| running in 8 cores. Needs to compare the benchmarks in the
| same environment to make a fair comparison.
| pama wrote:
| The AMD Ryzen 4800U has 8 cores total so the author follows
| the contest restriction. This CPU supports hyperthreading.
| (I'd be very interested in seeing hyperoptimized CUDA code
| using unlimited GPU cores FWIW.)
| ww520 wrote:
| Good to know. I didn't know the contest has no limit on
| hyperthread.
| pama wrote:
| The performance report followed the initial request: run 6
| times and remove the best and worst outliers, so the mmap
| optimization is fair game. Agreed that the C code has room
| left for some additional optimization.
| emcq wrote:
| If we are going to consider using prior runs of the program
| having the file loaded in RAM by the kernel fair, why stop
| there?
|
| Let's say I create a "cache" where I store the min/mean/max
| output for each city, mmap it, and read it at least once to
| make sure it is in RAM. If the cache is available I simply
| write it to standard out. I use whatever method to compute
| the first run, and I persist it to disk and then mmap it.
| The first run could take 20 hours and gets discarded.
|
| By technicality it might fit the rules of the original
| request but it isn't an interesting solution. Feel free to
| submit it :)
| pama wrote:
| Having write access to storage or spawning persistent
| daemons is an extra requirement and that is often not
| available in practice when evaluating contest code :-)
|
| This is a fun project for learning CUDA and I enjoyed
| reading about it----I just wanted to point out that the
| performance tuning in this case is really on the parsing,
| hashing, memory transfers, and IO. Taking IO out of the
| picture using specialized hardware or Linux kernel
| caching still leaves an interesting problem to solve and
| the focus should be on minimizing the memory transfers,
| parsing, and hashing.
| gunnarmorling wrote:
| This actually doesn't fit the rules. I've designed the
| challenge so that disk I/O is not part of the measured
| runtime (initially, by relying on the fact that the first
| run which would pull the file into the page cache will be
| the slowest and thus discarded, later on by loading the
| file from a RAM disk). But keeping state between runs is
| explicitly ruled out as per the README, for the reason
| you describe.
| jbochi wrote:
| Pretty cool. Shameless plug: My team at Anthropic is hiring
| people that can write accelerator kernels. Please reach out to
| <my username>@gmail.com if you want to make state of the art
| models faster :)
| petermcneeley wrote:
| qq: Would this be in CUDA?
| jbochi wrote:
| Not sure what I can share publicly, but we use TPUs as well:
| https://cloud.google.com/blog/products/compute/announcing-
| cl...
| fpgamlirfanboy wrote:
| It's funny - former Google devs (whom I just maintain a
| good relationship with their former employer) are ideally
| positioned to profitably take advantage of the arb of TPU
| over GPU.
| bee_rider wrote:
| Google should improve their abstractions until there
| isn't room for that anymore, haha.
| trashtensor wrote:
| not looking for a job at this time but i do this kind of work -
| what is the name of the team that you are working on /
| typically works on this?
| Aznable wrote:
| The query itself it's a perfect hash (assuming the station is
| dictionary encoded) and takes around 100ms on a gh-200 (the Gpu
| is a single h100 with 132 SMs) with a query that's concurrency
| constrained.
|
| The same level of performance can be obtained using an Ada like
| an L40s or and RTX 4090.
|
| The transfer across the nvlink connecting the Cpu and GPU on a
| gh-200 after the parse and encoding of the source CSV takes a
| negligible amount of time given the 500 gb/sec of system memory
| bandwidth and the 900 GB /sec interconnection between Cpu and
| Gpu.
|
| So the problem is disk bandwidth that's going to limit the
| performance of the Gpu kernel. The faster solution should be
| parse the Csv with a gpu kernel using namp and Managed memory (?)
| encode the station into a interger or a small integer. The min
| and max value can be used to create keyless perfect hash table
| for each SM to limit the concurrency on global memory using 32bit
| atomic operations for min, max, count and sum and then do a final
| reduction on the Gpu.
|
| I don't think that is needed more then 1 modern gpu for this,
| especially if you are on a modern hardware like the gh-200.
|
| I'm running this kind of aggregates on a gh-200 using 10 Billion
| of records having the data in Gpu or Cpu memory using our
| software (heavydb) for testing purposes in the last two weeks
| candido_heavyai wrote:
| The query itself it's a perfect hash (assuming the station is
| dictionary encoded) and takes around 100ms on a gh-200 (the Gpu
| is a single h100 with 132 SMs) with a query that's concurrency
| constrained.
|
| The same level of performance can be obtained using an Ada like
| an L40s or and RTX 4090.
|
| The transfer across the nvlink connecting the Cpu and GPU on a
| gh-200 after the parse and encoding of the source CSV takes a
| negligible amount of time given the 500 gb/sec of system memory
| bandwidth and the 900 GB /sec interconnection between Cpu and
| Gpu.
|
| So the problem is disk bandwidth that's going to limit the
| performance of the Gpu kernel. The faster solution should be
| parse the Csv with a gpu kernel using namp and Managed memory (?)
| encode the station into a interger or a small integer. The min
| and max value can be used to create keyless perfect hash table
| for each SM to limit the concurrency on global memory using 32bit
| atomic operations for min, max, count and sum and then do a final
| reduction on the Gpu.
|
| I don't think that is needed more then 1 modern gpu for this,
| especially if you are on a modern hardware like the gh-200.
|
| I'm running this kind of aggregates on a gh-200 using 10 Billion
| of records having the data in Gpu or Cpu memory using our
| software (heavydb) for testing purposes in the last two weeks
| fulafel wrote:
| The 1brc benchmark has the data in a RAM disk[0], there's no
| disk IO.
|
| [0] https://github.com/gunnarmorling/1brc#evaluating-results
| ehsankia wrote:
| I have a hard time believing the basic C++ implementation took 16
| minutes... I just did the most trivial non-parallel
| implementation in python and it took less than 5 minute with
| pypy.
| anthony88 wrote:
| The basic Java implementation was also around 5 minutes
___________________________________________________________________
(page generated 2024-04-14 23:01 UTC)