[HN Gopher] The One Billion Row Challenge in CUDA
       ___________________________________________________________________
        
       The One Billion Row Challenge in CUDA
        
       Author : tspeterkim
       Score  : 199 points
       Date   : 2024-04-12 16:27 UTC (2 days ago)
        
 (HTM) web link (tspeterkim.github.io)
 (TXT) w3m dump (tspeterkim.github.io)
        
       | MuffinFlavored wrote:
       | > GPU Hash Table?
       | 
       | How bad would performance have suffered if you sha256'd the lines
       | to build the map? I'm going to guess "badly"?
       | 
       | Maybe something like this in CUDA:
       | https://github.com/Cyan4973/xxHash ?
        
         | tspeterkim wrote:
         | So performance would increase since hashing is faster than
         | binary-searching.
         | 
         | However, the problem of collisions across threads and dealing
         | with concurrent map key insertions still remains. e.g. when two
         | different cities produce the same hash (one at each thread),
         | how can we atomically compare 100 byte city strings and
         | correctly do collision-solving (using linear probe, for example
         | - https://nosferalatu.com/SimpleGPUHashTable.html)
         | 
         | Atomic operations are limited to 32-bits.
        
           | convivialdingo wrote:
           | You could hash the city names using something like a djb2a
           | algorithm?
        
           | mandarax8 wrote:
           | > Atomic operations are limited to 32-bits.
           | 
           | I'm using 64bit atomics at work, are you on an old version of
           | cuda or are some operations only supported on 32bit?
        
             | tspeterkim wrote:
             | I misspoke. (got confused with the key limit in my link
             | above)
             | 
             | Atomics work up to 128-bits
             | (https://docs.nvidia.com/cuda/cuda-c-programming-
             | guide/#atomi...).
             | 
             | Regardless, it's still less than 100 bytes, which is the
             | max length of city strings.
        
           | Sesse__ wrote:
           | If the set of cities is known (as your binary search
           | algorithm assumes), you can insert all of them first, on the
           | CPU. That will resolve all collisions ahead of time, making
           | the structure of the hash table essentially read-only. (Of
           | course, you would still need atomics for the values.)
        
       | bhouston wrote:
       | Interesting approach.
       | 
       | I wonder if one could use a reduce operation across the cities or
       | similar rather than atomic operations? GPUs are amazing at reduce
       | operations.
        
         | tspeterkim wrote:
         | Agreed. I tried reducing across cities first.
         | 
         | The problem was that the work of gathering all the temperatures
         | for each city (before I could launch the reduction CUDA
         | kernels) required a full parsing through the input data.
         | 
         | My final solution would be slower than the C++ baseline since
         | the baseline already does the full parsing anyways.
        
       | cavisne wrote:
       | The Programming Massively Parallel Processors textbook has a
       | chapter on histograms. You have basically mentioned the main
       | trick though (privatization).
       | 
       | For this problem I think coursening would also help (so
       | everything isn't queued up on the copy to global memory)
        
         | tspeterkim wrote:
         | The PMPP book is great. I reread the histogram chapter after
         | finishing the blog, and realized I could use privatization. You
         | got me!
         | 
         | By coarsening, do you mean making the threads handle more file
         | parts, and reducing the number of private copies (of histogram
         | or stats here) to globally commit at the end?
        
       | kevincox wrote:
       | > These offsets are obtained iteratively, stepping through the
       | entire file buffer by the desired split size (= total file size /
       | desired number of parts), and marking the position of a new line
       | character:
       | 
       | In many cases it is better to just pass the raw byte offsets to
       | the workers, then they can skip to the first newline and progress
       | until the next newline after their "end offset". This way the
       | launcher doesn't need to access the input data at all.
       | 
       | This can be ineffective when boundaries are not self-describing
       | or when the minimum processable chunk is near the expected batch
       | size (as you will have significant swings in batch size) but I
       | think would work quite well for this case.
        
         | tspeterkim wrote:
         | By "launcher", do you mean the CUDA kernel? How can it avoid
         | accessing the input data since it needs access to the
         | characters based on the offsets?
         | 
         | I also already pass these offsets to the threads as `Part*
         | parts`.
         | 
         | I also probably didn't understand your suggestion and am
         | drawing a blank here. So pls feel free to elaborate and correct
         | me.
        
           | tromp wrote:
           | He's proposing to skip the whole computation of the parts[]
           | array and within the kernel, replace                   for
           | (int i = 0; i < parts[bx].length; i++) {   // bx is the
           | global thread index           char c =
           | buffer[parts[bx].offset-buffer_offset + i];
           | 
           | by something like                   long long split_size =
           | size / num_parts;         long long offset = bx * split_size;
           | while (buffer[offset++] != '\n')            ;         for
           | (int i = 0; buffer[offset+i] != '\n'; i++) {           char c
           | = buffer[offset + i];
           | 
           | (ignoring how to deal with buffer_offset for now).
        
             | tspeterkim wrote:
             | for (int i = 0; buffer[offset+i] != '\n'; i++) {
             | 
             | This would only process the current line, though. Here,
             | each thread processes ~split_size bytes (multiple lines).
             | 
             | Even if were to read multiple lines, how would a thread
             | know when to stop? (at which offset?)
             | 
             | And when it does, it should communicate where it stopped
             | with the other threads to prevent re-reading the buffer. My
             | brain's hurting now.
        
               | tromp wrote:
               | It can compute both the starting offset for itself and
               | next_offset for (bx+1)...
        
               | kevincox wrote:
               | Yeah, this is a bug in their quick example code.
               | Basically think that you simply give each thread a start
               | and stop offset with simple byte division. Then each
               | thread handles the first line that starts after the start
               | offset up to but excluding the first line that starts
               | after its end offset. So there will be a small amount of
               | "overread" at the boundaries but this is probably
               | negligible.                   let chunk_size = buf.length
               | / threads;         let start_byte = chunk_size * bx;
               | let end_byte = start_byte + chunk_size;
               | let i = start_byte;         while (buffer[i++] != '\n')
               | {} // Skip until the first new row.              while (i
               | < end_byte) {             let row_start = i;
               | let row_end = i;             while (buffer[row_end++] !=
               | '\n') {}             process_row(buffer, row_start,
               | row_end);             i = row_end + 1;         }
               | 
               | Basically you first roughly split the data with simple
               | byte division. Then each thread aligns itself to the
               | underling data chunks. This alignment can be done in
               | parallel across all threads rather than being part of a
               | serial step that examines every byte before the parallel
               | work starts. You need to take care that the alignment
               | each thread does doesn't skip or duplicate any rows, but
               | for simple data formats like this I don't think that
               | should be a major difficulty.
        
       | konstantinua00 wrote:
       | am I reading correctly that your AtomicMin stops only when CAS
       | succeeds?
       | 
       | can you not stop early if extracted value becomes smaller than
       | val?
        
         | konstantinua00 wrote:
         | also, iirc, floats' comparison is compatible with signed
         | integers' comparison due to representation?
         | 
         | so just using cuda's AtomicMin for ints would be enough
        
           | Sesse__ wrote:
           | Positive finite floats compare like ints. If you can have
           | negative values, things get slightly more complicated, since
           | they use sign-magnitude (you basically shift the sign bit to
           | the right and use that as an XOR mask on all the other bits).
           | If you need to care about -0.0 == +0.0 and NaN != NaN, most
           | bets are off. :-)
        
       | formerly_proven wrote:
       | Well that's a very pessimistic C++ baseline - using iostreams.
       | The CUDA solution presented here seems to me like it's a bit of
       | an old-timey approach, treating the GPU as an accelerator. The
       | implementation first prepares a million tasks, then submits them
       | as kernel invocations which perform the work. The first phase
       | already takes over two seconds, the second phase takes over ten
       | seconds. Resulting runtime is 16s, 60x faster than iostreams.
       | 
       | This is similar to how GPU acceleration is bolted on in a ton of
       | commercial software, you look at hotspots and then move just
       | those over to the accelerator. For example outsourcing some
       | linear algebra in a FEM solver. That's a sensible approach for
       | trying to accelerate legacy software, but it's not very effective
       | and leads to poor utilization ("copy mat, copy mat, copy mat copy
       | copy copy mat" - Dave Cutler about GPU acceleration). I think a)
       | you can do all of this on the GPU b) this should be limited by
       | PCIe bandwidth for getting the file contents into VRAM. The 1BRC
       | data set is 14 GB. So this is processing that file at about 1
       | GB/s using both CPU and GPU.
        
         | taspeotis wrote:
         | Yeah <iostream> has terrible performance, and also I find the
         | use of a map rather than an unordered_map a bit suspect.
        
           | jodleif wrote:
           | Its much easier to improve is the baseline is terrible
        
       | muep wrote:
       | Instead of desiring atomic maps, would it work here to give each
       | task its own map and then merge those after the parallel
       | processing has ended? I have have absolutely no experience of
       | CUDA, so no idea how applicable that approach would be. However,
       | it seemed pretty practical and effective when I tried a parallel
       | implementation of this challenge on plain CPUs.
        
         | speps wrote:
         | Looks like TFA already uses partitioning per thread.
        
         | SJC_Hacker wrote:
         | You can definitely have thread local storage
        
         | tspeterkim wrote:
         | This is the possible optimization that I mention at the end of
         | the blog - using a private map for each thread block.
         | 
         | The catch is that this map must fit in shared memory, which is
         | pretty limited on all current hardware: ~100KB.
         | 
         | I originally thought that my map (stats array) was too big to
         | fit into this shared memory. Now, however, I realize it can.
         | It'll interesting to see how much speedup (or not!) this
         | optimization can bring.
        
       | _zoltan_ wrote:
       | I think this is very slow for some reason. I'll bookmark this and
       | try on an A100 and a H100 box and then will try with multigpu on
       | larger data.
        
         | tspeterkim wrote:
         | Please do.
         | 
         | My hope with this blog was to rile up other CUDA enthusiasts.
         | Making them wanna bring in bigger and better hardware that I
         | don't have access to.
         | 
         | + Someone in the CUDA MODE community got 6 seconds on a 4090.
        
       | pama wrote:
       | There are some good ideas for this type of problem here:
       | https://github.com/dannyvankooten/1brc
       | 
       | After you deal with parsing and hashes, basically you are IO
       | limited, so mmap helps. The C code takes less than 1.4s without
       | any CUDA access. Because there is no compute to speak of, other
       | than parsing and a hashmap, a reasonable guess is that even for
       | the optimal CUDA implementation, the starting of kernels and
       | transfer of data to the GPU would likely add a noticeable
       | bottleneck and make the optimal CUDA code slower than this pure C
       | code.
        
         | _zoltan_ wrote:
         | this is true for small datasets.
         | 
         | on large datasets, once loaded into GPU memory, cross GPU
         | shuffling with NVLink is going to be much faster than CPU to
         | RAM.
         | 
         | on the H100 boxes with 8x400Gbps, IO with GDS is also pretty
         | fast.
         | 
         | for truly IObound tasks I think a lot of GPUs beats almost
         | anything :-)
        
           | 15155 wrote:
           | Try an array of FPGAs.
        
             | _zoltan_ wrote:
             | I've never been a fan plus good luck getting 800GB/s
             | across.
        
             | dan-robertson wrote:
             | Why does that help with IO bandwidth?
        
           | candido_heavyai wrote:
           | I am testing a gh200 and the speed you can access the system
           | memory is amazing.. Assuming you have already encoded the
           | station into a smallint and the size of the dataset would be
           | around 6gb that on such system takes just 20 ms to be
           | transfered (I am sure about that because I'm observing
           | transfer a 9.5gb that took about 33ms right now).
        
           | konstantinua00 wrote:
           | 1 billion rows is "small dataset"?
        
             | _zoltan_ wrote:
             | eh, it's just two columns. 12 or 13GB IIRC.
        
             | PeterisP wrote:
             | The meaningful boundary between small data and large data
             | is the difference whether the whole dataset/processing is
             | expected to fit in the RAM of a single common machine. From
             | that perspective, 1 billion rows a borderline case, which
             | can be small or large depending on how large the rows are -
             | and in this particular challenge the rows are tiny.
        
             | belter wrote:
             | Just the click stream data of four 15 year old's, chilling
             | out on the living room while watching TikTok videos....
        
           | pama wrote:
           | Yes, GDS will accelerate the IO to the GPU. I'd love to see
           | the above C code compared to hyperoptimized GPU code on the
           | right hardware, but I don't want to accidentally nerd snipe
           | myself :-) The unfortunate part of this particular benchmark
           | is that once you have the data in the right place in your
           | hardware there is very little compute left. The GPU code
           | would probably have constant performance with an additional
           | couple thousand operations on each row whereas CPU would slow
           | down.
           | 
           | https://docs.nvidia.com/gpudirect-storage/overview-
           | guide/ind...
        
           | nwallin wrote:
           | > on large datasets, once loaded into GPU memory,
           | 
           | You're yada-yada-yadaing the best part.
           | 
           | If the disk can process the data at 1GB/s, the CPU can
           | process the data at 2GB/s, the GPU can process the data at
           | 32GB/s, then the CPU can process the data at 1GB/s and the
           | GPU can process the data at 1GB/s.
           | 
           | (also, personally, "large dataset" is a short way to say
           | "doesn't fit in memory". if it fits in memory it's small, if
           | it doesn't fit in memory it's large. but that's just my
           | opinion. I generally avoid calling something a "large" or
           | "small" dataset because it's an overloaded term that means
           | different things to different people.)
        
         | ww520 wrote:
         | Actually on modern systems it's pure compute plus memory
         | bounded. For this problem hashmap lookup is the biggest
         | bottleneck. Parsing takes time away as well but not as bad as
         | hashmap.
        
           | petermcneeley wrote:
           | From what I can tell the number of unique elements is pretty
           | small. This would mean the hash map will sit in cache.
           | Parsing is likely the bottleneck; I dont see any use of SIMD
           | in the linked code.
        
             | ww520 wrote:
             | The key can be up to 100 bytes. The key in the hashmap is
             | in cache. The key being compared is from main memory.
             | Basically the whole file in memory is being gone through
             | and compared. With 1 billion rows it's about 16GB data in
             | memory for a 16-byte average row. That's approaching the
             | memory bandwidth limit for a second, making it a memory
             | bound problem.
        
               | petermcneeley wrote:
               | This really depends on number of unique keys. If they sit
               | in L1-L2: the bandwidth of these caches is an order of
               | magnitude greater than main memory.
               | 
               | This problem is quite a Nerd Snipe so it a good thing
               | that I dont have a computer with more than 16 GB or I
               | might end up trying it myself.
        
         | emcq wrote:
         | The 1.4s is _after_ having the file loaded into RAM by the
         | kernel. Because this is mostly I/O bound, it's not a fair
         | comparison to skip the read time. If you were running on a M3
         | mac you'd might get less than 100ms if the dataset was stored
         | in RAM.
         | 
         | If you account for time loading from disk, the C implementation
         | would be more like ~5s as reported in the blog post [1].
         | Speculating that their laptop's SSD may be in the 3GB/s range,
         | perhaps there is another second or so of optimization left
         | there (which would roughly work out to the 1.4s in-memory
         | time).
         | 
         | Because you have a lot of variable width row reads this will be
         | more difficult on a GPU than CPU.
         | 
         | [1] https://www.dannyvankooten.com/blog/2024/1brc/
        
           | ww520 wrote:
           | Also this uses 16 threads while the contest restricts to
           | running in 8 cores. Needs to compare the benchmarks in the
           | same environment to make a fair comparison.
        
             | pama wrote:
             | The AMD Ryzen 4800U has 8 cores total so the author follows
             | the contest restriction. This CPU supports hyperthreading.
             | (I'd be very interested in seeing hyperoptimized CUDA code
             | using unlimited GPU cores FWIW.)
        
               | ww520 wrote:
               | Good to know. I didn't know the contest has no limit on
               | hyperthread.
        
           | pama wrote:
           | The performance report followed the initial request: run 6
           | times and remove the best and worst outliers, so the mmap
           | optimization is fair game. Agreed that the C code has room
           | left for some additional optimization.
        
             | emcq wrote:
             | If we are going to consider using prior runs of the program
             | having the file loaded in RAM by the kernel fair, why stop
             | there?
             | 
             | Let's say I create a "cache" where I store the min/mean/max
             | output for each city, mmap it, and read it at least once to
             | make sure it is in RAM. If the cache is available I simply
             | write it to standard out. I use whatever method to compute
             | the first run, and I persist it to disk and then mmap it.
             | The first run could take 20 hours and gets discarded.
             | 
             | By technicality it might fit the rules of the original
             | request but it isn't an interesting solution. Feel free to
             | submit it :)
        
               | pama wrote:
               | Having write access to storage or spawning persistent
               | daemons is an extra requirement and that is often not
               | available in practice when evaluating contest code :-)
               | 
               | This is a fun project for learning CUDA and I enjoyed
               | reading about it----I just wanted to point out that the
               | performance tuning in this case is really on the parsing,
               | hashing, memory transfers, and IO. Taking IO out of the
               | picture using specialized hardware or Linux kernel
               | caching still leaves an interesting problem to solve and
               | the focus should be on minimizing the memory transfers,
               | parsing, and hashing.
        
               | gunnarmorling wrote:
               | This actually doesn't fit the rules. I've designed the
               | challenge so that disk I/O is not part of the measured
               | runtime (initially, by relying on the fact that the first
               | run which would pull the file into the page cache will be
               | the slowest and thus discarded, later on by loading the
               | file from a RAM disk). But keeping state between runs is
               | explicitly ruled out as per the README, for the reason
               | you describe.
        
       | jbochi wrote:
       | Pretty cool. Shameless plug: My team at Anthropic is hiring
       | people that can write accelerator kernels. Please reach out to
       | <my username>@gmail.com if you want to make state of the art
       | models faster :)
        
         | petermcneeley wrote:
         | qq: Would this be in CUDA?
        
           | jbochi wrote:
           | Not sure what I can share publicly, but we use TPUs as well:
           | https://cloud.google.com/blog/products/compute/announcing-
           | cl...
        
             | fpgamlirfanboy wrote:
             | It's funny - former Google devs (whom I just maintain a
             | good relationship with their former employer) are ideally
             | positioned to profitably take advantage of the arb of TPU
             | over GPU.
        
               | bee_rider wrote:
               | Google should improve their abstractions until there
               | isn't room for that anymore, haha.
        
         | trashtensor wrote:
         | not looking for a job at this time but i do this kind of work -
         | what is the name of the team that you are working on /
         | typically works on this?
        
       | Aznable wrote:
       | The query itself it's a perfect hash (assuming the station is
       | dictionary encoded) and takes around 100ms on a gh-200 (the Gpu
       | is a single h100 with 132 SMs) with a query that's concurrency
       | constrained.
       | 
       | The same level of performance can be obtained using an Ada like
       | an L40s or and RTX 4090.
       | 
       | The transfer across the nvlink connecting the Cpu and GPU on a
       | gh-200 after the parse and encoding of the source CSV takes a
       | negligible amount of time given the 500 gb/sec of system memory
       | bandwidth and the 900 GB /sec interconnection between Cpu and
       | Gpu.
       | 
       | So the problem is disk bandwidth that's going to limit the
       | performance of the Gpu kernel. The faster solution should be
       | parse the Csv with a gpu kernel using namp and Managed memory (?)
       | encode the station into a interger or a small integer. The min
       | and max value can be used to create keyless perfect hash table
       | for each SM to limit the concurrency on global memory using 32bit
       | atomic operations for min, max, count and sum and then do a final
       | reduction on the Gpu.
       | 
       | I don't think that is needed more then 1 modern gpu for this,
       | especially if you are on a modern hardware like the gh-200.
       | 
       | I'm running this kind of aggregates on a gh-200 using 10 Billion
       | of records having the data in Gpu or Cpu memory using our
       | software (heavydb) for testing purposes in the last two weeks
        
       | candido_heavyai wrote:
       | The query itself it's a perfect hash (assuming the station is
       | dictionary encoded) and takes around 100ms on a gh-200 (the Gpu
       | is a single h100 with 132 SMs) with a query that's concurrency
       | constrained.
       | 
       | The same level of performance can be obtained using an Ada like
       | an L40s or and RTX 4090.
       | 
       | The transfer across the nvlink connecting the Cpu and GPU on a
       | gh-200 after the parse and encoding of the source CSV takes a
       | negligible amount of time given the 500 gb/sec of system memory
       | bandwidth and the 900 GB /sec interconnection between Cpu and
       | Gpu.
       | 
       | So the problem is disk bandwidth that's going to limit the
       | performance of the Gpu kernel. The faster solution should be
       | parse the Csv with a gpu kernel using namp and Managed memory (?)
       | encode the station into a interger or a small integer. The min
       | and max value can be used to create keyless perfect hash table
       | for each SM to limit the concurrency on global memory using 32bit
       | atomic operations for min, max, count and sum and then do a final
       | reduction on the Gpu.
       | 
       | I don't think that is needed more then 1 modern gpu for this,
       | especially if you are on a modern hardware like the gh-200.
       | 
       | I'm running this kind of aggregates on a gh-200 using 10 Billion
       | of records having the data in Gpu or Cpu memory using our
       | software (heavydb) for testing purposes in the last two weeks
        
         | fulafel wrote:
         | The 1brc benchmark has the data in a RAM disk[0], there's no
         | disk IO.
         | 
         | [0] https://github.com/gunnarmorling/1brc#evaluating-results
        
       | ehsankia wrote:
       | I have a hard time believing the basic C++ implementation took 16
       | minutes... I just did the most trivial non-parallel
       | implementation in python and it took less than 5 minute with
       | pypy.
        
         | anthony88 wrote:
         | The basic Java implementation was also around 5 minutes
        
       ___________________________________________________________________
       (page generated 2024-04-14 23:01 UTC)