[HN Gopher] The GPU is not always faster
       ___________________________________________________________________
        
       The GPU is not always faster
        
       Author : CowFreedom
       Score  : 77 points
       Date   : 2024-12-11 14:28 UTC (8 hours ago)
        
 (HTM) web link (cowfreedom.de)
 (TXT) w3m dump (cowfreedom.de)
        
       | tylermw wrote:
       | For simple operations like the dot product (that also map
       | extremely well to SIMD operations), yes, the CPU is often better,
       | as there not much actual "computation" being done. More complex
       | computations where the data does not need to transfer between the
       | host and device amortize that transfer cost across multiple
       | operations, and the balance can quickly weigh in favor of the
       | GPU.
        
       | ramoz wrote:
       | Simpler research could've shown that there is a physical data
       | transfer cost.
        
       | ssivark wrote:
       | A good mental model is to compare the number of floats being
       | processed -vs- the number of primitive computations. Matrix
       | multiplication has n^3 computation with n^2 data. Multiplication
       | of large matrices is therefore special in the potential for "data
       | re-use" (each float is used against all the columns or rows of
       | the other matrix) -- so systems are designed to have a much
       | higher flops throughput than memory bandwidth. A dot product is
       | at the other extreme, where each float is used only once
       | (loosely).
       | 
       | Roofline plots [1] are framework to visualize system design from
       | this perspective.
       | 
       | [1] https://en.wikipedia.org/wiki/Roofline_model
        
         | sigmoid10 wrote:
         | This is amplified even more by the fact that only the trivial
         | implementation of matmul is O(n^3) whereas efficient ones (e.g
         | BLAS) use things like the Strassen algorithm. You can also
         | speed it up significantly by using cache-aware approaches when
         | retrieving rows and columns. In practice there is a huge amount
         | of theory behind this that is far beyond the average person's
         | scope if they are not actual researchers.
        
           | rnrn wrote:
           | Is there actually a BLAS implementation that uses strassen?
           | 
           | I don't think it's accurate that only trivial implementations
           | use the direct o(n^3) algorithm. AFAIK high performance BLAS
           | implementations just use highly optimized versions of it.
        
             | chessgecko wrote:
             | I remember reading that it's too hard to get good memory
             | bandwidth/l2 utilization in the fancy algorithms, you need
             | to read contiguous blocks and be able to use them
             | repeatedly. But I also haven't looked at the gpu blas
             | implementations directly.
        
             | jcranmer wrote:
             | AIUI, Strassen gets used moderately commonly with non-
             | floating-point datatypes, where numerical stability is less
             | of a concern and multiplications are more useful to
             | minimize than memory traffic. But from what I can tell,
             | every floating-point BLAS library eschews Strassen, despite
             | a steady trickle of papers saying "hey, there might be some
             | small wins if we go to Strassen!"
        
               | chillee wrote:
               | The big issue with Strassen isn't performance - it's
               | numerical stability.
        
             | leecarraher wrote:
             | BLAS is just the library definition and not the
             | implementation, so BLAS implementations could implement
             | GEMM anyway they want. But in practice the triple loop
             | method (n^3) is the most common, despite Strassen's and the
             | more numerically stable, Winograd methods being well known
             | and available for decades. But with most things involving
             | real computing hardware, memory access patterns and
             | locality tend to be more important for performance than
             | operation counts
        
           | sestep wrote:
           | Really? I thought that no practical linear algebra library
           | used the Strassen algorithm. Can you provide a source?
        
             | CowFreedom wrote:
             | The BLAS GEMM routines I have seen use normal blocked
             | algorithms.
        
               | bee_rider wrote:
               | I don't know what the real convention is, but IMO, BLAS
               | GEMM "is" the O(n^3) algorithm (blocked is fine of
               | course$ in the sense that something like Strassen has
               | stability implications and isn't appropriate for lots of
               | sizes. Just swapping it in would be nuts, haha.
        
         | hansvm wrote:
         | That property is the same reason you don't incur substantial
         | overhead doing large matrix multiplications sharded over disks
         | or many machines. You apply the same chunking strategies used
         | to optimally use L1/L2/L3 caches, just instead at the level of
         | numa nodes, physical disks, machines, and clusters. So long as
         | each "cache" is big enough for the N^3/N^2 term to dominate
         | communication overhead (especially if that communication can
         | happen concurrently), the networked result is about as fast as
         | the individual machines running at their max FLOPs for some
         | smaller problem.
        
       | SigmundA wrote:
       | Would be interesting to see what a unified memory setup can do
       | like say an Apple M-series since this is the argument for unified
       | memory, zero copy memory access between CPU and GPU.
        
         | CowFreedom wrote:
         | Even the integrated Intel HD Graphics would be an interesting
         | comparison.
        
         | mirsadm wrote:
         | Unified memory is what makes using the GPU viable in my use
         | case (mobile). The copy operation is almost always the slowest
         | part. This is especially true for real time work.
        
       | glitchc wrote:
       | It's the relative difference of transfer overhead vs degree of
       | compute. For one single operation, sure, the transfer overhead
       | dominates. Add multiple compute steps (operations) however, and
       | experiments will show that the GPU is faster as the transfer cost
       | is fixed.
        
       | moomin wrote:
       | Even if the GPU took literally no time at all to compute the
       | results there would be workflows where doing it on the CPU was
       | faster.
        
         | Waterluvian wrote:
         | The GPU is the gas station across town that's five cents
         | cheaper.
        
           | adamc wrote:
           | Good analogy.
        
           | _zoltan_ wrote:
           | this is simply not true. the post just uses outdated
           | hardware, as pointed out above.
           | 
           | a GH200 will run miles around any CPU.
        
             | CowFreedom wrote:
             | The gist of the post is that optimizations and
             | interpretations thereof must always be made with respect to
             | the underlying hardware.
        
       | dragontamer wrote:
       | There is the compute vs communicate ratio.
       | 
       | For problems like Matrix Multiplication, it costs N to
       | communicate the problem but N^2 operations to calculate.
       | 
       | For problems like dot product, it costs N to communicate but only
       | N operations to calculate.
       | 
       | Compute must be substantially larger than communication costs if
       | you hope to see any benefits. Asymptotic differences obviously
       | help, but linear too might help.
       | 
       | You'd never transfer N data to perform a log(n) binary search for
       | example. At that point communication dominates.
        
         | AnotherGoodName wrote:
         | For those skimming and to add to the above the article is using
         | the gpu to work with system memory since that's where they have
         | the initial data and where they want the result in this case
         | and comparing it to a cpu doing the same. The entire bottleneck
         | is GPU to system memory.
         | 
         | If you're willing to work entirely with the gpu memory the gpu
         | will of course be faster even in this scenario.
        
         | HarHarVeryFunny wrote:
         | Sure, but once you've justified moving data onto the GPU you
         | don't want to incur the cost of moving the operation output
         | back to the CPU unless you have to. So, for example, you might
         | justify moving data to the GPU for a neural net convolution,
         | but then also execute the following activation function (&
         | subsequent operators) there because that's now where the data
         | is.
        
       | lmeyerov wrote:
       | Comparing multicore wide AVX to CUDA is a bit of an unnecessary
       | nuance for most folks. These make sense, but miss the forest from
       | the trees:
       | 
       | - Either way, you're writing 'cuda style' fine-grained data
       | parallel code that looks and works very different from regular
       | multithreaded code. You are now in a different software universe.
       | 
       | - You now also have to think about throughput, latency hiding,
       | etc. Nvidia has been commoditizing throughput-oriented hardware a
       | lot better than others, and while AMD is catching up on some
       | workloads, Nvidia is already advancing. This is where we think
       | about bandwidth between network/disk=>compute unit. My best
       | analogy here, when looking at things like GPU Direct
       | Storage/Network, is CPU systems feel like a long twisty straw,
       | while GPU paths are fat pipes. Big compute typically needs both
       | compute + IO, and hardware specs tell you the bandwidth ceiling.
       | 
       | To a large extent, ideas are cross-polinating -- CPUs looking
       | more like GPUs, and GPUs getting the flexibility of CPUs -- but
       | either way, you're in a different universe of how code & hardware
       | works than 1990s & early 2000s intel.
        
         | bee_rider wrote:
         | Realistically you should use Numpy or Cupy (or whatever the
         | appropriate/fashionable library is) anyway, because tuning this
         | stuff is a big pain.
         | 
         | So, GPUs have the slight disadvantages that you have to think
         | an about data movement and the drivers are a little less
         | convenient to install, but it isn't really a big deal.
        
       | rnrn wrote:
       | how can the multicore AVX implementation do a dot product (for
       | arrays much larger than cache) at 340 GB/s on a system with RAM
       | bandwidth < 50 GB/s
        
         | alecco wrote:
         | I think the post is a bit disingenuous.
         | 
         | But about bandwidth, matrix multiplications happen mostly in
         | cache and that has a lot more bandwidth than RAM. Blocks of the
         | matrix are loaded to cache (explicitly in CUDA) and used
         | multiple times there.
         | 
         | I'd exploit the better multi-level cache hierarchy in CPUs and
         | make the code NUMA aware. But still I wouldn't bet against a
         | recent GPU card.
        
       | jsight wrote:
       | TBH, I'm finding that people underestimate the usefulness of CPU
       | in both inference and fine tuning. PEFT with access to 64GB+ RAM
       | and lots of cores can sometimes be cost effective.
        
         | ramoz wrote:
         | I think engineers learn this quickly in high-scale/performance
         | production environments. Even without hardware backgrounds.
         | SLAs/costs create constraints you need to optimize against
         | after promising the business line these magical models can
         | enable that cool new feature for a million users.
         | 
         | Traditional AI/ML models (including smaller transformers) can
         | definitely be optimized for mass scale/performance on cpu-
         | optimized infrastructure.
        
       | alecco wrote:
       | > Each version is severely memory bandwidth bottlenecked, the
       | CUDA version suffers the most with its practical 11.8 GB/s
       | device-to-host bandwidth due to its PCI-Express 3.0 x16
       | interface.
       | 
       | PCIe 3.0? What?
       | 
       | https://cowfreedom.de/#appendix/computer_specs/
       | 
       | > GeForce GTX 1050 Ti with Max-Q Design (PCIe 3.0 x16) (2016)
       | 
       | > Intel Core i5-8300H (2020)
       | 
       | This is a low-price 8 year old GPU and a 4 year old CPU. And he
       | seems to be including loading the data to GPU. Newer cards have
       | wide PCIe 5.0 or some faster interconnect, like Nvidia Grace-
       | Hopper.
       | 
       | Also he is comparing his own CUDA implementation. He should use
       | one of the many available in CUBLAS/CUTLASS. Making a good CUDA
       | GEMM is a very difficult art and very hardware specific
        
         | jsheard wrote:
         | > Newer cards have wide PCIe 5.0 or some faster interconnect,
         | like Nvidia Grace-Hopper.
         | 
         | There aren't any (edit: consumer) GPUs with PCIe5 yet, though
         | they probably aren't far off. Plenty already have PCIe4 though.
        
           | alecco wrote:
           | Consumer cards are PCIe 4.0 x16. H100 PCIe version is PCIe
           | 5.0 https://en.wikipedia.org/wiki/Hopper_(microarchitecture).
           | And it's been out 2 years already.
        
             | throwway120385 wrote:
             | So everyone is supposed to do all of their testing on
             | H100's?
        
               | alecco wrote:
               | 4090 (2022) PCIe 4.0 x16 is quite decent. The major limit
               | is memory, not bandwidth. And 3090 (2020) is also PCIe
               | 4.0 x16, and used cards are a bargain. You can hook them
               | up with Nvlink.
               | 
               | Nvidia is withholding new releases but the current
               | hardware has more legs with new matrix implementations.
               | Like FlashAttention doing some significant improvement
               | every 6 months.
               | 
               | Nvidia could make consumer chips with combined CPU-GPU. I
               | guess they are too busy making money with the big cloud.
               | Maybe somebody will pick up. Apple is already doing
               | something like it even on laptops.
        
               | _zoltan_ wrote:
               | get a GH100 on lambda and behold you have 900GB/s between
               | CPU memory and GPU, and forget PCIe.
        
             | jsheard wrote:
             | TIL, I missed Hopper already having it. I assume the RTX
             | 5000 series will bring it to consumers.
        
             | yvdriess wrote:
             | For the consumer GPUs, PCIe 4.0 x16 has plenty of BW
             | headroom. The full sized x16 is more for stability reasons.
             | Some vendors even put a couple of M.2 slots on PCI4/5 GPU
             | board to recuperate the unused PCIe lanes.
        
             | touisteur wrote:
             | Well let me tell you about the pretty high-end and
             | expensive L40 that shipped with PCIe-4.0 to my utter dismay
             | and disgust. Only the H100 had 5.0 although I could already
             | saturate 4.0 (and 5.0) with Mellanox NICs and
             | GPU/StorageDirect. Waiting for the next one to maybe get
             | 5.0.
        
           | goosedragons wrote:
           | Supposedly the Intel B580 releasing friday will use PCIe 5.0
           | 8x.
        
             | yvdriess wrote:
             | PCIe 4.0 16x
             | 
             | https://www.intel.com/content/www/us/en/products/sku/227961
             | /...
        
           | _zoltan_ wrote:
           | that's not true, H100 NVL is PCIe gen5 x16.
        
         | _zoltan_ wrote:
         | GH100 can do 900GB/s HtoD.
        
           | alecco wrote:
           | And both 3090 and 4090 can do 32 GB/s host-device. Not far
           | from CPU-RAM. You only load the matrix once. The bandwidth
           | for the matmul is orders of magnitude larger and happens all
           | in device and mostly in cache.
        
       | hangonhn wrote:
       | Question from someone who doesn't know enough about GPUs:
       | Recently a friend mentioned his workstation has 384 cores using 4
       | processors. This is starting to approach some of the core numbers
       | of earlier GPUs.
       | 
       | Is there a possibility that in the not too distant future that
       | GPUs and CPUs will just converge? Or are the tasks done by GPUs
       | too specialized?
        
         | krapht wrote:
         | Too specialized. You can't use GPUs as general purpose
         | computers. The basic unit of operation is the warp, which is 32
         | threads operating in lockstep (simplified). If you're not using
         | all 32 threads, then you may as well not be using a GPU.
        
         | immibis wrote:
         | You'd also need extreme hyperthreading. A GPU can cycle between
         | several warps in the same execution unit (barrel-processor-
         | style), padding out the time per instruction to hide memory
         | latency, while still getting the same throughput. That's
         | counter to the fundamental design of CPUs.
        
         | JonChesterfield wrote:
         | They're really similar already. You can program a GPU much like
         | you would a CPU (existence proof at
         | https://news.ycombinator.com/item?id=42387267). There's a lot
         | of obfuscation and hoarding of the ancient knowledge from the
         | GPUs are special enthusiasts but the magic doesn't survive
         | looking at the things. It's a masked vector isa.
         | 
         | My dev GPU is a 6800XT. Cheapish gaming card from a little
         | while ago, 16GB ram on the card. 72 "compute units" which are
         | independent blocks of hardware containing memory ports,
         | floating point unit, register file etc. Roughly "a core" from
         | x64 world. Each of those can have up to 64 tasks ready to go,
         | roughly a "hyperthread". It's 300W or so.
         | 
         | There's some noise in the details, e.g. the size of the
         | register file from the perspective of a hyperthread affects how
         | many can be resident on the compute unit ready to run, the
         | memory hierarchy has extra layers in it. The vector unit is
         | 256byte wide as opposed to 64byte wide on x64.
         | 
         | But if you wanted to run a web browser entirely on the GPU and
         | were sufficiently bloody minded you'd get it done, with the CPU
         | routing keyboard I/O to it and nothing else. If you want a
         | process to sit on the GPU talking to the network and crunching
         | numbers, don't need the x64 or arm host to do anything at all.
        
       | ltbarcly3 wrote:
       | Taking a helicopter is not always faster than walking.
       | 
       | Is this surprising or obvious?
        
       ___________________________________________________________________
       (page generated 2024-12-11 23:00 UTC)