[HN Gopher] The GPU is not always faster
___________________________________________________________________
The GPU is not always faster
Author : CowFreedom
Score : 77 points
Date : 2024-12-11 14:28 UTC (8 hours ago)
(HTM) web link (cowfreedom.de)
(TXT) w3m dump (cowfreedom.de)
| tylermw wrote:
| For simple operations like the dot product (that also map
| extremely well to SIMD operations), yes, the CPU is often better,
| as there not much actual "computation" being done. More complex
| computations where the data does not need to transfer between the
| host and device amortize that transfer cost across multiple
| operations, and the balance can quickly weigh in favor of the
| GPU.
| ramoz wrote:
| Simpler research could've shown that there is a physical data
| transfer cost.
| ssivark wrote:
| A good mental model is to compare the number of floats being
| processed -vs- the number of primitive computations. Matrix
| multiplication has n^3 computation with n^2 data. Multiplication
| of large matrices is therefore special in the potential for "data
| re-use" (each float is used against all the columns or rows of
| the other matrix) -- so systems are designed to have a much
| higher flops throughput than memory bandwidth. A dot product is
| at the other extreme, where each float is used only once
| (loosely).
|
| Roofline plots [1] are framework to visualize system design from
| this perspective.
|
| [1] https://en.wikipedia.org/wiki/Roofline_model
| sigmoid10 wrote:
| This is amplified even more by the fact that only the trivial
| implementation of matmul is O(n^3) whereas efficient ones (e.g
| BLAS) use things like the Strassen algorithm. You can also
| speed it up significantly by using cache-aware approaches when
| retrieving rows and columns. In practice there is a huge amount
| of theory behind this that is far beyond the average person's
| scope if they are not actual researchers.
| rnrn wrote:
| Is there actually a BLAS implementation that uses strassen?
|
| I don't think it's accurate that only trivial implementations
| use the direct o(n^3) algorithm. AFAIK high performance BLAS
| implementations just use highly optimized versions of it.
| chessgecko wrote:
| I remember reading that it's too hard to get good memory
| bandwidth/l2 utilization in the fancy algorithms, you need
| to read contiguous blocks and be able to use them
| repeatedly. But I also haven't looked at the gpu blas
| implementations directly.
| jcranmer wrote:
| AIUI, Strassen gets used moderately commonly with non-
| floating-point datatypes, where numerical stability is less
| of a concern and multiplications are more useful to
| minimize than memory traffic. But from what I can tell,
| every floating-point BLAS library eschews Strassen, despite
| a steady trickle of papers saying "hey, there might be some
| small wins if we go to Strassen!"
| chillee wrote:
| The big issue with Strassen isn't performance - it's
| numerical stability.
| leecarraher wrote:
| BLAS is just the library definition and not the
| implementation, so BLAS implementations could implement
| GEMM anyway they want. But in practice the triple loop
| method (n^3) is the most common, despite Strassen's and the
| more numerically stable, Winograd methods being well known
| and available for decades. But with most things involving
| real computing hardware, memory access patterns and
| locality tend to be more important for performance than
| operation counts
| sestep wrote:
| Really? I thought that no practical linear algebra library
| used the Strassen algorithm. Can you provide a source?
| CowFreedom wrote:
| The BLAS GEMM routines I have seen use normal blocked
| algorithms.
| bee_rider wrote:
| I don't know what the real convention is, but IMO, BLAS
| GEMM "is" the O(n^3) algorithm (blocked is fine of
| course$ in the sense that something like Strassen has
| stability implications and isn't appropriate for lots of
| sizes. Just swapping it in would be nuts, haha.
| hansvm wrote:
| That property is the same reason you don't incur substantial
| overhead doing large matrix multiplications sharded over disks
| or many machines. You apply the same chunking strategies used
| to optimally use L1/L2/L3 caches, just instead at the level of
| numa nodes, physical disks, machines, and clusters. So long as
| each "cache" is big enough for the N^3/N^2 term to dominate
| communication overhead (especially if that communication can
| happen concurrently), the networked result is about as fast as
| the individual machines running at their max FLOPs for some
| smaller problem.
| SigmundA wrote:
| Would be interesting to see what a unified memory setup can do
| like say an Apple M-series since this is the argument for unified
| memory, zero copy memory access between CPU and GPU.
| CowFreedom wrote:
| Even the integrated Intel HD Graphics would be an interesting
| comparison.
| mirsadm wrote:
| Unified memory is what makes using the GPU viable in my use
| case (mobile). The copy operation is almost always the slowest
| part. This is especially true for real time work.
| glitchc wrote:
| It's the relative difference of transfer overhead vs degree of
| compute. For one single operation, sure, the transfer overhead
| dominates. Add multiple compute steps (operations) however, and
| experiments will show that the GPU is faster as the transfer cost
| is fixed.
| moomin wrote:
| Even if the GPU took literally no time at all to compute the
| results there would be workflows where doing it on the CPU was
| faster.
| Waterluvian wrote:
| The GPU is the gas station across town that's five cents
| cheaper.
| adamc wrote:
| Good analogy.
| _zoltan_ wrote:
| this is simply not true. the post just uses outdated
| hardware, as pointed out above.
|
| a GH200 will run miles around any CPU.
| CowFreedom wrote:
| The gist of the post is that optimizations and
| interpretations thereof must always be made with respect to
| the underlying hardware.
| dragontamer wrote:
| There is the compute vs communicate ratio.
|
| For problems like Matrix Multiplication, it costs N to
| communicate the problem but N^2 operations to calculate.
|
| For problems like dot product, it costs N to communicate but only
| N operations to calculate.
|
| Compute must be substantially larger than communication costs if
| you hope to see any benefits. Asymptotic differences obviously
| help, but linear too might help.
|
| You'd never transfer N data to perform a log(n) binary search for
| example. At that point communication dominates.
| AnotherGoodName wrote:
| For those skimming and to add to the above the article is using
| the gpu to work with system memory since that's where they have
| the initial data and where they want the result in this case
| and comparing it to a cpu doing the same. The entire bottleneck
| is GPU to system memory.
|
| If you're willing to work entirely with the gpu memory the gpu
| will of course be faster even in this scenario.
| HarHarVeryFunny wrote:
| Sure, but once you've justified moving data onto the GPU you
| don't want to incur the cost of moving the operation output
| back to the CPU unless you have to. So, for example, you might
| justify moving data to the GPU for a neural net convolution,
| but then also execute the following activation function (&
| subsequent operators) there because that's now where the data
| is.
| lmeyerov wrote:
| Comparing multicore wide AVX to CUDA is a bit of an unnecessary
| nuance for most folks. These make sense, but miss the forest from
| the trees:
|
| - Either way, you're writing 'cuda style' fine-grained data
| parallel code that looks and works very different from regular
| multithreaded code. You are now in a different software universe.
|
| - You now also have to think about throughput, latency hiding,
| etc. Nvidia has been commoditizing throughput-oriented hardware a
| lot better than others, and while AMD is catching up on some
| workloads, Nvidia is already advancing. This is where we think
| about bandwidth between network/disk=>compute unit. My best
| analogy here, when looking at things like GPU Direct
| Storage/Network, is CPU systems feel like a long twisty straw,
| while GPU paths are fat pipes. Big compute typically needs both
| compute + IO, and hardware specs tell you the bandwidth ceiling.
|
| To a large extent, ideas are cross-polinating -- CPUs looking
| more like GPUs, and GPUs getting the flexibility of CPUs -- but
| either way, you're in a different universe of how code & hardware
| works than 1990s & early 2000s intel.
| bee_rider wrote:
| Realistically you should use Numpy or Cupy (or whatever the
| appropriate/fashionable library is) anyway, because tuning this
| stuff is a big pain.
|
| So, GPUs have the slight disadvantages that you have to think
| an about data movement and the drivers are a little less
| convenient to install, but it isn't really a big deal.
| rnrn wrote:
| how can the multicore AVX implementation do a dot product (for
| arrays much larger than cache) at 340 GB/s on a system with RAM
| bandwidth < 50 GB/s
| alecco wrote:
| I think the post is a bit disingenuous.
|
| But about bandwidth, matrix multiplications happen mostly in
| cache and that has a lot more bandwidth than RAM. Blocks of the
| matrix are loaded to cache (explicitly in CUDA) and used
| multiple times there.
|
| I'd exploit the better multi-level cache hierarchy in CPUs and
| make the code NUMA aware. But still I wouldn't bet against a
| recent GPU card.
| jsight wrote:
| TBH, I'm finding that people underestimate the usefulness of CPU
| in both inference and fine tuning. PEFT with access to 64GB+ RAM
| and lots of cores can sometimes be cost effective.
| ramoz wrote:
| I think engineers learn this quickly in high-scale/performance
| production environments. Even without hardware backgrounds.
| SLAs/costs create constraints you need to optimize against
| after promising the business line these magical models can
| enable that cool new feature for a million users.
|
| Traditional AI/ML models (including smaller transformers) can
| definitely be optimized for mass scale/performance on cpu-
| optimized infrastructure.
| alecco wrote:
| > Each version is severely memory bandwidth bottlenecked, the
| CUDA version suffers the most with its practical 11.8 GB/s
| device-to-host bandwidth due to its PCI-Express 3.0 x16
| interface.
|
| PCIe 3.0? What?
|
| https://cowfreedom.de/#appendix/computer_specs/
|
| > GeForce GTX 1050 Ti with Max-Q Design (PCIe 3.0 x16) (2016)
|
| > Intel Core i5-8300H (2020)
|
| This is a low-price 8 year old GPU and a 4 year old CPU. And he
| seems to be including loading the data to GPU. Newer cards have
| wide PCIe 5.0 or some faster interconnect, like Nvidia Grace-
| Hopper.
|
| Also he is comparing his own CUDA implementation. He should use
| one of the many available in CUBLAS/CUTLASS. Making a good CUDA
| GEMM is a very difficult art and very hardware specific
| jsheard wrote:
| > Newer cards have wide PCIe 5.0 or some faster interconnect,
| like Nvidia Grace-Hopper.
|
| There aren't any (edit: consumer) GPUs with PCIe5 yet, though
| they probably aren't far off. Plenty already have PCIe4 though.
| alecco wrote:
| Consumer cards are PCIe 4.0 x16. H100 PCIe version is PCIe
| 5.0 https://en.wikipedia.org/wiki/Hopper_(microarchitecture).
| And it's been out 2 years already.
| throwway120385 wrote:
| So everyone is supposed to do all of their testing on
| H100's?
| alecco wrote:
| 4090 (2022) PCIe 4.0 x16 is quite decent. The major limit
| is memory, not bandwidth. And 3090 (2020) is also PCIe
| 4.0 x16, and used cards are a bargain. You can hook them
| up with Nvlink.
|
| Nvidia is withholding new releases but the current
| hardware has more legs with new matrix implementations.
| Like FlashAttention doing some significant improvement
| every 6 months.
|
| Nvidia could make consumer chips with combined CPU-GPU. I
| guess they are too busy making money with the big cloud.
| Maybe somebody will pick up. Apple is already doing
| something like it even on laptops.
| _zoltan_ wrote:
| get a GH100 on lambda and behold you have 900GB/s between
| CPU memory and GPU, and forget PCIe.
| jsheard wrote:
| TIL, I missed Hopper already having it. I assume the RTX
| 5000 series will bring it to consumers.
| yvdriess wrote:
| For the consumer GPUs, PCIe 4.0 x16 has plenty of BW
| headroom. The full sized x16 is more for stability reasons.
| Some vendors even put a couple of M.2 slots on PCI4/5 GPU
| board to recuperate the unused PCIe lanes.
| touisteur wrote:
| Well let me tell you about the pretty high-end and
| expensive L40 that shipped with PCIe-4.0 to my utter dismay
| and disgust. Only the H100 had 5.0 although I could already
| saturate 4.0 (and 5.0) with Mellanox NICs and
| GPU/StorageDirect. Waiting for the next one to maybe get
| 5.0.
| goosedragons wrote:
| Supposedly the Intel B580 releasing friday will use PCIe 5.0
| 8x.
| yvdriess wrote:
| PCIe 4.0 16x
|
| https://www.intel.com/content/www/us/en/products/sku/227961
| /...
| _zoltan_ wrote:
| that's not true, H100 NVL is PCIe gen5 x16.
| _zoltan_ wrote:
| GH100 can do 900GB/s HtoD.
| alecco wrote:
| And both 3090 and 4090 can do 32 GB/s host-device. Not far
| from CPU-RAM. You only load the matrix once. The bandwidth
| for the matmul is orders of magnitude larger and happens all
| in device and mostly in cache.
| hangonhn wrote:
| Question from someone who doesn't know enough about GPUs:
| Recently a friend mentioned his workstation has 384 cores using 4
| processors. This is starting to approach some of the core numbers
| of earlier GPUs.
|
| Is there a possibility that in the not too distant future that
| GPUs and CPUs will just converge? Or are the tasks done by GPUs
| too specialized?
| krapht wrote:
| Too specialized. You can't use GPUs as general purpose
| computers. The basic unit of operation is the warp, which is 32
| threads operating in lockstep (simplified). If you're not using
| all 32 threads, then you may as well not be using a GPU.
| immibis wrote:
| You'd also need extreme hyperthreading. A GPU can cycle between
| several warps in the same execution unit (barrel-processor-
| style), padding out the time per instruction to hide memory
| latency, while still getting the same throughput. That's
| counter to the fundamental design of CPUs.
| JonChesterfield wrote:
| They're really similar already. You can program a GPU much like
| you would a CPU (existence proof at
| https://news.ycombinator.com/item?id=42387267). There's a lot
| of obfuscation and hoarding of the ancient knowledge from the
| GPUs are special enthusiasts but the magic doesn't survive
| looking at the things. It's a masked vector isa.
|
| My dev GPU is a 6800XT. Cheapish gaming card from a little
| while ago, 16GB ram on the card. 72 "compute units" which are
| independent blocks of hardware containing memory ports,
| floating point unit, register file etc. Roughly "a core" from
| x64 world. Each of those can have up to 64 tasks ready to go,
| roughly a "hyperthread". It's 300W or so.
|
| There's some noise in the details, e.g. the size of the
| register file from the perspective of a hyperthread affects how
| many can be resident on the compute unit ready to run, the
| memory hierarchy has extra layers in it. The vector unit is
| 256byte wide as opposed to 64byte wide on x64.
|
| But if you wanted to run a web browser entirely on the GPU and
| were sufficiently bloody minded you'd get it done, with the CPU
| routing keyboard I/O to it and nothing else. If you want a
| process to sit on the GPU talking to the network and crunching
| numbers, don't need the x64 or arm host to do anything at all.
| ltbarcly3 wrote:
| Taking a helicopter is not always faster than walking.
|
| Is this surprising or obvious?
___________________________________________________________________
(page generated 2024-12-11 23:00 UTC)