[HN Gopher] The GPU is not always faster
___________________________________________________________________
The GPU is not always faster
Author : CowFreedom
Score : 197 points
Date : 2024-12-11 14:28 UTC (1 days ago)
(HTM) web link (cowfreedom.de)
(TXT) w3m dump (cowfreedom.de)
| tylermw wrote:
| For simple operations like the dot product (that also map
| extremely well to SIMD operations), yes, the CPU is often better,
| as there not much actual "computation" being done. More complex
| computations where the data does not need to transfer between the
| host and device amortize that transfer cost across multiple
| operations, and the balance can quickly weigh in favor of the
| GPU.
| ramoz wrote:
| Simpler research could've shown that there is a physical data
| transfer cost.
| hershey890 wrote:
| Yeah classic use cases of GPUs like deep learning have you
| transfer the weights for the entire model to your GPU(s) at the
| of inference and after you that you only transfer your input
| over.
|
| The use case of transferring ALL data over every time is
| obviously misusing the GPU.
|
| If anyone's ever tried running a model that's too large for
| your GPU you will have experienced how slow this is when you
| have to pull in the model in parts for a single inference run.
| ssivark wrote:
| A good mental model is to compare the number of floats being
| processed -vs- the number of primitive computations. Matrix
| multiplication has n^3 computation with n^2 data. Multiplication
| of large matrices is therefore special in the potential for "data
| re-use" (each float is used against all the columns or rows of
| the other matrix) -- so systems are designed to have a much
| higher flops throughput than memory bandwidth. A dot product is
| at the other extreme, where each float is used only once
| (loosely).
|
| Roofline plots [1] are framework to visualize system design from
| this perspective.
|
| [1] https://en.wikipedia.org/wiki/Roofline_model
| sigmoid10 wrote:
| This is amplified even more by the fact that only the trivial
| implementation of matmul is O(n^3) whereas efficient ones (e.g
| BLAS) use things like the Strassen algorithm. You can also
| speed it up significantly by using cache-aware approaches when
| retrieving rows and columns. In practice there is a huge amount
| of theory behind this that is far beyond the average person's
| scope if they are not actual researchers.
| rnrn wrote:
| Is there actually a BLAS implementation that uses strassen?
|
| I don't think it's accurate that only trivial implementations
| use the direct o(n^3) algorithm. AFAIK high performance BLAS
| implementations just use highly optimized versions of it.
| chessgecko wrote:
| I remember reading that it's too hard to get good memory
| bandwidth/l2 utilization in the fancy algorithms, you need
| to read contiguous blocks and be able to use them
| repeatedly. But I also haven't looked at the gpu blas
| implementations directly.
| jcranmer wrote:
| AIUI, Strassen gets used moderately commonly with non-
| floating-point datatypes, where numerical stability is less
| of a concern and multiplications are more useful to
| minimize than memory traffic. But from what I can tell,
| every floating-point BLAS library eschews Strassen, despite
| a steady trickle of papers saying "hey, there might be some
| small wins if we go to Strassen!"
| chillee wrote:
| The big issue with Strassen isn't performance - it's
| numerical stability.
| leecarraher wrote:
| BLAS is just the library definition and not the
| implementation, so BLAS implementations could implement
| GEMM anyway they want. But in practice the triple loop
| method (n^3) is the most common, despite Strassen's and the
| more numerically stable, Winograd methods being well known
| and available for decades. But with most things involving
| real computing hardware, memory access patterns and
| locality tend to be more important for performance than
| operation counts
| sestep wrote:
| Really? I thought that no practical linear algebra library
| used the Strassen algorithm. Can you provide a source?
| CowFreedom wrote:
| The BLAS GEMM routines I have seen use normal blocked
| algorithms.
| bee_rider wrote:
| I don't know what the real convention is, but IMO, BLAS
| GEMM "is" the O(n^3) algorithm (blocked is fine of
| course$ in the sense that something like Strassen has
| stability implications and isn't appropriate for lots of
| sizes. Just swapping it in would be nuts, haha.
| ryao wrote:
| Not that long ago, I tried using the FFT to do matrix
| multiplication since it was supposed to be asymptomatically
| faster. It turns out that the constant factor is huge
| compared to the O(n^3) grade school algorithm that BLAS
| optimizes via tiling and other tricks. Even if it looks
| expensive on paper, the cubic algorithm is fast.
|
| I just wish I understood the tricks done to make it so fast
| so I could implement my own for variations for which there
| are no pre-existing BLAS implementations. The best BLAS
| implementations are all closed source sadly.
| david-gpu wrote:
| _> The best BLAS implementations are all closed source
| sadly._
|
| NVidia open-sourced CUTLASS [0] some years ago and it
| achieves pretty competitive performance compared to e.g.
| the closed-source cuBLAS.
|
| Keen observers will notice that Strassen is not used in
| CUTLASS.
|
| [0] https://github.com/NVIDIA/cutlass
| Ballas wrote:
| Using FFT to dot matmul is much more memory intensive,
| IIRC.
|
| CUDNN supports FFT to do matmul as well as
| convolution/correlation and can also be configured to
| automatically use the best algorithm.
|
| In some cases the FFT method has the incidental side-
| benefit of data reuse, like in the case of FIR filters,
| where the data allows for partitioned convolution.
| phkahler wrote:
| Strassen works best on power of 2 sized matrices. That's not
| a restriction you'd usually see in a general purpose library.
| ithkuil wrote:
| Ok strassen and others are better but they are still O(n^w)
| where 2 < w < 3
| hansvm wrote:
| That property is the same reason you don't incur substantial
| overhead doing large matrix multiplications sharded over disks
| or many machines. You apply the same chunking strategies used
| to optimally use L1/L2/L3 caches, just instead at the level of
| numa nodes, physical disks, machines, and clusters. So long as
| each "cache" is big enough for the N^3/N^2 term to dominate
| communication overhead (especially if that communication can
| happen concurrently), the networked result is about as fast as
| the individual machines running at their max FLOPs for some
| smaller problem.
| hansvm wrote:
| It looks like a dozen people found this helpful. As a related
| idea, it's one of the main reasons batch inference is so much
| more efficient in ML. You transmute the problem from memory-
| bound to compute-bound.
| fulafel wrote:
| This is indeed a good moel. From the article's context, dGPUs
| have more of both bandwidth and flops, the compute-intensity
| balance isn't necessarily the deciding factor on whether
| there's a speedup on GPU.
|
| In this article the deciding factor seems to be the startup
| cost because the application has placed the data on the CPU
| memory side, and is considering shipping out to GPU memory just
| for this computation.
| KeplerBoy wrote:
| You can modify the roofline model to include the PCIe
| bandwidth. It is sometimes called a hierarchical roofline
| model.
| SigmundA wrote:
| Would be interesting to see what a unified memory setup can do
| like say an Apple M-series since this is the argument for unified
| memory, zero copy memory access between CPU and GPU.
| CowFreedom wrote:
| Even the integrated Intel HD Graphics would be an interesting
| comparison.
| mirsadm wrote:
| Unified memory is what makes using the GPU viable in my use
| case (mobile). The copy operation is almost always the slowest
| part. This is especially true for real time work.
| glitchc wrote:
| It's the relative difference of transfer overhead vs degree of
| compute. For one single operation, sure, the transfer overhead
| dominates. Add multiple compute steps (operations) however, and
| experiments will show that the GPU is faster as the transfer cost
| is fixed.
| moomin wrote:
| Even if the GPU took literally no time at all to compute the
| results there would be workflows where doing it on the CPU was
| faster.
| Waterluvian wrote:
| The GPU is the gas station across town that's five cents
| cheaper.
| adamc wrote:
| Good analogy.
| _zoltan_ wrote:
| this is simply not true. the post just uses outdated
| hardware, as pointed out above.
|
| a GH200 will run miles around any CPU.
| CowFreedom wrote:
| The gist of the post is that optimizations and
| interpretations thereof must always be made with respect to
| the underlying hardware.
| Dylan16807 wrote:
| In this analogy, using better hardware affects the discount
| per gallon, but there are still situations where the closer
| gas station is the better choice.
| Legend2440 wrote:
| No, it's the bulk order from China that's 100x cheaper but
| has a minimum order quantity of 100000 units and takes 6-8
| weeks to get here.
| bob1029 wrote:
| L2 cache is 1000x closer than the PCIe bus. Pretty much
| anything that has to respond in ~realtime to outside events
| will run better on the CPU. You can use the GPU to visualize
| the state of the system with some small delay (e.g., video
| games), but it is not so great at modifying state in an
| efficient manner - especially when serialization of events &
| causality are important.
| gpderetta wrote:
| > ~realtime to outside events
|
| well, to play the devil advocate, for the outside event to
| affect the CPU, a signal will have to go through the PCI bus
| or equivalent.
| dragontamer wrote:
| There is the compute vs communicate ratio.
|
| For problems like Matrix Multiplication, it costs N to
| communicate the problem but N^2 operations to calculate.
|
| For problems like dot product, it costs N to communicate but only
| N operations to calculate.
|
| Compute must be substantially larger than communication costs if
| you hope to see any benefits. Asymptotic differences obviously
| help, but linear too might help.
|
| You'd never transfer N data to perform a log(n) binary search for
| example. At that point communication dominates.
| AnotherGoodName wrote:
| For those skimming and to add to the above the article is using
| the gpu to work with system memory since that's where they have
| the initial data and where they want the result in this case
| and comparing it to a cpu doing the same. The entire bottleneck
| is GPU to system memory.
|
| If you're willing to work entirely with the gpu memory the gpu
| will of course be faster even in this scenario.
| saagarjha wrote:
| Assuming your task is larger than kernel launch overhead, of
| course.
| HarHarVeryFunny wrote:
| Sure, but once you've justified moving data onto the GPU you
| don't want to incur the cost of moving the operation output
| back to the CPU unless you have to. So, for example, you might
| justify moving data to the GPU for a neural net convolution,
| but then also execute the following activation function (&
| subsequent operators) there because that's now where the data
| is.
| leeter wrote:
| So awhile back I was working on a chaotic renderer. This let
| me to a really weird set of situations:
|
| * If the GPU is a non-dedicated older style intel GPU, use
| CPU
|
| * If the GPU is a non-dedicated anything else do anything
| super parallel on the GPU but anything that can be BRRRRRT
| via CPU on the CPU because the memory is shared.
|
| * If the GPU is dedicated move everything to GPU memory and
| keep it there and only pull back small statistics if at all
| plausible.
| lmeyerov wrote:
| Comparing multicore wide AVX to CUDA is a bit of an unnecessary
| nuance for most folks. These make sense, but miss the forest from
| the trees:
|
| - Either way, you're writing 'cuda style' fine-grained data
| parallel code that looks and works very different from regular
| multithreaded code. You are now in a different software universe.
|
| - You now also have to think about throughput, latency hiding,
| etc. Nvidia has been commoditizing throughput-oriented hardware a
| lot better than others, and while AMD is catching up on some
| workloads, Nvidia is already advancing. This is where we think
| about bandwidth between network/disk=>compute unit. My best
| analogy here, when looking at things like GPU Direct
| Storage/Network, is CPU systems feel like a long twisty straw,
| while GPU paths are fat pipes. Big compute typically needs both
| compute + IO, and hardware specs tell you the bandwidth ceiling.
|
| To a large extent, ideas are cross-polinating -- CPUs looking
| more like GPUs, and GPUs getting the flexibility of CPUs -- but
| either way, you're in a different universe of how code & hardware
| works than 1990s & early 2000s intel.
| bee_rider wrote:
| Realistically you should use Numpy or Cupy (or whatever the
| appropriate/fashionable library is) anyway, because tuning this
| stuff is a big pain.
|
| So, GPUs have the slight disadvantages that you have to think
| an about data movement and the drivers are a little less
| convenient to install, but it isn't really a big deal.
| lmeyerov wrote:
| Agreed! The bigger shift is switching to data parallel coding
| styles.
| PartiallyTyped wrote:
| I am a big fan of jax for numerical computations these days.
| bee_rider wrote:
| I've been seeing lots of posts about it lately. Haven't had
| a chance to try it out, though.
| rnrn wrote:
| how can the multicore AVX implementation do a dot product (for
| arrays much larger than cache) at 340 GB/s on a system with RAM
| bandwidth < 50 GB/s
| alecco wrote:
| I think the post is a bit disingenuous.
|
| But about bandwidth, matrix multiplications happen mostly in
| cache and that has a lot more bandwidth than RAM. Blocks of the
| matrix are loaded to cache (explicitly in CUDA) and used
| multiple times there.
|
| I'd exploit the better multi-level cache hierarchy in CPUs and
| make the code NUMA aware. But still I wouldn't bet against a
| recent GPU card.
| rnrn wrote:
| > But about bandwidth, matrix multiplications happen mostly
| in cache and that has a lot more bandwidth than RAM. Blocks
| of the matrix are loaded to cache (explicitly in CUDA) and
| used multiple times there.
|
| The post is about dot product, not matrix multiply. Dot
| product has no data reuse
| rnrn wrote:
| Answer: it can't.
|
| The author has updated the post with corrected AVX
| measurements, with the original ~340 GB/s revised down to 31.7
| GB/s. (Thanks CowFreedom)
| jsight wrote:
| TBH, I'm finding that people underestimate the usefulness of CPU
| in both inference and fine tuning. PEFT with access to 64GB+ RAM
| and lots of cores can sometimes be cost effective.
| ramoz wrote:
| I think engineers learn this quickly in high-scale/performance
| production environments. Even without hardware backgrounds.
| SLAs/costs create constraints you need to optimize against
| after promising the business line these magical models can
| enable that cool new feature for a million users.
|
| Traditional AI/ML models (including smaller transformers) can
| definitely be optimized for mass scale/performance on cpu-
| optimized infrastructure.
| alecco wrote:
| > Each version is severely memory bandwidth bottlenecked, the
| CUDA version suffers the most with its practical 11.8 GB/s
| device-to-host bandwidth due to its PCI-Express 3.0 x16
| interface.
|
| PCIe 3.0? What?
|
| https://cowfreedom.de/#appendix/computer_specs/
|
| > GeForce GTX 1050 Ti with Max-Q Design (PCIe 3.0 x16) (2016)
|
| > Intel Core i5-8300H (2020)
|
| This is a low-price 8 year old GPU and a 4 year old CPU. And he
| seems to be including loading the data to GPU. Newer cards have
| wide PCIe 5.0 or some faster interconnect, like Nvidia Grace-
| Hopper.
|
| Also he is comparing his own CUDA implementation. He should use
| one of the many available in CUBLAS/CUTLASS. Making a good CUDA
| GEMM is a very difficult art and very hardware specific
| jsheard wrote:
| > Newer cards have wide PCIe 5.0 or some faster interconnect,
| like Nvidia Grace-Hopper.
|
| There aren't any (edit: consumer) GPUs with PCIe5 yet, though
| they probably aren't far off. Plenty already have PCIe4 though.
| alecco wrote:
| Consumer cards are PCIe 4.0 x16. H100 PCIe version is PCIe
| 5.0 https://en.wikipedia.org/wiki/Hopper_(microarchitecture).
| And it's been out 2 years already.
| throwway120385 wrote:
| So everyone is supposed to do all of their testing on
| H100's?
| alecco wrote:
| 4090 (2022) PCIe 4.0 x16 is quite decent. The major limit
| is memory, not bandwidth. And 3090 (2020) is also PCIe
| 4.0 x16, and used cards are a bargain. You can hook them
| up with Nvlink.
|
| Nvidia is withholding new releases but the current
| hardware has more legs with new matrix implementations.
| Like FlashAttention doing some significant improvement
| every 6 months.
|
| Nvidia could make consumer chips with combined CPU-GPU. I
| guess they are too busy making money with the big cloud.
| Maybe somebody will pick up. Apple is already doing
| something like it even on laptops.
| _zoltan_ wrote:
| get a GH100 on lambda and behold you have 900GB/s between
| CPU memory and GPU, and forget PCIe.
| saagarjha wrote:
| Where are you seeing 900 GB/s?
| KeplerBoy wrote:
| The 900 GB/s are a figure often cited for Hopper based
| SXM boards, it's the aggregate of 18 NVLink connections.
| So it's more of a many-to-many GPU-to-GPU bandwidth
| figure.
|
| https://www.datacenterknowledge.com/data-center-
| hardware/nvi...
| imtringued wrote:
| That doesn't add up, because you're now exceeding the
| memory bandwidth of the memory controller. I.e. it would
| be faster to do everything, including the CPU only
| algorithms, in far away VRAM.
| gpderetta wrote:
| latency might be higher and latency usually dominates CPU
| side algorithms.
| jsheard wrote:
| TIL, I missed Hopper already having it. I assume the RTX
| 5000 series will bring it to consumers.
| yvdriess wrote:
| For the consumer GPUs, PCIe 4.0 x16 has plenty of BW
| headroom. The full sized x16 is more for stability reasons.
| Some vendors even put a couple of M.2 slots on PCI4/5 GPU
| board to recuperate the unused PCIe lanes.
| KeplerBoy wrote:
| That's mixing apples with oranges.
|
| Some low-end or middle range GPUs really only use 8
| lanes, because that's how the chip is designed. Fewer
| lanes -> less silicon area for the pcie logic needed ->
| cheaper.
|
| The chips which use 16 lanes take advantage of it and can
| saturate the link.
| wtallis wrote:
| Smaller GPU silicon is also often designed more around
| the laptop market, where it's common for the CPU to not
| have more than 8 lanes available for the GPU.
| touisteur wrote:
| Well let me tell you about the pretty high-end and
| expensive L40 that shipped with PCIe-4.0 to my utter dismay
| and disgust. Only the H100 had 5.0 although I could already
| saturate 4.0 (and 5.0) with Mellanox NICs and
| GPU/StorageDirect. Waiting for the next one to maybe get
| 5.0.
| goosedragons wrote:
| Supposedly the Intel B580 releasing friday will use PCIe 5.0
| 8x.
| yvdriess wrote:
| PCIe 4.0 16x
|
| https://www.intel.com/content/www/us/en/products/sku/227961
| /...
| phonon wrote:
| That's A580. Correct link https://www.intel.com/content/w
| ww/us/en/products/sku/241598/...
|
| "PCI Express 4.0 x8"
| alecco wrote:
| So Intel is doing the same artificial performance
| limitation. Nvidia 4060s are also PCIe 4.0 with half
| lanes (x8). Argh.
| phonon wrote:
| Alternatively, at that performance level it doesn't make
| a difference, so better to save the $5....
| _zoltan_ wrote:
| that's not true, H100 NVL is PCIe gen5 x16.
| _zoltan_ wrote:
| GH100 can do 900GB/s HtoD.
| alecco wrote:
| And both 3090 and 4090 can do 32 GB/s host-device. Not far
| from CPU-RAM. You only load the matrix once. The bandwidth
| for the matmul is orders of magnitude larger and happens all
| in device and mostly in cache.
| jing wrote:
| No it can't. That's d to d
| saagarjha wrote:
| Well, device to device is technically doubled because you
| have a read and a write. But yes
| Ballas wrote:
| The CPU was launched in Q2 2018, so that is also 6 years. I
| wonder what the outcome will be with a CPU that supports
| AVX-512 and a more recent GPU.
| hangonhn wrote:
| Question from someone who doesn't know enough about GPUs:
| Recently a friend mentioned his workstation has 384 cores using 4
| processors. This is starting to approach some of the core numbers
| of earlier GPUs.
|
| Is there a possibility that in the not too distant future that
| GPUs and CPUs will just converge? Or are the tasks done by GPUs
| too specialized?
| krapht wrote:
| Too specialized. You can't use GPUs as general purpose
| computers. The basic unit of operation is the warp, which is 32
| threads operating in lockstep (simplified). If you're not using
| all 32 threads, then you may as well not be using a GPU.
| immibis wrote:
| You'd also need extreme hyperthreading. A GPU can cycle between
| several warps in the same execution unit (barrel-processor-
| style), padding out the time per instruction to hide memory
| latency, while still getting the same throughput. That's
| counter to the fundamental design of CPUs.
| JonChesterfield wrote:
| They're really similar already. You can program a GPU much like
| you would a CPU (existence proof at
| https://news.ycombinator.com/item?id=42387267). There's a lot
| of obfuscation and hoarding of the ancient knowledge from the
| GPUs are special enthusiasts but the magic doesn't survive
| looking at the things. It's a masked vector isa.
|
| My dev GPU is a 6800XT. Cheapish gaming card from a little
| while ago, 16GB ram on the card. 72 "compute units" which are
| independent blocks of hardware containing memory ports,
| floating point unit, register file etc. Roughly "a core" from
| x64 world. Each of those can have up to 64 tasks ready to go,
| roughly a "hyperthread". It's 300W or so.
|
| There's some noise in the details, e.g. the size of the
| register file from the perspective of a hyperthread affects how
| many can be resident on the compute unit ready to run, the
| memory hierarchy has extra layers in it. The vector unit is
| 256byte wide as opposed to 64byte wide on x64.
|
| But if you wanted to run a web browser entirely on the GPU and
| were sufficiently bloody minded you'd get it done, with the CPU
| routing keyboard I/O to it and nothing else. If you want a
| process to sit on the GPU talking to the network and crunching
| numbers, don't need the x64 or arm host to do anything at all.
| owlbite wrote:
| CPUs and GPUs are fundamentally aiming at different vertices of
| the performance polygon.
|
| CPUs aim to minimize latency (how many cycles have to pass
| before you can use a result), and do so by way of high clock
| frequencies, caches and fancy micro architectural tricks. This
| is what you want in most general computation cases where you
| don't have other work to do whilst you wait.
|
| GPUs instead just context switch to a different thread whilst
| waiting on a result. They hide their latency by making
| parallelism as cheap as possible. You can have many more cores
| running at a lower clock frequency and be more efficient as a
| result. But this only works if you have enough parallelism to
| keep everything busy whilst waiting for things to finish on
| other threads. As it happens that's pretty common in large
| matrix computations done in machine learning, so they're pretty
| popular there.
|
| Will they converge? I don't think so - they're fundamentally
| different design points. But it may well be that they get
| integrated at a much closer level than current designs, pushing
| the heterogeneous/dark silicon/accelerator direction to an
| extreme.
| nox101 wrote:
| GPU single threads are up to 20x slower the CPU threads. GPUs
| get their speed from massive parallelization, SIMD style "do N
| things 1 one instruction", and some specialized hardware (like
| texture samplers)
|
| If you take a serial algorithm and put it on the GPU, it's easy
| to verify that a single GPU thread is much slower than a single
| thread on the CPU. For example, just do a bubble sort on the
| GPU with a single thread. I'm not even including the time to
| transfer data or read the result. You'll easily find the CPU is
| way faster.
|
| The way you get GPU speed is by finding/designing algorithms
| that are massively parallel. There are lots of them. There are
| sorting solutions for example.
|
| As an example, 100 cores * 32 execution units per core = 3200 /
| 20 = 160x faster than the CPU if you can figure out a parallel
| solution. But, not every problem can be solved with parallel
| solutions and if it can't then there's where the CPU wins.
|
| It seems unlikely GPU threads will be as fast as CPU threads.
| They get their massive parallelism by being simpler.
|
| That said, who knows what the future holds.
| ltbarcly3 wrote:
| Taking a helicopter is not always faster than walking.
|
| Is this surprising or obvious?
| ryao wrote:
| The same thing applies to using a GPU to do inference with your
| weights in system memory. That is why nobody does that.
| shihab wrote:
| An otherwise valid point made using a terrible example.
| juunpp wrote:
| Terrible post, really. Did they need 5 pages to say that a
| silly dot product micro-bench that is PCIE-bound loses to a CPU
| implementation? Why are they even comparing computation vs
| computation + memory transfer?
| refulgentis wrote:
| Because going to the GPU adds the overhead of memory and this
| is a simple way to demonstrate that. Did you know that
| already? Congrats, you're in the top 10% of software
| engineers
| kittikitti wrote:
| But how will people be impressed if you don't exhaust them with
| jargon?
| gdiamos wrote:
| Memory locality depends on your perspective.
|
| The CPU would always be slower if the data originated in GPU
| memory.
| KeplerBoy wrote:
| Not necessarily, i'm sure one could construct scenarios where
| offloading to the CPU makes sense. Think of highly branchy
| double precision stuff.
| ImHereToVote wrote:
| The GPU is never faster. It's parallel.
| fancyfredbot wrote:
| This article is a CPU benchmark and a PCI express bandwidth
| benchmark. It's masquerading as something else though.
___________________________________________________________________
(page generated 2024-12-12 23:02 UTC)