[HN Gopher] The GPU is not always faster
       ___________________________________________________________________
        
       The GPU is not always faster
        
       Author : CowFreedom
       Score  : 197 points
       Date   : 2024-12-11 14:28 UTC (1 days ago)
        
 (HTM) web link (cowfreedom.de)
 (TXT) w3m dump (cowfreedom.de)
        
       | tylermw wrote:
       | For simple operations like the dot product (that also map
       | extremely well to SIMD operations), yes, the CPU is often better,
       | as there not much actual "computation" being done. More complex
       | computations where the data does not need to transfer between the
       | host and device amortize that transfer cost across multiple
       | operations, and the balance can quickly weigh in favor of the
       | GPU.
        
       | ramoz wrote:
       | Simpler research could've shown that there is a physical data
       | transfer cost.
        
         | hershey890 wrote:
         | Yeah classic use cases of GPUs like deep learning have you
         | transfer the weights for the entire model to your GPU(s) at the
         | of inference and after you that you only transfer your input
         | over.
         | 
         | The use case of transferring ALL data over every time is
         | obviously misusing the GPU.
         | 
         | If anyone's ever tried running a model that's too large for
         | your GPU you will have experienced how slow this is when you
         | have to pull in the model in parts for a single inference run.
        
       | ssivark wrote:
       | A good mental model is to compare the number of floats being
       | processed -vs- the number of primitive computations. Matrix
       | multiplication has n^3 computation with n^2 data. Multiplication
       | of large matrices is therefore special in the potential for "data
       | re-use" (each float is used against all the columns or rows of
       | the other matrix) -- so systems are designed to have a much
       | higher flops throughput than memory bandwidth. A dot product is
       | at the other extreme, where each float is used only once
       | (loosely).
       | 
       | Roofline plots [1] are framework to visualize system design from
       | this perspective.
       | 
       | [1] https://en.wikipedia.org/wiki/Roofline_model
        
         | sigmoid10 wrote:
         | This is amplified even more by the fact that only the trivial
         | implementation of matmul is O(n^3) whereas efficient ones (e.g
         | BLAS) use things like the Strassen algorithm. You can also
         | speed it up significantly by using cache-aware approaches when
         | retrieving rows and columns. In practice there is a huge amount
         | of theory behind this that is far beyond the average person's
         | scope if they are not actual researchers.
        
           | rnrn wrote:
           | Is there actually a BLAS implementation that uses strassen?
           | 
           | I don't think it's accurate that only trivial implementations
           | use the direct o(n^3) algorithm. AFAIK high performance BLAS
           | implementations just use highly optimized versions of it.
        
             | chessgecko wrote:
             | I remember reading that it's too hard to get good memory
             | bandwidth/l2 utilization in the fancy algorithms, you need
             | to read contiguous blocks and be able to use them
             | repeatedly. But I also haven't looked at the gpu blas
             | implementations directly.
        
             | jcranmer wrote:
             | AIUI, Strassen gets used moderately commonly with non-
             | floating-point datatypes, where numerical stability is less
             | of a concern and multiplications are more useful to
             | minimize than memory traffic. But from what I can tell,
             | every floating-point BLAS library eschews Strassen, despite
             | a steady trickle of papers saying "hey, there might be some
             | small wins if we go to Strassen!"
        
               | chillee wrote:
               | The big issue with Strassen isn't performance - it's
               | numerical stability.
        
             | leecarraher wrote:
             | BLAS is just the library definition and not the
             | implementation, so BLAS implementations could implement
             | GEMM anyway they want. But in practice the triple loop
             | method (n^3) is the most common, despite Strassen's and the
             | more numerically stable, Winograd methods being well known
             | and available for decades. But with most things involving
             | real computing hardware, memory access patterns and
             | locality tend to be more important for performance than
             | operation counts
        
           | sestep wrote:
           | Really? I thought that no practical linear algebra library
           | used the Strassen algorithm. Can you provide a source?
        
             | CowFreedom wrote:
             | The BLAS GEMM routines I have seen use normal blocked
             | algorithms.
        
               | bee_rider wrote:
               | I don't know what the real convention is, but IMO, BLAS
               | GEMM "is" the O(n^3) algorithm (blocked is fine of
               | course$ in the sense that something like Strassen has
               | stability implications and isn't appropriate for lots of
               | sizes. Just swapping it in would be nuts, haha.
        
           | ryao wrote:
           | Not that long ago, I tried using the FFT to do matrix
           | multiplication since it was supposed to be asymptomatically
           | faster. It turns out that the constant factor is huge
           | compared to the O(n^3) grade school algorithm that BLAS
           | optimizes via tiling and other tricks. Even if it looks
           | expensive on paper, the cubic algorithm is fast.
           | 
           | I just wish I understood the tricks done to make it so fast
           | so I could implement my own for variations for which there
           | are no pre-existing BLAS implementations. The best BLAS
           | implementations are all closed source sadly.
        
             | david-gpu wrote:
             | _> The best BLAS implementations are all closed source
             | sadly._
             | 
             | NVidia open-sourced CUTLASS [0] some years ago and it
             | achieves pretty competitive performance compared to e.g.
             | the closed-source cuBLAS.
             | 
             | Keen observers will notice that Strassen is not used in
             | CUTLASS.
             | 
             | [0] https://github.com/NVIDIA/cutlass
        
             | Ballas wrote:
             | Using FFT to dot matmul is much more memory intensive,
             | IIRC.
             | 
             | CUDNN supports FFT to do matmul as well as
             | convolution/correlation and can also be configured to
             | automatically use the best algorithm.
             | 
             | In some cases the FFT method has the incidental side-
             | benefit of data reuse, like in the case of FIR filters,
             | where the data allows for partitioned convolution.
        
           | phkahler wrote:
           | Strassen works best on power of 2 sized matrices. That's not
           | a restriction you'd usually see in a general purpose library.
        
           | ithkuil wrote:
           | Ok strassen and others are better but they are still O(n^w)
           | where 2 < w < 3
        
         | hansvm wrote:
         | That property is the same reason you don't incur substantial
         | overhead doing large matrix multiplications sharded over disks
         | or many machines. You apply the same chunking strategies used
         | to optimally use L1/L2/L3 caches, just instead at the level of
         | numa nodes, physical disks, machines, and clusters. So long as
         | each "cache" is big enough for the N^3/N^2 term to dominate
         | communication overhead (especially if that communication can
         | happen concurrently), the networked result is about as fast as
         | the individual machines running at their max FLOPs for some
         | smaller problem.
        
           | hansvm wrote:
           | It looks like a dozen people found this helpful. As a related
           | idea, it's one of the main reasons batch inference is so much
           | more efficient in ML. You transmute the problem from memory-
           | bound to compute-bound.
        
         | fulafel wrote:
         | This is indeed a good moel. From the article's context, dGPUs
         | have more of both bandwidth and flops, the compute-intensity
         | balance isn't necessarily the deciding factor on whether
         | there's a speedup on GPU.
         | 
         | In this article the deciding factor seems to be the startup
         | cost because the application has placed the data on the CPU
         | memory side, and is considering shipping out to GPU memory just
         | for this computation.
        
           | KeplerBoy wrote:
           | You can modify the roofline model to include the PCIe
           | bandwidth. It is sometimes called a hierarchical roofline
           | model.
        
       | SigmundA wrote:
       | Would be interesting to see what a unified memory setup can do
       | like say an Apple M-series since this is the argument for unified
       | memory, zero copy memory access between CPU and GPU.
        
         | CowFreedom wrote:
         | Even the integrated Intel HD Graphics would be an interesting
         | comparison.
        
         | mirsadm wrote:
         | Unified memory is what makes using the GPU viable in my use
         | case (mobile). The copy operation is almost always the slowest
         | part. This is especially true for real time work.
        
       | glitchc wrote:
       | It's the relative difference of transfer overhead vs degree of
       | compute. For one single operation, sure, the transfer overhead
       | dominates. Add multiple compute steps (operations) however, and
       | experiments will show that the GPU is faster as the transfer cost
       | is fixed.
        
       | moomin wrote:
       | Even if the GPU took literally no time at all to compute the
       | results there would be workflows where doing it on the CPU was
       | faster.
        
         | Waterluvian wrote:
         | The GPU is the gas station across town that's five cents
         | cheaper.
        
           | adamc wrote:
           | Good analogy.
        
           | _zoltan_ wrote:
           | this is simply not true. the post just uses outdated
           | hardware, as pointed out above.
           | 
           | a GH200 will run miles around any CPU.
        
             | CowFreedom wrote:
             | The gist of the post is that optimizations and
             | interpretations thereof must always be made with respect to
             | the underlying hardware.
        
             | Dylan16807 wrote:
             | In this analogy, using better hardware affects the discount
             | per gallon, but there are still situations where the closer
             | gas station is the better choice.
        
           | Legend2440 wrote:
           | No, it's the bulk order from China that's 100x cheaper but
           | has a minimum order quantity of 100000 units and takes 6-8
           | weeks to get here.
        
         | bob1029 wrote:
         | L2 cache is 1000x closer than the PCIe bus. Pretty much
         | anything that has to respond in ~realtime to outside events
         | will run better on the CPU. You can use the GPU to visualize
         | the state of the system with some small delay (e.g., video
         | games), but it is not so great at modifying state in an
         | efficient manner - especially when serialization of events &
         | causality are important.
        
           | gpderetta wrote:
           | > ~realtime to outside events
           | 
           | well, to play the devil advocate, for the outside event to
           | affect the CPU, a signal will have to go through the PCI bus
           | or equivalent.
        
       | dragontamer wrote:
       | There is the compute vs communicate ratio.
       | 
       | For problems like Matrix Multiplication, it costs N to
       | communicate the problem but N^2 operations to calculate.
       | 
       | For problems like dot product, it costs N to communicate but only
       | N operations to calculate.
       | 
       | Compute must be substantially larger than communication costs if
       | you hope to see any benefits. Asymptotic differences obviously
       | help, but linear too might help.
       | 
       | You'd never transfer N data to perform a log(n) binary search for
       | example. At that point communication dominates.
        
         | AnotherGoodName wrote:
         | For those skimming and to add to the above the article is using
         | the gpu to work with system memory since that's where they have
         | the initial data and where they want the result in this case
         | and comparing it to a cpu doing the same. The entire bottleneck
         | is GPU to system memory.
         | 
         | If you're willing to work entirely with the gpu memory the gpu
         | will of course be faster even in this scenario.
        
           | saagarjha wrote:
           | Assuming your task is larger than kernel launch overhead, of
           | course.
        
         | HarHarVeryFunny wrote:
         | Sure, but once you've justified moving data onto the GPU you
         | don't want to incur the cost of moving the operation output
         | back to the CPU unless you have to. So, for example, you might
         | justify moving data to the GPU for a neural net convolution,
         | but then also execute the following activation function (&
         | subsequent operators) there because that's now where the data
         | is.
        
           | leeter wrote:
           | So awhile back I was working on a chaotic renderer. This let
           | me to a really weird set of situations:
           | 
           | * If the GPU is a non-dedicated older style intel GPU, use
           | CPU
           | 
           | * If the GPU is a non-dedicated anything else do anything
           | super parallel on the GPU but anything that can be BRRRRRT
           | via CPU on the CPU because the memory is shared.
           | 
           | * If the GPU is dedicated move everything to GPU memory and
           | keep it there and only pull back small statistics if at all
           | plausible.
        
       | lmeyerov wrote:
       | Comparing multicore wide AVX to CUDA is a bit of an unnecessary
       | nuance for most folks. These make sense, but miss the forest from
       | the trees:
       | 
       | - Either way, you're writing 'cuda style' fine-grained data
       | parallel code that looks and works very different from regular
       | multithreaded code. You are now in a different software universe.
       | 
       | - You now also have to think about throughput, latency hiding,
       | etc. Nvidia has been commoditizing throughput-oriented hardware a
       | lot better than others, and while AMD is catching up on some
       | workloads, Nvidia is already advancing. This is where we think
       | about bandwidth between network/disk=>compute unit. My best
       | analogy here, when looking at things like GPU Direct
       | Storage/Network, is CPU systems feel like a long twisty straw,
       | while GPU paths are fat pipes. Big compute typically needs both
       | compute + IO, and hardware specs tell you the bandwidth ceiling.
       | 
       | To a large extent, ideas are cross-polinating -- CPUs looking
       | more like GPUs, and GPUs getting the flexibility of CPUs -- but
       | either way, you're in a different universe of how code & hardware
       | works than 1990s & early 2000s intel.
        
         | bee_rider wrote:
         | Realistically you should use Numpy or Cupy (or whatever the
         | appropriate/fashionable library is) anyway, because tuning this
         | stuff is a big pain.
         | 
         | So, GPUs have the slight disadvantages that you have to think
         | an about data movement and the drivers are a little less
         | convenient to install, but it isn't really a big deal.
        
           | lmeyerov wrote:
           | Agreed! The bigger shift is switching to data parallel coding
           | styles.
        
           | PartiallyTyped wrote:
           | I am a big fan of jax for numerical computations these days.
        
             | bee_rider wrote:
             | I've been seeing lots of posts about it lately. Haven't had
             | a chance to try it out, though.
        
       | rnrn wrote:
       | how can the multicore AVX implementation do a dot product (for
       | arrays much larger than cache) at 340 GB/s on a system with RAM
       | bandwidth < 50 GB/s
        
         | alecco wrote:
         | I think the post is a bit disingenuous.
         | 
         | But about bandwidth, matrix multiplications happen mostly in
         | cache and that has a lot more bandwidth than RAM. Blocks of the
         | matrix are loaded to cache (explicitly in CUDA) and used
         | multiple times there.
         | 
         | I'd exploit the better multi-level cache hierarchy in CPUs and
         | make the code NUMA aware. But still I wouldn't bet against a
         | recent GPU card.
        
           | rnrn wrote:
           | > But about bandwidth, matrix multiplications happen mostly
           | in cache and that has a lot more bandwidth than RAM. Blocks
           | of the matrix are loaded to cache (explicitly in CUDA) and
           | used multiple times there.
           | 
           | The post is about dot product, not matrix multiply. Dot
           | product has no data reuse
        
         | rnrn wrote:
         | Answer: it can't.
         | 
         | The author has updated the post with corrected AVX
         | measurements, with the original ~340 GB/s revised down to 31.7
         | GB/s. (Thanks CowFreedom)
        
       | jsight wrote:
       | TBH, I'm finding that people underestimate the usefulness of CPU
       | in both inference and fine tuning. PEFT with access to 64GB+ RAM
       | and lots of cores can sometimes be cost effective.
        
         | ramoz wrote:
         | I think engineers learn this quickly in high-scale/performance
         | production environments. Even without hardware backgrounds.
         | SLAs/costs create constraints you need to optimize against
         | after promising the business line these magical models can
         | enable that cool new feature for a million users.
         | 
         | Traditional AI/ML models (including smaller transformers) can
         | definitely be optimized for mass scale/performance on cpu-
         | optimized infrastructure.
        
       | alecco wrote:
       | > Each version is severely memory bandwidth bottlenecked, the
       | CUDA version suffers the most with its practical 11.8 GB/s
       | device-to-host bandwidth due to its PCI-Express 3.0 x16
       | interface.
       | 
       | PCIe 3.0? What?
       | 
       | https://cowfreedom.de/#appendix/computer_specs/
       | 
       | > GeForce GTX 1050 Ti with Max-Q Design (PCIe 3.0 x16) (2016)
       | 
       | > Intel Core i5-8300H (2020)
       | 
       | This is a low-price 8 year old GPU and a 4 year old CPU. And he
       | seems to be including loading the data to GPU. Newer cards have
       | wide PCIe 5.0 or some faster interconnect, like Nvidia Grace-
       | Hopper.
       | 
       | Also he is comparing his own CUDA implementation. He should use
       | one of the many available in CUBLAS/CUTLASS. Making a good CUDA
       | GEMM is a very difficult art and very hardware specific
        
         | jsheard wrote:
         | > Newer cards have wide PCIe 5.0 or some faster interconnect,
         | like Nvidia Grace-Hopper.
         | 
         | There aren't any (edit: consumer) GPUs with PCIe5 yet, though
         | they probably aren't far off. Plenty already have PCIe4 though.
        
           | alecco wrote:
           | Consumer cards are PCIe 4.0 x16. H100 PCIe version is PCIe
           | 5.0 https://en.wikipedia.org/wiki/Hopper_(microarchitecture).
           | And it's been out 2 years already.
        
             | throwway120385 wrote:
             | So everyone is supposed to do all of their testing on
             | H100's?
        
               | alecco wrote:
               | 4090 (2022) PCIe 4.0 x16 is quite decent. The major limit
               | is memory, not bandwidth. And 3090 (2020) is also PCIe
               | 4.0 x16, and used cards are a bargain. You can hook them
               | up with Nvlink.
               | 
               | Nvidia is withholding new releases but the current
               | hardware has more legs with new matrix implementations.
               | Like FlashAttention doing some significant improvement
               | every 6 months.
               | 
               | Nvidia could make consumer chips with combined CPU-GPU. I
               | guess they are too busy making money with the big cloud.
               | Maybe somebody will pick up. Apple is already doing
               | something like it even on laptops.
        
               | _zoltan_ wrote:
               | get a GH100 on lambda and behold you have 900GB/s between
               | CPU memory and GPU, and forget PCIe.
        
               | saagarjha wrote:
               | Where are you seeing 900 GB/s?
        
               | KeplerBoy wrote:
               | The 900 GB/s are a figure often cited for Hopper based
               | SXM boards, it's the aggregate of 18 NVLink connections.
               | So it's more of a many-to-many GPU-to-GPU bandwidth
               | figure.
               | 
               | https://www.datacenterknowledge.com/data-center-
               | hardware/nvi...
        
               | imtringued wrote:
               | That doesn't add up, because you're now exceeding the
               | memory bandwidth of the memory controller. I.e. it would
               | be faster to do everything, including the CPU only
               | algorithms, in far away VRAM.
        
               | gpderetta wrote:
               | latency might be higher and latency usually dominates CPU
               | side algorithms.
        
             | jsheard wrote:
             | TIL, I missed Hopper already having it. I assume the RTX
             | 5000 series will bring it to consumers.
        
             | yvdriess wrote:
             | For the consumer GPUs, PCIe 4.0 x16 has plenty of BW
             | headroom. The full sized x16 is more for stability reasons.
             | Some vendors even put a couple of M.2 slots on PCI4/5 GPU
             | board to recuperate the unused PCIe lanes.
        
               | KeplerBoy wrote:
               | That's mixing apples with oranges.
               | 
               | Some low-end or middle range GPUs really only use 8
               | lanes, because that's how the chip is designed. Fewer
               | lanes -> less silicon area for the pcie logic needed ->
               | cheaper.
               | 
               | The chips which use 16 lanes take advantage of it and can
               | saturate the link.
        
               | wtallis wrote:
               | Smaller GPU silicon is also often designed more around
               | the laptop market, where it's common for the CPU to not
               | have more than 8 lanes available for the GPU.
        
             | touisteur wrote:
             | Well let me tell you about the pretty high-end and
             | expensive L40 that shipped with PCIe-4.0 to my utter dismay
             | and disgust. Only the H100 had 5.0 although I could already
             | saturate 4.0 (and 5.0) with Mellanox NICs and
             | GPU/StorageDirect. Waiting for the next one to maybe get
             | 5.0.
        
           | goosedragons wrote:
           | Supposedly the Intel B580 releasing friday will use PCIe 5.0
           | 8x.
        
             | yvdriess wrote:
             | PCIe 4.0 16x
             | 
             | https://www.intel.com/content/www/us/en/products/sku/227961
             | /...
        
               | phonon wrote:
               | That's A580. Correct link https://www.intel.com/content/w
               | ww/us/en/products/sku/241598/...
               | 
               | "PCI Express 4.0 x8"
        
               | alecco wrote:
               | So Intel is doing the same artificial performance
               | limitation. Nvidia 4060s are also PCIe 4.0 with half
               | lanes (x8). Argh.
        
               | phonon wrote:
               | Alternatively, at that performance level it doesn't make
               | a difference, so better to save the $5....
        
           | _zoltan_ wrote:
           | that's not true, H100 NVL is PCIe gen5 x16.
        
         | _zoltan_ wrote:
         | GH100 can do 900GB/s HtoD.
        
           | alecco wrote:
           | And both 3090 and 4090 can do 32 GB/s host-device. Not far
           | from CPU-RAM. You only load the matrix once. The bandwidth
           | for the matmul is orders of magnitude larger and happens all
           | in device and mostly in cache.
        
           | jing wrote:
           | No it can't. That's d to d
        
             | saagarjha wrote:
             | Well, device to device is technically doubled because you
             | have a read and a write. But yes
        
         | Ballas wrote:
         | The CPU was launched in Q2 2018, so that is also 6 years. I
         | wonder what the outcome will be with a CPU that supports
         | AVX-512 and a more recent GPU.
        
       | hangonhn wrote:
       | Question from someone who doesn't know enough about GPUs:
       | Recently a friend mentioned his workstation has 384 cores using 4
       | processors. This is starting to approach some of the core numbers
       | of earlier GPUs.
       | 
       | Is there a possibility that in the not too distant future that
       | GPUs and CPUs will just converge? Or are the tasks done by GPUs
       | too specialized?
        
         | krapht wrote:
         | Too specialized. You can't use GPUs as general purpose
         | computers. The basic unit of operation is the warp, which is 32
         | threads operating in lockstep (simplified). If you're not using
         | all 32 threads, then you may as well not be using a GPU.
        
         | immibis wrote:
         | You'd also need extreme hyperthreading. A GPU can cycle between
         | several warps in the same execution unit (barrel-processor-
         | style), padding out the time per instruction to hide memory
         | latency, while still getting the same throughput. That's
         | counter to the fundamental design of CPUs.
        
         | JonChesterfield wrote:
         | They're really similar already. You can program a GPU much like
         | you would a CPU (existence proof at
         | https://news.ycombinator.com/item?id=42387267). There's a lot
         | of obfuscation and hoarding of the ancient knowledge from the
         | GPUs are special enthusiasts but the magic doesn't survive
         | looking at the things. It's a masked vector isa.
         | 
         | My dev GPU is a 6800XT. Cheapish gaming card from a little
         | while ago, 16GB ram on the card. 72 "compute units" which are
         | independent blocks of hardware containing memory ports,
         | floating point unit, register file etc. Roughly "a core" from
         | x64 world. Each of those can have up to 64 tasks ready to go,
         | roughly a "hyperthread". It's 300W or so.
         | 
         | There's some noise in the details, e.g. the size of the
         | register file from the perspective of a hyperthread affects how
         | many can be resident on the compute unit ready to run, the
         | memory hierarchy has extra layers in it. The vector unit is
         | 256byte wide as opposed to 64byte wide on x64.
         | 
         | But if you wanted to run a web browser entirely on the GPU and
         | were sufficiently bloody minded you'd get it done, with the CPU
         | routing keyboard I/O to it and nothing else. If you want a
         | process to sit on the GPU talking to the network and crunching
         | numbers, don't need the x64 or arm host to do anything at all.
        
         | owlbite wrote:
         | CPUs and GPUs are fundamentally aiming at different vertices of
         | the performance polygon.
         | 
         | CPUs aim to minimize latency (how many cycles have to pass
         | before you can use a result), and do so by way of high clock
         | frequencies, caches and fancy micro architectural tricks. This
         | is what you want in most general computation cases where you
         | don't have other work to do whilst you wait.
         | 
         | GPUs instead just context switch to a different thread whilst
         | waiting on a result. They hide their latency by making
         | parallelism as cheap as possible. You can have many more cores
         | running at a lower clock frequency and be more efficient as a
         | result. But this only works if you have enough parallelism to
         | keep everything busy whilst waiting for things to finish on
         | other threads. As it happens that's pretty common in large
         | matrix computations done in machine learning, so they're pretty
         | popular there.
         | 
         | Will they converge? I don't think so - they're fundamentally
         | different design points. But it may well be that they get
         | integrated at a much closer level than current designs, pushing
         | the heterogeneous/dark silicon/accelerator direction to an
         | extreme.
        
         | nox101 wrote:
         | GPU single threads are up to 20x slower the CPU threads. GPUs
         | get their speed from massive parallelization, SIMD style "do N
         | things 1 one instruction", and some specialized hardware (like
         | texture samplers)
         | 
         | If you take a serial algorithm and put it on the GPU, it's easy
         | to verify that a single GPU thread is much slower than a single
         | thread on the CPU. For example, just do a bubble sort on the
         | GPU with a single thread. I'm not even including the time to
         | transfer data or read the result. You'll easily find the CPU is
         | way faster.
         | 
         | The way you get GPU speed is by finding/designing algorithms
         | that are massively parallel. There are lots of them. There are
         | sorting solutions for example.
         | 
         | As an example, 100 cores * 32 execution units per core = 3200 /
         | 20 = 160x faster than the CPU if you can figure out a parallel
         | solution. But, not every problem can be solved with parallel
         | solutions and if it can't then there's where the CPU wins.
         | 
         | It seems unlikely GPU threads will be as fast as CPU threads.
         | They get their massive parallelism by being simpler.
         | 
         | That said, who knows what the future holds.
        
       | ltbarcly3 wrote:
       | Taking a helicopter is not always faster than walking.
       | 
       | Is this surprising or obvious?
        
       | ryao wrote:
       | The same thing applies to using a GPU to do inference with your
       | weights in system memory. That is why nobody does that.
        
       | shihab wrote:
       | An otherwise valid point made using a terrible example.
        
         | juunpp wrote:
         | Terrible post, really. Did they need 5 pages to say that a
         | silly dot product micro-bench that is PCIE-bound loses to a CPU
         | implementation? Why are they even comparing computation vs
         | computation + memory transfer?
        
           | refulgentis wrote:
           | Because going to the GPU adds the overhead of memory and this
           | is a simple way to demonstrate that. Did you know that
           | already? Congrats, you're in the top 10% of software
           | engineers
        
         | kittikitti wrote:
         | But how will people be impressed if you don't exhaust them with
         | jargon?
        
       | gdiamos wrote:
       | Memory locality depends on your perspective.
       | 
       | The CPU would always be slower if the data originated in GPU
       | memory.
        
         | KeplerBoy wrote:
         | Not necessarily, i'm sure one could construct scenarios where
         | offloading to the CPU makes sense. Think of highly branchy
         | double precision stuff.
        
       | ImHereToVote wrote:
       | The GPU is never faster. It's parallel.
        
       | fancyfredbot wrote:
       | This article is a CPU benchmark and a PCI express bandwidth
       | benchmark. It's masquerading as something else though.
        
       ___________________________________________________________________
       (page generated 2024-12-12 23:02 UTC)