[HN Gopher] Basic Facts about GPUs
       ___________________________________________________________________
        
       Basic Facts about GPUs
        
       Author : ibobev
       Score  : 213 points
       Date   : 2025-06-24 12:15 UTC (10 hours ago)
        
 (HTM) web link (damek.github.io)
 (TXT) w3m dump (damek.github.io)
        
       | kittikitti wrote:
       | This is a really good introduction and I appreciate it. When I
       | was building my AI PC, the deep dive research into GPU's took a
       | few days but this lays it out in front of me. It's especially
       | great because it touches on high-value applications like
       | generative artificial intelligence. A notable diagram from the
       | page that I wasn't able to find represented well elsewhere was
       | the memory hierarchy of the A100 GPU's. The diagrams were very
       | helpful. Thank you for this!
        
       | b0a04gl wrote:
       | been running llama.cpp and vllm on same 4070, trying to batch
       | more prompts for serving. llama.cpp was lagging bad once I hit
       | batch 8 or so, even though GPU usage looked fine. vllm handled it
       | way better.
       | 
       | later found vllm uses paged kv cache with layout that matches how
       | the GPU wants to read fully coalesced without strided jumps.
       | llama.cpp was using a flat layout that's fine for single prompt
       | but breaks L2 access patterns when batching.
       | 
       | reshaped kv tensors in llama.cpp to interleave ; made it [head,
       | seq, dim] instead of [seq, head, dim], closer to how vllm feeds
       | data into fused attention kernel. 2x speedup right there w.r.t
       | same ops.
       | 
       | GPU was never the bottleneck. it was memory layout not aligning
       | with SM's expected access stride. vllm just defaults to layouts
       | that make better use of shared memory and reduce global reads.
       | that's the real reason it scales better per batch.
       | 
       | this took its own time of say 2+days and had to dig under the
       | nice looking GPU graphs to find real bottlenecks, it was widly
       | trial and error tbf,
       | 
       | > anybody got idea on how to do this kinda experiment in hot
       | reload mode without so much hassle??
        
         | jcelerier wrote:
         | did you do a PR to integrate these changes back into llama.cpp
         | ? 2x speedup would be absolutely wild
        
           | zargon wrote:
           | Almost nobody using llama.cpp does batch inference. I
           | wouldn't be surprised if the change is somewhat involved to
           | integrate with all of llama.cpp's other features. Combined
           | with lack of interest and keeping up with code churn, that
           | would probably make it difficult to get included, with the
           | number of PRs the maintainers are flooded with.
        
             | tough wrote:
             | if you open a PR, even if it doesnt get merged, anyone with
             | the same issue can find it, and use your PR/branch/fix if
             | it suits better their needs than master
        
               | zargon wrote:
               | Yeah good point. I have applied such PRs myself in the
               | past. Eventually the code churn can sometimes make it too
               | much of a pain to maintain them, but they're useful for a
               | while.
        
             | buildxyz wrote:
             | Any speed up that is 2x is definitely worth fixing.
             | Especially since someone has already figured out the issue
             | and performance testing [1] shows that llamacpp* is lagging
             | behind vLLM by 2x. This is a positive for all running LLMs
             | locally using llamacpp.
             | 
             | Even if llamacpp isnt used for batch inference now, this
             | can allow those to finally run llamacpp for batching and on
             | any hardware since vLLM supports only select hardware.
             | Maybe finally we can stop all this gpu api software
             | fragmentation and cuda moat as llamacpp benchmarks have
             | shown Vulkan to be as or more performant than cuda or sycl.
             | 
             | [1] https://miro.medium.com/v2/resize:fit:1400/format:webp/
             | 1*lab...
        
               | menaerus wrote:
               | So, what exactly is batch inference workload and how
               | would someone running inference on local setup benefit
               | from it? Or how would I even benefit from it if I had a
               | single machine hosting multiple users simultaneously?
               | 
               | I believe batching is a concept only useful when during
               | the training or fine tuning process.
        
               | zargon wrote:
               | Batch inference is just running multiple inferences
               | simultaneously. If you have simultaneous requests, you'll
               | get incredible performance gains, since a single
               | inference doesn't leverage any meaningful fraction of a
               | GPU's compute capability.
               | 
               | For local hosting, a more likely scenario where you could
               | use batching is if you had a lot of different data you
               | wanted to process (lots of documents or whatever). You
               | could batch them in sets of x and have it complete in 1/x
               | the time.
               | 
               | A less likely scenario is having enough users that you
               | can make the first user wait a few seconds while you wait
               | to see if a second user submits a request. If you do get
               | a second request, then you can batch them and the second
               | user will get their result back much faster than if they
               | had had to wait for the first user's request to complete
               | first.
               | 
               | Most people doing local hosting on consumer hardware
               | won't have the extra VRAM for the KV cache for multiple
               | simultaneous inferences though.
        
               | menaerus wrote:
               | Wouldn't batching the multiple inference requests from
               | multiple different users with multiple different contexts
               | simultaneously impact the inference results for each of
               | those users?
        
               | pests wrote:
               | The different prompts being batched do not mathematically
               | affect each other. When running inference you have
               | massive weights that need to get loaded and unloaded just
               | to serve the current prompt and however long its context
               | is (maybe even just a few tokens even). This batching
               | lets you manipulate and move the weights around less to
               | serve the same amount of combined context.
        
               | menaerus wrote:
               | Batching isn't about "moving weights around less". Where
               | do you move the weights anyway once they are loaded into
               | the GPU VRAM? Batching, as always in CS problems, is
               | about maximizing the compute for a unit of a single round
               | trip, and in this case DMA-context-from-CPU-RAM-to-GPU-
               | VRAM.
               | 
               | Self attention premise is exactly that it isn't context
               | free so it is also incorrect to say that batched requests
               | do not mathematically affect each other. They do, and
               | that's by design.
        
               | zargon wrote:
               | > Where do you move the weights anyway once they are
               | loaded into the GPU VRAM?
               | 
               | The GPU can't do anything with weights while they are in
               | VRAM. They have to be moved into the GPU itself first.
               | 
               | So it is about memory round-trips, but not between RAM
               | and VRAM. It's the round trips between the VRAM and the
               | registers in the GPU die. When batch processing, the
               | calculations for all batched requests can be done while
               | the model parameters are in the GPU registers. Compared
               | to if they were done sequentially, you would multiply the
               | number of trips between the VRAM and the GPU by the
               | number of individual inferences.
               | 
               | Also, batched prompts and outputs are indeed
               | mathematically independent from each other.
        
           | zozbot234 wrote:
           | It depends, if the optimization is too hardware-dependent it
           | might hurt/regress performance on other platforms. One would
           | have to find ways to generalize and auto-tune it based on
           | known features of the local hardware architecture.
        
             | amelius wrote:
             | Yes, easiest is to separate it into a set of options. Then
             | have a bunch of Json/yaml files, one for each hw
             | configuration. From there, the community can fiddle with
             | the settings and share new settings if new hardware is
             | released.
        
         | tough wrote:
         | did you see yesterday nano-vllm [1] from a deepseek employee
         | 1200LOC and faster than vanilla vllm?
         | 
         | 1. https://github.com/GeeeekExplorer/nano-vllm
        
           | Gracana wrote:
           | Is it faster for large models, or are the optimizations more
           | noticeable with small models? Seeing that the benchmark uses
           | a 0.6B model made me wonder about that.
        
             | tough wrote:
             | I have not tested it but its from a deepseek employee i
             | don't know if it's used in prod there or not!
        
         | leeoniya wrote:
         | try https://github.com/ikawrakow/ik_llama.cpp
        
         | chickenzzzzu wrote:
         | >GPU was never the botteneck >it was memory layout
         | 
         | ah right so the GPU was the bottleneck then
        
       | SoftTalker wrote:
       | Contrasting colors. Use them!
        
         | jasonjmcghee wrote:
         | If the author stops by- the links and the comments in the code
         | blocks were the ones that I had to use extra effort to read.
         | 
         | It might be worth trying to increase the contrast a bit.
         | 
         | The content is really great though!
        
         | cubefox wrote:
         | The website seems to use alpha transparency for text. A grave,
         | contrast-reducing, sin.
        
           | xeonmc wrote:
           | It's just liquid-glass text and you'll get used to it soon
           | enough.
        
         | currency wrote:
         | The author might be formatting for and editing in dark mode. I
         | use edge://flags/#enable-force-dark and the links are readable.
        
         | Yizahi wrote:
         | font-weight: 300;
         | 
         | I'm 99% sure that author had designed this website on an Apple
         | Mac with so called "font smoothing" enabled, which makes all
         | regular fonts artificially "semi-bold". So to make a normal
         | looking font, Mac designers use this thinner font weight and
         | then Apple helpfully makes it kinda "normal".
         | 
         | https://news.ycombinator.com/item?id=23553486
        
           | neuroelectron wrote:
           | Jfc
        
       | elashri wrote:
       | Good article summarizing good chunk of information that people
       | should have some idea about. I just want to comment that the
       | title is a little bit misleading because this is talking about
       | the very choices that NVIDIA follows in developing their GPU
       | archs which is not what always what others do.
       | 
       | For example, the arithmetic intensity break-even point (ridge-
       | point) is very different once you leave the NVIDIA-land. If we
       | take AMD Instinct MI300, it has up to 160 TFLOPS FP32 paired with
       | ~6 TB/s of HBM3/3E bandwidth gives a ridge-point near 27
       | FLOPs/byte which is about double that of the A100's 13
       | FLOPs/byte. The larger on-package HBM (128 - 256 GB) GPU memory
       | also shifts the practical trade-offs between tiling depth and
       | occupancy. Although this is very expensive and does not have CUDA
       | (which can be good and bad at the same time).
        
         | apitman wrote:
         | Unfortunately Nvidia GPUs are the only ones that matter until
         | AMD starts taking their computer software seriously.
        
           | fooblaster wrote:
           | They are. It's just not at the consumer hardware level.
        
             | have-a-break wrote:
             | You could argue it's all the nice GPU debugging tools
             | nVidia provides which makes GPU programming accessible.
             | 
             | There are so many potential bottlenecks (normally just
             | memory access patterns, but without tools to verify you
             | have to design and run manual experiments).
        
             | tucnak wrote:
             | This misconception is repeated time and time again;
             | software support of their datacenter-grade hardware is just
             | as bad. I've had the displeasure of using MI50, MI100 (a
             | lot), MI210 (very briefly.) All three are supposedly
             | enterprise-grade computing hardware, and yet, it was a
             | pathetic experience with a myriad of disconnected
             | components which had to be patched, & married with a very
             | specific kernel version to get ANY kind of LLM inference
             | going.
             | 
             | Now, the last of it I bothered with was 9 months ago;
             | enough is enough.
        
               | fooblaster wrote:
               | this hardware is ancient history. mi250 and mi300 are
               | much better supported
        
               | tucnak wrote:
               | What a load of nonsense. MI210 effectively hit the market
               | in 2023, similarly to H100. We're talking about
               | datacenter-grade, two-year out of date card, and it's
               | already "ancient history?"
               | 
               | No wonder nobody on this site trusts AMD.
        
               | bluescrn wrote:
               | Unless you're, you know, using GPUs for graphics...
               | 
               | Xbox, Playstation, and Steam Deck seem to be doing pretty
               | nicely with AMD.
        
           | tucnak wrote:
           | Unfortunately, GPU's are old news now. When it comes to
           | perf/watt/dollar, TPU's are substantially ahead for both
           | training and inference. There's a sparsity disadvantage with
           | the trailing-edge TPU devices such as v4 but if you care
           | about large-scale training of any sort, it's not even close.
           | Additionally, Tenstorrent p300 devices are hitting the market
           | soon enough, and there's lots of promising stuff is coming on
           | Xilinx side of the AMD shop: the recent Versal chips allow
           | for AI compute-in-network capabilities that puts NVIDIA
           | Bluefield's supposed programmability to shame. NVIDIA likes
           | to say Bluefield is like a next-generation SmartNIC, but
           | compared to actually field-programmable Versal stuff, it's
           | more like 100BASE-T cards from the 90s.
           | 
           | I think it's very naive to assume that GPU's will continue to
           | dominate the AI landscape.
        
             | menaerus wrote:
             | So, where does one buy a TPU?
        
               | tucnak wrote:
               | The actual lead times on similarly-capable GPU systems
               | are so long, by the time your order is executed, you're
               | already losing money. Even assuming perfect utilization,
               | and perfect after-market conditions--you won't be making
               | any money on the hardware anyway.
               | 
               | Buy v. rent calculus is only viable if there's no
               | asymmetry between the two. Oftentimes, what you can rent
               | you cannot buy, and vice-versa, what you can buy--you
               | could never rent. Even if you _could_ buy an actual TPU,
               | you wouldn't be able to run it anyway, as it's all built
               | around sophisticated networking and switching
               | topologies[1]. The same goes for GPU deployments of
               | comparable scale: what made you think that you could buy
               | and run GPU's at scale?
               | 
               | It's a fantasy.
               | 
               | [1] https://arxiv.org/abs/2304.01433
        
               | almostgotcaught wrote:
               | Is your answer to "where can I buy a TPU" that you can't
               | buy a GPU either? That's a new one.
               | 
               | First of all I don't understand how that's an answer.
               | Second of all it's laughably wrong - I can name 5 firms
               | (outside of FAANG) off the top of my head with >1k
               | Blackwell devices and they're making very good money
               | (have you ever heard of quantfi....). Third of all, how
               | is TPU going to conquer absolutely anything when (as you
               | admit) you couldn't run one even if you could buy one?
        
               | tucnak wrote:
               | I'd never claimed that "TPU is going to conquer
               | everything," it's a matter of fact that the latest-
               | generation TPU is currently the most cost-effective
               | solution for large-scale training. I'm not even saying
               | that NVIDIA has lost, just that GPU's have lost. Maybe
               | NVIDIA comes up with a non-GPU based system, and it
               | includes programmable fabric to enable compute-in-network
               | capabilities, sure, anything other than Bluefield
               | nonsense, but it's already clear from the engineering
               | standpoint that the large HBM-stacks attached to a
               | "GPU"+Bluefield formula is over.
        
               | almostgotcaught wrote:
               | > NVIDIA has lost, just that GPU's have lost
               | 
               | i hope you realize how silly you sound when
               | 
               | 1. NVDA's market cap is 70% more than GOOG's
               | 
               | 2. there is literally not a single other viable
               | competitor to GPGPU amongst the 30 or so "accelerator"
               | companies that all swear their thing will definitely be
               | the one, even with many of them approaching 10 years in
               | the market by now (cerebras, samba nova, groq, dmatrix,
               | blah blah blah).
        
               | menaerus wrote:
               | Right. Your argument doesn't really follow. Since I
               | cannot buy a TPU, which you agree with, then a single
               | viable option is really only a GPU, which I _can_ buy.
               | 
               | So, according to that, GPUs aren't really going anywhere
               | unless there's a new player in a town who will compete
               | with the Nvidia and sell at lower prices.
        
             | almostgotcaught wrote:
             | > Unfortunately, GPU's are old news now
             | 
             | ...
             | 
             | > the recent Versal chips allow for AI compute-in-network
             | capabilities that puts NVIDIA Bluefield's supposed
             | programmability to shame
             | 
             | I'm always just like... who are you people. Like what is
             | the profile of a person that just goes around proclaiming
             | wild things as if they're completely established. And I see
             | this kind of comment on hn very frequently. Like you either
             | work for Tenstorrent or you're an influencer or a zdnet
             | presenter or just ... because none of this even remotely
             | true.
             | 
             | Reminds me of
             | 
             | "My father would womanize; he would drink. He would make
             | outrageous claims like he invented the question mark.
             | Sometimes, he would accuse chestnuts of being lazy."
             | 
             | > I think it's very naive to assume that GPU's will
             | continue to dominate the AI landscape
             | 
             | I'm just curious - how much of your portfolio is AMD and
             | how much is NVDA and how much is GOOG?
        
               | timeinput wrote:
               | Listen, I'm ~~not~~ all in on Ferrero Rocher, and
               | chestnuts *are* lazy. No where near as productive as
               | hazelnuts.
        
               | tucnak wrote:
               | > I'm just curious - now much of your portfolio is AMD
               | 
               | I'm always just like... who are you people: financiers,
               | or hackers? :-) I don't work for TT, but I am a founder
               | in the vertical AI space. Firstly, every major player is
               | making AI accelerators of their own now, and guess what,
               | most state-of-the-art designs have very little in common
               | with a GPGPU design of yester-year. We have thoroughly
               | evaluated various options, including buying/renting
               | NVIDIA hardware; unfortunately, it didn't make any sense
               | --neither in terms of cost, nor capability. Buying (and
               | waiting _months_ for) NVIDIA rack-fuls is the quickest
               | way to bankrupt your business with CAPEX. Renting the
               | same hardware is merely moving the disease to OPEX, and
               | in post-ZIRP era this is equally devastating.
               | 
               | No matter how much HBM memory you get for whatever
               | individual device, no matter the packaging--it's never
               | going to be enough. The weights alone are quickly dwarfed
               | by K/V cache pages anyway. This is doubly true, if you're
               | executing highly-concurrent agents that share a lot of
               | the context, or doing dataset-scale inference
               | transformations. The only thing that matters, truly, is
               | the ability to scale-out, meaning fabrics, RDMA over
               | fabrics. Even the leading-edge GPU systems aren't really
               | good at it, because none of the interconnect is actually
               | programmable.
               | 
               | The current generation of TT cards (7nm) has four 800G
               | NIC's per card, and the actual Blackhole chips[1] support
               | up to 12x400G. You can approach TT, they will license you
               | the IP, and you get to integrate it at whatever scale you
               | please (good luck even getting in a room with Arm
               | people!) and because TT's whole stack is open source, you
               | get to "punch in" whatever topology you want[2]. In other
               | words, at least with TT you would _get a chance_ to
               | scale-out without bankrupting your business.
               | 
               | The compute hierarchy is fresh and in line with the
               | latest research, their toolchain is as as hackable as it
               | gets, and stands multiple heads above anything that AMD
               | or Intel had ever released. Most importantly, because TT
               | is currently under-valued, it presents an outstanding
               | opportunity for businesses like ours in navigating around
               | the established cost-centers. For example, TT still
               | offers "Galaxy" deployments which used to contain 32
               | previous-generation (Wormhole) devices in a 6U air-cooled
               | chassis. It's not a stretch that a similar setup,
               | composed of 32 liquid-cooled Blackholes (2 TB GDDR6, 100
               | Tbps interconnect) would fit in a 4U chassis. AFAIK,
               | There's no GPU deployment in the world at that density.
               | Similarly to TPU design, it's also infinitely scalable by
               | means of 3+D twisted torus topologies.
               | 
               | What's currently missing in the TT ecosystem: (1) the
               | "superchip" package including state of the art CPU cores,
               | like TT-Ascalon, that they would also happily license to
               | you, and perhaps more importantly, (2) compute-in-network
               | capability, so that the stupidly-massive TT interconnect
               | bandwidth could be exploited/informed by applications.
               | 
               | Firstly, the Grendel superchip is expected to hit the
               | market by the end of next year.
               | 
               | Secondly, because the interconnect is not some
               | proprietary bullshit from Mellanox, you get to introduce
               | the programmable-logic NIC's into the topology, and maybe
               | even avoid IP encapsulation altogether! There are many
               | reasons to do so, and indeed, Versal FPGA's have lots to
               | offer in terms of hard IP in addition to PL. K/V cache
               | management with offloading to NVMe-oF clusters, prefix-
               | matching, reshaping, quantization, compression, and all
               | the other terribly-parallel tasks which are basically
               | intractable for anything other than FPGA's.
               | 
               | Today, if we wanted to do a large-scale training run, we
               | would simply go for the most cost-effective option
               | available at scale, which is renting TPU v6 from Google.
               | This is a temporary measure, if anything, because
               | compute-in-network in AI deployments is still a novelty,
               | and nobody can really do it at sufficiently-large scale
               | yet. Thankfully, Xilinx is getting there[3]. AWS offers
               | f1 instances, it does offer NVMe-accelerated ones, as
               | well as AI acclerators, but there's a good reason they're
               | unable to offer all three at the same time.
               | 
               | [1] https://riscv.epcc.ed.ac.uk/assets/files/hpcasia25/Te
               | nstorre...
               | 
               | [2] https://github.com/tenstorrent/tt-
               | metal/blob/main/tech_repor...
               | 
               | [3] https://www.amd.com/en/products/accelerators/alveo/v8
               | 0.html
        
               | almostgotcaught wrote:
               | > I don't work for TT, but I am a founder in the vertical
               | AI space
               | 
               | yes so this perfectly answers my question: you're a
               | salesman that is "talking their book" when they are
               | making claims like "GPU is old news". it makes perfect
               | sense and basically what i expected on hn.
               | 
               | i'm not really gonna try to respond to anything else in
               | your marketing-brochure sized post, which is replete with
               | buzzwords and links to more material that isn't actually
               | yours but sure looks like it lends your post credibility
               | ;)
        
               | tucnak wrote:
               | Your obsession with finance/marketing is exactly what I
               | expect to see on HN.
               | 
               | It's a shame your accusations have zero merit. In the
               | future, please try not to embarrass yourself by
               | attempting to get into technical discussion, and promptly
               | backing out of it having not made a single technical
               | argument in the process. Good luck on the stock market
               | 
               | See https://news.ycombinator.com/newsguidelines.html
        
               | almostgotcaught wrote:
               | > Your obsession with finance/marketing is exactly what I
               | expect to see on HN.
               | 
               | homie i work on AI infra at one of these companies that
               | you're so casually citing in all of _your_ marketing
               | content here. you 're not simply wrong on the things you
               | claim - you're not _even_ wrong. you literally don 't
               | know what you're talking about because you're citing
               | external facing docs/code/whatever.
               | 
               | > attempting to get into technical discussion, and
               | promptly backing out of it having not made a single
               | technical argument in the process
               | 
               | there's no technical discussion to be had with someone
               | that cites _other people 's work_ as proof for their own
               | claims.
        
       | eapriv wrote:
       | Spoiler: it's not about how GPUs work, it's about how to use them
       | for machine learning computations.
        
         | oivey wrote:
         | It's a pretty standard run down of CUDA. Nothing to do with ML
         | other than using relu in an example and mentioning torch.
        
       | neuroelectron wrote:
       | ASCII diagrams, really?
        
       | LarsDu88 wrote:
       | Maybe this should be titled "Basic Facts about Nvidia GPUs" as
       | the WARP terminology is a feature of modern Nvidia GPUs.
       | 
       | Again, I emphasize "modern"
       | 
       | An NVIDIA GPU from circa 2003 is completely different and has
       | baked in circuitry specific to the rendering pipelines used for
       | videogames at that time.
       | 
       | So most of this post is not quite general to all "GPUs" which a
       | much broader category of devices that don't necessarily encompass
       | the type of general purpose computation we use modern Nvidia GPUs
       | for.
        
       ___________________________________________________________________
       (page generated 2025-06-24 23:00 UTC)