[HN Gopher] Basic Facts about GPUs
___________________________________________________________________
Basic Facts about GPUs
Author : ibobev
Score : 213 points
Date : 2025-06-24 12:15 UTC (10 hours ago)
(HTM) web link (damek.github.io)
(TXT) w3m dump (damek.github.io)
| kittikitti wrote:
| This is a really good introduction and I appreciate it. When I
| was building my AI PC, the deep dive research into GPU's took a
| few days but this lays it out in front of me. It's especially
| great because it touches on high-value applications like
| generative artificial intelligence. A notable diagram from the
| page that I wasn't able to find represented well elsewhere was
| the memory hierarchy of the A100 GPU's. The diagrams were very
| helpful. Thank you for this!
| b0a04gl wrote:
| been running llama.cpp and vllm on same 4070, trying to batch
| more prompts for serving. llama.cpp was lagging bad once I hit
| batch 8 or so, even though GPU usage looked fine. vllm handled it
| way better.
|
| later found vllm uses paged kv cache with layout that matches how
| the GPU wants to read fully coalesced without strided jumps.
| llama.cpp was using a flat layout that's fine for single prompt
| but breaks L2 access patterns when batching.
|
| reshaped kv tensors in llama.cpp to interleave ; made it [head,
| seq, dim] instead of [seq, head, dim], closer to how vllm feeds
| data into fused attention kernel. 2x speedup right there w.r.t
| same ops.
|
| GPU was never the bottleneck. it was memory layout not aligning
| with SM's expected access stride. vllm just defaults to layouts
| that make better use of shared memory and reduce global reads.
| that's the real reason it scales better per batch.
|
| this took its own time of say 2+days and had to dig under the
| nice looking GPU graphs to find real bottlenecks, it was widly
| trial and error tbf,
|
| > anybody got idea on how to do this kinda experiment in hot
| reload mode without so much hassle??
| jcelerier wrote:
| did you do a PR to integrate these changes back into llama.cpp
| ? 2x speedup would be absolutely wild
| zargon wrote:
| Almost nobody using llama.cpp does batch inference. I
| wouldn't be surprised if the change is somewhat involved to
| integrate with all of llama.cpp's other features. Combined
| with lack of interest and keeping up with code churn, that
| would probably make it difficult to get included, with the
| number of PRs the maintainers are flooded with.
| tough wrote:
| if you open a PR, even if it doesnt get merged, anyone with
| the same issue can find it, and use your PR/branch/fix if
| it suits better their needs than master
| zargon wrote:
| Yeah good point. I have applied such PRs myself in the
| past. Eventually the code churn can sometimes make it too
| much of a pain to maintain them, but they're useful for a
| while.
| buildxyz wrote:
| Any speed up that is 2x is definitely worth fixing.
| Especially since someone has already figured out the issue
| and performance testing [1] shows that llamacpp* is lagging
| behind vLLM by 2x. This is a positive for all running LLMs
| locally using llamacpp.
|
| Even if llamacpp isnt used for batch inference now, this
| can allow those to finally run llamacpp for batching and on
| any hardware since vLLM supports only select hardware.
| Maybe finally we can stop all this gpu api software
| fragmentation and cuda moat as llamacpp benchmarks have
| shown Vulkan to be as or more performant than cuda or sycl.
|
| [1] https://miro.medium.com/v2/resize:fit:1400/format:webp/
| 1*lab...
| menaerus wrote:
| So, what exactly is batch inference workload and how
| would someone running inference on local setup benefit
| from it? Or how would I even benefit from it if I had a
| single machine hosting multiple users simultaneously?
|
| I believe batching is a concept only useful when during
| the training or fine tuning process.
| zargon wrote:
| Batch inference is just running multiple inferences
| simultaneously. If you have simultaneous requests, you'll
| get incredible performance gains, since a single
| inference doesn't leverage any meaningful fraction of a
| GPU's compute capability.
|
| For local hosting, a more likely scenario where you could
| use batching is if you had a lot of different data you
| wanted to process (lots of documents or whatever). You
| could batch them in sets of x and have it complete in 1/x
| the time.
|
| A less likely scenario is having enough users that you
| can make the first user wait a few seconds while you wait
| to see if a second user submits a request. If you do get
| a second request, then you can batch them and the second
| user will get their result back much faster than if they
| had had to wait for the first user's request to complete
| first.
|
| Most people doing local hosting on consumer hardware
| won't have the extra VRAM for the KV cache for multiple
| simultaneous inferences though.
| menaerus wrote:
| Wouldn't batching the multiple inference requests from
| multiple different users with multiple different contexts
| simultaneously impact the inference results for each of
| those users?
| pests wrote:
| The different prompts being batched do not mathematically
| affect each other. When running inference you have
| massive weights that need to get loaded and unloaded just
| to serve the current prompt and however long its context
| is (maybe even just a few tokens even). This batching
| lets you manipulate and move the weights around less to
| serve the same amount of combined context.
| menaerus wrote:
| Batching isn't about "moving weights around less". Where
| do you move the weights anyway once they are loaded into
| the GPU VRAM? Batching, as always in CS problems, is
| about maximizing the compute for a unit of a single round
| trip, and in this case DMA-context-from-CPU-RAM-to-GPU-
| VRAM.
|
| Self attention premise is exactly that it isn't context
| free so it is also incorrect to say that batched requests
| do not mathematically affect each other. They do, and
| that's by design.
| zargon wrote:
| > Where do you move the weights anyway once they are
| loaded into the GPU VRAM?
|
| The GPU can't do anything with weights while they are in
| VRAM. They have to be moved into the GPU itself first.
|
| So it is about memory round-trips, but not between RAM
| and VRAM. It's the round trips between the VRAM and the
| registers in the GPU die. When batch processing, the
| calculations for all batched requests can be done while
| the model parameters are in the GPU registers. Compared
| to if they were done sequentially, you would multiply the
| number of trips between the VRAM and the GPU by the
| number of individual inferences.
|
| Also, batched prompts and outputs are indeed
| mathematically independent from each other.
| zozbot234 wrote:
| It depends, if the optimization is too hardware-dependent it
| might hurt/regress performance on other platforms. One would
| have to find ways to generalize and auto-tune it based on
| known features of the local hardware architecture.
| amelius wrote:
| Yes, easiest is to separate it into a set of options. Then
| have a bunch of Json/yaml files, one for each hw
| configuration. From there, the community can fiddle with
| the settings and share new settings if new hardware is
| released.
| tough wrote:
| did you see yesterday nano-vllm [1] from a deepseek employee
| 1200LOC and faster than vanilla vllm?
|
| 1. https://github.com/GeeeekExplorer/nano-vllm
| Gracana wrote:
| Is it faster for large models, or are the optimizations more
| noticeable with small models? Seeing that the benchmark uses
| a 0.6B model made me wonder about that.
| tough wrote:
| I have not tested it but its from a deepseek employee i
| don't know if it's used in prod there or not!
| leeoniya wrote:
| try https://github.com/ikawrakow/ik_llama.cpp
| chickenzzzzu wrote:
| >GPU was never the botteneck >it was memory layout
|
| ah right so the GPU was the bottleneck then
| SoftTalker wrote:
| Contrasting colors. Use them!
| jasonjmcghee wrote:
| If the author stops by- the links and the comments in the code
| blocks were the ones that I had to use extra effort to read.
|
| It might be worth trying to increase the contrast a bit.
|
| The content is really great though!
| cubefox wrote:
| The website seems to use alpha transparency for text. A grave,
| contrast-reducing, sin.
| xeonmc wrote:
| It's just liquid-glass text and you'll get used to it soon
| enough.
| currency wrote:
| The author might be formatting for and editing in dark mode. I
| use edge://flags/#enable-force-dark and the links are readable.
| Yizahi wrote:
| font-weight: 300;
|
| I'm 99% sure that author had designed this website on an Apple
| Mac with so called "font smoothing" enabled, which makes all
| regular fonts artificially "semi-bold". So to make a normal
| looking font, Mac designers use this thinner font weight and
| then Apple helpfully makes it kinda "normal".
|
| https://news.ycombinator.com/item?id=23553486
| neuroelectron wrote:
| Jfc
| elashri wrote:
| Good article summarizing good chunk of information that people
| should have some idea about. I just want to comment that the
| title is a little bit misleading because this is talking about
| the very choices that NVIDIA follows in developing their GPU
| archs which is not what always what others do.
|
| For example, the arithmetic intensity break-even point (ridge-
| point) is very different once you leave the NVIDIA-land. If we
| take AMD Instinct MI300, it has up to 160 TFLOPS FP32 paired with
| ~6 TB/s of HBM3/3E bandwidth gives a ridge-point near 27
| FLOPs/byte which is about double that of the A100's 13
| FLOPs/byte. The larger on-package HBM (128 - 256 GB) GPU memory
| also shifts the practical trade-offs between tiling depth and
| occupancy. Although this is very expensive and does not have CUDA
| (which can be good and bad at the same time).
| apitman wrote:
| Unfortunately Nvidia GPUs are the only ones that matter until
| AMD starts taking their computer software seriously.
| fooblaster wrote:
| They are. It's just not at the consumer hardware level.
| have-a-break wrote:
| You could argue it's all the nice GPU debugging tools
| nVidia provides which makes GPU programming accessible.
|
| There are so many potential bottlenecks (normally just
| memory access patterns, but without tools to verify you
| have to design and run manual experiments).
| tucnak wrote:
| This misconception is repeated time and time again;
| software support of their datacenter-grade hardware is just
| as bad. I've had the displeasure of using MI50, MI100 (a
| lot), MI210 (very briefly.) All three are supposedly
| enterprise-grade computing hardware, and yet, it was a
| pathetic experience with a myriad of disconnected
| components which had to be patched, & married with a very
| specific kernel version to get ANY kind of LLM inference
| going.
|
| Now, the last of it I bothered with was 9 months ago;
| enough is enough.
| fooblaster wrote:
| this hardware is ancient history. mi250 and mi300 are
| much better supported
| tucnak wrote:
| What a load of nonsense. MI210 effectively hit the market
| in 2023, similarly to H100. We're talking about
| datacenter-grade, two-year out of date card, and it's
| already "ancient history?"
|
| No wonder nobody on this site trusts AMD.
| bluescrn wrote:
| Unless you're, you know, using GPUs for graphics...
|
| Xbox, Playstation, and Steam Deck seem to be doing pretty
| nicely with AMD.
| tucnak wrote:
| Unfortunately, GPU's are old news now. When it comes to
| perf/watt/dollar, TPU's are substantially ahead for both
| training and inference. There's a sparsity disadvantage with
| the trailing-edge TPU devices such as v4 but if you care
| about large-scale training of any sort, it's not even close.
| Additionally, Tenstorrent p300 devices are hitting the market
| soon enough, and there's lots of promising stuff is coming on
| Xilinx side of the AMD shop: the recent Versal chips allow
| for AI compute-in-network capabilities that puts NVIDIA
| Bluefield's supposed programmability to shame. NVIDIA likes
| to say Bluefield is like a next-generation SmartNIC, but
| compared to actually field-programmable Versal stuff, it's
| more like 100BASE-T cards from the 90s.
|
| I think it's very naive to assume that GPU's will continue to
| dominate the AI landscape.
| menaerus wrote:
| So, where does one buy a TPU?
| tucnak wrote:
| The actual lead times on similarly-capable GPU systems
| are so long, by the time your order is executed, you're
| already losing money. Even assuming perfect utilization,
| and perfect after-market conditions--you won't be making
| any money on the hardware anyway.
|
| Buy v. rent calculus is only viable if there's no
| asymmetry between the two. Oftentimes, what you can rent
| you cannot buy, and vice-versa, what you can buy--you
| could never rent. Even if you _could_ buy an actual TPU,
| you wouldn't be able to run it anyway, as it's all built
| around sophisticated networking and switching
| topologies[1]. The same goes for GPU deployments of
| comparable scale: what made you think that you could buy
| and run GPU's at scale?
|
| It's a fantasy.
|
| [1] https://arxiv.org/abs/2304.01433
| almostgotcaught wrote:
| Is your answer to "where can I buy a TPU" that you can't
| buy a GPU either? That's a new one.
|
| First of all I don't understand how that's an answer.
| Second of all it's laughably wrong - I can name 5 firms
| (outside of FAANG) off the top of my head with >1k
| Blackwell devices and they're making very good money
| (have you ever heard of quantfi....). Third of all, how
| is TPU going to conquer absolutely anything when (as you
| admit) you couldn't run one even if you could buy one?
| tucnak wrote:
| I'd never claimed that "TPU is going to conquer
| everything," it's a matter of fact that the latest-
| generation TPU is currently the most cost-effective
| solution for large-scale training. I'm not even saying
| that NVIDIA has lost, just that GPU's have lost. Maybe
| NVIDIA comes up with a non-GPU based system, and it
| includes programmable fabric to enable compute-in-network
| capabilities, sure, anything other than Bluefield
| nonsense, but it's already clear from the engineering
| standpoint that the large HBM-stacks attached to a
| "GPU"+Bluefield formula is over.
| almostgotcaught wrote:
| > NVIDIA has lost, just that GPU's have lost
|
| i hope you realize how silly you sound when
|
| 1. NVDA's market cap is 70% more than GOOG's
|
| 2. there is literally not a single other viable
| competitor to GPGPU amongst the 30 or so "accelerator"
| companies that all swear their thing will definitely be
| the one, even with many of them approaching 10 years in
| the market by now (cerebras, samba nova, groq, dmatrix,
| blah blah blah).
| menaerus wrote:
| Right. Your argument doesn't really follow. Since I
| cannot buy a TPU, which you agree with, then a single
| viable option is really only a GPU, which I _can_ buy.
|
| So, according to that, GPUs aren't really going anywhere
| unless there's a new player in a town who will compete
| with the Nvidia and sell at lower prices.
| almostgotcaught wrote:
| > Unfortunately, GPU's are old news now
|
| ...
|
| > the recent Versal chips allow for AI compute-in-network
| capabilities that puts NVIDIA Bluefield's supposed
| programmability to shame
|
| I'm always just like... who are you people. Like what is
| the profile of a person that just goes around proclaiming
| wild things as if they're completely established. And I see
| this kind of comment on hn very frequently. Like you either
| work for Tenstorrent or you're an influencer or a zdnet
| presenter or just ... because none of this even remotely
| true.
|
| Reminds me of
|
| "My father would womanize; he would drink. He would make
| outrageous claims like he invented the question mark.
| Sometimes, he would accuse chestnuts of being lazy."
|
| > I think it's very naive to assume that GPU's will
| continue to dominate the AI landscape
|
| I'm just curious - how much of your portfolio is AMD and
| how much is NVDA and how much is GOOG?
| timeinput wrote:
| Listen, I'm ~~not~~ all in on Ferrero Rocher, and
| chestnuts *are* lazy. No where near as productive as
| hazelnuts.
| tucnak wrote:
| > I'm just curious - now much of your portfolio is AMD
|
| I'm always just like... who are you people: financiers,
| or hackers? :-) I don't work for TT, but I am a founder
| in the vertical AI space. Firstly, every major player is
| making AI accelerators of their own now, and guess what,
| most state-of-the-art designs have very little in common
| with a GPGPU design of yester-year. We have thoroughly
| evaluated various options, including buying/renting
| NVIDIA hardware; unfortunately, it didn't make any sense
| --neither in terms of cost, nor capability. Buying (and
| waiting _months_ for) NVIDIA rack-fuls is the quickest
| way to bankrupt your business with CAPEX. Renting the
| same hardware is merely moving the disease to OPEX, and
| in post-ZIRP era this is equally devastating.
|
| No matter how much HBM memory you get for whatever
| individual device, no matter the packaging--it's never
| going to be enough. The weights alone are quickly dwarfed
| by K/V cache pages anyway. This is doubly true, if you're
| executing highly-concurrent agents that share a lot of
| the context, or doing dataset-scale inference
| transformations. The only thing that matters, truly, is
| the ability to scale-out, meaning fabrics, RDMA over
| fabrics. Even the leading-edge GPU systems aren't really
| good at it, because none of the interconnect is actually
| programmable.
|
| The current generation of TT cards (7nm) has four 800G
| NIC's per card, and the actual Blackhole chips[1] support
| up to 12x400G. You can approach TT, they will license you
| the IP, and you get to integrate it at whatever scale you
| please (good luck even getting in a room with Arm
| people!) and because TT's whole stack is open source, you
| get to "punch in" whatever topology you want[2]. In other
| words, at least with TT you would _get a chance_ to
| scale-out without bankrupting your business.
|
| The compute hierarchy is fresh and in line with the
| latest research, their toolchain is as as hackable as it
| gets, and stands multiple heads above anything that AMD
| or Intel had ever released. Most importantly, because TT
| is currently under-valued, it presents an outstanding
| opportunity for businesses like ours in navigating around
| the established cost-centers. For example, TT still
| offers "Galaxy" deployments which used to contain 32
| previous-generation (Wormhole) devices in a 6U air-cooled
| chassis. It's not a stretch that a similar setup,
| composed of 32 liquid-cooled Blackholes (2 TB GDDR6, 100
| Tbps interconnect) would fit in a 4U chassis. AFAIK,
| There's no GPU deployment in the world at that density.
| Similarly to TPU design, it's also infinitely scalable by
| means of 3+D twisted torus topologies.
|
| What's currently missing in the TT ecosystem: (1) the
| "superchip" package including state of the art CPU cores,
| like TT-Ascalon, that they would also happily license to
| you, and perhaps more importantly, (2) compute-in-network
| capability, so that the stupidly-massive TT interconnect
| bandwidth could be exploited/informed by applications.
|
| Firstly, the Grendel superchip is expected to hit the
| market by the end of next year.
|
| Secondly, because the interconnect is not some
| proprietary bullshit from Mellanox, you get to introduce
| the programmable-logic NIC's into the topology, and maybe
| even avoid IP encapsulation altogether! There are many
| reasons to do so, and indeed, Versal FPGA's have lots to
| offer in terms of hard IP in addition to PL. K/V cache
| management with offloading to NVMe-oF clusters, prefix-
| matching, reshaping, quantization, compression, and all
| the other terribly-parallel tasks which are basically
| intractable for anything other than FPGA's.
|
| Today, if we wanted to do a large-scale training run, we
| would simply go for the most cost-effective option
| available at scale, which is renting TPU v6 from Google.
| This is a temporary measure, if anything, because
| compute-in-network in AI deployments is still a novelty,
| and nobody can really do it at sufficiently-large scale
| yet. Thankfully, Xilinx is getting there[3]. AWS offers
| f1 instances, it does offer NVMe-accelerated ones, as
| well as AI acclerators, but there's a good reason they're
| unable to offer all three at the same time.
|
| [1] https://riscv.epcc.ed.ac.uk/assets/files/hpcasia25/Te
| nstorre...
|
| [2] https://github.com/tenstorrent/tt-
| metal/blob/main/tech_repor...
|
| [3] https://www.amd.com/en/products/accelerators/alveo/v8
| 0.html
| almostgotcaught wrote:
| > I don't work for TT, but I am a founder in the vertical
| AI space
|
| yes so this perfectly answers my question: you're a
| salesman that is "talking their book" when they are
| making claims like "GPU is old news". it makes perfect
| sense and basically what i expected on hn.
|
| i'm not really gonna try to respond to anything else in
| your marketing-brochure sized post, which is replete with
| buzzwords and links to more material that isn't actually
| yours but sure looks like it lends your post credibility
| ;)
| tucnak wrote:
| Your obsession with finance/marketing is exactly what I
| expect to see on HN.
|
| It's a shame your accusations have zero merit. In the
| future, please try not to embarrass yourself by
| attempting to get into technical discussion, and promptly
| backing out of it having not made a single technical
| argument in the process. Good luck on the stock market
|
| See https://news.ycombinator.com/newsguidelines.html
| almostgotcaught wrote:
| > Your obsession with finance/marketing is exactly what I
| expect to see on HN.
|
| homie i work on AI infra at one of these companies that
| you're so casually citing in all of _your_ marketing
| content here. you 're not simply wrong on the things you
| claim - you're not _even_ wrong. you literally don 't
| know what you're talking about because you're citing
| external facing docs/code/whatever.
|
| > attempting to get into technical discussion, and
| promptly backing out of it having not made a single
| technical argument in the process
|
| there's no technical discussion to be had with someone
| that cites _other people 's work_ as proof for their own
| claims.
| eapriv wrote:
| Spoiler: it's not about how GPUs work, it's about how to use them
| for machine learning computations.
| oivey wrote:
| It's a pretty standard run down of CUDA. Nothing to do with ML
| other than using relu in an example and mentioning torch.
| neuroelectron wrote:
| ASCII diagrams, really?
| LarsDu88 wrote:
| Maybe this should be titled "Basic Facts about Nvidia GPUs" as
| the WARP terminology is a feature of modern Nvidia GPUs.
|
| Again, I emphasize "modern"
|
| An NVIDIA GPU from circa 2003 is completely different and has
| baked in circuitry specific to the rendering pipelines used for
| videogames at that time.
|
| So most of this post is not quite general to all "GPUs" which a
| much broader category of devices that don't necessarily encompass
| the type of general purpose computation we use modern Nvidia GPUs
| for.
___________________________________________________________________
(page generated 2025-06-24 23:00 UTC)