[HN Gopher] Big GPUs don't need big PCs
___________________________________________________________________
Big GPUs don't need big PCs
Author : mikece
Score : 270 points
Date : 2025-12-20 17:49 UTC (1 days ago)
(HTM) web link (www.jeffgeerling.com)
(TXT) w3m dump (www.jeffgeerling.com)
| jonahbenton wrote:
| So glad someone did this. Have been running big gpus on egpus
| connected to spare laptops and thinking why not pis.
| 3eb7988a1663 wrote:
| Datapoints like this really make me reconsider my daily driver. I
| should be running one of those $300 mini PCs at <20W. With ~flat
| CPU performance gains, would be fine for the next 10 years. Just
| remote into my beefy workstation when I actually need to do real
| work. Browsing the web, watching videos, even playing some games
| is easily within their wheelhouse.
| ekropotin wrote:
| As experiment, I decided to try using proxmox VM with eGPU and
| usb bus bypassed to it, as my main PC for browsing and working
| on hobby projects.
|
| It's just 1 vCPU with 4 Gb ram, and you know what? It's more
| than enough for these needs. I think hardware manufactures
| falsely convinced us that every professional needs beefy laptop
| to be productive.
| reactordev wrote:
| I went with a beelink for this purpose. Works great.
|
| Keeps the desk nice and tidy while "the beasts" roar in a
| soundproofed closet.
| samuelknight wrote:
| Switching from my 8-core ryzen minipc to an 8-core ryzen
| desktop makes my unit tests run way faster. TDP limits can tip
| you off to very different performance envelopes in otherwise
| similar spec CPUs.
| loeg wrote:
| Even if you could cool the full TDP in a micro PC, in a full
| size desktop you might be able to use a massive AIO radiator
| with fans running at very slow, very quiet speeds instead of
| jet turbine howl in the micro case. The quiet and ease of
| working in a bigger space are mostly a good tradeoff for a
| slightly larger form factor under a desk.
| adrian_b wrote:
| A full-size desktop computer will always be much faster for
| any workload that fully utilizes the CPU.
|
| However, a full-size desktop computer seldom makes sense as a
| _personal_ computer, i.e. as the computer that interfaces to
| a human via display, keyboard and graphic pointer.
|
| For most of the activities done directly by a human, i.e.
| reading & editing documents, browsing Internet, watching
| movies and so on, a mini-PC is powerful enough. The only
| exception is playing games designed for big GPUs, but there
| are many computer users who are not gamers.
|
| In most cases the optimal setup is to use a mini-PC as your
| personal computer and a full-size desktop as a server on
| which you can launch any time-consuming tasks, e.g.
| compilation of big software projects, EDA/CAD simulations,
| testing suites etc.
|
| The desktop used as server can use Wake-on-LAN to stay
| powered off when not needed and wake up whenever it must run
| some task remotely.
| whatevaa wrote:
| Not everything supports remoting well. For example, many
| IDE's. Unless you run RDP, with whole graphical session on
| remote.
|
| Also, having to buy two computers also costs money. It
| makes sense to use 1 for both use cases if you have to buy
| the desktop anyway.
| jasonwatkinspdx wrote:
| For just basic windows desktop stuff, a $200 NUC has been good
| enough for like 15 years now.
| themafia wrote:
| > I should be running one of those $300 mini PCs at <20W.
|
| Yes. They're basically laptop chips at this point. The thermals
| are worse but the chips are perfectly modern and can handle
| reasonably large workloads. I've got an 8 core Ryzen 7 with
| Radeon 780 Graphics and 96GB of DDR5. Outside of AAA gaming
| this thing is absolutely fine.
|
| The power draw is a huge win for me. It's like 6W at idle. I
| live remotely so grid power is somewhat unreliable and saving
| watts when using solar batteries extends their lifetime
| massively. I'm thrilled with them.
| PunchyHamster wrote:
| Slapping $300 worth of solar panels on your roof/balcony will
| probably get you ahead on power usage
| nottorp wrote:
| That's why I use a M2 (not even pro) Mac Mini as a terminal and
| remote into other boxes when needed.
| ivanjermakov wrote:
| Another benefit is low noise. Many consider fan noise under
| load to be the most important property of a workstation.
| yjftsjthsd-h wrote:
| I've been kicking this around in my head for a while. If I want
| to run LLMs locally, a decent GPU is really the only important
| thing. At that point, the question becomes, roughly, what is the
| cheapest computer to tack on the side of the GPU? Of course, that
| assumes that everything does in fact work; unlike OP I am barely
| in a position to _understand_ eg. BAR problems, let alone try to
| fix them, so what I actually did was build a cheap-ish x86 box
| with a half-decent GPU and called it a day:) But it still is
| stuck in my brain: there must be a more efficient way to do this,
| especially if all you need is just enough computer to shuffle
| data to and from the GPU and serve that over a network
| connection.
| zeusk wrote:
| Get the DGX Spark computers? They're exactly what you're trying
| to build.
| Gracana wrote:
| They're very slow.
| geerlingguy wrote:
| They're okay, generally, but slow for the price. You're
| more paying for the ConnectX-7 networking than inference
| performance.
| Gracana wrote:
| Yeah, I wouldn't complain if one dropped in my lap, but
| they're not at the top of my list for inference hardware.
|
| Although... Is it possible to pair a fast GPU with one?
| Right now my inference setup for large MoE LLMs has
| shared experts in system memory, with KV cache and dense
| parts on a GPU, and a Spark would do a better job of
| handling the experts than my PC, if only it could talk to
| a fast GPU.
|
| [edit] Oof, I forgot these have only 128GB of RAM. I take
| it all back, I still don't find them compelling.
| tcdent wrote:
| We're not yet to the point where a single PCIe device will get
| you anything meaningful; IMO 128 GB of ram available to the GPU
| is essential.
|
| So while you don't need a ton of compute on the CPU you do need
| the ability address multiple PCIe lanes. A relatively low-spec
| AMD EPYC processor is fine if the motherboard exposes enough
| lanes.
| skhameneh wrote:
| There is plenty that can run within 32/64/96gb VRAM. IMO
| models like Phi-4 are underrated for many simple tasks. Some
| quantized Gemma 3 are quite good as well.
|
| There are larger/better models as well, but those tend to
| really push the limits of 96gb.
|
| FWIW when you start pushing into 128gb+, the ~500gb models
| really start to become attractive because at that point
| you're probably wanting just a bit more out of everything.
| tcdent wrote:
| IDK all of my personal and professional projects involve
| pushing the SOTA to the absolute limit. Using anything
| other than the latest OpenAI or Anthropic model is out of
| the question.
|
| Smaller open source models are a bit like 3d printing in
| the early days; fun to experiment with but really not that
| valuable for anything other than making toys.
|
| Text summarization, maybe? But even then I want a model
| that understands the complete context and does a good job.
| Even things like "generate one sentence about the action
| we're performing" I usually find I can just incorporate it
| into the output schema of a larger request instead of
| making a separate request to a smaller model.
| xyzzy123 wrote:
| It seems to me like the use case for local GPUs is almost
| entirely privacy.
|
| If you buy a 15k AUD rtx 6000 96GB, that card will
| _never_ pay for itself on a gpt-oss:120b workload vs just
| using openrouter - no matter how many tokens you push
| through it - because the cost of residential power in
| Australia means you cannot generate tokens cheaper than
| the cloud even if the card were free.
| joefourier wrote:
| There's a few more considerations:
|
| - You can use the GPU for training and run your own fine
| tuned models
|
| - You can have much higher generation speeds
|
| - You can sell the GPU on the used market in ~2 years
| time for a significant portion of its value
|
| - You can run other types of models like image, audio or
| video generation that are not available via an API, or
| cost significantly more
|
| - Psychologically, you don't feel like you have to
| constrain your token spending and you can, for instance,
| just leave an agent to run for hours or overnight without
| feeling bad that you just "wasted" $20
|
| - You won't be running the GPU at max power constantly
| 15155 wrote:
| Or censorship avoidance
| girvo wrote:
| > because the cost of residential power in Australia
|
| This _so_ doesn 't really matter to your overall point
| which I agree with but:
|
| The rise of rooftop solar and home battery energy storage
| flips this a bit now in Australia, IMO. At least where I
| live, every house has a solar panel on it.
|
| Not worth it _just_ for local LLM usage, but an
| interesting change to energy economics IMO!
| popalchemist wrote:
| This is simply not true. Your heuristic is broken.
|
| The recent Gemma 3 models, which are produced by Google
| (a little startup - heard of em?) outperform the last
| several OpenAI releases.
|
| Closed does not necessarily mean better. Plus the local
| ones can be finetuned to whatever use case you may have,
| won't have any inputs blocked by censorship
| functionality, and you can optimize them by distilling to
| whatever spec you need.
|
| Anyway all that is extraneous detail - the important
| thing is to decouple "open" and "small" from "worse" in
| your mind. The most recent Gemma 3 model specifically is
| incredible, and it makes sense, given that Google has
| access to many times more data than OpenAI for training
| (something like a factor of 10 at least). Which is of
| course a very straightforward idea to wrap your head
| around, Google was scrapign the internet for decades
| before OpenAI even entered the scene.
|
| So just because their Gemma model is released in an open-
| source (open weights) way, doesn't mean it should be
| discounted. There's no magic voodoo happening behind the
| scenes at OpenAI or Anthropic; the models are essentially
| of the same type. But Google releases theirs to undercut
| the profitability of their competitors.
| tcdent wrote:
| This one?
| https://artificialanalysis.ai/models/gemma-3-27b
| p1necone wrote:
| I'm holding out for someone to ship a gpu with dimm slots on
| it.
| kristianp wrote:
| A single CAMM might suit better.
| tymscar wrote:
| DDR5 is a couple of orders of magnitude slower than really
| good vram. That's one big reason.
| dawnerd wrote:
| But it would still be faster than splitting the model up
| on a cluster though, right? But I've also wondered why
| they haven't just shipped gpus like cpus.
| cogman10 wrote:
| Man I'd love to have a GPU socket. But it'd be pretty
| hard to get a standard going that everyone would support.
| Look at sockets for CPUs, we barely had cross over for
| like 2 generations.
|
| But boy, a standard GPU socket so you could easily BYO
| cooler would be nice.
| estimator7292 wrote:
| The problem isn't the sockets. It costs a _lot_ to spec
| and build new sockets, we wouldn 't swap them for no
| reason.
|
| The problem is that the signals and features that the
| motherboard and CPU expect are different between
| generations. We use different sockets on different
| generations to prevent you plugging in incompatible CPUs.
|
| We used to have cross-generational sockets in the 386 era
| because the hardware supported it. Motherboards weren't
| changing so you could just upgrade the CPU. But then the
| CPUs needed different voltages than before for
| performance. So we needed a new socket to not blow up
| your CPU with the wrong voltage.
|
| That's where we are today. Each generation of CPU wants
| different voltages, power, signals, a specific chipset,
| etc. Within the same +-1 generation you can swap CPUs
| because they're electrically compatible.
|
| To have universal CPU sockets, we'd need a universal
| electrical interface standard, which is too much of a
| moving target.
|
| AMD would probably love to never have to tool up a new
| CPU socket. They don't make money on the motherboard you
| have to buy. But the old motherboards just can't support
| new CPUs. Thus, new socket.
| cogman10 wrote:
| For AI, really good isn't really a requirement. If a
| middle ground memory module could be made, then it'd be
| pretty appealing.
| zrm wrote:
| DDR5 is ~8GT/s, GDDR6 is ~16GT/s, GDDR7 is ~32GT/s. It's
| faster but the difference isn't crazy and if the premise
| was to have a lot of slots then you could also have a lot
| of channels. 16 channels of DDR5-8200 would have slightly
| more memory bandwidth than RTX 4090.
| tymscar wrote:
| Yeah, so DDR5 is 8GT and GDDR7 is 32GT. Bus width is 64
| vs 384. That already makes the VRAM 4*6 (24) times
| faster.
|
| You can add more channels, sure, but each channel makes
| it less and less likely for you to boot. Look at modern
| AM5 struggling to boot at over 6000 with more than two
| sticks.
|
| So you'd have to get an insane six channels to match the
| bus width, at which point your only choice to be stable
| would be to lower the speed so much that you're back to
| the same orders of magnitude difference, really.
|
| Now we could instead solder that RAM, move it closer to
| the GPU and cross-link channels to reduce noise. We could
| also increase the speed and oh, we just invented
| soldered-on GDDR...
| zrm wrote:
| > Bus width is 64 vs 384.
|
| The bus width is the number of channels. They don't call
| them channels when they're soldered but 384 is already
| the equivalent of 6. The premise is that you would have
| more. Dual socket Epyc systems already have 24 channels
| (12 channels per socket). It costs money but so does
| 256GB of GDDR.
|
| > Look at modern AM5 struggling to boot at over 6000 with
| more than two sticks.
|
| The relevant number for this is the number of sticks per
| channel. With 16 channels and 64GB sticks you could have
| 1TB of RAM with only one stick per channel. Use CAMM2
| instead of DIMMs and you get the same speed and capacity
| from 8 slots.
| anon25783 wrote:
| Would that be worth anything, though? What about the
| overhead of clock cycles needed for loading from and
| storing to RAM? Might not amount to a net benefit for
| performance, and it could also potentially complicate heat
| management I bet.
| dist-epoch wrote:
| This problem was already solved 10 years ago - crypto mining
| motherboards, which have a large number of PCIe slots, a CPU
| socket, one memory slot, and not much else.
|
| > Asus made a crypto-mining motherboard that supports up to 20
| GPUs
|
| https://www.theverge.com/2018/5/30/17408610/asus-crypto-mini...
|
| For LLMs you'll probably want a different setup, with some
| memory too, some m.2 storage.
| jsheard wrote:
| Those only gave each GPU a single PCIe lane though, since
| crypto mining barely needed to move any data around. If your
| application doesn't fit that mould then you'll need a much,
| much more expensive platform.
| dist-epoch wrote:
| After you load the weights into the GPU and keep the KV
| cache there too, you don't need any other significant
| traffic.
| numpad0 wrote:
| Even in tensor parallel modes? I thought it could only
| work if you're fine stalling all but n GPU for n users at
| any given moments.
| skhameneh wrote:
| In theory, it's only sufficient for pipeline parallel due to
| limited lanes and interconnect bandwidth.
|
| Generally, scalability on consumer GPUs falls off between 4-8
| GPUs for most. Those running more GPUs are typically using a
| higher quantity of smaller GPUs for cost effectiveness.
| zozbot234 wrote:
| M.2 is mostly just a different form factor for PCIe anyway.
| seanmcdirmid wrote:
| And you don't want to go the M4 Max/M3 Ultra route? It works
| well enough for most mid sized LLMs.
| binsquare wrote:
| I run a crowd sourced website to collect data on the best and
| cheapest hardware setup for local LLM here:
| https://inferbench.com/
|
| Source code: https://github.com/BinSquare/inferbench
| nodja wrote:
| Cool site, I noticed the 3090 is on there twice.
|
| https://inferbench.com/gpu/NVIDIA%20GeForce%20RTX%203090
|
| https://inferbench.com/gpu/NVIDIA%20RTX%203090
| binsquare wrote:
| Oh nice catch, I'll fix that
|
| ---
|
| Edit: Fixed
| kilpikaarna wrote:
| Nice! Though for older hardware it would be nice if the price
| reflected the current second hand market (harder to get data
| for, I know). Eg. Nvidia RTX 3070 ranks as second best GPU in
| tok/s/$ even at the MSRP of $499. But you can get one for
| half that now.
| binsquare wrote:
| Great idea - I've added it by manually browsing ebay for
| that initial data.
|
| So it's just a static value in this hardware list: https://
| github.com/BinSquare/inferbench/blob/main/src/lib/ha...
|
| Let me know if you know of a better way, or contribute :D
| jsight wrote:
| It seems like verification might need to be improved a bit? I
| looked at Mistral-Large-123B. Someone is claiming 12
| tokens/sec on a single RTX 3090 at FP16.
|
| Perhaps some filter could cut out submissions that don't
| really make sense?
| Eisenstein wrote:
| There is a whole section in here on how to spec out a cheap rig
| and what to look for:
|
| * https://jabberjabberjabber.github.io/Local-AI-Guide/
| Wowfunhappy wrote:
| I really would have liked to see gaming performance, although I
| realize it might be difficult to find a AAA game that supports
| ARM. (Forcing the Pi to emulate x86 with FEX doesn't seem
| entirely fair.)
| 3eb7988a1663 wrote:
| You might have to thread the needle to find a game which does
| not bottleneck on the CPU.
| kristjansson wrote:
| Really why have the PCI/CPU artifice at all? Apple and Nvidia
| have the right idea: put the MPP on the same die/package as the
| CPU.
| bigyabai wrote:
| > put the MPP on the same die/package as the CPU.
|
| That would help in latency-constrained workloads, but I don't
| think it would make much of a difference for AI or most HPC
| applications.
| PunchyHamster wrote:
| We need low power but high PCIE lane count CPUs for that. Just
| purely for shoving models from NVMe to GPU
| lostmsu wrote:
| Now compare batched training performance. Or batched inference.
|
| Of course prefill is going to be GPU bound. You only send a few
| thousand bytes to it, and don't really ask to return much. But
| after prefill is done, unless you use batched mode, you aren't
| really using your GPU for anything more that it's VRAM bandwidth.
| numpad0 wrote:
| Not sure what was unexpected about the multi GPU part.
|
| It's very well known that most LLM frameworks including llama.cpp
| splits models by layers, which has sequential dependency, and so
| multi GPU setups are completely stalled unless there are n_gpu
| users/tasks running in parallel. It's also known that some GPUs
| are faster in "prompt processing" and some in "token generation"
| that combining Radeon and NVIDIA does something sometimes.
| Reportedly the inter-layer transfer sizes are in kilobyte ranges
| and PCIe x1 is plenty or something.
|
| It takes appropriate backends with "tensor parallel" mode
| support, which splits the neural network parallel to the
| direction of flow of data, which also obviously benefit
| substantially from good node interconnect between GPUs like PCIe
| x16 or NVlink/Infinity Fabric bridge cables, and/or inter-GPU DMA
| over PCIe(called GPU P2P or GPUdirect or some lingo like that).
|
| Absent those, I've read somewhere that people can sometimes see
| GPU utilization spikes walking over GPUs on nvtop-style tools.
|
| Looking for a way to break up tasks for LLMs so that there will
| be multiple tasks to run concurrently would be interesting, maybe
| like creating one "manager" and few "delegated engineers"
| personalities. Or simulating multiple different domains of brain
| such as speech center, visual cortex, language center, etc.
| communicating in tokens might be interesting in working around
| this problem.
| zozbot234 wrote:
| > Looking for a way to break up tasks for LLMs so that there
| will be multiple tasks to run concurrently would be
| interesting, maybe like creating one "manager" and few
| "delegated engineers" personalities.
|
| This is pretty much what "agents" are for. The manager model
| constructs prompts and contexts that the delegated models can
| work on in parallel, returning results when they're done.
| nodja wrote:
| > Reportedly the inter-layer transfer sizes are in kilobyte
| ranges and PCIe x1 is plenty or something.
|
| Not an expert, but napkin math tells me that more often that
| not this will be in the order of megabytes--not kilobytes--
| since it scales with sequence length.
|
| Example: Qwen3 30B has a hidden state size of 5120, even if
| quantized to 8 bits that's 5120 bytes per token. It would pass
| the MB boundary with just a little over 200 tokens. Still not
| much of an issue when a single PCIe lane is ~2GB/s.
|
| I think device to device latency is more of an issue here, but
| I don't know enough to assert that with confidence.
| remexre wrote:
| For each token generated, you only send one token's worth
| between layers; the previous tokens are in the KV cache.
| syntaxing wrote:
| Theres some technical implementations that makes it more
| efficient like EXO [1]. Jeff Geerling recently did a review on
| a 4 MAC Studio cluster with RDMA support and you can see that
| EXO has a noticeable advantage [2].
|
| [1] https://github.com/exo-explore/exo [2]
| https://www.youtube.com/watch?v=x4_RsUxRjKU
| sgt wrote:
| At this point I'd consider a cluster of top specced Mac
| Studio's to be worth while in production. I just need to host
| them properly in a rack and in a co-lo data center.
| syntaxing wrote:
| Honestly, I genuinely can see the value if you want to host
| something internally for sensitive and important
| information. I really hope the M5 ultra with matmul
| accelerators will knock this out of the park. With the way
| RAM is trending, a Mac Studio cluster will become more
| enticing.
| scotty79 wrote:
| > Not sure what was unexpected about the multi GPU part. It's
| very well known that most LLM frameworks including llama.cpp
| splits models by layers, which has sequential dependency, and
| so multi GPU setups are completely stalled
|
| Oh, I thought the point of transformers was being able to split
| the load veritcally to avoid seqential dependancies. Is it true
| just for training or not at all?
| sailingparrot wrote:
| Just for training and processing the existing context (pre
| fill phase). But when doing inference a token t has to be
| sampled before t+1 can so it's still sequential
| kgeist wrote:
| What about constrained decoding (with JSON schemas)? I noticed my
| vLLM instance is using 1 CPU 100%.
| jauntywundrkind wrote:
| PCIe 3.0 is the nice easy convenient generation where 1 lane =
| 1GBps. Given the overhead, thats pretty close to 10Gb ethernet
| speeds (low latency though).
|
| I do wonder how long the cards are going to need host systems at
| all. We've already seen GPUs with m.2 ssd attached! Radeon Pro
| SSG hails back from 2016! You still need a way to get the model
| on that in the first place to get work in and out, but a 1Gbe and
| small RISC-V chip (which Nvidia already uses formanagement cores)
| could suffice. Maybe even an rpi on the card.
| https://www.techpowerup.com/224434/amd-announces-the-radeon-...
|
| Given the gobs of memory cards have, they probably don't even
| need storage; they just need big pipes. Intel had 100Gbe on their
| Xeon & Xeon Phi cores (10x what we saw here!) in _2016_! GPUs
| that just plug into the switch and talk across 400Gbe or
| UltraEthernet or switched CXL, that run semi independently, feel
| so sensible, so not outlandish.
| https://www.servethehome.com/next-generation-interconnect-in...
|
| It's far off for now, but flash makers are also looking at
| radically many channel flash, which can provide absurdly high
| GB/s, High Bandwidth Flash. And potentially integrated some
| extremely parallel tensorcores on each channel. Switching from
| DRAM to flash for AI processing could be a colossal win for
| fitting large models cost effectively (& perhaps power
| efficiently) while still having ridiculous gobs of bandwidth.
| With that possible win of doing processing & filtering extremely
| near to the data too. https://www.tomshardware.com/tech-
| industry/sandisk-and-sk-hy...
| Avlin67 wrote:
| tired of jeff glinglin everywhere...
| manarth wrote:
| I personally find his work and his posts interesting, and enjoy
| seeing them pop up on HN.
|
| If you prefer not to see his posts on the HN list pages, a
| practical solution is to use a browser extension (such as
| Stylus) to customise the HN styling to hide the posts.
|
| Here is a specific CSS style which will hide submissions from
| Jeff's website: tr.submission:has(td
| a[href="from?site=jeffgeerling.com"]),
| tr.submission:has(td a[href="from?site=jeffgeerling.com"]) +
| tr, tr.submission:has(td
| a[href="from?site=jeffgeerling.com"]) + tr + tr {
| opacity: 0.05 }
|
| In this example, I've made it almost invisible, whilst it still
| takes up space on the screen (to avoid confusion about the post
| number increasing from N to N+2). You could use { display: none
| } to completely hide the relevant posts.
|
| The approach can be modified to suit any origin you prefer to
| not come across.
|
| The limitation is that the style modification may need
| refactoring if HN changes the markup structure.
| mjh2539 wrote:
| I only ever see him on HN. He's smart, kind, and talks about
| interesting things. Are you sure what you're feeling isn't
| envy?
| omneity wrote:
| I wish for a hardware + software solution to enable direct PCIe
| interconnect using lanes independent from the chipset/CPU. A PCIe
| mesh of sorts.
|
| With the right software support from say pytorch this could
| suddenly make old GPUs and underpowered PCs like in TFA into very
| attractive and competitive solutions for training and inference.
| snuxoll wrote:
| PCIe already allows DMA between peers on the bus, but, as you
| pointed out, the traces for the lanes have to terminate
| somewhere. However, it doesn't have to be the CPU (which is, of
| course, the PCIe root in modern systems) handling the traffic -
| a PCIe switch may be used to facilitate DMA between devices
| attached to it, if it supports routing DMA traffic directly.
| ComputerGuru wrote:
| That's what happened in TFA.
| omneity wrote:
| You're right. Let me correct myself: a hobbyist-friendly
| hardware solution. Dolphin's PCIe switches cost more than 8
| RTX 3090 on a Threadripper machine.
| Waterluvian wrote:
| At what point do the OEMs begin to realize they don't have to
| follow the current mindset of attaching a GPU to a PC and instead
| sell what looks like a GPU with a PC built into it?
| nightshift1 wrote:
| Exactly. With the Intel-Nvidia partnership signed this
| September, I expect to see some high-performance single-board
| computers being released very soon. I don't think the atx form-
| factor will survive another 30 years.
| bostik wrote:
| One should also remember that NVidia _does_ have
| organisational experience on designing and building CPUs[0].
|
| They were a pretty big deal back in ~2010, and I have to
| admit I didn't know that Tegra was powering Nintendo Switch.
|
| 0: https://en.wikipedia.org/wiki/Tegra
| goku12 wrote:
| I had a Xolo Tegra Note 7 tablet (marketed in the US as
| EVGA Tegra Note 7) in around 2013. I preordered it as far
| as I remember. It had a Tegra 4 SoC with quad core Cortex
| A15 CPU and a 72 core GeForce GPU. Nvidia used to claim
| that it is the fastest SoC for mobile devices at the time.
|
| To this day, it's the best mobile/Android device I ever
| owned. I don't know if it was the fastest, but it certainly
| was the best performing one I ever had. UI interactions
| were smooth, apps were fast on it, screen was bright, touch
| was perfect and still had long enough battery backup. The
| device felt very thin and light, but sturdy at the same
| time. It had a pleasant matte finish and a magnetic cover
| that lasted as long as the device did. It spolied the feel
| of later tablets for me.
|
| It had only 1 GB RAM. We have much more powerful SoCs
| today. But nothing ever felt that smooth (iPhone is not
| considered). I don't know why it was so. Perhaps Android
| was light enough for it back then. Or it may have had a
| very good selection and integration of subcomponents. I was
| very disappointed when Nvidia discontinued the Tegra SoC
| family and tablets.
| miladyincontrol wrote:
| I'd argue their current CPUs aren't to be discounted
| either. Much as people love to crown Apple's M-series chips
| as the poster child of what arm can do, Nvidia's grace CPUs
| too trade blows with the best of the best.
|
| It leaves one to wonder what could be if they had any
| appetite for devices more in the consumer realm of things.
| pjmlp wrote:
| So basically going back to the old days of Amiga and Atari, in
| a certain sense, when PCs could only display text.
| goku12 wrote:
| I'm not familiar with that history. Could you elaborate?
| pjmlp wrote:
| In the home computer universe, such computers were the
| first ones having a programmable graphics unit that did
| more than paste the framebuffer into the screen.
|
| While the PCs were still displaying text, or if you were
| lucky to own an Hercules card, gray text, or maybe a CGA
| one, with 4 colours.
|
| While the Amigas, which I am more confortable with, were
| doing this in the mid-80's:
|
| https://www.youtube.com/watch?v=x7Px-ZkObTo
|
| https://www.youtube.com/watch?v=-ga41edXw3A
|
| The original Amiga 1000, had on its motherboard, later
| reduced to fit into an Amiga 500,
|
| Motorola 68000 CPU, a programmable sounds chip with DMA
| channels (Paula), and a programable blitter chip (Agnus aka
| early GPUs).
|
| You would build in RAM the audio, or graphics instructions
| for the respetive chipset, set the DMA parameters, and let
| them lose.
| nnevatie wrote:
| Hey! I had an Amiga 1000 back in the day - it was simply
| awesome.
| goku12 wrote:
| Thanks! Early computing history is very interesting (I
| know that this wasn't the earliest). They also sometimes
| explain certain odd design decisions that are still
| followed today.
| estimator7292 wrote:
| In the olden days we didn't have GPUs, we had "CRT
| controllers".
|
| What it offered you was a page of memory where each byte
| value mapped to a character in ROM. You feed in your text
| and the controller fetches the character pixels and puts
| them on the display. Later we got ASCII box drawing
| characters. Then we got sprite systems like the NES, where
| the Picture Processing Unit handles loading pixels and
| moving sprites around the screen.
|
| Eventually we moved on to raw framebuffers. You get a big
| chunk of memory and you draw the pixels yourself. The
| hardware was responsible for swapping the framebuffers and
| doing the rendering on the physical display.
|
| Along the way we slowly got more features like defining a
| triangle, its texture, and how to move it, instead of doing
| it all in software.
|
| Up until the 90s when the modern concept of a GPU
| coalesced, we were mainly pushing pixels by hand onto the
| screen. Wild times.
|
| The history of display processing is obviously a lot more
| nuanced than that, it's pretty interesting if that's your
| kind of thing.
| pjmlp wrote:
| Small addendum, there was already stuff like TMS34010 in
| the 1980's, just not at home.
| cmrdporcupine wrote:
| Those machines multiplexed the bus to split access to memory,
| because RAM speeds were competitive with or faster than the
| CPU bus speed. The CPU and VDP "shared" the memory, but only
| because CPUs were slow enough to make that possible.
|
| We have had the opposite problem for 35+ years at this point.
| The newer architecture machines like the Apple machines, the
| GB10, the AI 395+ _do_ share memory between GPU and CPU but
| in a different way, I believe.
|
| I'd argue with memory becoming suddenly much more expensive
| we'll probably see the opposite trend. I'm going to get me
| one of these GB10 or Strix Halo machines ASAP because I think
| with RAM prices skyrocketing we won't be seeing more of this
| kind of thing in the consumer market for a long time. Or at
| least, prices will not be dropping any time soon.
| pjmlp wrote:
| You are right, hence my "in a certain sense", because I was
| too lazy to point out the differences between a motherboard
| having everything there without pluggable graphics unit[0],
| and having everything now inside of a single chip.
|
| [0] - Not fully correct, as there are/were extensions cards
| that override the bus, thus replacing one of the said
| chips, on Amiga case.
| animal531 wrote:
| It's funny how ideas come and go. I made this very comment here
| on Hacker News probably 4-5 years ago and received a few down
| votes for it at the time (albeit that I was thinking of
| computers in general).
|
| It would take a lot of work to make a GPU do current CPU type
| tasks, but it would be interesting to see how it changes
| parallelism and our approach to logic in code.
| goku12 wrote:
| > I made this very comment here on Hacker News probably 4-5
| years ago and received a few down votes for it at the time
|
| HN isn't always very rational about voting. It will be a loss
| if you judge any idea on their basis.
|
| > It would take a lot of work to make a GPU do current CPU
| type tasks
|
| In my opinion, that would be counterproductive. The advantage
| of GPUs is that they have a large number of very simple GPU
| cores. Instead, just do a few separate CPU cores on the same
| die, or on a separate die. Or you could even have a forest of
| GPU cores with a few CPU cores interspersed among them - sort
| of like how modern FPGAs have logic tiles, memory tiles and
| CPU tiles spread out on it. I doubt it would be called a GPU
| at that point.
| Den_VR wrote:
| As I recall, Gartner made the outrageous claim that upwards
| of 70% of all computing will be "AI" in some number of
| years - nearly the end of cpu workloads.
| yetihehe wrote:
| Looking at home computers, most of "computing" when
| counted as flops is done by gpus anyway, just to show
| more and more frames. Processors are only used to
| organise all that data to be crunched up by gpus. The
| rest is browsing webpages and running some word or excel
| several times a month.
| deliciousturkey wrote:
| I'd say over 70% of all computing is already been non-CPU
| for years. If you look at your typical phone or laptop
| SoC, the CPU is only a small part. The GPU takes the
| majority of area, with other accelerators also taking
| significant space. Manufacturers would not spend that
| money on silicon, if it was not already used.
| goku12 wrote:
| > I'd say over 70% of all computing is already been non-
| CPU for years.
|
| > If you look at your typical phone or laptop SoC, the
| CPU is only a small part.
|
| Keep in mind that the die area doesn't always correspond
| to the throughput (average rate) of the computations done
| on it. That area may be allocated for a higher
| computational bandwidth (peak rate) and lower latency. Or
| in other words, get the results of a large number of
| computations faster, even if it means that the circuits
| idle for the rest of the cycles. I don't know the
| situation on mobile SoCs with regards to those
| quantities.
| deliciousturkey wrote:
| This is true, and my example was a very rough metric. But
| the computation density per area is actually way, way
| higher on GPU's compared to CPU's. CPU's only spend a
| tiny fraction of their area doing actual computation.
| PunchyHamster wrote:
| If going by raw operations done, if the given workload
| uses 3d rendering for UI that's probably true for
| computers/laptops. Watching YT video is essentially CPU
| pushing data between internet and GPU's video decoder,
| and to GPU-accelerated UI.
| swiftcoder wrote:
| > If you look at your typical phone or laptop SoC, the
| CPU is only a small part
|
| In mobile SoCs a good chunk of this is power efficiency.
| On a battery-powered device, there's always going to be a
| tradeoff to spend die area making something like 4K video
| playback more power efficient, versus general purpose
| compute
|
| Desktop-focussed SKUs are more liable to spend a metric
| ton of die area on bigger caches close to your compute.
| zozbot234 wrote:
| GPU compute units are not that simple, the main difference
| with CPU is that they generally use a combination of wide
| SIMD and wide SMT to hide latency, as opposed to the power-
| intensive out-of-order processing used by CPU's. Performing
| tasks that can't take advantage of either SIMD or SMT on
| GPU compute units might be a bit wasteful.
|
| Also you'd need to add extra hardware for various OS
| support functions (privilege levels, address space
| translation/MMU) that are currently missing from the GPU.
| But the idea is otherwise sound, you can think of the
| 'Mill' proposed CPU architecture as one variety of it.
| goku12 wrote:
| > GPU compute units are not that simple
|
| Perhaps I should have phrased it differently. CPU and GPU
| cores are designed for different types of loads. The rest
| of your comment seems similar to what I was imagining.
|
| Still, I don't think that enhancing the GPU cores with
| CPU capabilities (OOE, rings, MMU, etc from your
| examples) is the best idea. You may end up with the
| advantages of neither and the disadvantages of both. I
| was suggesting that you could instead have a few
| dedicated CPU cores distributed among the numerous GPU
| cores. Finding the right balance of GPU to CPU cores may
| be the key to achieving the best performance on such a
| system.
| deliciousturkey wrote:
| HN in general is quite clueless about topics like hardware,
| high performance computing, graphics, and AI performance. So
| you probably shouldn't care if you are downvoted, especially
| if you honestly know you are being correct.
|
| Also, I'd say if you buy for example a Macbook with an M4 Pro
| chip, it is already is a big GPU attached to a small CPU.
| philistine wrote:
| People on here tend to act as if 20% of all computers sold
| were laptops, when it's the reverse.
| sharpneli wrote:
| Is there any need for that? Just have a few good CPUs there
| and you're good to go.
|
| As for how the HW looks like we already know. Look at Strix
| Halo as an example. We are just getting bigger and bigger
| integrated GPUs. Most of the flops on that chip is the GPU
| part.
| amelius wrote:
| I still would like to see a general GPU back end for LLVM
| just for fun.
| PunchyHamster wrote:
| It would just make everything worse. Some (if anything, most)
| tasks are far less paralleliseable than typical GPU loads.
| themafia wrote:
| At this point what you really need is an incredibly powerful
| heatsink with some relatively small chips pressed against it.
| whywhywhywhy wrote:
| Transhcan Mac Pro was this idea, triangular heatsink core
| cpu+gpu+gpu for each side
| jnwatson wrote:
| If you disassemble a modern GPU, that's what you'll find. 95%
| by weight of a GPU card is cooling related.
| amelius wrote:
| Maybe at the point where you can run Python directly on the
| GPU. At which point the GPU becomes the new CPU.
|
| Anyway, we're still stuck with "G" for "graphics" so it all
| doesn't make much sense and I'm actually looking for a vendor
| that takes its mission more seriously.
| lizknope wrote:
| The vast majority of computers sold today have a CPU / GPU
| integrated together in a single chip. Most ordinary home users
| don't care about GPU or local AI performance that much.
|
| In this video Jeff is interested in GPU accelerated tasks like
| AI and Jellyfin. His last video was using a stack of 4 Mac
| Studios connected by Thunderbolt for AI stuff.
|
| https://www.youtube.com/watch?v=x4_RsUxRjKU
|
| The Apple chips have both power CPU and GPU cores but also have
| a huge amount of memory (512GB) directly connected unlike most
| Nvidia consumer level GPUs that have far less memory.
| onion2k wrote:
| _Most ordinary home users don 't care about GPU or local AI
| performance that much._
|
| Right now, sure. There's a reason why chip manufacturers are
| adding AI pipelines, tensor processors, and 'neural cores'
| though. They believe that running small local models are
| going to be a popular feature in the future. They might be
| right.
| swiftcoder wrote:
| It's mostly marketing gimmicks though - they aren't adding
| anywhere near enough compute for that future. The tensor
| cores in an "AI ready" laptop from a year ago are already
| pretty much irrelevant as far as inferencing current-
| generation models go.
| zozbot234 wrote:
| NPU/Tensor cores are actually very useful for prompt pre-
| processing, or really any ML inference task that isn't
| strictly bandwidth limited (because you end up wasting a
| lot of bandwidth on padding/dequantizing data to a format
| that the NPU can natively work with, whereas a GPU can
| just do that in registers/local memory). Main issue is
| the limited support in current ML/AI inference
| frameworks.
| cmrdporcupine wrote:
| I mean, that's kind of what's going on at a certain level with
| the AMD Strix Halo, the NVIDIA GB10, and the newer Apple
| machines.
|
| In the sense that the RAM is fully integrated, anyways.
| pjmlp wrote:
| Of course, just go to any computer store where most gamer setups
| on affordable bugets go with the combo "beefy GPU + an i5",
| instead of using an i7 or i9 Intel CPUs.
| moebrowne wrote:
| I'd be interested to see if workloads like Folding@home could be
| efficiently run this way. I don't think they need a lot of
| bandwidth.
| haritha-j wrote:
| I currently have a PS500 laptop hooked up to an egpu box with a
| PS700 gpu. It's not a bad setup.
| yoan9224 wrote:
| The most interesting takeaway for me is that PCIe bandwidth
| really doesn't bottleneck LLM inference for single-user
| workloads. You're essentially just shuttling the model weights
| once, then the GPU churns through tokens using its own VRAM.
|
| This is huge for home lab setups. You can run a Pi 5 with a high-
| end GPU via external enclosure and get 90% of the performance of
| a full workstation at a fraction of the power draw and cost.
|
| The multi-GPU results make sense too - without tensor
| parallelism, you're just pipeline parallelism across layers,
| which is inherently sequential. The GPUs are literally sitting
| idle waiting for the previous layer's output. Exo and similar
| frameworks are trying to solve this but it's still early days.
|
| For anyone considering this: watch out for ResizeBAR
| requirements. Some older boards won't work at all without it.
___________________________________________________________________
(page generated 2025-12-21 23:01 UTC)