hngopher.com

       [HN Gopher] Big GPUs don't need big PCs
       ___________________________________________________________________
        
       Big GPUs don't need big PCs
        
       Author : mikece
       Score  : 270 points
       Date   : 2025-12-20 17:49 UTC (1 days ago)
        
 (HTM) web link (www.jeffgeerling.com)
 (TXT) w3m dump (www.jeffgeerling.com)
        
       | jonahbenton wrote:
       | So glad someone did this. Have been running big gpus on egpus
       | connected to spare laptops and thinking why not pis.
        
       | 3eb7988a1663 wrote:
       | Datapoints like this really make me reconsider my daily driver. I
       | should be running one of those $300 mini PCs at <20W. With ~flat
       | CPU performance gains, would be fine for the next 10 years. Just
       | remote into my beefy workstation when I actually need to do real
       | work. Browsing the web, watching videos, even playing some games
       | is easily within their wheelhouse.
        
         | ekropotin wrote:
         | As experiment, I decided to try using proxmox VM with eGPU and
         | usb bus bypassed to it, as my main PC for browsing and working
         | on hobby projects.
         | 
         | It's just 1 vCPU with 4 Gb ram, and you know what? It's more
         | than enough for these needs. I think hardware manufactures
         | falsely convinced us that every professional needs beefy laptop
         | to be productive.
        
         | reactordev wrote:
         | I went with a beelink for this purpose. Works great.
         | 
         | Keeps the desk nice and tidy while "the beasts" roar in a
         | soundproofed closet.
        
         | samuelknight wrote:
         | Switching from my 8-core ryzen minipc to an 8-core ryzen
         | desktop makes my unit tests run way faster. TDP limits can tip
         | you off to very different performance envelopes in otherwise
         | similar spec CPUs.
        
           | loeg wrote:
           | Even if you could cool the full TDP in a micro PC, in a full
           | size desktop you might be able to use a massive AIO radiator
           | with fans running at very slow, very quiet speeds instead of
           | jet turbine howl in the micro case. The quiet and ease of
           | working in a bigger space are mostly a good tradeoff for a
           | slightly larger form factor under a desk.
        
           | adrian_b wrote:
           | A full-size desktop computer will always be much faster for
           | any workload that fully utilizes the CPU.
           | 
           | However, a full-size desktop computer seldom makes sense as a
           | _personal_ computer, i.e. as the computer that interfaces to
           | a human via display, keyboard and graphic pointer.
           | 
           | For most of the activities done directly by a human, i.e.
           | reading & editing documents, browsing Internet, watching
           | movies and so on, a mini-PC is powerful enough. The only
           | exception is playing games designed for big GPUs, but there
           | are many computer users who are not gamers.
           | 
           | In most cases the optimal setup is to use a mini-PC as your
           | personal computer and a full-size desktop as a server on
           | which you can launch any time-consuming tasks, e.g.
           | compilation of big software projects, EDA/CAD simulations,
           | testing suites etc.
           | 
           | The desktop used as server can use Wake-on-LAN to stay
           | powered off when not needed and wake up whenever it must run
           | some task remotely.
        
             | whatevaa wrote:
             | Not everything supports remoting well. For example, many
             | IDE's. Unless you run RDP, with whole graphical session on
             | remote.
             | 
             | Also, having to buy two computers also costs money. It
             | makes sense to use 1 for both use cases if you have to buy
             | the desktop anyway.
        
         | jasonwatkinspdx wrote:
         | For just basic windows desktop stuff, a $200 NUC has been good
         | enough for like 15 years now.
        
         | themafia wrote:
         | > I should be running one of those $300 mini PCs at <20W.
         | 
         | Yes. They're basically laptop chips at this point. The thermals
         | are worse but the chips are perfectly modern and can handle
         | reasonably large workloads. I've got an 8 core Ryzen 7 with
         | Radeon 780 Graphics and 96GB of DDR5. Outside of AAA gaming
         | this thing is absolutely fine.
         | 
         | The power draw is a huge win for me. It's like 6W at idle. I
         | live remotely so grid power is somewhat unreliable and saving
         | watts when using solar batteries extends their lifetime
         | massively. I'm thrilled with them.
        
         | PunchyHamster wrote:
         | Slapping $300 worth of solar panels on your roof/balcony will
         | probably get you ahead on power usage
        
         | nottorp wrote:
         | That's why I use a M2 (not even pro) Mac Mini as a terminal and
         | remote into other boxes when needed.
        
         | ivanjermakov wrote:
         | Another benefit is low noise. Many consider fan noise under
         | load to be the most important property of a workstation.
        
       | yjftsjthsd-h wrote:
       | I've been kicking this around in my head for a while. If I want
       | to run LLMs locally, a decent GPU is really the only important
       | thing. At that point, the question becomes, roughly, what is the
       | cheapest computer to tack on the side of the GPU? Of course, that
       | assumes that everything does in fact work; unlike OP I am barely
       | in a position to _understand_ eg. BAR problems, let alone try to
       | fix them, so what I actually did was build a cheap-ish x86 box
       | with a half-decent GPU and called it a day:) But it still is
       | stuck in my brain: there must be a more efficient way to do this,
       | especially if all you need is just enough computer to shuffle
       | data to and from the GPU and serve that over a network
       | connection.
        
         | zeusk wrote:
         | Get the DGX Spark computers? They're exactly what you're trying
         | to build.
        
           | Gracana wrote:
           | They're very slow.
        
             | geerlingguy wrote:
             | They're okay, generally, but slow for the price. You're
             | more paying for the ConnectX-7 networking than inference
             | performance.
        
               | Gracana wrote:
               | Yeah, I wouldn't complain if one dropped in my lap, but
               | they're not at the top of my list for inference hardware.
               | 
               | Although... Is it possible to pair a fast GPU with one?
               | Right now my inference setup for large MoE LLMs has
               | shared experts in system memory, with KV cache and dense
               | parts on a GPU, and a Spark would do a better job of
               | handling the experts than my PC, if only it could talk to
               | a fast GPU.
               | 
               | [edit] Oof, I forgot these have only 128GB of RAM. I take
               | it all back, I still don't find them compelling.
        
         | tcdent wrote:
         | We're not yet to the point where a single PCIe device will get
         | you anything meaningful; IMO 128 GB of ram available to the GPU
         | is essential.
         | 
         | So while you don't need a ton of compute on the CPU you do need
         | the ability address multiple PCIe lanes. A relatively low-spec
         | AMD EPYC processor is fine if the motherboard exposes enough
         | lanes.
        
           | skhameneh wrote:
           | There is plenty that can run within 32/64/96gb VRAM. IMO
           | models like Phi-4 are underrated for many simple tasks. Some
           | quantized Gemma 3 are quite good as well.
           | 
           | There are larger/better models as well, but those tend to
           | really push the limits of 96gb.
           | 
           | FWIW when you start pushing into 128gb+, the ~500gb models
           | really start to become attractive because at that point
           | you're probably wanting just a bit more out of everything.
        
             | tcdent wrote:
             | IDK all of my personal and professional projects involve
             | pushing the SOTA to the absolute limit. Using anything
             | other than the latest OpenAI or Anthropic model is out of
             | the question.
             | 
             | Smaller open source models are a bit like 3d printing in
             | the early days; fun to experiment with but really not that
             | valuable for anything other than making toys.
             | 
             | Text summarization, maybe? But even then I want a model
             | that understands the complete context and does a good job.
             | Even things like "generate one sentence about the action
             | we're performing" I usually find I can just incorporate it
             | into the output schema of a larger request instead of
             | making a separate request to a smaller model.
        
               | xyzzy123 wrote:
               | It seems to me like the use case for local GPUs is almost
               | entirely privacy.
               | 
               | If you buy a 15k AUD rtx 6000 96GB, that card will
               | _never_ pay for itself on a gpt-oss:120b workload vs just
               | using openrouter - no matter how many tokens you push
               | through it - because the cost of residential power in
               | Australia means you cannot generate tokens cheaper than
               | the cloud even if the card were free.
        
               | joefourier wrote:
               | There's a few more considerations:
               | 
               | - You can use the GPU for training and run your own fine
               | tuned models
               | 
               | - You can have much higher generation speeds
               | 
               | - You can sell the GPU on the used market in ~2 years
               | time for a significant portion of its value
               | 
               | - You can run other types of models like image, audio or
               | video generation that are not available via an API, or
               | cost significantly more
               | 
               | - Psychologically, you don't feel like you have to
               | constrain your token spending and you can, for instance,
               | just leave an agent to run for hours or overnight without
               | feeling bad that you just "wasted" $20
               | 
               | - You won't be running the GPU at max power constantly
        
               | 15155 wrote:
               | Or censorship avoidance
        
               | girvo wrote:
               | > because the cost of residential power in Australia
               | 
               | This _so_ doesn 't really matter to your overall point
               | which I agree with but:
               | 
               | The rise of rooftop solar and home battery energy storage
               | flips this a bit now in Australia, IMO. At least where I
               | live, every house has a solar panel on it.
               | 
               | Not worth it _just_ for local LLM usage, but an
               | interesting change to energy economics IMO!
        
               | popalchemist wrote:
               | This is simply not true. Your heuristic is broken.
               | 
               | The recent Gemma 3 models, which are produced by Google
               | (a little startup - heard of em?) outperform the last
               | several OpenAI releases.
               | 
               | Closed does not necessarily mean better. Plus the local
               | ones can be finetuned to whatever use case you may have,
               | won't have any inputs blocked by censorship
               | functionality, and you can optimize them by distilling to
               | whatever spec you need.
               | 
               | Anyway all that is extraneous detail - the important
               | thing is to decouple "open" and "small" from "worse" in
               | your mind. The most recent Gemma 3 model specifically is
               | incredible, and it makes sense, given that Google has
               | access to many times more data than OpenAI for training
               | (something like a factor of 10 at least). Which is of
               | course a very straightforward idea to wrap your head
               | around, Google was scrapign the internet for decades
               | before OpenAI even entered the scene.
               | 
               | So just because their Gemma model is released in an open-
               | source (open weights) way, doesn't mean it should be
               | discounted. There's no magic voodoo happening behind the
               | scenes at OpenAI or Anthropic; the models are essentially
               | of the same type. But Google releases theirs to undercut
               | the profitability of their competitors.
        
               | tcdent wrote:
               | This one?
               | https://artificialanalysis.ai/models/gemma-3-27b
        
           | p1necone wrote:
           | I'm holding out for someone to ship a gpu with dimm slots on
           | it.
        
             | kristianp wrote:
             | A single CAMM might suit better.
        
             | tymscar wrote:
             | DDR5 is a couple of orders of magnitude slower than really
             | good vram. That's one big reason.
        
               | dawnerd wrote:
               | But it would still be faster than splitting the model up
               | on a cluster though, right? But I've also wondered why
               | they haven't just shipped gpus like cpus.
        
               | cogman10 wrote:
               | Man I'd love to have a GPU socket. But it'd be pretty
               | hard to get a standard going that everyone would support.
               | Look at sockets for CPUs, we barely had cross over for
               | like 2 generations.
               | 
               | But boy, a standard GPU socket so you could easily BYO
               | cooler would be nice.
        
               | estimator7292 wrote:
               | The problem isn't the sockets. It costs a _lot_ to spec
               | and build new sockets, we wouldn 't swap them for no
               | reason.
               | 
               | The problem is that the signals and features that the
               | motherboard and CPU expect are different between
               | generations. We use different sockets on different
               | generations to prevent you plugging in incompatible CPUs.
               | 
               | We used to have cross-generational sockets in the 386 era
               | because the hardware supported it. Motherboards weren't
               | changing so you could just upgrade the CPU. But then the
               | CPUs needed different voltages than before for
               | performance. So we needed a new socket to not blow up
               | your CPU with the wrong voltage.
               | 
               | That's where we are today. Each generation of CPU wants
               | different voltages, power, signals, a specific chipset,
               | etc. Within the same +-1 generation you can swap CPUs
               | because they're electrically compatible.
               | 
               | To have universal CPU sockets, we'd need a universal
               | electrical interface standard, which is too much of a
               | moving target.
               | 
               | AMD would probably love to never have to tool up a new
               | CPU socket. They don't make money on the motherboard you
               | have to buy. But the old motherboards just can't support
               | new CPUs. Thus, new socket.
        
               | cogman10 wrote:
               | For AI, really good isn't really a requirement. If a
               | middle ground memory module could be made, then it'd be
               | pretty appealing.
        
               | zrm wrote:
               | DDR5 is ~8GT/s, GDDR6 is ~16GT/s, GDDR7 is ~32GT/s. It's
               | faster but the difference isn't crazy and if the premise
               | was to have a lot of slots then you could also have a lot
               | of channels. 16 channels of DDR5-8200 would have slightly
               | more memory bandwidth than RTX 4090.
        
               | tymscar wrote:
               | Yeah, so DDR5 is 8GT and GDDR7 is 32GT. Bus width is 64
               | vs 384. That already makes the VRAM 4*6 (24) times
               | faster.
               | 
               | You can add more channels, sure, but each channel makes
               | it less and less likely for you to boot. Look at modern
               | AM5 struggling to boot at over 6000 with more than two
               | sticks.
               | 
               | So you'd have to get an insane six channels to match the
               | bus width, at which point your only choice to be stable
               | would be to lower the speed so much that you're back to
               | the same orders of magnitude difference, really.
               | 
               | Now we could instead solder that RAM, move it closer to
               | the GPU and cross-link channels to reduce noise. We could
               | also increase the speed and oh, we just invented
               | soldered-on GDDR...
        
               | zrm wrote:
               | > Bus width is 64 vs 384.
               | 
               | The bus width is the number of channels. They don't call
               | them channels when they're soldered but 384 is already
               | the equivalent of 6. The premise is that you would have
               | more. Dual socket Epyc systems already have 24 channels
               | (12 channels per socket). It costs money but so does
               | 256GB of GDDR.
               | 
               | > Look at modern AM5 struggling to boot at over 6000 with
               | more than two sticks.
               | 
               | The relevant number for this is the number of sticks per
               | channel. With 16 channels and 64GB sticks you could have
               | 1TB of RAM with only one stick per channel. Use CAMM2
               | instead of DIMMs and you get the same speed and capacity
               | from 8 slots.
        
             | anon25783 wrote:
             | Would that be worth anything, though? What about the
             | overhead of clock cycles needed for loading from and
             | storing to RAM? Might not amount to a net benefit for
             | performance, and it could also potentially complicate heat
             | management I bet.
        
         | dist-epoch wrote:
         | This problem was already solved 10 years ago - crypto mining
         | motherboards, which have a large number of PCIe slots, a CPU
         | socket, one memory slot, and not much else.
         | 
         | > Asus made a crypto-mining motherboard that supports up to 20
         | GPUs
         | 
         | https://www.theverge.com/2018/5/30/17408610/asus-crypto-mini...
         | 
         | For LLMs you'll probably want a different setup, with some
         | memory too, some m.2 storage.
        
           | jsheard wrote:
           | Those only gave each GPU a single PCIe lane though, since
           | crypto mining barely needed to move any data around. If your
           | application doesn't fit that mould then you'll need a much,
           | much more expensive platform.
        
             | dist-epoch wrote:
             | After you load the weights into the GPU and keep the KV
             | cache there too, you don't need any other significant
             | traffic.
        
               | numpad0 wrote:
               | Even in tensor parallel modes? I thought it could only
               | work if you're fine stalling all but n GPU for n users at
               | any given moments.
        
           | skhameneh wrote:
           | In theory, it's only sufficient for pipeline parallel due to
           | limited lanes and interconnect bandwidth.
           | 
           | Generally, scalability on consumer GPUs falls off between 4-8
           | GPUs for most. Those running more GPUs are typically using a
           | higher quantity of smaller GPUs for cost effectiveness.
        
           | zozbot234 wrote:
           | M.2 is mostly just a different form factor for PCIe anyway.
        
         | seanmcdirmid wrote:
         | And you don't want to go the M4 Max/M3 Ultra route? It works
         | well enough for most mid sized LLMs.
        
         | binsquare wrote:
         | I run a crowd sourced website to collect data on the best and
         | cheapest hardware setup for local LLM here:
         | https://inferbench.com/
         | 
         | Source code: https://github.com/BinSquare/inferbench
        
           | nodja wrote:
           | Cool site, I noticed the 3090 is on there twice.
           | 
           | https://inferbench.com/gpu/NVIDIA%20GeForce%20RTX%203090
           | 
           | https://inferbench.com/gpu/NVIDIA%20RTX%203090
        
             | binsquare wrote:
             | Oh nice catch, I'll fix that
             | 
             | ---
             | 
             | Edit: Fixed
        
           | kilpikaarna wrote:
           | Nice! Though for older hardware it would be nice if the price
           | reflected the current second hand market (harder to get data
           | for, I know). Eg. Nvidia RTX 3070 ranks as second best GPU in
           | tok/s/$ even at the MSRP of $499. But you can get one for
           | half that now.
        
             | binsquare wrote:
             | Great idea - I've added it by manually browsing ebay for
             | that initial data.
             | 
             | So it's just a static value in this hardware list: https://
             | github.com/BinSquare/inferbench/blob/main/src/lib/ha...
             | 
             | Let me know if you know of a better way, or contribute :D
        
           | jsight wrote:
           | It seems like verification might need to be improved a bit? I
           | looked at Mistral-Large-123B. Someone is claiming 12
           | tokens/sec on a single RTX 3090 at FP16.
           | 
           | Perhaps some filter could cut out submissions that don't
           | really make sense?
        
         | Eisenstein wrote:
         | There is a whole section in here on how to spec out a cheap rig
         | and what to look for:
         | 
         | * https://jabberjabberjabber.github.io/Local-AI-Guide/
        
       | Wowfunhappy wrote:
       | I really would have liked to see gaming performance, although I
       | realize it might be difficult to find a AAA game that supports
       | ARM. (Forcing the Pi to emulate x86 with FEX doesn't seem
       | entirely fair.)
        
         | 3eb7988a1663 wrote:
         | You might have to thread the needle to find a game which does
         | not bottleneck on the CPU.
        
       | kristjansson wrote:
       | Really why have the PCI/CPU artifice at all? Apple and Nvidia
       | have the right idea: put the MPP on the same die/package as the
       | CPU.
        
         | bigyabai wrote:
         | > put the MPP on the same die/package as the CPU.
         | 
         | That would help in latency-constrained workloads, but I don't
         | think it would make much of a difference for AI or most HPC
         | applications.
        
         | PunchyHamster wrote:
         | We need low power but high PCIE lane count CPUs for that. Just
         | purely for shoving models from NVMe to GPU
        
       | lostmsu wrote:
       | Now compare batched training performance. Or batched inference.
       | 
       | Of course prefill is going to be GPU bound. You only send a few
       | thousand bytes to it, and don't really ask to return much. But
       | after prefill is done, unless you use batched mode, you aren't
       | really using your GPU for anything more that it's VRAM bandwidth.
        
       | numpad0 wrote:
       | Not sure what was unexpected about the multi GPU part.
       | 
       | It's very well known that most LLM frameworks including llama.cpp
       | splits models by layers, which has sequential dependency, and so
       | multi GPU setups are completely stalled unless there are n_gpu
       | users/tasks running in parallel. It's also known that some GPUs
       | are faster in "prompt processing" and some in "token generation"
       | that combining Radeon and NVIDIA does something sometimes.
       | Reportedly the inter-layer transfer sizes are in kilobyte ranges
       | and PCIe x1 is plenty or something.
       | 
       | It takes appropriate backends with "tensor parallel" mode
       | support, which splits the neural network parallel to the
       | direction of flow of data, which also obviously benefit
       | substantially from good node interconnect between GPUs like PCIe
       | x16 or NVlink/Infinity Fabric bridge cables, and/or inter-GPU DMA
       | over PCIe(called GPU P2P or GPUdirect or some lingo like that).
       | 
       | Absent those, I've read somewhere that people can sometimes see
       | GPU utilization spikes walking over GPUs on nvtop-style tools.
       | 
       | Looking for a way to break up tasks for LLMs so that there will
       | be multiple tasks to run concurrently would be interesting, maybe
       | like creating one "manager" and few "delegated engineers"
       | personalities. Or simulating multiple different domains of brain
       | such as speech center, visual cortex, language center, etc.
       | communicating in tokens might be interesting in working around
       | this problem.
        
         | zozbot234 wrote:
         | > Looking for a way to break up tasks for LLMs so that there
         | will be multiple tasks to run concurrently would be
         | interesting, maybe like creating one "manager" and few
         | "delegated engineers" personalities.
         | 
         | This is pretty much what "agents" are for. The manager model
         | constructs prompts and contexts that the delegated models can
         | work on in parallel, returning results when they're done.
        
         | nodja wrote:
         | > Reportedly the inter-layer transfer sizes are in kilobyte
         | ranges and PCIe x1 is plenty or something.
         | 
         | Not an expert, but napkin math tells me that more often that
         | not this will be in the order of megabytes--not kilobytes--
         | since it scales with sequence length.
         | 
         | Example: Qwen3 30B has a hidden state size of 5120, even if
         | quantized to 8 bits that's 5120 bytes per token. It would pass
         | the MB boundary with just a little over 200 tokens. Still not
         | much of an issue when a single PCIe lane is ~2GB/s.
         | 
         | I think device to device latency is more of an issue here, but
         | I don't know enough to assert that with confidence.
        
           | remexre wrote:
           | For each token generated, you only send one token's worth
           | between layers; the previous tokens are in the KV cache.
        
         | syntaxing wrote:
         | Theres some technical implementations that makes it more
         | efficient like EXO [1]. Jeff Geerling recently did a review on
         | a 4 MAC Studio cluster with RDMA support and you can see that
         | EXO has a noticeable advantage [2].
         | 
         | [1] https://github.com/exo-explore/exo [2]
         | https://www.youtube.com/watch?v=x4_RsUxRjKU
        
           | sgt wrote:
           | At this point I'd consider a cluster of top specced Mac
           | Studio's to be worth while in production. I just need to host
           | them properly in a rack and in a co-lo data center.
        
             | syntaxing wrote:
             | Honestly, I genuinely can see the value if you want to host
             | something internally for sensitive and important
             | information. I really hope the M5 ultra with matmul
             | accelerators will knock this out of the park. With the way
             | RAM is trending, a Mac Studio cluster will become more
             | enticing.
        
         | scotty79 wrote:
         | > Not sure what was unexpected about the multi GPU part. It's
         | very well known that most LLM frameworks including llama.cpp
         | splits models by layers, which has sequential dependency, and
         | so multi GPU setups are completely stalled
         | 
         | Oh, I thought the point of transformers was being able to split
         | the load veritcally to avoid seqential dependancies. Is it true
         | just for training or not at all?
        
           | sailingparrot wrote:
           | Just for training and processing the existing context (pre
           | fill phase). But when doing inference a token t has to be
           | sampled before t+1 can so it's still sequential
        
       | kgeist wrote:
       | What about constrained decoding (with JSON schemas)? I noticed my
       | vLLM instance is using 1 CPU 100%.
        
       | jauntywundrkind wrote:
       | PCIe 3.0 is the nice easy convenient generation where 1 lane =
       | 1GBps. Given the overhead, thats pretty close to 10Gb ethernet
       | speeds (low latency though).
       | 
       | I do wonder how long the cards are going to need host systems at
       | all. We've already seen GPUs with m.2 ssd attached! Radeon Pro
       | SSG hails back from 2016! You still need a way to get the model
       | on that in the first place to get work in and out, but a 1Gbe and
       | small RISC-V chip (which Nvidia already uses formanagement cores)
       | could suffice. Maybe even an rpi on the card.
       | https://www.techpowerup.com/224434/amd-announces-the-radeon-...
       | 
       | Given the gobs of memory cards have, they probably don't even
       | need storage; they just need big pipes. Intel had 100Gbe on their
       | Xeon & Xeon Phi cores (10x what we saw here!) in _2016_! GPUs
       | that just plug into the switch and talk across 400Gbe or
       | UltraEthernet or switched CXL, that run semi independently, feel
       | so sensible, so not outlandish.
       | https://www.servethehome.com/next-generation-interconnect-in...
       | 
       | It's far off for now, but flash makers are also looking at
       | radically many channel flash, which can provide absurdly high
       | GB/s, High Bandwidth Flash. And potentially integrated some
       | extremely parallel tensorcores on each channel. Switching from
       | DRAM to flash for AI processing could be a colossal win for
       | fitting large models cost effectively (& perhaps power
       | efficiently) while still having ridiculous gobs of bandwidth.
       | With that possible win of doing processing & filtering extremely
       | near to the data too. https://www.tomshardware.com/tech-
       | industry/sandisk-and-sk-hy...
        
       | Avlin67 wrote:
       | tired of jeff glinglin everywhere...
        
         | manarth wrote:
         | I personally find his work and his posts interesting, and enjoy
         | seeing them pop up on HN.
         | 
         | If you prefer not to see his posts on the HN list pages, a
         | practical solution is to use a browser extension (such as
         | Stylus) to customise the HN styling to hide the posts.
         | 
         | Here is a specific CSS style which will hide submissions from
         | Jeff's website:                   tr.submission:has(td
         | a[href="from?site=jeffgeerling.com"]),
         | tr.submission:has(td a[href="from?site=jeffgeerling.com"]) +
         | tr,         tr.submission:has(td
         | a[href="from?site=jeffgeerling.com"]) + tr + tr {
         | opacity: 0.05         }
         | 
         | In this example, I've made it almost invisible, whilst it still
         | takes up space on the screen (to avoid confusion about the post
         | number increasing from N to N+2). You could use { display: none
         | } to completely hide the relevant posts.
         | 
         | The approach can be modified to suit any origin you prefer to
         | not come across.
         | 
         | The limitation is that the style modification may need
         | refactoring if HN changes the markup structure.
        
         | mjh2539 wrote:
         | I only ever see him on HN. He's smart, kind, and talks about
         | interesting things. Are you sure what you're feeling isn't
         | envy?
        
       | omneity wrote:
       | I wish for a hardware + software solution to enable direct PCIe
       | interconnect using lanes independent from the chipset/CPU. A PCIe
       | mesh of sorts.
       | 
       | With the right software support from say pytorch this could
       | suddenly make old GPUs and underpowered PCs like in TFA into very
       | attractive and competitive solutions for training and inference.
        
         | snuxoll wrote:
         | PCIe already allows DMA between peers on the bus, but, as you
         | pointed out, the traces for the lanes have to terminate
         | somewhere. However, it doesn't have to be the CPU (which is, of
         | course, the PCIe root in modern systems) handling the traffic -
         | a PCIe switch may be used to facilitate DMA between devices
         | attached to it, if it supports routing DMA traffic directly.
        
           | ComputerGuru wrote:
           | That's what happened in TFA.
        
             | omneity wrote:
             | You're right. Let me correct myself: a hobbyist-friendly
             | hardware solution. Dolphin's PCIe switches cost more than 8
             | RTX 3090 on a Threadripper machine.
        
       | Waterluvian wrote:
       | At what point do the OEMs begin to realize they don't have to
       | follow the current mindset of attaching a GPU to a PC and instead
       | sell what looks like a GPU with a PC built into it?
        
         | nightshift1 wrote:
         | Exactly. With the Intel-Nvidia partnership signed this
         | September, I expect to see some high-performance single-board
         | computers being released very soon. I don't think the atx form-
         | factor will survive another 30 years.
        
           | bostik wrote:
           | One should also remember that NVidia _does_ have
           | organisational experience on designing and building CPUs[0].
           | 
           | They were a pretty big deal back in ~2010, and I have to
           | admit I didn't know that Tegra was powering Nintendo Switch.
           | 
           | 0: https://en.wikipedia.org/wiki/Tegra
        
             | goku12 wrote:
             | I had a Xolo Tegra Note 7 tablet (marketed in the US as
             | EVGA Tegra Note 7) in around 2013. I preordered it as far
             | as I remember. It had a Tegra 4 SoC with quad core Cortex
             | A15 CPU and a 72 core GeForce GPU. Nvidia used to claim
             | that it is the fastest SoC for mobile devices at the time.
             | 
             | To this day, it's the best mobile/Android device I ever
             | owned. I don't know if it was the fastest, but it certainly
             | was the best performing one I ever had. UI interactions
             | were smooth, apps were fast on it, screen was bright, touch
             | was perfect and still had long enough battery backup. The
             | device felt very thin and light, but sturdy at the same
             | time. It had a pleasant matte finish and a magnetic cover
             | that lasted as long as the device did. It spolied the feel
             | of later tablets for me.
             | 
             | It had only 1 GB RAM. We have much more powerful SoCs
             | today. But nothing ever felt that smooth (iPhone is not
             | considered). I don't know why it was so. Perhaps Android
             | was light enough for it back then. Or it may have had a
             | very good selection and integration of subcomponents. I was
             | very disappointed when Nvidia discontinued the Tegra SoC
             | family and tablets.
        
             | miladyincontrol wrote:
             | I'd argue their current CPUs aren't to be discounted
             | either. Much as people love to crown Apple's M-series chips
             | as the poster child of what arm can do, Nvidia's grace CPUs
             | too trade blows with the best of the best.
             | 
             | It leaves one to wonder what could be if they had any
             | appetite for devices more in the consumer realm of things.
        
         | pjmlp wrote:
         | So basically going back to the old days of Amiga and Atari, in
         | a certain sense, when PCs could only display text.
        
           | goku12 wrote:
           | I'm not familiar with that history. Could you elaborate?
        
             | pjmlp wrote:
             | In the home computer universe, such computers were the
             | first ones having a programmable graphics unit that did
             | more than paste the framebuffer into the screen.
             | 
             | While the PCs were still displaying text, or if you were
             | lucky to own an Hercules card, gray text, or maybe a CGA
             | one, with 4 colours.
             | 
             | While the Amigas, which I am more confortable with, were
             | doing this in the mid-80's:
             | 
             | https://www.youtube.com/watch?v=x7Px-ZkObTo
             | 
             | https://www.youtube.com/watch?v=-ga41edXw3A
             | 
             | The original Amiga 1000, had on its motherboard, later
             | reduced to fit into an Amiga 500,
             | 
             | Motorola 68000 CPU, a programmable sounds chip with DMA
             | channels (Paula), and a programable blitter chip (Agnus aka
             | early GPUs).
             | 
             | You would build in RAM the audio, or graphics instructions
             | for the respetive chipset, set the DMA parameters, and let
             | them lose.
        
               | nnevatie wrote:
               | Hey! I had an Amiga 1000 back in the day - it was simply
               | awesome.
        
               | goku12 wrote:
               | Thanks! Early computing history is very interesting (I
               | know that this wasn't the earliest). They also sometimes
               | explain certain odd design decisions that are still
               | followed today.
        
             | estimator7292 wrote:
             | In the olden days we didn't have GPUs, we had "CRT
             | controllers".
             | 
             | What it offered you was a page of memory where each byte
             | value mapped to a character in ROM. You feed in your text
             | and the controller fetches the character pixels and puts
             | them on the display. Later we got ASCII box drawing
             | characters. Then we got sprite systems like the NES, where
             | the Picture Processing Unit handles loading pixels and
             | moving sprites around the screen.
             | 
             | Eventually we moved on to raw framebuffers. You get a big
             | chunk of memory and you draw the pixels yourself. The
             | hardware was responsible for swapping the framebuffers and
             | doing the rendering on the physical display.
             | 
             | Along the way we slowly got more features like defining a
             | triangle, its texture, and how to move it, instead of doing
             | it all in software.
             | 
             | Up until the 90s when the modern concept of a GPU
             | coalesced, we were mainly pushing pixels by hand onto the
             | screen. Wild times.
             | 
             | The history of display processing is obviously a lot more
             | nuanced than that, it's pretty interesting if that's your
             | kind of thing.
        
               | pjmlp wrote:
               | Small addendum, there was already stuff like TMS34010 in
               | the 1980's, just not at home.
        
           | cmrdporcupine wrote:
           | Those machines multiplexed the bus to split access to memory,
           | because RAM speeds were competitive with or faster than the
           | CPU bus speed. The CPU and VDP "shared" the memory, but only
           | because CPUs were slow enough to make that possible.
           | 
           | We have had the opposite problem for 35+ years at this point.
           | The newer architecture machines like the Apple machines, the
           | GB10, the AI 395+ _do_ share memory between GPU and CPU but
           | in a different way, I believe.
           | 
           | I'd argue with memory becoming suddenly much more expensive
           | we'll probably see the opposite trend. I'm going to get me
           | one of these GB10 or Strix Halo machines ASAP because I think
           | with RAM prices skyrocketing we won't be seeing more of this
           | kind of thing in the consumer market for a long time. Or at
           | least, prices will not be dropping any time soon.
        
             | pjmlp wrote:
             | You are right, hence my "in a certain sense", because I was
             | too lazy to point out the differences between a motherboard
             | having everything there without pluggable graphics unit[0],
             | and having everything now inside of a single chip.
             | 
             | [0] - Not fully correct, as there are/were extensions cards
             | that override the bus, thus replacing one of the said
             | chips, on Amiga case.
        
         | animal531 wrote:
         | It's funny how ideas come and go. I made this very comment here
         | on Hacker News probably 4-5 years ago and received a few down
         | votes for it at the time (albeit that I was thinking of
         | computers in general).
         | 
         | It would take a lot of work to make a GPU do current CPU type
         | tasks, but it would be interesting to see how it changes
         | parallelism and our approach to logic in code.
        
           | goku12 wrote:
           | > I made this very comment here on Hacker News probably 4-5
           | years ago and received a few down votes for it at the time
           | 
           | HN isn't always very rational about voting. It will be a loss
           | if you judge any idea on their basis.
           | 
           | > It would take a lot of work to make a GPU do current CPU
           | type tasks
           | 
           | In my opinion, that would be counterproductive. The advantage
           | of GPUs is that they have a large number of very simple GPU
           | cores. Instead, just do a few separate CPU cores on the same
           | die, or on a separate die. Or you could even have a forest of
           | GPU cores with a few CPU cores interspersed among them - sort
           | of like how modern FPGAs have logic tiles, memory tiles and
           | CPU tiles spread out on it. I doubt it would be called a GPU
           | at that point.
        
             | Den_VR wrote:
             | As I recall, Gartner made the outrageous claim that upwards
             | of 70% of all computing will be "AI" in some number of
             | years - nearly the end of cpu workloads.
        
               | yetihehe wrote:
               | Looking at home computers, most of "computing" when
               | counted as flops is done by gpus anyway, just to show
               | more and more frames. Processors are only used to
               | organise all that data to be crunched up by gpus. The
               | rest is browsing webpages and running some word or excel
               | several times a month.
        
               | deliciousturkey wrote:
               | I'd say over 70% of all computing is already been non-CPU
               | for years. If you look at your typical phone or laptop
               | SoC, the CPU is only a small part. The GPU takes the
               | majority of area, with other accelerators also taking
               | significant space. Manufacturers would not spend that
               | money on silicon, if it was not already used.
        
               | goku12 wrote:
               | > I'd say over 70% of all computing is already been non-
               | CPU for years.
               | 
               | > If you look at your typical phone or laptop SoC, the
               | CPU is only a small part.
               | 
               | Keep in mind that the die area doesn't always correspond
               | to the throughput (average rate) of the computations done
               | on it. That area may be allocated for a higher
               | computational bandwidth (peak rate) and lower latency. Or
               | in other words, get the results of a large number of
               | computations faster, even if it means that the circuits
               | idle for the rest of the cycles. I don't know the
               | situation on mobile SoCs with regards to those
               | quantities.
        
               | deliciousturkey wrote:
               | This is true, and my example was a very rough metric. But
               | the computation density per area is actually way, way
               | higher on GPU's compared to CPU's. CPU's only spend a
               | tiny fraction of their area doing actual computation.
        
               | PunchyHamster wrote:
               | If going by raw operations done, if the given workload
               | uses 3d rendering for UI that's probably true for
               | computers/laptops. Watching YT video is essentially CPU
               | pushing data between internet and GPU's video decoder,
               | and to GPU-accelerated UI.
        
               | swiftcoder wrote:
               | > If you look at your typical phone or laptop SoC, the
               | CPU is only a small part
               | 
               | In mobile SoCs a good chunk of this is power efficiency.
               | On a battery-powered device, there's always going to be a
               | tradeoff to spend die area making something like 4K video
               | playback more power efficient, versus general purpose
               | compute
               | 
               | Desktop-focussed SKUs are more liable to spend a metric
               | ton of die area on bigger caches close to your compute.
        
             | zozbot234 wrote:
             | GPU compute units are not that simple, the main difference
             | with CPU is that they generally use a combination of wide
             | SIMD and wide SMT to hide latency, as opposed to the power-
             | intensive out-of-order processing used by CPU's. Performing
             | tasks that can't take advantage of either SIMD or SMT on
             | GPU compute units might be a bit wasteful.
             | 
             | Also you'd need to add extra hardware for various OS
             | support functions (privilege levels, address space
             | translation/MMU) that are currently missing from the GPU.
             | But the idea is otherwise sound, you can think of the
             | 'Mill' proposed CPU architecture as one variety of it.
        
               | goku12 wrote:
               | > GPU compute units are not that simple
               | 
               | Perhaps I should have phrased it differently. CPU and GPU
               | cores are designed for different types of loads. The rest
               | of your comment seems similar to what I was imagining.
               | 
               | Still, I don't think that enhancing the GPU cores with
               | CPU capabilities (OOE, rings, MMU, etc from your
               | examples) is the best idea. You may end up with the
               | advantages of neither and the disadvantages of both. I
               | was suggesting that you could instead have a few
               | dedicated CPU cores distributed among the numerous GPU
               | cores. Finding the right balance of GPU to CPU cores may
               | be the key to achieving the best performance on such a
               | system.
        
           | deliciousturkey wrote:
           | HN in general is quite clueless about topics like hardware,
           | high performance computing, graphics, and AI performance. So
           | you probably shouldn't care if you are downvoted, especially
           | if you honestly know you are being correct.
           | 
           | Also, I'd say if you buy for example a Macbook with an M4 Pro
           | chip, it is already is a big GPU attached to a small CPU.
        
             | philistine wrote:
             | People on here tend to act as if 20% of all computers sold
             | were laptops, when it's the reverse.
        
           | sharpneli wrote:
           | Is there any need for that? Just have a few good CPUs there
           | and you're good to go.
           | 
           | As for how the HW looks like we already know. Look at Strix
           | Halo as an example. We are just getting bigger and bigger
           | integrated GPUs. Most of the flops on that chip is the GPU
           | part.
        
             | amelius wrote:
             | I still would like to see a general GPU back end for LLVM
             | just for fun.
        
           | PunchyHamster wrote:
           | It would just make everything worse. Some (if anything, most)
           | tasks are far less paralleliseable than typical GPU loads.
        
         | themafia wrote:
         | At this point what you really need is an incredibly powerful
         | heatsink with some relatively small chips pressed against it.
        
           | whywhywhywhy wrote:
           | Transhcan Mac Pro was this idea, triangular heatsink core
           | cpu+gpu+gpu for each side
        
           | jnwatson wrote:
           | If you disassemble a modern GPU, that's what you'll find. 95%
           | by weight of a GPU card is cooling related.
        
         | amelius wrote:
         | Maybe at the point where you can run Python directly on the
         | GPU. At which point the GPU becomes the new CPU.
         | 
         | Anyway, we're still stuck with "G" for "graphics" so it all
         | doesn't make much sense and I'm actually looking for a vendor
         | that takes its mission more seriously.
        
         | lizknope wrote:
         | The vast majority of computers sold today have a CPU / GPU
         | integrated together in a single chip. Most ordinary home users
         | don't care about GPU or local AI performance that much.
         | 
         | In this video Jeff is interested in GPU accelerated tasks like
         | AI and Jellyfin. His last video was using a stack of 4 Mac
         | Studios connected by Thunderbolt for AI stuff.
         | 
         | https://www.youtube.com/watch?v=x4_RsUxRjKU
         | 
         | The Apple chips have both power CPU and GPU cores but also have
         | a huge amount of memory (512GB) directly connected unlike most
         | Nvidia consumer level GPUs that have far less memory.
        
           | onion2k wrote:
           | _Most ordinary home users don 't care about GPU or local AI
           | performance that much._
           | 
           | Right now, sure. There's a reason why chip manufacturers are
           | adding AI pipelines, tensor processors, and 'neural cores'
           | though. They believe that running small local models are
           | going to be a popular feature in the future. They might be
           | right.
        
             | swiftcoder wrote:
             | It's mostly marketing gimmicks though - they aren't adding
             | anywhere near enough compute for that future. The tensor
             | cores in an "AI ready" laptop from a year ago are already
             | pretty much irrelevant as far as inferencing current-
             | generation models go.
        
               | zozbot234 wrote:
               | NPU/Tensor cores are actually very useful for prompt pre-
               | processing, or really any ML inference task that isn't
               | strictly bandwidth limited (because you end up wasting a
               | lot of bandwidth on padding/dequantizing data to a format
               | that the NPU can natively work with, whereas a GPU can
               | just do that in registers/local memory). Main issue is
               | the limited support in current ML/AI inference
               | frameworks.
        
         | cmrdporcupine wrote:
         | I mean, that's kind of what's going on at a certain level with
         | the AMD Strix Halo, the NVIDIA GB10, and the newer Apple
         | machines.
         | 
         | In the sense that the RAM is fully integrated, anyways.
        
       | pjmlp wrote:
       | Of course, just go to any computer store where most gamer setups
       | on affordable bugets go with the combo "beefy GPU + an i5",
       | instead of using an i7 or i9 Intel CPUs.
        
       | moebrowne wrote:
       | I'd be interested to see if workloads like Folding@home could be
       | efficiently run this way. I don't think they need a lot of
       | bandwidth.
        
       | haritha-j wrote:
       | I currently have a PS500 laptop hooked up to an egpu box with a
       | PS700 gpu. It's not a bad setup.
        
       | yoan9224 wrote:
       | The most interesting takeaway for me is that PCIe bandwidth
       | really doesn't bottleneck LLM inference for single-user
       | workloads. You're essentially just shuttling the model weights
       | once, then the GPU churns through tokens using its own VRAM.
       | 
       | This is huge for home lab setups. You can run a Pi 5 with a high-
       | end GPU via external enclosure and get 90% of the performance of
       | a full workstation at a fraction of the power draw and cost.
       | 
       | The multi-GPU results make sense too - without tensor
       | parallelism, you're just pipeline parallelism across layers,
       | which is inherently sequential. The GPUs are literally sitting
       | idle waiting for the previous layer's output. Exo and similar
       | frameworks are trying to solve this but it's still early days.
       | 
       | For anyone considering this: watch out for ResizeBAR
       | requirements. Some older boards won't work at all without it.
        
       ___________________________________________________________________
       (page generated 2025-12-21 23:01 UTC)