[HN Gopher] Making AMD GPUs competitive for LLM inference (2023)
       ___________________________________________________________________
        
       Making AMD GPUs competitive for LLM inference (2023)
        
       Author : plasticchris
       Score  : 256 points
       Date   : 2024-12-24 00:17 UTC (22 hours ago)
        
 (HTM) web link (blog.mlc.ai)
 (TXT) w3m dump (blog.mlc.ai)
        
       | dragontamer wrote:
       | Intriguing. I thought AMD GPUs didn't have tensor cores (or
       | matrix multiplication units) like NVidia. I believe they are only
       | dot product / fused multiply and accumulate instructions.
       | 
       | Are these LLMs just absurdly memory bound so it doesn't matter?
        
         | ryao wrote:
         | They don't, but GPUs were designed for doing matrix
         | multiplications even without the special hardware instructions
         | for doing matrix multiplication tiles. Also, the forward pass
         | for transformers is memory bound, and that is what does token
         | generation.
        
           | dragontamer wrote:
           | Well sure, but in other GPU tasks, like Raytracing, the
           | difference between these GPUs is far more pronounced.
           | 
           | And AMD has passable Raytracing units (NVidias are better but
           | the difference is bigger than these LLM results).
           | 
           | If RAM is the main bottleneck then CPUs should be on the
           | table.
        
             | webmaven wrote:
             | RAM is (often) the bottleneck for highly parallel GPUs, but
             | not for CPUs.
             | 
             | Though the distinction between the two categories is
             | blurring.
        
               | ryao wrote:
               | Memory bandwidth is the bottleneck for both when running
               | GEMV, which is the main operation used by token
               | generation in inference. It has always been this way.
        
             | IX-103 wrote:
             | > If RAM is the main bottleneck then CPUs should be on the
             | table
             | 
             | That's certainly not the case. The graphics memory model is
             | very different from the CPU memory model. Graphics memory
             | is explicitly designed for multiple simultaneous reads
             | (spread across several different buses) at the cost of
             | generality (only portions of memory may be available on
             | each bus) and speed (the extra complexity means reads are
             | slower). This makes then fast at doing simple operations on
             | a large amount of data.
             | 
             | CPU memory only has one bus, so only a single read can
             | happen at a time (a cache line read), but can happen
             | relatively quickly. So CPUs are better for workloads with
             | high memory locality and frequent reuse of memory locations
             | (as is common in procedural programs).
        
               | dragontamer wrote:
               | > CPU memory only has one bus
               | 
               | If people are paying $15,000 or more per GPU, then I can
               | choose $15,000 CPUs like EPYC that have 12-channels or
               | dual-socket 24-channel RAM.
               | 
               | Even desktop CPUs are dual-channel at a minimum, and
               | arguably DDR5 is closer to 2 or 4 buses per channel.
               | 
               | Now yes, GPU RAM can be faster, but guess what?
               | 
               | https://www.tomshardware.com/pc-components/cpus/amd-
               | crafts-c...
               | 
               | GPUs are about extremely parallel performance, above and
               | beyond what traditional single-threaded (or limited-SIMD)
               | CPUs can do.
               | 
               | But if you're waiting on RAM anyway?? Then the compute-
               | method doesn't matter. Its all about RAM.
        
             | schmidtleonard wrote:
             | CPUs have pitiful RAM bandwidth compared to GPUs. The
             | speeds aren't so different but GPU RAM busses are
             | wiiiiiiiide.
        
               | teleforce wrote:
               | Compute Express Link (CXL) should mostly solve limited
               | RAM with CPU:
               | 
               | 1) Compute Express Link (CXL):
               | 
               | https://en.wikipedia.org/wiki/Compute_Express_Link
               | 
               | PCIe vs. CXL for Memory and Storage:
               | 
               | https://news.ycombinator.com/item?id=38125885
        
               | schmidtleonard wrote:
               | Gigabytes per second? What is this, bandwidth for ants?
               | 
               | My years old pleb tier non-HBM GPU has more than 4 times
               | the bandwidth you would get from a PCIe Gen 7 x16 link,
               | which doesn't even officially exist yet.
        
               | teleforce wrote:
               | Yes CXL will soon benefit from PCIe Gen 7 x16 with
               | expected 64GB/s in 2025 and the non-HBM bandwidth I/O
               | alternative is increasing rapidly by the day. For most
               | inferences of near real-time LLM it will be feasible. For
               | majority of SME companies and other DIY users (humans or
               | ants) with their localized LLM should not be any issues
               | [1],[2]. In addition new techniques for more efficient
               | LLM are being discover to reduce the memory consumption
               | [3].
               | 
               | [1] Forget ChatGPT: why researchers now run small AIs on
               | their laptops:
               | 
               | https://news.ycombinator.com/item?id=41609393
               | 
               | [2] Welcome to LLMflation - LLM inference cost is going
               | down fast:
               | 
               | https://a16z.com/llmflation-llm-inference-cost/
               | 
               | [3] New LLM optimization technique slashes memory costs
               | up to 75%:
               | 
               | https://news.ycombinator.com/item?id=42411409
        
               | schmidtleonard wrote:
               | No. Memory bandwidth is _the_ important factor for LLM
               | inference. 64GB /s is 4x less than the hypothetical I
               | granted you (Gen7x16 = 256GB/s), which is 4x less than
               | the memory bandwidth on my 2 year old pleb GPU (1TB/s),
               | which is 10x less than a state of the art professional
               | GPU (10TB/s), which is what the cloud services will be
               | using.
               | 
               | That's 160x worse than cloud and 16x worse than what I'm
               | using for local LLM. I am keenly aware of the options for
               | compression. I use them every day. The sacrifices I make
               | to run local LLM cut deep compared to the cloud models,
               | and squeezing it down by another factor of 16 will cut
               | deep on top of cutting deep.
               | 
               | Nothing says it can't be useful. My most-used model is
               | running in a microcontroller. Just keep those
               | expectations tempered.
               | 
               | (EDIT: changed the numbers to reflect red team victory
               | over green team on cloud inference.)
        
               | ryao wrote:
               | It is reportedly 242GB/sec due to overhead:
               | 
               | https://en.wikipedia.org/wiki/PCI_Express#PCI_Express_7.0
        
               | Dylan16807 wrote:
               | > 4 times the bandwidth you would get from a PCIe Gen 7
               | x16 link
               | 
               | So you have a full terabyte per second of bandwidth? What
               | GPU is that?
               | 
               | (The 64GB/s number is an x4 link. If you meant you have
               | over four times that, then it sounds like CXL would be
               | pretty competitive.)
        
               | schmidtleonard wrote:
               | https://www.techpowerup.com/gpu-specs/geforce-
               | rtx-4090.c3889                   Memory Size: 24 GB
               | Memory Type: GDDR6X         Memory Bus: 384 bit
               | Bandwidth: 1.01 TB/s
               | 
               | Bandwidth between where the LLM is stored and where your
               | matrix*vector multiplies are done is _the_ important
               | figure for inference. You want to measure this in
               | terabytes per second, not gigabytes per second.
               | 
               | A 7900XTX also has 1TB/s on paper, but you'll need
               | awkward workarounds every time you want to do something
               | (see: article) and half of your workloads will stop dead
               | with driver crashes and you need to decide if that's
               | worth $500 to you.
               | 
               | Stacking 3090s is the move if you want to pinch pennies.
               | They have 24GB of memory and 936GB/s of bandwidth each,
               | so almost as good as the 4090, but they're as cheap as
               | the 7900XTX with none of the problems. They aren't as
               | good for gaming or training workloads, but for local
               | inference 3090 is king.
               | 
               | It's not a coincidence that the article lists the same 3
               | cards. These are the 3 cards you should decide between
               | for local LLM, and these are the 3 cards a true
               | competitor should aim to exceed.
        
               | Dylan16807 wrote:
               | A 4090 is not "years old pleb tier". Same for 3090 and
               | 7900XTX.
               | 
               | There's a serious gap between CXL and RAM, but it's not
               | nearly as big as it used to be.
        
               | adrian_b wrote:
               | Already an ancient Radeon VII from 5 years ago had 1
               | terabyte per second of memory bandwidth.
               | 
               | Later consumer GPUs have regressed and only RTX 4090
               | offers the same memory bandwidth in the current NVIDIA
               | generation.
        
               | Dylan16807 wrote:
               | Radeon VII had HBM.
               | 
               | So I can understand a call for returning to HBM, but it's
               | an expensive choice and doesn't fit the description.
        
               | ryao wrote:
               | That seems unlikely given that the full HBM supply for
               | the next year has been earmarked for enterprise GPUs.
               | That said, it would be definitely nice if HBM became
               | available for consumer GPUs.
        
               | ryao wrote:
               | The 3090 Ti and 4090 both have 1.01TB/sec memory
               | bandwidth:
               | 
               | https://www.techpowerup.com/gpu-specs/geforce-
               | rtx-3090-ti.c3...
        
             | ryao wrote:
             | The main bottleneck is memory bandwidth. CPUs have less
             | memory bandwidth than GPUs.
        
         | throwaway314155 wrote:
         | > Are these LLMs just absurdly memory bound so it doesn't
         | matter?
         | 
         | During inference? Definitely. Training is another story.
        
         | boroboro4 wrote:
         | They absolutely do have similar cores to tensor cores, it's
         | called matrix cores. And they have particular instructions to
         | utilize them (MFMA). Note I'm talking about DC compute chips,
         | like MI300.
         | 
         | LLMs aren't memory bound in production loads, they are pretty
         | much compute bound too, at least in prefill phase, but in
         | practice in general too.
        
           | almostgotcaught wrote:
           | Ya people in these comments don't know what they're talking
           | about (no one ever does in these threads). AMDGPU has had MMA
           | and WMMA for a while now
           | 
           | https://rocm.docs.amd.com/projects/rocWMMA/en/latest/what-
           | is...
        
       | sroussey wrote:
       | [2023]
       | 
       | Btw, this is from MLC-LLM which makes WebLLM and other good
       | stuff.
        
       | throwaway314155 wrote:
       | > Aug 9, 2023
       | 
       | Ignoring the very old (in ML time) date of the article...
       | 
       | What's the catch? People are still struggling with this a year
       | later so I have to assume it doesn't work as well as claimed.
       | 
       | I'm guessing this is buggy in practice and only works for the HF
       | models they chose to test with?
        
         | Const-me wrote:
         | It's not terribly hard to port ML inference to alternative GPU
         | APIs. I did it for D3D11 and the performance is pretty good
         | too: https://github.com/Const-me/Cgml
         | 
         | The only catch is, for some reason developers of ML libraries
         | like PyTorch aren't interested in open GPU APIs like D3D or
         | Vulkan. Instead, they focus on proprietary ones i.e. CUDA and
         | to lesser extent ROCm. I don't know why that is.
         | 
         | D3D-based videogames are heavily using GPU compute for more
         | than a decade now. Since Valve shipped SteamDeck, the same now
         | applies to Vulkan on Linux. By now, both technologies are
         | stable, reliable and performant.
        
           | jsheard wrote:
           | Isn't part of it because the first-party libraries like cuDNN
           | are only available through CUDA? Nvidia has poured a ton of
           | effort into tuning those libraries so it's hard to justify
           | not using them.
        
             | Const-me wrote:
             | Unlike training, ML inference is almost always bound by
             | memory bandwidth as opposed to computations. For this
             | reason, tensor cores, cuDNN, and other advanced shenanigans
             | make very little sense for the use case.
             | 
             | OTOH, general-purpose compute instead of fixed-function
             | blocks used by cuDNN enables custom compression algorithms
             | for these weights which does help, by saving memory
             | bandwidth. For example, I did custom 5 bits/weight
             | quantization which works on all GPUs, no hardware support
             | necessary, just simple HLSL codes:
             | https://github.com/Const-me/Cgml?tab=readme-ov-
             | file#bcml1-co...
        
               | boroboro4 wrote:
               | Only local (read batch size 1) ML inference is memory
               | bound, production loads are pretty much compute bound.
               | Prefill phase is very compute bound, and with continuous
               | batching generation phase is getting mixed with prefill,
               | which makes whole process altogether to be compute bound
               | too. So no, tensor cores and all other shenanigans
               | absolutely critical for performant inference
               | infrastructure.
        
               | Const-me wrote:
               | PyTorch is a project by Linux foundation. The about page
               | with the mission of the foundation contains phrases like
               | "empowering generations of open source innovators",
               | "democratize code", and "removing barriers to adoption".
               | 
               | I would argue running local inference with batch size=1
               | is more useful for empowering innovators compared to
               | running production loads on shared servers owned by
               | companies. Local inference increases count of potential
               | innovators by orders of magnitude.
               | 
               | BTW, in the long run it may also benefit these companies
               | because in theory, an easy migration path from CUDA puts
               | a downward pressure on nVidia's prices.
        
               | idonotknowwhy wrote:
               | Most people running local inference do so thorough quants
               | with llamacpp (which runs on everything) or awq/exl2/mlx
               | with vllm/tabbyAPI/lmstudio which are much faster to than
               | using pytorch directly
        
         | lhl wrote:
         | It depends on what you mean by "this." MLC's catch is that you
         | need to define/compile models for it with TVM. Here is the list
         | of supported model architectures: https://github.com/mlc-
         | ai/mlc-llm/blob/main/python/mlc_llm/m...
         | 
         | llama.cpp has a much bigger supported model list, as does vLLM
         | and of course PyTorch/HF transformers covers everything else,
         | all of which work w/ ROCm on RDNA3 w/o too much fuss these
         | days.
         | 
         | For inference, the biggest caveat is that Flash Attention is
         | only an aotriton implementation, which besides being less
         | performant sometimes, also doesn't support SWA. For CDNA there
         | is a better CK-based version of FA, but CK doesn't not have
         | RDNA support. There are a couple people at AMD apparently
         | working on native FlexAttention, os I guess we'll how that
         | turns out.
         | 
         | (Note the recent SemiAccurate piece was on training, which I'd
         | agree is in a much worse state (I have personal experience with
         | it being often broken for even the simplest distributed
         | training runs). Funnily enough, if you're running simple fine
         | tunes on a single RDNA3 card, you'll probably have a better
         | time. OOTB, a 7900 XTX will train at about the same speed as an
         | RTX 3090 (4090s blow both of those away, but you'll probably
         | want more cards and VRAM of just move to H100s).
        
       | shihab wrote:
       | I have come across quite few startups who are trying a similar
       | idea: break the nvidia monopoly by utilizing AMD GPUs (for
       | inference at least): Felafax, Lamini, tensorwave (partially),
       | SlashML. Even saw optimistic claims like CUDA moat is only 18
       | months deep from some of them [1]. Let's see.
       | 
       | [1]
       | https://www.linkedin.com/feed/update/urn:li:activity:7275885...
        
         | ryukoposting wrote:
         | Peculiar business model, at a glance. It seems like they're
         | doing work that AMD ought to be doing, and is probably doing
         | behind the scenes. Who is the customer for a third-party GPU
         | driver shim?
        
           | tesch1 wrote:
           | AMD. Just one more dot to connect ;)
        
           | dpkirchner wrote:
           | Could be trying to make themselves a target for a big
           | acquihire.
        
             | to11mtm wrote:
             | Cynical take: Try to get acquired by Intel for Arc.
        
               | shiroiushi wrote:
               | More cynical take: this would be a bad strategy, because
               | Intel hasn't shown much competence in its leadership for
               | a long time, especially in regards to GPUs.
        
               | rockskon wrote:
               | They've actually been making positive moves with GPUs
               | lately along with a success story for the B580.
        
               | schmidtleonard wrote:
               | Yeah but MLID says they are losing money on every one and
               | have been winding down the internal development
               | resources. That doesn't bode well for the future.
               | 
               | I want to believe he's wrong, but on the parts of his
               | show where I am in a position to verify, he generally
               | checks out. Whatever the opposite of Gell-Mann Amnesia
               | is, he's got it going for him.
        
               | derektank wrote:
               | Wait, are they losing money on every one in the sense
               | that they haven't broken even on research and development
               | yet? Or in the sense that they cost more to manufacture
               | than they're sold at? Because one is much worse than the
               | other.
        
               | rockskon wrote:
               | They're trying to unseat Radeon as the budget card. That
               | means making a more enticing offer than AMD for a
               | temporary period of time.
        
               | sodality2 wrote:
               | MLID on Intel is starting to become the same as
               | UserBenchmark on AMD (except for the generally reputable
               | sources)... he's beginning to sound like he simply wants
               | Intel to fail, to my insider-info-lacking ears. For
               | competition's sake I _really_ hope that MLID has it wrong
               | (at least the opining about the imminent failure of Intel
               | 's GPU division), and that the B series will encourage
               | Intel to push farther to spark more competition in the
               | GPU space.
        
               | oofabz wrote:
               | The die size of the B580 is 272 mm2, which is a lot of
               | silicon for $249. The performance of the GPU is good for
               | its price but bad for its die size. Manufacturing cost is
               | closely tied to die size.
               | 
               | 272 mm2 puts the B580 in the same league as the Radeon
               | 7700XT, a $449 card, and the GeForce 4070 Super, which is
               | $599. The idea that Intel is selling these cards at a
               | loss sounds reasonable to me.
        
               | tjoff wrote:
               | Though you assume the prices of the competition are
               | reasonable. There are plenty of reasons for them not to
               | be. Availability issues, lack of competition, other more
               | lucrative avenues etc.
               | 
               | Intel has neither, or at least not as much of them.
        
               | KeplerBoy wrote:
               | At a loss seems a bit overly dramatic. I'd guess Nvidia
               | sells SKUs for three times their marginal cost. Intel is
               | probably operating at cost without any hopes of recouping
               | R&D with the current SKUs, but that's reasonable for an
               | aspiring competitor.
        
               | 7speter wrote:
               | It kinda seems they are covering the cost of throwing
               | massive amounts of resources trying to get Arc's drivers
               | in shape.
        
               | kimixa wrote:
               | B580 being a "success" is purely a business decision as a
               | loss leader to get their name into the market. A larger
               | die on a newer node than either Nvidia or AMD means their
               | per-unit costs _are_ higher, and are selling it at a
               | lower price.
               | 
               | That's not a long-term success strategy. Maybe good for
               | getting your name in the conversation, but not
               | sustainable.
        
               | jvanderbot wrote:
               | I was reading this whole thread as about technical
               | accomplishment and non-nvidia GPU capabilities, not
               | business. So I think you're talking about different
               | definitions of "Success". Definitely counts, but not what
               | I was reading.
        
               | bitmasher9 wrote:
               | It's a long term strategy to release a hardware platform
               | with minimal margins in the beginning to attract software
               | support needed for long term viability.
               | 
               | One of the benefits of being Intel.
        
               | 7speter wrote:
               | I don't know if this matters but while the B580 has a die
               | comparable in size to a 4070 (~280mm^2), it has about
               | half the transistors (~17-18 billion), iirc.
        
               | ryao wrote:
               | Is it a loss leader? I looked up the price of 16Gbit
               | GDDR6 ICs the other day at dramexchange and the cost of
               | 12GB is $48. Using the gamer nexus die measurements, we
               | can calculate that they get at least 214 dies per wafer.
               | At $12095 per wafer, that is $57 per die. While defects
               | ordinarily reduce yields,
               | 
               | Intel put plenty of redundant transistors into the
               | silicon. This is ordinarily not possible to estimate, but
               | Tom Petersen reported in his interview with hardware
               | unboxed that they did not count those when reporting the
               | transistor count. Given that the density based on
               | reported transistors is about 40% less than the density
               | others get from the same process, they likely have enough
               | redundancy to work around defects that they can use all
               | of those dies.
               | 
               | As for the rest of the card, there is not much in it that
               | would not be part of the price of an $80 Asrock
               | motherboard. The main thing would be the bundled game,
               | which they likely can get in bulk at around $5 per copy.
               | This seems reasonable given how much Epic games pays for
               | their giveaways:
               | 
               | https://x.com/simoncarless/status/1389297530341519362
               | 
               | That brings the total cost to $190. They are selling
               | these at $250. They would not be getting huge margins,
               | but provided enough sales volume (to match the sales
               | volume Asrock gets on their $80 motherboards), Intel
               | should turn a profit on these versus not selling these at
               | all. Of course, they need to share some profit with their
               | board partners. Presumably, they are not sharing much
               | given that their board partners are not able/willing to
               | hit the $250 MSRP. The closest they come to it is $260.
               | 
               | My guess is that their margins are between 10% and 20%.
               | This would be razor thin for Intel, but it is definitely
               | not a loss. Given that the demand is there, the only way
               | that they could close money on these would be by limiting
               | production such that they do not get economics of scale
               | to reach those margins.
        
               | dangero wrote:
               | More cynical take: Trying to get acquired by nvidia
        
               | dizhn wrote:
               | Person below says they (the whole team) already joined
               | Nvidia.
        
               | dogma1138 wrote:
               | Intel is in a vastly better shape than AMD, they have the
               | software pretty much nailed down.
        
               | indolering wrote:
               | Tell that to the board.
        
               | bboygravity wrote:
               | Someone never used intel killer wifi software.
        
               | lhl wrote:
               | I've recently been poking around with Intel oneAPI and
               | IPEX-LLM. While there are things that I find refreshing
               | (like their ability to actually respond to bug reports in
               | a timely manner, or at all) on a whole, support/maturity
               | actually doesn't match the current state of ROCm.
               | 
               | PyTorch requires it's own support kit separate from the
               | oneAPI Toolkit (and runs slightly different versions of
               | everything), the vLLM xpu support doesn't work - both
               | source _and_ the docker failed to build /run for me. The
               | IPEX-LLM whisper support is completely borked, etc, etc.
        
               | moffkalast wrote:
               | I've recently been trying to get IPEX working as well,
               | apparently picking Ubuntu 24.04 was a mistake, because
               | while things compile, everything fails at runtime. I've
               | tried native, docker, different oneAPI versions, threw
               | away a solid week of afternoons for nothing.
               | 
               | SYCL with llama.cpp is great though, at least at FP16
               | since it supports nothing else but even Arc iGPUs easily
               | give 2-4x performance compared to CPU inference.
               | 
               | Intel should've just contributed to SYCL instead of
               | trying to make their own thing and then forgot to keep
               | maintaining it halfway through.
        
               | lhl wrote:
               | My testing has been w/ a Lunar Lake Core 258V chip (Xe2 -
               | Arc 140V) on Arch Linux. It sounds like you've tried a
               | lot of things already, but case it helps, my notes for
               | installing llama.cpp and PyTorch: https://llm-
               | tracker.info/howto/Intel-GPUs
               | 
               | I have some benchmarks as well, and the IPEX-LLM backend
               | performed a fair bit better than the SYCL llama.cpp
               | backend for me (almost +50% pp512 and almost 2X tg128) so
               | worth getting it working if you plan on using llama.cpp
               | much on an Intel system. SYCL still performs
               | significantly better than Vulkan and CPU backends,
               | though.
               | 
               | As an end-user, I agree that it'd be way better if they
               | could just contribute upstream somehow (whether to the
               | SYCL backend, or if not possible, to a dependency-minized
               | IPEX backend). the IPEX backend is one of the _more_
               | maintained parts of IPEX-LLM, btw. I found a lot of stuff
               | in that repo that depend on versions of oneKit that aren
               | 't even downloadable on Intel's site. I couldn't help but
               | smirk when I heard someone say "Intel has their software
               | nailed down."
        
               | moffkalast wrote:
               | Well that's funny, I think we already spoke on Reddit.
               | I'm the guy who was testing the 125H recently. I guess
               | there's like 5 of us who have intel hardware in total and
               | we keep running into each other :P
               | 
               | Honestly I think there's just something seriously broken
               | with the way IPEX expects the GPU driver to be on 24.04
               | and there's nothing I can really do about it except wait
               | for them to fix it if I want to keep using this OS.
               | 
               | I am vaguely considering adding another drive and
               | installing 22.04 or 20.04 with the exact kernel they want
               | to see if that might finally work in the meantime, but
               | honestly I'm fairly satisfied with the speed I get from
               | SYCL already. The problem is more that it's annoying to
               | integrate it directly through the server endpoint, every
               | projects expects a damn ollama api or llama-cpp-python
               | these days and I'm a fan of neither since it's just
               | another layer of headaches to get those compiled with
               | SYCL.
               | 
               | > I found a lot of stuff in that repo that depend on
               | versions of oneKit that aren't even downloadable on
               | Intel's site. I couldn't help but smirk when I heard
               | someone say "Intel has their software nailed down."
               | 
               | Yeah well the fact that oneAPI 2025 got released, broke
               | IPEX, and they still haven't figured out a way to patch
               | it for months makes me think it's total chaos internally,
               | where teams work against each other instead of talking
               | and coordinating.
        
             | dboreham wrote:
             | > Could be trying to make themselves a target for a big
             | acquihire.
             | 
             | Is this something anyone sets out to do?
        
               | seeknotfind wrote:
               | Yes.
        
               | ryukoposting wrote:
               | It definitely is, yes.
        
           | dylan604 wrote:
           | It would be interesting to find out AMD is funding these
           | other companies to ensure the shim happens while they focus
           | on not doing it.
        
             | bushbaba wrote:
             | AMD is kind of doing that funding by pricing its GPUs low
             | and/or giving them away at cost to these startups
        
           | shmerl wrote:
           | Is this effort benefiting everyone? I.e. where is it going /
           | is it open source?
        
             | britannio wrote:
             | Some of the work from Tinycorp is:
             | https://github.com/tinygrad/7900xtx
        
         | jsheard wrote:
         | Tinygrad was another one, but they ended up getting frustrated
         | with AMD and semi-pivoted to Nvidia.
        
           | nomel wrote:
           | This is discussed in the lex Friedman episode. AMD's own demo
           | would kernel panic when run in a loop [1].
           | 
           | [1] https://youtube.com/watch?v=dNrTrx42DGQ&t=3218
        
             | kranke155 wrote:
             | Interesting. I wonder if focusing on GPUs and CPUs is
             | something that requires two companies instead of one,
             | whether the concentration of resources just leads to one
             | arm of your company being much better than the other.
        
           | noch wrote:
           | > Tinygrad was another one, but they ended up getting
           | frustrated with AMD and semi-pivoted to Nvidia.
           | 
           | From their announcement on 20241219[^0]:
           | 
           | " _We are the only company to get AMD on MLPerf_ , and _we
           | have a completely custom driver that 's 50x simpler than the
           | stock one_. A bit shocked by how little AMD cared, but we'll
           | take the trillions instead of them."
           | 
           | From 20241211[^1]:
           | 
           | "We gave up and soon _tinygrad will depend on 0 AMD code_
           | except what 's required by code signing.
           | 
           | We did this for the 7900XTX (tinybox red). If AMD was
           | thinking strategically, they'd be begging us to take some
           | free MI300s to add support for it."
           | 
           | ---
           | 
           | [^0]: https://x.com/__tinygrad__/status/1869620002015572023
           | 
           | [^1]: https://x.com/__tinygrad__/status/1866889544299319606
        
         | pinsiang wrote:
         | AMD GPUs are becoming a serious contender for LLM inference.
         | vLLM is already showing impressive performance on AMD [1], even
         | with consumer-grade Radeon cards (even support GGUF) [2]. This
         | could be a game-changer for folks who want to run LLMs without
         | shelling out for expensive NVIDIA hardware.
         | 
         | [1] https://blog.vllm.ai/2024/10/23/vllm-serving-amd.html [2]
         | https://embeddedllm.com/blog/vllm-now-supports-running-gguf-...
        
           | MrBuddyCasino wrote:
           | Fun fact: Nvidia H200 are currently half the price/hr of H100
           | bc people can't get vLLM to work on it.
           | 
           | https://x.com/nisten/status/1871325538335486049
        
             | adrian_b wrote:
             | That seems like a CPU problem, not a GPU problem (due to
             | Aarch64 replacing x86-64).
        
             | ryao wrote:
             | That is GH200 and it is likely due to an amd64 dependency
             | in vLLM.
        
           | treprinum wrote:
           | AMD decided not to release a high-end GPU this cycle so any
           | investment into 7x00 or 6x00 is going to be wasted as Nvidia
           | 5x00 is likely going to destroy any ROI from the older cards
           | and AMD won't have an answer for at least two years, possibly
           | never due to being non-existing in high-end consumer GPUs
           | usable for compute.
        
         | llama-mini wrote:
         | From Lamini, we have a private AMD GPU cluster, ready to serve
         | any one who want to try MI300x or MI250 with inference and
         | tuning.
         | 
         | We just onboarded a customer to move from openai API to on-prem
         | solution, currently evaluating MI300x for inference.
         | 
         | Email me at my profile email.
        
         | 3abiton wrote:
         | My understanding is that once JAX takes off, the cuda advantage
         | is gone for nvidia. That's a big if/when though.
        
       | jroesch wrote:
       | Note: this is old work, and much of the team working on TVM, and
       | MLC were from OctoAI and we have all recently joined NVIDIA.
        
         | sebmellen wrote:
         | Is there no hope for AMD anymore? After George Hotz/Tinygrad
         | gave up on AMD I feel there's no realistic chance of using
         | their chips to break the CUDA dominance.
        
           | llm_trw wrote:
           | Not really.
           | 
           | AMD is constitutionally incapable of shipping anything but
           | mid range hardware that requires no innovation.
           | 
           | The only reason why they are doing so well in CPUs right now
           | is that Intel has basically destroyed itself without any
           | outside help.
        
             | perching_aix wrote:
             | And I'm supposed to believe that HN is this amazing
             | platform for technology and science discussions, totally
             | unlike its peers...
        
               | zamadatix wrote:
               | The above take is worded a bit cynical but is their
               | general approach to GPUs lately across the board e.g.
               | https://www.techpowerup.com/326415/amd-confirms-retreat-
               | from...
               | 
               | Also I'd take HN as being being an amazing platform for
               | the overall consistency and quality of moderation.
               | Anything beyond that depends more on who you're talking
               | to than where at.
        
               | petesergeant wrote:
               | Maybe be the change you want to see and tell us what the
               | real story is?
        
               | perching_aix wrote:
               | We seem to disagree on what the change in the world I'd
               | like to see is like, which is a real shocker I'm sure.
               | 
               | Personally, I think that's when somebody who has no real
               | information to contribute doesn't try to pretend that
               | they do.
               | 
               | So thanks for the offer, but I think I'm already
               | delivering on that realm.
        
               | llm_trw wrote:
               | I don't really care what you believe.
               | 
               | Everyone whose dug deep into what AMD is doing has left
               | in disgust if they are lucky and bankruptcy if they are
               | not.
               | 
               | If I can save someone else from wasting $100,000 on
               | hardware and six months of their life then my post has
               | done more good than the AMD marketing department ever
               | will.
        
               | AnthonyMouse wrote:
               | > If I can save someone else from wasting $100,000 on
               | hardware and six months of their life then my post has
               | done more good than the AMD marketing department ever
               | will.
               | 
               | This seems like unuseful advice if you've already given
               | up on them.
               | 
               | You tried it and at some point in the past it wasn't
               | ready. But by not being ready they're losing money, so
               | they have a direct incentive to fix it. Which would take
               | a certain amount of time, but once you've given up you no
               | longer know if they've done it yet or not, at which point
               | your advice would be stale.
               | 
               | Meanwhile the people who attempt it apparently seem to
               | get acquired by Nvidia, for some strange reason. Which
               | implies it should be a worthwhile thing to do. If they've
               | fixed it by now which you wouldn't know if you've stopped
               | looking, or they fix it in the near future, you have a
               | competitive advantage because you have access to lower
               | cost GPUs than your rivals. If not, but you've
               | demonstrated a serious attempt to fix it for everyone
               | yourself, Nvidia comes to you with a sack full of money
               | to make sure you don't finish, and then you get a sack
               | full of money. That's win/win, so rather than nobody
               | doing it, it seems like everybody should be doing it.
        
               | llm_trw wrote:
               | I've tried it three times.
               | 
               | I've seen people try it every six months for two decades
               | now.
               | 
               | At some point you just have to accept that AMD is not a
               | serious company, but is a second rate copycat and there
               | is no way to change that without firing everyone from
               | middle management up.
               | 
               | I'm deeply worried about stagnation in the CPU space now
               | that they are top dog and Intel is dead in the water.
               | 
               | Here's hoping China and Risk V save us.
               | 
               | >Meanwhile the people who attempt it apparently seem to
               | get acquired by Nvidia
               | 
               | Everyone I've seen base jumping has gotten a sponsorship
               | from redbull, ergo. everyone should basejump.
               | 
               | Ignore the red smears around the parking lot.
        
               | Const-me wrote:
               | > I've tried it three times
               | 
               | Have you tried compute shaders instead of that weird HPC-
               | only stuff?
               | 
               | Compute shaders are widely used by millions of gamers
               | every day. GPU vendors have huge incentive to make them
               | reliable and efficient: modern game engines are using
               | them for lots of thing, e.g. UE5 can even render triangle
               | meshes with GPU compute instead of graphics (the tech is
               | called nanite virtualized geometry). In practice they
               | work fine on all GPUs, ML included:
               | https://github.com/Const-me/Cgml
        
               | AnthonyMouse wrote:
               | > At some point you just have to accept that AMD is not a
               | serious company, but is a second rate copycat and there
               | is no way to change that without firing everyone from
               | middle management up.
               | 
               | AMD has always punched above their weight. Historically
               | their problem was that they were the much smaller company
               | and under heavy resource constraints.
               | 
               | Around the turn of the century the Athlon was faster than
               | the Pentium III and then they made x86 64-bit when Intel
               | was trying to screw everyone with Itanic. But the Pentium
               | 4 was a marketing-optimized design that maximized clock
               | speed at the expense of heat and performance per clock.
               | Intel was outselling them even though the Athlon 64 was
               | at least as good if not better. The Pentium 4 was rubbish
               | for laptops because of the heat problems, so Intel
               | eventually had to design a separate chip for that, but
               | they also had the resources to do it.
               | 
               | That was the point that AMD made their biggest mistake.
               | When they set out to design their next chip the
               | competition was the Pentium 4, so they made a power-
               | hungry monster designed to hit high clock speeds at the
               | expense of performance per clock. But the reason more
               | people didn't buy the Athlon 64 wasn't that they couldn't
               | figure out that a 2.4GHz CPU could be faster than a
               | 2.8GHz CPU, it was all the anti-competitive shenanigans
               | Intel was doing behind closed doors to e.g. keep PC OEMs
               | from featuring systems with AMD CPUs. Meanwhile by then
               | Intel had figured out that the Pentium 4 was, in fact, a
               | bad design, when their own Pentium M laptops started
               | outperforming the Pentium 4 desktops. So the Pentium 4
               | line got canceled and Bulldozer had to go up against the
               | Pentium M-based Core, which nearly bankrupted AMD and
               | compromised their ability to fund the R&D needed to
               | sustain state of the art fabs.
               | 
               | Since then they've been climbing back out of the hole but
               | it wasn't until Ryzen in 2017 that you could safely
               | conclude they weren't on the verge of bankruptcy, and
               | even then they were saddled with a lot of debt and
               | contracts requiring them to use the uncompetitive Global
               | Foundries fabs for several years. It wasn't until Zen4 in
               | 2022 that they finally got to switch the whole package to
               | TSMC.
               | 
               | So until quite recently the answer to the question "why
               | didn't they do X?" was obvious. They didn't have the
               | money. But now they do.
        
               | perching_aix wrote:
               | I'd be very concerned if somebody makes a $100K decision
               | based on a comment where the author couldn't even
               | differentiate between the words "constitutionally" and
               | "institutionally", while providing as much substance as
               | any other random techbro on any random forum and being
               | overwhelmingly oblivious to it.
        
             | ksec wrote:
             | Everything is comparative. AMD isn't perfect. As an Ex
             | Shareholder I have argued they did well partly because of
             | Intel's downfall. In terms of execution it is far from
             | perfect.
             | 
             | But Nvidia is a different beast. It is a bit like Apple in
             | the late 00s where you take business, forecast, marketing,
             | operation, software, hardware, sales etc You take any part
             | of it and they are all industry leading. And having
             | industry leading capability is only part of the game,
             | having it all work together is completely another thing.
             | And unlike Apple where they lost direction once Steve Jobs
             | passed away and weren't sure about how to deploy capital.
             | Jensen is still here, and they have more resources now
             | making Nvidia even more competitive.
             | 
             | It is often most people underestimate the magnitude of the
             | task required, ( I like to tell the story again about an
             | Intel GPU engineer in 2016 arguing they could take dGPU
             | market shares by 2020, and we are now 2025 ), over estimate
             | the capability of an organisation, under estimate the
             | rival's speed of innovation and execution. These three
             | thing combined is why most people are often off the
             | estimate by an order of magnitude.
        
               | llm_trw wrote:
               | Yeah, no.
               | 
               | We are in the middle of a monopoly squeeze by NVidia on
               | the most innovative part of the economy right now. I
               | expect the DOJ to hit them harder than they did MS in the
               | 90s given the bullshit they are pulling and the drag on
               | the economy they are causing.
               | 
               | By comparison if AMD could write a driver that didn't
               | shit itself when it had to multiply more than two
               | matrices in a row they'd be selling cards faster than
               | they can make them. You don't need to sell the best
               | shovels in a gold rush to make mountains of money, but
               | you can't sell teaspoons as premium shovels and expect
               | people to come back.
        
               | shiroiushi wrote:
               | >I expect the DOJ to hit them harder than they did MS in
               | the 90s given the bullshit they are pulling and the drag
               | on the economy they are causing.
               | 
               | It sounds like you're expecting extreme competence from
               | the DOJ. Given their history with regulating big tech
               | companies, and even worse, the incoming administration, I
               | think this is a very unrealistic expectation.
        
               | kadoban wrote:
               | What effect did the DOJ have on MS in the 90s? Didn't all
               | of that get rolled back before they had to pay a dime,
               | and all it amounted to was that browser choice screen
               | that was around for a while? Hardly a crippling blow. If
               | anything that showed the weakness of regulators in fights
               | against big tech, just outlast them and you're fine.
        
               | ksec wrote:
               | >We are in the middle of a monopoly squeeze by NVidia on
               | the most innovative part of the economy right now.
               | 
               | I am not sure which part of Nvidia is monopoly. That is
               | like suggesting TSMC has a monopoly.
        
               | vitus wrote:
               | > That is like suggesting TSMC has a monopoly.
               | 
               | They... do have a monopoly on foundry capacity,
               | especially if you're looking at the most advanced nodes?
               | Nobody's going to Intel or Samsung to build 3nm
               | processors. Hell, there have been whispers over the past
               | month that even Samsung might start outsourcing Exynos to
               | TSMC; Intel already did that with Lunar Lake.
               | 
               | Having a monopoly doesn't mean that you are engaging in
               | anticompetitive behavior, just that you are the only real
               | option in town.
        
               | Vecr wrote:
               | Will they? Given the structure of global controls on
               | GPUs, Nvidia is a de-facto self funding US government
               | company.
               | 
               | Maybe the US will do something if GPU price becomes the
               | limit instead of the supply of chips and power.
        
             | lofaszvanitt wrote:
             | It had to destroy itself. These companies do not act on
             | their own...
        
             | adrian_b wrote:
             | In CPUs, AMD has made many innovations that have been
             | copied by Intel only after many years and this delay had an
             | important contribution to Intel's downfall.
             | 
             | The most important has been the fact that AMD has predicted
             | correctly that big monolithic CPUs will no longer be
             | feasible in the future CMOS fabrication technologies, so
             | they have designed the Zen family since the beginning with
             | a chiplet-based architecture. Intel had attempted to
             | ridicule them, but after losing many billions they have
             | been forced to copy this strategy.
             | 
             | Also in the microarchitecture of their CPUs AMD has made
             | the right choices since the beginning and then they have
             | improved it constantly with each generation. The result is
             | that now the latest Intel big core, Lion Cove, has a
             | microarchitecture that is much more similar to AMD Zen 5
             | than to any of the previous Intel cores, because they had
             | to do this to get a competitive core.
             | 
             | In the distant past, AMD has also introduced a lot of
             | innovations long before they were copied by Intel, but it
             | is true that those had not been invented by AMD, but they
             | had been copied by AMD from more expensive CPUs, like DEC
             | Alpha or Cray or IBM POWER, but Intel has also copied them
             | only after being forced by the competition with AMD.
        
           | latchkey wrote:
           | https://x.com/dylan522p/status/1871287937268383867
        
             | krackers wrote:
             | That's almost word for word what geohotz said last year?
        
               | refulgentis wrote:
               | What part?
               | 
               | I assume the part where she said there's "gaps in the
               | software stack", because that's the only part that's
               | attributed to her.
               | 
               | But I must be wrong because that hasn't been in dispute
               | or in the news in a decade, it's not a geohot discovery
               | from last year.
               | 
               | Hell I remember a subargument of a subargument re: this
               | being an issue a decade ago in macOS dev (TL;Dr whether
               | to invest in opencl)
        
               | bn-l wrote:
               | I went through the thread. There's an argument to be made
               | in firing Su for being so spaced out as to miss an op for
               | their own CUDA for free.
        
               | hedgehog wrote:
               | Not remotely, how did you get to that idea?
        
               | refulgentis wrote:
               | Kids this days (shakes fist)
               | 
               | tl;dr there's a non-unsubstantial # of people who learn a
               | lot from geohot. I'd say about 3% of people here will be
               | confused if you thought of him as less than a top
               | technical expert across many comp sci fields.
               | 
               | And he did the geohot thing recently, way tl;dr: acted
               | like there was a scandal being covered up by AMD around
               | drivers that was causing them to "lose" to nVidia.
               | 
               | He then framed AMD not engaging with him on this topic as
               | further covering-up and choosing to lose.
               | 
               | So if you're of a certain set of experiences, you see an
               | anodyne quote from the CEO that would have been utterly
               | unsurprising dating back to when ATI was still a company,
               | and you'd read it as the CEO breezily admitting in public
               | that geohot was right about how there was malfeasance,
               | followed by a cover up, implying extreme dereliction of
               | duty, because she either helped or didn't realize till
               | now.
               | 
               | I'd argue this is partially due to stonk-ification of
               | discussions, there was a vague, yet often communicated,
               | sense there was something illegal happening. Idea was it
               | was financial dereliction of duty to shareholders.
        
           | dismalaf wrote:
           | IMO the hope shouldn't be that AMD specifically wins, rather
           | it's best for consumers that hardware becomes commoditized
           | and prices come down.
           | 
           | And that's what's happening, slowly anyway. Google, Apple and
           | Amazon all have their own AI chips, Intel has Gaudi, AMD had
           | their thing, and the software is at least working on more
           | than just Nvidia. Which is a win. Even if it's not perfect.
           | I'm personally hoping that everyone piles in on a standard
           | like SYCL.
        
           | comex wrote:
           | Maybe from Modular (the company Chris Lattner is working
           | for). In this recent announcement they said they had achieved
           | competitive ML performance... on NVIDIA GPUs, but with their
           | own custom stack completely replacing CUDA. And they're
           | targeting AMD next.
           | 
           | https://www.modular.com/blog/introducing-max-24-6-a-gpu-
           | nati...
        
             | behnamoh wrote:
             | Ah yes, the programming language (Mojo) that requires an
             | account before I can use it...
        
               | melodyogonna wrote:
               | Mojo no longer requires an account to install.
               | 
               | But that is irrelevant to the conversation because this
               | is not about Mojo but something they call MAX. [1]
               | 
               | 1. https://www.modular.com/max
        
           | quotemstr wrote:
           | The world is bigger than AMD and Nvidia. Plenty of
           | interesting new AI-tuned non-GPU accelerators coming online.
        
             | grigio wrote:
             | I hope, name some NPU who can run a 70B model..
        
           | steeve wrote:
           | We (ZML) have AMD MI300X working just fine, in fact, faster
           | than H100
        
       | lasermike026 wrote:
       | I believe these efforts are very important. If we want this stuff
       | to be practical we are going to have to work on efficiency. Price
       | efficiency is good. Power and compute efficiency would be better.
       | 
       | I have been playing with llama.cpp to run interference on
       | conventional cpus. No conclusions but it's interesting. I need to
       | look at llamafile next.
        
       | zamalek wrote:
       | I have been playing around with Phi-4 Q6 on my 7950x and 7900XT
       | (with HSA_OVERRIDE_GFX_VERSION). It's bloody fast, even with CPU
       | alone - in practical terms it beats hosted models due to the
       | roundtrip time. Obviously perf is more important if you're
       | hosting this stuff, but we've definitely reached AMD usability at
       | home.
        
       | latchkey wrote:
       | Previously:
       | 
       |  _Making AMD GPUs competitive for LLM inference_
       | https://news.ycombinator.com/item?id=37066522 (August 9, 2023 --
       | 354 points, 132 comments)
        
       | lxe wrote:
       | A used 3090 is $600-900, performs better than 7900, and is much
       | more versatile because CUDA
        
         | Uehreka wrote:
         | Reality check for anyone considering this: I just got a used
         | 3090 for $900 last month. It works great.
         | 
         | I would not recommend buying one for $600, it probably either
         | won't arrive or will be broken. Someone will reply saying they
         | got one for $600 and it works, that doesn't mean it will happen
         | if you do it.
         | 
         | I'd say the market is realistically $900-1100, maybe $800 if
         | you know the person or can watch the card running first.
         | 
         | All that said, this advice will expire in a month or two when
         | the 5090 comes out.
        
           | idonotknowwhy wrote:
           | I've bought 5 used and they're all perfect. But that's what
           | buyer protection on ebay is for. Had to send back an Epyc
           | mobo with bent pins and ebay handled it fine.
        
       | leonewton253 wrote:
       | This benchmark doest look right. Is it using the tensor cores in
       | the Nvidia gpu? AMD does not have AI cores so should run
       | noticeably slower.
        
         | nomel wrote:
         | AMD has WMMA.
        
       | mattfrommars wrote:
       | Great, I have yet to understand why does not the ML community
       | really push or move away from CUDA? To me, it feel like a
       | dinosaur move to build on top of CUDA which is screaming
       | proprietary nothing about it is open source or cross platform.
       | 
       | The reason why I say its dinosaur is, imagine, we as a dev
       | community continued to build on top of Flash or Microsoft
       | Silverlight...
       | 
       | LLM and ML has been out for quiet a while, with AI/LLM
       | advancement, the transition must have been much quicker to move
       | cross platform. But this hasn't yet and not sure when it will
       | happen.
       | 
       | Building a translation layer on top CUDA is not the answer either
       | to this problem.
        
         | dwood_dev wrote:
         | Except I never hear complaints about CUDA from a quality
         | perspective. The complaints are always about lock in to the
         | best GPUs on the market. The desire to shift away is to make
         | cheaper hardware with inferior software quality more usable.
         | Flash was an abomination, CUDA is not.
        
           | xedrac wrote:
           | Maybe the situation has gotten better in recent years, but my
           | experience with Nvidia toolchains was a complete nightmare
           | back in 2018.
        
             | claytonjy wrote:
             | The cuda situation is definitely better. The nvidia
             | struggles are now with the higher-level software they're
             | pushing (triton, tensor-llm, riva, etc), tools that are the
             | most performant option when they work, but a garbage
             | developer experience when you step outside the golden path
        
               | cameron_b wrote:
               | I want to double-down on this statement, and call
               | attention to the competitive nature of it. Specifically,
               | I have recently tried to set up Triton on arm hardware.
               | One might presume Nvidia would give attention to an
               | architecture they develop, but the way forward is not
               | easy. For some version of Ubuntu, you might have the
               | correct version of python ( usually older than packaged )
               | but current LTS is out of luck for guidance or packages.
               | 
               | https://github.com/triton-lang/triton/issues/4978
        
           | AnthonyMouse wrote:
           | Flash was popular because it was an attractive platform for
           | the developer. Back then there was no HTML5 and browsers
           | didn't otherwise support a lot of the things Flash did. Flash
           | _Player_ was an abomination, it was crashy and full of
           | security vulnerabilities, but that was a problem for the user
           | rather than the developer and it was the developer choosing
           | what to use to make the site.
           | 
           | This is pretty much exactly what happens with CUDA.
           | Developers like it but then the users have to use expensive
           | hardware with proprietary drivers/firmware, which is the
           | relevant abomination. But users have _some_ ability to
           | influence developers, so as soon as we get the GPU equivalent
           | of HTML5, what happens?
        
             | wqaatwt wrote:
             | > users have to use expensive hardware with proprietary
             | drivers/firmware
             | 
             | What do you mean by that? People trying to run their own
             | models are not "the users" they are a tiny insignificant
             | niche segment.
        
               | AnthonyMouse wrote:
               | There are _far_ more people running llama.cpp, various
               | image generators, etc. than there are people developing
               | that code. Even when the  "users" are corporate entities,
               | they're not necessarily doing any development in excess
               | of integrating the existing code with their other
               | systems.
               | 
               | We're also likely to see a stronger swing away from "do
               | inference in the cloud" because of the aligned incentives
               | of "companies don't want to pay for all that hardware and
               | electricity" and "users have privacy concerns" such that
               | companies doing inference on the local device will have
               | both lower costs and a feature they can advertise over
               | the competition.
               | 
               | What this is waiting for is hardware in the hands of the
               | users that can actually do this for a mass market price,
               | but there is no shortage of companies wanting a piece of
               | that. In particular, Apple is going to be pushing that
               | hard and despite the price they do a lot of volume, and
               | then you're going to start seeing more PCs with high-VRAM
               | GPUs or iGPUs with dedicated GDDR/HBM on the package as
               | their competitors want feature parity for the thing
               | everybody is talking about, the cost of which isn't
               | actually that high, e.g. 40GB of GDDR6 is less than $100.
        
         | idonotknowwhy wrote:
         | For me personally, hacking together projects as a hobbiest, 2
         | reasons :
         | 
         | 1. It just works. When i tried to build things on Intel Arcs, i
         | spent way more hours bikeshedding ipex and driver issues than
         | developing
         | 
         | 2. LLMs seem to have more cuda code in their training data. I
         | can leverage claude and 4o to help me build things with cuda,
         | but trying to get them to help me do the same things on ipex
         | just doesn't work.
         | 
         | I'd very much love a translation layer for Cuda, like a dxvk or
         | wine equivalent.
         | 
         | Would save a lot of money since Arc gpus are in the bargain bin
         | and nvidia cloud servers are double the price of AMD.
         | 
         | As it stands now, my dual Intel Arc rig is now just a llama.cpp
         | inference server for the family to use.
        
           | FloatArtifact wrote:
           | What kind of model learn and what's its token output on intel
           | gpu's?
        
           | jeroenhd wrote:
           | If CUDA counts as "just works", I dread to see the dark,
           | unholy rituals you need to invoke to get ROCm to work. I have
           | spent too many hours browsing the Nvidia forums for obscure
           | error codes and driver messages to ever consider updating my
           | CUDA install and every time I reboot my desktop for an update
           | I dread having to do it all over again.
        
       | pavelstoev wrote:
       | The problem is that performance achievements on AMD consumer-
       | grade GPUs (RX7900XTX) are not representative/transferrable to
       | the Datacenter grade GPUs (MI300X). Consumer GPUs are based on
       | RDNA architecture, while datacenter GPUs are based on the CDNA
       | architecture, and only sometime in ~2026 AMD is expected to
       | release unifying UDNA architecture [1]. At CentML we are
       | currently working on integrating AMD CDNA and HIP support into
       | our Hidet deep learning compiler [2], which will also power
       | inference workloads for all Nvidia GPUs, AMD GPUs, Google TPU and
       | AWS Inf2 chips on our platform [3]
       | 
       | [1] https://www.jonpeddie.com/news/amd-to-integrate-cdna-and-
       | rdn.... [2] https://centml.ai/hidet/ [3]
       | https://centml.ai/platform/
        
         | llm_trw wrote:
         | The problem is that the specs of AMD consumer-grade GPUs do not
         | translate to computer performance when you try and chain more
         | than one together.
         | 
         | I have 7 NVidia 4090s under my desk happily chugging along on
         | week long training runs. I once managed to get a Radeon VII to
         | run for six hours without shitting itself.
        
           | tspng wrote:
           | Wow, are these 7 RTX 4090s in a single setup? Care to share
           | more how you build it (case, cooling, power, ..)?
        
             | adakbar wrote:
             | I'd like to know too
        
             | ghxst wrote:
             | You might find the journey of Tinycorp's Tinybox
             | interesting, it's a machine with 6 to 8 4090 GPUs and you
             | should be able to track down a lot of their hardware
             | choices including pictures on their Twitter and other info
             | on George his livestreams.
        
             | llm_trw wrote:
             | Basically this but with an extra card on the x8 slot for
             | connecting my monitors:
             | https://www.youtube.com/watch?v=C548PLVwjHA
             | 
             | There's a bunch of similar setups and there are a couple of
             | dozen people that have done something similar on
             | /r/localllama.
        
             | osmarks wrote:
             | Most of these are just an EPYC server platform, some cursed
             | risers and multiple PSUs (though cryptominer server PSU
             | adapters are probably better). See
             | https://nonint.com/2022/05/30/my-deep-learning-rig/ and
             | https://www.mov-axbx.com/wopr/wopr_concept.html.
        
               | Keyframe wrote:
               | Looks like a fire hazard :)
        
           | mpreda wrote:
           | > I have 7 NVidia 4090s under my desk
           | 
           | I have 6 Radeon Pro VII under my desk (in a single system
           | BTW), and they run hard for weeks until I choose to reboot
           | e.g. for Linux kernel updates.
           | 
           | I bought them "new old stock" for $300 apiece. So that's
           | $1800 for all six.
        
             | highwaylights wrote:
             | How does the compute performance compare to 4090's for
             | these workloads?
             | 
             | (I release it will be significantly lower, just try to get
             | as much of a comparison as is possible).
        
               | cainxinth wrote:
               | The 4090 offers 82.58 teraflops of single-precision
               | performance compared to the Radeon Pro VII's 13.06
               | teraflops.
        
               | adrian_b wrote:
               | On the other hand, for double precision a Radeon Pro VII
               | is many times faster than a RTX 4090 (due to 1:2 vs. 1:64
               | FP64:FP32 ratio).
               | 
               | Moreover, for workloads limited by the memory bandwidth,
               | a Radeon Pro VII and a RTX 4090 will have about the same
               | speed, regardless what kind of computations are
               | performed. It is said that speed limitation by memory
               | bandwidth happens frequently for ML/AI inferencing.
        
               | llm_trw wrote:
               | For inference sure, for training: no.
        
               | crest wrote:
               | The Radeon VII is special compared to most older (and
               | current) affordable GPUs in that it used HBM giving it
               | memory bandwidth comparable to modern cards ~1TB/s and
               | has reasonable FP64 (1:4) throughput instead of (1:64).
               | So this card can still be pretty interesting for running
               | memory bandwidth intensive FP64 workloads. Anything
               | affordable afterward by either AMD or Nvidia crippled
               | realistic FP64 throughput to below what a AVX-512 many-
               | core CPU can do.
        
               | nine_k wrote:
               | If we speak about FP64, are your loads more like fluid
               | dynamics than ML training?
        
             | llm_trw wrote:
             | Are you running ml workloads or solving differential
             | equations?
             | 
             | The two are rather different and one market is worth
             | trillions, the other isn't.
        
               | comboy wrote:
               | I think there is some money to be made in machine
               | learning too.
        
         | zozbot234 wrote:
         | It looks like AMD's CDNA gpu's are supported by Mesa, which
         | ought to suffice for Vulkan Compute and SYCL support. So there
         | should be ways to run ML workloads on the hardware without
         | going through HIP/ROCm.
        
       | aussieguy1234 wrote:
       | I got a "gaming" PC for LLM inference with an RTX 3060. I could
       | have gotten more VRAM for my buck with AMD, but didn't because at
       | the time alot of inference needed CUDA.
       | 
       | As soon AMD is as good as Nvidia for inference, I'll switch over.
       | 
       | But I've read on here that their hardware engineers aren't even
       | given enough hardware to test with...
        
       | lhl wrote:
       | Just an FYI, this is writeup from August 2023 and a lot has
       | changed (for the better!) for RDNA3 AI/ML support.
       | 
       | That being said, I did some very recent inference testing on an
       | W7900 (using the same testing methodology used by Embedded LLM's
       | recent post to compare to vLLM's recently added Radeon GGUF
       | support [1]) and MLC continues to perform quite well. On Llama
       | 3.1 8B, MLC's q4f16_1 (4.21MB weights) performed +35% faster than
       | llama.cpp w/ Q4_K_M w/ their ROCm/HIP backend (4.30MB weights, 2%
       | size difference).
       | 
       | That makes MLC still the generally fastest standalone inference
       | engine for RDNA3 by a country mile. However, you have much less
       | flexibility with quants and by and large have to compile your own
       | for every model, so llama.cpp is probably still more flexible for
       | general use. Also llama.cpp's (recently added to llama-server)
       | speculative decoding can also give some pretty sizable
       | performance gains. Using a 70B Q4_K_M + 1B Q8_0 draft model
       | improves output token throughput by 59% on the same ShareGPT
       | testing. I've also been running tests with Qwen2.5-Coder and
       | using a 0.5-3B draft model for speculative decoding gives even
       | bigger gains on average (depends highly on acceptance rate).
       | 
       | Note, I think for local use, vLLM GGUF is still not suitable at
       | all. When testing w/ a 70B Q4_K_M model (only 40GB), loading,
       | engine warmup, and graph compilation took on avg 40 minutes.
       | llama.cpp takes 7-8s to load the same model.
       | 
       | At this point for RDNA3, basically everything I need works/runs
       | for my use cases (primarily LLM development and local
       | inferencing), but almost always slower than an RTX 3090/A6000
       | Ampere (a new 24GB 7900 XTX is $850 atm, used or refurbished 24
       | GB RTX 3090s are in in the same ballpark, about $800 atm; a new
       | 48GB W7900 goes for $3600 while an 48GB A6000 (Ampere) goes for
       | $4600). The efficiency gains can be sizable. Eg, on my standard
       | llama-bench test w/ llama2-7b-q4_0, the RTX 3090 gets a tg128 of
       | 168 t/s while the 7900 XTX only gets 118 t/s even though both
       | have similar memory bandwidth (936.2 GB/s vs 960 GB/s). It's also
       | worth noting that since the beginning of the year, the llama.cpp
       | CUDA implementation has gotten almost 25% faster, while the ROCm
       | version's performance has stayed static.
       | 
       | There is an actively (solo dev) maintained fork of llama.cpp that
       | sticks close to HEAD but basically applies a rocWMMA patch that
       | can improve performance if you use the llama.cpp FA (still
       | performs worse than w/ FA disabled) and in certain long-context
       | inference generations (on llama-bench and w/ this ShareGPT
       | serving test you won't see much difference) here:
       | https://github.com/hjc4869/llama.cpp - The fact that no one from
       | AMD has shown any interest in helping improve llama.cpp
       | performance (despite often citing llama.cpp-based apps in
       | marketing/blog posts, etc is disappointing ... but sadly on brand
       | for AMD GPUs).
       | 
       | Anyway, for those interested in more information and testing for
       | AI/ML setup for RDNA3 (and AMD ROCm in general), I keep a doc
       | with lots of details here: https://llm-tracker.info/howto/AMD-
       | GPUs
       | 
       | [1] https://embeddedllm.com/blog/vllm-now-supports-running-
       | gguf-...
        
       | mrcsharp wrote:
       | I will only consider AMD GPUs for LLM when I can easily make my
       | AMD GPU available within WSL and Docker on Windows.
       | 
       | For now, it is as if AMD does not exist in this field for me.
        
       | Sparkyte wrote:
       | More players in the market the better. AI shouldn't be owned by
       | one business.
        
       | melodyogonna wrote:
       | Modular claims that it achieves 93% GPU utilization on AMD GPUs
       | [1], official preview release coming early next year, we'll see.
       | I must say I'm bullish because of feedback I've seen people give
       | about the performance on Nvidia GPUs
       | 
       | 1.https://www.modular.com/max
        
       | guerrilla wrote:
       | So, does ollama use this work or does it do something else? How
       | does it compare?
        
       ___________________________________________________________________
       (page generated 2024-12-24 23:00 UTC)