[HN Gopher] Making AMD GPUs competitive for LLM inference (2023)
___________________________________________________________________
Making AMD GPUs competitive for LLM inference (2023)
Author : plasticchris
Score : 256 points
Date : 2024-12-24 00:17 UTC (22 hours ago)
(HTM) web link (blog.mlc.ai)
(TXT) w3m dump (blog.mlc.ai)
| dragontamer wrote:
| Intriguing. I thought AMD GPUs didn't have tensor cores (or
| matrix multiplication units) like NVidia. I believe they are only
| dot product / fused multiply and accumulate instructions.
|
| Are these LLMs just absurdly memory bound so it doesn't matter?
| ryao wrote:
| They don't, but GPUs were designed for doing matrix
| multiplications even without the special hardware instructions
| for doing matrix multiplication tiles. Also, the forward pass
| for transformers is memory bound, and that is what does token
| generation.
| dragontamer wrote:
| Well sure, but in other GPU tasks, like Raytracing, the
| difference between these GPUs is far more pronounced.
|
| And AMD has passable Raytracing units (NVidias are better but
| the difference is bigger than these LLM results).
|
| If RAM is the main bottleneck then CPUs should be on the
| table.
| webmaven wrote:
| RAM is (often) the bottleneck for highly parallel GPUs, but
| not for CPUs.
|
| Though the distinction between the two categories is
| blurring.
| ryao wrote:
| Memory bandwidth is the bottleneck for both when running
| GEMV, which is the main operation used by token
| generation in inference. It has always been this way.
| IX-103 wrote:
| > If RAM is the main bottleneck then CPUs should be on the
| table
|
| That's certainly not the case. The graphics memory model is
| very different from the CPU memory model. Graphics memory
| is explicitly designed for multiple simultaneous reads
| (spread across several different buses) at the cost of
| generality (only portions of memory may be available on
| each bus) and speed (the extra complexity means reads are
| slower). This makes then fast at doing simple operations on
| a large amount of data.
|
| CPU memory only has one bus, so only a single read can
| happen at a time (a cache line read), but can happen
| relatively quickly. So CPUs are better for workloads with
| high memory locality and frequent reuse of memory locations
| (as is common in procedural programs).
| dragontamer wrote:
| > CPU memory only has one bus
|
| If people are paying $15,000 or more per GPU, then I can
| choose $15,000 CPUs like EPYC that have 12-channels or
| dual-socket 24-channel RAM.
|
| Even desktop CPUs are dual-channel at a minimum, and
| arguably DDR5 is closer to 2 or 4 buses per channel.
|
| Now yes, GPU RAM can be faster, but guess what?
|
| https://www.tomshardware.com/pc-components/cpus/amd-
| crafts-c...
|
| GPUs are about extremely parallel performance, above and
| beyond what traditional single-threaded (or limited-SIMD)
| CPUs can do.
|
| But if you're waiting on RAM anyway?? Then the compute-
| method doesn't matter. Its all about RAM.
| schmidtleonard wrote:
| CPUs have pitiful RAM bandwidth compared to GPUs. The
| speeds aren't so different but GPU RAM busses are
| wiiiiiiiide.
| teleforce wrote:
| Compute Express Link (CXL) should mostly solve limited
| RAM with CPU:
|
| 1) Compute Express Link (CXL):
|
| https://en.wikipedia.org/wiki/Compute_Express_Link
|
| PCIe vs. CXL for Memory and Storage:
|
| https://news.ycombinator.com/item?id=38125885
| schmidtleonard wrote:
| Gigabytes per second? What is this, bandwidth for ants?
|
| My years old pleb tier non-HBM GPU has more than 4 times
| the bandwidth you would get from a PCIe Gen 7 x16 link,
| which doesn't even officially exist yet.
| teleforce wrote:
| Yes CXL will soon benefit from PCIe Gen 7 x16 with
| expected 64GB/s in 2025 and the non-HBM bandwidth I/O
| alternative is increasing rapidly by the day. For most
| inferences of near real-time LLM it will be feasible. For
| majority of SME companies and other DIY users (humans or
| ants) with their localized LLM should not be any issues
| [1],[2]. In addition new techniques for more efficient
| LLM are being discover to reduce the memory consumption
| [3].
|
| [1] Forget ChatGPT: why researchers now run small AIs on
| their laptops:
|
| https://news.ycombinator.com/item?id=41609393
|
| [2] Welcome to LLMflation - LLM inference cost is going
| down fast:
|
| https://a16z.com/llmflation-llm-inference-cost/
|
| [3] New LLM optimization technique slashes memory costs
| up to 75%:
|
| https://news.ycombinator.com/item?id=42411409
| schmidtleonard wrote:
| No. Memory bandwidth is _the_ important factor for LLM
| inference. 64GB /s is 4x less than the hypothetical I
| granted you (Gen7x16 = 256GB/s), which is 4x less than
| the memory bandwidth on my 2 year old pleb GPU (1TB/s),
| which is 10x less than a state of the art professional
| GPU (10TB/s), which is what the cloud services will be
| using.
|
| That's 160x worse than cloud and 16x worse than what I'm
| using for local LLM. I am keenly aware of the options for
| compression. I use them every day. The sacrifices I make
| to run local LLM cut deep compared to the cloud models,
| and squeezing it down by another factor of 16 will cut
| deep on top of cutting deep.
|
| Nothing says it can't be useful. My most-used model is
| running in a microcontroller. Just keep those
| expectations tempered.
|
| (EDIT: changed the numbers to reflect red team victory
| over green team on cloud inference.)
| ryao wrote:
| It is reportedly 242GB/sec due to overhead:
|
| https://en.wikipedia.org/wiki/PCI_Express#PCI_Express_7.0
| Dylan16807 wrote:
| > 4 times the bandwidth you would get from a PCIe Gen 7
| x16 link
|
| So you have a full terabyte per second of bandwidth? What
| GPU is that?
|
| (The 64GB/s number is an x4 link. If you meant you have
| over four times that, then it sounds like CXL would be
| pretty competitive.)
| schmidtleonard wrote:
| https://www.techpowerup.com/gpu-specs/geforce-
| rtx-4090.c3889 Memory Size: 24 GB
| Memory Type: GDDR6X Memory Bus: 384 bit
| Bandwidth: 1.01 TB/s
|
| Bandwidth between where the LLM is stored and where your
| matrix*vector multiplies are done is _the_ important
| figure for inference. You want to measure this in
| terabytes per second, not gigabytes per second.
|
| A 7900XTX also has 1TB/s on paper, but you'll need
| awkward workarounds every time you want to do something
| (see: article) and half of your workloads will stop dead
| with driver crashes and you need to decide if that's
| worth $500 to you.
|
| Stacking 3090s is the move if you want to pinch pennies.
| They have 24GB of memory and 936GB/s of bandwidth each,
| so almost as good as the 4090, but they're as cheap as
| the 7900XTX with none of the problems. They aren't as
| good for gaming or training workloads, but for local
| inference 3090 is king.
|
| It's not a coincidence that the article lists the same 3
| cards. These are the 3 cards you should decide between
| for local LLM, and these are the 3 cards a true
| competitor should aim to exceed.
| Dylan16807 wrote:
| A 4090 is not "years old pleb tier". Same for 3090 and
| 7900XTX.
|
| There's a serious gap between CXL and RAM, but it's not
| nearly as big as it used to be.
| adrian_b wrote:
| Already an ancient Radeon VII from 5 years ago had 1
| terabyte per second of memory bandwidth.
|
| Later consumer GPUs have regressed and only RTX 4090
| offers the same memory bandwidth in the current NVIDIA
| generation.
| Dylan16807 wrote:
| Radeon VII had HBM.
|
| So I can understand a call for returning to HBM, but it's
| an expensive choice and doesn't fit the description.
| ryao wrote:
| That seems unlikely given that the full HBM supply for
| the next year has been earmarked for enterprise GPUs.
| That said, it would be definitely nice if HBM became
| available for consumer GPUs.
| ryao wrote:
| The 3090 Ti and 4090 both have 1.01TB/sec memory
| bandwidth:
|
| https://www.techpowerup.com/gpu-specs/geforce-
| rtx-3090-ti.c3...
| ryao wrote:
| The main bottleneck is memory bandwidth. CPUs have less
| memory bandwidth than GPUs.
| throwaway314155 wrote:
| > Are these LLMs just absurdly memory bound so it doesn't
| matter?
|
| During inference? Definitely. Training is another story.
| boroboro4 wrote:
| They absolutely do have similar cores to tensor cores, it's
| called matrix cores. And they have particular instructions to
| utilize them (MFMA). Note I'm talking about DC compute chips,
| like MI300.
|
| LLMs aren't memory bound in production loads, they are pretty
| much compute bound too, at least in prefill phase, but in
| practice in general too.
| almostgotcaught wrote:
| Ya people in these comments don't know what they're talking
| about (no one ever does in these threads). AMDGPU has had MMA
| and WMMA for a while now
|
| https://rocm.docs.amd.com/projects/rocWMMA/en/latest/what-
| is...
| sroussey wrote:
| [2023]
|
| Btw, this is from MLC-LLM which makes WebLLM and other good
| stuff.
| throwaway314155 wrote:
| > Aug 9, 2023
|
| Ignoring the very old (in ML time) date of the article...
|
| What's the catch? People are still struggling with this a year
| later so I have to assume it doesn't work as well as claimed.
|
| I'm guessing this is buggy in practice and only works for the HF
| models they chose to test with?
| Const-me wrote:
| It's not terribly hard to port ML inference to alternative GPU
| APIs. I did it for D3D11 and the performance is pretty good
| too: https://github.com/Const-me/Cgml
|
| The only catch is, for some reason developers of ML libraries
| like PyTorch aren't interested in open GPU APIs like D3D or
| Vulkan. Instead, they focus on proprietary ones i.e. CUDA and
| to lesser extent ROCm. I don't know why that is.
|
| D3D-based videogames are heavily using GPU compute for more
| than a decade now. Since Valve shipped SteamDeck, the same now
| applies to Vulkan on Linux. By now, both technologies are
| stable, reliable and performant.
| jsheard wrote:
| Isn't part of it because the first-party libraries like cuDNN
| are only available through CUDA? Nvidia has poured a ton of
| effort into tuning those libraries so it's hard to justify
| not using them.
| Const-me wrote:
| Unlike training, ML inference is almost always bound by
| memory bandwidth as opposed to computations. For this
| reason, tensor cores, cuDNN, and other advanced shenanigans
| make very little sense for the use case.
|
| OTOH, general-purpose compute instead of fixed-function
| blocks used by cuDNN enables custom compression algorithms
| for these weights which does help, by saving memory
| bandwidth. For example, I did custom 5 bits/weight
| quantization which works on all GPUs, no hardware support
| necessary, just simple HLSL codes:
| https://github.com/Const-me/Cgml?tab=readme-ov-
| file#bcml1-co...
| boroboro4 wrote:
| Only local (read batch size 1) ML inference is memory
| bound, production loads are pretty much compute bound.
| Prefill phase is very compute bound, and with continuous
| batching generation phase is getting mixed with prefill,
| which makes whole process altogether to be compute bound
| too. So no, tensor cores and all other shenanigans
| absolutely critical for performant inference
| infrastructure.
| Const-me wrote:
| PyTorch is a project by Linux foundation. The about page
| with the mission of the foundation contains phrases like
| "empowering generations of open source innovators",
| "democratize code", and "removing barriers to adoption".
|
| I would argue running local inference with batch size=1
| is more useful for empowering innovators compared to
| running production loads on shared servers owned by
| companies. Local inference increases count of potential
| innovators by orders of magnitude.
|
| BTW, in the long run it may also benefit these companies
| because in theory, an easy migration path from CUDA puts
| a downward pressure on nVidia's prices.
| idonotknowwhy wrote:
| Most people running local inference do so thorough quants
| with llamacpp (which runs on everything) or awq/exl2/mlx
| with vllm/tabbyAPI/lmstudio which are much faster to than
| using pytorch directly
| lhl wrote:
| It depends on what you mean by "this." MLC's catch is that you
| need to define/compile models for it with TVM. Here is the list
| of supported model architectures: https://github.com/mlc-
| ai/mlc-llm/blob/main/python/mlc_llm/m...
|
| llama.cpp has a much bigger supported model list, as does vLLM
| and of course PyTorch/HF transformers covers everything else,
| all of which work w/ ROCm on RDNA3 w/o too much fuss these
| days.
|
| For inference, the biggest caveat is that Flash Attention is
| only an aotriton implementation, which besides being less
| performant sometimes, also doesn't support SWA. For CDNA there
| is a better CK-based version of FA, but CK doesn't not have
| RDNA support. There are a couple people at AMD apparently
| working on native FlexAttention, os I guess we'll how that
| turns out.
|
| (Note the recent SemiAccurate piece was on training, which I'd
| agree is in a much worse state (I have personal experience with
| it being often broken for even the simplest distributed
| training runs). Funnily enough, if you're running simple fine
| tunes on a single RDNA3 card, you'll probably have a better
| time. OOTB, a 7900 XTX will train at about the same speed as an
| RTX 3090 (4090s blow both of those away, but you'll probably
| want more cards and VRAM of just move to H100s).
| shihab wrote:
| I have come across quite few startups who are trying a similar
| idea: break the nvidia monopoly by utilizing AMD GPUs (for
| inference at least): Felafax, Lamini, tensorwave (partially),
| SlashML. Even saw optimistic claims like CUDA moat is only 18
| months deep from some of them [1]. Let's see.
|
| [1]
| https://www.linkedin.com/feed/update/urn:li:activity:7275885...
| ryukoposting wrote:
| Peculiar business model, at a glance. It seems like they're
| doing work that AMD ought to be doing, and is probably doing
| behind the scenes. Who is the customer for a third-party GPU
| driver shim?
| tesch1 wrote:
| AMD. Just one more dot to connect ;)
| dpkirchner wrote:
| Could be trying to make themselves a target for a big
| acquihire.
| to11mtm wrote:
| Cynical take: Try to get acquired by Intel for Arc.
| shiroiushi wrote:
| More cynical take: this would be a bad strategy, because
| Intel hasn't shown much competence in its leadership for
| a long time, especially in regards to GPUs.
| rockskon wrote:
| They've actually been making positive moves with GPUs
| lately along with a success story for the B580.
| schmidtleonard wrote:
| Yeah but MLID says they are losing money on every one and
| have been winding down the internal development
| resources. That doesn't bode well for the future.
|
| I want to believe he's wrong, but on the parts of his
| show where I am in a position to verify, he generally
| checks out. Whatever the opposite of Gell-Mann Amnesia
| is, he's got it going for him.
| derektank wrote:
| Wait, are they losing money on every one in the sense
| that they haven't broken even on research and development
| yet? Or in the sense that they cost more to manufacture
| than they're sold at? Because one is much worse than the
| other.
| rockskon wrote:
| They're trying to unseat Radeon as the budget card. That
| means making a more enticing offer than AMD for a
| temporary period of time.
| sodality2 wrote:
| MLID on Intel is starting to become the same as
| UserBenchmark on AMD (except for the generally reputable
| sources)... he's beginning to sound like he simply wants
| Intel to fail, to my insider-info-lacking ears. For
| competition's sake I _really_ hope that MLID has it wrong
| (at least the opining about the imminent failure of Intel
| 's GPU division), and that the B series will encourage
| Intel to push farther to spark more competition in the
| GPU space.
| oofabz wrote:
| The die size of the B580 is 272 mm2, which is a lot of
| silicon for $249. The performance of the GPU is good for
| its price but bad for its die size. Manufacturing cost is
| closely tied to die size.
|
| 272 mm2 puts the B580 in the same league as the Radeon
| 7700XT, a $449 card, and the GeForce 4070 Super, which is
| $599. The idea that Intel is selling these cards at a
| loss sounds reasonable to me.
| tjoff wrote:
| Though you assume the prices of the competition are
| reasonable. There are plenty of reasons for them not to
| be. Availability issues, lack of competition, other more
| lucrative avenues etc.
|
| Intel has neither, or at least not as much of them.
| KeplerBoy wrote:
| At a loss seems a bit overly dramatic. I'd guess Nvidia
| sells SKUs for three times their marginal cost. Intel is
| probably operating at cost without any hopes of recouping
| R&D with the current SKUs, but that's reasonable for an
| aspiring competitor.
| 7speter wrote:
| It kinda seems they are covering the cost of throwing
| massive amounts of resources trying to get Arc's drivers
| in shape.
| kimixa wrote:
| B580 being a "success" is purely a business decision as a
| loss leader to get their name into the market. A larger
| die on a newer node than either Nvidia or AMD means their
| per-unit costs _are_ higher, and are selling it at a
| lower price.
|
| That's not a long-term success strategy. Maybe good for
| getting your name in the conversation, but not
| sustainable.
| jvanderbot wrote:
| I was reading this whole thread as about technical
| accomplishment and non-nvidia GPU capabilities, not
| business. So I think you're talking about different
| definitions of "Success". Definitely counts, but not what
| I was reading.
| bitmasher9 wrote:
| It's a long term strategy to release a hardware platform
| with minimal margins in the beginning to attract software
| support needed for long term viability.
|
| One of the benefits of being Intel.
| 7speter wrote:
| I don't know if this matters but while the B580 has a die
| comparable in size to a 4070 (~280mm^2), it has about
| half the transistors (~17-18 billion), iirc.
| ryao wrote:
| Is it a loss leader? I looked up the price of 16Gbit
| GDDR6 ICs the other day at dramexchange and the cost of
| 12GB is $48. Using the gamer nexus die measurements, we
| can calculate that they get at least 214 dies per wafer.
| At $12095 per wafer, that is $57 per die. While defects
| ordinarily reduce yields,
|
| Intel put plenty of redundant transistors into the
| silicon. This is ordinarily not possible to estimate, but
| Tom Petersen reported in his interview with hardware
| unboxed that they did not count those when reporting the
| transistor count. Given that the density based on
| reported transistors is about 40% less than the density
| others get from the same process, they likely have enough
| redundancy to work around defects that they can use all
| of those dies.
|
| As for the rest of the card, there is not much in it that
| would not be part of the price of an $80 Asrock
| motherboard. The main thing would be the bundled game,
| which they likely can get in bulk at around $5 per copy.
| This seems reasonable given how much Epic games pays for
| their giveaways:
|
| https://x.com/simoncarless/status/1389297530341519362
|
| That brings the total cost to $190. They are selling
| these at $250. They would not be getting huge margins,
| but provided enough sales volume (to match the sales
| volume Asrock gets on their $80 motherboards), Intel
| should turn a profit on these versus not selling these at
| all. Of course, they need to share some profit with their
| board partners. Presumably, they are not sharing much
| given that their board partners are not able/willing to
| hit the $250 MSRP. The closest they come to it is $260.
|
| My guess is that their margins are between 10% and 20%.
| This would be razor thin for Intel, but it is definitely
| not a loss. Given that the demand is there, the only way
| that they could close money on these would be by limiting
| production such that they do not get economics of scale
| to reach those margins.
| dangero wrote:
| More cynical take: Trying to get acquired by nvidia
| dizhn wrote:
| Person below says they (the whole team) already joined
| Nvidia.
| dogma1138 wrote:
| Intel is in a vastly better shape than AMD, they have the
| software pretty much nailed down.
| indolering wrote:
| Tell that to the board.
| bboygravity wrote:
| Someone never used intel killer wifi software.
| lhl wrote:
| I've recently been poking around with Intel oneAPI and
| IPEX-LLM. While there are things that I find refreshing
| (like their ability to actually respond to bug reports in
| a timely manner, or at all) on a whole, support/maturity
| actually doesn't match the current state of ROCm.
|
| PyTorch requires it's own support kit separate from the
| oneAPI Toolkit (and runs slightly different versions of
| everything), the vLLM xpu support doesn't work - both
| source _and_ the docker failed to build /run for me. The
| IPEX-LLM whisper support is completely borked, etc, etc.
| moffkalast wrote:
| I've recently been trying to get IPEX working as well,
| apparently picking Ubuntu 24.04 was a mistake, because
| while things compile, everything fails at runtime. I've
| tried native, docker, different oneAPI versions, threw
| away a solid week of afternoons for nothing.
|
| SYCL with llama.cpp is great though, at least at FP16
| since it supports nothing else but even Arc iGPUs easily
| give 2-4x performance compared to CPU inference.
|
| Intel should've just contributed to SYCL instead of
| trying to make their own thing and then forgot to keep
| maintaining it halfway through.
| lhl wrote:
| My testing has been w/ a Lunar Lake Core 258V chip (Xe2 -
| Arc 140V) on Arch Linux. It sounds like you've tried a
| lot of things already, but case it helps, my notes for
| installing llama.cpp and PyTorch: https://llm-
| tracker.info/howto/Intel-GPUs
|
| I have some benchmarks as well, and the IPEX-LLM backend
| performed a fair bit better than the SYCL llama.cpp
| backend for me (almost +50% pp512 and almost 2X tg128) so
| worth getting it working if you plan on using llama.cpp
| much on an Intel system. SYCL still performs
| significantly better than Vulkan and CPU backends,
| though.
|
| As an end-user, I agree that it'd be way better if they
| could just contribute upstream somehow (whether to the
| SYCL backend, or if not possible, to a dependency-minized
| IPEX backend). the IPEX backend is one of the _more_
| maintained parts of IPEX-LLM, btw. I found a lot of stuff
| in that repo that depend on versions of oneKit that aren
| 't even downloadable on Intel's site. I couldn't help but
| smirk when I heard someone say "Intel has their software
| nailed down."
| moffkalast wrote:
| Well that's funny, I think we already spoke on Reddit.
| I'm the guy who was testing the 125H recently. I guess
| there's like 5 of us who have intel hardware in total and
| we keep running into each other :P
|
| Honestly I think there's just something seriously broken
| with the way IPEX expects the GPU driver to be on 24.04
| and there's nothing I can really do about it except wait
| for them to fix it if I want to keep using this OS.
|
| I am vaguely considering adding another drive and
| installing 22.04 or 20.04 with the exact kernel they want
| to see if that might finally work in the meantime, but
| honestly I'm fairly satisfied with the speed I get from
| SYCL already. The problem is more that it's annoying to
| integrate it directly through the server endpoint, every
| projects expects a damn ollama api or llama-cpp-python
| these days and I'm a fan of neither since it's just
| another layer of headaches to get those compiled with
| SYCL.
|
| > I found a lot of stuff in that repo that depend on
| versions of oneKit that aren't even downloadable on
| Intel's site. I couldn't help but smirk when I heard
| someone say "Intel has their software nailed down."
|
| Yeah well the fact that oneAPI 2025 got released, broke
| IPEX, and they still haven't figured out a way to patch
| it for months makes me think it's total chaos internally,
| where teams work against each other instead of talking
| and coordinating.
| dboreham wrote:
| > Could be trying to make themselves a target for a big
| acquihire.
|
| Is this something anyone sets out to do?
| seeknotfind wrote:
| Yes.
| ryukoposting wrote:
| It definitely is, yes.
| dylan604 wrote:
| It would be interesting to find out AMD is funding these
| other companies to ensure the shim happens while they focus
| on not doing it.
| bushbaba wrote:
| AMD is kind of doing that funding by pricing its GPUs low
| and/or giving them away at cost to these startups
| shmerl wrote:
| Is this effort benefiting everyone? I.e. where is it going /
| is it open source?
| britannio wrote:
| Some of the work from Tinycorp is:
| https://github.com/tinygrad/7900xtx
| jsheard wrote:
| Tinygrad was another one, but they ended up getting frustrated
| with AMD and semi-pivoted to Nvidia.
| nomel wrote:
| This is discussed in the lex Friedman episode. AMD's own demo
| would kernel panic when run in a loop [1].
|
| [1] https://youtube.com/watch?v=dNrTrx42DGQ&t=3218
| kranke155 wrote:
| Interesting. I wonder if focusing on GPUs and CPUs is
| something that requires two companies instead of one,
| whether the concentration of resources just leads to one
| arm of your company being much better than the other.
| noch wrote:
| > Tinygrad was another one, but they ended up getting
| frustrated with AMD and semi-pivoted to Nvidia.
|
| From their announcement on 20241219[^0]:
|
| " _We are the only company to get AMD on MLPerf_ , and _we
| have a completely custom driver that 's 50x simpler than the
| stock one_. A bit shocked by how little AMD cared, but we'll
| take the trillions instead of them."
|
| From 20241211[^1]:
|
| "We gave up and soon _tinygrad will depend on 0 AMD code_
| except what 's required by code signing.
|
| We did this for the 7900XTX (tinybox red). If AMD was
| thinking strategically, they'd be begging us to take some
| free MI300s to add support for it."
|
| ---
|
| [^0]: https://x.com/__tinygrad__/status/1869620002015572023
|
| [^1]: https://x.com/__tinygrad__/status/1866889544299319606
| pinsiang wrote:
| AMD GPUs are becoming a serious contender for LLM inference.
| vLLM is already showing impressive performance on AMD [1], even
| with consumer-grade Radeon cards (even support GGUF) [2]. This
| could be a game-changer for folks who want to run LLMs without
| shelling out for expensive NVIDIA hardware.
|
| [1] https://blog.vllm.ai/2024/10/23/vllm-serving-amd.html [2]
| https://embeddedllm.com/blog/vllm-now-supports-running-gguf-...
| MrBuddyCasino wrote:
| Fun fact: Nvidia H200 are currently half the price/hr of H100
| bc people can't get vLLM to work on it.
|
| https://x.com/nisten/status/1871325538335486049
| adrian_b wrote:
| That seems like a CPU problem, not a GPU problem (due to
| Aarch64 replacing x86-64).
| ryao wrote:
| That is GH200 and it is likely due to an amd64 dependency
| in vLLM.
| treprinum wrote:
| AMD decided not to release a high-end GPU this cycle so any
| investment into 7x00 or 6x00 is going to be wasted as Nvidia
| 5x00 is likely going to destroy any ROI from the older cards
| and AMD won't have an answer for at least two years, possibly
| never due to being non-existing in high-end consumer GPUs
| usable for compute.
| llama-mini wrote:
| From Lamini, we have a private AMD GPU cluster, ready to serve
| any one who want to try MI300x or MI250 with inference and
| tuning.
|
| We just onboarded a customer to move from openai API to on-prem
| solution, currently evaluating MI300x for inference.
|
| Email me at my profile email.
| 3abiton wrote:
| My understanding is that once JAX takes off, the cuda advantage
| is gone for nvidia. That's a big if/when though.
| jroesch wrote:
| Note: this is old work, and much of the team working on TVM, and
| MLC were from OctoAI and we have all recently joined NVIDIA.
| sebmellen wrote:
| Is there no hope for AMD anymore? After George Hotz/Tinygrad
| gave up on AMD I feel there's no realistic chance of using
| their chips to break the CUDA dominance.
| llm_trw wrote:
| Not really.
|
| AMD is constitutionally incapable of shipping anything but
| mid range hardware that requires no innovation.
|
| The only reason why they are doing so well in CPUs right now
| is that Intel has basically destroyed itself without any
| outside help.
| perching_aix wrote:
| And I'm supposed to believe that HN is this amazing
| platform for technology and science discussions, totally
| unlike its peers...
| zamadatix wrote:
| The above take is worded a bit cynical but is their
| general approach to GPUs lately across the board e.g.
| https://www.techpowerup.com/326415/amd-confirms-retreat-
| from...
|
| Also I'd take HN as being being an amazing platform for
| the overall consistency and quality of moderation.
| Anything beyond that depends more on who you're talking
| to than where at.
| petesergeant wrote:
| Maybe be the change you want to see and tell us what the
| real story is?
| perching_aix wrote:
| We seem to disagree on what the change in the world I'd
| like to see is like, which is a real shocker I'm sure.
|
| Personally, I think that's when somebody who has no real
| information to contribute doesn't try to pretend that
| they do.
|
| So thanks for the offer, but I think I'm already
| delivering on that realm.
| llm_trw wrote:
| I don't really care what you believe.
|
| Everyone whose dug deep into what AMD is doing has left
| in disgust if they are lucky and bankruptcy if they are
| not.
|
| If I can save someone else from wasting $100,000 on
| hardware and six months of their life then my post has
| done more good than the AMD marketing department ever
| will.
| AnthonyMouse wrote:
| > If I can save someone else from wasting $100,000 on
| hardware and six months of their life then my post has
| done more good than the AMD marketing department ever
| will.
|
| This seems like unuseful advice if you've already given
| up on them.
|
| You tried it and at some point in the past it wasn't
| ready. But by not being ready they're losing money, so
| they have a direct incentive to fix it. Which would take
| a certain amount of time, but once you've given up you no
| longer know if they've done it yet or not, at which point
| your advice would be stale.
|
| Meanwhile the people who attempt it apparently seem to
| get acquired by Nvidia, for some strange reason. Which
| implies it should be a worthwhile thing to do. If they've
| fixed it by now which you wouldn't know if you've stopped
| looking, or they fix it in the near future, you have a
| competitive advantage because you have access to lower
| cost GPUs than your rivals. If not, but you've
| demonstrated a serious attempt to fix it for everyone
| yourself, Nvidia comes to you with a sack full of money
| to make sure you don't finish, and then you get a sack
| full of money. That's win/win, so rather than nobody
| doing it, it seems like everybody should be doing it.
| llm_trw wrote:
| I've tried it three times.
|
| I've seen people try it every six months for two decades
| now.
|
| At some point you just have to accept that AMD is not a
| serious company, but is a second rate copycat and there
| is no way to change that without firing everyone from
| middle management up.
|
| I'm deeply worried about stagnation in the CPU space now
| that they are top dog and Intel is dead in the water.
|
| Here's hoping China and Risk V save us.
|
| >Meanwhile the people who attempt it apparently seem to
| get acquired by Nvidia
|
| Everyone I've seen base jumping has gotten a sponsorship
| from redbull, ergo. everyone should basejump.
|
| Ignore the red smears around the parking lot.
| Const-me wrote:
| > I've tried it three times
|
| Have you tried compute shaders instead of that weird HPC-
| only stuff?
|
| Compute shaders are widely used by millions of gamers
| every day. GPU vendors have huge incentive to make them
| reliable and efficient: modern game engines are using
| them for lots of thing, e.g. UE5 can even render triangle
| meshes with GPU compute instead of graphics (the tech is
| called nanite virtualized geometry). In practice they
| work fine on all GPUs, ML included:
| https://github.com/Const-me/Cgml
| AnthonyMouse wrote:
| > At some point you just have to accept that AMD is not a
| serious company, but is a second rate copycat and there
| is no way to change that without firing everyone from
| middle management up.
|
| AMD has always punched above their weight. Historically
| their problem was that they were the much smaller company
| and under heavy resource constraints.
|
| Around the turn of the century the Athlon was faster than
| the Pentium III and then they made x86 64-bit when Intel
| was trying to screw everyone with Itanic. But the Pentium
| 4 was a marketing-optimized design that maximized clock
| speed at the expense of heat and performance per clock.
| Intel was outselling them even though the Athlon 64 was
| at least as good if not better. The Pentium 4 was rubbish
| for laptops because of the heat problems, so Intel
| eventually had to design a separate chip for that, but
| they also had the resources to do it.
|
| That was the point that AMD made their biggest mistake.
| When they set out to design their next chip the
| competition was the Pentium 4, so they made a power-
| hungry monster designed to hit high clock speeds at the
| expense of performance per clock. But the reason more
| people didn't buy the Athlon 64 wasn't that they couldn't
| figure out that a 2.4GHz CPU could be faster than a
| 2.8GHz CPU, it was all the anti-competitive shenanigans
| Intel was doing behind closed doors to e.g. keep PC OEMs
| from featuring systems with AMD CPUs. Meanwhile by then
| Intel had figured out that the Pentium 4 was, in fact, a
| bad design, when their own Pentium M laptops started
| outperforming the Pentium 4 desktops. So the Pentium 4
| line got canceled and Bulldozer had to go up against the
| Pentium M-based Core, which nearly bankrupted AMD and
| compromised their ability to fund the R&D needed to
| sustain state of the art fabs.
|
| Since then they've been climbing back out of the hole but
| it wasn't until Ryzen in 2017 that you could safely
| conclude they weren't on the verge of bankruptcy, and
| even then they were saddled with a lot of debt and
| contracts requiring them to use the uncompetitive Global
| Foundries fabs for several years. It wasn't until Zen4 in
| 2022 that they finally got to switch the whole package to
| TSMC.
|
| So until quite recently the answer to the question "why
| didn't they do X?" was obvious. They didn't have the
| money. But now they do.
| perching_aix wrote:
| I'd be very concerned if somebody makes a $100K decision
| based on a comment where the author couldn't even
| differentiate between the words "constitutionally" and
| "institutionally", while providing as much substance as
| any other random techbro on any random forum and being
| overwhelmingly oblivious to it.
| ksec wrote:
| Everything is comparative. AMD isn't perfect. As an Ex
| Shareholder I have argued they did well partly because of
| Intel's downfall. In terms of execution it is far from
| perfect.
|
| But Nvidia is a different beast. It is a bit like Apple in
| the late 00s where you take business, forecast, marketing,
| operation, software, hardware, sales etc You take any part
| of it and they are all industry leading. And having
| industry leading capability is only part of the game,
| having it all work together is completely another thing.
| And unlike Apple where they lost direction once Steve Jobs
| passed away and weren't sure about how to deploy capital.
| Jensen is still here, and they have more resources now
| making Nvidia even more competitive.
|
| It is often most people underestimate the magnitude of the
| task required, ( I like to tell the story again about an
| Intel GPU engineer in 2016 arguing they could take dGPU
| market shares by 2020, and we are now 2025 ), over estimate
| the capability of an organisation, under estimate the
| rival's speed of innovation and execution. These three
| thing combined is why most people are often off the
| estimate by an order of magnitude.
| llm_trw wrote:
| Yeah, no.
|
| We are in the middle of a monopoly squeeze by NVidia on
| the most innovative part of the economy right now. I
| expect the DOJ to hit them harder than they did MS in the
| 90s given the bullshit they are pulling and the drag on
| the economy they are causing.
|
| By comparison if AMD could write a driver that didn't
| shit itself when it had to multiply more than two
| matrices in a row they'd be selling cards faster than
| they can make them. You don't need to sell the best
| shovels in a gold rush to make mountains of money, but
| you can't sell teaspoons as premium shovels and expect
| people to come back.
| shiroiushi wrote:
| >I expect the DOJ to hit them harder than they did MS in
| the 90s given the bullshit they are pulling and the drag
| on the economy they are causing.
|
| It sounds like you're expecting extreme competence from
| the DOJ. Given their history with regulating big tech
| companies, and even worse, the incoming administration, I
| think this is a very unrealistic expectation.
| kadoban wrote:
| What effect did the DOJ have on MS in the 90s? Didn't all
| of that get rolled back before they had to pay a dime,
| and all it amounted to was that browser choice screen
| that was around for a while? Hardly a crippling blow. If
| anything that showed the weakness of regulators in fights
| against big tech, just outlast them and you're fine.
| ksec wrote:
| >We are in the middle of a monopoly squeeze by NVidia on
| the most innovative part of the economy right now.
|
| I am not sure which part of Nvidia is monopoly. That is
| like suggesting TSMC has a monopoly.
| vitus wrote:
| > That is like suggesting TSMC has a monopoly.
|
| They... do have a monopoly on foundry capacity,
| especially if you're looking at the most advanced nodes?
| Nobody's going to Intel or Samsung to build 3nm
| processors. Hell, there have been whispers over the past
| month that even Samsung might start outsourcing Exynos to
| TSMC; Intel already did that with Lunar Lake.
|
| Having a monopoly doesn't mean that you are engaging in
| anticompetitive behavior, just that you are the only real
| option in town.
| Vecr wrote:
| Will they? Given the structure of global controls on
| GPUs, Nvidia is a de-facto self funding US government
| company.
|
| Maybe the US will do something if GPU price becomes the
| limit instead of the supply of chips and power.
| lofaszvanitt wrote:
| It had to destroy itself. These companies do not act on
| their own...
| adrian_b wrote:
| In CPUs, AMD has made many innovations that have been
| copied by Intel only after many years and this delay had an
| important contribution to Intel's downfall.
|
| The most important has been the fact that AMD has predicted
| correctly that big monolithic CPUs will no longer be
| feasible in the future CMOS fabrication technologies, so
| they have designed the Zen family since the beginning with
| a chiplet-based architecture. Intel had attempted to
| ridicule them, but after losing many billions they have
| been forced to copy this strategy.
|
| Also in the microarchitecture of their CPUs AMD has made
| the right choices since the beginning and then they have
| improved it constantly with each generation. The result is
| that now the latest Intel big core, Lion Cove, has a
| microarchitecture that is much more similar to AMD Zen 5
| than to any of the previous Intel cores, because they had
| to do this to get a competitive core.
|
| In the distant past, AMD has also introduced a lot of
| innovations long before they were copied by Intel, but it
| is true that those had not been invented by AMD, but they
| had been copied by AMD from more expensive CPUs, like DEC
| Alpha or Cray or IBM POWER, but Intel has also copied them
| only after being forced by the competition with AMD.
| latchkey wrote:
| https://x.com/dylan522p/status/1871287937268383867
| krackers wrote:
| That's almost word for word what geohotz said last year?
| refulgentis wrote:
| What part?
|
| I assume the part where she said there's "gaps in the
| software stack", because that's the only part that's
| attributed to her.
|
| But I must be wrong because that hasn't been in dispute
| or in the news in a decade, it's not a geohot discovery
| from last year.
|
| Hell I remember a subargument of a subargument re: this
| being an issue a decade ago in macOS dev (TL;Dr whether
| to invest in opencl)
| bn-l wrote:
| I went through the thread. There's an argument to be made
| in firing Su for being so spaced out as to miss an op for
| their own CUDA for free.
| hedgehog wrote:
| Not remotely, how did you get to that idea?
| refulgentis wrote:
| Kids this days (shakes fist)
|
| tl;dr there's a non-unsubstantial # of people who learn a
| lot from geohot. I'd say about 3% of people here will be
| confused if you thought of him as less than a top
| technical expert across many comp sci fields.
|
| And he did the geohot thing recently, way tl;dr: acted
| like there was a scandal being covered up by AMD around
| drivers that was causing them to "lose" to nVidia.
|
| He then framed AMD not engaging with him on this topic as
| further covering-up and choosing to lose.
|
| So if you're of a certain set of experiences, you see an
| anodyne quote from the CEO that would have been utterly
| unsurprising dating back to when ATI was still a company,
| and you'd read it as the CEO breezily admitting in public
| that geohot was right about how there was malfeasance,
| followed by a cover up, implying extreme dereliction of
| duty, because she either helped or didn't realize till
| now.
|
| I'd argue this is partially due to stonk-ification of
| discussions, there was a vague, yet often communicated,
| sense there was something illegal happening. Idea was it
| was financial dereliction of duty to shareholders.
| dismalaf wrote:
| IMO the hope shouldn't be that AMD specifically wins, rather
| it's best for consumers that hardware becomes commoditized
| and prices come down.
|
| And that's what's happening, slowly anyway. Google, Apple and
| Amazon all have their own AI chips, Intel has Gaudi, AMD had
| their thing, and the software is at least working on more
| than just Nvidia. Which is a win. Even if it's not perfect.
| I'm personally hoping that everyone piles in on a standard
| like SYCL.
| comex wrote:
| Maybe from Modular (the company Chris Lattner is working
| for). In this recent announcement they said they had achieved
| competitive ML performance... on NVIDIA GPUs, but with their
| own custom stack completely replacing CUDA. And they're
| targeting AMD next.
|
| https://www.modular.com/blog/introducing-max-24-6-a-gpu-
| nati...
| behnamoh wrote:
| Ah yes, the programming language (Mojo) that requires an
| account before I can use it...
| melodyogonna wrote:
| Mojo no longer requires an account to install.
|
| But that is irrelevant to the conversation because this
| is not about Mojo but something they call MAX. [1]
|
| 1. https://www.modular.com/max
| quotemstr wrote:
| The world is bigger than AMD and Nvidia. Plenty of
| interesting new AI-tuned non-GPU accelerators coming online.
| grigio wrote:
| I hope, name some NPU who can run a 70B model..
| steeve wrote:
| We (ZML) have AMD MI300X working just fine, in fact, faster
| than H100
| lasermike026 wrote:
| I believe these efforts are very important. If we want this stuff
| to be practical we are going to have to work on efficiency. Price
| efficiency is good. Power and compute efficiency would be better.
|
| I have been playing with llama.cpp to run interference on
| conventional cpus. No conclusions but it's interesting. I need to
| look at llamafile next.
| zamalek wrote:
| I have been playing around with Phi-4 Q6 on my 7950x and 7900XT
| (with HSA_OVERRIDE_GFX_VERSION). It's bloody fast, even with CPU
| alone - in practical terms it beats hosted models due to the
| roundtrip time. Obviously perf is more important if you're
| hosting this stuff, but we've definitely reached AMD usability at
| home.
| latchkey wrote:
| Previously:
|
| _Making AMD GPUs competitive for LLM inference_
| https://news.ycombinator.com/item?id=37066522 (August 9, 2023 --
| 354 points, 132 comments)
| lxe wrote:
| A used 3090 is $600-900, performs better than 7900, and is much
| more versatile because CUDA
| Uehreka wrote:
| Reality check for anyone considering this: I just got a used
| 3090 for $900 last month. It works great.
|
| I would not recommend buying one for $600, it probably either
| won't arrive or will be broken. Someone will reply saying they
| got one for $600 and it works, that doesn't mean it will happen
| if you do it.
|
| I'd say the market is realistically $900-1100, maybe $800 if
| you know the person or can watch the card running first.
|
| All that said, this advice will expire in a month or two when
| the 5090 comes out.
| idonotknowwhy wrote:
| I've bought 5 used and they're all perfect. But that's what
| buyer protection on ebay is for. Had to send back an Epyc
| mobo with bent pins and ebay handled it fine.
| leonewton253 wrote:
| This benchmark doest look right. Is it using the tensor cores in
| the Nvidia gpu? AMD does not have AI cores so should run
| noticeably slower.
| nomel wrote:
| AMD has WMMA.
| mattfrommars wrote:
| Great, I have yet to understand why does not the ML community
| really push or move away from CUDA? To me, it feel like a
| dinosaur move to build on top of CUDA which is screaming
| proprietary nothing about it is open source or cross platform.
|
| The reason why I say its dinosaur is, imagine, we as a dev
| community continued to build on top of Flash or Microsoft
| Silverlight...
|
| LLM and ML has been out for quiet a while, with AI/LLM
| advancement, the transition must have been much quicker to move
| cross platform. But this hasn't yet and not sure when it will
| happen.
|
| Building a translation layer on top CUDA is not the answer either
| to this problem.
| dwood_dev wrote:
| Except I never hear complaints about CUDA from a quality
| perspective. The complaints are always about lock in to the
| best GPUs on the market. The desire to shift away is to make
| cheaper hardware with inferior software quality more usable.
| Flash was an abomination, CUDA is not.
| xedrac wrote:
| Maybe the situation has gotten better in recent years, but my
| experience with Nvidia toolchains was a complete nightmare
| back in 2018.
| claytonjy wrote:
| The cuda situation is definitely better. The nvidia
| struggles are now with the higher-level software they're
| pushing (triton, tensor-llm, riva, etc), tools that are the
| most performant option when they work, but a garbage
| developer experience when you step outside the golden path
| cameron_b wrote:
| I want to double-down on this statement, and call
| attention to the competitive nature of it. Specifically,
| I have recently tried to set up Triton on arm hardware.
| One might presume Nvidia would give attention to an
| architecture they develop, but the way forward is not
| easy. For some version of Ubuntu, you might have the
| correct version of python ( usually older than packaged )
| but current LTS is out of luck for guidance or packages.
|
| https://github.com/triton-lang/triton/issues/4978
| AnthonyMouse wrote:
| Flash was popular because it was an attractive platform for
| the developer. Back then there was no HTML5 and browsers
| didn't otherwise support a lot of the things Flash did. Flash
| _Player_ was an abomination, it was crashy and full of
| security vulnerabilities, but that was a problem for the user
| rather than the developer and it was the developer choosing
| what to use to make the site.
|
| This is pretty much exactly what happens with CUDA.
| Developers like it but then the users have to use expensive
| hardware with proprietary drivers/firmware, which is the
| relevant abomination. But users have _some_ ability to
| influence developers, so as soon as we get the GPU equivalent
| of HTML5, what happens?
| wqaatwt wrote:
| > users have to use expensive hardware with proprietary
| drivers/firmware
|
| What do you mean by that? People trying to run their own
| models are not "the users" they are a tiny insignificant
| niche segment.
| AnthonyMouse wrote:
| There are _far_ more people running llama.cpp, various
| image generators, etc. than there are people developing
| that code. Even when the "users" are corporate entities,
| they're not necessarily doing any development in excess
| of integrating the existing code with their other
| systems.
|
| We're also likely to see a stronger swing away from "do
| inference in the cloud" because of the aligned incentives
| of "companies don't want to pay for all that hardware and
| electricity" and "users have privacy concerns" such that
| companies doing inference on the local device will have
| both lower costs and a feature they can advertise over
| the competition.
|
| What this is waiting for is hardware in the hands of the
| users that can actually do this for a mass market price,
| but there is no shortage of companies wanting a piece of
| that. In particular, Apple is going to be pushing that
| hard and despite the price they do a lot of volume, and
| then you're going to start seeing more PCs with high-VRAM
| GPUs or iGPUs with dedicated GDDR/HBM on the package as
| their competitors want feature parity for the thing
| everybody is talking about, the cost of which isn't
| actually that high, e.g. 40GB of GDDR6 is less than $100.
| idonotknowwhy wrote:
| For me personally, hacking together projects as a hobbiest, 2
| reasons :
|
| 1. It just works. When i tried to build things on Intel Arcs, i
| spent way more hours bikeshedding ipex and driver issues than
| developing
|
| 2. LLMs seem to have more cuda code in their training data. I
| can leverage claude and 4o to help me build things with cuda,
| but trying to get them to help me do the same things on ipex
| just doesn't work.
|
| I'd very much love a translation layer for Cuda, like a dxvk or
| wine equivalent.
|
| Would save a lot of money since Arc gpus are in the bargain bin
| and nvidia cloud servers are double the price of AMD.
|
| As it stands now, my dual Intel Arc rig is now just a llama.cpp
| inference server for the family to use.
| FloatArtifact wrote:
| What kind of model learn and what's its token output on intel
| gpu's?
| jeroenhd wrote:
| If CUDA counts as "just works", I dread to see the dark,
| unholy rituals you need to invoke to get ROCm to work. I have
| spent too many hours browsing the Nvidia forums for obscure
| error codes and driver messages to ever consider updating my
| CUDA install and every time I reboot my desktop for an update
| I dread having to do it all over again.
| pavelstoev wrote:
| The problem is that performance achievements on AMD consumer-
| grade GPUs (RX7900XTX) are not representative/transferrable to
| the Datacenter grade GPUs (MI300X). Consumer GPUs are based on
| RDNA architecture, while datacenter GPUs are based on the CDNA
| architecture, and only sometime in ~2026 AMD is expected to
| release unifying UDNA architecture [1]. At CentML we are
| currently working on integrating AMD CDNA and HIP support into
| our Hidet deep learning compiler [2], which will also power
| inference workloads for all Nvidia GPUs, AMD GPUs, Google TPU and
| AWS Inf2 chips on our platform [3]
|
| [1] https://www.jonpeddie.com/news/amd-to-integrate-cdna-and-
| rdn.... [2] https://centml.ai/hidet/ [3]
| https://centml.ai/platform/
| llm_trw wrote:
| The problem is that the specs of AMD consumer-grade GPUs do not
| translate to computer performance when you try and chain more
| than one together.
|
| I have 7 NVidia 4090s under my desk happily chugging along on
| week long training runs. I once managed to get a Radeon VII to
| run for six hours without shitting itself.
| tspng wrote:
| Wow, are these 7 RTX 4090s in a single setup? Care to share
| more how you build it (case, cooling, power, ..)?
| adakbar wrote:
| I'd like to know too
| ghxst wrote:
| You might find the journey of Tinycorp's Tinybox
| interesting, it's a machine with 6 to 8 4090 GPUs and you
| should be able to track down a lot of their hardware
| choices including pictures on their Twitter and other info
| on George his livestreams.
| llm_trw wrote:
| Basically this but with an extra card on the x8 slot for
| connecting my monitors:
| https://www.youtube.com/watch?v=C548PLVwjHA
|
| There's a bunch of similar setups and there are a couple of
| dozen people that have done something similar on
| /r/localllama.
| osmarks wrote:
| Most of these are just an EPYC server platform, some cursed
| risers and multiple PSUs (though cryptominer server PSU
| adapters are probably better). See
| https://nonint.com/2022/05/30/my-deep-learning-rig/ and
| https://www.mov-axbx.com/wopr/wopr_concept.html.
| Keyframe wrote:
| Looks like a fire hazard :)
| mpreda wrote:
| > I have 7 NVidia 4090s under my desk
|
| I have 6 Radeon Pro VII under my desk (in a single system
| BTW), and they run hard for weeks until I choose to reboot
| e.g. for Linux kernel updates.
|
| I bought them "new old stock" for $300 apiece. So that's
| $1800 for all six.
| highwaylights wrote:
| How does the compute performance compare to 4090's for
| these workloads?
|
| (I release it will be significantly lower, just try to get
| as much of a comparison as is possible).
| cainxinth wrote:
| The 4090 offers 82.58 teraflops of single-precision
| performance compared to the Radeon Pro VII's 13.06
| teraflops.
| adrian_b wrote:
| On the other hand, for double precision a Radeon Pro VII
| is many times faster than a RTX 4090 (due to 1:2 vs. 1:64
| FP64:FP32 ratio).
|
| Moreover, for workloads limited by the memory bandwidth,
| a Radeon Pro VII and a RTX 4090 will have about the same
| speed, regardless what kind of computations are
| performed. It is said that speed limitation by memory
| bandwidth happens frequently for ML/AI inferencing.
| llm_trw wrote:
| For inference sure, for training: no.
| crest wrote:
| The Radeon VII is special compared to most older (and
| current) affordable GPUs in that it used HBM giving it
| memory bandwidth comparable to modern cards ~1TB/s and
| has reasonable FP64 (1:4) throughput instead of (1:64).
| So this card can still be pretty interesting for running
| memory bandwidth intensive FP64 workloads. Anything
| affordable afterward by either AMD or Nvidia crippled
| realistic FP64 throughput to below what a AVX-512 many-
| core CPU can do.
| nine_k wrote:
| If we speak about FP64, are your loads more like fluid
| dynamics than ML training?
| llm_trw wrote:
| Are you running ml workloads or solving differential
| equations?
|
| The two are rather different and one market is worth
| trillions, the other isn't.
| comboy wrote:
| I think there is some money to be made in machine
| learning too.
| zozbot234 wrote:
| It looks like AMD's CDNA gpu's are supported by Mesa, which
| ought to suffice for Vulkan Compute and SYCL support. So there
| should be ways to run ML workloads on the hardware without
| going through HIP/ROCm.
| aussieguy1234 wrote:
| I got a "gaming" PC for LLM inference with an RTX 3060. I could
| have gotten more VRAM for my buck with AMD, but didn't because at
| the time alot of inference needed CUDA.
|
| As soon AMD is as good as Nvidia for inference, I'll switch over.
|
| But I've read on here that their hardware engineers aren't even
| given enough hardware to test with...
| lhl wrote:
| Just an FYI, this is writeup from August 2023 and a lot has
| changed (for the better!) for RDNA3 AI/ML support.
|
| That being said, I did some very recent inference testing on an
| W7900 (using the same testing methodology used by Embedded LLM's
| recent post to compare to vLLM's recently added Radeon GGUF
| support [1]) and MLC continues to perform quite well. On Llama
| 3.1 8B, MLC's q4f16_1 (4.21MB weights) performed +35% faster than
| llama.cpp w/ Q4_K_M w/ their ROCm/HIP backend (4.30MB weights, 2%
| size difference).
|
| That makes MLC still the generally fastest standalone inference
| engine for RDNA3 by a country mile. However, you have much less
| flexibility with quants and by and large have to compile your own
| for every model, so llama.cpp is probably still more flexible for
| general use. Also llama.cpp's (recently added to llama-server)
| speculative decoding can also give some pretty sizable
| performance gains. Using a 70B Q4_K_M + 1B Q8_0 draft model
| improves output token throughput by 59% on the same ShareGPT
| testing. I've also been running tests with Qwen2.5-Coder and
| using a 0.5-3B draft model for speculative decoding gives even
| bigger gains on average (depends highly on acceptance rate).
|
| Note, I think for local use, vLLM GGUF is still not suitable at
| all. When testing w/ a 70B Q4_K_M model (only 40GB), loading,
| engine warmup, and graph compilation took on avg 40 minutes.
| llama.cpp takes 7-8s to load the same model.
|
| At this point for RDNA3, basically everything I need works/runs
| for my use cases (primarily LLM development and local
| inferencing), but almost always slower than an RTX 3090/A6000
| Ampere (a new 24GB 7900 XTX is $850 atm, used or refurbished 24
| GB RTX 3090s are in in the same ballpark, about $800 atm; a new
| 48GB W7900 goes for $3600 while an 48GB A6000 (Ampere) goes for
| $4600). The efficiency gains can be sizable. Eg, on my standard
| llama-bench test w/ llama2-7b-q4_0, the RTX 3090 gets a tg128 of
| 168 t/s while the 7900 XTX only gets 118 t/s even though both
| have similar memory bandwidth (936.2 GB/s vs 960 GB/s). It's also
| worth noting that since the beginning of the year, the llama.cpp
| CUDA implementation has gotten almost 25% faster, while the ROCm
| version's performance has stayed static.
|
| There is an actively (solo dev) maintained fork of llama.cpp that
| sticks close to HEAD but basically applies a rocWMMA patch that
| can improve performance if you use the llama.cpp FA (still
| performs worse than w/ FA disabled) and in certain long-context
| inference generations (on llama-bench and w/ this ShareGPT
| serving test you won't see much difference) here:
| https://github.com/hjc4869/llama.cpp - The fact that no one from
| AMD has shown any interest in helping improve llama.cpp
| performance (despite often citing llama.cpp-based apps in
| marketing/blog posts, etc is disappointing ... but sadly on brand
| for AMD GPUs).
|
| Anyway, for those interested in more information and testing for
| AI/ML setup for RDNA3 (and AMD ROCm in general), I keep a doc
| with lots of details here: https://llm-tracker.info/howto/AMD-
| GPUs
|
| [1] https://embeddedllm.com/blog/vllm-now-supports-running-
| gguf-...
| mrcsharp wrote:
| I will only consider AMD GPUs for LLM when I can easily make my
| AMD GPU available within WSL and Docker on Windows.
|
| For now, it is as if AMD does not exist in this field for me.
| Sparkyte wrote:
| More players in the market the better. AI shouldn't be owned by
| one business.
| melodyogonna wrote:
| Modular claims that it achieves 93% GPU utilization on AMD GPUs
| [1], official preview release coming early next year, we'll see.
| I must say I'm bullish because of feedback I've seen people give
| about the performance on Nvidia GPUs
|
| 1.https://www.modular.com/max
| guerrilla wrote:
| So, does ollama use this work or does it do something else? How
| does it compare?
___________________________________________________________________
(page generated 2024-12-24 23:00 UTC)