[HN Gopher] The AMD Radeon Instinct MI300A's Giant Memory Subsystem
___________________________________________________________________
The AMD Radeon Instinct MI300A's Giant Memory Subsystem
Author : pella
Score : 154 points
Date : 2025-01-18 12:28 UTC (10 hours ago)
(HTM) web link (chipsandcheese.com)
(TXT) w3m dump (chipsandcheese.com)
| amelius wrote:
| I'm curious why this space hasn't been patented to death.
| hedora wrote:
| It has been. All sides have a pile of patents. All sides
| violate all the other sides' patents. If anyone sues, everyone
| goes out of business.
|
| This is the system working as currently intended. No matter
| what happens, the lawyers get will rich.
|
| If a small company comes in and doesn't pay the lawyers, it'll
| get sued for violating the patents.
| yvdriess wrote:
| Yep, it's an area denial weapon.
|
| You basically cannot do anything worthwhile in this space
| without violating someone's patents. It's beneficial patent
| and corporate lawyers, but it's detrimental to innovation. As
| an engineer you are asked to not look up existing techniques
| or designs as this will taint you legally.
| kmeisthax wrote:
| "Tainting" isn't a thing in patent law. All engineers
| worldwide are tainted the moment the patent is published;
| that's why parallel reinvention is not a defense to patent
| infringement.
| lhl wrote:
| But you pay triple damages if you knowingly vs
| unknowingly violate a patent (35 U.S.C. SS 284). Of
| course, everything is patented, so, engineers are just
| told to not read patents.
| amelius wrote:
| > If a small company comes in and doesn't pay the lawyers,
| it'll get sued for violating the patents.
|
| This assumes the small company isn't just in it for the
| patents.
| WithinReason wrote:
| Mutually assured destruction
| amelius wrote:
| Where do patent-trolls fit in this analogy?
| btown wrote:
| I've often thought that one of the places AMD could distinguish
| itself from NVIDIA is bringing significantly higher amounts of
| VRAM (or memory systems that are as performant as what we
| currently know as VRAM) to the consumer space.
|
| A card with a fraction of the FLOPS of cutting-edge graphics
| cards (and ideally proportionally less power consumption), but
| with 64-128GB VRAM-equivalent, would be a gamechanger for letting
| people experiment with large multi-modal models, and seriously
| incentivize researchers to build the next generation of tensor
| abstraction libraries for both CUDA and ROCm/HIP. And for gaming,
| you could break new grounds on high-resolution textures. AMD
| would be back in the game.
|
| Of course, if it's not real VRAM, it needs to be at least
| somewhat close on the latency and bandwidth front, so let's pop
| on over and see what's happening in this article...
|
| > An Infinity Cache hit has a load-to-use latency of over 140 ns.
| Even DRAM on the AMD Ryzen 9 7950X3D shows less latency. Missing
| Infinity Cache of course drives latency up even higher, to a
| staggering 227 ns. HBM stands for High Bandwidth Memory, not low
| latency memory, and it shows.
|
| Welp. Guess my wish isn't coming true today.
| formerly_proven wrote:
| Totally normal latencies for a GPU though.
| pkroll wrote:
| You're not the only one thinking that:
| https://www.nvidia.com/en-us/project-digits/
|
| 128G of unified memory. $3K. Throw ollama and ComfyUI on that
| sucker and things could get interesting. The question is how
| much slower than a 5090, is this gonna be? The memory bandwidth
| isn't going to match a 512 bit bus.
| lostmsu wrote:
| AFAIK this uses even slower memory.
| sroussey wrote:
| And a fraction of the 5090 cores.
| manojlds wrote:
| It's LPDDR5.
| KeplerBoy wrote:
| It's going to be waaay slower than a 5090. We're looking at
| something like 60W TDP for the entire system vs 600W for a
| 5090 GPU.
|
| It's going to be very energy efficient, it will get plenty of
| flops, but they won't be able to cheat physics.
| mpercival531 wrote:
| They are. Strix Halo is going after that same space of Apple M4
| Pro/Max where it is currently unchallenged. Pairing it with two
| 64GB LPCAMM2 modules will get you there.
|
| Edit: The problem with AMD is less the hardware offerings, but
| more that their compute software stack historically tends to
| handwave or be very slow with consumer GPU support -- even more
| so with their APUs. Maybe the advent of MI300A will change the
| equation, maybe not.
| lhl wrote:
| I don't know of any non-soldered memory Strix Halo devices,
| but both HP and Asus have announced 128GB SKUs (availability
| unknown).
|
| For LLM inference, basically everything works w/ ROCm on
| RDNA3 now (well, Flash Attention is via Triton and doesn't
| have support for SWA and some other stuff; also I mostly test
| on Linux, although I did check that the new WSL2 support
| works). I've tested some older APUs w/ basic benchmarking as
| well. Notes here for those interested: https://llm-
| tracker.info/howto/AMD-GPUs
| UncleOxidant wrote:
| Thanks for that link. I'm interested in either getting the
| HP Mini Z1 G1a or an NVidia Digits for LLM experimentation.
| The obvious advantage for the Digits is the CUDA ecosystem
| is much more tried & true for that kind of thing. But the
| disadvantage is trying to use it as a replacement for my
| current PC as well as the fact that it's going to run an
| already old version of Ubuntu (22.04) and you're dependent
| on Nvidia for updates.
| lhl wrote:
| Yeah, I think anyone w/ old Jetsons knows what it's like
| to be left high and dry by Nvidia's embedded software
| support. Older models are basically just ewaste. Since
| the Digits won't be out until May, I guess there's enough
| time to wait and see - at least to get a sense of what
| the actual specs are. I have a feeling the FP16 TFLOPS
| and the MBW are going to be much lower than what people
| have been hyping themselves up for.
|
| Sadly, my feeling is that the big Strix Halo SKUs (which
| have no scheduled release dates) aren't going to be
| competitively priced (they're likely to be at a big
| FLOPS/real-world performance disadvantage, and there's
| still the PITA factor), but there is something appealing
| about about the do-it-all aspect of it.
| rbanffy wrote:
| DIGITS looks like a serious attempt, but they don't have
| too much of an incentive to have people developing for
| older hardware. I wouldn't expect them to supor it for
| more than five years. At least the underlying Ubuntu will
| last more than that and provide a viable work environment
| far beyond the time it gets really boring.
| UncleOxidant wrote:
| If only they could get their changes upstreamed to Ubuntu
| (and possible kernel mods upstreamed), then we wouldn't
| have to worry about it.
| rbanffy wrote:
| Getting their kernel mods upstreamed is very unlikely,
| but they might provide just enough you can build a new
| kernel with the same major version number.
| KeplerBoy wrote:
| Who said anything about Ubuntu 22.04? I mean sure that's
| the newest release current jetpack comes with, but I'd be
| surprised if they shipped digits with that.
| rbanffy wrote:
| Doesn't DGX OS use the latest LTS version? Current should
| be 24.04.
| KeplerBoy wrote:
| I wouldn't know. I only work with workstation or jetson
| stuff.
|
| The DGX documentation and downloads aren't public afaik.
|
| Edit: Nevermind, some information about DGX is public and
| they really are on 22.04, but oh well, the deep learning
| stack is guaranteed to run.
|
| https://docs.nvidia.com/base-os/too
| Fade_Dance wrote:
| Assuming we are comparing chips that are using the latest
| generation/high density memory modules, a wider bus width is
| required for larger memory counts, which is expensive when it
| comes to silicon area. Therefore, if AMD is willing to boost up
| memory count as a competitive advantage, they may as well also
| consider using that die space for more logic gates as well.
| It's a set of trade-offs and an optimization problem to some
| degree.
|
| That said, when an incumbent has a leadership advantage, one of
| the obvious ways to boost profit is to slash the memory bus
| width, and then a competitor can come in and bring it up a bit
| and have a competitive offering. The industry has certainly
| seen this pattern many times. But as far as AMD coming in and
| using gigantic memory counts as a competitive advantage? You
| have to keep in mind the die space constraints.
|
| Well over a decade ago - I think it was R600 - AMD did take
| this approach, and it was fairly disastrous because the logic
| performance of the chip wasn't good enough while the die was
| too big and hot and yields were too low. They didn't strike the
| right balance and sacrificed too much for a 512-bit memory bus.
|
| AMD has also tried to sidestep some of these limitations with
| HBM back when it was new technology, but that didn't work out
| for them either. They actually would have been better off just
| increasing bus width and continuing to use the most optimized
| and cost efficient commodity memory chips in that case.
|
| Data center and such may have a bit more freedom for innovation
| but the consumer space is definitely stuck on the paradigm of
| GPU plus nearby mem chips, and going outside of that fence is a
| huge latency hit.
| amluto wrote:
| > a wider bus width is required for larger memory counts,
| which is expensive when it comes to silicon area
|
| I find this constraint to be rather odd. An extra, say, three
| address bits would add very little space (or latency in a
| serial protocol) to a memory bus, and the actual problem
| seems to be that the current generation of memory chips are
| intended for point-to-point connection.
|
| It seems to me that, if the memory vendors aren't building
| physically larger, higher capacity chips, then any of the
| major players (AMD, Nvidia, Intel, whoever else is in this
| field right now) could kludge around it with a multiplexer. A
| multiplexer would need to be somewhat large, but its job
| would be simple enough that it should be doable with an
| older, cheaper process and without using entirely
| unreasonable amounts of power.
|
| So my assumption is this is mostly an economic issue. The
| vendors don't think it's worthwhile to do this.
| sroussey wrote:
| Bus width they are talking about are multiples of 128. I
| think Apple m series chips are good examples. They go from
| 128 to 256 to 512 bits which just happens to be roughly
| about the megabytes per second bandwidth.
| formerly_proven wrote:
| GDDR has been point-to-point since... I dunno, probably
| 2000? Because cet par you can't really have an actual _bus_
| when you chase maximum bandwidth. Even the double-sided
| layouts (like T-layout, with <2mm stubs) typically incur a
| reduction in data rate. These also dissipate a fair amount
| of heat, you're looking at around 5-8 W per chip (~6
| pJ/bit), it's not like you can just stack a bunch of those
| dies.
|
| > A multiplexer would need to be somewhat large, but its
| job would be simple enough that it should be doable with an
| older, cheaper process and without using entirely
| unreasonable amounts of power.
|
| I don't know what you're basing that on. We're talking
| about 32 Gbps serdes here. Yes, there's multiplexers even
| for that. But what good is deciding which memory chip you
| want to use on boot-up?
| Dylan16807 wrote:
| > a wider bus width is required for larger memory counts
|
| Most video cards wire up 32 data pins to each memory chip.
| But GDDR chips already have full support for running 16 pins
| to each chip. And DDR commonly goes down to _4_ data pins per
| chip.
|
| The latest GDDR chips are 24Gbit, and at 16 bits each you
| could fit 48GB onto a nice easy 256 bit bus.
| enragedcacti wrote:
| > Of course, if it's not real VRAM, it needs to be at least
| somewhat close on the latency and bandwidth front
|
| It is close to _VRAM_ *, just not close to DRAM on a
| conventionally designed CPU. This thing is effectively just a
| GPU that fits in a CPU slot and has CPU cores bolted to the
| side. This approach has the downside of worse CPU performance
| and the upsides of orders of magnitude faster CPU<->GPU
| communication, simpler programming since coherency is handled
| for you, and access to substantial amounts of high bandwidth
| memory (up to 512GB with 4 MI300As).
|
| * https://chipsandcheese.com/p/microbenchmarking-nvidias-
| rtx-4...
| rbanffy wrote:
| I was curious because given the latencies between the CCXs,
| the number of NUMA domains seems small.
| 0934u934y9g wrote:
| The problem with only providing VRAM is that some AI things
| like real time audio processing under preform significantly
| because it does not have the equivalent of tensor cores to keep
| up. There are LLM's that won't run for the same reason. You
| will have more than enough VRAM but not enough tensor cores.
| AMD isn't able to compete.
| therealpygon wrote:
| I wholeheartedly agree. Nvidia is intentionally suppressing the
| amount of memory on their consumer GPUs to prevent data centers
| from using consumer cards rather than their far more expensive
| counterparts. The fact that they used to offer the 3060 with
| 12GB, but have now pushed the pricing higher and limited many
| cards to 8GB is a testament to the fact they are. I don't need
| giga-TOPS with 8-16gb of memory, I'd be perfectly happy with
| half that speed but with 64gb of memory or more. Even slower
| memory would be fine. I don't need 1000t/s, but being able to
| load a reasonable intelligent model even at 50t/s would be
| great.
| lhl wrote:
| Getting to 50 tok/s for a big model requires not just memory,
| but also memory bandwidth. Currently, 1TB/s of MBW will get a
| 70B Q4 (~40GB) model to about 20-25 tok/s. The good thing is
| models continue to get smarter - today's 20-30B models beat
| out last years 70B models on most tasks and the biggest open
| models like DeepSeek-v3 might have lots of weights, but
| actually a relatively reasonable # of activations/pass.
|
| You can test out your half the speed but w/ 64GB or more of
| memory w/ the latest Macs, AMD Strix Halo, or the upcoming
| Nvidia Digits, though. I suspect by the middle of the year
| there will be a bunch of options in the ~$3K range.
| Personally, I think I'd rather go for 2 x 5090s for 64GB of
| memory at 1.7TB/s than 96 or 128GB w/ only 250GB/s of MBW.
| sroussey wrote:
| A Mac with that memory will have closer to 500GB/s but your
| point still stands.
|
| That said, if you just want to play around, having more
| memory will let you do more interesting things. I'd rather
| have that option over speed since I won't be doing
| production inference serving on my laptop.
| lhl wrote:
| Yeah, the M4 Max actually has pretty decent MBW - 546
| GB/s (cheapest config is $4.7K on a 14" MBP atm, but
| maybe there will be a Mac Studio at some point). The big
| weakness for the Mac is actually the lack of TFLOPS on
| the GPU - the beefiest maxes out at ~34 FP16 TFLOPS. It
| makes a lot of use cases super painful, since
| prefill/prompt processing can take _minutes_ before token
| generation starts.
| SecretDreams wrote:
| If, by the grace of tech Jesus, amd gave us such systems at
| volumes Nvidia would notice, Nvidia would simply then do the
| same but with a better ecosystem.
|
| The biggest problem for AMD is not that the majority of people
| want to use AMD. It is that the majority of people want AMD to
| be more competitive so that Nvidia will be forced to drop
| prices so that people can afford Nvidia products.
|
| Until this pattern changes, AMD has a big uphill battle. Same
| for Intel, except Intel is at least seemingly doing great
| gen/gen improvements in mid/low range consumer GPUs and
| bringing healthy vram along for the ride.
| holoduke wrote:
| It can change quickly. Great example is the short domination
| of the ati 9700 that crushed nvidia for a short while.
| llm_trw wrote:
| The same could ba said for CPUs from Intel and AMD 5 years
| ago. Now people, myself included, buy AMD because it is
| simply the better choice.
| neuroelectron wrote:
| >Still, core to core transfers are very rare in practice. I
| consider core to core latency test results to be just about
| irrelevant to application performance. I'm only showing test
| results here to explain the system topology.
|
| How exactly are "applications" developed for this? Or is that all
| proprietary knowledge? TinyBox has resorted to writing their own
| drivers for 7900 XTX
| latchkey wrote:
| ROCm is the stack that people write code against to talk to AMD
| hardware.
|
| George wrote some incomplete non-perfomant drivers for a
| consumer grade product. Certainly not an easy task, but it also
| isn't something that most people would use. George just makes
| loud noises to get attention, but few in the HPC industry pay
| any attention to him.
| tucnak wrote:
| Nobody cares what HPC industry has to say; until recently,
| they have happily been jerking off Monte-Carlo simulations on
| overpriced nation-grade supercomputer NUMA clusters and
| didn't know what a "GPU" was anyway! Also please stop
| spreading "consumer grade product" propaganda. I had used AMD
| Instinct MI50's--supposedly datacenter-grade hardware, and
| have faced the _exact_ same problems as George. Except in my
| case there was no call-line at Lisa 's.
|
| Guess what, the AI industry has spoken: hyper-scalers would
| buy NVIDIA, or rather design their own silicon. Any thing,
| any how, but nothing to do with AMD.
|
| Also: if your business is doing so great, how come you're
| constantly in all these Hacker News threads talking and
| talking and talking but not actually releasing products of
| any kind, of any bread, that any of the hackers on here could
| use?
| latchkey wrote:
| > but not actually releasing products of any kind, of any
| bread, that any of the hackers on here could use?
|
| Our "product" is open access to a very specific type of HPC
| compute that previously was locked up and only available to
| a short list of researchers.
|
| Thanks for asking, we just added 1 GPU / 1 minute docker
| container access through our excellent partners:
| https://shadeform.ai
|
| 1 GPU / 1 VM / 1 minute is coming soon.
| tucnak wrote:
| From the looks of it, YOU ARE the product. That is,
| manufacturing optics of a "partner" and "distributer"
| ecosystem for AMD. And on borrowed time, too.
| latchkey wrote:
| > From the looks of it, YOU ARE the product.
|
| Sweet, thanks! That's at least part of what a CEO is
| supposed to be.
| tucnak wrote:
| Please don't be salty; the only person here who may
| embarrass you is yourself. I'm happy that you like to
| think about yourself as CEO, but perhaps it's worth
| reflecting you may be doing a better job if you had spent
| less time on Hacker News, and more time figuring out how
| to get Hacker News excited about your product? So far you
| have pledged allegiance to AMD every chance you got, and
| spun tall tales of great capability, with not much to
| show for it besides "partners." You know nobody has
| trained a thing with your GPU's yet? That would be a
| great place to start for a CEO. To make something people
| would use. To justify it to us; as AMD themselves have
| clearly justified your existence there's no work there!
|
| It's just tough words from a nobody, don't worry you'll
| be fine!
| latchkey wrote:
| > You know nobody has trained a thing with your GPU's
| yet?
|
| https://x.com/zealandic1/status/1877005338324427014
| neuroelectron wrote:
| Yes ROCm is for the GPU, but the MI300A also includes 4
| clusters of cpus connected by an infinity fabric. Generally
| this kind of thing is handled by the OS but there is no OS
| for this product.
| latchkey wrote:
| AMD has had APU's for years, the PS5 chip is an APU.
|
| I did a quick google search and found this presentation
| which details the programming model...
|
| https://nowlab.cse.ohio-
| state.edu/static/media/workshops/pre...
| mk_stjames wrote:
| So the 300A is an accelerator coupled with a full 24-core EPYC
| and 128GB of HBM all on a single chip (or, packaged chiplets,
| whatever).
|
| Why is it I can't buy a single one of these, on a motherboard, in
| a workstation format case, to use as an insane workstation?
| Assuming you could program for the accelerator part, there is an
| entire world of x86-fixed CAD, engineering, and entertainment
| industry (rendering, etc) where people want a single, desktop
| machine with 128GB + of fast ram to number crunch.
|
| There are Blender artists out there that build dual and quad
| RTX4090 machines with Threadrippers for $20k+ in components all
| day, because their render jobs pay for it.
|
| There are engineering companies that would not bat an eye at
| dropping $30k on a workstation if it mean they could spin around
| 80 gigabyte CATIA models of cars or aircraft loaded in RAM
| quicker. I know this at least because I sure as hell did with
| with several HP Z-series machines costing whole-Toyota-Corolla
| prices over the years...
|
| But these combined APU chips are relegated to these server units.
| In the end is this a driver problem? Just a software problem? A
| chicken and egg problem where no one is developing the support
| because there isn't the hardware on the market, and there isn't
| the hardware on the market because AMD thinks there is no use
| case?
|
| Edit: and note my use cases mentioned don't rely on latency,
| really, like videogamers need to hit framerates. The cache miss
| latency mentioned in the article doesn't matter as much for these
| type of compute applications where the main problems are just
| loading and unloading the massive amount of data. Things like
| offline renders and post-processing CFD simulations. Not
| necessarily a video output framerate.
| latchkey wrote:
| (I run a company that buys MI300x.)
|
| > _Why is it I can 't buy a single one of these, on a
| motherboard, in a workstation format case, to use as an insane
| workstation?_
|
| AMD doesn't have the resources to support end users for
| something like this. They are a public company, look at their
| spend. They are pouring everything they've got into trying to
| keep up with the Nvidia release cycle for AI chips.
|
| These chips are cutting edge, they are not perfect. They are
| still working through the hardware and software issues. It is
| hard enough to deal with all the public opinion on things as it
| is. Why would they add another layer of potential abuse?
| behnamoh wrote:
| AMD is done, no one uses their GPUs for AI because AMD were too
| dumb to understand the value of software lock-in like Nvidia did
| with CUDA.
| guywhocodes wrote:
| More like the value of drivers that doesn't require one in-
| house team per customer to "fix" driver crashes in the
| customers' particular workloads.
| numpy-thagoras wrote:
| Yeah, the labour involved in running non Nvidia equipment is
| the elephant in the room.
|
| Nvidia GPU: spin up OS, run your sims or load your LLM,
| gather results.
|
| AMD GPU: spin up OS, grok driver fixes, try and run your
| sims, grok more driver fixes, can't even gather results until
| you can verify software correctness of your fixes. Yeah,
| sometimes you need someone with specialized knowledge of
| numerical methods to help tune your fixes.
|
| ... What kind of maddening workflows are these? It's
| literally negative work: you are busy, you barely get
| anywhere, and you end up having to do more.
|
| In light of that, the Nvidia tax doesn't look so bad.
| ChuckMcM wrote:
| That is quite a thing. I've been out of the 'design loop' for
| chips like this for a while so I don't know if they still do full
| chip simulations prior to tapeout but woah trying to simulate
| that thing would take quite the compute complex in itself. Hat's
| off to AMD for getting it out the door.
| erulabs wrote:
| its interesting that two simultaneous and contradictory views are
| held by AI engineers:
|
| - Software is over
|
| - An impenetrable software moat protects Nvidia's market
| capitalization
___________________________________________________________________
(page generated 2025-01-18 23:00 UTC)