hngopher.com

       [HN Gopher] The AMD Radeon Instinct MI300A's Giant Memory Subsystem
       ___________________________________________________________________
        
       The AMD Radeon Instinct MI300A's Giant Memory Subsystem
        
       Author : pella
       Score  : 154 points
       Date   : 2025-01-18 12:28 UTC (10 hours ago)
        
 (HTM) web link (chipsandcheese.com)
 (TXT) w3m dump (chipsandcheese.com)
        
       | amelius wrote:
       | I'm curious why this space hasn't been patented to death.
        
         | hedora wrote:
         | It has been. All sides have a pile of patents. All sides
         | violate all the other sides' patents. If anyone sues, everyone
         | goes out of business.
         | 
         | This is the system working as currently intended. No matter
         | what happens, the lawyers get will rich.
         | 
         | If a small company comes in and doesn't pay the lawyers, it'll
         | get sued for violating the patents.
        
           | yvdriess wrote:
           | Yep, it's an area denial weapon.
           | 
           | You basically cannot do anything worthwhile in this space
           | without violating someone's patents. It's beneficial patent
           | and corporate lawyers, but it's detrimental to innovation. As
           | an engineer you are asked to not look up existing techniques
           | or designs as this will taint you legally.
        
             | kmeisthax wrote:
             | "Tainting" isn't a thing in patent law. All engineers
             | worldwide are tainted the moment the patent is published;
             | that's why parallel reinvention is not a defense to patent
             | infringement.
        
               | lhl wrote:
               | But you pay triple damages if you knowingly vs
               | unknowingly violate a patent (35 U.S.C. SS 284). Of
               | course, everything is patented, so, engineers are just
               | told to not read patents.
        
           | amelius wrote:
           | > If a small company comes in and doesn't pay the lawyers,
           | it'll get sued for violating the patents.
           | 
           | This assumes the small company isn't just in it for the
           | patents.
        
           | WithinReason wrote:
           | Mutually assured destruction
        
             | amelius wrote:
             | Where do patent-trolls fit in this analogy?
        
       | btown wrote:
       | I've often thought that one of the places AMD could distinguish
       | itself from NVIDIA is bringing significantly higher amounts of
       | VRAM (or memory systems that are as performant as what we
       | currently know as VRAM) to the consumer space.
       | 
       | A card with a fraction of the FLOPS of cutting-edge graphics
       | cards (and ideally proportionally less power consumption), but
       | with 64-128GB VRAM-equivalent, would be a gamechanger for letting
       | people experiment with large multi-modal models, and seriously
       | incentivize researchers to build the next generation of tensor
       | abstraction libraries for both CUDA and ROCm/HIP. And for gaming,
       | you could break new grounds on high-resolution textures. AMD
       | would be back in the game.
       | 
       | Of course, if it's not real VRAM, it needs to be at least
       | somewhat close on the latency and bandwidth front, so let's pop
       | on over and see what's happening in this article...
       | 
       | > An Infinity Cache hit has a load-to-use latency of over 140 ns.
       | Even DRAM on the AMD Ryzen 9 7950X3D shows less latency. Missing
       | Infinity Cache of course drives latency up even higher, to a
       | staggering 227 ns. HBM stands for High Bandwidth Memory, not low
       | latency memory, and it shows.
       | 
       | Welp. Guess my wish isn't coming true today.
        
         | formerly_proven wrote:
         | Totally normal latencies for a GPU though.
        
         | pkroll wrote:
         | You're not the only one thinking that:
         | https://www.nvidia.com/en-us/project-digits/
         | 
         | 128G of unified memory. $3K. Throw ollama and ComfyUI on that
         | sucker and things could get interesting. The question is how
         | much slower than a 5090, is this gonna be? The memory bandwidth
         | isn't going to match a 512 bit bus.
        
           | lostmsu wrote:
           | AFAIK this uses even slower memory.
        
             | sroussey wrote:
             | And a fraction of the 5090 cores.
        
           | manojlds wrote:
           | It's LPDDR5.
        
           | KeplerBoy wrote:
           | It's going to be waaay slower than a 5090. We're looking at
           | something like 60W TDP for the entire system vs 600W for a
           | 5090 GPU.
           | 
           | It's going to be very energy efficient, it will get plenty of
           | flops, but they won't be able to cheat physics.
        
         | mpercival531 wrote:
         | They are. Strix Halo is going after that same space of Apple M4
         | Pro/Max where it is currently unchallenged. Pairing it with two
         | 64GB LPCAMM2 modules will get you there.
         | 
         | Edit: The problem with AMD is less the hardware offerings, but
         | more that their compute software stack historically tends to
         | handwave or be very slow with consumer GPU support -- even more
         | so with their APUs. Maybe the advent of MI300A will change the
         | equation, maybe not.
        
           | lhl wrote:
           | I don't know of any non-soldered memory Strix Halo devices,
           | but both HP and Asus have announced 128GB SKUs (availability
           | unknown).
           | 
           | For LLM inference, basically everything works w/ ROCm on
           | RDNA3 now (well, Flash Attention is via Triton and doesn't
           | have support for SWA and some other stuff; also I mostly test
           | on Linux, although I did check that the new WSL2 support
           | works). I've tested some older APUs w/ basic benchmarking as
           | well. Notes here for those interested: https://llm-
           | tracker.info/howto/AMD-GPUs
        
             | UncleOxidant wrote:
             | Thanks for that link. I'm interested in either getting the
             | HP Mini Z1 G1a or an NVidia Digits for LLM experimentation.
             | The obvious advantage for the Digits is the CUDA ecosystem
             | is much more tried & true for that kind of thing. But the
             | disadvantage is trying to use it as a replacement for my
             | current PC as well as the fact that it's going to run an
             | already old version of Ubuntu (22.04) and you're dependent
             | on Nvidia for updates.
        
               | lhl wrote:
               | Yeah, I think anyone w/ old Jetsons knows what it's like
               | to be left high and dry by Nvidia's embedded software
               | support. Older models are basically just ewaste. Since
               | the Digits won't be out until May, I guess there's enough
               | time to wait and see - at least to get a sense of what
               | the actual specs are. I have a feeling the FP16 TFLOPS
               | and the MBW are going to be much lower than what people
               | have been hyping themselves up for.
               | 
               | Sadly, my feeling is that the big Strix Halo SKUs (which
               | have no scheduled release dates) aren't going to be
               | competitively priced (they're likely to be at a big
               | FLOPS/real-world performance disadvantage, and there's
               | still the PITA factor), but there is something appealing
               | about about the do-it-all aspect of it.
        
               | rbanffy wrote:
               | DIGITS looks like a serious attempt, but they don't have
               | too much of an incentive to have people developing for
               | older hardware. I wouldn't expect them to supor it for
               | more than five years. At least the underlying Ubuntu will
               | last more than that and provide a viable work environment
               | far beyond the time it gets really boring.
        
               | UncleOxidant wrote:
               | If only they could get their changes upstreamed to Ubuntu
               | (and possible kernel mods upstreamed), then we wouldn't
               | have to worry about it.
        
               | rbanffy wrote:
               | Getting their kernel mods upstreamed is very unlikely,
               | but they might provide just enough you can build a new
               | kernel with the same major version number.
        
               | KeplerBoy wrote:
               | Who said anything about Ubuntu 22.04? I mean sure that's
               | the newest release current jetpack comes with, but I'd be
               | surprised if they shipped digits with that.
        
               | rbanffy wrote:
               | Doesn't DGX OS use the latest LTS version? Current should
               | be 24.04.
        
               | KeplerBoy wrote:
               | I wouldn't know. I only work with workstation or jetson
               | stuff.
               | 
               | The DGX documentation and downloads aren't public afaik.
               | 
               | Edit: Nevermind, some information about DGX is public and
               | they really are on 22.04, but oh well, the deep learning
               | stack is guaranteed to run.
               | 
               | https://docs.nvidia.com/base-os/too
        
         | Fade_Dance wrote:
         | Assuming we are comparing chips that are using the latest
         | generation/high density memory modules, a wider bus width is
         | required for larger memory counts, which is expensive when it
         | comes to silicon area. Therefore, if AMD is willing to boost up
         | memory count as a competitive advantage, they may as well also
         | consider using that die space for more logic gates as well.
         | It's a set of trade-offs and an optimization problem to some
         | degree.
         | 
         | That said, when an incumbent has a leadership advantage, one of
         | the obvious ways to boost profit is to slash the memory bus
         | width, and then a competitor can come in and bring it up a bit
         | and have a competitive offering. The industry has certainly
         | seen this pattern many times. But as far as AMD coming in and
         | using gigantic memory counts as a competitive advantage? You
         | have to keep in mind the die space constraints.
         | 
         | Well over a decade ago - I think it was R600 - AMD did take
         | this approach, and it was fairly disastrous because the logic
         | performance of the chip wasn't good enough while the die was
         | too big and hot and yields were too low. They didn't strike the
         | right balance and sacrificed too much for a 512-bit memory bus.
         | 
         | AMD has also tried to sidestep some of these limitations with
         | HBM back when it was new technology, but that didn't work out
         | for them either. They actually would have been better off just
         | increasing bus width and continuing to use the most optimized
         | and cost efficient commodity memory chips in that case.
         | 
         | Data center and such may have a bit more freedom for innovation
         | but the consumer space is definitely stuck on the paradigm of
         | GPU plus nearby mem chips, and going outside of that fence is a
         | huge latency hit.
        
           | amluto wrote:
           | > a wider bus width is required for larger memory counts,
           | which is expensive when it comes to silicon area
           | 
           | I find this constraint to be rather odd. An extra, say, three
           | address bits would add very little space (or latency in a
           | serial protocol) to a memory bus, and the actual problem
           | seems to be that the current generation of memory chips are
           | intended for point-to-point connection.
           | 
           | It seems to me that, if the memory vendors aren't building
           | physically larger, higher capacity chips, then any of the
           | major players (AMD, Nvidia, Intel, whoever else is in this
           | field right now) could kludge around it with a multiplexer. A
           | multiplexer would need to be somewhat large, but its job
           | would be simple enough that it should be doable with an
           | older, cheaper process and without using entirely
           | unreasonable amounts of power.
           | 
           | So my assumption is this is mostly an economic issue. The
           | vendors don't think it's worthwhile to do this.
        
             | sroussey wrote:
             | Bus width they are talking about are multiples of 128. I
             | think Apple m series chips are good examples. They go from
             | 128 to 256 to 512 bits which just happens to be roughly
             | about the megabytes per second bandwidth.
        
             | formerly_proven wrote:
             | GDDR has been point-to-point since... I dunno, probably
             | 2000? Because cet par you can't really have an actual _bus_
             | when you chase maximum bandwidth. Even the double-sided
             | layouts (like T-layout, with  <2mm stubs) typically incur a
             | reduction in data rate. These also dissipate a fair amount
             | of heat, you're looking at around 5-8 W per chip (~6
             | pJ/bit), it's not like you can just stack a bunch of those
             | dies.
             | 
             | > A multiplexer would need to be somewhat large, but its
             | job would be simple enough that it should be doable with an
             | older, cheaper process and without using entirely
             | unreasonable amounts of power.
             | 
             | I don't know what you're basing that on. We're talking
             | about 32 Gbps serdes here. Yes, there's multiplexers even
             | for that. But what good is deciding which memory chip you
             | want to use on boot-up?
        
           | Dylan16807 wrote:
           | > a wider bus width is required for larger memory counts
           | 
           | Most video cards wire up 32 data pins to each memory chip.
           | But GDDR chips already have full support for running 16 pins
           | to each chip. And DDR commonly goes down to _4_ data pins per
           | chip.
           | 
           | The latest GDDR chips are 24Gbit, and at 16 bits each you
           | could fit 48GB onto a nice easy 256 bit bus.
        
         | enragedcacti wrote:
         | > Of course, if it's not real VRAM, it needs to be at least
         | somewhat close on the latency and bandwidth front
         | 
         | It is close to _VRAM_ *, just not close to DRAM on a
         | conventionally designed CPU. This thing is effectively just a
         | GPU that fits in a CPU slot and has CPU cores bolted to the
         | side. This approach has the downside of worse CPU performance
         | and the upsides of orders of magnitude faster CPU<->GPU
         | communication, simpler programming since coherency is handled
         | for you, and access to substantial amounts of high bandwidth
         | memory (up to 512GB with 4 MI300As).
         | 
         | * https://chipsandcheese.com/p/microbenchmarking-nvidias-
         | rtx-4...
        
           | rbanffy wrote:
           | I was curious because given the latencies between the CCXs,
           | the number of NUMA domains seems small.
        
         | 0934u934y9g wrote:
         | The problem with only providing VRAM is that some AI things
         | like real time audio processing under preform significantly
         | because it does not have the equivalent of tensor cores to keep
         | up. There are LLM's that won't run for the same reason. You
         | will have more than enough VRAM but not enough tensor cores.
         | AMD isn't able to compete.
        
         | therealpygon wrote:
         | I wholeheartedly agree. Nvidia is intentionally suppressing the
         | amount of memory on their consumer GPUs to prevent data centers
         | from using consumer cards rather than their far more expensive
         | counterparts. The fact that they used to offer the 3060 with
         | 12GB, but have now pushed the pricing higher and limited many
         | cards to 8GB is a testament to the fact they are. I don't need
         | giga-TOPS with 8-16gb of memory, I'd be perfectly happy with
         | half that speed but with 64gb of memory or more. Even slower
         | memory would be fine. I don't need 1000t/s, but being able to
         | load a reasonable intelligent model even at 50t/s would be
         | great.
        
           | lhl wrote:
           | Getting to 50 tok/s for a big model requires not just memory,
           | but also memory bandwidth. Currently, 1TB/s of MBW will get a
           | 70B Q4 (~40GB) model to about 20-25 tok/s. The good thing is
           | models continue to get smarter - today's 20-30B models beat
           | out last years 70B models on most tasks and the biggest open
           | models like DeepSeek-v3 might have lots of weights, but
           | actually a relatively reasonable # of activations/pass.
           | 
           | You can test out your half the speed but w/ 64GB or more of
           | memory w/ the latest Macs, AMD Strix Halo, or the upcoming
           | Nvidia Digits, though. I suspect by the middle of the year
           | there will be a bunch of options in the ~$3K range.
           | Personally, I think I'd rather go for 2 x 5090s for 64GB of
           | memory at 1.7TB/s than 96 or 128GB w/ only 250GB/s of MBW.
        
             | sroussey wrote:
             | A Mac with that memory will have closer to 500GB/s but your
             | point still stands.
             | 
             | That said, if you just want to play around, having more
             | memory will let you do more interesting things. I'd rather
             | have that option over speed since I won't be doing
             | production inference serving on my laptop.
        
               | lhl wrote:
               | Yeah, the M4 Max actually has pretty decent MBW - 546
               | GB/s (cheapest config is $4.7K on a 14" MBP atm, but
               | maybe there will be a Mac Studio at some point). The big
               | weakness for the Mac is actually the lack of TFLOPS on
               | the GPU - the beefiest maxes out at ~34 FP16 TFLOPS. It
               | makes a lot of use cases super painful, since
               | prefill/prompt processing can take _minutes_ before token
               | generation starts.
        
         | SecretDreams wrote:
         | If, by the grace of tech Jesus, amd gave us such systems at
         | volumes Nvidia would notice, Nvidia would simply then do the
         | same but with a better ecosystem.
         | 
         | The biggest problem for AMD is not that the majority of people
         | want to use AMD. It is that the majority of people want AMD to
         | be more competitive so that Nvidia will be forced to drop
         | prices so that people can afford Nvidia products.
         | 
         | Until this pattern changes, AMD has a big uphill battle. Same
         | for Intel, except Intel is at least seemingly doing great
         | gen/gen improvements in mid/low range consumer GPUs and
         | bringing healthy vram along for the ride.
        
           | holoduke wrote:
           | It can change quickly. Great example is the short domination
           | of the ati 9700 that crushed nvidia for a short while.
        
           | llm_trw wrote:
           | The same could ba said for CPUs from Intel and AMD 5 years
           | ago. Now people, myself included, buy AMD because it is
           | simply the better choice.
        
       | neuroelectron wrote:
       | >Still, core to core transfers are very rare in practice. I
       | consider core to core latency test results to be just about
       | irrelevant to application performance. I'm only showing test
       | results here to explain the system topology.
       | 
       | How exactly are "applications" developed for this? Or is that all
       | proprietary knowledge? TinyBox has resorted to writing their own
       | drivers for 7900 XTX
        
         | latchkey wrote:
         | ROCm is the stack that people write code against to talk to AMD
         | hardware.
         | 
         | George wrote some incomplete non-perfomant drivers for a
         | consumer grade product. Certainly not an easy task, but it also
         | isn't something that most people would use. George just makes
         | loud noises to get attention, but few in the HPC industry pay
         | any attention to him.
        
           | tucnak wrote:
           | Nobody cares what HPC industry has to say; until recently,
           | they have happily been jerking off Monte-Carlo simulations on
           | overpriced nation-grade supercomputer NUMA clusters and
           | didn't know what a "GPU" was anyway! Also please stop
           | spreading "consumer grade product" propaganda. I had used AMD
           | Instinct MI50's--supposedly datacenter-grade hardware, and
           | have faced the _exact_ same problems as George. Except in my
           | case there was no call-line at Lisa 's.
           | 
           | Guess what, the AI industry has spoken: hyper-scalers would
           | buy NVIDIA, or rather design their own silicon. Any thing,
           | any how, but nothing to do with AMD.
           | 
           | Also: if your business is doing so great, how come you're
           | constantly in all these Hacker News threads talking and
           | talking and talking but not actually releasing products of
           | any kind, of any bread, that any of the hackers on here could
           | use?
        
             | latchkey wrote:
             | > but not actually releasing products of any kind, of any
             | bread, that any of the hackers on here could use?
             | 
             | Our "product" is open access to a very specific type of HPC
             | compute that previously was locked up and only available to
             | a short list of researchers.
             | 
             | Thanks for asking, we just added 1 GPU / 1 minute docker
             | container access through our excellent partners:
             | https://shadeform.ai
             | 
             | 1 GPU / 1 VM / 1 minute is coming soon.
        
               | tucnak wrote:
               | From the looks of it, YOU ARE the product. That is,
               | manufacturing optics of a "partner" and "distributer"
               | ecosystem for AMD. And on borrowed time, too.
        
               | latchkey wrote:
               | > From the looks of it, YOU ARE the product.
               | 
               | Sweet, thanks! That's at least part of what a CEO is
               | supposed to be.
        
               | tucnak wrote:
               | Please don't be salty; the only person here who may
               | embarrass you is yourself. I'm happy that you like to
               | think about yourself as CEO, but perhaps it's worth
               | reflecting you may be doing a better job if you had spent
               | less time on Hacker News, and more time figuring out how
               | to get Hacker News excited about your product? So far you
               | have pledged allegiance to AMD every chance you got, and
               | spun tall tales of great capability, with not much to
               | show for it besides "partners." You know nobody has
               | trained a thing with your GPU's yet? That would be a
               | great place to start for a CEO. To make something people
               | would use. To justify it to us; as AMD themselves have
               | clearly justified your existence there's no work there!
               | 
               | It's just tough words from a nobody, don't worry you'll
               | be fine!
        
               | latchkey wrote:
               | > You know nobody has trained a thing with your GPU's
               | yet?
               | 
               | https://x.com/zealandic1/status/1877005338324427014
        
           | neuroelectron wrote:
           | Yes ROCm is for the GPU, but the MI300A also includes 4
           | clusters of cpus connected by an infinity fabric. Generally
           | this kind of thing is handled by the OS but there is no OS
           | for this product.
        
             | latchkey wrote:
             | AMD has had APU's for years, the PS5 chip is an APU.
             | 
             | I did a quick google search and found this presentation
             | which details the programming model...
             | 
             | https://nowlab.cse.ohio-
             | state.edu/static/media/workshops/pre...
        
       | mk_stjames wrote:
       | So the 300A is an accelerator coupled with a full 24-core EPYC
       | and 128GB of HBM all on a single chip (or, packaged chiplets,
       | whatever).
       | 
       | Why is it I can't buy a single one of these, on a motherboard, in
       | a workstation format case, to use as an insane workstation?
       | Assuming you could program for the accelerator part, there is an
       | entire world of x86-fixed CAD, engineering, and entertainment
       | industry (rendering, etc) where people want a single, desktop
       | machine with 128GB + of fast ram to number crunch.
       | 
       | There are Blender artists out there that build dual and quad
       | RTX4090 machines with Threadrippers for $20k+ in components all
       | day, because their render jobs pay for it.
       | 
       | There are engineering companies that would not bat an eye at
       | dropping $30k on a workstation if it mean they could spin around
       | 80 gigabyte CATIA models of cars or aircraft loaded in RAM
       | quicker. I know this at least because I sure as hell did with
       | with several HP Z-series machines costing whole-Toyota-Corolla
       | prices over the years...
       | 
       | But these combined APU chips are relegated to these server units.
       | In the end is this a driver problem? Just a software problem? A
       | chicken and egg problem where no one is developing the support
       | because there isn't the hardware on the market, and there isn't
       | the hardware on the market because AMD thinks there is no use
       | case?
       | 
       | Edit: and note my use cases mentioned don't rely on latency,
       | really, like videogamers need to hit framerates. The cache miss
       | latency mentioned in the article doesn't matter as much for these
       | type of compute applications where the main problems are just
       | loading and unloading the massive amount of data. Things like
       | offline renders and post-processing CFD simulations. Not
       | necessarily a video output framerate.
        
         | latchkey wrote:
         | (I run a company that buys MI300x.)
         | 
         | > _Why is it I can 't buy a single one of these, on a
         | motherboard, in a workstation format case, to use as an insane
         | workstation?_
         | 
         | AMD doesn't have the resources to support end users for
         | something like this. They are a public company, look at their
         | spend. They are pouring everything they've got into trying to
         | keep up with the Nvidia release cycle for AI chips.
         | 
         | These chips are cutting edge, they are not perfect. They are
         | still working through the hardware and software issues. It is
         | hard enough to deal with all the public opinion on things as it
         | is. Why would they add another layer of potential abuse?
        
       | behnamoh wrote:
       | AMD is done, no one uses their GPUs for AI because AMD were too
       | dumb to understand the value of software lock-in like Nvidia did
       | with CUDA.
        
         | guywhocodes wrote:
         | More like the value of drivers that doesn't require one in-
         | house team per customer to "fix" driver crashes in the
         | customers' particular workloads.
        
           | numpy-thagoras wrote:
           | Yeah, the labour involved in running non Nvidia equipment is
           | the elephant in the room.
           | 
           | Nvidia GPU: spin up OS, run your sims or load your LLM,
           | gather results.
           | 
           | AMD GPU: spin up OS, grok driver fixes, try and run your
           | sims, grok more driver fixes, can't even gather results until
           | you can verify software correctness of your fixes. Yeah,
           | sometimes you need someone with specialized knowledge of
           | numerical methods to help tune your fixes.
           | 
           | ... What kind of maddening workflows are these? It's
           | literally negative work: you are busy, you barely get
           | anywhere, and you end up having to do more.
           | 
           | In light of that, the Nvidia tax doesn't look so bad.
        
       | ChuckMcM wrote:
       | That is quite a thing. I've been out of the 'design loop' for
       | chips like this for a while so I don't know if they still do full
       | chip simulations prior to tapeout but woah trying to simulate
       | that thing would take quite the compute complex in itself. Hat's
       | off to AMD for getting it out the door.
        
       | erulabs wrote:
       | its interesting that two simultaneous and contradictory views are
       | held by AI engineers:
       | 
       | - Software is over
       | 
       | - An impenetrable software moat protects Nvidia's market
       | capitalization
        
       ___________________________________________________________________
       (page generated 2025-01-18 23:00 UTC)