hngopher.com

       [HN Gopher] Testing AMD's Giant MI300X
       ___________________________________________________________________
        
       Testing AMD's Giant MI300X
        
       Author : klelatti
       Score  : 167 points
       Date   : 2024-06-25 15:36 UTC (7 hours ago)
        
 (HTM) web link (chipsandcheese.com)
 (TXT) w3m dump (chipsandcheese.com)
        
       | pella wrote:
       | from the summary:
       | 
       |  _" When it is all said and done, MI300X is a very impressive
       | piece of hardware. However, the software side suffers from a
       | chicken-and-egg dilemma. Developers are hesitant to invest in a
       | platform with limited adoption, but the platform also depends on
       | their support. Hopefully the software side of the equation gets
       | into ship shape. Should that happen, AMD would be a serious
       | competitor to NVIDIA."_
        
         | firebaze wrote:
         | OTOH, the performance advantage compared to the H100 is super-
         | impressive according to tfa. Things could become interesting
         | again in the GPU market.
        
         | bearjaws wrote:
         | IMO Nvidia is going to force companies to fix this, Nvidia have
         | always made it clear they will increase prices and capture 90%
         | of your profits when left free to do so. See any example from
         | the GPU vendor space. There isn't infinite money to be spent
         | per token, so it's not like the AI companies can just increase
         | prices.
         | 
         | AMD can offer this product at a 40% discount and still make
         | money tells you all you need to know.
        
           | amelius wrote:
           | I'm personally wondering when nVidia will open an AI
           | AppStore, and every app that runs on nVidia hardware will
           | have to be notarized first, and you'll have to pay 30% of
           | your profits to nVidia.
           | 
           | History has shown that this idea is not as crazy as it
           | sounds.
        
       | latchkey wrote:
       | The news you've all been waiting for!
       | 
       | We are thrilled to announce that Hot Aisle Inc. proudly
       | volunteered our system for Chips and Cheese to use in their
       | benchmarking and performance showcase. This collaboration has
       | demonstrated the exceptional capabilities of our hardware and
       | further highlighted our commitment to cutting-edge technology.
       | 
       | Stay tuned for more exciting updates!
        
         | JonChesterfield wrote:
         | Thank you for loaning the box out! Has a lot more credibility
         | than the vendor saying it runs fast
        
           | latchkey wrote:
           | Thanks Jon, that's exactly the idea. About $12k worth of free
           | compute on a box that costs as much as a Ferrari.
           | 
           | Funny that HN doesn't like my comment for some reason though.
        
             | renewiltord wrote:
             | It reads like the kind of chumbox PR you read at bottom of
             | random website. Get a copywriter or something like
             | writer.ai. I thought your comment was spam and nearly
             | flagged it. It really is atrocious copy.
        
               | jampekka wrote:
               | I thought it was sarcastic.
        
             | logicallee wrote:
             | [retracted]
        
               | latchkey wrote:
               | This is the news that many people have been waiting for
               | and we do have more exciting updates coming. There is
               | another team on the system now doing testing. We have a
               | list of 22 people currently waiting.
        
               | logicallee wrote:
               | okay, I've retracted my comments. Thanks for your
               | generosity.
        
               | latchkey wrote:
               | Thanks, but I wouldn't call it generosity. We're helping
               | AMD build a developer flywheel and that is very much to
               | our benefit. The more developers using these chips, the
               | more chips that are needed, the more we buy to rent out,
               | the more our business grows.
               | 
               | Previously, this stuff was only available to HPC
               | applications. We're trying to get these into the hands of
               | more developers. Our view is that this is a great way to
               | foster the ecosystem.
               | 
               | Our simple and competitive pricing reflects this as well.
        
               | klelatti wrote:
               | Do you think this comment will make Hot Aisle more or
               | less likely to loan out their hardware in the future?
               | 
               | Personally, I couldn't care less about the quality of
               | copy. I do care about having access to similar hardware
               | in the future.
        
               | latchkey wrote:
               | Heh, I didn't even think of that, but you make a good
               | point. Don't worry though, we will keep the access
               | coming. I hate to say it, but it literally is... stay
               | tuned for more exciting updates.
        
               | klelatti wrote:
               | Thanks so much for doing that. There are loads of people
               | here who really appreciate it. We will stay tuned!
        
             | alecco wrote:
             | Don't sweat it. Some people are trigger happy on downvoting
             | things looking like self-promotion due to the sheer amount
             | of spam everywhere. Your sponsorship (?) is the right way
             | to promote your company. Thank you.
        
       | elorant wrote:
       | Even if the community provides support it could take years to
       | reach the maturity of CUDA. So while it's good to have some
       | competition, I doubt it will make any difference in the immediate
       | future. Unless some of the big corporations in the market lean in
       | heavily and support the framework.
        
         | JonChesterfield wrote:
         | Something like Triton from Microsoft/OpenAI as a cuda bypass?
         | Or pytorch/tensorflow targeting ROCm without user intervention.
         | 
         | Or there's openmp or hip. In extremis opencl.
         | 
         | I think the language stack is fine at this point. The moat
         | isn't in cuda the tech. It's in code running reliably on
         | nvidia's stack, without things like stray pointers needing a
         | machine reboot. Hard to know how far off robust rocm is at this
         | point.
        
         | jeroenhd wrote:
         | If, and that's a big if, AMD can get ROCm working well for this
         | chip, I don't think this will be a big problem.
         | 
         | ROCm can be spotty, especially on consumer cards, but for many
         | models it does seem to work on their more expensive models. It
         | may be worth it spending a few hours/days/weeks to work around
         | the peculiarities of ROCm given the cost difference between AMD
         | and Nvidia in this market segment.
         | 
         | This all stands or falls with how well AMD can get ROCm to
         | work. As this article states, it's nowhere near ready yet, but
         | one or two updates can turn AMD's accelerators from "maybe in
         | 5-10 years" to "we must consider this next time we order
         | hardware".
         | 
         | I also wonder if AMD is going to put any effort into ROCm (or a
         | similar framework) as a response to Qualcomm and other ARM
         | manufacturers creaming them on AI stuff. If these Copilot PCs
         | take off, we may see AMD invest into their AI compatibility
         | libraries because of interest from both sides.
        
           | entropicdrifter wrote:
           | If we're extremely lucky they might invest in SYCL and we'll
           | see an Intel/AMD open-source teamup
        
           | pbalcer wrote:
           | > Qualcomm and other ARM manufacturers creaming them on AI
           | stuff
           | 
           | That's mostly on Microsoft's DirectML though. I'm not sure
           | whether AMD's implementation is based on ROCm (doubt it).
        
           | jjoonathan wrote:
           | So I knew that AMD's compute stack was a buggy mess -- nobody
           | starts out wanting to pay more for less and I had to learn
           | the hard way how big of a gap there was between AMD's paper
           | specs and their actual offerings -- and I also knew that
           | Nvidia had a huge edge at the cutting edge of things, if you
           | need gigashaders or execution reordering or whatever, but ML
           | isn't any of that. The calculations are "just" matrix
           | multiplication, or not far off.
           | 
           | I would have thought AMD could have scrambled to fix their
           | bugs, at least the matmul related ones, scrambled to shore up
           | torch compatibility or whatever was needed for LLM training,
           | and pushed something out the door that might not have been
           | top-of-market but could at least have taken advantage of the
           | opportunity provided by 80% margins from team green. I
           | thought the green moat was maybe a year wide and tens of
           | millions deep (enough for a team to test the bugs, a team to
           | fix the bugs, time to ramp, and time to make it happen). But
           | here we are, multiple years and trillions in market cap delta
           | later, and AMD still seems to be completely non-viable. What
           | happened? Did they go into denial about the bugs? Did they
           | fix the bugs but the industry still doesn't trust them?
        
             | JonChesterfield wrote:
             | It's roughly that the AMD tech works reasonably well on HPC
             | and less convincingly on "normal" hardware/systems. So a
             | lot of AMD internal people think the stack is solid because
             | it works well on their precisely configured dev machines
             | and on the commercially supported clusters.
             | 
             | Other people think it's buggy and useless because that's
             | the experience on some other platforms.
             | 
             | This state of affairs isn't great. It could be worse but it
             | could certainly be much better.
        
           | latchkey wrote:
           | https://stratechery.com/2024/an-interview-with-amd-ceo-
           | lisa-...
           | 
           | "One of the things that you mentioned earlier on software,
           | very, very clear on how do we make that transition super easy
           | for developers, and one of the great things about our
           | acquisition of Xilinx is we acquired a phenomenal team of
           | 5,000 people that included a tremendous software talent that
           | is right now working on making AMD AI as easy to use as
           | possible."
        
             | jjoonathan wrote:
             | Oh no. Ohhhh nooooo. No, no, no!
             | 
             | Xilinx dev tools are awful. They are the ones who had
             | Windows XP as the only supported dev environment for a
             | product with guaranteed shipments through 2030. I saw
             | Xilinx defend this state of affairs for over a decade. My
             | entire FPGA-programming career was born, lived, and died,
             | long after XP became irrelevant but before Xilinx moved
             | past it, although I think they finally gave in some time
             | around 2022. Still, Windows XP through 2030, and if you
             | think that's bad wait until you hear about the actual
             | software. These are not role models of dev experience.
             | 
             | In my, err, uncle? post I said that I was confused about
             | where AMD was in the AI arms race. Now I know. They really
             | are just this dysfunctional. Yikes.
        
               | pbalcer wrote:
               | Xilinx made triSYCL (https://github.com/triSYCL/triSYCL),
               | so maybe there's some chance AMD invests first-class
               | support for SYCL (an open standard from Khronos). That'd
               | be nice. But I don't have much hope.
        
             | paulmd wrote:
             | this is honestly a very enlightening interview because - as
             | pointed out at the time - Lisa Su is basically _repeatedly_
             | asked about software and every single time she blatantly
             | dodges the question and tries to steer the conversation
             | back to her comfort-zone on hardware.
             | https://news.ycombinator.com/item?id=40703420
             | 
             | > He tries to get a comment on the (in hindsight) not great
             | design tradeoffs made by the Cell processor, which was hard
             | to program for and so held back the PS3 at critical points
             | in its lifecycle. It was a long time ago so there's been
             | plenty of time to reflect on it, yet her only thought is
             | "Perhaps one could say, if you look in hindsight,
             | programmability is so important". That's it! In hindsight,
             | programmability of your CPU is important! Then she
             | immediately returns to hardware again, and saying how proud
             | she was of the leaps in hardware made over the PS
             | generations.
             | 
             | > He asks her if she'd stayed at IBM and taken over there,
             | would she have avoided Gerstner's mistake of ignoring the
             | cloud? Her answer is "I don't know that I would've been on
             | that path. I was a semiconductor person, I am a
             | semiconductor person." - again, she seems to just reject on
             | principle the idea that she would think about software,
             | networking or systems architecture because she defines
             | herself as an electronics person.
             | 
             | > Later Thompson tries harder to ram the point home, asking
             | her "Where is the software piece of this? You can't just be
             | a hardware cowboy ... What is the reticence to software at
             | AMD and how have you worked to change that?" and she just
             | point-blank denies AMD has ever had a problem with
             | software. Later she claims everything works out of the box
             | with AMD and seems to imply that ROCm hardly matters
             | because everyone is just programming against PyTorch
             | anyway!
             | 
             | > The final blow comes when he asks her about ChatGPT. A
             | pivotal moment that catapults her competitor to absolute
             | dominance, apparently catching AMD unaware. Thompson asks
             | her what her response was. Was she surprised? Maybe she
             | realized this was an all hands to deck moment? What did
             | NVIDIA do right that you missed? Answer: no, we always knew
             | and have always been good at AI. NVIDIA did nothing
             | different to us.
             | 
             | > The whole interview is just astonishing. Put under
             | pressure to reflect on her market position, again and again
             | Su retreats to outright denial and management waffle about
             | "product arcs". It seems to be her go-to safe space. It's
             | certainly possible she just decided to play it all as low
             | key as possible and not say anything interesting to protect
             | the share price, but if I was an analyst looking for signs
             | of a quick turnaround in strategy there's no sign of that
             | here.
             | 
             | not expecting a heartfelt postmortem about how things got
             | to be this bad, but you can very easily _make this question
             | go away too_ , simply by acknowledging that it's a focus
             | and you're working on driving change and blah blah. you
             | really don't have to worry about crushing some analyst's
             | mindshare on AMD's software stack because nobody is crazy
             | enough to think that AMD's software isn't horrendously
             | behind at the present moment.
             | 
             | and frankly that's literally how she's governed as far as
             | software too. ROCm is barely a concern. Support
             | base/install base, obviously not a concern. DLSS
             | competitiveness, obviously not a concern. Conventional
             | gaming devrel: obviously not a concern. She wants to ship
             | the hardware and be done with it, but that's not how
             | products are built and released in 2020 anymore.
             | 
             | NVIDIA is out here building integrated systems that you
             | build your code on and away you go. They run NVIDIA-written
             | CUDA libraries, NVIDIA drivers, on NVIDIA-built networks
             | and stacks. AMD can't run the sample packages in ROCm
             | stably (as geohot discovered) on a supported configuration
             | of hardware/software, even after hours of debugging just to
             | get it that far. AMD doesn't even think drivers/runtime is
             | a thing they should have to write, let alone a software
             | library for the ecosystem.
             | 
             | "just a small family company (bigger than NVIDIA, until
             | very recently) who can't possibly afford to hire developers
             | for all the verticals they want to be in". But like, they
             | spent $50b on a single acquisition, they spent $12b in
             | stock buybacks over 2 years, they have money, just not for
             | _this_.
        
         | kd913 wrote:
         | You do know that Microsoft, Oracle, Meta are all in on this
         | right?
         | 
         | Heck I think it is being used to run ChatGPT 3.5 and 4
         | services.
        
           | 0cf8612b2e1e wrote:
           | On the other hand, AMD has had a decade of watching CUDA eat
           | their lunch and done basically nothing to change the
           | situation.
        
             | bee_rider wrote:
             | AMD tries to compete in hardware with Intel's CPUs and
             | Nvidia's GPUs. They have to slack somewhere, and software
             | seems to be where. It isn't any surprise that they can't
             | keep up on every front, but it does mean they can freely
             | bring in partners whose core competency is software and
             | work with them without any caveats.
             | 
             | Not sure why they haven't managed to execute on that yet,
             | but the partners must be pretty motivated now, right? I'm
             | sure they don't love doing business at Nvidia's leisure.
        
               | bobsondugnut wrote:
               | when was the last time AMD hardware was keeping up with
               | NVIDIA? 2014?
        
               | 0cf8612b2e1e wrote:
               | Been a while since AMD had the top tier offering, but it
               | has been trading blows in the middle tier segment the
               | entire time. If you are just looking for a gamer card (ie
               | not max AI performance), the AMD is typically cheaper and
               | less power hungry than the equivalent Nvidia.
        
               | bobsondugnut wrote:
               | > the AMD is typically cheaper and less power hungry than
               | the equivalent Nvidia
               | 
               | cheaper is true, but less power hungry is absolutely not
               | true, which is kind of my point.
        
               | dralley wrote:
               | It was true with RDNA 2. RDNA 3 regressed on this a bit,
               | supposedly there was a hardware hiccup that prevented
               | them from hitting frequency and voltage targets that they
               | were hoping to reach.
               | 
               | In any case they're only slightly behind, not crazy far
               | behind like Intel is.
        
               | aurareturn wrote:
               | It's trading blows because AMD sells their cards at lower
               | margins in the midrange and Nvidia lets them.
        
               | bee_rider wrote:
               | The MI300X sounds like it is competitive, haha
        
               | bobsondugnut wrote:
               | competitive with H100 for inference. a 2 year old product
               | on just one half of the ML story. H200 (and potentially
               | B100) is the appropriate comparison based on their
               | production in volume.
        
           | softfalcon wrote:
           | I feel like people forget that AMD has huge contracts with
           | Microsoft, Valve, Sony, etc to design consoles at scale. It's
           | an invisible provider as most folks don't even realize their
           | Xbox and their Playstation are both AMD.
           | 
           | When you're providing fab designs at that scale, it makes a
           | lot more sense to folks that companies would be willing to
           | try a more affordable option to nVidia hardware.
           | 
           | My bet is that AMD figures out a service-able solution for
           | some (not all) workloads that isn't ground breaking, but
           | affordable to the clients that want an alternative. That's
           | usually how this goes for AMD in my experience.
        
             | sangnoir wrote:
             | If you read/listen to the Stratechary interview wirh Lisa
             | Hsu, she spelled out being open ro customizing AMD hardware
             | to meet partner's needs. So if Microsoft needs more memory
             | bandwidth and less compute, AMD will build something just
             | for them based on what they have now. If Meta wants 10%
             | less power consumption (and cooling) for a 5% hit in
             | compute, AMD will hear them out too. We'll see if that
             | hardware customization strategy works outside of consoles.
        
             | Rinzler89 wrote:
             | _> I feel like people forget that AMD has huge contracts
             | with Microsoft, Valve, Sony, etc to design consoles at
             | scale. _
             | 
             | Nobody forget that, just that those console chips are super
             | low margins, which is why Intel and Nvidia stopped catering
             | to that market after the Xbox/PS3 generations and only AMD
             | took it up because they were broke and every penny mattered
             | to them.
             | 
             | Nvidia did a brief stint with the Shield/Switch because
             | they were trying to get into the Android/ARM space and also
             | kinda gave up due to the margins.
        
           | adabyron wrote:
           | I have read in a few places that Microsoft is using AMD for
           | inference to run ChatGPT. If I recall they said the
           | price/performance was better.
           | 
           | I'm curious if that's just because they can't get enough
           | Nvidia GPUs or if the price/performance is actually that much
           | better.
        
             | atq2119 wrote:
             | Most likely it really is better overall.
             | 
             | Think of it this way: AMD is pretty good at hardware, so
             | there's no reason to think that the raw difference in terms
             | of flops is significant in either direction. It may go in
             | AMD's favor sometimes and Nvidia's other times.
             | 
             | What AMD traditionally couldn't do was software, so those
             | AMD GPUs are sold at a discount (compared to Nvidia),
             | giving you better price/performance _if you can use them_.
             | 
             | Surely Microsoft is operating GPUs at large enough scale
             | that they can pay a few people to paper over the software
             | deficiencies so that they _can_ use the AMD GPUs and still
             | end up ahead in terms of overall price /performance.
        
         | singhrac wrote:
         | The problem is that we all have a lot of FUD (for good
         | reasons). It's on AMD to solve that problem publically. They
         | need to make it easier to understand what is supported so far
         | and what's not.
         | 
         | For example, for bitandbytes (a common dependency in LLM world)
         | there's a ROCm fork that the AMD maintainers are trying to
         | merge in
         | (https://github.com/TimDettmers/bitsandbytes/issues/107).
         | Meanwhile an Intel employee merged a change that made a common
         | device abstraction (presumably usable by AMD + Apple + Intel
         | etc.).
         | 
         | There's a lot of that right now - super popular package that is
         | CUDA-only is navigating how to make it work correctly with any
         | other accelerator. We just need more information on what is
         | supported.
        
       | JonChesterfield wrote:
       | Fantastic to see.
       | 
       | The MI300X does memory bandwidth better than anything else by a
       | ridiculous margin, up and down the cache hierarchy.
       | 
       | It did not score very well on global atomics.
       | 
       | So yeah, that seems about right. If you manage to light up the
       | hardware, lots and lots of number crunching for you.
        
       | jsheard wrote:
       | All eyes are of course on AI, but with 192GB of VRAM I wonder if
       | this or something like it could be good enough for high end
       | production rendering. Pixar and co still use CPU clusters for all
       | of their final frame rendering, even though the task is
       | ostensibly a better fit for GPUs, mainly because their memory
       | demands have usually been so far ahead of what even the biggest
       | GPUs could offer.
       | 
       | Much like with AI, Nvidia has the software side of GPU production
       | rendering locked down tight though so that's just as much of an
       | uphill battle for AMD.
        
         | Havoc wrote:
         | I'd imagine ray tracing is a bit easier to paralize over lots
         | of older cards. The computations aren't as heavily linked and
         | are more fault tolerant. So I doubt anyone is paying h100 style
         | premiums
        
           | bryanlarsen wrote:
           | Pixar is paying a massive premium; they probably are using an
           | order of magnitude or two more CPUs than they would if they
           | could use GPUs. Using a hundred CPUs in place of a single
           | H100 is a greater-than-h100 style premium.
        
             | imjonse wrote:
             | Would Pixar's existing software run on GPUs without much
             | work?
        
               | jsheard wrote:
               | It does already, at least on Nvidia GPUs: https://rmanwik
               | i.pixar.com/pages/viewpage.action?mobileBypas...
               | 
               | They currently only use the GPU mode for quick iteration
               | on relatively small slices of data though, and then
               | switch back to CPU mode for the big renders.
        
           | jsheard wrote:
           | The computations are easily parallelized, sure, but the data
           | feeding those computations isn't easily partitioned. Every
           | parallel render node needs as much memory as a lone render
           | node would, and GPUs typically have nowhere near enough for
           | the highest of high end productions. Last I heard they were
           | putting around 128GB to 256GB of RAM in their machines and
           | that was a few years ago.
        
         | PaulHoule wrote:
         | One missed opportunity from the game streaming bubble would be
         | a 20-or-so player game where one big machine draws everything
         | for everybody and streams it.
        
           | bob1029 wrote:
           | Stuff like this is still of interest to me. There are some
           | really compelling game ideas that only become possible once
           | you look into modern HPC platforms and streaming.
        
             | PaulHoule wrote:
             | My son and I have wargamed it a bit. The trouble is there
             | is a huge box of tricks used in open world and other
             | complex single player games for conserving RAM that compete
             | with just having a huge amount of RAM and it is not so
             | clear the huge SMP machine with a huge GPU really comes out
             | ahead in terms of creating a revolution in gaming.
             | 
             | In the case of Stadia, however, failing to develop this was
             | like a sports team not playing any home games. One way of
             | thinking about the current crisis of the games industry and
             | VR is that building 3-d worlds is too expensive and a major
             | part of it is all the shoehorning tricks the industry
             | depends on. Better hardware for games could be about
             | lowering development cost as opposed to making fancier
             | graphics but that tends to be a non-starter with companies
             | whose core competence is getting 1000 highly-paid
             | developers to struggle with difficult to use tools and the
             | idea you could do the same with 10 ordinary developers is
             | threatening to them.
        
               | bob1029 wrote:
               | I am thinking beyond the scale of any given machine and
               | traditional game engine architectures.
               | 
               | I am thinking of an entire datacenter purpose-built to
               | host a single game world, with edge locations handling
               | the last mile of client-side prediction, viewport
               | rendering, streaming and batching of input events.
               | 
               | We already have a lot of the conceptual architecture
               | figured out in places like the NYSE and CBOE - Processing
               | hundreds of millions of events in less than a second on a
               | single CPU core against one synchronous view of _some_
               | world. We can do this with insane reliability and
               | precision day after day. Many of the technology
               | requirements that emerge from the single instance WoW
               | path approximate what we have already accomplished in
               | other domains.
        
               | rcxdude wrote:
               | EVE online is more or less the closest to this so far, so
               | it may be worth learning lessons from them (though I
               | wouldn't suggest copying their approach: their stackless
               | python behemoth codebase appears to contain many a
               | horror). It's certainly a hard problem though, especially
               | when you have a concentration of large numbers of players
               | (which is inevitable when you create such a game world).
        
             | ganzuul wrote:
             | Curious what that is. Some kind of AR physics simulation?
             | 
             | I have been thinking about if the the compute could go
             | right in cellphone towers but this would take it up a
             | notch.
        
           | ThrowawayTestr wrote:
           | Stadia was supposed to allow for really big games distributed
           | across a cluster. Too bad it died in the crib.
        
           | Nexxxeh wrote:
           | It would immediately prevent several classes of cheating. No
           | more wallhacks or ESP.
           | 
           | Ironically the main type that'd still exist would be the
           | vision-based external AI-powered target-highlighting and
           | aim/fire assist.
           | 
           | The display is analysed and overlaid with helpful info (like
           | enemies highlighted) and/or inputs are assisted (snap to
           | visible enemies, and/or automatically pull trigger.)
        
         | JackYoustra wrote:
         | It's probably implemented way differently, but I worry about
         | the driver suitability. Gaming benchmarks at least perform
         | substantially worse on AI accelerators than even many
         | generations old GPUs, I wonder if this extends to custom
         | graphics code too.
        
         | Arelius wrote:
         | I work in this field, and I think so. This is actually the
         | project I'm currently working on.
         | 
         | I'm betting with current hardware and some clever tricks, we
         | can resolve full production frames in real-time rates.
        
       | Pesthuf wrote:
       | I feel like these huge graphics cards with insane amounts of RAM
       | are the moat that AI companies have been hoping for.
       | 
       | We can't possibly hope to run the kinds of models that run on
       | 192GB of VRAM at home.
        
         | dmbaggett wrote:
         | For inference you could use a maxed-out Mac Ultra; the RAM is
         | shared between the CPU and GPU.
        
           | alecco wrote:
           | For single user (batch_size = 1), sure. But that is quite
           | expensive in $/tok.
        
         | jsheard wrote:
         | Apple will gladly sell you a GPU with 192GB of memory, but your
         | wallet won't like it.
        
           | kbenson wrote:
           | Won't Nvidia, and Intel, and Qualcomm, and Falanx (who make
           | the ARM Mali GPUs from what I can see), and Imagination
           | Technologies (PowerVR) do the same? They each make a GPU, and
           | if you pay them enough money I have a hard time beleiving
           | they won't figure out how to slap enough RAM on a board for
           | one of their existing products and making whatever changes
           | are required.
        
           | nextaccountic wrote:
           | The US government is looking into heavily limit availability
           | of high end GPUs from now on. And the biggest and most
           | effective bottleneck for AI right now is VRAM
           | 
           | So maybe Apple is happy to sell huge GPUs like that but the
           | government will probably put it under export controls like
           | A100 and H100 already is
        
             | rbanffy wrote:
             | Cue the PowerMac G4 TV ad.
             | 
             | https://youtu.be/lb7EhYy-2RE
        
           | rbanffy wrote:
           | OTOH, it comes free with one of the finest Unix workstations
           | ever made.
        
             | Rinzler89 wrote:
             | Which Unix workstation?
        
               | coolspot wrote:
               | They are referring to MacOS being included with expensive
               | Mac hardware.
        
         | tonetegeatinst wrote:
         | Ow contrarily I'd argue the opposite. GPU vram has gotten
         | faster but the density isn't that good. 8gb used to be high end
         | for the early 2000's yet now 16gb can't even run games that
         | well, especially if its a studio that loves vram.
         | 
         | Side note: as someone who has been into machine learning for
         | over 10 years, let me tell ya us hobbyists and researchers
         | hunger for compute and memory.
         | 
         | VRAM isn't everything.....I am well aware but certain workflows
         | really do benefit from heaps of vram like vfx and cad and CFD.
         | I realize that the dream of upgradable GPUs where I can upgrade
         | the different components just like you do on the computer.
         | Computer is slow, then upgrade ram or storage or get a faster
         | chip that uses the same socket. GPU could possibility see
         | modularity with the processor the vram etc.
         | 
         | Level1Tech has some great videos about how PCIe is the
         | future...where we can connect systems together using raw PCI
         | lanes, which is similar to how nvidia Blackwell servers
         | communicate to other servers in the rack.
        
           | immibis wrote:
           | Wasn't that just because of Nvidia's market segmentation?
        
         | phkahler wrote:
         | >> We can't possibly hope to run the kinds of models that run
         | on 192GB of VRAM at home.
         | 
         | I'm looking to build a mini-ITX system with 256GB of RAM for my
         | next build. DDR5 spec can support that in 2 modules, but nobody
         | makes them yet. No need for a GPU, I'm looking to the AMD APUs
         | which are getting into the 50TOPs range. But yes, RAM seems to
         | be the limiting factor. I'm a little surprised the memory
         | companies aren't pushing harder for consumers to have that
         | capacity.
        
           | auspiv wrote:
           | 128GB DDR5 module -
           | https://store.supermicro.com/us_en/supermicro-
           | hynix-128gb-28...
           | 
           | It is of course RDIMM, but you didn't specify what memory
           | type you were looking at.
        
       | tonetegeatinst wrote:
       | I wonder if the human body could grow artificial kidney's so that
       | I can just sell infinite kidney's and manage to afford a couple
       | of these so I can do AI training on my own hardware.
        
         | pca006132 wrote:
         | why not infinite brains so you can have more computational
         | power than these GPUs?
        
         | Maken wrote:
         | Apparently one of those costs around 15K $. I don't know if you
         | can buy a couple or they only sell those in massive batches,
         | but in any case, how many human kidneys you need to sell to get
         | 30K $?
        
       | Filligree wrote:
       | So just out of curiosity, what does this thing cost?
        
         | latchkey wrote:
         | Pricing is strictly NDA. AMD does not give it out.
        
         | sva_ wrote:
         | The rumors say $20k. Nothing official though.
        
       | pheatherlite wrote:
       | Without first-class CUDA translation or cross compile, AMD is
       | just throwing more transistors at the void
        
         | chung8123 wrote:
         | I agree they need to work on their software but I also think
         | that the availability as well as massive expense of the H100,
         | AMD can undercut Nvidia and build a developer ecosystem if they
         | wanted to. I think they need to hit the consumer market pretty
         | hard and get all the local llama people hacking up the software
         | and drivers to make things work. A cheaper large VRAM consumer
         | card would go a long ways to getting a developer ecosystem
         | behind them.
        
         | epistasis wrote:
         | Given the number of people who need the compute but are only
         | accessing it via APIs like HuggingFace's transformers library,
         | which supports these chips, I don't really think that CUDA
         | support is absolutely essential.
         | 
         | Most kernels are super quick to rewrite, and higher level
         | abstractions like PyTorch and JAX make dealing with CUDA a
         | pretty rare experience for most people making use of large
         | clusters and small installs. And if you have the money to build
         | a big cluster, you can probably also hire the engineers to port
         | your framework to the right AMD library.
         | 
         | The world has changed a lot!
         | 
         | The bigger challenge is that if you are starting up, why in the
         | world would you give yourself the additional challenge of going
         | off the beaten path? Its not just CUDA but the whole
         | infrastructure of clusters and networking that really gives
         | NVIDIA an edge, in addition to knowing that they are going to
         | stick around in the market, whereas AMD might leave it
         | tomorrow.
        
       | alkonaut wrote:
       | Good. If there is even a slight suspicion that the best value is
       | team read in 5 or 10 years then CUDA will look a lot less
       | attractive already today.
        
       | w-m wrote:
       | Impressions from last week's CVPR, a conference with 12k
       | attendees on computer vision - Pretty much everyone is using
       | NVIDIA GPUs, and pretty much everyone isn't happy with the
       | prices, and would like some competition in the space:
       | 
       | NVIDIA was there with 57 papers, a website dedicated to their
       | research presented at the conference, a full day tutorial on
       | accelerating deep learning, and ever present with shirts and
       | backpacks in the corridors and at poster presentations.
       | 
       | AMD had a booth at the expo part, where they were raffling off
       | some GPUs. I went up to them to ask what framework I should look
       | into, when writing kernels (ideally from Python) for GPGPU. They
       | referred me to the "technical guy", who it turns out had a demo
       | on inference on an LLM. Which he couldn't show me, as the laptop
       | with the APU had crashed and wouldn't reboot. He didn't know
       | about writing kernels, but told me there was a compiler guy who
       | might be able to help, but he wasn't to be found at that moment,
       | and I couldn't find him when returning to the booth later.
       | 
       | I'm not at all happy with this situation. As long as AMDs
       | investment into software and evangelism remains at ~$0, I don't
       | see how any hardware they put out will make a difference. And
       | you'll continue to hear people walking away from their booth,
       | saying "oh when I win it I'm going to sell it to buy myself an
       | NVIDIA GPU".
        
         | cstejerean wrote:
         | Completely agree. It's been 18 years since Nvidia released
         | CUDA. AMD has had a long time to figure this out so I'm amazed
         | at how they continue to fumble this.
        
           | bryanlarsen wrote:
           | 10 years ago they were basically broke and bet the farm on
           | Zen. That bet paid off. I doubt a bet on CUDA would have paid
           | off in time to save the company. They definitely didn't have
           | the resources to split that bet.
        
           | jsheard wrote:
           | It's not like the specific push for AI on GPUs came out of
           | nowhere either, Nvidia first shipped cuDNN in 2014.
        
           | kimixa wrote:
           | CUDA of 18 years ago is _very_ different to CUDA of today.
           | 
           | Back then AMD/ATI were actually at the forefront on the GPGPU
           | side - things like the early brook language and CTM lead
           | pretty quickly into things like OpenCL. Lots of work went on
           | using the xbox360 gpu in _real_ games for GPGPU tasks.
           | 
           | But CUDA steadily improved iteratively, and AMD kinda just...
           | stopped developing their equivalents? Considering a good part
           | of that time they were near bankruptcy it might have not have
           | been surprising though.
           | 
           | But saying Nvidia solely kicked off everything with CUDA is
           | rather a-historical.
        
             | yvdriess wrote:
             | Yep! I used BrookGPU for my GPGPU master thesis, before
             | CUDA was a thing. AMD lacked followthrough on yhe software
             | side as you said, but a big factor was also NV handing out
             | GPUs to researchers.
        
             | userabchn wrote:
             | > CUDA of 18 years ago is very different to CUDA of today.
             | 
             | I've been writing CUDA since 2008 and it doesn't seem that
             | different to me. They even still use some of the same
             | graphics in the user guide.
        
             | dagw wrote:
             | _AMD kinda just... stopped developing their equivalents?_
             | 
             | I wasn't so much that they stopped developing, rather they
             | kept throwing everything out and coming out with new and
             | non backwards compatible replacements. I knew people
             | working in the GPU Compute field back in those days who
             | were trying to support both AMD/ATI and NVidia. While their
             | CUDA code just worked from release to release and every new
             | release of CUDA just got better and better, AMD kept coming
             | up with new breaking APIs and forcing rewrite and rewrite
             | until they just gave up and dropped AMD.
        
           | dragontamer wrote:
           | 10 years ago AMD was selling its own headquarters so that it
           | could stave off bankruptcy for another few weeks
           | (https://arstechnica.com/information-
           | technology/2013/03/amd-s...).
           | 
           | AMD's software investments have begun in earnest a few years
           | ago, but AMD really did progress more than pretty much
           | everyone else aside from NVidia IMO.
           | 
           | AMD further made a few bad decisions where they "split the
           | bet", relying upon Microsoft and others to push software
           | forward. (I did like C++ Amp for what its worth). The
           | underpinnings of C++Amp led to Boltzmann which led to ROCm,
           | which then needed to be ported away from C++Amp and into
           | CUDA-like Hip.
           | 
           | So its a bit of a misstep there for sure. But its not like
           | AMD has been dilly dallying. And for what its worth, I would
           | have personally preferred C++ Amp (a C++11 standardized way
           | to represent GPU functions as []-lambdas rather than CUDA-
           | specific <<<extensions>>>). Obviously everyone else disagrees
           | with me but there's some elegance to
           | parallel_for_each([](param1, param2){magically a GPU function
           | executing in parallel}), where the compiler figures out the
           | details of how to get param1 and param2 from CPU RAM into GPU
           | (or you use GPU-specific allocators to make param1/param2 in
           | the GPU codespace already to bypass the automagic).
        
         | jacoblambda wrote:
         | > As long as AMDs investment into software and evangelism
         | remains at ~$0
         | 
         | Last time I checked they have been trying to hire a ton of
         | software engineers for improving the applied stacks (CV, ML,
         | DSP, compute, etc) at the location near where I'm located.
         | 
         | It seems like there's a big push to improve the stacks but
         | given that less than 10 years ago they were practically at
         | death's door it's not terribly surprising that their software
         | is in the state it is. It's been getting better gradually but
         | quality software doesn't just show up over night and especially
         | so when things are as complex and arcane as they are in the GPU
         | world.
        
           | benreesman wrote:
           | With margins that high?
           | 
           | There is always financing, there are always people willing to
           | go to the competitor at some wage, there is always a way if
           | the leadership wants to.
           | 
           | If it was just a straight up fab bottleneck? Yeah maybe you
           | buy that for a year or two.
           | 
           | "During Q1, Nvidia reported $5.6 billion in cost of goods
           | sold (COGS). This resulted in a gross profit of $20.4
           | billion, or a margin profile of 78.4%."
           | 
           | That's called an "induced market failure".
        
             | almostgotcaught wrote:
             | > With margins that high? There is always financing, there
             | are always people willing to go to the competitor at some
             | wage, there is always a way if the leadership wants to.
             | 
             | People love to pop-off on stuff they really know anything
             | about. Let me ask you: what financing do you imagine is
             | available? Like literally what financing do you propose for
             | a publically traded company? Like do you realize they can't
             | actually issue new shares without putting it to a
             | shareholder vote? Should they issue bonds? No I know they
             | should run an ICO!!!
             | 
             | And then what margins exactly? Do you know what the margin
             | is on MI300? No. Do you know whether they're currently
             | selling at a loss to win marketshare? No.
             | 
             | I would the happiest boy if hn, in addition to policing
             | jokes and memes, could police arrogance.
        
               | JohnPrine wrote:
               | Are you saying that companies lose the ability to secure
               | financing once they go public?
        
               | almostgotcaught wrote:
               | of course not - mentioned 3 routes to securing further
               | financing. did you read about those 3 routes in my
               | comment?
        
         | qaq wrote:
         | Well if Mojo and Modular Max Platform take off I guess there
         | will be a path for AMD
        
         | monkeydust wrote:
         | As more a business person than engineer, help me understand why
         | AMD are not getting this, what's the counter argument? Is CUDA
         | just too far ahead, are they lacking the right people in senior
         | leadership roles to see this through?
        
           | cyanydeez wrote:
           | CUDA is a software moat. If you want to use any gpu other
           | than nvidia, you need to double your engineering budget
           | because theres no easy to bootstrap projects at any level.
           | The hardware prices are meaninglesz if you need a 200k
           | engineer, if they exist, just.to bootstrap a product.
        
             | rbanffy wrote:
             | Depending on your hardware budget, the engineering one can
             | look like a rounding error.
        
               | cyanydeez wrote:
               | Sure, but then youre still on the.side.of NVIDIA because
               | you jave the.budget.
        
               | sangnoir wrote:
               | Why give any additional money to Nvidia when you can
               | announce more profits (or get more compute if you're a
               | government agency) by hiring more engineers to enable AMD
               | hardware for less than a few million per year? It's not
               | like Microsoft loves the idea of handing over money to
               | Nvidia if there is a cheaper alternative that can make
               | $MSFT go up.
        
           | noelwelsh wrote:
           | Leadership lacking vision + being almost bankrupt until
           | relatively recently.
        
           | hedgehog wrote:
           | As another commenter points out their strategy appears to be
           | to focus on HPC clients where AMD can focus providing after-
           | sale software support around a relatively small number of
           | customer requests. This gets them some sales while avoiding
           | the level of organizational investment necessary to build a
           | software platform that can support NVIDIA-style broad
           | compatibility and good out-of-the-box experience.
        
           | dagw wrote:
           | CUDA is very far ahead. Not only technically, but in
           | mindshare. Developers trust CUDA and know that investing in
           | CUDA is a future proof investment. AMD has had so many API
           | changes over the years, that no one trusts them any more. If
           | you go all in on AMD, you might have to re-write all your
           | code in 3-5 years. AMD can promise that this won't happen,
           | but it's happened so many times already that no one really
           | believes them.
           | 
           | Another problem is simply that hiring (and keeping) top
           | talent is really really hard. If you're smart enough to be a
           | lead developer of AMDs core Machine Learning libraries, you
           | can probably get hired at any number of other places, so why
           | choose AMD.
           | 
           | I think the leadership gets it and understand the importance,
           | I just don't think they (or really anybody) knows how to come
           | up with a good plan to turn things around quickly. They're
           | going to have to commit to at least a 5 year plan and lose
           | money each of those 5 years, and I'm not sure they can or
           | even want to fight that battle.
        
             | martinpw wrote:
             | > Another problem is simply that hiring (and keeping) top
             | talent is really really hard.
             | 
             | Absolutely. And when your mandate for this top talent is
             | going to be "go and build something that basically copies
             | what those other guys have already built", it is even
             | harder to attract them, when they can go any place they
             | like and work on something new.
             | 
             | > I think the leadership gets it and understand the
             | importance, I just don't think they (or really anybody)
             | knows how to come up with a good plan to turn things around
             | quickly.
             | 
             | Yes, it always puzzles me when people think nobody at AMD
             | actually sees the problem. Of course they see it. Turning a
             | large company is incredibly hard. Leadership can give
             | direction, but there is so much baked in momentum, power
             | structures, existing projects and interests, that it is
             | really tough to change things.
        
           | alecco wrote:
           | > are they lacking the right people in senior leadership
           | roles to see this through?
           | 
           | Just like Intel, they have an outdated culture. IMHO they
           | should start a software Skunk Works isolated from the company
           | and have the software guys guide the hardware features. Not
           | the other way around.
           | 
           | I wouldn't bet money on either of them doing this. Hopefully
           | some other smaller, modern, and flexible companies can try
           | it.
        
         | jwuphysics wrote:
         | Have you looked into TinyCorp [0]/tinygrad [1], one of the
         | latest endeavors by George Hotz? I've been pretty impressed by
         | the performance. [2]
         | 
         | [0] https://tinygrad.org/ [1]
         | https://github.com/tinygrad/tinygrad [2]
         | https://x.com/realGeorgeHotz/status/1800932122569343043?t=Y6...
        
           | arghwhat wrote:
           | He also shakes his fist at the software stack, but loudly
           | enough that it has AMD react to it.
        
           | anthonix1 wrote:
           | I have not been impressed by the perf. Slower than PyTorch
           | for LLMs, and PyTorch is actually stable on AMD (I've trained
           | 7B/13B models).. so the stability issues seem to be more of a
           | tinygrad problem and less of an AMD problem, despite George's
           | ramblings [0][1]
           | 
           | [0] https://github.com/tinygrad/tinygrad/issues/4301 [1]
           | https://x.com/realAnthonix/status/1800993761696284676
        
         | slavik81 wrote:
         | MIVisionX is probably the library you want for computer vision.
         | As for kernels, you would generally write HIP, which is very
         | similar to CUDA. To my knowledge, there's no equivalent to cupy
         | for writing kernels in Python.
         | 
         | For what it's worth, your post has cemented my decision to
         | submit a few conference talks. I've felt too busy writing code
         | to go out and speak, but I really should make time.
        
           | hyperbovine wrote:
           | The equivalent to cupy is ... cupy:
           | 
           | https://docs.cupy.dev/en/v13.2.0/install.html#using-cupy-
           | on-...
        
             | slavik81 wrote:
             | Oh cool! It appears that I've already packaged cupy's
             | required dependencies for AMD GPU support in the Debian 13
             | 'main' and Ubuntu 24.04 'universe' repos. I also extended
             | the enabled architectures to cover all discrete AMD GPUs
             | from Vega onwards (aside from MI300, ironically). It might
             | be nice to get python3-cupy-rocm added to Debian 13 if this
             | is a library that people find useful.
        
         | sangnoir wrote:
         | > I'm not at all happy with this situation. As long as AMDs
         | investment into software and evangelism remains at ~$0, I don't
         | see how any hardware they put out will make a difference.
         | 
         | It appears AMD initial strategy is courting the HPC crowd and
         | hyperscalers, they have big budgets, lower support overhead and
         | are willing and able to write code that papers-over AMDs not-
         | great software while appreciating lower-than-Nvidia TCO. I
         | think this this incremental strategy is sensible, considering
         | where most of the money is.
         | 
         | As a first mover, Nvidia had to start from the bottom up; CUDA
         | used to run only/mostly on consumer GPUs - AMD is going top-
         | down, starting with high-margin DC hardware, before trickling
         | down rack-level users, and eventually APUs later as revenue
         | growth allows more re-investment.
        
           | antupis wrote:
           | That is wrong move personally would start from localllm/llama
           | folks who crave more memory and build up from there.
        
             | sangnoir wrote:
             | Seeing that they don't have a mature software stack, I
             | think for now AMD would prefer one customer who brings in
             | $10m revenue over 10'000 customers at $1000 a pop.
        
           | landryraccoon wrote:
           | They're making the wrong strategic play.
           | 
           | They will fail if they go after the highest margin customers.
           | Nvidia has every advantage and every motivation to keep those
           | customers. They would need a trillion dollars in capital to
           | have a chance imho.
           | 
           | It would be like trying to go after Intel in the early 2000s
           | by trying to target server cpus, or going after the desktop
           | operating system market in the 90s against Microsoft. Its
           | aiming for your competition where they are strongest and you
           | are weakest.
           | 
           | Their only chance to disrupt is to try to get some of the
           | customers that Nvidia doesn't care about, like consumer level
           | inference / academic or hobbyist models. Intel failed when
           | they got beaten in a market they didn't care about, i.e
           | mobile / small power devices.
        
             | acchow wrote:
             | > They would need a trillion dollars in capital to have a
             | chance imho.
             | 
             | All AMD would really need is for Nvidia innovation to
             | stall. Which, with many of their engineers coasting on $10M
             | annual compensation, seems not too far fetched
        
               | sangnoir wrote:
               | AMD can go toe to toe with Nvidia on hardware innovation.
               | What AMD had realised (correctly, IMO), is that all they
               | need is for _hyperscalers_ to match /come close to Nvidia
               | on software innovation on AMD hardware -
               | Amazon/Meta/Microsoft engineers can get their foundation
               | models running on M1300X well enough for their needs -
               | CUDA is not much of moat in that market-segment where
               | there are dedicated AI-infrastructure teams. If the price
               | is right, they may shift some of those CapEx dollars from
               | Nvidia to AMD. Few AI practitioners - and even fewer LLM
               | consumers - care about the libraries underpinning
               | torch/numpy/high-level-python-framework/$LLM-service, as
               | long as it works.
        
             | Certhas wrote:
             | This is a common sentiment, no doubt also driven by the
             | wish thay AMD would cater to us.
             | 
             | But I see no evidence that the strategy is wrong or
             | failing. AMD is already powering a massive and rapidly
             | growing share of Top 500 HPC:
             | 
             | https://www.top500.org/statistics/treemaps/
             | 
             | AMD compute growth isn't in places where people see it, and
             | I think that gives a wrong impression. (Or it means people
             | have missed the big shifts over the last two years.)
        
               | frognumber wrote:
               | I see a lot of evidence, in the form of a rising moat for
               | NVidia.
        
               | jampekka wrote:
               | It would be interesting to see how much these
               | "supercomputers" are actually used, and what parts of
               | them are used.
               | 
               | I use my university's "supercomputer" every now and then
               | when I need lots of VRAM, and there are rarely many other
               | users. E.g. I've never had to queue for a GPU even though
               | I use only the top model, which probably should be the
               | most utilized.
               | 
               | Also, I'd guess there can be nvidia cards in the grid
               | even if "the computer" is AMD.
               | 
               | Of course it doesn't matter for AMD whether the compute
               | is actually used or not as long as it's bought, but lots
               | of theoretical AMD flops standing somewhere doesn't
               | necessarily mean AMD is used much for compute.
        
             | bryanlarsen wrote:
             | The savings are an order of magnitude different. Switching
             | from Intel to AMD in a data center might have saved
             | millions if you were lucky. Switching from NVidia to AMD
             | might save the big LLM vendors billions.
        
             | toast0 wrote:
             | I only observe this market from the sidelines... but
             | 
             | They're able to get the high end customers, and this
             | strategy works because they can sell the high end customers
             | high end parts in volume without having to have a good
             | software stack; at the high end, the customers are willing
             | to put in the effort to make their code work on hardware
             | that is better in dollars/watts/availability or whatever it
             | is that's giving AMD inroads into the supercomputing
             | market. They can't sell low end customers on GPU compute
             | without having a stack that works, and somebody who has a
             | small GPU compute workload may not be willing or able to
             | adapt their software to make it work on an AMD card even if
             | the AMD card would be a better choice if they could make it
             | work.
        
               | jiggawatts wrote:
               | They're going to sell a billion dollars of GPUs to a
               | handful of customers while NVIDIA sells a trillion
               | dollars of their products to _everyone_.
               | 
               | Every framework, library, demo, tool, and app is going to
               | use CUDA forever and ever while some "account manager" at
               | AMD takes a government procurement officer to lunch to
               | sell one more supercomputer that year.
        
         | make3 wrote:
         | 99%+ of people aren't writing kernels man, this doesn't mean
         | anything, this is just silly
        
         | xadhominemx wrote:
         | If you are looking for attention from an evangelist, I'm sorry
         | but you are not the target customer for MI300. They are
         | courting the Hyperscalers for heavy duty production inference
         | workloads.
        
       | spitfire wrote:
       | I remember years ago one of the amd apus had the cup and gpu on
       | the same die, and could exchange ownership of cpu and gpu memory
       | with just a pointer change or some other small accounting.
       | 
       | Has this returned? Because for dual gpu/cpu workloads (alpha
       | zero, etc) that would deliver effective "infinite bandwidth"
       | between gpu and cpu. Using an apu of course gets you huge amounts
       | of slowish memory. But being some to fling things around with
       | abandon would be an advantage, particularly for development.
        
         | wmf wrote:
         | I assume the MI300A APU also supports zero-copy. Because MI300X
         | is a separate chip you necessarily have to copy data over PCIe
         | to get it into the GPU.
        
           | rbanffy wrote:
           | One day someone will build a workstation around that chip.
           | One day...
        
         | JonChesterfield wrote:
         | You don't need to change the pointer value. The GPU and the CPU
         | have the same page table structures and both use the same
         | pointer representation for "somewhere in common memory".
         | 
         | On the GPU there are additional pointer types for different
         | local memory, e.g. LDS is a uint16_t indexing from zero. But
         | even there you can still have a single pointer to "somewhere"
         | and when you store to it with a single flat addressing
         | instruction the hardware sorts out whether it's pointing to
         | somewhere in GPU stack or somewhere on the CPU.
         | 
         | This works really well for tables of data. It's a bit of a
         | nuisance for code as the function pointer is aimed at somewhere
         | in memory and whether that's to some x86 or to some gcn depends
         | on where you got the pointer from, and jumping to gcn code from
         | within x86 means exactly what it sounds like.
        
           | spitfire wrote:
           | I'm not sure it was "pointers" but it was some very low cost
           | way to change ownership of memory between the CPU and GPU.
           | 
           | They had some fancy marketing name for it at the time. But it
           | wasn't on all chips, it should have been. Even if it was dog
           | slow between PCIe GPU and CPU the unified interface would
           | have been the right way to go. Also, amenable to automated
           | scheduling.
           | 
           | The point still stand though, I want entirely unified GPU and
           | CPU memory.
        
             | JonChesterfield wrote:
             | The unified address space with moving pages between CPU and
             | GPU on page fault works on some discrete GPU systems but
             | it's a bit of a performance risk compared to keeping the
             | pages on the same device.
             | 
             | Fundamentally if you've got separate blocks of memory tied
             | together by pcie then it's either annoying copying data
             | across or a potential performance problem doing it behind
             | the scenes.
             | 
             | A single block of memory that everything has direct access
             | to is much better. It works very neatly on the APU systems.
        
               | spitfire wrote:
               | > Fundamentally if you've got separate blocks of memory
               | tied together by pcie then it's either annoying copying
               | data across or a potential performance problem doing it
               | behind the scenes.
               | 
               | Well, as I said that's amenable to automated planning.
               | 
               | But what I really, really want is a nice APU with 512GB+
               | of memory that both the CPU and GPU can access willy
               | nilly.
        
               | JonChesterfield wrote:
               | Yep, that's what I want too. The future is now.
               | 
               | The MI300A is an APU with 128gb on the package. They come
               | in four socket systems, that's 512gb of cache coherent
               | machine with 96 fast x64 cores and many GCN cores. Quite
               | like a node from El Capitan.
               | 
               | I'm delighted with the hardware and not very impressed
               | with the GPU offloading languages for programming it. The
               | GCN and x64 cores are very much equal peers on the
               | machine, the asymmetry baked into the languages grates on
               | me.
               | 
               | (on non-apu systems, moving the data around in the
               | background such that the latency is hidden is a nice idea
               | and horrendously difficult to do for arbitrary workloads)
        
             | kcb wrote:
             | Probably thinking of this https://en.m.wikipedia.org/wiki/H
             | eterogeneous_System_Archite...
             | 
             | > Even if it was dog slow between PCIe GPU and CPU the
             | unified interface would have been the right way to go
             | 
             | That is actually what happened. You can directly access
             | pinned cpu memory over pcie on discrete gpus.
        
       | omneity wrote:
       | I'm surprised at the simplicity of the formula in the paragraph
       | below. Could someone explain the relationship between model size,
       | memory bandwidth and token/s as they calculated here?
       | 
       | > Taking LLaMA 3 70B as an example, in float16 the weights are
       | approximately 140GB, and the generation context adds another
       | ~2GB. MI300X's theoretical maximum is 5.3TB/second, which gives
       | us a hard upper limit of (5300 / 142) = ~37.2 tokens per second.
        
         | wmf wrote:
         | AFAIK generating a single token requires reading all the
         | weights from RAM. So 5300 GB/s total memory bandwidth / 142 GB
         | weights = ~37.2 tokens per second.
        
         | latchkey wrote:
         | From Cheese (they don't have a HN account, so I'm posting for
         | them):
         | 
         | Each weight is a FP16 float which is 2 Bytes worth of data, you
         | have 70B tokens, so the total amount of data the weights take
         | up is 140GB then you have a couple extra GBs for the context.
         | 
         | Then to figure out the theoretical tokens per second you just
         | divide the amount of memory bandwidth, 5300GB/s in MI300X's
         | case, by the amount of data that the tokens and context take up
         | so 5300/142 which is about 37 tokens per second.
        
           | rbanffy wrote:
           | 37 somethings per second doesn't sound fast at all. You need
           | to remember it's 37 ridiculously difficult things per second.
        
           | omneity wrote:
           | So am I correct in understanding what they really mean is 37
           | full forward passes per second?
           | 
           | In which case, if the model weights are fitting in the VRAM
           | and are already loaded, why does the bandwidth impact the
           | rate of tok/s?
        
             | fancyfredbot wrote:
             | You have to get those weights from the RAM to the floating
             | point unit. The bandwidth here is the rate at which you can
             | do that.
             | 
             | The weights are not really reused. Which means they are
             | never in registers, or in L1/L2/L3 caches. They are always
             | in VRAM and always need to be loaded back in again.
             | 
             | However, if you are batching multiple separate inputs you
             | can reuse each weight on ech input, in which case you may
             | not be entirely bandwidth bound and this analysis breaks
             | down a bit. Basically you can't produce a single stream of
             | tokens any faster than this rate, but you can produce more
             | than one stream of tokens at this rate.
        
         | immibis wrote:
         | That would be higher with batching, right? (5300 / 144) * 2 =
         | ~73.6 and so on.
        
       | snaeker58 wrote:
       | I hate the state of AMDs software for non gamers. RoCm is a war
       | crime (which has improved dramatically in the last two years and
       | still sucks).
       | 
       | But like many have said considering AMD was almost bankrupt their
       | performance is impressive. This really speaks for their hardware
       | division. If only they could get the software side of things
       | fixed!
       | 
       | Also I wonder if NVIDIA has an employee of the decade plaque for
       | CUDA. Because CUDA is the best thing that could've happened to
       | them.
        
       | rbanffy wrote:
       | Would be interesting to see a workstation based on the version
       | with a couple x86 dies, the MI300A. Oddly enough, it'd need a
       | discrete GPU.
        
       | alecco wrote:
       | > Taking LLaMA 3 70B as an example, in float16 the weights are
       | approximately 140GB, and the generation context adds another
       | ~2GB. MI300X's theoretical maximum is 5.3TB/second, which gives
       | us a hard upper limit of (5300 / 142) = ~37.2 tokens per second.
       | 
       | I think they mean 37.2 _forward passes_ per second. And at 4008
       | tokens per second (from  "LLaMA3-70B Inference" chart) it means
       | they were using a batch size of ~138 (if using that math, but
       | probably not correct). Right?
        
       ___________________________________________________________________
       (page generated 2024-06-25 23:01 UTC)