[HN Gopher] AMD Now Has More Compute on the Top500 Than Nvidia
       ___________________________________________________________________
        
       AMD Now Has More Compute on the Top500 Than Nvidia
        
       Author : rbanffy
       Score  : 89 points
       Date   : 2024-11-18 18:54 UTC (4 hours ago)
        
 (HTM) web link (www.nextplatform.com)
 (TXT) w3m dump (www.nextplatform.com)
        
       | pie420 wrote:
       | layperson with no industry knowledge, but it seems like nvidia's
       | CUDA moat will fall in the next 2-5 years. It seems impossible to
       | sustain those margins without competition coming in and getting a
       | decent slice of the pie
        
         | metadat wrote:
         | But how will AMD or anyone else push in? CUDA is actually a
         | whole virtualization layer on top of the hardware and isn't
         | easily replicable, Nvidia has been at it for 17 years.
         | 
         | You are right, eventually something's gotta give. The path for
         | this next leg isn't yet apparent to me.
         | 
         | P.s. how much is an exaflop or petaflop, and how significant is
         | it? The numbers thrown around in this article don't mean
         | anything to me. Is this new cluster way more powerful than the
         | last top?
        
           | bryanlarsen wrote:
           | Anybody spending tens of billions annually on Nvidia hardware
           | is going to be willing to spend millions to port their
           | software away from CUDA.
        
             | echelon wrote:
             | For the average non-FAANG company, there's nothing to port
             | to yet. We don't all have the luxury of custom TPUs.
        
             | talldayo wrote:
             | To slower hardware? What are they supposed to port to,
             | ASICs?
        
               | adgjlsfhk1 wrote:
               | if the hardware is 30% slower and 2x cheaper, that's a
               | pretty great deal.
        
               | selectodude wrote:
               | Power density tends to be the limiting factor for this
               | stuff, not money. If it's 30 percent slower per watt,
               | it's useless.
        
           | vlovich123 wrote:
           | The API part isn't thaaat hard. Indeed HIP already works
           | pretty well at getting existing CUDA code to work unmodified
           | on AMD HW. The bigger challenge is that the AMD and Nvidia
           | architectures are so different that the optimization choices
           | for what the kernels would look like are more different
           | between Nvidia and AMD than they would be between Intel and
           | AMD in CPU land even including SIMD.
        
           | sangnoir wrote:
           | CUDA is the assembly to Torch's high-level language; for
           | most, it's a very good intermediary, but an intermediary
           | nonetheless, as it is between the actual code they are
           | interested in, and the hardware that runs it.
           | 
           | Most customers care about cost-effectiveness more than best-
           | in-class raw-performance, a fact that AMD has ruthlessly
           | exploited over the past 8 years. It helps that AMD products
           | are occasionally both.
        
           | LeanderK wrote:
           | its possible. Just look at Apples GPU, its mostly supported
           | by torch, what's left are mostly edge-cases. Apple should
           | make a datacenter GPU :D that would be insanely funny. It's
           | actually somewhat well positioned as, due to the MacBooks,
           | the support is already there. I assume here that most things
           | translate to linux, as I don't think you can sell MacOS in
           | the cloud :D
           | 
           | I know a lot developing on apples silicon and just pushing it
           | to clusters for bigger runs. So why not run it on an apple
           | GPU there?
        
             | talldayo wrote:
             | > what's left are mostly edge-cases.
             | 
             | For everything that isn't machine learning, I frankly feel
             | like it's the other way around. Apple's "solution" to these
             | edge cases is telling people to write compute shaders that
             | you could write in Vulkan or DirectX instead. What sets
             | CUDA apart is an integration with a complex acceleration
             | pipeline that Apple gave up trying to replicate years ago.
             | 
             | When cryptocurrency mining was king-for-a-day, everyone
             | rushed out to buy Nvidia hardware because it supported
             | accelerated crypto well from the start. The same thing
             | happened with the AI and machine learning boom. Apple and
             | AMD were both late to the party and wrongly assumed that
             | NPU hardware would provide a comparable solution. Without a
             | CUDA competitor, Apple would struggle more than AMD to find
             | market fit.
        
               | LeanderK wrote:
               | well, but machine learning is the major reason we use
               | GPUs in the datacenter (not talking about consumer GPUs
               | here). The others are edge-cases for data-centre
               | applications! Apple is uniquely positioned exactly
               | because it is already solved due to a significant part of
               | the ML-engineers using MacBooks to develop locally.
               | 
               | The code to run these things on apples GPUs exist and is
               | used every day! I don't know anyone using AMD GPUs, but
               | pretty often its nvidia on the cluster and Apple on the
               | laptop. So if nvidia is making these juicy profits, i
               | think apple could seriously think about moving to the
               | cluster if it wants to.
        
           | stonemetal12 wrote:
           | According to Wikipedia the previous #1 was from 2022 with a
           | peak petaflops of 2,055. This system is rated at 2,746. So
           | about 33% faster than the old #1.
           | 
           | Also, of the top 10, AMD has 5 systems.
           | 
           | https://en.wikipedia.org/wiki/TOP500
        
           | smokel wrote:
           | _> P.s. how much is an exaflop or petaflop_
           | 
           | 1 petaflop = 10^15 flops = 1,000,000,000,000,000 flops.
           | 
           | 1 exaflop = 10^18 flops = 1,000,000,000,000,000,000 flops.
           | 
           | Note that these are simply powers of 10, not powers of 2,
           | which are used for storage for example.
        
           | fweimer wrote:
           | Isn't porting software to the next generation supercomputer
           | pretty standard for HPC?
        
           | ok123456 wrote:
           | People have been chipping away at this for a while. HIP
           | allows source-level translation, and libraries like Jax
           | provide a HIP version.
        
           | vitus wrote:
           | > P.s. how much is an exaflop or petaflop, and how
           | significant is it? The numbers thrown around in this article
           | don't mean anything to me. Is this new cluster way more
           | powerful than the last top?
           | 
           | Nominally, a measurement in "flops" is how many (typically
           | 32-bit) FLoating-point Operations Per Second the hardware is
           | capable of performing, so it's an approximate measure of
           | total available computing power.
           | 
           | A high-end consumer-grade CPU can achieve on the order of a
           | few hundred gigaflops (let's say 250, just for a nice round
           | number). https://boinc.bakerlab.org/rosetta/cpu_list.php
           | 
           | A petaflop is therefore about four thousand of those;
           | multiply by another thousand to get an exaflop.
           | 
           | For another point of comparison, a high-end GPU might be on
           | the order of 40-80 teraflops.
           | https://www.tomshardware.com/reviews/gpu-
           | hierarchy,4388-2.ht...
        
           | quickthrowman wrote:
           | > But how will AMD or anyone else push in? CUDA is actually a
           | whole virtualization layer on top of the hardware and isn't
           | easily replicable, Nvidia has been at it for 17 years.
           | 
           | NVidia currently has 80-90% gross margins on their LLM GPUs,
           | that's all the incentive another company needs to invest
           | money into a CUDA alternative.
        
           | NineStarPoint wrote:
           | A high grade consumer gpu a (a 4090) is about 80 teraflops.
           | So rounding up to 100, an exaflop is about 10,000 consumer
           | grade cards worth of compute, and a petaflop is about 10.
           | 
           | Which doesn't help with understanding how much more
           | impressive these are than the last clusters, but does to me
           | at least put the amount of compute these clusters have into
           | focus.
        
             | vitus wrote:
             | You're off by three orders of magnitude.
             | 
             | My point of reference is that back in undergrad (~10-15
             | years ago), I recall a class assignment where we had to
             | optimize matrix multiplication on a CPU; typical good
             | parallel implementations achieved about 100-130 gigaflops
             | (on a... Nehalem or Westmere Xeon, I think?).
        
               | NineStarPoint wrote:
               | You are 100% correct, I lost a full prefix of performance
               | there. Edited my message.
               | 
               | Which does make the clusters a fair bit less impressive,
               | but also a lot more sensibly sized.
        
             | winwang wrote:
             | 4090 tensor performance (FP8): 660 teraflops, 1320 "with
             | sparsity" (i.e. max theoretical with zeroes in the right
             | places).
             | 
             | https://images.nvidia.com/aem-
             | dam/Solutions/geforce/ada/nvid...
             | 
             | But at these levels of compute, the memory/interconnect
             | bandwidth becomes the bottleneck.
        
         | latchkey wrote:
         | We donated one of our MI300x systems to the SCALE team. The
         | moat-less future is coming more quickly than you think.
         | 
         | https://scale-lang.com/
        
         | YetAnotherNick wrote:
         | CUDA moat is highly overrated for AI in the first place and
         | sold as the reason for the failure of AMD. Almost no one in AI
         | uses CUDA. They only use pytorch or Triton. TPUs didn't face
         | lot of hurdle due to CUDA because they were initially better in
         | terms of price to performance and supported pytorch, tensorflow
         | and jax.
         | 
         | The reason why AMD is behind is that it is behind in hardware.
         | MI300x is more pricey per hour in all the cloud I can find
         | compared to H100, and the MFU is order of magnitude lower
         | compared to NVIDIA for transformers, even though transformers
         | are fully supported. And I get same 40-50% MFU in TPU for the
         | same code. If anyone is investing >10 million dollar for
         | hardware, they sure can invest a million dollar to rewrite
         | everything in whatever language AMD asks them to if it is
         | cheaper.
        
       | nwgo wrote:
       | It does not matter. AMD is shit when it comes to low-level
       | processing, their algos are stuck that go nowhere. Nvidia is
       | killing it. There is a reason why Zookerberg ordered billions in
       | GPUs from Nvidia and not from AMD.
        
         | ipsum2 wrote:
         | AMD GPUs handle all inference for Llama3 at Meta btw.
        
         | thechao wrote:
         | Why is AMD shit at low-level processing? What does it mean
         | "their algos are stuck"? Having watched "the industry" for a
         | few decades now, the appeal for NV smells heavily like the old
         | appeal for Xeons, and Big Blue before them. The moat appears
         | (to me, an unknowledgeable outsider) to be just cultural, not
         | necessarily technical.
        
         | sqeaky wrote:
         | This is just silly fanboyism, there are pros and cons to each.
        
         | Koshkin wrote:
         | This comment is somewhat more insightful:
         | 
         | https://news.ycombinator.com/item?id=40791010
        
       | ipsum2 wrote:
       | As someone who worked in the ML infra space: Google, Meta, XAI,
       | Oracle, Microsoft, Amazon have clusters that perform better than
       | the highest performing cluster on Top500. They don't submit
       | because there's no reason to, and some want to keep the size of
       | their clusters a secret. They're all running Nvidia. (Except
       | Google, who uses TPUs and Nvidia.)
       | 
       | > El Capitan - we don't yet know how big of a portion yet as we
       | write this - with 43,808 of AMD's "Antares-A" Instinct MI300A
       | devices
       | 
       | By comparison XAI announced that they have 100k H100s. MI300A and
       | H100s have roughly similar performance. Meta says they're
       | training on more than 100k H100s for Llama-4, and have the
       | equivalent of 600k H100s worth of compute. (Note that compute and
       | networking can be orthogonal).
       | 
       | Also, Nvidia B200s are rolling out now. They offer 2-3x the
       | performance of H100s.
        
         | danpalmer wrote:
         | Google is running its own TPU hardware for internal workloads.
         | I believe Nvidia is just resold for cloud customers.
        
           | ipsum2 wrote:
           | Nvidia GPUs are also used for inference on Google products.
           | It just depends on availability.
        
             | danpalmer wrote:
             | Interesting, do you have a source for this? I've not been
             | able to find one.
        
               | nextos wrote:
               | GCP plans offer access to high-end Nvidia GPUs, as well
               | as TPUs. I thought Google the same pool of resources that
               | is also resold?
        
         | pclmulqdq wrote:
         | B200s have an incremental increase in FP64 and FP32 performance
         | over H100s. That is the number format that HPC people care
         | about.
         | 
         | The MI300A can get to 150% the FP64 peak performance that B200
         | devices can get, although AMD GPUs have historically
         | underperformed their spec more than Nvidia GPUs. It's possible
         | that B200 devices are actually behind for HPC.
        
           | cayleyh wrote:
           | Top line comparison numbers for reference: https://www.thereg
           | ister.com/2024/03/18/nvidia_turns_up_the_a...
           | 
           | It does seem like Nvidia is prioritizing int8 / fp8
           | performance over FP64, which given the current state of the
           | ML marketplace is a great idea.
        
         | zekrioca wrote:
         | The Top500 list is useful as a public, standardized baseline
         | that is straightforward, with a predicted periodicity for more
         | than 30 years. It is trickier to compare cloud infras due to
         | their heterogeneity, fast pace, and more importantly, due the
         | lack of standardized tests, although the MLCommons [1] have
         | been very keen on helping with that.
         | 
         | [1] https://mlcommons.org/datasets/
        
         | almostgotcaught wrote:
         | Ya exactly - no one cares about top500 outside of academia
         | (literally have never heard it come up at work). So this is
         | like the gold star (participation award) of DCGPU competition.
        
         | maratc wrote:
         | > Nvidia B200s ... offer 2-3x the performance of H100s
         | 
         | For ML, not for HPC. ML and HPC are two completely different,
         | only loosely related fields.
         | 
         | ML tasks are doing great with low precision, 16 and 8 bit
         | precision is fine, arguably good results can be achieved even
         | with 4 bit precision [0][1]. That won't do for HPC tasks, like
         | predicting global weather, computational biology, etc. -- one
         | would need 64 to 128 bit precision for that.
         | 
         | Nvidia is dividing their new silicon between many more low-
         | precision cores, so that their new offerings are arguably
         | _worse_ than their older offerings for HPC tasks (of course
         | they are _much better_ offerings for low-precision ML tasks --
         | where the hype and the money is currently). The feeling with
         | the HPC crowd is that  "Nvidia and AMD are in the process of
         | abandoning this market".
         | 
         | [0]
         | https://papers.nips.cc/paper/2020/file/13b919438259814cd5be8...
         | 
         | [1] https://arxiv.org/abs/2212.09720
        
         | formerly_proven wrote:
         | China has been absent from TOP500 for years as well.
        
       | amelius wrote:
       | Why the focus on AMD and Nvidia? It really isn't that hard to
       | design a large number of ALU blocks into some silicon IP block
       | and make them work together efficiently.
       | 
       | The real accomplishment is fabricating them.
        
         | georgeecollins wrote:
         | But not the profits.
        
         | talldayo wrote:
         | > It really isn't that hard to design a large number of ALU
         | blocks into some silicon IP block and make them work together
         | efficiently.
         | 
         | It really is that hard, and the fabrication side of the issue
         | the easy part from Nvidia's perspective - you just pay TSMC a
         | shitload of money. Nvidia's _real_ victory (besides leading on
         | performance-per-watt) is that their software stack doesn 't
         | suck. They invested in complex shader units and tensor
         | accelerators that scale with the size of the card rather than
         | being restrained in puny and limited NPUs. CUDA unified this
         | featureset and was industry-entrenched for almost a decade,
         | which gave it pretty much any feature you could want be it
         | crypto acceleration or AI/ML primitives.
         | 
         | The ultimate tragedy is that there was a potential future where
         | a Free and Open Source CUDA alternative existed. Apple wrote
         | the OpenCL spec for exactly that purpose and gave it to
         | Khronos, but later abandoned it to focus on... _checks
         | clipboard_ MLX and Metal Performance Shaders. Oh, what could
         | have been if the industry weren 't so stingy and shortsighted.
        
           | amelius wrote:
           | > you just pay TSMC a shitload of money
           | 
           | I guess with money you can win any argument ...
        
             | talldayo wrote:
             | Sure, Apple did the same thing with TSMC's 5nm node. They
             | still lost in performance-per-watt in direct comparison
             | with Nvidia GPUs using Samsung's 8nm node. Money isn't
             | everything, even when you have so much of it that you can
             | deny your competitors access to the tech you use.
             | 
             | Nvidia's lead is not _only_ cemented by dense silicon.
             | Their designs are extremely competitive, perhaps even a
             | generational leap over what their competitors offer.
        
               | amelius wrote:
               | Let me phrase it differently.
               | 
               | If Nvidia pulls the plug we can still go to AMD and have
               | a reasonable alternative.
               | 
               | If TSMC pulls the plug, however ...
        
       | vitus wrote:
       | After skimming the article, I'm confused -- where exactly is the
       | headline being pulled from?
       | 
       | If you look at the table toward the bottom, no matter how you
       | slice it, Nvidia has 50% of the total cores, 50% of the total
       | flops, and 90% of the total systems among the Top 500, while AMD
       | has 26% of the total cores, 27.5% of the total flops, and 7% of
       | the total systems.
       | 
       | Is it a matter of _newly-added_ compute?
       | 
       | > This time around, on the November 2024 Top500 rankings, AMD is
       | the big winner in terms of adding capacity to the HPC base.
        
         | Koshkin wrote:
         | > _AMD GPUs drove 72.1 percent of the new performance added for
         | the November 2024 rankings_
        
           | vitus wrote:
           | Yes, I saw that, but that doesn't justify the title as
           | written. Had it said "AMD Now Has More New Compute" I
           | wouldn't have said anything.
        
       | latchkey wrote:
       | I'm sure there is also a lot not on the Top500. I've got enough
       | AMD MI300x compute for about 140th position, but haven't
       | submitted numbers.
        
       ___________________________________________________________________
       (page generated 2024-11-18 23:00 UTC)