[HN Gopher] AMD Now Has More Compute on the Top500 Than Nvidia
___________________________________________________________________
AMD Now Has More Compute on the Top500 Than Nvidia
Author : rbanffy
Score : 89 points
Date : 2024-11-18 18:54 UTC (4 hours ago)
(HTM) web link (www.nextplatform.com)
(TXT) w3m dump (www.nextplatform.com)
| pie420 wrote:
| layperson with no industry knowledge, but it seems like nvidia's
| CUDA moat will fall in the next 2-5 years. It seems impossible to
| sustain those margins without competition coming in and getting a
| decent slice of the pie
| metadat wrote:
| But how will AMD or anyone else push in? CUDA is actually a
| whole virtualization layer on top of the hardware and isn't
| easily replicable, Nvidia has been at it for 17 years.
|
| You are right, eventually something's gotta give. The path for
| this next leg isn't yet apparent to me.
|
| P.s. how much is an exaflop or petaflop, and how significant is
| it? The numbers thrown around in this article don't mean
| anything to me. Is this new cluster way more powerful than the
| last top?
| bryanlarsen wrote:
| Anybody spending tens of billions annually on Nvidia hardware
| is going to be willing to spend millions to port their
| software away from CUDA.
| echelon wrote:
| For the average non-FAANG company, there's nothing to port
| to yet. We don't all have the luxury of custom TPUs.
| talldayo wrote:
| To slower hardware? What are they supposed to port to,
| ASICs?
| adgjlsfhk1 wrote:
| if the hardware is 30% slower and 2x cheaper, that's a
| pretty great deal.
| selectodude wrote:
| Power density tends to be the limiting factor for this
| stuff, not money. If it's 30 percent slower per watt,
| it's useless.
| vlovich123 wrote:
| The API part isn't thaaat hard. Indeed HIP already works
| pretty well at getting existing CUDA code to work unmodified
| on AMD HW. The bigger challenge is that the AMD and Nvidia
| architectures are so different that the optimization choices
| for what the kernels would look like are more different
| between Nvidia and AMD than they would be between Intel and
| AMD in CPU land even including SIMD.
| sangnoir wrote:
| CUDA is the assembly to Torch's high-level language; for
| most, it's a very good intermediary, but an intermediary
| nonetheless, as it is between the actual code they are
| interested in, and the hardware that runs it.
|
| Most customers care about cost-effectiveness more than best-
| in-class raw-performance, a fact that AMD has ruthlessly
| exploited over the past 8 years. It helps that AMD products
| are occasionally both.
| LeanderK wrote:
| its possible. Just look at Apples GPU, its mostly supported
| by torch, what's left are mostly edge-cases. Apple should
| make a datacenter GPU :D that would be insanely funny. It's
| actually somewhat well positioned as, due to the MacBooks,
| the support is already there. I assume here that most things
| translate to linux, as I don't think you can sell MacOS in
| the cloud :D
|
| I know a lot developing on apples silicon and just pushing it
| to clusters for bigger runs. So why not run it on an apple
| GPU there?
| talldayo wrote:
| > what's left are mostly edge-cases.
|
| For everything that isn't machine learning, I frankly feel
| like it's the other way around. Apple's "solution" to these
| edge cases is telling people to write compute shaders that
| you could write in Vulkan or DirectX instead. What sets
| CUDA apart is an integration with a complex acceleration
| pipeline that Apple gave up trying to replicate years ago.
|
| When cryptocurrency mining was king-for-a-day, everyone
| rushed out to buy Nvidia hardware because it supported
| accelerated crypto well from the start. The same thing
| happened with the AI and machine learning boom. Apple and
| AMD were both late to the party and wrongly assumed that
| NPU hardware would provide a comparable solution. Without a
| CUDA competitor, Apple would struggle more than AMD to find
| market fit.
| LeanderK wrote:
| well, but machine learning is the major reason we use
| GPUs in the datacenter (not talking about consumer GPUs
| here). The others are edge-cases for data-centre
| applications! Apple is uniquely positioned exactly
| because it is already solved due to a significant part of
| the ML-engineers using MacBooks to develop locally.
|
| The code to run these things on apples GPUs exist and is
| used every day! I don't know anyone using AMD GPUs, but
| pretty often its nvidia on the cluster and Apple on the
| laptop. So if nvidia is making these juicy profits, i
| think apple could seriously think about moving to the
| cluster if it wants to.
| stonemetal12 wrote:
| According to Wikipedia the previous #1 was from 2022 with a
| peak petaflops of 2,055. This system is rated at 2,746. So
| about 33% faster than the old #1.
|
| Also, of the top 10, AMD has 5 systems.
|
| https://en.wikipedia.org/wiki/TOP500
| smokel wrote:
| _> P.s. how much is an exaflop or petaflop_
|
| 1 petaflop = 10^15 flops = 1,000,000,000,000,000 flops.
|
| 1 exaflop = 10^18 flops = 1,000,000,000,000,000,000 flops.
|
| Note that these are simply powers of 10, not powers of 2,
| which are used for storage for example.
| fweimer wrote:
| Isn't porting software to the next generation supercomputer
| pretty standard for HPC?
| ok123456 wrote:
| People have been chipping away at this for a while. HIP
| allows source-level translation, and libraries like Jax
| provide a HIP version.
| vitus wrote:
| > P.s. how much is an exaflop or petaflop, and how
| significant is it? The numbers thrown around in this article
| don't mean anything to me. Is this new cluster way more
| powerful than the last top?
|
| Nominally, a measurement in "flops" is how many (typically
| 32-bit) FLoating-point Operations Per Second the hardware is
| capable of performing, so it's an approximate measure of
| total available computing power.
|
| A high-end consumer-grade CPU can achieve on the order of a
| few hundred gigaflops (let's say 250, just for a nice round
| number). https://boinc.bakerlab.org/rosetta/cpu_list.php
|
| A petaflop is therefore about four thousand of those;
| multiply by another thousand to get an exaflop.
|
| For another point of comparison, a high-end GPU might be on
| the order of 40-80 teraflops.
| https://www.tomshardware.com/reviews/gpu-
| hierarchy,4388-2.ht...
| quickthrowman wrote:
| > But how will AMD or anyone else push in? CUDA is actually a
| whole virtualization layer on top of the hardware and isn't
| easily replicable, Nvidia has been at it for 17 years.
|
| NVidia currently has 80-90% gross margins on their LLM GPUs,
| that's all the incentive another company needs to invest
| money into a CUDA alternative.
| NineStarPoint wrote:
| A high grade consumer gpu a (a 4090) is about 80 teraflops.
| So rounding up to 100, an exaflop is about 10,000 consumer
| grade cards worth of compute, and a petaflop is about 10.
|
| Which doesn't help with understanding how much more
| impressive these are than the last clusters, but does to me
| at least put the amount of compute these clusters have into
| focus.
| vitus wrote:
| You're off by three orders of magnitude.
|
| My point of reference is that back in undergrad (~10-15
| years ago), I recall a class assignment where we had to
| optimize matrix multiplication on a CPU; typical good
| parallel implementations achieved about 100-130 gigaflops
| (on a... Nehalem or Westmere Xeon, I think?).
| NineStarPoint wrote:
| You are 100% correct, I lost a full prefix of performance
| there. Edited my message.
|
| Which does make the clusters a fair bit less impressive,
| but also a lot more sensibly sized.
| winwang wrote:
| 4090 tensor performance (FP8): 660 teraflops, 1320 "with
| sparsity" (i.e. max theoretical with zeroes in the right
| places).
|
| https://images.nvidia.com/aem-
| dam/Solutions/geforce/ada/nvid...
|
| But at these levels of compute, the memory/interconnect
| bandwidth becomes the bottleneck.
| latchkey wrote:
| We donated one of our MI300x systems to the SCALE team. The
| moat-less future is coming more quickly than you think.
|
| https://scale-lang.com/
| YetAnotherNick wrote:
| CUDA moat is highly overrated for AI in the first place and
| sold as the reason for the failure of AMD. Almost no one in AI
| uses CUDA. They only use pytorch or Triton. TPUs didn't face
| lot of hurdle due to CUDA because they were initially better in
| terms of price to performance and supported pytorch, tensorflow
| and jax.
|
| The reason why AMD is behind is that it is behind in hardware.
| MI300x is more pricey per hour in all the cloud I can find
| compared to H100, and the MFU is order of magnitude lower
| compared to NVIDIA for transformers, even though transformers
| are fully supported. And I get same 40-50% MFU in TPU for the
| same code. If anyone is investing >10 million dollar for
| hardware, they sure can invest a million dollar to rewrite
| everything in whatever language AMD asks them to if it is
| cheaper.
| nwgo wrote:
| It does not matter. AMD is shit when it comes to low-level
| processing, their algos are stuck that go nowhere. Nvidia is
| killing it. There is a reason why Zookerberg ordered billions in
| GPUs from Nvidia and not from AMD.
| ipsum2 wrote:
| AMD GPUs handle all inference for Llama3 at Meta btw.
| thechao wrote:
| Why is AMD shit at low-level processing? What does it mean
| "their algos are stuck"? Having watched "the industry" for a
| few decades now, the appeal for NV smells heavily like the old
| appeal for Xeons, and Big Blue before them. The moat appears
| (to me, an unknowledgeable outsider) to be just cultural, not
| necessarily technical.
| sqeaky wrote:
| This is just silly fanboyism, there are pros and cons to each.
| Koshkin wrote:
| This comment is somewhat more insightful:
|
| https://news.ycombinator.com/item?id=40791010
| ipsum2 wrote:
| As someone who worked in the ML infra space: Google, Meta, XAI,
| Oracle, Microsoft, Amazon have clusters that perform better than
| the highest performing cluster on Top500. They don't submit
| because there's no reason to, and some want to keep the size of
| their clusters a secret. They're all running Nvidia. (Except
| Google, who uses TPUs and Nvidia.)
|
| > El Capitan - we don't yet know how big of a portion yet as we
| write this - with 43,808 of AMD's "Antares-A" Instinct MI300A
| devices
|
| By comparison XAI announced that they have 100k H100s. MI300A and
| H100s have roughly similar performance. Meta says they're
| training on more than 100k H100s for Llama-4, and have the
| equivalent of 600k H100s worth of compute. (Note that compute and
| networking can be orthogonal).
|
| Also, Nvidia B200s are rolling out now. They offer 2-3x the
| performance of H100s.
| danpalmer wrote:
| Google is running its own TPU hardware for internal workloads.
| I believe Nvidia is just resold for cloud customers.
| ipsum2 wrote:
| Nvidia GPUs are also used for inference on Google products.
| It just depends on availability.
| danpalmer wrote:
| Interesting, do you have a source for this? I've not been
| able to find one.
| nextos wrote:
| GCP plans offer access to high-end Nvidia GPUs, as well
| as TPUs. I thought Google the same pool of resources that
| is also resold?
| pclmulqdq wrote:
| B200s have an incremental increase in FP64 and FP32 performance
| over H100s. That is the number format that HPC people care
| about.
|
| The MI300A can get to 150% the FP64 peak performance that B200
| devices can get, although AMD GPUs have historically
| underperformed their spec more than Nvidia GPUs. It's possible
| that B200 devices are actually behind for HPC.
| cayleyh wrote:
| Top line comparison numbers for reference: https://www.thereg
| ister.com/2024/03/18/nvidia_turns_up_the_a...
|
| It does seem like Nvidia is prioritizing int8 / fp8
| performance over FP64, which given the current state of the
| ML marketplace is a great idea.
| zekrioca wrote:
| The Top500 list is useful as a public, standardized baseline
| that is straightforward, with a predicted periodicity for more
| than 30 years. It is trickier to compare cloud infras due to
| their heterogeneity, fast pace, and more importantly, due the
| lack of standardized tests, although the MLCommons [1] have
| been very keen on helping with that.
|
| [1] https://mlcommons.org/datasets/
| almostgotcaught wrote:
| Ya exactly - no one cares about top500 outside of academia
| (literally have never heard it come up at work). So this is
| like the gold star (participation award) of DCGPU competition.
| maratc wrote:
| > Nvidia B200s ... offer 2-3x the performance of H100s
|
| For ML, not for HPC. ML and HPC are two completely different,
| only loosely related fields.
|
| ML tasks are doing great with low precision, 16 and 8 bit
| precision is fine, arguably good results can be achieved even
| with 4 bit precision [0][1]. That won't do for HPC tasks, like
| predicting global weather, computational biology, etc. -- one
| would need 64 to 128 bit precision for that.
|
| Nvidia is dividing their new silicon between many more low-
| precision cores, so that their new offerings are arguably
| _worse_ than their older offerings for HPC tasks (of course
| they are _much better_ offerings for low-precision ML tasks --
| where the hype and the money is currently). The feeling with
| the HPC crowd is that "Nvidia and AMD are in the process of
| abandoning this market".
|
| [0]
| https://papers.nips.cc/paper/2020/file/13b919438259814cd5be8...
|
| [1] https://arxiv.org/abs/2212.09720
| formerly_proven wrote:
| China has been absent from TOP500 for years as well.
| amelius wrote:
| Why the focus on AMD and Nvidia? It really isn't that hard to
| design a large number of ALU blocks into some silicon IP block
| and make them work together efficiently.
|
| The real accomplishment is fabricating them.
| georgeecollins wrote:
| But not the profits.
| talldayo wrote:
| > It really isn't that hard to design a large number of ALU
| blocks into some silicon IP block and make them work together
| efficiently.
|
| It really is that hard, and the fabrication side of the issue
| the easy part from Nvidia's perspective - you just pay TSMC a
| shitload of money. Nvidia's _real_ victory (besides leading on
| performance-per-watt) is that their software stack doesn 't
| suck. They invested in complex shader units and tensor
| accelerators that scale with the size of the card rather than
| being restrained in puny and limited NPUs. CUDA unified this
| featureset and was industry-entrenched for almost a decade,
| which gave it pretty much any feature you could want be it
| crypto acceleration or AI/ML primitives.
|
| The ultimate tragedy is that there was a potential future where
| a Free and Open Source CUDA alternative existed. Apple wrote
| the OpenCL spec for exactly that purpose and gave it to
| Khronos, but later abandoned it to focus on... _checks
| clipboard_ MLX and Metal Performance Shaders. Oh, what could
| have been if the industry weren 't so stingy and shortsighted.
| amelius wrote:
| > you just pay TSMC a shitload of money
|
| I guess with money you can win any argument ...
| talldayo wrote:
| Sure, Apple did the same thing with TSMC's 5nm node. They
| still lost in performance-per-watt in direct comparison
| with Nvidia GPUs using Samsung's 8nm node. Money isn't
| everything, even when you have so much of it that you can
| deny your competitors access to the tech you use.
|
| Nvidia's lead is not _only_ cemented by dense silicon.
| Their designs are extremely competitive, perhaps even a
| generational leap over what their competitors offer.
| amelius wrote:
| Let me phrase it differently.
|
| If Nvidia pulls the plug we can still go to AMD and have
| a reasonable alternative.
|
| If TSMC pulls the plug, however ...
| vitus wrote:
| After skimming the article, I'm confused -- where exactly is the
| headline being pulled from?
|
| If you look at the table toward the bottom, no matter how you
| slice it, Nvidia has 50% of the total cores, 50% of the total
| flops, and 90% of the total systems among the Top 500, while AMD
| has 26% of the total cores, 27.5% of the total flops, and 7% of
| the total systems.
|
| Is it a matter of _newly-added_ compute?
|
| > This time around, on the November 2024 Top500 rankings, AMD is
| the big winner in terms of adding capacity to the HPC base.
| Koshkin wrote:
| > _AMD GPUs drove 72.1 percent of the new performance added for
| the November 2024 rankings_
| vitus wrote:
| Yes, I saw that, but that doesn't justify the title as
| written. Had it said "AMD Now Has More New Compute" I
| wouldn't have said anything.
| latchkey wrote:
| I'm sure there is also a lot not on the Top500. I've got enough
| AMD MI300x compute for about 140th position, but haven't
| submitted numbers.
___________________________________________________________________
(page generated 2024-11-18 23:00 UTC)