[HN Gopher] AmpereOne: Cores Are the New MHz
       ___________________________________________________________________
        
       AmpereOne: Cores Are the New MHz
        
       Author : speckx
       Score  : 76 points
       Date   : 2024-12-05 17:46 UTC (5 hours ago)
        
 (HTM) web link (www.jeffgeerling.com)
 (TXT) w3m dump (www.jeffgeerling.com)
        
       | znpy wrote:
       | Weird, but this makes me thing X86-64 might actually be better?
       | 
       | It isn't mentioned anywhere in __this__ article, but the power
       | draw of that chip is 276W. I got this from Phoronix [1]:
       | 
       | > The AmpereOne A192-32X boasts 192 AmpereOne cores, a 3.2GHz
       | clock frequency, a rated 276 Watt usage power
       | 
       | Which is interesting because it's almost half of AMD's 192-core
       | offering [2].
       | 
       | Why is this interesting? the AMD offering draws a little less
       | than double the wattage but has hyper-threading (!!!) meaning you
       | get 384 threads... So this means that AMD is essentially on par
       | with ARM cpus (at least with Ampere's ARM cpus) in terms of power
       | efficiency... Maybe a little better.
       | 
       | I'd be more inclined to think that AMD is the king/queen of power
       | efficiency in the datacenter rather than ARM/Ampere [3].
       | 
       | notes:
       | 
       | [1]: https://www.phoronix.com/review/ampereone-a192-32x
       | 
       | [2]: https://www.youtube.com/watch?v=S-NbCPEgP1A
       | 
       | [3]: regarding Graviton/Axiom or "alleged" server-class apple
       | silicon... Their power draw is essentially irrelevant as they're
       | all claim that cannot be tested and evaluated independently, so
       | they don't count in my opinion.
        
         | alfiedotwtf wrote:
         | Last time I looked at Ampere it was way less power hub but then
         | the competition. EPYC is over 250W at idle!
         | 
         | Off the top of my head, the Ampere 128 core was 250W at max
         | load. Big difference.
         | 
         | Only issue I had was cache size, and you could only buy them
         | from NewEgg.
         | 
         | Oh... and the dual socket gigabyte board had the sockets too
         | close together that I'd you put a big ass heat sink and fan,
         | the outlet of one fan will go straight into the inlet of the
         | other!
        
           | zamadatix wrote:
           | I think you might be comparing your memories of Ampere's CPU
           | power draw to either a full Epyc server's wall power draw or
           | stated TDP. The 128 core EPYC 9754 in the chart linked above
           | has a min of 10 W and a max draw of 397 W even though it
           | outperforms the new 192 core variant AmpereOne A192-32X which
           | had a min of 102 W and a max of 401 W.
           | 
           | The min on the A192 does seem anonymously high though. That
           | said, even the M128-30 has a min of 21 W or about twice that
           | of the AmpereOne.
        
         | ChocolateGod wrote:
         | > has hyper-threading (!!!) meaning you get 384 threads
         | 
         | isn't SMT turned off in shared-tenant scenarios now due to
         | security concerns
        
           | wmf wrote:
           | No, you can use core scheduling to assign both threads to the
           | same tenant.
        
         | klelatti wrote:
         | > but has hyper-threading (!!!) meaning you get 384 threads...
         | So this means that AMD is essentially on par with ARM cpus
         | 
         | So you're saying 'essentially' an AMD thread is the same as an
         | ampere core?!
        
           | wmf wrote:
           | Always has been. Most ARM cores are closer to x86 E-cores.
        
             | klelatti wrote:
             | Ampere cores are not 'most Arm cores'
        
         | jsheard wrote:
         | Hyperthreading doesn't even get close to doubling actual
         | performance, it depends on the workload but AMDs Zen5 gains
         | about 15% from HT on average according to Phoronix's
         | benchmarks.
         | 
         | https://www.phoronix.com/review/amd-ryzen-zen5-smt/8
        
           | phkahler wrote:
           | In practice on Zen1 and Zen3 I found HT to provide 20% to
           | 25%.
           | 
           | It seems some benchmarks are as high as 50% with zen5.
           | 
           | AMD really improved HT with Zen 5 by adding a second
           | instruction decoder per core. I expect even better per-thread
           | performance from Zen 6.
        
         | loudmax wrote:
         | Jeff says as much. The AMD EPYC core does offer better
         | performance per watt, which says more about AMD than it does
         | about Ampere.
         | 
         | But also this: "The big difference is the AmpereOne A192-32X is
         | $5,555, while the EPYC 9965 is almost $15,000!"
         | 
         | So you have to look at TCO over a period of years, but that's
         | still quite significant.
        
           | magicalhippo wrote:
           | Dad of a friend ran a store selling designer underwear for
           | men. He sold boxer shorts for $70 over 25 years ago.
           | 
           | He once told me that he also had a rack of thongs. Now, the
           | thongs didn't have a lot of material, obviously, and were
           | inexpensive to manufacture, so if he'd do his regular markup
           | on them they'd end up selling for $15.
           | 
           | "However, notice the price tag is $25. I add an extra $10,
           | because if a guy's looking to buy a thong, he's going to buy
           | a thong".
           | 
           | I think about what he said when I see the chip prices on
           | these high-end server CPUs.
        
           | touisteur wrote:
           | Wondering who actually pays public price on those AMD chips
           | in big-OEM serveurs. Not saying the AmpereOne isn't also
           | discount-able too, but these public prices always feel like
           | signalling, to me, more than a reference to the actual price.
           | Or maybe I'm lucky...
        
         | burnte wrote:
         | In the video Jeff mentions that the EPYC CPU wins in
         | performance per watt but not per dollar because it's 3x the
         | cost.
        
         | monlockandkey wrote:
         | $15,000 CPU is better than a $5000 CPU?
         | 
         | You can do all the benchmarks you want, but if you don't factor
         | in price, then of course the more expensive product (in
         | general) is going to be better.
         | 
         | It is the same thing with the Snapdragon Elite and Mediatek
         | Dimensity 9400. The SDE is the faster processor and more
         | expensive.
        
       | amelius wrote:
       | > Cores are great, but it's all about how you slice them. Don't
       | think of this as a single 192-core server. Think of it more like
       | 48 dedicated 4-core servers in one box. And each of those servers
       | has 10 gigs of high-speed RAM and consistent performance.
       | 
       | Suddenly it sounds less impressive ...
        
         | kevingadd wrote:
         | To be fair, utilizing 192 cores for a single process or
         | operation is often exceedingly difficult. Scheduling and
         | coordination and resource sharing are all really hard with
         | thread counts that high, so you're probably best operating in
         | terms of smaller clusters of 4-16 threads instead. Lots of
         | algorithms stop scaling well around the 8-32 range.
        
           | stonemetal12 wrote:
           | When was the last time you saw a server with 10GB Ram no
           | matter the number of cores\threads?
        
             | Suppafly wrote:
             | >When was the last time you saw a server with 10GB Ram no
             | matter the number of cores\threads?
             | 
             | Is that a lot or a little? I have a bunch that only have
             | 8gb, it just depends on what they are being used for.
        
               | geerlingguy wrote:
               | My two primary webservers are running on 2GB of RAM
               | still... it depends on the needs of your application :)
        
             | com2kid wrote:
             | I've run plenty of microservices with 256 or 512 GB of RAM,
             | and they were handling large loads. So long as each request
             | is short lived, and using a runtime with low per request
             | overhead (e.g. Node), memory is not really a problem for
             | many types of workloads.
        
           | chasil wrote:
           | I use xargs -P every weekend to back up my (Oracle) database.
           | 
           | It has 510 files that are around 2gb, and the parallel script
           | uses rman to make a datafile copy on each one, lzip it, then
           | scp it to my backup server.
           | 
           | I have xargs set to run 10 at once. Could I increase to 192?
           | Yes.
        
             | bee_rider wrote:
             | So the part that would actually benefit from the 192 cores
             | would just be the lzip, right?
        
           | griomnib wrote:
           | You can also saturate your network link as well. I do like
           | these improvements, but I'm old enough to know it's always a
           | game of "move the bottleneck"!
        
           | WorkerBee28474 wrote:
           | > Lots of algorithms stop scaling well around the 8-32 range.
           | 
           | For the curious: https://en.wikipedia.org/wiki/Amdahl's_law
        
             | bangaladore wrote:
             | I think a simple way to think about this is with more cores
             | comes more scheduling complexity. That's without even
             | considering how well you can parallelize your actual tasks.
             | 
             | If instead we break 128 cores into 4 core systems,
             | scheduling becomes a much less complex.
        
         | sixothree wrote:
         | I'm not understanding how the 10 GB of RAM gets assigned to a
         | core. Is this just his way to describe memory channels
         | corresponding to the cores?
        
           | wmf wrote:
           | It's not assigned at the hardware level (although Ampere has
           | memory QoS) but you can assign RAM at the VM level which is
           | what these CPUs are intended for.
        
         | Dylan16807 wrote:
         | If you use the 1U model you can put about 1800 of those basic
         | servers into a single rack. I think that's pretty impressive.
         | 
         | And they have a dual processor 1U too...
        
       | ThinkBeat wrote:
       | How much does one of these servers' cost?
       | 
       | Is the estimate that it is cheaper based on comparable server
       | with the same or similar core count?
        
         | voxadam wrote:
         | From the article:
         | 
         | > the AmpereOne A192-32X is $5,555, while the EPYC 9965 is
         | almost $15,000!
        
           | cbmuser wrote:
           | Sounds like x86-64 is starting to lose some of it's market
           | share very soon.
        
             | wmf wrote:
             | Half of all CPUs in AWS are ARM already.
             | 
             | Update: "Over the last two years, more than 50 percent of
             | all the CPU capacity landed in our datacenters was on AWS
             | Graviton." So not half of all but getting there.
             | https://www.nextplatform.com/2024/12/03/aws-reaps-the-
             | benefi...
        
               | droideqa wrote:
               | I just looked it up - that is a mistaken statistic. 50%
               | of their CPUs are not Arm, but AWS has 50% of all server-
               | side Arm CPUs.
               | 
               | "But that total is beaten by just one company - Amazon -
               | which has slightly above 50 percent of all Arm server
               | CPUs in the world deployed in its Amazon Web Services
               | (AWS) datacenters, said the analyst."[0]
               | 
               | [0]: https://www.theregister.com/2023/08/08/amazon_arm_se
               | rvers/
        
             | hkchad wrote:
             | Yup,
             | 
             | https://www.google.com/finance/quote/INTC:NASDAQ?sa=X&ved=2
             | a...
        
       | loudmax wrote:
       | I found this part particularly interesting:
       | 
       | > Also, with 512 gigs of RAM and a massive CPU, it can run a 405
       | billion parameter Large Language Model. It's not fast, but it did
       | run, giving me just under a token per second.
       | 
       | If you're serious about running LLMs and you can afford it,
       | you'll of course want GPUs. But this might be a relatively
       | affordable way to run really huge models like Llama 405B on your
       | own hardware. This could be even more plausible on Ampere's
       | upcoming 512-core CPU, though RAM bandwidth might be more of a
       | bottleneck than CPU cores. Probably a niche use case, but
       | intriguing.
        
         | revnode wrote:
         | It's really slow. Like, unusably slow. For those interested in
         | self-hosting, this is a really good resource:
         | https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...
        
           | geerlingguy wrote:
           | It's a little under 1 token/sec using ollama, but that was
           | with stock llama.cpp -- apparently Ampere has their own
           | optimized version that runs a little better on the AmpereOne.
           | I haven't tested it yet with 405b.
        
           | zozbot234 wrote:
           | It's not "really slow" at all, 1 tok/sec is absolutely par
           | for the course given the overall model size. The 405B model
           | was never actually intended for production use, so the fact
           | that it can even kinda run at speeds that are almost usable
           | is itself noteworthy.
        
           | johnklos wrote:
           | You know, there's nothing wrong with running a slow LLM.
           | 
           | For some people, they lack the resources to run an LLM on a
           | GPU. For others, they want to try certain models without
           | buying thousands of dollars of equipment just to try things
           | out.
           | 
           | Either way, I see too many people putting the proverbial
           | horse before the cart: they buy a video card, then try to fit
           | LLMs in to the limited VRAM they have, instead of playing
           | around, even if at 1/10th the speed, and figuring out _which_
           | models they want to run before deciding where they want to
           | invest their money.
           | 
           | One token a second is worlds better than running nothing at
           | all because someone told you that you shouldn't or can't
           | because you don't have a fancy, expensive GPU.
        
             | zozbot234 wrote:
             | > For some people, they lack the resources to run an LLM on
             | a GPU.
             | 
             | Most people have a usable iGPU, that's going to run most
             | models significantly slower (because less available memory
             | throughput, and/or more of it being wasted on padding,
             | compared to CPU) but a lot cooler than the CPU. NPU's will
             | likely be a similar story.
             | 
             | It would be nice if there was an easy way to only run the
             | initial prompt+context processing (which is generally
             | compute bound) on iGPU+NPU, but move to CPU for the token
             | generation stage.
        
           | MrDrMcCoy wrote:
           | Bummer that they have no stats for AMD, Intel, Qualcomm, etc
           | (C|G|N|X)PUs.
        
       | astrodust wrote:
       | The article states $5,550USD or so.
        
       | binary132 wrote:
       | not when your throughput is dictated by how fast you can process
       | a single task
        
       | cbmuser wrote:
       | Really a pity that Oracle killed off SPARC. They already had
       | 32-core CPUs almost a decade ago but Oracle never really
       | understood the value that SPARC and Solaris brought to the table.
        
         | mrbluecoat wrote:
         | Ironically, Oracle seems to be the only cloud compute offering
         | Ampere currently.
        
           | geerlingguy wrote:
           | Azure seems to be offering Ampere-based offerings[1], as well
           | as Hetzner[2].
           | 
           | [1] https://azure.microsoft.com/en-us/blog/azure-virtual-
           | machine...
           | 
           | [2] https://www.hetzner.com/press-release/arm64-cloud/
        
             | mrbluecoat wrote:
             | Oh cool, I thought they had discontinued them.
        
       | torginus wrote:
       | Now, what percentage of companies could get away with sticking a
       | pair of these (for redundancy) and run their entire operation off
       | of it?
        
       ___________________________________________________________________
       (page generated 2024-12-05 23:00 UTC)