[HN Gopher] AmpereOne: Cores Are the New MHz
___________________________________________________________________
AmpereOne: Cores Are the New MHz
Author : speckx
Score : 76 points
Date : 2024-12-05 17:46 UTC (5 hours ago)
(HTM) web link (www.jeffgeerling.com)
(TXT) w3m dump (www.jeffgeerling.com)
| znpy wrote:
| Weird, but this makes me thing X86-64 might actually be better?
|
| It isn't mentioned anywhere in __this__ article, but the power
| draw of that chip is 276W. I got this from Phoronix [1]:
|
| > The AmpereOne A192-32X boasts 192 AmpereOne cores, a 3.2GHz
| clock frequency, a rated 276 Watt usage power
|
| Which is interesting because it's almost half of AMD's 192-core
| offering [2].
|
| Why is this interesting? the AMD offering draws a little less
| than double the wattage but has hyper-threading (!!!) meaning you
| get 384 threads... So this means that AMD is essentially on par
| with ARM cpus (at least with Ampere's ARM cpus) in terms of power
| efficiency... Maybe a little better.
|
| I'd be more inclined to think that AMD is the king/queen of power
| efficiency in the datacenter rather than ARM/Ampere [3].
|
| notes:
|
| [1]: https://www.phoronix.com/review/ampereone-a192-32x
|
| [2]: https://www.youtube.com/watch?v=S-NbCPEgP1A
|
| [3]: regarding Graviton/Axiom or "alleged" server-class apple
| silicon... Their power draw is essentially irrelevant as they're
| all claim that cannot be tested and evaluated independently, so
| they don't count in my opinion.
| alfiedotwtf wrote:
| Last time I looked at Ampere it was way less power hub but then
| the competition. EPYC is over 250W at idle!
|
| Off the top of my head, the Ampere 128 core was 250W at max
| load. Big difference.
|
| Only issue I had was cache size, and you could only buy them
| from NewEgg.
|
| Oh... and the dual socket gigabyte board had the sockets too
| close together that I'd you put a big ass heat sink and fan,
| the outlet of one fan will go straight into the inlet of the
| other!
| zamadatix wrote:
| I think you might be comparing your memories of Ampere's CPU
| power draw to either a full Epyc server's wall power draw or
| stated TDP. The 128 core EPYC 9754 in the chart linked above
| has a min of 10 W and a max draw of 397 W even though it
| outperforms the new 192 core variant AmpereOne A192-32X which
| had a min of 102 W and a max of 401 W.
|
| The min on the A192 does seem anonymously high though. That
| said, even the M128-30 has a min of 21 W or about twice that
| of the AmpereOne.
| ChocolateGod wrote:
| > has hyper-threading (!!!) meaning you get 384 threads
|
| isn't SMT turned off in shared-tenant scenarios now due to
| security concerns
| wmf wrote:
| No, you can use core scheduling to assign both threads to the
| same tenant.
| klelatti wrote:
| > but has hyper-threading (!!!) meaning you get 384 threads...
| So this means that AMD is essentially on par with ARM cpus
|
| So you're saying 'essentially' an AMD thread is the same as an
| ampere core?!
| wmf wrote:
| Always has been. Most ARM cores are closer to x86 E-cores.
| klelatti wrote:
| Ampere cores are not 'most Arm cores'
| jsheard wrote:
| Hyperthreading doesn't even get close to doubling actual
| performance, it depends on the workload but AMDs Zen5 gains
| about 15% from HT on average according to Phoronix's
| benchmarks.
|
| https://www.phoronix.com/review/amd-ryzen-zen5-smt/8
| phkahler wrote:
| In practice on Zen1 and Zen3 I found HT to provide 20% to
| 25%.
|
| It seems some benchmarks are as high as 50% with zen5.
|
| AMD really improved HT with Zen 5 by adding a second
| instruction decoder per core. I expect even better per-thread
| performance from Zen 6.
| loudmax wrote:
| Jeff says as much. The AMD EPYC core does offer better
| performance per watt, which says more about AMD than it does
| about Ampere.
|
| But also this: "The big difference is the AmpereOne A192-32X is
| $5,555, while the EPYC 9965 is almost $15,000!"
|
| So you have to look at TCO over a period of years, but that's
| still quite significant.
| magicalhippo wrote:
| Dad of a friend ran a store selling designer underwear for
| men. He sold boxer shorts for $70 over 25 years ago.
|
| He once told me that he also had a rack of thongs. Now, the
| thongs didn't have a lot of material, obviously, and were
| inexpensive to manufacture, so if he'd do his regular markup
| on them they'd end up selling for $15.
|
| "However, notice the price tag is $25. I add an extra $10,
| because if a guy's looking to buy a thong, he's going to buy
| a thong".
|
| I think about what he said when I see the chip prices on
| these high-end server CPUs.
| touisteur wrote:
| Wondering who actually pays public price on those AMD chips
| in big-OEM serveurs. Not saying the AmpereOne isn't also
| discount-able too, but these public prices always feel like
| signalling, to me, more than a reference to the actual price.
| Or maybe I'm lucky...
| burnte wrote:
| In the video Jeff mentions that the EPYC CPU wins in
| performance per watt but not per dollar because it's 3x the
| cost.
| monlockandkey wrote:
| $15,000 CPU is better than a $5000 CPU?
|
| You can do all the benchmarks you want, but if you don't factor
| in price, then of course the more expensive product (in
| general) is going to be better.
|
| It is the same thing with the Snapdragon Elite and Mediatek
| Dimensity 9400. The SDE is the faster processor and more
| expensive.
| amelius wrote:
| > Cores are great, but it's all about how you slice them. Don't
| think of this as a single 192-core server. Think of it more like
| 48 dedicated 4-core servers in one box. And each of those servers
| has 10 gigs of high-speed RAM and consistent performance.
|
| Suddenly it sounds less impressive ...
| kevingadd wrote:
| To be fair, utilizing 192 cores for a single process or
| operation is often exceedingly difficult. Scheduling and
| coordination and resource sharing are all really hard with
| thread counts that high, so you're probably best operating in
| terms of smaller clusters of 4-16 threads instead. Lots of
| algorithms stop scaling well around the 8-32 range.
| stonemetal12 wrote:
| When was the last time you saw a server with 10GB Ram no
| matter the number of cores\threads?
| Suppafly wrote:
| >When was the last time you saw a server with 10GB Ram no
| matter the number of cores\threads?
|
| Is that a lot or a little? I have a bunch that only have
| 8gb, it just depends on what they are being used for.
| geerlingguy wrote:
| My two primary webservers are running on 2GB of RAM
| still... it depends on the needs of your application :)
| com2kid wrote:
| I've run plenty of microservices with 256 or 512 GB of RAM,
| and they were handling large loads. So long as each request
| is short lived, and using a runtime with low per request
| overhead (e.g. Node), memory is not really a problem for
| many types of workloads.
| chasil wrote:
| I use xargs -P every weekend to back up my (Oracle) database.
|
| It has 510 files that are around 2gb, and the parallel script
| uses rman to make a datafile copy on each one, lzip it, then
| scp it to my backup server.
|
| I have xargs set to run 10 at once. Could I increase to 192?
| Yes.
| bee_rider wrote:
| So the part that would actually benefit from the 192 cores
| would just be the lzip, right?
| griomnib wrote:
| You can also saturate your network link as well. I do like
| these improvements, but I'm old enough to know it's always a
| game of "move the bottleneck"!
| WorkerBee28474 wrote:
| > Lots of algorithms stop scaling well around the 8-32 range.
|
| For the curious: https://en.wikipedia.org/wiki/Amdahl's_law
| bangaladore wrote:
| I think a simple way to think about this is with more cores
| comes more scheduling complexity. That's without even
| considering how well you can parallelize your actual tasks.
|
| If instead we break 128 cores into 4 core systems,
| scheduling becomes a much less complex.
| sixothree wrote:
| I'm not understanding how the 10 GB of RAM gets assigned to a
| core. Is this just his way to describe memory channels
| corresponding to the cores?
| wmf wrote:
| It's not assigned at the hardware level (although Ampere has
| memory QoS) but you can assign RAM at the VM level which is
| what these CPUs are intended for.
| Dylan16807 wrote:
| If you use the 1U model you can put about 1800 of those basic
| servers into a single rack. I think that's pretty impressive.
|
| And they have a dual processor 1U too...
| ThinkBeat wrote:
| How much does one of these servers' cost?
|
| Is the estimate that it is cheaper based on comparable server
| with the same or similar core count?
| voxadam wrote:
| From the article:
|
| > the AmpereOne A192-32X is $5,555, while the EPYC 9965 is
| almost $15,000!
| cbmuser wrote:
| Sounds like x86-64 is starting to lose some of it's market
| share very soon.
| wmf wrote:
| Half of all CPUs in AWS are ARM already.
|
| Update: "Over the last two years, more than 50 percent of
| all the CPU capacity landed in our datacenters was on AWS
| Graviton." So not half of all but getting there.
| https://www.nextplatform.com/2024/12/03/aws-reaps-the-
| benefi...
| droideqa wrote:
| I just looked it up - that is a mistaken statistic. 50%
| of their CPUs are not Arm, but AWS has 50% of all server-
| side Arm CPUs.
|
| "But that total is beaten by just one company - Amazon -
| which has slightly above 50 percent of all Arm server
| CPUs in the world deployed in its Amazon Web Services
| (AWS) datacenters, said the analyst."[0]
|
| [0]: https://www.theregister.com/2023/08/08/amazon_arm_se
| rvers/
| hkchad wrote:
| Yup,
|
| https://www.google.com/finance/quote/INTC:NASDAQ?sa=X&ved=2
| a...
| loudmax wrote:
| I found this part particularly interesting:
|
| > Also, with 512 gigs of RAM and a massive CPU, it can run a 405
| billion parameter Large Language Model. It's not fast, but it did
| run, giving me just under a token per second.
|
| If you're serious about running LLMs and you can afford it,
| you'll of course want GPUs. But this might be a relatively
| affordable way to run really huge models like Llama 405B on your
| own hardware. This could be even more plausible on Ampere's
| upcoming 512-core CPU, though RAM bandwidth might be more of a
| bottleneck than CPU cores. Probably a niche use case, but
| intriguing.
| revnode wrote:
| It's really slow. Like, unusably slow. For those interested in
| self-hosting, this is a really good resource:
| https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...
| geerlingguy wrote:
| It's a little under 1 token/sec using ollama, but that was
| with stock llama.cpp -- apparently Ampere has their own
| optimized version that runs a little better on the AmpereOne.
| I haven't tested it yet with 405b.
| zozbot234 wrote:
| It's not "really slow" at all, 1 tok/sec is absolutely par
| for the course given the overall model size. The 405B model
| was never actually intended for production use, so the fact
| that it can even kinda run at speeds that are almost usable
| is itself noteworthy.
| johnklos wrote:
| You know, there's nothing wrong with running a slow LLM.
|
| For some people, they lack the resources to run an LLM on a
| GPU. For others, they want to try certain models without
| buying thousands of dollars of equipment just to try things
| out.
|
| Either way, I see too many people putting the proverbial
| horse before the cart: they buy a video card, then try to fit
| LLMs in to the limited VRAM they have, instead of playing
| around, even if at 1/10th the speed, and figuring out _which_
| models they want to run before deciding where they want to
| invest their money.
|
| One token a second is worlds better than running nothing at
| all because someone told you that you shouldn't or can't
| because you don't have a fancy, expensive GPU.
| zozbot234 wrote:
| > For some people, they lack the resources to run an LLM on
| a GPU.
|
| Most people have a usable iGPU, that's going to run most
| models significantly slower (because less available memory
| throughput, and/or more of it being wasted on padding,
| compared to CPU) but a lot cooler than the CPU. NPU's will
| likely be a similar story.
|
| It would be nice if there was an easy way to only run the
| initial prompt+context processing (which is generally
| compute bound) on iGPU+NPU, but move to CPU for the token
| generation stage.
| MrDrMcCoy wrote:
| Bummer that they have no stats for AMD, Intel, Qualcomm, etc
| (C|G|N|X)PUs.
| astrodust wrote:
| The article states $5,550USD or so.
| binary132 wrote:
| not when your throughput is dictated by how fast you can process
| a single task
| cbmuser wrote:
| Really a pity that Oracle killed off SPARC. They already had
| 32-core CPUs almost a decade ago but Oracle never really
| understood the value that SPARC and Solaris brought to the table.
| mrbluecoat wrote:
| Ironically, Oracle seems to be the only cloud compute offering
| Ampere currently.
| geerlingguy wrote:
| Azure seems to be offering Ampere-based offerings[1], as well
| as Hetzner[2].
|
| [1] https://azure.microsoft.com/en-us/blog/azure-virtual-
| machine...
|
| [2] https://www.hetzner.com/press-release/arm64-cloud/
| mrbluecoat wrote:
| Oh cool, I thought they had discontinued them.
| torginus wrote:
| Now, what percentage of companies could get away with sticking a
| pair of these (for redundancy) and run their entire operation off
| of it?
___________________________________________________________________
(page generated 2024-12-05 23:00 UTC)