[HN Gopher] AMD's Turin: 5th Gen EPYC Launched
___________________________________________________________________
AMD's Turin: 5th Gen EPYC Launched
Author : zdw
Score : 294 points
Date : 2024-10-12 00:22 UTC (22 hours ago)
(HTM) web link (chipsandcheese.com)
(TXT) w3m dump (chipsandcheese.com)
| bitwize wrote:
| > Apparently we now think 64 cores is 'lower core count'. What a
| world we live in.
|
| 64 cores is a high-end gaming rig. Civilization VII won't run
| smoothly on less than 16.
| noncoml wrote:
| Civilization VII won't run smoothly.
|
| Only recently I managed to build a PC that will run Civ 6
| smoothly during late game on huge map
| 7thpower wrote:
| What are the specs?
|
| Tangentially related, but I need to go check a18 civ
| 6benchmarks. The experience on my a15 with small map sizes
| was surprisingly good.
| noncoml wrote:
| It's not latest and greatest, 12900K + 64GB DDR4. But even
| when 12900K came out(2021) Civ 6 was already 5 years old
| zamadatix wrote:
| Civ 6 really doesn't utilize cores as much as one would think.
| I mean it'll spread the load across a lot of threads, sure, but
| it never seems to actually... use them much? E.g. I just ran
| the Gathering Storm expansion AI benchmark (late game map
| completely full of civs and units - basically worst case for
| CPU requirements and best case for eating up the multicore
| performance) on a 7950X 16 core CPU and it rarely peaked over
| 30% utilization, often averaging ~25%. 30% utilization means a
| 6 core part (barring frequency/cache differences) should be
| able to eat that at 80% load.
|
| https://i.imgur.com/YlJFu4s.png
|
| Whether the bottleneck is memory bandwidth (2x6000 MHz),
| unoptimized locking, small batch sizes, or something else it
| doesn't seem to be related to core count. It's also not waiting
| on the GPU much here, the 4090 is seeing even less utilization
| than the CPU. Hopefully utilization actually scales better with
| 7, not just splits up a lot.
| lukeschlather wrote:
| > 16 core CPU and it rarely peaked over 30% utilization,
| often averaging ~25%. 30% utilization means a 6 core part
| (barring frequency/cache differences) should be able to eat
| that at 80% load.
|
| As a rule I wouldn't be surprised if 90% of the stuff Civ 6
| is doing can't be parallelized at all, but then for that
| remaining 10% you get a 16x speedup with 16 cores. And
| they're underutilized on average but there are bursts where
| you get a measurable speedup from having 16 cores, and that
| speedup is strictly linear with the number of cores. 6 cores
| means that remaining 10% will be less than half as fast vs.
| having 16 cores. And this is consistent with observing 30%
| CPU usage I think.
| colechristensen wrote:
| My rule is more like I'd be willing to bet even odds that
| this could be sped up 100x with the right programmers
| focused on performance. When you lack expertise and things
| work "well enough" that's what you get. Same for games or
| enterprise software.
| squarefoot wrote:
| That's what we get in a market dominated more by the need
| to release before the competition rather than taking some
| time to optimize software. If it's slow, one can still
| blame the iron and users who don't upgrade it.
| oivey wrote:
| If only 10% of the workload can be parallelized, then the
| best case parallelization speed up is only 10%. That
| doesn't line up with the GP's claim that Civ6 benefits from
| more cores.
| Uvix wrote:
| They referenced the upcoming Civ7 (which does include
| 16-core chips on their highest recommended specs), not
| Civ6.
| gkhartman wrote:
| I can't help but think that this sounds more like a failure to
| optimize at the software level rather than a reasonable
| hardware limitation.
| cubefox wrote:
| That's the usual case when vid3o g4mes are "CPU limited". One
| has to just look whether the game does anything high-level
| that other games didn't do 10 years ago. Reasonable hardware
| limitations related to the CPU have normally to do with
| complex physics effects or unusually large crowds of NPCs.
| (Many games are CPU limited even for fairly small crowds
| because their engine isn't optimized for that purpose.)
| deaddodo wrote:
| > vid3o g4mes
|
| Why do you type it like that?
| immibis wrote:
| Usually people do this to avoid automatic censorship
| systems. HN certainly has censorship in place, but I'm
| not aware of any that targets discussion of video games.
| cubefox wrote:
| Old habit. I'm a kid of the 1990s, and we were convinced
| there wasn't anything cooler than a) video games and b)
| replacing letters with numbers. It retrospect, we might
| have been a little biased.
| deaddodo wrote:
| Well, just for future reference; if you're a kid of the
| 90s, you're well into your 30s now.
|
| It's weird/juvenile to be typing a phrase in a manner
| similar to a preteen, well after 20+ years have passed.
| Especially in the middle of an otherwise normal
| message/conversation.
| bitwize wrote:
| We thought that was dorky, skiddie shit even back in the
| 90s. It was even more stale by the time Piro and Largo
| turned it into a personality trait.
|
| Though I think this guy just did it the way a 2000s kid
| would say "vidya gaems".
| treesciencebot wrote:
| all high end "gaming" rigs are either using ~16 real cores or
| 8:24 performance/efficiency cores these days.
| threadripper/other HEDT options are not particularly good at
| gaming due to (relatively) lower clock speed / inter-CCD
| latencies.
| csomar wrote:
| If Civ 6 is any guidance, 64 or 32 won't make a slight
| difference. The next step calculations seem to run on a single
| CPU and thus having more CPUs is not going to change a thing.
| This is a software problem; they need to distribute the
| calculation over several CPUs.
| fulafel wrote:
| As the GPGPU scene trajectory seems dismal[1] for the
| foreseeable future wrt the DX, this seems like the best hope.
|
| [1] Fragmentation, at best C++ dialects, no practical compiler
| tech to transparently GPU offload, etc
| snvzz wrote:
| civ6's slowness is purely bad programming. No excuses to be
| had.
| Pet_Ant wrote:
| [citation needed]
| dragontamer wrote:
| ChipsAndCheese is one of the few new tech publications that
| really knows what they are talking about, especially with these
| deep dive benchmarks.
|
| With the loss of Anandtech, TechReport, HardCOP and other old
| technical sites, I'm glad to see a new publisher who can keep up
| with the older style stuff.
| mongol wrote:
| Interestingly, Slashdot originated from a site called "Chips &
| Dips". Similiar inspiration?
| tandr wrote:
| Did you mean to say HardOCP?
| kderbe wrote:
| Chips and Cheese most reminds me of the long gone LostCircuts.
| Most tech sites focus on the slate of application benchmarks,
| but C&C writes, and LC wrote, long form articles about
| architecture, combined with subsystem micro-benchmarks.
| jeffbee wrote:
| The part with only 16 cores but 512MB L3 cache ... that must be
| for some specific workload.
| addaon wrote:
| Does anyone know if modern AMD chips allow mapping the L3 cache
| and using it as TCM instead of cache? I know older non-X86
| processors supported this (and often booted into that mode so
| that the memory controllers could be brought up), but not sure
| if it's possible today. If so, that would sure make for some
| interesting embedded use cases for a large DRAM-less system...
| bpye wrote:
| The coreboot docs claim that modern AMD parts no longer
| support cache-as-RAM.
|
| https://doc.coreboot.org/soc/amd/family17h.html
| SV_BubbleTime wrote:
| Lame.
|
| Using it as TCM ram seems super useful.
|
| Although you would need to fight/request it from the OS, so
| technically I see why they might ditch it.
| hales wrote:
| Wow, thanks for the link, I had no idea:
|
| > AMD has ported early AGESA features to the PSP, which now
| discovers, enables and trains DRAM. Unlike any other x86
| device in coreboot, a Picasso system has DRAM online prior
| to the first instruction fetch.
|
| Perhaps they saw badly trained RAM as a security flaw? Or
| maybe doing it with the coprocessor helped them distribute
| the training code more easily (I heard a rumour once that
| RAM training algos are heavily patented? Might have
| imagined it).
| londons_explore wrote:
| If you keep your working set small enough, you should be able
| to tell the CPU it has RAM attached, but never actually
| attach any RAM.
|
| It would never flush any cache lines to RAM, and never do any
| reads from RAM.
| phonon wrote:
| Oracle can charge $40-$100k+ for EE including options _per
| core_ (times .5)...and some workloads are very cache sensitive.
| So a high cache, high bandwidth, high frequency, high memory
| capacity 16 core CPU[1] (x2 socket) might be the best bang for
| their buck for that million dollar+ license.
|
| [1]
| https://www.amd.com/en/products/processors/server/epyc/9005-...
| dajonker wrote:
| Surely that's a good reason for Oracle to increase their
| prices even more, leading to a cat-and-mouse game between CPU
| makers and software license sellers.
| immibis wrote:
| Hopefully ending when nobody uses Oracle.
| Tuna-Fish wrote:
| Oh yes, this cat-and-mouse game has been going on for more
| than a decade. But despite that, for any given time and
| license terms, there is a type of CPU that is optimal for
| performance/licensing costs, and when the license is as
| expensive and widely used as it is, it makes sense to sell
| CPUs for that specific purpose.
| chx wrote:
| It is very Oracle that their license policy gives a reason to
| make crippled CPUs.
| jsheard wrote:
| The topology of that part is wild, it's physically the same
| silicon as the 128-core part but they've disabled all but one
| core on each compute chiplet. 112 cores are switched off to
| leave just 16 cores with as much cache as possible.
|
| Inter-core latency will be rough since you'll always be hitting
| the relatively slow inter-chiplet bus, though.
| mistyvales wrote:
| Here I am running a 12 year old Dell PowerEdge with dual Xeons..
| I wonder when the first gen Epyc servers will be cheap fodder on
| eBay.
| assusdan wrote:
| IMO, 1st gen Epyc is not any good, given that 2nd gen exists,
| is more popular and is cheap enough (I actually have epyc 7302
| and MZ31-AR0 motherboard as homelab). Too low performance per
| core and NUMA things, plus worse node (2nd gen compute is 7nm
| TSMC)
| renewiltord wrote:
| Not worth. Get 9654 on eBay for $2k plus $1k for mobo. $7k full
| system. Or go Epyc 7282 type, and that's good combo easily
| available.
| ipsum2 wrote:
| They already are, and aren't very good.
| p1necone wrote:
| 1-3rd gen Epycs can be had super cheap, but the motherboards
| are expensive.
|
| Also not worth getting anything less than 3rd gen unless you're
| primarily buying them for the pcie lanes and ram capacity - a
| regular current gen consumer CPU with half - a quarter of the
| core count will outperform them in compute while consuming
| significantly less power.
| justmarc wrote:
| Lots of great second hand hardware to be had on ebay. Even
| last gen used CPUs, as well as RAM, at _much_ less than
| retail.
|
| However when you end up building a server quite often the
| motherboard + case is the cheap stuff, the CPUs are second in
| cost and the biggest expense can be the RAM.
| jsheard wrote:
| When buying used Epycs you have to contend with them possibly
| being vendor-locked to a specific brand of motherboard as
| well.
|
| https://www.servethehome.com/amd-psb-vendor-locks-epyc-
| cpus-...
| sph wrote:
| They sell this vendor lock-in "feature" as enhanced
| security?
| Uvix wrote:
| Yes. It keeps the board from booting if the firmware is
| replaced with a version not signed by the board
| manufacturer (i.e. so an attacker can't replace it with a
| version that does nefarious things). Preventing CPU reuse
| in other boards is just an (unintentional?) side effect.
| kjs3 wrote:
| The cynic would say the security implications are the
| side effect, since selling more, new chips is the goal.
| Uvix wrote:
| If that was the goal then the CPU would fuse on first
| boot for _any_ manufacturer's board, rather than being
| fused only by Dell boards.
| Tuna-Fish wrote:
| The reason for this is that CPU upgrades on the same board
| were/are very viable on SP3.
|
| Doing that on Intel platforms just wasn't done for basically
| ever, it was never worth it. But upgrade to Milan from Naples
| or Rome is very appealing.
|
| So SP3 CPUs are much more common used than the boards, simply
| because more of them were made. This is probably very bad for
| hobbyists, the boards are not going to get cheap until the
| entire platform is obsolete.
| swarnie wrote:
| Unsure about the Epyc chips but Ryzen 5 series kit was being
| given away on Amazon in the week...
|
| I snagged a 9 5950X for PS242
| kombine wrote:
| Thanks for pointing out, it's still up there for PS253 - I
| might consider upgrading my 8 core 5800X3D.
| smolder wrote:
| Whether that's an upgrade depends on your use case, as the
| X3D has more cache.
| kombine wrote:
| I don't play games so the X3D's cache doesn't really
| benefit me. 5950X should speed up compilation, but then,
| I mostly do Python at the moment :)
| taneq wrote:
| Haha same, and it's perfectly capable of anything a smallish
| company would need for general on-prem hosting.
| elric wrote:
| At 1:11 in the video, there's a chart of the TDP (which I looked
| for in the text but couldn't find). At 125-500W, these things run
| very hot.
| bunabhucan wrote:
| https://www.servethehome.com/wp-content/uploads/2024/10/AMD-...
|
| Has the full range with TDP. 500w is only for the 128/192
| monster chips. The 16 core fast sku has a 320W TDP.
| Sakos wrote:
| The 7950x3D is rated at 120w TDP. 320w seems quite high.
| adgjlsfhk1 wrote:
| that's the processor you're supposed to buy only if you are
| paying $10000 per core per year for some ridiculously
| expensive enterprise software. the extra power comes from
| all the extra memory channels and cache
| icegreentea2 wrote:
| It is high, but it probably can sustain much higher all
| core frequency compared to 7950x3D (or 7950x). If nothing
| else, it has a bigger die and heat spreading to pull heat
| from, it should be easier to maintain thermal headroom on
| the EYPC chip.
|
| That being said, it looks most probable what a 9175F is
| just a 9755 (their absolute max full size Zen5 core part)
| with nearly all (7/8) of their cores per CCD disabled in
| order to get all 512MB of cache. This means that there's a
| lot of extra interconnect being kept running per core which
| certainly would contribute to the higher TDP.
|
| Of course, in principle this should also mean that each
| core (which should basically have all of a CDD's IO and
| cache available to it) should be screaming fast in real
| terms.
|
| Of course finally, TDP is a totally fluffy number. The TDP
| of the 7950X3D is most probably as low as it is because of
| how much internal thermal resistance (the extra V-Cache
| layer) it has. Part of it's lower TDP isn't because "it's
| efficient", part of it is because "if we run it hotter,
| we'll melt it". The TDP for the 7950X for example is 170W.
| ticoombs wrote:
| > 500w is only for... 128/196 chips. The 16 core fast sku has
| a 320W TDP.
|
| When you think about it. 180W more for 7x threads is amazing
| masklinn wrote:
| It's not new that high frequencies require higher power.
|
| The base clock falls by half between the fastest and the
| widest chips.
| bjoli wrote:
| That depends on the size of the processor, surely.
|
| Socket sp5 is more than 3x the area of am5.
| jmb99 wrote:
| Doubtful, the 350W threadripper parts don't run particularly
| hot with normal desktop cooling. I've overclocked a 7975WX with
| an unrestricted power limit, and could dissipate over 800W
| while keeping it below 90C (admittedly, with water cooling).
| 500W with server cooling (super high RPM forced air) shouldn't
| be a problem.
| justmarc wrote:
| You just need one unit of these in dual socket config per room
| of your house, and you're sorted for the winter (if you live
| somewhere cold).
| formerly_proven wrote:
| The cores are spread out over roughly 1200 mm2 of silicon and
| the IO die seems to have grown again (maybe 500ish mm2?). So at
| just 0.3 W/mm2 this is pretty cozy. The desktop parts have 3x
| higher power density.
| bob1029 wrote:
| I'd like to see the 9965 in action. These parts are crazy. Will
| definitely be looking to buy a machine from this generation.
|
| https://www.amd.com/en/products/processors/server/epyc/9005-...
| Salgat wrote:
| I wonder how this compares to the 7950x3d. So much cache and a
| high boost clock.
| https://www.amd.com/en/products/processors/server/epyc/9005-...
| Tuna-Fish wrote:
| Great if you have 16 independent workloads, terrible for
| things that care about communication between threads.
|
| It has 16 CCD, each with only one thread enabled, latency
| between CCD is ~150ns.
| justmarc wrote:
| Surprise surprise, not every tool is right for every job.
| menaerus wrote:
| Not sure if this comment was about to come out as snarky
| but the parent rightfully pointed out the not so obvious
| design of EPYC CPUs. CCD is a NUMA in disguise.
| smolder wrote:
| 150ns is actually surprisingly high! I didn't realize it
| was so bad. That's about 2-3x as much latency as fetching
| from DRAM based on what I see in peoples AIDA64 results.
| justmarc wrote:
| Truly mind boggling scale.
|
| Twenty years ago we had just 1-2 cores per CPU, so we were lucky
| to have 4 cores in a dual socket server.
|
| A single server can now have almost 400 cores. Yes, we can have
| even more ARM cores but they don't perform as well as these do,
| at least for now.
| zer00eyz wrote:
| 700+ threads over 2 cores, can saturate 2 400gbe Nic's 500 wats
| per chip (less than 2 wats per thread)... All of that in a 2U
| package.... 20 years ago that would have been racks of gear.
| smolder wrote:
| > 700+ threads over 2 cores
|
| I assume you mean 2 sockets.
| jmrm wrote:
| I think those really 2 watts per thread are a lot more
| important than what us home users usually think. Having to
| deliver less power and having to dissipate less watts in form
| of heat in a data centre are really good news to its
| operative costs, which is usually a lot bigger than the cost
| of the purchase of the servers
| justmarc wrote:
| With these CPUs one can definitely hit much higher rates.
|
| 800Gbps from a single server was achieved by Netflix on much
| lesser systems two years ago:
|
| https://nabstreamingsummit.com/wp-
| content/uploads/2022/05/20...
|
| If I were to guess, this hardware can do double that, also
| helping that we now have actual 800Gbps Ethernet hardware.
|
| Indeed 20 years ago this would have been racks of gear at a
| very high cost and a huge power bill.
| immibis wrote:
| That's one hell of a pixelflut server.
| chx wrote:
| Indeed: the first dual core server chips only launched in 2005
| afaik with 90nm Denmark/Italy/Egypt Opterons and Paxville Xeons
| but on the Intel side it wasn't until 2007 when they were in
| full swing.
| p_l wrote:
| first dual core server chips show up generally available in
| 2001 with IBM POWER4, then HP PA-RISC ones in 2004, and then
| Opterons which was followed by "emergency" design of
| essentially two "sockets" on one die of of NetBurst dual core
| systems.
| chx wrote:
| Ehhhh the MAJC 5200 was generally available in 1999 and I
| am sure even older designs could be found if we were to dig
| deep enough. Their market share would also need some
| digging too.
|
| To quote the announcement: "two VLIW (very long instruction
| word) microprocessors on a single piece of silicon"
| formerly_proven wrote:
| Power and PA-RISC shipped servers, though. MAJC on the
| other hand
|
| > Sun built a single model of the MAJC, the two-core MAJC
| 5200, which was the heart of Sun's XVR-1000 and XVR-4000
| workstation graphics boards.
| kristianp wrote:
| MAJC: https://en.m.wikipedia.org/wiki/MAJC
|
| Why have I not heard of those before?
| chx wrote:
| Also, I raised the question at
| https://retrocomputing.stackexchange.com/q/30743/3722 and
| one of the answers points out the 1984 Rockwell R65C29 Dual
| CMOS Microprocessor. It was two standard 6502 on the same
| die using the same bus to access the same memory... and of
| course IBM mainframes did it decades before.
| kjs3 wrote:
| If we're going that direction, National Semiconductor had
| a 2 'core' COPS4 processor in 1981[1]. I have some in a
| tube somewhere (unused).
|
| [1] https://www.cpushack.com/2014/08/25/national-
| semiconductor-c...
| chx wrote:
| Yes, Retro SE also points out the Intel 8271 from 1977
| was a dual core microcontroller.
| kjs3 wrote:
| Depends on your definition. The 8271 wasn't programmable
| by anyone but Intel (at least, they never made that a
| market option), and the second core was more of a bit-
| oriented coprocessor, sorta like saying the 80486 is a
| 2-core processor because of the FPU.
| RobinL wrote:
| I wonder what percentage of 'big data' jobs that run in
| clusters would now be far faster on a single big machine with
| e.g. duckdb rather than spark
| conjecTech wrote:
| The difference in throughput for local versus distributed
| orchestration would mainly come from serdes, networking,
| switching. Serdes can be substantial. Networking and
| switching has been aggressively offloaded from CPU through
| better hardware support.
|
| Individual tasks would definitely have better latency, but
| I'd suspect the impact on throughput/CPU usage might be
| muted. Of course at the extremes (very small jobs, very
| large/complex objects being passed) you'd see big gains.
| mtremsal wrote:
| Would you mind expanding on how SerDes become a bottleneck?
| I'm not familiar and reading the Wikipedia article wasn't
| enough to connect the dots.
| RobinL wrote:
| By way of a single example, we've been migrating recently
| from spark to duckdb. Our jobs are not huge, but too big
| for a single 'normal' machine. We've gone from a 2.5 hour
| runtime on a cluster of 10 machines (40,vCPU total) to a 15
| minute runtime on a 32vCPU single machine. I don't know for
| sure, but I think this is largely because it eliminates
| expensive shuffles and serde. Obviously results vary hugely
| depending on workload, and some jobs are simply too big
| even for a 192 core machine. But I suspect a high
| proportion of workloads would be better run on single large
| machines nowadays
| semi-extrinsic wrote:
| Essentially all, I would guess. But scheduling jobs and
| moving data in and out of a single big machine can become a
| huge bottleneck.
| justmarc wrote:
| I often think about huge, fancy cloud setups literally
| costing silly money to run, being replaced by a single beast
| of a machine powered by a modern, high core count CPU (say
| 48+), lots of RAM and lots of high performance enterprise-
| grade SSD storage.
| cjbprime wrote:
| (Oftentimes part of the reason the huge, fancy cloud setup
| costs more is that any component can fail, all the way down
| to the region-level, without loss of service.)
| justmarc wrote:
| And often times that loss of service, if temporary is not
| all that painful -- it really depends on the exact
| needs/use case/scenario.
|
| That said, sometimes The Cloud goes down too.
| varispeed wrote:
| Nowadays very much most services can fit on single server and
| serve millions of users a day. I wonder how it will affect
| overly expensive cloud services where you can rent a beefy
| dedicated server for under a grand and make tens of thousands
| in savings (enough to hire full time administrator with plenty
| of money left for other things).
| Dylan16807 wrote:
| On the other hand, at the time we would have expected twenty
| years of progress to make the cores a thousand times faster.
| Instead that number is more like 5x.
| speedgoose wrote:
| I'm looking forward to deploy AMD Turin bare metal servers on
| Hetzner. The previous generations already had a great value but
| this seems a step above.
| nickpp wrote:
| Just in time for Factorio 2.0.
| smolder wrote:
| The weirdest one of the bunch is the AMD EPYC 9175F: 16 cores
| with 512MB of L3 cache! Presumably this is for customers trying
| to minimize software costs that are based on "per-core"
| licensing. It really doesn't make much sense to have so few cores
| at such an expense, otherwise. Does Oracle still use this style
| of licensing? If so, they need to knock it off.
|
| The only other thing I can think of is some purpose like HFT may
| need to fit a whole algorithm in L3 for absolute minimum latency,
| and maybe they want only the best core in each chiplet? It's
| probably about software licenses, though.
| londons_explore wrote:
| Plenty of applications are single threaded and it's cheaper to
| spend thousands on a super fast CPU to run it as fast as
| possible than spend tens of thousands on a programmer to
| rewrite the code to be more parallel.
|
| And like you say, plenty of times it is infeasible to rewrite
| the code because its third party code for which you don't have
| the source or the rights.
| puzzlingcaptcha wrote:
| Windows server licensing starts at 16 cores
| RHab wrote:
| Abaqus for example is by core, I am severly limited, for me
| this makes totally sense.
| bob1029 wrote:
| Another good example is any kind of discrete event simulation.
| Things like spiking neural networks are inherently single
| threaded if you are simulating them accurately (I.e.,
| serialized through the pending spike queue). Being able to keep
| all the state in local cache and picking the fastest core to do
| the job is the best possible arrangement. The ability to run 16
| in parallel simply reduces the search space by the same factor.
| Worrying about inter CCD latency isn't a thing for these kinds
| of problems. The amount of bandwidth between cores is minimal,
| even if we were doing something like a genetic algorithm with
| periodic crossover between physical cores.
| Jestzer wrote:
| MATLAB Parallel Server also does per-core licensing.
|
| https://www.mathworks.com/products/matlab-parallel-
| server/li....
| forinti wrote:
| You can pin which cores you will use and so stay within your
| contract with Oracle.
| heraldgeezer wrote:
| Windows server and MSSQL is per core now. A lot of enterprise
| software is. They are switching to core because before they had
| it based on CPU sockets. Not just Oracle.
| bee_rider wrote:
| 512 MB of cache, wow.
|
| A couple years ago I noticed that some Xeons I was using had a
| much cache as the ram in the systems I had growing up
| (millennial, so, we're not talking about ancient commodores or
| whatever; real usable computers that could play Quake and
| everything).
|
| But 512MB? That's _roomy._ Could Puppy Linux just be held
| entirely in L3 cache?
| hedora wrote:
| I wonder if you can boot it without populating any DRAM
| sockets.
| bee_rider wrote:
| I would be pretty curious about such a system. Or, maybe
| more practically, it might be interesting to have a system
| pretends the L3 cache is ram, and the ram is the hard drive
| (in particular, ram could disguise itself as the swap
| partition, to so the OS would treat is as basically a chunk
| of ram that it would rather not use).
| compressedgas wrote:
| Philip Machanick's RAMpage! (ca. 2000)
|
| > The RAMpage memory hierarchy is an alternative to a
| conventional cache-based hierarchy, in which the lowest-
| level cache is managed as a paged memory, and DRAM
| becomes a paging device.
| afr0ck wrote:
| So, essentially, you're just doing cache eviction in
| software. That's obviously a lot of overhead, but at
| least it gives you eviction control. However, there is
| very little to do when it comes to cache eviction. The
| algorithms are all well known and there is little
| innovation in that space. So baking that into the
| hardware is always better, for now.
| edward28 wrote:
| Intel has such a CPU with the previous gen called the
| xeon AMX with up to 64gb of HBM on chip. It could use it
| a cache or just memory.
| lewurm wrote:
| Firmware is using cache as RAM (e.g.
| https://www.coreboot.org/images/6/6c/LBCar.pdf) to do early
| init, like DRAM training. I guess later things in the boot
| chain rely on DRAM being set up probably though.
| zamadatix wrote:
| CCDs can't access each other's L3 cache as their own (fabric
| penalty is too high to do that directly). Assuming it's
| anything like the 9174F that means it's really 8 groups of 2
| cores that each have 64 MB of L3 cache. Still enormous, and
| you can still access data over the infinity fabric with
| penalties, but not quite a block of 512 MB of cache on a
| single 16 core block that it might sound like at first.
|
| Zen 4 also had 96 MB per CCD variants like the 9184X, so 768
| MB per, and they are dual socket so you can end up with a 1.5
| GB of total L3 cache single machine! The downside being now
| beyond CCD<->CCD latencies you have socket<->socket
| latencies.
| edward28 wrote:
| It's actually 16 CCDs with a single core and 32MB each.
| Aurornis wrote:
| Many algorithms are limited by memory bandwidth. On my 16-core
| workstation I've run several workloads that have peak
| performance with less than 16 threads.
|
| It's common practice to test algorithms with different numbers
| of threads and then use the optimal number of threads. For
| memory-intensive algorithms the peak performance frequently
| comes in at a relatively small number of cores.
| CraigJPerry wrote:
| Is this because of NUMA or is it L2 cache or something
| entirely different?
|
| I worked on high perf around 10 years ago and at that point I
| would pin the OS and interrupt handling to a specific core so
| I'd always lose one core. Testing led me to disable
| hyperthreading in our particular use case, so that was
| "cores" (really threads) halfed.
|
| A colleague had a nifty trick built on top of solarflare zero
| copy but at that time it required fairly intrusive kernel
| changes, which never totally sat well with me, but again I'd
| lose a 2nd core to some bookkeeping code that orchestrated
| that.
|
| I'd then tasksel the app to the other cores.
|
| NUMA was a thing by then so it really wasn't straightforward
| to eek maximum performance. It became somewhat of a
| competition to see who could get highest throughout but
| usually those configurations were unusable due to
| unacceptable p99 latencies.
| afr0ck wrote:
| NUMA gives you more bandwidth at the expense of higher
| latency (if not managed properly).
| yusyusyus wrote:
| new vmware licensing is per-core.
| bytepursuits wrote:
| When is 8005 coming?
| thelastparadise wrote:
| I wonder how LLM performance is on the higher core counts?
|
| With recent DDR generations and many core CPUs, perhaps CPUs will
| give GPUs a run for their money.
| kolbe wrote:
| The H100 has 16,000 cuda cores at 1.2ghz. My rough calculation
| is it can handle 230k concurrent calculations. Whereas a 192
| core avx512 chip (assuming it calculates on 16 bit data) can
| handle 6k concurrent calculations at 4x the frequency. So,
| about a 10x difference just on compute, not to mention that
| memory is an even stronger advantage for GPUs.
| TacticalCoder wrote:
| > The system we had access to was running 6000MT/s for its
| memory, and DDR5-6000 MT/s is what most systems will support in a
| 1 DIMM per channel configuration. Should you want to run 2 DIMMs
| per channel, then your memory speeds drop to 4400 MT/s; and if
| you run 1 DIMM per channel in a motherboard with 2 DIMMs per
| channel then expect 5200 MT/s for your memory speed.
|
| Is this all ECC memory at these speeds?
| wmf wrote:
| Yes, servers only use ECC RAM.
| aurareturn wrote:
| Phoronix recently reviewed the 196 core Turin Dense against the
| AmpereOne 192 core.
|
| * Ampere MSRP $5.5K vs $15K for the EPYC.
|
| * Turin 196 had 1.6x better performance
|
| * Ampere had 1.2x better energy consumption
|
| In terms of actual $/perf, Ampere 192 core is 1.7x better than
| Turin Dense 196 core based on Phoronix's review.
|
| So for $5.5k, you can either buy an AmpereOne 192 core CPU (274w)
| or a Turin Dense 48 core CPU (300w).
|
| Ampere has a 256 core, 3nm, 12 memory channel shipping next year
| that is likely to better challenge Turin Dense and Sierra Forest
| in terms of raw performance. For now, their value proposition is
| $/perf.
|
| Anyway, I'm very interested in how Qualcomm's Nuvia-based server
| chips will perform. Also, if ARM's client core improvements are
| any indication, I will be very interested in how in-house chips
| like AWS Graviton, Google Axion, Microsoft Cobalt, Nvidia Grace,
| Alibaba Yitian will compete with better Neoverse cores. Nuvia vs
| ARM vs AmpereOne.
|
| This is probably the golden age of server CPUs. 7 years ago, it
| was only Intel's Xeon. Now you have numerous options.
| KingOfCoders wrote:
| The difference is, you can get EPYC CPUs but you can't get hold
| of Ampere CPUs.
| tpurves wrote:
| AMD also wins on perf/Watt which is pretty notable for anyone
| who still believed that X86 could never challenge ARM/Risc in
| efficiency. These days, lot of data centers are also more
| limited by available Watts (and associated cooling) which bodes
| well for Turin.
| stzsch wrote:
| For those that dislike their change to substack, there is
| https://old.chipsandcheese.com/2024/10/11/amds-turin-5th-gen....
|
| At least for now.
___________________________________________________________________
(page generated 2024-10-12 23:01 UTC)