[HN Gopher] AMD's Turin: 5th Gen EPYC Launched
       ___________________________________________________________________
        
       AMD's Turin: 5th Gen EPYC Launched
        
       Author : zdw
       Score  : 294 points
       Date   : 2024-10-12 00:22 UTC (22 hours ago)
        
 (HTM) web link (chipsandcheese.com)
 (TXT) w3m dump (chipsandcheese.com)
        
       | bitwize wrote:
       | > Apparently we now think 64 cores is 'lower core count'. What a
       | world we live in.
       | 
       | 64 cores is a high-end gaming rig. Civilization VII won't run
       | smoothly on less than 16.
        
         | noncoml wrote:
         | Civilization VII won't run smoothly.
         | 
         | Only recently I managed to build a PC that will run Civ 6
         | smoothly during late game on huge map
        
           | 7thpower wrote:
           | What are the specs?
           | 
           | Tangentially related, but I need to go check a18 civ
           | 6benchmarks. The experience on my a15 with small map sizes
           | was surprisingly good.
        
             | noncoml wrote:
             | It's not latest and greatest, 12900K + 64GB DDR4. But even
             | when 12900K came out(2021) Civ 6 was already 5 years old
        
         | zamadatix wrote:
         | Civ 6 really doesn't utilize cores as much as one would think.
         | I mean it'll spread the load across a lot of threads, sure, but
         | it never seems to actually... use them much? E.g. I just ran
         | the Gathering Storm expansion AI benchmark (late game map
         | completely full of civs and units - basically worst case for
         | CPU requirements and best case for eating up the multicore
         | performance) on a 7950X 16 core CPU and it rarely peaked over
         | 30% utilization, often averaging ~25%. 30% utilization means a
         | 6 core part (barring frequency/cache differences) should be
         | able to eat that at 80% load.
         | 
         | https://i.imgur.com/YlJFu4s.png
         | 
         | Whether the bottleneck is memory bandwidth (2x6000 MHz),
         | unoptimized locking, small batch sizes, or something else it
         | doesn't seem to be related to core count. It's also not waiting
         | on the GPU much here, the 4090 is seeing even less utilization
         | than the CPU. Hopefully utilization actually scales better with
         | 7, not just splits up a lot.
        
           | lukeschlather wrote:
           | > 16 core CPU and it rarely peaked over 30% utilization,
           | often averaging ~25%. 30% utilization means a 6 core part
           | (barring frequency/cache differences) should be able to eat
           | that at 80% load.
           | 
           | As a rule I wouldn't be surprised if 90% of the stuff Civ 6
           | is doing can't be parallelized at all, but then for that
           | remaining 10% you get a 16x speedup with 16 cores. And
           | they're underutilized on average but there are bursts where
           | you get a measurable speedup from having 16 cores, and that
           | speedup is strictly linear with the number of cores. 6 cores
           | means that remaining 10% will be less than half as fast vs.
           | having 16 cores. And this is consistent with observing 30%
           | CPU usage I think.
        
             | colechristensen wrote:
             | My rule is more like I'd be willing to bet even odds that
             | this could be sped up 100x with the right programmers
             | focused on performance. When you lack expertise and things
             | work "well enough" that's what you get. Same for games or
             | enterprise software.
        
               | squarefoot wrote:
               | That's what we get in a market dominated more by the need
               | to release before the competition rather than taking some
               | time to optimize software. If it's slow, one can still
               | blame the iron and users who don't upgrade it.
        
             | oivey wrote:
             | If only 10% of the workload can be parallelized, then the
             | best case parallelization speed up is only 10%. That
             | doesn't line up with the GP's claim that Civ6 benefits from
             | more cores.
        
               | Uvix wrote:
               | They referenced the upcoming Civ7 (which does include
               | 16-core chips on their highest recommended specs), not
               | Civ6.
        
         | gkhartman wrote:
         | I can't help but think that this sounds more like a failure to
         | optimize at the software level rather than a reasonable
         | hardware limitation.
        
           | cubefox wrote:
           | That's the usual case when vid3o g4mes are "CPU limited". One
           | has to just look whether the game does anything high-level
           | that other games didn't do 10 years ago. Reasonable hardware
           | limitations related to the CPU have normally to do with
           | complex physics effects or unusually large crowds of NPCs.
           | (Many games are CPU limited even for fairly small crowds
           | because their engine isn't optimized for that purpose.)
        
             | deaddodo wrote:
             | > vid3o g4mes
             | 
             | Why do you type it like that?
        
               | immibis wrote:
               | Usually people do this to avoid automatic censorship
               | systems. HN certainly has censorship in place, but I'm
               | not aware of any that targets discussion of video games.
        
               | cubefox wrote:
               | Old habit. I'm a kid of the 1990s, and we were convinced
               | there wasn't anything cooler than a) video games and b)
               | replacing letters with numbers. It retrospect, we might
               | have been a little biased.
        
               | deaddodo wrote:
               | Well, just for future reference; if you're a kid of the
               | 90s, you're well into your 30s now.
               | 
               | It's weird/juvenile to be typing a phrase in a manner
               | similar to a preteen, well after 20+ years have passed.
               | Especially in the middle of an otherwise normal
               | message/conversation.
        
               | bitwize wrote:
               | We thought that was dorky, skiddie shit even back in the
               | 90s. It was even more stale by the time Piro and Largo
               | turned it into a personality trait.
               | 
               | Though I think this guy just did it the way a 2000s kid
               | would say "vidya gaems".
        
         | treesciencebot wrote:
         | all high end "gaming" rigs are either using ~16 real cores or
         | 8:24 performance/efficiency cores these days.
         | threadripper/other HEDT options are not particularly good at
         | gaming due to (relatively) lower clock speed / inter-CCD
         | latencies.
        
         | csomar wrote:
         | If Civ 6 is any guidance, 64 or 32 won't make a slight
         | difference. The next step calculations seem to run on a single
         | CPU and thus having more CPUs is not going to change a thing.
         | This is a software problem; they need to distribute the
         | calculation over several CPUs.
        
         | fulafel wrote:
         | As the GPGPU scene trajectory seems dismal[1] for the
         | foreseeable future wrt the DX, this seems like the best hope.
         | 
         | [1] Fragmentation, at best C++ dialects, no practical compiler
         | tech to transparently GPU offload, etc
        
         | snvzz wrote:
         | civ6's slowness is purely bad programming. No excuses to be
         | had.
        
           | Pet_Ant wrote:
           | [citation needed]
        
       | dragontamer wrote:
       | ChipsAndCheese is one of the few new tech publications that
       | really knows what they are talking about, especially with these
       | deep dive benchmarks.
       | 
       | With the loss of Anandtech, TechReport, HardCOP and other old
       | technical sites, I'm glad to see a new publisher who can keep up
       | with the older style stuff.
        
         | mongol wrote:
         | Interestingly, Slashdot originated from a site called "Chips &
         | Dips". Similiar inspiration?
        
         | tandr wrote:
         | Did you mean to say HardOCP?
        
         | kderbe wrote:
         | Chips and Cheese most reminds me of the long gone LostCircuts.
         | Most tech sites focus on the slate of application benchmarks,
         | but C&C writes, and LC wrote, long form articles about
         | architecture, combined with subsystem micro-benchmarks.
        
       | jeffbee wrote:
       | The part with only 16 cores but 512MB L3 cache ... that must be
       | for some specific workload.
        
         | addaon wrote:
         | Does anyone know if modern AMD chips allow mapping the L3 cache
         | and using it as TCM instead of cache? I know older non-X86
         | processors supported this (and often booted into that mode so
         | that the memory controllers could be brought up), but not sure
         | if it's possible today. If so, that would sure make for some
         | interesting embedded use cases for a large DRAM-less system...
        
           | bpye wrote:
           | The coreboot docs claim that modern AMD parts no longer
           | support cache-as-RAM.
           | 
           | https://doc.coreboot.org/soc/amd/family17h.html
        
             | SV_BubbleTime wrote:
             | Lame.
             | 
             | Using it as TCM ram seems super useful.
             | 
             | Although you would need to fight/request it from the OS, so
             | technically I see why they might ditch it.
        
             | hales wrote:
             | Wow, thanks for the link, I had no idea:
             | 
             | > AMD has ported early AGESA features to the PSP, which now
             | discovers, enables and trains DRAM. Unlike any other x86
             | device in coreboot, a Picasso system has DRAM online prior
             | to the first instruction fetch.
             | 
             | Perhaps they saw badly trained RAM as a security flaw? Or
             | maybe doing it with the coprocessor helped them distribute
             | the training code more easily (I heard a rumour once that
             | RAM training algos are heavily patented? Might have
             | imagined it).
        
           | londons_explore wrote:
           | If you keep your working set small enough, you should be able
           | to tell the CPU it has RAM attached, but never actually
           | attach any RAM.
           | 
           | It would never flush any cache lines to RAM, and never do any
           | reads from RAM.
        
         | phonon wrote:
         | Oracle can charge $40-$100k+ for EE including options _per
         | core_ (times .5)...and some workloads are very cache sensitive.
         | So a high cache, high bandwidth, high frequency, high memory
         | capacity 16 core CPU[1] (x2 socket) might be the best bang for
         | their buck for that million dollar+ license.
         | 
         | [1]
         | https://www.amd.com/en/products/processors/server/epyc/9005-...
        
           | dajonker wrote:
           | Surely that's a good reason for Oracle to increase their
           | prices even more, leading to a cat-and-mouse game between CPU
           | makers and software license sellers.
        
             | immibis wrote:
             | Hopefully ending when nobody uses Oracle.
        
             | Tuna-Fish wrote:
             | Oh yes, this cat-and-mouse game has been going on for more
             | than a decade. But despite that, for any given time and
             | license terms, there is a type of CPU that is optimal for
             | performance/licensing costs, and when the license is as
             | expensive and widely used as it is, it makes sense to sell
             | CPUs for that specific purpose.
        
           | chx wrote:
           | It is very Oracle that their license policy gives a reason to
           | make crippled CPUs.
        
         | jsheard wrote:
         | The topology of that part is wild, it's physically the same
         | silicon as the 128-core part but they've disabled all but one
         | core on each compute chiplet. 112 cores are switched off to
         | leave just 16 cores with as much cache as possible.
         | 
         | Inter-core latency will be rough since you'll always be hitting
         | the relatively slow inter-chiplet bus, though.
        
       | mistyvales wrote:
       | Here I am running a 12 year old Dell PowerEdge with dual Xeons..
       | I wonder when the first gen Epyc servers will be cheap fodder on
       | eBay.
        
         | assusdan wrote:
         | IMO, 1st gen Epyc is not any good, given that 2nd gen exists,
         | is more popular and is cheap enough (I actually have epyc 7302
         | and MZ31-AR0 motherboard as homelab). Too low performance per
         | core and NUMA things, plus worse node (2nd gen compute is 7nm
         | TSMC)
        
         | renewiltord wrote:
         | Not worth. Get 9654 on eBay for $2k plus $1k for mobo. $7k full
         | system. Or go Epyc 7282 type, and that's good combo easily
         | available.
        
         | ipsum2 wrote:
         | They already are, and aren't very good.
        
         | p1necone wrote:
         | 1-3rd gen Epycs can be had super cheap, but the motherboards
         | are expensive.
         | 
         | Also not worth getting anything less than 3rd gen unless you're
         | primarily buying them for the pcie lanes and ram capacity - a
         | regular current gen consumer CPU with half - a quarter of the
         | core count will outperform them in compute while consuming
         | significantly less power.
        
           | justmarc wrote:
           | Lots of great second hand hardware to be had on ebay. Even
           | last gen used CPUs, as well as RAM, at _much_ less than
           | retail.
           | 
           | However when you end up building a server quite often the
           | motherboard + case is the cheap stuff, the CPUs are second in
           | cost and the biggest expense can be the RAM.
        
           | jsheard wrote:
           | When buying used Epycs you have to contend with them possibly
           | being vendor-locked to a specific brand of motherboard as
           | well.
           | 
           | https://www.servethehome.com/amd-psb-vendor-locks-epyc-
           | cpus-...
        
             | sph wrote:
             | They sell this vendor lock-in "feature" as enhanced
             | security?
        
               | Uvix wrote:
               | Yes. It keeps the board from booting if the firmware is
               | replaced with a version not signed by the board
               | manufacturer (i.e. so an attacker can't replace it with a
               | version that does nefarious things). Preventing CPU reuse
               | in other boards is just an (unintentional?) side effect.
        
               | kjs3 wrote:
               | The cynic would say the security implications are the
               | side effect, since selling more, new chips is the goal.
        
               | Uvix wrote:
               | If that was the goal then the CPU would fuse on first
               | boot for _any_ manufacturer's board, rather than being
               | fused only by Dell boards.
        
           | Tuna-Fish wrote:
           | The reason for this is that CPU upgrades on the same board
           | were/are very viable on SP3.
           | 
           | Doing that on Intel platforms just wasn't done for basically
           | ever, it was never worth it. But upgrade to Milan from Naples
           | or Rome is very appealing.
           | 
           | So SP3 CPUs are much more common used than the boards, simply
           | because more of them were made. This is probably very bad for
           | hobbyists, the boards are not going to get cheap until the
           | entire platform is obsolete.
        
         | swarnie wrote:
         | Unsure about the Epyc chips but Ryzen 5 series kit was being
         | given away on Amazon in the week...
         | 
         | I snagged a 9 5950X for PS242
        
           | kombine wrote:
           | Thanks for pointing out, it's still up there for PS253 - I
           | might consider upgrading my 8 core 5800X3D.
        
             | smolder wrote:
             | Whether that's an upgrade depends on your use case, as the
             | X3D has more cache.
        
               | kombine wrote:
               | I don't play games so the X3D's cache doesn't really
               | benefit me. 5950X should speed up compilation, but then,
               | I mostly do Python at the moment :)
        
         | taneq wrote:
         | Haha same, and it's perfectly capable of anything a smallish
         | company would need for general on-prem hosting.
        
       | elric wrote:
       | At 1:11 in the video, there's a chart of the TDP (which I looked
       | for in the text but couldn't find). At 125-500W, these things run
       | very hot.
        
         | bunabhucan wrote:
         | https://www.servethehome.com/wp-content/uploads/2024/10/AMD-...
         | 
         | Has the full range with TDP. 500w is only for the 128/192
         | monster chips. The 16 core fast sku has a 320W TDP.
        
           | Sakos wrote:
           | The 7950x3D is rated at 120w TDP. 320w seems quite high.
        
             | adgjlsfhk1 wrote:
             | that's the processor you're supposed to buy only if you are
             | paying $10000 per core per year for some ridiculously
             | expensive enterprise software. the extra power comes from
             | all the extra memory channels and cache
        
             | icegreentea2 wrote:
             | It is high, but it probably can sustain much higher all
             | core frequency compared to 7950x3D (or 7950x). If nothing
             | else, it has a bigger die and heat spreading to pull heat
             | from, it should be easier to maintain thermal headroom on
             | the EYPC chip.
             | 
             | That being said, it looks most probable what a 9175F is
             | just a 9755 (their absolute max full size Zen5 core part)
             | with nearly all (7/8) of their cores per CCD disabled in
             | order to get all 512MB of cache. This means that there's a
             | lot of extra interconnect being kept running per core which
             | certainly would contribute to the higher TDP.
             | 
             | Of course, in principle this should also mean that each
             | core (which should basically have all of a CDD's IO and
             | cache available to it) should be screaming fast in real
             | terms.
             | 
             | Of course finally, TDP is a totally fluffy number. The TDP
             | of the 7950X3D is most probably as low as it is because of
             | how much internal thermal resistance (the extra V-Cache
             | layer) it has. Part of it's lower TDP isn't because "it's
             | efficient", part of it is because "if we run it hotter,
             | we'll melt it". The TDP for the 7950X for example is 170W.
        
           | ticoombs wrote:
           | > 500w is only for... 128/196 chips. The 16 core fast sku has
           | a 320W TDP.
           | 
           | When you think about it. 180W more for 7x threads is amazing
        
             | masklinn wrote:
             | It's not new that high frequencies require higher power.
             | 
             | The base clock falls by half between the fastest and the
             | widest chips.
        
         | bjoli wrote:
         | That depends on the size of the processor, surely.
         | 
         | Socket sp5 is more than 3x the area of am5.
        
         | jmb99 wrote:
         | Doubtful, the 350W threadripper parts don't run particularly
         | hot with normal desktop cooling. I've overclocked a 7975WX with
         | an unrestricted power limit, and could dissipate over 800W
         | while keeping it below 90C (admittedly, with water cooling).
         | 500W with server cooling (super high RPM forced air) shouldn't
         | be a problem.
        
         | justmarc wrote:
         | You just need one unit of these in dual socket config per room
         | of your house, and you're sorted for the winter (if you live
         | somewhere cold).
        
         | formerly_proven wrote:
         | The cores are spread out over roughly 1200 mm2 of silicon and
         | the IO die seems to have grown again (maybe 500ish mm2?). So at
         | just 0.3 W/mm2 this is pretty cozy. The desktop parts have 3x
         | higher power density.
        
       | bob1029 wrote:
       | I'd like to see the 9965 in action. These parts are crazy. Will
       | definitely be looking to buy a machine from this generation.
       | 
       | https://www.amd.com/en/products/processors/server/epyc/9005-...
        
         | Salgat wrote:
         | I wonder how this compares to the 7950x3d. So much cache and a
         | high boost clock.
         | https://www.amd.com/en/products/processors/server/epyc/9005-...
        
           | Tuna-Fish wrote:
           | Great if you have 16 independent workloads, terrible for
           | things that care about communication between threads.
           | 
           | It has 16 CCD, each with only one thread enabled, latency
           | between CCD is ~150ns.
        
             | justmarc wrote:
             | Surprise surprise, not every tool is right for every job.
        
               | menaerus wrote:
               | Not sure if this comment was about to come out as snarky
               | but the parent rightfully pointed out the not so obvious
               | design of EPYC CPUs. CCD is a NUMA in disguise.
        
             | smolder wrote:
             | 150ns is actually surprisingly high! I didn't realize it
             | was so bad. That's about 2-3x as much latency as fetching
             | from DRAM based on what I see in peoples AIDA64 results.
        
       | justmarc wrote:
       | Truly mind boggling scale.
       | 
       | Twenty years ago we had just 1-2 cores per CPU, so we were lucky
       | to have 4 cores in a dual socket server.
       | 
       | A single server can now have almost 400 cores. Yes, we can have
       | even more ARM cores but they don't perform as well as these do,
       | at least for now.
        
         | zer00eyz wrote:
         | 700+ threads over 2 cores, can saturate 2 400gbe Nic's 500 wats
         | per chip (less than 2 wats per thread)... All of that in a 2U
         | package.... 20 years ago that would have been racks of gear.
        
           | smolder wrote:
           | > 700+ threads over 2 cores
           | 
           | I assume you mean 2 sockets.
        
           | jmrm wrote:
           | I think those really 2 watts per thread are a lot more
           | important than what us home users usually think. Having to
           | deliver less power and having to dissipate less watts in form
           | of heat in a data centre are really good news to its
           | operative costs, which is usually a lot bigger than the cost
           | of the purchase of the servers
        
           | justmarc wrote:
           | With these CPUs one can definitely hit much higher rates.
           | 
           | 800Gbps from a single server was achieved by Netflix on much
           | lesser systems two years ago:
           | 
           | https://nabstreamingsummit.com/wp-
           | content/uploads/2022/05/20...
           | 
           | If I were to guess, this hardware can do double that, also
           | helping that we now have actual 800Gbps Ethernet hardware.
           | 
           | Indeed 20 years ago this would have been racks of gear at a
           | very high cost and a huge power bill.
        
             | immibis wrote:
             | That's one hell of a pixelflut server.
        
         | chx wrote:
         | Indeed: the first dual core server chips only launched in 2005
         | afaik with 90nm Denmark/Italy/Egypt Opterons and Paxville Xeons
         | but on the Intel side it wasn't until 2007 when they were in
         | full swing.
        
           | p_l wrote:
           | first dual core server chips show up generally available in
           | 2001 with IBM POWER4, then HP PA-RISC ones in 2004, and then
           | Opterons which was followed by "emergency" design of
           | essentially two "sockets" on one die of of NetBurst dual core
           | systems.
        
             | chx wrote:
             | Ehhhh the MAJC 5200 was generally available in 1999 and I
             | am sure even older designs could be found if we were to dig
             | deep enough. Their market share would also need some
             | digging too.
             | 
             | To quote the announcement: "two VLIW (very long instruction
             | word) microprocessors on a single piece of silicon"
        
               | formerly_proven wrote:
               | Power and PA-RISC shipped servers, though. MAJC on the
               | other hand
               | 
               | > Sun built a single model of the MAJC, the two-core MAJC
               | 5200, which was the heart of Sun's XVR-1000 and XVR-4000
               | workstation graphics boards.
        
               | kristianp wrote:
               | MAJC: https://en.m.wikipedia.org/wiki/MAJC
               | 
               | Why have I not heard of those before?
        
             | chx wrote:
             | Also, I raised the question at
             | https://retrocomputing.stackexchange.com/q/30743/3722 and
             | one of the answers points out the 1984 Rockwell R65C29 Dual
             | CMOS Microprocessor. It was two standard 6502 on the same
             | die using the same bus to access the same memory... and of
             | course IBM mainframes did it decades before.
        
               | kjs3 wrote:
               | If we're going that direction, National Semiconductor had
               | a 2 'core' COPS4 processor in 1981[1]. I have some in a
               | tube somewhere (unused).
               | 
               | [1] https://www.cpushack.com/2014/08/25/national-
               | semiconductor-c...
        
               | chx wrote:
               | Yes, Retro SE also points out the Intel 8271 from 1977
               | was a dual core microcontroller.
        
               | kjs3 wrote:
               | Depends on your definition. The 8271 wasn't programmable
               | by anyone but Intel (at least, they never made that a
               | market option), and the second core was more of a bit-
               | oriented coprocessor, sorta like saying the 80486 is a
               | 2-core processor because of the FPU.
        
         | RobinL wrote:
         | I wonder what percentage of 'big data' jobs that run in
         | clusters would now be far faster on a single big machine with
         | e.g. duckdb rather than spark
        
           | conjecTech wrote:
           | The difference in throughput for local versus distributed
           | orchestration would mainly come from serdes, networking,
           | switching. Serdes can be substantial. Networking and
           | switching has been aggressively offloaded from CPU through
           | better hardware support.
           | 
           | Individual tasks would definitely have better latency, but
           | I'd suspect the impact on throughput/CPU usage might be
           | muted. Of course at the extremes (very small jobs, very
           | large/complex objects being passed) you'd see big gains.
        
             | mtremsal wrote:
             | Would you mind expanding on how SerDes become a bottleneck?
             | I'm not familiar and reading the Wikipedia article wasn't
             | enough to connect the dots.
        
             | RobinL wrote:
             | By way of a single example, we've been migrating recently
             | from spark to duckdb. Our jobs are not huge, but too big
             | for a single 'normal' machine. We've gone from a 2.5 hour
             | runtime on a cluster of 10 machines (40,vCPU total) to a 15
             | minute runtime on a 32vCPU single machine. I don't know for
             | sure, but I think this is largely because it eliminates
             | expensive shuffles and serde. Obviously results vary hugely
             | depending on workload, and some jobs are simply too big
             | even for a 192 core machine. But I suspect a high
             | proportion of workloads would be better run on single large
             | machines nowadays
        
           | semi-extrinsic wrote:
           | Essentially all, I would guess. But scheduling jobs and
           | moving data in and out of a single big machine can become a
           | huge bottleneck.
        
           | justmarc wrote:
           | I often think about huge, fancy cloud setups literally
           | costing silly money to run, being replaced by a single beast
           | of a machine powered by a modern, high core count CPU (say
           | 48+), lots of RAM and lots of high performance enterprise-
           | grade SSD storage.
        
             | cjbprime wrote:
             | (Oftentimes part of the reason the huge, fancy cloud setup
             | costs more is that any component can fail, all the way down
             | to the region-level, without loss of service.)
        
               | justmarc wrote:
               | And often times that loss of service, if temporary is not
               | all that painful -- it really depends on the exact
               | needs/use case/scenario.
               | 
               | That said, sometimes The Cloud goes down too.
        
         | varispeed wrote:
         | Nowadays very much most services can fit on single server and
         | serve millions of users a day. I wonder how it will affect
         | overly expensive cloud services where you can rent a beefy
         | dedicated server for under a grand and make tens of thousands
         | in savings (enough to hire full time administrator with plenty
         | of money left for other things).
        
         | Dylan16807 wrote:
         | On the other hand, at the time we would have expected twenty
         | years of progress to make the cores a thousand times faster.
         | Instead that number is more like 5x.
        
       | speedgoose wrote:
       | I'm looking forward to deploy AMD Turin bare metal servers on
       | Hetzner. The previous generations already had a great value but
       | this seems a step above.
        
       | nickpp wrote:
       | Just in time for Factorio 2.0.
        
       | smolder wrote:
       | The weirdest one of the bunch is the AMD EPYC 9175F: 16 cores
       | with 512MB of L3 cache! Presumably this is for customers trying
       | to minimize software costs that are based on "per-core"
       | licensing. It really doesn't make much sense to have so few cores
       | at such an expense, otherwise. Does Oracle still use this style
       | of licensing? If so, they need to knock it off.
       | 
       | The only other thing I can think of is some purpose like HFT may
       | need to fit a whole algorithm in L3 for absolute minimum latency,
       | and maybe they want only the best core in each chiplet? It's
       | probably about software licenses, though.
        
         | londons_explore wrote:
         | Plenty of applications are single threaded and it's cheaper to
         | spend thousands on a super fast CPU to run it as fast as
         | possible than spend tens of thousands on a programmer to
         | rewrite the code to be more parallel.
         | 
         | And like you say, plenty of times it is infeasible to rewrite
         | the code because its third party code for which you don't have
         | the source or the rights.
        
         | puzzlingcaptcha wrote:
         | Windows server licensing starts at 16 cores
        
         | RHab wrote:
         | Abaqus for example is by core, I am severly limited, for me
         | this makes totally sense.
        
         | bob1029 wrote:
         | Another good example is any kind of discrete event simulation.
         | Things like spiking neural networks are inherently single
         | threaded if you are simulating them accurately (I.e.,
         | serialized through the pending spike queue). Being able to keep
         | all the state in local cache and picking the fastest core to do
         | the job is the best possible arrangement. The ability to run 16
         | in parallel simply reduces the search space by the same factor.
         | Worrying about inter CCD latency isn't a thing for these kinds
         | of problems. The amount of bandwidth between cores is minimal,
         | even if we were doing something like a genetic algorithm with
         | periodic crossover between physical cores.
        
         | Jestzer wrote:
         | MATLAB Parallel Server also does per-core licensing.
         | 
         | https://www.mathworks.com/products/matlab-parallel-
         | server/li....
        
         | forinti wrote:
         | You can pin which cores you will use and so stay within your
         | contract with Oracle.
        
         | heraldgeezer wrote:
         | Windows server and MSSQL is per core now. A lot of enterprise
         | software is. They are switching to core because before they had
         | it based on CPU sockets. Not just Oracle.
        
         | bee_rider wrote:
         | 512 MB of cache, wow.
         | 
         | A couple years ago I noticed that some Xeons I was using had a
         | much cache as the ram in the systems I had growing up
         | (millennial, so, we're not talking about ancient commodores or
         | whatever; real usable computers that could play Quake and
         | everything).
         | 
         | But 512MB? That's _roomy._ Could Puppy Linux just be held
         | entirely in L3 cache?
        
           | hedora wrote:
           | I wonder if you can boot it without populating any DRAM
           | sockets.
        
             | bee_rider wrote:
             | I would be pretty curious about such a system. Or, maybe
             | more practically, it might be interesting to have a system
             | pretends the L3 cache is ram, and the ram is the hard drive
             | (in particular, ram could disguise itself as the swap
             | partition, to so the OS would treat is as basically a chunk
             | of ram that it would rather not use).
        
               | compressedgas wrote:
               | Philip Machanick's RAMpage! (ca. 2000)
               | 
               | > The RAMpage memory hierarchy is an alternative to a
               | conventional cache-based hierarchy, in which the lowest-
               | level cache is managed as a paged memory, and DRAM
               | becomes a paging device.
        
               | afr0ck wrote:
               | So, essentially, you're just doing cache eviction in
               | software. That's obviously a lot of overhead, but at
               | least it gives you eviction control. However, there is
               | very little to do when it comes to cache eviction. The
               | algorithms are all well known and there is little
               | innovation in that space. So baking that into the
               | hardware is always better, for now.
        
               | edward28 wrote:
               | Intel has such a CPU with the previous gen called the
               | xeon AMX with up to 64gb of HBM on chip. It could use it
               | a cache or just memory.
        
             | lewurm wrote:
             | Firmware is using cache as RAM (e.g.
             | https://www.coreboot.org/images/6/6c/LBCar.pdf) to do early
             | init, like DRAM training. I guess later things in the boot
             | chain rely on DRAM being set up probably though.
        
           | zamadatix wrote:
           | CCDs can't access each other's L3 cache as their own (fabric
           | penalty is too high to do that directly). Assuming it's
           | anything like the 9174F that means it's really 8 groups of 2
           | cores that each have 64 MB of L3 cache. Still enormous, and
           | you can still access data over the infinity fabric with
           | penalties, but not quite a block of 512 MB of cache on a
           | single 16 core block that it might sound like at first.
           | 
           | Zen 4 also had 96 MB per CCD variants like the 9184X, so 768
           | MB per, and they are dual socket so you can end up with a 1.5
           | GB of total L3 cache single machine! The downside being now
           | beyond CCD<->CCD latencies you have socket<->socket
           | latencies.
        
             | edward28 wrote:
             | It's actually 16 CCDs with a single core and 32MB each.
        
         | Aurornis wrote:
         | Many algorithms are limited by memory bandwidth. On my 16-core
         | workstation I've run several workloads that have peak
         | performance with less than 16 threads.
         | 
         | It's common practice to test algorithms with different numbers
         | of threads and then use the optimal number of threads. For
         | memory-intensive algorithms the peak performance frequently
         | comes in at a relatively small number of cores.
        
           | CraigJPerry wrote:
           | Is this because of NUMA or is it L2 cache or something
           | entirely different?
           | 
           | I worked on high perf around 10 years ago and at that point I
           | would pin the OS and interrupt handling to a specific core so
           | I'd always lose one core. Testing led me to disable
           | hyperthreading in our particular use case, so that was
           | "cores" (really threads) halfed.
           | 
           | A colleague had a nifty trick built on top of solarflare zero
           | copy but at that time it required fairly intrusive kernel
           | changes, which never totally sat well with me, but again I'd
           | lose a 2nd core to some bookkeeping code that orchestrated
           | that.
           | 
           | I'd then tasksel the app to the other cores.
           | 
           | NUMA was a thing by then so it really wasn't straightforward
           | to eek maximum performance. It became somewhat of a
           | competition to see who could get highest throughout but
           | usually those configurations were unusable due to
           | unacceptable p99 latencies.
        
             | afr0ck wrote:
             | NUMA gives you more bandwidth at the expense of higher
             | latency (if not managed properly).
        
         | yusyusyus wrote:
         | new vmware licensing is per-core.
        
       | bytepursuits wrote:
       | When is 8005 coming?
        
       | thelastparadise wrote:
       | I wonder how LLM performance is on the higher core counts?
       | 
       | With recent DDR generations and many core CPUs, perhaps CPUs will
       | give GPUs a run for their money.
        
         | kolbe wrote:
         | The H100 has 16,000 cuda cores at 1.2ghz. My rough calculation
         | is it can handle 230k concurrent calculations. Whereas a 192
         | core avx512 chip (assuming it calculates on 16 bit data) can
         | handle 6k concurrent calculations at 4x the frequency. So,
         | about a 10x difference just on compute, not to mention that
         | memory is an even stronger advantage for GPUs.
        
       | TacticalCoder wrote:
       | > The system we had access to was running 6000MT/s for its
       | memory, and DDR5-6000 MT/s is what most systems will support in a
       | 1 DIMM per channel configuration. Should you want to run 2 DIMMs
       | per channel, then your memory speeds drop to 4400 MT/s; and if
       | you run 1 DIMM per channel in a motherboard with 2 DIMMs per
       | channel then expect 5200 MT/s for your memory speed.
       | 
       | Is this all ECC memory at these speeds?
        
         | wmf wrote:
         | Yes, servers only use ECC RAM.
        
       | aurareturn wrote:
       | Phoronix recently reviewed the 196 core Turin Dense against the
       | AmpereOne 192 core.
       | 
       | * Ampere MSRP $5.5K vs $15K for the EPYC.
       | 
       | * Turin 196 had 1.6x better performance
       | 
       | * Ampere had 1.2x better energy consumption
       | 
       | In terms of actual $/perf, Ampere 192 core is 1.7x better than
       | Turin Dense 196 core based on Phoronix's review.
       | 
       | So for $5.5k, you can either buy an AmpereOne 192 core CPU (274w)
       | or a Turin Dense 48 core CPU (300w).
       | 
       | Ampere has a 256 core, 3nm, 12 memory channel shipping next year
       | that is likely to better challenge Turin Dense and Sierra Forest
       | in terms of raw performance. For now, their value proposition is
       | $/perf.
       | 
       | Anyway, I'm very interested in how Qualcomm's Nuvia-based server
       | chips will perform. Also, if ARM's client core improvements are
       | any indication, I will be very interested in how in-house chips
       | like AWS Graviton, Google Axion, Microsoft Cobalt, Nvidia Grace,
       | Alibaba Yitian will compete with better Neoverse cores. Nuvia vs
       | ARM vs AmpereOne.
       | 
       | This is probably the golden age of server CPUs. 7 years ago, it
       | was only Intel's Xeon. Now you have numerous options.
        
         | KingOfCoders wrote:
         | The difference is, you can get EPYC CPUs but you can't get hold
         | of Ampere CPUs.
        
         | tpurves wrote:
         | AMD also wins on perf/Watt which is pretty notable for anyone
         | who still believed that X86 could never challenge ARM/Risc in
         | efficiency. These days, lot of data centers are also more
         | limited by available Watts (and associated cooling) which bodes
         | well for Turin.
        
       | stzsch wrote:
       | For those that dislike their change to substack, there is
       | https://old.chipsandcheese.com/2024/10/11/amds-turin-5th-gen....
       | 
       | At least for now.
        
       ___________________________________________________________________
       (page generated 2024-10-12 23:01 UTC)