[HN Gopher] Testing AMD's Bergamo: Zen 4c
       ___________________________________________________________________
        
       Testing AMD's Bergamo: Zen 4c
        
       Author : latchkey
       Score  : 77 points
       Date   : 2024-06-22 17:36 UTC (5 hours ago)
        
 (HTM) web link (chipsandcheese.com)
 (TXT) w3m dump (chipsandcheese.com)
        
       | tedunangst wrote:
       | I remember as part of the TBB library release, Intel remarked
       | that 100 pentium cores had the same transistor count of a core2.
       | Took a while, but starting to turn the corner on slower and wider
       | becoming more common.
        
         | imtringued wrote:
         | The problem with adding more cores is that SRAM is the real
         | bottleneck. Adding more cores means more cache.
         | 
         | Until someone figures out how to do 3D stackable SRAM similar
         | to how SSDs work, SRAM will always consume most of the area on
         | your chip.
        
           | wmf wrote:
           | It's called V-Cache.
        
           | JonChesterfield wrote:
           | This isn't fill the reticle with CPU, it's make a dozen
           | separate chips and package them on a network fabric. Amount
           | of cache can increase linearly with number of cores without a
           | problem.
           | 
           | There is some outstanding uncertainty about cache coherency
           | vs performance as N goes up which shows up in numa cliff
           | fashion. My pet theory is that'll be what ultimately kills
           | x64 - the concurrency semantics are skewed really hard
           | towards convenience and thus away from scalability.
        
             | doctor_eval wrote:
             | I know next to nothing about CPU architecture so please
             | forgive a stupid question.
             | 
             | Are you saying that the x86 memory model means RAM latency
             | is more impactful than on some other architectures?
             | 
             | Is this (tangentially) related to the memory mode that
             | Apple reportedly added to the M1 to emulate x86 memory
             | model to make emulation faster? - presumably to account for
             | assumptions that compilers make about the state of the CPU
             | after certain operations?
        
       | nullc wrote:
       | Is the 4c core slower in any other way than L3 cache reductions?
       | 
       | Would be interesting to see a compute bound perfectly scaling
       | workload and compare it in terms of absolute performance and
       | performance per watt between Bergamo and Genoa.
        
         | mlyle wrote:
         | Same performance per cycle; less L3 and lower maximum
         | frequency.
         | 
         | Zen4C is said to be a little better performance per watt than
         | Zen4, but it's not clear how much.
        
         | toast0 wrote:
         | > Is the 4c core slower in any other way than L3 cache
         | reductions?
         | 
         | It's got a lower clock ceiling. Not much lower than acheivable
         | clocks in really dense zen4 Epyc, but a lot less than an 8 core
         | Ryzen.
        
         | wmf wrote:
         | https://www.phoronix.com/review/amd-epyc-9754-bergamo/2
        
       | Pet_Ant wrote:
       | The actual title seems to be "Testing AMD's Bergamo: Zen 4c Spam"
       | which I really like because for the perspectives of 20 years ago
       | this would feel a bit like "spam" or a CPU-core "Zergling rush".
       | 
       | As I said before, I do believe that this is the future of CPUs
       | core. [1] With RAM latency not really having kept pace with CPUs
       | have more performant cores really seems like a waste. In a Cloud
       | setting where you always have some work to do it seems like
       | simpler cores but more of them is really the answer. It's in the
       | environment that the weight of x86's legacy will catch up with us
       | and we'll need to get rid of all the waste transistors decoding
       | cruft.
       | 
       | https://news.ycombinator.com/item?id=40535915
        
         | ComputerGuru wrote:
         | I agreed with you up until the x86 comment. Legacy x86 support
         | is largely a red herring. The constraints are architectural (as
         | you noted, per-core memory bandwidth, plus other things) more
         | than they are due to being tied down to legacy instruction
         | sets.
        
           | mlyle wrote:
           | If the goal ends up being many-many-core, x86's complexity
           | tax may start to matter. The cost of x86 compatibility
           | relative to all the complexities required for performance has
           | been small, but if we end up deciding that memory latency is
           | going to kill us and we can't keep complex cores fed, then
           | that is a vote for simpler architectures.
           | 
           | I suspect the future will be something between these extremes
           | (tons of dumb cores or ever-more-complicated cores to try and
           | squeeze out IPC), though.
        
           | Pet_Ant wrote:
           | Each core needs to handle the full complexity of x86. Now, as
           | super-scalar OoO x86 cores have evolved the percentage of die
           | allocated to decoding the cruft has gone down.
           | 
           | ...but when we start swarming simple cores, that cost starts
           | to rise. Each core needs to be able to decode everything. Now
           | when you can a 100 cores, even if the cruft is just 4%, that
           | means you can have 4 more cores. This is for free if you are
           | willing to recompile your code.
           | 
           | Now, it may turn out that we need more decoding complexity
           | than something like RISC-V currently has (Qualcomm has been
           | working in it), but these will be deliberate, intentionally
           | chose instead of accrued, that meet the needs of today and
           | current trade offs, and not of the eart 80's.
        
             | magicalhippo wrote:
             | As a developer of fairly standard software, there's very
             | little I can say I rely on from the x86/x64 ISA.
             | 
             | One big one is probably around consistency model[1] and
             | such which affects atomic operations and synchronizing
             | multi-threaded code. Usually not directly though, I
             | typically use libraries or OS primitives.
             | 
             | Are there any non-obvious (to me anyway) ways us "typical
             | devs" rely on x86/x64?
             | 
             | I get the sense that a lot of software is one recompile
             | away from running on some other ISA, but perhaps I'm overly
             | naive.
             | 
             | [1]: https://en.wikipedia.org/wiki/Consistency_model
        
       | adrian_b wrote:
       | The article says "AMD's server platform also leaves potential for
       | expansion. Top end Genoa SKUs use 12 compute chiplets while
       | Bergamo is limited to just eight. 12 Zen 4c compute dies would
       | let AMD fit 192 cores in a single socket".
       | 
       | It should be noted that the successor of Bergamo, Turin dense,
       | which is expected by the end of the year, will have 12 compute
       | chiplets, for a total of 192 Zen 5c cores, bringing thus both
       | more cores and faster cores.
        
         | metadat wrote:
         | 192 cores per socket? If so, that's pretty wild.
        
       ___________________________________________________________________
       (page generated 2024-06-22 23:00 UTC)