[HN Gopher] Test Results for AMD Zen 5
       ___________________________________________________________________
        
       Test Results for AMD Zen 5
        
       Author : matt_d
       Score  : 173 points
       Date   : 2025-07-26 18:44 UTC (4 hours ago)
        
 (HTM) web link (www.agner.org)
 (TXT) w3m dump (www.agner.org)
        
       | eigenform wrote:
       | This reminds me: has anyone ever figured out why Zen 3 was
       | missing memory renaming, but it came back in Zen 4 and Zen 5?
        
         | Tuna-Fish wrote:
         | AMD had two leapfrogging CPU design teams. Memory renaming was
         | added by the team that did Zen2, presumably the Zen3 team
         | couldn't import it in time for some reason.
        
       | alberth wrote:
       | While an interesting read, the title is a bit misleading since I
       | didn't see any actual "test results" in the post.
        
         | ooopdddddd wrote:
         | The detailed results are in the links at the bottom of the
         | post.
        
         | Someone wrote:
         | AMD's documentation for the CPU may or may not state such
         | things as "There are six integer ALUs, four address generation
         | units, three branch units, four vector ALUs, and two vector
         | read/write units", but even if it does, Agnes Fog runs actual
         | code to check that, and often discovers corner cases that the
         | official documentation doesn't mention.
         | 
         | So, he black box tests the CPU to try and discover its innards.
        
           | titanomachy wrote:
           | > Agnes Fog
           | 
           | Agner
        
         | djoldman wrote:
         | They are linked at the bottom of Mr. Fog's post. For example on
         | page 142 of this:
         | 
         | https://www.agner.org/optimize/instruction_tables.pdf
        
       | ashvardanian wrote:
       | > All vector units have full 512 bits capabilities except for
       | memory writes. A 512-bit vector write instruction is executed as
       | two 256-bit writes.
       | 
       | That sounds like a weird design choice. Curious if this will
       | affect memcpy-heavy workloads.
       | 
       | Writes aside, Zen5 is taking much longer to roll out than I
       | thought, and some of AMD's positioning is (almost expectedly)
       | misleading, especially around AI.
       | 
       | AMD's website claims Zen5 is the "Leading CPU for AI" (<https://w
       | ww.amd.com/en/products/processors/server/epyc/ai.ht...>), but I
       | strongly doubt that. First, they compare Zen5 (9965), which is
       | still largely unavailable, to Xeon2 (8280), a 2 generations older
       | processor. Xeon4 is abundantly available and comes with AMX, an
       | exclusive feature to Intel. I doubt AVX-512 support with a
       | 512-bit physical path and even twice as many cores will be enough
       | to compete with that (if we consider just the ALU throughput
       | rather than the overall system & memory).
        
         | dragontamer wrote:
         | Well, when you consider that AVX 512 instructions have 2 or 3
         | reads per 1 write, there's a degree of sense here.
         | 
         | Consider the standard matrix multiplication primitive the FMAC
         | / multiply and accumulate: 3 reads and one write if I'm
         | counting correctly .... (Output = A * B + C, three reads one
         | output).
        
         | rpiguy wrote:
         | It may be easier for the memory controller to schedule two
         | narrower writes than waiting for one 512-bit block or perhaps
         | they just didn't substantially update the memory controller and
         | so it still has to operate as it did in Zen 4.
        
         | arrakark wrote:
         | Cache-line bursts/beats tend to be standardized to 64B in lots
         | of NoC architectures.
        
           | Dylan16807 wrote:
           | "Network on Chip" okay got it.
        
         | ryao wrote:
         | AMD CPUs tend to have more memory bandwidth than Intel CPUs and
         | inference is CPU bound, so their claim seems accurate to me.
         | 
         | Whether the core does a 512-bit write in 1 cycle or 2 because
         | it is two 256-bit writes is immaterial. Memory bandwidth is
         | bottlenecked by 64GB/sec per CCX. You need to use cores from
         | multiple CCXs to get full bandwidth.
         | 
         | That said, the EYPC 9175F has 614.4GB/sec memory bandwidth and
         | should be able to use all of it. I have one, although the
         | machine is not yet assembled (Supermicro took 7 weeks to send
         | me a motherboard, which delayed assembly), so I have no
         | confirmed that it can use all of it yet.
        
       | pbsd wrote:
       | Vector ALU instruction latencies are understandably listed as 2
       | and higher, but this is not strictly the case. From AMD's Zen 5
       | optimization manual [1], we have                   The floating
       | point schedulers have a slow region, in the oldest entries of a
       | scheduler and only when the scheduler is full. If an operation is
       | in the slow region and it is dependent on a 1-cycle latency
       | operation, it will see a 1 cycle latency penalty.         There
       | is no penalty for operations in the slow region that depend on
       | longer latency operations or loads.         There is no penalty
       | for any operations in the fast region.         To write a latency
       | test that does not see this penalty, the test needs to keep the
       | FP schedulers from filling up.         The latency test could
       | interleave NOPs to prevent the scheduler from filling up.
       | 
       | Basically, short vector code sequences that don't fill up the
       | scheduler will have better latency.
       | 
       | [1]
       | https://www.amd.com/content/dam/amd/en/documents/processor-t...
        
       | vhcr wrote:
       | https://web.archive.org/web/20250726202105/https://www.agner...
        
       | londons_explore wrote:
       | > Integer vector instructions and floating point vector
       | instructions now have the same latencies.
       | 
       | There is very little reason to use integers for anything anymore.
       | Loop counter? Why not make it a double - you never know when you
       | might need an extra 0.5 loops at the end!
        
         | bee_rider wrote:
         | Finally we can implement BiCGStab intuitively!
        
         | Intralexical wrote:
         | Integers aren't for performance. They're for precision
         | (anything financial for example) and occasionally size.
        
         | sushevff wrote:
         | Totally. Can't wait to access the 18463.637th record in my
         | database plus or minus a record or thousand.
        
           | vhcr wrote:
           | Doubles can represent integers exactly up to 2^52
        
       | varispeed wrote:
       | Is it better than M4?
       | 
       | If a laptop will need to be plugged in to deliver full
       | performance, whilst blasting fans at full throttle, what is the
       | point? (apart from server / workstation use, where you don't like
       | MacOS or need different OS)
        
         | PixyMisa wrote:
         | Price.
        
       ___________________________________________________________________
       (page generated 2025-07-26 23:00 UTC)