[HN Gopher] Test Results for AMD Zen 5
___________________________________________________________________
Test Results for AMD Zen 5
Author : matt_d
Score : 173 points
Date : 2025-07-26 18:44 UTC (4 hours ago)
(HTM) web link (www.agner.org)
(TXT) w3m dump (www.agner.org)
| eigenform wrote:
| This reminds me: has anyone ever figured out why Zen 3 was
| missing memory renaming, but it came back in Zen 4 and Zen 5?
| Tuna-Fish wrote:
| AMD had two leapfrogging CPU design teams. Memory renaming was
| added by the team that did Zen2, presumably the Zen3 team
| couldn't import it in time for some reason.
| alberth wrote:
| While an interesting read, the title is a bit misleading since I
| didn't see any actual "test results" in the post.
| ooopdddddd wrote:
| The detailed results are in the links at the bottom of the
| post.
| Someone wrote:
| AMD's documentation for the CPU may or may not state such
| things as "There are six integer ALUs, four address generation
| units, three branch units, four vector ALUs, and two vector
| read/write units", but even if it does, Agnes Fog runs actual
| code to check that, and often discovers corner cases that the
| official documentation doesn't mention.
|
| So, he black box tests the CPU to try and discover its innards.
| titanomachy wrote:
| > Agnes Fog
|
| Agner
| djoldman wrote:
| They are linked at the bottom of Mr. Fog's post. For example on
| page 142 of this:
|
| https://www.agner.org/optimize/instruction_tables.pdf
| ashvardanian wrote:
| > All vector units have full 512 bits capabilities except for
| memory writes. A 512-bit vector write instruction is executed as
| two 256-bit writes.
|
| That sounds like a weird design choice. Curious if this will
| affect memcpy-heavy workloads.
|
| Writes aside, Zen5 is taking much longer to roll out than I
| thought, and some of AMD's positioning is (almost expectedly)
| misleading, especially around AI.
|
| AMD's website claims Zen5 is the "Leading CPU for AI" (<https://w
| ww.amd.com/en/products/processors/server/epyc/ai.ht...>), but I
| strongly doubt that. First, they compare Zen5 (9965), which is
| still largely unavailable, to Xeon2 (8280), a 2 generations older
| processor. Xeon4 is abundantly available and comes with AMX, an
| exclusive feature to Intel. I doubt AVX-512 support with a
| 512-bit physical path and even twice as many cores will be enough
| to compete with that (if we consider just the ALU throughput
| rather than the overall system & memory).
| dragontamer wrote:
| Well, when you consider that AVX 512 instructions have 2 or 3
| reads per 1 write, there's a degree of sense here.
|
| Consider the standard matrix multiplication primitive the FMAC
| / multiply and accumulate: 3 reads and one write if I'm
| counting correctly .... (Output = A * B + C, three reads one
| output).
| rpiguy wrote:
| It may be easier for the memory controller to schedule two
| narrower writes than waiting for one 512-bit block or perhaps
| they just didn't substantially update the memory controller and
| so it still has to operate as it did in Zen 4.
| arrakark wrote:
| Cache-line bursts/beats tend to be standardized to 64B in lots
| of NoC architectures.
| Dylan16807 wrote:
| "Network on Chip" okay got it.
| ryao wrote:
| AMD CPUs tend to have more memory bandwidth than Intel CPUs and
| inference is CPU bound, so their claim seems accurate to me.
|
| Whether the core does a 512-bit write in 1 cycle or 2 because
| it is two 256-bit writes is immaterial. Memory bandwidth is
| bottlenecked by 64GB/sec per CCX. You need to use cores from
| multiple CCXs to get full bandwidth.
|
| That said, the EYPC 9175F has 614.4GB/sec memory bandwidth and
| should be able to use all of it. I have one, although the
| machine is not yet assembled (Supermicro took 7 weeks to send
| me a motherboard, which delayed assembly), so I have no
| confirmed that it can use all of it yet.
| pbsd wrote:
| Vector ALU instruction latencies are understandably listed as 2
| and higher, but this is not strictly the case. From AMD's Zen 5
| optimization manual [1], we have The floating
| point schedulers have a slow region, in the oldest entries of a
| scheduler and only when the scheduler is full. If an operation is
| in the slow region and it is dependent on a 1-cycle latency
| operation, it will see a 1 cycle latency penalty. There
| is no penalty for operations in the slow region that depend on
| longer latency operations or loads. There is no penalty
| for any operations in the fast region. To write a latency
| test that does not see this penalty, the test needs to keep the
| FP schedulers from filling up. The latency test could
| interleave NOPs to prevent the scheduler from filling up.
|
| Basically, short vector code sequences that don't fill up the
| scheduler will have better latency.
|
| [1]
| https://www.amd.com/content/dam/amd/en/documents/processor-t...
| vhcr wrote:
| https://web.archive.org/web/20250726202105/https://www.agner...
| londons_explore wrote:
| > Integer vector instructions and floating point vector
| instructions now have the same latencies.
|
| There is very little reason to use integers for anything anymore.
| Loop counter? Why not make it a double - you never know when you
| might need an extra 0.5 loops at the end!
| bee_rider wrote:
| Finally we can implement BiCGStab intuitively!
| Intralexical wrote:
| Integers aren't for performance. They're for precision
| (anything financial for example) and occasionally size.
| sushevff wrote:
| Totally. Can't wait to access the 18463.637th record in my
| database plus or minus a record or thousand.
| vhcr wrote:
| Doubles can represent integers exactly up to 2^52
| varispeed wrote:
| Is it better than M4?
|
| If a laptop will need to be plugged in to deliver full
| performance, whilst blasting fans at full throttle, what is the
| point? (apart from server / workstation use, where you don't like
| MacOS or need different OS)
| PixyMisa wrote:
| Price.
___________________________________________________________________
(page generated 2025-07-26 23:00 UTC)