[HN Gopher] Pushing AMD's Infinity Fabric to Its Limit
       ___________________________________________________________________
        
       Pushing AMD's Infinity Fabric to Its Limit
        
       Author : klelatti
       Score  : 221 points
       Date   : 2024-11-24 20:39 UTC (1 days ago)
        
 (HTM) web link (chipsandcheese.com)
 (TXT) w3m dump (chipsandcheese.com)
        
       | cebert wrote:
       | George's detailed analysis always impresses me. I'm amazed with
       | his attention to detail.
        
         | geerlingguy wrote:
         | It's like Anandtech of old, though the articles usually lag
         | product launches a little further. Probably due to lack of
         | resources (in comparison to Anandtech at its height).
         | 
         | I feel like I've learned a bit after every deep dive.
        
           | ip26 wrote:
           | He goes far deeper than I remember Anandtech going.
        
         | IanCutress wrote:
         | Just to highlight, this one's Chester :)
        
       | AbuAssar wrote:
       | Great deep dive into AMD's Infinity Fabric! The balance between
       | bandwidth, latency, and clock speeds shows both clever
       | engineering and limits under pressure. Makes me wonder how these
       | trade-offs will evolve in future designs. Thoughts?
        
         | Cumpiler69 wrote:
         | IMHO these internal and external high speed interconnects will
         | be more and more important in the future, as More's law is
         | dying, GHz aren't increasing, and newer FAB nodes are becoming
         | monstrously expensive, so connecting cheaper made dies together
         | is the only way to scale compute performance for consumer
         | applications where cost matters. Apple did the same on the high
         | end M chips.
         | 
         | The only challenge is SW also needs to be rewritten to use
         | these new architectures efficiently otherwise we see
         | performance decreases instand of increases.
        
           | sylware wrote:
           | You would need fine-grained hardware configuration from the
           | software based on that very software semantics and task. If
           | ever possible in a shared hardware environment.
           | 
           | Video game consoles with shared GPU(for 3D) and CPU had to
           | chose: favor the GPU with high bandwidth and high latency, or
           | the CPU with low lantency with lower bandwidth. Since a video
           | game console is mostly GPU, they went for the GDDR, namely
           | high bandwidth with high latency.
           | 
           | On linux, you have the alsa-lib which does handle sharing the
           | audio device among the various applications. They had to
           | choose a reasonable default hardware configuration for all:
           | it is currently stereo 48kHz, and it is moving to the
           | 'maximum number of channels' at a maximum of 48kHz with left
           | and right channels.
        
       | Agingcoder wrote:
       | Proper thread placement and numa handling does have a massive
       | impact on modern amd cpus - significantly more so than on Xeon
       | systems. This might be anecdotal, but I've seen performance
       | improve by 50% on some real world workloads.
        
         | hobs wrote:
         | When I was caring more about hardware configuration on
         | databases in big virtual machine hosts not configuring NUMA was
         | an absolute performance killer, more than 50% performance on
         | almost any hardware because as soon as you left the socket the
         | interconnect suuuuucked.
        
         | bob1029 wrote:
         | NUMA feels like a really big deal on AMD now.
         | 
         | I recently refactored an evolutionary algorithm from
         | Parallel.ForEach over one gigantic population to an isolated
         | population+simulation per thread. The difference is so dramatic
         | (100x+) that loss of large scale population dynamics seems to
         | be more than offset by the # of iterations you can achieve per
         | unit time.
         | 
         | Communicating information between threads of execution should
         | be assumed to be growing _more_ expensive (in terms of latency)
         | as we head further in this direction. More threads is usually
         | not the answer for most applications. Instead, we need to back
         | up and review just how fast one thread can be when the
         | dependent data is in the right place at the right time.
        
           | bobmcnamara wrote:
           | Is cross thread latency more expensive in time, or more
           | expensive relative to things like local core throughput?
        
             | bob1029 wrote:
             | Time and throughput are inseparable quantities. I would
             | interpret "local core throughput" as being the subclass of
             | timing concerns wherein everything happens in a smaller
             | physical space.
             | 
             | I think a different way to restate the question would be:
             | What are the categories of problems for which the time it
             | takes to communicate cross-thread more than compensates for
             | the loss of cache locality? How often does it make sense to
             | run each thread ~100x slower so that we can leverage some
             | aggregate state?
             | 
             | The only headline use cases I can come up with for using
             | more than <modest #> of threads is hosting VMs in the cloud
             | and running simulations/rendering in an embarrassingly
             | parallel manner. I don't think gaming benefits much beyond
             | a certain point - humans have their own timing issues.
             | Hosting a web app and ferrying the user's state between 10
             | different physical cores under an async call stack is
             | likely not the most ideal use of the computational
             | resources, and this scenario will further worsen as inter-
             | thread latency increases.
        
           | Agingcoder wrote:
           | Yes - I almost view the server as a small cluster in a box,
           | and an internal network with the associated performance
           | impact when you start going out of box
        
       | majke wrote:
       | This has puzzled me for a while. The cited system has 2x89.6 GB/s
       | bandwidth. But a single CCD can do at most 64GB/s of sequential
       | reads. Are claims like "Apple Silicon having 400GB/s"
       | meaningless? I understand a typical single logical CPU can't do
       | more than 50-70GB/s, and it seems like a group of CPU's typically
       | shares a mem controller which is similarly limited.
       | 
       | To rephrase: is it possible to cause 100% mem bandwith
       | utilization with only or 1 or 2 CPU's doing the work per CCD?
        
         | KeplerBoy wrote:
         | Aren't those 400 GB/s a figure which only apply when the GPU
         | with its much wider interface is accessing the memory?
        
           | bobmcnamara wrote:
           | That figure is at the memory controller.
           | 
           | It applies as a maximum speed limit all the time, but it's
           | unlikely that a CPU would cause the memory controller to
           | reach it. Why it's important is that it causes increased
           | latency whenever other bus controllers are competing for
           | bandwidth, but I don't think Apple has documented their
           | internal bus architecture or performance counters necessary
           | to see how.
        
           | doctorpangloss wrote:
           | Another POV is that maybe the max memory bandwidth figure is
           | too vague to guide people optimizing libraries. It would be
           | nice if Apple Silicon was as fast as "400GB/s" sounds.
           | Grounded closer to reality, the parts are 65W.
        
             | KeplerBoy wrote:
             | But those 65 Watts contain State of the Art Flops/Watt.
        
         | ryao wrote:
         | On Zen 3, I am able to use nearly the full 51.2GB/sec from a
         | single CPU core. I have not tried using two as I got so close
         | to 51.2GB/sec that I had assumed that going higher was not
         | possible. Off the top of my head, I got 49-50GB/sec, but I last
         | measured a couple years ago.
         | 
         | By the way, if the cores were able to load things at full
         | speed, they would be able to use 640GB/sec each. That is 2
         | AVX-512 loads per cycle at 5GHz. Of course, they never are able
         | to do this due to memory bottlenecks. Maybe Intel's Xeon Max
         | series with HBM can, but I would not be surprised to see an
         | unadvertised internal bottleneck there too. That said, it is so
         | expensive and rare that few people will ever run code on one.
        
           | buildbot wrote:
           | People have studied the Xeon Max! Spoiler - yes, it's limited
           | to ~23GB/s per core. It can't achieve anywhere close to the
           | theoretical bandwidth of the HBM even, with all cores active.
           | It's a pretty bad design in my opinion.
           | 
           | https://www.ixpug.org/images/docs/ISC23/McCalpin_SPR_BW_limi.
           | ..
        
             | electricshampo1 wrote:
             | It is integer factors better overall total BW than ddr5
             | spr; I think they went for minimal investment + time to
             | market for the spr w/ hbm product rather than heavy
             | investment to hit full bw utilization. Which may have made
             | sense for intel overall given business context etc
        
         | jeffbee wrote:
         | There are large differences in load/store performance across
         | implementations. On Apple Silicon for example the M1 Max a
         | single core can stream about 100GB/s all by itself. This is a
         | significant advantage over competing designs that are built to
         | hit that kind of memory bandwidth only with all-cores
         | workloads. For example five generations of Intel Xeon
         | processors, from Sandybridge through Skylake, were built to
         | achieve about 20GB/s streams from a single core. That is one
         | reason why the M1 was so exceptional at the time it was
         | released. The 1T memory performance is much better than what
         | you get from everyone else.
         | 
         | As far as claims of the M1 Max having > 400GB/s of memory
         | bandwidth, this isn't achievable from CPUs alone. You need all
         | CPUs and GPUs running full tilt to hit that limit. In practice
         | you can hit maybe 250GB/s from CPUs if you bring them all to
         | bear, including the efficiency cores. This is still extremely
         | good performance.
        
           | majke wrote:
           | I don't think single M1 cpu can do 100GB/s. This source says
           | 68GB/s peak: https://www.anandtech.com/show/16252/mac-mini-
           | apple-m1-teste...
        
             | wizzard0 wrote:
             | btw what's about as important is that in practice you don't
             | need to write super clever code to do that, these 68GB/s
             | are easy to reach with textbook code without any cleverness
        
             | jeffbee wrote:
             | That's the plain M1. The Max can do a bit more. Same site
             | since you favor it:
             | https://www.anandtech.com/show/17024/apple-m1-max-
             | performanc...
        
               | majke wrote:
               | > From a single core perspective, meaning from a single
               | software thread, things are quite impressive for the
               | chip, as it's able to stress the memory fabric to up to
               | 102GB/s. This is extremely impressive and outperforms any
               | other design in the industry by multiple factors, we had
               | already noted that the M1 chip was able to fully saturate
               | its memory bandwidth with a single core and that the
               | bottleneck had been on the DRAM itself. On the M1 Max, it
               | seems that we're hitting the limit of what a core can do
               | - or more precisely, a limit to what the CPU cluster can
               | do.
               | 
               | Wow
        
         | jmb99 wrote:
         | > The cited system has 2x89.6 GB/s bandwidth.
         | 
         | The following applies for certain only to the Zen4 system; I
         | have no experience with Zen5.
         | 
         | That is the theoretical max bandwidth of the DDR5 memory
         | (/controller) running at 5600 MT/s (roughly: 5600MT/s / 2MT/s x
         | 32 bits/T = 89.6GB/s). There is also a bandwidth limitation
         | between the memory controller (IO die) and the cores themselves
         | (CCDs), along the Infinity Fabric. Infinity Fabric runs at a
         | different clock speed than the cores, their cache(s), and the
         | memory controller; by default, 2/3 of the memory controller.
         | So, if the Memory controller's CLocK (MCLK) is 2800MHz (for
         | 5600MT/s), the FCLK (infinity Fabrick CLocK) will run at
         | 1866.66MHz. With 32 bytes per clock read bandwidth, you get
         | 59.7GB/s maximum sequential memory read bandwidth per CCD<->IOD
         | interconnect.
         | 
         | Many systems (read: motherboard manufacturers) will overclock
         | the FCLK when applying automatic overclocking (such as when
         | selecting XMP/EXPO profiles, and I believe some EXPO profiles
         | include overclocking the FCLK as well. (Note that 5600MT/s RAM
         | is overclocked; the fastest officially supported Zen4 memory
         | speed is 5200MT/s, and most memory kits are 3600MT/s or less
         | until overclocked with their built-in profiles.) In my
         | experience, Zen4 will happily accept FCLK up to 2000MHz, while
         | Zen4 Threadripper (7000 series) seems happy up to 2200MHz. This
         | particular system has the FCLK overclocked to 2000MHz, which
         | will hurt latency[0] (due to not being 2/3 of MCLK) but
         | increase bandwidth. 2000MHz x 32 bytes/cycle = 64GB/s read
         | bandwidth, as quoted in the article.
         | 
         | First: these are theoretical maximums. Even the most "perfect"
         | benchmark won't hit these, and if they do, there are other
         | variables at play not being taken into account (likely lower
         | level caches). You will never, ever see theoretical maximum
         | memory bandwidth in any real application.
         | 
         | Second: no, it is not possible to see maximum memory bandwidth
         | on Zen4 from only one CCD, assuming you have sufficiently fast
         | DDR5 that the FCLK cannot be equal to the MCLK. This is an
         | architecture limitation, although rarely hit in practice for
         | most of the target market. A dual-CCD chip has sufficient
         | memory bandwidth to saturate the memory before the Infinity
         | Fabric (but as alluded to in the article, unless tuned
         | incredibly well, you'll likely run into contention issues and
         | either hit a latency or bandwidth wall in real applications).
         | My quad-CCD Threadripper can achieve nearly 300GB/s, due to
         | having 8 (technically 16) DDR5 channels operating at 5800MT/s
         | and FCLK at 2200MHz; I would need an octo-CCD chip to achieve
         | maximum memory bandwidth utilization.
         | 
         | Third: no, claims like "Apple Silicon having 400GB/s) are not
         | meaningless. Those numbers are achieved the exact same way as
         | above, and the same way Nvidia determines their maximum memory
         | bandwidth on their GPUs. Platform differences (especially CPU
         | vs GPU, but even CPU vs CPU since Apple, AMD, and Intel all
         | have very different topologies) make the numbers incomparable
         | to each other directly. As an example, Apple Silicon can
         | probably achieve higher per-core memory bandwidth than Zen4 (or
         | 5), but also shares bandwidth with the GPU; this may not be
         | great for gaming applications, for instance, where memory
         | bandwidth requirements will be high for both the CPU and GPU,
         | but may be fine for ML inference since the CPU sits mostly idle
         | while the GPU does most of the work.
         | 
         | [0] I'm surprised the author didn't mention this. I can only
         | assume they didn't know this, and haven't tested over
         | frequencies or read much on the overclocking forums about Zen4.
         | Which is fair enough, it's a very complicated topic with a lot
         | of hidden nuances.
        
           | bpye wrote:
           | > Note that 5600MT/s RAM is overclocked; the fastest
           | officially supported Zen4 memory speed is 5200MT/s
           | 
           | This specifically did change in Zen 5, the max supported is
           | now 5600MT/s
        
       ___________________________________________________________________
       (page generated 2024-11-25 23:02 UTC)