[HN Gopher] Pushing AMD's Infinity Fabric to Its Limit
___________________________________________________________________
Pushing AMD's Infinity Fabric to Its Limit
Author : klelatti
Score : 221 points
Date : 2024-11-24 20:39 UTC (1 days ago)
(HTM) web link (chipsandcheese.com)
(TXT) w3m dump (chipsandcheese.com)
| cebert wrote:
| George's detailed analysis always impresses me. I'm amazed with
| his attention to detail.
| geerlingguy wrote:
| It's like Anandtech of old, though the articles usually lag
| product launches a little further. Probably due to lack of
| resources (in comparison to Anandtech at its height).
|
| I feel like I've learned a bit after every deep dive.
| ip26 wrote:
| He goes far deeper than I remember Anandtech going.
| IanCutress wrote:
| Just to highlight, this one's Chester :)
| AbuAssar wrote:
| Great deep dive into AMD's Infinity Fabric! The balance between
| bandwidth, latency, and clock speeds shows both clever
| engineering and limits under pressure. Makes me wonder how these
| trade-offs will evolve in future designs. Thoughts?
| Cumpiler69 wrote:
| IMHO these internal and external high speed interconnects will
| be more and more important in the future, as More's law is
| dying, GHz aren't increasing, and newer FAB nodes are becoming
| monstrously expensive, so connecting cheaper made dies together
| is the only way to scale compute performance for consumer
| applications where cost matters. Apple did the same on the high
| end M chips.
|
| The only challenge is SW also needs to be rewritten to use
| these new architectures efficiently otherwise we see
| performance decreases instand of increases.
| sylware wrote:
| You would need fine-grained hardware configuration from the
| software based on that very software semantics and task. If
| ever possible in a shared hardware environment.
|
| Video game consoles with shared GPU(for 3D) and CPU had to
| chose: favor the GPU with high bandwidth and high latency, or
| the CPU with low lantency with lower bandwidth. Since a video
| game console is mostly GPU, they went for the GDDR, namely
| high bandwidth with high latency.
|
| On linux, you have the alsa-lib which does handle sharing the
| audio device among the various applications. They had to
| choose a reasonable default hardware configuration for all:
| it is currently stereo 48kHz, and it is moving to the
| 'maximum number of channels' at a maximum of 48kHz with left
| and right channels.
| Agingcoder wrote:
| Proper thread placement and numa handling does have a massive
| impact on modern amd cpus - significantly more so than on Xeon
| systems. This might be anecdotal, but I've seen performance
| improve by 50% on some real world workloads.
| hobs wrote:
| When I was caring more about hardware configuration on
| databases in big virtual machine hosts not configuring NUMA was
| an absolute performance killer, more than 50% performance on
| almost any hardware because as soon as you left the socket the
| interconnect suuuuucked.
| bob1029 wrote:
| NUMA feels like a really big deal on AMD now.
|
| I recently refactored an evolutionary algorithm from
| Parallel.ForEach over one gigantic population to an isolated
| population+simulation per thread. The difference is so dramatic
| (100x+) that loss of large scale population dynamics seems to
| be more than offset by the # of iterations you can achieve per
| unit time.
|
| Communicating information between threads of execution should
| be assumed to be growing _more_ expensive (in terms of latency)
| as we head further in this direction. More threads is usually
| not the answer for most applications. Instead, we need to back
| up and review just how fast one thread can be when the
| dependent data is in the right place at the right time.
| bobmcnamara wrote:
| Is cross thread latency more expensive in time, or more
| expensive relative to things like local core throughput?
| bob1029 wrote:
| Time and throughput are inseparable quantities. I would
| interpret "local core throughput" as being the subclass of
| timing concerns wherein everything happens in a smaller
| physical space.
|
| I think a different way to restate the question would be:
| What are the categories of problems for which the time it
| takes to communicate cross-thread more than compensates for
| the loss of cache locality? How often does it make sense to
| run each thread ~100x slower so that we can leverage some
| aggregate state?
|
| The only headline use cases I can come up with for using
| more than <modest #> of threads is hosting VMs in the cloud
| and running simulations/rendering in an embarrassingly
| parallel manner. I don't think gaming benefits much beyond
| a certain point - humans have their own timing issues.
| Hosting a web app and ferrying the user's state between 10
| different physical cores under an async call stack is
| likely not the most ideal use of the computational
| resources, and this scenario will further worsen as inter-
| thread latency increases.
| Agingcoder wrote:
| Yes - I almost view the server as a small cluster in a box,
| and an internal network with the associated performance
| impact when you start going out of box
| majke wrote:
| This has puzzled me for a while. The cited system has 2x89.6 GB/s
| bandwidth. But a single CCD can do at most 64GB/s of sequential
| reads. Are claims like "Apple Silicon having 400GB/s"
| meaningless? I understand a typical single logical CPU can't do
| more than 50-70GB/s, and it seems like a group of CPU's typically
| shares a mem controller which is similarly limited.
|
| To rephrase: is it possible to cause 100% mem bandwith
| utilization with only or 1 or 2 CPU's doing the work per CCD?
| KeplerBoy wrote:
| Aren't those 400 GB/s a figure which only apply when the GPU
| with its much wider interface is accessing the memory?
| bobmcnamara wrote:
| That figure is at the memory controller.
|
| It applies as a maximum speed limit all the time, but it's
| unlikely that a CPU would cause the memory controller to
| reach it. Why it's important is that it causes increased
| latency whenever other bus controllers are competing for
| bandwidth, but I don't think Apple has documented their
| internal bus architecture or performance counters necessary
| to see how.
| doctorpangloss wrote:
| Another POV is that maybe the max memory bandwidth figure is
| too vague to guide people optimizing libraries. It would be
| nice if Apple Silicon was as fast as "400GB/s" sounds.
| Grounded closer to reality, the parts are 65W.
| KeplerBoy wrote:
| But those 65 Watts contain State of the Art Flops/Watt.
| ryao wrote:
| On Zen 3, I am able to use nearly the full 51.2GB/sec from a
| single CPU core. I have not tried using two as I got so close
| to 51.2GB/sec that I had assumed that going higher was not
| possible. Off the top of my head, I got 49-50GB/sec, but I last
| measured a couple years ago.
|
| By the way, if the cores were able to load things at full
| speed, they would be able to use 640GB/sec each. That is 2
| AVX-512 loads per cycle at 5GHz. Of course, they never are able
| to do this due to memory bottlenecks. Maybe Intel's Xeon Max
| series with HBM can, but I would not be surprised to see an
| unadvertised internal bottleneck there too. That said, it is so
| expensive and rare that few people will ever run code on one.
| buildbot wrote:
| People have studied the Xeon Max! Spoiler - yes, it's limited
| to ~23GB/s per core. It can't achieve anywhere close to the
| theoretical bandwidth of the HBM even, with all cores active.
| It's a pretty bad design in my opinion.
|
| https://www.ixpug.org/images/docs/ISC23/McCalpin_SPR_BW_limi.
| ..
| electricshampo1 wrote:
| It is integer factors better overall total BW than ddr5
| spr; I think they went for minimal investment + time to
| market for the spr w/ hbm product rather than heavy
| investment to hit full bw utilization. Which may have made
| sense for intel overall given business context etc
| jeffbee wrote:
| There are large differences in load/store performance across
| implementations. On Apple Silicon for example the M1 Max a
| single core can stream about 100GB/s all by itself. This is a
| significant advantage over competing designs that are built to
| hit that kind of memory bandwidth only with all-cores
| workloads. For example five generations of Intel Xeon
| processors, from Sandybridge through Skylake, were built to
| achieve about 20GB/s streams from a single core. That is one
| reason why the M1 was so exceptional at the time it was
| released. The 1T memory performance is much better than what
| you get from everyone else.
|
| As far as claims of the M1 Max having > 400GB/s of memory
| bandwidth, this isn't achievable from CPUs alone. You need all
| CPUs and GPUs running full tilt to hit that limit. In practice
| you can hit maybe 250GB/s from CPUs if you bring them all to
| bear, including the efficiency cores. This is still extremely
| good performance.
| majke wrote:
| I don't think single M1 cpu can do 100GB/s. This source says
| 68GB/s peak: https://www.anandtech.com/show/16252/mac-mini-
| apple-m1-teste...
| wizzard0 wrote:
| btw what's about as important is that in practice you don't
| need to write super clever code to do that, these 68GB/s
| are easy to reach with textbook code without any cleverness
| jeffbee wrote:
| That's the plain M1. The Max can do a bit more. Same site
| since you favor it:
| https://www.anandtech.com/show/17024/apple-m1-max-
| performanc...
| majke wrote:
| > From a single core perspective, meaning from a single
| software thread, things are quite impressive for the
| chip, as it's able to stress the memory fabric to up to
| 102GB/s. This is extremely impressive and outperforms any
| other design in the industry by multiple factors, we had
| already noted that the M1 chip was able to fully saturate
| its memory bandwidth with a single core and that the
| bottleneck had been on the DRAM itself. On the M1 Max, it
| seems that we're hitting the limit of what a core can do
| - or more precisely, a limit to what the CPU cluster can
| do.
|
| Wow
| jmb99 wrote:
| > The cited system has 2x89.6 GB/s bandwidth.
|
| The following applies for certain only to the Zen4 system; I
| have no experience with Zen5.
|
| That is the theoretical max bandwidth of the DDR5 memory
| (/controller) running at 5600 MT/s (roughly: 5600MT/s / 2MT/s x
| 32 bits/T = 89.6GB/s). There is also a bandwidth limitation
| between the memory controller (IO die) and the cores themselves
| (CCDs), along the Infinity Fabric. Infinity Fabric runs at a
| different clock speed than the cores, their cache(s), and the
| memory controller; by default, 2/3 of the memory controller.
| So, if the Memory controller's CLocK (MCLK) is 2800MHz (for
| 5600MT/s), the FCLK (infinity Fabrick CLocK) will run at
| 1866.66MHz. With 32 bytes per clock read bandwidth, you get
| 59.7GB/s maximum sequential memory read bandwidth per CCD<->IOD
| interconnect.
|
| Many systems (read: motherboard manufacturers) will overclock
| the FCLK when applying automatic overclocking (such as when
| selecting XMP/EXPO profiles, and I believe some EXPO profiles
| include overclocking the FCLK as well. (Note that 5600MT/s RAM
| is overclocked; the fastest officially supported Zen4 memory
| speed is 5200MT/s, and most memory kits are 3600MT/s or less
| until overclocked with their built-in profiles.) In my
| experience, Zen4 will happily accept FCLK up to 2000MHz, while
| Zen4 Threadripper (7000 series) seems happy up to 2200MHz. This
| particular system has the FCLK overclocked to 2000MHz, which
| will hurt latency[0] (due to not being 2/3 of MCLK) but
| increase bandwidth. 2000MHz x 32 bytes/cycle = 64GB/s read
| bandwidth, as quoted in the article.
|
| First: these are theoretical maximums. Even the most "perfect"
| benchmark won't hit these, and if they do, there are other
| variables at play not being taken into account (likely lower
| level caches). You will never, ever see theoretical maximum
| memory bandwidth in any real application.
|
| Second: no, it is not possible to see maximum memory bandwidth
| on Zen4 from only one CCD, assuming you have sufficiently fast
| DDR5 that the FCLK cannot be equal to the MCLK. This is an
| architecture limitation, although rarely hit in practice for
| most of the target market. A dual-CCD chip has sufficient
| memory bandwidth to saturate the memory before the Infinity
| Fabric (but as alluded to in the article, unless tuned
| incredibly well, you'll likely run into contention issues and
| either hit a latency or bandwidth wall in real applications).
| My quad-CCD Threadripper can achieve nearly 300GB/s, due to
| having 8 (technically 16) DDR5 channels operating at 5800MT/s
| and FCLK at 2200MHz; I would need an octo-CCD chip to achieve
| maximum memory bandwidth utilization.
|
| Third: no, claims like "Apple Silicon having 400GB/s) are not
| meaningless. Those numbers are achieved the exact same way as
| above, and the same way Nvidia determines their maximum memory
| bandwidth on their GPUs. Platform differences (especially CPU
| vs GPU, but even CPU vs CPU since Apple, AMD, and Intel all
| have very different topologies) make the numbers incomparable
| to each other directly. As an example, Apple Silicon can
| probably achieve higher per-core memory bandwidth than Zen4 (or
| 5), but also shares bandwidth with the GPU; this may not be
| great for gaming applications, for instance, where memory
| bandwidth requirements will be high for both the CPU and GPU,
| but may be fine for ML inference since the CPU sits mostly idle
| while the GPU does most of the work.
|
| [0] I'm surprised the author didn't mention this. I can only
| assume they didn't know this, and haven't tested over
| frequencies or read much on the overclocking forums about Zen4.
| Which is fair enough, it's a very complicated topic with a lot
| of hidden nuances.
| bpye wrote:
| > Note that 5600MT/s RAM is overclocked; the fastest
| officially supported Zen4 memory speed is 5200MT/s
|
| This specifically did change in Zen 5, the max supported is
| now 5600MT/s
___________________________________________________________________
(page generated 2024-11-25 23:02 UTC)