https://fgiesen.wordpress.com/2017/04/11/memory-bandwidth/ Skip to content Follow: RSS Twitter The ryg blog When I grow up I'll be an inventor. * Home * About * Coding * Compression * Computer Architecture * Demoscene * Graphics Pipeline * Maths * Multimedia * Networking * Papers * Stories * Thoughts * Uncategorized Memory bandwidth April 11, 2017 Absolute memory bandwidth figures tend to look fairly large, especially for GPUs. This is deceptive. It's much more useful to relate memory bandwidth to say the number of clock cycles or instructions being executed, to get a feel for what you can (and can't) get away with. Let's start with a historical example: the MOS 6502, first released in 1975 - 42 years ago, and one of the key chips in the microcomputer revolution. A 6502 was typically clocked at 1MHz and did a 1-byte memory access essentially every clock cycle, which are nice round figures to use as a baseline. A typical 6502 instruction took 3-5 cycles; some instructions with more complex addressing modes took longer, a few were quicker, and there was some overlapping of the fetch of the next instruction with execution of the current instruction, but no full pipelining like you'd see in later (and more complex) workstation and then microcomputer CPUs, starting around the mid-80s. That gives us a baseline of 1 byte/cycle and let's say about 4 bytes/instruction memory bandwidth on a 40-year old CPU. A large fraction of that bandwidth went simply into fetching instruction bytes. Next, let's look at a recent (as of this writing) and relatively high-end desktop CPU. An Intel Core i7-7700K, has about 50GB/s and 4 cores, so if all 4 cores are under equal load, they get about 12.5GB/ s each. They also clock at about 4.2GHz (it's safe to assume that with all 4 cores active and hitting memory, none of them is going to be in "turbo boost" mode), so they come in just under 3 bytes per cycle of memory bandwidth. Code that runs OK-ish on that CPU averages around 1 instruction per cycle, well-optimized code around 3 instructions per cycle. So well-optimized code running with all cores busy has about 1 byte/instruction of available memory bandwidth. Note that we're 40 years of Moore's law scaling later and the available memory bandwidth per instruction has gone down substantially. And while the 6502 is a 8-bit microprocessor doing 8-bit operations, these modern cores can execute multiple (again, usually up to three) 256-bit SIMD operations in one cycle; if we treat the CPU like a GPU and count each 32-bit vector lane as a separate "thread" (appropriate when running SIMT/SPMD-style code), then we get 24 "instructions" executed per cycle and a memory bandwidth of about 0.125 bytes per cycle per "SIMT thread", or less unwieldy, one byte every 8 "instructions". It gets even worse if we look at GPUs. Now, GPUs generally look like they have insanely high memory bandwidths. But they also have a lot of compute units and (by CPU standards) extremely small amounts of cache per "thread" (invocation, lane, CUDA core, pick your terminology of choice). Let's take the (again quite recent as of this writing) NVidia GeForce GTX 1080Ti as an example. It has (as per Wikipedia) a memory bandwidth of 484GB/s, with a stock core clock of about 1.48GHz, for an overall memory bandwidth of about 327 bytes/ cycle for the whole GPU. However, this GPU has 28 "Shading Multiprocessors" (roughly comparable to CPU cores) and 3584 "CUDA cores" (SIMT lanes). We get about 11.7 bytes/cycle per SM, so about 4x what the i7-7700K core gets; that sounds good, but each SM drives 128 "CUDA cores", each corresponding to a thread in the SIMT programming model. Per thread, we get about 0.09 bytes of memory bandwidth per cycle - or perhaps less awkward at this scale, one byte every 11 instructions. That, in short, is why everything keeps getting more and larger caches, and why even desktop GPUs have quietly started using tile-based rendering approaches (or just announced so openly). Absolute memory bandwidths in consumer devices have gone up by several orders of magnitude from the ~1MB/s of early 80s home computers, but available compute resources have grown much faster still, and the only way to stop bumping into bandwidth limits all the time is to make sure your workloads have reasonable locality of reference so that the caches can do their job. Final disclaimer: bandwidth is only one part of the equation. Not considered here is memory latency (and that's a topic for a different post). The good news is absolute DRAM latencies have gone down since the 80s - by a factor of about 4-5 or so. The bad news is that clock rates have increased by about a factor of 3000 since then - oops. CPUs generally hit much lower memory latencies than GPUs (and are designed for low-latency operation in general) whereas GPUs are all about throughput. When CPU code is limited by memory, it is more commonly due to latency than bandwidth issues (running out of independent work to run while waiting for a memory access). GPU kernels have tons of runnable warps at the same time, and are built to schedule something else during the wait; running on GPUs, it's much easier to run into bandwidth issues. Share this: * Twitter * Facebook * Like this: Like Loading... Related From - Uncategorized 2 Comments 1. [d554] Emre Yavuz permalink I find it depressing. After reading this, I get the impression that performance we're getting from current systems is sort of an illusion, held together by a) layers of "fast but tiny caches", which do get bigger and slower, the more we're closer to DRAM down in the memory hierarchy b) manual or hardware prefetching (+speculative execution for CPU space) that tries to hide (mainly cache) latencies. Given that Moore's law is on its last legs, and cache+speculation is subject to tradeoff between performance vs consumed die area vs power budget, it's probably going to plateau from here for a while, until someone cames up with a novel but vastly different approach. Reply + [7bd2] vilx2 permalink To be honest ot feels to me like we've already plateaued for a while, at least in the CPU world. My 7+ years old PCs can easily keep up with today's software and it's been ages since I've seen a computer that was too weak for the tasks it needs to do (short of niche speciality work). The biggest speedups come from using SSDs instead of HDDs, not better RAM or CPUs. GPUs are still progressing steadily, but that's the only place with a noticeable improvement in every next generation. Reply Leave a Reply Cancel reply Enter your comment here... [ ] Fill in your details below or click an icon to log in: * * * * * Gravatar Email (required) (Address never made public) [ ] Name (required) [ ] Website [ ] WordPress.com Logo You are commenting using your WordPress.com account. ( Log Out / Change ) Google photo You are commenting using your Google account. ( Log Out / Change ) Twitter picture You are commenting using your Twitter account. ( Log Out / Change ) Facebook photo You are commenting using your Facebook account. ( Log Out / Change ) Cancel Connecting to %s [ ] Notify me of new comments via email. [ ] Notify me of new posts via email. [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] << Papers I like (part 1) Rounding up to the nearest integer that's congruent to k mod N >> * Recent Posts + Entropy coding in Oodle Data: the big picture + Frequency responses of half-pel filters + What happens when iterating filters? + Cache tables + Rotating a single vector using a quaternion + Rate-distortion optimization + Reading bits in far too many ways (part 3) + A whirlwind introduction to dataflow graphs + Reading bits in far too many ways (part 2) + Reading bits in far too many ways (part 1) * Categories + Coding + Compression + Computer Architecture + Demoscene + Graphics Pipeline + Maths + Multimedia + Networking + Papers + Stories + Thoughts + Uncategorized * Archives + July 2021 + July 2019 + April 2019 + February 2019 + December 2018 + September 2018 + March 2018 + February 2018 + January 2018 + December 2017 + November 2017 + September 2017 + August 2017 + April 2017 + October 2016 + August 2016 + April 2016 + March 2016 + February 2016 + January 2016 + December 2015 + October 2015 + September 2015 + July 2015 + May 2015 + February 2015 + December 2014 + October 2014 + August 2014 + July 2014 + June 2014 + May 2014 + March 2014 + February 2014 + December 2013 + November 2013 + October 2013 + September 2013 + August 2013 + July 2013 + June 2013 + May 2013 + March 2013 + February 2013 + January 2013 + August 2012 + July 2012 + June 2012 + April 2012 + March 2012 + February 2012 + November 2011 + October 2011 + September 2011 + August 2011 + July 2011 + May 2011 + February 2011 + January 2011 + December 2010 + November 2010 + October 2010 + September 2010 + August 2010 + March 2010 + December 2009 + October 2009 Blog at WordPress.com. %d bloggers like this: [b]