https://fgiesen.wordpress.com/2017/04/11/memory-bandwidth/

Skip to content

Follow:
    RSS
    Twitter

The ryg blog
When I grow up I'll be an inventor.

  * Home
  * About
  * Coding
  * Compression
  * Computer Architecture
  * Demoscene
  * Graphics Pipeline
  * Maths
  * Multimedia
  * Networking
  * Papers
  * Stories
  * Thoughts
  * Uncategorized

Memory bandwidth

April 11, 2017

Absolute memory bandwidth figures tend to look fairly large,
especially for GPUs. This is deceptive. It's much more useful to
relate memory bandwidth to say the number of clock cycles or
instructions being executed, to get a feel for what you can (and
can't) get away with.

Let's start with a historical example: the MOS 6502, first released
in 1975 - 42 years ago, and one of the key chips in the microcomputer
revolution. A 6502 was typically clocked at 1MHz and did a 1-byte
memory access essentially every clock cycle, which are nice round
figures to use as a baseline. A typical 6502 instruction took 3-5
cycles; some instructions with more complex addressing modes took
longer, a few were quicker, and there was some overlapping of the
fetch of the next instruction with execution of the current
instruction, but no full pipelining like you'd see in later (and more
complex) workstation and then microcomputer CPUs, starting around the
mid-80s. That gives us a baseline of 1 byte/cycle and let's say about
4 bytes/instruction memory bandwidth on a 40-year old CPU. A large
fraction of that bandwidth went simply into fetching instruction
bytes.

Next, let's look at a recent (as of this writing) and relatively
high-end desktop CPU. An Intel Core i7-7700K, has about 50GB/s and 4
cores, so if all 4 cores are under equal load, they get about 12.5GB/
s each. They also clock at about 4.2GHz (it's safe to assume that
with all 4 cores active and hitting memory, none of them is going to
be in "turbo boost" mode), so they come in just under 3 bytes per
cycle of memory bandwidth. Code that runs OK-ish on that CPU averages
around 1 instruction per cycle, well-optimized code around 3
instructions per cycle. So well-optimized code running with all cores
busy has about 1 byte/instruction of available memory bandwidth. Note
that we're 40 years of Moore's law scaling later and the available
memory bandwidth per instruction has gone down substantially. And
while the 6502 is a 8-bit microprocessor doing 8-bit operations,
these modern cores can execute multiple (again, usually up to three)
256-bit SIMD operations in one cycle; if we treat the CPU like a GPU
and count each 32-bit vector lane as a separate "thread" (appropriate
when running SIMT/SPMD-style code), then we get 24 "instructions"
executed per cycle and a memory bandwidth of about 0.125 bytes per
cycle per "SIMT thread", or less unwieldy, one byte every 8
"instructions".

It gets even worse if we look at GPUs. Now, GPUs generally look like
they have insanely high memory bandwidths. But they also have a lot
of compute units and (by CPU standards) extremely small amounts of
cache per "thread" (invocation, lane, CUDA core, pick your
terminology of choice). Let's take the (again quite recent as of this
writing) NVidia GeForce GTX 1080Ti as an example. It has (as per
Wikipedia) a memory bandwidth of 484GB/s, with a stock core clock of
about 1.48GHz, for an overall memory bandwidth of about 327 bytes/
cycle for the whole GPU. However, this GPU has 28 "Shading
Multiprocessors" (roughly comparable to CPU cores) and 3584 "CUDA
cores" (SIMT lanes). We get about 11.7 bytes/cycle per SM, so about
4x what the i7-7700K core gets; that sounds good, but each SM drives
128 "CUDA cores", each corresponding to a thread in the SIMT
programming model. Per thread, we get about 0.09 bytes of memory
bandwidth per cycle - or perhaps less awkward at this scale, one byte
every 11 instructions.

That, in short, is why everything keeps getting more and larger
caches, and why even desktop GPUs have quietly started using
tile-based rendering approaches (or just announced so openly).
Absolute memory bandwidths in consumer devices have gone up by
several orders of magnitude from the ~1MB/s of early 80s home
computers, but available compute resources have grown much faster
still, and the only way to stop bumping into bandwidth limits all the
time is to make sure your workloads have reasonable locality of
reference so that the caches can do their job.

Final disclaimer: bandwidth is only one part of the equation. Not
considered here is memory latency (and that's a topic for a different
post). The good news is absolute DRAM latencies have gone down since
the 80s - by a factor of about 4-5 or so. The bad news is that clock
rates have increased by about a factor of 3000 since then - oops.
CPUs generally hit much lower memory latencies than GPUs (and are
designed for low-latency operation in general) whereas GPUs are all
about throughput. When CPU code is limited by memory, it is more
commonly due to latency than bandwidth issues (running out of
independent work to run while waiting for a memory access). GPU
kernels have tons of runnable warps at the same time, and are built
to schedule something else during the wait; running on GPUs, it's
much easier to run into bandwidth issues.

Share this:

  * Twitter
  * Facebook
  * 

Like this:

Like Loading...

Related

From - Uncategorized

2 Comments  

 1. [d554]
    Emre Yavuz permalink

    I find it depressing. After reading this, I get the impression
    that performance we're getting from current systems is sort of an
    illusion, held together by a) layers of "fast but tiny caches",
    which do get bigger and slower, the more we're closer to DRAM
    down in the memory hierarchy b) manual or hardware prefetching
    (+speculative execution for CPU space) that tries to hide (mainly
    cache) latencies.

    Given that Moore's law is on its last legs, and cache+speculation
    is subject to tradeoff between performance vs consumed die area
    vs power budget, it's probably going to plateau from here for a
    while, until someone cames up with a novel but vastly different
    approach.

    Reply
      + [7bd2]
        vilx2 permalink

        To be honest ot feels to me like we've already plateaued for
        a while, at least in the CPU world. My 7+ years old PCs can
        easily keep up with today's software and it's been ages since
        I've seen a computer that was too weak for the tasks it needs
        to do (short of niche speciality work). The biggest speedups
        come from using SSDs instead of HDDs, not better RAM or CPUs.
        GPUs are still progressing steadily, but that's the only
        place with a noticeable improvement in every next generation.

        Reply

Leave a Reply Cancel reply

Enter your comment here...
[                    ]

Fill in your details below or click an icon to log in:

  *  
  *  
  * 
  *  
  *  

Gravatar
Email (required) (Address never made public)
[                    ]
Name (required)
[                    ]
Website
[                    ]
WordPress.com Logo

You are commenting using your WordPress.com account. ( Log Out / 
Change )

Google photo

You are commenting using your Google account. ( Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. ( Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. ( Log Out /  Change )

Cancel

Connecting to %s

[ ] Notify me of new comments via email.

[ ] Notify me of new posts via email.

[Post Comment] 

[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
<< Papers I like (part 1)
Rounding up to the nearest integer that's congruent to k mod N >>

  * Recent Posts

      + Entropy coding in Oodle Data: the big picture
      + Frequency responses of half-pel filters
      + What happens when iterating filters?
      + Cache tables
      + Rotating a single vector using a quaternion
      + Rate-distortion optimization
      + Reading bits in far too many ways (part 3)
      + A whirlwind introduction to dataflow graphs
      + Reading bits in far too many ways (part 2)
      + Reading bits in far too many ways (part 1)
  * Categories

      + Coding
      + Compression
      + Computer Architecture
      + Demoscene
      + Graphics Pipeline
      + Maths
      + Multimedia
      + Networking
      + Papers
      + Stories
      + Thoughts
      + Uncategorized
  * Archives

      + July 2021
      + July 2019
      + April 2019
      + February 2019
      + December 2018
      + September 2018
      + March 2018
      + February 2018
      + January 2018
      + December 2017
      + November 2017
      + September 2017
      + August 2017
      + April 2017
      + October 2016
      + August 2016
      + April 2016
      + March 2016
      + February 2016
      + January 2016
      + December 2015
      + October 2015
      + September 2015
      + July 2015
      + May 2015
      + February 2015
      + December 2014
      + October 2014
      + August 2014
      + July 2014
      + June 2014
      + May 2014
      + March 2014
      + February 2014
      + December 2013
      + November 2013
      + October 2013
      + September 2013
      + August 2013
      + July 2013
      + June 2013
      + May 2013
      + March 2013
      + February 2013
      + January 2013
      + August 2012
      + July 2012
      + June 2012
      + April 2012
      + March 2012
      + February 2012
      + November 2011
      + October 2011
      + September 2011
      + August 2011
      + July 2011
      + May 2011
      + February 2011
      + January 2011
      + December 2010
      + November 2010
      + October 2010
      + September 2010
      + August 2010
      + March 2010
      + December 2009
      + October 2009

Blog at WordPress.com.

%d bloggers like this:

[b]