https://lemire.me/blog/2021/01/06/memory-access-on-the-apple-m1-processor/

Skip to content

Daniel Lemire's blog

Daniel Lemire is a computer science professor at the University of
Quebec (TELUQ) in Montreal. His research is focused on software
performance and data engineering. He is a techno-optimist.

Menu and widgets

  * My home page
  * My papers
  * My software

Subscribe

You can subscribe to this blog by email.

Where to find me?

I am on Twitter and GitHub:

Follow @lemire
You can also find Daniel Lemire on

  * on Google Scholar with 4k citations and over 75 peer-reviewed
    publications,
  * on Facebook,
  * and on LinkedIn.

Before the pandemic of 2020, you could meet Daniel in person, as he
was organizing regular talks open to the public in Montreal: tribalab
and technolab .
Search for: [                    ] [Search]
Support my work!

I do not accept any advertisement. However, you can support the blog
with donations through paypal. Please consider getting in touch if
you are a supporter so that I can thank you.

Recent Posts

  * Memory access on the Apple M1 processor
  * Peer-reviewed papers are getting increasingly boring
  * My Science and Technology review for 2020
  * Science and Technology links (December 26th 2020)
  * In 2011, I predicted that the iPhone would have 1TB of storage in
    2020

Recent Comments

  * Daniel Lemire on Memory access on the Apple M1 processor
  * Daniel Lemire on Memory access on the Apple M1 processor
  * Ole Kristian Morch-Storstein on Memory access on the Apple M1
    processor
  * Chris on Memory access on the Apple M1 processor
  * Daniel Lemire on Memory access on the Apple M1 processor

Pages

  * A short history of technology
  * About me
  * Book recommendations
  * Cognitive biases
  * Interviews and talks
  * My bets
  * My favorite articles
  * My readers
  * My sayings
  * Predictions
  * Recommended video games
  * Terms of use
  * Write good papers

Archives

Archives [Select Month        ]

Boring stuff

  * Log in
  * Entries feed
  * Comments feed
  * WordPress.org

Memory access on the Apple M1 processor

When a program is mostly just accessing memory randomly, a standard
cost model is to count the number of distinct random accesses. The
general idea is that memory access is much slower than most other
computational tasks.

Furthermore, the cost model can be extended to count "nearby" memory
accesses as free. That is, if I read a byte at memory address x and
then I read a byte at memory address x+1, I can assume that the
second byte comes "for free".

This naive memory-access model is often sensible. However, you should
always keep in mind that it is merely a model. A model can fail to
predict real performance.

How might it fail? A CPU core can issue multiple memory requests at
once. So if I need to access 7 memory locations at once, I can issue
7 memory requests and wait for them. It it is likely that waiting for
7 memory requests is slower than waiting for a single memory request,
but is it likely to be 7 times slower?

The latest Apple laptop processor, the M1, has apparently a lot of
memory-level parallelism. It looks like a single core has about 28
levels of memory parallelism, and possibly more.[results]

Such a high degree of memory-level parallelism makes it less likely
that our naive random-memory model applies.

To test it out, I designed the following benchmark where I compare
three functions. The first one just grabs pairs of randomly selected
bytes and it computes a bitwise XOR between them before adding them
to a counter:

  for(size_t i = 0; i < 2*M; i+= 2) {
    answer += array[random[i]] ^ array[random[i + 1]];
  }

We compare against a 3-wise version of this function:

  for(size_t i = 0; i < 3*M; i+= 3) {
    answer += array[random[i]] ^ array[random[i + 1]]
              ^ array[random[i + 2]];
  }

Our naive memory-access cost model predicts that the second function
should be 50% more expensive. However many other models (such as a
simple instruction count) would also predict a 50% overhead.

To give our naive memory-access model a run for its money, let us
throw in a 2-wise version that also accesses nearby values (with
one-byte offset):

  for(size_t i = 0; i < 2*M; i+= 2) {
    int idx1 = random[i];
    int idx2 = random[i + 1];

    answer += array[idx1] ^ array[idx1 + 1]
           ^ array[idx2]  ^ array[idx2 + 1];
  }

Our naive memory-access cost model would predict that first and last
function should have about the same running time while the second
function should be 50% more expensive.

Let us measure it out. I use a 1GB array and I report the average
time spent in nanosecond on each iteration.

2-wise    8.9 ns
3-wise    13.0 ns
2-wise +  12.5 ns

At first glance, our naive memory-access model is validated: the
3-wise function is 46% more expensive than the 2-wise function. Yet
we should not be surprised because most reasonable models would make
such a prediction since in almost every way, the function does 50%
more work.

It is more interesting to compare the two 2-wise function... the last
one is 40% more expensive than the first 2-wise function. It
contradicts our prediction. And so, at least in this instance, our
simple memory-access cost model fails us on the Apple M1 processor.

Notes:

 1. My source code is available. The run-to-run variability is
    relatively high on such a test, but the conclusion is robust, on
    my Apple M1 system.
 2. I posted the assembly online.
 3. Importantly, I do not predict that other systems will follow the
    same pattern. Please do not run this benchmark on your non-M1 PC
    and expect comparable results.

Published by

[4b7361]

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).
View all posts by Daniel Lemire

Posted on January 6, 2021January 6, 2021Author Daniel Lemire
Categories  

10 thoughts on "Memory access on the Apple M1 processor"

 1. [0f5d41] Jongilanga Guma says:
    January 6, 2021 at 8:08 pm

    Hi,
    This is interesting, I ran this on my Mac, with processor:2,2 GHz
    Quad-Core Intel Core i7
    There are the results:

    $ ./two_or_three
    N = 1000000000, 953.7 MB
    starting experiments.
    two : 44.7 ns
    two+ : 45.0 ns
    three: 67.6 ns
    bogus 137531640

    Way too slow for my PC 
    Thanks for sharing.
    Regards,
    Jongi

    Reply
     1. [4b7361] Daniel Lemire says:
        January 6, 2021 at 8:16 pm

        You may want to upgrade to the Apple M1. It is a massively
        better processor.

        Reply
 2. [fd55bf] Frank Astier says:
    January 6, 2021 at 8:58 pm

    Did you look at the compiled assembly code? That could be
    interesting too.

    Reply
     1. [4b7361] Daniel Lemire says:
        January 6, 2021 at 9:09 pm

        See https://gist.github.com/lemire/
        1c9e8827b45d057d7546e2743ad34496

        Reply
 3. [e2f0f3] Olivier Galibert says:
    January 6, 2021 at 9:00 pm

    In the first version, the compiler may have scheduled the first
    memory access to run in parallel with the second random
    calculation, and failed to do it in the second. Looking at the
    asm could shine some light on what's going on.

    Reply
     1. [4b7361] Daniel Lemire says:
        January 6, 2021 at 9:09 pm

        See https://gist.github.com/lemire/
        1c9e8827b45d057d7546e2743ad34496

        Reply
 4. [965367] Chris says:
    January 6, 2021 at 10:05 pm

    Hello,

    Isn't this also dependent of the memory's speed?

    Reply
     1. [4b7361] Daniel Lemire says:
        January 6, 2021 at 10:19 pm

        The Apple M1 comes with builtin memory, so the memory speed
        is a constant.

        Reply
 5. [f82c06] Ole Kristian Morch-Storstein says:
    January 6, 2021 at 10:17 pm

    Great read! Getting these results on my M1 Basemodel MBA (8GB/
    256)

    two : 10.2 ns
    two+ : 12.1 ns
    three: 12.4 ns

    Reply
     1. [4b7361] Daniel Lemire says:
        January 6, 2021 at 10:23 pm

        Thanks. As I indicated (see Notes), there are run-to-run
        variations so you should expect to get different numbers.

        Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked 
*

To create code blocks or other preformatted text, indent by four
spaces:

    This will be displayed in a monospaced font. The first four
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
Comment [                                             ]

Name * [                              ]

Email * [                              ]

Website [                              ]

[ ] Save my name, email, and website in this browser for the next
time I comment.

Receive Email Notifications? [no, do not subscribe      ] 
[instantly    ]
Or, you can subscribe without commenting.

[Post Comment] 

Post navigation

Previous Previous post: Peer-reviewed papers are getting increasingly
boring
Proudly powered by WordPress