https://lemire.me/blog/2021/01/06/memory-access-on-the-apple-m1-processor/ Skip to content Daniel Lemire's blog Daniel Lemire is a computer science professor at the University of Quebec (TELUQ) in Montreal. His research is focused on software performance and data engineering. He is a techno-optimist. Menu and widgets * My home page * My papers * My software Subscribe You can subscribe to this blog by email. Where to find me? I am on Twitter and GitHub: Follow @lemire You can also find Daniel Lemire on * on Google Scholar with 4k citations and over 75 peer-reviewed publications, * on Facebook, * and on LinkedIn. Before the pandemic of 2020, you could meet Daniel in person, as he was organizing regular talks open to the public in Montreal: tribalab and technolab . Search for: [ ] [Search] Support my work! I do not accept any advertisement. However, you can support the blog with donations through paypal. Please consider getting in touch if you are a supporter so that I can thank you. Recent Posts * Memory access on the Apple M1 processor * Peer-reviewed papers are getting increasingly boring * My Science and Technology review for 2020 * Science and Technology links (December 26th 2020) * In 2011, I predicted that the iPhone would have 1TB of storage in 2020 Recent Comments * Daniel Lemire on Memory access on the Apple M1 processor * Daniel Lemire on Memory access on the Apple M1 processor * Ole Kristian Morch-Storstein on Memory access on the Apple M1 processor * Chris on Memory access on the Apple M1 processor * Daniel Lemire on Memory access on the Apple M1 processor Pages * A short history of technology * About me * Book recommendations * Cognitive biases * Interviews and talks * My bets * My favorite articles * My readers * My sayings * Predictions * Recommended video games * Terms of use * Write good papers Archives Archives [Select Month ] Boring stuff * Log in * Entries feed * Comments feed * WordPress.org Memory access on the Apple M1 processor When a program is mostly just accessing memory randomly, a standard cost model is to count the number of distinct random accesses. The general idea is that memory access is much slower than most other computational tasks. Furthermore, the cost model can be extended to count "nearby" memory accesses as free. That is, if I read a byte at memory address x and then I read a byte at memory address x+1, I can assume that the second byte comes "for free". This naive memory-access model is often sensible. However, you should always keep in mind that it is merely a model. A model can fail to predict real performance. How might it fail? A CPU core can issue multiple memory requests at once. So if I need to access 7 memory locations at once, I can issue 7 memory requests and wait for them. It it is likely that waiting for 7 memory requests is slower than waiting for a single memory request, but is it likely to be 7 times slower? The latest Apple laptop processor, the M1, has apparently a lot of memory-level parallelism. It looks like a single core has about 28 levels of memory parallelism, and possibly more.[results] Such a high degree of memory-level parallelism makes it less likely that our naive random-memory model applies. To test it out, I designed the following benchmark where I compare three functions. The first one just grabs pairs of randomly selected bytes and it computes a bitwise XOR between them before adding them to a counter: for(size_t i = 0; i < 2*M; i+= 2) { answer += array[random[i]] ^ array[random[i + 1]]; } We compare against a 3-wise version of this function: for(size_t i = 0; i < 3*M; i+= 3) { answer += array[random[i]] ^ array[random[i + 1]] ^ array[random[i + 2]]; } Our naive memory-access cost model predicts that the second function should be 50% more expensive. However many other models (such as a simple instruction count) would also predict a 50% overhead. To give our naive memory-access model a run for its money, let us throw in a 2-wise version that also accesses nearby values (with one-byte offset): for(size_t i = 0; i < 2*M; i+= 2) { int idx1 = random[i]; int idx2 = random[i + 1]; answer += array[idx1] ^ array[idx1 + 1] ^ array[idx2] ^ array[idx2 + 1]; } Our naive memory-access cost model would predict that first and last function should have about the same running time while the second function should be 50% more expensive. Let us measure it out. I use a 1GB array and I report the average time spent in nanosecond on each iteration. 2-wise 8.9 ns 3-wise 13.0 ns 2-wise + 12.5 ns At first glance, our naive memory-access model is validated: the 3-wise function is 46% more expensive than the 2-wise function. Yet we should not be surprised because most reasonable models would make such a prediction since in almost every way, the function does 50% more work. It is more interesting to compare the two 2-wise function... the last one is 40% more expensive than the first 2-wise function. It contradicts our prediction. And so, at least in this instance, our simple memory-access cost model fails us on the Apple M1 processor. Notes: 1. My source code is available. The run-to-run variability is relatively high on such a test, but the conclusion is robust, on my Apple M1 system. 2. I posted the assembly online. 3. Importantly, I do not predict that other systems will follow the same pattern. Please do not run this benchmark on your non-M1 PC and expect comparable results. Published by [4b7361] Daniel Lemire A computer science professor at the University of Quebec (TELUQ). View all posts by Daniel Lemire Posted on January 6, 2021January 6, 2021Author Daniel Lemire Categories 10 thoughts on "Memory access on the Apple M1 processor" 1. [0f5d41] Jongilanga Guma says: January 6, 2021 at 8:08 pm Hi, This is interesting, I ran this on my Mac, with processor:2,2 GHz Quad-Core Intel Core i7 There are the results: $ ./two_or_three N = 1000000000, 953.7 MB starting experiments. two : 44.7 ns two+ : 45.0 ns three: 67.6 ns bogus 137531640 Way too slow for my PC Thanks for sharing. Regards, Jongi Reply 1. [4b7361] Daniel Lemire says: January 6, 2021 at 8:16 pm You may want to upgrade to the Apple M1. It is a massively better processor. Reply 2. [fd55bf] Frank Astier says: January 6, 2021 at 8:58 pm Did you look at the compiled assembly code? That could be interesting too. Reply 1. [4b7361] Daniel Lemire says: January 6, 2021 at 9:09 pm See https://gist.github.com/lemire/ 1c9e8827b45d057d7546e2743ad34496 Reply 3. [e2f0f3] Olivier Galibert says: January 6, 2021 at 9:00 pm In the first version, the compiler may have scheduled the first memory access to run in parallel with the second random calculation, and failed to do it in the second. Looking at the asm could shine some light on what's going on. Reply 1. [4b7361] Daniel Lemire says: January 6, 2021 at 9:09 pm See https://gist.github.com/lemire/ 1c9e8827b45d057d7546e2743ad34496 Reply 4. [965367] Chris says: January 6, 2021 at 10:05 pm Hello, Isn't this also dependent of the memory's speed? Reply 1. [4b7361] Daniel Lemire says: January 6, 2021 at 10:19 pm The Apple M1 comes with builtin memory, so the memory speed is a constant. Reply 5. [f82c06] Ole Kristian Morch-Storstein says: January 6, 2021 at 10:17 pm Great read! Getting these results on my M1 Basemodel MBA (8GB/ 256) two : 10.2 ns two+ : 12.1 ns three: 12.4 ns Reply 1. [4b7361] Daniel Lemire says: January 6, 2021 at 10:23 pm Thanks. As I indicated (see Notes), there are run-to-run variations so you should expect to get different numbers. Reply Leave a Reply Cancel reply Your email address will not be published. Required fields are marked * To create code blocks or other preformatted text, indent by four spaces: This will be displayed in a monospaced font. The first four spaces will be stripped off, but all other whitespace will be preserved. Markdown is turned off in code blocks: [This is not a link](http://example.com) To create not a block, but an inline code span, use backticks: Here is some inline `code`. For more help see http://daringfireball.net/projects/markdown/syntax [ ] [ ] [ ] [ ] [ ] [ ] [ ] Comment [ ] Name * [ ] Email * [ ] Website [ ] [ ] Save my name, email, and website in this browser for the next time I comment. Receive Email Notifications? [no, do not subscribe ] [instantly ] Or, you can subscribe without commenting. [Post Comment] Post navigation Previous Previous post: Peer-reviewed papers are getting increasingly boring Proudly powered by WordPress