https://lemire.me/blog/2021/03/24/counting-cycles-and-instructions-on-the-apple-m1-processor/ Skip to content Daniel Lemire's blog Daniel Lemire is a computer science professor at the University of Quebec (TELUQ) in Montreal. His research is focused on software performance and data engineering. He is a techno-optimist. Menu and widgets * My home page * My papers * My software Subscribe You can subscribe to this blog by email. Where to find me? I am on Twitter and GitHub: Follow @lemire You can also find Daniel Lemire on * on Google Scholar with 4k citations and over 75 peer-reviewed publications, * on Facebook, * and on LinkedIn. Before the pandemic of 2020, you could meet Daniel in person, as he was organizing regular talks open to the public in Montreal: tribalab and technolab . Search for: [ ] [Search] Support my work! I do not accept any advertisement. However, you can support the blog with donations through paypal. Please consider getting in touch if you are a supporter so that I can thank you. Recent Posts * Counting cycles and instructions on the Apple M1 processor * Apple's M1 processor and the full 128-bit integer product * Science and Technology links (March 6th 2021) * How does your programming language handle "minus zero" (-0.0)? * Parsing floating-point numbers really fast in C# Recent Comments * Daniel Lemire on Number Parsing at a Gigabyte per Second * Daniel Lemire on Number Parsing at a Gigabyte per Second * Daniel Lemire on Ridiculously fast unicode (UTF-8) validation * Vinnie Falco on Ridiculously fast unicode (UTF-8) validation * Andrew on Apple's M1 processor and the full 128-bit integer product Pages * A short history of technology * About me * Book recommendations * Cognitive biases * Interviews and talks * My bets * My favorite articles * My favorite quotes * My readers * My sayings * Predictions * Recommended video games * Terms of use * Write good papers Archives Archives [Select Month ] Boring stuff * Log in * Entries feed * Comments feed * WordPress.org Counting cycles and instructions on the Apple M1 processor When benchmarking software, we often start by measuring the time elapsed. If you are benchmarking data bandwidth or latency, it is right measure. However, if you are benchmarking computational tasks where you avoid disk and network accesses and where you only access a few pages of memory, then the time elapsed in often not ideal because it can vary too much from run to run and it provides too little information. Most processors will adjust their frequency in response to power and thermal constraints. Thus you should generally avoid using a laptop. Yet even if you can get stable measures, it is hard to reason about your code from a time measurement. Processors operate in cycles, retiring instructions. They have branches, and sometimes they mispredict these branches. These are the measures you want! You can, of course, translate the time in CPU cycles if you know the CPU frequency. But it might be harder than it sounds because even without physical constraints, processors can vary their frequency during a test. You can measure the CPU frequency using predictable loops. It is a little bit awkward. Most people then go to a graphical tool like Intel VTune or Apple Instruments. These are powerful tools that can provide fancy graphical displays, run samples, record precise instruction counts and so forth. They also tend to work across a wide range of programming languages. These graphical tools use the fact that processor vendors include "performance counters" in their silicon. You can tell precisely how many instructions were executed between two points in time. Sadly, these tools can be difficult to tailor to your needs and to automate. Thankfully, the Linux kernel exposes performance counters, on most processors. Thus if you write code for Linux, you can rather easily query the performance counters for yourself. Thus you can put markers in your code and find out how many instructions or cycles were spent between these markers. We often refer to such code as being "instrumented". It requires you to modify your code and it will not work in all programming languages, but it is precise and flexible. It even works under Docker if you are into containers. You may need privileged access to use the counters. Surely you can also access the performance counters from your own program under Windows, but I never found any documentation nor any example. My main laptop these days is a new Apple macbook with an M1 processor. This ARM processor is remarkable. In many ways, it is more advanced that comparable Intel processors. Sadly, until recently, I did not know how to instrument my code for the Apple M1. Recently, one of the readers of my blog (Duc Tri Nguyen) showed me how, inspired by code from Dougall Johnson. Dougall has been doing interesting research on Apple's processors. As far as I can tell, it is entirely undocumented and could blow up your computer. Thankfully, to access the performance counters, you need administrative access (wheel group). In practice, it means that you could start your instrumented program in a shell using sudo so that your program has, itself, administrative privileges. To illustrate the approach, I have posted a full C++ project which builds an instrumented benchmark. You need administrative access and an Apple M1 system. I assume you have installed the complete developer kit with command-line utilities provided by Apple. I recommend measuring both minimal counters as well as the average counters. When the average is close to the minimum, you usually have reliable results. The maximum is less relevant in computational benchmarks. Observe that measures taken during a benchmark are not normally distributed: they are better described as following a log-normal distribution. The core of the benchmark looks like the following C++ code: performance_counters agg_min{1e300}; performance_counters agg_avg{0.0}; for (size_t i = 0; i < repeat; i++) { performance_counters start = get_counters(); my_function(); performance_counters end = get_counters(); performance_counters diff = end - start; agg_min = agg_min.min(diff); agg_avg += diff; } agg_avg /= repeat; Afterward, it is simply a matter of printing the results. I decided to benchmark floating-point number parsers in C++. I get the following output: # parsing random numbers model: generate random numbers uniformly in the interval [0.000000,1.000000] volume: 10000 floats volume = 0.0762939 MB strtod 491.89 instructions/float (+/- 0.0 %) 83.83 cycles/float (+/- 0.1 %) 5.87 instructions/cycle 94.97 branches/float (+/- 0.0 %) 0.5947 mis. branches/float fastfloat 226.03 instructions/float (+/- 0.0 %) 29.02 cycles/float (+/- 0.1 %) 7.79 instructions/cycle 41.00 branches/float (+/- 0.0 %) 0.0001 mis. branches/float As you can see, I get the average number of instructions, branches and mispredicted branches for every floating-point number. I also get the number of instructions retired per cycle. It appears that on this benchmark, the Apple M1 processor gets close to 8 instructions retired per cycle when parsing numbers with the fast_float library. That is a score far higher than anything possible on an Intel processor. You should note how precise the results are: the minimum and the average number of cycles are almost identical. It is quite uncommon in my experience to get such consistent numbers on a laptop. But these Apple M1 systems seem to show remarkably little variation. It suggests that there is little in the way of thermal constraints. I usually avoid benchmarking on laptops, but I make an exception with these laptops. To my knowledge, none of this performance-counter access is documented by Apple. Thus my code should be viewed with suspicion. It is possible that these numbers are not what I take them to be. However, the numbers are generally credible. My source code is available. Published by [4b7361] Daniel Lemire A computer science professor at the University of Quebec (TELUQ). View all posts by Daniel Lemire Posted on March 24, 2021March 24, 2021Author Daniel LemireCategories Leave a Reply Cancel reply Your email address will not be published. Required fields are marked * To create code blocks or other preformatted text, indent by four spaces: This will be displayed in a monospaced font. The first four spaces will be stripped off, but all other whitespace will be preserved. Markdown is turned off in code blocks: [This is not a link](http://example.com) To create not a block, but an inline code span, use backticks: Here is some inline `code`. For more help see http://daringfireball.net/projects/markdown/syntax [ ] [ ] [ ] [ ] [ ] [ ] [ ] Comment [ ] Name * [ ] Email * [ ] Website [ ] [ ] Save my name, email, and website in this browser for the next time I comment. Receive Email Notifications? [no, do not subscribe ] [instantly ] Or, you can subscribe without commenting. [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] Post navigation Previous Previous post: Apple's M1 processor and the full 128-bit integer product Proudly powered by WordPress