https://lemire.me/blog/2021/03/24/counting-cycles-and-instructions-on-the-apple-m1-processor/

Skip to content

Daniel Lemire's blog

Daniel Lemire is a computer science professor at the University of
Quebec (TELUQ) in Montreal. His research is focused on software
performance and data engineering. He is a techno-optimist.

Menu and widgets

  * My home page
  * My papers
  * My software

Subscribe

You can subscribe to this blog by email.

Where to find me?

I am on Twitter and GitHub:

Follow @lemire
You can also find Daniel Lemire on

  * on Google Scholar with 4k citations and over 75 peer-reviewed
    publications,
  * on Facebook,
  * and on LinkedIn.

Before the pandemic of 2020, you could meet Daniel in person, as he
was organizing regular talks open to the public in Montreal: tribalab
and technolab .
Search for: [                    ] [Search]
Support my work!

I do not accept any advertisement. However, you can support the blog
with donations through paypal. Please consider getting in touch if
you are a supporter so that I can thank you.

Recent Posts

  * Counting cycles and instructions on the Apple M1 processor
  * Apple's M1 processor and the full 128-bit integer product
  * Science and Technology links (March 6th 2021)
  * How does your programming language handle "minus zero" (-0.0)?
  * Parsing floating-point numbers really fast in C#

Recent Comments

  * Daniel Lemire on Number Parsing at a Gigabyte per Second
  * Daniel Lemire on Number Parsing at a Gigabyte per Second
  * Daniel Lemire on Ridiculously fast unicode (UTF-8) validation
  * Vinnie Falco on Ridiculously fast unicode (UTF-8) validation
  * Andrew on Apple's M1 processor and the full 128-bit integer
    product

Pages

  * A short history of technology
  * About me
  * Book recommendations
  * Cognitive biases
  * Interviews and talks
  * My bets
  * My favorite articles
  * My favorite quotes
  * My readers
  * My sayings
  * Predictions
  * Recommended video games
  * Terms of use
  * Write good papers

Archives

Archives [Select Month        ]

Boring stuff

  * Log in
  * Entries feed
  * Comments feed
  * WordPress.org

Counting cycles and instructions on the Apple M1 processor

When benchmarking software, we often start by measuring the time
elapsed. If you are benchmarking data bandwidth or latency, it is
right measure. However, if you are benchmarking computational tasks
where you avoid  disk and network accesses and where you only access
a few pages of memory, then the time elapsed in often not ideal
because it can vary too much from run to run and it provides too
little information.

Most processors will adjust their frequency in response to power and
thermal constraints. Thus you should generally avoid using a
laptop. Yet even if you can get stable measures, it is hard to reason
about your code from a time measurement. Processors operate in
cycles, retiring instructions. They have branches, and sometimes they
mispredict these branches. These are the measures you want!

You can, of course, translate the time in CPU cycles if you know the
CPU frequency. But it might be harder than it sounds because even
without physical constraints, processors can vary their frequency
during a test. You can measure the CPU frequency using predictable
loops. It is a little bit awkward.

Most people then go to a graphical tool like Intel VTune or Apple
Instruments. These are powerful tools that can provide fancy
graphical displays, run samples, record precise instruction counts
and so forth. They also tend to work across a wide range of
programming languages.

These graphical tools use the fact that processor vendors include
"performance counters" in their silicon. You can tell precisely how
many instructions were executed between two points in time.

Sadly, these tools can be difficult to tailor to your needs and to
automate. Thankfully, the Linux kernel exposes performance counters,
on most processors. Thus if you write code for Linux, you can rather
easily query the performance counters for yourself. Thus you can put
markers in your code and find out how many instructions or cycles
were spent between these markers. We often refer to such code as
being "instrumented". It requires you to modify your code and it will
not work in all programming languages, but it is precise and
flexible. It even works under Docker if you are into containers. You
may need privileged  access to use the counters. Surely you can also
access the performance counters from your own program under Windows,
but I never found any documentation nor any example.

My main laptop these days is a new Apple macbook with an M1
processor. This ARM processor is remarkable. In many ways, it is more
advanced that comparable Intel processors. Sadly, until recently, I
did not know how to instrument my code for the Apple M1.

Recently, one of the readers of my blog (Duc Tri Nguyen) showed me
how, inspired by code from Dougall Johnson. Dougall has been doing
interesting research on Apple's processors. As far as I can tell, it
is entirely undocumented and could blow up your computer.
Thankfully, to access the performance counters, you need
administrative access (wheel group). In practice, it means that you
could start your instrumented program in a shell using sudo so that
your program has, itself, administrative privileges.

To illustrate the approach, I have posted a full C++ project which
builds an instrumented benchmark. You need administrative access and
an Apple M1 system. I assume you have installed the complete
developer kit with command-line utilities provided by Apple.

I recommend measuring both minimal counters as well as the average
counters. When the average is close to the minimum, you usually have
reliable results. The maximum is less relevant in computational
benchmarks. Observe that measures taken during a benchmark are not
normally distributed: they are better described as following a
log-normal distribution.

The core of the benchmark looks like the following C++ code:

  performance_counters agg_min{1e300};
  performance_counters agg_avg{0.0};
  for (size_t i = 0; i < repeat; i++) {
    performance_counters start = get_counters();
    my_function();
    performance_counters end = get_counters();
    performance_counters diff = end - start;
    agg_min = agg_min.min(diff);
    agg_avg += diff;
  }
  agg_avg /= repeat;

Afterward, it is simply a matter of printing the results. I decided
to benchmark floating-point number parsers in C++. I get the
following output:

# parsing random numbers
model: generate random numbers uniformly in the interval [0.000000,1.000000]
volume: 10000 floats
volume = 0.0762939 MB
   strtod    491.89 instructions/float (+/- 0.0 %)
             83.83 cycles/float (+/- 0.1 %)
             5.87 instructions/cycle
             94.97 branches/float (+/- 0.0 %)
             0.5947 mis. branches/float

   fastfloat    226.03 instructions/float (+/- 0.0 %)
                29.02 cycles/float (+/- 0.1 %)
                7.79 instructions/cycle
                41.00 branches/float (+/- 0.0 %)
                0.0001 mis. branches/float

As you can see, I get the average  number of instructions, branches
and mispredicted branches for every floating-point number. I also get
the number of instructions retired per cycle. It appears that on this
benchmark, the Apple M1 processor gets close to 8 instructions
retired per cycle when parsing numbers with the fast_float library.
That is a score far higher than anything possible on an Intel
processor.

You should note how precise the results are: the minimum and the
average number of cycles are almost identical. It is quite uncommon
in my experience to get such consistent numbers on a laptop. But
these Apple M1 systems seem to show remarkably little variation. It
suggests that there is little in the way of thermal constraints. I
usually avoid benchmarking on laptops, but I make an exception with
these laptops.

To my knowledge, none of this performance-counter access is
documented by Apple. Thus my code should be viewed with suspicion. It
is possible that these numbers are not what I take them to be.
However, the numbers are generally credible.

My source code is available.

Published by

[4b7361]

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).
View all posts by Daniel Lemire

Posted on March 24, 2021March 24, 2021Author Daniel LemireCategories 
 

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked 
*

To create code blocks or other preformatted text, indent by four
spaces:

    This will be displayed in a monospaced font. The first four
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
Comment [                                             ]

Name * [                              ]

Email * [                              ]

Website [                              ]

[ ] Save my name, email, and website in this browser for the next
time I comment.

Receive Email Notifications? [no, do not subscribe      ] 
[instantly    ]
Or, you can subscribe without commenting.

[Post Comment] 

[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
Post navigation

Previous Previous post: Apple's M1 processor and the full 128-bit
integer product
Proudly powered by WordPress