https://lemire.me/blog/2024/02/19/measuring-energy-usage-regular-code-vs-simd-code/

Skip to content

Daniel Lemire's blog

Daniel Lemire is a computer science professor at the Data Science
Laboratory of the Universite du Quebec (TELUQ) in Montreal. His
research is focused on software performance and data engineering. He
is a techno-optimist and a free-speech advocate.

Menu and widgets

  * My home page
  * My papers
  * My software

Join over 12,500 email subscribers:

[                    ][Go!]
You can follow this blog on telegram.
You can find me on twitter as @lemire or on Mastodon.
Search for: [                    ] [Search]
Support my work!

I do not accept any advertisement. However, you can you can sponsor
my open-source work on GitHub.

Recent Posts

  * Measuring energy usage: regular code vs. SIMD code
  * JSON Parsing: Intel Sapphire Rapids versus AMD Zen 4
  * How fast is rolling Karp-Rabin hashing?
  * C23: a slightly better C
  * How much memory bandwidth do large Amazon instances offer?

Recent Comments

  * Daniel Lemire on JSON Parsing: Intel Sapphire Rapids versus AMD
    Zen 4
  * Joe Duarte on JSON Parsing: Intel Sapphire Rapids versus AMD Zen
    4
  * JC on C23: a slightly better C
  * JC on C23: a slightly better C
  * Daniel Lemire on Estimating your memory bandwidth

Pages

  * A short history of technology
  * About me
  * Book recommendations
  * Cognitive biases
  * Interviews and talks
  * My bets
  * My favorite articles
  * My favorite quotes
  * My rules
  * Newsletter
  * Predictions
  * Privacy Policy
  * Recommended video games
  * Terms of use
  * Write good papers

Archives

Archives [Select Month        ]

Boring stuff

  * Log in
  * Entries feed
  * Comments feed
  * WordPress.org

Measuring energy usage: regular code vs. SIMD code

Modern processor have fancy instructions that can do many operations
at one using wide registers: SIMD instructions. Intel and AMD have
512-bit registers and associated instructions under AVX-512.

You expect these instructions to use more power, more energy.
However, they get the job done faster. Do you save energy overall?
You should expect so.

Let us consider an example. I can just sum all values in a large
array.

float sum(float *data, size_t N) {
  double counter = 0;
  for (size_t i = 0; i < N; i++) {
    counter += data[i];
  }
  return counter;
}

If I leave it as is, the compiler might be tempted to optimize too
much, but I can instruct it to avoid 'autovectorization': it will not
doing anything fancy.

I can write the equivalent function using AVX-512 intrinsic
functions. The details do not matter too much, just trust me that it
is expected to be faster for sufficiently long inputs.

float sum(float *data, size_t N) {
  __m512d counter = _mm512_setzero_pd();
  for (size_t i = 0; i < N; i += 16) {
    __m512 v = _mm512_loadu_ps((__m512 *)&data[i]);
    __m512d part1 = _mm512_cvtps_pd(_mm512_extractf32x8_ps(v, 0));
    __m512d part2 = _mm512_cvtps_pd(_mm512_extractf32x8_ps(v, 1));
    counter = _mm512_add_pd(counter, part1);
    counter = _mm512_add_pd(counter, part2);
  }
  double sum = _mm512_reduce_add_pd(counter);
  for (size_t i = N / 16 * 16; i < N; i++) {
    sum += data[i];
  }
  return sum;
}

Under Linux, we can ask the kernel about power usage. You can query
the power usage of different components, but I query the overall
power usage. This includes, among other things, the power usage of
the memory system. It works well with Intel processors as long as you
have privileged access on the system. I wrote a little benchmark that
runs both functions.

On a 32-core Ice Lake processors, my results are as follows:

naive code 0.055 muJ/s 0.11 muJ/value
AVX-512    0.061 muJ/s 0.032 muJ/value

So the AVX-512 uses 3.5 times less energy overall, despite consuming
10% more energy per unit of time.

My benchmark is naive and should only serve as an illustration. The
general principle holds, however: if your tasks complete much faster,
you are likely to use less power, even if you are using more energy
per unit of time.

Published by

[2ca999]

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).
View all posts by Daniel Lemire

Posted on February 19, 2024Author Daniel LemireCategories  

Leave a Reply Cancel reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four
spaces:

    This will be displayed in a monospaced font. The first four
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
Comment * [                                             ]

Name * [                              ]

Email * [                              ]

Website [                              ]

[ ] Save my name, email, and website in this browser for the next
time I comment.

Receive Email Notifications? [no, do not subscribe      ] 
[instantly    ]
Or, you can subscribe without commenting.

[Post Comment] 

 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
D[                                             ] 

You may subscribe to this blog by email.

Post navigation

Previous Previous post: JSON Parsing: Intel Sapphire Rapids versus
AMD Zen 4
Terms of use Proudly powered by WordPress