https://pythonspeed.com/articles/numba-profiling/

Python=Speed
- About
- Consulting
- Contact
Articles
- Docker packaging
- Faster data science
Products
- Docker packaging
- Faster data science

Profiling your Numba code

by Itamar Turner-Trauring
Last updated 30 Jan 2024, originally created 30 Jan 2024

If you're writing numeric Python code, Numba can be a great way to
speed up your program. By compiling a subset of Python to machine
code, Numba lets you write for loops and other constructs that would
be too slow in normal Python. In other words, it's similar to Cython,
C, or Rust, in that it lets you write compiled extensions for Python.

Numba code isn't always as fast as it could be, however. This is
where profiling is useful: it can find at least some of the
bottlenecks in your code.

In this article we'll cover:

  * Profila, a new profiler I've released that is specifically
    designed for Numba code.
  * The limits of profiling. There are many potential performance
    enhancements that a profiler can't and won't help you discover.

Introducing Profila: a profiler for Numba

Profila is a profiler specifically designed to identify which lines
of code in your Numba code are slow. It works via sampling: it runs a
gdb process that connects to your process every 10ms or so, gets a
backtrace, and uses that to identify which lines of code were
running. Get enough samples, and you'll get a sense of where the most
time was spent in your program.

Let's see Profila in action! We're going to profile the following
program, which dithers an image, switching it from grayscale to black
and white. (This is the initial naive version that I then optimized
for performance in a different article.)

from numba import njit
import numpy as np
from skimage import io
from time import time

IMAGE = io.imread("hallway.jpg")

# Code that is decorated with numba.njit looks like Python code,
# but is actually compiled into machine code, at runtime. This
# is fast, low-level code!
@njit
def dither(img):
    # Allow negative values and wider range than a uint8 has:
    result = img.astype(np.int16)
    y_size = img.shape[0]
    x_size = img.shape[1]
    last_y = y_size - 1
    last_x = x_size - 1
    for y in range(y_size):
        for x in range(x_size):
            old_value = result[y, x]
            if old_value < 0:
                new_value = 0
            elif old_value > 255:
                new_value = 255
            else:
                new_value = np.uint8(np.round(old_value / 255.0)) * 255
            result[y, x] = new_value
            # We might get a negative value for the error:
            error = np.int16(old_value) - new_value
            if x < last_x:
                result[y, x + 1] += error * 7 // 16
            if y < last_y and x > 0:
                result[y + 1, x - 1] += error * 3 // 16
            if y < last_y:
                result[y + 1, x] += error * 5 // 16
            if y < last_y and x < last_x:
                result[y + 1, x + 1] += error // 16

    return result.astype(np.uint8)

# Run a first time, to compile the code:
dither(IMAGE)

# Make sure we run this for long enough that the profiler
# gets sufficient samples:
start = time()
runs = 0
while time() - start < 5:
    runs += 1
    dither(IMAGE)

elapsed = time() - start
print(f"Processed {int(round(runs / elapsed))} images / sec")

We then install gdb and profila:

$ sudo apt-get install -y gdb
$ pip install profila

And now we can use Profila to profile the code:

$ python -m profila annotate -- dither.py
Processed 255 images / sec
# Total samples: 805 (37.8% non-Numba samples, 1.9% bad samples)

## File `dither.py`
Lines 14 to 40:

  0.1% |     result = img.astype(np.int16)
       |     y_size = img.shape[0]
       |     x_size = img.shape[1]
       |     last_y = y_size - 1
       |     last_x = x_size - 1
       |     for y in range(y_size):
  0.2% |         for x in range(x_size):
       |             old_value = result[y, x]
       |             if old_value < 0:
       |                 new_value = 0
  1.0% |             elif old_value > 255:
       |                 new_value = 255
       |             else:
 30.2% |                 new_value = np.uint8(np.round(old_value / 255.0)) * 255
       |             result[y, x] = new_value
       |             # We might get a negative value for the error:
  2.0% |             error = np.int16(old_value) - new_value
  0.6% |             if x < last_x:
 11.3% |                 result[y, x + 1] += error * 7 // 16
       |             if y < last_y and x > 0:
  0.9% |                 result[y + 1, x - 1] += error * 3 // 16
       |             if y < last_y:
  2.4% |                 result[y + 1, x] += error * 5 // 16
       |             if y < last_y and x < last_x:
  1.6% |                 result[y + 1, x + 1] += error // 16
       |
  0.2% |     return result.astype(np.uint8)

## File `numba/np/arraymath.py`
Lines 3157 to 3157:

  9.8% |                         return _np_round_float(a)

Optimizing the code

First, notice from the top of the output that 37.8 + 1.9 = about 40%
of the samples were either bad, or not running Numba code. This
includes time spent importing modules, running regular Python code,
compiling the Numba code, and so on. That means 60% of the time was
spent running actual Numba code.

Then, we can see the combination of np.round() and new_value =
np.uint8(np.round(old_value / 255.0)) * 255 is 30.2 + 9.8 = 40% of
total samples. So that means 40/60, two thirds of the time, was spent
in that one line of code, turning old_value into new_value.

As we saw in the original article featuring this example, we can
replace this code:

if old_value < 0:
    new_value = 0
elif old_value > 255:
    new_value = 255
else:
    new_value = np.uint8(np.round(old_value / 255.0)) * 255

With a much simper version:

new_value = 0 if old_value < 128 else 255

If we profile this optimized version we'll see:

$ python -m profila annotate -- dither2.py
Processed 758 images / sec
# Total samples: 713 (30.3% non-Numba samples, 2.2% bad samples)

## File `/home/itamarst/devel/sandbox/numba-profiling-article/dither2.py`
Lines 14 to 35:

  0.1% |     result = img.astype(np.int16)
       |     y_size = img.shape[0]
       |     x_size = img.shape[1]
       |     last_y = y_size - 1
       |     last_x = x_size - 1
       |     for y in range(y_size):
  0.3% |         for x in range(x_size):
  0.1% |             old_value = result[y, x]
  4.5% |             new_value = 0 if old_value < 128 else 255
  2.0% |             result[y, x] = new_value
       |             # We might get a negative value for the error:
  5.5% |             error = np.int16(old_value) - new_value
  5.3% |             if x < last_x:
 34.9% |                 result[y, x + 1] += error * 7 // 16
       |             if y < last_y and x > 0:
  3.6% |                 result[y + 1, x - 1] += error * 3 // 16
       |             if y < last_y:
  6.0% |                 result[y + 1, x] += error * 5 // 16
       |             if y < last_y and x < last_x:
  4.5% |                 result[y + 1, x + 1] += error // 16
       |
  0.6% |     return result.astype(np.uint8)

As was the case in the original article, our code now runs 3x as
fast!

And this time we didn't have to guess, we had a profiler to point us
at the main bottleneck.

The limits of profiling

While profiling is extremely helpful, it won't always tell you
everything you need to know to speed up your code. To learn more
about all of these issues, check out my upcoming book on writing
high-performance low-level code.

Limit #1: Compiled code doesn't always match the source code

Consider the following example:

@njit
def add(arr):
    result = np.empty_like(arr)
    for i in range(len(arr)):
        orig = arr[i]
        orig += 123_000_000
        orig += 456_000
        orig += 789
        result[i] = orig
    return result

If we profile it, we get:

## File `combo.py`
Lines 13 to 14:

  2.9% |         orig += 789
  7.5% |         result[i] = orig

Why is orig += 789 listed, but not the other lines adding 123_000_000
and 456_000? Because thanks to compiler optimizations, they no longer
exist in the compiled version as distinct operations. We can for
example inspect the generated assembly for the function and search
for 789:

>>> for line in next(iter(add.inspect_asm().values())).splitlines():
...     if "789" in line: print(line)
...
        .long   123456789
        movl    $123456789, %edi
        movl    $123456789, %edx

The three added numbers (123_000_000, 456_000, 789) have been
combined by the compiler into a single number (123_456_789) which
presumably is being used with a single addition operation, instead of
three.

Limit #2: You need a mental model of how CPU execution works

Recall that in our optimized dithering function we saw the following
profiling results:

  5.3% |             if x < last_x:
 34.9% |                 result[y, x + 1] += error * 7 // 16
       |             if y < last_y and x > 0:
  3.6% |                 result[y + 1, x - 1] += error * 3 // 16
       |             if y < last_y:
  6.0% |                 result[y + 1, x] += error * 5 // 16
       |             if y < last_y and x < last_x:
  4.5% |                 result[y + 1, x + 1] += error // 16

Why is the first calculation so much more expensive than the later
ones? They seem to be doing more or less the same thing.

One hypothesis is instruction-level parallelism: the CPU is able to
go ahead and speculatively execute the later code in parallel. So the
first line is expensive not because it's any different, but because
it's first. We can partially test this hypothesis by changing the
order of the statements and profiling again. Here's the result:

       |             if y < last_y and x > 0:
 20.3% |                 result[y + 1, x - 1] += error * 3 // 16
 14.2% |             if x < last_x:
  8.8% |                 result[y, x + 1] += error * 7 // 16
       |             if y < last_y:
  3.7% |                 result[y + 1, x] += error * 5 // 16
       |             if y < last_y and x < last_x:
  3.7% |                 result[y + 1, x + 1] += error // 16

Now the line result[y, x + 1] += error * 7 // 16 has a much lower
percentage of samples than in the previous version of the code, even
though it's running the exact same number of times and doing the
exact same calculation. The only thing that has changed is the order
of execution.

A mental model of a CPU that just executes code linearly, with no
parallelism, is not sufficient to understand these results.

Limit #3: Faster code may require a more holistic view

Seeing which lines of code are slow can be very helpful, but some
optimizations are a result of rethinking the code more broadly. For
example, the final optimized version in the original article cuts
memory usage by two thirds, and also runs 50% even faster than the
second version. It does so by changing the structure of the
algorithm, rather than just focusing on a single line or two of code.

Give Profila a try!

Profila is a new project: I released it the same day I wrote this
article. So try it out, and please let me know about any bugs you
find, your success stories, or any feedback at all, in the GitHub
issue tracker or by email.

    Find performance and memory bottlenecks in your data processing
    code with the Sciagraph profiler

    Slow-running jobs waste your time during development, impede your
    users, and increase your compute costs. Speed up your code and
    you'll iterate faster, have happier users, and stick to your
    budget--but first you need to identify the cause of the problem.

    Find performance bottlenecks and memory hogs in your data science
    Python jobs with the Sciagraph profiler. Profile in development
    and production, with multiprocessing support, on macOS and Linux,
    with built-in support for Jupyter notebooks.

    A performance timeline created by Sciagraph, showing both CPU and
    I/O as bottlenecks

    Speed up your Python code and learn skills you can use at your
    job

    Join over 7600 Python developers and data scientists learning
    practical tools and techniques every week, from Python
    performance to Docker packaging, by signing up for my newsletter.

  * Home
  * About
  * RSS Feed
  * Mastodon
  * Privacy policy
  * Terms & Conditions

(c) 2024 Hyphenated Enterprises LLC. All rights reserved.