https://ramanlabs.in/static/blog/Generate_Python_extensions_using_Nim_language.html

Raman Labs
Home Blog

Generate Python extensions using Nim language

A Python Extension is generally some code written in language other
than python to extend the python ecosystem by exposing an interface
compatible with Python API.Such extensions are generally in form of
compiled code and loaded dynamically by the Python runtime.

The ability to create extensions quickly, to be used directly from
Python, would be a great advantage.Ideally, users would be able to
write the bottleneck parts of their code in their favorite language
(generally faster than Python), to offload compute-intensive parts,
and use them from Python language with all its flexibility. The
extensions written in such a way allow users to utilize all the
features offered by foreign language/dialect including but not
limited to multiprocessing, assembly level instructions,
meta-progamming hence also bypassing Python GIL. Lots of extensions
(modules) like numpy, scikit-learn fundamental to data-science are
also written in C/C++.

There are lots of options for writing such extensions ranging from
more specific dialects like Cython, Numba(for accelerating Numpy
array based operations) to general foreign language bindings like
Rust, C++, C. Writing extensions for Python is anything but a
delightful experience,generally because a lot of boiler-plate/glue
code is needed to wrap the extension.

In this post, we would be looking at Nim. Nim compiles to C, hence it
can generate compiled code to be loaded as a module from Python. I
personally found Nim as a highly productive language and have written
a lot of code for various use-cases. For me, it fulfills what it
promises while offering a lot of flexibility like directly compiling
any already existing C code for fast prototyping.

We would cover a concrete example to show the practical usefulness of
Nim based Python extension by writing a simple Image Preprocessing
Pipeline. We will try to establish a simple baseline with generic
single threaded code without going into optimizations.

Image preprocessing.

Image preprocessing is generally the first step in computer-vision
based ML models, and follows roughly the same order as:

Order:

 1. Converting Uint8 [HWC] to float32 format ranging [0-1].
 2. Optionally [HWC] format to [CHW] depending on the framework.
 3. Resizing to expected INPUT-SIZE (dictated by the model being
    used), like 224x224, 256x256.
 4. Normalization of data (by subtracting MEAN and dividing by
    standard-deviation).

Optimizing this part would be an advantage since this pipeline would
need to be run for each frame/image. Preprocessing and
post-processing code is generally least optimized code given that
code is highly specific for each model, hence contribute a
significant latency if not properly optimized.

Code:

Assumptions:

 1. We assume that input data is contiguous in memory i.e. data is
    packed without any padding and in consecutive memory locations.
 2. Input data contains Uint8 data i.e. value from [0-255].
 3. Input data follows HWC format, i.e. stride for channels'
    dimension is 1.
 4. Output data produced would also be contiguous.
 5. Output data format is [CHW], i.e. stride for Width dimension
    would be 1.

Since resizing would involve collecting an input/source pixel
(nearest neighbour) or weighted combination of input pixels
(bilinear), based on the logical coordinates (i, j) for each output
pixel location. We can convert the collected input (uint8) data into
float32, before updating corresponding output/destination memory
location thus effectively fusing steps 1 and 3.

We can decide if we want output format to be CHW or HWC beforehand or
by writing 2 different implementations, thus also fusing step 2.

# Writing a simple Resize function that fuses steps 1, 2 and 3 in our pipeline for case if output format is CHW.

import math

# nearest neighbour based routine to calculate the correponding input source index, given the output index.
proc nearest_neighbour_compute_source_index(scale:float, out_index:int, input_size:int):int=
    return min( int( floor(out_index.float * scale)), input_size-1 )

proc hwc2chw_resize_simple(inpRawData_ptr:ptr uint8, outRawData_ptr:ptr float32, inpH:int, inpW:int, outH:int, outW:int, C:int=3)=

    let
        inpRawData = cast[ptr UncheckedArray[uint8]](inpRawData_ptr)
        outRawData = cast[ptr UncheckedArray[float32]](outRawData_ptr)

    let
        scale_h:float = inpH.float/outH.float
        scale_w:float = inpW.float/outW.float

    # For each position in output image/array, get the correponding input pixel value. 
    for h in 0..<outH:
      for w in 0..<outW:
        for c in 0..<C:
          let src_h = nearest_neighbour_compute_source_index(scale=scale_h, out_index=h, input_size=inpH) # logical index
          let src_w = nearest_neighbour_compute_source_index(scale=scale_w, out_index=w, input_size=inpW) # logical index

          var inp_f32 = float32(inpRawData[src_h*inpW*C + src_w*C + c])/255'f32

          outRawData[c*outH*outW + h*outW + w] = inp_f32

Model may need image-data either in BGR format or in RGB format
depending on data model was trained with. Let us try to include that
constraint into our code with no extra cost.

proc hwc2chw_resize_simple(inpRawData_ptr:ptr uint8, outRawData_ptr:ptr float32, inpH:int, inpW:int, outH:int, outW:int, C=3, reverse_channels:bool=false)=

    let reverse_channels = int(reverse_channels) #0/1
    let
        inpRawData = cast[ptr UncheckedArray[uint8]](inpRawData_ptr)
        outRawData = cast[ptr UncheckedArray[float32]](outRawData_ptr)

    let
        scale_h:float = inpH.float/outH.float
        scale_w:float = inpW.float/outW.float

    #For each position in output image/array, get the correponding input pixel value. 
    for h in 0..<outH:
      for w in 0..<outW:
        for c in 0..<C:
          let src_h = nearest_neighbour_compute_source_index(scale=scale_h, out_index=h, input_size=inpH) #logical index
          let src_w = nearest_neighbour_compute_source_index(scale=scale_w, out_index=w, input_size=inpW) #logical index

          #if reverse_channels is true, we get (C-1-c)'th channel, otherwise c'th channel.  
          var inp_f32 = float32(inpRawData[src_h*inpW*C + src_w*C + (1-reverse_channels)*c + reverse_channels*(C-1-c)])/255'f32


          #update corresponding memory-location for output array by converting logical indices to array indices.
          outRawData[c*outH*outW + h*outW + w] = inp_f32

Let us also fuse normalization step i.e. subtracting Mean and
dividing std-deviation along Channel dimension.

proc hwc2chw_resize_simple(inpRawData_ptr:ptr uint8, outRawData_ptr:ptr float32, inpH:int, inpW:int, outH:int, outW:int, C=3, mean_array:array[3, float32]=[0'f32,0,0], std_array:array[3, float32]=[1'f32,1,1], reverse_channels:bool=false)=

    let reverse_channels = int(reverse_channels) #0/1
    let
        inpRawData = cast[ptr UncheckedArray[uint8]](inpRawData_ptr)
        outRawData = cast[ptr UncheckedArray[float32]](outRawData_ptr)

    let
        scale_h:float = inpH.float/outH.float
        scale_w:float = inpW.float/outW.float

    #For each position in output image/array, get the correponding input pixel value. 
    for h in 0..<outH:
      for w in 0..<outW:
        for c in 0..<C:
          let src_h = nearest_neighbour_compute_source_index(scale=scale_h, out_index=h, input_size=inpH) #logical index
          let src_w = nearest_neighbour_compute_source_index(scale=scale_w, out_index=w, input_size=inpW) #logical index

          #if reverse_channels is true, we get (C-1-c)'th channel, otherwise c'th channel.  
          var inp_f32 = float32(inpRawData[src_h*inpW*C + src_w*C + (1-reverse_channels)*c + reverse_channels*(C-1-c)])/255'f32

          #minus correponding mean and divide by standard deviation.
          inp_f32 = (inp_f32 - mean_array[(1-reverse_channels)*c + reverse_channels*(C-1-c)])/(std_array[(1-reverse_channels)*c + reverse_channels*(C-1-c)] + 1e-5'f32)

          #update corresponding memory-location for output array by converting logical indices to array indices.
          outRawData[c*outH*outW + h*outW + w] = inp_f32

In this way we were able to fuse generally all the steps for a
preprocessing pipeline with almost no-extra cost. We can write more
specific code depending on the model needs but with similar speeds
hence significantly reducing the latency for preprocessing part for a
model.

Generating Python extension.

We now can generate a Python extension to include this functionality
in a Python compiled module. We would be using an awesome module
nimpy which is designed with ABI compatibility, and hence should work
independent of Python versions.

Reading the rawData produced from Python side.

According to code written in Nim,we would be needing access to raw
Uint8 values for given frame/image. To produce the data from the
Python side, we would be using Numpy which implements buffer protocol
, hence easily allowing access to underlying buffer/rawData on the
Nim side, which is exactly what we need. Since Numpy is one of the
most-used modules for matrix/array manipulation, being directly able
to provide a Numpy array to routines exposed by the compiled
extension is a great benefit.

Writing code for extension

import nimpy
import nimpy / [raw_buffers, py_types]

proc preprocessPipeline_nim(image:PyObject, output:PyObject, reverse_channels:bool=false){.exportpy.}=

  var image_buf:RawPyBuffer
  image.getBuffer(image_buf, PyBUF_SIMPLE or PyBUF_ND)

  var output_buf:RawPyBuffer
  output.getBuffer(output_buf, PyBUF_WRITABLE or PyBUF_ND)


  #access to rawData
  let rawData_ptr = cast[ptr uint8](image_buf.buf)
  let output_ptr =  cast[ptr float32](output_buf.buf)

  #get dimensions of our input image.
  let
      H = cast[ptr UncheckedArray[Py_ssize_t]](image_buf.shape)[0].int
      W = cast[ptr UncheckedArray[Py_ssize_t]](image_buf.shape)[1].int
      C = cast[ptr UncheckedArray[Py_ssize_t]](image_buf.shape)[2].int

      outH = cast[ptr UncheckedArray[Py_ssize_t]](output_buf.shape)[1].int
      outW = cast[ptr UncheckedArray[Py_ssize_t]](output_buf.shape)[2].int
      C_out = cast[ptr UncheckedArray[Py_ssize_t]](output_buf.shape)[0].int

  assert C == C_out
  doAssert C == 3,"Expected No of channels to be 3" & " but got " & $C


  hwc2chw_resize_simple(
    inpRawData_ptr = rawData_ptr,
    outRawData_ptr = output_ptr,
    inpH = H,
    inpW = W,
    outH = outH,
    outW = outW,
    C = C,
    reverse_channels = bool(reverse_channels)
  )


  #release the buffers reference for Python GC.
  image_buf.release()
  output_buf.release()

Compiling the extension

Since i am compiling on Windows, to please Windows gods I have to
offer some sacrifice in the form of extra flags like
--tlsEmulation:off and -passL:-static.

# default cc/compiler is gcc,
nim c [-d:danger] --cc:gcc --threads:on --app:lib --tlsEmulation:off --passL:-static --out:preprocessPipeline.pyd preprocessPipeline.nim

Using the extension from Python

import numpy as np

import preprocessPipeline as pipeline

#wrap this into a simple function, for easy usage.minimal extra latency in form of function call overhead.
def preprocess(image, output_shape:tuple, reverse_channels:bool=False):
  """ Takes an input image of uint8 , HWC format, returns resized image float32[0-1] with format CHW"""

  #image: numpy nd array of uint8 data-type of shape [H, W, C]
  #output_shape: (out_height, out_width)

  #returns: np.array of shape [C, out_height, out_width]
  inpH = image.shape[0]
  inpW = image.shape[1]
  C = image.shape[2]

  outH = output_shape[0]
  outW = output_shape[1]

  output = np.empty((C, outH, outW), dtype=np.float32)
  pipeline.preprocessPipeline_nim(image, output, reverse_channels=False)#output would be updated in-place.

  return output

Timing:

Now we can try out our extension to visualize some results. It takes
about ~700 microseconds on my system for preprocess function to run
with an image with resolution of 576x768x3.

import cv2
import numpy as np

frame = cv2.imread("test.jpg") #Uint8 HWC format [576,768,3]
output = preprocess(frame, output_shape=(240,256), reverse_channels=False)
output_reversed = preprocess(frame, output_shape=(240,256), reverse_channels=True)

#This preprocessed output, can be directly used as an input to computer-vision model.

cv2.imshow("frame", output.transpose(1,2,0))
cv2.waitKey(0)

Corresponding pipeline in Pytorch framework would look like as shown
below. Even though Pytorch is research oriented framework but
pipeline like shown below is used quite regularly for making
predictions even with trained/deployed-in-production models.

import PIL
from torchvision import transforms

pipeline = transforms.Compose([
        transforms.Resize((240,256),1), #1 for interpolation Nearest-neighbour
        transforms.ToTensor(),  #[HWC] -> [CHW] [0-1] float32
        transforms.Normalize(mean=[0,0,0], std=[1,1,1])
])

image = PIL.Image.open("test.jpg") ##Uint8 HWC format [576,768,3]
output = pipeline(image)

Timing:

It takes about 5.6 milliseconds, without even color-conversion
routine being run/fused. Even though it is a very basic comparison of
timing, it indicates the flexibility we can have when writing
extension for very specific needs and with much lower latency than
already compiled modules.

Remarks:

Like this we should be able to fuse more operations depending upon
model requirements, like resizing with keeping aspect ratio
unchanged.

In our experience we have found by writing preprocessing/
postprocessing code from scratch in Nim/C/Rust/Zig and wrapping
existing fast-implementations of operations like Convolution we can
make a lot of deep-learning based models run in real-time on Consumer
Grade CPUs even without AVX512 instructions, thus opening a lot of
exciting opportunities, along with added benefit of deploying them in
production without much friction.

If you think i made a mistake or have any comments,please reach out
to anubhav@ramanlabs.in .

Creative Commons License
This work is licensed under a Creative Commons
Attribution-NonCommercial 4.0 International License.

(c) 2022 RamanLabs

Terms of use
Refund Policy
Privacy Policy