https://ramanlabs.in/static/blog/Generate_Python_extensions_using_Nim_language.html Raman Labs Home Blog Generate Python extensions using Nim language A Python Extension is generally some code written in language other than python to extend the python ecosystem by exposing an interface compatible with Python API.Such extensions are generally in form of compiled code and loaded dynamically by the Python runtime. The ability to create extensions quickly, to be used directly from Python, would be a great advantage.Ideally, users would be able to write the bottleneck parts of their code in their favorite language (generally faster than Python), to offload compute-intensive parts, and use them from Python language with all its flexibility. The extensions written in such a way allow users to utilize all the features offered by foreign language/dialect including but not limited to multiprocessing, assembly level instructions, meta-progamming hence also bypassing Python GIL. Lots of extensions (modules) like numpy, scikit-learn fundamental to data-science are also written in C/C++. There are lots of options for writing such extensions ranging from more specific dialects like Cython, Numba(for accelerating Numpy array based operations) to general foreign language bindings like Rust, C++, C. Writing extensions for Python is anything but a delightful experience,generally because a lot of boiler-plate/glue code is needed to wrap the extension. In this post, we would be looking at Nim. Nim compiles to C, hence it can generate compiled code to be loaded as a module from Python. I personally found Nim as a highly productive language and have written a lot of code for various use-cases. For me, it fulfills what it promises while offering a lot of flexibility like directly compiling any already existing C code for fast prototyping. We would cover a concrete example to show the practical usefulness of Nim based Python extension by writing a simple Image Preprocessing Pipeline. We will try to establish a simple baseline with generic single threaded code without going into optimizations. Image preprocessing. Image preprocessing is generally the first step in computer-vision based ML models, and follows roughly the same order as: Order: 1. Converting Uint8 [HWC] to float32 format ranging [0-1]. 2. Optionally [HWC] format to [CHW] depending on the framework. 3. Resizing to expected INPUT-SIZE (dictated by the model being used), like 224x224, 256x256. 4. Normalization of data (by subtracting MEAN and dividing by standard-deviation). Optimizing this part would be an advantage since this pipeline would need to be run for each frame/image. Preprocessing and post-processing code is generally least optimized code given that code is highly specific for each model, hence contribute a significant latency if not properly optimized. Code: Assumptions: 1. We assume that input data is contiguous in memory i.e. data is packed without any padding and in consecutive memory locations. 2. Input data contains Uint8 data i.e. value from [0-255]. 3. Input data follows HWC format, i.e. stride for channels' dimension is 1. 4. Output data produced would also be contiguous. 5. Output data format is [CHW], i.e. stride for Width dimension would be 1. Since resizing would involve collecting an input/source pixel (nearest neighbour) or weighted combination of input pixels (bilinear), based on the logical coordinates (i, j) for each output pixel location. We can convert the collected input (uint8) data into float32, before updating corresponding output/destination memory location thus effectively fusing steps 1 and 3. We can decide if we want output format to be CHW or HWC beforehand or by writing 2 different implementations, thus also fusing step 2. # Writing a simple Resize function that fuses steps 1, 2 and 3 in our pipeline for case if output format is CHW. import math # nearest neighbour based routine to calculate the correponding input source index, given the output index. proc nearest_neighbour_compute_source_index(scale:float, out_index:int, input_size:int):int= return min( int( floor(out_index.float * scale)), input_size-1 ) proc hwc2chw_resize_simple(inpRawData_ptr:ptr uint8, outRawData_ptr:ptr float32, inpH:int, inpW:int, outH:int, outW:int, C:int=3)= let inpRawData = cast[ptr UncheckedArray[uint8]](inpRawData_ptr) outRawData = cast[ptr UncheckedArray[float32]](outRawData_ptr) let scale_h:float = inpH.float/outH.float scale_w:float = inpW.float/outW.float # For each position in output image/array, get the correponding input pixel value. for h in 0.. [CHW] [0-1] float32 transforms.Normalize(mean=[0,0,0], std=[1,1,1]) ]) image = PIL.Image.open("test.jpg") ##Uint8 HWC format [576,768,3] output = pipeline(image) Timing: It takes about 5.6 milliseconds, without even color-conversion routine being run/fused. Even though it is a very basic comparison of timing, it indicates the flexibility we can have when writing extension for very specific needs and with much lower latency than already compiled modules. Remarks: Like this we should be able to fuse more operations depending upon model requirements, like resizing with keeping aspect ratio unchanged. In our experience we have found by writing preprocessing/ postprocessing code from scratch in Nim/C/Rust/Zig and wrapping existing fast-implementations of operations like Convolution we can make a lot of deep-learning based models run in real-time on Consumer Grade CPUs even without AVX512 instructions, thus opening a lot of exciting opportunities, along with added benefit of deploying them in production without much friction. If you think i made a mistake or have any comments,please reach out to anubhav@ramanlabs.in . Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. (c) 2022 RamanLabs Terms of use Refund Policy Privacy Policy