[HN Gopher] Speeding up Cython with SIMD
       ___________________________________________________________________
        
       Speeding up Cython with SIMD
        
       Author : PaulHoule
       Score  : 65 points
       Date   : 2023-10-30 13:34 UTC (9 hours ago)
        
 (HTM) web link (pythonspeed.com)
 (TXT) w3m dump (pythonspeed.com)
        
       | alecmg wrote:
       | ahh, I wish they included speed comparison to numpy.average
       | 
       | I know, thats not the point and average was only picked as a
       | simple example, but still...
        
         | akasakahakada wrote:
         | Agree. Normal Python for loop apply to a Numpy array to do
         | simple math is just pure nonsense.
         | 
         | Just tested how would it be without compile nonsense.
         | 
         | ```
         | 
         | a = np.random.random(int(1e6))
         | 
         | %%timtit
         | 
         | np.average(a)
         | 
         | %timeit
         | 
         | np.average(a[::16])
         | 
         | ```
         | 
         | And my result is that no matter how uncontiguous in memory
         | (here I take every 16 elements like what they did, and I tested
         | for 2,4,8,16), we are doing less operations so it always end up
         | faster. Contrastingly their SIMD compiled code is 10-20X slower
         | in uncontiguous case.
         | 
         | And for a larger array that is 16X of the contiguous one, but
         | we only take 1/16 of its element, the result is like 10X slower
         | as shown by the article. But I suspect that purely now you have
         | a 16X larger array to load from memory, which itself is slow in
         | nature.
         | 
         | ```
         | 
         | b = np.random.random(int(16e6))
         | 
         | np.average(b[::16])
         | 
         | ```
         | 
         | Which conclude that people should use Numpy in the right way.
         | It is really hard to beat pure numpy speed.
        
           | nerdponx wrote:
           | But that's precisely what makes this a good exercise, you can
           | see how far you are able to close the gap between the naive
           | looping implementation and the optimized array
           | implementation.
        
           | thatsit wrote:
           | Few years ago I tried to beat the C/C++ compiler on speed
           | with manual SIMD instructions vs pure C/C++ Didn't work
           | out...
           | 
           | I can only imagine that this is already backed into Numpy
           | now.
        
             | cozzyd wrote:
             | You usually have to unroll your loops for it to help
             | (unless compilers have gotten smarter about data
             | dependencies)
        
           | Elucalidavah wrote:
           | > np.average
           | 
           | But that's not the function in the article. The article
           | implements `(a + b) / 2`.
           | 
           | And, on my system, simple `return (arr1 + arr2) / 2` takes
           | 1.2ms, while the `average_arrays_4` takes 0.74ms.
        
       | jvanderbot wrote:
       | IIRC you need to enable AVX-type instructions to really have
       | SIMD:
       | 
       | -mavx2 -mfma -mavx512pf -msse4.2 etc
        
         | toth wrote:
         | Or alternatively, -mnative should do all of those and more if
         | your CPU supports them
        
         | itamarst wrote:
         | Author here. The original article I was going to write was
         | about using newer instruction sets, but then I discovered it
         | doesn't even use original SSE instructions by default, so I
         | wrote this instead.
         | 
         | Eventually I'll write that other article; I've been wondering
         | if it's possible to have infrastructure to support both modern
         | and old CPUs in Python libraries without doing runtime dispatch
         | on the C level, so this may involve some coding if I have time.
        
           | jvanderbot wrote:
           | Yeah. And I dont mean this in a "no true scottsman" way. I
           | really have trouble coaxing any kind of instruction-level
           | parallelism w/o those.
        
             | itamarst wrote:
             | There's presumably a reason they've spent the past 20 years
             | adding additional instructions to CPUs, yeah :) And a large
             | part of the Python ecosystem just ignores all of them.
             | (NumPy has a bunch of SIMD with function-level dispatch,
             | and they add more over time.)
        
       | dec0dedab0de wrote:
       | off topic, but Cython working with typehints instead of only
       | using their custom cdef sytax is the best thing to happen to
       | Cython, and imho the best reason for using typehints. I miss the
       | days when "pure python" was a badge of honor, and by using
       | typehints you can get speed from compiling when possible, and the
       | portability of just running python otherwise. Of course that
       | level of portability has really gone out the window with so many
       | things dropping backwards compatibility over the years, but its
       | still a nice dream.
        
         | aldanor wrote:
         | There's just less need for cython these days.
         | 
         | If you're writing a compiled extension, there's pybind11 or,
         | better yet, pyo3, where you can also get access to internals of
         | polars and other libraries.
         | 
         | In numeric Python, and especially with computers becoming
         | progressively faster, it's rarely the case than the layer of
         | pure Python is where the cpu time is spent. And in the rare
         | case when you need to do something funky for-loop-style with
         | your numpy data, there's jit-compiled numba...
        
           | alfalfasprout wrote:
           | Nowadays the python bottleneck is typically moreso with the
           | GIL. Assuming you're already using native extensions for
           | heavy lifting, the GIL is what prevents workloads from being
           | efficiently multithreaded if they need any sort of
           | communication without really hacky workarounds.
           | 
           | I have a feeling this is why ultimately why PEP 703 is being
           | accepted despite the setbacks to the Faster CPython effort--
           | while faster CPython is a great goal, it is rarely the
           | bottleneck nowadays.
        
             | Qem wrote:
             | > I have a feeling this is why ultimately why PEP 703 is
             | being accepted despite the setbacks to the Faster CPython
             | effort-- while faster CPython is a great goal, it is rarely
             | the bottleneck nowadays.
             | 
             | How much of a setback that should be? The initial target
             | was a 5x improvement over 4 releases, IIRC. What are
             | current estimates like?
        
       | nw05678 wrote:
       | It's a pity that jython was never more of a thing. I got promoted
       | integrating it with the product we were working with.
        
       ___________________________________________________________________
       (page generated 2023-10-30 23:01 UTC)