[HN Gopher] Speeding up Cython with SIMD
___________________________________________________________________
Speeding up Cython with SIMD
Author : PaulHoule
Score : 65 points
Date : 2023-10-30 13:34 UTC (9 hours ago)
(HTM) web link (pythonspeed.com)
(TXT) w3m dump (pythonspeed.com)
| alecmg wrote:
| ahh, I wish they included speed comparison to numpy.average
|
| I know, thats not the point and average was only picked as a
| simple example, but still...
| akasakahakada wrote:
| Agree. Normal Python for loop apply to a Numpy array to do
| simple math is just pure nonsense.
|
| Just tested how would it be without compile nonsense.
|
| ```
|
| a = np.random.random(int(1e6))
|
| %%timtit
|
| np.average(a)
|
| %timeit
|
| np.average(a[::16])
|
| ```
|
| And my result is that no matter how uncontiguous in memory
| (here I take every 16 elements like what they did, and I tested
| for 2,4,8,16), we are doing less operations so it always end up
| faster. Contrastingly their SIMD compiled code is 10-20X slower
| in uncontiguous case.
|
| And for a larger array that is 16X of the contiguous one, but
| we only take 1/16 of its element, the result is like 10X slower
| as shown by the article. But I suspect that purely now you have
| a 16X larger array to load from memory, which itself is slow in
| nature.
|
| ```
|
| b = np.random.random(int(16e6))
|
| np.average(b[::16])
|
| ```
|
| Which conclude that people should use Numpy in the right way.
| It is really hard to beat pure numpy speed.
| nerdponx wrote:
| But that's precisely what makes this a good exercise, you can
| see how far you are able to close the gap between the naive
| looping implementation and the optimized array
| implementation.
| thatsit wrote:
| Few years ago I tried to beat the C/C++ compiler on speed
| with manual SIMD instructions vs pure C/C++ Didn't work
| out...
|
| I can only imagine that this is already backed into Numpy
| now.
| cozzyd wrote:
| You usually have to unroll your loops for it to help
| (unless compilers have gotten smarter about data
| dependencies)
| Elucalidavah wrote:
| > np.average
|
| But that's not the function in the article. The article
| implements `(a + b) / 2`.
|
| And, on my system, simple `return (arr1 + arr2) / 2` takes
| 1.2ms, while the `average_arrays_4` takes 0.74ms.
| jvanderbot wrote:
| IIRC you need to enable AVX-type instructions to really have
| SIMD:
|
| -mavx2 -mfma -mavx512pf -msse4.2 etc
| toth wrote:
| Or alternatively, -mnative should do all of those and more if
| your CPU supports them
| itamarst wrote:
| Author here. The original article I was going to write was
| about using newer instruction sets, but then I discovered it
| doesn't even use original SSE instructions by default, so I
| wrote this instead.
|
| Eventually I'll write that other article; I've been wondering
| if it's possible to have infrastructure to support both modern
| and old CPUs in Python libraries without doing runtime dispatch
| on the C level, so this may involve some coding if I have time.
| jvanderbot wrote:
| Yeah. And I dont mean this in a "no true scottsman" way. I
| really have trouble coaxing any kind of instruction-level
| parallelism w/o those.
| itamarst wrote:
| There's presumably a reason they've spent the past 20 years
| adding additional instructions to CPUs, yeah :) And a large
| part of the Python ecosystem just ignores all of them.
| (NumPy has a bunch of SIMD with function-level dispatch,
| and they add more over time.)
| dec0dedab0de wrote:
| off topic, but Cython working with typehints instead of only
| using their custom cdef sytax is the best thing to happen to
| Cython, and imho the best reason for using typehints. I miss the
| days when "pure python" was a badge of honor, and by using
| typehints you can get speed from compiling when possible, and the
| portability of just running python otherwise. Of course that
| level of portability has really gone out the window with so many
| things dropping backwards compatibility over the years, but its
| still a nice dream.
| aldanor wrote:
| There's just less need for cython these days.
|
| If you're writing a compiled extension, there's pybind11 or,
| better yet, pyo3, where you can also get access to internals of
| polars and other libraries.
|
| In numeric Python, and especially with computers becoming
| progressively faster, it's rarely the case than the layer of
| pure Python is where the cpu time is spent. And in the rare
| case when you need to do something funky for-loop-style with
| your numpy data, there's jit-compiled numba...
| alfalfasprout wrote:
| Nowadays the python bottleneck is typically moreso with the
| GIL. Assuming you're already using native extensions for
| heavy lifting, the GIL is what prevents workloads from being
| efficiently multithreaded if they need any sort of
| communication without really hacky workarounds.
|
| I have a feeling this is why ultimately why PEP 703 is being
| accepted despite the setbacks to the Faster CPython effort--
| while faster CPython is a great goal, it is rarely the
| bottleneck nowadays.
| Qem wrote:
| > I have a feeling this is why ultimately why PEP 703 is
| being accepted despite the setbacks to the Faster CPython
| effort-- while faster CPython is a great goal, it is rarely
| the bottleneck nowadays.
|
| How much of a setback that should be? The initial target
| was a 5x improvement over 4 releases, IIRC. What are
| current estimates like?
| nw05678 wrote:
| It's a pity that jython was never more of a thing. I got promoted
| integrating it with the product we were working with.
___________________________________________________________________
(page generated 2023-10-30 23:01 UTC)