[HN Gopher] Vectorizing Ragged Arrays
___________________________________________________________________
Vectorizing Ragged Arrays
Author : vladf
Score : 21 points
Date : 2021-12-07 17:48 UTC (5 hours ago)
(HTM) web link (vladfeinberg.com)
(TXT) w3m dump (vladfeinberg.com)
| [deleted]
| johndough wrote:
| While this is a neat trick, the claim of "100x" speedup for the
| centroid computation is somewhat unrealistic, since I doubt that
| there are many practical applications where one wants to compute
| 10000 centroids for 10000 data points. The proposed technique is
| 50 % slower than the naive implementation for a more reasonable k
| = 10 centroids.
|
| Anyway, JIT compilers are quite good these days. A naive
| implementation with for loops beats the fastest version in both
| cases by a factor of 5 (here with numba): from
| numba import njit @njit def numba_centroids(X_nd,
| label_n): n, d = X_nd.shape k =
| label_n.max() + 1 c_kd = np.zeros((k, d))
| weights = np.zeros(k) dist_n = np.zeros(n)
| for i in range(n): label = label_n[i]
| for j in range(d): c_kd[label, j] += X_nd[i,
| j] weights[label] += 1 c_kd /=
| weights.reshape(k, 1) + 1e-10 for i in range(n):
| label = label_n[i] squared_difference = 0.0
| for j in range(d): difference = c_kd[label,
| j] - X_nd[i, j] squared_difference +=
| difference * difference dist_n[i] =
| squared_difference return c_kd, dist_n
| shoo wrote:
| Similarly, a translation into bad cython code, compiled with
| cython's defaults (-O2 but without taking advantage of cpu
| specific SIMD instruction sets such as avx) runs a touch over
| 1000x faster than the baseline python version
| ncalls tottime percall cumtime percall
| filename:lineno(function) 20 5.020 0.251
| 9.129 0.456 main.py:4(centroids) # baseline version from
| blog 20 0.007 0.000 0.008 0.000 {built-
| in method cycentroids.centroids} # below
| cimport cython import numpy as np
| @cython.boundscheck(False) @cython.wraparound(False)
| @cython.cdivision(True) cpdef centroids(const double[:,
| :] X_nd, const long[:] label_n): cdef size_t n, d, k,
| k_ix, n_ix, d_ix, label cdef double[:, :] c_kd
| cdef long[:] counts_k cdef double[:] dist_n
| cdef double acc, z, delta n = X_nd.shape[0]
| d = X_nd.shape[1] k = np.max(label_n) + 1
| c_kd = np.zeros((k, d), dtype=np.float64) dist_n =
| np.zeros((n, ), dtype=np.float64) counts_k =
| np.zeros((k, ), dtype=np.int64) # pass one:
| compute centroids for n_ix in range(n):
| label = label_n[n_ix] counts_k[label] += 1
| for d_ix in range(d): c_kd[label, d_ix] +=
| X_nd[n_ix, d_ix] for k_ix in range(k):
| z = 1.0 / counts_k[k_ix] for d_ix in range(d):
| c_kd[k_ix, d_ix] = z * c_kd[k_ix, d_ix] # pass
| two: compute square distances to centroids for n_ix
| in range(n): label = label_n[n_ix]
| acc = 0.0 for d_ix in range(d):
| delta = (X_nd[n_ix, d_ix] - c_kd[label, d_ix])
| acc += (delta * delta) dist_n[n_ix] = acc
| return c_kd, dist_n
| vladf wrote:
| Numba is great! Whenever you're using CPU and have very simple
| parallelism patterns, it's your best bet for speeding up numpy.
|
| But if you needed to do this on a GPU or TPU, ideally with
| native and transparent SIMT, such as the case for the SO
| question inspiring the post (an unsupervised centroid-based
| loss for a deep learning setting), would you have to write a
| custom C++ kernel to do this?
|
| Free SIMT may even make this worthwhile in the few-centroid
| setting.
___________________________________________________________________
(page generated 2021-12-07 23:03 UTC)