[HN Gopher] Vectorizing Ragged Arrays
       ___________________________________________________________________
        
       Vectorizing Ragged Arrays
        
       Author : vladf
       Score  : 21 points
       Date   : 2021-12-07 17:48 UTC (5 hours ago)
        
 (HTM) web link (vladfeinberg.com)
 (TXT) w3m dump (vladfeinberg.com)
        
       | [deleted]
        
       | johndough wrote:
       | While this is a neat trick, the claim of "100x" speedup for the
       | centroid computation is somewhat unrealistic, since I doubt that
       | there are many practical applications where one wants to compute
       | 10000 centroids for 10000 data points. The proposed technique is
       | 50 % slower than the naive implementation for a more reasonable k
       | = 10 centroids.
       | 
       | Anyway, JIT compilers are quite good these days. A naive
       | implementation with for loops beats the fastest version in both
       | cases by a factor of 5 (here with numba):                   from
       | numba import njit         @njit         def numba_centroids(X_nd,
       | label_n):             n, d = X_nd.shape             k =
       | label_n.max() + 1             c_kd = np.zeros((k, d))
       | weights = np.zeros(k)             dist_n = np.zeros(n)
       | for i in range(n):                 label = label_n[i]
       | for j in range(d):                     c_kd[label, j] += X_nd[i,
       | j]                 weights[label] += 1             c_kd /=
       | weights.reshape(k, 1) + 1e-10             for i in range(n):
       | label = label_n[i]                 squared_difference = 0.0
       | for j in range(d):                     difference = c_kd[label,
       | j] - X_nd[i, j]                     squared_difference +=
       | difference * difference                 dist_n[i] =
       | squared_difference             return c_kd, dist_n
        
         | shoo wrote:
         | Similarly, a translation into bad cython code, compiled with
         | cython's defaults (-O2 but without taking advantage of cpu
         | specific SIMD instruction sets such as avx) runs a touch over
         | 1000x faster than the baseline python version
         | ncalls  tottime  percall  cumtime  percall
         | filename:lineno(function)                 20    5.020    0.251
         | 9.129    0.456 main.py:4(centroids)   # baseline version from
         | blog            20    0.007    0.000    0.008    0.000 {built-
         | in method cycentroids.centroids}  # below
         | cimport cython       import numpy as np
         | @cython.boundscheck(False)       @cython.wraparound(False)
         | @cython.cdivision(True)       cpdef centroids(const double[:,
         | :] X_nd, const long[:] label_n):           cdef size_t n, d, k,
         | k_ix, n_ix, d_ix, label                cdef double[:, :] c_kd
         | cdef long[:] counts_k           cdef double[:] dist_n
         | cdef double acc, z, delta                n = X_nd.shape[0]
         | d = X_nd.shape[1]           k = np.max(label_n) + 1
         | c_kd = np.zeros((k, d), dtype=np.float64)           dist_n =
         | np.zeros((n, ), dtype=np.float64)           counts_k =
         | np.zeros((k, ), dtype=np.int64)                # pass one:
         | compute centroids           for n_ix in range(n):
         | label = label_n[n_ix]               counts_k[label] += 1
         | for d_ix in range(d):                   c_kd[label, d_ix] +=
         | X_nd[n_ix, d_ix]                for k_ix in range(k):
         | z = 1.0 / counts_k[k_ix]               for d_ix in range(d):
         | c_kd[k_ix, d_ix] = z * c_kd[k_ix, d_ix]                # pass
         | two: compute square distances to centroids           for n_ix
         | in range(n):               label = label_n[n_ix]
         | acc = 0.0               for d_ix in range(d):
         | delta = (X_nd[n_ix, d_ix] - c_kd[label, d_ix])
         | acc += (delta * delta)               dist_n[n_ix] = acc
         | return c_kd, dist_n
        
         | vladf wrote:
         | Numba is great! Whenever you're using CPU and have very simple
         | parallelism patterns, it's your best bet for speeding up numpy.
         | 
         | But if you needed to do this on a GPU or TPU, ideally with
         | native and transparent SIMT, such as the case for the SO
         | question inspiring the post (an unsupervised centroid-based
         | loss for a deep learning setting), would you have to write a
         | custom C++ kernel to do this?
         | 
         | Free SIMT may even make this worthwhile in the few-centroid
         | setting.
        
       ___________________________________________________________________
       (page generated 2021-12-07 23:03 UTC)