[HN Gopher] Luhn algorithm using SWAR and SIMD
       ___________________________________________________________________
        
       Luhn algorithm using SWAR and SIMD
        
       Author : harporoeder
       Score  : 13 points
       Date   : 2022-05-08 04:50 UTC (1 days ago)
        
 (HTM) web link (nullprogram.com)
 (TXT) w3m dump (nullprogram.com)
        
       | Const-me wrote:
       | I would do the SSE version slightly differently. No need for
       | bitwise tricks, SSE2 has an instruction to compare bytes, which
       | saves 1 instruction and one constant vector.
       | int luhn( const char* s )         {             __m128i r =
       | _mm_loadu_si128( ( const __m128i* )s );                  //
       | Decode ASCII             r = _mm_subs_epu8( r, _mm_set1_epi8(
       | 0x30 ) );                  // Double every other digit
       | __m128i m = _mm_set1_epi16( 0x00ff );             r =
       | _mm_add_epi8( r, _mm_and_si128( r, m ) );                  // if(
       | digit > 9 ) digit -= 9             const __m128i nine =
       | _mm_set1_epi8( 9 );             __m128i gt = _mm_cmpgt_epi8( r,
       | nine );             r = _mm_sub_epi8( r, _mm_and_si128( gt, nine
       | ) );                  // Horizontal sum             r =
       | _mm_sad_epu8( r, _mm_setzero_si128() );             r =
       | _mm_add_epi64( r, _mm_srli_si128( r, 8 ) );             return
       | _mm_cvtsi128_si32( r ) % 10;         }
        
       | emerged wrote:
       | I'd make it an inline function and call it multiple times per
       | loop iteration. You should be able to pull some of the
       | initializations out of the loop and fetch multiple at once,
       | pipelining using different registers. Optimizing compilers have
       | gotten good enough to do the right things much of the time with
       | intrinsics but it still may benefit from hand writing.
       | 
       | Sometimes you can even eek a touch more speed with prefetch, but
       | it can be tricky to get that right.
       | 
       | Another thing that matters a ton is whether you're
       | microbenchmarking it accurately. It isn't trivial to do so, as
       | you should mimic real memory / cache behavior as opposed to
       | repeatedly having it fetch memory which is already cached and
       | give you misleading results.
        
       | [deleted]
        
       | dragontamer wrote:
       | This is so stupid, I love it!
       | 
       | Its actually a great demonstration of efficient parallel
       | processing on something totally not needing it. But you've got
       | your "prefix-sum" pattern (maybe called reduction-pattern in some
       | other code, or even "scan" in yet other code).
       | 
       | The only problem is that there's only 16-digits, so parallel
       | processing of this is barely an upgrade at all. But you can
       | imagine that if we scaled credit-card numbers up to 1024 digits
       | or 1-million digits (1 MB), this algorithm would scale very well.
        
       ___________________________________________________________________
       (page generated 2022-05-09 23:02 UTC)