[HN Gopher] Luhn algorithm using SWAR and SIMD
___________________________________________________________________
Luhn algorithm using SWAR and SIMD
Author : harporoeder
Score : 13 points
Date : 2022-05-08 04:50 UTC (1 days ago)
(HTM) web link (nullprogram.com)
(TXT) w3m dump (nullprogram.com)
| Const-me wrote:
| I would do the SSE version slightly differently. No need for
| bitwise tricks, SSE2 has an instruction to compare bytes, which
| saves 1 instruction and one constant vector.
| int luhn( const char* s ) { __m128i r =
| _mm_loadu_si128( ( const __m128i* )s ); //
| Decode ASCII r = _mm_subs_epu8( r, _mm_set1_epi8(
| 0x30 ) ); // Double every other digit
| __m128i m = _mm_set1_epi16( 0x00ff ); r =
| _mm_add_epi8( r, _mm_and_si128( r, m ) ); // if(
| digit > 9 ) digit -= 9 const __m128i nine =
| _mm_set1_epi8( 9 ); __m128i gt = _mm_cmpgt_epi8( r,
| nine ); r = _mm_sub_epi8( r, _mm_and_si128( gt, nine
| ) ); // Horizontal sum r =
| _mm_sad_epu8( r, _mm_setzero_si128() ); r =
| _mm_add_epi64( r, _mm_srli_si128( r, 8 ) ); return
| _mm_cvtsi128_si32( r ) % 10; }
| emerged wrote:
| I'd make it an inline function and call it multiple times per
| loop iteration. You should be able to pull some of the
| initializations out of the loop and fetch multiple at once,
| pipelining using different registers. Optimizing compilers have
| gotten good enough to do the right things much of the time with
| intrinsics but it still may benefit from hand writing.
|
| Sometimes you can even eek a touch more speed with prefetch, but
| it can be tricky to get that right.
|
| Another thing that matters a ton is whether you're
| microbenchmarking it accurately. It isn't trivial to do so, as
| you should mimic real memory / cache behavior as opposed to
| repeatedly having it fetch memory which is already cached and
| give you misleading results.
| [deleted]
| dragontamer wrote:
| This is so stupid, I love it!
|
| Its actually a great demonstration of efficient parallel
| processing on something totally not needing it. But you've got
| your "prefix-sum" pattern (maybe called reduction-pattern in some
| other code, or even "scan" in yet other code).
|
| The only problem is that there's only 16-digits, so parallel
| processing of this is barely an upgrade at all. But you can
| imagine that if we scaled credit-card numbers up to 1024 digits
| or 1-million digits (1 MB), this algorithm would scale very well.
___________________________________________________________________
(page generated 2022-05-09 23:02 UTC)