[HN Gopher] On Summing Integers
       ___________________________________________________________________
        
       On Summing Integers
        
       Author : electricshampo1
       Score  : 18 points
       Date   : 2021-03-07 15:38 UTC (7 hours ago)
        
 (HTM) web link (unomerite.medium.com)
 (TXT) w3m dump (unomerite.medium.com)
        
       | [deleted]
        
       | orf wrote:
       | numpy, for comparison:                  In [8]: vec =
       | np.random.randint(-200, 200, (100_000_000,))             In [9]:
       | %timeit vec.sum()        63 ms +- 4.81 ms per loop (mean +- std.
       | dev. of 7 runs, 10 loops each)
       | 
       | The branchless C++ version took 125ms, and the AVX 512 version
       | took ~9ms.
        
       | tom_mellior wrote:
       | After the updates to the article, the takeaway seems to be "you
       | can use AVX-512 dot product instructions to sum an array of bytes
       | to int and get a 15% speedup over more straightforward vector
       | code". That's an interesting point, but it's now well-hidden
       | among irrelevant things like the compressed representation that
       | was only relevant to the article's original point.
       | 
       | It might make sense to resubmit a completely rewritten and pared-
       | down version of the article. The dot product trick is neat.
        
         | electricshampo1 wrote:
         | (Author here)
         | 
         | The main focus of the article is to show a worked example of
         | how to analyze the behavior of simple programs. The circuitous
         | path taken in the article is similar to that which you might
         | face when analyzing your own programs. The article is only
         | incidentally about AVX512 and summing numbers. Thus any
         | specific technique used or even the precise runtimes measured
         | should not be given too much weight. Forgive me, but the
         | journey is more important than the destination.
         | 
         | The article is also meant to inspire others to learn and spend
         | time deeply understanding (by looking at data that the hardware
         | makes available) what actually happens to their code when it
         | runs. It is too easy to lose sight of that in a professional
         | setting, where business needs/requirements leave little time
         | for deep analysis.
        
       | Const-me wrote:
       | I don't believe in the bright future of AVX512 tech, and I don't
       | have hardware either, my desktop PC has AMD Zen2 CPU.
       | 
       | Here's how I would do that in AVX2:
       | https://gist.github.com/Const-me/eed10bfe690b5804d2fc8266e02...
       | 
       | I wonder how does the performance compare to your version.
        
         | electricshampo1 wrote:
         | (author here)
         | 
         | Using srand(0) as in your gist; changing int8_t to char ; using
         | int for sum:
         | 
         | BM_hn 10155300 ns
         | 
         | BM_avx512 9218652 ns
         | 
         | -----
         | 
         | BM_avx512 9372319 ns
         | 
         | BM_hn 10428792 ns
         | 
         | Your code is getting ~18-19GB/s read bandwidth compared to
         | roughly ~21GB/s for my code.
        
       | electricshampo1 wrote:
       | NOTE: The article has been updated and expanded since the initial
       | post.
        
       | plesner wrote:
       | Could you not just do
       | 
       | sum(byteVals) + sum(intVals) + 128 * len(intVals)?
        
         | tom_mellior wrote:
         | And even if you didn't, I wonder what the motivation is for
         | trying to do it all in one loop.                   sum(byteVals
         | != -128) + sum(intVals)
         | 
         | should vectorize more nicely and be at least as cache friendly.
        
         | electricshampo1 wrote:
         | That is essentially the approach mentioned in the article at
         | 
         | " UPDATE: see
         | https://www.realworldtech.com/forum/?threadid=200693&curpost...
         | for a dramatic simplification. Not catching this is an
         | oversight on my part. This post will be updated to include
         | numbers with the mentioned strategy.
         | 
         | UPDATE: To my surprise and after much fiddling, I did not
         | manage to write a version that was measurably faster (indeed
         | they were at least a percent slower) than the hand written
         | sum_avx512 shown below. There is almost certainly something
         | that I am doing wrong but I can't seem to figure out what it
         | is. I will take this opportunity to leave this as an exercise
         | for the reader :). "
        
       ___________________________________________________________________
       (page generated 2021-03-07 23:02 UTC)