[HN Gopher] On Summing Integers
___________________________________________________________________
On Summing Integers
Author : electricshampo1
Score : 18 points
Date : 2021-03-07 15:38 UTC (7 hours ago)
(HTM) web link (unomerite.medium.com)
(TXT) w3m dump (unomerite.medium.com)
| [deleted]
| orf wrote:
| numpy, for comparison: In [8]: vec =
| np.random.randint(-200, 200, (100_000_000,)) In [9]:
| %timeit vec.sum() 63 ms +- 4.81 ms per loop (mean +- std.
| dev. of 7 runs, 10 loops each)
|
| The branchless C++ version took 125ms, and the AVX 512 version
| took ~9ms.
| tom_mellior wrote:
| After the updates to the article, the takeaway seems to be "you
| can use AVX-512 dot product instructions to sum an array of bytes
| to int and get a 15% speedup over more straightforward vector
| code". That's an interesting point, but it's now well-hidden
| among irrelevant things like the compressed representation that
| was only relevant to the article's original point.
|
| It might make sense to resubmit a completely rewritten and pared-
| down version of the article. The dot product trick is neat.
| electricshampo1 wrote:
| (Author here)
|
| The main focus of the article is to show a worked example of
| how to analyze the behavior of simple programs. The circuitous
| path taken in the article is similar to that which you might
| face when analyzing your own programs. The article is only
| incidentally about AVX512 and summing numbers. Thus any
| specific technique used or even the precise runtimes measured
| should not be given too much weight. Forgive me, but the
| journey is more important than the destination.
|
| The article is also meant to inspire others to learn and spend
| time deeply understanding (by looking at data that the hardware
| makes available) what actually happens to their code when it
| runs. It is too easy to lose sight of that in a professional
| setting, where business needs/requirements leave little time
| for deep analysis.
| Const-me wrote:
| I don't believe in the bright future of AVX512 tech, and I don't
| have hardware either, my desktop PC has AMD Zen2 CPU.
|
| Here's how I would do that in AVX2:
| https://gist.github.com/Const-me/eed10bfe690b5804d2fc8266e02...
|
| I wonder how does the performance compare to your version.
| electricshampo1 wrote:
| (author here)
|
| Using srand(0) as in your gist; changing int8_t to char ; using
| int for sum:
|
| BM_hn 10155300 ns
|
| BM_avx512 9218652 ns
|
| -----
|
| BM_avx512 9372319 ns
|
| BM_hn 10428792 ns
|
| Your code is getting ~18-19GB/s read bandwidth compared to
| roughly ~21GB/s for my code.
| electricshampo1 wrote:
| NOTE: The article has been updated and expanded since the initial
| post.
| plesner wrote:
| Could you not just do
|
| sum(byteVals) + sum(intVals) + 128 * len(intVals)?
| tom_mellior wrote:
| And even if you didn't, I wonder what the motivation is for
| trying to do it all in one loop. sum(byteVals
| != -128) + sum(intVals)
|
| should vectorize more nicely and be at least as cache friendly.
| electricshampo1 wrote:
| That is essentially the approach mentioned in the article at
|
| " UPDATE: see
| https://www.realworldtech.com/forum/?threadid=200693&curpost...
| for a dramatic simplification. Not catching this is an
| oversight on my part. This post will be updated to include
| numbers with the mentioned strategy.
|
| UPDATE: To my surprise and after much fiddling, I did not
| manage to write a version that was measurably faster (indeed
| they were at least a percent slower) than the hand written
| sum_avx512 shown below. There is almost certainly something
| that I am doing wrong but I can't seem to figure out what it
| is. I will take this opportunity to leave this as an exercise
| for the reader :). "
___________________________________________________________________
(page generated 2021-03-07 23:02 UTC)