https://stackoverflow.com/questions/72306573/why-does-this-code-execute-more-slowly-after-strength-reducing-multiplications-t

 
Stack Overflow

 1. About
 2. Products
 3. For Teams

 1. Stack Overflow Public questions & answers
 2. Stack Overflow for Teams Where developers & technologists share
    private knowledge with coworkers
 3. Talent Build your employer brand
 4. Advertising Reach developers & technologists worldwide
 5. About the company

[                    ]
Loading...
 1. current community

      +  
        Stack Overflow
        help chat
      + 
         
        Meta Stack Overflow
   
    your communities

    Sign up or log in to customize your list.

    more stack exchange communities

    company blog
 2.  
 3. Log in
 4. Sign up

Last call to make your voice heard! Our 2022 Developer Survey closes
in less than a week. Take survey.
 

 1.  
    Home
 2.
     1. Public
     2. Questions
     3.  
        Tags
     4.  
        Users
     5.  
        Companies
     6. Collectives
         
     7. Explore Collectives
 3.
        Teams
        Stack Overflow for Teams - Start collaborating and sharing
        organizational knowledge. [teams-illo-free-si] Create a free
        Team Why Teams?
     1. Teams
         
     2.  
        Create free Team

Collectives(tm) on Stack Overflow

Find centralized, trusted content and collaborate around the
technologies you use most.

Learn more

Teams

Q&A for work

Connect and share knowledge within a single location that is
structured and easy to search.

Learn more

Why does this code execute more slowly after strength-reducing
multiplications to loop-carried additions?

Ask Question
Asked 10 days ago
Modified today
Viewed 51k times
189
41
 

I am reading Agner Fog's optimization manuals, and I came across this
example:

double data[LEN];

void compute()
{
    const double A = 1.1, B = 2.2, C = 3.3;

    int i;
    for(i=0; i<LEN; i++) {
        data[i] = A*i*i + B*i + C;
    }
}

Agner indicates that there's a way to optimize this code - by
realizing that the loop can avoid using costly multiplications, and
instead use the "deltas" that are applied per iteration.

I use a piece of paper to confirm the theory, first...

Theory

...and of course, he is right - in each loop iteration we can compute
the new result based on the old one, by adding a "delta". This delta
starts at value "A+B", and is then incremented by "2*A" on each step.

So we update the code to look like this:

void compute()
{
    const double A = 1.1, B = 2.2, C = 3.3;
    const double A2 = A+A;
    double Z = A+B;
    double Y = C;

    int i;
    for(i=0; i<LEN; i++) {
        data[i] = Y;
        Y += Z;
        Z += A2;
    }
}

In terms of operational complexity, the difference in these two
versions of the function is indeed, striking. Multiplications have a
reputation for being significantly slower in our CPUs, compared to
additions. And we have replaced 3 multiplications and 2 additions...
with just 2 additions!

So I go ahead and add a loop to execute compute a lot of times - and
then keep the minimum time it took to execute:

unsigned long long ts2ns(const struct timespec *ts)
{
    return ts->tv_sec * 1e9 + ts->tv_nsec;
}

int main(int argc, char *argv[])
{
    unsigned long long mini = 1e9;
    for (int i=0; i<1000; i++) {
        struct timespec t1, t2;
        clock_gettime(CLOCK_MONOTONIC_RAW, &t1);
        compute();
        clock_gettime(CLOCK_MONOTONIC_RAW, &t2);
        unsigned long long diff = ts2ns(&t2) - ts2ns(&t1);
        if (mini > diff) mini = diff;
    }
    printf("[-] Took: %lld ns.\n", mini);
}

I compile the two versions, run them... and see this:

# gcc -O3 -o 1 ./code1.c

# gcc -O3 -o 2 ./code2.c

# ./1
[-] Took: 405858 ns.

# ./2
[-] Took: 791652 ns.

Well, that's unexpected. Since we report the minimum time of
execution, we are throwing away the "noise" caused by various parts
of the OS. We also took care to run in a machine that does absolutely
nothing. And the results are more or less repeatable - re-running the
two binaries shows this is a consistent result:

# for i in {1..10} ; do ./1 ; done
[-] Took: 406886 ns.
[-] Took: 413798 ns.
[-] Took: 405856 ns.
[-] Took: 405848 ns.
[-] Took: 406839 ns.
[-] Took: 405841 ns.
[-] Took: 405853 ns.
[-] Took: 405844 ns.
[-] Took: 405837 ns.
[-] Took: 406854 ns.

# for i in {1..10} ; do ./2 ; done
[-] Took: 791797 ns.
[-] Took: 791643 ns.
[-] Took: 791640 ns.
[-] Took: 791636 ns.
[-] Took: 791631 ns.
[-] Took: 791642 ns.
[-] Took: 791642 ns.
[-] Took: 791640 ns.
[-] Took: 791647 ns.
[-] Took: 791639 ns.

The only thing to do next, is to see what kind of code the compiler
created for each one of the two versions.

objdump -d -S shows that the first version of compute - the "dumb",
yet somehow fast code - has a loop that looks like this:

objdump naive

What about the second, optimized version - that does just two
additions?

objdump optimized

Now I don't know about you, but speaking for myself, I am... puzzled.
The second version has approximately 4 times fewer instructions, with
the two major ones being just SSE-based additions (addsd). The first
version, not only has 4 times more instructions... it's also full (as
expected) of multiplications (mulpd).

I confess I did not expect that result. Not because I am a fan of
Agner (I am, but that's irrelevant).

Any idea what I am missing? Did I make any mistake here, that can
explain the difference in speed? Note that I have done the test on a
Xeon W5580 and a Xeon E5 1620 - in both, the first (dumb) version is
much faster than the second one.

EDIT: For easy reproduction of the results, I added two gists with
the two versions of the code: Dumb yet somehow faster and optimized,
yet somehow slower.

P.S. Please don't comment on floating point accuracy issues; that's
not the point of this question.

loops assembly optimization x86-64 simd
Share
Improve this question
Follow
edited 46 mins ago
 
user avatar
Ian Kemp
26.5k1717 gold badges107107 silver badges129129 bronze badges
asked May 19 at 14:39
 
user avatar
ttsiodrasttsiodras
9,26666 gold badges4949 silver badges6767 bronze badges
9

  * 40
    The original code is easily vectorisable, the new version has a
    loop-carried dependency and is not. So on top of the lack of
    vectorisation, you also loose the ability of the OOO processor to
    execute multiple iterations at once in your second version.
    - fuz
    May 19 at 15:13
  * 6
    What CPU are those time numbers from? You mentioned two old Xeon
    CPUs, a W5580 (Nehalem-EP) and an E5-1620 (Sandybridge-EP). Both
    of those have 1/clock FP add and 1/clock FP mul throughput, on
    different ports so they can run in parallel. Only on Skylake and
    later is there 2/clock add throughput. But all of them have
    pipelined FPUs with significant latency vs. throughput, so yeah,
    the loop-carried dependency phuclv and fuz pointed out is a huge
    problem.
    - Peter Cordes
    May 19 at 15:25
  * 4
    To vectorize the 2-addition version, you'd need manual unrolling
    with increments of 4*A2 or something like that. Possibly clang
    could do that for you with -ffast-math (or possibly even GCC, but
    GCC tends to unroll without multiple accumulators.) With FMA
    available on Haswell or later, Horner's method would be great for
    such a short polynomial, easy for out-of-order exec to hide,
    although it would still need an FP version of i
    - Peter Cordes
    May 19 at 15:31
  * 8
    I want to mention that for integers multiplication is more
    expensive than addition; but for floating point it's often the
    opposite (addition is more expensive). The reason is that for
    floating point multiplication the significant and the exponent
    can be determined independently in parallel (like significand =
    sig1 * sig2; exponent = exp1+exp2), and for floating point
    addition it needs to be done in series (determine result
    exponent, then "shift" both values to match the result exponent,
    then determine result significand).
    - Brendan
    May 20 at 13:56
  * 4
    @Brendan: Despite that, modern x86 hardware FPUs have multiply
    latency always at least as high as addition. The significand
    multiply is still a 24 or 53-bit integer multiply. But yeah, if
    you take a microcode assist to deal with subnormal inputs or
    outputs, that can make the fast path pretty short. uops.info for
    mulpd vs. addpd (and vfma...) Alder Lake improved addpd latency
    to 3 cycles, down from 4 which was the latency for addpd/subpd/
    mulpd/vfma...pd since Skylake. AMD has had lower adds on some
    CPUs, but Zen2 has 3-cycle latency addpd and mulpd vs. 5c fma,
    like Broadwell
    - Peter Cordes
    May 21 at 13:40

  |  Show 4 more comments

7 Answers 7

Sorted by: Reset to default
[Highest score (default)     ]
198
 

The key to understanding the performance differences you're seeing is
in vectorization. Yes, the addition-based solution has a mere two
instructions in its inner loop, but the important difference is not
in how many instructions there are in the loop, but in how much work
each instruction is performing.

In the first version, the output is purely dependent on the input:
Each data[i] is a function just of i itself, which means that each
data[i] can be computed in any order: The compiler can do them
forwards, backwards, sideways, whatever, and you'll still get the
same result -- unless you're observing that memory from another
thread, you'll never notice which way the data is being crunched.

In the second version, the output isn't dependent on i -- it's
dependent on the A and Z from the last time around the loop.

If we were to represent the bodies of these loops as little
mathematical functions, they'd have very different overall forms:

  * f(i) -> di
  * f(Y, Z) -> (di, Y', Z')

In the latter form, there's no actual dependency on i -- the only way
you can compute the value of the function is by knowing the previous
Y and Z from the last invocation of the function, which means that
the functions form a chain -- you can't do the next one until you've
done the previous one.

Why does that matter? Because the CPU has vector parallel
instructions that each can perform two, four, or even eight
arithmetic operations at the same time! (AVX CPUs can do even more in
parallel.) That's four multiplies, four adds, four subtracts, four
comparisons -- four whatevers! So if the output you're trying to
compute is only dependent on the input, then you can safely do two,
four, or even eight at a time -- it doesn't matter if they're forward
or backward, since the result is the same. But if the output is
dependent on previous computation, then you're stuck doing it in
serial form -- one at a time.

That's why the "longer" code wins for performance. Even though it has
a lot more setup, and it's actually doing a lot more work, most of
that work is being done in parallel: It's not computing just data[i]
in each iteration of the loop -- it's computing data[i], data[i+1],
data[i+2], and data[i+3] at the same time, and then jumping to the
next set of four.

To expand out a little what I mean here, the compiler first turned
the original code into something like this:

int i;
for (i = 0; i < LEN; i += 4) {
    data[i+0] = A*(i+0)*(i+0) + B*(i+0) + C;
    data[i+1] = A*(i+1)*(i+1) + B*(i+1) + C;
    data[i+2] = A*(i+2)*(i+2) + B*(i+2) + C;
    data[i+3] = A*(i+3)*(i+3) + B*(i+3) + C;
}

You can convince yourself that'll do the same thing as the original,
if you squint at it. It did that because of all of those identical
vertical lines of operators: All of those * and + operations are the
same operation, just being performed on different data -- and the CPU
has special built-in instructions that can perform multiple * or
multiple + operations on different data at the same time, in a mere
single clock cycle each.

Notice the letter p in the instructions in the faster solution --
addpd and mulpd -- and the letter s in the instructions in the slower
solution -- addsd. That's "Add Packed Doubles" and "Multiply Packed
Doubles," versus "Add Single Double."

Not only that, it looks like the compiler partially unrolled the loop
too -- the loop doesn't just do two values each iteration, but
actually four, and interleaved the operations to avoid dependencies
and stalls, all of which cuts down on the number of times that the
assembly code has to test i < 1000 as well.

All of this only works, though, if there are no dependencies between
iterations of the loop: If the only thing that determines what
happens for each data[i] is i itself. If there are dependencies, if
data from the last iteration influences the next one, then the
compiler may be so constrained by them that it can't alter the code
at all -- instead of the compiler being able to use fancy parallel
instructions or clever optimizations (CSE, strength reduction, loop
unrolling, reordering, et al.), you get out code that's exactly what
you put in -- add Y, then add Z, then repeat.

But here, in the first version of the code, the compiler correctly
recognized that there were no dependencies in the data, and figured
out that it could do the work in parallel, and so it did, and that's
what makes all the difference.

Share
Improve this answer
Follow
edited 13 mins ago
answered May 19 at 16:29
 
user avatar
Sean WerkemaSean Werkema
4,79922 gold badges3333 silver badges4242 bronze badges
11

  * 25
    It's not just vectorization but data dependencies. The scalar
    code from the 'optimized" version can't run at full speed because
    of latency bottlenecks across iterations. That's the same thing
    that prevents it from vectorizing, but I would have started an
    answer by saying that the key is loop-carried dependencies. Lack
    of such allows both vectorization and instruction-level
    parallelism across iterations. (Integer i++ is a loop-carried
    dep, but the compiler can play with it since integer math is
    associative, unlike FP without -ffast-math)
    - Peter Cordes
    May 19 at 16:42
  * 15
    @PeterCordes I really wanted to focus on the high-level concept
    of "parallel versus serial computation" in this answer, since
    that seemed to be at the root of the question -- if you're unaware
    that parallel instructions even exist, you'd be just as puzzled
    as the asker was at how "more" can somehow magically be "less."
    Dependencies and bottlenecks -- how the compiler determined which
    of the optimization options were available to it -- would be great
    followup questions, though.
    - Sean Werkema
    May 19 at 17:10
  * 9
    But instruction-level parallelism is equally important to SIMD
    parallelism. Maybe moreso, with only 2 doubles per vector vs.
    SIMD FP addsd/addpd having 3-cycle latency, 1-cycle throughput on
    Nehalem and Sandy Bridge. (Although with two separate add chains
    in the loop, that maybe works out to one scalar FP add per 1.5
    clock cycles, so yeah maybe SIMD was more important.)
    - Peter Cordes
    May 19 at 17:21
  * 7
    Anyway, having a serial dependency across loop iterations
    actually is the ultimate key to parallel vs. serial code (and
    execution of that code), and IMO would be a good opening
    paragraph. Compilers and CPUs can take advantage of it in
    multiple ways, with the compiler auto-vectorizing and the CPU
    exploiting the ILP of independent loop iterations. Even if you
    only want to talk about SIMD vectorization, spotting the data
    parallelism available in a loop is the key first observation. (I
    did already upvote this answer; overall good but I'd like it more
    if it started from parallelism vs. deps)
    - Peter Cordes
    May 19 at 17:23
  * 6
    In your update, you mentioned strength-reduction optimization.
    The optimization proposed in the question is a fancy case of
    strength-reduction, replacing independent multiplies with
    loop-carried add chains. So if the compiler does that (with
    -ffast-math) you hope it does it in an unrolling-friendly way to
    allow vectorization.
    - Peter Cordes
    May 19 at 20:18

  |  Show 6 more comments
43
 

The main difference here is loop dependencies. The loop in the second
case is dependent -- operations in the loop depend on the previous
iteration. This means that each iteration can't even start until the
previous iteration finishes. In the first case, the loop body is
fully independent -- everything in the loop body is self contained,
depending solely on the iteration counter and constant values. This
means that the loop can be computed in parallel -- multiple
iterations can all work at the same time. This then allows the loop
to be trivially unrolled and vectorized, overlapping many
instructions.

If you were to look at the performance counters (e.g. with perf stat
./1), you would see that the first loop, besides running faster is
also running many more instructions per cycle (IPC). The second loop,
in contrast, has more dependency cycles -- time when the CPU is
sitting around doing nothing, waiting for instructions to complete,
before it can issue more instructions.

The first one may bottleneck on memory bandwidth, especially if you
let the compiler auto-vectorize with AVX on your Sandybridge (gcc -O3
-march=native), if it manages to use 256-bit vectors. At that point
IPC will drop, especially for an output array too large for L3 cache.

---------------------------------------------------------------------

One note, unrolling and vectorizing do not require independent loops
-- you can do them when (some) loop dependencies are present.
However, it is harder and the payoff is less. So if you want to see
the maximum speedup from vectorization, it helps to remove loop
dependencies where possible.

Share
Improve this answer
Follow
edited May 19 at 16:47
 
user avatar
Peter Cordes 
286k4141 gold badges518518 silver badges731731 bronze badges
Recognized by Intel
answered May 19 at 16:42
 
user avatar
Chris DoddChris Dodd
111k1212 gold badges123123 silver badges212212 bronze badges
5

  * 1
    Thanks - this makes sense. And by running 4 at a time, the branch
    comparison also runs 4 times less, I guess. Any suggestions on
    how to read the performance counters you are talking about (under
    Linux) would be most welcome.
    - ttsiodras
    May 19 at 16:44
  * 1
    oprofile is the usual way to do this on Linux
    - Chris Dodd
    May 19 at 16:46
  * 1
    @ttsiodras: These days most people use something like perf stat
    --all-user ./1 to accumulate counts across the whole program.
    Which is fine since it spends most of its time inside the loop.
    You might want to move the timing outside of the loop or remove
    it for this kind of profiling, perhaps hiding the repeat loop
    from the optimizer by putting the actual work in a __attribute__
    ((noinline,noipa)) function, to stop inter-procedural analysis
    and inlining.
    - Peter Cordes
    May 19 at 16:48
  * 7
    To get the maximum payoff with manual vectorization, I think you
    might actually use version 2, but with multiple vectors that
    advance in lock-step, four each different Z and Y vectors, like
    Z0 += 8*A2 (or 16*A2 if each Z vector holds 4 doubles instead of
    2). You'd need some math to account for striding an element by 8
    or 16 i values instead of 1, maybe a multiply in there somewhere.
    You can almost certainly do better than redoing int->FP
    conversion each iteration; that's an expensive way to get
    independent iterations.
    - Peter Cordes
    May 19 at 17:00
  * 
    (Yup, the idea in my last comment works, and doesn't need any
    extra multiplies. I wrote up an answer with something that
    compiles to 4-way unrolled code that just does two vaddpd ymm
    instructions and a store per 4 doubles of results.)
    - Peter Cordes
    May 22 at 18:59

Add a comment  |   
20
 

This optimization can give a speedup over the best you can do
re-evaluating the polynomial separately for each i. But only if you
generalize it to a larger stride, to still have enough parallelism in
the loop. My version can store 1 vector (4 doubles) per clock cycle
on my Skylake. Or in theory 1 vector per 2 clocks on earlier Intel,
including Sandybridge where a 256-bit store takes 2 clock cycles in
the store port.
It obviously needs to fit in cache to run this fast, so repeat a
small LEN multiple times.

---------------------------------------------------------------------

This strength-reduction optimization (just adding instead of starting
with a fresh i and multiplying) introduces a serial dependency across
loop iterations, involving FP math rather than integer increment.

The original has data parallelism across every output element: each
one only depends on constants and its own i value. Compilers can
auto-vectorize with SIMD (SSE2, or AVX if you use -O3 -march=native),
and CPUs can overlap the work across loop iterations with
out-of-order execution. Despite the amount of extra work, the CPU is
able to apply sufficient brute force, with the compiler's help.

But the version that calculates poly(i+1) in terms of poly(i) has
very limited parallelism; no SIMD vectorization, and your CPU can
only run two scalar adds per 4 cycles, for example, where 4 cycles is
the latency of FP addition on Intel Skylake through Tiger Lake. (
https://uops.info/).

---------------------------------------------------------------------

@huseyin tugrul buyukisik's answer shows how you can get close to
maxing out the throughput of the original version on a more modern
CPU, with two FMA operations to evaluate the polynomial (Horner's
scheme), plus an int->FP conversion or an FP increment. (The latter
creates an FP dep chain which you need to unroll to hide.)

So best case you have 3 FP math operations per SIMD vector of output.
(Plus a store). Current Intel CPUs only have two FP execution units
that can run FP math operations. (With 512-bit vectors with AVX-512,
current CPUs shut down the vector ALU on port 1 so there are only 2
SIMD ALU ports at all, so non-FP-math ops like integer increment will
also compete for throughput).

AMD since Zen2 has two FMA/mul units on two ports, and two FP add/sub
units on two different ports, so if you use FMA to do addition, you
have a theoretical max of four SIMD additions per clock cycle.

Haswell/Broadwell have 2/clock FMA, but only 1/clock FP add/sub (with
lower latency). This is good for naive code, not great for code that
has been optimized to have lots of parallelism. That's probably why
Intel changed it in Skylake.

Your Sandybridge (E5-1620) and Nehalem (W5580) CPUs have 1/clock FP
add/sub, 1/clock FP mul, on separate ports. This is what Haswell was
building on. And why adding extra multiplies isn't a big problem:
they can run in parallel with the existing adds.

Finding parallelism: strength-reduction with an arbitrary stride

Your compute2 calculates the next Y and next Z in terms of the
immediately previous value. i.e. with a stride of 1, the values you
need for data[i+1]. So each iteration is dependent on the immediately
previous one.

If you generalize that to other strides, you can advance 4, 8, 16, or
more separate Y and Z values so they all leapfrog in lockstep with
each other, all independently of each other. This regains some
parallelism which the compiler and/or CPU can take advantage of.

poly(i)   = A i^2             + B i          + C

poly(i+4) = A (i+4)^2         + B (i+4)      + C

          = A*i^2 + A*8i + A*16  +  B i + 4B   + C
          = poly(i) + 4*(2A)*i + 16*A  + 4B

(I picked a fixed 4 partly because writing exponents is messy in
ASCII text. On paper I might have just done (i+s)^2 = i^2 + 2is + s^2
and so on. Another approach would be to taking differences between
successive points, and then differences of those differences. At
least that's where we want to end up, with values we can add to
generate the sequence (of FP values). This is basically like taking
the 1st and 2nd derivative, and then the optimized loop is
effectively integrating to recover the original function.)

  * The 2nd-derivative part (formerly Z += A*2)
    becomes Z[j] += A * (stride * 2); from the A * 8i term.
  * The linear growth (formerly the initial Z = A+B;)
    becomes (A*stride + B) * stride from the 16*A + 4*B on advancing
    by 4.

As you can see, for stride=1 (no unroll) these simplify to the
original values. But with larger stride, they still let this method
work, with no extra multiplies or anything in the loop.

// UNTESTED, but compiles to asm that looks right.
// Hopefully I got the math right.

#include <stdalign.h>
#include <stdlib.h>
#define LEN 1024
alignas(64) double data[LEN];

void compute2_unrolled(void)
{
    const double A = 1.1, B = 2.2, C = 3.3;
    const int stride = 16;   // unroll factor.  1 reduces to the original
    const double deriv2 = A * (stride * 2);
    double Z[stride], Y[stride];
    for (int j = 0 ; j<stride ; j++){  // this loop will fully unroll
            Y[j] = j*j*A + j*B + C;         // poly(j) starting values to increment
            Z[j] = (A*stride + B) * stride;  // 1st derivative = same initial Z for all elements
    }

    for(size_t i=0; i < LEN - (stride-1); i+=stride) {
#if 0
        for (int j = 0 ; j<stride ; j++){
            data[i+j] = Y[j];
            Y[j] += Z[j];
            Z[j] += deriv2;
        }
#else
        // loops that are easy(?) for a compiler to roll up into some SIMD vectors
        for (int j=0 ; j<stride ; j++)  data[i+j] = Y[j];  // store
        for (int j=0 ; j<stride ; j++)  Y[j] += Z[j];      // add
        for (int j=0 ; j<stride ; j++)  Z[j] += deriv2;    // add
#endif
    }


    // cleanup for the last few i values
    for (int j = 0 ; j < LEN % stride ; j++) {
        // letting the compiler see the %stride can help it decide *not* to auto-vectorize this part for non-huge strides.
        size_t i = LEN - (stride-1) + j;
        //data[i] = poly(i);
    }
}

(Godbolt compiler explorer) This does actually auto-vectorize nicely
with stride=16 (4x YMM vectors of 4 doubles each) with clang14 -O3
-march=skylake -ffast-math.

It looks like clang has further unrolled by 2 (but without inventing
new accumulators, of course). So each asm loop iteration 2x 8 vaddpd
instructions and 2x 4 stores. And one add/jne macro-fused uop of loop
overhead, since clang is counting up towards zero, indexing from the
end of the array. So with L1d cache hits for the stores, this should
run (on Skylake) at 1 store per clock cycle, also doing 2x vaddpd
every cycle. This is max throughput for both of those things. The
front-end only needs to keep up with a bit over 3 uops / clock cycle,
but it's been 4-wide since Core2, and the uop cache in
Sandybridge-family makes that no problem. (Unless you run into the
JCC erratum on Skylake, so I used -mbranches-within-32B-boundaries to
have clang pad instructions to avoid that.)

I didn't actually benchmark it on my desktop; the vaddpd latency is 4
cycles, so 4 dep chains is just barely enough to keep 4 independent
operations in flight. And with two interlocking dep chains, it's
probably easy for imperfect scheduling to result in lost cycles. So a
bit more unrolling would be appropriate if we could get the compiler
to still make nice asm for stride=32.

(I'm not sure why -ffast-math is necessary for clang to vectorize
decently with this version; I don't think clang's re-associating
anything, just doing all the math operations in source order unless
there's something I'm missing; I didn't stare closely at the register
numbers. The source was carefully written in a SIMD-friendly way to
make it possible to compile to nice asm. Update: another version that
takes a pointer as an arg does vectorize with or without
-ffast-math.)

The other way to write the loops is with one inner loop instead of
all the Y ops, then all the Z ops. clang doesn't auto-vectorize this,
though, even with -ffast-math, only the SIMD-style above. I included
it inside a #if 0 / #else / #endif block on Godbolt. Update: in the
version taking a pointer arg, this actually vectorizes better in some
cases.

    for(int i=0; i < LEN - (stride-1); i+=stride) {
        for (int j = 0 ; j<stride ; j++){  // doesn't auto-vectorize with clang or GCC
            data[i+j] = Y[j];
            Y[j] += Z[j];
            Z[j] += deriv2;
        }
    }

We have to manually choose an appropriate unroll amount. Too large an
unroll factor can even stop the compiler from seeing what's going on
and keeping the temp arrays in registers. e.g. 32 is a problem for
clang, but not 24. There may be some tuning options to force the
compiler to unroll loops up to a certain count; there are for GCC
which can sometimes be used to let it see through something at
compile time.

(Another approach would be manual vectorization with #include
<immintrin.h> and __m256d Z[4] instead of double Z[16]. But this
version can vectorize for other ISAs like AArch64.)

Other downsides to a large unroll factor are leaving more cleanup
work when the problem-size isn't a multiple of the unroll. (You might
use the compute1 strategy for cleanup, letting the compiler vectorize
that for an iteration or two before doing scalar.)

---------------------------------------------------------------------

In theory a compiler would be allowed to do this for you with
-ffast-math, either from compute1 doing the strength-reduction on the
original polynomial, or from compute2 seeing how the stride
accumulates. But in practice that's really complicated and something
humans have to do themselves. Unless / until someone gets around to
teaching compilers how to look for patterns like this and apply
differential calculus (or successive differences with a choice of
stride).

---------------------------------------------------------------------

Experimental results:

On my desktop (i7-6700k) with clang13.0.0, I tested and was
pleasantly surprised that this does in fact run at 1 SIMD store per
clock cycle with the combination of compiler options (fast-math or
not) and #if 0 vs. #if 1 on the inner loop strategy, with a version
of @huseyin tugrul buyukisik's benchmark framework improved to repeat
a more measurable amount between rdtsc instructions. And to
compensate for the difference between core clock frequency and the
"reference" frequency of the clock rdtsc reads, in my case 3.9GHz vs.
4008 MHz. (Rated max turbo is 4.2GHz, but with EPP =
balance_performance on Linux, it only wants to clock up to 3.9 GHz.)

Source code I tested on Godbolt: using one inner loop, rather than 3
separate j<16 loops, and not using -ffast-math. And using
__attribute__((noinline)) to keep this from inlining into the repeat
loop. Some other variations of options and source led to some vpermpd
shuffles inside the loop.

$ clang++ -std=gnu++17 -O3 -march=native -mbranches-within-32B-boundaries poly-eval.cpp -Wall
# warning about noipa, only GCC knows that attribute
$ perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,fp_arith_inst_retired.256b_packed_double -r10 ./a.out
... (10 runs of the whole program, ending with)
0.252072 cycles per data element (corrected from ref cycles to core clocks)
0.252236 cycles per data element (corrected from ref cycles to core clocks)
0.252408 cycles per data element (corrected from ref cycles to core clocks)
0.251945 cycles per data element (corrected from ref cycles to core clocks)
0.252442 cycles per data element (corrected from ref cycles to core clocks)
0.252295 cycles per data element (corrected from ref cycles to core clocks)
0.252109 cycles per data element (corrected from ref cycles to core clocks)
xor=4303
min cycles per data = 0.251868

 Performance counter stats for './a.out' (10 runs):

            298.92 msec task-clock                #    0.989 CPUs utilized            ( +-  0.49% )
                 0      context-switches          #    0.000 /sec
                 0      cpu-migrations            #    0.000 /sec
               129      page-faults               #  427.583 /sec                     ( +-  0.56% )
     1,162,430,637      cycles                    #    3.853 GHz                      ( +-  0.49% )  # time spent in the kernel for system calls and interrupts isn't counted, that's why it's not 3.90 GHz
     3,772,516,605      instructions              #    3.22  insn per cycle           ( +-  0.00% )
     3,683,072,459      uops_issued.any           #   12.208 G/sec                    ( +-  0.00% )
     4,824,064,881      uops_executed.thread      #   15.990 G/sec                    ( +-  0.00% )
     2,304,000,000      fp_arith_inst_retired.256b_packed_double #    7.637 G/sec

           0.30210 +- 0.00152 seconds time elapsed  ( +-  0.50% )

fp_arith_inst_retired.256b_packed_double counts 1 for each FP add or
mul instruction (2 for FMA), so we're getting 1.98 vaddpd
instructions per clock cycle for the whole program, including
printing and so on. That's very close to the theoretical max 2/clock,
apparently not suffering from sub-optimal scheduling of the add
instructions. (I bumped up the repeat loop so it spends more of its
time there, so it's more useful to use perf stat on the whole
program.)

The goal of this optimization was to get the same work done with
fewer FLOPS, but that also means we're essentially maxing out the 8
FLOP/clock limit for Skylake without using FMA. (And getting 30.58
GFLOP/s at 3.9GHz on a single core).

Asm of the non-inline function (objdump -drwC -Mintel); clang used 4
sets of 2 pairs of YMM vectors, and unrolled the loop a further 3x.
Note the add rax,0x30 doing 3 * 0x10 doubles per iteration. So it
works out to an exact multiple of 24KiB with no cleanup.

0000000000001440 <void compute2<3072>(double*)>:
# just loading constants; the setup loop did flatten out and optimize into vector constants
    1440:       c5 fd 28 0d 18 0c 00 00         vmovapd ymm1,YMMWORD PTR [rip+0xc18]        # 2060 <_IO_stdin_used+0x60>
    1448:       c5 fd 28 15 30 0c 00 00         vmovapd ymm2,YMMWORD PTR [rip+0xc30]        # 2080 <_IO_stdin_used+0x80>
    1450:       c5 fd 28 1d 48 0c 00 00         vmovapd ymm3,YMMWORD PTR [rip+0xc48]        # 20a0 <_IO_stdin_used+0xa0>
    1458:       c4 e2 7d 19 25 bf 0b 00 00      vbroadcastsd ymm4,QWORD PTR [rip+0xbbf]        # 2020 <_IO_stdin_used+0x20>
    1461:       c5 fd 28 2d 57 0c 00 00         vmovapd ymm5,YMMWORD PTR [rip+0xc57]        # 20c0 <_IO_stdin_used+0xc0>
    1469:       48 c7 c0 d0 ff ff ff    mov    rax,0xffffffffffffffd0
    1470:       c4 e2 7d 19 05 af 0b 00 00      vbroadcastsd ymm0,QWORD PTR [rip+0xbaf]        # 2028 <_IO_stdin_used+0x28>
    1479:       c5 fd 28 f4             vmovapd ymm6,ymm4
    147d:       c5 fd 28 fc             vmovapd ymm7,ymm4
    1481:       c5 7d 28 c4             vmovapd ymm8,ymm4
    1485:       66 66 2e 0f 1f 84 00 00 00 00 00        data16 cs nop WORD PTR [rax+rax*1+0x0]

# top of outer loop.  The NOP before this is to align it.
    1490:       c5 fd 11 ac c7 80 01 00 00      vmovupd YMMWORD PTR [rdi+rax*8+0x180],ymm5
    1499:       c5 d5 58 ec             vaddpd ymm5,ymm5,ymm4
    149d:       c5 dd 58 e0             vaddpd ymm4,ymm4,ymm0
    14a1:       c5 fd 11 9c c7 a0 01 00 00      vmovupd YMMWORD PTR [rdi+rax*8+0x1a0],ymm3
    14aa:       c5 e5 58 de             vaddpd ymm3,ymm3,ymm6
    14ae:       c5 cd 58 f0             vaddpd ymm6,ymm6,ymm0
    14b2:       c5 fd 11 94 c7 c0 01 00 00      vmovupd YMMWORD PTR [rdi+rax*8+0x1c0],ymm2
    14bb:       c5 ed 58 d7             vaddpd ymm2,ymm2,ymm7
    14bf:       c5 c5 58 f8             vaddpd ymm7,ymm7,ymm0
    14c3:       c5 fd 11 8c c7 e0 01 00 00      vmovupd YMMWORD PTR [rdi+rax*8+0x1e0],ymm1
    14cc:       c5 bd 58 c9             vaddpd ymm1,ymm8,ymm1
    14d0:       c5 3d 58 c0             vaddpd ymm8,ymm8,ymm0
    14d4:       c5 fd 11 ac c7 00 02 00 00      vmovupd YMMWORD PTR [rdi+rax*8+0x200],ymm5
    14dd:       c5 d5 58 ec             vaddpd ymm5,ymm5,ymm4
    14e1:       c5 dd 58 e0             vaddpd ymm4,ymm4,ymm0
    14e5:       c5 fd 11 9c c7 20 02 00 00      vmovupd YMMWORD PTR [rdi+rax*8+0x220],ymm3
    14ee:       c5 e5 58 de             vaddpd ymm3,ymm3,ymm6
    14f2:       c5 cd 58 f0             vaddpd ymm6,ymm6,ymm0
    14f6:       c5 fd 11 94 c7 40 02 00 00      vmovupd YMMWORD PTR [rdi+rax*8+0x240],ymm2
    14ff:       c5 ed 58 d7             vaddpd ymm2,ymm2,ymm7
    1503:       c5 c5 58 f8             vaddpd ymm7,ymm7,ymm0
    1507:       c5 fd 11 8c c7 60 02 00 00      vmovupd YMMWORD PTR [rdi+rax*8+0x260],ymm1
    1510:       c5 bd 58 c9             vaddpd ymm1,ymm8,ymm1
    1514:       c5 3d 58 c0             vaddpd ymm8,ymm8,ymm0
    1518:       c5 fd 11 ac c7 80 02 00 00      vmovupd YMMWORD PTR [rdi+rax*8+0x280],ymm5
    1521:       c5 d5 58 ec             vaddpd ymm5,ymm5,ymm4
    1525:       c5 dd 58 e0             vaddpd ymm4,ymm4,ymm0
    1529:       c5 fd 11 9c c7 a0 02 00 00      vmovupd YMMWORD PTR [rdi+rax*8+0x2a0],ymm3
    1532:       c5 e5 58 de             vaddpd ymm3,ymm3,ymm6
    1536:       c5 cd 58 f0             vaddpd ymm6,ymm6,ymm0
    153a:       c5 fd 11 94 c7 c0 02 00 00      vmovupd YMMWORD PTR [rdi+rax*8+0x2c0],ymm2
    1543:       c5 ed 58 d7             vaddpd ymm2,ymm2,ymm7
    1547:       c5 c5 58 f8             vaddpd ymm7,ymm7,ymm0
    154b:       c5 fd 11 8c c7 e0 02 00 00      vmovupd YMMWORD PTR [rdi+rax*8+0x2e0],ymm1
    1554:       c5 bd 58 c9             vaddpd ymm1,ymm8,ymm1
    1558:       c5 3d 58 c0             vaddpd ymm8,ymm8,ymm0
    155c:       48 83 c0 30             add    rax,0x30
    1560:       48 3d c1 0b 00 00       cmp    rax,0xbc1
    1566:       0f 82 24 ff ff ff       jb     1490 <void compute2<3072>(double*)+0x50>
    156c:       c5 f8 77                vzeroupper
    156f:       c3                      ret

---------------------------------------------------------------------

Related:

  * Latency bounds and throughput bounds for processors for
    operations that must occur in sequence - analysis of code with
    two dep chains, one reading from the other and earlier in itself.
    Same dependency pattern as the strength-reduced loop, except one
    of its chains is an FP multiply. (It's also a polynomial
    evaluation scheme, but for one large polynomial.)
  * SIMD optimization of a curve computed from the second derivative
    another case of being able to stride along the serial dependency.
  * Is it possible to use SIMD on a serial dependency in a
    calculation, like an exponential moving average filter? - If
    there's a closed-form formula for n steps ahead, you can use that
    to sidestep serial dependencies.
  * Out of Order Execution, How to Solve True Dependency? - CPUs have
    to wait when an instruction depends on one that hasn't executed
    yet.
  * Dependency chain analysis a non-loop-carried dependency chain
    analysis, from one of Agner Fog's examples.
  * Modern Microprocessors A 90-Minute Guide! - general background on
    out-of-order exec and pipelines. Modern CPU-style short-vector
    SIMD exists in this form to get more work through the pipeline of
    a single CPU without widening the pipeline. By contrast, GPUs
    have many simple pipelines.
  * Why does mulss take only 3 cycles on Haswell, different from
    Agner's instruction tables? (Unrolling FP loops with multiple
    accumulators) - Some experimental numbers with unrolling to hide
    the latency of FP dependency chains, and some CPU-architecture
    background on register renaming.

Share
Improve this answer
Follow
edited 4 hours ago
answered May 21 at 20:54
 
user avatar
Peter CordesPeter Cordes 
286k4141 gold badges518518 silver badges731731 bronze badges
Recognized by Intel
8

  * 
    Maybe ffast-math helping only with denormal values flushed to
    zero for better performance. It gave gcc 10% performance in
    simple version. I also tried hybrid approach, performance was
    same. About 0.3 clock per element. Maybe 4 unrolled with 1
    casting & 3 adding iterations could work, instead of 50% 50%. Or
    1 cast, 2 add, 1 lookup for 4 unroll.
    - huseyin tugrul buyukisik
    May 21 at 21:22
  * 1
    @huseyintugrulbuyukisik: Updated with test results from my
    Skylake desktop: it does run 1 store per clock (and two vaddpd),
    so I'm getting 0.251 cycles per element without AVX-512 (which my
    desktop doesn't have). While testing, I noticed you were using
    rdtsc numbers instead of core clock cycles, which is a big
    assumption. It might hold true for some Xeons that the actual
    core clock is close to the TSC frequency when running "heavy"
    512-bit instructions, but that's a risky assumption.
    - Peter Cordes
    May 22 at 21:20
  * 1
    But anyway, presumably the same asm as mine but using ZMM vectors
    could also run a 1 store per clock on Skylake-avx512 CPUs, thus
    about 0.125 cycles per element. Getting a compiler to generate
    asm like that might be problematic without options to override
    tuning heuristics, so there are potential practical problems if
    you're not using intrinsics.
    - Peter Cordes
    May 22 at 21:22
  * 1
    @huseyintugrulbuyukisik: It's not like we know the CPU frequency
    of the server instance your code ends up running on, although we
    could use CPUID to get the brand string and print it, which may
    include the stock "rated" frequency. Having that would allow
    manual calculation (or correction to the RDTSC guesswork
    numbers). Perhaps employ Quick-bench's strategy of timing a NOP
    loop to estimate current CPU frequency, although turbo reductions
    from running AVX-512 "heavy" instructions make that harder.
    - Peter Cordes
    May 22 at 22:54
  * 1
    It's just a theoretical question anyway; no sense getting too
    crazy about actually optimizing this for production use, just
    proof of concept is fine. So getting it to auto-vectorize from
    plain C++ source is not something I'm going to spend more of my
    time on, until/unless a real-world use-case comes up in a
    specific project that would govern what compiler / options we can
    use, and what problem sizes to tune for, and how it needs to be
    called, etc.
    - Peter Cordes
    May 22 at 22:58

  |  Show 3 more comments
7
 

If you either need this code to run fast, or if you are curious, you
can try the following:

You changed the calculation of a[i] = f(i) to two additions. Modify
this to calculate a[4i] = f(4i) using two additions, a[4i+1] = f
(4i+1) using two additions, and so on. Now you have four calculations
that can be done in parallel.

There is a good chance that the compiler will do the same loop
unrolling and vectorisation, and you have the same latency, but for
four operations, not one.

Share
Improve this answer
Follow
answered May 20 at 11:48
 
user avatar
gnasher729gnasher729
49.7k55 gold badges7272 silver badges9494 bronze badges

Add a comment  |   
6
 


    Multiplications have a reputation for being significantly slower
    in our CPUs, compared to additions.

That may have been true historically and may still be true for
simpler low power CPUs but if the CPU designer is prepared to "throw
gates at the problem", multiplication can be made almost as fast as
addition.

Modern CPUs are designed to process multiple instructions at the same
time, through a combination of pipelining and having multiple
execution units.

The problem with this though is data dependencies. If an instruction
depends on the results of another instruction then its execution
cannot be started until the instruction it depends on has completed.

Modern CPUs try to work around this with "out of order execution".
Instructions that are waiting for data can be kept queued while other
instructions are allowed to execute.

But even with these measures, sometimes the CPU can simply run out of
new work to schedule.

Share
Improve this answer
Follow
edited May 21 at 7:58
 
user avatar
Toby Speight
25.1k4747 gold badges6161 silver badges9393 bronze badges
answered May 20 at 21:18
 
user avatar
plugwashplugwash
8,70822 gold badges2929 silver badges4646 bronze badges
2

  * 4
    True for FP on Intel CPUs from Skylake onward, before Alder Lake.
    FP add/sub/mul/fma all have literally identical performance,
    running on the same 2 (fully pipelined) execution ports with the
    same 4-cycle latency. Alder Lake sped up FP add/sub to 3 cycles,
    like it was in Haswell (but still with 2/clock throughput like
    mul/fma, unlike Haswell).
    - Peter Cordes
    May 20 at 21:34
  * 1
    But not true for integer math; 1/clock with 3 cycle latency vs. 4
    /clock with 1c for scalar integer, and also a factor of 4
    throughput for SIMD integer on modern Intel. Integer multiply is
    still fairly high throughput vs. old CPUs.
    - Peter Cordes
    May 20 at 21:35

Add a comment  |   
4
 

By using only additions as an optimization, you are missing all the
gflops of (newer CPUs') multiplication pipelines and the loop carried
dependency makes it worse by stopping the auto-vectorization. If it
was autovectorized, it would be much faster than
multiplication+addition. And much more energy efficient per data
(only add better than mul+add).

Another issue is that the end of array receives more rounding error
because of number of additions accumulated. But it shouldn't be
visible until very large arrays (unless data type becomes float).

When you apply Horner Scheme with GCC build options (on newer CPUs)
-std=c++20 -O3 -march=native -mavx2 -mprefer-vector-width=256
-ftree-vectorize -fno-math-errno,

void f(double * const __restrict__ data){
    double A=1.1,B=2.2,C=3.3;
    for(int i=0; i<1024; i++) {
        double id = double(i);
        double result =  A;
        result *=id;
        result +=B;
        result *=id;
        result += C;
        data[i] = result;
    }
}

the compiler produces this:

.L2:
    vmovdqa ymm0, ymm2
    vcvtdq2pd       ymm1, xmm0
    vextracti128    xmm0, ymm0, 0x1
    vmovapd ymm7, ymm1
    vcvtdq2pd       ymm0, xmm0
    vmovapd ymm6, ymm0
    vfmadd132pd     ymm7, ymm4, ymm5
    vfmadd132pd     ymm6, ymm4, ymm5
    add     rdi, 64
    vpaddd  ymm2, ymm2, ymm8
    vfmadd132pd     ymm1, ymm3, ymm7
    vfmadd132pd     ymm0, ymm3, ymm6
    vmovupd YMMWORD PTR [rdi-64], ymm1
    vmovupd YMMWORD PTR [rdi-32], ymm0
    cmp     rax, rdi
    jne     .L2
    vzeroupper
    ret

and with -mavx512f -mprefer-vector-width=512:

.L2:
    vmovdqa32       zmm0, zmm3
    vcvtdq2pd       zmm4, ymm0
    vextracti32x8   ymm0, zmm0, 0x1
    vcvtdq2pd       zmm0, ymm0
    vmovapd zmm2, zmm4
    vmovapd zmm1, zmm0
    vfmadd132pd     zmm2, zmm6, zmm7
    vfmadd132pd     zmm1, zmm6, zmm7
    sub     rdi, -128
    vpaddd  zmm3, zmm3, zmm8
    vfmadd132pd     zmm2, zmm5, zmm4
    vfmadd132pd     zmm0, zmm5, zmm1
    vmovupd ZMMWORD PTR [rdi-128], zmm2
    vmovupd ZMMWORD PTR [rdi-64], zmm0
    cmp     rax, rdi
    jne     .L2
    vzeroupper
    ret

all FP operations are in "packed" vector form and less instructions
(it is twice-unrolled version) because of mul+add joining into single
FMA. 16 instructions per 64 bytes of data (128 bytes if AVX512).

Another good thing about Horner Scheme is that it computes with a bit
better accuracy within the FMA instruction and it is only O(1)
operations per loop iteration so it doesn't accumulate that much
error with longer arrays.

I think the optimization from Agner Fog's optimization manuals must
be coming from times of Quake-3 fast inverse square root
approximation. In those times SIMD was not wide enough to make too
much difference as well as lacking support for the sqrt function. The
manual says copyright 2004 so there were Celerons with SSE and not
FMA. The first AVX desktop cpu was launched much later and FMA was
even later than that.

---------------------------------------------------------------------

Here is another version with strength-reduction (for id value):

void f(double * const __restrict__ data){

    double B[]={2.2,2.2,2.2,2.2,2.2,2.2,2.2,2.2,
    2.2,2.2,2.2,2.2,2.2,2.2,2.2,2.2};
    double C[]={3.3,3.3,3.3,3.3,3.3,3.3,3.3,3.3,
    3.3,3.3,3.3,3.3,3.3,3.3,3.3,3.3};
    double id[] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
    for(long long i=0; i<1024; i+=16) {
        double result[]={1.1,1.1,1.1,1.1,1.1,1.1,1.1,1.1,
                        1.1,1.1,1.1,1.1,1.1,1.1,1.1,1.1};
        // same thing, just with explicit auto-vectorization help
        for(int j=0;j<16;j++)
        {
            result[j] *=id[j];
            result[j] +=B[j];
            result[j] *=id[j];
            result[j] += C[j];
            data[i+j] = result[j];
        }

        // strength reduction
        for(int j=0;j<16;j++)
        {
            id[j] += 16.0;
        }
    }
}

assembly:

.L2:
    vmovapd zmm3, zmm0
    vmovapd zmm2, zmm1
    sub     rax, -128
    vfmadd132pd     zmm3, zmm6, zmm7
    vfmadd132pd     zmm2, zmm6, zmm7
    vfmadd132pd     zmm3, zmm5, zmm0
    vfmadd132pd     zmm2, zmm5, zmm1
    vaddpd  zmm0, zmm0, zmm4
    vaddpd  zmm1, zmm1, zmm4
    vmovupd ZMMWORD PTR [rax-128], zmm3
    vmovupd ZMMWORD PTR [rax-64], zmm2
    cmp     rdx, rax
    jne     .L2
    vzeroupper
    ret

When data, A, B and C arrays are aligned by alignas(64) and small
enough data array size, it runs at 0.26 cycles per element speed.

Share
Improve this answer
Follow
edited May 21 at 20:00
answered May 20 at 13:06
 
user avatar
huseyin tugrul buyukisikhuseyin tugrul buyukisik
10.6k33 gold badges4242 silver badges8585 bronze badges
24

  * 3
    The querent was only testing on Nehalem and Sandybridge Xeon
    CPUs, which don't support FMA. You forgot to mention the build
    options you used to let it auto-vectorize with AVX2+FMA. But yes,
    this is a good strategy if you have FMA. Possibly even if you
    don't, on CPUs where mulpd runs on a separate port than addpd so
    they only compete for front-end throughput. If you just care
    about speed, not accuracy, the strategy suggested in gnasher's
    answer (which I'd suggested earlier in comments) with multiple
    accumulators to hide FP latency, is probably still even better,
    avoiding int->FP cost.
    - Peter Cordes
    May 20 at 13:54
  * 1
    Right, there is int->FP cost and is not hidable by aggressively
    unrolling. Maybe better represent some magic with std::memcpy
    instead of casting. I will test it when I have some more time.
    (if loop count is less than 53 bits, it should work)
    - huseyin tugrul buyukisik
    May 20 at 13:56
  * 
    Exactly, this algorithm can't hide it. (You either have to
    convert, or do an FP increment with set1(1.0)). It is hideable
    with strength-reduction as in compute2 in the question, which can
    be done with sufficient unrolling with multiple SIMD accumulators
    (to hide FP latency), I think. Maybe needing one multiply per 2
    adds, so perhaps an add + FMA.
    - Peter Cordes
    May 20 at 14:02
  * 1
    3 FP math ops per vector of results means theoretical best-case
    with 2/clock FP math throughput is 3 ops * 0.5c/op / 8 elements
    per ZMM vector = 0.1875 cycles per element. But there's also two
    (eliminated) vmovapd instructions and two stores, so that fills
    the whole 4-wide pipeline on Skylake-X; only Ice Lake's wider
    pipeline can also run the loop overhead. But Ice Lake disabled
    mov elimination (at least for integer, I forget for SIMD) so
    those vmovapd instructions would compete with FMAs.
    - Peter Cordes
    May 20 at 15:20
  * 1
    Of course the current bottleneck in your code is the vaddpd
    latency of 4 cycles (SKX or ICX, only down to 3 cycles with Alder
    Lake). More unrolling is needed to hide that latency on any
    current CPU; you're only unrolling by 2 ZMM vectors here. (And of
    course the output array should fit in L1d cache since you need to
    store to it every 1.5 clock cycles, one vector of results per 3
    FP math ops = one per 1.5 cycles) 4 cycle latency with a required
    throughput of one per 1.5 clock cycles (for the vaddpd) needs an
    unroll of at least 4/1.5 = 2.666. So might as well do 4.
    - Peter Cordes
    May 20 at 15:23

  |  Show 19 more comments
0
 

Seems like you can have the cake and eat it too, by manually
parallelizing the code to something like this:

double A4 = A+A+A+A;
double Z = 3A+B;
double Y1 = C;
double Y2 = A+B+C;

int i;
// ... setup unroll when LEN is odd...

for(i=0; i<LEN; i++) {
    data[i] = Y1;
    data[++i] = Y2;
    Y1 += Z;
    Y2 += Z;
    Z += A4;
}

Probably not entirely functional as written, but you get the idea:
unroll the loop so that the data dependent paths can each be done in
parallel. For the machine being considered, a 4 step unroll should
achieve maximum performance, but of course, you get all the fun
things that come with hard-coding the architecture in your software.

Share
Improve this answer
Follow
edited 3 hours ago
answered 3 hours ago
 
user avatar
yudozisyudozis
111 bronze badge
New contributor
yudozis is a new contributor to this site. Take care in asking for
clarification, commenting, and answering. Check out our Code of
Conduct.

Add a comment  |   
Highly active question. Earn 10 reputation (not counting the
association bonus) in order to answer this question. The reputation
requirement helps protect this question from spam and non-answer
activity.

Not the answer you're looking for? Browse other questions tagged 
loops assembly optimization x86-64 simd or ask your own question.

 
Intel
Collective
Learn more about Collectives on Stack Overflow

    The Overflow Blog
  * 
    The complete beginners guide to graph theory
  * 
    Games are good, mods are immortal (ep 446)
    Featured on Meta
  * 
    Announcing the arrival of Valued Associate #1214: Dalmarus
  * 
    Improvements to site status and incident communication
  * 
    Staging Ground: Reviewer Motivation, Scaling, and Open Questions
  * 
    Collectives Update: Introducing Bulletins
  * 
    Should we burninate the [social] tag?
  * 
    Retiring Our Community-Specific Closure Reasons for Server Fault
    and Super User

Linked

 
47
How to get the CPU cycle count in x86_64 from C++?
 
47
SIMD instructions lowering CPU frequency
 
44
Why does mulss take only 3 cycles on Haswell, different from Agner's
instruction tables? (Unrolling FP loops with multiple accumulators)
 
11
How can I mitigate the impact of the Intel jcc erratum on gcc?
 
4
SIMD optimization of a curve computed from the second derivative
 
1
Is it possible to use SIMD on a serial dependency in a calculation,
like an exponential moving average filter?
 
1
Latency bounds and throughput bounds for processors for operations
that must occur in sequence
 
2
Dependency chain analysis
 
1
Out of Order Execution, How to Solve True Dependency?
 
0
Under what conditions does a C++ compiler use floating-point
pipelines to do integer division with run-time-known values for
higher performance?
See more linked questions

Related

 
505
Why does this go into an infinite loop?
 
681
How do I achieve the theoretical maximum of 4 FLOPs per cycle?
 
13
Why does this node.js loop run slowly after 112050 iterations?
 
20
Fast, branchless unsigned int absolute difference
 
1571
Replacing a 32-bit loop counter with 64-bit introduces crazy
performance deviations with _mm_popcnt_u64 on Intel CPUs
 
89
Why does this Haskell code run slower with -O?
 
0
Why does this loop execute only once?
 
1
Why does this assembly code goes on an infinite loop?
 
1
Why does this assembly code does more than just lgdt?

Hot Network Questions

  * 
    How can I replace oe with o while typing it?
  * 
    Get the last word on each line
  * 
    Set theory without the empty set
  * 
    Best way to cover exposed THHN cables where conduit has been
    pulled out
  * 
    Excel: Add month to previous date incrementally, giving same date
    for all months
  * 
    Spacy custom POS tagging for medical concepts
  * 
    How to name the node groups?
  * 
    What do you do when an opponent asks for a score sheet after a
    game in which recording is not compulsory?
  * 
    unable to delete file as root
  * 
    What does negative electrical energy signify?
  * 
    Time is the only dimension that has an arrow, and the only
    dimension which contributes an opposite sign to the metric. Is
    that just a coincidence?
  * 
    Can you use 5.25 HD floppy disks in a DD drive?
  * 
    A word for a state that changes regularly and consistently
  * 
    Is it okay if the interviewer make fun of my experience before
    graduation?
  * 
    How to best take notes as a player in an ongoing campaign?
  * 
    Sci-fi show marketed towards teens about a blonde girl living on
    some kind of space station
  * 
    How to recognize 105degC rated electrolytic capacitors?
  * 
    How to explain bad grades in phd application due to bad
    lecturers?
  * 
    Is it possible for a professor to write a negative letter of
    recommendation?
  * 
    tikz nesting trig funtions
  * 
    What is the meaning of Wang Bu Yue  here?
  * 
    Can flak or some other method be used to detonate WWI-style
    artillery shells mid-air?
  * 
    Why would a species that had never been exposed to light react to
    light?
  * 
    What happens to Ukrainian overstaying 90 days in the EU?

more hot questions
Question feed

Subscribe to RSS

Question feed

To subscribe to this RSS feed, copy and paste this URL into your RSS
reader.

[https://stackoverflo]
 
*
 

Stack Overflow

  * Questions
  * Help

Products

  * Teams
  * Advertising
  * Collectives
  * Talent

Company

  * About
  * Press
  * Work Here
  * Legal
  * Privacy Policy
  * Terms of Service
  * Contact Us
  * Cookie Settings
  * Cookie Policy

Stack Exchange Network

  * Technology
  * Culture & recreation
  * Life & arts
  * Science
  * Professional
  * Business
  * API
  * Data

  * Blog
  * Facebook
  * Twitter
  * LinkedIn
  * Instagram

Site design / logo (c) 2022 Stack Exchange Inc; user contributions
licensed under cc by-sa. rev 2022.5.26.42234

Your privacy

By clicking "Accept all cookies", you agree Stack Exchange can store
cookies on your device and disclose information in accordance with
our Cookie Policy.

Accept all cookies Customize settings