[HN Gopher] Why does this code execute more slowly after strengt...
___________________________________________________________________
Why does this code execute more slowly after strength-reducing
multiplications?
Author : ttsiodras
Score : 249 points
Date : 2022-05-29 13:40 UTC (9 hours ago)
(HTM) web link (stackoverflow.com)
(TXT) w3m dump (stackoverflow.com)
| manholio wrote:
| Seems like you can have the cake and eat it too, by manually
| parallelizing the code to something like this:
| double A4 = A+A+A+A; double Z = 3A+B; double Y1 =
| C; double Y2 = A+B+C; int i; // ...
| setup unroll when LEN is odd... for(i=0; i<LEN; i++)
| { data[i] = Y1; data[++i] = Y2;
| Y1 += Z; Y2 += Z; Z += A4; }
|
| Probably not entirely functional as written, but you get the
| idea: unroll the loop so that the data dependent paths can each
| be done in parallel. For the machine being considered, a 4 step
| unroll should achieve maximum performance, but of course, you get
| all the fun things that come with hard-coding the architecture in
| your software.
| Beltiras wrote:
| Is the implication here that tail call optimizations don't work
| anymore? They might seem to do the proper thing on the language
| level but the CPU just can't think that way.
| Retr0id wrote:
| How does this relate to tail calls?
| Beltiras wrote:
| Tail calls have a collector that accumulates the result, very
| similar pattern as in the example. It's an optimization in
| LISP and similar languages.
| readams wrote:
| Tail calls will work the same as loop. So if you have data
| dependencies between loop iterations or between tail-call-
| eliminated stack frames, then it will be slower than if you do
| not have those dependencies.
| lvass wrote:
| At least in some languages like Elixir and probably most FP
| languages, tail calls are practically only used when said
| dependencies exist, so their usage can perhaps be a marker
| for when some optimizations are not possible.
| dragontamer wrote:
| Tldr: autovectorizer transforms the first code into SIMD
| instructions, but the second into 64 bit instructions.
|
| SIMD is very powerful, and modern compilers can sometimes simd-
| ify your code automatically.
|
| -------
|
| The second code can probably become SIMD as well, but it's beyond
| GCC's ability to autovectorizer it in that form. I kinda want to
| give it a go myself but don't have time today...
|
| Autovectorizers are mysterious, often harder to see and use than
| explicitly SIMD code (like OpenCL or CUDA).
| mgaunard wrote:
| The problem is not the ability of the compiler, is that it's
| not numerically equivalent, so the compiler isn't allowed to do
| that optimization.
| parenthesis wrote:
| How can one go about making one's code apt for a compiler to be
| able to do these kinds of things?
| astrange wrote:
| Don't bother. Autovectorization almost fundamentally doesn't
| work.
|
| You can write code using explicit vectors and it's not too
| bad, though less cross platform than you'd like.
|
| An actual answer involves things like restrict and alignment
| keywords.
| DangitBobby wrote:
| I also wonder if there's any compiler that allows you to hint
| that you expect certain optinizations to occur (like
| vectorization), and if they do not, fails to compile at all.
| dragontamer wrote:
| You study autovectorizers, then you enable autovectorization
| warning flags, and carefully read your compilers output.
|
| If the compiler says autovectorization failed, you rewrite
| the code until the autovectorizer works.
|
| https://docs.microsoft.com/en-us/cpp/build/reference/qvec-
| re...
| janwas wrote:
| hm, I have had limited success with such nudging - it's
| certainly not a programming model with a predictable
| outcome. And as soon as the compiler or its flags are
| updated, maybe the situation changes.
| skavi wrote:
| A good way is to have your data arranged in structs of arrays
| rather than in arrays of structs. This allows the compiler to
| generate code which just loads in linear sections of memory
| to SIMD registers. It's also just more cache efficient in
| general.
|
| Check out data oriented design if you aren't already
| familiar.
| djmips wrote:
| Note that sometimes it's sufficient to have arrays of
| structs of arrays. That is you don't need to necessarily go
| whole hog.
| MauranKilom wrote:
| > The second code can probably become SIMD as well, but it's
| beyond GCC's ability to autovectorizer it in that form.
|
| Usually, for floating point operations the compiler simply has
| no chance to do anything clever. NaNs, infinities and signed
| zero mean that even the most "obvious" identities don't
| actually hold. For example, x + 0 == x (where the == is
| bitwise) does not hold for x = -0.
| sampo wrote:
| > For example, x + 0 == x (where the == is bitwise) does not
| hold for x = -0.
|
| But when operating on floating point numbers, == is not
| bitwise, and -0 == +0.
| Someone wrote:
| That doesn't matter for the case where a compiler would
| like to replace _x + y_ by _x_ because it knows _y == 0_.
|
| The compiler would have to test for the _x == -0_ case, and
| doing that ?rarely /never? is faster than computing _x +
| y_.
| MauranKilom wrote:
| If the user provides code where the result should be +0.0
| and the compiler emits code that results in -0.0 (and the
| user has not explicitly enabled relaxed IEEE754
| conformance), that's a bug in the compiler.
| sampo wrote:
| That is correct. But we were talking about how the ==
| operator compares zeros. According to IEEE 754, -0 == +0
| even though their bit patterns are not identical. And, by
| the way, inf != inf even if their bit patters are
| identical.
| andi999 wrote:
| so x+0 is 0?
| ncmncm wrote:
| No. -0.0 + 0 is not necessarily bitwise-equal to 0.0.
| Though it arguably could be, depending on details.
| Sesse__ wrote:
| -0.0 + x = x, even if x should be -0.0 or +0.0.
|
| +0.0 + x, on the other hand, is not always equal to x.
| titzer wrote:
| Note that IEEE 754 defines exactly which zero you get in
| every combination of operands. -0 + 0 is actually 0.
| anticensor wrote:
| There is a compiler switch for that too: -fassociative-math
| -funsafe-math-optimizations
| ascar wrote:
| > Tldr: autovectorizer transforms the first code into SIMD
| instructions, but the second into 64 bit instructions.
|
| That's not the tldr. It's actually more wrong than right.
|
| The actual TLDR is: loop dependencies prohibit the compiler
| _and_ your CPU from parallelizing. The SIMD part is the
| compiler, the more important part here is the CPU though.
|
| Long explanation:
|
| Let's start with the compiler. Yes, the first version can be
| vectorized and you can see that in the included screenshots of
| the assembly output. The use of addpd [0] (add 2 pairs of 64bit
| doubles in simultaneously) instead of addsd (add 1 pair of
| 64bit doubles). Both operations have identical
| throughput/latency on any architecture not too ancient (e.g.
| check information of the corresponding intrinsic here [3].
| Unfortunately Agner instruction_tables doesn't include the
| ADDSD for most architectures.) So while you can crunch 2
| operations using SIMD at the same time, you are still doing 3
| multiplications and 2 additions instead of 2 additions. The
| multiplications and addition have the same throughput at least
| on Skylake. So we can use simple math and 5/2 is still more
| than 2!
|
| Now to the CPU part. The loop dependency prohibits the compiler
| to properly utilize instruction pipelining [4] and now
| differences in throughput vs latency come into play. In a
| pipelined scenario the CPU can work on multiple steps of the
| instruction cycle in parallel (from fetching the instruction,
| then decoding it, to executing the operation and eventually
| writing the result back to the register) and thus achieve
| higher throughput. However, if your next instruction depends on
| the instruction before the CPU has to finish all the steps of a
| pipeline from instruction start to finish before the next
| instruction can start. This is the latency of an instruction.
| For example mulsd/mulpd have 8 times higher latency than
| throughput on Intel Skylake.
|
| So while SIMD comes in at a factor of 2, the loop dependency
| stops the CPU from realizing a potential factor of 8.
|
| Peter Cordes also correctly points this out in his comments and
| answer to the question on stackoverflow.
|
| [1] https://www.felixcloutier.com/x86/addpd
|
| [2] https://www.felixcloutier.com/x86/addsd
|
| [3] https://www.intel.com/content/www/us/en/docs/intrinsics-
| guid...
|
| [4] https://en.wikipedia.org/wiki/Instruction_pipelining
| rasz wrote:
| both versions use SSE and are pipelined, the problem with the
| second one is data dependency, only two adds but the second one
| directly depends on the first ones result = stall
| dragontamer wrote:
| SSE includes "scalar" adds (addsd), which are a 1x floating
| point instruction. These are "non-SIMD" instructions, serving
| as a replacement for the legacy x87 instructions.
|
| There is also "parallel" adds (addpd).
|
| Carefully look at the assembly language, the 1st version uses
| parallel adds (addpd) and parallel multiplies. The 2nd
| version uses scalar adds (addsd)
|
| The other major point is that the 2nd version uses a singular
| move qword (64-bit) per loop iteration, while the 1st version
| is using the full 128-bit move per loop iteration.
|
| ---------
|
| SSE is used for scalar double-precision these days, because
| scalar-SSE is faster than x87 instructions... and better
| matches the standards (x87 had "higher precision" than the
| IEEE specs, so it has different results compared to other
| computers. SSE is closer to the specs)
| [deleted]
| loeg wrote:
| The other essential insight is that the second version has data
| dependencies between loop iterations. Without that, the tldr is
| incomplete.
| dragontamer wrote:
| That is the 'why' autovectorization fails. But it seems
| solvable in this case pretty easily actually.
|
| EDIT: Solvable by a human. I dunno much about compilers
| though, so I dunno if there's some kind of language issue
| that would prevent the unrolling of this dependency.
| stormbrew wrote:
| I think the point GP is making is that even without
| vectorization, the data dependency causes stalls even in
| normal, single data instructions. That is, a data
| dependency between iterations of loops will hurt
| performance even for non-vectorizable calculations (or on
| CPUs with high ILP but no really good vector instructions,
| which granted is probably a small pool since both those
| things came about at about the same time).
| touisteur wrote:
| I think that is the point Peter Cordes is trying to make
| (quite politely but firmly) over there on stackoverflow.
| It's not only about autovectorization. The main point
| there is the loop-carried dependency that prevents both
| the compiler (autovec) _and_ the processor (ILP) to do
| their thing.
|
| Loop-carried dependency is the big culprit here. I wish
| we had a culture of writing for loops with the index a
| constant inside the loop, as in the Ada for statement,
| and not the clever C while loops or for with two running
| variables... Simpler loop syntax makes so many static
| analyses 'easier' and kind of forces the brain to think
| in bounded independant steps, or to reach for higher
| level constructs (e.g. reduce).
| hansvm wrote:
| It's a different issue altogether. Even in vectorized code
| artificial data dependencies are an issue. E.g., for fast
| dot products (not actually usually applicable on most input
| sizes because of memory bandwidth, but related problems can
| be CPU bound) you'll roughly double your execution speed by
| alternating between two running totals and merging those at
| the end, and that gain is mostly orthogonal to the question
| of whether those running totals are vectorized.
|
| Edit: And many modern compilers have been doing that
| particular optimization for a few years anyway, but it's
| still an important idea to keep in mind for any non-trivial
| graph of operations.
| zeusk wrote:
| How would the second approach be vectorized given each
| iteration's input has dependence on previous iteration's
| output??
| dragontamer wrote:
| Unroll the dependency until you are longer than the SIMD
| width.
|
| Ex: as long as i, i+1, i+2, i+3, ... i+7 are not dependent on
| each other, you can vectorize to SIMD-width 8.
|
| Or in other words: i+7 can depend on i-1 no problems.
| yongjik wrote:
| If they were integer variables, I guess the compiler would
| have done that, but you can't really do that with floats
| because i+A+A is not necessarily i+2*A. (Of course, in this
| particular example, the difference doesn't matter for the
| programmer, but the compiler doesn't know that!)
|
| I think there's some gcc option that enables these
| "dangerous" optimizations. -ffast-math, or something like
| that?
| oezi wrote:
| Funny that this question gained 180 upvotes in 10 days when it
| also could have received the reverse for being quite lacking in
| things the author has tried to figure the (rather obvious) data
| dependency out on his own.
| LAC-Tech wrote:
| I'll put my hand up and say none of the post or answers were
| obvious to me. I found it all very interesting.
| diarrhea wrote:
| Is this relevant to interpreted languages as well? I'm thinking
| perhaps Pythons bytecode could have similar optimisations.
| sampo wrote:
| In the post, multiplications and 2 additions are not faster than
| 2 additions.
|
| The post compares (1) loop code that can be vectorized, as loop
| rounds are independent and do not depend on the result from the
| previous round, and (2) an "optimization" that makes calculations
| shorter, but also makes each loop round depend on the result of
| the previous round, so this cannot be vectorized.
| dang wrote:
| Ok, we've reverted the title to what the article says.
|
| (Submitted title was "Multiplications and 2 additions are
| faster than 2 additions")
| magicalhippo wrote:
| > In the post, multiplications and 2 additions are not faster
| than 2 additions.
|
| Reminded me of when I tried to optimize some code using
| assembly on x86, maybe 8-10 years go. The compiler spit out a
| DIV [EDI] instruction which was inside a hot loop. I saw how I
| could easily rewrite it to use a register instead of the
| indirect memory access.
|
| So I did, and it was significantly slower, like 10-15%... I
| even took care to align the loop target address and so on. If I
| changed my code to use the memory indirection it was measurably
| faster.
|
| And with that I decided hand-optimizing x86 assembly was
| definitely a skill to leave behind.
| userbinator wrote:
| You probably added an extra instruction or two to put the
| value in a register? The CPU can already split memory
| accesses into uops and cached accesses are fast, so there's
| no point in doing that because it'll just waste an additional
| register (vs. using one of the many the renamer generates)
| and add instructions to decode.
|
| x86 is fundamentally a CISC; if you treat it like a RISC, it
| will definitely disappoint.
| ncmncm wrote:
| Cached accesses are not fast anymore. Modern chips can
| retire a dozen or even more instructions in the time it
| takes to move a word in from L1 cache.
| magicalhippo wrote:
| Not really, the compiler did the round-trip into the local
| variable (memory), I did it via register.
|
| I asked around at the time and someone mentioned that I
| might have overtaxed certain execution ports or something
| like that, but yeah that just cemented my belief that x86
| optimization is not my cup of tea anymore. Better to spend
| time learning how to write code the compiler can optimize
| well.
| astrange wrote:
| The compiler doesn't know anything about optimizing x86
| code either. The actual details there are too secret for
| Intel to want to accurately describe them in gcc, they're
| different across different CPUs, and compilers just
| aren't as good as you think they are.
|
| (But CPUs are usually better than you think they are.)
| CamperBob2 wrote:
| What made me give up was when I found that an
| optimization for one x86 processor was an impairment on
| another. Your DIV example might have had a different
| outcome on an AMD chip, or on the next- or previous-
| generation part from Intel.
|
| Nowadays, counting cycles is a game for monks who don't
| actually have to get anything done.
| sillysaurusx wrote:
| More generally, "stop storing state, unless it makes the
| program less complicated."
|
| The first version is simple. The second version is more
| complicated. The simplicity is because the first version can be
| represented as a formula; imagine trying to write out the
| second version as a formula, and the complexity becomes
| obvious.
|
| (The addition loop would have to be written as a recurrence
| relation, which of course means every step depends on the
| previous step.)
|
| Complexity isn't always obvious. It's sometimes common in C
| programs to write a loop like dst[index++] = src[i],
| particularly in deeply nested for-loops. In my experience it's
| almost always worth rewriting it so that the indices are
| computable entirely from the loop iteration variables, with no
| state. It helps you build a mental map of the operations,
| because you can think of it in terms of geometry (memcpy =
| copying rectangles onto other larger rectangles) whereas when
| the index is stateful it becomes quite a lot harder to
| visualize in higher dimensions. At least for me.
|
| We've been building a compiler at Groq for our custom ML
| hardware. (Roughly, "turn pytorch into our ISA with no extra
| programmer effort.") I used to think of posts like this as
| "Well, modern compilers are so fast, who really cares about
| autovectorization?" -- it turns out _you_ care when you need to
| write your own compiler. :)
|
| MLIR is pretty cool. It makes a lot of transformations like
| this pretty easy. The MLIR "krnl" dialect can also
| automatically transform nested loops into tiled iteration.
| Graphics devs will know what I mean -- no need to loop over 8x8
| blocks of pixels manually, just write "for x in range(width):
| for y in range(height): ..." and set the block size to 8.
| mike_hock wrote:
| The first comment on the code _does_ say precisely this.
| Arwill wrote:
| SIMD vectorization is like quantum mechanics, it becomes
| classical if you look.
| dreamcompiler wrote:
| The other issue is instruction-level parallelism, as another
| poster in TFA pointed out. Even within a single loop iteration
| the "unoptimized" code is more likely to exploit multiple ALUs
| if they exist, regardless of vectorization instructions.
| savant_penguin wrote:
| What would be the difference in power consumption from each
| method? (Would it be always better to multiply? If so why not
| multiply by one?)
| dento wrote:
| Usually faster version always consumes less power, as this
| allows the core more time in sleep. This is known as the race-
| to-sleep or race-to-idle paradox.
| chaboud wrote:
| The general rule to follow in power consumption on CPUs is to
| do your work quickly and then get to sleep. Propagating clock
| is going to eat the bulk of your power. The mild difference
| between multiply and add in actual usage is inside the noise
| (orders of magnitude smaller). The bigger penalty in this case
| is the inter-iteration dependency, which, vectorized or not,
| runs the risk of holding up the whole show due to pipelining.
|
| As a performance rule on modern processors: avoid using the
| result of a calculation as long as you reasonably can (in tight
| loops... You don't want to be out of cache.).
|
| Have fun threading the needle!
| varajelle wrote:
| The problem here is that the additions depends on values
| computed in the previous iteration of the loop. The version
| with multiplication is faster because there is no dependencies
| with the previous iteration so the CPU has more freedom
| scheduling the operations.
|
| The power consumption is a good question.
| MikeHolman wrote:
| Scheduling plays a part, but it is definitely more about
| vectorization.
| dgb23 wrote:
| If we ignore the details, then this is a perfect example of how
| simpler code should be preferred _by default_ over complex code.
| Both for performance and reasoning. In this case and often
| elsewhere, these are related things.
|
| I'm taking the definition of simple and complex (or complected)
| from this talk[0]. In short: Simple means individual pieces are
| standing on their own, not necessarily easy or minimal. Complex
| means individual pieces are braided together into a whole.
|
| The first code is simpler than the second. Just have a look at
| the two solutions and you see what I mean: The individual
| instructions in the first stand on their own and clearly declare
| what they mean as well. The second example is more complex,
| because each line/subexpression cannot be understood in
| isolation, there is an implicit coupling that requires the
| programmer _and_ the machine code interpreter to understand the
| whole thing for it to make sense. The CPU apparently fails at
| that and cannot optimize the machine instructions into more
| efficient microcode.
|
| The example illustrates performance benefits of simpler code, but
| there are others too:
|
| For example a type checker might be able to infer things from
| simpler code, but not from complex code. Again, complexity is
| about coupling, which can for example arise from making
| assumptions about things that are out of bounds or logic that
| happens in some other part of your program, which _this_ part
| relies on.
|
| Things like these can either be outright rejected by a type
| checker or simply ignored and sometimes we can provide additional
| ceremony to make it happy. But there is always an underlying
| question when these issues arise: Can I express this in a simpler
| way? It's sometimes possible to describe rich semantics with
| simpler, more primitive types and explicit coordination.
|
| Another thing that comes to mind are rich ORMs. They can be very
| convenient for common cases. But they have footguns for the
| general case, because they _complect_ so many things, such as
| validation, storage, domain rules, caching etc. And they are
| leaky abstractions because we have to do certain things in
| certain ways to avoid bloated SQL, N+1 etc.
|
| They are not simple (and certainly not declarative) so we program
| against a black box. There are simpler "ORMs" that I very much
| like, but they typically only provide a minimal set of orthogonal
| features, such as query builders, and mapping results into plain
| data structures. It is simpler to use these kind of orthogonal
| tools.
|
| Last but not least: Simple code can be pulled apart and changed
| more easily without having to worry about introducing problems.
| The substitution rule or referential transparency enables this by
| guaranteeing that each expression can be replaced by the value it
| will evaluate to. This also implies that other expressions that
| evaluate to the same value can be substituted freely.
|
| [0] Simple Made Easy - Rich Hickey at Strangeloop 2011
| https://www.youtube.com/watch?v=LKtk3HCgTa8
| djmips wrote:
| You can't ignore the details if you want the fastest
| performance. Unfortunately sometimes the solution that's the
| fastest is also more complex. Simplicity tends to win and also
| is less likely to be buggy, easier to maintain etc but
| sometimes the fastest code deliberately introduces coupling,
| hardware specific cache shenanigans or unreadable hand written
| asm and/or inlined SIMD. This is often counter to your
| hypothesis.
| kortex wrote:
| > sometimes the fastest code deliberately introduces
| coupling, hardware specific cache shenanigans or unreadable
| hand written asm and/or inlined SIMD.
|
| >> this is a perfect example of how simpler code should be
| preferred _by default_ over complex code.
|
| Extra emphasis on *by default*.
|
| You are both on the same side of the coin. This is the
| essence of "premature optimization is the root of all evil".
| Simplicity should be the default. Once you know your
| implementation is solid, then you can go back and optimize.
|
| TFA is quite contrived, but imagine a much more complex
| process. You see some code: for i in loop:
| foo[i] = myfunc(foo[i], bar[i])
|
| You have to go in and dig into why exactly foo is being fed
| back in at every step. But if instead you saw:
| for i in loop: foo[i] = myfunc(bar[i])
|
| You _immediately_ know you have more options at your disposal
| to optimize that loop. Further, the _compiler_ immediately
| knows it has more options. Maybe this is a good option for
| automatic unrolling or Duff 's device. Maybe you wanna send
| this to a gpu shader, or use a macro to parallelize the loop.
| But it's a lot harder to get to that point if your starting
| point is the complected, stateful one.
| ww520 wrote:
| Besides the autovectorization, the second version also has two
| additional assignments. Depending on how the storage is used for
| these, whether it's to registers, L1/L2, or stack, there might be
| performance hit.
| layer8 wrote:
| TLDR: Due to loop-carried dependencies preventing parallelized
| execution:
| https://en.wikipedia.org/wiki/Loop_dependence_analysis#Loop-...
|
| In the 2 additions version, computation of the next iteration
| depends on the results of the preceding iteration. In the
| multiplication version, the computations are independent for each
| iteration, enabling parallel execution (by SIMD and/or
| pipelined/superscalar execution).
| mlatu wrote:
| you could try to prefill first cell with -(A+B), then each cell
| gets a+b+c and in a loop for (j=0; j<i; j++): (A+A)
|
| but im to lazy to test that
| [deleted]
| someweirdperson wrote:
| I doubt repeatedly adding floating point numbers is a good idea.
| With every addition the sum increases further away from the
| addend, and with their growing relative difference problems grow
| as well.
|
| Just because the algebra works it doesn't guarantee that the code
| does, too.
| LAC-Tech wrote:
| The algebra works for reals, binary floating point is a
| different beast!
| ascar wrote:
| Your points are factually correct, but in practice not a big
| concern, if your floating point value is much more precise than
| necessary for the numbers used.
|
| E.g. if you use doubles for the range of 32bit integers even
| adding 1.0 2 billion times to 2 billion still ends up at 4
| billion. Even adding 1.0 20 billion times to 20 billion ends up
| at 40 billion. Now adding 0.1 20 billion times to 2 billion
| ends up 1908 short on my CPU, i.e. about 19080
| absorptions/rounding errors occured. You need some serious
| differences and amount of operations to actually trigger
| errors.
| sray wrote:
| The author of the post is aware of this. They explicitly say
| that is not the point of the question.
| oconnor663 wrote:
| The part about data dependencies across loop iterations is
| fascinating to me, becuase it's mostly invisible even when you
| look at the generated assembly. There's a related optimization
| that comes up in implementations of ChaCha/BLAKE, where we
| permute columns around in a kind of weird order, because it
| breaks a data dependency for an operation that's about to happen:
| https://github.com/sneves/blake2-avx2/pull/4#issuecomment-50...
| benreesman wrote:
| Oh man I have been smoked by data-dependency like this so many
| times. I've gotten a lot better over the years at "seeing" data
| dependencies in godbolt or whatever, but it still slips through
| by my code and that of my colleagues way more often than I'd
| like.
|
| Is anyone aware of good tooling for automatically catching this
| sort of thing even some of the time?
| slaymaker1907 wrote:
| I'm not sure if there are existing rules for it, but you could
| write a CodeQL query looking for data dependencies in loops.
| Obviously dependencies are sometimes required, but it at least
| could tell your they were there.
| mgaunard wrote:
| It's fairly obvious: the rewrite prevents parallelization because
| floating-point isn't associative.
|
| You'd need to parallelize it explicitly (which can be done by
| just unrolling the loop).
___________________________________________________________________
(page generated 2022-05-29 23:00 UTC)