[HN Gopher] Why does this code execute more slowly after strengt...
       ___________________________________________________________________
        
       Why does this code execute more slowly after strength-reducing
       multiplications?
        
       Author : ttsiodras
       Score  : 249 points
       Date   : 2022-05-29 13:40 UTC (9 hours ago)
        
 (HTM) web link (stackoverflow.com)
 (TXT) w3m dump (stackoverflow.com)
        
       | manholio wrote:
       | Seems like you can have the cake and eat it too, by manually
       | parallelizing the code to something like this:
       | double A4 = A+A+A+A;         double Z = 3A+B;         double Y1 =
       | C;         double Y2 = A+B+C;              int i;         // ...
       | setup unroll when LEN is odd...              for(i=0; i<LEN; i++)
       | {             data[i] = Y1;             data[++i] = Y2;
       | Y1 += Z;             Y2 += Z;             Z += A4;         }
       | 
       | Probably not entirely functional as written, but you get the
       | idea: unroll the loop so that the data dependent paths can each
       | be done in parallel. For the machine being considered, a 4 step
       | unroll should achieve maximum performance, but of course, you get
       | all the fun things that come with hard-coding the architecture in
       | your software.
        
       | Beltiras wrote:
       | Is the implication here that tail call optimizations don't work
       | anymore? They might seem to do the proper thing on the language
       | level but the CPU just can't think that way.
        
         | Retr0id wrote:
         | How does this relate to tail calls?
        
           | Beltiras wrote:
           | Tail calls have a collector that accumulates the result, very
           | similar pattern as in the example. It's an optimization in
           | LISP and similar languages.
        
         | readams wrote:
         | Tail calls will work the same as loop. So if you have data
         | dependencies between loop iterations or between tail-call-
         | eliminated stack frames, then it will be slower than if you do
         | not have those dependencies.
        
           | lvass wrote:
           | At least in some languages like Elixir and probably most FP
           | languages, tail calls are practically only used when said
           | dependencies exist, so their usage can perhaps be a marker
           | for when some optimizations are not possible.
        
       | dragontamer wrote:
       | Tldr: autovectorizer transforms the first code into SIMD
       | instructions, but the second into 64 bit instructions.
       | 
       | SIMD is very powerful, and modern compilers can sometimes simd-
       | ify your code automatically.
       | 
       | -------
       | 
       | The second code can probably become SIMD as well, but it's beyond
       | GCC's ability to autovectorizer it in that form. I kinda want to
       | give it a go myself but don't have time today...
       | 
       | Autovectorizers are mysterious, often harder to see and use than
       | explicitly SIMD code (like OpenCL or CUDA).
        
         | mgaunard wrote:
         | The problem is not the ability of the compiler, is that it's
         | not numerically equivalent, so the compiler isn't allowed to do
         | that optimization.
        
         | parenthesis wrote:
         | How can one go about making one's code apt for a compiler to be
         | able to do these kinds of things?
        
           | astrange wrote:
           | Don't bother. Autovectorization almost fundamentally doesn't
           | work.
           | 
           | You can write code using explicit vectors and it's not too
           | bad, though less cross platform than you'd like.
           | 
           | An actual answer involves things like restrict and alignment
           | keywords.
        
           | DangitBobby wrote:
           | I also wonder if there's any compiler that allows you to hint
           | that you expect certain optinizations to occur (like
           | vectorization), and if they do not, fails to compile at all.
        
           | dragontamer wrote:
           | You study autovectorizers, then you enable autovectorization
           | warning flags, and carefully read your compilers output.
           | 
           | If the compiler says autovectorization failed, you rewrite
           | the code until the autovectorizer works.
           | 
           | https://docs.microsoft.com/en-us/cpp/build/reference/qvec-
           | re...
        
             | janwas wrote:
             | hm, I have had limited success with such nudging - it's
             | certainly not a programming model with a predictable
             | outcome. And as soon as the compiler or its flags are
             | updated, maybe the situation changes.
        
           | skavi wrote:
           | A good way is to have your data arranged in structs of arrays
           | rather than in arrays of structs. This allows the compiler to
           | generate code which just loads in linear sections of memory
           | to SIMD registers. It's also just more cache efficient in
           | general.
           | 
           | Check out data oriented design if you aren't already
           | familiar.
        
             | djmips wrote:
             | Note that sometimes it's sufficient to have arrays of
             | structs of arrays. That is you don't need to necessarily go
             | whole hog.
        
         | MauranKilom wrote:
         | > The second code can probably become SIMD as well, but it's
         | beyond GCC's ability to autovectorizer it in that form.
         | 
         | Usually, for floating point operations the compiler simply has
         | no chance to do anything clever. NaNs, infinities and signed
         | zero mean that even the most "obvious" identities don't
         | actually hold. For example, x + 0 == x (where the == is
         | bitwise) does not hold for x = -0.
        
           | sampo wrote:
           | > For example, x + 0 == x (where the == is bitwise) does not
           | hold for x = -0.
           | 
           | But when operating on floating point numbers, == is not
           | bitwise, and -0 == +0.
        
             | Someone wrote:
             | That doesn't matter for the case where a compiler would
             | like to replace _x + y_ by _x_ because it knows _y == 0_.
             | 
             | The compiler would have to test for the _x == -0_ case, and
             | doing that ?rarely /never? is faster than computing _x +
             | y_.
        
             | MauranKilom wrote:
             | If the user provides code where the result should be +0.0
             | and the compiler emits code that results in -0.0 (and the
             | user has not explicitly enabled relaxed IEEE754
             | conformance), that's a bug in the compiler.
        
               | sampo wrote:
               | That is correct. But we were talking about how the ==
               | operator compares zeros. According to IEEE 754, -0 == +0
               | even though their bit patterns are not identical. And, by
               | the way, inf != inf even if their bit patters are
               | identical.
        
           | andi999 wrote:
           | so x+0 is 0?
        
             | ncmncm wrote:
             | No. -0.0 + 0 is not necessarily bitwise-equal to 0.0.
             | Though it arguably could be, depending on details.
        
               | Sesse__ wrote:
               | -0.0 + x = x, even if x should be -0.0 or +0.0.
               | 
               | +0.0 + x, on the other hand, is not always equal to x.
        
               | titzer wrote:
               | Note that IEEE 754 defines exactly which zero you get in
               | every combination of operands. -0 + 0 is actually 0.
        
           | anticensor wrote:
           | There is a compiler switch for that too: -fassociative-math
           | -funsafe-math-optimizations
        
         | ascar wrote:
         | > Tldr: autovectorizer transforms the first code into SIMD
         | instructions, but the second into 64 bit instructions.
         | 
         | That's not the tldr. It's actually more wrong than right.
         | 
         | The actual TLDR is: loop dependencies prohibit the compiler
         | _and_ your CPU from parallelizing. The SIMD part is the
         | compiler, the more important part here is the CPU though.
         | 
         | Long explanation:
         | 
         | Let's start with the compiler. Yes, the first version can be
         | vectorized and you can see that in the included screenshots of
         | the assembly output. The use of addpd [0] (add 2 pairs of 64bit
         | doubles in simultaneously) instead of addsd (add 1 pair of
         | 64bit doubles). Both operations have identical
         | throughput/latency on any architecture not too ancient (e.g.
         | check information of the corresponding intrinsic here [3].
         | Unfortunately Agner instruction_tables doesn't include the
         | ADDSD for most architectures.) So while you can crunch 2
         | operations using SIMD at the same time, you are still doing 3
         | multiplications and 2 additions instead of 2 additions. The
         | multiplications and addition have the same throughput at least
         | on Skylake. So we can use simple math and 5/2 is still more
         | than 2!
         | 
         | Now to the CPU part. The loop dependency prohibits the compiler
         | to properly utilize instruction pipelining [4] and now
         | differences in throughput vs latency come into play. In a
         | pipelined scenario the CPU can work on multiple steps of the
         | instruction cycle in parallel (from fetching the instruction,
         | then decoding it, to executing the operation and eventually
         | writing the result back to the register) and thus achieve
         | higher throughput. However, if your next instruction depends on
         | the instruction before the CPU has to finish all the steps of a
         | pipeline from instruction start to finish before the next
         | instruction can start. This is the latency of an instruction.
         | For example mulsd/mulpd have 8 times higher latency than
         | throughput on Intel Skylake.
         | 
         | So while SIMD comes in at a factor of 2, the loop dependency
         | stops the CPU from realizing a potential factor of 8.
         | 
         | Peter Cordes also correctly points this out in his comments and
         | answer to the question on stackoverflow.
         | 
         | [1] https://www.felixcloutier.com/x86/addpd
         | 
         | [2] https://www.felixcloutier.com/x86/addsd
         | 
         | [3] https://www.intel.com/content/www/us/en/docs/intrinsics-
         | guid...
         | 
         | [4] https://en.wikipedia.org/wiki/Instruction_pipelining
        
         | rasz wrote:
         | both versions use SSE and are pipelined, the problem with the
         | second one is data dependency, only two adds but the second one
         | directly depends on the first ones result = stall
        
           | dragontamer wrote:
           | SSE includes "scalar" adds (addsd), which are a 1x floating
           | point instruction. These are "non-SIMD" instructions, serving
           | as a replacement for the legacy x87 instructions.
           | 
           | There is also "parallel" adds (addpd).
           | 
           | Carefully look at the assembly language, the 1st version uses
           | parallel adds (addpd) and parallel multiplies. The 2nd
           | version uses scalar adds (addsd)
           | 
           | The other major point is that the 2nd version uses a singular
           | move qword (64-bit) per loop iteration, while the 1st version
           | is using the full 128-bit move per loop iteration.
           | 
           | ---------
           | 
           | SSE is used for scalar double-precision these days, because
           | scalar-SSE is faster than x87 instructions... and better
           | matches the standards (x87 had "higher precision" than the
           | IEEE specs, so it has different results compared to other
           | computers. SSE is closer to the specs)
        
             | [deleted]
        
         | loeg wrote:
         | The other essential insight is that the second version has data
         | dependencies between loop iterations. Without that, the tldr is
         | incomplete.
        
           | dragontamer wrote:
           | That is the 'why' autovectorization fails. But it seems
           | solvable in this case pretty easily actually.
           | 
           | EDIT: Solvable by a human. I dunno much about compilers
           | though, so I dunno if there's some kind of language issue
           | that would prevent the unrolling of this dependency.
        
             | stormbrew wrote:
             | I think the point GP is making is that even without
             | vectorization, the data dependency causes stalls even in
             | normal, single data instructions. That is, a data
             | dependency between iterations of loops will hurt
             | performance even for non-vectorizable calculations (or on
             | CPUs with high ILP but no really good vector instructions,
             | which granted is probably a small pool since both those
             | things came about at about the same time).
        
               | touisteur wrote:
               | I think that is the point Peter Cordes is trying to make
               | (quite politely but firmly) over there on stackoverflow.
               | It's not only about autovectorization. The main point
               | there is the loop-carried dependency that prevents both
               | the compiler (autovec) _and_ the processor (ILP) to do
               | their thing.
               | 
               | Loop-carried dependency is the big culprit here. I wish
               | we had a culture of writing for loops with the index a
               | constant inside the loop, as in the Ada for statement,
               | and not the clever C while loops or for with two running
               | variables... Simpler loop syntax makes so many static
               | analyses 'easier' and kind of forces the brain to think
               | in bounded independant steps, or to reach for higher
               | level constructs (e.g. reduce).
        
             | hansvm wrote:
             | It's a different issue altogether. Even in vectorized code
             | artificial data dependencies are an issue. E.g., for fast
             | dot products (not actually usually applicable on most input
             | sizes because of memory bandwidth, but related problems can
             | be CPU bound) you'll roughly double your execution speed by
             | alternating between two running totals and merging those at
             | the end, and that gain is mostly orthogonal to the question
             | of whether those running totals are vectorized.
             | 
             | Edit: And many modern compilers have been doing that
             | particular optimization for a few years anyway, but it's
             | still an important idea to keep in mind for any non-trivial
             | graph of operations.
        
         | zeusk wrote:
         | How would the second approach be vectorized given each
         | iteration's input has dependence on previous iteration's
         | output??
        
           | dragontamer wrote:
           | Unroll the dependency until you are longer than the SIMD
           | width.
           | 
           | Ex: as long as i, i+1, i+2, i+3, ... i+7 are not dependent on
           | each other, you can vectorize to SIMD-width 8.
           | 
           | Or in other words: i+7 can depend on i-1 no problems.
        
             | yongjik wrote:
             | If they were integer variables, I guess the compiler would
             | have done that, but you can't really do that with floats
             | because i+A+A is not necessarily i+2*A. (Of course, in this
             | particular example, the difference doesn't matter for the
             | programmer, but the compiler doesn't know that!)
             | 
             | I think there's some gcc option that enables these
             | "dangerous" optimizations. -ffast-math, or something like
             | that?
        
       | oezi wrote:
       | Funny that this question gained 180 upvotes in 10 days when it
       | also could have received the reverse for being quite lacking in
       | things the author has tried to figure the (rather obvious) data
       | dependency out on his own.
        
         | LAC-Tech wrote:
         | I'll put my hand up and say none of the post or answers were
         | obvious to me. I found it all very interesting.
        
       | diarrhea wrote:
       | Is this relevant to interpreted languages as well? I'm thinking
       | perhaps Pythons bytecode could have similar optimisations.
        
       | sampo wrote:
       | In the post, multiplications and 2 additions are not faster than
       | 2 additions.
       | 
       | The post compares (1) loop code that can be vectorized, as loop
       | rounds are independent and do not depend on the result from the
       | previous round, and (2) an "optimization" that makes calculations
       | shorter, but also makes each loop round depend on the result of
       | the previous round, so this cannot be vectorized.
        
         | dang wrote:
         | Ok, we've reverted the title to what the article says.
         | 
         | (Submitted title was "Multiplications and 2 additions are
         | faster than 2 additions")
        
         | magicalhippo wrote:
         | > In the post, multiplications and 2 additions are not faster
         | than 2 additions.
         | 
         | Reminded me of when I tried to optimize some code using
         | assembly on x86, maybe 8-10 years go. The compiler spit out a
         | DIV [EDI] instruction which was inside a hot loop. I saw how I
         | could easily rewrite it to use a register instead of the
         | indirect memory access.
         | 
         | So I did, and it was significantly slower, like 10-15%... I
         | even took care to align the loop target address and so on. If I
         | changed my code to use the memory indirection it was measurably
         | faster.
         | 
         | And with that I decided hand-optimizing x86 assembly was
         | definitely a skill to leave behind.
        
           | userbinator wrote:
           | You probably added an extra instruction or two to put the
           | value in a register? The CPU can already split memory
           | accesses into uops and cached accesses are fast, so there's
           | no point in doing that because it'll just waste an additional
           | register (vs. using one of the many the renamer generates)
           | and add instructions to decode.
           | 
           | x86 is fundamentally a CISC; if you treat it like a RISC, it
           | will definitely disappoint.
        
             | ncmncm wrote:
             | Cached accesses are not fast anymore. Modern chips can
             | retire a dozen or even more instructions in the time it
             | takes to move a word in from L1 cache.
        
             | magicalhippo wrote:
             | Not really, the compiler did the round-trip into the local
             | variable (memory), I did it via register.
             | 
             | I asked around at the time and someone mentioned that I
             | might have overtaxed certain execution ports or something
             | like that, but yeah that just cemented my belief that x86
             | optimization is not my cup of tea anymore. Better to spend
             | time learning how to write code the compiler can optimize
             | well.
        
               | astrange wrote:
               | The compiler doesn't know anything about optimizing x86
               | code either. The actual details there are too secret for
               | Intel to want to accurately describe them in gcc, they're
               | different across different CPUs, and compilers just
               | aren't as good as you think they are.
               | 
               | (But CPUs are usually better than you think they are.)
        
               | CamperBob2 wrote:
               | What made me give up was when I found that an
               | optimization for one x86 processor was an impairment on
               | another. Your DIV example might have had a different
               | outcome on an AMD chip, or on the next- or previous-
               | generation part from Intel.
               | 
               | Nowadays, counting cycles is a game for monks who don't
               | actually have to get anything done.
        
         | sillysaurusx wrote:
         | More generally, "stop storing state, unless it makes the
         | program less complicated."
         | 
         | The first version is simple. The second version is more
         | complicated. The simplicity is because the first version can be
         | represented as a formula; imagine trying to write out the
         | second version as a formula, and the complexity becomes
         | obvious.
         | 
         | (The addition loop would have to be written as a recurrence
         | relation, which of course means every step depends on the
         | previous step.)
         | 
         | Complexity isn't always obvious. It's sometimes common in C
         | programs to write a loop like dst[index++] = src[i],
         | particularly in deeply nested for-loops. In my experience it's
         | almost always worth rewriting it so that the indices are
         | computable entirely from the loop iteration variables, with no
         | state. It helps you build a mental map of the operations,
         | because you can think of it in terms of geometry (memcpy =
         | copying rectangles onto other larger rectangles) whereas when
         | the index is stateful it becomes quite a lot harder to
         | visualize in higher dimensions. At least for me.
         | 
         | We've been building a compiler at Groq for our custom ML
         | hardware. (Roughly, "turn pytorch into our ISA with no extra
         | programmer effort.") I used to think of posts like this as
         | "Well, modern compilers are so fast, who really cares about
         | autovectorization?" -- it turns out _you_ care when you need to
         | write your own compiler. :)
         | 
         | MLIR is pretty cool. It makes a lot of transformations like
         | this pretty easy. The MLIR "krnl" dialect can also
         | automatically transform nested loops into tiled iteration.
         | Graphics devs will know what I mean -- no need to loop over 8x8
         | blocks of pixels manually, just write "for x in range(width):
         | for y in range(height): ..." and set the block size to 8.
        
         | mike_hock wrote:
         | The first comment on the code _does_ say precisely this.
        
         | Arwill wrote:
         | SIMD vectorization is like quantum mechanics, it becomes
         | classical if you look.
        
         | dreamcompiler wrote:
         | The other issue is instruction-level parallelism, as another
         | poster in TFA pointed out. Even within a single loop iteration
         | the "unoptimized" code is more likely to exploit multiple ALUs
         | if they exist, regardless of vectorization instructions.
        
       | savant_penguin wrote:
       | What would be the difference in power consumption from each
       | method? (Would it be always better to multiply? If so why not
       | multiply by one?)
        
         | dento wrote:
         | Usually faster version always consumes less power, as this
         | allows the core more time in sleep. This is known as the race-
         | to-sleep or race-to-idle paradox.
        
         | chaboud wrote:
         | The general rule to follow in power consumption on CPUs is to
         | do your work quickly and then get to sleep. Propagating clock
         | is going to eat the bulk of your power. The mild difference
         | between multiply and add in actual usage is inside the noise
         | (orders of magnitude smaller). The bigger penalty in this case
         | is the inter-iteration dependency, which, vectorized or not,
         | runs the risk of holding up the whole show due to pipelining.
         | 
         | As a performance rule on modern processors: avoid using the
         | result of a calculation as long as you reasonably can (in tight
         | loops... You don't want to be out of cache.).
         | 
         | Have fun threading the needle!
        
         | varajelle wrote:
         | The problem here is that the additions depends on values
         | computed in the previous iteration of the loop. The version
         | with multiplication is faster because there is no dependencies
         | with the previous iteration so the CPU has more freedom
         | scheduling the operations.
         | 
         | The power consumption is a good question.
        
           | MikeHolman wrote:
           | Scheduling plays a part, but it is definitely more about
           | vectorization.
        
       | dgb23 wrote:
       | If we ignore the details, then this is a perfect example of how
       | simpler code should be preferred _by default_ over complex code.
       | Both for performance and reasoning. In this case and often
       | elsewhere, these are related things.
       | 
       | I'm taking the definition of simple and complex (or complected)
       | from this talk[0]. In short: Simple means individual pieces are
       | standing on their own, not necessarily easy or minimal. Complex
       | means individual pieces are braided together into a whole.
       | 
       | The first code is simpler than the second. Just have a look at
       | the two solutions and you see what I mean: The individual
       | instructions in the first stand on their own and clearly declare
       | what they mean as well. The second example is more complex,
       | because each line/subexpression cannot be understood in
       | isolation, there is an implicit coupling that requires the
       | programmer _and_ the machine code interpreter to understand the
       | whole thing for it to make sense. The CPU apparently fails at
       | that and cannot optimize the machine instructions into more
       | efficient microcode.
       | 
       | The example illustrates performance benefits of simpler code, but
       | there are others too:
       | 
       | For example a type checker might be able to infer things from
       | simpler code, but not from complex code. Again, complexity is
       | about coupling, which can for example arise from making
       | assumptions about things that are out of bounds or logic that
       | happens in some other part of your program, which _this_ part
       | relies on.
       | 
       | Things like these can either be outright rejected by a type
       | checker or simply ignored and sometimes we can provide additional
       | ceremony to make it happy. But there is always an underlying
       | question when these issues arise: Can I express this in a simpler
       | way? It's sometimes possible to describe rich semantics with
       | simpler, more primitive types and explicit coordination.
       | 
       | Another thing that comes to mind are rich ORMs. They can be very
       | convenient for common cases. But they have footguns for the
       | general case, because they _complect_ so many things, such as
       | validation, storage, domain rules, caching etc. And they are
       | leaky abstractions because we have to do certain things in
       | certain ways to avoid bloated SQL, N+1 etc.
       | 
       | They are not simple (and certainly not declarative) so we program
       | against a black box. There are simpler "ORMs" that I very much
       | like, but they typically only provide a minimal set of orthogonal
       | features, such as query builders, and mapping results into plain
       | data structures. It is simpler to use these kind of orthogonal
       | tools.
       | 
       | Last but not least: Simple code can be pulled apart and changed
       | more easily without having to worry about introducing problems.
       | The substitution rule or referential transparency enables this by
       | guaranteeing that each expression can be replaced by the value it
       | will evaluate to. This also implies that other expressions that
       | evaluate to the same value can be substituted freely.
       | 
       | [0] Simple Made Easy - Rich Hickey at Strangeloop 2011
       | https://www.youtube.com/watch?v=LKtk3HCgTa8
        
         | djmips wrote:
         | You can't ignore the details if you want the fastest
         | performance. Unfortunately sometimes the solution that's the
         | fastest is also more complex. Simplicity tends to win and also
         | is less likely to be buggy, easier to maintain etc but
         | sometimes the fastest code deliberately introduces coupling,
         | hardware specific cache shenanigans or unreadable hand written
         | asm and/or inlined SIMD. This is often counter to your
         | hypothesis.
        
           | kortex wrote:
           | > sometimes the fastest code deliberately introduces
           | coupling, hardware specific cache shenanigans or unreadable
           | hand written asm and/or inlined SIMD.
           | 
           | >> this is a perfect example of how simpler code should be
           | preferred _by default_ over complex code.
           | 
           | Extra emphasis on *by default*.
           | 
           | You are both on the same side of the coin. This is the
           | essence of "premature optimization is the root of all evil".
           | Simplicity should be the default. Once you know your
           | implementation is solid, then you can go back and optimize.
           | 
           | TFA is quite contrived, but imagine a much more complex
           | process. You see some code:                   for i in loop:
           | foo[i] = myfunc(foo[i], bar[i])
           | 
           | You have to go in and dig into why exactly foo is being fed
           | back in at every step. But if instead you saw:
           | for i in loop:             foo[i] = myfunc(bar[i])
           | 
           | You _immediately_ know you have more options at your disposal
           | to optimize that loop. Further, the _compiler_ immediately
           | knows it has more options. Maybe this is a good option for
           | automatic unrolling or Duff 's device. Maybe you wanna send
           | this to a gpu shader, or use a macro to parallelize the loop.
           | But it's a lot harder to get to that point if your starting
           | point is the complected, stateful one.
        
       | ww520 wrote:
       | Besides the autovectorization, the second version also has two
       | additional assignments. Depending on how the storage is used for
       | these, whether it's to registers, L1/L2, or stack, there might be
       | performance hit.
        
       | layer8 wrote:
       | TLDR: Due to loop-carried dependencies preventing parallelized
       | execution:
       | https://en.wikipedia.org/wiki/Loop_dependence_analysis#Loop-...
       | 
       | In the 2 additions version, computation of the next iteration
       | depends on the results of the preceding iteration. In the
       | multiplication version, the computations are independent for each
       | iteration, enabling parallel execution (by SIMD and/or
       | pipelined/superscalar execution).
        
       | mlatu wrote:
       | you could try to prefill first cell with -(A+B), then each cell
       | gets a+b+c and in a loop for (j=0; j<i; j++): (A+A)
       | 
       | but im to lazy to test that
        
       | [deleted]
        
       | someweirdperson wrote:
       | I doubt repeatedly adding floating point numbers is a good idea.
       | With every addition the sum increases further away from the
       | addend, and with their growing relative difference problems grow
       | as well.
       | 
       | Just because the algebra works it doesn't guarantee that the code
       | does, too.
        
         | LAC-Tech wrote:
         | The algebra works for reals, binary floating point is a
         | different beast!
        
         | ascar wrote:
         | Your points are factually correct, but in practice not a big
         | concern, if your floating point value is much more precise than
         | necessary for the numbers used.
         | 
         | E.g. if you use doubles for the range of 32bit integers even
         | adding 1.0 2 billion times to 2 billion still ends up at 4
         | billion. Even adding 1.0 20 billion times to 20 billion ends up
         | at 40 billion. Now adding 0.1 20 billion times to 2 billion
         | ends up 1908 short on my CPU, i.e. about 19080
         | absorptions/rounding errors occured. You need some serious
         | differences and amount of operations to actually trigger
         | errors.
        
         | sray wrote:
         | The author of the post is aware of this. They explicitly say
         | that is not the point of the question.
        
       | oconnor663 wrote:
       | The part about data dependencies across loop iterations is
       | fascinating to me, becuase it's mostly invisible even when you
       | look at the generated assembly. There's a related optimization
       | that comes up in implementations of ChaCha/BLAKE, where we
       | permute columns around in a kind of weird order, because it
       | breaks a data dependency for an operation that's about to happen:
       | https://github.com/sneves/blake2-avx2/pull/4#issuecomment-50...
        
       | benreesman wrote:
       | Oh man I have been smoked by data-dependency like this so many
       | times. I've gotten a lot better over the years at "seeing" data
       | dependencies in godbolt or whatever, but it still slips through
       | by my code and that of my colleagues way more often than I'd
       | like.
       | 
       | Is anyone aware of good tooling for automatically catching this
       | sort of thing even some of the time?
        
         | slaymaker1907 wrote:
         | I'm not sure if there are existing rules for it, but you could
         | write a CodeQL query looking for data dependencies in loops.
         | Obviously dependencies are sometimes required, but it at least
         | could tell your they were there.
        
       | mgaunard wrote:
       | It's fairly obvious: the rewrite prevents parallelization because
       | floating-point isn't associative.
       | 
       | You'd need to parallelize it explicitly (which can be done by
       | just unrolling the loop).
        
       ___________________________________________________________________
       (page generated 2022-05-29 23:00 UTC)