[HN Gopher] Do not taunt happy fun branch predictor (2023)
       ___________________________________________________________________
        
       Do not taunt happy fun branch predictor (2023)
        
       Author : fanf2
       Score  : 201 points
       Date   : 2024-07-03 14:42 UTC (8 hours ago)
        
 (HTM) web link (www.mattkeeter.com)
 (TXT) w3m dump (www.mattkeeter.com)
        
       | CrazyStat wrote:
       | (2023)
       | 
       | Discussion at the time:
       | https://news.ycombinator.com/item?id=34520498
        
         | dang wrote:
         | Thanks! Macroexpanded:
         | 
         |  _Do not taunt happy fun branch predictor_ -
         | https://news.ycombinator.com/item?id=34520498 - Jan 2023 (171
         | comments)
         | 
         | (Reposts are fine after a year or so; links to past threads are
         | just to satisfy extra-curious readers)
        
       | jcranmer wrote:
       | > After checking for loop termination, we fall through into the
       | foo function (without a branch!)
       | 
       | Literally just reading this line made me go "well, there's your
       | problem." I thought this was going to be something deep about
       | fancy branch predictor heuristics, but it turns out to be a
       | violation of basic heuristics.
       | 
       | Folks, don't think you're going to get amazing speedups by using
       | mismatched call/ret instructions. The branch predictor keeping a
       | shadow stack of return addresses is something that's been around
       | for decades.
        
         | vlovich123 wrote:
         | And yet they also showed how they did actually accomplish this
         | and other optimizations. It's an informative read.
        
         | quotemstr wrote:
         | And this will screw up program execution more profoundly
         | (i.e.crash) on systems with architectural shadow call stacks as
         | a security feature.
        
           | astrobe_ wrote:
           | Speaking of "security feature", this insanely complex
           | behavior (for a CPU) is a recipe for disasters like Specter
           | and Meltdown.
        
         | adwn wrote:
         | It's great that you're so knowledgeable about branch predictor
         | behavior, but many people aren't and for them this is new,
         | maybe even valuable, information.
         | 
         | I guess this article's just not for you then, and that's okay.
        
           | akira2501 wrote:
           | I don't think the complaint is that the article exists, it's
           | that it perhaps needs more context. The article does not
           | present the level of skill assumed about the reader or of the
           | author, so comments like these help put that in perspective.
           | 
           | The commentary from "old salts" isn't meant to insult or
           | discourage you but to inform your world view with hard earned
           | experience.
           | 
           | It all has a value that's greater than the sum of it's parts
           | and there's no good reason to be critical or reactionary
           | about it.
        
         | chasil wrote:
         | On the one hand, a design goal of RISC is to improve the
         | performance of compiled code at the expense of most other
         | things. As such, this sort of hazard should be documented, but
         | the designers should be able to assume that anyone directly
         | writing assembler has read the documentation.
         | 
         | On the other hand, Sophie Wilson wrote an implementation of BBC
         | BASIC for the original ARM (but it didn't have a branch
         | predictor). While that is 32-bit and plays by different rules,
         | I wonder how AArch64 slows down code when the architectural
         | assumptions change.
        
       | orbital-decay wrote:
       | Tangential question, perhaps a dumb one: why do most hardware
       | schedulers try to predict the load on their own, instead of
       | letting users explicitly signal their intent? It's everywhere,
       | from CPUs to storage.
        
         | 082349872349872 wrote:
         | In this case, I'd consider BL (call) a fairly explicit signal
         | of intent to RET?
        
           | dzaima wrote:
           | Yep. For returns, the more important thing in the article,
           | the "ret"[0] instruction behaves exactly identically to "br
           | x30"[1], just with the hint that it's expected to match a
           | "bl".
           | 
           | On x86 things are less pretty as call/ret push/pop the ip
           | from the stack, with no non-matching-hinted versions, but
           | having such would also just not be particularly useful as
           | unpaired call/ret would result in the stack pointer
           | continually drifting without manual fixup (at which point jmp
           | is just clearly better).
           | 
           | [0]: https://developer.arm.com/documentation/dui0802/a/A64-Ge
           | nera...
           | 
           | [1]: https://developer.arm.com/documentation/dui0802/a/A64-Ge
           | nera...
        
             | LegionMammal978 wrote:
             | On 32-bit x86, there is the trap of trying to use call/pop
             | to get the absolute instruction pointer. It will work
             | correctly, but it will mess up any call stack prediction
             | and cause great performance penalties. Hence why compiler-
             | generated EIP shims use call/mov/ret instead. (But this is
             | not such a big issue in x86-64 with its RIP-relative
             | addressing.)
        
         | hansvm wrote:
         | 1. "Explicitly signaling your intent" takes space and time.
         | Decoding isn't free, and even if there were no other downsides
         | you'd want to be careful with that change.
         | 
         | 2. It's a bit historical. Early CPUs didn't have a RAM/CPU
         | discrepancy and didn't need pipelining. Code wasn't written to
         | account for pipelining. As CPUs got a little faster relative to
         | RAM, you added a few prediction mechanisms so that most
         | consumers and workloads could actually use your brand new 2x
         | faster gigaterrawatthournewton CPUs without having to rewrite
         | all their code. Iterate 10-20 generations, where most software
         | has never been written to care about branch prediction and
         | where modern languages don't even expose the concept, and you
         | have the current status quo.
        
         | rcxdude wrote:
         | It's supported by some CPUs: the linux kernel has
         | likely/unlikely macros to indicate whether a branch is expected
         | to be taken or not in the common case (which uses GCC
         | attributes which allow this to be propagated through to
         | codegen). It's just that it's generally less effective: firstly
         | because not all branches get annotated, some that are annotated
         | are annotated incorrectly, and in a lot of cases the branching
         | behaviour is data-dependent enough that you can't make an
         | optimal static prediction at all.
        
         | pjc50 wrote:
         | The user doesn't know the capabilities of the system. In
         | particular, anything compiled in is set in stone: you don't
         | know the capabilities of systems yet unbuilt. New systems don't
         | sell well unless they make existing code run faster, and sell
         | badly if they require you to rebuild everything and make it
         | non-backwards-compatible.
         | 
         | The peak example of "let the user do the scheduling" was
         | Itanium, which had a VLIW system which scheduled three
         | instructions at once. It was a huge commercial failure, and
         | allowed AMD to instead define the instruction set people
         | wanted: X64 is IA32 but wider and with more registers, but not
         | substantially different and certainly not as weird as Itanium.
        
           | magicalhippo wrote:
           | I've lost count of the times I've sped up some existing code
           | by removing the hand-optimized assembly and just used a plain
           | C implementation or similar. Sure, it was hand-optimized
           | assembly, but for a 10+ year old CPU.
           | 
           | If you're going to do hand-optimized code for a given
           | platform, include the baseline code and measure at runtime to
           | pick the implementation.
        
             | patmorgan23 wrote:
             | Hand optimized assembly was necessary 30 years ago, and a
             | good for several optimizations 20 years ago. But today's
             | computers are SO FAST, it's just not necessary in most
             | situations.
        
         | gpderetta wrote:
         | Often, but not always, optimal scheduling is data dependent so
         | it can't be statically computed at compile time.
        
         | mhh__ wrote:
         | They want to be able to guess long, long, before actually
         | executing the instructions.
        
         | hindsightbias wrote:
         | We could solve a lot of problems if every user had to take
         | EE180.
        
         | crote wrote:
         | You should look into the history of Itanium, it was designed
         | around the idea that the compiler would do pretty much exactly
         | that. It looked great on paper, but in practice nobody figured
         | out how to actually _write_ a compiler capable of doing it
         | without constantly running into weird edge cases.
         | 
         | X86 does have "prefetch" instructions, which tell the CPU that
         | you want to use some data in the near future. There are also
         | "branch hint" instructions which tell the CPU if a branch is
         | usually taken or not. The problem is that they tend to make
         | your code _slower_ : the CPU is already more than capable of
         | predicting it by itself, and the extra instructions slow down
         | the overall execution because they take up cache space and have
         | to be decoded too.
        
           | CalChris wrote:
           | VLIW is pretty successful for loopy DSP and now AI/ML.
           | However, Itanium was trying to work for general purpose code
           | and then as you say, it constantly ran into weird edge cases.
           | It _seemed_ as if VLIW succumbed to the Peter Principle, it
           | had risen to its level of incompetence. But as long as you
           | use it appropriately, VLIW is a best practice and LLVM
           | supports it.
           | 
           | BTW, CS252 at Berkeley and Onur Mutlu's ETH lectures give a
           | conventionally disparaging view of VLIW without pointing out
           | its successes.
        
             | bbatha wrote:
             | Adding on, VLIW is only successful in AI/ML because GPUs
             | are incapable of doing branching in a good way let alone
             | prediction. I would guess the same story applies to DSPs.
             | If someone figures out how to stick a branch predictor in
             | those pipelines Im guessing the VLIW nature of those
             | platforms will disappear overnight.
        
               | CalChris wrote:
               | The defining character of VLIW is to have the brilliant
               | compiler software schedule dumb parallel hardware
               | instructions _statically_ and then not depend on power
               | /transistor expensive dynamic branch prediction and OOO
               | execution.
               | 
               | In a perfect VLIW world that would mean you don't spend
               | any transistors or power on branch prediction or out of
               | order instruction searches. Indeed the original VLIW
               | paper [1] spends vastly most its paragraphs on solving
               | the (hard) compiler instruction scheduling problem with
               | trace scheduling which is still used. The VLIW hardware
               | itself is dead simple.
               | 
               | [1] https://safari.ethz.ch/architecture/fall2022/lib/exe/
               | fetch.p...
               | 
               | So if VLIW fits the problem it has fantastic performance
               | characteristics. If it doesn't fit, and far and away most
               | don't, then VLIW is terrible. VLIW is very brittle.
               | 
               | I need to make a caveat about the Mill CPU which is a
               | VLIW but I see I've written too much already.
        
         | ww520 wrote:
         | GCC has the _builtin_expect macro to give hint for branch
         | prediction. C++ has added something similar recently .
        
           | mhh__ wrote:
           | These are usually for the compiler rather than the CPU e.g. I
           | don't want my error-handling code inlined! (But the compiler
           | can't work this out without a hint)
        
       | bee_rider wrote:
       | > The SIMD code does come with one asterisk, though: because
       | floating-point addition is not associative, and it performs the
       | summation in a different order, it may not get the same result as
       | straight-line code. In retrospect, this is likely why the
       | compiler doesn't generate SIMD instructions to compute the sum!
       | 
       | > Does this matter for your use case? Only you can know!
       | 
       | Anything is possible of course, and the typical way of writing a
       | loop to sum up an array is literally telling the computer to go
       | element by element and accumulate the values.
       | 
       | But it seems pretty unlikely that doing, say, four accumulations
       | in parallel with simd and then summing them up at the end is any
       | more wrong than going element by element.
       | 
       | Summing floats should by default be taken to have error bounds,
       | and any answer in those bounds is valid. If you know something
       | special about the floats you are inputting, the language should
       | have some means to explicitly encode that. It shouldn't be the
       | most basic loop, that's the default case, so it should give the
       | best performance.
        
         | neonsunset wrote:
         | A lot of code relies on FP operations being deterministic,
         | often within confines of specific ISA. Applying SIMD to loops
         | with FP could have been a default but it breaks a lot of
         | existing code, makes the output frequently straight-up non-
         | deterministic and as a result is something that a programmer
         | has to explicitly choose to use.
         | 
         | Moreover, many programmers may not be even aware of these, so
         | when a `float Sum(float[] values)` starts to return something
         | different, they may not have any way to know that it is the
         | vectorization that makes it do so. Which is why, for example,
         | .NET's standard library will use SIMD for `integers.Sum()` but
         | not for `floats.Sum()`.
        
           | crote wrote:
           | > A lot of code relies on FP operations being deterministic,
           | often within confines of specific ISA
           | 
           | How much of this is true in practice? I vaguely recall
           | reading about a hard-to-diagnose bug where a seemingly
           | unrelated code change meant that an intermediate value was no
           | longer stored in the x87 extended-precision register but in
           | memory instead, leading to a different result. Wouldn't you
           | run into stuff like that all the time?
        
             | mindcandy wrote:
             | > A lot of code relies on FP operations being deterministic
             | 
             | A lot, but still quite the minority. People do run into
             | situations where the same code with the same data produces
             | slightly different results all the time because of changes
             | in compiler, compiler settings, cpu arch, etc... They just
             | don't notice.
             | 
             | But, then they run into a situation where it matters and
             | it's a huge PITA to nail down.
        
               | kardos wrote:
               | Yes this is exactly it. The gold standard is bit-4-bit
               | matching when shifting code between systems. If you move
               | your code to a new system and get different results -->
               | is it due to different summing order, or something else
               | wrong? By ruling out the former you're more likely to get
               | the bit-4-bit match and thus avoid a lot of debugging
               | effort -- or spend it solving a real bug.
        
             | Sesse__ wrote:
             | This is one of many reasons why we generally don't use the
             | x87 registers anymore.
        
         | throwaway260124 wrote:
         | Sorting the floats reduces the error. I think using multiple
         | accumulators reduces the accuracy then. And sorted data is not
         | uncommon.
         | 
         | I think there is always a correct answer and the compiler
         | shouldn't make changes at least by default that are wrong.
         | 
         | But ways for the programmer to express his intent more clearly
         | are always welcome.
        
           | Chabsff wrote:
           | Not necessarily. If the array is large and the values are
           | similar, using accumulators would yield a more accurate
           | result than a single one.
        
           | celrod wrote:
           | Multiple accumulators increases accuracy. See pairwise
           | summation, for example.
           | 
           | SIMD sums are going to typically be much more accurate than a
           | naive sum.
        
             | Chabsff wrote:
             | Not necessarily either. It's not particularly hard to
             | create a vector where in-order addition is the most
             | accurate way to sum its terms. All you need is a sequence
             | where the next term is close to the sum of all prior ones.
             | 
             | There just isn't a one-size-fit-all solution to be had
             | here.
        
               | kardos wrote:
               | > There just isn't a one-size-fit-all solution to be had
               | here.
               | 
               | But there is:
               | https://news.ycombinator.com/item?id=40867842
        
         | pjc50 wrote:
         | > Summing floats should by default be taken to have error
         | bounds, and any answer in those bounds is valid. If you know
         | something special about the floats you are inputting, the
         | language should have some means to explicitly encode that. It
         | shouldn't be the most basic loop, that's the default case, so
         | it should give the best performance.
         | 
         | This is a lot of "should" for something that basically doesn't
         | happen. The only information you give is the order of
         | arithmetic in the original expression.
         | 
         | It is a total nightmare if arithmetic is not stable between
         | builds. Rebuilding the software and running it on the same
         | input should not produce different results.
         | 
         | (Long ago I encountered the special Intel version of this:
         | because the FPU used 80-bit registers internally but 64-bit in
         | memory, changing when a register fill/spill happened would
         | change when your answer got rounded, and thereby change the
         | results. You can set a global FPU flag at the start of the
         | program to force rounding on every operation.)
        
           | cwzwarich wrote:
           | > You can set a global FPU flag at the start of the program
           | to force rounding on every operation
           | 
           | This doesn't do quite the same thing. It still uses the wider
           | exponent range of the 80-bit type.
        
           | bee_rider wrote:
           | It does happen in languages and libraries with a higher level
           | of abstraction. MKL for example will do whatever it wants for
           | accumulations (which practically means that it'll take
           | advantage of SIMD because that's a big reason why people use
           | the library) unless you specifically request otherwise via
           | there "Conditional Numerical Reproducibility" flag.
           | 
           | I think that was the right way to do it. BLAS made the right
           | decision by defining these things in terms of sums and dot
           | products instead of step-by-step instructions.
           | 
           | It will always be possible to write programs that run
           | differently on different hardware or with different
           | optimization levels. If somebody is writing code for floating
           | point computations and expects exact bit patterns--it is
           | possible, of course, all the rounding behavior is clearly
           | specified. But usually this is an error.
        
         | crote wrote:
         | It turns into a serious problem if the floats are of different
         | magnitude. Consider [1e50, -1e50, 1e3, 1e3]. Evaluating it as
         | (((1e50 + -1e50) + 1e3) + 1e3) gives 2e3, evaluating it as
         | ((1e50 + 1e3) + (-1e50 + 1e3)) gives 0.
         | 
         | You get similar issues if you're trying to add a large number
         | of small-magnitude numbers to a single larger-magnitude number.
         | (((1e3 + 1e3) + 1e3) ... + 1e50) is substantially different
         | from (((1e50 + 1e3) + 1e3) ... + 1e3).
        
           | jandrese wrote:
           | Adding and subtracting numbers of wildly different magnitudes
           | is always cursed. If you're doing this I'd say the majority
           | of the time you are making an error.
           | 
           | Your equation ends up being like "We calculated the diameter
           | of the observable universe, but what if we added the Empire
           | State Building to one side, how does that affect the result?"
        
             | skybrian wrote:
             | It's also what we do when incrementing a counter, if it's a
             | very big counter. Operations like that don't have a
             | physical meaning, but can happen in things like random
             | number generators where it's a way of mixing the bits. Or
             | maybe allocating unique ids, if you started from a random
             | number.
        
               | bee_rider wrote:
               | Of course it is just an example so we shouldn't nitpick
               | you too much, but that sounds more like a job for an
               | integer.
        
               | skybrian wrote:
               | Yes, absolutely. Though in JavaScript we often make do
               | with the safe integer range. (I don't know how fast
               | BigInt is for 64 bit integers.)
        
               | bee_rider wrote:
               | I don't know JavaScript so I'm sure to shoot my own foot
               | here, but surely a BigInt who's value can fit in 64 bits
               | is just a normal 64 bit int (from hardware) with some
               | extra bookkeeping around it on most platforms, right?
        
               | IainIreland wrote:
               | I can't speak for any other engines, but in SpiderMonkey
               | a BigInt is a GC-managed object with a one-word header
               | containing the sign bit, the length, and some GC
               | bookkeeping bits, along with a second word that is either
               | a single inline 64-bit digit, or a pointer to an array of
               | 64-bit digits. So the memory overhead of a BigInt that
               | fits in 64 bits is "only" 2x, but doing math with BigInt
               | requires more work than a native integer because it needs
               | to handle larger numbers, manually manage sign bits, and
               | so on.
               | 
               | (https://searchfox.org/mozilla-
               | central/source/js/src/vm/BigIn...)
        
               | aardvark179 wrote:
               | If all you've got is JS then every problem looks like a
               | double.
        
           | mywittyname wrote:
           | But you haven't demonstrated that the SIMD approach produces
           | a meaningfully different and more accurate approach than the
           | naive method.
           | 
           | Just like processor magic made the theoretically more
           | performant code perform worse, it's possible that the
           | processor has facilities for bucketing the data such that
           | results from such a dataset are the same or more accurate
           | than the naive approach.
           | 
           | This is absolutely a situation that requires empirical
           | testing.
        
             | scaredginger wrote:
             | Yeah, no.
             | 
             | If I know what my data look like, I can choose an order of
             | summation that reduces the error. I wouldn't want the
             | compiler by default to assume associativity and introduce
             | bugs. There's a reason this reordering is disabled by
             | default
        
               | Sesse__ wrote:
               | Not the least that you should get the same result from
               | your code every time, even if you compile it with a
               | different compiler. Doesn't matter if that result is
               | somehow better or worse or "difference is so small that
               | it doesn't matter"; it needs to be exactly the same,
               | every time.
        
               | bee_rider wrote:
               | I think there's often a disconnect in these discussions
               | between people that work on libraries and people that
               | work on the application code.
               | 
               | If you have written a library, the user will provide some
               | inputs. While the rounding behavior of floating point
               | operations is well defined, for arbitrary user input, you
               | can't usually guarantee that it'll go either way.
               | Therefore, you need to do the numerical analysis given
               | users inputs from some range, if you want to be at all
               | rigorous. This will give results with error bounds, not
               | exact bit patterns.
               | 
               | If you want exact matches for your tests, maybe identify
               | the bits that are essentially meaningless and write them
               | to some arbitrary value.
               | 
               | Edit: that said I don't think anybody particularly owns
               | rigor on this front, given that most libraries don't
               | actually do the analysis, lol.
        
             | Dylan16807 wrote:
             | > it's possible that the processor has facilities for
             | bucketing the data such that results from such a dataset
             | are the same or more accurate
             | 
             | Not realistically. The processor very much needs to do what
             | you tell it.
        
             | Sesse__ wrote:
             | > it's possible that the processor has facilities for
             | bucketing the data such that results from such a dataset
             | are the same or more accurate than the naive approach.
             | 
             | It absolutely is not. Floating-point addition is completely
             | defined (save for some of the bits in NaN results), and no
             | CPU will start reordering your addition instructions in a
             | way that changes the result. Even if it's an ordering that
             | gives you better accuracy. Having SIMD do four (or more) of
             | them in parallel does not change this.
        
           | bee_rider wrote:
           | Whether you care about the order or not depends on the intent
           | of the programmer. I think the intent is usually to sum up
           | the array for this sort of code. So, at first glance your
           | first example looks more like a result of something like 0 +-
           | 1e35 in double to me (added one to the exponent to avoid
           | having to do math).
           | 
           | The intent of the programmer always has to be interpreted by
           | the language. I'm saying the default interpretation should be
           | one that doesn't impose an ordering unless specifically
           | requested. This is much more sympathetic to modern deeply
           | parallel machines.
        
             | FreakLegion wrote:
             | Very often there's no particular intent, other than to be
             | consistent. You haven't lived until you've traced
             | discrepancies in an ML model deployed across a bunch of
             | platforms back to which SIMD instructions each happens to
             | support (usually squirreled away in some dependency).
        
         | RobotToaster wrote:
         | > this is likely why the compiler doesn't generate SIMD
         | instructions to compute the sum!
         | 
         | Isn't this the type of thing ffast-math is supposed to enable?
        
           | not2b wrote:
           | -ffast-math can be read as -fwrong-math if the accuracy if
           | the answers matters. For some applications accuracy might not
           | matter much. But the differences you get by pretending that
           | floating point math is associative can be enormous, and
           | scientific and engineering code is often carefully designed
           | to take the likely scale of values into account, and
           | rearranging it is not a good idea.
        
             | Dylan16807 wrote:
             | Assuming associativity is not anti-accuracy in general, it
             | just gets in the way of certain clever algorithms.
        
             | dzaima wrote:
             | For code that's been explicitly manually optimized already,
             | indeed you wouldn't want to turn on -ffast-math; and I'd
             | hope anyone who has actually learned what benefits from
             | manual optimization and how to properly do that would also
             | at some point have read something about not coupling that
             | with -ffast-math. But I really doubt such meaningfully-
             | manually-optimized code is at all frequent, and where it
             | isn't it's a rather simple improvement (potentially with
             | using some subset flags to allow NaN or infinity if
             | desired).
        
           | Dylan16807 wrote:
           | Fast math does a lot of risky things in addition to using
           | basic math rules.
        
           | bee_rider wrote:
           | Maybe the "fun safe math" flag could be used?
        
         | lscharen wrote:
         | Surprisingly there are different algorithms for doing something
         | as simple as summing up a list of numbers.
         | 
         | The naive way of adding numbers one-by-one in a loop is an
         | obvious way, but there are more sophisticated methods that give
         | better bounds of the total accumulated error; Kahan
         | summation[1] being one of the better-known ones.
         | 
         | Like most things, it can get complicated depending on the
         | specific context. For streaming data, adding numbers one at a
         | time may be the only option. But what if one could use a fixed-
         | size buffer of N numbers? When a new number arrives what should
         | be done? Take a partial sum of some subset of the N numbers in
         | the buffer and add that to a cumulative total? If a subset if
         | chosen, how? Are there provable (improved) error bounds for the
         | subset selection method?
         | 
         | Fun stuff.
         | 
         | [1] https://en.wikipedia.org/wiki/Kahan_summation_algorithm
        
           | kardos wrote:
           | As far as I know, xsum [1] more or less solves the problem
           | completely: order invariant and exact, at less than 2x the
           | cost of the naive summation. Further speeding it up may be
           | the only avenue left for improvement.
           | 
           | [1] https://gitlab.com/radfordneal/xsum
        
             | lscharen wrote:
             | I was not aware of this (2015) work -- very nice!
             | 
             | A couple of pull-quotes from the paper to summarize:
             | 
             |  _Much work has been done on trying to improve the accuracy
             | of summation. Some methods aim to somewhat improve accuracy
             | at little computational cost, but do not guarantee that the
             | result is the correctly rounded exact sum._
             | 
             |  _Many methods have been developed that instead compute the
             | exact sum of a set of floating-point values, and then
             | correctly round this exact sum to the closest floating-
             | point value. This obviously would be preferable to any non-
             | exact method, if the exact computation could be done
             | sufficiently quickly_
             | 
             |  _Exact summation methods fall into two classes -- those
             | implemented using standard floating point arithmetic
             | operations available in hardware on most current
             | processors, such as the methods of Zhu and Hayes (2010),
             | and those that instead perform the summation with integer
             | arithmetic, using a "superaccumulator"._
             | 
             |  _I present two new methods for exactly summing a set of
             | floating-point numbers, and then correctly rounding to the
             | nearest floating-point number. ... One method uses a
             | "small" superaccumulator with sixty-seven 64-bit chunks,
             | each with 32-bit overlap with the next chunk, allowing
             | carry propagation to be done infrequently. The small
             | superaccumulator is used alone when summing a small number
             | of terms. For big summations, a "large" superaccumulator is
             | used as well. It consists of 4096 64-bit chunks, one for
             | every possible combination of exponent bits and sign bit,
             | plus counts of when each chunk needs to be transferred to
             | the small superaccumulator._
             | 
             |  _On modern 64-bit processors, exactly summing a large
             | array using this combination of large and small
             | superaccumulators takes less than twice the time of simple,
             | inexact, ordered summation, with a serial implementation_
        
               | pacaro wrote:
               | Thanks for the summary. I kinda low-key love the idea of
               | converting floats into a fixed point representation that
               | covers the entire range represented by the float type. I
               | mean the accumulator is only 32 KB, which is likely to be
               | in L1 the entire time on modern hardware, and any given
               | float is only going to need two 64 bit words, + 13 bits
               | (12 bits for offset, and 1 for sign) to be represented in
               | this scheme.
        
       | petsounds wrote:
       | The title of this article is a reference to "Happy Fun Ball", my
       | favorite SNL skit: https://www.youtube.com/watch?v=GmqeZl8OI2M
        
       | quotemstr wrote:
       | Yeah. This effect is why, also, using a bare CALL without a RET
       | ends up being a slow way to get the program counter on 32-bit
       | x86, which lacks native PC-relative addressing: it confuses the
       | CPU's tracking of calls and returns. Basically, you always want
       | to balance them. No exceptions.
        
         | gpderetta wrote:
         | Wasn't an article posted a few days ago showing that CPUs
         | special case. CALL+0 not to update the prediction stack as it
         | was commonly used to get IP?
        
           | rep_lodsb wrote:
           | Yes: https://news.ycombinator.com/item?id=40767676
        
           | quotemstr wrote:
           | Okay, fair enough. It _used_ to be a pessimization before
           | CPUs recognized that pattern. :-)
        
       | AshamedCaptain wrote:
       | The same thing was covered by Raymond Chen almost 2 decades ago:
       | https://devblogs.microsoft.com/oldnewthing/20041216-00/?p=36...
        
         | rkagerer wrote:
         | He did, and that's a fantastic article which is worth the read
         | and provides good context for interpreting this post.
         | 
         | One thing this post adds is the simple rectification of
         | replacing ret with another br instruction, so the pairs are
         | again "mirrored", and you get to have your cake and eat it too
         | - slightly faster code without breaking the branch predictor.
        
         | electrodank wrote:
         | As someone who has the old dead tree version of Intel's x86 and
         | 64 architecture instruction set reference (the fat blue books),
         | and in general as someone who carefully reads the data sheets
         | and documentation and looks for guidance from the engineers and
         | staff who wrote the said data sheets, I always have
         | reservations when I hear that "intuitively you would expect X
         | but Y happens." There's nothing intuitive about any of this
         | except, maybe, a reasonable understanding of the semi-
         | conductive nature of the silicon and the various dopants in the
         | process. Unless you've seen the die schematic, the traces, and
         | you know the paths, there is little to no reason to have any
         | sort of expectations that Thing A is faster than Thing B unless
         | the engineering staff and data sheets explicitly tell you.
         | 
         | There are exceptions, but just my 2c. Especially with ARM.
        
           | neonsunset wrote:
           | Respectfully, I disagree. CPU architecture optimization is in
           | continuous dance with compiler optimization where the former
           | tries to adapt to the patterns most commonly produced by the
           | latter, and the latter tries to adjust its optimizations
           | according to what performs the faster within the former.
           | 
           | Therefore, it is not unreasonable to make assumptions based
           | on the premise of "does this code look like something that
           | could be reasonably produced by GCC/LLVM?".
           | 
           | It is true that as cores get simpler and cheaper, they get
           | more edge cases - something really big like Firestorm
           | (A14/M1) can afford to have very consistent and tight
           | latencies for all of its SIMD instructions regardless of the
           | element/lane size and even hide complex dependencies or
           | alignment artifacts wherever possible. But compare that with
           | simpler and cheaper Neoverse N1, and it's a different story
           | entirely, where trivial algorithm changes lead to significant
           | slowdown - ADDV Vn.16B is way slower than Vn.4H, so you have
           | to work around it. This is only exacerbated if you look at
           | much smaller cores.
           | 
           | LLVM and GCC deal with this by being able to use relatively
           | precise knowledge of CPU's (-mtune) fetch, reorder, load and
           | store queue/buffer depths, as well as latencies and
           | dependency penalty cost of opcodes of the ISA it implements,
           | and other details like loop alignment requirements, branch
           | predictor limitations.
           | 
           | Generally, it's difficult to do better in straight-line code
           | with local data than such compilers assuming that whatever
           | you are doing doesn't make concessions that a compiler is not
           | allowed to make.
           | 
           | Nonetheless, the mindset for writing a performant algorithm
           | implementation is going to be the same as long as you are
           | doing so for the same _class_ of CPU cores - loop unrolling,
           | using cmovs, and scheduling operations in advance, or
           | ensuring that should spills happen, the load and store
           | operations have matching arguments - all of that will be
           | profitable on AMD 's Zen 4, Intel's Golden Cove, Apple's
           | Firestorm or ARM's Neoverse V3.
        
           | Sharlin wrote:
           | "Intuitively" here should be taken to mean approximately the
           | same as "naively" - as in, the intuition that most of us
           | learn at first that CPUs work ("as if") by executing one
           | instruction at a time, strictly mechanistically, exactly
           | corresponding to the assembly code. The way a toy
           | architecture on a first-year intro to microprocessors course
           | - or indeed a 6502 or 8086 or 68000 - would do it. Which is
           | to say, no pipelining, no superscalar, no prefetching, no
           | out-of-order execution, no branch prediction, no speculative
           | execution, and so on.
        
         | rep_lodsb wrote:
         | This seems no longer to be true for recent x86 processors:
         | https://news.ycombinator.com/item?id=40767676
        
         | bigstrat2003 wrote:
         | Raymond Chen is a treasure. I'm so glad that Microsoft gives
         | him the leeway to write that blog, I have learned a ton from
         | it.
        
       | allenrb wrote:
       | Was summarizing this article for a group of friends who largely
       | met during the Apple II days and wanted to repost a bit of that
       | here:
       | 
       | The optimized code at the end takes 94 _nanoseconds_ to sum an
       | array of 1024 32-bit floating point numbers.
       | 
       | In 94 nanoseconds, our old friend the 1 MHz 6502, would be just
       | starting to consider signaling the memory chips that maybe they
       | ought to try digging up the first byte of the first instruction
       | in the program.
       | 
       | Worth mentioning, that code is entirely dependent on running in
       | cache. Otherwise even the mighty M1 Max in the post would still
       | be stuck waiting on that first memory fetch. DRAM is slow. :-)
        
         | cmrdporcupine wrote:
         | Lucky our total L1 cache sizes are about as big as the entire
         | addressable memory of the 6502.
         | 
         | We truly live in amazing times.
        
           | LoganDark wrote:
           | Truly. And I'm also always amazed by how much slower (in
           | terms of wall time) modern software is than it ought to be.
           | CPUs sure don't feel three to four orders of magnitude faster
           | than they were 50 years ago, because software has gotten four
           | to five orders of magnitude more wasteful to compensate.
           | Argh...
        
       | samatman wrote:
       | Everything I know, or need to know, about branch prediction, I
       | learned from James Mickens.
       | 
       | https://scholar.harvard.edu/files/mickens/files/theslowwinte...
        
       | monitron wrote:
       | This was a lovely article. I only wish the author wouldn't have
       | kept switching units (between us and ns) making it hard to scan
       | the tables for comparison.
        
         | dmccarty wrote:
         | 1000ns = 1us :)
        
           | carl_dr wrote:
           | We know.
           | 
           | On a phone at normal reading distance, with the articles
           | styling, it's really hard to tell the difference between n
           | and u without zooming, and the decimal points get lost -
           | scanning the tables is hard.
        
         | cshimmin wrote:
         | They also switched from C to Rust halfway through the article,
         | which was a bit jarring...
        
       | mrinterweb wrote:
       | Classic SNL reference: "Do not taunt happy fun ball"
       | 
       | https://www.youtube.com/watch?v=GmqeZl8OI2M
        
         | rhussmann wrote:
         | If happy fun branch predictor starts to smoke, seek shelter
         | immediately.
        
         | Terr_ wrote:
         | > Happy Fun Ball has been shipped to our troops in Saudi Arabia
         | and is also being dropped by our warplanes in Iraq.
         | 
         | "What YEAR is it!?" /robin-williams-in-jumanji
        
       | nayuki wrote:
       | > Why do we need a special function return instruction?
       | Functionally, BR LR would do the same job as RET. Using RET tells
       | the processor that this is a function return.
       | 
       | I'm not sure this is a good design. Those two opcodes perform the
       | same logic function but have different branch prediction hints.
       | 
       | Meanwhile, there are other cases where an existing instruction is
       | reinterpreted with new semantics:
       | 
       | * On x86, `XCHG eax, eax` is `NOP`.
       | 
       | * On x86, `XOR reg, reg` is `MOV reg, 0` and breaks dependencies
       | for the purposes of register renaming.
       | 
       | * On x86, various examples of macro-op fusion and micro-op
       | fusion.
       | 
       | * On various RISC architectures, `ADD r1, r0, 1234` is `MOV r1,
       | 1234`.
       | 
       | * On some RISC architectures, conditionally branching if r0 == 0
       | is the only way to express an unconditional branch.
       | 
       | I see no reason why `BR LR` can't be the standard function return
       | instruction and involve the branch predictor.
        
         | zerohp wrote:
         | In AArch64, `BR LR` and `RET` do not perform the same logic
         | when FEAT_BTI (Branch Target Identification) is present.
        
       | 38 wrote:
       | > const float f = *data++;
       | 
       | cursed code. people might say "oh that's normal idiomatic C", I
       | don't care. I am glad Go does not allow code like this
        
       | trealira wrote:
       | I'm surprised they didn't try something less clever to optimize
       | their code first. The assembly code could be rewritten like this,
       | so the test requires one branch at the bottom of the loop instead
       | of two (one to jump to the top and one to branch conditionally),
       | as well as one instead of two ALU operations per loop for X1 (one
       | subtraction to compare X1 to zero, and one to decrement X1, as
       | opposed to using the result of the latter).
       | stp   x29, x30, [sp, #-16]!           fmov s0, #0.0
       | cbnz x1, exit // Skip loop if x1 == 0.       loop:           ldr
       | s1, [x0], #4                  bl foo           // Decrement x1.
       | If x1 != 0, repeat the loop.           subs x1, x1, #1
       | b.ne loop           // Fall through to exit after the loop
       | finishes.       exit:           ldp   x29, x30, [sp], #16
       | ret              foo:           // ...           ret
       | 
       | Going further, you could just inline foo and omit the RET
       | instruction without doing any trickery with mismatched BL and RET
       | instructions.
       | 
       | I haven't benchmarked this, though, so I don't actually know how
       | much this speeds the code up.
        
         | trealira wrote:
         | Typo: that line with "cbnz" should be "cbz". The CBZ
         | instruction branches to a label if a register is zero, and CBNZ
         | branches if the register isn't zero.
        
       | croemer wrote:
       | Don't miss the 2023! The article really is already a little
       | outdated: since Rust 1.78, the compiler uses some more aggressive
       | loop unrolling (and a little bit of SIMD):
       | https://godbolt.org/z/zhbobW7rr
       | 
       | OP wrote: "Looking at the assembly, it's doing some loop
       | unrolling.", linking to https://godbolt.org/z/Kv77abW6c, which is
       | using "Rust Nightly", a moving target. By now, there's more loop
       | unrolling.
       | 
       | Loop unrolling started with Rust 1.59:
       | https://godbolt.org/z/5PTnWrWf7
       | 
       | Update: The code on Github shows they were using Rust version
       | 1.67.0-nightly, 2022-11-27.
        
         | croemer wrote:
         | Rust 1.67.0, which is likely what OP was looking at, gives:
         | https://godbolt.org/z/4Y61d9seh
         | 
         | I ran the benchmarks locally on the same hardware, using latest
         | nightly Rust 1.81 (aggressive loop unrolling): no difference.
         | Same speed as 1.5 years ago.
        
       ___________________________________________________________________
       (page generated 2024-07-03 23:00 UTC)