[HN Gopher] Do not taunt happy fun branch predictor (2023)
___________________________________________________________________
Do not taunt happy fun branch predictor (2023)
Author : fanf2
Score : 201 points
Date : 2024-07-03 14:42 UTC (8 hours ago)
(HTM) web link (www.mattkeeter.com)
(TXT) w3m dump (www.mattkeeter.com)
| CrazyStat wrote:
| (2023)
|
| Discussion at the time:
| https://news.ycombinator.com/item?id=34520498
| dang wrote:
| Thanks! Macroexpanded:
|
| _Do not taunt happy fun branch predictor_ -
| https://news.ycombinator.com/item?id=34520498 - Jan 2023 (171
| comments)
|
| (Reposts are fine after a year or so; links to past threads are
| just to satisfy extra-curious readers)
| jcranmer wrote:
| > After checking for loop termination, we fall through into the
| foo function (without a branch!)
|
| Literally just reading this line made me go "well, there's your
| problem." I thought this was going to be something deep about
| fancy branch predictor heuristics, but it turns out to be a
| violation of basic heuristics.
|
| Folks, don't think you're going to get amazing speedups by using
| mismatched call/ret instructions. The branch predictor keeping a
| shadow stack of return addresses is something that's been around
| for decades.
| vlovich123 wrote:
| And yet they also showed how they did actually accomplish this
| and other optimizations. It's an informative read.
| quotemstr wrote:
| And this will screw up program execution more profoundly
| (i.e.crash) on systems with architectural shadow call stacks as
| a security feature.
| astrobe_ wrote:
| Speaking of "security feature", this insanely complex
| behavior (for a CPU) is a recipe for disasters like Specter
| and Meltdown.
| adwn wrote:
| It's great that you're so knowledgeable about branch predictor
| behavior, but many people aren't and for them this is new,
| maybe even valuable, information.
|
| I guess this article's just not for you then, and that's okay.
| akira2501 wrote:
| I don't think the complaint is that the article exists, it's
| that it perhaps needs more context. The article does not
| present the level of skill assumed about the reader or of the
| author, so comments like these help put that in perspective.
|
| The commentary from "old salts" isn't meant to insult or
| discourage you but to inform your world view with hard earned
| experience.
|
| It all has a value that's greater than the sum of it's parts
| and there's no good reason to be critical or reactionary
| about it.
| chasil wrote:
| On the one hand, a design goal of RISC is to improve the
| performance of compiled code at the expense of most other
| things. As such, this sort of hazard should be documented, but
| the designers should be able to assume that anyone directly
| writing assembler has read the documentation.
|
| On the other hand, Sophie Wilson wrote an implementation of BBC
| BASIC for the original ARM (but it didn't have a branch
| predictor). While that is 32-bit and plays by different rules,
| I wonder how AArch64 slows down code when the architectural
| assumptions change.
| orbital-decay wrote:
| Tangential question, perhaps a dumb one: why do most hardware
| schedulers try to predict the load on their own, instead of
| letting users explicitly signal their intent? It's everywhere,
| from CPUs to storage.
| 082349872349872 wrote:
| In this case, I'd consider BL (call) a fairly explicit signal
| of intent to RET?
| dzaima wrote:
| Yep. For returns, the more important thing in the article,
| the "ret"[0] instruction behaves exactly identically to "br
| x30"[1], just with the hint that it's expected to match a
| "bl".
|
| On x86 things are less pretty as call/ret push/pop the ip
| from the stack, with no non-matching-hinted versions, but
| having such would also just not be particularly useful as
| unpaired call/ret would result in the stack pointer
| continually drifting without manual fixup (at which point jmp
| is just clearly better).
|
| [0]: https://developer.arm.com/documentation/dui0802/a/A64-Ge
| nera...
|
| [1]: https://developer.arm.com/documentation/dui0802/a/A64-Ge
| nera...
| LegionMammal978 wrote:
| On 32-bit x86, there is the trap of trying to use call/pop
| to get the absolute instruction pointer. It will work
| correctly, but it will mess up any call stack prediction
| and cause great performance penalties. Hence why compiler-
| generated EIP shims use call/mov/ret instead. (But this is
| not such a big issue in x86-64 with its RIP-relative
| addressing.)
| hansvm wrote:
| 1. "Explicitly signaling your intent" takes space and time.
| Decoding isn't free, and even if there were no other downsides
| you'd want to be careful with that change.
|
| 2. It's a bit historical. Early CPUs didn't have a RAM/CPU
| discrepancy and didn't need pipelining. Code wasn't written to
| account for pipelining. As CPUs got a little faster relative to
| RAM, you added a few prediction mechanisms so that most
| consumers and workloads could actually use your brand new 2x
| faster gigaterrawatthournewton CPUs without having to rewrite
| all their code. Iterate 10-20 generations, where most software
| has never been written to care about branch prediction and
| where modern languages don't even expose the concept, and you
| have the current status quo.
| rcxdude wrote:
| It's supported by some CPUs: the linux kernel has
| likely/unlikely macros to indicate whether a branch is expected
| to be taken or not in the common case (which uses GCC
| attributes which allow this to be propagated through to
| codegen). It's just that it's generally less effective: firstly
| because not all branches get annotated, some that are annotated
| are annotated incorrectly, and in a lot of cases the branching
| behaviour is data-dependent enough that you can't make an
| optimal static prediction at all.
| pjc50 wrote:
| The user doesn't know the capabilities of the system. In
| particular, anything compiled in is set in stone: you don't
| know the capabilities of systems yet unbuilt. New systems don't
| sell well unless they make existing code run faster, and sell
| badly if they require you to rebuild everything and make it
| non-backwards-compatible.
|
| The peak example of "let the user do the scheduling" was
| Itanium, which had a VLIW system which scheduled three
| instructions at once. It was a huge commercial failure, and
| allowed AMD to instead define the instruction set people
| wanted: X64 is IA32 but wider and with more registers, but not
| substantially different and certainly not as weird as Itanium.
| magicalhippo wrote:
| I've lost count of the times I've sped up some existing code
| by removing the hand-optimized assembly and just used a plain
| C implementation or similar. Sure, it was hand-optimized
| assembly, but for a 10+ year old CPU.
|
| If you're going to do hand-optimized code for a given
| platform, include the baseline code and measure at runtime to
| pick the implementation.
| patmorgan23 wrote:
| Hand optimized assembly was necessary 30 years ago, and a
| good for several optimizations 20 years ago. But today's
| computers are SO FAST, it's just not necessary in most
| situations.
| gpderetta wrote:
| Often, but not always, optimal scheduling is data dependent so
| it can't be statically computed at compile time.
| mhh__ wrote:
| They want to be able to guess long, long, before actually
| executing the instructions.
| hindsightbias wrote:
| We could solve a lot of problems if every user had to take
| EE180.
| crote wrote:
| You should look into the history of Itanium, it was designed
| around the idea that the compiler would do pretty much exactly
| that. It looked great on paper, but in practice nobody figured
| out how to actually _write_ a compiler capable of doing it
| without constantly running into weird edge cases.
|
| X86 does have "prefetch" instructions, which tell the CPU that
| you want to use some data in the near future. There are also
| "branch hint" instructions which tell the CPU if a branch is
| usually taken or not. The problem is that they tend to make
| your code _slower_ : the CPU is already more than capable of
| predicting it by itself, and the extra instructions slow down
| the overall execution because they take up cache space and have
| to be decoded too.
| CalChris wrote:
| VLIW is pretty successful for loopy DSP and now AI/ML.
| However, Itanium was trying to work for general purpose code
| and then as you say, it constantly ran into weird edge cases.
| It _seemed_ as if VLIW succumbed to the Peter Principle, it
| had risen to its level of incompetence. But as long as you
| use it appropriately, VLIW is a best practice and LLVM
| supports it.
|
| BTW, CS252 at Berkeley and Onur Mutlu's ETH lectures give a
| conventionally disparaging view of VLIW without pointing out
| its successes.
| bbatha wrote:
| Adding on, VLIW is only successful in AI/ML because GPUs
| are incapable of doing branching in a good way let alone
| prediction. I would guess the same story applies to DSPs.
| If someone figures out how to stick a branch predictor in
| those pipelines Im guessing the VLIW nature of those
| platforms will disappear overnight.
| CalChris wrote:
| The defining character of VLIW is to have the brilliant
| compiler software schedule dumb parallel hardware
| instructions _statically_ and then not depend on power
| /transistor expensive dynamic branch prediction and OOO
| execution.
|
| In a perfect VLIW world that would mean you don't spend
| any transistors or power on branch prediction or out of
| order instruction searches. Indeed the original VLIW
| paper [1] spends vastly most its paragraphs on solving
| the (hard) compiler instruction scheduling problem with
| trace scheduling which is still used. The VLIW hardware
| itself is dead simple.
|
| [1] https://safari.ethz.ch/architecture/fall2022/lib/exe/
| fetch.p...
|
| So if VLIW fits the problem it has fantastic performance
| characteristics. If it doesn't fit, and far and away most
| don't, then VLIW is terrible. VLIW is very brittle.
|
| I need to make a caveat about the Mill CPU which is a
| VLIW but I see I've written too much already.
| ww520 wrote:
| GCC has the _builtin_expect macro to give hint for branch
| prediction. C++ has added something similar recently .
| mhh__ wrote:
| These are usually for the compiler rather than the CPU e.g. I
| don't want my error-handling code inlined! (But the compiler
| can't work this out without a hint)
| bee_rider wrote:
| > The SIMD code does come with one asterisk, though: because
| floating-point addition is not associative, and it performs the
| summation in a different order, it may not get the same result as
| straight-line code. In retrospect, this is likely why the
| compiler doesn't generate SIMD instructions to compute the sum!
|
| > Does this matter for your use case? Only you can know!
|
| Anything is possible of course, and the typical way of writing a
| loop to sum up an array is literally telling the computer to go
| element by element and accumulate the values.
|
| But it seems pretty unlikely that doing, say, four accumulations
| in parallel with simd and then summing them up at the end is any
| more wrong than going element by element.
|
| Summing floats should by default be taken to have error bounds,
| and any answer in those bounds is valid. If you know something
| special about the floats you are inputting, the language should
| have some means to explicitly encode that. It shouldn't be the
| most basic loop, that's the default case, so it should give the
| best performance.
| neonsunset wrote:
| A lot of code relies on FP operations being deterministic,
| often within confines of specific ISA. Applying SIMD to loops
| with FP could have been a default but it breaks a lot of
| existing code, makes the output frequently straight-up non-
| deterministic and as a result is something that a programmer
| has to explicitly choose to use.
|
| Moreover, many programmers may not be even aware of these, so
| when a `float Sum(float[] values)` starts to return something
| different, they may not have any way to know that it is the
| vectorization that makes it do so. Which is why, for example,
| .NET's standard library will use SIMD for `integers.Sum()` but
| not for `floats.Sum()`.
| crote wrote:
| > A lot of code relies on FP operations being deterministic,
| often within confines of specific ISA
|
| How much of this is true in practice? I vaguely recall
| reading about a hard-to-diagnose bug where a seemingly
| unrelated code change meant that an intermediate value was no
| longer stored in the x87 extended-precision register but in
| memory instead, leading to a different result. Wouldn't you
| run into stuff like that all the time?
| mindcandy wrote:
| > A lot of code relies on FP operations being deterministic
|
| A lot, but still quite the minority. People do run into
| situations where the same code with the same data produces
| slightly different results all the time because of changes
| in compiler, compiler settings, cpu arch, etc... They just
| don't notice.
|
| But, then they run into a situation where it matters and
| it's a huge PITA to nail down.
| kardos wrote:
| Yes this is exactly it. The gold standard is bit-4-bit
| matching when shifting code between systems. If you move
| your code to a new system and get different results -->
| is it due to different summing order, or something else
| wrong? By ruling out the former you're more likely to get
| the bit-4-bit match and thus avoid a lot of debugging
| effort -- or spend it solving a real bug.
| Sesse__ wrote:
| This is one of many reasons why we generally don't use the
| x87 registers anymore.
| throwaway260124 wrote:
| Sorting the floats reduces the error. I think using multiple
| accumulators reduces the accuracy then. And sorted data is not
| uncommon.
|
| I think there is always a correct answer and the compiler
| shouldn't make changes at least by default that are wrong.
|
| But ways for the programmer to express his intent more clearly
| are always welcome.
| Chabsff wrote:
| Not necessarily. If the array is large and the values are
| similar, using accumulators would yield a more accurate
| result than a single one.
| celrod wrote:
| Multiple accumulators increases accuracy. See pairwise
| summation, for example.
|
| SIMD sums are going to typically be much more accurate than a
| naive sum.
| Chabsff wrote:
| Not necessarily either. It's not particularly hard to
| create a vector where in-order addition is the most
| accurate way to sum its terms. All you need is a sequence
| where the next term is close to the sum of all prior ones.
|
| There just isn't a one-size-fit-all solution to be had
| here.
| kardos wrote:
| > There just isn't a one-size-fit-all solution to be had
| here.
|
| But there is:
| https://news.ycombinator.com/item?id=40867842
| pjc50 wrote:
| > Summing floats should by default be taken to have error
| bounds, and any answer in those bounds is valid. If you know
| something special about the floats you are inputting, the
| language should have some means to explicitly encode that. It
| shouldn't be the most basic loop, that's the default case, so
| it should give the best performance.
|
| This is a lot of "should" for something that basically doesn't
| happen. The only information you give is the order of
| arithmetic in the original expression.
|
| It is a total nightmare if arithmetic is not stable between
| builds. Rebuilding the software and running it on the same
| input should not produce different results.
|
| (Long ago I encountered the special Intel version of this:
| because the FPU used 80-bit registers internally but 64-bit in
| memory, changing when a register fill/spill happened would
| change when your answer got rounded, and thereby change the
| results. You can set a global FPU flag at the start of the
| program to force rounding on every operation.)
| cwzwarich wrote:
| > You can set a global FPU flag at the start of the program
| to force rounding on every operation
|
| This doesn't do quite the same thing. It still uses the wider
| exponent range of the 80-bit type.
| bee_rider wrote:
| It does happen in languages and libraries with a higher level
| of abstraction. MKL for example will do whatever it wants for
| accumulations (which practically means that it'll take
| advantage of SIMD because that's a big reason why people use
| the library) unless you specifically request otherwise via
| there "Conditional Numerical Reproducibility" flag.
|
| I think that was the right way to do it. BLAS made the right
| decision by defining these things in terms of sums and dot
| products instead of step-by-step instructions.
|
| It will always be possible to write programs that run
| differently on different hardware or with different
| optimization levels. If somebody is writing code for floating
| point computations and expects exact bit patterns--it is
| possible, of course, all the rounding behavior is clearly
| specified. But usually this is an error.
| crote wrote:
| It turns into a serious problem if the floats are of different
| magnitude. Consider [1e50, -1e50, 1e3, 1e3]. Evaluating it as
| (((1e50 + -1e50) + 1e3) + 1e3) gives 2e3, evaluating it as
| ((1e50 + 1e3) + (-1e50 + 1e3)) gives 0.
|
| You get similar issues if you're trying to add a large number
| of small-magnitude numbers to a single larger-magnitude number.
| (((1e3 + 1e3) + 1e3) ... + 1e50) is substantially different
| from (((1e50 + 1e3) + 1e3) ... + 1e3).
| jandrese wrote:
| Adding and subtracting numbers of wildly different magnitudes
| is always cursed. If you're doing this I'd say the majority
| of the time you are making an error.
|
| Your equation ends up being like "We calculated the diameter
| of the observable universe, but what if we added the Empire
| State Building to one side, how does that affect the result?"
| skybrian wrote:
| It's also what we do when incrementing a counter, if it's a
| very big counter. Operations like that don't have a
| physical meaning, but can happen in things like random
| number generators where it's a way of mixing the bits. Or
| maybe allocating unique ids, if you started from a random
| number.
| bee_rider wrote:
| Of course it is just an example so we shouldn't nitpick
| you too much, but that sounds more like a job for an
| integer.
| skybrian wrote:
| Yes, absolutely. Though in JavaScript we often make do
| with the safe integer range. (I don't know how fast
| BigInt is for 64 bit integers.)
| bee_rider wrote:
| I don't know JavaScript so I'm sure to shoot my own foot
| here, but surely a BigInt who's value can fit in 64 bits
| is just a normal 64 bit int (from hardware) with some
| extra bookkeeping around it on most platforms, right?
| IainIreland wrote:
| I can't speak for any other engines, but in SpiderMonkey
| a BigInt is a GC-managed object with a one-word header
| containing the sign bit, the length, and some GC
| bookkeeping bits, along with a second word that is either
| a single inline 64-bit digit, or a pointer to an array of
| 64-bit digits. So the memory overhead of a BigInt that
| fits in 64 bits is "only" 2x, but doing math with BigInt
| requires more work than a native integer because it needs
| to handle larger numbers, manually manage sign bits, and
| so on.
|
| (https://searchfox.org/mozilla-
| central/source/js/src/vm/BigIn...)
| aardvark179 wrote:
| If all you've got is JS then every problem looks like a
| double.
| mywittyname wrote:
| But you haven't demonstrated that the SIMD approach produces
| a meaningfully different and more accurate approach than the
| naive method.
|
| Just like processor magic made the theoretically more
| performant code perform worse, it's possible that the
| processor has facilities for bucketing the data such that
| results from such a dataset are the same or more accurate
| than the naive approach.
|
| This is absolutely a situation that requires empirical
| testing.
| scaredginger wrote:
| Yeah, no.
|
| If I know what my data look like, I can choose an order of
| summation that reduces the error. I wouldn't want the
| compiler by default to assume associativity and introduce
| bugs. There's a reason this reordering is disabled by
| default
| Sesse__ wrote:
| Not the least that you should get the same result from
| your code every time, even if you compile it with a
| different compiler. Doesn't matter if that result is
| somehow better or worse or "difference is so small that
| it doesn't matter"; it needs to be exactly the same,
| every time.
| bee_rider wrote:
| I think there's often a disconnect in these discussions
| between people that work on libraries and people that
| work on the application code.
|
| If you have written a library, the user will provide some
| inputs. While the rounding behavior of floating point
| operations is well defined, for arbitrary user input, you
| can't usually guarantee that it'll go either way.
| Therefore, you need to do the numerical analysis given
| users inputs from some range, if you want to be at all
| rigorous. This will give results with error bounds, not
| exact bit patterns.
|
| If you want exact matches for your tests, maybe identify
| the bits that are essentially meaningless and write them
| to some arbitrary value.
|
| Edit: that said I don't think anybody particularly owns
| rigor on this front, given that most libraries don't
| actually do the analysis, lol.
| Dylan16807 wrote:
| > it's possible that the processor has facilities for
| bucketing the data such that results from such a dataset
| are the same or more accurate
|
| Not realistically. The processor very much needs to do what
| you tell it.
| Sesse__ wrote:
| > it's possible that the processor has facilities for
| bucketing the data such that results from such a dataset
| are the same or more accurate than the naive approach.
|
| It absolutely is not. Floating-point addition is completely
| defined (save for some of the bits in NaN results), and no
| CPU will start reordering your addition instructions in a
| way that changes the result. Even if it's an ordering that
| gives you better accuracy. Having SIMD do four (or more) of
| them in parallel does not change this.
| bee_rider wrote:
| Whether you care about the order or not depends on the intent
| of the programmer. I think the intent is usually to sum up
| the array for this sort of code. So, at first glance your
| first example looks more like a result of something like 0 +-
| 1e35 in double to me (added one to the exponent to avoid
| having to do math).
|
| The intent of the programmer always has to be interpreted by
| the language. I'm saying the default interpretation should be
| one that doesn't impose an ordering unless specifically
| requested. This is much more sympathetic to modern deeply
| parallel machines.
| FreakLegion wrote:
| Very often there's no particular intent, other than to be
| consistent. You haven't lived until you've traced
| discrepancies in an ML model deployed across a bunch of
| platforms back to which SIMD instructions each happens to
| support (usually squirreled away in some dependency).
| RobotToaster wrote:
| > this is likely why the compiler doesn't generate SIMD
| instructions to compute the sum!
|
| Isn't this the type of thing ffast-math is supposed to enable?
| not2b wrote:
| -ffast-math can be read as -fwrong-math if the accuracy if
| the answers matters. For some applications accuracy might not
| matter much. But the differences you get by pretending that
| floating point math is associative can be enormous, and
| scientific and engineering code is often carefully designed
| to take the likely scale of values into account, and
| rearranging it is not a good idea.
| Dylan16807 wrote:
| Assuming associativity is not anti-accuracy in general, it
| just gets in the way of certain clever algorithms.
| dzaima wrote:
| For code that's been explicitly manually optimized already,
| indeed you wouldn't want to turn on -ffast-math; and I'd
| hope anyone who has actually learned what benefits from
| manual optimization and how to properly do that would also
| at some point have read something about not coupling that
| with -ffast-math. But I really doubt such meaningfully-
| manually-optimized code is at all frequent, and where it
| isn't it's a rather simple improvement (potentially with
| using some subset flags to allow NaN or infinity if
| desired).
| Dylan16807 wrote:
| Fast math does a lot of risky things in addition to using
| basic math rules.
| bee_rider wrote:
| Maybe the "fun safe math" flag could be used?
| lscharen wrote:
| Surprisingly there are different algorithms for doing something
| as simple as summing up a list of numbers.
|
| The naive way of adding numbers one-by-one in a loop is an
| obvious way, but there are more sophisticated methods that give
| better bounds of the total accumulated error; Kahan
| summation[1] being one of the better-known ones.
|
| Like most things, it can get complicated depending on the
| specific context. For streaming data, adding numbers one at a
| time may be the only option. But what if one could use a fixed-
| size buffer of N numbers? When a new number arrives what should
| be done? Take a partial sum of some subset of the N numbers in
| the buffer and add that to a cumulative total? If a subset if
| chosen, how? Are there provable (improved) error bounds for the
| subset selection method?
|
| Fun stuff.
|
| [1] https://en.wikipedia.org/wiki/Kahan_summation_algorithm
| kardos wrote:
| As far as I know, xsum [1] more or less solves the problem
| completely: order invariant and exact, at less than 2x the
| cost of the naive summation. Further speeding it up may be
| the only avenue left for improvement.
|
| [1] https://gitlab.com/radfordneal/xsum
| lscharen wrote:
| I was not aware of this (2015) work -- very nice!
|
| A couple of pull-quotes from the paper to summarize:
|
| _Much work has been done on trying to improve the accuracy
| of summation. Some methods aim to somewhat improve accuracy
| at little computational cost, but do not guarantee that the
| result is the correctly rounded exact sum._
|
| _Many methods have been developed that instead compute the
| exact sum of a set of floating-point values, and then
| correctly round this exact sum to the closest floating-
| point value. This obviously would be preferable to any non-
| exact method, if the exact computation could be done
| sufficiently quickly_
|
| _Exact summation methods fall into two classes -- those
| implemented using standard floating point arithmetic
| operations available in hardware on most current
| processors, such as the methods of Zhu and Hayes (2010),
| and those that instead perform the summation with integer
| arithmetic, using a "superaccumulator"._
|
| _I present two new methods for exactly summing a set of
| floating-point numbers, and then correctly rounding to the
| nearest floating-point number. ... One method uses a
| "small" superaccumulator with sixty-seven 64-bit chunks,
| each with 32-bit overlap with the next chunk, allowing
| carry propagation to be done infrequently. The small
| superaccumulator is used alone when summing a small number
| of terms. For big summations, a "large" superaccumulator is
| used as well. It consists of 4096 64-bit chunks, one for
| every possible combination of exponent bits and sign bit,
| plus counts of when each chunk needs to be transferred to
| the small superaccumulator._
|
| _On modern 64-bit processors, exactly summing a large
| array using this combination of large and small
| superaccumulators takes less than twice the time of simple,
| inexact, ordered summation, with a serial implementation_
| pacaro wrote:
| Thanks for the summary. I kinda low-key love the idea of
| converting floats into a fixed point representation that
| covers the entire range represented by the float type. I
| mean the accumulator is only 32 KB, which is likely to be
| in L1 the entire time on modern hardware, and any given
| float is only going to need two 64 bit words, + 13 bits
| (12 bits for offset, and 1 for sign) to be represented in
| this scheme.
| petsounds wrote:
| The title of this article is a reference to "Happy Fun Ball", my
| favorite SNL skit: https://www.youtube.com/watch?v=GmqeZl8OI2M
| quotemstr wrote:
| Yeah. This effect is why, also, using a bare CALL without a RET
| ends up being a slow way to get the program counter on 32-bit
| x86, which lacks native PC-relative addressing: it confuses the
| CPU's tracking of calls and returns. Basically, you always want
| to balance them. No exceptions.
| gpderetta wrote:
| Wasn't an article posted a few days ago showing that CPUs
| special case. CALL+0 not to update the prediction stack as it
| was commonly used to get IP?
| rep_lodsb wrote:
| Yes: https://news.ycombinator.com/item?id=40767676
| quotemstr wrote:
| Okay, fair enough. It _used_ to be a pessimization before
| CPUs recognized that pattern. :-)
| AshamedCaptain wrote:
| The same thing was covered by Raymond Chen almost 2 decades ago:
| https://devblogs.microsoft.com/oldnewthing/20041216-00/?p=36...
| rkagerer wrote:
| He did, and that's a fantastic article which is worth the read
| and provides good context for interpreting this post.
|
| One thing this post adds is the simple rectification of
| replacing ret with another br instruction, so the pairs are
| again "mirrored", and you get to have your cake and eat it too
| - slightly faster code without breaking the branch predictor.
| electrodank wrote:
| As someone who has the old dead tree version of Intel's x86 and
| 64 architecture instruction set reference (the fat blue books),
| and in general as someone who carefully reads the data sheets
| and documentation and looks for guidance from the engineers and
| staff who wrote the said data sheets, I always have
| reservations when I hear that "intuitively you would expect X
| but Y happens." There's nothing intuitive about any of this
| except, maybe, a reasonable understanding of the semi-
| conductive nature of the silicon and the various dopants in the
| process. Unless you've seen the die schematic, the traces, and
| you know the paths, there is little to no reason to have any
| sort of expectations that Thing A is faster than Thing B unless
| the engineering staff and data sheets explicitly tell you.
|
| There are exceptions, but just my 2c. Especially with ARM.
| neonsunset wrote:
| Respectfully, I disagree. CPU architecture optimization is in
| continuous dance with compiler optimization where the former
| tries to adapt to the patterns most commonly produced by the
| latter, and the latter tries to adjust its optimizations
| according to what performs the faster within the former.
|
| Therefore, it is not unreasonable to make assumptions based
| on the premise of "does this code look like something that
| could be reasonably produced by GCC/LLVM?".
|
| It is true that as cores get simpler and cheaper, they get
| more edge cases - something really big like Firestorm
| (A14/M1) can afford to have very consistent and tight
| latencies for all of its SIMD instructions regardless of the
| element/lane size and even hide complex dependencies or
| alignment artifacts wherever possible. But compare that with
| simpler and cheaper Neoverse N1, and it's a different story
| entirely, where trivial algorithm changes lead to significant
| slowdown - ADDV Vn.16B is way slower than Vn.4H, so you have
| to work around it. This is only exacerbated if you look at
| much smaller cores.
|
| LLVM and GCC deal with this by being able to use relatively
| precise knowledge of CPU's (-mtune) fetch, reorder, load and
| store queue/buffer depths, as well as latencies and
| dependency penalty cost of opcodes of the ISA it implements,
| and other details like loop alignment requirements, branch
| predictor limitations.
|
| Generally, it's difficult to do better in straight-line code
| with local data than such compilers assuming that whatever
| you are doing doesn't make concessions that a compiler is not
| allowed to make.
|
| Nonetheless, the mindset for writing a performant algorithm
| implementation is going to be the same as long as you are
| doing so for the same _class_ of CPU cores - loop unrolling,
| using cmovs, and scheduling operations in advance, or
| ensuring that should spills happen, the load and store
| operations have matching arguments - all of that will be
| profitable on AMD 's Zen 4, Intel's Golden Cove, Apple's
| Firestorm or ARM's Neoverse V3.
| Sharlin wrote:
| "Intuitively" here should be taken to mean approximately the
| same as "naively" - as in, the intuition that most of us
| learn at first that CPUs work ("as if") by executing one
| instruction at a time, strictly mechanistically, exactly
| corresponding to the assembly code. The way a toy
| architecture on a first-year intro to microprocessors course
| - or indeed a 6502 or 8086 or 68000 - would do it. Which is
| to say, no pipelining, no superscalar, no prefetching, no
| out-of-order execution, no branch prediction, no speculative
| execution, and so on.
| rep_lodsb wrote:
| This seems no longer to be true for recent x86 processors:
| https://news.ycombinator.com/item?id=40767676
| bigstrat2003 wrote:
| Raymond Chen is a treasure. I'm so glad that Microsoft gives
| him the leeway to write that blog, I have learned a ton from
| it.
| allenrb wrote:
| Was summarizing this article for a group of friends who largely
| met during the Apple II days and wanted to repost a bit of that
| here:
|
| The optimized code at the end takes 94 _nanoseconds_ to sum an
| array of 1024 32-bit floating point numbers.
|
| In 94 nanoseconds, our old friend the 1 MHz 6502, would be just
| starting to consider signaling the memory chips that maybe they
| ought to try digging up the first byte of the first instruction
| in the program.
|
| Worth mentioning, that code is entirely dependent on running in
| cache. Otherwise even the mighty M1 Max in the post would still
| be stuck waiting on that first memory fetch. DRAM is slow. :-)
| cmrdporcupine wrote:
| Lucky our total L1 cache sizes are about as big as the entire
| addressable memory of the 6502.
|
| We truly live in amazing times.
| LoganDark wrote:
| Truly. And I'm also always amazed by how much slower (in
| terms of wall time) modern software is than it ought to be.
| CPUs sure don't feel three to four orders of magnitude faster
| than they were 50 years ago, because software has gotten four
| to five orders of magnitude more wasteful to compensate.
| Argh...
| samatman wrote:
| Everything I know, or need to know, about branch prediction, I
| learned from James Mickens.
|
| https://scholar.harvard.edu/files/mickens/files/theslowwinte...
| monitron wrote:
| This was a lovely article. I only wish the author wouldn't have
| kept switching units (between us and ns) making it hard to scan
| the tables for comparison.
| dmccarty wrote:
| 1000ns = 1us :)
| carl_dr wrote:
| We know.
|
| On a phone at normal reading distance, with the articles
| styling, it's really hard to tell the difference between n
| and u without zooming, and the decimal points get lost -
| scanning the tables is hard.
| cshimmin wrote:
| They also switched from C to Rust halfway through the article,
| which was a bit jarring...
| mrinterweb wrote:
| Classic SNL reference: "Do not taunt happy fun ball"
|
| https://www.youtube.com/watch?v=GmqeZl8OI2M
| rhussmann wrote:
| If happy fun branch predictor starts to smoke, seek shelter
| immediately.
| Terr_ wrote:
| > Happy Fun Ball has been shipped to our troops in Saudi Arabia
| and is also being dropped by our warplanes in Iraq.
|
| "What YEAR is it!?" /robin-williams-in-jumanji
| nayuki wrote:
| > Why do we need a special function return instruction?
| Functionally, BR LR would do the same job as RET. Using RET tells
| the processor that this is a function return.
|
| I'm not sure this is a good design. Those two opcodes perform the
| same logic function but have different branch prediction hints.
|
| Meanwhile, there are other cases where an existing instruction is
| reinterpreted with new semantics:
|
| * On x86, `XCHG eax, eax` is `NOP`.
|
| * On x86, `XOR reg, reg` is `MOV reg, 0` and breaks dependencies
| for the purposes of register renaming.
|
| * On x86, various examples of macro-op fusion and micro-op
| fusion.
|
| * On various RISC architectures, `ADD r1, r0, 1234` is `MOV r1,
| 1234`.
|
| * On some RISC architectures, conditionally branching if r0 == 0
| is the only way to express an unconditional branch.
|
| I see no reason why `BR LR` can't be the standard function return
| instruction and involve the branch predictor.
| zerohp wrote:
| In AArch64, `BR LR` and `RET` do not perform the same logic
| when FEAT_BTI (Branch Target Identification) is present.
| 38 wrote:
| > const float f = *data++;
|
| cursed code. people might say "oh that's normal idiomatic C", I
| don't care. I am glad Go does not allow code like this
| trealira wrote:
| I'm surprised they didn't try something less clever to optimize
| their code first. The assembly code could be rewritten like this,
| so the test requires one branch at the bottom of the loop instead
| of two (one to jump to the top and one to branch conditionally),
| as well as one instead of two ALU operations per loop for X1 (one
| subtraction to compare X1 to zero, and one to decrement X1, as
| opposed to using the result of the latter).
| stp x29, x30, [sp, #-16]! fmov s0, #0.0
| cbnz x1, exit // Skip loop if x1 == 0. loop: ldr
| s1, [x0], #4 bl foo // Decrement x1.
| If x1 != 0, repeat the loop. subs x1, x1, #1
| b.ne loop // Fall through to exit after the loop
| finishes. exit: ldp x29, x30, [sp], #16
| ret foo: // ... ret
|
| Going further, you could just inline foo and omit the RET
| instruction without doing any trickery with mismatched BL and RET
| instructions.
|
| I haven't benchmarked this, though, so I don't actually know how
| much this speeds the code up.
| trealira wrote:
| Typo: that line with "cbnz" should be "cbz". The CBZ
| instruction branches to a label if a register is zero, and CBNZ
| branches if the register isn't zero.
| croemer wrote:
| Don't miss the 2023! The article really is already a little
| outdated: since Rust 1.78, the compiler uses some more aggressive
| loop unrolling (and a little bit of SIMD):
| https://godbolt.org/z/zhbobW7rr
|
| OP wrote: "Looking at the assembly, it's doing some loop
| unrolling.", linking to https://godbolt.org/z/Kv77abW6c, which is
| using "Rust Nightly", a moving target. By now, there's more loop
| unrolling.
|
| Loop unrolling started with Rust 1.59:
| https://godbolt.org/z/5PTnWrWf7
|
| Update: The code on Github shows they were using Rust version
| 1.67.0-nightly, 2022-11-27.
| croemer wrote:
| Rust 1.67.0, which is likely what OP was looking at, gives:
| https://godbolt.org/z/4Y61d9seh
|
| I ran the benchmarks locally on the same hardware, using latest
| nightly Rust 1.81 (aggressive loop unrolling): no difference.
| Same speed as 1.5 years ago.
___________________________________________________________________
(page generated 2024-07-03 23:00 UTC)