[HN Gopher] Do not taunt happy fun branch predictor
___________________________________________________________________
Do not taunt happy fun branch predictor
Author : mkeeter
Score : 169 points
Date : 2023-01-25 16:41 UTC (6 hours ago)
(HTM) web link (www.mattkeeter.com)
(TXT) w3m dump (www.mattkeeter.com)
| efitz wrote:
| I think that the days where hand tuned assembly language
| outperform compiler generated code are largely behind us (let
| loose the contrary anecdotes).
|
| Compilers and microprocessors are way more complex than they were
| back in the 80s or 90s, and compiler engineers know way more
| about how instructions are actually executed than the vast
| majority of programmers.
| jcalvinowens wrote:
| I think about it like this:
|
| If I ask you to write assembler faster than what the compiler
| emits for one very specific CPU model, you'll pretty much
| always be able to do that (given enough time).
|
| But as CPUs become more complex, and compilers get better,
| achieving that win requires more and more "overfitting" to
| implementation details of the specific hardware, in ways that
| are not helpful or even counterproductive on older or newer
| models of the same CPU family.
|
| It's still worth it sometimes, of course. It's just more work
| and a lower value proposition on average.
| ravi-delia wrote:
| The only exception (pointed out in this post) is leveraging
| SIMD instructions, which aren't nearly as well exploited by
| compilers. I doubt, of course, that this will last all that
| long, but for now there are too many subtle differences that
| _might_ matter and totally different instruction sets between
| cpus and even generations.
| makapuf wrote:
| ... and yet in this article, the author beats by 10x a C
| compiled code with hand tuned assembly. (By using SIMD and
| unrolling, which the compiler did not. Granted linear compiler
| code is faster than hand made linear assembly)
| zokier wrote:
| Author beats compiler because floating point constraints,
| with -ffast-math compiler vectorizes the code.. I don't have
| arm64 hardware to test the result, but its probably again
| pretty fast: https://gcc.godbolt.org/z/xvjY8P4cM
| rowanG077 wrote:
| I used to think that was true. Then I had a crypto course in
| uni which required us to write and optimize 3 different hashing
| and encryption algorithms.
|
| I was stunned by the first once, which I first did in C and
| then moved to ASM. The C code was pretty straightforward. But
| it was trivial to beat in ASM by something like 120%.
|
| That course taught me how bad compilers(Or rather GCC) are at
| high-level register optimization and using the barrel shifter.
| Basically all the performance I squeezed out of ASM was because
| the compiler just wasn't figuring basic stuff out. Mind you I
| was(am) not an ASM expert. That piece of code was the first
| thing I ever write in ASM. And yet I was able to easily beat a
| decades old world class compiler.
| wolf550e wrote:
| compiler autovectorization is poor, people often outperform it
| when writing SIMD code. intrinsics are so low level they might
| as well be assembly.
| JonChesterfield wrote:
| I've had really good results from the two in LLVM (one works
| on loops, one within basic blocks). Optimal loop body when
| the pointers had alignment metadata attached, though at the
| time it failed to unroll the tail.
|
| Using intrinsics with the control flow in C or C++ works
| really well - you get the right instruction selection from
| the intrinsics, easy to reason about control flow and the
| compiler deals with register allocation (and possibly
| instruction scheduling) which are a pain to do by hand.
| iamsmooney wrote:
| Title is a reference to this old SNL skit:
| https://www.youtube.com/watch?v=GmqeZl8OI2M
| sophacles wrote:
| It's definitely a classic - if you haven't seen it I'd
| recommend it even if it wasn't related to the article :D
| bioint7812 wrote:
| More translation:
|
| > for reasons
|
| https://www.urbandictionary.com/define.php?term=for%20reason...
| jerf wrote:
| This is another good example of how our CPUs are in many ways
| specialized C processors. C is a structured programming language
| that uses functions, so our processors like functions. If you
| jump out of that paradigm, even if the assembly instructions
| nominally seem to allow it, you'll run more slowly. Even when it
| seems like what you're offering is a shortcut to the CPU.
|
| This is neither praise nor criticism of the current CPU paradigm;
| it's just something you need to understand if you want the best
| performance out of our machines.
|
| A different paradigm, like a concatenative-paradigm-based
| program, might naively be more inclined to compile into code that
| looks more like what the author tried, jumping between
| implementations of the stack operators without it actually being
| "functions". One can imagine processors that would be "happier"
| with that, and would be bothered by things that look like
| function returns more. But that's not the CPUs we have.
| masklinn wrote:
| Author was not jumping out of the paradigm here, they were
| deliberately misusing constructs specialised for the paradigm
| (br/ret). That's like saying the toolcase is a specialised
| screw processor because you're trying to drive nails using a
| screwdriver and it does not go well.
|
| And C is hardly the first or only procedural langage.
| ablob wrote:
| This has more to do with the ISA than C I'd assume. C was built
| as an "easy assembler". Furthermore, the computer doesn't
| really care about structure (in the structured programming
| sense).
|
| In this case the implementation of the ISA stores information
| on each 'ret' depending on some 'bl' that came before. One can
| imagine that a different optimization technique which actually
| leads to a speedup exists. Without a branch-predictor, the
| program that Matt wrote might've been faster.
|
| Imo, this has nothing to do with paradigms, but how control-
| flow interacts with the system. This interaction is different
| between different implementations of an architecture.
|
| Code written for Cache-locality and paradigms that work well
| with it, for example, only became a "winner" after caches were
| widely implemented. Before that, the cost of sequential array
| access and random array access was identical. With caches, a
| linear search for an element can be faster than a binary
| search, even though it requires a lot more memory accessess.
| Thus, the optimal implementation for finding an element in an
| array is now dependent on size as well. (I.e. after an ordered
| array reaches a certain size, binary search should become
| faster on average).
| int_19h wrote:
| The point is that the ISA is assuming that BL and RET opcodes
| come in pairs - which is an assumption that does reflect what
| structured code is typically compiled down to. Going by the
| semantics of the opcodes themselves, there's no reason why
| RET should be treated differently from a simple indirect jump
| here.
| masklinn wrote:
| > there's no reason why RET should be treated differently
| from a simple indirect jump here.
|
| There's _its very existence_. If you're looking for a
| general purpose indirect jump, use that.
| tsimionescu wrote:
| As far as I understand, the ISA explicitly intends for BL
| and RET to come in pairs. The only point of RET is to jump
| to the address last stored by BL. If you don't need this
| behavior, there's no reason to use RET - as the article
| itself shows, B x30 does the exact same job, and doesn't
| come with the extra assumptions.
| classichasclass wrote:
| I agree, because this semantic difference between br x30 and
| ret doesn't exist in many other RISCs which also mostly run
| C-like languages. Power just has blr, and MIPS jr $ra, for
| example. This feels more like a well-intentioned footgun in
| the ISA.
| shadowofneptune wrote:
| The potential population for that footgun is compiler
| writers, who should safely fall into 'know what they are
| doing' territory.
| flohofwoe wrote:
| It's just a natural side effect of software and hardware
| forming a symbiosis and that they cannot really be separated.
| New software is (or should be!) built to run well on existing
| hardware, and new hardware is built to run existing software
| well. Besides, subroutine call instructions were a thing long
| before C became mainstream and manual assembly coding ruled
| supreme.
| [deleted]
| brigade wrote:
| If you want to encode a series of indirect jumps between
| concatenated functions, you can do that already and the return
| address stack won't get involved; you'll simply get the normal
| branch predictors.
|
| But generalized branch prediction is expensive (multiple
| entries in the BTB hashed by the previous program flow, and
| mispredicts if the next function in the chain changes); the
| point of an RAS is that it's a _very_ cheap way to keep some
| 100% predictable branches from using those general resources.
| Concatenation pretty much requires branches without unique
| characteristics, so no fast path shortcuts and everything is a
| bit slower.
| pranith wrote:
| Great investigative work! The stack structure you refer to here
| is called the Return Address Stack (RAS).
| ShroudedNight wrote:
| I feel like this treatment is incomplete without having tested
| the scenario where the unmatched ret is replaced with a br lr.
|
| EDIT: Reading the documentation after the fact, it appears that
| that was what br x30 was - naively I had interpreted the hex as a
| fixed offset to a label.
| snerbles wrote:
| While I was in undergrad I toyed around with abusing the branch
| predictor on a few different machines, compiling something like
| the following with optimizations off - it performs an identical
| computation regardless of branch outcome: void
| branchLoop(unsigned int condition, unsigned int &sum) {
| // put something suitably large here unsigned int
| loopCount = 0x0fffffff; unsigned int i;
| // compile with -O0 or this gets optimized away for
| (i = 0; i < loopCount; i++) if ((i & condition)
| == 0) sum++; else
| sum++; }
|
| The Core Duo on my Thinkpad T60 had some very distinct slowdowns
| on certain bit patterns, which were not repeatable on the handful
| of other CPUs I had access to at the time. I haven't tried this
| with more modern CPUs, however.
| gpderetta wrote:
| Predictors are getting better and better at recognizing long
| patterns (sometime at the cost of not being optimal with short
| patterns).
| Malic wrote:
| Ow. My head hurts.
|
| And this is why optimizing compilers are some of the most complex
| programs there are. (or so I have been taught)
| shadowgovt wrote:
| It's also why the modern rule of thumb is "don't optimize by
| writing your own assembly."
|
| The rule is a boiled-down version of the larger notion "Don't
| optimize by writing your own assembly, _because even with
| domain knowledge of the problem you 're trying to solve, you're
| probably not more clever than the engineer-decades that went
| into building your compiler toolchain and processor
| architecture, unless you're an expert in both fields, in which
| case good luck and shoulder that maintenance burden._"
|
| The rule of thumb drops a lot of detail on the ground but is a
| good first approximation.
| astrobe_ wrote:
| ... Which is kind of worrying; is it really a good thing that
| processors are so complex that you need "decades" to use them
| to fully. Bottom line, you end up with chaotic (in the
| "sensitive to the slightest change") performance behavior.
|
| OTOH this reminds of another saying, "don't roll your own
| crypto". But all those "don't" are a bit frustrating.
| olliej wrote:
| But you don't need decades of experience: we have compilers
| and optimizers to do that.
| bioint7812 wrote:
| It's interesting that the impedence the author was
| experiencing was one of the CPU incorrectly "speculating"
| about the intent. We as readers are left to speculate
| about the problem being solved by the author.
|
| Based on the content of his recent articles, we could
| assume he is continuing his development of a JIT for a
| Rust port of his graphics engine. Given that assumption,
| I would argue that the compiler writers are lagging here
| --specifically lack of dynamic compilation. For example,
| is there a JIT compiler for the Rust language? I was
| thinking about reproducing his experiment in SBCL, which
| does have dynamic compilation--although it wouldn't be a
| real "apples" to "apples" comparison because my work
| machine is a x86_64.
| olliej wrote:
| JITs have very different constraints (but also many
| advantages) vs AOT compilers, so I don't think in the
| general case a language compiler can "support" JITs
| directly. llvm/clang for instance have a jit mode... for
| compiling C/C++, not some other language.
|
| Also obviously the compiler writers (AOT or JIT) need to
| know about and understand very minute details of how a
| cpu is behaving. I was responding to a person saying it
| was worrying that devs need that kind of knowledge and
| experience by saying that that isn't true because the
| tools exist (I feel that there's some analogy to what
| knowledge you need for a car: changing oil, servicing
| engine, replacing engine, designing an engine...). Once
| you are writing a JIT you're in the "making the tools"
| group, so you now need to know more than the vast
| majority of devs (not just more than the "average" dev)
| that's inescapable.
| miloignis wrote:
| I've seen people get frustrated by the "don't"s before, but
| I think that's generally taking the first degree
| approximation too literally. Feel free to hand-write
| assembly or roll your own crypto, but _don 't_ depend on it
| for anything serious unless you are an expert or have it
| reviewed by one. Doing so for learning and fun is fine, if
| that's clearly called out such that no one accidentally
| depends on it for something serious. There's only one way
| to become good at something, and that's good practice!
|
| In a professional setting there's a responsibility to the
| end user which generally precludes doing these dangerous
| things - that is, one should feel free to take up
| woodworking as a hobby but shouldn't offer to build
| someone's house unless you're a licensed professional.
| [deleted]
| pjdesno wrote:
| Maybe better expressed as "don't write your own assembly
| unless you know why it might be better than the compiler".
|
| Trying to beat the compiler at optimizing normal code is kind
| of like trying to beat a calculator with paper and pencil -
| computers are just better at that sort of thing than people
| are.
|
| One use case is where you want to use bizarre CPU functions
| (popcount, encryption, load CR3, etc.) that no one's taught
| the compiler how to generate, although for some of them you
| might be better off using compiler intrinsics.
|
| Another is when you're dealing with things _underneath_ the
| language abstraction, like the Boost co-routines mentioned in
| a link a few comments above. Of course, if the language folks
| decide to add the capability you want (e.g. C++20
| coroutines), you 're back to being better off using the
| compiler.
|
| Finally there are pedagogical reasons, e.g. sometimes I show
| my classes the world's simplest "Hello World", using a few
| lines of assembler to invoke the write and exit syscalls.
| JonChesterfield wrote:
| The boost coroutines mentioned are not the same thing as
| the C++20 ones. Boost captures the stack, that's what the
| register shuffling is for. C++ is a compiler transform that
| moves some state onto the heap, but probably not the whole
| stack, and builds a switch dispatch style thing to
| transform the control flow. This is why C++ comes with co_*
| annotations and coroutines don't.
| avgcorrection wrote:
| I will never understand the trend of "using quotes around
| things". Which is a shorthand version for "using quotes
| around things that I wanted to point to and say, hey, this is
| something that "goes over here", you know, inside these
| quotes, to make sure that you understand exactly what I'm
| delimiting, since _using commas, and semicolons, and colons
| wouldn't fit for some reason. Oh look the thing that I'm
| quoting now consumes 80% of this paragraph. But this is way
| better than just saying "the modern rule of thumb is to not
| optimize by writing your own assembly." Because then it isn't
| 100% clear that the rule of thumb is delimited by (exactly)
| "[do] not optimize by writing your own assembly." Ya see?_ "
| RodgerTheGreat wrote:
| I would just phrase it as "if you optimize by writing your
| own assembly, don't expect your program or its performance to
| be portable."
| bob1029 wrote:
| The state of modern compilers is pretty staggering to me. I've
| seen some code folding in RyuJIT that makes me feel inferior as
| a developer.
|
| You've got a few compilers (Java, .NET, et. al.) which are
| capable of re-compiling hot path code during live execution and
| then seamlessly transitioning to those paths. This
| recompilation can be based upon the statistics of the live
| process, so it's almost like a sort of adaptive AI. Which paths
| are hot in production does not need to be known at compile time
| with these approaches.
| JonChesterfield wrote:
| There's a sense in which they're complicated. It's a sequence
| of graph transforms which mostly deal with non-polynomial time
| problems using heuristics, where mistakes in the transforms can
| manifest quite a long way away from the error. There's a
| significant risk that they're implemented in languages unique
| to that compiler toolchain, as compiler devs are quite prone to
| solving problems by writing compilers.
|
| There's also a sense in which they're really simple. The input
| format and output format are (usually) well defined. The user
| interface is largely printing things to stderr and giving up,
| possibly in a retry loop when there's an editor involved. The
| program dependency graph is often quite small so the bug you're
| looking at is probably in the source code you checked out.
| Security is not the dominant concern you have elsewhere.
| gpderetta wrote:
| Yes, never push and ret. Here is something I wrote (/me checks
| calendar) more than 15 years ago about optimizing coroutine
| control flow:
| https://www.crystalclearsoftware.com/soc/coroutine/coroutine...
| water-your-self wrote:
| If an HN reader wanted to play around with similar digging, what
| would be the essential tools to be aware of and where best could
| he start?
|
| Assuming prior knowledge of assembly/C but without much
| experience decompiling or testing speed.
| shoo wrote:
| Learn how to use a decent profiler. if you're running linux,
| that's probably perf:
|
| https://man7.org/linux/man-pages/man1/perf.1.html
|
| https://www.brendangregg.com/perf.html
|
| Here's a fun article from the cloudflare blog that gives an
| example of using of perf to diagnose performance of a small
| utility: https://blog.cloudflare.com/when-bloom-filters-dont-
| bloom/
|
| Matt Godbolt's compiler explorer is also worth checking out:
| https://godbolt.org/
| dcow wrote:
| Your compiler can spit out assembly, you just need to know how
| to read it. Sounds like the author was also using Xcode
| Instruments
| https://help.apple.com/instruments/mac/current/#/dev7b09c84f...
| to check cpu counters. And they were using criterion
| https://crates.io/crates/criterion to microbenchmark.
|
| My guess would be that the author is porting some C code to
| Rust and making sure not to regress performance along the way
| (probably hopefully trying to increase it). Likely their
| program was written in Rust and the section they were trying to
| optimize called some old c code. Sounds like they rewrote the
| section in Rust since Rust <-> C ffi calls break out of the
| happy realm the rust compiler likes and end up causing a
| performance hit themselves. You can write inline assembly in
| Rust using the macro https://doc.rust-
| lang.org/reference/inline-assembly.html.
| CalChris wrote:
| It took me a while to understand the mismatched bl/ret pairs
| because I'm used to reading matched bl/ret pairs. This confusion
| has to be similar to what the silicon is 'thinking': _Human, I
| see matched bl /ret pairs all the time and I'm good at them. Why
| are you giving me mismatched pairs? I'll do the right thing but
| I'm not so good at them._
|
| Still, this seems like function inlining. But why not just inline
| and use a regular branch loop? Is _foo()_ also being called from
| elsewhere? Is space at a premium?
| masklinn wrote:
| > Upon seeing this program, it's a common reaction to ask "why
| is foo a subroutine at all?"
|
| > The answer is "because this is a didactic example, not code
| that's trying to go as fast as possible".
| ogogmad wrote:
| Minor erratum: Floating point addition actually _is_ commutative;
| it 's in fact non-associative.
| lmm wrote:
| It's not commutative; -0.0 + 0.0 = -0.0 but 0.0 + -0.0 = 0.0.
| dekhn wrote:
| with some significant exceptions, such as NaNs.
| recursive wrote:
| Can you think of some `x` where `x + NaN` is not identical to
| `NaN + x`? I can't.
| dekhn wrote:
| you mean, like 1 + NaN = NaN and NaN + 1 = NaN, but NaN !=
| NaN? (I'm not a numerical expert, just repeating what
| others have told me)
| CountSessine wrote:
| Yes. An NaN in IEEE754 has all 1's in the exponent, and
| then the high bit of the mantissa determines whether it's
| quiet or signalling, but then rest of the mantissa is/can
| be a "payload".
| bruce343434 wrote:
| Sign bit determines signalling
| robocat wrote:
| Please double check your facts before disagreeing with
| somebody so abruptly.
|
| Sign bit is NOT the signalling/quiet bit. Bit 51 (edit or
| bit 50 - damn ***** IEEE for not publishing important
| standards for free public access) is according to the
| first result I looked at:
| https://craftinginterpreters.com/optimization.html
|
| Edit 2 from IEEE 754 (2008 version):
| 6.2.1 NaN encodings in binary formats This
| subclause further specifies the encodings of NaNs as bit
| strings when they are the results of operations. When
| encoded, all NaNs have a sign bit and a pattern of bits
| necessary to identify the encoding as a NaN and which
| determines its kind (sNaN vs. qNaN). The remaining bits,
| which are in the trailing significand field, encode the
| payload, which might be diagnostic information (see
| above). All binary NaN bit strings have all the
| bits of the biased exponent field E set to 1 (see 3.4). A
| quiet NaN bit string should be encoded with the first bit
| (d1) of the trailing significand field T being 1. A
| signaling NaN bit string should be encoded with the first
| bit of the trailing significand field being 0. If the
| first bit of the trailing significand field is 0, some
| other bit of the trailing significand field must be non-
| zero to distinguish the NaN from infinity. In the
| preferred encoding just described, a signaling NaN shall
| be quieted by setting d1 to 1, leaving the remaining bits
| of T unchanged. 6.3 The sign bit When
| either an input or result is NaN, this standard does not
| interpret the sign of a NaN. Note, however, that
| operations on bit strings--copy, negate, abs, copySign--
| specify the sign bit of a NaN result, sometimes based
| upon the sign bit of a NaN operand. The logical predicate
| totalOrder is also affected by the sign bit of a NaN
| operand. For all other operations, this standard does not
| specify the sign bit of a NaN result, even when there is
| only one input NaN, or when the NaN is produced from an
| invalid operation. When neither the inputs nor
| result are NaN, the sign of a product or quotient is the
| exclusive OR of the operands' signs; the sign of a sum,
| or of a difference x-y regarded as a sum x+(-y), differs
| from at most one of the addends' signs; and the sign of
| the result of conversions, the quantize operation, the
| roundTo- Integral operations, and the
| roundToIntegralExact (see 5.3.1) is the sign of the first
| or only operand. These rules shall apply even when
| operands or results are zero or infinite. When the
| sum of two operands with opposite signs (or the
| difference of two operands with like signs) is exactly
| zero, the sign of that sum (or difference) shall be +0 in
| all rounding-direction attributes except
| roundTowardNegative; under that attribute, the sign of an
| exact zero sum (or difference) shall be -0. However, x +
| x = x - (-x) retains the same sign as x even when x is
| zero. When (axb)+c is exactly zero, the sign of
| fusedMultiplyAdd(a, b, c) shall be determined by the
| rules above for a sum of operands. When the exact result
| of (a x b) + c is non-zero yet the result of
| fusedMultiplyAdd is zero because of rounding, the zero
| result takes the sign of the exact result. Except
| that squareRoot(-0) shall be -0, every numeric squareRoot
| result shall have a positive sign.
|
| I.e. you are definitely wrong. The sign bit can be + or -
| for NaN (presumably a side-effect of the encoding for
| +/-Infinity ). And then that leads to a bunch of arse
| (section 6.3) because the spec needs to decide what
| happens to the sign bit in a bunch of different
| situations. PS: fucking infinity. Infinity should have
| been NaN. Infinity [?] Infinity, except in in the
| egghead-land IEEE (side note: egghead is a compliment
| IMHO). Mind you, easy to see mistakes in retrospect, but
| corner cases are shit in programming. I do like NaN,
| although reading comments here, and the IEEE spec, forces
| me to learn how little I now about NaN encodings. Oh, and
| any NaN should equal any other NaN. Mathematically
| obviously not, but logically yes and IEEE is for
| programming. NaN is already defined as a nonsense, so at
| least keep the nonsense consistent. Changing _if =_ to
| _if [?]_ should not introduce subtle logic bugs.
|
| Ranting edit #755: and while we are at it, -0 is an
| abomination in the eyes of the Great Architect in the
| Matrix - it should never have been allowed - perhaps -0
| should have been NaN with signalling bits in the exponent
| (even though that would prevent some language virtual
| machine optimisations where 53 bits of NaN get used to
| pack other information, but the win would be compelling
| because reducing bugs due special cases is huge IMHO).
| How many developers understand IEEE corner cases: fuck
| all in my long experience.
| bruce343434 wrote:
| The source I had in my head as I was replying was
| https://posithub.org/docs/Posits4.pdf pages 31/32 which
| implies that the sign bit is responsible for signalling-
| ness. Neither of these are a primary source however.
|
| A random stackoverflow answer without sources seems to
| confirm your pov. Now I don't know what to believe. How
| is the sign bit used in NaNs? Would they really waste
| that bit?
| stephencanon wrote:
| Hi, former IEEE 754 committee member here: languages and
| architectures are allowed to choose how they encode
| signalingness, but they cannot use the sign bit to do it.
| The most common choice is to use the high-order but of
| the significand field, but most choices you might make is
| represented by some architecture.
| robocat wrote:
| Awesome! So the important part of section 6.2.1 (2008) is
| the "should" is not a "must"? "A quiet
| NaN bit string should be encoded with the first bit (d1)
| of the trailing significand field T being 1. A signaling
| NaN bit string should be encoded with the first bit of
| the trailing significand field being 0."
|
| Or is it that some implementations before 2008 used other
| bits?
|
| PS: I might be complaining, but I do sincerely thank you
| for your hard work. I certainly would not want to be on a
| standards committee, so I sincerely appreciate the
| efforts of the committed whom fight so hard to make
| things better. Your comment just goes to show how many
| corner cases there are to the corner cases!!
| MaulingMonkey wrote:
| If x is another NaN with a different payload, the result is
| likely a NaN with a payload corresponding to the left side
| of the addition.
| raphlinus wrote:
| That is correct, here is a playground link which shows
| that result: https://play.rust-
| lang.org/?version=stable&mode=debug&editio...
| stephencanon wrote:
| This varies tremendously across architectures and uarches
| and compilers. The result will be a quiet NaN, and you
| can't say anything beyond that.
| jcranmer wrote:
| It depends on your model of NaN. If you think there is only
| one NaN value, then the two values return the same result.
| If there are multiple distinct NaN values (i.e., you
| distinguish between NaNs with different payloads), then the
| resulting payload of an arithmetic operation is... not
| well-specified.
|
| Most hardware architectures (x86, ARM fall into this
| category) pick the rule that the payload is the Nth operand
| (usually first, sometimes second) when multiple inputs are
| NaN. I believe there's some hardware where the lesser of
| the payloads (converted to an integer) is picked instead.
| RISC-V dispenses with NaN payload propagation entirely.
| There is theoretically the ability for hardware to generate
| a new unique NaN with the hardware address of the
| instruction into the payload, but I don't believe any
| hardware actually does it.
|
| Most programming languages do not generally model NaN with
| greater fidelity than "there is a NaN value" or maybe
| "there is a distinction between qNaN and sNaN."
| [deleted]
| marcosdumay wrote:
| Yes, it's commutative. But it is associative.
|
| Make a number with 3 bits of mantissa, and go add a thousand
| repetitions of 1. You can get hundreds of different results
| depending on the order you add them up.
| wizzwizz4 wrote:
| > _But it is associative._
|
| You're describing a non-associative operation.
| ogogmad wrote:
| Exactly. See the definition:
| https://mathworld.wolfram.com/Associative.html
| [deleted]
| mkeeter wrote:
| Thanks, fixed!
| titzer wrote:
| > More specifically, the branch predictor probably keeps an
| internal stack of function return addresses, which is pushed to
| whenever a bl is executed. When the branch predictor sees a ret
| coming down the pipeline, it assumes that you're returning to the
| address associated with the most recent bl (and begins
| prefetching / speculative execution / whatever), then pops that
| top address from its internal stack.
|
| There's no need for "probably" here. The micro-architectural
| mechanism is known as a return stack buffer[1] and is generally
| separate from the branch predictor unit, though the processor may
| make use of indirect branch prediction entries for returns as
| well.
|
| [1] It is, indeed, a tiny little stack of return addresses and
| indeed, the article hit performance issues by misaligning it. The
| (Intel chips') RSB is behind the Retbleed vulnerabilities.
| londons_explore wrote:
| Observation: Almost any code, when micro-optimized, can gain
| about 10x performance.
|
| So, if we had the time and energy, we could probably make all of
| computing at least 10x faster.
|
| But we don't have the time or energy to dedicate that much effort
| to every line of code... But perhaps AI does?
| fluoridation wrote:
| Not true by a long shot, unfortunately. Getting even a 100%
| performance gain out of manual optimization is unusual. Usually
| the only way to get significant gains is by either switching
| algorithms or by relaxing problem requirements.
| david2ndaccount wrote:
| He didn't gain 10x from a micro optimization, he gained that
| much by converting it to use SIMD which is a macro
| optimization. You usually have to structure your program in
| such a way to be SIMD friendly. In this case it was already
| simd-friendly (adding a large number of floats).
| Jensson wrote:
| > In this case it was already simd-friendly (adding a large
| number of floats).
|
| So it was micro optimization afterall!
| lmm wrote:
| It was a micro optimization because it was a toy example.
| Doing floating-point arithmetic (and not already using BLAS
| etc.) is very niche in real code.
| Jensson wrote:
| It is still micro optimization to ensure that those
| instructions are used when the compiler is too dumb to
| use them. You can say that they are niche, but that is
| different from it being micro optimization.
| shoo wrote:
| > Almost any code, when micro-optimized, can gain about 10x
| performance.
|
| well, it depends on where you start from. the speedups can
| become arbitrarily impressive sounding when the starting point
| is arbitrarily inefficient.
|
| e.g. if you're starting with a python script that wasn't
| written with performance in mind and has ended up being
| compute-bound in pure python number crunching code, if you
| rewrite the thing in C while thinking a little about
| appropriate data structures and memory allocation/access --
| i.e. replacing a festival of python dict lookups and very
| frequent memory allocation for tiny intermediate results with
| indexing into preallocated arrays or so on -- it's quite common
| to see speedups of 500x - 1000x. this is before micro-
| optimisation, before introducing SIMD or multi-threading or
| setting the compiler to build for your exact CPU or so on.
| Pet_Ant wrote:
| Whenever you read/hear/think "AI" replace it with "a
| probabilistic process". If you can validate the output then
| it's really a beam search (
| https://en.wikipedia.org/wiki/Beam_search ) of the solution
| space, and not really "AI".
|
| If you can't, then it's really a crap shoot as to whether the
| output is at all valid. If we want to use more opaque
| processes, I think we need more transparent outputs. If a
| neural net can produce a machine checkable proof or supply it
| with the optimisation that's great, but otherwise it's just hot
| air.
| celeritascelery wrote:
| I don't think that is generally true. He only got a large
| speedup because he used SIMD, which has nothing to do with
| micro optimization. I would say a better take away is that
| micro optimization is really hard and you will often make
| things worse if you don't know what you are doing. Even if you
| do, you are only going to get a few percentage points.
| thethirdone wrote:
| My experience micro optimizing things is that even without
| SIMD, most software can get at least a 5x in performance.
| With SIMD, you can often get 50x improvements.
|
| The reason why people thing "Even if you do, you are only
| going to get a few percentage points." is because it
| generally takes 5-50x the developer time to optimize such
| code. If it takes half a day to write naive code to do
| something like validate utf8, it probably takes ~25 workdays
| to make a fast SIMD version. If you instead spend an extra
| half a day, there is a good chance you get a 10-50% speedup
| using normal code.
| mumumu wrote:
| This is true on a few streaming application (such as
| parsing).
|
| And most of the speedup is because of tricks to avoid doing
| beaches. There is a great blog post from one of the authors
| of JSON SIMD discussing this.
|
| I'm on mobile, there is a link for the blog post on the
| simd JSON github repository.
| bee_rider wrote:
| It also can we weird and non-obvious. For example depending
| on the instruction mix and hardware it might not be worth
| getting dinged by AVX-512 clocks. And if you are, say,
| writing the UTF validation code as a library (more
| favorable to put effort into a library!) you might not know
| where the code is being used, so you might not even know
| the instruction mix...
| FullyFunctional wrote:
| Well, _of course_ , the Return Address Stack (RAS) predictor
| maintains its own call stack and you need to understand how it
| works. However, there's a subtler way to break it: recurse too
| deeply. The RAS only has a fixed, small, and _implementation
| dependent_ length. If you use deep recursion with non-trivial
| control flow (in particular multiple call sites), then the RAS
| will starting missing once you return from beyond that limit.
|
| Another consequence of the RAS is that co-routines switching is
| more expensive than they might appear at first. RISC-V has
| encoding hints to mark call(jal)/returns that are actually co-
| routine switching but the full cost can't be avoided.
| gpderetta wrote:
| you can mitigate the cost by not 'call'-ing into your coroutine
| switch function but inlining the code into the surrounding
| coroutine. As a bonus you get a bit better branch prediction on
| your yield because distinct yields will share less state.
|
| Of course there is always going to be a penality for stackful
| coroutines that yield deep into a callstack.
| sylware wrote:
| I write assembly mainly _not_ because it is faster, but because I
| don't depend on an absurdely complex and massive compiler.
| packetlost wrote:
| So... how does that work? What sort of work do you do that you
| have the time to write raw ASM and still be productive? I'm
| asking in earnest, because I'm curious what sort of workflows
| still allow for writing ASM directly outside of very specific
| cases (such as initializing embedded MCUs, for example)
| jefftk wrote:
| _> The SIMD code does come with one asterisk, though: because
| floating-point addition is not associative, and it performs the
| summation in a different order, it may not get the same result as
| straight-line code. In retrospect, this is likely why the
| compiler doesn 't generate SIMD instructions to compute the sum!_
|
| What if you set -funsafe-math-optimizations, which allows
| "optimizations that allow arbitrary reassociations and
| transformations with no accuracy guarantees"?
| makapuf wrote:
| You could just turn gcc to 11 and use -Ofast
| zokier wrote:
| Both clang and gcc vectorize if you ask them nicely:
| https://gcc.godbolt.org/z/xvjY8P4cM
| [deleted]
___________________________________________________________________
(page generated 2023-01-25 23:00 UTC)