[HN Gopher] Own Constant Folder in C/C++
___________________________________________________________________
Own Constant Folder in C/C++
Author : todsacerdoti
Score : 149 points
Date : 2024-06-22 10:07 UTC (12 hours ago)
(HTM) web link (www.neilhenning.dev)
(TXT) w3m dump (www.neilhenning.dev)
| fwsgonzo wrote:
| Very interesting, and that constant folding trick is handy!
| smitty1e wrote:
| I was going to label this a fine example of yak shaving.
| Usually when I'm working this hard against the tool, I've
| missed something, though the author here sounds expert.
| DrBazza wrote:
| This is the sort of thing we do in HFT to get single
| instruction calls and save nanos in places that get called
| many many thousands of times a second.
| smitty1e wrote:
| I mean, at some point, why not just write a custom
| compiler?
| trelane wrote:
| If you care that much, you use intrinsics, asm in the C
| file, or asm to create an object you call into.
|
| Most of the code doesn't need nearly that level of
| optimization, so you write it in a higher level language.
|
| Making your own compiler is a _lot_ of work compared to
| just doing what I 've outlined above. So you don't. I's
| not worth it, and you'd still end up back where you are.
| TillE wrote:
| Writing assembly really isn't that scary, especially in
| small doses for performance-critical code. That was a
| very common practice in the 80s/90s, though it's faded as
| compiler optimizations have improved.
| feverzsj wrote:
| clang has __builtin_elementwise_sqrt, which meets all your
| requirements.
| josephg wrote:
| I just tried it on godbolt. __builtin_elementwise_sqrt emits
| the same terrible asm as _mm_sqrt_ps when -ffast-math is
| enabled:
|
| https://godbolt.org/z/34z81GEsG
| btdmaster wrote:
| Fortunately gcc does not miscompile like this:
| https://godbolt.org/z/PaeaohqGz
| noitpmeder wrote:
| This seems like a great example of why people don't like C/C++,
| and probably a good example of why some people _do_ like it.
|
| How is a non-expert in the language supposed to learn tricks/...
| things like this? I'm asking as a C++ developer of 6+ years in
| high performance settings, most of this article is esoteric to
| me.
| joosters wrote:
| Is this really C++ specific though? It seems like the
| optimisations are happening on a lower level, and so would
| 'infect' other languages too.
|
| Whatever the language, at some point in performance tweaking
| you will end up having to look at the assembly produced by your
| compiler, and discovering all kinds of surprises.
| tialaramex wrote:
| LLVM isn't perfect, but the problem here is that there's a
| C++ compiler flag (-ffast-math) which says OK, disregard how
| arithmetic actually works, we're going to promise _across the
| entire codebase_ that we don 't actually care and that's
| fine.
|
| This is nonsense, but it's really common, distressingly
| common, for C and C++ programmers to use this sort of
| inappropriate global modelling. It's something which cannot
| scale, it works OK for one man projects, "Oh, I use the
| Special Goose Mode to make routine A better, so even though
| normal Elephants can't Teleport I need to remember that in
| Special Goose Mode the Elephants in routine B might
| Teleport". In practice you'll screw this up, but it _feels_
| like you 'll get it right often enough to be valuable.
|
| In a large project where we're doing software engineering
| this is complete nonsense, now Jenny, the newest member of
| the team working on routine A, will see that obviously
| Special Goose Mode is a great idea, and turn it on, whereupon
| the entirely different team handling routine B find that
| their fucking Elephants can now Teleport. WTF.
|
| The need to _never do this_ is why I was glad to see Rust
| stabilize (e.g) u32::unchecked_add fairly recently. This
| (unsafe obviously) method says no, I don 't want checked
| arithmetic, or wrapping, or saturating, I want you to _assume
| this cannot overflow_. I am _formally promising_ that this
| addition is never going to overflow, in order to squeeze out
| the last drops of performance.
|
| Notice that's not a global flag. I can write let a = unsafe {
| b.unchecked_add(c) }; in just one place in a 50MLOC system,
| and _for just that one place_ the compiler can go absolutely
| wild optimising for the promise that overflows never happen -
| and yet right next door, even on the next line, I can write
| let x = y + z; and _that_ still gets the kid gloves, if it
| overflows nothing catches on fire. That 's how granular this
| needs to be to be useful, unlike C++ -ffast-math.
| gpderetta wrote:
| You can set fast math (or a subset of it) on a translation
| unit basis.
| tialaramex wrote:
| Because the language works by textual inclusion and so a
| "translation unit" isn't really just your code this is
| much more likely to result in nasty surprises, up to and
| including ODR violations.
| gpderetta wrote:
| Yes, if you try hard enough I'm sure you can find ways to
| screw up.
|
| From a practical point of view it is fine.
| cozzyd wrote:
| Or even on a per function basis, at least with gcc (no
| clue about clang...)
| gpderetta wrote:
| In principle yes, you can use Attribute optimize, but I
| wouldn't rely on it. Too many bugs open against it.
| cozzyd wrote:
| It's worked for me in the past but maybe I got lucky.
| pdw wrote:
| That's not really true. If you link a shared library that
| was compiled with -ffast-math, that will affect the
| entire program.
| https://moyix.blogspot.com/2022/09/someones-been-messing-
| wit...
| nikic wrote:
| LLVM actually also supports instruction-level granularity
| for fast-math (using essentially the same mechanism as
| things like unchecked_add), but Clang doesn't expose that
| level of control.
| wmobit wrote:
| clang does have pragma clang fp to enable a subset of
| fast math flags within a scope
| nickelpro wrote:
| If you're using floating point at all you have declared you
| don't care about determinism or absolute precision across
| platforms.
|
| Fast math is simply saying "I care even less than IEEE"
|
| This is perfectly appropriate in many settings, but
| _especially_ video games where such deterministic results
| are completely irrelevant.
| aw1621107 wrote:
| > This is perfectly appropriate in many settings, but
| _especially_ video games where such deterministic results
| are completely irrelevant.
|
| I'm not sure I'd agree. Off the top of my head, a
| potentially significant benefit to deterministic
| floating-point is allowing you to send updates/inputs
| instead of world state in multiplayer games, which could
| be a substantial improvement in network traffic demand.
| It would also allow for smaller/simpler cross-
| machine/platform replays, though I don't know how much
| that feature is desired in comparison.
| comex wrote:
| Indeed, but this isn't hypothetical. A large fraction of
| multiplayer games operate by sending inputs and requiring
| the simulation to be deterministic.
|
| (Some of those are single-platform or lack cross-play
| support, and thus only need consistency between different
| machines running the same build. That makes compiler
| optimizations less of an issue. However, some do support
| cross-play, and thus need consistency between different
| builds - using different compilers - of the same source
| code.)
| immibis wrote:
| Actually, floating-point math is mostly deterministic.
| There is an exception for rounding errors in
| transcendental functions.
|
| The perception of nondeterminism came specifically from
| x87, which had 80-bit native floating-point registers,
| which were different from every other platform's 64-bit
| default, and forcing values to 64-bit all the time cost
| performance, so compilers secretly turned data types to
| different ones when compiling for x87, therefore giving
| different results. It would be like if the compiler for
| ARM secretly changed every use of 'float' into 'double'.
| hifromwork wrote:
| Everything in this post applies to C too, so it's not C++
| specific. And the same gotchas apply for every case when you
| use inline assembly. I wouldn't call it a trick... Just an
| interesting workaround for a performance bug in Clang.
|
| The post can be boiled down to "Clang doesn't compile this
| intrinsic nicely, so just use inline asm directly. But remember
| that you need to have a non-asm special case to optimize
| constants too, and you can achieve this with
| __builtin_constant_p".
| GrantMoyer wrote:
| I wouldn't call this a performance bug in clang. It's an
| optimization working as intended.
| dooglius wrote:
| I would challenge you to find a processor on which the
| rsqrt plus two newton-raphson iterations is not slower than
| plain sqrt. (We don't know what mtune the author used)
| pclmulqdq wrote:
| The author probably didn't use any mtune setting, which
| is likely the problem. If you look at older cores on
| Agner's instruction tables, SQRT has been getting
| steadily faster over time. This implementation is
| slightly faster on old Intel machines, for example.
| GrantMoyer wrote:
| According to Intel, any processor before Skylake (section
| 15.12 from [1]).
|
| [1]: https://cdrdv2.intel.com/v1/dl/getContent/814198?fil
| eName=24...
| parasti wrote:
| A lesson I learned very early on: don't use -ffast-math unless
| you know what it does. It has a very appealing name that
| suggests nice things. You probably won't ever need it.
| SassyBird wrote:
| And even if you do know what it does, it's very impractical
| given that it's a global option. Turning it on for selected
| portions of code would be a completely different game.
| adgjlsfhk1 wrote:
| Julia has this (via a @fastmath macro) and it's so nice!
| swatcoder wrote:
| You're mistaken.
|
| You can/would just use it in the translation units where
| you want it; usually for numerical code where you want
| certain optimizations or behaviors and know that the
| tradeoffs are irrelevant.
|
| It's mostly harmless for everday application math anyway,
| and so enabling it for your whole application isn't a
| catastrophe, but it's not what people who know what they're
| doing would usually do. It's usually used for a specific
| file or perhaps a specific support library.
| leni536 wrote:
| -ffast-math affects other translation units too, because
| it introduces a global initialization that affects some
| CPU flags. You can't really contain it to a single TU.
| adgjlsfhk1 wrote:
| I prefer -funsafemath. Who doesn't want their math to be fun
| and safe?
| glitchc wrote:
| I don't know how many people got your joke, but that flag
| actually says unsafe math. The -f prefix signifies a
| floating point flag.
| rockwotj wrote:
| I am pretty sure the -f is for feature. Because there are
| flags like -fno-exceptions and -fcoroutines
| jkrejcha wrote:
| It's a feature flag, they're not all necessarily floating
| point related. As an example though, I still
| intentionally humorously misread -funroll-loops as
| funroll loops even though it's f(eature) unroll loops
| flohofwoe wrote:
| This is a Clang or even LLVM code generation issue and entirely
| unrelated to the C and C++ standards (-ffast-math, intrinsics
| and inline assembly are all compiler-specific features not
| covered by the standards).
|
| Most other language ecosystems most likely suffer from similar
| problem if you look under the hood.
|
| At least compilers like Clang also give you the tools to
| workaround such issues, as demonstrated by the article.
| almostgotcaught wrote:
| > How is a non-expert in the language supposed to learn
| tricks/... things like this?
|
| just like everything else in life that's complex: slowly and
| diligently.
|
| i hate to break it to you - C++ is complex not for fun but
| because it has to both be modern (support lots of modern
| syntax/sugar/primitives) _and_ compile to target code for an
| enormous range of architectures and modern achitectures are
| horribly complex and varied (x86 /arm isn't the only game in
| town). forgo one of those and it could be a much simpler
| language but forgo one of those and you're not nearly as
| compelling.
| nalrqa wrote:
| > How is a non-expert in the language supposed to learn
| tricks/... things like this?
|
| By learning C and inline asm. For a C developer, this is
| nothing out of the ordinary. C++ focuses too much on new
| abstractions and hiding everything in the stdlib++, where the
| actual implementations of course use all of this and look more
| like C, which includes using (OMG!) raw pointers.
| pas wrote:
| arguably vast-vast-vaaast majority of projects, problems,
| solutions, developers simply don't need this.
|
| but yes, given the premise of the article is that a friend
| wants to use a specific CPU instruction, yeah, at least
| minimum knowledge of one's stack is required (and usually the
| path leads through Assembly, some C/Rust and an FFI interface
| - like JNI for Java, cffi/Cython for Python, and so on)
| elvircrn wrote:
| Take a computer architecture course for starters.
| bryancoxwell wrote:
| Got any you'd recommend?
| pas wrote:
| for appetizer I recommend Cliff Click's "A Crash Course in
| Modern Hardware" talk
|
| https://www.youtube.com/watch?v=5ZOuCuGrw48 (and here's the
| 2009 version https://www.infoq.com/presentations/click-
| crash-course-moder... .. might be interesting for
| comparison )
| bryancoxwell wrote:
| Thanks!
| tonetegeatinst wrote:
| I'm a security student. My main experience has been python and
| java....but I have started to learn c to better learn how low
| level stuff works without so much abstraction.
|
| My understanding is that C is a great language, but I also get
| that its not for everyone. Its really powerful, and yet you can
| easily make mistakes.
|
| For me, I'm just learning how to use C, I'm not trying to
| understand the compiler or make files yet. From what I get, the
| compiler is how you can achieve even better performance, but
| you need to understand how it is doing its black
| magic....otherwise you just might make your code slower or more
| inefficient.
| vlovich123 wrote:
| First order optimization is always overall program
| architecture, then hotspots, then fixing architectural issues
| in the code (e.g. getting rid of misspeculated branches,
| reducing instructions etc), and then optimizing the code the
| compiler is generating. And at no point does it require to
| know the internals of the optimization passes as to how it
| generates the code.
|
| As for the compiler's role in C, it's equivalent to javac -
| it's taking your source and creating machine code, except the
| machine code isn't an abstract bytecode but the exact machine
| instructions intended to run on the CPU.
|
| The issues with C and C++ are around memory safety. Practice
| has repeatedly shown that the defect rate with these
| languages is high enough that it results in lots of easily
| exploitable vulnerabilities. That's a bit more serious than a
| personal preference. That's why there's pushes to shift the
| professional industry itself to stop using C and C++ in favor
| of Rust or even Go.
| torusle wrote:
| Nah, it is not that bad.
|
| Sure you can mess up your performance by picking bad compiler
| options, but most of the time you are fine with just default
| optimizations enabled and let it do it's thing. No need to
| understand the black magic behind it.
|
| This is only really necessary if you want to squeeze the last
| bit of performance out of a piece of code. And honestly, how
| often dies this occur in day to day coding unless you write a
| video or audio codec?
| vlovich123 wrote:
| The main flags to look at:
|
| * mtune/march - specifying a value of native optimizes for
| the current machine, x86-64-v1/v2/v3/v4 for generations or
| you can specify a specific CPU (ARM has different naming
| conventions). Recommendation: use the generation if
| distributing binaries, native if building and running
| locally unless you can get much much more specific
|
| * -O2 / -O3 - turn on most optimizations for speed.
| Alternatively Os/Oz for smaller binaries (sometimes faster,
| particularly on ARM)
|
| * -flto=thin - get most of the benefits of LTO with minimal
| compile time overhead
|
| * pgo - if you have a representative workload you can use
| this to replace compiler heuristics with real world
| measurements. AutoFDO is the next evolution of this to make
| it easier to connect data from production environments to
| compile time.
|
| * math: -fno-math-errno and -fno-trapping-math are "safe"
| subsets of ffast-math (i.e. don't alter the numerical
| accuracy). -fno-signed-zeros can also probably be
| considered if valuable.
| vbezhenar wrote:
| Also I learned recently that there's `-Og` which enables
| optimizations suitable for debug build.
| vlovich123 wrote:
| In practice I've had limited success with that flag. It
| still seems to enable optimizations that make debugging
| difficult.
| fredgrott wrote:
| if u understand that c/c+ purpose was at first to write an
| OS...u somewhat are aware of this....but that would depend upon
| your CS classroom studies exposure...
|
| In my case it was by accident as I picked up assembly and
| machine language before I touched C in the late 1980s.
| plq wrote:
| They have a computation to do, and they want to do it using a
| particular instruction. That's the only reason they are
| fiddling with the compiler. You would have to do so too, if you
| had decided to solve a problem at hand using a similar method.
|
| The author is talking about a way to get a particular (version
| of) C/C++ compiler to emit the desired instruction. So I'd call
| this clang-18.1.0-specific but not C/C++-specific since this
| has nothing to do with the language.
|
| Also such solutions are not portable nor stable since
| optimization behavior _does_ change between compiler versions.
| As far as I can tell, they also would have to implement a
| compiler-level unit test that ensures that the desired machine
| code is emitted as toolchain versions change.
| kenferry wrote:
| I mean, you don't have to care about this unless you have an
| application where you do. And if you do there is enough
| transparency (ie ability to inspect the assembly and ask
| questions) that you can solve this one issue without knowing
| everything under the sun.
|
| If you had an application where this sort of thing made a
| difference in JavaScript, the problem would likely still the
| there, you'd just have a lot less visibility on it.
|
| I guess you're still right - at the end of the day you see
| discussions like this far more often in C, so it impacts the
| feel of programming in C more.
| Conscat wrote:
| In my opinion, `__builtin_constant_p()` is not _that_ obscure
| of a feature. In C, it is used in macros to imitate constant
| functions, and in C++ it is useful for determining the current
| alternative that has lifetime in a `union` within a constant
| function. Granted that `__builtin_is_constant_evaluated()` has
| obsoleted its primary purpose, but there are enough ways it 's
| still useful that I see it from time to time.
| pandaman wrote:
| Indeed, C++ is different from most languages (other than C)
| because "knowing C++" does not mean just knowing syntaxis and
| standard library API but implies understanding of how the
| source code is turned into bytes inside an executable image. A
| Python or Java or whatever programmer could be writing books
| and praised as the language expert without slightest idea how
| is memory allocated, a C++ programmer who does not know that is
| probably going to work in some legacy shop supporting an
| ancient MFC/Qt data-entry app.
| Waterluvian wrote:
| Not only that but this is so far away from the actual problem
| being solved that it's just kind of annoying.
|
| I wish there was some sensible way for code that's purely about
| optimization to live entirely separated from the code that's
| about solving the problem at hand...
| pxmpxm wrote:
| Huh, why would reciprocal sqrt and fpmul be faster than regular
| sqrt?
| pkhuong wrote:
| It's less accurate.
| GrantMoyer wrote:
| Intel's explanation (at section 15.12):
| https://cdrdv2.intel.com/v1/dl/getContent/814198?fileName=24...
|
| > In Intel microarchitectures prior to Skylake, _the SSE divide
| and square root instructions DIVPS and SQRTPS have a latency of
| 14 cycles (or the neighborhood) and they are not pipelined.
| This means that the throughput of these instructions is one in
| every fourteen cycles._ The 256-bit Intel AVX instructions
| VDIVPS and VSQRTPS execute with 128-bit data path and have a
| latency of twenty-eight cycles and they are not pipelined as
| well.
|
| > In microarchitectures that provide DIVPS/SQRTPS with high
| latency and low throughput, it is possible to speed up single-
| precision divide and square root calculations using the
| (V)RSQRTPS and (V)RCPPS instructions. _For example, with
| 128-bit RCPPS /RSQRTPS at five-cycle latency and one-cycle
| throughput_ or with 256-bit implementation of these
| instructions at seven-cycle latency and two-cycle throughput, a
| single Newton-Raphson iteration or Taylor approximation can
| achieve almost the same precision as the (V)DIVPS and (V)SQRTPS
| instructions
|
| (emphasis mine)
| dooglius wrote:
| That's a different case where one is trying to calculate
| rsqrt(x) from 1/sqrt(x) whereas the article involves
| computing sqrt(x) from x*rsqrt(x) and in this case you don't
| need the divps instruction for sqrt but do need an additional
| mul instruction for the rsqrt case.
| GrantMoyer wrote:
| I think you've misread the section:
|
| > it is possible to speed up single-precision divide and
| square root calculations using the (V)RSQRTPS and (V)RCPPS
| instructions
| TinkersW wrote:
| Only on some older hardware really, pretty dumb optimization
| from clang, people writing SIMD probably already know this
| trick and have a version of sqrt that is exactly this, so going
| behind their backs and forcing it when they requested sqrt_ps
| is very uncool.
| cozzyd wrote:
| Arguably this is an incorrect default of mtune
| eqvinox wrote:
| Pet peeve: this isn't "your own constant folder in C/C++"... it's
| "your own enabling constant folding in C/C++"...
|
| With "own constant folder" I expected a GCC/clang plugin that
| does tree/control flow analysis and some fancy logic in order to
| determine when, where and how to constant fold...
| GrantMoyer wrote:
| I think the real problem is "really really wanted the sqrtps to
| be used in some code they were writing" is at odds with -ffast-
| math.
|
| Clang transforms sqrtps(x) to x * rsqrtps(x) when -ffast-math is
| set because it's often faster (See [1] section 15.12). It isn't
| faster for some architectures, but if you tell clang what
| architecture you're targeting (with -mtune), it appears to make
| the right choice for the architecture[2].
|
| [1]:
| https://cdrdv2.intel.com/v1/dl/getContent/814198?fileName=24...
|
| [2]:
| https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename...
| white_beach wrote:
| somewhat related to this:
|
| https://godbolt.org/z/1543TYszP
|
| (intel c++ compiler)
| kookamamie wrote:
| It's just so fiddly - can't trust the compiler, not a good sign
| for a language or its implementation.
| g15jv2dp wrote:
| Do you think other compilers for other languages don't perform
| optimizations like this?
| adgjlsfhk1 wrote:
| Julia does all of this correctly (although there are a few 0
| cost redundant move instructions presumably for ABI reasons).
|
| julia> f(x) = @fastmath sqrt.(x)
|
| julia > @code_native debuginfo=:none f.(Tuple(rand(Float32,
| 8)))
|
| vsqrtps ymm0, ymmword ptr [rsi]
|
| mov rbp, rsp
|
| mov rax, rdi
|
| vmovups ymmword ptr [rdi], ymm0
|
| pop rbp
|
| vzeroupper
|
| ret
|
| julia> g() = f((1f0, 2f0, 3f0, 4f0))
|
| julia> @code_native debuginfo=:none g()
|
| .LCPI0_0:
|
| .long 0x3f800000 # float 1
|
| .long 0x3fb504f3 # float 1.41421354
|
| .long 0x3fddb3d7 # float 1.73205078
|
| .long 0x40000000 # float 2
|
| .text
|
| .globl julia_g_600
|
| .p2align 4, 0x90
|
| .type julia_g_600,@function
|
| julia_g_600: # @julia_g_600
|
| # %bb.0: # %top
|
| push rbp
|
| movabs rcx, offset .LCPI0_0
|
| mov rbp, rsp
|
| mov rax, rdi
|
| vmovaps xmm0, xmmword ptr [rcx]
|
| vmovups xmmword ptr [rdi], xmm0
|
| pop rbp
|
| ret
| g15jv2dp wrote:
| ...so does LLVM, unless you specifically instruct it to do
| otherwise with -ffast-math. Don't be fooled by the name of
| this flag, if it were possible to just do math faster
| without compromising anything, it wouldn't be a flag, it
| would be the default. It seems that Julia has a flag with
| the same name but a different behavior. Okay? What's your
| point exactly?
| chongli wrote:
| But in this case, -ffast-math actually results in slower
| math, so the flag is badly named. Naming things is
| important (and hard). The naive expectation of a flag
| named -ffast-math is that the math should never be
| slower, sometimes be faster, and potentially less
| accurate and/or error prone. The fact that it can also be
| sometimes slower means the flag really should be named
| -fdifferent-math or (uncharitably) -funreliable-math.
| tcbawo wrote:
| It's too bad that it hasn't acquired the name of an
| author to describe this alternative functionality more
| concisely without being judgy (like Nagle's algorithm). I
| looked around, and it seems that there isn't any one
| designer/author that you could give credit to.
| adgjlsfhk1 wrote:
| @fastmath in Julia is telling LLVM (basically) the same
| thing. I think the only difference is that Julia doesn't
| think rsqrt(x)*x with Newton iteration is a good way of
| computing sqrt
| dist1ll wrote:
| I mean, the fiddly-ness comes from wanting to use inline
| assembly, which is often considered to be niche and second-
| class in most modern languages. Even writing the asm in the
| first place is full of footguns [0]. There are ways to reduce
| the optimization barrier of asm, and implementing constant
| folding is one such example. But I can see why most compiler
| writers wouldn't be that interested in it.
|
| [0] https://news.ycombinator.com/item?id=40607845
| ComputerGuru wrote:
| You can never trust an optimizing compiler. I've seen similar
| issues in pretty much every language I've written performance-
| sensitive code in. Point revisions change optimizations all the
| time, sometimes with horribly disastrous results. One example:
| every time rust upgrades the version of LLVM underlying the
| compiler backend, the team prepares itself for a deluge of
| optimization regression reports (though this has been somewhat
| "mitigated" by a massive regression in rustc's ability to
| convey sufficient information to the LLVM backend for it to
| optimize to begin with, thanks to a change made to improve
| compile times [0]).
|
| This is almost universally true across all languages. The
| behaviors you see are usually the result of countless
| heuristics stacked one atop the other, and a tiny change here
| can end up with huge changes there.
|
| [0]: https://github.com/rust-lang/rust/pull/91743
| edelsohn wrote:
| __builtin_constant_p(vec) is not inquiring if the contents of vec
| is constant. The compilers are not being fickle. The statement is
| not performing the question that the developer intended.
| dooglius wrote:
| I have not found __builtin_constant_p to be very reliable when I
| want to fold multiple times. Is there any way to do this trick
| better using c++ constexpr I wonder?
| ComputerGuru wrote:
| > but I think the LLVM folks have purposefully tried to match
| what GCC would do.
|
| I never got that impression in my perusal of LLVM bug reports and
| patches. I wonder if there is an open issue for this specific
| case.
| ComputerGuru wrote:
| This reminds me of an issue that I ran into with rust [0] when I
| was trying to optimize some machine learning code. Rust's uber-
| strict type-safe math operations have you use matching types to
| get the euclidean non-negative remainder of x mod y. When you
| have a float-point x value but an integral y value, the operation
| can be performed _much_ more cheaply than when y is also a
| floating point value.
|
| The problem is that you end up promoting y from an integer to an
| f64 and get a much slower operation. I ended up writing my own
| `rem_i64(self: &f64, divisor: i64) -> f64` routine that was some
| ~35x faster (a huge win when crunching massive arrays), but as
| there are range limitations (since f64::MAX > i64::MAX) you can't
| naively replace all call sites based on the type signatures.
| However, with some support from the compiler it would be
| completely doable anytime the compiler is able to infer an
| upper/lower bound on the f64 dividend, when the result of the
| operation is coerced to an integer afterwards, or when the
| dividend is a constant value that doesn't exceed that range.
|
| So now I copy that function around from ML project to ML project,
| because what else can I do?
|
| (A "workaround" was to use a slower-but-still-faster
| `rem_i128(self: &f64, divisor: i128) -> f64` to raise the
| functional limits of the operation, but you're never going to
| match the range of a 64-bit floating point value until you use
| 512-bit integral math!)
|
| [0]: https://github.com/rust-lang/rust/issues/83973
|
| Godbolt link: https://godbolt.org/z/EqrEqExnc
| toast0 wrote:
| > So now I copy that function around from ML project to ML
| project, because what else can I do?
|
| Aren't you supposed to make a crate? Which _is_ copying with
| more steps, but might make your life easier (or harder).
| ComputerGuru wrote:
| I knew someone would pipe in with that suggestion as I was
| writing that comment!
|
| Yes, I suppose that would be the canonical way to go. But
| it's just one function; four lines! I'm getting isEven()
| vibes!
| pryelluw wrote:
| I package unrelated classes, functions, utilities like
| yours into one module/crate/library and just keep adding
| stuff to it. Sort of like my own standard library.
| mikepurvis wrote:
| That works for a time but it's ultimately not great for
| new/external contributors to be faced with your code
| being full of unfamiliar idioms and utility functions
| coming from a single kitchen sink package.
| marcosdumay wrote:
| Really don't.
|
| If that crate becomes successful, it will basically
| greenlight a lot of functionality that was just there
| because it happened to be made by the same author. And
| all that extra functionality will only reduce the chances
| of the really useful function to get into the spotlight.
| Both are bad for the ecosystem.
|
| If the GP has _related_ useful little functions, yes,
| pack them together. Otherwise, I 'd say a crate for a
| small little function isn't a problem at all.
| im3w1l wrote:
| Maybe you could go looking for a home for your function in
| some existing crate then?
| saghm wrote:
| Maybe it could fit into the `num` crate in some fashion?
| https://docs.rs/num/latest/num/
| pclmulqdq wrote:
| Another option is to use "-march" to set your target architecture
| to something post-skylake/Zen 2. That should emit the right
| instruction.
|
| The square root and div instructions used to be a lot slower than
| they are now.
| SAI_Peregrinus wrote:
| > if you happened to use -ffast-math
|
| That option is terribly named. It should be
| -fincorrect_math_that_might_sometimes_be_faster.
| omoikane wrote:
| Offtopic, but the original title is "Your Own Constant Folder in
| C/C++". I am guessing Hacker News cut out the "your" for some
| reason, but at first glance I thought it was some kind of read-
| only directory implemented in C/C++. It's like a truncated garden
| path sentence.
|
| https://en.wikipedia.org/wiki/Garden-path_sentence
| mgaunard wrote:
| The real fix is to not use -ffast-math. It breaks all sorts of
| things by design.
|
| If you really want to relax the correctness of your code to get
| some potential speedup, either do the optimizations yourself
| instead of letting the compiler do them, or locally enable fast-
| math equivalents.
|
| As for the is constant expression GCC extensions, that stuff is
| natively available in standard C++ nowadays.
| hawk_ wrote:
| How does one locally enable -ffast-math equivalent?
| cozzyd wrote:
| https://gcc.gnu.org/onlinedocs/gcc/Function-Specific-
| Option-...
| Lockal wrote:
| For C++ this goes to the question
| https://stackoverflow.com/questions/8936549/constexpr-overlo...
|
| Back in the days the answer was `__builtin_constant_p`.
|
| But with C++20 it is possible to use std::is_constant_evaluated,
| or `if consteval` with C++23.
|
| But this is for scenario when you still want to keep high quality
| of code (maybe when you write multiprecision math library; not
| when you hack around compiler flag), which inline assembly
| violates for many reasons:
|
| 1) major issue: instead of dealing with `-ffast-math` with inline
| asm, just remove `-ffast-math`
|
| 2) randomly slapped inline asm inside normal fp32 computations
| breaks autovectorization
|
| 3) randomly slapped inline asm in example uses non-vex encoding,
| you will likely to forget to call vzeroupper on transition. Or in
| general, this limits code to x86 (forget about x86-64/arm)
|
| 4) provided example (best_always_inline) does not work in GCC as
| expected
| hansvm wrote:
| Forced constant folding is something I use on occasion in Zig. To
| declare that a computation needs to happen at compile-time, just
| prefix it with `comptime`.
|
| Compile-time execution emulates the target architecture, which
| has pros and cons. The most notable con is that inline assembly
| isn't supported (yet?). The complete solution to this particular
| problem then additionally requires `return if (isComptime())
| normal_code() else assembly_magic();` (which has no runtime cost
| because that branch is constant-folded). Given that inline
| assembly usually also branches on target architecture, that winds
| up not adding much complexity -- especially given that you
| probably had exactly that same code as a fallback for
| architectures you didn't explicitly handle.
| metadat wrote:
| Is `_mm_sqrt_ps(..)' an Intel-only thing? Why is the naming so
| jacked?
|
| https://www.intel.com/content/www/us/en/docs/cpp-compiler/de...
___________________________________________________________________
(page generated 2024-06-22 23:01 UTC)