[HN Gopher] Optimizing compilers reload vector constants needlessly
___________________________________________________________________
Optimizing compilers reload vector constants needlessly
Author : ibobev
Score : 82 points
Date : 2022-12-06 16:41 UTC (6 hours ago)
(HTM) web link (lemire.me)
(TXT) w3m dump (lemire.me)
| jeffbee wrote:
| Moving the constant to file or anonymous namespace scope solves
| the issue. It's too bad that intrinsics are not `constexpr`
| because I have a powerful urge to hang a `constinit` in front of
| this line.
| gumby wrote:
| Disturbing that this works, as it shouldn't do the reload even
| if the constant is passed in as a parameter.
| leni536 wrote:
| In this particular case the broadcasting instruction can be
| replaced with builtin operations, allowing constexpr.
|
| https://godbolt.org/z/Td6vG9cqG
|
| edit: uh, the constant requires some hand adjustment
|
| edit2: fixed version https://godbolt.org/z/4Px5Mbsx4, and I
| just don't get this. gcc really just wants to load that
| constant twice.
| foota wrote:
| Maybe it's trying to avoid using SSE in the case where there's no
| loop? SSE on some older platforms had a cost just from using it,
| so it might be possible.
| stephc_int13 wrote:
| My experience with optimizing compilers is that generated code is
| often frustratingly close to optimal (given the source is well
| written and taking account the constraints of the target arch).
|
| It is perfectly reasonable to take a look at the output on
| Godbolt, tweak it a bit and call it a day.
|
| Maintaining a full assembly language version of the same code is
| rarely justifiable.
|
| And yet, I understand the itch, especially because there are
| quite often some low-hanging fruits to grab.
| evancox100 wrote:
| This may be true for scalar code but it seems like the
| compilers still aren't quite there with vector code.
| phkahler wrote:
| The optimization here would be CSE or hoisting, or both? I'm
| guessing the problem is those are performed prior to
| vectorization.
|
| In other words, I suspect an invariant calculation inside
| consecutive loops but that is not vectorized will be pulled out
| of the loops and also moved prior to them and executed just once.
| JonChesterfield wrote:
| At a guess, constant rematerialision failing to cross basic
| block boundaries. Feels like a plausible thing for a heuristic
| to miss. E.g. sink the constant into the loop so it's available
| when optimising that block, then fail to hoist it back out
| afterwards because constant materialisation is cheap.
| JoeAltmaier wrote:
| Intel had an optimizing compiler that was amazing. But unless you
| were intel-only it made life harder to switch compilers for that
| platform.
| berkut wrote:
| Yeah, I haven't used ICC for 7 years now, but at the time it
| was much better than clang/gcc at keeping SSE/AVX intrinsic
| types in registers through function calls (i.e. clang/gcc used
| to spill out onto the stack and re-load), and things like this
| in the article.
| cwzwarich wrote:
| Were you testing on the same platform? The Microsoft ABI has
| callee-save XMM registers, whereas the Linux/macOS ABI does
| not. Regardless, it would be nice if more compilers could do
| interprocedural register allocation in cases where all
| callers are known.
| tester756 wrote:
| I've heard similar opinions that people could just recompile
| their soft and receive significant speed boost
| inetknght wrote:
| I haven't (yet) read the article, but I will. But the headline...
|
| > _Optimizing compilers reload vector constants needlessly_
|
| ...is absolutely true. I wrote some code that just does bit
| management (shifting, or, and, xor, popcount) on a byte-level.
| Compiler produced vectorized instructions that provided about a
| 30% speed-up. But when I looked at it... it was definitely not as
| good as it could be, and one of the big things was frequently
| reloading /broadcasting constants like 0x0F or 0xCC or similar.
| Another thing it would do is to sometimes drop down to normal
| (not-SIMD) instructions. This was with both `-O2` and `-O3`, and
| also with `-march=native`
|
| I ended up learning how to use SIMD intrinsics and hand-wrote it
| all... and achieved about a 600% speedup. The code reached about
| 90% of the performance of the bus to RAM which was what I
| theorized "should" be the limiting factor: bitwise operations
| like this are _extremely_ fast and the slowest point point was
| popcount which didn 't have a native instruction on the hardware
| I was targeting (AVX2). This was with GCC 6.3 if I recall, about
| 5 years ago.
| an1sotropy wrote:
| Can you recommend any favorite resources for learning how to
| use SIMD intrinsics?
| teux wrote:
| Not OP but also work with this.
|
| There's some tutorials but honestly the best thing is to just
| use them.
|
| Write an image processing routine that does something like
| apply a gaussian blur to a black and white image. The c++
| code for this is _everywhere_. You have a fixed kernel (2d
| matrix) and you have to do repeat multiplication and addition
| to each pixel for each element in the kernel.
|
| Write it in C++ or Rust. Then read the Arm SIMD manual, find
| the instructions that do the math you want, and switch it
| over to intrinsics. You are doing the same exact operations
| with the intrinsics as the raw c++. Just 8 or 16 of them at a
| single time.
|
| Run them side by side for parity and to check speed, tweak
| the simd, etc.
|
| Arm has good (well ,okay) documentation
|
| https://developer.arm.com/documentation/den0018/a/?lang=en
|
| https://arm-
| software.github.io/acle/neon_intrinsics/advsimd....
|
| * Edit: you also have to do this on a supported architecture.
| Raspberry pi's have a neon core at least in the 3's. Not sure
| about the 4's but I believe so too!
| corysama wrote:
| Adding on:
|
| Go to
| https://www.intel.com/content/www/us/en/docs/intrinsics-
| guid...
|
| Start with SSE, SSE2, SSE3
|
| Write small functions in https://godbolt.org/ . Watch the
| assembly and the program output.
| jeffreyrogers wrote:
| That's basically the problem the article describes although
| he's using vector intrinsics too and it still reloads and
| broadcasts the constant before each loop.
| pclmulqdq wrote:
| When I have used intrinsics, the compiler at least has a hope
| of getting this right, particularly when you use patterns
| like:
|
| __m256i mask = _mm256_set1_epi8(0x0f)
|
| If you just used the intrinsic that sets the register to a
| constant over and over, it often repeats the instruction.
|
| The compilers just aren't that smart about SIMD yet.
| jeffreyrogers wrote:
| He sets it once like this before the loops.
| __m256i c = _mm256_set1_epi32(10001);
|
| And then the disassembly has mov
| eax, 10001 vpbroadcastd ymm1, eax
|
| before each loop.
| DannyBee wrote:
| There are three reasons it reloads constants:
|
| 1. It thinks it is cheaper than keeping them in a register (
| this is known as rematerialization). It will reload constants
| that it lets it keep something else in a register, and it's
| cheaper to do this.
|
| 2. It thinks something could affect the constant.
|
| 3. It thinks it must move it through memory to use it, and
| then it thinks the memory was clobbered.
|
| In this case, it definitely knows it is a constant, and it
| can't prove that both loops always execute, so it places it
| in the path where it is only executed once per loop, because
| it believes it will be cheaper.
|
| I can still make at least gcc do weird things if i prove to
| it the loop executes once.
|
| In that case, what is happening in gcc is that constant
| propagation is propagating the vector constant forward into
| both loops. Something later (that has a machine cost model)
| is expected to commonize it if it is cheaper, but never does.
| teux wrote:
| I often hand write neon (and other vectorised architecture)
| intrinsics/assembly for my job, optimising image and signal
| processing routines. We have seen many many 3 digit percentage
| speedups from bare c/c++ code.
|
| I got into the nastiest discussion on reddit where people were
| swearing up and down it was impossible to beat the compiler,
| and handwritten assembly was useless/pretentious/dangerous. I
| was downvoted massively. Sigh.
|
| Anyways, that was a year ago. Thanks for another point of
| validation for that. It clearly didn't hurt my feelings. :)
|
| I never come across people in the wild that actually do this
| also, it's such a niche area of expertise.
| fwsgonzo wrote:
| It also slightly annoys me a bit the things JIT people write
| on their github READMEs about the incredibly theoretical
| improvements that can happen at runtime, yet it's never
| anywhere close to AOT compilation. Then you can add 2-3x on
| top of that for hand-written assembly.
|
| I do wonder whats going on with projects like BOLT though. I
| have seen it was merged into LLVM, and I have tried to use it
| but the improvement was never more than 7%. I feel like it
| has a lot of potential because it does try to take run-time
| into account.
|
| See: https://github.com/llvm/llvm-project/tree/main/bolt
| wyldfire wrote:
| > improvement was never more than 7%.
|
| If your use case isn't straining icache then you won't
| benefit as much.
|
| BTW 7% is huge, odd that you would describe it as "only".
| wyldfire wrote:
| > impossible to beat the compiler
|
| Ludicrous! How could they be taken seriously? Which subreddit
| was this?
| astrange wrote:
| Tell them to read the ffmpeg code. All the platform-
| specific/SIMD stuff is done in asm.
|
| This isn't only because it's faster, it's honestly easier to
| read than intrinsics anyway. What it does lack is
| debugability.
| teux wrote:
| For debugging you can actually use gdb in assembly tui mode
| and step through the instructions! You can even get it
| hooked up in vs code and remote debug an embedded target
| using the full IDE. Full register view, watch registers for
| changes, breakpoints, step instruction to instruction.
|
| Pipelining and optimisations can make the intrinsics a bit
| fucky though, have to make sure it's -O0 and a proper debug
| compilation.
|
| I have line by line debugged raw assembly many times. It's
| just a pain to initially set up. Honestly not very
| different from c/c++ debugging once running.
| MaxBarraclough wrote:
| Or any other highly optimised numerical codebase. From a
| quick glance at OpenBLAS, it looks like they have a _lot_
| of microarchitecture-specific assembly code, with
| dispatching code to pick out the appropriate
| implementations.
|
| https://github.com/xianyi/OpenBLAS/blob/02ea3db8e720b0ffb3e
| 2...
|
| https://github.com/xianyi/OpenBLAS/blob/02ea3db8e720b0ffb3e
| 2...
| [deleted]
| phkahler wrote:
| >> This was with both `-O2` and `-O3`, and also with
| `-march=native`
|
| Until very recently GCC didn't do vectorization at -O2 usless
| you told it to.
| inetknght wrote:
| That's true. I definitely omitted a bunch of other flags that
| were added including the flags to turn on vectorizations
| BoardsOfCanada wrote:
| Seems like the compiler puts the test for the first loop before
| loading the constant the first time, and therefor needs to load
| it again before the second loop. So the "tradeoff" is that if
| neither loop runs it will load the constant zero times. Of course
| this isn't what a human would do but at least there is some kind
| of sliver of logic to it. (Like if vpbroadcastd was a 2000 cycle
| instruction this pattern might have made sense)
___________________________________________________________________
(page generated 2022-12-06 23:00 UTC)