[HN Gopher] The Alder Lake SHLX Anomaly
___________________________________________________________________
The Alder Lake SHLX Anomaly
Author : panic
Score : 223 points
Date : 2025-01-02 23:00 UTC (1 days ago)
(HTM) web link (tavianator.com)
(TXT) w3m dump (tavianator.com)
| aftbit wrote:
| Woah that's weird. Left shifting either takes 3 cycles or 1
| cycle, depending on how you initialize the cycle count register?
|
| This patch from the article makes it take 1 cycle instead of 3:
| - MOV RCX, 1 + MOV ECX, 1
|
| >It seems like SHLX performs differently depending on how the
| shift count register is initialized. If you use a 64-bit
| instruction with an immediate, performance is slow. This is also
| true for instructions like INC (which is similar to ADD with a 1
| immediate).
|
| Practically speaking, is this sort of uop-dependent optimization
| implemented by compilers? How do they do so?
| tavianator wrote:
| > Practically speaking, is this sort of uop-dependent
| optimization implemented by compilers?
|
| I suspect this specific thing is not; I don't think many people
| were aware of this anomaly until this week. I doubt the
| compilers are modeling it.
|
| But in general, compilers will have a machine model that
| predicts the cost of different instructions. The backend will
| try to generate instruction sequences with low cost, so if you
| include this anomaly in the model it will probably work around
| it automatically.
|
| This old example might interest you: a false dependency on the
| destination register for `popcnt` on some uarches led to
| compiler patches: https://stackoverflow.com/q/25078285/502399
| RossBencina wrote:
| user mshockwave posted this the other day in a comment, might
| be of interest/tangential relevance:
|
| Scheduling Model in LLVM - Part I https://myhsu.xyz/llvm-sched-
| model-1
| loeg wrote:
| https://news.ycombinator.com/item?id=42555110
| tavianator wrote:
| One fun thing about this is that spilling+restoring the register
| will fix it, so if any kind of context switch happens (thread
| switch, page fault, interrupt, etc.), the register will get
| pushed to the stack and popped back from it, and the code
| suddenly gets 3x faster. Makes it a bit tricky to reproduce
| reliably, and led me down a few dead ends as I was writing this
| up.
| eigenform wrote:
| I wonder if this is a thing where the machine is trying to
| predict the actual value of the 'count' operand ...
| tavianator wrote:
| If it were a prediction/speculation thing, I would expect it
| to settle down long before 10,000 `SHLX`s are retired.
| dataflow wrote:
| Wow. Any chance you could share either an explanation or your
| benchmark code to show how you measured this reliably? I've
| never had to constrain benchmarks to within a time slice so I'm
| kind of fascinated how you got stable measurements out of that.
| Did you force a context switch prior to starting and then loop
| for a fixed number of times that wouldn't trigger another
| context switch? Or did you disable preemption on your thread,
| or something else?
| valleyer wrote:
| Yeah, even an interrupt (that doesn't result in a thread
| switch) would cause a spill to the stack, right? That seems
| hard to control.
| dataflow wrote:
| Yeah. I don't know how often interrupts happen to know how
| easy this would be to control for, but it certainly sounds
| like it could be tough. I suppose one way to mitigate these
| though is e.g. taking a min() of the execution times rather
| than the average?
| Bulat_Ziganshin wrote:
| except that we need max() here :)
| dataflow wrote:
| Huh, why max()? Generally microbenchmarks want min() (or
| close to it).
| Bulat_Ziganshin wrote:
| because THIS code becomes faster once RCX is saved and
| restored during the interrupt call
| dataflow wrote:
| Oh right, my bad, thanks. Totally escaped me when I wrote
| that :-)
| toast0 wrote:
| You should be able to prevent or at least strictly limit
| interrupts if you run a modern kernel, have a single
| threaded test process and cpu pin your test process to core
| X, while you've pinned everything else to not core X (and
| not core X's cpu threads if enabled), and you have no
| hardware interrupts routed to that core, either.
|
| You might have to look into tickless kernels too, but I
| think that's pretty mainstream now. Tldr, if there's no
| contention for the core and there's no hardware interrupts,
| there's no need for a timer/scheduler tick on that core.
| tavianator wrote:
| Benchmark code is in the post! The trick is to limit the
| number of SHLX instructions before you re-initialize RCX.
| Doing it every 10,000 keeps it fresh enough. I didn't have to
| disable preemption or anything, perf shows 0 context switches
| and only ~50 page faults (probably all at startup).
| amluto wrote:
| Wait a sec, isn't this backwards? The OS will restore thread
| state by loading all of RCX, which sounds like it should
| trigger the slow path.
|
| I wonder whether IRET fixes it. Can you try MOV RCX, 1; push an
| IRET frame; IRET; SHLX to see if IRET fixes it?
| tavianator wrote:
| > Wait a sec, isn't this backwards?
|
| No, if you push and pop RCX it's fast again. I mentioned this
| in the blog post, `mov rcx, 1` is slow but `mov rcx, 1; push
| rcx; pop rcx` is fast.
| amluto wrote:
| I think I missed that one in the pile of options. That's
| bizarre. I wonder if MOV from memory is like POP.
| BeeOnRope wrote:
| The slow state is when rcx has a special "hidden immediate"
| attached it, from the `mov rcx, 1` or `add rcx, 1`. If the
| last option on the register doesn't fit that narrow template,
| it is fast, so loading from memory in any way, like pop,
| xrestor, plan load, etc, will be in fast mode.
| userbinator wrote:
| How about trying "xor ecx, ecx; inc ecx"? Or the even shorter
| "mov cl, 1"?
|
| _It is very strange to me that the instruction used to set the
| shift count register can make the SHLX instruction 3x slower._
|
| I suspect this is a width restriction in the bypass/forwarding
| network.
|
| _The 32-bit vs. 64-bit operand size distinction is especially
| surprising to me as SHLX only looks at the bottom 6 bits of the
| shift count._
|
| Unfortunately the dependency analysis circuitry seems not Intel-
| ligent enough to make that distinction.
| tavianator wrote:
| > How about trying "xor ecx, ecx; inc ecx"?
|
| Fast. But inc rcx is slow.
|
| > Or the even shorter "mov cl, 1"?
|
| Fast.
|
| > I suspect this is a width restriction in the
| bypass/forwarding network...
|
| I think we just found the explanation on Twitter:
| https://x.com/corsix/status/1874965887108976858
|
| Alder Lake adds support for mov/add/sub with small immediates
| to the _register renamer_. So `add rcx, 1` gets handled by the
| renamer, potentially with zero latency.
|
| Unfortunately shifts are slow when the shift count is renamed
| in this way. This leads to fun things like `mov rcx, 1024`
| being fast while `mov rcx, 1023` is slow. I'll update the blog
| post in a bit.
| tptacek wrote:
| I got nothing here other than: this is very cool.
| eigenform wrote:
| Oooh, I forgot about this. Only thing I can think of is
| something like:
|
| 1. Assume that some values can be treated internally as
| "physical register plus a small immediate"
|
| 2. Assume that, in some cases, a physical register is known
| to be zero and the value can be represented as "zero plus a
| small immediate" (without reference to a physical register?)
|
| 3. Since 'count' is always expected to be <64, you
| technically don't need to use a physical register for it
| (since the immediate bits can always just be carried around
| with the name).
|
| 4. The 3-cycle case occurs when the physical register read
| cannot be optimized away??
|
| (Or maybe it's just that the whole shift operation can occur
| at rename in the fast case??)
| phire wrote:
| _> 4. The 3-cycle case occurs when the physical register
| read cannot be optimized away??_
|
| No... the 3-cycle case seems to be when the physical
| register read is optimized away. I think it's some kind of
| stupid implementation bug.
| eigenform wrote:
| Like, the scheduler is just waiting unnecessarily for
| both operands, despite the fact that 'count' has already
| been resolved at rename?
| phire wrote:
| The 3 cycles latency casts massive suspicion on the
| bypass network. But I don't see how the bypass network
| could be bugged without causing the incorrect result. So
| the scheduler doesn't know how to bypass this "shift with
| small immediate" micro op.
|
| Or maybe the bypass network is bugged, and what we are
| seeing is a chicken bit set by the microcode that
| disables the bypass network for this one micro op that
| does shift.
| eigenform wrote:
| > But I don't see how the bypass network could be bugged
| without causing the incorrect result.
|
| Maybe if they really rely on this kind of forwarding in
| many cases, it's not unreasonable to expect that latency
| can be generated by having to recover from "incorrect PRF
| read" (like I imagine there's also a case for recovery
| from "incorrect forwarding")
| phire wrote:
| Yeah, "incorrect PRF read" is something that might exist.
|
| I know modern CPUs will sometimes schedule uops that
| consume the result of load instruction, with the
| assumption the load will hit L1 cache. If the load
| actually missed L1, it's not going to find out until that
| uop tries to read the value coming in from L1 over the
| bypass network. So that uop needs to be aborted and
| rescheduled later. And I assume this is generic enough to
| catch any "incorrect forwarding", because there are other
| variable length instructions (like division) that would
| benefit from this optimistic scheduling.
|
| But my gut is to only have these checks on the bypass
| network, and only ever schedule PRF reads after you know
| the correct value has been stored.
| Bulat_Ziganshin wrote:
| maybe, the bypass network doesn't include these "constant
| registers"? a bit like zen5 where some 1-cycle SIMD ops
| are executed in 2 cycles, probably for shortcomings of
| the same network
| phire wrote:
| Ok... That does make some sense.
|
| Essentially, Intel have allocated a bunch of fake renaming
| registers to hold all small immediates. Like how MIPS had a
| zero register, Intel now have over 1000 constant registers.
| These registers don't really exist, the immediate is just
| reconstituted at the execution unit.
|
| And I'm guessing these small immediates predate Golden Cove;
| It's probably how they have always represented shift by
| constant and any instructions with 8bit immediate (presumably
| at least as far back as Sandybridge). The only change is that
| now the renamer can now execute simple movs/adds.
|
| So essentially these variable shifts are being converted to
| constant shifts on the fly. The problem is that there is no
| constant shift version of shlx, so it wasn't in the test
| suite. There is probably some kind of stupid implementation
| bug in the scheduler that simply wasn't exposed until the
| renamer changes.
| tavianator wrote:
| I don't think there are a thousand constant registers. I
| think the renamer just represents (reg, 10-bit offset)
| pairs rather than just registers.
|
| Also the problem affects SHL as well as SHLX, I didn't
| realize until just now.
| phire wrote:
| The speculation of a "reg + 10-bit offset" representation
| feels wrong to me.
|
| That requires a whole bunch of extra 64-bit full-adders
| everywhere one of these pairs might be consumed (so
| realistically, on every single read port of the register
| file). 64-bit adders take quite a bit of latency, so you
| don't want extra adders on all your critical paths.
|
| In the case where it appears to be holding a reg + offset
| pair, what I think has actually happened is that renamer
| (and/or uop fusion) has rewritten the uop to a 3-input
| add, with the offset as the third input.
|
| _> Also the problem affects SHL as well as SHLX, I didn
| 't realize until just now._
|
| And presumably SHR/SHRX/SAR/SARX too?
| BeeOnRope wrote:
| Yeah I am pretty sure that the renamer adds the tracked
| immediate(s) to the emitted uop, but that's not
| inconsistent with tracking "reg offset pairs" is it?
| eigenform wrote:
| I dunno, you could imagine it happens [speculatively?] in
| parallel, at the cost of a read port and adder for each
| op that can be renamed in a single cycle:
|
| 1. Start PRF reads at dispatch/rename
|
| 2. In the next cycle, you have the result and compute the
| 64-bit result. At the same time, the scheduler is sending
| other operands [not subject to the optimization] to read
| ports.
|
| 3. In the next cycle, the results from both sets of
| operands are available
| phire wrote:
| Seems rather pointless to implement it like that. You
| would save space in the scheduler because the uops have
| been fused, but execution time and execution unit usage
| is the same as just executing the original ops before
| this optimisation was implemented.
|
| It also adds an extra cycle of scheduling latency, which
| may or may not be an issue (I really have no idea how far
| ahead these schedulers can schedule).
|
| If you think about the "1024 constant registers"
| approach: It allows you to have a small 10bit adder in
| the renamer which handles long chains of
| mov/add/sub/inc/dec ops, as long as the chain stays in
| the range of 1024. This frees the main adders for bigger
| sums, or maybe you can power gate them.
|
| And in the case where one of the inputs is larger than
| 1024, or it's value is unknown during renaming, the
| renamer can still merge two two-input adds into a single
| three input add uop.
| Bulat_Ziganshin wrote:
| add3 operation will have 3 inputs, though. do we have
| other integer operations with 3 64-bit inputs?
| dzaima wrote:
| You don't quite need a full 64-bit adder to materialize
| the proper value, as one argument is only a 10-bit int.
| So a 10-bit adder, a 54-bit increment, and muxing it with
| the original depending on the add carry.
|
| And, presumably, the OP shift case here is in fact a case
| of there not being a built-in immediate adder and thus a
| need for fixup uops being inserted to materialize it?
| tavianator wrote:
| Right. Actually it turns out it's 11 bits, since [-1024,
| 1023] are all supported by the immediate add renamer.
|
| In general I think people are overstating the delay of an
| additional 64-bit add on register file reads (though I'm
| not a hardware guy so maybe someone can correct me).
| There are carry-lookahead adders with log_2(n) == 6 gate
| delays. Carry-save adders might also be relevant to how
| they can do multiple dependent adds with 0c latency.
|
| > And, presumably, the OP shift case here is in fact a
| case of there not being a built-in immediate adder and
| thus a need for fixup uops being inserted to materialize
| it?
|
| No, the perf counters show 1 uop dispatched/retired in
| both the slow and fast cases.
| dzaima wrote:
| Ah, good to know on the uop count. Still could be (or,
| well, has to be to some extent) the same concept, just
| pipelined within one uop.
| userbinator wrote:
| _So `add rcx, 1` gets handled by the renamer, potentially
| with zero latency.
|
| Unfortunately shifts are slow when the shift count is renamed
| in this way._
|
| That's because the value still has to make it to the
| shifter... and the renamer's adder output clearly takes a
| different and higher-latency path than the one which doesn't
| go through it, confirming my original suspicion that this is
| a limitation of the bypass/forwarding network.
|
| Ultimately it seems they are just playing around with where
| things happen; the addition can be done _here_ but the value
| ultimately needs to go _there_ for the shift, or the addition
| and shift can both happen _there_. One could envision a
| future change that puts the shifter in the renamer too. They
| 're really constrained by the laws of physics.
| Bulat_Ziganshin wrote:
| we set CX only once and then use it 10000 times. the
| problem is not the slow calculation of CX per se, but the
| slow shift once we got CX from the renamer
| monocasa wrote:
| I would have thought that the movs would have been dumped right
| into a rob entry without even going through the bypass network,
| much like a zeroing idiom does.
| bhouston wrote:
| Shifts were not always fast. These old hacker news comments
| contain the details: https://news.ycombinator.com/item?id=2962770
| petermcneeley wrote:
| Indeed. Dynamic shifting was microcoded (not uop!) on the power
| pc for gen3. However shifting with immediate values was not.
|
| This leads to all sorts of strange performance workarounds.
| Mike Acton refers to it here:
| https://macton.smugmug.com/Other/2008-07-15-by-Eye-Fi/n-xmKD...
| BeeOnRope wrote:
| Worth noting that whether intentional or not, this would be easy
| to miss and unlikely to move benchmark numbers since compilers
| won't generate instructions like this: they would use the eax
| form which is 1 byte shorter and functionally equivalent.
|
| Even some assemblers will optimize this for you.
| tavianator wrote:
| It's possible to get gcc to generate sequences that would
| trigger this: https://godbolt.org/z/rYKqPxn7b
| loeg wrote:
| The godbolt paste I'm looking at shows GCC generating a SAL
| with 8 bit shift (CL), not a SHLX with 64-bit shift (RCX).
| tjalfi wrote:
| GCC will generate the _shlx_ instruction if you add the
| _-mbmi2_ flag to your build options. For example, you can
| see this in action here: https://godbolt.org/z/asb1fxos5.
| BeeOnRope wrote:
| It's true, I was considering only the original mov rax, 1
| case, which I'm pretty sure compilers don't generate:
| it's just a useless encoding of mov eax, 1.
|
| Given that 64-bit immediate math also causes this though,
| compilers might generate it (I think this is a minor
| missed optimization by gcc though: it could have used add
| eax, 1 instead.
| tavianator wrote:
| SHL suffers from the same latency issue as SHLX it turns
| out
| loeg wrote:
| SAL as well? (And anyway this example seems to be
| shifting by the low 8 bits of RCX (CL), not the full
| width register.)
| tavianator wrote:
| SAL and SHL are synonyms, they have the same encoding.
| SHL only accepts CL as the count register, there's no
| other form that takes a variable shift count
| loeg wrote:
| Ahh, thanks, I didn't make that connection.
| Bulat_Ziganshin wrote:
| Intel shifts anyway mask out higher bits of CL, this
| hurts sometimes e.g. when you need to shift by 1..64 bits
| eigenform wrote:
| Funny thing to double-check: are these encodings correctly
| specifying a 64-bit operand? Maybe everyone's compilers are
| subtly wrong D:
|
| edit: It looks like VEX.W _is_ set in the encoding from the
| uops.info tests -\\_(tsu)_ /-
| BeeOnRope wrote:
| To rule out alignment you should adding padding to one so the two
| variations have the same alignment in their long run of SHLX (I
| don't actually think it's alignment related though).
| tavianator wrote:
| I did try aligning the loop, it didn't matter
| BeeOnRope wrote:
| Does this also occur with other 3-argument instructions like
| ANDN?
| tavianator wrote:
| ANDN is fast. The issue looks specific to shift counts. SHL is
| also slow.
| crest wrote:
| So the issue is that the frontend which "eliminates" those
| operations doesn't make the resulting values available to the
| backend if fits Intel's ad-hoc definition of a short immediate
| (which shift values do) and since there is no SHLX with immediate
| uop the small value has to be be expanded to a full 32 or 64 bits
| and requested from the frontend with adds two cycles of latency?
|
| Has anyone run tests with >=3 interleaved SHLX dependency chains
| in a loop? Does it "just" have 3 cycle latency or also less than
| 1 operation/cycle sustained throughput? Because if the pipeline
| stalls that would be even more annoying for existing optimised
| code.
|
| Is the regression limited to only Alder Lake P-cores or also
| present it later (refreshed) cores?
| orlp wrote:
| Seems like LLVM knows about this quirk (note how it suddenly uses
| eax instead of rax for the multiply):
| https://rust.godbolt.org/z/8jh7YPhz4.
| stassats wrote:
| Using EAX is one byte shorter, so it might be doing that
| inadvertently.
| rincebrain wrote:
| Since this seems like an optimization going awry somewhere, I
| wonder if there's a chicken bit that disables it, and if so, how
| broad the impact of disabling it is...
| eqvinox wrote:
| Hmm... does this have any impact on time-constant cryptographic
| algorithm implementations? In particular the wider "addition in
| register renamer" story?
| dzaima wrote:
| Perhaps not - as it happens at the rename stage and thus only
| with immediate values, it's all fixed behavior for a given
| instruction stream; with branching you can of course get
| dynamically different renamed immediates, but, as that's
| branchy code, it's not time-constant anyway.
| LegionMammal978 wrote:
| I don't think it should, if the performance change only depends
| on immediates hardcoded into the binary, and not the variable
| data being processed. Maybe if the algorithm branched on the
| variable data and had two separate ways of loading something,
| but branching on variable data is the #1 thing you're supposed
| to avoid in the first place.
| atq2119 wrote:
| To expand on this a bit, the hypothesis stated elsewhere is
| that the register renamer stores "physical register + 11-bit
| signed immediate".
|
| It should be possible to construct scenarios where a logical
| register is mapped to "physical register + immediate" where
| the immediate depends on the path taken through the program
| so far. The easiest example would be an if-statement that
| optionally adds 1023 to a register, and then an add of 1
| after the if-statement. The add of 1 overflows the immediate
| stored in the register renamer if and only if the if-
| condition was true.
|
| As you said, that requires taking different control flow
| paths through the program, which is something to be avoided
| by constant-time algorithms anyway.
|
| One potential twist: This shows that it may be important that
| the interface between the constant-time algorithm and the
| rest of the program goes entirely through memory and not
| through registers. If the interface does go through
| registers, it's possible that variable register-renamer-
| immediates leak from the outside world into the constant-time
| algorithm. (This admittedly feels like a bit of a stretch,
| but who knows.)
| juancn wrote:
| Code alignment? I mean, the different instructions change the
| alignment of the rest of the code.
|
| - 64 bit register 0: 48 c7 c1 01 00 00 00
| mov rcx,0x1 - 32 bit register 0: b9 01
| 00 00 00 mov ecx,0x1
|
| It should be easy to test by adding a couple of NOPs to the fast
| version: 0: b9 01 00 00 00 mov
| ecx,0x1 5: 90 nop 6: 90
| nop
|
| and see if it regresses again.
|
| I don't have an Alder Lake to test on.
| BeeOnRope wrote:
| https://news.ycombinator.com/item?id=42582623
| tavianator wrote:
| In general I'd be very surprised if code alignment had much
| effect on a 10,000 instruction block, especially since the
| instruction is 5 bytes long so it will quickly unalign itself
___________________________________________________________________
(page generated 2025-01-03 23:01 UTC)