[HN Gopher] The Alder Lake SHLX Anomaly
       ___________________________________________________________________
        
       The Alder Lake SHLX Anomaly
        
       Author : panic
       Score  : 223 points
       Date   : 2025-01-02 23:00 UTC (1 days ago)
        
 (HTM) web link (tavianator.com)
 (TXT) w3m dump (tavianator.com)
        
       | aftbit wrote:
       | Woah that's weird. Left shifting either takes 3 cycles or 1
       | cycle, depending on how you initialize the cycle count register?
       | 
       | This patch from the article makes it take 1 cycle instead of 3:
       | -        MOV RCX, 1         +        MOV ECX, 1
       | 
       | >It seems like SHLX performs differently depending on how the
       | shift count register is initialized. If you use a 64-bit
       | instruction with an immediate, performance is slow. This is also
       | true for instructions like INC (which is similar to ADD with a 1
       | immediate).
       | 
       | Practically speaking, is this sort of uop-dependent optimization
       | implemented by compilers? How do they do so?
        
         | tavianator wrote:
         | > Practically speaking, is this sort of uop-dependent
         | optimization implemented by compilers?
         | 
         | I suspect this specific thing is not; I don't think many people
         | were aware of this anomaly until this week. I doubt the
         | compilers are modeling it.
         | 
         | But in general, compilers will have a machine model that
         | predicts the cost of different instructions. The backend will
         | try to generate instruction sequences with low cost, so if you
         | include this anomaly in the model it will probably work around
         | it automatically.
         | 
         | This old example might interest you: a false dependency on the
         | destination register for `popcnt` on some uarches led to
         | compiler patches: https://stackoverflow.com/q/25078285/502399
        
         | RossBencina wrote:
         | user mshockwave posted this the other day in a comment, might
         | be of interest/tangential relevance:
         | 
         | Scheduling Model in LLVM - Part I https://myhsu.xyz/llvm-sched-
         | model-1
        
           | loeg wrote:
           | https://news.ycombinator.com/item?id=42555110
        
       | tavianator wrote:
       | One fun thing about this is that spilling+restoring the register
       | will fix it, so if any kind of context switch happens (thread
       | switch, page fault, interrupt, etc.), the register will get
       | pushed to the stack and popped back from it, and the code
       | suddenly gets 3x faster. Makes it a bit tricky to reproduce
       | reliably, and led me down a few dead ends as I was writing this
       | up.
        
         | eigenform wrote:
         | I wonder if this is a thing where the machine is trying to
         | predict the actual value of the 'count' operand ...
        
           | tavianator wrote:
           | If it were a prediction/speculation thing, I would expect it
           | to settle down long before 10,000 `SHLX`s are retired.
        
         | dataflow wrote:
         | Wow. Any chance you could share either an explanation or your
         | benchmark code to show how you measured this reliably? I've
         | never had to constrain benchmarks to within a time slice so I'm
         | kind of fascinated how you got stable measurements out of that.
         | Did you force a context switch prior to starting and then loop
         | for a fixed number of times that wouldn't trigger another
         | context switch? Or did you disable preemption on your thread,
         | or something else?
        
           | valleyer wrote:
           | Yeah, even an interrupt (that doesn't result in a thread
           | switch) would cause a spill to the stack, right? That seems
           | hard to control.
        
             | dataflow wrote:
             | Yeah. I don't know how often interrupts happen to know how
             | easy this would be to control for, but it certainly sounds
             | like it could be tough. I suppose one way to mitigate these
             | though is e.g. taking a min() of the execution times rather
             | than the average?
        
               | Bulat_Ziganshin wrote:
               | except that we need max() here :)
        
               | dataflow wrote:
               | Huh, why max()? Generally microbenchmarks want min() (or
               | close to it).
        
               | Bulat_Ziganshin wrote:
               | because THIS code becomes faster once RCX is saved and
               | restored during the interrupt call
        
               | dataflow wrote:
               | Oh right, my bad, thanks. Totally escaped me when I wrote
               | that :-)
        
             | toast0 wrote:
             | You should be able to prevent or at least strictly limit
             | interrupts if you run a modern kernel, have a single
             | threaded test process and cpu pin your test process to core
             | X, while you've pinned everything else to not core X (and
             | not core X's cpu threads if enabled), and you have no
             | hardware interrupts routed to that core, either.
             | 
             | You might have to look into tickless kernels too, but I
             | think that's pretty mainstream now. Tldr, if there's no
             | contention for the core and there's no hardware interrupts,
             | there's no need for a timer/scheduler tick on that core.
        
           | tavianator wrote:
           | Benchmark code is in the post! The trick is to limit the
           | number of SHLX instructions before you re-initialize RCX.
           | Doing it every 10,000 keeps it fresh enough. I didn't have to
           | disable preemption or anything, perf shows 0 context switches
           | and only ~50 page faults (probably all at startup).
        
         | amluto wrote:
         | Wait a sec, isn't this backwards? The OS will restore thread
         | state by loading all of RCX, which sounds like it should
         | trigger the slow path.
         | 
         | I wonder whether IRET fixes it. Can you try MOV RCX, 1; push an
         | IRET frame; IRET; SHLX to see if IRET fixes it?
        
           | tavianator wrote:
           | > Wait a sec, isn't this backwards?
           | 
           | No, if you push and pop RCX it's fast again. I mentioned this
           | in the blog post, `mov rcx, 1` is slow but `mov rcx, 1; push
           | rcx; pop rcx` is fast.
        
             | amluto wrote:
             | I think I missed that one in the pile of options. That's
             | bizarre. I wonder if MOV from memory is like POP.
        
           | BeeOnRope wrote:
           | The slow state is when rcx has a special "hidden immediate"
           | attached it, from the `mov rcx, 1` or `add rcx, 1`. If the
           | last option on the register doesn't fit that narrow template,
           | it is fast, so loading from memory in any way, like pop,
           | xrestor, plan load, etc, will be in fast mode.
        
       | userbinator wrote:
       | How about trying "xor ecx, ecx; inc ecx"? Or the even shorter
       | "mov cl, 1"?
       | 
       |  _It is very strange to me that the instruction used to set the
       | shift count register can make the SHLX instruction 3x slower._
       | 
       | I suspect this is a width restriction in the bypass/forwarding
       | network.
       | 
       |  _The 32-bit vs. 64-bit operand size distinction is especially
       | surprising to me as SHLX only looks at the bottom 6 bits of the
       | shift count._
       | 
       | Unfortunately the dependency analysis circuitry seems not Intel-
       | ligent enough to make that distinction.
        
         | tavianator wrote:
         | > How about trying "xor ecx, ecx; inc ecx"?
         | 
         | Fast. But inc rcx is slow.
         | 
         | > Or the even shorter "mov cl, 1"?
         | 
         | Fast.
         | 
         | > I suspect this is a width restriction in the
         | bypass/forwarding network...
         | 
         | I think we just found the explanation on Twitter:
         | https://x.com/corsix/status/1874965887108976858
         | 
         | Alder Lake adds support for mov/add/sub with small immediates
         | to the _register renamer_. So `add rcx, 1` gets handled by the
         | renamer, potentially with zero latency.
         | 
         | Unfortunately shifts are slow when the shift count is renamed
         | in this way. This leads to fun things like `mov rcx, 1024`
         | being fast while `mov rcx, 1023` is slow. I'll update the blog
         | post in a bit.
        
           | tptacek wrote:
           | I got nothing here other than: this is very cool.
        
           | eigenform wrote:
           | Oooh, I forgot about this. Only thing I can think of is
           | something like:
           | 
           | 1. Assume that some values can be treated internally as
           | "physical register plus a small immediate"
           | 
           | 2. Assume that, in some cases, a physical register is known
           | to be zero and the value can be represented as "zero plus a
           | small immediate" (without reference to a physical register?)
           | 
           | 3. Since 'count' is always expected to be <64, you
           | technically don't need to use a physical register for it
           | (since the immediate bits can always just be carried around
           | with the name).
           | 
           | 4. The 3-cycle case occurs when the physical register read
           | cannot be optimized away??
           | 
           | (Or maybe it's just that the whole shift operation can occur
           | at rename in the fast case??)
        
             | phire wrote:
             | _> 4. The 3-cycle case occurs when the physical register
             | read cannot be optimized away??_
             | 
             | No... the 3-cycle case seems to be when the physical
             | register read is optimized away. I think it's some kind of
             | stupid implementation bug.
        
               | eigenform wrote:
               | Like, the scheduler is just waiting unnecessarily for
               | both operands, despite the fact that 'count' has already
               | been resolved at rename?
        
               | phire wrote:
               | The 3 cycles latency casts massive suspicion on the
               | bypass network. But I don't see how the bypass network
               | could be bugged without causing the incorrect result. So
               | the scheduler doesn't know how to bypass this "shift with
               | small immediate" micro op.
               | 
               | Or maybe the bypass network is bugged, and what we are
               | seeing is a chicken bit set by the microcode that
               | disables the bypass network for this one micro op that
               | does shift.
        
               | eigenform wrote:
               | > But I don't see how the bypass network could be bugged
               | without causing the incorrect result.
               | 
               | Maybe if they really rely on this kind of forwarding in
               | many cases, it's not unreasonable to expect that latency
               | can be generated by having to recover from "incorrect PRF
               | read" (like I imagine there's also a case for recovery
               | from "incorrect forwarding")
        
               | phire wrote:
               | Yeah, "incorrect PRF read" is something that might exist.
               | 
               | I know modern CPUs will sometimes schedule uops that
               | consume the result of load instruction, with the
               | assumption the load will hit L1 cache. If the load
               | actually missed L1, it's not going to find out until that
               | uop tries to read the value coming in from L1 over the
               | bypass network. So that uop needs to be aborted and
               | rescheduled later. And I assume this is generic enough to
               | catch any "incorrect forwarding", because there are other
               | variable length instructions (like division) that would
               | benefit from this optimistic scheduling.
               | 
               | But my gut is to only have these checks on the bypass
               | network, and only ever schedule PRF reads after you know
               | the correct value has been stored.
        
               | Bulat_Ziganshin wrote:
               | maybe, the bypass network doesn't include these "constant
               | registers"? a bit like zen5 where some 1-cycle SIMD ops
               | are executed in 2 cycles, probably for shortcomings of
               | the same network
        
           | phire wrote:
           | Ok... That does make some sense.
           | 
           | Essentially, Intel have allocated a bunch of fake renaming
           | registers to hold all small immediates. Like how MIPS had a
           | zero register, Intel now have over 1000 constant registers.
           | These registers don't really exist, the immediate is just
           | reconstituted at the execution unit.
           | 
           | And I'm guessing these small immediates predate Golden Cove;
           | It's probably how they have always represented shift by
           | constant and any instructions with 8bit immediate (presumably
           | at least as far back as Sandybridge). The only change is that
           | now the renamer can now execute simple movs/adds.
           | 
           | So essentially these variable shifts are being converted to
           | constant shifts on the fly. The problem is that there is no
           | constant shift version of shlx, so it wasn't in the test
           | suite. There is probably some kind of stupid implementation
           | bug in the scheduler that simply wasn't exposed until the
           | renamer changes.
        
             | tavianator wrote:
             | I don't think there are a thousand constant registers. I
             | think the renamer just represents (reg, 10-bit offset)
             | pairs rather than just registers.
             | 
             | Also the problem affects SHL as well as SHLX, I didn't
             | realize until just now.
        
               | phire wrote:
               | The speculation of a "reg + 10-bit offset" representation
               | feels wrong to me.
               | 
               | That requires a whole bunch of extra 64-bit full-adders
               | everywhere one of these pairs might be consumed (so
               | realistically, on every single read port of the register
               | file). 64-bit adders take quite a bit of latency, so you
               | don't want extra adders on all your critical paths.
               | 
               | In the case where it appears to be holding a reg + offset
               | pair, what I think has actually happened is that renamer
               | (and/or uop fusion) has rewritten the uop to a 3-input
               | add, with the offset as the third input.
               | 
               |  _> Also the problem affects SHL as well as SHLX, I didn
               | 't realize until just now._
               | 
               | And presumably SHR/SHRX/SAR/SARX too?
        
               | BeeOnRope wrote:
               | Yeah I am pretty sure that the renamer adds the tracked
               | immediate(s) to the emitted uop, but that's not
               | inconsistent with tracking "reg offset pairs" is it?
        
               | eigenform wrote:
               | I dunno, you could imagine it happens [speculatively?] in
               | parallel, at the cost of a read port and adder for each
               | op that can be renamed in a single cycle:
               | 
               | 1. Start PRF reads at dispatch/rename
               | 
               | 2. In the next cycle, you have the result and compute the
               | 64-bit result. At the same time, the scheduler is sending
               | other operands [not subject to the optimization] to read
               | ports.
               | 
               | 3. In the next cycle, the results from both sets of
               | operands are available
        
               | phire wrote:
               | Seems rather pointless to implement it like that. You
               | would save space in the scheduler because the uops have
               | been fused, but execution time and execution unit usage
               | is the same as just executing the original ops before
               | this optimisation was implemented.
               | 
               | It also adds an extra cycle of scheduling latency, which
               | may or may not be an issue (I really have no idea how far
               | ahead these schedulers can schedule).
               | 
               | If you think about the "1024 constant registers"
               | approach: It allows you to have a small 10bit adder in
               | the renamer which handles long chains of
               | mov/add/sub/inc/dec ops, as long as the chain stays in
               | the range of 1024. This frees the main adders for bigger
               | sums, or maybe you can power gate them.
               | 
               | And in the case where one of the inputs is larger than
               | 1024, or it's value is unknown during renaming, the
               | renamer can still merge two two-input adds into a single
               | three input add uop.
        
               | Bulat_Ziganshin wrote:
               | add3 operation will have 3 inputs, though. do we have
               | other integer operations with 3 64-bit inputs?
        
               | dzaima wrote:
               | You don't quite need a full 64-bit adder to materialize
               | the proper value, as one argument is only a 10-bit int.
               | So a 10-bit adder, a 54-bit increment, and muxing it with
               | the original depending on the add carry.
               | 
               | And, presumably, the OP shift case here is in fact a case
               | of there not being a built-in immediate adder and thus a
               | need for fixup uops being inserted to materialize it?
        
               | tavianator wrote:
               | Right. Actually it turns out it's 11 bits, since [-1024,
               | 1023] are all supported by the immediate add renamer.
               | 
               | In general I think people are overstating the delay of an
               | additional 64-bit add on register file reads (though I'm
               | not a hardware guy so maybe someone can correct me).
               | There are carry-lookahead adders with log_2(n) == 6 gate
               | delays. Carry-save adders might also be relevant to how
               | they can do multiple dependent adds with 0c latency.
               | 
               | > And, presumably, the OP shift case here is in fact a
               | case of there not being a built-in immediate adder and
               | thus a need for fixup uops being inserted to materialize
               | it?
               | 
               | No, the perf counters show 1 uop dispatched/retired in
               | both the slow and fast cases.
        
               | dzaima wrote:
               | Ah, good to know on the uop count. Still could be (or,
               | well, has to be to some extent) the same concept, just
               | pipelined within one uop.
        
           | userbinator wrote:
           | _So `add rcx, 1` gets handled by the renamer, potentially
           | with zero latency.
           | 
           | Unfortunately shifts are slow when the shift count is renamed
           | in this way._
           | 
           | That's because the value still has to make it to the
           | shifter... and the renamer's adder output clearly takes a
           | different and higher-latency path than the one which doesn't
           | go through it, confirming my original suspicion that this is
           | a limitation of the bypass/forwarding network.
           | 
           | Ultimately it seems they are just playing around with where
           | things happen; the addition can be done _here_ but the value
           | ultimately needs to go _there_ for the shift, or the addition
           | and shift can both happen _there_. One could envision a
           | future change that puts the shifter in the renamer too. They
           | 're really constrained by the laws of physics.
        
             | Bulat_Ziganshin wrote:
             | we set CX only once and then use it 10000 times. the
             | problem is not the slow calculation of CX per se, but the
             | slow shift once we got CX from the renamer
        
         | monocasa wrote:
         | I would have thought that the movs would have been dumped right
         | into a rob entry without even going through the bypass network,
         | much like a zeroing idiom does.
        
       | bhouston wrote:
       | Shifts were not always fast. These old hacker news comments
       | contain the details: https://news.ycombinator.com/item?id=2962770
        
         | petermcneeley wrote:
         | Indeed. Dynamic shifting was microcoded (not uop!) on the power
         | pc for gen3. However shifting with immediate values was not.
         | 
         | This leads to all sorts of strange performance workarounds.
         | Mike Acton refers to it here:
         | https://macton.smugmug.com/Other/2008-07-15-by-Eye-Fi/n-xmKD...
        
       | BeeOnRope wrote:
       | Worth noting that whether intentional or not, this would be easy
       | to miss and unlikely to move benchmark numbers since compilers
       | won't generate instructions like this: they would use the eax
       | form which is 1 byte shorter and functionally equivalent.
       | 
       | Even some assemblers will optimize this for you.
        
         | tavianator wrote:
         | It's possible to get gcc to generate sequences that would
         | trigger this: https://godbolt.org/z/rYKqPxn7b
        
           | loeg wrote:
           | The godbolt paste I'm looking at shows GCC generating a SAL
           | with 8 bit shift (CL), not a SHLX with 64-bit shift (RCX).
        
             | tjalfi wrote:
             | GCC will generate the _shlx_ instruction if you add the
             | _-mbmi2_ flag to your build options. For example, you can
             | see this in action here: https://godbolt.org/z/asb1fxos5.
        
               | BeeOnRope wrote:
               | It's true, I was considering only the original mov rax, 1
               | case, which I'm pretty sure compilers don't generate:
               | it's just a useless encoding of mov eax, 1.
               | 
               | Given that 64-bit immediate math also causes this though,
               | compilers might generate it (I think this is a minor
               | missed optimization by gcc though: it could have used add
               | eax, 1 instead.
        
             | tavianator wrote:
             | SHL suffers from the same latency issue as SHLX it turns
             | out
        
               | loeg wrote:
               | SAL as well? (And anyway this example seems to be
               | shifting by the low 8 bits of RCX (CL), not the full
               | width register.)
        
               | tavianator wrote:
               | SAL and SHL are synonyms, they have the same encoding.
               | SHL only accepts CL as the count register, there's no
               | other form that takes a variable shift count
        
               | loeg wrote:
               | Ahh, thanks, I didn't make that connection.
        
               | Bulat_Ziganshin wrote:
               | Intel shifts anyway mask out higher bits of CL, this
               | hurts sometimes e.g. when you need to shift by 1..64 bits
        
         | eigenform wrote:
         | Funny thing to double-check: are these encodings correctly
         | specifying a 64-bit operand? Maybe everyone's compilers are
         | subtly wrong D:
         | 
         | edit: It looks like VEX.W _is_ set in the encoding from the
         | uops.info tests -\\_(tsu)_ /-
        
       | BeeOnRope wrote:
       | To rule out alignment you should adding padding to one so the two
       | variations have the same alignment in their long run of SHLX (I
       | don't actually think it's alignment related though).
        
         | tavianator wrote:
         | I did try aligning the loop, it didn't matter
        
       | BeeOnRope wrote:
       | Does this also occur with other 3-argument instructions like
       | ANDN?
        
         | tavianator wrote:
         | ANDN is fast. The issue looks specific to shift counts. SHL is
         | also slow.
        
       | crest wrote:
       | So the issue is that the frontend which "eliminates" those
       | operations doesn't make the resulting values available to the
       | backend if fits Intel's ad-hoc definition of a short immediate
       | (which shift values do) and since there is no SHLX with immediate
       | uop the small value has to be be expanded to a full 32 or 64 bits
       | and requested from the frontend with adds two cycles of latency?
       | 
       | Has anyone run tests with >=3 interleaved SHLX dependency chains
       | in a loop? Does it "just" have 3 cycle latency or also less than
       | 1 operation/cycle sustained throughput? Because if the pipeline
       | stalls that would be even more annoying for existing optimised
       | code.
       | 
       | Is the regression limited to only Alder Lake P-cores or also
       | present it later (refreshed) cores?
        
       | orlp wrote:
       | Seems like LLVM knows about this quirk (note how it suddenly uses
       | eax instead of rax for the multiply):
       | https://rust.godbolt.org/z/8jh7YPhz4.
        
         | stassats wrote:
         | Using EAX is one byte shorter, so it might be doing that
         | inadvertently.
        
       | rincebrain wrote:
       | Since this seems like an optimization going awry somewhere, I
       | wonder if there's a chicken bit that disables it, and if so, how
       | broad the impact of disabling it is...
        
       | eqvinox wrote:
       | Hmm... does this have any impact on time-constant cryptographic
       | algorithm implementations? In particular the wider "addition in
       | register renamer" story?
        
         | dzaima wrote:
         | Perhaps not - as it happens at the rename stage and thus only
         | with immediate values, it's all fixed behavior for a given
         | instruction stream; with branching you can of course get
         | dynamically different renamed immediates, but, as that's
         | branchy code, it's not time-constant anyway.
        
         | LegionMammal978 wrote:
         | I don't think it should, if the performance change only depends
         | on immediates hardcoded into the binary, and not the variable
         | data being processed. Maybe if the algorithm branched on the
         | variable data and had two separate ways of loading something,
         | but branching on variable data is the #1 thing you're supposed
         | to avoid in the first place.
        
           | atq2119 wrote:
           | To expand on this a bit, the hypothesis stated elsewhere is
           | that the register renamer stores "physical register + 11-bit
           | signed immediate".
           | 
           | It should be possible to construct scenarios where a logical
           | register is mapped to "physical register + immediate" where
           | the immediate depends on the path taken through the program
           | so far. The easiest example would be an if-statement that
           | optionally adds 1023 to a register, and then an add of 1
           | after the if-statement. The add of 1 overflows the immediate
           | stored in the register renamer if and only if the if-
           | condition was true.
           | 
           | As you said, that requires taking different control flow
           | paths through the program, which is something to be avoided
           | by constant-time algorithms anyway.
           | 
           | One potential twist: This shows that it may be important that
           | the interface between the constant-time algorithm and the
           | rest of the program goes entirely through memory and not
           | through registers. If the interface does go through
           | registers, it's possible that variable register-renamer-
           | immediates leak from the outside world into the constant-time
           | algorithm. (This admittedly feels like a bit of a stretch,
           | but who knows.)
        
       | juancn wrote:
       | Code alignment? I mean, the different instructions change the
       | alignment of the rest of the code.
       | 
       | - 64 bit register                   0:  48 c7 c1 01 00 00 00
       | mov    rcx,0x1           - 32 bit register              0:  b9 01
       | 00 00 00          mov    ecx,0x1
       | 
       | It should be easy to test by adding a couple of NOPs to the fast
       | version:                  0:  b9 01 00 00 00          mov
       | ecx,0x1        5:  90                      nop        6:  90
       | nop
       | 
       | and see if it regresses again.
       | 
       | I don't have an Alder Lake to test on.
        
         | BeeOnRope wrote:
         | https://news.ycombinator.com/item?id=42582623
        
         | tavianator wrote:
         | In general I'd be very surprised if code alignment had much
         | effect on a 10,000 instruction block, especially since the
         | instruction is 5 bytes long so it will quickly unalign itself
        
       ___________________________________________________________________
       (page generated 2025-01-03 23:01 UTC)