https://blog.xoria.org/pipelining/

Execution Units are Often Pipelined

In the context of out-of-order microarchitectures, I was under the
impression that execution units remain occupied until the uop they're
processing is complete. This is often not the case.

As an example, take the Firestorm microarchitecture in the A14 and
M1. It has two integer execution units capable of executing
multiplies, which take three cycles to complete one multiplication.

Of course, a sequence of dependent instructions like

benchmark:
        mul     x1, x0, x0 // a
        mul     x2, x1, x1 // b
        mul     x3, x2, x2 // c
        mul     x4, x3, x3 // d
        ret

will take 4 x 3 = 12 cycles, since it can only take advantage of a
single execution unit:

cycle EU 1 EU 2 completed
0     [a]  [ ]
1     [a]  [ ]
2     [a]  [ ]
3     [b]  [ ]  a
4     [b]  [ ]  a
5     [b]  [ ]  a
6     [c]  [ ]  a, b
7     [c]  [ ]  a, b
8     [c]  [ ]  a, b
9     [d]  [ ]  a, b, c
10    [d]  [ ]  a, b, c
11    [d]  [ ]  a, b, c
12    [ ]  [ ]  a, b, c, d

With my original understanding of how execution units work, a
sequence of independent instructions like

benchmark:
        mul     x1, x0, x0 // a
        mul     x2, x0, x0 // b
        mul     x3, x0, x0 // c
        mul     x4, x0, x0 // d
        ret

would take 2 x 3 = 6 cycles:

cycle EU 1 EU 2 completed
0     [a]  [b]
1     [a]  [b]
2     [a]  [b]
3     [c]  [d]  a, b
4     [c]  [d]  a, b
5     [c]  [d]  a, b
6     [ ]  [ ]  a, b, c, d

As it turns out, many execution unit and uop combinations are heavily
pipelined. This means that a uop can be issued to an execution unit
while it's still busy processing a different uop. So, on Firestorm
that code sequence actually executes more like

cycle   EU 1      EU 2    completed
0     [a][ ][ ] [b][ ][ ]
1     [c][a][ ] [d][b][ ]
2     [ ][c][a] [ ][d][b]
3     [ ][ ][c] [ ][ ][d] a, b
4     [ ][ ][ ] [ ][ ][ ] a, b, c, d

taking 4 cycles instead of 6.

In the limit, where the two execution units are constantly kept fed
with multiplication uops, my original understanding would have
predicted 1.5 cycles/instruction on average, when they in reality can
sustain 0.5 cycles/instruction - each execution unit can be fed a new
multiplication uop every cycle, and we have two of them.

Knowing this, I finally get why instruction latency and bandwidth
tables specify reciprocal throughput; because it's equivalent to
cycles/instruction!

I put together a GitHub repo where you can see this for yourself.
Make sure to adjust the maximum CPU frequency in Entry Point.c as
appropriate.

Luna Razzaghipour
26 December 2024

Thoughts, comments, corrections or suggestions? Email me! I'd like
nothing more than to hear from you.

more posts

atom feed

homepage

github