https://blog.xoria.org/pipelining/ Execution Units are Often Pipelined In the context of out-of-order microarchitectures, I was under the impression that execution units remain occupied until the uop they're processing is complete. This is often not the case. As an example, take the Firestorm microarchitecture in the A14 and M1. It has two integer execution units capable of executing multiplies, which take three cycles to complete one multiplication. Of course, a sequence of dependent instructions like benchmark: mul x1, x0, x0 // a mul x2, x1, x1 // b mul x3, x2, x2 // c mul x4, x3, x3 // d ret will take 4 x 3 = 12 cycles, since it can only take advantage of a single execution unit: cycle EU 1 EU 2 completed 0 [a] [ ] 1 [a] [ ] 2 [a] [ ] 3 [b] [ ] a 4 [b] [ ] a 5 [b] [ ] a 6 [c] [ ] a, b 7 [c] [ ] a, b 8 [c] [ ] a, b 9 [d] [ ] a, b, c 10 [d] [ ] a, b, c 11 [d] [ ] a, b, c 12 [ ] [ ] a, b, c, d With my original understanding of how execution units work, a sequence of independent instructions like benchmark: mul x1, x0, x0 // a mul x2, x0, x0 // b mul x3, x0, x0 // c mul x4, x0, x0 // d ret would take 2 x 3 = 6 cycles: cycle EU 1 EU 2 completed 0 [a] [b] 1 [a] [b] 2 [a] [b] 3 [c] [d] a, b 4 [c] [d] a, b 5 [c] [d] a, b 6 [ ] [ ] a, b, c, d As it turns out, many execution unit and uop combinations are heavily pipelined. This means that a uop can be issued to an execution unit while it's still busy processing a different uop. So, on Firestorm that code sequence actually executes more like cycle EU 1 EU 2 completed 0 [a][ ][ ] [b][ ][ ] 1 [c][a][ ] [d][b][ ] 2 [ ][c][a] [ ][d][b] 3 [ ][ ][c] [ ][ ][d] a, b 4 [ ][ ][ ] [ ][ ][ ] a, b, c, d taking 4 cycles instead of 6. In the limit, where the two execution units are constantly kept fed with multiplication uops, my original understanding would have predicted 1.5 cycles/instruction on average, when they in reality can sustain 0.5 cycles/instruction - each execution unit can be fed a new multiplication uop every cycle, and we have two of them. Knowing this, I finally get why instruction latency and bandwidth tables specify reciprocal throughput; because it's equivalent to cycles/instruction! I put together a GitHub repo where you can see this for yourself. Make sure to adjust the maximum CPU frequency in Entry Point.c as appropriate. Luna Razzaghipour 26 December 2024 Thoughts, comments, corrections or suggestions? Email me! I'd like nothing more than to hear from you. more posts atom feed homepage github