[HN Gopher] Execution units are often pipelined
___________________________________________________________________
Execution units are often pipelined
Author : ingve
Score : 132 points
Date : 2024-12-27 09:02 UTC (3 days ago)
(HTM) web link (blog.xoria.org)
(TXT) w3m dump (blog.xoria.org)
| Vogtinator wrote:
| For x86 cores this is visible in Agner Fog's instruction
| performance tables: https://agner.org/optimize/#manuals
|
| The latency shows after how many cycles the result of an
| instruction can be consumed by another, while the throughput
| shows how many such instructions can be pipelined per cycle, i.e.
| in parallel.
| wtallis wrote:
| I believe the throughput shown in those tables is the total
| throughput for the whole CPU core, so it isn't immediately
| obvious which instructions have high throughput due to
| pipelining within an execution unit and which have high
| throughput due just to the core having several execution units
| capable of handling that instruction.
| BeeOnRope wrote:
| That's true, but another part of the tables show how many
| "ports" the operation can be executed on, which is enough
| information to concluded an operation is pipelined.
|
| For example, for many years Intel chips had a multiplier unit
| on a single port, with a latency of 3 cycles, but an inverse
| throughput of 1 cycle, so effectively pipelined across 3
| stages.
|
| In any case, I think uops.info [1] has replaced Agner for up-
| to-date and detailed information on instruction execution.
|
| ---
|
| [1] https://uops.info/table.html
| Earw0rm wrote:
| Shame it doesn't seem to have been updated with Arrow Lake,
| Zen 5 and so on yet.
| BeeOnRope wrote:
| Yes. In the past new HW has been made available to the
| uops.info authors in order to run their benchmark suite
| and publish new numbers: I'm not sure if that just hasn't
| happened for the new stuff, or if they are not interested
| in updating it.
| ajross wrote:
| FWIW, there are _two_ ideas of parallelism being conflated
| here. One is the parallel execution of the different sequential
| steps of an instruction (e.g. fetch, decode, operate, retire).
| That 's "pipelining", and it's a different idea than decoding
| multiple instructions in a cycle and sending them to one of
| many execution units (which is usually just called "dispatch",
| though "out of order execution" tends to connote the same idea
| in practice).
|
| The Fog tables try hard show the former, not the latter. You
| measure dispatch parallelism with benchmarks, not microscopes.
|
| Also IIRC there are still some non-pipelined units in Intel
| chips, like the division engine, which show latency numbers ~=
| to their execution time.
| BeeOnRope wrote:
| I don't think anyone is talking about "fetch, decode,
| operate, retire" pipelining (though that is certainly called
| pipelinig): only pipelining within the execution of a
| instruction that takes multiple cycles just to execute (i.e.,
| latency from input-ready to output-ready).
|
| Pipelining in stages like fetch and decode are mostly hidden
| in these small benchmarks, but are visible when there are
| branch misprediction, other types of flushes, I$ misses and
| so on.
| ajross wrote:
| > I don't think anyone is talking about "fetch, decode,
| operate, retire" pipelining (though that is certainly
| called pipelinig): only pipelining within the execution of
| a instruction that takes multiple cycles just to execute
| (i.e., latency from input-ready to output-ready).
|
| I'm curious what you think the distinction is? Those
| statements are equivalent. The circuit implementing "an
| instruction" can't work in a single cycle, so you break it
| up and overlap sequentially issued instructions. Exactly
| what they do will be different for different hardware,
| sure, clearly we've moved beyond the classic four stage
| Patterson pipeline. But that doesn't make it a different
| kind of pipelining!
| formerly_proven wrote:
| Independently scheduled and queued execution phases are
| qualitatively different from a fixed pipeline.
| gpderetta wrote:
| An OoO design is qualitatively different from an in-order
| one because of renaming and dynamic scheduling, but the
| pipelining is essentially the same and for the same
| reasons.
| BeeOnRope wrote:
| We are interested in the software visible performance
| effects of pipelining. For small benchmarks that don't
| miss in the predictors or icache, this mostly means
| execution pipelining. That's the type of pipelining the
| article is discussing and the type of pipelining
| considered in instruction performance breakdowns
| considered by Agner, uops.info, simulated by LLVM-MCA,
| etc.
|
| I.e., a lot of what you need to model for tight loops
| only depends on the execution latencies (as little as 1
| cycle), and not on the full pipeline end-to-end latency
| (almost always more than 10 cycles on big OoO, maybe more
| than 20).
| gpderetta wrote:
| > and it's a different idea than decoding multiple
| instructions in a cycle and sending them to one of many
| execution units (which is usually just called "dispatch",
| though "out of order execution
|
| Being able to execute multiple instructions is more properly
| superscalar execution, right? In-order designs are also
| capable of doing it and the separate execution unit do not
| even need to run in lockstep (consider the original P5 U and
| V pipes).
| ajross wrote:
| In-order parallel designs are "VLIW". The jargon indeed
| gets thick. :)
|
| But as to OO: the whole idea of issuing _sequential_
| instructions in parallel means that the hardware needs to
| track dependencies between them so they can 't race ahead
| of their inputs. And if you're going to do that anyway,
| allowing them to retire out of order is a big
| performance/transistor-count win as it allows the pipeline
| lengths to be different.
| gpderetta wrote:
| VLIW is again a different thing. It uses a single
| instruction that encodes multiple independent operations
| to simplify decoding and tracking, usually with exposed
| pipelines.
|
| But you can have, for example, a classic in-order RISC
| design that allows for parallel execution. OoO renaming
| is not necessary for dependency tracking (in fact even
| scalar in order CPUs need dependency tracking to solve
| RAW and other hazards), it is "only" needed for executing
| around stalled instructions (while an in order design
| will stall the whole pipeline).
|
| Again P5 (i.e the original Pentium) was a very
| traditional in order design, yet could execute up to two
| instructions per cycle.
| ajross wrote:
| > VLIW is again a different thing.
|
| No it isn't. I'm being very deliberate here with refusing
| pedantry. _In practice_ , "multiple dispatch" means "OO"
| in the same way that "VLIW" means "parallel in order
| dispatch". Yes, you can imagine hypothetical CPUs that
| mix the distinction, but they'd be so weird that they'd
| never be built. Discussing the jargon without context
| only confuses things.
|
| > you can have, for example, a classic in-order RISC
| design that allows for parallel execution.
|
| Only by inventing VLIW, though, otherwise there's no way
| to tell the CPU how to order what it does. Which is my
| point; the ideas are joined at the hip. Note that the
| Pentium had two defined pipes with specific rules about
| how the pairing was encoded in the instruction stream. It
| was, in practice, a VLIW architecture (just one with a
| variable length encoding and where most of the available
| instruction bundles only filled one slot)! Pedantry hurts
| in this world, it doesn't help.
| gpderetta wrote:
| I'm sorry, but if P5 was VLIW then the word has lost all
| meanings. They couldn't possibly be more different.
| wtallis wrote:
| Right; it's easy to forget that superscalar CPU cores don't
| actually have to be in-order, but most of them _are_ out-
| of-order because that 's usually necessary to make good use
| of a wide superscalar core.
|
| (What's the best-performing in-order general purpose CPU
| core? POWER6 was notably in-order and ran at quite high
| clock speeds for the time. Intel's first-gen Atom cores
| were in-order and around the same time as POWER6 but at
| half the clock speed. SPARC T3 was ran at an even lower
| clock speed.)
| gpderetta wrote:
| POWER6 might indeed have been the last In-Order speed
| demon.
| eigenform wrote:
| > Also IIRC there are still some non-pipelined units in Intel
| chips, like the division engine, which show latency numbers
| ~= to their execution time
|
| I don't think that's accurate. That latency exists _because_
| the execution unit is pipelined. If it were not pipelined,
| there would be no latency. The latency corresponds to the
| fact that "doing division" is distributed across multiple
| clock cycles.
| delusional wrote:
| Is this still the case if I have different ALU operations. Say I
| have a single ALU on a single x86 core. Would the ALU be able to
| interleave say ADD and MULs? or would I incur the latency measure
| for each operation switch?
|
| I know that some ALU's have multiple ADD complexes, and I assume
| that would influence the answer, hence why I specified x86.
| barbegal wrote:
| You need to be far more specific than x86, x86 is just the
| instruction set, the actual architecture can vary massively
| with the same instruction set.
|
| In general though there is no penalty for interleaved
| operations.
| bjourne wrote:
| A modern ALU has multiple pipelined data paths for its
| operations. So maybe three adders with a one-cycle latency, two
| multipliers with a three-cycle latency, and one divider with a
| 16-cycle latency. Sustained throughput depends on the
| operation. Maybe one per cycle for add and multiply, but only
| one every eight cycle for divide.
| BeeOnRope wrote:
| Yes, it applies to different operations. E.g. you could
| interleave two or three different operations with 3 cycle
| latency and 1 cycle inv throughput on the same port and get 1
| cycle inv throughput in aggregate for all of them. There is no
| restriction that they must be same operation.
|
| In some cases mixing operations with _different_ latencies on
| the same execution port will leave you with less throughput
| than you expect due to "writeback conflicts", i.e., two
| instructions finishing on the same cycle (e.g., a 2 cycle
| operation starting on cycle 0 and a 1 cycle operation on cycle
| 1, will both finish on cycle 2 and in some CPUs this will delay
| the results of one of the operations by 1 cycle due to a
| conflict).
| Tuna-Fish wrote:
| You can interleave most operations how you like, without any
| extra latency, each op starting as soon as all the results are
| ready, regardless of where they were computed.
|
| There are some exceptions where "domain crossing", or using a
| very different operation costs an extra clock cycle. Notably,
| in vector registers using FP or integer operations on the
| result of the different type of op, as modern CPUs don't
| actually hold FP values in their IEEE754 transfer format inside
| registers, but instead registers have hidden extra bits and
| store all values in normal form, allowing fast operations on
| denormals. The downside of this is that if you alternate
| between FP and INT operations, the CPU has to insert extra
| conversion ops between them. Typical cost is 1-3 cycles per
| domain crossing.
| barbegal wrote:
| These days CPUs are so complex and have so many interdependencies
| that the best way to simulate them is simply to run them!
|
| In most real code the high throughput of these sorts of
| operations means that something else is the limiting factor. And
| if multiplier throughput is limiting performance then you should
| be using SIMD or a GPU.
| sroussey wrote:
| > These days CPUs are so complex and have so many
| interdependencies that the best way to simulate them is simply
| to run them!
|
| True. You can imagine how difficult it is for the hardware
| engineer designing and testing these things before production!
| alain94040 wrote:
| Very true. To paraphrase a saying, CPU amateurs argue about
| micro-benchmarks on HN, the pros simulate real code.
| pizlonator wrote:
| Or even just A:B test real code.
| devit wrote:
| The amateurs usually run benchmarks (because they can't
| reason about it as they lack the relevant knowledge) and
| believe they got a useful result on some aspect when in the
| fact the benchmark usually depends on other arbitrary random
| factors (e.g. maybe they think they are measuring FMA
| throughput, but are in fact measuring whether the compiler
| autovectorizes or whether it fuses multiply and adds
| automatically).
|
| A pro would generally only run benchmarks if it's the only
| way to find out (or if it's easy), but isn't going to trust
| it unless there's a good explanation for the effects, or
| unless they actually just want to compare two very specific
| configurations rather than coming up with a general finding.
| LegionMammal978 wrote:
| Then again, 'reasoning about it' can easily go awry if your
| knowledge doesn't get updated alongside the CPU
| architectures. I've seen people confidently say all sorts
| of stuff about optimization that hasn't been relevant since
| the early 2000s. Or in the other direction, some people
| treat modern compilers/CPUs like they can perform magic, so
| that it makes no difference what you shovel into them
| (e.g., "this OOP language has a compiler that always knows
| when to store values on the stack"). Benchmarks can help
| dispel some of the most egregious myths, even if they are
| easy to misuse.
| Joker_vD wrote:
| > the best way to simulate them is simply to run them!
|
| And it's quite sad because when you are faced with choosing
| between two ways to express something in the code, you can't
| predict how fast one or another option will run. You need to
| actually run both, preferrably in an environment close to the
| prod, and under similar load, to get accurate idea which one is
| more performant.
|
| And the worst thing is, you most likely can't extract any
| useful general principle out of it, because any small
| perturbation in the problem will result in a code that is very
| similar yet has completely different latency/throughput
| characteristics.
|
| The only saving grace is that modern computers are really
| incredibly fast, so layers upon layers of suboptimal code
| result in applications that mostly perform okay, with maybe
| some places where they perform egregiously slow.
| mpweiher wrote:
| > you can't predict how fast one or another option will run
|
| "The best way to predict the future is to invent it" -- Alan
| Kay
|
| > You need to actually run both
|
| Always! If you're not measuring, you're not doing performance
| optimization. And if you think CPUs are bad: try benchmarking
| I/O.
|
| Operating System (n) -- Mechanism designed specifically to
| prevent any meaningful performance measurement (every
| performance engineer ever)
|
| If it's measurable and repeatable, it's not meaningful. If
| it's meaningful, it's not measurable or repeatable.
|
| Pretty much.
|
| Or put another way: an actual stop watch is a very meaningful
| performance measurement tool.
| codedokode wrote:
| What I am interested to know is who invented pipelining? I tried
| googling but without much success. Does anybody know?
| paulsutter wrote:
| Probably the first person to ask, 'how can I speed up this
| processor, maybe there's a way to do more than one processing
| step at a time for each instruction'
| zelos wrote:
| Presumably by analogy to production lines in factories?
| sroussey wrote:
| Actually, it's just obvious when you design the most basic
| ALU and the circuits for doing the carry.
| codedokode wrote:
| Pipelining requires inserting extra (costly) flip-flops
| and intuitively one might think that this would decrease
| the performance. It is not that easy.
| NotCamelCase wrote:
| Is it really that much different than e.g. a cook preparing
| carrots while onions are being cooked? Nobody needed to
| invent this as far as my knowledge of history goes; so,
| maybe there really is nothing new under the sun :)
|
| Joking aside, of course, the ingenuity lies in finding a
| fitting solution to problem at hand, as well as knowing how
| to apply it.
| adrian_b wrote:
| Actually there are 2 ways to do more than one processing step
| at a time: parallel execution and pipelined (a.k.a.
| overlapped) execution.
|
| Both ways had been used for many centuries for the
| manufacturing of complex things, as ways to organize the work
| of multiple workers.
| taneq wrote:
| Slightly offtopic but did we go to uni together and team up
| on a couple of demo compos and VJ some events? If so hey
| it's been ages and whats up? :D If not, cool name and
| stuff. :)
| PittleyDunkin wrote:
| I'm guessing https://en.wikipedia.org/wiki/Donald_B._Gillies is
| a good bet for this claim.
| adrian_b wrote:
| Nope.
|
| 1957 in a transistorized computer is far too late for the
| origin of pipelining.
|
| Pipelining had already been used in computers with vacuum
| tubes a few years before and it had also been used already a
| decade earlier in computers with electromechanical relays,
| i.e. IBM SSEC, which had a 3-stage pipeline for the execution
| of its instructions (IBM SSEC had a Harvard architecture,
| with distinct kinds of memories for program and for data).
|
| Before 1959, pipelining was named "overlapped execution".
|
| The first use of the word "pipeline" was as a metaphor in the
| description of the IBM Stretch computer, in 1959. The word
| "pipeline" began to be used as a verb, with forms like
| "pipelining" and "pipelined" around 1965. The first paper
| where I have seen such a verbal use was from the US Army.
|
| Outside computing, pipelining as a method of accelerating
| iterative processes had been used for centuries in the
| progressive assembly lines, e.g. for cars, and before that
| for ships or engines.
|
| Pipelined execution and parallel execution are dual methods
| for accelerating an iterative process that transforms a
| stream of data. In the former the process is divided into
| subprocesses through which the stream of data passes in
| series, while in the latter the stream of data is divided
| into substreams that go in parallel through multiple
| instances of the process.
| cogman10 wrote:
| The earliest "pipeline" example I can find is the
| portsmouth block mills [1] implemented by Marc
| Brunel[2]/Henry Maudslay[3] around 1802
|
| [1] https://en.wikipedia.org/wiki/Portsmouth_Block_Mills
|
| [2] https://en.wikipedia.org/wiki/Marc_Isambard_Brunel
|
| [3] https://en.wikipedia.org/wiki/Henry_Maudslay
| bjourne wrote:
| Before that you had bucket brigades.
| formerly_proven wrote:
| Zuse Z3 had a three stage instruction pipeline.
| Voultapher wrote:
| > Pipelined execution and parallel execution are dual
| methods for accelerating an iterative process that
| transforms a stream of data. In the former the process is
| divided into subprocesses through which the stream of data
| passes in series, while in the latter the stream of data is
| divided into substreams that go in parallel through
| multiple instances of the process.
|
| With parallel do you mean superscalar execution i.e. having
| two ALUs or do you mean having multiple cores?
| nxobject wrote:
| EDIT: Adrian_B's sibling comment has even earlier historical
| examples that illustrate how architecture ideas are as old as
| the hills.
|
| There are good historical hardware architecture textbooks:
| Peter Kogge's "The Architecture of Pipelined Computers" ('81)
| mentions that the UNIVAC 1 pipelined IO (from which dedicated
| IO units sprung out), the IBM 7094 similarly interleaved
| multiple-state memory access cycles between multiple memory
| units to increase throughput, and then the IBM STRETCH had a
| two-part fetch-execute without attempt to address hazards.
|
| So pipelining came out the ad-hoc application of the patterns
| of interleaving and hiving off functionality to dedicated
| units. In general, a lot of hardware architecture ideas cropped
| up very early in big iron, with microcoding even co-existing
| with them, without all of the extra full ideas being worked out
| (e.g. forwarding, hazard resolution of all types, etc.)
|
| There's a whole fun universe of vintage hardware architecture
| textbooks that get through the essential concepts through
| historical big iron, if you're willing to look beyond Hennessy
| and Patterson and Shen and Lipasti! I am thinking that the
| latter might have references to other historical textbooks on
| pipelining, though. "The Anatomy of a High-Performance
| Microprocessor", a pedagogical deep-dive into the design of the
| AMD K6 uarch (up to detailed algorithms and HDL!), will have
| good references to historical textbooks too since these authors
| were trained before the crop of "modern" (microprocessor-era)
| textbooks.
| AtlasBarfed wrote:
| Come on, this isn't like the invention of calculus. Pipelining
| is just assembly line processing that anyone who managed or
| designed any factory would have implemented as soon as the
| transistor real estate became available.
| saagarjha wrote:
| Unrelated but for benchmarks like these you can just ask the
| kernel for how many clock cycles you've used using
| proc_pid_rusage rather than hardcoding it.
| taneq wrote:
| Is anything important not pipelined?
| monocasa wrote:
| Operations that are important, but take a single cycle.
| BeeOnRope wrote:
| Division, gather are muti-cycle instructions which have
| typically had little or no pipelining on x86.
| syncsynchalt wrote:
| Memory barriers come to mind.
| ryukoposting wrote:
| If the two ALUs could feed results into each other, that would
| make things _really_ interesting, especially when you consider
| the way it would affect reordering.
|
| Imagine if the OP's benchmark had two of those multiplication
| chains on independent registers: mul x1, x0, x0
| // a mul x2, x1, x1 // b mul x3, x2, x2 // c
| mul x4, x3, x3 // d mul x6, x5, x5 // e mul x7,
| x6, x6 // f mul x8, x7, x7 // g mul x9, x8, x8 // h
|
| If you had two independent ALUs, my intuition is that you'd want
| to dispatch the instructions as AEBFCGDH, interleaving the two
| chains.
|
| But, if the two ALUs could feed each other, perhaps you'd
| actually want to leave it in ABCDEFGH order? Hmm.
| monocasa wrote:
| They are able to do this via what's called a bypass network.
| wtallis wrote:
| > If you had two independent ALUs, my intuition is that you'd
| want to dispatch the instructions as AEBFCGDH, interleaving the
| two chains. But, if the two ALUs could feed each other, perhaps
| you'd actually want to leave it in ABCDEFGH order? Hmm.
|
| Once the frontend has determined what the dependencies are
| between instructions, you no longer have a linear sequence of
| instructions being dispatched to execution units but a DAG, and
| with multiple instructions starting on the same clock cycle the
| execution order would be more like (AE)(BF)(CG)(DH) without any
| ordering between instructions in the same parenthesized group.
|
| The reorder buffer in the CPU will be large enough to handle
| the compiler emitting either ABCDEFGH or AEBFCGDH, but the
| interleaved ordering is less likely to run into limitations of
| the instruction decoder (if this chunk of code isn't already in
| a decoded uop cache).
| filiphorvat wrote:
| Unrelated, do FPUs on modern CPUs use FMAs to both multiply and
| add or do they use mul/add-only units?
| bonzini wrote:
| Probably to do multiplies, as the extra add is basically free.
| Adds are cheaper.
| thesz wrote:
| Adds are cheaper only for fixed-point computations. Floating
| point addition needs to denormalize one of its' arguments,
| perform an (integer) addition and then normalize the result.
|
| Usually FP adds take a cycle or two longer than FP
| multiplication.
| gpderetta wrote:
| I don't think there is a generally optimal design. There are
| cons and pros to using the same homogeneous FMAs units for
| adds, multiplies and fmas, even at the cost of making adds
| slower (simpler design, and having all instructions of the same
| latency greatly simplifies scheduling). IIRC intel cycled
| through 4 cycles fma, add and mul, then to 4 cycles add and mul
| and 5 cycles fmas, then with a dedicated 3 cycles add.
|
| The optimal design depends a lot on the rest of the
| microarchitecture, the loads the core is being optimized for,
| the target frequency, the memory latency, etc.
| rasz wrote:
| Pipelining is the direct reason behind 1996 Quake running 30 fps
| on Intel Pentium 133 but requiring 233MHz from Cyrix and 166MHz
| from AMD K6 to reach same result. 20 fps needed 75MHz Intel
| Pentium, 133MHz Cyrix and 100MHz AMD K5.
| gary_0 wrote:
| Specifically, Carmack exploited the fact that on the Pentium,
| integer instructions could run in parallel with floating-point
| division[0]. This goes to show that an optimization that
| usually gets you ~10% might get you 200% depending on what
| software you're running. And that no implementation detail is
| safe from an ambitious software engineer.
|
| [0] https://news.ycombinator.com/item?id=38249029
| gpderetta wrote:
| So that was not actually pipelining, but superscalar
| execution. Incidentally division wasn't pipelined at all.
| dhash wrote:
| My favorite illustrations for the concepts discussed here (in an
| accessible form, not the processor optimization manuals) has long
| been [0].
|
| For me, this really makes working with a modern microprocessor a
| _science_ , as anyone who has written benchmarks knows -- it's
| difficult to reason about the complex behaviour and performance
| cliffs without testing.
|
| Another excellent example of the weirdness has to be JVM anatomy
| quarks [1]
|
| [0] https://www.lighterra.com/papers/modernmicroprocessors/
|
| [1] https://shipilev.net/jvm/anatomy-quarks/
| gpderetta wrote:
| The first link is very nice, worth of a submission of its own.
___________________________________________________________________
(page generated 2024-12-30 23:00 UTC)