[HN Gopher] Cortex A73's Not-So-Infinite Reordering Capacity
___________________________________________________________________
Cortex A73's Not-So-Infinite Reordering Capacity
Author : ingve
Score : 175 points
Date : 2024-08-04 21:30 UTC (1 days ago)
(HTM) web link (chipsandcheese.com)
(TXT) w3m dump (chipsandcheese.com)
| phkahler wrote:
| I've recently been contemplating the idea of putting a small ALU
| with each register. Simplest would do just AND, OR, XOR. A bigger
| one might add and subtract. The idea is that any instruction that
| clobbered a source operand would only need one read port (and no
| write port) from the register file. As routing near registers
| gets more complicated it might make sense to put a bit of
| execution right there with the data.
|
| Thoughts on this?
| gen3 wrote:
| It's been a few years, but my understanding is that most time
| spent by the CPU is waiting for data, not logic. I wonder if
| there would be a real impact on execution speed
| phkahler wrote:
| I wasn't really thinking of faster execution speed overall.
| Just faster for a given number of ports on the register file.
| It may also eliminate the need for result forwarding hardware
| since the registers would just the ALU output latch.
| The_Colonel wrote:
| Most of the crazy parts of CPUs (out-of-order, speculative
| execution, branch predictors, cache hierarchy) are ways of
| trying to work around the slow memory. Improving the logic
| execution can allow going farther speculatively and thus pre-
| fetch sooner. Compressing instruction encoding can lower the
| need to fetch instructions.
| fulafel wrote:
| Most of those, except cache, are attempts to work around
| the bottleneck of single flow of control and thus limited
| available parallelism.
| tsimionescu wrote:
| Unfortunately all experience shows that both programmers
| and compilers are _much_ worse at parallelizing code than
| CPUs are. There have been many attempts at architectures
| that allowed compilers or programmers to express code in
| more parallel ways (VLIW architectures, Intel 's Itanium,
| IBM's Cell used in the PS3) and they all categorically
| failed.
|
| Successful processors offer a sequential execution model,
| and handle the parallelism internally. Even CUDA is
| designed somewhat like this: you express your code in a
| largely linear fashion, and rely on the NVIDIA-created
| compiler and on the GPU itself to run it in parallel.
| adrian_b wrote:
| CUDA is quite different. In CUDA you do not express
| parallel code in a linear fashion, a.k.a. sequential
| fashion, hoping that CUDA will determine the dependence
| chains and extract them from the sequential code and
| execute them in parallel.
|
| The main feature of CUDA is that in order to describe an
| algorithm that is applied to an array, you just write the
| code that applies the algorithm to an element of that
| array, like writing only the body of a "for" loop.
|
| Then the CUDA run-time will take care of creating an
| appropriate number of execution threads, taking into
| account the physical configuration of the available GPU,
| e.g. how many array elements can be processed by a single
| instruction, how many execution threads can share a
| single processing core, how many processing cores exist
| in the GPU, and so on. When there are more array elements
| than the GPU resources can process at once, CUDA will add
| some appropriate looping construct, to ensure the
| processing of the entire array.
|
| The CUDA programmer writes the code that processes a
| single element array, informing thus the CUDA run-time
| that this code is independent of its replicas that
| process any other array element, except when the
| programmer references explicitly other array elements,
| which is normally avoided.
|
| The task of a CPU able to do non-sequential instruction
| execution (a.k.a. out-of-order execution) and
| simultaneous initiation of multiple instructions (a.k.a.
| superscalar instruction execution) is quite different.
|
| The main problem for such a CPU is the determination of
| the dependence relationships between instructions, based
| on examining the register numbers encoded in the
| instructions. Based on the detected dependencies and on
| the availability of the operands, the CPU can schedule
| the execution of the instructions, in parallel over
| multiple execution units.
|
| There exists an open research problem, whether there is
| any better way to pass the information about the
| dependencies between instructions from the compiler,
| which already knows them, to the CPU that runs the
| machine code, otherwise than by using fake register
| numbers, whose purpose is only to express the
| dependencies, and which must be replaced for execution
| with the real numbers of the physical registers, by the
| renaming mechanism of the CPU.
| tsimionescu wrote:
| Agreed, CUDA is very different. But still, it's another
| example where programmers just write sequential single
| flow of execution code, and it gets executed in parallel
| according to "external" rules.
| The_Colonel wrote:
| Out-of-order execution doesn't imply parallelism, it was
| designed from the beginning to work around the input data
| availability problem.
|
| In speculative execution and branch predictors, prefetch
| may seem just as a nice bonus, but given that nowadays
| CPU performance is largely bottlenecked by memory access,
| prefetch resulting from these techniques will often come
| out as the dominant performance factor.
| tsimionescu wrote:
| It's still a form of parallelism, that could in principle
| be written into the program instead of being
| automatically implemented by the processor.
|
| For example, in hand crafted assembly programs, it's
| sometimes common to know how long a fetch operation
| lasts, and manually schedule operations such that they
| can be executed in parallel with the fetch operation.
|
| Theoretically a high level language could also be
| designed to expose this kind of logic to the programmer.
| A program in such a language would be expressed as a set
| of very very short threads that can be interleaved by the
| compiler given precise knowledge of instruction timers.
| _nalply wrote:
| Would it make sense to compress code and data with
| something like zstd and let the CPU decompress?
|
| Of course this means a large change of how computers work
| but perhaps it is possible to make this opt-in (i.e.
| backwards compatible) for software?
| tsimionescu wrote:
| This would make memory read performance much, much more
| unpredictable, so it is a no-go from the start. And
| beyond that, the problem is not one of bandwidth, it is
| one of latency. This would increase bandwidth sometimes,
| but it would increase latency always, which is a terrible
| trade-off. Missed branch predictions would cost even more
| than they do today, for example.
| The_Colonel wrote:
| This idea isn't about compressing in-flight, but in the
| instruction cache, so that more instructions will fit
| into the cache, and you don't need to fetch as often (and
| thus incur latency) from main memory / L2.
|
| Zstd is impractical, but I can imagine some sort of
| storage efficient microcode? (current Intel CPUs store
| original x86 instructions in the L1 instruction cache).
| simiones wrote:
| You then need extra memory to store the de-compressed
| instructions, since you can't predict before running the
| decompression how many instructions you'll get. And you
| still the problem of an unpredictably-sized instruction
| cache.
|
| Plus, the larger the instruction cache is, the worse
| every branch mis-prediction is. As far as I know, the
| size of the instruction cache is not really limited
| because of space issues, it's limited for precisely this
| reason.
| adgjlsfhk1 wrote:
| The problem here is branches. since you can jump to
| pretty much any instruction, every instruction needs to
| be a valid place to start decoding which makes the idea
| non-tenable.
| pkhuong wrote:
| x86 is a variable-length encoding where more frequently
| used instructions tend to have shorter encodings. Thumb
| doesn't go as far, with only 2 and 4 -byte instructions.
| You can find old papers on Huffman encoding instructions.
|
| More effective block compression schemes are harder to
| pull off because of branches.
| nullc wrote:
| Some instruction sets are variable length, there are also
| things like thumb and the 'compressed' riscv extension.
|
| If you compress each instruction one at a time into a
| variable number of bits the ability to jump to any
| instruction is preserved but compression is hurt by not
| being able to use cross-instruction conditional
| probabilities and having to pack things into integer
| numbers of bits.
|
| One could imagine a compression format that had 'jump
| points' -- compiler selection locations where it was
| possible to jump and decode from, so that you'd only take
| the above losses at potential jump targets.
|
| You could go further an imagine the instruction set
| having some kind of constrained short jump forward a
| limited distance (by simply letting the decoder decode
| that way) or that can go back a limited distance without
| using a jump point so long as there was no way for
| controlflow to change out from under it.
|
| I wonder what percentage of instructions would need to be
| jump points in such a system?
|
| But I think this is mostly of academic interest: except
| in very small embedded devices code size is not a huge
| performance driver and like anything else you get
| diminishing returns from further optimizations. Variable
| length multiple-of-bytes instruction sets like x86
| probably get a significant fraction of the potential
| compression gains and do so without a lot of complexity.
| o11c wrote:
| That breaks OoO users of the register before the operation.
|
| Also, although the linked article complains about "too many
| ports", remember that the useful size of the register file is
| ultimately limited by how many in-flight operations are
| possible, which is determined by pipeline depth and number of
| instructions between branches.
| phkahler wrote:
| >> That breaks OoO users of the register before the
| operation.
|
| How so? The register can be used the same as before but
| clobber operations don't have to be sent over to a separate
| ALU and "back".
| dzaima wrote:
| If you have, say, 'add r1, r2, r3; add r2, r2, 1' and want
| to do the latter instruction in-register-file, you can't
| reorder them (e.g. if r2 is known before r3) as after
| having ran the second instruction the first ones r2 operand
| would be nowhere to be found. You'd need to track whether
| there's any unfinished usages of each register (or maybe
| the simpler option of just tracking if it has been any
| operand), which isn't traditionally required.
|
| Perhaps a bigger issue is that, if you have a speculative
| decision (incl. all loads & stores) between having written
| a register and the clobbering update, you can't do it in-
| register too, as that'd make it impossible to rollback.
| phkahler wrote:
| Thanks! This and some of the other comments have been
| helpful in understanding how these things work in more
| detail. So conceptually register names should be seen as
| result names - the result of a load or an operation gets
| called r2 for example. The ISA only provides so many
| result names, but the CPU may have any number of physical
| registers to store results. It's a nice model, and
| modifying a result in-register would really mess things
| up. It would destroy a previous result which would
| require a very different (more complex) approach to
| scheduling and clean up.
| Findecanor wrote:
| I definitely think it is worth exploring other models than the
| traditional randomly accessed register file.
|
| Your idea reminded me a bit of Scheduled Dataflow [0]
| architecture where every register is a dedicated input to a
| unit. Instead of an instruction having source and destination
| register parameters there are only destination registers.
|
| The Mill does it the other way around: There are only source
| operands and the result is stored in each unit's dedicated
| output register.
|
| [0]. https://www.semanticscholar.org/paper/Scheduled-
| Dataflow%3A-...
| pests wrote:
| The video where it's revealed the belt in the Mill
| architecture is completely conceptual is something I randomly
| think of at night when trying to sleep. Always been
| fascinated by it.
| gumby wrote:
| A modern core has a bunch of resources (ALUs, register files
| and so on) to assign/draw upon as dynamically needed. Not every
| one is needed for every instruction of course. In the old days
| when there was a single ALU, and maybe a single address
| calculator, that was that.
|
| Now the resources can be scheduled in a very complex, out of
| order, and subinstruction fashion. The chip designers guess
| what the instruction mix will likely be and hopefully make the
| right call as to how many Xs are required and how many Ys. Too
| few and operations in flight will wait. Too many and the chip
| becomes more complex for no benefit, and maybe those unused
| resources crowd out space for something else useful.
|
| If you stick an ALU on every register you're guaranteeing to
| use some area on something not used all the time.
| CalChris wrote:
| Onur Mutlu has a similar (definitely not same) idea of
| _Processing in Memory_. Basically the idea is to put some
| operations nearer to the data. Your idea is nearer to the
| register and his is in the memory controller nearer to memory.
|
| https://arxiv.org/pdf/2012.03112
| phkahler wrote:
| I could see memory getting a vector FPU that takes an entire
| DRAM row (64Kbit these days) and does things like
| scalar/vector MAC. Since DRAM is so slow it could be a
| relatively slow FPU (10 cycles or more). The biggest issue
| would be standardization. How do you send it instructions and
| operands? How do you manage the internal state? A standard
| would be difficult and seems premature given the rapid
| changes happening even with FP data formats. Oh, looks like
| that paper goes into some depth on this stuff!
| sroussey wrote:
| Why stop there? Move the ALU to the RAM
| rcxdude wrote:
| There are some proposals of that form: basically if you have
| a RAM interface that can support commands like "fill this
| range of memory with this pattern" or "add this constant to
| all values in this range" it can help speed up a decent
| amount of code, but also drastically reduce power
| consumption, as the memory interface is a pretty significant
| source of power drain on mobile systems. It does
| significantly complicate the interfaces involved, though, so
| it would require a lot of effort to implement.
| weinzierl wrote:
| This has been tried. I think the IBM System/360 could do
| arithmetic directly in RAM.
| _a_a_a_ wrote:
| err, any refs for that? I really doubt it but could well be
| wrong. I'm not aware of any mainstream arch that can/could
| do that, that I can think of.
| weinzierl wrote:
| At least the IBM System/370 (not 360 as I said above) had
| assembly instructions that took memory addresses directly
| without a register or immediate being in the instruction.
|
| Here is an example of such an instruction from the
|
| _" AP, Add Packed (Decimal) The data string located at
| the storage address specified by operand-2 (b2+d2) is
| added to the data string located at the storage address
| specified by operand-1 (b1+d1). Operand-2 remains
| unchanged."_
|
| http://www.simotime.com/asmins01.htm#AP
|
| Now, how this was actually done in hardware I don't know,
| but I would not be surprised if hidden registers were
| involved. So, maybe this is not really an example that
| fits the original premise of this subthread, but I think
| it is interesting nevertheless.
| vasco wrote:
| We've been doing the reverse and bringing the RAM to the ALU.
| dzaima wrote:
| There's already an x86-64 extension for this: RAO-INT (remote
| atomics).
| ip26 wrote:
| The entries in this register file would be larger & slower,
| which means you will not be able to squeeze in as many. This
| reduces your performance.
|
| Also, the hardware to distribute any entry as the second source
| to any other entry is effectively a read port, followed by a
| write port.
| NohatCoder wrote:
| It is probably easier to add more functionality to existing
| ALUs. Various instruction sets have added "free" bonus
| operations, but one could go a step further and allow any two
| functions from a wide set to be combined, the intermediate
| value would not hit the registers, and thus save a store and a
| load relative to two individual instructions.
| Tuna-Fish wrote:
| That's not how registers work. Register names are not
| registers. Unless you are in the deepest of embedded, all
| operations that write to a register allocate a new one, and do
| not write to their input register.
|
| edit: OoO primer, assuming x86_64 but everything else works the
| same way too:
|
| Your cpu has hundreds of physical registers. There are 16
| register names managed in the frontend. Whenever the frontend
| sees an instruction that writes to RAX, it allocates a fresh
| register that does not contain a value, sets it as pending,
| writes that register number under RAX in the register alias
| table (RAT), and sends the instruction forward. Once the
| instruction is in the scheduler, it waits until all it's inputs
| are ready, and then issues to an execution unit. Once it gets
| executed, it writes it's value to the register allocated for it
| and sets it as ready.
|
| If you have an instruction that uses the same register name as
| input and output, those are guaranteed not to be the same
| register. Used registers get garbage collected once no name or
| pending instruction refers to them anymore.
| phkahler wrote:
| Thank you, that was a very good explanation of "register
| renaming". The way you described it "renaming" is really more
| like allocation and is pretty simple - not done on an as-
| needed basis, just done automatically for every instruction
| making it SSA (single static assignment).
| LeifCarrotson wrote:
| As someone deep in an embedded cave, that's illuminating. You
| have a completely different meaning abstracted on top of the
| word "register".
|
| Over here, in AVR and MSP430 and Cortex-M0 land, there are
| about 16 registers, instructions take at most 2 registers,
| and write to the destination (often also a source argument,
| ADD R1 R2 does R1 := R1 + R2). There is no 'frontend', no
| renaming, no aliasing, no garbage collector; what you see is
| what you get.
|
| I assume that sometime long ago, the words had the original
| meaning, but they gradually added abstractions that were
| still _mostly_ indiscernible at each step of the journey
| until they had completely diverged.
| mystified5016 wrote:
| Yeah I'm deep into AVR, never got into the fancy
| architectures like x86 and this sounded bananas to me. In
| my world a register is a register is hardware.
|
| Hell, by sidestepping GCC's register allocation one can
| substantially increase performance if you arrange your data
| into registers that your ALU or other peripheral likes
| best. Prime example, RAM access can be twice as fast if you
| load your target address into the X/Y/Z register with LD
| instead of letting GCC emit the LDS instruction. You'd do
| something like store a pointer to a very important struct
| that you want to access many times very quickly.
|
| I honestly have no clue if a similar concept exists on
| modern x86. I'd assume that you are explicitly meant to not
| care about things at this level and it's all handled behind
| the scenes. It's crazy how much the x86 CPU does for you,
| but I find it much more fun to tweak things close to the
| bare metal.
| gpderetta wrote:
| It is not really my speciality, but there are equivalent
| things on OoO Land. Although CPUs are very good at
| running reasonable code efficiency, peak FP performance
| is still the real of hand written ASM or at the very
| least copious use of intrinsics.
|
| It can also be very microarchitecture specific. Because
| FP code often need significant unrolling, the number of
| architectural registers needed to store partial results
| can be a bottleneck, especially if the compiler doesn't
| do a perfect job.
| dzaima wrote:
| All registers are basically always treated the exact same
| in OoO-land (though on x86-64 for some instruction forms
| it can add an extra byte of instruction encoding if
| there's some operand in r8..r15).
|
| But there's still a massive amount of performance tuning.
| Some fun examples being how Haswell can only do one FP
| add per cycle, but two FMAs, so you can sometimes improve
| throughput by replacing some 'a+b's with 'a*1+b' (at the
| cost of higher latency). Or how, on older Intel, three-
| addend 'lea' can execute on only one port (as opposed to
| two for two-addend 'lea', or 3 or 4 for 'add') and has
| 3-cycle latency (vs. 1 cycle for two-addend; so two back-
| to-back 'lea's have lower latency than one three-addend
| one), but still uses only one port and thus can sometimes
| be better for throughput.
| Neywiny wrote:
| While I share your passion, there's still a front end. The
| front end vs back end of a cpu core exists as long as the
| logic for decoding an instruction is separate from that of
| executing. In an AVR core, the "instruction decode" and
| "instruction register" blocks are the front end, and the
| ALU is the backend. The first image result for "msp430 cpu
| core block diagram" for me was [1] which has the first
| block saying "front end". So like I get it but just because
| we work on the smallest cores doesn't mean they're alien
| tech compared to the newest Neoverse or AVX-512 monsters.
|
| https://www.researchgate.net/profile/Ch-
| Srinivas/publication...
| Symmetry wrote:
| Just as importantly, most values in critical dependencies
| happens via the bypass network, forwarding from one execution
| port directly to the next without ever making the round trip
| to the registers.
| peterfirefly wrote:
| Registers and ALUs are sort of already coupled like that in
| practice.
|
| Longer, less wrong version: modern CPUs have forwarding
| networks with latches in them. Those latches store ALU results.
| Those results are (usually) the equivalent of the content-to-be
| of a specific version of a register -- and by register, I
| actually mean register name.
|
| So, "registers" that are currently active get to live in the
| forwarding network where all the routing is and "registers"
| that aren't as active get to live in the register file, away
| from the ALUs and the forwarding network.
|
| (And the "registers" used in machine instructions are really
| just register _names_. There 's a lot renaming going on in
| order to keep many versions (and potential versions) of
| register values live at the same time to enable superscalar
| execution and out-of-order execution.)
| eigenform wrote:
| Presumably in that kind of machine, a "register" explicitly
| means "the link between one execution unit and another" and an
| "instruction" configures how a specific unit is linked to
| others. There are no "names of registers," only "names of
| execution units".
|
| 1. When you program this machine, you have to compute the
| schedule at compile-time. Exactly when and how long do you keep
| certain units linked?
|
| 2. You probably have a worse routing problem? - if you want to
| perform arbitrary instructions in a chain, all execution units
| need point-to-point physical wires to carry data to/from all
| the other units that they might need data from (and most of
| them will necessarily be unused?)
|
| Instead of having a massively distributed network of execution
| units, it's probably more efficient to have "virtual" links
| between units: which you implement as shared access to a
| centralized memory (a "register file").
| thesz wrote:
| https://en.wikipedia.org/wiki/Operand_forwarding
|
| Add and subtract is too expensive to be multiplied.
|
| This thing referenced above effectively reduces pipeline length
| by one step and is quite useful in scoreboarding-based
| implementation of OoO instruction issue.
___________________________________________________________________
(page generated 2024-08-05 23:01 UTC)