hngopher.com

       [HN Gopher] Cortex A73's Not-So-Infinite Reordering Capacity
       ___________________________________________________________________
        
       Cortex A73's Not-So-Infinite Reordering Capacity
        
       Author : ingve
       Score  : 175 points
       Date   : 2024-08-04 21:30 UTC (1 days ago)
        
 (HTM) web link (chipsandcheese.com)
 (TXT) w3m dump (chipsandcheese.com)
        
       | phkahler wrote:
       | I've recently been contemplating the idea of putting a small ALU
       | with each register. Simplest would do just AND, OR, XOR. A bigger
       | one might add and subtract. The idea is that any instruction that
       | clobbered a source operand would only need one read port (and no
       | write port) from the register file. As routing near registers
       | gets more complicated it might make sense to put a bit of
       | execution right there with the data.
       | 
       | Thoughts on this?
        
         | gen3 wrote:
         | It's been a few years, but my understanding is that most time
         | spent by the CPU is waiting for data, not logic. I wonder if
         | there would be a real impact on execution speed
        
           | phkahler wrote:
           | I wasn't really thinking of faster execution speed overall.
           | Just faster for a given number of ports on the register file.
           | It may also eliminate the need for result forwarding hardware
           | since the registers would just the ALU output latch.
        
           | The_Colonel wrote:
           | Most of the crazy parts of CPUs (out-of-order, speculative
           | execution, branch predictors, cache hierarchy) are ways of
           | trying to work around the slow memory. Improving the logic
           | execution can allow going farther speculatively and thus pre-
           | fetch sooner. Compressing instruction encoding can lower the
           | need to fetch instructions.
        
             | fulafel wrote:
             | Most of those, except cache, are attempts to work around
             | the bottleneck of single flow of control and thus limited
             | available parallelism.
        
               | tsimionescu wrote:
               | Unfortunately all experience shows that both programmers
               | and compilers are _much_ worse at parallelizing code than
               | CPUs are. There have been many attempts at architectures
               | that allowed compilers or programmers to express code in
               | more parallel ways (VLIW architectures, Intel 's Itanium,
               | IBM's Cell used in the PS3) and they all categorically
               | failed.
               | 
               | Successful processors offer a sequential execution model,
               | and handle the parallelism internally. Even CUDA is
               | designed somewhat like this: you express your code in a
               | largely linear fashion, and rely on the NVIDIA-created
               | compiler and on the GPU itself to run it in parallel.
        
               | adrian_b wrote:
               | CUDA is quite different. In CUDA you do not express
               | parallel code in a linear fashion, a.k.a. sequential
               | fashion, hoping that CUDA will determine the dependence
               | chains and extract them from the sequential code and
               | execute them in parallel.
               | 
               | The main feature of CUDA is that in order to describe an
               | algorithm that is applied to an array, you just write the
               | code that applies the algorithm to an element of that
               | array, like writing only the body of a "for" loop.
               | 
               | Then the CUDA run-time will take care of creating an
               | appropriate number of execution threads, taking into
               | account the physical configuration of the available GPU,
               | e.g. how many array elements can be processed by a single
               | instruction, how many execution threads can share a
               | single processing core, how many processing cores exist
               | in the GPU, and so on. When there are more array elements
               | than the GPU resources can process at once, CUDA will add
               | some appropriate looping construct, to ensure the
               | processing of the entire array.
               | 
               | The CUDA programmer writes the code that processes a
               | single element array, informing thus the CUDA run-time
               | that this code is independent of its replicas that
               | process any other array element, except when the
               | programmer references explicitly other array elements,
               | which is normally avoided.
               | 
               | The task of a CPU able to do non-sequential instruction
               | execution (a.k.a. out-of-order execution) and
               | simultaneous initiation of multiple instructions (a.k.a.
               | superscalar instruction execution) is quite different.
               | 
               | The main problem for such a CPU is the determination of
               | the dependence relationships between instructions, based
               | on examining the register numbers encoded in the
               | instructions. Based on the detected dependencies and on
               | the availability of the operands, the CPU can schedule
               | the execution of the instructions, in parallel over
               | multiple execution units.
               | 
               | There exists an open research problem, whether there is
               | any better way to pass the information about the
               | dependencies between instructions from the compiler,
               | which already knows them, to the CPU that runs the
               | machine code, otherwise than by using fake register
               | numbers, whose purpose is only to express the
               | dependencies, and which must be replaced for execution
               | with the real numbers of the physical registers, by the
               | renaming mechanism of the CPU.
        
               | tsimionescu wrote:
               | Agreed, CUDA is very different. But still, it's another
               | example where programmers just write sequential single
               | flow of execution code, and it gets executed in parallel
               | according to "external" rules.
        
               | The_Colonel wrote:
               | Out-of-order execution doesn't imply parallelism, it was
               | designed from the beginning to work around the input data
               | availability problem.
               | 
               | In speculative execution and branch predictors, prefetch
               | may seem just as a nice bonus, but given that nowadays
               | CPU performance is largely bottlenecked by memory access,
               | prefetch resulting from these techniques will often come
               | out as the dominant performance factor.
        
               | tsimionescu wrote:
               | It's still a form of parallelism, that could in principle
               | be written into the program instead of being
               | automatically implemented by the processor.
               | 
               | For example, in hand crafted assembly programs, it's
               | sometimes common to know how long a fetch operation
               | lasts, and manually schedule operations such that they
               | can be executed in parallel with the fetch operation.
               | 
               | Theoretically a high level language could also be
               | designed to expose this kind of logic to the programmer.
               | A program in such a language would be expressed as a set
               | of very very short threads that can be interleaved by the
               | compiler given precise knowledge of instruction timers.
        
             | _nalply wrote:
             | Would it make sense to compress code and data with
             | something like zstd and let the CPU decompress?
             | 
             | Of course this means a large change of how computers work
             | but perhaps it is possible to make this opt-in (i.e.
             | backwards compatible) for software?
        
               | tsimionescu wrote:
               | This would make memory read performance much, much more
               | unpredictable, so it is a no-go from the start. And
               | beyond that, the problem is not one of bandwidth, it is
               | one of latency. This would increase bandwidth sometimes,
               | but it would increase latency always, which is a terrible
               | trade-off. Missed branch predictions would cost even more
               | than they do today, for example.
        
               | The_Colonel wrote:
               | This idea isn't about compressing in-flight, but in the
               | instruction cache, so that more instructions will fit
               | into the cache, and you don't need to fetch as often (and
               | thus incur latency) from main memory / L2.
               | 
               | Zstd is impractical, but I can imagine some sort of
               | storage efficient microcode? (current Intel CPUs store
               | original x86 instructions in the L1 instruction cache).
        
               | simiones wrote:
               | You then need extra memory to store the de-compressed
               | instructions, since you can't predict before running the
               | decompression how many instructions you'll get. And you
               | still the problem of an unpredictably-sized instruction
               | cache.
               | 
               | Plus, the larger the instruction cache is, the worse
               | every branch mis-prediction is. As far as I know, the
               | size of the instruction cache is not really limited
               | because of space issues, it's limited for precisely this
               | reason.
        
               | adgjlsfhk1 wrote:
               | The problem here is branches. since you can jump to
               | pretty much any instruction, every instruction needs to
               | be a valid place to start decoding which makes the idea
               | non-tenable.
        
               | pkhuong wrote:
               | x86 is a variable-length encoding where more frequently
               | used instructions tend to have shorter encodings. Thumb
               | doesn't go as far, with only 2 and 4 -byte instructions.
               | You can find old papers on Huffman encoding instructions.
               | 
               | More effective block compression schemes are harder to
               | pull off because of branches.
        
               | nullc wrote:
               | Some instruction sets are variable length, there are also
               | things like thumb and the 'compressed' riscv extension.
               | 
               | If you compress each instruction one at a time into a
               | variable number of bits the ability to jump to any
               | instruction is preserved but compression is hurt by not
               | being able to use cross-instruction conditional
               | probabilities and having to pack things into integer
               | numbers of bits.
               | 
               | One could imagine a compression format that had 'jump
               | points' -- compiler selection locations where it was
               | possible to jump and decode from, so that you'd only take
               | the above losses at potential jump targets.
               | 
               | You could go further an imagine the instruction set
               | having some kind of constrained short jump forward a
               | limited distance (by simply letting the decoder decode
               | that way) or that can go back a limited distance without
               | using a jump point so long as there was no way for
               | controlflow to change out from under it.
               | 
               | I wonder what percentage of instructions would need to be
               | jump points in such a system?
               | 
               | But I think this is mostly of academic interest: except
               | in very small embedded devices code size is not a huge
               | performance driver and like anything else you get
               | diminishing returns from further optimizations. Variable
               | length multiple-of-bytes instruction sets like x86
               | probably get a significant fraction of the potential
               | compression gains and do so without a lot of complexity.
        
         | o11c wrote:
         | That breaks OoO users of the register before the operation.
         | 
         | Also, although the linked article complains about "too many
         | ports", remember that the useful size of the register file is
         | ultimately limited by how many in-flight operations are
         | possible, which is determined by pipeline depth and number of
         | instructions between branches.
        
           | phkahler wrote:
           | >> That breaks OoO users of the register before the
           | operation.
           | 
           | How so? The register can be used the same as before but
           | clobber operations don't have to be sent over to a separate
           | ALU and "back".
        
             | dzaima wrote:
             | If you have, say, 'add r1, r2, r3; add r2, r2, 1' and want
             | to do the latter instruction in-register-file, you can't
             | reorder them (e.g. if r2 is known before r3) as after
             | having ran the second instruction the first ones r2 operand
             | would be nowhere to be found. You'd need to track whether
             | there's any unfinished usages of each register (or maybe
             | the simpler option of just tracking if it has been any
             | operand), which isn't traditionally required.
             | 
             | Perhaps a bigger issue is that, if you have a speculative
             | decision (incl. all loads & stores) between having written
             | a register and the clobbering update, you can't do it in-
             | register too, as that'd make it impossible to rollback.
        
               | phkahler wrote:
               | Thanks! This and some of the other comments have been
               | helpful in understanding how these things work in more
               | detail. So conceptually register names should be seen as
               | result names - the result of a load or an operation gets
               | called r2 for example. The ISA only provides so many
               | result names, but the CPU may have any number of physical
               | registers to store results. It's a nice model, and
               | modifying a result in-register would really mess things
               | up. It would destroy a previous result which would
               | require a very different (more complex) approach to
               | scheduling and clean up.
        
         | Findecanor wrote:
         | I definitely think it is worth exploring other models than the
         | traditional randomly accessed register file.
         | 
         | Your idea reminded me a bit of Scheduled Dataflow [0]
         | architecture where every register is a dedicated input to a
         | unit. Instead of an instruction having source and destination
         | register parameters there are only destination registers.
         | 
         | The Mill does it the other way around: There are only source
         | operands and the result is stored in each unit's dedicated
         | output register.
         | 
         | [0]. https://www.semanticscholar.org/paper/Scheduled-
         | Dataflow%3A-...
        
           | pests wrote:
           | The video where it's revealed the belt in the Mill
           | architecture is completely conceptual is something I randomly
           | think of at night when trying to sleep. Always been
           | fascinated by it.
        
         | gumby wrote:
         | A modern core has a bunch of resources (ALUs, register files
         | and so on) to assign/draw upon as dynamically needed. Not every
         | one is needed for every instruction of course. In the old days
         | when there was a single ALU, and maybe a single address
         | calculator, that was that.
         | 
         | Now the resources can be scheduled in a very complex, out of
         | order, and subinstruction fashion. The chip designers guess
         | what the instruction mix will likely be and hopefully make the
         | right call as to how many Xs are required and how many Ys. Too
         | few and operations in flight will wait. Too many and the chip
         | becomes more complex for no benefit, and maybe those unused
         | resources crowd out space for something else useful.
         | 
         | If you stick an ALU on every register you're guaranteeing to
         | use some area on something not used all the time.
        
         | CalChris wrote:
         | Onur Mutlu has a similar (definitely not same) idea of
         | _Processing in Memory_. Basically the idea is to put some
         | operations nearer to the data. Your idea is nearer to the
         | register and his is in the memory controller nearer to memory.
         | 
         | https://arxiv.org/pdf/2012.03112
        
           | phkahler wrote:
           | I could see memory getting a vector FPU that takes an entire
           | DRAM row (64Kbit these days) and does things like
           | scalar/vector MAC. Since DRAM is so slow it could be a
           | relatively slow FPU (10 cycles or more). The biggest issue
           | would be standardization. How do you send it instructions and
           | operands? How do you manage the internal state? A standard
           | would be difficult and seems premature given the rapid
           | changes happening even with FP data formats. Oh, looks like
           | that paper goes into some depth on this stuff!
        
         | sroussey wrote:
         | Why stop there? Move the ALU to the RAM
        
           | rcxdude wrote:
           | There are some proposals of that form: basically if you have
           | a RAM interface that can support commands like "fill this
           | range of memory with this pattern" or "add this constant to
           | all values in this range" it can help speed up a decent
           | amount of code, but also drastically reduce power
           | consumption, as the memory interface is a pretty significant
           | source of power drain on mobile systems. It does
           | significantly complicate the interfaces involved, though, so
           | it would require a lot of effort to implement.
        
           | weinzierl wrote:
           | This has been tried. I think the IBM System/360 could do
           | arithmetic directly in RAM.
        
             | _a_a_a_ wrote:
             | err, any refs for that? I really doubt it but could well be
             | wrong. I'm not aware of any mainstream arch that can/could
             | do that, that I can think of.
        
               | weinzierl wrote:
               | At least the IBM System/370 (not 360 as I said above) had
               | assembly instructions that took memory addresses directly
               | without a register or immediate being in the instruction.
               | 
               | Here is an example of such an instruction from the
               | 
               |  _" AP, Add Packed (Decimal) The data string located at
               | the storage address specified by operand-2 (b2+d2) is
               | added to the data string located at the storage address
               | specified by operand-1 (b1+d1). Operand-2 remains
               | unchanged."_
               | 
               | http://www.simotime.com/asmins01.htm#AP
               | 
               | Now, how this was actually done in hardware I don't know,
               | but I would not be surprised if hidden registers were
               | involved. So, maybe this is not really an example that
               | fits the original premise of this subthread, but I think
               | it is interesting nevertheless.
        
           | vasco wrote:
           | We've been doing the reverse and bringing the RAM to the ALU.
        
           | dzaima wrote:
           | There's already an x86-64 extension for this: RAO-INT (remote
           | atomics).
        
         | ip26 wrote:
         | The entries in this register file would be larger & slower,
         | which means you will not be able to squeeze in as many. This
         | reduces your performance.
         | 
         | Also, the hardware to distribute any entry as the second source
         | to any other entry is effectively a read port, followed by a
         | write port.
        
         | NohatCoder wrote:
         | It is probably easier to add more functionality to existing
         | ALUs. Various instruction sets have added "free" bonus
         | operations, but one could go a step further and allow any two
         | functions from a wide set to be combined, the intermediate
         | value would not hit the registers, and thus save a store and a
         | load relative to two individual instructions.
        
         | Tuna-Fish wrote:
         | That's not how registers work. Register names are not
         | registers. Unless you are in the deepest of embedded, all
         | operations that write to a register allocate a new one, and do
         | not write to their input register.
         | 
         | edit: OoO primer, assuming x86_64 but everything else works the
         | same way too:
         | 
         | Your cpu has hundreds of physical registers. There are 16
         | register names managed in the frontend. Whenever the frontend
         | sees an instruction that writes to RAX, it allocates a fresh
         | register that does not contain a value, sets it as pending,
         | writes that register number under RAX in the register alias
         | table (RAT), and sends the instruction forward. Once the
         | instruction is in the scheduler, it waits until all it's inputs
         | are ready, and then issues to an execution unit. Once it gets
         | executed, it writes it's value to the register allocated for it
         | and sets it as ready.
         | 
         | If you have an instruction that uses the same register name as
         | input and output, those are guaranteed not to be the same
         | register. Used registers get garbage collected once no name or
         | pending instruction refers to them anymore.
        
           | phkahler wrote:
           | Thank you, that was a very good explanation of "register
           | renaming". The way you described it "renaming" is really more
           | like allocation and is pretty simple - not done on an as-
           | needed basis, just done automatically for every instruction
           | making it SSA (single static assignment).
        
           | LeifCarrotson wrote:
           | As someone deep in an embedded cave, that's illuminating. You
           | have a completely different meaning abstracted on top of the
           | word "register".
           | 
           | Over here, in AVR and MSP430 and Cortex-M0 land, there are
           | about 16 registers, instructions take at most 2 registers,
           | and write to the destination (often also a source argument,
           | ADD R1 R2 does R1 := R1 + R2). There is no 'frontend', no
           | renaming, no aliasing, no garbage collector; what you see is
           | what you get.
           | 
           | I assume that sometime long ago, the words had the original
           | meaning, but they gradually added abstractions that were
           | still _mostly_ indiscernible at each step of the journey
           | until they had completely diverged.
        
             | mystified5016 wrote:
             | Yeah I'm deep into AVR, never got into the fancy
             | architectures like x86 and this sounded bananas to me. In
             | my world a register is a register is hardware.
             | 
             | Hell, by sidestepping GCC's register allocation one can
             | substantially increase performance if you arrange your data
             | into registers that your ALU or other peripheral likes
             | best. Prime example, RAM access can be twice as fast if you
             | load your target address into the X/Y/Z register with LD
             | instead of letting GCC emit the LDS instruction. You'd do
             | something like store a pointer to a very important struct
             | that you want to access many times very quickly.
             | 
             | I honestly have no clue if a similar concept exists on
             | modern x86. I'd assume that you are explicitly meant to not
             | care about things at this level and it's all handled behind
             | the scenes. It's crazy how much the x86 CPU does for you,
             | but I find it much more fun to tweak things close to the
             | bare metal.
        
               | gpderetta wrote:
               | It is not really my speciality, but there are equivalent
               | things on OoO Land. Although CPUs are very good at
               | running reasonable code efficiency, peak FP performance
               | is still the real of hand written ASM or at the very
               | least copious use of intrinsics.
               | 
               | It can also be very microarchitecture specific. Because
               | FP code often need significant unrolling, the number of
               | architectural registers needed to store partial results
               | can be a bottleneck, especially if the compiler doesn't
               | do a perfect job.
        
               | dzaima wrote:
               | All registers are basically always treated the exact same
               | in OoO-land (though on x86-64 for some instruction forms
               | it can add an extra byte of instruction encoding if
               | there's some operand in r8..r15).
               | 
               | But there's still a massive amount of performance tuning.
               | Some fun examples being how Haswell can only do one FP
               | add per cycle, but two FMAs, so you can sometimes improve
               | throughput by replacing some 'a+b's with 'a*1+b' (at the
               | cost of higher latency). Or how, on older Intel, three-
               | addend 'lea' can execute on only one port (as opposed to
               | two for two-addend 'lea', or 3 or 4 for 'add') and has
               | 3-cycle latency (vs. 1 cycle for two-addend; so two back-
               | to-back 'lea's have lower latency than one three-addend
               | one), but still uses only one port and thus can sometimes
               | be better for throughput.
        
             | Neywiny wrote:
             | While I share your passion, there's still a front end. The
             | front end vs back end of a cpu core exists as long as the
             | logic for decoding an instruction is separate from that of
             | executing. In an AVR core, the "instruction decode" and
             | "instruction register" blocks are the front end, and the
             | ALU is the backend. The first image result for "msp430 cpu
             | core block diagram" for me was [1] which has the first
             | block saying "front end". So like I get it but just because
             | we work on the smallest cores doesn't mean they're alien
             | tech compared to the newest Neoverse or AVX-512 monsters.
             | 
             | https://www.researchgate.net/profile/Ch-
             | Srinivas/publication...
        
           | Symmetry wrote:
           | Just as importantly, most values in critical dependencies
           | happens via the bypass network, forwarding from one execution
           | port directly to the next without ever making the round trip
           | to the registers.
        
         | peterfirefly wrote:
         | Registers and ALUs are sort of already coupled like that in
         | practice.
         | 
         | Longer, less wrong version: modern CPUs have forwarding
         | networks with latches in them. Those latches store ALU results.
         | Those results are (usually) the equivalent of the content-to-be
         | of a specific version of a register -- and by register, I
         | actually mean register name.
         | 
         | So, "registers" that are currently active get to live in the
         | forwarding network where all the routing is and "registers"
         | that aren't as active get to live in the register file, away
         | from the ALUs and the forwarding network.
         | 
         | (And the "registers" used in machine instructions are really
         | just register _names_. There 's a lot renaming going on in
         | order to keep many versions (and potential versions) of
         | register values live at the same time to enable superscalar
         | execution and out-of-order execution.)
        
         | eigenform wrote:
         | Presumably in that kind of machine, a "register" explicitly
         | means "the link between one execution unit and another" and an
         | "instruction" configures how a specific unit is linked to
         | others. There are no "names of registers," only "names of
         | execution units".
         | 
         | 1. When you program this machine, you have to compute the
         | schedule at compile-time. Exactly when and how long do you keep
         | certain units linked?
         | 
         | 2. You probably have a worse routing problem? - if you want to
         | perform arbitrary instructions in a chain, all execution units
         | need point-to-point physical wires to carry data to/from all
         | the other units that they might need data from (and most of
         | them will necessarily be unused?)
         | 
         | Instead of having a massively distributed network of execution
         | units, it's probably more efficient to have "virtual" links
         | between units: which you implement as shared access to a
         | centralized memory (a "register file").
        
         | thesz wrote:
         | https://en.wikipedia.org/wiki/Operand_forwarding
         | 
         | Add and subtract is too expensive to be multiplied.
         | 
         | This thing referenced above effectively reduces pipeline length
         | by one step and is quite useful in scoreboarding-based
         | implementation of OoO instruction issue.
        
       ___________________________________________________________________
       (page generated 2024-08-05 23:01 UTC)