[HN Gopher] RISC in 2022
       ___________________________________________________________________
        
       RISC in 2022
        
       Author : todsacerdoti
       Score  : 159 points
       Date   : 2022-10-22 06:02 UTC (16 hours ago)
        
 (HTM) web link (wiki.alopex.li)
 (TXT) w3m dump (wiki.alopex.li)
        
       | ranma42 wrote:
       | > 64 registers with 3 registers per instruction would require 21
       | bits
       | 
       | Actually only 18. With 21 bits you can get 128 registers.
        
       | klelatti wrote:
       | A minor historical quibble:
       | 
       | > RISC was a set of design principles developed in the 1980's
       | that enabled hardware to get much faster and more efficient.
       | 
       | There is a strong argument that RISC as a set of design
       | principles (if not as an acronym) started in the 1970s with the
       | IBM 801 [1] and many of the ideas date back to the 1960s with the
       | CDC 6600 mainframes which were very RISCy.
       | 
       | On a more substantive point, I don't think small code size was
       | ever 'officially' part of the RISC concept. Arm pioneered it with
       | Thumb but I think that was a pragmatic decision to get Arm into
       | devices with limited memory space such as early mobile phones.
       | 
       | [1] https://thechipletter.substack.com/p/the-first-risc-john-
       | coc...
        
         | speed_spread wrote:
         | Nitpick: Hitachi pioneered small code size with the SuperH
         | processors, which ARM licensed and marketed as Thumb.
        
           | klelatti wrote:
           | Nitpick accepted - you're absolutely right. To be precise
           | they licensed the relevant patents of course.
        
       | rwmj wrote:
       | Should be "32-ish _architectural_ registers ". Real processors
       | have a lot more registers but they are not directly visible. This
       | is the whole reason why x86-64 is usable despite having only 16
       | architectural integer registers (or actually a little less than
       | that).
       | 
       | The article doesn't get RISC-V instruction encoding right. It
       | mentions compressed instructions, but instructions can also be
       | longer than 32 bits. The important thing about RISC-V is that the
       | instruction stream can easily be divided at instruction
       | boundaries (unlike, say, x86 which is horrific to decode). This
       | gives you most of the benefits of fixed size instructions and the
       | benefits of extensibility when you need it.
        
         | weinzierl wrote:
         | > _" This is the whole reason why x86-64 is usable despite
         | having only 16 architectural integer registers [..]"_
         | 
         | Who'd have thought thirty year ago we'd all be sittin' here
         | talking tens of registers, eh?
         | 
         | In them days we was glad to have two or three.
         | 
         | If we were lucky.
         | 
         | My first computer had 1 (one). One with less bits than fingers
         | on our hands. It was so dear to us we even gave it a name.
         | "Accu" it was called.
         | 
         | If you tell that to the young people today, they won't believe
         | you.
        
           | _joel wrote:
           | Luxury
        
           | Dylan16807 wrote:
           | The 68k was quite a popular chip, with almost 20 registers
           | and a clear trend towards more, and it came out over 40 years
           | ago.
        
           | speed_spread wrote:
           | Sounds as comfortable as living with two teenagers in a house
           | with single bathroom. You have to optimize around it and use
           | clever tricks that shouldn't be documented.
        
           | kstrauser wrote:
           | My first (usable) computer was a Commodore 64, with 3
           | registers but a special addressing mode for RAM in the
           | 0x00-0xFF range. It was largely used as 256 bytes of "kinda
           | like a register but not exactly" storage.
        
         | adwn wrote:
         | > _Should be "32-ish architectural registers"._
         | 
         | The entire article is about _instruction sets_ , not about
         | their physical implementation, so I don't think this
         | clarification is necessary.
        
         | ajross wrote:
         | x86 isn't _that_ difficult to decode. Instructions all fall on
         | byte boundaries, so if you want to decode 16 bytes of
         | instructions in a cycle you need 16 parallel decode engines.
         | That sounds awful by the standards of 1987 RISC transistor
         | budgets, but for modern CPUs it 's mostly noise.
        
       | q-big wrote:
       | > MIPS, Itanium, SPARC and POWER all have weak memory models.
       | 
       | This does not seem to hold for SPARC: according to
       | 
       | > https://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.tpho...
       | 
       | the (strong) memory models of x86 and SPARCv8 are very related:
       | 
       | "We give two equivalent definitions of x86-TSO: an intuitive
       | operational model based on local write buffers, and an axiomatic
       | total store ordering model, similar to that of the SPARCv8."
       | 
       | "Our x86-TSO axiomatic memory model is based on the SPARCv8
       | memory model specification [20, 21], but adapted to x86 and in
       | the same terms as our ear- lier x86-CC model."
       | 
       | "We have described x86-TSO, a memory model for x86 processors
       | that does not suffer from the ambiguities, weaknesses, or
       | unsoundnesses of earlier models. Its abstract-machine definition
       | should be intuitive for programmers, and its equiva- lent
       | axiomatic definition supports the memevents exhaustive search and
       | permits an easy comparison with related models; the similarity
       | with SPARCv8 suggests x86-TSO is strong enough to program above."
        
       | vodou wrote:
       | Poor OpenRISC... Not even mentioned when they talk about RISC.
        
       | renox wrote:
       | I remember an interesting message about 'scalable' vector
       | extensions which pointed that they weren't necessary compatible
       | with loop unrolling. An interesting point to keep in mind.
        
         | bumblebritches5 wrote:
        
       | rcgorton wrote:
       | Re: register windows. I disagree: code size wasn't the killer
       | here, it was how DEEP the stack got. If your architectural
       | register window spilled at 4 deep, then calls 3 deep were fine,
       | but if you had a set of code attempting to iterate over a tight
       | loop which had 8 calls deep, you were in [performance] trouble.
       | 
       | Another divot: asymmetric functional units. Some versions of
       | Alpha supported a PopCount instruction, but it only worked in a
       | single functional unit, which made scheduling a pain, esp. if you
       | had to write in assembly language.
       | 
       | I'm not convinced that AVX 256 and AVX 512 are useful for non-
       | matrix operations. Most strings (more importantly, parsing
       | bounded by whitespace) are much shorter than 512 bits (32 bytes).
       | In English, I cannot come up with many words longer than 16 bytes
       | (some place names, antidisestablishmentarianism, chemical
       | compound names, and some other stuff)
        
         | loup-vaillant wrote:
         | > _I 'm not convinced that AVX 256 and AVX 512 are useful for
         | non-matrix operations._
         | 
         | I've observed that compared to regular x86-64 code without
         | SIMD, using AVX 256 speeds up the Chacha20 cipher (for long
         | messages so they can be processed in 512-bytes chuncks (8
         | blocks)) by a factor of 5. Network packets easily exceed 1KB,
         | and files are usually much bigger.
         | 
         | Matrix operations aren't the only viable niche.
        
           | sitkack wrote:
           | SIMD has _many_ non-matrix uses.
           | 
           | https://simdjson.org/
        
       | danybittel wrote:
       | If you think about it, instruction encoding is really a
       | compression method.
       | 
       | I wonder, if it would make sense to decouple compression and
       | microcodes. So you could take a body of Code and find the best
       | compression for it. Or even be able to change "lookup tables"
       | before starting the operating system. Possibly have different
       | compression methods (x86 / arm..) run on the same CPU, without
       | any drawbacks. Which could get you around licensing an ISA. (Yes
       | I'm a software engineer thinking about hardware)
        
         | Someone wrote:
         | That is more or less what Transmeta did (https://en.wikipedia.o
         | rg/wiki/Transmeta#Code_Morphing_Softwa...)
         | 
         | The "without any drawbacks" part never panned out, though.
         | 
         | It also is fairly similar to what Apple has done with emulating
         | 68k on PPC, PPC on x64, and x64 on ARM (but those, AFAIK, do
         | not offer full emulation of the host CPU, as they don't need to
         | run code in kernel mode)
        
         | ThrowawayR2 wrote:
         | " _The jack of all trades is the master of none._ " Being able
         | to do any type of ISA means you have to put in the circuits to
         | do everything those ISAs can do and those extra circuits are
         | just expensive deadweight when the processor is configured as a
         | specific ISA.
        
         | ip26 wrote:
         | The most interesting concept here (to me) is the idea of taking
         | a variable length instruction set and optionally promising to
         | align instructions. If you did this, the code could switch
         | between dense and wide as it pleased, getting either the best
         | footprint in the instruction cache or allowing extremely wide
         | decode.
        
       | fulafel wrote:
       | > RISC was a set of design principles developed in the 1980's
       | that enabled hardware to get much faster and more efficient. We
       | tend to still call modern-looking instruction sets "RISC-y", but
       | really, a bunch of the original design principles of RISC CPU's
       | have not stood the test of time. So let's look at the things that
       | have worked and not worked between the 1980's and 2022.
       | 
       | I think the design principles of RISC were actually on a meta
       | level above this: take the quantitative approach, use your
       | transistors to best serve the software you have and compilers you
       | can build. "Nice ISA to write assembly code for" was thrown out,
       | or at least demoted significantly.
       | 
       | In that 80s moment in the transistor count curve, it meant
       | simplifying the ISA very radically in order to implement fewer
       | instructions in hard coded way without microcode, to the point of
       | ditching HW multiply instructions. The microarchitecture could
       | either do fast, pipelined execution or large instruction set in
       | the transitor budget, you optimized for bang for the buck in the
       | whole-system sense. You could make simple fast machines that were
       | designed to run Unix (so had just enough VM, exception, etc
       | support).
        
         | gumby wrote:
         | > I think the design principles of RISC were actually on a meta
         | level above this: take the quantitative approach, use your
         | transistors to best serve the software you have and compilers
         | you can build. "Nice ISA to write assembly code for" was thrown
         | out, or at least demoted significantly.
         | 
         | You have it right: this is almost exactly (in different words!)
         | what Radin wrote in his original RISC paper.
        
         | klelatti wrote:
         | Largely agree with you here but I think there were two other
         | factors in RISC's early success:
         | 
         | 1. As microprocessors became dominant freezing complex
         | microcode - with possible bugs - on an IC was a really bad
         | idea. Better to run with simpler instructions and less
         | microcode.
         | 
         | Look at the debugging issues that National Semiconductor had
         | with the NS32016 - which I think really hindered its adoption.
         | 
         | 2. You needed a much smaller team to design a RISC CPU - low
         | double figures for IBM 801, Berkeley RISC, MIPS and Arm. This
         | opened the door to lots of experiments and business models that
         | would not have been possible with CISC.
        
           | _0ffh wrote:
           | Yep, RISC is really a KISS CPU, paired with the attempt to
           | draw the maximal advantage from the potential strengths of
           | the underlying concept. And it turned out pretty nicely imho.
        
           | sitkack wrote:
           | Almost no systems use fixed microcode.
           | 
           | What RISC took advantage of was decoupling the memory bus
           | from the CPU clock rate and the introduction of instruction
           | and data caches.
           | 
           | RISC is a simple interface, but the internal processor
           | complexity is as high or higher than CISC (thanks to RISC
           | simplicity). Speculative execution, reordering instructions,
           | hazards, branch prediction these issues are the same across
           | both.
           | 
           | The front end isn't this huge deal. CPUs all do the same
           | thing, attempt to unroll huge state machines and compress
           | time. The ISA is just a way to get that problem into the CPU.
        
             | klelatti wrote:
             | The key word in my comment is 'early'. You're right about
             | architectures today but that explicitly wasn't what I was
             | talking about. Caches we're important in the IBM 801 but
             | your orher points don't apply to early RISC designs.
        
               | sitkack wrote:
               | RISC didn't hit its stride until the early 80s, using
               | caches. The 801 was an experiment. When the 801 shipped,
               | almost _no_ system had hard coded microcode. Being able
               | to update microcode in the field isn 't the reason RISC
               | thrived.
        
               | klelatti wrote:
               | I can think of at least two of the RISC pioneers who have
               | cited buggy CISC architectures / microcode as a key
               | motivating reason for adopting RISC.
               | 
               | I actually cited an example of a major CISC design that
               | basically failed because it was so buggy.
               | 
               | The early microprocessors all had hard coded microcode -
               | Intel only had upgradeable microcode with P6 in the mid
               | 1990s.
               | 
               | I didn't say that it was the reason - there were many
               | reasons - but that it was one factor. It was absolutely
               | the case that original RISC designs were simpler to
               | design and that helped them to get traction.
               | 
               | Edit - just to add that caches enabled RISC to get decent
               | performance but they were not by themselves a reason for
               | choosing RISC over CISC.
        
         | bogomipz wrote:
         | >"Nice ISA to write assembly code for" was thrown out, or at
         | least demoted significantly."
         | 
         | Interesting. Was the guiding principle of CISC ISAs essentially
         | that be easy to write assembly code for then? Maybe it's
         | obvious but I had never considered how or where CISC evolved
         | from. Would I have to look at something like the history of the
         | VAX ISA to understand this better?
        
           | pjmlp wrote:
           | Yes, with CISC instructions and powerful macro assemblers
           | like TASM and MASM.
        
           | klelatti wrote:
           | I think 'easy to write assembly code for them' might be
           | stretching it a little bit. Some instructions were pretty
           | complex and I would expect be quite hard to use in practice.
           | 
           | What is certainly true is that a single instruction could do
           | a lot, making for much more concise code than would be the
           | case for RISC code.
           | 
           | IBM S/360 is probably the most influential CISC architecture.
           | There is lots of S/360 documentation online. If of interest I
           | did a short post on S/360 assembly a little while ago.
           | 
           | https://thechipletter.substack.com/p/writing-
           | ibm-s360-assemb...
        
             | bogomipz wrote:
             | Thanks but your link has two sentences and then is
             | interrupted by an input box to subscribe to your newsletter
             | and by the time I made it to there my reading was again
             | interrupted, this time by an obnoxious pop up prompt again
             | asking me sign up for the newsletter. Is there a reason to
             | be this aggressive? It's hard to believe this is a
             | successful strategy. If someone enjoys your content they
             | will sign up after they've actually been able read the
             | article. Who signs up for a newsletter before they've even
             | been able to read the first paragraph? I didn't bother with
             | the rest of your link after this.
        
               | klelatti wrote:
               | Thanks for letting me know. It's a Substack thing not
               | something I've chosen to do - I've tried to see if I
               | could could turn off all pop-ups in the past without
               | success. I didn't know it was quite that bad so will have
               | another look and possibly feedback to Substack.
        
         | sroussey wrote:
         | You mention the key part here--it was designed for compilers
         | writing assembly, not people writing assembly.
         | 
         | And more importantly, future compilers, not the tech they had
         | at the time.
        
           | pclmulqdq wrote:
           | This is probably the most important factor in the RISC
           | revolution. The printed size of assembly doesn't matter any
           | more (only the number of bytes), and neither does
           | readability.
        
       | cpeterso wrote:
       | Andrew Waterman's PhD dissertation, "Design of the RISC-V
       | Instruction Set Architecture", is quite accessible and starts
       | with a similar analysis of other ISAs (including OpenRISC):
       | 
       | https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-...
        
       | cesarb wrote:
       | > Nobody besides x86_64 seems to have dedicated stack registers
       | anymore, they're defined only by convention.
       | 
       | Aarch64 also has a dedicated SP register.
        
         | renox wrote:
         | I remember reading that in fact it was an important improvement
         | from ARM32 to Aarch64 to have a dedicated stack register. Not
         | surprisingly Risc-V has also a dedicated stack register.
        
           | zozbot234 wrote:
           | > Not surprisingly Risc-V has also a dedicated stack
           | register.
           | 
           | I don't think this is correct. It has a _suggested_ SP
           | register, which merely gets some special support in the C
           | subset. But that 's just an optional compression scheme and
           | not really part of the ISA design.
        
             | renox wrote:
             | Thanks for correcting me. I was mistaken by the assembly
             | language.
        
           | kragen wrote:
           | RV32I, RV64I, and RV128I (the base RISC-V architecture) don't
           | have a dedicated stack register in the instruction set, and
           | I'm not even sure if your code will even run faster on fancy
           | implementations if you use x2 as the stack pointer in the
           | standard way. However, the compressed instruction extension
           | has "compressed" (16-bit-long) instructions that implicitly
           | index from x2: C.LWSP, C.LDSP (on RV64C and theoretically
           | RV128C), C.LQSP (RV128C only), C.FLWSP, and C.FLDSP; and
           | corresponding store instructions. These instructions
           | incorporate a 6-bit immediate offset field which is added to
           | x2 to form the effective address.
           | 
           | As far as I know, that's the full extent to which RISC-V has
           | a dedicated stack register: it has a compressed instruction
           | format that uses x2 as a base register, but not in the base
           | ISA, just a standard extension. There's no dedicated PUSH or
           | POP instruction, no dedicated instruction for storing the
           | link register into the stack, no dedicated instructions for
           | incrementing or decrementing x2 (you do that with ADDI, which
           | can be compressed as C.ADDI as long as the stack frame size
           | is less than 32 bytes, which means it has to be 16 bytes in
           | the standard ABI), not even autoincrement and autodecrement
           | addressing modes.
        
             | unwind wrote:
             | Too lazy to look this up (or even figure out the relevant
             | extension) but the obvious question to me would be how
             | storing state for interrupts is handled?
        
               | kragen wrote:
               | They call the interrupt mechanism "traps", reserving
               | "interrupt" for traps caused by asynchronous inputs, and
               | (in recent versions of RISC-V) they're specified in the
               | separate "The RISC-V Instruction Set Manual, Volume II:
               | Privileged Architecture". I'm looking at version 1.12
               | (20211203).
               | 
               | Basically there are special registers for trap handling,
               | which are CSRs: xscratch (a scratch register), xepc (the
               | trapping program counter), xcause and xtval (which trap),
               | and xip (interrupts pending). These come in four sets:
               | x=s (supervisor-mode), x=m (machine-mode, with a couple
               | of extras), x=h (hypervisor mode, which has some
               | differences), and x=vs (virtual supervisor). You can't
               | handle traps in U-mode, so in a RISC-V processor with
               | trap handling and without multiple modes, you're always
               | in M-mode. (See p.3, 17/155.)
               | 
               | I haven't done this but I suppose that what you're
               | supposed to do in a mode-X trap handler is start by
               | saving some user register to xscratch, then load a useful
               | pointer value into that user register off which you can
               | index to save the remaining user registers to memory.
               | 
               | I guess you know xscratch (and xepc, etc.) wasn't
               | previously being used because you only use them during
               | this very brief time and leave x-mode traps disabled
               | until you finish using it. If all your traps are
               | "vertical" (from a less-privileged mode like U-mode into
               | a more-privileged mode like M-mode) you don't have to
               | worry about this, because you'll never have another
               | x-mode trap while running your x-mode trap handler.
               | 
               | I should probably check out how FreeBSD and Linux handle
               | system calls on RV64.
               | 
               | dh` explained the following technique to me, as explained
               | to him by jrtc27: upon entry to, say, an S-mode trap
               | handler, you use CSRRW to swap the stack pointer in x2
               | with the sscratch register, if it's null you swap back,
               | then push all the registers on the stack, then you can do
               | real work.
        
               | unwind wrote:
               | Very informative, thanks a lot!
        
         | zerohp wrote:
         | Correct. It's a common mistake to think of the stack pointer as
         | just a convention for register 31 in A64. The 31 encoding has
         | two uses. For some instructions it is the zero register XZR and
         | others it is SP.
        
       | gbin wrote:
       | AFAIK the weak memory model is interesting when you start to
       | scale multicore. It is up to the programmer to sync correctly
       | accesses so when it is not necessary all the units accessing the
       | memory can operate more independently.
        
         | [deleted]
        
       | klodolph wrote:
       | You can think of "fused multiply-add" as an instruction that does
       | two things at once, but I think of FMA more as a single operation
       | these days. There's a critical difference in floating-point,
       | where the result of FMA can be rounded, so you're not actually
       | getting the same result as separate multiply + add.
        
         | Findecanor wrote:
         | I suppose you meant to write that: a FMA instruction rounds
         | only once at the end, but both the Multiply and Add
         | instructions also round once each = total two times.
         | 
         | BTW. Some MIPS processors did have a FMA instruction that _did_
         | round twice. The compiler was thus able to fuse instructions
         | without the code giving different results. This was deprecated
         | in later versions, however.
        
           | klodolph wrote:
           | I thought I _did_ write that?
        
         | vardump wrote:
         | Yeah, FMA is really just one operation.
         | 
         | Except when it's time to boast about FLOPS. Then it's _always_
         | counted as two operations. :-)
        
       ___________________________________________________________________
       (page generated 2022-10-22 23:02 UTC)