[HN Gopher] RISC in 2022
___________________________________________________________________
RISC in 2022
Author : todsacerdoti
Score : 159 points
Date : 2022-10-22 06:02 UTC (16 hours ago)
(HTM) web link (wiki.alopex.li)
(TXT) w3m dump (wiki.alopex.li)
| ranma42 wrote:
| > 64 registers with 3 registers per instruction would require 21
| bits
|
| Actually only 18. With 21 bits you can get 128 registers.
| klelatti wrote:
| A minor historical quibble:
|
| > RISC was a set of design principles developed in the 1980's
| that enabled hardware to get much faster and more efficient.
|
| There is a strong argument that RISC as a set of design
| principles (if not as an acronym) started in the 1970s with the
| IBM 801 [1] and many of the ideas date back to the 1960s with the
| CDC 6600 mainframes which were very RISCy.
|
| On a more substantive point, I don't think small code size was
| ever 'officially' part of the RISC concept. Arm pioneered it with
| Thumb but I think that was a pragmatic decision to get Arm into
| devices with limited memory space such as early mobile phones.
|
| [1] https://thechipletter.substack.com/p/the-first-risc-john-
| coc...
| speed_spread wrote:
| Nitpick: Hitachi pioneered small code size with the SuperH
| processors, which ARM licensed and marketed as Thumb.
| klelatti wrote:
| Nitpick accepted - you're absolutely right. To be precise
| they licensed the relevant patents of course.
| rwmj wrote:
| Should be "32-ish _architectural_ registers ". Real processors
| have a lot more registers but they are not directly visible. This
| is the whole reason why x86-64 is usable despite having only 16
| architectural integer registers (or actually a little less than
| that).
|
| The article doesn't get RISC-V instruction encoding right. It
| mentions compressed instructions, but instructions can also be
| longer than 32 bits. The important thing about RISC-V is that the
| instruction stream can easily be divided at instruction
| boundaries (unlike, say, x86 which is horrific to decode). This
| gives you most of the benefits of fixed size instructions and the
| benefits of extensibility when you need it.
| weinzierl wrote:
| > _" This is the whole reason why x86-64 is usable despite
| having only 16 architectural integer registers [..]"_
|
| Who'd have thought thirty year ago we'd all be sittin' here
| talking tens of registers, eh?
|
| In them days we was glad to have two or three.
|
| If we were lucky.
|
| My first computer had 1 (one). One with less bits than fingers
| on our hands. It was so dear to us we even gave it a name.
| "Accu" it was called.
|
| If you tell that to the young people today, they won't believe
| you.
| _joel wrote:
| Luxury
| Dylan16807 wrote:
| The 68k was quite a popular chip, with almost 20 registers
| and a clear trend towards more, and it came out over 40 years
| ago.
| speed_spread wrote:
| Sounds as comfortable as living with two teenagers in a house
| with single bathroom. You have to optimize around it and use
| clever tricks that shouldn't be documented.
| kstrauser wrote:
| My first (usable) computer was a Commodore 64, with 3
| registers but a special addressing mode for RAM in the
| 0x00-0xFF range. It was largely used as 256 bytes of "kinda
| like a register but not exactly" storage.
| adwn wrote:
| > _Should be "32-ish architectural registers"._
|
| The entire article is about _instruction sets_ , not about
| their physical implementation, so I don't think this
| clarification is necessary.
| ajross wrote:
| x86 isn't _that_ difficult to decode. Instructions all fall on
| byte boundaries, so if you want to decode 16 bytes of
| instructions in a cycle you need 16 parallel decode engines.
| That sounds awful by the standards of 1987 RISC transistor
| budgets, but for modern CPUs it 's mostly noise.
| q-big wrote:
| > MIPS, Itanium, SPARC and POWER all have weak memory models.
|
| This does not seem to hold for SPARC: according to
|
| > https://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.tpho...
|
| the (strong) memory models of x86 and SPARCv8 are very related:
|
| "We give two equivalent definitions of x86-TSO: an intuitive
| operational model based on local write buffers, and an axiomatic
| total store ordering model, similar to that of the SPARCv8."
|
| "Our x86-TSO axiomatic memory model is based on the SPARCv8
| memory model specification [20, 21], but adapted to x86 and in
| the same terms as our ear- lier x86-CC model."
|
| "We have described x86-TSO, a memory model for x86 processors
| that does not suffer from the ambiguities, weaknesses, or
| unsoundnesses of earlier models. Its abstract-machine definition
| should be intuitive for programmers, and its equiva- lent
| axiomatic definition supports the memevents exhaustive search and
| permits an easy comparison with related models; the similarity
| with SPARCv8 suggests x86-TSO is strong enough to program above."
| vodou wrote:
| Poor OpenRISC... Not even mentioned when they talk about RISC.
| renox wrote:
| I remember an interesting message about 'scalable' vector
| extensions which pointed that they weren't necessary compatible
| with loop unrolling. An interesting point to keep in mind.
| bumblebritches5 wrote:
| rcgorton wrote:
| Re: register windows. I disagree: code size wasn't the killer
| here, it was how DEEP the stack got. If your architectural
| register window spilled at 4 deep, then calls 3 deep were fine,
| but if you had a set of code attempting to iterate over a tight
| loop which had 8 calls deep, you were in [performance] trouble.
|
| Another divot: asymmetric functional units. Some versions of
| Alpha supported a PopCount instruction, but it only worked in a
| single functional unit, which made scheduling a pain, esp. if you
| had to write in assembly language.
|
| I'm not convinced that AVX 256 and AVX 512 are useful for non-
| matrix operations. Most strings (more importantly, parsing
| bounded by whitespace) are much shorter than 512 bits (32 bytes).
| In English, I cannot come up with many words longer than 16 bytes
| (some place names, antidisestablishmentarianism, chemical
| compound names, and some other stuff)
| loup-vaillant wrote:
| > _I 'm not convinced that AVX 256 and AVX 512 are useful for
| non-matrix operations._
|
| I've observed that compared to regular x86-64 code without
| SIMD, using AVX 256 speeds up the Chacha20 cipher (for long
| messages so they can be processed in 512-bytes chuncks (8
| blocks)) by a factor of 5. Network packets easily exceed 1KB,
| and files are usually much bigger.
|
| Matrix operations aren't the only viable niche.
| sitkack wrote:
| SIMD has _many_ non-matrix uses.
|
| https://simdjson.org/
| danybittel wrote:
| If you think about it, instruction encoding is really a
| compression method.
|
| I wonder, if it would make sense to decouple compression and
| microcodes. So you could take a body of Code and find the best
| compression for it. Or even be able to change "lookup tables"
| before starting the operating system. Possibly have different
| compression methods (x86 / arm..) run on the same CPU, without
| any drawbacks. Which could get you around licensing an ISA. (Yes
| I'm a software engineer thinking about hardware)
| Someone wrote:
| That is more or less what Transmeta did (https://en.wikipedia.o
| rg/wiki/Transmeta#Code_Morphing_Softwa...)
|
| The "without any drawbacks" part never panned out, though.
|
| It also is fairly similar to what Apple has done with emulating
| 68k on PPC, PPC on x64, and x64 on ARM (but those, AFAIK, do
| not offer full emulation of the host CPU, as they don't need to
| run code in kernel mode)
| ThrowawayR2 wrote:
| " _The jack of all trades is the master of none._ " Being able
| to do any type of ISA means you have to put in the circuits to
| do everything those ISAs can do and those extra circuits are
| just expensive deadweight when the processor is configured as a
| specific ISA.
| ip26 wrote:
| The most interesting concept here (to me) is the idea of taking
| a variable length instruction set and optionally promising to
| align instructions. If you did this, the code could switch
| between dense and wide as it pleased, getting either the best
| footprint in the instruction cache or allowing extremely wide
| decode.
| fulafel wrote:
| > RISC was a set of design principles developed in the 1980's
| that enabled hardware to get much faster and more efficient. We
| tend to still call modern-looking instruction sets "RISC-y", but
| really, a bunch of the original design principles of RISC CPU's
| have not stood the test of time. So let's look at the things that
| have worked and not worked between the 1980's and 2022.
|
| I think the design principles of RISC were actually on a meta
| level above this: take the quantitative approach, use your
| transistors to best serve the software you have and compilers you
| can build. "Nice ISA to write assembly code for" was thrown out,
| or at least demoted significantly.
|
| In that 80s moment in the transistor count curve, it meant
| simplifying the ISA very radically in order to implement fewer
| instructions in hard coded way without microcode, to the point of
| ditching HW multiply instructions. The microarchitecture could
| either do fast, pipelined execution or large instruction set in
| the transitor budget, you optimized for bang for the buck in the
| whole-system sense. You could make simple fast machines that were
| designed to run Unix (so had just enough VM, exception, etc
| support).
| gumby wrote:
| > I think the design principles of RISC were actually on a meta
| level above this: take the quantitative approach, use your
| transistors to best serve the software you have and compilers
| you can build. "Nice ISA to write assembly code for" was thrown
| out, or at least demoted significantly.
|
| You have it right: this is almost exactly (in different words!)
| what Radin wrote in his original RISC paper.
| klelatti wrote:
| Largely agree with you here but I think there were two other
| factors in RISC's early success:
|
| 1. As microprocessors became dominant freezing complex
| microcode - with possible bugs - on an IC was a really bad
| idea. Better to run with simpler instructions and less
| microcode.
|
| Look at the debugging issues that National Semiconductor had
| with the NS32016 - which I think really hindered its adoption.
|
| 2. You needed a much smaller team to design a RISC CPU - low
| double figures for IBM 801, Berkeley RISC, MIPS and Arm. This
| opened the door to lots of experiments and business models that
| would not have been possible with CISC.
| _0ffh wrote:
| Yep, RISC is really a KISS CPU, paired with the attempt to
| draw the maximal advantage from the potential strengths of
| the underlying concept. And it turned out pretty nicely imho.
| sitkack wrote:
| Almost no systems use fixed microcode.
|
| What RISC took advantage of was decoupling the memory bus
| from the CPU clock rate and the introduction of instruction
| and data caches.
|
| RISC is a simple interface, but the internal processor
| complexity is as high or higher than CISC (thanks to RISC
| simplicity). Speculative execution, reordering instructions,
| hazards, branch prediction these issues are the same across
| both.
|
| The front end isn't this huge deal. CPUs all do the same
| thing, attempt to unroll huge state machines and compress
| time. The ISA is just a way to get that problem into the CPU.
| klelatti wrote:
| The key word in my comment is 'early'. You're right about
| architectures today but that explicitly wasn't what I was
| talking about. Caches we're important in the IBM 801 but
| your orher points don't apply to early RISC designs.
| sitkack wrote:
| RISC didn't hit its stride until the early 80s, using
| caches. The 801 was an experiment. When the 801 shipped,
| almost _no_ system had hard coded microcode. Being able
| to update microcode in the field isn 't the reason RISC
| thrived.
| klelatti wrote:
| I can think of at least two of the RISC pioneers who have
| cited buggy CISC architectures / microcode as a key
| motivating reason for adopting RISC.
|
| I actually cited an example of a major CISC design that
| basically failed because it was so buggy.
|
| The early microprocessors all had hard coded microcode -
| Intel only had upgradeable microcode with P6 in the mid
| 1990s.
|
| I didn't say that it was the reason - there were many
| reasons - but that it was one factor. It was absolutely
| the case that original RISC designs were simpler to
| design and that helped them to get traction.
|
| Edit - just to add that caches enabled RISC to get decent
| performance but they were not by themselves a reason for
| choosing RISC over CISC.
| bogomipz wrote:
| >"Nice ISA to write assembly code for" was thrown out, or at
| least demoted significantly."
|
| Interesting. Was the guiding principle of CISC ISAs essentially
| that be easy to write assembly code for then? Maybe it's
| obvious but I had never considered how or where CISC evolved
| from. Would I have to look at something like the history of the
| VAX ISA to understand this better?
| pjmlp wrote:
| Yes, with CISC instructions and powerful macro assemblers
| like TASM and MASM.
| klelatti wrote:
| I think 'easy to write assembly code for them' might be
| stretching it a little bit. Some instructions were pretty
| complex and I would expect be quite hard to use in practice.
|
| What is certainly true is that a single instruction could do
| a lot, making for much more concise code than would be the
| case for RISC code.
|
| IBM S/360 is probably the most influential CISC architecture.
| There is lots of S/360 documentation online. If of interest I
| did a short post on S/360 assembly a little while ago.
|
| https://thechipletter.substack.com/p/writing-
| ibm-s360-assemb...
| bogomipz wrote:
| Thanks but your link has two sentences and then is
| interrupted by an input box to subscribe to your newsletter
| and by the time I made it to there my reading was again
| interrupted, this time by an obnoxious pop up prompt again
| asking me sign up for the newsletter. Is there a reason to
| be this aggressive? It's hard to believe this is a
| successful strategy. If someone enjoys your content they
| will sign up after they've actually been able read the
| article. Who signs up for a newsletter before they've even
| been able to read the first paragraph? I didn't bother with
| the rest of your link after this.
| klelatti wrote:
| Thanks for letting me know. It's a Substack thing not
| something I've chosen to do - I've tried to see if I
| could could turn off all pop-ups in the past without
| success. I didn't know it was quite that bad so will have
| another look and possibly feedback to Substack.
| sroussey wrote:
| You mention the key part here--it was designed for compilers
| writing assembly, not people writing assembly.
|
| And more importantly, future compilers, not the tech they had
| at the time.
| pclmulqdq wrote:
| This is probably the most important factor in the RISC
| revolution. The printed size of assembly doesn't matter any
| more (only the number of bytes), and neither does
| readability.
| cpeterso wrote:
| Andrew Waterman's PhD dissertation, "Design of the RISC-V
| Instruction Set Architecture", is quite accessible and starts
| with a similar analysis of other ISAs (including OpenRISC):
|
| https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-...
| cesarb wrote:
| > Nobody besides x86_64 seems to have dedicated stack registers
| anymore, they're defined only by convention.
|
| Aarch64 also has a dedicated SP register.
| renox wrote:
| I remember reading that in fact it was an important improvement
| from ARM32 to Aarch64 to have a dedicated stack register. Not
| surprisingly Risc-V has also a dedicated stack register.
| zozbot234 wrote:
| > Not surprisingly Risc-V has also a dedicated stack
| register.
|
| I don't think this is correct. It has a _suggested_ SP
| register, which merely gets some special support in the C
| subset. But that 's just an optional compression scheme and
| not really part of the ISA design.
| renox wrote:
| Thanks for correcting me. I was mistaken by the assembly
| language.
| kragen wrote:
| RV32I, RV64I, and RV128I (the base RISC-V architecture) don't
| have a dedicated stack register in the instruction set, and
| I'm not even sure if your code will even run faster on fancy
| implementations if you use x2 as the stack pointer in the
| standard way. However, the compressed instruction extension
| has "compressed" (16-bit-long) instructions that implicitly
| index from x2: C.LWSP, C.LDSP (on RV64C and theoretically
| RV128C), C.LQSP (RV128C only), C.FLWSP, and C.FLDSP; and
| corresponding store instructions. These instructions
| incorporate a 6-bit immediate offset field which is added to
| x2 to form the effective address.
|
| As far as I know, that's the full extent to which RISC-V has
| a dedicated stack register: it has a compressed instruction
| format that uses x2 as a base register, but not in the base
| ISA, just a standard extension. There's no dedicated PUSH or
| POP instruction, no dedicated instruction for storing the
| link register into the stack, no dedicated instructions for
| incrementing or decrementing x2 (you do that with ADDI, which
| can be compressed as C.ADDI as long as the stack frame size
| is less than 32 bytes, which means it has to be 16 bytes in
| the standard ABI), not even autoincrement and autodecrement
| addressing modes.
| unwind wrote:
| Too lazy to look this up (or even figure out the relevant
| extension) but the obvious question to me would be how
| storing state for interrupts is handled?
| kragen wrote:
| They call the interrupt mechanism "traps", reserving
| "interrupt" for traps caused by asynchronous inputs, and
| (in recent versions of RISC-V) they're specified in the
| separate "The RISC-V Instruction Set Manual, Volume II:
| Privileged Architecture". I'm looking at version 1.12
| (20211203).
|
| Basically there are special registers for trap handling,
| which are CSRs: xscratch (a scratch register), xepc (the
| trapping program counter), xcause and xtval (which trap),
| and xip (interrupts pending). These come in four sets:
| x=s (supervisor-mode), x=m (machine-mode, with a couple
| of extras), x=h (hypervisor mode, which has some
| differences), and x=vs (virtual supervisor). You can't
| handle traps in U-mode, so in a RISC-V processor with
| trap handling and without multiple modes, you're always
| in M-mode. (See p.3, 17/155.)
|
| I haven't done this but I suppose that what you're
| supposed to do in a mode-X trap handler is start by
| saving some user register to xscratch, then load a useful
| pointer value into that user register off which you can
| index to save the remaining user registers to memory.
|
| I guess you know xscratch (and xepc, etc.) wasn't
| previously being used because you only use them during
| this very brief time and leave x-mode traps disabled
| until you finish using it. If all your traps are
| "vertical" (from a less-privileged mode like U-mode into
| a more-privileged mode like M-mode) you don't have to
| worry about this, because you'll never have another
| x-mode trap while running your x-mode trap handler.
|
| I should probably check out how FreeBSD and Linux handle
| system calls on RV64.
|
| dh` explained the following technique to me, as explained
| to him by jrtc27: upon entry to, say, an S-mode trap
| handler, you use CSRRW to swap the stack pointer in x2
| with the sscratch register, if it's null you swap back,
| then push all the registers on the stack, then you can do
| real work.
| unwind wrote:
| Very informative, thanks a lot!
| zerohp wrote:
| Correct. It's a common mistake to think of the stack pointer as
| just a convention for register 31 in A64. The 31 encoding has
| two uses. For some instructions it is the zero register XZR and
| others it is SP.
| gbin wrote:
| AFAIK the weak memory model is interesting when you start to
| scale multicore. It is up to the programmer to sync correctly
| accesses so when it is not necessary all the units accessing the
| memory can operate more independently.
| [deleted]
| klodolph wrote:
| You can think of "fused multiply-add" as an instruction that does
| two things at once, but I think of FMA more as a single operation
| these days. There's a critical difference in floating-point,
| where the result of FMA can be rounded, so you're not actually
| getting the same result as separate multiply + add.
| Findecanor wrote:
| I suppose you meant to write that: a FMA instruction rounds
| only once at the end, but both the Multiply and Add
| instructions also round once each = total two times.
|
| BTW. Some MIPS processors did have a FMA instruction that _did_
| round twice. The compiler was thus able to fuse instructions
| without the code giving different results. This was deprecated
| in later versions, however.
| klodolph wrote:
| I thought I _did_ write that?
| vardump wrote:
| Yeah, FMA is really just one operation.
|
| Except when it's time to boast about FLOPS. Then it's _always_
| counted as two operations. :-)
___________________________________________________________________
(page generated 2022-10-22 23:02 UTC)