[HN Gopher] Addressing Criticism of RISC-V Microprocessors
___________________________________________________________________
Addressing Criticism of RISC-V Microprocessors
Author : nnx
Score : 163 points
Date : 2022-03-20 05:37 UTC (17 hours ago)
(HTM) web link (erik-engheim.medium.com)
(TXT) w3m dump (erik-engheim.medium.com)
| devit wrote:
| I think the biggest issue is the lack of arithmetic with overflow
| checking, especially with a variant that calls a location in a
| control register on overflow.
|
| This makes it very inefficient to compile languages that would
| like overflow checks on all arithmetic.
| audunw wrote:
| A comment elsewhere here pointed out that RISC-V can do it with
| two fused compressed instructions for the most common
| operations. So seems like they did the right trade off to me.
| FullyFunctional wrote:
| I'm heavily invested in RISC-V, both personally and
| professionally, and I think the story is much more complicated
| than this makes it out to be, but I'm not going to rehash the
| discussion yet again.
|
| However, I do want to point out that a real issue (especially
| with legacy code) is the scaled address calculation with 32-bit
| unsigned values. Thankfully the Zba extension adds a number of
| instructions that help a lot, but still would require fusion to
| get complete parity with Arm64
|
| For int update(int *base, unsigned index) {
| return base[index]++; }
|
| We get update: sh2add.uw
| a1,a1,a0 lw a0,0(a1)
| addiw a5,a0,1 sw a5,0(a1)
| ret
|
| Zba is included in the next Unix profile and will _likely_ be
| adopted eventually by all serious implementations.
|
| EDIT: grammar and spacing
| abainbridge wrote:
| I'm guess that your assembly code is RISC-V with the Zba
| extension. Is the non-Zba version worse than Arm64?
|
| Compiling your function with Godbolt, I get:
| RISC-V (no Zba) Clang - 7 instructions -
| https://godbolt.org/z/7znnrzxKq Arm64 Clang - 7
| instructions - https://godbolt.org/z/Trv8scxad
|
| Annoyingly I can't see the code size for the Arm64 case because
| no output is generated if I tick the "Compile to binary" option
| in "Output". I have to use GCC instead: RISC-V
| (no Zba) Clang - 20 bytes - https://godbolt.org/z/eWfPaorcj
| Arm64 GCC - 24 bytes -
| https://godbolt.org/z/bzsPzov5h
| FullyFunctional wrote:
| EDIT: Hmm, I seem to have picked a bad example. Try this one:
| int get(int *base, unsigned index) {return base[index];}
|
| Arm64: update: ldr w0, [x0,
| w1, uxtw 2] ret
|
| RV64GC (vanilla): update: slli
| a5,a1,32 srli a1,a5,30 add
| a0,a0,a1 lw a0,0(a0) ret
|
| RV64GC+Zba: update: sh2add.uw
| a0,a1,a0 lw a0,0(a0) ret
|
| Arm64 is able to do some indexed loads in a single
| instruction that might take two in RISC-V w/Zba (and up to 4+
| in regular RISC-V). However, calling that a win for Arm64 is
| not so clear as the more complicated addressing modes could
| become a critical timing path and/or require an extra
| pipeline stage. However, as a first approximation, for a
| superscalar dynamically scheduled implementation, fewer ops
| is better so I would say it's a slight win.
|
| I don't understand the obsession with bytes. 25% fewer bytes
| has only very marginally impact on a high-performance
| implementation and the variable length encoding has some
| horrendous complications (which is probably why Arm64
| _dropped_ variable length instructions). Including compressed
| instruction in the Unix profile was the biggest mistake
| RISC-V did and I'll die on that hill.
|
| ADD: Don't forget that every 32-bit instruction is currently
| wasting the lower two bits to allow for compressed, thus any
| gain from compress must be offset by the 6.25% tax that is
| forced upon it.
| damageboy wrote:
| Thank you for writing the obvious. Instruction Byte count
| is the wrong metric here 100%. Instruction Count (given
| reasonable decoding/timing constraints) is the thing to
| optimize for and indeed variable length encoding is very
| bad.
| knorker wrote:
| For those of us without the expertise, could you
| elaborate on why that is?
|
| On the one hand we have byte count, with its obvious
| effect on cache space used. But to those of us who don't
| know, why is instruction count so important?
|
| There's macro-op fusion, which admittedly would burn
| transistors that could be used for other things. Could
| you elaborate why it's not sufficient?
|
| And then the fact that modern x86 does the opposite to
| macro-op fusion, by actually splitting up CISC
| instructions into micro-ops. Why is it so bad if they
| were more micro-ops to start with, if Intel chooses to do
| this?
| tsmi wrote:
| Instruction byte count matters quite a lot when you're
| buying ROM in volume. And today, the main commercial
| battleground for RISCV is in the microcontroller space
| where people care about these things.
| avianes wrote:
| Why using an unsigned? It is obvious here that RISC-V
| without Zba takes 4 instructions because it manages special
| cases related to unsigned.
|
| If you use a simple int for index: slli
| a1,a1,2 add a0,a0,a1 lw a0,0(a0)
|
| And isolating this code in a small function puts
| constraints on register allocation, but if we remove this
| constraint then we can write: slli a1,a1,2
| add a1,a1,a0 lw a1,0(a1)
|
| Which is very suitable for macro-op fusion and C extension
|
| > Including compressed instruction in the Unix profile was
| the biggest mistake RISC-V did and I'll die on that hill.
|
| This is so wrong. The C extension is one of the great
| strengths of RISC-V, it is easy to decode, very suitable
| for macro-op fusion, and it gives a huge boost in code
| density
| JonChesterfield wrote:
| Iirc compressed instructions are the thing that costs 2
| bits per 32 and was criticised as overfitted to naive
| compiler output. Am I thinking of something else?
| avianes wrote:
| Yes, but RISC-V still has a lot of encoding-space free
| and the benefit of C extension is huge. It's a trade-off.
|
| I don't think RISC-V is perfect or universal, but on this
| point they do a pretty good job compared to other ISAs
| klelatti wrote:
| You say that the benefit is 'huge' but why should I care
| about code density on a modern CPU with gigabytes of
| memory and large caches?
|
| From a performance perspective what is the evidence that
| this actually provides an advantage?
| zozbot234 wrote:
| Because the _fastest_ cache levels are tiny, even in the
| largest and most advanced CPU 's. There's plenty of
| evidence for the performance benefits of improved density
| and terseness in both code and data.
| klelatti wrote:
| The M1 has a 192k instruction cache for performance cores
| which is not 'tiny'.
|
| If there is lots of evidence for the performance benefits
| of improved density vs the alternative of fixed
| instruction width in real world CPUs then I'm sure you'll
| be able to cite it.
| avianes wrote:
| To clarify, I'm not saying that RISC-V code density of C
| extension is a big advantage over its competitors, but it
| is a huge benefit for RISC-V.
|
| You are right, code density is perhaps not that critical
| today.
|
| And it is difficult to quantify its relevance as code
| density is always related to other variables such as
| instruction expressiveness, numbers of uop emitted, etc.
|
| But I still think that code density is important for
| RISC-V. Because RISC-V philosophy to reach high
| performance is to use very simple instructions that can
| be combined together and take advantage of macro-op
| fusion. I think RISC-V without macro-op fusion can't
| reach the performance of other ISA.
|
| But RISC-V with all these simple and not very expressive
| instructions and without C extension has a pretty bad
| code density which could cost a lot because it is not at
| the competitors' level.
|
| So if we think of RISC-V as a macro-op fusion oriented
| ISA, then the C extension becomes important to be
| competitive.
|
| I don't know what is better between a "macro-fusion"
| oriented arch or a "complex-instruction" oriented arch,
| future will tell us.
| klelatti wrote:
| Thanks for clarifying - interesting to see a different
| philosophy being tried.
| JonChesterfield wrote:
| 32 bit, 32 registers, three register code. So add r0 r1
| r2 spends fifteen bits on identifying which register to
| use then another two on the compressed ISA. That's half
| the encoding space gone before identifying the op. Never
| thought I'd want fewer registers but here we are.
|
| If the compressed extension is great in practice it might
| be a win. If the early criticism of overfit to gcc -O0
| proves sound and in practice compilers don't emit it then
| it was an expensive experiment.
| avianes wrote:
| The encoding space is not the number of bits used to
| encode an instruction, the encoding space is the ratio of
| values that encode a valid instructions over the total
| number of instructions that could be encoded.
|
| As an example, in an ISA with 8-bits fixed instruction
| length and 8 registers (reg index encoded on 3 bit):
|
| If the last 2 bits are the opcode and we define 2
| instructions (eg. AND, XOR) that manipulate 2 registers
| (3 bits + 3 bits), then instruction word values
| 0b00_000_000 to 0b01_111_111 might encode these two
| instructions (ignore "_" they are separators).
|
| Therefore, instruction word values from 0b10_000_000 to
| 0b11_111_111 remain free, which represents half of the
| encoding space. So half of the encoding space remains
| free.
|
| This means we still have room to put new instructions.
|
| Similarly, RISC-V valid instructions use almost all the
| available bits, but there is still room in the encoding
| space because some opcodes remain free.
| FullyFunctional wrote:
| Indeed why use unsigned? Go take a look at a lot of C
| code (hint, look at SPEC benchmarks). They do that.
| Decades of pointer == int == unsigned have let to a lot
| of horrific code. But we still compile it.
|
| The sins of the original RISC-V was spending too much
| time looking at RV32 and not realizing how big a problem
| this is in practice. Zba (slipped in as it's not really
| "bit manipulation") fixes the worst of this.
|
| ADD: The problem in this HN thread is the same reason we
| got compressed it in the first place. The vast majority
| of people aren't doing high performance wide
| implementation so the true cost isn't widely appreciated.
| The people holding the decision power certainly didn't
| understand it. I really think you have to live it to
| understand it.
| avianes wrote:
| > Go take a look at a lot of C code (hint, look at SPEC
| benchmarks). They do that.
|
| This does not really justify isolating that snippet if
| you admit yourself it's a bad one.
|
| > The sins of the original RISC-V was spending too much
| time looking at RV32 and not realizing how big a problem
| this is in practice.
|
| But indeed, my previous message shows that even without
| Zba the problem is erased by a good register allocation
| and macro-op fusion.
|
| I think you are trying too hard to find special cases
| that "trick" RISC-V, you didn't even pay attention to the
| use of unsigned which is non-optimal (unsigned has an
| undesirable overflow semantic here).
| saagarjha wrote:
| > 25% fewer bytes has only very marginally impact on a
| high-performance implementation
|
| Instruction cache doesn't come for free, and is usually
| pretty small on most shipping processors. It's not a big
| deal for smaller benchmarks, but in real-world programs
| this can become a problem.
| snek_case wrote:
| Not only that, but CPUs have a maximum number of
| instructions they can dispatch per cycle (typically 4 or
| 6). Even in microbenchmarks, the difference there could
| show up.
| marcosdumay wrote:
| The bottleneck on that is data interdependency of your
| algorithm. If you break it in 6 or 10 instructions, the
| data dependency stays the same.
|
| (Of course, you can add unnecessary dependency with a
| badly designer ISA. But it's not a necessary condition.)
| snek_case wrote:
| There's still a limit to how many instructions you can
| decode and dispatch every cycle, even with zero
| dependencies. There's also definitely dependencies in the
| example where you're computing a memory address to access
| a value.
| FullyFunctional wrote:
| I am obviously aware and I'm here to tell you that the
| overhead of variable length instructions matters more.
| Arm agrees. M1 has a 192 KiB I$ btw.
|
| ADD: had RISC-V just disallowed instructions from
| spanning cache lines and disallowing jumping into the
| middle of instructions then almost all of the issues
| would have gone away. Sigh.
| saagarjha wrote:
| I actually had Apple's chips in mind when talking about
| "most shipping processors" because they have historically
| invested heavily in their caches and reaped benefits from
| it. But not all the world's an M1, and also I'll have you
| know that Apple themselves cares _very much_ about their
| code size, even with their large caches. Don 't go
| wasting it for no reason!
|
| (I should also note that I am pretty on board with you
| with regards to variable-length instructions, this is
| just independent of that.)
| avianes wrote:
| Variable instruction sizes have a cost, but with only 2
| instruction sizes like current RISC-V that cost remains
| very low as long as we don't have to decode a very large
| number of instructions each cycle, and it gives a huge
| code density advantage.
| FullyFunctional wrote:
| This biggest issue is one instruction spanning two cache
| lines, and even two pages. This means a bunch of tricky
| cases that is the source of bugs and overheads.
|
| It also means you cannot tell instruction boundaries
| until you directly fetch instructions, so you cannot do
| any predecode in the cache that would help you figure out
| dependencies, branch targets, etc. These things matter
| when you are trying to fetch 8+ instructions per cycle.
| avianes wrote:
| > This biggest issue is one instruction spanning two
| cache lines
|
| Even with fixed (32 bit) instruction lengths aligned on
| 32 bit, when we have to decode a group of 8 instructions
| you are facing this kind of issue.
|
| So you either have to cut the instruction group (and thus
| not take full advantage of the 8 way decoder) or you have
| to implement a more complex prefetch with a longer
| pipeline. And these special cases can be handled in these
| pipeline stages.
|
| > It also means you cannot tell instruction boundaries
| until you directly fetch instructions
|
| I mean, AMD does that on x86, with 14 instruction
| lengths.
|
| It can be done for RISC-V, it's much cheaper than x86,
| and it takes significantly less surface area than a
| bigger cache to compensate.
| mst wrote:
| Given the compression stuff is an extension (and so far
| as I can tell the 16-bit alignment for 32-bit
| instructions that can result in that sort of spanning is
| part of that extension), so far as I can tell you could
| implement said extension for tiny hardware where every
| byte counts, and then for hardware where you're wanting
| to fetch 8+ instructions per cycle just ... not implement
| it?
|
| Wait (he says to himself, realising he's an idiot
| immediately -before- posting the comment for once). You
| said upthread the C extension is specified as part of the
| standard UNIX profile, so I guess people are effectively
| required to implement it currently?
|
| If that was changed, would that be sufficient to dissolve
| the issues for people wanting to design high performance
| implementations, or are there other problems inherent to
| the extension having been specified at all? (apologies
| for the 101 level questions, the only processor I really
| understood was the ARM2 so my curiosity vastly exceeds my
| knowledge here)
| KerrAvon wrote:
| Have the ARM AArch64 designers ever commented on this?
| They intentionally left out any kind of compressed
| instructions, and certainly Apple at least cares a lot
| about code size.
| [deleted]
| klelatti wrote:
| Try this at 34:30 - from Arm's architecture lead Richard
| Grisenthwaite. Earlier he says that several leading micro
| architects think that mixing 16 bit and 32 bit
| instructions (Thumb2) was the worst thing that Arm ever
| did.
|
| https://m.soundcloud.com/university-of-
| cambridge/a-history-o...
| albanread wrote:
| People should zoom right out and think about the whole RISC-V
| project. When our phones have billions of transistors, are we
| seriously supposed to believe that RISC philosophy still matters.
| Personally I greatly prefer the user programmable 68000 family of
| processors. The marketing of RISC-V is perhaps the most
| impressive thing about it. Each to their own, I can see why giant
| SSD manufacturers want to use a license free design and share the
| cost of compiler development. Is there really anything else?
| kortex wrote:
| Yeah, absolutely. Personally, when I zoom out, and look at the
| trends of engineering in general: simpler modular systems that
| compose well together vs bespoke solutions, RISC-V precisely
| follows the trend. Reduce global state. Make it easier (for
| humans and algos) to reason about control flow. Have a simple
| core with optional extensions. This all makes building multi-
| core solutions way simpler. We are fast running out of
| transistor density improvements. But we are getting way better
| at building coprocessors. There's clear value in "doing more
| simple things in parallel".
| monocasa wrote:
| > When our phones have billions of transistors, are we
| seriously supposed to believe that RISC philosophy still
| matters.
|
| The point isn't just saving gates because it's cheaper. Less
| gates means less critical path length, meaning less power
| consumption, and/or higher overall performance when compared
| apples to apples.
| throwaway81523 wrote:
| Fairly lame article (not wrong, but stuff people following the
| topic have seen before), and I'd still like to hear about integer
| overflow detection. If the floating point extension is able to do
| IEEE 754 condition codes including overflow detection, why can't
| the integer unit do something similar?
| FullyFunctional wrote:
| This comes up a lot and I'm sympathetic to your plea, really (I
| enjoy fantasizing about a different reality where CPUs weren't
| just "machines to run C programs"), but in computer
| architecture, what really matters for one application or a
| class of applications might not be important when viewed across
| millions of programs.
|
| The fact is that integer operations and floating point are two
| completely different beasts, so much so that we have different
| benchmark suites for each.
|
| Integer operations are critically latency sensitive and bagging
| on extra semantics doesn't come for free and for _most_ code
| this would be a tax. The "overflow bit" represents an implicit
| result that would have to be threaded around (I'm assuming that
| you aren't asking for exceptions which literally nobody wants).
| For FP we do that, but the cost and latency of FP ops is
| already high so it doesn't hurt quite as much.
|
| The RISC-V spec [1] (which I assume you have seen) already
| discusses all these trade offs:
|
| "We did not include special instruction-set support for
| overflow checks on integer arithmetic operations in the base
| instruction set, as many overflow checks can be cheaply
| implemented using RISC-V branches. Overflow checking for
| unsigned addition requires only a single additional branch
| instruction after the addition: add t0, t1,
| t2 bltu t0, t1, overflow
|
| For signed addition, if one operand's sign is known, overflow
| checking requires only a single branch after the addition:
| addi t0, t1, +imm blt t0, t1, overflow
|
| This covers the common case of addition with an immediate
| operand. For general signed addition, three additional
| instructions after the addition are required, leveraging the
| observation that the sum should be less than one of the
| operands if and only if the other operand is negative.
| add t0, t1, t2 slti t3, t2, 0 slt t4, t0, t1
| bne t3, t4, overflow
|
| In RV64I, checks of 32-bit signed additions can be optimized
| further by comparing the results of ADD and ADDW on the
| operands."
|
| I do think that it might have been worth adding an single
| instruction version for the last one (excluding the branch),
| but I'm not aware of it getting accepted.
|
| [1] https://github.com/riscv/riscv-isa-manual
| throwaway81523 wrote:
| Yes I've seen that reasoning: they propose bloating 1 integer
| instruction into 4 instructions in the usual case where the
| operands are unknown. Ouch. In reality they expect programs
| to normally run without checking like they did in the 1980s.
| So this is more fuel for the criticism that RiscV is a 1980s
| design with new paint. Do GCC and Clang currently support
| -ftrapv for RiscV, and what happens to the code size and
| speed when it is enabled? Yes, IEEE FP uses sticky overflow
| bits and the idea is that integer operations could do the
| same thing. Integer overflow is one of those things like null
| pointer dereferences, which originally went unchecked but now
| really should always be checked. (C itself is also deficient
| in not having checkable unsigned int types).
| modeless wrote:
| > I'm assuming that you aren't asking for exceptions which
| literally nobody wants
|
| I want exceptions. Why would they be a bad idea? Besides the
| fact that software doesn't utilize them today (because
| they're not implemented, chicken and egg problem)? IMO they
| would be as big a security win as many other complex features
| CPU designers are adding in the name of security, e.g.
| pointer authentication.
| adrian_b wrote:
| Generating the overflow bit and storing it adds a completely
| negligible cost to a 64-bit adder, so touting this as a cost
| saving measure is just a lie, even if indeed this claim has
| always been present in the RISC-V documentation.
|
| Most real cases of overflow checking are of the last type.
| Tripling the number of instructions over a bad ISA that lacks
| overflow exceptions, like unfortunately almost all currently
| popular ISAs are, or quadrupling the number of instructions
| over a traditional ISA with overflow exceptions is a totally
| unacceptable cost.
|
| The claim that providing overflow exceptions for integer
| addition might be too expensive can be easily countered by
| the fact that generating exceptions on each instruction is
| not the only way to guarantee that overflows do not happen.
|
| It is enough to store 2 overflow flags, 1 flag with the
| result of the last operation and 1 sticky flag that is set by
| any overflow and is reset only by a special instruction.
| Having the sticky flag allows zero-overhead overflow checking
| for most arithmetic instructions, because it can be tested
| only once after many operations, e.g. at a function exit.
|
| The cost of implementing the 2 overflow bits is absolutely
| negligible, 2 gates and 2 flip-flops. Much more extra
| hardware is needed for decoding a few additional instructions
| for flag testing and clearing, but even that is a negligible
| cost compared with a typical complete RISC-V implementation.
|
| Not providing such a means of reliable and cheap overflow
| detection is just stupid and it is an example of hardware
| design disconnected from the software design for the same
| device.
|
| The early RISC theory was to select the features that need to
| be implemented in hardware by carefully examining the code
| generated by compilers for representative useful programs.
|
| The choices made for the RISC-V ISA, e.g. the omission of
| both the most frequently required addressing modes and of the
| overflow checking. proves that the ISA designers either have
| never applied the RISC methodology, or they have studied only
| examples of toy programs, which are allowed to provide
| erroneous results.
| zozbot234 wrote:
| The typical overhead of overflow checking in compiled
| languages (which, as a reminder, is in the low single-digit
| %'s _at most_ ) has nothing to do with the lack of
| hardware-sprcific extensions. It's a consistent pattern of
| missing optimization opportunities, because the compiler
| now needs to preserve the exact state of intermediate
| results after some operation fails with an overflow. Adding
| these new opcodes to your preferred ISA would barely change
| anything. (If they help at all it's in executing highly
| dynamic languages as opposed to compiled ones, which makes
| them a natural target for the in-progress 'J' extension.)
| ajb wrote:
| The extra expense is not the generation of the overflow
| bit, but the infrastructure needed to support a flags
| register, or for every instruction to be able to generate
| an exception.
|
| On a simple processor like a microcontroller this doesn't
| cost much, but it's severely hampers a superscalar or out
| of order processor, as it can't work out very easily which
| instructions can be run in parallel or out of order.
|
| The clean solution from a micro architectural point of view
| would be to have an overflow bit (or whatever flags you
| wanted) in every integer register. But that's an expense
| most don't want to pay.
| ansible wrote:
| > _The clean solution from a micro architectural point of
| view would be to have an overflow bit (or whatever flags
| you wanted) in every integer register._
|
| That's what the Mill CPU does. Each "register" also had
| the other usual flags, and even some new ones like Not a
| Result, which helps with vector operations and access
| protection.
| adrian_b wrote:
| One must not forget that on any non-toy CPU, any
| instruction may generate exceptions, e.g. invalid opcode
| exceptions or breakpoint exceptions.
|
| In every 4-5 instructions, one is a load or store, which
| may generate a multitude of exceptions.
|
| Allowing exceptions does not slow down a CPU. However
| they create the problem that a CPU must be able to
| restore the state previous to the exception, so the
| instruction results must not be committed to permanent
| storage before it becomes certain that they could not
| have generated an exception.
|
| Allowing overflow exceptions on all integer arithmetic
| instructions, would increase the number of instructions
| that cannot be committed yet at any given time.
|
| This would increase the size of various internal queues,
| so it would increase indeed the cost of a CPU.
|
| That is why I have explained that overflow exceptions can
| be avoided while still having zero-overhead overflow
| checking, by using sticky overflow flags.
|
| On a microcontroller with a target price under 50 cents,
| which may lack a floating-point unit, the infrastructure
| to support a flags register may be missing, so it may be
| argued that it is an additional cost, even if the truth
| is that the cost is negligible. Such an infrastructure
| existed in 8-bit CPUs with much less than 10 thousand
| transistors, so arguing that it is too expensive in
| 32-bit or 64-bit CPUs is BS.
|
| On the other hand, any CPU that includes the floating-
| point unit must have a status register for the FPU and
| means of testing and setting its flags, so that
| infrastructure already exists.
|
| It is enough to allocate some of the unused bits of the
| FPU status register to the integer overflow flags.
|
| So, no, there are absolutely no valid arguments that may
| justify the failure to provide means for overflow
| checking.
|
| I have no idea why they happened to make this choice, but
| the reasons are not those stated publicly. All this talk
| about "costs" is BS made up to justify an already taken
| decision.
|
| For a didactic CPU, as RISC-V was actually designed,
| lacking support for overflow checking or for indexed
| addressing is completely irrelevant. RISC-V is a perfect
| target for student implementation projects.
|
| The problem appears only when an ISA like RISC-V is taken
| outside its right domain of application and forced into
| industrial or general-purpose applications by managers
| who have no idea about its real advantages and
| disadvantages. After that, the design engineers must
| spend extra efforts into workarounds for the ISA
| shortcomings.
|
| Moreover, the claim that overflow checking may have any
| influence upon the parallel execution of instructions is
| incorrect.
|
| For a sticky overflow bit, the order in which it is
| updated by instructions does not matter. For an overflow
| bit that shows the last operation, the bit updates must
| be reordered, but that is also true for absolutely all
| the registers in a CPU. Even if 4 previous instructions
| that were executed in parallel had the same destination
| register, you must ensure that the result stored in the
| register is the result corresponding to the last
| instruction in program order. One more bit along hundreds
| of other bits does not matter.
| throwaway81523 wrote:
| > After that, the design engineers must spend extra
| efforts into workarounds for the ISA shortcomings.
|
| That is too optimistic. Programs will keep running
| unchecked and we'll keep getting CVE's from overflow
| bugs.
| ansible wrote:
| > _The cost of implementing the 2 overflow bits is
| absolutely negligible, 2 gates and 2 flip-flops. Much more
| extra hardware is needed for decoding a few additional
| instructions for flag testing and clearing, but even that
| is a negligible cost compared with a typical complete
| RISC-V implementation._
|
| That's understating things considerably.
|
| ARMv8-A has PSTATE, which includes the overflow bit. This
| explicit state must be saved / restored upon any context
| switch.
|
| And there isn't just a single PSTATE for an OOO
| SuperScalar, there are several.
|
| Everything has a cost.
| feanaro wrote:
| > add t0, t1, t2 bltu t0, t1
|
| How does this work? Isn't `bltu` simply a branch that is
| taken if `t0 < t1`? How does that detect addition overflow?
|
| EDIT: Ah, because the operands are `t1` and `t2`. `t0` is the
| result. Quack.
| [deleted]
| KSPAtlas wrote:
| I am personally a fan of RISC-V, and I have written low level
| code for it before.
| VariableStar wrote:
| It is amusing and sobering to get a glimpse of some of the
| compexities going on inside a processor and how design
| philosophies may affect them. Those are things the user or even
| your normal programmer seldom thinks about.
| mhh__ wrote:
| > Easily out-perform ARM in code density
|
| > No data [that I can see at least]
| brucehoult wrote:
| I am _so_ bored with people criticising RISC-V based on tiny code
| snippets of things that basically never happen in real code.
|
| A function that does nothing but return an array element from an
| array base address and index passed to it? Really? Do you
| actually write junk like that? And if you write it does your
| compiler really not inline it? Why? Do you like big slow code?
| Once it's inlined, it's probably in a loop, and strength-reduced.
|
| It's very easy to verify that in the real-world RISC-V code is
| more compact than amd64 and arm64. Just download the same version
| of Ubuntu or Fedora (etc) for each one and run the "size" command
| on the binaries. The RISC-V ones are consistently significantly
| smaller.
|
| You can also, with quite a bit more work, count the number of
| uops each ISA executes. RISC-V executes slightly more
| instructions, but they are each simple and don't need expanding.
| Lots of x86 instructions get expanded into multiple uops and many
| 64 bit ARM instructions do too. In the end the number of uops
| executed by each is very similar.
|
| Trying to judge the goodness of a modern ISA by looking at two or
| three instruction snippets is as silly as using Dhrystone as your
| only benchmark program.
| sylware wrote:
| RISC-V is technically not bad enough to select arm64 or x86_64
| over it, since those have beyond toxic IP tied to them.
|
| From what I read in the comments, I don't expect compressed
| instructions on future high performance desktop/servers RISC-V
| CPU cores to be there.
| panick21_ wrote:
| People over argue these minimal difference. Lets be honest, never
| in the history of ISA were these things the primary reason for
| success or failure of instruction sets been a slightly better
| code size.
|
| Even if by <insert objective measurement> RISC-V is 10% worse
| then ARM, it wouldn't actually matter that much for adoption.
|
| Adoption happens for business reasons and what is differentiating
| RISC-V far more then anything else is the chance in license,
| governance and ecosystem.
|
| RISC-V being better at hitting different verticals optimally
| because of the molecularity is likely another thing that matters
| more overall then how perfectly it fits for each vertical.
| wmf wrote:
| Agreed, but I think the purpose of these kind of criticisms is
| to "fix" RISC-V before it becomes yet another worse-is-better
| design locked in for 50 years.
| zozbot234 wrote:
| The only part of RISC-V that is "locked in" to any extent is
| the minimal set of basic integer instructions. Everything
| else is defined as part of standardized extensions, and can
| be superseded simply by defining new custom extensions.
| Actually even the minimal instruction set admits of some
| variation, such as the 'E' architectures that dispense with
| registers R16 to R31, thus saving area in the smallest
| implementations and potentially freeing up some bits in the
| encoding.
| wmanley wrote:
| Things get locked in not by standards, but by usage. If
| your software depends on particular instructions being
| present you're not going to buy a processor that has
| superseded those instructions, even if the new instructions
| conform to a theoretically cleaner design.
|
| Everything being an extension (and thus removable) is a
| strength in some specific circumstances, but is a weakness
| in most.
| socialdemocrat wrote:
| I think you are really missing the point here. Of course
| RISC-V has negatives but most of those negatives exist for
| good reasons. It is a question of tradeoffs.
|
| One of the most important goals of RISC-V is to make an
| architecture which can stand the test of time. In this space
| adding the wrong kind of instructions is a bigger problem
| than not adding particular instructions.
|
| Whether you look at x86, HTML or just about anything the
| problem is nearly always about having to support old junk
| which no longer makes sense to support, or lacking the
| ability to grow. Remember 640K is enough for everyone? RISC-V
| has a lot of room to grow.
|
| If you want an architecture for the future you would want a
| minimalist one with room to grow a lot. By keeping the
| instruction count very low and building in a system for
| extensions they have made a future proof ISA. Okay we cannot
| know the future, but it is more likely to survive for decades
| than something like x86 or maybe even ARM.
| wmf wrote:
| Most of the complaints about RISC-V are extremely basic
| things like array indexing and conditional execution. These
| will never not be needed.
| tsmi wrote:
| I'm sure that's what the team that invented segment
| registers said too.
|
| The question is does it make sense to add these to the
| ISA long term? In the short term, given die density and
| how memory works today, it has advantages. But die
| density increases, making OoO cores cheaper, and memory
| technology changes. It's not obvious that these are long
| term improvements.
| dgreensp wrote:
| IANAE, but the article addresses why the arguments that
| assume these instructions need to be combined are usually
| not based on looking at the whole picture.
| tsmi wrote:
| People argue over these minimal differences for good reasons.
|
| If <insert objective measurement> = binary size, and I'm buying
| ROM in volume to hold that binary, +10% ROM address space can
| easily cost more than the ARM license.
|
| That can matter quite a lot for adoption. Especially in the
| short term.
|
| Obviously, priorities differ and change as a function of time
| but as the saying goes, the only thing worse than making a
| decision with benchmarks is making a decision without
| benchmarks.
| dontlaugh wrote:
| It's nice to have an open ISA, don't get me wrong.
|
| However, trade offs matter. Compressing instructions may improve
| density, but it makes them variable length. This is a big barrier
| to decoding in parallel, which is very important to high
| performance cores.
| brucehoult wrote:
| This is really not a big deal with RISC-V's 2 instruction
| lengths and the encoding they use.
|
| If decoding 32 bytes of code (256 bits, somewhere between 8 and
| 16 instructions) You can figure out where all the actual
| instructions start (yes, even the 16th instruction) with 2
| layers of LUT6.
|
| You can then use those outputs to mux two possible starting
| positions for 8 decoders that do 16 or 32 bit instructions,
| plus 8 decoders what will only ever do 16 bit instructions from
| fixed start positions (and might output a NOP or in some other
| way indicate they don't have an input).
|
| OR you can use those outputs to mux the _outputs_ of a 8
| decoders that only do 32 bit instructions and 8 decoders that
| do 16 or 32 (all with fixed starting positions), plus again 8
| decoders that only do 16 bit instructions from fixed start
| positions (possibly not used / NOP).
|
| The first option uses less hardware but has higher latency.
|
| That, again, is for decoding between 8 and 16 instructions per
| cycle, with an average on real code of close to 12.
|
| That is more than is actually useful on normally branchy code.
|
| In short: not a problem. Unlike x86 decoding.
| audunw wrote:
| Regular base instructions are always 32-bit, compressed are
| always 16-bit, and they're always aligned. I don't think
| there's a problem decoding them in parallel. You always know
| where the opcodes will be located in a 32-bit word - or set of
| 32-bit words - you're trying to decode.
|
| What I've been wondering is how difficult it is to fuse
| instructions when the compressed instructions you're trying to
| fuse isn't aligned to a 32-bit word.
| avianes wrote:
| You are right but RISC-V variable instruction size is indeed a
| good trade-off.
|
| Unlike x86 where instructions can range from 1 up to 15 byte,
| current RISC-V ISA only has 2 instruction sizes.
|
| Today x86 decoding is limiting because we want to decode more
| than ~4 instructions each cycle, for RISC-V to cause same
| decoding difficulty it would probably be required to decode
| more than ~20 instructions each cycle
| dontlaugh wrote:
| I don't know that it's a good tradeoff. ARM64 has fixed
| length instructions with decent density and has proven to
| allow highly parallel decoding.
| IshKebab wrote:
| Dubious. How is "you have to use this magic combination of
| instructions that compress & execute well" better than having a
| dedicated instruction?
|
| Also no mention of the binary compatibility issues - which
| `-march` do you compile your code for? On x86 you have a choice
| of 3. For RISC-V as far as I can tell there are 96 valid targets.
| mhh__ wrote:
| No. Firstly -march (or similar, e.g. -mcpu in LLVM land) should
| target a chip not individual instruction sets.
|
| Secondly, AVX-512 alone has a handful of different extensions.
| There are a bunch of different SSE variants, and similarly
| instructions are still being added to the VEX prefix (normal
| AVX).
|
| There is more potentially for getting it wrong with riscv but
| 64 bit implies a number of extensions too so it's too far off
| what amd64 originally meant for X86 (e.g. implies SSE2)
| IshKebab wrote:
| What do you mean "no"? My comment was entirely factual.
|
| > Firstly -march (or similar, e.g. -mcpu in LLVM land) should
| target a chip not individual instruction sets.
|
| LLVM still uses -march. And no you shouldn't target a
| specific chip unless you know your code will only run on that
| chip. That's the whole point I'm making. Sometimes you do
| know that (in embedded situations) but _often you don 't_.
| Desktop apps aren't compiled for specific chips.
|
| > Secondly, AVX-512 alone has a handful of different
| extensions.
|
| Yes but these are generally linear - if an x86 chip supports
| extension N it will support extension N-1 too. Not true for
| RISC-V.
| monocasa wrote:
| > Yes but these are generally linear - if an x86 chip
| supports extension N it will support extension N-1 too. Not
| true for RISC-V.
|
| Not if you include AMD and Intel cores in that.
| IshKebab wrote:
| Why do you say that? Look here:
|
| https://clang.llvm.org/docs/UsersManual.html#x86
| monocasa wrote:
| That list isn't really an accurate picture of the world,
| but a vague attempt to make sense of the madness.
|
| There's plenty of cores that don't follow that versioning
| scheme, and it's not an Intel or AMD construct.
| [deleted]
| mhh__ wrote:
| The LLVM tools (like llc) use -mcpu. Clang mimics GCC. My
| point about the specific chip is that you have to know it
| anyway if you're planning on targeting a combination of
| extensions so you might as well use it.
|
| As for linearity, the "generally" bit will apply to RISC-V
| by the time we have real desktop class chips using the ISA.
| We still can't assume AVX support for most programs, I
| don't view this as any different to RISC-V extensions. Just
| this ~year Intel added VEX-coded AI NN acceleration
| instructions, I assume RISC-V has similar plans.
| IshKebab wrote:
| LLVM uses -march and -mcpu. It seems to be a bit of a
| mess which one you should use and also depend on the
| architecture.
|
| Time will tell if there's a de facto minimum set of
| extensions for desktop RISC-V. Let's hope so, but it
| isn't guaranteed.
| socialdemocrat wrote:
| Because dedicated instructions suck up valuable encoding space,
| and the more instructions you have, the more instruction you
| have which potentially become obsolete with new advances in
| microarchitecture.
|
| Not to mention that by sticking with simple single purpose
| instructions you make the CPU easier to teach to students. That
| is after all one of the goals of RISC-V in addition to creating
| a good ISA for industry.
|
| Have we learned nothing about why we abandoned CISC in the
| first place? Those CPUs got riddled with instructions that
| never got used much.
| google234123 wrote:
| With every node shrinkage those legacy instructions take up
| less space
| bigcheesegs wrote:
| Encoding space, not die space.
| stonogo wrote:
| It doesn't matter how convincing the sales pitch is when the
| product is not actually for sale.
|
| One thing ARM and x86 got right that SPARC and POWER got wrong is
| widely-available machines available at reasonable prices. All the
| 'being right' in the world won't help if developers need a five-
| figure hardware budget to port to your platform. VMs don't cut it
| for bringup.
| johndoe0815 wrote:
| At least PowerPC machines were available for reasonable prices
| from Apple for about a decade - and Linux was quite well
| supported in addition to OS X. But with Motorola's loss of
| interest in the PC and server market and IBM's focus on
| processors for consoles, there was no future for Apple in the
| growing mobile market. After all, we're still waiting for the
| G5 Powerbook :).
| BirAdam wrote:
| $17 64 bit RISC-V https://linuxgizmos.com/17-sbc-runs-linux-on-
| allwinner-d1-ri...
|
| $29 64 bit RISC-V in the same form factor as an RPi CM3
| https://www.clockworkpi.com/product-page/copy-of-clockworkpi...
|
| If you never look for it, you will believe it doesn't exist.
|
| I was very skeptical of ARM back in the day thinking that it
| was great for crappy little iTrinkets and Androids but not for
| "real computing". I was clearly wrong. I was very skeptical of
| RISC-V until I recently heard Jim Keller explain why RISC-V has
| a bright future. He was rather convincing. This is especially
| true given his track record of straight-up magical results.
| Looking at different RISC-V machines, I think that the greatest
| advantage is that it is simple and can therefore be optimized
| more easily than complex designs, and due to being open, it has
| very low cost which will encourage more eyes trying more and
| different optimizations.
|
| EDIT: Link to Jim Keller interview
| https://www.anandtech.com/show/16762/an-anandtech-interview-...
| nnx wrote:
| > I recently heard Jim Keller explain why RISC-V has a bright
| future
|
| Would like to hear it too. Can you share a link?
| BirAdam wrote:
| Updated my response to include a link to the transcription.
| The audio/video is here:
| https://www.youtube.com/watch?v=AFVDZeg4RVY
|
| It's actually important (if you're not an engineer) to
| listen to the whole thing, because he drops knowledge all
| over the place.
| tsmi wrote:
| I agree mostly with Keller's take but I think he left of one
| key factor, the quality of the software tool chain.
|
| The x86 tool chains are amazing. They're practically black
| magic in the kinds of optimizations they can do. Honestly, I
| think they're a lot of what is keeping Intel competitive in
| performance. ARM tool chains are also very good. I think
| they're a lot of the reason behind why ARM can beat RISCV in
| code space and performance on equivalent class hardware
| because honestly, like Keller says, they're not all that
| different for common case software. But frankly x86 and ARM
| toolchains should dominate RISCV when we just consider the
| amount of person-hours that have been devoted to these tools.
|
| So for me the real question is, where are the resources that
| make RISCV toolchains competitive going to come from (and
| keep in mind x86 and ARM have open source toolchains too)?
| And, will these optimizations be made available to the
| public?
|
| If we see significant investment in the toolchains from the
| likes of Google, Apple and nVidia, or even Intel. ARM needs
| to be really worried.
| ansible wrote:
| I don't know that such a heavy investment in the toolchains
| for RISC-V are actually needed.
|
| If you look at generated code, it seems fairly
| straightforward. There aren't a lot of tricks or anything.
| BirAdam wrote:
| I think the serious investment will be from Intel, Apple
| (with LLVM), and possibly Microsoft (into the GCC/Linux
| ecosystem).
| pjmlp wrote:
| ARM has proven their place for real computing on Newton OS
| and Acorn Archimedes, no need to prove it again on crappy
| little iTrinkets and Androids.
|
| Where is a RISC-V doing "real computing" on a Acorn
| Archimedes like personal computer?
| rwmj wrote:
| There's lots of RISC-V hardware these days, from embedded RV32
| chips up to machines you can run Linux on. It's nothing at all
| like SPARC/POWER.
| jfkimmes wrote:
| I get "To keep reading this story, get the free app or log in.
| (With Facebook or Google)" on mobile.
|
| No thanks, Medium. These dark patterns crop up everywhere
| lately...
| socialdemocrat wrote:
| It is for authors like me writing on Medium, to have a way of
| getting paid. There is a need for both paid and free content.
| But reality is that you cannot produce quality content if
| everything has to be free. Advertisement is one solution, but
| one not without its own serious drawbacks.
|
| Medium is a like a magazine with a very large number of
| journalists which it pays to write for it. Naturally it needs
| to charge subscribers to make an income.
| Eduard wrote:
| https://archive.is/BqS0n
| math-dev wrote:
| That's a shame...it was the most annoying thing about Quora to
| me.
|
| As a Medium writer, I'm annoyed now! They already stopped
| paying me my ~0-10$ per month because I refused to beg everyone
| to get to their new minimum 100 followers requirement for
| getting paid.
| aw1cks wrote:
| https://scribe.rip
| throwaway81523 wrote:
| 12ft.io got past that for me.
| btdmaster wrote:
| https://scribe.rip/addressing-criticism-of-risc-v-microproce...
| zozbot234 wrote:
| > Every 32-bit word in the instruction cache will contain either
| a 32-bit uncompressed instruction or two 16-bit compressed
| instructions. Thus everything lines up nicely.
|
| This is not really accurate AIUI, since the RISC-V C extension
| allows 32-bit insns to be 16-bit aligned. (This would also happen
| if 48-bit insns were enabled by some other future extension).
| It's nonetheless a lot simpler than whatever x86 has to do, since
| insn length is given by a few well-defined bits in the insn word.
| ribit wrote:
| I have difficulty following the points the author is trying to
| make.
|
| - Even with instruction compression the type of code they present
| will take more space than, say, Aarch64. - The entire section on
| conditional execution doesn't make any sense. Conditional
| execution is bad, we know it, that's why modern ARM does not have
| conditional execution. Overall, author's insistence to compare
| RISC-V to practically obsolete ARMv7 when ARMv8 has been
| available for over a decade is... odd. - Regarding SIMD... it's a
| very complex topic, but personally, I don't see any fundamental
| problem with vector-style ISA. I think it's a great way of
| allowing scalable software. But vector ISA does not replace basic
| SIMD as they solve different problems. Vector stuff is great for
| throughput, SIMD is great for latency. There are many tasks such
| as geometry processing, modern data structures etc. where fixed-
| size 128-bit SIMD is an excellent building block. That's why ARM
| has both NEON and SVE2, the latter does not make obsolete the
| former. And that bit about GPUs and how they are not good for
| vector processing... not even sure how to comment on it. Also, at
| the end of the day, specialised devices will vastly outperform
| any general-purpose CPU solution. That's why we see, say, Apple
| M1 matrix accelerators delivering matmul performance on par with
| workstation CPU solutions, despite using a fraction of power.
|
| Overall, my impression is that the article is grasping at straws,
| ignores modern technology and ultimately fails to deliver. I
| aolso remain unconvinced by the initial premise that RISC-V
| follows the principle "not painting yourself into a corner due to
| choices which have short term benefit". I do think that choices
| like keeping instructions as simple as possible (even though it
| makes expression of common patterns verbose), avoiding flags
| registers, disregarding SIMD etc. could be characterised as
| "painting oneself into a corner".
|
| A usual disclaimer: I do think that RISC-V is a great
| architecture for many domains. Simple low-power/low-cost
| controllers, specialised hardware, maybe even GPUs (with
| extensions) -- the simplicity and openness of RISC-V makes it a
| great point of entry for basically anyone and invites
| experimentation. I just don't see much merit of RISC-V in the
| general-purpose high-performance consumer computing
| (laptop/desktop). In this space RISC-V does not have any notable
| advantages, it does have potential disadvantages (e.g. code
| density and lack of standard SIMD -- yet). Most importantly, the
| CPU microarchitecture becomes the decisive factor, and designing
| a fast general-purpose CPU requires a lot of expertise and
| resources. It's not something that a small group of motivated
| folk can realistically pull off. So all the great things about
| RISC-V simply do not apply here.
| audunw wrote:
| Why do you list code density as a potential disadvantage? With
| changes to the ISA that has already been approved, RISC-V will
| have the best code density of any significant ISA in real-world
| code.
|
| The article touched on this briefly so it's odd that you would
| claim this without a source for the claim. I know there's some
| outdated benchmarks where it's slightly worse than Thumb for
| instance. But then Thumb isn't relevant for desktop CPUs.
|
| The downside for RISC-V for high end desktop/laptop is lack of
| a large commercial backer (someone like Apple could pull it
| off, but clearly they've bet on ARM,which was clearly the right
| choice since RISC-V was far from ready). Lack of the huge
| legacy of tool chains and software built around x86 and ARM is
| also obviously a huge disadvantage.
|
| But you could have said the same about ARM back in the day. The
| thing is I'm not sure if the advantages for RISC-V is big
| enough to take over all of ARMs markets the way ARM has the
| potential for with x86.
| socialdemocrat wrote:
| > And that bit about GPUs and how they are not good for vector
| processing... not even sure how to comment on it. Also, at the
| end of the day, specialised devices will vastly outperform any
| general-purpose CPU solution. That's why we see, say, Apple M1
| matrix accelerators delivering matmul performance on par with
| workstation CPU solutions, despite using a fraction of power.
|
| Of course CPUs are good for vector processing compared to a
| general-purpose CPU. That was not the point at all. The point
| is that unlike older architectures such as Cray, they were not
| designed specifically for general-purpose vector processing but
| for graphics processing. That is why solutions such as SOC-1
| built specifically for genera-purpose vector processing can
| compete with graphics cards made by giants like Nvidia.
|
| The article is talking about adding vector processing both to
| RISC-V chips aimed at general purpose processing as well as to
| specialized RISC-V cores which are primarily designed for
| vector-processing. SOC-1 is an example of this. It has 4
| general purpose RISC-V cores called ET-Maxion, while also
| having 1088 small ET-Minion cores made for vector processing.
| However these are still RISC-V cores, rather than some graphics
| card SM core.
|
| I don't get your argument about SIMD being great for latency.
| RISC-V requires that vector registers are at minimum 128-bit
| you you can use RVV as a SIMD instruction-set with 128-bit
| registers if you want.
| pjmlp wrote:
| It isn't odd at all, this is the kind of usual narrative when
| selling stuff to an audience that only has a passing knowledge
| of all issues.
|
| So anyone that isn't deep into ARM architecture will indeed buy
| into the arguments being made, as they can't assert otherwise.
| socialdemocrat wrote:
| Author here: I have tried to clarify this better in the update.
| The point is that I am talking about AArch32 and AArch64 in the
| article. Yes, everybody has been going away from conditional
| instructions, because they don't work well in Out-of-Order
| superscalar processors, and they are pointless when you got
| good branch predictors.
|
| HOWEVER, an argument in the ARM camp is that they are very
| useful when dealing with smaller chips. Remember ARM and RISC-V
| compete in the low range as well as higher range. AArch32 is
| not obsolete. It still has uses. There has been ARM fans
| claiming that conditional instructions make ARM superior for
| simple chips. The argument here was that RISC-V has way of
| dealing with simple In-Order chips as well.
| Someone wrote:
| > There has been ARM fans claiming that conditional
| instructions make ARM superior for simple chips.
|
| For those following this only from the sidelines, it would
| help strengthen the article if the article has links to such
| claims. I couldn't easily find them, and would be curious as
| to their age, given that, reading https://en.wikipedia.org/wi
| ki/Predication_(computer_architec..., ARM has made
| substantial changes to conditional execution a few times
| since 1994 (over 25 years ago); Thumb (1994) dropped them,
| Thumb-2 (2003) replaced them by, if I understand it
| correctly, an instruction "skip the next 4 instructions
| depending on flags", and ARMv8 replaced them by conditional
| select.
|
| (In general, providing links to articles claiming each
| proclaimed myth to be true would strengthen this article. I
| think I've only ever read about #1, and not with as strong a
| wording as "bloats")
| socialdemocrat wrote:
| If there was some good articles to point to I would.
| However I don't want to single out people ranting against
| RISC-V. This is more about opinions which keep popping out
| here in Hacker news, twitter, Quora and other places. I
| don't want this discussion to be turned personal.
|
| It should be possible to discuss these opinions without
| singling out anyone.
|
| I am however talking about claims put forth after ARMv8.
| The argument here has basically been this: Both ARM and
| RISC-V aims to cover both the low end and high end. Some
| ARM fans think that by not including conditional
| instructions RISC-V really only works for high-end CPUs.
| The idea here is that AArch32 would be better than RV32 for
| lower-end chips.
| erosenbe0 wrote:
| What are these r registers for AArch64? Author hasn't likely
| bothered to run any of the examples through an actual 64-bit ARM
| assembler. Dubious.
| TazeTSchnitzel wrote:
| The conditional execution section makes no mention of the fact
| AArch64 doesn't have this feature either, and bizarrely lists a
| "64-bit" ARM code example that isn't. This doesn't inspire
| confidence in the author's understanding.
| fay59 wrote:
| All of the ARM assembly is wrong. AArch64 uses "x" or "w" to
| identify general purpose registers, "r" isn't a thing.
| solarexplorer wrote:
| In the same section the part about the SiFive optimization is
| also misleading. The goal of the optimization is obviously to
| avoid interrupting the instruction fetch. But he makes it sound
| like the goal was to reduce instruction count by fusing two
| instructions to get a single monster op with five (!) register
| operands. That just doesn't make sense.
| pxeger1 wrote:
| I thought it was only supposed to be pseudo-assembly?
| audunw wrote:
| Theres so many armchair specialists when it comes to criticizing
| RISC-V. I've seen people claim an ISA is better because it has
| branch delay slots.. which seems clever to someone who knows
| enough technical details about CPUs to understand what the
| benefit of that feature is ("free" instruction execution for
| every branch taken), but is a terrible idea for a truly scalable
| ISA (huge PITA for out-of-order architectures if I've understood
| correctly)
|
| I'm sure there are some bad decisions in RISC-V, but I've yet to
| see one that isn't in the process of being remedied. There was a
| good argument for a lack of POPCOUNT instruction being bad, but I
| think that's being added soon
| okl wrote:
| First you complain about "armchair specialists", then you make
| a blanket assertion about branch delay slots. Believe it or
| not, there are ISAs for applications where branch delay slots
| are useful, for example TMS320 DSPs with up to 5 delay slots.
| __s wrote:
| You shouldn't assess RISC-V from the viewpoint of a single
| chip. It's an ISA first. Branch slots are highly target
| specific
| mst wrote:
| My very much non-expert understanding was that branch delay
| slots where enough details of the target processor design are
| known at ISA design time to have the 'right' number of slots
| can be a neat optimisation.
|
| OTOH if one is designing an ISA that will have a bunch of
| different implementations - and this includes later
| implementations wanting to be ASM compatible with earlier
| ones - they tend to eventually become a footgun for the
| processor designers. (if I remember correctly and didn't
| completely misunderstand, MIPS' branch delay slots were
| absolutely a neat optimisation for the early models, but when
| they went to a deeper pipeline for later chips required a
| bunch of extra design effort to maintain compatibility with,
| without being helpful anymore)
|
| (explicit disclaimer that I'm an armchair amateur here, so if
| you're a fellow non-expert reading this comment before it
| attracts better informed replies please default to joining me
| in the assumption that I've made at least one massive error
| in what I'm saying here)
___________________________________________________________________
(page generated 2022-03-20 23:01 UTC)