[HN Gopher] Addressing Criticism of RISC-V Microprocessors
       ___________________________________________________________________
        
       Addressing Criticism of RISC-V Microprocessors
        
       Author : nnx
       Score  : 163 points
       Date   : 2022-03-20 05:37 UTC (17 hours ago)
        
 (HTM) web link (erik-engheim.medium.com)
 (TXT) w3m dump (erik-engheim.medium.com)
        
       | devit wrote:
       | I think the biggest issue is the lack of arithmetic with overflow
       | checking, especially with a variant that calls a location in a
       | control register on overflow.
       | 
       | This makes it very inefficient to compile languages that would
       | like overflow checks on all arithmetic.
        
         | audunw wrote:
         | A comment elsewhere here pointed out that RISC-V can do it with
         | two fused compressed instructions for the most common
         | operations. So seems like they did the right trade off to me.
        
       | FullyFunctional wrote:
       | I'm heavily invested in RISC-V, both personally and
       | professionally, and I think the story is much more complicated
       | than this makes it out to be, but I'm not going to rehash the
       | discussion yet again.
       | 
       | However, I do want to point out that a real issue (especially
       | with legacy code) is the scaled address calculation with 32-bit
       | unsigned values. Thankfully the Zba extension adds a number of
       | instructions that help a lot, but still would require fusion to
       | get complete parity with Arm64
       | 
       | For                   int update(int *base, unsigned index) {
       | return base[index]++; }
       | 
       | We get                   update:                 sh2add.uw
       | a1,a1,a0                 lw         a0,0(a1)
       | addiw      a5,a0,1                 sw         a5,0(a1)
       | ret
       | 
       | Zba is included in the next Unix profile and will _likely_ be
       | adopted eventually by all serious implementations.
       | 
       | EDIT: grammar and spacing
        
         | abainbridge wrote:
         | I'm guess that your assembly code is RISC-V with the Zba
         | extension. Is the non-Zba version worse than Arm64?
         | 
         | Compiling your function with Godbolt, I get:
         | RISC-V (no Zba) Clang - 7 instructions -
         | https://godbolt.org/z/7znnrzxKq       Arm64 Clang -           7
         | instructions - https://godbolt.org/z/Trv8scxad
         | 
         | Annoyingly I can't see the code size for the Arm64 case because
         | no output is generated if I tick the "Compile to binary" option
         | in "Output". I have to use GCC instead:                 RISC-V
         | (no Zba) Clang - 20 bytes - https://godbolt.org/z/eWfPaorcj
         | Arm64 GCC             - 24 bytes -
         | https://godbolt.org/z/bzsPzov5h
        
           | FullyFunctional wrote:
           | EDIT: Hmm, I seem to have picked a bad example. Try this one:
           | int get(int *base, unsigned index) {return base[index];}
           | 
           | Arm64:                   update:             ldr     w0, [x0,
           | w1, uxtw 2]             ret
           | 
           | RV64GC (vanilla):                   update:             slli
           | a5,a1,32             srli    a1,a5,30             add
           | a0,a0,a1             lw      a0,0(a0)             ret
           | 
           | RV64GC+Zba:                   update:             sh2add.uw
           | a0,a1,a0             lw      a0,0(a0)             ret
           | 
           | Arm64 is able to do some indexed loads in a single
           | instruction that might take two in RISC-V w/Zba (and up to 4+
           | in regular RISC-V). However, calling that a win for Arm64 is
           | not so clear as the more complicated addressing modes could
           | become a critical timing path and/or require an extra
           | pipeline stage. However, as a first approximation, for a
           | superscalar dynamically scheduled implementation, fewer ops
           | is better so I would say it's a slight win.
           | 
           | I don't understand the obsession with bytes. 25% fewer bytes
           | has only very marginally impact on a high-performance
           | implementation and the variable length encoding has some
           | horrendous complications (which is probably why Arm64
           | _dropped_ variable length instructions). Including compressed
           | instruction in the Unix profile was the biggest mistake
           | RISC-V did and I'll die on that hill.
           | 
           | ADD: Don't forget that every 32-bit instruction is currently
           | wasting the lower two bits to allow for compressed, thus any
           | gain from compress must be offset by the 6.25% tax that is
           | forced upon it.
        
             | damageboy wrote:
             | Thank you for writing the obvious. Instruction Byte count
             | is the wrong metric here 100%. Instruction Count (given
             | reasonable decoding/timing constraints) is the thing to
             | optimize for and indeed variable length encoding is very
             | bad.
        
               | knorker wrote:
               | For those of us without the expertise, could you
               | elaborate on why that is?
               | 
               | On the one hand we have byte count, with its obvious
               | effect on cache space used. But to those of us who don't
               | know, why is instruction count so important?
               | 
               | There's macro-op fusion, which admittedly would burn
               | transistors that could be used for other things. Could
               | you elaborate why it's not sufficient?
               | 
               | And then the fact that modern x86 does the opposite to
               | macro-op fusion, by actually splitting up CISC
               | instructions into micro-ops. Why is it so bad if they
               | were more micro-ops to start with, if Intel chooses to do
               | this?
        
               | tsmi wrote:
               | Instruction byte count matters quite a lot when you're
               | buying ROM in volume. And today, the main commercial
               | battleground for RISCV is in the microcontroller space
               | where people care about these things.
        
             | avianes wrote:
             | Why using an unsigned? It is obvious here that RISC-V
             | without Zba takes 4 instructions because it manages special
             | cases related to unsigned.
             | 
             | If you use a simple int for index:                 slli
             | a1,a1,2       add   a0,a0,a1       lw    a0,0(a0)
             | 
             | And isolating this code in a small function puts
             | constraints on register allocation, but if we remove this
             | constraint then we can write:                 slli  a1,a1,2
             | add   a1,a1,a0       lw    a1,0(a1)
             | 
             | Which is very suitable for macro-op fusion and C extension
             | 
             | > Including compressed instruction in the Unix profile was
             | the biggest mistake RISC-V did and I'll die on that hill.
             | 
             | This is so wrong. The C extension is one of the great
             | strengths of RISC-V, it is easy to decode, very suitable
             | for macro-op fusion, and it gives a huge boost in code
             | density
        
               | JonChesterfield wrote:
               | Iirc compressed instructions are the thing that costs 2
               | bits per 32 and was criticised as overfitted to naive
               | compiler output. Am I thinking of something else?
        
               | avianes wrote:
               | Yes, but RISC-V still has a lot of encoding-space free
               | and the benefit of C extension is huge. It's a trade-off.
               | 
               | I don't think RISC-V is perfect or universal, but on this
               | point they do a pretty good job compared to other ISAs
        
               | klelatti wrote:
               | You say that the benefit is 'huge' but why should I care
               | about code density on a modern CPU with gigabytes of
               | memory and large caches?
               | 
               | From a performance perspective what is the evidence that
               | this actually provides an advantage?
        
               | zozbot234 wrote:
               | Because the _fastest_ cache levels are tiny, even in the
               | largest and most advanced CPU 's. There's plenty of
               | evidence for the performance benefits of improved density
               | and terseness in both code and data.
        
               | klelatti wrote:
               | The M1 has a 192k instruction cache for performance cores
               | which is not 'tiny'.
               | 
               | If there is lots of evidence for the performance benefits
               | of improved density vs the alternative of fixed
               | instruction width in real world CPUs then I'm sure you'll
               | be able to cite it.
        
               | avianes wrote:
               | To clarify, I'm not saying that RISC-V code density of C
               | extension is a big advantage over its competitors, but it
               | is a huge benefit for RISC-V.
               | 
               | You are right, code density is perhaps not that critical
               | today.
               | 
               | And it is difficult to quantify its relevance as code
               | density is always related to other variables such as
               | instruction expressiveness, numbers of uop emitted, etc.
               | 
               | But I still think that code density is important for
               | RISC-V. Because RISC-V philosophy to reach high
               | performance is to use very simple instructions that can
               | be combined together and take advantage of macro-op
               | fusion. I think RISC-V without macro-op fusion can't
               | reach the performance of other ISA.
               | 
               | But RISC-V with all these simple and not very expressive
               | instructions and without C extension has a pretty bad
               | code density which could cost a lot because it is not at
               | the competitors' level.
               | 
               | So if we think of RISC-V as a macro-op fusion oriented
               | ISA, then the C extension becomes important to be
               | competitive.
               | 
               | I don't know what is better between a "macro-fusion"
               | oriented arch or a "complex-instruction" oriented arch,
               | future will tell us.
        
               | klelatti wrote:
               | Thanks for clarifying - interesting to see a different
               | philosophy being tried.
        
               | JonChesterfield wrote:
               | 32 bit, 32 registers, three register code. So add r0 r1
               | r2 spends fifteen bits on identifying which register to
               | use then another two on the compressed ISA. That's half
               | the encoding space gone before identifying the op. Never
               | thought I'd want fewer registers but here we are.
               | 
               | If the compressed extension is great in practice it might
               | be a win. If the early criticism of overfit to gcc -O0
               | proves sound and in practice compilers don't emit it then
               | it was an expensive experiment.
        
               | avianes wrote:
               | The encoding space is not the number of bits used to
               | encode an instruction, the encoding space is the ratio of
               | values that encode a valid instructions over the total
               | number of instructions that could be encoded.
               | 
               | As an example, in an ISA with 8-bits fixed instruction
               | length and 8 registers (reg index encoded on 3 bit):
               | 
               | If the last 2 bits are the opcode and we define 2
               | instructions (eg. AND, XOR) that manipulate 2 registers
               | (3 bits + 3 bits), then instruction word values
               | 0b00_000_000 to 0b01_111_111 might encode these two
               | instructions (ignore "_" they are separators).
               | 
               | Therefore, instruction word values from 0b10_000_000 to
               | 0b11_111_111 remain free, which represents half of the
               | encoding space. So half of the encoding space remains
               | free.
               | 
               | This means we still have room to put new instructions.
               | 
               | Similarly, RISC-V valid instructions use almost all the
               | available bits, but there is still room in the encoding
               | space because some opcodes remain free.
        
               | FullyFunctional wrote:
               | Indeed why use unsigned? Go take a look at a lot of C
               | code (hint, look at SPEC benchmarks). They do that.
               | Decades of pointer == int == unsigned have let to a lot
               | of horrific code. But we still compile it.
               | 
               | The sins of the original RISC-V was spending too much
               | time looking at RV32 and not realizing how big a problem
               | this is in practice. Zba (slipped in as it's not really
               | "bit manipulation") fixes the worst of this.
               | 
               | ADD: The problem in this HN thread is the same reason we
               | got compressed it in the first place. The vast majority
               | of people aren't doing high performance wide
               | implementation so the true cost isn't widely appreciated.
               | The people holding the decision power certainly didn't
               | understand it. I really think you have to live it to
               | understand it.
        
               | avianes wrote:
               | > Go take a look at a lot of C code (hint, look at SPEC
               | benchmarks). They do that.
               | 
               | This does not really justify isolating that snippet if
               | you admit yourself it's a bad one.
               | 
               | > The sins of the original RISC-V was spending too much
               | time looking at RV32 and not realizing how big a problem
               | this is in practice.
               | 
               | But indeed, my previous message shows that even without
               | Zba the problem is erased by a good register allocation
               | and macro-op fusion.
               | 
               | I think you are trying too hard to find special cases
               | that "trick" RISC-V, you didn't even pay attention to the
               | use of unsigned which is non-optimal (unsigned has an
               | undesirable overflow semantic here).
        
             | saagarjha wrote:
             | > 25% fewer bytes has only very marginally impact on a
             | high-performance implementation
             | 
             | Instruction cache doesn't come for free, and is usually
             | pretty small on most shipping processors. It's not a big
             | deal for smaller benchmarks, but in real-world programs
             | this can become a problem.
        
               | snek_case wrote:
               | Not only that, but CPUs have a maximum number of
               | instructions they can dispatch per cycle (typically 4 or
               | 6). Even in microbenchmarks, the difference there could
               | show up.
        
               | marcosdumay wrote:
               | The bottleneck on that is data interdependency of your
               | algorithm. If you break it in 6 or 10 instructions, the
               | data dependency stays the same.
               | 
               | (Of course, you can add unnecessary dependency with a
               | badly designer ISA. But it's not a necessary condition.)
        
               | snek_case wrote:
               | There's still a limit to how many instructions you can
               | decode and dispatch every cycle, even with zero
               | dependencies. There's also definitely dependencies in the
               | example where you're computing a memory address to access
               | a value.
        
               | FullyFunctional wrote:
               | I am obviously aware and I'm here to tell you that the
               | overhead of variable length instructions matters more.
               | Arm agrees. M1 has a 192 KiB I$ btw.
               | 
               | ADD: had RISC-V just disallowed instructions from
               | spanning cache lines and disallowing jumping into the
               | middle of instructions then almost all of the issues
               | would have gone away. Sigh.
        
               | saagarjha wrote:
               | I actually had Apple's chips in mind when talking about
               | "most shipping processors" because they have historically
               | invested heavily in their caches and reaped benefits from
               | it. But not all the world's an M1, and also I'll have you
               | know that Apple themselves cares _very much_ about their
               | code size, even with their large caches. Don 't go
               | wasting it for no reason!
               | 
               | (I should also note that I am pretty on board with you
               | with regards to variable-length instructions, this is
               | just independent of that.)
        
               | avianes wrote:
               | Variable instruction sizes have a cost, but with only 2
               | instruction sizes like current RISC-V that cost remains
               | very low as long as we don't have to decode a very large
               | number of instructions each cycle, and it gives a huge
               | code density advantage.
        
               | FullyFunctional wrote:
               | This biggest issue is one instruction spanning two cache
               | lines, and even two pages. This means a bunch of tricky
               | cases that is the source of bugs and overheads.
               | 
               | It also means you cannot tell instruction boundaries
               | until you directly fetch instructions, so you cannot do
               | any predecode in the cache that would help you figure out
               | dependencies, branch targets, etc. These things matter
               | when you are trying to fetch 8+ instructions per cycle.
        
               | avianes wrote:
               | > This biggest issue is one instruction spanning two
               | cache lines
               | 
               | Even with fixed (32 bit) instruction lengths aligned on
               | 32 bit, when we have to decode a group of 8 instructions
               | you are facing this kind of issue.
               | 
               | So you either have to cut the instruction group (and thus
               | not take full advantage of the 8 way decoder) or you have
               | to implement a more complex prefetch with a longer
               | pipeline. And these special cases can be handled in these
               | pipeline stages.
               | 
               | > It also means you cannot tell instruction boundaries
               | until you directly fetch instructions
               | 
               | I mean, AMD does that on x86, with 14 instruction
               | lengths.
               | 
               | It can be done for RISC-V, it's much cheaper than x86,
               | and it takes significantly less surface area than a
               | bigger cache to compensate.
        
               | mst wrote:
               | Given the compression stuff is an extension (and so far
               | as I can tell the 16-bit alignment for 32-bit
               | instructions that can result in that sort of spanning is
               | part of that extension), so far as I can tell you could
               | implement said extension for tiny hardware where every
               | byte counts, and then for hardware where you're wanting
               | to fetch 8+ instructions per cycle just ... not implement
               | it?
               | 
               | Wait (he says to himself, realising he's an idiot
               | immediately -before- posting the comment for once). You
               | said upthread the C extension is specified as part of the
               | standard UNIX profile, so I guess people are effectively
               | required to implement it currently?
               | 
               | If that was changed, would that be sufficient to dissolve
               | the issues for people wanting to design high performance
               | implementations, or are there other problems inherent to
               | the extension having been specified at all? (apologies
               | for the 101 level questions, the only processor I really
               | understood was the ARM2 so my curiosity vastly exceeds my
               | knowledge here)
        
               | KerrAvon wrote:
               | Have the ARM AArch64 designers ever commented on this?
               | They intentionally left out any kind of compressed
               | instructions, and certainly Apple at least cares a lot
               | about code size.
        
               | [deleted]
        
               | klelatti wrote:
               | Try this at 34:30 - from Arm's architecture lead Richard
               | Grisenthwaite. Earlier he says that several leading micro
               | architects think that mixing 16 bit and 32 bit
               | instructions (Thumb2) was the worst thing that Arm ever
               | did.
               | 
               | https://m.soundcloud.com/university-of-
               | cambridge/a-history-o...
        
       | albanread wrote:
       | People should zoom right out and think about the whole RISC-V
       | project. When our phones have billions of transistors, are we
       | seriously supposed to believe that RISC philosophy still matters.
       | Personally I greatly prefer the user programmable 68000 family of
       | processors. The marketing of RISC-V is perhaps the most
       | impressive thing about it. Each to their own, I can see why giant
       | SSD manufacturers want to use a license free design and share the
       | cost of compiler development. Is there really anything else?
        
         | kortex wrote:
         | Yeah, absolutely. Personally, when I zoom out, and look at the
         | trends of engineering in general: simpler modular systems that
         | compose well together vs bespoke solutions, RISC-V precisely
         | follows the trend. Reduce global state. Make it easier (for
         | humans and algos) to reason about control flow. Have a simple
         | core with optional extensions. This all makes building multi-
         | core solutions way simpler. We are fast running out of
         | transistor density improvements. But we are getting way better
         | at building coprocessors. There's clear value in "doing more
         | simple things in parallel".
        
         | monocasa wrote:
         | > When our phones have billions of transistors, are we
         | seriously supposed to believe that RISC philosophy still
         | matters.
         | 
         | The point isn't just saving gates because it's cheaper. Less
         | gates means less critical path length, meaning less power
         | consumption, and/or higher overall performance when compared
         | apples to apples.
        
       | throwaway81523 wrote:
       | Fairly lame article (not wrong, but stuff people following the
       | topic have seen before), and I'd still like to hear about integer
       | overflow detection. If the floating point extension is able to do
       | IEEE 754 condition codes including overflow detection, why can't
       | the integer unit do something similar?
        
         | FullyFunctional wrote:
         | This comes up a lot and I'm sympathetic to your plea, really (I
         | enjoy fantasizing about a different reality where CPUs weren't
         | just "machines to run C programs"), but in computer
         | architecture, what really matters for one application or a
         | class of applications might not be important when viewed across
         | millions of programs.
         | 
         | The fact is that integer operations and floating point are two
         | completely different beasts, so much so that we have different
         | benchmark suites for each.
         | 
         | Integer operations are critically latency sensitive and bagging
         | on extra semantics doesn't come for free and for _most_ code
         | this would be a tax. The  "overflow bit" represents an implicit
         | result that would have to be threaded around (I'm assuming that
         | you aren't asking for exceptions which literally nobody wants).
         | For FP we do that, but the cost and latency of FP ops is
         | already high so it doesn't hurt quite as much.
         | 
         | The RISC-V spec [1] (which I assume you have seen) already
         | discusses all these trade offs:
         | 
         | "We did not include special instruction-set support for
         | overflow checks on integer arithmetic operations in the base
         | instruction set, as many overflow checks can be cheaply
         | implemented using RISC-V branches. Overflow checking for
         | unsigned addition requires only a single additional branch
         | instruction after the addition:                   add t0, t1,
         | t2         bltu t0, t1, overflow
         | 
         | For signed addition, if one operand's sign is known, overflow
         | checking requires only a single branch after the addition:
         | addi t0, t1, +imm          blt t0, t1, overflow
         | 
         | This covers the common case of addition with an immediate
         | operand. For general signed addition, three additional
         | instructions after the addition are required, leveraging the
         | observation that the sum should be less than one of the
         | operands if and only if the other operand is negative.
         | add t0, t1, t2          slti t3, t2, 0          slt t4, t0, t1
         | bne t3, t4, overflow
         | 
         | In RV64I, checks of 32-bit signed additions can be optimized
         | further by comparing the results of ADD and ADDW on the
         | operands."
         | 
         | I do think that it might have been worth adding an single
         | instruction version for the last one (excluding the branch),
         | but I'm not aware of it getting accepted.
         | 
         | [1] https://github.com/riscv/riscv-isa-manual
        
           | throwaway81523 wrote:
           | Yes I've seen that reasoning: they propose bloating 1 integer
           | instruction into 4 instructions in the usual case where the
           | operands are unknown. Ouch. In reality they expect programs
           | to normally run without checking like they did in the 1980s.
           | So this is more fuel for the criticism that RiscV is a 1980s
           | design with new paint. Do GCC and Clang currently support
           | -ftrapv for RiscV, and what happens to the code size and
           | speed when it is enabled? Yes, IEEE FP uses sticky overflow
           | bits and the idea is that integer operations could do the
           | same thing. Integer overflow is one of those things like null
           | pointer dereferences, which originally went unchecked but now
           | really should always be checked. (C itself is also deficient
           | in not having checkable unsigned int types).
        
           | modeless wrote:
           | > I'm assuming that you aren't asking for exceptions which
           | literally nobody wants
           | 
           | I want exceptions. Why would they be a bad idea? Besides the
           | fact that software doesn't utilize them today (because
           | they're not implemented, chicken and egg problem)? IMO they
           | would be as big a security win as many other complex features
           | CPU designers are adding in the name of security, e.g.
           | pointer authentication.
        
           | adrian_b wrote:
           | Generating the overflow bit and storing it adds a completely
           | negligible cost to a 64-bit adder, so touting this as a cost
           | saving measure is just a lie, even if indeed this claim has
           | always been present in the RISC-V documentation.
           | 
           | Most real cases of overflow checking are of the last type.
           | Tripling the number of instructions over a bad ISA that lacks
           | overflow exceptions, like unfortunately almost all currently
           | popular ISAs are, or quadrupling the number of instructions
           | over a traditional ISA with overflow exceptions is a totally
           | unacceptable cost.
           | 
           | The claim that providing overflow exceptions for integer
           | addition might be too expensive can be easily countered by
           | the fact that generating exceptions on each instruction is
           | not the only way to guarantee that overflows do not happen.
           | 
           | It is enough to store 2 overflow flags, 1 flag with the
           | result of the last operation and 1 sticky flag that is set by
           | any overflow and is reset only by a special instruction.
           | Having the sticky flag allows zero-overhead overflow checking
           | for most arithmetic instructions, because it can be tested
           | only once after many operations, e.g. at a function exit.
           | 
           | The cost of implementing the 2 overflow bits is absolutely
           | negligible, 2 gates and 2 flip-flops. Much more extra
           | hardware is needed for decoding a few additional instructions
           | for flag testing and clearing, but even that is a negligible
           | cost compared with a typical complete RISC-V implementation.
           | 
           | Not providing such a means of reliable and cheap overflow
           | detection is just stupid and it is an example of hardware
           | design disconnected from the software design for the same
           | device.
           | 
           | The early RISC theory was to select the features that need to
           | be implemented in hardware by carefully examining the code
           | generated by compilers for representative useful programs.
           | 
           | The choices made for the RISC-V ISA, e.g. the omission of
           | both the most frequently required addressing modes and of the
           | overflow checking. proves that the ISA designers either have
           | never applied the RISC methodology, or they have studied only
           | examples of toy programs, which are allowed to provide
           | erroneous results.
        
             | zozbot234 wrote:
             | The typical overhead of overflow checking in compiled
             | languages (which, as a reminder, is in the low single-digit
             | %'s _at most_ ) has nothing to do with the lack of
             | hardware-sprcific extensions. It's a consistent pattern of
             | missing optimization opportunities, because the compiler
             | now needs to preserve the exact state of intermediate
             | results after some operation fails with an overflow. Adding
             | these new opcodes to your preferred ISA would barely change
             | anything. (If they help at all it's in executing highly
             | dynamic languages as opposed to compiled ones, which makes
             | them a natural target for the in-progress 'J' extension.)
        
             | ajb wrote:
             | The extra expense is not the generation of the overflow
             | bit, but the infrastructure needed to support a flags
             | register, or for every instruction to be able to generate
             | an exception.
             | 
             | On a simple processor like a microcontroller this doesn't
             | cost much, but it's severely hampers a superscalar or out
             | of order processor, as it can't work out very easily which
             | instructions can be run in parallel or out of order.
             | 
             | The clean solution from a micro architectural point of view
             | would be to have an overflow bit (or whatever flags you
             | wanted) in every integer register. But that's an expense
             | most don't want to pay.
        
               | ansible wrote:
               | > _The clean solution from a micro architectural point of
               | view would be to have an overflow bit (or whatever flags
               | you wanted) in every integer register._
               | 
               | That's what the Mill CPU does. Each "register" also had
               | the other usual flags, and even some new ones like Not a
               | Result, which helps with vector operations and access
               | protection.
        
               | adrian_b wrote:
               | One must not forget that on any non-toy CPU, any
               | instruction may generate exceptions, e.g. invalid opcode
               | exceptions or breakpoint exceptions.
               | 
               | In every 4-5 instructions, one is a load or store, which
               | may generate a multitude of exceptions.
               | 
               | Allowing exceptions does not slow down a CPU. However
               | they create the problem that a CPU must be able to
               | restore the state previous to the exception, so the
               | instruction results must not be committed to permanent
               | storage before it becomes certain that they could not
               | have generated an exception.
               | 
               | Allowing overflow exceptions on all integer arithmetic
               | instructions, would increase the number of instructions
               | that cannot be committed yet at any given time.
               | 
               | This would increase the size of various internal queues,
               | so it would increase indeed the cost of a CPU.
               | 
               | That is why I have explained that overflow exceptions can
               | be avoided while still having zero-overhead overflow
               | checking, by using sticky overflow flags.
               | 
               | On a microcontroller with a target price under 50 cents,
               | which may lack a floating-point unit, the infrastructure
               | to support a flags register may be missing, so it may be
               | argued that it is an additional cost, even if the truth
               | is that the cost is negligible. Such an infrastructure
               | existed in 8-bit CPUs with much less than 10 thousand
               | transistors, so arguing that it is too expensive in
               | 32-bit or 64-bit CPUs is BS.
               | 
               | On the other hand, any CPU that includes the floating-
               | point unit must have a status register for the FPU and
               | means of testing and setting its flags, so that
               | infrastructure already exists.
               | 
               | It is enough to allocate some of the unused bits of the
               | FPU status register to the integer overflow flags.
               | 
               | So, no, there are absolutely no valid arguments that may
               | justify the failure to provide means for overflow
               | checking.
               | 
               | I have no idea why they happened to make this choice, but
               | the reasons are not those stated publicly. All this talk
               | about "costs" is BS made up to justify an already taken
               | decision.
               | 
               | For a didactic CPU, as RISC-V was actually designed,
               | lacking support for overflow checking or for indexed
               | addressing is completely irrelevant. RISC-V is a perfect
               | target for student implementation projects.
               | 
               | The problem appears only when an ISA like RISC-V is taken
               | outside its right domain of application and forced into
               | industrial or general-purpose applications by managers
               | who have no idea about its real advantages and
               | disadvantages. After that, the design engineers must
               | spend extra efforts into workarounds for the ISA
               | shortcomings.
               | 
               | Moreover, the claim that overflow checking may have any
               | influence upon the parallel execution of instructions is
               | incorrect.
               | 
               | For a sticky overflow bit, the order in which it is
               | updated by instructions does not matter. For an overflow
               | bit that shows the last operation, the bit updates must
               | be reordered, but that is also true for absolutely all
               | the registers in a CPU. Even if 4 previous instructions
               | that were executed in parallel had the same destination
               | register, you must ensure that the result stored in the
               | register is the result corresponding to the last
               | instruction in program order. One more bit along hundreds
               | of other bits does not matter.
        
               | throwaway81523 wrote:
               | > After that, the design engineers must spend extra
               | efforts into workarounds for the ISA shortcomings.
               | 
               | That is too optimistic. Programs will keep running
               | unchecked and we'll keep getting CVE's from overflow
               | bugs.
        
             | ansible wrote:
             | > _The cost of implementing the 2 overflow bits is
             | absolutely negligible, 2 gates and 2 flip-flops. Much more
             | extra hardware is needed for decoding a few additional
             | instructions for flag testing and clearing, but even that
             | is a negligible cost compared with a typical complete
             | RISC-V implementation._
             | 
             | That's understating things considerably.
             | 
             | ARMv8-A has PSTATE, which includes the overflow bit. This
             | explicit state must be saved / restored upon any context
             | switch.
             | 
             | And there isn't just a single PSTATE for an OOO
             | SuperScalar, there are several.
             | 
             | Everything has a cost.
        
           | feanaro wrote:
           | > add t0, t1, t2 bltu t0, t1
           | 
           | How does this work? Isn't `bltu` simply a branch that is
           | taken if `t0 < t1`? How does that detect addition overflow?
           | 
           | EDIT: Ah, because the operands are `t1` and `t2`. `t0` is the
           | result. Quack.
        
             | [deleted]
        
       | KSPAtlas wrote:
       | I am personally a fan of RISC-V, and I have written low level
       | code for it before.
        
       | VariableStar wrote:
       | It is amusing and sobering to get a glimpse of some of the
       | compexities going on inside a processor and how design
       | philosophies may affect them. Those are things the user or even
       | your normal programmer seldom thinks about.
        
       | mhh__ wrote:
       | > Easily out-perform ARM in code density
       | 
       | > No data [that I can see at least]
        
       | brucehoult wrote:
       | I am _so_ bored with people criticising RISC-V based on tiny code
       | snippets of things that basically never happen in real code.
       | 
       | A function that does nothing but return an array element from an
       | array base address and index passed to it? Really? Do you
       | actually write junk like that? And if you write it does your
       | compiler really not inline it? Why? Do you like big slow code?
       | Once it's inlined, it's probably in a loop, and strength-reduced.
       | 
       | It's very easy to verify that in the real-world RISC-V code is
       | more compact than amd64 and arm64. Just download the same version
       | of Ubuntu or Fedora (etc) for each one and run the "size" command
       | on the binaries. The RISC-V ones are consistently significantly
       | smaller.
       | 
       | You can also, with quite a bit more work, count the number of
       | uops each ISA executes. RISC-V executes slightly more
       | instructions, but they are each simple and don't need expanding.
       | Lots of x86 instructions get expanded into multiple uops and many
       | 64 bit ARM instructions do too. In the end the number of uops
       | executed by each is very similar.
       | 
       | Trying to judge the goodness of a modern ISA by looking at two or
       | three instruction snippets is as silly as using Dhrystone as your
       | only benchmark program.
        
       | sylware wrote:
       | RISC-V is technically not bad enough to select arm64 or x86_64
       | over it, since those have beyond toxic IP tied to them.
       | 
       | From what I read in the comments, I don't expect compressed
       | instructions on future high performance desktop/servers RISC-V
       | CPU cores to be there.
        
       | panick21_ wrote:
       | People over argue these minimal difference. Lets be honest, never
       | in the history of ISA were these things the primary reason for
       | success or failure of instruction sets been a slightly better
       | code size.
       | 
       | Even if by <insert objective measurement> RISC-V is 10% worse
       | then ARM, it wouldn't actually matter that much for adoption.
       | 
       | Adoption happens for business reasons and what is differentiating
       | RISC-V far more then anything else is the chance in license,
       | governance and ecosystem.
       | 
       | RISC-V being better at hitting different verticals optimally
       | because of the molecularity is likely another thing that matters
       | more overall then how perfectly it fits for each vertical.
        
         | wmf wrote:
         | Agreed, but I think the purpose of these kind of criticisms is
         | to "fix" RISC-V before it becomes yet another worse-is-better
         | design locked in for 50 years.
        
           | zozbot234 wrote:
           | The only part of RISC-V that is "locked in" to any extent is
           | the minimal set of basic integer instructions. Everything
           | else is defined as part of standardized extensions, and can
           | be superseded simply by defining new custom extensions.
           | Actually even the minimal instruction set admits of some
           | variation, such as the 'E' architectures that dispense with
           | registers R16 to R31, thus saving area in the smallest
           | implementations and potentially freeing up some bits in the
           | encoding.
        
             | wmanley wrote:
             | Things get locked in not by standards, but by usage. If
             | your software depends on particular instructions being
             | present you're not going to buy a processor that has
             | superseded those instructions, even if the new instructions
             | conform to a theoretically cleaner design.
             | 
             | Everything being an extension (and thus removable) is a
             | strength in some specific circumstances, but is a weakness
             | in most.
        
           | socialdemocrat wrote:
           | I think you are really missing the point here. Of course
           | RISC-V has negatives but most of those negatives exist for
           | good reasons. It is a question of tradeoffs.
           | 
           | One of the most important goals of RISC-V is to make an
           | architecture which can stand the test of time. In this space
           | adding the wrong kind of instructions is a bigger problem
           | than not adding particular instructions.
           | 
           | Whether you look at x86, HTML or just about anything the
           | problem is nearly always about having to support old junk
           | which no longer makes sense to support, or lacking the
           | ability to grow. Remember 640K is enough for everyone? RISC-V
           | has a lot of room to grow.
           | 
           | If you want an architecture for the future you would want a
           | minimalist one with room to grow a lot. By keeping the
           | instruction count very low and building in a system for
           | extensions they have made a future proof ISA. Okay we cannot
           | know the future, but it is more likely to survive for decades
           | than something like x86 or maybe even ARM.
        
             | wmf wrote:
             | Most of the complaints about RISC-V are extremely basic
             | things like array indexing and conditional execution. These
             | will never not be needed.
        
               | tsmi wrote:
               | I'm sure that's what the team that invented segment
               | registers said too.
               | 
               | The question is does it make sense to add these to the
               | ISA long term? In the short term, given die density and
               | how memory works today, it has advantages. But die
               | density increases, making OoO cores cheaper, and memory
               | technology changes. It's not obvious that these are long
               | term improvements.
        
               | dgreensp wrote:
               | IANAE, but the article addresses why the arguments that
               | assume these instructions need to be combined are usually
               | not based on looking at the whole picture.
        
         | tsmi wrote:
         | People argue over these minimal differences for good reasons.
         | 
         | If <insert objective measurement> = binary size, and I'm buying
         | ROM in volume to hold that binary, +10% ROM address space can
         | easily cost more than the ARM license.
         | 
         | That can matter quite a lot for adoption. Especially in the
         | short term.
         | 
         | Obviously, priorities differ and change as a function of time
         | but as the saying goes, the only thing worse than making a
         | decision with benchmarks is making a decision without
         | benchmarks.
        
       | dontlaugh wrote:
       | It's nice to have an open ISA, don't get me wrong.
       | 
       | However, trade offs matter. Compressing instructions may improve
       | density, but it makes them variable length. This is a big barrier
       | to decoding in parallel, which is very important to high
       | performance cores.
        
         | brucehoult wrote:
         | This is really not a big deal with RISC-V's 2 instruction
         | lengths and the encoding they use.
         | 
         | If decoding 32 bytes of code (256 bits, somewhere between 8 and
         | 16 instructions) You can figure out where all the actual
         | instructions start (yes, even the 16th instruction) with 2
         | layers of LUT6.
         | 
         | You can then use those outputs to mux two possible starting
         | positions for 8 decoders that do 16 or 32 bit instructions,
         | plus 8 decoders what will only ever do 16 bit instructions from
         | fixed start positions (and might output a NOP or in some other
         | way indicate they don't have an input).
         | 
         | OR you can use those outputs to mux the _outputs_ of a 8
         | decoders that only do 32 bit instructions and 8 decoders that
         | do 16 or 32 (all with fixed starting positions), plus again 8
         | decoders that only do 16 bit instructions from fixed start
         | positions (possibly not used  / NOP).
         | 
         | The first option uses less hardware but has higher latency.
         | 
         | That, again, is for decoding between 8 and 16 instructions per
         | cycle, with an average on real code of close to 12.
         | 
         | That is more than is actually useful on normally branchy code.
         | 
         | In short: not a problem. Unlike x86 decoding.
        
         | audunw wrote:
         | Regular base instructions are always 32-bit, compressed are
         | always 16-bit, and they're always aligned. I don't think
         | there's a problem decoding them in parallel. You always know
         | where the opcodes will be located in a 32-bit word - or set of
         | 32-bit words - you're trying to decode.
         | 
         | What I've been wondering is how difficult it is to fuse
         | instructions when the compressed instructions you're trying to
         | fuse isn't aligned to a 32-bit word.
        
         | avianes wrote:
         | You are right but RISC-V variable instruction size is indeed a
         | good trade-off.
         | 
         | Unlike x86 where instructions can range from 1 up to 15 byte,
         | current RISC-V ISA only has 2 instruction sizes.
         | 
         | Today x86 decoding is limiting because we want to decode more
         | than ~4 instructions each cycle, for RISC-V to cause same
         | decoding difficulty it would probably be required to decode
         | more than ~20 instructions each cycle
        
           | dontlaugh wrote:
           | I don't know that it's a good tradeoff. ARM64 has fixed
           | length instructions with decent density and has proven to
           | allow highly parallel decoding.
        
       | IshKebab wrote:
       | Dubious. How is "you have to use this magic combination of
       | instructions that compress & execute well" better than having a
       | dedicated instruction?
       | 
       | Also no mention of the binary compatibility issues - which
       | `-march` do you compile your code for? On x86 you have a choice
       | of 3. For RISC-V as far as I can tell there are 96 valid targets.
        
         | mhh__ wrote:
         | No. Firstly -march (or similar, e.g. -mcpu in LLVM land) should
         | target a chip not individual instruction sets.
         | 
         | Secondly, AVX-512 alone has a handful of different extensions.
         | There are a bunch of different SSE variants, and similarly
         | instructions are still being added to the VEX prefix (normal
         | AVX).
         | 
         | There is more potentially for getting it wrong with riscv but
         | 64 bit implies a number of extensions too so it's too far off
         | what amd64 originally meant for X86 (e.g. implies SSE2)
        
           | IshKebab wrote:
           | What do you mean "no"? My comment was entirely factual.
           | 
           | > Firstly -march (or similar, e.g. -mcpu in LLVM land) should
           | target a chip not individual instruction sets.
           | 
           | LLVM still uses -march. And no you shouldn't target a
           | specific chip unless you know your code will only run on that
           | chip. That's the whole point I'm making. Sometimes you do
           | know that (in embedded situations) but _often you don 't_.
           | Desktop apps aren't compiled for specific chips.
           | 
           | > Secondly, AVX-512 alone has a handful of different
           | extensions.
           | 
           | Yes but these are generally linear - if an x86 chip supports
           | extension N it will support extension N-1 too. Not true for
           | RISC-V.
        
             | monocasa wrote:
             | > Yes but these are generally linear - if an x86 chip
             | supports extension N it will support extension N-1 too. Not
             | true for RISC-V.
             | 
             | Not if you include AMD and Intel cores in that.
        
               | IshKebab wrote:
               | Why do you say that? Look here:
               | 
               | https://clang.llvm.org/docs/UsersManual.html#x86
        
               | monocasa wrote:
               | That list isn't really an accurate picture of the world,
               | but a vague attempt to make sense of the madness.
               | 
               | There's plenty of cores that don't follow that versioning
               | scheme, and it's not an Intel or AMD construct.
        
               | [deleted]
        
             | mhh__ wrote:
             | The LLVM tools (like llc) use -mcpu. Clang mimics GCC. My
             | point about the specific chip is that you have to know it
             | anyway if you're planning on targeting a combination of
             | extensions so you might as well use it.
             | 
             | As for linearity, the "generally" bit will apply to RISC-V
             | by the time we have real desktop class chips using the ISA.
             | We still can't assume AVX support for most programs, I
             | don't view this as any different to RISC-V extensions. Just
             | this ~year Intel added VEX-coded AI NN acceleration
             | instructions, I assume RISC-V has similar plans.
        
               | IshKebab wrote:
               | LLVM uses -march and -mcpu. It seems to be a bit of a
               | mess which one you should use and also depend on the
               | architecture.
               | 
               | Time will tell if there's a de facto minimum set of
               | extensions for desktop RISC-V. Let's hope so, but it
               | isn't guaranteed.
        
         | socialdemocrat wrote:
         | Because dedicated instructions suck up valuable encoding space,
         | and the more instructions you have, the more instruction you
         | have which potentially become obsolete with new advances in
         | microarchitecture.
         | 
         | Not to mention that by sticking with simple single purpose
         | instructions you make the CPU easier to teach to students. That
         | is after all one of the goals of RISC-V in addition to creating
         | a good ISA for industry.
         | 
         | Have we learned nothing about why we abandoned CISC in the
         | first place? Those CPUs got riddled with instructions that
         | never got used much.
        
           | google234123 wrote:
           | With every node shrinkage those legacy instructions take up
           | less space
        
             | bigcheesegs wrote:
             | Encoding space, not die space.
        
       | stonogo wrote:
       | It doesn't matter how convincing the sales pitch is when the
       | product is not actually for sale.
       | 
       | One thing ARM and x86 got right that SPARC and POWER got wrong is
       | widely-available machines available at reasonable prices. All the
       | 'being right' in the world won't help if developers need a five-
       | figure hardware budget to port to your platform. VMs don't cut it
       | for bringup.
        
         | johndoe0815 wrote:
         | At least PowerPC machines were available for reasonable prices
         | from Apple for about a decade - and Linux was quite well
         | supported in addition to OS X. But with Motorola's loss of
         | interest in the PC and server market and IBM's focus on
         | processors for consoles, there was no future for Apple in the
         | growing mobile market. After all, we're still waiting for the
         | G5 Powerbook :).
        
         | BirAdam wrote:
         | $17 64 bit RISC-V https://linuxgizmos.com/17-sbc-runs-linux-on-
         | allwinner-d1-ri...
         | 
         | $29 64 bit RISC-V in the same form factor as an RPi CM3
         | https://www.clockworkpi.com/product-page/copy-of-clockworkpi...
         | 
         | If you never look for it, you will believe it doesn't exist.
         | 
         | I was very skeptical of ARM back in the day thinking that it
         | was great for crappy little iTrinkets and Androids but not for
         | "real computing". I was clearly wrong. I was very skeptical of
         | RISC-V until I recently heard Jim Keller explain why RISC-V has
         | a bright future. He was rather convincing. This is especially
         | true given his track record of straight-up magical results.
         | Looking at different RISC-V machines, I think that the greatest
         | advantage is that it is simple and can therefore be optimized
         | more easily than complex designs, and due to being open, it has
         | very low cost which will encourage more eyes trying more and
         | different optimizations.
         | 
         | EDIT: Link to Jim Keller interview
         | https://www.anandtech.com/show/16762/an-anandtech-interview-...
        
           | nnx wrote:
           | > I recently heard Jim Keller explain why RISC-V has a bright
           | future
           | 
           | Would like to hear it too. Can you share a link?
        
             | BirAdam wrote:
             | Updated my response to include a link to the transcription.
             | The audio/video is here:
             | https://www.youtube.com/watch?v=AFVDZeg4RVY
             | 
             | It's actually important (if you're not an engineer) to
             | listen to the whole thing, because he drops knowledge all
             | over the place.
        
           | tsmi wrote:
           | I agree mostly with Keller's take but I think he left of one
           | key factor, the quality of the software tool chain.
           | 
           | The x86 tool chains are amazing. They're practically black
           | magic in the kinds of optimizations they can do. Honestly, I
           | think they're a lot of what is keeping Intel competitive in
           | performance. ARM tool chains are also very good. I think
           | they're a lot of the reason behind why ARM can beat RISCV in
           | code space and performance on equivalent class hardware
           | because honestly, like Keller says, they're not all that
           | different for common case software. But frankly x86 and ARM
           | toolchains should dominate RISCV when we just consider the
           | amount of person-hours that have been devoted to these tools.
           | 
           | So for me the real question is, where are the resources that
           | make RISCV toolchains competitive going to come from (and
           | keep in mind x86 and ARM have open source toolchains too)?
           | And, will these optimizations be made available to the
           | public?
           | 
           | If we see significant investment in the toolchains from the
           | likes of Google, Apple and nVidia, or even Intel. ARM needs
           | to be really worried.
        
             | ansible wrote:
             | I don't know that such a heavy investment in the toolchains
             | for RISC-V are actually needed.
             | 
             | If you look at generated code, it seems fairly
             | straightforward. There aren't a lot of tricks or anything.
        
             | BirAdam wrote:
             | I think the serious investment will be from Intel, Apple
             | (with LLVM), and possibly Microsoft (into the GCC/Linux
             | ecosystem).
        
           | pjmlp wrote:
           | ARM has proven their place for real computing on Newton OS
           | and Acorn Archimedes, no need to prove it again on crappy
           | little iTrinkets and Androids.
           | 
           | Where is a RISC-V doing "real computing" on a Acorn
           | Archimedes like personal computer?
        
         | rwmj wrote:
         | There's lots of RISC-V hardware these days, from embedded RV32
         | chips up to machines you can run Linux on. It's nothing at all
         | like SPARC/POWER.
        
       | jfkimmes wrote:
       | I get "To keep reading this story, get the free app or log in.
       | (With Facebook or Google)" on mobile.
       | 
       | No thanks, Medium. These dark patterns crop up everywhere
       | lately...
        
         | socialdemocrat wrote:
         | It is for authors like me writing on Medium, to have a way of
         | getting paid. There is a need for both paid and free content.
         | But reality is that you cannot produce quality content if
         | everything has to be free. Advertisement is one solution, but
         | one not without its own serious drawbacks.
         | 
         | Medium is a like a magazine with a very large number of
         | journalists which it pays to write for it. Naturally it needs
         | to charge subscribers to make an income.
        
         | Eduard wrote:
         | https://archive.is/BqS0n
        
         | math-dev wrote:
         | That's a shame...it was the most annoying thing about Quora to
         | me.
         | 
         | As a Medium writer, I'm annoyed now! They already stopped
         | paying me my ~0-10$ per month because I refused to beg everyone
         | to get to their new minimum 100 followers requirement for
         | getting paid.
        
         | aw1cks wrote:
         | https://scribe.rip
        
         | throwaway81523 wrote:
         | 12ft.io got past that for me.
        
       | btdmaster wrote:
       | https://scribe.rip/addressing-criticism-of-risc-v-microproce...
        
       | zozbot234 wrote:
       | > Every 32-bit word in the instruction cache will contain either
       | a 32-bit uncompressed instruction or two 16-bit compressed
       | instructions. Thus everything lines up nicely.
       | 
       | This is not really accurate AIUI, since the RISC-V C extension
       | allows 32-bit insns to be 16-bit aligned. (This would also happen
       | if 48-bit insns were enabled by some other future extension).
       | It's nonetheless a lot simpler than whatever x86 has to do, since
       | insn length is given by a few well-defined bits in the insn word.
        
       | ribit wrote:
       | I have difficulty following the points the author is trying to
       | make.
       | 
       | - Even with instruction compression the type of code they present
       | will take more space than, say, Aarch64. - The entire section on
       | conditional execution doesn't make any sense. Conditional
       | execution is bad, we know it, that's why modern ARM does not have
       | conditional execution. Overall, author's insistence to compare
       | RISC-V to practically obsolete ARMv7 when ARMv8 has been
       | available for over a decade is... odd. - Regarding SIMD... it's a
       | very complex topic, but personally, I don't see any fundamental
       | problem with vector-style ISA. I think it's a great way of
       | allowing scalable software. But vector ISA does not replace basic
       | SIMD as they solve different problems. Vector stuff is great for
       | throughput, SIMD is great for latency. There are many tasks such
       | as geometry processing, modern data structures etc. where fixed-
       | size 128-bit SIMD is an excellent building block. That's why ARM
       | has both NEON and SVE2, the latter does not make obsolete the
       | former. And that bit about GPUs and how they are not good for
       | vector processing... not even sure how to comment on it. Also, at
       | the end of the day, specialised devices will vastly outperform
       | any general-purpose CPU solution. That's why we see, say, Apple
       | M1 matrix accelerators delivering matmul performance on par with
       | workstation CPU solutions, despite using a fraction of power.
       | 
       | Overall, my impression is that the article is grasping at straws,
       | ignores modern technology and ultimately fails to deliver. I
       | aolso remain unconvinced by the initial premise that RISC-V
       | follows the principle "not painting yourself into a corner due to
       | choices which have short term benefit". I do think that choices
       | like keeping instructions as simple as possible (even though it
       | makes expression of common patterns verbose), avoiding flags
       | registers, disregarding SIMD etc. could be characterised as
       | "painting oneself into a corner".
       | 
       | A usual disclaimer: I do think that RISC-V is a great
       | architecture for many domains. Simple low-power/low-cost
       | controllers, specialised hardware, maybe even GPUs (with
       | extensions) -- the simplicity and openness of RISC-V makes it a
       | great point of entry for basically anyone and invites
       | experimentation. I just don't see much merit of RISC-V in the
       | general-purpose high-performance consumer computing
       | (laptop/desktop). In this space RISC-V does not have any notable
       | advantages, it does have potential disadvantages (e.g. code
       | density and lack of standard SIMD -- yet). Most importantly, the
       | CPU microarchitecture becomes the decisive factor, and designing
       | a fast general-purpose CPU requires a lot of expertise and
       | resources. It's not something that a small group of motivated
       | folk can realistically pull off. So all the great things about
       | RISC-V simply do not apply here.
        
         | audunw wrote:
         | Why do you list code density as a potential disadvantage? With
         | changes to the ISA that has already been approved, RISC-V will
         | have the best code density of any significant ISA in real-world
         | code.
         | 
         | The article touched on this briefly so it's odd that you would
         | claim this without a source for the claim. I know there's some
         | outdated benchmarks where it's slightly worse than Thumb for
         | instance. But then Thumb isn't relevant for desktop CPUs.
         | 
         | The downside for RISC-V for high end desktop/laptop is lack of
         | a large commercial backer (someone like Apple could pull it
         | off, but clearly they've bet on ARM,which was clearly the right
         | choice since RISC-V was far from ready). Lack of the huge
         | legacy of tool chains and software built around x86 and ARM is
         | also obviously a huge disadvantage.
         | 
         | But you could have said the same about ARM back in the day. The
         | thing is I'm not sure if the advantages for RISC-V is big
         | enough to take over all of ARMs markets the way ARM has the
         | potential for with x86.
        
         | socialdemocrat wrote:
         | > And that bit about GPUs and how they are not good for vector
         | processing... not even sure how to comment on it. Also, at the
         | end of the day, specialised devices will vastly outperform any
         | general-purpose CPU solution. That's why we see, say, Apple M1
         | matrix accelerators delivering matmul performance on par with
         | workstation CPU solutions, despite using a fraction of power.
         | 
         | Of course CPUs are good for vector processing compared to a
         | general-purpose CPU. That was not the point at all. The point
         | is that unlike older architectures such as Cray, they were not
         | designed specifically for general-purpose vector processing but
         | for graphics processing. That is why solutions such as SOC-1
         | built specifically for genera-purpose vector processing can
         | compete with graphics cards made by giants like Nvidia.
         | 
         | The article is talking about adding vector processing both to
         | RISC-V chips aimed at general purpose processing as well as to
         | specialized RISC-V cores which are primarily designed for
         | vector-processing. SOC-1 is an example of this. It has 4
         | general purpose RISC-V cores called ET-Maxion, while also
         | having 1088 small ET-Minion cores made for vector processing.
         | However these are still RISC-V cores, rather than some graphics
         | card SM core.
         | 
         | I don't get your argument about SIMD being great for latency.
         | RISC-V requires that vector registers are at minimum 128-bit
         | you you can use RVV as a SIMD instruction-set with 128-bit
         | registers if you want.
        
         | pjmlp wrote:
         | It isn't odd at all, this is the kind of usual narrative when
         | selling stuff to an audience that only has a passing knowledge
         | of all issues.
         | 
         | So anyone that isn't deep into ARM architecture will indeed buy
         | into the arguments being made, as they can't assert otherwise.
        
         | socialdemocrat wrote:
         | Author here: I have tried to clarify this better in the update.
         | The point is that I am talking about AArch32 and AArch64 in the
         | article. Yes, everybody has been going away from conditional
         | instructions, because they don't work well in Out-of-Order
         | superscalar processors, and they are pointless when you got
         | good branch predictors.
         | 
         | HOWEVER, an argument in the ARM camp is that they are very
         | useful when dealing with smaller chips. Remember ARM and RISC-V
         | compete in the low range as well as higher range. AArch32 is
         | not obsolete. It still has uses. There has been ARM fans
         | claiming that conditional instructions make ARM superior for
         | simple chips. The argument here was that RISC-V has way of
         | dealing with simple In-Order chips as well.
        
           | Someone wrote:
           | > There has been ARM fans claiming that conditional
           | instructions make ARM superior for simple chips.
           | 
           | For those following this only from the sidelines, it would
           | help strengthen the article if the article has links to such
           | claims. I couldn't easily find them, and would be curious as
           | to their age, given that, reading https://en.wikipedia.org/wi
           | ki/Predication_(computer_architec..., ARM has made
           | substantial changes to conditional execution a few times
           | since 1994 (over 25 years ago); Thumb (1994) dropped them,
           | Thumb-2 (2003) replaced them by, if I understand it
           | correctly, an instruction "skip the next 4 instructions
           | depending on flags", and ARMv8 replaced them by conditional
           | select.
           | 
           | (In general, providing links to articles claiming each
           | proclaimed myth to be true would strengthen this article. I
           | think I've only ever read about #1, and not with as strong a
           | wording as "bloats")
        
             | socialdemocrat wrote:
             | If there was some good articles to point to I would.
             | However I don't want to single out people ranting against
             | RISC-V. This is more about opinions which keep popping out
             | here in Hacker news, twitter, Quora and other places. I
             | don't want this discussion to be turned personal.
             | 
             | It should be possible to discuss these opinions without
             | singling out anyone.
             | 
             | I am however talking about claims put forth after ARMv8.
             | The argument here has basically been this: Both ARM and
             | RISC-V aims to cover both the low end and high end. Some
             | ARM fans think that by not including conditional
             | instructions RISC-V really only works for high-end CPUs.
             | The idea here is that AArch32 would be better than RV32 for
             | lower-end chips.
        
       | erosenbe0 wrote:
       | What are these r registers for AArch64? Author hasn't likely
       | bothered to run any of the examples through an actual 64-bit ARM
       | assembler. Dubious.
        
       | TazeTSchnitzel wrote:
       | The conditional execution section makes no mention of the fact
       | AArch64 doesn't have this feature either, and bizarrely lists a
       | "64-bit" ARM code example that isn't. This doesn't inspire
       | confidence in the author's understanding.
        
         | fay59 wrote:
         | All of the ARM assembly is wrong. AArch64 uses "x" or "w" to
         | identify general purpose registers, "r" isn't a thing.
        
         | solarexplorer wrote:
         | In the same section the part about the SiFive optimization is
         | also misleading. The goal of the optimization is obviously to
         | avoid interrupting the instruction fetch. But he makes it sound
         | like the goal was to reduce instruction count by fusing two
         | instructions to get a single monster op with five (!) register
         | operands. That just doesn't make sense.
        
         | pxeger1 wrote:
         | I thought it was only supposed to be pseudo-assembly?
        
       | audunw wrote:
       | Theres so many armchair specialists when it comes to criticizing
       | RISC-V. I've seen people claim an ISA is better because it has
       | branch delay slots.. which seems clever to someone who knows
       | enough technical details about CPUs to understand what the
       | benefit of that feature is ("free" instruction execution for
       | every branch taken), but is a terrible idea for a truly scalable
       | ISA (huge PITA for out-of-order architectures if I've understood
       | correctly)
       | 
       | I'm sure there are some bad decisions in RISC-V, but I've yet to
       | see one that isn't in the process of being remedied. There was a
       | good argument for a lack of POPCOUNT instruction being bad, but I
       | think that's being added soon
        
         | okl wrote:
         | First you complain about "armchair specialists", then you make
         | a blanket assertion about branch delay slots. Believe it or
         | not, there are ISAs for applications where branch delay slots
         | are useful, for example TMS320 DSPs with up to 5 delay slots.
        
           | __s wrote:
           | You shouldn't assess RISC-V from the viewpoint of a single
           | chip. It's an ISA first. Branch slots are highly target
           | specific
        
           | mst wrote:
           | My very much non-expert understanding was that branch delay
           | slots where enough details of the target processor design are
           | known at ISA design time to have the 'right' number of slots
           | can be a neat optimisation.
           | 
           | OTOH if one is designing an ISA that will have a bunch of
           | different implementations - and this includes later
           | implementations wanting to be ASM compatible with earlier
           | ones - they tend to eventually become a footgun for the
           | processor designers. (if I remember correctly and didn't
           | completely misunderstand, MIPS' branch delay slots were
           | absolutely a neat optimisation for the early models, but when
           | they went to a deeper pipeline for later chips required a
           | bunch of extra design effort to maintain compatibility with,
           | without being helpful anymore)
           | 
           | (explicit disclaimer that I'm an armchair amateur here, so if
           | you're a fellow non-expert reading this comment before it
           | attracts better informed replies please default to joining me
           | in the assumption that I've made at least one massive error
           | in what I'm saying here)
        
       ___________________________________________________________________
       (page generated 2022-03-20 23:01 UTC)