[HN Gopher] An Open-Source FPGA-Optimized Out-of-Order RISC-V So...
       ___________________________________________________________________
        
       An Open-Source FPGA-Optimized Out-of-Order RISC-V Soft Processor
       (2019) [pdf]
        
       Author : varbhat
       Score  : 202 points
       Date   : 2021-01-05 12:05 UTC (10 hours ago)
        
 (HTM) web link (www.rsg.ci.i.u-tokyo.ac.jp)
 (TXT) w3m dump (www.rsg.ci.i.u-tokyo.ac.jp)
        
       | ecesena wrote:
       | Is anyone working on low power open risc-v implementations?
       | (Ideally including manufacturing, i.e. a physical device that I
       | could buy/build on top of)
        
         | nereye wrote:
         | One of the two ULP (Ultra Low Power) co-processors on ESP32-S2
         | chips is based on RISC-V. Boards using ESP32-S2 are available
         | for relatively little, e.g. see
         | https://www.mouser.com/ProductDetail/Espressif-
         | Systems/ESP32....
         | 
         | It's admittedly a niche use-case but it's an option for playing
         | with RISC-V hardware...
        
         | blacksmith_tb wrote:
         | Seeed has some RISC-V Arduino-alikes [1], but my memory is that
         | they have teased a more substantial, linux-capable board here
         | in Q1.
         | 
         | 1: https://www.seeedstudio.com/SeeedStudio-GD32-RISC-V-Dev-
         | Boar...
        
       | londons_explore wrote:
       | The gif of the Konata pipeline visualizer seems to show pretty
       | much one instruction per cycle most of the time... Many parts of
       | the trace show as low as 0.2 instructions/cycle..
       | 
       | Wouldn't we expect much higher numbers (more parallelism)
       | considering the number of frontend/backend pipelines?
        
       | xiphias2 wrote:
       | I guess somebody will implement 64 bit GC extensions to run linux
       | on it.
        
         | chrisseaton wrote:
         | GC extensions? What does 'GC' stand for?
        
           | the_duke wrote:
           | G is short for MAFD (multiplication, atomics, floats,
           | doubles) and C for compression.
           | 
           | See https://en.wikipedia.org/wiki/RISC-V#Design .
        
             | chrisseaton wrote:
             | Ah sorry I get it - ISA extensions - I thought it was
             | _compiler_ extensions.
        
       | choletentent wrote:
       | I am glad they are using System Verilog. It is hard for me to
       | understand why SiFive chose Chisel as RTL language. I think that
       | quietly slows down the RISC-V adoption. I honestly tried to
       | understand the advantages of Chisel, but I can not see any. There
       | is an answer on Stack Overflow regarding Chisel benefits, it is
       | just embarrassing [1].
       | 
       | [1] https://stackoverflow.com/questions/53007782/what-
       | benefits-d...
        
         | unionpivo wrote:
         | Please note I am software dev, not a hardware guy, and I just
         | got my first FPGA during holidays, and am just beginning to
         | play with it.
         | 
         | > There is an answer on Stack Overflow regarding Chisel
         | benefits, it is just embarrassing [1].
         | 
         | I don't understand what is embarrassing about the answer ? As a
         | software guy above answer make sense. Some problems you want to
         | use C (or similar) for and some problems you want to use
         | scripting language for, and then again sometimes the right tool
         | is erlang, rust or go-lang ...
         | 
         | But like I said, that's my software guy perspective, so I am
         | wondering what I missed?
        
           | alain94040 wrote:
           | I just read the answer as well. It's not "embarrassing", but
           | it basically doesn't answer the question. Instead, it argues
           | that the question is equivalent to asking what's the point of
           | Python vs. C.
           | 
           | So in the end, the answer doesn't provide any specific answer
           | regarding SystemVerilog and Chisel. All I found is one
           | mention of negotiating parameters, which Verilog doesn't do.
           | I would have loved to hear a lot more about examples of what
           | Chisel makes more convenient than SystemVerilog.
        
         | ljhsiung wrote:
         | Google built one of their TPUs in Chisel [1].
         | 
         | TL;DW: Chisel is beautiful/fun to write in, with a definite
         | productivity bonus, but has a pretty large learning curve and
         | had a much greater verification cost, partly because it's an
         | HLS (most have that problem) and also lack of any tooling. Both
         | of those costs are gradually being reduced (though in my
         | opinion, not enough to not make verification a PITA).
         | 
         | [1] https://www.youtube.com/watch?v=x85342Cny8c
        
           | jeff-davis wrote:
           | What is an HLS?
        
             | ris wrote:
             | High Level Synthesis
        
         | tverbeure wrote:
         | I wrote a long blog post about the VexRiscv RISC-V CPU and how
         | its design methodology is radically different than traditional
         | RTL languages.
         | 
         | The VexRiscv is written in SpinalHDL, which is a close relative
         | of Chisel.
         | 
         | The advantage of SpinalHDL/Chisel is that it supports plug and
         | play configurability that's impossible with languages like
         | SystemVerilog or VHDL.
         | 
         | You can read about it here:
         | https://tomverbeure.github.io/rtl/2018/12/06/The-VexRiscV-CP...
         | 
         | That said: there must be at least 50 open source RISC-V cores
         | out there, and only a small fraction is written in Chisel. I
         | don't see how the use of Chisel has held back RISC-V in any
         | meaningful way.
        
         | artemonster wrote:
         | I really want to bonk on the head anyone praising
         | SystemVerilog. This collective undertaking of corrupt commitee
         | members locked us in to this horrible language and these
         | grossly outdated tools, forever. There is not a slightest bit
         | of good in that languge, neither for design nor for
         | verification.
        
         | nickik wrote:
         | There are tons of people using System Verilog for RISC-V. The
         | majority of work done is with System Verilog and not Chisel.
         | Having lots of cores in both Chisel, System Verilog and many
         | other languages (VHDL,BlueSpec and so on) is a huge benfit for
         | RISC-V
         | 
         | SiFive values programmability above everything and for that
         | Chisel is pretty clearly an advantage.
        
       | CodesInChaos wrote:
       | Is there a quantification of "high performance"?
       | 
       | It will obviously be much lower than the IPC of an actual high
       | performance CPU (modern x86-64), but how big is the difference?
       | And how does it compare to typical mobile processors?
        
         | NieDzejkob wrote:
         | I believe the closest measurement would be in Table VII on page
         | 8 of the paper:
         | http://www.rsg.ci.i.u-tokyo.ac.jp/members/shioya/pdfs/Mashim...
        
           | jhallenworld wrote:
           | RV32IM, 2.04 DMIPS/MHz, 95.3 MHz, 15K LUTs, 8K LUT FFs, 6
           | BRAMs on Zynq 7020.
           | 
           | I'm curious if this will work on Lattice ECP5- I'm not really
           | sure if Synplify supports system verilog to the same degree
           | as the Xilinx tools. ECP5 is interesting because it's a $10
           | FPGA..
        
             | mhh__ wrote:
             | At a glance I think it would fit although I might be
             | thinking of the bigger ECP5s which are in the same price
             | bracket as some Zynqs
        
           | childintime wrote:
           | "In comparison to the BOOM, the RSD achieved 2.5-times higher
           | DMIPS and 1.9-times higher DMIPS/MHz", which should be
           | comparable to ARM cores around 4 years ago.
        
             | Veedrac wrote:
             | Note that BOOM is now on v3[1], which claims to be 3.93
             | DMIPS/MHz, or about twice RSD.
             | 
             | [1]
             | https://people.eecs.berkeley.edu/~krste/papers/SonicBOOM-
             | CAR...
        
         | mhh__ wrote:
         | This has 5 backend pipelines, which I would say is getting
         | there speed-wise. I think AMD Zen has 10 execution pipelines to
         | play with, although the exact structure is usually not
         | commented upon so I don't know how long they are.
         | 
         | It's 32 bit so it's not desktop-class necessarily but this
         | should blow a microcontroller out of the sky, for scale.
        
           | phkahler wrote:
           | I've been wondering if it could make sense to put a small ALU
           | with each register (when they are not virtual/renamed but
           | maybe then too?). This would allow instructions that use the
           | destination register as a source to only use one read port,
           | potentially allowing higher IPC. Has anyone looked into this
           | and if so what did the analysis show?
        
           | londons_explore wrote:
           | If I understand correctly, most designs effectively #define
           | the number of frontend and backend pipelines. That means you
           | are free to make as many as you like, for as much performance
           | as you like, but some parameters might scale _very_ badly
           | outside what the core was designed for. For example, I would
           | imagine the number of gates and power use to go up to
           | unfeasibly high levels above the quoted numbers.
        
             | Symmetry wrote:
             | It's a bit harder than that. You have to worry about
             | forwarding results from one pipeline to another bypassing
             | the registers if you want to avoid having to deal with
             | stalls waiting for results. The transistor cost of the
             | bypass network grows as the square of the number of
             | pipelines so it can be pretty significant, potentially
             | being larger than the cost of the execution pipelines
             | themselves.
             | 
             | Many modern designs aren't fully bypassed and involve
             | clusters of pipelines to manage this. IBM's recent Power
             | chips and Apple's ARM cores are particularly known to do
             | this.
        
               | sitkack wrote:
               | Does optimal pipeline topology vary with the specific
               | workload or is more dependent on the instruction set
               | semantics and result visibility?
               | 
               | Is this something that could be automatically optimized
               | via simulation?
               | 
               | Is it something that could be made dynamic and shared
               | across a bunch of schedulers so that cores could move
               | between big/little during execution.
        
               | Symmetry wrote:
               | It's highly dependent on both workload and instruction
               | set.
               | 
               | I'm sure it could be automatically optimized in theory,
               | even without the solution being AI complete, but I don't
               | think we have any idea how to do it right now.
               | 
               | No, not unless you're reflashing an FPGA. You'd have
               | better luck sharing subcores for threadlets I think.
        
           | gchadwick wrote:
           | > This has 5 backend pipelines, which I would say is getting
           | there speed-wise.
           | 
           | It's pretty easy to slap down pipelines. What is far harder
           | is keeping them all fed and running without excessive
           | stalling and bubbles whilst avoiding killing your max
           | frequency and blowing through your area and power goals.
        
           | swiley wrote:
           | Do you really want pipelinning on a microcontroller though?
        
             | mhh__ wrote:
             | Why not? I think Cortex-M3 has a 3 stage pipeline,
             | obviously I would not expect the 0.01$ chinese ones to have
             | one.
             | 
             | It's worth saying here, that a big ooo CPU is pipelines
             | differently to a small/old risc processor - even amongst
             | discussions about compiler optimizations people still use
             | terminology like pipeline stall, when a modern CPU has a
             | pipeline that handles fetching an instruction window,
             | finding dependencies, doing register renaming and
             | execution, that pipeline is not like an old
             | IF->IF->EX->MEM->WB - it won't stall in the same way a
             | pentium 5 did. The execution pipes themselves have a more
             | familiar structure.
        
               | gmueckl wrote:
               | Not sure about Cortex-M3, but I can confirm that the
               | Cortex-M4 has a pretty basic 3 stage in order pipeline
               | that gets flushed on branches. So unless there are caches
               | between the core and memory (I think some STM32 have
               | that), that CPU is still trivially deterministic.
        
               | makapuf wrote:
               | Yes, m4 from stm have art accelerator (cache memory)
               | which make them less deterministic (but much faster) for
               | flash access. See https://www.st.com/en/microcontrollers-
               | microprocessors/stm32...
        
         | zurn wrote:
         | It's targeting small FPGAs, comparisons to full custom designs
         | wouldn't be very fair.
        
           | adwn wrote:
           | "high-performance" is relative, so they need to compare it to
           | _something_ at all.
        
             | chungus_khan wrote:
             | I would imagine it is high-performance within its own
             | domain: soft microprocessors. There isn't much danger of
             | something like this being used outside of domains where it
             | is relevant, so just calling it "high-performance" gets the
             | message across fairly well to anyone who would actually
             | need it.
             | 
             | I can't comment on how well it performs myself without
             | testing it, but a quick skim of the paper reveals that it
             | apparently performs well in comparison to its rival open-
             | source out-of-order soft processor.
             | 
             | Comparing soft processors to other soft processors is
             | fairly easy if they can both run on the same hardware, but
             | comparing them to real silicon is inherently kind of
             | meaningless, as they don't really compete at the moment,
             | and the performance of the design in absolute terms will
             | depend on the FPGA it is implemented on. Nonetheless, you
             | could compare the raw numbers presented in the paper for
             | curiosity's sake and see that indeed, it isn't very fast
             | compared to modern silicon processors.
        
               | adwn wrote:
               | Sorry, what I meant was: They need to present benchmark
               | results (preferably against other soft-core processors,
               | of course) to claim the "high-performance" attribute. A
               | single row of a synthetic benchmark against only two
               | other contenders is a little ... lacking.
        
         | Symmetry wrote:
         | It seems broadly comparable to an ARM Cortex A9 in width and
         | reorder depth but it seems to be using a lot more pipeline
         | stages to do that. Probably that's because they had less
         | engineer-years to invest in the design than ARM did. It looks
         | like it's using pure automatic synthesis so the clock speed
         | will tend to be lower on equivalent process nodes too. The big
         | question is how accurate the prefetcher and branch predictors
         | are. It's entirely possible for an 2 issue out of order core to
         | be slower than 2 issue in order core if the later is
         | significantly better at speculating.
         | 
         | Still, that's a good bit of work they should be proud for
         | putting out there and I hope other people build on it.
         | 
         | EDIT: Oh, wait, they don't mention register renaming. Hmm,
         | well, I guess no speculating over multiple iterations of a loop
         | then.
         | 
         | EDIT2: No, the PDF the link mentions a rename unit.
         | http://www.rsg.ci.i.u-tokyo.ac.jp/members/shioya/pdfs/Mashim...
        
       | smrxx wrote:
       | All those abbreviations on the block diagram make it very
       | difficult to interpret. A key map in the image would be great, or
       | at least some markdown directly below it.
        
       | dpoochieni wrote:
       | If the FPGA is a closed-design are we really that better off?
        
         | rwmj wrote:
         | There are fully reverse-engineered FPGAs and open source
         | toolchains to drive them.
         | 
         | The real question is: Is there anything hidden in the silicon?
         | That's something you can only solve by owning your own fab -
         | the US military approach.
        
           | eeZah7Ux wrote:
           | Good luck hiding an effective backdoor in an FPGA. The
           | attacker (the FPGA fab) has no idea of how it's going to be
           | programmed.
        
             | rwmj wrote:
             | The usual thing the military worries about is a "kill
             | switch" (a very unlikely sequence of bits) which disables
             | the hardware completely. The idea is that at the beginning
             | of a war, the kill signal is broadcast by the enemy by
             | every means possible which brings all your electronics to a
             | halt.
             | 
             | This can be hidden in an FPGA - for example attached to the
             | input pins or SERDES - without needing to know anything
             | about the application.
             | 
             | (Article:
             | https://spectrum.ieee.org/semiconductors/design/the-hunt-
             | for...)
        
               | eeZah7Ux wrote:
               | Triggering a malfunction is incredibly easy compared to a
               | proper backdoor. A kill signal could also be injected
               | through side channels e.g. a power line, and the kill
               | mechanism could be implemented in many other
               | semiconductors than an FPGA.
        
           | fest wrote:
           | The argument here usually is: with a fixed silicon chip,
           | vendor can hide the backdoor in various locations and be it
           | triggered by various events (e.g. a particular sequence of
           | incoming ICMP packets would overwrite the first byte of
           | response with content of some register). With FPGA, the
           | vendor can't really know where a particular register is
           | located, or where incoming packets are processed, as it is
           | highly dependent on the synthesised CPU design and can even
           | be non-deterministic.
           | 
           | This does not mean that there is no way vendor can backdoor
           | the chip you are getting, but it does narrow the
           | possibilities significantly.
        
         | gmueckl wrote:
         | Even if the FPGA design were fully public, you won't
         | necessarily find the required fab around the corner, so what's
         | there to gain from the FPGA design? If you're seeking trust,
         | you'd need to establish trust in the whole chain: design, mask
         | production, chip fab, transport... that's a tall order.
        
       | mhh__ wrote:
       | Well done to the authors for making a surprisingly readable core
       | for once.
        
       | Symmetry wrote:
       | To push this up from the comments, if you're interested in why
       | this is important or what the authors are trying to do the PDF
       | where they describe their approach and architecture is really
       | interesting.
       | 
       | http://www.rsg.ci.i.u-tokyo.ac.jp/members/shioya/pdfs/Mashim...
        
         | dang wrote:
         | Ok, we've changed the URL to that from https://github.com/rsd-
         | devel/rsd, which doesn't give much background.
        
         | jeff-davis wrote:
         | If I understand this correctly:
         | 
         | Typically, I think of an FPGA as something used to accelerate
         | specialized operations. But sometimes, in the middle of one of
         | these specialized operations, you might want to do something
         | more general, like run a network stack, without returning to
         | the CPU. A soft processor like this allows you to run an
         | ordinary network stack (with ordinary code) inside the FPGA.
         | 
         | Is that right?
         | 
         | I thought one of the things people used FPGAs for was
         | accelerating network stacks, so I don't quite know why you'd
         | want to use a soft processor for that. But it does make sense
         | that you'd want to be able to run ordinary code in an FPGA (as
         | part of a larger FPGA operation that is not ordinary code).
         | 
         | EDIT: Also, I don't understand this statement: "for example,
         | one main compute kernel, which is too complex to deploy on
         | dedicated hardware, is run by specialized soft processors".
         | What do the authors mean "too complex to deploy on dedicated
         | hardware"?
        
           | mindentropy wrote:
           | Generally softcores are used for command and control for the
           | FPGA. For eg you have a framegrabber and you would like to
           | adjust the shutter speed, fps etc you would create a softcore
           | and run normal firmware to setup the registers based on the
           | users choice.
           | 
           | You can think of FPGA as an ASIC and the softcore to control
           | this ASIC. The hot data path and heavy processing is done in
           | the FPGA and processing options for the ASIC can be done
           | using the softcore firmware.
        
           | TrueDuality wrote:
           | It's definitely common to accelerate specialized operations
           | without the overhead of a general processor, but it's also
           | possible to effectively use them as a much more flexible
           | microprocessor when you need it.
           | 
           | If you go looking for a microcontroller for your project, you
           | have to choose among what is available. Maybe this
           | microcontroller has two hardware UART interfaces and 1 SPI
           | interface. If I don't need any UART but instead need a CANBUS
           | interface that microcontroller won't work for me. Sure I can
           | bitbang the protocol on GPIO ports but that uses up a lot of
           | the limited processing power on the microcontroller...
           | Usually that means a more expensive microcontroller.
           | 
           | There is a threshold that you can cross where a small FPGA is
           | cheaper than a microcontroller that has enough pins and
           | processing power for your application. This does come with an
           | additional upfront design cost of also writing (but more
           | often integrating) the soft cores but sometimes that makes
           | sense.
           | 
           | Sometimes peripherals just don't exist at the price point you
           | need. Try and find a microcontroller that has a MMIO
           | controller for under $5 (I probably couldn't do it at under
           | $10 but I haven't gone looking recently), it's rare they're
           | needed but sometimes a design requires one.
           | 
           | There has also been a lot of recent interest in doing formal
           | verification of hardware logic. A lot of the microprocessors
           | and even that full CPU in whatever device you're reading this
           | on has a lot of undocumented black boxes and undefined
           | behaviours both of which prevent that verification from being
           | meaningful.
        
           | mng2 wrote:
           | >What do the authors mean "too complex to deploy on dedicated
           | hardware"?
           | 
           | It costs time and effort to translate a software function to
           | HDL/FPGA, so it's not always worth doing. For example you
           | could do TCP/IP in hardware, but unless you have particular
           | performance requirements (say HFT) you're probably better off
           | with a soft core and a tested software TCP/IP stack.
           | 
           | Also each feature translated to hardware takes up space in
           | the fabric. When you crunch numbers on an FPGA, it's ideal if
           | you can lay out the entire sequence of operations as one big
           | pipeline, so you can keep throughput as high as possible.
           | Sufficiently long or complex sequences may not translate
           | efficiently to FPGA.
        
           | lizknope wrote:
           | To make a custom chip it requires one time mask fees of up to
           | $30 million for a 5nm chip. If your volume is high then you
           | can amortize that cost. If not then you probably go to an
           | FPGA. An FPGA has a high per part cost but no custom mask
           | costs and you can reprogram it in a few hours / days instead
           | of 2 months for a new custom chip to be manufactured.
           | 
           | A high end FPGA already has hardened CPU blocks, USB and PCIE
           | interfaces, and lots of other things built in. Then it has a
           | large area of generic reconfigurable logic that you can
           | customize to do whatever you want.
           | 
           | This reconfigurable logic will not be as fast as a custom
           | chip but it is still far faster than software and can be used
           | to implement your own CPU (assuming you don't want to use the
           | hardened CPU or got an FPGA without them)
        
       | oblio wrote:
       | For people in the industry: how likely are we to get RISC-V
       | servers/VMs/laptops/desktops in the next 5-10 years? You know, go
       | on a PC Part Picker and assemble a RISC-V desktop, for example.
        
         | jagger27 wrote:
         | Can you go on PC Part Picker and build an Arm desktop yet?
        
           | oblio wrote:
           | I haven't tried, I assume not. But aren't the hardware
           | fashions changing rapidly now? Compared to 2005-2020, with
           | x86 everywhere I mentioned.
        
           | anfilt wrote:
           | Part of ARMs licensing generally prevents socketed CPU chips.
        
           | moyix wrote:
           | Sure, order a Raspberry Pi 4 and an external USB SSD ;)
        
             | AceJohnny2 wrote:
             | so fun fact: I was recently curious how the RasPi4 compared
             | to my 15-year old Dell Precision M65 (Intel Core Duo
             | T2400).
             | 
             | I ran the sysbench CPU test on each, and the M65 _trounced_
             | the RasPi4, being over 3x as fast in single-core (and about
             | 1.5x as fast in multicore, which makes sense with the T2400
             | being 2-core to the RasPi 's 4-core).
             | 
             | So the RasPi4 (a cheap-class SoC) remains slower than a
             | performance-class PC from 14 years prior. Moore's law
             | certainly helped in peformance-per-watt and performance-
             | per-dollar, but if pure perfomance is what you want... I
             | don't think there's anything available to consumers outside
             | of Apple's offerings.
        
       ___________________________________________________________________
       (page generated 2021-01-05 23:01 UTC)