[HN Gopher] An Open-Source FPGA-Optimized Out-of-Order RISC-V So...
___________________________________________________________________
An Open-Source FPGA-Optimized Out-of-Order RISC-V Soft Processor
(2019) [pdf]
Author : varbhat
Score : 202 points
Date : 2021-01-05 12:05 UTC (10 hours ago)
(HTM) web link (www.rsg.ci.i.u-tokyo.ac.jp)
(TXT) w3m dump (www.rsg.ci.i.u-tokyo.ac.jp)
| ecesena wrote:
| Is anyone working on low power open risc-v implementations?
| (Ideally including manufacturing, i.e. a physical device that I
| could buy/build on top of)
| nereye wrote:
| One of the two ULP (Ultra Low Power) co-processors on ESP32-S2
| chips is based on RISC-V. Boards using ESP32-S2 are available
| for relatively little, e.g. see
| https://www.mouser.com/ProductDetail/Espressif-
| Systems/ESP32....
|
| It's admittedly a niche use-case but it's an option for playing
| with RISC-V hardware...
| blacksmith_tb wrote:
| Seeed has some RISC-V Arduino-alikes [1], but my memory is that
| they have teased a more substantial, linux-capable board here
| in Q1.
|
| 1: https://www.seeedstudio.com/SeeedStudio-GD32-RISC-V-Dev-
| Boar...
| londons_explore wrote:
| The gif of the Konata pipeline visualizer seems to show pretty
| much one instruction per cycle most of the time... Many parts of
| the trace show as low as 0.2 instructions/cycle..
|
| Wouldn't we expect much higher numbers (more parallelism)
| considering the number of frontend/backend pipelines?
| xiphias2 wrote:
| I guess somebody will implement 64 bit GC extensions to run linux
| on it.
| chrisseaton wrote:
| GC extensions? What does 'GC' stand for?
| the_duke wrote:
| G is short for MAFD (multiplication, atomics, floats,
| doubles) and C for compression.
|
| See https://en.wikipedia.org/wiki/RISC-V#Design .
| chrisseaton wrote:
| Ah sorry I get it - ISA extensions - I thought it was
| _compiler_ extensions.
| choletentent wrote:
| I am glad they are using System Verilog. It is hard for me to
| understand why SiFive chose Chisel as RTL language. I think that
| quietly slows down the RISC-V adoption. I honestly tried to
| understand the advantages of Chisel, but I can not see any. There
| is an answer on Stack Overflow regarding Chisel benefits, it is
| just embarrassing [1].
|
| [1] https://stackoverflow.com/questions/53007782/what-
| benefits-d...
| unionpivo wrote:
| Please note I am software dev, not a hardware guy, and I just
| got my first FPGA during holidays, and am just beginning to
| play with it.
|
| > There is an answer on Stack Overflow regarding Chisel
| benefits, it is just embarrassing [1].
|
| I don't understand what is embarrassing about the answer ? As a
| software guy above answer make sense. Some problems you want to
| use C (or similar) for and some problems you want to use
| scripting language for, and then again sometimes the right tool
| is erlang, rust or go-lang ...
|
| But like I said, that's my software guy perspective, so I am
| wondering what I missed?
| alain94040 wrote:
| I just read the answer as well. It's not "embarrassing", but
| it basically doesn't answer the question. Instead, it argues
| that the question is equivalent to asking what's the point of
| Python vs. C.
|
| So in the end, the answer doesn't provide any specific answer
| regarding SystemVerilog and Chisel. All I found is one
| mention of negotiating parameters, which Verilog doesn't do.
| I would have loved to hear a lot more about examples of what
| Chisel makes more convenient than SystemVerilog.
| ljhsiung wrote:
| Google built one of their TPUs in Chisel [1].
|
| TL;DW: Chisel is beautiful/fun to write in, with a definite
| productivity bonus, but has a pretty large learning curve and
| had a much greater verification cost, partly because it's an
| HLS (most have that problem) and also lack of any tooling. Both
| of those costs are gradually being reduced (though in my
| opinion, not enough to not make verification a PITA).
|
| [1] https://www.youtube.com/watch?v=x85342Cny8c
| jeff-davis wrote:
| What is an HLS?
| ris wrote:
| High Level Synthesis
| tverbeure wrote:
| I wrote a long blog post about the VexRiscv RISC-V CPU and how
| its design methodology is radically different than traditional
| RTL languages.
|
| The VexRiscv is written in SpinalHDL, which is a close relative
| of Chisel.
|
| The advantage of SpinalHDL/Chisel is that it supports plug and
| play configurability that's impossible with languages like
| SystemVerilog or VHDL.
|
| You can read about it here:
| https://tomverbeure.github.io/rtl/2018/12/06/The-VexRiscV-CP...
|
| That said: there must be at least 50 open source RISC-V cores
| out there, and only a small fraction is written in Chisel. I
| don't see how the use of Chisel has held back RISC-V in any
| meaningful way.
| artemonster wrote:
| I really want to bonk on the head anyone praising
| SystemVerilog. This collective undertaking of corrupt commitee
| members locked us in to this horrible language and these
| grossly outdated tools, forever. There is not a slightest bit
| of good in that languge, neither for design nor for
| verification.
| nickik wrote:
| There are tons of people using System Verilog for RISC-V. The
| majority of work done is with System Verilog and not Chisel.
| Having lots of cores in both Chisel, System Verilog and many
| other languages (VHDL,BlueSpec and so on) is a huge benfit for
| RISC-V
|
| SiFive values programmability above everything and for that
| Chisel is pretty clearly an advantage.
| CodesInChaos wrote:
| Is there a quantification of "high performance"?
|
| It will obviously be much lower than the IPC of an actual high
| performance CPU (modern x86-64), but how big is the difference?
| And how does it compare to typical mobile processors?
| NieDzejkob wrote:
| I believe the closest measurement would be in Table VII on page
| 8 of the paper:
| http://www.rsg.ci.i.u-tokyo.ac.jp/members/shioya/pdfs/Mashim...
| jhallenworld wrote:
| RV32IM, 2.04 DMIPS/MHz, 95.3 MHz, 15K LUTs, 8K LUT FFs, 6
| BRAMs on Zynq 7020.
|
| I'm curious if this will work on Lattice ECP5- I'm not really
| sure if Synplify supports system verilog to the same degree
| as the Xilinx tools. ECP5 is interesting because it's a $10
| FPGA..
| mhh__ wrote:
| At a glance I think it would fit although I might be
| thinking of the bigger ECP5s which are in the same price
| bracket as some Zynqs
| childintime wrote:
| "In comparison to the BOOM, the RSD achieved 2.5-times higher
| DMIPS and 1.9-times higher DMIPS/MHz", which should be
| comparable to ARM cores around 4 years ago.
| Veedrac wrote:
| Note that BOOM is now on v3[1], which claims to be 3.93
| DMIPS/MHz, or about twice RSD.
|
| [1]
| https://people.eecs.berkeley.edu/~krste/papers/SonicBOOM-
| CAR...
| mhh__ wrote:
| This has 5 backend pipelines, which I would say is getting
| there speed-wise. I think AMD Zen has 10 execution pipelines to
| play with, although the exact structure is usually not
| commented upon so I don't know how long they are.
|
| It's 32 bit so it's not desktop-class necessarily but this
| should blow a microcontroller out of the sky, for scale.
| phkahler wrote:
| I've been wondering if it could make sense to put a small ALU
| with each register (when they are not virtual/renamed but
| maybe then too?). This would allow instructions that use the
| destination register as a source to only use one read port,
| potentially allowing higher IPC. Has anyone looked into this
| and if so what did the analysis show?
| londons_explore wrote:
| If I understand correctly, most designs effectively #define
| the number of frontend and backend pipelines. That means you
| are free to make as many as you like, for as much performance
| as you like, but some parameters might scale _very_ badly
| outside what the core was designed for. For example, I would
| imagine the number of gates and power use to go up to
| unfeasibly high levels above the quoted numbers.
| Symmetry wrote:
| It's a bit harder than that. You have to worry about
| forwarding results from one pipeline to another bypassing
| the registers if you want to avoid having to deal with
| stalls waiting for results. The transistor cost of the
| bypass network grows as the square of the number of
| pipelines so it can be pretty significant, potentially
| being larger than the cost of the execution pipelines
| themselves.
|
| Many modern designs aren't fully bypassed and involve
| clusters of pipelines to manage this. IBM's recent Power
| chips and Apple's ARM cores are particularly known to do
| this.
| sitkack wrote:
| Does optimal pipeline topology vary with the specific
| workload or is more dependent on the instruction set
| semantics and result visibility?
|
| Is this something that could be automatically optimized
| via simulation?
|
| Is it something that could be made dynamic and shared
| across a bunch of schedulers so that cores could move
| between big/little during execution.
| Symmetry wrote:
| It's highly dependent on both workload and instruction
| set.
|
| I'm sure it could be automatically optimized in theory,
| even without the solution being AI complete, but I don't
| think we have any idea how to do it right now.
|
| No, not unless you're reflashing an FPGA. You'd have
| better luck sharing subcores for threadlets I think.
| gchadwick wrote:
| > This has 5 backend pipelines, which I would say is getting
| there speed-wise.
|
| It's pretty easy to slap down pipelines. What is far harder
| is keeping them all fed and running without excessive
| stalling and bubbles whilst avoiding killing your max
| frequency and blowing through your area and power goals.
| swiley wrote:
| Do you really want pipelinning on a microcontroller though?
| mhh__ wrote:
| Why not? I think Cortex-M3 has a 3 stage pipeline,
| obviously I would not expect the 0.01$ chinese ones to have
| one.
|
| It's worth saying here, that a big ooo CPU is pipelines
| differently to a small/old risc processor - even amongst
| discussions about compiler optimizations people still use
| terminology like pipeline stall, when a modern CPU has a
| pipeline that handles fetching an instruction window,
| finding dependencies, doing register renaming and
| execution, that pipeline is not like an old
| IF->IF->EX->MEM->WB - it won't stall in the same way a
| pentium 5 did. The execution pipes themselves have a more
| familiar structure.
| gmueckl wrote:
| Not sure about Cortex-M3, but I can confirm that the
| Cortex-M4 has a pretty basic 3 stage in order pipeline
| that gets flushed on branches. So unless there are caches
| between the core and memory (I think some STM32 have
| that), that CPU is still trivially deterministic.
| makapuf wrote:
| Yes, m4 from stm have art accelerator (cache memory)
| which make them less deterministic (but much faster) for
| flash access. See https://www.st.com/en/microcontrollers-
| microprocessors/stm32...
| zurn wrote:
| It's targeting small FPGAs, comparisons to full custom designs
| wouldn't be very fair.
| adwn wrote:
| "high-performance" is relative, so they need to compare it to
| _something_ at all.
| chungus_khan wrote:
| I would imagine it is high-performance within its own
| domain: soft microprocessors. There isn't much danger of
| something like this being used outside of domains where it
| is relevant, so just calling it "high-performance" gets the
| message across fairly well to anyone who would actually
| need it.
|
| I can't comment on how well it performs myself without
| testing it, but a quick skim of the paper reveals that it
| apparently performs well in comparison to its rival open-
| source out-of-order soft processor.
|
| Comparing soft processors to other soft processors is
| fairly easy if they can both run on the same hardware, but
| comparing them to real silicon is inherently kind of
| meaningless, as they don't really compete at the moment,
| and the performance of the design in absolute terms will
| depend on the FPGA it is implemented on. Nonetheless, you
| could compare the raw numbers presented in the paper for
| curiosity's sake and see that indeed, it isn't very fast
| compared to modern silicon processors.
| adwn wrote:
| Sorry, what I meant was: They need to present benchmark
| results (preferably against other soft-core processors,
| of course) to claim the "high-performance" attribute. A
| single row of a synthetic benchmark against only two
| other contenders is a little ... lacking.
| Symmetry wrote:
| It seems broadly comparable to an ARM Cortex A9 in width and
| reorder depth but it seems to be using a lot more pipeline
| stages to do that. Probably that's because they had less
| engineer-years to invest in the design than ARM did. It looks
| like it's using pure automatic synthesis so the clock speed
| will tend to be lower on equivalent process nodes too. The big
| question is how accurate the prefetcher and branch predictors
| are. It's entirely possible for an 2 issue out of order core to
| be slower than 2 issue in order core if the later is
| significantly better at speculating.
|
| Still, that's a good bit of work they should be proud for
| putting out there and I hope other people build on it.
|
| EDIT: Oh, wait, they don't mention register renaming. Hmm,
| well, I guess no speculating over multiple iterations of a loop
| then.
|
| EDIT2: No, the PDF the link mentions a rename unit.
| http://www.rsg.ci.i.u-tokyo.ac.jp/members/shioya/pdfs/Mashim...
| smrxx wrote:
| All those abbreviations on the block diagram make it very
| difficult to interpret. A key map in the image would be great, or
| at least some markdown directly below it.
| dpoochieni wrote:
| If the FPGA is a closed-design are we really that better off?
| rwmj wrote:
| There are fully reverse-engineered FPGAs and open source
| toolchains to drive them.
|
| The real question is: Is there anything hidden in the silicon?
| That's something you can only solve by owning your own fab -
| the US military approach.
| eeZah7Ux wrote:
| Good luck hiding an effective backdoor in an FPGA. The
| attacker (the FPGA fab) has no idea of how it's going to be
| programmed.
| rwmj wrote:
| The usual thing the military worries about is a "kill
| switch" (a very unlikely sequence of bits) which disables
| the hardware completely. The idea is that at the beginning
| of a war, the kill signal is broadcast by the enemy by
| every means possible which brings all your electronics to a
| halt.
|
| This can be hidden in an FPGA - for example attached to the
| input pins or SERDES - without needing to know anything
| about the application.
|
| (Article:
| https://spectrum.ieee.org/semiconductors/design/the-hunt-
| for...)
| eeZah7Ux wrote:
| Triggering a malfunction is incredibly easy compared to a
| proper backdoor. A kill signal could also be injected
| through side channels e.g. a power line, and the kill
| mechanism could be implemented in many other
| semiconductors than an FPGA.
| fest wrote:
| The argument here usually is: with a fixed silicon chip,
| vendor can hide the backdoor in various locations and be it
| triggered by various events (e.g. a particular sequence of
| incoming ICMP packets would overwrite the first byte of
| response with content of some register). With FPGA, the
| vendor can't really know where a particular register is
| located, or where incoming packets are processed, as it is
| highly dependent on the synthesised CPU design and can even
| be non-deterministic.
|
| This does not mean that there is no way vendor can backdoor
| the chip you are getting, but it does narrow the
| possibilities significantly.
| gmueckl wrote:
| Even if the FPGA design were fully public, you won't
| necessarily find the required fab around the corner, so what's
| there to gain from the FPGA design? If you're seeking trust,
| you'd need to establish trust in the whole chain: design, mask
| production, chip fab, transport... that's a tall order.
| mhh__ wrote:
| Well done to the authors for making a surprisingly readable core
| for once.
| Symmetry wrote:
| To push this up from the comments, if you're interested in why
| this is important or what the authors are trying to do the PDF
| where they describe their approach and architecture is really
| interesting.
|
| http://www.rsg.ci.i.u-tokyo.ac.jp/members/shioya/pdfs/Mashim...
| dang wrote:
| Ok, we've changed the URL to that from https://github.com/rsd-
| devel/rsd, which doesn't give much background.
| jeff-davis wrote:
| If I understand this correctly:
|
| Typically, I think of an FPGA as something used to accelerate
| specialized operations. But sometimes, in the middle of one of
| these specialized operations, you might want to do something
| more general, like run a network stack, without returning to
| the CPU. A soft processor like this allows you to run an
| ordinary network stack (with ordinary code) inside the FPGA.
|
| Is that right?
|
| I thought one of the things people used FPGAs for was
| accelerating network stacks, so I don't quite know why you'd
| want to use a soft processor for that. But it does make sense
| that you'd want to be able to run ordinary code in an FPGA (as
| part of a larger FPGA operation that is not ordinary code).
|
| EDIT: Also, I don't understand this statement: "for example,
| one main compute kernel, which is too complex to deploy on
| dedicated hardware, is run by specialized soft processors".
| What do the authors mean "too complex to deploy on dedicated
| hardware"?
| mindentropy wrote:
| Generally softcores are used for command and control for the
| FPGA. For eg you have a framegrabber and you would like to
| adjust the shutter speed, fps etc you would create a softcore
| and run normal firmware to setup the registers based on the
| users choice.
|
| You can think of FPGA as an ASIC and the softcore to control
| this ASIC. The hot data path and heavy processing is done in
| the FPGA and processing options for the ASIC can be done
| using the softcore firmware.
| TrueDuality wrote:
| It's definitely common to accelerate specialized operations
| without the overhead of a general processor, but it's also
| possible to effectively use them as a much more flexible
| microprocessor when you need it.
|
| If you go looking for a microcontroller for your project, you
| have to choose among what is available. Maybe this
| microcontroller has two hardware UART interfaces and 1 SPI
| interface. If I don't need any UART but instead need a CANBUS
| interface that microcontroller won't work for me. Sure I can
| bitbang the protocol on GPIO ports but that uses up a lot of
| the limited processing power on the microcontroller...
| Usually that means a more expensive microcontroller.
|
| There is a threshold that you can cross where a small FPGA is
| cheaper than a microcontroller that has enough pins and
| processing power for your application. This does come with an
| additional upfront design cost of also writing (but more
| often integrating) the soft cores but sometimes that makes
| sense.
|
| Sometimes peripherals just don't exist at the price point you
| need. Try and find a microcontroller that has a MMIO
| controller for under $5 (I probably couldn't do it at under
| $10 but I haven't gone looking recently), it's rare they're
| needed but sometimes a design requires one.
|
| There has also been a lot of recent interest in doing formal
| verification of hardware logic. A lot of the microprocessors
| and even that full CPU in whatever device you're reading this
| on has a lot of undocumented black boxes and undefined
| behaviours both of which prevent that verification from being
| meaningful.
| mng2 wrote:
| >What do the authors mean "too complex to deploy on dedicated
| hardware"?
|
| It costs time and effort to translate a software function to
| HDL/FPGA, so it's not always worth doing. For example you
| could do TCP/IP in hardware, but unless you have particular
| performance requirements (say HFT) you're probably better off
| with a soft core and a tested software TCP/IP stack.
|
| Also each feature translated to hardware takes up space in
| the fabric. When you crunch numbers on an FPGA, it's ideal if
| you can lay out the entire sequence of operations as one big
| pipeline, so you can keep throughput as high as possible.
| Sufficiently long or complex sequences may not translate
| efficiently to FPGA.
| lizknope wrote:
| To make a custom chip it requires one time mask fees of up to
| $30 million for a 5nm chip. If your volume is high then you
| can amortize that cost. If not then you probably go to an
| FPGA. An FPGA has a high per part cost but no custom mask
| costs and you can reprogram it in a few hours / days instead
| of 2 months for a new custom chip to be manufactured.
|
| A high end FPGA already has hardened CPU blocks, USB and PCIE
| interfaces, and lots of other things built in. Then it has a
| large area of generic reconfigurable logic that you can
| customize to do whatever you want.
|
| This reconfigurable logic will not be as fast as a custom
| chip but it is still far faster than software and can be used
| to implement your own CPU (assuming you don't want to use the
| hardened CPU or got an FPGA without them)
| oblio wrote:
| For people in the industry: how likely are we to get RISC-V
| servers/VMs/laptops/desktops in the next 5-10 years? You know, go
| on a PC Part Picker and assemble a RISC-V desktop, for example.
| jagger27 wrote:
| Can you go on PC Part Picker and build an Arm desktop yet?
| oblio wrote:
| I haven't tried, I assume not. But aren't the hardware
| fashions changing rapidly now? Compared to 2005-2020, with
| x86 everywhere I mentioned.
| anfilt wrote:
| Part of ARMs licensing generally prevents socketed CPU chips.
| moyix wrote:
| Sure, order a Raspberry Pi 4 and an external USB SSD ;)
| AceJohnny2 wrote:
| so fun fact: I was recently curious how the RasPi4 compared
| to my 15-year old Dell Precision M65 (Intel Core Duo
| T2400).
|
| I ran the sysbench CPU test on each, and the M65 _trounced_
| the RasPi4, being over 3x as fast in single-core (and about
| 1.5x as fast in multicore, which makes sense with the T2400
| being 2-core to the RasPi 's 4-core).
|
| So the RasPi4 (a cheap-class SoC) remains slower than a
| performance-class PC from 14 years prior. Moore's law
| certainly helped in peformance-per-watt and performance-
| per-dollar, but if pure perfomance is what you want... I
| don't think there's anything available to consumers outside
| of Apple's offerings.
___________________________________________________________________
(page generated 2021-01-05 23:01 UTC)