[HN Gopher] SiFive's P550 Microarchitecture
       ___________________________________________________________________
        
       SiFive's P550 Microarchitecture
        
       Author : rbanffy
       Score  : 128 points
       Date   : 2025-01-27 10:32 UTC (12 hours ago)
        
 (HTM) web link (chipsandcheese.com)
 (TXT) w3m dump (chipsandcheese.com)
        
       | turblety wrote:
       | Why are these only 1.4ghz frequency, when raspberry pi gets to
       | 2.4ghz? Is it a limitation of the cost of scale that prevents
       | building faster chips? Or does the architecture not really
       | support faster chips?
       | 
       | I really hope that RISC-V can take over as a modern architecture,
       | adding some competition to Intel/AMD and Arm. But they'll need to
       | be able to offer faster chips, or at a minimum more than 4 cores.
       | 
       | Also does anyone know the rate of progress? I believe 10 years
       | ago these where at 0.5mhz?
        
         | rbanffy wrote:
         | Seems to be limited mostly by the power and size constraints,
         | while it's also fabricated in an older node. The target market
         | seems to be simple embedded devices not needing SIMD
         | instructions, which is less constrained by the software
         | availability. RISC-V is still a very new architecture.
        
           | daef wrote:
           | i thought risc-v is >10years "old". how long is an ISA "very
           | new"?
        
             | rbanffy wrote:
             | x86 is more than 40. ARM is 30-ish.
             | 
             | Most importantly, it's been just a few years since we could
             | start getting reasonable RISC-V boards.
        
               | Someone wrote:
               | and one likely reason the boards weren't there was
               | (https://en.wikipedia.org/wiki/RISC-V#Design):
               | 
               |  _"As of June 2019, version 2.2 of the user-space ISA[46]
               | and version 1.11 of the privileged ISA[3] are frozen,
               | permitting software and hardware development to proceed.
               | The user-space ISA, now renamed the Unprivileged ISA, was
               | updated, ratified and frozen as version 20191213"_
               | 
               | So, it's more like 5 years old, compared to [?]40 for
               | 32-bit x86, [?]20 for 64-bit x86.
        
               | RobotToaster wrote:
               | First ARM processor was 1985, so 40 years almost exactly.
        
               | klelatti wrote:
               | I think it's a bit problematic to say ARM is 30-ish years
               | old. The company is 34 years old but 64 bit Arm (AArch64)
               | which is really very different to its predecessors was
               | announced in 2011 so arguably only 14 years old.
        
               | K0balt wrote:
               | I'm pretty sure the IP/experience transfer to arch64 was
               | massive compared to starting for scratch.
        
             | beeflet wrote:
             | This same point is made in threads discussing how wayland
             | protocol is 16 years "old". I think it's different if the
             | system starts out as a research project rather than a
             | commercial project, because the time until a usable
             | implementation is much greater. For example, I would say
             | that riscv is "newer" than loongISA/loongarch despite being
             | slightly older in a literal sense.
             | 
             | If you look at an arch like x86 or ARM it was designed
             | right before chips were released, and then extended over
             | time. The same goes for the X protocol, it simply extended
             | previous versions.
             | 
             | If you are designing something from the ground up to avoid
             | the inherent problems of an existing system, it is
             | reasonable to take time and research design problems to
             | make sure you don't recreate the same issues (which would
             | defeat the point of the redesign). It doesn't compete on
             | the same time-frame as an extension of an existing system.
        
             | formerly_proven wrote:
             | Privileged spec is only a couple years old and mainline
             | Linux only runs on RISC-V since 2022 or something like
             | that.
        
           | gliptic wrote:
           | > while it's also fabricated in an older node
           | 
           | Is this 7 nm node really older than raspberry pi 5's 16 nm
           | node?
        
             | nsteel wrote:
             | > This SoC has a 1.4 GHz, quad core P550 cluster with 4 MB
             | of shared cache. The EIC7700X is manufactured on TSMC's
             | 12nm FFC process
             | 
             | > Next up is TSMC's 12 nm FFC manufacturing technology,
             | which is an optimized version of the company's CLN16FFC
             | that is set to use 6T libraries (as opposed to 7.5T and 9T
             | libraries) providing a 20% area reduction. Despite
             | noticeably higher transistor density, the CLN12FFC is
             | expected to also offer a 10% frequency improvement at the
             | same power and complexity or a 25% power reduction at the
             | same clock rate and complexity.
             | 
             | They optimised for density and power, not frequency. A lot
             | of the benefit they're claiming comes just from this.
             | 
             | https://www.anandtech.com/show/11337/samsung-and-tsmc-
             | roadma...
        
         | beeflet wrote:
         | I think that the development of riscv will eek out greater
         | market share for chinese manufacturers, which will have a
         | negative effect on the global order.
         | 
         | I am also skeptical that it will lead to more open designs, but
         | perhaps it could increase competition enough in the chip design
         | space that more open chip designers can make a space for
         | themselves, especially if the business of chip fabrication is
         | isolated from design.
        
           | cbm-vic-20 wrote:
           | I'm surprised China's economic rivals, specifically India,
           | hasn't made a serious push to bootstrap a home-grown chip
           | industry based around RISC-V.
        
             | Arnavion wrote:
             | India had
             | https://en.m.wikipedia.org/wiki/SHAKTI_(microprocessor) as
             | of a few years ago but I haven't followed it. The main
             | website and blog seem to have had not much activity since
             | 2021.
        
           | daghamm wrote:
           | China tried to create a homegrown CPU 15-20 years ago with
           | their MIPS variant but that died out. I think this time they
           | are much wiser and will pull it off. In 5-8 years we may have
           | China CPUs dominating the Asian market at least.
           | 
           | Riscv leading to more open designs is wishful thinking, plus
           | probably a large dose of PR. MIPS has been open for years and
           | how many open source MIPS desisgn have we seen so far?
        
             | yjftsjthsd-h wrote:
             | > China tried to create a homegrown CPU 15-20 years ago
             | with their MIPS variant but that died out.
             | 
             | Er, are you talking about LoongArch? The CPU line that
             | https://en.wikipedia.org/wiki/Loongson lists new models of
             | this year?
        
               | daghamm wrote:
               | Yes, I had no idea they were still working on it.
               | 
               | You could buy their products in west as a sort of a low-
               | power PC for a short while, but I think once netbooks
               | arrived those just vanished.
        
         | azinman2 wrote:
         | Does Intel/AMD/ARM really need more competition? Do you think
         | they're stagnant?
         | 
         | As I and others have said before, successful consolidation
         | around RISC-V is an ultimately a gift to China. Maybe you're
         | for that; as an American I am not.
        
           | wbl wrote:
           | How is it a gift to China? Architecture isn't where the magic
           | is.
        
       | klelatti wrote:
       | From the SiFive website [1]
       | 
       | > The Performance P550 scales up to four-core complex
       | configurations while delivering 30% higher performance in less
       | than half the area of a comparable Arm(r) Cortex(r)-A75.
       | 
       | Dylan Patel wasn't impressed by these comparisons with A75 [2]
       | 
       | > @SiFive is claiming half the area and higher perf/GHz, but they
       | are using 7nm and 100ns memory latency. Choosing to compare to
       | the 10nm A75 on S845, notorious for its high latency at over
       | 200ns. Purposely ignoring iso-node or other A75 comparisons.
       | 
       | And this analysis seems to be borne out in this Chips and Cheese
       | post.
       | 
       | > As a step along that journey, P550 feels more comparable to one
       | of Arm's early out-of-order designs like Cortex A57. By the time
       | A75 came out, Arm already accumulated substantial experience in
       | designing out-of-order CPUs. Therefore, A75 is a well polished
       | and well rounded core, aside from obvious sacrifices required for
       | its low power and thermal budgets. P550 by comparison is rough
       | around the edges.
       | 
       | So what to make of SiFive's claims? It seems quite an important
       | claim / comparison.
       | 
       | [1] https://www.sifive.com/cores/performance-p550
       | 
       | [2] https://x.com/dylan522p/status/1415395415000817664
        
         | bhouston wrote:
         | > As a step along that journey, P550 feels more comparable to
         | one of Arm's early out-of-order designs like Cortex A57.
         | 
         | If it is as fast as a A57 on similar node, that would still be
         | a major win for RISC-V which so far has been incredibly slow.
         | The Nintendo Switch 1 uses Cortex-A57.
        
         | pankajdoharey wrote:
         | Even if i try to avoid the hyperbole by saying by the time
         | SiFive nodes will reach maturity with the current cortex,
         | GPT-30 will be out etc ...
         | 
         | The fact is that the node difference (7nm vs. 10nm) is critical
         | here, SiFive's area/power efficiency gains aren't purely
         | architectural but partly process-driven. Even with that
         | advantage, matching a 2018 A75 (designed for mobile
         | thermal/power limits) in 2024 feels like catching up to ARM's
         | rearview mirror. ARM's A720 today benefits from years of
         | iterative refinement (cache hierarchies, branch predictors,
         | memory subsystems) that aren't easily replicated overnight.
         | 
         | Scaling beyond cores is another hurdle, interconnects, memory
         | controllers, and accelerators matter just as much as raw IPC.
         | RISC-V's ecosystem (tools, firmware, software optimization)
         | also lags ARM's, which could limit adoption even if the P550
         | were competitive.
         | 
         | SiFive's claims highlight RISC-V's potential, but until they
         | benchmark against modern cores on the same node and demonstrate
         | system-level competitiveness (not just microarchitecture wins),
         | the gap will persist. That said, disruption takes time--ARM
         | wasn't born polished either. The real test is whether SiFive
         | can close the maturity deficit before ARM's roadmap (and AI-
         | driven heterogeneity) leaves them behind. I doubt it, the GAP
         | in GPU cores alone between Cortex and M Series is so huge, and
         | then there are accelerators like NPU cores, which SiFive havent
         | even started working on yet.
         | 
         | Even Cortex NPU's are behind Apple M Series, and if a large
         | companies liek Samsung, Qualcomm, Mediatek lag behind Apple is
         | Quality ARM chips with Decent GPU, NPU on Board memory, what
         | hope does SiFive Have? At Worse Burn Investor money and die. At
         | Best supply chips for your remote control, washing machine etc
         | ... competing with mainstream applications would not be wise by
         | any standards.
        
           | Symmetry wrote:
           | It's also worth noting that they're claiming a high perf/GHz,
           | not a high perf. It's easy to shrink a chip by using slower
           | but more compact libraries that don't increase gate size as
           | much for larger fanout at the cost of a lower maximum
           | frequency, like AMD's compact cores do. And lowering clock
           | speeds mean that your main memory latency, measured in clock
           | cycles, goes down increasing perf/GHz too.
        
         | hajile wrote:
         | TSMC 7nm is 91MTr/mm2 and 10nm is 53MTr/mm2. That's a 1.72x
         | increase in density while SiFive is claiming a 2x density
         | advantage which still puts it pretty far ahead if the claim is
         | accurate and that's without discussing the 30% IPC advantage
         | (though final clockspeed equivalence from their claims would
         | still put it 35% slower than the S845 at 2.8GHz). The real
         | question is about how much more dense could A75 be if they
         | lowered target clockspeeds.
         | 
         | Dylan's complaint about comparing to the S845 is mystifying
         | ignorance as he should know better.
         | 
         | What other A75 SoCs are there? Exynos used it for their mid
         | cores, but the SoC sucked. MediaTek had the Helio P65, but it
         | was announced in late 2019 which was basically 2 years after
         | S845 was announced at the end of 2017. There were some other
         | smaller suppliers from China, but I have no idea who they are.
         | S850 existed, but as I recall, it was just a better binning of
         | the S845 announced months after the original.
         | 
         | S845 is the ONLY A75 design worth comparing.
        
       | phire wrote:
       | _> Likely, P550 doesn't have another BTB level. If a branch
       | misses the 32 entry BTB, the core simply calculates the branch's
       | destination address when it arrives at the frontend_
       | 
       | That seems unwise. Might work well enough for direct branches,
       | but it's going to preform very badly on indirect branches. I
       | would love to see some tests for indirect branch performance
       | (static and dynamic) in your suite.
       | 
       |  _> When return stack capacity is exceeded, P550 sees a sharp
       | spike in latency. That contrasts with A75's more gentle increase
       | in latency._
       | 
       | That might be a direct consequence of the P550's limited BTB.
       | Even when the return stack overflows, the A75 can probably still
       | predict the return as if it was an indirect branch, utilising its
       | massive 3072 entry L1 BTB.
       | 
       | Actually, are you sure the P550 even has a return stack? 16
       | correctly predicted call/ret pairs just so happens to be what you
       | would get from a 32 entry BTB predicting 16 calls then 16
       | returns.
        
         | monocasa wrote:
         | Calls don't need to be predicted though since they're
         | unconditional. So 16 call/ret pairs should only be 16 B2B
         | entries.
        
           | IshKebab wrote:
           | You can predict a dynamically dispatched call surely?
        
           | colejohnson66 wrote:
           | You still need to predict the target if it's indirect
        
           | eigenform wrote:
           | Modern machines usually try to predict target addresses, not
           | just the direction of conditional branches. You can implement
           | it for unconditional calls and jumps too, even for
           | direct/relative-addressed ones. That's pretty common
           | nowadays.
           | 
           | When you calculate the target of a jump, you can cache it
           | (that's what a BTB is). Next time you encounter it, you
           | predict the target by accessing the cached value in the BTB
           | and start fetching early instead of waiting for your
           | jump/call to move all the way through the machine.
        
           | phire wrote:
           | You need to predict the target (with BTB) if want zero-bubble
           | calls (1 cycle latency).
           | 
           | Otherwise it takes 3 cycles to take a direct call. Doesn't
           | matter that it's unconditional, it can't see the call until
           | after decoding. Sure, it's only two extra cycles, but that's
           | a full six instructions on this small 3-wide core.
           | 
           | And an indirect call? Even if the target is known, it's going
           | to need to fetch it from the register file, which requires
           | going through rename. And unless you put a fast-path in with
           | an extra register-file read-port, you need to go through
           | dispatch and the scheduler too. Probably takes 6-10 cycles to
           | take an indirect branch without a BTB.
           | 
           | On bigger designs it's even more essential to predict
           | unconditional branches, as their instruction caches take
           | multiple cycles, and then there are quite substantial queues
           | between fetch and decode, and between predict and fetch.
        
       | drmpeg wrote:
       | The latest Technical Reference Manual for the Eswin EIC7700X is
       | here.
       | 
       | https://github.com/eswincomputing/EIC7700X-SoC-Technical-Ref...
        
       | somanyphotons wrote:
       | I'd love to see some regular-workload benchmarks that compare
       | equal-frequency, on the same fabrication node, same storage etc.
       | A real apples-to-apples shootout
        
       | sakras wrote:
       | I was going to buy one of these until I realized it didn't have
       | vector extensions. I expected something with "Performance" and
       | "Premier" in the name to have them. I think some sort of SIMD
       | capability is table stakes for a lot of workloads these days, so
       | I'm disappointed that there doesn't seem to be a CPU on the
       | market that supports them. I've heard that the vector extensions
       | being stateful makes them particularly hard to implement, which
       | makes me wonder if there needs to be some sort of simpler-to-
       | implement version which mirrors more traditional SIMDs like AVX2
       | and Neon.
        
         | drmpeg wrote:
         | The SpacemiT K1 SoC implements RVV1.0. It can be found on the
         | Banana Pi BPI-F3 and the Milk-V Jupiter boards.
        
           | adgjlsfhk1 wrote:
           | Unfortunately the K1 is horribly slow. It's an in order
           | processor, doesn't have L3 cache and has pretty slow floating
           | point multiplies. It's an OK dev board for riscv-V, but it is
           | closer to Raspberry pi 3 than the P550 which is a lot closer
           | to a Pi 4 for general performance.
        
             | drmpeg wrote:
             | Yes, it was very disappointing. I was hoping the 8 cores
             | would give a speedup for compiling code, but no dice. On a
             | large Linux build, the BPI-F3 with make -j8 takes exactly
             | the same time as a make -j4 on the VisionFive 2.
        
         | adgjlsfhk1 wrote:
         | The SiFive P670 has vector support, and apparently dev boards
         | using it are expected by end of year.
        
           | sakras wrote:
           | Oh that's exciting, I will be on the lookout for that!
        
       | sylware wrote:
       | Slap a good AMD gpu to that, get some rv64 recompiled AAA games,
       | and time to tune performance from there after some
       | QA/debugging... well for high-end desktop performance.
       | 
       | Then, after a little while of tuning, it will the time to access
       | the best silicon process.
        
         | remexre wrote:
         | ...Intel Core 2 -level desktop performance?
         | 
         | Also, I'd imagine you'd want the Ztso extension to port PC
         | games, assuming you mean Rosetta-style instruction translation
         | rather than "somehow get the source and port the engine and all
         | the middleware" -- I don't think the P550 has that extension.
        
           | sylware wrote:
           | What?
           | 
           | A rv64 port on a rv64 elf/linux (using the rv64 glibc) with
           | the AMD mesa drivers. That will reveal where certainly a lot
           | of work will have to be done, and that at all levels.
           | 
           | And better do that with many AAA games (the nasty and badly
           | coded ones, probably many of them).
           | 
           | Better try to do that work before getting access to the
           | latest silicon process.
        
       | ge96 wrote:
       | That 3D graph is great
        
       ___________________________________________________________________
       (page generated 2025-01-27 23:01 UTC)