[HN Gopher] SiFive's P550 Microarchitecture
___________________________________________________________________
SiFive's P550 Microarchitecture
Author : rbanffy
Score : 128 points
Date : 2025-01-27 10:32 UTC (12 hours ago)
(HTM) web link (chipsandcheese.com)
(TXT) w3m dump (chipsandcheese.com)
| turblety wrote:
| Why are these only 1.4ghz frequency, when raspberry pi gets to
| 2.4ghz? Is it a limitation of the cost of scale that prevents
| building faster chips? Or does the architecture not really
| support faster chips?
|
| I really hope that RISC-V can take over as a modern architecture,
| adding some competition to Intel/AMD and Arm. But they'll need to
| be able to offer faster chips, or at a minimum more than 4 cores.
|
| Also does anyone know the rate of progress? I believe 10 years
| ago these where at 0.5mhz?
| rbanffy wrote:
| Seems to be limited mostly by the power and size constraints,
| while it's also fabricated in an older node. The target market
| seems to be simple embedded devices not needing SIMD
| instructions, which is less constrained by the software
| availability. RISC-V is still a very new architecture.
| daef wrote:
| i thought risc-v is >10years "old". how long is an ISA "very
| new"?
| rbanffy wrote:
| x86 is more than 40. ARM is 30-ish.
|
| Most importantly, it's been just a few years since we could
| start getting reasonable RISC-V boards.
| Someone wrote:
| and one likely reason the boards weren't there was
| (https://en.wikipedia.org/wiki/RISC-V#Design):
|
| _"As of June 2019, version 2.2 of the user-space ISA[46]
| and version 1.11 of the privileged ISA[3] are frozen,
| permitting software and hardware development to proceed.
| The user-space ISA, now renamed the Unprivileged ISA, was
| updated, ratified and frozen as version 20191213"_
|
| So, it's more like 5 years old, compared to [?]40 for
| 32-bit x86, [?]20 for 64-bit x86.
| RobotToaster wrote:
| First ARM processor was 1985, so 40 years almost exactly.
| klelatti wrote:
| I think it's a bit problematic to say ARM is 30-ish years
| old. The company is 34 years old but 64 bit Arm (AArch64)
| which is really very different to its predecessors was
| announced in 2011 so arguably only 14 years old.
| K0balt wrote:
| I'm pretty sure the IP/experience transfer to arch64 was
| massive compared to starting for scratch.
| beeflet wrote:
| This same point is made in threads discussing how wayland
| protocol is 16 years "old". I think it's different if the
| system starts out as a research project rather than a
| commercial project, because the time until a usable
| implementation is much greater. For example, I would say
| that riscv is "newer" than loongISA/loongarch despite being
| slightly older in a literal sense.
|
| If you look at an arch like x86 or ARM it was designed
| right before chips were released, and then extended over
| time. The same goes for the X protocol, it simply extended
| previous versions.
|
| If you are designing something from the ground up to avoid
| the inherent problems of an existing system, it is
| reasonable to take time and research design problems to
| make sure you don't recreate the same issues (which would
| defeat the point of the redesign). It doesn't compete on
| the same time-frame as an extension of an existing system.
| formerly_proven wrote:
| Privileged spec is only a couple years old and mainline
| Linux only runs on RISC-V since 2022 or something like
| that.
| gliptic wrote:
| > while it's also fabricated in an older node
|
| Is this 7 nm node really older than raspberry pi 5's 16 nm
| node?
| nsteel wrote:
| > This SoC has a 1.4 GHz, quad core P550 cluster with 4 MB
| of shared cache. The EIC7700X is manufactured on TSMC's
| 12nm FFC process
|
| > Next up is TSMC's 12 nm FFC manufacturing technology,
| which is an optimized version of the company's CLN16FFC
| that is set to use 6T libraries (as opposed to 7.5T and 9T
| libraries) providing a 20% area reduction. Despite
| noticeably higher transistor density, the CLN12FFC is
| expected to also offer a 10% frequency improvement at the
| same power and complexity or a 25% power reduction at the
| same clock rate and complexity.
|
| They optimised for density and power, not frequency. A lot
| of the benefit they're claiming comes just from this.
|
| https://www.anandtech.com/show/11337/samsung-and-tsmc-
| roadma...
| beeflet wrote:
| I think that the development of riscv will eek out greater
| market share for chinese manufacturers, which will have a
| negative effect on the global order.
|
| I am also skeptical that it will lead to more open designs, but
| perhaps it could increase competition enough in the chip design
| space that more open chip designers can make a space for
| themselves, especially if the business of chip fabrication is
| isolated from design.
| cbm-vic-20 wrote:
| I'm surprised China's economic rivals, specifically India,
| hasn't made a serious push to bootstrap a home-grown chip
| industry based around RISC-V.
| Arnavion wrote:
| India had
| https://en.m.wikipedia.org/wiki/SHAKTI_(microprocessor) as
| of a few years ago but I haven't followed it. The main
| website and blog seem to have had not much activity since
| 2021.
| daghamm wrote:
| China tried to create a homegrown CPU 15-20 years ago with
| their MIPS variant but that died out. I think this time they
| are much wiser and will pull it off. In 5-8 years we may have
| China CPUs dominating the Asian market at least.
|
| Riscv leading to more open designs is wishful thinking, plus
| probably a large dose of PR. MIPS has been open for years and
| how many open source MIPS desisgn have we seen so far?
| yjftsjthsd-h wrote:
| > China tried to create a homegrown CPU 15-20 years ago
| with their MIPS variant but that died out.
|
| Er, are you talking about LoongArch? The CPU line that
| https://en.wikipedia.org/wiki/Loongson lists new models of
| this year?
| daghamm wrote:
| Yes, I had no idea they were still working on it.
|
| You could buy their products in west as a sort of a low-
| power PC for a short while, but I think once netbooks
| arrived those just vanished.
| azinman2 wrote:
| Does Intel/AMD/ARM really need more competition? Do you think
| they're stagnant?
|
| As I and others have said before, successful consolidation
| around RISC-V is an ultimately a gift to China. Maybe you're
| for that; as an American I am not.
| wbl wrote:
| How is it a gift to China? Architecture isn't where the magic
| is.
| klelatti wrote:
| From the SiFive website [1]
|
| > The Performance P550 scales up to four-core complex
| configurations while delivering 30% higher performance in less
| than half the area of a comparable Arm(r) Cortex(r)-A75.
|
| Dylan Patel wasn't impressed by these comparisons with A75 [2]
|
| > @SiFive is claiming half the area and higher perf/GHz, but they
| are using 7nm and 100ns memory latency. Choosing to compare to
| the 10nm A75 on S845, notorious for its high latency at over
| 200ns. Purposely ignoring iso-node or other A75 comparisons.
|
| And this analysis seems to be borne out in this Chips and Cheese
| post.
|
| > As a step along that journey, P550 feels more comparable to one
| of Arm's early out-of-order designs like Cortex A57. By the time
| A75 came out, Arm already accumulated substantial experience in
| designing out-of-order CPUs. Therefore, A75 is a well polished
| and well rounded core, aside from obvious sacrifices required for
| its low power and thermal budgets. P550 by comparison is rough
| around the edges.
|
| So what to make of SiFive's claims? It seems quite an important
| claim / comparison.
|
| [1] https://www.sifive.com/cores/performance-p550
|
| [2] https://x.com/dylan522p/status/1415395415000817664
| bhouston wrote:
| > As a step along that journey, P550 feels more comparable to
| one of Arm's early out-of-order designs like Cortex A57.
|
| If it is as fast as a A57 on similar node, that would still be
| a major win for RISC-V which so far has been incredibly slow.
| The Nintendo Switch 1 uses Cortex-A57.
| pankajdoharey wrote:
| Even if i try to avoid the hyperbole by saying by the time
| SiFive nodes will reach maturity with the current cortex,
| GPT-30 will be out etc ...
|
| The fact is that the node difference (7nm vs. 10nm) is critical
| here, SiFive's area/power efficiency gains aren't purely
| architectural but partly process-driven. Even with that
| advantage, matching a 2018 A75 (designed for mobile
| thermal/power limits) in 2024 feels like catching up to ARM's
| rearview mirror. ARM's A720 today benefits from years of
| iterative refinement (cache hierarchies, branch predictors,
| memory subsystems) that aren't easily replicated overnight.
|
| Scaling beyond cores is another hurdle, interconnects, memory
| controllers, and accelerators matter just as much as raw IPC.
| RISC-V's ecosystem (tools, firmware, software optimization)
| also lags ARM's, which could limit adoption even if the P550
| were competitive.
|
| SiFive's claims highlight RISC-V's potential, but until they
| benchmark against modern cores on the same node and demonstrate
| system-level competitiveness (not just microarchitecture wins),
| the gap will persist. That said, disruption takes time--ARM
| wasn't born polished either. The real test is whether SiFive
| can close the maturity deficit before ARM's roadmap (and AI-
| driven heterogeneity) leaves them behind. I doubt it, the GAP
| in GPU cores alone between Cortex and M Series is so huge, and
| then there are accelerators like NPU cores, which SiFive havent
| even started working on yet.
|
| Even Cortex NPU's are behind Apple M Series, and if a large
| companies liek Samsung, Qualcomm, Mediatek lag behind Apple is
| Quality ARM chips with Decent GPU, NPU on Board memory, what
| hope does SiFive Have? At Worse Burn Investor money and die. At
| Best supply chips for your remote control, washing machine etc
| ... competing with mainstream applications would not be wise by
| any standards.
| Symmetry wrote:
| It's also worth noting that they're claiming a high perf/GHz,
| not a high perf. It's easy to shrink a chip by using slower
| but more compact libraries that don't increase gate size as
| much for larger fanout at the cost of a lower maximum
| frequency, like AMD's compact cores do. And lowering clock
| speeds mean that your main memory latency, measured in clock
| cycles, goes down increasing perf/GHz too.
| hajile wrote:
| TSMC 7nm is 91MTr/mm2 and 10nm is 53MTr/mm2. That's a 1.72x
| increase in density while SiFive is claiming a 2x density
| advantage which still puts it pretty far ahead if the claim is
| accurate and that's without discussing the 30% IPC advantage
| (though final clockspeed equivalence from their claims would
| still put it 35% slower than the S845 at 2.8GHz). The real
| question is about how much more dense could A75 be if they
| lowered target clockspeeds.
|
| Dylan's complaint about comparing to the S845 is mystifying
| ignorance as he should know better.
|
| What other A75 SoCs are there? Exynos used it for their mid
| cores, but the SoC sucked. MediaTek had the Helio P65, but it
| was announced in late 2019 which was basically 2 years after
| S845 was announced at the end of 2017. There were some other
| smaller suppliers from China, but I have no idea who they are.
| S850 existed, but as I recall, it was just a better binning of
| the S845 announced months after the original.
|
| S845 is the ONLY A75 design worth comparing.
| phire wrote:
| _> Likely, P550 doesn't have another BTB level. If a branch
| misses the 32 entry BTB, the core simply calculates the branch's
| destination address when it arrives at the frontend_
|
| That seems unwise. Might work well enough for direct branches,
| but it's going to preform very badly on indirect branches. I
| would love to see some tests for indirect branch performance
| (static and dynamic) in your suite.
|
| _> When return stack capacity is exceeded, P550 sees a sharp
| spike in latency. That contrasts with A75's more gentle increase
| in latency._
|
| That might be a direct consequence of the P550's limited BTB.
| Even when the return stack overflows, the A75 can probably still
| predict the return as if it was an indirect branch, utilising its
| massive 3072 entry L1 BTB.
|
| Actually, are you sure the P550 even has a return stack? 16
| correctly predicted call/ret pairs just so happens to be what you
| would get from a 32 entry BTB predicting 16 calls then 16
| returns.
| monocasa wrote:
| Calls don't need to be predicted though since they're
| unconditional. So 16 call/ret pairs should only be 16 B2B
| entries.
| IshKebab wrote:
| You can predict a dynamically dispatched call surely?
| colejohnson66 wrote:
| You still need to predict the target if it's indirect
| eigenform wrote:
| Modern machines usually try to predict target addresses, not
| just the direction of conditional branches. You can implement
| it for unconditional calls and jumps too, even for
| direct/relative-addressed ones. That's pretty common
| nowadays.
|
| When you calculate the target of a jump, you can cache it
| (that's what a BTB is). Next time you encounter it, you
| predict the target by accessing the cached value in the BTB
| and start fetching early instead of waiting for your
| jump/call to move all the way through the machine.
| phire wrote:
| You need to predict the target (with BTB) if want zero-bubble
| calls (1 cycle latency).
|
| Otherwise it takes 3 cycles to take a direct call. Doesn't
| matter that it's unconditional, it can't see the call until
| after decoding. Sure, it's only two extra cycles, but that's
| a full six instructions on this small 3-wide core.
|
| And an indirect call? Even if the target is known, it's going
| to need to fetch it from the register file, which requires
| going through rename. And unless you put a fast-path in with
| an extra register-file read-port, you need to go through
| dispatch and the scheduler too. Probably takes 6-10 cycles to
| take an indirect branch without a BTB.
|
| On bigger designs it's even more essential to predict
| unconditional branches, as their instruction caches take
| multiple cycles, and then there are quite substantial queues
| between fetch and decode, and between predict and fetch.
| drmpeg wrote:
| The latest Technical Reference Manual for the Eswin EIC7700X is
| here.
|
| https://github.com/eswincomputing/EIC7700X-SoC-Technical-Ref...
| somanyphotons wrote:
| I'd love to see some regular-workload benchmarks that compare
| equal-frequency, on the same fabrication node, same storage etc.
| A real apples-to-apples shootout
| sakras wrote:
| I was going to buy one of these until I realized it didn't have
| vector extensions. I expected something with "Performance" and
| "Premier" in the name to have them. I think some sort of SIMD
| capability is table stakes for a lot of workloads these days, so
| I'm disappointed that there doesn't seem to be a CPU on the
| market that supports them. I've heard that the vector extensions
| being stateful makes them particularly hard to implement, which
| makes me wonder if there needs to be some sort of simpler-to-
| implement version which mirrors more traditional SIMDs like AVX2
| and Neon.
| drmpeg wrote:
| The SpacemiT K1 SoC implements RVV1.0. It can be found on the
| Banana Pi BPI-F3 and the Milk-V Jupiter boards.
| adgjlsfhk1 wrote:
| Unfortunately the K1 is horribly slow. It's an in order
| processor, doesn't have L3 cache and has pretty slow floating
| point multiplies. It's an OK dev board for riscv-V, but it is
| closer to Raspberry pi 3 than the P550 which is a lot closer
| to a Pi 4 for general performance.
| drmpeg wrote:
| Yes, it was very disappointing. I was hoping the 8 cores
| would give a speedup for compiling code, but no dice. On a
| large Linux build, the BPI-F3 with make -j8 takes exactly
| the same time as a make -j4 on the VisionFive 2.
| adgjlsfhk1 wrote:
| The SiFive P670 has vector support, and apparently dev boards
| using it are expected by end of year.
| sakras wrote:
| Oh that's exciting, I will be on the lookout for that!
| sylware wrote:
| Slap a good AMD gpu to that, get some rv64 recompiled AAA games,
| and time to tune performance from there after some
| QA/debugging... well for high-end desktop performance.
|
| Then, after a little while of tuning, it will the time to access
| the best silicon process.
| remexre wrote:
| ...Intel Core 2 -level desktop performance?
|
| Also, I'd imagine you'd want the Ztso extension to port PC
| games, assuming you mean Rosetta-style instruction translation
| rather than "somehow get the source and port the engine and all
| the middleware" -- I don't think the P550 has that extension.
| sylware wrote:
| What?
|
| A rv64 port on a rv64 elf/linux (using the rv64 glibc) with
| the AMD mesa drivers. That will reveal where certainly a lot
| of work will have to be done, and that at all levels.
|
| And better do that with many AAA games (the nasty and badly
| coded ones, probably many of them).
|
| Better try to do that work before getting access to the
| latest silicon process.
| ge96 wrote:
| That 3D graph is great
___________________________________________________________________
(page generated 2025-01-27 23:01 UTC)