[HN Gopher] VRoom A high end RISC-V implementation
___________________________________________________________________
VRoom A high end RISC-V implementation
Author : cmurf
Score : 108 points
Date : 2022-03-21 16:07 UTC (6 hours ago)
(HTM) web link (moonbaseotago.github.io)
(TXT) w3m dump (moonbaseotago.github.io)
| titzer wrote:
| This is a very ambitious project, so respect and good luck.
|
| I am wondering if the performance will pan out in practice, as it
| doesn't seem to have a very deep pipeline, so getting high
| clockspeeds may be a challenge. In particular the 5 clock branch
| mispredict penalty suggest the pipeline design is fairly simple.
| Production CPUs live and die by the gate depth and hit/miss
| latency of caches and predictors. A longer pipeline is the
| typical answer to gate delay issues. Cache design (and register
| file design!) is also super subtle; L1 is extremely important.
| evilos wrote:
| They mention in their arch slides that they expect to add at
| least 2 more pipeline stages to hit higher clocks.
| Taniwha wrote:
| As mentioned here I expect that reality will intrude and the
| pipe will get bigger - of course good a BTC (and spending lots
| of gates on it) is important because that's what mitigates that
| deep pipe.
|
| I haven't published my latest work (end of the week) I have a
| minor bump to ~6.5 DMips/MHz - Dhrystone isn't everything but
| it's still proving a useful tool to tweak the architecture
| (which is what's going on now)
| blacklion wrote:
| > Eventually we'll do some instruction combining using this
| information (best place may be at entry to I$0 trace cache), or
| possibly at the rename stage
|
| So much for "we will do only simplest of commands and u-op fusing
| will fix performance".
|
| It is why I'm very suspicious about this argument from RISC-V
| proponents.
| Taniwha wrote:
| I think that we need lots of trace before we decide which ops
| make sense to combine
| blacklion wrote:
| As far as I understand, RISC-V proponents want to have
| "recommended" command sequences for compilers, to avoid
| situation when different RISC-V CPUs will need different
| compilations. If different RISC-V implementations have
| different "fuseable" command sequences, we will be in
| dreadful situation when you will need exact "-mcpu" for
| decent performance and binary packages will be very
| unoptimal.
|
| And such "conventions" are bad idea, like comments in code,
| IMHO. It can not be checked by tools, etc.
| tsmi wrote:
| > you will need exact "-mcpu" for decent performance
|
| For some definitions of decent, I think that ship has
| sailed.
|
| https://clang.llvm.org/docs/CrossCompilation.html
|
| -target <triple> The triple has the general format
| <arch><sub>-<vendor>-<sys>-<abi>, where: arch = x86_64,
| i386, arm, thumb, mips, etc. sub = for ex. on ARM: v5, v6m,
| v7a, v7m, etc. vendor = pc, apple, nvidia, ibm, etc. sys =
| none, linux, win32, darwin, cuda, etc. abi = eabi, gnu,
| android, macho, elf, etc.
|
| Note, none of those are exhaustive...
| ncmncm wrote:
| It is always frustrating when you have put in the work to
| optimize code, and turn out to have pessimized it for the
| next chip over.
|
| The extremum for this is getting a 10x performance boost by
| using, e.g., POPCNT, and suffering instead a 10-100x
| pessimization because POPCNT is trapped and emulated.
| themerone wrote:
| What does GPL mean for a chip design?
|
| I understand how it applies to the HDL, but I doubt that it
| obligates you have to open your code to users of physical chips.
| Taniwha wrote:
| Well (author here) - this is a private project - typically such
| a project would be very propriety - people don't get to show
| their work.
|
| But I'm looking to find someone to build this thing, it's been
| a while since I last built chips (last CPU I helped design
| never saw the light of day due to reason that had little to do
| with how well it worked). So I need a way to show it off, show
| it's real. So GPLing it is a great way to do that - as is
| showing up on HN (thanks to whoever posted this :-).
|
| In practice the RTL level design of a processor is only a part
| of making a real processor - a real VRoom! would likely have
| hand built ALUs, shifters, caches, register files etc those
| things are all in the RTL at a high level but are really
| different IP - likely they'd be entangled with GPL and a
| manufacturer might feel that to be an issue.
|
| However I'm happy to dual license (I want to get it built, and
| maybe get paid to do it).
|
| Also about half the companies building RISCVs are in China
| (I've been building open source hardware in China for a decade
| or so now, so I know there's lots of smart people there) - they
| have a real problem (in the West) building something like this
| - all the rumors about supply chain/etc stuff - having an open
| sourced GPL'd reference that's cycle accurate is a way help
| build confidence.
| Taniwha wrote:
| One other comment about why GPLing something is important for
| some like me - publishing my 'secrets' are a great way to
| turn them into "prior art" - you read it here first, you
| can't patent it now - I can protect my ideas from becoming
| part of someone else's protected IP by publishing it.
|
| I spent a few years working on an x86 clone, I had maybe 10
| (now expired) patents on how to get around stupidly obvious
| things that Intel had patented - (or around ways to get
| around ways to get around In tel that other's had patented) -
| frankly from a technical POV it was all a lot of BS,
| including my patents
| wmf wrote:
| It means "pay me to remove the GPL". It's fake GPL like MySQL
| and MongoDB.
| homarp wrote:
| https://www.fsf.org/blogs/rms/selling-exceptions
|
| RMS wrote "I've considered selling exceptions acceptable
| since the 1990s, and on occasion I've suggested it to
| companies. Sometimes this approach has made it possible for
| important programs to become free software."
| Someone wrote:
| I guess you could argue that, if you bought a device with this
| CPU, you should be able to replace the CPU with one of your own
| that's derived from this one.
|
| I think that's the spirit of the GPL in a hardware context, but
| I don't think it's a given (by a long stretch) that courts
| would accept that argument.
|
| A somewhat clearer case would be if you bought a device that
| implements a GPL licensed design in a FPGA. I think you could
| argue such devices cannot disable the reprogrammability of the
| FPGA.
| dmoreno wrote:
| IANAL, but as far as I know it's very important it's GPLv3
| which means the antitivoization clause, which means that
| hardware that uses this firmware must provide full source code
| and a way to let you use your own firmware.
|
| If somehow this code is not in a firmware... No idea.
| marcodiego wrote:
| AFAICS, it is the same as software: you changed and
| distributed; you have to provide your changes if asked to.
| bragr wrote:
| Also IANAL, but as I understand it, the HDL would compile down
| to a sequence of gates, and presumably we'd treat that the same
| way as a binary - a "Non-Source Form" as the GPL calls it. So
| anyone that receives a copy of those gates (either as a binary
| blob for a FPGA, or pre-flashed on a FPGA, or made on actual
| silicon) would be entitled to the source as per GPL3 section 6
| "Conveying Non-Source Forms".
|
| I don't think the GPL anti-tivoization clause has much bearing
| there other than presumably you'd have to provide the full tool
| chain that resulted in the final gates - presumably this would
| affect companies producing actual chips the most since you
| couldn't have any propriety optimization or layout steps in
| producing the actual chip design, though also no DRM for FPGAs
| (is that even a thing?)
| Taniwha wrote:
| Author here (Paul Campbell) - AMA
| ncmncm wrote:
| Are you making any attempt at a learning branch predictor? Is
| anything published about really-current methods?
| Taniwha wrote:
| Not yet - I have a pretty generic combined bimodal/global
| predictor - there's a lot of research on BTCs - it's easy to
| throw gates at this area - I can imagine chips hitting 20-30%
| BTC in area just to keep the rest running
|
| My next set of work in this area will be integrating an L0
| trace cache into the existing BTC - that will help me greatly
| up the per-clock issue rate
| titzer wrote:
| As a language VM implementor, I would really love to have a
| conditional call instruction, like arm32. AFAICT this would be
| a relatively simple instruction to implement in the CPU. Is
| that accurate?
| Taniwha wrote:
| yes and no - there's a few issues here:
|
| 1 - architectural - RISCV has a nice clean ISA, it's adding
| instructions quickly, CMOV is contentious issue there - I'm
| not an expert on the history so I'll let others relitigate it
| - it's easy to add new instructions to a RISCV machine,
| unlike Intel/ARM it's even encouraged - however adding a new
| instruction to ALL machines is more difficult and may take
| many years. But unlike Intel/ARM there IS a process to adopt
| new instructions that doesn't involve just springing them on
| your customers
|
| 2 - remember RISCV is a no-condition code architecture - that
| would make CMOV require 3 register file ports (the only such
| instruction that also requires an adder [for the compare]) -
| register file ports are extremely expensive, especially for
| just 1 instruction
|
| 3 - micro-architectural - on simple pipes CMOV is pretty
| simple (you just inhibit register write, plus do something
| special with register bypass) I'd have to think very hard
| about how to do it on something like VRoom! with out of
| order, speculative, register renaming - I can see a naive way
| to do it, but ideally there should be a way to nullify such
| an instruction early in the pipe which would mean some sort
| of renaming-on-the-fly hack
| titzer wrote:
| Note I was talking about a conditionall _call_ instruction,
| which is very useful for, e.g. safety checks.
| Taniwha wrote:
| conditional CALL is MUCH harder to implement well - it's
| because the call part essentially happens up at the
| PC/BTC end of the CPU while at the execution stage what
| you're doing is writing the saved PC to the LR/etc and
| the register compare (or accessing a condition code that
| may not have been calculated yet).
|
| In many ways I guess it's a bit like a conditional branch
| that needs a write port - in RISCV, without condition
| codes, your conditional call relative branch distance
| will be smaller because the instruction encoding will
| need to encode 2-3 registers
| dzaima wrote:
| I imagine something like that might be viable in the to-
| be-designed RISC-V J extension, as safety checks (mostly
| in JITs) would be close to the only thing benefiting from
| this.
|
| Though, maybe instead of a conditional call, a
| conditional signal could do, which'd clearly give no
| expectation of performance if it's hit, simplifying the
| hardware effort required.
| Taniwha wrote:
| Yeah, I can imagine that being particularly easy to
| implement in VRoom! exceptions are handled synchronously
| at the end of the pipe (with everything before them
| already committed, and everything after flushed).
| Instructions that can convert to exceptions (like loads
| and store taking TLB misses) essentially hit two
| different functional units - a conditional exception
| would be tested in a branch/ALU unit and then transition
| into an effective a no-op or transition into an d
| exception and synchronise the pipe when they hit the
| commit stage
| kragen wrote:
| 8080 had it too, 8086 dropped it due to disuse. In a modern
| context it's just a performance hack, an alternative to
| macro-op fusion, but for high-performance RISC-V (or i386, or
| amd64, or aarch64) you need macro-op fusion anyway.
| sitkack wrote:
| What does your benchmarking workflow look like? I am interested
| in * From a high level what does your dev
| iteration look like? * Getting instruction traces,
| timing and resimulating those traces * Power analysis,
| timing analysis (do you do this as part of performance
| simulation) ? * Do you benchmark the whole chip or
| specific sub units? * How do you choose what to focus on
| in terms of performance enhancements? * What areas are
| you focusing on now? * What tools would make this
| easier?
| Taniwha wrote:
| At the moment I'm just starting working my way up the
| hierarchy of benchmarks, dhrystone's been useful though it's
| nearing the end of its use - I build the big FPGA version (on
| an AWS FPGA instance) to give me a place to run bigger things
| exactly like this.
|
| I currently run low level simulations in Verilator where I
| can easily take large internal architectural trace, and
| bigger stuff on AWS (where that sort of trace is much much
| harder)
|
| I haven't got to the power analysis stage - that will need to
| wait until we decide to build a real chip - timing will
| depend on final tools if we get to build something real,
| currently it's building on Vivado for the FPGA target.
|
| Mostly I'm doing whole chip tests - getting everything to
| work well together is sort of the area I'm focusing on at the
| moment (correctness was the previous goal - being together
| enough to boot linux), the past 3 months I've brought the
| performance up b y a factor of 4 - the trace cache might get
| me 2x more if I'm lucky.
|
| I spend a lot of time looking at low level performance, at
| some level I want to get the IPC (instructions per clock) of
| the main pipe as high as I can so I stare at the spots where
| that doesn't happen
|
| I'm using open source tools (thanks everyone!)
| tromp wrote:
| > dhrystone's been useful though it's nearing the end of
| its use
|
| Would my fhourstones [1] [2] benchmark be of any use?
|
| [1] https://tromp.github.io/c4/fhour.html
|
| [2] https://openbenchmarking.org/test/pts/fhourstones
| Taniwha wrote:
| thanks I'll have a look - I'm not so interested in raw
| scores, more about relative numbers so I can judge
| different architectural experiments
| [deleted]
| gary_0 wrote:
| From what little I know about microarchitecture, this seems
| extremely impressive. Hopefully these aren't dumb questions:
|
| Are there GPL'd designs for PCIe, USB, etc, that could be used
| to incorporate this into a SoC design? If not, how much work is
| that compared to this?
|
| Also, what other kind of technical considerations would be
| involved to make this into a "real" chip on something like
| 28nm?
| Taniwha wrote:
| Great questions - I'm using an open source UART from someone
| else, an d for the AWS FPGA system I have a 'fake' disk
| driver plus timers/interrupt controllers etc
|
| So far I haven't needed USB/ether/PCIe/etc I've sort of
| sketched out a place for those to live - I think that for a
| high end system like this one you can't just plug something
| in - real performance needs some consideration of how:
|
| - cache coherency works - VM and virtual memory works
| (essentially page tables for IO devices) - PMAP protections
| from I/O space (so that devices can't bypass the CPU PMAPs
| that are used to man age secure enclaves in machine mode)
|
| So in general I'm after something uniquer, or at least
| slightly bespoke.
|
| I also think there's a bit of a grand convergence going on in
| this area around serdes's which are sort of becoming a new
| generic interface PCIe, high speed ether, new USBs, disk
| drivers etc are all essentially bunches of serdes with
| different protocol engines behind them - a smart SoC is going
| to split things this way for maximum flexibility
| rwmj wrote:
| Don't know much about the details, but this company /
| person claims to have developed some open source IP:
| http://www.enjoy-digital.fr/
| Lramseyer wrote:
| Not Paul Campbell, but I'll share what I know on the matter.
|
| So GPL'd IO blocks - This is a great question, and something
| I have definitely been asking myself! One thing to keep in
| mind is that IO interfaces like PCIe, USB, and whatnot have a
| Physical interface ("Phy" for short.) Those contain quite a
| bit of analog circuitry, which is tied to the transistor
| architecture that's used for the design.
|
| That being said, A lot of interfaces that aren't DRAM
| protocols use what's known as a SerDes Phy (short for
| Serializer De-serializer Physical interface.) More or less,
| they have an analog front end and a digital back end, and
| that digital back end that connects to everything else is
| somewhat standardized way. So it wouldn't be unreasonable to
| try to build something like an open PCIe controller that only
| has the Transaction Layer and Data Link Layer. While there
| are various timing concerns/constraints when not including a
| Phy layer (lowest layer,) I don't think it's impossible.
|
| The other big challenge is that anyone wanting to use an open
| source design will definitely want the test benches and test
| cases included in the repo (you can think of them like unit
| tests.) Unfortunately, most of the software to actually
| compile and run those simulations is cost prohibitive for an
| individual, because it's licensed software. Also, the
| companies that develop this software make a ton of money
| selling things like USB and PCIe controllers, so I'll let you
| draw your own conclusions about the incentives of that
| industry.
|
| Even if you were able to get your hands on the software, the
| simulations are very computationally intensive, and
| contribution by individuals would be challenging ...though
| not impossible!
|
| Despite those barriers, it's a direction that I desperately
| want to see the industry move towards, and I think it's
| becoming more and more inevitable as companies like Google
| get involved with hardware, and try to make the ecosystem
| more open. Chiplet architectures are also all the rage these
| days, so it would be less of a risk for a company to attempt
| to use an open source design.
|
| I'd really be curious to hear Paul Campbell's take on this
| question though. He definitely knows a lot more than I do!
| tsmi wrote:
| One advantage of SkyWater opening its PDK is Universities
| are starting to back fill all the hardware that is missing.
|
| Here's a SerDes from Purdue. I don't think this particular
| design has been validated in silicon yet though.
|
| https://arxiv.org/abs/2105.13256
| black_puppydog wrote:
| Do you dance? :)
|
| https://youtu.be/nlu0foF3WBk?t=182
|
| I know, I'm leaning hard on that second "A" there. :D
| Taniwha wrote:
| heh! - I'm a Kiwi who lived and worked in Silicon Valley for
| 20 years, moved back when the kids started high school, but
| mostly still work there - while I was there I started a
| company using "Taniwha" ... great for a logo, but a mistake
| because of course no one in the US knows how to pronounce it
| (pro-tip the "wh" is most close to an english "f")
| tsmi wrote:
| Have you considered making an ASIC of your design?
| https://efabless.com/open_shuttle_program
| Taniwha wrote:
| It's likely too big for those programs - I am (just now)
| starting a build with the Open Lane/Sky tools not with the
| intent of actually taping out but more to squeeze the
| architectural timing (above the slow FPGA I've been using for
| bringing up Linux) so I can find the places where I'm being
| stupidly unreasonable about timing (working on my own I can't
| afford a Synopsys license)
| tsmi wrote:
| Gotcha. Did you run into any issues with yosys given that
| it has limited system verilog support?
|
| Ibex needed to add a pass with sv2v
| https://github.com/lowRISC/ibex/tree/master/syn
| Taniwha wrote:
| I'm just starting this week, I've recently switched to
| some use of SV interfaces and it does not like arrays of
| them - sv2v seems the way to go - but even without that
| yosys goes bang! somethings too big Vivado compiles the
| same stuff - I rearchitected the bit that might obviously
| be doing this but no luck so far.
| tux3 wrote:
| Any thoughts about higher level HDLs in embedded in software
| languages, like Chisel, nMigen, or others? Some other RISC-V
| core designers claim they've had increased productivity with
| those.
|
| It seems that despite a lot of valid criticism against
| (System)Verilog, nothing really seems to be a on trajectory to
| replace it today. I'm not sure if that's purely inertia
| (existing tooling, workflows, methodologies), other HDLs not
| being attractive enough, or maybe Verilog is just good enough?
| Taniwha wrote:
| I think they're great - I earned my VLSI chops building stuff
| in the 90s and I can write Verilog about as fast as I can
| think so it's my goto language. I've also written a couple of
| compilers over the years so I know it really well (you can
| thank me for the ' _' in "always @(_)"). That's just my
| personal bias.
|
| Inertia in tooling is a REALLY BIG deal - if you can't run
| your design through simulation, (and FPGA simulation),
| synthesis, layout/etc you'll never build a chip - it can take
| a 5-10 years for a new language feature to become ubiquitous
| enough so that you can depend on it en ough to use it in a
| design (I've been struggling with this using System Verilog
| interfaces this month).
|
| If you look closely at VRoom! you'll see I'm stepping beyond
| some Verilog limitations by adding tiny programs that
| generate bespoke bits of Verilog as part of the build process
| - this stops me from fat fingering some bit in a giant
| encoder but also helps me make things that SV doesn't do so
| well (big 1-hot muxes, priority schedulers etc)
| Taniwha wrote:
| err HN swallowed my * there as in: "(you can thank me for
| the '*' in "always @(*)")"
| snakke wrote:
| As an aside, the latest and active development of nMigen has
| been rebranded a few months ago to Amaranth and can be found
| here: https://github.com/amaranth-lang/amaranth . In case
| people googled nMigen and came to the repository that hasn't
| been updated in two years.
| [deleted]
| codedokode wrote:
| The presentation was interesting; but I would like to write an
| idea that is tangentially related to this CPU.
|
| I noticed that modern CPUs are optimized for legacy monolith OS
| kernels like Linux or Windows. But having a large, multimegabyte
| kernel is a bad idea from a security standpoint. A single mistake
| or intentional error in some rarely used component (like a
| temperature sensor driver) can get attacker full access to the
| system. Again, an error in any part of the monolith kernel can
| cause system failure. And Linux kernel doesn't even use static
| analysis to find bugs! It is obvious that using microkernels
| could solve many of the issues above.
|
| But microkernels tend to have poor performance. One of the
| reasons for this could be high context switch latency. CPUs with
| high context switch latency are only good for legacy OSes and not
| ready for better future kernels. Therefore, either we will find a
| way to make context switches fast or we will have to stay with
| large, insecure kernels full of vulnerabilities.
|
| So I was thinking what could be done here. For example, one thing
| that could be improved is to get rid of address space switch. It
| causes flushes of various caches and it hurts performance.
| Instead, we could always use the single mapping from virtual to
| physical addresses, but allocate each process different virtual
| address range. To implement this, we could add two registers,
| which would hold minumum and maximum accessible virtual
| addresses. It should be easy to check the address against them to
| prevent speculative out of bounds memory accesses.
|
| By the way, 32-bit x86 architecture had segments, that could be
| used to divide single address space between processes.
|
| Another thing that can take time is saving/restoring registers on
| context switch. One way to solve the problem could be to use
| multiple banks (say, 64 banks) of registers that can be quickly
| switched, another way would be to zero out registers on return
| from kernel and let processes save them if they need it.
|
| Or am I wrong somewhere and fast context switches cannot be
| implemented this way?
| db65edfc7996 wrote:
| >But microkernels tend to have poor performance.
|
| Citation needed. What kind of hit are we talking about? 5%?
| 90%? We have supercomputers from the future that have capacity
| to spare. I would be willing to take an enormous performance
| hit for better security guarantees on essential infrastructure
| (routers, firewalls, file servers, electrical grid, etc).
| kragen wrote:
| SASOSes are interesting, sometimes extending a 64-bit address
| space to cover a whole cluster, but they aren't compatible with
| anything that calls fork().
|
| The various variants of L4 have pretty good context-switch
| latency even on traditional CPUs, and seL4 in particular is
| formally proven correct on a few platforms. Spectre+Meltdown
| mitigation was painful for them, but they're still pretty good.
|
| Lots of microcontrollers have no MMUs but do have MPUs to keep
| a user task from cabbaging the memory of the kernel or other
| tasks. Not sure if any of them use the PDP-11-style base+offset
| segment scheme you're describing to define the memory regions.
|
| Protected-memory multitasking on a multicore system doesn't
| need to involve context switches, especially with per-core
| memory.
|
| Even on Linux, context switches are cheap when your memory map
| is small. httpdito normally has five pages mapped and takes
| about 100 microseconds (on a 2.8GHz amd64 laptop) to fork,
| serve a request, and exit. I think I've measured context
| switches a lot faster than that between two existing processes.
|
| Multiple register banks for context switching go back to the
| CDC 6600's peripheral processor (FEP) or maybe the TX-0 on
| which Sutherland wrote SKETCHPAD; it has a lot of advantages
| beyond potentially cheaper IPC. Register bank switching for
| interrupt handling was one of the major features the Z80 had
| over the 8080 (you cn think of the interrupt handler as being
| the kernel). The Tera MTA in the 01990s was at least widely
| talked about if not widely imitated. Switching register sets is
| how "SMT" works and also sort of how GPUs work. And today
| Padauk's "FPPA" microcontrollers (starting around 12 cents
| IIRC) use register bank switching to get much lower I/O latency
| than competing microcontrollers that must take an interrupt and
| halt background processing until I/O is complete.
|
| Another alternative approach to memory protection is to do it
| in software, like Java, Oberon, and Smalltalk do, and Liedtke's
| EUMEL did; then an IPC can be just an ordinary function call.
| Side-channel leaks like Spectre seem harder to plug in that
| scenario. GC may make fault isolation difficult in such an
| environment, particularly with regard to performance bugs that
| make real-time tasks miss deadlines, and possibly Rust-style
| memory ownership could help there.
| codedokode wrote:
| What I would like to have is a context switch latency
| comparable to a function call. For example, if in a
| microkernel system bus driver, network card driver, firewall,
| TCP stack, socket service are all separate userspace
| processes, then every time a packet arrives there would be a
| context-switching festival.
|
| As I understand, in microkernel OSes most system calls are
| simply IPCs - for example, network card driver passes
| incoming packet to the firewall. So there is almost no kernel
| work except for context switch. That's why it has to be as
| fast as possible and resemble a normal function call, maybe
| even without invoking the kernel at all. Maybe something like
| Intel's call gate, but fast.
|
| > they aren't compatible with anything that calls fork().
|
| I wouldn't miss it; for example, Windows works fine without
| it.
| kinghajj wrote:
| You should look into the Mill CPU architecture.[0] Its design
| should make microkernels much more viable.
|
| * Single 64-bit address space. Caches use virtual addresses.
|
| * Because of that, the TLB is moved _after_ the last level
| cache, so it 's not on the critical path.
|
| * There's instead a PLB (protection lookaside buffer), which
| can be searched in parallel with cache lookup. (Technically,
| there's three: two instruction PLBs and one data PLB.)
|
| [0]: https://millcomputing.com/
| foobiekr wrote:
| I was also going to mention the Mill, but it's become a bit
| of a Flying Dutchman that people tell tales of but which
| probably doesn't exist.
| thechao wrote:
| Segment registers are _precisely_ how NT does context
| switching. I think it may be restricted to just switching from
| user- to kernel- threads. I can 't remember if there's thread-
| to-thread switching using segment registers -- I feel like this
| was a thing, or it was just a thing we did when we tried to
| boot NT on Larrabee. (Blech.)
| wrs wrote:
| Long ago, we in the Newton project at Apple had that idea. We
| (in conjunction with ARM) were defining the first ARM MMU, so
| we took the opportunity to implement "domains" of memory
| protection mappings that could be quickly swapped at a context
| switch. So you get multiple threads in the same address space,
| but with independent R/W permission mappings.
|
| I think a few other ARM customers were intrigued by the
| security possibilities, but the vast majority were more like
| "what is this bizarre thing, I just want to run Unix", so the
| feature disappeared eventually.
|
| Here's some ARM documentation if you want to pull this thread:
| https://developer.arm.com/documentation/dui0056'/latest/'cac...
| wrs wrote:
| Too late to edit, but here's a documentation link that works
| better:
| https://developer.arm.com/documentation/dui0056/d/caches-
| and...
| StillBored wrote:
| Its similar to the original macOS, which used handles to
| track/access/etc memory requested from the OS and swap them
| to disk as needed. First you request the space, then you
| request access, which pinned it into ram.
|
| PalmOS was another one that worked similarly.
| https://www.fuw.edu.pl/~michalj/palmos/Memory.html
| Taniwha wrote:
| These days there are few caches that need to be flushed at
| context switch time - RISCV's ASIDs mean that you don't need to
| flush the TLBs (mostly) when you contect switch.
|
| VRoom! largely has physically tagged caches so they don't need
| to be flushed, the BTC is virtually tagged, but split into
| kernel and user caches, you need to flush the user one on on a
| context switch (or both on a VM switch) - also the trace cache
| (L0 icache) will also be virtually tagged. VRoom! also doesn't
| do speculative accesses past the TLBs.
|
| Honestly saving and restoring kernel context is small compared
| to the time spent in the kernel (and I've spent much of the
| past year looking at how this works in depth).
|
| Practically you have to design stuff to an architecture (like
| RISCV) so that one can leverage off of the work of others
| (compilers, libraries, kernels) adding specialised stuff that
| would (in this case) get in to a critical timing path is
| something that one has to consider very carefully - b ut that's
| a lot of what RISCV is about - you can go and knock up that
| chip yourself on an FPGA and start trialing it on your
| microkernel
| kragen wrote:
| Thanks, this is really informative.
| nynx wrote:
| The Architectural presentation linked from the GitHub repository
| for this project is an incredibly good resource on how these
| kinds of things are designed.
| avianes wrote:
| Yes, there is a huge lack of open and approachable information
| sources in micro-architecture.
|
| Be aware though, the micro-architecture used here is very
| interesting but differs in many ways from state of the art
| industrial high-end micro-architectures for superscalar out-of-
| order speculative processor.
|
| I am quite curious about how the author came up with these
| choices
| Taniwha wrote:
| Well, everyone was building tiny RISCVs, I kind of thought
| "can I make a Xeon+ class RISCV if I throw gates at the
| problem ?" :-)
|
| Seriously though I started out with the intent of building a
| 4/8 instruction/clock decoder, and an O-O execution pipe that
| could keep up - with the end goal of at least 4+ instruction
| s/clock average (we peak now at 8) - the renamer, dual
| register file, and commitQ are the core of what's probably
| different here
| avianes wrote:
| Yes, the "dual register file" is probably the most
| intriguing to me.
|
| This looks like a renaming scheme used in some old micro-
| architecture (Intel Core 2 maybe) where ROB receives
| transient results and acts as a physical regfile, at commit
| reg value are copied to a arch regfile. But in your uarch
| the physical regfile is decoupled from ROB, which must
| correspond to your commitQ.
|
| I wonder if this solution is viable for a very large uarch
| (8 way) because read ports to copy reg value from pysical
| regfile to arch regfile are additional read ports that can
| be avoided with other (more complex) renaming scheme. These
| additional read ports can be expensive on a regfile that
| already has a bunch of ports.
|
| Any thoughts about this?
|
| But I haven't read much of your code yet, that's just a raw
| observation
| Taniwha wrote:
| the commitQ entries are smart enough to 'see' the commits
| into the architectural file and request the data from
| it's current location
|
| It does mean lots of register read ports .... but you can
| duplicate register files at some point (reducing read
| ports but keeping the write ports) (you want to keep them
| close to the ALUs/multipliers/etc) - in some ways these
| are more implementation issues rather than
| 'architectural'
| avianes wrote:
| I see, there are indeed solutions like regfile
| duplication to handle large port number but it's
| expensive when physical regfile becomes large. I still
| think that the uarch's job is to ensure minimal
| implementation cost ;).
|
| Thank you for your opinion and thought process, it's very
| valuable !
| tasty_freeze wrote:
| For a few years I worked with the guy behind this project, Paul
| Campbell. He is a fearless coder, and moves between hardware and
| software design with equal ease.
|
| An example of his crazy coding chops, he was frustrated by the
| lack of verilog licenses at the place he worked back in the early
| 90s. His solution was to whip up a compliant verilog simulator,
| then wrote a screen saver that would pick up verification tasks
| from a pending queue. They had many macs around the office that
| were powered 24/7, and they could chew through a lot of work
| during the 16 hours a day when nobody was sitting in front of
| them. When someone sat down at their computer in the morning or
| came back from lunch, the screen saver would just abandon the
| simulation job it was running and that job would go back to the
| queue of work waiting to be completed.
| evilos wrote:
| That is terrifying.
| thechao wrote:
| Synthesizable verilog is a very small language compared to
| system verilog -- especially in the 90s. Off the top of my head
| I know of _six_ "just real quick" verilog simulators that I've
| worked with (one of which I wrote). I'm not sure how I feel
| about them. On one hand, I hate dealing with licenses; on the
| other hand, now you've got to worry that your custom simulation
| matches behavior with the synthesis tools. A lot of the
| "nonstandard" interpretation for synthesizable verilog from the
| bigs comes from practical understanding of the behavior for a
| given node. Most of that is captured in ABC files ... but not
| all of it.
| Taniwha wrote:
| It was more than simple synthesisable verilog, but not a lot
| - it was also a compiler rather than an interpreter - at the
| time VCS was just starting to be a thing, verilog as a
| language was not at all well defined (lots of assumptions
| about event ordering that no-one should have been making)
|
| I was designing Mac graphics accelerators I'd built it on a
| some similar infrastructure I'd built to capture trace from
| people's machines to try and figure out where QuickDraw was
| really spending it's time - we ended up with a minimilistic
| graphics accelerator that beat the pants off of everyone else
| thechao wrote:
| This is why I think Moore (LLHD), Verilator, and Yosys are
| such awesome tools. They move a lot more slowly than (say)
| GCC, but I personally think they're all close to the
| tipping point.
| Taniwha wrote:
| I wrote a second, much more standard Verilog compiler
| (because by then there was a standard) with the intent of
| essentially selling cloud simulation time (being 3rd to a
| marketplace means you have to innovate) - sadly I was a
| bit ahead of my time ('cloud' was not yet a word) the
| whole California/Enron "smartest guys in the room"
| debacle kind of made a self financed startup like that
| non-viable
|
| So in the end I open sourced the compiler ('vcomp') but
| it didn't take off
| colejohnson66 wrote:
| So, BOINC before BOINC?
| jasonwatkinspdx wrote:
| A lot of people have come up with something similar. Someone
| I know implemented the Condor scheduler to run models on
| workstations at night at a hedge fund. That Condor scheduler
| dates to the 80s. Smaller 3d animation studios commonly do
| this too.
| Symmetry wrote:
| The architectural details here were pretty interesting:
|
| https://moonbaseotago.github.io/talk/index.html
|
| It would be nice to get actual performance numbers rather than
| just frequency scaled Dhrystone but I suppose we have to be
| patient.
| Taniwha wrote:
| Dhystone's just a place to start, it helps me make quick
| tweaks, and I'm at that stage of the process - it's
| particularly good because it's somewhat at odds with my big
| wide decoders - VRoom! can decode bundles of up to 8
| instructions per clock, while Dhrystone has lots of twisty
| branches, only decodes ~3.7 instructions per bundle - it's a
| great test for the architecture by pushing at the things it
| might not be as good as.
|
| Having said that I'm about reaching the end of the point where
| it's the only thing - being able to run bigger longer
| benchmarks is one of the reasons for bringing up linux on the
| big FPGA on AWS
| Taniwha wrote:
| I'll add that freq scaled Dhrystone (DMIPS/MHz) is a
| particularly useful number because it helps you compare
| architectures rather than just clocks - you can figure out
| questions like "If I can make this run at 5GHz how will it
| compare with X?"
| minroot wrote:
| Any recommendations for resources on learning to makes things
| like this in general?
| Symmetry wrote:
| Computer Architecture: A Quantitative Approach[1] is the
| textbook that gets recommended the most on the topic, I
| believe.
| homarp wrote:
| [1] https://www.elsevier.com/books/computer-
| architecture/henness...
| tsmi wrote:
| If you're at the point in your career where you're not sure
| which is the right textbook then "A Quantitative Approach" is
| likely to be really tough to get through.
|
| Computer Organization and Design, by the same authors, is
| considered a better choice for a first book. I personally
| loved it and couldn't put it down the first time I read it.
|
| https://www.elsevier.com/books/computer-organization-and-
| des...
| camtarn wrote:
| Definitely recommend this textbook as a great read - it
| remains one of the very few textbooks I've read end-to-end
| and genuinely enjoyed.
| minroot wrote:
| Any recommendation for books on (System)Verilog
| sitkack wrote:
| You might like "Digital Design and Computer Architecture,
| RISC-V Edition" by Harris and Harris.
|
| https://www.google.com/books/edition/Digital_Design_and_C
| omp...
|
| This book definitely skews pragmatic, hands on and
| doesn't assume much. Covers both VHDL and Verilog. Has
| sections on branch prediction, register renaming, etc.
| tsmi wrote:
| I personally am not into the verilog specific books. For
| me HDLs are hardware description languages, so first you
| learn to design digital hardware, then you learn to
| describe them.
|
| For that I highly recommend: https://www.cambridge.org/us
| /academic/subjects/engineering/c...
|
| Great first book on the subject.
| jasonwatkinspdx wrote:
| Older editions of this are freely available online, and great
| for learning about microarchitecture.
___________________________________________________________________
(page generated 2022-03-21 23:00 UTC)