Post Ay1PV7MeLVcKBgITBo by rygorous@mastodon.gamedev.place
 (DIR) More posts by rygorous@mastodon.gamedev.place
 (DIR) Post #Ay11wC44a2UNgTzAUi by regehr@mastodon.social
       2025-09-09T00:57:24Z
       
       0 likes, 1 repeats
       
       is this bullshit? or does ISA not really matter in some fictitious world where we can normalize for process and other factors?https://www.techpowerup.com/340779/amd-claims-arm-isa-doesnt-offer-efficiency-advantage-over-x86
       
 (DIR) Post #Ay11wDmmBJqF1PuNoO by rygorous@mastodon.gamedev.place
       2025-09-09T01:07:46Z
       
       1 likes, 0 repeats
       
       @regehr Every serious study (both from independent researchers and from vendors themselves) that I've ever seen (and I'm up to 5 or so at this point), broadly, supports this, with some caveats.It's not "no difference", but for server/application cores, what differences there are typically somewhere in the single-digit %. You can always find pathological examples, but typically it's not that much.There is a real cost to x86s many warts but it's mostly in design/validation cost and toolchains.
       
 (DIR) Post #Ay11wF2PWfwSuCENou by rygorous@mastodon.gamedev.place
       2025-09-09T01:12:49Z
       
       0 likes, 0 repeats
       
       @regehr Some more details:- the D/V and toolchain costs are amortized. Broadly speaking, the bigger your ecosystem/market share, the bigger your ability to absorb that cost.- This holds for what ARM would call "application" cores; oversimplifying a bit, it's essentially a constant overhead on the design that adds some extra area and pipe stages. It's more onerous for smaller cores, but you need to be really small.
       
 (DIR) Post #Ay11wG8pQIfaKOF1Si by rygorous@mastodon.gamedev.place
       2025-09-09T01:19:09Z
       
       0 likes, 0 repeats
       
       @regehr Eventually, there's nowhere left to hide. For applications where you'd use say an ARM Cortex-M0 or a bare-bones minimal RV32I CPU, I'm not aware of anything x86 past or present that would really make sense.Intel did "Quark" a while back which I believe was either a 486 or P5 derivative, so still something like a 5-stage pipelined integer core. If you want to go even lower than that, I don't think anyone has (or wants to do) anything.
       
 (DIR) Post #Ay11wHTQTCjwSYszCq by steve@discuss.systems
       2025-09-09T01:21:13Z
       
       0 likes, 0 repeats
       
       @rygorous @regehr yeah, this
       
 (DIR) Post #Ay11wI5iArMgNInYsi by rygorous@mastodon.gamedev.place
       2025-09-09T01:24:59Z
       
       0 likes, 0 repeats
       
       @steve @regehr Anyway, take that with whatever amount of salt you want, but Intel and AMD both are strongly incentivized to seriously look at this.They for sure would prefer to sell you x86s because they have decades of experience with that, but they're looking at what it costs them to do it both in capex and in how much it hurts the resulting designs.And for the latter, the consistent answer has been "a bit, but not much".
       
 (DIR) Post #Ay11wIwB1nKf01LSeu by regehr@mastodon.social
       2025-09-09T01:29:21Z
       
       0 likes, 0 repeats
       
       @rygorous @steve I've seen part of a convincing / complete formal spec for x86 and I would run away from any effort to validate an implementation of this
       
 (DIR) Post #Ay11wJPFHiaISAwfy4 by rygorous@mastodon.gamedev.place
       2025-09-09T01:38:08Z
       
       0 likes, 0 repeats
       
       @regehr @steve Anecdotally, there's at least 3 (Intel, AMD, Centaur) companies that do this on the regular, and one of them (Centaur) is quite small as such things go.I wouldn't want to do it either, but the other thing you gotta keep in mind is that the CPU core, while important, is only part of a SoC and ISA has very little impact on the "everything else".
       
 (DIR) Post #Ay11wJu5R3FpzpNJ2W by regehr@mastodon.social
       2025-09-09T01:40:46Z
       
       0 likes, 0 repeats
       
       @rygorous @steve sure sure
       
 (DIR) Post #Ay11wKPzWQm7amImlk by rygorous@mastodon.gamedev.place
       2025-09-09T01:47:34Z
       
       1 likes, 0 repeats
       
       @regehr @steve For example, it's a goddamn NIGHTMARE doing a high-performance memory subsystem for absolutely anything.This whole "shared memory" fiction we're committed to maintaining is a significant drag on all HW, but HW impls of it are just in another league perf-wise than "just" building message-passing and trying to work around it in SW (lots have tried, but there's little code for it and it's a PITA), so we're kind of stuck with it.
       
 (DIR) Post #Ay11wLFkQ0AwBIW7RQ by rygorous@mastodon.gamedev.place
       2025-09-09T01:49:43Z
       
       1 likes, 0 repeats
       
       @regehr @steve Basically almost everything that _all_ major ISAs pretend is true about memory at the ISA level is an expensive lie, but one that ~ALL the SW depends on. :)
       
 (DIR) Post #Ay11wMHYalDVNCN4ts by rygorous@mastodon.gamedev.place
       2025-09-09T01:52:29Z
       
       1 likes, 0 repeats
       
       @regehr @steve To wit: virtual memory is a lie, by design. Uniform memory is a lie. Shared instruction/data memory is a lie. Coherent caches are a lie, caches would rather be _anything_ else. Buses are a lie. Memory-mapped IO is IO lying about being memory. Oh and the data bits and wires are small and shitty enough now that they started lying too and everything is slowly creeping towards ECCing all the things
       
 (DIR) Post #Ay11wNEP3yHwJhu4ci by rygorous@mastodon.gamedev.place
       2025-09-09T01:21:14Z
       
       1 likes, 0 repeats
       
       @regehr It's not that x86 couldn't do that, but you'd need to dive even deeper into history, and P5 level is honestly about the lowest anyone still wants to go.You could do 286-or-less but that's 16-bit x86 and tooling for that is essentially extinct at this point. You're stuck with old compilers etc.
       
 (DIR) Post #Ay11wPzym3X6tqB63c by rygorous@mastodon.gamedev.place
       2025-09-09T01:28:17Z
       
       0 likes, 0 repeats
       
       @steve @regehr If they really felt it was a lead anchor around their necks, they would've switched. AMD did a serious push on a high-end ARM core something like 15 years(?) ago, and ultimately decided against continuing it, because they didn't think it was a compelling product. If they'd gotten magic sauce huge wins just from the different ISA, I don't think they'd have canned it.
       
 (DIR) Post #Ay1NDme1vbhN5y27Ae by regehr@mastodon.social
       2025-09-09T01:53:00Z
       
       0 likes, 0 repeats
       
       @rygorous @steve yep this is more or less how I teach this stuff
       
 (DIR) Post #Ay1NDo02tEu3IXLD7o by rygorous@mastodon.gamedev.place
       2025-09-09T01:55:49Z
       
       0 likes, 0 repeats
       
       @regehr @steve Also, re: ISA efficiency, I like re-posting this, by now, rather old image that shows you what the score really is.This was on the Xeon Phis but the general trend holds to this day. (Source: https://people.eecs.berkeley.edu/~ysshao/assets/papers/shao2013-islped.pdf p. 3) NB this is an in-order core with 512b vector units.
       
 (DIR) Post #Ay1NDp9eb0BOscqOjw by rygorous@mastodon.gamedev.place
       2025-09-09T01:59:18Z
       
       0 likes, 0 repeats
       
       @regehr @steve This is one of the bigger reasons for why ISA doesn't matter more.Broadly, your uArch is only as good as its data movement, because that shit is what's really expensive, not the logic gates.It's things like:- how good is your entire memory subsystem- how good is your bypass network- how good are your register filesetc.It's not like you can't make mistakes in the ISA that will really kill your design, you can. That's what happened to VAX.
       
 (DIR) Post #Ay1NDqMo5aIYdi0Pse by rygorous@mastodon.gamedev.place
       2025-09-09T02:02:04Z
       
       0 likes, 0 repeats
       
       @regehr @steve The VAX ISA turns out to be, inadvertently, _extremely_ hostile to an implementation that tries to decouple frontend and backend, which ultimately broke its neck.x86 has many flaws, but nothing that makes it so that there is a massive discontinuity where there's basically nothing you can do about a particular problem until you have like 10x transistor/power/whatever budget, which is the kind of thing that kills archs.
       
 (DIR) Post #Ay1NDrSW1qSW1hgUPw by argv_minus_one@mastodon.sdf.org
       2025-09-09T03:54:45Z
       
       0 likes, 0 repeats
       
       @rygorous What caused the demise of m68k?@regehr @steve
       
 (DIR) Post #Ay1NDsjvGbydzypuBk by TomF@mastodon.gamedev.place
       2025-09-09T04:10:17Z
       
       0 likes, 0 repeats
       
       @argv_minus_one @rygorous @regehr @steve 68k went way way too CISC right at the point RISC got all trendy. Like... RISC was wrong in the long term, but it was 20% right for a decade or so. And then it was wrong. Sadly, that was long enough to kill 68k as a mainstream part (though it lived on for a looooong time in the embedded space)
       
 (DIR) Post #Ay1NDtkJWdst7U1jRA by wolf480pl@mstdn.io
       2025-09-09T06:23:35Z
       
       0 likes, 0 repeats
       
       @TomF@argv_minus_one @rygorous @regehr @steve wasn't 68k basically a VAX?
       
 (DIR) Post #Ay1NDy9z78nMoG4I0O by rygorous@mastodon.gamedev.place
       2025-09-09T02:03:59Z
       
       0 likes, 0 repeats
       
       @regehr @steve x86 has always had a staircase of options where you can gradually throw more budget and it and get incremental wins every gen.It's a crooked staircase with some trick steps, but it's a staircase, not a brick wall. :)
       
 (DIR) Post #Ay1NL2m4MBVl9DFtB2 by argv_minus_one@mastodon.sdf.org
       2025-09-09T05:02:14Z
       
       0 likes, 0 repeats
       
       @TomF ARM now has many more instructions than m68k did. Who's RISC now? 😋@rygorous @regehr @steve
       
 (DIR) Post #Ay1NL3nWYGGkK0wZ5E by TomF@mastodon.gamedev.place
       2025-09-09T05:34:47Z
       
       0 likes, 0 repeats
       
       @argv_minus_one @rygorous @regehr @steve Yeah, it's a silly comparo unless you restrict yourself to A64 vs x64 and ignore the massive legacy baggage of both.
       
 (DIR) Post #Ay1NL4taTCiHj6mvAm by wolf480pl@mstdn.io
       2025-09-09T06:25:01Z
       
       0 likes, 0 repeats
       
       @TomF@argv_minus_one @rygorous @regehr @steve I thought RISC vs CISC was about load-store vs plentiful addressing modes that can be used with every op (orthogonality)
       
 (DIR) Post #Ay1Nm3jqsbzdxO6Glk by lispi314@udongein.xyz
       2025-09-09T03:16:23.110725Z
       
       0 likes, 0 repeats
       
       @dalias @rygorous @regehr @steve I would really want to see that considered more.Finally reliable CPUs? Why, shove that into some whitebox system with Free/Libre firmware & redundant hotswappable CPUs and I've just about got my dream system.
       
 (DIR) Post #Ay1Nm51y4k4vxraFe4 by wolf480pl@mstdn.io
       2025-09-09T06:29:53Z
       
       0 likes, 0 repeats
       
       @lispi314@steve @dalias @rygorous @regehr also, not having 600 rename registers should help if a lot of your power and area is in data movement, right?
       
 (DIR) Post #Ay1O5vsHpMlzMJWOxc by azonenberg@ioc.exchange
       2025-09-09T04:53:40Z
       
       0 likes, 0 repeats
       
       @rygorous @regehr @steve In the GPU/embedded space, these abstractions get much leakier.In ngscopeclient, for example, we use low level Vulkan APIs for memory management. Most (not all) of our waveform data is explicitly incoherent with one copy on the CPU and one on the GPU with nonblocking DMAs triggered between them as needed. If you forget to call PrepareForCpuAccess() on a buffer before using it, you're going to read stale data if the GPU was working on it.How are buses a lie, though? Unless you are assuming they are a shared reliable broadcast medium rather than a packet-switched network like most of them are these days?
       
 (DIR) Post #Ay1O5xh14EwizwGQfg by rygorous@mastodon.gamedev.place
       2025-09-09T04:55:14Z
       
       0 likes, 0 repeats
       
       @azonenberg Yeah, I mean buses as shared mediums; anything remotely high-speed these days is switched point-to-point links. (Not necessarily packets, more like flits usually.)
       
 (DIR) Post #Ay1O5ypCrH5kVd6U4m by wolf480pl@mstdn.io
       2025-09-09T06:33:29Z
       
       0 likes, 0 repeats
       
       @rygorous@azonenberg they even pretend to have the pull-up resistors by returning 0xFFFFFFFF when you read an invalid address
       
 (DIR) Post #Ay1OLl6NeDwKa6f0cq by TomF@mastodon.gamedev.place
       2025-09-09T06:36:22Z
       
       0 likes, 0 repeats
       
       @wolf480pl @argv_minus_one @rygorous @regehr @steve I don't know the VAX ISA very well, but although 68k was clearly inspired by a few features of VAX, it's nowhere near as aggressively CISCy - at least not in its first few iterations (the 68020 added some wacky stuff).
       
 (DIR) Post #Ay1OXqAWdRIXlrMOZc by TomF@mastodon.gamedev.place
       2025-09-09T06:38:33Z
       
       0 likes, 0 repeats
       
       @wolf480pl @argv_minus_one @rygorous @regehr @steve There are many definitions of the two, and any argument that follows from one particular definition is usually silly.
       
 (DIR) Post #Ay1PV7MeLVcKBgITBo by rygorous@mastodon.gamedev.place
       2025-09-09T06:49:15Z
       
       0 likes, 0 repeats
       
       @wolf480pl it was absolutely not, no.One of VAX's more notable problems was that absolutely every operand could be a memory reference or even indirect memory reference (meaning a memory location containing a pointer to a memory location that was accessed by the instruction). Some VAX instructions had 6 operands, each of which could be a double-indirect memory reference, and IIRC also unaligned and spanning a page boundary, so the worst case number of page faults per instruction was bonkers.
       
 (DIR) Post #Ay1Q26KDENlqWJsto8 by rygorous@mastodon.gamedev.place
       2025-09-09T06:51:52Z
       
       0 likes, 0 repeats
       
       @wolf480pl Everything could also be an immediate operand.There were two ways to encode immediates, "literal" was for short integers and was more compact, anything out of range used the actual immediate encoding.On the VAX, you had 16 GPRs R0-R15, and R15 was just your PC. (32-bit ARM later copied that mistake, and it is a mistake.)The immediate encoding boiled down to (r15)+, i.e., fetch data (of whatever the right size is) at PC and auto-increment. That's also how it was encoded.
       
 (DIR) Post #Ay1Q27Qd80UxwVtXRw by wolf480pl@mstdn.io
       2025-09-09T06:55:13Z
       
       0 likes, 0 repeats
       
       @rygorousyeah, *pc++ - isn't that beautiful?*hides*
       
 (DIR) Post #Ay1Q2E2mX4a6SZn5kW by rygorous@mastodon.gamedev.place
       2025-09-09T06:54:46Z
       
       0 likes, 0 repeats
       
       @wolf480pl So, in the VAX encoding, if you have say an add instruction where the first operand is an immediate, you get the encoding for the first operand, then the immediate bytes, then the encoding for the second operand, and so forth.Crucially, you don't really know where the byte describing the second operand starts until you've finished the first operand; and this goes for all (up to 6) operands.Nobody does this anymore, because turns out, it's a _terrible_ idea.
       
 (DIR) Post #Ay1QMpHTnO6pjxUMS0 by wolf480pl@mstdn.io
       2025-09-09T06:58:59Z
       
       0 likes, 0 repeats
       
       @rygoroussounds like it'd save you a lot of gates in the uninteresting scenario of a cacheless byte-addressable memory and a core that takes 3+ cycles to process each operand
       
 (DIR) Post #Ay1QqLN5YxwxfIBgGG by rygorous@mastodon.gamedev.place
       2025-09-09T07:04:18Z
       
       0 likes, 0 repeats
       
       @wolf480pl VAX was multi-cycle everything, basically something like at least 1 cycle for the base operation (even if no operands), at least 1 cycle extra for every operand, more if they involved memory access.They did try to pipeline it past that (with the NVAX) but the ISA proved to be remarkably resistant to doing something much better, at least with the transistor budget they had at the time (late 80s/early 90s).
       
 (DIR) Post #Ay1RFiKfKdf27I9SN6 by rygorous@mastodon.gamedev.place
       2025-09-09T06:58:52Z
       
       0 likes, 0 repeats
       
       @wolf480pl In a faster version (ideally, >=1IPC) you want to pipeline this which turns into a mess when basic things like where the register numbers are in the instruction bytes for a particular instruction is super context-sensitive.M68k got some VAX-isms (they did add some double-indirect addressing in the 68020...) and so did 32-bit ARM (R15=PC, which is the source of nearly half of the "constrained unpredictable" behavior in the 32-bit ARM spec), but real VAX is... something else.
       
 (DIR) Post #Ay1RFjh2Gx9IKxcpsW by wolf480pl@mstdn.io
       2025-09-09T07:08:53Z
       
       0 likes, 0 repeats
       
       @rygoroushmm I see, so m68k makes PC-relative its own addressing mode and doesn't allow for post-increment with it, which bypasses the question of what's read first, the immediate or the remaining operand addresses, and makes it possible to know early which bits are where in the instruction...
       
 (DIR) Post #Ay1RgD3RjJ82vfjP8a by wolf480pl@mstdn.io
       2025-09-09T07:13:42Z
       
       0 likes, 0 repeats
       
       @rygoroushmm...was PDP-11 ever pipelined?If it was then VAX truly is a disgrace
       
 (DIR) Post #Ay1ShlHW3nSjKGcH4q by rygorous@mastodon.gamedev.place
       2025-09-09T07:15:42Z
       
       0 likes, 0 repeats
       
       @wolf480pl Which is all just a random historical footnote at this point, but it is important context because all the original first-wave early-to-mid-80s RISC papers were subtweeting VAXen, specifically and especially.VAX is '77. 8086 is from '78, and descended from the 8080 and ultimately 8008 ('72). IBM z mainframes are still off the original System/360 architecture from 1965. Both decidedly not RISC. Both made the jumps to first pipelined, then superscalar, then OoO just fine. VAX, nope.
       
 (DIR) Post #Ay1ShmmkTA2VzQEjOi by rygorous@mastodon.gamedev.place
       2025-09-09T07:20:06Z
       
       0 likes, 0 repeats
       
       @wolf480pl The original RISC papers put all "CISCs" in the same boat, but historically, that is demonstrably false.VAX made some very specific decisions that felt clean and elegant in the short term and screwed them over big-time in the long term.Same for the first gen RISCs - load delay slots never made it into series production for MIPS,  branch delay slots did but were regretted not long after, etc.I don't think there's a big lesson here other than predicting the future is hard.
       
 (DIR) Post #Ay1Sho83TQg29nDGFM by rygorous@mastodon.gamedev.place
       2025-09-09T07:23:23Z
       
       0 likes, 0 repeats
       
       @wolf480pl Except, of course, for ISA designers, where there's plenty of immediately actionable information from how the VAX shook out, but that's less along RISC/CISC ideological lines and more like:- make instructions fixed-size or, when not practical, at least make it easy to tell insn size from the first word- don't bake in decisions that really lock you into one particular implementation, you might not want it in the future- don't make the PC a GPR (SP is somewhat special too)etc.
       
 (DIR) Post #Ay1ShpXGFCQwWG0uAq by wolf480pl@mstdn.io
       2025-09-09T07:25:07Z
       
       0 likes, 0 repeats
       
       @rygorousthose things are fairly obvious if you know pipelining is a thing. I'm guessing in the 70s they didn't know that yet?
       
 (DIR) Post #Ay1TNcoBx8VHD4rVmi by rygorous@mastodon.gamedev.place
       2025-09-09T07:32:41Z
       
       0 likes, 0 repeats
       
       @wolf480pl They did. Pipelining is 50s tech (IBM Stretch had a 3-stage pipeline, designed starting 1956). Superscalar, OoO was developed in the 60s (e.g. IBM ACS, Tomasulo algorithm) and shipped that decade (IBM System/360 Model 91 FPU, 1967).But this was all mostly the purview of supercomputers.The whole idea of an Instruction Set Architecture with multiple impls goes back to the S/360. That was barely 10 years old when they designed the VAX.
       
 (DIR) Post #Ay1U6E0szIZXL94Bwu by rygorous@mastodon.gamedev.place
       2025-09-09T07:35:33Z
       
       0 likes, 0 repeats
       
       @wolf480pl For all of the 50s and 60s (unless you were at IBM), people mostly designed a sequence of computers, all incompatible with each other. (Related, but incompatible.)Even if you'd told the original VAX designers around '75 that their decisions would screw over the VAX by '92 or so, I don't think they would've considered a 17-year life span a bad result.We now know (via S/360, 8086 etc.) that commercial archs can last 50+ years (albeit with many changes). They did not.
       
 (DIR) Post #Ay1U6FJMA6wPMiiSNU by wolf480pl@mstdn.io
       2025-09-09T07:40:47Z
       
       0 likes, 0 repeats
       
       @rygorouswas most software distributed as source cose back then? Or did you have to ask the company you got your accounting package from if they have a build for a newer machine because you want to upgrade your computers?
       
 (DIR) Post #Ay1U6M7YqMfIUAFdSq by rygorous@mastodon.gamedev.place
       2025-09-09T07:39:58Z
       
       0 likes, 0 repeats
       
       @wolf480pl Of course, Alpha was designed in the 90s with the explicit goal of lasting (at least) 25 years, is notable among RISCs in explicitly thinking about superscalar, OoO, SMP etc. from the beginning and avoiding everything that would complicate them, has a much better track record than other RISCs in terms of architectural regrets because of this (they did make some mistakes in retrospect, but way fewer than MIPS, SPARC, ARM etc. did).Great tech, but ultimately: wrong business model.
       
 (DIR) Post #Ay1UFxCQM95s5Y6Q9Q by wolf480pl@mstdn.io
       2025-09-09T07:42:34Z
       
       0 likes, 0 repeats
       
       @rygorouswhere does Itanium fit in here? SPARC but more?
       
 (DIR) Post #Ay1UOCC4UXRGuQrr2u by rygorous@mastodon.gamedev.place
       2025-09-09T07:44:02Z
       
       0 likes, 0 repeats
       
       @wolf480pl Neither.New computer, new languages, new compilers, new software. Had to rewrite everything every time you bought a new computer.Some languages were at least somewhat compatible between some machines. When you were lucky. But still lots of manual work (and many bugs) in porting stuff over.Weirdly, once customers had significant SW investment, they didn't like that idea so much anymore. :)
       
 (DIR) Post #Ay1UZBHyCDnNHvrgHo by wolf480pl@mstdn.io
       2025-09-09T07:46:02Z
       
       0 likes, 0 repeats
       
       @rygorousso I guess C and Unix were one attempt at a solution to this, the other being backwards-compatible ISAs, and both happened around the same time?
       
 (DIR) Post #Ay1Ub0cQq0J0m7J7Mu by rygorous@mastodon.gamedev.place
       2025-09-09T07:46:22Z
       
       0 likes, 0 repeats
       
       @wolf480pl Itanium/EPIC is more in line with VLIWs which is sort of a separate lineage.There was Multiflow in the 80s and various DSPs. But it was kind of a niche idea.Classic VLIW always had no intention of binary-level compat across machines.EPIC was trying to go for VLIW-lite so you could have a stable ISA and get newer kit that gave you much better perf without SW changes, it just didn't really work out that way.
       
 (DIR) Post #Ay1UjeQEjZ24XViwm8 by wolf480pl@mstdn.io
       2025-09-09T07:47:56Z
       
       0 likes, 0 repeats
       
       @rygorousoh, I thought VLIW was a natural evolution of RISC, but I guess it branched out much earlier than I thought
       
 (DIR) Post #Ay1UooCpmkRyKbulJQ by wolf480pl@mstdn.io
       2025-09-09T07:48:51Z
       
       0 likes, 0 repeats
       
       @rygorousso I guess there was no market for source-compatible VLIW, and binary-compatible was difficult to get right?
       
 (DIR) Post #Ay1Ux0e1QI2hE2XeDY by rygorous@mastodon.gamedev.place
       2025-09-09T07:50:19Z
       
       0 likes, 0 repeats
       
       @wolf480pl nah, neither were setting out trying to solve this problem at all, it just happened by accident.Unix stated out as ken and dmr just wanting  to noodle around with computers and coming up with applications to keep getting funded. :)A lot of Unix's text-centeredness is because its killer app inside Bell labs was text editing. That's what the non-nerds were using it for.
       
 (DIR) Post #Ay1WDs2WlhbRPobkg4 by rygorous@mastodon.gamedev.place
       2025-09-09T08:04:33Z
       
       0 likes, 0 repeats
       
       @wolf480pl It's all around the same time. Late 70s/early 80s.AFAIK the original VLIW works (or at least what coined the term) were Josh Fisher's papers on VLIW at Yale in 1978 (so actually before RISC was coined as a term); he then founded Multiflow to commercialize it.Somewhat ironically, after they shut down, some Multiflow alumni ended up in the architecture group at Intel.But not on Itanium. On the PPro.(C.f. the long but very interesting https://www.sigmicro.org/media/oralhistories/colwell.pdf)
       
 (DIR) Post #Ay1WaZVDr5ZSaFr2FE by rygorous@mastodon.gamedev.place
       2025-09-09T08:08:41Z
       
       0 likes, 0 repeats
       
       @wolf480pl I'm not sure if there was no market, but certainly recompiling is no good to some fraction of customers because often the source for some SW is literally lost. Like random business-specific stuff that they had written by some lone programmer using FoxPro in the 80s
       
 (DIR) Post #Ay1ZzkH5F4tP5uw7TE by rygorous@mastodon.gamedev.place
       2025-09-09T07:53:40Z
       
       0 likes, 0 repeats
       
       @wolf480pl troff was written to drive a phototypesetter. (That they had to reverse engineer first), which was then used by all kinds of departments to make professional documents directly without having to use a typewriter/line printer to get a manuscript then have a print shop do it.The PDPs they were on were pretty common kit so they mailed source tapes to friends (other researchers) at universities.
       
 (DIR) Post #Ay1Zzli3uG4DXsZBA0 by rygorous@mastodon.gamedev.place
       2025-09-09T07:54:34Z
       
       0 likes, 0 repeats
       
       @wolf480pl University depts then had all kinds of other machines standing around, C was a pretty simple lang, and so much of it being written in C made it easy for University researchers (or gifted students) to port to other machines, so they did. That's also why there's a zillion Unix variants.
       
 (DIR) Post #Ay1ZzmeCQ6ZUSBlbmK by rygorous@mastodon.gamedev.place
       2025-09-09T07:56:54Z
       
       0 likes, 0 repeats
       
       @wolf480pl It's important to remember that the early Unices ran on systems with like 18k of memory.It just wasn't that much code to port. Of course, once they noticed that this seemed to be going well, they started leaning into it, but from all sources that I've read on the topic, this seems to be more of a happy accident than a masterplan working out just as designed
       
 (DIR) Post #Ay1ZznNZhOrqiuzqVM by wolf480pl@mstdn.io
       2025-09-09T08:46:48Z
       
       0 likes, 0 repeats
       
       @rygorousso Ken Thompson wasn't like "Man I just wrote that kernel in PDP-7 arm and now I need to rewrite it for PDP-11 again? What if we didn't have to do that every time?"?
       
 (DIR) Post #Ay1aTLR89aFokSkYOO by jernej__s@infosec.exchange
       2025-09-09T08:43:02Z
       
       0 likes, 0 repeats
       
       @rygorous @wolf480pl Gaah, I had to set up DOSBox at a few clients for them to be able to access old accounting stuff written in Clipper and TurboPascal…
       
 (DIR) Post #Ay1aTMeHeAMyVXuZX6 by rygorous@mastodon.gamedev.place
       2025-09-09T08:43:29Z
       
       0 likes, 0 repeats
       
       @jernej__s @wolf480pl YUUP
       
 (DIR) Post #Ay1aTOBzuIvpIOh0im by wolf480pl@mstdn.io
       2025-09-09T08:52:10Z
       
       0 likes, 0 repeats
       
       @rygorous@jernej__s my dad wrote point-of-sale software in Clipper xD
       
 (DIR) Post #Ay1bMidZaz2GdBySuG by jernej__s@infosec.exchange
       2025-09-09T09:02:12Z
       
       0 likes, 0 repeats
       
       @wolf480pl @rygorous The company I started with had both POS and accounting software written in Clipper (and later ported to xHarbour, when some clients refused to switch to anything else; we had some fun anecdotes with that – first basically nobody expected full support from original programmers when doing the switch, and second, at least Navision couldn't directly import a bunch of stuff, because they had way lower length limits on fields, so every conversion had to go through Excel where they had to rewrite descriptions to fit the lower limits).
       
 (DIR) Post #Ay1dmyKC5mwEiqYX5M by wolf480pl@mstdn.io
       2025-09-09T09:29:23Z
       
       0 likes, 0 repeats
       
       @jernej__s@rygorous > lower length limitsoofAnyway, migration pains aside, which do you think has better UX: those DOS programs or modern state-of-the-art GUI apps?
       
 (DIR) Post #Ay1h1VSE0vpjO8w0Ei by jernej__s@infosec.exchange
       2025-09-09T10:05:35Z
       
       0 likes, 0 repeats
       
       @wolf480pl @rygorous Navision had some really weird GUI conventions (very different from how Windows worked), but it seems that it worked well; the software my current company uses also works well, but it's completely classic Win32 (well Delphi) GUI.One of the clients migrated from Navision 5 to Business Central in the beginning of the year, and that seems to be much much worse.
       
 (DIR) Post #Ay1kSLK05w5RdyZlTs by wolf480pl@mstdn.io
       2025-09-09T10:44:05Z
       
       0 likes, 0 repeats
       
       @jernej__s@rygorous Hmm yeah that looks unusual.Not sure how widespread of a convention that was, but my dad's programs always had a list of keybindngs in the bottom line (or two) of the screen
       
 (DIR) Post #Ay2GEuKiK5u8i1XyEa by rygorous@mastodon.gamedev.place
       2025-09-09T16:40:11Z
       
       0 likes, 0 repeats
       
       @wolf480pl @jernej__s WordStar (1978) did that, and it was pretty influential, not sure if they were the first either though.
       
 (DIR) Post #Ay3PkoHDTB43wC9qkK by david_chisnall@infosec.exchange
       2025-09-09T09:18:59Z
       
       0 likes, 2 repeats
       
       @regehr There are two different questions here:Is AArch64 a better ISA for modern microarchitectures than x86-64?Does the architecture limit the performance of an implementation?The latter is trivially true.  We have a load of examples of dead ends.  Stack machines make extracting instruction-level parallelism really hard so lost completely to register machines.  Complex microcode makes out-of-order execution hard because you have to be able to serialise machine state on interrupt and decoded microops may have a bunch of state that isn't architectural. This one is quite interesting because it's a very sharp step change.  Building a microcode engine that works is quite easy: serialise the pipeline, disable interrupts, run a bunch of microops, reenable interrupts.  Building one that is efficient and allows multiple microcoded instructions to run in parallel is really hard, but if you do it then the complexity is amortised across a potentially large number of instructions.  x86 chips took the first approach until fairly recently because there was a lot of lower-hanging fruit and microcoded instructions were rare.  Having one instruction that requires complex microcode is absolutely the worst case.Different ISAs favour different implementation choices.  AArch32's choice to make the program counter architectural was great on simple pipelines, for example.  It made PC-relative addressing trivial (just use PC as the base of any load or add) and made short relative jumps just an add to the PC.  This became more annoying for more complex pipelines because you can't tell whether an instruction is a jump until you've done full decode (on most other RISCy ISAs, you can tell from the major opcode), which impacts where you do branch prediction.  Similarly, all of the predication in AArch32 is great for avoiding using the branch predictor in common cases with simple pipelines, but you need that state anyway on big out-of-order machines.  Thumb-2's if-then-else instruction provided a denser way of packing predication that scales nicely up to dual-issue in-order cores, but really hurts if you want to decode multiple instructions in parallel.The question of AArch64 vs x86-64 is much more interesting.Register rename is the biggest single consumer of power and the bottleneck on a lot of very high-end implementations.  Complex addressing modes really help reduce this overhead, but so do memory-register operations where you avoid needing to keep a rename register live for a value that's used only once.At MS, we did a lot of work on dataflow architectures to try to avoid this.  This was largely driven by two observations:Around 2/3 of values are used exactly once.Around 2/3 of values (not the same 2/3, but an overlapping set) of values are used only within the same basic block where they are created.The theory was that, by encoding this directly in the ISA (input operands were implicit, output operands were the distance in executed instruction stream to the instruction that consumed the result) you'd be able to significantly reduce rename register pressure.  Unfortunately, it turned out that speculative execution required you to do something that looked a lot like register rename for these values.AArch64 intentionally tries to provide a useful common set of fused operations.  x86-64 does it largely by accident, but there isn't a clear winner here.The one big win that we found has not really made it into any instruction set, which continues to surprise me.  A cheap way of marking a register as dead can massively improve performance.  I've seen a 2x speedup on x86 from putting an xor rax, rax at the end of a tight loop because the pipeline was stalling having to keep all of the old rax values around in rename registers, even though no possible successor blocks used them.  If I were designing a new ISA, for high-performance systems I'd be tempted to do one of the following:Have a short instruction with a bitmap of dead registers that compilers could insert at end of the basic block for the back arc in a loop to mark multiple registers as dead.Put an extra bit in each source operand to mark it as a kill.The latter hurts density, but would probably be a bigger win because it would let you rewrite a load of operations from allocating rename registers to using forwarding in the front end.The other bottleneck is parallel decode.  A64 ditched the variable-length instruction set that T32 introduced because it makes orthogonal decoding trivial.  Fetch 16 bytes, decode four instructions.  Apple's implementations make a lot of use of this and also have a nice sideways forwarding path that allows values produced in one instruction to be directly forwarded to a consumer in the same bundle without going via register rename (if the value isn't clobbered, they still need to allocate a rename register).x86-64 is staggeringly bad here.  As Stephen Dolan quipped, x86 chips don't have an instruction decoder, they have an instruction parser.  Instructions can be very long, or as short as a single byte.  Mostly this doesn't matter on modern x86-64 chips because 90% of dynamic execution is in loops and modern x86-64 chips do at least some caching of decoded things (at the very least, caching the locations of instruction boundaries, often caching of decoded micro-ops) in loops.And this is where it gets really interesting.  The extra decode steps and the extra caches add area, but they also save instruction cache by providing a dense instruction encoding (x86-64 isn't perfect there, it's not that close to a Huffman encoding over common instruction sequences, but it's a moderately good approximation).  Is that a good tradeoff?  It almost certainly varies between processes (the relative power and area costs of SRAM vs logic vary a surprising amount).The thing that I find really surprising is the memory model.  Arm's weak memory model is supposed to make high-performance implementations easier by enabling more reorderings in the core, whereas x86's TSO is far more constrained.  Apple's processors have a mode that (as I understand it) decodes loads and stores into something with the same semantics as the load-acquire / store-release instructions but with the normal addressing modes.  It doesn't seem to hurt performance (it's only used by Rosetta 2, so it's hard to do an exact comparison, but x86-64 emulation is ludicrously fast on these machines, so it isn't hurting that much.  I'd love to see some benchmarks enabling it by default on everything).Doing any kind of apples-to-apples comparison is really hard because things in the architecture force implementation choices.  You would not implement an x86-64 and an AArch64 core in exactly the same way.  There was a paper at ISCA in 2015 that I absolutely hated (not least because it was used to justify a load of bad design decisions in RISC-V) that claimed it did this, but when you looked at their methodology they'd simulated cores that were not at all how you would implement the ISAs that they were discussing.
       
 (DIR) Post #Ay3QD3rx5fraPBoPbc by pro@mu.zaitcev.nu
       2025-09-10T06:06:36.624697Z
       
       0 likes, 0 repeats
       
       @david_chisnall @regehr I wonder what's going on in E2k land these days. Just a decade ago they didn't have register rename at all. The heart of the design was a large (64 or 128) multiport (4 minimum) register file. They claimed that performance was competitive against x86 made on a similar process (although sadly not world-beating).
       
 (DIR) Post #Ay3g5EYx2y5QWLDL3A by david_chisnall@infosec.exchange
       2025-09-10T08:05:50Z
       
       0 likes, 0 repeats
       
       @aedancullen No, it’s very sad. Doug Berger gave a keynote at ISCA where he talked about the E2 architecture. There were two follow-on ISAs that improved that and then a RISC ISA. I think most of the folks who worked on that project have left MS (Microsoft’s senior leadership does not create an environment conducive to people wanting to work on projects with a multi-year time to market) so that expertise is now scattered across the industry, and probably too hard to collect enough people to write up some decent publications. Midori (clean-slate OS written in .NET) suffered a similar fate, where the only publications relate to an earlier research prototype.
       
 (DIR) Post #Ay3g5FoaOKBeP7XL3g by ignaloidas@not.acu.lt
       2025-09-10T09:04:29.418Z
       
       0 likes, 0 repeats
       
       @david_chisnall@infosec.exchange @aedancullen@social.treehouse.systems there's some non-academic stuff about midori from joe duffy which I think includes the later stages of it? https://joeduffyblog.com/2015/11/03/blogging-about-midori/Of course it will be far from comprehensive, but still plenty of interesting details (I really like the error model part of it)
       
 (DIR) Post #Ay8nnffbILHAnsGCGm by TomF@mastodon.gamedev.place
       2025-09-12T20:24:29Z
       
       0 likes, 0 repeats
       
       @wolf480pl @rygorous A lot of VLIW was driven by DSPs. And DSP ISAs are absolutely wild - they expose all sorts of things that no CPU ISA would dream of doing. They're really fun to study, but terrifying to write compilers for.
       
 (DIR) Post #Ay8o0831c2iqI59nxg by wolf480pl@mstdn.io
       2025-09-12T20:26:46Z
       
       0 likes, 0 repeats
       
       @TomF @rygorous can you point me to some fun ones?
       
 (DIR) Post #Ay8oDld3lIdAoXW0wa by TomF@mastodon.gamedev.place
       2025-09-12T20:29:13Z
       
       0 likes, 0 repeats
       
       @wolf480pl @rygorous The original reasonably sane, very popular one is the Motorola 56001. It's VLIW, and I think has some explicit pipelining as well? It's a good place to start, because it was only eccentric, not mad.
       
 (DIR) Post #Ay8oPsenSRFe5P6VPs by wolf480pl@mstdn.io
       2025-09-12T20:31:26Z
       
       0 likes, 0 repeats
       
       @TomF @rygorous > The stack area is allocated in a separate address space> The stack, which is used when subroutine calls and "long interrupt"s, is fifteen in depththis would work well in the game "TuringComplete" :D
       
 (DIR) Post #Ay8uZOy9I5WYdE4xZw by TomF@mastodon.gamedev.place
       2025-09-12T20:32:09Z
       
       0 likes, 0 repeats
       
       @wolf480pl @rygorous At the other end of the scale is probably this - the Saturn DSP:https://www.youtube.com/watch?v=n8plen8cLroThis is super extreme, but a whole bunch of other DSPs did similar things, e.g. data hazards become the compiler's problem, not the hardware's.
       
 (DIR) Post #Ay8uZQ9AuZwEHiFHP6 by TomF@mastodon.gamedev.place
       2025-09-12T20:40:04Z
       
       0 likes, 0 repeats
       
       @wolf480pl @rygorous Oh, and the followup video, which is more o_0:https://www.youtube.com/watch?v=lxpp3KsA3CI
       
 (DIR) Post #Ay8uZRI4eyePpbPtui by wolf480pl@mstdn.io
       2025-09-12T21:40:21Z
       
       0 likes, 0 repeats
       
       @TomF @rygorous The only cursed thing in those two videos is the manual.If you look at the block diagram, clearly the data paths from the multiplier to the adder, and from the adder to the accumulator, don't go through the X and Y buses.I don't know why the bits enabling these data paths belong to the X-bus and Y-bus related fields in the instruction, but like, it's a VLIW, it just makes sense that a particular bit in the instruction would control a particular flop or mux.
       
 (DIR) Post #Ay8upQa548tCGQSm48 by TomF@mastodon.gamedev.place
       2025-09-12T21:43:13Z
       
       0 likes, 0 repeats
       
       @wolf480pl @rygorous Right. Once you realise there is no "multiply" instruction - the multiplier just runs all the time. Then you understand that it's not "an instruction set" so much as it is a bunch of direct mux controllers. Wheeee! :-)
       
 (DIR) Post #Ay8utcsrGYOfZxFRQm by wolf480pl@mstdn.io
       2025-09-12T21:44:02Z
       
       0 likes, 0 repeats
       
       @TomF @rygorous I see how this can be confusing for someone who hasn't taken an FPGA class :P
       
 (DIR) Post #B2uAnfsBeuTC7DiAiW by rygorous@mastodon.gamedev.place
       2026-02-02T06:04:57Z
       
       0 likes, 0 repeats
       
       @koakuma @wolf480pl There are use cases for referring to PC, especially PC-relative reads for global literals that don't require relocations, but there are ways to do that that use instruction encoding space better.E.g. ARM A64 has PC-relative loads as a separate addressing mode (with a much larger immediate field than other reg-relative loads). Likewise they use the same encoding (#31) for both the zero register xzr and the stack pointer xsp, to limit wasted slots.
       
 (DIR) Post #B2uAnhcJAuxNWYIWFE by koakuma@uwu.social
       2026-02-02T06:42:13Z
       
       0 likes, 0 repeats
       
       @wolf480pl @rygorous So I suppose in e.g the case of wanting to do PC-relative accessed, it's better to copy PC then act on copied value instead of dedicating a register slot to it?Assuming here a direct PC-relative addressing is unavailable
       
 (DIR) Post #B2uAnio2klwDDEnPAu by wolf480pl@mstdn.io
       2026-02-02T08:21:37Z
       
       0 likes, 0 repeats
       
       @koakumathat, or have a separate "rN = pc + imm" instruction like RISC-V@rygorous