[HN Gopher] The Intel 80376 - A legacy-free i386 with a twist (2...
___________________________________________________________________
The Intel 80376 - A legacy-free i386 with a twist (2010)
Author : anyfoo
Score : 61 points
Date : 2022-08-13 06:13 UTC (2 days ago)
(HTM) web link (www.pagetable.com)
(TXT) w3m dump (www.pagetable.com)
| allenrb wrote:
| I'd forgotten about the 80376 but it hits at a question I've
| occasionally had over the last few years. Why have we not seen a
| "modernized" x86 CPU that strips out everything pre-AMD64? The
| answer seems likely to be one or both of:
|
| 1. There are more users of legacy modes than is obvious to us on
| HN.
|
| 2. The gains in terms of gate lost, critical paths reduced, lower
| power consumption just don't amount to much.
|
| My guess is that #2 is the dominant factor. If there were
| actually significant gains to be had on a "clean"(er) x86 design,
| we'd see it in the market regardless of #1.
| klelatti wrote:
| x86 did appear in one context where legacy compatibility is
| likely to have been a much smaller (or non issue?) and where
| (eg power consumption) efficiencies would have been even more
| valuable - that's on mobile running Android.
|
| The fact that a cleaned version wasn't used would seem to
| support your hypothesis.
| tenebrisalietum wrote:
| So I learned about the "hidden" x86 mode called XuCode (https:/
| /www.intel.com/content/www/us/en/developer/articles/t...) -
| which is x86 binary code placed into RAM by the microcode and
| then "called" by the microcode for certain instructions -
| particularly SGX ones if I'm remembering correctly.
|
| Wild speculative guess: It's entirely possible some of the pre-
| AMD64 stuff is actually internally used by modern Intel and AMD
| CPUs to implement complex instructions.
| kmeisthax wrote:
| Oh boy, we've gone all the way back to Transmeta Code
| Morphing Software. What "ring" does this live on now? Ring
| -4? :P
|
| Jokes aside, I doubt XuCode would use pre-AMD64 stuff;
| microcode is lower-level than that. The pre-AMD64 stuff is
| already handled with sequences of microcode operations
| because it's not really useful for modern applications[0].
| It's entirely possible for microcode to implement other
| instructions too, and that's what XuCode is doing[1].
|
| The real jank is probably hiding in early boot and SMM,
| because you need to both jump to modern execution modes for
| client or server machines _and_ downgrade to BIOS
| compatibility for all those industrial deployments that want
| to run ancient software and OSes on modern machines.
|
| [0] The last time I heard someone even talk about x86
| segmentation, it was as part of enforcing the Native Client
| inner sandbox.
|
| [1] Hell, there's no particular reason why you can't have a
| dual-mode CPU with separate decoders for ARM and x86 ISAs. As
| far as I'm aware, however, such a thing does not exist...
| though evidently at one point AMD was intending on shipping
| Ryzen CPUs with ARM decoders in them.
| Macha wrote:
| As the article points out, modern x86 CPUs boot up in 16 bit
| mode, then get transferred into 32 bit mode, then 64 bit mode.
| So right out the gate such a CPU is not compatible with
| existing operating systems, so now you have a non-compatible
| architecture. Sure Intel could easily add support to GRUB and
| push Microsoft to do it for new Windows media, but that won't
| help the existing install base. Intel tried launching a non-
| compatible CPU once, it was Itanium, it didn't go so well for
| them.
|
| Plus I'm sure there's crazy DRM rootkits that depend on these
| implementation details.
|
| Also, AMD has experimented with not-quite-PC-compatible x86
| setups already in the consoles. As the fail0verflow talk about
| Linux on PS4 emphasised, the PS4 is x86, but not a PC. So
| despite building a x86 CPU with some less legacy, AMD didn't
| seem to think it worthwhile bringing it to a more general
| purpose platform
|
| Also AMD/Intel/VIA are the only companies with the licenses to
| produce x86, and you'd need both Intel and AMD to sign off on
| licensing x64 to someone new.
| messe wrote:
| > As the article points out, modern x86 CPUs boot up in 16
| bit mode, then get transferred into 32 bit mode, then 64 bit
| mode. So right out the gate such a CPU is not compatible with
| existing operating systems, so now you have a non-compatible
| architecture
|
| Except that a modern OS is booted in UEFI mode meaning that
| the steps of going fro 16 -> 32 -> 64-bit mode are all
| handled by the firmware, not the kernel or the bootloader.
| The OS kernel will only (at most) switch to 32-bit
| compatibility mode (a submode of long-mode not protected
| mode) when it needs to run 32-bit apps, otherwise staying in
| 64-bit mode 100% of the time.
| anyfoo wrote:
| Yeah. Long mode is a bit of a line in the sand, leaving
| many features behind that were kept for compatibility
| (segmentation, vm86...). It came at a time where,
| fortunately, the mainstream OSes had enough abstraction so
| that software did not have to be written for what
| effectively was the bare metal anymore, with DOS almost
| being more of a "software bootloader with filesystem
| services".
| [deleted]
| anyfoo wrote:
| > Intel tried launching a non-compatible CPU once, it was
| Itanium, it didn't go so well for them.
|
| That may be only secondary, though. Itanium simply failed to
| deliver performance promises and be competitive. The compiler
| was supposed to effectively perform instruction scheduling
| itself, and writing such a compiler turned out more difficult
| than anticipated.
| FullyFunctional wrote:
| I've see this a lot, but IMO the truth is slightly
| different: the assumption behind EPIC was that a compiler
| _could_ do the scheduling which turned out to be
| _impossible_. The EPIC effort roots goes way back, but
| still I don't understand how they failed to foresee the
| ever growing tower of caches which unavoidably leads to a
| crazy wide latency range for loads (3-400+ cycles) which in
| turn is why we now have these very deep OoO machines.
| (Tachyum's Prodigy appears to be repeating the EPIC mistake
| with very limited but undisclosed reordering).
|
| OoO EPIC has been suggested (I recall an old comp.arch
| posting by an Intel architect) but never got green-lit. I
| assume they had bet so much on compiler assumption that the
| complexity would have killed it.
|
| It's really a shame because EPIC did get _some_ things
| right. The compiler absolutely can make the front-end life
| easier by making dependences more explicit (though I would
| do it differently) and by making control transfers much
| easier to deal with (the 128-bit block alone saves 4 bits
| in all BTB entries, etc). On the balance, IA-64 was a
| committee-designed train wreck, piling on way too much
| complexity, and failed both as a brainiac and speed-daemon.
|
| Disclaimer: I have an Itanic space heater than I
| occasionally boot up for the chuckle - and then shuts down
| before the hearing damage gets permanent.
| klelatti wrote:
| > Intel tried launching a non-compatible CPU once, it was
| Itanium, it didn't go so well for them.
|
| More than once. iAPX432 if anything went worse.
| anyfoo wrote:
| Yeah, but that, again, was for far worse reasons than just
| not being "compatible". In fact, iAPX432 was exceptionally
| bad. Intel's i960 for example fared much better in the
| embedded space (where PC-compatibility did not matter).
| klelatti wrote:
| Indeed and I suppose in fairness I don't think 432 was
| ever intended as a PC cpu replacement whilst Itanium was
| designed to replace some x86 servers.
|
| As an aside I'm still astonished that 432 and Itanium got
| as far as they did with so much cash spent on them before
| without conclusive proof that performance would be
| competitive. Seems like a prerequisite for projects of
| this size.
| rodgerd wrote:
| Think about the disruption that Apple caused when they moved
| from 32-bit x86 being supported to deprecating it - there was a
| great deal of angst, and that's on a vertically-integrated
| platform that is relatively young (yes, I know that NeXT is
| old, but MacOS isn't, really). Now imagine that on Windows - a
| much older platform, with much higher expectations for
| backwards compat. It would be a bloodbath of user rage.
|
| More importantly, though, backward compat has been Intel's moat
| for a long, long time. Intel have been trying to get people to
| buy non-16-bit-compat processors for literally 40 years!
| They've tried introducing a lot of mildly (StrongARM -
| technically a buy I suppose, i860/i960) and radically (i432,
| Itanium) innovative processors, and they've all been
| indifferent or outright failures in the marketplace.
|
| The market has been really clear on this: it doesn't care for
| Intel graphics cards or SSDs or memory, it hates non-x86 Intel
| processors. Intel stays in business by shipping 16-bit-
| compatible processors.
| rwmj wrote:
| Quite a lot of modern Arm 64 bit processors have dropped 32 bit
| (ie. ARMv7) support. Be careful what you wish for though! It's
| still useful to be able to run 32 bit i386 code at a decent
| speed occasionally. Even on my Linux systems I still have
| hundreds of *.i686.rpms installed.
| jleahy wrote:
| The 64-bit ARM instructions were designed in a way that made
| supporting both modes in parallel very expensive from a
| silicon perspective. In contrast AMD were very clever with
| AMD64 and designed it such that very little additional
| silicon area was required to add it.
| danbolt wrote:
| I feel as though a lot of the consumer value of x86+Windows
| comes from its wide library of software and compatibility.
|
| > than is obvious to us on HN
|
| I think your average HNer is more likely to interact with
| Linux/Mac workstations or servers, where binary compatibility
| isn't as necessary.
| unnah wrote:
| Instruction decoding is a bottleneck for x86 these days: Apple
| M1 can do 8-wide decode, Intel just managed to reach 6-wide in
| Alder Lake, and AMD Zen 3 only has a 4-wide decoder. One would
| think that dropping legacy 16-bit and 32-bit instructions would
| enable simpler and more efficient instruction decoders in
| future x86 versions.
| amluto wrote:
| Sadly not. x86_64's encoding is extremely similar to the
| legacy encodings. AIUI the fundamental problem is that x86 is
| a variable-length encoding, so a fully parallel decoder needs
| to decode at guessed offsets, many of which will be wrong.
| ARM64 instructions are aligned.
|
| Dumping legacy features would be great for all kinds of
| reasons, but not this particular reason.
| hyperman1 wrote:
| This suggests another way forward: Re-encode the existing
| opcodes with new, more regular byte sequences. E.g. 32 bits
| / instruction, with some escape for e.g. 64bit constants.
| You'll have to redo the backend of the assembler, but most
| of the compiler and optimization wisdom can be reused as-
| is. Of course, this breaks backward compatibility
| completely so the high performance mode can only be
| unlocked for recompiles.
| colejohnson66 wrote:
| That was Itanium, and it failed for a variety of reasons;
| one of which was a compatibility layer that sucked. You
| _can 't_ get rid of x86's backwards compatibility. Intel
| and AMD have done their best by using vector prefixes
| (like VEX and EVEX)[a] that massively simplify decoding,
| but there's only so much that can be done.
|
| People get caught up in the variable length issue that
| x86 has, and then claim that M1 beats x86 because of
| that. Sure, decoding ARM instructions is easier than x86,
| but the variable length aspect is handled in the
| predecode/cache stage, not the actual decoder. The
| decoder, when it reaches an instruction, already knows
| various bits of info are.
|
| The RISC vs CISC debate is useless today. M1's big
| advantage comes from the memory ordering model (and other
| things)[0], not the instruction format. Apple actually
| had to create a special mode for the M1 (for Rosetta 2)
| that enforces the x86 ordering model (TSO with load
| forwarding), and native performance is slightly worse
| when doing so.
|
| [0]:
| https://twitter.com/ErrataRob/status/1331735383193903104
|
| [a]: There's also others that predate AVX (VEX) such as
| the 0F38 prefix group consisting only of opcodes that
| have a ModR/M byte and no immediate, and the 0F3A prefix
| being the same, but with an 8 bit immediate.
| ShroudedNight wrote:
| I thought the critical failure of Itanium was that a
| priori VLIW scheduling turned out to be a non-starter, at
| least as far as doing so efficiently
| atq2119 wrote:
| The entire approach is misguided for single-threaded
| performance. It turns out that out-of-order execution is
| pretty important for a number is things, perhaps most
| importantly dealing with variable memory instruction
| latencies (cache hits at various points in the hierarchy
| vs. misses). A compiler simply cannot statically predict
| those well enough.
| Dylan16807 wrote:
| > That was Itanium
|
| What? No. Itanium was a vastly, wildly different
| architecture.
| FullyFunctional wrote:
| Two factual mistakes:
|
| * IA-64 failed primarily because it failed to deliver the
| promised performance. x86 comparability isn't and wasn't
| essential to success (behold the success on Arm for
| example).
|
| * M1's advantage has almost nothing to do with the weak
| memory model, but it has to do with everything: wider,
| deeper, faster (memory). The ISA being Arm64 also help in
| many ways. The variable length x86 instructions can be
| dealt with via predecoding, sure to an extent, but that
| lengthens the pipeline which hurts the branch mispredict
| penalty, which absolutely matters.
| kmeisthax wrote:
| M1 doesn't have a special mode for Rosetta. _All code_ is
| executed with x86 TSO on M1 's application processors.
| How do I know this?
|
| Well, did you know Apple ported Rosetta 2 to Linux? You
| can get it by running a Linux VM on macOS. It does not
| require any kernel changes to support in VMs, and if you
| extract the binary to run it on Asahi Linux, it works
| just fine too. None of the Asahi team did _anything_ to
| support x86 TSO. Rosetta also works just fine in m1n1 's
| hypervisor mode, which exists specifically to log all
| hardware access to detect these sorts of things. If there
| _is_ a hardware toggle for TSO, it 's either part of the
| chicken bits (and thus enabled all the time anyway) or
| turned on by iBoot (and thus enabled before any user code
| runs).
|
| Related point: Hector Martin just upstreamed a patch to
| Linux that fixes a memory ordering bug in workqueues
| that's been around since before Linux had Git history. He
| also found a bug in some ARM litmus tests that he was
| using to validate whether or not they were implemented
| correctly. Both of those happened purely because M1 and
| M2 are so hilariously wide and speculative that they
| trigger memory reorders no other CPU would.
| messe wrote:
| I'm sorry, but please cite some sources, because this
| contradicts everything that's been said about M1's x86
| emulation that I've read so far.
|
| > Well, did you know Apple ported Rosetta 2 to Linux? You
| can get it by running a Linux VM on macOS. It does not
| require any kernel changes to support in VMs, and if you
| extract the binary to run it on Asahi Linux, it works
| just fine too. None of the Asahi team did anything to
| support x86 TSO. Rosetta also works just fine in m1n1's
| hypervisor mode, which exists specifically to log all
| hardware access to detect these sorts of things. If there
| is a hardware toggle for TSO, it's either part of the
| chicken bits (and thus enabled all the time anyway) or
| turned on by iBoot (and thus enabled before any user code
| runs).
|
| Apple tells you to attach a special volume/FS to your
| linux VM in order for Rosetta to work. When such a volume
| is attached, it runs the VM in TSO mode. As simple as
| that.
|
| The rosetta binary itself doesn't know whether or not TSO
| is enabled, so its not surprising that it runs fine under
| Asahi. As marcan42 himself said on twitter[1], most x86
| applications will run fine even without TSO enabled.
| You're liable to run into edge cases in heavily
| multithreaded code though.
|
| [1]:
| https://twitter.com/marcan42/status/1534054757421432833
|
| > Both of those happened purely because M1 and M2 are so
| hilariously wide and speculative that they trigger memory
| reorders no other CPU would.
|
| In other words, they're not constantly running in TSO
| mode? Because if they were, why would they trigger such
| re-orders?
|
| EDIT: I've just run a modified version of the following
| test program[2] (removing the references to the
| tso_enable sysctl which requires an extension), both
| native and under Rosetta.
|
| Running natively, it fails after ~3500 iterations. Under
| Rosetta, it completes the entire test successfully.
|
| [2] https://github.com/losfair/tsotest/
| anyfoo wrote:
| > M1 doesn't have a special mode for Rosetta. All code is
| executed with x86 TSO on M1's application processors.
|
| That's not true. (And doesn't your last paragraph
| contradict it already?)
|
| You might just have figured out that most stuff will run
| fine (or appear to run fine for a long time) when TSO
| isn't enabled.
| messe wrote:
| I have no idea why you are being downvoted. You are
| entirely correct.
| anyfoo wrote:
| Thanks, I was puzzled as well. The downvotes seem to have
| stopped, though.
| umanwizard wrote:
| If you're inventing a completely incompatible ISA, why
| not just use ARM64 at that point?
| anamax wrote:
| Perhaps because you don't want to commit to ARM
| compatiblity AND licensing fees.
| gabereiser wrote:
| To someone who is interested in bare metal, can you explain
| the significance of this? Is this how much data a CPU can
| handle simultaneously? Via instructions from the kernel?
| anyfoo wrote:
| It means how many instructions the CPU can decode at the
| same time, roughly to "figure out what they mean and
| dispatch what they actually have to _do_ to the functional
| units of the CPU which will perform the work of the
| instruction ". It is not directly how much data a
| superscalar CPU can handle in parallel, but still plays a
| role, in the sense that there is a number of functional
| units available in the CPU, and if you cannot keep those
| busy with decoded instructions, then they lay around
| unused. So a too narrow decoder can be one of the
| bottlenecks in optimal CPU usage (but note how as a sibling
| commenter mentioned, the complexity of the
| instructions/architecture is also important, e.g. a single
| CISC instruction may keep things pretty busy by itself).
|
| Whether the instructions come from the kernel or from
| userspace does not matter at all, they all go through the
| same decoder and functional units. The kernel/userspace
| differentiation is a higher level concept.
| monocasa wrote:
| It's more complex than that if you'll excuse the pun.
| Instructions on CISC cores aren't 1 to 1 with RISC
| instructions, and tend to encode quite a bit more micro ops.
| Something like inc dword [rbp+16] is one instruction, but
| would be a minimum of three micro ops (and would be three
| RISC instructions as well).
|
| Long story short, this isn't really the bottle neck, or we'd
| see more simple decoders on the tail end of the decode
| window.
| mhh__ wrote:
| Decode bound performance issues are actually pretty rare. X86
| is quite dense.
| johnklos wrote:
| > The 80376 doesn't do paging.
|
| Wait - what? How is that even possible? Do they simply _not_ have
| an MMU? That makes it unsuitable for both old OSes and for new
| OSes. No wonder it was so uncommon.
| anyfoo wrote:
| It claims itself as an "embedded" processor, so it likely just
| wasn't meant to run PC OSes. According to Wikipedia at least,
| Intel did not even expect the 286 to be used for PCs, but for
| things such as PBXs. And ARM Cortex M still doesn't have an MMU
| either, for some applications you can just do without.
| Especially because both the 286 and this 376 beast did have
| segmentation, which could subsume some of the need for an MMU
| (separated address spaces, if you don't need paging and are
| content with e.g. buddy allocation for dividing available
| memory among tasks).
| [deleted]
| marssaxman wrote:
| It is very common for embedded systems processors not to have
| an MMU. If it ran any OS at all it would likely have been some
| kind of RTOS.
| blueflow wrote:
| Using paging/virtual memory is not a requirement for an OS,
| even when all currently popular OSes make use of it.
|
| Intel CPUs before the 80286 did not have an MMU, either.
| anyfoo wrote:
| Did they even call it an MMU already? The 286 only had
| segmentation, which arguably is just an addressing mode. It
| introduced descriptors that had to be resolved, but that
| happened when selecting the descriptor (i.e. when loading the
| segment register), where a hidden base, limit, and permission
| "cache" was updated. Unlike paging, where things are resolved
| when accessing memory.
___________________________________________________________________
(page generated 2022-08-15 23:00 UTC)