[HN Gopher] The x86 architecture is the weirdo (2004)
___________________________________________________________________
The x86 architecture is the weirdo (2004)
Author : signa11
Score : 82 points
Date : 2022-04-19 01:45 UTC (21 hours ago)
(HTM) web link (devblogs.microsoft.com)
(TXT) w3m dump (devblogs.microsoft.com)
| aap_ wrote:
| I wouldn't say the x86 is the weirdo in these cases, it's just
| not a RISC.
| lynguist wrote:
| In essence, it is a RISC since the days of Pentium Pro. Each
| instruction is decoded to several micro-operations and the
| trick of making the CPU run fast lies in the decoder.
|
| This also leads to situations where many instructions in x86
| asm will run faster than fewer complex instructions which the
| decoder may not decode to its most optimal micro-operations
| sequence.
| aap_ wrote:
| That's an implementation detail. The submission was about
| architecture.
| saagarjha wrote:
| No, this take is incorrect. x86-64 instructions _do_ get
| lowered into uops, but all the common ones will only execute
| one, or maybe two. The real win for uops is letting Intel
| microcode instructions that nobody cares about, but you
| really shouldn 't be using them anyways.
| phamilton wrote:
| To what extent is the decoding consistent? It always felt
| weird to me that we'd do something in hardware on every
| execution that could be done once in software at compile
| time. But if the decoder does smart things that would make
| more sense to me. For example, with SMT the decoder could
| choose not to use certain ops if the compute units are in use
| by the other thread.
| buildbot wrote:
| I am not sure of any arch that does this, but having a
| second stage decode buys 2 things off the top of my head -
| better instruction cache usage, esp. for small loops, and
| the flexibility to decode ops differently on different
| versions of the same arch from one binary. Otherwise you
| would have to have tons of different versions for each
| flavor of x86 out there.
| imtringued wrote:
| Intel and AMD CPUs have micro op caches for tight loops.
| akira2501 wrote:
| > on every execution that could be done once in software at
| compile time
|
| Which is to me, essentially, VLIW. Those architectures
| didn't work out very well. There's a lot of memory pressure
| just to move instructions around and the costs didn't seem
| to outweigh the benefits.
| marcosdumay wrote:
| Notice that it's the only one remaining?
|
| Yes, it's a weirdo.
| FullyFunctional wrote:
| Not defending the weirdo, but IBM Z is a thing that is
| shipping, _heavily_ used, and still evolving. It 's more CISC
| than even x86.
|
| I do find it sad that we are stuck in an 80 year old paradigm
| that is a horrible fit for how CPUs are _actually_
| implemented in 2022, but the inertia is a strong in this one.
| kosherhurricane wrote:
| Inertia is the same as backward compatibility.
|
| Kinda like gas engines and fuel pumps. Or wall socket
| formats. Inertia.
| aap_ wrote:
| You mean as opposed to the other architectures mentioned in
| the article? ppc, mips, itanium and alpha. Two of which are
| utterly dead, one of which looks like it's dying (mips) and
| one that is little more than a niche product (ppc).
| DaiPlusPlus wrote:
| > Notice that it's the only one remaining?
|
| Motorola 68K is still around too.
| Findecanor wrote:
| > The x86 has a strict memory model
|
| x86 doesn't really impose _sequential_ _consistency_ between
| cores /threads. It imposes a _Total_ _Store_ _Order_ (TSO) in
| which stores are always in order to each other but a store can be
| reordered after a load.
|
| SPARC had TSO on later chips whereas earlier chips had weaker
| models. MIPS developed the other way: with older versions having
| stronger memory ordering and later getting relaxed memory
| ordering.
|
| RISC-V chips can optionally support TSO but it seems that the
| motivation is programs ported from x86. IBM's z/Architecture
| (with lineage back to IBM/360) is still alive and also has TSO.
| BTW. The Mill is supposed to offer sequential consistency, but it
| remains to be seen if that will be a performance bottleneck.
| ajross wrote:
| > x86 doesn't really impose sequential consistency between
| cores/threads. It imposes a Total Store Order (TSO) in which
| stores are always in order to each other but a store can be
| reordered after a load.
|
| To be more pedantic (and hoping I remember this correctly): TSO
| is indistinguishable in software from full sequential
| consistency. Any code to detect the difference must by
| definition be subject to race conditions (or must be an atomic
| read/write operation that on x86 would be serializing anyway).
| So x86 in fact does provide SC semantics "between
| cores/threads". It does have visible reordering artifacts from
| the perspective of hardware designs (e.g. MMIO registers) where
| a load has side effects.
| fmajid wrote:
| An article by _the_ Raymond Chen. Here 's Joel Spolsky's
| contemporary take on Chen:
|
| https://www.joelonsoftware.com/2004/06/13/how-microsoft-lost...
| [deleted]
| comex wrote:
| Let's see how things have changed since 2004 when this was
| published:
|
| > The x86 has a small number (8) of general-purpose registers
|
| x86-64 added more general-purpose registers.
|
| > The x86 uses the stack to pass function parameters; the others
| use registers.
|
| OS vendors switched to registers for x86-64.
|
| > The x86 forgives access to unaligned data, silently fixing up
| the misalignment.
|
| Now ubiquitous on application processors.
|
| > The x86 has variable-sized instructions. The others use fixed-
| sized instructions.
|
| ARM introduced Thumb-2, with a mix of 2-byte and 4-byte
| instructions, in 2003. PowerPC and RISC-V also added some form of
| variable-length instruction support. On the other hand, ARM
| turned around and dropped variable-length instructions with its
| 64-bit architecture released in 2011.
|
| > The x86 has a strict memory model ... The others have weak
| memory models
|
| Still x86-only.
|
| > The x86 supports atomic load-modify-store operations. None of
| the others do.
|
| As opposed to load-linked/store-conditional, which is a different
| way to express the same basic idea? Or is he claiming that other
| processors didn't support any form of atomic instructions, which
| definitely isn't true?
|
| At any rate, ARM previously had load-linked/store-conditional but
| recently added a native compare-and-swap instruction with
| ARMv8.1.
|
| > The x86 passes function return addresses on the stack. The
| others use a link register.
|
| Still x86-only.
| NtGuy25 wrote:
| In regards to memory alignment, it's even worse. Most
| instructions work on unaligned data. But some instructions
| require 8 byte, 16 byte, 32 byte, 64 byte and I think there's
| even some 128 and 256 byte alignment. One of the more common
| pitfalls someone can find themselves in when coding x86-64 asm.
| phamilton wrote:
| I vaguely recall that LL/SC solves the ABA problem whereas
| load-modify-store does not.
|
| It's been a while, so I'm going to define my understanding of
| the ABA problem in case I misunderstood it:
|
| x86 only supplies cmpxchange instructions on which will update
| a value only if it matches the passed in previous value.
| There's a class of concurrency bugs where the value is modified
| away from it's initial value and then modified again back to
| it's value. cmpxchange can't detect that condition, so if it's
| a meaningful difference often the 128-bit cmpxchange will be
| used with a counter in the second 64-bits that is incremented
| on each write to catch this case.
|
| LL/SC will trigger on any write, rather than comparing the
| value, providing the stronger guarantee.
|
| (Please correct me if this is inaccurate, it's been a hit
| minute since I learned this and I'd love to be more current on
| it).
| zozbot234 wrote:
| AIUI, a cmpxchg loop is enough to implement read-modify-write
| of any atomically sized value. The ABA problem becomes
| relevant when trying to implement more complex lock-free data
| structures.
| ceeplusplus wrote:
| There is still one big thing that hasn't changed but has been
| the subject of discussion on whether x86-64 is fundamentally
| bottlenecking CPU architecture. Variable length instructions
| means decoder complexity scales quadratically rather than
| linearly. It's been speculated this is one reason why even the
| latest x86 architectures stick with relatively narrow decode
| but Arm CPUs with lower performance levels (e.g. Cortex X1/2)
| are already 5-wide and Apple is 8-wide.
| cesarb wrote:
| > ARM introduced Thumb-2, with a mix of 2-byte and 4-byte
| instructions, in 2003. PowerPC and RISC-V also [...]
|
| x86 is still the weirdo. Both Thumb-2 and the RISC-V C
| extension (I don't know about PowerPC) have only 2-byte and
| 4-byte instructions, aligned to 2 bytes; x86 instructions can
| vary from 1 to 15 bytes, with no alignment requirement.
| chasil wrote:
| ARM Thumb actually licensed patents from Hitachi Super-H, who
| did this first.
|
| Supposedly, "MIPS processors [also] have a MIPS-16 mode."
|
| https://en.m.wikipedia.org/wiki/SuperH
| cryptonector wrote:
| I suspect variable-length instructions are a big gain because
| you get to pack instructions more tightly and so have fewer
| cache misses. Though, obviously, it's going to depend on
| having an instruction set that yields shorter text for
| typical assembly than fixed-sized instructions would. (In a
| way, opcodes need a bit of Huffman encoding!)
|
| Any losses from having to do more decoding work are probably
| offset by having sufficiently deep pipelines and enough
| decoders.
| phdelightful wrote:
| The counterpoint is that variable-length decoding
| introduces sequential dependence in the decoding, i.e. you
| don't know where instruction 2 starts until you've decoded
| instruction 1. This probably limits how many decoders you
| can have. If you know all your instructions are 4B you can
| basically decode as many as you want in parallel.
| classichasclass wrote:
| Power10 has prefixed instructions. These are essentially
| 64-bit instructions in two pieces. They are odd even
| (particularly?) to those of us who have worked with the
| architecture for a long time, and not much otherwise supports
| them yet. Their motivation is primarily to more efficiently
| represent constants and offsets.
| ncmncm wrote:
| Apple M1 supports optional x86-style memory event ordering, so
| that its x86 emulation could be made to work without penalty.
|
| When SPARC got new microcode supporting unaligned access, it
| turned out to be a big performance win, as the alignment
| padding had made for a bigger cache footprint. That was an
| embarrassment for the whole RISC industry. Nobody today would
| field a chip that enforced alignment.
|
| The alignment penalty _might_ have been smaller back when clock
| rates were closer to memory latency, but caches were radically
| smaller then, too, so even more affected by inflated footprint.
| jnordwick wrote:
| > as the alignment padding had made for a bigger cache
| footprint
|
| I argues with some of the Rust compiler members the other day
| about wanting to just ditch almost all alighnment
| restrictions because I of this exact thing. They laughed and
| basically told me i didnt know what i was talking about. I
| remember about 15 years ago when i worked at market making
| firm and we test this and it was a great gain - we started
| packing almost all our structs after that.
|
| Now, at another MM shop, and we're trying to push the same
| thing but having to fight these areguments again (the only
| alignmets I want to keep are for AVX and hardware accessed
| buffers).
| anyfoo wrote:
| FWIW, it's still better to lay out your critical structures
| carefully, so that padding isn't needed. That way, you win
| _both_ the cache efficiency and the efficiencies for
| aligned accesses.
| ncmncm wrote:
| Superstition is as powerful as it ever was.
| cryptonector wrote:
| It's definitely received wisdom that may once have been
| right and no longer is.
|
| Most people are not used to facts having a half-life, but
| many facts do, or, rather, much knowledge, does.
|
| We feel very secure in knowing what we know, and the
| reality is that we need to be willing to question a lot
| of things, like authority, including our very own. Now,
| we can't be questioning everything all the time because
| that way madness lies, but we can't never question
| anything we think we know either!
|
| Epistemology is hard. I want a doll that says that when
| you pull the cord.
| cogman10 wrote:
| Sort of depends on the knowledge.
|
| It's certainly true that in the tech industry things are
| CONSTANTLY shifting.
|
| However, talk physics and you'll find that things rarely
| change, especially the physics that most college
| graduates learn.
| MBCook wrote:
| Is this superstition or more received wisdom, which may
| have been true at one point in the past as is now just
| orthodoxy?
| notriddle wrote:
| Fifty bucks says it isn't even about performance, but is
| instead about passing pointers to C code. Zero-overhead
| FFI has killed a lot of radical performance improvements
| that Rust could have otherwise made.
|
| I don't know, because nobody's actually posting a link to
| it.
| ncmncm wrote:
| This strikes me as likely. Bitwise compatibility with
| machine ABI layout rules has powerful compatibility
| advantages even in places where it might make code
| slower. (And, for the large majority of code, slower
| doesn't matter anyway.)
|
| Of course C and C++ themselves have to keep to machine
| ABI layout rules for backward compatibility to code built
| when those rules were (still) thought a good idea.
| Compilers offer annotations to dictate packing for
| specified types, and the Rust compiler certainly also
| offers such a choice. So, maybe such annotations should
| just be used a lot more in Rust, C, and C++.
|
| This is not unlike the need to write "const" everywhere
| in C and C++ because the inherited default (from before
| it existed) was arguably wrong. We just need to get used
| to ignoring the annotation clutter.
|
| But there is no doubt there are lots of people who think
| padding to alignment boundaries is faster. And, there can
| be other reasons to align even more strictly than the
| machine ABI says, even knowing what it costs.
| kevingadd wrote:
| There are other things you need to take into account too -
| padding can make it more likely for a struct to divide
| evenly into cache lines, which can trigger false sharing.
| Changing the size of a struct from 128 bytes to 120 or 122
| bytes will cause it to be misaligned on cache lines and
| reduce the impact of false sharing and that can
| _significantly_ improve performance.
|
| The last time I worked on a btree-based data store,
| changing the nodes from ~1024 bytes to ~1000 delivered
| something like a 10% throughput improvement. This was done
| by reducing the number of entries in each node, and not by
| changing padding or packing.
| ncmncm wrote:
| True. Another reason to avoid too much aligning is to
| help reduce reliance on N-way cache collision avoidance.
|
| Caches on modern chips can handle keeping up to some
| small fixed number, often 4, objects all in cache whose
| addresses are at the same offset into a page, but
| performance may collapse if that number is exceeded. It
| is quite hard to tune to avoid this, but by making things
| _not_ line up on power-of-two boundaries, we can avoid
| out-and-out inviting it.
| cryptonector wrote:
| TIL. I should have known this... Maybe I'll start packing my
| structs too.
| ceeplusplus wrote:
| TSO has a performance cost, on M1 this is 10-15% [1] loss
| from enabling TSO on native arm64 code (not emulated).
|
| [1]: https://blog.yiningkarlli.com/2021/07/porting-takua-to-
| arm-p...
| ncmncm wrote:
| Yes, there are sound reasons for it to be optional. It is
| remarkable how little the penalty is, on M1 and on x86.
| Apparently it takes a really huge number of extra
| transistors in the cache system to keep the overhead
| tolerable.
| wolpoli wrote:
| Thanks for summarizing this. Did they do any other clean-up
| when moving to 64 bit?
| ungamedplayer wrote:
| Thank you for writing this. I was going to cover quite a lot of
| these points and you have done it so very succinctly.
|
| It may be obvious, but I think it bears repeating. This blog
| entry should not reflect badly on Raymond C as he was reporting
| on the architecture at this time.
| zinekeller wrote:
| The 2022 follow-up also said _" And by x86 I mean
| specifically x86-32."_ Also, I don't he was on the AMD64 team
| yet at that time (still Itanium) so probably that is
| something.
| anticensor wrote:
| Part 2:
| https://devblogs.microsoft.com/oldnewthing/20220418-00/?p=10...
___________________________________________________________________
(page generated 2022-04-19 23:01 UTC)