[HN Gopher] The x86 architecture is the weirdo (2004)
       ___________________________________________________________________
        
       The x86 architecture is the weirdo (2004)
        
       Author : signa11
       Score  : 82 points
       Date   : 2022-04-19 01:45 UTC (21 hours ago)
        
 (HTM) web link (devblogs.microsoft.com)
 (TXT) w3m dump (devblogs.microsoft.com)
        
       | aap_ wrote:
       | I wouldn't say the x86 is the weirdo in these cases, it's just
       | not a RISC.
        
         | lynguist wrote:
         | In essence, it is a RISC since the days of Pentium Pro. Each
         | instruction is decoded to several micro-operations and the
         | trick of making the CPU run fast lies in the decoder.
         | 
         | This also leads to situations where many instructions in x86
         | asm will run faster than fewer complex instructions which the
         | decoder may not decode to its most optimal micro-operations
         | sequence.
        
           | aap_ wrote:
           | That's an implementation detail. The submission was about
           | architecture.
        
           | saagarjha wrote:
           | No, this take is incorrect. x86-64 instructions _do_ get
           | lowered into uops, but all the common ones will only execute
           | one, or maybe two. The real win for uops is letting Intel
           | microcode instructions that nobody cares about, but you
           | really shouldn 't be using them anyways.
        
           | phamilton wrote:
           | To what extent is the decoding consistent? It always felt
           | weird to me that we'd do something in hardware on every
           | execution that could be done once in software at compile
           | time. But if the decoder does smart things that would make
           | more sense to me. For example, with SMT the decoder could
           | choose not to use certain ops if the compute units are in use
           | by the other thread.
        
             | buildbot wrote:
             | I am not sure of any arch that does this, but having a
             | second stage decode buys 2 things off the top of my head -
             | better instruction cache usage, esp. for small loops, and
             | the flexibility to decode ops differently on different
             | versions of the same arch from one binary. Otherwise you
             | would have to have tons of different versions for each
             | flavor of x86 out there.
        
               | imtringued wrote:
               | Intel and AMD CPUs have micro op caches for tight loops.
        
             | akira2501 wrote:
             | > on every execution that could be done once in software at
             | compile time
             | 
             | Which is to me, essentially, VLIW. Those architectures
             | didn't work out very well. There's a lot of memory pressure
             | just to move instructions around and the costs didn't seem
             | to outweigh the benefits.
        
         | marcosdumay wrote:
         | Notice that it's the only one remaining?
         | 
         | Yes, it's a weirdo.
        
           | FullyFunctional wrote:
           | Not defending the weirdo, but IBM Z is a thing that is
           | shipping, _heavily_ used, and still evolving. It 's more CISC
           | than even x86.
           | 
           | I do find it sad that we are stuck in an 80 year old paradigm
           | that is a horrible fit for how CPUs are _actually_
           | implemented in 2022, but the inertia is a strong in this one.
        
             | kosherhurricane wrote:
             | Inertia is the same as backward compatibility.
             | 
             | Kinda like gas engines and fuel pumps. Or wall socket
             | formats. Inertia.
        
           | aap_ wrote:
           | You mean as opposed to the other architectures mentioned in
           | the article? ppc, mips, itanium and alpha. Two of which are
           | utterly dead, one of which looks like it's dying (mips) and
           | one that is little more than a niche product (ppc).
        
           | DaiPlusPlus wrote:
           | > Notice that it's the only one remaining?
           | 
           | Motorola 68K is still around too.
        
       | Findecanor wrote:
       | > The x86 has a strict memory model
       | 
       | x86 doesn't really impose _sequential_ _consistency_ between
       | cores /threads. It imposes a _Total_ _Store_ _Order_ (TSO) in
       | which stores are always in order to each other but a store can be
       | reordered after a load.
       | 
       | SPARC had TSO on later chips whereas earlier chips had weaker
       | models. MIPS developed the other way: with older versions having
       | stronger memory ordering and later getting relaxed memory
       | ordering.
       | 
       | RISC-V chips can optionally support TSO but it seems that the
       | motivation is programs ported from x86. IBM's z/Architecture
       | (with lineage back to IBM/360) is still alive and also has TSO.
       | BTW. The Mill is supposed to offer sequential consistency, but it
       | remains to be seen if that will be a performance bottleneck.
        
         | ajross wrote:
         | > x86 doesn't really impose sequential consistency between
         | cores/threads. It imposes a Total Store Order (TSO) in which
         | stores are always in order to each other but a store can be
         | reordered after a load.
         | 
         | To be more pedantic (and hoping I remember this correctly): TSO
         | is indistinguishable in software from full sequential
         | consistency. Any code to detect the difference must by
         | definition be subject to race conditions (or must be an atomic
         | read/write operation that on x86 would be serializing anyway).
         | So x86 in fact does provide SC semantics "between
         | cores/threads". It does have visible reordering artifacts from
         | the perspective of hardware designs (e.g. MMIO registers) where
         | a load has side effects.
        
       | fmajid wrote:
       | An article by _the_ Raymond Chen. Here 's Joel Spolsky's
       | contemporary take on Chen:
       | 
       | https://www.joelonsoftware.com/2004/06/13/how-microsoft-lost...
        
       | [deleted]
        
       | comex wrote:
       | Let's see how things have changed since 2004 when this was
       | published:
       | 
       | > The x86 has a small number (8) of general-purpose registers
       | 
       | x86-64 added more general-purpose registers.
       | 
       | > The x86 uses the stack to pass function parameters; the others
       | use registers.
       | 
       | OS vendors switched to registers for x86-64.
       | 
       | > The x86 forgives access to unaligned data, silently fixing up
       | the misalignment.
       | 
       | Now ubiquitous on application processors.
       | 
       | > The x86 has variable-sized instructions. The others use fixed-
       | sized instructions.
       | 
       | ARM introduced Thumb-2, with a mix of 2-byte and 4-byte
       | instructions, in 2003. PowerPC and RISC-V also added some form of
       | variable-length instruction support. On the other hand, ARM
       | turned around and dropped variable-length instructions with its
       | 64-bit architecture released in 2011.
       | 
       | > The x86 has a strict memory model ... The others have weak
       | memory models
       | 
       | Still x86-only.
       | 
       | > The x86 supports atomic load-modify-store operations. None of
       | the others do.
       | 
       | As opposed to load-linked/store-conditional, which is a different
       | way to express the same basic idea? Or is he claiming that other
       | processors didn't support any form of atomic instructions, which
       | definitely isn't true?
       | 
       | At any rate, ARM previously had load-linked/store-conditional but
       | recently added a native compare-and-swap instruction with
       | ARMv8.1.
       | 
       | > The x86 passes function return addresses on the stack. The
       | others use a link register.
       | 
       | Still x86-only.
        
         | NtGuy25 wrote:
         | In regards to memory alignment, it's even worse. Most
         | instructions work on unaligned data. But some instructions
         | require 8 byte, 16 byte, 32 byte, 64 byte and I think there's
         | even some 128 and 256 byte alignment. One of the more common
         | pitfalls someone can find themselves in when coding x86-64 asm.
        
         | phamilton wrote:
         | I vaguely recall that LL/SC solves the ABA problem whereas
         | load-modify-store does not.
         | 
         | It's been a while, so I'm going to define my understanding of
         | the ABA problem in case I misunderstood it:
         | 
         | x86 only supplies cmpxchange instructions on which will update
         | a value only if it matches the passed in previous value.
         | There's a class of concurrency bugs where the value is modified
         | away from it's initial value and then modified again back to
         | it's value. cmpxchange can't detect that condition, so if it's
         | a meaningful difference often the 128-bit cmpxchange will be
         | used with a counter in the second 64-bits that is incremented
         | on each write to catch this case.
         | 
         | LL/SC will trigger on any write, rather than comparing the
         | value, providing the stronger guarantee.
         | 
         | (Please correct me if this is inaccurate, it's been a hit
         | minute since I learned this and I'd love to be more current on
         | it).
        
           | zozbot234 wrote:
           | AIUI, a cmpxchg loop is enough to implement read-modify-write
           | of any atomically sized value. The ABA problem becomes
           | relevant when trying to implement more complex lock-free data
           | structures.
        
         | ceeplusplus wrote:
         | There is still one big thing that hasn't changed but has been
         | the subject of discussion on whether x86-64 is fundamentally
         | bottlenecking CPU architecture. Variable length instructions
         | means decoder complexity scales quadratically rather than
         | linearly. It's been speculated this is one reason why even the
         | latest x86 architectures stick with relatively narrow decode
         | but Arm CPUs with lower performance levels (e.g. Cortex X1/2)
         | are already 5-wide and Apple is 8-wide.
        
         | cesarb wrote:
         | > ARM introduced Thumb-2, with a mix of 2-byte and 4-byte
         | instructions, in 2003. PowerPC and RISC-V also [...]
         | 
         | x86 is still the weirdo. Both Thumb-2 and the RISC-V C
         | extension (I don't know about PowerPC) have only 2-byte and
         | 4-byte instructions, aligned to 2 bytes; x86 instructions can
         | vary from 1 to 15 bytes, with no alignment requirement.
        
           | chasil wrote:
           | ARM Thumb actually licensed patents from Hitachi Super-H, who
           | did this first.
           | 
           | Supposedly, "MIPS processors [also] have a MIPS-16 mode."
           | 
           | https://en.m.wikipedia.org/wiki/SuperH
        
           | cryptonector wrote:
           | I suspect variable-length instructions are a big gain because
           | you get to pack instructions more tightly and so have fewer
           | cache misses. Though, obviously, it's going to depend on
           | having an instruction set that yields shorter text for
           | typical assembly than fixed-sized instructions would. (In a
           | way, opcodes need a bit of Huffman encoding!)
           | 
           | Any losses from having to do more decoding work are probably
           | offset by having sufficiently deep pipelines and enough
           | decoders.
        
             | phdelightful wrote:
             | The counterpoint is that variable-length decoding
             | introduces sequential dependence in the decoding, i.e. you
             | don't know where instruction 2 starts until you've decoded
             | instruction 1. This probably limits how many decoders you
             | can have. If you know all your instructions are 4B you can
             | basically decode as many as you want in parallel.
        
           | classichasclass wrote:
           | Power10 has prefixed instructions. These are essentially
           | 64-bit instructions in two pieces. They are odd even
           | (particularly?) to those of us who have worked with the
           | architecture for a long time, and not much otherwise supports
           | them yet. Their motivation is primarily to more efficiently
           | represent constants and offsets.
        
         | ncmncm wrote:
         | Apple M1 supports optional x86-style memory event ordering, so
         | that its x86 emulation could be made to work without penalty.
         | 
         | When SPARC got new microcode supporting unaligned access, it
         | turned out to be a big performance win, as the alignment
         | padding had made for a bigger cache footprint. That was an
         | embarrassment for the whole RISC industry. Nobody today would
         | field a chip that enforced alignment.
         | 
         | The alignment penalty _might_ have been smaller back when clock
         | rates were closer to memory latency, but caches were radically
         | smaller then, too, so even more affected by inflated footprint.
        
           | jnordwick wrote:
           | > as the alignment padding had made for a bigger cache
           | footprint
           | 
           | I argues with some of the Rust compiler members the other day
           | about wanting to just ditch almost all alighnment
           | restrictions because I of this exact thing. They laughed and
           | basically told me i didnt know what i was talking about. I
           | remember about 15 years ago when i worked at market making
           | firm and we test this and it was a great gain - we started
           | packing almost all our structs after that.
           | 
           | Now, at another MM shop, and we're trying to push the same
           | thing but having to fight these areguments again (the only
           | alignmets I want to keep are for AVX and hardware accessed
           | buffers).
        
             | anyfoo wrote:
             | FWIW, it's still better to lay out your critical structures
             | carefully, so that padding isn't needed. That way, you win
             | _both_ the cache efficiency and the efficiencies for
             | aligned accesses.
        
             | ncmncm wrote:
             | Superstition is as powerful as it ever was.
        
               | cryptonector wrote:
               | It's definitely received wisdom that may once have been
               | right and no longer is.
               | 
               | Most people are not used to facts having a half-life, but
               | many facts do, or, rather, much knowledge, does.
               | 
               | We feel very secure in knowing what we know, and the
               | reality is that we need to be willing to question a lot
               | of things, like authority, including our very own. Now,
               | we can't be questioning everything all the time because
               | that way madness lies, but we can't never question
               | anything we think we know either!
               | 
               | Epistemology is hard. I want a doll that says that when
               | you pull the cord.
        
               | cogman10 wrote:
               | Sort of depends on the knowledge.
               | 
               | It's certainly true that in the tech industry things are
               | CONSTANTLY shifting.
               | 
               | However, talk physics and you'll find that things rarely
               | change, especially the physics that most college
               | graduates learn.
        
               | MBCook wrote:
               | Is this superstition or more received wisdom, which may
               | have been true at one point in the past as is now just
               | orthodoxy?
        
               | notriddle wrote:
               | Fifty bucks says it isn't even about performance, but is
               | instead about passing pointers to C code. Zero-overhead
               | FFI has killed a lot of radical performance improvements
               | that Rust could have otherwise made.
               | 
               | I don't know, because nobody's actually posting a link to
               | it.
        
               | ncmncm wrote:
               | This strikes me as likely. Bitwise compatibility with
               | machine ABI layout rules has powerful compatibility
               | advantages even in places where it might make code
               | slower. (And, for the large majority of code, slower
               | doesn't matter anyway.)
               | 
               | Of course C and C++ themselves have to keep to machine
               | ABI layout rules for backward compatibility to code built
               | when those rules were (still) thought a good idea.
               | Compilers offer annotations to dictate packing for
               | specified types, and the Rust compiler certainly also
               | offers such a choice. So, maybe such annotations should
               | just be used a lot more in Rust, C, and C++.
               | 
               | This is not unlike the need to write "const" everywhere
               | in C and C++ because the inherited default (from before
               | it existed) was arguably wrong. We just need to get used
               | to ignoring the annotation clutter.
               | 
               | But there is no doubt there are lots of people who think
               | padding to alignment boundaries is faster. And, there can
               | be other reasons to align even more strictly than the
               | machine ABI says, even knowing what it costs.
        
             | kevingadd wrote:
             | There are other things you need to take into account too -
             | padding can make it more likely for a struct to divide
             | evenly into cache lines, which can trigger false sharing.
             | Changing the size of a struct from 128 bytes to 120 or 122
             | bytes will cause it to be misaligned on cache lines and
             | reduce the impact of false sharing and that can
             | _significantly_ improve performance.
             | 
             | The last time I worked on a btree-based data store,
             | changing the nodes from ~1024 bytes to ~1000 delivered
             | something like a 10% throughput improvement. This was done
             | by reducing the number of entries in each node, and not by
             | changing padding or packing.
        
               | ncmncm wrote:
               | True. Another reason to avoid too much aligning is to
               | help reduce reliance on N-way cache collision avoidance.
               | 
               | Caches on modern chips can handle keeping up to some
               | small fixed number, often 4, objects all in cache whose
               | addresses are at the same offset into a page, but
               | performance may collapse if that number is exceeded. It
               | is quite hard to tune to avoid this, but by making things
               | _not_ line up on power-of-two boundaries, we can avoid
               | out-and-out inviting it.
        
           | cryptonector wrote:
           | TIL. I should have known this... Maybe I'll start packing my
           | structs too.
        
           | ceeplusplus wrote:
           | TSO has a performance cost, on M1 this is 10-15% [1] loss
           | from enabling TSO on native arm64 code (not emulated).
           | 
           | [1]: https://blog.yiningkarlli.com/2021/07/porting-takua-to-
           | arm-p...
        
             | ncmncm wrote:
             | Yes, there are sound reasons for it to be optional. It is
             | remarkable how little the penalty is, on M1 and on x86.
             | Apparently it takes a really huge number of extra
             | transistors in the cache system to keep the overhead
             | tolerable.
        
         | wolpoli wrote:
         | Thanks for summarizing this. Did they do any other clean-up
         | when moving to 64 bit?
        
         | ungamedplayer wrote:
         | Thank you for writing this. I was going to cover quite a lot of
         | these points and you have done it so very succinctly.
         | 
         | It may be obvious, but I think it bears repeating. This blog
         | entry should not reflect badly on Raymond C as he was reporting
         | on the architecture at this time.
        
           | zinekeller wrote:
           | The 2022 follow-up also said _" And by x86 I mean
           | specifically x86-32."_ Also, I don't he was on the AMD64 team
           | yet at that time (still Itanium) so probably that is
           | something.
        
       | anticensor wrote:
       | Part 2:
       | https://devblogs.microsoft.com/oldnewthing/20220418-00/?p=10...
        
       ___________________________________________________________________
       (page generated 2022-04-19 23:01 UTC)