[HN Gopher] How many x86 instructions are there? (2016)
___________________________________________________________________
How many x86 instructions are there? (2016)
Author : sandinmyjoints
Score : 111 points
Date : 2021-04-21 13:03 UTC (9 hours ago)
(HTM) web link (fgiesen.wordpress.com)
(TXT) w3m dump (fgiesen.wordpress.com)
| hnmullany wrote:
| Now ask how many unique x86 instructions do standard compilers
| emit...
| FpUser wrote:
| I feel sudden urge to write some assembly for fun. Have not done
| it for at least a couple of years I think.
| carbonguy wrote:
| The assembly bug bit me again a few months back, but instead of
| writing it I had a grand time hand-disassembling some ARMv4T
| binaries - scratched something of the same itch as solving a
| sudoku.
| neurostimulant wrote:
| Then you might like TIS-100 [1] and Shenzhen I/O [2].
|
| [1]
| https://store.steampowered.com/app/370360/TIS100/?curator_cl...
|
| [2]
| https://store.steampowered.com/app/504210/SHENZHEN_IO/?curat...
| deelowe wrote:
| I've been getting a lot of enjoyment out of 6502/z80 projects
| like rc2015 and ben eaters breadboard 6502 kit/videos.
| There's also nand2tetris, if you prefer a more academic
| approach.
| snvzz wrote:
| You'll feel less miserable if you target a modern RISC
| architecture such as RISC-V or a cleaner CISC architecture such
| as the venerable 68000.
|
| Whenever bored, I read or watch chibiakumas tutorials. There's
| no such thing as knowing too many assemblers.
| magicalhippo wrote:
| Then when checking the results are half the speed of what the
| compiler spits out and the fun is gone. At least that's what
| happens to me...
| kingsuper20 wrote:
| I've always been able to beat the compiler, and that's
| usually after trying to optimize using C. Admittedly, it's a
| whole lot harder to understand what's fast than it used to
| be. Access to SSE has it's own benefits.
|
| It's been a problem (optimizing) for some time though. I
| remember it being some work to beat the compiler on the
| i960CA. OTOH, I seem to remember the i860 being not-so-great
| and for sure the TI C80 C compiler was downright awful (per
| usual for DSPs).
| magicalhippo wrote:
| Back in the Pentium 1 and earlier days I could beat the
| compiler. But then it got hard.
|
| And it changes so often, instructions that are fast on one
| CPU are not so fast on the next one, and vice versa.
|
| Not to mention branch prediction and out-of-order execution
| makes it very difficult to meaningfully benchmark. Is my
| code really faster, or just seems like it because some
| address got better aligned or similar.
|
| I've gotten significant speed gains in certain projects by
| simply replacing certain hand-optimized assembly in
| libraries (ie not my code) with the plain C code
| equivalent. The assembly was probably faster 10-15 years
| ago, but not anymore...
| mhh__ wrote:
| I've started injecting code fragments into spectre
| exploits as a way of seeing whether the CPU likes them or
| not.
| kingsuper20 wrote:
| >I've gotten significant speed gains in certain projects
| by simply replacing certain hand-optimized assembly in
| libraries (ie not my code) with the plain C code
| equivalent.
|
| That's an interesting point, plus there's the portability
| issue.
|
| My own breadcrumbs of legacy code for this kind of
| innerloopish stuff has been to write a straightforward
| 'C' implementation (and time it), an optimized 'C'
| version (which itself can depend on the processor used),
| and a handtuned assembly version where really needed.
|
| It allows you to back out of the tricky stuff plus acts
| as a form of documentation.
| zsmi wrote:
| One should never loose to the complier, after all you can
| see it's output and it can't see yours.
|
| Also, the programmer can "cheat" by doing things the
| compiler would consider invalid but are known to be ok
| given the larger context of the application.
|
| The restrict keyword in c gives an example of how one can
| optimize code by knowing the larger context. https://cellpe
| rformance.beyond3d.com/articles/2006/05/demyst...
|
| The problem is the ROI is usually pretty bad as these
| assumptions rarely hold as the code evolves, in my
| experience, and the optimization usually only lasts for
| finite (sometimes shockingly short) amount of time. i.e. OS
| changes, hardware changes, memory changes, etc. etc. etc.
| f00zz wrote:
| That's why it's more fun to try to optimize for size
| FpUser wrote:
| For this case all fun is in the process ;)
| im3w1l wrote:
| Last time I wrote assembly, and it was a long while ago, it
| was way faster. But let's be honest, 95% of it was doing
| manual buffering on top of OS api's rather than use C stdlib.
| And the other 5% were by skipping itoa calls, by doing
| arithmetic directly on the string representation.
|
| I think this is why assembler can be faster many times. Not
| because I'm better than a compiler. But because the structure
| of the language nudges you into faster approaches.
| magicalhippo wrote:
| Good point, though I usually found that if I then go back
| and restructure my high-level code, the compiler beat my
| ASM again.
| TwoBit wrote:
| ARM has all these variations which make it seem as complicated as
| x86, but they are distict variations and future CPUs can for
| example drop 16 bit Thumb fairly clearly.
| klelatti wrote:
| I think that one of the things that distinguishes Arm from
| Intel (for good or bad) is that Arm _has_ left behind a lot of
| the legacy. aarch64 has no Thumb, Jazelle etc
| api wrote:
| It's _way_ easier to determine instruction length on ARM. It 's
| usually fixed. That eliminates a lot of brute force thrashing
| that X86 decoders have to do. It doesn't impact transistor
| count all that much on a huge modern CPU but it saves a decent
| amount of power. It's one of the things that factors into why
| ARM is so power efficient.
|
| ARM has also been willing to drop older optional legacy stuff
| like Java oriented instructions that almost nobody used and
| Thumb. X86 supports nearly all legacy opcodes, even things like
| MMX and other obsolete vector operations that modern programs
| never use.
| thechao wrote:
| It's not that variable length is expensive, it's that
| variable length _the way Intel does it_ is expensive. For
| instance -- not that this is a good idea -- you could burn
| the top 2b to mark instructions as 2 /4/6/8 bytes (or
| whatever) in length. Then you can have your variable-width-
| cake-and-eat-your-fast-decode-too.
| richardwhiuk wrote:
| Variable length is always going to restrict the parralelism
| of your instruction decode (for a given chip area/power/etc
| cost).
| thechao wrote:
| Only because a single instruction would take the "space"
| of multiple instructions in the I$-fetch-to-decode. The
| idea with variable-length encodings is that, for example,
| an 8B encoding does _more than twice the work_ of a 4B
| encoding, so you lose out on the instruction slot, but
| win on the work done.
|
| I mean ... that's the theory.
| bogomipz wrote:
| Could you elaborate - what is about the Intel design that
| makes the decode so inefficient? Is "2b" bits here?
|
| Are there examples of ISA or chips that handle variable
| length instruction encoding efficiently?
| pkaye wrote:
| Due to how the instruction set evolved, for the Intel x86
| architecture, you have to look at a lot of bits of the
| instruction stream to determine where the next
| instruction starts. To execute multiple instructions per
| clock cycles you also have to decode instruction multiple
| instructions per clock cycle.
|
| I think this old Intel patent talks about one of their
| decoder implementations:
|
| https://patents.google.com/patent/US5758116A/en
| Symmetry wrote:
| Intel x86 isn't self synchonizing. In theory you have to
| decode every byte of the instruction stream that came
| before to be sure where the boundaries of an instruction
| are. Normally it stabilizes after a while in real world
| instruction streams but you can craft malicious
| instruction streams which yield two different valid sets
| of instructions depending on whether you start reading
| them at an offset or not.
|
| Contrast that to something like utf8 where that isn't
| possible.
| saagarjha wrote:
| Note that random x86 code is usually "eventually self-
| synchronizing", which id useful if you're trying to
| disassemble a blob at a random offset.
| atq2119 wrote:
| > you could burn the top 2b to mark instructions as 2/4/6/8
| bytes (or whatever) in length.
|
| FWIW, this is exactly what RISC-V does.
| akira2501 wrote:
| > you could burn the top 2b to mark
|
| Which seems reasonable.. but you just burned 3/4 of the
| single opcode instruction space, which may not be worth it
| for most general purpose loads.
| bogomipz wrote:
| Would you mind elaborating on the math of how the "top
| 2b" ends up burning 3/4 of the single opcode instruction
| space?
| gbrown_ wrote:
| How about undocumented instructions? Sandsifter[1] is an
| interesting project and the video from BlackHat[2] is a good
| watch. There's also a previous discussion of it on HN[3].
|
| [1] https://github.com/Battelle/sandsifter
|
| [2] https://www.youtube.com/watch?v=KrksBdWcZgQ
|
| [3] https://news.ycombinator.com/item?id=18179212
| da_big_ghey wrote:
| yes and sansdifter are continue finding more, recent
| undocumented microcode modify instruction are find with it.
| sandinmyjoints wrote:
| Very cool project! Thanks for the pointer.
| snvzz wrote:
| The only real answer is: Too Many.
|
| This complexity is pushed down to operating systems, compilers,
| assemblers, debuggers. It ends up causing brutal human time
| overhead throughout the chain, and its cost effectively prevents
| security and high assurance.
|
| This more than justifies moving away from x86 into RISC
| architectures, such as the rising open and royalty-free RISC-V.
| saagarjha wrote:
| There is complexity, but it's not nearly as large as you claim
| it is, nor does RISC magically resolve the issues you've
| brought up.
| WalterBright wrote:
| Back in the 70s, a friend of mine said he loved the Z80
| instruction set the best. I replied that it was the most
| complicated and kludgy one. He laughed and said the complexity
| allowed his code to shine above others and he got paid more.
|
| Of course, these days the Z80 instruction set looks trivial :-)
| deepspace wrote:
| > This complexity is pushed down to operating systems,
| compilers, assemblers, debuggers.
|
| Is that really true though? In my experience, the amount of
| complexity in a layered system is roughly constant, and if you
| make one layer simpler, you have to make another more complex.
|
| I would expect that compiler for a RISC architecture, for
| example, needs to do a lot of work figuring out which set of
| 'simple' instructions encode an intent most efficiently. It
| also needs to deal with instruction scheduling and other issues
| that a RISC compiler does not care about.
| mobilio wrote:
| In short - endless battle between RISC or CISC.
| Symmetry wrote:
| These days RISC is mostly about regular encoding of
| instructions which don't present challenges to doing precise
| interrupts on a deeply pipelined machine rather than just few
| instructions _per se_.
| bogomipz wrote:
| What do you mean by "precise" interrupts here? Do some types
| of interrupts cause worse pipeline stalls than other types?
| Are the interrupt handlers in RISC faster because of more
| efficient decoding? Is that the issue?
| zsmi wrote:
| This is the classic definition
|
| https://dl.acm.org/doi/10.1109/12.4607
|
| "An interrupt is precise if the saved process state
| corresponds to a sequential model of program execution in
| which one instruction completes before the next begins. In
| a pipelined processor, precise interrupts are difficult to
| implement because an instruction may be initiated before
| its predecessors have completed."
| Symmetry wrote:
| No, it's just that the fact that RISC didn't have any load-
| op-store instructions made it easier to pipeline them back
| in the day while still being able to handle interrupts as
| if they executed one instruction at a time and there was
| one instruction that had executed while the following one
| hadn't executed at all. That's possible to do with CISC
| architectures, of course, we do it all the time these days.
| But back in the day being able to implement that easily-ish
| in a reasonable number of transistors were a big factor in
| early RISC processors having so much better performance
| than the CISCs of the time.
|
| In the modern day it mostly just means that you can
| implement fancy out of order schenanigans with a bit fewer
| engineer-years than non-RISC ISAs.
| bogomipz wrote:
| Thanks these are both very helpful and good bits of
| historical perspective as well. Cheers.
| kingsuper20 wrote:
| Oh well, there's RISC and there's RISC.
|
| I always think of the AMD 29000.
| mobilio wrote:
| Or i860 & i960?
| johnklos wrote:
| More than 1,500! Holy cow!
|
| While having instructions for everything that are slow in early
| models but can be significantly improved in silicon over time is
| one way to look at CISC, I genuinely wonder how much silicon is
| spent on instructions that are so rarely used they'd be better in
| software.
|
| Or to ask another way: how many instructions are in billions of
| x86 cores that rarely if ever get used? Hmmm...
| dreamcompiler wrote:
| I'd also be curious to discover how many distinct x86
| instructions gcc can even emit? I expect the answer is "a lot
| less than all of them."
| saagarjha wrote:
| Probably a couple hundred at most, and for common programs
| several dozen.
| Someone wrote:
| If you cheat: all of them. GCC supports in-line assembly.
|
| Of course, that's not an interesting observation.
|
| It gets (somewhat) interesting when you realize that one may
| inadvertently include in-line assembly. Do you call including
| system headers that happen to contain inline assembly
| cheating? Including headers for a standard C library?
| Compiling with old-fashioned compiler flags (for example to
| generate FPU instructions for the x87)?
| kingsuper20 wrote:
| I mostly just think of the test set they must make use of at
| Intel. It makes my head hurt. Maybe you end up with a wad of
| legacy code so large that no one knows how it really works.
| That ends up being the real definition of the part.
| thechao wrote:
| It's really more like 6000, if you do the accounting right.
| And, as someone who just ate a 2x (customer-facing) slow down
| porting a piece of x86 assembly to ARM, because that "one nifty
| instruction" was missing, I'm going to say: we should keep all
| of 'em.
| mhh__ wrote:
| Which one?
| thechao wrote:
| _mm256_movemask_epi8, i.e., the "fast lexer" instruction.
| That instruction takes ~3c (depending on uarch), and the
| ARM equivalent (7-8 instructions) takes ~5-6c (depending on
| uarch). It's just _annoying_.
| ConcernedCoder wrote:
| This instruction does quite a bit of leg work, and it
| becomes obvious why a RISK architecture would need 7-8
| instructions to do the same, see: https://software.intel.
| com/sites/landingpage/IntrinsicsGuide...
| irdc wrote:
| This looks like a task for gorc: https://five-
| embeddev.com/riscv-bitmanip/draft/bext.html#gen...
|
| (granted, that's only 32/64 bits, but still...)
| CodeArtisan wrote:
| AMD Zen dropped a few instructions sets from the previous
| architecture.
|
| FMA4
| https://en.wikipedia.org/wiki/FMA_instruction_set#FMA4_instr...
|
| XOP https://en.wikipedia.org/wiki/XOP_instruction_set
|
| TBM
| https://en.wikipedia.org/wiki/Bit_manipulation_instruction_s...
| davidkuhta wrote:
| > To not leave you hanging: Intel has an official x86
| encoder/decoder library called XED. According to Intel's XED, as
| of this writing, there are 1503 defined x86 instructions
| ("iclasses" in XED lingo), from AAA to XTEST (this includes AMD-
| specific extensions too, by the way). Straightforward, right?
|
| Hopefully this will have either saved you a click or validated
| your time in reading the article.
| siltpotato wrote:
| Well, it's the first line. Probably doesn't count as click
| bait.
| anyfoo wrote:
| Curious, does anyone actually care about the actual number
| primarily? I thought pretty much everyone who clicks on an
| article with that title would do so because they are interested
| in the insights gathered when getting to that number.
| wglb wrote:
| If you are writing a disassembler or binary program decoder,
| such a number will help you be sure that you enumerate all
| the instructions.
| ketralnis wrote:
| I doubt most people reading the article coming from HN are
| writing disassemblers, and all such people would have to
| read it anyway because the number itself isn't sufficient
| to validate that you've enumerated all of them (because as
| my sibling points out, it's more complicated than that).
| The specific number is the least interesting part.
| account4mypc wrote:
| i think if you were writing such a program, this article
| would show that using such a number is a much more
| complicated idea than it sounds
| Out_of_Characte wrote:
| Enumerating all instructions would be usefull to check
| wether you can decode all legal instructions.
| saagarjha wrote:
| The issue is that the list of all legal instructions is
| hard to define.
| carbonguy wrote:
| For me the article was well worth it; where else but in ISA
| discussions can you find gems like the following?
|
| > _Does a non-instruction that is non-defined and unofficially
| guaranteed to non-execute exactly as if it had never been in
| the instruction set to begin with count as an x86 instruction?
| For that matter, does UD2 itself, the defined undefined
| instruction, count as an instruction?_
| sandinmyjoints wrote:
| This was why I posted it -- I learned a lot more than the
| answer to the title.
| xony wrote:
| 6000 ? cache can't effectively handle the burden of the decoder
| propagation delays . compiler writer might not be using even 20%
| of the total instructions. except backward compatibilty its
| useless architecture
| optimalsolver wrote:
| Alternate instructions here :-)
|
| https://paws.kettering.edu/~jhuggins/humor/opcodes.html
___________________________________________________________________
(page generated 2021-04-21 23:01 UTC)