[HN Gopher] Things I learned while writing an x86 emulator (2013)
___________________________________________________________________
Things I learned while writing an x86 emulator (2013)
Author : fanf2
Score : 330 points
Date : 2024-07-10 14:42 UTC (1 days ago)
(HTM) web link (www.timdbg.com)
(TXT) w3m dump (www.timdbg.com)
| pm2222 wrote:
| Prior discussion here
| https://news.ycombinator.com/item?id=34636699
|
| Cannot believe it's been 16months. How time flies.
| AstroJetson wrote:
| Check out Justine Tunney and her emulator.
| https://justine.lol/blinkenlights/
|
| The docs are an amazing tour of how the cpu works.
| zarathustreal wrote:
| Astonishing.. they never cease to amaze me
| trallnag wrote:
| That name, Tunney. Remember it from around 2014, being
| homeless, bumming around, and shit posting on Twitter about
| Occupy lol
| sdsd wrote:
| What a cool person. I really enjoy writing assembly, it feels so
| simple and I really enjoy the vertical aesthetic quality.
|
| The closest I've ever come to something like OP (which is to say,
| not close at all) was when I was trying to help my JS friend
| understand the stack, and we ended up writing a mini vm with its
| own little ISA:
| https://gist.github.com/darighost/2d880fe27510e0c90f75680bfe...
|
| This could have gone much deeper - i'd have enjoyed that, but
| doing so would have detracted from the original educational goal
| lol. I should contact that friend and see if he still wants to
| study with me. it's hard since he's making so much money doing
| fancy web dev, he has no time to go deep into stuff. whereas my
| unemployed ass is basically an infinite ocean of time and energy.
| actionfromafar wrote:
| You should leverage that into your friend teaching you JS,
| maybe.
| djaouen wrote:
| It's like my friend 0x86 always said: "Stay away from
| JavaScript. But stay away from TypeScript harder."
| dmitrygr wrote:
| I've written fast emulators for a dozen non-toy architectures and
| a few JIT translators for a few as well. x86 still gives me PTSD.
| I have never seen a messier architecture. There is history, and a
| reason for it, but still ... damn
| Arech wrote:
| Haha, man, I feel you :DD You probably should have started with
| it from the very beginning :D
| jcranmer wrote:
| > I have never seen a messier architecture.
|
| Itanium. Pretty much every time I open up the manual, I find a
| new thing that makes me go "what the hell were you guys
| thinking!?" without even trying to.
| snazz wrote:
| What sorts of projects are you working on that use Itanium?
| jcranmer wrote:
| None, really. I just happened to get a copy of the manual
| and start idly reading it when my computer got stuck in a
| very long update-reboot cycle and I couldn't do anything
| other than read a physical book.
| trealira wrote:
| Studying the x86 architecture is kind of like studying
| languages with lots of irregularities and vestigial bits, and
| with competing grammatical paradigms, e.g. French. Other
| architectures, like RISC-V and ARMv8, are much more consistent.
| x0x0 wrote:
| I think English may be a better example; we just stapled
| chunks of vulgar latin to an inconsistently simplified proto-
| germanic and then borrowed words from every language we met
| along the way. Add in 44 sounds serialized to the page with
| 26 letters and tada!
| WalterBright wrote:
| The Norman conquest of England means English is a language
| with barbarian syntax and French nouns. It's a happy mess
| of cultural appropriation!
|
| Squeezing the lot into 26 characters was simply genius -
| enabling printing with movable type, Morse code, Baudot
| code, ASCII, etc.
|
| Of course, then icons came along and ruined everything.
| aengelke wrote:
| > Other architectures, like [...] ARMv8, are much more
| consistent.
|
| From an instruction/operation perspective, AArch64 is more
| clean. However, from an instruction operand and encoding
| perspective, AArch64 is a lot less consistent than x86.
| Consider the different operand types: on x86, there are a
| dozen register types, immediate (8/16/32/64 bits), and memory
| operands (always the same layout). On AArch64, there's: GP
| regs, incremented GP reg (MOPS extension), extended GP reg
| (e.g., SXTB), shifted GP reg, stack pointer, FP reg, vector
| register, vector register element, vector register table,
| vector register table element, a dozen types of memory
| operands, conditions, and a dozen types of immediate
| encodings (including the fascinating and very useful, but
| also very non-trivial encoding of logical immediates [1]).
|
| AArch64 also has some register constraints: some vector
| operations can only encode register 0-15 or 0-7; not to
| mention SVE with it's "movprfx" prefix instruction that is
| only valid in front of a few selected instructions.
|
| [1]: https://github.com/aengelke/disarm/blob/master/encode.c#
| L19-...
| trealira wrote:
| Admittedly, I never wrote an assembler, but the encoding of
| x86-64 seems pretty convoluted [0] [1], with much of the
| information smeared across the some bits in the Mod/RM and
| SIB bytes and then extended by prefix bytes. It's more
| complicated than you would assume from having only written
| assembly in text and then having the assembler encode the
| instructions.
|
| One of the things that sticks out to me is that on x86-64,
| operations on 32-bit registers implicitly zero out the
| upper 32-bits of the corresponding 64-bit register (e.g.
| "MOV EAX, EBX" also zeroes out the upper 32-bits of RAX),
| except for the opcode 0x90, which, logically, would be the
| accumulator encoding of "XCHG EAX, EAX" but is special-
| cased to do nothing because it's the traditional NOP
| instruction for x86. So you have to use the other encoding,
| 87 C0.
|
| The fact that AArch64 has instructions that are almost
| always 4 bytes, and the fact that they clearly thought out
| their instruction set more (e.g. instead of having CMP,
| they just use SUBS (subtract and store condition codes) and
| store the result in the zero register, which is always
| zero), is what made me say that it seemed more regular. I
| hadn't studied it in as close detail as x86-64, though.
| But, you've written an an ARM disassembler and I haven't;
| you seem to know more about it than me, and so I believe
| what you're saying.
|
| > AArch64 also has some register constraints
|
| x86-64 also has some, in that many instructions can't
| encode the upper 8 bits of a 16-bit register (AH, BH, DH,
| CH) if a REX prefix is used.
|
| [0]: https://wiki.osdev.org/X86-64_Instruction_Encoding
|
| [1]: http://www.c-jump.com/CIS77/CPU/x86/lecture.html
| WalterBright wrote:
| > you've written an an ARM disassembler
|
| Here's my AArch64 disassembler work in progress:
|
| https://github.com/dlang/dmd/blob/master/compiler/src/dmd
| /ba...
|
| I add to it in tandem with writing the code generator. It
| helps flush out bugs in both by doing this. I.e. generate
| the instruction, the disassemble it and compare with what
| I thought it should be.
|
| It's quite a bit more complicated than the corresponding
| x86 disassembler:
|
| https://github.com/dlang/dmd/blob/master/compiler/src/dmd
| /ba...
| bonzini wrote:
| x86 has a lot of special cases and it's becoming harder
| and harder to find non-code material with all
| instructions in a nice tabular format. But overall the
| encoding isn't that complicated. What's complex in the
| architecture is the amount of legacy stuff that still
| lives in privileged code.
|
| > x86-64 also has some, in that many instructions can't
| encode the upper 8 bits of a 16-bit register (AH, BH, DH,
| CH) if a REX prefix is used.
|
| You can't use both the high registers and the extended
| low registers in the same instruction, indeed. But in
| practice nobody is using the high registers anymore so
| it's a relatively rare limitation.
| jcranmer wrote:
| One of the projects I work on every now and then is the
| "World's Worst X86 Decoder", where I'm trying to
| essentially automatically uncover x86 assembly semantics
| without ever having to build a list of x86 instructions,
| and this has forced me to look at the x86 ISA in a
| different way. The basic format I build for an opcode is
| this: enum Group1Prefix { Lock = 0xf0,
| Repnz = 0xf2, Repz = 0xf3 } enum Group2Prefix { Cs
| = 0x2e, Ds = 0x3e, Es = 0x26, Fs = 0x64, Gs = 0x65, Ss =
| 0x36 } enum Group3Prefix { OpSize = 0x66 }
| enum Group4Prefix { AddrSize = 0x67 } enum
| ModernPrefix { Rex { w: bool }, Vex { w:
| bool, l: bool }, } struct Opcode {
| pub group1: Option<Group1Prefix>, pub group2:
| Option<Group2Prefix>, pub group3:
| Option<Group3Prefix>, pub group4:
| Option<Group4Prefix>, pub modern_prefix:
| Option<ModernPrefix>, pub opcode_map: u8,
| pub opcode: u8, }
|
| And all opcodes can optionally have an immediate of 1, 2,
| 3, 4, or 8 bytes and optionally have a ModR/M byte (which
| is a separate datastructure because I don't enter that
| information myself, I simply run a sandsifter-like
| program to execute every single opcode and work out the
| answer).
|
| This isn't quite accurate to how an assembler would see
| it, as Intel will sometimes pack instructions with 0 or 1
| operands into a ModR/M byte, so that, e.g., 0F.01 eax,
| [mem] is actually the SGDT instruction and 0F.01 ecx,
| [mem] is SIDT (and 0F.01 eax, ecx is actually VMCALL).
|
| As long as you're internally making a distinction between
| the different operand forms of instructions (e.g., 8-bit
| ADD versus 16-bit versus 32-bit versus 64-bit, and
| register/register versus register/immediate versus
| register/memory), it's actually not all that difficult to
| deal with the instruction encoding, or even mapping IR to
| those instructions, at least until EVEX prefixes enter
| the picture.
| trealira wrote:
| That's pretty cool; thanks for showing me. I guess x86
| instruction encoding isn't as bad as I thought, then.
| WalterBright wrote:
| As I'm currently implementing an AArch64 code generator for
| the D language dmd compiler, the inconsistency of its
| instructions is equaled and worsened by the clumsiness of
| the documentation for it :-/ But I'm slowly figuring it
| out.
|
| (For example, some instructions with very different
| encodings have the same mnemonic. Arrgghh.)
| brendank310 wrote:
| It's probably no longer maintained, but a former
| colleague of mine did some work on this for C++:
| https://github.com/ainfosec/shoulder. Obviously if the
| docs are lying it doesn't help much, but there was
| another effort he had https://github.com/ainfosec/scapula
| that tried to automate detecting behavior differences
| between the docs and the hardware implementation.
| corsix wrote:
| For an implementation of logical immediate encoding without
| the loop, see https://github.com/LuaJIT/LuaJIT/blob/04dca79
| 11ea255f37be799...
| SunlitCat wrote:
| Haha! Writing an x86 emulator! I still remember writing a toy
| emulator which was able to execute something around the first
| 1000-ish lines of a real bios (and then it stuck or looped when
| it started to access ports or so, can't remember it was too long
| ago and I didn't continue it as I started to get into DirectX and
| modern c++ more).
| aengelke wrote:
| Bonus quirk: there's BSF/BSR, for which the Intel SDM states that
| on zero input, the destination has an undefined value. (AMD
| documents that the destination is not modified in that case.) And
| then there's glibc, which happily uses the undocumented fact that
| the destination is also unmodified on Intel [1]. It took me quite
| some time to track down the issue in my binary translator.
| (There's also TZCNT/LZCNT, which is BSF/BSR encoded with
| F3-prefix -- which is silently ignored on older processors not
| supporting the extension. So the same code will behave
| differently on different CPUs. At least, that's documented.)
|
| Encoding: People often complain about prefixes, but IMHO, that's
| by far not the worst thing. It is well known and somewhat well
| documented. There are worse quirks: For example, REX/VEX/EVEX.RXB
| extension bits are ignored when they do not apply (e.g., MMX
| registers); except for mask registers (k0-k7), where they trigger
| #UD -- also fine -- except if the register is encoded in
| ModRM.rm, in which case the extension bit is ignored again.
|
| APX takes the number of quirks to a different level: the REX2
| prefix can encode general-purpose registers r16-r31, but not
| xmm16-xmm31; the EVEX prefix has several opcode-dependent
| layouts; and the extension bits for a register used depend on the
| register type (XMM registers use X3:B3:rm and V4:X3:idx; GP
| registers use B4:B3:rm, X4:X3:idx). I can't give a complete list
| yet, I still haven't finished my APX decoder after a year...
|
| [1]: https://sourceware.org/bugzilla/show_bug.cgi?id=31748
| CoastalCoder wrote:
| Can you imagine having to make all this logic work faithfully,
| let alone _fast_ , in silicon?
|
| X86 used to be Intel's moat, but what a nightmarish burden to
| carry.
| dx4100 wrote:
| Did people just... do this by hand (in software), transistor
| by transistor, or was it laid out programmatically in some
| sense? As in, were segments created algorithmically, then
| repeated to obtain the desired outcome? CPU design baffles
| me, especially considering there are 134 BILLION transistors
| or so in the latest i7 CPU. How does the team even keep track
| of, work on, or even load the files to WORK on the CPUs?
| tristor wrote:
| They use EDA (Electronic Design Automation) software, there
| are only a handful of vendors, the largest probably being
| Mentor Graphics, now owned by Siemens. So, yes, they use
| automation to algorithmically build and track/resolve
| refactors as they design CPUs. CPUs are /generally/ block-
| type designs these days, so particular functions get
| repeated identically in different places and can be
| somewhat abstracted away in your EDA.
|
| It's still enormously complex, and way more complex than
| the last time I touched this stuff more than 15 years ago.
| rikthevik wrote:
| I love that the EDA industry still uses Tcl heavily.
| Warms my heart.
| monocasa wrote:
| It's written in an HDL; IIRC both Intel and AMD use
| verilog. A modern core is on the order of a million or so
| lines of verilog.
|
| Some of that will be hand placed, quite a bit will just be
| thrown at the synthesizer. Other parts like SRAM blocks
| will have their cad generated directly from a macro and a
| description of the block in question.
| cogman10 wrote:
| To further expound on this. ASIC (like AMD CPUs) is a lot
| like software work. The engineers that create a lot of
| the digital logic aren't dealing with individual
| transistors, instead they are saying "give me an
| accumulator for this section of code" and the HDL
| provides it. The definition of that module exists
| elsewhere and is shared throughout the system.
|
| This is how the complexity can be wrangled.
|
| Now, MOST of the work is automated for digital logic.
| However, we live in an analog world. So, there is (As far
| as I'm aware) still quite a bit of work for analog
| engineers to bend the analog reality into digital. In the
| real world, changing current creates magnetic fields
| which means you need definitions limiting voltages and
| defining how close a signal line can be to avoid cross
| talk. Square waves are hard to come by, so there's effort
| in timing and voltage bands to make sure you aren't
| registering a "1" when it should have been a "0".
|
| Several of my professors were intel engineers. From what
| they told me, the ratios of employment were something
| like 100 digital engineers to 10 analog engineers to 1
| Physicist/materials engineer.
| bonzini wrote:
| It was entirely laid out by hand until the 286. Using
| standard cells in the 386 enabled the switch from microcode
| to a mostly hardwired core.
| moosedev wrote:
| You might enjoy Ken's latest article on some of this stuff,
| posted just the other day:
| https://news.ycombinator.com/item?id=40899393
| gumby wrote:
| A lot of this is done in software (microcode). But even with
| that case, your statement still holds: "Can you imagine
| having to make all this logic work faithfully, let alone
| fast, in the chip itself?" Writing that microcode must be
| fiendishly difficult given all the functional units, out of
| order execution, register renaming...
| bonzini wrote:
| The crazy parts that were mentioned in the parent comment
| are all part of the hot path. Microcode handles slow paths
| related to paging and segmentation, and very rare
| instructions. Not necessarily unimportant (many common
| privileged instructions are microcoded) but still rare
| compared to the usual ALU instructions.
|
| But it's not a huge deal to program the quirky encoding in
| an HDL, it's just a waste of transistors. The really
| complicated part is the sequencing of micro operations and
| how they enter the (out of order) execution unit.
| aengelke wrote:
| > A lot of this is done in software (microcode).
|
| No, that's not the case, since >30 years. Microcode is only
| used for implementing some complex instructions (mostly
| system instructions). Most regular instructions (and the
| rest of the core) don't use microcode and their expansions
| into uOps are hardwired. Also the entire execution unit is
| hardwired.
|
| There are typically some undocumented registers (MSRs on
| x86) that can control how the core behaves (e.g., kill
| switches for certain optimizations). These can then be
| changed by microcode updates.
| im3w1l wrote:
| It's a lot easier to more or less accidentally create
| something quirky than it is to create a second quirk-for-
| quirk compatible system.
| dzaima wrote:
| A fun thing is that e.g. "cmp ax, 0x4231" differs from "cmp
| eax, 0x87654321" only in the presence of the data16 prefix,
| and thus the longer immediate; and it's the only significant
| case (I think?) of a prefix changing the total instruction
| size, and thus, for _some_ such instructions, the 16-bit
| version, _sometimes_ (but not always!) is significantly
| slower. "but not always" as in, if you try to microbenchmark
| a loop of such, sometimes you can have entire microseconds of
| it consistently running at 0.25 cycles/instr avg, and
| sometimes that same exact code (in the same process!) will
| measure it at 3 cycles/instr (tested on Haswell, but
| uops.info indicates this happens on all non-atom Intel since
| Ivy Bridge).
| BeeOnRope wrote:
| Probably, if the uops come from the uop cache you get the
| fast speed since the prefix and any decoding stalls don't
| have any impact in that case (that mess is effectively
| erased in the uop cache), but if it needs to be decoded you
| get a stall due to the length changing prefix.
|
| Whether a bit of code comes from the uop cache is highly
| dependent on alignment, surrounding instructions, the
| specific microarchitecture, microcode version and even more
| esototeric things like how many incoming jumps target the
| nearby region of code (and in which order they were
| observed by the cache).
| dzaima wrote:
| Yep, a lot of potential contributors. Though, my test was
| of a single plain 8x unrolled loop doing nothing else,
| running for tens of thousands of iterations to take a
| total of ~0.1ms, i.e. should trivially cache, and yet
| there's consistent inconsistency.
|
| Did some 'perf stat'ting, comparing the same test with
| "cmp eax,1000" vs "cmp ax,1000"; per instruction,
| idq.mite_uops goes 0.04% - 35%, and lsd.uops goes 90% -
| 54%; so presumably sometimes somehow the loop makes into
| LSD at which point dropping out of it is hard, while
| other times it perpetually gets stuck at MITE? (test is
| of 10 instrs - 8 copies of the cmp, and dec+jne that'd
| get fused, so 90% uops/instr makes sense)
| aengelke wrote:
| The behavior/penalty of 66/67 length-changing prefixes is
| documented in the Intel Optimization Reference Manual
| (3.4.2.3):
|
| > Assembly/Compiler Coding Rule 19. (MH impact, MH
| generality) Favor generating code using imm8 or imm32
| values instead of imm16 values.
| dzaima wrote:
| Some interesting quotes around there:
|
| > The following alignment situations can cause LCP stalls
| to trigger twice:
|
| > * An instruction is encoded with a MODR/M and SIB byte,
| and the fetch line boundary crossing is between the
| MODR/M and the SIB bytes.
|
| > * An instruction starts at offset 13 of a fetch line
| references a memory location using register and immediate
| byte offset addressing mode.
|
| So that's the order of funkiness to be expected, fun.
|
| > False LCP stalls occur when (a) instructions with LCP
| that are encoded using the F7 opcodes, and (b) are
| located at offset 14 of a fetch line. These instructions
| are: not, neg, div, idiv, mul, and imul. False LCP
| experiences delay because the instruction length decoder
| can not determine the length of the instruction before
| the next fetch line, which holds the exact opcode of the
| instruction in its MODR/M byte.
|
| The "true" LCP stall for the F7 opcodes would be "test
| r16,imm16", but due to the split opcode info across the
| initial byte & ModR/M, the other F7's suffer too.
| kimixa wrote:
| It seems that the complexity of state management in a modern
| superscalar CPU is orders of magnitude more complex than even
| this.
| turndown wrote:
| Intel is coming out with an improved x86 instruction set that
| removes a lot of the cruft, called 'APX' for advanced
| performance extensions.
| bonzini wrote:
| On and off over the last year I have been rewriting QEMU's x86
| decoder. It started as a necessary task to incorporate AVX
| support, but I am now at a point where only a handful of
| opcodes are left to rewrite, after which it should not be too
| hard to add APX support. For EVEX my plan is to keep the raw
| bits until after the opcode has been read (i.e. before
| immediates and possibly before modrm) and the EVEX class
| identified.
|
| My decoder is mostly based on the tables in the manual, and the
| code is mostly okay--not too much indentation and phases mostly
| easy to separate/identify. Because the output is JITted code,
| it's ok to not be super efficient and keep the code readable;
| it's not where most of the time is spent. Nevertheless there
| are several cases in which the manual is wrong or doesn't say
| the whole story. And the tables haven't been updated for
| several years (no K register instructions, for example), so
| going forward there will be more manual work to do. :(
|
| The top comment explains a bit what's going on:
| https://github.com/qemu/qemu/blob/59084feb256c617063e0dbe7e6...
|
| (As I said above, there are still a few instructions handled by
| the old code predating the rewrite, notably BT/BTS/BTR/BTC. I
| have written the code but not merged it yet).
| aengelke wrote:
| Thanks for the pointer to QEMU's decoder! I actually never
| looked at it before.
|
| So you coded all the tables manually in C -- interesting,
| that's quite some effort. I opted to autogenerate the tables
| (and keep them as data only => smaller memory footprint)
| [1,2]. That's doable, because x86 encodings are mostly fairly
| consistent. I can also generate an encoder from it (ok, you
| don't need that). Re 'custom size "xh"': AVX-512 also has
| fourth and eighth. Also interesting that you have a separate
| row for "66+F2". I special case these two (CRC32, MOVBE)
| instructions with a flag.
|
| I think the prefix decoding is not quite right for x86-64:
| 26/2e/36/3e are ignored in 64-bit mode, except for 2e/3e as
| branch-not-taken/taken hints and 3e as notrack. (See SDM Vol.
| 1 3.3.7.1 "Other segment override prefixes (CS, DS, ES, and
| SS) are ignored.") Also, REX prefixes that don't immediately
| preceed the opcode (or VEX/EVEX prefix) are ignored. Anyhow,
| I need to take a closer look at the decoder with more time.
| :-)
|
| > For EVEX my plan is to keep the raw bits until after the
| opcode has been read
|
| I came to the same conclusion that this is necessary with
| APX. The map+prefix+opcode combination identifies how the
| other fields are to be interpreted. For AVX-512, storing the
| last byte was sufficient, but with APX, vvvv got a second
| meaning.
|
| > Nevertheless there are several cases in which the manual is
| wrong or doesn't say the whole story.
|
| Yes... especially for corner cases, getting real hardware is
| the only reliable way to find out, how the CPU behaves.
|
| [1]: https://github.com/aengelke/fadec/blob/master/instrs.txt
| [2]: https://github.com/aengelke/fadec/blob/master/decode.c
| bonzini wrote:
| > interesting that you have a separate row for "66+F2"
|
| Yeah that's only for 0F38F0 to 0F38FF.
|
| > Re 'custom size "xh"': AVX-512 also has fourth and eighth
|
| Also AVX for VPMOVSX and VPMOVZX but those are handled
| differently. I probably should check if xh is actually
| redundant... _EDIT_ : it's only needed for VCVTPS2PH, which
| is the only instruction with a half-sized _destination_.
|
| > I think the prefix decoding is not quite right for
| x86-64: 26/2e/36/3e are ignored in 64-bit mode
|
| Interesting, I need to check how they interact with the
| FS/GS prefixes (64/65).
|
| > REX prefixes that don't immediately preceed the opcode
| (or VEX/EVEX prefix) are ignored
|
| Oh, didn't know that!
| aengelke wrote:
| > how they interact with the FS/GS prefixes (64/65)
|
| For memory operations, they are ignored: 64-2e-65-3e
| gives 65 as segment override. (From my memory and the
| resulting implementation, I did some tests with hardware
| a few years back.)
|
| I do need to check myself how 2e/3e on branches interact
| with other segment overrides, though.
| jart wrote:
| FWIW here's all 700 lines of Blink's x86 decoder.
| https://github.com/jart/blink/blob/master/blink/x86.c
| bonzini wrote:
| I don't want to be the person that has to add an
| instruction to blink...
| jcranmer wrote:
| It looks like someone started with Intel's XED code
| (which relies on custom tables to specify instructions,
| and compiles that to C tables at compile time) and hand-
| minimized the code into a single file. I'm guessing it's
| designed to never have any more code added to it.
| jart wrote:
| I minimized it by hand to make it more maintainable, to
| me at least.
| mananaysiempre wrote:
| The semantics of LZCNT combined with its encoding feels like an
| own goal: it's encoded as a BSR instruction with a legacy-
| ignored prefix, _but_ for nonzero inputs its return value is
| the operand size minus the return value of the legacy version.
| Yes, clz() is a function that exists, but the extra subtraction
| in its implementation feels like a small cost to pay for extra
| compatibility when LZCNT could've just been BSR with different
| zero-input semantics.
| bonzini wrote:
| Yes, it's like someone looked at TZCNT and thought "let's
| encode LZCNT the same way", but it makes no sense.
| adrian_b wrote:
| LZCNT and TZCNT are corrections (originally introduced by
| AMD) for the serious mistake done by the designers of Intel
| 80386 when they have defined BSF and BSR.
|
| Because on the very slow 80386 the wrong definition for the
| null input did not matter much, they have failed to foresee
| how bad it will become for the future pipelined and
| superscalar CPUs, where having to insert a test for null
| input can slow down a program many times.
|
| Nevertheless, they should have paid more attention to the
| earlier use of such instructions. For instance Cray-1 had
| defined LZCNT in the right way almost ten years earlier.
| bonzini wrote:
| LZCNT was introduced by Intel in BMI1, while TZCNT was
| introduced by AMD.
| BeeOnRope wrote:
| I'm not following: as long as you are introducing a new,
| incompatible instruction for leading zero counting, you'd
| definitely choose LZCNT over BSR as LZCNT has definitely won
| in retrospect over BSR as the primitive for this use case.
| BSR is just a historical anomaly which has a zero-input
| problem for no benefit.
|
| What would be the point of offering a new variation BSR with
| different input semantics?
| mananaysiempre wrote:
| When it comes to TZCNT vs BSF, they are just compatible
| enough for a new compiler to use unconditionally (if we
| assume that BSF with a zero input leaves its output
| register unchanged, as it has for decades, and as
| documented by AMD who defined LZCNT): the instruction
| sequence MOV ECX, 32 TZCNT ECX, EAX
| ; i.e. REP BSF ECX, EAX
|
| behaves identically on everything from the original 80386
| up and is better on superscalars with TZCNT support due to
| avoiding the false dependency on ECX. The reason for that
| is that BSF with a nonzero input and TZCNT with a nonzero
| input have exactly the same output. That's emphatically not
| true of BSR and LZCNT, so we're stuck relegating LZCNT to
| compiler flags.
| aengelke wrote:
| TZCNT and BSF are not completely identical even for non-
| zero input: BSF sets the ZF when the input is zero, TZCNT
| sets ZF when the output is zero (i.e., the least
| significant bit is set).
| torusle wrote:
| Another bonus quirk, from the 486 and Pentium area..
|
| BSWAP EAX converts from little endian to big endian and vice
| versa. It was a 32 bit instruction to begin with.
|
| However, we have the 0x66 prefix that switches between 16 and
| 32 bit mode. If you apply that to BSWAP EAX undefined funky
| things happen.
|
| On some CPU architectures (Intel vs. AMD) the prefix was just
| ignored. On others it did something that I call an "inner
| swap". E.g. in your four bytes that are stored in EAX byte 1
| and 2 are swapped. 0x11223344 became
| 0x11332244.
| userbinator wrote:
| Also known as "bswap ax", and research shows that it does
| something surprising but consistent on almost all hardware:
| It zeros the register.
|
| https://www.ragestorm.net/blogs/?p=141
|
| https://gynvael.coldwind.pl/?id=268
|
| However, this page, now gone, suggests that some CPUs (early
| 486s?) did something different: http://web.archive.org/web/20
| 071231192014/http://www.df.lth....
|
| Unfortunately I have not found any evidence nor reason to
| believe that this "inner swap" behaviour you mention exists
| in some CPU --- except perhaps some flawed emulators?
| __turbobrew__ wrote:
| I know nothing about this space, but it would be interesting to
| hook up a jtag interface to a x86 CPU and them step instruction
| by instruction and record all the register values.
|
| You could then use this data to test whether or not your
| emulator perfectly emulates the hardware by running the same
| program through the emulated CPU and validate the state is the
| same at every instruction.
| was_a_dev wrote:
| Off topic, but I like this blog style/layout. I can imagine it
| isn't everyones taste, but it just works for me
| timmisiak wrote:
| Glad you like it. I used m10c, with a few tweaks:
| https://github.com/vaga/hugo-theme-m10c
| waynecochran wrote:
| Intel architecture is loaded with historical artifacts. The
| switch in how segment registers were used as you went from real
| mode to protected mode was an incredible hardware hack to keep
| older software working. I blame Intel for why so many folks avoid
| assembly language. I programmed in assembly for years using TI's
| 84010 graphics chips and the design was gorgeous -- simple RISC
| instruction set, flat address space, and bit addressable! If
| during the earlier decades folks were programming using chips
| with more elegant designs, far more folks would be programming in
| assembly language (or at least would know how to).
| hajile wrote:
| > I blame Intel for why so many folks avoid assembly language.
|
| x86 (the worst assembly of any of the top 50 most popular ISAs
| by a massive margin) and tricky MIPS branch delay slots trivia
| questions at university have done more to turn off programmers
| from learning assembly than anything else and it's not even
| close.
|
| This is one reason I'm hoping that RISC-V kills off x86. It
| actually has a chance of once again allowing your average
| programmer to learn useful assembly.
| LegionMammal978 wrote:
| What do you find particularly problematic about x86 assembly,
| from a pedagogical standpoint? I've never noticed any glaring
| issues with it, except for the weird suffixes and sigils if
| you use AT&T syntax (which I generally avoid).
| timmisiak wrote:
| I suspect the biggest issue is that courses like to talk
| about how instructions are encoded, and that can be
| difficult with x86 considering how complex the encoding
| scheme is. Personally, I don't think x86 is all that bad as
| long as you look at a small useful subset of instructions
| and ignore legacy and encoding.
| LegionMammal978 wrote:
| True, encoding is one thing that really sets x86 apart.
| But as you say, the assembly itself doesn't seem that
| uniquely horrible (at least not since the 32-bit era),
| which is why I found the sentiment confusing as it was
| phrased.
|
| Maybe it's the haphazard SIMD instruction set, with every
| extension adding various subtly-different ways to permute
| bytes and whatnot? But that would hardly seem like a
| beginner's issue. The integer multiplication and division
| instructions can also be a bit wonky to use, but hardly
| unbearably so.
| hajile wrote:
| An ordinary developer cannot write performant x86 without
| a massive optimizing compiler.
|
| Actual instruction encoding is horrible. If you're
| arguing that you can write a high-level assembly over the
| top, then you aren't so much writing assembly as you are
| writing something in between.
|
| When you need to start caring about the actual assembly
| (padding a cache line, avoiding instructions with too
| many uops, or choosing between using APX 32 registers and
| more normal shorter instructions, etc) rather than some
| high level abstraction, the experience is worse than any
| other popular ISA.
| LegionMammal978 wrote:
| > Actual instruction encoding is horrible. If you're
| arguing that you can write a high-level assembly over the
| top, then you aren't so much writing assembly as you are
| writing something in between.
|
| One instruction in x86 assembly is one instruction in the
| machine code, and one instruction as recognized by the
| processor. And except for legacy instructions that we
| shouldn't teach people to use, each of these is not much
| higher-level than an instruction in any other assembly
| language. So I still don't see what the issue is, apart
| from "the byte encoding is wacky".
|
| (There are mops beneath it of course, but these are
| reordered and placed into execution units in a very
| implementation-dependent manner that can't easily be
| optimized until runtime. Recall how VLIW failed at
| exposing this to the programmer/compiler.)
|
| > padding a cache line, avoiding instructions with too
| many uops
|
| Any realistic ARM or RISC-V processor these days also
| supports out-of-order execution with an instruction
| cache. The Cortex processors even have mops to support
| this! The classic 5-stage pipeline is obsolete outside
| the classroom. So if you're aiming for maximum
| performance, I don't see how these are concerns that
| arise far less in other assembly languages. E.g., you'll
| always have to be worried about register dependencies,
| execution units, optimal loop unrolling, etc. It's not
| like a typical program will be blocked on the mop cache
| in any case.
|
| > APX 32 registers and more normal shorter instructions
|
| APX doesn't exist yet, and I'd wager there's a good
| chance it will never reach consumer CPUs.
| hajile wrote:
| Which "one instruction" is the right one? You can have
| the exact same instruction represented many different
| ways in x86. Some are shorter and some are longer (and
| some are different, but the same length). When you say to
| add two numbers, which instruction variant is correct?
|
| This isn't a straightforward answer. As I alluded to in
| another statement you quoted, padding cache lines is a
| fascinating example. Functions should ideally start at
| the beginning of a cache line. This means the preceding
| cache line needs to be filled with something. NOP seems
| like the perfect solution, but it adds unnecessary
| instructions in the uop cache which slow things down.
| Instead, the compiler goes to the code on the previous
| cache line and expands instructions to their largest size
| to fill the space because adding a few bytes of useless
| prefix is allowed and doesn't generate NOPs in the uop
| cache.
|
| Your uop statements aren't what I was talking about.
| Almost all RISCV instructions result in just one uop. On
| fast machines, they may even generate less than one uop
| if they get fused together.
|
| x86 has a different situation. If your instruction
| generates too many uops (or is too esoteric), it will
| skip the fast hardware decoders and be sent to a
| microcode decoder. There's a massive penalty for doing
| this that slows performance to a crawl. To my knowledge,
| no such instructions exist in any modern ISA.
|
| Intel only has one high performance core. When they
| introduce APX, it'll be on everything. There's good
| reason to believe this will be introduced. It adds lots
| of good features and offers increased code density in
| quite a few situations which is something we haven't seen
| in a meaningful way since AMD64.
| akira2501 wrote:
| I think that's putting the cart before the horse. I think it
| wouldn't matter which architecture you choose as there will
| always be deep performance considerations that must be
| understood in order to write efficient software.
|
| Otherwise your statement might amount down to "I hope there
| is an ISA that intentionally wastes performance and energy in
| deference to human standards of beauty."
|
| It's why the annals of expertise rarely makes for good dinner
| table conversation.
| hajile wrote:
| x86 has a parity flag. It only takes the parity of the
| lowest 8 bits though. Why is it there? Because it was in
| the 8086 because it was in the 8080 because it was in the
| 8008 because Intel was trying to win a contract for the
| Datapoint 2200.
|
| Sometimes the short instruction variant is correct, but not
| if it makes a single instruction break down into many uops
| as the microcode is 1000x slower.
|
| Oh, but you need to use those longer variants without extra
| uops to align functions to cache boundaries because they
| perform better than NOP.
|
| Floats and SIMD are a mess with x87, AMX (with incompatible
| variants), SSE1-4 (with incompatible variants), AVX, AVX2,
| and AVX512 (with incompatible variants) among others.
|
| The segment and offset was off dealing with memory is
| painful.
|
| What about the weird rules about which registers are
| reserved for multiply and divide? Half the "general
| purpose" registers are actually locked in at the ISA level.
|
| Now APX is coming up and you get to choose between shorter
| instructions with 16 registers and 2 registers syntax or
| long instructions with 32 registers ands 3 registers
| instructions.
|
| And this just scratches the surface.
|
| RISCV is better in every way. The instruction density is
| significantly higher. The instructions are more simple and
| easier to understand while being just as powerful.
| Optimizing compilers are easier to write because there's
| generally just one way to do things and is guaranteed to be
| optimized.
| bheadmaster wrote:
| If Terry A. Davis is to be trusted, as long as you ignore the
| legacy stuff, x64 assembly is nice to work with.
| 256_ wrote:
| Where did he say this? In any case, I can confirm it from
| personal experience. x86 isn't that annoying if you ignore
| most of it. The worst thing is the edge cases with how some
| instructions can only use some registers, like
| shifting/rotating using cl.
| bheadmaster wrote:
| There was a video of him talking about it, but I can't
| recall the exact words or the name of the video.
|
| But his operating system, TempleOS, was based on HolyC
| which was compiled to pure x64, and had support for inline
| x64 assembly.
| russdill wrote:
| What's crazy is that depending on how deep you want to go, a
| lot of the information is not available in documents published
| by Intel. Fortunately, if it matters for emulators it typically
| can be/has been reverse engineering.
| Max-q wrote:
| Wow, is that true? And the documentation is thousands of
| pages! I can't understand how Intel keep these processors
| consistent during development. It must be a really, really
| hard job.
|
| It was a nice period in history when proper RISC was working
| well, when we could get stuff to run faster, but getting more
| transistors was expensive. Now we can't get it to run faster
| but we can get billions of transistors, making stuff more
| complicated being the way to more performance.
|
| I wonder if we ever again will get a time where
| simplification is a valid option...
| russdill wrote:
| There's some info here:
|
| http://www.rcollins.org/secrets/
| http://www.rcollins.org/ddj/ddj.html
|
| Notably things like undocumented opcodes, processor modes,
| undocumented registers, undocumented bits within registers,
| etc. It's not uncommon to release a specific set of
| documentation to trusted partners only. Back in the day
| intel called these "gold books".
| jecel wrote:
| Wouldn't that be the 34010?
| waynecochran wrote:
| Yes. TMS 34010 and 34020 ... my bad.
| trollied wrote:
| > Writing a CPU emulator is, in my opinion, the best way to
| REALLY understand how a CPU works
|
| Hard disagree.
|
| The best way is to create a CPU from gate level, like you do on a
| decent CS course. (I really enjoyed making a cut down ARM from
| scratch)
| whobre wrote:
| Reading Petzold's "Code" comes pretty close to, though and is
| easier.
| commandlinefan wrote:
| OTOH, are you really going to be implementing memory segmenting
| in your gate-level CPU? I'd say actually creating a working CPU
| and _then_ emulating a real CPU (warts and all) are both
| necessary steps to real understanding.
| trollied wrote:
| I agree.
| monocasa wrote:
| > OTOH, are you really going to be implementing memory
| segmenting in your gate-level CPU?
|
| I have, but it was a PDP-8 which I'll be the first to admit
| is kind of cheating.
| mrspuratic wrote:
| This. I mean, why not start with wave theory and material
| science if you really want a good understanding :)
|
| In my CS course I learned a hell of a lot from creating a
| 6800 emulator; though it wasn't on the course, building a
| working 6800 system was. The development involved running an
| assembler on a commercial *nix system and then typing the hex
| object code into an EPROM programmer. You get a lot of time
| to think about where your bugs are when you have to wait for
| a UV erase cycle...
| commandlinefan wrote:
| > why not start with wave theory and material science
|
| You jest, but if I had infinite time...
| quantified wrote:
| Well, I think you're both right. It's satisfying as heck to
| sling 74xx chips together and you get a feel for the electrical
| side of things and internal tradeoffs.
|
| When you get to doing that for the CPU that you want to do
| meaningful work with, you start to lose interest in that
| detail. Then the complexities of the behavior and spec become
| interesting and the emulator approach is more tractable, can
| cover more types of behavior.
| IshKebab wrote:
| I think trollied is correct actually. I work on a CPU
| emulator professionally and while it gives you a great
| understanding of the spec there are lots of details about
| _why_ the spec is the way it is that are due to how you
| actually implement the microarchitecture. You only learn that
| stuff by actually implementing a microarchitecture.
|
| Emulators tend not to have many features that you find in
| real chips, e.g. caches, speculative execution, out-of-order
| execution, branch predictors, pipelining, etc.
|
| This isn't "the electrical side of things". When he said
| "gate level" he meant RTL (SystemVerilog/VHDL) which is
| pretty much entirely in the digital domain; you very rarely
| need to worry about actual electricity.
| trollied wrote:
| I write retro console emulators for fun, so agree with you
| 100% :)
| timmisiak wrote:
| I think both are useful, but designing a modern CPU from the
| gate level is out of reach for most folks, and I think there's
| a big gap between the sorts of CPUs we designed in college and
| the sort that run real code. I think creating an emulator of a
| modern CPU is a somewhat more accessible challenge, while still
| being very educational even if you only get something partially
| working.
| banish-m4 wrote:
| This is an illusion and a red herring. RTL synthesis is the
| typical functional prototype stage reached which is generally
| sufficient for FPGA work. To burn an ASIC as part of an
| educational consortium run is doable, but it's uncommon.
| WalterBright wrote:
| When I was at Caltech, another student in the dorm had been
| admitted because he'd designed and implemented a CPU using
| only 7400 TTL.
|
| Woz wasn't the only supersmart young computer guy at the time
| :-)
|
| (I don't know how capable it was, even a 4 bit CPU would be
| quite a challenge with TTL.)
| nsguy wrote:
| I think the key word above was modern. I felt able to
| design a simple CPU when I finished my Computer
| Architecture course in university. I think I forgot most of
| it by now ;) There are a few basic concepts to wrap your
| head around but once you have them a simple CPU is doable.
| Doing this with TTL or other off the shelf components is
| mostly minimizing/adapting/optimizing to those components
| (or using a lot of chips ;) ). I have never looked at
| discrete component CPU designs, I imagine ROM and RAM chips
| play a dominant part (e.g. you don't just built RAM with
| 74x TTL flip-flops).
| WalterBright wrote:
| He probably used off-the-shelf RAM chips, after all, RAM
| is not part of the CPU.
|
| In the early 70s, before the internet, even finding the
| information needed would be a fair amount of work.
|
| I learned how flip flops worked, adders, and registers in
| college, and that could be extended to an ALU. But still,
| that was in college, not high school.
|
| I've read some books on computer history, and they are
| frustratingly vague about how the machines actually
| worked. I suspect the authors didn't actually know. Sort
| of the like the books on the history of Apple that gush
| over Woz's floppy disk interface, but no details.
| nsguy wrote:
| Was doing some Googling and came across:
| https://en.wikipedia.org/wiki/Breakout_(video_game)
|
| I never heard this story...
| akira2501 wrote:
| > and the sort that run real code
|
| And the sort that are commercially viable in today's
| marketplace. The nature of the code has nothing to do with
| it. The types of machines we play around with today surpass
| the machines we used to land men on the moon. What's not
| "real code" about that?
| brailsafe wrote:
| So far on my journey through Nand2Tetris (since I kind of
| dropped out of my real CS course) I've found the entire process
| of working my way up from gate level, and just finished the VM
| emulator chapter which took an eternity. Now onto compilation.
| banish-m4 wrote:
| Seconded. A microcoded, pipelined, superscalar, branch-
| predicting basic processor with L1 data & instruction caches
| and write-back L2 cache controller is nontrivial. Most software
| engineers have an incomplete grasp of data hazards, cache
| invalidation, or pipeline stalls.
| nsguy wrote:
| IIRC reading some Intel CPU design history some of their
| designers are from a CS/software background. But I agree.
| Software is naturally very sequential which is different than
| digital hardware which is naturally/inherently parallel. A
| clock can change the state of a million flip-flops all at
| once, it's a very different way of thinking about computation
| (though ofcourse at the theoretical level all the same) and
| then there's the physics and EE parts of a real world CPU.
| Writing software and designing CPUs are just very different
| disciplines and the CPU as it appears to the software
| developer isn't how it appears to the CPU designer.
| banish-m4 wrote:
| I'm not talking about Intel's engineers, I'm saying very
| few software engineers today understand how a processor
| works. I'm sure Intel hires all sorts of engineers for
| different aspects of each ecosystem they maintain.
| Furthermore, very few software engineers have ever even
| touched a physical server because they're sequestered away
| from a significant fraction of the total stack.
|
| Speculative and out-of-order execution requires
| synchronization to maintain dataflow and temporal
| consistency.
|
| Computer Organization is a good book should anyone want to
| dive deeper.
| snvzz wrote:
| CPU was a poor choice of words. ISA would have worked.
| fjfaase wrote:
| Interesting read. I have a lot of respect for people who develop
| emulator for x86 processors. It is a complicated processor and
| from first hand experience I know that developing and debugging
| emulators for CPU's can be very challenging. In the past year, I
| spend some time developing a very limited i386 emulator [1]
| including some system calls for executing the first steps of
| live-bootstrap [2], primarily to figure out how it is working. I
| learned a lot about system calls and ELF.
|
| [1] https://github.com/FransFaase/Emulator/
|
| [2] https://github.com/fosslinux/live-bootstrap/
| banish-m4 wrote:
| Most of the complexities lie in managing the various
| configurations of total system compatibility emulation,
| especially for timing, analog oddities, and whether to include
| bugs or not and for which steppings. If you want precise and
| accurate emulation, you have to have real hardware to validate
| behavior against. Then there are the cases of what not to
| emulate and offering better-than-original alternatives.
| Quekid5 wrote:
| Just as an adjacent aside from a random about learning by doing:
|
| Implementing a ThingDoer is a huge learning experience. I
| remember doing co-op "write-a-compiler" coursework with another
| person. We were doing great, everything was working and then we
| got to the oral exam...
|
| "Why is your Stack Pointer growing upwards"?
|
| ... I was kinda stunned. I'd never thought about that. We
| understood most of the things, but sometimes we kind of just
| bashed at things until they worked... and it turned out upward-
| growing SP did work (up to a point) on the architecture our toy
| compiler was targeting.
| boricj wrote:
| It's funny to me how much grief x86 assembly generates when
| compared to RISC here, because I have the opposite problem when
| delinking code back into object files.
|
| For this use-case, x86 is really easy to analyze whereas MIPS has
| been a nightmare to pull off. This is because all I mostly care
| about are references to code and data. x86 has pointer-sized
| immediate constants and MIPS has split HI16/LO16 relocation
| pairs, which leads to all sorts of trouble with register usage
| graphs, code flow and branch delay instructions.
|
| That should not be constructed as praise on my end for x86.
| jmspring wrote:
| Apparently my memory is false, I thought originally the salsa20
| variants and machine code were on cryp.to in my memory, but Dan
| Berstein's site is - https://cr.yp.to/
|
| While at a startup when we were looking at data at rest
| encryption, streaming encryption and other such things. Dan had a
| page with different implementations (cross compiled from his
| assembler representation) to target chipsets and instruction
| sets. Using VMs (this was the early/mid 2000s) and such, it was
| interesting to see what of those instruction sets were supported.
| In testing, there would be occasional hiccups where an
| implementation wasn't fully supported though the VM claimed such.
| djaouen wrote:
| "I don't believe in emulatores." - 0x86
| t_sea wrote:
| > Writing a CPU emulator is, in my opinion, the best way to
| REALLY understand how a CPU works.
|
| The 68k disassembler we wrote in college was such a Neo "I know
| kung fu" moment for me. It was the missing link that let me
| reason about code from high-level language down to transistors
| and back. I can only imagine writing a full emulator is an order
| of magnitude more effective. Great article!
| lifthrasiir wrote:
| I recently implemented a good portion of x86(-64) decoder for
| some side project [1] and kinda surprised how it got even more
| complicated in recent days. Sandpile.org [2] was really useful
| for my purpose.
|
| [1] Namely, a version of Fabian Giesen's disfilter for x86-64,
| for yet another side project which is still not in public:
| https://gist.github.com/lifthrasiir/df47509caac2f065032ef72e...
|
| [2] https://sandpile.org/
| ale42 wrote:
| Shouldn't it be (2023) rather than (2013)?
| Sparkyte wrote:
| The footnotes are glorious. "He was convinced that using a shift
| would work and didn't believe me when I said it wasn't possible."
___________________________________________________________________
(page generated 2024-07-11 23:02 UTC)