hngopher.com

       [HN Gopher] Things I learned while writing an x86 emulator (2013)
       ___________________________________________________________________
        
       Things I learned while writing an x86 emulator (2013)
        
       Author : fanf2
       Score  : 330 points
       Date   : 2024-07-10 14:42 UTC (1 days ago)
        
 (HTM) web link (www.timdbg.com)
 (TXT) w3m dump (www.timdbg.com)
        
       | pm2222 wrote:
       | Prior discussion here
       | https://news.ycombinator.com/item?id=34636699
       | 
       | Cannot believe it's been 16months. How time flies.
        
       | AstroJetson wrote:
       | Check out Justine Tunney and her emulator.
       | https://justine.lol/blinkenlights/
       | 
       | The docs are an amazing tour of how the cpu works.
        
         | zarathustreal wrote:
         | Astonishing.. they never cease to amaze me
        
         | trallnag wrote:
         | That name, Tunney. Remember it from around 2014, being
         | homeless, bumming around, and shit posting on Twitter about
         | Occupy lol
        
       | sdsd wrote:
       | What a cool person. I really enjoy writing assembly, it feels so
       | simple and I really enjoy the vertical aesthetic quality.
       | 
       | The closest I've ever come to something like OP (which is to say,
       | not close at all) was when I was trying to help my JS friend
       | understand the stack, and we ended up writing a mini vm with its
       | own little ISA:
       | https://gist.github.com/darighost/2d880fe27510e0c90f75680bfe...
       | 
       | This could have gone much deeper - i'd have enjoyed that, but
       | doing so would have detracted from the original educational goal
       | lol. I should contact that friend and see if he still wants to
       | study with me. it's hard since he's making so much money doing
       | fancy web dev, he has no time to go deep into stuff. whereas my
       | unemployed ass is basically an infinite ocean of time and energy.
        
         | actionfromafar wrote:
         | You should leverage that into your friend teaching you JS,
         | maybe.
        
           | djaouen wrote:
           | It's like my friend 0x86 always said: "Stay away from
           | JavaScript. But stay away from TypeScript harder."
        
       | dmitrygr wrote:
       | I've written fast emulators for a dozen non-toy architectures and
       | a few JIT translators for a few as well. x86 still gives me PTSD.
       | I have never seen a messier architecture. There is history, and a
       | reason for it, but still ... damn
        
         | Arech wrote:
         | Haha, man, I feel you :DD You probably should have started with
         | it from the very beginning :D
        
         | jcranmer wrote:
         | > I have never seen a messier architecture.
         | 
         | Itanium. Pretty much every time I open up the manual, I find a
         | new thing that makes me go "what the hell were you guys
         | thinking!?" without even trying to.
        
           | snazz wrote:
           | What sorts of projects are you working on that use Itanium?
        
             | jcranmer wrote:
             | None, really. I just happened to get a copy of the manual
             | and start idly reading it when my computer got stuck in a
             | very long update-reboot cycle and I couldn't do anything
             | other than read a physical book.
        
         | trealira wrote:
         | Studying the x86 architecture is kind of like studying
         | languages with lots of irregularities and vestigial bits, and
         | with competing grammatical paradigms, e.g. French. Other
         | architectures, like RISC-V and ARMv8, are much more consistent.
        
           | x0x0 wrote:
           | I think English may be a better example; we just stapled
           | chunks of vulgar latin to an inconsistently simplified proto-
           | germanic and then borrowed words from every language we met
           | along the way. Add in 44 sounds serialized to the page with
           | 26 letters and tada!
        
             | WalterBright wrote:
             | The Norman conquest of England means English is a language
             | with barbarian syntax and French nouns. It's a happy mess
             | of cultural appropriation!
             | 
             | Squeezing the lot into 26 characters was simply genius -
             | enabling printing with movable type, Morse code, Baudot
             | code, ASCII, etc.
             | 
             | Of course, then icons came along and ruined everything.
        
           | aengelke wrote:
           | > Other architectures, like [...] ARMv8, are much more
           | consistent.
           | 
           | From an instruction/operation perspective, AArch64 is more
           | clean. However, from an instruction operand and encoding
           | perspective, AArch64 is a lot less consistent than x86.
           | Consider the different operand types: on x86, there are a
           | dozen register types, immediate (8/16/32/64 bits), and memory
           | operands (always the same layout). On AArch64, there's: GP
           | regs, incremented GP reg (MOPS extension), extended GP reg
           | (e.g., SXTB), shifted GP reg, stack pointer, FP reg, vector
           | register, vector register element, vector register table,
           | vector register table element, a dozen types of memory
           | operands, conditions, and a dozen types of immediate
           | encodings (including the fascinating and very useful, but
           | also very non-trivial encoding of logical immediates [1]).
           | 
           | AArch64 also has some register constraints: some vector
           | operations can only encode register 0-15 or 0-7; not to
           | mention SVE with it's "movprfx" prefix instruction that is
           | only valid in front of a few selected instructions.
           | 
           | [1]: https://github.com/aengelke/disarm/blob/master/encode.c#
           | L19-...
        
             | trealira wrote:
             | Admittedly, I never wrote an assembler, but the encoding of
             | x86-64 seems pretty convoluted [0] [1], with much of the
             | information smeared across the some bits in the Mod/RM and
             | SIB bytes and then extended by prefix bytes. It's more
             | complicated than you would assume from having only written
             | assembly in text and then having the assembler encode the
             | instructions.
             | 
             | One of the things that sticks out to me is that on x86-64,
             | operations on 32-bit registers implicitly zero out the
             | upper 32-bits of the corresponding 64-bit register (e.g.
             | "MOV EAX, EBX" also zeroes out the upper 32-bits of RAX),
             | except for the opcode 0x90, which, logically, would be the
             | accumulator encoding of "XCHG EAX, EAX" but is special-
             | cased to do nothing because it's the traditional NOP
             | instruction for x86. So you have to use the other encoding,
             | 87 C0.
             | 
             | The fact that AArch64 has instructions that are almost
             | always 4 bytes, and the fact that they clearly thought out
             | their instruction set more (e.g. instead of having CMP,
             | they just use SUBS (subtract and store condition codes) and
             | store the result in the zero register, which is always
             | zero), is what made me say that it seemed more regular. I
             | hadn't studied it in as close detail as x86-64, though.
             | But, you've written an an ARM disassembler and I haven't;
             | you seem to know more about it than me, and so I believe
             | what you're saying.
             | 
             | > AArch64 also has some register constraints
             | 
             | x86-64 also has some, in that many instructions can't
             | encode the upper 8 bits of a 16-bit register (AH, BH, DH,
             | CH) if a REX prefix is used.
             | 
             | [0]: https://wiki.osdev.org/X86-64_Instruction_Encoding
             | 
             | [1]: http://www.c-jump.com/CIS77/CPU/x86/lecture.html
        
               | WalterBright wrote:
               | > you've written an an ARM disassembler
               | 
               | Here's my AArch64 disassembler work in progress:
               | 
               | https://github.com/dlang/dmd/blob/master/compiler/src/dmd
               | /ba...
               | 
               | I add to it in tandem with writing the code generator. It
               | helps flush out bugs in both by doing this. I.e. generate
               | the instruction, the disassemble it and compare with what
               | I thought it should be.
               | 
               | It's quite a bit more complicated than the corresponding
               | x86 disassembler:
               | 
               | https://github.com/dlang/dmd/blob/master/compiler/src/dmd
               | /ba...
        
               | bonzini wrote:
               | x86 has a lot of special cases and it's becoming harder
               | and harder to find non-code material with all
               | instructions in a nice tabular format. But overall the
               | encoding isn't that complicated. What's complex in the
               | architecture is the amount of legacy stuff that still
               | lives in privileged code.
               | 
               | > x86-64 also has some, in that many instructions can't
               | encode the upper 8 bits of a 16-bit register (AH, BH, DH,
               | CH) if a REX prefix is used.
               | 
               | You can't use both the high registers and the extended
               | low registers in the same instruction, indeed. But in
               | practice nobody is using the high registers anymore so
               | it's a relatively rare limitation.
        
               | jcranmer wrote:
               | One of the projects I work on every now and then is the
               | "World's Worst X86 Decoder", where I'm trying to
               | essentially automatically uncover x86 assembly semantics
               | without ever having to build a list of x86 instructions,
               | and this has forced me to look at the x86 ISA in a
               | different way. The basic format I build for an opcode is
               | this:                 enum Group1Prefix { Lock = 0xf0,
               | Repnz = 0xf2, Repz = 0xf3 }       enum Group2Prefix { Cs
               | = 0x2e, Ds = 0x3e, Es = 0x26, Fs = 0x64, Gs = 0x65, Ss =
               | 0x36 }       enum Group3Prefix { OpSize = 0x66 }
               | enum Group4Prefix { AddrSize = 0x67 }       enum
               | ModernPrefix {         Rex { w: bool },         Vex { w:
               | bool, l: bool },       }       struct Opcode {
               | pub group1: Option<Group1Prefix>,         pub group2:
               | Option<Group2Prefix>,         pub group3:
               | Option<Group3Prefix>,         pub group4:
               | Option<Group4Prefix>,         pub modern_prefix:
               | Option<ModernPrefix>,         pub opcode_map: u8,
               | pub opcode: u8,       }
               | 
               | And all opcodes can optionally have an immediate of 1, 2,
               | 3, 4, or 8 bytes and optionally have a ModR/M byte (which
               | is a separate datastructure because I don't enter that
               | information myself, I simply run a sandsifter-like
               | program to execute every single opcode and work out the
               | answer).
               | 
               | This isn't quite accurate to how an assembler would see
               | it, as Intel will sometimes pack instructions with 0 or 1
               | operands into a ModR/M byte, so that, e.g., 0F.01 eax,
               | [mem] is actually the SGDT instruction and 0F.01 ecx,
               | [mem] is SIDT (and 0F.01 eax, ecx is actually VMCALL).
               | 
               | As long as you're internally making a distinction between
               | the different operand forms of instructions (e.g., 8-bit
               | ADD versus 16-bit versus 32-bit versus 64-bit, and
               | register/register versus register/immediate versus
               | register/memory), it's actually not all that difficult to
               | deal with the instruction encoding, or even mapping IR to
               | those instructions, at least until EVEX prefixes enter
               | the picture.
        
               | trealira wrote:
               | That's pretty cool; thanks for showing me. I guess x86
               | instruction encoding isn't as bad as I thought, then.
        
             | WalterBright wrote:
             | As I'm currently implementing an AArch64 code generator for
             | the D language dmd compiler, the inconsistency of its
             | instructions is equaled and worsened by the clumsiness of
             | the documentation for it :-/ But I'm slowly figuring it
             | out.
             | 
             | (For example, some instructions with very different
             | encodings have the same mnemonic. Arrgghh.)
        
               | brendank310 wrote:
               | It's probably no longer maintained, but a former
               | colleague of mine did some work on this for C++:
               | https://github.com/ainfosec/shoulder. Obviously if the
               | docs are lying it doesn't help much, but there was
               | another effort he had https://github.com/ainfosec/scapula
               | that tried to automate detecting behavior differences
               | between the docs and the hardware implementation.
        
             | corsix wrote:
             | For an implementation of logical immediate encoding without
             | the loop, see https://github.com/LuaJIT/LuaJIT/blob/04dca79
             | 11ea255f37be799...
        
       | SunlitCat wrote:
       | Haha! Writing an x86 emulator! I still remember writing a toy
       | emulator which was able to execute something around the first
       | 1000-ish lines of a real bios (and then it stuck or looped when
       | it started to access ports or so, can't remember it was too long
       | ago and I didn't continue it as I started to get into DirectX and
       | modern c++ more).
        
       | aengelke wrote:
       | Bonus quirk: there's BSF/BSR, for which the Intel SDM states that
       | on zero input, the destination has an undefined value. (AMD
       | documents that the destination is not modified in that case.) And
       | then there's glibc, which happily uses the undocumented fact that
       | the destination is also unmodified on Intel [1]. It took me quite
       | some time to track down the issue in my binary translator.
       | (There's also TZCNT/LZCNT, which is BSF/BSR encoded with
       | F3-prefix -- which is silently ignored on older processors not
       | supporting the extension. So the same code will behave
       | differently on different CPUs. At least, that's documented.)
       | 
       | Encoding: People often complain about prefixes, but IMHO, that's
       | by far not the worst thing. It is well known and somewhat well
       | documented. There are worse quirks: For example, REX/VEX/EVEX.RXB
       | extension bits are ignored when they do not apply (e.g., MMX
       | registers); except for mask registers (k0-k7), where they trigger
       | #UD -- also fine -- except if the register is encoded in
       | ModRM.rm, in which case the extension bit is ignored again.
       | 
       | APX takes the number of quirks to a different level: the REX2
       | prefix can encode general-purpose registers r16-r31, but not
       | xmm16-xmm31; the EVEX prefix has several opcode-dependent
       | layouts; and the extension bits for a register used depend on the
       | register type (XMM registers use X3:B3:rm and V4:X3:idx; GP
       | registers use B4:B3:rm, X4:X3:idx). I can't give a complete list
       | yet, I still haven't finished my APX decoder after a year...
       | 
       | [1]: https://sourceware.org/bugzilla/show_bug.cgi?id=31748
        
         | CoastalCoder wrote:
         | Can you imagine having to make all this logic work faithfully,
         | let alone _fast_ , in silicon?
         | 
         | X86 used to be Intel's moat, but what a nightmarish burden to
         | carry.
        
           | dx4100 wrote:
           | Did people just... do this by hand (in software), transistor
           | by transistor, or was it laid out programmatically in some
           | sense? As in, were segments created algorithmically, then
           | repeated to obtain the desired outcome? CPU design baffles
           | me, especially considering there are 134 BILLION transistors
           | or so in the latest i7 CPU. How does the team even keep track
           | of, work on, or even load the files to WORK on the CPUs?
        
             | tristor wrote:
             | They use EDA (Electronic Design Automation) software, there
             | are only a handful of vendors, the largest probably being
             | Mentor Graphics, now owned by Siemens. So, yes, they use
             | automation to algorithmically build and track/resolve
             | refactors as they design CPUs. CPUs are /generally/ block-
             | type designs these days, so particular functions get
             | repeated identically in different places and can be
             | somewhat abstracted away in your EDA.
             | 
             | It's still enormously complex, and way more complex than
             | the last time I touched this stuff more than 15 years ago.
        
               | rikthevik wrote:
               | I love that the EDA industry still uses Tcl heavily.
               | Warms my heart.
        
             | monocasa wrote:
             | It's written in an HDL; IIRC both Intel and AMD use
             | verilog. A modern core is on the order of a million or so
             | lines of verilog.
             | 
             | Some of that will be hand placed, quite a bit will just be
             | thrown at the synthesizer. Other parts like SRAM blocks
             | will have their cad generated directly from a macro and a
             | description of the block in question.
        
               | cogman10 wrote:
               | To further expound on this. ASIC (like AMD CPUs) is a lot
               | like software work. The engineers that create a lot of
               | the digital logic aren't dealing with individual
               | transistors, instead they are saying "give me an
               | accumulator for this section of code" and the HDL
               | provides it. The definition of that module exists
               | elsewhere and is shared throughout the system.
               | 
               | This is how the complexity can be wrangled.
               | 
               | Now, MOST of the work is automated for digital logic.
               | However, we live in an analog world. So, there is (As far
               | as I'm aware) still quite a bit of work for analog
               | engineers to bend the analog reality into digital. In the
               | real world, changing current creates magnetic fields
               | which means you need definitions limiting voltages and
               | defining how close a signal line can be to avoid cross
               | talk. Square waves are hard to come by, so there's effort
               | in timing and voltage bands to make sure you aren't
               | registering a "1" when it should have been a "0".
               | 
               | Several of my professors were intel engineers. From what
               | they told me, the ratios of employment were something
               | like 100 digital engineers to 10 analog engineers to 1
               | Physicist/materials engineer.
        
             | bonzini wrote:
             | It was entirely laid out by hand until the 286. Using
             | standard cells in the 386 enabled the switch from microcode
             | to a mostly hardwired core.
        
             | moosedev wrote:
             | You might enjoy Ken's latest article on some of this stuff,
             | posted just the other day:
             | https://news.ycombinator.com/item?id=40899393
        
           | gumby wrote:
           | A lot of this is done in software (microcode). But even with
           | that case, your statement still holds: "Can you imagine
           | having to make all this logic work faithfully, let alone
           | fast, in the chip itself?" Writing that microcode must be
           | fiendishly difficult given all the functional units, out of
           | order execution, register renaming...
        
             | bonzini wrote:
             | The crazy parts that were mentioned in the parent comment
             | are all part of the hot path. Microcode handles slow paths
             | related to paging and segmentation, and very rare
             | instructions. Not necessarily unimportant (many common
             | privileged instructions are microcoded) but still rare
             | compared to the usual ALU instructions.
             | 
             | But it's not a huge deal to program the quirky encoding in
             | an HDL, it's just a waste of transistors. The really
             | complicated part is the sequencing of micro operations and
             | how they enter the (out of order) execution unit.
        
             | aengelke wrote:
             | > A lot of this is done in software (microcode).
             | 
             | No, that's not the case, since >30 years. Microcode is only
             | used for implementing some complex instructions (mostly
             | system instructions). Most regular instructions (and the
             | rest of the core) don't use microcode and their expansions
             | into uOps are hardwired. Also the entire execution unit is
             | hardwired.
             | 
             | There are typically some undocumented registers (MSRs on
             | x86) that can control how the core behaves (e.g., kill
             | switches for certain optimizations). These can then be
             | changed by microcode updates.
        
           | im3w1l wrote:
           | It's a lot easier to more or less accidentally create
           | something quirky than it is to create a second quirk-for-
           | quirk compatible system.
        
           | dzaima wrote:
           | A fun thing is that e.g. "cmp ax, 0x4231" differs from "cmp
           | eax, 0x87654321" only in the presence of the data16 prefix,
           | and thus the longer immediate; and it's the only significant
           | case (I think?) of a prefix changing the total instruction
           | size, and thus, for _some_ such instructions, the 16-bit
           | version, _sometimes_ (but not always!) is significantly
           | slower.  "but not always" as in, if you try to microbenchmark
           | a loop of such, sometimes you can have entire microseconds of
           | it consistently running at 0.25 cycles/instr avg, and
           | sometimes that same exact code (in the same process!) will
           | measure it at 3 cycles/instr (tested on Haswell, but
           | uops.info indicates this happens on all non-atom Intel since
           | Ivy Bridge).
        
             | BeeOnRope wrote:
             | Probably, if the uops come from the uop cache you get the
             | fast speed since the prefix and any decoding stalls don't
             | have any impact in that case (that mess is effectively
             | erased in the uop cache), but if it needs to be decoded you
             | get a stall due to the length changing prefix.
             | 
             | Whether a bit of code comes from the uop cache is highly
             | dependent on alignment, surrounding instructions, the
             | specific microarchitecture, microcode version and even more
             | esototeric things like how many incoming jumps target the
             | nearby region of code (and in which order they were
             | observed by the cache).
        
               | dzaima wrote:
               | Yep, a lot of potential contributors. Though, my test was
               | of a single plain 8x unrolled loop doing nothing else,
               | running for tens of thousands of iterations to take a
               | total of ~0.1ms, i.e. should trivially cache, and yet
               | there's consistent inconsistency.
               | 
               | Did some 'perf stat'ting, comparing the same test with
               | "cmp eax,1000" vs "cmp ax,1000"; per instruction,
               | idq.mite_uops goes 0.04% - 35%, and lsd.uops goes 90% -
               | 54%; so presumably sometimes somehow the loop makes into
               | LSD at which point dropping out of it is hard, while
               | other times it perpetually gets stuck at MITE? (test is
               | of 10 instrs - 8 copies of the cmp, and dec+jne that'd
               | get fused, so 90% uops/instr makes sense)
        
             | aengelke wrote:
             | The behavior/penalty of 66/67 length-changing prefixes is
             | documented in the Intel Optimization Reference Manual
             | (3.4.2.3):
             | 
             | > Assembly/Compiler Coding Rule 19. (MH impact, MH
             | generality) Favor generating code using imm8 or imm32
             | values instead of imm16 values.
        
               | dzaima wrote:
               | Some interesting quotes around there:
               | 
               | > The following alignment situations can cause LCP stalls
               | to trigger twice:
               | 
               | > * An instruction is encoded with a MODR/M and SIB byte,
               | and the fetch line boundary crossing is between the
               | MODR/M and the SIB bytes.
               | 
               | > * An instruction starts at offset 13 of a fetch line
               | references a memory location using register and immediate
               | byte offset addressing mode.
               | 
               | So that's the order of funkiness to be expected, fun.
               | 
               | > False LCP stalls occur when (a) instructions with LCP
               | that are encoded using the F7 opcodes, and (b) are
               | located at offset 14 of a fetch line. These instructions
               | are: not, neg, div, idiv, mul, and imul. False LCP
               | experiences delay because the instruction length decoder
               | can not determine the length of the instruction before
               | the next fetch line, which holds the exact opcode of the
               | instruction in its MODR/M byte.
               | 
               | The "true" LCP stall for the F7 opcodes would be "test
               | r16,imm16", but due to the split opcode info across the
               | initial byte & ModR/M, the other F7's suffer too.
        
           | kimixa wrote:
           | It seems that the complexity of state management in a modern
           | superscalar CPU is orders of magnitude more complex than even
           | this.
        
           | turndown wrote:
           | Intel is coming out with an improved x86 instruction set that
           | removes a lot of the cruft, called 'APX' for advanced
           | performance extensions.
        
         | bonzini wrote:
         | On and off over the last year I have been rewriting QEMU's x86
         | decoder. It started as a necessary task to incorporate AVX
         | support, but I am now at a point where only a handful of
         | opcodes are left to rewrite, after which it should not be too
         | hard to add APX support. For EVEX my plan is to keep the raw
         | bits until after the opcode has been read (i.e. before
         | immediates and possibly before modrm) and the EVEX class
         | identified.
         | 
         | My decoder is mostly based on the tables in the manual, and the
         | code is mostly okay--not too much indentation and phases mostly
         | easy to separate/identify. Because the output is JITted code,
         | it's ok to not be super efficient and keep the code readable;
         | it's not where most of the time is spent. Nevertheless there
         | are several cases in which the manual is wrong or doesn't say
         | the whole story. And the tables haven't been updated for
         | several years (no K register instructions, for example), so
         | going forward there will be more manual work to do. :(
         | 
         | The top comment explains a bit what's going on:
         | https://github.com/qemu/qemu/blob/59084feb256c617063e0dbe7e6...
         | 
         | (As I said above, there are still a few instructions handled by
         | the old code predating the rewrite, notably BT/BTS/BTR/BTC. I
         | have written the code but not merged it yet).
        
           | aengelke wrote:
           | Thanks for the pointer to QEMU's decoder! I actually never
           | looked at it before.
           | 
           | So you coded all the tables manually in C -- interesting,
           | that's quite some effort. I opted to autogenerate the tables
           | (and keep them as data only => smaller memory footprint)
           | [1,2]. That's doable, because x86 encodings are mostly fairly
           | consistent. I can also generate an encoder from it (ok, you
           | don't need that). Re 'custom size "xh"': AVX-512 also has
           | fourth and eighth. Also interesting that you have a separate
           | row for "66+F2". I special case these two (CRC32, MOVBE)
           | instructions with a flag.
           | 
           | I think the prefix decoding is not quite right for x86-64:
           | 26/2e/36/3e are ignored in 64-bit mode, except for 2e/3e as
           | branch-not-taken/taken hints and 3e as notrack. (See SDM Vol.
           | 1 3.3.7.1 "Other segment override prefixes (CS, DS, ES, and
           | SS) are ignored.") Also, REX prefixes that don't immediately
           | preceed the opcode (or VEX/EVEX prefix) are ignored. Anyhow,
           | I need to take a closer look at the decoder with more time.
           | :-)
           | 
           | > For EVEX my plan is to keep the raw bits until after the
           | opcode has been read
           | 
           | I came to the same conclusion that this is necessary with
           | APX. The map+prefix+opcode combination identifies how the
           | other fields are to be interpreted. For AVX-512, storing the
           | last byte was sufficient, but with APX, vvvv got a second
           | meaning.
           | 
           | > Nevertheless there are several cases in which the manual is
           | wrong or doesn't say the whole story.
           | 
           | Yes... especially for corner cases, getting real hardware is
           | the only reliable way to find out, how the CPU behaves.
           | 
           | [1]: https://github.com/aengelke/fadec/blob/master/instrs.txt
           | [2]: https://github.com/aengelke/fadec/blob/master/decode.c
        
             | bonzini wrote:
             | > interesting that you have a separate row for "66+F2"
             | 
             | Yeah that's only for 0F38F0 to 0F38FF.
             | 
             | > Re 'custom size "xh"': AVX-512 also has fourth and eighth
             | 
             | Also AVX for VPMOVSX and VPMOVZX but those are handled
             | differently. I probably should check if xh is actually
             | redundant... _EDIT_ : it's only needed for VCVTPS2PH, which
             | is the only instruction with a half-sized _destination_.
             | 
             | > I think the prefix decoding is not quite right for
             | x86-64: 26/2e/36/3e are ignored in 64-bit mode
             | 
             | Interesting, I need to check how they interact with the
             | FS/GS prefixes (64/65).
             | 
             | > REX prefixes that don't immediately preceed the opcode
             | (or VEX/EVEX prefix) are ignored
             | 
             | Oh, didn't know that!
        
               | aengelke wrote:
               | > how they interact with the FS/GS prefixes (64/65)
               | 
               | For memory operations, they are ignored: 64-2e-65-3e
               | gives 65 as segment override. (From my memory and the
               | resulting implementation, I did some tests with hardware
               | a few years back.)
               | 
               | I do need to check myself how 2e/3e on branches interact
               | with other segment overrides, though.
        
             | jart wrote:
             | FWIW here's all 700 lines of Blink's x86 decoder.
             | https://github.com/jart/blink/blob/master/blink/x86.c
        
               | bonzini wrote:
               | I don't want to be the person that has to add an
               | instruction to blink...
        
               | jcranmer wrote:
               | It looks like someone started with Intel's XED code
               | (which relies on custom tables to specify instructions,
               | and compiles that to C tables at compile time) and hand-
               | minimized the code into a single file. I'm guessing it's
               | designed to never have any more code added to it.
        
               | jart wrote:
               | I minimized it by hand to make it more maintainable, to
               | me at least.
        
         | mananaysiempre wrote:
         | The semantics of LZCNT combined with its encoding feels like an
         | own goal: it's encoded as a BSR instruction with a legacy-
         | ignored prefix, _but_ for nonzero inputs its return value is
         | the operand size minus the return value of the legacy version.
         | Yes, clz() is a function that exists, but the extra subtraction
         | in its implementation feels like a small cost to pay for extra
         | compatibility when LZCNT could've just been BSR with different
         | zero-input semantics.
        
           | bonzini wrote:
           | Yes, it's like someone looked at TZCNT and thought "let's
           | encode LZCNT the same way", but it makes no sense.
        
             | adrian_b wrote:
             | LZCNT and TZCNT are corrections (originally introduced by
             | AMD) for the serious mistake done by the designers of Intel
             | 80386 when they have defined BSF and BSR.
             | 
             | Because on the very slow 80386 the wrong definition for the
             | null input did not matter much, they have failed to foresee
             | how bad it will become for the future pipelined and
             | superscalar CPUs, where having to insert a test for null
             | input can slow down a program many times.
             | 
             | Nevertheless, they should have paid more attention to the
             | earlier use of such instructions. For instance Cray-1 had
             | defined LZCNT in the right way almost ten years earlier.
        
               | bonzini wrote:
               | LZCNT was introduced by Intel in BMI1, while TZCNT was
               | introduced by AMD.
        
           | BeeOnRope wrote:
           | I'm not following: as long as you are introducing a new,
           | incompatible instruction for leading zero counting, you'd
           | definitely choose LZCNT over BSR as LZCNT has definitely won
           | in retrospect over BSR as the primitive for this use case.
           | BSR is just a historical anomaly which has a zero-input
           | problem for no benefit.
           | 
           | What would be the point of offering a new variation BSR with
           | different input semantics?
        
             | mananaysiempre wrote:
             | When it comes to TZCNT vs BSF, they are just compatible
             | enough for a new compiler to use unconditionally (if we
             | assume that BSF with a zero input leaves its output
             | register unchanged, as it has for decades, and as
             | documented by AMD who defined LZCNT): the instruction
             | sequence                 MOV   ECX, 32       TZCNT ECX, EAX
             | ; i.e. REP BSF ECX, EAX
             | 
             | behaves identically on everything from the original 80386
             | up and is better on superscalars with TZCNT support due to
             | avoiding the false dependency on ECX. The reason for that
             | is that BSF with a nonzero input and TZCNT with a nonzero
             | input have exactly the same output. That's emphatically not
             | true of BSR and LZCNT, so we're stuck relegating LZCNT to
             | compiler flags.
        
               | aengelke wrote:
               | TZCNT and BSF are not completely identical even for non-
               | zero input: BSF sets the ZF when the input is zero, TZCNT
               | sets ZF when the output is zero (i.e., the least
               | significant bit is set).
        
         | torusle wrote:
         | Another bonus quirk, from the 486 and Pentium area..
         | 
         | BSWAP EAX converts from little endian to big endian and vice
         | versa. It was a 32 bit instruction to begin with.
         | 
         | However, we have the 0x66 prefix that switches between 16 and
         | 32 bit mode. If you apply that to BSWAP EAX undefined funky
         | things happen.
         | 
         | On some CPU architectures (Intel vs. AMD) the prefix was just
         | ignored. On others it did something that I call an "inner
         | swap". E.g. in your four bytes that are stored in EAX byte 1
         | and 2 are swapped.                 0x11223344 became
         | 0x11332244.
        
           | userbinator wrote:
           | Also known as "bswap ax", and research shows that it does
           | something surprising but consistent on almost all hardware:
           | It zeros the register.
           | 
           | https://www.ragestorm.net/blogs/?p=141
           | 
           | https://gynvael.coldwind.pl/?id=268
           | 
           | However, this page, now gone, suggests that some CPUs (early
           | 486s?) did something different: http://web.archive.org/web/20
           | 071231192014/http://www.df.lth....
           | 
           | Unfortunately I have not found any evidence nor reason to
           | believe that this "inner swap" behaviour you mention exists
           | in some CPU --- except perhaps some flawed emulators?
        
         | __turbobrew__ wrote:
         | I know nothing about this space, but it would be interesting to
         | hook up a jtag interface to a x86 CPU and them step instruction
         | by instruction and record all the register values.
         | 
         | You could then use this data to test whether or not your
         | emulator perfectly emulates the hardware by running the same
         | program through the emulated CPU and validate the state is the
         | same at every instruction.
        
       | was_a_dev wrote:
       | Off topic, but I like this blog style/layout. I can imagine it
       | isn't everyones taste, but it just works for me
        
         | timmisiak wrote:
         | Glad you like it. I used m10c, with a few tweaks:
         | https://github.com/vaga/hugo-theme-m10c
        
       | waynecochran wrote:
       | Intel architecture is loaded with historical artifacts. The
       | switch in how segment registers were used as you went from real
       | mode to protected mode was an incredible hardware hack to keep
       | older software working. I blame Intel for why so many folks avoid
       | assembly language. I programmed in assembly for years using TI's
       | 84010 graphics chips and the design was gorgeous -- simple RISC
       | instruction set, flat address space, and bit addressable! If
       | during the earlier decades folks were programming using chips
       | with more elegant designs, far more folks would be programming in
       | assembly language (or at least would know how to).
        
         | hajile wrote:
         | > I blame Intel for why so many folks avoid assembly language.
         | 
         | x86 (the worst assembly of any of the top 50 most popular ISAs
         | by a massive margin) and tricky MIPS branch delay slots trivia
         | questions at university have done more to turn off programmers
         | from learning assembly than anything else and it's not even
         | close.
         | 
         | This is one reason I'm hoping that RISC-V kills off x86. It
         | actually has a chance of once again allowing your average
         | programmer to learn useful assembly.
        
           | LegionMammal978 wrote:
           | What do you find particularly problematic about x86 assembly,
           | from a pedagogical standpoint? I've never noticed any glaring
           | issues with it, except for the weird suffixes and sigils if
           | you use AT&T syntax (which I generally avoid).
        
             | timmisiak wrote:
             | I suspect the biggest issue is that courses like to talk
             | about how instructions are encoded, and that can be
             | difficult with x86 considering how complex the encoding
             | scheme is. Personally, I don't think x86 is all that bad as
             | long as you look at a small useful subset of instructions
             | and ignore legacy and encoding.
        
               | LegionMammal978 wrote:
               | True, encoding is one thing that really sets x86 apart.
               | But as you say, the assembly itself doesn't seem that
               | uniquely horrible (at least not since the 32-bit era),
               | which is why I found the sentiment confusing as it was
               | phrased.
               | 
               | Maybe it's the haphazard SIMD instruction set, with every
               | extension adding various subtly-different ways to permute
               | bytes and whatnot? But that would hardly seem like a
               | beginner's issue. The integer multiplication and division
               | instructions can also be a bit wonky to use, but hardly
               | unbearably so.
        
               | hajile wrote:
               | An ordinary developer cannot write performant x86 without
               | a massive optimizing compiler.
               | 
               | Actual instruction encoding is horrible. If you're
               | arguing that you can write a high-level assembly over the
               | top, then you aren't so much writing assembly as you are
               | writing something in between.
               | 
               | When you need to start caring about the actual assembly
               | (padding a cache line, avoiding instructions with too
               | many uops, or choosing between using APX 32 registers and
               | more normal shorter instructions, etc) rather than some
               | high level abstraction, the experience is worse than any
               | other popular ISA.
        
               | LegionMammal978 wrote:
               | > Actual instruction encoding is horrible. If you're
               | arguing that you can write a high-level assembly over the
               | top, then you aren't so much writing assembly as you are
               | writing something in between.
               | 
               | One instruction in x86 assembly is one instruction in the
               | machine code, and one instruction as recognized by the
               | processor. And except for legacy instructions that we
               | shouldn't teach people to use, each of these is not much
               | higher-level than an instruction in any other assembly
               | language. So I still don't see what the issue is, apart
               | from "the byte encoding is wacky".
               | 
               | (There are mops beneath it of course, but these are
               | reordered and placed into execution units in a very
               | implementation-dependent manner that can't easily be
               | optimized until runtime. Recall how VLIW failed at
               | exposing this to the programmer/compiler.)
               | 
               | > padding a cache line, avoiding instructions with too
               | many uops
               | 
               | Any realistic ARM or RISC-V processor these days also
               | supports out-of-order execution with an instruction
               | cache. The Cortex processors even have mops to support
               | this! The classic 5-stage pipeline is obsolete outside
               | the classroom. So if you're aiming for maximum
               | performance, I don't see how these are concerns that
               | arise far less in other assembly languages. E.g., you'll
               | always have to be worried about register dependencies,
               | execution units, optimal loop unrolling, etc. It's not
               | like a typical program will be blocked on the mop cache
               | in any case.
               | 
               | > APX 32 registers and more normal shorter instructions
               | 
               | APX doesn't exist yet, and I'd wager there's a good
               | chance it will never reach consumer CPUs.
        
               | hajile wrote:
               | Which "one instruction" is the right one? You can have
               | the exact same instruction represented many different
               | ways in x86. Some are shorter and some are longer (and
               | some are different, but the same length). When you say to
               | add two numbers, which instruction variant is correct?
               | 
               | This isn't a straightforward answer. As I alluded to in
               | another statement you quoted, padding cache lines is a
               | fascinating example. Functions should ideally start at
               | the beginning of a cache line. This means the preceding
               | cache line needs to be filled with something. NOP seems
               | like the perfect solution, but it adds unnecessary
               | instructions in the uop cache which slow things down.
               | Instead, the compiler goes to the code on the previous
               | cache line and expands instructions to their largest size
               | to fill the space because adding a few bytes of useless
               | prefix is allowed and doesn't generate NOPs in the uop
               | cache.
               | 
               | Your uop statements aren't what I was talking about.
               | Almost all RISCV instructions result in just one uop. On
               | fast machines, they may even generate less than one uop
               | if they get fused together.
               | 
               | x86 has a different situation. If your instruction
               | generates too many uops (or is too esoteric), it will
               | skip the fast hardware decoders and be sent to a
               | microcode decoder. There's a massive penalty for doing
               | this that slows performance to a crawl. To my knowledge,
               | no such instructions exist in any modern ISA.
               | 
               | Intel only has one high performance core. When they
               | introduce APX, it'll be on everything. There's good
               | reason to believe this will be introduced. It adds lots
               | of good features and offers increased code density in
               | quite a few situations which is something we haven't seen
               | in a meaningful way since AMD64.
        
           | akira2501 wrote:
           | I think that's putting the cart before the horse. I think it
           | wouldn't matter which architecture you choose as there will
           | always be deep performance considerations that must be
           | understood in order to write efficient software.
           | 
           | Otherwise your statement might amount down to "I hope there
           | is an ISA that intentionally wastes performance and energy in
           | deference to human standards of beauty."
           | 
           | It's why the annals of expertise rarely makes for good dinner
           | table conversation.
        
             | hajile wrote:
             | x86 has a parity flag. It only takes the parity of the
             | lowest 8 bits though. Why is it there? Because it was in
             | the 8086 because it was in the 8080 because it was in the
             | 8008 because Intel was trying to win a contract for the
             | Datapoint 2200.
             | 
             | Sometimes the short instruction variant is correct, but not
             | if it makes a single instruction break down into many uops
             | as the microcode is 1000x slower.
             | 
             | Oh, but you need to use those longer variants without extra
             | uops to align functions to cache boundaries because they
             | perform better than NOP.
             | 
             | Floats and SIMD are a mess with x87, AMX (with incompatible
             | variants), SSE1-4 (with incompatible variants), AVX, AVX2,
             | and AVX512 (with incompatible variants) among others.
             | 
             | The segment and offset was off dealing with memory is
             | painful.
             | 
             | What about the weird rules about which registers are
             | reserved for multiply and divide? Half the "general
             | purpose" registers are actually locked in at the ISA level.
             | 
             | Now APX is coming up and you get to choose between shorter
             | instructions with 16 registers and 2 registers syntax or
             | long instructions with 32 registers ands 3 registers
             | instructions.
             | 
             | And this just scratches the surface.
             | 
             | RISCV is better in every way. The instruction density is
             | significantly higher. The instructions are more simple and
             | easier to understand while being just as powerful.
             | Optimizing compilers are easier to write because there's
             | generally just one way to do things and is guaranteed to be
             | optimized.
        
         | bheadmaster wrote:
         | If Terry A. Davis is to be trusted, as long as you ignore the
         | legacy stuff, x64 assembly is nice to work with.
        
           | 256_ wrote:
           | Where did he say this? In any case, I can confirm it from
           | personal experience. x86 isn't that annoying if you ignore
           | most of it. The worst thing is the edge cases with how some
           | instructions can only use some registers, like
           | shifting/rotating using cl.
        
             | bheadmaster wrote:
             | There was a video of him talking about it, but I can't
             | recall the exact words or the name of the video.
             | 
             | But his operating system, TempleOS, was based on HolyC
             | which was compiled to pure x64, and had support for inline
             | x64 assembly.
        
         | russdill wrote:
         | What's crazy is that depending on how deep you want to go, a
         | lot of the information is not available in documents published
         | by Intel. Fortunately, if it matters for emulators it typically
         | can be/has been reverse engineering.
        
           | Max-q wrote:
           | Wow, is that true? And the documentation is thousands of
           | pages! I can't understand how Intel keep these processors
           | consistent during development. It must be a really, really
           | hard job.
           | 
           | It was a nice period in history when proper RISC was working
           | well, when we could get stuff to run faster, but getting more
           | transistors was expensive. Now we can't get it to run faster
           | but we can get billions of transistors, making stuff more
           | complicated being the way to more performance.
           | 
           | I wonder if we ever again will get a time where
           | simplification is a valid option...
        
             | russdill wrote:
             | There's some info here:
             | 
             | http://www.rcollins.org/secrets/
             | http://www.rcollins.org/ddj/ddj.html
             | 
             | Notably things like undocumented opcodes, processor modes,
             | undocumented registers, undocumented bits within registers,
             | etc. It's not uncommon to release a specific set of
             | documentation to trusted partners only. Back in the day
             | intel called these "gold books".
        
         | jecel wrote:
         | Wouldn't that be the 34010?
        
           | waynecochran wrote:
           | Yes. TMS 34010 and 34020 ... my bad.
        
       | trollied wrote:
       | > Writing a CPU emulator is, in my opinion, the best way to
       | REALLY understand how a CPU works
       | 
       | Hard disagree.
       | 
       | The best way is to create a CPU from gate level, like you do on a
       | decent CS course. (I really enjoyed making a cut down ARM from
       | scratch)
        
         | whobre wrote:
         | Reading Petzold's "Code" comes pretty close to, though and is
         | easier.
        
         | commandlinefan wrote:
         | OTOH, are you really going to be implementing memory segmenting
         | in your gate-level CPU? I'd say actually creating a working CPU
         | and _then_ emulating a real CPU (warts and all) are both
         | necessary steps to real understanding.
        
           | trollied wrote:
           | I agree.
        
           | monocasa wrote:
           | > OTOH, are you really going to be implementing memory
           | segmenting in your gate-level CPU?
           | 
           | I have, but it was a PDP-8 which I'll be the first to admit
           | is kind of cheating.
        
           | mrspuratic wrote:
           | This. I mean, why not start with wave theory and material
           | science if you really want a good understanding :)
           | 
           | In my CS course I learned a hell of a lot from creating a
           | 6800 emulator; though it wasn't on the course, building a
           | working 6800 system was. The development involved running an
           | assembler on a commercial *nix system and then typing the hex
           | object code into an EPROM programmer. You get a lot of time
           | to think about where your bugs are when you have to wait for
           | a UV erase cycle...
        
             | commandlinefan wrote:
             | > why not start with wave theory and material science
             | 
             | You jest, but if I had infinite time...
        
         | quantified wrote:
         | Well, I think you're both right. It's satisfying as heck to
         | sling 74xx chips together and you get a feel for the electrical
         | side of things and internal tradeoffs.
         | 
         | When you get to doing that for the CPU that you want to do
         | meaningful work with, you start to lose interest in that
         | detail. Then the complexities of the behavior and spec become
         | interesting and the emulator approach is more tractable, can
         | cover more types of behavior.
        
           | IshKebab wrote:
           | I think trollied is correct actually. I work on a CPU
           | emulator professionally and while it gives you a great
           | understanding of the spec there are lots of details about
           | _why_ the spec is the way it is that are due to how you
           | actually implement the microarchitecture. You only learn that
           | stuff by actually implementing a microarchitecture.
           | 
           | Emulators tend not to have many features that you find in
           | real chips, e.g. caches, speculative execution, out-of-order
           | execution, branch predictors, pipelining, etc.
           | 
           | This isn't "the electrical side of things". When he said
           | "gate level" he meant RTL (SystemVerilog/VHDL) which is
           | pretty much entirely in the digital domain; you very rarely
           | need to worry about actual electricity.
        
             | trollied wrote:
             | I write retro console emulators for fun, so agree with you
             | 100% :)
        
         | timmisiak wrote:
         | I think both are useful, but designing a modern CPU from the
         | gate level is out of reach for most folks, and I think there's
         | a big gap between the sorts of CPUs we designed in college and
         | the sort that run real code. I think creating an emulator of a
         | modern CPU is a somewhat more accessible challenge, while still
         | being very educational even if you only get something partially
         | working.
        
           | banish-m4 wrote:
           | This is an illusion and a red herring. RTL synthesis is the
           | typical functional prototype stage reached which is generally
           | sufficient for FPGA work. To burn an ASIC as part of an
           | educational consortium run is doable, but it's uncommon.
        
           | WalterBright wrote:
           | When I was at Caltech, another student in the dorm had been
           | admitted because he'd designed and implemented a CPU using
           | only 7400 TTL.
           | 
           | Woz wasn't the only supersmart young computer guy at the time
           | :-)
           | 
           | (I don't know how capable it was, even a 4 bit CPU would be
           | quite a challenge with TTL.)
        
             | nsguy wrote:
             | I think the key word above was modern. I felt able to
             | design a simple CPU when I finished my Computer
             | Architecture course in university. I think I forgot most of
             | it by now ;) There are a few basic concepts to wrap your
             | head around but once you have them a simple CPU is doable.
             | Doing this with TTL or other off the shelf components is
             | mostly minimizing/adapting/optimizing to those components
             | (or using a lot of chips ;) ). I have never looked at
             | discrete component CPU designs, I imagine ROM and RAM chips
             | play a dominant part (e.g. you don't just built RAM with
             | 74x TTL flip-flops).
        
               | WalterBright wrote:
               | He probably used off-the-shelf RAM chips, after all, RAM
               | is not part of the CPU.
               | 
               | In the early 70s, before the internet, even finding the
               | information needed would be a fair amount of work.
               | 
               | I learned how flip flops worked, adders, and registers in
               | college, and that could be extended to an ALU. But still,
               | that was in college, not high school.
               | 
               | I've read some books on computer history, and they are
               | frustratingly vague about how the machines actually
               | worked. I suspect the authors didn't actually know. Sort
               | of the like the books on the history of Apple that gush
               | over Woz's floppy disk interface, but no details.
        
               | nsguy wrote:
               | Was doing some Googling and came across:
               | https://en.wikipedia.org/wiki/Breakout_(video_game)
               | 
               | I never heard this story...
        
           | akira2501 wrote:
           | > and the sort that run real code
           | 
           | And the sort that are commercially viable in today's
           | marketplace. The nature of the code has nothing to do with
           | it. The types of machines we play around with today surpass
           | the machines we used to land men on the moon. What's not
           | "real code" about that?
        
         | brailsafe wrote:
         | So far on my journey through Nand2Tetris (since I kind of
         | dropped out of my real CS course) I've found the entire process
         | of working my way up from gate level, and just finished the VM
         | emulator chapter which took an eternity. Now onto compilation.
        
         | banish-m4 wrote:
         | Seconded. A microcoded, pipelined, superscalar, branch-
         | predicting basic processor with L1 data & instruction caches
         | and write-back L2 cache controller is nontrivial. Most software
         | engineers have an incomplete grasp of data hazards, cache
         | invalidation, or pipeline stalls.
        
           | nsguy wrote:
           | IIRC reading some Intel CPU design history some of their
           | designers are from a CS/software background. But I agree.
           | Software is naturally very sequential which is different than
           | digital hardware which is naturally/inherently parallel. A
           | clock can change the state of a million flip-flops all at
           | once, it's a very different way of thinking about computation
           | (though ofcourse at the theoretical level all the same) and
           | then there's the physics and EE parts of a real world CPU.
           | Writing software and designing CPUs are just very different
           | disciplines and the CPU as it appears to the software
           | developer isn't how it appears to the CPU designer.
        
             | banish-m4 wrote:
             | I'm not talking about Intel's engineers, I'm saying very
             | few software engineers today understand how a processor
             | works. I'm sure Intel hires all sorts of engineers for
             | different aspects of each ecosystem they maintain.
             | Furthermore, very few software engineers have ever even
             | touched a physical server because they're sequestered away
             | from a significant fraction of the total stack.
             | 
             | Speculative and out-of-order execution requires
             | synchronization to maintain dataflow and temporal
             | consistency.
             | 
             | Computer Organization is a good book should anyone want to
             | dive deeper.
        
         | snvzz wrote:
         | CPU was a poor choice of words. ISA would have worked.
        
       | fjfaase wrote:
       | Interesting read. I have a lot of respect for people who develop
       | emulator for x86 processors. It is a complicated processor and
       | from first hand experience I know that developing and debugging
       | emulators for CPU's can be very challenging. In the past year, I
       | spend some time developing a very limited i386 emulator [1]
       | including some system calls for executing the first steps of
       | live-bootstrap [2], primarily to figure out how it is working. I
       | learned a lot about system calls and ELF.
       | 
       | [1] https://github.com/FransFaase/Emulator/
       | 
       | [2] https://github.com/fosslinux/live-bootstrap/
        
         | banish-m4 wrote:
         | Most of the complexities lie in managing the various
         | configurations of total system compatibility emulation,
         | especially for timing, analog oddities, and whether to include
         | bugs or not and for which steppings. If you want precise and
         | accurate emulation, you have to have real hardware to validate
         | behavior against. Then there are the cases of what not to
         | emulate and offering better-than-original alternatives.
        
       | Quekid5 wrote:
       | Just as an adjacent aside from a random about learning by doing:
       | 
       | Implementing a ThingDoer is a huge learning experience. I
       | remember doing co-op "write-a-compiler" coursework with another
       | person. We were doing great, everything was working and then we
       | got to the oral exam...
       | 
       | "Why is your Stack Pointer growing upwards"?
       | 
       | ... I was kinda stunned. I'd never thought about that. We
       | understood most of the things, but sometimes we kind of just
       | bashed at things until they worked... and it turned out upward-
       | growing SP did work (up to a point) on the architecture our toy
       | compiler was targeting.
        
       | boricj wrote:
       | It's funny to me how much grief x86 assembly generates when
       | compared to RISC here, because I have the opposite problem when
       | delinking code back into object files.
       | 
       | For this use-case, x86 is really easy to analyze whereas MIPS has
       | been a nightmare to pull off. This is because all I mostly care
       | about are references to code and data. x86 has pointer-sized
       | immediate constants and MIPS has split HI16/LO16 relocation
       | pairs, which leads to all sorts of trouble with register usage
       | graphs, code flow and branch delay instructions.
       | 
       | That should not be constructed as praise on my end for x86.
        
       | jmspring wrote:
       | Apparently my memory is false, I thought originally the salsa20
       | variants and machine code were on cryp.to in my memory, but Dan
       | Berstein's site is - https://cr.yp.to/
       | 
       | While at a startup when we were looking at data at rest
       | encryption, streaming encryption and other such things. Dan had a
       | page with different implementations (cross compiled from his
       | assembler representation) to target chipsets and instruction
       | sets. Using VMs (this was the early/mid 2000s) and such, it was
       | interesting to see what of those instruction sets were supported.
       | In testing, there would be occasional hiccups where an
       | implementation wasn't fully supported though the VM claimed such.
        
       | djaouen wrote:
       | "I don't believe in emulatores." - 0x86
        
       | t_sea wrote:
       | > Writing a CPU emulator is, in my opinion, the best way to
       | REALLY understand how a CPU works.
       | 
       | The 68k disassembler we wrote in college was such a Neo "I know
       | kung fu" moment for me. It was the missing link that let me
       | reason about code from high-level language down to transistors
       | and back. I can only imagine writing a full emulator is an order
       | of magnitude more effective. Great article!
        
       | lifthrasiir wrote:
       | I recently implemented a good portion of x86(-64) decoder for
       | some side project [1] and kinda surprised how it got even more
       | complicated in recent days. Sandpile.org [2] was really useful
       | for my purpose.
       | 
       | [1] Namely, a version of Fabian Giesen's disfilter for x86-64,
       | for yet another side project which is still not in public:
       | https://gist.github.com/lifthrasiir/df47509caac2f065032ef72e...
       | 
       | [2] https://sandpile.org/
        
       | ale42 wrote:
       | Shouldn't it be (2023) rather than (2013)?
        
       | Sparkyte wrote:
       | The footnotes are glorious. "He was convinced that using a shift
       | would work and didn't believe me when I said it wasn't possible."
        
       ___________________________________________________________________
       (page generated 2024-07-11 23:02 UTC)