[HN Gopher] The microcode and hardware in the 8086 processor tha...
       ___________________________________________________________________
        
       The microcode and hardware in the 8086 processor that perform
       string operations
        
       Author : picture
       Score  : 111 points
       Date   : 2023-04-04 16:58 UTC (6 hours ago)
        
 (HTM) web link (www.righto.com)
 (TXT) w3m dump (www.righto.com)
        
       | ataylor284_ wrote:
       | These 8086 instructions were pretty cool coming from a 6502
       | background. The usual way to copy a block of memory in 6502 was
       | something like this:                   ldy #len       loop:
       | lda source-1,y   ; memory accesses: loads opcode, address, value
       | sta dest-1,y     ; memory accesses: loads opcode, address, stores
       | value         dey              ; memory accesses: loads opcode
       | bne loop         ; memory accesses: loads opcode, offset
       | 
       | So 4 instructions and a minimum of 9 memory accesses per byte
       | copied, more if not in the zero page. Even unrolling the loop
       | gets you down to 6.
       | 
       | Compare this to the 8086:                     mov cx, len
       | mov si, source           mov di, dest       rep movsb
       | ; memory accesses: read the opcodes, then 1 load and 1 store per
       | byte
       | 
       | Even forgetting about word moves, you're down to 2 memory
       | accesses per byte. No instruction opcode or branching overhead.
        
       | mikepurvis wrote:
       | It still boggles my mind that processors in the 1970s already had
       | microcode-- obviously it was vastly simpler and not doing the
       | crazy branch prediction, speculative execution, and operation
       | reordering that modern processors do, but it's still amazing to
       | me that this architecture was there at all.
        
         | kens wrote:
         | Historically, microcode was invented around 1951 and was
         | heavily used in most of the IBM System/360 computers (1964).
         | Once your instruction set is even moderately complicated, using
         | microcode is a lot easier than hard-wired control circuitry.
         | 
         | You can view microcode as one of the many technologies that
         | started in mainframes, moved into minicomputers, and then moved
         | into microprocessors. In a sense, the idea behind RISC was the
         | recognition that this natural progression was not necessarily
         | the best way to build microprocessors.
        
         | msla wrote:
         | There's something more fundamental going on here:
         | 
         | The concept of the ISA as a standard implemented by hardware,
         | as opposed to the ISA being a document of what a specific
         | hardware design does, goes back to the IBM System/360 family of
         | computers from 1964. That was the first range of computers,
         | from large to (let's be honest) somewhat less large, that could
         | all run most of the same software, modulo some specialized
         | instructions and peripheral hardware like disk drives not
         | present on the cheapest systems, at different levels of
         | performance. As others have said, microcode is plenty old, but
         | the idea of architectural families is one of the things which
         | really gets us to where we are now.
        
         | pwg wrote:
         | Microcode (the concept) is very much older than the 1970's. Per
         | Wikipedia ([1]) the earliest system with something like
         | 'microcode' dates back to 1947.
         | 
         | What is 'special' for the 8086 is that this was a low cost
         | microprocessor using microcode, as opposed to a multi-million
         | dollar mainframe CPU.
         | 
         | [1] https://en.wikipedia.org/wiki/Microcode#History
         | 
         | edit: fix typo
        
           | plesner wrote:
           | Babbage's analytical engine used a form of microcode ~100
           | years earlier. Complex operations such as multiplication and
           | division were implemented by lower-level instructions encoded
           | as a series of pegs on a rotating barrel. Each micro-
           | instruction ("vertical" in Babbage's terminology) could take
           | several seconds to execute so a complete multiplication or
           | division would take minutes.
        
           | kens wrote:
           | You mean the 8086, not the 8-bit 8080, right? The 8080 wasn't
           | microcoded.
        
             | pwg wrote:
             | Indeed, that was a typo. Fixed now. Thanks.
        
         | morcheeba wrote:
         | I love that the TMS34082 math coprocessor had user-definable
         | microcode! This was intended to work with 3D graphics, so you
         | could really get the most out of the ALU by chaining operations
         | for your specific need (e.g. divide-and-sqrt single
         | instruction)
         | 
         | See chapter 8:
         | http://www.bitsavers.org/components/ti/TMS340xx/TMS34082_Des...
        
         | [deleted]
        
       | jecel wrote:
       | The 286 added "in string" and "out string" instructions. It
       | turned out that if peripherals had a small internal buffer it was
       | more efficient to use these new instructions instead of using DMA
       | hardware. This is why hard disk controllers for the original PC
       | and XT used DMA but from the AT on they normally didn't.
        
         | rep_lodsb wrote:
         | It was the 186 which added those. Wasn't widely used in PCs
         | because it included some built-in peripherals that weren't
         | compatible with the "standard" ones used by IBM in their
         | original PC.
         | 
         | And that standard PC's DMA controller was originally intended
         | for use with the 8-bit 8080 (eighty-eighty, not eighty-eight)
         | processor. It was slow and only supported 16 bit addresses. IBM
         | added a latch that would store the high address bits, but they
         | weren't incremented when the lower bits wrapped around. So
         | buffers had to be allocated in a way that they didn't cross
         | those boundaries, and the BIOS disk functions would return an
         | error if your code didn't do this properly.
        
           | adrian_b wrote:
           | 80186 and 80286 have been launched simultaneously in February
           | 1982, so you cannot say that one of them has introduced INS
           | and OUTS before the other.
           | 
           | Nevertheless, due to their high cost for those times none of
           | these 2 CPUs had seen much use before two years and a half
           | later, when IBM launched the IBM PC/AT in August 1984.
           | 
           | Most programmers have encountered INS and OUTS for the first
           | time while using PC/AT clones with 80286. Embedded computers
           | with 80186 became common only some years later, when the
           | price of 80186 dropped a lot.
        
           | anyfoo wrote:
           | I grew up with a "Siemens PC-D", 80186 and not fully IBM
           | compatible for the reasons you said. However, it seems that
           | the folks at Siemens were very enthusiastic about it, and
           | ported a lot of software that accessed the hardware more
           | directly, including most Microsoft software (Windows, Word,
           | the C compiler...) and GEM. The BIOS was actually on disk, so
           | updatable, and the BIOS updates came with a README file from
           | the PC-D team that again give me the impression that some at
           | least really liked working on it.
           | 
           | A marvelous machine. High resolution (if monochrome),
           | Hercules-like graphics, for example.
           | 
           | But the kicker was that it actually had an MMU[1], built from
           | discrete logic chips, in order to run SINIX (a Xenix 8086
           | derivate, but with added MMU support among other things) and
           | enjoy full memory protection. I reverse engineered that MMU,
           | it's quite fun. Paging wasn't possible though.
           | 
           | [1] Early on as a variant called "PC-X", but they apparently
           | did away with the early board revisions, and all machines
           | I've seen say "PC-D/PC-X" on the back, so they all have the
           | MMU and can be both.
        
       | kens wrote:
       | Author here for all your 8086 questions :-)
        
         | myself248 wrote:
         | It's always boggled my mind that they'd have 32 bits of
         | addressing registers, but overlap 12 bits, leaving just 20
         | useful bits. What a waste. I guess the segment:offset scheme
         | had some advantages, but honestly, I've never felt like I had a
         | good understanding of them.
         | 
         | But, what if the overlap had been just 10 bits, or just 8,
         | leaving a much larger functional address range before we got
         | clobbered by the 20-bit limit; what if it was 22 or 24 bits of
         | useful range? Can you speculate what effect such a decision
         | would've had, and why it wasn't taken at the time? I understand
         | in 1978 even 20 bits was a lot, but was that optimal?
        
           | pwg wrote:
           | The 20 bits of address was likely a direct result of Intel's
           | decision to package the 8086 in a 40-pin package.
           | 
           | They already multiplex the 16-bit data bus on top of 16 of
           | the 20 pins of the address bus. But with only 40 pins, that
           | 20-bit address bus was already using half of the total chip
           | pinout.
           | 
           | And, at the time the 8086 was being designed (circa.
           | 1976-1978) with other microprocessors of the time having
           | 16-bit or smaller address buses, the jump to 1M of possible
           | total address space was likely seen as huge. We look back
           | now, comfortable with 12+GB of RAM as a common size, and see
           | 1M as small. But when your common size is 16-64k, having 1M
           | as a possibility would likely seem huge.
        
             | myself248 wrote:
             | Packaging makes a lot of sense. I've handled a 68000 in
             | DIP64 and it's just comically huge, and trying to fit the
             | cursed thing into a socket quickly explains why DIPs larger
             | than 40 are ultra rare.
             | 
             | I'm sure there must be architectures that use a multiplexed
             | low/high address bus, like a latch signal that says "the 16
             | bits on the address bus right now are the segment", then a
             | moment later "okay here's the offset", and leave it to the
             | decoding circuitry on the motherboard, to determine how
             | much to overlap them, or not at all. Doing it this way, you
             | could scale the same chip from 64kB to 4GB, and the
             | decision would be part of the system architecture, rather
             | than the processor. (Could even have mode bits, like the
             | infamous A20 Gate, that would vary the offset and thus the
             | addressable space...)
             | 
             | But, yeah, it was surely seen as unnecessary at the time.
             | Nobody was expecting the x86 to spawn forty-plus years of
             | descendants, and even though Moore's Law was over a decade
             | old at the time, it seems like nobody was wrapping their
             | head around its full implications.
        
             | getpost wrote:
             | Hahaha, somehow I convinced myself that the rationale for
             | the segment register scheme was a long term plan to
             | accommodate virtual memory and memory protection. The idea
             | being that you would only have to validate constraints on
             | access when a segment register was loaded, rather than
             | every single memory access.
        
               | pwg wrote:
               | If you read through the PDF Ken links to in another note
               | here, you find this quote starting the "Objectives and
               | Constraints of the 8086" section:
               | 
               | "The processor was to be assembly-language-level-
               | compatible with the 8080 so that existing 8080 software
               | could be reassembled and correctly executed on the 8086.
               | To allow for this, the 8080 register set and instruction
               | set appear as logical subsets of the 8086 registers and
               | instructions."
               | 
               | The segment registers provide a way to address more than
               | a max of 64k, while also maintaining this "assembly-
               | language-level-compatibl[ity]" with existing 8080
               | programs.
        
             | dfox wrote:
             | It was not that much of an engineering decision as a simple
             | requirement for the chip to be feasible. At that time 40pin
             | DIL was effectively the largest package that could be made
             | at production scale.
        
               | kens wrote:
               | Intel's decision to use a 40-pin chip was mainly because
               | Intel had a weird drive for small integrated circuits.
               | The Texas Instruments TMS9900 (1976) used a 64-pin
               | package for instance, as did the Motorola 68000 (1979).
               | 
               | For the longest time, Intel was fixated on 16-pin chips,
               | which is why the Intel 4004 processor was crammed into a
               | 16-pin package. The 8008 designers were lucky that they
               | were allowed 18 pins. The Oral History of Federico Faggin
               | [1] describes how 16-pin packages were a completely silly
               | requirement, but the "God-given 16 pins" was like a
               | religion at Intel. He hated this requirement because it
               | was throwing away performance. When Intel was forced to
               | 18 pins by the 1103 memory chip, it "was like the sky had
               | dropped from heaven" and he had "never seen so many long
               | faces at Intel."
               | 
               | [1] pages 55-56 of http://archive.computerhistory.org/res
               | ources/text/Oral_Histo...
        
           | garganzol wrote:
           | A similar thing occurs with 64-bit processors nowadays at the
           | silicon level: they have 64-bit virtual address space but
           | only 48 or so bits are used for physical RAM access. Not
           | exactly a 8086 situation but still an interesting
           | observation.
        
           | kens wrote:
           | According to Steve Morse, the 8086 designer:
           | 
           | "Various alternatives for extending the 8080 address space
           | were considered. One such alternative consisted of appending
           | 8 rather than 4 low-order zero bits to the contents of a
           | segment register, thereby providing a 24-bit physical address
           | capable of addressing up to 16 megabytes of memory. This was
           | rejected for the following reasons:
           | 
           | Segments would be forced to start on 256-byte boundaries,
           | resulting in excessive memory fragmentation.
           | 
           | The 4 additional pins that would be required on the chip were
           | not available.
           | 
           | It was felt that a 1-megabyte address space was sufficient. "
           | 
           | Ref: page 16 of
           | https://www.stevemorse.org/8086history/8086history.pdf
           | 
           | Credit to mschaef for finding this.
        
             | bell-cot wrote:
             | But they revisited and used that alternative later, in
             | designing the 80286. (Launched Feb'82, with ~135K
             | transistors, according to Wikipedia.)
             | 
             | If only there was a famous Bill Gates quote, about how 16MB
             | (the limit of 24-bit physical addresses) ought to be more
             | than enough...
        
               | rep_lodsb wrote:
               | Only in protected mode, and then segments were defined by
               | descriptor table entries having a full 24 bit base
               | address.
        
             | anyfoo wrote:
             | And it's been stated in related threads numerous times (and
             | by you as well if I recall correctly) how Intel at that
             | time strangely treated their low pin count limit as almost
             | a religion... apparently that influenced many decisions.
        
         | benlivengood wrote:
         | Do you still write 8086 assembly for fun? I grew up with DOS,
         | a86, and nasm and remember those times fondly. Once I had flat
         | mode in Linux I never really looked back, and once I had x86_64
         | I never looked back, and now I rarely even delve below the
         | interpreted/compiled level unfortunately. x86 had a closeness
         | to the hardware that was fun to play with.
        
           | kens wrote:
           | Strangely enough, I haven't written 8086 assembly since 1988,
           | when I wrote the ellipse-drawing code used in Windows.
        
             | madmoose wrote:
             | Aaron Giles was just lamenting the other day that the GDI
             | ellipse drawing is apparently slightly lopsided, since he's
             | trying to recreate it :)
             | 
             | https://corteximplant.com/@aaronsgiles/110121906721852861
        
               | kens wrote:
               | Hopefully that's not my bug :-) Before my code [1], small
               | ellipses looked terrible: a small circle looked like a
               | stop sign and others looked corrugated.
               | 
               | [1] I can't take much credit. The code was Bresenham's
               | algorithm, as explained to me by the OS/2 group. At the
               | time, OS/2 was the cool operating system that was the
               | wave of the future and Windows was this obscure operating
               | system that nobody cared about. The impression I got was
               | that the OS/2 people didn't really want to waste their
               | time talking to us.
        
             | garganzol wrote:
             | Pardon my interest, are you happen to be a former
             | Microsoftee?
        
               | kens wrote:
               | I was an intern in the summer of 1988. Instead of
               | staying, I went back to grad school which in retrospect
               | probably cost me a fortune :-)
        
         | pwg wrote:
         | Ken, one tiny update. You have "a 4-megabyte (20-bit) address
         | space" near the top of your post. 2^20 is a 1 Megabyte (in
         | base-2 'megabytes') not a 4 megabytes address space.
        
           | kens wrote:
           | I got it fixed just before your comment :-)
        
         | mysterydip wrote:
         | Is there a performance advantage to using a microcoded
         | instruction vice explicitly calling each instruction? Or is it
         | a space/bugs-while-coding savings?
        
           | rep_lodsb wrote:
           | A non-repeated string opcode is one byte for the CPU to fetch
           | vs. several for the corresponding "RISC-like" series of
           | operations (load, increment SI, store/compare, increment DI).
           | 
           | With the REP prefix (another single byte), an entire loop
           | could run in microcode without any additional instruction
           | fetches. Remember that each memory access took 4 clock cycles
           | and there was no cache yet.
           | 
           | --
           | 
           | Eliminating opcode fetches might still speed things up today
           | in some situations, but modern x86 cores are optimized to
           | decode and dispatch several simple operations each cycle
           | without having to go through a microcode ROM (and thus have
           | to use a slower path for the complex instructions).
           | 
           | Also the fact that compilers didn't emit most of the more
           | complex/specialized instructions led to Intel not spending
           | much effort on optimizing those.
        
           | bell-cot wrote:
           | Directly implementing many & complex instructions in silicon
           | is fast, but takes a _lot_ of transistors. Microcode running
           | on a much simpler  "actual" processor is slower, but requires
           | far fewer transistors. IIR, ~all substantial CPU's of the
           | past ~40 years have mixed the two approaches.
           | 
           | (Microcode also makes it _possible_ to design in bug-patching
           | features. Possibly including patches for bugs in your direct-
           | implemented instructions.)
        
         | bonzini wrote:
         | In the SCAS/CMPS annotated microcode, the path with the R DS
         | microinstruction is for CMPS, not SCAS.
         | 
         | SCAS compares AX against [ES:DI] while CMPS compares [DS:SI]
         | against [ES:DI], so the paragraph before is slightly incorrect
         | (should be "SCAS compares against the character in the
         | accumulator, while CMPS reads the comparison character from
         | _SI_ ").
        
           | kens wrote:
           | Thanks, I've fixed the post.
        
       | anyfoo wrote:
       | > Finally, the address adder increments and decrements the index
       | registers used for block operations.
       | 
       | It's kind of surprising to me that it's the address adder that
       | increments/decrements the index registers. Since it's a direct
       | operation on a register, naively I would have assumed it's the
       | ALU doing that. Is it busy during that time?
       | 
       | EDIT: Having read a bit further, I guess it's because of the
       | microcode's simplicity... so, with the microcode being as it is,
       | using the BL flag to instruct the logic to use the address adder
       | to increment/decrement the index register prevents needing
       | additional ALU uops, which would make the instructions slower? So
       | then I guess it would not have worked to make that BL flag
       | instruct the ALU instead? Does anyone know any details?
        
         | kens wrote:
         | I think the motivation is that the address adder already has
         | the capability to increment/decrement registers by various
         | constants: incrementing the PC, incrementing the address for an
         | unaligned word access, correcting the PC from the queue size,
         | etc. If you updated the SI/DI registers through the ALU, you'd
         | need a separate mechanism to get the constants into the ALU, as
         | well as controlling the ALU to do this. But using the address
         | adder gets this functionality almost for free.
        
       | userbinator wrote:
       | _The solution is that a string instruction can be interrupted in
       | the middle of the instruction, unlike most instructions._
       | 
       | This is true even for later x86 processors that move entire
       | cachelines at a time, and also leads to one sneaky technique for
       | detection of VMs or being debugged/emulated:
       | 
       | https://repzret.org/p/rep-prefix-and-detecting-valgrind/
       | 
       | https://silviocesare.wordpress.com/2009/02/02/anti-debugging...
        
         | rep_lodsb wrote:
         | Small nitpick: the REP STOSB isn't left in the prefetch queue
         | while it's being executed by the microcode engine. The queue
         | always contains the _next_ instructions to execute (obviously
         | it 's not as simple for out-of-order architectures).
         | 
         | The following should load AL with 0 on an 8088, 1 on an 8086:
         | mov bx,test_instr+1             mov byte [bx],1    ;make sure
         | immediate operand is set to one             mov cl,255
         | ;max shift count to give time for prefetch             cli
         | ;no interrupts allowed                      align 2
         | shl byte [bx],cl  ;clears byte at [bx] after wasting lots of
         | time             nop               ;first byte in queue
         | nop               ;2nd             nop               ;3rd
         | sti               ;4th (intr. recognized after next
         | instruction)         test_instr:             mov al,0ffh
         | ;opcode is 5th, operand is 6th
         | 
         | The 255x left shift is done in an internal register, during
         | which the 8086 has more than enough time to fetch all six
         | following bytes. I didn't test this code but am 99% certain it
         | works.
        
           | secondcoming wrote:
           | username checks out!
        
       | bell-cot wrote:
       | > to create a 4-megabyte (20-bit) address space consisting of 64K
       | segments
       | 
       | sed "s/4-/1-/", perhaps?
        
         | kens wrote:
         | Oops, thanks.
        
           | userbinator wrote:
           | The 8086/88 does have an effective 4MB address space since it
           | outputs the segment register in use during bus cycles, and I
           | believe one of the official manuals did show that
           | possibility, but I don't know of any systems that used such a
           | configuration.
        
             | anyfoo wrote:
             | Today I learned. I in my mind briefly went through anything
             | that would foil such a plan, but with no caches
             | whatsoever... should indeed have worked, even if it's weird
             | (and deeply incompatible with what PCs did, of course).
             | 
             | The only thing that comes to mind is fetching vectors from
             | the IDT. I guess it simply output CS in those cycles?
        
       | PaulHoule wrote:
       | Once you got a pipelined architecture that stored instructions in
       | a cache you could write a loop in assembly language that runs as
       | fast as one of those string instructions, but if you didn't have
       | that you'd be wasting most of the bandwidth to the RAM loading
       | instructions instead of loading and storing data.
        
       | anyfoo wrote:
       | > SI - IND R DS,BL 1: Read byte/word from SI
       | 
       | > IND - SI JMPS X0 8 test instruction bit 3: jump if LODS
       | 
       | > DI - IND W DA,BL MOVS path: write to DI
       | 
       | Is "W DA,BL" a typo, should it be "W DS,BL"?
        
         | ajenner wrote:
         | It's not - the "DA" means that the write happens to the ES
         | segment, while "DS" means that the read happens from the DS
         | segment. The use of different segments for the source and
         | destination gives these instructions extra flexibility.
        
         | kens wrote:
         | Not a typo, but something I should clean up. The move
         | instruction moves from DS:SI to ES:DI. The 8086 registers
         | originally had different names (which is why AX, BX, CX, DX
         | aren't in order) and the "Extra Segment" was the "Alternate
         | Segment". So the 8086 patent uses "DA" to indicate what is now
         | ES. ("A" for Alternate but I'm not sure why "D" is used for all
         | the segments.) The register names were cleaned up before the
         | 8086 was released.
         | 
         | Andrew Jenner's microcode disassembly used the original names.
         | I've been changing the register names to the modern names, but
         | missed that one. So it should probably be "W ES,BL", although I
         | don't really like "BL" either. (Maybe "PM" would make more
         | sense for plus/minus?)
         | 
         | A bit more on the original register names. The original
         | registers XA, BC, DE, HL were renamed to AX, CX, DX, BX. The
         | original registers MP, IJ, IK were renamed to BP, SI, DI.
        
         | rep_lodsb wrote:
         | DA = "Data segment Alternate", or something like that. It
         | refers to the segment register normally called ES, which string
         | instruction implicitly use as destination (source being DS
         | unless overridden).
         | 
         | The names are those used in a patent filed by Intel and come
         | from before the official documentation for the 8086 was
         | written. It might make the microcode a bit more readable to
         | invent a different notation that is more in line with normal
         | x86 assembly, but then again this is a very niche topic :)
        
       ___________________________________________________________________
       (page generated 2023-04-04 23:00 UTC)