[HN Gopher] The microcode and hardware in the 8086 processor tha...
___________________________________________________________________
The microcode and hardware in the 8086 processor that perform
string operations
Author : picture
Score : 111 points
Date : 2023-04-04 16:58 UTC (6 hours ago)
(HTM) web link (www.righto.com)
(TXT) w3m dump (www.righto.com)
| ataylor284_ wrote:
| These 8086 instructions were pretty cool coming from a 6502
| background. The usual way to copy a block of memory in 6502 was
| something like this: ldy #len loop:
| lda source-1,y ; memory accesses: loads opcode, address, value
| sta dest-1,y ; memory accesses: loads opcode, address, stores
| value dey ; memory accesses: loads opcode
| bne loop ; memory accesses: loads opcode, offset
|
| So 4 instructions and a minimum of 9 memory accesses per byte
| copied, more if not in the zero page. Even unrolling the loop
| gets you down to 6.
|
| Compare this to the 8086: mov cx, len
| mov si, source mov di, dest rep movsb
| ; memory accesses: read the opcodes, then 1 load and 1 store per
| byte
|
| Even forgetting about word moves, you're down to 2 memory
| accesses per byte. No instruction opcode or branching overhead.
| mikepurvis wrote:
| It still boggles my mind that processors in the 1970s already had
| microcode-- obviously it was vastly simpler and not doing the
| crazy branch prediction, speculative execution, and operation
| reordering that modern processors do, but it's still amazing to
| me that this architecture was there at all.
| kens wrote:
| Historically, microcode was invented around 1951 and was
| heavily used in most of the IBM System/360 computers (1964).
| Once your instruction set is even moderately complicated, using
| microcode is a lot easier than hard-wired control circuitry.
|
| You can view microcode as one of the many technologies that
| started in mainframes, moved into minicomputers, and then moved
| into microprocessors. In a sense, the idea behind RISC was the
| recognition that this natural progression was not necessarily
| the best way to build microprocessors.
| msla wrote:
| There's something more fundamental going on here:
|
| The concept of the ISA as a standard implemented by hardware,
| as opposed to the ISA being a document of what a specific
| hardware design does, goes back to the IBM System/360 family of
| computers from 1964. That was the first range of computers,
| from large to (let's be honest) somewhat less large, that could
| all run most of the same software, modulo some specialized
| instructions and peripheral hardware like disk drives not
| present on the cheapest systems, at different levels of
| performance. As others have said, microcode is plenty old, but
| the idea of architectural families is one of the things which
| really gets us to where we are now.
| pwg wrote:
| Microcode (the concept) is very much older than the 1970's. Per
| Wikipedia ([1]) the earliest system with something like
| 'microcode' dates back to 1947.
|
| What is 'special' for the 8086 is that this was a low cost
| microprocessor using microcode, as opposed to a multi-million
| dollar mainframe CPU.
|
| [1] https://en.wikipedia.org/wiki/Microcode#History
|
| edit: fix typo
| plesner wrote:
| Babbage's analytical engine used a form of microcode ~100
| years earlier. Complex operations such as multiplication and
| division were implemented by lower-level instructions encoded
| as a series of pegs on a rotating barrel. Each micro-
| instruction ("vertical" in Babbage's terminology) could take
| several seconds to execute so a complete multiplication or
| division would take minutes.
| kens wrote:
| You mean the 8086, not the 8-bit 8080, right? The 8080 wasn't
| microcoded.
| pwg wrote:
| Indeed, that was a typo. Fixed now. Thanks.
| morcheeba wrote:
| I love that the TMS34082 math coprocessor had user-definable
| microcode! This was intended to work with 3D graphics, so you
| could really get the most out of the ALU by chaining operations
| for your specific need (e.g. divide-and-sqrt single
| instruction)
|
| See chapter 8:
| http://www.bitsavers.org/components/ti/TMS340xx/TMS34082_Des...
| [deleted]
| jecel wrote:
| The 286 added "in string" and "out string" instructions. It
| turned out that if peripherals had a small internal buffer it was
| more efficient to use these new instructions instead of using DMA
| hardware. This is why hard disk controllers for the original PC
| and XT used DMA but from the AT on they normally didn't.
| rep_lodsb wrote:
| It was the 186 which added those. Wasn't widely used in PCs
| because it included some built-in peripherals that weren't
| compatible with the "standard" ones used by IBM in their
| original PC.
|
| And that standard PC's DMA controller was originally intended
| for use with the 8-bit 8080 (eighty-eighty, not eighty-eight)
| processor. It was slow and only supported 16 bit addresses. IBM
| added a latch that would store the high address bits, but they
| weren't incremented when the lower bits wrapped around. So
| buffers had to be allocated in a way that they didn't cross
| those boundaries, and the BIOS disk functions would return an
| error if your code didn't do this properly.
| adrian_b wrote:
| 80186 and 80286 have been launched simultaneously in February
| 1982, so you cannot say that one of them has introduced INS
| and OUTS before the other.
|
| Nevertheless, due to their high cost for those times none of
| these 2 CPUs had seen much use before two years and a half
| later, when IBM launched the IBM PC/AT in August 1984.
|
| Most programmers have encountered INS and OUTS for the first
| time while using PC/AT clones with 80286. Embedded computers
| with 80186 became common only some years later, when the
| price of 80186 dropped a lot.
| anyfoo wrote:
| I grew up with a "Siemens PC-D", 80186 and not fully IBM
| compatible for the reasons you said. However, it seems that
| the folks at Siemens were very enthusiastic about it, and
| ported a lot of software that accessed the hardware more
| directly, including most Microsoft software (Windows, Word,
| the C compiler...) and GEM. The BIOS was actually on disk, so
| updatable, and the BIOS updates came with a README file from
| the PC-D team that again give me the impression that some at
| least really liked working on it.
|
| A marvelous machine. High resolution (if monochrome),
| Hercules-like graphics, for example.
|
| But the kicker was that it actually had an MMU[1], built from
| discrete logic chips, in order to run SINIX (a Xenix 8086
| derivate, but with added MMU support among other things) and
| enjoy full memory protection. I reverse engineered that MMU,
| it's quite fun. Paging wasn't possible though.
|
| [1] Early on as a variant called "PC-X", but they apparently
| did away with the early board revisions, and all machines
| I've seen say "PC-D/PC-X" on the back, so they all have the
| MMU and can be both.
| kens wrote:
| Author here for all your 8086 questions :-)
| myself248 wrote:
| It's always boggled my mind that they'd have 32 bits of
| addressing registers, but overlap 12 bits, leaving just 20
| useful bits. What a waste. I guess the segment:offset scheme
| had some advantages, but honestly, I've never felt like I had a
| good understanding of them.
|
| But, what if the overlap had been just 10 bits, or just 8,
| leaving a much larger functional address range before we got
| clobbered by the 20-bit limit; what if it was 22 or 24 bits of
| useful range? Can you speculate what effect such a decision
| would've had, and why it wasn't taken at the time? I understand
| in 1978 even 20 bits was a lot, but was that optimal?
| pwg wrote:
| The 20 bits of address was likely a direct result of Intel's
| decision to package the 8086 in a 40-pin package.
|
| They already multiplex the 16-bit data bus on top of 16 of
| the 20 pins of the address bus. But with only 40 pins, that
| 20-bit address bus was already using half of the total chip
| pinout.
|
| And, at the time the 8086 was being designed (circa.
| 1976-1978) with other microprocessors of the time having
| 16-bit or smaller address buses, the jump to 1M of possible
| total address space was likely seen as huge. We look back
| now, comfortable with 12+GB of RAM as a common size, and see
| 1M as small. But when your common size is 16-64k, having 1M
| as a possibility would likely seem huge.
| myself248 wrote:
| Packaging makes a lot of sense. I've handled a 68000 in
| DIP64 and it's just comically huge, and trying to fit the
| cursed thing into a socket quickly explains why DIPs larger
| than 40 are ultra rare.
|
| I'm sure there must be architectures that use a multiplexed
| low/high address bus, like a latch signal that says "the 16
| bits on the address bus right now are the segment", then a
| moment later "okay here's the offset", and leave it to the
| decoding circuitry on the motherboard, to determine how
| much to overlap them, or not at all. Doing it this way, you
| could scale the same chip from 64kB to 4GB, and the
| decision would be part of the system architecture, rather
| than the processor. (Could even have mode bits, like the
| infamous A20 Gate, that would vary the offset and thus the
| addressable space...)
|
| But, yeah, it was surely seen as unnecessary at the time.
| Nobody was expecting the x86 to spawn forty-plus years of
| descendants, and even though Moore's Law was over a decade
| old at the time, it seems like nobody was wrapping their
| head around its full implications.
| getpost wrote:
| Hahaha, somehow I convinced myself that the rationale for
| the segment register scheme was a long term plan to
| accommodate virtual memory and memory protection. The idea
| being that you would only have to validate constraints on
| access when a segment register was loaded, rather than
| every single memory access.
| pwg wrote:
| If you read through the PDF Ken links to in another note
| here, you find this quote starting the "Objectives and
| Constraints of the 8086" section:
|
| "The processor was to be assembly-language-level-
| compatible with the 8080 so that existing 8080 software
| could be reassembled and correctly executed on the 8086.
| To allow for this, the 8080 register set and instruction
| set appear as logical subsets of the 8086 registers and
| instructions."
|
| The segment registers provide a way to address more than
| a max of 64k, while also maintaining this "assembly-
| language-level-compatibl[ity]" with existing 8080
| programs.
| dfox wrote:
| It was not that much of an engineering decision as a simple
| requirement for the chip to be feasible. At that time 40pin
| DIL was effectively the largest package that could be made
| at production scale.
| kens wrote:
| Intel's decision to use a 40-pin chip was mainly because
| Intel had a weird drive for small integrated circuits.
| The Texas Instruments TMS9900 (1976) used a 64-pin
| package for instance, as did the Motorola 68000 (1979).
|
| For the longest time, Intel was fixated on 16-pin chips,
| which is why the Intel 4004 processor was crammed into a
| 16-pin package. The 8008 designers were lucky that they
| were allowed 18 pins. The Oral History of Federico Faggin
| [1] describes how 16-pin packages were a completely silly
| requirement, but the "God-given 16 pins" was like a
| religion at Intel. He hated this requirement because it
| was throwing away performance. When Intel was forced to
| 18 pins by the 1103 memory chip, it "was like the sky had
| dropped from heaven" and he had "never seen so many long
| faces at Intel."
|
| [1] pages 55-56 of http://archive.computerhistory.org/res
| ources/text/Oral_Histo...
| garganzol wrote:
| A similar thing occurs with 64-bit processors nowadays at the
| silicon level: they have 64-bit virtual address space but
| only 48 or so bits are used for physical RAM access. Not
| exactly a 8086 situation but still an interesting
| observation.
| kens wrote:
| According to Steve Morse, the 8086 designer:
|
| "Various alternatives for extending the 8080 address space
| were considered. One such alternative consisted of appending
| 8 rather than 4 low-order zero bits to the contents of a
| segment register, thereby providing a 24-bit physical address
| capable of addressing up to 16 megabytes of memory. This was
| rejected for the following reasons:
|
| Segments would be forced to start on 256-byte boundaries,
| resulting in excessive memory fragmentation.
|
| The 4 additional pins that would be required on the chip were
| not available.
|
| It was felt that a 1-megabyte address space was sufficient. "
|
| Ref: page 16 of
| https://www.stevemorse.org/8086history/8086history.pdf
|
| Credit to mschaef for finding this.
| bell-cot wrote:
| But they revisited and used that alternative later, in
| designing the 80286. (Launched Feb'82, with ~135K
| transistors, according to Wikipedia.)
|
| If only there was a famous Bill Gates quote, about how 16MB
| (the limit of 24-bit physical addresses) ought to be more
| than enough...
| rep_lodsb wrote:
| Only in protected mode, and then segments were defined by
| descriptor table entries having a full 24 bit base
| address.
| anyfoo wrote:
| And it's been stated in related threads numerous times (and
| by you as well if I recall correctly) how Intel at that
| time strangely treated their low pin count limit as almost
| a religion... apparently that influenced many decisions.
| benlivengood wrote:
| Do you still write 8086 assembly for fun? I grew up with DOS,
| a86, and nasm and remember those times fondly. Once I had flat
| mode in Linux I never really looked back, and once I had x86_64
| I never looked back, and now I rarely even delve below the
| interpreted/compiled level unfortunately. x86 had a closeness
| to the hardware that was fun to play with.
| kens wrote:
| Strangely enough, I haven't written 8086 assembly since 1988,
| when I wrote the ellipse-drawing code used in Windows.
| madmoose wrote:
| Aaron Giles was just lamenting the other day that the GDI
| ellipse drawing is apparently slightly lopsided, since he's
| trying to recreate it :)
|
| https://corteximplant.com/@aaronsgiles/110121906721852861
| kens wrote:
| Hopefully that's not my bug :-) Before my code [1], small
| ellipses looked terrible: a small circle looked like a
| stop sign and others looked corrugated.
|
| [1] I can't take much credit. The code was Bresenham's
| algorithm, as explained to me by the OS/2 group. At the
| time, OS/2 was the cool operating system that was the
| wave of the future and Windows was this obscure operating
| system that nobody cared about. The impression I got was
| that the OS/2 people didn't really want to waste their
| time talking to us.
| garganzol wrote:
| Pardon my interest, are you happen to be a former
| Microsoftee?
| kens wrote:
| I was an intern in the summer of 1988. Instead of
| staying, I went back to grad school which in retrospect
| probably cost me a fortune :-)
| pwg wrote:
| Ken, one tiny update. You have "a 4-megabyte (20-bit) address
| space" near the top of your post. 2^20 is a 1 Megabyte (in
| base-2 'megabytes') not a 4 megabytes address space.
| kens wrote:
| I got it fixed just before your comment :-)
| mysterydip wrote:
| Is there a performance advantage to using a microcoded
| instruction vice explicitly calling each instruction? Or is it
| a space/bugs-while-coding savings?
| rep_lodsb wrote:
| A non-repeated string opcode is one byte for the CPU to fetch
| vs. several for the corresponding "RISC-like" series of
| operations (load, increment SI, store/compare, increment DI).
|
| With the REP prefix (another single byte), an entire loop
| could run in microcode without any additional instruction
| fetches. Remember that each memory access took 4 clock cycles
| and there was no cache yet.
|
| --
|
| Eliminating opcode fetches might still speed things up today
| in some situations, but modern x86 cores are optimized to
| decode and dispatch several simple operations each cycle
| without having to go through a microcode ROM (and thus have
| to use a slower path for the complex instructions).
|
| Also the fact that compilers didn't emit most of the more
| complex/specialized instructions led to Intel not spending
| much effort on optimizing those.
| bell-cot wrote:
| Directly implementing many & complex instructions in silicon
| is fast, but takes a _lot_ of transistors. Microcode running
| on a much simpler "actual" processor is slower, but requires
| far fewer transistors. IIR, ~all substantial CPU's of the
| past ~40 years have mixed the two approaches.
|
| (Microcode also makes it _possible_ to design in bug-patching
| features. Possibly including patches for bugs in your direct-
| implemented instructions.)
| bonzini wrote:
| In the SCAS/CMPS annotated microcode, the path with the R DS
| microinstruction is for CMPS, not SCAS.
|
| SCAS compares AX against [ES:DI] while CMPS compares [DS:SI]
| against [ES:DI], so the paragraph before is slightly incorrect
| (should be "SCAS compares against the character in the
| accumulator, while CMPS reads the comparison character from
| _SI_ ").
| kens wrote:
| Thanks, I've fixed the post.
| anyfoo wrote:
| > Finally, the address adder increments and decrements the index
| registers used for block operations.
|
| It's kind of surprising to me that it's the address adder that
| increments/decrements the index registers. Since it's a direct
| operation on a register, naively I would have assumed it's the
| ALU doing that. Is it busy during that time?
|
| EDIT: Having read a bit further, I guess it's because of the
| microcode's simplicity... so, with the microcode being as it is,
| using the BL flag to instruct the logic to use the address adder
| to increment/decrement the index register prevents needing
| additional ALU uops, which would make the instructions slower? So
| then I guess it would not have worked to make that BL flag
| instruct the ALU instead? Does anyone know any details?
| kens wrote:
| I think the motivation is that the address adder already has
| the capability to increment/decrement registers by various
| constants: incrementing the PC, incrementing the address for an
| unaligned word access, correcting the PC from the queue size,
| etc. If you updated the SI/DI registers through the ALU, you'd
| need a separate mechanism to get the constants into the ALU, as
| well as controlling the ALU to do this. But using the address
| adder gets this functionality almost for free.
| userbinator wrote:
| _The solution is that a string instruction can be interrupted in
| the middle of the instruction, unlike most instructions._
|
| This is true even for later x86 processors that move entire
| cachelines at a time, and also leads to one sneaky technique for
| detection of VMs or being debugged/emulated:
|
| https://repzret.org/p/rep-prefix-and-detecting-valgrind/
|
| https://silviocesare.wordpress.com/2009/02/02/anti-debugging...
| rep_lodsb wrote:
| Small nitpick: the REP STOSB isn't left in the prefetch queue
| while it's being executed by the microcode engine. The queue
| always contains the _next_ instructions to execute (obviously
| it 's not as simple for out-of-order architectures).
|
| The following should load AL with 0 on an 8088, 1 on an 8086:
| mov bx,test_instr+1 mov byte [bx],1 ;make sure
| immediate operand is set to one mov cl,255
| ;max shift count to give time for prefetch cli
| ;no interrupts allowed align 2
| shl byte [bx],cl ;clears byte at [bx] after wasting lots of
| time nop ;first byte in queue
| nop ;2nd nop ;3rd
| sti ;4th (intr. recognized after next
| instruction) test_instr: mov al,0ffh
| ;opcode is 5th, operand is 6th
|
| The 255x left shift is done in an internal register, during
| which the 8086 has more than enough time to fetch all six
| following bytes. I didn't test this code but am 99% certain it
| works.
| secondcoming wrote:
| username checks out!
| bell-cot wrote:
| > to create a 4-megabyte (20-bit) address space consisting of 64K
| segments
|
| sed "s/4-/1-/", perhaps?
| kens wrote:
| Oops, thanks.
| userbinator wrote:
| The 8086/88 does have an effective 4MB address space since it
| outputs the segment register in use during bus cycles, and I
| believe one of the official manuals did show that
| possibility, but I don't know of any systems that used such a
| configuration.
| anyfoo wrote:
| Today I learned. I in my mind briefly went through anything
| that would foil such a plan, but with no caches
| whatsoever... should indeed have worked, even if it's weird
| (and deeply incompatible with what PCs did, of course).
|
| The only thing that comes to mind is fetching vectors from
| the IDT. I guess it simply output CS in those cycles?
| PaulHoule wrote:
| Once you got a pipelined architecture that stored instructions in
| a cache you could write a loop in assembly language that runs as
| fast as one of those string instructions, but if you didn't have
| that you'd be wasting most of the bandwidth to the RAM loading
| instructions instead of loading and storing data.
| anyfoo wrote:
| > SI - IND R DS,BL 1: Read byte/word from SI
|
| > IND - SI JMPS X0 8 test instruction bit 3: jump if LODS
|
| > DI - IND W DA,BL MOVS path: write to DI
|
| Is "W DA,BL" a typo, should it be "W DS,BL"?
| ajenner wrote:
| It's not - the "DA" means that the write happens to the ES
| segment, while "DS" means that the read happens from the DS
| segment. The use of different segments for the source and
| destination gives these instructions extra flexibility.
| kens wrote:
| Not a typo, but something I should clean up. The move
| instruction moves from DS:SI to ES:DI. The 8086 registers
| originally had different names (which is why AX, BX, CX, DX
| aren't in order) and the "Extra Segment" was the "Alternate
| Segment". So the 8086 patent uses "DA" to indicate what is now
| ES. ("A" for Alternate but I'm not sure why "D" is used for all
| the segments.) The register names were cleaned up before the
| 8086 was released.
|
| Andrew Jenner's microcode disassembly used the original names.
| I've been changing the register names to the modern names, but
| missed that one. So it should probably be "W ES,BL", although I
| don't really like "BL" either. (Maybe "PM" would make more
| sense for plus/minus?)
|
| A bit more on the original register names. The original
| registers XA, BC, DE, HL were renamed to AX, CX, DX, BX. The
| original registers MP, IJ, IK were renamed to BP, SI, DI.
| rep_lodsb wrote:
| DA = "Data segment Alternate", or something like that. It
| refers to the segment register normally called ES, which string
| instruction implicitly use as destination (source being DS
| unless overridden).
|
| The names are those used in a patent filed by Intel and come
| from before the official documentation for the 8086 was
| written. It might make the microcode a bit more readable to
| invent a different notation that is more in line with normal
| x86 assembly, but then again this is a very niche topic :)
___________________________________________________________________
(page generated 2023-04-04 23:00 UTC)