[HN Gopher] Inside the 8086 processor's instruction prefetch cir...
       ___________________________________________________________________
        
       Inside the 8086 processor's instruction prefetch circuitry
        
       Author : klelatti
       Score  : 64 points
       Date   : 2023-01-02 18:11 UTC (4 hours ago)
        
 (HTM) web link (www.righto.com)
 (TXT) w3m dump (www.righto.com)
        
       | rep_lodsb wrote:
       | * * *
        
       | userbinator wrote:
       | If you search for information on 8086/88-specific features and
       | terminology such as BIU/EU, prefetch queue, etc., a surprisingly
       | large number of results are Indian. They are apparently still
       | quite popular in CS courses there. I wonder if new discrete
       | clones are still being made, as Intel stopped making them long
       | ago.
        
       | ajross wrote:
       | I love these. kens, nitpick on your first footnote:
       | 
       | In fact the Apple II was designed for 4k DRAMs with a whole-cycle
       | time of 500ns, which was the low end of the range in 1977. The
       | idea is that the machine double pumps a 2MHz memory bus, with
       | cycles alternating between the 1 MHz 6502 and the video scan
       | circuitry (which doubles as the refresh controller). By the
       | advent of 4164 chips, DRAM cycle rates were pushing 4MHz. But Woz
       | was stuck with more limited hardware.
       | 
       | Similarly the 8086 bus is multiplexed between address and data, a
       | memory access is 4 cycles. So a 4.77 MHz IBM PC was actually
       | cycling the DRAM at just 1.2MHz (though with no clever tricks, as
       | MDA/CGA had physically separate framebuffers; there the extra
       | bandwidth really was wasted).
       | 
       | Which isn't to say that the footnote is wrong, just that "DRAM
       | was faster than CPUs" is only part of the story.
        
       | mrlonglong wrote:
       | I hate to nitpick but there's an error in the text where it's
       | stated that the 8086 could address 4 megabytes with 20 bit
       | addresses, but that's actually one megabyte. Otherwise, all
       | fascinating reading. My first proper pc was an Amstrad PC1512
       | with a NEC V30 (clone of the 8086 with some enhanced
       | instructions). Maybe you could take a look at the V30 at some
       | point?
        
         | kens wrote:
         | Thanks, I've fixed that. As for the NEC V30, the 8086 is likely
         | to keep me busy for a long time, but it would be interesting to
         | see how the enhanced instructions in the V30 were implemented.
        
           | mrlonglong wrote:
           | I've got a NEC V30 coprocessor plugged into my raspberry pi.
           | It works, amazingly well enough
        
         | [deleted]
        
       | kens wrote:
       | It's nice to see my post about the Intel 8086 here, even if the
       | title is totally mangled. Any questions?
        
         | toast0 wrote:
         | Footnote 10 ends midsentance... I suspect 'instruction' is the
         | missing word at the end
        
           | kens wrote:
           | Thanks, I've added the missing line.
        
         | moosedev wrote:
         | Hey Ken,
         | 
         | Always enjoy your posts!
         | 
         | > The much-reviled solution was to create a 4-megabyte (20-bit)
         | address space consisting of 64K segments,
         | 
         | Did you mean 1-megabyte here? (and again later in the same
         | "paragraph", pun intended)
        
           | userbinator wrote:
           | There are 4 segments and which one is being used is presented
           | on the bus interface (S4/S3) so 4MB of total memory is
           | actually possible in an 8086 system, and the 8086
           | documentation does explicitly mention that as an
           | implementation choice, but I don't think many systems made
           | use of that feature; bank-switching seems to have been more
           | common.
           | 
           | The reason being that separating the segments into their own
           | address spaces would create something more like a Harvard
           | architecture, which isn't really wanted in a general-purpose
           | computer.
        
             | moosedev wrote:
             | That's interesting - I didn't know that the memory
             | subsystem could "know" which of the 4 segment registers was
             | in use for a given request. With that detail, one could
             | certainly call it a 4MB "address space", albeit even more
             | of a pain to use than the usual 1MB one :-)
        
           | kens wrote:
           | Thanks, I've fixed that.
        
           | zozbot234 wrote:
           | That solution actually made quite a bit of sense. What was
           | really messy is protected mode in the 80286 where the segment
           | address actually indexes into an indirection table, in order
           | to address 24 bits (16M) of physical address space. But the
           | 8086 had none of that. And the 80386 finally got 32-bit flat
           | addressing in protected mode, enabling modern OS's to be
           | ported.
        
         | klelatti wrote:
         | Hi Ken, Thanks for a great post as usual. I forgot HN can mess
         | up titles - hopefully fixed now!
        
       | ajross wrote:
       | This shocked me a little:
       | 
       | > the 8088 has a 4-byte prefetch queue instead of a 6-byte
       | prefetch queue
       | 
       | My whole professional life, I've been led to believe that an 8088
       | is just an 8086 with a tiny state machine bolted in front of the
       | bus to split multiword accesses. That it has design differences
       | is really surprising. If the prefetch queue is smaller, then...
       | what are they putting in that die area? Or did they redo the
       | layout in other ways?
        
         | rep_lodsb wrote:
         | Intel removed / disabled the last two bytes in the prefetch
         | queue because they found that it actually made the CPU faster
         | (because of less wasted prefetch cycles on jump instructions).
         | 
         | The 8086 didn't have that problem, because it only started a
         | new prefetch when there were two bytes free in the queue, so
         | there was enough time for the jump microcode to stop it.
         | 
         |  _edit: already mentioned in the article_
        
         | adrian_b wrote:
         | This difference between 8088 and 8086 was used by several MS-
         | DOS programs to detect the type of the CPU on which they were
         | run.
         | 
         | It was easy to measure the length of the prefetch queue by a
         | self-modifying program, which replaced the instruction placed 4
         | bytes after the current instruction pointer.
         | 
         | On an 8088, the new instruction was executed, while on an 8086
         | the old instruction was executed.
         | 
         | Various other such obscure differences (e.g. the value of a
         | pushed stack pointer or the result of a shift operation with a
         | shift count greater than the register size) were used to
         | identify NEC V20 and V30, Intel 80186, 80188, 80286 and 80386,
         | because architectural identification means, like the CPUID
         | instruction, have been introduced only in Intel Pentium and in
         | the late 486 models that were launched after Pentium.
        
           | userbinator wrote:
           | The P6 family was when Intel decided to add circuitry to
           | detect SMC and flush the queue, effectively making it appear
           | to have a queue size of 0. AFAIK all x86 henceforth have been
           | the same.
        
             | kens wrote:
             | I tried to track this down but I found sources claiming
             | this was added to the 486, Pentium, and Pentium Pro (P6).
             | So if you know for sure that it's P6, please let me know
             | the details.
        
           | robocat wrote:
           | > It was easy to measure the length of the prefetch queue by
           | a self-modifying program
           | 
           | It became hard on the 386, as linked in the article:
           | http://www.rcollins.org/secrets/PrefetchQueue.html
        
         | kens wrote:
         | The 8088 die is almost the same as the 8086, but there are a
         | few differences. There are two prefetch queue registers instead
         | of three. But the 8088 queue registers are slightly more
         | complicated because the chip needs to write one byte at a time
         | instead of one word. So there's just a small gap in the layout.
         | There are various changes in the memory cycle circuitry. And
         | there are a few changes to the microcode. Once I finish with
         | the 8086, I plan to look at the changes in the 8088 more
         | closely.
        
       ___________________________________________________________________
       (page generated 2023-01-02 23:00 UTC)