Subj : clocks vs. bus cycles
To   : Robin Sheppard
From : David Noon
Date : Sun Oct 21 2001 03:43 am

Hi Robin,

Replying to a message of Robin Sheppard to David Noon:

[snip]
 DN>> frequency.  Ultimately, one is hostage to the speed of the
 DN>> motherboard and RAM, not the CPU.

 RS> Okay, that all makes sense.  Now, with instructions that have a lower 
 RS> listed clock cycle "cost" than the motherboard bus rate, such as a 
 RS> one-clock MOV mem, reg on my clock-doubled p133, would the total 
 RS> clocks be 2, or would it be 1 to execute the instruction, plus 2 for 
 RS> the motherboard bus to catch up?  In other words, can I assume that 
 RS> any instructions that are listed as 1 or 2 cycles will take at most 2 
 RS> clocks on my machine?

The data should have been staged into cache by an early phase in the pipelined 
execution, so it will usually be 1 CPU cycle, rather than 1 bus cycle. This 
isn't guaranteed, esepcially if the instruction is the target of an unpredicted 
successful branch instruction. [Since the 80486 had neither a branch prediction 
buffer nor a speculative execution unit, all branches on that and earlier 
processors are always unpredicted.]

 RS> As for instructions that access memory twice during their execution, 
 RS> like XCHG mem, reg or MOVS, how does this affect them?  I've always 
 RS> been of the impression that MOVS (and many of the string
 RS> instructions,  for that matter) was relatively useless; it's listed
 RS> in my TASM guide  (which only covers up to i486 instructions) as
 RS> being 7 clocks, yet it  can be duplicated by 2 MOVs and 2 INCs or
 RS> ADDs (depending on operand  size), which are all listed as one-clock
 RS> instructions.

The string instructions (MOVS, SCAS, etc.) are designed to be prefixed with 
REP/REPE or REPNE. This repetition prefix eliminates multiple fetches, decodes, 
etc. for processing a string of bytes, as the one instruction keeps running 
until the string is processed or an error occurs. Thus, the timing for MOVS is 
not simply 7 clocks, but something like (6+n) clocks, where n is the number of 
bytes/words/dwords manipulated.

Be aware that the 80486 introduced some rather slick microcode to handle string 
instructions, and this persists on the newer models. But that microcode only 
starts to perform well once you are processing more than about 8 bytes. For 8 
or fewer bytes, using registers to handle chunks up to 4 bytes will move the 
data slightly faster, provided you schedule the instructions to avoid access 
interlocks on both memory and registers.

You might also notice that MOVSD can move a DWORD of data in the same time that 
MOVSB can moe a byte. This means that REP MOVSD can move strings approx. 4 
times faster than REP MOVSB.

A singleton MOVSx instruction does have another important use: it is 
intrinsically conditional on CX and avoids a branch if it should do nothing. 
This means that the succcessful branch penalty is avoided.

 DN>> For an on-die L1 cache (e.g. almost all Intel and AMD processors) you 
 DN>> can assume it to be 1 clock. Many of the early Intel P-III
 DN>> processors  had off-die L1 caches that ran at half CPU speed, so the
 DN>> fetch time  could be 2 clocks.  These have, fortunately, disappeared
 DN>> from the  shops. 

 RS> What about older processors, like the Pentium (no "Pro", "II", etc)?  
 RS> Would those caches run at the full CPU speed?

Yes, if the cache is on-die. IIRC, the Pentium and Pentium Pro both used on-die 
L1 cache. I can look it up if you want to be really certain.

 RS> I understand that SRAM would be faster than DRAM because you'd never 
 RS> run into a case where the CPU had to wait on RAM refresh, but what's 
 RS> the deal with "tagging for associative access"?  I thought memory 
 RS> locations were simply "associated" with their addresses.

Not all SRAM is tagged. E.g. most modern mainframes use SRAM for main memory.

The way CPU and motherboard caches work is that each location in main memory 
(typically some type of DRAM in the Intel 80x86 world) that is represented in 
the cache has both a data component that mirrors main memory content and a 
"tag" to locate where that data should reside in main memory. Such an area is 
usually called a "line" in the cache and its data size is not necessarily the 
same as the word size of the machine. The tag section has special comparator 
circuitry that allows all tags in the cache to be compared in parallel, so that 
a single check can be made when a lookaside is required. It also allows the 
coherence of the cache to be assured, as this circuitry should ensure that no 2 
cache lines have the same tag.

The tagging circuitry is quite expensive, which is why "tagged SRAM" is usually 
many times more expensive than normal SRAM.

Regards

Dave
<Team PL/I>

--- FleetStreet 1.25.1
 * Origin: My other computer is an IBM S/390 (2:257/609.5)

.