Subj : clocks vs. bus cycles To : Robin Sheppard From : David Noon Date : Sun Oct 21 2001 03:43 am Hi Robin, Replying to a message of Robin Sheppard to David Noon: [snip] DN>> frequency. Ultimately, one is hostage to the speed of the DN>> motherboard and RAM, not the CPU. RS> Okay, that all makes sense. Now, with instructions that have a lower RS> listed clock cycle "cost" than the motherboard bus rate, such as a RS> one-clock MOV mem, reg on my clock-doubled p133, would the total RS> clocks be 2, or would it be 1 to execute the instruction, plus 2 for RS> the motherboard bus to catch up? In other words, can I assume that RS> any instructions that are listed as 1 or 2 cycles will take at most 2 RS> clocks on my machine? The data should have been staged into cache by an early phase in the pipelined execution, so it will usually be 1 CPU cycle, rather than 1 bus cycle. This isn't guaranteed, esepcially if the instruction is the target of an unpredicted successful branch instruction. [Since the 80486 had neither a branch prediction buffer nor a speculative execution unit, all branches on that and earlier processors are always unpredicted.] RS> As for instructions that access memory twice during their execution, RS> like XCHG mem, reg or MOVS, how does this affect them? I've always RS> been of the impression that MOVS (and many of the string RS> instructions, for that matter) was relatively useless; it's listed RS> in my TASM guide (which only covers up to i486 instructions) as RS> being 7 clocks, yet it can be duplicated by 2 MOVs and 2 INCs or RS> ADDs (depending on operand size), which are all listed as one-clock RS> instructions. The string instructions (MOVS, SCAS, etc.) are designed to be prefixed with REP/REPE or REPNE. This repetition prefix eliminates multiple fetches, decodes, etc. for processing a string of bytes, as the one instruction keeps running until the string is processed or an error occurs. Thus, the timing for MOVS is not simply 7 clocks, but something like (6+n) clocks, where n is the number of bytes/words/dwords manipulated. Be aware that the 80486 introduced some rather slick microcode to handle string instructions, and this persists on the newer models. But that microcode only starts to perform well once you are processing more than about 8 bytes. For 8 or fewer bytes, using registers to handle chunks up to 4 bytes will move the data slightly faster, provided you schedule the instructions to avoid access interlocks on both memory and registers. You might also notice that MOVSD can move a DWORD of data in the same time that MOVSB can moe a byte. This means that REP MOVSD can move strings approx. 4 times faster than REP MOVSB. A singleton MOVSx instruction does have another important use: it is intrinsically conditional on CX and avoids a branch if it should do nothing. This means that the succcessful branch penalty is avoided. DN>> For an on-die L1 cache (e.g. almost all Intel and AMD processors) you DN>> can assume it to be 1 clock. Many of the early Intel P-III DN>> processors had off-die L1 caches that ran at half CPU speed, so the DN>> fetch time could be 2 clocks. These have, fortunately, disappeared DN>> from the shops. RS> What about older processors, like the Pentium (no "Pro", "II", etc)? RS> Would those caches run at the full CPU speed? Yes, if the cache is on-die. IIRC, the Pentium and Pentium Pro both used on-die L1 cache. I can look it up if you want to be really certain. RS> I understand that SRAM would be faster than DRAM because you'd never RS> run into a case where the CPU had to wait on RAM refresh, but what's RS> the deal with "tagging for associative access"? I thought memory RS> locations were simply "associated" with their addresses. Not all SRAM is tagged. E.g. most modern mainframes use SRAM for main memory. The way CPU and motherboard caches work is that each location in main memory (typically some type of DRAM in the Intel 80x86 world) that is represented in the cache has both a data component that mirrors main memory content and a "tag" to locate where that data should reside in main memory. Such an area is usually called a "line" in the cache and its data size is not necessarily the same as the word size of the machine. The tag section has special comparator circuitry that allows all tags in the cache to be compared in parallel, so that a single check can be made when a lookaside is required. It also allows the coherence of the cache to be assured, as this circuitry should ensure that no 2 cache lines have the same tag. The tagging circuitry is quite expensive, which is why "tagged SRAM" is usually many times more expensive than normal SRAM. Regards Dave --- FleetStreet 1.25.1 * Origin: My other computer is an IBM S/390 (2:257/609.5) .