Newsgroups: comp.arch
Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!wuarchive!mit-eddie!uw-beaver!rice!ariel.rice.edu!preston
From: preston@ariel.rice.edu (Preston Briggs)
Subject: Re: cache pre-load/no-load instructions
Message-ID: <1991Mar21.161044.2898@rice.edu>
Sender: news@rice.edu (News)
Organization: Rice University, Houston
References: <JONATHAN.91Mar17034438@speedy.cs.pitt.edu> <765@ajpo.sei.cmu.edu>
Date: Thu, 21 Mar 91 16:10:44 GMT

jonathan@cs.pitt.edu (Jonathan Eunice) writes:
>>Two of the tweaks of the forthcoming "Snake" (HP-PA 1.1) systems from 
>>
>>1)  cache pre-load instructions (the compiler inserts these into the
>>instr stream, and hopefully, the appropriate cache line will be available
>>by the time it's needed, avoiding delays and speeding up single-task 
>>execution) 
>>
>>2) cache no-load hints as a part of store instructions (useful to avoid
>>useless cache loading for initialization statements, for faster program
>>startup, and perhaps in other situations too)

At the upcoming ASPLOS, there's a paper called "Software Prefetching",
by Callahan, Kennedy, and Porterfield, describing compiler mechanisms
to take advantage of cache pre-fetch instructions (1 above).  They seem very
effective for scientific code.

The RS/6000 includes 2 interesting possibilities.
An instruction that zeroes a line in the data cache (without
fetching it).  May be used like (2 above); additionally handy for zeroing
big chunks of memory.  They also include an "invalidate line"
instruction which says: "don't bother writing this one back to memory."

>>How effective are these optimizations likely to be?  (While they aren't going
>>to give the same kind of speedup as making the system super-scalar or 
>>super-pipelined, they strike me as effective tweaks.)  

This sort of thing can be very important.  One of the basic problems
of the i860 (for an example) is its low off-chip memory bandwidth,
at least in relation to it's FP performance.  Instruction-level
parallelism (piplines, wide instructions, superscalar, speculative execution)
is ok for getting the FP performance up, but the processor will starve
without lots of bandwidth.

Preston Briggs
