Newsgroups: comp.benchmarks
Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!van-bc!ubc-cs!uw-beaver!rice!ariel.rice.edu!preston
From: preston@ariel.rice.edu (Preston Briggs)
Subject: Re: A benchmark
Message-ID: <1991May3.053053.29174@rice.edu>
Keywords: snake performance
Sender: news@rice.edu (News)
Organization: Rice University, Houston
References: <1991May3.023705.5616@marlin.jcu.edu.au>
Date: Fri, 3 May 91 05:30:53 GMT

csrdh@marlin.jcu.edu.au (Rowan Hughes) writes:

>     CPU times for the Geophysical Fluid Dynamics Model MOM_1.0 (Fortran).

>     The code is well vectorized, and consists mostly of floating point
>     multiplies/additions.

It's important to remember that code that is ideal for a vector machine
is not necessarily ideal for a scalar (or super-scalar) machine.
Yes, the IBM and HP machines can cook on vector code,
but often it can be rearranged for even better performance.

For example, on vector machine, we don't like to see recurrences
in the inner loop.  On scalar machines, these are desirable.

A simple (contrived) example:

This is ok for vector machines


	DO j = 1, n
	    DO i = 1, n
		A(i) = A(i) + B(j)
	    ENDDO
	ENDDO

but this is better for scalar machines
(and terrible for vector machines, because of the recurrence on A(i))

	DO i = 1, n
	    DO j = 1, n
		A(i) = A(i) + B(j)
	    ENDDO
	ENDDO

Why?  In the first case, we'll hold the inner-loop invariant B(j) in a 
register.  Therefore, we'll require 1 load and 1 store for each flop.
In the 2nd case, we'll hold A(i) in a register
across the inner loop, requiring only one load per flop, with no stores
in the inner loop.

We can further munch the second example, by unrolling the outer loop
and jamming the resulting inner loop bodies together

	DO i = 1, n, 4
	    DO j = 1, n
		A(i+0) = A(i+0) + B(j)
		A(i+1) = A(i+1) + B(j)
		A(i+2) = A(i+2) + B(j)
		A(i+3) = A(i+3) + B(j)
	    ENDDO
	ENDDO


In this case, we'll hold 4 parts of A in registers, and require
only one load of B for every 4 flops.  This also helps get better
scheduling for the pipelines.

So, the point is that the results of measuring "well vectorized"
code will tend to favor vector machines.  By reworking the code
(a lot?), ala the Perfect Club, you should be able to achieve
even better performance on the scalar machines.

Preston Briggs

