Performance and Library Issues 
for Mathematical Software 
on High Performance Computers

J.J. Dongarra and D.C. Sorensen
Abstract
This paper discusses 
some of the fundamental issues facing designers of mathematical software 
libraries for medium scale parallel processors such as the CRAY X-MP-4
and the Denelcor HEP. 
We discuss the problems that arise with performance and demonstrate that it 
may be appropriate to exploit parallelism at all levels of the program, not just
at the highest level.  We give performance measurements indicating the 
efficiency of a linear algebra library written in terms of a few high level 
modules.  These modules chosen at the matrix vector level extend the 
concept of the BLAS [13] and provide enough computational granularity
to allow efficient implementations on a wide variety of architectures.  Only
three modules must be recoded for efficiency in order to transport the 
library to various machines.   We report experience on machines as diverse
as the CRAY X-MP and the Denelcor HEP.  Finally, we report on some special
algorithms for the HEP which take advantage of the fine grain parallelism
capabilities.
Comparison of the CRAY X-MP-4,
Fujitsu VP-200, and Hitachi S-810/20:
An Argonne Perspective

Jack J. Dongarra and Alan Hinds

Abstract
A set of programs, gathered from major Argonne computer users, 
was run on the current generation of supercomputers:
the CRAY X-MP-4, Fujitsu VP-200, and Hitachi S-810/20.
The results show that a single processor of a CRAY X-MP-4 
is a consistently strong performer over a wide range of problems.
The Fujitsu and Hitachi computers excel on highly vectorized programs and
offer an attractive opportunity to sites with IBM-compatible computers.
A Fully Parallel Algorithm for the Symmetric Eigenvalue Problem

J.J. Dongarra and D. C. Sorensen

In this paper we present a parallel algorithm for the symmetric algebraic
eigenvalue problem.  The algorithm is based upon a divide and conquer scheme
suggested by Cuppen for computing the eigensystem of a symmetric tridiagonal
matrix.  We extend this idea to obtain a parallel algorithm 
that retains a number of active parallel processes that is greater than or i
equal to the initial number throughout the course
of the computation.  
We give a new deflation
technique which together with a robust root finding technique will 
assure computation of an eigensystem to full accuracy in the residuals
and in the orthogonality of eigenvectors.   A brief analysis of the numerical
properties and sensitivity to round off error is presented to indicate where 
numerical difficulties may occur.  
The algorithm is able to 
exploit parallelism at all levels of the computation and is well suited to 
a variety of architectures.  
.PP
Computational results are presented for several
machines.  These results are very encouraging with
respect to both accuracy and speedup. A surprising result is that     
the parallel algorithm, even when run in serial mode, can be significantly 
faster than the previously best sequential 
algorithm on large problems, and is effective on moderate size 
problems when run in serial mode.
Multiprocessing Linear Algebra \h'.35i'     
Algorithms on the CRAY X-MP-2: \h'.35i'
Experiences with Small Granularity \h'.35i'     

Steve C. Chen
Jack J. Dongarra
Christopher C. Hsiung
Abstract
This paper gives a brief overview of 
the CRAY X-MP-2 general-purpose multiprocessor system
and discusses how it can be used effectively to solve problems 
that have small granularity. An implementation is described
for linear algebra algorithms that solve 
systems of linear equations when the matrix is general
and when the matrix is symmetric
and positive definite.
Implementing Dense Linear Algebra Algorithms 
Using Multitasking on the CRAY X-MP-4
(or Approaching the Gigaflop)
Jack J. Dongarra and Tom Hewitt
Abstract
This note describes some experiments on simple, dense
linear algebra algorithms. These experiments show that
the CRAY X-MP is capable of small - grain multitasking
arising from standard implementations of LU and Cholesky decomposition.
The implementation described here provides the "fastest"
execution rate for LU decomposition, 718 MFLOPS for a matrix of order
1000.
DISTRIBUTION OF MATHEMATICAL SOFTWARE 
VIA ELECTRONIC MAIL

JACK J. DONGARRA and ERIC GROSSE

A large collection of public-domain mathematical software
is now available
via electronic mail.
Messages sent to
"netlib@anl-mcs"
(on the Arpanet/CSNET)
or to
"research!netlib"
(on the UNIX\(rg network)
wake up a server that
distributes items from the collection.
For example, the one-line message
"send index"
gets a library catalog by return mail.
We describe how to use the service
and some of the issues
in its implementation.
Performance of Various Computers Using Standard 
Linear Equations Software in a Fortran Environment

Jack J. Dongarra

Abstract - This note compares the performance of different computer systems
while solving dense systems of linear equations using the LINPACK
software in a Fortran environment. About 100 computers, ranging from 
a CRAY X-MP to the 68000 based systems 
such as the Apollo and SUN Workstations to IBM PC's, are compared.
Implementing Linear Algebra Algorithms for Dense Matrices
on a Vector Pipeline Machine

J.J. Dongarra, F.G. Gustavson and A. Karp 
Abstract - This paper examines common implementations of linear algebra 
algorithms; such as matrix-vector multiplication, matrix-matrix multiplication 
and the solution of linear equations. The different versions are
examined for efficiency on a computer architecture which uses
vector processing and has pipelined instruction execution.
By using the advanced architectural features of such machines, 
one can usually achieve maximum performance, and tremendous 
improvements in terms of execution speed can be seen over conventional 
computers. 
A Proposal for an Extended Set of Fortran
Basic Linear Algebra Subprograms

Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson

Abstract
This paper describes an extension to
the set of Basic Linear Algebra Subprograms. The extensions
proposed are targeted at matrix vector operations which should
provide for more efficient and portable implementations of algorithms
for high performance computers.
Squeezing the Most out of Eigenvalue Solvers
on High-Performance Computers

Jack J. Dongarra, Linda Kaufman and, Sven Hammarling
Abstract
This paper describes modifications to many of the standard
algorithms used in computing eigenvalues and eigenvectors of matrices.
These modifications can dramatically
increase the performance of the underlying software 
on high performance computers
without resorting to assembler language, without significantly influencing the
floating point operation count, and without affecting the 
roundoff error properties of the algorithms.
The techniques are applied to a wide variety of algorithms
and are beneficial in various architectural settings.
Squeezing the Most out of an Algorithm
in CRAY Fortran
Jack J. Dongarra and Stanley C. Eisenstat
Abstract
This paper describes a technique for achieving super-vector performance on
a CRAY-1 in a purely Fortran environment (i.e., without resorting to assembler
language).
The technique can be applied to a wide variety of algorithms in linear algebra,
and is beneficial in other architectural settings.

.