Performance and Library Issues for Mathematical Software on High Performance Computers J.J. Dongarra and D.C. Sorensen Abstract This paper discusses some of the fundamental issues facing designers of mathematical software libraries for medium scale parallel processors such as the CRAY X-MP-4 and the Denelcor HEP. We discuss the problems that arise with performance and demonstrate that it may be appropriate to exploit parallelism at all levels of the program, not just at the highest level. We give performance measurements indicating the efficiency of a linear algebra library written in terms of a few high level modules. These modules chosen at the matrix vector level extend the concept of the BLAS [13] and provide enough computational granularity to allow efficient implementations on a wide variety of architectures. Only three modules must be recoded for efficiency in order to transport the library to various machines. We report experience on machines as diverse as the CRAY X-MP and the Denelcor HEP. Finally, we report on some special algorithms for the HEP which take advantage of the fine grain parallelism capabilities. Comparison of the CRAY X-MP-4, Fujitsu VP-200, and Hitachi S-810/20: An Argonne Perspective Jack J. Dongarra and Alan Hinds Abstract A set of programs, gathered from major Argonne computer users, was run on the current generation of supercomputers: the CRAY X-MP-4, Fujitsu VP-200, and Hitachi S-810/20. The results show that a single processor of a CRAY X-MP-4 is a consistently strong performer over a wide range of problems. The Fujitsu and Hitachi computers excel on highly vectorized programs and offer an attractive opportunity to sites with IBM-compatible computers. A Fully Parallel Algorithm for the Symmetric Eigenvalue Problem J.J. Dongarra and D. C. Sorensen In this paper we present a parallel algorithm for the symmetric algebraic eigenvalue problem. The algorithm is based upon a divide and conquer scheme suggested by Cuppen for computing the eigensystem of a symmetric tridiagonal matrix. We extend this idea to obtain a parallel algorithm that retains a number of active parallel processes that is greater than or i equal to the initial number throughout the course of the computation. We give a new deflation technique which together with a robust root finding technique will assure computation of an eigensystem to full accuracy in the residuals and in the orthogonality of eigenvectors. A brief analysis of the numerical properties and sensitivity to round off error is presented to indicate where numerical difficulties may occur. The algorithm is able to exploit parallelism at all levels of the computation and is well suited to a variety of architectures. .PP Computational results are presented for several machines. These results are very encouraging with respect to both accuracy and speedup. A surprising result is that the parallel algorithm, even when run in serial mode, can be significantly faster than the previously best sequential algorithm on large problems, and is effective on moderate size problems when run in serial mode. Multiprocessing Linear Algebra \h'.35i' Algorithms on the CRAY X-MP-2: \h'.35i' Experiences with Small Granularity \h'.35i' Steve C. Chen Jack J. Dongarra Christopher C. Hsiung Abstract This paper gives a brief overview of the CRAY X-MP-2 general-purpose multiprocessor system and discusses how it can be used effectively to solve problems that have small granularity. An implementation is described for linear algebra algorithms that solve systems of linear equations when the matrix is general and when the matrix is symmetric and positive definite. Implementing Dense Linear Algebra Algorithms Using Multitasking on the CRAY X-MP-4 (or Approaching the Gigaflop) Jack J. Dongarra and Tom Hewitt Abstract This note describes some experiments on simple, dense linear algebra algorithms. These experiments show that the CRAY X-MP is capable of small - grain multitasking arising from standard implementations of LU and Cholesky decomposition. The implementation described here provides the "fastest" execution rate for LU decomposition, 718 MFLOPS for a matrix of order 1000. DISTRIBUTION OF MATHEMATICAL SOFTWARE VIA ELECTRONIC MAIL JACK J. DONGARRA and ERIC GROSSE A large collection of public-domain mathematical software is now available via electronic mail. Messages sent to "netlib@anl-mcs" (on the Arpanet/CSNET) or to "research!netlib" (on the UNIX\(rg network) wake up a server that distributes items from the collection. For example, the one-line message "send index" gets a library catalog by return mail. We describe how to use the service and some of the issues in its implementation. Performance of Various Computers Using Standard Linear Equations Software in a Fortran Environment Jack J. Dongarra Abstract - This note compares the performance of different computer systems while solving dense systems of linear equations using the LINPACK software in a Fortran environment. About 100 computers, ranging from a CRAY X-MP to the 68000 based systems such as the Apollo and SUN Workstations to IBM PC's, are compared. Implementing Linear Algebra Algorithms for Dense Matrices on a Vector Pipeline Machine J.J. Dongarra, F.G. Gustavson and A. Karp Abstract - This paper examines common implementations of linear algebra algorithms; such as matrix-vector multiplication, matrix-matrix multiplication and the solution of linear equations. The different versions are examined for efficiency on a computer architecture which uses vector processing and has pipelined instruction execution. By using the advanced architectural features of such machines, one can usually achieve maximum performance, and tremendous improvements in terms of execution speed can be seen over conventional computers. A Proposal for an Extended Set of Fortran Basic Linear Algebra Subprograms Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson Abstract This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extensions proposed are targeted at matrix vector operations which should provide for more efficient and portable implementations of algorithms for high performance computers. Squeezing the Most out of Eigenvalue Solvers on High-Performance Computers Jack J. Dongarra, Linda Kaufman and, Sven Hammarling Abstract This paper describes modifications to many of the standard algorithms used in computing eigenvalues and eigenvectors of matrices. These modifications can dramatically increase the performance of the underlying software on high performance computers without resorting to assembler language, without significantly influencing the floating point operation count, and without affecting the roundoff error properties of the algorithms. The techniques are applied to a wide variety of algorithms and are beneficial in various architectural settings. Squeezing the Most out of an Algorithm in CRAY Fortran Jack J. Dongarra and Stanley C. Eisenstat Abstract This paper describes a technique for achieving super-vector performance on a CRAY-1 in a purely Fortran environment (i.e., without resorting to assembler language). The technique can be applied to a wide variety of algorithms in linear algebra, and is beneficial in other architectural settings. .