.. From convex!dodson@anl-mcs.ARPA Mon Aug 31 19:53:21 1987 .. From: convex!dodson@anl-mcs.ARPA (Dave Dodson) .. Subject: sparse blas paper .. .. Enclosed is the troff input for the Sparse Blas paper \" tbl file | eqn | nitroff -ms .hw spars-ity .ft 3 .ps 11 .EQ gsize 11 delim @@ .EN .TL .ps 12 .in 0 Sparse Extensions to the Fortran .sp Basic Linear Algebra Subprograms .AU .ps 11 .in 0 David S. Dodson .AI .ps 10 .in 0 Convex Computer Corporation 701 N. Plano Road Richardson, Texas 75081 .AU .ps 11 .in 0 Roger G. Grimes .AI .ps 10 .in 0 Boeing Computer Services, M/S 7L-21 P.O. Box 24346 Seattle, Washington 98124-0346 .AU .ps 11 .in 0 John G. Lewis .AI .ps 10 .in 0 Boeing Computer Services, M/S 7L-21 P.O. Box 24346 Seattle, Washington 98124-0346 .FS .ps 10 .vs 11p Typeset on \*(DY. .FE .sp 3 .QS .ps 10 .in .25i .ll -.25i .I Abstract \(em This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extension is targeted at sparse vector operations, with the goal of providing efficient, but portable, implementations of algorithms for high performance computers. .in .ll .QE .ps 11 .vs 16 .nr PS 11 .nr VS 16 .nr PD 0.5v .NH Introduction .PP In 1973, Hanson, Krogh, and Lawson [14] described the advantages of adopting a set of basic routines for problems in linear algebra. They observed that standardizing such a subroutine library would improve program clarity, portability, modularity, and maintainability. Additionally, if these routines were coded in assembly language for many computers, they would promote efficiency without sacrificing portability. The original basic linear algebra subprograms, now commonly referred to as the BLAS, were fully described in Lawson, Hanson, Kincaid, and Krogh [15,16]. They have been used in a wide range of software, including LINPACK [4], and have become a \fIde facto\fR standard for the elementary vector operations. The success of the BLAS has led Dongarra, \fIet al.\fR [5], to propose an extended set of subprograms targeted toward efficiency in dense linear algebra on a broader class of computers. .PP Many codes now exist for solving sparse linear systems, e.g., MA28 [6], the Yale Sparse Matrix Package [8], SPARSPAK [9], and ICCG/ILU preconditioning [18]. It has been found that sparse matrix computations are useful in many fields. Furthermore, much effort is being expended in other sparse matrix computations such as sparse least squares [10,11] and eigenvalue extraction [12,13]. Just as the original BLAS have already demonstrated their utility for dense matrix computations, further development in sparse matrix algorithms will be well served by sparse extensions to the BLAS. Sparse extensions will aid software developers by promoting software portability while giving the advantage of efficiency across various machine types via tuned (assembly language) implementations [17]. In fact, it is precisely these tuned implementations that will ease the transfer of much of the existing sparse linear algebra methodology to computers with vector or parallel architectures. .PP We believe this is an appropriate time to standardize specifications for a set of sparse extensions to the BLAS. Examination of the previously listed sparse linear algebra codes reveals that a few basic operations occur frequently and dominate the computations. Standardization of these operations may help deter additional non-portable extensions to Fortran such as CRAY's GATHER and CDC's Q8GATHR. We specify here those basic operations, along with naming conventions and argument lists. In [3] we present a model implementation of the Sparse BLAS in Fortran 77 (extended to include the COMPLEX*16 data type), and also a set of rigorous test programs. .NH Compressed Storage of Sparse Vectors .PP In the original BLAS, a dense @n@ vector, @x@, is represented by a triple (N, X, INCX), where X is a Fortran array in which the components of @x@ are stored according to the indexing pattern: .EQ x sub i ~ roman { = ~ left {~ lpile { X(1+(i-1)*INCX) above X(1-(N-i)*INCX) }~~~~ lpile { if above if }~~ lpile { INCX ~ >= ~ 0 above INCX ~ < ~0. } } .EN .PP In sparse linear algebra, large savings in computer time and memory are realized by storing and operating on only the ``interesting'' (usually nonzero) components of vectors. The most common method for representing a sparse vector uses a Fortran array just long enough to contain the interesting components, and a companion array of indices that maps the stored values into their proper positions within the vector. Letting NZ be the number of interesting components of @x@, X be the Fortran array in which they are stored, and INDX be the Fortran array of indices, a sparse vector is represented by the triple (NZ, X, INDX). For example, if .KS .EQ x ~ roman = ~(0,~4,~0,~0,~1,~0,~0,~0,~6,~0) .EN and if the interesting components of @x@ are the nonzero ones, then .EQ roman { rpile { NZ above X above INDX } ~ cpile { = above = above = } ~ lpile { 3, above (4,~1,~6), above (2,~5,~9) } } .EN so that .EQ x sub roman INDX(i) roman { ~=~ X(i). } .EN .KE .PP As we will show later, the generality of INCX and an increment argument for INDX is not needed. .NH Scope of the Sparse BLAS .PP We follow the naming convention given in [15]: a subprogram name consists of a prefix that indicates the data type involved, a root that describes the operation performed, and an optional suffix that indicates a variant of the basic operation. An underscore in the prefix of a subprogram name may be replaced with any of the following type specification characters: .KS .TS center; l l. S REAL D DOUBLE PRECISION C COMPLEX Z COMPLEX*16 or DOUBLE COMPLEX (if available) .TE .KE Subprograms with ``Z'' prefixes may not be available on all machines since COMPLEX*16 is not a standard Fortran data type. They are included in this standard for completeness and for their usefulness on those systems that support this data type. Certain of the subprograms are not defined for complex arithmetic. These are indicated with an ``S'' as the prefix; ``S'' may be replaced only by ``D''. In addition, certain subprograms are defined only for complex arithmetic. These are listed with a prefix of ``C'', which may be replaced with ``Z'' on those machines where COMPLEX*16 is available. .KS .PP An examination of the original BLAS reveals that some of them do non-vector operations. These BLAS are applicable to either dense or sparse linear algebra, so sparse variants are not needed. The subprograms in this category are: .TS center; l l. SROTG Set up Givens rotation SROTMG Set up modified Givens rotation .TE .KE .KS .PP Some original BLAS do the correct operation when presented with the compressed value array of a sparse vector. These BLAS also do not need special sparse variants. They include: .TS center; l l. _NRM2 Euclidean norm _ASUM Sum of absolute values _SCAL Constant times a vector I_AMAX Index of first component of maximum absolute value .TE .KE (Technically, when I_AMAX is applied to the compressed value array, X, of a sparse vector @x@ that is represented by (NZ, X, INDX), it returns the index of the first maximal element of X. The corresponding element of INDX gives the index of that component of @x@. This might not be the index of the \fIfirst\fR maximal component of @x@ if the associated array of indices is not in increasing order. This is probably of no significance.) .KS .PP Finally, we omit sparse variants of several of the original BLAS because we perceive a lack of usefulness. The operations thus omitted are: .TS center; l l. Sparse SROTM Apply sparse modified Givens rotation Sparse _SWAP Swap sparse vectors .TE .KE If these operations are implemented as an extension to the Sparse BLAS, it is recommended that they be named SROTMI and _SWAPI and that they conform to the Sparse BLAS conventions. .PP The standard BLAS operations that evidently require sparse extensions are _DOT-, _AXPY, and _ROT. Two practices are widely used to improve the efficiency of the sparse versions of these operations, and they both are followed in our extensions. In them, the major goal is to perform the operations in O(NZ) work, independent of @n@. Achieving this goal requires eliminating all searches for interesting values or matching indices. .PP The first idea, adopted by the authors of all the packages mentioned above, and described in detail in [7], is to avoid binary operations between two compactly-stored sparse vectors with different sparsity patterns. For example, the dot product of two such vectors, (NZ1, X1, INDX1) and (NZ2, X2, INDX2) requires work at least of order NZ1 + NZ2 to determine which indices match. The standard approach is to expand one vector to its full, uncompressed form and perform the numeric operation between that uncompressed vector and the remaining compressed vector. Although the expansion requires O(NZ) work, and in some cases a corresponding, equally-costly compression may be required, this overhead typically is recovered in full. It is common for the uncompressed vector to be involved in a series of such binary operations, further amortizing the expansion and compression cost over several vector operations. .PP The expansion and compression operations mentioned above are two symmetric sparse extensions of the standard BLAS routine _COPY. The first extension, _SCTR, expands or \fIscatters\fR a sparse compressed vector into full or uncompressed form. The second extension is the inverse operation, _GTHR, which compresses or \fIgathers\fR the values from the uncompressed form into the compressed form. These sparse variants of _COPY complete our extension to the BLAS. .PP The second idea that contributes greatly to efficiency of sparse linear algebra computations is separating the determination of the nonzero structure of the sparse vectors from the numeric computation. This separation allows the symbolic determination of the nonzero structure to exploit the sparsity characteristics of the particular algorithm being realized. This often obviates the need for either the O(NZ1+NZ2) work of merging index lists or the O(@n@) work of searching entire vectors for their nonzero entries. Another advantage is that the index lists are determined independently of the numeric operations. The result is that the numeric operations are implemented using static data structures and can be much simpler and faster, and more amenable to vector or parallel processing. The Sparse BLAS are designed for use in such an environment, and therefore do not change the data structures represented by NZ and INDX. .NH Sparse BLAS Conventions .NH 2 Storage .PP Another characteristic of all the packages mentioned in \(sc 1 is that the full, uncompressed vector is stored contiguously, e.g., with the BLAS vector descriptor (N, Y, 1). This is the form we have adopted. Because contiguous storage of Y is what is used in practice, INCY has been restricted to unity. The absence of a non-unit INCY is unlikely to cause any trouble, since the Y array usually is a work array whose organization is arbitrary. Furthermore, it is not possible to allow for the backward indexing through Y that would be required by INCY \(<= 0 without knowing N, even though the implementations of the operations are otherwise independent of N. Thus, INCY is omitted from the argument list. Similarly, no increment is required for X or INDX. .PP In the specifications below, the X argument always represents a sparse vector in compressed storage form. INDX follows X, replacing INCX of the original BLAS. There are no restrictions on the order of the values in the INDX array. However, those subprograms where Y is an output will give correct and consistent results with vector or parallel processing only when the values in INDX are distinct. In the usual sparse matrix applications, distinct indices are the norm. Because our goal is efficiency, especially on high performance computers, together with portability, \fIwe impose the restriction that the values in \fRINDX\fI be distinct when \fRY\fI is an output.\fR .NH 2 Error Handling .PP The Sparse BLAS do no explicit error checking or reporting. Values of @roman NZ~<=~0@ are legal, and the routines do ``the right thing''; i.e., the dot product routines return zero function values and all of the routines neither access their vector input arguments nor store into them. Thus, special case testing need not be performed in a calling program. .PP As mentioned above, certain routines require that the values in INDX be distinct. Checking this condition would be prohibitively expensive. Therefore, violating this condition may yield incorrect or inconsistent results without warning. .NH 2 Naming Convention .PP If a sparse BLAS routine is an extension of a dense BLAS, the subprogram name is formed by appending a suffix character, ``I'', standing for \fIindexed\fR, to the dense name. Arguments justifying this minor extension of the naming convention given in [15] are presented in [1]. Fortunately, current BLAS needing sparse variants have at most five characters in their names, so a one-character suffix can be appended to every one. If a sparse BLAS is not a direct extension of a dense BLAS, the root name is a new, mnemonic, four character name, and the suffix is reserved to describe variants. We use ``Z'' as one such variant descriptor. .PP The proposed available combinations are listed in Table I below. The collection of routines can be thought of as being divided into four separate parts, REAL, DOUBLE PRECISION, COMPLEX, and COMPLEX*16. Except for the routines that use COMPLEX*16 variables, the routines can be written in standard ANSI Fortran 77. .KS .ce Table I. Summary of Functions and Names of the Sparse BLAS Subprograms .ps 10 .vs 13 .TS expand, tab(/); l c c s s s s s s s l c c s s s s s s s l a a0e a0e a0e a0e a0e a0e a0e a0e. _ Function/Root/Prefix and suffix \^/of name/of name _ Dot product/-DOT-/S-I/D-I/C-UI/Z-UI/C-CI/Z-CI Scalar times a vector/-AXPY-/S-I/D-I/C-I/Z-I added to a vector Apply Givens rotation/-ROT-/S-I/D-I Gather \fIy\fR into \fIx\fR/-GTHR-/S-/D-/C-/Z-/S-Z/D-Z/C-Z/Z-Z Scatter \fIx\fR into \fIy\fR/-SCTR/S-/D-/C-/Z- _ .TE .ps .vs .KE .KS .NH Specifications for the Sparse BLAS .PP Type and dimension declarations for variables occurring in the subroutine specifications are as follows: .TS center; c l1 l. all: INTEGER NZ, INDX(NZ) S: REAL A, C, S, W, X(NZ), Y(*) D: DOUBLE PRECISION A, C, S, W, X(NZ), Y(*) C: COMPLEX A, W, X(NZ), Y(*) Z: COMPLEX*16 A, W, X(NZ), Y(*) .TE .KE .KS Type declarations for the function names are as follows: .TS center; c l1 l. S: REAL SDOTI D: DOUBLE PRECISION DDOTI C: COMPLEX CDOTCI, CDOTUI Z: COMPLEX*16 ZDOTCI, ZDOTUI .TE .KE .PP NZ, A, C, S, and INDX are never changed by the Sparse BLAS. In addition, X is not changed by _DOT-I, _AXPYI, or _SCTR, and Y is not changed by _DOT-I or _GTHR. Only the elements of Y whose indices appear in INDX are referenced or modified. Finally, for convenience in the mathematical descriptions of the operations, we assume that any component of @x@ whose index is not listed in INDX is zero. (The FORTRAN descriptions properly exhibit the operations even when ``uninteresting'' means something other than ``zero.'') .KS .IP 1) Sparse Dot Product .IP W = SDOTI (NZ, X,INDX, Y) .br W = CDOTCI (NZ, X,INDX, Y) Conjugated .br W = CDOTUI (NZ, X,INDX, Y) Unconjugated .KE .IP SDOTI and CDOTUI compute .EQ w ~ : roman = ~ sum from {i roman = 1} to n x sub i y sub i roman {~=~ sum from I=1 to NZ X(I)*Y(INDX(I))} .EN and CDOTCI computes .EQ w ~ : roman = ~ sum from {i roman = 1} to n x bar sub i y sub i roman {~=~ sum from I=1 to NZ CONJG(X(I))*Y(INDX(I))} .EN where @x bar@ is the complex conjugate of @x@. .KS .IP 2) Sparse Elementary Vector Operation .IP CALL _AXPYI (NZ, A, X,INDX, Y) .KE .IP _AXPYI performs the operation .EQ y ~ : roman = ~ ax ~ + ~ y .EN which is the Fortran 77 operation .EQ roman { Y(INDX(I)) ~=~ A*X(I) ~ + ~ Y(INDX(I)) ~~~~~~for~I ~=~ 1,~...,~NZ }. .EN The values in INDX must be distinct to allow consistent vector or parallel execution. .KS .IP 3) Apply a Sparse Plane Rotation (REAL and DOUBLE PRECISION only) .IP CALL SROTI (NZ, X,INDX, Y, C,S) .KE .IP SROTI applies a Givens rotation to a sparse vector stored in compressed form and another vector stored in full storage form. If all nonzero components of @y@ appear in INDX, then SROTI computes .EQ left [ lpile { x sub i above y sub i } right ] ~ up 30 {: roman =} ~ left [ rpile { c above -s } ~~~ rpile { s above c } ~ right ] ~ up 50 . ~ left [ lpile { x sub i above y sub i } right ] ~~~~for~i~ roman = ~1,~...,~n. .EN Whether or not all nonzero components of @y@ appear in INDX, the operation is described by the following Fortran: .EQ roman { left "" lpile { TEMP above X(I) above Y(INDX(I)) } ~ cpile { = above = above = } ~ rpile { X(I)~~~~~~~^ above C*TEMP above bold - S*TEMP } ~ cpile { ~ above + above + } ~ rpile { ~ above S*Y(INDX(I)) above C*Y(INDX(I)) } ~~~ right } ~~~for~I ~=~ 1,~...,~NZ. } .EN The values in INDX must be distinct to allow consistent vector or parallel execution. .KS .IP 4) Gather a Vector into Compressed Form .IP CALL _GTHR (NZ, Y, X,INDX) .KE .IP Gather (copy) the specified components of a vector stored in full storage form into compressed form. If all nonzero components of @y@ appear in INDX, then _GTHR performs .EQ x ~ : roman = ~ y .EN which is symbolized by: .EQ roman { X(I) ~=~ Y(INDX(I)) ~~~~~~for~I~=~1,~...,~NZ. } .EN .IP Gather a Vector into Compressed Form and Zero the Full-form Vector .IP CALL _GTHRZ (NZ, Y, X,INDX) .IP Gather (copy) the specified components of a vector stored in full storage form into compressed form and zero the gathered components of the full-form vector. If all nonzero components of @y@ appear in INDX, a common occurrence in the use of a single uncompressed scratch vector, then _GTHRZ simultaneously performs .EQ left { ~ lpile {x ~ : roman = ~ y above y ~ : roman = ~ 0 . } .EN This is specified by the following: .EQ roman { left "" lpile { X(I) above Y(INDX(I)) } ~ lpile { = above = } ~ lpile { Y(INDX(I)) above 0.0 } ~~~ right } ~~~for~I~=~1,~...,~NZ. } .EN The values in INDX must be distinct to allow consistent vector or parallel execution. .KS .IP 5) Scatter a Sparse Vector from Compressed Form .IP CALL _SCTR (NZ, X,INDX, Y) .KE .IP Scatter (copy) the components of a sparse vector stored in compressed form into specified components of a vector in full storage form. If @y@ is initially the zero vector, then _SCTR performs .EQ y ~ : roman = ~ x . .EN which is represented by the following Fortran: .EQ roman { Y(INDX(I)) ~=~ X(I) ~~~~~~for~I~=~1,~...,~NZ. } .EN The values in INDX must be distinct to allow consistent vector or parallel execution. .NH Acknowledgements .PP A draft of this proposal was discussed at the Parvec IV Workshop organized by John Rice at Purdue University on October 29-30, 1984. Subsequently, a revised draft appeared in the SIGNUM Newsletter [2], and was discussed at the SIAM Conference on Applied Linear Algebra in Raleigh, April 29-May 2, 1985. We thank the participants at the workshop and meeting for their comments, discussions, and encouragement. .SH References .IP [1] D.S. Dodson and J.G. Lewis, ``Issues Relating to Extension of the Basic Linear Algebra Subprograms'', \fIACM SIGNUM Newsletter 20,\fR 1, (1985), pp. 19-22. .IP [2] D.S. Dodson and J.G. Lewis, ``Proposed Sparse Extensions to the Basic Linear Algebra Subprograms'', \fIACM SIGNUM Newsletter 20,\fR 1, (1985), pp. 22-25. .IP [3] D.S. Dodson, R.G. Grimes, and J.G. Lewis, ``Model Implementation and Test Package for the Sparse Basic Linear Algebra Subprograms,'' \fIthis journal.\fR .IP [4] J.J. Dongarra, J.R. Bunch, C.B. Moler, and G.W. Stewart, \fILINPACK Users' Guide,\fR SIAM Publications, Philadelphia, 1979. .IP [5] J.J. Dongarra, J. Du Croz, S. Hammarling, and R.J. Hanson, ``An Extended Set of Fortran Basic Linear Algebra Subprograms,'' \fIthis journal.\fR .IP [6] I.S. Duff, ``MA28: A Set of Fortran Subroutines for Sparse Unsymmetric Linear Systems,'' AERE Report R.8730, HMSO, London, (1971). .IP [7] I.S. Duff, A.M. Erisman, and J.K. Reid, \fIDirect Methods for Sparse Matrices,\fR Oxford University Press, (1986). .IP [8] S.C. Eisenstat, M.C. Gursky, M.H. Schultz, and A.H. Sherman, ``Yale Sparse Matrix Package I. The Symmetric Codes,'' \fIInt. J. Num. Math. Eng. 18,\fR (1982), pp. 1145-1151. .IP [9] A. George and J.W. Liu, \fIComputer Solution of Large Sparse Positive Definite Systems,\fR Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1982. .IP [10] A. George and M.T. Heath, ``Solution of Sparse Linear Least Squares Problems Using Givens Rotations,'' \fILinear Algebra and Its Applications 34,\fR (1980), pp. 69-82. .IP [11] A. George, M.T. Heath, and R.J. Plemmons, ``Solution of Large-Scale Sparse Least Squares Problems Using Auxiliary Storage,'' Report ORNL/CSD-63, Oak Ridge National Laboratory, (August, 1980). .IP [12] R.G. Grimes, J.G. Lewis, and H.D. Simon, ``Experiences in Solving Large Eigenvalue Problems on the CRAY X-MP,'' \fIProceedings of the Eighteenth Semiannual CRAY Users Group Meeting,\fR Garmisch, West Germany, (October, 1986). .IP [13] R.G. Grimes, J.G. Lewis, and H.D. Simon, ``Eigenvalue Problems and Algorithms in Structural Engineering,'' in \fILarge Scale Eigenvalue Problems\fR (J. Cullum and R. Willoughby, eds.), Elsevier North-Holland, (1986), pp. 81-93. .IP [14] R.J. Hanson, F.T. Krogh, and C.L. Lawson, ``A Proposal for Standard Linear Algebra Subprograms,'' TM-33-660, Jet Propulsion Laboratory, (November, 1973). .IP [15] C.L. Lawson, R.J. Hanson, D.R. Kincaid, and F.T. Krogh, ``Basic Linear Algebra Subprograms for Fortran Usage,'' \fIACM Transactions on Mathematical Software 5,\fR 3 (September, 1979), pp. 308-323. .IP [16] C.L. Lawson, R.J. Hanson, D.R. Kincaid, and F.T. Krogh, ``Algorithm 539: Basic Linear Algebra Subprograms for Fortran Usage,'' \fIACM Transactions on Mathematical Software 5,\fR 3 (September, 1979), pp. 324-325. .IP [17] J.G. Lewis and H.D. Simon, ``The Impact of Hardware Gather/Scatter on Sparse Gaussian Elimination,'' \fIProceedings of the 1986 International Conference on Parallel Processing,\fR (K. Hwang, S. Jacobs, and E. Swartzlander, eds.), IEEE Computer Society, Los Angeles, (1986). .IP [18] H.D. Simon, ``Incomplete LU Preconditioners for Conjugate Gradient Type Iterative Methods,'' Paper SPE 13533, \fIProceedings of the Eighth SPE Symposium on Reservoir Simulation,\fR (February, 1985). From convex!dodson@a.cs.uiuc.edu Fri Jul 15 12:41:14 1988 Return-Path: Received: from anl-mcs.ARPA by antares.mcs.anl (3.2/SMI-3.2) id AA03808; Fri, 15 Jul 88 12:41:11 CDT Received: from a.cs.uiuc.edu (a.cs.uiuc.edu.ARPA) by anl-mcs.ARPA (4.12/4.9) id AA17728; Fri, 15 Jul 88 12:47:54 cdt Received: by a.cs.uiuc.edu (UIUC-5.52/9.7) id AA23639; Fri, 15 Jul 88 12:49:37 CDT Received: from convex1 (8003e3f8) by convex (4.12/4.7) id AA02211; Fri, 15 Jul 88 11:46:13 cdt Date: Fri, 15 Jul 88 11:46:05 cdt From: convex!dodson@a.cs.uiuc.edu (Dave Dodson) Message-Id: <8807151646.AA01545@convex1> To: anl-mcs.ARPA!dongarra%uiucdcs.cs.uiuc.edu@a.cs.uiuc.edu Subject: troff source for sparse blas paper Status: RO \" tbl file | eqn | nitroff -ms .hw spars-ity .ft 3 .ps 11 .EQ gsize 11 delim @@ .EN .TL .ps 12 .in 0 Sparse Extensions to the FORTRAN .sp Basic Linear Algebra Subprograms .AU .ps 11 .in 0 David S. Dodson .AI .ps 10 .in 0 Convex Computer Corporation 701 N. Plano Road Richardson, Texas 75081 .AU .ps 11 .in 0 Roger G. Grimes .AI .ps 10 .in 0 Boeing Computer Services, M/S 7L-21 P.O. Box 24346 Seattle, Washington 98124-0346 .AU .ps 11 .in 0 John G. Lewis .AI .ps 10 .in 0 Boeing Computer Services, M/S 7L-21 P.O. Box 24346 Seattle, Washington 98124-0346 .FS .ps 10 .vs 11p Typeset on \*(DY. .FE .sp 3 .QS .ps 10 .in .25i .ll -.25i .I Abstract \(em This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extension is targeted at sparse vector operations, with the goal of providing efficient, but portable, implementations of algorithms for high performance computers. .in .ll .QE .ps 11 .vs 16 .nr PS 11 .nr VS 16 .nr PD 0.5v .NH Introduction .PP In 1973, Hanson, Krogh, and Lawson [16] described the advantages of adopting a set of basic routines for problems in linear algebra. They observed that standardizing such a subroutine library would improve program clarity, portability, modularity, and maintainability. Additionally, if these routines were coded in assembly language for many computers, they would promote efficiency without sacrificing portability. The original basic linear algebra subprograms, now commonly referred to as the BLAS, were fully described in Lawson, Hanson, Kincaid, and Krogh [17,18]. They have been used in a wide range of software, including LINPACK [5], and have become a \fIde facto\fR standard for the elementary vector operations. The success of the BLAS has led Dongarra \fIet al.\fR [6,7] to propose extended sets of subprograms targeted toward efficiency in dense linear algebra on a broader class of computers. .PP Many codes now exist for solving sparse linear systems, e.g., MA28 [8], the Yale Sparse Matrix Package [10], SPARSPAK [11], and ICCG/ILU preconditioning [20]. It has been found that sparse matrix computations are useful in many fields. Furthermore, much effort is being expended in other sparse matrix computations such as sparse least squares [12,13] and eigenvalue extraction [14,15]. Just as the original BLAS have already demonstrated their utility for dense matrix computations, further development in sparse matrix algorithms will be well served by sparse extensions to the BLAS. Indeed the reasons that justified the original BLAS development hold even more strongly for the Sparse BLAS. .PP The clarity and modularity of a code are almost always improved by the use of higher level constructs. This is certainly true in the context of sparse linear algebra, where the ordinary matrix algebra operations occur in a complicated framework of data structures representing the sparsity of these operations. The identification of the fundamental operations eases the conceptual problems of writing and understanding such codes. .PP. Sparse extensions have their most profound effect in promoting software portability with efficiency across various machine types via tuned implementations [19]. This is a far more important issue for the Sparse BLAS than for the ordinary BLAS because three of the fundamental operations cannot be written in portable FORTRAN in such a fashion that compilers can automatically optimize them for high performance architectures. The syntactic structure of the code for these operations suggests too much generality to permit safely executing them in vector mode or on distributed processors. It is in fact their use in the context of the Sparse BLAS that assures the implementor that the usage meets the conditions necessary for optimization. Often such tuned implementations require no more than addition of machine-specific compiler directives. However, it is precisely these tuned implementations that will ease the transfer of much of the existing sparse linear algebra methodology to computers with vector or parallel architectures. .PP In the light of recent efforts [6,7] to extend the scope of the BLAS to routines involving @n sup 2@ or @n sup 3@ operations, a proposal for standardizing routines that involve much fewer than @n@ operations may appear archaic. The granularity of the routines specified here is low, but the analogy to dense matrix operations is misleading. As mentioned above, these routines may be essential to obtaining high performance in a portable code. In such environments, the overhead of additional subroutine linkages is easily absorbed. It is also the case that these routines are called less often than their dense counterparts. The sparse factorization of a sparse matrix results in a number of calls to single-loop subroutines that is more like @n@ than @n sup 2@. Lastly, sparsity interferes with the regularity that makes the higher level dense BLAS possible. Although the structures of certain higher level sparse BLAS have been identified [1], these structures appear to have less generality than the single-loop kernels specified here. .PP We believe this is an appropriate time to standardize specifications for a set of sparse extensions to the BLAS. Examination of the previously listed sparse linear algebra codes reveals that a few basic operations occur frequently and dominate the computations. Standardization of these operations may help deter additional non-portable extensions to FORTRAN such as CRAY's GATHER and CDC's Q8GATHR. We specify here those basic operations, along with naming conventions and argument lists. In [4] we present a model implementation of the Sparse BLAS in FORTRAN 77 (extended to include the COMPLEX*16 data type), and also a set of rigorous test programs. .NH Compressed Storage of Sparse Vectors .PP In the original BLAS, a dense @n@ vector, @x@, is represented by a triple (N, X, INCX), where X is a FORTRAN array in which the components of @x@ are stored according to the indexing pattern: .EQ x sub i ~ roman { = ~ left {~ lpile { X(1+(i-1)*INCX) above X(1-(N-i)*INCX) }~~~~ lpile { if above if }~~ lpile { INCX ~ >= ~ 0 above INCX ~ < ~0. } } .EN .PP In sparse linear algebra, large savings in computer time and memory are realized by storing and operating on only the \fIinteresting\fP (usually nonzero) components of vectors. The most common method for representing a sparse vector uses a FORTRAN array just long enough to contain the interesting components, and a companion array of indices that maps the stored values into their proper positions within the vector. Letting NZ be the number of interesting components of @x@, X be the FORTRAN array in which they are stored, and INDX be the FORTRAN array of indices, a sparse vector is represented by the triple (NZ, X, INDX). For example, if .KS .EQ x ~ roman = ~(0,~4,~0,~0,~1,~0,~0,~0,~6,~0) .EN and if the interesting components of @x@ are the nonzero ones, then .EQ roman { rpile { NZ above X above INDX } ~ cpile { = above = above = } ~ lpile { 3, above (4,~1,~6), above (2,~5,~9) } } .EN so that .EQ x sub roman INDX(i) roman { ~=~ X(i). } .EN .KE .PP As we will show later, the generality of INCX and an increment argument for INDX is not needed. .NH Scope of the Sparse BLAS .PP We follow the naming convention given in [17]: a subprogram name consists of a prefix that indicates the data type involved, a root that describes the operation performed, and an optional suffix that indicates a variant of the basic operation. The Sparse BLAS operations are defined for several data types, so we represent the prefix generically with an underscore character. Usually, the underscore character may be replaced with any of the following type specification characters: .KS .TS center; l l. S REAL D DOUBLE PRECISION C COMPLEX Z COMPLEX*16 or DOUBLE COMPLEX (if available) .TE .KE Since a standard for a COMPLEX analog of SROT was not specified in [8], we provide only real plane rotations; therefore, the underscore character in _ROTI may be replaced only by S or D. Subprograms with Z prefixes may not be available on all machines since COMPLEX*16 is not a standard FORTRAN data type. They are included in this standard for completeness and for their usefulness on those systems that support this data type. .KS .PP An examination of the original BLAS reveals that some of them do non-vector operations. These BLAS are applicable to both dense and sparse linear algebra, so sparse variants are not needed. The subprograms in this category are: .TS center; l l. _ROTG Set up Givens rotation _ROTMG Set up modified Givens rotation .TE .KE .KS .PP Some original BLAS do the correct operation when presented with the compressed value array, X, of a sparse vector @x@ that is represented by (NZ, X, INDX). These BLAS also do not need special sparse variants. They include: .TS center; l l. _NRM2 Euclidean norm _ASUM Sum of absolute values _SCAL Constant times a vector I_AMAX Index of first component of maximum absolute value .TE .KE (Technically, when I_AMAX is applied to X, it returns the index of the first maximal element of X. The corresponding element of INDX gives the index of that component of @x@. This might not be the index of the \fIfirst\fR maximal component of @x@ if the associated array of indices is not in increasing order. This is probably of no significance.) .KS .PP Finally, we omit sparse variants of several of the original BLAS because we perceive a lack of usefulness. The operations thus omitted are: .TS center; l l. Sparse _ROTM Apply sparse modified Givens rotation Sparse _SWAP Swap sparse vectors .TE .KE If these operations are implemented as an extension to the Sparse BLAS, it is recommended that they be named _ROTMI and _SWAPI and that they conform to the Sparse BLAS conventions. .PP The standard BLAS operations that evidently require sparse extensions are _DOT, _AXPY, and _ROT. Two practices are widely used to improve the efficiency of the sparse versions of these operations, and they both are followed in our extensions. In them, the major goal is to perform the operations in O(NZ) work, independent of @n@. Achieving this goal requires eliminating all searches for interesting values or matching indices. .PP The first idea, adopted by the authors of all the packages mentioned above, and described in detail in [9], is to avoid binary operations between two compactly-stored sparse vectors with different sparsity patterns. For example, the dot product of two such vectors, (NZ1, X1, INDX1) and (NZ2, X2, INDX2) requires work at least of order NZ1 + NZ2 to determine which indices match. The standard approach is to expand one vector to its full, uncompressed form and perform the numerical operation between that uncompressed vector and the remaining compressed vector. Although the expansion requires O(NZ) work, and in some cases a corresponding compression may be required, this overhead typically is recovered in full. It is common for the uncompressed vector to be involved in a series of such binary operations, further amortizing the expansion and compression cost over several vector operations. There also are situations, such as forming the product of a sparse matrix and a vector, in which it is natural for one of the vector operands to be uncompressed. .PP The expansion and compression operations mentioned above are two symmetric sparse extensions of the standard BLAS routine _COPY. The first extension, _SCTR, expands or \fIscatters\fR a sparse compressed vector into full or uncompressed form. The second extension is the inverse operation, _GTHR, which compresses or \fIgathers\fR the values from the uncompressed form into the compressed form. These sparse variants of _COPY complete our extension to the BLAS. .PP The second idea that contributes greatly to the efficiency of sparse linear algebra computations is separating the determination of the nonzero structure of the sparse vectors from the numerical computation. Ignoring fortuitous cancellation, performing an elementary row operation or a Givens rotation on two sparse vectors results in output vector(s) having nonzeros wherever either of the inputs had nonzeros. While it is possible to determine the locations of the merged nonzero entries at the same time as their values are computed, it typically is more efficient to perform this symbolic operation independently. This separation allows the symbolic determination of the nonzero structure to exploit the sparsity characteristics of the particular algorithm being realized. For example, chapter 5 of [11] discusses a number of representations for the process of symmetric Gaussian elimination that permit the symbolic operations to be performed efficiently in a form that is much more compact than the sparse representation of the numerical entries. In this and several other important applications, the indices can be determined without knowing the values of the nonzero entries, but only that they are nonzero. This leads to dramatic savings in computation when several problems with common sparsity patterns, but different numerical values, are solved. Only the numerical process, not the symbolic operations, need be performed more than once. Chapters 5 and 9 of [9] point out the savings to be found by the separation of the symbolic and numerical processes, even in the case of sparse Gaussian elimination with partial pivoting where the actual numerical entries must affect the algorithm. .PP The result of making the symbolic and numerical algorithms distinct is that the numerical operations are implemented using static data structures. Because the data structures are known in advance of the numerical operations, the numerical code can be simpler, faster, and more amenable to vector or parallel processing. As observed in [9], the overhead of merging sparse vectors is reduced to ``the difference between the use of direct and indirect addressing.'' The Sparse BLAS are designed for use in such an environment, and therefore do not change the data structures represented by NZ and INDX. It is assumed that these data structures correctly mirror the sparsity that evolves. No determination is ever made in these codes that all the interesting entries are processed or allocated storage. That is the responsibility of the symbolic analysis. .NH Sparse BLAS Conventions .NH 2 Storage .PP Another characteristic of all the packages mentioned in \(sc1 is that the full, uncompressed vector is stored contiguously, e.g., with the BLAS vector descriptor (N, Y, 1). This is the form we have adopted. Because contiguous storage of Y is what is used in practice, INCY has been restricted to unity. The absence of a non-unit INCY is unlikely to cause any trouble, since the Y array usually is a work array whose organization is arbitrary. Furthermore, it is not possible to allow for the backward indexing through Y that would be required by INCY \(<= 0 without knowing N, even though the implementations of the operations are otherwise independent of N. Thus, INCY is omitted from the argument list. Similarly, no increment is required for X or INDX. .PP In the specifications below, the X argument represents the compressed value array of a sparse vector in compressed storage form. INDX follows X, replacing INCX of the original BLAS. There are no restrictions on the order of the values in the INDX array. However, those subprograms where Y is an output argument will give correct and consistent results with vector or parallel processing only when the values in INDX are distinct. In the usual sparse matrix applications, distinct indices are the norm. Because our goal is efficiency, especially on high performance computers, together with portability, \fIwe impose the restriction that the values in \fRINDX\fI be distinct when \fRY\fI is an output argument \fR(\fIthat is, in subroutines \fR_AXPYI\fI, \fR_ROTI\fI, \fR_SCTR\fI and \fR_GTHRZ). .NH 2 Error Handling .PP The Sparse BLAS do no explicit error checking or reporting. Values of @roman NZ~<=~0@ are legal, and the routines do ``the right thing''; i.e., the dot product routines return zero function values and all of the routines make no references to their vector arguments. Thus, special case testing need not be performed in a calling program. .PP As mentioned above, certain routines require that the values in INDX be distinct. Checking this condition would have greater complexity than the numerical operations, since testing a list for duplicates requires more than O(NZ) work if the values are not sorted. Therefore, compliance with the restriction is not verified and violating it may yield incorrect or inconsistent results without warning. .NH 2 Naming Convention .PP If a Sparse BLAS routine is an extension of a dense BLAS, the subprogram name is formed by appending a suffix character, I, standing for ``indexed'', to the dense name. Arguments justifying this minor extension of the naming convention given in [17] are presented in [2]. Fortunately, current BLAS needing sparse variants have at most five characters in their names, so a one-character suffix can be appended to every one. If a Sparse BLAS is not a direct extension of a dense BLAS, the root name is a new, mnemonic, four character name, and the suffix is reserved to describe variants. _GTHR and _SCTR are new root names, and Z is a variant descriptor for _GTHR. .PP The available combinations are listed in Table I. The collection of routines can be thought of as being divided into four separate parts, REAL, DOUBLE PRECISION, COMPLEX, and COMPLEX*16. Except for the routines that use COMPLEX*16 variables, the routines can be written in standard ANSI FORTRAN 77. .KS .ce Table I. Summary of Functions and Names of the Sparse BLAS Subprograms .ps 10 .vs 13 .TS expand, tab(/); l c c s s s s s s s l c c s s s s s s s l a a0e a0e a0e a0e a0e a0e a0e a0e. _ Function/Root/Prefix and suffix \^/of name/of name _ Dot product/-DOT-/S-I/D-I/C-UI/Z-UI/C-CI/Z-CI Scalar times a vector/-AXPY-/S-I/D-I/C-I/Z-I added to a vector Apply Givens rotation/-ROT-/S-I/D-I Gather \fIy\fR into \fIx\fR/-GTHR-/S-/D-/C-/Z-/S-Z/D-Z/C-Z/Z-Z Scatter \fIx\fR into \fIy\fR/-SCTR/S-/D-/C-/Z- _ .TE .ps .vs .KE .KS .NH Specifications for the Sparse BLAS .PP Type and dimension declarations for variables occurring in the subroutine specifications are as follows: .TS center; c l1 l. all: INTEGER NZ, INDX(NZ) S: REAL A, C, S, W, X(NZ), Y(*) D: DOUBLE PRECISION A, C, S, W, X(NZ), Y(*) C: COMPLEX A, W, X(NZ), Y(*) Z: COMPLEX*16 A, W, X(NZ), Y(*) .TE .KE .KS Type declarations for the function names are as follows: .TS center; c l1 l. S: REAL SDOTI D: DOUBLE PRECISION DDOTI C: COMPLEX CDOTCI, CDOTUI Z: COMPLEX*16 ZDOTCI, ZDOTUI .TE .KE .PP NZ, A, C, S, and INDX are never changed by the Sparse BLAS. In addition, X is not changed by _DOT-I, _AXPYI, or _SCTR, and Y is not changed by _DOT-I or _GTHR. Only the elements of Y whose indices appear in INDX are referenced or modified. In the descriptions the initial character S can be replaced by D to obtain the description of the double precision version; similarly an initial C can be replaced by Z. Finally, for convenience in the mathematical descriptions of the operations, we assume that any component of @x@ whose index is not listed in INDX is zero. The FORTRAN descriptions properly describe the operations performed even when \fIuninteresting\fP means something other than \fIzero\fP. .KS .IP 1) Sparse Dot Product .IP W = SDOTI (NZ, X,INDX, Y) .br W = DDOTI (NZ, X,INDX, Y) .br W = CDOTCI (NZ, X,INDX, Y) Conjugated .br W = CDOTUI (NZ, X,INDX, Y) Unconjugated .br W = ZDOTCI (NZ, X,INDX, Y) Conjugated .br W = ZDOTUI (NZ, X,INDX, Y) Unconjugated .KE .IP SDOTI, DDOTI, CDOTUI and ZDOTUI compute .EQ w ~ : roman = ~ roman sum from {i roman = 1} to n x sub i y sub i roman {~=~ sum from I=1 to NZ X(I)*Y(INDX(I))} .EN while CDOTCI and ZDOTCI compute .EQ w ~ : roman = ~ roman sum from {i roman = 1} to n x bar sub i y sub i roman {~=~ sum from I=1 to NZ CONJG(X(I))*Y(INDX(I))} .EN where @x bar@ is the complex conjugate of @x@. .KS .IP 2) Sparse Elementary Vector Operation .IP CALL SAXPYI (NZ, A, X,INDX, Y) .br CALL DAXPYI (NZ, A, X,INDX, Y) .br CALL CAXPYI (NZ, A, X,INDX, Y) .br CALL ZAXPYI (NZ, A, X,INDX, Y) .KE .IP _AXPYI performs the operation .EQ y ~ : roman = ~ ax ~ + ~ y .EN which is the FORTRAN 77 operation .EQ roman { Y(INDX(I)) ~=~ A*X(I) ~ + ~ Y(INDX(I)) ~~~~~~for~I ~=~ 1,~...,~NZ }. .EN The values in INDX must be distinct to allow consistent vector or parallel execution. .KS .IP 3) Apply a Sparse Plane Rotation (REAL and DOUBLE PRECISION only) .IP CALL SROTI (NZ, X,INDX, Y, C,S) .br CALL DROTI (NZ, X,INDX, Y, C,S) .KE .IP _ROTI applies a Givens rotation to a sparse vector stored in compressed form and another vector stored in full storage form. If all nonzero components of @y@ appear in INDX, then _ROTI computes .bp .EQ left [ lpile { x sub i above y sub i } right ] ~ up 30 {: roman =} ~ left [ rpile { c above -s } ~~~ rpile { s above c } ~ right ] ~ up 50 . ~ left [ lpile { x sub i above y sub i } right ] ~~~~for~i~ roman = ~1,~...,~n. .EN Whether or not all nonzero components of @y@ appear in INDX, the operation is described by the following FORTRAN: .EQ roman { left "" lpile { TEMP above X(I) above Y(INDX(I)) } ~ cpile { = above = above = } ~ rpile { X(I)~~~~~~~^ above C*TEMP above bold - S*TEMP } ~ cpile { ~ above + above + } ~ rpile { ~ above S*Y(INDX(I)) above C*Y(INDX(I)) } ~~~ right } ~~~for~I ~=~ 1,~...,~NZ. } .EN The values in INDX must be distinct to allow consistent vector or parallel execution. Since a standard for a COMPLEX analog of _ROT was not specified in [8], we provide only real plane rotations. .KS .IP 4) Gather a Vector into Compressed Form .IP CALL SGTHR (NZ, Y, X,INDX) .br CALL DGTHR (NZ, Y, X,INDX) .br CALL CGTHR (NZ, Y, X,INDX) .br CALL ZGTHR (NZ, Y, X,INDX) .KE .IP Gather (copy) the specified components of a vector stored in full storage form into compressed form. If all nonzero components of @y@ appear in INDX, then _GTHR performs .EQ x ~ : roman = ~ y .EN which is symbolized by .EQ roman { X(I) ~=~ Y(INDX(I)) ~~~~~~for~I~=~1,~...,~NZ. } .EN .IP 5) Gather a Vector into Compressed Form and Zero the Full-form Vector .IP CALL SGTHRZ (NZ, Y, X,INDX) .br CALL DGTHRZ (NZ, Y, X,INDX) .br CALL CGTHRZ (NZ, Y, X,INDX) .br CALL ZGTHRZ (NZ, Y, X,INDX) .IP Gather (copy) the specified components of a vector stored in full storage form into compressed form and zero the gathered components of the full-form vector. If all nonzero components of @y@ appear in INDX, a common occurrence in the use of a single uncompressed scratch vector, then _GTHRZ simultaneously performs .EQ left { ~ lpile {x ~ : roman = ~ y above y ~ : roman = ~ 0 . } .EN This is specified by the following: .EQ roman { left "" lpile { X(I) above Y(INDX(I)) } ~ lpile { = above = } ~ lpile { Y(INDX(I)) above 0.0 } ~~~ right } ~~~for~I~=~1,~...,~NZ. } .EN In typical use of full-form temporary vectors, the vector is initially set to zero. _SCTR is used to initialize the interesting entries for processing. At the completion of processing, _GTHRZ is used to move the interesting values back into the sparse data structures and simultaneously reset the full-form vector to its initial, zero, state for later use. .IP The values in INDX must be distinct to allow consistent vector or parallel execution since otherwise the indeterminacy in the order of accessing elements of Y and resetting them to zero makes X ambiguous. .KS .IP 6) Scatter a Sparse Vector from Compressed Form .IP CALL SSCTR (NZ, X,INDX, Y) .br CALL DSCTR (NZ, X,INDX, Y) .br CALL CSCTR (NZ, X,INDX, Y) .br CALL ZSCTR (NZ, X,INDX, Y) .KE .IP Scatter (copy) the components of a sparse vector stored in compressed form into specified components of a vector in full storage form. If @y@ is initially the zero vector, then _SCTR performs .EQ y ~ : roman = ~ x . .EN which is represented by the following FORTRAN: .EQ roman { Y(INDX(I)) ~=~ X(I) ~~~~~~for~I~=~1,~...,~NZ. } .EN The values in INDX must be distinct to allow consistent vector or parallel execution. .NH Acknowledgements .PP A draft of this proposal was discussed at the Parvec IV Workshop organized by John Rice at Purdue University on October 29-30, 1984. Subsequently, a revised draft appeared in the SIGNUM Newsletter [3], and was discussed at the SIAM Conference on Applied Linear Algebra in Raleigh, April 29-May 2, 1985. We thank the participants at the workshop and meeting for their comments, discussions, and encouragement. .SH References .IP[1] C.C. Ashcraft, R.G. Grimes, J.G. Lewis, B.W. Peyton and H.D. Simon, ``Progress in Sparse Matrix Methods for Large Linear Systems on Vector Supercomputers'', \fiInternational Journal of Supercomputer Applications 1, \fR 4 (December, 1987), pp. 10-30. .IP [2] D.S. Dodson and J.G. Lewis, ``Issues Relating to Extension of the Basic Linear Algebra Subprograms'', \fIACM SIGNUM Newsletter 20,\fR 1, (1985), pp. 19-22. .IP [3] D.S. Dodson and J.G. Lewis, ``Proposed Sparse Extensions to the Basic Linear Algebra Subprograms'', \fIACM SIGNUM Newsletter 20,\fR 1, (1985), pp. 22-25. .IP [4] D.S. Dodson, R.G. Grimes, and J.G. Lewis, ``Model Implementation and Test Package for the Sparse Basic Linear Algebra Subprograms,'' \fIthis journal.\fR .IP [5] J.J. Dongarra, J.R. Bunch, C.B. Moler, and G.W. Stewart, \fILINPACK Users' Guide,\fR SIAM Publications, Philadelphia, 1979. .IP [6] J.J. Dongarra, J. Du Croz, S. Hammarling, and R.J. Hanson, ``An Extended Set of FORTRAN Basic Linear Algebra Subprograms,'' \fIACM Transactions on Mathematical Software 14,\fR 1 (March, 1988), pp. 1-17. .IP [7] J.J. Dongarra, J. Du Croz, I.S. Duff, and S. Hammarling, ``A Proposal for a Set of Level 3 Basic Linear Algebra Subprograms,'' \fIACM SIGNUM Newsletter 22,\fR 3, (1987), pp. 2-14. .IP [8] I.S. Duff, ``MA28: A Set of FORTRAN Subroutines for Sparse Unsymmetric Linear Systems,'' AERE Report R.8730, HMSO, London, (1971). .IP [9] I.S. Duff, A.M. Erisman, and J.K. Reid, \fIDirect Methods for Sparse Matrices,\fR Oxford University Press, (1986). .IP [10] S.C. Eisenstat, M.C. Gursky, M.H. Schultz, and A.H. Sherman, ``Yale Sparse Matrix Package I. The Symmetric Codes,'' \fIInt. J. Num. Math. Eng. 18,\fR (1982), pp. 1145-1151. .IP [11] A. George and J.W. Liu, \fIComputer Solution of Large Sparse Positive Definite Systems,\fR Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1982. .IP [12] A. George and M.T. Heath, ``Solution of Sparse Linear Least Squares Problems Using Givens Rotations,'' \fILinear Algebra and Its Applications 34,\fR (1980), pp. 69-82. .IP [13] A. George, M.T. Heath, and R.J. Plemmons, ``Solution of Large-Scale Sparse Least Squares Problems Using Auxiliary Storage,'' Report ORNL/CSD-63, Oak Ridge National Laboratory, (August, 1980). .IP [14] R.G. Grimes, J.G. Lewis, and H.D. Simon, ``Experiences in Solving Large Eigenvalue Problems on the CRAY X-MP,'' \fIProceedings of the Eighteenth Semiannual CRAY Users Group Meeting,\fR Garmisch, West Germany, (October, 1986). .IP [15] R.G. Grimes, J.G. Lewis, and H.D. Simon, ``Eigenvalue Problems and Algorithms in Structural Engineering,'' in \fILarge Scale Eigenvalue Problems\fR (J. Cullum and R. Willoughby, eds.), Elsevier North-Holland, (1986), pp. 81-93. .IP [16] R.J. Hanson, F.T. Krogh, and C.L. Lawson, ``A Proposal for Standard Linear Algebra Subprograms,'' \fIACM SIGNUM Newsletter 8\fR (1973), pp. 16ff. .IP [17] C.L. Lawson, R.J. Hanson, D.R. Kincaid, and F.T. Krogh, ``Basic Linear Algebra Subprograms for FORTRAN Usage,'' \fIACM Transactions on Mathematical Software 5,\fR 3 (September, 1979), pp. 308-323. .IP [18] C.L. Lawson, R.J. Hanson, D.R. Kincaid, and F.T. Krogh, ``Algorithm 539: Basic Linear Algebra Subprograms for FORTRAN Usage,'' \fIACM Transactions on Mathematical Software 5,\fR 3 (September, 1979), pp. 324-325. .IP [19] J.G. Lewis and H.D. Simon, ``The Impact of Hardware Gather/Scatter on Sparse Gaussian Elimination,'' \fIProceedings of the 1986 International Conference on Parallel Processing,\fR (K. Hwang, S. Jacobs, and E. Swartzlander, eds.), IEEE Computer Society, Los Angeles, (1986). .IP [20] H.D. Simon, ``Incomplete LU Preconditioners for Conjugate Gradient Type Iterative Methods,'' Paper SPE 13533, \fIProceedings of the Eighth SPE Symposium on Reservoir Simulation,\fR (February, 1985). .