.ds TH "\h'-.75m'\v'-.50m'^\v'.50m' .ds TD "\h'-.75m'\v'-.45m'~\v'.45m' .B .nr BT ''-%-'' .he '''' .pl 11i .de fO 'bp .. .wh -.5i fO .LP .nr LL 6.5i .ll 6.5i .nr LT 6.5i .lt 6.5i .ta 5.0i .ft 3 .bp .R .sp 1i .ce 100 .R .sp .5i . .sp 10 ARGONNE NATIONAL LABORATORY .br 9700 South Cass Avenue .br Argonne, Illinois 60439 .sp .6i .ps 12 .ft 3 A Proposal for an Extended Set of Fortran Basic Linear Algebra Subprograms .ps 11 .sp 3 Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson .sp 3 .ps 10 .ft 1 Mathematics and Computer Science Division .sp 2 Technical Memorandum No. 41 (Revision 1) .sp .7i November, 1985 .pn 0 .he ''-%-'' .he '''' .bp . .sp 10 .B .ce 1 .ps 11 ABSTRACT .R .ps 10 .sp .5i This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extensions proposed are targeted at matrix vector operations which should provide for more efficient and portable implementations of algorithms for high performance computers. .in .bp .ft 3 .ps 11 .bp .LP .LP .EQ gsize 11 delim @@ .EN .TL .ps 12 .in 0 A Proposal for an Extended Set of Fortran\h'.35i' .sp Basic Linear Algebra Subprograms\h'.35i' .AU .ps 11 .in 0 Jack J. Dongarra\|@size -1 {"" sup \(dg}@\h'.15i' .AI .ps 10 .in 0 Mathematics and Computer Science Division\h'.20i' Argonne National Laboratory\h'.20i' Argonne, Illinois 60439\h'.20i' .AU .ps 11 .in 0 Jeremy Du Croz\h'.20i' .AI .ps 10 .in 0 Numerical Algorithms Group Ltd.\h'.20i' NAG Central Office, Mayfield House\h'.20i' 256 Banbury Road, Oxford OX2 7DE\h'.20i' .AU .ps 11 .in 0 Sven Hammarling\h'.20i' .AI .ps 10 .in 0 Numerical Algorithms Group Ltd.\h'.20i' NAG Central Office, Mayfield House\h'.20i' 256 Banbury Road, Oxford OX2 7DE\h'.20i' .AU .ps 11 .in 0 Richard J. Hanson\h'.20i' .AI .ps 10 .in 0 Applied Math 2646 Sandia National Laboratory\h'.20i' Albuquerque, New Mexico 87185\h'.20i' .FS .ps 9 .vs 11p @size -1 {"" sup \(dg}@\|Work supported in part by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U. S. Department of Energy, under Contract W-31-109-Eng-38. This report is being issued jointly as Argonne National Laboratory MCSD/TM-41 (Revision 1) and Numerical Algorithms Group Technical Report NP 1195. Typeset on \*(DY. .FE .sp 3 .QS .ps 10 .in .25i .ll -.25i .I Abstract \(em This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extensions proposed are targeted at matrix vector operations which should provide for more efficient and portable implementations of algorithms for high performance computers. .in .ll .QE .nr PS 11 .nr VS 16 .nr PD 0.5v .SH Part 1: The Proposal .SH 1. Introduction .PP In 1973 Hanson, Krogh, and Lawson wrote an article in the SIGNUM Newsletter (Vol. 8, no. 4, page 16) describing the advantages of adopting a set of basic routines for problems in linear algebra. The original basic linear algebra subprograms, now commonly referred to as the BLAS and fully described in Lawson, Hanson, Kincaid, and Krogh [9,10], have been very successful and have been used in a wide range of software including LINPACK [5] and many of the algorithms published by the ACM Transactions on Mathematical Software. In particular they are an aid to clarity, portability, modularity and maintenance of software and they have become a \f2de facto \f1standard for the elementary vector operations. An excellent discussion of the .I raison d' \*^etre .R of the BLAS is given in Dodson and Lewis [2]. .PP Special versions of the BLAS, in some cases machine code versions, have been implemented on a number of computers, thus improving the efficiency of the BLAS. However, with some of the modern machine architectures, the use of the BLAS is not the best way to improve the efficiency of higher level codes. On vector machines, for example, one needs to optimize at least at the level of matrix-vector operations in order to approach the potential efficiency of the machine (see [3 and 4]); and the use of the BLAS inhibits this optimization because they hide the matrix-vector nature of the operations from the compiler. .PP We believe that the time is right to propose the specifications of an additional set of BLAS designed for matrix-vector operations. It has been our experience that a small set of matrix-vector operations occur frequently in the implementation of many of the most common algorithms in linear algebra. We define here the basic operations for that set, together with the naming conventions and the calling sequences. Routines at this level should provide a reasonable compromise between the sometimes conflicting aims of efficiency and modularity and it is our hope that efficient implementations will become available on a wide range of computer architectures. .PP In this paper we shall refer to the existing BLAS of Lawson et al. as ``Level 1 BLAS'' or ``Existing BLAS'', and the proposed new set as ``Level 2 BLAS'' or ``Extended BLAS''. The Level 2 BLAS involve @O(mn )@ scalar operations where @m@ and @n@ are the dimensions of the matrix involved. These could be programmed by a series of calls to the Level 1 BLAS, though we do not recommend that they be implemented in that way. Hence, in a natural sense, the Level 2 BLAS are performing basic operations at one level higher than the Level 1 BLAS. .PP We will make available a complete set of Level 2 BLAS in Fortran 77 so that software developers without access to specific implementations can make use of them. We also plan to develop a test program so that implementations of the extended BLAS can be thoroughly tested before being distributed. We intend to submit the test program and the Fortran 77 version of the routines for publication as an ACM algorithm. .SH 2. Scope of the Extended BLAS .PP We propose that the following three types of basic operation be performed by the extended BLAS: a) Matrix-vector products of the form .in +.5i .EQ I y ~<-~ alpha Ax ~+~ beta y ,~~ y ~<-~ alpha A ~ sup T x ~+~ beta y ,~~ and~ ~ y ~<-~ alpha A bar ~ sup T x ~+~ beta y .EN where @ alpha @ and @ beta @ are scalars, @x@ and @y@ are vectors and @A@ is a matrix, and .EQ I x ~<-~ Tx ,~ ~ x ~<-~ T ~ sup T x,~~ and~ ~ x ~<-~ T bar ~ sup T x, .EN where @x@ is a vector and @T@ is an upper or lower triangular matrix. .in -.5i b) Rank-one and rank-two updates of the form .in +.5i .EQ I A ~<-~ alpha xy ~ sup T ~+~ A ,~~ A ~<-~ alpha x y bar ~ sup T ~+~ A ,~~ H ~<-~ alpha x x bar ~ sup T ~+~ H ,~~and ~ H ~<-~ alpha x y bar ~ sup T ~+~ alpha bar y x bar ~ sup T ~+~ H , .EN where @~H~@ is a Hermitian matrix. .in -.5i c) Solution of triangular equations of the form .in +.5i .EQ I x ~<-~ T ~ sup -1 x , ~x ~<-~ T ~ sup -T x , ~~and~ ~x ~<-~ T bar ~ sup -T x , .EN where @T@ is a non-singular upper or lower triangular matrix. .in -.5i .PP We propose that, where appropriate, the operations be applied to general, general band, Hermitian, Hermitian band, triangular, and triangular band matrices in both real and complex arithmetic, and in single and double precision. .PP See Part 2 of this report for examples to illustrate the uses and advantages of the proposed routines. An example illustrating the implementation of the routines is given in Appendix A. .PP In Appendix C we propose some additional routines in which the vectors @x@ and/or @y@ have a higher precision than the matrix, so that on machines with extended precision registers the extra internal precision is not all discarded on return from the routine. We propose these routines as an appendix solely because it is not possible, even in single precision, to specify a complete set within the confines of ANSI Fortran 77. The only case that can be realized is where the matrix is real single precision and the vector is real double precision. .SH 3. Naming Conventions .PP The proposed name of a Level 2 BLAS is in the LINPACK style and consists of five characters, the last of which may be blank. The fourth and fifth characters in the name denote the type of operation, as follows: .KS .TS center; l l. MV - Matrix-vector product R - Rank-one update R2 - Rank-two update SV - Solve a system of linear equations .TE .KE Characters two and three in the name denote the kind of matrix involved, as follows: .KS .TS center; l l. GE General matrix GB General band matrix HE Hermitian matrix SY Symmetric matrix HP Hermitian matrix stored in packed form SP Symmetric matrix stored in packed form HB Hermitian band matrix SB Symmetric band matrix TR Triangular matrix TP Triangular matrix in packed form TB Triangular band matrix .TE .KE The first character in the name denotes the Fortran data type of the matrix, as follows: .TS center; l l. S REAL D DOUBLE PRECISION C COMPLEX Z COMPLEX*16 or DOUBLE COMPLEX (if available) .TE The proposed available combinations are indicated in the table below. In the first column, under \f2complex\f1, the initial C may be replaced by Z. In the second column, under \f2real\f1, the initial S may be replaced by D. See Appendix D for the full subroutine calling sequences. .PP The proposed collection of routines can be thought of as being divided into four separate parts, \f2complex\f1, \f2real\f1, \f2double precision\f1, and \f2complex*16\f1. Each part will be accompanied by a separate testing program. The routines proposed are written in the ANSI Fortran 77 standard with the exception of the routines that use COMPLEX*16 variables. These routines are included for completeness and, because the Fortran standard does not provide for this variable type, may not be available on all machines. .KS .ce Table 3.1 .TS center; c c c c c c c. \f2complex real\f1 MV R R2 SV CGE SGE * * @nothing sup \(dg @ CGB SGB * CHE SSY * * * CHP SSP * * * CHB SSB * CTR STR * * CTP STP * * CTB STB * * .TE .KE .EQ gsize 8 .EN .FS @size -1 {"" sup \(dg}@\| For the general rank-1 update (GER) we propose two complex routines: CGERC for @ A ~<-~ alpha x y bar ~ sup T ~+~ A @ and CGERU for @ A ~<-~ alpha xy ~ sup T ~+~ A @. This is the only exception to the one to one correspondence between real and complex routines. See section 7 for further discussion. .FE .EQ gsize 11 .EN .PP We do not propose routines for rank-one and rank-two updates applied to band matrices because these can be obtained by calls to the rank-one and rank-two full matrix routines. This is illustrated in Appendix B. .SH 4. Parameter Conventions .PP We propose a similar convention for the parameter lists to that for the existing BLAS, but with extensions where comparable parameters are not present in the existing BLAS. The proposed order of parameters is as follows: a) Parameters specifying options. b) Parameters defining the size of the matrix. c) Input scalar. d) Description of the input matrix. e) Description of input vector(s). f) Input scalar (associated with input-output vector). g) Description of the input-output vector. h) Description of the input-output matrix. Note that not each category is present in each of the routines. .PP The parameters that specify options are character parameters with the names TRANS, UPLO, and DIAG. TRANS is used by the matrix vector product routines as follows: .KS .TS center; l l. Value Meaning _ _ `N' Operate with the matrix. `T' Operate with the transpose of the matrix. `C' Operate with the conjugate transpose of the matrix. .TE .KE In the real case the values `T' and `C' have the same meaning. .PP UPLO is used by the Hermitian, symmetric, and triangular matrix routines to specify whether the upper or lower triangle is being referenced as follows: .KS .TS center; l l. Value Meaning _ _ `U' Upper triangle `L' Lower triangle .TE .KE .PP DIAG is used by the triangular matrix routines to specify whether or not the matrix is unit triangular, as follows: .KS .TS center; l l. Value Meaning _ _ `U' Unit triangular `N' Non-unit triangular .TE .KE When DIAG is supplied as `U' the diagonal elements are not referenced. .PP It is worth noting that actual character arguments in Fortran may be longer than the corresponding dummy arguments. So that, for example, the value `T' for TRANS may be passed as `TRANSPOSE'. .PP The size of the matrix is determined by the two parameters M and N for an @m@ by @n@ rectangular matrix; by the parameters M, N, KL, and KU for an @m@ by @n@ band matrix with @kl@ sub-diagonals and \f2ku\f1 super-diagonals; and by the parameters N and K for an @n@ by @n@ symmetric or Hermitian or triangular band matrix with \f2k\f1 super- and/or sub-diagonals. .PP The description of the matrix consists either of the array name (A) followed by the leading dimension of the array as declared in the calling (sub) program (LDA), when the matrix is being stored in a two- dimensional array; or the Fortran array name (AP) alone when the matrix is being stored as a (packed) vector. When A is not band, then in the former case the actual array must contain at least @(m ~+~ d(n-1))@ elements, where @d@ is the leading dimension of the array and @m~=~n@ for a square matrix; and in the latter case the actual array must contain at least @n(n+1)/2@ elements. .PP The scalars always have the dummy argument names ALPHA and BETA. As with the existing BLAS the description of a vector consists of the name of the array (X or Y) followed by the storage spacing (increment) in the array of the vector elements (INCX or INCY). The increment is allowed to be negative, zero, or positive. When the vector @x@ consists of @k@ elements, then the corresponding actual array argument X must be of length at least (1 + (@k@-1)|INCX|). .sp The following values of parameters are invalid: .nf .sp Any value of the character parameters DIAG, TRANS, or UPLO whose meaning is not specified. M < 0 for GE and GB routines. N < 0 for all routines. KL < 0 or KL < M - N for the GB routines KU < 0 or KU < N - M for the GB routines K < 0 or for the HB, SB, and TB routines. LDA < M for the GE routines. LDA < KL + KU + 1 for the GB routines. LDA < N for the HE, SY, and TR routines. LDA < K + 1 for the HB, SB, and TB routines. .fi Note that it is permissible to call the routines with M or N = 0, in which case the routines exit immediately without referencing their vector or matrix arguments. For those routines that include the parameter BETA, when BETA is supplied as zero then the array Y need not be set on input, so that operations such as @y ~ <- ~ alpha A x @ may be performed without initially setting @y@ to zero. .SH 5. Storage Conventions .PP Unless otherwise stated it is assumed that matrices are stored conventionally in a 2-dimensional array with matrix-element @a sub ij @ stored in array-element A(I,J). .PP The routines for real symmetric and complex Hermitian matrices allow for the matrix to be stored in either the upper (UPLO = `U') or lower triangle (UPLO = `L') of a two dimensional array, or to be packed in a one dimensional array. In the latter case the upper triangle may be packed sequentially column by column (UPLO = `U'), or the lower triangle may be packed sequentially column by column (UPLO = `L'). Note that for real symmetric matrices packing the upper triangle by column is equivalent to packing the lower triangle by rows, and packing the lower triangle by columns is equivalent to packing the upper triangle by rows. (For complex Hermitian matrices the only difference is that the off-diagonal elements are conjugated.) .PP For triangular matrices the parameter UPLO serves to define whether the matrix is upper .br (UPLO = `U') or lower (UPLO = `L') triangular. In the packed storage the triangle has to be packed by column. .PP The band matrix routines allow storage in the same style as with LINPACK, so that the @j sup th @ column of the matrix is stored in the @j sup th @ column of the Fortran array. For a general band matrix the diagonal of the matrix is stored in the @ku + 1 sup st @ row of the array. For a Hermitian or symmetric matrix either the upper triangle (UPLO = `U') may be stored in which case the leading diagonal is in the @k+1 sup st @ row of the array, or the lower triangle (UPLO = `L') may be stored in which case the leading diagonal is in the first row of the array. For an upper triangular band matrix (UPLO = `U') the leading diagonal is in the @k+1 sup st @ row of the array and for a lower triangular band matrix (UPLO = `L') the leading diagonal is in the first row. .PP For a Hermitian matrix the imaginary parts of the diagonal elements are of course zero and thus the imaginary parts of the corresponding Fortran array elements need not be set, but are assumed to be zero. In the R and R2 routines these imaginary parts will be set to zero on return. .PP For packed triangular matrices the same storage layout is used whether or not DIAG = `U' (diagonal elements are assumed to have the value 1), i.e. space is left for the diagonal elements even if those array elements are not referenced. .SH 6. Specification of the Extended BLAS .PP Type and dimension for variables occurring in the subroutine specifications are as follows: .KS INTEGER INCX, INCY, K, KL, KU, LDA, M, N CHARACTER*1 DIAG, TRANS, UPLO .KE For routines whose first letter is an S: REAL ALPHA, BETA REAL X(*), Y(*) REAL A(LDA,*) REAL AP(*) For routines whose first letter is a D DOUBLE PRECISION ALPHA, BETA DOUBLE PRECISION X(*), Y(*) DOUBLE PRECISION A(LDA,*) DOUBLE PRECISION AP(*) For routines whose first letter is a C: COMPLEX ALPHA COMPLEX BETA COMPLEX X(*), Y(*) COMPLEX A(LDA,*) COMPLEX AP(*) except that for CHER and CHPR the scalar @alpha@ is real so that the first declaration above is replaced by: REAL ALPHA For routines whose first letter is Z: COMPLEX*16 ALPHA DOUBLE COMPLEX ALPHA COMPLEX*16 BETA DOUBLE COMPLEX BETA COMPLEX*16 X(*), Y(*) or DOUBLE COMPLEX X(*), Y(*) COMPLEX*16 A(LDA,*) DOUBLE COMPLEX A(LDA,*) COMPLEX*16 AP(*) DOUBLE COMPLEX AP(*) except that for ZHER and ZHPR the first declaration above is replaced by: DOUBLE PRECISION ALPHA The generic names, parameter lists and specifications for the extended BLAS now follow. Refer to table 3.1 for the specific subroutine names. a) General Matrix Vector Products for a general matrix: GEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) for a general band matrix: GBMV ( TRANS, M, N, KL, KU, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) Operation: .sp if @TRANS~=~@ `N',@~~~~~y ~ <- ~ alpha Ax ~+~ beta y @ .sp if @TRANS~=~@ `T',@ ~~~~y ~ <- ~ alpha A ~ sup T x ~+~ beta y@ .sp if @~TRANS~=~@ `C',@ ~~~~y ~ <- ~ alpha A bar ~ sup T x ~+~ beta y @. .sp b) Symmetric or Hermitian Matrix Vector Products for a symmetric or Hermitian matrix: SYMV ( UPLO, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) HEMV ( UPLO, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) for a symmetric or Hermitian matrix in packed storage: SPMV ( UPLO, N, ALPHA, AP, X, INCX, BETA, Y, INCY ) HPMV ( UPLO, N, ALPHA, AP, X, INCX, BETA, Y, INCY ) for a symmetric or Hermitian band matrix: SBMV ( UPLO, N, K, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) HBMV ( UPLO, N, K, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) Operation: .EQ I y ~ <- ~ alpha Ax ~+~ beta y ~. .EN c) Triangular Matrix Vector Products for a triangular matrix: TRMV ( UPLO, TRANS, DIAG, N, A, LDA, X, INCX ) for a triangular matrix in packed storage: TPMV ( UPLO, TRANS, DIAG, N, AP, X, INCX ) for a triangular band matrix: TBMV ( UPLO, TRANS, DIAG, N, K, A, LDA, X, INCX ) Operation: .sp if @TRANS~=~@ `N', @~~~~x ~ <- ~ Tx @ .sp if @TRANS~=~@ `T', @~~~~x ~ <- ~ T ~ sup T x@ .sp if @TRANS~=~@ `C', @~~~~x ~ <- ~ T bar ~ sup T x@ .sp d) Triangular equation solvers for a triangular matrix: TRSV ( UPLO, TRANS, DIAG, N, A, LDA, X, INCX ) for a triangular matrix in packed storage: TPSV ( UPLO, TRANS, DIAG, N, AP, X, INCX ) for a triangular band matrix: TBSV ( UPLO, TRANS, DIAG, N, K, A, LDA, X, INCX ) Operation: .sp if @TRANS~=~ @ `N', @~~~~x ~ <- ~ T ~ sup -1 x@ .sp if @TRANS~=~@ `T', @~~~~x ~ <- ~ T ~ sup -T x @ .sp if @TRANS~=~@ `C', @ ~~~~x ~ <- ~ T bar ~ sup -T x ~ @. .sp e) General Rank-1 Updates: for a general matrix: GER_ ( M, N, ALPHA, X, INCX, Y, INCY, A, LDA ) for real matrices: SGER performs the operation @ A ~<-~ alpha x y ~ sup T ~+~ A @. for complex matrices: CGERC performs the operation @ A ~<-~ alpha x y bar ~ sup T ~+~ A @ and CGERU performs the operation @ A ~<-~ alpha x y ~ sup T ~+~ A @. f) Symmetric or Hermitian Rank-1 Updates: for a symmetric or Hermitian matrix: SYR ( UPLO, N, ALPHA, X, INCX, A, LDA ) HER ( UPLO, N, ALPHA, X, INCX, A, LDA ) for symmetric or Hermitian matrix in packed storage: SPR ( UPLO, N, ALPHA, X, INCX, AP ) HPR ( UPLO, N, ALPHA, X, INCX, AP ) Operation: .EQ I A ~<- ~ alpha x x bar ~ sup T ~+~ A .EN for real symmetric matrices this is simply .EQ I A ~<- ~ alpha x x ~ sup T ~+~ A ~. .EN g) Symmetric or Hermitian Rank-2 Updates: for a symmetric or Hermitian matrix: SYR2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, A, LDA ) HER2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, A, LDA ) for symmetric or Hermitian matrix in packed storage: SPR2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, AP ) HPR2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, AP ) Operation: .EQ I A ~<- ~ alpha x y bar ~ sup T ~+~ alpha bar y x bar ~ sup T ~+~ A .EN for real symmetric matrices this is simply .EQ I A ~<- ~ alpha x y ~ sup T ~+~ alpha y x ~ sup T ~+~ A ~. .EN .SH 7. Rationale .PP The three basic matrix-vector operations chosen (Section 2) were obvious candidates because they occur in a wide range of linear algebra applications, and they occur at the innermost level of many algorithms. The hard decision was to restrict the scope only to these operations, since there are many other potential candidates, such as matrix scaling and sequences of plane rotations. Similarly, we could have extended the scope by applying the operations to other types of matrices such as complex symmetric or augmented band matrices. We have aimed at a reasonable compromise between a much larger number of routines each performing one type of operation (e.g. @x ~<-~ L ~ sup -T x@), and a smaller number of routines with a more complicated set of options. There are in fact, in each precision, 16 real routines performing altogether 43 different operations, and 17 complex routines performing 58 different operations. .PP We feel that to extend the scope further would significantly reduce the chances of having the routines implemented efficiently over a wide range of machines, because it would place too heavy a burden on implementors. On the other hand, to restrict the scope further would place too narrow a limit on the potential applications of the level 2 BLAS. .PP The parameters @ alpha @ and @ beta@ are included in the non-triangular routines to give extra flexibility, but we recommend that implementors consider special code for the cases where @ alpha ~or ~beta~=~ 1.0 @ and @ alpha ~or~ beta ~=~ -1.0 @. Similarly, as with the level 1 BLAS, we have included an increment parameter with the vectors so that a vector could, for example, be a row of a matrix. But again we recommend that implementors consider special code for the case where the increments are unity. Note that the implementation must be such that when @ beta ~ = ~ 0.0 @ then @y@ need not be defined on input. .PP As noted earlier, corresponding to the real routine SGER we propose two complex routines CGERC (for @A ~<-~ alpha x y bar ~ sup T ~+~ A@) and CGERU (for @A ~<-~ alpha x y ~ sup T ~+~ A@). Both are frequently required. An alternative would be to provide a single complex routine CGER with an option parameter; however this parameter would have become redundant in the real routine SGER. Rather than have redundant parameters, or different parameter lists for the real and complex routines, we have chosen two distinct complex routines; they are analogous to the level 1 BLAS CDOTC (@c ~<- ~ c ~+~ x bar ~ sup T y@) and CDOTU (@c ~<- ~ c ~+~ x ~ sup T y@). .PP Note that no check has been included for singularity, and near singularity, in the triangular equation solving routines. The requirements for such a test depend upon the application and so we felt that this should not be included, but should instead be performed before calling the triangular solver. .PP On certain machines, which do not use the ASCII sequence on all of their Fortran systems, lower case characters may not exist, so that the innocent looking argument \f2`t'\f1, passed through the parameter TRANS for designating a transposed matrix, is not in the Fortran character set. Some UNIVAC systems do not have a lower case representation using the `field data' character set. On the CDC NOS-2 system, a mechanism is provided for a full 128 ASCII character set by using pairs of 6-bit host characters for certain 7-bit ASCII characters. This means that there is a `2 for 1' physical extension of the logical records that contain lower case letters. This fact can hamper portability of codes written on ASCII machines that are later moved to CDC systems. The only safe way to proceed is to convert the transported text entirely into the Fortran character set. On the other hand we believe that users on ASCII character set systems may wish to treat upper and lower case letters as equivalent in meaning. If this is done, it means that text that will be transported to machines of unknown types must have the ASCII set mapped into the Fortran character set before the text is moved. .PP The band storage scheme used by the GB, HB, SB, and TB routines has columns of the matrix stored in columns of the array, and diagonals of the matrix stored in rows of the array. This is the storage scheme used by LINPACK. An alternative scheme (used in some EISPACK [8,11] routines) has rows of the matrix stored in rows of the array, and diagonals of the matrix stored in columns of the array. The latter scheme has the advantage that a band matrix-vector product of the form @ y ~<- ~ alpha Ax ~+~ beta y @ can be computed using long vectors (the diagonals of the matrix) stored in contiguous elements, and hence is much more efficient on some machines (e.g. CDC Cyber 205) than the first scheme. However other computations involving band matrices, such as @x ~<-~ Tx@, @x ~<-~ T ~ sup -1 x@ and @LU@ and @U ~ sup T U @ factorization, cannot be organized `by diagonals'; instead the computation sweeps along the band, and the LINPACK storage scheme has the advantage of reducing the number of page swaps and allowing contiguous vectors (the columns of the matrix) to be used. .PP Although not discussed here, we plan to provide a portable testing package for each of the four parts of this extended BLAS package. The test package will be self-contained; generating test data and checking for conformity with this proposal in a portable fashion. .PP We considered the possibility of generalizing the rank-1 and rank-2 updates to rank-k updates. Rank-k updates with @k ~>~ 1@ (but @k ~<<~n@) can achieve significantly better performance on some machines than rank-1 [7]. But to take advantage of this usually requires complicating the calling algorithm; and moreover rank-k updates with @k ~ approx ~n@ would allow an even higher level operation such as matrix multiplication `in by the back door'. We prefer to keep to a clean concept of genuine matrix-vector operations. .bp .B Part 2: .SH 8. Applicability of a Set of Level 2 BLAS .PP The purpose of Part 2 is to demonstrate the wide applicability of the set of Level 2 BLAS proposed in Part 1. LINPACK and EISPACK are taken as well-known examples of heavily used software which could, with hindsight, have made extensive use of such a set of BLAS; future versions may do just that. The most straightforward way to use the Level 2 BLAS is as components in the development of new software; working out how to fit them into existing software can be more complicated, but may still be very worthwhile if it leads to substantial improvements in performance; a similar exercise has already been undertaken on selected NAG Library routines to improve their performance on vector processors [1]. .PP Sections 10 and 11 demonstrate in some detail how calls to individual Level 2 BLAS can be used to perform the bulk of the computation in the CPO-, CPP- and CPB- sets of LINPACK routines. Section 12 surveys the broad applicability of the entire set of Level 2 BLAS in LINPACK and EISPACK. .SH 9. Example: the CPO- Routines in LINPACK .PP The routine CPOFA performs a @ U bar ~ sup T U @ factorization of a Hermitian positive- definite matrix. The executable code for this routine is: .ps 10 .sp 2 .cs 1 24 .nf DO 30 J = 1, N INFO = J S = 0.0E0 JM1 = J - 1 IF (JM1 .LT. 1) GO TO 20 DO 10 K = 1, JM1 T = A(K,J) - CDOTC(K-1,A(1,K),1,A(1,J),1) T = T/A(K,K) A(K,J) = T S = S + REAL(T*CONJG(T)) 10 CONTINUE 20 CONTINUE S = REAL(A(J,J)) - S C ......EXIT IF (S .LE. 0.0E0 .OR. AIMAG(A(J,J)) .NE. 0.0E0) GO TO 40 A(J,J) = CMPLX(SQRT(S),0.0E0) 30 CONTINUE .fi .ps 12 .sp 2 .cs 1 .PP It is not obvious from inspection of the code what type of matrix-vector operation, if any, is being performed here. It helps to analyze the algorithm with the use of partitioned matrices: .EQ I left ( matrix { ccol { A sub 11 above * above * } ccol { a sub 12 above a sub 22 above * } ccol { A sub 13 above a sub 23 ~ sup T above A sub 33 } } right ) ~=~ left ( matrix { ccol { U bar sub 11 ~ sup T above u bar sub 12 ~ sup T above U bar sub 12 ~ sup T } ccol { nothing above u sub 22 above u bar sub 23 } ccol { nothing above nothing above U bar sub 33 ~ sup T } } right ) left ( matrix { ccol { U sub 11 above nothing above nothing } ccol { u sub 12 above u sub 22 above nothing } ccol { u sub 13 above u sub 23 ~ sup T above U sub 33 } } right ) .EN Equating the middle column, we obtain: .EQ I a sub 12 ~=~ U bar sub 11 ^ sup T u sub 12 .EN .EQ I a sub 22 ~=~ u bar sub 12 ^ sup T u sub 12 ~+~ u sub 22 sup 2 .EN hence: .EQ I u sub 12 ~=~ U bar sub 11 ^ sup -T a sub 12 .EN .EQ I u sub 22 ~=~ sqrt { a sub 22 ~-~ u bar sub 12 ^ sup T u sub 12 } .EN Thus it is possible to compute @U@ column-by-column, with the elements of @U@ overwriting the upper triangle of A, using the matrix-vector operation .EQ I x ~<-~ U bar ~ sup -T x. .EN CPOFA does just this, the outermost loop index J indicating the column isolated in the partitioned matrices above. The only complication is that CPOFA combines the computation of @u sub 12 @ with the computation of @u bar sub 12 ~ sup T u sub 12 @, but this can easily be separated into a separate call of CDOTC, and the code can be rewritten to use the Level 2 BLAS CTRSV: .ps 10 .sp .cs 1 24 .nf DO 30 J = 1, N INFO = J C CALL CTRSV ('UPPER', 'CONJUGATE-TRANSPOSE', 'NON-UNIT', * J-1, A, LDA, A(1,J), 1) C S = REAL(A(J,J) - CDOTC(J-1,A(1,J),1,A(1,J),1)) C ......EXIT IF (S .LE. 0.0E0 .OR. AIMAG(A(J,J)) .NE. 0.0E0) GO TO 40 A(J,J) = CMPLX(SQRT(S),0.0E0) 30 CONTINUE .fi .ps 12 .sp .cs 1 .PP Note that this code takes advantage of the convention that CTRSV performs no computation when called with its @4 sup th @ parameter equal to zero. .PP The routine CPOSL which solves @Ax ~=~ b@ after @A@ has been factorized by CPOFA, does nothing more than compute .EQ I y ~<-~ U bar ~ sup -T b .EN .EQ I x ~<-~ U ~ sup -1 y .EN and it is straightforward to recode it to call CTRSV. .PP The computation of @A ~ sup -1 @ in CPODI proceeds in two stages. In the first, @U@ is inverted in place: at the @k sup th @ stage, the algorithm performs a rank-1 update of the rectangular submatrix in A(1:K-1, K+1:N), which can be implemented by a call to CGERU. (Incidentally page 3.11 of the LINPACK Users' Guide describes a different organization of this computation than that which is actually performed by the code.) Finally @A ~ sup -1 @ is computed by forming the product @U ~ sup -1 U bar ~ sup -T @: the analysis on pages 3.11 and 3.12 of the LINPACK Users' Guide (using partitioned matrices) shows that the computation involves a rank-1 update of the Hermitian leading submatrix of @A@, i.e. a call to CHER. .PP The revised form of the relevant sections of code is: .ps 10 .sp 2 .cs 1 24 .nf DO 100 K = 1, N A(K,K) = (1.0E0,0.0E0)/A(K,K) T = -A(K,K) CALL CSCAL(K-1,T,A(1,K),1) C CALL CGERU (K-1, N-K, (+1.0,0.0), A(1,K), 1, * A(K,MIN(K+1,N)), LDA, A(1,MIN(K+1,N)), LDA ) C 100 CONTINUE C DO 130 J = 1, N C CALL CHER ('UPPER', J-1, +1.0, A(1,J), 1, A, LDA) C T = CONJG(A(J,J)) CALL CSCAL(J,T,A(1,J),1) 130 CONTINUE .fi .ps 12 .sp 2 .cs 1 .PP The remaining routine in the CPO- set is the condition estimator CPOCO. Although mathematically the algorithm involves the operations @x ~<-~ U ~ sup -1 x @ and @x ~<-~ U bar ~ sup -T x@, calls to CTRSV cannot simply be substituted for sections of the existing code because of the careful scaling that is done during the solution of these triangular systems. A deeper analysis is required to determine whether the scaling could be separated from the solution. .SH 10. Extensions: the CPP- and CPB- Routines in LINPACK .PP The CPP- set of LINPACK routines differs from the CPO- set only in the way that the matrices are stored: upper triangular matrices are packed by columns into a one-dimensional array. In this storage scheme, the leading upper triangular submatrices are stored in leading sub-sections of the array, but no other useful submatrices are held in convenient subsections of the array that could be passed as parameters to Level 2 BLAS. In CPOFA, and in the formation of @U ~ sup -1 U bar ~ sup -T @ in CPODI, it is the leading upper triangle that is passed as a parameter to CTRSV and CHER (in the example in Section 10). Hence by direct analogy, calls to CTPSV and CHPR can be inserted into CPPFA and CPPDI. However, in the computation of @U ~ sup -1 @, the rectangular submatrix passed to CGERU by CPODI is not mapped onto a rectangular subsection of the packed array, so CGERU cannot be used in this part of CPPDI. Fortunately an alternative way of organizing the computation - that which is actually described in the LINPACK Users' Guide! - involves the operation @x ~<-~ Ux@, and so the overall result can be reproduced by using a call to CTPMV. .PP The routine CPBFA, which is the analogue of CPOFA for banded matrices, differs from CPOFA both in the way the elements are stored and in the fact that computations corresponding to elements outside the band are omitted. The computation is illustrated in the diagram below in which the @6 sup th @ column is partitioned off as in Section 10: .KS .ps 10 .sp 2 .cs 1 24 .nf 11 12 13 14 0 0 0 * 22 23 24 25 0 0 * * 33 34 35 36 0 * * * 44 45 46 47 0 * * * 55 56 57 0 0 * * * 66 67 0 0 0 * * * 77 .fi .ps 12 .KE .sp 2 .cs 1 .PP Instead of using the whole of the leading 5-by-5 upper triangle in the operation @x ~<-~ U x@ to compute the 6th column of the result, we can take advantage of the leading zeros in column 6 so that only the upper triangular submatrix in rows and columns 3 to 5 is required. This is mapped onto the banded array storage thus: .ps 10 .sp 2 .KS .cs 1 24 .nf 14 25 36 47 13 24 35 46 57 12 23 34 45 56 67 11 22 33 44 55 66 77 .fi .ps 12 .KE .sp .cs 1 .PP It can be passed as a parameter to the Level 2 BLAS CTRSV by using the device described in Appendix B: the parameter LDA passed to CTRSV must be one less than the leading dimension declared in the calling program (with care taken to avoid array bound errors). .PP Of course the routines CPPSL and CPBSL can be implemented via straightforward calls to CTPSV and CTBSV. .SH 11. Scope of Application in LINPACK and EISPACK .PP Table 1 shows, without any claim to completeness, where calls to the proposed set of Level 2 BLAS might be incorporated into LINPACK or EISPACK routines (only the real single precision set have been considered, and EISPACK routines which operate on complex matrices have been omitted). Many of the calls to SGEMV and SGER arise in the application of a Householder transformation to a matrix. The computation .EQ I X ~<-~ ( I ~-~ alpha u u ~ sup T )X ~=~ X ~-~ alpha u( X ~ sup T u ) ~ sup T .EN can be performed by the following operations using the Level 2 BLAS in conjunction with a work vector @w@: .EQ I w ~<-~ X ~ sup T u ~~~~~~~~~~~~~~~~~~~~~~ (SGEMV) .EN .EQ I X ~<-~ - alpha u w ~ sup T ~+~ X ~~~~~~~~~~~~~~~ (SGER) .EN Some latitude has been allowed in drawing up Table 1, in that a suitable work vector may not be available in the existing code. Similarly, to make use of STRSV in SGESL would require the interchanges to be handled differently by the SGE- set of LINPACK routines. .PP Furthermore, many of the algorithms listed in Table 1 can be organized in different ways, as was pointed out by Dongarra, Gustavson and Karp [5] for matrix multiplication and LU-factorization. Following their convention, in which the basic operation in the innermost loop is something like @a sub ij ~<-~ a sub ij ~ +- ~ a sub ik a sub kj @: - with @i@ as the outermost loop index, the result is computed row by row; - with @j@ as the outermost loop index, the result is computed column by column; - with @k@ as the outermost loop index, the computation proceeds by updating a large submatrix and (in most algorithms) reducing the dimension of the problem by one at each stage. .sp .KS .ce Table 1 .TS center; l l l|l l l. BLAS LINPACK EISPACK BLAS LINPACK EISPACK _ _ _ _ _ _ SGEMV SGEDI ORTHES SSPMV - TRED3 SQRDC ORTBAK SSVDC ORTRAN SSPR SPPDI - ELMHES SSPFA QZHES TRED2 SSPR2 - TRED3 TRBAK1 TRBAK3 STRMV - ELMBAK REDUC STPMV SPPDI - SGER SPODI ORTHES SGEFA ORTBAK STRSV SGESL ELMHES SGEDI ORTRAN SPOFA REDUC SGBFA QZHES SPBFA REBAK SQRDC TRED2 SPOSL SSVDC TRBAK1 TRBAK3 STPSV SPPSL - SPPFA SSYMV - TRED1 TRED2 STBSV SGBSL - SPBSL SSYR SCHDC - SPODI SSIFA SSYR2 - TRED1 TRED2 .TE .KE .ps 12 .sp 2 .cs 1 .PP In Fortran code, in order to avoid paging problems, the @j@-form is usually preferred to the @i@-form, but the @i@-form would be preferred if matrices were stored by rows. The @i@-form and @j@-form can often permit GAXPY operations [5] in their innermost loops, so are likely to be favorable on a CRAY-1, whereas the @k@-form (which can work either by row or column) would be preferable on machines which allowed the individual AXPY operations of a rank-1 update to be performed in parallel. Each of the three methods of organizing an algorithm can involve calls to different Level 2 BLAS. For example, for LU-factorization, - the @i@-form requires STRSV (@x ~<-~ U ~ sup -T x@) and SGEMV (@y ~<-~ alpha A ~ sup T x ~+~ beta y@) - the @j@-form requires STRSV (@x ~<-~ L ~ sup -1 x@) and SGEMV (@y ~<-~ alpha Ax ~+~ beta y @) - the @k@-form requires SGER (@ A ~<-~ alpha xy ~ sup T ~+~ A@) .PP In all the algorithms studied which involve symmetric matrices, it has proved possible to find a form which permits the use of packed storage in conjunction with the Level 2 BLAS. Consider as an example the computation of @L ~ sup -1 A L ~ sup -T @ for symmetric @A@ using packed storage (i.e. a "packed" version of the EISPACK routine REDUC). If the lower triangles of @A@ and @L@ are packed by columns, the computation can be implemented by calls to SSPR2 (@A ~<-~ alpha xy ~ sup T ~+~ alpha yx ~ sup T ~+~ A@) and STPSV (@x ~<-~ L ~ sup -1 x@); if the lower triangles are packed by rows, this is equivalent to computing @U ~ sup -T A U ~ sup -1 @ with the upper triangles of @A@ and @U@ packed by columns, and can be implemented by calls to STPSV (@x ~<-~ U ~ sup -1 x@) and SSPMV (@y ~<-~ alpha Ax + y@ for symmetric A). .PP A further degree of variation is introduced if for any reason we wish to perform the basic triangular factorizations 'backwards' (i.e. as @UL@, @L ~ sup T L@ or @U U ~ sup T @ ). .PP Allowing for all these possibilities within a small set of fundamental algorithms, yields applications for each of the proposed Level 2 BLAS, except the banded matrix vector products. We consider it important that the set of extended BLAS be sufficiently wide to permit this degree of variation and not to constrain the freedom of software developers. .PP These remarks have concentrated on algorithms for dense matrices with a simple structure. It is hoped that the Level 2 BLAS may also prove useful when dealing with matrices with more complicated structures, if this can be done by splitting them into dense sub-matrices. .SH 12. Acknowledgements .PP An earlier draft of this proposal was discussed at the Parvec IV Workshop organized by John Rice at Purdue University on October 29-30, 1984 and at the SIAM Conference on Applied Linear Algebra in Raleigh, April 29-May 2, 1985. We wish to thank all the participants at the workshop and meeting for their comments, discussions, and encouragement, as well as the many people who have sent us comments separately. .bp .SH 13. References .sp .IP [1] C. Daly and J.J. Du Croz. "Performance of a Subroutine Library on Vector- Processing Machines". .I Computer Physics Communications, vol 37, (1985), 181-186. .sp .IP [2] D.S. Dodson and J.G. Lewis, "Issues relating to extension of the Basic Linear Algebra Subprograms", ACM SIGNUM Newsletter, vol 20, no 1, (1985), 2-18. .sp .IP [3] .R J.J. Dongarra and S.C. Eisenstat, ``Squeezing the Most out of an Algorithm in CRAY Fortran,'' .I ACM Transactions on Mathematical Software, .R Vol. 10, No. 3, (1984), 221-230. .sp .IP [4] J.J. Dongarra, .I Increasing the Performance of Mathematical Software through High-Level Modularity. .R Proceedings of the Sixth International Symposium on Computing Methods in Engineering and Applied Sciences. (Versailles, France). North-Holland (1984), pp 239-248. .sp .IP [5] J.J. Dongarra, J.R. Bunch, C.B. Moler, and G.W. Stewart, .I LINPACK Users' Guide, .R SIAM Publications, Philadelphia, 1979. .sp .IP [6] J.J. Dongarra, F.G. Gustavson, and A. Karp. "Implementing Linear Algebra Algorithms for Dense Matrices on a Vector Pipeline Machine". .I SIAM Review, .R 26, 91-112 (1984). .R .sp .IP [7] .R J.J. Dongarra, L. Kaufman, and S. Hammarling .I Squeezing the Most out of Eigenvalue Solvers on High-Performance Computers, .R Argonne National Laboratory Report ANL MCS-TM 46, (January 1985), to appear in Linear Algebra and Its Applications. .sp .IP [8] B.S. Garbow, J.M. Boyle, J.J. Dongarra, C.B. Moler, .I Matrix Eigensystem Routines - EISPACK Guide Extension, .R Lecture Notes in Computer Science, Vol. 51, Springer-Verlag, Berlin, 1977. .sp .IP [9] C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, ``Basic Linear Algebra Subprograms for Fortran Usage,'' .I ACM Transactions on Mathematical Software .R \fB5\fP (1979), 308-323. .sp .IP [10] C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, ``Algorithm 539: Basic Linear Algebra Subprograms for Fortran Usage,'' .I ACM Transactions on Mathematical Software .R \fB5\fP (1979), 324-325. .sp .IP [11] B.T. Smith, J.M. Boyle, J.J. Dongarra, B.S. Garbow, Y. Ikebe, V.C. Klema, and C.B. Moler, .I Matrix Eigensystem Routines - EISPACK Guide, .R Lecture Notes in Computer Science, Vol. 6, 2nd edition, Springer-Verlag, Berlin, 1976. .bp .SH Appendix A .PP This appendix gives an example to illustrate three possible implementations, within Fortran, of the proposed extended BLAS. The example considers 3 versions of the routine SGEMV; the first a straightforward version that is likely to be suitable for many conventional serial paged or non-paged machines; the second a tuned version, with unrolled loops, suitable for vector machines such as the CRAY 1; the third a version with accumulated inner products aimed at short word length machines. .ps 8 .sp 2 .cs 1 24 .nf \f2Version 1.\f1 .vs 11p SUBROUTINE SGEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX, $ BETA, Y, INCY ) CHARACTER*1 TRANS INTEGER M, N, LDA, INCX, INCY REAL ALPHA, A( LDA, * ), X( * ), BETA, Y( * ) * * Purpose * ======= * * SGEMV performs one of the matrix-vector operations * * y := alpha*A*x + beta*y, or y := alpha*A'*x + beta*y, * * where alpha and beta are scalars, x and y are vectors and A is an * m by n matrix. * * Parameters * ========== * * TRANS - CHARACTER*1. * On entry, TRANS specifies the operation to be performed as * follows: * * TRANS = 'N' or 'n' y := alpha*A*x + beta*y. * * TRANS = 'T' or 't' or y := alpha*A'*x + beta*y. * 'C' or 'c' or * * Unchanged on exit. * * M - INTEGER. * On entry, M specifies the number of rows of the matrix A. * M must be at least zero. * Unchanged on exit. * * N - INTEGER. * On entry, N specifies the number of columns of the matrix A. * N must be at least zero. * Unchanged on exit. * * ALPHA - REAL . * On entry, ALPHA specifies the scalar alpha. * Unchanged on exit. * * A - REAL array of DIMENSION ( LDA, n ). * Before entry, the leading m by n part of the array A must * contain the matrix of coefficients. * Unchanged on exit. * * LDA - INTEGER. * On entry, LDA specifies the leading dimension of A as * declared in the calling (sub) program. LDA must be at least * m. * Unchanged on exit. * * X - REAL array of DIMENSION at least * ( 1 + ( n - 1 )*abs( INCX ) ) when TRANS = 'N' or 'n' * and at least * ( 1 + ( m - 1 )*abs( INCX ) ) otherwise. * Before entry, the incremented array X must contain the * vector x. * Unchanged on exit. * * INCX - INTEGER. * On entry, INCX specifies the increment for the elements of * X. * Unchanged on exit. * * BETA - REAL . * On entry, BETA specifies the scalar beta. When BETA is * supplied as zero then Y need not be set on input. * Unchanged on exit. * * Y - REAL array of DIMENSION at least * ( 1 + ( m - 1 )*abs( INCY ) ) when TRANS = 'N' or 'n' * and at least * ( 1 + ( n - 1 )*abs( INCY ) ) otherwise. * Before entry with BETA non-zero, the incremented array Y * must contain the vector y. On exit, Y is overwritten by the * updated vector y. * * INCY - INTEGER. * On entry, INCY specifies the increment for the elements of * Y. * Unchanged on exit. * * Note that TRANS, M, N and LDA must be such that the value of the * LOGICAL variable OK in the following statement is true: * * OK = ( ( TRANS.EQ.'N' ).OR.( TRANS.EQ.'n' ).OR. * $ ( TRANS.EQ.'T' ).OR.( TRANS.EQ.'t' ).OR. * $ ( TRANS.EQ.'C' ).OR.( TRANS.EQ.'c' ).OR. * $ .AND. * $ ( M.GE.0 ) * $ .AND. * $ ( N.GE.0 ) * $ .AND. * $ ( LDA.GE.M ) * * * * Level 2 Blas routine. * * This is the 'vanilla' Fortran version, which should be suitable * for most conventional paged and non-paged machines. * * -- This version written on 29-August-1985. * LOGICAL NULL , UNIT INTEGER I , IX , IY , J , JX , JY INTEGER KX , KY , LENX , LENY REAL TEMP REAL ONE , ZERO PARAMETER ( ONE = 1.0E+0, ZERO = 0.0E+0 ) * * Quick return if possible. * IF( ( M.EQ.0 ).OR.( N.EQ.0 ).OR. $ ( ( ALPHA.EQ.ZERO ).AND.( BETA.EQ.ONE ) ) ) $ RETURN * * Initialize some useful constants. * LENX and LENY are the lengths of the vectors x and y respectively. * NULL = ( TRANS.EQ.'N' ).OR.( TRANS.EQ.'n' ) UNIT = ( INCX.EQ.1 ).AND.( INCY.EQ.1 ) IF( NULL )THEN LENX = N LENY = M ELSE LENX = M LENY = N END IF * * Start the operations. In this version the elements of A are * accessed sequentially with one pass through A. * * Form y := beta*y and set up the start points in X and Y if the * increments are not both unity. * IF( UNIT )THEN IF( BETA.NE.ONE )THEN IF( BETA.EQ.ZERO )THEN DO 10, I = 1, LENY Y( I ) = ZERO 10 CONTINUE ELSE DO 20, I = 1, LENY Y( I ) = BETA*Y( I ) 20 CONTINUE END IF END IF ELSE IF( INCX.GT.0 )THEN KX = 1 ELSE KX = 1 - ( LENX - 1 )*INCX END IF IF( INCY.GT.0 )THEN KY = 1 ELSE KY = 1 - ( LENY - 1 )*INCY END IF IF( BETA.NE.ONE )THEN IY = KY IF( BETA.EQ.ZERO )THEN DO 30, I = 1, LENY Y( IY ) = ZERO IY = IY + INCY 30 CONTINUE ELSE DO 40, I = 1, LENY Y( IY ) = BETA*Y( IY ) IY = IY + INCY 40 CONTINUE END IF END IF END IF IF( NULL )THEN * * Form y := alpha*A*x + y. * IF( UNIT )THEN DO 60, J = 1, N IF( X( J ).NE.ZERO )THEN TEMP = ALPHA*X( J ) DO 50, I = 1, M Y( I ) = Y( I ) + TEMP*A( I, J ) 50 CONTINUE END IF 60 CONTINUE ELSE JX = KX DO 80, J = 1, N IF( X( JX ).NE.ZERO )THEN TEMP = ALPHA*X( JX ) IY = KY DO 70, I = 1, M Y( IY ) = Y( IY ) + TEMP*A( I, J ) IY = IY + INCY 70 CONTINUE END IF JX = JX + INCX 80 CONTINUE END IF ELSE * * Form y := alpha*A'*x + y. * IF( UNIT )THEN DO 100, J = 1, N TEMP = ZERO DO 90, I = 1, M TEMP = TEMP + A( I, J )*X( I ) 90 CONTINUE Y( J ) = Y( J ) + ALPHA*TEMP 100 CONTINUE ELSE JY = KY DO 120, J = 1, N TEMP = ZERO IX = KX DO 110, I = 1, M TEMP = TEMP + A( I, J )*X( IX ) IX = IX + INCX 110 CONTINUE Y( JY ) = ALPHA*TEMP + Y( JY ) JY = JY + INCY 120 CONTINUE END IF END IF RETURN * * End of SGEMV . * END .vs 16p \f2Version 2.\f1 .vs 11p SUBROUTINE SGEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX, $ BETA, Y, INCY ) CHARACTER*1 TRANS INTEGER M, N, LDA, INCX, INCY REAL ALPHA, A( LDA, * ), X( * ), BETA, Y( * ) * * The following code illustrates a possible Fortran CRAY 1 version. * In this version the loops are only unrolled to a depth of 4, but a * depth of 16 should be considered for a 'production' version. * Special code for cases such as alpha = 1.0 and unit increments has * not been included, but this could be considered for a 'production' * version. * * -- This version written on 29-August-1985. * INTRINSIC MOD INTEGER I , ISTART, IX , IY , J , JSTART INTEGER JX , JY , KX , KY , LENX , LENY REAL TEMP1 , TEMP2 , TEMP3 , TEMP4 REAL ONE , ZERO PARAMETER ( ONE = 1.0E+0, ZERO = 0.0E+0 ) * * Quick return if possible. * IF( ( M.EQ.0 ).OR.( N.EQ.0 ).OR. $ ( ( ALPHA.EQ.ZERO ).AND.( BETA.EQ.ONE ) ) ) $ RETURN * * Initialize some useful constants. * LENX and LENY are the lengths of the vectors x and y respectively. * IF( ( TRANS.EQ.'N' ).OR.( TRANS.EQ.'n' ) )THEN LENX = N LENY = M ELSE LENX = M LENY = N END IF * * Start the operations. In this version the elements of A are * accessed sequentially with one pass through A. * * Set up the start points in X and Y and form y := beta*y. * IF( INCX.GT.0 )THEN KX = 1 ELSE KX = 1 - ( LENX - 1 )*INCX END IF IF( INCY.GT.0 )THEN KY = 1 ELSE KY = 1 - ( LENY - 1 )*INCY END IF IY = KY IF( BETA.EQ.ZERO )THEN DO 30, I = 1, LENY Y( IY ) = ZERO IY = IY + INCY 30 CONTINUE ELSE DO 40, I = 1, LENY Y( IY ) = BETA*Y( IY ) IY = IY + INCY 40 CONTINUE END IF IF( ( TRANS.EQ.'N' ).OR.( TRANS.EQ.'n' ) )THEN * * Form y := alpha*A*x + y. The first two loops are 'clean-up' * loops for the left over vectors. * JX = KX J = MOD( N, 2 ) IF( J.EQ.1 )THEN TEMP1 = ALPHA*X( JX ) IY = KY DO 50, I = 1, M Y( IY ) = Y( IY ) + TEMP1*A( I, J ) IY = IY + INCY 50 CONTINUE JX = JX + INCX END IF J = MOD( N, 4 ) IF( ( J.EQ.2 ).OR.( J.EQ.3 ) )THEN TEMP1 = ALPHA*X( JX ) TEMP2 = ALPHA*X( JX + INCX ) IY = KY DO 60, I = 1, M Y( IY ) = ( ( Y( IY ) ) + $ TEMP1*A( I, J - 1 ) ) + $ TEMP2*A( I, J ) IY = IY + INCY 60 CONTINUE JX = JX + 2*INCX END IF * * The main loop runs over groups of four vectors. * JSTART = J + 4 DO 80, J = JSTART, N, 4 TEMP1 = ALPHA*X( JX ) TEMP2 = ALPHA*X( JX + INCX ) TEMP3 = ALPHA*X( JX + 2*INCX ) TEMP4 = ALPHA*X( JX + 3*INCX ) IY = KY DO 70, I = 1, M Y( IY ) = ( ( ( ( Y( IY ) ) + $ TEMP1*A( I, J - 3 ) ) + $ TEMP2*A( I, J - 2 ) ) + $ TEMP3*A( I, J - 1 ) ) + $ TEMP4*A( I, J ) IY = IY + INCY 70 CONTINUE JX = JX + 4*INCX 80 CONTINUE ELSE * * Form y := alpha*A'*x + y. The first two loops are 'clean-up' * loops for the left over vectors. * IX = KX I = MOD( M, 2 ) IF( I.EQ.1 )THEN TEMP1 = ALPHA*X( IX ) JY = KY DO 90, J = 1, N Y( JY ) = Y( JY ) + TEMP1*A( I, J ) JY = JY + INCY 90 CONTINUE IX = IX + INCX END IF I = MOD( M, 4 ) IF( ( I.EQ.2 ).OR.( I.EQ.3 ) )THEN TEMP1 = ALPHA*X( IX ) TEMP2 = ALPHA*X( IX + INCX ) JY = KY DO 100, J = 1, N Y( JY ) = ( ( Y( JY ) ) + $ TEMP1*A( I - 1, J ) ) + $ TEMP2*A( I , J ) JY = JY + INCY 100 CONTINUE IX = IX + 2*INCX END IF * * The main loop runs over groups of four vectors. * ISTART = I + 4 DO 120, I = ISTART, M, 4 TEMP1 = ALPHA*X( IX ) TEMP2 = ALPHA*X( IX + INCX ) TEMP3 = ALPHA*X( IX + 2*INCX ) TEMP4 = ALPHA*X( IX + 3*INCX ) JY = KY DO 110, J = 1, N Y( JY ) = ( ( ( ( Y( JY ) ) + $ TEMP1*A( I - 3, J ) ) + $ TEMP2*A( I - 2, J ) ) + $ TEMP3*A( I - 1, J ) ) + $ TEMP4*A( I , J ) JY = JY + INCY 110 CONTINUE IX = IX + 4*INCX 120 CONTINUE END IF RETURN * * End of SGEMV . * END .vs 16p \f2Version 3.\f1 .vs 11p SUBROUTINE SGEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX, $ BETA, Y, INCY ) CHARACTER*1 TRANS INTEGER M, N, LDA, INCX, INCY REAL ALPHA, A( LDA, * ), X( * ), BETA, Y( * ) * * The following code illustrates a version for short word length * machines for which speed is not essential. Accumulated inner * products are used. * * -- This version written on 11-October-1985. * INTRINSIC DBLE INTEGER I , IX , IY , J , JX , JY INTEGER KX , KY , LENX , LENY REAL ONE , ZERO PARAMETER ( ONE = 1.0E+0, ZERO = 0.0E+0 ) DOUBLE PRECISION DTEMP * * Quick return if possible. * IF( ( M.EQ.0 ).OR.( N.EQ.0 ).OR. $ ( ( ALPHA.EQ.ZERO ).AND.( BETA.EQ.ONE ) ) ) $ RETURN * * LENX and LENY are the lengths of the vectors x and y respectively. * IF( ( TRANS.EQ.'N' ).OR.( TRANS.EQ.'n' ) )THEN LENX = N LENY = M ELSE LENX = M LENY = N END IF * * Start the operations. * * Form y := beta*y and set up the start points in X and Y. * IF( INCX.GT.0 )THEN KX = 1 ELSE KX = 1 - ( LENX - 1 )*INCX END IF IF( INCY.GT.0 )THEN KY = 1 ELSE KY = 1 - ( LENY - 1 )*INCY END IF IY = KY IF( BETA.EQ.ZERO )THEN DO 10, I = 1, LENY Y( IY ) = ZERO IY = IY + INCY 10 CONTINUE ELSE DO 20, I = 1, LENY Y( IY ) = BETA*Y( IY ) IY = IY + INCY 20 CONTINUE END IF IF( ( TRANS.EQ.'N' ).OR.( TRANS.EQ.'n' ) )THEN * * Form y := alpha*A*x + y. * JX = KX DO 40, I = 1, M DTEMP = ZERO JX = KX DO 30, J = 1, N DTEMP = DTEMP + A( I, J )*DBLE( X( JX ) ) JX = JX + INCX 30 CONTINUE Y( IY ) = ALPHA*DTEMP + Y( IY ) IY = IY + INCY 40 CONTINUE ELSE * * Form y := alpha*A'*x + y. * JY = KY DO 60, J = 1, N DTEMP = ZERO IX = KX DO 50, I = 1, M DTEMP = DTEMP + A( I, J )*DBLE( X( IX ) ) IX = IX + INCX 50 CONTINUE Y( JY ) = ALPHA*DTEMP + Y( JY ) JY = JY + INCY 60 CONTINUE END IF RETURN * * End of SGEMV . * END .vs 16p .fi .ps 12 .sp 2 .cs 1 .bp .SH Appendix B .PP In this appendix we illustrate how to use the full matrix update routines to obtain rank-one and rank-two updates to band matrices. We assume that the vectors x and y are such that no fill-in occurs outside the band, in which case the update affects only a full rectangle within the band matrix A. This is illustrated in Fig. B.1 for the case where @m=n=9@, @kl=2@, @ku=3@ and the update commences in row (and column) @l=3@. .cs 1 24 .nf ( * * * * ) ( 0 )( 0 0 * * * * 0 0 0 ) ( * * * * * ) ( 0 ) ( * * |* * * *| ) ( * ) ( * |* * * *| * ) ( * ) ( |* * * *| * * ) + ( * ) ( * * * * * * ) ( 0 ) ( * * * * * ) ( 0 ) ( * * * * ) ( 0 ) ( * * * ) ( 0 ) .fi .cs 1 .EQ A ~+~ xy ~ sup T .EN Fig. B.1 We see that the update affects only that part of A indicated by the dotted lines, that is, the (@kl@ +1) by (\f2ku\f1+1) part of @A@ starting at @a sub ll @. The routines that we have omitted are GBR, SBR, and SBR2 (in the complex case HBR and HBR2). Their parameter lists could have been GBR ( M, N, KL, KU, L, ALPHA, X, INCX, Y, INCY, A, LDA ) SBR ( UPLO, N, K, L, ALPHA, X, INCX, A, LDA ) SBR2 ( UPLO, N, K, L, ALPHA, X, INCX, Y, INCY, A, LDA ) where the parameter L denotes the starting row and column for the update and the elements @x sub l @ and @y sub l @, of the vectors @x@ and @y@, are in elements X(1) and Y(1) of the arrays X and Y. Calls to SGBR can be achieved by KM = MIN (KL+1, M-L+1) KN = MIN (KU+1, N-L+1) CALL SGER (KM, KN, ALPHA, X, INCX, Y, INCY, A(KU+1, L), MAX(KM, LDA-1)) Calls to SSBR can be achieved by KN = MIN(K+1, N-L+1) IF (UPLO .EQ. 'U') THEN CALL SSYR('U', KN, ALPHA, X, INCX, A(K+1, L), MAX(1, LDA-1)) ELSE CALL SSYR('L', KN, ALPHA, X, INCX, A(1, L), MAX(1, LDA-1)) ENDIF and similarly for calls to SSBR2. .bp .SH Appendix C .PP In this appendix we propose an additional set of real and complex single precision level 2 routines which allow extended precision matrix-vector operations to be performed. .PP The names of these routines are obtained by preceding the character representing the Fortran data type (S or C), by the character E. These routines are to perform the operations described in section 2 as follows. .sp 2 .PP For the matrix-vector operations .sp .EQ I y ~<-~ alpha A x ~+~ beta y, ~~ y ~<-~ alpha A ~ sup T x ~+~ beta y , ~~ y ~<-~ alpha A bar ~ sup T x ~+~ beta y .EN @ alpha , ~ beta , ~ A,@ and @x@ are single precision, @y@ is double precision and the computation of @y@ is to be performed in at least double precision. .sp .PP For the triangular operations .sp .EQ I x ~<-~ T x, ~~ x ~<-~ T ~ sup T x, ~~ x ~<-~ T bar ~ sup T x, .EN .EQ I x ~<-~ T ~ sup -1 x, ~~ x ~<-~ T ~ sup -T x, ~~ x ~<-~ T bar ~ sup -T x .EN .sp @T@ is single precision, @x@ is double precision and the computation of @x@ is to be performed in at least double precision. .PP For the rank-one and rank-two updates .sp .EQ I A ~<-~ alpha x y ~ sup T ~+~ A,~~ A ~<-~ alpha x y bar ~ sup T ~+~ A, .EN .EQ I H ~<-~ alpha x x bar ~ sup T ~+~ H, ~~ H ~<-~ alpha x y bar ~ sup T ~+~ alpha bar y x bar ~ sup T ~+~ H, .EN .sp @ alpha@, @A@, and @H@ are single precision, @x@ and @y@ are double precision and the computation is to be performed in at least double precision. .PP The type and dimension of the corresponding Fortran variables (see section 6) is as follows. .PP For all routines whose first two letters are ES: .sp REAL ALPHA, BETA REAL A( LDA, * ) REAL AP( * ) DOUBLE PRECISION Y( * ) .sp .PP For routines ESGEMV, ESGBMV, ESSYMV, ESSBMV, ESSPMV .sp REAL X( * ) .sp but for all the other ES routines: .sp DOUBLE PRECISION X( * ) .PP For all routines whose first two letters are EC: .sp COMPLEX ALPHA COMPLEX BETA COMPLEX A( LDA, * ) COMPLEX AP( * ) COMPLEX*16 Y( * ) or DOUBLE COMPLEX Y( * ) .sp except that for ECHER and ECHPR the scalar @ alpha @ is real so that the first declaration is replaced by: .sp REAL ALPHA .sp .PP For routines ECGEMV, ECGBMV, ECHEMV, ECHBNV, ECHPMV: .sp COMPLEX X( * ) .sp but for all the other EC routines: .sp COMPLEX*16 X( * ) or DOUBLE COMPLEX X( * ) .sp .PP We intend to make a set of these routines available together with the routines of the main proposal, and the test program will include tests of these extended precision level 2 BLAS. .PP These extended precision routines require just the COMPLEX*16 extension to ANSI Fortran 77. An equivalent set of double precision routines would require further data types to be available and so we have not included these as part of this proposal. However many systems do have an extended precision real data type beyond double precision, such as REAL*16, and similarly some systems have an additional complex data type such as COMPLEX*32. Thus implementors may wish to provide real and complex double precision versions of these routines, in which case we recommend that the names of the routines are again obtained by preceding the characters representing the Fortran data type (D or Z) by the character E. We implore implementors of the double precision routines to clearly describe the arithmetic used for the internal computation. .PP We thank Velvel Kahan for insisting that we think about the extended precision issue. .bp .SH Appendix D .PP This appendix contains the calling sequences for all the proposed level 2 BLAS. .ps 8 .cs 1 24 .nf name options dim b-width scalar matrix x-vector scalar y-vector _GEMV( TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) _GBMV( TRANS, M, N, KL, KU, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) _HEMV( UPLO, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) _HBMV( UPLO, N, K, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) _HPMV( UPLO, N, ALPHA, AP, X, INCX, BETA, Y, INCY ) _SYMV( UPLO, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) _SBMV( UPLO, N, K, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) _SPMV( UPLO, N, ALPHA, AP, X, INCX, BETA, Y, INCY ) _TRMV( UPLO, TRANS, DIAG, N, A, LDA, X, INCX ) _TBMV( UPLO, TRANS, DIAG, N, K, A, LDA, X, INCX ) _TPMV( UPLO, TRANS, DIAG, N, AP, X, INCX ) _TRSV( UPLO, TRANS, DIAG, N, A, LDA, X, INCX ) _TBSV( UPLO, TRANS, DIAG, N, K, A, LDA, X, INCX ) _TPSV( UPLO, TRANS, DIAG, N, AP, X, INCX ) name options dim scalar x-vector y-vector matrix _GER_( M, N, ALPHA, X, INCX, Y, INCY, A, LDA ) _HER ( UPLO, N, ALPHA, X, INCX, A, LDA ) _HPR ( UPLO, N, ALPHA, X, INCX, AP ) _HER2( UPLO, N, ALPHA, X, INCX, Y, INCY, A, LDA ) _HPR2( UPLO, N, ALPHA, X, INCX, Y, INCY, AP ) _SYR ( UPLO, N, ALPHA, X, INCX, A, LDA ) _SPR ( UPLO, N, ALPHA, X, INCX, AP ) _SYR2( UPLO, N, ALPHA, X, INCX, Y, INCY, A, LDA ) _SPR2( UPLO, N, ALPHA, X, INCX, Y, INCY, AP ) .fi .cs 1 .