.ds TH "\h'-.75m'\v'-.50m'^\v'.50m'
.ds TD "\h'-.75m'\v'-.45m'~\v'.45m'
.B
.nr BT ''-%-''
.he ''''
.pl 11i
.de fO
'bp
..
.wh -.5i fO
.LP
.nr LL 6.5i
.ll 6.5i
.nr LT 6.5i
.lt 6.5i
.ta 5.0i
.ft 3
.bp
.R
.sp 1i
.ce 100
.R
.sp .5i
 .
.sp 10
ARGONNE NATIONAL LABORATORY
.br
9700 South Cass Avenue
.br
Argonne, Illinois  60439
.sp .6i
.ps 12
.ft 3
A Proposal for an Extended Set of Fortran

Basic Linear Algebra Subprograms
.ps 11
.sp 3
Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson
.sp 3
.ps 10
.ft 1
Mathematics and Computer Science Division
.sp 2
Technical Memorandum No. 41 (Revision 1)
.sp .7i
November, 1985
.pn 0
.he ''-%-''
.he ''''
.bp
 .
.sp 10
.B
.ce 1
.ps 11
ABSTRACT
.R
.ps 10
.sp .5i
This paper describes an extension to
the set of Basic Linear Algebra Subprograms. The extensions
proposed are targeted at matrix vector operations which should
provide for more efficient and portable implementations of algorithms
for high performance computers.
.in
.bp
.ft 3
.ps 11
.bp
.LP
.LP
.EQ
gsize 11
delim @@
.EN
.TL
.ps 12
.in 0
A Proposal for an Extended Set of Fortran\h'.35i'
.sp
Basic Linear Algebra Subprograms\h'.35i'
.AU
.ps 11
.in 0
Jack J. Dongarra\|@size -1 {"" sup \(dg}@\h'.15i'
.AI
.ps 10
.in 0
Mathematics and Computer Science Division\h'.20i'
Argonne National Laboratory\h'.20i'
Argonne, Illinois 60439\h'.20i'
.AU
.ps 11
.in 0
Jeremy Du Croz\h'.20i'
.AI
.ps 10
.in 0
Numerical Algorithms Group Ltd.\h'.20i'
NAG Central Office, Mayfield House\h'.20i'
256 Banbury Road, Oxford OX2 7DE\h'.20i'
.AU
.ps 11
.in 0
Sven Hammarling\h'.20i'
.AI
.ps 10
.in 0
Numerical Algorithms Group Ltd.\h'.20i'
NAG Central Office, Mayfield House\h'.20i'
256 Banbury Road, Oxford OX2 7DE\h'.20i'
.AU
.ps 11
.in 0
Richard J. Hanson\h'.20i'
.AI
.ps 10
.in 0
Applied Math  2646
Sandia National Laboratory\h'.20i'
Albuquerque, New Mexico 87185\h'.20i'
.FS
.ps 9
.vs 11p
@size -1 {"" sup \(dg}@\|Work supported in part by the Applied Mathematical
Sciences subprogram of the Office of Energy Research,
U. S. Department of Energy, under Contract W-31-109-Eng-38.

This report is being issued jointly as Argonne
National Laboratory MCSD/TM-41 (Revision 1) and Numerical 
Algorithms Group Technical Report NP 1195.

Typeset on \*(DY.
.FE
.sp 3
.QS
.ps 10
.in .25i
.ll -.25i
.I Abstract
\(em This paper describes an extension to
the set of Basic Linear Algebra Subprograms. The extensions
proposed are targeted at matrix vector operations which should
provide for more efficient and portable implementations of algorithms
for high performance computers.
.in
.ll
.QE

.nr PS 11
.nr VS 16
.nr PD 0.5v
.SH
Part 1: The Proposal
.SH
1.  Introduction
.PP
In 1973 Hanson, Krogh, and Lawson wrote an article in the SIGNUM Newsletter (Vol.
8, no. 4, page 16) describing the advantages of adopting a set of basic
routines for problems in linear algebra. The original basic linear algebra
subprograms, now commonly referred to as the BLAS and fully described in
Lawson, Hanson, Kincaid, and Krogh [9,10], have been very
successful and have been used in a wide range of software
including LINPACK [5] and many of the algorithms
published by the ACM Transactions on Mathematical Software. In particular
they are an aid to clarity, portability, modularity and maintenance of
software and they have become a \f2de facto \f1standard for the 
elementary vector
operations.
An excellent discussion of the 
.I
raison d' \*^etre
.R
of the BLAS is given in Dodson and Lewis [2].
.PP
Special versions of the BLAS, in some cases machine code versions, have been
implemented on a number of computers, thus improving the efficiency of the
BLAS. However, with some of the modern machine architectures, 
the use of the BLAS
is not the best way to improve the efficiency of higher level codes.
On vector machines, for example, one needs to optimize at least at the level of
matrix-vector operations in order to approach the potential efficiency of the
machine (see [3 and 4]); and the use of the BLAS inhibits
this optimization because they hide the matrix-vector nature of the
operations from the compiler.
.PP
We believe that the time is right to propose the specifications of an
additional set of BLAS designed for matrix-vector operations. It has been our
experience that a small set of matrix-vector operations occur frequently in
the implementation of many of the most common algorithms in linear algebra.
We define here the basic operations for that set, together with the naming
conventions and the calling sequences.
Routines at this level should provide a reasonable compromise between
the sometimes conflicting aims of efficiency and modularity and
it is our hope that efficient implementations will become available on a wide
range of computer architectures.
.PP
In this paper we shall refer to the existing BLAS of Lawson et al.
as ``Level 1 BLAS'' or ``Existing BLAS'', and the proposed
new set as ``Level 2 BLAS'' or ``Extended BLAS''.
The Level 2 BLAS involve @O(mn )@ scalar operations
where @m@ and @n@ are the dimensions of the matrix involved. These could be programmed
by a series of calls to the Level 1 BLAS, though we do not recommend
that they be implemented in that way. Hence, in a natural sense, 
the Level 2 BLAS
are performing basic operations at one level higher than the Level 1 BLAS.
.PP
We will make available a complete set of Level 2 BLAS in Fortran 77
so that software developers without access to specific implementations
can make use of them. We also plan to develop a test program
so that implementations of the extended BLAS can be thoroughly tested before
being distributed. We intend to submit the test program
and the Fortran 77 version of the routines for publication as an ACM algorithm.

.SH
2. Scope of the Extended BLAS
.PP
We propose that the following three types of basic operation be performed by the
extended BLAS:

a)  Matrix-vector products of the form
.in +.5i
.EQ I
y ~<-~  alpha Ax      ~+~ beta y   ,~~ 
y ~<-~  alpha A ~ sup T x    ~+~ beta y   ,~~ and~
~ y ~<-~  alpha A bar ~ sup T x    ~+~ beta y
.EN
where @ alpha @ and @ beta @ are scalars, @x@ and @y@ are vectors and @A@ is a matrix, and
.EQ I
x ~<-~  Tx ,~   ~ 
x ~<-~  T ~ sup T x,~~ and~
~ x ~<-~  T bar ~ sup T x,
.EN
where @x@ is a vector and @T@ is an upper or lower triangular matrix.
.in -.5i

b)  Rank-one and rank-two updates of the form
.in +.5i
.EQ I
A ~<-~ alpha  xy ~ sup T ~+~ A  ,~~
A ~<-~ alpha  x y bar ~ sup T ~+~ A  ,~~
H ~<-~ alpha  x x bar ~ sup T ~+~ H  ,~~and ~
H ~<-~ alpha  x y bar ~ sup T ~+~ alpha bar  y x bar ~ sup T ~+~ H   ,
.EN
where @~H~@ is a Hermitian matrix.
.in -.5i

c)  Solution of triangular equations of the form
.in +.5i
.EQ I
x ~<-~ T ~ sup -1  x   ,
~x ~<-~ T ~ sup -T  x  , ~~and~
~x ~<-~ T bar ~ sup -T  x  ,
.EN
where @T@ is a non-singular upper or lower triangular matrix.
.in -.5i
.PP
We propose that, where appropriate, the operations be applied to general,
general band, Hermitian, Hermitian band, triangular, and triangular
band matrices in both real and complex arithmetic, and in single and double 
precision.
.PP
See Part 2 of this report for
examples to illustrate the uses and advantages of the proposed
routines. An example illustrating the
implementation of the routines is given in Appendix A.
.PP
In Appendix C we propose some additional routines in which
the vectors @x@ and/or @y@ have a higher precision
than the matrix, so that on machines with extended precision
registers the extra internal precision is not all discarded on
return from the routine. We propose these routines as an appendix
solely because it is not
possible, even in single precision, to specify a complete set
within the confines of ANSI Fortran 77.
The only case that can be realized is where the matrix is
real single precision and the vector is real double precision.
.SH
3. Naming Conventions
.PP
The proposed name of a Level 2 BLAS is in the LINPACK style and consists of
five characters, the last of which may be blank. 
The fourth and fifth characters in the name denote
the type of operation, as follows:

.KS
.TS
center;
l l.
MV	- Matrix-vector product
R	- Rank-one update
R2	- Rank-two update
SV	- Solve a system of linear equations
.TE
.KE

Characters two and three in the name denote the kind of matrix involved, as
follows:

.KS
.TS
center;
l l.
GE	 General matrix
GB	 General band matrix
HE	 Hermitian matrix 
SY	 Symmetric matrix 
HP	 Hermitian matrix stored in packed form
SP	 Symmetric matrix stored in packed form
HB	 Hermitian band matrix
SB	 Symmetric band matrix
TR	 Triangular matrix
TP	 Triangular matrix in packed form
TB	 Triangular band matrix
.TE
.KE

The first character in the name denotes the Fortran data type of the matrix,
as follows:

.TS
center;
l l.
S	 REAL
D	 DOUBLE PRECISION
C	 COMPLEX
Z	 COMPLEX*16 or DOUBLE COMPLEX (if available)
.TE

The proposed available combinations are indicated in the table below.
In the first column, under \f2complex\f1, the initial C may be replaced
by Z. In the second column, under \f2real\f1, the initial S may be
replaced by D.
See Appendix D for the full subroutine calling sequences.
.PP
The proposed collection of routines can be thought of as being divided
into four separate
parts, \f2complex\f1, \f2real\f1, \f2double precision\f1, and 
\f2complex*16\f1.
Each part will be accompanied by a separate testing program.
The routines proposed are written in the ANSI Fortran 77
standard with the exception of the routines that use COMPLEX*16 variables.
These routines are included for completeness and, because the Fortran
standard does not provide for this variable type, may not be available
on all machines.

.KS
.ce
Table 3.1
.TS
center;
c c c c c c c.
\f2complex	real\f1	MV	R	R2	SV

CGE	SGE	*	* @nothing sup \(dg @
CGB	SGB	*
CHE	SSY	*	*	*
CHP	SSP	*	*	*
CHB	SSB	*
CTR	STR	*			*
CTP	STP	*			*
CTB	STB	*			*
.TE
.KE
.EQ
gsize 8
.EN
.FS
@size -1 {"" sup \(dg}@\|
For the general rank-1 update (GER) we propose two complex routines:
CGERC for @ A ~<-~ alpha x y bar ~ sup T ~+~ A @ and 
CGERU for @ A ~<-~ alpha xy ~ sup T ~+~ A @.
This is the only exception to the one to one correspondence between
real and complex routines.
See section 7 for further discussion.
.FE
.EQ
gsize 11
.EN
.PP
We do not propose routines for rank-one and rank-two updates applied to band
matrices because these can be obtained by calls to the rank-one and rank-two
full matrix routines. This is illustrated in 
Appendix B.

.SH
4. Parameter Conventions
.PP
We propose a similar convention for the parameter lists to that for the
existing BLAS, but with extensions where comparable parameters are not
present in the existing BLAS. The proposed order of parameters is as follows:

       a)  Parameters specifying options.
       b)  Parameters defining the size of the matrix.
       c)  Input scalar.
       d)  Description of the input matrix.
       e)  Description of input vector(s).
       f)  Input scalar (associated with input-output vector).
       g)  Description of the input-output vector.
       h)  Description of the input-output matrix.

Note that not each category is present in each of the routines.
.PP
The parameters that specify options are character parameters with the
names TRANS, UPLO, and DIAG.
TRANS is used by the matrix vector product routines as follows:

.KS
.TS
center;
l l.
Value	Meaning
_	_
`N'	Operate with the matrix.
`T'	Operate with the transpose of the matrix.
`C'	Operate with the conjugate transpose of the matrix.
.TE
.KE

In the real case the values `T' and `C' have the same meaning.
.PP
UPLO is used by the Hermitian, symmetric, and triangular matrix routines to
specify whether the upper or lower triangle is being referenced as follows:
.KS
.TS
center;
l l.
Value	Meaning
_	_
`U'	Upper triangle
`L'	Lower triangle
.TE
.KE

.PP
DIAG is used by the triangular matrix routines to specify whether or not
the matrix is unit triangular, as follows:
.KS
.TS
center;
l l.
Value	Meaning
_	_
`U'	Unit triangular
`N'	Non-unit triangular
.TE
.KE
When DIAG is supplied as `U' the diagonal elements are not
referenced.
.PP
It is worth noting that actual character arguments in Fortran may be longer
than the corresponding dummy arguments. So that, 
for example, the value `T' for TRANS may be passed as `TRANSPOSE'.
.PP
The size of the matrix is determined by the two parameters M and N for an @m@
by @n@ rectangular matrix; 
by the parameters M, N, KL, and KU for an @m@ by @n@ band matrix with @kl@
sub-diagonals and \f2ku\f1 super-diagonals; and by the parameters N and K for an @n@ by @n@
symmetric or Hermitian or triangular band matrix with \f2k\f1 super- and/or
sub-diagonals. 
.PP
The description of the matrix consists either of the array name (A)
followed by the leading dimension of the array as declared in the
calling (sub) program (LDA), when the matrix is being stored in a two-
dimensional array; or the Fortran array name (AP) alone when the matrix is being
stored as a (packed) vector. When A is not band, then in the former case the
actual array must contain at least @(m ~+~ d(n-1))@ elements, where @d@ is the
leading dimension of the array and @m~=~n@ for a square matrix; and in the latter
case the actual array must contain at least @n(n+1)/2@ elements.
.PP
The scalars always have the dummy argument names ALPHA and BETA.
As with the
existing BLAS the description of a vector consists of the name of the 
array (X or Y) followed by the storage spacing (increment) in the 
array of the vector elements (INCX or INCY). The increment is allowed to be
negative, zero, or positive. When the vector @x@ consists of @k@ elements, then
the corresponding actual array argument X must be of length at least
(1 + (@k@-1)|INCX|).
.sp
The following values of parameters are invalid:
.nf
.sp
      Any value of the character parameters DIAG, TRANS, or UPLO 
         whose meaning is not specified.
      M < 0 for GE and GB routines.
      N < 0 for all routines.
      KL < 0 or KL < M - N  for the GB routines
      KU < 0 or KU < N - M  for the GB routines
      K < 0 or for the HB, SB, and TB routines.
      LDA < M for the GE routines.
      LDA < KL + KU + 1 for the GB routines.
      LDA < N for the HE, SY, and TR routines.
      LDA < K + 1 for the HB, SB, and TB routines.
.fi

Note that it is permissible to call the routines with M or N = 0,
in which case the routines exit immediately without referencing
their vector or matrix arguments.
For those routines that include the parameter BETA,
when BETA is supplied as zero then the array Y need not
be set on input, so that operations such as @y ~ <- ~ alpha A x @ may be
performed without initially setting @y@ to zero.

.SH
5. Storage Conventions
.PP
Unless otherwise stated it is assumed that matrices are stored
conventionally in a 2-dimensional array with matrix-element @a sub ij @
stored in array-element A(I,J).
.PP
The routines for real symmetric and complex Hermitian matrices
allow for the matrix to be stored in either the upper 
(UPLO = `U')
or lower triangle 
(UPLO = `L') 
of a two dimensional array,
or to be packed in a one dimensional array. In the latter case the upper
triangle may be packed sequentially column by column 
(UPLO = `U'), or the lower triangle may be packed sequentially
column by column
(UPLO = `L'). 
Note that for real symmetric matrices packing the upper
triangle by column is equivalent to packing the lower triangle by rows,
and packing the lower triangle by columns is equivalent to packing the
upper triangle 
by rows. (For complex Hermitian matrices the only difference is that
the off-diagonal elements are conjugated.)
.PP
For triangular matrices the parameter UPLO serves to define
whether the matrix is upper
.br
(UPLO = `U')
or lower 
(UPLO = `L') 
triangular. In the packed storage the triangle has to be packed by column.
.PP
The band matrix routines allow storage in the same style as with LINPACK,
so that the @j sup th @ column of the matrix is stored in the @j sup th @
column of the Fortran array. For a general band matrix the diagonal
of the matrix is stored in the @ku + 1 sup st @ row of the array.
For a Hermitian or symmetric matrix either the upper triangle
(UPLO = `U')
may be stored in which case the leading diagonal is in the @k+1 sup st @
row of the array, or the lower
triangle
(UPLO = `L') 
may be stored in which case the leading diagonal is in the first
row of the array.
For an upper triangular band matrix
(UPLO = `U')
the leading diagonal is in the @k+1 sup st @ row of the array
and for a lower triangular band matrix
(UPLO = `L') 
the leading diagonal is in the first row.
.PP
For a Hermitian matrix the imaginary parts of the diagonal elements
are of course zero and thus the imaginary parts of the corresponding
Fortran array elements need not be set, but are assumed to be zero. In
the R and R2 routines these imaginary parts will be set to zero on return.
.PP
For packed triangular matrices the same storage layout is used whether
or not DIAG = `U' (diagonal elements are assumed to have the value 1),
i.e. space is left for the diagonal elements even if those array elements
are not referenced.

.SH
6. Specification of the Extended BLAS
.PP
Type and dimension for variables occurring in the subroutine
specifications are as follows:

.KS
          INTEGER     INCX, INCY, K, KL, KU, LDA, M, N
          CHARACTER*1 DIAG, TRANS, UPLO
.KE

For routines whose first letter is an S:

          REAL  ALPHA, BETA
          REAL  X(*), Y(*)
          REAL  A(LDA,*)
          REAL  AP(*)

For routines whose first letter is a D

          DOUBLE PRECISION  ALPHA, BETA
          DOUBLE PRECISION  X(*), Y(*)
          DOUBLE PRECISION  A(LDA,*)
          DOUBLE PRECISION  AP(*)

For routines whose first letter is a C:

          COMPLEX  ALPHA
          COMPLEX  BETA
          COMPLEX  X(*), Y(*)
          COMPLEX  A(LDA,*)
          COMPLEX  AP(*)

except that for CHER and CHPR the scalar @alpha@ is real so that the first
declaration above is replaced by:

          REAL  ALPHA

For routines whose first letter is Z:

    COMPLEX*16  ALPHA 				DOUBLE COMPLEX  ALPHA
    COMPLEX*16  BETA 				DOUBLE COMPLEX  BETA
    COMPLEX*16  X(*), Y(*)   or			DOUBLE COMPLEX  X(*), Y(*)
    COMPLEX*16  A(LDA,*) 			DOUBLE COMPLEX  A(LDA,*)
    COMPLEX*16  AP(*) 				DOUBLE COMPLEX  AP(*)

except that for ZHER and ZHPR the first declaration above is
replaced by:

          DOUBLE PRECISION  ALPHA

The generic names,  parameter lists and specifications for the extended BLAS
now follow. Refer to table 3.1 for the specific subroutine names.

a)  General Matrix Vector Products

    for a general matrix:
    GEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY )

    for a general band matrix:
    GBMV ( TRANS, M, N, KL, KU, ALPHA, A, LDA, X, INCX, BETA, Y, INCY )

    Operation:
.sp
        if @TRANS~=~@ `N',@~~~~~y ~ <- ~ alpha Ax ~+~ beta y @
.sp
        if @TRANS~=~@ `T',@ ~~~~y ~ <- ~ alpha A ~ sup T x ~+~ beta y@
.sp
        if @~TRANS~=~@ `C',@ ~~~~y ~ <- ~ alpha A bar ~ sup T x ~+~ beta y @.
.sp

b)  Symmetric or Hermitian Matrix Vector Products

    for a symmetric or Hermitian matrix:
    SYMV ( UPLO, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY )
    HEMV ( UPLO, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY )

    for a symmetric or Hermitian matrix in packed storage:
    SPMV ( UPLO, N, ALPHA, AP, X, INCX, BETA, Y, INCY )
    HPMV ( UPLO, N, ALPHA, AP, X, INCX, BETA, Y, INCY )

    for a symmetric or Hermitian band matrix:
    SBMV ( UPLO, N, K, ALPHA, A, LDA, X, INCX, BETA, Y, INCY )
    HBMV ( UPLO, N, K, ALPHA, A, LDA, X, INCX, BETA, Y, INCY )

    Operation:
.EQ I
y ~ <- ~ alpha Ax ~+~ beta y ~.
.EN

c)  Triangular Matrix Vector Products

    for a triangular matrix:
    TRMV ( UPLO, TRANS, DIAG, N, A, LDA, X, INCX )

    for a triangular matrix in packed storage:
    TPMV ( UPLO, TRANS, DIAG, N, AP, X, INCX )

    for a triangular band matrix:
    TBMV ( UPLO, TRANS, DIAG, N, K, A, LDA, X, INCX )

    Operation:
.sp
         if @TRANS~=~@ `N', @~~~~x ~ <- ~ Tx @
.sp
         if @TRANS~=~@ `T', @~~~~x ~ <- ~ T ~ sup T x@
.sp
         if @TRANS~=~@ `C', @~~~~x ~ <- ~ T bar ~ sup T x@
.sp

d)  Triangular equation solvers

    for a triangular matrix:
    TRSV ( UPLO, TRANS, DIAG, N, A, LDA, X, INCX )

    for a triangular matrix in packed storage:
    TPSV ( UPLO, TRANS, DIAG, N, AP, X, INCX )

    for a triangular band matrix:
    TBSV ( UPLO, TRANS, DIAG, N, K, A, LDA, X, INCX )

    Operation:
.sp
         if @TRANS~=~ @ `N',  @~~~~x ~ <- ~ T ~ sup -1 x@ 
.sp
         if @TRANS~=~@ `T',  @~~~~x ~ <- ~ T ~ sup -T x @
.sp
         if @TRANS~=~@ `C', @ ~~~~x ~ <- ~ T bar ~ sup -T x ~ @.
.sp

e)  General Rank-1 Updates:

    for a general matrix:
    GER_ ( M, N, ALPHA, X, INCX, Y, INCY, A, LDA )

    for real matrices:
    SGER performs the operation @ A ~<-~ alpha x y ~ sup T ~+~ A @.

    for complex matrices:
    CGERC performs the operation @ A ~<-~ alpha x y bar ~ sup T ~+~ A @
    and
    CGERU performs the operation @ A ~<-~ alpha x y ~ sup T ~+~ A @.

f)  Symmetric or Hermitian Rank-1 Updates:

    for a symmetric or Hermitian matrix:
    SYR ( UPLO, N, ALPHA, X, INCX, A, LDA )
    HER ( UPLO, N, ALPHA, X, INCX, A, LDA )

    for symmetric or Hermitian matrix in packed storage:
    SPR ( UPLO, N, ALPHA, X, INCX, AP )
    HPR ( UPLO, N, ALPHA, X, INCX, AP )

    Operation:
.EQ I
A ~<- ~ alpha x x bar ~ sup T ~+~ A
.EN
    for real symmetric matrices this is simply
.EQ I
A ~<- ~ alpha x x ~ sup T ~+~ A ~.
.EN

g)  Symmetric or Hermitian Rank-2 Updates:

    for a symmetric or Hermitian matrix:
    SYR2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, A, LDA )
    HER2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, A, LDA )

    for symmetric or Hermitian matrix in packed storage:
    SPR2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, AP )
    HPR2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, AP )

    Operation:
.EQ I
A ~<- ~ alpha x y bar ~ sup T ~+~ alpha bar y x bar ~ sup T  ~+~ A
.EN

    for real symmetric matrices this is simply
.EQ I
A ~<- ~ alpha x y ~ sup T ~+~ alpha y x ~ sup T ~+~ A ~.
.EN

.SH
7.  Rationale
.PP
The three basic matrix-vector operations chosen (Section 2) were obvious
candidates because they occur in a wide range of linear algebra applications,
and they occur at the innermost level of many algorithms. The hard decision
was to restrict the scope only to these operations, since there are many
other potential candidates, such as matrix scaling and sequences of plane
rotations. Similarly, we could have extended the scope by applying the
operations to other types of matrices such as complex symmetric or
augmented band matrices. 
We have aimed at a reasonable compromise between a much larger
number of routines each performing one type of operation
(e.g. @x ~<-~ L ~ sup -T x@), and a smaller number of routines
with a more complicated set of options. There are in fact, in each precision,
16 real routines performing altogether 43 different operations, and 17
complex routines performing 58 different operations.
.PP
We feel that to extend the scope further would significantly reduce the
chances of having the routines implemented efficiently
over a wide range of machines,
because it would place too heavy a burden on implementors. On the other hand,
to restrict the scope further would place too narrow a limit on the potential
applications of the level 2 BLAS.
.PP
The parameters @ alpha @ and @ beta@ are included in the non-triangular routines to 
give extra flexibility, but we
recommend that implementors consider special code for the cases where @ alpha ~or ~beta~=~ 1.0 @
and @ alpha ~or~ beta ~=~ -1.0 @. Similarly, as with the level 1 BLAS, 
we have included an
increment parameter with the vectors so that a vector could, for example, be
a row of a matrix. But again we recommend that implementors consider special
code for the case where the increments are unity.
Note that the implementation must be such that when
@ beta ~ = ~ 0.0 @ then @y@ need not be defined on input.
.PP
As noted earlier, corresponding to the real routine SGER we propose
two complex routines CGERC 
(for @A ~<-~ alpha x y bar ~ sup T ~+~ A@)
and
CGERU 
(for @A ~<-~ alpha x y ~ sup T ~+~ A@).
Both are frequently required. An alternative would be to provide a single
complex routine CGER with an option parameter; however this parameter
would have become redundant in the real routine SGER. Rather than
have redundant parameters, or different parameter lists for the
real and complex routines, we have chosen two distinct complex
routines; they are analogous to the level 1 BLAS CDOTC
(@c ~<- ~ c ~+~ x bar ~ sup T y@)
and CDOTU
(@c ~<- ~ c ~+~ x ~ sup T y@).
.PP
Note that no check has been included for singularity, and near
singularity, in the triangular equation solving routines.
The requirements for such a test depend upon the
application and so we felt that this should not be included,
but should instead be performed before calling the triangular solver.
.PP
On certain machines, which do not use the ASCII sequence on
all of their Fortran systems, lower case characters may not exist,
so that the innocent looking argument \f2`t'\f1, passed through the parameter 
TRANS for designating a transposed matrix, is not in the Fortran character set. 
Some UNIVAC systems do not have a lower case representation
using the `field data' character set.  
On the CDC NOS-2 system, a mechanism is provided for a full 128 ASCII 
character set by using pairs of 6-bit host characters for certain 7-bit 
ASCII characters.  
This means that there is a `2 for 1' physical extension of the logical records
that  contain lower case letters.  
This fact can hamper portability of codes written on ASCII machines that 
are later moved to CDC systems. 
The only safe way to proceed is to convert
the transported text entirely into the Fortran character set.
On the other hand we believe that users on ASCII character set
systems may wish to 
treat upper and lower case letters as equivalent in meaning.
If this is done, it means that text that will be transported to
machines of unknown types must have the ASCII set mapped into the
Fortran character set before the text is moved.
.PP
The band storage scheme used by the GB, HB, SB, and TB routines has
columns of the matrix stored in columns of the array, and diagonals
of the matrix stored in rows of the array. This is the storage scheme
used by LINPACK. An alternative scheme (used in some EISPACK [8,11] routines)
has rows of the matrix stored in rows of the array, and diagonals of the
matrix stored in columns of the array. The latter scheme has the advantage
that a band matrix-vector product of the form
@ y ~<- ~ alpha Ax ~+~ beta y @ can be computed using long vectors
(the diagonals of the matrix) stored in contiguous elements, and
hence is much more efficient on some machines (e.g. CDC Cyber 205)
than the first scheme. However other computations involving
band matrices, such as
@x ~<-~ Tx@,
@x ~<-~ T ~ sup -1 x@ and @LU@ and @U ~ sup T U @ factorization, cannot
be organized `by diagonals'; instead the computation sweeps along
the band, and the LINPACK storage scheme has the advantage of 
reducing the number of page swaps and allowing contiguous vectors
(the columns of the matrix) to be used.
.PP
Although not discussed here, we plan to provide a portable
testing package for each of the four parts of this extended 
BLAS package. The test package will be self-contained; generating
test data and checking for conformity with this proposal in a portable
fashion.
.PP
We considered the possibility of generalizing the rank-1 and rank-2
updates to rank-k updates. Rank-k updates with @k ~>~ 1@
(but @k ~<<~n@) can achieve significantly better performance on some
machines than rank-1 [7]. But to take advantage of this usually
requires complicating the calling algorithm; and moreover
rank-k updates with @k ~ approx ~n@ would allow an even higher
level operation such as matrix multiplication `in by the back door'.
We prefer to keep to a clean concept of genuine matrix-vector
operations.

.bp
.B
Part 2:

.SH
8. Applicability of a Set of Level 2 BLAS
.PP
The purpose of Part 2 is to demonstrate 
the wide applicability of the set of Level 2 BLAS proposed in Part 1. 
LINPACK and EISPACK are taken as well-known examples of heavily used 
software which could, with hindsight, have made extensive use of such a 
set of BLAS; future versions may do just that. The most straightforward 
way to use the Level 2 BLAS is as components in the development of new 
software; working out how to fit them into existing software can be more 
complicated, but may still be very worthwhile if it leads to substantial 
improvements in performance; a similar exercise has already been 
undertaken on selected NAG Library routines to improve their performance 
on vector processors [1].
.PP
Sections 10 and 11 demonstrate in some detail how calls to individual 
Level 2 BLAS can be used to perform the bulk of the computation in the 
CPO-, CPP- and CPB- sets of LINPACK routines.
Section 12 surveys the broad applicability of the entire set 
of Level 2 BLAS in LINPACK and EISPACK. 

.SH
9.  Example: the CPO- Routines in LINPACK
.PP
The routine CPOFA performs a @ U bar ~ sup T U @ factorization of a Hermitian positive-
definite matrix. The executable code for this routine is:
.ps 10
.sp 2
.cs 1 24
.nf
         DO 30 J = 1, N
            INFO = J
            S = 0.0E0
            JM1 = J - 1
            IF (JM1 .LT. 1) GO TO 20
            DO 10 K = 1, JM1
               T = A(K,J) - CDOTC(K-1,A(1,K),1,A(1,J),1)
               T = T/A(K,K)
               A(K,J) = T
               S = S + REAL(T*CONJG(T))
   10       CONTINUE
   20       CONTINUE
            S = REAL(A(J,J)) - S
C     ......EXIT
            IF (S .LE. 0.0E0 .OR. AIMAG(A(J,J)) .NE. 0.0E0) GO TO 40
            A(J,J) = CMPLX(SQRT(S),0.0E0)
   30    CONTINUE
.fi
.ps 12
.sp 2
.cs 1
.PP
It is not obvious from inspection of the code what type of matrix-vector 
operation, if any, is being performed here. It helps to analyze the 
algorithm with the use of partitioned matrices:

.EQ I
left ( matrix {
ccol { A sub 11 above * above * }
ccol { a sub 12 above a sub 22 above * }
ccol { A sub 13 above a sub 23 ~ sup T above A  sub 33 }
} right ) ~=~
left ( matrix {
ccol { U bar sub 11 ~ sup T above u bar sub 12 ~ sup T  above U bar sub 12 ~ sup T }
ccol { nothing above u sub 22 above u bar sub 23 }
ccol { nothing above nothing above U bar sub 33  ~ sup T }
} right ) 
left ( matrix {
ccol { U sub 11 above nothing above nothing }
ccol { u sub 12 above u sub 22 above nothing }
ccol { u sub 13 above u sub 23 ~ sup T above U sub 33  }
} right ) 
.EN

Equating the middle column, we obtain:

.EQ I
a sub 12 ~=~ U bar sub 11 ^ sup T u sub 12
.EN
.EQ I
a sub 22 ~=~ u bar sub 12 ^ sup T u sub 12 ~+~ u sub 22 sup 2
.EN
    hence:
.EQ I
u sub 12 ~=~ U bar sub 11 ^ sup -T a sub 12
.EN
.EQ I
u sub 22 ~=~ sqrt { a sub 22 ~-~ u bar sub 12 ^ sup T u sub 12 }
.EN

Thus it is possible to compute @U@ column-by-column, with the elements of @U@ 
overwriting the upper triangle of A, using the matrix-vector operation

.EQ I
x ~<-~ U bar ~ sup -T x.
.EN

CPOFA does just this, the outermost loop index J indicating the column 
isolated in the partitioned matrices above. The only complication is that 
CPOFA combines the computation of @u sub 12 @ with the computation 
of @u bar sub 12 ~ sup T u sub 12 @, but 
this can easily be separated into a separate call of CDOTC, and the code 
can be rewritten to use the Level 2 BLAS CTRSV:
.ps 10
.sp
.cs 1 24
.nf
         DO 30 J = 1, N
            INFO = J
C
            CALL CTRSV ('UPPER', 'CONJUGATE-TRANSPOSE', 'NON-UNIT',
     *                   J-1, A, LDA, A(1,J), 1)
C
            S = REAL(A(J,J) - CDOTC(J-1,A(1,J),1,A(1,J),1))
C     ......EXIT
            IF (S .LE. 0.0E0 .OR. AIMAG(A(J,J)) .NE. 0.0E0) GO TO 40
            A(J,J) = CMPLX(SQRT(S),0.0E0)
   30    CONTINUE
.fi
.ps 12
.sp
.cs 1
.PP
Note that this code takes advantage of the convention that CTRSV performs 
no computation when called with its @4 sup th @ parameter equal to zero.
.PP
The routine CPOSL which solves @Ax ~=~ b@ after @A@ has been factorized by 
CPOFA, does nothing more than compute

.EQ I
y ~<-~ U bar ~ sup -T b
.EN
.EQ I
x ~<-~ U ~ sup -1 y
.EN
and it is straightforward to recode it to call CTRSV.
.PP
The computation of @A ~ sup -1 @ in CPODI proceeds in two stages. In the first, 
@U@ 
is inverted in place: at the @k sup th @ stage, the algorithm performs a rank-1 
update of the rectangular submatrix in A(1:K-1, K+1:N), which can be 
implemented by a call to CGERU. (Incidentally page 3.11 of the LINPACK 
Users' Guide describes a different organization of this computation than 
that which is actually performed by the code.) Finally @A ~ sup -1 @ is computed by 
forming the product @U ~ sup -1 U bar ~ sup -T @: the analysis on pages 3.11 and 3.12 of 
the LINPACK Users' Guide (using partitioned matrices) shows that the 
computation involves a rank-1 update of the Hermitian leading submatrix 
of @A@, i.e. a call to CHER.
.PP
The revised form of the relevant sections of code is:
.ps 10
.sp 2
.cs 1 24
.nf
         DO 100 K = 1, N
            A(K,K) = (1.0E0,0.0E0)/A(K,K)
            T = -A(K,K)
            CALL CSCAL(K-1,T,A(1,K),1)
C
            CALL CGERU (K-1, N-K, (+1.0,0.0), A(1,K), 1,
     *                    A(K,MIN(K+1,N)), LDA, A(1,MIN(K+1,N)), LDA )
C
  100    CONTINUE
C
         DO 130 J = 1, N
C
            CALL CHER ('UPPER', J-1, +1.0, A(1,J), 1, A, LDA)
C
            T = CONJG(A(J,J))
            CALL CSCAL(J,T,A(1,J),1)
  130    CONTINUE
.fi
.ps 12
.sp 2
.cs 1
.PP
The remaining routine in the CPO- set is the condition estimator CPOCO. 
Although mathematically the algorithm involves the operations 
@x ~<-~ U ~ sup -1 x @
and 
@x ~<-~ U bar ~ sup -T x@,
calls to CTRSV cannot simply be substituted for sections 
of the existing code because of the careful scaling that is done during 
the solution of these triangular systems. A deeper analysis is required 
to determine whether the scaling could be separated from the solution.

.SH
10.  Extensions: the CPP- and CPB- Routines in LINPACK
.PP
The CPP- set of LINPACK routines differs from the CPO- set only in the 
way that the matrices are stored: upper triangular matrices are packed by 
columns into a one-dimensional array. In this storage scheme, the leading 
upper triangular submatrices are stored in leading sub-sections of the 
array, but no other useful submatrices are held in convenient subsections 
of the array that could be passed as parameters to Level 2 BLAS. In
CPOFA, and in the formation of 
@U ~ sup -1 U bar ~ sup -T @ in CPODI, it is the leading 
upper triangle that is passed as a parameter to CTRSV and CHER (in the 
example in Section 10). Hence by direct analogy, calls to CTPSV and CHPR 
can be inserted into CPPFA and CPPDI. However, in the computation of @U ~ sup -1 @,
the rectangular submatrix passed to CGERU by CPODI is not mapped onto a 
rectangular subsection of the packed array, so CGERU cannot be used in 
this part of CPPDI. Fortunately an alternative way of organizing the 
computation - that which is actually described in the LINPACK Users' Guide! 
- involves the operation 
@x ~<-~ Ux@,
and so the overall result can be 
reproduced by using a call to CTPMV.
.PP
The routine CPBFA, which is the analogue of CPOFA for banded matrices, 
differs from CPOFA both in the way the elements are stored and in the 
fact that computations corresponding to elements outside the band are 
omitted. The computation is illustrated in the diagram below in which the 
@6 sup th @ column is partitioned off as in Section 10:

.KS
.ps 10
.sp 2
.cs 1 24
.nf
          11     12     13     14      0      0      0
           *     22     23     24     25      0      0
           *      *     33     34     35     36      0
           *      *      *     44     45     46     47
           0      *      *      *     55     56     57
           0      0      *      *      *     66     67
           0      0      0      *      *      *     77
.fi
.ps 12
.KE
.sp 2
.cs 1
.PP
Instead of using the whole of the leading 5-by-5 upper triangle in the 
operation @x ~<-~ U  x@ to compute the 6th column of the result, we can take 
advantage of the leading zeros in column 6 so that only the upper 
triangular submatrix in rows and columns 3 to 5 is required. This is 
mapped onto the banded array storage thus:
.ps 10
.sp 2
.KS
.cs 1 24
.nf
                               14     25     36     47
                        13     24     35     46     57
                 12     23     34     45     56     67
          11     22     33     44     55     66     77
.fi
.ps 12
.KE
.sp 
.cs 1
.PP
It can be passed as a parameter to the Level 2 BLAS CTRSV by using the 
device described in Appendix B: the parameter LDA passed to 
CTRSV must be one less than the leading dimension declared in the calling 
program (with care taken to avoid array bound errors).
.PP
Of course the routines CPPSL and CPBSL can be implemented via 
straightforward calls to CTPSV and CTBSV.

.SH    
11.  Scope of Application in LINPACK and EISPACK
.PP
Table 1 shows, without any claim to completeness, where calls to the 
proposed set of Level 2 BLAS might be incorporated into LINPACK or 
EISPACK routines (only the real single precision set have been 
considered, and EISPACK routines which operate on complex matrices have 
been omitted). 
Many of the calls to SGEMV and SGER arise in the application of a
Householder transformation to a matrix. The computation

.EQ I 
X ~<-~ ( I ~-~ alpha u u ~ sup T )X ~=~ X ~-~ alpha u( X ~ sup T u ) ~ sup T 
.EN 

can be performed by the following operations using the Level 2 BLAS
in conjunction with a work vector @w@:
 
.EQ I
w ~<-~ X ~ sup T u  ~~~~~~~~~~~~~~~~~~~~~~         (SGEMV)
.EN
.EQ I
X ~<-~ - alpha u w ~ sup T ~+~ X ~~~~~~~~~~~~~~~ (SGER)
.EN
 
Some latitude has been allowed in drawing up Table 1, in that a suitable work
vector may not be available in the existing code.
Similarly, to make use of STRSV in SGESL 
would require the interchanges to be handled differently by the SGE- set 
of LINPACK routines.
.PP
Furthermore, many of the algorithms listed in Table 1 can be organized in 
different ways, as was pointed out by Dongarra, Gustavson and Karp [5]
for matrix multiplication and LU-factorization. Following their 
convention, in which the basic operation in the innermost loop is 
something like 
@a sub ij ~<-~ a sub ij ~ +- ~ a sub ik a sub kj @:

    - with @i@ as the outermost loop index, the result is computed 
      row by row;

    - with @j@ as the outermost loop index, the result is computed 
      column by column;

    - with @k@ as the outermost loop index, the computation proceeds by 
      updating a large submatrix and (in most algorithms) reducing the 
      dimension of the problem by one at each stage.
.sp
.KS
.ce
Table 1
.TS
center;
l l l|l l l.
BLAS	LINPACK	EISPACK	BLAS	LINPACK	EISPACK
_	_	_	_	_	_

SGEMV	SGEDI	ORTHES	SSPMV	-	TRED3
	SQRDC	ORTBAK
	SSVDC	ORTRAN	SSPR	SPPDI	-
		ELMHES		SSPFA
		QZHES
		TRED2	SSPR2	-	TRED3
		TRBAK1
		TRBAK3	STRMV	-	ELMBAK
		REDUC
			STPMV	SPPDI	-
SGER	SPODI	ORTHES
	SGEFA	ORTBAK	STRSV	SGESL	ELMHES
	SGEDI	ORTRAN		SPOFA	REDUC
	SGBFA	QZHES		SPBFA	REBAK
	SQRDC	TRED2		SPOSL
	SSVDC	TRBAK1
		TRBAK3	STPSV	SPPSL	-
				SPPFA
SSYMV	-	TRED1
		TRED2	STBSV	SGBSL	-
				SPBSL
SSYR	SCHDC	-
	SPODI
	SSIFA

SSYR2	-	TRED1
		TRED2

.TE
.KE
.ps 12
.sp 2
.cs 1
.PP
In Fortran code, in order to avoid paging problems, the @j@-form is usually
preferred to the @i@-form, but the @i@-form would be preferred if matrices
were stored by rows. The @i@-form and @j@-form can often permit GAXPY 
operations [5] in their innermost loops, so are likely to be 
favorable on a CRAY-1, whereas the @k@-form (which can work either by row 
or column) would be preferable on machines which allowed the individual 
AXPY operations of a rank-1 update to be performed in parallel. Each of 
the three methods of organizing an algorithm can involve calls to 
different Level 2 BLAS. For example, for LU-factorization,

    - the @i@-form requires STRSV (@x ~<-~ U ~ sup -T  x@) and SGEMV (@y ~<-~  alpha A ~ sup T x ~+~ beta y@)

    - the @j@-form requires STRSV (@x ~<-~ L  ~ sup -1 x@) and SGEMV (@y ~<-~  alpha Ax ~+~ beta y @)

    - the @k@-form requires SGER (@ A ~<-~  alpha xy ~ sup T  ~+~ A@)

.PP
In all the algorithms studied which involve symmetric matrices, it has 
proved possible to find a form which permits the use of packed storage in 
conjunction with the Level 2 BLAS. Consider as an example the computation 
of @L ~ sup -1 A L ~ sup -T @ for symmetric @A@ using packed storage (i.e. a "packed" version 
of the EISPACK routine REDUC). If the lower triangles of @A@ and @L@ are 
packed by columns, the computation can be implemented by calls to SSPR2 
(@A ~<-~  alpha xy ~ sup T  ~+~  alpha yx ~ sup T  ~+~ A@) and STPSV (@x ~<-~ L ~ sup -1 x@); if the lower triangles 
are packed by rows, this is equivalent to computing @U ~ sup -T A U ~ sup -1 @ with the 
upper triangles of @A@ and @U@ packed by columns, and can be implemented by 
calls to STPSV (@x ~<-~ U ~ sup -1  x@) and SSPMV (@y ~<-~  alpha Ax + y@ for symmetric A).
.PP
A further degree of variation is introduced if for any reason we wish to 
perform the basic triangular factorizations 'backwards' (i.e. as @UL@, @L ~ sup T L@ 
or @U U ~ sup T @ ).
.PP
Allowing for all these possibilities within a small set of fundamental 
algorithms, yields applications for each of the proposed Level 2 BLAS, 
except the banded matrix vector products. We consider it important that the set of 
extended BLAS be sufficiently wide to permit this degree of variation and 
not to constrain the freedom of software developers.
.PP
These remarks have concentrated on algorithms for dense matrices with a 
simple structure. It is hoped that the Level 2 BLAS may also prove useful 
when dealing with matrices with more complicated structures, if this can be 
done by splitting them into dense sub-matrices.

.SH
12.  Acknowledgements
.PP
An earlier draft of this proposal was discussed at the
Parvec IV Workshop organized by John Rice at Purdue University
on October 29-30, 1984 and at the SIAM Conference on
Applied Linear Algebra in Raleigh, April 29-May 2, 1985. 
We wish to thank all the participants at
the workshop and meeting for their comments, discussions, and encouragement,
as well as the many people who have sent us comments separately.

.bp
.SH
13.  References
.sp
.IP [1]
C. Daly and J.J. Du Croz. "Performance of a Subroutine Library on Vector-
Processing Machines". 
.I
Computer Physics Communications, vol 37, (1985), 181-186.
.sp
.IP [2]
D.S. Dodson and J.G. Lewis, "Issues relating to extension of
the Basic Linear Algebra Subprograms", ACM SIGNUM Newsletter,
vol 20, no 1, (1985), 2-18.
.sp
.IP [3]
.R
J.J. Dongarra and S.C. Eisenstat,
``Squeezing the Most out of an Algorithm in CRAY Fortran,''
.I
ACM Transactions on Mathematical Software,
.R
Vol. 10, No. 3, (1984), 221-230.
.sp
.IP [4]
J.J. Dongarra,
.I
Increasing the Performance of Mathematical Software
through High-Level Modularity.
.R
Proceedings of the Sixth International
Symposium on Computing Methods in Engineering and Applied Sciences.
(Versailles, France). North-Holland (1984), pp 239-248.
.sp
.IP [5]
J.J. Dongarra, J.R. Bunch, C.B. Moler, and G.W. Stewart,
.I
LINPACK Users' Guide,
.R
SIAM Publications, Philadelphia, 1979.
.sp
.IP [6]
J.J. Dongarra, F.G. Gustavson, and A. Karp. "Implementing Linear Algebra 
Algorithms for Dense Matrices on a Vector Pipeline Machine". 
.I
SIAM Review, 
.R
26, 91-112 (1984).
.R
.sp 
.IP [7]
.R
J.J. Dongarra, L. Kaufman, and S. Hammarling
.I
Squeezing the Most out of Eigenvalue Solvers
on High-Performance Computers,
.R
Argonne National Laboratory Report ANL MCS-TM 46, (January 1985),
to appear in Linear Algebra and Its Applications.
.sp
.IP [8] 
B.S. Garbow, J.M. Boyle, J.J. Dongarra, C.B. Moler, 
.I
Matrix Eigensystem Routines - EISPACK Guide Extension, 
.R
Lecture Notes in Computer Science, Vol. 51, Springer-Verlag, Berlin, 1977.
.sp
.IP [9]
C. Lawson, R. Hanson, D. Kincaid, and F. Krogh,
``Basic Linear Algebra Subprograms for Fortran Usage,''
.I
ACM Transactions on Mathematical Software
.R
\fB5\fP (1979), 308-323.
.sp
.IP [10]
C. Lawson, R. Hanson, D. Kincaid, and F. Krogh,
``Algorithm 539: Basic Linear Algebra Subprograms for Fortran Usage,''
.I
ACM Transactions on Mathematical Software
.R
\fB5\fP (1979), 324-325.
.sp 
.IP [11] 
B.T. Smith, J.M. Boyle, J.J. Dongarra, B.S. Garbow, Y. Ikebe,
V.C. Klema, and C.B. Moler, 
.I
Matrix Eigensystem Routines - EISPACK Guide, 
.R
Lecture Notes in Computer Science, Vol. 6, 2nd edition, 
Springer-Verlag, Berlin, 1976.
.bp
.SH
Appendix A
.PP
This appendix gives an example to illustrate three possible implementations,
within Fortran, of the proposed extended BLAS. The example considers 3
versions of the routine SGEMV; the first a straightforward version that is
likely to be suitable for many conventional serial paged or non-paged
machines; the second a tuned version, with unrolled loops, suitable for
vector machines such as the CRAY 1; the third a version with accumulated
inner products aimed at short word length machines.

.ps 8
.sp 2
.cs 1 24
.nf
\f2Version 1.\f1
.vs 11p


      SUBROUTINE SGEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX,
     $                   BETA, Y, INCY )
      CHARACTER*1        TRANS
      INTEGER            M, N, LDA, INCX, INCY
      REAL               ALPHA, A( LDA, * ), X( * ), BETA, Y( * )
*
*  Purpose
*  =======
*
*  SGEMV  performs one of the matrix-vector operations
*
*     y := alpha*A*x + beta*y,   or   y := alpha*A'*x + beta*y,
*
*  where alpha and beta are scalars, x and y are vectors and A is an
*  m by n matrix.
*
*  Parameters
*  ==========
*
*  TRANS  - CHARACTER*1.
*           On entry, TRANS specifies the operation to be performed as
*           follows:
*
*              TRANS = 'N' or 'n'          y := alpha*A*x  + beta*y.
*
*              TRANS = 'T' or 't' or       y := alpha*A'*x + beta*y.
*                      'C' or 'c' or
*
*           Unchanged on exit.
*
*  M      - INTEGER.
*           On entry, M specifies the number of rows of the matrix A.
*           M must be at least zero.
*           Unchanged on exit.
*
*  N      - INTEGER.
*           On entry, N specifies the number of columns of the matrix A.
*           N must be at least zero.
*           Unchanged on exit.
*
*  ALPHA  - REAL            .
*           On entry, ALPHA specifies the scalar alpha.
*           Unchanged on exit.
*
*  A      - REAL             array of DIMENSION ( LDA, n ).
*           Before entry, the leading m by n part of the array A must
*           contain the matrix of coefficients.
*           Unchanged on exit.
*
*  LDA    - INTEGER.
*           On entry, LDA specifies the leading dimension of A as
*           declared in the calling (sub) program. LDA must be at least
*           m.
*           Unchanged on exit.
*
*  X      - REAL             array of DIMENSION at least
*           ( 1 + ( n - 1 )*abs( INCX ) ) when TRANS = 'N' or 'n'
*           and at least
*           ( 1 + ( m - 1 )*abs( INCX ) ) otherwise.
*           Before entry, the incremented array X must contain the
*           vector x.
*           Unchanged on exit.
*
*  INCX   - INTEGER.
*           On entry, INCX specifies the increment for the elements of
*           X.
*           Unchanged on exit.
*
*  BETA   - REAL            .
*           On entry, BETA specifies the scalar beta. When BETA is
*           supplied as zero then Y need not be set on input.
*           Unchanged on exit.
*
*  Y      - REAL             array of DIMENSION at least
*           ( 1 + ( m - 1 )*abs( INCY ) ) when TRANS = 'N' or 'n'
*           and at least
*           ( 1 + ( n - 1 )*abs( INCY ) ) otherwise.
*           Before entry with BETA non-zero, the incremented array Y
*           must contain the vector y. On exit, Y is overwritten by the
*           updated vector y.
*
*  INCY   - INTEGER.
*           On entry, INCY specifies the increment for the elements of
*           Y.
*           Unchanged on exit.
*
*  Note that TRANS, M, N and LDA must be such that the value of the
*  LOGICAL variable OK in the following statement is true:
*
*     OK = ( ( TRANS.EQ.'N' ).OR.( TRANS.EQ.'n' ).OR.
*    $       ( TRANS.EQ.'T' ).OR.( TRANS.EQ.'t' ).OR.
*    $       ( TRANS.EQ.'C' ).OR.( TRANS.EQ.'c' ).OR.
*    $     .AND.
*    $     ( M.GE.0 )
*    $     .AND.
*    $     ( N.GE.0 )
*    $     .AND.
*    $     ( LDA.GE.M )
*
*
*
*  Level 2 Blas routine.
*
*     This is the 'vanilla' Fortran version, which should be suitable
*     for most conventional paged and non-paged machines.
*
*  -- This version written on 29-August-1985.
*
      LOGICAL            NULL  , UNIT
      INTEGER            I     , IX    , IY    , J     , JX    , JY
      INTEGER            KX    , KY    , LENX  , LENY
      REAL               TEMP
      REAL               ONE   ,         ZERO
      PARAMETER        ( ONE   = 1.0E+0, ZERO = 0.0E+0 )
*
*     Quick return if possible.
*
      IF( ( M.EQ.0 ).OR.( N.EQ.0 ).OR.
     $    ( ( ALPHA.EQ.ZERO ).AND.( BETA.EQ.ONE ) ) )
     $   RETURN
*
*     Initialize some useful constants.
*     LENX and LENY are the lengths of the vectors x and y respectively.
*
      NULL = ( TRANS.EQ.'N' ).OR.( TRANS.EQ.'n' )
      UNIT = ( INCX.EQ.1 ).AND.( INCY.EQ.1 )
      IF( NULL )THEN
         LENX = N
         LENY = M
      ELSE
         LENX = M
         LENY = N
      END IF
*
*     Start the operations. In this version the elements of A are
*     accessed sequentially with one pass through A.
*
*     Form  y := beta*y  and set up the start points in X and Y if the
*     increments are not both unity.
*
      IF( UNIT )THEN
         IF( BETA.NE.ONE )THEN
            IF( BETA.EQ.ZERO )THEN
               DO 10, I = 1, LENY
                  Y( I ) = ZERO
   10          CONTINUE
            ELSE
               DO 20, I = 1, LENY
                  Y( I ) = BETA*Y( I )
   20          CONTINUE
            END IF
         END IF
      ELSE
         IF( INCX.GT.0 )THEN
            KX = 1
         ELSE
            KX = 1 - ( LENX - 1 )*INCX
         END IF
         IF( INCY.GT.0 )THEN
            KY = 1
         ELSE
            KY = 1 - ( LENY - 1 )*INCY
         END IF
         IF( BETA.NE.ONE )THEN
            IY = KY
            IF( BETA.EQ.ZERO )THEN
               DO 30, I = 1, LENY
                  Y( IY ) = ZERO
                  IY      = IY + INCY
   30          CONTINUE
            ELSE
               DO 40, I = 1, LENY
                  Y( IY ) = BETA*Y( IY )
                  IY      = IY + INCY
   40          CONTINUE
            END IF
         END IF
      END IF
      IF( NULL )THEN
*
*        Form  y := alpha*A*x + y.
*
         IF( UNIT )THEN
            DO 60, J = 1, N
               IF( X( J ).NE.ZERO )THEN
                  TEMP = ALPHA*X( J )
                  DO 50, I = 1, M
                     Y( I ) = Y( I ) + TEMP*A( I, J )
   50             CONTINUE
               END IF
   60       CONTINUE
         ELSE
            JX = KX
            DO 80, J = 1, N
               IF( X( JX ).NE.ZERO )THEN
                  TEMP = ALPHA*X( JX )
                  IY   = KY
                  DO 70, I = 1, M
                     Y( IY ) = Y( IY ) + TEMP*A( I, J )
                     IY      = IY      + INCY
   70             CONTINUE
               END IF
               JX = JX + INCX
   80       CONTINUE
         END IF
      ELSE
*
*        Form  y := alpha*A'*x + y.
*
         IF( UNIT )THEN
            DO 100, J = 1, N
               TEMP = ZERO
               DO 90, I = 1, M
                  TEMP = TEMP + A( I, J )*X( I )
   90          CONTINUE
               Y( J ) = Y( J ) + ALPHA*TEMP
  100       CONTINUE
         ELSE
            JY = KY
            DO 120, J = 1, N
               TEMP = ZERO
               IX   = KX
               DO 110, I = 1, M
                  TEMP = TEMP + A( I, J )*X( IX )
                  IX   = IX   + INCX
  110          CONTINUE
               Y( JY ) = ALPHA*TEMP + Y( JY )
               JY      = JY         + INCY
  120       CONTINUE
         END IF
      END IF
      RETURN
*
*     End of SGEMV .
*
      END


 
.vs 16p
\f2Version 2.\f1
.vs 11p
 

      SUBROUTINE SGEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX,
     $                   BETA, Y, INCY )
      CHARACTER*1        TRANS
      INTEGER            M, N, LDA, INCX, INCY
      REAL               ALPHA, A( LDA, * ), X( * ), BETA, Y( * )
*
*     The following code illustrates a possible Fortran CRAY 1 version.
*     In this version the loops are only unrolled to a depth of 4, but a
*     depth of 16 should be considered for a 'production' version.
*     Special code for cases such as alpha = 1.0 and unit increments has
*     not been included, but this could be considered for a 'production'
*     version.
*
*  -- This version written on 29-August-1985.
*
      INTRINSIC          MOD
      INTEGER            I     , ISTART, IX    , IY    , J     , JSTART
      INTEGER            JX    , JY    , KX    , KY    , LENX  , LENY
      REAL               TEMP1 , TEMP2 , TEMP3 , TEMP4
      REAL               ONE   ,         ZERO
      PARAMETER        ( ONE   = 1.0E+0, ZERO = 0.0E+0 )
*
*     Quick return if possible.
*
      IF( ( M.EQ.0 ).OR.( N.EQ.0 ).OR.
     $    ( ( ALPHA.EQ.ZERO ).AND.( BETA.EQ.ONE ) ) )
     $   RETURN
*
*     Initialize some useful constants.
*     LENX and LENY are the lengths of the vectors x and y respectively.
*
      IF( ( TRANS.EQ.'N' ).OR.( TRANS.EQ.'n' ) )THEN
         LENX = N
         LENY = M
      ELSE
         LENX = M
         LENY = N
      END IF
*
*     Start the operations. In this version the elements of A are
*     accessed sequentially with one pass through A.
*
*     Set up the start points in X and Y and form  y := beta*y.
*
      IF( INCX.GT.0 )THEN
         KX = 1
      ELSE
         KX = 1 - ( LENX - 1 )*INCX
      END IF
      IF( INCY.GT.0 )THEN
         KY = 1
      ELSE
         KY = 1 - ( LENY - 1 )*INCY
      END IF
      IY = KY
      IF( BETA.EQ.ZERO )THEN
         DO 30, I = 1, LENY
            Y( IY ) = ZERO
            IY      = IY + INCY
   30    CONTINUE
      ELSE
         DO 40, I = 1, LENY
            Y( IY ) = BETA*Y( IY )
            IY      = IY + INCY
   40    CONTINUE
      END IF
      IF( ( TRANS.EQ.'N' ).OR.( TRANS.EQ.'n' ) )THEN
*
*        Form  y := alpha*A*x + y. The first two loops are 'clean-up'
*        loops for the left over vectors.
*
         JX = KX
         J  = MOD( N, 2 )
         IF( J.EQ.1 )THEN
            TEMP1 = ALPHA*X( JX )
            IY    = KY
            DO 50, I = 1, M
               Y( IY ) = Y( IY ) + TEMP1*A( I, J )
               IY      = IY      + INCY
   50       CONTINUE
            JX = JX + INCX
         END IF
         J = MOD( N, 4 )
         IF( ( J.EQ.2 ).OR.( J.EQ.3 ) )THEN
            TEMP1 = ALPHA*X( JX        )
            TEMP2 = ALPHA*X( JX + INCX )
            IY    = KY
            DO 60, I = 1, M
               Y( IY ) = ( ( Y( IY )             ) +
     $                       TEMP1*A( I, J - 1 ) ) +
     $                       TEMP2*A( I, J     )
               IY      = IY + INCY
   60       CONTINUE
            JX = JX + 2*INCX
         END IF
*
*        The main loop runs over groups of four vectors.
*
         JSTART = J + 4
         DO 80, J = JSTART, N, 4
            TEMP1 = ALPHA*X( JX          )
            TEMP2 = ALPHA*X( JX +   INCX )
            TEMP3 = ALPHA*X( JX + 2*INCX )
            TEMP4 = ALPHA*X( JX + 3*INCX )
            IY    = KY
            DO 70, I = 1, M
               Y( IY ) = ( ( ( ( Y( IY )             ) +
     $                           TEMP1*A( I, J - 3 ) ) +
     $                           TEMP2*A( I, J - 2 ) ) +
     $                           TEMP3*A( I, J - 1 ) ) +
     $                           TEMP4*A( I, J     )
               IY      = IY + INCY
   70       CONTINUE
            JX = JX + 4*INCX
   80    CONTINUE
      ELSE
*
*        Form  y := alpha*A'*x + y. The first two loops are 'clean-up'
*        loops for the left over vectors.
*
         IX = KX
         I  = MOD( M, 2 )
         IF( I.EQ.1 )THEN
            TEMP1 = ALPHA*X( IX )
            JY    = KY
            DO 90, J = 1, N
               Y( JY ) = Y( JY ) + TEMP1*A( I, J )
               JY      = JY      + INCY
   90       CONTINUE
            IX = IX + INCX
         END IF
         I = MOD( M, 4 )
         IF( ( I.EQ.2 ).OR.( I.EQ.3 ) )THEN
            TEMP1 = ALPHA*X( IX        )
            TEMP2 = ALPHA*X( IX + INCX )
            JY    = KY
            DO 100, J = 1, N
               Y( JY ) = ( ( Y( JY )             ) +
     $                       TEMP1*A( I - 1, J ) ) +
     $                       TEMP2*A( I    , J )
               JY      = JY + INCY
  100       CONTINUE
            IX = IX + 2*INCX
         END IF
*
*        The main loop runs over groups of four vectors.
*
         ISTART = I + 4
         DO 120, I = ISTART, M, 4
            TEMP1 = ALPHA*X( IX          )
            TEMP2 = ALPHA*X( IX +   INCX )
            TEMP3 = ALPHA*X( IX + 2*INCX )
            TEMP4 = ALPHA*X( IX + 3*INCX )
            JY    = KY
            DO 110, J = 1, N
               Y( JY ) = ( ( ( ( Y( JY )             ) +
     $                           TEMP1*A( I - 3, J ) ) +
     $                           TEMP2*A( I - 2, J ) ) +
     $                           TEMP3*A( I - 1, J ) ) +
     $                           TEMP4*A( I    , J )
               JY      = JY + INCY
  110       CONTINUE
            IX = IX + 4*INCX
  120    CONTINUE
      END IF
      RETURN
*
*     End of SGEMV .
*
      END


.vs 16p
\f2Version 3.\f1
.vs 11p

      SUBROUTINE SGEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX,
     $                   BETA, Y, INCY )
      CHARACTER*1        TRANS
      INTEGER            M, N, LDA, INCX, INCY
      REAL               ALPHA, A( LDA, * ), X( * ), BETA, Y( * )
*
*     The following code illustrates a version for short word length
*     machines for which speed is not essential. Accumulated inner
*     products are used.
*
*  -- This version written on 11-October-1985.
*
      INTRINSIC          DBLE
      INTEGER            I     , IX    , IY    , J     , JX    , JY
      INTEGER            KX    , KY    , LENX  , LENY
      REAL               ONE   ,         ZERO
      PARAMETER        ( ONE   = 1.0E+0, ZERO = 0.0E+0 )
      DOUBLE PRECISION   DTEMP
*
*     Quick return if possible.
*
      IF( ( M.EQ.0 ).OR.( N.EQ.0 ).OR.
     $    ( ( ALPHA.EQ.ZERO ).AND.( BETA.EQ.ONE ) ) )
     $   RETURN
*
*     LENX and LENY are the lengths of the vectors x and y respectively.
*
      IF( ( TRANS.EQ.'N' ).OR.( TRANS.EQ.'n' ) )THEN
         LENX = N
         LENY = M
      ELSE
         LENX = M
         LENY = N
      END IF
*
*     Start the operations.
*
*     Form  y := beta*y  and set up the start points in X and Y.
*
      IF( INCX.GT.0 )THEN
         KX = 1
      ELSE
         KX = 1 - ( LENX - 1 )*INCX
      END IF
      IF( INCY.GT.0 )THEN
         KY = 1
      ELSE
         KY = 1 - ( LENY - 1 )*INCY
      END IF
      IY = KY
      IF( BETA.EQ.ZERO )THEN
         DO 10, I = 1, LENY
            Y( IY ) = ZERO
            IY      = IY + INCY
   10    CONTINUE
      ELSE
         DO 20, I = 1, LENY
            Y( IY ) = BETA*Y( IY )
            IY      = IY + INCY
   20    CONTINUE
      END IF
      IF( ( TRANS.EQ.'N' ).OR.( TRANS.EQ.'n' ) )THEN
*
*        Form  y := alpha*A*x + y.
*
         JX = KX
         DO 40, I = 1, M
            DTEMP = ZERO
            JX    = KX
            DO 30, J = 1, N
               DTEMP = DTEMP + A( I, J )*DBLE( X( JX ) )
               JX    = JX    + INCX
   30       CONTINUE
            Y( IY ) = ALPHA*DTEMP + Y( IY )
            IY      = IY          + INCY
   40    CONTINUE
      ELSE
*
*        Form  y := alpha*A'*x + y.
*
         JY = KY
         DO 60, J = 1, N
            DTEMP = ZERO
            IX    = KX
            DO 50, I = 1, M
               DTEMP = DTEMP + A( I, J )*DBLE( X( IX ) )
               IX    = IX    + INCX
   50       CONTINUE
            Y( JY ) = ALPHA*DTEMP + Y( JY )
            JY      = JY         + INCY
   60    CONTINUE
      END IF
      RETURN
*
*     End of SGEMV .
*
      END


.vs 16p
.fi
.ps 12
.sp 2
.cs 1
.bp
.SH
Appendix B
.PP
In this appendix we illustrate how to use the full matrix update routines to
obtain rank-one and rank-two updates to band matrices. We assume that the
vectors x and y are such that no fill-in occurs outside the band, in which
case the update affects only a full rectangle within the band matrix A. This
is illustrated in Fig. B.1 for the case where @m=n=9@, @kl=2@, @ku=3@ and the update
commences in row (and column) @l=3@.

.cs 1 24
.nf
( *  *  *  *                )   ( 0 )( 0  0  *  *  *  *  0  0  0 )
( *  *  *  *  *             )   ( 0 )
( *  * |*  *  *  *|         )   ( * )
(    * |*  *  *  *| *       )   ( * )
(      |*  *  *  *| *  *    ) + ( * )
(          *  *  *  *  *  * )   ( 0 )
(             *  *  *  *  * )   ( 0 )
(                *  *  *  * )   ( 0 )
(                   *  *  * )   ( 0 )
    
.fi
.cs 1
.EQ
                      A            ~+~     xy  ~ sup T
.EN

                                  Fig. B.1

We see that the update affects only that part of A indicated by the dotted
lines, that is, the (@kl@ +1) by (\f2ku\f1+1) part of @A@ starting at @a sub ll @.

The routines that we have omitted are GBR, SBR, and SBR2 (in the
complex case HBR and HBR2). 
Their parameter lists could have been

   GBR ( M, N, KL, KU, L, ALPHA, X, INCX, Y, INCY, A, LDA )
   SBR ( UPLO, N, K, L, ALPHA, X, INCX, A, LDA )
   SBR2 ( UPLO, N, K, L, ALPHA, X, INCX, Y, INCY, A, LDA )

where the parameter L denotes the starting row and column for the update and
the elements @x sub l @ and @y sub l @, of the vectors @x@ and @y@,
are in elements X(1) and Y(1)
of the arrays X and Y.

Calls to SGBR can be achieved by

   KM = MIN (KL+1, M-L+1)
   KN = MIN (KU+1, N-L+1)
   CALL SGER (KM, KN, ALPHA, X, INCX, Y, INCY, A(KU+1, L), MAX(KM, LDA-1))


Calls to SSBR can be achieved by

   KN = MIN(K+1, N-L+1)
   IF (UPLO .EQ. 'U') THEN
      CALL SSYR('U', KN, ALPHA, X, INCX, A(K+1, L), MAX(1, LDA-1))
   ELSE
      CALL SSYR('L', KN, ALPHA, X, INCX, A(1, L), MAX(1, LDA-1))
   ENDIF
      

and similarly for calls to SSBR2.
.bp
.SH
Appendix C
.PP
In this appendix we propose an additional set of real and complex
single precision level 2 routines which allow extended precision
matrix-vector operations to be performed.
.PP
The names of these routines are obtained by preceding the character
representing the Fortran data type (S or C),
by the character E. These routines are to perform the operations
described in section 2 as follows.
.sp 2
.PP
For the matrix-vector operations
.sp
.EQ I
y ~<-~ alpha A x ~+~ beta y, ~~ y ~<-~ alpha A ~ sup T x ~+~ beta y , ~~ y ~<-~ alpha A bar ~ sup T x ~+~ beta y
.EN
@ alpha , ~ beta , ~ A,@ and @x@ are single precision, @y@ is double
precision and the computation of @y@ is to be performed in at least
double precision.
.sp
.PP
For the triangular operations
.sp
.EQ I
x ~<-~ T x, ~~ x ~<-~ T ~ sup T x, ~~ x ~<-~ T bar ~ sup T x,
.EN
.EQ I
x ~<-~ T ~ sup -1 x, ~~ x ~<-~ T ~ sup -T x, ~~ x ~<-~ T bar ~ sup -T x
.EN
.sp
@T@ is single precision, @x@ is double precision and the computation of @x@
is to be performed in at least double precision.
.PP
For the rank-one and rank-two updates
.sp 
.EQ I
A ~<-~ alpha x y ~ sup T ~+~ A,~~  A ~<-~ alpha x y bar ~ sup T ~+~ A,
.EN
.EQ I
 H ~<-~ alpha x x bar ~ sup T ~+~ H,
~~ H ~<-~ alpha x y bar ~ sup T ~+~ alpha bar y x bar ~ sup T ~+~ H,
.EN
.sp
@ alpha@, @A@, and @H@ are single precision, @x@ and @y@ are
double precision and the computation is to be performed in at least
double precision.
.PP
The type and dimension of the corresponding Fortran variables (see section 6)
is as follows.
.PP
For all routines whose first two letters are ES:
.sp
       REAL  ALPHA, BETA
       REAL  A( LDA, * )
       REAL  AP( * )
       DOUBLE PRECISION  Y( * )
.sp
.PP
For routines ESGEMV, ESGBMV, ESSYMV, ESSBMV, ESSPMV
.sp
       REAL  X( * )
.sp
but for all the other ES routines:
.sp
       DOUBLE PRECISION X( * )
.PP
For all routines whose first two letters are EC:
.sp
       COMPLEX  ALPHA
       COMPLEX  BETA
       COMPLEX  A( LDA, * )
       COMPLEX  AP( * )
       COMPLEX*16  Y( * )	or  DOUBLE COMPLEX  Y( * )
.sp
except that for ECHER and ECHPR the scalar @ alpha @ is real
so that the first declaration is replaced by:
.sp
       REAL  ALPHA
.sp
.PP
For routines ECGEMV, ECGBMV, ECHEMV, ECHBNV, ECHPMV:
.sp
       COMPLEX  X( * )
.sp
but for all the other EC routines:
.sp
       COMPLEX*16  X( * )	or  DOUBLE COMPLEX  X( * )
.sp
.PP
We intend to make a set of these routines available together
with the routines of the main proposal, and the test program
will include tests of these extended precision level 2 BLAS.
.PP
These extended precision routines require just the COMPLEX*16
extension to ANSI Fortran 77. An equivalent set of double precision
routines would require further data types to be available and so we
have not included these as part of this proposal. However many systems
do have an extended precision real data type beyond double precision,
such as REAL*16, and similarly some systems have an additional
complex data type such as COMPLEX*32. Thus implementors may wish
to provide real and complex double precision versions of these routines,
in which case we recommend that the names of the routines are again obtained
by preceding the characters representing the Fortran data type
(D or Z) by the character E. We implore implementors of the
double precision routines to clearly describe the arithmetic used
for the internal computation.
.PP
We thank Velvel Kahan for insisting that we think about the
extended precision issue.

.bp
.SH
Appendix D
.PP
This appendix contains the calling sequences for all the proposed
level 2 BLAS.

.ps 8
.cs 1 24
.nf
 name     options         dim   b-width scalar matrix  x-vector scalar y-vector

_GEMV(       TRANS,       M, N,         ALPHA, A, LDA, X, INCX, BETA, Y, INCY )
_GBMV(       TRANS,       M, N, KL, KU, ALPHA, A, LDA, X, INCX, BETA, Y, INCY )
_HEMV( UPLO,                 N,         ALPHA, A, LDA, X, INCX, BETA, Y, INCY )
_HBMV( UPLO,                 N,    K,   ALPHA, A, LDA, X, INCX, BETA, Y, INCY )
_HPMV( UPLO,                 N,         ALPHA, AP,     X, INCX, BETA, Y, INCY )
_SYMV( UPLO,                 N,         ALPHA, A, LDA, X, INCX, BETA, Y, INCY )
_SBMV( UPLO,                 N,    K,   ALPHA, A, LDA, X, INCX, BETA, Y, INCY )
_SPMV( UPLO,                 N,         ALPHA, AP,     X, INCX, BETA, Y, INCY )
_TRMV( UPLO, TRANS, DIAG,    N,                A, LDA, X, INCX )
_TBMV( UPLO, TRANS, DIAG,    N,    K,          A, LDA, X, INCX )
_TPMV( UPLO, TRANS, DIAG,    N,                AP,     X, INCX )

_TRSV( UPLO, TRANS, DIAG,    N,                A, LDA, X, INCX )
_TBSV( UPLO, TRANS, DIAG,    N,    K,          A, LDA, X, INCX )
_TPSV( UPLO, TRANS, DIAG,    N,                AP,     X, INCX )

 name      options         dim   scalar x-vector y-vector matrix

_GER_(                     M, N, ALPHA, X, INCX, Y, INCY, A, LDA )
_HER (  UPLO,                 N, ALPHA, X, INCX,          A, LDA )
_HPR (  UPLO,                 N, ALPHA, X, INCX,          AP )
_HER2(  UPLO,                 N, ALPHA, X, INCX, Y, INCY, A, LDA )
_HPR2(  UPLO,                 N, ALPHA, X, INCX, Y, INCY, AP )
_SYR (  UPLO,                 N, ALPHA, X, INCX,          A, LDA )
_SPR (  UPLO,                 N, ALPHA, X, INCX,          AP )
_SYR2(  UPLO,                 N, ALPHA, X, INCX, Y, INCY, A, LDA )
_SPR2(  UPLO,                 N, ALPHA, X, INCX, Y, INCY, AP )
.fi
.cs 1

.