.ds TH "\h'-.75m'\v'-.50m'^\v'.50m' .ds TD "\h'-.75m'\v'-.45m'~\v'.45m' .B .nr BT ''-%-'' .he '''' .pl 11i .de fO 'bp .. .wh -.5i fO .LP .nr LL 6.5i .ll 6.5i .nr LT 6.5i .lt 6.5i .ta 5.0i .ft 3 .bp .R .sp 1i .ce 100 .R .sp .5i . .sp 10 ARGONNE NATIONAL LABORATORY .br 9700 South Cass Avenue .br Argonne, Illinois 60439 .sp .6i .ps 12 .ft 3 LAPACK Working Note #2 .sp 2 Block Reduction of Matrices to Condensed Forms for Eigenvalue Computations .ps 11 .sp 3 Jack J. Dongarra, Sven J. Hammarling, and Danny C. Sorensen .sp 3 .ps 10 .ft 1 Mathematics and Computer Science Division .sp 2 Technical Memorandum No. 99 .sp .7i September 1987 .pn 0 .he ''-%-'' .he '''' .bp . .sp 10 .B .ce 1 .ps 11 ABSTRACT .R .ps 10 .sp .5i In this paper we describe block algorithms for the reduction of a real symmetric matrix to tridiagonal form and for the reduction of a general real matrix to either bidiagonal or Hessenberg form using Householder transformations. The approach is to aggregate the transformations and to apply them in a blocked fashion, thus achieving algorithms that are rich in matrix-matrix operations. These reductions to condensed form typically comprise a preliminary step in the computation of eigenvalues or singular values. With this in mind, we also demonstrate how the initial reduction to tridiagonal or bidiagonal form may be pipelined with the divide and conquer technique for computing the eigensystem of a symmetric matrix or the singular value decomposition of a general matrix to achieve algorithms which are load balanced and rich in matrix-matrix operations. .in .bp .ft 3 .ps 11 .bp .LP .LP .EQ gsize 11 delim @@ .EN .TL .ps 12 .in 0 Block Reduction of Matrices to Condensed Forms for Eigenvalue Computations .AU .ps 11 .in 0 Jack J. Dongarra**, Sven J. Hammarling, and Danny C. Sorensen* .AI .ps 10 .in 0 Mathematics and Computer Science Division\h'.20i' Argonne National Laboratory\h'.20i' Argonne, Illinois 60439\h'.20i' and Numerical Algorithms Group Ltd.\h'.20i' NAG Central Office, Mayfield House\h'.20i' 256 Banbury Road, Oxford OX2 7DE\h'.20i' .FS .ps 9 .vs 11p * Work supported in part by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U. S. Department of Energy, under Contract W-31-109-Eng-38. ** Work supported in part by the National Science Foundation. .FE .sp 3 .QS .ps 10 .in .25i .ll -.25i .I Abstract \(em In this paper we describe block algorithms for the reduction of a real symmetric matrix to tridiagonal form and for the reduction of a general real matrix to either bidiagonal or Hessenberg form using Householder transformations. The approach is to aggregate the transformations and to apply them in a blocked fashion, thus achieving algorithms that are rich in matrix-matrix operations. These reductions to condensed form typically comprise a preliminary step in the computation of eigenvalues or singular values. With this in mind, we also demonstrate how the initial reduction to tridiagonal or bidiagonal form may be pipelined with the divide and conquer technique for computing the eigensystem of a symmetric matrix or the singular value decomposition of a general matrix to achieve algorithms which are load balanced and rich in matrix-matrix operations. .in .ll .QE .nr PS 11 .nr VS 16 .nr PD 0.5v .NH Introduction .PP The key to using a high-performance computer effectively is to avoid unnecessary memory references. In most computers, data flows from memory into and out of registers and from registers into and out of functional units, which perform the given instructions on the data. Algorithm performance can be dominated by the amount of memory traffic rather than by the number of floating-point operations involved. The movement of data between memory and registers can be as costly as arithmetic operations on the data. This provides considerable motivation to restructure existing algorithms and to devise new algorithms that minimize data movement. .PP Along these lines there has been much activity in the past few years involving redesign of some of the basic routines in linear algebra .[[ dongarra sorensen referbug .], .[ dongarra eisenstat .], .[ dongarra kaufman hammarling .]]. A number of researchers have demonstrated the effectiveness of block algorithms on a variety of modern computer architectures with vector-processing or parallel-processing capabilities, on which potentially high performance can easily be degraded by excessive transfer of data between different levels of memory (vector registers, cache, local memory, main memory, or solid-state disks) .[[ dongarra hewitt .], .[ dongarra sorensen linear algebra on high performance computers .], .[ parallel algorithms on the cedar system .], .[ ibm engineering scientific subroutine library .], .[ bischof van loan .], .[ bucher jordan .], .[ Calahan .]]. This redesign has led to the development of algorithms that are based on matrix-vector and matrix-matrix techniques .[[ dongarra sorensen referbug .], .[ van loan bischof .]]. .PP This approach to software construction is well suited to computers with a hierarchy of memory and true parallel-processing computers. A survey that provides a description of many advanced-computer architectures may be found in .[[ dongarra duff advanced .]]. For those architectures it is often preferable to partition the matrix or matrices into blocks and to perform the computation by matrix-matrix operations on the blocks. By organizing the computation in this fashion we provide for full reuse of data while the block is held in cache or local memory. This approach avoids excessive movement of data to and from memory and gives a .I surface-to-volume .R effect for the ratio of arithmetic operations to data movement, i.e. @ O( n sup 3 ) @ arithmetic operations to @O(n sup 2 )@ data movement. In addition, on architectures that provide for parallel processing, parallelism can be exploited in two ways: (1) operations on distinct blocks may be performed in parallel; and (2) within the operations on each block, scalar or vector operations may be performed in parallel. For a description of blocked implementation for Cholesky factorization, LU decomposition, and matrix multiply and the specifications for a set of building blocks to aid the development of block algorithms see .[[ dongarra hammarling duff .]]. .PP Many of the most successful algorithms for computing eigenvalues or singular values of matrices require an initial reduction to condensed form. Typically, this condensed form is well suited to the implementation of an underlying iterative process used to compute the eigenvalues or singular values. We present block algorithms suitable for computing three different condensed forms. These are the reduction of a symmetric matrix to tridiagonal form, and the reduction of a (real) general matrix to either upper Hessenberg form or bidiagonal form. The reduction of a symmetric matrix to tridiagonal form dominates the computation of eigenvalues if no eigenvectors are required and represents about half the work if both eigenvalues and eigenvectors are sought. A similar remark is appropriate for the reduction of a general matrix to bidiagonal form in preparation for the computation of singular values. When the full eigensystem or singular value decomposition is desired then divide and conquer techniques are appropriate for both of these computations and we shall discuss how to pipeline the reduction to condensed form with a divide and conquer scheme. .NH The Algorithm - Reduction to Tridiagonal Form .PP We usually think of applying a sequence of Householder transformations to reduce the original symmetric matrix to symmetric tridiagonal form. We apply the transformations as similarity transformations to preserve the eigenvalues of the original matrix. The process can be described as follows: .EQ I P sub i ~=~ I ~-~ u sub i u sub i sup T , ~~~~ u sub i sup T u sub i ~=~ 2 .EN .EQ I T ~=~ P sub n-2 ~ ... ~ P sub 2 P sub 1 A P sub 1 P sub 2 ~... ~P sub n-2. .EN .bp Each transformation @ P sub i @ is designed to introduce zeros in the @i sup th @ column (and row) of the matrix below the subdiagonal (and above the superdiagonal) so as to leave the upper part of the matrix in tridiagonal form and the lower part full and symmetric. At the @i sup th @ step of the process, the matrix is of the form .KS .EQ I left ( ~~ matrix { ccol { x above x above nothing above nothing above nothing above nothing above nothing above nothing } ccol { x above x above x above nothing above nothing above nothing above nothing above nothing } ccol { nothing above x above . above . above nothing above nothing above nothing above nothing } ccol { nothing above nothing above . above . above x above nothing above nothing above nothing } ccol { nothing above nothing above nothing above x above x above x above x above x } ccol { nothing above nothing above nothing above nothing above x above x above x above x } ccol { nothing above nothing above nothing above nothing above x above x above x above x } ccol { nothing above nothing above nothing above nothing above x above x above x above x } } ~~ right ) . .EN .KE .PP To describe the algorithmic details of this reduction, we use the notation @ A sup { ( i:n ^,^ j:n) } @ to denote the @ (n-i+1) times (n-j+1) @ submatrix of @A@ beginning at the @(i,j)@ location of @A@; we denote the subvector of a vector @a@ beginning at the @i-th@ position by @ a sup {(i:n)} @; and the @ i-th @ component of a vector @a@ by @ a sup (i) @. The vector @ u sub i @ is constructed from the @i sup th @ column of the reduced matrix so that .EQ I alpha sub i ~=~ -sign( a sub i sup (i+1) ) || a sub i sup (i+1:n) || sub 2 .EN .EQ I u sub i sup (i+1) ~=~ sqrt( 1 ~-~ a sub i sup (i+1) / alpha sub i ) .EN .EQ I u sub i sup (i+2:n) ~=~ -a sub i sup (i+2:n) / alpha sub i .EN .EQ I u sub i sup (1:i) ~=~ 0 ~. .EN .PP In practice @ u sub i @ is constructed and applied to the matrix as follows: .EQ I y sub i ~=~ A sub i u sub i .EN .EQ I v sub i ~=~ y sub i ~-~ half ( y sub i sup T u sub i ) u sub i .EN .EQ I (2.1) A sub i+1 ~=~ A sub i ~-~ u sub i v sub i sup T ~-~ v sub i u sub i sup T ~. .EN .sp In the process @ A @ is repeatedly modified by a symmetric rank-two update. This requires updates to half the @ n-i ~ times ~ n - i @ elements of the symmetric matrix at each stage of the process. .PP To achieve better memory utilization we can consider aggregating a sequence of transformations, say @p@ of them, so that the matrix will be updated by a rank @2p@ symmetric matrix. Such an implementation would be as follows: instead of explicitly updating the matrix with the rank two change, we form only the second column (row) of @A sub 2 @, say @ a sub 2 @. We then update @ a sub 2 @ by applying (2.1) in the following way: .EQ I a sub 2 ~=~ a sub 2 ~-~ v sub 1 sup (2) u sub 1 ~-~ u sub 1 sup (2) v sub 1 ~. .EN From this we can compute @ u sub 2 @; and @y sub 2 @ would be formed as, @ y sub 2 ~=~ A sub 2 u sub 2 @. However we have not explicitly formed @ A sub 2 @. We can construct @ y sub 2 @ as follows: .EQ I y sub 2 mark ~=~ A sub 2 u sub 2 .EN .EQ I lineup ~=~ (A sub 1 ~-~ u sub 1 v sub 1 sup T ~-~ v sub 1 u sub 1 sup T ) u sub 2 .EN .EQ I lineup ~=~ A sub 1 u sub 2 ~-~ ( v sub 1 sup T u sub 2 ) u sub 1 ~-~ ( u sub 1 sup T u sub 2 ) v sub 1 ~. .EN We could then explicitly form @ A sub 3 @ as a symmetric rank four update as follows: .EQ I A sub 3 mark ~=~ A sub 2 ~-~ u sub 2 v sub 2 sup T ~-~ v sub 2 u sub 2 sup T .EN .EQ I lineup ~=~ A sub 1 ~-~ u sub 1 v sub 1 sup T ~-~ v sub 1 u sub 1 sup T ~-~ u sub 2 v sub 2 sup T ~-~ v sub 2 u sub 2 sup T ~. .EN We could have continued the process and in general found for a rank @2p@ update: .EQ I A sub p+1 ~=~ A sub 1 ~-~ UV sup T ~-~ VU sup T ~, .EN where .EQ I U ~=~ (u sub 1 , ~ u sub 2 , ~ ..., ~ u sub p ) .EN .EQ I V ~=~ ( v sub 1 , ~ v sub 2 , ~ ..., ~ v sub p ) .EN .EQ I v sub p ~=~ y sub p ~-~ half (y sub p sup T u sub p ) u sub p .EN .EQ I y sub p+1 ~=~ (A sub 1 ~-~ UV sup T ~-~ VU sup T ) u sub p+1 .EN .EQ I a sub p+1 sup (p+1:n) ~=~ a sub p+1 sup (p+1:n) ~-~ sum from i=1 to p ( v sub i sup (p+1) u sub i ~+~ u sub i sup (p+1) v sub i ) ~. .EN Thus, @ A sub p+1 @ can be formed by a rank @2p@ symmetric update that is rich in matrix-matrix operations. .sp .I Algorithm 1 .R .nf @U@ and @V@ are temporary @n times p @ arrays, which are reused for each iteration of the @k@ loop @n@ is the order of the matrix @p@ is the blocking @N ~=~ (n-2)/p@ for @k ~=~ 1, ~ N @ @s ~=~ (k-1)p ~+~ 1@ for @ j ~=~ s, ~ s+p-1 @ @a sub j ~=~ a sub j ~-~ sum from i=s to j-1 ( v sub i sup (j+1) u sub i ~+~ u sub i sup (j+1) v sub i )@ @alpha sub j ~=~ -sign( a sub j sup (j+1) ) || a sub j sup (j+1:n) || sub 2 @ @u sub j sup (j+1) ~=~ sqrt( 1 ~-~ a sub j sup (j+1) / alpha sub j @ ) @u sub j sup (j+2:n) ~=~ -a sub j sup (j+2:n) / alpha sub j @ @y sub j ~=~ (A sub s ~-~ U sub j-1 {V sub j-1 } sup T ~-~ V sub j-1 {U sub j-1 } sup T ) u sub j @ @v sub j ~=~ y sub j ~-~ half ( y sub j sup T u sub j ) u sub j @ @U sub j ~=~ left ( { U sub j-1 , ~ u sub j } right ) @ @V sub j ~=~ left ( { V sub j-1 , ~ v sub j } right ) @ end @U ~=~ U sub s+p-1 , ~~ V ~=~ V sub s+1-1 @ @perform~symmetric~rank~2p~update ~on~submatrix@ @A sup (s+p:n,s+p:n) ~=~ left ( ~ A ~-~ U V sup T ~-~ V U sup T ~ right ) sup (s+p:n,s+p:n) @ end .fi @U @ is a lower trapezoidal matrix with the first column having its first non-zero element in position @s~+~1@ and the @p@-th column having its first non-zero element in position @s~+~p@. Notice that to aggregate the Householder transformations during the construction of the vector @y sub j @ we perform a matrix-vector multiplication with the submatrix @A sub s @ in the @j@ loop. .PP Algorithm 1 constructs @k@ block transformations and applies it to the matrix. We will call this a .I ``right-looking algorithm.'' .R Notice that at each of the @k@ stages we are updating a submatrix of size @n - s + 1 ~ times ~ n-s+1@. We can further reduce the amount of data referenced by the following algorithm. .KS .sp .I Algorithm 2 .R .nf for @k ~=~ 1, ~N @ @s ~=~ (k-1)p ~+~ 1@ .I Apply the previous @k-1@ block transformations to @A sup { (s:n,s:s+p-1) } @ Compute @ U sub k ~ and ~ V sub k @ .R end .KE .PP At each stage of this algorithm we are only modifying a @n-s+1 ~ times ~ p @ matrix. We will call this algorithm the .I ``left-looking algorithm.'' .R This algorithm will require an access to the submatrix @A sub s @ in the loop, however it avoids an update of the matrix at the end of the @k@ loop. .NH Reduction to Hessenberg Form .PP Not surprisingly, the same approach can be used in the reduction to Hessenberg form. Here we have .EQ I H ~=~ P sub n-2 ... P sub 2 P sub 1 A P sub 1 P sub 2 ... P sub n-2 ~, .EN where @H@ is upper Hessenberg. The idea of using a rank 2 or higher update in this context was discussed in .[[ Dongarra Kaufman Hammarling .]]. Here it is convenient to use slightly modified formulas to those in .[[ Dongarra Kaufman Hammarling .]] given by .EQ I y sub i ~=~ A sub i sup T u sub i ~,~~~ z sub i ~=~ A sub i u sub i .EN .EQ I v sub i ~=~ y sub i ~-~ half (z sub i sup T u sub i ) u sub i ~,~~ w sub i ~=~ z sub i ~-~ half (y sub i sup T u sub i ) u sub i .EN .EQ I A sub i+1 ~=~ A sub i ~-~ u sub i v sub i sup T ~-~ w sub i u sub i sup T . .EN When @A@ is symmetric, @y sub i ~=~ z sub i @, and @v sub i ~=~ u sub i @ and these equations are as in the tridiagonal case. The vector @ u sub i@ is computed from the same equations as for the tridiagonal case. Here @A@ is updated as .EQ I A sub p+1 ~=~ A sub 1 ~-~ UV sup T ~-~ WU sup T ~, .EN where .EQ I U ~=~ (u sub 1 , u sub 2 , ... , u sub p ) ~,~~ V ~=~ (v sub 1 , v sub 2 , ... , v sub p ) ~,~~ W ~=~ (w sub 1 , w sub 2 , ... , w sub p ) ~and .EN .EQ I y sub p+1 ~=~ (A sub 1 sup T ~-~ VU sup T ~-~ UW sup T ) u sub p+1 ~,~~ z sub p+1 ~=~ (A sub 1 ~-~ UV sup T ~-~ WU sup T ) u sub p+1 ~. .EN @U@, @V@, and @Y@ are trapezoidal, but @Z@ and @W@ are not. .NH Reduction to Bidiagonal Form .PP A problem that is closely associated with the eigenvalue problem is to compute the Singular Value Decomposition (SVD) of a real @m times n @ matrix @A@. This decomposition is directly related to the symmetric eigenvalue problem in that the singular values of @A@ are the square roots of the eigenvalues of the symmetric positive semidefinite matrix @A sup T A@. It is numerically preferable to avoid formation of @A sup T A@ and the algorithm of choice involves an initial reduction of @A@ to upper bidiagonal form @B@ through a sequence of Householder transformations to obtain .EQ A ~=~ U B V sup T .EN with @U@ and @ V@ orthogonal and @ B @ upper bidiagonal. .PP This initial reduction may be treated with an algorithm similar to those already presented. In this case .EQ I B ~=~ P sub n-2 sup T ... P sub 2 sup T P sub 1 sup T A Q sub 1 Q sub 2 ... Q sub n-2 ~, .EN where each @P sub j ~=~ I ~-~ u sub j u sub j sup T @ is an @m times m@ Householder transformation and each @ Q sub j ~=~ I ~-~ v sub j v sub j sup T @ is an @n times n @ Householder transformation. Again we may achieve efficient memory utilization by aggregating a sequence of transformations, say @p@ of them, so that the matrix will be updated by a matrix of rank @2p@. However, there are data dependencies within this reduction that require additional attention. .PP Let us suppose for the moment that the sequences { @ u sub j @ } and { @ v sub j @ } can be computed at will. In general, .EQ ( I ~-~u u sup T ) A ( I ~-~ v v sup T ) ~=~ A ~-~ u w sup T ~-~ y v sup T ~~, .EN where .EQ y ~=~ A v ~~,~~~z ~=~ A sup T u ~~ and ~~~ w ~=~ z ~-~ ( u sup T y ) v ~, .EN see .[[ Dongarra Kaufman Hammarling .]] for more details. Thus, a straightforward extension of the tridiagionalization scheme presented in Section 2 gives the following algorithm: .sp2 .KS .sp .I Algorithm 3 .R .nf @U@ and @Y@ are temporary @m times p @ arrays, which are reused for each iteration of the @k@ loop @V@ and @W@ are temporary @n times p @ arrays, which are reused for each iteration of the @k@ loop @n@ is the number of columns in the matrix @m@ is the number of rows in the matrix @p@ is the blocking @N ~=~ (n-2)/p@ for @k ~=~ 1, ~ N @ @s ~=~ (k-1)p ~+~ 1@ for @ j ~=~ s,~ s+p-1@ compute @ u sub j ~ @ compute @ v sub j ~ @ .sp @ y sub j ~=~ left (~ A ~-~ U sub j-1 W sub j-1 sup T ~-~ Y sub j-1 V sub j-1 sup T ~ right ) ~v sub j@ .sp @ h sub j ~=~ left (~ A ~-~ U sub j-1 W sub j-1 sup T ~-~ Y sub j-1 V sub j-1 sup T ~ right ) sup T u sub j@ .sp @ w sub j ~=~ h sub j ~-~ ( u sub j sup T y sub j ) v sub j @ .sp @ U sub j ~=~ left (~ U sub j-1 ~,~ u sub j ~right )@ .sp @ V sub j ~=~ left (~ V sub j-1 ~,~ v sub j ~right )@ .sp @ W sub j ~=~ left (~ W sub j-1 ~,~ w sub j ~right )@ .sp @ Y sub j ~=~ left (~ Y sub j-1 ~,~ y sub j ~right )@ end @U ~=~ U sub s+p-1 , ~~ V ~=~ V sub s+1-1 @ @Y ~=~ Y sub s+p-1 , ~~ W ~=~ W sub s+1-1 @ @perform~~rank~2p~update ~on~submatrix@ @A sup (s+p:n,s+p:n) ~=~ (~ A ~-~ U W sup T ~-~ Y V sup T ~) sup (s+p:n,s+p:n) @ end .sp2 .KE .fi Unfortunately, it is not so straightforward to compute @ u sub j @ and @ v sub j @ at will. At step @j@ of the usual bidiagonalization process, the vector @ u sub j @ is nonzero in the @ j ~-~ th @ entry. Hence application on the left by the corresponding Householder transformation alters the @j-th@ row of the reduced matrix @ A sub j @ and knowledge of this row is required to compute the vector @ v sub j @ which is nonzero in the @ j+1 ~-~ st@ entry. The dependencies now become even more complicated because it would appear that the transformation corresponding to @ v sub j @ must be applied from the right before @ u sub j+1 @ can be computed and so on. However, we note that the above algorithm will be valid if there is an independent formula for computing the @ v sub j @ since the @ u sub j@ may be computed as in the previous algorithms by knowing the @ j-th @ column of the reduced matrix. Indeed, there is an independent formula for computing the @ v sub j @ which may be found by noting that .EQ V sup T A sup T A V ~=~ B sup T B ~=~ T .EN where @ V ~=~ Q sub 1 Q sub 2 ... Q sub n-2 ~ @ is precisely the sequence of transformations that would be computed in Algorithm 1 to reduce the matrix @ A sup T A @ to tridiagonal form @ T ~=~ B sup T B @ . This leads to the following procedure for computing the @ v sub j @ .sp2 .I Procedure compute @v sub j@ .R .nf @z sub j ~=~ A sub s sup T a sub j sup { (s : s+p-1) } ~-~ sum from i=s to j-1 ( v sub i x sub i sup (j+1) ~+~ x sub i v sub i sup (j+1) )@ @zeta sub j ~=~ -sign( z sub j sup (j+1) ) || z sub j sup (j+1:n) || sub 2 @ @v sub j sup (j+1) ~=~ sqrt( 1 ~-~ z sub j sup (j+1) / zeta sub j @ ) @v sub j sup (j+2:n) ~=~ -z sub j sup (j+2:n) / zeta sub j @ @t sub j ~=~ (A sub s sup T A sub s ~-~ V sub j-1 { X sub j-1 } sup T ~-~ X sub j-1 {V sub j-1 } sup T ) v sub j , ~ A sub s ~=~ A sup (s:m,s:n) @ @x sub j ~=~ t sub j ~-~ half ( t sub j sup T v sub j ) v sub j @ . @X sub j ~=~ left ( { X sub j-1, ~ x sub j } right ) @ .fi .sp2 Computation of the @ u sub j @ only depends upon the @ j-th @ column of the reduced matrix being in place before the @j-th@ step. Therefore, the column oriented formula given in Algorithm 1 may be adapted to give .sp2 .I Procedure compute @u sub j@ .R .nf @a sub j ~=~ a sub j ~-~ sum from i=s to j-1 ( u sub i w sub i sup (j+1) ~+~ y sub i v sub i sup (j+1) )@ @alpha sub j ~=~ -sign( a sub j sup (j) ) || a sub j sup (j:m) || sub 2 @ @u sub j sup (j) ~=~ sqrt( 1 ~-~ a sub j sup (j) / alpha sub j @ ) @u sub j sup (j+1:m) ~=~ -a sub j sup (j+1:m) / alpha sub j @ .fi .sp2 If these two procedures are substituted for "compute u" and "compute v" in Algorithm 3 then it be well defined. In all of these we do not explicitly form the indicated matrix products. Instead, the matrix vector products are accumulated. .NH Relationship to the @WY@ Factorization .PP The algorithm presented here for aggregating Householder transformations is closely related to the @WY@ representation for the product of Householder matrices presented by Bischof and Van Loan .[[ bischof .]]. The relationship is most clearly seen in the contexts of the QR factorization of a general rectangular matrix. The @WY@ representation has the following form: .KS .sp .I QR Factorization (Bischof and Van Loan) .R .nf @n@ is the number of columns in the matrix @p@ is the blocking @N ~=~ n/p@ for @k ~=~ 1, ~ N @ @s ~=~ (k-1)p ~+~ 1@ for @ j ~=~ s, ~ s+p-1 @ @a sub j ~=~ a sub j ~-~ sum from i=s to j-1 z sub i sup (j) u sub i ~~ where ~ z sub i ~=~ A sub i u sub i @ @compute ~ Householder ~vector ~ u sub j @ @Y sub k sup (s:j) ~=~ (Y sub k sup (s:j-1) ~-~ u sub j u sub j sup T {Y sub k sup (s:j-1) } , ~ -2u sub j ) @ end @perform~rank~2p~update ~on~sub-matrix@ @A sup (s+p:n,s+p:n) ~=~ A sup (s+p:n,s+p:n) ~-~ U Y sup T A sup (s+p:n,s+p:n) @ end .fi .KE .sp If one implements the reduction along the lines of the algorithm described in Section 2, the factorization can be described as follows: .sp .KS .I QR Factorization (Alternative Algorithm) .R .nf @n@ is the order of the matrix @p@ is the blocking @N ~=~ n/p@ for @k ~=~ 1, ~ N @ @s ~=~ (k-1)p ~+~ 1@ for @ j ~=~ s, ~ s+p-1 @ @a sub j ~=~ a sub j ~-~ sum from i=s to j-1 v sub i sup (j+1) u sub i @ @compute ~ Householder ~vector ~ u sub j @ @v sub j ~=~ ( A sub s sup T ~-~ V sub k sup (s:j-1) U sub k sup {(s:j-1)~T } ) u sub j @ end @perform~rank~2p~update ~on~submatrix@ @A sup (s+p:n,s+p:n) ~=~ left ( ~ A ~-~ U V sup T ~ right ) sup (s+p:n,s+p:n) @ end .fi .KE The two differences between the block algorithms are in the formation of @v sub j @ and @Y sub j @ and also in the update of the matrix @A@. For the Alternate Algorithm the vector @v sub j @ is updated using the submatrix @A sub s @ and the Bischof and Van Loan Algorithm uses information from @u sub j @ and @ Y sub k sup (s:j-1) @. Thus the Bischof and Van Loan Algorithm will have fewer accesses to the data. In the update of the matrix @A@ for the Bischof and Van Loan factorization, the update is of the form .EQ I A sup (s+p:n,s+p:n) ~=~ A sup (s+p:n,s+p:n) ~-~ U Y sup T A sup (s+p:n,s+p:n) ~; .EN and for the alternative factorization, the update is of the form .EQ I A sup (s+p:n,s+p:n) ~=~ A sup (s+p:n,s+p:n) ~-~ U V sup T . .EN The alternative algorithm incorporates the information about the matrix @A@ in the matrix @V@. .PP We can describe the reduction to tridiagonal form for the symmetric eigenvalue problem using the Bischof and Van Loan approach as .EQ I A sub p+1 ~=~ (I ~-~ USU sup T ) A sub 1 ( I ~-~ USU sup T ) . .EN If we multiply out and combine terms, we can reduce the expression to .EQ I A sub p+1 ~=~ A sub 1 ~-~ ZU sup T ~-~ UZ sup T .EN where .EQ I W ~=~ A sub 1 US sup T ~ and~~ Z ~=~ W ~-~ half USU sup T W .EN which is of the form described in Section 2. An implementation for the reductions along these lines is described by W. Harrod at the University of Illinois .[[ Harrod private .]]. .PP With the @WY@ representation it is simple to apply the set of transformations to another matrix, as is required in back substitution for the eigenvector computation; one simply applies @ (I ~-~ WY sup T ) @ to the matrix. To apply the transformation using the formulation in Section 2, one can use the Householder vectors to construct the matrix @S@ such that @ I ~-~ USU sup T @ can be used to apply the transformations to the eigenvectors of the symmetric tridiagonal matrix, back transforming them to the eigenvectors of the original problem. The matrix @S @ is a @p@ by @p@ upper triangular matrix whose @k@-th column is formed as follows: .EQ I (I ~-~ USU sup T ) (I ~-~ uu sup T ) mark ~=~ I ~-~ uu sup T ~-~ USU sup T ~+~ USU sup T uu sup T .EN .EQ I lineup ~=~ I~-~ [U,~u] ~ left ( ~ matrix { ccol { { SU sup T ~-~ SU sup T uu sup T } above u sup T } } ~right ) .EN .EQ I lineup ~=~ I~-~ [U,~u] ~ left ( ~ matrix { ccol { S above 0 } ccol { { - SU sup T u } above 1 } } right ) left ( ~ matrix { ccol { U sup T above u sup T } } right ) .EN So the new column of @S@ is .EQ I left ( ~ matrix { ccol { -S U sup T u above 1 } } right ) ~. .EN .NH Pipelining Reduction to Condensed Form with Determination of Eigenvalues .PP Recently, algorithms have been developed based upon divide and conquer strategies for the determination of eigenvalues and singular values for a matrix in condensed form .[[ dongarra sorensen fully parallel .], .[ jessup sorensen .]]. These methods are also rich in matrix-matrix operations and mesh very well with the block reductions presented here. This is accomplished through pipelining the initial reduction phase with the computation of eigenvalues and back-transformation of eigenvectors. These considerations are of little consequence on serial computers but have significant performance advantages on parallel-vector processors. .PP We use the block reduction algorithm as described above to introduce zeros in a block of the matrix, say we are at the @k@-th stage and have just introduced zeros into the @k@-th block. As we start the next block reduction, on the @k+1@-st block, we can start in parallel the eigenvalue computation on that part of the tridiagonal matrix generated from the @k@-th block reduction of the matrix. When we have completed eigenvalue computations from two tridiagonal segments, we can use the technique applied in the divide and conquer algorithm as described by Dongarra and Sorensen .[[ Dongarra sorensen fully eigenvalue .]] to determine the eigenvalues and eigenvectors of a pair of tridiagonal matrices. Then the eigenvalues of successive pairs of blocks can be found, then pairs of pairs, etc., until the full set is determined. When the reduction to condensed form and the divide and conquer strategy are used in this pipelined fashion a highly efficient parallel algorithm can be constructed. .PP This discussion is made more explicit in the following example. We consider a symmetric matrix that is to be partitioned into four block columns as shown in the figure below. .KS .PS 4 4 line from 2.737, 8.413 to 3.225, 7.925 line from 2.813, 8.538 to 3.338, 8.012 line from 2.725, 8.525 to 3.250, 8.012 dashwid = 0.0500i line dashed from 7.012, 8.475 to 7.188, 8.275 line dashed from 6.750, 8.488 to 7.188, 8.038 line dashed from 6.500, 8.488 to 6.988, 8.025 line dashed from 6.225, 8.512 to 6.725, 8.038 line dashed from 5.988, 8.500 to 6.488, 8.025 line dashed from 5.750, 8.512 to 6.238, 8.025 line dashed from 5.525, 8.500 to 5.975, 8.050 line dashed from 5.213, 8.288 to 5.450, 8.025 line dashed from 5.250, 8.488 to 5.713, 8.012 line dashed from 0.725, 6.738 to 0.475, 6.500 line dashed from 0.712, 6.988 to 0.262, 6.537 line dashed from 0.738, 7.250 to 0.237, 6.775 line dashed from 0.712, 7.512 to 0.237, 7.025 line dashed from 0.712, 7.750 to 0.250, 7.275 line dashed from 0.725, 7.988 to 0.250, 7.537 line dashed from 0.700, 8.238 to 0.262, 7.762 line dashed from 0.725, 8.488 to 0.237, 7.988 line dashed from 0.475, 8.512 to 0.237, 8.250 line from 3.237, 8.025 to 3.237, 6.550 line from 3.250, 8.012 to 4.700, 8.012 line from 5.250, 7.000 to 7.200, 7.000 line from 5.250, 7.512 to 7.213, 7.512 line from 5.238, 8.000 to 7.225, 8.000 line from 1.725, 8.525 to 1.725, 6.500 line from 1.237, 8.512 to 1.237, 6.500 line from 0.738, 8.538 to 0.738, 6.475 line from 0.225, 8.525 to 2.225, 8.525\ to 2.225, 6.525\ to 0.225, 6.525\ to 0.225, 8.525 line from 5.225, 8.525 to 7.225, 8.525\ to 7.225, 6.525\ to 5.225, 6.525\ to 5.225, 8.525 line from 2.725, 8.538 to 4.725, 8.538\ to 4.725, 6.537\ to 2.725, 6.537\ to 2.725, 8.537 .PE .ce @Figure ~ 1@ Partitioned Matrix .KE .sp .PP Let us associate @H sub k @ with the process of reducing the @k-th@ block @ A sub k @ of the partitioned matrix to tridiagonal form using Householder transformations. Thus @H sub k @ executes a block step of Algorithm 1 (see the k-loop) on the block @ A sub k @ In this algorithm we have the possibility of spawning parallel processes. Processes may cooperate in applying the resulting transformation shown in .EQ A sup (s+p:n,s+p:n) ~=~ A sup (s+p:n,s+p:n) ~-~ U sub k V sub k sup T ~-~ V sub k U sub k sup T .EN in parallel. Let us denote these parallel processes by @M sub kj @ so that process @M sub kj@ is responsible for a portion of the work in the matrix multiply in the performance of Algorithm 1 by process @H sub k@. .PP On completion of process @H sub k@ the tridiagonal matrix @ T sub k @ is exposed and the algorithm @TQL2@ may be applied to compute the eigensystem of @T sub k @ after the rank one tearing has been computed. Let us denote this process by @ E sub k @. .PP Once processes @ E sub 1@ and @ E sub 2 @ have completed, then the eigensystem of .EQ (6.1) left ( ~ matrix { ccol { T sub 1 above 0 } ccol { 0 above T sub 2 } } ~ right ) ~=~ left ( ~ matrix { ccol { Q sub 1 above 0 } ccol { 0 above Q sub 2 } } ~ right ) left (~ left ( ~ matrix { ccol { D sub 1 above 0 } ccol { 0 above D sub 2 } } ~ right ) +~ beta left ( ~ matrix { ccol { q sub 1 above q sub 2} } ~ right ) ( q sub 1 sup T ~,~ q sub 2 sup T ) ~ ~ right ) left ( ~ matrix { ccol { Q sub 1 sup T above 0 } ccol { 0 above Q sub 2 sup T } } ~ right ) .EN may be computed using the rank one updating scheme. Similarly, once processes @ E sub 3@ and @ E sub 4 @ have completed, the eigensystem of @ diag ( T sub 3 , T sub 4 ) @ may be computed. Finally the entire eigensystem may be obtained through rank one updating of these two eigensystems. Let us denote these processes as @U sub 1 ~,~ U sub 2 @ and @ U sub 0 @ respectively. .PP With the proper storage arrangements these processes obey the following large grain control flow graph: .sp2 .KS .PS 3 3 arrowht = 0.1000i arrowwid = 0.0500i line <- from 5.600, 5.475 to 4.588, 5.475 line <- from 4.088, 5.475 to 3.112, 5.475 line <- from 2.612, 5.488 to 1.638, 5.488 line <- from 3.487, 9.087 to 2.563, 8.387 line <- from 3.850, 9.087 to 4.875, 8.413 line <- from 2.213, 8.025 to 1.500, 6.975 line <- from 2.438, 7.988 to 2.775, 6.975 line <- from 4.938, 8.012 to 4.425, 6.912 line <- from 5.200, 8.025 to 5.775, 6.938 line <- from 1.362, 6.475 to 1.362, 5.713 line <- from 2.813, 6.438 to 2.800, 5.738 line <- from 4.350, 6.450 to 4.350, 5.762 line <- from 5.825, 6.488 to 5.838, 5.775 line <- from 1.388, 5.213 to 1.388, 4.188 line <- from 1.237, 5.262 to 0.538, 4.188 line <- from 1.538, 5.262 to 2.213, 4.150 "U" at 3.537, 9.337 ljust "0" at 3.700, 9.200 ljust "U" at 2.263, 8.250 ljust "1" at 2.388, 8.087 ljust "U" at 4.975, 8.275 ljust "2" at 5.125, 8.100 ljust "E" at 1.288, 6.762 ljust "1" at 1.425, 6.613 ljust "E" at 2.737, 6.762 ljust "2" at 2.900, 6.688 ljust "E" at 4.238, 6.762 ljust "3" at 4.363, 6.637 ljust "E" at 5.775, 6.738 ljust "4" at 5.887, 6.625 ljust "H" at 1.237, 5.525 ljust "1" at 1.438, 5.363 ljust "H" at 2.725, 5.525 ljust "2" at 2.912, 5.387 ljust "H" at 4.213, 5.512 ljust "3" at 4.375, 5.363 ljust "H" at 5.713, 5.500 ljust "4" at 5.887, 5.350 ljust "M" at 0.275, 4.000 ljust "11" at 0.425, 3.838 ljust "M" at 1.225, 3.987 ljust "12" at 1.325, 3.862 ljust "M" at 2.150, 3.963 ljust "13" at 2.287, 3.838 ljust ellipse at 3.662, 9.313 wid 0.500 ht 0.500 ellipse at 2.350, 8.225 wid 0.500 ht 0.500 ellipse at 2.862, 5.500 wid 0.500 ht 0.500 ellipse at 1.375, 6.713 wid 0.500 ht 0.500 ellipse at 2.850, 6.713 wid 0.500 ht 0.500 ellipse at 4.338, 6.700 wid 0.500 ht 0.500 ellipse at 5.863, 6.700 wid 0.500 ht 0.500 ellipse at 5.075, 8.238 wid 0.500 ht 0.500 ellipse at 1.375, 5.475 wid 0.500 ht 0.500 ellipse at 4.338, 5.488 wid 0.500 ht 0.500 ellipse at 5.863, 5.475 wid 0.500 ht 0.500 ellipse at 1.375, 3.950 wid 0.500 ht 0.500 ellipse at 0.438, 3.950 wid 0.500 ht 0.500 ellipse at 2.313, 3.938 wid 0.500 ht 0.500 .PE .ce @Figure ~ 2@ Large Grain Control Flow Graph .KE .sp2 In this control flow graph a @node@ denotes a process. That is, for example, a subroutine name together with the pointers to the data which the subroutine is to operate upon. A process @P@ becomes schedulable or ready to execute when there are no incoming arcs to the node representing process @P@. This signifies that all of the data dependencies for process @P@ have been satisfied through the completion of the processes that it was dependent upon. This graph indicates that processes @ M sub 1j @ can execute immediately. Once they have completed then @H sub 1@ may report to @H sub 2@ and this process may execute and spawn processes @ M sub 2j @. At the same time @ H sub 1 @ reports to process @ E sub 1 @ and it may begin execution. When @ E sub 1 @ and @ E sub 2 @ have both completed than process @ U sub 1 @ may start and so on. .PP To accumulate a matrix of eigenvectors, the successive Householder transformations must be multiplied from left to right in the order they are applied .EQ (6.2) Q ~=~ PI from i=1 to n-2 left (^ I ~-~ 2 u sub i u sub i sup T ^ right ) .EN and we observe that when accumulated this way, successive applications of @ I ~-~ u sub i u sub i sup T @ effect only the last @n-i+1@ columns of @Q@. Thus, application of the Givens transformations associated with @E sub 1 @ may be applied as soon as the products of the Householder transformations associated with @H sub 1@ have been multiplied out. These transformations may be applied independently of the computation of @ H sub k @ for @ k ~>~ 1 @ because these matrices effect columns that are independent of the columns effected by @E sub i @. .PP When we do not wish to find eigenvectors there is no reason to store the product @Q@ of these Householder transformations. Nor is it necessary to accumulate the product of the successive eigenvector transformations resulting from the updating problem. That is, we do not need to overwrite @Q@ with .EQ Q ~<-~ Q left ( ~ matrix { ccol { Q sub 1 above 0 } ccol { 0 above Q sub 2 } } ~ right ) Q hat .EN where @Q sub 1 @ and @Q sub 2 @ are the matrices appearing in equations (6.1) , @Q hat @ is the matrix of eigenvectors for the interior matrix in (6.1) and @Q@ is the matrix appearing in (6.2) above. Instead, we may simply discard @Q@. Then the vector @q sub 1 @ may be formed as @T hat sub 1 @ is transformed to @D sub 1@ in (2.2) by accumulating the products of the transformations constructed in TQL2 that make up @Q sub 1@ against the vector @ e sub k @ . If there is more than one division then @Q sub 1 @ will have been calculated with the updating scheme. In this case we do not calculate all of @Q sub 1@ but instead use the component-wise formula for the eigenvectors to pick off the appropriate components needed to form @q sub 1 @ (for details see .[[ dongarra sorensen fully parallel .], .[ jessup sorensen .]]. ) .NH Operations Counts and Storage .PP An analysis of the number of floating-point operations (counting additions and multications) for the reduction to tridiagonal form of the standard algorithm reveals an operation count of .EQ I 4 over 3 n sup 3 ~+~ 7 over 2 n sup 2 ~+~ 1 over 6 n ~-~ 25 ~ flops. .EN In aggregating the transformations to perform the block reduction additional work is required in the formulation of @y sub j @ in Algorithm 1. The additional work for a block size @p@ amounts to: .EQ I (2p ~-~ 3 over 2 ) n sup 2 ~+~ ( -2 over 3 p sup 2 ~-~ 2p ~+~ 13 over 6 )n ~+~ ( -4 over 3 p sup 2 ~-~ 4 p ) .EN floating point additions and multiplications being performed. .PP The algorithm can be organized so that the vectors @u sub i @ overwrite the lower part of the matrix (as we do in the standard version of the software), but additional workspace of size @n times p @ is required to store the current block of @V@. .NH Experimental Results .PP The following results were generated on an Alliant FX/8 computer using eight processors. The Alliant FX/8 is a parallel processor where each of the processors has vector registers and can perform vector operations. .KS .I .ce Table 1 .ce Ratio of execution times (speedups) between the .ce EISPACK routine and the blocked version .ce on the Alliant FX/8 .R .TS center; c|c|c. Ratio Ratio Order TRED1/TREDB ORTHES/ORTHSB _ 100 1.94 2.59 200 2.39 3.01 300 2.40 3.23 400 2.39 3.35 500 2.36 3.46 .TE .KE The following results were generated on the CRAY X-MP using one processor. .KS .I .ce Table 2 .ce Ratio of execution times (speedups) between the .ce EISPACK routine and the blocked version .ce on the Cray X-MP .R .TS center; c|c|c. Ratio Ratio Order TRED1/TREDB ORTHES/ORTHSB _ 100 1.03 1.29 200 1.10 1.52 300 1.21 1.65 400 1.23 1.79 500 1.28 1.92 .TE .KE .NH References .[ $LIST$ .] .bp .B .ce LAPACK Working Notes .sp 2 .R LAPACK Working Note #1 .sp James Demmel, Jack J. Dongarra, Jeremy Du Croz, Anne Greenbaum, Sven Hammarling, and Danny Sorensen, .I ``Prospectus for the Development of a Linear Algebra Library for High-Performance Computers,'' .R Argonne National Laboratory, Mathematics and Computer Science Division, Technical Memorandum No. 97, September, 1987. .sp 2 .R LAPACK Working Note #2 .sp Jack J. Dongarra, Sven J. Hammarling, and Danny C. Sorensen .I ``Block Reduction of Matrices to Condensed Forms for Eigenvalue Computations,'' .R Argonne National Laboratory, Mathematics and Computer Science Division, Technical Memorandum No. 99, September, 1987. .