.ds TH "\h'-.75m'\v'-.50m'^\v'.50m'
.ds TD "\h'-.75m'\v'-.45m'~\v'.45m'
.B
.nr BT ''-%-''
.he ''''
.pl 11i
.de fO
'bp
..
.wh -.5i fO
.LP
.nr LL 6.5i
.ll 6.5i
.nr LT 6.5i
.lt 6.5i
.ta 5.0i
.ft 3
.bp
.R
.sp 1i
.ce 100
.R
.sp .5i
 .
.sp 10
ARGONNE NATIONAL LABORATORY
.br
9700 South Cass Avenue
.br
Argonne, Illinois  60439
.sp .6i
.ps 12
.ft 3
LAPACK Working Note #2
.sp 2
Block Reduction of Matrices to Condensed Forms for Eigenvalue Computations 
.ps 11
.sp 3
Jack J. Dongarra, Sven J. Hammarling, and Danny C. Sorensen
.sp 3
.ps 10
.ft 1
Mathematics and Computer Science Division
.sp 2
Technical Memorandum No. 99
.sp .7i
September 1987
.pn 0
.he ''-%-''
.he ''''
.bp
 .
.sp 10
.B
.ce 1
.ps 11
ABSTRACT
.R
.ps 10
.sp .5i
In this paper we describe block algorithms for the reduction of a real
symmetric matrix to tridiagonal form and for the reduction of
a general real matrix to either bidiagonal or Hessenberg form
using Householder transformations.  
The approach is to aggregate the transformations and to apply
them in a blocked fashion, thus
achieving algorithms that are rich in matrix-matrix operations.
These reductions to condensed
form typically comprise a preliminary step in the computation of 
eigenvalues or singular values. 
With this in mind, we also demonstrate how the initial reduction
to tridiagonal or bidiagonal form may be pipelined with the divide
and conquer technique for computing the eigensystem of a symmetric matrix
or the singular value decomposition of a general matrix to achieve algorithms
which are load balanced and rich in matrix-matrix operations.
.in
.bp
.ft 3
.ps 11
.bp
.LP
.LP
.EQ
gsize 11
delim @@
.EN
.TL
.ps 12
.in 0
Block Reduction of Matrices to Condensed Forms for Eigenvalue Computations 
.AU
.ps 11
.in 0
Jack J. Dongarra**, Sven J. Hammarling, and Danny C. Sorensen*
.AI
.ps 10
.in 0
Mathematics and Computer Science Division\h'.20i'
Argonne National Laboratory\h'.20i'
Argonne, Illinois 60439\h'.20i'

and

Numerical Algorithms Group Ltd.\h'.20i'
NAG Central Office, Mayfield House\h'.20i'
256 Banbury Road, Oxford OX2 7DE\h'.20i'

.FS
.ps 9
.vs 11p
* Work supported in part by the Applied Mathematical
Sciences subprogram of the Office of Energy Research,
U. S. Department of Energy, under Contract W-31-109-Eng-38.

** Work supported in part by the National Science Foundation.

.FE
.sp 3
.QS
.ps 10
.in .25i
.ll -.25i
.I Abstract
\(em In this paper we describe block algorithms for the reduction of a real
symmetric matrix to tridiagonal form and for the reduction of
a general real matrix to either bidiagonal or Hessenberg form
using Householder transformations.  
The approach is to aggregate the transformations and to apply
them in a blocked fashion, thus
achieving algorithms that are rich in matrix-matrix operations.
These reductions to condensed
form typically comprise a preliminary step in the computation of 
eigenvalues or singular values. 
With this in mind, we also demonstrate how the initial reduction
to tridiagonal or bidiagonal form may be pipelined with the divide
and conquer technique for computing the eigensystem of a symmetric matrix
or the singular value decomposition of a general matrix to achieve algorithms
which are load balanced and rich in matrix-matrix operations.
.in
.ll
.QE

.nr PS 11
.nr VS 16
.nr PD 0.5v


.NH
Introduction
.PP
The key to using a high-performance computer effectively is to avoid
unnecessary memory references.  In most computers,
data flows from memory into and out of registers and
from registers into and out of functional units, which
perform the given instructions on the data.
Algorithm performance can be dominated by the amount
of memory traffic rather than by the number of floating-point
operations involved.
The movement of data between memory and registers can be as costly as
arithmetic operations on the data.
This
provides considerable motivation to restructure existing algorithms and
to devise new algorithms that minimize data movement.
.PP
Along these lines there has been much activity in the past few years
involving redesign of some of the basic routines in linear algebra
.[[
dongarra sorensen referbug
.],
.[
dongarra eisenstat
.], 
.[
dongarra kaufman hammarling
.]].
A number of researchers have
demonstrated the effectiveness of
block algorithms on a variety of modern computer architectures with
vector-processing or parallel-processing capabilities, on which
potentially high performance can easily be degraded by excessive transfer
of data  between different levels of memory
(vector registers, cache, local memory, main memory, or solid-state disks)
.[[
dongarra hewitt
.],
.[
dongarra sorensen linear algebra on high performance computers
.],
.[
parallel algorithms on the cedar system
.],
.[
ibm engineering scientific subroutine library
.],
.[
bischof van loan
.],
.[
bucher jordan
.],
.[
Calahan
.]].
This redesign has led to the development of algorithms that
are based on matrix-vector and matrix-matrix techniques
.[[
dongarra sorensen referbug
.],
.[
van loan bischof
.]].
.PP
This approach to software construction is well suited
to computers with a hierarchy of memory 
and true parallel-processing computers.
A survey that provides a description of many advanced-computer architectures
may be found in 
.[[
dongarra duff advanced
.]].
For those architectures it is often preferable to partition the matrix
or matrices into blocks and to perform the computation
by matrix-matrix operations on the blocks. By organizing the computation
in this fashion we provide for full reuse of data
while the block is held in cache or local memory. This approach
avoids excessive movement of data to and from memory and
gives a 
.I
surface-to-volume
.R
effect for the ratio of arithmetic operations to data movement, i.e.
@ O( n sup 3 ) @ arithmetic operations to @O(n sup 2  )@ data movement.
In addition, on architectures that provide for parallel processing,
parallelism can be exploited in two ways:
(1) operations on distinct blocks may be performed in parallel;
and (2) within the operations on each block, scalar or vector
operations may be performed in parallel.
For a description of blocked implementation for Cholesky factorization,
LU decomposition, and matrix multiply and the specifications for a set
of building blocks to aid the development of block algorithms see 
.[[
dongarra hammarling duff
.]].
.PP
Many of the most successful algorithms for computing eigenvalues or
singular values of matrices require an initial reduction to condensed
form.  Typically, this condensed form is well suited to the implementation
of an underlying iterative process used to compute the eigenvalues or
singular values.  We present block algorithms suitable for computing
three different condensed forms.  These are the reduction of a symmetric
matrix to tridiagonal form, and the reduction of a (real) general
matrix to either upper Hessenberg form or bidiagonal form.  The reduction
of a symmetric matrix to tridiagonal form dominates the computation of
eigenvalues if no eigenvectors are required and represents about half
the work if both eigenvalues and eigenvectors are sought.  A similar
remark is appropriate for the reduction of a general matrix to
bidiagonal form in preparation for the computation of singular values.
When the full eigensystem or singular value decomposition is desired
then divide and conquer techniques are appropriate for both of these
computations and we shall discuss how to pipeline the reduction
to condensed form with a divide and conquer scheme.  
.NH
The Algorithm - Reduction to Tridiagonal Form
.PP
We usually think
of applying a sequence of Householder transformations to reduce the original
symmetric matrix to symmetric tridiagonal form.
We apply the transformations as similarity transformations
to preserve the eigenvalues of the original matrix.
The process can be described as follows:
.EQ I
P sub i ~=~ I ~-~ u sub i u sub i sup T , ~~~~ u sub i sup T u sub i ~=~ 2
.EN
.EQ I
T ~=~ P sub n-2  ~ ... ~ P sub 2 P sub 1 A P sub 1 P sub 2 ~... ~P sub n-2.
.EN
.bp
Each transformation @ P sub i @ is designed to introduce zeros
in the @i sup th  @ column (and row) of the matrix
below the subdiagonal (and above the superdiagonal) so as to leave the upper
part of the matrix in  tridiagonal form and  the lower part full and
symmetric.
At the @i sup th @ step of the process, the matrix is of the form
.KS
.EQ I
left ( ~~ matrix {
ccol { x above x  above nothing  above nothing above nothing above nothing above nothing above nothing }
ccol { x above x  above x  above nothing above nothing above nothing above nothing above nothing }
ccol { nothing above x  above .  above . above nothing above nothing above nothing above nothing }
ccol { nothing above nothing  above .  above . above x above nothing above nothing above nothing }
ccol { nothing above nothing  above nothing  above x above x above x above x above x }
ccol { nothing above nothing  above nothing  above nothing above x above x above x above x }
ccol { nothing above nothing  above nothing  above nothing above x above x above x above x }
ccol { nothing above nothing  above nothing  above nothing above x above x above x above x }
} ~~ right ) .
.EN
.KE
.PP
To describe the algorithmic details of this reduction, we use the 
notation @ A sup { ( i:n ^,^ j:n) } @ to denote the @ (n-i+1) times (n-j+1) @
submatrix of @A@ beginning at the @(i,j)@ location of @A@; we denote 
the subvector of a vector @a@ beginning at the @i-th@ position 
by @ a sup {(i:n)} @; and the @ i-th @ component of a vector @a@ 
by @ a sup (i) @.
The vector @ u sub i @ is constructed from the @i sup th @ 
column of the reduced matrix so that
.EQ I
alpha  sub i ~=~ -sign( a sub i sup (i+1) ) || a sub i sup (i+1:n) || sub 2
.EN
.EQ I
u sub i sup (i+1) ~=~ sqrt( 1 ~-~ a sub i sup (i+1) / alpha sub i )
.EN
.EQ I
u sub i sup (i+2:n) ~=~ -a sub i sup (i+2:n) / alpha sub i
.EN
.EQ I
u sub i sup (1:i) ~=~ 0  ~.
.EN
.PP
In practice @ u sub i @ is constructed and applied to the matrix
as follows:
.EQ I
y sub i ~=~ A sub i u sub i 
.EN
.EQ I
v sub i ~=~ y sub i ~-~ half ( y sub i sup T u sub i ) u sub i
.EN
.EQ I (2.1)
A sub i+1 ~=~ A sub i ~-~ u sub i v sub i sup T ~-~ v sub i u sub i sup T ~.
.EN
.sp
In the process @ A @ is repeatedly modified by a symmetric
rank-two update. This requires updates to half the @ n-i ~ times ~ n - i @
elements of the symmetric matrix at each stage of the process.
.PP
To achieve better memory utilization we can consider aggregating a sequence
of transformations, say @p@ of them, so that
the matrix will be updated by a rank @2p@ symmetric
matrix. Such an implementation would be as follows:
instead of explicitly updating the matrix with the rank two change,
we form only the second column (row) of @A sub 2 @,
say @ a sub 2 @. We then update
@ a sub 2 @ by applying (2.1)
in the following way:
.EQ I
a sub 2 ~=~ a sub 2 ~-~ v sub 1 sup (2) u sub 1  ~-~ u sub 1 sup (2) v sub 1  ~.
.EN
From this we can compute @ u sub 2 @;
and @y sub 2 @ would be formed as, @ y sub 2 ~=~ A sub 2 u sub 2 @.
However we have not explicitly formed @ A sub 2 @. We can construct @ y sub 2 @ as follows:
.EQ I
y sub 2 mark ~=~ A sub 2 u sub 2
.EN
.EQ I
lineup ~=~ (A sub 1 ~-~ u sub 1 v sub 1 sup T ~-~ 
v sub 1 u sub 1 sup T ) u sub 2
.EN
.EQ I
lineup ~=~ A sub 1 u sub 2 ~-~ ( v sub 1 sup T u sub 2 ) u sub 1 
~-~ ( u sub 1 sup T u sub 2 ) v sub 1 ~.
.EN
We could then explicitly form @ A sub 3 @ as a symmetric rank four
update
as follows:
.EQ I
A sub 3 mark ~=~ A sub 2 ~-~ u sub 2 v sub 2 sup T ~-~ v sub 2 u sub 2 sup T
.EN
.EQ I
lineup ~=~ A sub 1 ~-~ u sub 1 v sub 1 sup T ~-~ v sub 1 u sub 1 sup T 
~-~ u sub 2 v sub 2 sup T ~-~ v sub 2 u sub 2 sup T  ~.
.EN
We could have continued the process and in general found for a rank @2p@ update:
.EQ I
A sub  p+1 ~=~ A sub 1 ~-~ UV sup T ~-~ VU sup T ~,
.EN
where
.EQ I
U ~=~ (u sub 1 , ~ u sub 2 , ~ ..., ~ u sub p  )
.EN
.EQ I
V ~=~ ( v sub 1 , ~ v sub 2 , ~ ..., ~ v sub p )
.EN
.EQ I
v sub p ~=~ y sub p ~-~ half (y sub p sup T u sub p ) u sub p 
.EN
.EQ I
y sub p+1 ~=~ (A sub 1 ~-~ UV sup T  ~-~ VU sup T ) u sub p+1
.EN
.EQ I
a sub p+1 sup (p+1:n) ~=~ a sub p+1 sup (p+1:n) ~-~
sum from i=1 to p ( v sub i sup (p+1) u sub i ~+~ u sub i sup (p+1) v sub i ) ~.
.EN
Thus, @ A sub p+1 @ can be formed by a rank @2p@ symmetric update 
that is rich in matrix-matrix operations.
.sp
.I
Algorithm 1
.R
.nf
	@U@ and @V@ are temporary @n times p @ arrays, which are reused for each iteration of the @k@ loop
	@n@ is the order of the matrix
	@p@ is the blocking
	@N ~=~ (n-2)/p@
	for @k ~=~ 1, ~ N @
		@s ~=~ (k-1)p ~+~ 1@
		for @ j ~=~ s, ~ s+p-1 @
			@a sub j ~=~ a sub j ~-~ sum from i=s to j-1 ( v sub i sup (j+1) u sub i ~+~ u sub i sup (j+1) v sub i )@
			@alpha  sub j ~=~ -sign( a sub j sup (j+1) ) || a sub j sup (j+1:n) || sub 2 @
			@u sub j sup (j+1) ~=~ sqrt( 1 ~-~ a sub j sup (j+1) / alpha sub j @ )
			@u sub j sup (j+2:n) ~=~ -a sub j sup (j+2:n) / alpha sub j @
			@y sub j ~=~ (A sub s ~-~  U sub j-1  {V sub j-1  } sup T  ~-~   V sub j-1  {U sub j-1  } sup T  ) u sub j @
			@v sub j  ~=~ y sub j ~-~ half ( y sub j sup T u sub j ) u sub j @
			@U sub j ~=~ left ( { U sub j-1 , ~ u sub j } right ) @
			@V sub j ~=~ left ( { V sub j-1 , ~ v sub j } right ) @
		end
		@U ~=~ U sub s+p-1 , ~~ V ~=~ V sub s+1-1 @
		@perform~symmetric~rank~2p~update ~on~submatrix@
		@A sup (s+p:n,s+p:n) ~=~ left ( ~  A ~-~ U V sup T ~-~ V U sup T ~ right ) sup (s+p:n,s+p:n)  @
	end
.fi

@U @ is a lower trapezoidal matrix with the first column having its 
first non-zero element in position @s~+~1@ and the @p@-th column having its first non-zero element
in position @s~+~p@.
Notice that 
to aggregate the Householder transformations
during the construction of the vector
@y sub j @ we perform a 
matrix-vector multiplication with the submatrix @A sub s @
in the @j@ loop. 
.PP
Algorithm 1 constructs @k@ block
transformations and applies it to the matrix. We will call this a 
.I
``right-looking algorithm.''
.R
Notice that at each of the @k@ stages
we are updating a submatrix of size @n - s + 1 ~ times ~ n-s+1@.
We can further reduce the amount of data referenced by the following
algorithm.
.KS
.sp
.I
Algorithm 2
.R
.nf
	for @k ~=~ 1, ~N @
		@s ~=~ (k-1)p ~+~ 1@
.I
		Apply the previous @k-1@ block transformations to @A sup { (s:n,s:s+p-1) } @
		Compute @ U sub k ~ and ~ V sub k @ 
.R
	end
.KE
.PP
At each stage of this algorithm we are only modifying a
@n-s+1 ~ times ~ p @ matrix.
We will call this algorithm the 
.I
``left-looking algorithm.''
.R
This algorithm will require an access to the submatrix @A sub s @
in the loop, however it avoids an update of the matrix at the
end of the @k@ loop.
.NH
Reduction to Hessenberg Form
.PP
Not surprisingly, the same approach can be used in the reduction to Hessenberg
form. Here we have
.EQ I
H ~=~ P sub n-2 ... P sub 2 P sub 1 A P sub 1 P sub 2 ... P sub n-2 ~,
.EN
where
@H@ is upper Hessenberg. The idea of using a rank 2 or
higher update in this context was discussed in 
.[[
Dongarra Kaufman Hammarling
.]].
Here it is convenient to use slightly modified formulas to those in 
.[[
Dongarra Kaufman Hammarling
.]]
given by
.EQ I
y sub i ~=~ A sub i sup T u sub i ~,~~~ z sub i ~=~ A sub i  u sub i
.EN
.EQ I
v sub i ~=~ y sub i ~-~ half (z sub i sup T u sub i ) u sub i ~,~~ w sub i ~=~ z sub i ~-~ half (y sub i sup T u sub i ) u sub i
.EN 
.EQ I
A sub i+1 ~=~ A sub i ~-~ u sub i v sub i sup T ~-~ w sub i u sub i sup T .
.EN
When @A@ is symmetric, @y sub i ~=~ z sub i @, and @v sub i ~=~ u sub i @
and these equations are as in the tridiagonal case.
The vector @ u sub i@ is computed from the same equations as for the 
tridiagonal case.
Here @A@ is updated as
.EQ I
A sub p+1 ~=~ A sub 1 ~-~ UV sup T ~-~ WU sup T ~,
.EN
where
.EQ I
U ~=~ (u sub 1 , u sub 2 , ... , u sub p ) ~,~~
V ~=~ (v sub 1 , v sub 2 , ... , v sub p ) ~,~~
W ~=~ (w sub 1 , w sub 2 , ... , w sub p ) ~and 
.EN
.EQ I
y sub p+1 ~=~ (A sub 1 sup T ~-~ VU sup T ~-~ UW sup T ) u sub p+1 ~,~~
z sub p+1 ~=~ (A sub 1 ~-~ UV sup T ~-~ WU sup T ) u sub p+1 ~.
.EN
@U@, @V@, and @Y@ are trapezoidal, but @Z@ and @W@ are not.
.NH
Reduction to Bidiagonal Form
.PP
A problem that is closely associated with the eigenvalue problem is to
compute the Singular Value Decomposition (SVD) of a real @m times n @ 
matrix @A@.  This decomposition is directly related to the symmetric 
eigenvalue problem in that the singular values of @A@ are the 
square roots of the eigenvalues of  the symmetric positive semidefinite
matrix @A sup T A@.  It is numerically preferable to avoid formation of
@A sup T A@ and the algorithm of choice involves an initial reduction
of @A@ to upper bidiagonal form @B@ through a sequence of Householder 
transformations to obtain
.EQ
   A ~=~ U B V sup T 
.EN
with @U@ and @ V@ orthogonal and @ B @ upper bidiagonal.
.PP
This initial reduction may be treated with an algorithm similar to those 
already presented.  In this case
.EQ I
B ~=~ P sub n-2 sup T ... P sub 2 sup T P sub 1 sup T A 
    Q sub 1 Q sub 2 ... Q sub n-2 ~,
.EN
where each @P sub j ~=~ I ~-~ u sub j u sub j sup T  @ is an @m times m@ 
Householder transformation and 
each @ Q sub j ~=~ I ~-~ v sub j v sub j sup T  @ is an @n times n @ 
Householder transformation.  
Again we may achieve efficient memory utilization by aggregating a sequence
of transformations, say @p@ of them, so that
the matrix will be updated by a matrix of rank @2p@.  However, there are 
data dependencies within this reduction that require additional attention.
.PP
Let us suppose for the moment that the sequences { @ u sub j @ } 
and { @ v sub j @ } can be computed at will.
In general, 
.EQ
  ( I ~-~u u sup T ) A ( I ~-~ v v sup T ) ~=~ 
     A ~-~ u w sup T ~-~ y v sup T  ~~,
.EN
where 
.EQ
    y ~=~ A v ~~,~~~z ~=~ A sup T u ~~ and ~~~ w ~=~ z ~-~ ( u sup T y ) v ~,
.EN
see 
.[[
Dongarra Kaufman Hammarling
.]]
for more details.
Thus, a straightforward extension of the tridiagionalization
scheme presented in Section 2 gives the following algorithm:
.sp2
.KS
.sp
.I
Algorithm 3
.R
.nf
	@U@ and @Y@ are temporary @m times p @ arrays, which are reused for each iteration of the @k@ loop
	@V@ and @W@ are temporary @n times p @ arrays, which are reused for each iteration of the @k@ loop
	@n@ is the number of columns in the matrix
	@m@ is the number of rows in the matrix
	@p@ is the blocking
	@N ~=~ (n-2)/p@
	for @k ~=~ 1, ~ N @
		@s ~=~ (k-1)p ~+~ 1@
                for @ j ~=~ s,~ s+p-1@
                       compute  @  u sub j ~ @
                       compute  @  v sub j ~ @
.sp
                       @ y sub j  ~=~ left (~ A ~-~ U sub j-1 W sub j-1 sup T ~-~ Y sub j-1 V sub j-1  sup T ~ right ) ~v sub j@
.sp
                       @ h sub j  ~=~ left (~ A ~-~ U sub j-1 W sub j-1 sup T ~-~ Y sub j-1 V sub j-1  sup T ~ right ) sup T u sub j@
.sp
                       @ w sub j ~=~ h sub j ~-~ ( u sub j sup T y sub j ) v sub j @
.sp
                       @ U sub j ~=~ left (~ U sub j-1 ~,~ u sub j ~right )@
.sp
                       @ V sub j ~=~ left (~ V sub j-1 ~,~ v sub j ~right )@
.sp
                       @ W sub j ~=~ left (~ W sub j-1 ~,~ w sub j ~right )@
.sp
                       @ Y sub j ~=~ left (~ Y sub j-1 ~,~ y sub j ~right )@
                 end
                 @U ~=~ U sub s+p-1 , ~~ V ~=~ V sub s+1-1 @
                 @Y ~=~ Y sub s+p-1 , ~~ W ~=~ W sub s+1-1 @
                 @perform~~rank~2p~update ~on~submatrix@
                 @A sup (s+p:n,s+p:n) ~=~ (~ A ~-~ U W sup T ~-~ Y V sup T  ~)  sup (s+p:n,s+p:n)  @
              end
.sp2
.KE
.fi
Unfortunately, it is not so straightforward to compute @ u sub j @
and @ v sub j @ at will.  At step @j@ of the usual bidiagonalization process, 
the vector @ u sub j @  is nonzero in the @ j ~-~ th @ entry.  Hence application
on the left by the corresponding Householder transformation alters 
the @j-th@ row
of the  reduced matrix @ A sub j @ and knowledge of this row is  required
to compute the vector @ v sub j @ which is nonzero in the @ j+1 ~-~ st@
entry.  The dependencies now become even more complicated because it 
would appear that the transformation corresponding to @ v sub j @ 
must be applied from the right before @ u sub j+1 @ can be computed and so on.
However, we note that the above algorithm will be valid if there is
an independent formula for computing the @ v sub j @  since the @ u sub j@ 
may be computed as in the previous algorithms by knowing the @ j-th @
column of the reduced matrix.  Indeed, there is an independent formula
for computing the @ v sub j @ which may be found by noting that 
.EQ
 V sup T A sup T A V ~=~ B sup T B ~=~ T
.EN
where  @ V ~=~ Q sub 1 Q sub 2 ... Q sub n-2 ~ @
is precisely the sequence of transformations that would be 
computed in Algorithm 1 to reduce the matrix @ A sup T A @ to 
tridiagonal form @ T ~=~ B sup T B @ .  This leads to 
the following procedure for computing the @ v sub j @ 
.sp2
.I
Procedure compute @v sub j@
.R
.nf
			@z sub j ~=~ A sub s sup T a sub j sup { (s : s+p-1) } 
~-~ sum from i=s to j-1 ( v sub i x sub i sup (j+1) ~+~ x sub i v sub i sup (j+1) )@
			@zeta  sub j ~=~ -sign( z sub j sup (j+1) ) || z sub j sup (j+1:n) || sub 2 @
			@v sub j sup (j+1) ~=~ sqrt( 1 ~-~ z sub j sup (j+1) / zeta sub j @ )
			@v sub j sup (j+2:n) ~=~ -z sub j sup (j+2:n) / zeta sub j @
			@t sub j ~=~ (A sub s sup T A sub s ~-~  V sub j-1 { X sub j-1  } sup T  ~-~   X sub j-1  {V sub j-1  } sup T  ) v sub j , ~ A sub s ~=~ A sup (s:m,s:n) @
			@x sub j  ~=~ t sub j ~-~ half ( t sub j sup T v sub j ) v sub j @   .
			@X sub j  ~=~ left ( { X sub j-1, ~ x sub j } right ) @
.fi
.sp2
Computation of the @ u sub j @ only depends upon the @ j-th @ column of
the reduced matrix being in place before the @j-th@ step.  Therefore, 
the column oriented formula given in Algorithm 1 may be adapted to give
.sp2
.I
Procedure compute @u sub j@
.R
.nf

			@a sub j ~=~ a sub j ~-~ sum from i=s to j-1 ( 
u sub i w sub i sup (j+1) ~+~ y sub i v sub i sup (j+1) )@
			@alpha  sub j ~=~ -sign( a sub j sup (j) ) || a sub j sup (j:m) || sub 2 @
			@u sub j sup (j) ~=~ sqrt( 1 ~-~ a sub j sup (j) / alpha sub j @ )
			@u sub j sup (j+1:m) ~=~ -a sub j sup (j+1:m) / alpha sub j @
.fi
.sp2
If these two procedures are substituted for "compute u" and "compute v"
in Algorithm 3 then it be well defined.
In all of these we do not explicitly form the indicated matrix products.
Instead, the matrix vector products are accumulated.
.NH
Relationship to the @WY@ Factorization
.PP
The algorithm presented here for aggregating Householder transformations
is closely related to the @WY@ representation for the product of Householder
matrices presented by Bischof and Van Loan
.[[
bischof
.]].
The relationship is most clearly seen in the contexts of the
QR factorization of a general rectangular matrix.
The @WY@ representation has the following form:
.KS
.sp
.I
QR Factorization (Bischof and Van Loan)
.R
.nf
	@n@ is the number of columns in the matrix
	@p@ is the blocking
	@N ~=~ n/p@
	for @k ~=~ 1, ~ N @
		@s ~=~ (k-1)p ~+~ 1@
		for @ j ~=~ s, ~ s+p-1 @
			@a sub j ~=~ a sub j ~-~ sum from i=s to j-1 z sub i sup (j) u sub i ~~ where ~ z sub i ~=~ A sub i u sub i @
			@compute ~ Householder ~vector ~ u sub j @
			@Y sub k sup (s:j)  ~=~ (Y sub k sup (s:j-1) ~-~  u sub j u sub j sup T {Y sub k sup (s:j-1) } , ~ -2u sub j ) @
		end
		@perform~rank~2p~update ~on~sub-matrix@
		@A sup (s+p:n,s+p:n) ~=~ A sup (s+p:n,s+p:n) ~-~ U Y sup T A sup (s+p:n,s+p:n) @
	end
.fi
.KE
.sp
If one implements the reduction along the lines of the algorithm described in 
Section 2, the factorization can be described as follows:
.sp
.KS
.I
QR Factorization (Alternative Algorithm)
.R
.nf
	@n@ is the order of the matrix
	@p@ is the blocking
	@N ~=~ n/p@
	for @k ~=~ 1, ~ N @
		@s ~=~ (k-1)p ~+~ 1@
		for @ j ~=~ s, ~ s+p-1 @
			@a sub j ~=~ a sub j ~-~ sum from i=s to j-1 v sub i sup (j+1) u sub i @
			@compute ~ Householder ~vector ~ u sub j @
			@v sub j  ~=~ ( A sub s sup T ~-~ V sub k sup (s:j-1) U sub k sup {(s:j-1)~T } ) u sub j @
		end
		@perform~rank~2p~update ~on~submatrix@
		@A sup (s+p:n,s+p:n) ~=~ left ( ~ A ~-~ U V sup T ~ right ) sup (s+p:n,s+p:n)  @
	end
.fi
.KE
The two differences between the block algorithms are 
in the formation of @v sub j @ and @Y sub j @ 
and also in the update of the matrix @A@.
For the Alternate Algorithm the vector @v sub j @ is updated
using the submatrix @A sub s @ and the Bischof and Van Loan Algorithm
uses information from @u sub j @ and @ Y sub k sup (s:j-1) @.
Thus the Bischof and Van Loan Algorithm will have fewer accesses to
the data.
In the update of the matrix @A@ 
for the Bischof and Van Loan factorization, the update is of the form
.EQ I
A sup (s+p:n,s+p:n) ~=~ A sup (s+p:n,s+p:n) ~-~ U Y sup T A sup (s+p:n,s+p:n) ~;
.EN
and for the alternative factorization, the update is of the form
.EQ I
A sup (s+p:n,s+p:n) ~=~ A sup (s+p:n,s+p:n) ~-~ U V sup T .
.EN
The alternative algorithm incorporates the information about the
matrix @A@ in the matrix @V@.
.PP
We can describe the reduction to 
tridiagonal form for the symmetric eigenvalue
problem using the Bischof and Van Loan approach as
.EQ I
A sub p+1 ~=~ (I ~-~ USU sup T ) A sub 1 ( I ~-~ USU sup T ) .
.EN
If we multiply out and combine terms, we can reduce the expression to
.EQ I
A sub p+1 ~=~ A sub 1 ~-~ ZU sup T ~-~ UZ sup T
.EN
where
.EQ I
 W ~=~ A sub 1 US sup T ~ and~~ Z ~=~ W ~-~ half USU sup T W
.EN
which is of the form described in Section 2.
An implementation for the reductions along these lines
is described by W. Harrod at the University of Illinois 
.[[
Harrod  private
.]].
.PP
With the @WY@ representation it is simple to apply the set of transformations
to another matrix, as is required in back substitution for the eigenvector
computation; one simply applies @ (I ~-~ WY sup T ) @ to the matrix.
To apply the transformation using the formulation in Section 2, 
one can use the Householder vectors to construct the matrix @S@ such that
@ I ~-~ USU sup T @ can be used to apply the transformations to the eigenvectors
of the symmetric tridiagonal matrix, back transforming them to the eigenvectors
of the original problem.
The matrix @S @ is a @p@ by @p@ upper triangular matrix
whose @k@-th column is formed as follows:
.EQ I
(I ~-~ USU sup T ) (I ~-~ uu sup T ) mark ~=~ 
I ~-~ uu sup T ~-~ USU sup T ~+~ USU sup T uu sup T
.EN
.EQ I
lineup ~=~ I~-~ [U,~u] ~ 
left ( ~ matrix { 
ccol { { SU sup T ~-~ SU sup T uu sup T } above u sup T } } ~right )
.EN
.EQ I 
lineup ~=~ I~-~ [U,~u] ~ 
left ( ~ matrix { ccol { S above 0 } 
ccol { { - SU sup T u }  above 1 } } right ) 
left ( ~ matrix { ccol { U sup T above u sup T } } right ) 
.EN
So the new column of @S@ is
.EQ I
left ( ~ matrix { ccol { -S U sup T u above 1 } } right )  ~.
.EN
.NH
Pipelining Reduction to Condensed Form with Determination of Eigenvalues
.PP
Recently, algorithms have been developed based upon divide and conquer 
strategies for the determination of eigenvalues and singular values
for a matrix in condensed form
.[[ 
dongarra sorensen fully parallel 
.], 
.[ 
jessup sorensen
.]].
These methods are also rich in matrix-matrix operations and mesh very well
with the block reductions presented here.  This is accomplished through
pipelining the initial reduction phase with the computation of eigenvalues
and back-transformation of eigenvectors.  These considerations are of little
consequence on serial computers but have significant performance advantages
on parallel-vector processors.
.PP
We use the block reduction algorithm as described above to introduce zeros in a 
block of the matrix, say we are at the @k@-th stage and have just introduced
zeros into the @k@-th block.
As we start the next block reduction, on the @k+1@-st block, 
we can start in parallel the eigenvalue
computation on that part of the 
tridiagonal matrix generated from the @k@-th block reduction
of the
matrix.
When we have completed eigenvalue computations from two tridiagonal
segments, we can use the technique applied in the divide and conquer
algorithm as described by Dongarra and Sorensen
.[[
Dongarra sorensen fully eigenvalue
.]]
to determine the eigenvalues and eigenvectors of a pair of tridiagonal matrices.
Then the eigenvalues of successive pairs of blocks can be found, then pairs
of pairs, etc., until the full set is determined.
When the reduction to condensed form and the divide and conquer
strategy are used in this pipelined fashion
a highly efficient parallel algorithm can be constructed.
.PP
This discussion is made more explicit in the following example. 
We consider a symmetric matrix that is to be partitioned into four
block columns as shown in the figure below.
.KS
.PS 4 4
line   from  2.737, 8.413 to  3.225, 7.925
line   from  2.813, 8.538 to  3.338, 8.012
line   from  2.725, 8.525 to  3.250, 8.012
dashwid =  0.0500i
line  dashed from  7.012, 8.475 to  7.188, 8.275
line  dashed from  6.750, 8.488 to  7.188, 8.038
line  dashed from  6.500, 8.488 to  6.988, 8.025
line  dashed from  6.225, 8.512 to  6.725, 8.038
line  dashed from  5.988, 8.500 to  6.488, 8.025
line  dashed from  5.750, 8.512 to  6.238, 8.025
line  dashed from  5.525, 8.500 to  5.975, 8.050
line  dashed from  5.213, 8.288 to  5.450, 8.025
line  dashed from  5.250, 8.488 to  5.713, 8.012
line  dashed from  0.725, 6.738 to  0.475, 6.500
line  dashed from  0.712, 6.988 to  0.262, 6.537
line  dashed from  0.738, 7.250 to  0.237, 6.775
line  dashed from  0.712, 7.512 to  0.237, 7.025
line  dashed from  0.712, 7.750 to  0.250, 7.275
line  dashed from  0.725, 7.988 to  0.250, 7.537
line  dashed from  0.700, 8.238 to  0.262, 7.762
line  dashed from  0.725, 8.488 to  0.237, 7.988
line  dashed from  0.475, 8.512 to  0.237, 8.250
line   from  3.237, 8.025 to  3.237, 6.550
line   from  3.250, 8.012 to  4.700, 8.012
line   from  5.250, 7.000 to  7.200, 7.000
line   from  5.250, 7.512 to  7.213, 7.512
line   from  5.238, 8.000 to  7.225, 8.000
line   from  1.725, 8.525 to  1.725, 6.500
line   from  1.237, 8.512 to  1.237, 6.500
line   from  0.738, 8.538 to  0.738, 6.475
line   from  0.225, 8.525 to  2.225, 8.525\
	to  2.225, 6.525\
	to  0.225, 6.525\
	to  0.225, 8.525
line   from  5.225, 8.525 to  7.225, 8.525\
	to  7.225, 6.525\
	to  5.225, 6.525\
	to  5.225, 8.525
line   from  2.725, 8.538 to  4.725, 8.538\
	to  4.725, 6.537\
	to  2.725, 6.537\
	to  2.725, 8.537
.PE
.ce
 @Figure ~ 1@  Partitioned Matrix     
.KE
.sp
.PP
Let us associate @H sub k @ with the process of 
reducing the @k-th@ block @ A sub k @
of the partitioned matrix to tridiagonal form using Householder transformations.
Thus @H sub k @ executes a block step of
Algorithm 1 (see the k-loop)  on the block @ A sub k @ 
In this algorithm we have the possibility of spawning parallel processes.
Processes may cooperate
in applying the resulting transformation shown in 
.EQ
A sup (s+p:n,s+p:n) ~=~ A sup (s+p:n,s+p:n) ~-~ U sub k V sub k sup T ~-~ V sub k U sub k sup T 
.EN
in parallel.
Let us denote these parallel processes by @M sub kj @ so that process
@M sub kj@ is responsible for a portion of the work in the matrix multiply
in the performance of Algorithm 1 by process @H sub k@.
.PP
On completion of process @H sub k@ the tridiagonal matrix @ T sub k @ is
exposed and the algorithm @TQL2@ may be applied to compute the 
eigensystem of @T sub k @ after the rank one tearing has been computed.  
Let us denote this process by @ E sub k @.  
.PP
Once processes @ E sub 1@ and @ E sub 2 @ have completed, then the eigensystem
of 
.EQ (6.1)
 left ( ~ matrix { ccol { T sub 1 above 0 }
                           ccol { 0 above T sub 2 } }  ~ right )
  ~=~ left ( ~ matrix { ccol { Q sub 1 above 0 }
                           ccol { 0 above Q sub 2 } }  ~ right )
left (~
left ( ~ matrix { ccol { D sub 1 above 0 }
                ccol { 0 above D sub 2 } }  ~ right )
+~ beta left ( ~ matrix { ccol { q sub 1 above q sub 2} }  
~ right )
( q sub 1 sup T ~,~ q sub 2 sup T ) ~ ~ right )
left ( ~ matrix { ccol { Q sub 1 sup T  above 0 }
                ccol { 0 above Q sub 2 sup T  } }  ~ right )
.EN
may be computed using the rank one updating scheme.  Similarly, once
processes @ E sub 3@ and @ E sub 4 @ have completed, the eigensystem
of @ diag ( T sub 3 , T sub 4 ) @ may be computed.  Finally the entire
eigensystem may be obtained through rank one updating of these two
eigensystems.  Let us denote these processes as @U sub 1 ~,~ U sub 2 @ 
and @ U sub 0 @ respectively.  
.PP
With the proper storage arrangements these processes obey the following
large grain control flow graph:
.sp2
.KS
.PS 3 3
arrowht =  0.1000i
arrowwid =  0.0500i
line <-  from  5.600, 5.475 to  4.588, 5.475
line <-  from  4.088, 5.475 to  3.112, 5.475
line <-  from  2.612, 5.488 to  1.638, 5.488
line <-  from  3.487, 9.087 to  2.563, 8.387
line <-  from  3.850, 9.087 to  4.875, 8.413
line <-  from  2.213, 8.025 to  1.500, 6.975
line <-  from  2.438, 7.988 to  2.775, 6.975
line <-  from  4.938, 8.012 to  4.425, 6.912
line <-  from  5.200, 8.025 to  5.775, 6.938
line <-  from  1.362, 6.475 to  1.362, 5.713
line <-  from  2.813, 6.438 to  2.800, 5.738
line <-  from  4.350, 6.450 to  4.350, 5.762
line <-  from  5.825, 6.488 to  5.838, 5.775
line <-  from  1.388, 5.213 to  1.388, 4.188
line <-  from  1.237, 5.262 to  0.538, 4.188
line <-  from  1.538, 5.262 to  2.213, 4.150
"U" at  3.537, 9.337 ljust
"0" at  3.700, 9.200 ljust
"U" at  2.263, 8.250 ljust
"1" at  2.388, 8.087 ljust
"U" at  4.975, 8.275 ljust
"2" at  5.125, 8.100 ljust
"E" at  1.288, 6.762 ljust
"1" at  1.425, 6.613 ljust
"E" at  2.737, 6.762 ljust
"2" at  2.900, 6.688 ljust
"E" at  4.238, 6.762 ljust
"3" at  4.363, 6.637 ljust
"E" at  5.775, 6.738 ljust
"4" at  5.887, 6.625 ljust
"H" at  1.237, 5.525 ljust
"1" at  1.438, 5.363 ljust
"H" at  2.725, 5.525 ljust
"2" at  2.912, 5.387 ljust
"H" at  4.213, 5.512 ljust
"3" at  4.375, 5.363 ljust
"H" at  5.713, 5.500 ljust
"4" at  5.887, 5.350 ljust
"M" at  0.275, 4.000 ljust
"11" at  0.425, 3.838 ljust
"M" at  1.225, 3.987 ljust
"12" at  1.325, 3.862 ljust
"M" at  2.150, 3.963 ljust
"13" at  2.287, 3.838 ljust
ellipse at  3.662, 9.313 wid  0.500 ht  0.500
ellipse at  2.350, 8.225 wid  0.500 ht  0.500
ellipse at  2.862, 5.500 wid  0.500 ht  0.500
ellipse at  1.375, 6.713 wid  0.500 ht  0.500
ellipse at  2.850, 6.713 wid  0.500 ht  0.500
ellipse at  4.338, 6.700 wid  0.500 ht  0.500
ellipse at  5.863, 6.700 wid  0.500 ht  0.500
ellipse at  5.075, 8.238 wid  0.500 ht  0.500
ellipse at  1.375, 5.475 wid  0.500 ht  0.500
ellipse at  4.338, 5.488 wid  0.500 ht  0.500
ellipse at  5.863, 5.475 wid  0.500 ht  0.500
ellipse at  1.375, 3.950 wid  0.500 ht  0.500
ellipse at  0.438, 3.950 wid  0.500 ht  0.500
ellipse at  2.313, 3.938 wid  0.500 ht  0.500
.PE
.ce
 @Figure ~ 2@  Large Grain Control Flow Graph 
.KE
.sp2

In this control flow graph a @node@ denotes a process.  
That is, for example, a 
subroutine name together with the pointers to the data which the
subroutine is to operate upon.  A process @P@ becomes schedulable or 
ready to execute when there are no incoming arcs to the node
representing process @P@.  This signifies that all of the data 
dependencies for process @P@ have been satisfied through the 
completion of the processes that it was dependent upon.
This graph indicates that processes @ M sub 1j @ can execute immediately.
Once they have completed then @H sub 1@ may report to @H sub 2@ and 
this process may execute and spawn processes @ M sub 2j @.  At the same 
time @ H sub 1 @ reports to process @ E sub 1 @ and it may begin execution. 
When @ E sub 1 @ and @ E sub 2 @ have both completed than process @ U sub 1 @
may start and so on.  
.PP
To accumulate a matrix of eigenvectors, the successive Householder 
transformations must be multiplied from left to right in the order 
they are applied
.EQ (6.2)
    Q ~=~  PI from i=1 to n-2 left (^  I ~-~ 2 u sub i u sub i sup T ^ right )
.EN
and we observe that when accumulated this way, successive applications
of @ I ~-~ u sub i u sub i sup T @ effect only the last @n-i+1@ columns of @Q@.
Thus, application of the Givens transformations associated with @E sub 1 @
may be applied as soon as the products of the Householder transformations
associated with @H sub 1@ have been multiplied out.  These transformations 
may be applied independently of the computation of @ H sub k @ for @ k ~>~ 1 @
because these matrices effect columns that are independent of the columns 
effected by @E sub i @.
.PP
When we do not wish to find eigenvectors there is no reason to store the 
product @Q@ of these Householder transformations.  Nor is it necessary to 
accumulate the product of the successive eigenvector transformations 
resulting from the updating problem.  That is, we do not need to 
overwrite @Q@ with 
.EQ
 Q ~<-~ Q 
  left ( ~ matrix { ccol { Q sub 1 above 0 }
                           ccol { 0 above Q sub 2 } }  ~ right ) Q hat
.EN
where @Q sub 1 @ and  @Q sub 2 @ 
are the matrices appearing in equations (6.1) , @Q hat @ is the matrix
of eigenvectors for the interior matrix in (6.1) and @Q@ is the matrix 
appearing in (6.2) above.
Instead, we may simply discard @Q@.   Then the vector @q sub 1 @
may be formed as @T hat sub 1 @ is transformed to @D sub 1@ in (2.2)
by accumulating the products of the transformations constructed
in TQL2 that make up 
@Q sub 1@ against the vector @ e sub k @ .  If there is more than
one division then @Q sub 1 @ will have been calculated with
the updating scheme.  In this case we do not calculate all of @Q sub 1@
but instead
use the component-wise formula for the eigenvectors to pick 
off the appropriate components needed to form @q sub 1 @ (for details see 
.[[ 
dongarra sorensen fully parallel 
.], 
.[ 
jessup sorensen
.]].
)
.NH
Operations Counts and Storage
.PP
An analysis of the number of floating-point operations 
(counting additions and multications)
for the
reduction to tridiagonal form of the standard algorithm
reveals an operation count of
.EQ I
4 over 3 n sup 3 ~+~ 7 over 2 n sup 2 ~+~ 1 over 6 n ~-~ 25  ~
flops. 
.EN
In aggregating the transformations to perform the block
reduction additional work is required in the formulation
of @y sub j @ in Algorithm 1.
The additional work for a block size @p@ amounts to:
.EQ I
(2p ~-~ 3 over 2 ) n sup 2 ~+~ ( -2 over 3 p sup 2 ~-~ 2p ~+~ 13 over 6 )n ~+~
( -4 over 3 p sup 2 ~-~ 4 p )  
.EN
floating point additions and multiplications being performed.
.PP
The algorithm can be organized so that the vectors @u sub i @ overwrite
the lower part of the matrix (as we do in the standard version of the
software), but additional workspace of size @n times p @ is required to store
the current block of @V@.

.NH
Experimental Results
.PP
The following results were generated on an Alliant FX/8 computer
using eight processors. The Alliant FX/8 is a parallel processor
where each of the processors has vector registers and can perform
vector operations.

.KS
.I
.ce
Table 1
.ce
Ratio of execution times (speedups) between the 
.ce
EISPACK routine and the blocked version
.ce
on the Alliant FX/8
.R
.TS
center;
c|c|c.
	Ratio	Ratio
Order	TRED1/TREDB	ORTHES/ORTHSB
_

100	1.94	2.59
200	2.39	3.01
300	2.40	3.23
400	2.39	3.35
500	2.36	3.46
.TE

.KE
The following results were generated on the CRAY X-MP
using one processor.
.KS
.I
.ce
Table 2
.ce
Ratio of execution times (speedups) between the 
.ce
EISPACK routine and the blocked version
.ce
on the Cray X-MP
.R
.TS
center;
c|c|c.
	Ratio	Ratio
Order	TRED1/TREDB	ORTHES/ORTHSB
_

100	1.03	1.29
200	1.10	1.52
300	1.21	1.65
400	1.23	1.79
500	1.28	1.92
.TE

.KE
.NH
References

.[
$LIST$
.]
.bp
.B
.ce
LAPACK Working Notes
.sp 2
.R
LAPACK Working Note #1
.sp
James Demmel, Jack J. Dongarra, Jeremy Du Croz, Anne Greenbaum,
Sven Hammarling, and Danny Sorensen,
.I
``Prospectus for the Development of a Linear Algebra Library
for High-Performance Computers,''
.R
Argonne National Laboratory,
Mathematics and Computer Science Division,
Technical Memorandum No. 97,
September, 1987.

.sp 2
.R
LAPACK Working Note #2
.sp
Jack J. Dongarra, Sven J. Hammarling, and Danny C. Sorensen
.I
``Block Reduction of Matrices to Condensed Forms for Eigenvalue Computations,''
.R
Argonne National Laboratory,
Mathematics and Computer Science Division,
Technical Memorandum No. 99,
September, 1987.

.