LAPACK 3.2
==========

Release date: Su 11/16/2008.

This material is based upon work supported by the National Science Foundation
and the Department of Energy under Grant No.  NSF-CCF-00444486, NSF-CNS
0325873, NSF-EIA 0122599, NSF-ACI-0090127, DOE-DE-FC02-01ER25478,
DOE-DE-FC02-06ER25768.

LAPACK is a software package provided by Univ. of Tennessee, Univ. of
California Berkeley, Univ. of Colorado Denver and NAG  Ltd..


1. Support and questions:
-------------------------

http://icl.cs.utk.edu/lapack-forum/

2. LAPACK 3.2: What's new
-------------------------

////
1
////
. *Extra Precise Iterative Refinement:* New linear solvers that ``guarantee''
fully accurate answers (or give a warning that the answer cannot be trusted).
The matrix types supported in this release are: GE (general), SY (symmetric),
PO (positive definite), HE (Hermitian), and GB (general band)  in all the
relevant precisions.  See reference [3] below.
////
2
////
. *XBLAS, or portable ``extra precise BLAS'':* our new linear solvers in (1)
depend on these to perform iterative refinement. See reference [3] below.
The XBLAS will be released in a separate package. See ``More Details''.
////
3
////
. *Non-Negative Diagonals from Householder QR:* The QR factorization routines
now guarantee that the diagonal is both real and non-negative.  Factoring a
uniformly random matrix now correctly generates an orthogonal Q from the Haar
distribution. See reference [4] below.
////
4
////
. *High Performance QR and Householder Reflections on Low-Profile Matrices:*
The auxiliary routines to apply Householder reflections (e.g. DLARFB)
automatically reduce the cost of QR from O(n^3^) to O(n^2^) for matrices stored
in a dense format but with a ``narrow profile'' (including but not limited to
band matrices) with no user interface changes.  Other users of these routines
can see similar benefits. See reference [4] below.
////
5
////
. *New fast and accurate Jacobi SVD:* High accuracy SVD routine for dense
matrices, which can compute tiny singular values to many more correct digits
than xGESVD when the matrix has columns differing widely in norm, and usually
runs faster than xGESVD too.  See references [5,6,7] below.
////
6
////
. *Routines for Rectangular Full Packed format:*  The RFP format (SF, HF, PF,
TF) enables efficient routines with optimal storage for symmetric, Hermitian or
triangular matrices.  Since these routines utilize the Level 3 BLAS, they are
generally much more efficient than the existing packed storage routines (SP,
HP, PP, TP). See reference [8] below.
////
7
////
. *Pivoted Cholesky:* The Cholesky factorization with diagonal pivoting for
symmetric positive semi-definite matrices.  Pivoting is required for reliable
rank detection. See reference [9] below.
////
8
////
. *Mixed precision iterative refinement* routines for exploiting fast single
precision hardware:* On platforms like the Cell processor that do single
precision much faster than double, linear systems can be solved many times
faster. Even on commodity processors there is a factor of 2 in speed between
single and double precision.  The matrix types supported in this release are:
GE (general), PO (positive definite).  See reference [1] below.
////
9
////
. *Some new variants added for the one sided factorization:* LU gets
Right-Looking, Left-Looking, Crout and Recursive), QR gets Right-Looking and
Left-Looking, Cholesky gets Left-Looking, Right-Looking and Top-Looking.
Depending on the computer architecture (or speed of the underlying BLAS), one
of these variants may be faster than the original LAPACK implementation.
////
10
////
. *More robust DQDS algorithm:*
Fixed some rare convergence failures for the bidiagonal DQDS SVD routine.
////
11
////
. *Better documentation for the multishift Hessenberg QR algorithm with early
aggressive deflation, and various improvements of the code.*
See reference [2] below.

3. References
-------------
[1]:: Alfredo Buttari, Jack Dongarra, Julie Langou, Julien Langou, Piotr
Luszczek, and Jakub Kurzak. _Mixed Precision Iterative Refinement Techniques for
the Solution of Dense Linear Systems_. International Journal of High Performance
Computing Applications, 21(4):457-466, 2007. 

[2]:: Ralph Byers. _LAPACK 3.1 xHSEQR: Tuning and Implementation Notes on the
Small Bulge Multi-shift QR Algorithm with Aggressive Early Deflation_. LAPACK
Working Note 187, May 2007.

[3]:: James Demmel, Yozo Hida, William Kahan, Xiaoye S. Li, Sonil Mukherjee, and
E.  Jason Riedy.  _Error Bounds from Extra Precise Iterative Refinement_.  ACM
Transactions on Mathematical Software (TOMS), 32(2):325-351, 2006. (Also
LAWN-165).

[4]:: James W. Demmel, Mark Hoemmen, Yozo Hida, and E. Jason Riedy.  _Non-Negative
Diagonals and High Performance on Low-Profile Matrices from Householder QR_.
LAPACK Working Note 203, May 2008.

[5]:: Zlatko Drmac.  _A global convergence proof of cyclic Jacobi methods with
block rotations_. LAPACK Working Note 196, December 2007.

[6]:: Zlatko Drmac and Kresimir Veselic.  _New fast and accurate Jacobi SVD
algorithm: I_.  SIAM Journal on Matrix Analysis and Applications,
29(4):1322-1342, 2007. (Also LAWN-169).

[7]:: Zlatko Drmac and Kresimir Veselic.  _New fast and accurate Jacobi SVD
algorithm: II_.  SIAM Journal on Matrix Analysis and Applications,
29(4):1343-1362, 2007. (Also LAWN-170).

[8]:: Fred G. Gustravson, Jerzy Wasniewski, and Jack J. Dongarra.  _Rectangular
Full Packed Format for Cholesky's Algorithm: Factorization, Solution and
Inversion_.  LAPACK Working Note 199, April 2008.

[9]:: Craig Lucas.  _LAPACK-Style Codes for Level 2 and 3 Pivoted Cholesky
Factorizations_.  LAPACK Working Note 161, February 2004.

4. External Contributors
------------------------

    * Ralph Byers (University of Kansas, USA)
    * Zlatko Drmac (University of Zagreb, Croatia)
    * Peng Du (University of Tennessee, Knoxville, USA)
    * Fred Gustavson (IBM Watson Research Center, NY, US)
    * Craig Lucas (University of Manchester / NAG Ltd., UK)  
    * Kresimir Veselic (Fernuniversitaet Hagen, Hagen, Germany)
    * Jerzy Wasniewski (Technical University of Denmark,  Lyngby, Copenhagen, Denmark)

5. Thanks
---------

Thanks for bug-report/patches/suggestions to:

Patrick Alken (University of Colorado at Boulder, USA),
Penny Anderson, Bobby Cheng, Cleve Moler, Duncan Po, and Pat Quillen (MathWorks, MA, USA),
Michael Baudin (scilab),
Michael Chuvelev (Intel),
Phil DeMier (IBM), 
Michel Devel (UTINAM institute, University of Franche-Comte, UMR CNRSA),
Alan Edelmann (Massachusetts Institute of Technology, MA, USA),
Carlo de Falco and all the Octave developers,
Fernando Guevara (University of Utah, UT, USA),
Joao,
Christian Keil,
Zbigniew Leyk (Wolfram), 
``LawrenceMulholland'' (NAG Ltd),
Mick Pont (NAG),
Clint Whaley (University of Texas at San Antonio, TX, USA),
Mikhail Wolfson (MIT),
Vittorio Zecca.

6. Developer list
-----------------

.Principal Investigators

    * Jim Demmel (University or California at Berkeley, USA)
    * Jack Dongarra (University of Tennessee and ORNL, USA)

.LAPACK developers involved in this release

    * Deaglan Halligan (University or California at Berkeley, USA)
    * Sven Hammarling (NAG Ltd., UK)
    * Yozo Hida (University of California at Berkeley, USA)
    * Daniel Kressner (ETH Zurich, Switzerland)
    * Julie Langou (University of Tennessee, USA)
    * Julien Langou (University of Colorado Denver, USA)
    * Osni Marques (Lawrence Berkeley Laboratory, USA)
    * E. Jason Riedy (University of California at Berkeley, USA)
    * Edward Smyth (NAG Ltd., UK)

.XBLAS developers involved in this release

    * David Bailey (Lawrence Berkeley Laboratory, USA)
    * Deaglan Halligan (University or California at Berkeley, USA)
    * Greg Henry (Intel)
    * Yozo Hida (University or California at Berkeley, USA)
    * Jimmy Iskandar (University or California at Berkeley, USA)
    * William Kahan (University or California at Berkeley, USA)
    * Anil Kapur (University or California at Berkeley, USA)
    * Suh Y. Kang (University or California at Berkeley, USA)
    * Xiaoye Li (Lawrence Berkeley Laboratory, USA)
    * Soni Mukherjee (University or California at Berkeley, USA)
    * Jason Riedy (University or California at Berkeley, USA)
    * Michael Martin (University or California at Berkeley, USA)
    * Brandon Thompson (University or California at Berkeley, USA)
    * Teresa Tung (University or California at Berkeley, USA)
    * Daniel Yoo (University or California at Berkeley, USA)

7. Install Procedure
--------------------

IMPORTANT: *YOU NEED F90 !!!* MAXLOC
refer to LAWN81 (and update it!)

[IMPORTANT]
===============================
*XBLAS and iterref integration*

C compiler?
===============================

[CAUTION]
===============================
*VARIANTS integration*
make variants
make variants_testing

	Corresponding libraries created in SRC/VARIANTS/LIB:
		- LU Crout : lucr.a
		- LU Left Looking : lull.a
		- LU Sivan Toledo's recursive : lurec.a
		- QR Left Looking : qrll.a
		- Cholesky Right Looking : cholrl.a
		- Cholesky Top : choltop.a




These variants are compiled by default in the build process but they are not tested by default.

Please refer to the README file in SRC/VARIANTS for more information.
===============================
	
8. Interface changes
--------------------

[WARNING]
====
There are interface changes from LAPACK versions 3.1 to 3.2 for routines:

    DSGESV ZCGESV DLASQ3 DLASQ4 SLASQ3 SLASQ4

We have removed the routines:

    DLAZQ3 DLAZQ4 SLAZQ3 SLAZQ4
====



9. More details
----------------

9.1. Extra Precise Iterative Refinement
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The matrix types supported in this release are

1. GE (general)
2. SY (symmetric)
3. PO (positive definite)
4. HE (Hermitian)
5. GB (general band)

in all the relevant precisions.

9.2. XBLAS, or portable ``extra precise BLAS''
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All material relevant to the XBLAS is located at http://www.netlib.org/xblas/.


9.3. Non-Negative Diagonals from Householder QR
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.contributors

James W. Demmel, Mark Hoemmen, Yozo Hida, and E. Jason Riedy.

.lapacker

E. Jason Riedy

.reference

James W. Demmel, Mark Hoemmen, Yozo Hida, and E. Jason Riedy.  _Non-Negative
Diagonals and High Performance on Low-Profile Matrices from Householder QR_.
LAPACK Working Note 203, May 2008.

.contribution

The QR factorization routines now guarantee that the diagonal is both real and
non-negative.  Factoring a uniformly random matrix now correctly generates an
orthogonal Q from the Haar distribution.

   M SRC/clarf.f   M SRC/dlarf.f   M SRC/slarf.f     M SRC/zlarf.f
   M SRC/clarfx.f  M SRC/dlarfx.f  M SRC/slarfx.f    M SRC/zlarfx.f
   A SRC/ilaclc.f  A SRC/iladlc.f  A SRC/ilaslc.f    A SRC/ilazlc.f
   A SRC/ilaclr.f  A SRC/iladlr.f  A SRC/ilaslr.f    A SRC/ilazlr.f

9.4. High Performance QR and Householder Reflections on Low-Profile Matrices
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.contributors

James W. Demmel, Mark Hoemmen, Yozo Hida, and E. Jason Riedy.

.lapacker

E. Jason Riedy

.reference

James W. Demmel, Mark Hoemmen, Yozo Hida, and E. Jason Riedy.  _Non-Negative
Diagonals and High Performance on Low-Profile Matrices from Householder QR_.
LAPACK Working Note 203, May 2008.

.contribution

The auxiliary routines to apply Householder reflections (e.g. DLARFB)
automatically reduce the cost of QR from O(n^3^) to O(n^2^) for matrices stored
in a dense format but with a ``narrow profile'' (including but not limited to
band matrices) with no user interface changes.  Other users of these routines
can see similar benefits.

   M SRC/clarfb.f   M SRC/dlarfb.f   M SRC/slarfb.f   M SRC/zlarfb.f
   M SRC/clarft.f   M SRC/dlarft.f   M SRC/slarft.f   M SRC/zlarft.f

9.5. New fast and accurate Jacobi SVD
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.contributors

Zlatko Drmac and Kresimir Veselic

.lapacker

Julien Langou

.references

Zlatko Drmac.  _A global convergence proof of cyclic Jacobi methods with
block rotations_. LAPACK Working Note 196, December 2007.

Zlatko Drmac and Kresimir Veselic.  _New fast and accurate Jacobi SVD
algorithm: I_.  SIAM Journal on Matrix Analysis and Applications,
29(4):1322-1342, 2007. (Also LAWN-169).

Zlatko Drmac and Kresimir Veselic.  _New fast and accurate Jacobi SVD
algorithm: II_.  SIAM Journal on Matrix Analysis and Applications,
29(4):1343-1362, 2007. (Also LAWN-170).

.contribution

High accuracy SVD routine for dense matrices, which can compute tiny singular
values to many more correct digits than `SGESVD` when the matrix has columns
differing widely in norm, and usually runs faster than `SGESVD` too. See
references [5,6,7].

   A dgejsv.f         A sgejsv.f
   A dgesvj.f         A sgesvj.f
   A dgsvj0.f         A sgsvj0.f
   A dgsvj1.f         A sgsvj1.f

*SGESVJ* implements one sided Jacobi SVD with fast scaled rotations and de
Rijk's pivoting. The code works as specified in the theory of Demmel and
Veselic. It uses a special implementation of Jacobi rotations, so that it can
compute the singular values in the full range [underflow, overflow].  `SGESVJ`
can be used as stand-alone, and it is also the kernel routine in the expert
driver routine `SGEJSV`.

[source,fortran]
----
      SUBROUTINE SGESVJ( JOBA, JOBU, JOBV, M, N, A, LDA, SVA,
     &                     MV, V, LDV, WORK, LWORK, INFO )
*
      INTEGER    INFO, LDA, LDV, LWORK, M, MV, N
      CHARACTER  JOBA, JOBU, JOBV
      REAL       A( LDA, N ), SVA( N ), V( LDV, N ), WORK( LWORK )
*
*  Purpose
*  =======
*  DGESVJ computes the singular value decomposition (SVD) of a real
*  M-by-N matrix A, where M >= N. The SVD of A is written as
*                                     [++]   [xx]   [x0]   [xx]
*               A = U * SIGMA * V^t,  [++] = [xx] * [ox] * [xx]
*                                     [++]   [xx]
*  where SIGMA is an N-by-N diagonal matrix, U is an M-by-N orthonormal
*  matrix, and V is an N-by-N orthogonal matrix. The diagonal elements
*  of SIGMA are the singular values of A. The columns of U and V are the
*  left and the right singular vectors of A, respectively.
*
----

*SGEJSV* implements the preconditioned Jacobi SVD of Drmac and Veselic.  This
is the expert driver routine, that calls `SGESVJ`, after certain
preconditioning.  Since it uses many LAPACK and BLAS components that are not
capable of working in the full range of floating point numbers, `SGEJSV`
imposes a restriction on sigma_max / sigma_min that is similar to the current
LAPACK SVD routines or it is larger, but not the full range [underflow,
overflow] as in `SGESVJ`.

[source,fortran]
----
      SUBROUTINE SGEJSV( JOBA, JOBU, JOBV, JOBR, JOBT, JOBP,
     &                   M, N, A, LDA, SVA, U, LDU, V, LDV,
     &                   WORK, LWORK, IWORK, INFO )
*
      INTEGER    INFO, LDA, LDU, LDV, LWORK, M, N
      INTEGER    IWORK( * )
      CHARACTER  JOBA, JOBP, JOBR, JOBT, JOBU, JOBV
      REAL       A( LDA, N ), SVA( N ), U( LDU, N ), V( LDV, N ),
     &           WORK( LWORK )
*     ..
*
*  Purpose
*  ~~~~~~~
*  SGEJSV computes the singular value decomposition (SVD) of a real M-by-N
*  matrix [A], where M >= N. The SVD of [A] is written as
*
*               [A] = [U] * [SIGMA] * [V]^t,
*
*  where [SIGMA] is an N-by-N (M-by-N) matrix which is zero except for its N
*  diagonal elements, [U] is an M-by-N (or M-by-M) orthonormal matrix, and
*  [V] is an N-by-N orthogonal matrix. The diagonal elements of [SIGMA] are
*  the singular values of [A]. The columns of [U] and [V] are the left and
*  the right singular vectors of [A], respectively. The matrices [U] and [V]
*  are computed and stored in the arrays U and V, respectively. The diagonal
*  of [SIGMA] is computed and stored in the array SVA.
*
----

`SGSVJ0`, `SGSVJ1` are derived from `SGESVJ` and are computational routine
called by `SGESVJ`. They provide a way that `SGESVJ` recursively calls itself
on certain submatrices. The caller/user of `SGESVJ`, `SGEJSV` should not worry
about them, as they are never standalone.

[CAUTION]
This release only include `SINGLE PRECISION` and `DOUBLE PRECISION`.  In other
words, we do not have complex support. Also the matrix in input needs to be
with `M.GE.N`.

9.6. Rectangular Full Packed format
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.contributor

Fred Gustavson and Jerzy Wasniewski

.lapacker

Julien Langou

.reference

Fred G. Gustravson, Jerzy Wasniewski, and Jack J. Dongarra.  _Rectangular
Full Packed Format for Cholesky's Algorithm: Factorization, Solution and
Inversion_.  LAPACK Working Note 199, April 2008.

.rfp

RFP = Rectangular Full Packed.

Motivation for the RFP format: LAPACK current full formots (PO, TR, HE) waste
half of the memory, LAPACK current packed formats (PP, TP, HP) are completely
inefficient. Gustavson introduced in 2005 the rectangular full packed format
(PF, TF, HF) which enables to use half of the full storage while maintaining
efficiency by using Level 3 BLAS/LAPACK kernels.

The idea is simple: to store a triangle, you break it in three part: two
triangles of half the size of the former and a square, then you pack this as
you can in a rectangle. Since the two triangles are stored in full storage
(PO, TR, HE) one can use existing efficient routines on them. They are some
subtilities and variants, but everything works fine. The storage format
Gustavson has proposed has 8 possible cases depending whether N is odd or
even, whether you have a upper triangular matrix or lower triangular matrix,
and whether you choose to store the conjugate-transpose of the storage or its
normal form.

.storing a matrix in RFP format

1. the RFP format is not that hard to understand. We refer to the headers of
the routines, the paper in reference above or an illustrative webpage
[http://www-math.cudenver.edu/~langou/lapack-3.2/rfp.html[URL-3]].

2. The routines `CHFRK`, `ZHFRK`, `SSFRK` and `DSFRK` build directly matrices
in RFP format. Once the matrix is constructed you can then use without
understanding the details of the format.

3. There are some convertion routines `xTPTTF` goes from standard packed to 
RFP and `xTRTTF` goes from full to RFP. This approach is less interesting than
(1) or (2) since it requires to store two matrices at once.

.RFP nomenclature
[grid="all"]
`---`---------`------------------------------------------------
SF       xSFyy    Symmetric
HF       xHFyy    Hermitian
PF       xPFyy    Symmetric/Hermitian Positive Definite
TF       xTFyy    Triangular
---------------------------------------------------------------

.contribution

   A SRC/chfrk.f          A SRC/dlansf.f         A SRC/slansf.f          A SRC/zhfrk.f
   A SRC/clanhf.f         A SRC/dpftrf.f         A SRC/spftrf.f          A SRC/zlanhf.f
   A SRC/cpftrf.f         A SRC/dpftri.f         A SRC/spftri.f          A SRC/zpftrf.f
   A SRC/cpftri.f         A SRC/dpftrs.f         A SRC/spftrs.f          A SRC/zpftri.f
   A SRC/cpftrs.f         A SRC/dsfrk.f          A SRC/ssfrk.f           A SRC/zpftrs.f
   A SRC/ctfsm.f          A SRC/dtfsm.f          A SRC/stfsm.f           A SRC/ztfsm.f
   A SRC/ctftri.f         A SRC/dtftri.f         A SRC/stftri.f          A SRC/ztftri.f
   A SRC/ctfttp.f         A SRC/dtfttp.f         A SRC/stfttp.f          A SRC/ztfttp.f
   A SRC/ctfttr.f         A SRC/dtfttr.f         A SRC/stfttr.f          A SRC/ztfttr.f
   A SRC/ctpttf.f         A SRC/dtpttf.f         A SRC/stpttf.f          A SRC/ztpttf.f
   A SRC/ctpttr.f         A SRC/dtpttr.f         A SRC/stpttr.f          A SRC/ztpttr.f
   A SRC/ctrttf.f         A SRC/dtrttf.f         A SRC/strttf.f          A SRC/ztrttf.f
   A SRC/ctrttp.f         A SRC/dtrttp.f         A SRC/strttp.f          A SRC/ztrttp.f

.functionnality and interface
[grid="all"]
`------------------------------`----------`----------`----------`----------`-----------------------------------------------
Cholesky factorization          CPFTRF     DPFTRF     SPFTRF     ZPFTRF     (TRANSR,UPLO,N,A,INFO)
Multiple solve after xPFTRF     CPFTRS     DPFTRS     SPFTRS     ZPFTRS     (TRANSR,UPLO,N,NR,A,B,LDB,INFO)
Inversion after xPFTRF          CPFTRI     DPFTRI     SPFTRI     ZPFTRI     (TRANSR,UPLO,N,A,INFO)
Triangular inversion            CTRTRI     DTRTRI     STRTRI     ZTRTRI     (TRANSR,UPLO,DIAG,N,A,INFO)
Sym/Herm matrix norm            CLANHF     DLANSF     SLANSF     ZLANHF     (NORM,TRANSR,UPLO,N,A,WORK)
Triangular solve                CTFSM      DTFSM      STFSM      ZTFSM      (TRANSR,SIDE,UPLO,TRANS,DIAG,M,N,ALPHA,A,B,LDB)
Sym/Herm rank-k update          CHFRK      DSFRK      SSFRK      ZHFRK      (TRANSR,UPLO,TRANS,N,K,ALPHA,A,LDA,BETA,C)
Conv. from TP to TF             CTPTTF     DTPTTF     STPTTF     ZTPTTF     (TRANSR,UPLO,N,AP,ARF,INFO)
Conv. from TR to TF             CTRTTF     DTRTTF     STRTTF     ZTRTTF     (TRANSR,UPLO,N,A,LDA,ARF,INFO)
Conv. from TF to TP             CTFTTP     DTFTTP     STFTTP     ZTFTTP     (TRANSR,UPLO,N,ARF,AP,INFO)
Conv. from TF to TR             CTFTTR     DTFTTR     STFTTR     ZTFTTR     (TRANSR,UPLO,N,ARF,A,LDA,INFO)
---------------------------------------------------------------------------------------------------------------------------

.example of what one can do with this new release

   1. ZHFRK( )  Form the normal equations (maybe update them)
   2. ZPFTRF( ) Cholesky factorization of the normal equations
   3. ZPFTRS( ) Solve the normal equations

which performs the same as ZSYRK, ZPOTRF, ZPOTRS (with half the storage
for the normal equations).

[WARNING]
The current RFP routines can have `INTEGER` overflow since they access entries
like `A( N*N )` and your compiler might be not smart enough to computes `N*N`
as a 32-bit `INTEGER`. (Note that the same restriction holds for any current LAPACK
packed routines). Use 64-bit `INTEGER` and you'll be covered.

[URL-3] Illustration of the RFP. http://www-math.cudenver.edu/~langou/lapack-3.2/rfp.html.

9.7. Pivoted Cholesky
~~~~~~~~~~~~~~~~~~~~~

.contributor

Craig Lucas

.lapacker

Jason Riedy

.reference

Craig Lucas.  _LAPACK-Style Codes for Level 2 and 3 Pivoted Cholesky Factorizations_.  LAPACK Working Note 161, February 2004.

.contribution

The Cholesky factorization with diagonal pivoting for symmetric positive
semi-definite matrices.  Pivoting is required for reliable rank detection. 

   A SRC/cpstf2.f   A SRC/dpstf2.f   A SRC/spstf2.f   A SRC/zpstf2.f
   A SRC/cpstrf.f   A SRC/dpstrf.f   A SRC/spstrf.f   A SRC/zpstrf.f

[source,fortran]
----
     SUBROUTINE SPSTRF( UPLO, N, A, LDA, PIV, RANK, TOL, WORK, INFO )
*
      REAL               TOL
      INTEGER            INFO, LDA, N, RANK
      CHARACTER          UPLO
      REAL               A( LDA, N ), WORK( 2*N )
      INTEGER            PIV( N )
*
*  Purpose
*  =======
*
*  SPSTRF computes the Cholesky factorization with complete
*  pivoting of a real symmetric positive semidefinite matrix A.
*
*  The factorization has the form
*     P' * A * P = U' * U ,  if UPLO = 'U',
*     P' * A * P = L  * L',  if UPLO = 'L',
*  where U is an upper triangular matrix and L is lower triangular, and
*  P is stored as vector PIV.
*
*  This algorithm does not attempt to check that A is positive
*  semidefinite. This version of the algorithm calls level 3 BLAS.
*
----

9.8. Mixed precision iterative refinement subroutines for exploiting fast single precision hardware
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.contributor

Julie Langou

.lapacker

Julie Langou

.reference

Alfredo Buttari, Jack Dongarra, Julie Langou, Julien Langou, Piotr Luszczek,
and Jakub Kurzak. _Mixed Precision Iterative Refinement Techniques for the
Solution of Dense Linear Systems_. International Journal of High Performance
Computing Applications, 21(4):457-466, 2007. 

.contribution

On platforms like the Cell processor that do single precision much faster than
double, linear systems can be solved many times faster. Even on commodity
processors there is a factor of 2 in speed between single and double
precision.  The matrix types supported in this release are: GE (general), PO
(positive definite).

         A SRC/dsposv.f          A SRC/zcposv.f

[source,fortran]
----
      SUBROUTINE ZCPOSV( UPLO, N, NRHS, A, LDA, B, LDB, X, LDX, WORK,
     +                   SWORK, RWORK, ITER, INFO )
*
      CHARACTER          UPLO
      INTEGER            INFO, ITER, LDA, LDB, LDX, N, NRHS
      DOUBLE PRECISION   RWORK( N*NRHS )
      COMPLEX            SWORK( N*(N+NRHS) )
      COMPLEX*16         A( LDA, N ), B( LDB, NRHS ), WORK( N, NRHS ),
     +                   X( LDX, NRHS )
*
*  Purpose
*  =======
*
*  ZCPOSV computes the solution to a complex system of linear equations
*     A * X = B,
*  where A is an N-by-N Hermitian positive definite matrix and X and B
*  are N-by-NRHS matrices.
*
*  ZCPOSV first attempts to factorize the matrix in COMPLEX and use this
*  factorization within an iterative refinement procedure to produce a
*  solution with COMPLEX*16 normwise backward error quality. If the
*  approach fails the method switches to a COMPLEX*16 factorization and
*  solve.
*
----

9.9. Add some variants for the one sided factorization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.contributors

Peng Du and Jason Riedy

.lapackers

Julie Langou and Jason Riedy

.contribution

LAPACK QR blocked factorization (xGEQRF) is Right-Looking add the Left-Looking
variant.  LAPACK Cholesky blocked factorization (xPOTRF) is Left-Looking, add
the Right-Looking variant,  add the Top-Looking variant.  LAPACK LU blocked
factorization (xGETRF) is Right-Looking, add the Right-Looking variant.  add
the Crout variant.  add the recursive variant (in F77, please).

   A SRC/VARIANTS/cholesky/RL/cpotrf.f
   A SRC/VARIANTS/cholesky/RL/dpotrf.f
   A SRC/VARIANTS/cholesky/RL/spotrf.f
   A SRC/VARIANTS/cholesky/RL/zpotrf.f
   A SRC/VARIANTS/cholesky/TOP/cpotrf.f
   A SRC/VARIANTS/cholesky/TOP/dpotrf.f
   A SRC/VARIANTS/cholesky/TOP/spotrf.f
   A SRC/VARIANTS/cholesky/TOP/zpotrf.f
   A SRC/VARIANTS/lu/CR/cgetrf.f
   A SRC/VARIANTS/lu/CR/dgetrf.f
   A SRC/VARIANTS/lu/CR/sgetrf.f
   A SRC/VARIANTS/lu/CR/zgetrf.f
   A SRC/VARIANTS/lu/LL/cgetrf.f
   A SRC/VARIANTS/lu/LL/dgetrf.f
   A SRC/VARIANTS/lu/LL/sgetrf.f
   A SRC/VARIANTS/lu/LL/zgetrf.f
   A SRC/VARIANTS/lu/REC/cgetrf.f
   A SRC/VARIANTS/lu/REC/dgetrf.f
   A SRC/VARIANTS/lu/REC/sgetrf.f
   A SRC/VARIANTS/lu/REC/zgetrf.f
   A SRC/VARIANTS/qr/LL/cgeqrf.f
   A SRC/VARIANTS/qr/LL/dgeqrf.f
   A SRC/VARIANTS/qr/LL/sceil.f
   A SRC/VARIANTS/qr/LL/sgeqrf.f
   A SRC/VARIANTS/qr/LL/zgeqrf.f

.integration in LAPACK

To install and test the variants:

--------------------------------
make variants
make variants_testing
--------------------------------

This should create the following libraries in `SRC/VARIANTS/LIB`

--------------------------------
- LU Crout : `lucr.a`
- LU Left Looking : `lull.a`
- LU Sivan Toledo's recursive : `lurec.a`
- QR Left Looking : `qrll.a`
- Cholesky Right Looking : `cholrl.a`
- Cholesky Top : `choltop.a`
--------------------------------

Please refer to the README file in SRC/VARIANTS for more information.

9.10. Bug fixes for the bidiagonal SVD routine that fixes some rare convergence failures
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.contributors

Osni Marques and Beresford Parlett

.lapackers

Osni Marques, Jim Demmel, and Julien Langou

.contribution

The version of dqds available in LAPACK-3.1.1 (released in February 2007) is
essentially the version of dqds available in LAPACK-3.0 (released in 1999).
The difference between those two versions is that the dqds in LAPACK-3.1.1 is
thread safe due to modifications made by Sven Hamarling.  Any eventual
shortcomings of the dqds algorithm in LAPACK-3.0 are still present in
LAPACK-3.1.1. In LAPACK-3.2.0, we introduce the latest version of dqds that
Beresford and Osni produced in 2006.  This version contained a number of
improvements that were introduced after 1999. In particular, the flipping
mechanism have been improved. The original motivation was a problem reported
by one of Inderjit Dhillon's student (Univ. of Texas at Austin). As a result,
this new version of dqds (let us call it dqds-3.2) passes Cleve Moler's matrix
while dqds-3.0 and dqds-3.1 do not pass this matrix.  The algorithm in
dqds-3.0 and dqds-3.1 fails to converge in the maximum number of iterations
allowed. Cleve Moler's matrix is at
[http://www-math.cudenver.edu/\~langou/lapack-3.2/set_Moler_200.c[URL-1]].
The single precision version of the code dqds-3.2 has a problem in the LAPACK
TESTING with one of the matrix of type 16 while dqds-3.0 and dqds-3.1 were
fine. The matrix is at
[http://www-math.cudenver.edu/~langou/lapack-3.2/set_PB1_30.c[URL-2]].
To pass this matrix, we needed to set the `IEEE` LOGICAL variable in SLASQ2 to
`.FALSE.`.  This is a temporary fix, we wish to have a better fix in the
future.

   M SRC/dlasq1.f         M SRC/slasq1.f
   M SRC/dlasq2.f         M SRC/slasq2.f
   M SRC/dlasq3.f         M SRC/slasq3.f
   M SRC/dlasq4.f         M SRC/slasq4.f
   M SRC/dlasq5.f         M SRC/slasq5.f
   M SRC/dlasq6.f         M SRC/slasq6.f
   D SRC/dlazq3.f         D SRC/slazq3.f
   D SRC/dlazq4.f         D SRC/slazq4.f

[CAUTION]
There are interface changes for:
  `DLASQ3`, `DLASQ4`, `SLASQ3`, `SLASQ4`.
We decided not to add another set of routines and simply
changed the interfaces. +
We have removed the routines
  `DLAZQ3`, `DLAZQ4`, `SLAZQ3`, `SLAZQ4`.
They were introduced in 3.1 for thread safety but are no longer needed.

[URL-1] A nasty matrix from Cleve Moler's. http://www-math.cudenver.edu/\~langou/lapack-3.2/set_Moler_200.c.

[URL-2] A nasty matrix from the LAPACK TESTING. http://www-math.cudenver.edu/~langou/lapack-3.2/set_PB1_30.c.

9.11. Better multishift Hessenberg QR algorithm with early aggressive deflation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.contributor

Ralph Byers

.lapacker

Edward Smyth

.reference

Ralph Byers. _LAPACK 3.1 xHSEQR: Tuning and Implementation Notes on the Small
Bulge Multi-shift QR Algorithm with Aggressive Early Deflation_. LAPACK
Working Note 187, May 2007.

.contribution

Most of the revisions are fixing typographical errors, but there are a few
revisions that have a small affect on how the program works.  Even these are
relatively minor revisions:

  1. revised the choice of the size of the deflation window slightly to make
     the code a little more robust against convergence failures.
  2. revised the section of code that tries to reintroduce bulges after they
     have collapsed due to underflow.  The new version is cleaner and more
     robust.
  3. revised `xLAQR1` so that it does not assume that `H(2,1)` is real.  A code
     ought to do what it claims to do and in the complex case, this small
     subroutine didn't quite do it.

   M SRC/chseqr.f   M SRC/dhseqr.f   M SRC/shseqr.f   M SRC/zhseqr.f
   M SRC/clahqr.f   M SRC/dlahqr.f   M SRC/slahqr.f   M SRC/zlahqr.f
   M SRC/claqr0.f   M SRC/dlaqr0.f   M SRC/slaqr0.f   M SRC/zlaqr0.f
   M SRC/claqr1.f   M SRC/dlaqr2.f   M SRC/slaqr2.f   M SRC/zlaqr1.f
   M SRC/claqr2.f   M SRC/dlaqr3.f   M SRC/slaqr3.f   M SRC/zlaqr2.f
   M SRC/claqr3.f   M SRC/dlaqr4.f   M SRC/slaqr4.f   M SRC/zlaqr3.f
   M SRC/claqr4.f   M SRC/dlaqr5.f   M SRC/slaqr5.f   M SRC/zlaqr4.f
   M SRC/claqr5.f                                     M SRC/zlaqr5.f

10. Expected additions and improvements for the future
------------------------------------------------------

.multishift QZ with early aggressive deflation
Bo Kagstrom and Daniel Kressner.
_Multishift variants of the QZ algorithm with aggressive early deflation_.
SIAM J. Matrix Anal. Appl., 29(1):199-227, 2006.

.block reordering algorithm
Daniel Kressner.
_Block algorithms for reordering standard and generalized Schur forms_.
ACM Trans. Math. Software, 32(4):521-532, 2006.

.accurate and efficient Givens rotations
http://www.cs.berkeley.edu/~demmel/Givens/

David Bindel, James Demmel, William Kahan, and Osni Marques.
_On computing givens rotations reliably and efficiently_.
ACM Transactions on Mathematical Software (TOMS)
Volume 28, Issue 2, 2002.  Pages: 206-238.

.blas 2.5

Gary W. Howell, James Demmel, Charles T. Fulton, Sven Hammarling, and Karen Marmol.
_Cache efficient bidiagonalization using BLAS 2.5 operators_.
ACM Transactions on Mathematical Software (TOMS)
Volume 34, Issue 3, 2008.

.level 3 BLAS dsytrs, ssytrs, zhetrs, chetrs, dsytri, ssytri, zhetri, chetri
Upgrade dsytrs, ssytrs, zhetrs, chetrs, dsytri, ssytri, zhetri, chetri to use
level 3 BLAS when called with multiple right-hand-sides (feedback from the
MathWorks).

.support more matrix types for extra-precise iterative refinement.
Matrix types SB (symmetric band), PB (positive definite band), HB
(Hermitian band), and packed storage.  Tridiagonal types such as GT
(general tridiagonal) are also on the wish list but first we need to derive
adequate test cases.

.make xLARFB thread friendly
Take into account Robert van de Geijn (Univ. of Texas at Austin) comments on
the interface of xLARFB.  The interface labels V as input/output and this is
not at all convenient for multithreaded implementations.

.laundry list
1. Change the default Cholesky factorization in SRC from right--looking to left--looking,
move left--looking to the VARIANTS directory.
2. Add some recursive variants for QR and Cholesky.
3. Remove IEEE=.FALSE. in DQDS (DLASQ3, SLASQ3 of Osni).
4. Remove the possibilites of INTEGER overflow in the RFP routines


.