LAPACK 3.2 ========== Release date: Su 11/16/2008. This material is based upon work supported by the National Science Foundation and the Department of Energy under Grant No. NSF-CCF-00444486, NSF-CNS 0325873, NSF-EIA 0122599, NSF-ACI-0090127, DOE-DE-FC02-01ER25478, DOE-DE-FC02-06ER25768. LAPACK is a software package provided by Univ. of Tennessee, Univ. of California Berkeley, Univ. of Colorado Denver and NAG Ltd.. 1. Support and questions: ------------------------- http://icl.cs.utk.edu/lapack-forum/ 2. LAPACK 3.2: What's new ------------------------- //// 1 //// . *Extra Precise Iterative Refinement:* New linear solvers that ``guarantee'' fully accurate answers (or give a warning that the answer cannot be trusted). The matrix types supported in this release are: GE (general), SY (symmetric), PO (positive definite), HE (Hermitian), and GB (general band) in all the relevant precisions. See reference [3] below. //// 2 //// . *XBLAS, or portable ``extra precise BLAS'':* our new linear solvers in (1) depend on these to perform iterative refinement. See reference [3] below. The XBLAS will be released in a separate package. See ``More Details''. //// 3 //// . *Non-Negative Diagonals from Householder QR:* The QR factorization routines now guarantee that the diagonal is both real and non-negative. Factoring a uniformly random matrix now correctly generates an orthogonal Q from the Haar distribution. See reference [4] below. //// 4 //// . *High Performance QR and Householder Reflections on Low-Profile Matrices:* The auxiliary routines to apply Householder reflections (e.g. DLARFB) automatically reduce the cost of QR from O(n^3^) to O(n^2^) for matrices stored in a dense format but with a ``narrow profile'' (including but not limited to band matrices) with no user interface changes. Other users of these routines can see similar benefits. See reference [4] below. //// 5 //// . *New fast and accurate Jacobi SVD:* High accuracy SVD routine for dense matrices, which can compute tiny singular values to many more correct digits than xGESVD when the matrix has columns differing widely in norm, and usually runs faster than xGESVD too. See references [5,6,7] below. //// 6 //// . *Routines for Rectangular Full Packed format:* The RFP format (SF, HF, PF, TF) enables efficient routines with optimal storage for symmetric, Hermitian or triangular matrices. Since these routines utilize the Level 3 BLAS, they are generally much more efficient than the existing packed storage routines (SP, HP, PP, TP). See reference [8] below. //// 7 //// . *Pivoted Cholesky:* The Cholesky factorization with diagonal pivoting for symmetric positive semi-definite matrices. Pivoting is required for reliable rank detection. See reference [9] below. //// 8 //// . *Mixed precision iterative refinement* routines for exploiting fast single precision hardware:* On platforms like the Cell processor that do single precision much faster than double, linear systems can be solved many times faster. Even on commodity processors there is a factor of 2 in speed between single and double precision. The matrix types supported in this release are: GE (general), PO (positive definite). See reference [1] below. //// 9 //// . *Some new variants added for the one sided factorization:* LU gets Right-Looking, Left-Looking, Crout and Recursive), QR gets Right-Looking and Left-Looking, Cholesky gets Left-Looking, Right-Looking and Top-Looking. Depending on the computer architecture (or speed of the underlying BLAS), one of these variants may be faster than the original LAPACK implementation. //// 10 //// . *More robust DQDS algorithm:* Fixed some rare convergence failures for the bidiagonal DQDS SVD routine. //// 11 //// . *Better documentation for the multishift Hessenberg QR algorithm with early aggressive deflation, and various improvements of the code.* See reference [2] below. 3. References ------------- [1]:: Alfredo Buttari, Jack Dongarra, Julie Langou, Julien Langou, Piotr Luszczek, and Jakub Kurzak. _Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems_. International Journal of High Performance Computing Applications, 21(4):457-466, 2007. [2]:: Ralph Byers. _LAPACK 3.1 xHSEQR: Tuning and Implementation Notes on the Small Bulge Multi-shift QR Algorithm with Aggressive Early Deflation_. LAPACK Working Note 187, May 2007. [3]:: James Demmel, Yozo Hida, William Kahan, Xiaoye S. Li, Sonil Mukherjee, and E. Jason Riedy. _Error Bounds from Extra Precise Iterative Refinement_. ACM Transactions on Mathematical Software (TOMS), 32(2):325-351, 2006. (Also LAWN-165). [4]:: James W. Demmel, Mark Hoemmen, Yozo Hida, and E. Jason Riedy. _Non-Negative Diagonals and High Performance on Low-Profile Matrices from Householder QR_. LAPACK Working Note 203, May 2008. [5]:: Zlatko Drmac. _A global convergence proof of cyclic Jacobi methods with block rotations_. LAPACK Working Note 196, December 2007. [6]:: Zlatko Drmac and Kresimir Veselic. _New fast and accurate Jacobi SVD algorithm: I_. SIAM Journal on Matrix Analysis and Applications, 29(4):1322-1342, 2007. (Also LAWN-169). [7]:: Zlatko Drmac and Kresimir Veselic. _New fast and accurate Jacobi SVD algorithm: II_. SIAM Journal on Matrix Analysis and Applications, 29(4):1343-1362, 2007. (Also LAWN-170). [8]:: Fred G. Gustravson, Jerzy Wasniewski, and Jack J. Dongarra. _Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion_. LAPACK Working Note 199, April 2008. [9]:: Craig Lucas. _LAPACK-Style Codes for Level 2 and 3 Pivoted Cholesky Factorizations_. LAPACK Working Note 161, February 2004. 4. External Contributors ------------------------ * Ralph Byers (University of Kansas, USA) * Zlatko Drmac (University of Zagreb, Croatia) * Peng Du (University of Tennessee, Knoxville, USA) * Fred Gustavson (IBM Watson Research Center, NY, US) * Craig Lucas (University of Manchester / NAG Ltd., UK) * Kresimir Veselic (Fernuniversitaet Hagen, Hagen, Germany) * Jerzy Wasniewski (Technical University of Denmark, Lyngby, Copenhagen, Denmark) 5. Thanks --------- Thanks for bug-report/patches/suggestions to: Patrick Alken (University of Colorado at Boulder, USA), Penny Anderson, Bobby Cheng, Cleve Moler, Duncan Po, and Pat Quillen (MathWorks, MA, USA), Michael Baudin (scilab), Michael Chuvelev (Intel), Phil DeMier (IBM), Michel Devel (UTINAM institute, University of Franche-Comte, UMR CNRSA), Alan Edelmann (Massachusetts Institute of Technology, MA, USA), Carlo de Falco and all the Octave developers, Fernando Guevara (University of Utah, UT, USA), Joao, Christian Keil, Zbigniew Leyk (Wolfram), ``LawrenceMulholland'' (NAG Ltd), Mick Pont (NAG), Clint Whaley (University of Texas at San Antonio, TX, USA), Mikhail Wolfson (MIT), Vittorio Zecca. 6. Developer list ----------------- .Principal Investigators * Jim Demmel (University or California at Berkeley, USA) * Jack Dongarra (University of Tennessee and ORNL, USA) .LAPACK developers involved in this release * Deaglan Halligan (University or California at Berkeley, USA) * Sven Hammarling (NAG Ltd., UK) * Yozo Hida (University of California at Berkeley, USA) * Daniel Kressner (ETH Zurich, Switzerland) * Julie Langou (University of Tennessee, USA) * Julien Langou (University of Colorado Denver, USA) * Osni Marques (Lawrence Berkeley Laboratory, USA) * E. Jason Riedy (University of California at Berkeley, USA) * Edward Smyth (NAG Ltd., UK) .XBLAS developers involved in this release * David Bailey (Lawrence Berkeley Laboratory, USA) * Deaglan Halligan (University or California at Berkeley, USA) * Greg Henry (Intel) * Yozo Hida (University or California at Berkeley, USA) * Jimmy Iskandar (University or California at Berkeley, USA) * William Kahan (University or California at Berkeley, USA) * Anil Kapur (University or California at Berkeley, USA) * Suh Y. Kang (University or California at Berkeley, USA) * Xiaoye Li (Lawrence Berkeley Laboratory, USA) * Soni Mukherjee (University or California at Berkeley, USA) * Jason Riedy (University or California at Berkeley, USA) * Michael Martin (University or California at Berkeley, USA) * Brandon Thompson (University or California at Berkeley, USA) * Teresa Tung (University or California at Berkeley, USA) * Daniel Yoo (University or California at Berkeley, USA) 7. Install Procedure -------------------- IMPORTANT: *YOU NEED F90 !!!* MAXLOC refer to LAWN81 (and update it!) [IMPORTANT] =============================== *XBLAS and iterref integration* C compiler? =============================== [CAUTION] =============================== *VARIANTS integration* make variants make variants_testing Corresponding libraries created in SRC/VARIANTS/LIB: - LU Crout : lucr.a - LU Left Looking : lull.a - LU Sivan Toledo's recursive : lurec.a - QR Left Looking : qrll.a - Cholesky Right Looking : cholrl.a - Cholesky Top : choltop.a These variants are compiled by default in the build process but they are not tested by default. Please refer to the README file in SRC/VARIANTS for more information. =============================== 8. Interface changes -------------------- [WARNING] ==== There are interface changes from LAPACK versions 3.1 to 3.2 for routines: DSGESV ZCGESV DLASQ3 DLASQ4 SLASQ3 SLASQ4 We have removed the routines: DLAZQ3 DLAZQ4 SLAZQ3 SLAZQ4 ==== 9. More details ---------------- 9.1. Extra Precise Iterative Refinement ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The matrix types supported in this release are 1. GE (general) 2. SY (symmetric) 3. PO (positive definite) 4. HE (Hermitian) 5. GB (general band) in all the relevant precisions. 9.2. XBLAS, or portable ``extra precise BLAS'' ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All material relevant to the XBLAS is located at http://www.netlib.org/xblas/. 9.3. Non-Negative Diagonals from Householder QR ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .contributors James W. Demmel, Mark Hoemmen, Yozo Hida, and E. Jason Riedy. .lapacker E. Jason Riedy .reference James W. Demmel, Mark Hoemmen, Yozo Hida, and E. Jason Riedy. _Non-Negative Diagonals and High Performance on Low-Profile Matrices from Householder QR_. LAPACK Working Note 203, May 2008. .contribution The QR factorization routines now guarantee that the diagonal is both real and non-negative. Factoring a uniformly random matrix now correctly generates an orthogonal Q from the Haar distribution. M SRC/clarf.f M SRC/dlarf.f M SRC/slarf.f M SRC/zlarf.f M SRC/clarfx.f M SRC/dlarfx.f M SRC/slarfx.f M SRC/zlarfx.f A SRC/ilaclc.f A SRC/iladlc.f A SRC/ilaslc.f A SRC/ilazlc.f A SRC/ilaclr.f A SRC/iladlr.f A SRC/ilaslr.f A SRC/ilazlr.f 9.4. High Performance QR and Householder Reflections on Low-Profile Matrices ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .contributors James W. Demmel, Mark Hoemmen, Yozo Hida, and E. Jason Riedy. .lapacker E. Jason Riedy .reference James W. Demmel, Mark Hoemmen, Yozo Hida, and E. Jason Riedy. _Non-Negative Diagonals and High Performance on Low-Profile Matrices from Householder QR_. LAPACK Working Note 203, May 2008. .contribution The auxiliary routines to apply Householder reflections (e.g. DLARFB) automatically reduce the cost of QR from O(n^3^) to O(n^2^) for matrices stored in a dense format but with a ``narrow profile'' (including but not limited to band matrices) with no user interface changes. Other users of these routines can see similar benefits. M SRC/clarfb.f M SRC/dlarfb.f M SRC/slarfb.f M SRC/zlarfb.f M SRC/clarft.f M SRC/dlarft.f M SRC/slarft.f M SRC/zlarft.f 9.5. New fast and accurate Jacobi SVD ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .contributors Zlatko Drmac and Kresimir Veselic .lapacker Julien Langou .references Zlatko Drmac. _A global convergence proof of cyclic Jacobi methods with block rotations_. LAPACK Working Note 196, December 2007. Zlatko Drmac and Kresimir Veselic. _New fast and accurate Jacobi SVD algorithm: I_. SIAM Journal on Matrix Analysis and Applications, 29(4):1322-1342, 2007. (Also LAWN-169). Zlatko Drmac and Kresimir Veselic. _New fast and accurate Jacobi SVD algorithm: II_. SIAM Journal on Matrix Analysis and Applications, 29(4):1343-1362, 2007. (Also LAWN-170). .contribution High accuracy SVD routine for dense matrices, which can compute tiny singular values to many more correct digits than `SGESVD` when the matrix has columns differing widely in norm, and usually runs faster than `SGESVD` too. See references [5,6,7]. A dgejsv.f A sgejsv.f A dgesvj.f A sgesvj.f A dgsvj0.f A sgsvj0.f A dgsvj1.f A sgsvj1.f *SGESVJ* implements one sided Jacobi SVD with fast scaled rotations and de Rijk's pivoting. The code works as specified in the theory of Demmel and Veselic. It uses a special implementation of Jacobi rotations, so that it can compute the singular values in the full range [underflow, overflow]. `SGESVJ` can be used as stand-alone, and it is also the kernel routine in the expert driver routine `SGEJSV`. [source,fortran] ---- SUBROUTINE SGESVJ( JOBA, JOBU, JOBV, M, N, A, LDA, SVA, & MV, V, LDV, WORK, LWORK, INFO ) * INTEGER INFO, LDA, LDV, LWORK, M, MV, N CHARACTER JOBA, JOBU, JOBV REAL A( LDA, N ), SVA( N ), V( LDV, N ), WORK( LWORK ) * * Purpose * ======= * DGESVJ computes the singular value decomposition (SVD) of a real * M-by-N matrix A, where M >= N. The SVD of A is written as * [++] [xx] [x0] [xx] * A = U * SIGMA * V^t, [++] = [xx] * [ox] * [xx] * [++] [xx] * where SIGMA is an N-by-N diagonal matrix, U is an M-by-N orthonormal * matrix, and V is an N-by-N orthogonal matrix. The diagonal elements * of SIGMA are the singular values of A. The columns of U and V are the * left and the right singular vectors of A, respectively. * ---- *SGEJSV* implements the preconditioned Jacobi SVD of Drmac and Veselic. This is the expert driver routine, that calls `SGESVJ`, after certain preconditioning. Since it uses many LAPACK and BLAS components that are not capable of working in the full range of floating point numbers, `SGEJSV` imposes a restriction on sigma_max / sigma_min that is similar to the current LAPACK SVD routines or it is larger, but not the full range [underflow, overflow] as in `SGESVJ`. [source,fortran] ---- SUBROUTINE SGEJSV( JOBA, JOBU, JOBV, JOBR, JOBT, JOBP, & M, N, A, LDA, SVA, U, LDU, V, LDV, & WORK, LWORK, IWORK, INFO ) * INTEGER INFO, LDA, LDU, LDV, LWORK, M, N INTEGER IWORK( * ) CHARACTER JOBA, JOBP, JOBR, JOBT, JOBU, JOBV REAL A( LDA, N ), SVA( N ), U( LDU, N ), V( LDV, N ), & WORK( LWORK ) * .. * * Purpose * ~~~~~~~ * SGEJSV computes the singular value decomposition (SVD) of a real M-by-N * matrix [A], where M >= N. The SVD of [A] is written as * * [A] = [U] * [SIGMA] * [V]^t, * * where [SIGMA] is an N-by-N (M-by-N) matrix which is zero except for its N * diagonal elements, [U] is an M-by-N (or M-by-M) orthonormal matrix, and * [V] is an N-by-N orthogonal matrix. The diagonal elements of [SIGMA] are * the singular values of [A]. The columns of [U] and [V] are the left and * the right singular vectors of [A], respectively. The matrices [U] and [V] * are computed and stored in the arrays U and V, respectively. The diagonal * of [SIGMA] is computed and stored in the array SVA. * ---- `SGSVJ0`, `SGSVJ1` are derived from `SGESVJ` and are computational routine called by `SGESVJ`. They provide a way that `SGESVJ` recursively calls itself on certain submatrices. The caller/user of `SGESVJ`, `SGEJSV` should not worry about them, as they are never standalone. [CAUTION] This release only include `SINGLE PRECISION` and `DOUBLE PRECISION`. In other words, we do not have complex support. Also the matrix in input needs to be with `M.GE.N`. 9.6. Rectangular Full Packed format ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .contributor Fred Gustavson and Jerzy Wasniewski .lapacker Julien Langou .reference Fred G. Gustravson, Jerzy Wasniewski, and Jack J. Dongarra. _Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion_. LAPACK Working Note 199, April 2008. .rfp RFP = Rectangular Full Packed. Motivation for the RFP format: LAPACK current full formots (PO, TR, HE) waste half of the memory, LAPACK current packed formats (PP, TP, HP) are completely inefficient. Gustavson introduced in 2005 the rectangular full packed format (PF, TF, HF) which enables to use half of the full storage while maintaining efficiency by using Level 3 BLAS/LAPACK kernels. The idea is simple: to store a triangle, you break it in three part: two triangles of half the size of the former and a square, then you pack this as you can in a rectangle. Since the two triangles are stored in full storage (PO, TR, HE) one can use existing efficient routines on them. They are some subtilities and variants, but everything works fine. The storage format Gustavson has proposed has 8 possible cases depending whether N is odd or even, whether you have a upper triangular matrix or lower triangular matrix, and whether you choose to store the conjugate-transpose of the storage or its normal form. .storing a matrix in RFP format 1. the RFP format is not that hard to understand. We refer to the headers of the routines, the paper in reference above or an illustrative webpage [http://www-math.cudenver.edu/~langou/lapack-3.2/rfp.html[URL-3]]. 2. The routines `CHFRK`, `ZHFRK`, `SSFRK` and `DSFRK` build directly matrices in RFP format. Once the matrix is constructed you can then use without understanding the details of the format. 3. There are some convertion routines `xTPTTF` goes from standard packed to RFP and `xTRTTF` goes from full to RFP. This approach is less interesting than (1) or (2) since it requires to store two matrices at once. .RFP nomenclature [grid="all"] `---`---------`------------------------------------------------ SF xSFyy Symmetric HF xHFyy Hermitian PF xPFyy Symmetric/Hermitian Positive Definite TF xTFyy Triangular --------------------------------------------------------------- .contribution A SRC/chfrk.f A SRC/dlansf.f A SRC/slansf.f A SRC/zhfrk.f A SRC/clanhf.f A SRC/dpftrf.f A SRC/spftrf.f A SRC/zlanhf.f A SRC/cpftrf.f A SRC/dpftri.f A SRC/spftri.f A SRC/zpftrf.f A SRC/cpftri.f A SRC/dpftrs.f A SRC/spftrs.f A SRC/zpftri.f A SRC/cpftrs.f A SRC/dsfrk.f A SRC/ssfrk.f A SRC/zpftrs.f A SRC/ctfsm.f A SRC/dtfsm.f A SRC/stfsm.f A SRC/ztfsm.f A SRC/ctftri.f A SRC/dtftri.f A SRC/stftri.f A SRC/ztftri.f A SRC/ctfttp.f A SRC/dtfttp.f A SRC/stfttp.f A SRC/ztfttp.f A SRC/ctfttr.f A SRC/dtfttr.f A SRC/stfttr.f A SRC/ztfttr.f A SRC/ctpttf.f A SRC/dtpttf.f A SRC/stpttf.f A SRC/ztpttf.f A SRC/ctpttr.f A SRC/dtpttr.f A SRC/stpttr.f A SRC/ztpttr.f A SRC/ctrttf.f A SRC/dtrttf.f A SRC/strttf.f A SRC/ztrttf.f A SRC/ctrttp.f A SRC/dtrttp.f A SRC/strttp.f A SRC/ztrttp.f .functionnality and interface [grid="all"] `------------------------------`----------`----------`----------`----------`----------------------------------------------- Cholesky factorization CPFTRF DPFTRF SPFTRF ZPFTRF (TRANSR,UPLO,N,A,INFO) Multiple solve after xPFTRF CPFTRS DPFTRS SPFTRS ZPFTRS (TRANSR,UPLO,N,NR,A,B,LDB,INFO) Inversion after xPFTRF CPFTRI DPFTRI SPFTRI ZPFTRI (TRANSR,UPLO,N,A,INFO) Triangular inversion CTRTRI DTRTRI STRTRI ZTRTRI (TRANSR,UPLO,DIAG,N,A,INFO) Sym/Herm matrix norm CLANHF DLANSF SLANSF ZLANHF (NORM,TRANSR,UPLO,N,A,WORK) Triangular solve CTFSM DTFSM STFSM ZTFSM (TRANSR,SIDE,UPLO,TRANS,DIAG,M,N,ALPHA,A,B,LDB) Sym/Herm rank-k update CHFRK DSFRK SSFRK ZHFRK (TRANSR,UPLO,TRANS,N,K,ALPHA,A,LDA,BETA,C) Conv. from TP to TF CTPTTF DTPTTF STPTTF ZTPTTF (TRANSR,UPLO,N,AP,ARF,INFO) Conv. from TR to TF CTRTTF DTRTTF STRTTF ZTRTTF (TRANSR,UPLO,N,A,LDA,ARF,INFO) Conv. from TF to TP CTFTTP DTFTTP STFTTP ZTFTTP (TRANSR,UPLO,N,ARF,AP,INFO) Conv. from TF to TR CTFTTR DTFTTR STFTTR ZTFTTR (TRANSR,UPLO,N,ARF,A,LDA,INFO) --------------------------------------------------------------------------------------------------------------------------- .example of what one can do with this new release 1. ZHFRK( ) Form the normal equations (maybe update them) 2. ZPFTRF( ) Cholesky factorization of the normal equations 3. ZPFTRS( ) Solve the normal equations which performs the same as ZSYRK, ZPOTRF, ZPOTRS (with half the storage for the normal equations). [WARNING] The current RFP routines can have `INTEGER` overflow since they access entries like `A( N*N )` and your compiler might be not smart enough to computes `N*N` as a 32-bit `INTEGER`. (Note that the same restriction holds for any current LAPACK packed routines). Use 64-bit `INTEGER` and you'll be covered. [URL-3] Illustration of the RFP. http://www-math.cudenver.edu/~langou/lapack-3.2/rfp.html. 9.7. Pivoted Cholesky ~~~~~~~~~~~~~~~~~~~~~ .contributor Craig Lucas .lapacker Jason Riedy .reference Craig Lucas. _LAPACK-Style Codes for Level 2 and 3 Pivoted Cholesky Factorizations_. LAPACK Working Note 161, February 2004. .contribution The Cholesky factorization with diagonal pivoting for symmetric positive semi-definite matrices. Pivoting is required for reliable rank detection. A SRC/cpstf2.f A SRC/dpstf2.f A SRC/spstf2.f A SRC/zpstf2.f A SRC/cpstrf.f A SRC/dpstrf.f A SRC/spstrf.f A SRC/zpstrf.f [source,fortran] ---- SUBROUTINE SPSTRF( UPLO, N, A, LDA, PIV, RANK, TOL, WORK, INFO ) * REAL TOL INTEGER INFO, LDA, N, RANK CHARACTER UPLO REAL A( LDA, N ), WORK( 2*N ) INTEGER PIV( N ) * * Purpose * ======= * * SPSTRF computes the Cholesky factorization with complete * pivoting of a real symmetric positive semidefinite matrix A. * * The factorization has the form * P' * A * P = U' * U , if UPLO = 'U', * P' * A * P = L * L', if UPLO = 'L', * where U is an upper triangular matrix and L is lower triangular, and * P is stored as vector PIV. * * This algorithm does not attempt to check that A is positive * semidefinite. This version of the algorithm calls level 3 BLAS. * ---- 9.8. Mixed precision iterative refinement subroutines for exploiting fast single precision hardware ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .contributor Julie Langou .lapacker Julie Langou .reference Alfredo Buttari, Jack Dongarra, Julie Langou, Julien Langou, Piotr Luszczek, and Jakub Kurzak. _Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems_. International Journal of High Performance Computing Applications, 21(4):457-466, 2007. .contribution On platforms like the Cell processor that do single precision much faster than double, linear systems can be solved many times faster. Even on commodity processors there is a factor of 2 in speed between single and double precision. The matrix types supported in this release are: GE (general), PO (positive definite). A SRC/dsposv.f A SRC/zcposv.f [source,fortran] ---- SUBROUTINE ZCPOSV( UPLO, N, NRHS, A, LDA, B, LDB, X, LDX, WORK, + SWORK, RWORK, ITER, INFO ) * CHARACTER UPLO INTEGER INFO, ITER, LDA, LDB, LDX, N, NRHS DOUBLE PRECISION RWORK( N*NRHS ) COMPLEX SWORK( N*(N+NRHS) ) COMPLEX*16 A( LDA, N ), B( LDB, NRHS ), WORK( N, NRHS ), + X( LDX, NRHS ) * * Purpose * ======= * * ZCPOSV computes the solution to a complex system of linear equations * A * X = B, * where A is an N-by-N Hermitian positive definite matrix and X and B * are N-by-NRHS matrices. * * ZCPOSV first attempts to factorize the matrix in COMPLEX and use this * factorization within an iterative refinement procedure to produce a * solution with COMPLEX*16 normwise backward error quality. If the * approach fails the method switches to a COMPLEX*16 factorization and * solve. * ---- 9.9. Add some variants for the one sided factorization ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .contributors Peng Du and Jason Riedy .lapackers Julie Langou and Jason Riedy .contribution LAPACK QR blocked factorization (xGEQRF) is Right-Looking add the Left-Looking variant. LAPACK Cholesky blocked factorization (xPOTRF) is Left-Looking, add the Right-Looking variant, add the Top-Looking variant. LAPACK LU blocked factorization (xGETRF) is Right-Looking, add the Right-Looking variant. add the Crout variant. add the recursive variant (in F77, please). A SRC/VARIANTS/cholesky/RL/cpotrf.f A SRC/VARIANTS/cholesky/RL/dpotrf.f A SRC/VARIANTS/cholesky/RL/spotrf.f A SRC/VARIANTS/cholesky/RL/zpotrf.f A SRC/VARIANTS/cholesky/TOP/cpotrf.f A SRC/VARIANTS/cholesky/TOP/dpotrf.f A SRC/VARIANTS/cholesky/TOP/spotrf.f A SRC/VARIANTS/cholesky/TOP/zpotrf.f A SRC/VARIANTS/lu/CR/cgetrf.f A SRC/VARIANTS/lu/CR/dgetrf.f A SRC/VARIANTS/lu/CR/sgetrf.f A SRC/VARIANTS/lu/CR/zgetrf.f A SRC/VARIANTS/lu/LL/cgetrf.f A SRC/VARIANTS/lu/LL/dgetrf.f A SRC/VARIANTS/lu/LL/sgetrf.f A SRC/VARIANTS/lu/LL/zgetrf.f A SRC/VARIANTS/lu/REC/cgetrf.f A SRC/VARIANTS/lu/REC/dgetrf.f A SRC/VARIANTS/lu/REC/sgetrf.f A SRC/VARIANTS/lu/REC/zgetrf.f A SRC/VARIANTS/qr/LL/cgeqrf.f A SRC/VARIANTS/qr/LL/dgeqrf.f A SRC/VARIANTS/qr/LL/sceil.f A SRC/VARIANTS/qr/LL/sgeqrf.f A SRC/VARIANTS/qr/LL/zgeqrf.f .integration in LAPACK To install and test the variants: -------------------------------- make variants make variants_testing -------------------------------- This should create the following libraries in `SRC/VARIANTS/LIB` -------------------------------- - LU Crout : `lucr.a` - LU Left Looking : `lull.a` - LU Sivan Toledo's recursive : `lurec.a` - QR Left Looking : `qrll.a` - Cholesky Right Looking : `cholrl.a` - Cholesky Top : `choltop.a` -------------------------------- Please refer to the README file in SRC/VARIANTS for more information. 9.10. Bug fixes for the bidiagonal SVD routine that fixes some rare convergence failures ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .contributors Osni Marques and Beresford Parlett .lapackers Osni Marques, Jim Demmel, and Julien Langou .contribution The version of dqds available in LAPACK-3.1.1 (released in February 2007) is essentially the version of dqds available in LAPACK-3.0 (released in 1999). The difference between those two versions is that the dqds in LAPACK-3.1.1 is thread safe due to modifications made by Sven Hamarling. Any eventual shortcomings of the dqds algorithm in LAPACK-3.0 are still present in LAPACK-3.1.1. In LAPACK-3.2.0, we introduce the latest version of dqds that Beresford and Osni produced in 2006. This version contained a number of improvements that were introduced after 1999. In particular, the flipping mechanism have been improved. The original motivation was a problem reported by one of Inderjit Dhillon's student (Univ. of Texas at Austin). As a result, this new version of dqds (let us call it dqds-3.2) passes Cleve Moler's matrix while dqds-3.0 and dqds-3.1 do not pass this matrix. The algorithm in dqds-3.0 and dqds-3.1 fails to converge in the maximum number of iterations allowed. Cleve Moler's matrix is at [http://www-math.cudenver.edu/\~langou/lapack-3.2/set_Moler_200.c[URL-1]]. The single precision version of the code dqds-3.2 has a problem in the LAPACK TESTING with one of the matrix of type 16 while dqds-3.0 and dqds-3.1 were fine. The matrix is at [http://www-math.cudenver.edu/~langou/lapack-3.2/set_PB1_30.c[URL-2]]. To pass this matrix, we needed to set the `IEEE` LOGICAL variable in SLASQ2 to `.FALSE.`. This is a temporary fix, we wish to have a better fix in the future. M SRC/dlasq1.f M SRC/slasq1.f M SRC/dlasq2.f M SRC/slasq2.f M SRC/dlasq3.f M SRC/slasq3.f M SRC/dlasq4.f M SRC/slasq4.f M SRC/dlasq5.f M SRC/slasq5.f M SRC/dlasq6.f M SRC/slasq6.f D SRC/dlazq3.f D SRC/slazq3.f D SRC/dlazq4.f D SRC/slazq4.f [CAUTION] There are interface changes for: `DLASQ3`, `DLASQ4`, `SLASQ3`, `SLASQ4`. We decided not to add another set of routines and simply changed the interfaces. + We have removed the routines `DLAZQ3`, `DLAZQ4`, `SLAZQ3`, `SLAZQ4`. They were introduced in 3.1 for thread safety but are no longer needed. [URL-1] A nasty matrix from Cleve Moler's. http://www-math.cudenver.edu/\~langou/lapack-3.2/set_Moler_200.c. [URL-2] A nasty matrix from the LAPACK TESTING. http://www-math.cudenver.edu/~langou/lapack-3.2/set_PB1_30.c. 9.11. Better multishift Hessenberg QR algorithm with early aggressive deflation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .contributor Ralph Byers .lapacker Edward Smyth .reference Ralph Byers. _LAPACK 3.1 xHSEQR: Tuning and Implementation Notes on the Small Bulge Multi-shift QR Algorithm with Aggressive Early Deflation_. LAPACK Working Note 187, May 2007. .contribution Most of the revisions are fixing typographical errors, but there are a few revisions that have a small affect on how the program works. Even these are relatively minor revisions: 1. revised the choice of the size of the deflation window slightly to make the code a little more robust against convergence failures. 2. revised the section of code that tries to reintroduce bulges after they have collapsed due to underflow. The new version is cleaner and more robust. 3. revised `xLAQR1` so that it does not assume that `H(2,1)` is real. A code ought to do what it claims to do and in the complex case, this small subroutine didn't quite do it. M SRC/chseqr.f M SRC/dhseqr.f M SRC/shseqr.f M SRC/zhseqr.f M SRC/clahqr.f M SRC/dlahqr.f M SRC/slahqr.f M SRC/zlahqr.f M SRC/claqr0.f M SRC/dlaqr0.f M SRC/slaqr0.f M SRC/zlaqr0.f M SRC/claqr1.f M SRC/dlaqr2.f M SRC/slaqr2.f M SRC/zlaqr1.f M SRC/claqr2.f M SRC/dlaqr3.f M SRC/slaqr3.f M SRC/zlaqr2.f M SRC/claqr3.f M SRC/dlaqr4.f M SRC/slaqr4.f M SRC/zlaqr3.f M SRC/claqr4.f M SRC/dlaqr5.f M SRC/slaqr5.f M SRC/zlaqr4.f M SRC/claqr5.f M SRC/zlaqr5.f 10. Expected additions and improvements for the future ------------------------------------------------------ .multishift QZ with early aggressive deflation Bo Kagstrom and Daniel Kressner. _Multishift variants of the QZ algorithm with aggressive early deflation_. SIAM J. Matrix Anal. Appl., 29(1):199-227, 2006. .block reordering algorithm Daniel Kressner. _Block algorithms for reordering standard and generalized Schur forms_. ACM Trans. Math. Software, 32(4):521-532, 2006. .accurate and efficient Givens rotations http://www.cs.berkeley.edu/~demmel/Givens/ David Bindel, James Demmel, William Kahan, and Osni Marques. _On computing givens rotations reliably and efficiently_. ACM Transactions on Mathematical Software (TOMS) Volume 28, Issue 2, 2002. Pages: 206-238. .blas 2.5 Gary W. Howell, James Demmel, Charles T. Fulton, Sven Hammarling, and Karen Marmol. _Cache efficient bidiagonalization using BLAS 2.5 operators_. ACM Transactions on Mathematical Software (TOMS) Volume 34, Issue 3, 2008. .level 3 BLAS dsytrs, ssytrs, zhetrs, chetrs, dsytri, ssytri, zhetri, chetri Upgrade dsytrs, ssytrs, zhetrs, chetrs, dsytri, ssytri, zhetri, chetri to use level 3 BLAS when called with multiple right-hand-sides (feedback from the MathWorks). .support more matrix types for extra-precise iterative refinement. Matrix types SB (symmetric band), PB (positive definite band), HB (Hermitian band), and packed storage. Tridiagonal types such as GT (general tridiagonal) are also on the wish list but first we need to derive adequate test cases. .make xLARFB thread friendly Take into account Robert van de Geijn (Univ. of Texas at Austin) comments on the interface of xLARFB. The interface labels V as input/output and this is not at all convenient for multithreaded implementations. .laundry list 1. Change the default Cholesky factorization in SRC from right--looking to left--looking, move left--looking to the VARIANTS directory. 2. Add some recursive variants for QR and Cholesky. 3. Remove IEEE=.FALSE. in DQDS (DLASQ3, SLASQ3 of Osni). 4. Remove the possibilites of INTEGER overflow in the RFP routines .