   Complex                                                   August-1994



                     Users' and Installation guide
               for the GEMM-Based Level 3 BLAS Benchmark

                               Per Ling
                  Institute of Information Processing
                      University of Umea, Sweden

                            August 11, 1994



   Please see the users and installation guide for the single precision
   version of this benchmark program. Installing the complex version is
   similar. Apart from obvious differences, notice the flop count and
   the GEMM-Efficiency.

   GEMM-Efficiency is measured with problem configurations for CGEMM,
   which in this respect ``corresponds'' to problem configurations used
   for the remaining Level 3 BLAS routines. Performance of the Level 3
   BLAS problems

      CSYMM(  side, uplo,  m, n, alpha, A, lda, B, ldb, beta, C, ldc ),
      CSYRK(  uplo, trans, n, k, alpha, A, lda, beta, C, ldc ),
      CSYR2K( uplo, trans, n, k, alpha, A, lda, B, ldb, beta, C, ldc ),
      CTRMM(  side, uplo, trans, diag, m, n, alpha, A, lda, C, ldc ),
      CTRSM(  side, uplo, trans, diag, m, n, alpha, A, lda, C, ldc )

   where alpha = ( 0.9, 0.05 ), beta = ( 1.1, 0.03 ), and
   lda = ldb = ldc are compared with the performance of the following
   problems for CGEMM:

      -----------------------------------------------------------
      Level 3 BLAS        |      Input parameters for CGEMM
                          |
      routine  side trans | transa transb  m  n  k  A  B  C  beta
      -----------------------------------------------------------
                          |
      CSYMM     'L'       |   'N'    'N'   m  n  m  A  B  C  beta
                'R'       |   'N'    'N'   m  n  n  B  A  C  beta
                          |
      CSYRK          'N'  |   'N'    'T'   n  n  k  A  A  C  beta
                     'T'  |   'T'    'N'   n  n  k  A  A  C  beta
                          |
      CSYR2K         'N'  |   'N'    'T'   n  n  k  A  B  C  beta
                     'T'  |   'T'    'N'   n  n  k  A  B  C  beta
                          |
      CTRMM,              |
      CTRSM     'L'       |  trans   'N'   m  n  m  A  B  C  one
                'R'       |   'N'   trans  m  n  n  B  A  C  one
                          |
      -----------------------------------------------------------
      (Parameters for CGEMM not shown in the table, equals the
      parameters for the Level 3 BLAS routine CGEMM is compared
      with. The value one for beta is, one = ( 1.0, 0.0 ).)

   The number of floating point operations (flop) performed by a Level 3
   BLAS routine is divided by the execution time in seconds, times
   1 000 000, to obtain the performance in megaflops. The number of
   floating point operations performed is calculated as follows:

      ----------------------------------------------------------------
      Level 3 BLAS        |  nops : number of operations for a Level 3
                          |         BLAS problem.
                          |  gops : number of operations for the
      routine  side diag  |         corresponding CGEMM problem.
      ----------------------------------------------------------------
                          |
      CSYMM     'L'       |  nops:
                          |   mult = ( m+1 )mn + min( mn, m( m+1 )/2 )
                          |   add  = mmn
                          |  gops:
                          |   mult = ( m+1 )mn + min( mn, mm )
                          |   add  = mmn
                          |
                'R'       |  nops:
                          |   mult = ( n+1 )mn + min( mn, n( n+1 )/2 )
                          |   add  = mnn
                          |  gops:
                          |   mult = ( n+1 )mn + min( mn, nn )
                          |   add  = mnn
                          |
      CSYRK               |  nops:
                          |   mult = ( k+1 )( n( n+1 )/2 ) +
                          |                      min( nk, n( n+1 )/2 )
                          |   add  = k( n( n+1 )/2 )
                          |  gops:
                          |   mult = ( k+1 )nn + min( nk, nn )
                          |   add  = knn
                          |
      CSYR2K              |  nops:
                          |   mult = ( 2k+1 )( n( n+1 )/2 ) +
                          |                       min( 2nk, n( n+1 ) )
                          |   add  = kn(n+1)
                          |  gops:
                          |   mult = ( k+1 )nn + min( nk, nn )
                          |   add  = knn
                          |
      CTRMM,              |
      CTRSM     'L'  'N'  |  nops:
                          |   mult = ( m( m+1 )/2 )n +
                          |                      min( mn, m( m+1 )/2 )
                          |   add  = ( m( m-1 )/2 )n
                'L'  'U'  |  nops:
                          |   mult = ( m( m-1 )/2 )n +
                          |                      min( mn, m( m+1 )/2 )
                          |   add  = ( m( m-1 )/2 )n
                'L'       |  gops:
                          |   mult = mmn + min( mn, mm )
                          |   add  = m( m-1 )n
                          |
                'R'  'N'  |  nops:
                          |   mult = m( n( n+1 )/2 ) +
                          |                      min( mn, n( n+1 )/2 )
                          |   add  = m( n( n-1 )/2 )
                'R'  'U'  |  nops:
                          |   mult = m( n( n-1 )/2 ) +
                          |                      min( mn, n( n+1 )/2 )
                          |   add  = m( n( n-1 )/2 )
                'R'       |  gops:
                          |   mult = mnn + min( mn, nn )
                          |   add  = m( n-1 )n
                          |
      ----------------------------------------------------------------

   The total number of operations (NOP) is

   o   NOP = 6*mult + 2*add.

   For hermitian matrices, the imaginary part of the diagonal
   elements is always zero. Therefore it is not necessary to involve
   the imaginary parts of the diagonal elements in the computations.
   We can assume they are zero when reading a matrix and explicitly
   assign 0.0D+0 to them when writing a matrix.

   Performance of the Level 3 BLAS routines

      CHEMM(  side, uplo,  m, n, alpha, A, lda, B, ldb, beta, C, ldc ),
      CHERK(  uplo, trans, n, k, alpha, A, lda, beta, C, ldc ),
      CHER2K( uplo, trans, n, k, alpha, A, lda, B, ldb, beta, C, ldc ),

   where alpha = ( 0.9, 0.05 ), beta = ( 1.1, 0.03 ), and
   lda = ldb = ldc are compared with the performance of the following
   problems for CGEMM:

      -----------------------------------------------------------
      Level 3 BLAS        |      Input parameters for CGEMM
                          |
      routine  side trans | transa transb  m  n  k  A  B  C
      -----------------------------------------------------------
                          |
      CHEMM     'L'       |   'N'    'N'   m  n  m  A  B  C
                'R'       |   'N'    'N'   m  n  n  B  A  C
                          |
      CHERK          'N'  |   'N'    'C'   n  n  k  A  A  C
                     'C'  |   'C'    'N'   n  n  k  A  A  C
                          |
      CHER2K         'N'  |   'N'    'C'   n  n  k  A  B  C
                     'C'  |   'C'    'N'   n  n  k  A  B  C
                          |
      -----------------------------------------------------------
      (Parameters for CGEMM not shown in the table, equals the
      parameters for the Level 3 BLAS routine that CGEMM is
      compared with.)

   The number of floating point operations (flop) performed by the
   Level 3 BLAS routines involving a hermitian matrix is calculated
   as follows:

      ----------------------------------------------------------------
      Level 3 BLAS        |  nops : number of operations for a Level 3
                          |         BLAS problem.
                          |
                          |  gops : number of operations for the
      routine  side       |         corresponding CGEMM problem.
      ----------------------------------------------------------------
                          |
      CHEMM     'L'       |  nops:
                          |   mult = ( 6m+2 )mn + min( 6mn, 3mm-m )
                          |   add  = 2mmn
                          |  gops:
                          |   mult = ( 6m+6 )mn + min( 6mn, 6mm )
                          |   add  = 2mmn
                          |
                'R'       |  nops:
                          |   mult = ( 6n+2 )mn + min( 6mn, 3nn-n )
                          |   add  = 2mnn
                          |  gops:
                          |   mult = ( 6n+6 )mn + min( 6mn, 6nn )
                          |   add  = 2mnn
                          |
      CHERK               |  nops:
                          |   mult = ( 3k+1 )nn + min( 2nk, nn )
                          |   add  = knn
                          |  gops:
                          |   mult = ( 6k+6 )nn + min( 6nk, 6nn )
                          |   add  = 2knn
                          |
      CHER2K              |  nops:
                          |   mult = ( 6k+1 )nn + min( 12nk, 6nn-2n )
                          |   add  = 2knn
                          |  gops:
                          |   mult = ( 6k+6 )nn + min( 6nk, 6nn )
                          |   add  = 2knn
                          |
      ----------------------------------------------------------------

   where the total number of operations (NOP) is calculated as

   o   NOP = mult + add.

   Notice also that the type of the scalars alpha and beta is not
   always complex for the hermitian Level 3 BLAS routines.

                      alpha      beta

           CHEMM:     complex    complex
           CHERK:     real       real
           CHER2K:    complex    real

   This of course affect the flop count.
