   Single Precision                                         January-1994



                     Users' and Installation guide
               for the GEMM-Based Level 3 BLAS Benchmark

                               Per Ling
                  Institute of Information Processing
                      University of Umea, Sweden

                           January 15, 1994



   1. Introduction

   The GEMM-Based Level 3 BLAS Benchmark is a tool for performance
   evaluation of Level 3 BLAS kernel programs. With the announcement of
   LAPACK [1], the need for high performance Level 3 BLAS kernels became
   apparent. LAPACK is based on calls to the Level 3 BLAS kernels. This
   benchmark measures and compares performance of a set of user-supplied
   Level 3 BLAS implementations and of the GEMM-Based Level 3 BLAS
   implementations permanently included in the benchmark. The purpose of
   the benchmark is to facilitate the user in determining the quality of
   different Level 3 BLAS implementations. The included GEMM-Based
   Level 3 BLAS routines provide a lower limit on the performance to be
   expected from a highly optimized Level 3 BLAS library.

   The user supplies a set of Level 3 implementations to be evaluated.
   These are linked with the benchmark program. When the benchmark
   executes, timings are performed according to specifications given in
   an input file. An example input file is given in the file
   'example.in'. The user may design his own tests or use the example
   input file, or the enclosed input files specifying proposed standard
   tests.

   The output optionally presents the following results:

      A   A collected mean value result, calculated from the
          performance results of the separate user-supplied Level 3
          routines for specified problem configurations.

      B   Tables, showing performance results in megaflops, and
          comparisons between different routines calculated as the
          performance result of one Level 3 BLAS routine divided by the
          performance results of another Level 3 BLAS routine.

   The purpose of the collected result A is to provide a performance
   result of the user-supplied routines which easily can be compared
   between different machines. We propose two standard tests with
   different problem configurations, SMARK01 and SMARK02 (see the input
   files 'smark01.in' and 'smark02.in').


   The tables B are intended for program developers and others who are
   interested in detailed performance information from the routines.


   2. The input file

   The user supplies an input file for the benchmark specifying tests to
   be made and results to be presented. The following parameters need to
   be specified in the input file.

      LBL       An arbitrary label which identifies the test to be
                performed (max 50 characters). The label is printed
                together with the output results A and B (see
                section 1).

      TAB       One or more numbers specifying tests to be made and
                results to be presented.

      RUNS      All results presented are based on the fastest of RUNS
                executions for each problem configuration.

   At least one of the numbers 1 - 6 need to be specified for the
   parameter TAB. The numbers are interpreted as follows.

      1   The collected benchmark result.

      2   Performance of the built-in GEMM-Based Level 3 BLAS library
          in megaflops.

      3   Performance of the user-supplied Level 3 BLAS library in
          megaflops.

      4   Performance of the user-supplied SGEMM routine in megaflops.
          Problem configurations for SGEMM are chosen to 'correspond' to
          those in 2 and 3 for timing purposes, see section 3.

      5   GEMM-Efficiency of the user-supplied Level 3 routines.

      6   GEMM-Ratio

   The input parameters for the Level 3 BLAS routines are specified as
   follows.

      SIDE      Characters. L(eft) and/or R(ight).
      UPLO      Characters. U(pper) and/or L(ower) triangular part.
      TRANS     Characters. N(o transpose) and/or T(ranspose).
      DIAG      Characters. N(o unit) and/or U(nit) triangular.
      DIM1      Integer values for the first of the two dimensions.
      DIM2      Integer values for the second of the two dimensions.
      LDA       Integer values for leading dimension of the matrices.

   See [2] and [3] for further explanations of the input parameters
   SIDE, UPLO, TRANS, and DIAG. The parameters DIM1 and DIM2 are used to
   specify the first and second dimension in the calling sequence
   of the Level 3 BLAS routines, respectively. The values for DIM1 and
   DIM2 comes in pairs i.e., the i:th value for DIM1 is used together
   with the i:th value for DIM2, exclusively. LDA specifies leading
   dimensions of the matrices A, B, and C in calls to the Level 3 BLAS
   routines.

   Determine for each routine whether it should be timed or not. Put T
   after the routine name if the routine should be timed and F if not.

      SSYMM     T
      SSYRK     T
      SSYR2K    T
      STRMM     T
      STRSM     T

   An example of an input file is enclosed in the file 'example.in'.
   This file may be used as a template for user constructed tests.


   3. Benchmark results

   The output from the benchmark optionally includes a ``collected mean
   value result'' of the user-supplied Level 3 routines, and tables
   showing detailed performance results and comparisons between the
   user-supplied and the built-in GEMM-Based Level 3 BLAS routines.
   Problem configurations, routines to be timed, and results to be
   presented are selected according to specifications in the input file.


   3.1. The table results

   The table results are intended for program developers and others who
   are interested in detailed performance presentations. Performance of
   the user-supplied and the built-in GEMM-Based Level 3 BLAS routines
   are shown. The tables also show GEMM-Efficiency and GEMM-Ratio.

   GEMM-Efficiency is a number intended to give the user an idea of how
   close to the ``practical'' peak performance a routine executes. The
   performance of the user-supplied Level 3 BLAS routines are compared
   with the performance of the user-supplied SGEMM routine as follows.

                         Performance of a user-supplied
                         Level 3 BLAS routine (megaflops).
      GEMM-Efficiency = -----------------------------------
                         Performance of the user-supplied
                         SGEMM routine (megaflops).

   SGEMM is often carefully implemented and reaches performance levels
   close to the practical limit on many machines. GEMM-Efficiency is
   measured with problem configurations for SGEMM, which in this
   respect ``corresponds'' to problem configurations used for the
   remaining Level 3 BLAS routines. Performance of the Level 3 BLAS
   problems

      SSYMM(  side, uplo,  m, n, alpha, A, lda, B, ldb, beta, C, ldc ),
      SSYRK(  uplo, trans, n, k, alpha, A, lda, beta, C, ldc ),
      SSYR2K( uplo, trans, n, k, alpha, A, lda, B, ldb, beta, C, ldc ),
      STRMM(  side, uplo, trans, diag, m, n, alpha, A, lda, C, ldc ),
      STRSM(  side, uplo, trans, diag, m, n, alpha, A, lda, C, ldc )

   where alpha = 0.9, beta = 1.1, and lda = ldb = ldc are compared with
   the performance of the following problems for SGEMM:

      -----------------------------------------------------------
      Level 3 BLAS        |      Input parameters for SGEMM
                          |
      routine  side trans | transa transb  m  n  k  A  B  C  beta
      -----------------------------------------------------------
                          |
      SSYMM     'L'       |   'N'    'N'   m  n  m  A  B  C   1.1
                'R'       |   'N'    'N'   m  n  n  B  A  C   1.1
                          |
      SSYRK          'N'  |   'N'    'T'   n  n  k  A  A  C   1.1
                     'T'  |   'T'    'N'   n  n  k  A  A  C   1.1
                          |
      SSYR2K         'N'  |   'N'    'T'   n  n  k  A  B  C   1.1
                     'T'  |   'T'    'N'   n  n  k  A  B  C   1.1
                          |
      STRMM,              |
      STRSM     'L'       |  trans   'N'   m  n  m  A  B  C   1.0
                'R'       |   'N'   trans  m  n  n  B  A  C   1.0
                          |
      -----------------------------------------------------------
      (Parameters for SGEMM not shown in the table, equals the
      parameters for the Level 3 BLAS routine SGEMM is compared
      with.)

   The number of floating point operations (flop) performed by a Level 3
   BLAS routine is divided by the execution time in seconds, times
   1 000 000, to obtain the performance in megaflops. The number of
   floating point operations performed is calculated as follows:

      ----------------------------------------------------------------
      Level 3 BLAS        |  nops : number of operations for a Level 3
                          |         BLAS problem.
                          |  gops : number of operations for the
      routine  side diag  |         corresponding SGEMM problem.
      ----------------------------------------------------------------
                          |
      SSYMM     'L'       |  nops = ( 2m+1 )mn + min( mn, m( m+1 )/2 )
                          |  gops = ( 2m+1 )mn + min( mn, mm )
                          |
                'R'       |  nops = ( 2n+1 )mn + min( mn, n( n+1 )/2 )
                          |  gops = ( 2n+1 )mn + min( mn, nn )
                          |
      SSYRK               |  nops = ( 2k+1 )( n( n+1 )/2 ) +
                          |                      min( nk, n( n+1 )/2 )
                          |  gops = ( 2k+1 )nn + min( nk, nn )
                          |
      SSYR2K              |  nops = ( 4k+1 )( n( n+1 )/2 ) +
                          |                       min( 2nk, n( n+1 ) )
                          |  gops = ( 2k+1 )nn + min( nk, nn )
                          |
      STRMM,              |
      STRSM     'L'  'N'  |  nops = mmn + min( mn, m( m+1 )/2 )
                'L'  'U'  |  nops = mmn - mn + min( mn, m( m+1 )/2 )
                'L'       |  gops = ( 2m-1 )mn + min( mn, mm )
                          |
                'R'  'N'  |  nops = mnn + min( mn, n( n+1 )/2 )
                'R'  'U'  |  nops = mnn - mn + min( mn, n( n+1 )/2 )
                'R'       |  gops = ( 2n-1 )mn + min( mn, nn )
                          |
      ----------------------------------------------------------------

   GEMM-Ratio compares performance of the built-in GEMM-Based Level 3
   BLAS routines with performance of the corresponding user-supplied
   routines.

                    Performance of the internal GEMM-Based
                    Level 3 BLAS routine Sxxxx (megaflops).
      GEMM-Ratio = -----------------------------------------
                    Performance of the user-supplied
                    Level 3 BLAS routine Sxxxx (megaflops).

   A value greater than one implies that the GEMM-Based routine is
   faster than the user-supplied routine.


   3.2. The collected benchmark result

   The collected benchmark result is calculated from performance results
   of the user-supplied Level 3 routines for problems specified in the
   input file. The result consists of a tuple ( x, y ), where x is the
   mean value of the GEMM-Efficiency and y is the mean value of the
   performance of SGEMM in megaflops. SGEMM is timed for problems
   corresponding to those specified for the remaining Level 3 routines.

   The purpose of the collected benchmark result is to provide an
   overall performance measure of the user-supplied Level 3 BLAS
   routines. The intention is to expose the capacity of the target
   machine for these kinds of problems and to show how well the routines
   utilize the machine. Furthermore, the collected result is intended to
   be easy to compare between different target machines.

   We propose two standard test suits for the collected benchmark
   result, SMARK01 and SMARK02 (see the files 'smark01.in' and
   'smark02.in'). These tests are designed to show performance of the
   user-supplied Level 3 library for problem sizes that are likely to
   often be requested by a calling routine. This imply problems that
   presumably constitute a large part of computations in routines which
   use the Level 3 BLAS as their major computational kernels. LAPACK
   implements blocked algorithms which are based on calls to the Level 3
   BLAS. The problems in the two tests are similar. However, some of the
   matrix dimensions are larger in SMARK02 than in SMARK01. This
   corresponds to larger matrix blocks in the calling routine. The tests
   are expected to match various target machines differently.
   Performance results may depend strongly on sizes of different storage
   units in the memory hierarchy. The size of the cache memory, for
   instance, may be decisive. For this reason, we propose two standard
   tests instead of one.


   4. The built-in GEMM-Based Level 3 BLAS

   The GEMM-Based Level 3 BLAS concept utilizes the fact that it is
   possible to formulate the Level 3 BLAS operations in terms of the
   Level 3 operation for general matrix multiply and add, SGEMM, and
   some Level 1 and Level 2 BLAS operations.

   The GEMM-Based Level 3 BLAS model implementations, included in the
   benchmark, is a set of Level 3 BLAS routines written in Fortran 77
   [4],[5],[6]. They are designed to perform matrix operations in a
   blocked fashion. This means that an operation performed by a GEMM-
   Based Level 3 routine is divided into several suboperations which are
   performed on submatrices (or blocks) of the matrices. Block
   dimensions and other parameters are tuned, by the user, to match the
   characteristics of the memory system on the target machine. These
   user specified parameters may be tuned to match the size of vector
   registers, cache lines, or cache sets, for instance. Suboperations on
   blocks of matrices are performed by calls to Level 1 BLAS routines,
   Level 2 BLAS routines, and to the Level 3 BLAS routine SGEMM, which
   are all supplied by the user. The routines supplied by the user
   should be highly optimized for the target machine, possibly assembly
   implementations provided by the machine manufacturer. See the README
   file and the installation guide, enclosed with the GEMM-Based Level 3
   BLAS model implementations, for further information.


   5. Installing the benchmark program

   All routines are written in Fortran 77 for portability. No changes to
   the code should be necessary in order to run the programs correctly
   on different target machines. In fact, we strongly recommend the user
   to avoid changes, except for the user specified parameters and for
   UNIT numbers for input and output communication. This will ensure
   that performance results from different target machines are
   comparable. UNIT numbers are set in the main program SGBTIM and the
   user specified parameters exist only in the built-in GEMM-Based
   Level 3 BLAS routines.

   The benchmark program consists of the following routines apart from
   the built-in GEMM-Based Level 3 BLAS routines: SGB02, SGB04, SGB06,
   SGB08, SGB09, SGB90, and SGB91.

      SGBTIM   The main program. Reads the input file and calls the
               routines described below.

      SGBT01   Times the user-supplied SGEMM routine.

      SGBT02   Times the built-in GEMM-Based Level 3 BLAS routines and
               the user-supplied Level 3 BLAS routines except SGEMM.

      SGBTP1   Calculates and prints the collected benchmark result A,
               see section 1.

      SGBTP2   Calculates and prints the table results B, see section 1.


   The following is a description of how to install the GEMM-Based
   Level 3 BLAS Benchmark on machines with Unix-like operating systems.
   A 'Makefile' is enclosed to facilitate the installation. The user
   specified parameters of the built-in GEMM-Based Level 3 BLAS routines
   comes with default values. These values need to be optimized for the
   target machine. We refer to the installation guide enclosed with the
   GEMM-Based Level 3 BLAS model implementations for instructions on how
   to tune these values. The program SSBPM, for assigning values to the
   user specified parameters, corresponds to the program SSGPM for the
   GEMM-Based Level 3 BLAS model implementations. Input files for SSGPM
   may also be used with SSBPM.  To compile and link SSBPM give the
   command:

      % make ssbpm

   Run SSBPM which updates the built-in GEMM-Based Level 3 BLAS routines
   with the new parameters given in the input file, 'newsgpm.in':

      % ssbpm < newsgpm.in

   The benchmark program calls a single precision function SECOND with
   no arguments. This function is assumed to return the central-
   processor time in seconds from some fixed starting-time. Create this
   function if it doesn't already exist on your system. The enclosed
   Fortran 77 function in the file 'second.f' can be used as a template.
   This routine is based on calls to the timing function etime under
   Unix.

   Specify the Level 3 BLAS library to be evaluated and compiler flags
   in the enclosed file 'Makefile'. You may change the UNIT numbers for
   I/O communication NIN, NOUT, and NERR in the main program SGBTIM, if
   necessary. Create the executable benchmark program by giving the
   command:

      % make sgbtim

   Hopefully, you will now get a useful performance evaluation tool for
   Level 3 BLAS kernels.


   6  Executing the benchmark program

   The GEMM-Based Level 3 BLAS Benchmark can be used in different ways
   to evaluate performance of Level 3 BLAS routines. It is possible to
   obtain one, or both, of the output results A and B described in
   previous sections. The tables in B shows performance results in
   megaflops and performance comparisons between the different routines,
   according to specifications in the input file (see sections 2 and 3).
   The user fully controls which tests to be made and which results to
   be presented through specifications in the input file.

   The following unix command runs the benchmark program with the input
   file 'example.in' and writes the result to the output file
   'example.out':

      % sgbtim < example.in > example.out

   Notice that this benchmark may be quite time consuming to run.
   Obviously, the ``size'' of the test, specified in the input file, is
   decisive for the execution time. Further, the performance of the
   target machine and of the different Level 3 BLAS libraries also
   affect the total execution time.


   7. Where to send the results

   Please help us collect performance results from different target
   machines. Send results obtained with the proposed standard tests
   SMARK01 and SMARK02 to:

      E-mail:  pol@cs.umu.se

      Per Ling
      Institute of Information Processing
      University of Umea
      S-901 87 Umea
      Sweden

   We would appreciate if you also specified as much as you can of the
   following system characteristics:

   o  Machine                 Name and version.
                              Number of processors.
                              Sizes of cache(s) and main memory etc.

   o  Operating system        Name, version, and release.

   o  Fortran compiler        Name, version, and release.
                              Options used: -O3 -inline -parallel etc.

   o  User suppl. routines    Library name, version, and release.
                              Parallel routines.

   o  Used configuration      Number of processors in parallel.

   o  Precision               Single precision: XX bit words.

   o  The function SECOND     Library:       name, version, and release.
                              Time measured: real, cpu, sys, user, etc.
                              Resolution:    microseconds, etc.
                              Based on:      etime, dclock, mclock, etc.

   If the external GEMM-Based Level 3 BLAS model implementations are
   specified as user-supplied routines in the tests, please enclose
   values for the user specified parameters and describe the underlying
   BLAS implementations (SGEMM, Level 1 and Level 2 BLAS). If more than
   one processor are involved, please explain how the parallelism is
   utilized. Are the GEMM-Based routines automatically parallelized by
   the compiler and/or are the underlying BLAS routines parallel?


   References

   [1] Anderson E., Bai Z., Bischof C., Demmel J., Dongarra J.,
       DuCroz J., Greenbaum A., Hammarling S., McKenney A.,
       Ostrouchov S., and Sorensen D., "LAPACK Users' Guide", Society
       for Industrial and Applied Mathematics, Philadelphia, 1992.

   [2] Dongarra J. J., DuCroz J., Duff I., and Hammarling S., "A Set of
       Level 3 Basic Linear Algebra Subprograms", ACM Trans. Math.
       Softw., Vol. 16, No. 1, 1990, pp.1-17.

   [3] Dongarra J. J., DuCroz J., Duff I., and Hammarling S., "Algorithm
       679: A Set of Level 3 Basic Linear Algebra Subprograms: Model
       Implementation and Test Programs", ACM Trans. Math. Softw.,
       Vol. 16, No. 1, 1990, pp.18-28.

   [4] Kagstrom B., Ling P. and Van Loan C. "High Performance GEMM-Based
       Level-3 BLAS: Sample Routines for Double Precision Real Data",
       in High Performance Computing II, Durand M. and El Dabaghi F.,
       eds., Amsterdam, 1991, North-Holland, pp.269-281.

   [5] Kagstrom B., Ling P. and Van Loan C. "Portable High Performance
       GEMM-Based Level-3 BLAS, in R. F. Sincovec et al, eds., Parallel
       Processing for Scientific Computing, SIAM Publications, 1993.

   [6] Kagstrom B. and Van Loan C. "GEMM-Based Level-3 BLAS", Tech. rep.
       CTC91TR47, Department of Computer Science, Cornell University,
       Dec. 1989.
