   Single Precision           Version 1.0                  December-1993


                        GEMM-Based Level 3 BLAS
                         model implementations


   The GEMM-Based Level 3 BLAS concept utilizes the fact that it is
   possible to formulate the Level 3 BLAS operations in terms of the
   Level 3 operation for general matrix multiply and add, SGEMM, and
   some Level 1 and Level 2 BLAS operations.

   The GEMM-Based Level 3 BLAS model implementations are written in
   Fortran 77 and designed to be highly efficient on machines with a
   memory hierarchy. The model implementations consist of five single-
   precision Level 3 routines SSYMM, SSYRK, SSYR2K, STRMM, and STRSM and
   the auxiliary routines SBIGP and SCLD. The auxiliary routines LSAME
   and XERBLA, from the original Level 3 BLAS are also used. For high
   performance, the GEMM-Based Level 3 BLAS routines rely on underlying
   optimized implementations of the Level 3 BLAS routine SGEMM and some
   Level 1 and Level 2 BLAS routines. The model implementations are
   primarily intended for single processor use on machines with local or
   global caches, and micro processors with on-chip caches. However,
   they can also be parallelized using a parallelizing compiler, or
   linked with underlying parallel BLAS routines. All routines are
   structured to reduce data traffic in the memory hierarchy.

   The compiler and processor sensitive parts of the operations are
   concentrated in calls to the underlying BLAS routines. If these are
   efficiently optimized for the target machine, the GEMM-Based Level 3
   BLAS model implementations can offer:

   o  efficient use of vector instructions (compound instructions,
      chaining, etc.), through SGEMM, Level 1 and Level 2 BLAS routines.

   o  vector register reuse, through SGEMM and Level 2 BLAS routines.

   o  efficient cache reuse, through internal blocking, use of local
      arrays, and through SGEMM.

   o  column-wise referencing, for problems with arrays having a leading
      dimension that could cause sever performance degradation with
      row-wise referencing (except for reference patterns in underlying
      BLAS routines).

   o  parallelism, where the concurrency is explicit through automatic
      parallelization of the GEMM-Based Level 3 kernels by the compiler,
      or implicit via use of parallel underlying BLAS kernels.

   o  an opportunity to conveniently create a Level 3 BLAS library based
      on unconventional underlying matrix multiply algorithms like, for
      example, Strassens or Winograds algorithms.

   The enclosed file INSTALL contains a guide to facilitate
   installation of these model implementations so that correct results
   are produced, with high and uniform performance.



   Per Ling
   Institute of Information Processing
   University of Umea
   S-901 87 Umea, Sweden
   E-mail: pol@cs.umu.se



   For further information see:

   Dongarra J. J., DuCroz J., Duff I., and Hammarling S., "A Set of
       Level 3 Basic Linear Algebra Subprograms", ACM Trans. Math.
       Softw., Vol. 16, No. 1, 1990, pp.1-17.

   Dongarra J. J., DuCroz J., Duff I., and Hammarling S., "Algorithm
       679: A Set of Level 3 Basic Linear Algebra Subprograms: Model
       Implementation and Test Programs", ACM Trans. Math. Softw.,
       Vol. 16, No. 1, 1990, pp.18-28.

   Kagstrom B. and Van Loan C. "GEMM-Based Level-3 BLAS", Tech. rep.
      CTC91TR47, Department of Computer Science, Cornell University,
      Dec. 1989.

   Kagstrom B., Ling P. and Van Loan C. "High Performance GEMM-Based
      Level-3 BLAS: Sample Routines for Double Precision Real Data",
      in High Performance Computing II, Durand M. and El Dabaghi F.,
      eds., Amsterdam, 1991, North-Holland, pp.269-281.

   Kagstrom B., Ling P. and Van Loan C. "Portable High Performance
      GEMM-Based Level-3 BLAS, in R. F. Sincovec et al, eds.,
      Parallel Processing for Scientific Computing, SIAM
      Publications, 1993.
