.ds CH 
.pl 11i
.LP
.nr LL 6.5i
.ll 6.5i
.nr LT 6.5i
.lt 6.5i
.ft 3
.bp
.R
.sp .5i
 .
.sp
.R
.ta 4.20i
	Distribution Category:
.br
	Mathematics and Computers
.br
	General (UC-32)
.ce 100
.in 0
.sp 1i
.B
.ce 100
-------------
ANL-85-19
-------------
.R
.sp .5i
ARGONNE NATIONAL LABORATORY
.br
9700 South Cass Avenue
.br
Argonne, Illinois  60439
.sp .6i
.ps 12
.ft 3
Comparison of the CRAY X-MP-4,

Fujitsu VP-200, and Hitachi S-810/20:

An Argonne Perspective
.ps 11
.sp 3
.I
Jack J. Dongarra
.ps 10
.R
.ft 1
Mathematics and Computer Science Division
.sp
and
.sp
.I
Alan Hinds
.R
.br
Computing Services
.sp .7i
October 1985
.bp
 .
.sp
.B
.ce 1
.ps 12
Table of Contents
.sp 3
.R
.ps 10
.ta 5.5i
List of Tables	 v

List of Figures	 v

Abstract	 1

1.  Introduction	 1

2.  Architectures	 1

    2.1  CRAY X-MP	 2
    2.2  Fujitsu VP-200	 4
    2.3  Hitachi S-810/20	 6

3.  Comparison of Computers	 8

    3.1  IBM Compatibility of the Fujitsu and Hitachi Machines	 8
    3.2  Main Storage Characteristics	 8
    3.3  Memory Address Architecture	10

         3.3.1  Memory Address Word and Address Space	10
         3.3.2  Operand Sizes and Operand Memory Boundary Alignment	12
         3.3.3  Memory Regions and Program Relocation	13
         3.3.4  Main Memory Size Limitations	13

    3.4  Memory Performance	14

         3.4.1  Memory Bank Structure	14
         3.4.2  Instruction Access	14
         3.4.3  Scalar Memory Access	14
         3.4.4  Vector Memory Access	15

    3.5  Input/Output Performance	16
    3.6  Vector Processing Performance	18
    3.7  Scalar Processing Performance	21

4.  Benchmark Environments	22

5.  Benchmark Codes and Results	23
.sp 2

.sp .70i
.ce 1
iii
.bp
    5.1  Codes	23

         5.1.1  APW	23
         5.1.2  BIGMAIN	24
         5.1.3  BFAUCET and FFAUCET	24
         5.1.4  LINPACK	24
         5.1.5  LU, Cholesky Decomposition, and Matrix Multiply	26

    5.2  Results	28

6.  Fortran Compilers and Tools	28

    6.1  Fortran Compilers	28
    6.2  Fortran Tools	30

7.  Conclusions	31

References	33

Acknowledgments	33

.sp 4.3i
.ce 1
iv
.bp
.ce 1
.B
List of Tables
.R
.sp 3
.ta .3i 6.5iR
  1.	Overview of Machine Characteristics	9
.sp
  2.	Main Storage Characteristics	11
.sp
  3.	Input/Output Features and Performance	17
.sp
  4.	Vector Architecture	19
.sp
  5.	Scalar Architecture	22
.sp
  6.	Programs Used for Benchmarking	25
.sp
  7.	Average Vector Length for BFAUCET and FFAUCET	26
.sp
  8.	LINPACK Timing for a Matrix of Order 100	26
.sp
  9.	LU Decomposition Based on Matrix Vector Operations	27
.sp
 10.	Cholesky Decomposition Based on Matrix Vector Operations	27
.sp
 11.	Matrix Multiply Based on Matrix Vector Operations	28
.sp
 12.	Timing Data (in seconds) for Various Computers	29
.sp
B-1.	Loops Missed by the Respective Compilers	42
.sp 4
.ce 1
.B
List of Figures
.R
.sp 2
  1.	CRAY X-MP/48 Architecture	3
.sp
  2.	Fujitsu VP-200 Architecture	5
.sp
  3.	Hitachi S-810/10 Architecture	7
.sp 1.5i
.ce 1
v
.ft 3
.ps 11
.pn 1
.ds CH "%
.bp
.LP
.EQ
delim @@
.EN
.B
.ps 12
.in 
.ce
Comparison of the CRAY X-MP-4,
.sp
.ce
Fujitsu VP-200, and Hitachi S-810/20:
.sp
.ce
An Argonne Perspective*
.fi
.sp
.R
.AU
.ps 11
.in 0
Jack J. Dongarra\|@size -1 {"" sup \(*}@\h'.15i'
.AI
.ps 10
.in 0
Mathematics and Computer Science Division
.AU
.ps 11
.in 0
Alan Hinds
.AI
.ps 10
.in 0
Computing Services 
.FS
.ps 9
.vs 11p
*Work supported in part by the Applied Mathematical
Sciences subprogram of the Office of Energy Research,
U.S. Department of Energy, under Contract W-31-109-Eng-38.
.FE
.sp 3
.QS
.ps 10
.in +.25i
.ll -.5i
.B
.ce 1
Abstract
.R

A set of programs,
gathered from major Argonne computer users, was run on the current generation of
supercomputers:
the CRAY X-MP-4, Fujitsu VP-200,
and Hitachi S-810/20.
The results show that
a single 
processor of a CRAY X-MP-4 is a consistently strong performer over a
wide range of problems.
The Fujitsu and Hitachi
computers excel on highly vectorized programs and
offer
an attractive opportunity to sites
with IBM-compatible computers.
.in
.ll
.QE
.nr PS 11
.nr VS 16
.nr PD 0.5v
.SH
1. Introduction
.PP
Last year we ran a set of programs,
gathered from major Argonne computer users, on the current generation of 
supercomputers:
the CRAY X-MP-4
at CRAY Research in Mendota Heights, Minnesota;
the Fujitsu VP-200
at the Fujitsu plant in Numazu, Japan;
and 
the Hitachi S-810/20
at the 
Hitachi Ltd. Kanagawa Works in Kanagawa, Japan.
.SH
2. Architectures
.PP
The CRAY X-MP, Fujitsu VP, and Hitachi S/810
computers are all high-performance vector
processors that use pipeline techniques in both scalar and
vector operations and permit concurrent execution by independent
functional units.
All three 
machines use a register-to-register format for
instruction execution.
Each machine has three vector load/store techniques \(em contiguous
element, constant stride, and indirect address (index vector) modes.
All three are optimized for 64-bit floating-point arithmetic operations.
Outstanding features and major differences in these machines
are discussed below and summarized in Table 1 
at the end of this section.
.sp
.B
2.1  CRAY X-MP
.R
.PP
The CRAY X-MP-4 (Figure 1) is the
largest of the family of CRAY X-MP computer models, which range in size from
one to four processors and from one
million to sixteen million words of central memory.
The CRAY X-MP/48 computer consists of four identical pipelined processors,
each with fully segmented scalar and vector functional units with a
9.5-nanosecond clock cycle.
All four processors share in common an 8-million word,
high-speed (38-nanosecond cycle time) bipolar central memory, a common I/O
subsystem, and an optional integrated solid-state storage device (SSD).  Each
processor contains a complete set of registers and functional units, and each
processor can access all of the common memory, all of the I/O devices, and the
single (optional) SSD.  The CRAY X-MP-4 vector register set -- 512 words
per processor -- is the smallest in this study. 
.PP
The four CRAY CPUs can process four separate and
independent jobs, or they can be organized to work concurrently on a single
job.
This document will focus on the performance of only a single processor of
the CRAY X-MP-4, as none of our benchmark programs were organized to take
advantage of multiple processors.
Thus, in the tables and text that follow, all data
on the capacity and the performance of the CRAY X-MP-4 apply to a single
processor, except for data on the size of memory and the configuration and
performance of I/O devices and the SSD.
.PP
The CRAY X-MP-4 has extremely high floating-point performance for both scalar
and vector applications and both short and long vector lengths.
Each CRAY X-MP
processor has a maximum theoretical floating-point result rate of 210 MFLOPS
(millions of floating point operations per second)
for overlapped vector multiply and add instructions.  With the optional 
solid-state storage device installed, the CRAY X-MP-4 has an 
input/output bandwidth of over 2.4 billion bytes per second, the largest in
this study; without the SSD, the I/O bandwidth is 424
million bytes per second,
but only 68 million bytes per second is attainable by disk I/O.
The CRAY DD-49 disks have the fastest single disk transfer rate (9.8 million
bytes per second) in this study.
The CRAY permits a maximum of four disk devices on each of
eight disk control units,
the smallest disk subsystem in this study.
.PP
The cooling system for the CRAY X-MP-4 is refrigerated
liquid freon.
.PP
The CRAY X-MP-4 operates with the CRAY Operating System
(COS), a batch operating system designed to
attach by a high-speed
channel or hyperchannel interface
with a large variety of
self-contained, general-purpose front-end computers.
All computing tasks other
than batch compiling, linking, and executing of application programs
must be performed on
the front-end computer.  Alternatively, the CRAY X-MP-4 can operate under CTSS
(CRAY Time-Sharing System,
.bp
  .
.sp 7.8i
.ce
Figure 1
.ce
CRAY X-MP/48 Architecture
.bp
available from Lawrence Livermore National Laboratory), a
full-featured interactive system with background batch computing.
The primary programming languages for the CRAY X-MP
are Fortran 77 and CAL (CRAY Assembly Language); the Pascal and C programming
languages are also available.
.sp 
.B
2.2 Fujitsu VP-200 (Amdahl 1200)
.R
.PP
The Fujitsu VP-200 (Figure 2) is midway in performance in a family of four Fujitsu VP
computers, whose performance levels range to over a billion floating-point
operations per second.  In North America, the Fujitsu VP-200 is marketed and
maintained by the Amdahl Corporation as the Amdahl 1200 Vector Processor.
Although we benchmarked the VP-200 in Japan, the comparisons in this
document will emphasize the configurations of the VP-200 offered by Amdahl in
the United States.  
.PP
The Fujitsu VP-200 is a high-speed, single-processor computer, with up to 32
million words of fast (60-nanosecond cycle time) static MOS central memory.
The VP-200 has separate scalar (15-nanosecond clock cycle) and vector
(7.5-nanosecond clock cycle) execution units, which can execute instructions
concurrently.  A unique characteristic of the VP-200 vector unit is its
large (8192-word) vector register set, which can be
dynamically configured into different numbers and lengths of vector registers.
.PP
The VP-200 has a maximum theoretical floating point result rate of 533 MFLOPS
for overlapped vector multiply and add instructions.
.PP
The VP-200 system is
cooled entirely by forced air.
.PP
The Fujitsu VP-200 scalar instruction set and data formats
are fully compatible with the IBM 370 instruction set and data formats;
the VP-200 can execute load modules and share load libraries and datasets that
have been prepared on IBM-compatible computers.  The Fujitsu VP-200 uses
IBM-compatible I/O channels and can attach all IBM-compatible disk and tape
devices and share these devices with other IBM-compatible mainframe computers.
Fujitsu does not offer an integrated solid-state storage device for the VP
computer series, but any such device that attaches to an IBM channel and
emulates an IBM disk device can be attached to the VP-200.  The total I/O
bandwidth of the VP-200 is 96 million bytes per second, the smallest in this study.
Up to 93 million bytes per second can be used for disk I/O;
The maximum single-disk data transfer rate is 3 million bytes per second.
The VP-200 can attach over one thousand disk devices.
.KS
.sp 20
.sp 1.5i
.ce
Figure 2
.ce
Fujitsu VP-200 Architecture
.KE

.PP
The Fujitsu VP-200 operates with the FACOM VP control
program (also called VSP \(em Vector Processor System Program \(em by Amdahl), a
batch operating system designed to interface with an IBM-compatible front-end
computer via a channel-to-channel (CTC) adaptor in a tightly coupled or loosely
coupled network.  
Internally, Amdahl is running the IBM MVS/XA operating system
on their Amdahl 1200 computer).
The front-end computer operating system may be Fujitsu's
OS-IV (available only in Japan) or IBM's MVS, MVS/XA, or VM/CMS.
To optimize use
of the VP vector hardware, Fujitsu encourages VP users to perform all computing
tasks, other than executing their Fortran application programs, on the 
front-end computer.
.bp
.PP
Of the three machines in this study, Fujitsu (Amdahl) provides
the most powerful set of optimizing and debugging tools.
With the VSP operating system, interactive tools
must be run on the front-end computer system, but batch versions of the tools
can run on either the fron-end or the vector processor.
Fujitsu Fortran 77/VP is
the only programming language that takes
advantage of the Fujitsu VP vector capability,
although object code produced by any other compiler or assembler available for
IBM scalar mainframe computers will execute correctly on the VP in scalar mode.
.sp 
.B
2.3  Hitachi S-810/20
.R
.PP
The Hitachi S-810/20 (Figure 3)
computer is the more powerful of two Hitachi S-810 computers,
which currently are sold only in Japan.  
Little is published in English about the
Hitachi S-810 computers; consequently,
some data 
in the tables and comparisons are inferred and may be
inaccurate.
.PP
The Hitachi S-810/20 is a high-speed, single-processor computer,
with up to 32 million words of fast (70-nanosecond bank cycle time) static MOS
central memory and up to 128 million words of extended storage. 
The computer has separate scalar (28-nanosecond clock cycle) and vector
(14-nanosecond clock cycle) execution units, which can execute instructions
concurrently.
The scalar execution unit is distinguished by its large (32 thousand words)
cache memory.  The S-810/20 vector unit has 8192 words of vector registers,
and the largest number of vector functional units and the most comprehensive
vector macro instruction set of the three machines in this study.
The Hitachi S-810 family alone has
the ability to process vectors that are longer
than their vector registers, entirely under hardware control.
.PP
The Hitachi
S-810/20 has a maximum theoretical floating point result rate of 840 MFLOPS for
overlapped vector multiply and add instructions
(two multiply and four add results per cycle).
.PP
The S-810/20 computer is cooled by
forced air across a closed, circulating water radiator.
.PP
Like the Fujitsu VP, the Hitachi S-810/20
scalar instruction set and data formats
are fully compatible with the IBM 370 instruction set and data formats;
the S-810/20 can execute load modules and share load libraries and datasets that
have been prepared on IBM-compatible computers.
The Hitachi S-810/20 uses 
IBM-compatible I/O channels and can attach all IBM-compatible disk and tape
devices and share these devices with other IBM-compatible mainframe computers.
Hitachi's optional,
extended storage offers extremely high
performance I/O.  With the extended storage installed, the Hitachi S-810/20 has
an I/O bandwidth of 1.1 billion bytes per second; without extended storage
the I/O bandwidth is 96 million bytes per second.
Up to 93 million bytes per second can be used for disk I/O;
the maximum single-disk data transfer rate is 3 million bytes per second.
The Hitachi can attach over one thousand disk devices.
.bp
  .
.sp 2
.sp 5.5i
.ce
Figure 3
.ce
Hitachi S-810/10 Architecture

.PP
The Hitachi S-810/20 operates either with a batch operating
system designed to interface with an IBM-compatible front-end computer via a
channel-to-channel (CTC) adaptor in a loosely coupled network, or with
a stand-alone
operating system with MVS-like batch and MVS/TSO-like interactive capabilities.
The primary programming languages for the Hitachi S-810 computers are
Fortran 77 and assembly language,
although object code produced by any assembler or
compiler available for IBM-compatible computers will also execute on
the S-810 computers
in scalar mode.
.bp
.B
3.  Comparison of Computers
.R

.B
3.1  IBM Compatibility of the Fujitsu and Hitachi Machines
.R
.PP
Both Japanese computers run the full IBM System 370 scalar instruction set,
but neither of the Japanese machines can run the IBM 3090 Vector Facility
instruction set. Also, neither Japanese
computer is compatible with the IBM XA extensions to System 370;
each vendor has its own 31-bit address architecture that provides
an XA-equivalent 2-gigabyte address space.
(Amdahl is running MVS/XA on one Amdahl 1200 computer
that they modified to be compatible with XA architecture.)
The Japanese operating systems
simulate IBM MVS system functions at the SVC level. MVS load modules
created
on Argonne's IBM 3033 ran correctly on both the Fujitsu and Hitachi
machines in scalar mode.
.PP
The Japanese computers 
can share datasets
on direct-access I/O equipment
with IBM-compatible front-end
computers. 
Programs can be developed and debugged on the front
end computers with the user's favorite tools, then recompiled and executed
on the vector processors.
All software tools for the vector
processors will run on IBM-compatible front ends.
Currently the interactive software tools
are MVS TSO/SPF oriented.
.PP
The Japanese Fortran compilers are compatible with IBM VS/Fortran;
the ANSI X3.9 (Fortran 77) standard and (most) IBM extensions are implemented.
.sp
.B
3.2  Main Storage Characteristics
.R
.PP
The main storage characteristics of the three machines in this study
are compared in Table 2.
All three machines have large, interleaved main memories,
optimized for 64-bit-word data transfers,
with bandwidths matched to the requirements of their respective vector units.
Each machine permits vector accesses from contiguous, constant-stride
separated, and scattered (using indirect list-vectors) memory addresses.
All three machines use similar memory error-detection and error-correction
schemes.
The text that follows concentrates on those differences in
main memory that have significant performance implications.
.PP
The CRAY X-MP-48 uses extremely fast bipolar memory, while the Fujitsu and
Hitachi computers use relatively slower static-MOS memory
(see Table 2).
CRAY's choice of the faster but much more expensive bipolar memory
is largely dictated by the need to service four processors from a single,
symmetrically shared main memory.
Fujitsu and Hitachi selected static MOS for its relatively lower cost and
lower heat dissipation.
These MOS characteristics permit much larger memory configurations
without drastic cost and cooling penalties.
Fujitsu and Hitachi compensate for the relatively slower speed of their MOS
memory by providing much higher levels of memory banking and interleaving.
.bp
.ce 1
Table 1
.br
.ce 1
Overview of Machine Characteristics
.TS
center;
lp8 lp8 lp8 lp8.

Characteristic	CRAY X-MP-4	Fujitsu VP-200	Hitachi S-810/20
_	_	_	_

Number of Processors	4	1	1

Machine Cycle Time	9.5 ns vector	7.5 ns vector	14 ns vector
	9.5 ns scalar	15 ns scalar	28 ns scalar

Memory Addressing	Real	Mod. Virtual	Mod. Virtual

Maximum Memory Size	16 Mwords	32 Mwords	32 Mwords

Optional SSD Memory	32; 128 Mwords	Not Available	32; 64; 128 Mwords

SSD Transfer Rate	256 Mwords/s	Not Available	128 Mwords/s

I/O-Memory Bandwidth	50 Mwords/s	12 Mwords/s	12 Mwords/s

	(numbers below are
	per processor)

CPU Memory Bandwidth	315 Mwords/s	533 Mwords/s	560 Mwords/s

Scalar Buffer Memory	64 Words T reg	8192 Words Cache	32768 Words Cache

Vector Registers	512 Words	8192 Words	8192 Words

Vector Pipelines:
 Load/Store Pipes	2 Load; 1 Store	2 Load/Store	3 Load; 1 Load/Store
 Floating Point M & A	1 Mult; 1 Add;	1 Mult; 1 Add	2 Add; 2 Mult/Add

Peak Vector (M + A)	210 MFLOPS	533 MFLOPS	840 MFLOPS

Cooling System Type	Freon	Forced Air	Air and Radiator
.bp
Characteristic	CRAY X-MP-4	Fujitsu VP-200	Hitachi S-810/20
_	_	_	_

Operating Systems	CRAY-OS (batch)	VSP (batch)	HAP OS
	CTSS (interactive)

Front Ends	IBM,  CDC,  DEC,	IBM-compatible	IBM-compatible
	Data General,
	Univac, Apollo,
	Honeywell

Vectorizing Languages	Fortran 77	Fortran 77	Fortran 77

Other High-Level
 Languages	Pascal, C, LISP	Any IBM-compat.	Any IBM-compat.

Vectorizing Tools	Fortran Compiler	Fortran Compiler	Fortran Compiler
		FORTUNE	VECTIZER
		Interact. Vectorizer
		Batch Vectorizer

.TE
.sp 2
.B
3.3  Memory Address Architecture

3.3.1  Memory Address Word and Address Space
.R
.PP
The CRAY X-MP uses a 24-bit address, which it interprets as a 16-bit "parcel"
address when referencing instructions and as a 64-bit-word address when
referencing operands.
This addressing duality leads to a 4-million-word address space for
instructions and a 16-million-word address space for operands.
.PP
The two Japanese machines use similar memory addressing schemes, owing to their
mutual commitment to IBM compatibility.
Both Japanese computers allow operating-system selection of
IBM 370-compatible 24-bit addressing or IBM XA-like 31-bit addressing.
These addressing alternatives provide a 2-million-word address
space or a 256-million-word address space, respectively.
The address space is identical for both program instructions and operands.
.bp
.ce 1
Table 2
.br
.ce 1
Main Storage Characteristics
.TS
center;
lp8 lp8 lp8 lp8 lp8.

Memory Item	Units	CRAY X-MP-4	Fujitsu VP-200	Hitachi S-810/20
_	_	_	_	_

Memory Type	SECDED	16K-bit Bipolar	64K-bit S-MOS	64K-bit S-MOS

Addressing:	Type	Extended Real	Mod. Virtual	Mod. Virtual
 Paged		No	System Only	System Only
 Address Word	Bits	24	24 or 31	24 or 31
 Address Space	Mwords	4(inst); 16(data)	2; 256	2; 256

Address Boundary:
 Instructions	Bit	16	16	16
 Scalar Data	Bit	64	8	8
 Vector Data	Bit	64	32; 64	32; 64

Vector Addressing		Contiguous	Contiguous	Contiguous
Modes		Constant Stride	Constant Stride	Constant Stride
		Indirect Index	Indirect Index	Indirect Index

Memory Size	Mwords	8; 16	8;   16;   32	4;  8;  16;  32
	Mbytes	64; 128	64;  128;  256	32; 64; 128; 256

Interleave	Sections	4;   4	8;    8;    8	8
	Ways	64;   64	128;  256;  256	128

Cycle Time:
 Section	CP - ns	1CP - 9.5 ns	2CP - 15 ns	1CP - 14 ns
 Bank	CP - ns	4CP - 38 ns	8CP - 60 ns	5CP - 70 ns

Access Time:			From Cache	From Cache
 Scalar	CP - ns	14CP - 133 ns	2CP - 30 ns	2CP - 28 ns
 Vector	CP - ns	17CP - 162 ns	?	?

.bp
Memory Item	Units	CRAY X-MP-4	Fujitsu VP-200	Hitachi S-810/20
_	_	_	_	_

Transfer Rate:		   (per CPU)
 Scalar L/S	Words/CP	1W/19. ns	2W/15 ns	2W/14 ns
 Inst. Fetch	Words/CP	8W/9.5 ns	2W/15 ns	1W/14 ns
 Vect. Load	Words/CP	2W/9.5 ns	8W/15 ns	8W/14 ns
 Vect. Store	Words/CP	1W/9.5 ns	8W/15 ns	2W/14 ns
 Vect. Total	Words/CP	3W/9.5 ns	8W/15 ns	8W/14 ns
 I/O	Words/CP	1W/9.5 ns	?	1W/14 ns

Vector Bandwidth:		   (per CPU)
 L/S Pipes	Pipes	2 Load; 1 Store	2 Load/Store	3 Load; 1 Load/Store
 # Sectors	Sectors		x 2 Sectors	x 2 Sectors

Vector Bandwidth:	Stride	one; odd; even	one; odd; even	one; odd; even
Max. Load	Mwords/s	210; 210; 210	533; 266; 133	560; 560; 560
Max. Store	Mwords/s	105; 105; 105	533; 266; 133	140; 140; 140
Total L/S	Mwords/s	315; 315; 315	533; 266; 133	560; 560; 560

Scalar Buffer Memory:		T Registers	Cache Memory	Cache Memory
 Size	Words	64 	8192	32768
 Block Load 	Words/CP	1W/9.5 ns	8W/60 ns	8W/70 ns
 Access Time	CP - ns	1CP - 9.5 ns	2CP - 15 ns	2CP - 28 ns
 Trans. Rate	Words/CP	1W/9.5 ns	2W/15 ns	2W/28 ns

Instruction Buffer:		128 Words I-stack	Cache Memory	Cache Memory
 Block Load	Words/CP	8W/9.5 ns	8W/60 ns	8W/70 ns
.TE
.sp 2
.B
3.3.2  Operand Sizes and Operand Memory Boundary Alignment
.R
.PP
CRAY X-MP computers have
only two hardware operand sizes:
64-bit integer, real, and logical operands;
and 24-bit integer operands, used primarily for addressing.
All CRAY operands are stored in memory on 64-bit word boundaries.
CRAY program instructions consist of one or two 16-bit "parcels," packed four
to a word.
CRAY instructions are fetched from memory, 32 parcels at a time beginning
on an 8-word memory boundary, into an instruction buffer that in turn is
addressable on 16-bit parcel boundaries.
.bp
.PP
The Japanese computers provide all of the IBM 370 architecture's
operand types and lengths, and some additional ones.
The Fujitsu and Hitachi scalar instruction sets can process 8-bit, 16-bit,
32-bit, 64-bit, and 128-bit binary-arithmetic and logical operands;
8-bit to 128-bit (in units of 8 bits) decimal-arithmetic operands;
and 8-bit to 32768-bit (in units of 8 bits) character operands.
Scalar operands may be aligned in memory on any 8-bit boundary.
However, the Fujitsu and Hitachi vector instruction sets can process only
32-bit and 64-bit binary-arithmetic and logical operands, and these operands
must be aligned in memory on 32-bit and 64-bit boundaries, respectively.
Most of the
Fujitsu and Hitachi
incompatibilities with IBM Fortran programs arise from
vector operand misalignment in COMMON blocks and EQUIVALENCE statements.
.sp
.B
3.3.3  Memory Regions and Program Relocation
.R
.PP
The CRAY X-MP uses only real memory addresses.
The operating system loads each program into a contiguous region
of memory for instructions and a contiguous region of memory for operands.
The CRAY X-MP uses two base registers to relocate all addresses
in a program; one register uniformly biases all instruction addresses,
and the second register uniformly biases all operand addresses.
.PP
In contrast, the Fujitsu and Hitachi computers use a modified virtual-memory
addressing scheme.
The operating systems and user application programs are each loaded
into a contiguous region of "virtual" memory,
although each may actually occupy noncontiguous "pages" of real memory.
Every virtual address reference must undergo dynamic address translation
to obtain the corresponding real memory address.
As in conventional virtual-memory systems, operating-system pages can be
paged out to an external device, allowing the virtual-memory space to exceed
the underlying real-memory space.
However, user application program pages are never paged out.
Application program address translation is used primarily to avoid
memory fragmentation.
.sp
.B
3.3.4  Main Memory Size Limitations
.R
.PP
The CRAY X-MP is available with up to 16 million words
of main memory, the maximum permitted by its address space.
This is restrictive compared to the Japanese offerings,
especially as the memory must be shared by four processors.
Currently, the Fujitsu and Hitachi computers offer a maximum of
32 million words of main memory.
However, both Japanese computers
could accommodate expansion to 256 million words
(per program) within the current 31-bit virtual-addressing architecture.
.bp
.B
3.4  Memory Performance

3.4.1  Memory Bank Structure
.R
.PP
The computers on which we ran the benchmark problems were all
equipped with 8 million words of main memory.
The CRAY X-MP-48 memory was divided into 64 independent memory banks,
organized as 4 sections of 16 banks each (later models of the CRAY X-MP are
limited to 32 memory banks).
Both the Fujitsu and Hitachi computer memories are divided into 128
independent memory banks organized as 8 sections of 16 banks each;
Fujitsu memories larger than 8 million words have 256 memory banks
in 8 sections.
In general, the larger numbers of memory banks permit higher bandwidths
for consecutive block memory transfers and fewer bank conflicts from
random memory accesses.
.sp
.B
3.4.2  Instruction Access
.R
.PP
The CRAY X-MP has four 32-word instruction buffers that can deliver a new
instruction for execution on every clock cycle, leaving the full memory
bandwidth available for operand access.
Each buffer contains 128 consecutive parcels of program instructions,
but the separate buffers need not be from contiguous memory segments.
Looping and branching within the buffers are permitted; entire Fortran DO
loops and small subroutines can be completely contained in the buffer.
An instruction buffer is block-loaded from memory, 32 words at a time,
at the rate of 8 words per 9.5-nanosecond cycle.
.PP
The Fujitsu and Hitachi processors buffer all instruction fetches through
their respective cache memories (see "Scalar Memory Access" below).
The cache bandwidths are adequate to deliver instructions and scalar
operands without conflict.                   
.sp
.B
3.4.3  Scalar Memory Access
.R
.PP
The CRAY X-MP does not have a scalar cache.
Instead, it has 64 24-bit intermediate-address B-registers and
64 64-bit intermediate-scalar T-registers.
These registers are under program control and can deliver one operand
per 9.5-nanosecond clock cycle to the primary scalar registers.
The user must plan a program carefully to make effective use of the B and T
registers in CRAY Fortran;
variables assigned to B and T registers by the
compiler are never stored in memory.
.bp
.PP
The Fujitsu VP-200 and Hitachi S-810/20 automatically buffer all scalar
memory accesses and instruction fetches through fast cache memories of
8192 words and 32768 words, respectively.
The Fujitsu and Hitachi cache memories can each deliver one words per scalar
clock cycle (15 nanoseconds and 28 nanoseconds, respectively) to their
respective scalar execution units, entirely under hardware control.
.sp
.B
3.4.4  Vector Memory Access
.R
.PP
The computers studied all have multiple data-streaming pipelines to transfer
operands between main memory and vector registers.
Each processor of a CRAY X-MP has three pipelines \(em two dedicated to loads
and one dedicated to stores \(em between its own set of vector registers and
the shared main memory.
(A fourth pipe in each X-MP processor is dedicated to I/O data transfers.)
The Fujitsu VP-200 has two memory pipelines, each capable of both loads
and stores.
The Hitachi S-810/20 has four memory pipelines \(em three dedicated to
loads and one capable of both loads and stores.
.PP
Each CRAY X-MP pipe can transfer one 64-bit word between main storage and
a vector register each 9.5-nanosecond cycle, giving a single-processor
memory bandwidth (excluding I/O) of 315 million words per second and a
four-processor memory bandwidth of 1260 million words per second.
The Fujitsu and Hitachi pipes can each transfer two 64-bit words each memory
cycle (7.5 nanoseconds and 14 nanoseconds, respectively), giving total
memory bandwidths of 533 and 560 million words per second, respectively.
.PP
For indirect-address operations (scatter/gather) and for constant strides
different from one, the Fujitsu computer devotes one of its memory pipelines to
generating operand addresses; its maximum memory-to-vector register bandwidth
is 266 million words per second for scatter/gather and odd-number constant
strides, and 133 million words per second for even-number constant strides.
.PP
All three machines can automatically "chain" their load and store pipelines
with their vector functional pipelines.
Thus, vector instructions need not wait for a vector load to complete, but
can begin execution as soon as the first vector element arrives from memory.
And vector stores can begin as soon as the first result is available
in a vector register.
In the limit, pipelines can be chained to create a continuous flow of operands
from memory, through the vector functional unit(s),
and back to memory with an unbroken stream of finished results.
In this "memory-to-memory" processing mode, the vector registers serve as
little more than buffers between memory and the functional units.
The 
CRAY X-MP's three memory pipes permit memory-to-memory operation
with two input operand streams and one result stream.
With only two memory pipes, the Fujitsu VP-200 can function in memory-to-memory
mode only if one of the input operands is already in a vector register, or
if one of the operands is a scalar, and not at all if the vector stride is
different from one.
The Hitachi, with four memory pipes, can function in memory-to-memory mode
with up to three input operand streams and one result stream;
add to this the Hitachi's ability to automatically process vectors that are
longer than its vector registers, and the Hitachi can be viewed
as a formidable memory-to-memory processor.
.sp
.B
3.5  Input/Output Performance
.R
.PP
Table 3 summarizes the input/output features and
performance of the CRAY X-MP, the Fujitsu, and the Hitachi.
This information is entirely from the manufacturers' published machine
specifications;
no I/O performance comparisons were included in our tests.
.PP
Both the CRAY and Hitachi I/O subsystems have optional 
integrated solid-state storage devices, with data transfer 
rates of 2048 and 1024 Mbytes per second,
respectively, over specialized channels.  The I/O 
bandwidth of one of these devices dwarfs the I/O bandwidth of the entire disk 
I/O subsystem on each machine.  The Fujitsu computers can attach only those 
solid-state storage devices that emulate standard IBM disk and drum devices 
over standard Fujitsu 3-Mbyte-per-second channels.
.PP
The IBM-compatible disk I/O subsystems on the two Japanese computers have a 
much larger aggregate disk storage capacity than the CRAY.
The CRAY can attach
a maximum of 32 disk units, while Fujitsu and Hitachi can each attach
over one thousand disks.
CRAY permits a maximum of 8 concurrent disk data transfers, 
while Fujitsu and Hitachi permit as many concurrent disk data transfers as 
there are channels
(up to 31; at least one channel is required for front-end
communication).
Individually, CRAY's DD-49 disks 
can transfer data sequentially at the rate of 10 Mbytes per second, compared 
with only 3 Mbytes per second for the IBM 3380-compatible disks used by 
Fujitsu and Hitachi.  But the maximum concurrent CRAY disk data rate (four
DD-49 data streams on each of two I/O processors) is only 68 Mbytes per second, 
compared with 93 Mbytes per second for the two Japanese computers.  The disks 
used on all three computers should have very similar random access performance,
which is dominated by access time rather than data transfer rate.
.bp
.ce 1
Table 3
.br
.ce 1
Input/Output Features and Performance

.TS
center;
lp9 lp9 lp9 lp9.
I/O Features	CRAY X-MP-4	Fujitsu VP-200	Hitachi S-810/20
_

Disk I/O Channels:			
  Disk I/O Processors	2 I/O Subsystems	2 I/O Directors	2 I/O Directors
  Channels per IOP	1	16	16
  Maximum Channels	2	32	32
  Data Rate/Channel	100 MB/s	3 MB/s	3 MB/s
  Total Bandwidth	200 MB/s	96 MB/s	96 MB/s

Disk Controllers:	DCU-5	6880	3880-equivalent
  Max. per Channel	4	8	16
  Max. Controllers	8	128	256
  Disks/Controller	4	4-64	4-16
  Data Paths/Controller	1	2	2
  Bandwidth/Controller	12 MB/s	6 MB/s	6 MB/s

Disk Devices:	DD-39; DD-49	6380	3380-equivalent
  Storage Capacity	1200 MB; 1200 MB	600 MB; 1200 MB	600 MB; 1200 MB
  Data Transfer Rate	6 MB/s; 10 MB/s	3 MB/s	3 MB/s
  Average Seek Time	18 ms; 16 ms	15 ms	15 ms
  Average Latency	9 ms; 9 ms	8 ms	8 ms
  Maximum Striping	5; 3	24	?
  Max. Disk Bandwidth	45 MB/s; 68 MB/s	93 MB/s	93 MB/s

Integrated SSD:	Optional	Not Available	Optional
  Capacity (Mwords)	32; 64; 128		32; 64; 128
  Data Transfer Rate	256 Mwords/s		128 Mwords/s
.TE
.PP
CRAY includes up to 8 Mwords of I/O subsystem buffer memory between its 
CPUs and its disk units.  This I/O buffer memory permits 100-Mbyte-per-second 
data transfer between the I/O subsystem and a single CRAY CPU.
The IBM 3880-compatible disk controllers
used by the two Japanese machines permit up to 2
Mwords of cache buffer memory on each controller.  This disk controller cache 
does not increase peak data transfer rates but serves to reduce average record
access times.
.bp
.PP
All three machines permit "disk striping" to increase I/O performance \(em the 
data blocks of a single file can be interleaved over multiple disk devices to 
allow concurrent data transfer for a single file.  CRAY allows certain disks to
be designated as striping volumes at the system level; striped and non-striped 
datasets may not reside on the same disk volume.  A single CRAY file may be 
striped over a maximum of three DD-49 or five DD-39 disk units.  Fujitsu and 
Hitachi permit striping on a Fortran dataset basis; striped and non-striped datasets 
may reside on the same disk volume.  A single Fujitsu dataset may be striped 
over as many as 24 disk volumes.
Fortran programs compiled by the Japanese Fortran compilers
in scalar mode can usr disk striping on any IBM compatible computer.
.sp
.B
3.6  Vector Processing Performance
.R
.PP
Table 4 shows the vector architectures of the
three computers studied.
All three machines are vector register based, 
with multiple pipelines connecting the vector registers with main memory.
All three have multiple vector functional units, permit concurrency among 
independent vector functional units and with the load/store pipelines, and 
permit flexible chaining of the vector functional units with each other and 
with the load/store pipelines.  Although Fujitsu and Hitachi permit both 32-bit
and 64-bit vector operands, all vector arithmetic on all three machines is 
performed in and optimized for 64-bit floating point.  The three vector units 
differ primarily in the numbers and lengths of vector registers, the numbers of
vector functional units, and the types of vector instructions.
.PP
Of the three machines, the CRAY has the smallest number and size of vector 
registers.  Each CRAY X-MP processing unit has 8 vector registers of 64 elements, 
while the Fujitsu and Hitachi computers each have 8192-word vector register 
sets.  The Fujitsu vector registers can be dynamically configured into 
different numbers and lengths of vector registers (see Table 4), ranging from a
minimum of 8 registers of 1024 words each to a maximum of 256 registers of 32 
words each.
The Fujitsu Fortran compiler uses the vector-length information 
available at compile time to try to optimize the vector register configurations
for each loop.  The Hitachi has 32 vector registers, fixed at 256 elements 
each, but with the unique ability to process longer vectors without the user or
the compiler dividing them into sections of 256 elements or less; the Hitachi 
hardware can automatically repeat a long vector instruction for successive 
vector segments.  The HAP Fortran compiler decides when to divide vectors into 
256-element segments and when to process entire vectors all at once, based on 
whether intermediate results in a vector register can be used in later 
operations.
.bp
.ce 1
Table 4
.br
.ce 1
Vector Architecture
.TS
center;
lp8 lp8 lp8 lp8.

Vector Processing Item	CRAY X-MP-4	Fujitsu VP-200	Hitachi S-810/20
_
Vector Registers:
  Configuration	Fixed	Reconfigurable	Fixed
  Total Capacity	512 Words/CPU	8192 Words	8192 Words
  Number x Size	8x64 Words	8x1024 Words	32x256 Words
 		16x512 Words		
 		32x256 Words		
 		64x128 Words		
 		128x64 Words		
 		256x32 Words		
  Mask Registers	64 Bits	8192 Bits	8x256 Words

Vector Pipelines	  (per CPU)			
  Load/Store	2 Load; 1 Store	2 Load/Store	3 Load;1 Load/Store
  Floating Point	1 Mult; 1 Add;	1 Mult; 1 Add	2 Add/Shift/Logic
	1 Recip. Approx.	1 Divide	1 Mult/Divide/Add
			1 Mult/Add
  Other	1 Shift; 1 Mask	1 Mask	1 Mask
	2 Logical
Maximum Vector Result Rates			
(64-bit results):			
  Floating Point Mult.	105 MFLOPS	267 MFLOPS	280 MFLOPS
  Floating Point Add	105 MFLOPS	267 MFLOPS	560 MFLOPS
  Floating Point Divide	33 MFLOPS	56 MFLOPS	70 MFLOPS
  Floating Mult. & Add	210 MFLOPS	533 MFLOPS	560 MFLOPS
 			840 (M+2A)
Vector Data Types:			
  Floating Point	64-bit	32-bit; 64-bit	32-bit; 64-bit
  Fixed Point	64-bit	32-bit	32-bit
  Logical	64-bit	1-bit; 64-bit	64-bit
Vector Macro Instructions:			
  Masked Arithmetic	No	Yes	Yes
  Vector Compress/Expand	Yes	Yes	Yes
  Vector Merge under Mask	Yes	No	No
  Vector Sum (S=S+Vi)	No	Yes	Yes
.bp
Vector Processing Item	CRAY X-MP-4	Fujitsu VP-200	Hitachi S-810/20
_	_	_	_

Vector Macro Instructions:
  Vector Prod (S=S*Vi)	No	No	Yes
  DOT Product (S=S+Vi*Vj)	No	Chain	Yes
  DAXPY (Vi=Vi+S*Xi)	Chain	Chain	Yes
  Iteration (Aj=Ai*Bi+Ci)	No	No	Yes
  Max/Min (S=MAX(S,Vi))	No	Yes	Yes
  Fix/Float (Vi=Ii;Ii=Vi)	Chain	Yes	Yes
.TE
.sp
.PP
The Hitachi has more vector arithmetic pipelines than the CRAY and Fujitsu 
computers.  These pipelines permit the Hitachi to achieve higher peak levels of
concurrency than CRAY and Fujitsu.  Depending on the operation mix, the Hitachi
can drive two vector add and two vector multiply+add pipelines concurrently, 
for an instantaneous result rate of 840 MFLOPS.  If the program operation mix 
is inappropriate, however, the extra pipelines are just expensive unused 
hardware.  The HAP Fortran "pair-processing" option often increases performance
by dividing a vector
in two
and processing each half concurrently through a separate 
pipe.  For long vectors, pair-processing can double the result rate; but for 
short vectors, startup overhead can result in reduced performance.  The HAP 
Fortran compiler permits pair-processing to be selected on a program-wide, 
subroutine-wide, or individual loop basis.
Pair-processing was the compiler default
for all out timings. Previous S-810 benchmarks that reported
relatively poorer performance were done without pair-processing [3].
.PP
The Fujitsu and Hitachi computers have larger and more powerful vector 
instruction sets than the CRAY.  These macro instruction sets make these 
machines more "compilable" and more "vectorizable" than the CRAY.  Especially 
valuable are the macro instructions that reduce an entire vector operation to 
a single result, such as the vector inner (or dot) product.  The CRAY, lacking 
such instructions, must normally perform these operations in scalar mode, 
although vectorizable algorithms exist for long CRAY vectors.  The Hitachi has the 
richest set of vector macro-instructions, with macro functional units to match.
Both Fujitsu and Hitachi have single vector instructions or two
instruction chains to extract 
the maximum and minimum elements of a vector, to sum the elements of a vector, 
to take the inner product of two vectors, and to convert vector elements 
between fixed point and floating point representations.  To these, the Hitachi 
adds a vector product reduction, the DAXPY sequence common in linear algebra, 
and a vector iteration useful in finite-difference calculations.
.PP
The 
only CRAY masked vector instructions are the vector compress/expand and 
conditional vector merge instructions; the CRAY Fortran compiler uses these 
instructions to vectorize loops with only a single IF statement.
The CRAY can hold logical data for only a single vector 
register.
Both Japanese computers,
on the other hand,
have masked arithmetic instructions that permit 
straightforward vectorization of loops with IF statements.  The Fujitsu and 
Hitachi computers have
mask register sets that can hold logical data for every vector register 
element.
These large mask register sets, and vector logical instructions to 
manipulate these masks, should make the Japanese machines strong candidates 
for logic programming.  These machines can hold the results of many 
different logical operations in their multiple mask registers, eliminating the 
need to recompute masks that are needed repeatedly, and permitting the 
vectorization of loops with multiple, compound, and nested IF statements.
.sp
.B
3.7  Scalar Processing Performance
.R
.PP
Table 5 compares
the scalar architectures of the three machines studied.
.PP
All three 
computers permit scalar and vector instruction concurrency; CRAY permits 
concurrency among all its functional units.
The Fujitsu and Hitachi computers are compatible with IBM System 370; 
they implement the complete IBM 370 scalar instruction set and scalar register 
sets (Fujitsu added four additional floating-point registers).
.PP
CRAY computers use multiple, fully-segmented functional units for both scalar 
and vector instruction execution, while Fujitsu and Hitachi use an unsegmented 
execution unit for all scalar instructions.
CRAY 
computers can begin a scalar instruction on any clock cycle; more than one CRAY
scalar instruction can be in execution at a given time, in the same and in 
different functional units.
Fujitsu and Hitachi, on the other hand, perform their 
scalar instructions one at a time, many taking more than one cycle.
Thus, even though many scalar instruction times
are faster on the Fujitsu than on the CRAY, the CRAY will often have a higher 
scalar result rate because of concurrency.  In our benchmark set, a single 
processor of the CRAY X-MP-4 outperformed both the Fujitsu VP-200 and the 
Hitachi S-810/20 on most of the programs that were dominated by scalar floating 
point instruction execution.
.PP
The Fujitsu and Hitachi computers have larger and more powerful general-purpose
instruction sets than the CRAY, and more flexible data formats for integer and 
character processing.  Thus, applications that are predominately scalar but 
use little floating-point arithmetic may well execute faster on these 
IBM-compatible computers than on a CRAY.  We had no applications in our benchmark 
to measure such performance.
.bp
.ce 1
Table 5
.br
.ce 1
Scalar Architecture

.TS
center;
lp8 cp8 cp8 cp8.
Scalar Processing Item	CRAY X-MP-4	Fujitsu VP-200	Hitachi S-810/20
_
Scalar Cycle Time	9.5 nsec	15 nsec	28 nsec

Scalar Registers:			
  General/Addressing	8x24-bit	16x32-bit	16x32-bit
  Floating Point	8x64-bit	8x64-bit	4x64-bit

Scalar Buffer Memory:	T-Registers	Cache Memory	Cache Memory
  Capacity	64 Words	8192 Words	32768 Words
  Memory Bandwidth	105 Mwords/sec	67 Mwords/sec	112 Mwords/sec
  CPU Access Time	1 CP - 9.5 nsec	2 CP - 30 nsec	1 CP - 28 nsec
  CPU Transfer Rate	1 Word/9.5 nsec	1 Words/15 nsec	1 Word/28 nsec

Scalar Execution Times:			
  Floating Point Mult.	7 CP - 66.5 nsec	4 CP - 60 nsec	3 CP - 84 nsec
  Floating Point Add	6 CP - 57.0 nsec	3 CP - 45 nsec	2 CP - 56 nsec

Scalar Data Types:
  Floating Point	64-bit	32; 64; 128-bit	32; 64; 128-bit
  Fixed Point	24; 64-bit	16; 32-bit	16; 32-bit
  Logical	64-bit	8; 32; 64-bit	8; 32; 64-bit
  Decimal	None	1 to 16-bytes	1 to 16-bytes
  Character	None	1 to 4096-bytes	1 to 4096-bytes
.TE
.sp
.B
4.  Benchmark Environments
.R
.PP
We spent two days
at Cray Research compiling and running
the benchmark on the CRAY X-MP-4.
The CRAY programs were one-processor tests;
no attempt was made to exploit the additional processors.
.PP
For the Japanese benchmarkings,
we sent ahead
a preliminary tape of our benchmark source programs and some
load modules produced at Argonne.
At both Fujitsu and Hitachi
the load modules
ran without problem, demonstrating that the machines are in fact
compatible with IBM computers on both instruction set and operating
system interface levels. (Of course, these tests did not
use the vector features of the machines.)
.bp
.PP
The
VP-200 tests were run at the Fujitsu plant in Numazu, Japan, during a one-week
period.
We had as much time on the VP-200 as needed. 
The front-end machine was a 
Fujitsu M-380 (approximately twice as fast as a single
processor of an IBM 3081 K).
.PP
The Hitachi S-810/20 tests were run
at the Hitachi Kanagawa Works,
during two afternoons.
The Hitachi S-810/20 benchmark configuration had no
front-end system. 
Instead, we compiled, linked, ran, and printed output
directly on the machine.
.PP
The physical environment of the Hitachi S-810/20 at Kanagawa is noteworthy.
The machine room
was not air-conditioned; a window was opened
to cool off the area. The outside
temperature exceeded 100 degrees Fahrenheit on the first day,
and we estimate that the computer room temperature 
was well above 100 degrees,
with high humidity; yet the computer ran without problem.
.sp
.B
5.  Benchmark Codes and Results
.R
.sp
.B
5.1  Codes
.PP
We asked some of the 
major computer users at Argonne for typical Fortran
programs that would help in judging the performance
of these vector machines.
We gathered 20 programs,
some simple kernels, others 
full production codes. 
The programs are itemized in Table 6.
.PP
Four of the programs have very little vectorizable Fortran 
(for the most part they are scalar programs):
BANDED, NODAL0, NODAL1, SPARSESP.
Both STRAWEXP and
STRAWIMP have many calculations involving short vectors.
For most of these programs the CRAY X-MP performed fastest,
with the Fujitsu faster than the Hitachi.
.PP
Below we describe some of the benchmarks and 
analyze the results.
.sp
.B
5.1.1  APW
.R
.PP
The APW program is a solid-state quantum mechanics
electronic structure
code.
APW
calculates self-consistent field wave functions and energy band structures
for a sodium chloride lattice using an antisymmetrized plane wave
basis set and a muffin-tin potential.
The majority of loops in this program are short and are coded
as IF loops rather than DO loops; they do not vectorize
on any of the benchmarked computers. The calculations are
predominately scalar.
.bp
.PP
This program highlights the CRAY X-MP advantage when executing
"quasi-vector" code (vector-like loops that
do not vectorize for some reason). The CRAY executes
scalar code on segmented functional units and can achieve
a higher degree of concurrency in scalar 
than either the Fujitsu or Hitachi machines, which execute
scalar instructions one at a time.
.sp
.B
5.1.2  BIGMAIN
.R
.PP
BIGMAIN is a highly vectorized Monte Carlo algorithm for computing
Wilson line observables in SU(2) lattice gauge theory. This program
has the longest vector lengths
of the benchmarks. All the vectors begin on the same memory bank
boundary, and all have a stride of twelve.
The only significant nonvectorized
code is an IF loop, which seriously limits the peak performance.
.PP
The superior performance of the CRAY on BIGMAIN reflects both the CRAY's
insensitivity to the vector stride and its greater levels of concurrency
when executing scalar loops.
The Fujitsu performance reflects a quartering
of memory bandwidth when using a vector stride of twelve.
The Hitachi performance reflects its slower 
scalar performance.
.sp
.B
5.1.3  BFAUCET and FFAUCET
.R
.PP
BFAUCET and FFAUCET
compute the ground state energies of drops of liquid
helium by the variational Monte Carlo method.
The BFAUCET codes
involve Bose statistics, and a table-lookup operation is an important
component of the time.
The FFAUCET cases use Fermi statistics
and are dominated by the evaluation of determinants using LU decomposition.
The different cases correspond to different sized drops,
as shown in Table 7.
.PP
BFAUCET1, 2, and 3 and FFAUCET1 and 2 perform only a single
Monte Carlo iteration each;
these cases are typical of checkout runs and are dominated
by non-repeated setup work.
BFAUCET4, 5, and 6 and FFAUCET3 are long production runs.
.sp 
.B
5.1.4  LINPACK
.R
.PP
The LINPACK timing is dominated by memory reference
as a result of array access through the calls to SAXPY.
For this problem
the vector length
changes during the calculation from length 100 down to length 1
(see Table 8).
.PP
Fujitsu's and Hitachi's performance reflects the fact that they do not
do so well as the CRAY with short vectors. 
.bp
.ce 1
Table 6
.br
.ce 1
Programs Used for Benchmarking

.TS
center;
lp8 lp8 cp8
lp8 np8 lp8.
Code	No. of Lines	Description
_	_	_
APW	1448	Solid-state code, for anti-symmetric plane wave calculations for solids.

BANDED	1539	Band linear algebra equation solver, for parallel processors.

BIGMAIN	774	Vectorized Monte Carlo algorithm, for SU(2) lattice gauge theory.

DIF3D	527	1, 2, and 3-D diffusion theory kernels.

LATFERM3	1149	Statistical-mechanical approach to lattice gauge calculations.

LATFERM4	1149	Statistical-mechanical approach to lattice gauge calculations.

LATTICE8	1149	Statistical-mechanical approach to lattice gauge calculations.

MOLECDYN	1020	Molecular dynamics code simulating a fluid.

NODAL0	345	Kernel of 3-D neutronics code using nodal method.

NODAL1	345	Kernel of 3-D neutronics code using nodal method.

NODALX	345	Kernel of 3-D neutronics code using nodal method.

BFAUCET	5460	Variational Monte Carlo for drops of He-4 atoms \(em Bose statistics.

FFAUCET	5577	Variational Monte Carlo for drops of He-3 atoms \(em Fermi statistics.

SPARSESP	1617	ICCG for non-symmetric sparse matrices based on normal equations.

SPARSE1	3228	MA32 from the Harwell library sparse matrix code using frontal
		techniques and software run on a 64 x 64 problem.

STRAWEXP	4806	2-D nonlinear explicit solution of finite element program with weakly
		coupled thermomechanical formulation in addition to structural and
		fluid structural interaction capability.

STRAWIMP	4806	Same as STRAWEXP but implicit solution.
.TE
.bp
.ce 1
Table 7
.br
.ce 1
Average Vector Length for BFAUCET and FFAUCET

.TS
center;
l l
l n.
Case	Average Vector Length
_	_
BFAUCET1	10
BFAUCET2	35
BFAUCET3	56
BFAUCET4	120
BFAUCET5	10
BFAUCET6	35

FFAUCET1	10
FFAUCET2	17
FFAUCET3	10
.TE
.sp 4
.KF
.ce 
Table 8
.br
.ce
LINPACK Timing for a Matrix of Order 100

.TS
center;
l l l
l n n.
Machine	MFLOPS	Seconds
_	_	_
CRAY X-MP	21	.032
Fujitsu VP-200	17	.040
Hitachi S-810/20	17	.042
.TE
.KE
.sp 2
.B
5.1.5  LU, Cholesky Decomposition, and Matrix Multiply 
.R
.PP
The LU, Cholesky decomposition, and matrix multiply
benchmarks
are based on matrix vector operations. 
As a result, memory reference is not a limiting factor since 
results are retained in vector registers during the operation.
The technique used in these tests is based on vector unrolling [1],
which works equally well on CRAY, Fujitsu, and Hitachi machines.
.bp
.PP
The routines used in Tables 9 through 11 
have a very high percentage of floating-point arithmetic operations.
The algorithms are all based on column accesses to the matrices.
That is, the programs reference array elements sequentially
down a column, not across a row.
With the exception of matrix multiply,
the vector lengths start out as the order of the matrix and
decrease during the course of the computation to a vector length
of one.
.sp
.KS
.ce
Table 9
.br
.ce
LU Decomposition Based on Matrix Vector Operations

.TS
center;
c c s s
c c c c
n n n n.
	MFLOPS
Order	CRAY X-MP (1 CPU)	Fujitsu VP-200	Hitachi S-810/20
_
50	24.5	20.5	17.9
100	51.6	51.8	47.5
150	72.1	84.6	76.3
200	87.4	117.1	102.2
250	99.2	148.8	126.4
300	108.4	178.8	147.8
.TE
.KE
.sp 3
.KS
.ce
Table 10
.br
.ce
Cholesky Decomposition Based on Matrix Vector Operations

.TS
center;
c c s s
c c c c
n n n n.
	MFLOPS
Order	CRAY X-MP (1 CPU)	Fujitsu VP-200	Hitachi S-810/20
_
50	29.9	25.8	18.8
100	65.6	70.6	60.1
150	91.9	117.6	104.9
200	107.7	162.2	144.9
250	119.1	202.2	179.7
300	132.3	238.1	211.8
.TE
.KE
.bp
.KS
.ce
Table 11
.br
.ce
Matrix Multiply Based on Matrix Vector Operations

.TS
center;
c c s s
c c c c
n n n n.
	MFLOPS
Order	CRAY X-MP (1 CPU)	Fujitsu VP-200	Hitachi S-810/20
_
50	98.4	112.9	100.0
100	135.7	225.2	213.3
150	149.0	328.1	279.3
200	156.2	404.5	336.8
250	165.9	462.2	366.7
300	167.9	469.2	390.4
.TE
.KE
.sp
.PP
For low-order problems the CRAY X-MP is slightly faster than the VP-200 and 
S-810/20,
because it has the smallest 
vector startup overhead
(primarily due to faster memory access).
As the order increases,
and the calculations become
saturated by longer vectors, the Fujitsu VP-200 attains the fastest overall 
execution rate.
.PP
With matrix multiply, the vectors remain the same length
throughout; here Fujitsu comes close to attaining 
its peak theoretical speed in Fortran.
.sp
.B
5.2  Results
.R
.PP
Table 12 contains the timing data for our benchmark codes.
We also include the timing results on other machines
for comparison.
.ps 11
.fi
.sp
.B
6.  Fortran Compilers and Tools 
.R
.sp
.B
6.1  Fortran Compilers
.R
.PP
The three compilers tested exhibit several similarities.
All three tested systems include a full Fortran 77 vectorizing
compiler
as the primary programming language.
The CRAY compiler includes most IBM and CDC Fortran extensions;
the two Japanese compilers include all the IBM extensions
to Fortran 77.
All three compilers can generate vectorized code from standard
Fortran;
no explicit vector syntax is provided.
All three compilers recognize a variety of compiler directives \(em special
Fortran comments that, when placed in a Fortran source code, aid the
compiler in optimizing and vectorizing the generated code.
Each compiler, in its options and compiler directives,
provides users with a great deal of control over
the optimization and vectorization of their programs.
.bp
.nr PO .5i
.nr LL 7.0i
.po .5i
.ll 7.0i
.ce
Table 12
.br
.ce
Timing Data (in seconds) for Various Computers (a)

.nf
.ps 8
.TS
center;
lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8
lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8
lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8
lp8 np8 np8 np8 np8 np8 np8 np8 np8 np8 np8.
Program	CRAY X-MP-4	Fujitsu	Hitachi	Hitachi(b)	Hitachi(b)	IBM	IBM	IBM	Amdahl
Name	using 1 proc.	VP-200	S810/20	S810/20	S810/20	370/195	3033	3033	5860
_
	CFT 1.13	f77	f77	FORTVS	H EXT	H EXT	FORTVS	H EXT	f77
				(scalar)	(scalar)					

APW	\f330.69\f1	40.58	54.37				171		62
BANDED	\f324.3\f1	34.15	38.3	41.0			102.65		35
BIGMAIN	\f310.86\f1	23.49	34.36		157.66		
DIF3DS1/1	\f323.7\f1	\f320.31\f1	\f321.9\f1	45.1	39.2	74.82	151.81	134.2	62
DIF3DS2/1	\f319.0\f1	21.93	21.9	47.4	41.5	81.27	157.44	142.	67
DIF3DV0/1	\f39.31\f1	16.37	11.8	50.1	39.5	73	168	138	73
DIF3DV1/1	\f39.37\f1	16.59	12.1	49.3	38.7	74	167	137	70
LATFERM3	\f36.1\f1	\f36.2\f1	\f36.6\f1	15.8	33.3		52.07	87.8	18
LATFERM4	121.8	\f365.29\f1	\f365.3\f1	345.2	820.6				640
LATTICE8	10.2	\f35.54\f1	6.7	16	19.4		46.38	53.8	17
MOLECDYN	\f38.68\f1	\f39.07\f1	15.78	16.6	17.2	36.26	51.44	51.74	17
NODAL0	\f36.41\f1	14.31	20.1	19.5	19.7	28.36	45.53	45.5	27
NODAL1	\f36.45\f1	14.47	19.8	19.3	19.5	27.58	45.35	45.	23
NODELX	.25	\f3.14\f1	.20		1.14	1.45	1.57	
BFAUCET1	\f311.2\f1	16.13		22.9	22.8		74	73	31
BFAUCET2	\f38.96\f1	11.66		23.9	24.2		79	78	34
BFAUCET3	\f310.6\f1	18.48		38.7	38.9		130	128	405
BFAUCET4	\f3259.4\f1	551.2			621.0		2100	2048	920
BFAUCET5	\f3787.4\f1	923.04			1529.4				2351
BFAUCET6	\f3727.5\f1	823.98							2786
FFAUCET1	\f313.6\f1	19.45			26.7		94	82	35
FFAUCET2	\f344.4\f1	\f342.31\f1			114.3		419	397	150
FFAUCET3	\f31144.0\f1	1691.83							2440
SPARSESP	\f31200\f1	\f31361\f1	\f31264.29\f1						1484
SPARSE1	\f32.51\f1	6.74	9.85	14.26			33.06		26
STRAWEXP	\f337.3\f1	45.74	59.2			116.28	143.35	142.28	51
STREWEXP2	\f3153.4\f1	179.37	231.13		273.9				216
STRAWIMP	\f3151.5\f1	\f3151.51\f1	172.61		?	382.73	381.51	360.55	
.TE
(a) Numbers in boldface denote "fastest" time for a given program.
.br
(b) From load modules created on an IBM machine.
.nr PO 1.i
.po 1.i
.nr LL 6.5i
.ll 6.5i

.ps 11
.bp
.PP
All three compilers provide excellent optimization
of scalar code.
The compilers differ primarily in the range of Fortran
statements they can vectorize, the complexity of the DO loops
that they vectorize, and the quantity and quality of messages they
provide the programmer about the success or failure of vectorization.
.PP
All three Fortran compilers have similar capabilities for
vectorizing simple inner DO loops and DO loops with
a single IF statement.
The two Japanese compilers can also vectorize outer DO loops
and loops
with compound, multiple, and nested IF statements.
The Fujitsu compiler has multiple strategies for vectorizing
DO loops containing
IF statements, based on compiler directive estimates of the IF
statement true ratio.
The Japanese compilers can vectorize loops that contain
a mix of
vectorizable and non-vectorizable statements;
the CRAY
compiler requires the user to divide such code into separate
vectorizable and non-vectorizable DO loops.
.PP
The vector macro instructions (e.g., inner product, MAX/MIN, iteration)
on the two Japanese computers permit their compilers to vectorize a wider
range of Fortran statements than can the CRAY compiler.
And, the Japanese compilers seem more successful at using information
from outside a DO loop in determining whether that loop is
vectorizable.
.PP
All three compilers convert loops with small iteration counts
to scalar code, when the advantages of vectorization will not repay the
loop vector start-up times.
The CRAY compiler can completely unroll inner DO loops
with constant iteration counts less than ten, eliminating entirely the scalar
loop overhead. Often an unrolled inner loop will then vectorize on an outer
loop index, with dramatic performance improvement. The Fujitsu
compiler can double the statements and halve the iteration count
of all DO loops. This loop doubling improves scalar performance, but usually
degrades vector performance by converting each vector operation to two
new operations with half the vector length and double the stride of the
original. The similar Hitachi option -- "pair processing"-- usually
improves performance because the two new vector operations can execute 
concurrently on separate functional units.
.PP
All three compilers, in their output listings, indicate
which DO loops vectorized and which did not.
The two Japanese compilers provide more detailed explanations
of why a particular DO loop or statement does not vectorize.
The Fujitsu
compiler listing is the most effective of the three:
in addition to the vectorization commentary, the Fujitsu compiler
labels each DO statement in the source listing with a V if
it vectorizes totally, an S if the loop compiles to scalar code,
and an M if the loop is a mix of scalar and vector code.
Each statement in the loop itself is similarly labeled.
.PP
The Fujitsu and Hitachi compilers make all architectural
features of their respective machines available from
standard Fortran.
As a measure of confidence in their compilers,
Fujitsu has written all, and Hitachi nearly all, of their
scientific subroutine libraries in standard Fortran.
.sp
.B
6.2  Fortran Tools
.R
.PP
All three systems include tools to trace program
execution and identify the most time consuming program areas
for tuning attention. In addition,
Fujitsu and Hitachi provide Fortran source program analysis
tools which guide the user in optimizing program performance.
The Fujitsu interactive vectorizer is a powerful tool
for both the novice and the experienced user;
it allows one to tune a program despite an
unfamiliarity with vector machine architecture and programming practices.
The interactive vectorizer
(which runs on any IBM-compatible system with MVS/TSO)
displays the Fortran source with each
statement labeled with a V (vectorized), S (scalar),
or M (partially vectorized), and a static estimate
of the execution cost of the statement.
As the user interactively modifies a code, the vectorization
labels and statement execution costs are updated on-screen.
The vectorizer gives detailed explanations for failure
to vectorize a statement, suggests alternative codings that will
vectorize, and inserts compiler directives into the source based on user responses
to the vectorizer's queries.
Statement execution cost analyses are based on assumed DO loop iteration
counts and IF statement true ratios.
The user can supply his own estimate of these values, or
run the FORTUNE execution analyzer to gather run-time statistics
for a program,
which can then be input to the interactive vectorizer to provide a more
accurate dynamic statement execution cost analysis.
.PP
The Hitachi VECTIZER runs in batch mode;
it provides additional information much like the Hitachi Fortran compiler's
vectorization messages.
.sp
.B
7.  Conclusions
.R
.PP
The results of our benchmark show the CRAY X-MP-4 to be a
consistently strong performer
across a wide range of problems.
The CRAY was particularly fast on programs dominated by
scalar calculations and short vectors.
The fast CRAY memory contributes to low vector startup times,
leading to its exceptional short-vector performance.
The CRAY scalar performance derives from its
segmented functional units;
the X-MP achieves enough concurrency in many scalar loops
to outperform the Japanese machines, even though individual
scalar arithmetic instruction times are longer on
the CRAY than on the Fujitsu.
.PP
The Fujitsu and Hitachi computers
perform faster than the CRAY
for highly vectorizable
programs, especially those with long (>50) vector lengths.
The Fujitsu VP achieved the most dramatic peak performance in the
benchmark, outperforming a single CRAY X-MP processor by factors
of two to three on matrix-vector algorithms, with the Hitachi not
far behind.
Over the life cycle of a program, the Fujitsu and Hitachi
machines should benefit relatively more than the CRAY from tuning
that increases the degree of program vectorization.
.PP
The CRAY has I/O weaknesses that were not probed in this exercise.
With an SSD, the CRAY has the highest I/O bandwidth of the
three machines.
However, owing to severe limits on the number of
disk I/O paths and disk devices, the total CRAY
disk storage capacity and aggregate disk I/O
bandwidth fall far below that of the two Japanese machines.
The CRAY is forced to depend on a front-end machine's
mass storage system to manage the large quantities of disk
data created and consumed by such a high-performance machine.
.PP
Several weaknesses were evident in the Fujitsu VP in this
benchmark.
The Fujitsu memory performance degrades seriously for
nonconsecutive vectors.
This was particularly evident in the BIGMAIN, DIF3D, and FAUCET
benchmark programs.
Even-number vector strides reduce the Fujitsu memory bandwidth by
75%, and a stride proportional to the number of memory banks
(stride=n*128) reduces the memory bandwidth about 94%.
This results in poor performance for vectorized Fortran COMPLEX
arithmetic (stride=2).
Fujitsu users will profit by reprogramming their complex
arithmetic using only REAL arrays,
and by ensuring that multidimensional-array algorithms are
vectorized by column (stride=1) rather than by row.
.PP
Fujitsu's vector performance is substantially improved if a
program's maximum vector lengths are evident at compile time,
whether from explicit DO loop bounds, array dimension statements,
or compiler directives.
For example, the order-100 LINPACK benchmark improves by 12% to 19
MFLOPS, and the order-300 matrix-vector LU benchmark
improves by 23% to 220 MFLOPS, when a Fujitsu compiler directive
is included to specify the maximum vector length (numbers from
the LINPACK benchmark paper [2]).
When maximum vector lengths are known, the Fujitsu compiler can optimize
the numbers and lengths of the vector registers and frequently
avoid the logic that divides vectors into segments no larger than
the vector registers.
Fujitsu's short-loop performance, not strong to begin with,
is particularly degraded by unnecessary vector segmentation
("stripmining") logic.
None of the benchmark problems had explicit vector length
information.
.PP
In many ways, the Hitachi computer seems to have the greatest
vector potential.
Despite its slower memory technology, the Hitachi has the
highest single processor memory bandwidth, owing to its four
memory pipes.
Also, Hitachi has the most powerful vector macro instruction set
and the most flexible set of arithmetic pipelines;
in addition, the Hitachi is the only computer able to process vectors
longer than its vector registers, entirely in hardware.
The vectorizing Fortran compiler is impressive,
although the compiler is rarely able to exploit fully the
potential concurrency of the arithmetic pipelines.
The Hitachi performs best on the benchmarks with
little scalar content;
its slow scalar performance \(em about half that of the Fujitsu
computer \(em burdens its performance on every problem.
.PP
At present the Japanese Fortran compilers are superior to the
CRAY compiler at vectorization.
Advanced Fujitsu and Hitachi hardware features
provide opportunities for vectorization that are unavailable on
the CRAY.
For example, the Japanese machines have macro instructions to
vectorize dot products, simple recurrences, and the search for
the maximum and minimum elements of an array; and they have
multiple mask registers to allow vectorization of loops with
nested IF statements.
Thus, a wider range of algorithms can vectorize on the Japanese
computers than can vectorize on the CRAY.
Also, the Japanese compilers provide the user with more useful
information about the success and failure of vectorization.
Moreover, there is no CRAY equivalent to the Fujitsu interactive
vectorizer and FORTUNE performance analyzer.
These advanced hardware features and vectorizing tools will make
it easier to tune programs for optimum performance on the
Japanese computers than on the CRAY.
.PP
The CRAY X-MP and the Japanese computers require different tuning
strategies.
The CRAY compiler does not partially vectorize loops.
Therefore, CRAY users typically break up loops into their
vectorizable and nonvectorizable parts.
The Japanese compilers, however, automatically segment loops into
their vectorizable and nonvectorizable parts.
It is advantageous to merge smaller loops together on the
Japanese computers, to take maximum advantage of their large vector
register sets.
.sp 3
.B
References

.IP [1] 
.R
J.J. Dongarra and S.C. Eisenstat,
"Squeezing the Most out of an Algorithm in CRAY Fortran,"
.I
ACM Trans. Math. Software,
.R
Vol. 10, No. 3, pp. 221-230 (1984).
.sp
.IP [2]
.R
J. J. Dongarra,
.I
Performance of Various Computers Using Standard
Linear Equations Software in a Fortran
Environment,
.R
Argonne National Laboratory Report MCS-TM-23
(October 1985)
.sp
.IP [3]
.R
O. Lubeck, J. Moore, and R. Mendez, 
.I
A Benchmark Comparison of Three Supercomputers: Fujitsu VP-200,
Hitachi S-810/20 and CRAY X-MP-2
.sp 3
.B
Acknowledgment
.R
.sp
We would like to thank Gail Pieper for her excellent help
in editing this report.
.R
.bp
.cs 1
.ps 11
.in 0
.ce 1
.B
Distribution for ANL-85-19
.ce 0
.sp 2
.B
Internal:
.sp
.in .75i
.nf
.R
J. J. Dongarra (40)
A. Hinds (40)
K. L. Kliewer
A. B. Krisciunas
P. C. Messina 
G. W. Pieper 
D. M. Pool
T. M. Woods (2)

ANL Patent Department
ANL Contract File
ANL Libraries
TIS Files (6)
.sp 2
.B
.in 0
External:
.R
.sp
.in .75i
DOE-TIC, for distribution per UC-32 (167)
Manager, Chicago Operations Office, DOE
Mathematics and Computer Science Division Review Committee:
.in +.4i
J. L. Bona, U. Chicago
T. L. Brown, U. of Illinois, Urbana
S. Gerhart, MCC, Austin, Texas 
G. Golub, Stanford University
W. C. Lynch, Xerox Corp., Palo Alto
J. A. Nohel, U. of Wisconsin, Madison
M. F. Wheeler, Rice U.
.in -.4i
D. Austin, ER-DOE
J. Greenberg, ER-DOE
G. Michael, LLL

.