.ds CH .pl 11i .LP .nr LL 6.5i .ll 6.5i .nr LT 6.5i .lt 6.5i .ft 3 .bp .R .sp .5i . .sp .R .ta 4.20i Distribution Category: .br Mathematics and Computers .br General (UC-32) .ce 100 .in 0 .sp 1i .B .ce 100 ------------- ANL-85-19 ------------- .R .sp .5i ARGONNE NATIONAL LABORATORY .br 9700 South Cass Avenue .br Argonne, Illinois 60439 .sp .6i .ps 12 .ft 3 Comparison of the CRAY X-MP-4, Fujitsu VP-200, and Hitachi S-810/20: An Argonne Perspective .ps 11 .sp 3 .I Jack J. Dongarra .ps 10 .R .ft 1 Mathematics and Computer Science Division .sp and .sp .I Alan Hinds .R .br Computing Services .sp .7i October 1985 .bp . .sp .B .ce 1 .ps 12 Table of Contents .sp 3 .R .ps 10 .ta 5.5i List of Tables v List of Figures v Abstract 1 1. Introduction 1 2. Architectures 1 2.1 CRAY X-MP 2 2.2 Fujitsu VP-200 4 2.3 Hitachi S-810/20 6 3. Comparison of Computers 8 3.1 IBM Compatibility of the Fujitsu and Hitachi Machines 8 3.2 Main Storage Characteristics 8 3.3 Memory Address Architecture 10 3.3.1 Memory Address Word and Address Space 10 3.3.2 Operand Sizes and Operand Memory Boundary Alignment 12 3.3.3 Memory Regions and Program Relocation 13 3.3.4 Main Memory Size Limitations 13 3.4 Memory Performance 14 3.4.1 Memory Bank Structure 14 3.4.2 Instruction Access 14 3.4.3 Scalar Memory Access 14 3.4.4 Vector Memory Access 15 3.5 Input/Output Performance 16 3.6 Vector Processing Performance 18 3.7 Scalar Processing Performance 21 4. Benchmark Environments 22 5. Benchmark Codes and Results 23 .sp 2 .sp .70i .ce 1 iii .bp 5.1 Codes 23 5.1.1 APW 23 5.1.2 BIGMAIN 24 5.1.3 BFAUCET and FFAUCET 24 5.1.4 LINPACK 24 5.1.5 LU, Cholesky Decomposition, and Matrix Multiply 26 5.2 Results 28 6. Fortran Compilers and Tools 28 6.1 Fortran Compilers 28 6.2 Fortran Tools 30 7. Conclusions 31 References 33 Acknowledgments 33 .sp 4.3i .ce 1 iv .bp .ce 1 .B List of Tables .R .sp 3 .ta .3i 6.5iR 1. Overview of Machine Characteristics 9 .sp 2. Main Storage Characteristics 11 .sp 3. Input/Output Features and Performance 17 .sp 4. Vector Architecture 19 .sp 5. Scalar Architecture 22 .sp 6. Programs Used for Benchmarking 25 .sp 7. Average Vector Length for BFAUCET and FFAUCET 26 .sp 8. LINPACK Timing for a Matrix of Order 100 26 .sp 9. LU Decomposition Based on Matrix Vector Operations 27 .sp 10. Cholesky Decomposition Based on Matrix Vector Operations 27 .sp 11. Matrix Multiply Based on Matrix Vector Operations 28 .sp 12. Timing Data (in seconds) for Various Computers 29 .sp B-1. Loops Missed by the Respective Compilers 42 .sp 4 .ce 1 .B List of Figures .R .sp 2 1. CRAY X-MP/48 Architecture 3 .sp 2. Fujitsu VP-200 Architecture 5 .sp 3. Hitachi S-810/10 Architecture 7 .sp 1.5i .ce 1 v .ft 3 .ps 11 .pn 1 .ds CH "% .bp .LP .EQ delim @@ .EN .B .ps 12 .in .ce Comparison of the CRAY X-MP-4, .sp .ce Fujitsu VP-200, and Hitachi S-810/20: .sp .ce An Argonne Perspective* .fi .sp .R .AU .ps 11 .in 0 Jack J. Dongarra\|@size -1 {"" sup \(*}@\h'.15i' .AI .ps 10 .in 0 Mathematics and Computer Science Division .AU .ps 11 .in 0 Alan Hinds .AI .ps 10 .in 0 Computing Services .FS .ps 9 .vs 11p *Work supported in part by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy, under Contract W-31-109-Eng-38. .FE .sp 3 .QS .ps 10 .in +.25i .ll -.5i .B .ce 1 Abstract .R A set of programs, gathered from major Argonne computer users, was run on the current generation of supercomputers: the CRAY X-MP-4, Fujitsu VP-200, and Hitachi S-810/20. The results show that a single processor of a CRAY X-MP-4 is a consistently strong performer over a wide range of problems. The Fujitsu and Hitachi computers excel on highly vectorized programs and offer an attractive opportunity to sites with IBM-compatible computers. .in .ll .QE .nr PS 11 .nr VS 16 .nr PD 0.5v .SH 1. Introduction .PP Last year we ran a set of programs, gathered from major Argonne computer users, on the current generation of supercomputers: the CRAY X-MP-4 at CRAY Research in Mendota Heights, Minnesota; the Fujitsu VP-200 at the Fujitsu plant in Numazu, Japan; and the Hitachi S-810/20 at the Hitachi Ltd. Kanagawa Works in Kanagawa, Japan. .SH 2. Architectures .PP The CRAY X-MP, Fujitsu VP, and Hitachi S/810 computers are all high-performance vector processors that use pipeline techniques in both scalar and vector operations and permit concurrent execution by independent functional units. All three machines use a register-to-register format for instruction execution. Each machine has three vector load/store techniques \(em contiguous element, constant stride, and indirect address (index vector) modes. All three are optimized for 64-bit floating-point arithmetic operations. Outstanding features and major differences in these machines are discussed below and summarized in Table 1 at the end of this section. .sp .B 2.1 CRAY X-MP .R .PP The CRAY X-MP-4 (Figure 1) is the largest of the family of CRAY X-MP computer models, which range in size from one to four processors and from one million to sixteen million words of central memory. The CRAY X-MP/48 computer consists of four identical pipelined processors, each with fully segmented scalar and vector functional units with a 9.5-nanosecond clock cycle. All four processors share in common an 8-million word, high-speed (38-nanosecond cycle time) bipolar central memory, a common I/O subsystem, and an optional integrated solid-state storage device (SSD). Each processor contains a complete set of registers and functional units, and each processor can access all of the common memory, all of the I/O devices, and the single (optional) SSD. The CRAY X-MP-4 vector register set -- 512 words per processor -- is the smallest in this study. .PP The four CRAY CPUs can process four separate and independent jobs, or they can be organized to work concurrently on a single job. This document will focus on the performance of only a single processor of the CRAY X-MP-4, as none of our benchmark programs were organized to take advantage of multiple processors. Thus, in the tables and text that follow, all data on the capacity and the performance of the CRAY X-MP-4 apply to a single processor, except for data on the size of memory and the configuration and performance of I/O devices and the SSD. .PP The CRAY X-MP-4 has extremely high floating-point performance for both scalar and vector applications and both short and long vector lengths. Each CRAY X-MP processor has a maximum theoretical floating-point result rate of 210 MFLOPS (millions of floating point operations per second) for overlapped vector multiply and add instructions. With the optional solid-state storage device installed, the CRAY X-MP-4 has an input/output bandwidth of over 2.4 billion bytes per second, the largest in this study; without the SSD, the I/O bandwidth is 424 million bytes per second, but only 68 million bytes per second is attainable by disk I/O. The CRAY DD-49 disks have the fastest single disk transfer rate (9.8 million bytes per second) in this study. The CRAY permits a maximum of four disk devices on each of eight disk control units, the smallest disk subsystem in this study. .PP The cooling system for the CRAY X-MP-4 is refrigerated liquid freon. .PP The CRAY X-MP-4 operates with the CRAY Operating System (COS), a batch operating system designed to attach by a high-speed channel or hyperchannel interface with a large variety of self-contained, general-purpose front-end computers. All computing tasks other than batch compiling, linking, and executing of application programs must be performed on the front-end computer. Alternatively, the CRAY X-MP-4 can operate under CTSS (CRAY Time-Sharing System, .bp . .sp 7.8i .ce Figure 1 .ce CRAY X-MP/48 Architecture .bp available from Lawrence Livermore National Laboratory), a full-featured interactive system with background batch computing. The primary programming languages for the CRAY X-MP are Fortran 77 and CAL (CRAY Assembly Language); the Pascal and C programming languages are also available. .sp .B 2.2 Fujitsu VP-200 (Amdahl 1200) .R .PP The Fujitsu VP-200 (Figure 2) is midway in performance in a family of four Fujitsu VP computers, whose performance levels range to over a billion floating-point operations per second. In North America, the Fujitsu VP-200 is marketed and maintained by the Amdahl Corporation as the Amdahl 1200 Vector Processor. Although we benchmarked the VP-200 in Japan, the comparisons in this document will emphasize the configurations of the VP-200 offered by Amdahl in the United States. .PP The Fujitsu VP-200 is a high-speed, single-processor computer, with up to 32 million words of fast (60-nanosecond cycle time) static MOS central memory. The VP-200 has separate scalar (15-nanosecond clock cycle) and vector (7.5-nanosecond clock cycle) execution units, which can execute instructions concurrently. A unique characteristic of the VP-200 vector unit is its large (8192-word) vector register set, which can be dynamically configured into different numbers and lengths of vector registers. .PP The VP-200 has a maximum theoretical floating point result rate of 533 MFLOPS for overlapped vector multiply and add instructions. .PP The VP-200 system is cooled entirely by forced air. .PP The Fujitsu VP-200 scalar instruction set and data formats are fully compatible with the IBM 370 instruction set and data formats; the VP-200 can execute load modules and share load libraries and datasets that have been prepared on IBM-compatible computers. The Fujitsu VP-200 uses IBM-compatible I/O channels and can attach all IBM-compatible disk and tape devices and share these devices with other IBM-compatible mainframe computers. Fujitsu does not offer an integrated solid-state storage device for the VP computer series, but any such device that attaches to an IBM channel and emulates an IBM disk device can be attached to the VP-200. The total I/O bandwidth of the VP-200 is 96 million bytes per second, the smallest in this study. Up to 93 million bytes per second can be used for disk I/O; The maximum single-disk data transfer rate is 3 million bytes per second. The VP-200 can attach over one thousand disk devices. .KS .sp 20 .sp 1.5i .ce Figure 2 .ce Fujitsu VP-200 Architecture .KE .PP The Fujitsu VP-200 operates with the FACOM VP control program (also called VSP \(em Vector Processor System Program \(em by Amdahl), a batch operating system designed to interface with an IBM-compatible front-end computer via a channel-to-channel (CTC) adaptor in a tightly coupled or loosely coupled network. Internally, Amdahl is running the IBM MVS/XA operating system on their Amdahl 1200 computer). The front-end computer operating system may be Fujitsu's OS-IV (available only in Japan) or IBM's MVS, MVS/XA, or VM/CMS. To optimize use of the VP vector hardware, Fujitsu encourages VP users to perform all computing tasks, other than executing their Fortran application programs, on the front-end computer. .bp .PP Of the three machines in this study, Fujitsu (Amdahl) provides the most powerful set of optimizing and debugging tools. With the VSP operating system, interactive tools must be run on the front-end computer system, but batch versions of the tools can run on either the fron-end or the vector processor. Fujitsu Fortran 77/VP is the only programming language that takes advantage of the Fujitsu VP vector capability, although object code produced by any other compiler or assembler available for IBM scalar mainframe computers will execute correctly on the VP in scalar mode. .sp .B 2.3 Hitachi S-810/20 .R .PP The Hitachi S-810/20 (Figure 3) computer is the more powerful of two Hitachi S-810 computers, which currently are sold only in Japan. Little is published in English about the Hitachi S-810 computers; consequently, some data in the tables and comparisons are inferred and may be inaccurate. .PP The Hitachi S-810/20 is a high-speed, single-processor computer, with up to 32 million words of fast (70-nanosecond bank cycle time) static MOS central memory and up to 128 million words of extended storage. The computer has separate scalar (28-nanosecond clock cycle) and vector (14-nanosecond clock cycle) execution units, which can execute instructions concurrently. The scalar execution unit is distinguished by its large (32 thousand words) cache memory. The S-810/20 vector unit has 8192 words of vector registers, and the largest number of vector functional units and the most comprehensive vector macro instruction set of the three machines in this study. The Hitachi S-810 family alone has the ability to process vectors that are longer than their vector registers, entirely under hardware control. .PP The Hitachi S-810/20 has a maximum theoretical floating point result rate of 840 MFLOPS for overlapped vector multiply and add instructions (two multiply and four add results per cycle). .PP The S-810/20 computer is cooled by forced air across a closed, circulating water radiator. .PP Like the Fujitsu VP, the Hitachi S-810/20 scalar instruction set and data formats are fully compatible with the IBM 370 instruction set and data formats; the S-810/20 can execute load modules and share load libraries and datasets that have been prepared on IBM-compatible computers. The Hitachi S-810/20 uses IBM-compatible I/O channels and can attach all IBM-compatible disk and tape devices and share these devices with other IBM-compatible mainframe computers. Hitachi's optional, extended storage offers extremely high performance I/O. With the extended storage installed, the Hitachi S-810/20 has an I/O bandwidth of 1.1 billion bytes per second; without extended storage the I/O bandwidth is 96 million bytes per second. Up to 93 million bytes per second can be used for disk I/O; the maximum single-disk data transfer rate is 3 million bytes per second. The Hitachi can attach over one thousand disk devices. .bp . .sp 2 .sp 5.5i .ce Figure 3 .ce Hitachi S-810/10 Architecture .PP The Hitachi S-810/20 operates either with a batch operating system designed to interface with an IBM-compatible front-end computer via a channel-to-channel (CTC) adaptor in a loosely coupled network, or with a stand-alone operating system with MVS-like batch and MVS/TSO-like interactive capabilities. The primary programming languages for the Hitachi S-810 computers are Fortran 77 and assembly language, although object code produced by any assembler or compiler available for IBM-compatible computers will also execute on the S-810 computers in scalar mode. .bp .B 3. Comparison of Computers .R .B 3.1 IBM Compatibility of the Fujitsu and Hitachi Machines .R .PP Both Japanese computers run the full IBM System 370 scalar instruction set, but neither of the Japanese machines can run the IBM 3090 Vector Facility instruction set. Also, neither Japanese computer is compatible with the IBM XA extensions to System 370; each vendor has its own 31-bit address architecture that provides an XA-equivalent 2-gigabyte address space. (Amdahl is running MVS/XA on one Amdahl 1200 computer that they modified to be compatible with XA architecture.) The Japanese operating systems simulate IBM MVS system functions at the SVC level. MVS load modules created on Argonne's IBM 3033 ran correctly on both the Fujitsu and Hitachi machines in scalar mode. .PP The Japanese computers can share datasets on direct-access I/O equipment with IBM-compatible front-end computers. Programs can be developed and debugged on the front end computers with the user's favorite tools, then recompiled and executed on the vector processors. All software tools for the vector processors will run on IBM-compatible front ends. Currently the interactive software tools are MVS TSO/SPF oriented. .PP The Japanese Fortran compilers are compatible with IBM VS/Fortran; the ANSI X3.9 (Fortran 77) standard and (most) IBM extensions are implemented. .sp .B 3.2 Main Storage Characteristics .R .PP The main storage characteristics of the three machines in this study are compared in Table 2. All three machines have large, interleaved main memories, optimized for 64-bit-word data transfers, with bandwidths matched to the requirements of their respective vector units. Each machine permits vector accesses from contiguous, constant-stride separated, and scattered (using indirect list-vectors) memory addresses. All three machines use similar memory error-detection and error-correction schemes. The text that follows concentrates on those differences in main memory that have significant performance implications. .PP The CRAY X-MP-48 uses extremely fast bipolar memory, while the Fujitsu and Hitachi computers use relatively slower static-MOS memory (see Table 2). CRAY's choice of the faster but much more expensive bipolar memory is largely dictated by the need to service four processors from a single, symmetrically shared main memory. Fujitsu and Hitachi selected static MOS for its relatively lower cost and lower heat dissipation. These MOS characteristics permit much larger memory configurations without drastic cost and cooling penalties. Fujitsu and Hitachi compensate for the relatively slower speed of their MOS memory by providing much higher levels of memory banking and interleaving. .bp .ce 1 Table 1 .br .ce 1 Overview of Machine Characteristics .TS center; lp8 lp8 lp8 lp8. Characteristic CRAY X-MP-4 Fujitsu VP-200 Hitachi S-810/20 _ _ _ _ Number of Processors 4 1 1 Machine Cycle Time 9.5 ns vector 7.5 ns vector 14 ns vector 9.5 ns scalar 15 ns scalar 28 ns scalar Memory Addressing Real Mod. Virtual Mod. Virtual Maximum Memory Size 16 Mwords 32 Mwords 32 Mwords Optional SSD Memory 32; 128 Mwords Not Available 32; 64; 128 Mwords SSD Transfer Rate 256 Mwords/s Not Available 128 Mwords/s I/O-Memory Bandwidth 50 Mwords/s 12 Mwords/s 12 Mwords/s (numbers below are per processor) CPU Memory Bandwidth 315 Mwords/s 533 Mwords/s 560 Mwords/s Scalar Buffer Memory 64 Words T reg 8192 Words Cache 32768 Words Cache Vector Registers 512 Words 8192 Words 8192 Words Vector Pipelines: Load/Store Pipes 2 Load; 1 Store 2 Load/Store 3 Load; 1 Load/Store Floating Point M & A 1 Mult; 1 Add; 1 Mult; 1 Add 2 Add; 2 Mult/Add Peak Vector (M + A) 210 MFLOPS 533 MFLOPS 840 MFLOPS Cooling System Type Freon Forced Air Air and Radiator .bp Characteristic CRAY X-MP-4 Fujitsu VP-200 Hitachi S-810/20 _ _ _ _ Operating Systems CRAY-OS (batch) VSP (batch) HAP OS CTSS (interactive) Front Ends IBM, CDC, DEC, IBM-compatible IBM-compatible Data General, Univac, Apollo, Honeywell Vectorizing Languages Fortran 77 Fortran 77 Fortran 77 Other High-Level Languages Pascal, C, LISP Any IBM-compat. Any IBM-compat. Vectorizing Tools Fortran Compiler Fortran Compiler Fortran Compiler FORTUNE VECTIZER Interact. Vectorizer Batch Vectorizer .TE .sp 2 .B 3.3 Memory Address Architecture 3.3.1 Memory Address Word and Address Space .R .PP The CRAY X-MP uses a 24-bit address, which it interprets as a 16-bit "parcel" address when referencing instructions and as a 64-bit-word address when referencing operands. This addressing duality leads to a 4-million-word address space for instructions and a 16-million-word address space for operands. .PP The two Japanese machines use similar memory addressing schemes, owing to their mutual commitment to IBM compatibility. Both Japanese computers allow operating-system selection of IBM 370-compatible 24-bit addressing or IBM XA-like 31-bit addressing. These addressing alternatives provide a 2-million-word address space or a 256-million-word address space, respectively. The address space is identical for both program instructions and operands. .bp .ce 1 Table 2 .br .ce 1 Main Storage Characteristics .TS center; lp8 lp8 lp8 lp8 lp8. Memory Item Units CRAY X-MP-4 Fujitsu VP-200 Hitachi S-810/20 _ _ _ _ _ Memory Type SECDED 16K-bit Bipolar 64K-bit S-MOS 64K-bit S-MOS Addressing: Type Extended Real Mod. Virtual Mod. Virtual Paged No System Only System Only Address Word Bits 24 24 or 31 24 or 31 Address Space Mwords 4(inst); 16(data) 2; 256 2; 256 Address Boundary: Instructions Bit 16 16 16 Scalar Data Bit 64 8 8 Vector Data Bit 64 32; 64 32; 64 Vector Addressing Contiguous Contiguous Contiguous Modes Constant Stride Constant Stride Constant Stride Indirect Index Indirect Index Indirect Index Memory Size Mwords 8; 16 8; 16; 32 4; 8; 16; 32 Mbytes 64; 128 64; 128; 256 32; 64; 128; 256 Interleave Sections 4; 4 8; 8; 8 8 Ways 64; 64 128; 256; 256 128 Cycle Time: Section CP - ns 1CP - 9.5 ns 2CP - 15 ns 1CP - 14 ns Bank CP - ns 4CP - 38 ns 8CP - 60 ns 5CP - 70 ns Access Time: From Cache From Cache Scalar CP - ns 14CP - 133 ns 2CP - 30 ns 2CP - 28 ns Vector CP - ns 17CP - 162 ns ? ? .bp Memory Item Units CRAY X-MP-4 Fujitsu VP-200 Hitachi S-810/20 _ _ _ _ _ Transfer Rate: (per CPU) Scalar L/S Words/CP 1W/19. ns 2W/15 ns 2W/14 ns Inst. Fetch Words/CP 8W/9.5 ns 2W/15 ns 1W/14 ns Vect. Load Words/CP 2W/9.5 ns 8W/15 ns 8W/14 ns Vect. Store Words/CP 1W/9.5 ns 8W/15 ns 2W/14 ns Vect. Total Words/CP 3W/9.5 ns 8W/15 ns 8W/14 ns I/O Words/CP 1W/9.5 ns ? 1W/14 ns Vector Bandwidth: (per CPU) L/S Pipes Pipes 2 Load; 1 Store 2 Load/Store 3 Load; 1 Load/Store # Sectors Sectors x 2 Sectors x 2 Sectors Vector Bandwidth: Stride one; odd; even one; odd; even one; odd; even Max. Load Mwords/s 210; 210; 210 533; 266; 133 560; 560; 560 Max. Store Mwords/s 105; 105; 105 533; 266; 133 140; 140; 140 Total L/S Mwords/s 315; 315; 315 533; 266; 133 560; 560; 560 Scalar Buffer Memory: T Registers Cache Memory Cache Memory Size Words 64 8192 32768 Block Load Words/CP 1W/9.5 ns 8W/60 ns 8W/70 ns Access Time CP - ns 1CP - 9.5 ns 2CP - 15 ns 2CP - 28 ns Trans. Rate Words/CP 1W/9.5 ns 2W/15 ns 2W/28 ns Instruction Buffer: 128 Words I-stack Cache Memory Cache Memory Block Load Words/CP 8W/9.5 ns 8W/60 ns 8W/70 ns .TE .sp 2 .B 3.3.2 Operand Sizes and Operand Memory Boundary Alignment .R .PP CRAY X-MP computers have only two hardware operand sizes: 64-bit integer, real, and logical operands; and 24-bit integer operands, used primarily for addressing. All CRAY operands are stored in memory on 64-bit word boundaries. CRAY program instructions consist of one or two 16-bit "parcels," packed four to a word. CRAY instructions are fetched from memory, 32 parcels at a time beginning on an 8-word memory boundary, into an instruction buffer that in turn is addressable on 16-bit parcel boundaries. .bp .PP The Japanese computers provide all of the IBM 370 architecture's operand types and lengths, and some additional ones. The Fujitsu and Hitachi scalar instruction sets can process 8-bit, 16-bit, 32-bit, 64-bit, and 128-bit binary-arithmetic and logical operands; 8-bit to 128-bit (in units of 8 bits) decimal-arithmetic operands; and 8-bit to 32768-bit (in units of 8 bits) character operands. Scalar operands may be aligned in memory on any 8-bit boundary. However, the Fujitsu and Hitachi vector instruction sets can process only 32-bit and 64-bit binary-arithmetic and logical operands, and these operands must be aligned in memory on 32-bit and 64-bit boundaries, respectively. Most of the Fujitsu and Hitachi incompatibilities with IBM Fortran programs arise from vector operand misalignment in COMMON blocks and EQUIVALENCE statements. .sp .B 3.3.3 Memory Regions and Program Relocation .R .PP The CRAY X-MP uses only real memory addresses. The operating system loads each program into a contiguous region of memory for instructions and a contiguous region of memory for operands. The CRAY X-MP uses two base registers to relocate all addresses in a program; one register uniformly biases all instruction addresses, and the second register uniformly biases all operand addresses. .PP In contrast, the Fujitsu and Hitachi computers use a modified virtual-memory addressing scheme. The operating systems and user application programs are each loaded into a contiguous region of "virtual" memory, although each may actually occupy noncontiguous "pages" of real memory. Every virtual address reference must undergo dynamic address translation to obtain the corresponding real memory address. As in conventional virtual-memory systems, operating-system pages can be paged out to an external device, allowing the virtual-memory space to exceed the underlying real-memory space. However, user application program pages are never paged out. Application program address translation is used primarily to avoid memory fragmentation. .sp .B 3.3.4 Main Memory Size Limitations .R .PP The CRAY X-MP is available with up to 16 million words of main memory, the maximum permitted by its address space. This is restrictive compared to the Japanese offerings, especially as the memory must be shared by four processors. Currently, the Fujitsu and Hitachi computers offer a maximum of 32 million words of main memory. However, both Japanese computers could accommodate expansion to 256 million words (per program) within the current 31-bit virtual-addressing architecture. .bp .B 3.4 Memory Performance 3.4.1 Memory Bank Structure .R .PP The computers on which we ran the benchmark problems were all equipped with 8 million words of main memory. The CRAY X-MP-48 memory was divided into 64 independent memory banks, organized as 4 sections of 16 banks each (later models of the CRAY X-MP are limited to 32 memory banks). Both the Fujitsu and Hitachi computer memories are divided into 128 independent memory banks organized as 8 sections of 16 banks each; Fujitsu memories larger than 8 million words have 256 memory banks in 8 sections. In general, the larger numbers of memory banks permit higher bandwidths for consecutive block memory transfers and fewer bank conflicts from random memory accesses. .sp .B 3.4.2 Instruction Access .R .PP The CRAY X-MP has four 32-word instruction buffers that can deliver a new instruction for execution on every clock cycle, leaving the full memory bandwidth available for operand access. Each buffer contains 128 consecutive parcels of program instructions, but the separate buffers need not be from contiguous memory segments. Looping and branching within the buffers are permitted; entire Fortran DO loops and small subroutines can be completely contained in the buffer. An instruction buffer is block-loaded from memory, 32 words at a time, at the rate of 8 words per 9.5-nanosecond cycle. .PP The Fujitsu and Hitachi processors buffer all instruction fetches through their respective cache memories (see "Scalar Memory Access" below). The cache bandwidths are adequate to deliver instructions and scalar operands without conflict. .sp .B 3.4.3 Scalar Memory Access .R .PP The CRAY X-MP does not have a scalar cache. Instead, it has 64 24-bit intermediate-address B-registers and 64 64-bit intermediate-scalar T-registers. These registers are under program control and can deliver one operand per 9.5-nanosecond clock cycle to the primary scalar registers. The user must plan a program carefully to make effective use of the B and T registers in CRAY Fortran; variables assigned to B and T registers by the compiler are never stored in memory. .bp .PP The Fujitsu VP-200 and Hitachi S-810/20 automatically buffer all scalar memory accesses and instruction fetches through fast cache memories of 8192 words and 32768 words, respectively. The Fujitsu and Hitachi cache memories can each deliver one words per scalar clock cycle (15 nanoseconds and 28 nanoseconds, respectively) to their respective scalar execution units, entirely under hardware control. .sp .B 3.4.4 Vector Memory Access .R .PP The computers studied all have multiple data-streaming pipelines to transfer operands between main memory and vector registers. Each processor of a CRAY X-MP has three pipelines \(em two dedicated to loads and one dedicated to stores \(em between its own set of vector registers and the shared main memory. (A fourth pipe in each X-MP processor is dedicated to I/O data transfers.) The Fujitsu VP-200 has two memory pipelines, each capable of both loads and stores. The Hitachi S-810/20 has four memory pipelines \(em three dedicated to loads and one capable of both loads and stores. .PP Each CRAY X-MP pipe can transfer one 64-bit word between main storage and a vector register each 9.5-nanosecond cycle, giving a single-processor memory bandwidth (excluding I/O) of 315 million words per second and a four-processor memory bandwidth of 1260 million words per second. The Fujitsu and Hitachi pipes can each transfer two 64-bit words each memory cycle (7.5 nanoseconds and 14 nanoseconds, respectively), giving total memory bandwidths of 533 and 560 million words per second, respectively. .PP For indirect-address operations (scatter/gather) and for constant strides different from one, the Fujitsu computer devotes one of its memory pipelines to generating operand addresses; its maximum memory-to-vector register bandwidth is 266 million words per second for scatter/gather and odd-number constant strides, and 133 million words per second for even-number constant strides. .PP All three machines can automatically "chain" their load and store pipelines with their vector functional pipelines. Thus, vector instructions need not wait for a vector load to complete, but can begin execution as soon as the first vector element arrives from memory. And vector stores can begin as soon as the first result is available in a vector register. In the limit, pipelines can be chained to create a continuous flow of operands from memory, through the vector functional unit(s), and back to memory with an unbroken stream of finished results. In this "memory-to-memory" processing mode, the vector registers serve as little more than buffers between memory and the functional units. The CRAY X-MP's three memory pipes permit memory-to-memory operation with two input operand streams and one result stream. With only two memory pipes, the Fujitsu VP-200 can function in memory-to-memory mode only if one of the input operands is already in a vector register, or if one of the operands is a scalar, and not at all if the vector stride is different from one. The Hitachi, with four memory pipes, can function in memory-to-memory mode with up to three input operand streams and one result stream; add to this the Hitachi's ability to automatically process vectors that are longer than its vector registers, and the Hitachi can be viewed as a formidable memory-to-memory processor. .sp .B 3.5 Input/Output Performance .R .PP Table 3 summarizes the input/output features and performance of the CRAY X-MP, the Fujitsu, and the Hitachi. This information is entirely from the manufacturers' published machine specifications; no I/O performance comparisons were included in our tests. .PP Both the CRAY and Hitachi I/O subsystems have optional integrated solid-state storage devices, with data transfer rates of 2048 and 1024 Mbytes per second, respectively, over specialized channels. The I/O bandwidth of one of these devices dwarfs the I/O bandwidth of the entire disk I/O subsystem on each machine. The Fujitsu computers can attach only those solid-state storage devices that emulate standard IBM disk and drum devices over standard Fujitsu 3-Mbyte-per-second channels. .PP The IBM-compatible disk I/O subsystems on the two Japanese computers have a much larger aggregate disk storage capacity than the CRAY. The CRAY can attach a maximum of 32 disk units, while Fujitsu and Hitachi can each attach over one thousand disks. CRAY permits a maximum of 8 concurrent disk data transfers, while Fujitsu and Hitachi permit as many concurrent disk data transfers as there are channels (up to 31; at least one channel is required for front-end communication). Individually, CRAY's DD-49 disks can transfer data sequentially at the rate of 10 Mbytes per second, compared with only 3 Mbytes per second for the IBM 3380-compatible disks used by Fujitsu and Hitachi. But the maximum concurrent CRAY disk data rate (four DD-49 data streams on each of two I/O processors) is only 68 Mbytes per second, compared with 93 Mbytes per second for the two Japanese computers. The disks used on all three computers should have very similar random access performance, which is dominated by access time rather than data transfer rate. .bp .ce 1 Table 3 .br .ce 1 Input/Output Features and Performance .TS center; lp9 lp9 lp9 lp9. I/O Features CRAY X-MP-4 Fujitsu VP-200 Hitachi S-810/20 _ Disk I/O Channels: Disk I/O Processors 2 I/O Subsystems 2 I/O Directors 2 I/O Directors Channels per IOP 1 16 16 Maximum Channels 2 32 32 Data Rate/Channel 100 MB/s 3 MB/s 3 MB/s Total Bandwidth 200 MB/s 96 MB/s 96 MB/s Disk Controllers: DCU-5 6880 3880-equivalent Max. per Channel 4 8 16 Max. Controllers 8 128 256 Disks/Controller 4 4-64 4-16 Data Paths/Controller 1 2 2 Bandwidth/Controller 12 MB/s 6 MB/s 6 MB/s Disk Devices: DD-39; DD-49 6380 3380-equivalent Storage Capacity 1200 MB; 1200 MB 600 MB; 1200 MB 600 MB; 1200 MB Data Transfer Rate 6 MB/s; 10 MB/s 3 MB/s 3 MB/s Average Seek Time 18 ms; 16 ms 15 ms 15 ms Average Latency 9 ms; 9 ms 8 ms 8 ms Maximum Striping 5; 3 24 ? Max. Disk Bandwidth 45 MB/s; 68 MB/s 93 MB/s 93 MB/s Integrated SSD: Optional Not Available Optional Capacity (Mwords) 32; 64; 128 32; 64; 128 Data Transfer Rate 256 Mwords/s 128 Mwords/s .TE .PP CRAY includes up to 8 Mwords of I/O subsystem buffer memory between its CPUs and its disk units. This I/O buffer memory permits 100-Mbyte-per-second data transfer between the I/O subsystem and a single CRAY CPU. The IBM 3880-compatible disk controllers used by the two Japanese machines permit up to 2 Mwords of cache buffer memory on each controller. This disk controller cache does not increase peak data transfer rates but serves to reduce average record access times. .bp .PP All three machines permit "disk striping" to increase I/O performance \(em the data blocks of a single file can be interleaved over multiple disk devices to allow concurrent data transfer for a single file. CRAY allows certain disks to be designated as striping volumes at the system level; striped and non-striped datasets may not reside on the same disk volume. A single CRAY file may be striped over a maximum of three DD-49 or five DD-39 disk units. Fujitsu and Hitachi permit striping on a Fortran dataset basis; striped and non-striped datasets may reside on the same disk volume. A single Fujitsu dataset may be striped over as many as 24 disk volumes. Fortran programs compiled by the Japanese Fortran compilers in scalar mode can usr disk striping on any IBM compatible computer. .sp .B 3.6 Vector Processing Performance .R .PP Table 4 shows the vector architectures of the three computers studied. All three machines are vector register based, with multiple pipelines connecting the vector registers with main memory. All three have multiple vector functional units, permit concurrency among independent vector functional units and with the load/store pipelines, and permit flexible chaining of the vector functional units with each other and with the load/store pipelines. Although Fujitsu and Hitachi permit both 32-bit and 64-bit vector operands, all vector arithmetic on all three machines is performed in and optimized for 64-bit floating point. The three vector units differ primarily in the numbers and lengths of vector registers, the numbers of vector functional units, and the types of vector instructions. .PP Of the three machines, the CRAY has the smallest number and size of vector registers. Each CRAY X-MP processing unit has 8 vector registers of 64 elements, while the Fujitsu and Hitachi computers each have 8192-word vector register sets. The Fujitsu vector registers can be dynamically configured into different numbers and lengths of vector registers (see Table 4), ranging from a minimum of 8 registers of 1024 words each to a maximum of 256 registers of 32 words each. The Fujitsu Fortran compiler uses the vector-length information available at compile time to try to optimize the vector register configurations for each loop. The Hitachi has 32 vector registers, fixed at 256 elements each, but with the unique ability to process longer vectors without the user or the compiler dividing them into sections of 256 elements or less; the Hitachi hardware can automatically repeat a long vector instruction for successive vector segments. The HAP Fortran compiler decides when to divide vectors into 256-element segments and when to process entire vectors all at once, based on whether intermediate results in a vector register can be used in later operations. .bp .ce 1 Table 4 .br .ce 1 Vector Architecture .TS center; lp8 lp8 lp8 lp8. Vector Processing Item CRAY X-MP-4 Fujitsu VP-200 Hitachi S-810/20 _ Vector Registers: Configuration Fixed Reconfigurable Fixed Total Capacity 512 Words/CPU 8192 Words 8192 Words Number x Size 8x64 Words 8x1024 Words 32x256 Words 16x512 Words 32x256 Words 64x128 Words 128x64 Words 256x32 Words Mask Registers 64 Bits 8192 Bits 8x256 Words Vector Pipelines (per CPU) Load/Store 2 Load; 1 Store 2 Load/Store 3 Load;1 Load/Store Floating Point 1 Mult; 1 Add; 1 Mult; 1 Add 2 Add/Shift/Logic 1 Recip. Approx. 1 Divide 1 Mult/Divide/Add 1 Mult/Add Other 1 Shift; 1 Mask 1 Mask 1 Mask 2 Logical Maximum Vector Result Rates (64-bit results): Floating Point Mult. 105 MFLOPS 267 MFLOPS 280 MFLOPS Floating Point Add 105 MFLOPS 267 MFLOPS 560 MFLOPS Floating Point Divide 33 MFLOPS 56 MFLOPS 70 MFLOPS Floating Mult. & Add 210 MFLOPS 533 MFLOPS 560 MFLOPS 840 (M+2A) Vector Data Types: Floating Point 64-bit 32-bit; 64-bit 32-bit; 64-bit Fixed Point 64-bit 32-bit 32-bit Logical 64-bit 1-bit; 64-bit 64-bit Vector Macro Instructions: Masked Arithmetic No Yes Yes Vector Compress/Expand Yes Yes Yes Vector Merge under Mask Yes No No Vector Sum (S=S+Vi) No Yes Yes .bp Vector Processing Item CRAY X-MP-4 Fujitsu VP-200 Hitachi S-810/20 _ _ _ _ Vector Macro Instructions: Vector Prod (S=S*Vi) No No Yes DOT Product (S=S+Vi*Vj) No Chain Yes DAXPY (Vi=Vi+S*Xi) Chain Chain Yes Iteration (Aj=Ai*Bi+Ci) No No Yes Max/Min (S=MAX(S,Vi)) No Yes Yes Fix/Float (Vi=Ii;Ii=Vi) Chain Yes Yes .TE .sp .PP The Hitachi has more vector arithmetic pipelines than the CRAY and Fujitsu computers. These pipelines permit the Hitachi to achieve higher peak levels of concurrency than CRAY and Fujitsu. Depending on the operation mix, the Hitachi can drive two vector add and two vector multiply+add pipelines concurrently, for an instantaneous result rate of 840 MFLOPS. If the program operation mix is inappropriate, however, the extra pipelines are just expensive unused hardware. The HAP Fortran "pair-processing" option often increases performance by dividing a vector in two and processing each half concurrently through a separate pipe. For long vectors, pair-processing can double the result rate; but for short vectors, startup overhead can result in reduced performance. The HAP Fortran compiler permits pair-processing to be selected on a program-wide, subroutine-wide, or individual loop basis. Pair-processing was the compiler default for all out timings. Previous S-810 benchmarks that reported relatively poorer performance were done without pair-processing [3]. .PP The Fujitsu and Hitachi computers have larger and more powerful vector instruction sets than the CRAY. These macro instruction sets make these machines more "compilable" and more "vectorizable" than the CRAY. Especially valuable are the macro instructions that reduce an entire vector operation to a single result, such as the vector inner (or dot) product. The CRAY, lacking such instructions, must normally perform these operations in scalar mode, although vectorizable algorithms exist for long CRAY vectors. The Hitachi has the richest set of vector macro-instructions, with macro functional units to match. Both Fujitsu and Hitachi have single vector instructions or two instruction chains to extract the maximum and minimum elements of a vector, to sum the elements of a vector, to take the inner product of two vectors, and to convert vector elements between fixed point and floating point representations. To these, the Hitachi adds a vector product reduction, the DAXPY sequence common in linear algebra, and a vector iteration useful in finite-difference calculations. .PP The only CRAY masked vector instructions are the vector compress/expand and conditional vector merge instructions; the CRAY Fortran compiler uses these instructions to vectorize loops with only a single IF statement. The CRAY can hold logical data for only a single vector register. Both Japanese computers, on the other hand, have masked arithmetic instructions that permit straightforward vectorization of loops with IF statements. The Fujitsu and Hitachi computers have mask register sets that can hold logical data for every vector register element. These large mask register sets, and vector logical instructions to manipulate these masks, should make the Japanese machines strong candidates for logic programming. These machines can hold the results of many different logical operations in their multiple mask registers, eliminating the need to recompute masks that are needed repeatedly, and permitting the vectorization of loops with multiple, compound, and nested IF statements. .sp .B 3.7 Scalar Processing Performance .R .PP Table 5 compares the scalar architectures of the three machines studied. .PP All three computers permit scalar and vector instruction concurrency; CRAY permits concurrency among all its functional units. The Fujitsu and Hitachi computers are compatible with IBM System 370; they implement the complete IBM 370 scalar instruction set and scalar register sets (Fujitsu added four additional floating-point registers). .PP CRAY computers use multiple, fully-segmented functional units for both scalar and vector instruction execution, while Fujitsu and Hitachi use an unsegmented execution unit for all scalar instructions. CRAY computers can begin a scalar instruction on any clock cycle; more than one CRAY scalar instruction can be in execution at a given time, in the same and in different functional units. Fujitsu and Hitachi, on the other hand, perform their scalar instructions one at a time, many taking more than one cycle. Thus, even though many scalar instruction times are faster on the Fujitsu than on the CRAY, the CRAY will often have a higher scalar result rate because of concurrency. In our benchmark set, a single processor of the CRAY X-MP-4 outperformed both the Fujitsu VP-200 and the Hitachi S-810/20 on most of the programs that were dominated by scalar floating point instruction execution. .PP The Fujitsu and Hitachi computers have larger and more powerful general-purpose instruction sets than the CRAY, and more flexible data formats for integer and character processing. Thus, applications that are predominately scalar but use little floating-point arithmetic may well execute faster on these IBM-compatible computers than on a CRAY. We had no applications in our benchmark to measure such performance. .bp .ce 1 Table 5 .br .ce 1 Scalar Architecture .TS center; lp8 cp8 cp8 cp8. Scalar Processing Item CRAY X-MP-4 Fujitsu VP-200 Hitachi S-810/20 _ Scalar Cycle Time 9.5 nsec 15 nsec 28 nsec Scalar Registers: General/Addressing 8x24-bit 16x32-bit 16x32-bit Floating Point 8x64-bit 8x64-bit 4x64-bit Scalar Buffer Memory: T-Registers Cache Memory Cache Memory Capacity 64 Words 8192 Words 32768 Words Memory Bandwidth 105 Mwords/sec 67 Mwords/sec 112 Mwords/sec CPU Access Time 1 CP - 9.5 nsec 2 CP - 30 nsec 1 CP - 28 nsec CPU Transfer Rate 1 Word/9.5 nsec 1 Words/15 nsec 1 Word/28 nsec Scalar Execution Times: Floating Point Mult. 7 CP - 66.5 nsec 4 CP - 60 nsec 3 CP - 84 nsec Floating Point Add 6 CP - 57.0 nsec 3 CP - 45 nsec 2 CP - 56 nsec Scalar Data Types: Floating Point 64-bit 32; 64; 128-bit 32; 64; 128-bit Fixed Point 24; 64-bit 16; 32-bit 16; 32-bit Logical 64-bit 8; 32; 64-bit 8; 32; 64-bit Decimal None 1 to 16-bytes 1 to 16-bytes Character None 1 to 4096-bytes 1 to 4096-bytes .TE .sp .B 4. Benchmark Environments .R .PP We spent two days at Cray Research compiling and running the benchmark on the CRAY X-MP-4. The CRAY programs were one-processor tests; no attempt was made to exploit the additional processors. .PP For the Japanese benchmarkings, we sent ahead a preliminary tape of our benchmark source programs and some load modules produced at Argonne. At both Fujitsu and Hitachi the load modules ran without problem, demonstrating that the machines are in fact compatible with IBM computers on both instruction set and operating system interface levels. (Of course, these tests did not use the vector features of the machines.) .bp .PP The VP-200 tests were run at the Fujitsu plant in Numazu, Japan, during a one-week period. We had as much time on the VP-200 as needed. The front-end machine was a Fujitsu M-380 (approximately twice as fast as a single processor of an IBM 3081 K). .PP The Hitachi S-810/20 tests were run at the Hitachi Kanagawa Works, during two afternoons. The Hitachi S-810/20 benchmark configuration had no front-end system. Instead, we compiled, linked, ran, and printed output directly on the machine. .PP The physical environment of the Hitachi S-810/20 at Kanagawa is noteworthy. The machine room was not air-conditioned; a window was opened to cool off the area. The outside temperature exceeded 100 degrees Fahrenheit on the first day, and we estimate that the computer room temperature was well above 100 degrees, with high humidity; yet the computer ran without problem. .sp .B 5. Benchmark Codes and Results .R .sp .B 5.1 Codes .PP We asked some of the major computer users at Argonne for typical Fortran programs that would help in judging the performance of these vector machines. We gathered 20 programs, some simple kernels, others full production codes. The programs are itemized in Table 6. .PP Four of the programs have very little vectorizable Fortran (for the most part they are scalar programs): BANDED, NODAL0, NODAL1, SPARSESP. Both STRAWEXP and STRAWIMP have many calculations involving short vectors. For most of these programs the CRAY X-MP performed fastest, with the Fujitsu faster than the Hitachi. .PP Below we describe some of the benchmarks and analyze the results. .sp .B 5.1.1 APW .R .PP The APW program is a solid-state quantum mechanics electronic structure code. APW calculates self-consistent field wave functions and energy band structures for a sodium chloride lattice using an antisymmetrized plane wave basis set and a muffin-tin potential. The majority of loops in this program are short and are coded as IF loops rather than DO loops; they do not vectorize on any of the benchmarked computers. The calculations are predominately scalar. .bp .PP This program highlights the CRAY X-MP advantage when executing "quasi-vector" code (vector-like loops that do not vectorize for some reason). The CRAY executes scalar code on segmented functional units and can achieve a higher degree of concurrency in scalar than either the Fujitsu or Hitachi machines, which execute scalar instructions one at a time. .sp .B 5.1.2 BIGMAIN .R .PP BIGMAIN is a highly vectorized Monte Carlo algorithm for computing Wilson line observables in SU(2) lattice gauge theory. This program has the longest vector lengths of the benchmarks. All the vectors begin on the same memory bank boundary, and all have a stride of twelve. The only significant nonvectorized code is an IF loop, which seriously limits the peak performance. .PP The superior performance of the CRAY on BIGMAIN reflects both the CRAY's insensitivity to the vector stride and its greater levels of concurrency when executing scalar loops. The Fujitsu performance reflects a quartering of memory bandwidth when using a vector stride of twelve. The Hitachi performance reflects its slower scalar performance. .sp .B 5.1.3 BFAUCET and FFAUCET .R .PP BFAUCET and FFAUCET compute the ground state energies of drops of liquid helium by the variational Monte Carlo method. The BFAUCET codes involve Bose statistics, and a table-lookup operation is an important component of the time. The FFAUCET cases use Fermi statistics and are dominated by the evaluation of determinants using LU decomposition. The different cases correspond to different sized drops, as shown in Table 7. .PP BFAUCET1, 2, and 3 and FFAUCET1 and 2 perform only a single Monte Carlo iteration each; these cases are typical of checkout runs and are dominated by non-repeated setup work. BFAUCET4, 5, and 6 and FFAUCET3 are long production runs. .sp .B 5.1.4 LINPACK .R .PP The LINPACK timing is dominated by memory reference as a result of array access through the calls to SAXPY. For this problem the vector length changes during the calculation from length 100 down to length 1 (see Table 8). .PP Fujitsu's and Hitachi's performance reflects the fact that they do not do so well as the CRAY with short vectors. .bp .ce 1 Table 6 .br .ce 1 Programs Used for Benchmarking .TS center; lp8 lp8 cp8 lp8 np8 lp8. Code No. of Lines Description _ _ _ APW 1448 Solid-state code, for anti-symmetric plane wave calculations for solids. BANDED 1539 Band linear algebra equation solver, for parallel processors. BIGMAIN 774 Vectorized Monte Carlo algorithm, for SU(2) lattice gauge theory. DIF3D 527 1, 2, and 3-D diffusion theory kernels. LATFERM3 1149 Statistical-mechanical approach to lattice gauge calculations. LATFERM4 1149 Statistical-mechanical approach to lattice gauge calculations. LATTICE8 1149 Statistical-mechanical approach to lattice gauge calculations. MOLECDYN 1020 Molecular dynamics code simulating a fluid. NODAL0 345 Kernel of 3-D neutronics code using nodal method. NODAL1 345 Kernel of 3-D neutronics code using nodal method. NODALX 345 Kernel of 3-D neutronics code using nodal method. BFAUCET 5460 Variational Monte Carlo for drops of He-4 atoms \(em Bose statistics. FFAUCET 5577 Variational Monte Carlo for drops of He-3 atoms \(em Fermi statistics. SPARSESP 1617 ICCG for non-symmetric sparse matrices based on normal equations. SPARSE1 3228 MA32 from the Harwell library sparse matrix code using frontal techniques and software run on a 64 x 64 problem. STRAWEXP 4806 2-D nonlinear explicit solution of finite element program with weakly coupled thermomechanical formulation in addition to structural and fluid structural interaction capability. STRAWIMP 4806 Same as STRAWEXP but implicit solution. .TE .bp .ce 1 Table 7 .br .ce 1 Average Vector Length for BFAUCET and FFAUCET .TS center; l l l n. Case Average Vector Length _ _ BFAUCET1 10 BFAUCET2 35 BFAUCET3 56 BFAUCET4 120 BFAUCET5 10 BFAUCET6 35 FFAUCET1 10 FFAUCET2 17 FFAUCET3 10 .TE .sp 4 .KF .ce Table 8 .br .ce LINPACK Timing for a Matrix of Order 100 .TS center; l l l l n n. Machine MFLOPS Seconds _ _ _ CRAY X-MP 21 .032 Fujitsu VP-200 17 .040 Hitachi S-810/20 17 .042 .TE .KE .sp 2 .B 5.1.5 LU, Cholesky Decomposition, and Matrix Multiply .R .PP The LU, Cholesky decomposition, and matrix multiply benchmarks are based on matrix vector operations. As a result, memory reference is not a limiting factor since results are retained in vector registers during the operation. The technique used in these tests is based on vector unrolling [1], which works equally well on CRAY, Fujitsu, and Hitachi machines. .bp .PP The routines used in Tables 9 through 11 have a very high percentage of floating-point arithmetic operations. The algorithms are all based on column accesses to the matrices. That is, the programs reference array elements sequentially down a column, not across a row. With the exception of matrix multiply, the vector lengths start out as the order of the matrix and decrease during the course of the computation to a vector length of one. .sp .KS .ce Table 9 .br .ce LU Decomposition Based on Matrix Vector Operations .TS center; c c s s c c c c n n n n. MFLOPS Order CRAY X-MP (1 CPU) Fujitsu VP-200 Hitachi S-810/20 _ 50 24.5 20.5 17.9 100 51.6 51.8 47.5 150 72.1 84.6 76.3 200 87.4 117.1 102.2 250 99.2 148.8 126.4 300 108.4 178.8 147.8 .TE .KE .sp 3 .KS .ce Table 10 .br .ce Cholesky Decomposition Based on Matrix Vector Operations .TS center; c c s s c c c c n n n n. MFLOPS Order CRAY X-MP (1 CPU) Fujitsu VP-200 Hitachi S-810/20 _ 50 29.9 25.8 18.8 100 65.6 70.6 60.1 150 91.9 117.6 104.9 200 107.7 162.2 144.9 250 119.1 202.2 179.7 300 132.3 238.1 211.8 .TE .KE .bp .KS .ce Table 11 .br .ce Matrix Multiply Based on Matrix Vector Operations .TS center; c c s s c c c c n n n n. MFLOPS Order CRAY X-MP (1 CPU) Fujitsu VP-200 Hitachi S-810/20 _ 50 98.4 112.9 100.0 100 135.7 225.2 213.3 150 149.0 328.1 279.3 200 156.2 404.5 336.8 250 165.9 462.2 366.7 300 167.9 469.2 390.4 .TE .KE .sp .PP For low-order problems the CRAY X-MP is slightly faster than the VP-200 and S-810/20, because it has the smallest vector startup overhead (primarily due to faster memory access). As the order increases, and the calculations become saturated by longer vectors, the Fujitsu VP-200 attains the fastest overall execution rate. .PP With matrix multiply, the vectors remain the same length throughout; here Fujitsu comes close to attaining its peak theoretical speed in Fortran. .sp .B 5.2 Results .R .PP Table 12 contains the timing data for our benchmark codes. We also include the timing results on other machines for comparison. .ps 11 .fi .sp .B 6. Fortran Compilers and Tools .R .sp .B 6.1 Fortran Compilers .R .PP The three compilers tested exhibit several similarities. All three tested systems include a full Fortran 77 vectorizing compiler as the primary programming language. The CRAY compiler includes most IBM and CDC Fortran extensions; the two Japanese compilers include all the IBM extensions to Fortran 77. All three compilers can generate vectorized code from standard Fortran; no explicit vector syntax is provided. All three compilers recognize a variety of compiler directives \(em special Fortran comments that, when placed in a Fortran source code, aid the compiler in optimizing and vectorizing the generated code. Each compiler, in its options and compiler directives, provides users with a great deal of control over the optimization and vectorization of their programs. .bp .nr PO .5i .nr LL 7.0i .po .5i .ll 7.0i .ce Table 12 .br .ce Timing Data (in seconds) for Various Computers (a) .nf .ps 8 .TS center; lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 lp8 np8 np8 np8 np8 np8 np8 np8 np8 np8 np8. Program CRAY X-MP-4 Fujitsu Hitachi Hitachi(b) Hitachi(b) IBM IBM IBM Amdahl Name using 1 proc. VP-200 S810/20 S810/20 S810/20 370/195 3033 3033 5860 _ CFT 1.13 f77 f77 FORTVS H EXT H EXT FORTVS H EXT f77 (scalar) (scalar) APW \f330.69\f1 40.58 54.37 171 62 BANDED \f324.3\f1 34.15 38.3 41.0 102.65 35 BIGMAIN \f310.86\f1 23.49 34.36 157.66 DIF3DS1/1 \f323.7\f1 \f320.31\f1 \f321.9\f1 45.1 39.2 74.82 151.81 134.2 62 DIF3DS2/1 \f319.0\f1 21.93 21.9 47.4 41.5 81.27 157.44 142. 67 DIF3DV0/1 \f39.31\f1 16.37 11.8 50.1 39.5 73 168 138 73 DIF3DV1/1 \f39.37\f1 16.59 12.1 49.3 38.7 74 167 137 70 LATFERM3 \f36.1\f1 \f36.2\f1 \f36.6\f1 15.8 33.3 52.07 87.8 18 LATFERM4 121.8 \f365.29\f1 \f365.3\f1 345.2 820.6 640 LATTICE8 10.2 \f35.54\f1 6.7 16 19.4 46.38 53.8 17 MOLECDYN \f38.68\f1 \f39.07\f1 15.78 16.6 17.2 36.26 51.44 51.74 17 NODAL0 \f36.41\f1 14.31 20.1 19.5 19.7 28.36 45.53 45.5 27 NODAL1 \f36.45\f1 14.47 19.8 19.3 19.5 27.58 45.35 45. 23 NODELX .25 \f3.14\f1 .20 1.14 1.45 1.57 BFAUCET1 \f311.2\f1 16.13 22.9 22.8 74 73 31 BFAUCET2 \f38.96\f1 11.66 23.9 24.2 79 78 34 BFAUCET3 \f310.6\f1 18.48 38.7 38.9 130 128 405 BFAUCET4 \f3259.4\f1 551.2 621.0 2100 2048 920 BFAUCET5 \f3787.4\f1 923.04 1529.4 2351 BFAUCET6 \f3727.5\f1 823.98 2786 FFAUCET1 \f313.6\f1 19.45 26.7 94 82 35 FFAUCET2 \f344.4\f1 \f342.31\f1 114.3 419 397 150 FFAUCET3 \f31144.0\f1 1691.83 2440 SPARSESP \f31200\f1 \f31361\f1 \f31264.29\f1 1484 SPARSE1 \f32.51\f1 6.74 9.85 14.26 33.06 26 STRAWEXP \f337.3\f1 45.74 59.2 116.28 143.35 142.28 51 STREWEXP2 \f3153.4\f1 179.37 231.13 273.9 216 STRAWIMP \f3151.5\f1 \f3151.51\f1 172.61 ? 382.73 381.51 360.55 .TE (a) Numbers in boldface denote "fastest" time for a given program. .br (b) From load modules created on an IBM machine. .nr PO 1.i .po 1.i .nr LL 6.5i .ll 6.5i .ps 11 .bp .PP All three compilers provide excellent optimization of scalar code. The compilers differ primarily in the range of Fortran statements they can vectorize, the complexity of the DO loops that they vectorize, and the quantity and quality of messages they provide the programmer about the success or failure of vectorization. .PP All three Fortran compilers have similar capabilities for vectorizing simple inner DO loops and DO loops with a single IF statement. The two Japanese compilers can also vectorize outer DO loops and loops with compound, multiple, and nested IF statements. The Fujitsu compiler has multiple strategies for vectorizing DO loops containing IF statements, based on compiler directive estimates of the IF statement true ratio. The Japanese compilers can vectorize loops that contain a mix of vectorizable and non-vectorizable statements; the CRAY compiler requires the user to divide such code into separate vectorizable and non-vectorizable DO loops. .PP The vector macro instructions (e.g., inner product, MAX/MIN, iteration) on the two Japanese computers permit their compilers to vectorize a wider range of Fortran statements than can the CRAY compiler. And, the Japanese compilers seem more successful at using information from outside a DO loop in determining whether that loop is vectorizable. .PP All three compilers convert loops with small iteration counts to scalar code, when the advantages of vectorization will not repay the loop vector start-up times. The CRAY compiler can completely unroll inner DO loops with constant iteration counts less than ten, eliminating entirely the scalar loop overhead. Often an unrolled inner loop will then vectorize on an outer loop index, with dramatic performance improvement. The Fujitsu compiler can double the statements and halve the iteration count of all DO loops. This loop doubling improves scalar performance, but usually degrades vector performance by converting each vector operation to two new operations with half the vector length and double the stride of the original. The similar Hitachi option -- "pair processing"-- usually improves performance because the two new vector operations can execute concurrently on separate functional units. .PP All three compilers, in their output listings, indicate which DO loops vectorized and which did not. The two Japanese compilers provide more detailed explanations of why a particular DO loop or statement does not vectorize. The Fujitsu compiler listing is the most effective of the three: in addition to the vectorization commentary, the Fujitsu compiler labels each DO statement in the source listing with a V if it vectorizes totally, an S if the loop compiles to scalar code, and an M if the loop is a mix of scalar and vector code. Each statement in the loop itself is similarly labeled. .PP The Fujitsu and Hitachi compilers make all architectural features of their respective machines available from standard Fortran. As a measure of confidence in their compilers, Fujitsu has written all, and Hitachi nearly all, of their scientific subroutine libraries in standard Fortran. .sp .B 6.2 Fortran Tools .R .PP All three systems include tools to trace program execution and identify the most time consuming program areas for tuning attention. In addition, Fujitsu and Hitachi provide Fortran source program analysis tools which guide the user in optimizing program performance. The Fujitsu interactive vectorizer is a powerful tool for both the novice and the experienced user; it allows one to tune a program despite an unfamiliarity with vector machine architecture and programming practices. The interactive vectorizer (which runs on any IBM-compatible system with MVS/TSO) displays the Fortran source with each statement labeled with a V (vectorized), S (scalar), or M (partially vectorized), and a static estimate of the execution cost of the statement. As the user interactively modifies a code, the vectorization labels and statement execution costs are updated on-screen. The vectorizer gives detailed explanations for failure to vectorize a statement, suggests alternative codings that will vectorize, and inserts compiler directives into the source based on user responses to the vectorizer's queries. Statement execution cost analyses are based on assumed DO loop iteration counts and IF statement true ratios. The user can supply his own estimate of these values, or run the FORTUNE execution analyzer to gather run-time statistics for a program, which can then be input to the interactive vectorizer to provide a more accurate dynamic statement execution cost analysis. .PP The Hitachi VECTIZER runs in batch mode; it provides additional information much like the Hitachi Fortran compiler's vectorization messages. .sp .B 7. Conclusions .R .PP The results of our benchmark show the CRAY X-MP-4 to be a consistently strong performer across a wide range of problems. The CRAY was particularly fast on programs dominated by scalar calculations and short vectors. The fast CRAY memory contributes to low vector startup times, leading to its exceptional short-vector performance. The CRAY scalar performance derives from its segmented functional units; the X-MP achieves enough concurrency in many scalar loops to outperform the Japanese machines, even though individual scalar arithmetic instruction times are longer on the CRAY than on the Fujitsu. .PP The Fujitsu and Hitachi computers perform faster than the CRAY for highly vectorizable programs, especially those with long (>50) vector lengths. The Fujitsu VP achieved the most dramatic peak performance in the benchmark, outperforming a single CRAY X-MP processor by factors of two to three on matrix-vector algorithms, with the Hitachi not far behind. Over the life cycle of a program, the Fujitsu and Hitachi machines should benefit relatively more than the CRAY from tuning that increases the degree of program vectorization. .PP The CRAY has I/O weaknesses that were not probed in this exercise. With an SSD, the CRAY has the highest I/O bandwidth of the three machines. However, owing to severe limits on the number of disk I/O paths and disk devices, the total CRAY disk storage capacity and aggregate disk I/O bandwidth fall far below that of the two Japanese machines. The CRAY is forced to depend on a front-end machine's mass storage system to manage the large quantities of disk data created and consumed by such a high-performance machine. .PP Several weaknesses were evident in the Fujitsu VP in this benchmark. The Fujitsu memory performance degrades seriously for nonconsecutive vectors. This was particularly evident in the BIGMAIN, DIF3D, and FAUCET benchmark programs. Even-number vector strides reduce the Fujitsu memory bandwidth by 75%, and a stride proportional to the number of memory banks (stride=n*128) reduces the memory bandwidth about 94%. This results in poor performance for vectorized Fortran COMPLEX arithmetic (stride=2). Fujitsu users will profit by reprogramming their complex arithmetic using only REAL arrays, and by ensuring that multidimensional-array algorithms are vectorized by column (stride=1) rather than by row. .PP Fujitsu's vector performance is substantially improved if a program's maximum vector lengths are evident at compile time, whether from explicit DO loop bounds, array dimension statements, or compiler directives. For example, the order-100 LINPACK benchmark improves by 12% to 19 MFLOPS, and the order-300 matrix-vector LU benchmark improves by 23% to 220 MFLOPS, when a Fujitsu compiler directive is included to specify the maximum vector length (numbers from the LINPACK benchmark paper [2]). When maximum vector lengths are known, the Fujitsu compiler can optimize the numbers and lengths of the vector registers and frequently avoid the logic that divides vectors into segments no larger than the vector registers. Fujitsu's short-loop performance, not strong to begin with, is particularly degraded by unnecessary vector segmentation ("stripmining") logic. None of the benchmark problems had explicit vector length information. .PP In many ways, the Hitachi computer seems to have the greatest vector potential. Despite its slower memory technology, the Hitachi has the highest single processor memory bandwidth, owing to its four memory pipes. Also, Hitachi has the most powerful vector macro instruction set and the most flexible set of arithmetic pipelines; in addition, the Hitachi is the only computer able to process vectors longer than its vector registers, entirely in hardware. The vectorizing Fortran compiler is impressive, although the compiler is rarely able to exploit fully the potential concurrency of the arithmetic pipelines. The Hitachi performs best on the benchmarks with little scalar content; its slow scalar performance \(em about half that of the Fujitsu computer \(em burdens its performance on every problem. .PP At present the Japanese Fortran compilers are superior to the CRAY compiler at vectorization. Advanced Fujitsu and Hitachi hardware features provide opportunities for vectorization that are unavailable on the CRAY. For example, the Japanese machines have macro instructions to vectorize dot products, simple recurrences, and the search for the maximum and minimum elements of an array; and they have multiple mask registers to allow vectorization of loops with nested IF statements. Thus, a wider range of algorithms can vectorize on the Japanese computers than can vectorize on the CRAY. Also, the Japanese compilers provide the user with more useful information about the success and failure of vectorization. Moreover, there is no CRAY equivalent to the Fujitsu interactive vectorizer and FORTUNE performance analyzer. These advanced hardware features and vectorizing tools will make it easier to tune programs for optimum performance on the Japanese computers than on the CRAY. .PP The CRAY X-MP and the Japanese computers require different tuning strategies. The CRAY compiler does not partially vectorize loops. Therefore, CRAY users typically break up loops into their vectorizable and nonvectorizable parts. The Japanese compilers, however, automatically segment loops into their vectorizable and nonvectorizable parts. It is advantageous to merge smaller loops together on the Japanese computers, to take maximum advantage of their large vector register sets. .sp 3 .B References .IP [1] .R J.J. Dongarra and S.C. Eisenstat, "Squeezing the Most out of an Algorithm in CRAY Fortran," .I ACM Trans. Math. Software, .R Vol. 10, No. 3, pp. 221-230 (1984). .sp .IP [2] .R J. J. Dongarra, .I Performance of Various Computers Using Standard Linear Equations Software in a Fortran Environment, .R Argonne National Laboratory Report MCS-TM-23 (October 1985) .sp .IP [3] .R O. Lubeck, J. Moore, and R. Mendez, .I A Benchmark Comparison of Three Supercomputers: Fujitsu VP-200, Hitachi S-810/20 and CRAY X-MP-2 .sp 3 .B Acknowledgment .R .sp We would like to thank Gail Pieper for her excellent help in editing this report. .R .bp .cs 1 .ps 11 .in 0 .ce 1 .B Distribution for ANL-85-19 .ce 0 .sp 2 .B Internal: .sp .in .75i .nf .R J. J. Dongarra (40) A. Hinds (40) K. L. Kliewer A. B. Krisciunas P. C. Messina G. W. Pieper D. M. Pool T. M. Woods (2) ANL Patent Department ANL Contract File ANL Libraries TIS Files (6) .sp 2 .B .in 0 External: .R .sp .in .75i DOE-TIC, for distribution per UC-32 (167) Manager, Chicago Operations Office, DOE Mathematics and Computer Science Division Review Committee: .in +.4i J. L. Bona, U. Chicago T. L. Brown, U. of Illinois, Urbana S. Gerhart, MCC, Austin, Texas G. Golub, Stanford University W. C. Lynch, Xerox Corp., Palo Alto J. A. Nohel, U. of Wisconsin, Madison M. F. Wheeler, Rice U. .in -.4i D. Austin, ER-DOE J. Greenberg, ER-DOE G. Michael, LLL .