\begin{center} {\bf Advanced Architecture Computers*} \vspace{.4in} {\em Jack J. Dongarra and Iain S. Duff} \vspace{.15in} ({\em dongarra@mcs.anl.gov} and {\em na.duff@na-net.stanford.edu}) \vspace{.15in} Mathematics and Computer Science Division Argonne National Laboratory Argonne, Illinois 60439-4844 \vspace{.15in} Computer Science and Systems Division Building 8.9 Harwell Laboratory Oxfordshire OX11 ORA England \end{center} \vspace{.4in} {\bf Abstract:} We describe the characteristics of several recent computers that employ vectorization or parallelism to achieve high performance in floating-point calculations. We consider both top-of-the-range supercomputers and computers based on readily available and inexpensive basic units. In each case we discuss the architectural base, novel features, performance, and cost. We intend to update this report regularly, and to this end we welcome comments. \vspace{.3in} \noindent {\bf Keywords} \par\noindent vector processors, array processors, parallel architectures, supercomputers, high-performance computers \section{Introduction} In the past few years several machines have been announced that use some form of parallelism to achieve performance in excess of that attainable directly from the underlying technology of the constituent chips. To a large degree the availability of low-cost chips as building blocks has given rise to many of these new machines. After listening to numerous technical and sales presentations on these new computers, we became overwhelmed and confused with the characteristics of each product and its relative strengths and weaknesses. In an effort to clarify these issues - both for ourselves and for other computational scientists - we have written this report summarizing the range of machines available, the architectures employed, and the principal features of each machine. In Section 2, we list the computers considered and discuss the criteria we have used to select them. We present a rough classification based on architectural features and their niche in the marketplace. This classification divides the machines into five categories: supercomputers, minisupercomputers, vector add-ons or vector-assisted mainframes, parallel processors, and high-performance graphics workstations. Each category is discussed in turn in Sections 3 through 7. More detailed information on the machines is provided as Appendix B. The guidelines used in preparing the detailed descriptions are given in Section 8. In some cases, our data are incomplete and nonuniform. This situation reflects the technical level of the presentations, the documentation available to us, the stage of development of the product being described, and the comments received from vendors on draft copies of our document. We welcome comments and criticisms that might help to remedy any deficiencies. This report is a second edition. We intend to continue updating this report to reflect both the changing marketplace and further information on currently listed machines. \section{ Summary and Classification of Machines Considered} In the past few years there has been an unprecedented explosion in the number of different computers in the marketplace. This explosion has been fueled partly by the availability of powerful and cheap building blocks and by the availability of venture capital. There have been two main directions to this explosion. One has been the personal computer and workstation market, and the other the development and marketing of computers using advanced architectural concepts. In this report we restrict our study to the latter group, with particular interest in architectures that use some form of parallelism to increase performance over that of the basic chip. We also restrict our attention to machines that are available commercially, and thus exclude research projects in universities and government laboratories and products still at the design stage. We would, however, welcome being alerted to ongoing activities. We have necessarily had to exclude information obtained under non-disclosure agreements. We will update this report as such information is released through product announcements. A much-referenced and useful taxonomy of computer architectures was given by Flynn (1966). He divided machines into four categories: \begin{flushleft} (i) SISD - single instruction stream, single data stream\\ (ii) SIMD - single instruction stream, multiple data stream\\ (iii) MISD - multiple instruction stream, single data stream\\ (iv) MIMD - multiple instruction stream, multiple data stream\\ \end{flushleft} Although these categories give a helpful coarse division, we find immediately that the current situation is more complicated, with some architectures exhibiting aspects of more than one category. Many of today's machines are really a hybrid design. For example, the CRAY X-MP has up to four processors (MIMD), but each processor uses pipelining (SIMD) for vectorization. Moreover, where there are multiple processors, the memory can be local, global, or a combination of these. There may or may not be caches and virtual memory systems, and the interconnections can be by crossbar switches, multiple bus-connected systems, time-shared bus systems, etc. We thus choose a method of subdividing and classifying the machines different from that used in our original report (Dongarra and Duff 1987). As before, we identify the supercomputers separately and discuss these in Section 3. However, we split the other machines according to their niche in the marketplace rather than their connectivity or mode of data access or data transfer. Minisupercomputers can be defined as junior versions of supercomputers that offer a similar interface to the larger machines but with lower performance and reduced costs. We consider machines in this class in Section 4. Some powerful vector computers do not fall into either of the previous classes but are based on an enhancement to a mainframe computer through the addition of an array processor or an integrated vector facility. We discuss both types of computer in Section 5. In Section 6, we consider machines that rely primarily on parallelism rather than pipelined vector processing, and divide these into two categories depending on whether we regard them as good experimental vehicles for studying parallelism and parallel algorithms or whether we consider them as potential supercomputers of the future. In Section 7 we summarize the high-performance graphics workstations that do not themselves qualify for the previous categories but that are clearly in a different class from regular top-of-the-line workstations. \section{Supercomputers} Supercomputers are by definition the fastest and most powerful general-purpose scientific computing systems available at any given time. They offer speed and capacity significantly greater than mainframe computers, defined as top-of-the-range widely available machines built primarily for commercial use. The term supercomputer became prevalent in the early 1960s, with the development of the CDC 6600. That machine, first marketed in 1963, boasted a performance of 1 Megaflops (millions of floating-point operations per second). During the next fifteen years, the peak performance of supercomputers grew at an rapid rate; and since 1980, that trend has accelerated. The projected 1995 machine is expected to have a maximum speed of 200 Gigaflops, more than 200,000 times that of the CDC 6600 (see Table 1). \begin{center} Table 1. Performance Trends in Scientific Supercomputing \end{center} \vspace{.1in} \begin{tabular}{l|l l l l} Year & Machine & Speed & Speed Increase & \\ \hline & & & 10 years & 20 years\\ 1963 & CDC 6600 & 1 MFLOPS & - & -\\ 1969 & CDC 7600 & 4 MFLOPS & 4 & -\\ 1979 & CRAY-1 & 160 MFLOPS & 100 & -\\ 1983 & CYBER 205 & 400 MFLOPS & 100 & 400\\ 1986 & CRAY-2 & 2 GFLOPS & 500 & 2000\\ 1990-1995& - & 200 - 1000 GFLOPS & 1000 & 250,000\\ \end{tabular} \vspace{.1in} Many companies have devoted their resources to producing the fastest and most powerful machines on the market. Their strategy has been to develop a few state-of-the-art machines that enable scientists and engineers to tackle problems previously considered computationally infeasible. From these commercial ventures we have seen the development of vector and, more recently, parallel computers capable of solving complex numerical and nonnumerical problems. The second generation, with higher speed and more parallelism, is already under development. In Table 2, we summarize the currently available supercomputers. \begin{center} Table 2. Supercomputers \end{center} \vspace{.1in} \begin{tabular}{l | r r c r} Machine & Maximum Rate, & Memory, & OS & Number \\ & in MFLOPS & in Mbytes & & of Processors \\ \hline CRAY-1 & 160 & 32 & Own & 1 \\ CRAY X-MP & 941 & 512 & Own/UNIX & 4 \\ CRAY Y-MP & 2667 & 256 & Own/UNIX & 8 \\ CRAY-2 & 1951 & 4096 & UNIX & 4 \\ CYBER 205 & 400 & 128 & Own & 1 \\ ETA-10G & 5714(a) & 2048(b) & UNIX/VSOS & 8 \\ ETA-10E & 3810(a) & 2048(b) & UNIX/VSOS & 8 \\ ETA-10Q & 526(a) & 512(b) & UNIX/VSOS & 2 \\ Fujitsu VP-400E & 1714 & 1024 & Own & 1 \\ Fujitsu VP-200E & 857 & 1024 & Own & 1 \\ Fujitsu VP-100E & 429 & 1024 & Own & 1 \\ Fujitsu VP-50E & 286 & 1024 & Own & 1 \\ Fujitsu VP-30E & 133 & 1024 & Own & 1 \\ Hitachi S-820/80 & 2000 & 512(c) & Own & 1 \\ Hitachi S-810/20 & 857 & 512(c) & Own & 1 \\ NEC SX-2A & 1300 & 1024(d) & Own & 1 \\ NEC SX-1A & 650 & 1024(d) & Own & 1 \\ NEC SX-1E & 324 & 1024(d) & Own & 1 \\ \end{tabular} \vspace{.1in} \begin{tabbing} aaa\=bbb\= \kill \>(a) for 64-bit processing on 2 pipelines with linked triad and overlapped\\ \>\> scalar processing\\ \>(b) Also 16 MWord (128 Mbyte) local memory for each processor\\ \>(c) Also a 12-Gbyte extended memory\\ \>(d) Also a 8-Gbyte extended memory\\ \end{tabbing} \vspace {.1in} The actual price of the systems in Table 2 depends on the configuration, with most manufacturers offering systems in the \$5 million to \$20 million range. All use ECL logic with LSI, except the CRAY X-MP, the CRAY-1 in SSI, and the ETA-10 in CMOS ALSI (Advanced Large Scale Integration), and all use pipelining and/or multiple functional units to achieve vectorization/parallelization within each processor. Cray and ETA are the only supercomputer manufacturers to offer multiple-processors machines, although other vendors have announced multiprocessor machines for delivery in the near future. The form of synchronization on both the Cray and ETA machines is essentially event handling. Both Fujitsu and Hitachi systems are IBM System 370 compatible. We have included the CRAY-1 computer in the above table largely as a benchmark since it could not now be considered a supercomputer in terms of performance and is no longer manufactured by Cray. The Fujitsu machines are marketed in Europe and North America by Amdahl (the 500E to 1400E range) and by Siemens (the VP-50 to 400 range). \section{Minisupercomputers} Below the supercomputer market, a new class of near-supercomputers or minisupercomputers has emerged. These systems typically feature strong vector or advanced scalar capabilities and have been utilized for traditional high-performance technical computing applications. Priced well under supercomputers, \$100,000 to generally no more than \$1 million, minisupercomputers are frequently sold when budgets are limited to this price range or when stand-alone capabilities are required. Early leaders in the field of minisupercomputing were Alliant, Convex, and Scientific Computer Systems. More recently, this market has experienced high growth, and many new products and companies have emerged, including Multiflow, and Gould (see Table 3). \begin{center} Table 3. Minisupercomputers \end{center} \vspace{.1in} \begin{tabular}{l | c c c } Vendor & Theoretical Peak & LINPACK & First Shipment \\ &Performance &Performance \\ &Mflops (64 bits) &Mflops \\ \hline Alliant FX/8 & 94 & 7.6 & 1985 \\ Alliant FX/80 & 188 & 8.5 & 1987 \\ Astronautics & 90 & 7.1 & 1988 \\ Convex C1 & 20 & 7.3 & 1984 \\ Convex C2 & 200 & 16 & 1987 \\ FPS 500 & -- & -- & 1988 \\ Multiflow Trace 28/200 & 60 & 10 & 1987 \\ \end{tabular} \vspace{.1in} \section {Enhanced Mainframes} An alternative in the near-supercomputer category is the add-on array processor. Companies such as Floating Point Systems, and Star Technology are actively marketing these add-on products in an effort to attract current supercomputer users. In a related vein, vector-processing enchancements are now being marketed for commercial mainframes. These vector enhancements allow machines produced for general-purpose applications to offer users increased numerical capability. In some cases, the ability to apply vectors is extended to more than one processor in multiprocessing mode. Companies currently offering such vector-processing capabilities include Control Data, Hitachi (marketed in the West by NAS and COMPAREX), Honeywell, IBM, and UNISYS. We summarize some of the machines in this category in Table 4. \begin{center} Table 4. Power-assisted mainframes \end{center} \vspace{.1in} \begin{tabular}{l|c c c c} Machine & Maximum Rate, & Memory, & OS & Number of \\ & Mflops & Mbytes & & Processors \\ \hline CDC 180 990 & 125 & 256 & NOS/VE & 1-2 \\ FPS M64/140 & 187 & 128 & Own & 1 \\ IBM 3090S/VF & 696 & 256 (a) & Own & 1 - 6 \\ NAS AS/91X0 & ? & 64 & Own & 1 or 2 \\ Unisys 1190/ISP & 266 & 128 & Own & 1,2,4 (c) \\ \end{tabular} \vspace{.15in} \begin{tabbing} aaa\= \kill \> (a) Also a 2-Gbyte extended memory\\ \> (b) In 32-bit arithmetic\\ \> (c) Only 1 or 2 ISPs can be attached\\ \end{tabbing} \vspace {.1in} \section{Parallel Machines} While most of the supercomputers and minisupercomputers utilize vector processing to provide performance, a number of new companies are developing parallel processing systems. Such systems range from smaller (8- to 30-processor) machines like the Sequent or Encore to massively parallel (16,384-processor) systems like the Thinking Machines CM-2. Others in this area include Floating Point Systems, Myrias, BBN Advanced Computing, and DEC; and they may be joined soon by IBM, which has indicated that it will offer a product in this category by 1989. While it certainly true that the parallel architectures fall into two camps depending on whether or not they are potential supercomputers, it is less easy to assign a particular machine to one of these classes. We have, however, made a partly subjective judgment and compare the parallel architectures in two tables. Table 5 summarizes those parallel architectures that are designed for experimentation with parallel constructs, and Table 6 lists machines with potential for future elevation to the status of a supercomputer. \pagebreak \begin{center} Table 5. Experimental parallel machines \end{center} \vspace{.1in} \begin{tabular}{l|c c c } Machine & Chip & Max. Parallelism & Connection \\ \hline Elxsi 6400 & ECL & 12 & bus \\ Encore Multimax & 32332/32081 & 20 & bus \\ (optional Weitek 1164/1165) & & & \\ Flex/32 & 32032/32081 & 20 & bus \\ Sequent Symmetry S81 & 80386/80387 & 30 & bus \\ \end{tabular} \vspace{.1in} \begin{center} Table 6. Potential supercomputers \end{center} \vspace{.1in} \begin{tabular}{l c c c} Machine & Chip & Parallelism & Connection \\ \hline Active Memory (DAP) & CMOS & 4096 (SIMD) & near-neighbor \\ BBN Butterfly TC 2000 & 88000 & 256 & Banyon network \\ CYBERPLUS & Own & 256 & ring \\ Intel iPSC/2 & 80386/80387 & 128 & hypercube \\ IP-1 & Own & 33 & cross-bar \\ Meiko & Transputer & No limit (a) & user-configurable \\ Myrias SPS-2 & 68020/68882 & 512 minimum & hierarchical bus \\ NCUBE & VLSI & 1024 & hypercube \\ TMC CM-2 & VLSI & 65536 (SIMD) & hypercube \\ \end{tabular} \vspace{.15in} \begin{tabbing} aaa\= \kill \> (a) Maximum system delivered to date has 1024 processors\\ \end{tabbing} Because of the widely differing architectures of the machines in Tables 5 and 6, it is not really advisable to give one or even two values for the memory. In some instances there is an identifiable global memory; in others there is a fixed amount of memory per processor. Additionally, it may be possible to configure memory as either local or global. A value for the maximum speed is even less meaningful than in the previous tables, since a high Megaflop rate is not necessarily the objective of those machines and the actual speed will depend on the algorithm and application. \section{High-Performance Graphics Workstations} Finally, the supercomputer market has been expanded by the introduction of supercomputing workstations, such as those from Apollo, and single-user high-performance graphics systems such as those from Apollo, Ardent, Stellar, and Silicon Graphics. We summarize these machines in Table 7. \begin{center} Table 7. High-performance graphics workstations \end{center} \vspace{.1in} \begin{tabular}{l|c c c c} Machine & Chip & Peak performance, & Memory, \\ & & Mflops & Mbytes \\ \hline Apollo DN10000 & Own & ? & ? \\ Ardent TITAN & MIPS/Weitek & 64 & 128 \\ Silicon Graphics IRIS GT & MIPS/Weitek & 100 & 16 \\ Stellar GS2000 & Own/Weitek & 80 & 128 \\ \end{tabular} \vspace{.15in} \section { Template for Machine Description} As we mentioned in the introduction, the level of technical information on each machine varied significantly. We have, however, attempted to organize the available information in a consistent manner. In Table 8, we give the template used in presenting the data in the appendixes. \begin{center} Table 8. Template for Description of Machines \end{center} \begin{verbatim} Name of machine, manufacturer, backers, etc. Architecture Basic chip used Local, shared memory, or both Connectivity (for example, grid, hypercube) Range of memory sizes available; virtual memory Floating point unit (IEEE standard?) Configuration Stand-alone or range of front-ends Peripherals Software UNIX or other? Languages available Fortran characteristics F77 Extensions Debugging facilities Vectorizing/parallelizing capabilities Applications Run on prototype Software available Performance Peak Benchmarks on codes and kernels Status Date of delivery of first machine, beta sites, etc. Expected cost (cost range) Proposed market (numbers and class of users) Contact: technical and sales \end{verbatim} \newpage \begin{tabular}{l r} Machine & Page \\ end{tabular} .