Newsgroups: comp.parallel From: gottlieb@allan.ultra.nyu.edu (Allan Gottlieb) Subject: Info on some new parallel machines Nntp-Posting-Host: allan.ultra.nyu.edu Organization: New York University, Ultracomputer project Date: 18 Dec 92 12:31:15 A week or two ago, in response to a request for information on ksr, I posted the ksr section of a paper I presented at PACTA'92 in Barcelona in sept. I received a bunch of requests for a posting of the entire paper, which I "did". Unfortunately, it seems to have disappeared somewhere between here and Clemson so I am trying again. I doubt if anyone will get this twice but if so, please let me know and accept my appologies. Allan Gottlieb .\" Format via .\" troff -me filename .\" New Century Schoolbook fonts .\" Delete next three lines if you don't have the font .fp 1 NR \" normal .fp 2 NI \" italic .fp 3 NB \" bold .sz 11 .nr pp 11 .nr ps 1v .\" They want double space before paragraph .nr sp 12 .nr fp 10 .pl 26c .m1 1c .m2 0 .m3 0 .m4 0 .ll 14c .tp .(l C .sz +2 .b "Architectures for Parallel Supercomputing .sz -2 .sp .5c Allan Gottlieb .sp 1.5c Ultracomputer Research Laboratory New York University 715 Broadway, Tenth Floor New York NY 10003 USA .)l .sp 1c .sh 1 Introduction .lp In this talk, I will describe the architectures of new commercial offerings from Kendall Square Research, Thinking Machines Incorporated, Intel Corporation, and the MasPar Computer Corporation. These products span much of the currently active design space for parallel supercomputers, including shared-memory and message-passing, MIMD and SIMD, and processor sizes from a square millimeter to hundreds of square centimeters. However, there is at least one commercially important class omitted: the parallel vector supercomputers, whose death at the hands of the highly parallel invaders has been greatly exaggerated (shades of Mark Twain). Another premature death notice may have been given to FORTRAN since all these machines speak (or rather understand) this language\*-but that is another talk. .sh 1 "New Commercial Offerings" .lp I will describe the architectures of four new commercial offerings: The shared-memory MIMD KSR1 from Kendall Square Research; two message-passing MIMD computers, the Connection Machine CM-5 from Thinking Machines Corporation and the Paragon XP/S from Intel Corporation; and the SIMD MP-1 from the MasPar Computer Corporation. Much of this section is adapted from material prepared for the forthcoming second edition of .i "Highly Parallel Computing" , a book I co-author with George Almasi from IBM's T.J. Watson Research Center. .sh 2 "The Kendall Square Research KSR1" .lp The KSR1 is a shared-memory MIMD computer with private, consistent caches, that is, each processor has its own cache and the system hardware guarantees that the multiple caches are kept in agreement. In this regard the design is similar to the MIT Alewife [ACDJ91] and the Stanford Dash [LLSJ92]. There are, however, three significant differences between the KSR1 and the two University designs. First, the Kendall Square machine is a large-scale, commercial effort: the current design supports 1088 processors and can be extended to tens of thousands. Second, the KSR1 features an ALLCACHE memory, which we explain below. Finally, the KSR1, like the Illinois Cedar [GKLS84], is a hierarchical design: a small machine is a ring or .q "Selection Engine" of up to 32 processors (called an SE:0); to achieve 1088 processors, an SE:1 ring of 34 SE:0 rings is assembled. Larger machines would use yet higher level rings. More information on the KSR1 can be found in [Roth92]. .sh 3 Hardware .lp A 32-processor configuration (i.e. a full SE:0 ring) with 1 gigabyte of memory and 10 gigabytes of disk requires 6 kilowatts of power and 2 square meters of floor space. This configuration has a peak computational performance of 1.28 GFLOPS and a peak I/O bandwidth of 420 megabytes/sec. In a March 1992 posting to the comp.parallel electronic newsgroup, Tom Dunigan reported that a 32-processor KSR1 at the Oak Ridge National Laboratory attained 513 MFLOPS on the 1000\(mu1000 LINPACK benchmark. A full SE:1 ring with 1088 processors equipped with 34.8 gigabytes of memory and 1 terabyte of disk would require 150 kilowatts and 74 square meters. Such a system would have a peak floating point performance of 43.5 GFLOPS and a peak I/O bandwidth of 15.3 gigabytes/sec. .pp Each KSR1 processor is a superscalar 64-bit unit able to issue up to two instructions every 50ns., giving a peak performance rating of 40 MIPS. (KSR is more conservative and rates the processor as 20 MIPS since only one of the two instructions issued can be computational but I feel that both instructions should be counted. If there is any virtue in peak MIPS ratings, and I am not sure there is, it is that the ratings are calculated the same way for all architectures.) Since a single floating point instruction can perform a multiply and an add, the peak floating point performance is 40 MFLOPS. At present, a KSR1 system contains from eight to 1088 processors (giving a system-wide peak of 43,520 MIPS and 43,520 MFLOPS) all sharing a common virtual address space of one million megabytes. .pp The processor is implemented as a four chip set consisting of a control unit and three co-processors, with all chips fabricated in 1.2 micron CMOS. Up to two instructions are issued on each clock cycle. The floating point co-processor supports IEEE single and double precision and includes linked triads similar to the multiply and add instructions found in the Intel Paragon. The integer/logical co-processor contains its own set of thirty-two 64-bit registers and performs the the usual arithmetic and logical operations. The final co-processor provides a 32-MB/sec I/O channel at each processor. Each processor board also contains a 256KB data cache and a 256KB instruction cache. These caches are conventional in organization though large in size, and should not be confused with the ALLCACHE (main) memory discussed below. .sh 3 "ALLCACHE Memory and the Ring of Rings" .lp Normally, caches are viewed as small temporary storage vehicles for data, whose permanent copy resides in central memory. The KSR1 is more complicated in this respect. It does have, at each processor, standard instruction and data caches, as mentioned above. However, these are just the first-level caches. .i Instead of having main memory to back up these first-level caches, the KSR1 has second-level caches, which are then backed up by .i disks . That is, there is no central memory; all machine resident data and instructions are contained in one or more caches, which is why KSR uses the term ALLCACHE memory. The data (as opposed to control) portion of the second-level caches are implemented using the same DRAM technology normally found in central memory. Thus, although they function as caches, these structures have the capacity and performance of main memory. .pp When a (local, second-level) cache miss occurs on processor A, the address is sent around the SE:0 ring. If the requested address resides in B, another one of the processor/local-cache pairs on the same SE:0 ring, B forwards the cache line (a 128-byte unit, called a subpage by KSR) to A again using the (unidirectional) SE:0 ring. Depending on the access performed, B may keep a copy of the subpage (thus sharing it with A) or may cause all existing copies to be invalidated (thus giving A exclusive access to the subpage). When the response arrives at A, it is stored in the local cache, possibly evicting previously stored data. (If this is the only copy of the old data, special actions are taken not to evict it.) Measurements at Oak Ridge indicate a 6.7 microsecond latency for their (32-processor) SE:0 ring. .pp If the requested address resides in processor/local-cache C, which is located on .i another SE:0 ring, the situation is more interesting. Each SE:0 includes an ARD (ALLCACHE routing and directory cell), containing a large directory with an entry for every subpage stored on the entire SE:0.\** .(f \**Actually an entry for every page giving the state of every subpage. .)f If the ARD determines that the subpage is not contained in the current ring, the request is sent .q up the hierarchy to the (unidirectional) SE:1 ring, which is composed solely of ARDs, each essentially a copy of the ARD .q below it. When the request reaches the SE:1 ARD above the SE:0 ring containing C, the request is sent down and traverses the ring to C, where it is satisfied. The response from C continues on the SE:0 ring to the ARD, goes back up, then around the SE:1 ring, down to the SE:0 ring containing A, and finally around this ring to A. .pp Another difference between the KSR1 caches and the more conventional variety is size. These are BIG caches, 32MB per processor. Recall that they replace the conventional main memory and hence are implemented using dense DRAM technology. .pp The SE:0 bandwidth is 1 GB/sec. and the SE:1 bandwidth can be configured to be 1, 2, or 4 GB/sec., with larger values more appropriate for systems with many SE:0s (cf. the fat-trees used in the CM5). Readers interested in a performance comparison between ALLCACHE and more conventional memory organizations should read [SJG92]. Another architecture using the ALLCACHE design is the Data Diffusion Machine from the Swedish Institute of Computer Science [HHW90]. .sh 4 Software .lp The KSR operating system is an extension of the OSF/1 version of Unix. As is often the case with shared-memory systems, the KSR operating system runs on the KSR1 itself and not on an additional .q host system. The later approach is normally used on message passing systems like the CM-5, in which case only a subset of the OS functions run directly on the main system. Using the terminology of [AG89] the KSR operating system is symmetric; whereas the CM-5 uses a master-slave approach. Processor allocation is performed dynamically by the KSR operating system, i.e. the number of processors assigned to a specific job varies with time. .pp A fairly rich software environment is supplied including the X window system with the Motif user interface; FORTRAN, C, and COBOL; the ORACLE relational database management system; and AT&T's Tuxedo for transaction processing. .pp A FORTRAN programmer may request automatic parallelization of his/her program or may specify the parallelism explicitly; a C programmer has only the latter option. .sh 2 "The TMC Connection Machine CM-5" .lp Thinking Machines Corporation has become well known for their SIMD connection machines CM-1 and CM-2. Somewhat surprisingly their next offering CM-5 has moved into the MIMD world (although, as we shall see, there is still hardware support for a synchronous style of programming). Readers seeking additional information should consult [TMC91]. .sh 3 Architecture .lp At the very coarsest level of detail, the CM-5 is simply a message-passing MIMD machine, another descendent of the Caltech cosmic cube [Seit85]. But such a description leaves out a great deal. The interconnection topology is a fat tree, there is support for SIMD, a combining control network is provided, vector units are available, and the machine is powerful. We discuss each of these in turn. .pp A fat tree is a binary tree in which links higher in the tree have greater bandwidth (e.g. one can keep the clock constant and use wider busses near the root). Unlike hypercube machines such as CM-1 and CM-2, a node in the CM-5 has a constant number of nearest neighbors independent of the size of the machine. In addition, the bandwidth available per processor for random communication patterns remains constant as the machine size increases; whereas this bandwidth decreases for meshes (or non-fat trees). Local communication is favored by the CM-5 but by only a factor of 4 over random communication (20MB/sec vs. 5MB/sec), which is much less than in other machines such as CM-2. Also attached to this fat tree are I/O interfaces. The device side of these interfaces can support 20MB/sec; higher speed devices are accommodated by ganging together multiple interfaces. (If the destination node for the I/O is far from the interface, the sustainable bandwidth is also limited by the fat tree to 5MB/sec.) .pp The fat tree just discussed is actually one of three networks on the CM-5. In addition to this .q "data network" , there is a diagnostic network used for fault detection and a control network that we turn to next. One function of the control network is to provide rapid synchronization of the processors, which is accomplished by by a global OR operation that completes shortly after the last participating processor sets its value. This .q "cheap barrier" permits the main advantage of SIMD (permanent synchrony implying no race conditions) without requiring that the processors always execute the same instruction. .pp A second function of the control network is to provide a form of hardware combining, specifically to support reduction and parallel prefix calculations. A parallel prefix computation for a given binary operator \(*f (say addition) begins with each processor specifying a value and ends with each processor obtaining the sum of the values provided by itself and all lower-numbered processors. These parallel prefix computations may be viewed as the synchronous, and hence deterministic, analogue of the fetch-and-phi operation found in the NYU Ultracomputer [GGKM83]. The CM-5 supports addition, maximum, logical OR, and XOR. Two variants are also supplied: a parallel suffix and a segmented parallel prefix (and suffix). With a segmented operation (think of worms, not virtual memory, and see [SCHW80]), each processor can set a flag indicating that it begins a segment and the prefix computation is done separately for each segment. Reduction operations are similar: each processor supplies a value and all processors obtain the sum of all values (again max, OR, and XOR are supported as well). .pp Each node of a CM-5 contains a SPARC microprocessor for scalar operations (users are advised against coding in assembler, a hint that the engine may change), a 64KB cache, and up to 32 MB of local memory. Memory is accessed 64 bits at a time (plus 8 bits for ECC). An option available with the CM-5 is the incorporation of 4 vector units in between each processor and its associated memory. When the vector units are installed, memory is organized as four 8 MB banks, one connected to each unit. Each vector unit can perform both floating-point and integer operations, either one at a peak rate of 32 mega 64-bit operations per second. .pp As mentioned above, the CM-5 is quite a powerful computer. With the vector units present, each node has a peak performance of 128 64-bit MFLOPS or 128 64-bit integer MOPS. The machine is designed for a maximum of 256K nodes but the current implementation is .q "limited" to 16K due to restrictions on cable lengths. Since the peak computational rate for a 16K node system exceeds 2 Teraflops one might assert that the age of (peak) .q "teraflop computing" has arrived. However, as I write this in May 1992, the largest announced delivery of a CM-5 is a 1K node configuration without vector units. A full 16K system would cost about one-half Billion U.S. dollars. .sh 3 "Software and Environment" .lp In addition to the possibly thousands of computation nodes just described, a CM-5 contains a few control processors that act as hosts into which users login. The reason for multiple control processors is that the system administrator can divide the CM-5 into partitions, each with an individual control processor as host. The host provides a conventional .sm UNIX -like operating system; in particular users can timeshare a single partition. Each computation node runs an operating system microkernel supporting a subset of the full functionality available on the control processor acting as its host (a master-slave approach, see [AG89]. .pp Parallel versions of Fortran, C, and Lisp are provided. CM Fortran is a mild extension of Fortran 90. Additional features include a \f(CWforall\fP statement and vector-valued subscripts. For an example of the latter assume that \f(CWA\fP and \f(CWP\fP are vectors of size 20 with all \f(CWP(I)\fP between 1 and 20, then \f(CWA=A(P)\fP does the 20 parallel assignments \f(CWA(I)=A(P(I))\fP. .pp An important contribution is the CM Scientific Software Library a growing set of numerical routines hand tailored to exploit the CM-5 hardware. Although primarily intended for the CM Fortran user, the library is also usable from TMC's versions of C and Lisp, C* and *Lisp. To date the library developers have concentrated on linear algebra, FFTs, random number generators, and statistical analyses. .pp In addition to supporting the data parallel model of computing typified by Fortran 90, the CM-5 also supports synchronous (i.e. blocking) message passing in which the sender does not proceed until its message is received. (This is the rendezvous model used in Ada and CSP.) Limited support for asynchronous message passing is provided and further support is expected. .sh 2 "The Intel Paragon XP/S" .lp The Intel Paragon XP/S Supercomputer [Inte91] is powered by a collection of up to 4096 Intel i860 XP processors and can be configured to provide peak performance ranging from 5 to 300 GFLOPS (64-bit). The processing nodes are connected in a rectangular mesh pattern, unlike the hypercube connection pattern used in the earlier Intel iPSC/860. .pp The i860 XP node processor chip (2.5 million transistors) has a peak performance of 75 MFLOPS (64-bit) and 42 MIPS when operating at 50 MHz. The chip contains 16KByte data and instruction caches, and can issue a multiply and add instruction in one cycle [DS90]. The maximum bandwidth from cache to floating point unit is 800 MBytes/s. Communication bandwidth between any two nodes is 200 MByte/sec full duplex. Each node also has 16-128 MBytes of memory and a second i860 XP processor devoted to communication. .pp The prototype for the Paragon, the Touchstone Delta, was installed at Caltech\** in 1991 .(f \**^The machine is owned by the Concurrent Supercomputing Consortium, an alliance of universities, laboratories, federal agencies, and industry. .)f and immediately began to compete with the CM2 Connection Machine for the title of .q "world's fastest supercomputer" . The lead changed hands several times.\** .(f \**\^One point of reference is the 16 GFLOPS reported at the Supercomputing '91 conference for seismic modeling on the CM2 [MS91]. .)f .pp The Delta system consists of 576 nodes arranged in a mesh that has 16 rows and 36 columns. Thirty-three of the columns form a computational array of 528 numeric nodes (computing nodes) that each contain an Intel i860 microprocessor and 16 MBytes of memory. This computational array is flanked on each side by a column of I/O nodes that each contain a 1.4 GByte disk (the number of disks is to be doubled later). The last column contains two HIPPI interfaces (100 Mbyte/sec each) and an assortment of tape, ethernet, and service nodes. Routing chips are used to provide internode communication with an internode speed of 25 MByte/sec and a latency of 80 microseconds. The peak performance of the i860 processor is 60 MFLOPS (64-bit), which translates to a peak performance for the Delta of over 30 GFLOPS (64-bit). Achievable speeds in the range 1-15 GFLOPS have been claimed. Total memory is 8.4 GBytes, on-line disk capacity is 45 GBytes, to be increased to 90 GBytes. .pp The operating system being developed for the Delta consists of OSF/1 with extensions for massively parallel systems. The extensions include a decomposition of OSF/1 into a pure Mach kernel (OSF/1 is based on Mach), and a modular server framework that can be used to provide distributed file, network, and process management service. .pp The system software for interprocess communication is compatible with that of the iPSC/860. The Express environment is also available. Language support includes Fortran and C. The Consortium intends to allocate 80% of the Delta's time for .q "Grand Challenge" problems (q.v.). .sh 2 "The MasPar MP-1" .lp Given the success of the CM1 and CM2, it is not surprising to see another manufacturer produce a machine in the same architectural class (SIMD, tiny processor). What perhaps .i "is" surprising is that Thinking Machines, with the new CM-5, has moved to an MIMD design. The MasPar Computer Corporation's MP-1 system, introduced in 1990, features an SIMD array of up to 16K 4-bit processors organized as a 2-dimensional array with each processor connected to its 8 nearest neighbors (i.e., the NEWS of CM1 plus the four diagonals). MasPar refers to this interconnection topology as the X-Net. The MP-1 also contains an array control unit that fetches and decodes instructions, computes addresses and other scalars, and sends control signals to the processor array. .pp An MP-1 system of maximum size has a peak speed of 26 GIPS (32-bit operations) or 550 MFLOPS (double precision) and dissipates about a kilowatt (not including I/O). The maximum memory size is 1GB and the maximum bandwidth to memory is 12 GB/sec. When the X-Net is used, the maximum aggregate inter-PE communication bandwidth is 23GB/sec. In addition, a three-stage global routing network is provided, utilizing custom routing chips and achieving up to 1.3 GB/sec aggregate bandwidth. This same network is also connected to a 256 MB I/O RAM buffer that is in turn connected to a frame buffer and various I/O devices. .pp Although the processor is internally a 4-bit device (e.g. the datapaths are 4-bits wide), it contains 40 programmer-visible, 32-bit registers and supports integer operands of 1, 8, 16, 32, or 64 bits. In addition, the same hardware performs 32- and 64-bit floating point operations. This last characteristic is reminiscent of the CM1 design, but not the CM2 with its separate Weiteks. Indeed a 16K MP-1 does perform 16K floating point adds as fast as it performs one; whereas a 64K CM2 performs only 2K floating point adds concurrently (one per Weitek). The tradeoff is naturally in single processor floating point speed. The larger, and hence less numerous, Weiteks produce several MFLOPS each; the MP-1 processors achieve only a few dozen KFLOPS (which surpasses the older CM1 processors). .pp MasPar is able to package 32 of these 4-bit processors on a single chip, illustrating the improved technology now available (two-level metal, 1.6 micron CMOS with 450,000 transistors) compared to the circa 1985 technology used in CM1, which contained only 16 1-bit processors per chip. Each 14"x19" processor board contains 1024 processors, clocked at 80ns, and 16 MB of ECC memory, the latter organized as 16KB per processor and implemented using page mode 1Mb DRAMs. .pp A DECstation 5000 is used as a host and manages program execution, user interface, and network communications for an MP-1 system. The languages supported include data parallel versions of FORTRAN and C as well as the MasPar Parallel Application Language (MPL) that permits direct program control of the hardware. Ultrix, DEC's version of UNIX, runs on the host and provides a standard user interface. DEC markets the MP-1 as the DECmpp 12000. .pp Further information on the MP-1 can be found in [Chri90], [Nick90], [Blan90], and [Masp91]. An unconventional assessment of virtual processors, as used for example in CM2, appears in [Chri91]. .uh References .(b I F .ll 14c .ti 0 [ACDJ91] Anant Agarwal, David Chaiken, Godfrey D'Souza, Kirk Johnson, David Kranz, John Kubiatowicz, Kiyoshi Kurihara, Beng-Hong Lim, Gino Maa, Dan Nussbaum, Mike Parkin, and Donald Yeung, .q "The MIT Alewife Machine: A Large-Scale Distributed-Memory Multiprocessor" , in .i "Proceedings of Workshop on Scalable Shared Memory Multiprocessors" , Kluwer Academic Publishers, 1991 .)b .(b I F .ll 14c .ti 0 [AG89] George Almasi and Allan Gottlieb, .i "Highly Parallel Computing" , Benjamin/Cummings, 1989, 519 pages. .)b .(b I F .ll 14c .ti 0 [Blan90] Tom Blank, .q "The MasPar MP-1 Architecture" , .i "IEEE COMPCON Proceedings" , 1990, pp. 20-24. .)b .(b I F .ll 14c .ti 0 [Chri90] Peter Christy, .q "Software to Support Massively Parallel Computing on the MasPar MP-1" , .i "IEEE COMPCON Proceedings" , 1990, pp. 29-33. .)b .(b I F .ll 14c .ti 0 [Chri91] Peter Christy, .q "Virtual Processors Considered Harmful" , .i "Sixth Distributed Memory Computing Conference Proceedings" , 1991. .)b .(b I F .ll 14c .ti 0 [DS90] Robert B.K. Dewar and Matthew Smosna, .i "Microprocessors: A Programmers View" , McGraw-Hill, New York, 1990. .)b .(b I F .ll 14c .ti 0 [GKLS84] Daniel Gajski, David Kuck, Duncan Lawrie, and Ahmed Sameh, .q Cedar in .i "Supercomputers: Design and Applications" , Kai Hwang, ed. 1984. .)b .(b I F .ll 14c .ti 0 [HHW90] E. Hagersten, S. Haridi, and D.H.D. Warren, .q "The Cache-Coherent Protocol of the Data Diffusion Machine" , .i "Cache and Interconnect Architectures in Multiprocessors" , edited by Michel Dubois and Shreekant Thakkar, 1990. .)b .(b I F .ll 14c .ti 0 [Inte91] Intel Corporation literature, November 1991. .)b .(b I F .ll 14c .ti 0 [LLSJ92] Dan Lenoski, James Laudon, Luis Stevens, Truman Joe, Dave Nakahira, Anoop Gupta, and John Hennessy, .q "The DASH Prototype: Implementation and Performance" , .i "Proc. 19th Annual International Symposium on Computer Archtecture" , May, 1992, Gold Coast, Australia, pp. 92-103. .)b .(b I F .ll 14c .ti 0 [Masp91] .q "MP-1 Family Massively Parallel Computers" , MasPar Computer Corporation, 1991. .)b .(b I F .ll 14c .ti 0 [MS91] Jacek Myczkowski and Guy Steele, .q "Seismic Modeling at 14 gigaflops on the Connection Machine" , .i "Proc. Supercomputing '91" , Albuquerque, November, 1991. .)b .(b I F .ll 14c .ti 0 [Nick90] John R. Nickolls, .q "The Design of the MasPar MP-1: A Cost Effective Massively Parallel Computer" , .i "IEEE COMPCON Proceedings" , 1990, pp. 25-28. .)b .(b I F .ll 14c .ti 0 [ROTH92] James Rothnie, .q "Overview of the KSR1 Computer System" , Kendall Square Research Report TR 9202001, March, 1992 .)b .(b I F .ll 14c .ti 0 [Seit85] Charles L. Seitz, .q "The Cosmic Cube" , .i "Communications of the ACM" , .b 28 (1), January 1985, pp. 22-33. .)b .(b I F .ll 14c .ti 0 [SJG92] Per Stenstrom, Truman Joe, and Anoop Gupta, .q "Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures" , .i "Proceedings, 19th International Symposium on Computer Architecture" , 1992. .)b .(b I F .ll 14c .ti 0 [TMC91] .q "The Connection Machine CM-5 Technical Summary" , Thinking Machines Corporation, 1991. .)b