.B .nr BT ''-%- '' .he '''' .pl 11i .de tt 'sp 3 'tl ''-%-'' 'sp 2 .. .wh 0 tt .tt .B .nr BT ''-%-'' .he '''' .pl 11i .de fO 'bp .. .wh -.5i fO .LP .nr LL 6.5i .ll 6.5i .nr LT 6.5i .lt 6.5i .ta 5.0i .ft 3 .bp .R .sp 1i .ce 100 .R .sp .5i . .sp 10 ARGONNE NATIONAL LABORATORY .br 9700 South Cass Avenue .br Argonne, Illinois 60439 .sp .6i .ps 12 .ft 3 Advanced Architecture Computers .ps 11 .sp 3 .ft 2 Jack J. Dongarra and Iain S. Duff .sp 3 .ps 10 .ft 1 Mathematics and Computer Science Division .sp 2 Technical Memorandum No. 57 (Revision 1) .sp .7i \*(DY .pn 1 .in .ft 3 .ps 11 .LP .EQ delim @@ .EN .nr PO .5i .nr LL 7.0i .po .5i .ll 7.0i .B .ps 14 .rm $s .de $s \l'2i' .nr _B \\n(bmu-((\\n(ppu*\\n($ru)/2u) .. .sz 11 .nr pp 11 .nr fp 9 .vs 16p .nr $r 9 .he '''' .EQ define begin 'bold "begin"' define I 'bold "I"' define U 'bold "U"' define Ux 'bold "Ux"' define L 'bold "L"' define Ly 'bold "Ly"' define A 'bold "A"' define Ax 'bold "Ax"' define end 'bold "end"' define for 'bold "for"' define until 'bold "until"' define do 'bold "do"' .EN .ce 100 .bp .ps 13 .B Advanced Architecture Computers\|@{"" sup *}@ .ps 11 .sp .vs 12p .he ''%'' .EQ delim %% .EN .AU Jack J. Dongarra and Iain S. Duff (dongarra@anl-mcs.arpa and na.duff@su-score.arpa) .sp .4 .ps 10 .AI Mathematics and Computer Science Division Argonne National Laboratory Argonne, Illinois 60439-4844 Computer Science and Systems Division Building 8.9 Harwell Laboratory Oxfordshire OX11 ORA England .ps 11p .vs 16p .FS %size -1 {"" sup *}%\|Work supported in part by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U. S. Department of Energy, under Contract W-31-109-Eng-38. During preparation of the original report, the second author was on leave from Harwell Laboratory. This version was typeset on \*(DY. .FE .ce 0 .ps 10 .in .25i .ll -.25i .sp 2 .QS .B Abstract: .R We describe the characteristics of several recent computers that employ vectorization or parallelism to achieve high performance in floating-point calculations. We consider both top-of-the-range supercomputers and computers based on readily available and inexpensive basic units. In each case we discuss the architectural base, novel features, performance, and cost. It is intended that this report will be continually updated, and to this end the authors welcome comments. .QE .in -.25i .ll +.25i .nr PS 11 .nr VS 16 .nr PD 0.5v .SH .ps 10 Keywords .PP .ps 10 vector processors, array processors, parallel architectures, supercomputers, high-performance computers .sp 0.7 .ps 11 .SH 1. Introduction .PP In the last few years several machines have been announced that use some form of parallelism to achieve a performance in excess of that attainable directly from the underlying technology used in the design of the constituent chips. To a large degree the availability of low-cost chips as building blocks has given rise to many of these new machines. We give a list of such chips in Appendix A. .PP After listening to a great number of both technical and sales presentations on these new computers, we quickly became overwhelmed and confused with the characteristics of each product and its relative strengths and weaknesses. In an effort to clarify our understanding, we have written this report summarizing the principal features of each machine. We hope that the publication of this report will provide similar assistance to other computational scientists and will clarify what architectures are currently being employed and the range of machines available. .PP In Section 2 we list the computers considered and discuss the criteria we have used to select these computers. We present a rough classification based on architectural features and use this in our list of machines. We also summarize principal features of the machines in two tables: one for the expensive supercomputers and the other for cheaper machines. More detailed information on the machines is provided as Appendix B of this report. .PP The guidelines used in preparing the detailed descriptions are given in Section 3. In some cases, our data are incomplete and nonuniform. This situation reflects the technical level of the presentations, the documentation available to us, the stage of development of the product being described, and the comments received from vendors on draft copies of the document. We would be grateful for comments and criticisms that might help to remedy these deficiencies. We intend to update this report from time to time to reflect both the changing marketplace and further information on currently listed machines. .SH 2. Summary and Classification of Machines Considered .PP In recent months there has been an unprecedented explosion in the number of computers in the marketplace. This explosion has been fueled partly by the availability of powerful and cheap building blocks and by the availability of venture capital. There have been two main directions to this explosion. One has been the personal computer market and the other the development and marketing of computers using advanced architectural concepts. In this report we restrict our study to the latter group, with particular interest in architectures that use some form of parallelism to increase performance over that of the basic chip. .PP We also restrict our attention to machines that are available commercially, and thus exclude research projects in universities and government laboratories and products still at the design stage. We would, however, be delighted to be alerted to ongoing activities. .PP Some machines not commonly thought of as multiprocessors can be used as such. For example, the IBM 3081, 3084, and 3090 are .Ie "Multiple-processor machines" multiple-processor machines. Most installations use this feature to increase the throughput, but it is possible to use them as multiple processors (with multiplicity up to 2, 3, and 4 for the two machines, respectively) using the IBM Program Product MTF which runs under MVS. We do not, however, give further details of these machines in Appendix B. In addition, we include information only on attached processors whose performance is the supercomputer range. .PP We have necessarily had to exclude information obtained under non-disclosure agreements. We will update this report as such information is released through product announcements. .PP A much-referenced and useful taxonomy of computer architectures was given by Flynn (1986). .Ie "Flynn" "categories of machines" .Ie "Categories of machines" .Ie "Machines" "categories of" .Ie "Categories" "machine" He divided machines into four categories: .sp .in +.5i (i) SISD - single instruction stream, single data stream (ii) SIMD - single instruction stream, multiple data stream (iii) MISD - multiple instruction stream, single data stream (iv) MIMD - multiple instruction stream, multiple data stream .in -.5i .sp .hw ex-am-in-ing Although these categories give a helpful coarse division, we find immediately on examining current machines that the situation is more complicated, with some architectures exhibiting aspects of more than one category. .PP Many of today's machines are really a hybrid design. .Ie "Machines" "hybrid design" .Ie "Hybrid design" For example, the CRAY X-MP has up to four processors (MIMD), but each processor uses pipelining (SIMD) for vectorization. Moreover, where there are multiple processors, the memory can be local, global, or a combination of these. There may or may not be caches and virtual memory systems, and the interconnections can be by crossbar switches, multiple bus-connected systems, time-shared bus systems, etc. .PP With this caveat on the difficulty of classifying machines, we list below the machines considered in this report. We group those with similar architectural features. We have not included the machines from American Information Technology, Cydrome (Axiom), Data Technology Corporation, and Vitesse in this list since the documentation we have on these machines has insufficient technical details for us to classify them. .bp .2C .B scalar .R pipelined (e.g., 7600, 3090) parallel pipelined wide instruction words .I CHoPP FPS 164 FPS 264 Multiflow STAR ST-100 .B vector .R memory to memory .I CDC CYBER 205 .R register to register .I Convex C-1 CRAY-1 CRAY X-MP-1 Amdahl 500,1100,1200,1400 (Fujitsu VP-50,100,200,400) Galaxy YH-1 Hitachi S-810 NEC SX-1E, SX-1, SX-2 Scientific Computer Systems .R cache-based r-to-r .I Alliant FX/1 .B parallel .R global memory bus connect .I Alliant FX/8 (vector capability) Culler 7 Elxsi 6400 Encore Multimax FLEX/32 IP-1 Sequent Balance 21000 .sp 4 .R direct connect .I CRAY-2 (vector capability) CRAY-3 (vector capability) CRAY X-MP-2/4 (vector cap.) Denelcor HEP-1 IBM 3090/VF (vector capability) NAS AS/91X0 (vector capability) Sperry 1190/ISP (vector capability) .R Banyon network connect .I BBN Butterfly .R local memory hypercube .I Ametek System 14 Connection Machine FPS T-Series Intel iPSC NCUBE .R ring-bus .I CDC CYBERPLUS .R lattice .I Goodyear MPP Active Memory Systems (DAP) .R dataflow .I Loral DATAFLO .R user configurable .I Meiko .R .R multilevel memory .I ETA-10 (vector capability) Myrias 4000 .R systolic .I SAXPY .R high-performance graphic workstation .I Dana Group Silicon Graphics Inc Stellar .1C .PP A more empirical subdivision can be made on the basis of cost. We split the machines into two classes: those costing over $1 million and those under $1 million. The former group is usually classed as supercomputers, the latter as high-performance engines. With this subdivision, we can summarize the machines in the following tables. Since we do not have sufficient technical information on the the Galaxy YH-1, Vitesse machines, PS-2000, and MIPS, we have excluded them from these summary tables. .sp .Ie "Cost of machines" "over 1 million dollars" .Ie "Machines" "higher cost" .KS .ce 100 Table 1 Machines Costing over $1M (base system) .ce 0 .TS center; lp9|cp9 cp9 cp9 cp9 cp9 cp9 lp9|lp9 cp9 cp9 cp9 lp9 lp9 lp9|cp9 np9 np9 lp9 cp9 lp9. Machine Word Length Maximum Rate Memory OS Number of Proc. in MFLOPS in Mbytes _ Amdahl 1400 32/64 1142 256 Own 1 (Fujitsu VP-400) CHoPP 64 ? ? Own 16 CRAY-1 64 160 32 Own 1 CRAY X-MP 64 235/proc 128 Own/UNIX 1,2,4 CRAY-2 64 488/proc 2048 UNIX 4 CRAY-3 64 1000/proc 16000 UNIX 16 CYBER 205 32/64 800(f) 128 Own 1 CYBERPLUS 32/64 100/proc 4(a) Own 256 Denelcor HEP-1 32/64 10/PEM 16/PEM UNIX 16(b) ETA-10 32/64 1250/proc 2048(c) Own 1,2,4,6,8 FPS T-Series 32/64 16/proc 16384 Own 8 - 16384 Hitachi S-810/20 32/64 840 256 Own 1 IBM 3090/VF 32/64 108/proc 256 Own 1,2,4 Myrias 4000 32/64/128 ??? 512/Krate UNIX 1024/Krate NAS AS/91X0 32/64 ??? 64 Own 1 or 2 NEC SX-2 32/64 1300 320(d) Own 1 SAXPY 32 32/proc 512 Own 32 Sperry 1190/ISP 36/72 133/proc 64 Own 1,2,4 (e) .TE (a) Memory per processor. (b) 64 processes possible for each PEM; however, effective parallelism per PEM is 8-10. (c) Also 32 Mwords of local memory with each processor. (d) Also a 2-Gbyte extended memory. (e) Only 1 or 2 ISPs can be attached. (f) 800 MFLOPS for 32-bit arithmetic / 400 MFLOPS for 64-bit arithmetic. .KE .sp .PP The actual price of the systems in Table 1 is very dependent on the configuration, with most manufacturers offering systems in the $5 million to $20 million range. All use ECL logic with LSI (except the CRAY-1 in SSI, CRAY X-MP, and ETA-10 in CMOS ALSI (Advanced Large Scale Integration)), and all use pipelining and/or multiple functional units to achieve vectorization/parallelization within each processor. For the multiple-processor systems, the form of synchronization varies: event handling on the CRAYs, asynchronous variables on the HEP, send/receive on the CYBERPLUS. The CRAY-3 and ETA-10 are not yet available. Both Amdahl and Hitachi systems are IBM System 370 compatible. .PP In Table 2 we summarize machines in the lower price category. The data presented in Table 2 differ from that of Table 1. Full details for all the machines are given in Appendix B. Because of the widely differing architectures of the machines in Table 2 it is not really advisable to give one or even two values for the memory. In some instances there is an identifiable global memory; in others there is a fixed amount of memory per processor. Additionally, it may be possible to configure memory either as local or global. A value for the maximum speed is even less meaningful than in Table 1, since a high Megaflop rate is not necessarily the objective of the machines in Table 2, and the actual speed will be very dependent upon the algorithm and application. In the other aspects quoted in Table 1, all the machines in Table 2 are similar. All machines, except the FPSs and the SCS (all 64 bit), the DAP, MPP, and Connection (all bit-slice, supporting variable-precision floating point), the Star and SAXPY (32 bit), and Sperry with 36 and 72 bit, have both 32- and 64-bit arithmetic hardware, with most of them adhering closely to the IEEE standard. .sp .sp .Ie "Machines" "lower cost" .Ie "Cost of machines" "under 1 million dollars" .KS .ce 100 Table 2 Machines costing under $1M .ce 0 .TS center; lp9 | lp9 lp9 lp9 lp9 lp9 lp9. Machine Chip Parallelism Connection _ Active Memory (DAP) ECL 1024 near-neighbor Alliant FX/8 WTL 1064/1065 8+vector cross bar (reg to cache) and plus 10 gate arrays bus (cache to memory) Ametek System 14 80286/80287 256 hypercube Analogic MC68000/VLSI Vector (scalar) BBN Butterfly 68020/68881 256 Banyon network TMI Connection VLSI 64000 hypercube Convex C-1 Gate array Vector (vector) Culler 7 Gate array 4 bus Cydrome (Axiom) LSI VLIW (scalar) Dana Group Gate array vector (vector) Elxsi 6400 ECL 12 bus Encore Multimax 32032/32081 20 bus Flex/32 32032/32081 20 bus FPS-364 LSI VLIW (scalar) FPS-264 ECL VLIW (scalar) FPS-164/MAX VLSI 16 bus FPS-5000 VLSI 4 bus FPS MP32 VLSI 3 bus Intel iPSC 80286/80287 128 hypercube IP-1 ???? 8 cross-bar Loral DATAFLO 32016/32081 256 bus Goodyear MPP VLSI 16384 near-neighbor Meiko Transputer 157 user-configurable Multiflow Gate array VLIW (scalar) NCUBE Custom VLSI 1024 hypercube Numerix VLSI Vector (scalar) SCS-40 ECL/LSI Vector (vector) Sequent Balance 21000 32032/32081 30 bus Silicon Graphics Gate array vector (vector) Star ST-100 VLSI VLIW (scalar) Stellar Gate array vector (vector) .TE VLIW - Very Long Instruction Word .KE .sp .SH 3. Template for Machine Description .PP As we mentioned in the introduction, the level of technical information on each machine varied significantly. We have, however, attempted to organize the available information in a consistent manner. In Table 3, we give the template used in presenting the data in the appendices. .sp .KS .ce 100 Table 3 Template for Description of Machines .ce 0 Name of machine, manufacturer, backers, etc. Contact: technical and sales Architecture Basic chip used Local, global-shared memory, or both Connectivity (for example, grid, hypercube) Range of memory sizes available; virtual memory Floating point unit (IEEE standard?) Configuration Stand-alone or range of front-ends Peripherals Software UNIX or other? Languages available Fortran characteristics F77 Extensions Debugging facilities Vectorizing/parallelizing capabilities Applications Run on prototype Software available Performance Peak Benchmarks on codes and kernels Status Date of delivery of first machine, beta sites, etc. Expected cost (cost range) Proposed market (numbers and class of users) .KE .sp .SH Reference .IP Flynn, M. J. (1966) Very high-speed computing systems. Proc IEEE, vol. 54, pp. 1901-1909. .bp .sp 3i .ce 100 APPENDIX A .br .sp 2 LIST OF BASIC CHIPS USED .Ie "Chips used" .ce 0 .bp .sp 0.25i .nf .B General-Purpose Floating-Point Processors .R Intel 8087/80287 National 32081 Motorola 68881 Zilog 8070 AMD 9511A/9512 Fairchild F9450 .B Building-Block Floating-Point Processors .R Weitek WTL1032/1033 TRW TDC 1022/1042 Weitek WTL 1064/1065 AMD 29325 Analog Devices ADSP2310/2320 .B General-Purpose Building-Block Floating-Point Processors .R Weitek WTL 1164/1165 (Fandrianto and Woo 1985) .B Memroy, control, and communication chip .R INMOS T414 transputer INMOS T800 transputer (integral floating point) .fi .B Reference .R .IP Fandrianto, J. and Woo, B.Y. (1985), VLSI floating-point processors. IEEE Proceedings of the 7th Symposium on Computer Arithmetic, pp. 93-100. .bp .sp 3i .ce 100 APPENDIX B .br .sp 2 DETAILS OF MACHINES CONSIDERED .LP .nf .bp .nf .B ALLIANT FX/1 and ALLIANT FX/8 .R .Ie "Alliant" "FX/1" .Ie "Alliant" "FX/8" Alliant Computer Systems Corp. 42 Nagog Park Acton, MA 01720 617-263-9110 In Europe: Peter Smith Sales Manager DPS9000 Products Apollo Computer (UK) Ltd Oriel House 26 The Quadrant Richmond Surrey TW9 1DL UK 01-948-6055 Telex 8953944 Fax 01-948-5845 Contact: Technical: Craig J. Mundie, vice president of software Contact: Sales: David L. Micciche, vice president marketing, sales and customer servies Backers: Venrock Hambrecht and Quist Kleiner, Perkins, Caulfield and Byers Formerly, the company was called Dataflow. .B Vector Register Parallel Shared Memory Architecture .R .fi Computational elements (CEs) execute applications code using vector instructions. An FX/1 has one CE. An FX/8 has 1-8 CEs. The CEs transparently execute the code of an application in parallel. CEs may be added in the field, increasing performance without recompilation or relinking. Each CE has 8 vector registers, each with 32 64-bit elements, and 8 64-bit scalar floating point, 8 32-bit integer, and 8 32-bit address registers. Interactive Processors (IPs) execute operating system, interactive code, and I/O operations. An FX/1 has 1-2 IPs. An FX/8 has 1-12 IPs. Basic chip used: Weitek 1064/1065 plus ten different gate array types with 2600 to 8000 gates. In addition, the Motorola 68012 is used in the IP. The cycle time is 170 ns. CEs are cross-bar connected on the backplane to a 64 Kbyte/128 Kbyte write-back computational processor (CP) cache (FX/8). Bandwidth is 376 Mbyte/sec. Each 32-Kbyte IP cache is connected to 1-3 IPs (FX/8) or 1-2 IPs and a CE (FX/1). The FX/8 has 1-4 IP caches; the FX/1 has one IP cache. The CP and IP caches are attached by two 72-bit busses to the main memory. Memory bus bandwidth is 188 Mbyte/sec. Connectivity: crossbar (CE to cache), bus (cache to memory, cache to cache) Range of memory sizes available: 8-16 Mbytes (FX/1), 8-64 Mbytes (FX/8), all with ECC. Virtual memory: 2 Gbytes per process Floating point unit: IEEE 32- and 64-bit formats including hardware divide and square root and microcoded elementary functions. Configuration: Standalone. TCP/IP network support. Size (inches): FX/1 system - 28h x 13w x 25d (the FX/1 I/O expansion cabinet is the same size); FX/8 system - 43h x 29w x 34d (the FX/8 I/O expansion cabinet is 22w and same height and depth). Cooling: Both the FX/8 and FX/1 are air-cooled. The FX/8 system consumes 4950 watts (max. configuration), the FX/1 system 1155 watts (max. configuration). Peripherals: 800/1600/6250 BPI start-stop tape drive 67, 134, and 379 Mbyte (formatted) Winchester disk drives 45 MBbyte cartridge tape drive Floppy disk drive 8/16 line multichannel communications controllers 600 lpm printer Ethernet controller Software: Concentrix, Alliant's enhancement of Berkeley 4.2 UNIX with multiprocessor support. Compiler runs on production hardware and software. Languages: Fortran, C, Pascal Fortran characteristics: F77 - Conforms to 1978 ANSI standard. Extensions - Most of VAX/VMS extensions and Fortran 8x array extensions. Debugging facilities - Yes. Vectorizing/parallelizing capabilities - Automatic detection of vectors and parallelism. Feedback to user via diagnostic messages. User control of transformations via directives in the form of Fortran comments. Does interprocedural dependency analysis for automatical determination of parallel subroutine calls in loops. Performance: Scalar 32 bit - 4.45 MIPs / CE. (4450 Kwhetstones) Scalar 64 bit - 3.63 MIPs / CE. (3630 Kwhetstones) Vector 32 bit: 11.8 MFLOPS / CE. (1 chime multiply-add triad at 170ns/chime) Vector 64 bit: 5.9 MFLOPS / CE. (2 chime multiply-add triad at 170ns/chime) (64-bit multiply is 2 chimes; 64-bit add, subtract, and move are 1 chime). Applications: Engineering and scientific end-user and OEM applications, stand-alone or as a computational server to a network of engineering workstations. Status: First beta delivery May 1985; first production shipment September 1985. Expected cost: FX/1 - $132,000 to $200,000; FX/8 - $270,000 to $750,000 .bp .nf .B Amdahl Vector Processors (Fujitsu VP) .R .Ie "Amdahl" "vector processors" .Ie "Fujitsu VP" John Roberts Amdahl Corp. 1250 East Arques Ave. P.O. Box 3470 Sunnyvale, CA 94088 408-746-6880 In Europe: AMDAHL UK Dr. Horst-Peter Rother Producy Manager Amdahl Vector Processor International Management Services Limited Dogmersfield Park Hartley Wintney Hampshire RG27 8TE ENGLAND (0252)-24555 Telex 858486 .B Vector Register Architecture .R .fi The Amdahl 500, 1100, 1200 and 1400 Vector Processors are marketed by Amdahl Corp. in the U.S., Canada, and Europe. These products are manufactured by Fujitsu, and similar models are marketed in Japan as the VP-50, VP-100, VP-200, and VP-400. The VP-100 and 200 is also marketed by Siemens in mainland Europe. These are all register-to-register machines. All models have one scalar and one vector unit which can execute computations independently. The scalar unit fetches all instructions and passes each instruction to the appropriate unit for execution. The scalar processor is based on the Fujitsu M380/382 series mainframes and runs the IBM S/370 extended architecture instruction set plus 10 unique instructions. The vector performance varies according to model as follows: .sp .TS center; c c n n. Model Peak MFLOPS _ 500 133 1100 267 1200 533 1400 1142 .TE The scalar processor cycle time is 14 ns (VP 1400 only) or 15 ns (compared to the X-MP's 9.5 ns), but a sampling of scalar instructions indicates that the VP operations may be slightly faster than the X-MP's. There is, moreover, a difference in the pipelining between the X-MP and VP. Each VP scalar instruction is pipelined in three stages: fetch, decode, and execute. However, unlike the X-MP, the execution stage in the VP is not segmented. Thus, there is less potential purely scalar overlap in the VP than in the X-MP. (Note that all scalar work can overlap vector operations.) The vector unit consists of 5 or 6 pipelines, a vector register memory, and a mask memory. The 5 or 6 pipelines comprise 1 or 2 load/store pipelines, plus 1 mask pipeline, 1 add/logical pipeline, 1 multiply pipeline, and 1 divide pipeline. The number of concurrent pipelines, vector register size, and mask register size differ for each model, as shown below. Main memory capacity ranges from 32 Mbytes to 256 Mbytes (4 to 32 M 64-bit words). .KS .TS center; c c s s s c c c c c l n n n n. Model Configuration 500 1100 1200 1400 _ # pipes total 5 6 6 5 # concurrent load/store pipes 1 2 2 1 # 64 bit words/vect cyc/pipe 1 1 2 4 Scalar cycle time (ns) 15 15 15 14 Vector cycle time (ns) 7.5 7.5 7.5 7 # concurrent arith pipes 1 2 2 2 # 64-bit results/vect cyc/pipe 1 1 2 4 Vect. reg. size (Kbytes) 32 32 64 128 Mask reg. size (Bytes) 512 512 1024 2048 Max. main memory (Mbytes) 128 128 256 256 Min. main memory (Mbytes) 32 32 64 64 Max. interleaving (ways) 128 128 256 256 .TE .KE The total vector register capacity is 32-128 Kbytes. The registers can be reconfigured dynamically to 6 different combinations with varying vector register lengths, as shown below: .nf .bp .ce 1 Configuration of Vector Registers .KS .TS center; c c s s s c c s s s c c s s s c n n n n n n n n n. Register Length by Model (# of 64-bit word elements) # registers 500 1100 1200 1400 _ 8 512 512 1024 2048 16 256 256 512 1024 32 128 128 256 256 64 64 64 128 128 128 32 32 64 64 256 16 16 32 32 .TE .KE Technology: 400 and 1300 gate ECL, 350-picosecond delay main memory - 64 Kbit, 55 ns, MOS static RAM 380-470 square feet 36-62 KVA power consumption air cooled Software: Automatic vectorizing Fortran compiler Scalar Fortran compiler Interactive debugger Performance measurement tools Interactive vectorizer Scientific subroutine library (223 routines) .bp .B AMETEK System 14 .R .Ie "AMETEK System 14" .nf Ametek Computer Research 610 North Santa Anita Avenue Arcadia, California 91006 Technical Contact: Dr. Jeff Fier Sales: John C. Wyckoff, IL 818-445-6811 .B Hypercube Architecture .R .fi This is the first generation of AMETEK Concurrent Processing Systems. Each node is based on a 80286/80287 Applications Processor/Floating Point Co-processor with a separate 80186 Communication Processor. Each node has 8 bidirectional communications channels at 3 Mbits/sec connected to the host machine through a 1 Mbyte/sec parallel interface. Effective node-to-node throughput is 100 Kbyte/channel. Software overheads per message are about 300 microseconds. Local memory - 1 Mbyte per node. Connectivity - 16 to 256 nodes are connected in hypercube to form a System 14. Floating Point Unit - IEEE Standard Floating Point Arithmetic Configuration: Front-end machines (host) are DEC VAXs (MicroVAX II through VAX 8600). Support is available for the host running either UNIX 4.2bsd or VMS. A copy of the AMETEK Operating System, XOS, runs in each node. XOS supports automatic message buffering, message forwarding, process creation, and machine partitioning for multiple users. Language: C Software: Consisting of a simulator, single and multi-process debuggers, and user interfaces, the AMETEK Development Environment (ADE) is specifically designed to be the most complete set of software development tools for parallel program development. .sp ADE allows the programmer to develop, compile, and link programs that run on the simulator and/or the hypercube. Only one copy of source code exists for debugging on the simulator and running on the hypercube. The ADE allows the user to switch between the simulator mode and the hardware mode with a single command - automatically locating the correct libraries, using the correct compilers, and generating the executables for either mode. .sp The simulator enables the programmer to simulate and debug parallel processes on a sequential computer. While the single process debugger allows the debugging on one task at a time, the multi-processes debugger enables the debugging of many concurrent processes. The programmer has the ability to shift on command between processes at any time. .sp The user interface will automatically assign the type of topology requested by the programmer. The choices consist of the nodes being defined as ring, 2-D nearest neighbor, and 3-D nearest neighbor. This enables the programmer to spend time where it is most important - writing and debugging the program. .sp ADE training classes have shown that the experienced sequential programmer will be running successful parallel programs in two to three days. .sp .ul .nf STATUS: .sp Production shipments since first quarter 1986. .bp .nf .B ANALOGIC AP500 .R .Ie "Analogic AP500" Analogic Corporation Audubon Road Wakefield, MA 01880 (617) 246-0300 In Europe: Analogic Limited 68 High Street Weybridge Surrey KT13 8BN ENGLAND (0932) 56011 .B Pipelined Array Processor .R Control processor uses Motorola MC68000 Cycle time 160 nsec. .fi Pipelined adder can deliver a result each clock cycle, whereas pipelined multiplier produces a result every other cycle for a maximum rate of 9.375 Mflops. .sp 32-bit words but arithmetic performed in 40-bit pipeline. Program memory of 256K bytes and data memory of 912K words. I/O: DMA/PIO Host Interface. RS-232 serial port with user-settable transmission rate to 19.2K baud. Two 6.25 MHz auxiliary I/O ports (optional) IEEE-796 standard multibus (optional) Software includes : Linker Assembler Debugger Diagnostics Program optimization Function libraries Applications can be written in : Host high-level language Host assembly language AP assembly language Arithmetic: 32-bit DEC floating-point arithmetic. multiple-precision capabilities. 1024-point complex FFT in 4.7 msec. 100 x 100 matrix inversion in 649 msec. size: 5.25"h x 19"w x 21"d (rack-mountable) weight: 55lbs. power: 200 Watts for basic system .nf .bp .nf .B BBN Butterfly Parallel Processor .R .Ie "BBN Butterfly" Bolt, Beranek and Newman; Advanced Computer Inc. Gary Schmidt BBN Advanced Computers Inc. Cambridge, MA 02238 617-497-3931 .B Parallel Butterfly Network Architecture .R .fi The Butterfly Parallel processor is a tightly coupled, shared memory multiprocessor housing up to 256 processor boards, each with an MC68000 microprocessor or, optionally, an MC68020 microprocessor and MC68881 floating point coprocessor. Every processor board includes either 1 or 4 megabytes of globally shared memory. Any processor can access any memory location through the Butterfly switch, a fast, modular, multi-stage interconnect. Processors also have direct access to their own 1- or 4- megabyte share of the global memory pool. .sp .nf Other features: .sp Tightly coupled, shared memory, symmetrical multiprocessing. Multiple instruction, multiple data (MIMD) architecture. Up to 256 Mips of processing power in 1-Mip increments. All processors have equal access to as much as 1024 megabytes (ie, one gigabyte) of main memory. Memory bandwidth up to 1024 megabytes/sec (one gigabyte/sec). Memory access time less than 1 microsecond typical, 4 microseconds worst case (without contention). Distributed I/O system supports RS-232, RS-449, Ethernet, and Multibus. Field expandable in single processor increments. .sp .fi Each processor node is a separate circuit board with its own MC68000 (or MC68020 with MC68881 floating point coprocessor), an AMD2901 bit slice processor that extends the MC68000 instruction set, an onboard switching power supply, and either 1 or 4 megabytes of memory. Processors access their onboard "home" memory directly in less than 1 microsecond; they can access the home memory of any other processor through the Butterfly switch in about 4 microseconds. Providing true parallel access to memory, the Butterfly performs up to 256 simultaneous reads or writes and automatically resolves contention for memory. .sp Software includes the Chrysalis Operating System (somewhat like UNIX) with full C and Fortran support. A Lisp system is being developed. Extensions to all languages simplify parallel programming. Any of several "front end" processors, such as Sun Microsystems or VAX family computers, provide the familiar Berkeley UNIX development environment where parallel programs for the Butterfly can be written, maintained, and partially debugged. .sp Cost varies from $40,000 to $2,5000,000 depending on size. .nf .bp .B CHoPP .R .Ie "CHoPP" .sp Sullivan Computer Corporation 1012 Prospect Street Suite 300 La Jolla, California 92037 (619) 454-3116 .sp Lee Higbee, VP Research .sp .B VLIW (Very Long Instruction Word) Architecture .R .sp .fi The computer is under development by Sullivan Computer Corporation. The single processor is claimed to be several times faster than current supercomputers and will not require special coding techniques such as those required for vector processors, hypercubes, or other highly-parallel systems. The machine under current development is the Demonstration Unit (DU), a single processor version of the CHoPP 1. The CHoPP 1 will include up to 16 parallel processors. The features listed below highlight some features of the DU. .sp A superinstruction that includes up to 9 instructions is executed each clock cycle, providing one of the highest instruction issue rates available today. .br Four address arithmetic and logic units (ALUs) and four computational functional units, each an ALU and floating point unit, support the 8 concurrent computations in each superinstruction. .br A zero delay branch is the ninth executable instruction. The central processing unit has multiple register sets to support many tasks in concurrent execution (multiprogramming). .br The memory bandwidth is approximately 200 MWDS/sec or 1600 MB/sec and the I/O bandwidth is approximately 16 MWDS/sec or 130 MB/sec. .br On the Livermore Loops, the DU is expected to perform over two times the CRAY X-MP. Delivered performance to price ratio is expected to be over 4 times the CRAY X-MP/12. .sp The machine is small and air cooled; it is compatible with most computer environments; it does NOT require special cooling systems. Much of the lowest level of the Operating System is in hardware, providing much lower O/S overhead. .br Optimizing compilers are easy because there is no need for special techniques such as automatic vectorization or parallelization. This implies that it will be easy for Sullivan to support many languages with very high quality code from their compilers. Porting will be easy. .br The Fortran will accept the common extensions, both those that extend Fortran's functionality and those that allow for improved optimization of the compiler's output. .sp Plans .sp The CHoPP 1, which is essentially a multiprocessor version of the DU described above, allows from four to 16 processors (and will be about four to 16 times as fast) because of their (patented and proprietary) conflict-free, crashless memory and memory interconnect design. .br The CHoPP 2, which is the CHoPP 1 with very high speed circuitry (ECL), is expected to allow from four to 32 CPUs, each running at about five times the clock speed of the CHoPP 1. The CHoPP 2 is projected to provide ten times the performance of the CHoPP 1. .bp .nf .B Connection Machine .R .Ie "Connection Machine" .Ie "Thinking Machine" "Connection" .Ie "TMI Connection" Thinking Machines Inc. 245 First St. Cambridge, Mass. 02142-1214 617 876-1111 James Bailey - Director of Marketing .B Parallel Hypercube Architecture .R .fi The Connection Machine is a very fine grain parallel computer with an architecture suitable for artificial intelligence applications. The 64000 node processor prototype will have 1000 times the logical inference performance capabilities of current LISP workstations. The processing elements are one-bit machines having 4096 bits of memory connected so that each processor can communicate with any other through a fast message-routing system that forms a hypercube network. All linkages are software controlled with system-wide message flow being handled by a 3 Gigabit per second message routing system. All memory is dual ported and is hence directly accessible by both the Connection Machine and the front end. Configuration: The Connection Machine system has 65536 physical processors but may be configured for a much larger number of logical processors by means of the global-reset and configure commands. Access is through a front-end processor, currently either a VAX or a Symbolics 3600. The front-end provides the operating system environment, including terminal interaction and file management. The clock rate may range up to 10 MHz, giving an expected performance of 2 billion 32-bit integer additions per second in the 64K (65536) node configuration. Average instruction mixes are expected to exceed 1000 Mips. I/O can be through the front end or direct to a 1.2 Gigabyte disk at the rate of 500 Megabits per second. Languages: Applications programs reside in the host and can be written in CM-C (a Connection Machine extension of C), CM-Lisp, or an assembly language REL-2. Applications: One of the principal applications is expected to be image processing. Other applications include VLSI simulation and FFT's. The prototype currently available uses a conservative VLSI technology of 10000 gate CMOS gate arrays. .nf .bp .nf .B Convex C-1 (XL and XP). .R Convex Computer Corporation .Ie "Convex C-1" 701 N. Plano Rd. Richardson, Texas 75081 Phone: 214-952-0200 Technical: Steve Wallach Sales: Bob Shaw In Europe: CONVEX Computer Limited Hays Wharf Millmead Guildford GU2 5BE England 0483-69000 Telex 858136 Fax 0483-36775 .B Vector Register Architecture .R .fi The machine is based on CMOS VLSI gate array 8000 gates/chip (24 different chips in the machine). The c1-XP also uses two 20000 gates/chip CMOS VLSI gate arrays. It uses vector architecture, register to register, with pipelined functional units (each of which operates asynchronously - 3 present). The machine is based on a 100-ns major cycle time, 50-ns minor cycle time, with virtual memory (page size 4096 bytes) and 1024 bytes logical cache between memory and registers. Also a 64-Kbyte, 50-ns access. physical cache. Vector operations bypass the cache (cache bypass). Scalar operands are encached. .nf Physical memory - up to 1024 MB ( 1 billion) dynamic RAM(32-way interleaved). Virtual address space - 4 Gbytes User address space - 2 Gbytes. Memory - on a 32-Mbyte board(256 kbit Dram), 128 -Mbyte board(1 Mbit Dram), 2 banks per board, each 4-way interleaved. Transfer rates between memory and CPU - rated at 80 Mbytes/sec. Single memory pipe between memory and registers. Note: 64-bit vector references that are aligned on 32-bit boundaries will bypass the cache. Vector registers - 8, each with 128 elements (64-bit elements). VL and VS registers .br 0.512 Mbyte IOP buffer. IOP 68K based with event-driven monitor .br I/O transfer rates of 80 Mbyte/sec Floating point IEEE Standard format. 5 independent I/O processors each rated at 80 Mbyte/sec. Concurrent operation of scalar and vector units (fixed and float). Mask/merge and compress operations supported. Reduction operators max,min,sum,prod,any,all, and parity supported. Degradation for indirect addressing not specified. .nf .sp A(i) = B(C(i)) ... LD VL LD C,V0 SHF 4,V0,V1 LD B,V0,V1 STORE A .sp Byte-addressable with integer*1, integer *2, *4, and *8 arithmetic supported. Also, real *4 and real*8, Logical *1, *2,*4, and *8, and complex *8 and complex *16. Configuration: Designed as a stand-alone multiuser machine. Software: UNIX 4.2 bsd operating system. .fi Languages: Fortran 77 and C (accepts VMS Fortran), with excellent Fortran vectorizing compiler. Fortran compiler accepts VAX VMS Fortran. C compiler (VC) automatically vectorizes scalar code. Performance: Peak performance 20 MFLOPS in double precision (64-bit arithmetic), 40 MFLOPS in single precision (32-bit arithmetic). LINPACK timings - expect around 3-4 MFLOPS. Note: Convex rates their machine as 1/6 of a CRAY 1-S, 600 ns per subroutine call, 9 cycles latency (cf. 11 on CRAY, 30 on FACOM VP) Basic system: two 19-in. racks and 16-Mbytes memory, 1 I/O processor, service processor, 414 Mbyte Winchester, 6250 bpi tape drive. Size: 25 x 62 x 40 inches for each cabinet. Base system requires two cabinets, each about 500 lb. Forced air cooling. .br Power consumption 3200-4500 watts Cost: XL base system $350,000, XP base system $500,000 .fi .TS center; cs s s s l l l l l l l l cs s s s l l l l l l l l. Model 10 16 Mbytes 414-Mbyte disk one IOP [16 lines] $495,000 32 Mytes 828-Mbyte disk one IOP " $545,000 Model 20 64 Mbytes 828-Mbyte disk two IOP [32 lines] $745,000 128 Mbytes 3312-Mbyte disk two IOP " $1,400,000 .TE 3312 Mbytes = 8 Fuji eagles can have 3 asynchronous 16 line ports .TS center; l l. F77 compiler $24.5K VC compiler $24.5K (PCC comes with OS) (has GPROFF, PROF, and BPROF run-time profilers) Networking package $15K .TE .fi .nf .bp .nf .B CRAY-1 .R .Ie "CRAY-1" Cray Research Inc. 1440 Northland Drive Mendota Heights, MN 55120 612-452 6650 In Europe: CRAY UK Malcolm Hammerton Cray Research (UK) Ltd Cray House London Road Bracknell Berkshire RG12 2SY ENGLAND (0344) 485971 Telex 848841 .B Vector Register Architecture .R .fi This machine is no longer being produced, although when first introduced in 1976 (Los Alamos), it was undisputedly the fastest processor in the world and is still used as a benchmark for high-speed computing. Since many CRAY customers are currently upgrading their systems to an X-MP, there are opportunities to buy second-hand CRAY-1s at knockdown prices. Features: A uni-processor. Vector processor, uses pipelining and chaining to gain speed. 12.5-nsec clock. Fast scalar. Uses only four chip types with 2 gates per chip. 64-bit word size up to 4 M words of storage. The CRAY 1-S has bipolar (in units of 4K RAM), and the newer (1982) CRAY 1-M has MOS memory (in units of 16K RAM). Logic chips - ECL with a gate delay of .7 nsec. Main memory banked up to 16 ways. The bank busy time is 50 nsec (70 nsec on 1-M) and the memory access time (latency) is 12 clocks (150 nsec). No virtual memory Register-to-register machine 8 registers of length 64 (64-bit) words each Word addressable (64-bits). No half precision. Double precision is through software and is extremely slow (factors of about 50 times single precision are common). There is only one pipe from memory to vector registers, resulting in a major bottleneck with loads and stores to memory from registers. Loads can be chained with arithmetic operations; stores cannot. Performance: Low vector startup times and fast scalar performance make this a very general-purpose machine. Max. performance 160 MFLOPS; 64-bit arithmetic; max. attainable sustained performance 150 MFLOPS. There are codes for matrix multiplication and the solution of equations which get close to this. Maximum scalar rate is 80 MIPS. It is easy to attain over 100 MFLOPS for certain problems, even using Fortran. Software: An extensive range of software exists for this machine. Since the instruction set is compatible with the X-MP range, this software will also run on that range. .bp .nf .B CRAY-2 .R .Ie "CRAY-2" Cray Research Inc. 1440 Northland Drive Mendota Heights, MN 55120 Phone : 612-452-6650 1100 Lowater Rd. Cray Research Inc. Chippewa Falls, Wisconsin 54701 Phone : 715-726-1211 In Europe: CRAY UK Malcolm Hammerton Cray Research (UK) Ltd Cray House London Road Bracknell Berkshire RG12 2SY ENGLAND (0344) 485971 Telex 848841 .B Vector Register Parallel Shared Memory Architecture .R .fi This is a 4-processor (quadrant) vector machine with pipelining and overlapping but no chaining. .br There are more segments in the pipes than in the other CRAYs. .br Multitasking is compatible with the X-MP. The system has a 4.1-nsec clock cycle time. Memory is 256 M words of 256 K DRAM in 128 banks. The bank busy time is 57 clocks, and the scalar memory access time is 59 clocks. .br Local memory is 16 Kwords, 4 clocks from local memory to vector registers. .br Vector references from local memory must be with unit stride. There are 8 vector registers each with 64 elements. Overheads for vector operations are large: 63 cycles for vector load 22 cycles for vector multiply 22 cycles for vector add 63 cycles for vector store The machine is liquid cooled using inert fluorocarbon. Software: UNIX-based OS (called UNICOS) C compiler CFT2 (Fortran compiler) CFT77 Performance: Max. quoted at 500 MFLOPS per processor. Cost: $15M - $20M Delivered: NMFECC, NASA Ames, University of Minnesota, Stuttgart, Ecole Polytechnique (Paris). Orders placed by AERE Harwell. .bp .nf .B CRAY-3 .R .Ie "CRAY-3" Cray Research Inc. 1440 Northland Drive Mendota Heights, MN 55120 612-452-6650 1100 Lowater Rd. Cray Research Inc. Chippewa Falls, Wisconsin 54701 715-726-1211 In Europe: CRAY UK Malcolm Hammerton Cray Research (UK) Ltd Cray House London Road Bracknell Berkshire RG12 2SY ENGLAND (0344) 485971 Telex 848841 .B Vector Parallel Architecture .R .fi The machine is essentially a GaAs version of the CRAY-2 being developed by a team under Seymour Cray at Chippewa Falls. Architecture: 16 processors 2-nsec cycle time 4 logical functions/clock period Memory twice as fast as CRAY-2. Speed about 8 times CRAY-2. CRAY-2 imbalance removed by increasing scalar speed to four times that of a CRAY-2 on each processor so, 12x scalar. Aim is 100 times a CRAY-1. Boards reduced from the 4 x 8 x 1 of the CRAY-2 to 1 x 1 x .1. Only 1 cu ft in size, with power dissipation of 180 kW as in CRAY-2. Power supplies take 10 cu ft and liquid coolant 100 cu ft. Status: 1988 production version; 1990 sales .bp .nf .B CRAY X-MP .R .Ie "CRAY X-MP" Cray Research Inc. 1440 Northland Drive Mendota Heights, MN 55120 612-452-6650 Steve Chen Chris Hsiung 1100 Lowater Rd. Cray Research Inc. Chippewa Falls, Wisconsin 54701 715-726-1211 In Europe: CRAY UK Malcolm Hammerton Cray Research (UK) Ltd Cray House London Road Bracknell Berkshire RG12 2SY ENGLAND (0344) 485971 Telex 848841 .B Vector Register Parallel Shared Memory Architecture .R .fi This is a multiprocessor pipelined vector machine. It has the same architecture as the CRAY-1. The major difference is that there are now three paths from memory to the vector registers, and the clock cycle time is now 8.5 ns on all machines shipped after August 1986 (machines built before August have a cycle time 9.5 ns.) The current machines come with 1, 2, or 4 processors. Gather/scatter hardware is available on the 2- or 4-processor version of the machine. The gather/scatter can can be chained to load/store operation. Users can control all processors through calls in Fortran. The processors share memory. Other features: Memory up to 16 M (64-bit) words X-MP-2 - MOS. (Bank busy time is 68 ns and a memory access time of 17 clocks.) X-MP-4 ECL. (Bank busy time on the ECL machine is 34 ns and a memory access time of 14 clocks.) ECL logic with .35-.5 ns gate delay and 16 gates/chip. Main memory - ECL 4K RAMs with 25-ns access time. (Interleaving to 64 banks is possible.) .fi High-speed connection at 1024 Mbytes/sec per channel (max. 2) to a CRAY SSD. The SSD comes in various sizes up to 512 M word of secondary MOS memory. Data transfer to high speed (1200 Mbyte) DD-49 disk takes 10 Mbytes/sec. Configuration: There are many possible front ends including IBM, CDC, VAX, and Apollo. Performance: Max. per processor is 235 MFLOPS. Status: Announced in August 1982, first system delivered in June 1983. .bp .nf .B Culler 7 .R .Ie "Culler" "7" .Ie "Culler" "PSC" Culler Scientific Systems Corporation 100 Burns Place Santa Barbara, CA 93117 805-683-5631 Ward Davidson Vice President, Sales and Support .B Parallel Array Processor .R .fi Up to four processors. Each processor is a proprietary 64-bit high-performance computational processor. Global data memory of 96 MBytes real memory of 120 nsec access time. Local memory consists of program memory up to 256 KB and array memory of 4 x 16 KB with 40 nsec access time. Each processor rated at 18 MIP and around 11 MFLOPS. Software is an enhanced version of 4.2 BSD UNIX. Fortran and C The Fortran and C compilers generate instructions in parallel streams which employ all the computational function units to achieve execution concurrency within a processor. Cost: $275K - $750K .B Culler PSC .R .sp Connects to a front end workstation like a Sun. .br Designed as a network compute server product architecture and performance similar to a single processor Culler 7 unit. .sp Cost: $98.5K (order quantity one, discounts for OEM's). .bp .nf .B CDC CYBER 205 .R .Ie "CYBER 205" ETA Systems, Incorporated 1450 Energy Park Drive St. Paul, MN 55108 612/642-3400 Charles D. Swanson - Account Support In Europe: CDC and ETA UK D. Swanston Control Data Limited Genesis Centre Garrett Field Birchwood Science Park Birchwood Warrington Cheshire WA3 7BH ENGLAND (0925) 824757 Telex 629900 .B Vector Architecture .R Architecture: ECL/LSI logic (168 gates/chip) .fi Sequential and parallel processing on single bits, 8-bit bytes and 32- or 64-bit floating-point operands .nf 20-nanosecond cycle time Scalar Unit Segmented functional units 64-word instruction stack 256 word high-speed register file Vector Unit 1, 2, or 4 segmented vector pipelines memory-to-memory data streaming maximum vector length of 65,536 words gather/scatter instructions up to 800 million 32-bit floating-point operations/second Memory MOS semiconductor memory Memory size: 1, 2, 4, 8 or 16 million 64-bit words Virtual memory accessing mechanism with multiple, concurrently usable page sizes SECDED on each 32-bit half word 48-bit address (address space of 4 trillion words per user) 80 nanosecond memory bank cycle time Memory bandwidth: 25.6 or 51.2 Gigabits/second I/O Eight I/O ports, 32-bits in width, expandable to 16 200 M bits/second for each port Maximum I/O port bandwidth of 3200 M bits/sec Miscellaneous Cooling: freon Dimensions: floor area (four pipe model) 23 ft x 19 ft "footprint" (with I/O system) 105 sq ft Software: Virtual operating system Batch and interactive access FORTRAN compiler ANSI 77 with vector extensions 32-bit half-precision data type Special calls to machine instructions Automatic vectorization Scalar optimization utilizing large register file Utilities Interactive symbolic debugger Source code maintenance Object code maintenance Performance: .fi Linked triad performance on long vectors approaches asymptotic speed of machine. Performance can be severely degraded at short vector lengths (that is, the typical %n sub 1/2% is around 100) and if vector is not held contiguously. For this reason most tuned software employs long, contiguously held vectors. .bp .nf .B CYBERPLUS .R .Ie "CYBERPLUS" Control Data Corporation CYBERPLUS Marketing P.O. Box O HQS09B Minneapolis, MN 55440 Martin Ferrante 800-828-8001 ext 88 In Europe: CDC and ETA UK D Swanston Control Data Limited Genesis Centre Garrett Field Birchwood Science Park Birchwood Warrington Cheshire WA3 7BH ENGLAND (0925) 824757 Telex 629900 .B Ring Bus Architecture .R .fi This is a multiple parallel processor system. It grew from the Flexible Project and the subsequent Advanced Flexible Processor Project (AFP), used in military applications since 1976. The machine is based on ring technology with an 800 Megabits/second transfer rate with a read and a write possible between processors at this sustained rate. There are two CYBERPLUS processor models: 16-bit integer and 32- and 64-bit floating point. The integer processor has 15 independent functional units capable of 8-, 16- and 32-bit working; each processor has a 20-nsec cycle time. The floating point processor is an extension of the integer one through the addition of three floating point functional units capable of 32- and 64-bit precision, with rated maximum performance of 65 MFLOPS (103 in 32-bit mode). Each processor contains 2048 Kbytes of memory which can be expanded to 4096 Kbytes. A crossbar architecture allows the output of one functional unit to go to any or all other functional units in one machine cycle and permits all functional units to fire every cycle. There are 15 independent functional units: - 1 program unit - 9 I/O units including 4 read/write 16-bit memory units - 2 read/write 64-bit memory units, 2 ring port I/O units, - 5 integer/Boolean units (2 add/subtract, 1 multiply, and 2 shift Boolean) .fi Floating point: 1 add/subtract, 1 multiply, 1 divide/square root connected by an additional crossbar. Floating-point units can run simultaneously with fixed-point ones. Each instruction can initiate multiple functional units. .fi Configuration: Up to 16 rings can be connected to a CYBER 800 computer (each connected through a channel ring port) with up to 16 CYBERPLUS processors per ring. Within this ring all processors can operate autonomously and may execute each clock cycle. Processor Memory Interface allows direct reading and writing of the memory of any processor by another processor on the ring every machine cycle. Central Memory Interface (CMI) for transfer of data to host. The central memory ring is 64 bits wide with an 80 nanosecond cycle time, and this provides a direct transfer of 64 bits between the CYBER and a Cyberplus processor. Data transfers are controlled by the system ring and will be direct memory-memory transfers with the HPM memory on the CYBERPLUS processors. There are two rings connecting the processors: the system ring and the application ring. The ring packet has 13 bits of control information and 16 bits of data. A function code in the ring packet can determine whether access to other memories (one or several) is direct or indirect, the latter requiring the acceptance by the target processor. .nf There are three distinct memory systems: 1. 4K 16-bit data memory: 4 independent bi-polar data memories with a one-cycle read/write. 2. 256K 64-bit high-performance data memory: 4 banks with 4-cycle memory access, expandable to 512K 64-bit words with 8 banks. 3. Program Instruction Memory with 4096 200-bit words. Each machine cycle, the instruction memory fetches and initiates the execution of one or all of the parallel functional units. When the floating point option is in use, the size of these memory words increases to 240 words. .fi The host CDC 170 Series 800 (under NOS 2) loads code into the processors, transmits data from host to processors, and starts and stops processor's task. Software includes a cross assembler (MICA), a CYBERPLUS instructor load simulator (ECHOS), and an ANSI 77 Fortran cross-compiler. .EQ delim @@ .EN 64-bit floating point is 14 decimal accurate with a range of @10 sup -293 @ to @ 10 sup +322 @. 32-bit is 7 decimal accurate with range @10 sup -39@ to @10 sup +37@. Water cooled Performance: Claimed performance of 64 CYBERPLUS systems linked to a single Control Data 170 Series 800 is 16 billion calculations per second on signal data applications. Change detection algorithm for image processing is about 100 times faster than on a CDC 7600. Software: Floating point hardware and software delivered in first quarter 1985. Fortran compiler available for research activities fourth quarter 1984 and released April 1985. Cost: Entry-level CYBERPLUS base processor is priced at $735,000, which includes a 16-bit integer unit and 2.048 Mbytes of memory. With all available options the price is $1.6 million. Status: Announced formally on October 4, 1983; deliveries started in the first quarter of 1985. .bp .nf .B Cydrome (formally AXIOM Systems) .R .Ie "Cydrome" .Ie "AXIOM" .nf 1589 Centre Pointe Milpitas, California 95035 Richard Lipes Bob Rau 408-943-9460 Ross Towle (compiler person, student of Kuck) Bob Rau (Architect from University of Illinois and Elxsi) .B Dataflow Architecture .R .bp .nf .B Dana Group .R .Ie "Dana Group" Ben Wegbreit Dana Group 550 Del Ray Summyvale, CA 94086 408-732-0400 .B Very High Performance Integrated Graphics Workstation Very High Performance Integvrated Graphics /workstation .R Company founded by Allen Michels (from Convergent Tech) Vector register architecture .fi Heavy emphasis on interactive graphics for large computational problems. 48 MFLOPS peak performance UNIX Fortran C Availability: 1987 Markets: CAD/CAM/CAE Molecular Modeling Image Processing Scientific Engineering Research and Development .sp Cost: $50 - 75K .bp .nf .B DAP-3 .R .Ie "DAP" .Ie "Active Memory" Bruce Apler Active Memory Technology Inc. 6600 Peachtree Dunwoody Road 300 Embassy Row Suite 670 Atlanta, GA 30328 404-399-5633 In Europe: S. MacQueen/I. Merry International Computers Ltd ICL Defence Systems Lovelace Road Bracknell Berkshire RG12 4SN England 0344-24842 Telex 22971 Professor Dennis Parkinson DAP Support Unit Computer Centre Queen Mary College Mile End Road London E1 4NS 01-980-4811 Active Memory Technology Limited Eggington House 25-28 Buckingham Gate London SW1E 6LD England 01-630-9811 Telex 296923 (ADVENT G) Fax 01-828-4919 .B Bit Parallel Architecture .R .fi Configuration: This is an SIMD lockstep machine which operates on multiple data one bit at a time. It has variable-length arithmetic. Configuration is as a grid of processing elements with nearest neighbor connections. There are also row and column data highways (not present on the ILLIAC IV) so that broadcasts can be used to sum efficiently the entries of an array or to find the maximum entry, for example. The other main advantage over the ILLIAC IV lies in the far greater memory for each processing element and the greater reliability of the components. Three versions of the machine have been produced to date. The first, the prototype 32 x 32 machine, was followed by a larger 64 x 64 version which had an ICL 2900 host. The DAP was configured as one of the host's store modules. This resulted in no communication costs between the two machines when a common data to memory mapping format was used. The standard machine had 2 Megabytes of store, but the QMC (Queen Mary College) machine was later upgraded to 8 Megabytes (i.e., it can be visualized as a cube of dimensions 64 x 64 x 2048 bytes). Six of these machines are in use. The third version of the machine, the one currently marketed, has returned to the 32 x 32 array size, and has 8 Megabytes of array storage. The machine is approximately two orders of magnitude smaller, (it now fits under a desk) and can run without a host. The only architectural change has been the provision of a 40 Megabyte/sec I/O subsystem to permit real time processing. The instruction cycle time has also been reduced from 200 to 150 nsec. Software: The development environment (cross-compilers and run time debugging aids) are supplied running under UNIX. The DAP is linked as a peripheral via a 1.5 Megabyte/sec parallel interface. Language: The principal programming language used is DAP Fortran, an augmented Fortran that includes most of the array features proposed for Fortran 8X. Applications: Some of its main applications are in lattice gauge theory and molecular dynamics. It is particularly powerful on the Ising model because of its bit arithmetic. It is also used in many Monte-Carlo calculations and in image processing where the major problem is in data movement rather than processing speed. For some specialized applications, the DAP will outperform a CRAY-1. The new mini DAP has also been used to implement a high-performance military radar system. the Micro-VAX II and development software. Basic System Configuration: 32 x 32 p;rocessor array 8 MBytes of array memory 1 MByte of MCU code memory 10 MHZ instruction rate Micro-Vax II host Single Caninet, approx: 17 x 13 x 20 inches .bp Cost: The DAP-3 is currently priced at around $150,000, including the Micro-Vax and development software. Status: Work has already begun on a new machine that will use VLSI to achieve further improvements in integration levels and heat dissipation, with a dramatically improved arithmetic performance. .nf .bp .nf .B Elxsi System 6400 .R .Ie "Elxsi 6400" Len Shar Elxsi 2334 Lundy Place San Jose, CA 95131 408-942-1111 Harvey Goldman - Marketing Len Shar - Research .B Parallel Processor/Bus Architecture .R .fi This machine uses ECL technology high-density LSI components. The system can be used as a multiprocessor for multitasking of a single Fortran program, or as a loosely coupled architecture with no parallel processing capability executing independent programs or processes, or both ways. The system can be configured with 768 Mbytes of memory and many disk drives (474 Mbytes each). Up to 12 processors can be configured with this machine, with up to 64 Kbytes of cache on each processor. Global memory architecture is via a fast bus. The bus is 64-bit wide channel providing a gross bandwidth of 320 Mbytes per second, giving a transfer rate 160-213 Mbytes/second. All major components are connected to the bus. Up to 768 Mbytes of MOS memory are available (4 Gbytes virtual). Other features: Each CPU 3 boards, rated at 6 MIPS on M6410 CPU and at 10 MIPS on M6420 CPU. 64-bit wide data paths. 50-nsec cycle time. 64-Kbyte, 2-way set associative cache (100-nsec access time). 16 sets of 64-bit general-purpose registers. IEEE floating point arithmetic. Software: The operating system, called EMBOS, is a message-based OS. There is also Elxsi's version of UNIX, a port of AT&T System V.2 and 4.2 BSD. Size: The 5-CPU system fits in a single cabinet, 32 in. deep by 59 in. wide. Languages: Fortran 77, Pascal, COBOL 74, C, MAINSAIL Cost: A single-processor system is in the range of $400,000. A new model the 6420 CPU outperforms the old 6410 by a factor or 1.5 to 2 times. The new CPU can exist with the old CPUs. .bp .nf .B Encore Multimax .R .Ie "Encore Multimax" Encore Computer Corp 257 Cedar Hill St Marlboro, Mass. 01752 617-460-0500 Julius Marcus - VP of Marketing .B Parallel/Bus Multiprocessor Architecture .R .fi Architecture: National Semiconductor 32032 chip set running at 10 MHz. 32-Kbyte write-through cache per processor pair. Processors connected via a fast, 64-bit wide bus with data throughput rate of 100 Mbytes/sec. Address space of 4 Gbytes Main memory 32 Mbytes of RAM in 4 independent banks, in increments of 4 Mbytes. Configuration: Terminal and unit record I/O connected via Annex 16 line terminal concentrators attached to Ethernet, providing pre-processing. Is compatible with 19-in. Encore workstation. Note: The company plans successor chips using best microprocessors, including RISC architectures. 20 processors maximum configuration. .br Ethernet communications using TCP/IP. Performance: Range quoted from 1.5 MIPS to 15 MIPS by adding processors per module. Languages: UNIX 4.2 with C, Fortran, and Pascal. Status: November 1985 with a product .nf .bp .nf .B ETA-10 .R .Ie "ETA-10" ETA Systems, Incorporated 1450 Energy Park Drive St. Paul, MN 55108 612/642-3400 Charles D. Swanson - Account Support In Europe: D. Swanston Control Data Limited Genesis Centre Garrett Field Birchwood Science Park Warrington Cheshire WA3 7BH ENGLAND 0925-824757 Telex 629900 .B Vector Parallel Architecture .R .fi The ETA-10 is a successor to the CYBER 205, designed to operate at 10 GFLOPS by the end of 1986. Architecture: Central Processors Multiprocessor system with 2, 4, 6, or 8 CPU's (a one CPU system will also be available) Very high density CMOS circuitry (20,000 gates/chip) Liquid nitrogen cooling for performance and reliability CYBER 205 instruction compatibility Each CPU with a scalar and vector processor, and 4 million words of local memory Scalar unit Independent, segmented functional units 256-word high-speed register file 64-word instruction stack Vector unit 2 vector pipelines Memory Up to 32 million words of CPU memory (4Mw/CPU) MOS semiconductor Shared Memory using 256K VLSI chips Shared Memory sizes: 32, 64, 128, 192, or 256 million words 1 million word communication buffer for interprocessor communication Virtual memory addressing SECDED on each 32-bit half word 48-bit address (address space of 4 trillion words/user) I/O Up to 18 400-Mbit/sec Input/Output units for accessing disks, tapes, front-end systems and networks Miscellaneous Very low power requirement: 700 Watts/CPU (i.e., about 200 Watts per 205 equivalent) Liquid nitrogen cooling Compact packaging High reliability: 100 per cent functional availability Software: Virtual operating system Kernel operating system for basic processes User environments for control languages and utilities: VSOS (CYBER 205 OS - provides CYBER 205 software compatibility) UNIX Utilities Interactive symbolic debugger Symbolic postmortem dump Performance analyzer Source and object code maintenance Languages: Fortran ANSI 77 with vector extensions 32-bit half-precision data type Special calls to machine instructions Support for anticipated FORTRAN 8X array notation Automatic vectorization Scalar optimization Multiprocessing library Pascal C .fi .EQ delim %% .EN Performance: Too early to say. The performance of the product line is claimed to range from 2 to 4 times faster than the CYBER 205 for a single processor entry level system to 40 times faster at the high end (8 processors). The vector unit has been designed to reduce start-up times (%n sub 1/2%) relative to the CYBER 205; however, performance will still be degraded for noncontiguous vectors. Status: Complete system checkout by early 1987, with initial beta site deliveries in December 1986. The fully configured high-performance machines shipping by third quarter of 1987. .nf .bp .nf .B FLEX/32 MultiComputer .R .Ie "FLEX-32" Flexible Computer System Larry Samartin Flexible Computer Corporation 1801 Royal Lane Bldg 8 Dallas, TX 75229 214-869-1234 President/Chairman Larry B. Samartin President/CEO Dr. M. Nicholas Matelan William T. Walker National Manager Flexible Computer Corporation 5 Great Valley Parkway Suite 226 Malvern, PA 19355 215-648-3916 .B Parallel Bus Architecture .R .fi This machine is a true 32-bit multicomputer with variable architecture structure and is an MIMD machine. It uses National Semiconductor 32032 chips at 10 MHz, with an independent self-testing system using a Z80 micro. The "local memory" cycle time is 145 nsec. The claimed limit on the number of CPUs is 20480. Each processor is on one PC with full 32-bit data bus and full 32-bit address capability, with speed capacity of approximately of 1 MIP using the 32032. Each card has a hardware floating-point processor and hardware memory management and memory protection with a local bus interface and a 32-bit VMEbus I/O interface. Also, each processor board has 1 Mbyte or 4 Mbytes of ECC RAM in addition to cache memory and 128 K of ROM. An optional 1 Mbyte of RAM (later planned to have up to 8 Mbyte) with integral error detection and correction code logic is available. Also, an optional floating point accelerator (1 MFLOP) is available on each processor. The company envisages attaching array processors that are VME compatible such as SKY Warrior. Other features: Standard VME bus open architecture supporting Eurocard standard. Communication rates on local 10 buses 160 Mbit/sec each. Communicatoin rates on common bus 380 Mbit/sec each. Time to get on local bus - 1 msec. Time to do an an arbitrated read/write through high speed (45 nsec) common memory - 170-185 nsec Direct messaging to another processor's memory via global memory. .fi Configuration: The machine can have flexible configuration of local (145 nsec) and common memory (45 nsec). Mass memory cards (local memory) contain from 1 to 8 Mbytes RAM connected by local and/or 32-bit VMEbus I/O interface and can be used in any combination or permutation with CPU cards (these memory cards also have a microprocessor for SelfTest diagnostics and fault isolation). The system can be dynamically configured and reconfigured using the SelfTest mechanism. Software: A full UNIX System V can run on each processor, with extensions for concurrent processing. FLEX has a 4.2 license. The software license is for 32 users, with optional software license for unlimited users. FLEX's own multicomputing multitasking operating system (MMOS) is for real-time operating system support providing all the tools for interprocessor communication and signaling, synchronization, event management, etc. Ethernet-supported TCP/IP Languages: Fortran 77 with ISA S61.1 extensions Ratfor C Concurrent C and Fortran by using a preprocessor Assembly Ada under development Base system: Each cabinet can include up to 20 32-bit processors or 160 Mbytes of memory. There are two computers in two 19-in. standard cabinets: - one cabinet (the peripheral control cabinet PCC) for the SelfTest System and VME Eurocard card cage (with room for further 19-in. card cages for peripherals) - the other cabinet (the MultiComputer Cabinet MCC) with a 30-slot card cage partitioned into three 10-slot sections. The backplane contains 2 common buses, 10 local buses, and 20 VMEbus interfaces. The MCC also houses a local bus to common bus interface (common control card) with fair arbitration mechanism up to 9 common access cards with 128 Kbytes to 512 Kbytes of common memory (45 ns) each and a universal card with 128 Kbytes ROM, 1MByte or 4 MBytes of ECC RAM, 1 MIP processor, and VME interface with a separate microprocessor for the SelfTest System. Cabinet size is 24"x76"x36". Cost: Price starts at approximately $100,000 $36,000 list price/CPU + 1Mbyte RAM with 128 Kbytes ROM, FPP, and MMU. .bp .B Floating Point Systems MP32 SERIES MODEL 3000 .R .Ie "FPS/32" .Ie "Floating Point Systems" .nf MP32 Series, Model 3000, Floating Point Systems, Inc. Steve Cannon 3601 SW Murray Blvd, Beaverton, OR, 503-641-3151 x1883 In Europe: David A. Tanqueray Floating Point Systems U.K. Limited Apex House London Road Bracknell Berks RG12 2TE England .sp 2 Architecture: MIMD .R Basic chip used M68000 (Control Processor), AMD & Weitek Chips (arithmetic processor) Local, global-shared memory, or both: Both Connectively (for example, grid, hypercube): Bus Range of memory sizes available, virtual memory: 1Mword to 7Mword (32-bit) Floating point unit (IEEE standard?): IEEE standard 32-bit Configuration Stand-alone or range of front-ends Front ends: DG MV Series, Perkin-Elmer, Microvax II, VAX Peripherals: I/O processors Software: Unix or other? Other Language available: MAX 68 control language, XPAL assembler FORTRAN characteristics: N/A F77 Extensions Debugging facilities Vectorizing/parallelizing capabilities: Horizontal microcode synthesis that allows up to 10 operations to execute simultaneously. Applications: Run on prototype: Yes, or on front-end simulator Software available: Math Libraries: Basic math, Signal, Image, & Geophysical Performance: Peak: 18 to 54 MFLOPS Benchmarks on codes and kernels: 2D CFFT 1024 x 1024 pts - 1.89 sec. Status: Date of delivery of first machine, beta sites, etc.: Available since 8/85 Expected cost (cost range): $57,500 to $125,000 Proposed market (numbers and class of users): Signal processing, Image processing, and Computational physics .bp .B Floating Point Systems FPS-5000 SERIES .R .Ie "FPS-5000" .nf FPS-5000, Floating Point Systems Inc. Steve Cannon, 3601 SW Murray Blvd., Beaverton, OR, 503-641-3151, x1883 In Europe: David A. Tanqueray Floating Point Systems U.K. Limited Apex House London Road Bracknell Berks RG12 2TE ENGLAND .B Architecture: MIMD .R Basic chip used: AMD Chips, Weitek Chips on coprocessor Local, global-shared memory, or both: Both Connectively (for example, grid, hypercube): Bus Range of memory sizes available, virtual memory: 256K to 1024K (38-bit words) Floating point unit (IEEE standard?): 32-bit IEEE (coprocessor) Configuration: Stand-alone or range of front-ends: Front ends: VAX; PDP-11; Perkin-Elmer 3200; Gould 32; IBM 4300, 3080, 3090; Prime 750, 9950; Harris 800, HP 1000E Peripherals: 300MB and 80MB, Disks, I/O processors Software: UNIX or other? Other Language available: CP FORTRAN, MAXL control language (FORTRAN-like); APAL and XPAL assemblers FORTRAN characteristics: F77 (CPFORTRAN, which is F77 less I/O and character data type support) Extensions: Calls to coprocessor programs Debugging facilities: Symbolic debugger Vectorizing/parallelizing capabilities: Horizontal microcode synthesis that allows up to 10 operators to execute simultaneously Applications: Run on prototype: Yes, or run on simulator on front end Software available: Math Libraries: Basic & advanced math signal and image processing, simulation and geophysical Performance: Peak: 8 to 62 MFLOPS Benchmarks on codes and kernels: 2D convolution 31x31 operations - 33 MFLOPS (FPS-5430) Status: Date of delivery of first machine, beta sites, etc.: Oct. 1983 Expected cost (cost range): $45,000 to $99,000 for 256Kword system + standard software Proposed market (numbers and class of users): 350+ units per year in signal processing, image processing, geophysical analysis, computational physics, and real-time simulation .bp .B Floating Point Systems FPS-164/MAX .R .Ie "FPS-164/MAX" .nf FPS-164/MAX, Floating Point Systems Inc. Dave Vickers (Technical), Mike Saunders (Sales) 3601 SW Murray Blvd., Beaverton, OR, 503-641-3151 In Europe: David A Tanqueray Floating Point Systems U.K. Limited Apex House London Road Bracknell Berks RG12 2TE ENGLAND .B Pipeline Scalar Processor with Atttached Processor .sp .R Architecture: Basic chip used: Proprietary (CPU), Weitek Chips (MAX) Local, global-shared memory, or both: Both Connectively (for example, grid, hypercube): Bus Range of memory sizes available, virtual memory: .5Mwords to 15Mwords (64- bit words) or 4Mbytes to 120Mbytes Floating point unit (IEEE standard?): IEEE Standard compatibility Configuration: Stand-alone or range of front-ends: Front-end connection to IBM 4300, 308x, 303x, 309x under MVS, MVS/XA, VM/CMS; DEC VAX under VMS; Sperry 1100 Series; Apollo Domain Peripherals: FD64 Disk subsystem (1-6 controllers, 4-24 drives), 680MB to 16.2GB Software: UNIX or other? System Job Executive Language available: FORTRAN, ASSEMBLY FORTRAN characteristics: F77 ANSI '77 optimizing compiler, 5 levels of optimization Extensions: DOE Extensions for asynchronous I/O Debugging facilities: Symbolic debugger Vectorizing/parallelizing capabilities: Takes advantage of architecture through horizontal micro-coding allowing 10 different operations to occur in 8 separate functional units per machine/cycle. The matrix algebra accelerator (MAX) modules allow up to 15 concurrent vector operations at any one time. Applications: Run on prototype. Software available: Math Library routines (500+), Fast Matrix Solution Library (FMSLIB) over 40 third party software packages available. Performance: Peak: 33-341 MFLOPS Benchmarks on codes and kernels: 1000 x 1000 Matrix multiply - 66 seconds with 1 MAX module; - 10 seconds with 15 MAX modules Status: Date of delivery of first machine, beta sites, etc.: Available since 4/1/85 Expected cost (cost range): $435,000 to $1,900,000 Proposed market (numbers and class of users): Computational Chemistry/Physics, Electronic Circuit Design, Oil Reservoir Simulation, Structural Analysis .bp .B Floating Point Systems FPS-264 .R .Ie "FPS-264" FPS-264, Floating Point Systems Inc. Dave Vickers (Technical), Mike Saunders (Sales), 3601 SW Murray Blvd., Beaverton, OR, 503-641-3151 In Europe: David A. Tanqueray Floating Point Systems U.K. Limited Apex House London Road Bracknell Berks RG12 2TE ENGLAND .B Pipelined Scalar Architecture .R Basic chip used: Proprietary ECL implementation Local, global-shared memory, or both: Both Connectively (for example, grid, hypercube): Bus Range of memory sizes available, virtual memory: .5MW to 4.5MW (64-bit words), or 4Mbytes to 36Mbytes Floating point unit (IEEE standard?): IEEE standard compatibility Configuration: Stand-alone or range of front-ends: Front-end connection to IBM 4300, 308x, 303x, 309x under VMS, MVS/XA, VM/CMS; DEC VAX under VMS; Sperry 1100 Series; Apollo Domain Peripherals: FD64 Disk subsystem (1-6 controllers, 4-24 drives), 680MB to 16.2GB Software: UNIX or other? System Job Executive Language available: FORTRAN, ASSEMBLY FORTRAN characteristics: F77 ANSI '77 optimizing compiler, 5 levels of optimization Extensions: DOE Extensions for asynchronous I/O Debugging facilities: Symbolic debugger Vectorizing/parallelizing capabilities: Takes advantage of architecture through horizontal micro-coding allowing 10 different operations to occur in 8 separate functional units per machine/cycle. Applications: Run on prototype: Software available: Math Library routines (500+), Fast Matrix Solution Library (FMSLIB) over 40 third party software packages available. Performance: Peak: 38 MFLOPS Benchmarks on codes and kernels: 1000 x 1000 Matrix multiply 53 seconds Proposed market (numbers and class of users): Computational Chemistry/Physics, Electronic Circuit Design, Oil Reservoir Simulation, Structural Analysis .sp Expected cost: $640,000 to $1,350,000 .sp Status: Date of delivery of first machine, beta sites, etc.: Available since July .bp .B Floating Point Systems FPS-364 .R .Ie "FPS-364" FPS-364, Floating Point Systems Inc. Dave Vickers (Technical) Mike Saunders (Sales) 3601 SW Murray Blvd., Beaverton, OR, 503-641-3151 In Europe: David A. Tanqueray Floating Point Systems U.K. Limited Apex House London Road Bracknell Berks RG12 2TE ENGLAND .B Scalar Pipelined Architecture .R Basic chip used: Proprietary ECL implementation Local, global-shared memory, or both: Both Connectively (for example, grid, hypercube): Bus Range of memory sizes available, virtual memory: .5MW to 9MW (64-bit words) or 4Mbytes to 72Mbytes Floating point unit (IEEE standard?): IEEE Standard compatibility Configuration: Stand-alone or range of front-ends: Front end connection to IBM 4300, 308x, 303x, 309x under MVS, MVS/XA, VM/CMS; DEC VAX under VMS, Sperry 1100 Series; Apollo Domain Peripherals: FD64 (same as MAX except capacity) 1-2 controllers, 1-8 disks, 680 MB to 5.44 Gbytes Software: System Job Executive Language available: FORTRAN, ASSEMBLY FORTRAN characteristics: F77 ANSI '77 optimizing compiler, 5 levels of optimization Extensions: DOE Extensions for asynchronous I/O Debugging facilities: Symbolic debugger Vectorizing/parallelizing capabilities: Takes advantage of architecture through horizontal micro-coding allowing 10 different operations to occur in 8 separate functional units per machine/cycle. Applications: Run on prototype: Software available: Math Library routines (500+), Fast Matrix Solution Library (FMSLIB) over 40 third-party software packages available. Performance: Peak: 11 MFLOPS Benchmarks on codes and kernels: 1000 x 1000 matrix multiply - 189 seconds Proposed market (numbers and class of users): Computational Chemistry/Physics, Electronic Circuit Design, Oil Reservoir Simulation, Structural Analysis Status: Date of delivery of first machine, beta sites, etc.: Available since Sept. 1, 1985. Expected cost (cost range): $298,000 to $950,000 .bp .nf .B Floating Point Systems .R .Ie "FPS T-Series" FPS T Series Floating Point Systems Beaverton, OR 97005 1-800-547-1445 .B Hypercube architecture - Vector processors .R Each node is Inmos transputer, memory, plus vector processor. Vector processor: The vector processor consists of 2/3 of the surface of the board, and is a proprietary state machine with its own instruction stream and microcode. Three of the chips are currently Weitek parts. 6 stage 8 MFLOPS adder and a 7 stage 8 MFLOPS multiplier. Peak performance is 16 MFLOPS for 64-bit operands and 24 MFLOPS with 32-bit operands. IEEE arithmetic. 192 MBytes/sec to/from memory. Inmos transputer: 32-bit CMOS processor 7.5 MIPS processor .fi 2 KB of one chip RAM with one-cycle access that serves like a large register set. 19MB/sec between local memory and transputer. .nf Local memory is 1MB of dual ported RAM. .fi Aggregate external bandwidth for a single node 8 MB/sec. 4 input and 4 output channels may be active simultaneously. .nf .EQ delim @@ .EN Maximum number of nodes that can be connected is @ 2 sup 14 @ (16384). Maximum execution rate of 262 GFLOPS for 64-bit operands. Eight nodes make up a module. Two modules make up a cabinet. Maximum of 1024 cabinets. I/O peak transfer rate 80 MB/sec for a 16-node cabinet system. Stand alone system. .fi A cabinet contains two system disks the user may reference through a system manager network. .nf Direct disks, up to 1 GByte/node, are planned for July 1987. Software: Occam is the language used on the Transputer. Occam is enhanced with a library of mathematical subroutines. Sequential languages C, Fortran, and Pascal can run on each node, but Occam is still needed to manage concurrancy. Each cabinet is air cooled, requires 1000 watts of power and has a footprint of 5 sq. ft. Delivered: Cornell University, one cabinet 2nd quarter 86 Northrop, one cabinet Michigan State University, two cabinets Caltech, one cabinet. .bp .B .nf Galaxy YH-1 .R .Ie "Chinese supercomputer" "Galaxy YH-1" .Ie "Galaxy YH-1" China .B Vector Register Architecture .R .fi China has built its first supercomputer, as was revealed by \f2China Pictorial\f1. The development of this machine, which has the appearance of a CRAY computer, started in 1978 at the University of Defense, Science and Technology in Changsa. Performance: The YH-1 (Galaxy), as it is called, can execute 100 million operations per second. Status: According to \f2China Pictorial,\f1 the YH-1 was finished two years ahead of schedule and at only one-fifth of the planned budget. .bp .nf .B HEP .R .Ie "Denelcor HEP-1" .Ie "HEP" Denelcor, Inc. 17000 E. Ohio Place Aurora Colorado 80017 8-303-337-7900 Dr. Burton Smith - architect .B Shared Memory Multiprocessor .R .fi The Heterogeneous Element Processor (HEP) is an MIMD machine with two levels of parallelism. Each Process Execution Module (PEM) can run asynchronously, and all can have access to the common storage through a proprietary switch. Although the HEP has been designed for use with up to 16 PEMs, the largest built was a 4-PEM machine. Each PEM is itself an MIMD machine with parallelism achieved through an instruction execution pipeline. Up to 64 user-defined tasks can be executing concurrently, but the length of the pipeline on a 1-PEM machine effectively limits the degree of parallelism to between 8 and 16, depending on memory accesses. The memory accesses are also pipelined. An instruction progresses to the next stage of the pipeline every clock cycle of 100 nsec, although a memory fetch or store can be proceeding simultaneously. The CPU uses MSI ECL, mostly ECL 10 K with a gate delay of 3 ns, although some critical circuits use ECL 100 K with a .75-nsec gate delay. SECDED memory is used throughout. Parallelism is obtained in Fortran by explicit task creation (with minimal overhead), and synchronization is by means of asynchronous variables. Program, constant, register, and data memories all use 64-bit words. - Program memory size is from 32 Kwords to 1 Mword. - There are 2048 registers, and the minimum size of the read-only constant memory is 4096 words. - The data memory is separate from the CPU and can be expanded in 128-Kword increments to a maximum of 1M words (8 Mbytes) per PEM. Memory access time is 50 nsec, and half and quarter word and byte addressing is possible. .bp Configuration: The HEP switch that connects memory with CPUs is a flexibly configured, programmable network which uses packet switching techniques to route messages. Each node on the switch network has three full-duplex ports. Arbitration is through a priority system based on longevity. The propagation time through a node is 50 nsec. Although designed as a stand alone system, it is probably best to front-end the machine with a machine with good interactive capabilities like a VAX. Software: A version of UNIX III is used as the operating system, although not all utilities are available. The debugging and diagnostic capabilities are poor. Floating point uses IBM-compatible 32- and 64-bit formats. Little software outside of linear algebra kernels is available. Languages: Fortran 77, C, and Pascal are available in addition to HEP assembler. Performance: Each PEM is rated at 10 MIPS, and speeds in excess of 7 MFLOPS have been achieved on one PEM for linear algebra kernels coded in HEP assembler language. It is rare to exceed 3 MFLOPS for purely Fortran code on one PEM. Cost: The cost of a 1-PEM configuration is around $3 million. Status: Company filed Chapter 11 in 1985. No systems operational. HEP2 plans uncertain. .nf .bp .nf .B Hitachi S-810 .R .Ie "Hitachi S-810" Yoshihiro Koshimizu Hitachi America Ltd. Computer Division 950 Elm Ave. Suite 100 San Bruno, CA 94066-3094 415-872-1902 .B Vector Register Architecture .R .fi The Hitachi comes in three models: the S-810/5, the S-810/10 and S-810/20 (not available in the United States, only for the Japanese market). Hitachi's approach has been to employ independent scalar and vector processors. The S-810/20 relies on their current top-of-the-line mainframe (the M280H) for their scalar processor, with a cycle time of 28 nsec, and runs the complete IBM 370 instruction set. The vector unit was designed with a cycle time of 14 nsec. The main memory capacity of the S-810/20 is 256 megawords. The model 20 has four floating point add/logical units and eight combination multiply/divide-add units. In addition, there are two load pipes and two load/store pipe to/from memory, each capable of loads/stores at a rate of two word (64 bits) per cycle. The scalar speed of the Hitachi S-810 may be slower than either the CRAY X-MP or Fujitsu VP-200. The vector register capacity is 32 registers, each with a fixed length of 256 elements (64 bits). A unique feature of the Hitachi design is that vectors greater than 256 elements are managed automatically by the hardware. .nf .bp .nf .B IBM 3090/VF .R .Ie "IBM 3090/VF" IBM Neighborhood Rd Kingston, New York 12401 In Europe: David Marshall IBM Warwick Engineering, Science and Industrial Centre PO Box 31 Birmingham Road Warwick CV34 5JL England 0926-32525 Telex 311601 .B Vector Register Parallel Shared Memory Architecture .R .fi The IBM 3090 is the top end system available from IBM. It uses the System/370 Extended Architecture for scalar operations. 18.5 nsec cycle time. 3090 Model 150 is a uni-processor with 32 MB or 64 MB of central memory. 3090 Model 180 is a uni-processor with 32 MB or 64 MB of central memory and 64 up to 256 MB extended storage. 3090 Model 200 is a dyadic processor with 64 MB of central storage and up to 256 MB of expanded storage. 3090 Model 400 is a four-way processor with 128 MB of central storage and up to 512 MB of expanded storage. For the 3090 each processor has a high-speed-cache of 64 KB. The cache is system controlled. Vector Facility (VF): Optional feature to the 3090. Pipelined vector processor with vector registers. Each VF has 8 vector floating point registers of 128 64-bit elements. 171 vector instructions are added for the VF. 32-bit operands in the VF are treated as 64-bit operands. Fixed stride addressing on vectors is allowed as well as indirect addressing or mask control. Each VF has has a theoretical peak performance of 108 MFLOPS. .nf Models 150 and 180 can have 1 VF added. Model 200 can have one or two VFs added. Model 400 can have one, two, three, or four VFs added. System Software: MVS/XA VM/XA VM/SP High Performance Option Languages: Assembler H Version 2 VS Fortran 2 including Library Program Multitasking Facility and Interactive Debug. Engineering and Scientific Subroutine Library. The Fortran compiler will automatically vectorize existing codes. Power consumption: 7.8 KWatts Closed water/air cooled. 171 Sq. Ft. Cost: .fi 3090 Model 200 rough cost is $5M, VF option is 10 per cent per processor additional cost. .nf .bp .nf .B International Parallel Machines Inc. (IP-1) .R .Ie "IP-1" .Ie "International Parallel Machines Inc." Robin Chang International Parallel Machines, Inc. 700 Pleasant Street New Bedford, Massachusetts 02740 617-990-2977 .B Parallel Architecture .R .nf Sales: Walter Stuart Pye V.P. Marketing 6767 Forest Hill Ave. Suite 305 Richmond, VA 23225 U.S.A. 804/272-5678 Telex 888648 Technical: Dr. Robin Chang President 700 Pleasant St. Top Floor New Bedford, MA 02740 617/990-2977 Telex 888648 .fi .sp Parallel Architecture .sp Proprietary CPUs (9 used in base system) (IP-1-9) Local and global-shared memory NxN crossbar interconnection switch 32-bit physical memory addressing, expandable to 48 bits; 64-bit data paths 80M to 430M main memory 170M to 3G disk space double-precision IEEE standard 9CPU system, 133 MFLOPS double precision 72 MIPS (9 CPU configuration) 52 I/O ports .sp Configurations: Stand-alone VAX front-end IBM MVS front-end IBM MVS front-end various VME/Unix workstation front-ends Symbolic processing workstation front-end (Prolog or Lisp) .sp Can add: 1/2-inch tape drives multiple disk drives running in parallel plotters and printers close-coupling high speed communication interface to other CPUs TCP/IP, HyperChannel more CPUs up to 33 for 1987 delivery, up to 1025 CPUs for 1Q 1989 delivery .sp Software UNIV System V.3, up 64 users, real-time version available C with IP parallel math routines called from library Fortran 77-to-C converter Fortran 77 (VAX compatible) IP-1 virtual machine package for software developers, IBM-AT and VAX hosts, with debugging facilities, nominal charge .sp Application Software Available: Database management, printed circuit board layout, oil reservoir simulation, seismic data analysis, will port serious applications depending on market potential .sp Performance: 9-CPU peak, 144 MFLOPS double precision IEEE 33-CPU peak, 600 MFLOPS double precision IEEE .bp Status: First machine delivered October, 1985 Oil reservoir simulation beta sites in progress multiple OEM contracts Cost: $22K to $1M+, plus possible application porting charges Scientific, aerospace, engineering, military and university users .bp .nf .B Intel's Personal Supercomputers (iPSC) .R .Ie "Intel iPSC" .Ie "iPSC" Intel Scientific Computers 15201 NW Greenbriar PW Beaverton, Oregon 97006 503-629-7600 General Manager: Robert Rockwell Applications Manager: Cleve Moler Marketing Manager: Charlie Bishop Marketing and Customer Support: Ellen Bailey In Europe: David Moody Intel Scientific Computers Intel International Limited Pipers Way Swindon SN3 1RJ ENGLAND .B Hypercube Architecture .R Developed from Caltech work on Cosmic Cube. .PP The cube manager, or intermediate host, is a 286/310 workstation with 2-4 Mbytes of memory, a 140-Mbyte Winchester disk, a 320-Kbyte floppy, a proprietary ethernet connection to the hypercube itself, and a TCP/IP ethernet connection to remote hosts. The manager runs Xenix. .PP The hypercube has 32, 64, or 128 nodes, termed the iPSC/d5, d6, or d7. Each node consists of an 80286 CPU, an 80287 floating point coprocessor, and 0.5 megabytes of memory. The 80287 has IEEE arithmetic with 32-, 64-, and 80-bit formats and a speed of about 30-50 kiloflops. Each node also has 8 bi-directional communication channels rated at 10 Mbits/sec per channel. One of the channels is used for communication with the cube manager and the other 5, 6, or 7 are used for communication with other nodes in the cube. .PP The basic system may be modified by replacing node boards with memory expansion boards or higher speed floating point vector boards. A memory board increases the node memory from 0.5 to 4.5 megabytes. The resulting systems are known as the iPSC-MX/d4, MX/d5 and MX/d6. Software available from Gold Hill called CCLISP, for Concurrent Common LISP, provides communicating LISP environments for each node of the MX systems. .PP The vector extension, or VX, boards consist of two 100 nsec cycle, pipelined floating point units, one for addition/subtraction and one for multiplication, an additional megabyte of 250 nsec data memory, and 16 kilobytes of 100 nsec fast data memory. The speed of vector operations is determined largely by the memory speed. For example, a DAXPY involving long-precision vectors in the large, main memory has a peak rate of 2.6 Megaflops on a single node, while a dot product involving short precision vectors in the small, fast memory can approach 20 Megaflops. Peak floating point rates of the VX systems, obtained by multiplying the peak rate of a single node by the number of nodes, reach 424 megaflops for long precision and 1280 megaflops for short precision on a 64 node, iPSC-VX/d6. VAST II, a Fortran vectorizer from Pacific Sierra Research, is expected to be available in the summer of '87. Software: Manager operating system: Microsoft Xenix 3.0 Node executive: Intel NX Languages: Fortran, C, LISP, FCP (Flat Concurrent Prolog), ASM286, Ada under development. Tools: CCLISP, VAST II, Debugger, Crystalline Operations System (Caltech), Cosmic Environment (Caltech), NETCUBE (Oak Ridge) Physical characteristics of one 32-node cabinet: 16 x 16 x 19 inches; footprint 26 x 26 inches; 180 lb. Cost and performance summary: .KS .TS center; l l l l l l n l n l. System Nodes Memory MFLOPS Price iPSC/d5 32 16 MBytes 2 $155K iPSC/d6 64 32 MBytes 4 $280K iPSC/d7 128 64 MBytes 6 $525K iPSC-MX/d4 16 72 MBytes 2 $176K iPSC-MX/d5 32 144 MBytes 4 $306K iPSC-MX/d6 64 288 MBytes 6 $556K iPSC-VX/d4 16 24 MBytes 106 $250K iPSC-VX/d5 32 48 MBytes 212 $450K iPSC-VX/d6 64 96 MBytes 424 $850K .TE .KE .bp .nf .B Loral Dataflo .R .Ie "Loral DATAFLO" .sp Loral Instrumentation 8401 Aero Drive San Diego, California 92123 619-560-5888 .B Parallel Dataflow Architecture .R .fi .sp The Loral DATAFLO system is a parallel processor that can be incrementally expanded from approximately 10 processors to approximately 256 processors. Each processor is composed of two National Semiconductor NS32016 microprocessors. One processor is dedicated to token (data) management and store and the other is dedicated to application execution. The application processor has a National Floating Point Unit associated with it. The applications processors each have 128 K of local static RAM that is used for application execution. In general, communication between processors is via messages (dataflow tokens). Communication is handled on a 32-bit time mutliplex bus. This bus is used to broadcast dataflow tokens that have 16 bits of tag and 16 bits of data. A large dataflow system is composed of multiple chassis, with at most 14 dataflow processors programmed to pass dataflow tokens between chassis. Since these interfaces pass only those tokens that they are programmed to pass, bus saturation within a chassis is minimized. Shared memory can be added to the system in 2-Mbyte increments by replacing a dataflow processor with a shared memory board. Shared memory can be accessed by any processor in the chassis via a device bus that is separate from the dataflow bus. A program is composed of two components, a data graph description and a set of graph node implementations written in some standard language like C or Fortran. Applications development and monitoring system activity is accomplished through a dedicated UNIX-based processor occupying a position in one of the clusters. The "grain" size for the system is approximately the size of a procedure, around 60 to 100 lines of source code. A wide variety of real-time I/O and data storage controllers may be included in the dataflow environment through an extension of the dataflow bus. Price: $67K to $2M .nf .bp .nf .B Meiko .R .Ie "Meiko" Meiko Incorporated 6201 Ascot Drive Oakland, CA 94611 (415) 530 3055 Telex 797748 In Europe: Meiko Limited Whitefriars Lewins Mead Bristol BS1 2NT England (0272) 277409 Telex 449731 Fax (0272) 277082 .B Parallel MIMD architecture .R Founded in 1977. First shown in July 1985 at SIGRAPH in San Francisco. Contact: Roy Bottomley and Miles Chesney (England) Parallel MIMD architecture .fi The founders of Meiko were the managers of the design group responsible for the transputer and its peripherals. Thus the whole design philosophy of the Meiko system units is based around the INMOS Transputer. These are available in three flavors: .sp T414-15 15MHz 32-bit 7.5 MIPS T414-20 20MHz 32-bit 10 MIPS T800-20 20MHz 32-bit 1.2 Mflops sustained (peak of 3 Mflops) Connection topology is user configured, either (i) hardwired by means of wire wrap, patch links or PCBs plugged onto the backplane, or (ii) by electronic configuration. The connectivity is defined by the program. A distributed electronic switch implements this connectivity on the computing surface. .fi Each unit contains a transputer processor with eight unidirectional 10Mbit/sec autonomous message channels. These communication channels can be used for high-speed direct memory access or for low latency message passing to or from other computing elements. .sp Communication between units is by explicit I/O or message passing. .sp Message passing is a single instruction in which the appropriate process scheduling is achieved in an efficient microcode sequence. The units are: Local host with 3Mbytes 15Mbytes/sec error-checked RAM and 128Kbytes of 10 Mbyte/sec EPROM. IEEE 488 and dual RS232 I/O interfaces. At least one local host is required in any system (computing surface). Computing element. The only memory is that of the transputer, namely, 256Kbytes of 15Mbytes/sec error-checked RAM. Mass store with 8Mbytes of 15Mbyte/sec error-checked RAM and 2Mbytes/sec DMA controlled SCSI disk and peripheral interface. The third level of this memory hierarchy is 2048 bytes of single cycle static RAM for frequently accessed local variables. Display which has 128Kbytes of private SRAM, 1.5 Mbytes dual-ported display store. 70 MHz pixel rates and 200Mbytes/sec pixel highway. CCIR/RS-343-compatible video with programmable sync generator supports interlace and non-interlace. The units are held in slots in the Computing Surface. The local host, mass store, and display each require one slot, but the Compute Board contains 4 computing elements and occupies only one slot. The units are grouped as Computing Surface Modules which can themselves be combined to form the Computing Surface. Two standard modules are the 10-slot M10 and the 40-slot M40. The Computing Surface contains an infrastructure to facilitate debugging. The Computing Surface can be used stand alone or as an attached resource to a VAX, SUN, IBM PC, or Prime. .nf Software includes : Occam II compiler and the sequential language compilers ... C Fortran 66 Fortran 77 Pascal BCPL .fi Current applications include molecular modelling, naval simulators, computational fluid dynamics, lattice gauge theory, quantum chromodynamics, ray tracing, and solution of partial differential equations. A single M40 module with computing elements employing the T800 transputer is capable of a sustained performance of 187 Mflops. Price depends on system which is ordered. Prices for a fully operational system start at around $13K (the M10). The M40 Computing Surface Module with 39 computing elements and a local host costs around 250K pounds ($417K). This configuration has 157-way parallelism, total MIPS rating of 1175, and 42 Mbytes of RAM. Computing Surfaces containing over 300 processors, spread across several modules have been demonstrated. The delivery of a 1024 processor, 1 Gbyte, 1 Gflop, 3M pounds Computing Surface is expected during the middle of 1987. First deliveries were in March 1986 and since then over 2 dozen machines have been shipped. .bp .nf .B MIPS Computer Systems, Inc. .R .Ie "MIPS" John Hennessey MIPS Computer Systems, Inc. 930 Aqures Ave. Sunnyvale, CA 94086 408-720-1700 .B RISC Technology .R .fi This is a new organization (2 years old), with about 95 people, including the founders John Hennessey, John Moussouris, and Skip Stritter. Architecture: Family of Products: - component kits - boards - development systems Family of CPU boards. 3 - 5 - 8 MIPS (VAX 1.0 MIPS) Custom floating point 3 MFLOPS IEEE arithmetic Software: UNIX (C, IEEE Pascal, Fortran 77) Cost: $4,000 for the OEM board Status: products are shipping now .bp .nf .B Goodyear MPP .R .Ie "Goodyear MPP" .Ie "MPP" Goodyear Aerospace Corporation 1210 Massillon Road Akron, Ohio 44315 Ken E. Batcher 216-796-4511 .B Parallel Architecture .R .fi The MPP is the product of research and development designed to evaluate the application of a computer architecture containing thousands of processing elements, all operating concurrently. The major elements are the array unit, the array control unit, and the staging buffer. The 128x128 processing element has nearest neighbor connection with full-edge closure. The 16,384 processing elements, not including the extra columns for reliability, are simple bit-serial processors, each with a 32 element on chip shift register. The heart of the array unit is a custom integrated circuit containing eight processing elements. A total of 2112 chips have been combined with commercial memory on control chips to give the capability to perform 400 million floating-point operations per second. The array control unit contains all the logic to provide a pipeline of commands to the array unit, an I/O controller, and a custom-built 16-bit high-performance microprocessor for program management. The staged buffer is a 16-Mbyte, multidimensional I/O buffer. This unit has the capability necessary to reformat input data into the bit plane format of the MPP I/O system. The staging buffer has an external input rate of 40 Mbytes and an internal transfer rate to and from the array unit of 160 Mbytes in each direction. Language: Parallel Pascal Status: The Massively Parallel Processor was delivered to NASA Goddard Space Flight Center in May 1983. .nf .bp .nf .B Multiflow .R .Ie "Multiflow" Donald E. Eckdahl Joseph A. Fisher Multiflow Computer, Inc. 175 N. Main St. Branford, CT 06405 203-488-6090 .B VLIW (Very Long Instruction Word) Architecture .R Performance: Vector/parallelism capabilities by different techniques Software: IEEE standard arithmetic and UNIX Applications: Scientific engineering market Languages: Fortran, C, F77 VAX extensions Cost: under $1 million .nf .bp .nf .B Myrias 4000 System .R .Ie "Myrias 4000" Martin Walker Myrias Research Corporation 200 - 10328 - 81st Avenue Edmonton AB T6E 1X2 Canada (403) 432 1616 Telex 037 - 42759 Martin Walker - R&D Program Manager UUCP:ihnp4!alberta!myrias!maw .B Parallel Architecture, hierarchically Managed Local Memory .R .fi Main design goal of the architecture is scalability of memory capacity and performance. Each processing element (PE) contains one Motorola MC68000 (10 MHz) and 512 Kbytes of 150 nsec DRAM with DMA interface to a board level bus; a multiple processing element (MPE) board contains 8 PEs, a supervisory PE and an interface to a printed wire backplane; 16 MPEs fit in one card cage. Card cages have eight parallel ports for communication with other card cages, or with external devices; they can be interconnected in a fractal network of arbitrary size; physical packaging in Krates of eight cages (1024 PEs; 512 Mbytes of memory). The architecture supports the Myrias memory model: independent parallel tasks execute in distinct memory spaces; spaces are merged upon task completion; these memory spaces are not tied to particular PEs. Virtual memory (32 bit addressing) and the hierarchical clustering of PEs provide a distributed cache system. Architecture is implemented as a virtual machine on which all user software (applications, compilers, editors and optimizers) runs. The virtual machine provides user transparent virtual memory (paging and scheduling) and run time support to user processes. The virtual machine can run on many different hardware substrates; hardware failures are circumvented by the machine's control mechanism. Configuration: - off the shelf components - two sided printed circuit boards - two kinds of board - maintenance by on site board swap - stand alone system - standard network interface (eg. VME) or to suit customer Arithmetic: 32-, 64-, and 128 bit floating point; 8-, 16-, 32-, and arbitrary precision fixed point; IEEE 754 option. Software: - UNIX System V and BSD 4.2 operating system (user visible)/Myrias 4000 (user transparent) - upwards compatible with existing serial computers Languages available: Myrias Parallel Fortran (Fortran 77 with parallel DO loops, recursion and dynamic array dimensions); Myrias Parallel C (ANSI C with parallel DO loops). Fortran characteristics: - single instruction provides access to parallelism (parallel DO loops) - upwards compatible with Fortran 77 - will run conforming programs - will have parallel debugging aids - recursive parallel programming methods allow straightforward implementation of optimal divide and conquer algorithms which can minimize computational complexity Applications: physical modeling (neutron transport, magnetic fusion, drug design, chemical engineering, quantum chemistry, aerodynamics and hydrodynamics, seismic processing and hydrocarbon recovery, geophysics, meteorology, and structural design); data processing (image processing and generation, searching and sorting); VLSI design; algebraic manipulation. Will provide (recursive parallel) mathematical library. Performance: proportional to size of configuration; achieved through scalable architecture and algorithmic reduction of computational complexity. Status: prototype 1986. Cost: price proportional to performance; more than $1M. .nf .bp .nf .B AS/91X0 .R .Ie "National Advanced Systems" "AS/91X0" .Ie "NAS AS/91X0" .Ie "AS/91X0" Claud Stoudmeyer National Advanced Systems 800 East Middlefield Rd. PO Box 7300 Mountain View, CA 94039 415-962-6100 .B Integrated Vector Processor .R .fi The NAS 91X0 is the top end system available from National Advanced Systems. It used the System/370 Extended Architecture for scalar operations. AS/9140/50 are uni-processors with 48 MB of central memory. AS/9160 is a uni-processor with 64 MB of central memory. AS/9170/80 is a dyadic processors with 64 MB of central memory. Each processor has a high-speed cache for scalar operands. The cache is system controlled. Vector Processing Facility (VPF): Optional feature to the 91X0. Pipelined vector processor using memory to memory operations (no vector registers). 46 vector instructions are added for the VPF. 32-bit operands in the VPF are treated as 64-bit operands. Fixed stride addressing on vectors is allowed as well as indirect addressing or mask control. Based on the Hitachi S-9 plus IAP. System Software: MVS/XA VM/XA VM/SP High Performance Option Languages: Assembler H Version 2 The Fortran compiler will automatically vectorize existing codes using Pacific Sierra's VAST. Closed water/air cooled. Cost: rough cost is $3M .bp .B NCUBE .R .Ie "NCUBE" .nf Sales Office: 700 E. Baseline Rd., Suite D1 Tempe, AZ 85283 Headquarters: 1815 NW 169th Place Suite 2030 Beaverton, OR 97006 John Palmer (602)839-7545 .B Hypercube Architecture .R Node Processor Custom VLSI 11 Interrupt driven DMA channels at 2 Megabytes/sec 10 channels for hypercube; 1 for system I/O VAX style 32-bit byte addressable architecture 16 general registers (32 bits) complete, orthogonal 2 address instruction set 8,16,32 bit integer and logical operations 32,64 bit IEEE floating point operations 17 addressing modes (e.g. autoincr,autodecr,autostride) Performance (8 Mhz: approx. VAX 780 with fl.pt. accelerator) 1-2 MIPS (32 bits); .5 MFLOPS (32 bits); .3 MFLOPS (64 bits) Memory: 128 Kbytes SECDED about 110 KB available for application Processor Board (16"x22") contains 64 nodes + 8 MBytes SECDED memory Host Board (16"x22") contains Intel 80286/80287 with 4 Megabytes SECDED memory 1 ESMD Disk Interface for up to 4 disks (160,330,500 Megabyte) 8 serial RS-232 channels 1 parallel Centronics compatible interface 3 iSBX interfaces 16 Node processors with memory; provide small cube for starter system or 128 DMA channels for larger system Performance: up to 180 Megabytes/sec bandwidth to hypercube Graphics Board (16"x22") contains 2Kx1Kx8 frame buffer (768x1024 displayed 60 Hz); color table (16 M color); 180 Mbytes/sec data bandwidth (30 frames/sec); zoom; pan; 16 local NCUBE nodes; text processor; RS-343 RGB output Intersystem Link Board: Connects multiple NCUBE/ten systems together Open system Board: Allows user-defined interfaces to the hypercube. Disk Farm Board: Allows direct disk connection to hypercube nodes. Configurations NCUBE/ten: 16 to 1024 Nodes; 3 ft cubed; 220 v; 8 KW max; air cooled; 24 slot backplane: 8 for I/O options, 16 for Processor Boards; 160, 330 or 500 Megabyte disk drives and 60 Megabyte cartridge tape NCUBE/seven: 16 to 128 Nodes; 14" wide by 29" by 29"; 110v; office environment; 4 slot backplane: 2 for I/O options, 2 for Processor Boards; 160 or 330 MB disk 16 MB tape drive NCUBE/four: 4 to 16 Nodes; PC-AT Accelerator (4 Nodes+AT bus interface); up to 4 Boards per AT; for software development plus workstation. Software Axis (Host): Unix style multiuser; distributed file system; .fi EMACS style screen editor with up to 4 windows; cube managed as a device that can be allocated in subcubes; parallel symbolic debugger. .br routing; message typing; process debugging support .br Fortran 77 and C are available. Axis, Vertex, and compilers run on the NCUBE/four (PC-AT). Price: NCUBE/ ten or seven: $40K(cabinets+peripherals)+$60K/Host Boards+$100K/Processor Boards (University discount available) NCUBE/four: $10K/board (4 nodes) + $4K O.S. license. Schedule: Betasites working with I/O systems since February, 1985 Product announcement November 18, 1985, SIAM meeting on Parallel Processing First complete system shipments in December, 1985 Approximately 30 systems sold and installed. .nf .bp .nf .B NEC SX-1E, SX-1 and SX-2 .R .Ie "NEC SX-2" Mr. S. Adams NEC Information Systems 1414 Massachusetts Ave. Boxborough, Massachusetts 01719 617-264-8800 In Europe: Garry Foley Manager - Marketing Communications Systems Division NEC Business Systems (Europe) Ltd. NEC House 164-166 Drummond Street London NW1 3HP 01-388-6100 Telex 261914 NEC LDN Fax : (01) 387 4867 (GIII) (01) 388 5704 (GIII) .B Vector Register Architecture .R .fi The SX system has two processors, the Central Processor (CP) and the Arithmetic Processor (AP) sharing the main memory. CP is a front-end mainframe processor where system control programs and user programs run. The AP is a kind of Fortran engine dedicated to user programs executing. Although SX runs in standalone mode, NEC supports its ACOS series mainframes and also IBM mainframe connections. .nf .TS center; l l l l. SX-1E SX-1 SX-2 _ Cycle time 7 ns 7 ns 6 ns Number pipes 4 v-pipe 8 v-pipe 16 v-pipe Length regs 20K v-reg 40K v-reg 80K v-reg .TE .bp .fi Architecture: AP Architecture - 16 vector arithmetic pipelines: four identical sets each with an add, multiply, logical, and shift pipe. - 1000 gate LSIs with 250 picosecond gate delay. - 1 Kbit bipolar memory with 3.5 nanosecond cache memory access time. - 256 Megabyte memory (512-way interleaving) with 2 Gigabyte extended memory. - 64K bit static MOS memory chip with 40 nanosecond access time, giving a memory-to-register rate of 11 Gbytes per second. - Register-to-register machine with 40 (80 on the SX-2) Kbytes of vector registers. - register-to-register with far more (and more flexible) vector functional units. Scalar arithmetic is pipelined (128 scalar registers) and operates in parallel with vector units. The NEC scalar cycle time is the same as the vector, and is segmented and pipelined to allow more than one pair of operands to progress through the same functional unit concurrently. CP Architecure - The extension of the NEC mainframe computer. - virtual storage support. Software: - does not run the IBM instruction set (unlike other Japanese computers) - Fortran 77 with automatic vectorization. Performance tuning tools available are VECTORIZER/SX and ANALYZER/SX. The compiler vectorizes IF statements, intrinsic functions, and indirect addressing using vector gather and scatter instructions (into temporaries). - uses its own operating system Languages: Fortran 77, ALGOL, PL/I, BASIC, Pascal, C, LISP, PROLOG and COBOL. In vector mode only Fortran is supported. Performance: Maximum rating of the SX-1E is 285 MFLOPS and of the SX-1 is 570 Megaflops and of the SX-2 is 1.3 Gigaflops. Peak performance for the SX-2 will be in the 1.3 Gigaflop range. It appears to be the most powerful of the Japanese supercomputers, and the only one to aggressively address the scalar bottleneck. Status: First delivery date in the U.S. was July 1986. The NEC machine is available for benchmarking. NEC has sold seven of its supercomputers in Japan and in the USA. Cost: SX-1E: $8-9 million SX-1: $10-12 million SX-2: $14-16 million .nf .bp .nf .B NUMERIX MARS-432 .R .Ie "NUMERIX" "MARS-432" .Ie "MARS-432" Numerix Corporation 20 Ossipee Road Newton, MA 02164 (617) 964 2500 In Europe: Numerix UK Limited Ambassador House 181 Farnham Road Slough SL1 4XP ENGLAND (0753) 29411 Attn : Martin C Allen .. Director of Sales and Marketing Company formed in 1980 as co-operative exercise between Analog Devices Inc and Standard Oil (Indiana). .B Pipelined Array Processor .R .fi 32-bit floating-point array processor. Clock cycle time is 100 nsec. There are two pipelined adders and one pipelined multiply that can each deliver one result per cycle. Simultaneously two data reads or one write can be performed. Computational power of 30 Megaflops (32-bit arithmetic). Access of memory from arithmetic pipes is via a cross-bar switch. Data memory of 64 Megabyte of directly addressable memory. Program memory of 4K words cache and virtual memory space (64 word pages) of up to 64K words. A path exists between the memories so that programs can be stored in the data memory. Communication with the host is through an interface box and a 5 MHz 32-bit data bus with control through a second bus (the CBUS). DMA transfers at I/O bus rates of 20 Megabytes/second. Interfaces currently exist to DEC machines, ELXSI (Embos), Apollo (Aegis), and Sperry (OS1100) systems. Software includes: Fortran development system Microcode development system AP run-time executive support package Application libraries including mathematics, signal processing, and geophysical processing. 1024-point complex FFT in 1.7 msec. Dimensions are 19"w x 21"h x 24"d Weight 180 lbs. Customers include : ERIM (Michigan), Honeywell, Naval Research Lab, Kodak, Pratt & Whitney, Naval Weapons Center (USA) and Rolls-Royce, GESMA, Ensign Geophysics, Queen Mary College, and BGT (Europe). .nf .bp .nf .B PS 2000 (Russian supercomputer) .R U.S.S.R. .R .Ie "Russian supercomputer" "PS 2000" .Ie "PS 2000" .B Parallel Architecture (SIMD) .R .fi Today in the Soviet Union there is assembly line production of PS-2000 computers with a capability of up to 200 million ops. All these processors (the number of which varies with modifications of the machine) do the same operation at the same time or are in the wait mode. .sp The PS-2000 complex is classified as SIMD (single instruction stream-multiple data). The complex includes an SM-2 and the PS-2000 processor. The latter consists of 8-64 processor elements, each with its own memory of 4K-16K 24-bit words. All processor elements are under common control. The complex was 'first commissioned' in 1980. Unspecified type of addition speed is 0.3 microsec, with a memory access or cycle time (source gives both in heading without saying which the number applies to) of 0.64 microsec. .sp The structure of the PS-2000 computer consists of 8, 16, 32, or 64 processor elements (PE). They are connected to each other in an identical fashion, are located under a unified control, and are of a single type. Each processing element has its own (local) direct access semiconductor 12K or 48K-byte memory. This makes it easy to upgrade the system and thus change its performance within wide limits. The performance of the minimum PS-2000 8-processor computer configuration is approximately 25 million short operations per second. The maximum PS-2000 64-processor computer configuration permits a performance about 200 million short operations per second. .sp The PS-2000 operates on 12, 16, and 24-bit words and can work in both fixed and floating-point modes. .sp The basic programming language for the PS-2000 is assembly, which reflects the PS-2000 microinstruction set. .sp The PS-2000 can have 8, 16, 32, or 64 processors, and these can be connected under program control into a ring structure. It is possible to form two identical rings, each consisting of 8, 16, or 32 processors. These processors are controlled by the PS-2000 CPU, which uses 64-bit instructions from its own 16K semiconductor memory. A basic 8 processor configuration fills a 28' rack. A full 64-processor 40-Mflop configuration fills 5 such racks. By comparison, the U.S.-made 30Mflop Numerix 432 fills half of a 22' rack. .sp While the bulk of the applications of the PS-2000 appears to be seismic data processing, other problems such as near-sonic gas flow studies and nuclear reactor simulations have been reported. .sp The PS-3000 array processor is designed to augment the computing capability of the SM-1210 computer, which is either a new machine or an upgraded SM-2. The PS-3000 probably is not yet in production. It will be a multiprocessor superior to the PS-2000 and capable of 100-Mflop computing rates. The PS-3000 will apparently have four parallel processors, each of which has three arithmetic units that run in parallel. .sp Cost: "retails at 800,000 rubles". .bp .nf .B SAXPY Computer Corporation .R .Ie "SAXPY" SAXPY-1M B. Friedlander Director of Advanced Technology SAXPY Computer Corporation 255 San Geronimo Way Sunnyvale, California 94086 408-732-6700 .B Reconfigurable systolic architecture .R The machine has 5 basic components: .in 2 System Control Unit (DEC Micro VAX II) Matrix Processing Unit (Systolic processor) Vector Processing Unit (Numerix MARS 432) System Memory (64 MB to 2 GB) SAXPY Interconnect (320 MB/sec transfer rate) .in 0 Stand-alone computer .in 2 With capability to connect to VAX family of equipment Additional interface - High speed mass storage subsystem (HMS) (100 MB/sec) Connection to disks, tapes, VME, hyperchannel Network Input/Output (NIO) - VAX Cluster Interface .in 0 .fi The matrix processing unit is a linear array of 32 systolically connected processors. .br MPU to system memory trasnfer rate is 62.5 MWords/sec. .br 64 nsec cycle time. .br 32-bit floating point arithmetic. .br Peak performance 1000 MFLOPS. Software on the System Control Unit: .in 2 VMS Operating system Fortran 77 Pascal Ada C Matrix Math Subroutine Libraries. Access to the MPU is through subroutine calls. .in 0 . Size: 95.2" wide x 78.2" high x 40.4" deep Power: 15 KWatt Air cooled .sp Cost: $2 million base price .bp .nf .B SCS-40 .R .Ie "SCS-40" .Ie "Scientific Computer Systems Corporation" "SCS-40" Scientific Computer Systems Corporation 25195 S.W. Parkway Ave. Wilsonville, OR 97070 503-682-7223 President: Bob Schuhmann Technical: Carl Haberland In Europe: Pierre Hassid Scientific Computer Systems Corp. 5 Villa Alexandrine 92100 Boulogne Billancourt France +33-1-48.25.73.47 .B Vector Register Architecture .R .fi Architecture: - register-to-register CRAY-compatible architecture (all CRAY software should run on this machine) - microcode driven emulator to emulate the CRAY X-MP instruction set. - 64-bit scientific computer with pipelined, asynchronous functional units. - multiple pipelined functional units. - 45-nsec cycle time. - 5 vector, 1 scalar, and an address calculation can execute concurrently. - transfer rate from registers to functional units of up to 6 words/ clock cycle (1.07 Gbytes/sec). - 256-word buffer between memory and instruction decode logic allows execution of one instruction per cycle (two cycles for conditional branch). - supports flexible hardware chaining of functional units and memory references. Configuration: - 8-, 16-, 32-Mbyte field-upgradable memory configurations with 4-16 banks. - four ports to memory (like the CRAY X-MP, i.e., 2 vector loads and a store can be going on at the same time.) - designed as to interface to a front end, either VAX 11/780 or VAX 11/750. (Interfaces planned for CRAY X-MP, IBM 4300 series, and NSC hyperchannel.) - 2-10 programmable I/O channels, each with 16 Kbyte buffer and a transfer rate of 20 Mbyte/sec. Transfer rate of buffers to central memory is 1 word/clock period (178Mbytes/sec). - DD-550 disk drive holds 550 Mbytes and can sustain read/write data transfer rate of 10 Mbyte/sec with an average access time (seek plus latency) of 24 msec - maximum of eight drives can be attached to each of the eight optional I/O channels. Other features: - Size: 55 x 55 x 60 inches - Forced air cooling. - Power consumption: 208 3-phase 11-16.5 KVA - Weight: 1 ton Software: - software licensing agreement with CRAY. - multiuser, multiprogramming OS supports interactive job execution. Languages: - Fortran 77 Fortran compilation expected at 20,000 to 40,000 lines per minute. Fortran vectorizing compiler. Interactive debugger - Assembler Performance: - peak of 44 MFLOPS in 64-bit arithmetic - LINPACK timings around 1/4 the performance of a single CPU X-MP. - matrix vector operations (subroutine SMXPY), around 37.6 MFLOPS (simulated). Status: Prototype available 11/85; first customer shipment 4/86 Cost: Base system $500,000. Market target is to provide a CRAY-compatible general-purpose scientific computer that computes at 1/4 the CRAY X-MP, but has the price of a super-mini and thus the price/performance of a supercomputer. .bp .nf .B Sequent Balance 21000 .R .Ie "Sequent Balance 21000" Ron Parsons Sequent Computer Systems, Inc. 15450 SW Koll Parkway Beaverton, Oregon 97006-6063 503-626-5700 800-854-0428 Telex 296559 Casey Powell and Scott Gibson, co-founders. Technical: David Rodgers and Gary Fielland Chicago Office Karl von Spreckelsen District Manager 200 Tri-State International Drive Suite 110 Lincolnshire, IL 60015-1480 312-940-9299 In Europe: SEQUENT UK Chris Arnold Compass Peripheral Systems Bridge House Faraday Road Newbury Berkshire RG13 2DH ENGLAND (0635) 33933 Telex 846301 Incorporated in January 1983 (old name of company was Sequel) .sp .B Parallel Bus Architecture .R .fi Machine has 2-30 NS 32032 processors running at 10 MHz, each with floating point unit, memory management unit, and 8-Kbyte cache sharing a global memory via a 32-bit wide pipelined packet bus supporting multiple, overlapped memory and I/O transactions with a sustained data transfer rate of up to 27 Mbyte/sec. Memory: The machine has up to 28 Mbytes of physical memory, a 4-Mbyte I/O address space, and a 16-Mbyte virtual memory address space for each user process. Memory can be two-way interleaved, and there can be up to 4 memory controllers which each manage 2 to 8 Mbytes using 256K-RAM components. Processor and memory boards can go in any slot on the SB21000 bus. A Sequent-designed IC chip (SLIC, System Link, and Interrupt Controller) resides on each board to manage interprocessor communication, synchronization, interrupts, diagnostics, and configuration control. There is an extensive diagnostic subsystem. Software: The operating system, called DYNIX, is a version of Berkeley 4.2bsd UNIX, enhanced for application-transparent multiprocessing and user-controlled parallel processing. Among the enhancements are a completely reentrant kernel, user-level shared memory, and synchronization services. All processors run a shared copy of the operating system. The configuration is symmetric, and load balancing is automatic and dynamic. Industry-standard I/O, interfaces: MULTIBUS - has terminal multiplexor with controllers Ethernet - at 10 Mbits/sec. Connnection to PC as virtual disk through Ethernet. SCSI - at 2.5 Mbyte/sec. Offers 5-1/4 in. disk drives (72 Mbytes formatted) and streamer tape drives with adaptor boards for the SCSI bus. Peripherals include a 1/2" tape drive and a 396-Mbyte disk drive asynchronous The packaged system includes a 26-slot SB21000 bus backplane and an 21-slot MULTIBUS backplane; can take up to fifteen dual processor boards. Other features: Table height packaging. Dimensions 30.5" x 23.25" x 28.625" (HWD) SB800 chassis 15.5" x 10.5" x 13.5" MULTIBUS chassis 14.2" x 6.68" x 8.5" 11 amps max at 60Hz 115VAC. Maximum configuration dissipates 1500 Watts Software: supports ARPANET TCP/IP protocols plus all the networking facilities of UNIX 4.2. Support is also available for customer-provided application accelerators. Languages: Ada C Fortran 77 ANSI-standard Pascal Assembly language Lisp Parallel programming library callable from any language. Extension to Fortran to allow shared common blocks. Performance: Fully populated machine seen as 21 times a VAX 11/780 in power. Designed as a high throughput system, with support for parallel processing at user level. Status: Shipments began 12/84, and Sequent currently has manufactured more than 140 systems (as of Nov 86). Cost: $286,000 for the complete machine with all software, 10 processors, 8-Kbyte cache/processor, 16-Mbyte memory, and four Fujitsu Swallow 264 MByte disks (total of 1056 MBytes); $140,000 for a 4-processor system; and $62,000 for a small 2-processor Balance 8000 system. .nf .bp .nf .B Silicon Graphics Inc .R .Ie "Silicon Graphics" Forest Baskett Silicon Graphics 2011 Stierlin Rd. Mountain View, CA94045 415-960-1980 .B Very High Performance Workstation .R .fi Goals: Heavy emphasis on interactive graphics for large computational problems. Markets: CAD/CAM/CAE Molecular Modeling Image Processing Scientific/Engineering Research and Development .bp .nf .B Unisys Integrated Scientific Processor System ISP 1100/90 .R .Ie "Sperry ISP" .Ie "Unisys" Dave Deak Unisys corporation Information Systems Group P.O. Box 500 Blue Bell, PA 19424 215-542-5216 .sp 2 .B Vector Parallel Architecture .R .sp .fi The ISP operates under the control of one basic Integrated Scientific Processing system consists of a Unisys 1100/90 CPU with one I/O Unit, the ISP, and a 4 M-word Scientific Processor Storage unit. .sp The peak performance of a single ISP is a 133 MFLOPS in single precision (36 bit word) and 67 MFLOPS in double precision (72 bit word). Two ISPs may be connected to a single Unisys 1100/90 host system. .sp The high speed memory that supports the ISP is capable of transferring data to an ISP at 133 M-words/sec. The sustained performance is 20 to 30 MFLOPS in double precision and may double for single precision. .sp 2 .nf First delivery was June 1986. .sp 30 nsec clock .sp 16 MWords memory .sp Peak performance single precision (36 bits) 133 MFLOPS .sp Peak performance double precision (72) 67 MFLOPS .sp Cache based 4K words (36 bit words) - Scalar processor part only/vector processor can address into cache .sp register to register architecture; vector register set is 16 x 64 words. .sp Vector processor also has scalar imbedded processor. .sp Hetrogeneous processing system - up to four scalar processors (IP) and two vector processors (ISP). .sp Vectorizing compiler UFTN .bp .nf .B ST-100 .R .Ie "Star ST-100" .Ie "ST-100" Star Technologies Inc. 515 Shaw Road Sterling, Virginia 22170 703-689-4400 Technical: Phil Cannon In Europe: Stephen D Bean Star Technologies Inc. Rosemount House Rosemount Avenue West Byfleet Surrey KT14 6NP ENGLAND 09323 5281 Telex 928764 STAR G .B Pipeline Floating Point Architecture .R .fi The ST-100 is an array processor, designed to attach to a more general-purpose computer or host via bus. It has four independent programmable processors. A separate processor is dedicated to each of the following functions: external data flow, internal data flow, arithmetic processing, and synchronization. A hierarchical memory system consists of external storage devices, a large main memory, a high-speed random access partitioned data cache, and a universal register set. The main memory consists a 320 nsec memory, 8-way interleaved, composed of 64K dynamic RAMs with SECDED. It is expandable to 32 Mbytes in increments of 2048 Kbytes. All main memory is byte addressable (address range 4 Gbytes) and can be partitioned and protected at multiples of 16 Kbytes. Memory access time is 40 nsec (per 32-bit word). The random access data cache memory consists of 6 banks of 8192 32-bit words for a total of 192 Kbytes. During each machine cycle, four cache references are permitted: three by the arithmetic processor and one by the storage/move processor. Information flow from host to main memory to cache to functional unit to cache to memory to host. .bp Other features: 40 nsec clock cycle, Bipolar VLSI circuits with 1200 gates. 32-bit floating point arithmetic, pipelined functional units. 2 adders, 2 multipliers, and a 480 nsec divide/square root functional unit. Ambient air cooled Size 56" x 33" x 67" A data interchange unit permits one of 16 operands to be selected for each arithmetic input register. During each machine cycle, three cache banks may be referenced, one loop control operation computed, four arithmetic operations started, and a conditional branch executed. The 25 Mbyte I/O channel supports 7 device adapters; 12.5 Mbyte/sec data transfer rate. Software: Fortran-like control language (APCL) Macro assembler Simulator/debugger and Linker Library Maintenance Program Applications Library available. .fi Fortran compiler implemented using KAP Pre-compiler from Kuck and Associates. Performance: 100 MFLOPS peak in single-precision (32-bit) arithmetic for convolution and matrix operations. Cost: $265,000 base price. .nf .bp .nf .B Stellar .R .Ie "Stellar" Wallace E. Smith VP Sales Stellar 100 Wells Ave. Newton, MA 02159 617-964-1000 .B Very High Performance Workstation .R Company founded by John Poduska (from Apollo). Goals: Heavy emphasis on interactive graphics for large computational problems. Price $75K - $125K Availability: 2nd half 1987 Markets: CAD/CAM/CAE Molecular Modeling Image Processing Scientific/Engineering Research and Development .bp .nf .B Vitesse Electronics .R .Ie "Vitesse" 741 Calle Plano Camarillo, CA 93011 805-388-3700 .B Parallel Architecture .R .fi Plans are to build a scalar machine with with 1Gbyte memory and a 40-nsec cycle time. The machine will be made of CMOS. It is to support hardware optimization for high run-time performance. Configuration: First machine is to have up to 8 processors. Connectivity allows for large number of processors, in the thousands. It can be used as a co-processor on a VAX. Software: 32-64 bit floating point arithmetic supporting the IEEE standard. Languages: Fortran, Pascal, and C. Performance: 25 to 150 MFLOPS (uniprocessor range of performance as result of optional hardware boards for each processor). Status: Started in July 1984, expect to produce a machine by late 1986. It is planned to make a GaAs version in a couple of years. .