From KARP@IBM.COM Tue Mar 22 16:47:15 1988 Return-Path: Received: from anl-mcs.ARPA by antares.mcs.anl (3.2/SMI-3.2) id AA07225; Tue, 22 Mar 88 16:47:09 CST Received: from IBM.COM (ibm.com.ARPA) by anl-mcs.ARPA (4.12/4.9) id AA01909; Tue, 22 Mar 88 16:45:41 cst Date: Tue, 22 Mar 88 16:42:48 cst From: KARP@IBM.COM Message-Id: <8803222245.AA01909@anl-mcs.ARPA> Forward: COMPMAIL To: soft.one@anl-mcs.ARPA Status: R Galen, Here is the draft of the paper. Thanks for clarifying some of my fuzzy thinking. I have added comments in several places. I know you don't like to use all caps for acronyms, but sometimes that is the standard notation. I have changed those that I think absolutely should be in caps. Three places to send the press release would be Science - you can find the address Physics Today, American Institute of Physics, 500 Sunnyside Blvd., Woodbury, NY 11797 Computers in Physics - same as above (Ken and Jack, Please send your nominations directly to Galen.) In addition, there are a number of Society published magazines. Check with your resident chemists, bioligists, engineers, etc. to get their favorites. Addresses are: Jack Dongarra Mathematics and Computer Science Division Argonne National Laboratory 9700 South Cass Avenue Argonne, Illinois 60439 Alan Karp IBM Systems Center 18100 Frederick Pike Gaithersburg, Maryland 20879 ----------------------------------------------------------------------- Alan, You did a great job on this piece. We'll run it at full length. I've added some details and tried to expand some areas (especially in the technical discussions so readers unfamiliar with the problems at least get an idea of what the problems were). Please check those areas (I've prefaced areas where I had questions with comments enclosed in veryical bars (|) and addressed "au:". Please enclose comments to me in the same bars and address them "ed:". Call me at (714) 821-8380 if you have any questions. If I'm off-site, call me at (213) 438-5088 (I'll be moving between the two sites all week next week). When I get this back from you, I'll write up the press release and mail it out. Add any publications (include addresses or at least cities so I can get their addresses) you think should get a copy, especially specialized journals or noncomputing publications. Also, provide me your and Dongarra's addresses (I have Kennedy's, since he once wrote an article for us) so I can make sure you get copies of the May issue this will appear in. --Galen ========================================== AWARDS.TYP Gordon Bell Awards IEEE Software May 1988 22225 |au: verify and supply missing information.| |ed: response given to each question| [ts4][ts7][ir] -io-Dongarra, Karp, and Kennedy judged the 1987 Gordon Bell Awards. Dongarra is a numerical analyst at the Mathematics and Computer Science Division at Argonne National Laboratory; Karp, who chaired the judging committee, is a staff member at the IBM Palo Alto Scientific Center currently on assignment at the IBM Washington Systems Center in Gaithersburg, Maryland; Kennedy is a professor of computer science at Rice University.[ix] [hs4]1987 Gordon Bell Awards: Winners achieved speedup of 400 [bs2]Jack Dongarra, Alan Karp, and Ken Kennedy |au: Shouldn't the first sentence read " ... application of parallel processing to ... "? Also, verify total number of entries -- we describe a winner, second place, honorable mention, and special award (not eligible for the award), plus four others. That makes seven, unless we're not including the NCAR entry (which Bell entered, not NCAR).| |ed: I would prefer to leave the first sentence as it is since next year's rules have two categories that don't mention parallelism explicitly. The second sentence takes care of the parallelism issue. I did not count the NCAR entry; it's a special award Bell gave, not an official entry. I have reworded the intro to deemphasize this year's rules.| [ts4] The Gordon Bell Awards recognize outstanding achievement in the application of supercomputers to scientific and engineering problems. In 1987, these awards were for the largest speedup on multiple- instruction, multiple-data computers in two categories of machines: general- and special-purpose. Six of the nine entries met the rules for the competition. Gordon Bell, vice president of engineering at Ardent Computer in Sunnyvale, Calif., will sponsor two $1,000 awards each year for 10 years to promote practical parallel-processing research. "Utility is the key," he said at this year's awards presentation, held March 1 in San Francisco at the Compcon conference. He made the offer to sponsor the awards during an interview with [io]IEEE Software[ix] conducted while Bell managed the Computer and Information Science Directorate at the National Science Foundation ("Parallel, Distributed Processing Lead NSF Software Research Directions," Soft News, July, pp. 102-104). [io]IEEE Software[ix] administers the awards and judging. |au: fill in the missing information in the last sentence.| |ed: done| The winning entry in the general-purpose computer category was submitted by Robert Benner, John Gustafson, and Gary Montry of the Parallel Processing Division of Sandia National Laboratory in Albuquerque, N.M. They used a 1,024-processor NCUBE to run three production applications with speedups ranging from 400 to more than 600 times. The judges had expected the winning speedup to be about 50. The applications were a beam stress analysis, a surface-wave simulation, and an unstable fluid-flow model. Bell presented a $1,000 check to the Sandia team, which has donated the award to The Association for Retarded Citizens of Albuquerque. |au: please expand on this point, which you mentioned at Compcon.| |ed: done| The type of work that won the prize is not normally undertaken at Sandia. Montry told the judges that the possibility of winning the Bell Award and Karp's challenge (see sidebar) were key factors in convincing his management to expend resources on this research. There were no acceptable entries in the special-purpose category, so Bell decided to divide the money among several groups, although some of them did not meet the criteria specified in the rules. "It's my money, and I can do whatever I want with it," he said. Second place in the competition went to Marina Chen of Yale University, Erik Benedictus of AT&T Bell Laboratories, Geoffery Fox of the California Institute of Technology, Jingke Li of Yale, and David Walker of Caltech for work done on the Caltech and NCUBE hypercubes. They submitted results from three programs: a computational kernel with a speedup of 98, a quantum-chronodynamics calculation with a speedup of 458, and a circuit simulation with a speedup of 39. Bell decided to give $300 to this group as a special award. |au: Bell's secretary spelled the name "Scherrin." Did you get your spelling from NCAR?| |ed: Bob was quite certain that I got the spelling of his name right.| Bell gave one special award for $500 to Robert Chervin of the National Center for Atmospheric Research in Boulder, Colo., for a global ocean-climate model running at 450 Mflops on all four processors of a Cray X-MP/48. This work is impressive in two respects: (1) It is a production code running in parallel and (2) it achieves a significant fraction of the peak performance of the machine. The judges decided this entry deserved an honorable mention. Although the rules for the competition limited consideration to multiple-instruction, multiple-data machines, both the judges and Bell thought that the submission by Stavros Zenios of the Wharton School at the University of Pennsylvania was worthy of recognition. Zenios used a 16,384-processor Connection Machine to solve nonlinear network- optimization problems. Although it is impossible to measure parallel speedup on such a machine, Zenios was able to solve a problem in 1.5 seconds that ran longer than 92 seconds on one processor of an IBM 3090-600. Bell decided to give Zenios a special award of $200. [bo]First place.[bx] The Sandia researchers did their work on an NCUBE hypercube multicomputer with 1,024 processors. Each processor has 512 Kbytes of private memory and is directly connected to 10 other processors. Data is shared among processors by explicitly sending messages from one processor to another. Each processor is capable of about 80 Kflops. One problem the Sandia team addressed is measuring speedup. Speedup is the time it takes to solve the problem on one processor divided by the time it takes to solve the problem on all the processors. But there are two difficulties with this definition. To get a reliable measurement, the program must run for at least a few minutes. However, a problem that runs one minute on 1,024 processors would take tens of hours on one processor. The Sandia team solved this problem by running a program that takes at least 15 seconds on 1,024 processors and about five hours on one processor. |au: transmission errors obscured the original. Please supply again. Also, you might expand "compute-bound" and "communication-bound" a bit. I think I know what you mean, but being explicit would make it easier for the reader.| |ed: done| The second problem is more severe. Although the total memory of a multicomputer is large, each processor has access to only a modest amount of the total. Thus, the largest problem that fits on one processor uses only a small part of the full configuration. A key factor in the performance of a multicomputer is the ratio of the time spent computing to the time spent communicating; the machine is only doing useful work when it is computing. Large problems typically have a larger fraction of computation than do small problems. Therefore, a problem small enough to fit on one processor will be compute-bound on a small number of processors but communication-bound on the full system. |au: verify my parenthetical.| |ed: done| The Sandia team solve this problem by following two curves (which they called the research line and application line) in the problem- size versus number-of-processors plane. The research line runs the largest problem that fits on one processor on different numbers of processors. The numbers submitted for the competition follow this research line. These speedup numbers represent a lower bound on what is achievable because the problem is very small relative to the full system. The Sandia team also presented timings for the application line, where problem size increases as more processors are used. The application line is more representative of how multicomputers are used in production work. By measuring the idle time of each processor, they could estimate the speedup they would get if the large problem fit on one processor. They have validated this model and have found the predictions are correct to within a few percent. All three programs submitted by the Sandia team showed a speedup of more than 400. While one such result might be considered a special case, the weight of evidence is significant. The judges were particularly impressed with these results because they expected the winning speedup to be 50 or less. |au: what are dot products?| |ed: explained| One program, the beam strain-analysis problem, looked at the two- dimensional deflection of a beam fixed at one end. The program uses a finite-element formulation with a matrix-free, preconditioned, conjugate-gradient algorithm. This algorithm is frequently used on supercomputers when memory constraints prevent the use of more efficient methods. The most time-consuming part of this approach is the accumulation of dot products in which two vectors are multiplied element by element and the resulting products added to produce a scalar. The beam is divided into 2,048 bilinear finite elements -- the largest problem that will fit on one node. Each processor is assigned a rectangular subdomain. Synchronization occurs at two points during each iteration. First, each processor must exchange boundary information with the four processors sharing edges with it. Second, five dot products must be computed and the results distributed to each processor. Following the research line, the Sandia team found a speedup of more than 452 for this code. |ed: bad grammar changed.| Equally interesting was the application line. Running a problem with a 64[xm]32 grid on one processor takes 1,614 seconds. A problem with 64[xm]32 finite elements per processor on 1,024 processors (2 million elements) takes 1,619 seconds. The nearly constant execution time as the problem size increases indicates that their scaled speedup of 1,021 is reasonably accurate. The second application submitted by the Sandia team tracks two- dimensional waves as they move past a set of barriers. The program can handle any configuration of barriers and wave velocities that depend nonlinearly on position and wave state. |au: "discretized"? How about "made discrete"?| |ed: discretized is the proper term. If anything needs explaining it is "stencil".| The wave equation is discretized using a five-point stencil in space and an explicit time-step scheme. The domain is divided into rectangular blocks, and boundary information is passed to adjoining processors on each time step. To reduce communications overhead, the Sandia team used a special quadruple buffering technique. Because this program does relatively few computations per communication, the researchers used some assembly language to improve the speedup. First, they wrote a routine to transfer data. This routine helped most when noncontiguous data was to be transferred. Second, they wrote the inner loop of the update in assembly language to reduce the relative importance of loop start-up. Following the research line, the Sandia team reported a speedup of 637, while the scaled results along the application line indicated a speedup of 1,020. These scaled results appear to be reliable because a 4,096-point problem run on one node took 12,780 seconds while a 4,096-point-per-node problem (4 million points) took only 12,824 seconds. |au: "discretization"? Is there a less jargony word or phrase?| |ed: If this were going into the Style section of the newspaper, I would agree to change "discretization". In Software, it should not be a problem.| The third problem submitted by the Sandia team was a computational fluid-dynamics program that uses a flux-corrected transport algorithm. This approach uses a low-order spatial discretization where the flow is smooth and second- or fourth-order schemes in regions of large gradients. The time behavior is followed explicitly. The test problem follows the development of a Kelvin- Helmholtz instability in a shear flow. This code is nearly perfectly parallel. As in the previous problems, each processor must share boundary information with its neighbors on each time step. In addition, each processor determines the allowed time step based on the grid points in its subdomain. These local time steps must be combined to produce a single step size, and this value is broadcast to all processors. Speedup along the research line reached 519, while the scaled result along the application line is 1016. A 32[xm]32 grid run on one processor took 28,039 seconds; a 32[xm]32 grid for each of 1,024 processors took 28,258 seconds. As before, the overhead of running in parallel is small. |au: what is LU decomposition and why will it not be described here?| |ed: done| [bo]Second place.[bx] Second place went to the group from Yale, Bell Labs, and Caltech. This group submitted three programs running on different hypercubes. One program was a computational kernel, LU decomposition used to solve a system of linear equations, and will not be described here since kernels were explicitly excluded from consideration in the rules. |au: "discretized" again. Also, describe the lattice-gauge theories for in-the-dark readers, as I did later for the El Nino and greenhouse effects.| |ed: done| Another calculation submitted was the computation of the potential energy of a quark and an antiquark using quantum- chromodynamics theory to model the quark behavior. The computational solution to the problem uses lattice-gauge theory. A gauge theory is one that is defined in terms of a metric, a measure of length that is independent of the observer's frame of reference. With a gauge theory, a problem, such as computing the mass of the proton, can be reduced to a multidimensional integral with each dimension of the integral representing a different path a particle can take through space-time. The numerical solution to a problem is obtained by discretizing space-time on a four-dimensional grid, the lattice of the lattice-gauge theories. Each lattice point has a set of variables representing the quarks associated with it; the links between points contain information on gluons (which bind quarks). |au: Check my explanation of the Monte Carlo procedure.| |ed: OK| A step of the Monte Carlo procedure to evaluate multidimensional integrals involves randomly changing the variables on one link. This change modifies the corresponding lattice variables. After many changes have been made, it is possible to determine several physical quantities from the statistical fluctuations of these variables. |au: for the last sentence, sent between or among?| |ed: between, see change| Each processor is assigned a part of the lattice. At each step, all links that are not connected to a common lattice point can be updated simultaneously. Because each processor runs at its own rate, the only way to assure that this condition is met is to synchronize at each step. Furthermore, if the link data is held by one processor and the lattice point variables by another processor, a communications step is needed. When run on many processors, data on most links is sent between a pair of processors at least once on each sweep over the lattice. A further synchronization is needed because of an artifact of the implementation. Each update involves multiplying 3[xm]3 complex matrices. To save memory and time, these calculations are carried out using 16-bit integers instead of floating-point numbers. The round-off errors associated with the multiplications are large enough to destroy the accuracy of the calculation unless a correction is applied. This correction is applied after every second sweep through the lattice. |au: "discretized" again. Also, check my description of Coulomb law.| |ed: good, but make it parenthetical so it flows better.| The Yale/Bell Labs/Caltech researchers submitted a problem discretized on a 24[xm]24[xm]24[xm]48-point lattice. They calculated how the potential energy of a quark/antiquark pair varies with their separation. They showed that, when the quarks are close, the potential energy follows Coulomb's Law, as would a proton pair or electron pair. (Coulomb's Law says the force of repulsion between two like charges of electricity concentrated at two points is proportional to the product of their magnitudes and inversely proportional to the square of the distance between them.) At larger separations, the potential energy falls logarithmically with separation. Such an energy law implies that it would take an infinite amount of energy to separate these particles. A speedup of 458 was obtained on a 512- processor Caltech hypercube. Their second application was a circuit simulation to predict the time behavior of a circuit to a set of inputs. At each time step, each circuit element takes its input voltages and computes a set of output voltages. The problem is made parallel by dividing the circuit elements among the processors. The difficult part of making sure that each circuit element uses the correct input voltage is handled by using queues. The calculation cannot proceed completely asynchronously because some circuit elements require more computation than others to update their output voltages. During an arbitrarily long run, the input queues of the slowest elements would certainly overflow. One way to handle this problem is to run the simulation in segments of some fixed number of time steps. At the end of each segment, the processors synchronize (without passing any data) and then continue. Using such a programming style, the Yale/Bell Labs/Caltech group reported a speedup of 39 on a 127-processor NCUBE hypercube. |au: "Chervin" or "Scherrin"? Also, verify Semter.| |ed: Chervin, Semtner| [bo]Honorable mention.[bx] Robert Chervin of the National Center for Atmospheric Research won an honorable mention for the global ocean-climate model he parallelized with the help of Albert Semtner of the Naval Postgraduate School in Monterey, Calif. This program is used to study such large scale ocean phenomena as El Ni[at]no episodes in the Pacific Basin. (An El Ni-at-no is an unusual change in sea-surface temperatures that shifts cloud-cover and rainfall patterns across the Pacific Ocean, resulting in drought, flooding, and altered fish migration.) The program is also used to produce input data to the climate models run to predict such things as the long-term effect of the greenhouse effect (the rise in global air temperature caused by the retention of solar energy by carbon dioxide, which is added to the atmosphere by industrial processes). |au: "discretization" again. Also, what is a leapfrog-explicit time step?| |ed: done| The ocean-model program uses a standard, second-order, finite- difference discretization in three spatial dimensions and a leapfrog- explicit time-step procedure. In a leapfrog procedure, the change in the unknowns in going from time n to time n+1 depends only on the solution at time n-1. This method has higher accuracy than a single-step method but may become unstable. The leapfrog method also requires more memory than single-step methods since space for the solution is needed at three time levels (at n-1 to compute the change in going to step n+1, at n to convert the change in the variables to the values of the new variables, and n+1 which is the desired result). These disadvantages are more than compensated for by the larger time steps allowed by the higher accuracy. The problem for which Chervin was recognized used a grid made up of a point every 0.5 degree in latitude and longitude and 20 points in height for a total of 4 million grid points. Four variables are computed at each grid point. Because the time-stepping procedure requires data from three consecutive times, the problem is far too large to be contained in the main memory of a Cray. The program uses 6 Mwords of main memory and 60 Mwords on the Cray SSD, a large, relatively slow memory used as a fast I/O device. On each time step, more than 100 Mwords are transferred to and from the SSD. In spite of the large amount of I/O, this program shows a speedup of 3.74 on a four-processor Cray X-MP/48 using Cray microtasking. For the runs cited in the awards entry, Chervin used 280 tasks. The sustained processing rate of 450 Mflops is more than half the theoretical peak of 880 Mflops. This speedup is considerably higher than others have obtained. One of the limiting factors for parallel performance on the Cray is usually memory-bank conflicts that occur when one task accesses a memory bank needed by another task. Most researchers believe that nothing can be done about this problem, but Chervin was able to eliminate virtually all such bank conflicts. The trick is to set up the data so each task uses a distinct set of memory banks. It may be necessary to do some extra data movement, as Chervin does, but the gain in memory access rates clearly exceeds the cost. |au: "discretization" again. Also, explain why the program's being nearly 100-percent parallel relates to the first part of the last sentence. Is it that this high parallelism prohibits speedup comparable to the ocean model's or that this high parallelism is a benefit worth the lower speedup?| |ed: There is no connection so I have made two sentences.| Chervin's success with this program is not an isolated event. He has also achieved a speedup of 3.7 on a climate model that uses a spectral method for the horizontal spatial discretization. Here, too, he was able to eliminate most of the intertask memory bank conflicts by carefully distributing his data. This spectral algorithm is not vectorizable, so the raw speed is not as impressive as the ocean model. On the other hand, this program is 99.5-percent parallel. [bo]Special award.[bx] A special award was given to a program that did not meet the eligibility criteria. However, the judges and Bell were so impressed with the results that they awarded a special prize to Stavros Zenios of the Wharton School at the University of Pennsylvania. Zenios solved several problems in nonlinear network optimization on a single-instruction, multiple-data CM-1 Connection Machine. Nonlinear network optimization involves minimizing a strictly convex, continuously differentiable, real-valued function subject to a set of linear constraints. These problems are difficult because the spatial dimension is high, making the resulting matrices extremely large. Fortunately, these matrices are also extremely sparse. This means that if one variable is changed, only a small subset of the equations must be reevaluated to account for the change. A standard approach is to change one variable at a time (coordinate-wise minimization) until a global minimum is achieved. You can think of the nonlinear system as a graph. Each node in the graph represents an unknown. An arc joins two nodes, say [io]i[ix] and [io]j[ix], if variable [io]j[ix] appears in equation [io]i[ix]. When variable [io]j[ix] is changed, only equations explicitly containing it must be updated. Zenios used this fact to map the problem onto the CM-1. Each arc in the graph is associated with two processors. On each iteration, all variables are updated simultaneously. Next, each equation is evaluated by combining the results of all processors associated with each unknown using a special function on the CM-1 called a segmented-plus- scan, which forms a set of running sums. Finally, the results of this scan are distributed to all the nodes. This procedure appears to be inefficient in that it uses two processors per unknown and does some computations twice. However, the resulting reduction in communication overhead makes it run very fast. About 35 percent of the time is spent communicating. On three different problems, the CM-1 needed one, eight, and 29 seconds, while an IBM 3090 uniprocessor needed 93, 109, and 83 seconds for the same jobs. The CM-2 connection machine with floating-point should be even faster. [bo]Other entries.[bx] Four other entries were judged eligible for consideration. Although they did not win, they represent work worthy of commendation. Paul Fischer of the Massachusetts Institute of Technology submitted measurements of NEKTON, a commercial fluid-dynamics and heat-flow package, running on an Intel iPSC-VX hypercube. The package uses the so-called "p" finite-element method, sometimes called the spectral-element method. Convergence is obtained by increasing the order of approximation on each finite element instead of increasing the number of finite elements as done in the "h" finite-element methods. The resulting linear systems are solved with a diagonally preconditioned conjugate-gradient algorithm. Fischer's results point out one of the problems with the contest rules. He got a speedup of 7.2 running with 32 vector processors. When he turned off vector processing, he achieved a speedup of 16. However, the execution time of the nonvector job was almost 100 times longer than the vector job. The judges have revised the rules to take this into account; the revised rules appear in the [bo]box on p. XX[bx]. Richard Roloff of the Center for Supercomputing Research and Development at the University of Illinois at Urbana-Champaign made parallel a program used heavily by the university's Chemical Engineering Department. It models turbulent, incompressible flow with a pseudospectral algorithm. The compiler did most of the parallelization by vectorizing inner loops and parallelizing outer loops. In addition, compiler directives were used so subroutine calls could be executed in parallel. Roloff achieved a 6.5 speedup on an eight-processor Alliant FX/8. Richard Hessel of Alliant Computer Systems submitted measurements of ANSYS, a general-purpose finite-element package, on an Alliant FX/8. He ran a standard benchmark data set called S4. This linear, static-analysis problem requires 4,000 elements and has more than 24,000 unknowns. Hessel achieved a speedup of 5.0 on the eight-processor Alliant. Most of the parallelism was extracted by the compiler, with the exception of one routine. He recoded the rank-[io]n[ix] update routine to use a block algorithm better suited to the Alliant architecture. As is often true, optimizing for parallelism leads to other improvements. Hessel reports that the modified program runs 1.5 times faster on one processor than the best previous code. David George and Frederica Darema-Rogers of IBM Research in Hawthorne, N.Y., submitted a parallelized version of AIR3D, a three- dimensional fluid code from the National Aeronautics and Space Administration. This program models flow past a complex surface, including boundary layers. It is heavily used to model the space shuttle. The computationally slow part is the solution of large linear systems, which are solved using an alternating-direction-implicit scheme on planes. On each time step, the system is decoupled into a set of two-dimensional problems by assuming the unknowns in the [io]x[ix] direction are known and solving the resulting set of two- dimensional problems. Next, the unknowns in [io]y[ix] are fixed, and a set of two-dimensional problems solved. Finally, the [io]z[ix] unknowns are fixed. This solution is done for each iteration of the nonlinear solution step. The IBM group used EPEX, an experimental system for use within IBM, to parallelize at the loop level. On each of the three steps of the iteration, the solution of the two-dimensional problems are independent and were run in parallel. They ran on an IBM 3090 with six vector processors and achieved a speedup of 4.9. [hs5]1988 rules |au: After talking with Bell Thursday, I added "two of" to the second sentence. He intends to make two $1000 awards each year from the best submissions in two of the three categories. I guess the top entry (assuming there is one) in the third category would be an honorable mention.| [ts4][ll19] The Gordon Bell Awards recognize achievements in large- scale scientific computing. The 1988 Awards will be given in two of three categories: 1.[ns]Performance: The entrant will be expected to convince the judges that the submitted program is running faster than any other comparable engineering or scientific application. Suitable evidence will be the megaflop rate based on actual operation counts, or the solution of the same problem with a properly tuned code on a machine of known performance such as a Cray X-MP. If neither of these measurements can be made, the submitter should document the performance claims as well as possible. 2.[ns]Price/performance: The entrant must show that the performance of the application divided by the cost of the system is better than any other entry. Performance measurement will be evaluated as for the performance prize. For the purposes of this contest, cost will be the list price of the computational engine (CPUs, including enough memory to run the problem). Peripherals and software need not be included in the price. 3.ns]Compiler parallelization: The compiler/application that generates the best speed-up will be the winner. Speedup will be measured by dividing the execution time of the program compiled without automatic parallelization by the execution time with automatic parallelization. A third run may be submitted that uses compiler directives to improve parallelization. The judges will weight these two parallel runs to determine the winner. There are some general conditions: 1.-ns-The submitted program must have utility -- it must solve a problem that is considered a routine production run, such as making daily weather predictions or solving an important engineering or scientific problem. It should not be a contrived or experimental problem that is intended to show speedup. 2.[ns]Entrants in the price/performance category must submit the output from the Linpack 100[xm]100 benchmark showing a speed of at least 5 Mflops. 3.[ns]In all cases, the burden of proof is on the contestants. The judges will make an honest effort to compare results of different programs running on different machines, but will depend primarily on the submitted material. 4.[ns]Contestants should describe their entries in a three- to four-page executive summary and submit this to Ted Lewis, Editor-in- Chief, [io]IEEE Software[ix], c/o Computer Science Dept., Oregon State University, Corvallis, OR 97331. The deadline is December 31, 1988. If they need additional information, the judges will contact the contestants. |au: please verify and expand this as needed.| |ed: done| [hs5]Sandia team wins Karp challenge [ts4][ll25] In 1985, Alan Karp of IBM issued a challenge in a number of places including the NANET (Numerical Analysis Network), SIGNUM Newsletter, and ACM Forum saying he would pay $100 to the first person to demonstrate a speedup of at least 200 on a general-purpose, multiple-instruction, multiple-data computer used for scientific applications. The offer was good through Dec. 31, 1995. In March, Karp awarded the $100 to the Sandia National Laboratory team that won the 1987 Gordon Bell Award for parallel-processing speedup. They donated the award to the Association for Retarded Citizens in Albuquerque, N.M. ------- .