.. To process try: .. refer -sA+D -p refer paper | pic | dtbl | eqn | psroff -ms .ds TT "\v'-.55m'T\v'.55m' .ds TH "\h'-.75m'\v'-.50m'^\v'.50m' .ds TD "\h'-.75m'\v'-.45m'~\v'.45m' .B .nr BT ''-%-'' .he '''' .pl 10.5i .de fO 'bp .. .wh -.5i fO .LP .nr LL 5.8i .ll 5.8i .nr LT 5.8i .lt 5.8i .ta 5.0i .ft 3 .bp .R .sp 1i .ce 100 .R .sp .5i . .sp 10 ARGONNE NATIONAL LABORATORY .br 9700 South Cass Avenue .br Argonne, Illinois 60439 .sp .6i .ps 12 .ft 3 SCHEDULE: Tools for Developing and Analyzing Parallel Fortran Programs .ps 11 .sp 3 Jack J. Dongarra and Danny C. Sorensen .sp 3 .ps 10 .ft 1 Mathematics and Computer Science Division .sp 2 Technical Memorandum No. 86 .sp .7i November, 1986 .pn 1 .he ''-%-'' .he '''' .in .bp .ft 3 .ps 11 .bp .LP .ds HT "\h'-.65m'\v'-.40m'^\v'.40m' .ds TD "\h'-.65m'\v'-.40m'~\v'.40m' .ds CT "\(mu\h'-.66m'\(ci .ds NI "\(mo\h'-.65m'/ .nr EP 1 .nr VS 14 .EQ gsize 11 delim @@ define times '\(mu' define ctimes '\(mu back 80 \(ci ^ ' define bl '\0' .EN .TL SCHEDULE: Tools for Developing and Analyzing Parallel Fortran Programs * .AU .ps 11 .in 0 J. J. Dongarra and D. C. Sorensen .AI .ps 10 .in 0 Mathematics and Computer Science Division\h'.20i' Argonne National Laboratory\h'.20i' 9700 South Cass Avenue\h'.20i' Argonne, Illinois 60439-4844\h'.20i' .sp and\h'.20i' .sp Center for Supercomputing Research and Development\h'.20i' University of Illinois at Urbana-Champaign\h'.20i' 305 Talbot Laboratory\h'.20i' 104 South Wright Street\h'.20i' Urbana, Illinois 61801-2932\h'.20i' .nr PS 12 .nr VS 16 .nr PD 0.5v .FS * Work supported in part by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy, under Contracts W-31-109-Eng-38, DE-AC05-840R21400, and DE-FG02-85ER25001. .FE .NH Introduction .PP Many new parallel computers are now emerging as commercial products .[[ dongarra duff .]]. Exploitation of the parallel capabilities requires either extensions to an existing language such as Fortran or development of an entirely new language. A number of activities .[[ vanrosendale .], .[ mcgraw sisal .]] are under way to develop new languages that promise to provide the ability to exploit parallelism without the considerable effort that may be required in using an inherently serial language that has been extended for parallelism. We applaud such activities and expect they will offer a true solution to the software dilemma in the future. However, in the short term we feel there is a need to confront some of the software issues, with particular emphasis placed on transportability and use of existing software. .PP Our interests lie mainly with mathematical software typically associated with scientific computations. Therefore, we concentrate here on using the Fortran language. Each vendor of a parallel machine designed primarily for numerical calculations has provided its own parallel extensions to Fortran. These extensions have taken many forms already and are usually dictated by the underlying hardware and by the capabilities that the vendor wishes to supply the user. This has led to widely different extensions ranging from the ability to synchronize on every assignment of a variable with a full empty property .[[ jordan .]] to attempts at automatically detecting loop-based parallelism with a preprocessing compiler aided by user directives .[[ alliant .]]. The act of getting a parallel process executing on a physical processor ranges from a simple "create" statement .[[ jordan .]] which imposes the overhead of a subroutine call, to "tskstart" .[[ cray multitasking .]] which imposes an overhead on the order of @10 sup 6 @ machine cycles, to no formal mechanism whatsoever .[[ alliant .]]. These different approaches reflect characteristics of underlying hardware and operating systems and to a large extent are dictated by the vendors view of which aspects of parallelism are marketable. It is too early to impose a standard on these vendors, yet it is disconcerting that there is no agreement among any of them on which extensions should be included. There is not even an agreed-upon naming convention for extensions that have identical functionality. Program developers interested in producing implementations of parallel algorithms that will run on a number of different parallel machines are therefore faced with an overwhelming task. The process of developing portable parallel packages is complicated by additional factors that lie beyond each computer manufacturer supplying its own, very different mechanism for parallel processing. A given implementation may require several different communicating parallel processes, perhaps with different levels of granularity. An efficient implementation may require the ability to dynamically start processes, perhaps many more than the number of physical processors in the system. This feature is either lacking or prohibitively expensive on most commercially available parallel computers. Instead, many of the manufacturers have limited themselves to providing one-level loop-based parallelism. .PP This paper describes an environment for the transportable implementation of parallel algorithms in a Fortran setting. By this we mean that a user's code is virtually identical for each machine. The main tool in this environment is a package called SCHEDULE which has been designed to aid a programmer familiar with a Fortran programming environment to implement a parallel algorithm in a manner that will lend itself to transporting the resulting program across a wide variety of parallel machines. The package is designed to allow existing Fortran subroutines to be called through SCHEDULE, without modification, thereby permitting users access to a wide body of existing library software in a parallel setting. Machine intrinsics are invoked within the SCHEDULE package, and considerable effort may be required on our part to move SCHEDULE from one machine to another. On the other hand, the user of SCHEDULE is relieved of the burden of modifying each code he disires to transport from one machine to another. .PP Our work has primarily been influenced by the work of Babb .[[ babb .]], Browne .[[ browne .]], and Lusk and Overbeek .[[ lusk overbeek .]]. We present here our approach, which aids in the programming of explicitly parallel algorithms in Fortran and which allows one to make use of existing Fortran libraries in the parallel setting. The approach taken here should be regarded as minimalist: it has a very limited scope. There are two reasons for this. First, the goal of portability of user code will be less difficult to achieve. Second, the real hope for a solution to the software problems associated with parallel programming lies with new programming languages or perhaps with the "right" extension to Fortran. Our approach is expected to have a limited lifetime. Its purpose is to allow us to exploit existing hardware immediately. .NH Terminology .PP Within the science of parallel computation there seems to be no standard definition of terms. A certain terminology will be adopted here for the sake of dialogue. It will not be "standard" and is intended only to apply within the scope of this document. .R Process - A unit of computation, an independently executable Fortran subroutine together with calling sequence parameters, common data, and externals. Task - A main program, processes, and a virtual processor. Virtual Processor - A process designed to assume the identity of every process within a given task (through an appropriate subroutine call). Processor - A physical device capable of executing a main program or a virtual processor. Shared Data - Variables that are read and/or written by more than one process (including copies of processes). Data Dependency - A situation wherein one process (A) reads any shared data that another process (B) writes. This data dependency is satisfied when B has written the shared data. Schedulable Process - A process whose data dependencies have all been satisfied. .NH Parallel Programming Ideas .PP When designing a parallel algorithm one is required to describe the data dependencies, parallel structures, and shared variables involved in the solution. Typically, such algorithms are first designed at a conceptual level and later implemented in Fortran and its extensions. Each manufacturer provides a different set of extensions and targets these extensions at different implementation levels. For example, some manufacturers allow only test-and-set along with spawn-a-process, while others allow concurrent execution of different loop iterations. .PP Our attempt here is to allow the user to define the data dependencies, parallel structures, and shared variables in his application and then to implement these ideas in a Fortran program written in terms of subroutine calls to SCHEDULE. Each set of subroutine calls specifies a unit of computation or process which consists of a subroutine name along with the calling parameters and the data dependencies necessary to coordinate the parallel execution. .PP The basic philosophy here is that Fortran programs are naturally broken into subroutines that identify self-contained units of computation which operate on shared data structures. This allows one to call on existing library subroutines in a parallel setting without modification, and without having to write an envelope around the library subroutine call in order to conform to some unusual data-passing conventions imposed by a given parallel programming environment. .PP A parallel(izable) program is written in terms of calls to subroutines which, in principle, may be performed either independently or according to data dependency requirements that the user is responsible for defining. The result is a serial program that can run in parallel given a way to schedule the units of computation on a system of parallel processors while obeying the data dependencies. .NH Parallel Programming Using SCHEDULE .PP The package SCHEDULE requires a user to specify the subroutine calls along with the execution dependencies in order to carry out a parallel computation. Each of these calls represents a process, and the user must take the responsibility of ensuring that the data dependencies represented by the graph are valid. This concept is perhaps difficult to grasp without some experience with writing parallel programs. We shall try to explain it in this section by example; in the following section we shall describe the underlying concepts and the SCHEDULE mechanism. .PP To use SCHEDULE, one must be able to express (i.e., program) an algorithm in terms of processes and execution dependencies among the processes. A convenient way to view this is through a computational graph. For example, the following graph .KS .sp 2 .PS .ft R .ps 8 .vs 9 scale = 1 circlerad = 0.282i boxwid = 0.918i boxht = 0.282i circle rad 0.169 at (2.485i, 6.995i) "A" circle rad 0.169 at (2.033i, 6.544i) "B" circle rad 0.169 at (2.936i, 6.544i) "C" circle rad 0.169 at (2.033i, 5.866i) "D" circle rad 0.169 at (1.355i, 5.866i) "D" circle rad 0.169 at (2.711i, 5.866i) "E" arrow from (2.623i, 6.850i) to (2.598i, 6.875i) line from (2.824i, 6.664i) to (2.623i, 6.850i) arrow from (2.347i, 6.850i) to (2.372i, 6.875i) line from (2.146i, 6.664i) to (2.347i, 6.850i) arrow from (1.895i, 6.399i) to (1.920i, 6.424i) line from (1.468i, 5.986i) to (1.895i, 6.399i) arrow from (2.033i, 6.346i) to (2.033i, 6.381i) line from (2.033i, 6.028i) to (2.033i, 6.346i) arrow from (2.171i, 6.399i) to (2.146i, 6.424i) line from (2.598i, 5.986i) to (2.171i, 6.399i) .ps .vs .ft .PE .sp .I .ce Figure 4.1. .ce Data Dependency Graph .KE .R .sp denotes five subroutines A, B, C, D, and E (here with two "copies" of subroutine D operating on different data). We intend the execution to start simultaneously on subroutines C,D,D, and E since they appear as leaves in the data dependency graph (D will be initiated twice with different data). Once D,D,and E have completed then B may execute. When B and C have completed execution then A may start and the entire computation is finished when A has completed. To use SCHEDULE, one is required to specify the subroutine calling sequence of each of the six units of computation, along with a representation of this dependency graph. .PP For each node in the graph, SCHEDULE requires two subroutine calls. One contains information about the user's routine to be called, such as the name of the routine, calling sequence parameters, and a simple tag to identify the process. The second subroutine call defines the dependency in the graph to nodes above and below the one being specified, and specifies the tag to identify the process. In this example, after an initial call to set up the environment for SCHEDULE, six pairs of calls would be made to define the relationships and data in the computational graph. .PP These concepts are perhaps more easily grasped through an actual Fortran example. A very simple example is a parallel algorithm for computing the inner product of two vectors. The intention here is to illustrate the mechanics of using SCHEDULE. This algorithm and the use of SCHEDULE on a problem of such small granularity are not necessarily recommended. .nf Problem: Given real vectors @a@ and @b@, each of length @n@, compute @sigma ~=~ a sup T b @. Parallel Algorithm: Let @ a sup T ~=~ ( a sub 1 sup T , a sub 2 sup T ,...,a sub k sup T ) @ and @ b sup T ~=~ ( b sub 1 sup T , b sub 2 sup T ,...,b sub k sup T ) @ be a partitioning of the vectors @a@ and @b@ into smaller vectors @ a sub i @ and @ b sub i @. .sp Compute ( in parallel ) @sigma sub j ~=~ a sub j sup T b sub j@ , @ j ~=~ 1,k@. When all done @ sigma ~=~ sigma sub 1 ~+~ sigma sub 2 ~+~ ... ~+~ sigma sub k @. Each of the parallel processes will execute code of the following form: .nf .cs 1 24 .vs 10 .ps 9 subroutine inprod(m,a,b,sig) integer m real a(*),b(*),sig sig = 0.0 do 100 j = 1,m sig = sig + a(j)*b(j) 100 continue return end .fi .ps 12 .cs 1 .vs 14 The following routine is used to accumulate the result: .nf .cs 1 24 .vs 10 .ps 9 subroutine addup(k,sigma,temp) integer k real sigma,temp(*) sigma = 0.0 do 100 j = 1,k sigma = sigma + temp(j) 100 continue return end .fi .ps 12 .cs 1 .vs 14 .PP The first step in constructing a code is to understand the parallel algorithm in terms of schedulable processes and a data dependency graph. Then the algorithm is expressed in a standard (serial) Fortran code. This code consists of a main program which initializes the shared data and a "parallel" subroutine \f3parprd\f1 to compute the inner product by invoking the parallel processes \f3inprd\f1 and \f3addup\f1. The program and associated data dependency graph are shown below. Serial Code: .nf .cs 1 24 .vs 10 .ps 9 program main integer n,k real a(1000),b(1000),temp(50),sigma read (5,*) n,k do 100 j = 1,n a(j) = j b(j) = 1 100 continue c call parprd(n,k,a,b,temp,sigma) c write(6,*) ' sigma = ',sigma stop end subroutine parprd(n,k,a,b,temp,sigma) c c declare shared variables c integer n,k real a(*),b(*),temp(*),sigma c c declare local variables c integer m,indx,j c m = n/k indx = 1 do 200 j = 1,k c call inprod(m,a(indx),b(indx),temp(j)) c indx = indx + m if (j .eq. k-1) m = n - indx + 1 200 continue c call addup(k,sigma,temp) c return end .fi .ps 12 .cs 1 .vs 14 .fi .PS .ft R .ps 8 .vs 9 scale = 1 circlerad = 0.282i boxwid = 0.918i boxht = 0.282i circle rad 0.176 at (2.485i, 6.769i) "k+1" circle rad 0.155 at (1.355i, 6.092i) "1" circle rad 0.155 at (1.807i, 6.092i) "2" circle rad 0.155 at (2.259i, 6.092i) "3" circle rad 0.155 at (3.162i, 6.092i) "k-1" circle rad 0.155 at (3.614i, 6.092i) "k" arrow from (2.306i, 6.659i) to (2.336i, 6.678i) line from (1.482i, 6.169i) to (2.306i, 6.659i) arrow from (2.340i, 6.617i) to (2.365i, 6.642i) line from (1.913i, 6.198i) to (2.340i, 6.617i) arrow from (2.424i, 6.574i) to (2.435i, 6.607i) line from (2.301i, 6.233i) to (2.424i, 6.574i) arrow from (2.630i, 6.617i) to (2.605i, 6.642i) line from (3.056i, 6.198i) to (2.630i, 6.617i) arrow from (2.663i, 6.659i) to (2.633i, 6.678i) line from (3.487i, 6.169i) to (2.663i, 6.659i) .ps .vs .ft .PE .I .ce Figure 4.2. .ce Depedency Graph for Parallel Inner Product .R In this data dependency graph we have identified @k@ processes .EQ inprod(~ m,~ a(indx),~ b(indx),~ temp(j)~ ) ~,~~~ j~=~1,~ 2,...,~ k ~ ~~, ~ ~indx~=~1~+~ (j-1)*m .EN which are not data dependent. Each of them reads a segment of the shared data @a ~,~b @ and writes on its own entry of the array @temp@, but none of them needs to read data that some other process will write. This fact is evident in the graphical representation shown in Figure 4.2 where they are @leaves@. One process, .EQ addup(k,sigma,temp), .EN labeled @k+1@ is data dependent on each of the processes @ 1,2,...,k@. This is because \f3addup\f1 needs to read each entry of the array @temp@ in order to compute the sum and place it into @sigma@. .PP From this data dependency graph we may proceed to write the parallel program. Once we have understood the computation well enough to have carried out these two steps, the invocation of SCHEDULE to provide for the parallel execution of schedulable processes is straightforward. Calls to \f3parprd\f1, \f3inprod\f1, and \f3addup\f1 are replaced by calls to SCHEDULE to identify the routines to be executed as well as the information relating to the dependency graph. The modified code follows. .nf Parallel Main: .cs 1 24 .vs 10 .ps 9 program main integer n,k c EXTERNAL PARPRD c real a(1000),b(1000),temp(50),sigma read (5,*) n,k,NPROCS do 100 j = 1,n a(j) = j b(j) = 1 100 continue c CALL SCHED(nprocs,PARPRD,n,k,a,b,temp,sigma) c write(6,*) ' sigma = ',sigma stop end subroutine parprd(n,k,a,b,temp,sigma) c c declare shared variables c integer n,k real a(*),b(*),temp(*),sigma c c declare local variables c integer m1,m2,indx,j,jobtag,icango,ncheks,mychkn(2) c EXTERNAL INPROD,ADDUP save m1,m2 c m1 = n/k indx = 1 do 200 j = 1,k-1 c c express data dependencies c JOBTAG = j ICANGO = 0 NCHEKS = 1 MYCHKN(1) = k+1 c CALL DEP(jobtag,icango,ncheks,mychkn) CALL PUTQ(jobtag,INPROD,m1,a(indx),b(indx),temp(j)) c indx = indx + m1 200 continue m2 = n - indx + 1 c c express data dependencies for clean up step c JOBTAG = k ICANGO = 0 NCHEKS = 1 MYCHKN(1) = k+1 c CALL DEP(jobtag,icango,ncheks,mychkn) CALL PUTQ(jobtag,INPROD,m2,a(indx),b(indx),temp(k)) c indx = indx + m1 c JOBTAG = k+1 ICANGO = k NCHEKS = 0 c CALL DEP(jobtag,icango,ncheks,mychkn) CALL PUTQ(jobtag,ADDUP,k,sigma,temp) c return end .ps 12 .cs 1 .vs 14 .fi .PP The code that will execute in parallel has been derived from the serial code by replacing calls to \f3parprd\f1, \f3inprd\f1, \f3addup\f1 with calls to SCHEDULE routines that invoke these routines. The modifications are signified by putting calls to SCHEDULE routines in capital letters. Let us now describe the purpose of each of these calls. .nf .cs 1 24 .vs 10 .ps 9 CALL SCHED(nprocs,PARPRD,n,k,a,b,temp,sigma) .ps 12 .cs 1 .vs 14 .fi This replaces the call to \f3parprd\f1 in the serial code. The effect is to devote @nprocs@ virtual processors to the parallel subroutine \f3parprd\f1. The parameter list following the subroutine name consist of the calling sequence one would use to make a normal call to \f3parprd\f1. Each of these parameters must be called by reference and not by value. No constants or arithmetic expressions should be passed as parameters through a call to \f3sched\f1. This call to \f3sched\f1 will activate @nprocs@ copies of a virtual processor \f3work\f1. This virtual processor is a SCHEDULE procedure (written in C) that is internal to the package and not explicitly available to the user. .nf .cs 1 24 .vs 10 .ps 9 JOBTAG = j ICANGO = 0 NCHEKS = 1 MYCHKN(1) = k+1 c CALL DEP(jogtag,icango,ncheks,mychkn) CALL PUTQ(jobtag,INPROD,m,a(indx),b(indx),temp(j)) .cs 1 .vs 14 .ps 12 .fi This code fragment shows the @j-th@ instance of the process \f3inprod\f1 being placed on a queue. The information needed to schedule this process is contained in the data dependency graph. In this case, the @j-th@ instance of a call to \f3inprod\f1 is being placed on the queue, so @jobtag@ is set to @j@. The value zero is placed in @icango@, indicating that this process does not depend on any others. If this process were dependent on @p@, other processes then @icango@ would be set to @p@. .PP The mechanism just described allows static scheduling of parallel processes. In this program the partitioning and data dependencies are known in advance even though they are parameterized. It is possible to dynamically allocate processes; this capability will be explained later. It might be worthwhile at this point to discuss the mechanism that this package relies on. .NH The SCHEDULE Mechanism .PP The call to the SCHEDULE routines \f3dep\f1 and \f3putq\f1, respectively, places process dependencies and process descriptors on a queue. A unique user supplied identifier @jobtag@ is associated with each node of the dependency graph. This identifier is a positive integer. Internally it represents a pointer to a process. The items needed to specify a data dependency are non-negative integers @icango@ and @ncheks@ and an integer array @mychkn@. The @icango@ specifies the number of processes that process @jobtag@ depends on. The @ncheks@ specifies the number of processes that depend on process @jobtag@. The @mychkn@ is an integer array whose first @ncheks@ entries contain the identifiers (i.e., @jobtag@s) of the processes that depend on process @jobtag@. .KS .PS .ft R .ps 8 .vs 9 scale = 1 circlerad = 0.282i boxwid = 0.918i boxht = 0.282i circle rad 0.169 at (1.807i, 6.769i) circle rad 0.169 at (1.807i, 6.318i) circle rad 0.169 at (1.807i, 5.866i) circle rad 0.282 at (0.904i, 6.318i) box invis ht 0.918 wid 0.282 at (0.452i, 5.640i) box invis ht 0.918 wid 0.282 at (1.129i, 5.640i) box invis wid 0.918 ht 0.282 at (0.000i, 6.995i) "icango = 2" box invis wid 0.918 ht 0.282 at (0.000i, 6.769i) "ncheks = 3" spline -> from (1.024i, 6.565i) to (1.129i, 6.769i)to (1.645i, 6.769i) spline -> from (1.179i, 6.318i) to (1.355i, 6.318i)to (1.694i, 5.986i) spline -> from (1.151i, 6.438i) to (1.355i, 6.544i)to (1.659i, 6.388i) arrow from (0.729i, 6.055i) to (0.748i, 6.085i) line from (0.452i, 5.640i) to (0.729i, 6.055i) arrow from (0.999i, 6.023i) to (0.988i, 6.056i) line from (1.129i, 5.640i) to (0.999i, 6.023i) .ps .vs .ft .PE .I .ce Figure 5.1. .ce A Node in a Dependency Graph .R .KE .sp In Figure 5.1 a typical node of a data dependency graph is shown. This node has two incoming arcs and three outgoing arcs. As shown to the left of the node one would set @icango ~=~ 2@, @ncheks ~=~ 3@, and the first three entries of @mychkn@ to the identifiers of the processes pointed to by the outgoing arcs. .PP The initial call to @sched@(nprocs,subname,) results in @nprocs@ virtual processors called \f3work\f1 to begin executing on @nprocs@ separate physical processors. Typically @nprocs@ should be set to a value that is less than or equal to the number of physical processors available on the given system. These \f3work\f1 routines access a ready queue of @jobtag@s for schedulable processes. Recall that a schedulable process is one whose data dependencies have been satisfied. After a \f3work\f1 routine has been successful in obtaining the @jobtag@ of a schedulable process, it makes the subroutine call associated with that @jobtag@ during the call to \f3putq\f1. When this subroutine executes a @return@, control is returned to \f3work\f1, and a SCHEDULE routine \f3chekin\f1 is called which decrements the @icango@ counter of each of the @ncheks@ processes that depend on process @jobtag@. If any of these @icango@ values has been decremented to zero, the identifier of that process is placed on the ready queue immediately. .PP We depict this mechanism in Figure 5.2 . The array labeled \f3parmq\f1 holds a process descriptor for each @jobtag@. A process descriptor consists of data dependency information and a subroutine name together with a calling sequence for that subroutine. This information is placed on \f3parmq\f1 through the two calls .sp .cs 1 24 .vs 10 .ps 9 CALL DEP(jobtag,icango,ncheks,mychkn) CALL PUTQ(jobtag,subname,). .fi .cs 1 .vs 14 .ps 12 When making these two calls the user has assured that a call to @subname@ with the argument list @@ is valid in a data dependency sense whenever the counter @icango@ has been decremented to the value zero. When a \f3work\f1 routine has finished a call to \f3chekin\f1 , it gets the @jobtag@ of the next available schedulable process off the @readyq@ and then assumes the identity of the appropriate subroutine by making a call to @subname@ with the argument list @@. .KS .PS .ft R .ps 8 .vs 9 scale = 1 circlerad = 0.282i boxwid = 0.918i boxht = 0.282i box invis ht 0.918 wid 0.282 at (1.581i, 7.673i) circle rad 0.282 at (2.033i, 6.544i) "SCHEDULE" box wid 0.275 ht 0.762 at (1.355i, 5.414i) "W" "O" "R" "K" box wid 0.275 ht 0.762 at (1.807i, 5.414i) "W" "O" "R" "K" box wid 0.275 ht 0.762 at (2.711i, 5.414i) "W" "O" "R" "K" box invis ht 0.918 wid 0.282 at (2.711i, 6.995i) "Head" box invis ht 0.918 wid 0.282 at (1.355i, 6.995i) "Tail" box invis ht 0.918 wid 0.282 at (1.581i, 7.673i) box wid 0.918 ht 0.282 at (2.033i, 6.995i) "READYQ" box wid 1.285 ht 0.593 at (2.033i, 7.673i) "PARMQ" box invis ht 0.918 wid 0.282 at (3.614i, 6.092i) "10 myjob = getprb(jobtag)" box invis ht 0.918 wid 0.282 at (3.614i, 6.092i) box invis ht 0.918 wid 0.282 at (3.656i, 5.633i) "call checkin(jobtag)" box invis ht 0.918 wid 0.282 at (3.346i, 5.456i) "go to 10" box invis ht 0.918 wid 0.282 at (3.656i, 5.866i) "call subname()" box invis ht 0.918 wid 0.282 at (2.936i, 6.318i) box invis ht 0.918 wid 0.282 at (3.614i, 6.318i) box invis ht 0.918 wid 0.282 at (2.936i, 5.188i) box invis ht 0.918 wid 0.282 at (3.614i, 5.188i) box invis ht 0.918 wid 0.282 at (2.259i, 5.414i) box invis ht 0.918 wid 0.282 at (2.033i, 5.414i) box invis ht 0.918 wid 0.282 at (2.259i, 5.414i) box invis ht 0.918 wid 0.282 at (2.259i, 5.414i) box invis ht 0.918 wid 0.282 at (2.485i, 5.414i) spline -> from (2.252i, 6.374i) to (2.485i, 6.318i)to (2.661i, 5.795i) spline -> from (1.864i, 6.325i) to (1.807i, 6.092i)to (1.807i, 5.795i) spline -> from (1.793i, 6.402i) to (1.355i, 6.318i)to (1.355i, 5.795i) spline -> from (1.595i, 6.854i) to (1.355i, 6.769i)to (1.772i, 6.628i) line dashed from (2.809i, 5.795i) to (2.936i, 6.318i) line dashed from (2.936i, 6.318i) to (3.614i, 6.318i) line dashed from (2.845i, 5.273i) to (2.936i, 5.188i) line dashed from (2.936i, 5.188i) to (3.614i, 5.188i) spline -> from (2.301i, 6.600i) to (2.929i, 6.727i)to (2.492i, 6.854i) .ps .vs .ft .PE .I .ce Figure 5.2. .ce The SCHEDULE Mechanism .R .KE .sp .NH Low-Level Synchronization .PP Ideally, the mechanism we have just described will relieve the user of explicitly invoking any synchronization primitives. Unfortunately, some powerful parallel constructs are not so easily described by this mechanism. It may be desirable to have two processes executing simultaneously that are not truly data independent of each other. A typical example is in pipelining a computation, that is, when several parallel processes are writing on the same data in a specified order which is coordinated through explicit synchronization. To provide this capability, two low-level synchronization primitives have been made available within SCHEDULE. They are \f3lockon\f1 and \f3unlock\f1. Each takes an integer argument. An example of usage is .sp2 .nf .ps 9 .cs 1 24 .vs 10 call lockon(ilock) ilocal = indx indx = indx + 5 call unlock(ilock) .fi .ps 12 .cs 1 .vs 14 .sp2 In this example a critical section has been placed around the act of getting a local copy of the shared variable @indx@ and updating the value of @indx@. If several concurrent processes are executing this code, then only one of them will be able to occupy this critical section at any given time. The variable @ilock@ must be a globally shared variable and it must be initialized before any usage as an argument to \f3lockon\f1 or \f3unlock\f1 by calling the routine \f3lckasn\f1. In the above example the statement .sp .nf .ps 9 .cs 1 24 .vs 10 call lckasn(ilock) .fi .ps 12 .cs 1 .vs 14 .sp must execute exactly once and before any of the calls to \f3lockon\f1 are made. After execution of this statement the lock variable @ilock@ has been declared as a lock variable and has been initiated to be @off@. If there are low-level data dependencies among any of the processes that will be scheduled, then it will be necessary to enforce those data dependencies using locks. It is preferable to avoid using locks if possible. However, in certain cases such as pipelining, locks will be required. .NH Dynamic Allocation of Processes .PP The scheme presented above might be considered static allocation of processes. By this we mean that the number of processes and their data dependencies were known in advance. Therefore the entire data structure (internal to SCHEDULE) representing the computational graph could be recorded in advance of the computation and is fixed throughout the computation. In many situations, however, we will not know the computational graph in advance, and we will need the ability for one process to start or spawn another depending on a computation that has taken place up to a given point in the spawning process. This dynamic allocation of processes is accomplished through the use of the SCHEDULE subroutine \f3spawn\f1. The method of specifying a process is similar to the use of \f3putq\f1 described above. .PP We shall use the same example to illustrate this mechanism. .nf Processes: subroutine inprod .... same as above .ps 9 .cs 1 24 .vs 10 subroutine addup(n,k,a,b,sigma,temp) integer myid,n,k real a(*),b(*),sigma,temp(*) c c declare local variables c integer j,jdummy,m1,m2 c EXTERNAL INPROD c m1 = n/k indx = 1 c CALL PRTSPN(myid) do 200 j = 1,k-1 c c replace the call to inprod with a call to spawn c CALL SPAWN(myid,jdummy,INPROD,m1,a(indx),b(indx),temp(j)) indx = indx + m1 200 continue m2 = n - indx + 1 c c clean up step c call inprod(m2,a(indx),b(indx),temp(k)) c c keep one of the inner product assignments and then call wait c to be sure all others have completed. c CALL WAIT(myid) c c c All have checked in, now addup the results. c sigma = 0.0 do 100 j = 1,k sigma = sigma + temp(j) 100 continue return end .sp2 .fi .cs 1 .vs 14 .ps 12 The subroutine \f3parprd\f1 must change somewhat. .nf .sp2 .cs 1 24 .vs 10 .ps 9 subroutine parprd(n,k,a,b,temp,sigma) c c declare shared variables c integer n,k real a(*),b(*),temp(*),sigma c c declare local variables c integer mychkn(1),icango,ncheks,jobtag EXTERNAL ADDUP c JOBTAG = 1 ICANGO = 0 NCHEKS = 0 c CALL DEP(jobtag,icango,ncheks,mychkn) CALL PUTQ(jobtag,ADDUP,n,k,a,b,sigma,temp) c return end .fi .cs 1 .vs 14 .ps 12 .PP In this example the call to @spawn@ invokes @k ~-~ 1@ instances of the subroutine @inprod@. Note that the user does not specify data dependencies. Instead, the jobtags for these processes are assigned internally and they are required to report to the spawning process. More details of this mechanism are given in the next section. .NH Syntax Summary .PP In this section we summarize the SCHEDULE operations and their syntax. We begin with a list of these operations and follow up with a general description of their usage. With this we hope to supplement the examples given above. .nf .sp2 .ce Table 1 .ce .I User Calls .sp .R .cs 1 24 .vs 12 .ps 10 .R CALL SCHED ( nprocs, subname, ) CALL DEP ( jobtag, icango, ncheks, mychkn ) CALL PUTQ ( jobtag, subname, ) CALL PRTSPN( myid ) CALL SPAWN ( myid, jdummy, subname, ) CALL WAIT ( myid ) CALL LCKASN( ilock ) CALL LOCKON( ilock ) CALL UNLOCK( ilock ) .R nprocs - Number of virtual processors to be used on subname. subname - Subroutine to be executed. - list of parameters for subname. jobtag - A unique user supplied identifier. icango - The number of processes that must complete before this process starts. ncheks - Specifies the number of processes that depend on this process. mychkn - An integer array whose first @ncheks@ entries contain the identifiers of the processes that depend on this process. myid - An alias process id assigned by SCHEDULE to record completion of the spawned processes. jdummy - A dummy integer parameter. ilock - A globally shared variable used as a lock. .fi .cs 1 .vs 14 .ps 12 .PP There are three basic operations in SCHEDULE other than the lowest level synchronization provided through locks. They are \f3call sched\f1, \f3call putq\f1, and \f3call spawn\f1. Various additional calls must be made in association with these three basic operations to enforce data dependencies, and orchestrate the correct execution of the parallel program. .PP The statement .EQ call~~sched(nprocs,paralg,) .EN is used to initiate the execution of a parallel program @paralg@ and will devote @nprocs@ physical processors to the computation. The symbol @@ stands for a set of parameters that would normally be supplied to the Fortran subroutine @paralg@ through a Fortran call in a sequential program. One should think of the memory associated with the parameter list represented by as shared memory throughout the execution of @paralg@. Another way to share memory is through the use of @named ~common@. .PP The pair of statements .EQ mark call ~~ dep ( jobtag,icango,ncheks,mychkn) .EN .EQ lineup call ~~ putq(jobtag,subname,) .EN are used to place a process in the large grained data dependency graph with @dep@ used to record the data dependencies and @putq@ to record the unit of computation. In this case the unit of computation will be a call to the subroutine @subname@ with the calling sequence @@. When the data dependencies for this process have been satisfied the @icango@ counter will have been decremented to zero and this process will be placed upon the ready queue. At some point in time a physical processor executing a @work@ program will be free and will pick this process off of the ready queue and execute a @call~~subname()@. It is the user's responsibility to assure that the addresses passed in @@ are addresses in shared memory. That is to say that these addresses must reference shared memory that has been declared in the program that made the call to @sched@. Hence these addresses must come from named common declared in that program and referenced by the subroutine @subname@ or which have been passed as parameters in the call to @sched@. .PP The sequence of statements .EQ mark call ~~ prtspn(myid) .EN .EQ lineup ... .EN .EQ lineup ~~~~~call~~ spawn(myid,jdummy,subname,) .EN .EQ lineup ... .EN .EQ lineup call~~wait(myid) .EN accomplish dynamic spawning of processes. Usually calls to spawn are made in a loop but they need not be. The call to @prtspn@ secures an alias process id @myid@ to record completion of the spawned processes. This id must be used in the subsequent calls to @spawn@ and to @wait@. There is an implied barrier at the call to @wait@ and execution does not continue past this point in the spawning process until all spawned processes have been completed. The syntax of the call to @spawn@ is very similar to that of @putq@. The parameter @jdummy@ is a temporary location that must be supplied to @spawn@ and the result is a @call~~ subname() @ as soon as there is an available instantiation of @work@ available to execute this call. A key difference between a call to @spawn@ and a call to @putq@ is the lack of user specification of jobtags and data dependencies. The only data dependency indicated here is that a spawned process must report to the parent process and no other. This is taken care of automatically with use of the above mechanism. If communication is desired at a lower level among spawned processes this must be accomplished through the use of shared data and/or locks. Local variables in the spawning routine may be passed as arguments to the spawned routines with the understanding that their scope extends to all members of the subtree rooted at the spawning process. These variables must be treated as shared data within this subtree. .PP The implications of the mechanism used for spawning should be discussed further. We rely upon the stack associated with the calling tree rooted at the spawning process. If one wishes to do recursion to gain parallelism then there is potential for stack overflow. This is an unfortunate consequence of an attempt to avoid the burden of placing and keeping track of entry points and saving the state of an executing process which spawns other processes. The key problem is to retain the ability to execute the parallel program on a single physical processor. The call to @wait@ does not result in a busy wait as one might suspect. Instead, a call is made to a routine that is quite similar to the @work@ routines discussed earlier. This routine accesses the readyq makes a call and then "checks in" and returns to @wait@. Within the routine @wait@ completion of spawned processes is monitored. If any have failed to complete another call to the special work routine is made. Obviously, this may result in stack overflow when spawned routines spawn other routines. An alternative was used in this package previously .[[ dongarra sorensen full parallel .]] which avoided stack overflow but was deemed overly complicated for the user. .NH An Environment for the Development of Explicitly Parallel Programs .PP As we mentioned at the outset, SCHEDULE is a programming aid that is intended to serve as a backbone to an environment for the development of explicitly parallel programs. Our goal is to provide a uniform interface to the parallel capabilities provided by existing and impending parallel systems. Included in this are tools for debugging and analyzing the performance of parallel programs. In this section we shall describe some of our goals in this area. Preliminary attempts at achieving these goals have been implemented or are in the design stage. .PP We regard SCHEDULE as a tool primarily designed as an aid to constructing an implementation of a "new" parallel algorithm. We do not think it is particularly well suited to converting an existing serial code to a parallel version, although this would be possible in some cases. The reason for this statement is that to program with SCHEDULE it is imperative that the large grain data dependencies are well understood by the programmer. This is of course required in any successful implementation of a parallel program. However, with SCHEDULE the data partitioning and construction of the data dependency graph are an explicit part of the programming effort. We expect that the designer of a parallel algorithm will have this information naturally at hand. At least this has been our experience in the design of our own parallel algorithms .[[ dongarra sorensen Linear Algebra on High-Performance Computers .]]. Enforcing this programming style goes a long ways towards avoiding bugs traditionally associated with parallel programming. .PP We have found it very useful to retain the capability of executing a parallel program in serial mode. A SCHEDULE program does not depend upon the number of physical processors available in a given system. It will execute with one processor active. Others have found this to be a useful property of such a programming tool .[[ lusk overbeek .]]. However, Gentleman .[[ gentleman private .]] does not feel that this is feature is essential to the dubugging of parallel programs. We find at least two reasons to provide it. First it allows one to develop and test code on a serial machine such as a VAX 11/780. This is quite important when the parallel machine is at a remote site or if the software development environment existing on the parallel machine is inadequate or difficult to use. Second, when developing a new algorithm one would like to be assured that the numerical properties of the algorithm are correct in serial mode before entering the parallel testing regime. Operating in this sequence tends to separate ordinary programming bugs from those associated with parallel programming. Moreover, due to the SCHEDULE mechanism one can be fairly confident that a parallel bug is due either to incorrect specification of the data dependency graph or to incorrect partitioning of data. Explicit synchronization is generally not a part of a SCHEDULE program and thus difficult synchronization bugs do not generally arise. .PP To aid in ascertaining a correct specification of the data dependency graph we envision a tool that will graphically represent the units of computation and the data dependency graph. A preliminary version of this tool has been implemented and experience with this tool is reported below. Similar ideas have been proposed in .[[ babb .]]. In our view the level of detail in those efforts is too fine. We expect the level of detail in such a representation of a parallel program to roughly correspond to the abstract level at which the programmer has partitioned his parallel algorithm. Let us illustrate this with a simple example. A favorite example of our's is the solution of a triangular linear system partitioned by blocks. .NH Triangular Solve Example .PP We can consider solving a triangular system of equations @ T x = b @ in parallel by partitioning the matrix @T@ and vectors @x@ and @b@ as shown in Figure 10.1. .PS .ft R .ps 8 .vs 9 scale = 1 circlerad = 0.282i boxwid = 0.918i boxht = 0.282i box invis ht 0.918 wid 0.282 at (0.685i, 6.995i) box invis ht 0.918 wid 0.282 at (0.678i, 6.536i) box invis ht 0.918 wid 0.282 at (0.685i, 6.085i) box invis ht 0.918 wid 0.282 at (0.678i, 5.640i) box invis ht 0.918 wid 0.282 at (0.685i, 5.181i) box invis ht 0.918 wid 0.282 at (1.136i, 6.536i) box invis ht 0.918 wid 0.282 at (1.136i, 6.092i) box invis ht 0.918 wid 0.282 at (1.136i, 5.640i) box invis ht 0.918 wid 0.282 at (1.136i, 5.188i) box invis ht 0.918 wid 0.282 at (1.588i, 6.085i) box invis ht 0.918 wid 0.282 at (1.588i, 5.640i) box invis ht 0.918 wid 0.282 at (1.588i, 5.188i) box invis ht 0.918 wid 0.282 at (2.040i, 5.640i) box invis ht 0.918 wid 0.282 at (2.040i, 5.188i) box invis ht 0.918 wid 0.282 at (2.492i, 5.181i) box invis ht 0.918 wid 0.282 at (0.685i, 4.729i) box invis ht 0.918 wid 0.282 at (1.136i, 4.729i) box invis ht 0.918 wid 0.282 at (1.588i, 4.736i) box invis ht 0.918 wid 0.282 at (2.033i, 4.736i) box invis ht 0.918 wid 0.282 at (2.492i, 4.736i) box invis ht 0.918 wid 0.282 at (2.944i, 4.729i) box invis ht 0.918 wid 0.282 at (0.847i, 6.713i) "1" box invis ht 0.918 wid 0.282 at (0.875i, 6.325i) "2" box invis ht 0.918 wid 0.282 at (0.861i, 5.873i) "3" box invis ht 0.918 wid 0.282 at (0.889i, 5.400i) "4" box invis ht 0.918 wid 0.282 at (0.868i, 4.948i) "5" box invis ht 0.918 wid 0.282 at (1.313i, 6.219i) "6" box invis ht 0.918 wid 0.282 at (1.299i, 5.859i) "7" box invis ht 0.918 wid 0.282 at (1.341i, 5.407i) "8" box invis ht 0.918 wid 0.282 at (1.348i, 4.962i) "9" box invis ht 0.918 wid 0.282 at (1.765i, 5.795i) "10" box invis ht 0.918 wid 0.282 at (1.793i, 5.400i) "11" box invis ht 0.918 wid 0.282 at (1.821i, 4.941i) "12" box invis ht 0.918 wid 0.282 at (2.181i, 5.336i) "13" box invis ht 0.918 wid 0.282 at (2.273i, 4.948i) "14" box invis ht 0.918 wid 0.282 at (2.612i, 4.864i) "15" box invis ht 0.918 wid 0.282 at (3.388i, 7.002i) box invis ht 0.918 wid 0.282 at (3.395i, 4.744i) box invis ht 0.918 wid 0.282 at (3.388i, 6.586i) box invis ht 0.918 wid 0.282 at (3.388i, 6.508i) box invis ht 0.918 wid 0.282 at (3.388i, 6.127i) box invis ht 0.918 wid 0.282 at (3.388i, 6.056i) box invis ht 0.918 wid 0.282 at (3.395i, 5.654i) box invis ht 0.918 wid 0.282 at (3.388i, 5.591i) box invis ht 0.918 wid 0.282 at (3.395i, 5.224i) box invis ht 0.918 wid 0.282 at (3.395i, 5.160i) box invis ht 0.918 wid 0.282 at (4.271i, 6.988i) box invis ht 0.918 wid 0.282 at (4.278i, 4.729i) box invis ht 0.918 wid 0.282 at (4.271i, 6.572i) box invis ht 0.918 wid 0.282 at (4.271i, 6.494i) box invis ht 0.918 wid 0.282 at (4.271i, 6.113i) box invis ht 0.918 wid 0.282 at (4.271i, 6.042i) box invis ht 0.918 wid 0.282 at (4.278i, 5.640i) box invis ht 0.918 wid 0.282 at (4.271i, 5.576i) box invis ht 0.918 wid 0.282 at (4.278i, 5.209i) box invis ht 0.918 wid 0.282 at (4.278i, 5.146i) box invis ht 0.918 wid 0.282 at (3.162i, 6.769i) "x1" box invis ht 0.918 wid 0.282 at (3.176i, 6.296i) "x2" box invis ht 0.918 wid 0.282 at (3.162i, 5.859i) "x3" box invis ht 0.918 wid 0.282 at (3.155i, 5.400i) "x4" box invis ht 0.918 wid 0.282 at (3.169i, 4.941i) "x5" box invis ht 0.918 wid 0.282 at (4.052i, 4.969i) "b5" box invis ht 0.918 wid 0.282 at (4.073i, 5.414i) "b4" box invis ht 0.918 wid 0.282 at (4.087i, 5.859i) "b3" box invis ht 0.918 wid 0.282 at (4.080i, 6.318i) "b2" box invis ht 0.918 wid 0.282 at (4.052i, 6.784i) "b1" box invis ht 0.918 wid 0.282 at (3.734i, 5.845i) "=" box invis ht 0.918 wid 0.282 at (3.169i, 5.449i) line from (0.685i, 6.995i) to (0.678i, 6.536i) line from (0.678i, 6.536i) to (0.685i, 4.729i) line from (0.685i, 4.729i) to (2.944i, 4.729i) line from (2.944i, 4.729i) to (0.685i, 6.995i) line from (0.685i, 6.085i) to (1.588i, 6.085i) line from (0.678i, 5.640i) to (2.040i, 5.640i) line from (0.685i, 5.181i) to (2.492i, 5.181i) line from (0.678i, 6.536i) to (1.136i, 6.536i) line from (1.136i, 6.536i) to (1.136i, 4.729i) line from (1.588i, 6.085i) to (1.588i, 4.736i) line from (2.040i, 5.640i) to (2.033i, 4.736i) line from (2.492i, 5.181i) to (2.492i, 4.736i) line from (3.388i, 7.002i) to (3.388i, 6.586i) line from (3.388i, 6.508i) to (3.388i, 6.127i) line from (3.388i, 6.056i) to (3.395i, 5.654i) line from (3.388i, 5.591i) to (3.395i, 5.160i) line from (3.395i, 5.160i) to (3.395i, 5.224i) line from (3.395i, 5.160i) to (3.395i, 4.744i) line from (4.271i, 6.988i) to (4.271i, 6.572i) line from (4.271i, 6.494i) to (4.271i, 6.113i) line from (4.271i, 6.042i) to (4.278i, 5.640i) line from (4.271i, 5.576i) to (4.278i, 5.146i) line from (4.278i, 5.146i) to (4.278i, 4.729i) line from (4.278i, 5.146i) to (4.278i, 5.209i) .ps .vs .ft .PE .I .ce Figure 10.1 .ce Partitioning for the Triangular Solve .sp .R The first step is to solve the system @ T sub 1 x sub 1 ~=~ b sub 1@, this will determine the solution for that part of the vector labeled @ x sub 1@. After @x sub 1 @ has been computed it can be used to update the right hand side with the computations .EQ I b sub 2 ~=~ b sub 2 - T sub 2 x sub 1 .EN .EQ I b sub 3 ~=~ b sub 3 - T sub 3 x sub 1 .EN .EQ I b sub 4 ~=~ b sub 4 - T sub 4 x sub 1 .EN .EQ I b sub 5 ~=~ b sub 5 - T sub 5 x sub 1 ~. .EN Notice that these matrix-vector multiplications can occur in parallel, as there are no dependences. However, there may be several processes attempting to update the value of a vector @b sub j@ (for example 4,8,11 will update @b sub 4@) and this will have to be synchronized through the use of locks or the use of temporary arrays for each process. As soon as @b sub 2 @ has been updated, the computation of @x sub 2 @ can proceed as @x sub 2 ~=~ T sub 6 sup -1 b sub 2@. Notice that this computation is independent of the other matrix-vector operations involving @ b sub 3@, @b sub 4@, and @b sub 5@. After @ x sub 2 @ has been computed, it can be used to update the right hand side as follows: .EQ I b sub 3 ~=~ b sub 3 - T sub 7 x sub 2 .EN .EQ I b sub 4 ~=~ b sub 4 - T sub 8 x sub 2 .EN .EQ I b sub 5 ~=~ b sub 5 - T sub 9 x sub 2 ~. .EN The process is continued until the full solution is determined. The data dependency graph for this can be represented in Figure 10.2. .KS .PS .ft R .ps 8 .vs 9 scale = 1 circlerad = 0.282i boxwid = 0.918i boxht = 0.282i circle rad 0.106 at (0.678i, 6.995i) "1" circle rad 0.106 at (0.678i, 6.544i) "2" circle rad 0.106 at (0.678i, 6.092i) "3" circle rad 0.106 at (0.678i, 5.640i) "4" circle rad 0.106 at (0.678i, 5.188i) "5" circle rad 0.106 at (1.355i, 6.544i) "6" circle rad 0.106 at (1.355i, 6.092i) "7" circle rad 0.106 at (1.355i, 5.640i) "8" circle rad 0.106 at (1.355i, 5.188i) "9" circle rad 0.106 at (2.047i, 6.078i) "10" circle rad 0.106 at (2.047i, 5.640i) "11" circle rad 0.106 at (2.047i, 5.188i) "12" circle rad 0.106 at (2.725i, 5.626i) "13" circle rad 0.106 at (2.725i, 5.188i) "14" circle rad 0.106 at (3.388i, 5.174i) "15" circle rad 0.282 at (4.052i, 6.755i) "13" box invis ht 0.918 wid 0.282 at (4.066i, 6.092i) "14" box invis ht 0.918 wid 0.282 at (3.176i, 7.151i) "4" box invis ht 0.918 wid 0.282 at (3.162i, 6.784i) "8" box invis ht 0.918 wid 0.282 at (3.162i, 6.416i) "11" box invis ht 0.918 wid 0.282 at (4.038i, 5.612i) "jobtag = 13" "icango = 3" "ncheks = 1" "myckhn(1) = 14" spline -> from (0.649i, 6.896i) to (0.621i, 6.798i)to (0.656i, 6.642i) spline -> from (0.614i, 6.918i) to (0.508i, 6.784i)to (0.522i, 6.247i)to (0.607i, 6.162i) spline -> from (0.586i, 6.953i) to (0.452i, 6.868i)to (0.424i, 6.007i)to (0.621i, 5.725i) spline -> from (0.579i, 6.981i) to (0.311i, 6.953i)to (0.409i, 5.598i)to (0.621i, 5.273i) spline -> from (0.776i, 6.515i) to (1.031i, 6.431i)to (1.256i, 6.515i) spline -> from (1.327i, 6.445i) to (1.299i, 6.346i)to (1.334i, 6.191i) spline -> from (1.454i, 6.064i) to (1.694i, 6.007i)to (1.948i, 6.056i) spline -> from (2.146i, 5.612i) to (2.344i, 5.555i)to (2.626i, 5.605i) spline -> from (2.824i, 5.167i) to (3.106i, 5.118i)to (3.289i, 5.153i) spline -> from (0.762i, 6.035i) to (0.988i, 5.866i)to (1.624i, 5.880i)to (1.955i, 6.035i) spline -> from (1.440i, 5.584i) to (1.666i, 5.442i)to (2.414i, 5.456i)to (2.633i, 5.584i) spline -> from (0.769i, 5.598i) to (1.285i, 5.329i)to (2.485i, 5.358i)to (2.661i, 5.548i) spline -> from (2.096i, 5.104i) to (2.188i, 4.962i)to (3.148i, 4.962i)to (3.311i, 5.111i) spline -> from (1.440i, 5.139i) to (2.005i, 4.807i)to (3.304i, 4.864i)to (3.360i, 5.075i) spline -> from (0.769i, 5.146i) to (1.426i, 4.779i)to (3.374i, 4.765i)to (3.381i, 5.075i) spline -> from (1.299i, 6.459i) to (1.129i, 6.247i)to (1.158i, 5.795i)to (1.278i, 5.704i) spline -> from (1.278i, 6.480i) to (1.059i, 6.289i)to (1.045i, 5.400i)to (1.271i, 5.245i) spline -> from (2.019i, 5.979i) to (1.976i, 5.838i)to (2.019i, 5.739i) spline -> from (1.991i, 5.993i) to (1.835i, 5.767i)to (1.878i, 5.329i)to (1.969i, 5.252i) spline -> from (2.696i, 5.527i) to (2.668i, 5.386i)to (2.696i, 5.287i) arrow from (3.765i, 6.883i) to (3.798i, 6.868i) line from (3.176i, 7.151i) to (3.765i, 6.883i) arrow from (3.741i, 6.763i) to (3.776i, 6.762i) line from (3.162i, 6.784i) to (3.741i, 6.763i) arrow from (3.758i, 6.644i) to (3.791i, 6.656i) line from (3.162i, 6.416i) to (3.758i, 6.644i) arrow from (4.065i, 6.127i) to (4.066i, 6.092i) line from (4.059i, 6.480i) to (4.065i, 6.127i) .ps .vs .ft .PE .I .ce Figure 10.2 .ce Data Dependency Graph for Triangular Example .R .KE .KS .sp 20 .I .ce Figure 10.3 .ce Output from SCHEDULE for Triangular Example .R .KE .PP In Figure 10.3 we show the output from an executing SCHEDULE implementation of this example. The program was run on an Alliant FX/8. An output file was produced as the program executed which recorded the units of computation as they were defined and executed. The file was then shipped to a SUN workstation where a graphics program interpreted this output, constructed the graph and played back the execution sequence that was run on the Alliant. In the graph shown above in Figure 10.3 the black nodes show processes which have completed execution, the hatched nodes show executing processes and the white nodes show processes waiting to execute. The program was able to produce this playback output automatically simply by linking to SCHEDULE with a request for the graph option. The user's program did not change in any way. At the moment this is the only capability we have. It would be very useful have the user construct that part of the Fortran program that corresponded to the dependence graph, i.e. calls to the SCHEDULER routines, from a "mouse driven" graphics input device. .PP In addition to discovering bugs in the specification of the graph, this representation is useful in exposing more subtle aspects of the executing program. For example when the graph produced by SCHEDULE is contrasted with the abstract user specified graph there are noticeable differences. The graph produced by SCHEDULE exposes inherent serial bottlenecks in the algorithm. During execution the two bar graphs at the top track the existing and maximum number of concurrently executing processes respectively. In the ideal situation, the bottom bar has a value equal to he number of physical processors available and the top bar is almost always equal to the bottom bar. In addition to discovering serial bottlenecks, one can learn about load balancing problems within an executing program. It may even be useful to put in artificial data dependencies to force certain processes to complete before others to achieve a better load balance. Another interesting feature is that this execution sequence can be accelerated if the program took a long time to execute and slowed down if it took a short time while maintaining the ratios of execution times among the parallel processes. .PP A final feature that is possible but not fully implemented is the possibility of observing the executing program within a node. We envision the capability of opening a workstation window corresponding to a selected node. Within this window we would see a display of selected variables being updated as execution takes place. At the same time the currently executing lines of Fortran code would be displayed. This capability currently exists for a serial Fortran program .[[ dongarra sorensen aid to fortran .]]. The output of such a display is shown in Figure 10.4 .KS .sp 17 .I .ce Figure 10.4 .ce Screen Output from Debugger .R .KE The main problem to overcome here is selected display of data. A real program executing on large arrays could produce an overwhelming display of output. .NH Experience with SCHEDULE .PP At present the experience with using SCHEDULE is limited but encouraging. Versions are running successfully on the VAX 11/780, Alliant FX/8, and CRAY-2 computers. That is, the same user code executes without modification on all three machines. Only the SCHEDULE internals are modified, and these modifications are usually minor, but can be difficult in some cases. They involve such things as naming and parameter-passing conventions for the C - Fortran interface. They also involve coding the low-level synchronization primitives and managing to "create" the \f3work\f1 processes. .PP On the CRAY-2 process creation is accomplished using \f3tskstart\f1, and the low-level synchronization is very similar to the low-level synchronization routines provided by the CRAY multitasking library .[[ cray multitasking .]]. For the Alliant FX/8 we coded the low-level synchronization primitives using their test-and-set instruction. To "create" the work routines, we used the CVD$L CNCALL directive before a loop that performed @nprocs@ calls to the subroutine \f3work\f1. .PP In addition to some toy programs used for debugging SCHEDULE, several codes have been written and executed using SCHEDULE. These codes include the algorithm TREEQL for the symmetric tridiagonal eigenvalue problem .[[ dongarra sorensen symmetric .]], a domain decomposition code for singularly perturbed convection-diffusion PDE .[[ chin mcgraw .]], an adaptive quadrature code .[[ comer .]], and a block preconditioned conjugate gradient code for systems arising in reservoir simulation .[[ diaz .]]. .PP The graph associated with TREEQL is probably the most illustrative of the capabilities of SCHEDULE so we shall discuss it in some detail. This is a parallel algorithm for computing the eigensystem of a real symmetric tridiagonal matrix @T@. The problem is to compute an orthogonal matrix @Q@ and a diagonal matrix @D@ such that @ T ~=~ Q D Q sup T @. The crux of the algorithm is to divide a given problem into two smaller subproblems. To do this, we consider the symmetric tridiagonal matrix .EQ (11.1) T mark ~=~ left ( ~ ~ matrix { ccol { T sub 1 above beta e sub 1 e sub k sup T } ccol { beta e sub k e sub 1 sup T above T sub 2 } } ~ ~ right ) .EN .sp .EQ lineup ~=~ left ( ~ matrix { ccol { T hat sub 1 above 0 } ccol { 0 above T hat sub 2 }} ~ right ) ~+~ beta left ( ~ matrix { ccol { e sub k above e sub 1 } } ~ right ) ( e sub k sup T ~,~ e sub 1 sup T ) .EN .sp where @ 1 ~<=~ k ~<=~ n @ and @ e sub j @ represents the @j-th@ unit vector of appropriate dimension. The @k-th@ diagonal element of @ T sub 1@ has been modified to give @T hat sub 1 @ and the first diagonal element of @ T sub 2@ has been modified to give @ T hat sub 2 @. .PP Now we have two smaller tridiagonal eigenvalue problems to solve. According to equation (11.1) we compute the two eigensystems .EQ T hat sub 1 ~=~ Q sub 1 D sub 1 Q sub 1 sup T ~,~~~~ T hat sub 2 ~=~ Q sub 2 D sub 2 Q sub 2 sup T ~. .EN This gives .EQ (11.2) T mark ~=~ left ( ~ matrix { ccol { Q sub 1 D sub 1 Q sub 1 sup T above 0 } ccol { 0 above Q sub 2 D sub 2 Q sub 2 sup T } } ~ right ) ~ +~ beta left ( ~ matrix { ccol { e sub k above e sub 1} } ~ right ) ( e sub k sup T ~,~ e sub 1 sup T ) .EN .EQ lineup ~=~ left ( ~ matrix { ccol { Q sub 1 above 0 } ccol { 0 above Q sub 2 } } ~ right ) left (~ left ( ~ matrix { ccol { D sub 1 above 0 } ccol { 0 above D sub 2 } } ~ right ) +~ beta left ( ~ matrix { ccol { q sub 1 above q sub 2} } ~ right ) ( q sub 1 sup T ~,~ q sub 2 sup T ) ~ ~ right ) left ( ~ matrix { ccol { Q sub 1 sup T above 0 } ccol { 0 above Q sub 2 sup T } } ~ right ) .EN where @ q sub 1 ~=~ Q sub 1 sup T e sub k@ and @ q sub 2 ~=~ Q sub 2 sup T e sub 1@. The problem at hand now is to compute the eigensystem of the interior matrix in equation (11.2). A numerical method for solving this problem has been provided in [2] and we shall discuss this method in the next section. .PP It should be fairly obvious how to proceed from here to exploit parallelism. One simply repeats the tearing on each of the two halves recursively until the original problem has been divided into the desired number of subproblems and then the rank one modification routine may be applied from bottom up to glue the results together again. .PP This @gluing@ process is called updating and the general updating problem we are required to solve for each partition is that of computing the eigensystem of a matrix of the form .EQ (11.3) Q hat D hat Q hat sup T ~=~ D ~+~ rho zz sup T .EN where @ D @ is a real @n times n@ diagonal matrix, @rho@ is a nonzero scalar, and @z@ is a real vector of order @n@ . .PP A formula for computing an eigen-pair of the matrix on the right hand side of (11.3) is obtained by noting that if @lambda@ is a root of the equation .EQ (11.4) 1 ~+~ rho z sup T (D ~-~ lambda I ) sup -1 z ~=~ 0 .EN and .EQ (11.5) q ~=~ theta (D ~-~ lambda I ) sup -1 z .EN then @ q ^,^ lambda @ is such an eigen-pair, i.e they satisfy the relation .EQ ( D ~+~ rho z z sup T ) q ~=~ lambda q. .EN In (11.5) the scalar @theta@ may be chosen so that @ || q || ~=~ 1 @ to obtain an orthonormal eigensystem. .PP If we write equation (11.4) in terms of the components @ zeta sub i @ of @z@ then @lambda@ must be a root of the equation .EQ (11.6) f ( lambda ) ~==~ 1 ~+~ rho sum from {j = 1 } to n { zeta sub j sup 2 } over { delta sub j ~-~ lambda } ~=~ 0 . .EN Under certain assumptions this equation has precisely @n@ roots, one in each of the open intervals @ ( delta sub j ~,~ delta sub j+1 ) @, @ j ~=~ 1,2,...n-1 @ and one to the right of @delta sub n @ if @rho ~>~ 0 @ or one to the left of @delta sub 1 @ if @rho ~<~ 0 @. We construct the eigenvectors corresponding to each of these roots using formula (11.5). .PP Due to this structure, an excellent numerical method may be devised to find the roots of the Equation (11.6) and as a by-product to compute the eigenvectors to full accuracy. It must be stressed, however, that great care must be exercised in the numerical method used to solve the secular equation and to construct the eigenvectors from formula (11.5). Theoretical results and numerical details may be found in .[[ dongarra sorensen fully parallel .]]. .PP The static partitioning of this algorithm results in a divide and conquer graph with a node corresponding to each split of the original matrix. There is of course a problem with such a scheme since parallelism is lost as the computation proceeds. Fortunately, in this problem the computation executing in a given node may be parallelized by splitting the root finding problems into a number of parallel parts. The location of each root is bracketed by the old eigenvalues and each of them may be computed independently. It is a simple matter, therefore, to divide the roots into groups which will be computed in parallel. An important numerical aspect of this algorithm which will not be discussed in detail here is the reduction of the number of roots to be computed in a given node through a deflation technique. This deflation is of considerable importance and has great impact both on the efficiency and numerical accuracy of the algorithm. Details are reported in .[[ dongarra sorensen fully parallel .]], but we wish to emphasize here that this aspect of the computation is @problem ~dependent@. The number of roots to be found at any given step can only be known at the time of execution. Therefore, the number of groups that the roots can be reasonably divided into must be done at run time and assigning a parallel process to one of the groups requires dynamic allocation of processes. These aspects are illustrated in the following graph produced by SCHEDULE. .KS .sp 20 .I .ce Figure 11.1 .ce Output from SCHEDULE .R .KE Note that the graph produced by SCHEDULE corresponds very closely to the abstract graph a user might define for this algorithm. In Figure 11.1 we see a partially executed program. The square nodes show dynamically spawned processes, while the round nodes show the static partitioning. A particular point of interest is static node number 5 which had the potential for spawning processes just as the other nodes on that level but didn't because a computation done at run time indicated there was not enough work at this node to warrant the spawning of additional processes. This graph demonstrates the utility of such a capability because the maximum number of concurrently executing processes was met at nearly all levels of the computation. Without the potential of dynamically spawned processes the performance of the parallel algorithm would be very poor due to the loss of concurrency associated with the divide and conquer nature of the static partition. Without dynamic spawning the number of concurrent processes decreases by a factor of two at each level of the graph. This is even more disappointing when one realizes that the amount of work usually will be at a maximum in the nodes near the root of the tree. Moreover, with this particular algorithm dynamic spawning is essential because the amount of work at any given node is proportional to the number of roots to be found at that node, but this number will be problem dependent and determinable only at the time of execution. .PP Another interesting example is associated with an algorithm for solving singularly perturbed convection diffusion PDE. An algorithm based upon the ideas presented in .[[ Chin Hedstrom .]] has been implemented using SCHEDULE on the Alliant FX/8. This algorithm solves .EQ u sub t ~+~ c(x,t) u sub x ~=~ epsilon DELTA u ~,~~ 0 ~<=~ t ~<=~ infinity , ~~ 0 ~<=~ x ~<=~ 1 ~,~ ~0 ~<~ epsilon ~<~ 1 .EN given appropriate initial data and boundary conditions. The basis of the method is to use asymptotic analysis of the dominate scales of the equation in various subdomains as depicted in Figure 11.2. Such analysis is evident upon making a characteristics coordinate transformation @ (x,t) ~->~ ( xi , tau ) @ with .EQ dx over dt ~=~ c(x,t) ~,~~ x(0) ~=~ xi ~,~~ t ~==~ tau ~. .EN .KS .sp 20 .I .ce Figure 11.2 .ce Subdomains of Convection Diffusion Equation .R .KE We refer the reader to .[[ Chin Hedstrom .]] for details. The parallelism is obtained through solving for these characteristics in parallel block fashion with the number of characteristics associated with each block tuned to the number of vector registers on a physical processor. For the Alliant computer this number was 32. The vector processor was used as an SIMD engine to compute these characteristics in lock step using the step size control associated with the ODE solver for a system of ODEs. The solver used in this case was the code of Shampine and Allen .[[ Shampine and Allen .]]. The domain decomposition consisted of two regions where the solution was taken to be constants on these characteristics, an internal layer where the full equation was solved, a boundary layer where two point boundary value problems were solved , and a region where the internal layer meets the boundary layer. .PP In the following graphics output, Figure 11.3, from SCHEDULE we show this program executing. Node 4 corresponds to gridding the internal layer with characteristics. Note that there is dynamic spawning of blocks of 32 characteristics made there. Nodes 2 and 3 correspond to convection only regions and blocks of 32 characteristics are spawned off there as well. These processes just continue to solve the boundary layer problem as soon as the characteristics meet the boundary at @x ~=~ 1@. Process 6 solves the full equation in the internal layer while process 5 grids the region where the internal and boundary layers meet. Both 5 and 6 must be completed before node 7 can execute to solve the equation where the internal and boundary layers meet. Node 1 is a dummy node used to collect all the results. .KS .sp 20 .I .ce Figure 11.3 .ce SCHEDULE Output from Domain Decomposition .R .KE .NH Conclusions .PP The construction SCHEDULE has evolved naturally during the course of our experience with programming various parallel machines. We are primarily motivated by the lack of unformity and limited capabilities offered by vendors for explicit parallel programming. We are deeply indebted to our colleagues who are active in similar pursuits and we have used many of their ideas in constructing SCHEDULE. However, although we have been greatly influenced by their work, we feel that our experience with numerical software has guided us in subtly different directions. It is our view that the user interface at this point in time is paramount to the utility of such an effort. Our goal has been to provide a methodology that is within the grasp of a capable Fortran programmer and a to provide syntax which does not represent a radical departure in appearance from standard Fortran programming. We have already modified this interface several times based upon interaction with users. The current approach has evolved from a package written entirely in Fortran to one requiring a @C ~-~ Fortran@ interface. Our goal of transportability is made more difficult with this requirement due to the lack of standardization of such an interface. However, our goal of providing clean user interface with the parallel capabilities seems to require this mixed language approach. .PP Our experience with SCHEDULE has been encouraging for the most part. We do not view it as a "solution" to the software problem we face in parallel programming. However, we do think this will be useful in the short term and perhaps will have some influence on the development of a long term solution. .sp .SH Acknowledgements .PP We would like to thank Terry Disz for introducing us to the graphic environment RT/1 and his assistance in making the graphics interface work. .sp .SH References .