\documentstyle[11pt]{article} \hyphenation{time-stamp time-stamps} \begin{document} \title{ParaGraph: A Tool for Visualizing Performance \\ of Parallel Programs \thanks{ This user guide is adapted from a technical report \protect\cite{hea91a}, which also appears in a modified form in \protect\cite{hea91b}. The principal modifications to produce this manual were the omission of illustrations and most bibliographic citations, and the addition of a considerable amount of detail to aid in using and understanding ParaGraph. This edition also reflects many new features recently added to ParaGraph, and is up to date as of June 8, 1994.} } \author{Michael T. Heath, University of Illinois \and Jennifer Etheridge Finger, Oak Ridge National Laboratory} \maketitle {\abstract ParaGraph is a graphical display system for visualizing the behavior and performance of parallel programs on message-passing multicomputer architectures. The visual animation is based on execution trace information monitored during an actual run of a parallel program on a message-passing parallel computer. The resulting trace data are replayed pictorially to provide a dynamic depiction of the behavior of the parallel program, as well as graphical summaries of its overall performance. Many different visual perspectives are provided from which to view the same performance data, in an attempt to gain insights that might be missed by any single view. We describe this visualization tool, outline the motivation and philosophy behind its design, and provide details on how to use it. \endabstract } \clearpage \tableofcontents \clearpage \section{Motivation and Design Philosophy} Graphical visualization is a standard technique for facilitating human comprehension of complex phenomena and large volumes of data. The behavior of parallel programs on advanced computer architectures is often extremely complex, and hardware or software performance monitoring of such programs can generate vast quantities of data. Thus, it seems natural to use visualization techniques to gain insight into the behavior of parallel programs so that their performance can be understood and improved. We have developed such a software tool, called ParaGraph, that provides a detailed, dynamic, graphical animation of the behavior of message-passing parallel programs, as well as graphical summaries of their performance. The purpose of this document is to explain how to use the visualization tool to analyze parallel programs. \subsection{Graphical Simulation} For lack of a better term, we will often use the word ``simulation'' to refer to the graphical animation of a parallel program. The use of this term should not be taken to suggest that there is anything artificial about the programs or their behavior as we portray them. ParaGraph displays the behavior and performance of real parallel programs running on real parallel computers to solve real problems. In effect, ParaGraph simply provides a visual replay of the events that actually occurred when a parallel program was run on a parallel machine. To date, ParaGraph has been used only in such a ``post processing'' manner, using a tracefile created during the execution of the parallel program and saved for later study. But the design of the package does not rule out the possibility that the data for the visualization could be arriving at the graphical workstation as the parallel program executes on the parallel machine. In practice, however, there are major impediments to such real-time performance visualization. With the current generation of distributed-memory parallel architectures, it is difficult to extract performance data from the processors and send it to the outside world during execution without significantly perturbing the application program being monitored. Moreover, the network bandwidth between the parallel processor and the graphical workstation, as well as the drawing speed of the workstation, are usually inadequate to handle the extremely high data transmission rates that would be required for real-time display. Finally, even if these other limitations were not a factor, human visual perception would be hard pressed to digest a detailed graphical depiction as it flies by in real time. One of the strengths of ParaGraph is the insight that can be gained from repeated replays of the same execution trace data, much the way ``instant'' replays are used in televised sports events. Program visualization can be thought of in either static or dynamic terms. After a parallel program has completed execution, the tracefile of events saved on disk can be considered as a static, immutable object to be studied by various analytical or statistical means. Some performance visualization packages reflect this philosophy in that they provide graphical tools designed for visual browsing of the performance data from various perspectives using scrollbars and the like. In designing ParaGraph, we have adopted a more dynamic approach whose conceptual basis is algorithm animation. We see the tracefile as a script to be played out, visually re-enacting the original live action of parallel program execution in order to provide insight into the program's dynamic behavior. There are advantages and disadvantages in both static and dynamic approaches. Algorithm animation is good at capturing a sense of motion and change, but it is difficult to control the apparent speed of the simulation. The static ``browser with scrollbars'' approach, on the other hand, gives the user control over the speed with which the data are viewed (indeed, ``time'' can even move backward), but does not provide such an intuitive feeling for the dynamic behavior of parallel programs. In designing ParaGraph, we have opted for the dynamic animation approach, sacrificing some control over simulation speed (as will be discussed in greater detail below). \subsection{Design Goals} In designing ParaGraph, our principal goals were: \begin{itemize} \item ease of understanding, \item ease of use, and \item portability. \end{itemize} We now briefly discuss each of these goals in turn. \subsubsection{Ease of understanding} Since the whole point of visualization is to facilitate human understanding, it is imperative that the visual displays provided be as intuitively meaningful as possible. The charts and diagrams should be aesthetically appealing, and the information they convey should be as self-evident as possible. A diagram is not likely to be useful if it requires an extensive explanation. The type of information conveyed by a diagram should be immediately obvious, or at least easily remembered once learned. The choice of colors used should take advantage of existing conventions to reinforce the meaning of graphical objects, and should also be consistent across views. Above all, it is essential to provide many different visual perspectives, since no single view is likely to provide full insight into the complex behavior and large volume of data associated with the execution of parallel programs. ParaGraph in fact provides some twenty-five different displays or views, all based on the same underlying execution trace data. \subsubsection{Ease of use} One of the main purposes of software tools is to relieve tedium, not promote it. Through the use of color and animation, we have tried to make ParaGraph painless, perhaps even entertaining, to use. It certainly seems reasonable that any graphics package should have a graphical user interface. ParaGraph has an interactive, mouse- and menu-oriented user interface so that the various features of the package are easily invoked and customized. Another important factor in ease of use is that the user's parallel program (the object under study) need not be modified extensively to obtain the data on which the visualization is based. ParaGraph currently takes its input data from execution tracefiles produced by PICL (Portable Instrumented Communication Library \cite{gei90a,gei90b}), which enables the user to produce such trace data automatically. We have also tried to keep the user's learning curve for ParaGraph very short, even at the expense of limiting the flexibility of its data processing and graphical display capabilities. Our intent is to require minimal data manipulation and to provide a variety of views that are customized to the task at hand, rather than providing more general data processing and graphics capabilities in a toolkit for constructing views of program behavior. \subsubsection{Portability} There are two senses in which portability is important in the present context. One is that the graphics package itself be portable. ParaGraph is based on the X Window System, and thus runs on a wide variety of scientific workstations from many different vendors. ParaGraph does not require any X toolkit or widget set, as it is based directly on the standard Xlib library, which is available in any distribution of the X Window System. ParaGraph has been tested with all MIT distributions of X11 through X11R5, as well as several vendor-supplied versions of X Windows. Although ParaGraph is most effective in color, it also works on monochrome and grayscale monitors. It automatically detects which type of monitor is in use, but this automatic choice can be overridden by the user, if desired. A second sense in which portability is important is that the package be capable of displaying execution behavior from different parallel architectures and parallel programming paradigms. ParaGraph inherits a high degree of such portability from PICL, which runs on parallel architectures from a number of different vendors (e.g., Cogent, Intel, Meiko, Ncube, Symult, Thinking Machines). On the other hand, many of the displays in ParaGraph are based on a message-passing paradigm, and thus the package does not currently offer support for displaying the behavior of programs based explicitly on shared-memory constructs. Further comments on the programming model supported by ParaGraph are given below. \subsection{Relationship to Previous Work} ParaGraph is certainly not the first software tool to be developed for visualizing parallel programs. Graphical animation techniques for visualizing serial algorithms have received considerable study. Visualization of parallel computations has been the subject of a number of Ph.D. theses, technical articles, and books. Graphical visualization has also been an important component of several environments that have been developed for parallel programming, debugging, and monitoring, as well as integrated environments that combine several of these components. Algorithm visualization tools have also been developed for specific applications, such as matrix computations. See \cite{hea91a} for an extensive list of references on much of this prior work. ParaGraph is a general-purpose performance visualization tool that is distinguished from most previous efforts in the following ways: \begin{itemize} \item The multiplicity of displays provided by ParaGraph is unique. Other packages have emphasized the importance of multiple views, but ParaGraph provides a substantially greater variety of perspectives than any other package of which we are aware. Some of the displays we have incorporated into ParaGraph appear to be original, while others have been motivated by similar displays found in previous packages. \item Many previous packages for visualizing parallel programs have targeted a particular parallel architecture and/or been based on a proprietary graphical display system. ParaGraph is applicable to any parallel architecture having message passing as its programming paradigm, and ParaGraph itself is based on the X Window System, which is widely available on workstations from many vendors. \item We have tried to attain high standards in the intuitive appeal and aesthetic quality of the displays provided by ParaGraph, including both the new displays we have devised and the display concepts that we have borrowed from previous packages. Of course, the perceived success of this attempt is in the eye of the beholder and can be judged only by users. \item We have also tried to make ParaGraph exceptionally easy to use, both through its interactive, graphical user interface and by relying on an instrumented communication library (PICL) to provide the requisite trace data without requiring the user to instrument explicitly the parallel program under study. We have also emphasized a short learning curve, minimal data manipulation, and views that are already tuned to the specific task at hand. \item Another unusual feature of ParaGraph is its extensibility. ParaGraph provides a mechanism for users to add new displays of their own design that can be viewed along with the other displays already provided. This capability is intended primarily to support special-purpose displays for particular applications, and is described in more detail below. \end{itemize} An indication of our degree of success in making ParaGraph easy to use and easy to understand is the fact that many users obtained an early version from {\tt netlib@ornl.gov} \cite{don87} over the Internet and were able to build the program at their locations and use it effectively without the benefit of any documentation beyond a one-page README file. For some case studies of performance tuning with ParaGraph, see \cite{hea90,hea93,tom93}. \subsection{Relationship to PICL} PICL is a Portable Instrumented Communication Library \cite{gei90a,gei90b} that is available from {\tt netlib@ornl.gov} and runs on a variety of message-passing parallel architectures. As its name implies, PICL provides both portability and instrumentation for programs that use its communication facilities for passing messages between processors. On request, PICL provides a tracefile that records important events in the execution of the user's parallel program (e.g., sending and receiving messages). The tracefile contains one event record per line, and each event record consists of a set of integers that specify the event type, timestamp, processor number, message length, and other similar information. The current format of the tracefile is documented in \cite{wor92}. (Tracefiles in the original PICL format documented in {\cite{gei90a} can be converted to the newer format by a conversion utility provided with the ParaGraph distribution.) To obtain further information about PICL, including documentation and source code, send email to {\tt netlib@ornl.gov} containing the message {\tt send index from picl}. ParaGraph has a producer-consumer relationship with PICL: ParaGraph consumes trace data produced by PICL. By using PICL rather than the ``native'' parallel programming interface for a particular machine, the user gains portability, instrumentation, and the ability to use ParaGraph in analyzing the behavior and performance of the parallel program. These benefits are essentially ``free'' in that once the parallel program is implemented using PICL, no further changes are required to the source code to move it to a new machine (provided PICL is available on the target machine), and little or no effort is required to instrument the program for performance analysis. On the other hand, since ParaGraph's dependence on PICL is solely for its input data, ParaGraph could in fact work equally well with any other source of data having the same format and semantics. Thus, other message-passing systems could be instrumented to produce trace data in the format expected by ParaGraph, or else ParaGraph's input routine could be adapted to a different input format. In this manner, ParaGraph can be, and indeed has been, used in conjunction with communication systems other than PICL. Several vendors of parallel computers have instrumented their native messaging systems to produce PICL-compatible tracefiles, which can then be viewed with ParaGraph. For a meaningful simulation, the timestamps of the events should be as accurate and consistent across processors as possible. This is not necessarily easy to accomplish on a machine in which each processor may have its own clock with its own starting time, running at its own rate. Moreover, the resolution of the clock may be inadequate to resolve events precisely. Poor resolution and/or poor synchronization of the processor clocks can lead to ``tachyons'' in the tracefile, that is, messages that appear to be received before they are sent. Such an occurrence will confuse ParaGraph, since much of its logic depends on correctly pairing sends and receives, and will invalidate the information in some of the displays. For this reason, PICL goes to considerable lengths to synchronize the processor clocks, and also to adjust for potential clock drift, so that the timestamps will be as consistent and meaningful as possible \cite{dun91}. On some machines, PICL actually provides a higher resolution clock than the one supplied by the system vendor. Fortunately, the trend is towards more accurate clocks and centralized clock pulses in distributed parallel machines, so clock consistency should be less of a problem in the future. Another important issue is the amount of additional overhead introduced by the collection of trace information compared to the execution time of an equivalent uninstrumented program. PICL tries to minimize the perturbation due to tracing by saving the trace data locally in each processor's memory, then downloading it to disk only after the program has finished execution. Nevertheless, such monitoring inevitably introduces some extra overhead; in PICL the clock calls necessary to determine the timestamps for the event records, plus other minor overhead, add a fixed amount (independent of message size) to the cost of sending each message. The overall perturbation is thus a function of the frequency and volume of communication traffic, and it also varies from machine to machine. In general, we believe that this perturbation is small enough that the behavior of parallel programs is not fundamentally altered. It is certainly true that in our experience the lessons we learn from visual study of instrumented runs invariably lead to improved performance of uninstrumented runs. \section{Programming Model} The most fundamental restriction in the parallel programming model currently supported by ParaGraph (and PICL) is the assumption that there is only one user process per processor. Such a programming style is typical of SPMD (single program multiple data) programs for multicomputers and is adequate for the vast majority of applications on these machines. Nevertheless, this restriction is occasionally limiting, and it will probably be relaxed in a future release of ParaGraph. The interprocessor communication model supported by ParaGraph is typical of interrupt-driven, loosely synchronous message passing systems from a variety of vendors. If a message is sent before it has been requested by the receiving user process, then the message is buffered and queued by the communication system. When the receiving user process subsequently requests the message, its contents are then transferred from a system buffer into an array provided by the user process. If a message is requested before it has arrived, then the receiving user process blocks further execution until the requested message arrives, at which point the message is transferred into the receiving array in the user process and execution resumes. In any case, the sending user process resumes execution immediately after the send and does not block. Currently, ParaGraph does not explicitly support fully synchronous communication (in which the sender blocks until the receive is executed) or fully asynchronous communication (in which neither sender nor receiver blocks, and probes must be done to detect whether messages have arrived). It may still be possible for ParaGraph to visualize programs that use such communication, but the visual simulation and the figures of merit produced may not be very accurate. Explicit support for these and other communication protocols may be added to ParaGraph in the future. It is assumed that each message has an integer type or tag. The value for the type is assigned by the sending user process and can be used by the receiving user process to search the queue of incoming messages selectively for a particular kind of message. Such tags can play an important role in synchronization and control of the parallel program. Some message passing systems support ``global sends,'' that is, messages that go from one processor to all other processors as a result of a single send command. Such a global send is usually indicated by a special value, such as -1, for the destination in the usual send function. The actual implementation of such global sends varies from one machine to another, depending on the most efficient way of using a given interconnection network topology (e.g., a spanning tree in a hypercube). ParaGraph handles global sends (i.e., send event records with destination -1) as if there were a whole set of separate messages, one for each possible destination processor, with each message having the same source and contents. This may or may not accurately reflect how the given communication system actually implements global sends, but is about all that ParaGraph can do in representing global sends pictorially, given its lack of knowledge of the underlying interconnection network topology. Note that PICL includes a special routine, {\tt bcast0}, for global broadcasting that is optimized for various particular topologies, and this routine in turn generates individual send and receive event records, which are therefore accurately depicted by ParaGraph. For this reason, as well as for greater portability, the user may prefer to use the explicit broadcasting facility of PICL rather than a simple send with a special destination value. \section{Software Design} ParaGraph is an interactive, event-driven program. Its basic structure is that of an event loop and a large switch that selects actions based on the nature of each event. There are in fact two separate event queues: a queue of X events produced by the user (mouse clicks, keypresses, window exposures, etc.) and a queue of trace events produced by the parallel program under study. Thus, ParaGraph must alternate between these two queues to provide both a dynamic depiction of the parallel program and responsive interaction with the user. Menu selections determine the execution behavior of ParaGraph, both statically (initial selection of displays, options, parameter values) and dynamically (pause/resume, single-step mode, etc.). ParaGraph is written in C, and the source code contains about 18,000 lines. The {\tt main} program of ParaGraph calls the {\tt preprocess} function to determine necessary parameters, initializes many variables, allocates graphical resources such as windows and fonts, and then goes into a {\tt while} loop that repeatedly calls the functions {\tt get\_event} and {\tt get\_trace}, which check the X event queue and the trace event queue, respectively, for the next event upon which to act. The {\tt get\_event} routine is simply a switch containing a series of calls to appropriate routines to handle the various X events. The {\tt get\_trace} routine calls {\tt scan} to read a trace event record, and then calls {\tt draw} to update the drawing of the displays that have been selected. The X event queue must be checked frequently enough to provide good interactive responsiveness, but not so frequently as to degrade the drawing speed during the simulation. On the other hand, the trace event queue should be processed as rapidly as possible while the simulation is active, but need not be checked at all if the next possible event must be an X event (e.g., before the simulation starts, after the simulation finishes, when in single-step mode, or when the simulation has been paused and can be resumed only by user input). To address these issues, the alternation between the two queues is not strict. Since not all trace event records produced by PICL are of interest to ParaGraph, it ``fast forwards'' through any series of such ``uninteresting'' records before rechecking the X event queue. Moreover, both blocking and nonblocking calls are used to check the X event queue, depending on the circumstances, so that workstation resources are not consumed unnecessarily when the simulation is inactive. The relationship between the apparent simulation speed and the original execution speed of the parallel program is necessarily somewhat imprecise. The speed of the graphical simulation is determined primarily by the drawing speed of the workstation, which in turn is a function of the number and complexity of displays that have been selected. There is no way, in general, to make the apparent simulation speed uniformly proportional to the original execution speed of the parallel program. The reason is that the time required to compute some event on the parallel machine and the time required to draw some graphical depiction of that event on the workstation screen have no particular correlation with each other. For the most part, ParaGraph simply processes the event records and draws the resulting displays as rapidly as it can. If there are gaps between consecutive timestamps, however, the intervening time is ``filled in'' by a spin loop so that there is at least a rough (but not uniform) correspondence between simulation time and original execution time. Fortunately, this issue does not seem to be of critical importance in visual performance analysis. The most important consideration in understanding parallel program behavior is simply that the correct relative order of events be preserved in the graphical replay. Moreover, the figures of merit produced by ParaGraph are based on the actual timestamps, not the apparent speed with which the simulation unfolds. Since ParaGraph's speed of execution is determined primarily by the drawing speed of the workstation, it can be slowed down or speeded up by selecting more or fewer displays, and by the options used within those displays (e.g., jump vs. smooth scrolling). For a given fixed configuration of displays and options, there is no way to speed up the simulation, since it is already drawing as fast as it can. However, a ``slow-motion'' slider is provided for precise control over the simulation speed if slower execution is desired without resorting to opening additional displays. Further, one can always use single-step mode if arbitrarily slow drawing speed is desired for very close study of program behavior. \section{Using ParaGraph} ParaGraph supports the following command-line options: \begin{itemize} \item[{\tt -c}] to specify color display mode, \item[{\tt -d}] to specify a hostname and screen (e.g., {\tt hostname:0.0}) for remote display across a network, \item[{\tt -e}] to specify an environment file (default: {\tt .pgrc}), \item[{\tt -f}] (or no switch) to specify the path name of a tracefile directory or of an individual tracefile, \item[{\tt -g}] to specify grayscale display mode, \item[{\tt -l}] to specify an animation layout file (default: {\tt .pganim}), \item[{\tt -m}] to specify monochrome display mode, \item[{\tt -n}] to specify a name for the base window (default: {\tt ParaGraph}), \item[{\tt -o}] to specify a file defining an alternate order and/or optional names for the processors (default: {\tt .pgorder}), \item[{\tt -r}] to specify a file of RGB color values for use in color-coding user-defined tasks (default: {\tt .pgcolors}). \end{itemize} It is not normally necessary to specify the display mode (color, grayscale, or monochrome), as ParaGraph by default will detect the most appropriate choice for the workstation in use. Overriding this automatic choice of display mode can be useful, however, for making black-and-white hardcopies from a color screen or to accommodate workstations with multiple screens of different types. The environment file, if present, defines the initial selection of displays and options with which ParaGraph begins execution. Typically, such an environment file is created and saved during a previous ParaGraph session. Specifying a unique name for the base window (i.e., the main menu window) is useful for distinguishing among multiple instances of ParaGraph when running them simultaneously. These and other options are explained in greater detail below. The tracefile can be specified on the command line, or it can be selected using the {\tt tracefile} submenu available from the main menu. The tracefile directory can be specified on the command line, or it can be entered (or changed) during execution by typing the path name in the subwindow provided in the {\tt tracefile} menu. If the path name of a tracefile is specified on the command line, then the directory portion of that path name is taken as the tracefile directory. Once a tracefile has been selected, ParaGraph preprocesses the tracefile to determine relevant parameters automatically (e.g., time scale, number of processors) before the graphical simulation begins; most of these values can be overridden by the user, if desired, by using the {\tt options} menu. Faulty tracefiles are usually detected during the preprocessing stage, in which case ParaGraph will issue an error message and terminate before going into the graphical simulation. To produce the necessary trace records for optimal use in ParaGraph, tracing in PICL should be done with {\tt tracelevel(4,4,0)}. For graphical animation with ParaGraph, the tracefile needs to be sorted into time order. Since the tracing option of PICL produces a tracefile with records in node order, the necessary reordering can be accomplished with the Unix sort command: \vspace{0.5cm} {\tt \% sort +2n -3 +0n -1 +1rn -2 tracefile.raw > tracefile.trf} \vspace{1.0cm} By default, ParaGraph initially displays only its main menu, which contains buttons for controlling execution and for selecting various additional menus. The submenus available include those for three types, or families, of displays ({\tt utilization}, {\tt communication}, and {\tt tasks)}, an {\tt other} menu of miscellaneous additional displays, a {\tt tracefile} menu for selecting a tracefile, an {\tt options} menu for specifying various options and parameters, and a {\tt record options} menu for selecting displays that are to produce numerical output to files, if desired. As many displays can be selected as will fit on the screen; the displays can be resized within reasonable bounds. Although it is difficult to pay close attention to many displays at once, it is still useful to have several available simultaneously for comparison and selective scrutiny with repeated replays. Many of the displays have various options that can be selected by clicking on the appropriate subwindow button. Pressing the right or middle mouse button cycles forward through the choices, while pressing the left mouse button cycles backward. The selection of displays, their sizes, their locations on the screen, and the options in effect can be saved in an environment file for use in subsequent ParaGraph sessions, as explained below. The {\tt tracefile} menu provides a graphical means for browsing a directory from which to select a desired tracefile. If a directory path name is supplied on the command line when invoking ParaGraph, then it will appear in the {\tt path name} subwindow; otherwise, the current working directory when ParaGraph is invoked is taken as the default directory. A new path name can be typed into the {\tt path name} subwindow of the {\tt tracefile} menu at any time. The filenames in the given directory are displayed, and the user selects the desired tracefile (or another directory) by clicking the mouse pointer on the corresponding name. If the name selected is a directory, then it becomes the new tracefile directory and the files it contains are displayed. If the name selected is a tracefile, then the filename is highlighted in reverse video, and the corresponding tracefile is processed by ParaGraph. A new tracefile can be selected at any time simply by clicking again on a different filename. Note that the directory may contain the names of files that are not in fact legitimate tracefiles. It is the responsibility of the user to select only valid tracefiles for processing by ParaGraph. You may wish to adopt some standard filename suffix, such as {\tt .trf}, to help distinguish tracefiles from other files. To provide greater selectivity in listing filenames, a {\tt pattern} subwindow is provided that supports the wildcard characters {\tt *}, which stands for any string, and {\tt ?}, which stands for any single character. The pattern can be changed by typing in the {\tt pattern} subwindow. Only those filenames in the current tracefile directory that fit the given pattern are displayed for possible selection. Typical patterns might be {\tt *.trf} or {\tt run??}. After selecting the desired displays, options, and tracefile, the user presses {\tt start} to begin the graphical simulation of the parallel program based on the tracefile specified. The animation proceeds straight through to the end of the tracefile, but it can be interrupted for detailed study by use of the {\tt pause/resume} button. Repeated use of this button alternates between pausing and resuming the simulation. For even more detailed study, the {\tt step} button provides a single-step mode that processes the tracefile one event (or a user-specified number of events) at a time. A particular time interval can be singled out for study by specifying starting and stopping times (the defaults are the beginning and ending of the tracefile), or the simulation can be optionally stopped each time a user-specified event occurs in the tracefile. The entire animation can be restarted at any time (whether in the middle or at the end of the tracefile) simply by pressing the {\tt start} button again. Most of the displays show program behavior dynamically as individual events occur, but some show only overall summary information at the end of the run (a few displays serve both purposes, as will be discussed below). The {\tt slow motion} button opens a window containing a ``slider'' for controlling the simulation speed. Clicking or dragging the mouse cursor along the slider slows down execution as much as desired. The position of the slider can be altered dynamically during the animation, and such changes take effect immediately. The {\tt save env} button causes a record of the current screen configuration and the various option settings in ParaGraph to be written in a file, so that, if desired, the same selection of displays and options can be established immediately upon subsequent invocations of ParaGraph. By default, the environment file is called {\tt .pgrc}, but a different name can be specified using the {\tt -e} command-line option. The screen locations of all displays are among the information saved in the environment file, but this placement may or may not be honored by a given window manager. Regardless of user requests, some window managers insist on interactive placement of windows and others insist on choosing their own locations beyond the user's control. The {\tt open env/close all} button alternately opens whatever set of displays are specified in the current environment file, or closes all currently open displays except the main menu. The intent is to allow for quick reconfiguration of displays, including reestablishment of the initial setup, without having to close or open many windows individually or restart ParaGraph. The {\tt screen dump} button enables any window (e.g., a single display or the entire screen) to be printed on a hardcopy output device, usually a laser printer. After pressing the {\tt screen dump} button, a particular window is selected for printing by clicking the mouse with the cross-hairs cursor in the desired location. The appropriate local command for routing the resulting screen dump to a suitable output device must appear in the {\tt print command} subwindow of the {\tt options} menu (see below). The {\tt reset} button clears all displays and returns to the beginning of the current tracefile, without restarting the animation. The {\tt quit} button terminates ParaGraph. \section{Displays} In this section we describe the individual displays provided by ParaGraph. For color illustrations of many of the displays, see \cite{hea91a,hea91b}. Some of the displays change in place dynamically as events occur, with execution time in the original run represented by simulation time in the replay. Others depict time evolution by representing execution time in the original run by one space dimension on the screen. The latter displays scroll as necessary (by a user-controllable amount) as simulation time progresses, in effect providing a moving window for viewing what could be considered a static picture. No matter which representation of time is used, all displays of both types are updated simultaneously and synchronized with each other. As stated earlier, most of the displays fall into one of three basic categories -- utilization, communication, and task information -- although some displays contain more than one type of information, and a few do not fit these categories at all. Below we provide brief descriptions of the displays. Most of the displays scale to reasonably large numbers of processors, but a few contain too much detail to scale up well. The current limit for most of the displays is 512 processors; the few exceptions are noted specifically below. \subsection{Utilization Displays} The displays described in this section are concerned primarily with processor utilization. They are helpful in determining the effectiveness with which the processors are used and how evenly the computational work is distributed across the processors. \subsubsection {Utilization Count} This display shows the total number of processors in each of three states -- busy, overhead, and idle -- as a function of time. The number of processors is on the vertical axis and time is on the horizontal axis, which scrolls as necessary as the simulation proceeds. The color scheme used is borrowed from traffic signals: green (go) for busy, yellow (caution) for overhead, and red (stop) for idle. By convention, we show green at the bottom, yellow in the middle, and red at the top along the vertical axis. At any given time, ParaGraph categorizes each processor as {\em idle} if it has suspended execution awaiting a message that has not yet arrived (or if it has ceased execution at the end of the run), {\em overhead} if it is executing in the communication subsystem (but not awaiting a message), and {\em busy} if it is executing some portion of the program other than the communication subsystem. Since the three categories are mutually exclusive and exhaustive, the total height of the composite is always equal to the total number of processors. Ideally, we would like to interpret {\em busy} as meaning that a processor is doing useful work, {\em overhead} as meaning that a processor is doing work that would be unnecessary in a serial program, and {\em idle} as meaning that a processor is doing nothing. Unfortunately, the monitoring required to make such a determination would almost certainly be nonportable and/or excessively intrusive. Thus, the ``busy'' time we report may well include redundant work or other work that would not be necessary in a serial program, since our monitoring detects only overhead associated with communication. However, we find that the definitions we have adopted based on the data provided by PICL are quite adequate in practice to convey the effectiveness of parallel programs pictorially. \subsubsection{Gantt Chart} This display, which is patterned after graphical charts used in industrial management, depicts the activity of individual processors by a horizontal bar chart in which the color of each bar indicates the busy/overhead/idle status of the corresponding processor as a function of time, again using the traffic-signal color scheme. Processor number is on the vertical axis and time is on the horizontal axis, which scrolls as necessary as the simulation proceeds. The Gantt chart provides the same basic information as the Utilization Count display, but on an individual processor, rather than aggregate, basis; in fact, the Utilization Count display is simply the Gantt chart with the green sunk to the bottom, the red floated to the top, and the yellow sandwiched between. \subsubsection{Kiviat Diagram} This display, which is adapted from related graphs used in other types of performance evaluation, gives a geometric depiction of the utilization of individual processors and the overall load balance across processors. Each processor is represented by a spoke of a wheel. The recent average fractional utilization of each processor determines a point on its spoke, with the hub of the wheel representing zero (completely idle) and the outer rim representing one (completely busy). Taken together, the points for all the processors determine the vertices of a polygon whose size and shape give a pictorial indication of both processor utilization and load balance across processors. Low utilization causes the polygon to be concentrated near the center, while high utilization causes the polygon to lie near the perimeter. Poor load balance across processors causes the polygon to be strongly skewed or asymmetric. Any change in load balance is clearly shown pictorially; for example, with many ring-oriented algorithms the moving polygon has the appearance of a rotating camshaft as the heavier workload moves around the ring. Other algorithms show a rhythmic oscillation of the polygon, much like a systolic ``heartbeat.'' The current utilization is shown in dark shading, while the ``high water mark'' seen thus far is shown in lighter shading. Since the Kiviat polygon may not be convex, and the high water mark for different processors may occur at different times, the outer figure may not have simple straight sides connecting the spokes. The ``current'' utilization used in this diagram is in fact a moving average over a time interval of user-specified width, since instantaneous utilization would of course always be either zero or one for each processor. The width of this smoothing interval can be changed via the {\tt options} menu. A button is provided for the user to choose whether the utilization plotted includes only busy time, or both busy and overhead (i.e., not idle). \subsubsection{Streak} This display is based loosely on newspaper listings of team sports standings that often include data on winning and losing streaks. Processor numbers are on the horizontal axis, and the length of the current streak for each processor is on the vertical axis. Busy is always considered winning and idle is always considered losing. Overhead (perhaps analogous to ties in sports) can be lumped in with either winning or losing, as selected using the button provided (so that the streaks might be more accurately termed undefeated or winless, respectively). By convention, winning streaks rise from the horizontal axis and losing streaks fall below the horizontal axis. This distinction is further emphasized by color coding (green for winning and red for losing). As the current streak for each processor grows, the corresponding vertical bar rises (or falls) from the horizontal axis. When a streak for a processor ends, its bar returns to the horizontal axis to begin a new streak. At the end of the run, the longest streaks (both winning and losing) at any point during the run are shown for each processor. This display often provides insight into rhythmic patterns in parallel programs or imbalances across processors. \subsubsection{Utilization Summary} This display shows the cumulative percentage of time, over the entire run, that each processor spent in each of the three busy/overhead/idle states. The percentage of time is shown on the vertical axis and the processor number on the horizontal axis. Again, the green/yellow/red color scheme is used to indicate the three states. In addition to giving a visual impression of the overall efficiency of the parallel program, this display also gives a visual indication of the load balance across processors. \subsubsection{Utilization Meter} This display uses a colored vertical bar, with the usual green/yellow/red color scheme, to indicate the percentage of the total number of processors that are currently in each of the three busy/overhead/idle states. The visual effect is similar to that of a thermometer or some automobile speedometers. This display provides essentially the same information as the Utilization Count display, but saves screen space (which may be needed for other displays) by changing in place rather than scrolling with time. \subsubsection{Concurrency Profile} For each possible number of processors $k$, $0 \leq k \leq p$, where $p$ is the maximum number of processors for this run, this display shows the percentage of time during the run that {\em exactly} $k$ processors were in a given state (i.e., busy/overhead/idle). The percentage of time is shown on the vertical axis and the number of processors $k$ is shown on the horizontal axis. The profile for each possible state is shown separately, and the user can cycle through the three states by clicking the mouse on the appropriate subwindow. This display is defined only at the end of the run. The actual concurrency profile for real programs shown by this display is usually in marked contrast to the idealized conditions that are the basis for Amdahl's Law, where the concurrency profile is assumed to be bimodal, with nonzero values at $k = 1$ and $k = p$ and zero elsewhere (i.e., at any given time the computational work is either strictly serial or fully parallel). \subsection{Communication Displays} The displays described in this section are concerned primarily with depicting interprocessor communication. They are helpful in determining the frequency, volume, and overall pattern of communication, and whether there is congestion in the message queues or on the links of the interconnection network. \subsubsection{Communication Traffic} This display is a simple plot of the total traffic in the communication system (interconnection network and message buffers) as a function of time. The curve plotted is the total of all messages that are currently pending (i.e., sent but not yet received), and can be optionally expressed either by message count or by volume in bytes. The communication traffic shown can also optionally be either the aggregate over all processors or just the messages pending for any individual processor the user selects. Message volume or count is shown on the vertical axis, and time is shown on the horizontal axis, which scrolls as necessary. \subsubsection{Spacetime Diagram} This display is patterned after the diagrams used in physics, particularly in relativity theory, to depict interactions between particles through space and time. This type of diagram has been used by Lamport for describing the order of events in a distributed computing system. The same pictorial concept was used over a century ago to prepare graphical railway schedules. In our adaptation of the Spacetime Diagram, processor number is on the vertical axis, and time is on the horizontal axis, which scrolls as necessary as time proceeds. Processor activity (busy/idle) is indicated by horizontal lines, one for each processor, with the line drawn solid if the corresponding processor is busy (or doing overhead), and blank if the processor is idle. Messages between processors are depicted by slanted lines between the sending and receiving processor activity lines, indicating the times at which each message was sent and received. These sending and receiving times are from user process to user process (not simply the physical transmission time), and hence the slopes of the resulting lines give a visual indication of how soon a given piece of data produced by one processor was needed by the receiving processor. The communication lines are color coded according to the Color Code display (see below). Each message line is drawn when its receive time has been reached, so this display may appear to be ``behind'' other displays that depict messages as soon as the send event is encountered. The Spacetime Diagram is one of the most informative of all the displays, since it depicts both individual processor utilization and all message traffic in full detail. For example, it can easily be seen which particular message ``wakes up'' an idle processor that was previously blocked awaiting its arrival. Unfortunately, this fine level of detail does not scale up well to large numbers of processors, as the diagram becomes extremely cluttered, and its current limit is 128 processors. \subsubsection{Message Queues} This display depicts the size of the queue of incoming messages for each processor by a vertical bar whose height varies with time as messages are sent, buffered, and received. The processor number is shown on the horizontal axis. At the user's option, the queue size can be measured either by the number of messages or by their total length in bytes. The input queue size for a given processor is incremented each time a message is sent to that processor, and decremented each time the user process on that processor receives a message. On most message-passing parallel systems, the physical transmission time between processors is negligible compared to the software overhead in handling messages, so that the time interval between the send and receive events is a reasonable approximation to the time a given message actually spends in the destination processor's input queue. Of course, depending on message types, the messages may not be received in the same order in which they arrive for queuing, so the queues may grow and shrink in complicated ways. As before, dark shading depicts the current queue size on each processor, and lighter shading indicates the ``high water mark'' seen so far. The Message Queue display gives a pictorial indication of whether there is communication congestion in a parallel program (i.e., whether messages are accumulating in the input queue), or the messages are being consumed at about the same rate as they arrive. Of course, it is best if messages arrive slightly before they are actually needed, so that the receiving processor does not become idle awaiting a message. But a large backlog of incoming messages can consume excessive buffer space, so a happy medium (analogous to ``just in time'' manufacturing) is desirable. \subsubsection{Communication Matrix} In this display, messages are represented by squares in a two-dimensional array whose rows and columns correspond to the sending and receiving processors, respectively, for each message. During the simulation, each message is depicted by coloring the appropriate square at the time the message is sent, and erasing it at the time the message is received. The color used is determined by the Color Code display (see below). Thus, the durations and overall pattern of messages are depicted by this display. The nodes can be ordered along the axes in either natural, Gray code, or user-defined order, and the choice may strongly affect the appearance of the communication pattern. At the end of the simulation, the Communication Matrix display shows the cumulative statistics (e.g., communication volume) for the entire run between each pair of processors, depending on the particular choice of color code. \subsubsection{Communication Meter} This display uses a vertical bar to indicate the percentage of maximum communication volume (or number of messages) currently pending (i.e., sent but not yet received). This display provides essentially the same information as the Communication Traffic display, but saves screen space (which may be needed for other displays) by changing in place rather than scrolling with time. Conceptually, this thermometer-like display is similar to the Utilization Meter display, except that it shows communication instead of utilization, and the two are interesting to observe side by side. \subsubsection{Animation} In this display, the parallel system is represented by a graph whose nodes (depicted by numbered circles) represent processors, and whose arcs (depicted by lines between the circles) represent communication between processors. The status of each node (busy, idle, sending, receiving) is indicated by its color, so that the circles can be thought of as the ``front-panel lights'' of the parallel computer. A line is drawn between the source and destination processors when each message is sent, and erased when the message is received. Thus, both the colors of the nodes and the connectivity of the graph change dynamically as the simulation proceeds. The lines represent the logical communication structure of the parallel program and do not necessarily reflect the actual interconnectivity of the underlying physical network. In particular, the possible routing of messages through intermediate nodes is not depicted unless the program being visualized does such forwarding explicitly. The nodes can be arranged in ring, mesh, or user-defined configurations by clicking on the appropriate subwindow. For the mesh, the user can also select the desired aspect ratio and row-wise or column-wise numbering by clicking on the appropriate buttons. In addition, the nodes can be arranged in natural, Gray code, or user-defined order, and the user's choice may strongly affect the appearance of the communication pattern among processors. The various arrangements of the nodes are merely pictorial conveniences, and do not necessarily imply anything about the structure of the underlying interconnection network topology on which the parallel program was run. If a user-defined layout is selected, then the processors can be placed arbitrarily within the display by using the mouse. Initially, the nodes are arranged in a default layout. A given node can be moved anywhere within the window by first clicking on the chosen node to select it, and then clicking again at the desired new location. If desired, a layout determined in this way can be saved in a file for future use. The default name for such an animation layout file is {\tt .pganim}, but a different name can be specified using the {\tt -l} command-line option or by typing the name into the appropriate subwindow when using the {\tt read file} or {\tt write file} options of the display. Note that various combinations of states are possible for the sending and receiving processors on either end of a message line. For example, both processors could be busy, one having already sent the message and resumed computing, while the other has not yet stopped computing to receive the message. Upon conclusion, this display shows a summary of all (logical) communication links used throughout the run. Because of its level of detail, this display is currently limited to depicting up to 128 processors. \subsubsection{Hypercube} This display is similar to the Animation display, except that it provides a number of additional layouts for the nodes in order to exhibit more clearly communication patterns corresponding to the various networks that can be embedded in a hypercube. Note that this display does not require that the interconnection network of the machine on which the parallel program executed actually be a hypercube; it merely highlights the hypercube structure as a matter of potential interest. The scheme for coloring nodes and drawing arcs is the same as that for the Animation display, except that curved arcs are often used to avoid, as much as possible, intersecting other nodes. To help the user of a hypercube to determine if the network's physical connectivity is correctly honored by the communication in the parallel program, message arcs corresponding to genuine physical hypercube links are drawn in a different color from message arcs along ``virtual'' links that do not exist in a hypercube and therefore entail indirect routing through intermediate processors. If the actual number of processors is not a power of two, then any ``unused'' nodes in the selected layout are indicated by black shading. Upon conclusion, this display shows a summary of all (logical) communication links used throughout the run. Unfortunately, the method used to draw this rather elaborate display does not scale up well to large numbers of processors, so it is currently limited to 16. \subsubsection{Network} This display depicts interprocessor communication in terms of various network topologies. Unlike the Animation and Hypercube displays, the Network display shows the actual path that each message takes, which may include routing through one or more intermediate nodes between the original source and ultimate destination. Obviously, depicting message routing through a network requires a knowledge of the interconnection topology. The user selects from among several of the most common interconnection networks, each of which may also have a choice of routing schemes. Some of the available topologies are represented as multistage networks, with duplicate sets of source and destination nodes, between which are several ``stages'' of nodes or switches through which intermediate routing occurs. Networks depicted in this manner include butterfly, hypercube, omega, baseline, and crossbar. Other available topologies are represented by a single set of nodes that serve as both sources and destinations, with messages moving in either direction through the network. Networks depicted in this manner include binary tree, quadtree, and mesh. Each physical link in the network is color coded according to the number of messages currently sharing that link. A temperature-like color code is used, so that ``hot spots'' appear red while less heavily used links appear blue. In monochrome, the message count on a link is indicated instead by the line width, so that, for example, the tree networks look like ``fat'' trees, as the message count tends to be higher nearer the root. Unlike the Animation or Hypercube displays, in the Network display the sending or receiving of a message does not always cause the drawing or erasure of a given link, but will often merely change its color to be one step hotter or cooler than it was previously. A given message may use several links, causing each link involved to be incremented or decremented separately. On conclusion, the coloring of the links indicates the cumulative message count over the entire run, and the color-code legend is recalibrated accordingly to indicate the range of cumulative totals for the various links. The choices of network and routing scheme (and also the aspect ratio and row-wise or column-wise ordering for the mesh) are selected by clicking on the appropriate subwindow. The choice of network topology and routing scheme need not match those of the machine on which the parallel program actually ran, but the representation is obviously most accurate if they do match. On the other hand, one might want to choose a different network deliberately in order to get some idea how a program that ran on one topology might perform on a different topology. Thus, for example, the user of an Intel Paragon (mesh) can see a visual simulation of the behavior his program might have on a Thinking Machines CM-5 (quadtree), or vice versa. The node numbers of the peripheral nodes are always shown, but to avoid excessive clutter amid the message lines, by default the the interior node numbers are omitted in the multistage and mesh networks. All of the node numbers can be shown, however, by clicking on the appropriate option subwindow. Due to its high degree of detail, this display is currently limited to 128 processors. \subsubsection{Node Data} This display provides, in graphical form, detailed communication data for any single processor the user selects. The choices of data plotted are the source/destination, type, length, and distance traveled for all messages sent to or from the chosen processor. The length of a message is in bytes, and the distance traveled is in hops from source to destination as determined by the distance function chosen using the {\tt options} menu. Time is on the horizontal axis, and the chosen statistic is on the vertical axis, with incoming and outgoing messages shown in separate subwindows. This display is helpful in analyzing communication behavior in detail, especially in perceiving trends or patterns in the communication structure that improve understanding of program behavior and performance. It has been used as an aid in designing ``synthetic programs,'' which are simple programs that mimic the behavior and performance of much more complex programs, and are useful for performance modeling and benchmarking. This display is currently limited to depicting 256 processors. \subsubsection{Color Code} This display permits the user to select which statistic will determine the color code for coloring the messages in displays such as Spacetime and Communication Matrix. The choices include the size of the message in bytes, the distance between the source and destination nodes (according to the distance function chosen using the {\tt options} menu), and the message type. Clicking on the subwindow cycles through the choices, and the resulting color code is displayed to enable proper interpretation of the other displays that use it. \subsection{Task Displays} The displays we have considered thus far depict a number of important aspects of parallel program behavior that help in detecting performance bottlenecks. However, they contain no information indicating the location in the parallel program at which the observed behavior occurs. To remedy this situation, we considered a number of automated approaches to providing such information (e.g., picking up line numbers in the source code from the compiler), but all of these encounter nasty practical difficulties (such as dealing with multiple source files). Thus, we reluctantly made an exception to our rule that the user need do nothing to instrument the parallel program under study in order to use ParaGraph. We developed a number of ``task'' displays that use information provided by the user, with the help of PICL, to depict the portion of the user's parallel program that is executing at any given time. Specifically, the user defines ``tasks'' within the program by using special PICL routines to mark the beginning and ending of each task and assign it a user-selected, nonnegative task number. The scope of what is meant by a task is left entirely to the user: a task can be a single line of code, a loop, an entire subroutine, or any other unit of work that is meaningful in a given application. For example, in matrix factorization one might define the computation of each column to be a task, and assign the column number as the task number. Tasks are defined simply by calling PICL's {\tt traceblockbegin} and {\tt traceblockend} routines, with the desired task number as argument, immediately before and after the selected section of code. This causes PICL to produce event records that are interpreted appropriately by ParaGraph to depict the given task, using displays to be described in this section. We emphasize that task definitions are required {\em only} if the user wishes to view the task displays. If the tracefile contains no event records defining tasks, then the task displays will simply be blank, but the remaining displays in ParaGraph will still show their normal information. Tasks can be nested, one inside another, but if so these should be properly bracketed by matching task begin and end records. More than one processor can be assigned the same task (or, more accurately, each processor can be assigned its own portion of the same task); indeed, the model we have in mind is that all processors collaborate on each task, rather than that each task is assigned to a single processor. In many contexts, such as the matrix factorization example mentioned above, there is a natural ordering and corresponding numbering of the tasks in a parallel program. In most of the task displays described below, the task numbers are indicated by a color coding. Since the number of tasks may be larger than the number of colors that can be easily distinguished, we recycle a limited number of colors to depict successive task numbers. We use a maximum of 64 different colors for indicating individual tasks. To aid in distinguishing consecutively numbered tasks (the most common case) we stride through these 64 colors in groups of eight rather than in strict rainbow sequence. If desired, the user can override these default task colors by supplying a file containing up to 64 sets of RGB values. The default name for such a file is {\tt .pgcolors}, but an alternative filename can be specified on the command line by using the {\tt -r} option. Each line of the color file contains four integers, the first of which is the color number (0-63) that is being replaced, and the other three are the R, G, and B values (0-255) for the substitute color. In monochrome mode, stipple patterns are used to distinguish tasks. The eight different stipple patterns available are recycled as needed for larger numbers of tasks. \subsubsection{Task Count} During the simulation, this display shows the number of processors that are executing a given task at the current time. The number of processors is shown on the vertical axis and the task number is shown on the horizontal axis. At the end of the run, this display changes to show a summary over the entire run. Specifically, it shows the average number of processors that were executing each task over the lifetime of that task (i.e., the time interval starting when the first processor began the task and ending when the last processor finished the task). \subsubsection{Task Gantt} This display depicts the task activity of individual processors by a horizontal bar chart in which the color of each bar indicates the current task being executed by the corresponding processor as a function of time. Processor number is on the vertical axis and time is on the horizontal axis, which scrolls as necessary as the simulation proceeds. This display can be compared with the Utilization Gantt chart to correlate busy/overhead/idle status with the task information. \subsubsection{Task Status} In this display the tasks are represented by a two-dimensional array of squares, with task numbers filling the array in row-wise order. Initially, all of the squares are white. As each task is begun, its corresponding square is lightly shaded to indicate that the task is now in progress. When a task is subsequently completed, its corresponding square is then darkly shaded. \subsubsection{Task Summary} This display, which is defined only at the end of the simulation run, indicates the duration of each task (from earliest beginning to last completion by any processor) as a percentage of the overall execution time of the parallel program, and furthermore places the duration interval of each task within the overall execution interval of the parallel program. The percentage of the total execution time is shown on the vertical axis, and the task number is shown on the horizontal axis. \subsection{Other Displays} In this section we describe some additional displays that either do not fit into any of the three categories above or else cut across more than one category. \subsubsection{Clock} This display provides both digital and analog clock readings during the graphical simulation of the parallel program. The current simulation time is shown as a numerical reading, and the proportion of the full tracefile that has been completed thus far is shown by a colored horizontal bar. The clock reading is updated synchronously with the other displays, and it ``ticks'' through all integral time values, not just those that happen to correspond to event timestamps. \subsubsection{Trace} This is a non-graphical display that prints an annotated version of each trace event as it is read from the tracefile. It is primarily useful in the single-step mode for debugging or other detailed study of the parallel program on an event-by-event basis. Although the trace records are drawn in this display one at a time, space is allowed to show several consecutive trace records in context, and the display scrolls vertically as necessary with time. By default, all trace events are printed, but events can be printed selectively by node or type by changing the appropriate setting in the {\tt options} menu. \subsubsection{Statistical Summary} This is a non-graphical display that gives numerical values for various statistics summarizing processor utilization and communication, both for individual processors and aggregated over all processors. The data provided include percentage of busy, overhead, and idle time; total count and volume of messages sent and received; maximum queue size; and maxima, minima, and averages for the size, distance traveled (according to the distance function chosen using the {\tt options} menu), transit time, and overhead incurred for both incoming and outgoing messages. While this tabular display may yield less insight than the graphical displays provided by ParaGraph, exact numerical quantities are occasionally useful in preparing tables and graphs for printed reports, or for analytical performance modeling. Due to limited space on the screen, this display shows data for at most 16 processors at a time, but the subset of processors shown can be varied by clicking the mouse on the subwindow provided, which enables one to browse the entire data array. \subsubsection{Processor Status} This is a comprehensive display that attempts to capture detailed information about processor utilization, communication, and tasks, but in a compact format that scales up well to large numbers of processors. This display contains four subdisplays, in each of which the processors are represented by a two-dimensional array of squares, with processor numbers filling the array in row-wise order. The upper left subdisplay shows the current state of each processor (busy/overhead/idle), using the usual green/yellow/red color scheme. The upper right subdisplay shows the task currently being executed by each processor, using the same task coloring scheme as discussed previously. The lower left subdisplay shows the volume of messages currently being sent by each processor, and the lower right subdisplay shows the volume of messages currently awaiting receipt by each processor; both of these communication subdisplays indicate message volume in bytes using the same color code as discussed previously for the other communication displays. Although this comprehensive display is somewhat difficult to follow due to the large amount of information it contains, it has the virtue of readily scaling to very large numbers of processors. \subsubsection{Critical Path} This display is similar to the Spacetime display described earlier, but uses a different color coding to highlight the longest serial thread in the parallel computation. Specifically, the processor and message lines along the critical path are shown in red, while all other processor and message lines are shown in light blue. This display is intended to aid in identifying performance bottlenecks and tuning the parallel program by focusing attention on the portion of the computation that is currently limiting performance. Any improvement in performance must necessarily shorten the longest serial thread running through the computation, so this is a primary place to look for potential algorithm improvements. For larger numbers of processors, the noncritical message lines are suppressed so that they do not obscure the critical path. \subsubsection{Phase Portrait} This display is patterned after the phase portraits used in differential equations and classical mechanics to depict the relationship between two variables (e.g., position and velocity) that depend on some independent variable (e.g., time). In our case, we are attempting to illustrate pictorially the relationship over time between communication and processor utilization. At any given point in time, the current percentage utilization (i.e., the percentage of processors that are in the busy state), and the percentage of the maximum volume of communication currently in transit, together define a single point in a two-dimensional plane. This point changes with time as communication and processor utilization vary, thereby tracing out a trajectory in the plane that is plotted graphically in this display, with communication and utilization on the two axes. To filter out noise in plotting the trajectory, this display uses the same smoothing interval as the Kiviat diagram, and thus the amount of smoothing can be controlled via the {\tt options} menu. Since the overhead and potential idleness due to communication inhibit processor utilization, one expects communication and utilization generally to have an inverse relationship. Thus one expects the phase trajectory to tend to lie along a diagonal of the display. This display is particularly useful for revealing repetitive or periodic behavior in a parallel program, which tends to show up in the phase portrait as an orbit pattern. The color used for drawing the trajectory is determined by the current task number on processor 0 (default is black if no such task is active), so by setting task numbers appropriately, the user can color code the trajectory to highlight either major phases or individual orbits. \subsubsection{Coordinate Information} The Info display is a non-graphical display used to write information produced by mouse clicks on the other displays. Many of the displays respond to mouse clicks by printing in the Info display the coordinates (in units meaningful to the user) of the point at which the cursor is located at the time the button is pressed. This feature is intended to enable the user to determine precisely information that may be difficult to read accurately from the axis scales alone. In addition, clicking a mouse button with the cursor placed on one of the nodes in the Animation display causes the following information to be printed in the Info display: node number, task number (if any), number of incoming messages pending, number of outgoing messages pending. The latter information can be used in conjunction with the color code in the Animation display to determine the exact state of the nodes more precisely. \subsection{Application-Specific Displays} All of the displays we have discussed thus far are generic in the sense that they are applicable to any parallel program based on message passing and do not depend on the particular application or problem domain that the program addresses. While this wide applicability is generally a virtue, knowledge of the specific application can often enable one to design a special-purpose display that reveals greater detail or insight than generic displays alone would permit. In studying a parallel sorting algorithm, for example, generic displays can show which processors are communicating with each other, and the volume of communication, but they cannot show which specific data items are being exchanged between processors. Since we obviously could not provide such application-specific displays as part of ParaGraph, we instead made ParaGraph extensible so that users can add application-specific displays of their own design that can be selected from a menu and viewed along with the usual generic displays. The mechanism we use for supporting this capability works as follows. ParaGraph contains calls at appropriate points to routines that provide initialization, data input, event handling, drawing, etc., for application-specific displays. If the corresponding routines for such displays are not supplied by the user when the executable module for ParaGraph is built, then dummy ``stub'' routines are linked into ParaGraph instead, and the {\tt user} submenu selection does not appear on the main menu in the list of available submenus. If application-specific displays have been linked into ParaGraph and the resulting module is executed, then a {\tt user} item appears in the main menu, and its selection opens a {\tt user} submenu that is analogous to the other submenus of available displays. The {\tt user} submenu may contain any number of separate user-defined displays that can be selected individually. Each such user-supplied display is given access to all of the event records in the tracefile that ParaGraph reads, as well as all X events, and can use them in any manner it chooses. Thus, the user-supplied displays can receive input interactively via the mouse or keyboard. The usual events generated by PICL may suffice for the application-specific displays, or the user may wish to insert additional events during execution of the parallel program in order to supply additional data for the application-specific display. The {\tt tracedata} command of PICL is perhaps the most useful for this purpose, as it allows the user to insert into the tracefile timestamped records containing arbitrary lists of integers, which might be used to provide loop counters, array indices, memory addresses, identifiers of particles or transactions, or any other information that would enable the user-supplied display to convey more fully and precisely the activity of the parallel program in the context of the particular application. Unfortunately, writing the necessary routines to support application-specific displays is a decidedly nontrivial task that requires a general knowledge of X Window System programming. But at least the potential user of this capability can concentrate on only those portions of the graphics programming that are relevant to his application, taking advantage of the supporting infrastructure of ParaGraph to provide all of the other necessary facilities to drive the overall graphical simulation. As an aid to users who may wish to develop application-specific displays to add to ParaGraph, we have developed several prototype displays for depicting such applications as parallel sorting algorithms, recursive matrix transposition, general matrix computations, and graph algorithms, and for displaying operation counts (e.g., flops, particles, transactions). These example routines are distributed along with the source code for ParaGraph. \section{Options} The execution behavior and visual appearance of ParaGraph can be customized in a number of ways to suit each user's taste or needs. The individual items in the {\tt options} menu are described in this section. Some of the menu items cycle through the available choices as the mouse is clicked on the corresponding subwindow, while others accept keyboard input to specify numerical values or character strings. The type of user input expected for a given menu entry is indicated by the type of mouse cursor it displays. For the menu items that take keyboard input, existing values can be erased by the {\tt backspace} or {\tt delete} keys. When typing a new value, the characters are echoed in reverse video. Hitting the {\tt return} key completes the keyboard input and makes the new value take effect, at which point the new characters revert to normal video display. In this section, we briefly discuss some of the choices available in the {\tt options} menu. \begin{itemize} \item Order: In many of the displays, the user can choose to have the processors arranged in natural, Gray code, or user-defined order, and the choice will affect the appearance of communication patterns. The Gray code order is not permitted if the number of processors is not a power of two. If desired, a user-defined ordering can be supplied by means of an order file. The default name for an order file is {\tt .pgorder}, but an alternate filename can be given by using the {\tt -o} command-line option. An order file contains two numbers per line, the first of which is a node number and the second of which is the desired position of that node in the user-defined order. Optionally, a third field can be specified on each line which, if present, is interpreted as a character string to be used as the name for that node. The ability to rename the nodes is intended to support heterogeneity in either the application program (e.g., master-slave) or the underlying architecture (e.g., a network of various workstations), in which case it may be desirable to be able to distinguish nodes of different types. Due to limited space available in the displays where they will be used, node names are limited to three characters. If no order file is supplied by the user, then the {\tt user} item does not appear among the choices for the ordering. \item Scrolling: Those windows that represent time along the horizontal dimension of the screen can smoothly scroll or jump scroll by a user-specified amount as simulation time advances. Smooth scrolling provides an appealing sense of visual continuity, but results in a slower drawing speed. \item Time Unit: The relationship between simulation time and the timestamps of the trace events is determined by the {\tt time unit} chosen. By convention, PICL provides event timestamps with a resolution of microseconds. Consequently, a value of 100 for the time unit in ParaGraph, for example, means that each ``tick'' of the simulation clock corresponds to 100 microseconds in the original execution of the parallel program. During preprocessing, ParaGraph scans the timestamps in the the tracefile and attempts to determine a reasonable value for the time unit. The user can override this automatic choice, however, simply by entering a different choice in the {\tt time unit} subwindow. Once the time unit is set, all displays (as well as user input) are expressed in terms of this time unit rather than the units of the original raw timestamps in the tracefile. \item Magnification: This parameter determines the visual resolution of the horizontal axis in the displays that scroll with time. It specifies the number of pixels on the screen that represent each unit of simulation time. A larger number of pixels per time unit magnifies the horizontal dimension of the scrolling displays to bring out more detail, but with less of the overall behavior of the program visible at once. The choices available for the magnification factor are 1, 2, 4, and 8. The user can override the default value chosen automatically by ParaGraph. The visual effect of this parameter is much like that of using a magnifying glass of the given power. The magnification factor and the time unit, as we have defined them, are related to each other in the effect they have on the appearance of the displays that scroll with time, but they serve distinct purposes: the choice of {\tt magnification} determines the visual resolution of the drawing on the screen, while the choice of {\tt time unit} determines the time resolution of trace events. Thus, these two quantities can be varied in concert to produce any desired effect. \item Start Time and Stop Time: By default, ParaGraph starts the simulation at the beginning of the tracefile and proceeds to the end of the tracefile. By choosing other starting and stopping times, however, the user can isolate any particular time period of interest for visual scrutiny without having to view a possibly long simulation in its entirety. Once the specified stopping time is reached, the simulation pauses, and then can be resumed by typing a new (still later) stopping time, or by clicking on the {\tt pause/resume} menu button, or by clicking on the {\tt step} menu button to proceed from this point in single-step mode. \item Step Increment: This parameter determines how many consecutive records from the tracefile are processed each time the {\tt step} button is pressed on the {\tt controls} menu. The default value of one provides the finest control for detailed scrutiny, but can be tedious and time consuming to use, so the user may prefer a larger value. \item Smoothing Interval: The user can select the amount of smoothing used in the Kiviat Diagram and Phase Portrait displays to avoid an excessively noisy or jumpy appearance. The amount of smoothing is determined by the width of a moving time interval, with a larger value giving more smoothing and a smaller value giving less smoothing. This parameter is expressed in simulation time units and it can be changed simply by typing a new value. \item Pause on Tracemark/msg: Another way to stop the simulation for detailed study at a given predetermined point is to insert {\tt tracemark} or {\tt tracemsg} event records into the tracefile during the original execution of the parallel program. These special records provided by PICL can be used to mark milestones in the user's program, such as the completion of a major phase of the program or the beginning of a new one, or a point at which a bug is suspected. This provides a program-dependent means of isolating particular points of interest for close scrutiny. After the simulation has stopped at a {\tt tracemark} or {\tt tracemsg} event, it can be resumed by any of the usual actions, including single stepping. \item Pause on Error: This option determines whether the simulation is paused if an error is detected, such as a mismatched send/receive pair or incorrectly nested blocks. Again, the simulation can be resumed by any of the usual actions. \item Trace Node and Trace Type: These parameters determine which trace events are printed in the Trace display window. This feature allows the user to focus on events for a specific node and/or of a specific type, since looking at every event for every processor can be tedious and time consuming. The default value for both parameters is {\tt all}. \item Print Command: This specifies the command string used by the {\tt screen dump} button on the {\tt controls} menu to route images to a printer for hardcopy output. The default print command is installation dependent. It can be changed simply by typing a new print command in this subwindow. The command string will often include invocation of a remote shell and piping through a number of filters for converting image formats, etc., before reaching the physical output device. \item Distance Function: This option determines the network topology used to compute the distance traveled by each message, which may be used in some displays for color coding messages and is also tallied in summaries of communication statistics. The choices available include Hamming distance (appropriate for hypercubes, for example), 1-D and 2-D mesh (without wrap-around), 1-D and 2-D torus (with wrap-around), binary tree, quadtree, and unit distance (appropriate for a fully connected network, for example). For the mesh and torus, the user also selects the desired aspect ratio and whether the processors are numbered row-wise or column-wise. The choice of distance function does not necessarily have to agree with the layout or topology chosen for the Animation and Network displays. \end{itemize} \section{Record Options} The data generated by ParaGraph for drawing the various displays can be saved in files, if desired. Such data may be useful for inclusion in printed documents, for mathematical modeling or statistical analysis of performance, or as input to other graphical packages. The {\tt record options} menu is used to select which displays, if any, are to have their data recorded in files on disk during the simulation. Each data file created in this manner will have the name shown in the prefix subwindow, with a suffix added to indicate the particular display name. The default filename prefix is the path name of the tracefile, if one was specified on the command line when ParaGraph was invoked. The filename prefix can be changed by entering a new name into the prefix subwindow. Another subwindow allows the user to specify start and stop times for saving data in files, which by default include the entire run. Each data file produced in this manner begins with a header that identifies the subsequent fields, followed by one line of data per event. \section{General Advice} In this section we provide a few tips that may make using ParaGraph easier and more meaningful. Perhaps the most important piece of advice is to keep the tracefile to be viewed as small as possible without losing the phenomenon to be studied. The best way to accomplish this is to use a relatively small number of processors and a brief execution time. Although ParaGraph currently supports the use of up to 512 processors, and has no limit on the duration of the simulation run, the size of the tracefile for a large number of processors and/or a long execution time can be enormous (many megabytes). Such large tracefiles can quickly consume large amounts of disk space and will require a great deal of time for ParaGraph to preprocess and then animate visually. Fortunately, in our experience, basic algorithm behavior and most fundamental bottlenecks and inefficiencies in parallel programs are usually already apparent when viewed with small numbers of processors and relatively small test problems that run quickly. Moreover, many programs display repetitive behavior, so that only a few iterations need be examined in detail in order to get the gist of their behavior, rather than a long sequence of replicated behavior. In a lengthy program, it is also a good idea to invoke PICL's tracing commands only for the portion of immediate interest. On some machines, a more insidious problem with large numbers of processors and/or long run times is the increasing probability of ``tachyons'' in the tracefile as the number of processors increases and as individual processor clocks drift apart with time. In creating tracefiles for viewing with ParaGraph, be sure to use the highest resolution clock and sharpest clock synchronization that PICL offers. On machines with independent node clocks, PICL will try to compensate for clock skew and drift by monitoring clock behavior during a brief interval before the program begins execution. To increase the accuracy of this computation, you might try calling {\tt clocksync0} at the end of your program, just before calling {\tt close0} on each node. This will cause the entire duration of your run to be used in determining the appropriate adjustment for clock drift, improving its statistical significance. Tachyons may cause unpredictable behavior in ParaGraph, possibly including outright failure. Therefore, before reporting ``bugs'' in ParaGraph, check the tracefile for tachyons using the awk script {\tt tachyon.awk} supplied in the ParaGraph distribution, which will print any tachyons it finds and otherwise remain silent. Other common causes of faulty tracefiles include failure to sort into time order, inadvertent concatenation of multiple tracefiles, and incomplete tracefiles due to full trace buffers. Another way to reduce the size of the tracefile is to refrain from tracing on the host. ParaGraph ignores all events involving the host anyway, so tracing on the host pointlessly clutters the tracefile with data that are irrelevant to the visualization. The decision to ignore the host in ParaGraph was based on a number of factors, including the difficulty of representing a host pictorially without spoiling the symmetry of many of the displays, the difficulty of obtaining accurate and reliable timestamps on a time-shared multiuser host, the fact that most parallel programs do not use the host for any substantive computations anyway, and the fact that many vendors support multiple hosts or are eliminating the need for separate hosts in their systems. The various parameters given in the {\tt options} menu can have a dramatic effect on the behavior of ParaGraph for a given tracefile, and the user may or may not find the default values to be the most desirable. For example, during preprocessing a rough heuristic is used to choose an appropriate {\tt time unit}, and the value chosen strongly affects the appearance and behavior of the scrolling displays. An attempt is made to choose a value that will fill at least one window width but not need to scroll more than a few window widths. The value chosen automatically may be so large that it obscures detail the user would like to see, or so small that the simulation runs for too long. So, the user should feel free to adjust the value for the {\tt time unit}, if desired. Note, however, that the {\tt magnification} parameter also affects the visual resolution of the scrolling displays, so it may also be changed to produce a desired effect. In addition, the speed of the drawing is strongly affected by the type and amount of scrolling employed, so this is subject to experimentation as well. In using the Kiviat Diagram and Phase Portrait displays, some experimentation with the {\tt smoothing interval}, as well as the {\tt time unit}, may be required to produce the most meaningful visual results. As pointed out previously, the execution speed of ParaGraph is normally determined by how fast it can read trace records and perform the resulting drawing. If the visual simulation is too rapid for the eye to follow, then its execution can be slowed down either by using the {\tt slow motion} slider or else by selecting some additional displays, especially those that scroll with time. If the visual simulation is too slow, it can be speeded up by using fewer displays at a time or by selecting jump scrolling. Changing the {\tt time unit} and/or {\tt magnification} also affects the drawing speed, so these are subject to experimentation as well. Finally, the {\tt step} button or repeatedly hitting {\tt pause/resume} can also be used to control the speed with which the animation unfolds. By some combination of these means, the user should be able to produce an animation speed that can be followed visually in sufficient detail, yet does not take an inordinate amount time to finish. In building an executable module for ParaGraph from the distributed source code, there are a few compile-time parameters that the user may wish to modify for a particular situation. These parameters are found in the {\tt defines.h} file. The parameters most likely to require modification are as follows: \begin{itemize} \item[{\tt ALL}] integer destination value used to indicate global sends (default -1), \item[{\tt HOST}] integer identifier for the host processor (default -32768, consistent with PICL), \item[{\tt MAXP}] maximum number of node processors allowed (default 128, maximum possible 512). \end{itemize} The usual default for the maximum number of nodes allowed is set at 128 in order to conserve memory. {\tt MAXP} can be increased up to 512 to accommodate larger systems, but this may cause sluggish performance and should be avoided unless necessary. By default, ParaGraph uses two fonts, namely {\tt 6x12} and {\tt 8x13}, which are both available in most distributions of X Windows. In case these fonts are not available, however, the names of alternate constant-width fonts of similar size can be substituted into the initialization of {\tt font1\_name} and {\tt font2\_name} in the {\tt defaults.h} file. \section{Future Work} In terms of the number and appearance of displays it provides, ParaGraph is a reasonably mature software tool, although we intend to add more displays as helpful new perspectives are devised. There are a few technical points about ParaGraph that could stand improvement. The contents of many of the displays are lost if the window is obscured and then reexposed. This inability to repair or redraw windows, short of rerunning the simulation from the beginning, was a deliberate design decision based on a desire to conserve the substantial amount of memory that would be required to save the contents of all windows for possible restoration. Nevertheless, this ``feature'' can be annoying at times and should eventually be fixed. A related problem is that ParaGraph cannot reliably support dynamic changes in parameters during a simulation run (e.g., dynamic zooming of the time resolution). A more serious limitation of ParaGraph in its current form is the number of processors that can be depicted effectively. A few of the current displays are simply too detailed to scale up beyond about 128 processors and still be comprehensible. Most of the displays scale up well to a level of 512 or 1024 processors on a normal sized workstation screen, but at this point they are down to representing each processor by a single pixel (or pixel line), and hence cannot be scaled any further in their current form. To visualize programs for massively parallel architectures having thousands of processors, we must devise new displays that scale up to this level, or else we must adapt the existing displays, either by aggregating or selecting information. For example, the current displays could depict either clusters of processors or subsets of individual processors (e.g., cross sections). While it is fairly easy to imagine how graphics technology might be adapted to meet the needs of visualizing massively parallel computations, it is much less obvious how to handle the vast volume of execution trace data that would result from monitoring thousands of processors. Even with the more modest numbers of processors currently supported by ParaGraph, storage and processing of the large volume of trace data resulting from runs of significant duration are already difficult problems. To go beyond the present level will almost certainly require some degree of abstraction of essential behavior in a more concise and compact form, both in the data and in its graphical presentation. We simply cannot afford to continue to record or display all communication events when they become so voluminous. Unfortunately, many of the current displays in ParaGraph depend critically on the availability of data on each individual event. Thus, the development of new visual displays and new data abstractions must proceed in tandem so that the execution monitoring facility will produce data that can be visually displayed in a meaningful way to provide helpful insights into program behavior and performance. \section{Acknowledgements} The original implementation of ParaGraph was done almost entirely by undergraduate students during research internships at Oak Ridge National Laboratory. The overall structure of the software and the conceptual designs of the individual displays were developed by Michael Heath. Most of the programming for the initial release of ParaGraph was done by Jennifer Finger (then Etheridge) while she was an undergraduate student, first at Roanoke College and later at the University of Tennessee. Heath and Finger have continued joint development of ParaGraph since he moved from ORNL to the University of Illinois and she became a regular staff member at ORNL. Two other undergraduate students also contributed to the development of ParaGraph: Loretta Auvil, then of Alderson Broaddus College, originally developed the Hypercube display, and Michelle Hribar, then of Albion College, developed the first two application-specific displays (to illustrate parallel sorting and matrix transposition) as extensions to ParaGraph. In each case these undergraduates began their work on ParaGraph without any prior knowledge of Unix, C, workstations, computer graphics, or the X Window System, and within a single term each was contributing to the relatively sophisticated software tool described in this manual. Thus, the development of ParaGraph has been an interesting educational experiment that has provided a useful tool for the performance analysis of parallel programs. This research was supported by the Applied Mathematical Sciences Research Program, Office of Energy Research, U.S. Department of Energy under contract DE-AC05-84OR21400 with Martin Marietta Energy Systems, Inc. \bibliographystyle{plain} \begin{thebibliography}{1} \bibitem{don87} J.~J.~Dongarra and E.~Grosse. \newblock Distribution of mathematical software via electronic mail. \newblock {\em Communications of the ACM}, 30(5), May 1987, pp. 403--407. \bibitem{dun91} T.~H.~Dunigan. \newblock Hypercube clock synchronization. \newblock {\em Concurrency: Practice and Experience}, 4(3), May 1992, pp. 257--268. \bibitem{gei90a} G.~A.~Geist, M.~T.~Heath, B.~W.~Peyton, and P.~H.~Worley. \newblock {PICL}: a portable instrumented communication library, {C} reference manual. \newblock Technical Report ORNL/TM-11130, Oak Ridge National Laboratory, Oak Ridge, TN, July 1990. \bibitem{gei90b} G.~A.~Geist, M.~T.~Heath, B.~W.~Peyton, and P.~H.~Worley. \newblock A users' guide to {PICL}, a portable instrumented communication library. \newblock Technical Report ORNL/TM-11616, Oak Ridge National Laboratory, Oak Ridge, TN, October 1990. \bibitem{hea90} M.~T. Heath. \newblock Visual animation of parallel algorithms for matrix computations. \newblock In D. Walker and Q. Stout, editors, {\em Proceedings of the Fifth Distributed Memory Computing Conference}, volume~II, pages 1213--1222, Los Alamitos, CA, April 1990. IEEE Computer Society Press. \bibitem{hea91a} M.~T.~Heath and J.~A.~Etheridge. \newblock Visualizing performance of parallel programs. \newblock Technical Report ORNL/TM-11813, Oak Ridge National Laboratory, Oak Ridge, TN, May 1991. \bibitem{hea91b} M.~T.~Heath and J.~A.~Etheridge. \newblock Visualizing the performance of parallel programs. \newblock {\em IEEE Software}, 8(5), September 1991, pp. 29-39. \bibitem{hea93} M.~T. Heath. \newblock Recent developments and case studies in performance visualization using {ParaGraph}. \newblock In G.~Haring and G.~Kotsis, editors, {\em Performance Measurement and Visualization of Parallel Systems}, pages 175--200, Amsterdam, The Netherlands, 1993. Elsevier Science Publishers. \bibitem{tom93} G.~Tomas and C.~W. Ueberhuber. \newblock {\em Visualization of scientific parallel programs}. \newblock Technical University of Vienna. \newblock April 1993. \bibitem{wor92} P.~H.~Worley. \newblock A new {PICL} trace file format. \newblock Technical Report ORNL/TM-12125, Oak Ridge National Laboratory, Oak Ridge, TN, October 1992. \end{thebibliography} {\bf Biographies} {\em Michael T. Heath} is Professor in the Department of Computer Science and Research Scientist at the National Center for Supercomputing Applications at the University of Illinois in Urbana-Champaign. Previously he was a Senior Research Staff Member and Computer Science Group Leader in the Mathematical Sciences Section at Oak Ridge National Laboratory. He received a Ph.D. in Computer Science from Stanford University in 1978. His current research interests are in large-scale scientific computing on parallel computers, numerical linear algebra, and performance visualization. He can be contacted by email at the following address: {\tt heath@ncsa.uiuc.edu}. {\em Jennifer E. Finger} is a Technical Research Associate in the Mathematical Sciences Section at Oak Ridge National Laboratory. She received a B.S. degree from the University of Tennessee, Knoxville, in 1990, with a major in Mathematics. Her current interests are in computer graphics and visualization. She can be contacted by email at the following address: {\tt jenn@msr.epm.ornl.gov}. \newpage \begin{verbatim} ParaGraph(L) LOCAL COMMANDS ParaGraph(L) NAME ParaGraph - performance visualization of parallel programs SYNOPSIS PG [-c | -g | -m] [-d hostname:0.0] [-e envfile] [-f trace- file] [-l layoutfile] [-n windowname] [-o orderfile] [-r rgbfile] DESCRIPTION ParaGraph is a graphical display system for visualizing the behavior and performance of parallel programs on message- passing parallel computers. It takes as input execution trace data provided by PICL (Portable Instrumented Communi- cation Library), developed at Oak Ridge National Laboratory and available from netlib. PICL optionally produces an exe- cution trace during an actual run of a parallel program on a message-passing machine, and the resulting trace data can then be replayed pictorially with ParaGraph to display a dynamic, graphical depiction of the behavior of the parallel program. ParaGraph provides several distinct visual per- spectives from which to view processor utilization, communi- cation traffic, and other performance data in an attempt to gain insights that might be missed by any single view. ParaGraph is based on the X Window System and runs on a wide variety of graphical workstations. It uses no X toolkit and requires only Xlib. Although ParaGraph is most effective in color, it also works on monochrome and grayscale monitors. It has a graphical, menu-oriented user interface that accepts user input via mouse clicks and keystrokes. The execution of ParaGraph is event driven, including both user-generated X Window events and trace events in the input data file provided by PICL. Thus, ParaGraph displays a dynamic depiction of the parallel program while also provid- ing responsive interaction with the user. Menu selections determine the execution behavior of ParaGraph both stati- cally (e.g., initial selection of parameter values) and dynamically (e.g., pause/resume, single-step mode). Para- Graph preprocesses the input tracefile to determine relevant parameters (e.g., time scale, number of processors) automat- ically before the graphical simulation begins, but these values can be overridden by the user, if desired. ParaGraph currently provides about 25 different displays or views, all based on the same underlying trace data, but each giving a distinct perspective. Some of these displays change dynamically in place, with execution time in the ori- ginal run represented by simulation time in the replay. Other displays represent execution time in the original run by one space dimension on the screen. The latter displays scroll as necessary (by a user-controllable amount) as visual simulation time progresses. The user can view as many of the displays simultaneously as will fit on the screen, and all visible windows are updated appropriately as the tracefile is read. The displays can be resized within reasonable bounds. Most of the displays depict up to 512 processors in the current implementation, although a few are limited to 128 processors and one is limited to 16. ParaGraph is extensible so that users can add new displays of their own design that can be viewed along with those views already provided. This capability is intended pri- marily to support application-specific displays that augment the insight that can be gained from the generic views pro- vided by ParaGraph. Sample application-specific displays are supplied with the source code. If no user-supplied display is desired, then dummy "stub" routines are linked with ParaGraph instead. The ParaGraph source code comes with several sample trace- files for use in demonstrating the package and verifying its correct installation. To create your own tracefiles for viewing with ParaGraph, you will need PICL, which is also available from netlib. The tracing option of PICL produces a tracefile with records in node order. For graphical ani- mation with ParaGraph, the tracefile needs to be sorted into time order, which can be accomplished with the Unix sort command: % sort +2n -3 +0n -1 +1rn -2 tracefile.raw > tracefile.trf When using PICL to produce tracefiles for viewing with Para- Graph, set tracelevel(4,4,0) to produce the trace events required by ParaGraph. You may also want to define tasks using the traceblockbegin and traceblockend commands of PICL to delimit sections of code and assign them task numbers to be depicted by ParaGraph in some of its displays as an aid in correlating the visual simulation with your parallel pro- gram. ParaGraph does not depict a "host" processor graphi- cally and ignores all trace events involving the host, so tracing on the host is not encouraged when the tracefile is to be viewed using ParaGraph. OPTIONS The following command-line options are supported by Para- Graph. -c to force color display mode. -d to specify a hostname and screen (e.g., hostname:0.0) for remote display across a network. -e to specify an environment file (default: .pgrc). -f (or no switch) to specify a tracefile directory path or filename. -g to force grayscale display mode. -l to specify an animation layout file (default: .pganim). -m to force monochrome display mode. -n to specify a name for the base window (default: Para- Graph). -o to specify an order file (default: .pgorder). -r to specify a file containing RGB values of task colors (default: .pgcolors). By default, ParaGraph automatically detects the appropriate display mode (color, grayscale, or monochrome), but a par- ticular display mode can be forced, if desired, by the corresponding command-line option. This facility is useful, for example, in making black-and-white hardcopies from a color monitor. FILES The following environment files can optionally be supplied by the user to customize ParaGraph's appearance and behavior. The default filenames given below can be changed by the appropriate command-line options. .pgrc defines the initial state of ParaGraph upon invocation, including which menus and displays are open and various options. .pgorder defines an optional order or alternative names for the processors. .pgcolors defines the color scheme to be used for identifying tasks. .pganim defines an animation layout file. The following files are provided in the ParaGraph distribu- tion from netlib. *.c several C source files. *.h several include files. Makefile.* sample makefiles for several machine configurations, which should be modified to incorporate the local location for Xlib, etc. manual.tex a user guide in Latex format. pg.man a Unix man page. tracefiles a directory containing several sample tracefiles. u_* several directories containing example application-specific displays. SEE ALSO A machine-readable manual for ParaGraph, in Latex format, is provided along with the source code from netlib. Additional information is contained in the article "Visualizing Perfor- mance of Parallel Programs" in the September 1991 issue of IEEE Software, pages 29-39, and in the technical report ORNL/TM-11813. Documentation for PICL is available from netlib and in the technical reports ORNL/TM-11130 and ORNL/TM-11616. BUGS Some of the displays are not repaired when re-exposed after having been partially obscured. Changing parameters dynami- cally while the visual animation is active may give erratic results. The apparent speed of visual animation is deter- mined primarily by the drawing speed of the workstation and is not necessarily uniformly proportional to the original execution speed of the parallel program. AUTHORS Michael T. Heath and Jennifer E. Finger \end{verbatim} \end{document} .