%Version of August 3, 1993 %Minor edits by S. Otto, August 7, 1993 \chapter{Point to Point Communication} \label{sec:pt2pt} \label{chap:pt2pt} \section{Introduction} \label{sec:pt2pt-intro} This chapter is a draft of the current proposal for point-to-point communication. Sending and receiving of messages by processes is the basic MPI communication mechanism. All other communication functions can be implemented on top of this basic communication layer (although a more direct implementation may lead to greater efficiency). The basic point to point communication operations are {\bf send} and {\bf receive}. A {\bf send} operation creates and sends a message. The operation specifies a {\bf send buffer} in the sender memory from which the message data is taken. In addition, the send operation associates an {\bf envelope} with the message. This envelope specifies the message destination and contains distinguishing information that can be used by the {\bf receive} operation to select a particular message. A {\bf receive} operation consumes a message. The message to be received is selected according to the value on its envelope, and the message data is put into the {\bf receive buffer}. The next sections describe the basic (blocking) send and receive operations. We discuss send, receive, basic communication semantics, type matching requirements, type conversion in heterogeneous environments, and more general communication modes. Nonblocking communication is addressed next, followed by channel-like constructs and send-receive operations. We then consider general datatypes that allow one to transfer heterogeneous and noncontiguous data, and conclude with a description of an implementation of MPI point to point communication using a small number of primitives. \section{Basic send operation} \label{sec:pt2pt-basicsend} The syntax of the simplest send operation is given below. \begin{funcdef}{MPI\_SEND (start, count, datatype, dest, tag, comm)} \funcarg{\IN}{start}{initial address of send buffer (choice)} \funcarg{\IN}{count}{number of elements in send buffer (nonnegative integer)} \funcarg{\IN}{datatype}{datatype of each send buffer element} \funcarg{\IN}{dest}{rank of destination (integer)} \funcarg{\IN}{tag}{message tag (integer)} \funcarg{\IN}{comm}{communicator (handle)} \end{funcdef} \subsection{Message data} \label{subsec:pt2pt-messagedata} The send buffer specified by the \func{MPI\_SEND} operation consists of \mpiarg{count} successive entries of the type indicated by \mpiarg{datatype}, starting with the entry at address \mpiarg{start}. Note that we specify the message length in terms of number of {\em elements}, not number of {\em bytes}. The former is machine independent and closer to the application level. The data part of the message consists of a sequence of \mpiarg{count} values, each of the type indicated by \mpiarg{datatype}. \mpiarg{ count} may be zero, in which case the data part of the message is empty. The basic datatypes that can be specified for message data values correspond to the basic datatypes of the host language. The possible` values of this parameter for Fortran and the corresponding Fortran types are listed below \vspace{5ex} \begin{center} \begin{tabular}{|l|l|} \hline MPI datatype & Fortran datatype \\ \hline \type{ MPI\_INTEGER} & \ftype{INTEGER} \\ \type{ MPI\_REAL} & \ftype{REAL} \\ \type{ MPI\_DOUBLE} & \ftype{DOUBLE PRECISION} \\ \type{ MPI\_COMPLEX} & \ftype{COMPLEX} \\ \type{ MPI\_DOUBLE\_COMPLEX} & \ftype{DOUBLE COMPLEX} \\ \type{ MPI\_LOGICAL} & \ftype{LOGICAL} \\ \type{ MPI\_CHARACTER} & \ftype{CHARACTER} \\ \type{ MPI\_BYTE} & \\ \hline \end{tabular} \end{center} \vspace{5ex} The possible values for this parameter for C and the corresponding C types are listed below \vspace{5ex} \begin{center} \begin{tabular}{|l|l|} \hline MPI datatype & C datatype \\ \hline \type{ MPI\_SHORT} & \ctype{short} \\ \type{ MPI\_INT} & \ctype{int} \\ \type{ MPI\_LONG} & \ctype{long} \\ \type{ MPI\_UNSIGNED} & \ctype{unsigned} \\ \type{ MPI\_FLOAT} & \ctype{float} \\ \type{ MPI\_DOUBLE} & \ctype{double} \\ \type{ MPI\_CHAR} & \ctype{char} \\ \type{ MPI\_BYTE} & \\ \hline \end{tabular} \end{center} \vspace{5ex} The datatype \type{ MPI\_BYTE} does not correspond to a Fortran or C datatype. A value of type \type{ MPI\_BYTE} consists of 8 binary digits. A byte is uninterpreted and is different from a character. Different machines may have different representations for characters, or may use more than one byte to represent characters. On the other hand, a byte has the same binary value on all machines. \discuss{ Need to decide whether we want (for C) \ctype{unsigned char}, \ctype{unsigned short}, \ctype{unsigned long}, \ctype{long double} \ftype{DOUBLE COMPLEX} may not be standard Fortran 77 \type{MPI\_DOUBLE} is now used both for Fortran and C. Should we use different names? } \subsection{Message envelope} \label{subsec:pt2pt-envelope} In addition to the data part, messages contain information that can be used to distinguish messages and selectively receive them. This information is contained in a fixed number of fixed-format fields, which we collectively call the {\bf message envelope}. These fields are \begin{description} \item[source] \item[destination] \item[tag] \item[context] \end{description} The message source is implicitly determined by the identity of the message sender. The other fields are specified by parameters in the send operation. The integer-valued message tag is specified by the \mpiarg{ tag} parameter. This integer can be used by the program to distinguish different types of messages. The range of valid tag values is implementation dependent and can be queried using the \mpifunc{TagRange} environmental inquiry function, as described in Chapter~\ref{chap:inquiry}. The context of the message sent is specified by the \mpiarg{ comm} parameter. the message carries the context associated with this communicator (see Chapter~\ref{chap:context}). The message destination is specified by the \mpiarg{ dest} parameter as a rank within the process group associated with that same communicator. The range of valid values is {\tt 0, ... , n-1}, where {\tt n} is the number of processes in this group. Thus, point-to-point communications do not use absolute addresses, but only relative ranks within a group. This provides important modularity. The message envelope would normally be encoded by a fixed-length message header. However, the actual mechanism used to associate an envelope with a message is implementation dependent: some of the information (e.g., source or destination) may be implicit, and need not be explicitly carried by messages; processes may be identified by relative ranks, or absolute ids; etc. \section{Basic receive operation} \label{sec:pt2pt-basicreceive} The syntax of the simplest receive operation is given below. \begin{funcdef}{MPI\_RECV (start, count, datatype, source, tag, comm, status)} \funcarg{\OUT}{start}{initial address of receive buffer (choice)} \funcarg{\IN}{count}{number of elements in receive buffer (integer)} \funcarg{\IN}{datatype}{datatype of each receive buffer element (state)} \funcarg{\IN}{source}{rank of source (integer)} \funcarg{\IN}{tag}{message tag (integer)} \funcarg{\IN}{comm}{communicator (handle)} \funcarg{\OUT}{status}{status object} \end{funcdef} The receive buffer consists of the storage containing \mpiarg{ count} consecutive elements of the type specified by \mpiarg{ datatype}, starting at address \mpiarg{ start}. The length of the received message must be less then or equal to the length of the receive buffer. I.e., all incoming data must fit, without truncation, into the receive buffer. The \mpifunc{MPI\_PROBE} function described in Section~\ref{sec:pt2pt-probe} can be used to receive messages of unknown length. The selection of a message by a receive operation is done uniquely according to the value of the message envelope. The receive operation specifies an {\bf envelope pattern}; a message can be received by that receive operation only if its envelope matches that pattern. A pattern specifies values for the \mpiarg{ source}, \mpiarg{ tag} and \mpiarg{ context} fields of the message envelope. The receiver may specify a wildcard \const{MPI\_ANY\_SOURCE} value for \mpiarg{ source}, and/or a wildcard \const{ MPI\_ANY\_TAG} value for \mpiarg{ tag}, indicating that any source and/or tag are acceptable. It cannot specify a wildcard value for \mpiarg{ context}. Thus, a message can be received by a receive operation only if it is addressed to the receiving process, has a matching context, has matching source unless source=\const{MPI\_ANY\_SOURCE} in the pattern, and has a matching tag unless tag=\const{ MPI\_ANY\_TAG} in the pattern. The message tag is specified by the \mpiarg{ tag} parameter of the receive operation. The message \mpiarg{ context} is the context attached with the communicator specified by the parameter \mpiarg{ comm}. The message source, if different from \const{ MPI\_ANY\_SOURCE}, is specified as a rank within the process group associated with that same communicator. Thus, the range of valid values for the \mpiarg{ source} parameter is {\tt \{ 0, ... , n-1 \} }$\cup${\tt \{ MPI\_ANY\_SOURCE \}}, where {\tt n} is the number of processes in this group. Note the asymmetry between send and receive operations: A receive operation may accept messages from an arbitrary sender; on the other hand, a send operation must specify a unique receiver. This matches a ``push'' communication mechanism, where data transfer is effected by the sender (rather than a ``pull'' mechanism, where data transfer is effected by the receiver). Source = destination is allowed: a process can send a message to itself. \subsection{Return status} \label{subsec:pt2pt-status} \discuss{\bf This is a change from previous draft} The source or the tag of a received message may not be known if wildcard values where used in the receive operation. Also, the actual length of the message received may not be known. Thus, this information needs to be returned by the receive operation. For reasons that will become clear later, it is preferable to return this information in a separate \type{ out} variable, rather then using the \mpiarg{ count}, \mpiarg{ source} and \mpiarg{ tag} parameters both for input and output. The information is returned by the \mpiarg{ status} parameter of the \mpifunc{MPI\_RECV} function. This is a parameter of a special MPI-defined type. The status variable can be ``decoded'' to retrieve the \mpiarg{ count}, \mpiarg{ source} and \mpiarg{ tag} fields, using the query functions listed below. \begin{funcdef}{MPI\_GET\_SOURCE(status, source)} \funcarg{\IN}{status}{return status of receive operation (Status)} \funcarg{\OUT}{source}{source rank (integer)} \end{funcdef} Returns the rank of the message source (in the group associated with the communicator that was used to receive). \begin{funcdef}{MPI\_GET\_TAG(status, tag)} \funcarg{\IN}{status}{return status of receive operation (Status)} \funcarg{\OUT}{tag}{message tag (integer)} \end{funcdef} Returns the tag of received message. \begin{funcdef}{MPI\_GET\_COUNT(status, count)} \funcarg{\IN}{status}{return status of receive operation (Status)} \funcarg{\OUT}{count}{number of received elements (integer)} \end{funcdef} Returns the number of elements received. (Here, again, we count {\em elements}, not {\em bytes}.) The information returned by these query functions is the information last stored in the \mpiarg{ status} variable by a receive function. It is erroneous to call these query functions if the \mpiarg{ status} variable was never set by a receive. Note that it is not mandatory to query the return status after a receive. The receiver will use these calls, or some of them, only when the information they return is needed. information returned in it by the last receive operation that updated \discuss{ The use of a separate status parameter prevents errors that are often attached with \type{ \INOUT} parameters (e.g., passing the \const{ MPI\_ANY\_TAG} constant as the actual argument for \mpiarg{ tag}). The use of an explicit user object becomes important for nonblocking communication as it allows the receive operation to be stateless and, hence, reentrant. This prevents confusions in the case where multiple receives can be posted by a process. } \implement{ One expects decode functions to be in-lined in many implementations. } All send and receive operations use the \mpiarg{ start}, \mpiarg{ count},\mpiarg{ datatype}, \mpiarg{ source}, \mpiarg{ dest}, \mpiarg{ tag}, \mpiarg{ comm} and \mpiarg{ status} parameters in the same way as the basic \func{ MPI\_SEND} and \func{ MPI\_RECV} operations described in this section. \section{Semantics of point-to-point communication} \label{sec:pt2pt-semantics} A valid MPI implementation guarantees certain general properties of point-to-point communication, which are described in this section. \paragraph*{Order} Messages are {\em non-overtaking}, within each context: if two messages are successively sent from the same source, to the same destination, with the same context, then the messages are received in the order they were sent. I.e., if the receiver posts two successive receives that both match either message, then the first receive will receive the first message, and the second receive will receive the second message. This requirement facilitates matching of sends to receives. It guarantees that message-passing code is deterministic, if processes are single-threaded and wildcard \const{ MPI\_ANY\_SOURCE} is not used in receives. If a process has a single thread of execution, then any two communications executed by this process are ordered. On the other hand, if the process is multi-threaded, then the semantics of thread execution may not define a relative order between two send operations executed by two distinct threads: the operations are logically concurrent, even though one physically precedes the other. In such a case, the two messages sent can be received in any order. Similarly, if two receive operations that are logically concurrent receive two successively sent messages, then the two messages can match the two receives in either order. \paragraph*{Progress} If a pair of matching send and receives have been initiated on two processes, then at least one of these two operations will complete, independently of other actions in the system: the send operation will complete, unless the receive is satisfied by another message, and completes; the receive operation will complete, unless the message sent is consumed by another matching receive that was posted at the same destination process. \paragraph*{Fairness} MPI makes no guarantee of {\em fairness} in the handling of communication. Suppose that a send was posted. Then it is possible that the destination repeatedly posts a receive that matches this send, yet the message is never received, because it is each time overtaken by another message, sent from another source. Similarly, suppose that a receive was posted. Then it is possible that messages that match this receive are repeatedly received, yet the receive is never satisfied, because it is overtaken by other receives posted at this node (by other executing threads). It is the programmer's responsibility to prevent starvation in such situations. \discuss{ {\bf Need to be discussed by MPIF} Alternative positions: 1. Sends are handled fairly; e.g. a posted send cannot be repeatedly overtaken by other sends. An example of an implementation that violates this: Assume that arriving messages are kept in a hash table organized by source, and that a receive with source=dontcare is handled by searching this table sequentially. Then a message posted at the end of this table may be repeatedly overtaken by messages that get posted at the head of the table. (Easy fix: rotate starting point of search.) 2. Receives are handled fairly; e.g. a posted receive cannot be repeatedly overtaken by other receives. An example of an implementation that violates this: assume that there are two different posting tables for receives with a specific source and for receives with source=dontcare; assume further that these two structures are searched in a fixed order when a message arrives. Then a receive with source=dontcare may be repeatedly overtaken by receives with source specified. (Fix: alternate search order; but this has performance problems, if we assume that one type of receive is much more frequent than the other type.) 3. Both send and receives are handled fairly. Opinions? } \paragraph*{Resource limitations} The current practice for many commercial message-passing libraries is that (short) messages are buffered by the system, thus allowing blocking send operations to complete ahead of the matching receives. It is expected that many MPI implementations will follow this practice, and provide the same level of buffering that is available on the libraries they replace. However, message buffering is not a universal practice. Even on systems where buffering occur, the amount of buffer space available and the way it is allocated is bound to be implementation dependent. Therefore, message buffering is not mandated by MPI and is seen as a quality of implementation issue. A valid MPI implementation of \func{ MPI\_SEND} is to block the sender until a matching receive has been initiated. In general, the programmer can make no assumptions on the availability of buffer space, and how this space is allocated. Thus, portable (safe) MPI code should work under the assumption that an arbitrary subset of the send operations are going to return before a matching receive is posted, and the rest will block until a matching receive is posted. MPI implementation will provide information on the amount of available buffer space and on the buffering policy via the environmental enquiries described in Chapter~\ref{chap:inquiry}. Examples (involving two processors with ranks 0 and 1) The following program is safe, and should always succeed. \begin{verbatim} CALL MPI_COMM_RANK(comm, rank, ierr) IF (rank.EQ.0) THEN CALL MPI_SEND(sendbuf, count, MPI_REAL, 1, tag, comm, ierr) CALL MPI_RECV(recvbuf, count, MPI_REAL, 1, tag, comm, ierr) ELSE ! rank.EQ.1 CALL MPI_RECV(recvbuf, count, MPI_REAL, 0, tag, comm, ierr) CALL MPI_SEND(sendbuf, count, MPI_REAL, 0, tag, comm, context) END IF \end{verbatim} The following program is erroneous, and should always fail. \begin{verbatim} CALL MPI_COMM_RANK(comm, rank, ierr) IF (rank.EQ.0) THEN CALL MPI_RECV(recvbuf, count, MPI_REAL, 1, tag, comm, ierr) CALL MPI_SEND(sendbuf, count, MPI_REAL, 1, tag, comm, ierr) ELSE ! rank.EQ.1 CALL MPI_RECV(recvbuf, count, MPI_REAL, 0, tag, comm, ierr) CALL MPI_SEND(sendbuf, count, MPI_REAL, 0, tag, comm, ierr) END IF \end{verbatim} The receive operation of the first process must complete before its send, and can complete only if the matching send of the second processor is executed; the receive operation of the second process must complete before its send and can complete only if the matching send of the first process is executed. This program will deadlock. The following program is unsafe, and may succeed or fail, depending on implementation. \begin{verbatim} CALL MPI_COMM_RANK(comm, rank, ierr) IF (rank.EQ.0) THEN CALL MPI_SEND(sendbuf, count, MPI_REAL, 1, tag, comm, ierr) CALL MPI_RECV(recvbuf, count, MPI_REAL, 1, tag, comm, ierr) ELSE ! rank.EQ.1 CALL MPI_SEND(sendbuf, count, MPI_REAL, 0, tag, comm, ierr) CALL MPI_RECV(recvbuf, count, MPI_REAL, 0, tag, comm, ierr) END IF \end{verbatim} The message sent by each process has to be copied out before the send operation returns and the receive operation starts. For the program to complete, it is necessary that at least one of the two messages sent is buffered. Thus, this program can succeed only if the communication system has sufficient buffer space to buffer \mpiarg{ count} words of data, and the MPI implementation buffers messages. \section{Data Type Matching} \label{sec:pt2pt-typematch} One can think of message transmission as consisting of three phases: \begin{enumerate} \item Data is pulled out of the send buffer and a message is assembled \item A message is transferred from sender to receiver \item Data is pulled from the incoming message and disassembled into the receive buffer \end{enumerate} Type matching has to be observed at each of these three phases: The type of each variable in the sender buffer has to match the type specified for that entry by the send operation; the type specified by the send operation has to match the type specified by the receive operation; and the type of each variable in the receive buffer has to match the type specified for that entry by the receive operation. A program that fails to observe these three rules is erroneous. To define type matching more precisely, we need to deal with two issues: matching of types of the host language with types specified in communication operations; and matching of types at sender and receiver. A type specified for an entry by a send operation matches the type specified for that entry by a receive operation if both operations use identical names: \type{ MPI\_INTEGER} matches \type{ MPI\_INTEGER}, \type{ MPI\_REAL} matches \type{ MPI\_REAL}, and so on. The type of a variable in a host program matches the type specified in the communication operation if the datatype name used by that operation corresponds to the basic type of the host program variable: an entry with type name \type{ MPI\_INTEGER} matches a Fortran variable of type \ftype{ INTEGER}, an entry with type name \type{ MPI\_REAL} matches a Fortran variable of type \ftype{ REAL}, and so on. There is one exception to this last rule: An entry with type name \type{ MPI\_BYTE} can be used to match any byte of storage (on a byte-addressable machine), irrespective of the datatype of the variable that contains this byte. The value of the message entry will be the binary value of the corresponding byte in memory. We thus have two cases: \begin{itemize} \item Communication of typed values (e.g., with datatype different from \type{ MPI\_BYTE}), where the datatypes of the corresponding entries in the sender program, in the send call, in the receive call and in the receiver program should all match. \item Communication of untyped values (e.g., of datatype \type{ MPI\_BYTE}), where both sender and receiver use the datatype \type{ MPI\_BYTE}. In this case, there are no requirements on the types of the corresponding entries in the sender and the receiver programs, nor is it required that they be the same. \end{itemize} Examples: 1st program: \begin{verbatim} CALL MPI_COMM_RANK(comm, rank, ierr) IF(rank.EQ.0) THEN CALL MPI_SEND(a(1), 10, MPI_REAL, 1, tag, comm, ierr) ELSE CALL MPI_RECV(a(1), 15, MPI_REAL, 0, tag, comm, ierr) END IF \end{verbatim} This code is correct if both sender and receiver programs have allocated consecutive storage for ten real numbers, starting from {\tt a(1)}. 2nd program: \begin{verbatim} CALL MPI_COMM_RANK(comm, rank, ierr) IF(rank.EQ.0) THEN CALL MPI_SEND(a(1), 10, MPI_REAL, 1, tag, comm, ierr) ELSE CALL MPI_RECV(a(1), 40, MPI_BYTE, 0, tag, comm, ierr) END IF \end{verbatim} This code is erroneous, since sender and receiver do not provide matching datatype parameters. 3rd program: \begin{verbatim} CALL MPI_COMM_RANK(comm, rank, ierr) IF(rank.EQ.0) THEN CALL MPI_SEND(a(1), 40, MPI_BYTE, 1, tag, comm, ierr) ELSE CALL MPI_RECV(a(1), 60, MPI_BYTE, 0, tag, comm, ierr) END IF \end{verbatim} This code is correct, irrespective of the type of {\tt a(1)} and following variables in store. \discuss{ The behavior of erroneous code is implementation-dependent. It is desirable but not required that errors are reported at run-time. An MPI implementation may decide to avoid type-checking for the sake of efficiency, in which case none of the three programs will report an error. On a system where real numbers occupy four bytes and both sender and receiver execute in the same environment, it is likely that all three codes will have the same behavior. } \section{Data conversion} \label{sec:pt2pt-conversion} One of the goals of MPI is to support parallel computations across heterogeneous environments. Communication in a heterogeneous environment may require data conversions. We use the following terminology: \begin{description} \item[type conversion] changes the datatype of a value, e.g. by rounding a \type{ REAL} to an \type{ INTEGER}. \item[representation conversion] changes the binary representation of a value, e.g. from Hex floating point to IEEE floating point. \end{description} Representation conversion is needed when data is moved across different environments, e.g. different machines that use different binary encodings for the same datatype, or codes compiled with different compiler options for datatype representation. Representation conversion preserves, to the greatest possible extent, the {\bf value} represented. However, rounding errors and overflow and underflow exceptions may occur during floating point conversions; conversion of integers may also lead to exceptions when a value that can represented in one system cannot be represented in the other system. MPI does not specify rules for representation conversion. The type matching rules imply that MPI communication never entails type conversion. On the other hand, MPI requires that a representation conversion be performed when a typed value is transferred across environments that use different representations for the datatype of this value. An exception occurring during representation conversion results in a failure of the communication; an error occurs either in the send operation, or the receive operation, or both. If a value sent in a message is untyped (i.e., of type \type{ MPI\_BYTE}), then the binary representation of the byte stored at the receiver is identical to the binary representation of the byte loaded at the sender. This holds true, whether sender and receiver run in the same or in distinct environments; no representation conversion is required. Note that no conversion ever occurs when an MPI program executes in a homogeneous system, where all processes run in the same environment. Also note the different behavior of \type{ MPI\_BYTE} and of \type{ MPI\_CHARACTER}. A buffer descriptor entry with datatype of \type{ MPI\_CHARACTER} can only match a Fortran variable of type \ftype{ CHARACTER}; and representation conversion may occur when values of type \type{ MPI\_CHARACTER} are transferred., e.g., from an EBCDIC encoding to an ASCII encoding. Consider the previous three examples. The first program is correct, assuming both sender and receiver declared ten consecutive real variables in storage starting at {\tt a(1)}. If the sender and receiver execute in different environments, then the ten real values that are fetched from the send buffer will be converted to the representation for reals on the receiver site before they are stored in the receive buffer. While the number of real elements fetched from the send buffer equal the number of real elements stored in the receive buffer, the number of bytes stored need not equal the number of bytes loaded: e.g. the sender may use a four byte representation and the receiver an eight byte representation for reals. If the send or receive buffer do not contain ten consecutive real variables, then the program is erroneous, and its behavior is undefined. The second program is erroneous, and its behavior is undefined. The third program is correct. The exact same sequence of forty bytes that were loaded from the send buffer will be stored in the receive buffer, even if sender and receiver run in a different environment. The message sent has exactly the same length (in bytes) and the same binary representation as the message received. If the variables in the send buffer are of different types as the variables in the receive buffer, or they are of the same type but different data representations are used, then the bits stored in the receive buffer may encode values that are different from the values they encoded in the send buffer. Data representation conversion also applies to the envelope of a message: source, destination and tag are all integers that may need to be converted. The current draft does not provide a mechanism for inter-language communication: messages sent by Fortran calls should be received by Fortran calls and messages sent by C calls should be received by C calls (this follows from the requirements of type matching and the fact that Fortran and C datatypes are distinct). If inter-language communication is needed then one needs to invoke C communication routines from a Fortran program or Fortran communication routines from a C program. \implement{ The current definition does not require messages to carry data type information. A message can be composed and sent using only the information provided by the send call, and can be received and stored using only the information provided by the receive call. If messages are sent between different machines then one can either use a ``universal'' data encoding for messages, use knowledge of the receiver environment in order to convert data at the sender, or use knowledge of the sender environment in order to convert data at the receiver. In either case the datatype parameter in the local call can be used to derive the types of the values transferred. However, additional type information carried by messages can provide better error detection. } \discuss{ MPI does not handle inter-language communication because there seem to be no agreed standards for inter-language type conversions. The current (non)solution means that inter-language communication relies on the vendor specific rules for inter-language calls at each site. It may be desirable to provide information on the correspondence between MPI C datatypes (such as \ctype{ MPI\_FLOAT}) and MPI Fortran datatypes (such as \ftype{ MPI\_REAL}) with suitable environment query functions. } \section{Communication Modes} \label{sec:pt2pt-modes} The basic send operation described in Section~\ref{sec:pt2pt-basicsend} used the {\bf standard} communication mode. In such a communication mode, a send operation can be started whether or not a matching receive was posted. The completion of the send operation indicates that the message and its envelope have been safely stored away and that the sender is free to access and modify the sender buffer. Thus, the operation is {\bf blocking}: it does not return until the send operation has completed {\bf locally}, on the sender side. The completion of a send operation gives no indication that the message was received on the receiver side. A blocking send may be implemented so that a blocking send returns only after a matching receive has been executed on the receiver side. This avoids the need to buffer message data out of sender or receiver memory. In this case the send operation completes only after the matching receive has started executing. On the other hand, it is also possible for MPI to buffer messages, so as to allow the sender to proceed ahead of the receiver. In such a case the send operation may complete successfully before the message was received. Thus, the basic send operation described in the previous section is {\bf asynchronous}, since its return does not imply a synchronization with the (remote) receive operation, and does not imply {\bf global} termination of the communication. There are two additional communication modes: A send that uses the {\bf ready} communication mode may be started only if a matching receive is already posted; otherwise the operation is erroneous and its outcome is undefined. In some systems, this allows the removal of a hand-shake operation that is otherwise required, and results in improved performance. The completion of the send operation does not depend on the status of a matching receive, and merely indicates that the send buffer can be reused. A send operation that uses the ready mode has the same semantics as a standard send operation. It is merely that the sender provides additional information to the communication subsystem, when using the ready mode. A send that uses the {\bf synchronous} mode can be started whether or not a matching receive was posted. However, the send will complete successfully only if a matching receive is posted, and the receive operation has started to receive the message sent by the synchronous send. (I.e., the receive has been posted, and the incoming message has been matched to this posted receive.) Thus, the completion of a synchronous send not only indicates that the send buffer can be reused, but also indicates that the receiver has reached a certain point in its execution, namely that it has started executing the matching receive. If both sends and receives are blocking operations then the use of the synchronous mode provides synchronous communication semantics: a communication does not complete at either ends before both processes ``attend'' to the communication; the completion of a synchronous send is a {\bf global} event. Two additional send functions are provided for the two additional communication modes. The communication mode is indicated by a one letter prefix: {\tt R} for ready and {\tt S} for synchronous. Send in ready mode \begin{funcdef}{MPI\_RSEND (start, count, datatype, dest, tag, comm)} \funcarg{\IN}{start}{initial address of send buffer (choice)} \funcarg{\IN}{count}{number of elements in send buffer (integer)} \funcarg{\IN}{datatype}{datatype of each send buffer element } \funcarg{\IN}{dest}{rank of destination (integer)} \funcarg{\IN}{tag}{message tag (integer)} \funcarg{\IN}{comm}{communicator (handle)} \end{funcdef} Send in synchronous mode \begin{funcdef}{MPI\_SSEND (start, count, datatype, dest, tag, comm)} \funcarg{\IN}{start}{initial address of send buffer (choice)} \funcarg{\IN}{count}{number of elements in send buffer (integer)} \funcarg{\IN}{datatype}{datatype of each send buffer element } \funcarg{\IN}{dest}{rank of destination (integer)} \funcarg{\IN}{tag}{message tag (integer)} \funcarg{\IN}{comm}{communicator (handle)} \end{funcdef} There is only one receive mode, which can match any of the send modes. The receive operation described in the last section is {\bf blocking}: it returns only after the receive buffer contains the newly received message. It is {\bf asynchronous}: the completion of a receive is a local operation, and a receive can complete before the matching send has completed (of course, it can complete only after the matching send has started). Communication imposes an order on the events occurring at the communicating nodes. It is always the case that the completion of a receive occurs after the start of the matching send. If the synchronous send mode is used, then it is also the case that the completion of a send occurs after the start of the matching receive. Of course, on each process, a communication completes after it is started. No other order is imposed by MPI. E.g., if the standard send mode is used, the send operation may complete before the matching receive was started. In a multi-threaded implementation of MPI, a blocking send or receive operation may only block the executing thread, not the entire process. Thus, while a thread is blocked waiting for a send or receive to complete, other threads may execute within the same address space. In such a case it the user responsibility not to modify the send buffer until the send completes and not to access or modify the receive buffer until the receive completes; otherwise the outcome of the computation is undefined. \discuss{ Do we want to say that a blocking operation MAY or MUST only block the executing thread? } \implement{ A ready send can be implemented as a standard send; in such a case there will be no performance advantage (or disadvantage) for the use of ready send. A standard send can be implemented as a synchronous send. In such a case, no data buffering is needed. A possible communication protocol for the various communication modes is outlined below: {\tt ready send}: The message is sent as soon as possible. {\tt synchronous send:} The sender sends a request-to-send message. The receiver stores this request. When a matching receive is posted, the receiver sends back a permission-to-send message, and the sender now sends the message. {\tt standard send:} First protocol may be used for short messages, and second protocol for long messages. Additional control messages might be needed for flow control and error recovery. Of course, there are many other possible choices. } \section{Nonblocking communication} \label{sec:pt2pt-nonblock} One can improve performance on many systems by overlapping communication and computation. This is especially true on systems where communication can be executed autonomously by an intelligent communication controller. Light-weight threads are one mechanism for achieving such overlap. An alternative mechanism that often leads to better performance is to use {\bf nonblocking communication}. A nonblocking send call initiates the send operation, but does not complete it. The send call will return before the message was copied out of the send buffer. A separate call is needed to complete the communication, i.e. to verify that the data has been copied out of the send buffer. With suitable hardware, the transfer of data out of the sender memory may proceed concurrently with computations done at the sender after the send was initiated and before it completed. Similarly, a nonblocking receive initiates the receive operation, but does not complete it. The call will return before a message is stored into the receive buffer. A separate call is needed to complete the receive operation and verify that the data has been received into the receive buffer. With suitable hardware, the transfer of data into the receiver memory may proceed concurrently with computations done after the receive was initiated and before it completed. Nonblocking sends can use the same three modes as blocking sends: {\tt standard}, {\tt ready} and {\tt synchronous}. These carry the same meaning. The initiation and completion of a standard send do not depend on the status of a matching receive. A ready send can be initiated only if a matching receive has already been initiated, otherwise the call is erroneous; its completion does not depend on the status of a matching receive. A synchronous send can be initiated before a matching receive has been initiated, but will complete only after a matching receive has been initiated, and has started receiving the message generated by the send operation. In all cases the operation that initiates the communication (send or receive) is a local, nonblocking call that returns as soon as possible. A nonblocking communication call may fail because because the system has exhausted available resources (e.g., exceeded the limit on number of pending communications per node). Good quality implementations of MPI will set these limits high enough so that a nonblocking communication call will fail only in ``pathological'' cases. Nonblocking sends can be matched with blocking receives, and vice-versa. \subsection{Communication Objects} \label{subsec:pt2pt-commobject} Nonblocking communications use opaque communication objects to identify communication operations and match the operation that initiates the communication with the operation that terminates it. These are system objects that are accessed via a handle. An opaque communication object identifies various properties of a communication operation, such as the (send or receive) buffer that is associated with it, its context, the tag and destination parameters to be used for a send, or the tag and source parameters to be used for a receive. In addition, this object stores information about the status of the pending communication operation that is performed with this object. \subsection{Communication initiation} \label{subsec:pt2tp-commstart} We use the same naming conventions as for blocking communication: a prefix of {\tt R} ({\tt S}) is used for {\tt READY} ({\tt SYNCHRONOUS}) mode. In addition a prefix of {\tt I} (for {\tt IMMEDIATE}) indicates that the call is nonblocking. Initiate a standard mode nonblocking communication. \begin{funcdef}{MPI\_ISEND(handle, start, count, datatype, dest, tag, comm)} \funcarg{\OUT}{handle}{handle to communication object} \funcarg{\IN}{start}{initial address of send buffer (choice)} \funcarg{\IN}{count}{number of elements in send buffer (integer)} \funcarg{\IN}{datatype}{datatype of each send buffer element } \funcarg{\IN}{dest}{rank of destination (integer)} \funcarg{\IN}{tag}{message tag (integer)} \funcarg{\IN}{comm}{communicator (handle)} \end{funcdef} Initiate a ready mode nonblocking communication. \begin{funcdef}{MPI\_IRSEND(handle, start, count, datatype, dest, tag, comm)} \funcarg{\OUT}{handle}{handle to communication object} \funcarg{\IN}{start}{initial address of send buffer (choice)} \funcarg{\IN}{count}{number of elements in send buffer (integer)} \funcarg{\IN}{datatype}{datatype of each send buffer element } \funcarg{\IN}{dest}{rank of destination (integer)} \funcarg{\IN}{tag}{message tag (integer)} \funcarg{\IN}{comm}{communicator (handle)} \end{funcdef} Initiate a synchronous mode nonblocking communication. \begin{funcdef}{MPI\_ISSEND(handle, start, count, datatype, dest, tag, comm)} \funcarg{\OUT}{handle}{handle to communication object} \funcarg{\IN}{start}{initial address of send buffer (choice)} \funcarg{\IN}{count}{number of elements in send buffer (integer)} \funcarg{\IN}{datatype}{datatype of each send buffer element } \funcarg{\IN}{dest}{rank of destination (integer)} \funcarg{\IN}{tag}{message tag (integer)} \funcarg{\IN}{comm}{communicator (handle)} \end{funcdef} Initiate a nonblocking receive. \begin{funcdef}{MPI\_IRECV (handle, start, count, datatype, source, tag, comm)} \funcarg{\OUT}{handle}{handle to communication object} \funcarg{\OUT}{start}{initial address of receive buffer (choice)} \funcarg{\IN}{count}{number of elements in receive buffer (integer)} \funcarg{\IN}{datatype}{datatype of each receive buffer element } \funcarg{\IN}{source}{rank of source (integer)} \funcarg{\IN}{tag}{message tag (integer)} \funcarg{\IN}{comm}{communicator (handle)} \end{funcdef} The call allocates a communication object and associates it with the handle. The handle can be used later to query about the status of the communication or wait for its completion. A sender should not update any part of a send buffer after a nonblocking send operation returns, until the send completes. A receiver should not access or update any part of a receive buffer after a nonblocking receive operation returns, until the receive completes. \subsection{Communication Completion} \label{subsec:pt2pt-commend} The completion of a send operation indicates that the sender is now free to update the locations in the send buffer, or any other location that can be referenced by the send operation (the send operation itself leaves the content of the send buffer unchanged). It does not indicate that the message has been received; rather, it may have been buffered by the communication subsystem. However, if a {\tt synchronous} mode send was used, the completion of the send operation indicates that a matching receive was initiated, and that the message will eventually be received by this matching receive. The completion of a receive operation indicates that the receive buffer contains the received message, and that the status object is set; the receiver is now free to access the receive buffer, or any other location that can be referenced by the receive operation. It does not indicate that the matching send operation has completed (but indicates, of course, that the send was initiated). \begin{funcdef}{MPI\_WAIT (handle, status)} \funcarg{\IN}{handle}{handle to communication object} \funcarg{\OUT}{status}{status object} \end{funcdef} A call to \func{ MPI\_WAIT} returns when the send operation identified by \mpiarg{ handle} is complete. If the communication object associated with this handle was created by a nonblocking send or receive call, then the object is deallocated by the call to \func{ MPI\_WAIT} and the handle becomes null. The call returns in \mpiarg{ status} information on the completed operation. The status object for a receive operation can be queried using the functions described in Section~\ref{subsec:pt2pt-status}. The status parameter is not used or updated for a send operation, and an arbitrary value may be passed for this parameter in the call. \discuss { Do we agree on the last sentence? (this implies some restriction on the implementation) } \begin{funcdef}{MPI\_TEST (handle, flag, status)} \funcarg{\IN}{handle}{handle to communication object} \funcarg{\OUT}{ flag}{logical} \funcarg{\OUT}{status}{status object} \end{funcdef} A call to \func{ MPI\_TEST} returns \mpiarg{ flag=true} if the operation identified by \mpiarg{ handle} is complete. In such a case, the status object is set to contain information on the completed operation; if the communication object was created by a nonblocking send or receive, then it is deallocated and the handle becomes null. The call returns \mpiarg{ flag=false}, otherwise. In such a case, the value of the status object is undefined. \discuss{ We could dispose of the flag parameter and add an additional query function for status objects. } \func{ MPI\_TEST} is a local, nonblocking operation. If repeatedly called for an operation that is enabled, it must eventually succeed. The return status object for a receive operation carries information that can be accessed using the functions described in Section~\ref{subsec:pt2pt-status}. The status argument is not used or updated for a send operation and an arbitrary value can be passed for this parameter in the call. In multi-threaded environment, the use of a blocking receive operation (\func{ MPI\_WAIT}) may allow the operating system to de-schedule the blocked thread and schedule another thread for execution, if such is available. The use of a nonblocking receive operation (\func{ MPI\_TEST}) allows the user to schedule alternative activities within a single thread of execution. Example: \begin{verbatim} CALL MPI_COMM_RANK(comm, rank, ierr) IF(rank.EQ.0) THEN CALL MPI_ISEND(handle, a(1), 10, MPI_REAL, 1, tag, comm, ierr) **** do some computation to mask latency **** CALL MPI_WAIT(handle, status, ierr) ELSE CALL MPI_IRECV(handle, a(1), 15, MPI_REAL, 0, tag, comm, ierr) **** do some computation to mask latency **** CALL MPI_WAIT(handle, status, ierr) END IF \end{verbatim} The functions \func{ MPI\_WAIT} and \func{ MPI\_TEST} can be used to complete both sends and receives; they will be also used to complete any other nonblocking communication call provided by MPI. Nonblocking communications cannot be replaced by blocking communications, even in the {\tt synchronous} communication mode. Consider the following example: \begin{verbatim} CALL MPI_COMM_RANK(comm, rank, ierr) IF(rank.EQ.0) THEN CALL MPI_ISEND(handle1, a(1), count, MPI_REAL, 1, tag1, comm, ierr) CALL MPI_ISEND(handle2, a(2), count, MPI_REAL, 1, tag2, comm, ierr) CALL MPI_WAIT(handle1, status, ierr) CALL MPI_WAIT(handle2, status, ierr) ELSE CALL MPI_IRECV(handle2, a(2), count, MPI_REAL, 0, tag2, comm, ierr) CALL MPI_IRECV(handle1, a(1), count, MPI_REAL, 0, tag1, comm, ierr) CALL MPI_WAIT(handle2, status, ierr) CALL MPI_WAIT(handle1, status, ierr) END IF \end{verbatim} The code is guaranteed to execute correctly, even in an implementation that does not buffer messages (the only resource requirement is the ability to have two pending communications). On the other hand, the code may deadlock if the nonblocking operations are replaced by blocking ones. \subsection{Multiple Completions} \label{subsec:pt2pt-multiple} It is convenient to be able to wait for the completion of any or all the operations in a set, rather than having to wait for a specific message. A call to \func{ MPI\_WAITANY} or \func{ MPI\_TESTANY} can be used to wait for the completion of one out of several operations; a call to \func{ MPI\_WAITALL} or \func{ MPI\_TESTALL} can be used to wait for all pending operations in a list. \discuss{ \bf \func{ MPI\_TESTALL} is new. } \begin{funcdef}{MPI\_WAITANY (count, array\_of\_handles, index, status)} \funcarg{\IN}{count}{list length (integer)} \funcarg{\IN}{array\_of\_handles}{array of handles to communication objects} \funcarg{\OUT}{index}{index of handle for operation that completed (integer)} \funcarg{\OUT}{status}{return status object} \end{funcdef} Blocks until one of the operations associated with the communication handles in the array has completed. Returns the index of that handle in the array, and returns the status of that operation in the object associated with the status. The successful execution of \func{ MPI\_WAITANY(count, array\_of\_handles, index, status)} has the same effect as the successful execution of \func{ MPI\_WAIT(array\_of\_handles[i], status)}, where \mpiarg{ i} is the value returned by \mpiarg{ index}. In particular, the associated communication object is deallocated, and the handle to it in \mpiarg{ array\_of\_handles} is set to null. If more then one operation is enabled and can terminate, one is arbitrarily chosen. There is no requirement that the choice satisfy any fairness criterion. \discuss{ If statement of fairness is changed in Section~\ref{sec:pt2pt-semantics} then will need to change here, too. } \begin{funcdef}{MPI\_TESTANY( count, array\_of\_handles, index, status)} \funcarg{\IN}{count}{list length (integer)} \funcarg{\IN}{array\_of\_handles}{array of handles to communication objects} \funcarg{\OUT}{index}{index of handle for operation that completed, or -1 if none completed (integer)} \funcarg{\OUT}{status}{return status object} \end{funcdef} Causes either one or none of the operations associated with the communication handles to return. In the former case, it returns the index of that handle in the array, and returns the status of that operation in the object associated with the status. The communication object is deallocated and the handle to it in \mpiarg{ array\_of\_handles} is set to null. In the latter case, it returns a value of -1 in \mpiarg{ index} and \mpiarg{ status} is undefined. Like \mpifunc{MPI\_TEST}, this is a nonblocking operation, that returns immediately; and if one busy-waits with \mpifunc{MPI\_TESTANY}, and some operation on the list is enabled, then \mpifunc{MPI\_TESTANY} will eventually return \mpiarg{ flag = true}. \discuss{ Could have a \mpiarg{ flag} parameter, rather than using index = -1 for unsuccessful return. This is more consistent and more elegant, but requires an additional parameter. If we stick with current version, may replace -1 by an MPI constant. } \begin{funcdef}{MPI\_WAITALL( count, array\_of\_handles, array\_of\_status)} \funcarg{\IN}{count}{lists length (integer)} \funcarg{\IN}{array\_of\_handles}{array of handles to communication objects} \funcarg{\OUT}{array\_of\_status}{array of status objects} \end{funcdef} Blocks until all communication operations associated with handles in the list complete, and return the status of all these operations. Both arrays have the same number of valid entries. The \mpiarg{ i}-th entry in \mpiarg{ array\_of\_status} is set to the return status of the \mpiarg{ i}-th operation. All communication objects are deallocated and all handles are set to null. \begin{funcdef}{MPI\_TESTALL(count, array\_of\_handles, flag, array\_of\_status)} \funcarg{\IN}{count}{lists length (integer)} \funcarg{\IN}{array\_of\_handles}{array of handles to communication objects} \funcarg{\OUT}{flag}{(logical)} \funcarg{\OUT}{array\_of\_status}{array of status objects} \end{funcdef} Causes either all or none of the operations associated with the communication handles to complete. It returns \mpiarg{ flag = true} if all communications associated with handles in the array have completed. In this case, each status entry is set to the status of the corresponding communication. All communication objects are deallocated, and all handles are set to null. Otherwise, \mpiarg{ flag = false} is returned and the values of the status entries are undefined. This is a nonblocking operation that returns immediately; if one busy waits with \mpifunc{MPI\_TESTALL}, and all operations in the list become enabled, then \mpifunc{MPI\_TESTALL} will eventually return \mpiarg{ flag = true}. Example: \begin{verbatim} CALL MPI_COMM_RANK(comm, rank, ierr) IF(rank < 2) THEN CALL MPI_ISEND(handle, a, n, MPI_REAL, 2, tag, comm, ierr) **** do some computation to mask latency **** CALL MPI_WAIT(handle, status, ierr) ELSE ! rank.EQ.2 CALL MPI_IRECV(handle_list(0), a, n, MPI_REAL, 0, tag, comm, ierr) CALL MPI_IRECV(handle_list(1), b, n, MPI_REAL, 1, tag, comm, ierr) **** do some computation to mask latency **** CALL MPI_WAITANY(2, handle_list, index, status, ierr) IF(index.EQ.0) THEN **** handle message from process 0 **** CALL MPI_WAIT(handle_list(1), status, ierr) **** handle message from process 1 **** ELSE **** handle message from process 1 **** CALL MPI_WAIT(handle_list(0), status, ierr) **** handle message from process 0 **** END IF END IF \end{verbatim} The calls introduced in this subsection can be used to wait or test for the completion of an arbitrary mix of nonblocking communication calls, including a mix sends and receives and of any additional nonblocking communication calls provided by MPI. The \mpiarg{ status} argument in \mpifunc{MPI\_WAITANY} and \mpifunc{MPI\_WAITALL} and the \mpiarg{ array\_of\_status} argument in \mpifunc{MPI\_WAITALL} and \mpifunc{MPI\_TESTALL} are not used or updated if the completing communication is a send. If all operations in the list provided by the argument \mpiarg{ array\_of\_handles} are send operations, then an arbitrary value can be passed to the \mpiarg{ status} (resp.\mpiarg{ array\_of\_status}) argument. \discuss{ Is the last paragraph agreed? We may want to allow null pointers in an array\_of\_handles This will allow to loop with the same array\_of\_handles, until all communications have completed. On the other hand, this will encourage potentially inefficient code. I added the requirement that the handed to a communication that completes is set to null. This prevents the occurrence of dangling references, but add some overhead. } \section{Probe and Cancel} \label{sec:pt2pt-probe} The \func{ MPI\_PROBE} and \func{ MPI\_IPROBE} operations allow incoming messages to be checked for, without actually receiving them. The user can then decide how to receive them, based on the information returned by the probe (basically, the information returned by \mpiarg{ status}). In particular, the user may allocate memory for the receive buffer, according to the length of the probed message. The \func{ MPI\_CANCEL} operation allows pending communications to be cancelled. This is required for cleanup. Posting a send or a receive ties up user resources (send or receive buffers), and a cancel may be needed to free these resources gracefully. \discuss{ Made one change from approved draft on probe: removed the datatype argument from probe. If generalized datatype are adopted, then it seems more natural to provide the argument in the status decode function. } \begin{funcdef}{MPI\_IPROBE(source, tag, comm, flag, status)} \funcarg{\IN}{source}{source rank, or \const{ MPI\_ANY\_SOURCE} (integer)} \funcarg{\IN}{tag}{tag value or \const{ MPI\_ANY\_TAG} (integer)} \funcarg{\IN}{comm}{handle to communicator} \funcarg{\OUT}{flag}{ (logical)} \funcarg{\OUT}{status}{status object} \end{funcdef} \func{ MPI\_IPROBE} returns \mpiarg{ flag = true} if there is a message that can be received and that matches the pattern specified by the parameters \mpiarg{ source}, \mpiarg{ tag}, and \mpiarg{ comm}. It returns \mpiarg{ flag = false}, otherwise. If \func{ MPI\_IPROBE} returns \mpiarg{ flag = true}, then the source and tag of the message matched are returned in the status object. These are the same values that would have been returned by a call to \func{ MPI\_RECV} executed at the same point in the program. These values can be decoded from the return status object using the status query functions described in Section~\ref{subsec:pt2pt-status}. The return status object is undefined if \mpiarg{ flag=false}. The call to \func{ MPI\_PROBE} does not carry a datatype argument: the message type may not be known when the probe is executed. Accordingly, the probe call does not return in the status object the length of the probed message, and \mpifunc{MPI\_GET\_COUNT} cannot be used after a call to \mpifunc{MPI\_PROBE} to query on the length of the probed message. (Remember that the length of a message, in elements, depends on the types of these elements; some implementations may not carry in the transmitted message type information.) Instead, a different status decoding function that provides the datatype as argument can be used in this case. \begin{funcdef}{MPI\_PROBE\_COUNT(status, datatype, count)} \funcarg{\IN}{status}{status object} \funcarg{\IN}{datatype}{message datatype} \funcarg{\OUT}{count}{number of entries in message (integer)} \end{funcdef} Returns the length of the message that was last probed with\mpiarg{ status} as an argument. The call is erroneous if \mpiarg{ datatype} does not match the type of the message. \discuss{ Message passing in MPI can be implemented without appending type information to messages. A message is merely a string of bytes and the interpretation of these bytes into a sequence of typed elements is done using the datatype information provided by the local communication call. The ability to use such an implementation strategy is deemed to be an important goal. In such an implementation, when a message arrives, it is not known how many elements it contains, or even how much storage is needed to receive that message (because of possible representation conversion in a heterogeneous environment). The length of the incoming message can be computed by \func{ MPI\_PROBE\_LEN} from the byte length of the message and from the datatype parameter. The current solution saves us the need for one additional word per message that would otherwise be needed to transfer the message length (in elements) with the message. {\bf Need be discussed by MPIF} We may want a call to \func{ MPI\_PROBE\_COUNT} to be significant not just after probe, but also after receive, for consistency reasons. It's tempting to merge \func{ MPI\_GET\_COUNT} and \func{ MPI\_PROBE\_COUNT} into one call. E.g., always provide a datatype argument, even though it is redundant after a receive. But then we impose an additional burden on regular receives. We may want to relax type restrictions so that \mpiarg{ datatype} = \type{ MPI\_BYTE} could be used as an argument to \func{ MPI\_PROBE\_COUNT}, event if the message was sent with another datatype; this will allow to count the number of bytes received, and use this count in a malloc. } A subsequent receive executed with the same context, and the source and tag returned by the call to \func{ MPI\_IPROBE} will receive the message that was matched by the probe, if no other intervening receive occurred after the probe. If the receiving process is multi-threaded, it is the user responsibility to ensure that the last condition holds. \discuss{ MPI guarantees that successive messages sent from a source to a destination within the same context are received in the order they are sent. Thus, MPI must support, either explicitly or implicitly, a FIFO structure to manage messages between each pair of processes, for each context. \func{ MPI\_PROBE} returns information on the first matching message in this FIFO; this will also be the message received by the first subsequent receive with the same source, tag and context as the message matched by \func{ MPI\_PROBE}. } \begin{funcdef}{MPI\_PROBE( source, tag, comm, datatype, status)} \funcarg{\IN}{source}{source rank, or \const{ MPI\_ANY\_SOURCE} (integer)} \funcarg{\IN}{tag}{tag value, or \const{ MPI\_ANY\_TAG} (integer)} \funcarg{\IN}{comm}{handle to communicator} \funcarg{\IN}{datatype}{assumed type of data in message (status)} \funcarg{\OUT}{status}{status object} \end{funcdef} \func{ MPI\_PROBE} behaves like \func{ MPI\_IPROBE} except that it is a blocking call which returns only after a matching message has been found. \discuss{ \bf CANCEL has been changed since last draft } \begin{funcdef}{MPI\_CANCEL(handle)} \funcarg{\IN}{handle}{handle to communication object} \end{funcdef} A call to \func{ MPI\_CANCEL} marks for cancellation a pending nonblocking communication operation (send or receive). The cancel call is non-blocking, and local. It returns immediately, possibly before the communication is actually cancelled. It is still necessary to complete a communication that has been marked for cancellation, using a call to \func{ MPI\_WAIT} or \func{ MPI\_TEST} (or any of the derived operations). If the operation has been cancelled, then information about to that effect will be returned in the status argument of the operation that completes the communication. If a communication is marked for cancellation, then a \func{ MPI\_WAIT} call for that communication is guaranteed to return, irrespective of the status of other processes; similarly if \func{ MPI\_TEST} is repeatedly called in a busy wait loop for a cancelled communication, then \mpifunc{MPI\_TEST} will eventually be successful. Either the cancellation succeeds, or the communication succeeds, but not both. If a send is marked for cancellation, then it must be the case that either the send completes normally, in which case the message sent was received at the destination process, or that the send is successfully cancelled, in which case no part of the message was received at the destination. Then, any matching receive has to be satisfied by another send. If a receive is marked for cancellation, then it must be the case that either the receive completes normally, or that the receive is successfully cancelled, in which case no part of the receive buffer is altered. Then, any matching send has to be satisfied by another receive. \begin{funcdef}{MPI\_TEST\_CANCELLED(status, flag)} \funcarg{\IN}{status}{return status object} \funcarg{\IN}{flag}{(logical)} \end{funcdef} Returns \mpiarg{ flag = true} if the communication associated with the return status object was cancelled successfully. In such a case, all other fields of \mpiarg{ status} (such as \mpiarg{ length} or \mpiarg{ tag}) are undefined. Returns \mpiarg{ flag = false}, otherwise. If a receive operation might be cancelled then one should call \func{ MPI\_TEST\_CANCELLED} first, to check whether the operation was cancelled, before checking on the other values of the return status. \section{Persistent communication objects} \discuss { \bf there are some changes here not yet approved by MPIF } Often a communication with the same parameter list is repeatedly executed within the inner loop of a parallel computation. In such a situation, it may be possible to optimize the communication by binding the list of communication parameters to the communication object once and, then, repeatedly using the communication handle to initiate and complete messages. The communication handle thus created can be thought of as a communication port or a ``half-channel'' . It does not provide the full functionality of a conventional channel, since there is no binding of the send port to the receive port: this construct allows reduction of the overhead for communication between processor and communication controller, but not the overhead for communication between one communication controller and another. A communication object is created using one of the four following calls. These calls involve no communication. The function \mpifunc{MPI\_CREATE\_SEND} creates a communication object for a standard mode send operation, and binds to it all the parameters of a send operation. \begin{funcdef}{MPI\_CREATE\_SEND(handle, start, count, datatype, dest, tag, comm)} \funcarg{\OUT}{handle}{handle to communication object} \funcarg{\IN}{start}{initial address of send buffer (choice)} \funcarg{\IN}{count}{number of elements sent (integer)} \funcarg{\IN}{datatype}{type of each element} \funcarg{\IN}{dest}{rank of destination (integer)} \funcarg{\IN}{tag}{message tag (integer)} \funcarg{\IN}{comm}{handle to communicator object} \end{funcdef} The call allocates a new communication object and associates the communication handle with it. \discuss{ Two changes: 1. \mpiarg{ buffer\_handle} is replaced with \mpiarg{ start, count, datatype}, to be in sync with the new datatype proposal. 2. Took out \mpiarg{ persistence}. Does not seem very useful here. Instead, I introduce in the last section, a ``universal'' communication function, which would not usually used directly by programmers, but can be used to define the other communication functions. } The function \mpifunc{MPI\_CREATE\_RSEND} creates a communication object for a ready mode send operation. \begin{funcdef}{MPI\_CREATE\_RSEND(handle, start, count, datatype, dest, tag, comm)} \funcarg{\OUT}{handle}{handle to communication object} \funcarg{\IN}{start}{initial address of send buffer} \funcarg{\IN}{count}{number of elements sent (integer)} \funcarg{\IN}{datatype}{type of each element} \funcarg{\IN}{dest}{rank of destination (integer)} \funcarg{\IN}{tag}{message tag (integer)} \funcarg{\IN}{comm}{handle to communicator object} \end{funcdef} The function \mpifunc{MPI\_CREATE\_SSEND} creates a communication object for a synchronous mode send operation. \begin{funcdef}{MPI\_CREATE\_SSEND(handle, start, count, datatype, dest, tag, comm)} \funcarg{\OUT}{handle}{handle to communication object} \funcarg{\IN}{start}{initial address of send buffer} \funcarg{\IN}{count}{number of elements sent (integer)} \funcarg{\IN}{datatype}{type of each element} \funcarg{\IN}{dest}{rank of destination (integer)} \funcarg{\IN}{tag}{message tag (integer)} \funcarg{\IN}{comm}{handle to communicator object} \end{funcdef} The function \mpifunc{MPI\_CREATE\_RECV} creates a communication object for a receive operation. \begin{funcdef}{MPI\_CREATE\_RECV(handle, start, count, datatype, source, tag, comm)} \funcarg{\OUT}{handle}{handle to communication object} \funcarg{\OUT}{start}{initial address of receive buffer} \funcarg{\IN}{count}{number of elements received (integer)} \funcarg{\IN}{datatype}{type of each element} \funcarg{\IN}{dest}{rank of source or MPI\_ANY\_SOURCE (integer)} \funcarg{\IN}{tag}{message tag or MPI\_ANY\_TAG (integer)} \funcarg{\IN}{comm}{handle to communicator object} \end{funcdef} A communication (send or receive) that uses a predefined handle is initiated by the function \mpifunc{MPI\_START}. \begin{funcdef}{MPI\_START(handle)} \funcarg{\INOUT}{handle}{handle to communication object} \end{funcdef} The communication object associated with \mpiarg{ handle} should be an object that was created by one of the previous four functions, so that all the communication parameters are already defined. A send can be started provided that the previous send using the same object has completed, or as soon as the object is created, if it has not yet been used in a communication. In addition, if the communication mode is {\tt ready} then a matching receive should be posted. The send buffer should not be updated after the send is started, until the operation completes. A receive can be started provided that the preceding receive using the same object has completed, or as soon as the object is created, if it has not yet been used in a communication. The receive buffer should not be accessed after the receive is started, until the operation completes. The call is nonblocking, with similar semantics as the nonblocking communication operations described in Section~\ref{sec:pt2pt-nonblock}. A communication started with a call to \mpifunc{MPI\_START} is completed by a call to \mpifunc{MPI\_WAIT}, \mpifunc{MPI\_TEST}, or one of the derived functions described in Section~\ref{subsec:pt2pt-multiple}. These communication completion functions do not deallocate the communication object, and this can be reused anew by a \mpifunc{MPI\_START} call. The object needs to be explicitly deallocated by a call to the function \mpifunc{MPI\_COMM\_FREE}, below. We thus have two types of communication objects: {\bf persistent} objects, which are allocated by a call to \mpifunc{MPI\_CREATE\_xxx}, and are explicitly deallocated by \mpifunc{MPI\_COMM\_FREE}, and {\bf ephemeral} objects that persist for one communication only: they are created by a nonblocking communication initiation function, and are freed by the communication completion call. \begin{funcdef}{MPI\_COMM\_FREE(handle)} \funcarg{\INOUT}{handle}{handle to communication object} \end{funcdef} Marks the communication object for deallocation. The object will be deallocated when there are no pending communications involving this object, at which point the handle becomes null. The call is nonblocking, and the object may not be deallocated when the call returns. It is permissible to call \mpifunc{MPI\_COMM\_FREE(handle)} after a communication that uses \mpiarg{ handle} has been initiated, but before it has completed. The object will be deallocated after the communication completes. It is erroneous to try to reuse the handle for a subsequent communication. A correct invocation of the functions described in this section will occur in a sequence of the form \[ \bf Create \ (Start \ Complete)^* \ Free \ , \] where $*$ indicates zero or more repetitions. If the same communication object is used in several concurrent threads, it is the user responsibility to coordinate calls so that the correct sequence is obeyed. A send operation initiated with \func{ MPI\_START} can be matched with any receive operation and, likewise, a receive operation initiated with \func{ MPI\_START} can receive messages generated by any send operation. \section{Send-receive} \label{sec:pt2pt-sendrecv} \discuss{ \bf This section has not yet been approved by MPIF. } The {\bf send-receive} operations combines in one call the sending of a message to one destination and the receiving of another message, from another destination, possibly the same. The {\bf exchange} operations are the same as send-receive, except that the send buffer and the receive buffer are identical. A send-receive operation is very useful for executing a shift operation across a chain of processes. If blocking sends and receives are used for such shift, then one needs to order correctly the sends and receives (e.g. even sends, next receives, odd receives first, next sends) so as to prevent cyclic dependencies that lead to deadlock. When a send-receive or exchange operation is used, the communication subsystem takes care of these issues. Also, a send-receive operation is useful for implementing remote procedure calls. A message sent by a send-receive or exchange operation can be received by a regular receive operation, and vice versa. \begin{funcdef}{MPI\_SENDRECV(send\_start, send\_count, send\_type, dest, recv\_start, recv\_count, recv\_type, source, tag, comm, status)} \funcarg{\IN}{send\_start}{initial address of send buffer (choice)} \funcarg{\IN}{send\_count}{number of elements in send buffer (integer)} \funcarg{\IN}{send\_type}{type of elements in send buffer} \funcarg{\IN}{dest}{rank of destination (integer)} \funcarg{\OUT}{recv\_start}{initial address of receive buffer (choice)} \funcarg{\IN}{recv\_count}{number of elements in receive buffer (integer)} \funcarg{\IN}{recv\_type}{ type of elements in receive buffer} \funcarg{\IN}{source}{rank of source (integer)} \funcarg{\IN}{tag}{message tag (integer)} \funcarg{\IN}{comm}{handle to communicator} \funcarg{\OUT}{status}{status object} \end{funcdef} Execute a blocking send and receive operation. Both send and receive use the same tag value and the same communicator. However, the send buffer and receive buffer are disjoint, and may have different length and different datatypes. \discuss{ For a shift it's more natural to have same type and same length both for send and receive; for a remote procedure call, the additional freedom makes sense. } \begin{funcdef}{MPI\_EXCHANGE(start, count, datatype, dest, source, tag, comm, status)} \funcarg{\IN}{start}{initial address of send and receive buffer (choice)} \funcarg{\IN}{count}{number of elements in send and receive buffer (integer)} \funcarg{\IN}{datatype}{type of elements in send and receive buffer} \funcarg{\IN}{dest}{rank of destination (integer)} \funcarg{\IN}{source}{rank of source (integer)} \funcarg{\IN}{tag}{message tag (integer)} \funcarg{\IN}{comm}{handle to communicator} \funcarg{\OUT}{status}{status object} \end{funcdef} Execute a blocking send and receive; the same buffer is used both for the send and for the receive. \discuss{ I omitted the ready and synchronous modes for send/receive and exchange. Also, I omitted nonblocking send/receive and exchange. Do we want any of those? } The semantics of a send-receive operation is what would obtain if the caller forked two concurrent threads, one to execute the send, and one to execute the receive, followed by a join of these two threads. A send-receive cannot be implemented by a blocking send followed by a blocking receive or a blocking receive, followed by a blocking send. Consider the following code, where two processes exchange messages: \begin{verbatim} CALL MPI_COMM_RANK(comm, rank, ierr) IF (rank.EQ.0) THEN CALL MPI_SENDRECV(send_buff, count, MPI_REAL, 0, recv_buff, count, MPI_REAL, 1, tag, comm, status, ierr) ELSE CALL MPI_SENDRECV(send_buff, count, MPI_REAL, 1, recv_buff, count, MPI_REAL, 0, tag, comm, status, ierr) END IF \end{verbatim} If the send receives are replaced either by blocking send, followed by blocking receive, or blocking receive, followed by blocking send, then the code may deadlock. On the other hand, send-receive can be implemented using nonblocking sends and receives. Note that some system buffering is required for a correct implementation of \func{ MPI\_EXCHANGE}. (Consider the last example, with send-receive replaced by exchange.) \section{Null processes} \label{sec:pt2pt-nullproc} \discuss{ This section has not been reviewed by MPIF. } In many instances, it is convenient to specify a ``dummy'' source or destination for communication. This simplifies the code that is needed for dealing with boundaries, e.g., in the case of a non-circular shift done with calls to send-receive. The special value \const{ MPI\_PROCNULL} can be used instead of a rank wherever a source or a destination parameter is required in a call. A communication with process \const{ MPI\_PROCNULL} has no effect: a send to \const{ MPI\_PROCNULL} succeeds and returns as soon as possible. A receive from \const{ MPI\_PROCNULL} succeeds and returns as soon as possible with no modifications to the receive buffer. When a receive with \mpiarg{ source} = \const{ MPI\_PROCNULL} is executed then the status object returns \mpiarg{ source} = \const{ MPI\_PROCNULL}, \mpiarg{ tag} = \const{ MPI\_ANY\_TAG} and \mpiarg{ count=0}. \discuss{ It would be nice to have \const{ MPI\_PROCNULL} ``equivalenced'' to -1 and to group\_size, so that, when a shift is executed in a group the first and last processors automatically communicate with the null process. A choice is to say that a source/destination with rank -1 or group\_size is always the null process. We then lose some error catching capability. Another choice is to say that only for send-receive and exchange. A third choice is to live without this nicety. It means that one still have to execute conditional code: (if myrank=0 then myneighbor=mpi\_procnull). But this code need be executed only when the communication pattern is set, and the conditional boundary code is not needed in the inner loop. } \section{Derived datatypes} \label{sec:pt2pt-datatype} \discuss{ \bf This is a new section, and the material has not yet been discussed by MPIF. } Up to now, all point to point communication involved only contiguous buffers containing a sequence of elements of the same type. This is too constraining on two accounts: One often wants to pass messages that contain values with different datatypes (e.g., an integer count, followed by count real numbers); and one often wants to send noncontiguous data (e.g. a sub-block of a matrix). One solution is to provide functions that pack noncontiguous data into a contiguous buffer at the sender site and unpack it back at the receiver site. This has the disadvantage of requiring additional memory to memory copy at both sites, even when the communication subsystem has scatter-gather capabilities. Instead, MPI provides mechanisms to specify more general, mixed and noncontiguous communication buffers. It is up to the implementation to decide whether data should be first packed in a contiguous buffer before been transmitted, or whether it can be collected directly from where it resides. The general mechanisms provided here allows one to transfer directly, without copying, objects of various shape and size. It is not assumed that the MPI library is cognizant of the objects declared in the host language; thus, if one wants to transfer a structure, or an array section, it will be necessary to provide in MPI a definition of a communication buffer that mimics the definition of the structure or array section in question. These facilities can be used by library designers to define communication functions that can transfer objects defined in the host language -- by decoding their definitions as available in a symbol table or a dope vector. Such higher-level communication functions are not part of MPI. More general communication buffers are specified by replacing the basic datatypes that have used so far with derived datatypes that are constructed from basic datatypes using the constructors described in this section. These methods of constructing derived datatypes can be applied recursively. A {\bf general datatype} is an opaque object that specifies two things: \begin{itemize} \item A sequence of basic types \item A sequence of integer (byte) displacements \end{itemize} The displacements are not required to be positive, distinct, or in increasing order; therefore the order of items need not coincide with their order in store, and an item may appear more than once. Let \[ Datatype = \{ (type_0,disp_0), ..., (type_{n-1}, disp_{n-1}) \} , \] be such a general datatype, where $type_i$ are basic types, and $disp_i$ are displacements. This general datatype, together with a base address \mpiarg{ start}, specifies a communication buffer: the communication buffer that consists of $n$ entries, where the $i$-th entry is at address $\mpiarg{ start} + disp_i$ and has type $type_i$. A message assembled from this communication buffer will contain a sequence of $n$ values, where value $i$ has type $type_i$. The general datatype determines the {\bf type signature} of the message (i.e. the number of values carried by the message and the basic type of each value). We can use a handle to a general datatype as an argument in a send or receive operation, in replacement of a basic datatype argument. The operation \func{ MPI\_SEND(start, 1, datatype,...)} will use the send buffer defined by the base address \mpiarg{ start} and the general datatype associated with \mpiarg{ datatype}; it will generate a message with the type signature determined by the \mpiarg{ datatype} argument. \func{ MPI\_RECV(start, 1, datatype,...)} will use the receive buffer defined by the base address \mpiarg{ start} and the general datatype associated with \mpiarg{ datatype}. General datatypes can be used in all send and receive operations. We address later in Section~\ref{subsec:pt2pt-datatypeuse} the case where the second argument \mpiarg{ count} has value $> 1$. The predefined basic datatypes presented in Section~\ref{subsec:pt2pt-messagedata} are particular cases of a general datatype. Thus, \type{ MPI\_INT} is a predefined handle to the datatype $\{ ({\tt int}, 0) \}$, with one entry of type {\tt int} and displacement zero. And similarly for all other basic datatypes. The {\bf extent} of a datatype is defined to be the span from the first byte to the last byte occupied by entries in this datatype, rounded up to satisfy alignment requirements. I.e., if \[ Datatype = \{ (type_0,disp_0), ..., (type_{n-1}, disp_{n-1}) \} , \] $disp_r = \min_j disp_j$ and $disp_s = \max_j disp_j$ then \begin{equation} \label{eq:pt2pt-extent} extent(Datatype) = disp_s + sizeof(type_s) - disp_r ; \end{equation} if furthermore, $type_i$ requires alignment to a byte address that is is a multiple of $k_i$, then $extent(Datatype)$ is rounded up to the next multiple of $\max_i k_i$. Example: Assume that \[ \label{eq:pt2pt-type1} Type1 = \{ ({\tt double},0), ({\tt char}, 8) \} \] (a {\tt double} at displacement zero, followed by a {\tt char} at displacement eight). Assume, furthermore, that doubles have to be strictly aligned at addresses that are multiple of eight. Then, the extent of this datatype is 16 (9 rounded to the next multiple of 8). A datatype that consists of a character immediately followed by a double will also have an extent of 16. \subsection{Datatype constructors} \label{subsec:pt2pt-datatypeconst} The simplest datatype constructor is \mpifunc{MPI\_TYPE\_CONTIGUOUS} which allows replication of a datatype into contiguous locations. \begin{funcdef}{MPI\_TYPE\_CONTIGUOUS(count, oldtype, newtype)} \funcarg{\IN}{count}{replication count (integer)} \funcarg{\IN}{oldtype}{handle to input datatype object} \funcarg{\OUT}{newtype}{handle to output datatype object} \end{funcdef} \mpiarg{ newtype} is the datatype obtained by concatenating \mpiarg{ count} copies of \mpiarg{ oldtype}. Padding may be added to satisfy alignment requirements of the underlying architecture. Example: let \mpiarg{ oldtype} be a handle to the datatype \[ \{ ({\tt double}, 0), ({\tt char}, 8) \} , \] with extent 16, and let $\mpiarg{ count} = 3$. The resulting datatype returned by \mpiarg{ newtype} is \[ \{ ({\tt double}, 0), ({\tt char}, 8), ({\tt double}, 16), ({\tt char}, 24), ({\tt double}, 32), ({\tt char}, 40) \} ; \] i.e., alternating {\tt double} and {\tt char} elements, with displacements $0, 8, 16, 24, 32, 40$. In general, assume that \mpiarg{ oldtype} is a handle to the datatype \[ \{ (type_0,disp_0), ..., (type_{n-1}, disp_{n-1}) \} , \] with extent $extent$. Then \mpiarg{ newtype} is a handle to the datatype with $\tt count \cdot n$ entries defined by: \[ \{ (type_0, disp_0), ..., (type_{n-1}, disp_{n-1}), (type_0, disp_0 +extent), ... ,(type_{n-1}, disp_{n-1} + extent) ,..., \] \[ (type_0, disp_0 +extent \cdot({\tt count}-1) ), ... , (type_{n-1} , disp_{n-1} + extent \cdot ({\tt count}-1)) \} . \] \mpifunc{MPI\_TYPE\_VECTOR} is a more general constructor that allows replication of a datatype into locations that consist of equally spaced contiguous blocks of equal size -- block sizes and block displacements are all multiples of the old type extent. \begin{funcdef}{MPI\_TYPE\_VECTOR( count, oldtype, stride, blocklen, newtype)} \funcarg{\IN}{count}{number of blocks (integer)} \funcarg{\IN}{oldtype}{handle to input datatype object} \funcarg{\IN}{stride}{number of elements between start of each block (integer)} \funcarg{\IN}{blocklen}{number of elements in each block (positive integer)} \funcarg{\OUT}{newtype}{handle to output datatype object} \end{funcdef} Example: Assume, again, that \mpiarg{ oldtype} points to the type \[ \{ ({\tt double}, 0), ({\tt char}, 8) \} , \] with extent 16. A call to \mpifunc{MPI\_TYPE\_VECTOR( 2, oldtype, 4, 3, newtype)} will create the datatype \[ \{ ({\tt double}, 0), ({\tt char}, 8), ({\tt double}, 16), ({\tt char}, 24), ({\tt double}, 32), ({\tt char}, 40), \] \[ ({\tt double}, 64), ({\tt char}, 72), ({\tt double}, 80), ({\tt char}, 88), ({\tt double}, 96), ({\tt char}, 104) \} : \] two blocks with three copies each of the old type, starting 4*16 apart. A call to \mpifunc{MPI\_TYPE\_VECTOR(oldtype, 3, -2, 1, newtype)} will create the datatype \[ \{ ({\tt double}, 0), ({\tt char}, 8), ({\tt double}, -32), ({\tt char}, -24), ({\tt double}, -64), ({\tt char}, -56) \} . \] In general, assume that \mpiarg{ oldtype} is a handle to the datatype \[ \{ (type_0,disp_0), ..., (type_{n-1}, disp_{n-1}) \} , \] with extent $extent$. The newly created datatype is a sequence of length ${\tt count} \cdot {\tt blocklen} \cdot n$ with entries: \[ \{ (type_0, disp_0), ... , (type_{n-1} , disp_{n-1}), \] \[ (type_0 ,disp_0 + extent) , ... , (type_{n-1} , disp_{n-1} + extent ), ..., \] \[ (type_0 , disp_0 + ({\tt blocklen} -1) \cdot extent ) , ... , (type_{n-1} , disp_{n-1} + ({\tt blocklen} -1) \cdot extent ) , \] \[ (type_0 ,disp_0 + {\tt stride} \cdot extent ) , ... , (type_{n-1} , disp_{n-1} + {\tt stride} \cdot extent ), ... , \] \[ (type_0 , disp_0 + ({\tt stride} + {\tt blocklen} -1) \cdot extent ) , ... , \] \[ (type_{n-1}, disp_{n-1} + ({\tt stride} + {\tt blocklen} -1) \cdot extent ) , ...., \] \[ (type_0 ,disp_0 + {\tt stride} \cdot ({\tt count}-1) \cdot extent ) , ... , \] \[ (type_{n-1} , disp_{n-1} + {\tt stride} \cdot ({\tt count} -1) \cdot extent ) , ... , \] \[ (type_0 , disp_0 + ({\tt stride} \cdot ({\tt count} -1) + {\tt blocklen} -1) \cdot extent ) , ... , \] \[ (type_{n-1}, disp_{n-1} + ({\tt stride} \cdot ({\tt count} -1) + {\tt blocklen} -1) \cdot extent ) \} \] \discuss{ I changed \mpiarg{ count} from being the number of entries (or replicates), as in the old draft, to being the number of blocks. The reason is that in the next calls I allow block lengths to vary; it is easy to say what it means to have $n$ blocks of sizes $b_0 , ... , b_{n-1}$; it is harder to explain what it means to have $n$ elements filling up successive positions in blocks of sizes $b_0 , b_1 , ....$. We loose the ability to use \type{MPI\_TYPE\_VECTOR} to create a partially filled last block (the more general \type{MPI\_TYPE\_INDEXED} function is needed); we gain consistency with the next calls, and convenience in these calls. Opinions? } A call to \mpifunc{MPI\_TYPE\_CONTIGUOUS(oldtype, count, newtype)} is equivalent to a call to \mpifunc{MPI\_TYPE\_VECTOR(oldtype, count, 1, 1, newtype)}. The function \mpifunc{MPI\_TYPE\_HVECTOR} is identical to \mpifunc{MPI\_TYPE\_VECTOR}, except that \mpiarg{ stride} is given in bytes, rather then in elements. The use for both types of vector constructors is illustrated in Section~\ref{subsec:pt2pt-examples}. ({\tt H} is used for ``heterogeneous''). \begin{funcdef}{MPI\_TYPE\_HVECTOR( count, oldtype, stride, blocklen, newtype)} \funcarg{\IN}{count}{number of blocks (integer)} \funcarg{\IN}{oldtype}{handle to input datatype object} \funcarg{\IN}{stride}{number of bytes between start of each block (integer)} \funcarg{\IN}{blocklen}{number of elements in each block (positive integer)} \funcarg{\OUT}{newtype}{handle to output datatype object} \end{funcdef} \mpifunc{MPI\_TYPE\_INDEXED} is a more general function that allows replication of an old datatype into a sequence of contiguous blocks, where block lengths and block displacements can all differ -- but are all multiples of the old type extent. \begin{funcdef}{MPI\_TYPE\_INDEXED( count, oldtype, array\_of\_indices, array\_of\_blocklen, newtype)} \funcarg{\IN}{count}{number of blocks (integer)} \funcarg{\IN}{oldtype}{handle to input datatype object} \funcarg{\IN}{array\_of\_indices}{displacement for each block, in multiples of \mpiarg{ oldtype} extent (array of integer)} \funcarg{\IN}{array\_of\_blocklen}{number of elements per block (array of integer)} \funcarg{\OUT}{newtype}{handle to output datatype object} \end{funcdef} Example: let \mpiarg{ oldtype} be a handle to the type \[ \{ ({\tt double}, 0), ({\tt char}, 8) \} , \] with extent 16. Let {\tt I = (64, 0)} and let {\tt B = (3, 1)}. A call to \mpifunc{MPI\_TYPE\_INDEXED(2, oldtype, I, B, newtype)} returns in {\tt newtype} a handle to the type \[ \{ ({\tt double}, 64), ({\tt char}, 72), ({\tt double}, 80), ({\tt char}, 88), ({\tt double}, 96), ({\tt char}, 104) \] \[ ({\tt double}, 0), ({\tt char}, 8) \} : \] three copies of the old type starting at displacement 64, and one copy starting at displacement 0. In general, assume that \mpiarg{ oldtype} is a handle to the datatype \[ \{ (type_0,disp_0), ..., (type_{n-1}, disp_{n-1}) \} , \] with extent {\em extent}. Let \mpiarg{ B} be the \mpiarg{ array\_of\_blocklen} parameter and \mpiarg{ I} be the \mpiarg{ array\_of\_indices} parameter. The newly created type returned by \mpiarg{ newtype} is the sequence of $n \cdot \sum_{i=0}^{{\tt count}-1} \mpiarg{ B[i]}$ entries \[ \{ (type_0, disp_0 + \mpiarg{ I[0]} \cdot extent ) , ... , (type_{n-1} , disp_{n-1} + \mpiarg{ I[0]} \cdot extent ) , ... , \] \[ (type_0 , disp_0 + (\mpiarg{ I[0]} + \mpiarg{ B[0]} -1) \cdot extent) ,..., (type_{n-1} , disp_{n-1} + (\mpiarg{ I[0]} +\mpiarg{ B[0]} -1) \cdot extent ) , ..., \] \[ (type_0, disp_0 + \mpiarg{ I[count-1]} \cdot extent ) , ... , (type_{n-1} , disp_{n-1} + \mpiarg{ I[count-1]} \cdot extent ) , ... , \] \[ (type_0 , disp_0 + (\mpiarg{ I[count-1]} + \mpiarg{ B[count-1]} -1) \cdot extent) ,..., \] \[ (type_{n-1} , disp_{n-1} + (\mpiarg{ I[count-1]} +\mpiarg{ B[count-1]} -1) \cdot extent ) \} . \] A call to \mpifunc{MPI\_TYPE\_VECTOR(count, oldtype, stride, blocklen, newtype)} is equivalent to a call to \mpifunc{MPI\_TYPE\_INDEX(count, oldtype, I, B, newtype)} where \[ \mpiarg{ I[j]} = j \cdot \mpiarg{ stride} \ , j=0 ,..., {\tt count} -1 , \] and \[ \mpiarg{ B[j]} = \mpiarg{ blocklen} \ , j=0 ,..., {\tt count} -1 . \] The function \mpifunc{MPI\_TYPE\_HINDEXED} is identical to \mpifunc{MPI\_TYPE\_INDEXED}, except that block displacements in \mpiarg{ array\_of\_indices} are specified in bytes, rather than in multiples of the old type extent. \begin{funcdef}{MPI\_TYPE\_HINDEXED( count, oldtype, array\_of\_indices, array\_of\_blocklen, newtype)} \funcarg{\IN}{count}{number of blocks (integer)} \funcarg{\IN}{oldtype}{handle to input datatype object} \funcarg{\IN}{array\_of\_indices}{byte displacement of each block (array of integer)} \funcarg{\IN}{array\_of\_blocklen}{number of elements in each block (array of integer)} \funcarg{\OUT}{newtype}{handle to output datatype object} \end{funcdef} {\tt MPI\_TYPE\_STRUCT} is the most general constructor: It further generalizes the previous one in that it allows each block to consist of replications of a different datatype. \begin{funcdef}{MPI\_TYPE\_STRUCT(count, array\_of\_types, array\_of\_indices, array\_of\_blocklen, newtype)} \funcarg{\IN}{count}{number of blocks (integer)} \funcarg{\IN}{array\_of\_types}{type of elements in each block (array of datatype objects)} \funcarg{\IN}{array\_of\_indices}{byte displacement of each block (array of integer)} \funcarg{\IN}{array\_of\_blocklen}{number of elements in each block (array of integer)} \funcarg{\OUT}{newtype}{handle to output datatype object} \end{funcdef} Example: Let \mpiarg{ type1} be a handle to the type \[ \{ ({\tt double}, 0), ({\tt char}, 8) \} , \] with extent 16. Let {\tt I = (0, 16, 26)}, {\tt B = (2, 1, 3)} and {\tt T = (MPI\_FLOAT, type1, MPI\_BYTE)}. Then a call to \mpifunc{MPI\_TYPE\_STRUC(3, T, I, B, newtype)} returns in \mpiarg{ newtype} a handle to the datatype \[ \{ ({\tt float}, 0), ({\tt float}, 4), ({\tt double}, 16), ({\tt char}, 24), ({\tt char}, 26), ({\tt char}, 27), ({\tt char}, 28) \} : \] two copies of \type{ MPI\_FLOAT} starting at 0, followed by one copy of \mpiarg{ type1} starting at 16, followed by three copies of \type{ MPI\_CHAR}, starting at 26. (We assume that a \type{ float} occupies four bytes.) In general, let \mpiarg{ T} be the \mpiarg{ array\_of\_types} parameter, where \mpiarg{ T[i]} is a handle to \[ type_i = \{ (type_0^i , disp_0^i ) , ... , (type_{n_i}^i , disp_{n_i}^i ) \} , \] with extent $extent_i$. Let \mpiarg{ B} be the \mpiarg{ array\_of\_blocklen} parameter and \mpiarg{ I} be the \mpiarg{ array\_of\_indices} parameter. Then the newly created datatype is the sequence of length $\sum_{i=0}^{{\tt count}-1}\mpiarg{ B[i]} \cdot extent_i$ with entries \[ \{ (type_0^0 , disp_0^0 +\mpiarg{ I[0]}) , ... , (type_{n_0}^0 , disp_{n_0}^0 + \mpiarg{ I[0]} ) , ... , \] \[ (type_0^0 , disp_0^0 + \mpiarg{ I[0]} + (\mpiarg{ B[0]}-1) \cdot extent_0 ) , ... , (type_{n_0}^0 , disp_{n_0}^0 + \mpiarg{ I[0]} + (\mpiarg{ B[0]-1)} \cdot extent_0 ) , ... , \] \[ (type_0^{{\tt count}-1} , disp_0^{{\tt count}-1} +\mpiarg{ I[count-1]}) , ... , (type_{n_{{\tt count}-1}}^{{\tt count}-1} , disp_{n_{{\tt count}-1}}^{{\tt count}-1} + \mpiarg{ I[count-1]} ) , ... , \] \[ (type_0^{{\tt count}-1} , disp_0^{{\tt count}-1} + \mpiarg{ I[count-1]} + (\mpiarg{ B[count-1]}-1) \cdot extent_{{\tt count}-1} ) , ... , \] \[ (type_{n_{{\tt count}-1}}^{{\tt count}-1} , disp_{n_{{\tt count}-1}}^{{\tt count}-1} + \mpiarg{ I[count-1]} + (\mpiarg{ B[count-1]-1)} \cdot extent_{{\tt count}-1} ) \} \] A call to \mpifunc{MPI\_TYPE\_HINDEXED( count, oldtype, I, B, newtype)} is equivalent to a call to \mpifunc{MPI\_TYPE\_STRUC( count, T, I, B, newtype)}, where each entry of\mpiarg{ T} is equal to \mpiarg{ oldtype}. \discuss{ I am open to suggestions for better names to these functions. } \subsection{Additional functions} The displacements in a general datatype are relative to some start address. {\bf Absolute addresses} can be substituted for these displacements: we treat them as displacements relative to ``address zero'', the start of the address space. This initial address zero is indicated by the constant \const{ MPI\_BOTTOM}. Thus, a datatype can specify the absolute address of the entries in the communication buffer, in which case the \mpiarg{ start} parameter is passed the value \const{ MPI\_BOTTOM}. The address of a location in memory can be found by invoking the function \mpifunc{MPI\_ADDRESS}. \begin{funcdef}{MPI\_ADDRESS(location, address)} \funcarg{\IN}{location}{location in caller memory (choice)} \funcarg{\OUT}{address}{address of location (integer)} \end{funcdef} Returns the (byte) address of \mpiarg{ location}. Another useful auxiliary function is \mpifunc{MPI\_EXTENT}, that returns the extent of a datatype -- where extent is as defined in Eq.~\ref{eq:pt2pt-extent}. \begin{funcdef}{MPI\_TYPE\_EXTENT(datatype, extent)} \funcarg{\IN}{datatype}{handle to datatype object} \funcarg{\OUT}{extent}{datatype extent (integer)} \end{funcdef} As defined now, the extent of a datatype goes from its first occupied byte to its last occupied byte. It is often convenient to define a datatype that has a ``hole'' at its beginning and/or its end; or a datatype with entries that extend beyond the datatype ``extent''. This allows to use the functions \func{ MPI\_TYPE\_CONTIGUOUS}, \func{ MPI\_TYPE\_VECTOR}, or \func{ MPI\_TYPE\_INDEX} to replicate a more general pattern; examples of such a usage are provided in Section~\ref{subsec:pt2pt-examples}. To achieve this, we add two additional ``pseudo-datatypes'' \type{ MPI\_LB}, and \type{ MPI\_UB} that can be used, respectively, to mark the lower bound or the upper bound of a datatype. these pseudo-datatypes occupy no space ($extent(\type{ MP\_LB}) = extent(\type{ MP\_UB}) =0$). They do not affect the signature of a message created with this datatype. However, they do affect the definition of the extent of a datatype and, therefore, affect the outcome of a replication of this datatype by a datatype constructor. Example: Let {\tt I = (-3, 0, 6)}; {\tt T = (MPI\_LB, MPI\_INT, MPI\_UB)}, and {\tt B = (1, 1, 1)}. Then a call to \mpifunc{MPI\_TYPE\_STRUC(3, T, I, B, type1)} creates a new datatype that has an extent of 9 (from -3 to 6), and contains an integer at displacement 0. This is the datatype defined by the sequence {\tt \{(lb, -3), (int, 0), (ub, 6)\} }. If this type is replicated twice by a call to \mpifunc{MPI\_TYPE\_CONTIGUOUS(2, type1, type2)} then the newly created type can be described by the sequence {\tt \{(lb, -3), (int, 0), (int,9), (ub, 15)\} }. (Entries of type {\tt lb} or {\tt ub} can be deleted if they are not at the endpoints of the datatype.) In general, if \[ Datatype = \{ (type_0 , disp_0 ) , ... , (type_{n-1} , disp_{n-1}) \} , \] then the {\bf lower bound} of $Datatype$ is defined to be \[ lb(Datatype) = \left\{ \begin{array}{ll} \min_j disp_j & \mbox{ if no entry has basic type $type_j = {\tt lb}$} \\ \min \{ disp_j \ : \ type_j = {\tt lb} \} & \mbox{otherwise} \end{array} \right. \] Similarly, the {\bf upper bound} of $Datatype$ is defined to be \[ ub(Datatype) = \left\{ \begin{array}{ll} \max_j disp_j + sizeof(type_j) & \mbox{ if no entry has basic type $type_j = {\tt ub}$} \\ \max \{ disp_j \ : \ type_j = {\tt ub} \} & \mbox{otherwise} \end{array} \right. \] Then \[ extent(Datatype) = ub(Datatype) - lb(Datatype) ; \] Furthermore, if $type_i$ requires alignment to an address that is a multiple of $k_i$, then $extent(Datatype)$ is further rounded up to the nearest multiple of $\max_i k_i$. The formal definitions given for the various datatype constructors apply now, with the amended definition of {\bf extent}. A datatype object has to be {\bf committed} before it can be used in a communication. A committed datatype object cannot be used to build new derived datatypes. The system may use a different internal representation for committed datatype object so as to facilitate communication, e.g. change from a compacted representation to a flat representation of the datatype, and select the most convenient transfer mechanism. \begin{funcdef}{MPI\_TYPE\_COMMIT(intype, outtype)} \funcarg{\IN}{intype}{handle to datatype object to be committed} \funcarg{\OUT}{outtype}{handle to committed datatype object} \end{funcdef} There is no need to commit basic datatypes; they are ``pre-committed''. They are the exception to the rule that committed datatypes cannot be used to form derived datatypes. \discuss{ 1. Do we really need a commit here? 2. Do we want one \INOUT argument, or two distinct arguments? 3. Are committed and uncommitted objects of the same type? (The answer must be yes, if it is an \INOUT argument.) 4. If we have only one \INOUT argument, does a commit change the datatype object, or does it create a new object and return a pointer to it in \mpiarg{ datatype}? I.e., if we have another pointer to the same uncommitted object, can we continue to use it? We prohibit the use of committed datatype in the construction of new datatypes in order to allow the commit function to change a datatype representation into one suitable for communication, and perhaps unsuitable for recursive datatype construction. This restriction could be relaxed by assuming that the ``uncommitted'' datatype representation is kept alongside the ``committed'' one. With such an implementation \func{ MPI\_TYPE\_COMMIT} could have only one argument, and only one type of datatype object is needed. Such representation could also simplify the implementation of a datatype ``flatten'' function (e.g., for debugging). } \begin{funcdef}{MPI\_TYPE\_FREE(datatype)} \funcarg{\INOUT}{datatype}{handle to datatype object} \end{funcdef} Marks the datatype object associated with \mpiarg{ datatype} for deallocation. The object will be deallocated after any pending communication that uses this object completes, at which point \mpiarg{ datatype} becomes null. \discuss { If we have different datatypes for committed and uncommitted, then we have 2 free functions (or never free uncommitted datatype) } \subsection{Use of general datatypes in communication} \label{subsec:pt2pt-datatypeuse} Handles to derived datatypes can be passed to a communication call wherever a datatype argument is required. A call of the form \mpifunc{MPI\_SEND(start, count, datatype, ...)}, where $\tt count =1$ uses the send buffer defined by \mpiarg{ start} and \mpiarg{ datatype}; i.e. if \mpiarg{ datatype} is a handle to the datatype defined by the sequence $\{ (type_0 , disp_0) , ... , (type_{n-1}, disp_{n-1} ) \}$, (with empty entries deleted) then the send buffer consists of the entries of types $type_i$, at addresses ${\tt start} + disp_i$, $i=0, ..., n-1$. The message generated consists of a sequence of $n$ entries, of types $type_0 , ... , type_{n-1}$. The receive buffer for a call \mpifunc{MPI\_RECV( start, count, datatype , ...)}, is similarly defined, for {\tt count =1}. The same applies to all other send and receive operations that have a \mpiarg{ datatype} parameter. A datatype used in a receive operation may not have overlapping entries; otherwise the outcome of the receive operation is undefined. If a send operation using a general datatype \type{ type1} is matched by a receive operation using a general datatype \type{ type2} then the sequence of (nonempty) basic type entries defined by each datatype should be identical. Furthermore, each entry in the send buffer should be of type that matches the type of the corresponding entry in \type{ type1} and each entry in the receive buffer should of a type that matches the type of the corresponding entry in \type{ type2} (with type matching defined as in Section~\ref{sec:pt2pt-typematch}). Thus, type matching is defined by the ``flat'' structure of a datatype, and does not depend on its defining expression. Example: \begin{verbatim} ... MPI_TYPE_CONTIGUOUS( MPI_REAL, 2, type2) MPI_TYPE_CONTIGUOUS( MPI_REAL, 4, type4) MPI_TYPE_CONTIGUOUS( type2, 2, type22) \end{verbatim} A message sent with datatype \type{ type4} can be received with datatype \type{ type22}. Type matching requirements imply that (nonempty) entries of different types in a datatype cannot overlap, unless they coincide and have the same basic type, or one of them have basic type \type{ byte}. The length of a message received (i.e., the value returned when the receive status object is decoded by the function \mpifunc{MPI\_GET\_LEN}) is equal to the number of basic entries received. Consider again the datatypes defined in the last example. If a message is sent using either datatype \type{ type22} or \type{ type4} and is received using either of these two datatypes, then the length of the message received will be four. In the general case the basic entries need not be all of the same type. \discuss{ An alternative definition that finds favor with many members of the subcommittee is that the length of the message received is measured in terms of subunits of the receiving datatype. Thus, if four reals are received using \type{ type4}, the length of the received message is four; if they are received using \type{ type22}, the length of the received message is two. With such a definition, it is required that the arriving message fills an integer number of subunits. The rationale is that the these are the units that are meaningful to the programmer. Consider the case of an array of structures that is passed in a message (example 4.1 below). Then one might want to enforce the restriction that only complete structures are passed; and is natural to return in the receive status the number of structures received, rather than this number multiplied by 14, the number of entries per structure. Consider, on the other hand, the case of a sequence of reals that is received in a 3D array section (defined as in the first example below). We may want to receive an arbitrary number of reals, and the 1D and 2D subsections may not carry any significance for the purpose of this communication. Opinions? } A call of the form \mpifunc{MPI\_SEND(start, count, datatype , ...)}, where $\tt count > 1$, is interpreted as if the call was passed a new datatype which is the concatenation of \mpiarg{ count} copies of \mpiarg{ datatype}, such as would be produced by a call to \mpifunc{MPI\_CONTIGUOUS(count, datatype, newtype)}, and a \mpiarg{ count} argument of one. The same holds true for the other communication functions that have a datatype argument. \subsection{Examples} \label{subsec:pt2pt-examples} The following examples illustrate the use of derived datatypes. First example: extract a section of a 3D matrix. \begin{verbatim} REAL a(100,100,100), e(9,9,9) INTEGER oneslice, twoslice, threeslice, finaltype, sizeofreal C extract the section a(1:17:2, 3:11, 2:10) C and store it in e(*,*,*). CALL MPI_TYPE_EXTENT( MPI_REAL, sizeofreal, ierr) C create datatype for a 1D section CALL MPI_TYPE_VECTOR( 9, MPI_REAL, 2, 1, oneslice, ierr) C create datatype for a 2D section CALL MPI_TYPE_HVECTOR( 9, oneslice, 100*sizeofreal, 1, twoslice, ierr) C create datatype for the entire section CALL MPI_TYPE_HVECTOR( 9, twoslice, 100*100*sizeofreal, 1, threeslice, ierr) CALL MPI_TYPE_COMMIT( threeslice, finaltype, ierr) CALL MPI_SENDRECV(a(1,3,2), 1, finaltype, MPI_ME, e, 9*9*9, MPI_REAL, MPI_ME, 0, MPI_ALL, status, ierr) \end{verbatim} Second example: Copy the (strictly) lower triangular part of a matrix. \begin{verbatim} REAL a(100,100), b(100,100) INTEGER index(100), blocklen(100), ltype, finaltype C copy lower triangular part of array a C onto lower triangular part of array b C compute start and size of each column DO i=1, 100 index(i) = 100*(i-1) + i block(i) = 100-i END DO C create datatype for lower triangular part CALL MPI_TYPE_INDEX( MPI_REAL, 100, index, block, ltype, ierr) CALL MPI_TYPE_COMMIT(ltype, finaltype, ierr) CALL MPI_SENDRECV( a, 1, finaltype, MPI_ME, b, 1, finaltype, MPI_ME, 0, MPI_ALL, status, ierr) \end{verbatim} Third example: Transpose a matrix \begin{verbatim} REAL a(100,100), b(100,100) INTEGER row, xpose, finaltype, sizeofreal C transpose matrix a onto b CALL MPI_TYPE_EXTENT( MPI_REAL, sizeofreal, ierr) C create datatype for one row CALL MPI_TYPE_VECTOR( 100, MPI_REAL, 100, 1, row, ierr) C create datatype for matrix in row-major order CALL MPI_TYPE_HVECTOR( 100, row, sizeofreal, 1, xpose, ierr) CALL MPI_TYPE_COMMIT( xpose, finaltype, ierr) C send matrix in row-major order and receive in column major order CALL MPI_SENDRECV( a, 1, finaltype, MPI_ME, b, 100*100, MPI_REAL, MPI_ME, 0, MPI_ALL, status, ierr) \end{verbatim} Another approach to the transpose problem: \begin{verbatim} REAL a(100,100), b(100,100) INTEGER index(2), blen(2), type(2), row, row1, finaltype, sizeofreal C transpose matrix a onto b CALL MPI_TYPE_EXTENT( MPI_REAL, sizeofreal, ierr) C create datatype for one row CALL MPI_TYPE_VECTOR( MPI_REAL, 100, 100, 1, row, ierr) C create datatype for one row, with the extent of one real number index(1) = 0 index(2) = sizeofreal type(1) = row type(2) = MPI_UB blen(1) = 1 blen(2) = 1 CALL MPI_TYPE_STRUC( 2, type, index, blen, row1) CALL MPI_TYPE_COMMIT( row1, finaltype, ierr) C send 100 rows and receive in column major order CALL MPI_SENDRECV( a, 100, finaltype, MPI_ME, b, 100*100, MPI_REAL, MPI_ME, 0, MPI_ALL, status, ierr) \end{verbatim} Fourth example: manipulate an array of structures. \begin{verbatim} struct Partstruct { int index; /* particle type */ double d[6] /* particle coordinates */ char b[7] /* some additional information */ }; struct Partstruct particle[1000]; /* build datatype describing first array entry */ MPI_datatype Particletype; MPI_datatype stype[3] = {MPI_int, MPI_double, MPI_char}; int sblock[3] = {1, 6, 7}; int sindex[3]; MPI_address( (void *)particle, sindex); MPI_address( (void *)particle[0].d, sindex+1); MPI_address( (void *)particle[0].b, sindex+2); MPI_type_struct( 3, stype, sindex, sblock, Particletype); /* Particletype describes first array entry -- using absolute addresses */ /* 4.1: send the entire array */ MPI_datatype Type1; MPI_commit( Particletype, Type1); MPI_send( MPI_bottom, 1000, Type1, dest, tag, comm); /* 4.2: send the entries with index value zero, preceded by the number of such entries */ MPI_datatype Zparticles; /* datatype describing all particles with index zero (needs to be recomputed if indices change) */ MPI_datatype Ztype, Type2; int zindex[1000], zblock[1000], i, j, k; int zzblock[2] = {1,1}; int zztype[2], zzindex[2]; /* compute indices of particles with index zero */ j = 0; for(i=0; i < 1000; i++) if (particle[i].index==0) then { zindex[j] = i; zblock[i] = 1; j++; } /* create datatype for particles with index zero */ MPI_type_index( j, Particletype, zindex, zblock, Zparticles); /* prepend particle count */ MPI_address((void *)&j, zzindex); zzindex[1] = MPI_bottom; zztype[0] = MPI_int; zztype[1] = Zparticles; MPI_type_struc(2, zztype, zzindex, zzblock, Ztype); MPI_type_commit( Ztype, Type2); MPI_send( MPI_bottom, 1, Type2, dest, tag, comm); /* A possibly more efficient way of defining Zparticles */ /* consecutive particles with index zero are handled as one block */ j=0; for (i=0; i < 1000; i++) if (particle[i].index==0) then { for (k=i+1; (k < 1000)&&(particle[k].index == 0) ; k++); zindex[j] = i; zblock[j] = k-i; j++; i = k; } MPI_type_index( j, Particletype, zindex, zblock, Zparticles); /* 4.3: send the first two coordinates of all entries */ MPI_datatype Allpairs; /* datatype for all pairs of coordinates */ MPI_datatype Type3; int sizeofentry; MPI_extent( Particletype, sizeofentry); /* sizeofentry can also be computed by subtracting the address of particle[0] from the address of particle[1] */ MPI_type_hvector( 1000, MPI_real, sizeofentry, 2, Allpairs); MPI_type_commit( Allpairs, Type3); MPI_send( particle.d, 1, Type3, dest, tag, comm); /* an alternative solution to 4.3 */ MPI_datatype Onepair; /* datatype for one pair of coordinates, with the extent of one particle entry */ MPI_datatype Type4; int indexp[3]; int typep[3] = {MPI_lb, MPI_double, MPI_ub}; int blockp[3] = {1, 2, 1}; MPI_address( (void *)particle, indexp); MPI_address( (void *)particle[0].d, indexp+1); MPI_address( (void *)(particle+1), indexp+2); MPI_type_struc( 3, typep, indexp, blockp, Onepair); MPI_commit( Onepair, Type4); MPI_send( MPI_bottom, 1000, Type4, dest, tag, comm); \end{verbatim} \discuss{ The examples carry two implicit assumptions on the way padding is done in an array of structures. First, that padding is done in the same way in the successive array entries, so that all entries have an identical memory layout. E.g., we assume that displacement from {\tt particle[i].b} to {\tt particle[j].b} is the same as the displacement from {\tt particle[i].d} to {\tt particle[j].d}. Second, that the amount of padding added is, in some sense, minimal. I.e., if the most stringent alignment requirement for a basic component of a structure is alignment to multiples of $2^k$, then no contiguous padding space has size greater or equal to $2^k$. This assumption is implicit in the definition of {\bf extent}. I don't think either assumptions are mandated by C of Fortran standards, although I would be very surprised to see an implementation that does not fulfill these conditions. If I am wrong (e.g., if some compilers always align structures to double word boundaries, even if they have no double word components), then the definition of \mpiarg{ extent} should be revisited. } \subsection{Correct use of addresses} \label{subsec:pt2pt-segmented} Successively declared variables in C or Fortran are not necessarily stored at contiguous locations. Thus, care must be exercised that displacements do not cross from one variable to another. Also, in machines with segmented address space, addresses are not unique and address arithmetic has some peculiar properties. Thus, use of {\bf addresses}, i.e. displacements relative to the start address \const{ MPI\_BOTTOM}, has to be restricted. Variables belong to the same {\bf sequential storage} if they belong to the same array, to the same {\tt COMMON} block in Fortran, or to the same structure in C. Valid addresses are defined recursively as follows: \begin{enumerate} \item The function \mpifunc{MPI\_ADDRESS} returns a valid address, when passed as argument a host program variable. \item The \mpiarg{ start} parameter of a communication function evaluates to a valid address, when passed as argument a host program variable. \item If \mpiarg{ v} is a valid address, and \mpiarg{ i} is an integer, then \mpiarg{ v+i} is a valid address, provided \mpiarg{ v} and \mpiarg{ v+i} are in the same sequential storage. \item If \mpiarg{ v} is a valid address then \const{ MPI\_BOTTOM + v} is a valid address. \end{enumerate} A correct program uses only valid addresses to identify the locations of entries in communication buffers. Furthermore, if \mpiarg{ u} and \mpiarg{ v} are two valid addresses, then the (integer) difference \mpiarg{ u - v} can be computed only if both \mpiarg{ u} and {\tt v} are in the same sequential storage; no other arithmetic operations can be meaningfully executed on addresses. We say that a datatype is {\bf absolute} if all displacements within this datatype are valid (absolute) addresses; it is {\bf relative} otherwise. A correct program obeys the following constraints in the use of datatypes: \begin{itemize} \item If the \mpiarg{ oldtype} argument used in \func{ MPI\_TYPE\_CONTIGUOUS, MPI\_TYPE\_VECTOR, MPI\_TYPE\_HVECTOR, MPI\_TYPE\_INDEX} or \func{ MPI\_TYPE\_HINDEX} is absolute, then all addresses within it must be to variables contained within the same array or structure; the result datatype is also absolute, and all computed addresses must also fall within the same array or structure. \item If all entries of the \mpiarg{ array\_of\_types} argument of \mpifunc{MPI\_TYPE\_STRUC} are absolute addresses (computed by \mpifunc{MPI\_ADDR}), then the result datatype is also absolute. Each new address computed from an old address must fall within the same array or record as the old address. (Addresses in the old type may not be all within the same array or record.) \item If a communication call is invoked with parameters \mpiarg{ start} and \mpiarg{ datatype}, then either \mpiarg{ start} = \const{ MPI\_BOTTOM} and \mpiarg{ datatype} is a handle to an absolute datatype, or \mpiarg{ start} is set to the address of a program variable, and all displacements in \mpiarg{ datatype} relative to this base yield addresses that are within the same array or record as this variable. \end{itemize} In summary, the type constructors \func{ MPI\_TYPE\_CONTIGUOUS, MPI\_TYPE\_VECTOR, MPI\_TYPE\_HVECTOR, MPI\_TYPE\_INDEX} and \func{ MPI\_TYPE\_HINDEX} can be recursively applied to build datatypes that will combine variables that belong to the same sequential storage; variables that do not belong to the same sequential storage can be combined together using one application of \mpifunc{MPI\_TYPE\_STRUC}. \discuss{ It is not expected that MPI implementations will be able to detect erroneous, ``out of bound'' displacements -- unless those overflow the user address space -- since the MPI call may not now the extent of the arrays and records in the host program. } \section{Universal communication functions} Although the number of communication functions defined in this chapter is fairly large, it is possible to construct all of them from a much smaller number of primitive functions. The construction is outlined in this section. The construction can be used to build a portable implementation of MPI, although it is expected that more efficient implementation will implement directly most or all of the functions in this section. Also, this construction can be used in order to derive the behavior of MPI point to point communication from the behavior of a small subset of functions. We assume the availability of the primitive functions listed at the end of the section, and the ability to allocate memory for new variables. We assume that the function \mpifunc{MPI\_TYPE\_COMMIT} is not changing the representation of a datatype, but merely creates a new copy of it. Communication with a \mpiarg{ count} argument that is $> 1$ can be reduced to communication with \mpiarg{ count =1}, by replicating the \mpiarg{ datatype} argument \mpiarg{ count} times. Communication with a \mpiarg{ a start} argument that is not \const{ MPI\_BOTTOM} can be reduced to communication with \mpiarg{ start} = \const{ MPI\_BOTTOM} by creating a suitable displaced new datatype. Thus: \mpifunc{MPI\_SEND( start, count, datatype, dest, tag, comm)} is \begin{verbatim} Type[1] = datatype MPI_ADDRESS( start, Index[1]); Block[1] = count; MPI_TYPE_STRUCT( 1, Type, Index, Block, Newtype) MPI_TYPE_COMMIT( Newtype, datatype1) MPI_SEND( MPI_BOTTOM, 1, datatype1, dest, tag, comm) \end{verbatim} The same construction applies to all other communication functions with arguments \mpiarg{ count} and \mpiarg{ datatype}. We shall henceforth restrict ourselves to communication that involves only one element with an absolute address (\mpiarg{ count =1}, and \mpiarg{ start} = \const{ MPI\_BOTTOM}). (If the operation \mpifunc{MPI\_TYPE\_COMMIT)} is not assumed to be trivial, then the above construction cannot be used; this adds one more primitive function, namely \mpifunc{MPI\_TYPE\_COMMIT}, to our list, and introduces small changes in the construction.) It is convenient to to make explicit the {\bf persistence} attribute of communication objects. A \mpiarg{ persistent} communication object need be explicitly deallocated by a \func{ MPI\_COMM\_FREE} operation. On the other hand, an \mpiarg{ ephemeral} communication object is good for a single communication. It is deallocated by the system when the first communication it is used for completes. The function \mpifunc{MPI\_COMM\_INIT} is a universal function for the creation of communication objects: \begin{funcdef}{MPI\_COMM\_INIT(handle, datatype, source-dest, tag, comm, op-mode, persistence)} \funcarg{\OUT}{handle}{handle to newly created communication object} \funcarg{\IN}{datatype}{datatype of element sent or received} \funcarg{\IN}{source-dest}{rank of destination or source (integer)} \funcarg{\IN}{tag}{message tag (integer)} \funcarg{\IN}{comm}{communication object} \funcarg{\IN}{op-mode}{one of \const{ MPI\_STANDARD}, \const{ MPI\_READY}, \const{ MPI\_SYNCHRONOUS} or \const{ MPI\_RECV}} \funcarg{\IN}{persistence}{one of \const{ MPI\_PERSISTENT} or \const{ MPI\_EPHEMERAL}} \end{funcdef} The communication involves only one element, with an absolute address. \subsection{Persistent communication objects} \mpifunc{MPI\_CREATE\_SEND(handle, MPI\_BOTTOM, 1, datatype, dest, tag, comm)} is \begin{verbatim} MPI_COMM_INIT(handle, datatype, dest, tag, comm, MPI_STANDARD, MPI_PERSISTENT) \end{verbatim} The functions \mpifunc{MPI\_CREATE\_RSEND}, \mpifunc{MPI\_CREATE\_RSEND} and \mpifunc{MPI\_CREATE\_RECV} are dealt in a similar manner. The functions \mpifunc{MPI\_START} and \mpifunc{MPI\_COMM\_FREE} are primitive. \subsection{Nonblocking communication initiation} \mpifunc{MPI\_ISEND(handle, MPI\_BOTTOM, 1, datatype, dest, tag, comm)} is \begin{verbatim} MPI_COMM_INIT(handle, datatype, dest, tag, comm, MPI_STANDARD, MPI_EPHEMERAL) MPI_START(handle) \end{verbatim} The functions \mpifunc{MPI\_IRSEND}, \mpifunc{MPI\_ISSEND} and \mpifunc{MPI\_IRECV} are handled in a similar manner. \subsection{Communication completion} The two primitive completion operations are \mpifunc{MPI\_WAITANY} and \mpifunc{MPI\_TESTALL}. The function \mpifunc{MPI\_WAIT} can be implemented by a call to \mpifunc{MPI\_WAITANY} with an \mpiarg{ array-of-handles} argument of length one. The function \mpifunc{MPI\_WAITALL} can be implemented as a loop where an \func{ MPI\_WAIT} call is executed for each successive handle in the \mpiarg{ array-of-handles}. The function \mpifunc{MPI\_TEST} can be implemented by a call to \mpifunc{MPI\_TESTALL} with an \mpiarg{ array-of\-handles} of length one. The function \mpifunc{MPI\_TESTALL} can be implemented as a loop where an \func{ MPI\_TEST} call is executed for each successive handle in the \mpiarg{ array-of-handle}. \subsection{Blocking communication} \mpifunc{MPI\_SEND(MPI\_BOTTOM, 1, datatype, dest, tag, comm)} is \begin{verbatim} MPI_COMM_INIT(handle, datatype, dest, tag, comm, MPI_STANDARD, MPI_EPHEMERAL) MPI_START(handle) MPI_WAIT(handle, dontcarestatus) \end{verbatim} The functions \mpifunc{MPI\_RSEND} and \mpifunc{MPI\_SSEND} are handled in a similar manner. \mpifunc{MPI\_RECV(MPI\_BOTTOM, 1, datatype, source, tag, comm)} is \begin{verbatim} MPI_COMM_INIT(handle, datatype, dest, tag, comm, MPI_RECV, MPI_EPHEMERAL) MPI_START(handle) MPI_WAIT(handle, status) \end{verbatim} \subsection{Probe and cancel} The functions \mpifunc{MPI\_PROBE} and \mpifunc{MPI\_CANCEL} are primitive. \subsection{Return status} The functions \mpifunc{MPI\_GET\_SOURCE}, \mpifunc{MPI\_GET\_TAG}, \mpifunc{MPI\_GET\_LEN}, \mpifunc{MPI\_PROBE\_LEN}, and \mpifunc{MPI\_IS\_CANCELLED} are primitive. These functions are simple macros that access records in a structure. \subsection{send-receive and exchange} \mpifunc{MPI\_SENDRECV( send\_start, send\_count, send\_type, dest, recv\_start, recv\_count, recv\_type, source, tag, comm, status)} is \begin{verbatim} MPI_ISEND(handle[0], send_start, send_count, send_type, dest, tag, comm) MPI_IRECV(handle[1], recv_start, recv_count, recv_type, source, tag, comm) MPI_WAITALL(2, handle, status_array) status = status_array[1] \end{verbatim} The nonblocking sends and receives can be replaced by their primitive implementation. The function \mpifunc{MPI\_EXCHANGE} can be handled in a similar manner; a temporary buffer need be allocated to replicate the send and receive buffer. A size for this buffer can be computed using the function \mpifunc{MPI\_TYPE\_EXTENT}. \subsection{Derived datatypes} We have outlined in Section~\ref{subsec:pt2pt-datatypeconst} how each datatype constructor can be expressed in terms of the next one. Thus all datatype constructors can be expressed in terms of the constructor \mpifunc{MPI\_TYPE\_STRUC}; the functions \mpifunc{MPI\_TYPE\_FREE}, \mpifunc{MPI\_TYPE\_EXTENT} and \mpifunc{MPI\_ADDRESS} are primitive. \subsection{Summary} In order to implement the MPI point-to-point communication one needs 16 functions; about half of those are trivial macros: \begin{enumerate} \item \mpifunc{MPI\_COMM\_INIT} \item \mpifunc{MPI\_START} \item \mpifunc{MPI\_WAITALL} \item \mpifunc{MPI\_TESTANY} \item \mpifunc{MPI\_PROBE} \item \mpifunc{MPI\_CANCEL} \item \mpifunc{MPI\_GET\_SOURCE} \item \mpifunc{MPI\_GET\_TAG} \item \mpifunc{MPI\_GET\_LEN} \item \mpifunc{MPI\_PROBE\_LEN} \item \mpifunc{MPI\_IS\_CANCELLED} \item \mpifunc{MPI\_TYPE\_STRUC} \item \mpifunc{MPI\_TYPE\_EXTENT} \item \mpifunc{MPI\_ADDRESS} \item \mpifunc{MPI\_TYPE\_FREE} \item \mpifunc{MPI\_COMM\_FREE} \end{enumerate} On the other hand, we expect most implementations to use more primitives, in order to reduce communication overheads, reduce buffering, and improve error reporting. .