%Version of August 3, 1993
%Minor edits by S. Otto, August 7, 1993

\chapter{Point to Point Communication}
\label{sec:pt2pt}
\label{chap:pt2pt}

\section{Introduction}
\label{sec:pt2pt-intro}

This chapter is a draft of the current proposal for point-to-point
communication.  Sending and receiving of messages by processes is the basic MPI
communication mechanism.   All other communication functions can be implemented
on top of this basic communication layer (although a more direct implementation
may lead to greater efficiency).


The basic point to point communication operations are {\bf send} and {\bf
receive}. A {\bf send} operation creates and sends a message.  The operation
specifies a {\bf send buffer} in the sender memory from which the
message data
is taken.  In addition, the send operation associates an {\bf envelope} with the
message.  This envelope specifies the message destination and contains
distinguishing information that can be used by the {\bf receive} operation to
select a particular message.

A {\bf receive} operation consumes a
message.  The message to be received is selected according to the value on its
envelope, and the message data is put into the {\bf receive
buffer}.

The next sections describe the basic (blocking) send and receive
operations.  We discuss send, receive, basic communication semantics,
type matching requirements, type conversion in
heterogeneous environments, and more general communication modes.
Nonblocking communication is addressed next, followed by
channel-like constructs and send-receive operations.  We then consider general
datatypes that allow one to transfer heterogeneous and noncontiguous data,
and conclude with a description of an implementation of MPI point to
point communication using a small number of primitives.


\section{Basic send operation}
\label{sec:pt2pt-basicsend}

The syntax of the simplest send operation is given below.

\begin{funcdef}{MPI\_SEND (start, count, datatype, dest, tag, comm)}
\funcarg{\IN}{start}{initial address of send buffer (choice)}
\funcarg{\IN}{count}{number of elements in send buffer (nonnegative integer)}
\funcarg{\IN}{datatype}{datatype of each send buffer element}
\funcarg{\IN}{dest}{rank of destination (integer)}
\funcarg{\IN}{tag}{message tag (integer)}
\funcarg{\IN}{comm}{communicator (handle)}
\end{funcdef}


\subsection{Message data}
\label{subsec:pt2pt-messagedata}


The send buffer specified by the \func{MPI\_SEND} operation consists of
\mpiarg{count} successive entries of the type indicated by \mpiarg{datatype}, starting
with the entry at address \mpiarg{start}.   Note that we specify the message length
in terms of number of {\em elements}, not number of {\em bytes}.  The former is
machine independent and closer to the application level.

The data part of the message consists of a sequence of \mpiarg{count}
values, each of the type indicated by \mpiarg{datatype}.  \mpiarg{ count} may be zero,
in which case the data part of the message is empty.
The basic
datatypes that can be specified for message data values correspond to
the basic datatypes of the host language.
The possible` values of this parameter for Fortran and the
corresponding Fortran types are listed below

\vspace{5ex}

\begin{center}
\begin{tabular}{|l|l|}
\hline
MPI datatype & Fortran datatype \\
\hline
\type{ MPI\_INTEGER} & \ftype{INTEGER} \\
\type{ MPI\_REAL} & \ftype{REAL} \\
\type{ MPI\_DOUBLE} & \ftype{DOUBLE PRECISION} \\
\type{ MPI\_COMPLEX} & \ftype{COMPLEX} \\
\type{ MPI\_DOUBLE\_COMPLEX} & \ftype{DOUBLE COMPLEX} \\
\type{ MPI\_LOGICAL} & \ftype{LOGICAL} \\
\type{ MPI\_CHARACTER} & \ftype{CHARACTER} \\
\type{ MPI\_BYTE} & \\
\hline
\end{tabular}
\end{center}
\vspace{5ex}
The possible values for this parameter for C and the corresponding C
types are listed below
\vspace{5ex}
\begin{center}
\begin{tabular}{|l|l|}
\hline
MPI datatype & C datatype \\
\hline
\type{ MPI\_SHORT} & \ctype{short} \\
\type{ MPI\_INT} & \ctype{int} \\
\type{ MPI\_LONG} & \ctype{long} \\
\type{ MPI\_UNSIGNED} & \ctype{unsigned} \\
\type{ MPI\_FLOAT} & \ctype{float} \\
\type{ MPI\_DOUBLE} & \ctype{double} \\
\type{ MPI\_CHAR} & \ctype{char} \\
\type{ MPI\_BYTE} & \\
\hline
\end{tabular}
\end{center}

\vspace{5ex}

The datatype \type{ MPI\_BYTE} does not correspond to a Fortran or C
datatype.  A value of type \type{ MPI\_BYTE} consists of 8 binary digits.  A
byte is uninterpreted and is different from a character.
Different machines may have
different representations for characters, or may use more than one
byte to represent characters.  On the other hand, a byte has the same
binary value on all machines.

\discuss{

Need to decide whether we want (for C)

\ctype{unsigned char}, \ctype{unsigned short}, \ctype{unsigned long},
\ctype{long double}

\ftype{DOUBLE COMPLEX} may not be standard Fortran 77

\type{MPI\_DOUBLE} is now used both for Fortran and C.  Should we use different
names?

}



\subsection{Message envelope}
\label{subsec:pt2pt-envelope}

In addition to the data part, messages contain information that can be used to
distinguish messages and selectively receive them.  This information is
contained in a fixed number of fixed-format fields, which we collectively call
the {\bf message envelope}.   These fields are
\begin{description}
\item[source]
\item[destination]
\item[tag]
\item[context]
\end{description}

The message source is implicitly determined by the identity of the
message sender.  The other fields are specified by parameters in the send
operation.

The integer-valued message
tag is specified by the \mpiarg{ tag} parameter.
This integer can be used by the program to distinguish different types of
messages.   The range of valid tag values is implementation dependent and can be
queried using the \mpifunc{TagRange} environmental inquiry function, as
described in Chapter~\ref{chap:inquiry}.

The context of the message sent is specified by the \mpiarg{ comm}
parameter.
the message carries the context associated with this communicator (see
Chapter~\ref{chap:context}).

The message destination is specified by the \mpiarg{ dest} parameter
as a rank within the process
group associated with that same communicator.  The range of valid values
is {\tt 0, ... , n-1}, where {\tt n} is the number of
processes in this group.  Thus, point-to-point communications
do not use absolute addresses, but only relative ranks within a group.  This
provides important modularity.

The message envelope
would normally be encoded by a fixed-length message header.  However, the actual
mechanism used to associate an envelope with a message
is implementation dependent: some of the information (e.g.,
source or destination) may be implicit,  and need not be
explicitly carried by messages; processes may be identified by relative ranks,
or absolute ids; etc.

\section{Basic receive operation}
\label{sec:pt2pt-basicreceive}

The syntax of the simplest receive operation is given below.

\begin{funcdef}{MPI\_RECV (start, count, datatype, source, tag, comm, status)}
\funcarg{\OUT}{start}{initial address of receive buffer (choice)}
\funcarg{\IN}{count}{number of elements in receive buffer (integer)}
\funcarg{\IN}{datatype}{datatype of each receive buffer element (state)}
\funcarg{\IN}{source}{rank of source (integer)}
\funcarg{\IN}{tag}{message tag (integer)}
\funcarg{\IN}{comm}{communicator (handle)}
\funcarg{\OUT}{status}{status object}
\end{funcdef}


The receive buffer consists of
the storage containing \mpiarg{ count} consecutive elements of the type
specified by \mpiarg{ datatype}, starting at address \mpiarg{ start}.
The length of the received message must be less then or equal to the length of
the
receive buffer.  I.e., all incoming data must fit, without truncation, into the
receive buffer.  The \mpifunc{MPI\_PROBE} function described in
Section~\ref{sec:pt2pt-probe} can be used to receive messages of unknown length.

The selection of a message by a receive operation is done uniquely
according to the value of the message envelope.  The receive operation specifies
an {\bf envelope pattern}; a message can be received by that receive operation
only if its envelope matches that pattern.
A pattern specifies
values for the \mpiarg{ source}, \mpiarg{ tag} and \mpiarg{ context} fields of
the message envelope.
The receiver may specify a wildcard \const{MPI\_ANY\_SOURCE}
value for \mpiarg{ source}, and/or
a wildcard \const{ MPI\_ANY\_TAG} value for \mpiarg{ tag},
indicating that any source and/or tag are acceptable.   It cannot specify a
wildcard value for \mpiarg{ context}.
Thus, a message can be received by a receive
operation only if it is addressed
to the receiving process, has a matching context, has matching source unless
source=\const{MPI\_ANY\_SOURCE}
in the pattern, and has a matching tag unless tag=\const{ MPI\_ANY\_TAG} in
the pattern.

The message tag is specified by the \mpiarg{ tag} parameter of the
receive operation.  The message
\mpiarg{ context} is the context attached with the communicator
specified
by the parameter \mpiarg{ comm}.  The message source, if different from \const{
MPI\_ANY\_SOURCE}, is specified as a rank within the process group associated
with that same communicator.
Thus, the range of valid values for the
\mpiarg{ source} parameter is
{\tt \{ 0, ... , n-1 \} }$\cup${\tt \{ MPI\_ANY\_SOURCE \}}, where
{\tt n} is the number of processes in this group.

Note the asymmetry between send and receive operations:  A receive operation may
accept messages from an arbitrary sender; on the other hand, a send operation
must specify a unique receiver.  This matches a ``push'' communication
mechanism, where data transfer is effected by the sender (rather than a
``pull'' mechanism, where data transfer is effected by the receiver).

Source = destination is allowed:  a process can send a
message to itself.



\subsection{Return status}
\label{subsec:pt2pt-status}

\discuss{\bf This is a change from previous draft}

The source or the tag of a received message may not be known if wildcard
values where used in the receive operation.  Also, the actual length of the
message received may not be known.  Thus, this information needs to be returned
by the receive operation.  For reasons that will become clear later, it is
preferable to return this information in a separate \type{ out} variable, rather
then using the \mpiarg{ count}, \mpiarg{ source} and \mpiarg{ tag} parameters both for
input and output.  The information is returned by the \mpiarg{ status}
parameter of the \mpifunc{MPI\_RECV} function.  This is a parameter of a special
MPI-defined type.   The status variable can be ``decoded'' to retrieve the
\mpiarg{ count}, \mpiarg{ source} and \mpiarg{ tag} fields, using the query functions
listed below.

\begin{funcdef}{MPI\_GET\_SOURCE(status, source)}
\funcarg{\IN}{status}{return status of receive operation (Status)}
\funcarg{\OUT}{source}{source rank (integer)}
\end{funcdef}

Returns the rank of the message source (in the group associated with the
communicator that was used to receive).

\begin{funcdef}{MPI\_GET\_TAG(status, tag)}
\funcarg{\IN}{status}{return status of receive operation (Status)}
\funcarg{\OUT}{tag}{message tag (integer)}
\end{funcdef}

Returns the tag of received message.


\begin{funcdef}{MPI\_GET\_COUNT(status, count)}
\funcarg{\IN}{status}{return status of receive operation (Status)}
\funcarg{\OUT}{count}{number of received elements (integer)}
\end{funcdef}

Returns the number of elements received.  (Here, again, we count {\em elements},
not {\em bytes}.)

The information returned by these query functions is the information
last stored in the \mpiarg{ status} variable by a receive function.  It is
erroneous to call these query functions if
the \mpiarg{ status} variable was never set by a receive.

Note that it is not mandatory to query the return status after a
receive.  The receiver will use these calls, or some of them, only
when the information they return is needed.

information returned in it by the last receive operation that updated

\discuss{

The use of a separate status parameter prevents errors that are often attached
with \type{ \INOUT} parameters (e.g., passing the 
\const{ MPI\_ANY\_TAG}  constant as
the actual argument for \mpiarg{ tag}).  The use of an explicit user object
becomes important for nonblocking communication as it allows the receive
operation to be stateless and, hence, reentrant.  This prevents
confusions in the case where multiple receives can be posted by a
process.

}

\implement{

One expects decode functions to be in-lined in many implementations.
}

All send and receive operations use the \mpiarg{ start}, \mpiarg{ count},\mpiarg{ 
datatype}, \mpiarg{ source}, \mpiarg{ dest}, \mpiarg{ tag},
\mpiarg{ comm} and \mpiarg{ status} parameters in the same way as the basic 
\func{ MPI\_SEND} and
\func{ MPI\_RECV} operations described in this section.


\section{Semantics of point-to-point communication}
\label{sec:pt2pt-semantics}

A valid MPI implementation guarantees certain general properties of
point-to-point communication, which are described in this section.

\paragraph*{Order}

Messages are {\em non-overtaking}, within each context:
if two messages are successively sent from the same source, to the same
destination,
with the same context, then the messages are received in the order they were
sent.  I.e., if the receiver posts two successive receives that both
match either message, then the first receive will receive the first
message, and the second receive will receive the second message.
This requirement facilitates matching of sends to receives.  It
guarantees that message-passing code is deterministic, if processes are
single-threaded and wildcard \const{ MPI\_ANY\_SOURCE} is not used in receives.

If a process has a single thread of execution, then any two communications
executed by this process are ordered.  On the other hand, if the process is
multi-threaded, then the semantics of thread execution may not define a relative
order between two send operations executed by two
distinct threads:  the operations are logically concurrent, even though one
physically precedes the other. In such a case, the two messages sent can be
received in any order.  Similarly, if two receive operations that are logically
concurrent receive two successively sent messages, then the two messages can
match the two receives in either order.

\paragraph*{Progress}

If a pair of
matching send and receives have been initiated on two processes, then at
least one of these two operations will complete, independently of
other actions in the system:  the send operation will
complete, unless the receive is satisfied by another message, and completes; the
receive operation will complete, unless the message sent is consumed by another
matching receive that was posted at the same destination process.

\paragraph*{Fairness}

MPI makes no guarantee of {\em fairness} in the handling of
communication.  Suppose that a send was posted.  Then it is possible
that the destination repeatedly posts a receive that matches this
send, yet the message is never received, because it is each time overtaken by
another message, sent from another source.  Similarly, suppose that a
receive was posted.  Then it is possible that messages that match this
receive are repeatedly received, yet the receive is never satisfied,
because it is overtaken by other receives posted at this node (by
other executing threads).  It is the programmer's responsibility to prevent
starvation in such situations.

\discuss{

{\bf Need to be discussed by MPIF}

Alternative positions:

1. Sends are handled fairly; e.g. a posted send cannot be repeatedly
overtaken by other sends.  An example of an implementation that
violates this:  Assume that arriving messages are kept in a hash table
organized by source, and that a receive with source=dontcare is
handled by searching this table sequentially.   Then a message posted
at the end of this table may be repeatedly overtaken by messages that
get posted at the head of the table. (Easy fix: rotate starting point
of search.)

2. Receives are handled fairly; e.g. a posted receive cannot be
repeatedly overtaken by other receives.  An example of an
implementation that violates this:  assume that there are two
different posting tables for receives with
a specific source and for receives with source=dontcare;
assume further that these two
structures are searched in a fixed order when a message arrives.
Then a receive with source=dontcare may be repeatedly overtaken by
receives with source specified.  (Fix: alternate search order; but
this has performance problems, if we assume that one type of receive
is much more frequent than the other type.)

3.  Both send and receives are handled fairly.

Opinions?
}

\paragraph*{Resource limitations}

The current practice for many commercial message-passing libraries is that
(short) messages are buffered by the system, thus allowing blocking send
operations to complete ahead of the matching receives.  It is expected that many
MPI implementations will follow this practice, and provide the same level of
buffering that is available on the libraries they replace.  However, message
buffering is not a universal practice.  Even on systems where buffering
occur, the amount of buffer space
available and the way it is allocated is bound to be implementation dependent.

Therefore, message buffering is not mandated by MPI and is seen as a quality of
implementation issue.  A valid
MPI implementation of \func{ MPI\_SEND} is to block the sender until a matching
receive has been initiated.
In general, the programmer can make no assumptions
on the availability of buffer space, and how this space is allocated.  Thus,
portable (safe) MPI code should work under
the assumption that an arbitrary subset of the send operations are
going to return before a matching receive is posted, and the rest
will block until a matching receive is posted.

MPI implementation will provide information on the amount of
available buffer space and on the buffering policy via the
environmental enquiries described in Chapter~\ref{chap:inquiry}.


Examples (involving two processors with ranks 0 and 1)

The following program is safe, and should always succeed.
\begin{verbatim}

CALL MPI_COMM_RANK(comm, rank, ierr)
IF (rank.EQ.0)
  THEN
    CALL MPI_SEND(sendbuf, count, MPI_REAL, 1, tag, comm, ierr)
    CALL MPI_RECV(recvbuf, count, MPI_REAL, 1, tag, comm, ierr)
  ELSE    ! rank.EQ.1
    CALL MPI_RECV(recvbuf, count, MPI_REAL, 0, tag, comm, ierr)
    CALL MPI_SEND(sendbuf, count, MPI_REAL, 0, tag, comm, context)
END IF
\end{verbatim}



The following program is erroneous, and should always fail.
\begin{verbatim}

CALL MPI_COMM_RANK(comm, rank, ierr)
IF (rank.EQ.0)
  THEN
    CALL MPI_RECV(recvbuf, count, MPI_REAL, 1, tag, comm, ierr)
    CALL MPI_SEND(sendbuf, count, MPI_REAL, 1, tag, comm, ierr)
  ELSE    ! rank.EQ.1
    CALL MPI_RECV(recvbuf, count, MPI_REAL, 0, tag, comm, ierr)
    CALL MPI_SEND(sendbuf, count, MPI_REAL, 0, tag, comm, ierr)
END IF
\end{verbatim}

The receive operation of the first process must complete before its send, and
can complete only if the matching send
of the second processor is executed; the receive operation of the second
process must complete before its send and
can complete only if the matching send of the first process is executed.
This program will deadlock.

The following
program is unsafe, and may succeed or fail, depending on implementation.

\begin{verbatim}

CALL MPI_COMM_RANK(comm, rank, ierr)
IF (rank.EQ.0)
  THEN
    CALL MPI_SEND(sendbuf, count, MPI_REAL, 1, tag, comm, ierr)
    CALL MPI_RECV(recvbuf, count, MPI_REAL, 1, tag, comm, ierr)
  ELSE    ! rank.EQ.1
    CALL MPI_SEND(sendbuf, count, MPI_REAL, 0, tag, comm, ierr)
    CALL MPI_RECV(recvbuf, count, MPI_REAL, 0, tag, comm, ierr)
END IF

\end{verbatim}

The message sent by each process has to be copied out before the send operation
returns and the receive operation starts.  For the program to complete, it is
necessary that at least one of the two messages sent is buffered.
Thus, this program can
succeed only if the communication system has sufficient
buffer space to buffer \mpiarg{ count} words of data, and the MPI
implementation buffers messages.

\section{Data Type Matching}
\label{sec:pt2pt-typematch}

One can think of message transmission as consisting of three phases:
\begin{enumerate}
\item
Data is pulled out of the send buffer and a message is assembled
\item
A message is transferred from sender to receiver
\item
Data is pulled from the incoming message and disassembled into the receive
buffer
\end{enumerate}

Type matching has to be observed at each of these three phases:  The type
of each variable in the sender buffer has to match
the type specified for that entry by the send operation;
the type specified by the send operation has to match the type specified by
the receive operation; and
the type of each
variable in the receive buffer has to match the type specified for
that entry by the receive operation.
A program that fails to observe these
three rules is erroneous.

To define type matching more precisely, we need to deal with two issues:
matching of types of the host language with types specified in
communication operations;
and matching of types at sender and receiver.

A type specified for an entry by a send operation
matches the type specified for that entry by a receive operation
if both operations use identical names: \type{ 
MPI\_INTEGER} matches \type{ MPI\_INTEGER}, \type{ MPI\_REAL} matches \type{ 
MPI\_REAL}, and so on.

The type of a variable in a host program matches the type specified in the
communication operation
if the datatype name used by that operation corresponds
to the basic type of the host program variable:   an entry with type name \type{ 
MPI\_INTEGER} matches a Fortran variable of type \ftype{ INTEGER}, an entry with
type name  \type{ MPI\_REAL} matches a Fortran variable of type 
\ftype{ REAL}, and so
on.  There is one exception to this last rule:  An entry with type name \type{ 
MPI\_BYTE} can be used to match
any byte of storage (on a byte-addressable machine),
irrespective of the datatype of the variable that contains this byte.  The value
of the message entry will be the binary value of the corresponding byte in
memory.

We thus have two cases:
\begin{itemize}
\item
Communication of typed values (e.g., with datatype different from \type{ 
MPI\_BYTE}), where the datatypes of the corresponding entries in the sender
program, in the send call,
in the receive call and in the receiver program
should all match.
\item
Communication of untyped values (e.g., of datatype \type{ MPI\_BYTE}), where
both sender and receiver use the
datatype \type{ MPI\_BYTE}.  In this case, there are no
requirements on the types of the corresponding entries in the sender and the
receiver programs, nor is it required that they be the same.
\end{itemize}

Examples:

1st program:
\begin{verbatim}
CALL MPI_COMM_RANK(comm, rank, ierr)
IF(rank.EQ.0)
  THEN
    CALL MPI_SEND(a(1), 10, MPI_REAL, 1, tag, comm, ierr)
  ELSE
    CALL MPI_RECV(a(1), 15, MPI_REAL, 0, tag, comm, ierr)
END IF
\end{verbatim}

This code is correct if both sender and receiver programs have
allocated consecutive storage for ten real numbers,
starting from {\tt a(1)}.
2nd program:
\begin{verbatim}
CALL MPI_COMM_RANK(comm, rank, ierr)
IF(rank.EQ.0)
  THEN
    CALL MPI_SEND(a(1), 10, MPI_REAL, 1, tag, comm, ierr)
  ELSE
    CALL MPI_RECV(a(1), 40, MPI_BYTE, 0, tag, comm, ierr)
END IF
\end{verbatim}

This code is erroneous, since sender and receiver do not provide
matching datatype parameters.
3rd program:
\begin{verbatim}
CALL MPI_COMM_RANK(comm, rank, ierr)
IF(rank.EQ.0)
  THEN
    CALL MPI_SEND(a(1), 40, MPI_BYTE, 1, tag, comm, ierr)
  ELSE
    CALL MPI_RECV(a(1), 60, MPI_BYTE, 0, tag, comm, ierr)
END IF
\end{verbatim}

This code is correct, irrespective of the type of {\tt a(1)} and
following variables in store.

\discuss{

The behavior of erroneous code is implementation-dependent.  It is
desirable but not required that errors are reported at run-time.
An MPI implementation may decide to avoid type-checking for the sake
of efficiency, in which case none of the three programs will report an
error.  On a system where real numbers occupy four bytes and both
sender and receiver execute in the same environment, it is likely that
all three codes will have the same behavior.

}

\section{Data conversion}
\label{sec:pt2pt-conversion}

One of the goals of MPI is to support parallel computations across
heterogeneous environments.  Communication in a heterogeneous
environment may require data conversions.

We use the following terminology:
\begin{description}
\item[type conversion] changes the datatype of a value, e.g. by rounding a
\type{ REAL} to an \type{ INTEGER}.
\item[representation conversion] changes the binary representation of a value,
e.g. from Hex floating point to IEEE floating point.
\end{description}

Representation conversion
is needed when data is moved across different environments, e.g.
different machines that use different binary encodings
for the same datatype, or codes compiled with different compiler
options for datatype representation.
 Representation conversion preserves, to the greatest
possible extent, the {\bf value}  represented.  However, rounding errors and
overflow and underflow exceptions may occur during floating point conversions;
conversion of integers may also lead to exceptions when a value that
can represented in one system cannot be represented in the other
system.
MPI does not specify rules for representation conversion.

The type matching rules imply that MPI communication never entails type
conversion.  On the other hand, MPI requires that a representation conversion be
performed when a typed value is transferred across environments that use
different representations for the datatype of this value.  An exception
occurring during representation conversion results in a failure of the
communication; an error occurs either in the send operation, or the receive
operation, or both.

If a value sent in a message is untyped (i.e., of type \type{ MPI\_BYTE}),
then the binary representation of the byte stored at the receiver is
identical to the binary representation of the byte loaded at the sender.  This
holds true, whether sender and receiver run in the same or in distinct
environments; no representation conversion is required.

Note that no conversion ever occurs when an MPI program executes in
a homogeneous system, where all processes run in the same environment.
Also note the different behavior of \type{ MPI\_BYTE} and of \type{ MPI\_CHARACTER}.
A buffer descriptor entry with datatype of \type{ MPI\_CHARACTER} can only match
a Fortran variable of type \ftype{ CHARACTER}; and representation conversion may
occur when values of type \type{ MPI\_CHARACTER} are transferred., e.g., from an
EBCDIC encoding to an ASCII encoding.


Consider the previous three examples.

The first program is correct, assuming both sender and receiver
declared ten consecutive real variables in storage starting at {\tt
a(1)}.  If the sender and receiver execute in different environments,
then the ten real values that are fetched from the send buffer will
be converted to the representation for reals on the receiver site
before they are stored in the receive buffer.  While the
number of real elements fetched from the send buffer equal the
number of real elements stored in the receive buffer, the number of
bytes stored need not equal the number of bytes loaded: e.g. the
sender may use a four byte representation and the receiver an eight
byte representation for reals.
If the send or receive buffer do not contain ten consecutive real
variables, then the program is erroneous, and its behavior is
undefined.

The second program is erroneous, and its behavior is undefined.

The third program is correct.  The exact same
sequence of forty bytes that were loaded from the send buffer will be
stored in the receive buffer, even
if sender and receiver run in a different environment.  The message
sent has exactly the same length (in bytes) and the same binary
representation as the message received.  If the variables in the send
buffer are of different types as the variables in the receive buffer,
or they are of the same type but different data representations are used,
then the bits stored in the receive buffer may encode values that are
different from the values they encoded in the send buffer.

Data representation conversion also applies to the envelope of a message:
source, destination and tag are all integers that may need to be converted.

The current draft does not provide a mechanism for inter-language
communication: messages sent by Fortran calls should be received by
Fortran calls and messages sent by C calls should be received by C
calls (this follows from the requirements of type matching and the
fact that Fortran and C datatypes are distinct).  If inter-language
communication is needed then one needs to invoke
C communication routines from a Fortran program or Fortran
communication routines from a C program.

\implement{

The current definition does not require messages to carry data type information.
A message can be composed and sent using
only the information provided by the send call, and can be
received and stored using only the information provided by the receive
call.
If messages are sent between different machines then one can either
use a ``universal'' data encoding for messages, use knowledge of the receiver
environment in order to convert data at the sender, or use knowledge of the
sender environment in order to convert data at the receiver.   In either case
the datatype parameter in the local call
can be used to derive the types of the values
transferred.  However, additional
type information carried by messages can provide better error detection.

}

\discuss{

MPI does not handle inter-language communication because there seem to
be no agreed standards for inter-language type conversions.  The
current (non)solution means that inter-language communication relies on
the vendor specific rules for inter-language calls at each site.

It may be desirable to provide information on the correspondence between MPI C
datatypes (such as \ctype{ MPI\_FLOAT}) and MPI Fortran
datatypes (such as \ftype{ MPI\_REAL}) 
with suitable environment query functions.

}
\section{Communication Modes}
\label{sec:pt2pt-modes}

The basic send operation described in Section~\ref{sec:pt2pt-basicsend}
used the
{\bf standard} communication mode.  In such a communication mode, a send
operation can be started whether or not a matching receive was posted.
The completion of the send operation indicates that the message
and its envelope have been safely stored away and that the sender is
free to access and modify the sender buffer.  Thus, the operation is
{\bf blocking}: it does not return until the send operation has
completed {\bf locally}, on the sender side.

The completion of a send operation gives no
indication that the message was received on the receiver side.
A blocking send may be implemented so that a blocking send returns
only after a matching receive has been executed on the receiver side.
This avoids the need to buffer message data out of sender or receiver
memory.  In this case the send operation completes only after the
matching receive has started executing.
On the other hand, it is also possible  for MPI to buffer messages, so
as to allow the sender to proceed ahead of the receiver.  In such a case the
send operation may complete successfully before the message was received.
Thus, the basic send operation described in the previous section is
{\bf asynchronous}, since its return does not imply a
synchronization with the (remote) receive operation, and does not
imply {\bf global} termination of the communication.

There are two additional communication modes:

A send that uses the {\bf ready} communication mode
may be started only if a matching receive is already posted;
otherwise the operation is erroneous and its outcome is undefined.  In some
systems, this allows the removal of a hand-shake
operation that is otherwise required, and results in improved
performance.  The completion of the send operation does not depend on the
status of a matching receive, and merely indicates that the send
buffer can be reused.   A send operation that uses the ready mode has
the same semantics as a standard send operation.  It is merely that
the sender provides additional information to the communication
subsystem, when using the ready mode.

A send that uses the {\bf synchronous} mode can be started whether or
not a matching receive was posted.  However, the send will complete
successfully only if a matching receive is posted, and the
receive operation has started to receive the message sent by the
synchronous send.
(I.e., the receive has been posted, and the incoming message has been matched to
this posted receive.)
Thus, the completion of a synchronous send not only indicates that the
send buffer can be reused, but also indicates that the receiver has
reached a certain point in its execution, namely that it has started
executing the matching receive.  If both sends and receives are
blocking operations then the use of the synchronous mode provides
synchronous communication semantics: a communication does not complete
at either ends before both processes ``attend'' to the
communication; the completion of a synchronous send is a {\bf global} event.

Two additional send functions are provided for the two additional communication
modes.  The communication mode is indicated by a one letter prefix:
{\tt R} for ready and {\tt S} for synchronous.



Send in ready mode

\begin{funcdef}{MPI\_RSEND (start, count, datatype, dest, tag, comm)}
\funcarg{\IN}{start}{initial address of send buffer (choice)}
\funcarg{\IN}{count}{number of elements in send buffer (integer)}
\funcarg{\IN}{datatype}{datatype of each send buffer element }
\funcarg{\IN}{dest}{rank of destination (integer)}
\funcarg{\IN}{tag}{message tag (integer)}
\funcarg{\IN}{comm}{communicator (handle)}
\end{funcdef}


Send in synchronous mode

\begin{funcdef}{MPI\_SSEND (start, count, datatype, dest, tag, comm)}
\funcarg{\IN}{start}{initial address of send buffer (choice)}
\funcarg{\IN}{count}{number of elements in send buffer (integer)}
\funcarg{\IN}{datatype}{datatype of each send buffer element }
\funcarg{\IN}{dest}{rank of destination (integer)}
\funcarg{\IN}{tag}{message tag (integer)}
\funcarg{\IN}{comm}{communicator (handle)}
\end{funcdef}

There is only one receive mode, which can match any of the send modes.
The receive operation described in the last section is {\bf blocking}:
it returns only after the receive buffer contains the newly received
message.   It is {\bf asynchronous}:  the completion of a receive is a
local operation, and a receive can complete before the matching send
has completed (of course, it can complete only after the matching send
has started).

Communication imposes an order on the events occurring at the communicating
nodes.  It is always the case that the completion of a receive occurs
after the start of the matching send.  If the synchronous send mode is used,
then
it is also the case that the completion of a send occurs after the
start of the matching receive.  Of course, on each process, a communication
completes after it is started.  No other order is imposed by MPI.  E.g., if
the standard send mode is used, the send operation may complete before the
matching receive was started.

In a multi-threaded implementation of MPI, a blocking send or receive
operation may only block the executing thread, not the entire process.
Thus, while a thread is blocked waiting for a send or receive to
complete, other threads may execute within the same address space.  In
such a case it the user responsibility not to modify the send buffer
until the send completes and not to access or modify the receive
buffer until the receive completes; otherwise the outcome of the
computation is undefined.

\discuss{

Do we want to say that a blocking operation MAY or MUST only block the
executing thread?

}

\implement{
A ready send can be implemented as a standard send; in such a case
there will be no performance advantage (or disadvantage) for the use of
ready send.

A standard send can be implemented as a synchronous send.
In such a case, no data buffering is needed.

A possible communication protocol for the various communication modes
is outlined below:

{\tt ready send}:   The message is sent as soon as possible.

{\tt synchronous send:}
The sender sends a request-to-send message.
The receiver stores this request.
When a matching receive is posted, the receiver sends back a permission-to-send
message, and the sender now sends the message.

{\tt standard send:}
First protocol may be used for short messages, and second protocol for
long messages.

Additional control messages might be needed for flow control and error
recovery.  Of course, there are many other possible choices.

}


\section{Nonblocking communication}
\label{sec:pt2pt-nonblock}

One can improve performance on many systems by overlapping
communication and computation.  This is especially true on systems
where communication can be executed autonomously by an intelligent
communication controller.  Light-weight threads are one mechanism for
achieving such overlap.  An alternative mechanism that often leads to
better performance is to use {\bf nonblocking communication}.  A
nonblocking send call initiates the send operation, but does not
complete it.  The send call will return
before the message was copied out of the send buffer.
A separate
call is needed to complete the communication, i.e. to verify that the
data has been copied out of the send buffer.  With suitable hardware,
the transfer of data out of the sender memory
may proceed concurrently with computations done at the sender after
the send was initiated and before it completed.
Similarly, a nonblocking receive initiates the receive operation, but
does not complete it.  The call will return before
a message is stored into the receive buffer.  A separate call is
needed to complete the receive operation and verify that the data has
been received into the receive buffer.
With suitable hardware, the transfer of data into the receiver memory
may proceed concurrently with computations done after the receive was
initiated and before it completed.

Nonblocking sends can use the same three modes as blocking sends: {\tt
standard}, {\tt ready} and {\tt synchronous}.  These carry the same
meaning.  The initiation and completion of a standard send do not
depend on the status of a matching receive.  A ready send can be
initiated only if a matching receive has already been initiated,
otherwise the call is erroneous; its completion does not depend on the
status of a matching receive.   A synchronous send can be
initiated before a matching receive has been initiated, but will
complete only after a matching receive has been initiated, and has
started receiving the message generated by the send operation.

In all cases the operation that initiates the communication (send or
receive) is a local, nonblocking call that returns as soon as
possible.
A nonblocking communication call may fail because
because the system has exhausted available resources (e.g., exceeded
the limit on number of pending communications per node).  Good quality
implementations of MPI will set these limits high enough so that a
nonblocking communication call will fail only in ``pathological'' cases.

Nonblocking sends can be matched with blocking receives, and
vice-versa.

\subsection{Communication Objects}
\label{subsec:pt2pt-commobject}

Nonblocking communications use opaque communication objects to
identify communication operations and match the operation that
initiates the communication with the operation that terminates it.
These are system objects that are accessed via a handle.
An opaque communication object identifies various properties of a communication
operation, such as the (send or receive) buffer  that is associated with it, its
context, the tag and destination parameters to be used for a send, or the tag
and source parameters to be used for a receive.  In addition, this object stores
information about the status of the pending communication operation that is
performed with this object.

\subsection{Communication initiation}
\label{subsec:pt2tp-commstart}

We use the same naming conventions as for blocking communication: a
prefix of {\tt R} ({\tt S}) is used for {\tt READY} ({\tt
SYNCHRONOUS}) mode.  In addition a prefix of {\tt I} (for {\tt
IMMEDIATE}) indicates that the call is nonblocking. 


Initiate a standard mode nonblocking communication.

\begin{funcdef}{MPI\_ISEND(handle, start, count, datatype, dest, tag, comm)}
\funcarg{\OUT}{handle}{handle to communication object}
\funcarg{\IN}{start}{initial address of send buffer (choice)}
\funcarg{\IN}{count}{number of elements in send buffer (integer)}
\funcarg{\IN}{datatype}{datatype of each send buffer element }
\funcarg{\IN}{dest}{rank of destination (integer)}
\funcarg{\IN}{tag}{message tag (integer)}
\funcarg{\IN}{comm}{communicator (handle)}
\end{funcdef}

Initiate a ready mode nonblocking communication.

\begin{funcdef}{MPI\_IRSEND(handle, start, count, datatype, dest, tag, comm)}
\funcarg{\OUT}{handle}{handle to communication object}
\funcarg{\IN}{start}{initial address of send buffer (choice)}
\funcarg{\IN}{count}{number of elements in send buffer (integer)}
\funcarg{\IN}{datatype}{datatype of each send buffer element }
\funcarg{\IN}{dest}{rank of destination (integer)}
\funcarg{\IN}{tag}{message tag (integer)}
\funcarg{\IN}{comm}{communicator (handle)}
\end{funcdef}


Initiate a synchronous mode nonblocking communication.

\begin{funcdef}{MPI\_ISSEND(handle, start, count, datatype, dest, tag, comm)}
\funcarg{\OUT}{handle}{handle to communication object}
\funcarg{\IN}{start}{initial address of send buffer (choice)}
\funcarg{\IN}{count}{number of elements in send buffer (integer)}
\funcarg{\IN}{datatype}{datatype of each send buffer element }
\funcarg{\IN}{dest}{rank of destination (integer)}
\funcarg{\IN}{tag}{message tag (integer)}
\funcarg{\IN}{comm}{communicator (handle)}
\end{funcdef}

Initiate a nonblocking receive.

\begin{funcdef}{MPI\_IRECV (handle, start, count, datatype, source, tag, comm)}
\funcarg{\OUT}{handle}{handle to communication object}
\funcarg{\OUT}{start}{initial address of receive buffer (choice)}
\funcarg{\IN}{count}{number of elements in receive buffer (integer)}
\funcarg{\IN}{datatype}{datatype of each receive buffer element }
\funcarg{\IN}{source}{rank of source (integer)}
\funcarg{\IN}{tag}{message tag (integer)}
\funcarg{\IN}{comm}{communicator (handle)}
\end{funcdef}

The call allocates a communication object
and associates it with the handle.   The handle can be used later to
query about the status of the communication or wait for its completion.

A sender should not update any part of a send buffer after a
nonblocking send operation returns, until the send completes.  A
receiver should not access or update any part of a receive buffer
after a nonblocking receive operation returns, until the receive
completes.

\subsection{Communication Completion}
\label{subsec:pt2pt-commend}

The completion of a send operation indicates that the sender is now free
to update the locations in the send buffer, or any other location that can be
referenced by the send operation (the send operation itself leaves the content
of the send buffer unchanged). It does not indicate that the
message has been received; rather, it may have been buffered by the communication
subsystem.  However, if a {\tt synchronous}
mode send was used, the completion of the
send operation indicates that a matching receive was initiated, and that the
message will eventually be received by this matching receive.

The completion of a receive operation indicates that the receive buffer
contains the received message, and that the status object is set; the receiver
is now free to access the receive buffer,
or any other location that can be referenced by the receive operation.  It does
not indicate that the matching send operation has completed (but indicates, of
course, that the send was initiated). 

\begin{funcdef}{MPI\_WAIT (handle, status)}
\funcarg{\IN}{handle}{handle to communication object}
\funcarg{\OUT}{status}{status object}
\end{funcdef}

A call to \func{ MPI\_WAIT} returns when the send operation
identified by \mpiarg{ handle} is complete.  If the communication object
associated with this handle was created by a nonblocking send or
receive call, then the object is deallocated by the call to \func{ 
MPI\_WAIT} and the handle becomes null.

The call returns in \mpiarg{ status} information on
the completed operation.  The status object for a receive operation
can be queried using the
functions described in Section~\ref{subsec:pt2pt-status}.
The status parameter is not used or updated for a send operation, and
an arbitrary value may be passed for this parameter in the call.


\discuss {

Do we agree on the last sentence?  (this implies some restriction on the
implementation)

}


\begin{funcdef}{MPI\_TEST (handle, flag, status)}
\funcarg{\IN}{handle}{handle to communication object}
\funcarg{\OUT}{ flag}{logical}
\funcarg{\OUT}{status}{status object}
\end{funcdef}

A call to \func{ MPI\_TEST} returns \mpiarg{ flag=true} if the
operation identified by \mpiarg{ handle} is complete.  In such a case, the
status object is set to contain information on the completed
operation;  if the communication object was created by a nonblocking
send or receive, then it is deallocated and the handle becomes null.
The call returns
\mpiarg{ flag=false}, otherwise. In such a case, the value of the status object is
undefined.

\discuss{

We could dispose of the flag parameter and add an additional query
function for status objects.
}

\func{ MPI\_TEST} is a local, nonblocking operation.
If repeatedly called for an operation
that is enabled, it must eventually succeed.


The
return status object for a receive operation carries information that
can be accessed using the functions described in
Section~\ref{subsec:pt2pt-status}.
The status argument is not used or updated for a send operation and an
arbitrary value can be passed for this parameter in the call.

In multi-threaded environment, the use of a blocking receive operation (\func{
MPI\_WAIT}) may allow the operating system to de-schedule the blocked thread
and schedule another thread for execution, if such is available.  The use of
a nonblocking receive operation (\func{ MPI\_TEST}) allows the user to
schedule alternative activities within a single thread of execution.

Example:
\begin{verbatim}
CALL MPI_COMM_RANK(comm, rank, ierr)
IF(rank.EQ.0)
  THEN
    CALL MPI_ISEND(handle, a(1), 10, MPI_REAL, 1, tag, comm, ierr)
    **** do some computation to mask latency ****
    CALL MPI_WAIT(handle, status, ierr)
  ELSE
    CALL MPI_IRECV(handle, a(1), 15, MPI_REAL, 0, tag, comm, ierr)
    **** do some computation to mask latency ****
    CALL MPI_WAIT(handle, status, ierr)
END IF
\end{verbatim}

The functions \func{ MPI\_WAIT} and \func{ MPI\_TEST} can be used to
complete both sends and receives; they will be also used to complete
any other nonblocking communication call provided by MPI.

Nonblocking communications cannot be replaced by blocking
communications, even in the {\tt synchronous} communication mode.
Consider the following example:
\begin{verbatim}
CALL MPI_COMM_RANK(comm, rank, ierr)
IF(rank.EQ.0)
  THEN
    CALL MPI_ISEND(handle1, a(1), count, MPI_REAL, 1, tag1, comm, ierr)
    CALL MPI_ISEND(handle2, a(2), count, MPI_REAL, 1, tag2, comm, ierr)
    CALL MPI_WAIT(handle1, status, ierr)
    CALL MPI_WAIT(handle2, status, ierr)
  ELSE
    CALL MPI_IRECV(handle2, a(2), count, MPI_REAL, 0, tag2, comm, ierr)
    CALL MPI_IRECV(handle1, a(1), count, MPI_REAL, 0, tag1, comm, ierr)
    CALL MPI_WAIT(handle2, status, ierr)
    CALL MPI_WAIT(handle1, status, ierr)
END IF
\end{verbatim}

The code is guaranteed to execute correctly, even in an implementation
that does not buffer messages (the only resource requirement is the
ability to have two pending communications).  On the other hand, the
code may deadlock if the nonblocking operations are replaced by blocking
ones.

\subsection{Multiple Completions}
\label{subsec:pt2pt-multiple}

It is convenient to be able to wait for the completion of any or all the
operations in a set, rather than having to wait for a specific message.
A call to \func{ MPI\_WAITANY} or \func{ MPI\_TESTANY} can be used to wait for the
completion of one out of several operations; a call to \func{ 
MPI\_WAITALL} or \func{ MPI\_TESTALL} can be
used to wait for all pending operations in a list.

\discuss{

\bf \func{ MPI\_TESTALL} is new.
}



\begin{funcdef}{MPI\_WAITANY (count, array\_of\_handles, index, status)}
\funcarg{\IN}{count}{list length (integer)}
\funcarg{\IN}{array\_of\_handles}{array of handles to communication objects}
\funcarg{\OUT}{index}{index of handle for operation that completed (integer)}
\funcarg{\OUT}{status}{return status object}
\end{funcdef}


Blocks until one of the operations associated with the communication
handles in the array has completed.
Returns the index of that handle in the
array, and returns the status of that operation in the object associated with
the status.

The successful execution of \func{ MPI\_WAITANY(count, array\_of\_handles, index,
status)} has the same effect as the successful execution of \func{ 
MPI\_WAIT(array\_of\_handles[i], status)}, where \mpiarg{ i} is the value
returned by \mpiarg{ index}.  In particular, the associated communication object is
deallocated, and the handle to it in \mpiarg{ array\_of\_handles} is set to null.
If more then one operation is
enabled and can terminate, one is arbitrarily
chosen.  There is no requirement that the choice satisfy any fairness criterion.

\discuss{

If statement of fairness is changed in Section~\ref{sec:pt2pt-semantics}
then will need to change here, too.

}

\begin{funcdef}{MPI\_TESTANY( count, array\_of\_handles, index, status)}
\funcarg{\IN}{count}{list length (integer)}
\funcarg{\IN}{array\_of\_handles}{array of handles to communication objects}
\funcarg{\OUT}{index}{index of handle for operation that completed,
or -1 if none completed (integer)}
\funcarg{\OUT}{status}{return status object}
\end{funcdef}

Causes either one or none of the operations associated with the communication
handles to return.   In the former case, it
returns the index of that handle in the
array, and returns the status of that operation in the object associated with
the status.  The communication object is deallocated and the handle to it in
\mpiarg{ array\_of\_handles} is set to null. In the latter case, it returns a value
of -1 in \mpiarg{ index} and \mpiarg{ status} is undefined.
Like \mpifunc{MPI\_TEST}, this is a nonblocking operation, that returns
immediately; and if one busy-waits with \mpifunc{MPI\_TESTANY}, and
some operation on the list is enabled,
then \mpifunc{MPI\_TESTANY} will eventually return \mpiarg{ flag = true}.

\discuss{

Could have a \mpiarg{ flag} parameter, rather than using index = -1 for unsuccessful
return.  This is more consistent and more elegant, but requires an additional
parameter.

If we stick with current version, may replace -1 by an MPI constant.
}

\begin{funcdef}{MPI\_WAITALL( count, array\_of\_handles, array\_of\_status)}
\funcarg{\IN}{count}{lists length (integer)}
\funcarg{\IN}{array\_of\_handles}{array of handles to communication objects}
\funcarg{\OUT}{array\_of\_status}{array of status objects}
\end{funcdef}

Blocks until all communication operations associated with handles in the list
complete, and return the status of all these operations.
Both arrays have the same number of valid entries.  The \mpiarg{ i}-th entry in
\mpiarg{ array\_of\_status} is set to the return status of the \mpiarg{ i}-th operation.
All communication objects are deallocated and all handles are set to null.


\begin{funcdef}{MPI\_TESTALL(count, array\_of\_handles, flag, array\_of\_status)}
\funcarg{\IN}{count}{lists length (integer)}
\funcarg{\IN}{array\_of\_handles}{array of handles to communication objects}
\funcarg{\OUT}{flag}{(logical)}
\funcarg{\OUT}{array\_of\_status}{array of status objects}
\end{funcdef}

Causes either all or none of the operations associated with the communication
handles to complete.  It returns \mpiarg{ flag = true}
if all communications associated
with handles in the array have completed.  In this case, each status entry
is set to the status of the corresponding communication.  All communication
objects are deallocated, and all handles are set to null.  Otherwise,
\mpiarg{ flag = false} is returned and the values of the status entries
are undefined.  This is a nonblocking operation that returns
immediately; if one busy waits with \mpifunc{MPI\_TESTALL}, and all
operations in the list become enabled, then \mpifunc{MPI\_TESTALL}
will eventually return \mpiarg{ flag = true}.

Example:
\begin{verbatim}

CALL MPI_COMM_RANK(comm, rank, ierr)
IF(rank < 2)
  THEN
    CALL MPI_ISEND(handle, a, n, MPI_REAL, 2, tag, comm, ierr)
    **** do some computation to mask latency ****
    CALL MPI_WAIT(handle, status, ierr)
  ELSE    ! rank.EQ.2
    CALL MPI_IRECV(handle_list(0), a, n, MPI_REAL, 0, tag, comm, ierr)
    CALL MPI_IRECV(handle_list(1), b, n, MPI_REAL, 1, tag, comm, ierr)
    **** do some computation to mask latency ****
    CALL MPI_WAITANY(2, handle_list, index, status, ierr)
    IF(index.EQ.0)
      THEN
        **** handle message from process 0 ****
        CALL MPI_WAIT(handle_list(1), status, ierr)
        **** handle message from process 1 ****
      ELSE
        **** handle message from process 1 ****
        CALL MPI_WAIT(handle_list(0), status, ierr)
        **** handle message from process 0 ****
    END IF
END IF
\end{verbatim}

The calls introduced in this subsection can be used to wait or test
for the completion of an arbitrary mix of nonblocking communication
calls, including a mix sends and receives and of any additional
nonblocking communication calls provided by MPI.

The \mpiarg{ status} argument in \mpifunc{MPI\_WAITANY} and
\mpifunc{MPI\_WAITALL} and the \mpiarg{ array\_of\_status} argument in
\mpifunc{MPI\_WAITALL} and \mpifunc{MPI\_TESTALL} are
not used or updated if the completing
communication is a send.  If all operations in the list provided by
the argument \mpiarg{ array\_of\_handles} are send operations, then an
arbitrary value can be passed to the \mpiarg{ status} (resp.\mpiarg{ 
array\_of\_status})  argument.

\discuss{

Is the last paragraph agreed?

We may want to allow null pointers in an array\_of\_handles
This will allow to loop with the same
array\_of\_handles, until all communications have completed.  On the other hand,
this will encourage potentially inefficient code.

I added the requirement that the handed to a communication that completes is set
to null.  This prevents the occurrence of dangling references, but add some
overhead.
}

\section{Probe and Cancel}
\label{sec:pt2pt-probe}


The \func{ MPI\_PROBE} and \func{ MPI\_IPROBE}
operations allow incoming messages to be checked for,
without
actually receiving them.  The user can then decide how to receive them, based
on the information returned by the probe (basically, the information
returned by \mpiarg{ status}).  In particular, the user may allocate memory for
the receive buffer, according to the length of the probed message.

The \func{ MPI\_CANCEL} operation allows pending communications to be cancelled.
This
is required for cleanup.   Posting a send or a receive ties up user resources
(send or receive buffers), and a cancel may be needed to free these resources
gracefully.

\discuss{

Made one change from approved draft on probe: removed the datatype argument
from probe.  If generalized datatype are adopted, then it seems
more natural to provide the argument in the status decode function.
}


\begin{funcdef}{MPI\_IPROBE(source, tag, comm, flag, status)}
\funcarg{\IN}{source}{source rank, or \const{ MPI\_ANY\_SOURCE} (integer)}
\funcarg{\IN}{tag}{tag value or \const{ MPI\_ANY\_TAG} (integer)}
\funcarg{\IN}{comm}{handle to communicator}
\funcarg{\OUT}{flag}{ (logical)}
\funcarg{\OUT}{status}{status object}
\end{funcdef}

\func{ MPI\_IPROBE} returns \mpiarg{ flag = true} if there is a message
that can be received and that matches the pattern specified by the
parameters \mpiarg{ source}, \mpiarg{ tag}, and \mpiarg{ comm}.
It returns \mpiarg{ flag = false}, otherwise.  If \func{ MPI\_IPROBE} returns
\mpiarg{ flag = true}, then the source and tag
of the message matched are returned in the status object.  These are the
same values that would have been returned by a call to \func{ MPI\_RECV}
executed at the same point in the program.

These values can be decoded from the return status object using the
status query functions described in Section~\ref{subsec:pt2pt-status}.
The return status object is
undefined if \mpiarg{ flag=false}.

The call to \func{ MPI\_PROBE} does not carry a datatype argument:  the
message type may not be known when the probe is executed.
Accordingly, the probe call does not return in the status object the
length of the probed message, and \mpifunc{MPI\_GET\_COUNT} cannot be
used after a call to \mpifunc{MPI\_PROBE} to query on the length of
the probed message.  (Remember that the length of a message, in
elements, depends on the types of these elements; some implementations may not
carry in the transmitted message type information.)
Instead, a different status decoding function that provides the
datatype as argument can be used in this case.

\begin{funcdef}{MPI\_PROBE\_COUNT(status, datatype, count)}
\funcarg{\IN}{status}{status object}
\funcarg{\IN}{datatype}{message datatype}
\funcarg{\OUT}{count}{number of entries in message (integer)}
\end{funcdef}

Returns the length of the message that was last probed with\mpiarg{ 
status} as an argument.  The call is erroneous if \mpiarg{ datatype} does
not match the type of the message.

\discuss{

Message passing in MPI can be implemented without appending
type information to messages.  A message is merely a string of bytes and the
interpretation of these bytes
into a sequence of typed elements is done using the datatype
information provided by the local communication call.
The ability to use such an implementation strategy is deemed to be an
important goal. In such an implementation, when a message arrives,
it is not known how many elements it contains, or even how much storage is
needed to receive that message (because of possible representation
conversion in a heterogeneous environment).
The length of the incoming message can be computed by \func{ 
MPI\_PROBE\_LEN} from the byte length of the message
and from the datatype parameter.

The current solution saves us the need for one additional word per message that
would otherwise be needed to transfer the message length (in elements) with the
message.

{\bf Need be discussed by MPIF}

We may want a call to \func{ MPI\_PROBE\_COUNT} to be significant not just
after probe, but also after receive, for consistency reasons.

It's tempting to merge \func{ MPI\_GET\_COUNT} and \func{ MPI\_PROBE\_COUNT} into
one call.  E.g., always provide a datatype argument, even though it is
redundant after a receive.  But then we impose an additional burden on
regular receives.

We may want to relax type restrictions so that \mpiarg{ datatype} = 
\type{ MPI\_BYTE} could
be used as an argument to \func{ MPI\_PROBE\_COUNT}, event if the message was
sent with another datatype; this will allow to count the number of bytes
received, and use this count in a malloc.

}

A subsequent
receive executed with the same context, and the source and tag
returned by the call to \func{ MPI\_IPROBE} will receive the message
that was matched by the probe, if no other intervening receive occurred after
the probe.  If the receiving process is multi-threaded, it is the user
responsibility to ensure that the last condition holds.


\discuss{

MPI guarantees that successive messages sent from a source to a destination
within the same context are received in the order they are sent.  Thus,  MPI
must support, either explicitly or implicitly, a FIFO structure to
manage messages between each pair of processes, for each context. \func{ 
MPI\_PROBE} returns information on the first matching message in this FIFO; this
will also be the message received by the first subsequent receive with the same
source, tag and context as the message matched by \func{ MPI\_PROBE}.

}



\begin{funcdef}{MPI\_PROBE( source, tag, comm, datatype, status)}
\funcarg{\IN}{source}{source rank, or \const{ MPI\_ANY\_SOURCE} (integer)}
\funcarg{\IN}{tag}{tag value, or \const{ MPI\_ANY\_TAG} (integer)}
\funcarg{\IN}{comm}{handle to communicator}
\funcarg{\IN}{datatype}{assumed type of data in message (status)}
\funcarg{\OUT}{status}{status object}
\end{funcdef}

\func{ MPI\_PROBE} behaves like \func{ MPI\_IPROBE} except that it is a blocking
call which returns only after a matching message has been found.

\discuss{

\bf CANCEL has been changed since last draft

}

\begin{funcdef}{MPI\_CANCEL(handle)}
\funcarg{\IN}{handle}{handle to communication object}
\end{funcdef}

A call to \func{ MPI\_CANCEL} marks for cancellation a pending
nonblocking communication operation (send or receive).  The cancel call is
non-blocking, and local.
It returns immediately, possibly before the
communication is actually cancelled.

It is still necessary to complete a communication that has been marked
for cancellation, using a call to \func{ MPI\_WAIT} or \func{ MPI\_TEST}
(or any of the derived operations).  If the operation has been
cancelled, then information about to that effect will be returned in the
status argument of the operation that completes the communication.
If a communication is marked for cancellation, then a \func{ MPI\_WAIT}
call for that communication is guaranteed to return, irrespective of
the status of other processes; similarly if \func{ MPI\_TEST} is
repeatedly called in a busy wait loop for a cancelled communication,
then \mpifunc{MPI\_TEST} will eventually be
successful.

Either the cancellation succeeds, or the communication succeeds, but
not both.
If a send is marked for cancellation, then it must be the case that
either the send completes normally, in which case the
message sent was received at the destination process, or that the send is
successfully
cancelled, in which case no part of the message was received at the
destination.  Then, any matching receive has to be satisfied by another send.
If a receive is marked for cancellation, then it must be the case that
either the receive completes normally, or that the receive is
successfully cancelled, in which case no part of the receive buffer
is altered.  Then, any matching send has to be satisfied by another receive.


\begin{funcdef}{MPI\_TEST\_CANCELLED(status, flag)}
\funcarg{\IN}{status}{return status object}
\funcarg{\IN}{flag}{(logical)}
\end{funcdef}

Returns \mpiarg{ flag = true} if the communication associated with the
return status object was cancelled successfully.  In such a case, all
other fields of \mpiarg{ status} (such as \mpiarg{ length} or \mpiarg{ tag}) are
undefined.  Returns \mpiarg{ flag = false}, otherwise.  If a receive
operation might be cancelled then one should call \func{ 
MPI\_TEST\_CANCELLED} first, to check whether the operation was
cancelled, before checking on the other values of the return status.


\section{Persistent communication objects}

\discuss {

\bf there are some changes here not yet approved by MPIF
}

Often a communication with the same parameter list is repeatedly
executed within the inner loop of a parallel computation.  In such a
situation, it may be possible to optimize the communication by
binding the list of communication parameters to the communication object
once and, then, repeatedly using
the communication handle to initiate and complete messages.  The
communication handle thus created can be thought of as a communication
port or a ``half-channel'' .
It does not provide the full functionality of a conventional channel,
since there is no binding of the send port to the receive port: this
construct allows reduction of the overhead for communication
between processor and communication controller, but not the overhead for
communication between one communication controller and another.

A communication object is created using one of the four following
calls.  These calls involve no communication.

The function \mpifunc{MPI\_CREATE\_SEND} creates a communication object
for a standard mode send operation, and binds to it all the
parameters of a send operation.

\begin{funcdef}{MPI\_CREATE\_SEND(handle, start, count, datatype, dest, tag,
comm)}
\funcarg{\OUT}{handle}{handle to communication object}
\funcarg{\IN}{start}{initial address of send buffer (choice)}
\funcarg{\IN}{count}{number of elements sent (integer)}
\funcarg{\IN}{datatype}{type of each element}
\funcarg{\IN}{dest}{rank of destination (integer)}
\funcarg{\IN}{tag}{message tag (integer)}
\funcarg{\IN}{comm}{handle to communicator object}
\end{funcdef}

The call allocates a new communication object and associates the communication
handle with it.

\discuss{

Two changes:

1. \mpiarg{ buffer\_handle} is replaced with \mpiarg{ start, count, datatype}, to be
in sync with the new datatype proposal.

2. Took out \mpiarg{ persistence}.  Does not seem very useful here.
Instead, I
introduce in the last section, a ``universal'' communication function,
which would not usually used directly by programmers, but can be used
to define the other communication functions.

}


The function \mpifunc{MPI\_CREATE\_RSEND} creates a communication object
for a ready mode send operation.

\begin{funcdef}{MPI\_CREATE\_RSEND(handle, start, count, datatype, dest, tag,
comm)}
\funcarg{\OUT}{handle}{handle to communication object}
\funcarg{\IN}{start}{initial address of send buffer}
\funcarg{\IN}{count}{number of elements sent (integer)}
\funcarg{\IN}{datatype}{type of each element}
\funcarg{\IN}{dest}{rank of destination (integer)}
\funcarg{\IN}{tag}{message tag (integer)}
\funcarg{\IN}{comm}{handle to communicator object}
\end{funcdef}

The function \mpifunc{MPI\_CREATE\_SSEND} creates a communication object
for a synchronous mode send operation.

\begin{funcdef}{MPI\_CREATE\_SSEND(handle, start, count, datatype, dest, tag,
comm)}
\funcarg{\OUT}{handle}{handle to communication object}
\funcarg{\IN}{start}{initial address of send buffer}
\funcarg{\IN}{count}{number of elements sent (integer)}
\funcarg{\IN}{datatype}{type of each element}
\funcarg{\IN}{dest}{rank of destination (integer)}
\funcarg{\IN}{tag}{message tag (integer)}
\funcarg{\IN}{comm}{handle to communicator object}
\end{funcdef}


The function \mpifunc{MPI\_CREATE\_RECV} creates a communication object
for a receive operation.

\begin{funcdef}{MPI\_CREATE\_RECV(handle, start, count, datatype, source, tag,
comm)}
\funcarg{\OUT}{handle}{handle to communication object}
\funcarg{\OUT}{start}{initial address of receive buffer}
\funcarg{\IN}{count}{number of elements received (integer)}
\funcarg{\IN}{datatype}{type of each element}
\funcarg{\IN}{dest}{rank of source or MPI\_ANY\_SOURCE (integer)}
\funcarg{\IN}{tag}{message tag or MPI\_ANY\_TAG (integer)}
\funcarg{\IN}{comm}{handle to communicator object}
\end{funcdef}

A communication (send or receive) that uses a predefined handle is
initiated by the function \mpifunc{MPI\_START}.

\begin{funcdef}{MPI\_START(handle)}
\funcarg{\INOUT}{handle}{handle to communication object}
\end{funcdef}

The communication object associated with \mpiarg{ handle} should be an
object that was created by one of the previous four functions, so that
all the communication parameters are already defined.  A send can be
started provided that the previous send using the same object has
completed, or as soon as the object is created, if it has not
yet been used in a communication.  In addition, if the communication
mode is {\tt ready} then a matching receive should be posted.  The
send buffer should not be updated after the send is started, until the
operation completes.

A receive can be started provided that the preceding receive using the
same object has completed, or as soon as the object is created, if
it has not yet been used in a communication.  The receive buffer
should not be accessed after the receive is started, until the
operation completes.

The call is nonblocking, with similar semantics as the nonblocking
communication operations described in
Section~\ref{sec:pt2pt-nonblock}.

A communication started with a call to \mpifunc{MPI\_START}  is
completed by a call to \mpifunc{MPI\_WAIT}, \mpifunc{MPI\_TEST}, or
one of the derived functions described in
Section~\ref{subsec:pt2pt-multiple}.  These communication completion
functions do not deallocate the communication object, and this can be
reused anew by a \mpifunc{MPI\_START} call.  The object needs to be
explicitly deallocated by a call to the function
\mpifunc{MPI\_COMM\_FREE},
below.  We thus have two types of
communication objects:  {\bf persistent} objects, which are allocated
by a call to \mpifunc{MPI\_CREATE\_xxx}, and are explicitly deallocated
by \mpifunc{MPI\_COMM\_FREE}, and {\bf ephemeral} objects that persist
for one communication only: they are
created by a nonblocking communication initiation function, and are
freed by the communication completion call.


\begin{funcdef}{MPI\_COMM\_FREE(handle)}
\funcarg{\INOUT}{handle}{handle to communication object}
\end{funcdef}

Marks the communication object for deallocation.  The object will be
deallocated when there are no pending communications involving this
object, at which point the handle becomes null.

The call is nonblocking, and the object may not be deallocated when
the call returns.  It is permissible to call
\mpifunc{MPI\_COMM\_FREE(handle)} after a communication that uses \mpiarg{ 
handle} has been initiated, but before it has completed.  The object
will be deallocated after the communication completes.  It is
erroneous to try to reuse the handle for a subsequent communication.


A correct invocation of the functions described in this section will
occur in a sequence of the form
\[ \bf
Create \ (Start \ Complete)^* \ Free   \ ,
\]
where $*$ indicates zero or more repetitions.
If the same communication object is used in several concurrent
threads, it is the user responsibility to coordinate calls so that the
correct sequence is obeyed.

A send operation initiated with \func{ MPI\_START} can be matched with
any receive operation and, likewise, a receive operation initiated
with \func{ MPI\_START} can receive messages generated by any send
operation.

\section{Send-receive}
\label{sec:pt2pt-sendrecv}

\discuss{

\bf This section has not yet been approved by MPIF.
}

The {\bf send-receive} operations combines in one call the sending of a
message to one destination and the receiving of another message, from
another destination, possibly the same.  The {\bf exchange}
operations are the same as send-receive, except that the send buffer
and the receive buffer are identical.  A send-receive operation is
very useful for executing a shift operation across a chain of
processes.  If blocking sends and receives are used for such shift,
then one needs to order correctly the sends and receives (e.g. even
sends, next receives, odd receives first, next sends) so as to prevent
cyclic dependencies that lead to deadlock.  When a send-receive or
exchange operation is used, the communication subsystem takes care of
these issues.  Also, a send-receive operation is useful for implementing
remote procedure calls.

A message sent by a
send-receive or exchange operation can be received by a regular receive
operation, and vice versa.

\begin{funcdef}{MPI\_SENDRECV(send\_start, send\_count, send\_type, dest,
recv\_start, recv\_count, recv\_type, source, tag, comm, status)}
\funcarg{\IN}{send\_start}{initial address of send buffer (choice)}
\funcarg{\IN}{send\_count}{number of elements in send buffer (integer)}
\funcarg{\IN}{send\_type}{type of elements in send buffer}
\funcarg{\IN}{dest}{rank of destination (integer)}
\funcarg{\OUT}{recv\_start}{initial address of receive buffer (choice)}
\funcarg{\IN}{recv\_count}{number of elements in receive buffer (integer)}
\funcarg{\IN}{recv\_type}{ type of elements in receive buffer}
\funcarg{\IN}{source}{rank of source (integer)}
\funcarg{\IN}{tag}{message tag (integer)}
\funcarg{\IN}{comm}{handle to communicator}
\funcarg{\OUT}{status}{status object}
\end{funcdef}

Execute a blocking send and receive operation.  Both send and receive
use the same tag value and the same communicator.  However, the send
buffer and receive buffer are disjoint, and may have different length
and different datatypes.

\discuss{

For a shift it's more natural to have same type and same length both
for send and receive; for a remote procedure call, the additional
freedom makes sense.

}


\begin{funcdef}{MPI\_EXCHANGE(start, count, datatype, dest,
source, tag, comm, status)}
\funcarg{\IN}{start}{initial address of send and receive buffer (choice)}
\funcarg{\IN}{count}{number of elements in send and receive buffer (integer)}
\funcarg{\IN}{datatype}{type of elements in send and receive buffer}
\funcarg{\IN}{dest}{rank of destination (integer)}
\funcarg{\IN}{source}{rank of source (integer)}
\funcarg{\IN}{tag}{message tag (integer)}
\funcarg{\IN}{comm}{handle to communicator}
\funcarg{\OUT}{status}{status object}
\end{funcdef}

Execute a blocking send and receive; the same buffer is used both for
the send and for the receive.

\discuss{

I omitted the ready and synchronous modes for send/receive and
exchange.  Also,
I omitted nonblocking send/receive and exchange.
Do we want any of those?

}


The semantics of a send-receive operation is what would obtain
if the caller forked two concurrent threads, one to execute the send,
and one to execute the receive, followed by a join of these two
threads.  A send-receive cannot be implemented by a blocking send
followed by a blocking receive or a blocking receive, followed by a
blocking send.
Consider the following code, where two
processes exchange messages:

\begin{verbatim}
CALL MPI_COMM_RANK(comm, rank, ierr)
IF (rank.EQ.0)
  THEN
    CALL MPI_SENDRECV(send_buff, count, MPI_REAL, 0, recv_buff, count, MPI_REAL,
         1, tag, comm, status, ierr)
  ELSE
    CALL MPI_SENDRECV(send_buff, count, MPI_REAL, 1, recv_buff, count, MPI_REAL,
         0, tag, comm, status, ierr)
END IF
\end{verbatim}

If the send receives are replaced either by blocking send, followed by
blocking receive, or blocking receive, followed by blocking send,
then the code may deadlock.   On the other hand, send-receive can be
implemented using nonblocking sends and receives.
Note that some system buffering is required for a correct implementation of
\func{ MPI\_EXCHANGE}. (Consider the last example, with send-receive
replaced by exchange.)

\section{Null processes}
\label{sec:pt2pt-nullproc}

\discuss{
This section has not been reviewed by MPIF.
}

In many instances, it is convenient to specify a ``dummy'' source or destination
for communication.  This simplifies the code that is needed for dealing with
boundaries, e.g., in the case of a non-circular shift done with calls to
send-receive.

The special value \const{ MPI\_PROCNULL} can be used 
instead of a rank wherever a
source or a destination parameter is required in a call.   A communication
with process \const{ MPI\_PROCNULL} has no effect: a send to \const{ MPI\_PROCNULL}
succeeds and returns as soon as possible.  A receive from \const{ MPI\_PROCNULL} succeeds and
returns as soon as possible with no modifications to the receive buffer.
When a receive with \mpiarg{ source} = \const{ MPI\_PROCNULL} is executed then
the status object returns \mpiarg{ 
source} = \const{ MPI\_PROCNULL}, \mpiarg{ tag} = \const{ MPI\_ANY\_TAG} and 
\mpiarg{ count=0}.

\discuss{

It would be nice to have \const{ MPI\_PROCNULL} ``equivalenced'' to -1
and to group\_size, so that, when a shift is executed in a group the
first and last processors automatically communicate with the null
process.

A choice is to say that a source/destination with rank -1 or
group\_size is always the null process.  We then lose some error
catching capability.

Another choice is to say that only for send-receive and exchange.

A third choice is to live without this nicety.  It means that one
still have to execute conditional code: (if myrank=0 then
myneighbor=mpi\_procnull).  But this code need be executed only when
the communication pattern is set,
and the conditional boundary code is not needed in the inner loop.

}

\section{Derived datatypes}
\label{sec:pt2pt-datatype}

\discuss{

\bf This is a new section, and the material has not yet been discussed
by MPIF.

}

Up to now, all point to point communication involved only contiguous
buffers containing a sequence of elements of the same type.
This is too constraining on two accounts:  One often wants to pass
messages that contain values with different datatypes (e.g., an integer count,
followed by count real numbers); and one often wants to send
noncontiguous data (e.g. a sub-block of a matrix).  One solution is to
provide functions that pack noncontiguous data into a contiguous buffer
at the sender site and unpack it back at the receiver site.   This has
the disadvantage of requiring additional memory to memory copy
at both sites, even when the communication
subsystem has scatter-gather capabilities.   Instead, MPI provides
mechanisms to specify more general, mixed and noncontiguous
communication buffers. It is up to the implementation to decide
whether data should be first packed in a contiguous buffer before been
transmitted, or whether it can be collected directly from where it
resides.

The general mechanisms provided here allows one to transfer directly,
without copying, objects of various shape and size.  It is not assumed
that the MPI library is cognizant of the objects declared in the host
language; thus, if one wants to transfer a structure, or an array
section, it will be necessary to provide in MPI a definition of a
communication buffer that mimics the definition of the structure or
array section in question.  These facilities can be used by library
designers to define communication functions that can transfer objects
defined in the host language -- by decoding their definitions as
available in a symbol table or a dope vector.  Such higher-level
communication functions are not part of MPI.

More general communication buffers are specified by replacing the
basic datatypes that have used so far with derived datatypes that
are constructed from basic datatypes using the constructors described
in this section.  These methods of constructing derived datatypes can
be applied recursively.

A {\bf general datatype} is an opaque object that specifies two
things:
\begin{itemize}
\item
A sequence of basic types
\item
A sequence of integer (byte) displacements
\end{itemize}

The displacements are not required to be positive, distinct, or
in increasing order; therefore the order of items need not
coincide with their order in store, and an item may appear more than
once.

Let
\[
Datatype = \{ (type_0,disp_0), ..., (type_{n-1}, disp_{n-1}) \} ,
\]
be such a general datatype, where $type_i$ are basic types, and
$disp_i$ are  displacements.
This general datatype, together with a base address \mpiarg{ start}, specifies a
communication buffer: the communication buffer that consists of $n$
entries, where the $i$-th entry is at address $\mpiarg{ start} +
disp_i$ and has type $type_i$.  A message assembled from this
communication buffer will contain a sequence of $n$ values, where
value $i$ has type $type_i$.  The general datatype determines
the {\bf type signature} of the message (i.e. the number of values
carried by the message and the basic type of each value).

We can use a handle to a general datatype as an argument in a send or
receive operation, in replacement of a basic datatype argument.  The
operation
\func{ MPI\_SEND(start, 1, datatype,...)} will use the send buffer
defined by the base address \mpiarg{ start} and the general datatype
associated with \mpiarg{ datatype}; it will generate a message with the type
signature determined by the \mpiarg{ datatype} argument.
\func{ MPI\_RECV(start, 1, datatype,...)} will use the receive buffer
defined by the base address \mpiarg{ start} and the general datatype
associated with \mpiarg{ datatype}.

General datatypes can be used in all send and receive
operations.  We address later in Section~\ref{subsec:pt2pt-datatypeuse} the
case where the second argument \mpiarg{ count} has value $> 1$.

The predefined basic datatypes presented in
Section~\ref{subsec:pt2pt-messagedata}
are particular cases of a general datatype.
Thus, \type{ MPI\_INT} is a predefined handle to the datatype $\{ ({\tt
int}, 0) \}$, with one entry of type {\tt int} and
displacement zero.  And similarly for all other basic datatypes.

The {\bf extent} of a datatype is defined to
be the span from the first byte to the last byte occupied by entries in this
datatype, rounded up to satisfy alignment requirements.
I.e., if
\[
Datatype = \{ (type_0,disp_0), ..., (type_{n-1}, disp_{n-1}) \} ,
\]
$disp_r = \min_j disp_j$ and $disp_s = \max_j disp_j$  then

\begin{equation}
\label{eq:pt2pt-extent}
extent(Datatype) = disp_s + sizeof(type_s) - disp_r ;
\end{equation}
if furthermore, $type_i$
requires alignment to a byte address that is is a multiple of $k_i$,
then
$extent(Datatype)$ is rounded up to the next multiple of $\max_i k_i$.


Example:  Assume that
\[
\label{eq:pt2pt-type1}
Type1 = \{ ({\tt double},0), ({\tt char}, 8) \}
\]
(a {\tt double} at
displacement zero, followed by a {\tt char} at displacement eight).
Assume, furthermore, that
doubles have to be strictly aligned at addresses that are multiple of eight.
Then, the extent of this datatype is 16 (9 rounded to the next multiple of 8).
A datatype that consists of a character immediately followed by a double will
also have an extent of 16.

\subsection{Datatype constructors}
\label{subsec:pt2pt-datatypeconst}

The simplest datatype constructor is \mpifunc{MPI\_TYPE\_CONTIGUOUS} which
allows replication of a datatype into contiguous locations.

\begin{funcdef}{MPI\_TYPE\_CONTIGUOUS(count, oldtype, newtype)}
\funcarg{\IN}{count}{replication count (integer)}
\funcarg{\IN}{oldtype}{handle to input datatype object}
\funcarg{\OUT}{newtype}{handle to output datatype object}
\end{funcdef}

\mpiarg{ newtype} is the datatype obtained by concatenating \mpiarg{ count} copies of
\mpiarg{ oldtype}.  Padding may be added to satisfy alignment requirements of the
underlying architecture.


Example: let \mpiarg{ oldtype} be a handle to the datatype
\[
\{ ({\tt double}, 0), ({\tt char}, 8) \} ,
\]
with extent 16,
and let $\mpiarg{ count} = 3$.  The resulting
datatype returned by \mpiarg{ newtype} is
\[
\{ ({\tt double}, 0), ({\tt char}, 8), ({\tt double}, 16), ({\tt
char}, 24), ({\tt double}, 32), ({\tt char}, 40) \} ;
\]
i.e., alternating {\tt double} and {\tt char} elements, with displacements
$0, 8, 16, 24, 32, 40$.

In general,
assume that \mpiarg{ oldtype} is a handle to the datatype
\[
\{ (type_0,disp_0), ..., (type_{n-1}, disp_{n-1}) \} ,
\]
with extent $extent$.
Then \mpiarg{ newtype} is a handle to the datatype with $\tt count \cdot
n$ entries defined by:

\[
\{ (type_0, disp_0), ..., (type_{n-1}, disp_{n-1}), (type_0, disp_0
+extent), ... ,(type_{n-1}, disp_{n-1} + extent) ,...,
\] \[
(type_0, disp_0 +extent \cdot({\tt count}-1) ), ... ,
(type_{n-1} , disp_{n-1} + extent \cdot ({\tt count}-1)) \} .
\]


\mpifunc{MPI\_TYPE\_VECTOR} is a more general constructor that
allows replication of a datatype
into locations that consist of equally spaced contiguous blocks of equal
size -- block sizes and block displacements are all multiples of the old type
extent.

\begin{funcdef}{MPI\_TYPE\_VECTOR( count, oldtype, stride, blocklen, newtype)}
\funcarg{\IN}{count}{number of blocks (integer)}
\funcarg{\IN}{oldtype}{handle to input datatype object}
\funcarg{\IN}{stride}{number of elements between start of each block (integer)}
\funcarg{\IN}{blocklen}{number of elements in each block (positive integer)}
\funcarg{\OUT}{newtype}{handle to output datatype object}
\end{funcdef}


Example:  Assume, again, that \mpiarg{ oldtype} points to the type
\[
\{ ({\tt double}, 0), ({\tt char}, 8) \} ,
\]
with extent 16.
A call to \mpifunc{MPI\_TYPE\_VECTOR( 2, oldtype, 4, 3, newtype)} will create the
datatype
\[
\{
({\tt double}, 0), ({\tt char}, 8), ({\tt double}, 16), ({\tt char},
24), ({\tt double}, 32), ({\tt char}, 40),
\] \[
({\tt double}, 64), ({\tt char}, 72), ({\tt double}, 80), ({\tt char},
88), ({\tt double}, 96), ({\tt char}, 104)
\} :
\]
two blocks with three copies each of the old type, starting 4*16
apart.

A call to \mpifunc{MPI\_TYPE\_VECTOR(oldtype, 3, -2, 1, newtype)} will create the
datatype
\[
\{
({\tt double}, 0), ({\tt char}, 8), ({\tt double}, -32), ({\tt char},
-24), ({\tt double}, -64), ({\tt char}, -56)
\} .
\]


In general, assume that \mpiarg{ oldtype} is a handle to the datatype
\[
 \{ (type_0,disp_0), ..., (type_{n-1}, disp_{n-1}) \} ,
\]
with extent $extent$.
The newly created  datatype is a sequence of length
${\tt count} \cdot {\tt blocklen} \cdot n$ with
entries:
\[
\{
(type_0, disp_0), ... , (type_{n-1} , disp_{n-1}),
\]
\[
(type_0 ,disp_0 + extent) , ... ,
(type_{n-1} , disp_{n-1} + extent ), ...,
\] \[
(type_0 , disp_0 + ({\tt blocklen} -1) \cdot extent
) , ... , (type_{n-1} , disp_{n-1} + ({\tt blocklen} -1) \cdot extent ) ,
\] \[
(type_0 ,disp_0 + {\tt stride} \cdot extent ) , ... ,
(type_{n-1} , disp_{n-1} + {\tt stride} \cdot extent ), ... ,
\] \[
(type_0 , disp_0 + ({\tt stride} + {\tt blocklen} -1) \cdot extent ) , ... ,
\] \[
(type_{n-1}, disp_{n-1} + ({\tt stride} + {\tt blocklen} -1) \cdot extent ) , ....,
\] \[
(type_0 ,disp_0 + {\tt stride} \cdot ({\tt count}-1) \cdot extent ) , ... ,
\] \[
(type_{n-1} , disp_{n-1} + {\tt stride} \cdot ({\tt count} -1) \cdot
extent )
, ... ,
\] \[
(type_0 , disp_0 + ({\tt stride} \cdot ({\tt count} -1)
+ {\tt blocklen} -1) \cdot extent ) , ... ,
\] \[
(type_{n-1}, disp_{n-1} + ({\tt stride} \cdot ({\tt count} -1)
+ {\tt blocklen} -1) \cdot extent )
\}
\]

\discuss{

I changed \mpiarg{ count} from being the number of entries (or
replicates), as in the old draft, to being the number of blocks.
The reason is that in the next calls I allow block lengths to vary; it
is easy to say what it means to have $n$ blocks of sizes $b_0 , ... ,
b_{n-1}$; it is harder to explain what it means to have $n$ elements
filling up successive positions in blocks of sizes $b_0 , b_1 , ....$.

We
loose the ability to use \type{MPI\_TYPE\_VECTOR} to create a partially
filled last block (the more general \type{MPI\_TYPE\_INDEXED} function
is needed);
we gain consistency with the next calls, and convenience in these
calls.

Opinions?

}

A call to \mpifunc{MPI\_TYPE\_CONTIGUOUS(oldtype, count, newtype)} is
equivalent to a call to
\mpifunc{MPI\_TYPE\_VECTOR(oldtype, count, 1, 1, newtype)}.

The function \mpifunc{MPI\_TYPE\_HVECTOR} is identical to
\mpifunc{MPI\_TYPE\_VECTOR}, except that \mpiarg{ stride} is given in bytes,
rather then in elements.  The use for both types of vector
constructors is illustrated in Section~\ref{subsec:pt2pt-examples}.
({\tt H} is used for ``heterogeneous'').

\begin{funcdef}{MPI\_TYPE\_HVECTOR( count, oldtype, stride, blocklen, newtype)}
\funcarg{\IN}{count}{number of blocks (integer)}
\funcarg{\IN}{oldtype}{handle to input datatype object}
\funcarg{\IN}{stride}{number of bytes between start of each block (integer)}
\funcarg{\IN}{blocklen}{number of elements in each block (positive integer)}
\funcarg{\OUT}{newtype}{handle to output datatype object}
\end{funcdef}



\mpifunc{MPI\_TYPE\_INDEXED} is a more general function that allows
replication of an old datatype into a sequence of contiguous blocks, where block
lengths
and block displacements can all differ -- but are all multiples of the old type
extent.

\begin{funcdef}{MPI\_TYPE\_INDEXED( count, oldtype, array\_of\_indices,
array\_of\_blocklen, newtype)}
\funcarg{\IN}{count}{number of blocks (integer)}
\funcarg{\IN}{oldtype}{handle to input datatype object}
\funcarg{\IN}{array\_of\_indices}{displacement for each block, in
multiples of \mpiarg{ oldtype} extent (array of
integer)}
\funcarg{\IN}{array\_of\_blocklen}{number of elements per block (array
of integer)}
\funcarg{\OUT}{newtype}{handle to output datatype object}
\end{funcdef}

Example: let \mpiarg{ oldtype} be a handle to the type
\[
\{ ({\tt double}, 0), ({\tt char}, 8) \} ,
\]
with extent 16.
Let  {\tt I = (64, 0)} and let {\tt B = (3, 1)}.  A call to
\mpifunc{MPI\_TYPE\_INDEXED(2, oldtype, I, B, newtype)} returns in {\tt
newtype} a handle to the type
\[
\{
({\tt double}, 64), ({\tt char}, 72), ({\tt double}, 80), ({\tt char},
88), ({\tt double}, 96), ({\tt char}, 104)
\]
\[
({\tt double}, 0), ({\tt char}, 8)
\} :
\]
three copies of the old type starting at displacement
64, and one copy starting at displacement 0.

In general,
assume that \mpiarg{ oldtype} is a handle to the datatype
\[
\{ (type_0,disp_0), ..., (type_{n-1}, disp_{n-1}) \} ,
\]
with extent {\em extent}.
Let \mpiarg{ B} be the \mpiarg{ array\_of\_blocklen} parameter and \mpiarg{ I} be the
\mpiarg{ array\_of\_indices} parameter. The newly created type returned
by \mpiarg{ newtype} is the sequence of $n \cdot \sum_{i=0}^{{\tt count}-1}
\mpiarg{ B[i]}$ entries
\[
\{
(type_0, disp_0 + \mpiarg{ I[0]} \cdot extent ) , ... ,
(type_{n-1} , disp_{n-1} + \mpiarg{ I[0]} \cdot extent ) , ... ,
\] \[
(type_0 , disp_0 + (\mpiarg{ I[0]} + \mpiarg{ B[0]} -1) \cdot extent) ,...,
(type_{n-1} , disp_{n-1} + (\mpiarg{ I[0]} +\mpiarg{ B[0]} -1) \cdot extent ) , ...,
\] \[
(type_0, disp_0 + \mpiarg{ I[count-1]} \cdot extent ) , ... ,
(type_{n-1} , disp_{n-1} + \mpiarg{ I[count-1]} \cdot extent ) , ... ,
\] \[
(type_0 , disp_0 + (\mpiarg{ I[count-1]} + \mpiarg{ B[count-1]} -1) \cdot extent) ,...,
\] \[
(type_{n-1} , disp_{n-1} + (\mpiarg{ I[count-1]} +\mpiarg{ B[count-1]} -1)
\cdot extent )
\} .
\]

A call to \mpifunc{MPI\_TYPE\_VECTOR(count, oldtype, stride, blocklen, newtype)}
is equivalent to a call to 
\mpifunc{MPI\_TYPE\_INDEX(count, oldtype, I, B, newtype)} where
\[
\mpiarg{ I[j]} = j \cdot \mpiarg{ stride} \ , j=0 ,..., {\tt count} -1 ,
\]
and
\[
\mpiarg{ B[j]} = \mpiarg{ blocklen} \ , j=0 ,..., {\tt count} -1 .
\]

The function \mpifunc{MPI\_TYPE\_HINDEXED} is identical to
\mpifunc{MPI\_TYPE\_INDEXED}, except that block displacements in \mpiarg{ 
array\_of\_indices} are specified in bytes, rather than in multiples
of the old type extent.

\begin{funcdef}{MPI\_TYPE\_HINDEXED( count, oldtype, array\_of\_indices,
array\_of\_blocklen, newtype)}
\funcarg{\IN}{count}{number of blocks (integer)}
\funcarg{\IN}{oldtype}{handle to input datatype object}
\funcarg{\IN}{array\_of\_indices}{byte displacement of each block (array of
integer)}
\funcarg{\IN}{array\_of\_blocklen}{number of elements in each block (array
of integer)}
\funcarg{\OUT}{newtype}{handle to output datatype object}
\end{funcdef}

{\tt MPI\_TYPE\_STRUCT} is the most general constructor:  It further generalizes
the previous one in that it allows each block to consist of replications of a
different datatype.


\begin{funcdef}{MPI\_TYPE\_STRUCT(count, array\_of\_types,
array\_of\_indices, array\_of\_blocklen, newtype)}
\funcarg{\IN}{count}{number of blocks (integer)}
\funcarg{\IN}{array\_of\_types}{type of elements in each block (array
of datatype objects)}
\funcarg{\IN}{array\_of\_indices}{byte displacement of each block (array of
integer)}
\funcarg{\IN}{array\_of\_blocklen}{number of elements in each block (array
of integer)}
\funcarg{\OUT}{newtype}{handle to output datatype object}
\end{funcdef}

Example:  Let \mpiarg{ type1} be a handle to the type
\[
\{ ({\tt double}, 0), ({\tt char}, 8) \} ,
\]
with extent 16.
Let {\tt I = (0, 16, 26)}, {\tt B = (2, 1,
3)} and {\tt T = (MPI\_FLOAT, type1, MPI\_BYTE)}.  Then a call to
\mpifunc{MPI\_TYPE\_STRUC(3, T, I, B, newtype)} returns in \mpiarg{ newtype} a
handle to the datatype
\[
\{
({\tt float}, 0), ({\tt float}, 4), ({\tt double}, 16), ({\tt char},
24), ({\tt char}, 26), ({\tt char}, 27), ({\tt char}, 28)
\} :
\]
two copies of \type{ MPI\_FLOAT} starting at 0, followed by
one copy of \mpiarg{ type1} starting at 16, followed by three copies of
\type{ MPI\_CHAR}, starting at 26.
(We assume that a \type{ float} occupies four bytes.)

In general,
let \mpiarg{ T} be the \mpiarg{ array\_of\_types} parameter, where \mpiarg{ 
T[i]} is a handle to
\[
type_i = \{ (type_0^i , disp_0^i ) , ... , (type_{n_i}^i ,
disp_{n_i}^i ) \} ,
\]
with extent $extent_i$.
Let
\mpiarg{ B} be the \mpiarg{ array\_of\_blocklen} parameter and \mpiarg{ I} be the
\mpiarg{ array\_of\_indices} parameter.
Then the newly created datatype is the sequence
of length $\sum_{i=0}^{{\tt count}-1}\mpiarg{ B[i]} \cdot extent_i$ with
entries
\[
\{
(type_0^0 , disp_0^0 +\mpiarg{ I[0]}) , ... , (type_{n_0}^0 , disp_{n_0}^0 +
\mpiarg{ I[0]} ) ,
... ,
\] \[
(type_0^0 , disp_0^0 + \mpiarg{ I[0]} + (\mpiarg{ B[0]}-1) \cdot extent_0 ) , ... ,
(type_{n_0}^0 , disp_{n_0}^0 + \mpiarg{ I[0]} + (\mpiarg{ B[0]-1)} \cdot
extent_0 )
, ... ,
\] \[
(type_0^{{\tt count}-1} , disp_0^{{\tt count}-1} +\mpiarg{ I[count-1]}) , ... ,
(type_{n_{{\tt count}-1}}^{{\tt count}-1} ,
disp_{n_{{\tt count}-1}}^{{\tt count}-1} + \mpiarg{ I[count-1]} ) ,
... ,
\] \[
(type_0^{{\tt count}-1} , disp_0^{{\tt count}-1} +
\mpiarg{ I[count-1]} + (\mpiarg{ B[count-1]}-1) \cdot extent_{{\tt count}-1} ) , ... ,
\] \[
(type_{n_{{\tt count}-1}}^{{\tt count}-1} ,
disp_{n_{{\tt count}-1}}^{{\tt count}-1} + \mpiarg{ I[count-1]} +
(\mpiarg{ B[count-1]-1)} \cdot extent_{{\tt count}-1} )
\}
\]

A call to \mpifunc{MPI\_TYPE\_HINDEXED( count, oldtype, I, B, newtype)} is equivalent to a
call to 
\mpifunc{MPI\_TYPE\_STRUC( count, T, I, B, newtype)}, where each entry of\mpiarg{ 
T} is equal to \mpiarg{ oldtype}.

\discuss{

I am open to suggestions for better names to these functions.

}

\subsection{Additional functions}

The displacements in a general datatype are relative to some start
address.  {\bf Absolute addresses} can be substituted for these
displacements: we treat them as displacements relative to ``address
zero'', the start of the address space.  This initial address zero is
indicated by the constant \const{ MPI\_BOTTOM}.  Thus, a datatype can
specify the absolute address of the entries in the communication
buffer, in which case the \mpiarg{ start} parameter is passed the value
\const{ MPI\_BOTTOM}.

The address of a location in memory can be found by invoking the
function \mpifunc{MPI\_ADDRESS}.

\begin{funcdef}{MPI\_ADDRESS(location, address)}
\funcarg{\IN}{location}{location in caller memory (choice)}
\funcarg{\OUT}{address}{address of location (integer)}
\end{funcdef}

Returns the (byte) address of \mpiarg{ location}.


Another useful auxiliary function is \mpifunc{MPI\_EXTENT}, that
returns the extent of a datatype
-- where extent is as defined in Eq.~\ref{eq:pt2pt-extent}.

\begin{funcdef}{MPI\_TYPE\_EXTENT(datatype, extent)}
\funcarg{\IN}{datatype}{handle to datatype object}
\funcarg{\OUT}{extent}{datatype extent (integer)}
\end{funcdef}


As defined now, the extent of a datatype goes from its first occupied byte to
its last occupied byte.  It is often convenient to define a datatype that
has a ``hole'' at its beginning and/or its end; or a datatype with
entries that extend beyond the datatype ``extent''.  This allows to use
the functions \func{ MPI\_TYPE\_CONTIGUOUS}, \func{ MPI\_TYPE\_VECTOR}, 
or \func{ MPI\_TYPE\_INDEX} to replicate a more general pattern; examples of
such a usage are provided in Section~\ref{subsec:pt2pt-examples}.
To achieve this, we add
two additional ``pseudo-datatypes'' \type{ MPI\_LB}, and \type{ MPI\_UB}
that can be used, respectively, to mark the lower bound or the upper
bound of a datatype.  these pseudo-datatypes occupy no space
($extent(\type{ MP\_LB}) = extent(\type{ MP\_UB}) =0$). They do not affect
the signature of a message created with this datatype. However, they do
affect the definition of the
extent of a datatype and, therefore, affect the outcome of a replication of
this datatype by a datatype constructor.

Example:
Let {\tt I = (-3, 0, 6)}; {\tt T = (MPI\_LB, MPI\_INT, MPI\_UB)},
and {\tt B = (1, 1, 1)}.  Then a call to
\mpifunc{MPI\_TYPE\_STRUC(3, T, I, B, type1)}
creates a new datatype that has an
extent of 9 (from -3 to 6), and contains an integer at
displacement 0.   This is the datatype defined by the sequence
{\tt \{(lb, -3), (int, 0), (ub, 6)\} }.
If this type is replicated twice by a call to
\mpifunc{MPI\_TYPE\_CONTIGUOUS(2, type1, type2)} then the newly created type can
be described by the sequence
{\tt \{(lb, -3), (int, 0), (int,9), (ub, 15)\} }.  (Entries of type
{\tt lb} or {\tt ub}
can be deleted if they are not at the endpoints of the datatype.)

In general, if
\[
Datatype = \{ (type_0 , disp_0 ) , ... , (type_{n-1} , disp_{n-1}) \} ,
\]
then the {\bf lower bound} of $Datatype$ is defined to be
\[
lb(Datatype) = \left\{ \begin{array}{ll}
\min_j disp_j & \mbox{ if no entry has basic type $type_j = {\tt lb}$} \\
\min \{ disp_j \ : \ type_j = {\tt lb} \} & \mbox{otherwise}
\end{array}
\right. \]

Similarly,
the {\bf upper bound} of $Datatype$ is defined to be
\[
ub(Datatype) = \left\{ \begin{array}{ll}
\max_j disp_j + sizeof(type_j) & \mbox{ if no entry has basic type
$type_j = {\tt ub}$} \\
\max \{ disp_j \ : \ type_j = {\tt ub} \} & \mbox{otherwise}
\end{array}
\right. \]

Then
\[
extent(Datatype) = ub(Datatype) - lb(Datatype) ;
\]

Furthermore, if $type_i$ requires alignment to an address that is a multiple
of $k_i$, then $extent(Datatype)$ is further rounded up to the nearest
multiple of $\max_i k_i$.
The formal definitions given for the various datatype constructors
apply now, with the amended definition of {\bf extent}.

A datatype object has to be {\bf committed} before it can be used in a
communication.  A committed datatype object cannot be used to build
new derived datatypes.  The system may use a different internal
representation for committed  datatype object so as to facilitate
communication, e.g.
change from a compacted representation to a flat representation of the
datatype, and select the most convenient transfer mechanism.

\begin{funcdef}{MPI\_TYPE\_COMMIT(intype, outtype)}
\funcarg{\IN}{intype}{handle to datatype object to be committed}
\funcarg{\OUT}{outtype}{handle to committed datatype object}
\end{funcdef}

There is no need to commit basic datatypes; they are ``pre-committed''.  They are
the exception to the rule that committed datatypes cannot be used to
form derived datatypes.

\discuss{

1. Do we really need a commit here?

2. Do we want one \INOUT argument, or two distinct arguments?

3. Are committed and uncommitted objects of the same
type? (The answer must be yes, if it is an \INOUT argument.)

4. If we have only one \INOUT argument,
does a commit change the datatype object, or does it create a new
object and return a pointer to it in \mpiarg{ datatype}?  I.e., if we have
another pointer to the same uncommitted object, can we continue to use it?

We prohibit the use of committed datatype in the construction of new datatypes
in order to allow the commit function to change a datatype representation into
one suitable for communication, and perhaps unsuitable for recursive datatype
construction.  This restriction could be relaxed by assuming that the
``uncommitted'' datatype representation is kept alongside the ``committed'' one.
With such an implementation \func{ MPI\_TYPE\_COMMIT} could have only one
argument,
and only one type of datatype object is needed.  Such representation could also
simplify the implementation of a datatype ``flatten'' function (e.g., for
debugging).


}

\begin{funcdef}{MPI\_TYPE\_FREE(datatype)}
\funcarg{\INOUT}{datatype}{handle to datatype object}
\end{funcdef}

Marks the datatype object associated with \mpiarg{ datatype} for deallocation.
The object will be deallocated after any pending
communication that uses this object completes,
at which point \mpiarg{ datatype} becomes null.

\discuss {

If we have different datatypes for committed and uncommitted, then we
have 2 free functions (or never free uncommitted datatype)

}

\subsection{Use of general datatypes in communication}
\label{subsec:pt2pt-datatypeuse}

Handles to derived datatypes can be passed to a communication call wherever a
datatype argument is required.  A call of the form
\mpifunc{MPI\_SEND(start, count, datatype, ...)}, where $\tt count =1$
uses the send buffer defined by \mpiarg{ start} and \mpiarg{ datatype}; i.e. if \mpiarg{ 
datatype} is a handle to the datatype defined by the sequence
$\{ (type_0 , disp_0) , ... , (type_{n-1}, disp_{n-1} ) \}$,
(with empty entries deleted) then the send buffer consists of the entries of
types $type_i$, at addresses ${\tt start} + disp_i$,
$i=0, ..., n-1$.  The message generated consists of a sequence of $n$
entries, of types $type_0 , ... , type_{n-1}$.  The receive buffer for a
call \mpifunc{MPI\_RECV( start, count, datatype , ...)}, is similarly defined,
for {\tt count =1}.  The same applies to all other send and receive operations
that have a \mpiarg{ datatype} parameter.

A datatype used in a receive operation may not have overlapping entries;
otherwise the outcome of the receive operation is undefined.

If a send operation using a general datatype \type{ type1} is matched by a
receive operation using a general datatype \type{ type2} then the sequence of
(nonempty) basic type entries defined by each datatype should be identical.
Furthermore, each entry in the send buffer should be of type that
matches the type of the
corresponding entry in \type{ type1} and each entry in the receive buffer should
of a type that matches the type of the corresponding entry in
\type{ type2} (with type matching defined as in
Section~\ref{sec:pt2pt-typematch}).  Thus, type matching is defined by the
``flat'' structure of a datatype, and does not depend on its defining
expression.

Example:
\begin{verbatim}
...
MPI_TYPE_CONTIGUOUS( MPI_REAL, 2, type2)
MPI_TYPE_CONTIGUOUS( MPI_REAL, 4, type4)
MPI_TYPE_CONTIGUOUS( type2, 2, type22)
\end{verbatim}

A message sent with datatype \type{ type4} can be received with datatype \type{ 
type22}.

Type matching requirements imply that (nonempty) entries of different types in a
datatype
cannot overlap, unless they coincide and have the same basic type, or one of
them have basic type \type{ byte}.

The length of a message received (i.e., the value returned when the receive
status object is decoded by the function \mpifunc{MPI\_GET\_LEN}) is equal to
the number of basic entries received.  Consider again the datatypes
defined in the last example.  If a message is sent using
either datatype \type{ type22} or \type{ type4} and is received using either of
these two datatypes, then the length of the message received will be
four.  In the general case the basic entries need not be all of the
same type.

\discuss{

An alternative definition that finds favor with many members of the subcommittee
is that the length of the message received is measured in terms of subunits of
the receiving datatype.  Thus, if four reals are received using \type{ type4},
the length of the received message is four; if they are received using \type{ 
type22}, the length of the received message is two.  With such a definition, it
is required that the arriving message fills an integer number of
subunits.

The rationale is that the
these are the units that are meaningful to the programmer.  Consider the case of
an array of structures that is passed in a message (example 4.1 below).  Then one
might want to enforce  the restriction that only complete structures are
passed; and is natural to return in the receive status the number of
structures received, rather than this number multiplied by 14, the
number of entries per structure.

Consider, on the other hand,
the case of a sequence of reals that is
received in a 3D array section (defined as in the first example
below).
We may want to receive an arbitrary number
of reals, and the 1D and 2D subsections may not carry any significance for the
purpose of this communication.

Opinions?

}

A call of the form \mpifunc{MPI\_SEND(start, count, datatype , ...)}, where
$\tt
count > 1$, is interpreted as if the call was passed a new datatype which is the
concatenation of \mpiarg{ count} copies of \mpiarg{ datatype}, such as would
be produced by a call to \mpifunc{MPI\_CONTIGUOUS(count, datatype,
newtype)}, and a \mpiarg{ count} argument of one.
The same holds true for the other communication functions that have a datatype
argument.

\subsection{Examples}
\label{subsec:pt2pt-examples}

The following examples illustrate the use of derived datatypes.

First example: extract a section of a 3D matrix.
\begin{verbatim}
      REAL a(100,100,100), e(9,9,9)
      INTEGER oneslice, twoslice, threeslice, finaltype, sizeofreal

C      extract the section a(1:17:2, 3:11, 2:10)
C      and store it in e(*,*,*).

      CALL MPI_TYPE_EXTENT( MPI_REAL, sizeofreal, ierr)

C     create datatype for a 1D section
      CALL MPI_TYPE_VECTOR( 9, MPI_REAL,  2, 1, oneslice, ierr)

C     create datatype for a 2D section
      CALL MPI_TYPE_HVECTOR( 9, oneslice,  100*sizeofreal, 1, twoslice, ierr)

C     create datatype for the entire section
      CALL MPI_TYPE_HVECTOR( 9, twoslice,  100*100*sizeofreal, 1,
                  threeslice, ierr)

      CALL MPI_TYPE_COMMIT( threeslice, finaltype, ierr)
      CALL MPI_SENDRECV(a(1,3,2), 1, finaltype, MPI_ME, e, 9*9*9,
                  MPI_REAL, MPI_ME, 0, MPI_ALL, status, ierr)
\end{verbatim}


Second example: Copy the (strictly) lower triangular part of a matrix.
\begin{verbatim}
      REAL a(100,100), b(100,100)
      INTEGER  index(100), blocklen(100), ltype, finaltype

C     copy lower triangular part of array a
C     onto lower triangular part of array b

C     compute start and size of each column
      DO i=1, 100
        index(i) = 100*(i-1) + i
        block(i) = 100-i
      END DO

C     create datatype for lower triangular part
      CALL MPI_TYPE_INDEX( MPI_REAL, 100, index, block, ltype, ierr)

      CALL MPI_TYPE_COMMIT(ltype, finaltype, ierr)
      CALL MPI_SENDRECV( a, 1, finaltype, MPI_ME, b, 1,
                finaltype, MPI_ME, 0, MPI_ALL, status, ierr)
\end{verbatim}

Third example: Transpose a matrix
\begin{verbatim}
      REAL a(100,100), b(100,100)
      INTEGER row, xpose, finaltype, sizeofreal

C     transpose matrix a onto b


      CALL MPI_TYPE_EXTENT( MPI_REAL, sizeofreal, ierr)

C     create datatype for one row
      CALL MPI_TYPE_VECTOR( 100, MPI_REAL, 100, 1, row, ierr)

C     create datatype for matrix in row-major order
      CALL MPI_TYPE_HVECTOR( 100, row, sizeofreal, 1, xpose, ierr)

      CALL MPI_TYPE_COMMIT( xpose, finaltype, ierr)

C     send matrix in row-major order and receive in column major order
      CALL MPI_SENDRECV( a, 1, finaltype, MPI_ME, b, 100*100,
                MPI_REAL, MPI_ME, 0, MPI_ALL, status, ierr)
\end{verbatim}

Another approach to the transpose problem:
\begin{verbatim}
      REAL a(100,100), b(100,100)
      INTEGER  index(2), blen(2), type(2), row, row1, finaltype, sizeofreal


C     transpose matrix a onto b


      CALL MPI_TYPE_EXTENT( MPI_REAL, sizeofreal, ierr)

C     create datatype for one row
      CALL MPI_TYPE_VECTOR( MPI_REAL, 100, 100, 1, row, ierr)

C     create datatype for one row, with the extent of one real number
      index(1) = 0
      index(2) = sizeofreal
      type(1)  = row
      type(2)  = MPI_UB
      blen(1)  = 1
      blen(2)  = 1
      CALL MPI_TYPE_STRUC( 2, type, index, blen, row1)

      CALL MPI_TYPE_COMMIT( row1, finaltype, ierr)

C     send 100 rows and receive in column major order
      CALL MPI_SENDRECV( a, 100, finaltype, MPI_ME, b, 100*100,
                MPI_REAL, MPI_ME, 0, MPI_ALL, status, ierr)
\end{verbatim}



Fourth example: manipulate an array of structures.
\begin{verbatim}
struct Partstruct
   {
   int index;    /* particle type */
   double d[6]  /* particle coordinates */
   char b[7]    /* some additional information */
   };

struct Partstruct particle[1000];

           /* build datatype describing first array entry */

MPI_datatype Particletype;
MPI_datatype stype[3] = {MPI_int, MPI_double, MPI_char};
int          sblock[3] = {1, 6, 7};
int          sindex[3];

MPI_address( (void *)particle, sindex);
MPI_address( (void *)particle[0].d, sindex+1);
MPI_address( (void *)particle[0].b, sindex+2);
MPI_type_struct( 3, stype, sindex, sblock, Particletype);
/* Particletype describes first array entry -- using absolute
   addresses */


        /* 4.1: send the entire array */

MPI_datatype Type1;
MPI_commit( Particletype, Type1);
MPI_send( MPI_bottom, 1000, Type1, dest, tag, comm);


        /* 4.2: send the entries with index value zero,
          preceded by the number of such entries */

MPI_datatype Zparticles;   /* datatype describing all particles
                                    with index zero (needs to be recomputed
                                    if indices change) */
MPI_datatype Ztype, Type2;

int zindex[1000], zblock[1000], i, j, k;
int zzblock[2] = {1,1};
int zztype[2], zzindex[2];


/* compute indices of particles with index zero */
j = 0;
for(i=0; i < 1000; i++)
  if (particle[i].index==0) then
     {
     zindex[j] = i;
     zblock[i] = 1;
     j++;
     }

/* create datatype for particles with index zero */
MPI_type_index( j, Particletype,  zindex, zblock, Zparticles);

/* prepend particle count */
MPI_address((void *)&j, zzindex);
zzindex[1] = MPI_bottom;
zztype[0] = MPI_int;
zztype[1] = Zparticles;
MPI_type_struc(2, zztype, zzindex, zzblock, Ztype);

MPI_type_commit( Ztype, Type2);
MPI_send( MPI_bottom, 1, Type2, dest, tag, comm);


        /* A possibly more efficient way of defining Zparticles */

/* consecutive particles with index zero are handled as one block */
j=0;
for (i=0; i < 1000; i++)
  if (particle[i].index==0) then
    {
    for (k=i+1; (k < 1000)&&(particle[k].index == 0) ; k++);
    zindex[j] = i;
    zblock[j] = k-i;
    j++;
    i = k;
    }
MPI_type_index( j, Particletype, zindex, zblock, Zparticles);


        /* 4.3: send the first two coordinates of all entries */

MPI_datatype Allpairs;      /* datatype for all pairs of coordinates */
MPI_datatype Type3;

int          sizeofentry;

MPI_extent( Particletype, sizeofentry);

     /* sizeofentry can also be computed by subtracting the address
        of particle[0] from the address of particle[1] */

MPI_type_hvector( 1000, MPI_real, sizeofentry, 2, Allpairs);
MPI_type_commit( Allpairs, Type3);
MPI_send( particle.d, 1, Type3, dest, tag, comm);

      /* an alternative solution to 4.3 */

MPI_datatype Onepair;   /* datatype for one pair of coordinates, with
                          the extent of one particle entry */
MPI_datatype Type4;

int indexp[3];
int typep[3] = {MPI_lb, MPI_double, MPI_ub};
int blockp[3] = {1, 2, 1};

MPI_address( (void *)particle, indexp);
MPI_address( (void *)particle[0].d, indexp+1);
MPI_address( (void *)(particle+1), indexp+2);

MPI_type_struc( 3, typep, indexp, blockp, Onepair);
MPI_commit( Onepair, Type4);
MPI_send( MPI_bottom, 1000, Type4, dest, tag, comm);

\end{verbatim}

\discuss{

The examples carry two implicit assumptions on the way padding is done in an
array of structures.

First, that padding is done in the same way in
the successive array entries, so that all entries have an identical memory
layout. E.g., we assume that displacement from
{\tt particle[i].b} to {\tt particle[j].b} is the same as the displacement from
{\tt particle[i].d} to {\tt particle[j].d}.

Second, that the amount of padding added is, in some sense, minimal.  I.e., if
the most stringent alignment requirement for a basic component of a structure
is alignment to multiples of $2^k$, then no contiguous padding space has
size greater or equal to $2^k$.  This assumption is implicit in the definition
of {\bf extent}.

I don't think either assumptions are mandated
by C of Fortran standards, although I would be very surprised to see an
implementation that does not fulfill these conditions.   If I am wrong (e.g., if
some compilers always align structures to double word boundaries, even if they
have no double word components), then the definition of \mpiarg{ extent} should be
revisited.

}

\subsection{Correct use of addresses}
\label{subsec:pt2pt-segmented}

Successively declared variables in C or Fortran are not necessarily
stored at contiguous locations.  Thus, care must be exercised
that displacements do not cross from one variable
to another.  Also, in machines with segmented address space,
addresses are not unique and address arithmetic has some peculiar
properties.   Thus, use of
{\bf addresses}, i.e. displacements relative to the
start address \const{ MPI\_BOTTOM}, has to be restricted.

Variables belong to the same {\bf sequential storage} if they belong to the same
array, to the same {\tt COMMON} block in Fortran, or to the same structure in C.
Valid addresses are defined recursively as follows:

\begin{enumerate}
\item
The function \mpifunc{MPI\_ADDRESS} returns a valid address, when
passed as argument a host program variable.
\item
The \mpiarg{ start} parameter of a communication function evaluates to a
valid address, when passed as argument a host program variable.
\item
If \mpiarg{ v} is a valid address, and \mpiarg{ i} is an
integer, then \mpiarg{ v+i} is a valid address, provided \mpiarg{ v} and
\mpiarg{ v+i} are in the same sequential storage.
\item
If \mpiarg{ v} is a valid address then \const{ MPI\_BOTTOM + v} is a valid
address.
\end{enumerate}

A correct program uses only valid addresses to identify the
locations of entries in communication buffers.
Furthermore, if \mpiarg{ u} and \mpiarg{ v} are two valid addresses, then the (integer)
difference \mpiarg{ u - v} can be computed only if both \mpiarg{ u} and {\tt
v} are in the same sequential storage;
no other arithmetic operations can be meaningfully executed on addresses.

We say that a datatype is {\bf absolute} if all displacements within
this datatype are valid (absolute) addresses; it is {\bf relative}
otherwise.

A correct program obeys the following constraints in the use of
datatypes:

\begin{itemize}
\item
If the \mpiarg{ oldtype} argument used in \func{ MPI\_TYPE\_CONTIGUOUS,
MPI\_TYPE\_VECTOR, MPI\_TYPE\_HVECTOR, MPI\_TYPE\_INDEX} or \func{
MPI\_TYPE\_HINDEX} is absolute, then all addresses within it must be to
variables contained within the same array or structure; the result
datatype is also absolute, and all computed addresses must also
fall within the same array or structure.
\item
If all entries of the \mpiarg{ array\_of\_types} argument of
\mpifunc{MPI\_TYPE\_STRUC} are absolute addresses (computed by
\mpifunc{MPI\_ADDR}), then the result datatype is also absolute.
Each new address computed from an old address must fall within
the same array or record as the old address.  (Addresses in the
old type may not be all within the same array or record.)
\item
If a communication call is invoked with parameters \mpiarg{ start} and \mpiarg{ 
datatype}, then either \mpiarg{ start} = \const{ MPI\_BOTTOM} and \mpiarg{ datatype} is
a handle to an absolute datatype, or \mpiarg{ start} is set to
the address of a program variable, and all displacements in \mpiarg{ 
datatype} relative to this base yield addresses that are within the
same array or record as this variable.
\end{itemize}


In summary, the type constructors \func{ MPI\_TYPE\_CONTIGUOUS,
MPI\_TYPE\_VECTOR, MPI\_TYPE\_HVECTOR, MPI\_TYPE\_INDEX} and \func{ 
MPI\_TYPE\_HINDEX} can be recursively applied to build datatypes that will
combine variables that belong to the same sequential storage;
variables that do not belong to the same sequential storage can be combined
together using one application of \mpifunc{MPI\_TYPE\_STRUC}.

\discuss{

It is not expected that MPI implementations will be able to detect
erroneous, ``out of bound'' displacements -- unless those overflow the
user address space -- since the MPI call may not now the extent of the
arrays and records in the host program.

}

\section{Universal communication functions}

Although the number of communication functions defined in this chapter
is fairly large, it is possible to construct all of them from a much
smaller number of primitive functions.  The construction is outlined in
this section.  The construction can be used to build a portable
implementation of MPI, although it is expected that more efficient
implementation will implement directly most or all of the functions in
this section.  Also, this construction can be used in order to derive
the behavior of MPI point to point communication from the behavior of
a small subset of functions.
We assume the availability of the primitive functions listed at the end
of the section, and the ability to allocate memory for new variables.

We assume that the function \mpifunc{MPI\_TYPE\_COMMIT} is not
changing the representation of a datatype, but merely creates a new
copy of it.

Communication with a \mpiarg{ count} argument that is $> 1$ can be reduced
to communication with \mpiarg{ count =1}, by replicating the \mpiarg{ 
datatype} argument \mpiarg{ count} times.  Communication with a \mpiarg{ a
start} argument that is not \const{ MPI\_BOTTOM} can be reduced to
communication with \mpiarg{ start} = \const{ MPI\_BOTTOM} by creating a suitable
displaced new datatype. Thus:

\mpifunc{MPI\_SEND( start, count, datatype, dest, tag, comm)} is
\begin{verbatim}
Type[1] = datatype
MPI_ADDRESS( start, Index[1]);
Block[1] = count;
MPI_TYPE_STRUCT( 1, Type, Index, Block, Newtype)
MPI_TYPE_COMMIT( Newtype, datatype1)
MPI_SEND( MPI_BOTTOM, 1, datatype1, dest, tag, comm)
\end{verbatim}

The same construction applies to all other communication functions
with arguments \mpiarg{ count} and \mpiarg{ datatype}.  We shall henceforth
restrict ourselves to communication that involves only one element
with an absolute address (\mpiarg{ count =1}, and \mpiarg{ start} = 
\const{ MPI\_BOTTOM}).

(If the operation \mpifunc{MPI\_TYPE\_COMMIT)} is not assumed to be
trivial, then the above construction cannot be used; this adds one
more primitive function, namely \mpifunc{MPI\_TYPE\_COMMIT}, to our
list, and introduces small changes in the construction.)

It is convenient to to make explicit the {\bf persistence} attribute
of communication objects.  A \mpiarg{ persistent} communication
object need be explicitly deallocated by a \func{ MPI\_COMM\_FREE}
operation.  On  the other hand, an \mpiarg{ ephemeral} communication
object is good for a single communication.  It is deallocated by the
system when the first communication it is used for completes.


The function \mpifunc{MPI\_COMM\_INIT} is a universal function for the
creation of communication objects:

\begin{funcdef}{MPI\_COMM\_INIT(handle, datatype, source-dest, tag, comm,
op-mode, persistence)}
\funcarg{\OUT}{handle}{handle to newly created communication object}
\funcarg{\IN}{datatype}{datatype of element sent or received}
\funcarg{\IN}{source-dest}{rank of destination or source (integer)}
\funcarg{\IN}{tag}{message tag (integer)}
\funcarg{\IN}{comm}{communication object}
\funcarg{\IN}{op-mode}{one of \const{ MPI\_STANDARD}, \const{ MPI\_READY},
\const{ MPI\_SYNCHRONOUS} or \const{ MPI\_RECV}}
\funcarg{\IN}{persistence}{one of \const{ MPI\_PERSISTENT} or \const{ 
MPI\_EPHEMERAL}}
\end{funcdef}

The communication involves only one element, with an absolute address.

\subsection{Persistent communication objects}

\mpifunc{MPI\_CREATE\_SEND(handle, MPI\_BOTTOM, 1, datatype, dest, tag, comm)}
is

\begin{verbatim}
MPI_COMM_INIT(handle, datatype, dest, tag, comm, MPI_STANDARD, MPI_PERSISTENT)
\end{verbatim}

The functions \mpifunc{MPI\_CREATE\_RSEND},
\mpifunc{MPI\_CREATE\_RSEND} and \mpifunc{MPI\_CREATE\_RECV} are dealt
in a similar manner.

The functions \mpifunc{MPI\_START} and
\mpifunc{MPI\_COMM\_FREE} are primitive.

\subsection{Nonblocking communication initiation}

\mpifunc{MPI\_ISEND(handle, MPI\_BOTTOM, 1, datatype, dest, tag, comm)}
is

\begin{verbatim}
MPI_COMM_INIT(handle, datatype, dest, tag, comm, MPI_STANDARD, MPI_EPHEMERAL)
MPI_START(handle)
\end{verbatim}

The functions \mpifunc{MPI\_IRSEND}, \mpifunc{MPI\_ISSEND} and
\mpifunc{MPI\_IRECV} are handled in a similar manner.


\subsection{Communication completion}

The two primitive completion operations are \mpifunc{MPI\_WAITANY} and
\mpifunc{MPI\_TESTALL}.

The function \mpifunc{MPI\_WAIT} can be
implemented by a call to \mpifunc{MPI\_WAITANY} with an \mpiarg{ 
array-of-handles} argument of length one.

The function \mpifunc{MPI\_WAITALL} can be implemented as a loop where
an \func{ MPI\_WAIT} call is executed for each successive handle in the
\mpiarg{ array-of-handles}.

The function \mpifunc{MPI\_TEST} can be implemented by a call to
\mpifunc{MPI\_TESTALL} with an \mpiarg{ array-of\-handles} of length one.

The function \mpifunc{MPI\_TESTALL} can be implemented as a loop where
an \func{ MPI\_TEST} call is executed for each successive handle in the
\mpiarg{ array-of-handle}.

\subsection{Blocking communication}

\mpifunc{MPI\_SEND(MPI\_BOTTOM, 1, datatype, dest, tag, comm)}
is

\begin{verbatim}
MPI_COMM_INIT(handle, datatype, dest, tag, comm, MPI_STANDARD, MPI_EPHEMERAL)
MPI_START(handle)
MPI_WAIT(handle, dontcarestatus)
\end{verbatim}

The functions \mpifunc{MPI\_RSEND} and
\mpifunc{MPI\_SSEND} are handled in a similar manner.

\mpifunc{MPI\_RECV(MPI\_BOTTOM, 1, datatype, source, tag, comm)}
is

\begin{verbatim}
MPI_COMM_INIT(handle, datatype, dest, tag, comm, MPI_RECV, MPI_EPHEMERAL)
MPI_START(handle)
MPI_WAIT(handle, status)
\end{verbatim}

\subsection{Probe and cancel}

The functions \mpifunc{MPI\_PROBE} and \mpifunc{MPI\_CANCEL} are
primitive.

\subsection{Return status}

The functions
\mpifunc{MPI\_GET\_SOURCE}, \mpifunc{MPI\_GET\_TAG},
\mpifunc{MPI\_GET\_LEN},
\mpifunc{MPI\_PROBE\_LEN}, and \mpifunc{MPI\_IS\_CANCELLED} are
primitive.

These functions are simple macros that access records in a structure.

\subsection{send-receive and exchange}

\mpifunc{MPI\_SENDRECV( send\_start, send\_count, send\_type, dest,
recv\_start, recv\_count, recv\_type, source, tag, comm, status)}
is

\begin{verbatim}
MPI_ISEND(handle[0], send_start, send_count, send_type, dest, tag, comm)
MPI_IRECV(handle[1], recv_start, recv_count, recv_type, source, tag, comm)
MPI_WAITALL(2, handle, status_array)
status = status_array[1]
\end{verbatim}

The nonblocking sends and receives can be replaced by their primitive
implementation.

The function \mpifunc{MPI\_EXCHANGE} can be handled in a similar
manner; a temporary buffer need be allocated to replicate the send and
receive buffer.  A size for this buffer can be computed using the
function \mpifunc{MPI\_TYPE\_EXTENT}.

\subsection{Derived datatypes}

We have outlined in Section~\ref{subsec:pt2pt-datatypeconst} how each
datatype constructor can be expressed in terms of the next one.  Thus
all datatype constructors can be expressed in terms of the constructor
\mpifunc{MPI\_TYPE\_STRUC}; the functions
\mpifunc{MPI\_TYPE\_FREE},
\mpifunc{MPI\_TYPE\_EXTENT} and
\mpifunc{MPI\_ADDRESS} are primitive.
\subsection{Summary}

In order to implement the MPI point-to-point communication one needs
16 functions; about half of those are trivial macros:

\begin{enumerate}
\item
\mpifunc{MPI\_COMM\_INIT}
\item
\mpifunc{MPI\_START}
\item
\mpifunc{MPI\_WAITALL}
\item
\mpifunc{MPI\_TESTANY}
\item
\mpifunc{MPI\_PROBE}
\item
\mpifunc{MPI\_CANCEL}
\item
\mpifunc{MPI\_GET\_SOURCE}
\item
\mpifunc{MPI\_GET\_TAG}
\item
\mpifunc{MPI\_GET\_LEN}
\item
\mpifunc{MPI\_PROBE\_LEN}
\item
\mpifunc{MPI\_IS\_CANCELLED}
\item
\mpifunc{MPI\_TYPE\_STRUC}
\item
\mpifunc{MPI\_TYPE\_EXTENT}
\item
\mpifunc{MPI\_ADDRESS}
\item
\mpifunc{MPI\_TYPE\_FREE}
\item
\mpifunc{MPI\_COMM\_FREE}
\end{enumerate}

On the other hand,
we expect most implementations to use more primitives, in order to
reduce communication overheads, reduce buffering, and improve error reporting.

.