The basic remote queue device assumes the following:

There is a simple mechanism for transfering short (control) messages.  These
are typically packet headers, but may also include short messages.  There is
no assumption that you must be able to send a short message; a device must be
prepared to enqueue an out-going message.

Long messages are *always* transfered with remote memory operations.  That is,
either a put or a get operation.  The tcp device shows how this is simulated
for environments that do not have remote memory operations.  Note that the
"address" for the put or get operation is an opaque bit-field; where true
remote memory operations are available, it could be an address; in other
cases, it could be an offset or a index into a table entry or something else.

Collective operations can be accellerated through the use of a few operations
that can be specified at the destination of each message.  Think of these as
extensions to the usual active message data copy routine.  These operations
are optional; a device need not support them (the operations are join, cat,
and copy).

Different devices cannot be mixed together in an single MPICH implementation.
However, different methods may be.  A method may specify how to move data
using a single transport (e.g., TCP, shared memory, Myrinet).  A multi-method
implementation shares basic data structures (e.g., MPID_Request).  Note that
MPID_Requests are shared by all methods in a device, and are "vertically
integrated" (used from the very lowest levels of the device to the very top).
By mandating a common form for requests (and for the common part of a packet),
it is possible to write the dispatch loop for incoming data even if there is a
single queue (see below).

The interface between the MPI routines (e.g., MPI_Send) and the ADI is defined
in terms of requests:  the MPI routine allocates a request, fills in the
required fields, and calls the appropriate ADI routine.

The receive queues (unexpected and pending) are (roughly) visible at the MPID
level.  Any other queues or lists are part of particular device
implementations.  The reason that the receive queues are visible is that, for
thread safety, the atomic operation when looking starting a receive is ``is a
matching receive in the unexpected queue; if so, return that match and if not,
add to the posted receive queue''.

Example ADI implementations
============================

1. A multi-method device (TCP and Shared Memory)

Overview:  Messages are sent by sending packets; these packets either carry
data (short messages) or setup a rendezvous (long messages).  Additional
packet types support message cancel and flow control.  Long data is sent using
a separate mechanism; this *looks* like remote memory operations, but may
simply be another packet operation.

There are three basic queues: Posted receives, Unexpected sends, and pending
sends.  The last queue is used to handle messages that have not yet been sent
because of a flow control or other temporary restriction on delivery of
messages.  The pending send queue is visible only the the RMQ device.

Rules for names:
MPID prefix is used for routines seen by the MPI implementation and that are
part of the ADI-3 specification.
RMQ prefix is used for the RMQ implementation of ADI-3.

typedef enum { RMQ_TCP, RMQ_SHMEM } RMQ_Method;
typedef <whatever is 32 bits> int32;

typedef struct _MPID_Request *MPID_Request_p;

/* This is the packed version using bit fields; limited to 1024 processes.
   For maxiumum efficiency, it should be L1 cache-aligned and, if possible,
   the minimum packet should be 1 cache-line long */
typedef struct _RMQ_Packet {
    unsigned int packet_type:8;
    unsigned int packet_len:14;
    unsigned int lrank:10;
    unsigned int from:10;
    unsigned int to:10;  ??? do we need this, or is it in the request?
    unsigned int ctxid:12;
    int32        tag;
    ? flow control ?
    ???
    ? data area ?
    } RMQ_Packet;

/* Requests are also carefully aligned. */
typedef struct _MPID_Request {
   MPID_Request_p next;     /* Easy enqueue of requests */
   RMQ_Method    method;
   int            to;       ? if not in packet
   MPI_Comm      comm;      /* needed to invoke correct error handler; 
                               perhaps identify communication queue */
   MPI_Datatype  datatype;  /* needed to create byte stream */
   ??? other data not in packet???
   char           pad[PAD_SIZE]; /* padding to position packet */
   RMQ_Packet    packet;   
   ???
   /* Method-specific data begins here */
   } MPID_Request

Device Implementation
=====================

MPID_Isend( MPID_Request *request )
{
    /* Determine method from destination; setup any method-specific
       info (use request->to or request->packet.to) */
    RMQ_Set_method( request );
    // This could return the method as its value or RMQ_NO_METHOD for error 
    // Alternatively, it could just invoke the appropriate method.
    
    /* Invoke appropriate method */
    switch (request->method) {
        case RMQ_TCP:   RMQ_TCP_Isend( request ); break;
	case RMQ_SHMEM: RMQ_SHMEM_Isend( request ); break;
	/* Add other methods here */
	/* Always catch the something-is-wrong case */
	default: Panic();
    }
}

MPID_Irecv( MPID_Request *request )
{
    // Match against unexpected receive queue or insert into
    // pending receives
    if (MPID_Match_recv( request )) {
	// found a matching receive.  Process it
	RMQ_Push_request( request );
	// request may be complete (short message) or may not (long message,
        // push-request sent ack).
    }
}

MPID_Wait( MPID_Request *request, MPI_Status *status )
{
    while (!request->complete) {
	// In the multi-threaded case, the request may be completed by
        // another thread (?).  Should there be a MPID_Req_active(request)
	// to indicate that some thread is actively waiting on request?
	MPID_Poll( 1 );
    }
}

/* This routine handles the queue operations.  Note that not all operations
   will reach this routine (e.g., the TCP device may choose to handle
   RMQ_PKT_OK_TO_SEND messages directly) 

   MPID_Poll is called by
       Random MPI routines (which ones?)
       A timer (thread lock?)
       A thread (thread lock?)
*/
RMQ_Poll( int blocking )
{
    make sure we are atomic: if running return else lock;

    /* Get next packet from anywhere */
    packet = RMQ_Next_packet( blocking );

    if (!packet) { unlock; return; }
    /* Also need to attempt to write any pending messages in the write 
       queue */
    /* This can use a single queue or a blended access to different 
       method queues based on access time */
    switch (packet->packet_type) {
        case RMQ_PKT_SHORT:
	     Try to match message against posted receives, else 
	     add to unexpected message queue
	     break;
	case RMQ_PKT_REQUEST_SEND:
	     Try to match message against posted receives.  If
	     found, generate ok-to-send response.  Else add to 
             unexpected message queue
	     break;
	case RMQ_PKT_OK_TO_SEND:
	     Execute RMQ_Put_data
	     break;
	case RMQ_PKT_ANTI_SEND:
	     Try to remove message from unexpected message queue; 
	     generate ack.
	     break;
	case RMQ_PKT_ANTI_SEND_OK:
	     Update request with success/failure of cancel
	     break;
	case RMQ_PKT_FLOW:
	     Update flow-control information for link
	     break;
        default: Panic();
    }
    unlock;
}

// This only handles communication initialization.  
// For mmap/forked shmem, we need to pass pre-existing communication context
// to the initialization (or have pre and post-process creation
// initialization?) 
MPID_Init()  // no cmdline args required 
{
    RMQ_TCP_Init();
    RMQ_SHMEM_Init();
}

or

MPID_Init()  // no cmdline args required 
{
 
    RMQ_TCP_CommPreInit();
    RMQ_SHMEM_CommPreInit();
    RMQ_TCP_ProcessCreate();
    RMQ_SHMEM_ProcessCreate();
    RMQ_TCP_CommPostInit();
    RMQ_SHMEM_CommPostInit();
}

Question:  What do we guarantee about the enviroment of the processes?
Environment variables?  Commandline arguments?  What do we try to provide?

MPID_Mem_register( void *buf, long n )
{
   // may be a nop for these methods    
}
 
Mem_register is used by MPI_Send_init and MPI_Recv_init, as well as
MPI_Win_create.  

void *MPID_Mem_alloc( long n )
{
    return RMQ_SHMEM_Mem_alloc( n );
}
This attempts to allocate memory in a shared memory area.  Note that MPI-2 can
restrict third-party access to such memory.  For example, the device could
choose to allow this only if *only* the shared memory method was used.

void MPID_Mem_free( void *buf )
{
    RMQ_SHMEM_Mem_free( buf );
}
Free the memory.

RMQ device (general routines)
==============================

/* This routine could be simply "add to queue" */
RMQ_Dispatch_packet( packet )
{
   if (multi-threaded)
      Add packet to queue of pending packets
   else
      call routine to process packet (switch in MPID_Poll?)
}

RMQ_Push_request( MPID_Request *request )
{
    switch (request->method) {
        case RMQ_TCP:   RMQ_TCP_Push_request( request ); break;
	case RMQ_SHMEM: RMQ_SHMEM_Push_request( request ); break;
	/* Add other methods here */
	/* Always catch the something-is-wrong case */
	default: Panic();
    }
}

? Should there be an RMQ_Request_matched that is called by MPID (or RMQ?) when a 
previously unexpected receive is no matched by a (MPI) receive.

TCP method
==========

typedef struct _RMQ_TCP_Request {
    MPID_Request common;   /* Could also macro include this if the 
                              compiler generates poor code */
    ???
    } RMQ_TCP_Request

// This doesn't address process startup
// Only the communication is initialized by this
RMQ_TCP_Init()
{
    Create connection listener
    Get connection server ip/port name from environment
    Send connection server my listener ip:host
    Contact connection server to get others ip:host
}

RMQ_TCP_Isend( MPID_Request_p request )
{
    if (message is short) {
	pack message into packet (memcpy or MPID_pack_datatype)
    }
    else {
	set packet kind to request-to-send
    }
    //send packet (enqueue request at remote location)
    RMQ_TCP_Enqueue_envelope( request );
}

RMQ_TCP_Poll( int blocking )
{
    select( open-fds, timeout = (blocking ? NULL : 0) );
    for each set fd {
        if read {
            read packet.  Process or add the packet queue:
            RMQ_Dispatch_packet( packet );
        }
        else {
            advance pending writes
        }
    }
}

RMQ_TCP_Enqueue_envelope( MPID_Request_p request_p )
{
    Do flow-control for packet
    if writes allowed then {
        write( request_p->tcp.fd, request_p->common.packet, 
	       request_p->common.packet_len );
    }
    if (not all written) {
        add to pending write queue
    }
}

/* ?? How to double buffer this ?? */
/* Vaddress is the "virtual" address that was returned by the receiver in the 
   packet.  When the data is returned, with this address, to the receiver, it
   will know where to put the data (Vaddress could be index of matching
   request, which solves heterogeneity issues)
 */
RMQ_TCP_Put_data( MPID_Packet *packet_p )
{
   Find request from packet
   if (message is contiguous) {
       compute address of source data from request
   }
   else {
       use MPID_pack_datatype into a buffer attached to the request
       (space allocated at Isend?  At Put_data time?)
       save state of pack in request if we aren't done
   }
   writev( with return Vaddress, next lump of data ) 
   if (not all written)
       add to pending write queue
}

Shared memory method
====================

(not done)

RMQ_SHMEM_Init()
{
    ?? (if sysv segments, this would look much like TCP_Init)
}

RMQ_SHMEM_Put_data( MPID_Packet *packet_p )
{
   Find request from packet
   if (message is contiguous) {
       compute address of source data from request
   }
   else {
       use MPID_pack_datatype into a buffer attached to the request
       (space allocated at Isend?  At Put_data time?)
       save state of pack in request if we aren't done
   }
   memcpy( dest-address, source-address, n );
   change packet type to indicate some data sent
   RMQ_SHMEM_Enqueue_envelope( MPID_Request_p request_p )
   // An enhancement is to have the initial message allow *two* memcpy's
   // to start.  This estabilishes a double buffering copy.  The ack is sent
   // after the first is copied and then after each copy.  This is better
   // for one-way data delivery but not for exchanges.
}

Notes on the implementation
===========================
Rather than use an enum type and a switch statement, the MPID_Request could
replace method with a (virtual) function table.  The advantage is that any
number of methods can be supported; the disadvantage is that the requests them
selves become more complex and vulnerable (any damage to the request can
mangle the function pointers), even if method is a pointer to a static
function table.  

We need a way (similar to the ch2 device in ADI-2) to build the "common" parts
of the method implementations.  Perhaps the name-mangling approach in the next
section could be used?

If "short" doesn't depend on the method, or is easily extracted, we may be
able to eliminate one layer of calls.

Short cuts in the method implementations could process some packets
(particularly "continue sending data") without ever reaching the device-level
poll routine.

Single Method Implementation
============================
In the single method case, we can dispense with the switch on method type and
the code to determine the correct method.  To enable method code to become
device code, we might consider declarations that use
#define RMQ_NAME(b) RMQ_BASE_NAME##b
...
RMQ_NAME(_Isend)( MPID_Request_p request )
...

Then -DRMQ_BASE_NAME=RMQ_TCP creates the TCP method for use in a
multi-method device, and -DRMQ_BASE_NAME=RMQ creates the TCP method for a
single (TCP) method device.  ## is the ISO C string concatentation operator;
pre ISO C can use /**/ instead (and configure can test with
PAC_C_CPP_CONCAT). 

Adding VIA
==========

(not done)

Unlimited Methods
=================
All of the functions can be virtuallized, allowing the method to be
dynamically loaded, at some additional cost in function dispatch.

Questions
=========
Error returns.  Do all functions return an error value?  Do they use the MPI
error classes?  Error codes (encoded according to MPICH error handling)?

How should requests be allocated?  The allocation must be fast (don't use
malloc).

Who defines MPI_Status structure?  Is this something defined by the particular
device implementation (and hence must be included within the mpi.h file)?

Should there be allowed command-line args to make it easier to pass
debug/option values to the device?

MPI-2 Operations
================

Spawn
=====
One way to consider spawn is to replace mpirun with a small MPI program that
calls MPI_Comm_spawn_multiple (or a slight variation):

mpirun.c:
#include <mpi.h>
int main( int argc, char *argv[] );
{
    MPI_Comm intercomm;
    MPI_Init(0,0);
    /* convert command-line arguments into args for ... */
    MPI_Comm_spawn_multiple( ..., &intercomm );
    MPI_Comm_free( &intercomm );
    /* handle io, signals, etc. */
    MPI_Finalize();
    return rc;
}

This won't quite work, but it is close.  Close enough that we should consider
replacing MPI_Comm_spawn_multiple with a slightly enhanced version that would
work.  The enhanced version should
1. Provide an environment consisting of a common part and a per-process part,
allowing per-process information to be passed within the environment
2. Provide hooks for I/O routing
3. Provide hooks for signal/event delivery
4. Provide hooks for return code vector

Note that MPI_Comm_spawn_multiple requires the ability to provide different
command line arguments to each process.

Remote Memory Access
====================

Because the RMQ device has put as an abstraction, the MPI_PUT operation, at
least for contiquous data, is fairly simple.  For MPI_GET and ACCUMULATE, we
may want to add those operations (at least as options).

A complication is the ability to specify at the source a datatype to be
applied at the target.  This means sending datatype descriptions to the target
of a put/get/accumulate.  For efficiency, we need to cache such datatype
descriptions and optimize for contiguous data.

We should also allow a target to flush the datatype cache, so we will need to
handle "cache miss" on a datatype.

Additional Packet Types:

RMQ_PKT_DATATYPE_DEFN - data contains datatype defn.  Complication-datatype
    defn may be arbitrarily long
RMQ_PKT_DATATYPE_MISS - The datatype cachevalue (id) refered to in a previous
    RMA operation is no longer valid and must be resent
RMQ_PKT_RMA_PUT - MPI_Put operation
RMQ_PKT_RMA_GET - MPI_Get operation
RMQ_PKT_RMA_ACCUMULATE - MPI_Accumulate operation

What do we need for syncronization?  Perhaps

RMQ_PKT_RMA_SYNC - Sync the specified RMA window

Question: Can we combine these with the message-passing operations?  E.g., can
we eliminate the rendezvous send by including these?  Even the datatype cache
could be used; for example, a receive rendezvous could return the datatype id,
allowing the put operation to exploit any scatter capability.

Random Thoughts and Notes
=========================

Managing persistent resources
-----------------------------
All resources that the OS doesn't automatically clean up (e.g., SYSV ipcs,
processes) should be registered somewhere.  E.g., RSRC_Register( const char *
command_to_remove ).

An implementation of this API simply appends to a well-known file that can be
read and executed.

Fast multimethod devices
------------------------
For best performance, the fastest method should have the shortest software
path.  This suggests that only the fastest method provides input queues, and
that all other methods convert their messages into messages for the fast
method.  This is relatively easy if the slower methods all run in separate
threads; then only the fastest method does polling and/or waiting; all other
methods wait in their threads.


Collective Operations
=====================

points:

1. To catch user errors in matching collective operations up, we'd like to
have the following test in the device:

   if (is-collective-message && is-from-expected-source && 
       collective-type != active-collective-type-for-this-communicator) {
       Generate-error-mismatched-collective
   }

(Do we need the "is-from-expected-source"?  It is here because two successive
collective operations may use different communication patterns and hence 
might not arrive in the same order; but if communication is on a pair-by-pair
basis, then messages must arrive in-order for each communicator.)

This and some other items suggests that the envelope of the messages contain
some "user"-specified data.  In this case, that the operation is collective
and additional data is provided besides just the tag (note that we did NOT
test the tag; the point is that an unexpected tag is an error).  
Other data might include "errors on input", allowing collective reporting of
user-argument errors (e.g., a bcast with erroneous datatype at root would
still do a broadcast, but with (rank=root detected error), causing all
processes involved in the bcast to return an error.

2. Fast algorithms for individual operations, particularly in
store-and-forward algorithms, rely on pipelining the operations (i.e., sending
one long message in multiple parts so that the first part can be forwarded
before the last part has been sent).  This in turn
relies on being able to send partial messages.  Since a single MPI datatype
may represent an arbitrary amount of data, we need an interface that can send
partial messages, remembering how far through the datatype it has gotten.

The limiting case of pipelining is a stream of data, which can be forwarded
after the first byte arrives.  We won't express this directly at the top
level, though we should allow a device to do this on a packet-by-packet level.

