I. Mapping.

a. Overview.
The gm mapper sends host messages to other hosts; and switch messages
to itself through switches. Hosts receiving host messages send
replies. The mapper will timeout and retransmit when these replies
aren't received. In some cases when a host fails to reply to a host
message, it may receive switch messages, which it should drop.  The
mapper will also timeout and retransmit switch messages when these
messages don't return to the mapper. After a certain amount of failed
retransmissions, the mapper moves on.  Mapping occurs in three phases:
exploration, configuration and verification. The mapper cycles
through these phases, pausing briefly during the interregnum at the
end of each cycle.  Mappers can be alive on multiple hosts, and there
is a scheme to decide with of the mappers is active. If the active
mapper dies, a dormant mapper will take its place. There are many
options to the mapper. These options are defined as part of the
GM distribution in the file gm/mt/tools/active.args. The mapper
messages themselves are defined in gm/mt/libmt/mt_Message.h, and in
this document. Contradictions should be resolved by ignoring this
document.

b. Exploration.
First the mapper explores the network. In this phase host messages may
be sent to hosts that don't exist, and switch messages may be routed
through switches that don't exist; these failures will cause timeouts
and a fixed number of retransmissions. The mapper learns about the
network from its failures and successes.

c. Configuration.
The second phase of mapping is the distribution of network information
to the hosts on the network. The mapper sends configuration messages to
each host, and waits for acknowledgments. Unacknowledged configuration
messages are resent a certain number of times; this failing still,
the mapper will give up, and try again after it has re-mapped the
network. Each different mapping of the network is given a nonzero map
version number. Hosts remember map versions they were configured with,
and only hosts with out of date map versions will be re-configured.

d. Verification.
The third phase of mapping is the verification phase, in which the mapper
verifies that the network hasn't changed. If the mapper discovers that
the network has changed, the mapper will go back to the exploration
phase. Verification is an optimization. It is simpler than exploration,
and it uses fewer, and simpler messages. If the network is unchanging,
the mapper will continue to verify it periodically, and never re-explore
it. The user can tell the mapper to skip the verification phase.

e. Interregnum.
The active mapper will sleep for 10 seconds between mappings. A dormant
mapper will wake up every 5 seconds and determine if it should become
the active mapper. These values can be changed by the user in
the mapper.args file.

f. Arbitration.
When a mapper is started, it is given a nonnegative priority. The
mapper with the highest priority will become the active mapper. If all
mappers have the same priority, the mapper with the highest 48-bit
Myrinet board ID will become the active mapper. This priority is
sometimes called the "map level".

g. Messages
Each mapping message begins with a 16-bit Myrinet type (0x000f), and
a 16-bit mapper subtype. Message types and all fields are in network byte
order (big endian).  All routes are to be treated as an opaque byte
string that needs to go on the front a message.  The length (in bytes)
of these routes are given by separate route length fields.

II. Host Messages

a. Scout Message
struct ScoutMessage
{
  const uint16 type    = 0x000f;
  const uint16 subtype = 0x0000;
  uint32 port;
  uint32 phase;

  uint32 routeLength;
  uint8  routeToMapper[32];
  uint8  mapperAddress[6];
  uint16 command;
  uint32 level;
};

The mapper sends scout messages to a host to discover it, or to verify
that it still exists. The host should reply to the scout message with
a Reply message, using the route provided in the scout message. The
mapper will retransmit unacknowledged scout messages at 10 millisecond
intervals 3 times before giving up. (These values can be changed by
the user.)  Hosts running mappers should pass received scout messages
up to the mapper, which will use the address and level fields to
decide if it should be dormant or active.

1. port should be copied into the reply message.
2. phase should be copied into the reply message.
3. routeLength is the length of the route back to the mapper.
4. route is the route back to the mapper.
5. mapperAddress is the 48-bit Myrinet board ID of the mapper.
6. command is optional. It tells the host to perform some unusual act.
   The possible values are:
   0 - default - no additional action necessary.
   1 - reset your mapping-related state (gmID, mapVersion, etc.)
       this is used to clear gmID conflicts.
7. level is the priority of the mapper.

b. Reply Message
struct ReplyMessage
{
  const uint16 type    = 0x000f;
  const uint16 subtype = 0x0002;
  uint32 port;
  uint32 phase;

  uint8  address[6];
  uint8  mapperAddress[6];
  uint16 gmID;
  uint16 pad;
  uint32 mapVersion;
  uint8  hostname[32];
  uint32 level;
  uint16 nodeType;
  uint16 option = 0;
};

When a host receives a scout message from the mapper, it should send a
ReplyMessage.  Some of the fields in the reply message are copied back
out of the original scout message, and other fields are about the host
itself. The host should use the route provided by the scout message to
send the reply back to the mapper.

1. port should be copied from the scout message.
2. phase should be copied from the scout message.
3. address is the host's 48-bit Myrinet board id.
4. mapperAddress is the 48-bit Myrinet board id of the
   mapper host from the last configuration. (see II.c.7, below)
5. gmID is
   a. the host's gm ID, if it has been configured.
   b. 0 otherwise.
6. mapVersion is
   a. the mapVersion from the last time the host was configured
   b. 0 if the host has never been configured.
7. hostname is the network hostname of the host, null terminated.
   a null string is an acceptable response
8. level is the map level of the host, if there is a mapper
   running. map levels are used to arbitrate between multiple mappers.
9. nodeType is a number determined by Myricom. It is
   a. 0 for workstations running gm.
   b. 1 for the Myricom "hybrid" MCP
   c. 2 for CompanyX switches
10. option must be set to zero and is reserved for future use.

c. Cloud Reply Message

struct CloudMember
{
  uint8 address[6];
  mtu16 control;
};

struct CloudReplyMessage
{
  const uint16 type    = 0x000f;
  const uint16 subtype = 0x0002;
  uint32 port;
  uint32 phase;

  uint8  address[6];
  uint8  mapperAddress[6];
  uint16 gmID;
  uint16 pad;
  uint32 mapVersion;
  uint8  hostname[32];
  uint32 level;
  uint16 nodeType;
  uint16 option = 1;

  uint32 numHosts;
  CloudMember [175];
};


Network devices that connect ethernet networks to myrinets should
reply to mapper scout messages with a Cloud Reply Message. A Cloud
Reply Message is a ReplyMessage with the option field set to 1
(hexadecimal 0x1), and with additional fields at the end: first the
number of hosts on the ethernet and then the ethernet addresses and
two bytes of control information for up to 175 hosts. If there are
more than 175 hosts, the mapper will query the network device for
additional ethernet addresses with a Cloud Query message. The querying
assmes that the node representing the cloud keeps a table of the
ethernet addresses of the hosts in the cloud, and will be able to
reply with the contents of this table a few rows at a time. The
responding node should be able to handle requests for out of bounds
rows, which can occur if the table gets smaller after the initial
reply message was sent, and before all rows were received. The node
should send back a reply with numHosts equal to zero in this
case. The mapper will handle duplicate row entries, which can show up
for the same reason. The limit of 175 assumes an ethernet MTU of 1500
bytes. The device should respond with a Cloud Query Reply message. The
control information will be distributed to all GM nodes as the control
field in a row of the configuration message.

For instance, if there are 500 member nodes in the cloud the messages
might look like this:

1. Cloud node gets a scout message.
2. Cloud node responds with a reply, numHosts = 500, information on
   members 0 - 174 in the same message.
3. Mapper gets reply.
4. Mapper sends query request for members 175 - 349: first = 175,
   numHosts = 175.
5. Mapper sends query request for members 350 to 500: first = 350,
   numHosts = 150.
5. Cloud node receives message 4, replies with query reply,
   numHosts = 175, members from 175 to 349.
5. Cloud node receives message 5, replies with query reply,
   numHosts = 150, members from 350 to 500.
6. Mapper drops message 5.
7. Mapper receives message 6.
8. Mapper times out on message 5, resends it.
9. Cloud node receives message 8, replies with query reply,
   numHosts = 175, members from 175 to 349.
10. Mapper receives message 9.

d. Cloud Query Message

struct CloudQueryMessage
{
  const uint16 type    = 0x000f;
  const uint16 subtype = 0x0007;
  uint32 port;
  uint32 phase;

  uint32 first;
  uint32 numHosts;
}

e. 
struct CloudQueryReplyMessage
{
  const uint16 type    = 0x000f;
  const uint16 subtype = 0x0008;
  uint32 port;
  uint32 phase;

  uint8  address[6];
  uint16 pad;
  uint32 numHosts;
  CloudMember [175];
}


f. Config Message
struct ConfigEntry
{
  uint8  address[6];
  uint16 gmID;
  uint16 nodeType;
  uint16 routeLength;
  uint8  route[32];
  uint8  hostname[32];
  uint16 control;
  uint16 pad;
};

struct ConfigMessage
{
   const uint16 type    = 0x000f;
   const uint16 subtype = 0x0001;
   uint32 port;
   uint32 phase;

   uint32 serial;
   uint32 routeLength;
   uint8  routeToMapper[32];

   uint8  address[6];
   uint8  mapperAddress[6];
   unit16 gmID;
   uint16 numHosts;
   uint32 mapVersion;
   unit32 numEntries;
   ConfigEntry entries[];
}

The mapper sends a series of configuration messages to each host to
give the host information about the network. This information includes
information about the host itself (its gm ID), information about the
other hosts (hostnames and gm and Myrinet board IDs), and routes. The
host should acknowledge each configuration message by sending a
HostReply message back to the mapper. The mapper will retransmit
unacknowledged configuration messages at at 10 millisecond intervals 3
times before giving up. (The values can be changed by the user.) The
mapper may send more than one configuration message at a time, and
they may arrive in any order (unless the user specifies otherwise.)  A
host should ignore configuration messages with map versions less than
its current map version, unless the new map version is zero.

1. port should be copied back in the reply message.
2. phase should be copied back also.
3. serial is the sequence number, starting at zero. This field is not
   necessarily useful. 
4. routeLength is the length of
5. route, which is the route to reply back to the mapper with.
6. address is the 48-bit Myrinet board ID of the host; if the address
   doesn't match, the message should be dropped.
7. mapperAddress, which is the 48-bit board ID of the mapper. This
   value should be copied back in the reply message. (see II.b.4)
8. gmID is the unchanging gm ID of the receiving host.
9. numHosts is the total number of routes for the configuration.
   the host can use it to reserve space for routes, or to
   determine when the configuration has completed.
10. mapVersion is the version of the current map. It should be
   remembered by the host for two reasons:
   a. for the reply messages.
   b. if the current mapVersion differs from the remembered one, the
      host should clear its route table. New routes may not be consistent
      with old ones.
11. numEntries is the number of entries following in this message.
12. entries is an array of entries. Each entry is information about a
    host on the network.
    a. address is the Myrinet board ID of some host.
    b. gmID is the gm ID of the same host.
    c. nodeType is the node type of the same host.
    d. routeLength is the length of the route to the same host.
    e. route is the route to the same host.
    f. hostname is the network name of the same host.
    g. control is for cloud members.

III. Switch Messages
The mapper will retransmit unreturned switch messages at 5
millisecond intervals 3 times before giving up. (The values can be
changed by the user.)

a. Probe Message
If a host receives a probe message, it should drop it.
struct ProbeMessage
{
  const uint16 type    = 0x000f;
  const unit16 subtype = 0x0003;
  uint32 port;
  uint32 phase;
}

b. Verify Message
If a host receives a verify message, it should drop it.
struct VerifyMessage
{
  const uint16 type    = 0x000f;
  const uint16 subtype = 0x0004;
  uint32 port;
  uint32 phase;
}


IV. Extended HostReply Messages
The HostReply message has an "option" field. If option is
non-zero the message will be interpreted differently.

Option	Message
------  -------------------------------------------
0x0001	Cloud
0x0002	Do not send me config messages with routes


struct ReplyMessage0x0002
{
 ... 
 const uint16 option = 0x0002; /* no config route messages wanted */
};


V. A Miminal GM Node

A GM-based Myrinet network might contain various endpoints, not
necessarily just hosts. A minimal endpoint in a GM network should
adhere to the following basic rules.

a. Use registered "typed" Myrinet messages for all communication.
Any message generated by a node should have a 16-bit type field
on the front of the message (after the routing bytes). Myricom
assigns these unique types and suggests that protocols create
their own 32-bit types by adding 16-bits of subtype to the
16-bits of registered type.

b. Assign a unique 48-bit ID 
Every Myrinet endpoint should have a unique 48-bit ID. A simple
way to generate these IDs is to register with the IEEE and 
purchase a range of IDs from the 48-bit UID (Ethernet space).

c. Respond to mapping Scout messages
A node that does not respond to mapping message will gets probed
repeatedly by the mapper. You will not be optimizing your code by
ignoring these messages.  See Sections II.a. and II.b. The hostname field
may be set to zero. You will need to save 

d. Accept a minimal configure message
The configure messages serve two functions. They provide each endpoint
with a GMId and deliver routes to all the other endpoints. If you 
specify option=0x0002 in the HostReply message, then your
node will only receive a minimal configure message. At the
very least, you will need to save 
   uint8  mapperAddress[6];
   unit16 gmID;
   uint32 mapVersion;
so that you can put them in the Reply message to keep the
mapper from thinking the map has changed.
See Section II.c. for more information on configure messages.

A node such as a "RAM Disk" might never generate its own
messages and would only respond with data when it got
a read message and an ack when it got a write message.
Such a node would not need any GM information or
routes since each request message could contain a
reply path just as the map messages do. The main reason
it participates in mapping is to allow other nodes
to find it.

[********** Things to be added or at least discussed ************]
Should we restructure the Scout and Reply messages to try
to make it easier for nodes to reply. An example would be
to put the mapversion and mapperID in the reply message
right next to each other so that the saved versions could
be written in a "bcopy" like fashion. The Scout and Reply
messages have evolved and were not designed from the ground
up with simplicity in mind.

Should we implement a set of request messages so that
nodes can ask for routes to all nodes of a specific type?
Or a request message so that nodes can ask for a route
to a particular GMId or 48-bit ID? This would make the
mapper more like the "Address Consultant" from the ISI days.

Should we define a specific type (16-bits) that we can assign
as a "pass-through" type so that GM hosts could send/recv
low-level protocol messages designed by customers? 
We already have a RAW GM port that the mapper can use.
This might be a better port for people to use for their
own host-based protocols. it is quite possible (and likely)
that customers will build their own MCPs and just use
a GM mapper in the network. CSPI is doing this now with
the BDMP MCP. I'd suggest that we punt on this issue right
now. The raw port gives people trivial access to sending
raw messages to "minimal" nodes.


$Id: mapping.txt,v 1.6 1999/06/17 22:31:25 finucane Exp $
