                  Portable MPI Model Implementation
                               over GM

                    Version mpich_1_2__2, July 7, 2000


********************
Status & Limitations
********************

GM does NOT work with threading or forking. MPICH-over-GM
will not either. "system()" calls from a GM program or
mpich-over-gm program will cause GM to fail with strange errors.
See /examples/basic/osc.c for examples.

Contents of this file:
----------------------
 MPI-GM compilation and usage (READ THIS!!)
 Running TOTALVIEW
 More on SMP support
 More examples of building and running
 Configuring on different ARCHs
 Running under solaris
 More on registering memory
 Some Tunable Parameters in MPICH
 Other Notes

MPI-GM compilation and usage:
*****************************

1. Configure MPICH
2. Make
3. Create a conf file
4. Run a program

Note: You may wish to use the script mpich.make to assist with
steps 1 and 2. Edit it using the instructions in step 1, then type:
mpich.make

STEP 1: Configure MPICH
-------

Configure in the mpich source directory:

WARNING: You must compile/link MPICH for GM on the same architecture
         and OS version that you'll be running on. Do NOT compile
         on a linux-2.0.x box if you'll be running under linux-2.2.x.

Here are some examples for LINUX. Other architecturers will need
a different -arch= parameter. (See the bottom of the file for Silicon 
Graphics machines and digital Unix (OSF)).

We have not tried to compile mpich for NT. We would expect a lot
of trouble getting it to work.

setenv GM_HOME <location of GM base dir>

for LINUX -
./configure -nodevdebug -cflags="-I$GM_HOME/binary/include -I$GM_HOME/include"\
            -opt=-O2  -device=ch_gm -noromio -noc++ \
            --lib="-L$GM_HOME/binary/lib/ -L$GM_HOME/lib/ -lgm" \
            -arch=LINUX -fc=g77 -rsh=rsh -gm-can-register-memory \
            -shared-memory-support


for other archs (like IRIX64, sun4, and digital Unix that don't support
memory registration) see the examples below - you can not use
-gm-can-register-memory.

or to compile on Linux with mpe, without f77 and to use ssh and no (direct)
SMP support -

./configure -nodevdebug -cflags="-I$GM_HOME/binary/include -I$GM_HOME/include"\
            -opt=-O2  -device=ch_gm -nof77 -noromio -noc++ \
            --lib="-L$GM_HOME/binary/lib/ -L$GM_HOME/lib/ -lgm" \
            -arch=LINUX -rsh=ssh -gm-can-register-memory

And, to add mpich-gm debugging, add -DGM_DEBUG=1 to the cflags.

STEP 2: Make 
-------

In the main mpich directory:

make

To avoid compiling the profiling libraries and other stuff:
make mpilib mpiflib

NOTE: The mpich make process generates lots of output.
If the make fails in one directory, it will skip that directory and
continue with the rest of the make. The result of this is that when
an application is being built, it will fail with 'undefined reference to' 
errors.

In that case, make mpich again using
make >& make.out
to pipe the output to a file, and then check the output file (make.out)
for the first error. Fix the problem and then try making again.


STEP 3: Create a conf file 
-------

This "conf" file must be accessible on all the nodes where you are going 
to be running the MPI processes. Each process will read this conf file as
will the mpirun script. Comments (lines that begin with '#') and
blank lines are allowed.

The default location for this file is $HOME/.gmpi/conf .
If the file is in a different location or of a different name, 
then the --gm-f option can be used to tell mpirun.ch_gm where 
to find the conf information.

The generic description of the contents of this file is:

<num_nodes>
<node_0_name> <node_0_port> [node_0_board_optional]
<node_1_name> <node_1_port> [node_1_board_optional]
<node_2_name> <node_2_port> [node_2_board_optional]
<node_3_name> <node_3_port> [node_3_board_optional]
.
.
.
<node_N_name> <node_N_port> [node_N_board_optional]


An example conf file is given below:

# .gmpi/conf file begin
# first the number of nodes in the file
11
# the list of (node, port, board) that make the MPI World
node1.myri.com 2
node1.myri.com 4
node1.myri.com 5
node1.myri.com 6
node2.myri.com 2
node2.myri.com 4
node2.myri.com 5
node2.myri.com 6
node3.myri.com 2 
node3.myri.com 2 1
node3.myri.com 4 1
# .gmpi/conf file end

To set up the conf file for SMP use, you simply list the SMP machine
N times (one for each processor) using a different GM port for each line. 
The example above uses two SMP machines (node1 and node2), each of which
have 4 processors.

If you have multiple boards in a single machine, then you need to add the
board number to the end of the line. Missing board numbers are assumed to
be zero.  

In general, gm has 8 ports. Ports 2,4,5,6,7 are for users. Other ports 
are not for general user-process use.

In the example above, node3 has two boards and will run one
process on board 0, port 2, another process on board 1, port 2 and
a third process on board 1, port 4.

The machine names are what 
	gm/binary/bin/gm_board_info 
shows with routes (full host names).

NOTE: the exception is for multiple boards in a machine. The hostname
in the conf file needs to be a valid hostname that can be used for rsh.
The name in the gm_board_info output will be machinename:1 for board #1.
Don't put the ":1" in the conf file, just use "machinename" and put the
"1" at the end of the line to indicate board "1".

Here's an example (with two cards in "node3").

Route table for this node follows:
  gmID MAC Address                               Hostname Route
  ---- ----------------- -------------------------------- ---------------------
    96 00:60:dd:7f:ec:fa                   node1.myri.com 87 be bb
    97 00:60:dd:7f:e5:f2                   node2.myri.com 85 82 be
    99 00:60:dd:7f:ec:f9                   node3.myri.com 87 be b9
   100 00:60:dd:7f:ee:a5                   node3.myri.com:1 80 (this node)


STEP 4: Run a program
-------

Sample test programs are in examples/basic, examples/perftest and examples/tests.
To run the cpi program in examples/basic.

cd examples/basic
make
../../bin/mpirun.ch_gm --gm-v cpi

examples/perftest/myrunex will gather performance information.

If the make process fails with 'undefined reference' errors, see the NOTE 
under step 2 on making mpich.


Running TOTALVIEW
*****************

To run with totalview, first set the TOTALVIEW environment
variable, and then run with the -tv flag :

setenv TOTALVIEW <dir-where-your-totalview-is>/totalview
../../bin/mpirun.ch_gm --gm-v --gm-f ~/.gmpi/conf.mpi2 -tv cpi


More on SMP support:
********************

Use
   "-shared-memory-support", 
for SMP with a 2 copies protocol, 

OR
   "-shared-memory-support -shared-memory-enable-directcopy"
for SMP with one direct copy.

NOTE: To be able to use SMP with one direct copy,
the linux kernel must be rebuilt, and then gm must
be configured with the --enable-directcopy option.
See README-linux in GM. (Warning - not for the faint of heart).

NOTE 2: Check the usage of the direct copy flag carefully, it has changed
from version 1.2..0

NOTE 3: Starting with gm version 1.4, the directcopy feature is
automatic for linux. (i.e., no directcopy flag is necessary).


More examples of building and running:
**************************************

MPI programs can be compiled with 
	mpicc *.[oc] -o exec
or
	mpif77 *.[oc] -o exec

They can be launched with 

	mpirun -np <num_node> exec

In case of a problem with the mpirun script, try instead:

    mpirun.ch_gm -np <num_node> exec
          ^^^^^^

To use a different config file just use the "-f" parameter to
mpirun.ch_gm.

	mpirun.ch_gm -np <num_nodes> -f <new_conf_file> program

Example:

on a linux machine, running 4 processors and test cpi:
../../build/bin/mpirun.ch_gm --gm-v --gm-f ~/.gmpi/conf.linux --gm-np 4 cpi

Add -mpichecksum flag to the end of these lines to enable a software
checksum.  

Add -mpichecksum-no-die flag to the end of these lines to enable a
software checksum which will not end execution if a checksum error is
found (for advanced users only!)


Configuring on different ARCHs -
********************************

Making MPICH on Silicon Graphic machines:

The configure line should specify:

./configure -nodevdebug -cc="cc -64 -mips4 -r10000 -DGM_CPU_mips" -cflags="-I$GM_HOME/binary/include -I$GM_HOME/include" -opt=-O2 -device=ch_gm -nof77 -noromio -noc++ --lib="-L$GM_HOME/binary/lib/ -L$GM_HOME/lib/ -lgm" -arch=IRIX64

-DGM_CPU_mips does not need to be specified if gm.h defines this
for "mips".

We were able to configure and make and run the tests cpi, srtest, and stress.

During the make we did get the following errors and warnings.
They did not prevent us from making and running the tests.

Errors during make:
"configure", line 2181: error(1020): identifier "bogus" is undefined
   bogus endian macros
   ^

"configure", line 2181: error(1065): expected a ";"
   bogus endian macros
                ^

2 errors detected in the compilation of "conftest.c".
configure: failed program was:
#line 2175 "configure"

Warnings during make:
ld64: WARNING 84: /ufs/gm/ruth/mpich_sgi/mpich/build/IRIX64/ch_gm/lib/libpmpich.a is not used for resolving any symbol.
ld64: WARNING 84: /ufs/gm/ruth/mpich_sgi/mpich/build/IRIX64/ch_gm/lib/libmpich.a is not used for resolving any symbol.
ld64: WARNING 84: /ufs/gm/ruth/gm_sgi/gm/binary/lib/libgm.a is not used for resolving any symbol.

Also, mpe did not build.

When building in /examples/basic, we received errors because the profiler
did not build. 

Making MPICH on Digital Unix machines:

configure:
./configure -nodevdebug -cflags="-I$GM_HOME/binary/include -I$GM_HOME/include" -opt=-O2 -device=ch_gm -nof77 -noromio -noc++ --lib="-L$GM_HOME/binary/lib/ -L$GM_HOME/lib/ -lgm"

Note that -gm-can-register-memory can NOT be used for Digital Unix.


Running under solaris:
**********************

Currently, memory registration is not supported under Solaris.
If you see errors of this form:

_gm_mmap: mmap failed: Protocol error
gmpriv.c:203  DMA_ALLOC returns NULL
gmpriv.c:204: failed assertion: "send_buf".

Then you will need to make the following modifications to your
MPICH. This will decrease the amount of memory used by MPICH.

0. From the main mpich directory, cd to mpid/ch_gm

1. Modify the file mpigm.h:
from:
#define GMPI_MAX_FRAG (1<<17) /* maximum message = 128Kbytes (temporary) */

to:
#define GMPI_MAX_FRAG (1<<15) /* maximum message = 128Kbytes (temporary) */

2. Change the following lines in mpigm.c, as given in this patch -

p2bl 1955% diff -r -u mpigm.c.old mpigm.c
--- mpigm.c.old        Mon Jul 10 12:16:02 2000
+++ mpigm.c     Mon Jul 10 12:21:59 2000
@@ -110,7 +110,7 @@
   gm_assert(GMPI_MAX_FRAG <= 4*GMPI_MAX_DMA_BYTES);
 
   gmpi.stoken = gmpi.max_stoken = MIN(gm_num_send_tokens(gmpi.port)/2,GMPI_NSTOKEN);
-  gmpi.rtoken = gm_num_receive_tokens(gmpi.port)/4;
+  gmpi.rtoken = 4   /*gm_num_receive_tokens(gmpi.port)/4  */ ;
   for (i=0;i<MPID_MyWorldSize;i++) {
     gmpi.node_ids[i] = gm_host_name_to_node_id(gmpi.port,gmpi.node_names[i]);
     if (gmpi.node_ids[i] == GM_NO_SUCH_NODE_ID) {
@@ -166,7 +166,7 @@
   if (MPID_MyWorldSize > 1) {
     gm_init_sync();
   }
-  for (i=0;i<gm_num_receive_tokens(gmpi.port)/2;i++) {
+  for (i=0;i< 8   /*gm_num_receive_tokens(gmpi.port)/2 */ ;i++) {
     void * p = DMA_ALLOC(gmpi.port, gm_max_length_for_size(GMPI_CONTROL_TAG), GMPI_RDMA);
     gm_provide_receive_buffer(gmpi.port, p , 
                              GMPI_CONTROL_TAG, GM_LOW_PRIORITY);

3. in this directory,
make clean
make

Applications will have to be rebuilt as well.

Your performance might suffer - the number of receive buffers
has been greatly reduced. (So free those buffers as quickly as you can!)

Also, by changing GMPI_MAX_FRAG to (1<<15) the bandwidth of
your MPICH is limited by the the BW obtained at this size message.

You may wish to tune these parameters for your system.

More on registering memory:
***************************

If GM supports memory registration (I believe that is true only
for Linux and NT currently), it is recommended to compile MPI with
-gm-can-register-memory to enable the registration/zero-copy code.

To do this, add -gm-can-register-memory to the configure line.

If memory registration *is* available, the GM_NO_MEM_REG_CACHE option 
allows one to switch between two variants:

- define GM_NO_MEM_REG_CACHE which causes registration/deregistration
for each communication (which is currently more costly than actually
copying).(NOTE: this should be used only for debugging purpose),

- Or, do NOT define GM_NO_MEM_REG_CACHE. This will avoid
using memory registration for each communication event and instead
will maintain a cache of what part of the memory has been registered 
previously and what part is still in use.


Some Tunable Parameters in MPICH
********************************

The short explanation -

If you want to play with MPI-GM constants and try to find the optimal
values, you can try to change : 

* MPID_PKT_MAX_DATA_SIZE : limit between SHORT and 3-WAY (should be a power of 2)

* GMPI_MAX_FRAG : maximum fragment size = maximum BW

* need_scopy and need_rcopy : limit copy/register with gm-can-register-memory


The longer explanation -

For messages larger than MPID_PKT_MAX_DATA_SIZE, we change the
protocol from SHORT (0 to MPID_PKT_MAX_DATA_SIZE, one message=one
packet, the sender sends it immediately and the messages is buffered on
the receiver side if the receive is not posted) to 3-WAY
(MPID_PKT_MAX_DATA_SIZE to GMPI_MAX_FRAG) which is a Rendez-vous
protocol : the sender sends a small message ("Are you ready"), waits for
a reply ("Yes, you can send it, my receive is posted") and then sends all
the data. So you have this handshaking at the beginning of the protocol
to avoid a copy on the receiver size, and this handshaking is expensive.

So, now what is the power of the gm-can-register-memory flag ? When GM
sends a message, it has to be sure that the data will stay at the same memory
location during all of the send, and not be swapped on the disk, for
example. GM can provide a "register" function to lock a user buffer in
memory and be sure that the buffer won't be swapped. So, when you want
to send a buffer, you call "gm_register" on the buffer, then you call
"gm_send" and then you can unregister the buffer (it's more complicated than
that but it's just to explain). There's several possible optimizations:

* if you send the same memory area several times, you should avoid the
unregister at the end of a gm_send and the register (it again) before
the next gm_send. That's why there's a cache in MPI-GM : when you
register a memory area, this buffer stay registered as long as there's
enough resources and then the next sends won't need to register that
buffer.

* gm_register is not very, very expensive, but it can be for big
buffers.  So maybe it can be more efficient to register a memory area
one time and copy the buffer to send to a part of this area before
sending it. In this case, you replace the cost of the "register" by
the cost of the "copy", and the copy is faster for small
buffers. That's why MPI-GM uses the functions "need_scopy" and
"need_rcopy". The limit between copy and register is 8K in MPI-GM,
that's means for messages up to 8K, there's a copy to/from a special
location, and for messages from 8K and to Inf, the user buffer is
registered using gm_register.

* if GM doesn't provide a "register" function on this platform, MPI-GM
will have to copy the user buffer to this special safe memory area
before sending and copy from this area after the receive on the
receiver side. In this case, i.e without gm-can-register-memory flag,
the limit between copy and register is set to Infinity,  "need_scopy"
and "need_rcopy" return always 1. It is the only difference for
gm-can-register-memory. But the cost of this copy before the send on the
sender side and the copy after the receive on the receiver side is
expensive for big messages (more than 8K), that's why the BW is not
good.

You will notice that the limit copy/register is 8K in MPI-GM with
gm-can-register-memory, so there is never a copy after
MPID_PKT_MAX_DATA_SIZE (16K) with the gm-can-register-memory flag, and
always without gm-can-register-memory.


Other notes:
************

To remove synchronization messages that look like this,
just comment out the define in mpich/mpid/ch_gm/gmpriv.c
#define GM_SEE_SYNC      

=> gmpi: node 0: opened GM board 0 (gm_id=1 gm_port=2)
=> gmpi:node 0 out of 5 waiting for sync
=> gmpi:node 0 starting ring comm for sync,magic=0xdeafa39b
=> gmpi: node 2: opened GM board 0 (gm_id=3 gm_port=2)
=> gmpi: node 3: opened GM board 0 (gm_id=4 gm_port=2)
=> gmpi: node 4: opened GM board 0 (gm_id=5 gm_port=2)
=> gmpi:node 2 out of 5 waiting for sync
=> gmpi:node 3 out of 5 waiting for sync
=> gmpi:node 4 out of 5 waiting for sync
=> gmpi: node 1: opened GM board 0 (gm_id=2 gm_port=2)
=> gmpi: node 0 starting final sync,magic=0xdeafa39a
=> node: 0 Synchronization done

The mpirun script does not yet handle non-SPMD program, although they
can work if you launch all process by hand.

 >
 > Should aborttest  work in mpich-gm?
 >
 > aborttest completes if I use one process. It hangs otherwise.
 >

In fact, MPI-GM makes no attempt at terminating the other nodes when
MPI_Abort is called on one node. Only the calling process is
terminated, and the others do not notice (and so will generally hang
at some point).

Although this is a buggish shortcut, some other MPI implementations
behave similarly, that is certainly why the aborttest comes last in the
"env" test suite (without counting the Fortran tests), in this
position it does not prevent further tests to be run.


TCP/IP performance
******************

For people wanting to run PVM or MPI on top of TCP/IP, you can suggest
two patches to the linux kernel that are in the Myricom-distributed
mpich tree:
tcp.c-noagle-small-alloc.patch
tcp_input.c.2.0.33+.patch
they can sometimes drastically improve performance when the
TCP_NODELAY option is used, and I think both MPI and PVM use it.
Without it MPI/TCP-IP bandwidth is quite bad, (as netperf with -D
option).

Sigalarm - currently disabled.
******************************

To avoid cases where the mpich process does not get a chance
to periodically wake up and check it's message queues, we have
added a sigalarm handler. If you do not want to have this 
sigalarm handler used, configure with the flag -gm-disable-sigalarm .
Reasons to use this flag (and disable gm-mpich's use of sigalarms) include:
- you are already using sigalarm in your mpich program
- sigaction does not exist on your system







