

			       README for MPD-2

			 Ralph Butler and Rusty Lusk


General
-------

MPD is a process management system for starting parallel jobs,
especially MPICH jobs.  Before running a job (with mpiexec), the
mpd daemons must be running on each host and connected into a ring.
This README explains how to do that and also test and manage the daemons 
after they have been started.

You need to have Python version 2.2 or later installed to run the mpd.
You can type

    which python

to make sure you have it installed, and 

    python

to find out what your version is.  The current version can be obtained
from  www.python.org.

Type 

    mpdhelp

for a list of mpd-related commands.  Each command can be run with the
--help argument for usage information.


How to use MPD
-------------------------

You can start one mpd on the current host by running

   mpd &

This starts a ring of one mpd.  Other mpd's join the ring by being run
with host and port arguments for the first mpd.  You can automate this
process by using mpdboot.
 
Make a file with machine names in it.  This file should may or may not include the
local machine.  It will be handy to use the default, which is ./mpd.hosts .

donner% cat ./mpd.hosts
donner.mcs.anl.gov
foo.mcs.anl.gov
shakey.mcs.anl.gov
terra.mcs.anl.gov
donner% 

After mpich is built, the mpd commands are in mpich2/bin, or the bin
subdirectory of the install directory if you have done an install.
You should put this (bin) directory in your PATH 
in your .cshrc or .bashrc, so that it will be picked up by the mpd's that
are started remotely:

Put in .cshrc:  setenv PATH /home/you/mpich2/bin:$PATH

Put in .bashrc: export PATH=/home/you/mpich2/bin:$PATH

To start some mpds, use mpdboot.  It uses the mpd.hosts file:

donner% mpdboot -n 4 
donner%

This command starts a total of 4 daemons, one on the local machine and the
rest on machines in the mpd.hosts file.  You can specify another file (-f) or
another mpd command (-m).  The mpdboot command uses ssh to start the mpd on
each machine in the mpd.hosts file.

You can use mpdtrace to see where your mpd's are running:

donner% mpdtrace
donner
foo
shakey
donner% 

You can run something with mpdrun

donner% mpdrun -np 2 hostname 
donner.mcs.anl.gov
foo.mcs.anl.gov
donner%

You can run an mpich2 job:

donner% mpiexec -np 10 /home/lusk/hellow
Hello world from process 0 of 10
Hello world from process 1 of 10
Hello world from process 2 of 10
Hello world from process 3 of 10
Hello world from process 4 of 10
Hello world from process 5 of 10
Hello world from process 6 of 10
Hello world from process 7 of 10
Hello world from process 9 of 10
Hello world from process 8 of 10
donner% 

You can take down the daemons:

donner% mpdallexit
donner%

If things go bad and daemons seem to be in a bad state, you can remove the
Unix sockets on all the machines in mpd.hosts by doing a cleanup:

donner% mpdcleanup

Here is the usage information for all of the mpd commands:

************************************************************ mpdhelp

The following mpd commands are available.  For usage of any specific one,
invoke it with the single argument --help .

mpd           start an mpd daemon
mpdtrace      show all mpd's in ring
mpdboot       start a ring of daemons all at once
mpdringtest   test how long it takes for a message to circle the ring 
mpdallexit    take down all daemons in ring
mpdcleanup    repair local Unix socket if ring crashed badly
mpdrun        start a parallel job
mpdlistjobs   list processes of jobs (-a or --all: all jobs for all users)
mpdkilljob    kill all processes of a single job
mpdsigjob     deliver a specific signal to the application processes of a job

Each command can be invoked with the --help argument, which prints usage
information for the command without running it.

************************************************************ mpdboot

usage:  mpdboot --totalnum=<n_to_start> [--file=<hostsfile>]  [--help] \ 
                [--rsh=<rshcmd>] [--user=<user>] [--mpd=<mpdcmd>] \ 
                [--loccons] [--remcons] [--shell] [--verbose] [-1]
 or, in short form, 
        mpdboot -n n_to_start [-f <hostsfile>] [-h] [-r <rshcmd>]
	        [-u <user>] [-m <mpdcmd>]  -s -v [-1]

--totalnum specifies the total number of mpds to start; at least
  one mpd will be started locally, and others on the machines specified
  by the file argument
--file specifies the file of machines to start the rest of the mpds on;
  it defaults to mpd.hosts
--mpd specifies the full path name of mpd on the remote hosts if it is
  not in your path
--rsh specifies the name of the command used to start remote mpds; it
  defaults to ssh; an alternative is rsh
--shell says that the Bourne shell is your default for rsh
--verbose shows the ssh attempts as they occur; it does not provide
  confirmation that the sshs were successful
--loccons says you do not want a console commands available on the local mpd
--remcons says you do not want consoles available on remote mpds
--1 means start two mpds on the local machine if it occurs in the file
-z specifies a number of remote mpdboots to start

************************************************************ mpd

usage: mpd [--host=<host> [--port=<portnum>] [--noconsole] \ 
           [--trace] [--echo] [--daemon] [--bulletproof] \ 
           [--idmyhost=<hostname>]

Long parameter names may be abbreviated to their first letters by using
  only one hyphen and no equal sign:
     mpd -h donner -p 4268 -n
  is equivalent to
     mpd --host=donner --port=4268 --noconsole

--host and --port must be specified together; they tell the new mpd where
  to enter an existing ring;  if they are omitted, the new mpd forms a
  stand-alone ring that other mpds may enter later
--noconsole is useful for running 2 mpds on the same machine; only one of
  them will accept mpd commands
--trace yields lots of traces thru mpd routines; currently too verbose
  them can have a unix socket which a console program can connect to
--echo causes the mpd echo its listener port by which other mpds may connect
--daemon causes mpd to run backgrounded, with no controlling tty
--bulletproof says to turn bulletproofing on (experimental)
--idmyhost specifies an alternate hostname for the host this mpd is running on
--listenport specifies a port for this mpd to listen on; by default it will
  acquire one from the system.

A file named .mpd.conf file must be present in the users home directory
  with read and write access only for the user, and must contain at least
  a line with password=<password>

This version of mpd is (0, 2, 0, 'Aug 2003 release')

************************************************************ mpdallexit

usage: mpdallexit (no args)
causes all mpds in the ring to exit

************************************************************ mpdcleanup

usage: mpdcleanup [-f <hostsfile>] [-r <rshcmd>] [-u <user>] [-c <cleancmd>] 
   or: mpdcleanup [--file=<hostsfile>] [--rsh=<rshcmd>] [-user=<user>] \
                  [-clean=<cleancmd>]
Removes the Unix socket on local (the default) and remote machines
This is useful in case the mpd crashed badly and did not remove it,
which it normally does.

************************************************************ mpiexec

usage:
mpiexec [ -h   or  -help   or  --help ]
mpiexec -file filename  # where filename contains xml for job description
mpiexec -configfile filename  # where filename contains cmd-line arg-sets
mpiexec [ -default defaultArgs : ] argset : more_arg_sets : ...
    where each argset contains some of:
        -n <n> -host <h> -wdir <w> -path <p> cmd args 
    note: cmd must be specfied for each argset; it can not be a default arg
    other default arguments can be -l (line labels on stdout, stderr) and
    -setenvall (pass entire environment of mpiexec to all processes),
    -env KEY1=VALUE1 -env KEY2=VALUE2 ...
    defaultArgs are passed to all processes unless overridden
sample executions:
    mpiexec -n 1 pwd : -wdir /tmp pwd : printenv
    mpiexec -default -n 2 -wdir /bin -env RMB3=e3 : pwd : printenv

************************************************************ mpdkilljob

usage: mpdkilljob  jobnum  [mpdid]  # as obtained from mpdlistjobs
   or: mpdkilljob  -a jobalias      # as obtained from mpdlistjobs
    mpdid is mpd where process 0 starts
    mpdid of form 1@linux02_32996 (may need \@ in csh)

************************************************************ mpdlistjobs

usage: mpdlistjobs [-u | --user username] [-a | --alias jobalias]  [-j | --jobid jobid]
  (only use one of jobalias or jobid)
lists jobs being run by an mpd ring, all by default, or filtered
by user, mpd job id, or alias assigned when the job was submitted

************************************************************ mpdringtest

usage: mpdringtest [number of loops]
Times a single message going around the ring of mpds [num] times (default once)

************************************************************ mpdrun

mpdrun for mpd version: (0, 2, 0, 'Aug 2003 release')
usage: mpdrun [args] pgm_to_execute [pgm_args]
   where args may be: -a alias -np nprocs -hf hostsfile -l -1 -s
       (nprocs must be a positive integer)
       (-hf is a hostsfile containing names of nodes on which to run)
       (-l (ell) means attach line labels identifying which client prints each line)
       (-1 means do NOT start the first process locally)
       (-a means assign this alias to the job)
       (-s means send stdin to all processes; not just first)
or:    mpdrun -f filename
   where filename contains all the arguments in xml format

************************************************************ mpdsigjob

usage: mpdsigjob  sigtype  jobnum  [mpdid]  # as obtained from mpdlistjobs
   or: mpdsigjob  sigtype  -a jobalias      # as obtained from mpdlistjobs
    mpdid is mpd where process 0 starts
    mpdid of form 1@linux02_32996 (may need \@ in csh)
Delivers a Unix signal to all the application processes in the job

************************************************************ mpdtrace

usage: mpdtrace [-l]
Lists the (short) hostname of each of the mpds in the ring
The -l (ell) option shows full hostnames and listening ports
