
Parallel Execution Utility - dsh 0.1.3

  (C) A. Fachat <a.fachat@physik.tu-chemnitz.de>
     _________________________________________________________________
   
   When we got our computing clusters we found that we could need some
   tool to automatically distribute processes on the least loaded
   cluster. This is exactly what this tools does. It takes a command and
   tries to distribute it to the least-loaded machine in a Linux cluster.
   
   It is a first (working) draft, and has some limitations.
   
   The tool consists of three parts:
    1. A daemon (python 1.5 script) to keep the current load state of
       each cluster machine. The daemons accross the network communicate
       with each other. They know of each other by the means of a cluster
       description file. This holds one nodename (+ relative load factor,
       + relative memory factor) per line. A daemon need not need to
       communicate to all clusters, you can for example restrict the
       communication to the neighbouring nodes with this file.
       This daemon gives, when connected to at a certain TCP port, a list
       of machines with their respective load. On another TCP port it
       returns the best machine according to its internal evaluation
       function. At the same time this machine gets an additional,
       virtual load for a small time. This handles network latency, i.e.
       the time from starting until the new load has been sent back to
       the daemon.
    2. A script (python 1.5 script) that connects to either port of the
       daemon and uses the returned list for internal evaluation. The
       best machine is then used to send the given command to. It uses
       rsh or ssh to connect to the remote machine.
    3. A helper script (/bin/sh script) that has to be on the remote
       machine. It handles the file locking that is used to take care of
       NFS latency - basically the dsh creates a file and the remote
       process is only started when this file exists. Also dsh assumes
       all files written from the remote process when the file does not
       exist anymore.
       
   You need Python (1.5 or better) for dsh and dshd.
   
   It uses Linux /proc filesystem to determine the load and the free
   memory. It has been tested with (non-SMP) Linux 2.0.29 and 2.0.32. And
   there is the LSM entry.
   
   Furthermore it assumes that all binaries and files are at the same
   places on the remote machine. This means the directory where it is
   executed should be the same as on the local machine.
   
   It does not catch signals sent to the process. Also I/O redirection is
   not explicitely handled, although it should work as before, the shell
   taking care of it before invoking dsh.
   
   The evaluation function in the daemon (as well as in the script) can
   easily be set to something else fitting your needs.
     _________________________________________________________________
   
  What's new?
  
     * 0.1.3 adds a rc.dsh script to start the daemon at startup. Also
       dshd has been reorganized a bit to go to backgound only if all the
       setup is done correctly.
       This script can be tailored to run the daemon as user "nobody".
     _________________________________________________________________
   
  Acknowledgements
  
   Thanks go to Alexander Schreiber for explaining me where the 'stale
   NFS file handle' comes from :-). Thanks go to Tino Schwarze for
   discussions and comments. He wrote a similar package, CLSH. Here the
   daemon itself starts the remote process to avoid rsh/ssh latency. This
   is a good point, but with my approach the daemons can be run as nobody
   and the user itself has to setup the usual privileges for rsh/ssh.
   Also there is the perfs perl script (running on Suns, and I don't do
   perl anyway...) and the beowulf cluster procps utilities. The problem
   is that the procps utilities try to get the current values each time
   any of the cluster utilities is invoked. This is horribly slow, esp.
   if one machine is down. In my approach the local daemon takes care of
   that with UDP packets. If it doesn't receive any, the machine does not
   exist for the daemon. The Mosix approach is going even farther, but
   there you have a kernel patch etc, and it is not ready for Linux.
     _________________________________________________________________
   
  Usage
  
Usage: dshd [-f clusterdesc] [-b] [-p] [-t txport] [-r rxport]
  -f filename    = location of cluster file
  -b             = fork to background
  -p             = print own pid on startup (only if background)
  -t txport      = use other port than  8181  for full status
  -r rxport      = use other port than  8282  for best node report

   dshd -b starts the daemon in the background. It reads the etcfile
   specified in the script (or with the -f etcfile option) that defines
   the cluster. The cluster file looks like:

newton.foo.bar 1.0 1.0
galileo.foo.bar 2.0 2.0
kepler.foo.bar 2.0

   The first number behind the nodename is the load weight. It means that
   galileo and kepler in this example are half as fast as - need double
   the time - than the newton machine. The second number is a memory
   weight similar to the load weight. Only a higher value means faster
   RAM. The weights are optional.
   
   The daemon sends its state to those machines and accepts state reports
   from this cluster only.

    dsh command

   This connects to the local daemon and remote-execs the command on the
   least-loaded machine.
     _________________________________________________________________
   
  Copying Policy
  
   This package is distributed under the GNU Public License.
     _________________________________________________________________
   
  Download
  
     * dsh-0.1.2.tar.gz the full packet with docs.
     * dshd daemon script (0.1.2)
     * dshd remote exec script (0.1.2)
     * dshexec shell script that wraps around the command on the remote
       machine.
     _________________________________________________________________
   
Documentation and random comments

   Here are some comments from the files:
# dshd v0.1.0 (c) Andre Fachat
# distributed under GPL (this is too small to include a copy, go to
# www.gnu.org to get a copy or refer to your favorite GNU program for the
# file COPYING)
#
# This daemon runs in the background on each computer in a cluster.
# The cluster is defined in the file etcname (see below)
# The format is one machine per line with
#    machinename loadscale memscale
# where loadscale and memscale are multiplied with the respective
# load and mem values before evaluation.
# The daemon sends its state information (load, mem) to all machines
# in the cluster. Then it tries to receive the information
# from the other machines. If it does not receive a state info during
# maxloops loops it removed the machine from the list - it might be down.
#
# Telnetting to stport gives the state info of the complete cluster
# Telnetting to dport gives the state of the best machine. The load
# of this machine is locally increased (extraload) to handle the latency
# between starting and the new state info to be received.
#
# This is not particular an example of good programming.
# I am especially unexperienced with socket programming, so this might
# be improvable.
# Also there may be memory leaks that I did not find.
#
# Further possible improvements:
# - cluster definition also by broadcast addresses
# - include memory value in evaluation
# - make evaluation function more flexible
#

   from the daemon and from the script
# dsh v0.1.1 (c) Andre Fachat
# distributed under GPL (this is too small to include a copy, go to
# www.gnu.org to get a copy or refer to your favorite GNU program for the
# file COPYING)
#
# This handy script uses the dshd daemon to find the currently least
# loaded machine in a cluster. It then distributes the command given
# to this machine (via rsh or ssh). The directory where dsh is started must
# be at the same place on the remote machine.
# To avoid NFS problems a temporary file is created by the local
# process and the remote process waits for it to exist
# (needs the "waitfile" shell script). After completion the remote
# process removes the file and exits.
# The local process waits for the child to terminate and then waits
# for the temp file to disappear, to be sure all NFS stuff has been done.
#
# possible improvements:
# - catch SIGINT and send to remote process
# - own cmdline options for verbosity (print remote host name) etc
#

   It is not (yet :-) perfect. Sometimes NFS seems to cause weird
   problems that have not yet been solved.
   
   The scripts have been tested with Linux 2.0.29 as cluster machines and
   a Sun Ultra with Solaris 2.5 as NFS server.
   
   One word about clustred compiling:
I tried to compile the Linux kernel on the cluster. However, one of
the machines was preloaded with 0.9 load already, and two seemed to have
gone swapping due to other memory-intensive stuff...

I tried three runs on a single machine (with make, make -j2 and make -j4)
and three runs on the cluster, with 2, 4 and 8 parallel compilers.
I simply redefined MAKE and CC in the toplevel Makefile to
"make -j2" and "dsh.py gcc" resp.

make:                   5m47
make -j2:               5m13
make -j4:               5m12
dsh.py, make -j2        8m38  didn't do version.o
dsh.py, make -j4        5m44  worked with 0.1.0, didn't do version.o with 0.1.1
dsh.py, make -j8        4m37 and 4m48 on two runs with 0.1.1

load on the NFS server increased (from practically 0.0 to 0.15...).
BTW; the network is 10MBit ethernet (10BaseT) via a Hub.
xosview showed that the main machine was mostly doing network stuff
(the IRQ for the network card was almost always on and the CPU was doing alot
in the system)

I guess normal compiles are just to short and to NFS-dependend to effectively
distribute them.

However, I don't know where the version.o thing comes from...

   Another word. The daemons communicate approx. each 1.5 sec one with
   each other. This might increase you net load! Currently there is no
   feature to use broadcasts.
     _________________________________________________________________
   
   Contents last modified 03 Aug 1998| Go to Homepage or to this page
