$Id: README,v 1.10 1997/05/07 15:47:57 thc Exp $
$Log: README,v $
Revision 1.10  1997/05/07 15:47:57  thc
Added a tiny bit more about factoring, at James's suggestion.

Revision 1.9  1997/05/07 12:59:58  thc
Added list of needed files.

Revision 1.8  1997/05/06 22:04:41  thc
Added copyright notice.

Revision 1.7  1997/05/06 19:24:24  thc
Renamed test to test_bmmc and test_comm to test_bmmc_comm.

Revision 1.6  1997/05/05 21:19:53  thc
Added an item on other useful functions.

Revision 1.5  1997/05/05 20:58:41  thc
Minor formatting change.

Revision 1.4  1997/05/05 20:56:25  thc
Added clarification about matrix transpose involving 2 matrices.

Revision 1.3  1997/05/05 19:36:04  thc
Minor change.

Revision 1.2  1997/05/05 19:34:59  thc
Minor change.

Revision 1.1  1997/05/05 19:33:17  thc
Initial revision

------------------------------------------------------------------------
Copyright (C) 1997, Thomas H. Cormen, thc@cs.dartmouth.edu

This software may be freely copied, modified, and redistributed,
provided that this copyright notice is preserved on all copies.

There is no warranty or other guarantee of fitness for this
software, and it is provided solely "as is".  Bug reports or fixes
may be sent to the author, who may or may not act on them as he
desires.

Rights are granted to use this software in any non-commercial
enterprise.  For commercial rights to this software, please contact
the author.
------------------------------------------------------------------------


This library contains C-callable functions to perform BMMC
permutations on multiprocessor systems.  Interprocessor communication
is via MPI calls.

To build the library, named libbmmc_mpi.a, you will need the following
files:
  Makefile
  bit_matrix_fns.h
  bit_matrix_fns.c
  bit_matrix_types.h
  bmmc_mpi.h
  bmmc_mpi.c

To build the test programs, you will also need the following files:
  test_bmmc_mpi.c
  test_bmmc_mpi_comm.c



Contents of this README file:
1. What are BMMC permutations?
2. Why are BMMC permutations useful?
3. Why should I use this package?
4. How do I specify a BMMC permutation using this package?
5. How do I call a function to perform a BMMC permutation using
   this package?
6. How do I build the library that contains the function that performs
   BMMC permutations?
7. How do I test that the library runs correctly?
8. What other functions might I find useful?
9. Who do I contact with further questions?



1  WHAT ARE BMMC PERMUTATIONS?

A permutation of data is a one-to-one mapping in which the data at
each source index maps to the position given by a corresponding target
index.  Source and target indices range from 0 to N-1 for an N-element
permutation.

A BMMC (bit-matrix-multiply/complement) permutation on N elements is
defined only when N is an integer power of 2.  That is, let n = lg N
be an integer.  (lg denotes the log-base-2 function.)  We treat each
index x as an n-bit vector (x_0, x_1, ..., x_(n-1)).

To specify a BMMC permutation, we use an n x n characteristic matrix A
= (a_ij) whose entries are 0 or 1 and is invertible over GF(2).
(Think of matrix multiplication over GF(2) as standard matrix
multiplication but with all arithmetic performed modulo 2.
Equivalently, replace multiplication by logical-and and addition by
exclusive-or.)  The specification for BMMC permutations also includes
a complement vector c = (c_0, c_1, ..., c_(n-1)) of length n.

To determine the target index y = (y_0, y_1, ..., y_(n-1)) that a
source index x = (x_0, x_1, ..., x_(n-1)) maps to, we perform
matrix-vector multiplication over GF(2) and then complement some
subset of the resulting bits: y = Ax + c, where + denotes
exclusive-or.  Written out as full matrices and vectors:

 +-       -+
 |   y_0   |
 |   y_1   |
 |   y_2   | =
 |   ...   |
 | y_(n-1) |
 +-       -+

     +-                                     -+ +-       -+   +-       -+
     | a_00      a_01      ... a_0,(n-1)     | |   x_0   |   |   c_0   |
     | a_10      a_11      ... a_1,(n-1)     | |   x_1   |   |   c_1   |
     | a_20      a_21      ... a_2,(n-1)     | |   x_2   | + |   c_2   |
     | ...       ...       ... ...           | |   ...   |   |   ...   |
     | a_(n-1),0 a_(n-1),1 ... a_(n-1),(n-1) | | x_(n-1) |   | c_(n-1) |
     +-                                     -+ +-       -+   +-       -+

(Note our convention that least significant bits are toward the top
and left.)

As long as the matrix A is invertible over GF(2), the mapping of
source indices x to target indices y is one-to-one and thus a
permutation.

BMMC permutations are also known as affine transformations or, if the
complement vector is all 0, linear transformations or bit-linear
transformations.



2  WHY ARE BMMC PERMUTATIONS USEFUL?

Although not all permutations on N elements can be expressed as BMMC
permutations, many useful ones can be.  When a permutation can be
expressed as BMMC, its representation is compact.  Rather than a
vector of N! target indices, only (lg N)^2 + lg N bits are needed to
represent the characteristic matrix and complement vector.  Moreover,
you can use the functions in this package to perform BMMC permutations
quickly.

Any BPC (bit-permute/complement) permutation is BMMC.  In a BPC
permutation, we form each target index by permuting the bits of each
source index according to a fixed permutation on the lg N bit
positions and then complementing a subset of the bits.  (The BMMC
characteristic matrix for a BPC permutation is a permutation matrix,
which means that it has one 1 in each row and in each column.)

The following permutations are BPC, and hence BMMC:

* Matrix transposition when dimensions are powers of 2.  If there are N
  entries in an r x s matrix stored in row-major order, then each (lg
  N)-bit index is comprised of a (lg r)-bit row number in the most
  significant bits followed by a (lg s)-bit column number in the least
  significant bits.  Transposing an r x s matrix entails mapping the
  (i,j) entry to the (j,i) position.  In other words, swap the upper lg
  r bits and the lower lg s bits.  This is done by a cyclic rotation by
  lg s bits to the right (or lg r bits to the left).  The characteristic
  matrix is the n x n identity matrix (n = lg N), but with columns
  rotated lg s positions to the right:

         lg s   lg r
      +-      |      -+        +-   -+          +-   -+
      |       |       |        |     |          |     |
      |   0   |   I   | lg r   | col | lg s     | row | lg r
      |       |       |        |     |          |     |
      |-------+-------|        |-----|       =  |-----|
      |       |       |        |     |          |     |
      |   I   |   0   | lg s   | row | lg r     | col | lg s
      |       |       |        |     |          |     |
      +-      |      -+        +-   -+          +-   -+

  Submatrix and subvector dimensions are indicated, and the submatrices
  are either 0 or identity matrices.  The complement vector is all 0 and
  is omitted here.

  Note that there are two matrices involved here.  The data is an r x s
  matrix, with entry types unspecified.  The characteristic matrix is an
  n x n matrix, with each entry being one bit.

* Bit-reversal permutations, as are often used in performing FFTs.  The
  record with source index x = (x_0, x_1, ..., x_(n-1)) maps to target
  index y = (x_(n-1), x_(n-2), ..., x_1, x_0).  The characteristic
  matrix has 1's only on the antidiagonal:

      +-           -+
      | 0 0 ... 0 1 |
      | 0 0 ... 1 0 |
      |     ...     |
      | 0 1 ... 0 0 |
      | 1 0 ... 0 0 |
      +-           -+

  Again, the complement vector is all 0.

* Vector-reversal permutations.  The record with source index i maps to
  target index N-i-1 for i = 0, 1, ..., N-1.  Here the characteristic
  matrix is the identity matrix and the complement vector is all 1's.

* Hypercube permutations.  The source and target indices differ in only
  one bit.  The characteristic matrix is the identity matrix and the
  complement vector has a 1 only in the bit position that changes.

The following permutations are BMMC but not BPC:

* Gray code.  To map an ordinal number to its corresponding value in the
  standard binary reflected Gray code, the characteristic matrix is

      +-                   -+
      | 1 1 0 0 ... 0 0 0 0 |
      | 0 1 1 0 ... 0 0 0 0 |
      | 0 0 1 1 ... 0 0 0 0 |
      |         ...         |
      | 0 0 0 0 ... 1 1 0 0 |
      | 0 0 0 0 ... 0 1 1 0 |
      | 0 0 0 0 ... 0 0 1 1 |
      | 0 0 0 0 ... 0 0 0 1 |
      +-                   -+

  The complement vector is 0.

* Inverse Gray code.  To map a number in a binary reflected Gray code to
  its ordinal number, the characteristic matrix is

      +-                   -+
      | 1 1 1 1 ... 1 1 1 1 |
      | 0 1 1 1 ... 1 1 1 1 |
      | 0 0 1 1 ... 1 1 1 1 |
      |         ...         |
      | 0 0 0 0 ... 1 1 1 1 |
      | 0 0 0 0 ... 0 1 1 1 |
      | 0 0 0 0 ... 0 0 1 1 |
      | 0 0 0 0 ... 0 0 0 1 |
      +-                   -+

  The complement vector is 0.  Note that this matrix is the inverse over
  GF(2) of the characteristic matrix for Gray code.


BMMC permutations have some other nice properties:

* They are closed under inverse, i.e., the inverse of a BMMC permutation
  is a BMMC permutation.  In particular, if y = Ax + c, then the inverse
  permutation is given by x = A^(-1) y + A^(-1) c, where A^(-1) is the
  inverse of A over GF(2).

* They are closed under composition.  If y = Ax + c and z = A'y + c',
  then z = (A' A)x + (A'c + c).  That is, the characteristic matrix of
  the composition is the matrix product A' A, and the complement vector
  is A'c + c.



3  WHY SHOULD I USE THIS PACKAGE?

There are two good reasons to use this package: it's easy to use, and
it's fast.

Please note, however, that like the number of elements, the number of
processors must be an integer power of 2.

It's easy to use because all the hard work is done for you.  You just
have to specify the characteristic matrix, complement vector, and some
additional parameters that make it flexible.  You can even specify
which bits of the index contain the processor number.  (If there are P
processors, then lg P consecutive bits of an index determine which
processor the element resides in.)  This facility makes it so that you
can organize your data in processor-major order (the most significant
lg P bits contain the processor number, so that processor 0 contains
elements 0 to N/P-1, processor 1 contains N/P to 2N/P-1, etc.),
processor-minor order (the least significant lg P bits contain the
processor number, so that processor k contains all elements congruent
to k modulo P), or anything in between.

It's fast because the algorithm used is developed specifically for
interprocessor BMMC permutations.  Suppose that processor j contains m
elements that are destined for processor k.  Then processor j sends
only one message to processor k, and it contains all m elements.  As
few messages as possible are sent, which reduces message overhead.
Moreover, the sending and receiving processor implicitly agree on the
source and target indices of the elements in the message, and so no
indices are ever transmitted.  This protocol saves on bandwidth, since
the only information sent is data.  The MPI function
MPI_Sendrecv_replace() is used for even greater efficiency.  Finally,
if the same BMMC permutation is to be performed multiple times on
different data, the preprocessing performed before data transmission
can be factored out into a separate call.  That is, you can preprocess
once and then perform the same BMMC permutation on multiple sets of
data, saving the modest preprocessing cost.



4  HOW DO I SPECIFY A BMMC PERMUTATION USING THIS PACKAGE?

A characteristic matrix is stored by columns.  (If you are accustomed
to row-major storage of matrices, this might take a little getting
used to, but there are great advantages to storing characterstic
matrices by columns.)  The file bit_matrix_types.h defines a typedef
for matrix_column; this should be an unsigned 32-bit or 64-bit word.
(If the data you wish to permute has at most 2^(32) elements
altogether, a 32-bit word should suffice.  Otherwise, you should use a
64-bit word.)  A matrix_column is just a packed bit sequence.  For
example, to specify the matrix column

    +- -+
    | 0 |
    | 1 |
    | 0 |
    | 1 |
    | 1 |
    | 0 |
    | 1 |
    | 0 |
    +- -+

you could write either of the following:
((matrix_column) 0x5a)
  OR
(((matrix_column) 1) << 1) | (((matrix_column) 1) << 3) |
   (((matrix_column) 1) << 4) | (((matrix_column) 1) << 6)

A bit_matrix is an array of matrix_column.  Array element 0 is the
leftmost (0th) column, and so on.  Using the functions
allocate_bit_matrix() and free_bit_matrix() in bit_matrix_fns.h and
bit_matrix_fns.c, we could create, use, and free an n x n identity
matrix I as follows:

  bit_matrix I = allocate_bit_matrix(n);
  int j;

  for (j = 0; j < n; j++)
    I[j] = ((matrix_column) 1) << j;

  /* code that uses I goes here */

  free_bit_matrix(I);

An easier way to create an identity matrix would be to use the
function identity_matrix() in bit_matrix_fns.h and bit_matrix_fns.c:

  bit_matrix I = allocate_bit_matrix(n);
  identity_matrix(I, n);



5  HOW DO I CALL A FUNCTION TO PERFORM A BMMC PERMUTATION USING
   THIS PACKAGE?

You have many options.  In each case, there are P processors (P must
be a power of 2) within an MPI communicator.  Each processor has a
unique rank from 0 to P-1 and contains N/P elements to permute in a
buffer named "data".  The permutation is performed "in place" in that
the permuted elements end up in the "data" buffer, although probably
in a different processor/location.  The permutation is NOT performed
in place in that some of the elements are copied to a temporary buffer
that you, the caller, must allocate, pass to the permutation function,
and free when you are done.

The easiest way to perform a BMMC permutation is to call either of the
following functions from bmmc_mpi.h and bmmc_mpi.c:

int BMMC_MPI_proc_major(bit_matrix A,
			matrix_column c,
			int n,
			int p,
			int rank,
			MPI_Comm comm,
			int size,
			void *data,
			void *temp);

int BMMC_MPI_proc_minor(bit_matrix A,
			matrix_column c,
			int n,
			int p,
			int rank,
			MPI_Comm comm,
			int size,
			void *data,
			void *temp);

The parameters and return values are as follows:

  A is the characteristic matrix, as described above.

  c is the complement vector, given as a matrix_column.

  n is the log-base-2 of the problem size, and it should equal the
  number of rows and columns of A.

  p is the log-base-2 of the number of processors.

  rank is the rank of this processor in the MPI communicator.

  comm is the MPI communicator.  You would normally use MPI_COMM_WORLD,
  but you can use any communicator you want as long as it has a
  power-of-2 number of processors.

  size is the size of each element to permute, in bytes.

  data points to the buffer within the processor containing the elements
  to permute.  data should contain ((N/P) * size) bytes per processor.

  temp points to another buffer within the processor, of the same size
  as data.  You must allocate this buffer before the call, and it is
  your responsibility to free it afterward.

  The return value is an error code.  Error codes are listed near the
  beginning of bmmc_mpi.h.  A return value of 0 indicates that the call
  completed without error, and a nonzero return value indicates an
  error.

The difference between the two calls is that in BMMC_MPI_proc_major(),
the data is assumed to be in processor-major order, described in item
3 above.  That is, the processor-number bits are bits n-p, n-p+1, ...,
n-1.  In BMMC_MPI_proc_minor(), the data is in processor-minor order,
and so the processor-number bits are bits 0, 1, ..., p.

If you are using some other data layout in which the processor-number
bits are f, f+1, ..., f+p-1, call the following function:

int BMMC_MPI(bit_matrix A,
	     matrix_column c,
	     int n,
	     int p,
	     int f,
	     int rank,
	     MPI_Comm comm,
	     int size,
	     void *data,
	     void *temp);

As you might have guessed, BMMC_MPI_proc_major() and
BMMC_MPI_proc_minor() are simply wrappers that call BMMC_MPI with the
parameter f set to n-p and 0, respectively.

The functions BMMC_MPI(), BMMC_MPI_proc_major(), and
BMMC_MPI_proc_minor() perform a modest amount of preprocessing that is
independent of the data.  The preprocessing performs some factoring of
the characteristic matrix.  (You don't need to understand how the
factoring works to use this software, but you are welcome to read the
comments in the code.)  If you are going to perform the same BMMC
permutation multiple times on different data, you can do the
preprocessing once.  Use any of the following functions:

int factor_BMMC_MPI_proc_major(bit_matrix A,
			       matrix_column c,
			       int n,
			       int p,
			       BMMC_MPI_factor_info *info);

int factor_BMMC_MPI_proc_minor(bit_matrix A,
			       matrix_column c,
			       int n,
			       int p,
			       BMMC_MPI_factor_info *info);

int factor_BMMC_MPI(bit_matrix A,
		    matrix_column c,
		    int n,
		    int p,
		    int f,
		    BMMC_MPI_factor_info *info);

The parameters A, c, n, p, and f are as before.  The parameter info is
a pointer to the BMMC_MPI_factor_info structure, which is an opaque
type defined in bmmc_mpi.h.  The return value is 0 when the factoring
occurs without error, or a nonzero code defined in bmmc_mpi.h when an
error occurs.  Note that you must allocate the BMMC_MPI_factor_info
structure yourself (from either the stack or heap), but that it will
point to structures dynamically allocated by the factor_BMMC_MPI...()
functions.

Once you have created and filled in a BMMC_MPI_factor_info structure
with one of the above calls, you can perform BMMC permutations with
the following function:

int perform_BMMC_MPI(BMMC_MPI_factor_info *info,
		     int n,
		     int p,
		     int rank,
		     MPI_Comm comm,
		     int size,
		     void *data,
		     void *temp);

All parameters are as described above.

When you are done with the BMMC_MPI_factor_info structure, free the
memory dynamically allocated by the factor_BMMC_MPI...() functions by
calling the function

void free_BMMC_MPI_factor_info(BMMC_MPI_factor_info *info);

Freeing the memory occupied by the BMMC_MPI_factor_info structure is
your responsibility.  (If you allocate it from the stack, it is of
course freed when the structure goes out of scope.)

Again, you probably won't be surprised to see that the function
BMMC_MPI() is a wrapper that calls factor_BMMC_MPI(),
perform_BMMC_MPI(), and free_BMMC_MPI_factor_info().



6  HOW DO I BUILD THE LIBRARY THAT CONTAINS THE FUNCTION THAT PERFORMS
   BMMC PERMUTATIONS?

Get the Makefile into your favorite editor.  Edit it as follows:

  MPI_HOME should be the root directory for MPI.

  MPI_INC should be the directory that contains mpi.h.  As the Makefile
  says, it would normally be $(MPI_HOME)/include.

  MPI_LIB should be the directory that contains libmpi.a.  The Makefile
  has it at $(MPI_HOME)/lib/alpha/ch_p4 because I developed this code on
  a DEC Alpha workstation with MPI-CH running on top of p4.  But you
  should change this directory as necessary to link in the correct
  version of libmpi.a.

  Optionally, edit the VERBOSE flag to turn off verbosity.  You will
  want to leave it alone (i.e., leave verbosity on) for the tests
  described in the next item.

Having edited the Makefile, type
  make depend
  make libbmmc_mpi.a

and the library, libbmmc_mpi.a, is built.  Note that "make depend"
assumes that you have the makedepend program on your search path.



7  HOW DO I TEST THAT THE LIBRARY RUNS CORRECTLY?

We have provided two test programs.  To build them, first make sure
that the line

VERBOSE = -DVERBOSE

has NOT been commented out of the Makefile.  Then type
  make test_bmmc test_bmmc_comm

How you run MPI programs varies by implementation and site.  Because
we use MPI-CH, we use the mpirun command.  You should check to see how
to run MPI on your system.

The two test programs are similar in that they each allow either
randomized test case generation or read specific test cases from
standard input.  They differ in that the "test_bmmc" program tests
only for the communicator MPI_COMM_WORLD, but "test_bmmc_comm" tests
by splitting MPI_COMM_WORLD into two equal halves and running them
independently and concurrently.

The test programs have command lines of the following form:

[test_bmmc | test_bmmc_comm] [-r | -m] n f [trials]

  test_bmmc or test_bmmc_comm is the program name.

  The -r option indicates to randomly generate the characteristic
  matrices and complement vectors.

  The -m indicates to randomly generate the characteristic matrices, and
  use 0 for all complement vectors.

  If neither -r nor -m is specified, then the characteristic matrices
  and complement vectors are read from standard input.  They appear as
  whitespace-separated bits.  First, the characteristic matrix appears,
  in row-major order.  Then the complement vector.  For example, to
  specify a bit-reversal permutation on N = 2^4 bits (so that lg N is
  4), use the input

      0 0 0 1
      0 0 1 0
      0 1 0 0
      1 0 0 0

      0 0 0 0

  n is lg N, the log-base-2 of the number of elements over all processors.

  f is the least significant processor-number bit.  Use n-p (where p is
  the log-base-2 of the number of processors) for processor-major
  layout, and use 0 for processor-minor layout.  You may use any value
  from 0 to n-p.

  trials is how many trials to run.  If left unspecified, the default is
  1.

The test programs will produce output telling you what they are doing.
Remember to use the -r or -m option if you don't want to specify a
particular characteristic matrix/complement vector in standard input.
It is easy to forget to use one of these options and then mistakenly
think that the program has hung, when it is in fact just waiting for
input.

Again, note that once you have completed testing, you will want to
turn off verbosity in libbmmc_mpi.a.  To do so, type

  make clean
  make libbmmc_mpi.a



8  WHAT OTHER FUNCTIONS MIGHT I FIND USEFUL?

The files bit_matrix_fns.h and bit_matrix_fns.c contain other
bit-matrix functions that you may find useful.  They assume the same
bit-matrix representation as the BMMC permutation functions.

bit_matrix identity_matrix(bit_matrix A, int n);
  As mentioned above, it makes A be an n x n identity matrix.  A must be
  allocated already.  It returns A, which is useful in chaining calls
  together.

int is_identity_matrix(bit_matrix A, int n);
  Returns 1 if A is an n x n identity matrix, 0 otherwise.

bit_matrix bit_matrix_multiply(bit_matrix C, bit_matrix A, bit_matrix B,
			       int n);
  Performs matrix-matrix multiplication on n x n matrices, forming the
  product C = A x B.  All three matrices must be allocated.  C and B may
  point to the same storage, but C and A must be distinct.  It returns
  C.

matrix_column bit_matrix_vector_multiply(bit_matrix A, matrix_column x, int n);
  Performs matrix-vector multiplication to return the product A x, where
  A is n x n and x is an n-vector.

int invert_bit_matrix(bit_matrix A_inv, bit_matrix A, int n);
  Inverts A to produce A_inv.  Both matrices are n x n and must already
  be allocated.  Returns 1 if A is invertible, 0 if noninvertible.

bit_matrix allocate_bit_matrix(int n);
  Allocates an n x n bit matrix and returns its address.

bit_matrix copy_bit_matrix(bit_matrix target, bit_matrix source, int n);
  Copies bit matrix source into target.  Both matrices are n x n and
  must already be allocated.  Returns target.

bit_matrix dup_bit_matrix(bit_matrix A, int n);
  Allocates a new n x n matrix, copies A into it, and returns the copy.

void free_bit_matrix(bit_matrix A);
  Frees the storage allocated for A.

bit_matrix extract_bit_submatrix(bit_matrix target, bit_matrix source,
				 int start_row, int start_col,
				 int rows, int cols);
  Extracts a submatrix of source into target, both of which must already
  be allocated.  The rows extracted from source are numbered start_row
  to (start_row + rows - 1), and the columns extracted from source are
  numbered start_col to (start_col + cols - 1).  The extracted bits
  appear in target starting at row 0 and column 0.

void print_bit_matrix(bit_matrix A, int m, int n, char *name);
  Prints bit matrix A to standard output.  A has m rows and n columns.
  name points to a string printed to indicate the name of the matrix
  upon printing.



9  WHO DO I CONTACT WITH FURTHER QUESTIONS?

Please direct questions to Tom Cormen, Dartmouth College Department of
Computer Science, thc@cs.dartmouth.edu, phone 603-448-2442, fax
603-646-2417.

