            Open Fabrics Enterprise Distribution (OFED)
                CHELSIO T3 RNIC RELEASE NOTES
			February 2011


The iw_cxgb3 and cxgb3 modules provide RDMA and NIC support for the
Chelsio S series adapters.  Make sure you choose the 'cxgb3' and
'libcxgb3' options when generating your OFED rpms.

============================================
New for OFED-1.5.3
============================================

- Bug fixes. Various upstream bug fixes have been included in this
release.

============================================
Enabling Various MPIs
============================================

For OpenMPI, Intel MPI, HP MPI, and Scali MPI: you must set the iw_cxgb3
module option peer2peer=1 on all systems.  This can be done by writing
to the /sys/module file system during boot.  EG:

# echo 1 > /sys/module/iw_cxgb3/parameters/peer2peer

Or you can add the following line to /etc/modprobe.conf to set the option
at module load time:

options iw_cxgb3 peer2peer=1

For Intel MPI, HP MPI, and Scali MPI: Enable the chelsio device by adding
an entry to /etc/dat.conf for the chelsio interface.  For instance,
if your chelsio interface name is eth2, then the following line adds
a DAT version 1.2 and 2.0 devices named "chelsio" and "chelsio2" for
that interface:

chelsio u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
chelsio2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""

=============
Intel MPI:
=============

The following env vars enable Intel MPI version 3.1.038.  Place these
in your user env after installing and setting up Intel MPI:

export RSH=ssh
export DAPL_MAX_INLINE=64
export I_MPI_DEVICE=rdssm:chelsio
export MPIEXEC_TIMEOUT=180
export MPI_BIT_MODE=64

Logout & log back in.

Populate mpd.hosts with node names.
Note: The hosts in this file should be Chelsio interface IP addresses. 

Note: I_MPI_DEVICE=rdssm:chelsio assumes you have an entry in
/etc/dat.conf named "chelsio".

Note: MPIEXEC_TIMEOUT value might be required to increase if heavy traffic 
is going across the systems.

Contact Intel for obtaining their MPI with DAPL support.

To run Intel MPI applications:

 mpdboot -n <num nodes> -r ssh --ncpus=<num cpus>
 mpiexec -ppn <process per node> -n <num nodes> <MPI Application Path>


=============
HP MPI:
=============

The following env vars enable HP MPI version 2.03.01.00.  Place these
in your user env after installing and setting up HP MPI:

export MPI_ROOT=/opt/hpmpi
export PATH=$MPI_ROOT/bin:/opt/bin:$PATH
export MANPATH=$MANPATH:$MPI_ROOT/share/man

Log out & log back in. 

To run HP MPI applications, use these mpirun options:

-prot -e DAPL_MAX_INLINE=64 -UDAPL

EG:

$ mpirun -prot -e DAPL_MAX_INLINE=64 -UDAPL -hostlist r1-iw,r2-iw ~/tests/presta-1.4.0/glob

Where r1-iw and r2-iw are hostnames mapping to the chelsio interfaces.

Also this assumes your first entry in /etc/dat.conf is for the chelsio
device.

Contact HP for obtaining their MPI with DAPL support.

=============
Scali MPI:
=============

The following env vars enable Scali MPI.  Place these in your user env
after installing and setting up Scali MPI for running over IWARP:

export DAPL_MAX_INLINE=64
export SCAMPI_NETWORKS=chelsio
export SCAMPI_CHANNEL_ENTRY_COUNT="chelsio:128"

Log out & log back in.

Note: SCAMPI_NETWORKS=chelsio assumes you have an entry in /etc/dat.conf
named "chelsio".

Note: SCAMPI supports only dapl 1.2 library not dapl 2.0

Contact Scali for obtaining their MPI with DAPL support.

To run SCALI MPI applications:

 mpimon <SCALI Application Path> -- <node1_IP> <procs> <node2_IP> <procs>

Note: <procs> is the number of processes to run on the node Note:
<node#_IP> should be the IP of Chelsio's interface

=============
OpenMPI:
=============

OpenMPI iWARP support is only available in OpenMPI version 1.3 or greater.

Open MPI will work without any specific configuration via the openib btl.
Users wishing to performance tune the configurable options may wish to
inspect the receive queue values.  Those can be found in the "Chelsio T3"
section of mca-btl-openib-hca-params.ini.

Note: OpenMPI version 1.3 does not support newer Chelsio card with device
ID 0x0035 and 0x0036. To use those cards add the device id of the cards 
in the "Chelsio T3" section of mca-btl-openib-hca-params.ini file.

To run OpenMPI applications:

 mpirun --host <node1>,<node2> -mca btl openib,sm,self <OpenMPI Application Path>

=============
MVAPICH2:
=============

The following env vars enable MVAPICH2 version 1.4-2.  Place these
in your user env after installing and setting up MVAPICH2 MPI:

export MVAPICH2_HOME=/usr/mpi/gcc/mvapich2-1.4/
export MV2_USE_IWARP_MODE=1
export MV2_USE_RDMA_CM=1

On each node, add this to the end of /etc/profile.

 ulimit -l 999999

On each node, add this to the end of /etc/init.d/sshd and restart sshd.

 ulimit -l 999999
 % service sshd restart

Verify the ulimit changes worked. These should show '999999':

 % ulimit -l
 % ssh <peer> ulimit -l

Note: You may have to restart sshd a few times to get it to work.

Create mpd.hosts with list of hostname or ipaddrs in the cluster. They
should be names/addresses that you can ssh to without passwords. (See
Passwordless SSH Setup).

On each node, create /etc/mv2.conf with a single line containing the
IP address of the local T3 interface. This is how MVAPICH2 picks which
interface to use for RDMA traffic.

On each node, edit /etc/hosts file. Comment the entry if there is an
entry with 127.0.0.1 IP Address and local host name. Add an entry for
corporate IP address and local host name (name that you have given in
mpd.hosts file) in /etc/hosts file.

To run MVAPICH2 application:

 mpirun_rsh -ssh -np 8 -hostfile mpd.hosts <MVAPICH2 Application Path>

============================================
Loadable Module options:
============================================

The following options can be used when loading the iw_cxgb3 module to
tune the iWARP driver:

cong_flavor     - set the congestion control algorithm.  Default is 1.
                  0 == Reno
                  1 == Tahoe
                  2 == NewReno
                  3 == HighSpeed

snd_win         - set the TCP send window in bytes. Default is 32kB.

rcv_win         - set the TCP receive window in bytes. Default is 256kB.

crc_enabled     - set whether MPA CRC should be negotiated.  Default is 1.

markers_enabled - set whether to request receiving MPA markers.  Default is
                  0; do not request to receive markers.

                  NOTE: The Chelsio RNIC fully supports markers, but
                  the current OFA RDMA-CM doesn't provide an API for
                  requesting either markers or crc to be negotiated.  Thus
                  this functionality is provided via module parameters.

mpa_rev         - set the MPA revision to be used.  Default is 1, which is 
                  spec compliant.  Set to 0 to connect with the Ammasso 1100 
                  rnic.

ep_timeout_secs - set the number of seconds for timing out MPA start up
                  negotiation and normal close.  Default is 60.

peer2peer	- Enables connection setup changes to allow peer2peer
		  applications to work over chelsio rnics.  This enables
		  the following applications:
			Intel MPI
			HP MPI
			Open MPI
			Scali MPI
			MVAPICH2
		  Set peer2peer=1 on all systems to enable these
		  applications.

The following options can be used when loading the cxgb3 module to
tune the NIC driver:

msi             - whether to use MSI or MSI-X.  Default is 2.
                  0 = only pin
                  1 = only MSI or pin
                  2 = use MSI/X, MSI, or pin, based on system

============================================
Updating Firmware:
============================================

This release requires firmware version 7.11.0, and Protocol SRAM
version 1.1.0.  These versions are included in the OFED-1.5.3 release
and will be automatically loaded when the cxgb3 module is loaded and
the interface configured.  To load later/newer versions of the firmware,
follow this procedure:

If your distro/kernel supports firmware loading, you can place the chelsio
firmware and psram images in /lib/firmware/cxgb3, then unload and reload
the cxgb3 module to get the new images loaded.  If this does not work,
then you can load the firmware images manually:

Obtain the cxgbtool tool and the update_eeprom.sh script from Chelsio.

To build cxgbtool:

# cd <path-to-cxgbtool>
# make && make install

Then load the cxgb3 driver:

# modprobe cxgb3

Now note the ethernet interface name for the T3 device.  This can be
done by typing 'ifconfig -a' and noting the interface name for the
interface with a HW address that begins with "00:07:43".  Then load the
new firmware and eeprom file:

# cxgbtool ethxx loadfw <firmware_file>
# update_eeprom.sh ethxx <eeprom_file>
# reboot

============================================
Testing connectivity with ping and rping:
============================================

Configure the ethernet interfaces for your cxgb3 device.   After you
modprobe iw_cxgb3 you will see one or two ethernet interfaces for the
T3 device.  Configure them with an appropriate ip address, netmask, etc.
You can use the Linux ping command to test basic connectivity via the
T3 interface.

To test RDMA, use the rping command that is included in the librdmacm-utils 
rpm:

On the server machine:

# rping -s -a 0.0.0.0 -p 9999

On the client machine:

# rping -c -VvC10 -a server_ip_addr -p 9999

You should see ping data like this on the client:

ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs
ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst
ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu
ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw
ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx
ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz
ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
client DISCONNECT EVENT...
#

============================================
Addition Notes and Issues
============================================

1) To run uDAPL over the chelsio device, you must export this environment
variable:

        export DAPL_MAX_INLINE=64

2) If you have a multi-homed host and the physical ethernet networks
are bridged, or if you have multiple chelsio rnics in the system, then
you need to configure arp to only send replies on the interface with
the target ip address:

        sysctl -w net.ipv4.conf.all.arp_ignore=2

3) If you are building OFED against a kernel.org kernel later than
2.6.20, then make sure your kernel is configured with the cxgb3 and
iw_cxgb3 modules enabled.  This forces the kernel to pull in the genalloc
allocator, which is required for the OFED iw_cxgb3 module.  Make sure
these config options are included in your .config file:

	CONFIG_CHELSIO_T3=m
	CONFIG_INFINIBAND_CXGB=m

4) If you run the RDMA latency test using the ib_rdma_lat program, make
sure you use the following command lines to limit the amount of inline
data to 64:

	server:	ib_rdma_lat -c -I 64
	client:	ib_rdma_lat -c -I 64 server_ip_addr

5) If you're running NFSRDMA over Chelsio's T3 RNIC and your cients are
using a 64KB page size (like PPC64 and IA64 systems) and your server is
using a 4KB page size (like i386 and X86_64), then you need to mount the
server using rsize=32768,wsize=32768 to avoid overrunning the Chelsio
RNIC fast register limits.  This is a known firmware limitation in the
Chelsio RNIC.
