            Open Fabrics Enterprise Distribution (OFED)
      NetEffect Ethernet Cluster Server Adapter Release Notes
                           January 2012



The iw_nes module and libnes user library provide RDMA and L2IF
support for the NetEffect Ethernet Cluster Server Adapters.

==========
What's New
==========

OFED 1.5.4.1 contains bug fixes for iw_nes driver.

* Fixed a problem with QP destroy timer and improved AE handling.
* Fixed a problem with sending MPA reject message.
* Fixed fast memory registration issues.

OFED 1.5.4 contains several enhancements and bug fixes to iw_nes driver.

* Added backports for 2.6.35 to 3.0 kernels.
* Fixed a couple of problems which caused IMA to crash.
* Fixed a problem with VLAN flag for IMA.
* Enabled bonding with iw_nes.
* Fixed a couple of IB_EVENT issues.
* Fixed an SFP+ link status issue.
* Added support for Chelsio Interoperability.
* Added support for MPA version 2.


============================================
Required Setting - RDMA Unify TCP port space
============================================
RDMA connections use the same TCP port space as the host stack.  To avoid
conflicts, set rdma_cm module option unify_tcp_port_space to 1 by adding
the following to /etc/modprobe.conf:

    options rdma_cm unify_tcp_port_space=1


========================================
Required Setting - Power Management Mode
========================================
If possible, disable Active State Power Management in the BIOS, e.g.:

  PCIe ASPM L0s - Advanced State Power Management: DISABLED


=======================
Loadable Module Options
=======================
The following options can be used when loading the iw_nes module by modifying
/etc/modprobe.conf file.

wide_ppm_offset=0
    Set to 1 will increase CX4 interface clock ppm offset to 300ppm.
    Default setting 0 is 100ppm.

mpa_version=1
    MPA version to be used int MPA Req/Resp (1 or 2).

disable_mpa_crc=0
    Disable checking of MPA CRC.
    Set to 1 to enable MPA CRC.

send_first=0
    Send RDMA Message First on Active Connection.

nes_drv_opt=0x00000100
    Following options are supported:

    0x00000010 - Enable MSI
    0x00000080 - No Inline Data
    0x00000100 - Disable Interrupt Moderation
    0x00000200 - Disable Virtual Work Queue
    0x00001000 - Disable extra doorbell read after write

nes_debug_level=0
    Specify debug output level.

wqm_quanta=65536
    Set size of data to be transmitted at a time.

limit_maxrdreqsz=0
    Limit PCI read request size to 256 bytes.


===============
Runtime Options
===============
The following options can be used to alter the behavior of the iw_nes module:
NOTE: Assuming NetEffect Ethernet Cluster Server Adapter is assigned eth2.

    ifconfig eth2 mtu 9000  - largest mtu supported

    ethtool -K eth2 tso on  - enables TSO
    ethtool -K eth2 tso off - disables TSO

    ethtool -C eth2 rx-usecs-irq 128 - set static interrupt moderation

    ethtool -C eth2 adaptive-rx on  - enable dynamic interrupt moderation
    ethtool -C eth2 adaptive-rx off - disable dynamic interrupt moderation
    ethtool -C eth2 rx-frames-low 16 - low watermark of rx queue for dynamic
                                       interrupt moderation
    ethtool -C eth2 rx-frames-high 256 - high watermark of rx queue for
                                         dynamic interrupt moderation
    ethtool -C eth2 rx-usecs-low 40 - smallest interrupt moderation timer
                                      for dynamic interrupt moderation
    ethtool -C eth2 rx-usecs-high 1000 - largest interrupt moderation timer
                                         for dynamic interrupt moderation

===================
uDAPL Configuration
===================
Rest of the document assumes the following uDAPL settings in /etc/dat.conf:

    OpenIB-iwarp u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
    ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""

========================
Chelsio Interoperability
========================
The firmware version supporting interoperability with Chelsio is 3.23 or greater.
The load time Chelsio parameter peer2peer must be set to 1.

==============
mpd.hosts file
==============
mpd.hosts is a text file with a list of nodes, one per line, in the MPI ring.  
Use either fully qualified hostname or IP address.

===========================
100% CPU Utilization remark
===========================
Most of the RDMA applications use CQ Polling mode to decrease latency.
This operational mode can cause 100% CPU utilization.

To switch to Event Driven mode and lower CPU utilization please refer to README or 
Release Notes for specific application.

==============================================
Recommended Settings for Intel MPI 4.0.x
==============================================
Add the following to mpiexec command:

    -genv I_MPI_FALLBACK_DEVICE 0
    -genv I_MPI_DEVICE rdma:ofa-v2-iwarp
    -genv I_MPI_USE_RENDEZVOUS_RDMA_WRITE 1

Example mpiexec command line for uDAPL-2.0:

    mpiexec -genv I_MPI_FALLBACK_DEVICE 0
            -genv I_MPI_DEVICE rdma:ofa-v2-iwarp
            -genv I_MPI_USE_RENDEZVOUS_RDMA_WRITE 1
            -ppn 1 -n 2
            /opt/intel/impi/4.0.0.025/bin64/IMB-MPI1

Example mpiexec command line for uDAPL-1.2:
    mpiexec -genv I_MPI_FALLBACK_DEVICE 0
            -genv I_MPI_DEVICE rdma:OpenIB-iwarp
            -genv I_MPI_USE_RENDEZVOUS_RDMA_WRITE 1
            -ppn 1 -n 2
            /opt/intel/impi/4.0.0.025/bin64/IMB-MPI1

Intel MPI use CQ Polling mode as a default.
To switch to wait mode add the following to mpiexec command:
     -genv I_MPI_WAIT_MODE 1

NOTE: Wait mode supports the sock device only.

========================================
Recommended Setting for MVAPICH2 and OFA
========================================
Example mpirun_rsh command line:

    mpirun_rsh -ssh -np 2 -hostfile /root/mpd.hosts
            /usr/mpi/gcc/mvapich2-1.7/tests/osu_benchmarks-3.1.1/osu_latency

MVAPICH2 use CQ Polling mode as a default.
To switch to Blocking mode add the following to mpirun_rsh command:
     MV2_USE_BLOCKING=1

==========================================
Recommended Setting for MVAPICH2 and uDAPL
==========================================
Add the following to the mpirun_rsh command for 64 or more processes:

    -env MV2_ON_DEMAND_THRESHOLD <number of processes>

Example mpirun_rsh command with uDAPL-2.0:

    mpirun_rsh -ssh -np 64 -hostfile /root/mpd.hosts
            MV2_DAPL_PROVIDER=ofa-v2-iwarp
            MV2_ON_DEMAND_THRESHOLD=64
            /usr/mpi/gcc/mvapich2-1.7/tests/IMB-3.2/IMB-MPI1

Example mpirun_rsh command with uDAPL-1.2:

    mpirun_rsh -ssh -np 64 -hostfile /root/mpd.hosts
            MV2_DAPL_PROVIDER=OpenIB-iwarp
            MV2_ON_DEMAND_THRESHOLD=64
            /usr/mpi/gcc/mvapich2-1.7/tests/IMB-3.2/IMB-MPI1

MVAPICH2 use CQ Polling mode as a default.
To switch to Blocking mode add the following to mpirun_rsh command:
     MV2_USE_BLOCKING=1

===========================
Modify Settings in Open MPI
===========================
There is more than one way to specify MCA parameters in
Open MPI.  Please visit this link and use the best method
for your environment:

http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

=======================================
Recommended Settings for Open MPI 1.4.3
=======================================
Allow the sender to use RDMA Writes:

    -mca btl_openib_flags 2

Example mpirun command line:

    mpirun -np 2 -hostfile /opt/mpd.hosts
           -mca btl openib,self,sm
           -mca btl_mpi_leave_pinned 0
           -mca btl_openib_flags 2
           /usr/mpi/gcc/openmpi-1.4.3/tests/IMB-3.2/IMB-MPI1

OpenMPI use CQ Polling mode as a default.
No command parameter available to swith to Event Driven mode.

===================================
iWARP Multicast Acceleration (IMA)
===================================

iWARP multicast acceleration enables raw L2 multicast traffic kernel
bypass using user-space verbs API using the new defined QP type
IBV_QPT_RAW_ETH.

The L2 RAW_ETH acceleration assumes that user application transmits and
receives a whole L2 frame including MAC/IP/UDP/TCP headers.

ETH RAW QP usage:
First the application creates IBV_QPT_RAW_ETH QP with associated CQ, PD,
completion channels as it is performed for RDMA connection.

Next step is enabling L2 MAC address RX filters for directing received
multicasts to the RAW_ETH QPs using ibv_attach_multicast() verb.

From this point the application is ready to receive and transmit multicast
traffic.

In multicast acceleration the user application passes to ibv_post_send()
whole IGMP frame including MAC header, IP header, UDP header and UDP payload.
It is a user responsibility to make IP fragmentation when required payload
is larger than MTU. Every fragment is a separate L2 frame to transmit.
The ibv_poll_cq() provides an information about the status of transmit buffer.

On receive path, ibv_poll_cq() returns information about received L2
packet, the Rx buffer (previously posted by ibv_post_recv() ) contains
whole L2 frame including MAC header, IP header and UDP header.
It is a user application responsibility to check if received packet is
a valid UDP frame so the fragments must be checked and checksums must be
computed.

IMA API description (NE020 specific):
User application must create separate CQs for RX and TX path.
Only single SGE on tranmit is supported.
User application must post at least 65 rx buffers to keep RX path working.

IMA device:
IMA requires creation of the /dev/infiniband/nes_ud_sksq device to get
access to optimized IMA transmit path. The best method for creation of this
device is manual addition following line to /etc/udev/rules.d/90-ib.rules
file after OFED distribution installation and rebooting machine.

KERNEL=="nes_ud_sksq", NAME="infiniband/%k", MODE="0644"

As a result the 90-ib.rules should look like:

KERNEL=="umad*", NAME="infiniband/%k"
KERNEL=="issm*", NAME="infiniband/%k"
KERNEL=="ucm*", NAME="infiniband/%k", MODE="0666"
KERNEL=="uverbs*", NAME="infiniband/%k", MODE="0666"
KERNEL=="ucma", NAME="infiniband/%k", MODE="0666"
KERNEL=="rdma_cm", NAME="infiniband/%k", MODE="0666"
KERNEL=="nes_ud_sksq", NAME="infiniband/%k", MODE="0644"



NetEffect is a trademark of Intel Corporation in the U.S. and other countries.
