	     Open Fabrics Enterprise Distribution (OFED)
		    IPoIB in OFED 3.18 Release Notes

			   July 2015


===============================================================================
Table of Contents
===============================================================================
1. Overview
2. Known Issues
3. DHCP Support of IPoIB
4. Child interfaces
5. Bug Fixes and Enhancements Since OFED 1.3
6. Bug Fixes and Enhancements Since OFED 1.3.1
7. Bug Fixes and Enhancements Since OFED 1.4
8. Bug Fixes and Enhancements Since OFED 1.4.2
9. Bug Fixes and Enhancements Since OFED 1.5.0
10. Bug Fixes and Enhancements Since OFED 1.5.2
11. Bug Fixes and Enhancements Since OFED 1.5.3
12. Bug Fixes and Enhancements Since OFED 1.5.4
13. Bug Fixes and Enhancements Since OFED 3.12
14. Performance tuning

===============================================================================
1. Overview
===============================================================================
IPoIB is a network driver implementation that enables transmitting IP and ARP
protocol packets over an InfiniBand UD channel. The implementation conforms to
the relevant IETF working group's RFCs (http://www.ietf.org).


Usage and configuration:
========================
1. To check the current mode used for outgoing connections, enter:
   cat /sys/class/net/ib0/mode
2. To disable IPoIB CM at compile time, enter:
   cd OFED-3.12
   export OFA_KERNEL_PARAMS="--without-ipoib-cm"
   ./install.pl
3. To change the run-time configuration for IPoIB, enter:
   edit /etc/infiniband/openib.conf, change the following parameters:
   # Enable IPoIB Connected Mode
   SET_IPOIB_CM=yes
   # Set IPoIB MTU
   IPOIB_MTU=65520

4. You can also change the mode and MTU for a specific interface manually.
   
   To enable connected mode for interface ib0, enter:
   echo connected > /sys/class/net/ib0/mode
   
   To increase MTU, enter:
   ifconfig ib0 mtu 65520

5. Switching between CM and UD mode can be done in run time:
   echo datagram > /sys/class/net/ib0/mode sets the mode of ib0 to UD
   echo connected > /sys/class/net/ib0/mode sets the mode ib0 to CM


===============================================================================
2. Known Issues
===============================================================================
1. If a host has multiple interfaces and (a) each interface belongs to a
   different IP subnet, (b) they all use the same InfiniBand Partition, and (c)
   they are connected to the same IB Switch, then the host violates the IP rule
   requiring different broadcast domains. Consequently, the host may build an
   incorrect ARP table.

   The correct setting of a multi-homed IPoIB host is achieved by using a
   different PKEY for each IP subnet. If a host has multiple interfaces on the
   same IP subnet, then to prevent a peer from building an incorrect ARP entry
   (neighbor) set the net.ipv4.conf.X.arp_ignore value to 1 or 2, where X
   stands for the IPoIB (non-child) interfaces (e.g., ib0, ib1, etc). This
   causes the network stack to send ARP replies only on the interface with the
   IP address specified in the ARP request:

   sysctl -w net.ipv4.conf.ib0.arp_ignore=1
   sysctl -w net.ipv4.conf.ib1.arp_ignore=1

   Or, globally,

   sysctl -w net.ipv4.conf.all.arp_ignore=1

   To learn more about the arp_ignore parameter, see 
   Documentation/networking/ip-sysctl.txt.
   Note that distributions have the means to make kernel parameters persistent.

2. There are IPoIB alias lines in /etc/modprobe.d/ib_ipoib.conf which prevent
   stopping/unloading the stack (i.e., '/etc/init.d/openibd stop' will fail). 
   These alias lines cause the drivers to be loaded again by udev scripts.

   Workaround: Change modprobe.conf to set
   OFA_KERNEL_PARAMS="--without-modprobe" before running install.pl, or remove 
   the alias lines from /etc/modprobe.d/ib_ipoib.conf.

3. After a hotplug event, the IPoIB interface falls back to datagram mode, and
   MTU is reduced to 2K.
   Workaround: Re-enable connected mode and increase MTU manually:
   echo connected > /sys/class/net/ib0/mode
   ifconfig ib0 mtu 65520

4. Since the IPoIB configuration files (ifcfg-ib<n>) are installed under the
   standard networking scripts location (RedHat:/etc/sysconfig/network-scripts/
   and SuSE: /etc/sysconfig/network/), the option IPOIB_LOAD=no in openib.conf
   does not prevent the loading of IPoIB on boot.

5. If IPoIB connected mode is enabled, it uses a large MTU for connected mode
   messages and a small MTU for datagram (in particular, multicast) messages,
   and relies on path MTU discovery to adjust MTU appropriately. Packets sent
   in the window before MTU discovery automatically reduces the MTU for a
   specific destination will be dropped, producing the following message in the
   system log:
   "packet len <actual length> (> <max allowed length>) too long to send, dropping"

   To warn about this, a message is produced in the system log each time MTU is
   set to a value higher than 2K.

6. In connected mode, TCP latency for short messages is larger by approx. 1usec
   (~5%) than in datagram mode. As a workaround, use datagram mode.

7. ConnectX only: If you have a port configured as ETH and IPoIB is running
    in connected mode, and then you change the port type to IB, the IPoIB mode
    will change to datagram mode.

8. IPoIB datagram mode initial packet loss (bug #1287): When the datagram test
    gets to packet size 8192 or larger, it always loses the first packet in the
    sequence. 
    Workaround: Increase the number of pending skb's before a neighbor is
    resolved (default is 3). This value can be changed with:
    sysctl net.ipv4.neigh.ib0.unres_qlen.

9. If bonding uses an IPoIB slave, then un-enslaving all slaves (or downing
    them with ifdown) followed by unloading the module ib_ipoib might crash the
     kernel. To avoid this leave the IPoIB interfaces enslaved when unloading
     ib_ipoib.

10. On SLES 11, sysconfig scripts override the interface mode and set it to
    datagram on each call to ifup, ifdown, etc. To avoid this, add the line
    IPOIB_MODE=connected
    to the interface configuration file (e.g. ifcfg-ib0)

11. Under SLES11, if an IP configuration exists for an IPoIB interface
    that later becomes a slave of a bonding master, a network restart
    does not erase the IP configuration from the slave and it appears to have
    an IP address even though the new configuration does not set one. This
    may cause problems when trying to use the bonded network interface. To
    avoid this, restart the IB stack (openib restart) once you change the
    configuration.
    This issue is described in
    https://bugs.openfabrics.org/show_bug.cgi?id=1975

===============================================================================
3. IPoIB Configuration Based on DHCP
===============================================================================

Setting an IPoIB interface configuration based on DHCP (v3.0.4 which is
available via www.isc.org) is performed similarly to the configuration of
Ethernet interfaces. In other words, you need to make sure that IPoIB
configuration files include the following line:
	For RedHat:
	BOOTPROTO=dhcp
	For SLES:
	BOOTPROTO=dhcp
Note: If IPoIB configuration files are included, ifcfg-ib<n> files will be 
installed under:
/etc/sysconfig/network-scripts/ on a RedHat machine
/etc/sysconfig/network/ on a SuSE machine

Note: Two patches for DHCP are required for supporting IPoIB. The patch files
for DHCP v3.0.4 are available under the docs/dhcp/ directory.

Standard DHCP fields holding MAC addresses are not large enough to contain an 
IPoIB hardware address. To overcome this problem, DHCP over InfiniBand messages 
convey a client identifier field used to identify the DHCP session. This client
identifier field can be used to associate an IP address with a client identifier 
value, such that the DHCP server will grant the same IP address to any client 
that conveys this client identifier.

Note: Refer to the DHCP documentation for more details how to make this 
association.

The length of the client identifier field is not fixed in the specification. 

4.1 DHCP Server
In order for the DHCP server to provide configuration records for clients, an 
appropriate configuration file needs to be created. By default, the DHCP server 
looks for a configuration file called dhcpd.conf under /etc. You can either
edit this file or create a new one and provide its full path to the DHCP server
using the -cf flag. See a file example at docs/dhcpd.conf of this package.
The DHCP server must run on a machine which has loaded the IPoIB module.

To run the DHCP server from the command line, enter:
dhcpd <IB network interface name> -d
Example:
host1# dhcpd ib0 -d

4.2 DHCP Client (Optional)

Note: A DHCP client can be used if you need to prepare a diskless machine with 
an IB driver.

In order to use a DHCP client identifier, you need to first create a 
configuration file that defines the DHCP client identifier. Then run the DHCP 
client with this file using the following command:
dhclient cf <client conf file> <IB network interface name>
Example of a configuration file for the ConnectX (PCI Device ID 26428), called
dhclient.conf:
# The value indicates a hexadecimal number
interface "ib1" {
send dhcp-client-identifier ff:00:00:00:00:00:02:00:00:02:c9:00:00:02:c9:03:00:00:10:39;
}
Example of a configuration file for InfiniHost III Ex (PCI Device ID 25218),
called dhclient.conf:
# The value indicates a hexadecimal number
interface "ib1" {
send dhcp-client-identifier 20:00:55:04:01:fe:80:00:00:00:00:00:00:00:02:c9:02:00:23:13:92;
}

In order to use the configuration file, run:
host1# dhclient -cf dhclient.conf ib1


===============================================================================
4. Child interfaces
===============================================================================

4.1 Subinterfaces
-----------------
You can create subinterfaces for a primary IPoIB interface to provide traffic
isolation. Each such subinterface (also called a child interface) has
different IP and network addresses from the primary (parent) interface. The
default Partition Key (PKey), ff:ff, applies to the primary (parent) interface.

4.1.1 Creating a Subinterface
-----------------------------
To create a child interface (subinterface), follow this procedure:
Note: In the following procedure, ib0 is used as an example of an IB
subinterface.

Step 1. Decide on the PKey to be used in the subnet. Valid values are 0-255.
The actual PKey used is a 16-bit number with the most significant bit set. For
example, a value of 0 will give a PKey with the value 0x8000.

Step 2. Create a child interface by running:
host1$ echo <PKey> > /sys/class/net/<IB subinterface>/create_child
Example:
host1$ echo 0 > /sys/class/net/ib0/create_child
This will create the interface ib0.8000.

Step 3. Verify the configuration of this interface by running:
Using the example of Step 2:
host1$ ifconfig ib0.8000
ib0.8000 Link encap:UNSPEC HWaddr 80-00-00-4A-FE-80-00-00-00-00-
00-00-00-00-00-00
BROADCAST MULTICAST MTU:2044 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

Step 4. As can be seen, the interface does not have IP or network addresses so
it needs to be configured.

Step 5. To be able to use this interface, a configuration of the Subnet Manager
is needed so that the PKey chosen, which defines a broadcast address, can be
recognized.

4.1.2 Removing a Subinterface
To remove a child interface (subinterface), run:
echo <subinterface PKey> /sys/class/net/<ib_interface>/delete_child
Using the example of Step 2:
echo 0x8000 > /sys/class/net/ib0/delete_child
Note that when deleting the interface you must use the PKey value with the most
significant bit set (e.g., 0x8000 in the example above).


===============================================================================
5. Bug Fixes and Enhancements Since OFED 1.3
===============================================================================
- There is no default configuration for IPoIB interfaces: One should manually
  specify the full IP configuration or use the ofed_net.conf file. See
  OFED_Installation_Guide.txt for details on ipoib configuration.
- Don't drop multicast sends when they can be queued
- IPoIB panics with RHEL5U1, RHEL4U6 and RHEL4U5: Bug fix when copying small
  SKBs (bug 989)
- IPoIB failed on stress testing (bug 1004)
- Kernel Oops during "port up/down test" (bug 1040)
- Restart the stack during iperf 2.0.4 ver2.0.4 in client side cause to kernel
  panic (bug 985)
- Fix neigh destructor oops on kernel versions between 2.6.17 and 2.6.20
- Set max CM MTU when moving to CM mode, instead of setting it in openibd script
- Fix CQ size calculations for ipoib
- Bonding: Enable build for SLES10 SP2
- Bonding: Fix  issue in using the bonding module for Ethernet slaves (see
  documentation for details)

===============================================================================
6. Bug Fixes and Enhancements Since OFED 1.3.1
===============================================================================
- IPoIB: Refresh paths instead of flushing them on SM change events to improve 
  failover respond
- IPoIB: Fix loss of connectivity after bonding failover on both sides
- Bonding: Fix link state detection under RHEL4
- Bonding: Avoid annoying messages from initscripts when starting bond
- Bonding: Set default number of grat. ARP after failover to three (was one)

===============================================================================
7. Bug Fixes and Enhancements Since OFED 1.4
===============================================================================
- Performance tuning is enabled by default for IPOIB CM.
- Clear IPOIB_FLAG_ADMIN_UP if ipoib_open fails 
- Disable napi while cq is being drained (bugzilla #1587)
- rdma_cm: Use the rate from the ipoib broadcast when joining an ipoib
  multicast. When joining an IPoIB multicast group, use the same rate as in the
  broadcast group. Otherwise, if rdma_cm creates this group before IPoIB does,
  it might get a different rate. This will cause IPoIB to fail joining the same
  group later on, because IPoIB has a strict rate selection.
- Fixed unprotected use of priv->broadcast in ipoib_mcast_join_task.
- Do not join broadcast group if interface is brought down

  
===============================================================================
8. Bug Fixes and Enhancements Since OFED 1.4.2
===============================================================================

- Check that the format of multicast link addresses is correct before taking
  them from dev->mc_list to priv->multicast_list.  This way we never try to
  send a bogus address to the SA, which prevents badness from erroneous
  'ip maddr addr add', broken bonding drivers, etc. (bugzilla #1664)
- IPoIB: Don't turn on carrier for a non-active port.
  If a bonding interface uses this IPoIB interface as a slave it might
  not detect that this slave is almost useless and failover
  functionality will be damaged.  The fix checks the state of the IB
  port in the carrier_task before calling netif_carrier_on(). (bugzilla #1726)
- Clear ipoib_neigh.dgid in ipoib_neigh_alloc()
  IPoIB can miss a change in destination GID under some conditions.  The
  problem is caused when ipoib_neigh->dgid contains a stale address.
  The fix is to set ipoib_neigh->dgid to zero in ipoib_neigh_alloc().

===============================================================================
9. Bug Fixes and Enhancements Since OFED 1.5.0
===============================================================================

- Fixed lockup of the TX queue on mixed CM/UD traffic
  When there is a high rate of send traffic on both CM and UD QPs, the
  transmitter can be stopped by the CM path but not re-enabled.

===============================================================================
10. Bug Fixes and Enhancements Since OFED 1.5.2
===============================================================================
1. Fix IPoIB rx_frames and rx_usecs to conform to ethtool documentation.
2. Fix faulty list maintenance in path and neigh list
3. Leave stale send-only multicast groups

===============================================================================
11. Bug Fixes and Enhancements Since OFED 1.5.3
===============================================================================
1. Set pkt_type correctly for multicast packets (fix IGMP breakage)
2. Add support to TSO status query
3. Fix the function that performs dma unmap of memory buffers
4. Remove TX moderation settings from ethtool support
5. Allow disabling/enabling TSO on the fly through ethtool

===============================================================================
12. Bug Fixes and Enhancements Since OFED 1.5.4
===============================================================================
1. IPoIB code based on linux-3.12

===============================================================================
13. Bug Fixes and Enhancements Since OFED 3.12
===============================================================================
 - Validate struct ipoib_cb size
 - Remove unnecessary port query
 - Remove unnecessary test for NULL before debugfs_remove()
 - Avoid multicast join attempts with invalid P_key
 - Avoid flushing the workqueue from worker context
 - Use P_Key change event instead of P_Key polling mechanism
 - Get rid of SET_ETHTOOL_OPS
 - Report operstate consistently when brought up without a link
 - Add flow steering support for IPoIB UD traffic
 - Lower NAPI weight
 - Start multicast join process only on active ports
 - Add path query flushing in ipoib_ib_dev_cleanup
 - Fix usage of uninitialized multicast objects
 - Avoid flushing the driver workqueue on dev_down
 - Fix deadlock between dev_change_flags() and __ipoib_dev_flush()
 - Change CM skb memory allocation to be non-atomic during init
 - Fix crash in dev_open error flow

===============================================================================
14. Performance Tuning
===============================================================================
When IPoIB is configured to run in connected mode, tcp parameter tuning is
performed at driver startup to improve the throughput of medium and large
messages.
The driver startup scripts set the following TCP parameters as follows:

      net.ipv4.tcp_timestamps=0
      net.ipv4.tcp_sack=1
      net.core.netdev_max_backlog=250000
      net.core.rmem_max=4194304
      net.core.wmem_max=4194304
      net.core.rmem_default=4194304
      net.core.wmem_default=4194304
      net.core.optmem_max=4194304
      net.ipv4.tcp_rmem="4096 87380 4194304"
      net.ipv4.tcp_wmem="4096 65536 4194304"
      net.ipv4.tcp_low_latency=1

This tuning is effective only for connected mode. If you run in datagram mode,
it actually reduces performance.

If you change the IPoIB run mode to "datagram" while the driver is running,
the tuned parameters do not get reset to their default values.  We therefore
recommend that you change the IPoIB mode only while the driver is down
(by setting line "SET_IPOIB_CM=yes" to "SET_IPOIB_CM=no" in file
/etc/infiniband/openib.conf, and then restarting the driver).


