InfiniBand is among the most common and well-known cluster interconnect technologies. However, the complexities of an InfiniBand (IB) network can frustrate the most experienced cluster administrators. Maintaining a balanced fabric topology, dealing with underperforming hosts and links, and chasing potential improvements keeps all of us on our toes. Sometimes, though, a little research and experimentation can find unexpected performance and stability gains.
For example, consider a 1300-node cluster using Intel TrueScale IBfor job communication and a Panasas ActiveStor filesystem for storage. Panasas only communicates to clients via Ethernet and not IB, so a group of Mellanox switches acts as gateways from the Panasas Ethernet to the TrueScale IB.
Every system has bottlenecks; in our case, the links to and from these IB/Ethernet gateways showed congestion due to the large amount of disk traffic. This adversely affects the whole cluster — jobs can’t get the data they need, and the increased congestion interferes with other IB traffic as well.
Fortunately, InfiniBand provides a congestion control mechanism that can help mitigate the effects of severe congestion on the fabric. We were able to implement this feature to save the expense and trouble of adding additional IB/Ethernet gateways.
InfiniBand is intended to be a lossless fabric. IB switches won’t drop packets for flow control unless they absolutely have to, usually in cases of hardware failure or malformed packets. Instead of dropping packets and retransmitting, like Ethernet does, InfiniBand uses a system of credits to perform flow control.
Communication occurs between IB endpoints, which in turn are issued credits based on the amount of buffer space the receiving device has. If the credit cost of the data to be transmitted is less than the credits remaining on the receiving device, the data is transmitted. Otherwise, the transmitting device holds on to the data until the receiving device has sufficient credits free.
This method of flow control works well for normal loads on well-balanced, non-oversubscribed IB fabrics. However, if the fabric is unbalanced or oversubscribed or just heavily loaded, some links may be oversaturated with traffic beyond the ability of the credit mechanism to help.
Congestion can be observed by checking the IB error counters. When an IB device attempts to transmit data but the receiving device cannot receive data due to congestion, the PortXmitWait counter is incremented. If the congestion is so bad that the data cannot be transmitted before the time-to-live on the packet expires, the packet is discarded and the PortXmitDiscards counter is incremented. If you’re seeing high values of PortXmitWait and PortXmitDiscards counters, enabling congestion control may help manage congestion on your InfiniBand fabric.
When an IB switch detects congestion on a link, it enables a special bit, called the Forward Explicit Congestion Notification (FECN) bit, which informs the destination device that congestion has been detected on the link. When the destination receives a packet marked with the FECN bit, the destination device notifies the sending device of the congestion via a Backwards Explicit Congestion Notification bit (BECN.)
When the source receives the BECN bit notification from the destination, the sending (source) device begins to throttle the amount of data it sends to the destination. The mechanism it uses is the credits system – by reducing the credits available to the destination, the size and rate of the packets are effectively decreased. The sending device may also add a delay between packets to provide the destination device time to catch up on data.
Over time, the source device increases credits for the destination device, gradually increasing the number of packets sent. If the destination device continues to receive FECN packets from its switch, it again transmits BECN packets to the source device and the throttling is increased again. Without the reception of BECN packets from the destination device, the source device eventually returns to normal packet transmission. This balancing act is managed by congestion control parameters which require tuning for each environment.
After enabling InfiniBand congestion control and proper tuning, we realized a 15 percent improvement in our Panasas file system benchmark testing. PortXmitDiscards counters were completely clear, and PortXmitWait counters were significantly smaller, indicating that congestion control was doing its job.
Given that no additional hardware or other costs were required to achieve these results, a speed increase of 15 percent plus increased stability of the IB fabric was a nice result.
Congestion control must be enabled on all IB devices and hosts, as well as on the IB subnet manager. This process includes turning on congestion control and setting a congestion control key on each device, as well as tuning the congestion control tables and parameters on each host and switch.
After congestion control is enabled on each IB device, the OpenSM configuration file must be modified to tune the subnet manager’s congestion control manager. Please note that mistuned parameters will either wreak havoc on a fabric or be completely ineffectual, so be careful – and do plenty of testing on a safe “test” system. Never attempt this on a live or production system.
Enabling InfiniBand congestion control had an immediate positive effect on our IB fabric. If you are suffering from issues with fabric congestion, enabling congestion control may provide the similar relief for your fabric as well, without the cost of adding additional hardware. Contact us today for a deeper discussion on how to get maximum performance out of your existing InfiniBand fabric.
Keep connected—subscribe to our blog by email.