[ Home | About Me | Blog ]

2020-03-07 - Fun with Infiniband at home (part 2)

Note - though I'm using Fedora 31 and saying it's easy, I experienced hard crashes when the Infiband network came up as I sat down to write this after a `dnf upgrade`. I had to fall back to a previous kernel, due to a bug in kernels >5.4.21… https://bugzilla.redhat.com/show_bug.cgi?id=1806981

This type of bug is not typical, in my experience.

Infiniband Software Stack

In a post a month ago I wrote a bit about buying some FDR Infiniband network cards for my home lab setup. I use 2 machines networked with ConnectX3 FDR Infiniband cards to test things work with IB in HPC focused projects I'm involved in, and as a fast general use network between my workstation and a box acting as a server of sorts. Now let's take a look at the software stack.

OFED, MLNX_OFED, and Distro Packages

If you've been around people using or adminstering HPC systems, where Infiniband is common and used with large MPI workloads, it might seem like it's very complicated to get an IB network setup correctly. Luckily it's now very straightforward in standard Linux distributions. IB networking requires kernel modules providing a driver for the card, and supporting native and IP protocols. There are also various libraries associated with using the native Infiniband 'verbs', RDMA etc. Many deployments use the OpenFabrics Enterprise Distribution (OFED) which bundles all of these things, packaging them up for enterprise Linux distributions such as Red Hat Enterprise Linux and SUSE Enterprise Server. Mellanox provides their own distribution (MLNX_OFED) which is further tested / optimized for their cards.

OFED and MLNX_OFED are only distributed for long-term supported 'enterprise' Linux distributions. You won't find an official OFED package for e.g. Arch Linux. Luckily all the component parts of OFED are open-source, and are packaged by distributions. If you're reading around on the net not that Mellanox refers to upstream drivers, rather than those distributed as part of their MLNX_OFED packages, as 'inbox' drivers.

Installation on Fedora / CentOS

I'm running Fedora 31 on the machines I want to use the Infiniband cards with. Mellanox do distribute a Fedora version of their OFED package, but it lags behind new Fedora releases. The current version of MLNX_OFED supports Fedora 30. Instead, I can install the distribution's drivers and libraries easily with:

sudo yum install libibverbs ucx-ib ucx-rdmacm opensm infiniband-diags

This will bring in a bunch of other packages covering most IB needs. On CentOS or Red Hat Enterpise Linux it's easier still as there's a group you can install:

sudo yum groupinstall "Infiniband Support"

Modifying memlock limits

Later on when you start doing much with programs that use IB natively you may get cryptic errors about not being able to allocate memory. Programs performing RDMA transfers etc. need to lock large regions of memory. The default security limits on most Linux distributions only allow non-root users to lock a small amount:

08:47 PM $ ulimit -a
...
max locked memory       (kbytes, -l) 65536
...

You can increase this limit by setting a higher value in /etc/security/limits.conf. For simplicity I'm setting the hard and soft limits to unlimited by adding the lines:

*		hard	memlock	unlimited
*		soft	memlock	unlimited

Start a subnet manager

Infiniband networks are co-ordinated by something called a subnet manager. This is a process that runs on a machine in the network, or on an IB switch. It discovers the topology of the network and manages the routes for traffic through the network etc. In my simple network consisting of 2 hosts connected with a single IB cable I still need a subnet manager running for the machines to be able to communicate. opensm is the software subnet manager that we can run on a Linux system:

sudo systemctl enable opensm
sudo systemctl start opensm

Check the connection

Now a subnet manager is running on one of the machines the lights on the cards of both should be active, and we can check the connection state with the ibstat command:

dave@ythel~> ibstat
CA 'mlx4_0'
	CA type: MT4099
	Number of ports: 2
	Firmware version: 2.42.5000
	Hardware version: 1
	Node GUID: 0xf4521403007f2a10
	System image GUID: 0xf4521403007f2a13
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 40 (FDR10)
		Base lid: 1
		LMC: 0
		SM lid: 1
		Capability mask: 0x0259486a
		Port GUID: 0xf4521403007f2a11
		Link layer: InfiniBand

We can see my card is now Active and the link speed is 40Gbps (since I have a cheaper 40Gbps FDR10 cable… though the cards are capable of a 56Gbps FDR link).

The Base lid: 1 is the local identifier of this host within the network. The SM lid: 1 shows where the subnet manager is running - on the same machine in this case. The GUID values are global IDs that would be important if we had multiple Infiniband networks and routed between them, but are not interesting on a small network.

We can use the ibnetdiscover command to get an overview of the network, which shows 2 hosts:

dave@ythel~> sudo ibnetdiscover
[sudo] password for dave:
#
# Topology file: generated on Sat Mar  7 21:25:30 2020
#
# Initiated from node f4521403007f2a10 port f4521403007f2a11

vendid=0x2c9
devid=0x1003
sysimgguid=0x2c90300a51963
caguid=0x2c90300a51960
Ca	2 "H-0002c90300a51960"		# "piran mlx4_0"
[1](2c90300a51961) 	"H-f4521403007f2a10"[1] (f4521403007f2a11) 		# lid 2 lmc 0 "ythel mlx4_0" lid 1 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf4521403007f2a13
caguid=0xf4521403007f2a10
Ca	2 "H-f4521403007f2a10"		# "ythel mlx4_0"
[1](f4521403007f2a11) 	"H-0002c90300a51960"[1] (2c90300a51961) 		# lid 1 lmc 0 "piran mlx4_0" lid 2 4xFDR10

Setup IP networking over IB

Infiniband is different than the Ethernet with TCP/IP networking that we are more familiar with. There's a nice overview of how communication works here: https://blog.zhaw.ch/icclab/infiniband-an-introduction-simple-ib-verbs-program-with-rdma-write/

Luckily we can run an IP network over Infinband using IPoIB which allows us to run any software that expects to talk to a host using an IP address and port, as well as programs written to exploit IB's native low-latency verbs and RDMA transfers.

These days an IPoIB interface can be setup easily using NetworkManager, e.g. via the nmtui command or in the GUI. I chose to network my 2 machines as 10.1.1.215 and 10.1.1.216, on the 10.1.1.0/24 network. Your IB network must use a different subnet to any existing ethernet interfaces, so that traffic is routed correctly. I used nmtui to add a connection, chose the Infiniband option and entered those IPv4 static IP details. The first port of the cards appears as the device ibp1s0 on both of my machines. ib0 would be the more traditional device name on systems that do not rename net devices by physical location in the system.

Testing things

After bringing up the IPoIB networking with nmtui I can ping 10.1.1.216 from the machine that was setup as 10.1.1.215, and the reverse should work too:

dave@ythel~> ping 10.1.1.216
PING 10.1.1.216 (10.1.1.216) 56(84) bytes of data.
64 bytes from 10.1.1.216: icmp_seq=1 ttl=64 time=0.215 ms
64 bytes from 10.1.1.216: icmp_seq=2 ttl=64 time=0.173 ms
64 bytes from 10.1.1.216: icmp_seq=3 ttl=64 time=0.201 ms
^C
--- 10.1.1.216 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2087ms
rtt min/avg/max/mdev = 0.173/0.196/0.215/0.017 ms

I can test the performance of the IPoIB networking with iperf3:

# Run a server on one machine
dave@ythel~> iperf3 -s
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
...
# Client on another machine
dave@piran~> iperf3 -c 10.1.1.215
Connecting to host 10.1.1.215, port 5201
[  5] local 10.1.1.216 port 42710 connected to 10.1.1.215 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.44 GBytes  12.3 Gbits/sec    0   3.11 MBytes
[  5]   1.00-2.00   sec  1.43 GBytes  12.3 Gbits/sec    0   3.11 MBytes
[  5]   2.00-3.00   sec  1.44 GBytes  12.3 Gbits/sec    0   3.11 MBytes
[  5]   3.00-4.00   sec  1.43 GBytes  12.3 Gbits/sec    0   3.11 MBytes
[  5]   4.00-5.00   sec  1.43 GBytes  12.3 Gbits/sec    0   3.11 MBytes
[  5]   5.00-6.00   sec  1.43 GBytes  12.3 Gbits/sec    0   3.11 MBytes
[  5]   6.00-7.00   sec  1.43 GBytes  12.3 Gbits/sec    0   3.11 MBytes
[  5]   7.00-8.00   sec  1.43 GBytes  12.3 Gbits/sec    0   3.11 MBytes
[  5]   8.00-9.00   sec  1.43 GBytes  12.3 Gbits/sec    0   3.11 MBytes
[  5]   9.00-10.00  sec  1.43 GBytes  12.3 Gbits/sec    0   3.11 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  14.3 GBytes  12.3 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  14.3 GBytes  12.3 Gbits/sec                  receiver

This shows that I have 12.3Gbits/sec using IP between my 2 Infiniband hosts. We know from the earlier hardware blog post that the card in my workstation is limited by available PCIe lanes to 16Gbps max. Given that's the maximum at the PCIe layer, and there is overhead in the IB stack and the IP stack, 12.3Gbits/sec seems pretty fast.

I can now go ahead and use my fast IB network just as I do my normal 1Gbps ethernet, but get 10x the speed by using the IP addresses associated with the Infiniband interfaces. When I go on to try out software that is written specifically to use native Infiniband verbs and RDMA instead of the IP layer I should get closer to the limit 16Gbps PCIe limit.


SFC Supporter