Low-latency streaming with Heron on InfiniBand (part 1 of 2)

October 20, 2017

Supun Kamburugamuve

Ethernet is the most widely used networking technology available today. It is currently used in both public cloud infrastructure and enterprise datacenters. Apart from Ethernet, there are other networking technologies available that can perform at a high level. These are predominantly used by High-performance Computing (HPC) clusters, including supercomputers. With recent advancements in hardware, these high-performance networks have become cheaper. They require separate installations, network administration, and source code, but gains in performance can nonetheless be achieved for performance-critical applications.

HPC systems have been around for close to three decades, while so-called “Big Data” computing has become dominant in the last decade. High-performance parallel computations require operations running in tight synchronization across thousands of cores. To achieve this level of parallelism, it is necessary to have ultra-low latencies at the communication layer that expands to thousands of nodes. HPC frameworks are able to achieve efficient communication in such environments with microsecond latencies often less than 20 µs. Use of specialized hardware is a big factor in scaling HPC applications with low latencies.

Today, there are many high-performance networks available, including InfiniBand, Intel Omni-Path, and Cray Aries. InfiniBand is the most widely used interconnect both in enterprise datacenters for performance-critical applications and large HPC clusters for scientific applications. The latest InfiniBand hardware operates at 100 Gbps bandwidth and less than .5 µs peer-to-peer latency. As a research project, we have integrated InfiniBand interconnect with Heron to see the potential benefits. The initial results showed good performance gains both in latency and throughput, and further potential improvements have been identified.

InfiniBand

Unlike socket/TCP, which is a streaming protocol, InfiniBand works at the message level, where it preserves message boundaries when transferring data. InfiniBand supports several modes of message transfer. Table 1 shows a summary of these modes. Both connection-oriented and connectionless data transfer are supported. Reliable messaging and unreliable messaging can be used depending on the requirements. Channel semantics (send/receive) and memory semantics are supported for data transfer. Channel semantics is similar to TCP send/receive and with memory semantics one can directly write to/read from the memory of a remote machine.

All the computations related to protocol handling related computations are performed on InfiniBand Network Interface cards, allowing the host CPU to work on other computations such as data processing. The data is transferred directly from the user buffer to the network without additional data copying. The host kernel is not involved in either reading data from, or, writing data to the network, so the application can work in user space without any context switches.

Operation Unreliable datagram Unreliable connection Reliable connection Reliable datagram
Send
Receive
RDMA write
RDMA read
Atomic

Channel and memory semantics

In channel mode, two queue pairs are used, one for transmission and the other for receive operations. To transfer a message, a descriptor is posted to the transfer queue, which includes the address of the memory buffer to transfer. For receiving a message, a descriptor needs to be submitted along with a pre-allocated receive buffer. The user program queries the completion queue associated with a transmission or a receiving queue to determine the success or failure of a work request. Once a message arrives, the hardware puts the message into the posted receive buffer and the user program can determine the availability of the new message through the completion queue. Note that this mode requires the receiving buffers to be pre-posted before the transmission can happen successfully.

With memory semantics, Remote Direct Memory Access (RDMA) operations are used. Two processes preparing to communicate register memory and share the details with each other. Read and write operations are used instead of send and receive operations. If a process wishes to write to remote memory, it can post a write operation with the local addresses of the data. The completion of the write operation can be detected using the completion queue associated. The receiving side is not notified about the write operation and has to use out-of-band mechanisms to figure the write completion. The same is true for remote reads as well. RDMA is more suitable for large message transfers while channel mode is suitable for small messages.

Programming

InfiniBand is programmed using the Verbs API, which is available on most major platforms. Because there are many interconnects available with different APIs, it can be cumbersome to program against each of these APIs. The libfabric library provides a unified programming API to program these interconnects. Because of its advantages, we have chosen libfabric as the API.

In this blog post, we provided an overview of high-performance interconnects. In the second part of the series, we’;; take look at how Heron’s architecture makes it easy to use InfiniBand and quantify the dramatic improvement in performance using experiments.

About the guest author

Supun Kamburugamuve is a PhD student in the Pervasive Technology Institute Laboratory of Indiana University. His research currently focuses on sensor networks.