Low-latency streaming with Heron on InfiniBand (part 2 of 2)

November 8, 2017

Supun Kamburugamuve

In part 1 of this InfiniBand with Heron blog series, we introduced the InfiniBand high-performance networking standard and described some of its advantages. In this second and final post we’ll take a look at how we integrated InfiniBand with the Heron stream processing engine and outline some performance improvements resulting from the integration.

Integration

We integrated InfiniBand at the Heron Stream Manager level, as shown in Figure 1 below. This implementation uses a fixed-size buffer pool, in channel mode, for better performance and a credit-based flow control mechanism when sending and receiving messages.

Figure 1. InfiniBand integration
Figure 1. InfiniBand integration

Figure 2 below shows the architecture of a Heron topology, which consists of an arbitrarily complex network of spouts that inject data into the topology and bolts that process that data.

Figure 2. An example Heron topology
Figure 2. An example Heron topology

Buffer management

Each side of the communication uses buffer pools with equally sized buffers to communicate. Two pools of this type are used for sending and receiving data for each channel. For receiving operations, all the buffers are posted at the beginning to the fabric. For transmitting messages, the buffers are filled with messages and posted to the fabric for transmission. After the transmission is complete, the buffer is added back to the pool. The message receive and transmission completions are discovered using the completion queues. Individual buffer sizes are kept relatively large to accommodate the largest messages expected. If the buffer size is not enough for a single message, the message is divided into pieces and put into multiple buffers. Every network message carries the length of the total message and this information can be used to assemble the pieces.

Flow control

InfiniBand requires flow control between the communication parties in order to avoid the sender overflowing the receiver. In addition, the flow control should make sure that the sender is not blocked when there is space available at the receiver. This implementation uses a standard credit-based approach for flow control. The credit available for the sender to communicate is equal to the number of buffers posted into the fabric by the receiver. Credit information is passed to the other side as part of data transmissions, or by using separate messages in case there are no data transmissions to send it. Each data message carries the current credit of the communication party as a 4-byte integer value.

Experiments

Figure 2 above shows an example Heron topology that we used for measuring the performance of the InfiniBand implementation. It uses a spout and a bolt arranged in a “shuffle” grouping. The experiments were conducted on a Haswell cluster with a 56Gbps InfiniBand and 1Gbps Ethernet connection. Figure 3 shows the round trip latency of a message originating from a spout and going to a bolt, with its acknowledgment coming back to the spout. The experiments were conducted with different message sizes ranging from 16 to 512 bytes.

Figure 3. Latency of the topology on 8 nodes with very small-sized messages
Figure 3. Latency of the topology on 8 nodes with very small-sized messages

Figure 4 below shows round trip latency for a topology with larger message sizes, up to 500 KB.

Figure 3. Latency of the topology on 8 nodes with larger-sized messages
Figure 3. Latency of the topology on 8 nodes with larger-sized messages

The initial results showed overall better performance for InfiniBand. There are two main sources of overhead at the network layer: one is the overhead imposed by the network stack, while the other is the overhead produced by message serialization. With the InfiniBand implementation, the total network time is pushed closer to the time occupied by message serialization (which is done using Protocol Buffers and Kryo). Providing enhancements at the message serialization level will help to reduce the latency of Heron even further. The goal of half a millisecond or more would not be an unreachable one. Further details about the implementation can be found in our paper.1

1. http://dsc.soic.indiana.edu/publications/Heron_Infiniband.pdf

About the guest author

Supun Kamburugamuve is a PhD student in the Pervasive Technology Institute Laboratory of Indiana University. His research currently focuses on sensor networks.