In part 1 of this InfiniBand with Heron blog series, we introduced the InfiniBand high-performance networking standard and described some of its advantages. In this second and final post we’ll take a look at how we integrated InfiniBand with the Heron stream processing engine and outline some performance improvements resulting from the integration.
We integrated InfiniBand at the Heron Stream Manager level, as shown in Figure 1 below. This implementation uses a fixed-size buffer pool, in channel mode, for better performance and a credit-based flow control mechanism when sending and receiving messages.
Figure 2 below shows the architecture of a Heron topology, which consists of an arbitrarily complex network of spouts that inject data into the topology and bolts that process that data.
Each side of the communication uses buffer pools with equally sized buffers to communicate. Two pools of this type are used for sending and receiving data for each channel. For receiving operations, all the buffers are posted at the beginning to the fabric. For transmitting messages, the buffers are filled with messages and posted to the fabric for transmission. After the transmission is complete, the buffer is added back to the pool. The message receive and transmission completions are discovered using the completion queues. Individual buffer sizes are kept relatively large to accommodate the largest messages expected. If the buffer size is not enough for a single message, the message is divided into pieces and put into multiple buffers. Every network message carries the length of the total message and this information can be used to assemble the pieces.
InfiniBand requires flow control between the communication parties in order to avoid the sender overflowing the receiver. In addition, the flow control should make sure that the sender is not blocked when there is space available at the receiver. This implementation uses a standard credit-based approach for flow control. The credit available for the sender to communicate is equal to the number of buffers posted into the fabric by the receiver. Credit information is passed to the other side as part of data transmissions, or by using separate messages in case there are no data transmissions to send it. Each data message carries the current credit of the communication party as a 4-byte integer value.
Figure 2 above shows an example Heron topology that we used for measuring the performance of the InfiniBand implementation. It uses a spout and a bolt arranged in a “shuffle” grouping. The experiments were conducted on a Haswell cluster with a 56Gbps InfiniBand and 1Gbps Ethernet connection. Figure 3 shows the round trip latency of a message originating from a spout and going to a bolt, with its acknowledgment coming back to the spout. The experiments were conducted with different message sizes ranging from 16 to 512 bytes.
Figure 4 below shows round trip latency for a topology with larger message sizes, up to 500 KB.
The initial results showed overall better performance for InfiniBand. There are two main sources of overhead at the network layer: one is the overhead imposed by the network stack, while the other is the overhead produced by message serialization. With the InfiniBand implementation, the total network time is pushed closer to the time occupied by message serialization (which is done using Protocol Buffers and Kryo). Providing enhancements at the message serialization level will help to reduce the latency of Heron even further. The goal of half a millisecond or more would not be an unreachable one. Further details about the implementation can be found in our paper.1