Apache Pulsar is a next generation distributed pub-sub messaging system, with enterprise features including multi-tenancy, multi-datacenter replication and strong durability guarantees. It is also a very flexible system, supporting both queuing and streaming. For an introduction to Apache Pulsar please check out this post.
For architects and developers that are new to the streaming space it is important to understand the difference between a work queue and a stream, and when you would use one vs the other. When ordering is not important you would use a queue. Let’s look at an example use case.
Uploading images to a website and doing some basic processing like resizing to generate thumbnails to display. It doesn’t matter whether one image is processed before another image, as long as there are guarantees that all the images are processed within a certain time window. Each worker takes an image from the queue, processes it and then outputs it, and then takes another image from the queue. The ordering is not a critical factor which makes this a work queue use case.
Until the introduction of Apache Pulsar this use case would necessitate using a queuing system such as RabbitMQ or ActiveMQ.
Now we will look at streaming. Unlike a work queue, a stream is where the order that messages are received and processed is important.
Retail weblog analysis. For retailers that want to analyze what their customers are doing on their website and how they are interacting with it, it is very important to see the correct sequence of events. For example, someone comes to a site, browses clothing, adds some items to the shopping cart, but then abandons the shopping cart. If the ordering is not exact then the analysis of the customer’s behavior becomes meaningless. This is what we mean by streaming.
In the past, streaming has needed a separate system, such as Apache Kafka, from a queuing system which increases the complexity and impacts productivity for developers because they need to learn two different systems. This also makes operations harder and more expensive.
In the following short video you can watch Sanjeev Kulkarni and Christian Hasker talking about the difference between queuing and streaming and talking about a use case which would need both queuing and streaming.
As we outlined above Apache Pulsar is a very flexible system. Let’s examine the use case of a work queue where an enterprise would be handling incoming images to display as thumbnails on their website. With shared subscriptions, you can have as many consumers as you want attach to the same subscription. Incoming images are delivered in a round-robin distribution across multiple consumers and any given image is delivered to only one consumer for processing.
In the use case where an enterprise wants to do analysis of web log data to understand customer buying patterns you would use the streaming model in Apache Pulsar. Streaming is strictly ordered or exclusive messaging. With streaming messaging, there is always only one consumer consuming the messaging channel. The consumer receives the messages dispatched from the channel in the exact order in which they were written.
In this video Sanjeev architects the use case of an enterprise using both queuing and streaming with Apache Pulsar.
In this post and accompanying videos we have discussed the differences between queuing and streaming, and have looked at how Apache Pulsar’s flexible architecture support both models in a single system. This offers advantages over other systems that only support either queuing or streaming, both for enterprise developers and operations teams.