Building Data-Driven Applications with Apache Pulsar at STICORP

August 14, 2018

Daniel Ferreira Jorge

Daniel Ferreira Jorge is Director of Operations at STICORP.

Adopting any new technology, including an open source one, is usually seen as a big risk even when it has significant technical advantages. I wanted to share a summary of our experience with Apache Pulsar, a key technology we’ve adopted to support our applications, because there are certainly other companies like ours that could also benefit from taking a closer look at Pulsar.

STICORP is a software company based in Brazil. We provide software solutions that help over 7,000 customers automate and manage tax reporting, helping them ensure compliance, avoid penalties, and identify places where they can save money on taxes. That requires managing a lot of documents and a large number of processing workflows triggered by activities that happen all the time at our customers–invoices, shipments, and payments all generate documents that our software needs to process. In a single day, a customer can generate over 200,000 documents that our software needs to handle.

Our software obtains these documents automatically through integrations with client and government systems and organizes them in a manner that allows the customer to analyze the data behind these documents. The content and complexity of these documents depends on the specific business process that needs to be addressed. For example, there are different types of documents based on invoices, shipments, and payments being sent from customers to suppliers and customers, and additional documents for recording when activities such as these have been completed.

Handling all of the different information in these documents requires a workflow consisting of several steps. The documents arrive in an encrypted form, so the first step is to decrypt them. Then the raw message content in the documents is extracted as an XML file and converted to JSON. From there it is split into multiple topics and put onto an event bus. Ultimately it ends up downstream in a Couchbase NoSQL database where it can be accessed by other applications.

Performance and scalability are critical for us. Because of the nature of the transactions that generate these documents, we can see sudden spikes of up to 25,000 messages per second for a single customer. It’s also important for us to be able to support large numbers of topics–breaking up the information in documents into multiple topics makes it much easier to organize and manage how we connect the right data to the right workflow processors, but it also means that we may have 30 different topics for a single customer.

Why we needed new technology

We had originally deployed Apache Kafka to provide this event bus. Although we had a stable Kafka infrastructure and the decision to change wasn’t easy, we came to realize that Kafka was not the best technology for our needs–Kafka was not originally designed for cloud-native world we live in today and because of that we needed to spend a lot of time to make it work for our application. The main challenge we had with Kafka was that it wasn’t good at handling large numbers of topics. In addition, Kafka’s architecture made scaling painful for us–because Kafka brokers are stateful, expanding or shrinking a Kafka cluster requires rebalancing partitions, which impacts our performance and latency as well as limiting how easily and quickly we can react to changes in workloads.

Deploying Apache Pulsar

Looking for alternatives, we learned about Apache Pulsar and decided to evaluate it. Because Apache Kafka and Apache Pulsar use similar messaging concepts, we saw that 100% of our Kafka use cases can be implemented with Pulsar in the exact same way that they are implemented with Kafka. That is one of the things that facilitated the decision to switch to Pulsar.

We also saw some important differences in the architecture and design of Pulsar. A key difference is the decoupling of storage and brokers–Pulsar brokers are stateless, whereas in Kafka data is stored on the brokers. That is one example of an architectural difference that allows Pulsar to do things that are difficult or impossible in Kafka. Examples include:

  • Topic scalability: We need to have over 100,000 topics (without accounting for growth) not only to help us manage different types of data processed by our applications but also to allow individual customers to connect to data in our system with their custom applications. Pulsar’s architecture can handle millions of topics easily.
  • Performance: Because of Pulsar’s architecture, reads and writes use different paths. As a result a spike in reads will not impact write performance at all and vice-versa. Pulsar also supports non-persistent topics, allowing very high throughput for topics that do not need persistence at all, which is great for real-time applications.
  • Message queuing: Pulsar can act as a queue for use cases where you do not need ordering for a particular subscription. Pulsar does that at the subscription rather than the topic level, which is extremely valuable when you add a new consumer that doesn’t require ordering to a topic that has other consumers who require ordering. If that new consumer needs to read a topic from the beginning, with Kafka you’d be forced to sacrifice throughput or repartition your topic and sacrifice ordering. With Pulsar you can just add a new subscription and Pulsar will fan out the messages to increase throughput for your new consumer.
  • Easier operations: With Apache Kafka, any capacity expansion requires partition rebalancing, which in turn requires recopying the whole partition to newly added brokers. With Pulsar we can add and remove nodes easily without having to rebalance the entire cluster. In addition, with Pulsar you never have to worry about whether a partition will outgrow a broker’s physical disk space. In Kafka, a partition HAS to fit on disk.
  • Unlimited data retention: Some of our customers need to access their documents even months later. We love being able to keep the data in Pulsar, without deleting it, and using it later on when needed. We don’t have to start over, and we don’t have to worry about losing messages. When we need to execute another business process with the documents for another system, we don’t need to go to a database, we can simply get it off the message bus and reprocess it for that specific subsystem.

Apache Pulsar simply offered too many benefits to ignore. We made the decision to implement Pulsar and have been very happy with it. We’ve already moved over 30% of our production data flow to Pulsar and aim to have all of it running through Pulsar within the next six months.