An explanation of key processing semantics including at least once, at most once and exactly once
In most data processing scenarios, how often a specific data element is processed is critical to proper use of data. Whether or not there are failures or retries during the course of processing a data element, data processing guarantees specify what happens. These guarantees are meaningful since you can always assume the possibility of failures via network, machines, etc. After all, if your bank accidentally deducted a payment from your account multiple times due to some minor failure in their systems you certainly wouldn’t be pleased, and regulators probably wouldn’t be particularly happy either. Let’s take a look at the most common guarantees and what they mean.
We’ll discuss these guarantees in the context of event and message processing systems, where they refer to how many times an event or message is delivered to a consuming application, but the concepts are analogous in other systems and scenarios.
“At most once” means that a system guarantees that a given data element, for example an event or message, will be processed zero or one times but never more. This is essentially a “best effort” approach. This means that if a failure occurs before a streaming or messaging application can fully process it, no additional attempts will be made to retry or retransmit the message or event. The figure below illustrates an example of this.
At most once is useful for scenarios in which processing an event or message multiple times is not desireable, but not processing a specific message is acceptable. For example, suppose you have a task whose job is to provide frequent updates on the current location of a drone. If a failure occurs in processing a specific update, but by the time you reprocessed that update the next update would have already arrived, it is not useful to require that the out-of-date update be processed before the most recent update is processed.
“At least once” means that data is guaranteed to be processed one or more times and will never be dropped. This usually means an event or message will be replayed or retransmitted if it is lost before it can be fully processed. That replay can introduce scenarios where the event or message could be processed more than once, thus the at-least-once term. The figure below illustrates an example of this scenario. In this case, the first operator initially fails to process an event, then succeeds upon retry, then succeeds upon a second retry that turns out to have been unnecessary.
At least once is used when it is critical that no events are missed, but the number of times that an event is processed is not important. For example, suppose your application alerts you whenever it gets an event from your security system whenever a window in your home is broken. You want to guarantee that you receive that notification, but it is not critically important whether you receive one or multiple notifications that a specific window has been broken.
“Exactly once” means that events or messages are guaranteed to be processed once and only once, even in the event of failures and retries.
There are two popular mechanisms typically used to achieve “exactly-once” processing semantics.
The distributed snapshot/state checkpointing method of achieving “exactly-once” is inspired by the Chandy-Lamport distributed snapshot algorithm. With this mechanism, all the state for each operator in the streaming application is periodically checkpointed, and in the event of a failure anywhere in the system, all the state of for every operator is rolled back to the most recent globally consistent checkpoint. During the rollback, all processing will be paused. Sources are also reset to the correct offset corresponding to the most recent checkpoint. The whole streaming application is basically rewound to its most recent consistent state, and processing can then restart from that state. The figure below illustrates the basics of this mechanism.
In this diagram, the streaming application is working normally at T1 and the state is checkpointed. At time T2, however, the operator fails to process an incoming message or event. At this point, the state value of S = 4 has been saved to durable storage, while the state value S = 12 is held in the operator’s memory. In order to overcome this discrepancy, at time T3 the processing graph rewinds the state to S = 4 and “replays” each successive state in the stream of messages or events up to the most recent, processing each one. The end result is that some data have been processed multiple times, but without changing the result because the resulting state is the same no matter how many rollbacks have been performed. For this reason, the term “effectively once” is sometimes used to be more precise.
Another method used to achieve “exactly-once” is through implementing at-least-once event delivery in conjunction with event deduplication on a per-operator basis. Systems utilizing this approach replay failed events for further attempts at processing and remove duplicated events for every operator prior to the events entering the logic in the operator. This mechanism requires that a transaction log be maintained for every operator to track which events it has already processed. The figure below illustrates this approach.
A good example of where exactly once is needed is in processing most financial transaction data–both sides of a transaction require guarantees that the transaction is only processed once, whether there are failures in the processing systems or not.
Hopefully you now have a better understanding of processing guarantees commonly used in systems such as stream processing and messaging systems. Choosing a message and event processing solution that provides the guarantees you require is crucial to success. One messaging and streaming solution that supports all the guarantees described in this post is Apache Pulsar. You can read more about how Pulsar implements delivery guarantees in this blog post and in the Pulsar documentation.