Configuring Apache Pulsar Tiered Storage with Amazon S3

October 12, 2018

Jia Zhai

Ivan Kelly

Apache Pulsar’s tiered storage feature enables Pulsar to offload older messages on a topic to a long-term storage system, freeing up space in Apache BookKeeper and taking advantage of scalable low-cost storage options such as cloud storage.

Tiered storage is valuable for a topic for which you want to retain data for a very long time. For example, if you have a topic containing user actions which you use to train your recommendation systems, you may want to keep that data for a long time so that if you create a new version of your recommendation model you can rerun it against the full user history.

How Apache Pulsar Stores Data

Apache Pulsar stores topics using what we call a segment-oriented architecture. A topic in Pulsar is persisted to a log, known as a managed ledger, stored in Apache BookKeeper. This log is composed of an ordered list of segments. Because a log is append-only, Pulsar only writes to the final segment of the log. All previous segments are sealed, and the data within the segment is immutable.

The tiered storage offloading mechanism takes advantage of this segment-oriented architecture. When a segment is offloaded to an external storage system, the segments of the log are copied, one-by-one, to that storage system. All segments of the log, apart from the segment currently being written to, can be offloaded.

Tiered storage illustration
Tiered storage illustration

Apache Pulsar current supports multiple cloud storage systems for tiered storage. In this post we’ll walk through a simple example of configuring a standalone Pulsar cluster to use Amazon S3 to store the offloaded segments. The key steps that we’ll cover:

  • Creating and configuring a bucket in Amazon S3.
  • Configuring Pulsar to use that S3 bucket as a storage tier.
  • Start Pulsar and validating offload.

There’s also a video recording of these steps at the end of this blog post.

Step-by-Step Configuration

Step 1: Setting up an S3 bucket

Our first step is to create the S3 bucket that will be used as tiered storage. To do that, we first log in to the AWS console and choose the S3 service.

Using the AWS Console to create an Amazon S3 bucket
Using the AWS Console to create an Amazon S3 bucket

Then, create a bucket. First click the “Create bucket” button, then name this bucket, then click “Next” button, then confirm the creation operation.

Creating an Amazon S3 bucket
Creating an Amazon S3 bucket

After this, a new Bucket should be created successfully.

Successful creation of an Amazon S3 bucket
Successful creation of an Amazon S3 bucket

Also make sure your aws credentials are set correctly.

$ cat ~/.aws/credentials
[default]
aws_access_key_id = XXXXXXXXXXXXXXXXXXXXXX
aws_secret_access_key = XXXXXXXXXXXXXXXXXXXXXXXXXXX

Step 2: Configure Pulsar

Now let’s configure Pulsar to use that S3 bucket as a cold storage tier.

To do that, first download the Pulsar file (apache-pulsar-x.x.x-bin.tar.gz) from http://pulsar.apache.org/en/download and un-tar it.

Then cd into the binary root directory and edit conf/standalone.conf, adding the offload configuration settings at the end of file conf/standalone.conf:

managedLedgerOffloadDriver=S3
s3ManagedLedgerOffloadBucket=offload-test-aws
s3ManagedLedgerOffloadRegion=us-east-1

Also in the same config file conf/standalone.conf, change the ledger size and rollover time to make topics create each segment more easily:

# Max number of entries to append to a ledger before triggering a rollover
# A ledger rollover is triggered on these conditions
#  * Either the max rollover time has been reached
#  * or max entries have been written to the ledger and at least min-time has passed
managedLedgerMaxEnteriesPerLedger=1000

# Minimum time between ledger rollover for a topic
managedLedgerMinLedgerRolloverTimeMinutes=0

Then start Pulsar in standalone mode:

$ bin/pulsar standalone

Step 3: Verify Segment Offloading

Now let’s test our configuration by consuming and producing messages.

In a new terminal tab, run the consume command to make sure topic data not be atomically dropped.

$ bin/pulsar-client consume -s “my-sub-name“ my-topic-for-offload

In a new terminal tab, run the produce command twice, to make sure it is big enough to create 2 segments(each with 1000 entries).

$ bin/pulsar-client produce my-topic-for-offload  --messages "hello pulsar this is the content for each message" -n 1000

Now let’s manually trigger offload by using the Pulsar admin cli.

$ bin/pulsar-admin topics offload --size-threshold 10K public/default/my-topic-for-offload

Offload triggered for persistent://public/default/my-topic-for-offload for messages before 32:0:-1

Get offload status by cli.

$ bin/pulsar-admin topics offload-status public/default/my-topic-for-offload

Offload was a success

Once the status is “success”, we can find the offloaded ledger in S3 using the AWS console.

Offloaded segment files stored in Amazon S3
Offloaded segment files stored in Amazon S3

If we use the Pulsar admin command to get topic internal stats, we will find that ledger-31 is in the state “offloaded: true”.

$    bin/pulsar-admin topics stats-internal  public/default/my-topic-for-offload

[
  "entriesAddedCounter" : 3200,
  "numberOfEntries" : 1200,
  "totalSize" : 111344,
  "currentLedgerEntries" : 200,
  "currentLedgerSize" : 18600,
  "lastLedgerCreatedTimestamp" : "2018-10-11T07:06:14.891+08:00",
  "waitingCursorsCount" : 0,
  "pendingAddEntriesCount" : 0,
  "lastConfirmedEntry" : "32:199",
  "state" : "LedgerOpened",
  "ledgers" : [ {
      "ledgerId" : 31,
      "entries" : 1000,
      "size" : 92744,
      "offloaded" : true
  }, {
    "ledgerId" : 32,
    "entries" : 0,
    "size" : 0,
    "offloaded" : false
  } ],
  "cursors" : false

Video Demonstration

To see the step-by-step configuration process, this video walks through how to configure a full Pulsar cluster to use tiered storage in Amazon S3.

For More Information

If you want to get more details about how tiered storage works and how to configure it, please refer to these links: