Globally distributed Apache Pulsar quickstart

October 23, 2017

Ivan Kelly

Evaluating distributed systems can be a time-consuming affair. Although many projects offer a standalone or single-node setup that enables you to try things out on a single machine, this only enables you to play with the APIs or even just a subset of the APIs. To give the system a true workout, you need to run it on multiple machines, and if the system supports multi-region setups, you need to run it in multiple geographical regions that are preferably hundreds if not thousands of kilometers apart.

Pulsar is a distributed pub-sub messaging system, open sourced by Yahoo and recently adopted by the Apache Foundation, that has features like multi-tenancy and geo-replication baked in from the start. We wanted to enable users to quickly and painlessly set up a multi-region Pulsar deployment. We wanted to enable you to quickly and painlessly set up a multi-region Pulsar deployment so that you could get a taste not just of Pulsar’s application-facing APIs but also of more advanced features like geo-replication. To that end, we’ve provided an Ansible playbook that will get you up and running in no time and with minimal legwork. You just need to provide some Google Cloud Platform (GCP) credentials and the playbook will do the rest.

In particular, the playbook will:

  • Provision all the machines needed, in three regions. In each of the three regions, the following will be deployed:
  • 3 ZooKeeper nodes
  • 3 Pulsar brokers
  • 3 BookKeeper bookies
  • Configure the global ZooKeeper cluster with one ZooKeeper quorum in each region
  • Configure a 3-node local ZooKeeper cluster in each region
  • Boot the local BookKeeper cluster in each region
  • Start a Pulsar cluster in each region

There are two ways of standing up a distributed Pulsar installation using the Ansible playbook: via video or by following the text walkthrough.

Video walkthrough

If you’re more of a visual learner, check out the YouTube video I made below:

Text walkthrough

If you prefer traditional text tutorials, follow along here. Note that this playbook is for demonstration purposes only. In a production Pulsar instance you would need to pay more attention to how your ZooKeeper clusters are distributed and provision separate journal and ledger disks to each bookkeeper bookie. However, the playbook does give you a good starting point for building out your own deployment.

The Pulsar Ansible playbook runs on GCP preemptible nodes. Preemptible nodes run on the spare capacity of GCP and are thus much less expensive. Given current GCP pricing, running the playbook will cost roughly $0.18 to run per hour. That’s for 18 machines running over 3 geographical regions. GCP sometimes does reclaim preemptible instances when it needs the capacity, which means that some of the machines in the cluster may randomly shut down. On the bright side, this will give you a chance to check out Pulsar’s fault tolerance capabilities!

Preemptible machines also shut down automatically after 24 hours, so if you forget to shut your Pulsar installation down, you’ll pay a maximum of $5. You almost can’t afford not to try it!

Initial setup

The first thing you’ll need is a jump host on which you will install Ansible and clone, configure, and run the playbook. I’ve used a g1-small running Debian 9. Other distros may work, but YMMV.

For a guide to creating Google Compute Engine instances, see Creating and Starting an Instance in the GCP docs.

Connect to your jump host and install Ansible and some other utilities:

$ sudo apt-get update
$ sudo apt-get install git ansible python-libcloud python-pip tmux emacs-nox

Libcloud is required for ansible to be able to access the GCP API.

Clone the Pulsar Ansible repo:

$ git clone http://github.com/streamlio/pulsar-ansible

There is a sample credentials file in vars/credentials.yaml in the repo. You need to copy this and fill in some of your own GCE credentials.

$ cd pulsar-ansible
$ cp vars/credentials.yaml.sample vars/credentials.yaml
$ emacs vars/credentials.yaml # or use Vim or nano or something else

Getting credentials from Google Compute Engine

The first thing you need is a project. From the Google Cloud Console, you can create a new project from the dropdown in the top right.

Once the project is created, you can see the ID by clicking on this dropdown. Put this ID into your credentials.yaml file.

When the project has been created (it can take a few minutes), create a service account in “IAM & admin”.

The service account should have the following roles:

  • Project > Service Account Actor
  • Compute Engine > Compute Admin

Make sure to tick “Furnish a new private key”, with “JSON” selected. Copy the value of the “Service Account ID”, and put it as the value for service_account_email in your credentials.yaml.

When you click create, a key file will be generated and downloaded. Copy this to the root directory of the ansible playbook, and rename to pulsar-credentials.json. You should now be ready to run the playbook.

The ansible GCP module has detailed instructions on setting up credentials.

Setting up SSH keys

For ansible to be able to configure the GCP instances, it needs to be able to SSH into them. To allow this, we need to generate a new key pair on the jump host, and add it to the SSH key metadata for the GCP project.

On your jump host, generate a new key pair:

$ ssh-keygen -t rsa -f ~/.ssh/id_rsa -N ""

Then add the contents of ~/.ssh/id_rsa.pub to the SSH keys for your GCP project. This can be done in the “Compute Engine > Metadata” section. You may also want to add your own public key so that you can log into the created instances directly without having to go through the jump host.

Running the playbook

Running the playbook involves running a single command. You also need to disable host key checking, as the the ssh host keys from all the new GCE instances will be unrecognised. From the pulsar-ansible directory run:

$ export ANSIBLE_HOST_KEY_CHECKING=False
$ ansible-playbook pulsar.yaml

As we are using preemptible instances, it is possible that they may be reclaimed while the playbook is running. If this happens just run the playbook again.

When the script finishes, it will print out a list of the nodes in the different regions.

TASK [Host data] ***************************************************************
ok: [localhost] => {
    "msg": [
        "104.192.62.189, name=zk2-us-central, zone=us-central1-c",
        "104.191.209.154, name=pulsar2-us-central, zone=us-central1-c",
        "104.191.78.153, name=zk3-eu-west, zone=europe-west1-b",
        "146.154.68.201, name=pulsar1-us-central, zone=us-central1-c",
        "35.123.2.45, name=pulsar3-us-central, zone=us-central1-c",
        "35.123.98.47, name=zk3-us-central, zone=us-central1-c",
        "35.194.173.231, name=zk1-us-east, zone=us-east1-c",
        "35.194.214.71, name=zk2-eu-west, zone=europe-west1-b",
        "35.191.183.115, name=pulsar1-eu-west, zone=europe-west1-b",
        "35.132.210.166, name=pulsar2-eu-west, zone=europe-west1-b",
        "35.132.252.83, name=zk1-eu-west, zone=europe-west1-b",
        "35.132.59.28, name=pulsar3-eu-west, zone=europe-west1-b",
        "35.132.160.174, name=zk2-us-east, zone=us-east1-c",
        "35.132.2.43, name=pulsar3-us-east, zone=us-east1-c",
        "35.133.224.1, name=pulsar2-us-east, zone=us-east1-c",
        "35.133.32.12, name=pulsar1-us-east, zone=us-east1-c",
        "35.133.43.48, name=zk3-us-east, zone=us-east1-c",
        "35.213.218.39, name=zk1-us-central, zone=us-central1-c",
        ""
    ]
}

Trying it out

Now that you have a running globally distributed Pulsar instance, it’s time to try it out. SSH into one of the pulsar instances in us-east and create a property that spans all three clusters.

$ cd /opt/pulsar/
$ bin/pulsar-admin properties create test \
  --allowed-clusters europe-west1-b,us-east1-c,us-central1-c \
  --admin-roles test-admin-role

Create a namespace and enable it on all the clusters.

$ bin/pulsar-admin namespaces create test/global/ns1
$ bin/pulsar-admin namespaces set-clusters \
  -c europe-west1-b,us-east1-c,us-central1-c test/global/ns1

Now you can start sending messages on the namespace. To try this out, log into one of the EU pulsar nodes and start a consumer running.

$ bin/pulsar-perf consume persistent://test/global/ns1/my-topic

Then start sending messages to this topic from us-east.

$ bin/pulsar-perf produce persistent://test/global/ns1/my-topic

You may see an error about not being able to create a .hgrm file. Don’t worry about this, you only see the error if you run the command from a directory you don’t have write permissions on, which will be the case if you are running from /opt/pulsar.

On the producer side you should start to see messages coming through.

2017-10-11 10:01:51,085 - INFO  - [main:PerformanceConsumer@289] - Throughput received: 0.000  msg/s -- 0.000 Mbit/s
2017-10-11 10:02:01,086 - INFO  - [main:PerformanceConsumer@289] - Throughput received: 24.699  msg/s -- 0.193 Mbit/s
2017-10-11 10:02:11,086 - INFO  - [main:PerformanceConsumer@289] - Throughput received: 100.094  msg/s -- 0.782 Mbit/s
2017-10-11 10:02:21,087 - INFO  - [main:PerformanceConsumer@289] - Throughput received: 99.996  msg/s -- 0.781 Mbit/s
2017-10-11 10:02:31,087 - INFO  - [main:PerformanceConsumer@289] - Throughput received: 99.996  msg/s -- 0.781 Mbit/s
2017-10-11 10:02:41,087 - INFO  - [main:PerformanceConsumer@289] - Throughput received: 100.096  msg/s -- 0.782 Mbit/s

And that’s it! You have a global distributed pubsub system running in three geographical regions. You can now throw some traffic at it, fail some machines, write a test app. When you’re finished, just delete the instances in GCP and everything will be cleaned up.