Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft]: Using realist benchmarks to test performance to derive new cluster configurations #597

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ This definition and everything related to that can be found under [principlesofc

### Resources

Resources which you might find interesting regarding to that topic.
Resources which you might find interesting regarding that topic.

* https://principlesofchaos.org/?lang=ENcontent
* https://www.oreilly.com/library/view/chaos-engineering/9781491988459/
Expand All @@ -31,7 +31,9 @@ Resources which you might find interesting regarding to that topic.
In order to test or execute chaos experiments we use an tool called [chaos-toolkit](https://chaostoolkit.org/), which
makes it easy to create chaos experiments and also automate them later on.

[Complete list of experiments](inventory.md)
All our current experiments are located under `chaos-days/blog/`, for more
details please have a look at the [README](chaos-days/blog/README.md).

Alternatively all our chaos-days experiments can be found [here](https://zeebe-io.github.io/zeebe-chaos/) in blog
format.

All our current experiments are located under `chaos-experiments/`, for more details please have a look
at the [README](chaos-experiments/README.md).
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
226 changes: 226 additions & 0 deletions chaos-days/blog/2024-10-14-realistic-benchmarks/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
---
layout: posts
title: "Using realist benchmarks to test performance to derive new cluster configurations"
date: 2024-10-14
categories:
- chaos_experiment
- performance
tags:
- benchmark
authors: rodrigo
---

# Chaos Day Summary

Our first goal is to further improve and make the benchmarks more
realistic, by using a process model that covers the average process
orchestration use case and a payload that reflects the reality.

The second goal is to use these benchmarks to derive new minimal viable
cluster configuration that can handle at least 40 process instances per
second, while maintaining low backpressure and low latency.

The third goal is to scale out minimal viable cluster configuration
resources linearly and see if the performance scales accordingly.

**TL;DR;**

We used a realistic process model to benchmark our system and derived a new
cluster configuration based on previous requirements.

When we scale this base configuration linearly we see that the performance
increases significantly more than linearly, while maintaining low
backpressure and low latency.

## Chaos Experiment

### Expected

We do expect that we can find a cluster configuration that can handle at 40
tasks second to be significantly reduced in resources in relation to our
smaller clusters ([G3-S HA Plan](https://accounts.cloud.dev.ultrawombat.com/consoleadmin/clusterplans/0af08654-28ec-413a-8dcf-c5938e828ddd)) since
these can process 50-70 tasks per second.

We also expect that we can scale this base configuration linearly, and that
the processing tasks rate to grow initially a bit faster than linearly due to
the lower relative overhead, and if we keep expanding further to flatten due
to the partition count being a bottleneck.

### Actual

#### Benchmarking a realistic process model

In our previous benchmarks we used a simple process model with a single
task with several decision symbols. For the newer benchmarks we wanted to
increase significantly the number of symbols and service tasks used as well
to make the load more configurable.

![bank-customer-complaint-dispute-handling](bank-customer-complaint-dispute-handling.png)

For this realistic benchmark, we tried to use a situation closer to a real
world scenario, where a bank customer complaint dispute handling process is
handled by several service tasks. Additionally, fraud claim
investigation part of the process is unfolded in multiple instances. This
can be configured in our instance payload by how many disputes we send to
scale up or down the number of multi instances in the fraud claim
investigation, which enables to scale the load of each process instance.

Finally, in deploying our benchmark we can define the number of instances
started per second. After choosing a uniform payload for all tests we can
start benchmarking or different cluster configurations where we measure the
performance by how many process instances are completed per second, while
maintaining low backpressure (bellow 10%) as to preserve user experience.

#### Minimal Viable Requirements for our Cluster

For our minimal viable cluster configuration we determined though several
metrics on users and our own experience that we need to be able to handle
at least 40 process instances per second, or 3.5 million tasks per day.

Other metrics that we want to preserve and keep track are the backpressure
to preserve user experience, guarantee that exporting speed can keep up
with the processing speed, write to import latency which tells us how long
it takes for a record to be written to being imported by our other
apps such as the operator.

#### Reverse Engineering the Cluster Configuration

For our new configurations the only resources that we are going to change
are the ones relevant to the factors described above. These are the
resources allocated to our zeebe-brokers, gateway and elasticSearch.

Our starting point in resources was the configuration for our [G3-S HA Plan](https://accounts.cloud.dev.ultrawombat.com/consoleadmin/clusterplans/0af08654-28ec-413a-8dcf-c5938e828ddd)
as this already had the capability to significantly outperform the current
goal of 40 tasks per second (close to 50-70 in reality).

The next step was to deploy our realistic benchmark, with a payload of 5
costumer disputes per instance and start 2 instances per second, this
generated approximately 40 tasks per second.

After this we reduced the resources iteratively until we saw any increase
in backpressure, given that no there was no backlog of records, and no
significant increase in the write to import latency.

The results for our new cluster are specified bellow in the tables, where
our starting cluster configuration is the G3-S HA Plan and the new
configuration cluster is the G3 - BasePackage HA.

<div style="display: flex; justify-content: space-between;">

<div style="width: 48%;">

| G3-S HA | CPU Limit | Memory Limit in GB |
|------------------------|-----------|--------------------|
| operate | 2 | 2 |
| operate.elasticsearch | 6 | 6 |
| optimize | 2 | 2 |
| tasklist | 2 | 2 |
| zeebe.broker | 2.88 | 12 |
| zeebe.gateway | 0.9 | 0.8 |
| zeebeAnalytics | 0.4 | 0.45 |
| connectorBridge | 0.4 | 0.512 |
| **TOTAL** | **16.58** | **25.762** |
[Cluster Plan](https://accounts.cloud.dev.ultrawombat.com/consoleadmin/clusterplans/0af08654-28ec-413a-8dcf-c5938e828ddd)
</div>

<div style="width: 48%;">

| G3 - BasePackage HA | CPU Limit | Memory Limit in GB |
|-----------------------|-----------|--------------------|
| operate | 1 | 1 |
| operate.elasticsearch | 3 | 4.5 |
| optimize | 1 | 1.6 |
| tasklist | 1 | 1 |
| zeebe.broker | 1.5 | 4.5 |
| zeebe.gateway | 0.6 | 1 |
| zeebeAnalytics | 0.2 | 0.3 |
| connectorBridge | 0.4 | 1 |
| **TOTAL** | **8.7** | **14.9** |
[Cluster Plan](https://accounts.cloud.dev.ultrawombat.com/consoleadmin/clusterplans/9b8b444a-636e-48cf-be83-2387a0d11aba)
</div>

</div>

##### Reduction in Resources for our Minimal Viable Cluster

| | CPU Reduction (%) | Memory Reduction (%) |
|:----------------------|--------------------:|-----------------------:|
| zeebe.broker | 47.92 | 62.5 |
| zeebe.gateway | 33.33 | -25.0 |
| operate.elasticsearch | 50.00 | 25.0 |


Total cluster reduction:

| | G3-S HA | G3 - BasePackage HA | Reduction (%) |
|:----------------------|--------:|--------------------:|--------------:|
| CPU Limits | 16.58 | 8.7 | 48 |
| Memory Limits | 25.762 | 14.9 | 42 |


The process of reducing the hardware requirements was donne initially by
scaling down the resources of the zeebe-broker, gateway and elasticSearch.
The other components were left untouched, as they had no impact in our key
metrics, and were scaled down later in separate experiences to maintain
user experience.

#### Scaling out the Cluster

Now for the scaling procedure we intend to see if we can linearly increase
the allocated resources and having a corresponding performance increase,
while keeping the backpressure low, low latency, and user experience.

For this we started with the G3 - BasePackage HA configuration and
incremented the load again until we saw any increase in backpressure,
capture our key metrics and repeated the process for the cluster
configuration resources respectively multiplied by 2x, 3x, and 4x.

This means that the resources allocated for our clusters were:

| | Base 1x | Base 2x | Base 3x | Base 4x |
|:--------------|----------:|----------:|----------:|----------:|
| CPU Limits | 8.7 | 17.4 | 26.1 | 34.8 |
| Memory Limits | 14.9 | 29.8 | 44.7 | 59.6 |

The results in the table bellow show the performance of our several cluster
configurations:

| | Base 1x | Base 2x | Base 3x | Base 4x |
|:-------------------------|--------:|--------:|--------:|--------:|
| Process Instances/s | 2 | 6 | 9 | 20 |
| Tasks/s | 40 | 108 | 160 | 360 |
| Average Backpressure | 0% | 5% | 9% | 2% |
| Write-to-Import Latency | 2.8s | 9s | 3s | 2.5min |
| Write-to-Process Latency | 40ms | 200ms | 240ms | 15ms |
| Records Processed | 750 | 2100 | 3700 | 6750 |
| Records Exported | 700 | 1900 | 3200 | 5700 |

This first observations is that the performance scales particularly well by
just adding more resources to the cluster, particularly for a linear
increase of the resources the performance as measured by tasks completed
increases slightly more than linearly. This might be due to the fact that
for a 2x of the resources the application overhead of the system is
reduced in proportion to the resources added.

This a very good result as it means that we can scale our system linearly
(at least initially) to handle the expected increase in loads.

Importantly, the backpressure is kept low, and the write-to-import latency
only increases significantly at the last 4x configuration when kept at this
max rate. This might imply that a 4x configuration the amount records
generated starts to be too much for either elasticSearch or our web apps
that import these records to handle. Some further investigation could be
done here to investigate the bottleneck.

Another metric also relevant but not shown in this table is the backlog of
records not exported, which kept at almost null through all the experiments
conducted.

### Bugs found

During the initial tests, we had several OOM errors in the gateways pods.
After some investigation, we found that this was exclusive to the Camunda 8.
6.0 version, which consumes more memory in the gateway than the previous versions. This
explains why the gateway memory limits were the only resource that was
increased in the new reduced cluster configuration.

Loading