Introduction and Goals

The Zeebe Cluster Testbench is a tool to run end to end tests against Zeebe clusters.

Requirements Overview

It must be possible to run tests periodically
It must be possible to run tests ad hoc
It must be possible to run tests against a Zeebe cluster deployed in Camunda Cloud
Clusters under test shall be created on demand and destroyed when it is no longer needed
Engineers shall be notified when a test failure occurred
The cluster in which a failure occurred must be kept alive until an engineer was able to analyse the failure

Quality Goals

The solution shall use hardware resources economically.
The solution shall be able to scale with increased demand
The solution shall require low maintenance in production
The solution shall be fault tolerant to unavailability of external systems
The solution shall automatically recover in case of failures.
The solution shall be flexible to adapt to future needs.

Architecture Constraints

Constraint	Rationale
Use Zeebe cluster to orchestrate the tests	This is part of the "drink your own champaign" / "eat your own dog food" initiative. Apart from that using Zeebe for orchestration is perfect for the given quality goals
Java (or other languages in the Java ecosystem) for implementation	Camunda has many experienced Java software developers
Maven / Jenkins build pipeline	Integrates well with existing Camunda infrastructure
Kubernetes as runtime environment	Follows current best practices for scalable applications

System Scope and Context

Solution Strategy

The Zeebe cluster testbench is basically a bunch of workers that request jobs from the testbench test orchestration cluster.

The test orchestration cluster contains different processes that define the steps for a test. See README.md for a documentation of the different processes.

Once a test is started, this will e.g. trigger the worker to create a cluster. Afterwards, another worker is triggered to run the test. Finally, other workers are triggered notify engineers (if needed) and destroy the cluster again.

The workers are stateless. The only state is kept by Zeebe in the testbench test orchestration cluster. Workers exchange information by reading from and writing to variables in the process.

Building Block View

Modules

core - contains the launcher and the workers that are orchestrated by the test orchestration cluster
core/chaos-workers - contains the worker to run chaos experiemnts (implemented in Bash)
cloud-client - contains a server facade to interact with the Cloud API
internal-cloud-client - contains a server facade to interact with the internal cloud backend. These services are not part of the official cloud API. They are accessed via a user account, not a service account.
testdriver-api - contains interfaces and shared classes to be used by several test drivers
testdriver-sequential - contains the workers for the sequential test

*) all implemented in Java, unless otherwise indicated

Service Task Reference

Runtime View

Most of the runtime behavior is determined by the processes deployed to the testbench cluster. See README.md for a documentation of the different processes.

Startup

Read all environment variables
Perform self test - the objective of the self test is to check connectivity to external systems; if these fail on startup, the application will not launch.
Deploy test orchestration processes
Register workers

Fault Tolerance/Recovery

Fault tolerance and recovery are handled by the testbench cluster:

Whenever a task fails, it will be retried.
If a worker dies, the job will time out and be reactivated.
If the application crashes, all workers die. As soon as the application is back up, workers can poll for job again.

The testbench cluster is deployed to a high availability cluster.

Design Decisions

Keep it simple - most workers are currently part of one deployment. This allows to start/manage all workers with a single deployment. However, it also allows for future scale out when several instances of each worker are needed.
No framework (yet). Mostly to keep the dependencies to a minimum and not to commit on any architectural pattern too soon. (Given that the current implementation is mostly workers, a reactive, non-blocking IO framework would be ideal. However, it is questionable whether the load will ever get so high that the benefits of those frameworks materialize. CDI has been missed while implementing the current solution. So developer convenience might be a stronger driver for architectural commitment then technical criteria)
All environment variables that are used by the solution are read out in the bootstrap class. Mostly because it gets opaque when environment variables are sprinkled throughout the code.
Environment variables are currently the only way to configure the application
Each Java worker defines its input and output parameters in dedicated classes. this is a little more verbose than necessary, but it also documents their interface.
The communication to the Cloud API and Internal Cloud API uses RESTEasy Client API. This was the option with the least dependencies (in comparison to Spring RestTemplate, MicroProfile RestClient, and other)
The chaos worker is implemented based on zbctl, which allows to forward jobs to bash scripts.

Risks and Technical Debts

Currently, there is no orderly shutdown. So far this has not caused any problems. However, it does slow the tests down when the application crashes or is restarted due to redeploys. The delay is due to Zeebe for the jobs to timeout before rescheduling new jobs.
All Java workers share the same thread pool. If this thread pool dies or is blocked, then nothing will move forward.
The chaos worker currently has only one instance. This can become a bottleneck when volume increases. Scaling it is not as trivial as just increasing the instance count. First, we need to find a way to correlate chaos experiment failures with the logs they produce.
The application has no self monitoring. It relies heavily on Zeebe to restart jobs when things go wrong.
The process to create a cluster has a potential infinite loop. If the cluster is created, but never gets ready, then the process will not terminate.
The worker MapNamesToUUIDsWorker writes to the same variables that it uses as input. Once overwritten, it is no longer possible to look at what the input was. This already confused root cause analysis for a bug.
The CloudAPI might change and become incompatible. Diagnosing these problems is quite tricky with the RESTEasy Client API. It does a great job at giving developers a nice interface; however if a given REST endpoint returns e.g. a HTML page with a helpful human readable error message it is difficult to get a hold of this error message.
The Internal Cloud API might change and become incompatible. We are using a Camunda internal API here with no guarantees on backwards compatibility.
The chaos worker is currently written in bash, which might be harder to maintain then other languages. The zbctl worker itself comes with some limitations, which we could overcome with other clients. Related issue #110

Appendix

Glossary

Term	Definition
Cloud API	API provided by Camunda Cloud to create, query and destroy clusters, and to create, query and delete client accounts for these clusters
Internal Cloud API/Cloud Backend	Internal API of Camunda Cloud. It is used to perform administrative actions like creating new generations for clusters to test.
testbench cluster	The Zeebe cluster in which the tests are orchestrated
test driver	Set of classes to run a test and determine the outcome of the test
worker	Class that handles Zeebe jobs. Workers are registered for service tasks of a given job type.

Arc42 template

This document takes inspiration from arc42, the Template for documentation of software and system architecture by Dr. Gernot Starke, Dr. Peter Hruschka and contributors.

Template Revision: 7.0 EN (based on asciidoc), January 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

technical-documentation.md

technical-documentation.md

Introduction and Goals

Requirements Overview

Quality Goals

Architecture Constraints

System Scope and Context

Solution Strategy

Building Block View

Modules

Service Task Reference

Runtime View

Startup

Fault Tolerance/Recovery

Design Decisions

Risks and Technical Debts

Appendix

Glossary

Arc42 template

Files

technical-documentation.md

Latest commit

History

technical-documentation.md

File metadata and controls

Introduction and Goals

Requirements Overview

Quality Goals

Architecture Constraints

System Scope and Context

Solution Strategy

Building Block View

Modules

Service Task Reference

Runtime View

Startup

Fault Tolerance/Recovery

Design Decisions

Risks and Technical Debts

Appendix

Glossary

Arc42 template