Production S high load chaos test fails #135

Zelldon · 2020-11-16T09:06:14Z

The high load chaos test https://github.com/zeebe-io/zeebe-chaos/blob/master/chaos-experiments/camunda-cloud/production-m/high-load/experiment.json for production s currently fails because the resource limits are to low for Production S cluster plans.

Experiment output:

[2020-11-16 09:57:20 INFO] Validating the experiment's syntax
[2020-11-16 09:57:20 INFO] Experiment looks valid
[2020-11-16 09:57:20 INFO] Running experiment: No leader change due to high load
[2020-11-16 09:57:20 INFO] Steady-state strategy: default
[2020-11-16 09:57:20 INFO] Rollbacks strategy: default
[2020-11-16 09:57:20 INFO] Steady state hypothesis: Zeebe is alive and has a consistent topology
[2020-11-16 09:57:20 INFO] Probe: All pods should be ready
[2020-11-16 09:57:21 INFO] Probe: Should be able to create workflow instances on partition 1
[2020-11-16 09:57:23 INFO] Probe: Cluster status should remain constant
[2020-11-16 09:57:24 INFO] Steady state hypothesis is met!
[2020-11-16 09:57:24 INFO] Playing your experiment's method now...
[2020-11-16 09:57:24 INFO] Action: Start many instances (5000)
[2020-11-16 09:57:47 INFO] Pausing after activity for 5s...
[2020-11-16 09:57:52 INFO] Steady state hypothesis: Zeebe is alive and has a consistent topology
[2020-11-16 09:57:52 INFO] Probe: All pods should be ready
[2020-11-16 09:58:11 INFO] Probe: Should be able to create workflow instances on partition 1
[2020-11-16 09:58:14 INFO] Probe: Cluster status should remain constant
[2020-11-16 09:58:15 CRITICAL] Steady state probe 'Cluster status should remain constant' is not in the given tolerance so failing this experiment
[2020-11-16 09:58:15 INFO] Experiment ended with status: deviated
[2020-11-16 09:58:15 INFO] The steady-state has deviated, a weakness may have been discovered

Output of kgpo -w during the execution:

zeebe-0                               0/1     Running                 0          10s
zeebe-0                               1/1     Running                 0          35s
zeebe-0                               0/1     OOMKilled               0          67s
zeebe-0                               0/1     Running                 1          68s
zeebe-0                               1/1     Running                 1          90s
zeebe-1                               1/1     Terminating             0          5m54s
zeebe-1                               0/1     Terminating             0          6m1s

The text was updated successfully, but these errors were encountered:

Zelldon · 2020-11-16T09:19:08Z

Created an issue for increasing limits https://github.com/camunda-cloud/zeebe-controller-k8s/issues/363

Zelldon · 2020-11-16T09:31:50Z

I will not add this experiment for now for production s plan.

Zelldon · 2020-11-18T11:17:48Z

\cc @npepinpe

Zelldon · 2020-11-23T13:59:47Z

We (@npepinpe and me) decided to also turn off the experiment for the other cluster plans, because it might cause flakiness, since the embedded gateway and the resource limitations are the problem also on the other cluster plans. Currently assumption is that when we move the embedded gw out to standalone gateway that this experiment should succeed, if we give it enough resources, related camunda/camunda#5874

Blocked by https://github.com/camunda-cloud/zeebe-controller-k8s/issues/244
and https://github.com/camunda-cloud/zeebe-controller-k8s/issues/363

Commit zeebe-io/zeebe-chaos@8272ec3

npepinpe · 2021-10-12T12:20:17Z

Does this still apply with the newer cluster plans? Assuming we do migrate to the newer one (which we should)?

Zelldon · 2021-10-12T12:35:53Z

Since GA plan is the same as S and we still have the embedded gateway yes.

Zelldon · 2022-03-04T10:48:59Z

When we did #573 we can enable new experiments like the high load and gateway termination etc. 🎉

pihme added the hiatus label Nov 26, 2020

npepinpe added the Status: Backlog label Jan 11, 2021

pihme added the team/process-automation label Jun 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production S high load chaos test fails #135

Production S high load chaos test fails #135

Zelldon commented Nov 16, 2020

Zelldon commented Nov 16, 2020

Zelldon commented Nov 16, 2020

Zelldon commented Nov 18, 2020

Zelldon commented Nov 23, 2020 •

edited

Loading

npepinpe commented Oct 12, 2021

Zelldon commented Oct 12, 2021

Zelldon commented Mar 4, 2022

Production S high load chaos test fails #135

Production S high load chaos test fails #135

Comments

Zelldon commented Nov 16, 2020

Zelldon commented Nov 16, 2020

Zelldon commented Nov 16, 2020

Zelldon commented Nov 18, 2020

Zelldon commented Nov 23, 2020 • edited Loading

npepinpe commented Oct 12, 2021

Zelldon commented Oct 12, 2021

Zelldon commented Mar 4, 2022

Zelldon commented Nov 23, 2020 •

edited

Loading