Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production S high load chaos test fails #135

Open
Zelldon opened this issue Nov 16, 2020 · 7 comments
Open

Production S high load chaos test fails #135

Zelldon opened this issue Nov 16, 2020 · 7 comments

Comments

@Zelldon
Copy link
Member

Zelldon commented Nov 16, 2020

The high load chaos test https://github.com/zeebe-io/zeebe-chaos/blob/master/chaos-experiments/camunda-cloud/production-m/high-load/experiment.json for production s currently fails because the resource limits are to low for Production S cluster plans.

Experiment output:

[2020-11-16 09:57:20 INFO] Validating the experiment's syntax
[2020-11-16 09:57:20 INFO] Experiment looks valid
[2020-11-16 09:57:20 INFO] Running experiment: No leader change due to high load
[2020-11-16 09:57:20 INFO] Steady-state strategy: default
[2020-11-16 09:57:20 INFO] Rollbacks strategy: default
[2020-11-16 09:57:20 INFO] Steady state hypothesis: Zeebe is alive and has a consistent topology
[2020-11-16 09:57:20 INFO] Probe: All pods should be ready
[2020-11-16 09:57:21 INFO] Probe: Should be able to create workflow instances on partition 1
[2020-11-16 09:57:23 INFO] Probe: Cluster status should remain constant
[2020-11-16 09:57:24 INFO] Steady state hypothesis is met!
[2020-11-16 09:57:24 INFO] Playing your experiment's method now...
[2020-11-16 09:57:24 INFO] Action: Start many instances (5000)
[2020-11-16 09:57:47 INFO] Pausing after activity for 5s...
[2020-11-16 09:57:52 INFO] Steady state hypothesis: Zeebe is alive and has a consistent topology
[2020-11-16 09:57:52 INFO] Probe: All pods should be ready
[2020-11-16 09:58:11 INFO] Probe: Should be able to create workflow instances on partition 1
[2020-11-16 09:58:14 INFO] Probe: Cluster status should remain constant
[2020-11-16 09:58:15 CRITICAL] Steady state probe 'Cluster status should remain constant' is not in the given tolerance so failing this experiment
[2020-11-16 09:58:15 INFO] Experiment ended with status: deviated
[2020-11-16 09:58:15 INFO] The steady-state has deviated, a weakness may have been discovered

Output of kgpo -w during the execution:

zeebe-0                               0/1     Running                 0          10s
zeebe-0                               1/1     Running                 0          35s
zeebe-0                               0/1     OOMKilled               0          67s
zeebe-0                               0/1     Running                 1          68s
zeebe-0                               1/1     Running                 1          90s
zeebe-1                               1/1     Terminating             0          5m54s
zeebe-1                               0/1     Terminating             0          6m1s
@Zelldon
Copy link
Member Author

Zelldon commented Nov 16, 2020

Created an issue for increasing limits https://github.com/camunda-cloud/zeebe-controller-k8s/issues/363

@Zelldon
Copy link
Member Author

Zelldon commented Nov 16, 2020

I will not add this experiment for now for production s plan.

@Zelldon
Copy link
Member Author

Zelldon commented Nov 18, 2020

\cc @npepinpe

@Zelldon
Copy link
Member Author

Zelldon commented Nov 23, 2020

We (@npepinpe and me) decided to also turn off the experiment for the other cluster plans, because it might cause flakiness, since the embedded gateway and the resource limitations are the problem also on the other cluster plans. Currently assumption is that when we move the embedded gw out to standalone gateway that this experiment should succeed, if we give it enough resources, related camunda/camunda#5874

Blocked by https://github.com/camunda-cloud/zeebe-controller-k8s/issues/244
and https://github.com/camunda-cloud/zeebe-controller-k8s/issues/363

Commit zeebe-io/zeebe-chaos@8272ec3

@npepinpe
Copy link
Member

Does this still apply with the newer cluster plans? Assuming we do migrate to the newer one (which we should)?

@Zelldon
Copy link
Member Author

Zelldon commented Oct 12, 2021

Since GA plan is the same as S and we still have the embedded gateway yes.

@Zelldon
Copy link
Member Author

Zelldon commented Mar 4, 2022

When we did #573 we can enable new experiments like the high load and gateway termination etc. 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants