Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hypothesis: CPU stress on standalone gateway should not cause harm cluster wise #28

Open
Zelldon opened this issue Jun 11, 2020 · 1 comment
Labels
Contribution: Availability This issue will contribute to build up confidence in reliability. Hypothesis A thing which worries us and is ready for exploration. Impact: Low The issue has an low impact on the system. Impact: Medium The issue has an medium impact on the system. Likelihood: Medium The issue is not so likely.

Comments

@Zelldon
Copy link
Member

Zelldon commented Jun 11, 2020

Hypothesis

We believe that when we stress the standalone gateway CPU that this will not affect the stability of the cluster.

We expect that the latency will go up and the throughput will go down during this period of time. AFTER this it should go back to normal.

@Zelldon Zelldon added Hypothesis A thing which worries us and is ready for exploration. Impact: Medium The issue has an medium impact on the system. Impact: Low The issue has an low impact on the system. Likelihood: Medium The issue is not so likely. Contribution: Availability This issue will contribute to build up confidence in reliability. labels Jun 11, 2020
@Zelldon
Copy link
Member Author

Zelldon commented Jun 11, 2020

We did today an chaos experiment where we used our standard setup with a baseline load of 100 workflow instance and 6 workers, which can activate 120 jobs max.

On our steady state we saw that we are able to start and complete 100 workflow instances in a second. One instance took 1 - 2.5 seconds.

We expected when we introduce stress on the standalone gateway CPU that the latency of the processing goes up and the throughput goes down, but there should be no cluster wide failures happening. We expected that after removing the stress the system should come back to normal and the baseline should be reached again.

The results looks promising:

gw-stress-proc

gw-stress-proc-latency
gw-cpu

We tested it twice and saw that the throughput goes down and latency up on stress, but comes back to normal after removing it.

What was unexpected or what we found out:

Unexpected was that our Broker Backpressure goes up, which means it drops requests during the stress time. This was not expected, since the latency between writing to dispatcher and processing the event should not change. We probably need to investigate this more. Current assumption is that the gateway sends requests in batches and this causes in higher spikes on the backpressure. We need more metrics on the transport module to verify that.

We found out that the standalone gateway is not resource limited, which caused that we used at some point 12 cpu cores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Contribution: Availability This issue will contribute to build up confidence in reliability. Hypothesis A thing which worries us and is ready for exploration. Impact: Low The issue has an low impact on the system. Impact: Medium The issue has an medium impact on the system. Likelihood: Medium The issue is not so likely.
Projects
None yet
Development

No branches or pull requests

1 participant