Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chaos: Terminating Gateway Stops Workflow Processing #336

Open
shahamit opened this issue Mar 21, 2023 · 5 comments
Open

Chaos: Terminating Gateway Stops Workflow Processing #336

shahamit opened this issue Mar 21, 2023 · 5 comments
Labels
Chaos Experiment This issue describes a chaos experiments, which should be created.

Comments

@shahamit
Copy link

Chaos Experiment

When running the terminate chaos experiment against a zeebe cluster that was under load, we observed that the cluster stops processing any workflows there after.

Config - 6 brokers, 2 gateways, 6 partitions, 2 replication factor.

Note we don't have an ingress controller configured in front of the zeebe-gateway. Since our client (benchmarking tool in this case) runs within the same zeebe cluster it should be fine given that k8s service (zeebe-gateway) does the load balancing between them (which isn't happening but that's a separate issue).

We were hoping that since the client (benchmarking tool) connects to the k8s zeebe-gateway service, terminating one of the gateway instances shouldn't have any impact on the client. I didn't follow why do we see errors on the client. Please share more insights.

Thanks.

Benchmarking tool logs
ksnip_20230321-181948

Terminate command output
ksnip_20230321-181029

@shahamit shahamit added the Chaos Experiment This issue describes a chaos experiments, which should be created. label Mar 21, 2023
@Zelldon
Copy link
Member

Zelldon commented Mar 31, 2023

Hey @shahamit sorry for the late reply.

Regarding:

When running the terminate chaos experiment against a zeebe cluster that was under load, we observed that the cluster stops processing any workflows there after.

What means it stops, does it recover afterwards? After some time eventually?

@Zelldon
Copy link
Member

Zelldon commented Mar 31, 2023

We were hoping that since the client (benchmarking tool) connects to the k8s zeebe-gateway service, terminating one of the gateway instances shouldn't have any impact on the client.

The affect will never be zero, because some request might fail or timeout, but yes after retry it should work and take the next gateway I agree.

@shahamit
Copy link
Author

What means it stops, does it recover afterwards? After some time eventually?

Yes after a few seconds, the workflows did start getting processed. In between though some workflow do fail (indicated by the backpressure % increasing). Given that the gateway replicas are behind a k8s service, shouldn't it automatically go to the next gateway instance instead of failing the workflows?

@Zelldon
Copy link
Member

Zelldon commented Mar 31, 2023

Do you have any metrics to show? What type of load we are speaking of? 🤔

Given that the gateway replicas are behind a k8s service, shouldn't it automatically go to the next gateway instance

Yes if a new request comes in I would expect something like that.

failing the workflows?

Be aware that the process instances are not failing, they are just not continued right?

@shahamit
Copy link
Author

shahamit commented Apr 5, 2023

Sorry for the late reply @Zelldon

Do you have any metrics to show? What type of load we are speaking of? thinking

We are running the benchmarking tool against a zeebe cluster of 7 brokers and 2 gateways. We could see the throughput as 170 PI/s.

Be aware that the process instances are not failing, they are just not continued right?

This is hard to find out since the benchmarking tool starts around 170 process instances per second. If you can think of a way to find this out, please let me know.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Chaos Experiment This issue describes a chaos experiments, which should be created.
Projects
None yet
Development

No branches or pull requests

2 participants