Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected periodical istiod scale down #915

Open
vitalii-buchyn-exa opened this issue Apr 12, 2023 · 15 comments
Open

Unexpected periodical istiod scale down #915

vitalii-buchyn-exa opened this issue Apr 12, 2023 · 15 comments
Assignees

Comments

@vitalii-buchyn-exa
Copy link

Describe the bug
Hello community,

We are experiencing periodical unexpected istiod deployment scale down that is far below an HPA recommendations and below a minimum replicas count value.
It happens every 12 hour in our case.

That behaviour leads to a control plane traffic amount increase and thus for an outbound traffic cost' increase for us.

If we stop an operator (scale its deployment down to 0 replicas) we don't see such a behaviour.

Example of replicas spec

replicas:
          count: 30
          max: 160
          min: 30

Please also find an example screenshot attached.

Please let me know if any additional info is required.

Thank you!

Expected behavior
To not have such a drastic scale down. Rollout restart can be used instead.

Screenshots
image

Additional context
Platform 1.23.16-gke.1400
Operator version v2.16.1
Pilot version 1.15.3

@LuciferInLove
Copy link
Collaborator

Hi @vitalii-buchyn-exa. Is there something interesting in the operator's logs during these events?

@vitalii-buchyn-exa
Copy link
Author

@LuciferInLove Hi, nothing i can spot at least, here is couple minutes before and after the issue:
some-istio-operator.log

for istiod deployment we see events like:

Events:
  Type    Reason             Age                    From                   Message
  ----    ------             ----                   ----                   -------
  Normal  ScalingReplicaSet  69s (x7 over 4d)       deployment-controller  Scaled down replica set istiod-istio2-8696758cfb to 54
  Normal  ScalingReplicaSet  69s                    deployment-controller  Scaled up replica set istiod-istio2-65745fd459 to 71
  Normal  ScalingReplicaSet  53s (x6 over 3d)       deployment-controller  Scaled down replica set istiod-istio2-8696758cfb to 51
  Normal  ScalingReplicaSet  53s (x5 over 2d12h)    deployment-controller  Scaled down replica set istiod-istio2-8696758cfb to 53
  Normal  ScalingReplicaSet  52s (x7 over 3d)       deployment-controller  Scaled down replica set istiod-istio2-8696758cfb to 50
  Normal  ScalingReplicaSet  52s (x3 over 3d)       deployment-controller  Scaled down replica set istiod-istio2-8696758cfb to 46
  Normal  ScalingReplicaSet  52s (x4 over 3d)       deployment-controller  Scaled down replica set istiod-istio2-8696758cfb to 45
  Normal  ScalingReplicaSet  52s (x3 over 3d)       deployment-controller  Scaled down replica set istiod-istio2-8696758cfb to 44
  Normal  ScalingReplicaSet  52s                    deployment-controller  Scaled down replica set istiod-istio2-8696758cfb to 41
  Normal  ScalingReplicaSet  51s (x123 over 4d12h)  deployment-controller  (combined from similar events): Scaled down replica set istiod-istio2-65745fd459 to 0

@LuciferInLove
Copy link
Collaborator

Thanks for the information. I'm trying to reproduce it but with no success yet. Did you notice any changes in resources? For example, was the HPA resource changed maybe?

@vitalii-buchyn-exa
Copy link
Author

HPA config doesn't seem to change: min and max values stay the same.

HPA may get confused for a while with unknown in TARGETS, but from what we can observe, after HPA can get target metrics, it tries to set replicas to at least minimum value (desired replicas is a yellow line on the screenshot above).

We tried to scale down operator just before the issue and then scale it back after while, the result is postponed istiod deployment scale down, example:
image

We also tried to scale operator down/up at a random time, but it doesn't behave like this randomly. It seems something happens every 12 hours (on a screenshot timezone is GMT+3)

@LuciferInLove
Copy link
Collaborator

If it's possible, please try to run the operator with -devel-mode and -verbose flags for verbose logging.

@vitalii-buchyn-exa
Copy link
Author

sure, --devel-mode and --verbose added
will wait till the next downscale event and share logs

@vitalii-buchyn-exa
Copy link
Author

here is the log of 5 mins before and 5 mins after the issue (timezone of logs is GMT)
some-debug-istio-operator.log

scale down happened at 03:00:30 GMT+3
image

@LuciferInLove
Copy link
Collaborator

The most interesting part:

2023-04-17T23:55:20.746Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ui-ingressgateway"}

Is the gateway address changed every 12 hours?

@vitalii-buchyn-exa
Copy link
Author

i can see a lot of such events in the log

% grep "controllers.IstioMeshGateway      gateway address has changed" istio-operator-67db7ccb48-9z2pp.log | wc -l
     237

events are like:

2023-04-18T08:51:40.263Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ui-ingressgateway"}
2023-04-18T09:08:14.282Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ingress-apigee"}
2023-04-18T09:08:20.065Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ingress-api"}
2023-04-18T09:08:25.864Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ui-ingressgateway"}
2023-04-18T09:24:59.887Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ingress-apigee"}
2023-04-18T09:25:05.662Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ingress-api"}
2023-04-18T09:25:11.464Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ui-ingressgateway"}
2023-04-18T09:41:45.482Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ingress-apigee"}
2023-04-18T09:41:51.252Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ingress-api"}
2023-04-18T09:41:57.051Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ui-ingressgateway"}

@vitalii-buchyn-exa
Copy link
Author

shouldn't be there any connection between istiod deployment and istiomeshgateway deployments reconciling?

@LuciferInLove
Copy link
Collaborator

istiomeshgateway IP change shouldn't cause istiod deployment scaling down. I checked that on the same cluster. Only IstioControlPlane changes can cause this behavior. Could something be changed every 12 hours on your cluster that causes these resource changes? Could you please also share an IstioControlPlane that you use?

@vitalii-buchyn-exa
Copy link
Author

Could something be changed every 12 hours on your cluster that causes these resource changes?
I wish i knew... I can only find logWindowDuration

and something jwt token related

Here is an example of IstioControlPlane CR:
istiocontrolplane.yaml.zip

@LuciferInLove
Copy link
Collaborator

These settings can't affect such behavior. Otherwise, it would be reproducible everywhere, including our test envs.

Unfortunately, there are no clues in the IstioControlPlane. I tried this ICP with some modifications according to my env, and there were no downscales for 16 hours.

Another interesting string in the logs:

2023-04-18T00:00:20.039Z	DEBUG	controllers.IstioControlPlane.discovery	resource diff
{"name": "istiod-istio2", "namespace": "istio-system", "apiVersion": "apps/v1", "kind": "Deployment",
"patch": "{\"spec\":{\"template\":{\"metadata\":{\"labels\":{\"exa_service\":\"ecp-inf-istio\"}}}}}"}

Could something be causing these periodic label changes?

@vitalii-buchyn-exa
Copy link
Author

@LuciferInLove you are genius, Sir, there is indeed label injector cronjob with schedule 0 */12 * * *, that job modifies istiod labels...
Just wonder if such a change could trigger rollout restart instead of full scale down?

Thank you for your time and help!

@LuciferInLove
Copy link
Collaborator

@vitalii-buchyn-exa no problem. Feel free to reach out.
We'll take a look at what is possible to do. You can try to change PodDisruptionBudget settings for now. Maybe the tips from here will come in handy as well:
kubernetes/kubernetes#72231

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants