Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods occasionally come up without injected sidecars when AKS cluster is restarted #2901

Closed
KarstenWintermann opened this issue Apr 26, 2024 · 7 comments
Labels
area:collector Issues for deploying collector bug Something isn't working

Comments

@KarstenWintermann
Copy link

Component(s)

No response

What happened?

Description

I have configured an OpenTelemetry collector sidecar and configured several pods for sidecar injection. The sidecars are always injected successfully as expected when I restart individual pods, for example by "kubectl delete pod ...".

However, when the whole AKS cluster is stopped and started (e.g. via "az aks stop ...", "az aks start ..."), occasionally some or all of the pods come up without the sidecars.

There are no relevant K8S events or log messages in case the pods come up without sidecars.

I would estimate that about 50% of the time, there are sidecar injections missing after a cluster restart.

Steps to Reproduce

  • deploy opentelemetry operator
  • configure opentelemetry sidecar
  • configure pods for sidecar injection
  • watch that pods come up with sidecars
  • restart cluster
  • check pods for sidecars
  • if sidecars are present, restart cluster again

Expected Result

Sidecars are always present in all pods configured for sidecar injection, even after cluster restart

Actual Result

Occasionally, sidecars are not present after cluster restart in the running pods

Kubernetes Version

1.28.5

Operator version

0.98.0

Collector version

0.98.0

Environment information

Environment

AKS 1.28.5
Quarkus 3.6.x
dapr.io 1.13.2
Node image: AKSUbuntu-2204gen2containerd-202402.07.0

Log output

No response

Additional context

I have noticed similar issues with dapr.io's sidecar injection. However, in dapr.io it is possible to configure a watchdog (https://docs.dapr.io/concepts/dapr-services/operator/#injector-watchdog, https://github.com/dapr/dapr/blob/c75c08f6f364620238b67cb2bfd231b3bde57c79/pkg/operator/watchdog.go) which periodically checks for pods which are annotated for sidecar injection and are missing sidecars. If any pods are found, they are deleted so that they have a chance to come up again with injected sidecar. This fixes the same issue in dapr.

I am aware of issue #1765 but that doesn't fix my issue, also I have found no way to configure the changes described there through the operator helm chart.

@KarstenWintermann KarstenWintermann added bug Something isn't working needs triage labels Apr 26, 2024
@pavolloffay
Copy link
Member

The sidecar injection is using pod mutation webhook

// +kubebuilder:webhook:path=/mutate-v1-pod,mutating=true,failurePolicy=ignore,groups="",resources=pods,verbs=create,versions=v1,name=mpod.kb.io,sideEffects=none,admissionReviewVersions=v1
and rendered into webhook

The webhook defines failurePolicy=ignore https://book.kubebuilder.io/reference/markers/webhook.html

specifies what should happen if the API server cannot reach the webhook.
It may be either "ignore" (to skip the webhook and continue on) or "fail" (to reject the object in question).

That said I would recommend trying to change the policy to fail. It might have other side effects but it is worth trying.

@pavolloffay pavolloffay added the area:collector Issues for deploying collector label Apr 26, 2024
@KarstenWintermann
Copy link
Author

When I set admissionWebhooks.pods.failurePolicy to Fail in the helm chart (I use 0.55.3), then the OTel operator fails to start with

Error creating: Internal error occurred: failed calling webhook "mpod.kb.io": failed to call webhook: Post "https://otel-operator-opentelemetry-operator-webhook.otel.svc:443/mutate-v1-pod?timeout=10s": no endpoints available for service "otel-operator-opentelemetry-operator-webhook"

which causes some (but not necessarily all) other deployments to also fail with a similar error message.

When I use the additional settings from issue #1765 (exclude the namespace of the OTel operator) then there above error messages don't appear for the OTel operator deployment and the other deployments eventually start as expected with the sidecars, after initially generating the above error message.

I still don't understand how to configure the helm chart in order to include these settings.

@KarstenWintermann
Copy link
Author

I just found out that there is an "admission enforcer" in AKS (Azure/AKS#4002) which apparently overwrites whatever I configure for the namespaceSelector :-(

@jaronoff97
Copy link
Contributor

is the same as #1329?

@KarstenWintermann
Copy link
Author

is the same as #1329?

Yes, sounds similar. I'm doing sidecar injection instead of auto-instrumentation, but the effect seems to be the same. Plus AKS adds another layer of difficulty with its admission enforcement.

@jaronoff97
Copy link
Contributor

Yeah we had some discussion on that issue, and ultimately came to this conclusion. Would you mind commenting on that issue with what you expect to happen here / how we could help bubble this up better? Any opposition to me closing this issue?

@jaronoff97
Copy link
Contributor

@KarstenWintermann I'm going to close this issue in favor of #1329. Let me know if you have any further issues. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:collector Issues for deploying collector bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants