Pods occasionally come up without injected sidecars when AKS cluster is restarted #2901

KarstenWintermann · 2024-04-26T10:08:00Z

Component(s)

No response

What happened?

Description

I have configured an OpenTelemetry collector sidecar and configured several pods for sidecar injection. The sidecars are always injected successfully as expected when I restart individual pods, for example by "kubectl delete pod ...".

However, when the whole AKS cluster is stopped and started (e.g. via "az aks stop ...", "az aks start ..."), occasionally some or all of the pods come up without the sidecars.

There are no relevant K8S events or log messages in case the pods come up without sidecars.

I would estimate that about 50% of the time, there are sidecar injections missing after a cluster restart.

Steps to Reproduce

deploy opentelemetry operator
configure opentelemetry sidecar
configure pods for sidecar injection
watch that pods come up with sidecars
restart cluster
check pods for sidecars
if sidecars are present, restart cluster again

Expected Result

Sidecars are always present in all pods configured for sidecar injection, even after cluster restart

Actual Result

Occasionally, sidecars are not present after cluster restart in the running pods

Kubernetes Version

1.28.5

Operator version

0.98.0

Collector version

0.98.0

Environment information

Environment

AKS 1.28.5
Quarkus 3.6.x
dapr.io 1.13.2
Node image: AKSUbuntu-2204gen2containerd-202402.07.0

Log output

No response

Additional context

I have noticed similar issues with dapr.io's sidecar injection. However, in dapr.io it is possible to configure a watchdog (https://docs.dapr.io/concepts/dapr-services/operator/#injector-watchdog, https://github.com/dapr/dapr/blob/c75c08f6f364620238b67cb2bfd231b3bde57c79/pkg/operator/watchdog.go) which periodically checks for pods which are annotated for sidecar injection and are missing sidecars. If any pods are found, they are deleted so that they have a chance to come up again with injected sidecar. This fixes the same issue in dapr.

I am aware of issue #1765 but that doesn't fix my issue, also I have found no way to configure the changes described there through the operator helm chart.

pavolloffay · 2024-04-26T11:57:43Z

The sidecar injection is using pod mutation webhook

opentelemetry-operator/internal/webhook/podmutation/webhookhandler.go

Line 32 in 35d8891

    
           // +kubebuilder:webhook:path=/mutate-v1-pod,mutating=true,failurePolicy=ignore,groups="",resources=pods,verbs=create,versions=v1,name=mpod.kb.io,sideEffects=none,admissionReviewVersions=v1

and rendered into webhook

opentelemetry-operator/config/webhook/manifests.yaml

Line 74 in 6af3e4c

failurePolicy: Ignore

The webhook defines failurePolicy=ignore https://book.kubebuilder.io/reference/markers/webhook.html

specifies what should happen if the API server cannot reach the webhook.
It may be either "ignore" (to skip the webhook and continue on) or "fail" (to reject the object in question).

That said I would recommend trying to change the policy to fail. It might have other side effects but it is worth trying.

KarstenWintermann · 2024-04-26T13:25:11Z

When I set admissionWebhooks.pods.failurePolicy to Fail in the helm chart (I use 0.55.3), then the OTel operator fails to start with

Error creating: Internal error occurred: failed calling webhook "mpod.kb.io": failed to call webhook: Post "https://otel-operator-opentelemetry-operator-webhook.otel.svc:443/mutate-v1-pod?timeout=10s": no endpoints available for service "otel-operator-opentelemetry-operator-webhook"

which causes some (but not necessarily all) other deployments to also fail with a similar error message.

When I use the additional settings from issue #1765 (exclude the namespace of the OTel operator) then there above error messages don't appear for the OTel operator deployment and the other deployments eventually start as expected with the sidecars, after initially generating the above error message.

I still don't understand how to configure the helm chart in order to include these settings.

KarstenWintermann · 2024-04-26T14:22:12Z

I just found out that there is an "admission enforcer" in AKS (Azure/AKS#4002) which apparently overwrites whatever I configure for the namespaceSelector :-(

jaronoff97 · 2024-04-26T17:15:21Z

is the same as #1329?

KarstenWintermann · 2024-04-26T20:24:46Z

is the same as #1329?

Yes, sounds similar. I'm doing sidecar injection instead of auto-instrumentation, but the effect seems to be the same. Plus AKS adds another layer of difficulty with its admission enforcement.

jaronoff97 · 2024-04-30T15:13:28Z

Yeah we had some discussion on that issue, and ultimately came to this conclusion. Would you mind commenting on that issue with what you expect to happen here / how we could help bubble this up better? Any opposition to me closing this issue?

jaronoff97 · 2024-07-26T15:00:20Z

@KarstenWintermann I'm going to close this issue in favor of #1329. Let me know if you have any further issues. Thank you!

KarstenWintermann added bug Something isn't working needs triage labels Apr 26, 2024

pavolloffay added the area:collector Issues for deploying collector label Apr 26, 2024

jaronoff97 removed the needs triage label Apr 26, 2024

jaronoff97 closed this as completed Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods occasionally come up without injected sidecars when AKS cluster is restarted #2901

Pods occasionally come up without injected sidecars when AKS cluster is restarted #2901

KarstenWintermann commented Apr 26, 2024

pavolloffay commented Apr 26, 2024

KarstenWintermann commented Apr 26, 2024

KarstenWintermann commented Apr 26, 2024

jaronoff97 commented Apr 26, 2024

KarstenWintermann commented Apr 26, 2024

jaronoff97 commented Apr 30, 2024

jaronoff97 commented Jul 26, 2024

Pods occasionally come up without injected sidecars when AKS cluster is restarted #2901

Pods occasionally come up without injected sidecars when AKS cluster is restarted #2901

Comments

KarstenWintermann commented Apr 26, 2024

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Kubernetes Version

Operator version

Collector version

Environment information

Environment

Log output

Additional context

pavolloffay commented Apr 26, 2024

KarstenWintermann commented Apr 26, 2024

KarstenWintermann commented Apr 26, 2024

jaronoff97 commented Apr 26, 2024

KarstenWintermann commented Apr 26, 2024

jaronoff97 commented Apr 30, 2024

jaronoff97 commented Jul 26, 2024