Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stable/vpa] Probe defaults don't make sense #1519

Open
2 tasks done
piepmatz opened this issue Aug 9, 2024 · 1 comment
Open
2 tasks done

[stable/vpa] Probe defaults don't make sense #1519

piepmatz opened this issue Aug 9, 2024 · 1 comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@piepmatz
Copy link

piepmatz commented Aug 9, 2024

What happened?

The VPA chart sets readinessProbe and livenessProbe for several containers. Here are the values for the recommender:

# recommender.livenessProbe -- The liveness probe definition inside the recommender pod
livenessProbe:
failureThreshold: 6
httpGet:
path: /health-check
port: metrics
scheme: HTTP
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 3
# recommender.readinessProbe -- The readiness probe definition inside the recommender pod
readinessProbe:
failureThreshold: 120
httpGet:
path: /health-check
port: metrics
scheme: HTTP
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 3

Both are essentially the same, just differ in their failureThreshold values. Those values are the problem I see: If 6 failed liveness probes lead to restarting the container, there is no way for the container to ever become unready after 120 readiness probes, as the restart happens way earlier.

The behavior has been like this since the probes were added in #399.

In a scenario with a couple of thousand VPA resources I've seen the recommender being restarted all the time because its liveness probes failed, as the container wasn't done with its startup.

What did you expect to happen?

A failureThreshold of 120 for the readinessProbe seems quite high. I wonder if it's rather meant to be a startupProbe. That high failureThreshold allows quite some time for the container to come up before the livenessProbe then takes over.

How can we reproduce this?

Create lots of VPA resources to slow down the startup of the VPA's recommender pod. It then gets restarted due to the failing liveness probe.

Version

4.5.0

Search

  • I did search for other open and closed issues before opening this.

Code of Conduct

  • I agree to follow this project's Code of Conduct

Additional context

No response

@piepmatz piepmatz added bug Something isn't working triage This bug needs triage labels Aug 9, 2024
@sudermanjr
Copy link
Member

Good catch. Please feel free to open a PR to update these to more sensical values. The only reason to have a readiness probe on the vpa pods is if you're using metrics anyway, since they do not expose any api and just run controller loops.

@sudermanjr sudermanjr added help wanted Extra attention is needed good first issue Good for newcomers and removed triage This bug needs triage labels Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants