Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oomkill in aks-based openshift 4.10 cluster #984

Open
cirocosta opened this issue Aug 19, 2022 · 2 comments
Open

oomkill in aks-based openshift 4.10 cluster #984

cirocosta opened this issue Aug 19, 2022 · 2 comments

Comments

@cirocosta
Copy link
Contributor

Bug description:

In an Azure-based openshift 4.10 cluster we observe that Cartographer's default
limits are not enough in the presence of 15k+ secrets and thousands of
serviceaccounts (carto v0.4.3).

Steps to reproduce:

TODO (/cc todd)

Expected behavior:

Have cartographer up and running despite the great number of other objects
unrelated to the necessary functionality of cartographer.

Actual behavior:

OOMKill

Logs, k8s object dumps:

TODO

Versions:

TODO

  • Infrastructure (kind, TKG, GKE etc.)
  • cartographer version
  • k8s version
  • ... other versions of related software being used

Deployment info:

TODO

Additional context:

@cirocosta cirocosta changed the title oomkill in aks-based openshift 4.10 cluster draft: oomkill in aks-based openshift 4.10 cluster Aug 19, 2022
@scothis
Copy link
Contributor

scothis commented Aug 22, 2022

If you can avoid it, it's best not to inform with a cache on Secrets and instead look them up as needed. ServiceAccount are probably still ok as each resource is very small.

@idoru
Copy link
Contributor

idoru commented Aug 30, 2022

What are the goals for this issue?

Is raising the requirements acceptable?

If the goal is reducing our memory footprint, what are the parameters for that, considering we currently know we have a cache that grows linearly and unbounded with the number and size of objects? Are we looking at reducing the overhead per object, or are we looking to constrain memory consumption to a ceiling? The former is easier, but without the latter, operators will potentially always have to manage resource limits for Cartographer, which has always felt sub-par to me.

Some activities that might help us understand where we could have impact reducing memory footprint:

  1. Audit code to discover redundancy in cache semantics? (ie are we over caching?)
  2. What complete objects in the cache can we reduce to hashes instead?
  3. Can we evict more aggressively? (it's been a long time, but as far as my memory works, we don't do this because we never, ever want to redundantly re-stamp - can that be solved another way? is re-stamping really as evil?)
  4. Can we externalize the cache (and how?)
  5. Audit unpaginated list calls

@cirocosta cirocosta changed the title draft: oomkill in aks-based openshift 4.10 cluster oomkill in aks-based openshift 4.10 cluster Sep 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants