oomkill in aks-based openshift 4.10 cluster #984

cirocosta · 2022-08-19T19:49:33Z

Bug description:

In an Azure-based openshift 4.10 cluster we observe that Cartographer's default
limits are not enough in the presence of 15k+ secrets and thousands of
serviceaccounts (carto v0.4.3).

Steps to reproduce:

TODO (/cc todd)

Expected behavior:

Have cartographer up and running despite the great number of other objects
unrelated to the necessary functionality of cartographer.

Actual behavior:

OOMKill

Logs, k8s object dumps:

TODO

Versions:

TODO

Infrastructure (kind, TKG, GKE etc.)
cartographer version
k8s version
... other versions of related software being used

Deployment info:

TODO

Additional context:

scothis · 2022-08-22T14:22:31Z

If you can avoid it, it's best not to inform with a cache on Secrets and instead look them up as needed. ServiceAccount are probably still ok as each resource is very small.

idoru · 2022-08-30T19:05:35Z

What are the goals for this issue?

Is raising the requirements acceptable?

If the goal is reducing our memory footprint, what are the parameters for that, considering we currently know we have a cache that grows linearly and unbounded with the number and size of objects? Are we looking at reducing the overhead per object, or are we looking to constrain memory consumption to a ceiling? The former is easier, but without the latter, operators will potentially always have to manage resource limits for Cartographer, which has always felt sub-par to me.

Some activities that might help us understand where we could have impact reducing memory footprint:

Audit code to discover redundancy in cache semantics? (ie are we over caching?)
What complete objects in the cache can we reduce to hashes instead?
Can we evict more aggressively? (it's been a long time, but as far as my memory works, we don't do this because we never, ever want to redundantly re-stamp - can that be solved another way? is re-stamping really as evil?)
Can we externalize the cache (and how?)
Audit unpaginated list calls

cirocosta changed the title ~~oomkill in aks-based openshift 4.10 cluster~~ draft: oomkill in aks-based openshift 4.10 cluster Aug 19, 2022

cirocosta changed the title ~~draft: oomkill in aks-based openshift 4.10 cluster~~ oomkill in aks-based openshift 4.10 cluster Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oomkill in aks-based openshift 4.10 cluster #984

oomkill in aks-based openshift 4.10 cluster #984

cirocosta commented Aug 19, 2022

scothis commented Aug 22, 2022

idoru commented Aug 30, 2022 •

edited

Loading

oomkill in aks-based openshift 4.10 cluster #984

oomkill in aks-based openshift 4.10 cluster #984

Comments

cirocosta commented Aug 19, 2022

Bug description:

Steps to reproduce:

Expected behavior:

Actual behavior:

Logs, k8s object dumps:

Versions:

Deployment info:

Additional context:

scothis commented Aug 22, 2022

idoru commented Aug 30, 2022 • edited Loading

idoru commented Aug 30, 2022 •

edited

Loading