Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm Upgrade Failing #3250

Closed
ghost opened this issue Feb 2, 2024 · 8 comments
Closed

Helm Upgrade Failing #3250

ghost opened this issue Feb 2, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@ghost
Copy link

ghost commented Feb 2, 2024

What steps did you take and what happened:
Trying to bump our gatekeeper version from v3.12.0 to v3.13.3 using helm.
$ helm upgrade --install gatekeeper gatekeeper/gatekeeper --namespace gatekeeper-system --version 3.13.3 -f values.yaml --wait --timeout=180s
This just times out when trying to directly update, there was no change in the values file.
In debug mode I everything runs as expected until:
ready.go:281: [debug] Deployment is not ready: gatekeeper-system/gatekeeper-audit. 0 out of 1 expected pods are ready
I can see the audit and controller-manager pods are trying to launch, however the pods never become ready:

$ kubectl get pods -n gatekeeper-system 
NAME                                             READY   STATUS    RESTARTS   AGE
gatekeeper-audit-6955859697-rb9k9                0/1     Running   0          8m58s
gatekeeper-audit-859cb9ddbd-wvtn8                1/1     Running   0          16m
gatekeeper-controller-manager-7cd8bbf7fc-76qpb   0/1     Running   0          8m57s
gatekeeper-controller-manager-7d65d89866-7q7lj   1/1     Running   0          16m
gatekeeper-controller-manager-7d65d89866-nx2xp   1/1     Running   0          16m

Doing a describe on these shows that the Readiness probe is failing, which I think suggests an application failure to launch:
Readiness probe failed: HTTP probe failed with statuscode: 500

I managed to successfully bump the version if I uninstall all constraints/templates/mutations, uninstall gatekeeper, then reinstall the lot again from scratch.
What didn't work is if I leave gatekeeper itself installed, then delete constraints/templates/mutations, then try bumping the version.

So this has lead me to the conclusion that gatekeeper needs to be entirely 'clean' before upgrading which doesn't feel right.
In the past we have successfully bumped the version with no issues.

What did you expect to happen:
The new version was installed, and gatekeeper to continue working as normal.

Anything else you would like to add:
Relevant values for audit and controller manager:

controllerManager:
  tolerations:
    - key: system-no-schedule
      operator: Equal
      value: "true"
      effect: NoSchedule
  nodeSelector:
    nodegroup: core
    kubernetes.io/os: linux

audit:
  tolerations:
    - key: system-no-schedule
      operator: Equal
      value: "true"
      effect: NoSchedule
  nodeSelector:
    nodegroup: core
    kubernetes.io/os: linux

Environment:

  • Gatekeeper version: v3.12.0 > v3.13.3
  • Kubernetes version: (use kubectl version): v1.26.12
  • Helm version: 3.13.3
@ghost ghost added the bug Something isn't working label Feb 2, 2024
@cbugneac-nex
Copy link

Just to compliment, this is what I get when checking the readiness endpoint of gatekeeper-audit pod:

curl -v http://100-***-26-129.gatekeeper-system.pod.cluster.local:9090/readyz
* Host 100-***-26-129.gatekeeper-system.pod.cluster.local:9090 was resolved.
* IPv6: (none)
* IPv4: 100.***.26.129
*   Trying 100.***.26.129:9090...
* Connected to 100-***-26-129.gatekeeper-system.pod.cluster.local (100.***.26.129) port 9090
> GET /readyz HTTP/1.1
> Host: 100-***-26-129.gatekeeper-system.pod.cluster.local:9090
> User-Agent: curl/8.5.0
> Accept: */*
> 
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Fri, 02 Feb 2024 13:19:38 GMT
< Content-Length: 56
< 
[-]tracker failed: reason withheld
healthz check failed

Not very informative, which tracks back to this issue #696

@ghost
Copy link
Author

ghost commented Feb 2, 2024

Have confirmed that the version upgrade will work only if the constraints and templates are never deployed while gatekeeper is installed.
Version bump works if I freshly install gatekeeper v3.12.0, upgrade to v3.13.3 (no installation of templates/constraints).
Version bump will not work if I install gatekeeper v3.12.0, install constraints/templates, uninstall constraints/templates, upgrade to v3.13.3.
Having a look at the logs I can see the following when its failing:

{"level":"error","ts":1706891806.7624226,"msg":"Reconciler error","controller":"constrainttemplate-controller","object":{"name":"k8scontainerlimits"},"namespace":"","name":"k8scontainerlimits","reconcileID":"56bd9d16-5941-49c0-a36d-50c7b1cf6429","error":"Operation cannot be fulfilled on customresourcedefinitions.apiextensions.k8s.io \"k8scontainerlimits.constraints.gatekeeper.sh\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/open-policy-agent/gatekeeper/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/open-policy-agent/gatekeeper/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/open-policy-agent/gatekeeper/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226"}
{"level":"info","ts":1706891806.7961633,"logger":"controller","msg":"handling constraint template status update","process":"constraint_template_status_controller","instance":{"apiVersion":"templates.gatekeeper.sh/v1beta1","kind":"ConstraintTemplate","name":"k8scontainerlimits"}}

@ghost
Copy link
Author

ghost commented Feb 5, 2024

Another update: I have found that if the following is run before the helm upgrade then it works fine... Seems a bit hacky though
kubectl delete crd -l gatekeeper.sh/system=yes

@ghost
Copy link
Author

ghost commented Feb 16, 2024

Commenting to keep this alive, any ideas?

@maxsmythe
Copy link
Contributor

If you create a config resource with spec.readiness.statsEnabled = true, what do the logs for the failing audit/webhook pods say?

example config:

apiVersion: config.gatekeeper.sh/v1alpha1
kind: Config
metadata:
  name: config
  namespace: "gatekeeper-system"
spec:
  readiness:
    statsEnabled: true

@maxsmythe
Copy link
Contributor

3.14.0 also contained a readiness fix, maybe upgrading to that version would remediate the issue?

https://github.com/open-policy-agent/gatekeeper/releases/tag/v3.14.0

@ghost
Copy link
Author

ghost commented Feb 21, 2024

Hi @maxsmythe
Thanks for this info, it actually helped resolve the problem as we had some additional rules applied to the config which were causing the issues. The extensions and v1beta1 ingress were the root cause, removing these allowed us to update as expected.

apiVersion: config.gatekeeper.sh/v1alpha1
kind: Config
metadata:
  name: config
  namespace: gatekeeper-system
spec:
  sync:
    syncOnly:
      - group: extensions
        version: v1beta1
        kind: Ingress
      - group: networking.k8s.io
        version: v1beta1
        kind: Ingress
      - group: networking.k8s.io
        version: v1
        kind: Ingress
      {{- end }}

@ghost ghost closed this as completed Feb 21, 2024
@maxsmythe
Copy link
Contributor

Glad I could help!

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants