Calico Helm Chart upgrade fails after upgrade from rke2 v1.28.8+rke2r1 to v1.28.12+rke2r1 / v1.29.6+rke2r1 #6633

shindebshekhar · 2024-08-26T12:10:32Z

Environmental Info:
RKE2 Version: v1.28.8+rke2r1

:~ # rke2 -v
rke2 version v1.28.8+rke2r1 (42cab2f)
go version go1.21.8 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

Linux hostname 5.3.18-150300.59.161-default #1 SMP Thu May 9 06:59:05 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

3 Master 3 Worker nodes

Describe the bug:

We are trying to upgrade rke2 from v1.28.8+rke2r1(fresh install) to v1.28.12+rke2r1 / v1.29.6+rke2r1

After upgrade rke2 service comes up but we see all the helm jobs fails for system component calico. Helm Jobs are retriggered in continuous loop(possibly trying to upgrade the above components)

For some reason instead of upgrading the calico chart, It tries to uninstall the tigera operator CRDs and calico CRDs. In this process it hangs as resources are still present. Please see below log output for calico CRD job.

kubectl get crds | grep -i calico --> No result

kubectl logs job/helm-install-rke2-calico-crd -n kube-system -f

if [[ ${KUBERNETES_SERVICE_HOST} =~ .*:.* ]]; then
        echo "KUBERNETES_SERVICE_HOST is using IPv6"
        CHART="${CHART//%\{KUBERNETES_API\}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"
else
        CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
fi

set +v -x
+ [[ true != \t\r\u\e ]]
+ [[ '' == \1 ]]
+ [[ '' == \v\2 ]]
+ shopt -s nullglob
+ [[ -f /config/ca-file.pem ]]
+ [[ -f /tmp/ca-file.pem ]]
+ [[ -n '' ]]
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/rke2-calico-crd.tgz.base64
+ CHART_PATH=/tmp/rke2-calico-crd.tgz
+ [[ ! -f /chart/rke2-calico-crd.tgz.base64 ]]
+ base64 -d /chart/rke2-calico-crd.tgz.base64
+ CHART=/tmp/rke2-calico-crd.tgz
+ set +e
+ [[ install != \d\e\l\e\t\e ]]
+ helm_repo_init
+ grep -q -e 'https\?://'
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ [[ /tmp/rke2-calico-crd.tgz == stable/* ]]
+ [[ -n '' ]]
+ helm_update install --set-string global.clusterCIDR=192.168.128.0/17 --set-string global.clusterCIDRv4=192.168.128.0/17 --set-string global.clusterDNS=192.168.64.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=192.168.64.0/18
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
++ helm_v3 ls --all -f '^rke2-calico-crd$' --namespace kube-system --output json
++ jq -r '"\(.[0].chart),\(.[0].status)"'
++ tr '[:upper:]' '[:lower:]'
**+ LINE=rke2-calico-crd-v3.27.002,uninstalling**
+ IFS=,
+ read -r INSTALLED_VERSION STATUS _
+ VALUES=
+ [[ install = \d\e\l\e\t\e ]]
+ [[ rke2-calico-crd-v3.27.002 =~ ^(|null)$ ]]
+ [[ uninstalling =~ ^(pending-install|pending-upgrade|pending-rollback)$ ]]
+ [[ uninstalling == \d\e\p\l\o\y\e\d ]]
+ [[ uninstalling =~ ^(deleted|failed|null|unknown)$ ]]
+ echo 'Installing helm_v3 chart'
+ helm_v3 install --set-string global.clusterCIDR=192.168.128.0/17 --set-string global.clusterCIDRv4=192.168.128.0/17 --set-string global.clusterDNS=192.168.64.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=192.168.64.0/18 rke2-calico-crd /tmp/rke2-calico-crd.tgz
Error: INSTALLATION FAILED: cannot re-use a name that is still in use

The text was updated successfully, but these errors were encountered:

brandond · 2024-08-26T18:49:36Z

It looks like for some reason the Helm job to upgrade the chart was interrupted while upgrading the chart. The helm controller responded by trying to uninstall and reinstall the chart, but the uninstall job was also interrupted - so now the chart is stuck in the "uninstalling" status.

You might try deleting the Helm secrets for the rke2-calico-crd release, and rke-calico as well if necessary. This should allow it to successfully reinstall the chart.

What process did you use to upgrade your cluster? We do not generally see issues with the Helm jobs being interrupted while upgrading, unless the upgrade is interrupted partway through, leaving nodes deploying conflicting component versions.

rjchicago · 2024-09-11T15:10:38Z

Was there any recovery from this? We ran into this issue yesterday and had to restore controller VM and etcd from snapshots.

The symptoms and logs match exactly what was posted above. We initially attempted to install the CRDs and recreate the required resources, but calico controller continued to crashloop.

Ultimately, the restore from snapshots worked, but we actually had to do that twice as after adding additional controllers ,the helm upgrade was re-triggered and we had to restart the process. We're now currently running with just the one controller - not an ideal state.

wzrdtales · 2024-09-20T09:40:51Z

probably this projectcalico/calico#9068, which was fixed upstream, but you will need to wait likely quite some time for this fix becoming available in rke and rancher

@brandond is there any possibility in rke2 to override the calico version being deployed?

brandond · 2024-09-20T17:55:12Z

Calico 3.28.2 should go into next month's releases: rancher/rke2-charts#524

The issue is in the chart itself, so no you can't just bump the version of calico that the chart deploys. You'll need to wait for us to update the chart in RKE2.

caroline-suse-rancher added the area/cni label Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calico Helm Chart upgrade fails after upgrade from rke2 v1.28.8+rke2r1 to v1.28.12+rke2r1 / v1.29.6+rke2r1 #6633

Calico Helm Chart upgrade fails after upgrade from rke2 v1.28.8+rke2r1 to v1.28.12+rke2r1 / v1.29.6+rke2r1 #6633

shindebshekhar commented Aug 26, 2024 •

edited

Loading

brandond commented Aug 26, 2024

rjchicago commented Sep 11, 2024

wzrdtales commented Sep 20, 2024

brandond commented Sep 20, 2024

Calico Helm Chart upgrade fails after upgrade from rke2 v1.28.8+rke2r1 to v1.28.12+rke2r1 / v1.29.6+rke2r1 #6633

Calico Helm Chart upgrade fails after upgrade from rke2 v1.28.8+rke2r1 to v1.28.12+rke2r1 / v1.29.6+rke2r1 #6633

Comments

shindebshekhar commented Aug 26, 2024 • edited Loading

brandond commented Aug 26, 2024

rjchicago commented Sep 11, 2024

wzrdtales commented Sep 20, 2024

brandond commented Sep 20, 2024

shindebshekhar commented Aug 26, 2024 •

edited

Loading