Current csi deployments mandate usage of three nodes or else rollout get stuck #140

leelavg · 2024-09-18T08:11:18Z

// ============================ we only have two nodes
> k get no
NAME                          STATUS   ROLES    AGE   VERSION
hcp416-bm1-b-2624094b-dn4gn   Ready    worker   22h   v1.29.7+d77deb8
hcp416-bm1-b-2624094b-pvd2m   Ready    worker   22h   v1.29.7+d77deb8

// ============================ controller plugin occupies those two nodes because of replica 2
> k get po -owide | grep -P 'NAME|ceph.com'
NAME                                                             READY  STATUS   RESTARTS  AGE   IP            NODE                       
openshift-storage.cephfs.csi.ceph.com-ctrlplugin-747cffd6479rgt  7/7    Running  0         4h6m  10.132.1.14   hcp416-bm1-b-2624094b-dn4gn
openshift-storage.cephfs.csi.ceph.com-ctrlplugin-747cffd64b5k9p  7/7    Running  0         4h6m  10.133.1.138  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.cephfs.csi.ceph.com-nodeplugin-d5wht           3/3    Running  0         4h6m  10.128.3.248  hcp416-bm1-b-2624094b-dn4gn
openshift-storage.cephfs.csi.ceph.com-nodeplugin-qkg7h           3/3    Running  0         4h6m  10.131.1.157  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.rbd.csi.ceph.com-ctrlplugin-6444559dc-ktwrw    7/7    Running  0         4h6m  10.133.1.139  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.rbd.csi.ceph.com-ctrlplugin-6444559dc-sp2ms    7/7    Running  0         4h6m  10.132.1.13   hcp416-bm1-b-2624094b-dn4gn
openshift-storage.rbd.csi.ceph.com-nodeplugin-xfmst              4/4    Running  0         4h6m  10.131.1.157  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.rbd.csi.ceph.com-nodeplugin-xvgkp              4/4    Running  0         4h6m  10.128.3.248  hcp416-bm1-b-2624094b-dn4gn

// ============================ can refer https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-update-deployment
// ============================ we are using default values for rollingUpdate and added hard (ie requiredDuring... rather than preferredDuring...) antiaffinity rules
> k get deploy openshift-storage.cephfs.csi.ceph.com-ctrlplugin -oyaml | yq -r '.spec.strategy,.spec.template.spec.affinity'
rollingUpdate:
  maxSurge: 25%
  maxUnavailable: 25%
type: RollingUpdate
podAntiAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          app: openshift-storage.cephfs.csi.ceph.com-ctrlplugin
      topologyKey: kubernetes.io/hostname
    
// ============================ can refer https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/#performing-a-rolling-update
// ============================ we are using default values for rollingUpdate
> k get ds openshift-storage.cephfs.csi.ceph.com-nodeplugin -oyaml | yq -r .spec.updateStrategy
rollingUpdate:
  maxSurge: 0
  maxUnavailable: 1
type: RollingUpdate

// ============================ i just update a value (container arg) in operatorconfig so that csi-op updates that value
> k get po -owide | grep -P 'NAME|ceph.com'
NAME                                                             READY  STATUS   RESTARTS  AGE    IP            NODE                       
openshift-storage.cephfs.csi.ceph.com-ctrlplugin-747cffd6479rgt  7/7    Running  0         4h15m  10.132.1.14   hcp416-bm1-b-2624094b-dn4gn
openshift-storage.cephfs.csi.ceph.com-ctrlplugin-747cffd64b5k9p  7/7    Running  0         4h15m  10.133.1.138  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.cephfs.csi.ceph.com-ctrlplugin-8d98c9787xnm4d  0/7    Pending  0         83s    <none>        <none>                                  
openshift-storage.cephfs.csi.ceph.com-nodeplugin-2gtvl           3/3    Running  0         53s    10.131.1.157  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.cephfs.csi.ceph.com-nodeplugin-gs56r           3/3    Running  0         21s    10.128.3.248  hcp416-bm1-b-2624094b-dn4gn
openshift-storage.rbd.csi.ceph.com-ctrlplugin-6444559dc-ktwrw    7/7    Running  0         4h15m  10.133.1.139  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.rbd.csi.ceph.com-ctrlplugin-6444559dc-sp2ms    7/7    Running  0         4h15m  10.132.1.13   hcp416-bm1-b-2624094b-dn4gn
openshift-storage.rbd.csi.ceph.com-ctrlplugin-7949d485f9-bnskh   0/7    Pending  0         83s    <none>        <none>                                  
openshift-storage.rbd.csi.ceph.com-nodeplugin-8jln8              4/4    Running  0         81s    10.128.3.248  hcp416-bm1-b-2624094b-dn4gn
openshift-storage.rbd.csi.ceph.com-nodeplugin-scfxd              4/4    Running  0         82s    10.131.1.157  hcp416-bm1-b-2624094b-pvd2m

// due to the existing rule new pod will not rollout we need to use the strategy that we are using for daemonset
> k describe po openshift-storage.cephfs.csi.ceph.com-ctrlplugin-8d98c9787xnm4d
Name:                 openshift-storage.cephfs.csi.ceph.com-ctrlplugin-8d98c9787xnm4d
Namespace:            openshift-storage-client
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      ceph-csi-cephfs-ctrlplugin-sa
Node:                 <none>
Labels:               app=openshift-storage.cephfs.csi.ceph.com-ctrlplugin
                      pod-template-hash=8d98c9787
Annotations:          openshift.io/scc: ceph-csi-op-scc
Status:               Pending
[...]
Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  4m27s                  default-scheduler  0/2 nodes are available: 2 node(s) didn't match pod anti-affinity rules. preemption: 0/2 nodes are available: 2 node(s) didn't match pod anti-affinity rules.
  Warning  FailedScheduling  4m22s (x2 over 4m25s)  default-scheduler  0/2 nodes are available: 2 node(s) didn't match pod anti-affinity rules. preemption: 0/2 nodes are available: 2 node(s) didn't match pod anti-affinity rules.

IOW,

we want controller plugin to occupy distinct nodes and when there are only two nodes we occupy each one of it
when a rollout deployment happens, the configuration is unsuitable for rollout, ie, 3 pods (due to maxSurge rounded up) on 3 distinct nodes

Possible solution,

we are already using this for daemonset, do the same for deployment, ie, maxSurge: 0 and maxUnavailable: 1 which forces kube controller to kill one of running pods of deployment which is fine since other pod takes over due to lease expiry and pod from new deployment can start rollout.

The text was updated successfully, but these errors were encountered:

Madhu-1 · 2024-09-18T08:14:56Z

we already have this flexibility with cephcsi helm charts, we can configure the same here https://github.com/ceph/ceph-csi/blob/28dc64dcae3cec8d11d84bdf525bda0ef757c688/charts/ceph-csi-cephfs/values.yaml#L145-L151, This feature is required to the compatibility with cephcsi driver as well.

leelavg · 2024-09-18T08:19:01Z

ack, in that scenario we might've probably included that in the APIs already, let me do a check.

leelavg · 2024-09-18T08:20:32Z

I see we have the option for nodeplugin but not for controllerplugin

ceph-csi-operator/api/v1alpha1/driver_types.go

Lines 162 to 165 in 28605d8

    
           // Driver's plugin daemonset update strategy, supported values are OnDelete and RollingUpdate. 
        
           // Default value is RollingUpdate with MaxAvailabile set to 1 
        
           //+kubebuilder:validation:Optional 
        
           UpdateStrategy *appsv1.DaemonSetUpdateStrategy `json:"updateStrategy,omitempty"`

Madhu-1 · 2024-09-18T08:34:59Z

yes correct, we dont have it from controllerplugin we need to provide the same flexibility for it, the cephcsi pointer is just to show that we support it in cephcsi

nb-ohad · 2024-09-19T07:11:46Z

I am not sure this needs to be configurable separately for node plugin and controller plugin, and a single nob should have done the trick. But the current API design forces that separation already.
So the only "fix" we can offer here is to move the UpdateStategy field to the PodCommonSpec and implement it for controller plugin reconciliation

Madhu-1 · 2024-09-20T07:55:56Z

I am not sure this needs to be configurable separately for node plugin and controller plugin, and a single nob should have done the trick. But the current API design forces that separation already. So the only "fix" we can offer here is to move the UpdateStategy field to the PodCommonSpec and implement it for controller plugin reconciliation

The deployment updatestrategy and daemonsetupdate strategy are not the same. it needs to be separate, what is the reason to have it under PodSpec and its going to look like?

leelavg linked a pull request Oct 15, 2024 that will close this issue

provide ability to adjust deployment strategy #156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current csi deployments mandate usage of three nodes or else rollout get stuck #140

Current csi deployments mandate usage of three nodes or else rollout get stuck #140

leelavg commented Sep 18, 2024 •

edited

Loading

Madhu-1 commented Sep 18, 2024

leelavg commented Sep 18, 2024

leelavg commented Sep 18, 2024

Madhu-1 commented Sep 18, 2024

nb-ohad commented Sep 19, 2024

Madhu-1 commented Sep 20, 2024

Current csi deployments mandate usage of three nodes or else rollout get stuck #140

Current csi deployments mandate usage of three nodes or else rollout get stuck #140

Comments

leelavg commented Sep 18, 2024 • edited Loading

Madhu-1 commented Sep 18, 2024

leelavg commented Sep 18, 2024

leelavg commented Sep 18, 2024

Madhu-1 commented Sep 18, 2024

nb-ohad commented Sep 19, 2024

Madhu-1 commented Sep 20, 2024

leelavg commented Sep 18, 2024 •

edited

Loading