Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current csi deployments mandate usage of three nodes or else rollout get stuck #140

Open
leelavg opened this issue Sep 18, 2024 · 6 comments · May be fixed by #156
Open

Current csi deployments mandate usage of three nodes or else rollout get stuck #140

leelavg opened this issue Sep 18, 2024 · 6 comments · May be fixed by #156

Comments

@leelavg
Copy link
Contributor

leelavg commented Sep 18, 2024

// ============================ we only have two nodes
> k get no
NAME                          STATUS   ROLES    AGE   VERSION
hcp416-bm1-b-2624094b-dn4gn   Ready    worker   22h   v1.29.7+d77deb8
hcp416-bm1-b-2624094b-pvd2m   Ready    worker   22h   v1.29.7+d77deb8

// ============================ controller plugin occupies those two nodes because of replica 2
> k get po -owide | grep -P 'NAME|ceph.com'
NAME                                                             READY  STATUS   RESTARTS  AGE   IP            NODE                       
openshift-storage.cephfs.csi.ceph.com-ctrlplugin-747cffd6479rgt  7/7    Running  0         4h6m  10.132.1.14   hcp416-bm1-b-2624094b-dn4gn
openshift-storage.cephfs.csi.ceph.com-ctrlplugin-747cffd64b5k9p  7/7    Running  0         4h6m  10.133.1.138  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.cephfs.csi.ceph.com-nodeplugin-d5wht           3/3    Running  0         4h6m  10.128.3.248  hcp416-bm1-b-2624094b-dn4gn
openshift-storage.cephfs.csi.ceph.com-nodeplugin-qkg7h           3/3    Running  0         4h6m  10.131.1.157  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.rbd.csi.ceph.com-ctrlplugin-6444559dc-ktwrw    7/7    Running  0         4h6m  10.133.1.139  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.rbd.csi.ceph.com-ctrlplugin-6444559dc-sp2ms    7/7    Running  0         4h6m  10.132.1.13   hcp416-bm1-b-2624094b-dn4gn
openshift-storage.rbd.csi.ceph.com-nodeplugin-xfmst              4/4    Running  0         4h6m  10.131.1.157  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.rbd.csi.ceph.com-nodeplugin-xvgkp              4/4    Running  0         4h6m  10.128.3.248  hcp416-bm1-b-2624094b-dn4gn

// ============================ can refer https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-update-deployment
// ============================ we are using default values for rollingUpdate and added hard (ie requiredDuring... rather than preferredDuring...) antiaffinity rules
> k get deploy openshift-storage.cephfs.csi.ceph.com-ctrlplugin -oyaml | yq -r '.spec.strategy,.spec.template.spec.affinity'
rollingUpdate:
  maxSurge: 25%
  maxUnavailable: 25%
type: RollingUpdate
podAntiAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          app: openshift-storage.cephfs.csi.ceph.com-ctrlplugin
      topologyKey: kubernetes.io/hostname
    
// ============================ can refer https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/#performing-a-rolling-update
// ============================ we are using default values for rollingUpdate
> k get ds openshift-storage.cephfs.csi.ceph.com-nodeplugin -oyaml | yq -r .spec.updateStrategy
rollingUpdate:
  maxSurge: 0
  maxUnavailable: 1
type: RollingUpdate

// ============================ i just update a value (container arg) in operatorconfig so that csi-op updates that value
> k get po -owide | grep -P 'NAME|ceph.com'
NAME                                                             READY  STATUS   RESTARTS  AGE    IP            NODE                       
openshift-storage.cephfs.csi.ceph.com-ctrlplugin-747cffd6479rgt  7/7    Running  0         4h15m  10.132.1.14   hcp416-bm1-b-2624094b-dn4gn
openshift-storage.cephfs.csi.ceph.com-ctrlplugin-747cffd64b5k9p  7/7    Running  0         4h15m  10.133.1.138  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.cephfs.csi.ceph.com-ctrlplugin-8d98c9787xnm4d  0/7    Pending  0         83s    <none>        <none>                                  
openshift-storage.cephfs.csi.ceph.com-nodeplugin-2gtvl           3/3    Running  0         53s    10.131.1.157  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.cephfs.csi.ceph.com-nodeplugin-gs56r           3/3    Running  0         21s    10.128.3.248  hcp416-bm1-b-2624094b-dn4gn
openshift-storage.rbd.csi.ceph.com-ctrlplugin-6444559dc-ktwrw    7/7    Running  0         4h15m  10.133.1.139  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.rbd.csi.ceph.com-ctrlplugin-6444559dc-sp2ms    7/7    Running  0         4h15m  10.132.1.13   hcp416-bm1-b-2624094b-dn4gn
openshift-storage.rbd.csi.ceph.com-ctrlplugin-7949d485f9-bnskh   0/7    Pending  0         83s    <none>        <none>                                  
openshift-storage.rbd.csi.ceph.com-nodeplugin-8jln8              4/4    Running  0         81s    10.128.3.248  hcp416-bm1-b-2624094b-dn4gn
openshift-storage.rbd.csi.ceph.com-nodeplugin-scfxd              4/4    Running  0         82s    10.131.1.157  hcp416-bm1-b-2624094b-pvd2m

// due to the existing rule new pod will not rollout we need to use the strategy that we are using for daemonset
> k describe po openshift-storage.cephfs.csi.ceph.com-ctrlplugin-8d98c9787xnm4d
Name:                 openshift-storage.cephfs.csi.ceph.com-ctrlplugin-8d98c9787xnm4d
Namespace:            openshift-storage-client
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      ceph-csi-cephfs-ctrlplugin-sa
Node:                 <none>
Labels:               app=openshift-storage.cephfs.csi.ceph.com-ctrlplugin
                      pod-template-hash=8d98c9787
Annotations:          openshift.io/scc: ceph-csi-op-scc
Status:               Pending
[...]
Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  4m27s                  default-scheduler  0/2 nodes are available: 2 node(s) didn't match pod anti-affinity rules. preemption: 0/2 nodes are available: 2 node(s) didn't match pod anti-affinity rules.
  Warning  FailedScheduling  4m22s (x2 over 4m25s)  default-scheduler  0/2 nodes are available: 2 node(s) didn't match pod anti-affinity rules. preemption: 0/2 nodes are available: 2 node(s) didn't match pod anti-affinity rules.

IOW,

  1. we want controller plugin to occupy distinct nodes and when there are only two nodes we occupy each one of it
  2. when a rollout deployment happens, the configuration is unsuitable for rollout, ie, 3 pods (due to maxSurge rounded up) on 3 distinct nodes

Possible solution,

  1. we are already using this for daemonset, do the same for deployment, ie, maxSurge: 0 and maxUnavailable: 1 which forces kube controller to kill one of running pods of deployment which is fine since other pod takes over due to lease expiry and pod from new deployment can start rollout.
@Madhu-1
Copy link
Collaborator

Madhu-1 commented Sep 18, 2024

we already have this flexibility with cephcsi helm charts, we can configure the same here https://github.com/ceph/ceph-csi/blob/28dc64dcae3cec8d11d84bdf525bda0ef757c688/charts/ceph-csi-cephfs/values.yaml#L145-L151, This feature is required to the compatibility with cephcsi driver as well.

@leelavg
Copy link
Contributor Author

leelavg commented Sep 18, 2024

ack, in that scenario we might've probably included that in the APIs already, let me do a check.

@leelavg
Copy link
Contributor Author

leelavg commented Sep 18, 2024

I see we have the option for nodeplugin but not for controllerplugin

// Driver's plugin daemonset update strategy, supported values are OnDelete and RollingUpdate.
// Default value is RollingUpdate with MaxAvailabile set to 1
//+kubebuilder:validation:Optional
UpdateStrategy *appsv1.DaemonSetUpdateStrategy `json:"updateStrategy,omitempty"`

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Sep 18, 2024

yes correct, we dont have it from controllerplugin we need to provide the same flexibility for it, the cephcsi pointer is just to show that we support it in cephcsi

@nb-ohad
Copy link
Collaborator

nb-ohad commented Sep 19, 2024

I am not sure this needs to be configurable separately for node plugin and controller plugin, and a single nob should have done the trick. But the current API design forces that separation already.
So the only "fix" we can offer here is to move the UpdateStategy field to the PodCommonSpec and implement it for controller plugin reconciliation

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Sep 20, 2024

I am not sure this needs to be configurable separately for node plugin and controller plugin, and a single nob should have done the trick. But the current API design forces that separation already. So the only "fix" we can offer here is to move the UpdateStategy field to the PodCommonSpec and implement it for controller plugin reconciliation

The deployment updatestrategy and daemonsetupdate strategy are not the same. it needs to be separate, what is the reason to have it under PodSpec and its going to look like?

@leelavg leelavg linked a pull request Oct 15, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants