Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cinder-csi-plugin] node plugin pods restart with health check failing #1674

Closed
bhachn opened this issue Oct 22, 2021 · 8 comments
Closed

[cinder-csi-plugin] node plugin pods restart with health check failing #1674

bhachn opened this issue Oct 22, 2021 · 8 comments

Comments

@bhachn
Copy link

bhachn commented Oct 22, 2021

BUG Report
cinder-csi-plugin

What happened:
openstack-cinder-csi node plugin pod is getting restarted with health check failing with below error:

health check failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded

liveness container of the pod is getting restarted

What you expected to happen:
Pods should not restart and service should be running smoothly

Anything else we need to know?:
The load on the server as well is not high and well under limits in the range of ~2% CPU load utilization

Environment: Production

  • OpenStack version: openstack-cinder-csi-1.4.6
  • K8s version 1.20.7
@jichenjc
Copy link
Contributor

@bhachn can you help provide the configurations and logs of your nnode plugin?
I assume some errors occurs ahead so those info should be helpful

@bhachn
Copy link
Author

bhachn commented Oct 25, 2021

@jichenjc, please find attached the logs from all the containers running inside the pod.

cinder-csi.log
cinder-csi-liveness.log
cinder-csi-node-driver.log

@jichenjc
Copy link
Contributor

looks like the error log is showing

I1023 01:09:45.610277       1 connection.go:153] Connecting to unix:///csi/csi.sock
E1023 01:09:46.610083       1 main.go:74] health check failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded

and 

E1020 14:32:46.634270       1 connection.go:131] Lost connection to unix:///csi/csi.sock.
E1020 15:14:46.631193       1 connection.go:131] Lost connection to unix:///csi/csi.sock.
E1020 20:06:46.630356       1 connection.go:131] Lost connection to unix:///csi/csi.sock.

I don't know the reason behind the lost connection, it's weird .. @ramineni have you saw this before?
how about enable debug (add --v=5 in the pod start up command ) so that we might have further info?

@ramineni
Copy link
Contributor

@bhachn Is this happening on all nodes or only some nodes have this problem?
And also could you check if restarting the pod resolve the error? that is simply restarting the pod by deleting it and letting the replicaset/daemonset run it again (not redeploying)

This issue looks similar to kubernetes-csi/node-driver-registrar#139

@bhachn
Copy link
Author

bhachn commented Oct 26, 2021

@ramineni
It's happening on a single node. As suggested we've restart the pod and would be monitoring the same for a day and update in case we experience same again.

@ramineni ramineni changed the title cinder-csi node plugin pods restart with health check failing [cinder-csi-plugin] node plugin pods restart with health check failing Oct 27, 2021
@bhachn
Copy link
Author

bhachn commented Oct 27, 2021

@ramineni
We checked and post restart things look better.
However we would like to observe the same this week and update here in case any abnormal behavior is observed.

@ramineni
Copy link
Contributor

@bhachn Thanks for the update.
As I mentioned above, the issue is related to node-driver-registrar not the plugin itself.
I suppose we could close the issue in this repo and you could track the issue kubernetes-csi/node-driver-registrar#139 for any update.

@bhachn
Copy link
Author

bhachn commented Oct 29, 2021

Thanks @ramineni.
Closing the issue.

@bhachn bhachn closed this as completed Oct 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants