Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csi-node-driver-registrar consumed ~44% cpu time to read container registry TimeZone information in Windows #229

Open
Howard-Haiyang-Hao opened this issue Oct 14, 2022 · 14 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@Howard-Haiyang-Hao
Copy link

I was trying to narrow down a perf issue and noticed that csi.node-driver-registrar.exe consumed 44% cpu time within 359ms. The majority of this time was spent on reading Container registry timezoon information.

Worked with @andyzhangx and he recommeneded tracking this issue here .

csi-node-driver-registrar-cpu
MicrosoftTeams-image (2)

I would like to know:

  1. if csi.node-driver-registrar.exe will delay the container startup time
  2. purpose of readying the timezone information.

Best regards,
Howard Hao

@ddebroy
Copy link
Contributor

ddebroy commented Oct 14, 2022

The csi node driver registrar primarily executes when the CSI node plugin is initializing and registering with Kubelet. Did you observe the above happening beyond the CSI node daemonset pod's initialization? Were there any interesting bits (especially errors/retries) in the logs of the csi node driver registrar container from the CSI node plugin pod?

@Howard-Haiyang-Hao
Copy link
Author

It's within the bounds of Pod initialization. I am curious why this process needs to read time zone registry keys. This operation seems take a quite bit CPU time. Thank you for the quick response.

@ddebroy
Copy link
Contributor

ddebroy commented Oct 14, 2022

It's within the bounds of Pod initialization

This is quite unexpected in the first place (assuming Pod initialization above is referring to a general stateful workload pod that mounts PVs backed by the CSI plugin) as the driver registrar does not have a role to play beyond CSI Node registration. The logs may reveal if you have a situation where the plugin is failing to register or restarting for some reason.

I am curious why this process needs to read time zone registry keys

Just a guess (logs/stack traces needed to confirm) but it could be this sequence:

csiConn, err := connection.Connect(*csiAddress, cmm)
=> https://github.com/kubernetes-csi/node-driver-registrar/blob/master/vendor/github.com/kubernetes-csi/csi-lib-utils/connection/connection.go#L113 => https://github.com/kubernetes-csi/node-driver-registrar/blob/master/vendor/github.com/kubernetes-csi/csi-lib-utils/connection/connection.go#L211

@mauriciopoppe
Copy link
Member

mauriciopoppe commented Oct 14, 2022

Adding on profiling, you can use

enableProfile = flag.Bool("enable-pprof", false, "enable pprof profiling")
to enable these endpoints
if *enableProfile {
klog.InfoS("Starting profiling", "endpoint", httpEndpoint)
mux.HandleFunc("/debug/pprof/", pprof.Index)
mux.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
mux.HandleFunc("/debug/pprof/profile", pprof.Profile)
mux.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
mux.HandleFunc("/debug/pprof/trace", pprof.Trace)
}

@Howard-Haiyang-Hao
Copy link
Author

Howard-Haiyang-Hao commented Oct 14, 2022

I think the demo process(9684) was still running, these instances were the kubelet-registion-probe operations that was happening every 10 seconds, with command like: /csi-node-driver-registrar.exe --kubelet-registration-path=C:\var\lib\kubelet\plugins\disk.csi.azure.com\csi.sock --mode=kubelet-registration-probe

csi-node-driver-registrar-probe

@mauriciopoppe
Copy link
Member

If you're running a recent cluster version (1.25+) I'd suggest removing --mode=kubelet-registration-probe, this mode was added as a workaround due to #143 but it was fixed in kubernetes/kubernetes#104584 i.e. we no longer need to probe to check that node-driver-registrar is up.

@Howard-Haiyang-Hao
Copy link
Author

Let me asked team to see if they can move to 1.25. Thanks.

I wonder if the demo and probe use the same code base, if we do, then we still need to figure out purpose of the Time Zone registry readings. Based on the call stack, it's called by csi-node-driver-registrar.exe directly by calling RegOpenKeyExW, but as you mentioned, I also can't find RegOpenKeyEx in this repo. I am guessing this may come from the Go Lang runtime.

@andyzhangx
Copy link
Member

Let me asked team to see if they can move to 1.25. Thanks.

I wonder if the demo and probe use the same code base, if we do, then we still need to figure out purpose of the Time Zone registry readings. Based on the call stack, it's called by csi-node-driver-registrar.exe directly by calling RegOpenKeyExW, but as you mentioned, I also can't find RegOpenKeyEx in this repo. I am guessing this may come from the Go Lang runtime.

@Howard-Haiyang-Hao we don't need to wait for 1.25, we could change current azure disk daemonset config directly by removing livenessProbe and check whether it solves the issue, let's discuss offline, and I can share you the steps.
https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/09066645538325be70cf0f28915ef484186c2ba9/deploy/csi-azuredisk-node-windows.yaml#L63-L70

@Howard-Haiyang-Hao
Copy link
Author

@andyzhangx, let me work with you offline to see if the workaround solves the issue. Thanks!

@Howard-Haiyang-Hao
Copy link
Author

I guess the following call actually trigger the Time Zone registry key emumations:

image

Any idea the purpose of this call?

thanks,
Howard.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 17, 2023
@mauriciopoppe
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2023
@mauriciopoppe mauriciopoppe changed the title csi-node-driver-registrar consumed ~44% cpu time to read container registry TimeZone information csi-node-driver-registrar consumed ~44% cpu time to read container registry TimeZone information in Windows Feb 17, 2023
@mauriciopoppe
Copy link
Member

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 17, 2023
@andyzhangx
Copy link
Member

/remove-lifecycle stale

Sneha-at added a commit to Sneha-at/node-driver-registrar that referenced this issue Aug 17, 2023
de2fba88 Merge pull request kubernetes-csi#233 from andyzhangx/andyzhangx-patch-1
cee895e1 remove windows 20H2 build since it's EOL long time ago
670bb0ef Merge pull request kubernetes-csi#229 from marosset/fix-codespell-errors
35d5e783 Merge pull request kubernetes-csi#219 from yashsingh74/update-registry
63473cc9 Merge pull request kubernetes-csi#231 from coulof/bump-go-version-1.20.5
29a5c76c Merge pull request kubernetes-csi#228 from mowangdk/chore/adopt_kubernetes_recommand_labels
8dd28211 Update cloudbuild image with go 1.20.5
2b8b80ea fixing some codespell errors
72984ec0 chore: adopt kubernetes recommand label
901bcb5a Update registry k8s.gcr.io -> registry.k8s.io

git-subtree-dir: release-tools
git-subtree-split: de2fba88becec7dec6744355a8ddb0057c5fe2f9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

6 participants