csi-node-driver-registrar consumed ~44% cpu time to read container registry TimeZone information in Windows #229

Howard-Haiyang-Hao · 2022-10-14T15:32:43Z

I was trying to narrow down a perf issue and noticed that csi.node-driver-registrar.exe consumed 44% cpu time within 359ms. The majority of this time was spent on reading Container registry timezoon information.

Worked with @andyzhangx and he recommeneded tracking this issue here .

I would like to know:

if csi.node-driver-registrar.exe will delay the container startup time
purpose of readying the timezone information.

Best regards,
Howard Hao

ddebroy · 2022-10-14T17:03:00Z

The csi node driver registrar primarily executes when the CSI node plugin is initializing and registering with Kubelet. Did you observe the above happening beyond the CSI node daemonset pod's initialization? Were there any interesting bits (especially errors/retries) in the logs of the csi node driver registrar container from the CSI node plugin pod?

Howard-Haiyang-Hao · 2022-10-14T17:56:59Z

It's within the bounds of Pod initialization. I am curious why this process needs to read time zone registry keys. This operation seems take a quite bit CPU time. Thank you for the quick response.

ddebroy · 2022-10-14T18:57:33Z

It's within the bounds of Pod initialization

This is quite unexpected in the first place (assuming Pod initialization above is referring to a general stateful workload pod that mounts PVs backed by the CSI plugin) as the driver registrar does not have a role to play beyond CSI Node registration. The logs may reveal if you have a situation where the plugin is failing to register or restarting for some reason.

I am curious why this process needs to read time zone registry keys

Just a guess (logs/stack traces needed to confirm) but it could be this sequence:

node-driver-registrar/cmd/csi-node-driver-registrar/main.go

Line 192 in db46d17

csiConn, err := connection.Connect(*csiAddress, cmm)

=> https://github.com/kubernetes-csi/node-driver-registrar/blob/master/vendor/github.com/kubernetes-csi/csi-lib-utils/connection/connection.go#L113 => https://github.com/kubernetes-csi/node-driver-registrar/blob/master/vendor/github.com/kubernetes-csi/csi-lib-utils/connection/connection.go#L211

mauriciopoppe · 2022-10-14T19:04:24Z

Adding on profiling, you can use

node-driver-registrar/cmd/csi-node-driver-registrar/main.go

Line 73 in 6f7211c

    
           enableProfile           = flag.Bool("enable-pprof", false, "enable pprof profiling")

to enable these endpoints

node-driver-registrar/cmd/csi-node-driver-registrar/node_register.go

Lines 116 to 124 in 6f7211c

    
           if *enableProfile { 
        
           	klog.InfoS("Starting profiling", "endpoint", httpEndpoint) 
        
           	mux.HandleFunc("/debug/pprof/", pprof.Index) 
        
           	mux.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline) 
        
           	mux.HandleFunc("/debug/pprof/profile", pprof.Profile) 
        
           	mux.HandleFunc("/debug/pprof/symbol", pprof.Symbol) 
        
           	mux.HandleFunc("/debug/pprof/trace", pprof.Trace) 
        
           }

Howard-Haiyang-Hao · 2022-10-14T20:11:32Z

I think the demo process(9684) was still running, these instances were the kubelet-registion-probe operations that was happening every 10 seconds, with command like: /csi-node-driver-registrar.exe --kubelet-registration-path=C:\var\lib\kubelet\plugins\disk.csi.azure.com\csi.sock --mode=kubelet-registration-probe

mauriciopoppe · 2022-10-14T21:08:06Z

If you're running a recent cluster version (1.25+) I'd suggest removing --mode=kubelet-registration-probe, this mode was added as a workaround due to #143 but it was fixed in kubernetes/kubernetes#104584 i.e. we no longer need to probe to check that node-driver-registrar is up.

Howard-Haiyang-Hao · 2022-10-15T16:19:58Z

Let me asked team to see if they can move to 1.25. Thanks.

I wonder if the demo and probe use the same code base, if we do, then we still need to figure out purpose of the Time Zone registry readings. Based on the call stack, it's called by csi-node-driver-registrar.exe directly by calling RegOpenKeyExW, but as you mentioned, I also can't find RegOpenKeyEx in this repo. I am guessing this may come from the Go Lang runtime.

andyzhangx · 2022-10-16T12:37:38Z

Let me asked team to see if they can move to 1.25. Thanks.

I wonder if the demo and probe use the same code base, if we do, then we still need to figure out purpose of the Time Zone registry readings. Based on the call stack, it's called by csi-node-driver-registrar.exe directly by calling RegOpenKeyExW, but as you mentioned, I also can't find RegOpenKeyEx in this repo. I am guessing this may come from the Go Lang runtime.

@Howard-Haiyang-Hao we don't need to wait for 1.25, we could change current azure disk daemonset config directly by removing livenessProbe and check whether it solves the issue, let's discuss offline, and I can share you the steps.
https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/09066645538325be70cf0f28915ef484186c2ba9/deploy/csi-azuredisk-node-windows.yaml#L63-L70

Howard-Haiyang-Hao · 2022-10-18T04:13:07Z

@andyzhangx, let me work with you offline to see if the workaround solves the issue. Thanks!

Howard-Haiyang-Hao · 2022-10-19T02:53:44Z

I guess the following call actually trigger the Time Zone registry key emumations:

Any idea the purpose of this call?

thanks,
Howard.

k8s-triage-robot · 2023-01-17T03:27:19Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mauriciopoppe · 2023-01-19T23:47:49Z

/remove-lifecycle stale

mauriciopoppe · 2023-02-17T20:23:49Z

/lifecycle frozen

andyzhangx · 2023-03-03T03:12:48Z

/remove-lifecycle stale

de2fba88 Merge pull request kubernetes-csi#233 from andyzhangx/andyzhangx-patch-1 cee895e1 remove windows 20H2 build since it's EOL long time ago 670bb0ef Merge pull request kubernetes-csi#229 from marosset/fix-codespell-errors 35d5e783 Merge pull request kubernetes-csi#219 from yashsingh74/update-registry 63473cc9 Merge pull request kubernetes-csi#231 from coulof/bump-go-version-1.20.5 29a5c76c Merge pull request kubernetes-csi#228 from mowangdk/chore/adopt_kubernetes_recommand_labels 8dd28211 Update cloudbuild image with go 1.20.5 2b8b80ea fixing some codespell errors 72984ec0 chore: adopt kubernetes recommand label 901bcb5a Update registry k8s.gcr.io -> registry.k8s.io git-subtree-dir: release-tools git-subtree-split: de2fba88becec7dec6744355a8ddb0057c5fe2f9

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 17, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2023

mauriciopoppe changed the title ~~csi-node-driver-registrar consumed ~44% cpu time to read container registry TimeZone information~~ csi-node-driver-registrar consumed ~44% cpu time to read container registry TimeZone information in Windows Feb 17, 2023

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 17, 2023

andyzhangx mentioned this issue Apr 10, 2023

feat: add chart config value enableRegistrationProbe kubernetes-sigs/azurefile-csi-driver#1217

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csi-node-driver-registrar consumed ~44% cpu time to read container registry TimeZone information in Windows #229

csi-node-driver-registrar consumed ~44% cpu time to read container registry TimeZone information in Windows #229

Howard-Haiyang-Hao commented Oct 14, 2022

ddebroy commented Oct 14, 2022

Howard-Haiyang-Hao commented Oct 14, 2022

ddebroy commented Oct 14, 2022 •

edited

Loading

mauriciopoppe commented Oct 14, 2022 •

edited

Loading

Howard-Haiyang-Hao commented Oct 14, 2022 •

edited

Loading

mauriciopoppe commented Oct 14, 2022

Howard-Haiyang-Hao commented Oct 15, 2022

andyzhangx commented Oct 16, 2022

Howard-Haiyang-Hao commented Oct 18, 2022

Howard-Haiyang-Hao commented Oct 19, 2022

k8s-triage-robot commented Jan 17, 2023

mauriciopoppe commented Jan 19, 2023

mauriciopoppe commented Feb 17, 2023

andyzhangx commented Mar 3, 2023

csi-node-driver-registrar consumed ~44% cpu time to read container registry TimeZone information in Windows #229

csi-node-driver-registrar consumed ~44% cpu time to read container registry TimeZone information in Windows #229

Comments

Howard-Haiyang-Hao commented Oct 14, 2022

ddebroy commented Oct 14, 2022

Howard-Haiyang-Hao commented Oct 14, 2022

ddebroy commented Oct 14, 2022 • edited Loading

mauriciopoppe commented Oct 14, 2022 • edited Loading

Howard-Haiyang-Hao commented Oct 14, 2022 • edited Loading

mauriciopoppe commented Oct 14, 2022

Howard-Haiyang-Hao commented Oct 15, 2022

andyzhangx commented Oct 16, 2022

Howard-Haiyang-Hao commented Oct 18, 2022

Howard-Haiyang-Hao commented Oct 19, 2022

k8s-triage-robot commented Jan 17, 2023

mauriciopoppe commented Jan 19, 2023

mauriciopoppe commented Feb 17, 2023

andyzhangx commented Mar 3, 2023

ddebroy commented Oct 14, 2022 •

edited

Loading

mauriciopoppe commented Oct 14, 2022 •

edited

Loading

Howard-Haiyang-Hao commented Oct 14, 2022 •

edited

Loading