Remove grpcServer.Stop() when stream.Send() fails on ListAndWatch #412

syaganti · 2024-10-22T21:06:29Z

The error handling of failures to stream.Send(resp) during ListAndWatch is not completely reliable.

Take for instance, if stream.Send() fails and kubelet never restarts, then stopping the grpcServer will mean device plugin will never reconnect with kubelet because the socket will still exist in /var/lib/kubelet/device-plugins/ and therefore device plugin will not trigger itself to re-register with kubelet and the grpc server is never started again.

the grpcServer is only created during registeration so in this case we will be stuck in a state where the grpcServer is stopped with no way of starting it again. This will put Device Plugins in an unrecoverable state.

Instead we should not stop the grpcServer on failure of send and continue with ListAndWatch logic and retry send on the next device health update.

The error handling of failures to stream.Send(resp) during ListAndWatch is not completely reliable. Take for instance, if stream.Send() fails and kubelet never restarts, then stopping the grpcServer will mean device plugin will never reconnect with kubelet because the socket will still exist in `/var/lib/kubelet/device-plugins/` and therefore device plugin will not trigger itself to re-register with kubelet and the grpc server is never started again. the grpcServer is only created during registeration so in this case we will be stuck in a state where the grpcServer is stopped with no way of starting it again. This will put Device Plugins in an unrecoverable state. Instead we should not stop the grpcServer on failure of send and continue with ListAndWatch logic and retry send on the next device health update.

SergeyKanzhelev · 2024-10-23T07:14:53Z

/cc @ffromani

Interestingly the NVIDIA version is doing something similar: https://github.com/NVIDIA/k8s-device-plugin/blob/620aaed3474d6dca586451ec63ca8e6b94100410/internal/plugin/server.go#L278C5-L278C15 but not the same.

I wonder what is the best practice here. I feel like retry like suggested in this PR is a good way to go. @syaganti maybe it is a good idea to file an issue in NVIDIA version of a device plugin as well.

ffromani · 2024-10-23T07:17:19Z

/cc @ffromani

Interestingly the NVIDIA version is doing something similar: https://github.com/NVIDIA/k8s-device-plugin/blob/620aaed3474d6dca586451ec63ca8e6b94100410/internal/plugin/server.go#L278C5-L278C15 but not the same.

I wonder what is the best practice here. I feel like retry like suggested in this PR is a good way to go. @syaganti maybe it is a good idea to file an issue in NVIDIA version of a device plugin as well.

@SergeyKanzhelev in general I'd support an organized effort from us in the kubernetes 1.33 cycle to improve things in this area, there are multiple related issues/efforts ongoing. Maybe we need a KEP or something more structured.

I'll dig into git history and into past conversation and comment about recommendations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove grpcServer.Stop() when stream.Send() fails on ListAndWatch #412

Remove grpcServer.Stop() when stream.Send() fails on ListAndWatch #412

syaganti commented Oct 22, 2024

SergeyKanzhelev commented Oct 23, 2024

ffromani commented Oct 23, 2024 •

edited

Loading

Remove grpcServer.Stop() when stream.Send() fails on ListAndWatch #412

Are you sure you want to change the base?

Remove grpcServer.Stop() when stream.Send() fails on ListAndWatch #412

Conversation

syaganti commented Oct 22, 2024

SergeyKanzhelev commented Oct 23, 2024

ffromani commented Oct 23, 2024 • edited Loading

ffromani commented Oct 23, 2024 •

edited

Loading