Remove grpcServer.Stop() when stream.Send() fails on ListAndWatch #412
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The error handling of failures to stream.Send(resp) during ListAndWatch is not completely reliable.
Take for instance, if stream.Send() fails and kubelet never restarts, then stopping the grpcServer will mean device plugin will never reconnect with kubelet because the socket will still exist in
/var/lib/kubelet/device-plugins/
and therefore device plugin will not trigger itself to re-register with kubelet and the grpc server is never started again.the grpcServer is only created during registeration so in this case we will be stuck in a state where the grpcServer is stopped with no way of starting it again. This will put Device Plugins in an unrecoverable state.
Instead we should not stop the grpcServer on failure of send and continue with ListAndWatch logic and retry send on the next device health update.