webhook EOF errors #1509

mattmoor · 2020-07-16T14:13:07Z

/area API
/area test-and-release
/kind bug

Expected Behavior

When we run our e2e tests with chaos there are no failures due to the webhook shutting down.

Actual Behavior

We intermittently see failures like this: Post https://eventing-webhook.knative-eventing-qh1fjbnng8.svc:443/resource-conversion?timeout=30s: EOF ever since we enabled webhook chaos.

The text was updated successfully, but these errors were encountered:

mattmoor · 2020-07-16T14:13:17Z

cc @vaikas

mattmoor · 2020-07-16T14:15:14Z

I already made two changed to try and mitigate this:

Bumped the terminationGracePeriodSecond in our webhook pods to 300 to give the webhook itself greater control over the drain duration.
Bumped the network.DefaultDrainTimeout: Bump the drain timeout #1501 to give the API Server more time to observe the endpoints change.

mattmoor · 2020-07-16T14:39:39Z

From the Go http.Server Shutdown documentation:

When Shutdown is called, Serve, ListenAndServe, and ListenAndServeTLS immediately return ErrServerClosed. Make sure the program doesn't exit and waits instead for Shutdown to return.

I wondered if this might be our problem, but we call ListenAndServeTLS here:

pkg/webhook/webhook.go

Lines 207 to 210 in f1b8240

    
           eg.Go(func() error { 
        
           	if err := server.ListenAndServeTLS("", ""); err != nil && err != http.ErrServerClosed { 
        
           		logger.Errorw("ListenAndServeTLS for admission webhook returned error", zap.Error(err)) 
        
           		return err

and seem to properly check for ErrServerClosed

mattmoor · 2020-07-16T14:41:52Z

I started a thread in #sig-api-machinery here: https://kubernetes.slack.com/archives/C0EG7JC6T/p1594910493127400

mattmoor · 2020-07-16T23:41:59Z

Alright, so one of the things I have been wondering is around Keep-Alives and whether that might be a reason the API server takes so long to realize an endpoint is no longer good. I noticed the following comment on Go's SetKeepAlivesEnabled:

SetKeepAlivesEnabled controls whether HTTP keep-alives are enabled. By default, keep-alives are always enabled. Only very resource-constrained environments or servers in the process of shutting down should disable them.

... emphasis mine.

Now it turns out that Go's http logic is already smart about disabling keep-alive when it is shutting down:

func (s *Server) doKeepAlives() bool {
	return atomic.LoadInt32(&s.disableKeepAlives) == 0 && !s.shuttingDown()
}

However, when we lame duck we aren't yet .shuttingDown() so I believe keep-alives will continue throughout our network.DefaultDrainTimeout until Shutdown() is actually called.

I believe testing this should be as simple as calling server.SetKeepAlivesEnabled(false) here:

pkg/webhook/webhook.go

Line 221 in 25f2aa6

However, if this works then we should probably consider doing something similar across our various dataplane components as well. cc @tcnghia

See also: #1509 (comment)

mattmoor · 2020-07-17T17:52:16Z

We believe that @vaikas is still seeing some webhook failures that should be synced past the above change, so we aren't out of the woods yet.

This implements a new `http.Handler` called `Drainer`, which is intended to wrap some inner `http.Handler` business logic with a new outer handler that can respond to Kubelet probes (successfully until told to "Drain()"). This takes over the webhook's relatively new probe handling and lame duck logic with one key difference. Previously the webhook waited for a fixed period after SIGTERM before exitting, but the new logic waits for this same grace period AFTER THE LAST REQUEST. So if the handler keeps getting (non-probe) requests, the timer will continually reset, and once it stops receiving requests for the configured grace period, "Drain()" will return and the webhook will exit. The goal of this work is to try to better cope with what we believe to be high tail latencies of the API server seeing that a webhook replica is shutting down. Related: knative#1509

mattmoor · 2020-07-18T04:22:19Z

The new "Drainer" handler in the linked PR uses a dynamic drain timeout where it waits for at least network.DefaultDrainTimeout after the last request before terminating the webhook. With that constant set to 45s, let's see if there are any failures... 🤞

* Implement a new shared "Drainer" handler. This implements a new `http.Handler` called `Drainer`, which is intended to wrap some inner `http.Handler` business logic with a new outer handler that can respond to Kubelet probes (successfully until told to "Drain()"). This takes over the webhook's relatively new probe handling and lame duck logic with one key difference. Previously the webhook waited for a fixed period after SIGTERM before exitting, but the new logic waits for this same grace period AFTER THE LAST REQUEST. So if the handler keeps getting (non-probe) requests, the timer will continually reset, and once it stops receiving requests for the configured grace period, "Drain()" will return and the webhook will exit. The goal of this work is to try to better cope with what we believe to be high tail latencies of the API server seeing that a webhook replica is shutting down. Related: #1509 * Switch to RWLock

This is attempting to try and combat the webhook Post EOF errors we have been seeing intermittently: knative/pkg#1509

mattmoor · 2020-07-19T14:29:19Z

One of the two flakes in eventing yesterday was this:

TestChannelDataPlaneSuccess/InMemoryChannel-messaging.knative.dev/v1/full-event_encoding_binary_version_1.0: creation.go:108: Failed to create subscription "full-event-binary-10-sub": Internal error occurred: failed calling webhook "webhook.eventing.knative.dev": Post https://eventing-webhook.knative-eventing-sfqaktzdqa.svc:443/defaulting?timeout=2s: EOF

I need to track down whether that CI run had the drainer stuff in yet.

mattmoor · 2020-07-19T14:33:59Z

Hmm looks like it was at 239f7fc, which includes the drainer.

Link to the full run: https://prow.knative.dev/view/gcs/knative-prow/logs/ci-knative-eventing-continuous/1284652424807583744

vaikas · 2020-07-21T15:41:39Z

Tomorrow I'm going to add a workaround for this by using Retry loops, just like we do for when pod creates (or configmaps fail) with very specific error cases. This will be a workaround for now hopefully so that we'll be able to not get tests that fail if they hit this condition. Reconcilers should be resilient to this, but tests that use create*orFail will not, so this will at least cut down some noise. I'll make sure to log these so at least we see how often this happens.

markusthoemmes · 2020-07-21T15:55:50Z

Still happening in serving too I believe: https://prow.knative.dev/view/gcs/knative-prow/logs/ci-knative-serving-continuous/1285454712115564545

mattmoor · 2020-07-21T16:39:19Z

It definitely is. Open to suggestions on how we might keep chipping away at this / instrument / debug / ...

mattmoor · 2020-07-22T15:10:03Z

I had an idea chatting with @markusthoemmes a bit on slack.

Right now there's a lot of machinery standing between us and the EOFs that it is making them harder to debug than they should be. The thought is: can we have a more dedicated probing process that we can use to reproduce this (maybe more consistently)?

dprotaso · 2021-03-31T16:17:07Z

Interestingly one way to reproduce this consistently is to panic in the webhook when handling a request (ie. the defaulting logic)

golang's http server recovers these panics and logs an error.

We should potentially recover ourselves so we can return an 'internal server' type error

dprotaso · 2021-05-19T17:07:03Z

As part of knative/serving#11225 I encountered EOF's/context deadline exceeded. After adding some tracing I've seen our web hooks respond <10ms but then the API server still returns a timeout.

So I don't think this is isolated to just our web hooks.

My next steps is to start testing with a non-managed k8s service to be able to get API server logs

Serving has done this as well, the chaosduck killing the webhook causes errors being tracked in knative/pkg#1509

Mitigation for knative/pkg#1509. Same fix was used in eventing core to mitigate webhook EOF errors. Signed-off-by: Pierangelo Di Pilato <[email protected]>

This test sometimes fails due to [1]. [1] knative/pkg#1509 Signed-off-by: Pierangelo Di Pilato <[email protected]>

knative-prow-robot added area/API area/test-and-release kind/bug Categorizes issue or PR as related to a bug. labels Jul 16, 2020

vaikas mentioned this issue Jul 16, 2020

[flaky] test/e2e.TestBrokerChannelFlowV1Beta1BrokerV1/Channel-messaging.knative.dev/v1 knative/eventing#3491

Closed

mattmoor added a commit to mattmoor/pkg that referenced this issue Jul 16, 2020

Disable keep-alives on shutdown.

71db0b6

See also: knative#1509 (comment)

mattmoor mentioned this issue Jul 16, 2020

Disable keep-alives on shutdown. #1511

Merged

knative-prow-robot pushed a commit that referenced this issue Jul 16, 2020

Disable keep-alives on shutdown. (#1511)

0f78f8a

See also: #1509 (comment)

mattmoor mentioned this issue Jul 17, 2020

Implement a new shared "Drainer" handler. #1517

Merged

mattmoor added a commit to mattmoor/eventing that referenced this issue Jul 18, 2020

Pull in the new fancier webhook drain from knative/pkg.

eb963f4

This is attempting to try and combat the webhook Post EOF errors we have been seeing intermittently: knative/pkg#1509

mattmoor mentioned this issue Jul 18, 2020

Pull in the new fancier webhook drain from knative/pkg. knative/eventing#3634

Merged

knative-prow-robot pushed a commit to knative/eventing that referenced this issue Jul 18, 2020

Pull in the new fancier webhook drain from knative/pkg. (#3634)

47adfa0

This is attempting to try and combat the webhook Post EOF errors we have been seeing intermittently: knative/pkg#1509

vaikas mentioned this issue Jul 21, 2020

IMC and Multi-Tenant Channel Based Broker retries knative/eventing#2932

Merged

This was referenced Jul 22, 2020

retry on webhook failures knative/eventing#3678

Merged

Remove the webhook retry hack. knative/eventing#3681

Open

julz mentioned this issue Apr 15, 2021

Excluse domainmapping-webhook from chaos knative/serving#11202

Merged

markusthoemmes mentioned this issue May 10, 2021

Allow disable via regex in chaosduck #2117

Merged

benmoss mentioned this issue May 19, 2021

Disable chaosduck on the webhook knative/eventing#5419

Merged

knative-prow-robot pushed a commit to knative/eventing that referenced this issue May 20, 2021

Disable chaosduck on the webhook (#5419)

869220a

Serving has done this as well, the chaosduck killing the webhook causes errors being tracked in knative/pkg#1509

pierDipi mentioned this issue May 28, 2021

[flaky] test/e2e/conformance.TestBrokerControlPlane/4/Ready_Trigger_V1_(no_Broker)_set_Broker_and_includes_status.subscriber_Uri knative-extensions/eventing-kafka-broker#941

Closed

pierDipi mentioned this issue Jun 15, 2021

[WIP] Probe resource address for ingress readiness knative-extensions/eventing-kafka-broker#974

Closed

pierDipi mentioned this issue Aug 2, 2021

[flaky] test/e2e_new.TestBrokerConformance/Knative_Broker_Specification_-_Control_Plane/Event_Routing_Spec_-_routing-test-6/Setup/Create_Trigger0_with_recorder knative-extensions/eventing-kafka-broker#1120

Closed

pierDipi mentioned this issue Aug 25, 2021

[Automated] Update eventing-kafka-broker nightly knative-extensions/eventing-kafka-broker#1168

Merged

pierDipi mentioned this issue Nov 8, 2021

[flaky] test/e2e.TestKafkaSourceAssureIsOperational/2/no_event_tls-v1beta1 knative-extensions/eventing-kafka-broker#1430

Closed

pierDipi added a commit to pierDipi/eventing-kafka that referenced this issue Nov 9, 2021

Retry on Webhook EOF Errors

80f94ba

Mitigation for knative/pkg#1509. Same fix was used in eventing core to mitigate webhook EOF errors. Signed-off-by: Pierangelo Di Pilato <[email protected]>

pierDipi mentioned this issue Nov 9, 2021

Retry on Webhook EOF Errors knative-extensions/eventing-kafka#978

Merged

pierDipi mentioned this issue Nov 10, 2021

Retry on Webhook EOF Errors knative-extensions/eventing-kafka#980

Closed

pierDipi added a commit to pierDipi/eventing-kafka that referenced this issue Nov 10, 2021

Retry on Webhook EOF Errors

d70d1ef

Mitigation for knative/pkg#1509. Same fix was used in eventing core to mitigate webhook EOF errors. Signed-off-by: Pierangelo Di Pilato <[email protected]>

pierDipi added a commit to pierDipi/eventing-kafka that referenced this issue Nov 10, 2021

Retry on Webhook EOF Errors

c9dd3c9

Mitigation for knative/pkg#1509. Same fix was used in eventing core to mitigate webhook EOF errors. Signed-off-by: Pierangelo Di Pilato <[email protected]>

pierDipi added a commit to pierDipi/eventing-kafka that referenced this issue Nov 10, 2021

Retry on Webhook EOF Errors

39c994f

Mitigation for knative/pkg#1509. Same fix was used in eventing core to mitigate webhook EOF errors. Signed-off-by: Pierangelo Di Pilato <[email protected]>

This was referenced Nov 11, 2021

Retry on webhook errors in rekt tests knative/eventing#5892

Closed

Retry on webhook EOF errors knative-extensions/reconciler-test#260

Merged

pierDipi mentioned this issue Nov 18, 2021

Retry on Webhook EOF in Broker conformance tests knative/eventing#5916

Merged

5 tasks

pierDipi mentioned this issue Dec 3, 2021

Allow Unstructured callback from Defaulting Webhook #2363

Merged

ellistarn mentioned this issue Feb 6, 2022

Webhook Slow to Restart aws/karpenter-provider-aws#863

Closed

pierDipi mentioned this issue May 11, 2022

[release-1.2] Prober targets service instead of pods directly knative-extensions/eventing-kafka-broker#2180

Merged

pierDipi mentioned this issue Jun 14, 2022

Dispatcher verticles are now worker verticles knative-extensions/eventing-kafka-broker#2300

Merged

pierDipi added a commit to pierDipi/eventing-kafka-broker that referenced this issue Aug 8, 2022

Retry webhook EOF errors

4ce3879

This test sometimes fails due to [1]. [1] knative/pkg#1509 Signed-off-by: Pierangelo Di Pilato <[email protected]>

pierDipi added a commit to pierDipi/eventing-kafka-broker that referenced this issue Aug 8, 2022

Retry webhook EOF errors

d204560

This test sometimes fails due to [1]. [1] knative/pkg#1509 Signed-off-by: Pierangelo Di Pilato <[email protected]>

pierDipi mentioned this issue Aug 8, 2022

Retry webhook EOF errors knative-extensions/eventing-kafka-broker#2484

Merged

knative-prow bot pushed a commit to knative-extensions/eventing-kafka-broker that referenced this issue Aug 9, 2022

Retry webhook EOF errors (#2484)

93b63bf

This test sometimes fails due to [1]. [1] knative/pkg#1509 Signed-off-by: Pierangelo Di Pilato <[email protected]>

dprotaso mentioned this issue Oct 13, 2023

Re-entrant apiserver calls are not part of the same trace kubernetes/kubernetes#103186

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

webhook EOF errors #1509

webhook EOF errors #1509

mattmoor commented Jul 16, 2020

mattmoor commented Jul 16, 2020

mattmoor commented Jul 16, 2020

mattmoor commented Jul 16, 2020

mattmoor commented Jul 16, 2020

mattmoor commented Jul 16, 2020

mattmoor commented Jul 17, 2020

mattmoor commented Jul 18, 2020

mattmoor commented Jul 19, 2020

mattmoor commented Jul 19, 2020 •

edited

Loading

vaikas commented Jul 21, 2020

markusthoemmes commented Jul 21, 2020

mattmoor commented Jul 21, 2020

mattmoor commented Jul 22, 2020

dprotaso commented Mar 31, 2021

dprotaso commented May 19, 2021 •

edited

Loading

webhook EOF errors #1509

webhook EOF errors #1509

Comments

mattmoor commented Jul 16, 2020

Expected Behavior

Actual Behavior

mattmoor commented Jul 16, 2020

mattmoor commented Jul 16, 2020

mattmoor commented Jul 16, 2020

mattmoor commented Jul 16, 2020

mattmoor commented Jul 16, 2020

mattmoor commented Jul 17, 2020

mattmoor commented Jul 18, 2020

mattmoor commented Jul 19, 2020

mattmoor commented Jul 19, 2020 • edited Loading

vaikas commented Jul 21, 2020

markusthoemmes commented Jul 21, 2020

mattmoor commented Jul 21, 2020

mattmoor commented Jul 22, 2020

dprotaso commented Mar 31, 2021

dprotaso commented May 19, 2021 • edited Loading

mattmoor commented Jul 19, 2020 •

edited

Loading

dprotaso commented May 19, 2021 •

edited

Loading