Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IMC and Multi-Tenant Channel Based Broker retries #2932

Merged
merged 10 commits into from
Jul 21, 2020

Conversation

pierDipi
Copy link
Member

@pierDipi pierDipi commented Apr 7, 2020

Fixes #2210
Fixes #2212

Proposed Changes

  • MessageDispatcher retries sending events.
  • Added e2e tests for the redelivery feature.

Release Note

- 🎁 Add new feature 
In Memory Channel and Multi-Tenant Channel Based Broker retry sending events

@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Apr 7, 2020
@knative-prow-robot knative-prow-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. area/test-and-release Test infrastructure, tests or release labels Apr 7, 2020
@knative-prow-robot
Copy link
Contributor

Hi @pierDipi. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot knative-prow-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 7, 2020
@pierDipi
Copy link
Member Author

pierDipi commented Apr 7, 2020

@lionelvillard
Copy link
Member

/hold

let's wait for the next CloudEvent SDK preview release to land.

@knative-prow-robot knative-prow-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Apr 8, 2020
@slinkydeveloper
Copy link
Contributor

Let's keep this on hold, we need to understand if we can use the protocol/http.Protocol from the sdk

@vaikas
Copy link
Contributor

vaikas commented May 8, 2020

Just a ping on the status of this?

@slinkydeveloper
Copy link
Contributor

@vaikas i have no updates, let's check this next week

@slinkydeveloper
Copy link
Contributor

slinkydeveloper commented Jun 4, 2020

What I'm really worried about is that we're adding ourselves code for retrying http requests that already exists out there (which is error prone and we need to test). For example, i'm not really sure what happens in that code if you pass an unbuffered io.Reader in the request body, because after it's read for the first time it could be broken to read it a second time...
Have you considered using this library? https://github.com/hashicorp/go-retryablehttp
We can integrate it in zero time, we don't need any testing, and it just works (libraries from these org are also well mantained)

@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 4, 2020
@pierDipi
Copy link
Member Author

pierDipi commented Jun 4, 2020

Have you considered using this library? https://github.com/hashicorp/go-retryablehttp
We can integrate it in zero time, we don't need any testing, and it just works (libraries from these org are also well mantained)

No, I didn't, but I'm fine with this

Let me know if I can go ahead using that library.

@slinkydeveloper
Copy link
Contributor

I think we need somebody else on this topic, @n3wscott wdyt?

@lionelvillard
Copy link
Member

really cool ideas in this library! Thanks for sharing @slinkydeveloper! It is very flexible (BYO backoff policy and retry policy) so from a functional PoV it does what we want.

@matzew
Copy link
Member

matzew commented Jun 25, 2020

Have you considered using this library? https://github.com/hashicorp/go-retryablehttp
We can integrate it in zero time, we don't need any testing, and it just works (libraries from these org are also well mantained)

I second what @slinkydeveloper said here! Let's use proven and maintained code, instead of hacking /cc @pierDipi

@pierDipi
Copy link
Member Author

Ok, I'll continue to work on this.

@knative-prow-robot knative-prow-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 16, 2020
@pierDipi pierDipi changed the title Message sender retries [WIP] Retries Jul 16, 2020
@knative-prow-robot knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 16, 2020
@slinkydeveloper
Copy link
Contributor

/assign

Copy link
Contributor

@slinkydeveloper slinkydeveloper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little hint on the signature of MessageDispatcher

pkg/channel/message_dispatcher.go Outdated Show resolved Hide resolved
pkg/kncloudevents/message_sender.go Outdated Show resolved Hide resolved
Signed-off-by: Pierangelo Di Pilato <[email protected]>
Signed-off-by: Pierangelo Di Pilato <[email protected]>
@knative-metrics-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-knative-eventing-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/channel/fanout/fanout_message_handler.go 92.6% 89.1% -3.5
pkg/channel/message_dispatcher.go 86.1% 87.3% 1.3
pkg/channel/multichannelfanout/multi_channel_fanout_message_handler.go 90.9% 84.6% -6.3
pkg/reconciler/inmemorychannel/dispatcher/inmemorychannel.go 78.9% 77.1% -1.8

@pierDipi
Copy link
Member Author

/retest

@knative-test-reporter-robot

The following jobs failed:

Test name Triggers Retries
pull-knative-eventing-integration-tests 0/3

Failed non-flaky tests preventing automatic retry of pull-knative-eventing-integration-tests:

test/e2e.TestEventTransformationForTriggerV1Beta1BrokerV1
test/e2e.TestEventTransformationForTriggerV1Beta1BrokerV1/InMemoryChannel-messaging.knative.dev/v1beta1

@pierDipi
Copy link
Member Author

Failed to create v1 broker "inmemorychannel": conversion webhook for eventing.knative.dev/v1, Kind=Broker failed: Post https://eventing-webhook.knative-eventing-n2htblyaln.svc:443/resource-conversion?timeout=30s: EOF

Have you ever seen something similar?

@vaikas
Copy link
Contributor

vaikas commented Jul 21, 2020

Yeah :(

knative/pkg#1509

@pierDipi
Copy link
Member Author

Thanks!

/retest

@slinkydeveloper
Copy link
Contributor

/unhold

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 21, 2020
eventingduck "knative.dev/eventing/pkg/apis/duck/v1beta1"
"knative.dev/eventing/pkg/channel"
"knative.dev/eventing/pkg/kncloudevents"
)

const (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your doing, but 15 minute timeout? :) I'll look what this is for. Just a note :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems too much :)

It's used here:

errorCh := make(chan error, subs)
for _, sub := range f.config.Subscriptions {
go func(s Subscription) {
errorCh <- f.makeFanoutRequest(ctx, bufferedMessage, additionalHeaders, s)
}(sub)
}
for range f.config.Subscriptions {
select {
case err := <-errorCh:
if err != nil {
f.logger.Error("Fanout had an error", zap.Error(err))
return err
}
case <-time.After(f.timeout):
f.logger.Error("Fanout timed out")
return errors.New("fanout timed out")
}
}

)

const (
defaultTimeout = 15 * time.Minute
)

type Subscription struct {
eventingduck.SubscriberSpec
RetriesConfig kncloudevents.RetryConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: RetriesConfig => RetryConfig?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vaikas
Copy link
Contributor

vaikas commented Jul 21, 2020

Does this also fix / help with:
#2357

/lgtm
/approve

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 21, 2020
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pierDipi, vaikas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 21, 2020
@knative-prow-robot knative-prow-robot merged commit d5ea26f into knative:master Jul 21, 2020
@pierDipi
Copy link
Member Author

Does this also fix / help with:
#2357

It looks like we have the opposite behavior: https://prow.knative.dev/view/gcs/knative-prow/pr-logs/pull/knative_eventing/2932/pull-knative-eventing-upgrade-tests/1285477487198867458

event #* received 2 times, but should be received only once

@pierDipi pierDipi mentioned this pull request Jul 21, 2020
@slinkydeveloper
Copy link
Contributor

Thanks @pierDipi for looking in this long-standing issue!

@pierDipi
Copy link
Member Author

Thanks for your reviews 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test-and-release Test infrastructure, tests or release cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Broker/Trigger should retry sending failed events. In-memory Dispatcher should retry sending events
9 participants