Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: turn healthCheckHttpClient timeout from 500ms to 3s #1321

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

batleforc
Copy link

@batleforc batleforc commented Sep 16, 2024

What does this PR do?

It change the timeout of the healthcheck client from 500ms to 3s

What issues does this PR fix or reference?

Linked to eclipse-che/che#23067
Fixes #1325

Is it tested? How?

In progress

PR Checklist

  • E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
    • v8-devworkspace-operator-e2e: DevWorkspace e2e test
    • v8-che-happy-path: Happy path for verification integration with Che

Copy link

openshift-ci bot commented Sep 16, 2024

Hi @batleforc. Thanks for your PR.

I'm waiting for a devfile member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@batleforc
Copy link
Author

Setup in a working environment, with the linked PR work

Copy link
Collaborator

@AObuchow AObuchow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@batleforc Thank you for the PR :)

I assume you're submitting this PR with the hopes of getting it into upstream DWO (and Che) rather than just making changes for your own testing, correct?

Rather than change the default timeout as part of your PR, my gut instinct is we should instead expose a configuration option in the DevWorkspaceOperatorConfig that would allow users to customize the healthCheckHttpClient timeout. Then you could configure the timeout to your desired value from there.

If you're okay with reworking your PR to do this, let me know and I can help guide you further.

@@ -70,7 +70,7 @@ func setupHttpClients(k8s client.Client, logger logr.Logger) {
}
healthCheckHttpClient = &http.Client{
Transport: healthCheckTransport,
Timeout: 500 * time.Millisecond,
Timeout: 3 * 500 * time.Millisecond,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would actually be 1500 ms (1.5s) instead of 3s.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a doubt a checkout of a stash to high, the test that I deployed was set to a hard coded 3s

@AObuchow
Copy link
Collaborator

/ok-to-test

@batleforc
Copy link
Author

HI @AObuchow,
This PR is part of an issue in the Che Side where I have a problem with a slow CNI.
Your gut instinct are the same as mine, the end goal would be to have it merged with a possibility to set this value but i was conflicted on how to have it match between the Che Operator and the DevWorkspace Operator.
I'm totally okay on reworking this PR and if you have time I'm waiting for your guidance.

@dkwon17
Copy link
Collaborator

dkwon17 commented Sep 30, 2024

Instead of increasing the timeout, what about returning a RetryError when the health check fails here , and handle it with checkDWError? This is so that another reconcile request would be created if the attempt to ping the health endpoint fails.

@batleforc does that work for your use case?

We do something similar when waiting for the workspace deployment to be ready:

return &dwerrors.RetryError{Message: "Deployment is not ready"}

if shouldReturn, reconcileResult, reconcileErr := r.checkDWError(workspace, err, "Error creating DevWorkspace deployment", metrics.DetermineProvisioningFailureReason(err.Error()), reqLogger, &reconcileStatus); shouldReturn {
reqLogger.Info("Waiting on deployment to be ready")
reconcileStatus.setConditionFalse(conditions.DeploymentReady, "Waiting for workspace deployment")
return reconcileResult, reconcileErr
}

@batleforc
Copy link
Author

I think that it could fix the problem, totally answer my case @dkwon17, and could remove the need of changing the Che operator source code.
What I really need is to not wait for Five more minutes when the IDE is already up but the CNI took too long to broadcast the IP of the Pod (it annoys me and kind of irritate the user).

@batleforc
Copy link
Author

batleforc commented Sep 30, 2024

And your answer wouldn't lock the operator on this action and potentially unlock the process for future action

@batleforc
Copy link
Author

batleforc commented Oct 7, 2024

@dkwon17, i've tried to set it up correctly, not sure if it's correct. I choose to set it with a reconcile after 1 second (could be less but don't know)

Copy link
Collaborator

@AObuchow AObuchow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update to your PR.
@batleforc Have you tried if your latest changes resolve the issue you were encountering?

In my experience, I've seen DWO continuously re-attempt the health check when getting a 502 bad gateway response. However, in these cases you'd see a Main URL server not ready -- are you not seeing that message?

return reconcile.Result{}, err
if shouldReturn, reconcileResult, reconcileErr := r.checkDWError(workspace, err, "Error checking server status", metrics.ReasonInfrastructureFailure, reqLogger, &reconcileStatus); shouldReturn {
reqLogger.Info("Waiting for DevWorkspace to be ready")
reconcileStatus.setConditionFalse(conditions.DeploymentReady, "Waiting for DevWorkspace to be ready") // No sure of the conditions.DeploymentReady is the right one
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dw.DevWorkspaceReady may be a more appropriate condition (we use it when the health check fails)

Copy link
Collaborator

@dkwon17 dkwon17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @batleforc !

return reconcile.Result{}, err
if shouldReturn, reconcileResult, reconcileErr := r.checkDWError(workspace, err, "Error checking server status", metrics.ReasonInfrastructureFailure, reqLogger, &reconcileStatus); shouldReturn {
reqLogger.Info("Waiting for DevWorkspace to be ready")
reconcileStatus.setConditionFalse(conditions.DeploymentReady, "Waiting for DevWorkspace to be ready") // No sure of the conditions.DeploymentReady is the right one
Copy link
Collaborator

@dkwon17 dkwon17 Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT @AObuchow ? After removing the comment, it looks good to me

I prefer @AObuchow 's suggestion above

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change, this PR looks good to me :)

@dkwon17
Copy link
Collaborator

dkwon17 commented Oct 7, 2024

In my experience, I've seen DWO continuously re-attempt the health check when getting a 502 bad gateway response. However, in these cases you'd see a Main URL server not ready -- are you not seeing that message?

I've experienced the opposite actually, after handling all the queued reconciles, a retry doesn't happen after a health fail check

@dkwon17
Copy link
Collaborator

dkwon17 commented Oct 7, 2024

@AObuchow I think you're thinking about this case, where the health check endpoint is accessible, but a 502 is returned:

if !serverReady {
reqLogger.Info("Main URL server not ready", "status-code", serverStatusCode)
reconcileStatus.setConditionFalse(dw.DevWorkspaceReady, "Waiting for editor to start")
return reconcile.Result{RequeueAfter: 1 * time.Second}, nil
}

This PR is to handle the case where there is a timeout when trying to access the health check endpoint

@AObuchow
Copy link
Collaborator

AObuchow commented Oct 7, 2024

@AObuchow I think you're thinking about this case, where the health check endpoint is accessible, but a 502 is returned:

if !serverReady {
reqLogger.Info("Main URL server not ready", "status-code", serverStatusCode)
reconcileStatus.setConditionFalse(dw.DevWorkspaceReady, "Waiting for editor to start")
return reconcile.Result{RequeueAfter: 1 * time.Second}, nil
}

This PR is to handle the case where there is a timeout when trying to access the health check endpoint

@dkwon17 thank you for the confirmation, yes you're right - I believe that's the case I was thinking about.

@openshift-ci openshift-ci bot removed the lgtm label Oct 8, 2024
@batleforc
Copy link
Author

Hi @AObuchow @dkwon17 ,
Didn't have time to try it yet (was pretty late), I will try to set it up in both instance that encounter the problem during the following day and will come back to you as fast as possible.

Copy link
Collaborator

@AObuchow AObuchow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@batleforc this looks just about good to me to merge 😁 Thank you so much for your contribution.

One last request: Could you please re-organize your commits from this PR?
Here are my suggestions:

  • Squash all 3 of your commits into a single commit, as they are all iterations on the change introduced from this PR
  • Change the final commit message to feat: queue workspace reconcile if workspace health check encounters an error. In the description, add a fix #1325.

The final commit message should ressemble something like:

feat: queue workspace reconcile if workspace health check encounters an error

fix #1325

Signed-off-by: Max batleforc <[email protected]>

@AObuchow
Copy link
Collaborator

AObuchow commented Oct 8, 2024

Hi @AObuchow @dkwon17 , Didn't have time to try it yet (was pretty late), I will try to set it up in both instance that encounter the problem during the following day and will come back to you as fast as possible.

Sounds great to me :) I hope this resolves your issue. FWIW: This PR is approved and can be merged (after my final request on cleaning up the commit log) :)

@batleforc
Copy link
Author

So, I squashed the commit, but I still need to test it on the last env that has this problem quite frequently (a Devspaces 3.16.1 updated recently)

@AObuchow
Copy link
Collaborator

AObuchow commented Oct 9, 2024

@batleforc Thank you for updating your PR, sounds good to me. Keep us updated when you have a moment :) It's greatly appreciated.

Copy link
Collaborator

@AObuchow AObuchow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some last minute comments, but looks great overall @batleforc :)

controllers/workspace/devworkspace_controller.go Outdated Show resolved Hide resolved
return reconcile.Result{}, err
if shouldReturn, reconcileResult, reconcileErr := r.checkDWError(workspace, err, "Error checking server status", metrics.ReasonInfrastructureFailure, reqLogger, &reconcileStatus); shouldReturn {
reqLogger.Info("Waiting for DevWorkspace to be ready")
reconcileStatus.setConditionFalse(dw.DevWorkspaceReady, "Waiting for DevWorkspace to be ready") // No sure of the conditions.DeploymentReady is the right one
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkwon17 @batleforc Any thoughts on if we should change the DevWorkspace status message & log here to something more specific to the health check timing out? "Waiting for DevWorkspace to be ready"

Maybe something like:

reqLogger.Info("Waiting for DevWorkspace health check endpoint to become available")
reconcileStatus.setConditionFalse(dw.DevWorkspaceReady, "Waiting for workspace health check to become available") 

Copy link

openshift-ci bot commented Oct 9, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AObuchow, batleforc, dkwon17, ibuziuk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Queue workspace for reconcile if workspace health check encounters an error
4 participants