Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflows executions are getting stuck #209

Open
arorashivam opened this issue Jul 16, 2024 · 1 comment
Open

Workflows executions are getting stuck #209

arorashivam opened this issue Jul 16, 2024 · 1 comment

Comments

@arorashivam
Copy link

Describe the bug
Workflow executions are getting stuck due to tasks taking too long to schedule.

Further debugging details:

  1. In sweeper flow, If a task is in SCHEDULED state, the un-ack time is set as workflowTimeout if taskDefinition is not present. In other words the sweeper will now only sweep this workflow after workflowTimeout.
  2. Note: I am not sure if we re-set the un-ack timeout once task moves from SCHEDULED to IN_PROGRESS
  3. Now a workflow execution whenever reaches a state where it depends on sweeper to trigger the decide would remain stuck.

Details
Conductor version: 3.20.0
Persistence implementation: Postgres
Queue implementation: Dynoqueues
Lock: Redis
Workflow definition: N/A
Task definition: N/A
Event handler definition: N/A

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error
Expected behavior
Sweeper to continue sweeping a workflow once a task moves from SCHEDULED to IN_PROGRESS

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

@lbestatlas
Copy link
Contributor

I'd like to add some additional context to this issue.

As noted above,

In sweeper flow, If a task is in SCHEDULED state, the un-ack time is set as workflowTimeout if taskDefinition is not present.
In other words the sweeper will now only sweep this workflow after workflowTimeout.

This issue has been observed for async System Tasks, but could also occur for SIMPLE tasks if the timeouts are not set on the TaskDefinition but a timeout is set on the Workflow. These types of tasks do not transition from SCHEDULED to IN PROGRESS within a "decide", so the Sweep can pick them up in the SCHEDULED state.

Having a timely workflow sweep is critical in the cases where an execution lock cannot be obtained for some reason, as the decide is deliberately deferred to the sweep in this case. Furthermore, we have seen issues with the JOIN when it was set to synchronous as it does not trigger a decide when it completes (this was resolved when it was reverted to async).

It seems like there should be another setting "maxSweepDelay" to use as the fallback unack time, set either at the workflow level, system level or both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants