Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend workflowcheck tool to detect reads from (WorkflowExecution).RunId and suggest alternatives. #1385

Closed
askreet opened this issue Feb 9, 2024 · 12 comments
Labels
enhancement New feature or request

Comments

@askreet
Copy link

askreet commented Feb 9, 2024

Is your feature request related to a problem? Please describe.

I've recently learned that using the value of RunId within workflow task code is non-deterministic under workflow retry reset scenarios.

Describe the solution you'd like

I'd love for the workflowcheck tool to detect reads of this particular property within workflow functions and suggest use of the WorkflowInfo.OriginalRunId (or FirstRunId) field instead.

Describe alternatives you've considered

N/A.

Additional context

N/A.

@askreet askreet added the enhancement New feature or request label Feb 9, 2024
@Quinn-With-Two-Ns
Copy link
Contributor

Can you clarify the scenario? Workflow retry starts a whole new workflow so the fact that RunID changes should not lead to non determinism

@askreet
Copy link
Author

askreet commented Feb 9, 2024

@Quinn-With-Two-Ns It leads to non-determinism if you use the value of RunID as an input to an activity or child workflow. For example, we were rendering the URL to the Temporal UI before kicking off a child workflow that creates a pull request, linking our developers back to the workflow that was created for them. In my scenario, the workflow had previously passed this step successfully so during replay the input value to the child workflow had changed.

Edit: In fact this journey led me this field, which documents itself as a more viable alternative.

@Quinn-With-Two-Ns
Copy link
Contributor

Activities and Child workflows are not carried over retries. Non determinism only applies to a single workflow execution. A retried workflow can take a totally different path then the parent. It sounds like your trying to use RunID as an idempotency token across workflow retries, which is an error in your code, but is not non determinism.

The workflow check is designed to catch non determinism errors not all possible programming errors,

@Quinn-With-Two-Ns
Copy link
Contributor

Note: in general we do not recommend workflow retries, we recommend handling errors in your workflow.

@askreet
Copy link
Author

askreet commented Feb 9, 2024

Activities and Child workflows are not carried over retries.

I don't follow what you're saying here.

The input values need to match during the execution of a workflow task, or you get a non-determinism error when you reach the point in the workflow task when it attempts to run the activity or child workflow, right? In my workflow I execute a child workflow at a point in the event stream, and during the retry the input value is not the same as it was during the previous workflow task because it refers to a Run ID, which has changed due to the retry.

To be clear, I don't expect (or desire) that the child workflow is actually run again - we're talking about replaying the event stream to determine the next action for the workflow to take.

Note: in general we do not recommend workflow retries, we recommend handling errors in your workflow.

In my case I had a workflow fail near the end of it's run due to a program defect. I've corrected the defect and would like to replay the workflow from the point of failure. Are you suggesting that the workflow retry command is not intended for me to do that? What alternative methods are suggested for resuming my terminated, failed workflow execution so that I can complete the intended actions?

Edit: Wait, I now see the confusion in the second scenario - I've been saying "retry", I mean workflow "reset". That's what I'm trying to do 😓. I am not using workflow retries at all.

@Quinn-With-Two-Ns
Copy link
Contributor

Ah OK, reset makes more sense

The input values need to match during the execution of a workflow task, or you get a non-determinism error when you reach the point in the workflow task when it attempts to run the activity or child workflow, right?

No, input is not checked when the SDK is validating a workflow is deterministic.

@askreet
Copy link
Author

askreet commented Feb 9, 2024

No, input is not checked when the SDK is validating a workflow is deterministic.

Oh? If that's so that I actually am back at square one and don't understand this determinism error! Will keep digging.

@Quinn-With-Two-Ns
Copy link
Contributor

Only the Activity ID and Type is checked I believe

@askreet
Copy link
Author

askreet commented Feb 9, 2024

@Quinn-With-Two-Ns You're definitely pointing me in the right direction here, I updated my code to use OriginalRunId and I'm still seeing this error:

unknown command CommandType: ChildWorkflow, ID: 70b7166d-0175-40bb-9ec0-e0f6473625c1_29, possible causes are nondeterministic workflow definition code or incompatible change in the workflow definition

That ID is the original Run ID, plus the Event ID where the child was spawned. My current thesis is that since I'm not setting a workflow ID on the child workflow execution, it's comparing a newly generated child workflow ID that does not match that value.

@askreet
Copy link
Author

askreet commented Feb 9, 2024

I think the interesting piece of code is: https://github.com/temporalio/sdk-go/blob/master/internal/internal_event_handlers.go#L538-L540.

func (wc *workflowEnvironmentImpl) ExecuteChildWorkflow(
	params ExecuteWorkflowParams, callback ResultHandler, startedHandler func(r WorkflowExecution, e error),
) {
	if params.WorkflowID == "" {
		params.WorkflowID = wc.workflowInfo.WorkflowExecution.RunID + "_" + wc.GenerateSequenceID()
	}

If I understand this correctly, any child workflow execution without a deterministically set Workflow ID will make the parent execution not "reset safe".

@Quinn-With-Two-Ns
Copy link
Contributor

Ah if your using child workflow without an ID then this is a know issue #723.

I would recommend setting your own ID for now since as the linked issue says the SDK generated ID needs to not conflict with future server work that is not currently prioritized.

@askreet
Copy link
Author

askreet commented Feb 9, 2024

Awesome, thanks @Quinn-With-Two-Ns! I'll update our internal guidelines to always set child workflow IDs in the meantime. I wonder if that is a useful check for workflowcheck in the meantime if we don't have a timeline for the blocking server work? Anyway, I'll close this out for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants