Proposed Specification: Procedural Child Workflow Spawning and Joining Results #277

grutt · 2024-03-20T19:55:08Z

grutt
Mar 20, 2024
Maintainer

Problem

Currently, the Hatchet Python, TypeScript, and Go SDKs support defining workflows as a Directed Acyclic Graph (DAG), where steps are declared upfront and dependencies between steps are specified. However, there are scenarios where the number of child workflows needed is not known until runtime, and the results of these child workflows need to be joined and processed by a parent workflow. The current SDK lacks the capability to dynamically spawn child workflows based on runtime conditions and join their results before continuing execution of future downstream steps.

Example Use Cases

Data processing pipelines where the number of items/batches is determined at runtime (i.e. document ingestion).
Distributed machine learning workflows where the number of training jobs depends on the dataset size or hyperparameter configurations.
Web scraping workflows where the number of pages to scrape is determined dynamically based on the initial results.
Any scenario where the number of parallel tasks or subworkflows is not known until runtime and the results need to be aggregated.

Proposed Solution

Extend the Hatchet Python, TypeScript, and Go SDKs to support procedural child workflow spawning (fanout) and joining of the results of these child workflows. This will involve introducing new SDK methods and modifying the workflow execution engine to handle dynamic child workflow creation and result joining.

By exposing a spawn_workflow method on the step context, we're able to reference the child from the parent workflow run. In other words, we'll be able to trace the workflow state for the entire parent-and-child invocation.

Proposed SDK Design

Note: I’m not married to spawn_workflow, alternatives includes spawn or run_workflow (to match hatchet.admin.run_workflow)

Python

Signatures:

@dataclass
class ChildWorkflowRef(Generic[T]):
    workflow_run_id: str

    def to_promise(self) -> 'Promise[T]':
        # subscribe to workflow_run_id results
        pass

    def to_json(self) -> str:
        # Implementation of to_json
        pass

class Context:
    # ...
    def spawn_workflow(self, workflow_name: Union[str, 'Workflow'], data: dict, key: str = None) -> ChildWorkflowRef:
        # Implementation of spawn_workflow
        pass

    async def join(self, refs: List[ChildWorkflowRef[T]]) -> List[T]:
        # Implementation of join
        pass

Example Usage:

@hatchet.workflow()
class ParentWorkflow:
    @hatchet.step()
    async def start(self, context: Context) -> dict:
        # Perform some operation
        num_child_workflows = await determine_num_child_workflows()
        return {"num_child_workflows": num_child_workflows}

    @hatchet.step(parents=["start"])
    async def spawn_and_join_child_workflows(self, context: Context) -> dict:
        num_child_workflows = context.step_output("start")["num_child_workflows"]
        child_workflow_promises = []
        for i in range(num_child_workflows):
            child_workflow_promise = context.spawn_workflow(
                ChildWorkflow,
                input_data={"index": i, "parent_run_id": context.run_id},
                key=f"child_workflow_{i}"
            )
            child_workflow_promises.append(child_workflow_promise)

        child_results = await context.join(child_workflow_promises)
        # Process the joined results
        return {"final_result": final_result}

@hatchet.workflow()
class ChildWorkflow:
    @hatchet.step()
    async def process(self, context: Context) -> dict:
        index = context.workflow_input()["index"]
        parent_run_id = context.workflow_input()["parent_run_id"]
        # Perform child workflow processing
        return {"child_result": child_result, "parent_run_id": parent_run_id}

TypeScript

Signatures:

    class Context {
        // ...
        spawnWorkflow(workflowName: string | Workflow, data: object, key?: string): ChildWorkflowRef;
        join(refs: ChildWorkflowRef<K>[]): Promise<Array<K>>
    }

    type ChildWorkflowRef<T> {
        workflowRunId: string;
        toPromise(): Promise<T>
        toJSON(): string;
    }

Example Usage:

@hatchet.workflow()
class ParentWorkflow {
  @hatchet.step()
  async start(context: Context): Promise<{ num_child_workflows: number }> {
    // Perform some operation
    const num_child_workflows = await determine_num_child_workflows();
    return { num_child_workflows };
  }

  @hatchet.step({ parents: ["start"] })
  async spawnAndJoinChildWorkflows(
    context: Context
  ): Promise<{ final_result: any }> {
    const num_child_workflows = context.stepOutput("start").num_child_workflows;
    const child_workflow_promises: Promise<any>[] = [];
    for (let i = 0; i < num_child_workflows; i++) {
      const child_workflows = context.spawnWorkflow(
        ChildWorkflow,
        { index: i, parent_run_id: context.runId },
        `child_workflow_${i}`
      );
      child_workflow_promises.push(child_workflow_promise);
    }

    const child_results = await context.join(child_workflows);
    // Process the joined results
    return { final_result };
  }
}

@hatchet.workflow()
class ChildWorkflow {
  @hatchet.step()
  async process(
    context: Context
  ): Promise<{ child_result: any; parent_run_id: string }> {
    const index = context.workflowInput().index;
    const { parent_run_id } = context.workflowInput();
    // Perform child workflow processing
    return { child_result, parent_run_id };
  }
}

Workflow Execution Engine Modifications

Modify the workflow execution engine to support spawning child workflows dynamically based on the context.spawn_workflow() method.
Introduce a mechanism to track the references of spawned child workflows and associate them with the parent workflow run ID.
Ensure proper error handling and propagation of errors from child workflows to the parent workflow.
Update the dashboard to display the parent-child relationships between workflow runs, effectively creating a DAG of the specific runtime that can be introspected.

Risks/Unknowns

Performance impact of dynamically spawning and joining a large number of child workflows.
Potential issues with handling and propagating errors from child workflows to the parent workflow. I.e. do we throw an exception in join if a child fails?
Ordering of child workflows: In some cases, the order of execution of child workflows may matter. If ordering is important, consider implementing a priority queue for child workflows with an index for each item and setting the concurrency (maxRuns) to 1 to ensure sequential execution.
What is the impact of maxRuns and timeouts if we’re spawning new workflows from a running step? In other words, the step will take a slot on the worker until the child workflow is completed.

Notes on Durability and Idempotency

To ensure durability and idempotency in child workflow spawning, we propose two strategies:

Input Hashing:
- When spawn_workflow is called with specific input data, the input data is hashed to generate a unique identifier.
- The combination of the parent workflow run ID (workflowRunId) and the input hash is used to check for collisions.
- If a child workflow with the same workflowRunId and input hash already exists, instead of spawning a new child workflow, the method subscribes to the existing child workflow and returns its promise.
- This ensures that child workflows with the same input data are not spawned multiple times, even if the spawning step is retried due to failures.
Key Argument:
- The spawn_workflow method accepts an optional key argument that allows the user to provide a custom identifier for the child workflow.
- If the key argument is provided, it overrides the default behavior of input hashing.
- The combination of the parent workflow run ID (workflowRunId) and the provided key is used to check for collisions.
- If a child workflow with the same workflowRunId and key already exists, the method subscribes to the existing child workflow and returns its promise.
- This allows users to have more control over the uniqueness of child workflows and enables them to specify their own identifiers when needed.

After all child workflows are dispatched, the parent workflow should use context.join (which calls Promise.all or asyncio.gather under the hood) to wait for the completion of all child workflows and retrieve their results. This ensures that the parent workflow progresses only when all child workflows have finished executing. If the parent workflow step fails and retries, the join will resume from the last spawned state.

UI Considerations

Representing a large number of spawned child workflows in the parent workflow's DAG view can be challenging.

Instead of cluttering the DAG with numerous child workflow nodes, the spawn_workflows step can render a table or list of “spawned” or “linked” child workflow runs. This table provides an overview of the spawned child workflows' status, progress, and links to individual details. Clicking a run can open a modal or navigate to that specific invocation.

Users can click on a specific child workflow in the table to navigate to its detailed view, displaying the child workflow's own DAG, steps, and execution details.

grutt · 2024-03-29T21:30:00Z

grutt
Mar 29, 2024
Maintainer Author

Implemented in #289
https://docs.hatchet.run/launches/child-workflows

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed Specification: Procedural Child Workflow Spawning and Joining Results #277

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Proposed Specification: Procedural Child Workflow Spawning and Joining Results #277

grutt Mar 20, 2024 Maintainer

Problem

Example Use Cases

Proposed Solution

Proposed SDK Design

Python

TypeScript

Workflow Execution Engine Modifications

Risks/Unknowns

Notes on Durability and Idempotency

UI Considerations

Replies: 1 comment

grutt Mar 29, 2024 Maintainer Author

grutt
Mar 20, 2024
Maintainer

grutt
Mar 29, 2024
Maintainer Author