Provide inputs to non-root nodes #9

ztaylor54 · 2022-01-13T23:53:42Z

💪 Motivation

Situations often arise where it would be nice to inject inputs farther down-the-line of pipeline execution than the root node. This is often useful during testing, where the behavior of individual pipeline steps needs to be examined without needing to run data & inputs all the way through the pipeline first.

It is also useful when pipeline steps fail or must be re-run due to misconfiguration or other issues, such as a failure in an externally-configured service. In cases like these, it would be desirable to execute a partial re-run of a pipeline, starting from where the previous run left off. This would avoid duplication of (possibly expensive) work performed by earlier pipeline steps.

Note: The use-case for a partial re-run likely warrants some method of "replaying" pipeline inputs - this could be achieved by caching inputs in the manager's work queues, or something similar.

📖 Additional Details

For a more concrete example, consider the following pipeline:

              +-------+      +--------+      +------+
 datainput -> | start | ---> | middle | ---> | last | -> end
              +-------+      +--------+      +------+

If an error occurs in middle, we might have reason to send data from the datainput directly to middle, thus bypassing start. This might be implemented in a DataInput spec as follows:

spec:
  data: 
    <data block>
  target: middle # add target: <node>

Which would result in the DataInput's container pushing data to middle's work queue, instead of root.

There are a few considerations / caveats:

The DataInput schema will need to be updated to include the target: <node> option, specifying that the output queue of the DataInput should be something other than the root node. Will default to the root node of target is not specified.
With the current implementation, a given node may have more than one workqueue (incoming edge) it gets inputs from in round-robin. Shortcut-inputs could be evenly distributed, put all into one queue, or handled separately - the correct approach is unclear.
While the DataInput can somewhat-easily be configured to pass data to a different step in the pipeline, it is less straightforward to get the underlying container to pass inputs that middle would care about (i.e. emulate start's output).
- This is where an input "replay" will come in handy, but there's still the case where inputs are unavailable such as during a test of a single pipeline step. This likely requires a new DataInput container to be created specifically for this purpose.

The text was updated successfully, but these errors were encountered:

ztaylor54 · 2022-01-14T00:07:24Z

Adding some more thoughts on the replay option.. I think this is the easiest case (for the user, at least):

We might support a replay key in the DataInput spec that links to the ID of a previously-run DataInput:

spec:
  replay: <datainput-id>
  target: middle # add target: <node>

This would eliminate the need for an image specification, and a replay of inputs for the past DataInput would be handled by internal mechanisms (likely on the manager). We'd need to keep track of DataInput IDs during execution, but that wouldn't be an issue.

ztaylor54 added enhancement New feature or request needs:triage labels Jan 13, 2022

ztaylor54 self-assigned this Jan 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide inputs to non-root nodes #9

Provide inputs to non-root nodes #9

ztaylor54 commented Jan 13, 2022

ztaylor54 commented Jan 14, 2022

Provide inputs to non-root nodes #9

Provide inputs to non-root nodes #9

Comments

ztaylor54 commented Jan 13, 2022

💪 Motivation

📖 Additional Details

ztaylor54 commented Jan 14, 2022