Automated pipeline shutdown #10

ztaylor54 · 2022-01-14T01:03:58Z

💪 Motivation

I've toyed with the idea of a pipeline that operates like a Kubernetes Job and would automatically 'shutdown' deallocate cluster resources. This would be useful for one-off processing jobs, and really any pipeline that doesn't need to hang around after it has done its allotted work.

For example, one might create a pipeline to perform content validation on a large dataset. Given the nature of KDP's parallel execution & pipelines' tendency to have large swathes of pods (often in the thousands) with heterogeneous loads, it can be challenging to know when a pipeline is "finished."

Additionally, certain uneven processing loads can lead to wasted resources. Uneven load could cause most of a step's pods to churn through their input quickly, then sit idle as the handful of pods with the heavier-load inputs prevent the user from shutting down the pipeline. It would be far better for pods to know they've reached the end of their input and then complete, no longer taking up valuable cluster resources.

📖 Additional Details

I see a few options for implementation - we've already used a similar pattern for DataInputs, which are by default ephemeral and only take up resources while they are running. Pipelines complicate the matter, as it is not straightforward for a pod listening to an input queue to know when it should stop listening. Note that a lack of inputs in the queue could simply mean that a previous step is taking a while, or that a conditional branch hasn't been met yet.

There are two components to the solution:

Specifying this type of pipeline to KDP
Facilitating automated shutdown

Specification

For specification, I currently see the following options:

Create a new CRD that ties a Pipeline and any number of DataInputs together. This would make it easy to determine when no input remains, as we could wait for all DataInputs to complete. It would also have the added benefit of being a single resource whose status could be easily monitored whether by KDP or externally by user code.
Add a flag to a DataInput that indicates it should shut down the pipeline after completion. This could lead to headache if a user runs a shutdown DataInput and it completes before other critical DataInputs. It also might complicate use cases where it is undesirable for users to be able to shutdown the pipeline, such as if a KDP pipeline is being offered as a managed service.

Option (1) will likely avoid the most headache, and would allow the shutdown logic to be handled by the Operator, rather than by the DataInputs (which seems more 'correct').

Implementation

There are a few ways to shut down a pipeline, but the easiest and 'best' is likely to use a poison pill. Once we know (in the Operator, most likely) that the pipeline won't have any more inputs, we can seed the root input queue with a signal that the Manager knows to propagate to all output queues, regardless of conditional branches. Because pipelines are DAGs (and thus all nodes are reachable from the root) we are guaranteed that every node will see the poison pill.

We have to be slightly clever about node behavior on seeing a poison pill, however. Node resources (pods) cannot shutdown until they have seen a poison pill from each incoming edge. So long as we can get an accurate count, pods will be able to determine when they have seen the last of the data and can safely shut down. This will mean that in the case described above, idle pods will quickly shut down.

After all pods have exited, the pipeline will be completed and will remain in k8s as a "completed" resource.

There are a few caveats / things to look out for, but this general approach to automated shutdown is a good start. Something to look out for is that since pipelines by default are implemented as Parallel Jobs with no spec.completions, we might run into trouble if we exit a pod and others are still doing work, logging, etc. We will likely need to specify spec.completions on the underlying Jobs when we wish to automatically shut down.

The text was updated successfully, but these errors were encountered:

ztaylor54 added enhancement New feature or request needs:triage labels Jan 14, 2022

ztaylor54 self-assigned this Jan 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated pipeline shutdown #10

Automated pipeline shutdown #10

ztaylor54 commented Jan 14, 2022

Automated pipeline shutdown #10

Automated pipeline shutdown #10

Comments

ztaylor54 commented Jan 14, 2022

💪 Motivation

📖 Additional Details

Specification

Implementation