[FEATURE] Automatic resource adjustments for streaming jobs #129

s-vitaliy · 2024-10-01T13:32:45Z

Description

It would be nice if in case of OOM events, Arcane operator could recreate the stream with increased resources.

Possible solution

Multiple job references

For each streaming job, add an annotation that describes which job template was used to construct the job
Replace single reference in jobTemplateRef and backfillJobTemplateRef with ordered array of references
If job fails with specific exception/exit code/etc, select the next job template from this array.

This approach is not backward-compatible, but can be used not only for OOM events, but for other errors, like automatic eviction from failing AZ to another AZ.

Scale Factor

Add to the SD a field scalingFactor.
If a job fails with OOM, multiply resources demands using that scalingFactor field

Easier to implement, but not so flexible as 1. Additionally, this approach does not require an increase in the number of job templates within the cluster.

Alternatives

Use VerticalPodAutoscaler

Context

No response

The text was updated successfully, but these errors were encountered:

s-vitaliy added the code/new-feature New feature or request label Oct 1, 2024

s-vitaliy added this to Arcane Oct 1, 2024

s-vitaliy moved this to Backlog in Arcane Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Automatic resource adjustments for streaming jobs #129

[FEATURE] Automatic resource adjustments for streaming jobs #129

s-vitaliy commented Oct 1, 2024 •

edited

Loading

[FEATURE] Automatic resource adjustments for streaming jobs #129

[FEATURE] Automatic resource adjustments for streaming jobs #129

Comments

s-vitaliy commented Oct 1, 2024 • edited Loading

Description

Possible solution

Alternatives

Context

s-vitaliy commented Oct 1, 2024 •

edited

Loading