Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Automatic resource adjustments for streaming jobs #129

Open
s-vitaliy opened this issue Oct 1, 2024 · 0 comments
Open

[FEATURE] Automatic resource adjustments for streaming jobs #129

s-vitaliy opened this issue Oct 1, 2024 · 0 comments
Labels
code/new-feature New feature or request

Comments

@s-vitaliy
Copy link
Contributor

s-vitaliy commented Oct 1, 2024

Description

It would be nice if in case of OOM events, Arcane operator could recreate the stream with increased resources.

Possible solution

  1. Multiple job references
  • For each streaming job, add an annotation that describes which job template was used to construct the job
  • Replace single reference in jobTemplateRef and backfillJobTemplateRef with ordered array of references
  • If job fails with specific exception/exit code/etc, select the next job template from this array.

This approach is not backward-compatible, but can be used not only for OOM events, but for other errors, like automatic eviction from failing AZ to another AZ.

  1. Scale Factor
  • Add to the SD a field scalingFactor.
  • If a job fails with OOM, multiply resources demands using that scalingFactor field

Easier to implement, but not so flexible as 1. Additionally, this approach does not require an increase in the number of job templates within the cluster.

Alternatives

Use VerticalPodAutoscaler

Context

No response

@s-vitaliy s-vitaliy added the code/new-feature New feature or request label Oct 1, 2024
@s-vitaliy s-vitaliy added this to Arcane Oct 1, 2024
@s-vitaliy s-vitaliy moved this to Backlog in Arcane Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code/new-feature New feature or request
Projects
Status: Backlog
Development

No branches or pull requests

1 participant