Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify StartExecution architecture to submit Batch jobs at the maximum rate #1272

Open
jtherrmann opened this issue Oct 17, 2022 · 3 comments
Labels
bug Something isn't working Jira Spike Create a Jira Spike for this issue

Comments

@jtherrmann
Copy link
Contributor

jtherrmann commented Oct 17, 2022

Jira: https://asfdaac.atlassian.net/browse/TOOL-2043

As described by AWS Batch service quotas, we're limited to 50 transactions per second (TPS) for each account for SubmitJob operations, per AWS Region. So we can submit up to 50 Batch jobs per second.

We refactored our StartExecution architecture in #1263 to use the manager/worker model to start batches of step function executions in parallel. Unfortunately this caused executions to be submitted in bursts that overran Batch's SubmitJob service limit, so we reduced the amount of concurrency in #1271.

There are two problems with the current approach:

  • Currently our StartExecution manager/worker system starts up to 900 executions per minute, or 15 executions per second, which is well under the maximum rate of 50 Batch jobs per second.
  • Also, as described here, there may be some duplicate step function execution attempts which would lead to reduced throughput.

Here is my quick brain dump from 2022-10-14 summarizing my discussion with @asjohnston-asf regarding this issue:

So Batch is limited to 50 submissions per sec, and we’re starting executions in big bursts that overrun that limit. Our tentative solution is to keep the manager running for 15 min at a time and invoking a steady stream of workers to start jobs at close to the max rate that Batch can handle. We may need to refactor how we get pending jobs; ideally it would operate as a queue so that step function executions are started in the order that the jobs were submitted through the hyp3 API? (Although what if someone submits a ton of jobs right before someone else submits just a few? Will the round-robin priority system take care of that even if the first user’s jobs are all submitted to Batch before the second user’s jobs?)

See put_jobs for the implementation of job priority.

At some point I would like to continue this discussion and improve our StartExecution architecture in order to solve the two problems described above.

@jtherrmann jtherrmann added the bug Something isn't working label Oct 17, 2022
@jtherrmann
Copy link
Contributor Author

Could we implement a priority queue to store submitted hyp3 jobs according to their priority, instead of using Batch's priority system?

@jtherrmann
Copy link
Contributor Author

jtherrmann commented Dec 14, 2022

The StartExecution architecture may be further modified by #1365

@jtherrmann
Copy link
Contributor Author

@asjohnston-asf Would be interested to get your feedback on this issue at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Jira Spike Create a Jira Spike for this issue
Projects
None yet
Development

No branches or pull requests

1 participant