Ensure client and scheduler are resilient to server autoscaling #2277

trxcllnt · 2024-10-25T23:49:55Z

While profiling distributed build cluster performance, forcing the client to fallback to local compilation is the largest contributor to overall build time. Presently this happens due to at least one bug, but also sub-optimal error handling in the client and scheduler.

These issues are amplified when autoscaling sccache-dist servers, as the errors happen more frequently, can lead to sub-optimal autoscaling behavior, leading to more errors, etc.

So this PR is a collection of fixes for the sccache client, scheduler, and server to better support dist-server autoscaling, as well as general improvements for tracing and debugging distributed compilation across clients, schedulers, and workers.

3c4547b Adds the output file, job_id, and server_id to client logs, and adds job_id to scheduler and server logs. This makes it significantly easier to trace build cluster failures across client and server logs.
Nit: Searching through unstructured/ad-hoc log lines is difficult, using something like structured_logger instead of env_logger would also improve this experience.
df2e4a1 Ensures the client and scheduler use the latest certificates for each server. This is necessary for resiliency when the servers scale in or out (more on this below).
fe83892 Ensures the scheduler is resilient to server errors, and attempts to allocate jobs to the next-best server candidate (more on this below).
4657454 Adds the ability for clients to retry distributed compilations. In conjunction with the two prior commits, this ensures clients with jobs assigned to servers that are scaled in can ask the scheduler to allocate the job to a new server (more on this below).
6cd9ff3 Adds envvar-configurable connection and request timeouts. Fixes make "REQUEST_TIMEOUT_SECS" configurable #2276

Build cluster configuration

Before diving into these changes, I should describe the architecture of the cluster for which these changes are necessary.

A Traefik API Gateway to terminate SSL and expose a single endpoint for clients. This could be any API Gateway/LB/router, I just like Traefik.
An sccache-dist scheduler instance, which receives forwarded connections from Traefik.
An autoscaling group of sccache-dist servers, which scale in and out based on load, and are associated with one of a fixed pool of ports on the Traefik instance. For example, if the ASG includes up to 10 instances, Traefik will open 10 ports (e.g. 10500-10509) and associate each port with a worker.

Workers are associated with and forwarded traffic from one of Traefik's open ports when they start up, and un-associated with that port when they shut down. When a new worker starts up, it could be associated with any free port, even ports previously associated with a different worker.

Note: While this PR isn't related, this description assumes sccache has been compiled with the changes in #1922, as that's necessary for the workers to report the public_url of the API Gateway instead of their private VPC address.

Certificate handling for server scale in and out

When the server cluster goes through a cycle of scaling out, in, then out again, the new servers may be available at addresses that were previously associated with an old server. This presents a challenge for certificate handling, because the client and scheduler may have cached certificates for the initial instance, and those certs are not valid for communicating with the new instance:

# scale out to two instances:
127.0.0.1:10500 - server A
127.0.0.1:10501 - server B

# scale in to one instance:
127.0.0.1:10501 - server B

# scale out to three instances:
127.0.0.1:10500 - server C
127.0.0.1:10501 - server B
127.0.0.1:10502 - server D

In the initial state, the client and scheduler cached certificates for servers A and B. After scaling in and out again, the client and scheduler attempt and fail to use the certificates generated by server A to communicate with server C. I believe this is because the certificates for A and C both embed 127.0.0.1:10500 as their SubjectAlternativeName, and this confuses reqwest.

df2e4a1 updates the client to track certificates by server_id like the server does, and updates both the client and scheduler to remove the old certificate from the certs map before adding the existing certs to the reqwest client builder.

Scheduler job allocation resiliency

There's a delay between when servers scale in and when the scheduler prunes them from the list of active servers. In this time, the scheduler may attempt to allocate jobs to these servers. When this fails, and the current behavior is to return an error to the client to run a local compile.

This is sub-optimal for an autoscaling strategy, since by rejecting the jobs, the additional work sent back to the client to do isn't captured by the autoscaler.

For example, if the autoscaler scales in from 64 to 32 CPUs, and in the meantime the scheduler rejects the next 32 jobs to compile locally, the autoscaler believes it is in a steady-state rather than recognizing there are 64 units of work to handle.

At best, this leads to delays in scaling up, and at worst it can cause the autoscaler to believe it can continue to scale down.

The best solution is for the scheduler to handle the alloc_job failure and attempt to allocate to the next-best server candidate, until either the job is allocated or the candidate list is exhausted. This ensures the autoscaler will see the existing instances get busier, and stop scaling in/start scaling out again.

Example of starting a cluster with 3 initial workers, scaling down to 1, then running a distributed compile before the scheduler has pruned the dead servers:

$ docker compose up -d --scale worker=3
# ... wait till cluster is up
$ docker compose up -d --scale worker=1
$ sccache ...
[INFO  sccache::dist::http::server] Scheduler listening for clients on 0.0.0.0:80
[INFO  sccache::dist::http::server] Adding new certificate for 172.18.0.2:10500 to scheduler
[INFO  sccache_dist] Registered new server ServerId(172.18.0.2:10500)
[INFO  sccache::dist::http::server] Adding new certificate for 172.18.0.2:10501 to scheduler
[INFO  sccache_dist] Registered new server ServerId(172.18.0.2:10501)
[INFO  sccache::dist::http::server] Adding new certificate for 172.18.0.2:10502 to scheduler
[INFO  sccache_dist] Registered new server ServerId(172.18.0.2:10502)
[WARN  sccache_dist] [alloc_job(0)]: POST to scheduler assign_job failed, caused by: error sending request for url (https://172.18.0.2:10502/api/v1/distserver/assign_job/0), caused by: client error (Connect), caused by: error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:2091: (self-signed certificate), caused by: error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:2091:
[INFO  sccache_dist] [alloc_job(0)]: Job created and assigned to server 172.18.0.2:10501 with state Ready
[WARN  sccache_dist] [alloc_job(1)]: POST to scheduler assign_job failed, caused by: error sending request for url (https://172.18.0.2:10500/api/v1/distserver/assign_job/1), caused by: client error (Connect), caused by: error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:2091: (self-signed certificate), caused by: error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:2091:
[INFO  sccache_dist] [alloc_job(2)]: Job created and assigned to server 172.18.0.2:10501 with state Ready
[INFO  sccache_dist] [alloc_job(1)]: Job created and assigned to server 172.18.0.2:10501 with state Ready
[INFO  sccache_dist] [alloc_job(3)]: Job created and assigned to server 172.18.0.2:10501 with state Ready
[INFO  sccache_dist] [update_job_state(0, 172.18.0.2:10501)]: Job state updated from Ready to Started
[INFO  sccache_dist] [update_job_state(2, 172.18.0.2:10501)]: Job state updated from Ready to Started
[INFO  sccache_dist] [update_job_state(1, 172.18.0.2:10501)]: Job state updated from Ready to Started
[INFO  sccache_dist] [update_job_state(3, 172.18.0.2:10501)]: Job state updated from Ready to Started
[INFO  sccache_dist] [update_job_state(0, 172.18.0.2:10501)]: Job state updated from Started to Complete
[INFO  sccache_dist] [update_job_state(2, 172.18.0.2:10501)]: Job state updated from Started to Complete
[INFO  sccache_dist] [update_job_state(1, 172.18.0.2:10501)]: Job state updated from Started to Complete
[INFO  sccache_dist] [update_job_state(3, 172.18.0.2:10501)]: Job state updated from Started to Complete
[WARN  sccache_dist] Server 172.18.0.2:10500 appears to be dead, pruning it in the scheduler
[WARN  sccache_dist] Server 172.18.0.2:10502 appears to be dead, pruning it in the scheduler

Client job execution resiliency

It's also possible for a server to be taken offline while it's running jobs for clients. In this scenario the scheduler alloc_job succeeds when the worker is still alive, but the worker is destroyed while the client is waiting on the run_job response.

To avoid the expensive local compilation, the client should handle the failure and allow retrying the job on a new server assigned by the scheduler. When combined with the feature described in the previous section, the scheduler should reallocate the job on an alive server.

Example client logs when worker shuts down during run_job, and client retries:

$ docker compose up -d --scale worker=10
# ... wait till cluster is up
$ SCCACHE_DIST_RETRY_LIMIT=5 sccache ...
$ docker compose up -d --scale worker=1
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Run distributed compilation (attempt 1 of 6)
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Creating distributed compile request
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Identifying dist toolchain for "/usr/local/cuda/bin/../nvvm/bin/cicc"
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Requesting allocation
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Successfully allocated job 2
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Running job 2 on server 172.18.0.2:10504
[WARN  sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Error running distributed compilation (attempt 1 of 6), retrying. Could not run distributed compilation job on 172.18.0.2:10504: error sending request for url (https://172.18.0.2:10504/api/v1/distserver/run_job/2): client error (SendRequest): connection closed before message completed
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Run distributed compilation (attempt 2 of 6)
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Creating distributed compile request
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Identifying dist toolchain for "/usr/local/cuda/bin/../nvvm/bin/cicc"
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Requesting allocation
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Successfully allocated job 21
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Running job 21 on server 172.18.0.2:10500
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Fetched [("/tmp/sccache_nvcc5K0cEY/0.cudafe1.c", "Size: 209->174"), ("/tmp/sccache_nvcc5K0cEY/1.cudafe1.stub.c", "Size: 1429->635"), ("/tmp/sccache_nvcc5K0cEY/simpleP2P.compute_50.cudafe1.gpu", "Size: 25849->3584"), ("/tmp/sccache_nvcc5K0cEY/simpleP2P.compute_50.ptx", "Size: 874->445")]
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Compiled in 3.963 s, storing in cache
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Created cache artifact in 0.006 s
[DEBUG sccache::server] [simpleP2P.compute_50.ptx]: compile result: cache miss
[DEBUG sccache::server] [simpleP2P.compute_50.ptx]: CompileFinished retcode: exit status: 0
[DEBUG sccache::compiler::compiler] [simpleP2P.compute_50.ptx]: Stored in cache successfully!

…server logs

…lient remove certs when a server's cert changes

…nd try other candidates instead of failing

codecov-commenter · 2024-10-25T23:56:47Z

Codecov Report

Attention: Patch coverage is 32.96089% with 120 lines in your changes missing coverage. Please review.

Project coverage is 40.78%. Comparing base (0cc0c62) to head (5f1d50e).
Report is 83 commits behind head on main.

Files with missing lines	Patch %	Lines
src/compiler/compiler.rs	38.51%	28 Missing and 55 partials ⚠️
src/dist/http.rs	0.00%	21 Missing ⚠️
src/server.rs	0.00%	4 Missing and 7 partials ⚠️
src/compiler/rust.rs	0.00%	3 Missing ⚠️
src/util.rs	50.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2277      +/-   ##
==========================================
+ Coverage   30.91%   40.78%   +9.87%     
==========================================
  Files          53       55       +2     
  Lines       20112    20978     +866     
  Branches     9755     9677      -78     
==========================================
+ Hits         6217     8556    +2339     
- Misses       7922     8247     +325     
+ Partials     5973     4175    -1798

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…HE_DIST_REQUEST_TIMEOUT` seconds are stale and should be removed

trxcllnt added 5 commits October 25, 2024 20:00

consistently report job_id and optionally server_id in scheduler and …

3c4547b

…server logs

ensure client tracks certs by server_id, and both the scheduler and c…

df2e4a1

…lient remove certs when a server's cert changes

rewrite scheduler handle_alloc_job to be resilient to server errors a…

fe83892

…nd try other candidates instead of failing

support retrying distributed compilations

4657454

Allow configuring connection and request timeouts

6cd9ff3

track started job mtimes and assume jobs that take longer than `SCCAC…

5f1d50e

…HE_DIST_REQUEST_TIMEOUT` seconds are stale and should be removed

trxcllnt force-pushed the fea/autoscaling branch from a22c102 to 5f1d50e Compare October 28, 2024 23:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure client and scheduler are resilient to server autoscaling #2277

Ensure client and scheduler are resilient to server autoscaling #2277

trxcllnt commented Oct 25, 2024 •

edited

Loading

codecov-commenter commented Oct 25, 2024 •

edited

Loading

Ensure client and scheduler are resilient to server autoscaling #2277

Are you sure you want to change the base?

Ensure client and scheduler are resilient to server autoscaling #2277

Conversation

trxcllnt commented Oct 25, 2024 • edited Loading

Build cluster configuration

Certificate handling for server scale in and out

Scheduler job allocation resiliency

Client job execution resiliency

codecov-commenter commented Oct 25, 2024 • edited Loading

Codecov Report

trxcllnt commented Oct 25, 2024 •

edited

Loading

codecov-commenter commented Oct 25, 2024 •

edited

Loading