Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Infra] Migrate rest of linux builder workflows off GCP runners. #18511

Merged
merged 22 commits into from
Sep 13, 2024

Conversation

saienduri
Copy link
Collaborator

@saienduri saienduri commented Sep 12, 2024

This commit is part of this larger issue that is tracking our migration off the GCP runners, storage buckets, etc: #18238.

This builds on #18381, which migrated

  • linux_x86_64_release_packages
  • linux_x64_clang_debug
  • linux_x64_clang_tsan

Here, we move over the rest of the critical linux builder workflows off of the GCP runners:

  • linux_x64_clang
  • linux_x64_clang_asan

This also drops all CI usage of the GCP cache (http://storage.googleapis.com/iree-sccache/ccache). Some workflows now use sccache backed by Azure Blob Storage as a replacement. There are few issues with this (mozilla/sccache#2258) that prevent us providing read only access to the cache in PRs created from forks, so PRs from forks currently don't use the cache and will have slower builds. We're covering for this slowdown by using larger runners, but if we can roll out caching to all builds then we might use runners with fewer cores.

Along with the changes to the cache, usage of Docker is rebased on images in the https://github.com/iree-org/base-docker-images/ repo and the build_tools/docker/docker_run.sh script is now only used by unmigrated workflows (linux_arm64_clang and build_test_all_bazel).

saienduri and others added 8 commits August 29, 2024 13:52
Progress on #15332. This uses a
new `cpubuilder_ubuntu_jammy_x86_64` dockerfile from
https://github.com/iree-org/base-docker-images.

This stops using the remote cache that is hosted on GCP. Build time
_without a cache_ is about 20 minutes on current runners, while build
_with a cache_ is closer to 10 minutes. Build time without a cache is
closer to 28-30 minutes on new runners. We can try adding back a cache
using GitHub or our own hosted storage.

I tried to continue using the previous cache during this transition
period, but the `gcloud` command needs to run on the host, and I'd like
to stop using the `docker_run.sh` script. I'm hoping we can keep folding
away this sort of complexity by having the build machines run a
dockerfile that includes key environment components like utility tools
and any needed authorization/secrets (see
#18238).

ci-exactly: linux_x64_clang
Progress on #15332. I'm trying to
get rid of the `docker_run.sh` scripts, replacing them with GitHub's
`container:` feature. While local development flows _may_ want to use
Docker like the CI workflows do, those scripts contained a lot of
special handling and file mounting to be compatible with Bazel. Much of
that is not needed for CMake and can be folded away, though the
`--privileged` option needed here is one exception.

This stops using the remote cache that is hosted on GCP. We can try
adding back a cache using GitHub or our own hosted storage as part of
#18238.

Job | Cache? | Runner cluster | Time | Logs
-- | -- | -- | -- | --
ASan | Cache | GCP runners | 14 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10620030527/job/29438925064)
ASan | No cache | GCP runners | 28 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10605848397/job/29395467181)
ASan | Cache | Azure runners | (not configured yet)
ASan | No cache | Azure runners | 35 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10621238709/job/29442788013?pr=18396)
| | | 
TSan | Cache | GCP runners | 12 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10612418711/job/29414025939)
TSan | No cache | GCP runners | 21 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10605848414/job/29395467002)
TSan | Cache | Azure runners | (not configured yet)
TSan | No cache | Azure runners | 32 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10621238738/job/29442788341?pr=18396)

ci-exactly: linux_x64_clang_asan
Following iree-org/base-docker-images#6, the new
cpubuilder dockerfile should have all the software needed for ASan and
TSan building + testing (specifically `clang-19` instead of just
`clang-14`).

Progress on #15332. The only
remaining uses of `gcr.io/iree-oss/base.*` are:

* `build_test_all_bazel` uses `gcr.io/iree-oss/base-bleeding-edge`
* `publish_website` uses `gcr.io/iree-oss/base`
* arm64 workflows use `gcr.io/iree-oss/base-arm64`
* `gcr.io/iree-oss/emscripten` (used by web test workflows) depends on
`gcr.io/iree-oss/base`
Signed-off-by: Elias Joseph <[email protected]>
Implemented caching with Azure containers using sccache, only works when merging from a branch

ci-exactly: linux_x64_clang
saienduri and others added 6 commits September 12, 2024 13:25
Progress on #15332. This uses a
new `cpubuilder_ubuntu_jammy_x86_64` dockerfile from
https://github.com/iree-org/base-docker-images.

This stops using the remote cache that is hosted on GCP. Build time
_without a cache_ is about 20 minutes on current runners, while build
_with a cache_ is closer to 10 minutes. Build time without a cache is
closer to 28-30 minutes on new runners. We can try adding back a cache
using GitHub or our own hosted storage.

I tried to continue using the previous cache during this transition
period, but the `gcloud` command needs to run on the host, and I'd like
to stop using the `docker_run.sh` script. I'm hoping we can keep folding
away this sort of complexity by having the build machines run a
dockerfile that includes key environment components like utility tools
and any needed authorization/secrets (see
#18238).

ci-exactly: linux_x64_clang
Signed-off-by: saienduri <[email protected]>
Progress on #15332. I'm trying to
get rid of the `docker_run.sh` scripts, replacing them with GitHub's
`container:` feature. While local development flows _may_ want to use
Docker like the CI workflows do, those scripts contained a lot of
special handling and file mounting to be compatible with Bazel. Much of
that is not needed for CMake and can be folded away, though the
`--privileged` option needed here is one exception.

This stops using the remote cache that is hosted on GCP. We can try
adding back a cache using GitHub or our own hosted storage as part of
#18238.

Job | Cache? | Runner cluster | Time | Logs
-- | -- | -- | -- | --
ASan | Cache | GCP runners | 14 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10620030527/job/29438925064)
ASan | No cache | GCP runners | 28 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10605848397/job/29395467181)
ASan | Cache | Azure runners | (not configured yet)
ASan | No cache | Azure runners | 35 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10621238709/job/29442788013?pr=18396)
| | |
TSan | Cache | GCP runners | 12 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10612418711/job/29414025939)
TSan | No cache | GCP runners | 21 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10605848414/job/29395467002)
TSan | Cache | Azure runners | (not configured yet)
TSan | No cache | Azure runners | 32 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10621238738/job/29442788341?pr=18396)

ci-exactly: linux_x64_clang_asan
Signed-off-by: saienduri <[email protected]>
Following iree-org/base-docker-images#6, the new
cpubuilder dockerfile should have all the software needed for ASan and
TSan building + testing (specifically `clang-19` instead of just
`clang-14`).

Progress on #15332. The only
remaining uses of `gcr.io/iree-oss/base.*` are:

* `build_test_all_bazel` uses `gcr.io/iree-oss/base-bleeding-edge`
* `publish_website` uses `gcr.io/iree-oss/base`
* arm64 workflows use `gcr.io/iree-oss/base-arm64`
* `gcr.io/iree-oss/emscripten` (used by web test workflows) depends on
`gcr.io/iree-oss/base`

Signed-off-by: saienduri <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: saienduri <[email protected]>
@saienduri saienduri force-pushed the shared/runner-cluster-migration branch from 9f06194 to d43fc31 Compare September 12, 2024 18:25
This fixes the issue shown here:
https://github.com/iree-org/iree/actions/runs/10800586282/job/29958939877#step:6:16
where `SCCACHE_AZURE_CONNECTION_STRING` is defined to empty string
instead of being undefined. It also introduces a config script, as
discussed here:
#18489 (comment).

We may still want to limit writing to the cache to `push` events.

ci-exactly: linux_x64_clang
@ScottTodd ScottTodd added the infrastructure Relating to build systems, CI, or testing label Sep 12, 2024
Eliasj42 and others added 2 commits September 12, 2024 16:19
| Workflow      | Un-cached      | Cached   |
| ------------- | ------------- | ------------- |
| Clang | 15m57s | 10m12s |
| Clang ASan | NA | 15m7s |
| Clang TSan | NA | 9m42s |
| Clang Debug | 12m19s | 8m15s |
test.txt Outdated Show resolved Hide resolved
ScottTodd and others added 2 commits September 13, 2024 09:43
Also testing these workflows with PRs from a fork.
Some of these tests are taking 30-60 seconds on new runner machines under ASan/TSan, getting close to the 60 second timeout. Increase the timeout to 5 minutes.

We could also do something TSan/ASan-specific here, but developers running the tests on slower systems can also benefit from these timeout changes.
@ScottTodd ScottTodd merged commit cc891ba into main Sep 13, 2024
45 of 46 checks passed
@ScottTodd ScottTodd deleted the shared/runner-cluster-migration branch September 13, 2024 20:53
raikonenfnu pushed a commit to raikonenfnu/iree that referenced this pull request Sep 16, 2024
…e-org#18511)

This commit is part of this larger issue that is tracking our migration
off the GCP runners, storage buckets, etc:
iree-org#18238.

This builds on iree-org#18381, which
migrated
* `linux_x86_64_release_packages`
* `linux_x64_clang_debug`
* `linux_x64_clang_tsan`

Here, we move over the rest of the critical linux builder workflows off
of the GCP runners:
* `linux_x64_clang`
* `linux_x64_clang_asan`

This also drops all CI usage of the GCP cache
(`http://storage.googleapis.com/iree-sccache/ccache`). Some workflows
now use sccache backed by Azure Blob Storage as a replacement. There are
few issues with this (mozilla/sccache#2258)
that prevent us providing read only access to the cache in PRs created
from forks, so **PRs from forks currently don't use the cache and will
have slower builds**. We're covering for this slowdown by using larger
runners, but if we can roll out caching to all builds then we might use
runners with fewer cores.

Along with the changes to the cache, usage of Docker is rebased on
images in the https://github.com/iree-org/base-docker-images/ repo and
the `build_tools/docker/docker_run.sh` script is now only used by
unmigrated workflows (`linux_arm64_clang` and `build_test_all_bazel`).

---------

Signed-off-by: saienduri <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Co-authored-by: Scott Todd <[email protected]>
Co-authored-by: Elias Joseph <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure Relating to build systems, CI, or testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants