Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor CI workflow #439

Merged
merged 153 commits into from
May 9, 2024
Merged

Refactor CI workflow #439

merged 153 commits into from
May 9, 2024

Conversation

danielhollas
Copy link
Contributor

@danielhollas danielhollas commented Mar 20, 2024

Overview

The goal of the PR is twofold:

  • Improve CI runtimes (currently the overall workflow takes ~1h40min!)
  • Reduce complexity

In the end, I mostly ended up reverting things back to where they were roughly around PR #372, while (hopefully) keeping the current modularity. Closes #414

Major changes

  • The main change here is to stop uploading Docker images as artifacts: Instead we upload them to ghcr.io straightaway. This already brings a lot of simplification and time reduction -- the only "disadvantage" is that the PRs cannot be open from forks, but as discussed in Slow build test and push #414, this is more of a feature rather than a bug.

  • The second approach I took was to reduce our reliance on the self-hosted ARM64 runner, which is a limited resource (can only run one job at a time). In this PR, the arm64 image is build on ubuntu GHA runner with the help of QEMU virtualization. This is a bit slower then building natively, but not by that much (12min versus 9min). The other big advantage of this is that it reduces a lot of complexity, since we don't need to create / manage manifests manually, and simply let the docker/bake-action take care of the multiplatform build and upload. We can re-evaluate this in a future PR.

  • We first build the amd64 images and immediately test full-stack. This overall takes only 6 minutes, so any obvious issues should be caught early.
    I believe some of the improvements here should be ported to QEApp and aiida-core repos as well.

Final results

  • The workflow now takes 20 minutes. The first amd64 build and full-stack tests take only around 6 minutes, so any issues are surfaced early.
  • The overall diff here in terms of lines (+551/-502) is a bit misleading, since I've added a lot of explanation and example inputs / outputs in the auxiliary shell scripts. The total lines of the yaml files is actually +352/-470 +320/-470.

Future work

  • Revamp release procedure so that we do not have to push to main
  • Reconsider buidling of arm64 image on self-hosted runner, if GHA turns out to be unreliable. On some commits on this PR the build took over 1h instead of 12 minutes.
  • Speed up the build by optimizing what we put in the image. I have some ideas but I'll keep them for a separate PR.

@unkcpz
Copy link
Member

unkcpz commented Mar 21, 2024

Thanks!
Just careful that I don't want to go back to the original situation that tests for all 4*2 images are just in one GH workflow (or two for two architectures). The bake can happened in one workflow but tests are better separated.

@danielhollas
Copy link
Contributor Author

Yes, tests should be run separately and in parallel. But the current build strategy is extremely wasteful, since we're not reusing the previous layers (e.g. the base image is build 4 times, unless I am missing something).

@unkcpz
Copy link
Member

unkcpz commented Mar 21, 2024

(e.g. the base image is build 4 times, unless I am missing something).

That is correct! Initially I want to keep things as separated as possible and notice in the end every build cost < mins so didn't put effort to combine the built processes back to one.
It definitely worth to try. Let me know if you encounter any problem and sorry I delayed on starting to work on this.

@danielhollas
Copy link
Contributor Author

@mbercx would you mind installing jq on the ARM runner? Thx

@danielhollas
Copy link
Contributor Author

@unkcpz I've pushed a fix to dodo.py for building locally. Can you try it out by running doit build? Here's what I get after that

$ docker images
aiidalab/full-stack              2024.1017.post145.dev0_8f8a30e   1b069e00e91e   About a minute ago   2.54GB
aiidalab/lab                     2024.1017.post145.dev0_8f8a30e   af94ef50d9cc   About a minute ago   2.19GB
aiidalab/base-with-services      2024.1017.post145.dev0_8f8a30e   e6be8727a8d1   4 minutes ago        2.54GB
aiidalab/base                    2024.1017.post145.dev0_8f8a30e   dba20a49b4f7   34 minutes ago       2.16GB

The image tag (determined by VERSION variable) is automatically generated from git commit.

Otherwise I think I replied to all your comments, thank you for a detailed review! Once I get a green light from you I will merge and then make a release for the aiida v2.5 image.

@unkcpz
Copy link
Member

unkcpz commented May 8, 2024

Can you try it out by running doit build?

Thanks for the fix. This works for me locally.

But the command README seems not work doit build --arch=arm64 --version=my-version. And I specify with --targe base but it build all. Besides this, I am good will all other changes.

@unkcpz
Copy link
Member

unkcpz commented May 8, 2024

Before merge this can we make a release with aiida-core==2.5.1 first?

@danielhollas
Copy link
Contributor Author

But the command README seems not work doit build --arch=arm64 --version=my-version. And I specify with --targe base but it build all.

This should now be fixed, can you try again? Note that the doit tests command is still broken, but this will require a bit more change to fix, I'll open an issue.

@danielhollas
Copy link
Contributor Author

I've temporarily switched to buildjet runners (and you can see that the tests pass now @unkcpz).
My plan is to merge this and try making a release again so we don't have to wait till the ARM64 runners are fixed.

@unkcpz
Copy link
Member

unkcpz commented May 9, 2024

My plan is to merge this and try making a release again so we don't have to wait till the ARM64 runners are fixed.

Make sense to me. And I think we really need to move to buildjet.

@danielhollas
Copy link
Contributor Author

@unkcpz can you approve so I can merge? (I could bypass it but don't want to :-) )

@danielhollas danielhollas requested a review from unkcpz May 9, 2024 18:19
Copy link
Member

@unkcpz unkcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good. This is much clear! Thanks. @danielhollas (and I know how difficult to figure out all the details 😄 ) I'll let you write up for a detail commit message, 153 commits seems a legend PR.

@danielhollas danielhollas merged commit 4ca0b40 into main May 9, 2024
15 checks passed
@danielhollas danielhollas deleted the remove-artifacts branch May 9, 2024 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow build test and push
2 participants