Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64 docker images take 6h+ to build #879

Closed
1 task done
M0NsTeRRR opened this issue Sep 22, 2024 · 24 comments · Fixed by #919
Closed
1 task done

ARM64 docker images take 6h+ to build #879

M0NsTeRRR opened this issue Sep 22, 2024 · 24 comments · Fixed by #919
Labels
github_actions Pull requests that update GitHub Actions code help wanted Extra attention is needed

Comments

@M0NsTeRRR
Copy link

M0NsTeRRR commented Sep 22, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Description of the bug

https://github.com/orhun/git-cliff/actions/runs/10978847295/job/30482488830

The github action failed to build and push 2.6.0 tag to ghcr.io

The Docker job times out after 360 minutes, as described in the GitHub documentation

Steps To Reproduce

.

Expected behavior

.

Screenshots / Logs

No response

Software information

Github action

Additional context

No response

@orhun
Copy link
Owner

orhun commented Sep 23, 2024

Thanks for reporting!

Yup, I tried running the workflow 2 more times, they have all failed: https://github.com/orhun/git-cliff/actions/runs/10978847295

I think it is timing out due to QEMU/AMD builds :/

According to this docs, it is not possible to increase this limit.

Job execution time - Each job in a workflow can run for up to 6 hours of execution time. If a job reaches this limit, the job is terminated and fails to complete.

Not sure what can I do expect disabling them temporarily, which I did in cde2a8e

@M0NsTeRRR
Copy link
Author

M0NsTeRRR commented Sep 24, 2024

@orhun Looks like an issue with github action, the docker action is running now (4 completed) ? (could you rerun the 2.6.0 docker action)

@orhun
Copy link
Owner

orhun commented Sep 25, 2024

Yup, I need to create a new release soon. I don't think it is possible to trigger the pipeline again with the new changes in the workflow file :)

@orhun
Copy link
Owner

orhun commented Sep 25, 2024

In the meantime you can probably use one of the latest versions.

@M0NsTeRRR
Copy link
Author

Yup, I close this one as the issue seems resolved :)

@orhun
Copy link
Owner

orhun commented Sep 26, 2024

I refactored the Docker builds in #888 based on these docs - it should be faster in theory but I feel like there are still some rough edges in the workflow.

@orhun
Copy link
Owner

orhun commented Sep 27, 2024

It takes 6hr+ even with the build matrix: https://github.com/orhun/git-cliff/actions/runs/11053756749/job/30708900152 💀

@orhun orhun changed the title Git cliff 2.6.0 docker image not published ARM64 docker images takes 6h+ to build Sep 27, 2024
@orhun orhun reopened this Sep 27, 2024
@orhun orhun removed their assignment Sep 27, 2024
@orhun orhun added help wanted Extra attention is needed github_actions Pull requests that update GitHub Actions code and removed bug Something isn't working labels Sep 27, 2024
@orhun orhun changed the title ARM64 docker images takes 6h+ to build ARM64 docker images take 6h+ to build Sep 27, 2024
@LtdSauce
Copy link
Contributor

LtdSauce commented Sep 27, 2024

Hey!

As i have the impression i caused some trouble with my comment (that i deleted... which might have been a mistake), i thought i share my rationale why i removed that comment.

When i saw the change removing the arm64 image build, i thought that it was due to the fact, that building the images and publishing them was all done on the same node. But then i found the setup-buildx-action docs, which mention that this is already creating one node for each architecture. Which is basically what this suggests as well. Thus, i deleted my comment in the hope that i had not done any harm 😅 Sorry 'bout that!

Also i guess in the current state this is not yet achievable as github has not yet released its official arm64 support as of this comment.
As soon as that is available for OSS this could be used in the open PR to use a native arm64 node to build the arm64 image. This suggests this will not be within this year. (In fact this was the second reason i deleted my comment... next time i will just update it!)

@M0NsTeRRR
Copy link
Author

M0NsTeRRR commented Sep 27, 2024

From the log I'm reading, it looks like there is an issue with shellexpand, right? It takes one hour from the start to reach shellexpand, and the compilation takes at least 5 hours (it looks really long from my POV) ?
https://github.com/orhun/git-cliff/actions/runs/11053756749/job/30708900152#step:10:1072

Maybe we could add more verbose to cargo build to see if we have something useful ?

But maybe as @LtdSauce pointed out it's just the QEMU build that is really slow as expected :/

For now as a quick fix we could disable the ARM build until we find a solution.

@M0NsTeRRR
Copy link
Author

Maybe docker buildx on macos arm runner (for arm build) are better ?

https://github.com/actions/runner-images

@LtdSauce
Copy link
Contributor

Just to share some insights: on my DAYJOB we recently had to switch from qemu emulation for building arm64 images as well. In our case we were only doing apt install commands, but that already increased the build time from 10 minutes for our amd64 builds to 60 minutes for our qemu emulated arm64 builds.

@orhun
Copy link
Owner

orhun commented Sep 27, 2024

As i have the impression i caused some trouble with my comment (that i deleted... which might have been a mistake), i thought i share my rationale why i removed that comment.

No worries!

Also i guess in the current state this is not yet achievable as github has not yet released its official arm64 support as of this comment.

Ah, thanks for sharing the links. I subscribed to those issues and I hope it happens one day 🙏🏼

it looks like there is an issue with shellexpand, right?

I think the issue is not about a specific crate, it is related to QEMU emulated builds.

Maybe we could add more verbose to cargo build to see if we have something useful ?
Maybe docker buildx on macos arm runner (for arm build) are better ?

I tried your suggestion in #888 but I guess setup-qemu-action is not available for the macOS runner: https://github.com/orhun/git-cliff/actions/runs/11072292415/job/30766175378?pr=888

@M0NsTeRRR
Copy link
Author

M0NsTeRRR commented Sep 27, 2024

I think the issue is not about a specific crate, it is related to QEMU emulated builds.

I just find it strange at first glance that such package takes more than 5 hours to build compared to others.

I tried your suggestion in #888 but I guess setup-qemu-action is not available for the macOS runner: https://github.com/orhun/git-cliff/actions/runs/11072292415/job/30766175378?pr=888

Sad, yes I saw the same error in setup-qemu-action repository discussion.

@orhun
Copy link
Owner

orhun commented Sep 27, 2024

Can you send a link of that discussion?

Man... all I wanted was to build arm64 images...

@LtdSauce
Copy link
Contributor

I thought qemu is not needed when using the arm64 MacOs runner? The architecture is already native and afaik you should be able to do docker builds on MacOs that build a Linux image already?

@M0NsTeRRR
Copy link
Author

Can you send a link of that discussion?

Man... all I wanted was to build arm64 images...

docker/setup-qemu-action#137
docker/setup-qemu-action#24

@M0NsTeRRR
Copy link
Author

I thought qemu is not needed when using the arm64 MacOs runner? The architecture is already native and afaik you should be able to do docker builds on MacOs that build a Linux image already?

Yes with two differents job instead of a matrix 👍

@orhun
Copy link
Owner

orhun commented Sep 27, 2024

I thought qemu is not needed when using the arm64 MacOs runner? The architecture is already native and afaik you should be able to do docker builds on MacOs that build a Linux image already?

Ah damn, right...

@LtdSauce
Copy link
Contributor

FTR: This seems to be the way it is until GitHub launches their linux/arm64 runners.

There seems to be no way for macos runners on arm64 to get docker to build.

@LtdSauce
Copy link
Contributor

LtdSauce commented Oct 5, 2024

TL;DR

cargo chef cook does not use the rust-toolchain.toml file while cargo build does. Thus the dependencies are build twice instead of being cached. This might be the cause for the timeout. Copying the toolchain file before invoking cargo chef cook solves this issue.

In-Depth

I had a look again on the build logs of the failing actions and noticed something weird in the build logs: Although the cargo chef cook already builds the dependencies, they are build again in the cargo build during the Build and push step.
This made me wonder if the arguments passed to those commands maybe need to be exactly the same or if there is something else wrong here.
After some fiddling around with it, i did manage to use the cached dependencies from the chef cook step on my 4st try...

The following build times are on my local machine. So as a benchmark the current HEAD took ~860s to docker build ..
Disclaimer: I pre-pulled the cargo-chef base image in all cases to reduce the "noise" a normal docker pull would have here.

Try 1 (Pass same arguments to cargo-chef cook)

So then I basically did the following patch to the current head to try to get caching going again:

diff --git a/Dockerfile b/Dockerfile
index 6175d23..8723799 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -5,13 +5,13 @@ WORKDIR app
 FROM chef AS planner
 COPY . .
 RUN cargo chef prepare --recipe-path recipe.json
-
 FROM chef AS builder
 COPY --from=planner /app/recipe.json recipe.json
 ENV CARGO_NET_GIT_FETCH_WITH_CLI=true
-RUN cargo chef cook --release --recipe-path recipe.json
+ENV BUILD_ARGS="--release --locked --no-default-features --features github --features gitlab --features bitbucket"
+RUN cargo chef cook ${BUILD_ARGS} --recipe-path recipe.json
 COPY . .
-RUN cargo build --release --locked --no-default-features --features github --features gitlab --features bitbucket
+RUN cargo build ${BUILD_ARGS}
 RUN rm -f target/release/deps/git_cliff*
 
 FROM debian:buster-slim as runner

The build with just docker build . took 880.942s to complete.
So that did not solve it as the dependencies where build twice again. So basically i have the feeling that cargo chef cook is not doing what i`d expect from it, or i did not manage to use it correctly. But as the current state builds the dependencies twice, this might be the problem why the arm64 build times out (though i am not sure why it gets obviously stuck in the second round of compiling, as the reported numbers from the docker builds do not add up to 6h).

Try 2 (remove cargo-chef cook)

So my next idea: if the dependency caching is not actually caching anything it is better to be removed:
(patch again based on HEAD and not my previous patch)

diff --git a/Dockerfile b/Dockerfile
index 6175d23..e60a085 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -2,14 +2,8 @@
 FROM lukemathwalker/cargo-chef:0.1.63-rust-1.76-slim-buster AS chef
 WORKDIR app
 
-FROM chef AS planner
-COPY . .
-RUN cargo chef prepare --recipe-path recipe.json
-
 FROM chef AS builder
-COPY --from=planner /app/recipe.json recipe.json
 ENV CARGO_NET_GIT_FETCH_WITH_CLI=true
-RUN cargo chef cook --release --recipe-path recipe.json
 COPY . .
 RUN cargo build --release --locked --no-default-features --features github --features gitlab --features bitbucket
 RUN rm -f target/release/deps/git_cliff*

This build took 603.229s. So a speed-up of ~280 seconds...

Try 3 (Toolchain file)

Now i noticed, that this project is using a toolchain file... and cargo-chef added a commit emitting the toolchain file in a version not yet used... so lets try our luck by "just" bumping the cargo-chef version:

diff --git a/Dockerfile b/Dockerfile
index 6175d23..ed61a33 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -1,5 +1,5 @@
 # syntax=docker/dockerfile:1.4.3-labs
-FROM lukemathwalker/cargo-chef:0.1.63-rust-1.76-slim-buster AS chef
+FROM lukemathwalker/cargo-chef:0.1.68-rust-1.81-slim AS chef
 WORKDIR app
 
 FROM chef AS planner

But that yields the same result... so lets also add the arguments passed to cargo to chef like done in my First try from above. But it still did not work...

Try 4 (copy toolchain file by hand)

Then i stumbled across LukeMathWalker/cargo-chef#271 which looks exactly like the issue from my 3rd try. So, as cargo-chef cook does not use the toolchain file to build the dependencies, let's just copy it by hand. I applied the following patch to the unmodified head again:

diff --git a/Dockerfile b/Dockerfile
index 6175d23..abcae25 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -9,6 +9,10 @@ RUN cargo chef prepare --recipe-path recipe.json
 FROM chef AS builder
 COPY --from=planner /app/recipe.json recipe.json
 ENV CARGO_NET_GIT_FETCH_WITH_CLI=true
+# cargo-chef prepare only started to emit the toolchain file in 0.1.64, but cargo-chef cook does not use it to build the dependencies.
+# See https://github.com/LukeMathWalker/cargo-chef/issues/271.
+# So long we have to manually copy it before invoking cargo chef cook.
+COPY ./rust-toolchain.toml rust-toolchain.toml
 RUN cargo chef cook --release --recipe-path recipe.json
 COPY . .
 RUN cargo build --release --locked --no-default-features --features github --features gitlab --features bitbucket

Aaaand 🥁 it reduced the build time to 576.634s and actually cached nearly all dependencies!
So this sounds like a viable work-around until cargo-chef can use the toolchain file.

(Some dependencies are rebuild... i suspect this be connected with macro expansion or something like that.. which is not cacheable afaik.)

Conclusion

The build times in the docker build are quite high, because cargo-chef in the current state builds the dependencies twice. By injecting the toolchain file manually the caching works again and reduces the build times. This might not be enough to no longer hit the timeout, but as the cargo chef cook step is already fast enough, this should in theory make it work again.

Building the dependencies with the nightly compiler took roughly the same time as with installed 1.76 from the image.

So i suggest just copy the toolchain file and hopefully the arm64 images can be build again.
But it is getting late... so i will not submit a PR now and i am unsure when i get the time to do that. But feel free to apply my patch from my 4st try.

@LtdSauce
Copy link
Contributor

FTR: i submitted a PR in cargo-chef that hopefully solves the double install of dependencies and would make my suggested work-around unnecessary.

LtdSauce added a commit to LtdSauce/git-cliff that referenced this issue Oct 11, 2024
LtdSauce added a commit to LtdSauce/git-cliff that referenced this issue Oct 11, 2024
…ly (orhun#879)"

This reverts commit cde2a8e.
Commit 7b1bf94 made it possible to
build the arm64 image again without running into timeouts.
@LtdSauce
Copy link
Contributor

I guess I found an easier way to solve this and opened a PR to fix this. Let me know if you agree.

LtdSauce added a commit to LtdSauce/git-cliff that referenced this issue Oct 14, 2024
…ly (orhun#879)"

This reverts commit cde2a8e.
Commit 73f75d5 made it possible to
build the arm64 image again without running into timeouts.
@orhun
Copy link
Owner

orhun commented Oct 17, 2024

Wow man, you rock! Thanks for investigating it once again.

@orhun
Copy link
Owner

orhun commented Oct 17, 2024

Merged your PR, let's see how it'll go 🤞🏼

https://github.com/orhun/git-cliff/actions/runs/11380858481/job/31661022271

orhun pushed a commit that referenced this issue Oct 17, 2024
* chore(docker): ignore rust toolchain in docker builds

This commit adds the known names of the rust-toolchain files to the
.dockerignore file. This has two reasons why it makes sense:

- The initial docker layer already has a set up rust toolchain that is
  sufficient to build the project. Thus, by providing a toolchain file,
  the toolchain would be installed again during docker build.
- Currently cargo-chef only copies the toolchain files during cooking
  but it gets not used during the building of the dependencies in the
  cook call, see
  LukeMathWalker/cargo-chef#271.
  With this in mind, currently the dependencies were actually build
  twice. Once with the installed toolchain from the image itself, and
  then in the actual cargo build call with the toolchain speciefied in
  the toolchain file. Building them twice resulted in timeouts when
  building the arm64 images as they are emulated using qemu, which is
  itself already slower than building natively.

Now one could argue, that as soon as the mentioned issue is solved using
the toolchain again would be fine. But then it would be still needed to
assemble the Dockerfile in a way that the toolchain is not build twice.
Because the current structure of the Dockerfile builds the toolchain
once in the cargo-chef prepare step and once during the cargo build step
(and would later build it during the cargo-chef cook instead of cargo
build).

With all this in mind using no toolchain file but instead just using
the sufficient rust installation from the base image makes sense.

* Revert "chore(docker): disable building arm64 docker images temporarily (#879)"

This reverts commit cde2a8e.
Commit 73f75d5 made it possible to
build the arm64 image again without running into timeouts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
github_actions Pull requests that update GitHub Actions code help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants