Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal Nvidia GPU image #5

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from
Draft

Minimal Nvidia GPU image #5

wants to merge 16 commits into from

Conversation

ruffsl
Copy link

@ruffsl ruffsl commented Sep 25, 2024

Opening for visibility and collaboration. It would be nice to include a minimal Nvidia GPU AMI for use with RunsOn.
This PR currently modifies the default templates re-used by the gpu RELEASE_DIST example to slim down the resulting AMI, and change the source AMI to leverage NVIDIA GPU-Optimized AMI:

Because of the source AMI's imposed constraints, this does necessitate that building the custom GPU AMI then requires the use of a nvidia GPU instance_type to kick off the packer process. Perhaps this could sidestepped by manually installing nvidia drivers and nvidia container runtime, but is something I've not yet bothered to reverse engineer.

View the commit log for some notable subtle patches required to accommodate for apt-lock bocking because of the Nvidia source AMI's use of bashrc to bootstrap the drivers on first boot, and disabling the AWS CLI installation given it conflicts with the pre-installed version that ships with the Nvidia source AMI. The Nvidia source AMI is also initialized from a larger drive (128GB), so our child AMI also (unforntally) requires a bump minimum HDD size, larger than the current default large option in the RunsOn disk size of 80GB. Thus some editing of the RunsOn cloud formation setting were also needed. This may be another motivation to manually install the nvidia drivers, rather than rely on the source AMI.

Context:

given that the Nvidia source image requires a GPU
>   2024-08-09T03:45:28Z: ==> amazon-ebs.build_ebs: Error waiting for fleet request (fleet-d6b5ce15-8087-660d-0e92-0982be357801) to become ready:The instance configuration for this AWS Marketplace product is not supported. Please see the AWS Marketplace site for more information about supported instance types, regions, and operating systems.
given that 80 is too small for Nvidia source image
> 2024-08-09T03:11:58Z: ==> amazon-ebs.build_ebs: Error waiting for fleet request (fleet-5415661f-0887-e407-0630-81802dcb1f95) to become ready:Your requested instance type (c7a.xlarge) is not supported in your requested Availability Zone (eu-west-2c).Your requested instance type (m7a.xlarge) is not supported in your requested Availability Zone (eu-west-2c).Volume of size 80GB is smaller than snapshot 'snap-00c0e57c77605a262', expect size>= 128GB
as it conflicts with Nvidia source image
`./bin/patch/ubuntu22-x64 releases/ubuntu22/x64`
given that nvidia source image apt installs drivers at startup
so we need to wait for dpkg locks to release
`./bin/patch/ubuntu22-x64 releases/ubuntu22/x64`
as it conflicts with Nvidia source image
`./bin/patch/ubuntu22-x64 releases/ubuntu22/x64`
```
2024/08/14 19:31:55 ui error: 2024-08-14T19:31:55Z: ==> amazon-ebs.build_ebs: Error modify AMI attributes: AuthFailure: AMIs with product codes can't be made public
```
`./bin/patch/ubuntu22-x64 releases/ubuntu22/x64`
because it takes so much time to build
`./bin/patch/ubuntu22-x64 releases/ubuntu22/x64`
@crohr
Copy link
Contributor

crohr commented Sep 26, 2024

Thanks @ruffsl! I think you forgot to push configure-apt-mock.sh?

I will experiment with this and also try @samayala22 approach, since it would be nice to be able to simply extend the base RunsOn images with the additional drivers.

@ruffsl
Copy link
Author

ruffsl commented Sep 26, 2024

I think you forgot to push configure-apt-mock.sh?

I forget how/where the templates populate from, but it's already included in the source tree here:

#!/bin/bash -e
################################################################################
## File: configure-apt-mock.sh
## Desc: A temporary workaround for https://github.com/Azure/azure-linux-extensions/issues/1238.
## Cleaned up during cleanup.sh.
################################################################################

I will experiment with this and also try @samayala22 approach, since it would be nice to be able to simply extend the base RunsOn images with the additional drivers.

That could be a more optimal and customizable approach, as I've updated the OP commit to note about the disk usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants