Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initialization actions which use apt-get update fail due to purged oldoldstable backports repository #1157

Open
olbapjose opened this issue Apr 14, 2024 · 10 comments
Assignees

Comments

@olbapjose
Copy link

olbapjose commented Apr 14, 2024

Very recently, Dataproc clusters started to fail at creation, due to an error in the Kafka initialization script, caused by a Debian repository no longer available:

https://deb.debian.org/debian buster-backports Release

The error says:

  • Initialization action failed. Failed action 'gs://goog-dataproc-initialization-actions-europe-west1/kafka/kafka.sh', see output in: gs://XXXXX/google-cloud-dataproc-metainfo/a697c722-2bd7-440b-b4da-9494892703ac/XXXXXX-m/dataproc-initialization-script-0_output"

The contents of that file is the following. Any advice or workaround is more than welcome.

image
@kishida-yuki
Copy link

kishida-yuki commented Apr 15, 2024

I am in the same situation with 1.5-debian10.

@cjac cjac changed the title Kafka initialization actions script fails due to missing Debian repository initialization actions which use apt-get update fail due to purged oldoldstable backports repository Apr 15, 2024
@cjac
Copy link
Contributor

cjac commented Apr 15, 2024

Thank you for the report. We are addressing this issue with the highest priority.

@akhanna213
Copy link
Contributor

The fix #1161 for gpu init actions has been verified. We are already working on the same fix patch for other init actions which are failing with the same error.

For urgent fix, customers/developers can clone the init action and add the same lines of code as in the fix in their copy, and use it for cluster creation. Please note that we do not encourage our customers to use cloned init script as they will not have updated init actions, and they will have to clone it every time there is a change in the init actions repository. So unless urgent, please wait for the other fixes to go in :)

@ahmedetefy
Copy link

@akhanna213 I just tried using the latest version of the install_gpu_driver.sh and just went through the process of create a dataproc cluster through the UI and setting that latest version of the driver and I am still running into initialization issues

@olbapjose
Copy link
Author

olbapjose commented May 20, 2024

@akhanna213 @cjac I have run the command and it is still failing. Could you please provide an update? It is very important for us to have this up and running. I am using --image-version 2.0-debian10 which I know is a bit old but I don't think it is related to the issue, correct?

Thanks

@akhanna213
Copy link
Contributor

Hi @ahmedetefy @olbapjose could you confirm if the error message is still the same. We have already rolled out the fix a while back.

@olbapjose
Copy link
Author

olbapjose commented May 21, 2024

@akhanna213 Please see the image below and the attachment, which is the output file mentioned in the error.

image

google-cloud-dataproc-metainfo_initialization-script-0_output.txt

Long story short, the error says 'Unable to update packages lists.'

@ahmedetefy
Copy link

@akhanna213 Yes I can confirm the error is still there

To reproduce the error is quite straightforward

gcloud dataproc clusters create cluster-e485 --enable-component-gateway --bucket <bucket_name> --region <your-region> --single-node --master-machine-type n1-standard-8 --master-boot-disk-type pd-balanced --master-boot-disk-size 500 --master-accelerator type=nvidia-tesla-t4 --image-version <any 2.1 or above image version> --optional-components JUPYTER --initialization-actions '< gcs_path to latest install GPU driver script >' --project <project_name>

I have also had issues with 2.0-ubuntu18 (even though it succeeds in installing the GPU drivers sometimes)

And the following are the error logs if it helps

E: Repository 'https://packages.cloud.google.com/apt google-cloud-logging-bionic-all InRelease' changed its 'Codename' value from 'google-cloud-logging-stretch-all' to 'google-cloud-logging-bionic-all'

@akhanna213
Copy link
Contributor

Hi @ahmedetefy @olbapjose , this looks like a different issue than what the users were facing earlier. Let me check with the team to understand what is causing this breakage. Appreciate your patience on this, let me get back to you as soon as possible.

@olbapjose
Copy link
Author

Hi @akhanna213 do you have updates on this? Initially I was able to do a workaround by adding --allow-releaseinfo-change:

function update_apt_get() {
  retry_apt_command "apt-get update --allow-releaseinfo-change"
}

and it worked, but today it is failing again with a different message:

The following NEW packages will be installed:
gnupg2
0 upgraded, 1 newly installed, 0 to remove and 3 not upgraded.
Need to get 393 kB of archives.
After this operation, 411 kB of additional disk space will be used.
Err:1 http://deb.debian.org/debian buster/main amd64 gnupg2 all 2.2.12-1+deb10u1
404 Not Found [IP: 151.101.22.132 80]
E: Failed to fetch http://deb.debian.org/debian/pool/main/g/gnupg2/gnupg2_2.2.12-1+deb10u1_all.deb 404 Not Found [IP: 151.101.22.132 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

I will try again with fix-missing but looks like the script is not robust as it is exposed to different possible points of failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants