Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The PostCommit Python Arm job is flaky #30760

Open
github-actions bot opened this issue Mar 27, 2024 · 19 comments
Open

The PostCommit Python Arm job is flaky #30760

github-actions bot opened this issue Mar 27, 2024 · 19 comments

Comments

@github-actions
Copy link
Contributor

The PostCommit Python Arm is failing over 50% of the time
Please visit https://github.com/apache/beam/actions/workflows/beam_PostCommit_Python_Arm.yml?query=is%3Afailure+branch%3Amaster to see the logs.

@chamikaramj
Copy link
Contributor

@tvalentyn do we have a good owner for this ?

@ahmedabu98
Copy link
Contributor

I actually can't find a single green run since this test suite was created (back in September)

@tvalentyn
Copy link
Contributor

You may be right, thanks for correction, @ahmedabu98

2024-04-24T12:03:53.0963029Z Please verify that you have permissions to write to the parent directory..
2024-04-24T12:03:53.0964903Z The configuration directory may not be writable. To learn more, see https://cloud.google.com/sdk/docs/configurations#creating_a_configuration
2024-04-24T12:03:53.0968080Z ERROR: (gcloud.auth.docker-helper) Could not create directory [/var/lib/kubelet/pods/573a1844-124b-4e12-bb0f-0325d0f3c3aa/volumes/kubernetes.io~empty-dir/gcloud]: Permission denied.
2024-04-24T12:03:53.0969612Z 
2024-04-24T12:03:53.0970063Z Please verify that you have permissions to write to the parent directory.
2024-04-24T12:03:53.3953756Z #29 pushing layers 1.4s done
2024-04-24T12:03:53.3956208Z #29 ERROR: failed to push us.gcr.io/apache-beam-testing/github-actions/beam_python3.8_sdk:2.57.0-SNAPSHOT: error getting credentials - err: exit status 1, out: ``
2024-04-24T12:03:53.8953735Z ------

cc: @damccorm - do you remember if this suite never worked or the above error is an artifact of GHA migration?

We can reclassify this as part part of ARM backlog work.

@tvalentyn tvalentyn added P2 and removed P1 labels Apr 24, 2024
@damccorm
Copy link
Contributor

@damccorm
Copy link
Contributor

Looks like it went flaky then permared around then

@ahmedabu98
Copy link
Contributor

Ahh my apologies, I was looking at it through a is:failure filter

@volatilemolotov
Copy link
Contributor

So by removing
https://github.com/apache/beam/blob/master/.github/workflows/beam_PostCommit_Python_Arm.yml#L113

I get the test to move along but its still failing on my fork due to some permission with the Healthcare api.
Oauth scope is wrong or something:
https://github.com/volatilemolotov/beam/actions/runs/8820257015/job/24213449686#step:13:13113

@damccorm
Copy link
Contributor

@volatilemolotov could you put up a PR to make that change? Definitely seems like it is getting further.

@svetakvsundhar do you know what scope is missing? Given the normal postcommit python isn't failing, it might just be an issue with your service account specifically?

@volatilemolotov
Copy link
Contributor

Sure, here it is
#31102

@damccorm
Copy link
Contributor

Thanks - merged, lets see what the result on master is

@svetakvsundhar
Copy link
Contributor

@svetakvsundhar do you know what scope is missing? Given the normal postcommit python isn't failing, it might just be an issue with your service account specifically?

+1, it could be a service account specific issue. I'd want to see a couple of more runs of this to see if it's actually an issue. If so, a thought might be to add ["https://www.googleapis.com/auth/cloud-platform"] as a scope manually in the test.

@volatilemolotov
Copy link
Contributor

@damccorm
Copy link
Contributor

Great, thanks @volatilemolotov

Looks like we're still flaky - https://github.com/apache/beam/actions/runs/8843342204/job/24283441647 - but that's an improvement and it looks like a test flake instead of infra

@kennknowles
Copy link
Member

Permared now

@damccorm
Copy link
Contributor

Copy link
Contributor Author

Reopening since the workflow is still flaky

@damccorm
Copy link
Contributor

Fixed by #32530

@github-actions github-actions bot reopened this Nov 5, 2024
Copy link
Contributor Author

github-actions bot commented Nov 5, 2024

Reopening since the workflow is still flaky

@damccorm
Copy link
Contributor

damccorm commented Nov 5, 2024

This is failing because of Dataflow issues, not because of Beam. Dataflow is requesting arm machines in regions where there are none, failing the job. I reopened an internal bug (id 352725422)

@damccorm damccorm removed this from the 2.60.0 Release milestone Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants