Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gcsio migration #29360

Merged
merged 2 commits into from
Nov 16, 2023
Merged

Gcsio migration #29360

merged 2 commits into from
Nov 16, 2023

Conversation

shunping
Copy link
Contributor

@shunping shunping commented Nov 9, 2023

This is the continuation of gcsio migration work. The majority work has been done by @BjornPrime, and here I am fixing a few edge cases and trying to make sure all tests are passed before submission.

fixes #25676


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Copy link
Contributor

github-actions bot commented Nov 9, 2023

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @liferoad for label python.
R: @Abacn for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@shunping
Copy link
Contributor Author

I am going to do some more tests before it is merged. Will let you know

@shunping
Copy link
Contributor Author

stop reviewer notifications

Copy link
Contributor

Stopping reviewer notifications for this pull request: requested by reviewer

Copy link

codecov bot commented Nov 11, 2023

Codecov Report

Attention: 174 lines in your changes are missing coverage. Please review.

Comparison is base (729c4de) 38.32% compared to head (acb57f3) 37.85%.
Report is 3 commits behind head on master.

Files Patch % Lines
sdks/python/apache_beam/io/gcp/gcsio.py 10.31% 113 Missing ⚠️
...apache_beam/runners/dataflow/internal/apiclient.py 0.00% 22 Missing ⚠️
..._beam/runners/portability/sdk_container_builder.py 0.00% 15 Missing ⚠️
sdks/python/apache_beam/io/gcp/gcsfilesystem.py 0.00% 13 Missing ⚠️
...ks/python/apache_beam/runners/interactive/utils.py 0.00% 10 Missing ⚠️
sdks/python/apache_beam/internal/gcp/auth.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #29360      +/-   ##
==========================================
- Coverage   38.32%   37.85%   -0.47%     
==========================================
  Files         694      690       -4     
  Lines      102373   101305    -1068     
==========================================
- Hits        39235    38354     -881     
+ Misses      61546    61359     -187     
  Partials     1592     1592              
Flag Coverage Δ
python 29.01% <7.44%> (-0.87%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

BjornPrime and others added 2 commits November 16, 2023 09:18
The two commits are merged into one:
* Reapply "Replace StorageV1 client with GCS client (apache#28079)" (apache#28721)
* added project parameter to apiclient
Copy link
Contributor

@johnjcasey johnjcasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved! lets hope this attempt and our safeguards are enough!

@johnjcasey johnjcasey merged commit 8ac8b20 into apache:master Nov 16, 2023
89 of 92 checks passed
jto pushed a commit to jto/beam that referenced this pull request Nov 17, 2023
* Cherry pick two previous commits on migrating gcs client

The two commits are merged into one:
* Reapply "Replace StorageV1 client with GCS client (apache#28079)" (apache#28721)
* added project parameter to apiclient

* Initialze storage client with project from pipeline option.

---------

Co-authored-by: Bjorn Pedersen <[email protected]>
Abacn pushed a commit that referenced this pull request Nov 17, 2023
@@ -352,27 +352,20 @@ def delete(self, paths):
Args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is line 350 comment still relevant? The directories are not getting deleted recursively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCS is currently a flat system. I think it will match and delete all the files with that prefix, which is equivalent to delete the files recursively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm..I am running few example and still the files within the matched prefix are not getting deleted.

For example,

I have path=gs://anandinguva-test/artifacts/53b617/ and when I call GCSFileSystem(options).delete([path]), I expect it to delete the directories/buckets and files/objects within the path. Maybe we can clarify this in comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/beam/pull/29477/files - this follows a pattern to local filesystem. Would this work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you debugged the original code? I am curious of which part doesn't work. I guess it is the file matching part?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the matching part. When a directory is provided, we append *.

  • expected - all the dirs and objects should be deleted.
  • Actual - only objects are getting deleted.

Solution - let's use bucket.list_buckets() and match the path with the provided dirs and delete them. I can spin up a PR if this sounds good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. Thanks for working on this! Please go ahead to create a PR.

shunping added a commit to shunping/beam that referenced this pull request Nov 21, 2023
This value was changed in PR apache#29360 to 1000, which led to
internal test failure.
johnjcasey pushed a commit that referenced this pull request Nov 21, 2023
This value was changed in PR #29360 to 1000, which led to
internal test failure.
riteshghorse pushed a commit to riteshghorse/beam that referenced this pull request Nov 21, 2023
This value was changed in PR apache#29360 to 1000, which led to
internal test failure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace storage_v1_client with GCS client
5 participants