Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python BQ] Allow setting a fixed number of Storage API streams #28592

Merged
merged 5 commits into from
Sep 22, 2023

Conversation

ahmedabu98
Copy link
Contributor

@ahmedabu98 ahmedabu98 commented Sep 21, 2023

Allow users to set a number of streams for the Storage Write API sink.

Fixes #28587
Provides a workaround for issue in #28168

Update: had to revert this (#28613) and fix some linting errors. PR after fixes is in #28618

@ahmedabu98
Copy link
Contributor Author

R: @bvolpato
R: @liferoad

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@ahmedabu98
Copy link
Contributor Author

Run Python_Xlang_Gcp_Direct PostCommit

@github-actions github-actions bot added the infra label Sep 21, 2023
@ahmedabu98
Copy link
Contributor Author

Run Python_Xlang_Gcp_Dataflow PostCommit

@codecov
Copy link

codecov bot commented Sep 21, 2023

Codecov Report

Merging #28592 (3ec836e) into master (534f93a) will decrease coverage by 0.03%.
Report is 17 commits behind head on master.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##           master   #28592      +/-   ##
==========================================
- Coverage   72.23%   72.20%   -0.03%     
==========================================
  Files         684      684              
  Lines      100991   101132     +141     
==========================================
+ Hits        72949    73021      +72     
- Misses      26463    26532      +69     
  Partials     1579     1579              
Flag Coverage Δ
python 82.75% <50.00%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
sdks/python/apache_beam/io/gcp/bigquery.py 69.92% <50.00%> (-0.05%) ⬇️

... and 11 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@@ -2018,6 +2019,8 @@ def __init__(
determined number of shards to write to BigQuery. This can be used for
all of FILE_LOADS, STREAMING_INSERTS, and STORAGE_WRITE_API. Only
applicable to unbounded input.
num_storage_api_streams: If set, the Storage API sink will default to
using this number of write streams. Only applicable to unbounded data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_storage_api_streams: specifies the number of write streams that the Storage API sink will use. This parameter is only applicable to unbounded data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we check this should be always set now for the unbounded data since it won't work otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we check this should be always set now for the unbounded data since it won't work otherwise.

streaming writes with at-least-once still works without setting this parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the documentation, thanks for the suggestion!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Shall we add something like "This parameter must be set for Storage API writes with the exactly once method."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hesitate on doing this because conventionally we don't do runner-based checks in the SDK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clarified it in the public BigQuery connector doc (https://beam.apache.org/documentation/io/built-in/google-bigquery/)

@liferoad
Copy link
Collaborator

Just had minor comments. LGTM from the Python side. Thanks.

@@ -260,33 +262,43 @@ def run_streaming(
streaming=True,
allow_unsafe_triggers=True)

auto_sharding = num_streams == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me some time to parse it a bit, nit and might be common practice, but auto_sharding = (num_streams == 0) looks better

Copy link
Contributor

@bvolpato bvolpato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@ahmedabu98 ahmedabu98 merged commit 04a26da into apache:master Sep 22, 2023
90 of 95 checks passed
ahmedabu98 added a commit that referenced this pull request Sep 22, 2023
ahmedabu98 added a commit that referenced this pull request Sep 22, 2023
ahmedabu98 added a commit that referenced this pull request Sep 22, 2023
…age API streams (#28618)

* expose num streams option; fix some streaming tests

* add test phrase in description

* lint fix
kennknowles pushed a commit to kennknowles/beam that referenced this pull request Sep 22, 2023
…he#28592)

* expose num streams option; fix some streaming tests

* clarify docs; address nit
kennknowles pushed a commit to kennknowles/beam that referenced this pull request Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants