-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python BQ] Allow setting a fixed number of Storage API streams #28592
Conversation
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
Run Python_Xlang_Gcp_Direct PostCommit |
Run Python_Xlang_Gcp_Dataflow PostCommit |
Codecov Report
@@ Coverage Diff @@
## master #28592 +/- ##
==========================================
- Coverage 72.23% 72.20% -0.03%
==========================================
Files 684 684
Lines 100991 101132 +141
==========================================
+ Hits 72949 73021 +72
- Misses 26463 26532 +69
Partials 1579 1579
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 11 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@@ -2018,6 +2019,8 @@ def __init__( | |||
determined number of shards to write to BigQuery. This can be used for | |||
all of FILE_LOADS, STREAMING_INSERTS, and STORAGE_WRITE_API. Only | |||
applicable to unbounded input. | |||
num_storage_api_streams: If set, the Storage API sink will default to | |||
using this number of write streams. Only applicable to unbounded data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_storage_api_streams: specifies the number of write streams that the Storage API sink will use. This parameter is only applicable to unbounded data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we check this should be always set now for the unbounded data since it won't work otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we check this should be always set now for the unbounded data since it won't work otherwise.
streaming writes with at-least-once still works without setting this parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed the documentation, thanks for the suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Shall we add something like "This parameter must be set for Storage API writes with the exactly once method."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hesitate on doing this because conventionally we don't do runner-based checks in the SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I clarified it in the public BigQuery connector doc (https://beam.apache.org/documentation/io/built-in/google-bigquery/)
Just had minor comments. LGTM from the Python side. Thanks. |
@@ -260,33 +262,43 @@ def run_streaming( | |||
streaming=True, | |||
allow_unsafe_triggers=True) | |||
|
|||
auto_sharding = num_streams == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took me some time to parse it a bit, nit and might be common practice, but auto_sharding = (num_streams == 0)
looks better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
…age API streams (#28618) * expose num streams option; fix some streaming tests * add test phrase in description * lint fix
…he#28592) * expose num streams option; fix some streaming tests * clarify docs; address nit
…ms (apache#28592)" (apache#28613) This reverts commit 04a26da.
Allow users to set a number of streams for the Storage Write API sink.
Fixes #28587
Provides a workaround for issue in #28168
Update: had to revert this (#28613) and fix some linting errors. PR after fixes is in #28618