-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add file loads streaming integration tests #28312
Conversation
Fixes #28309 |
Run PostCommit_Java_Dataflow |
R: @Abacn |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! this will fill an important gap, had comment below
.apply("GroupByKey", GroupByKey.create()) | ||
.apply("Extract Values", Values.create()) | ||
// We use this and the following GBK to aggregate by final destination. | ||
// This way, each destination has its own pane sequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this fixes #28309, thanks!
@@ -689,6 +689,10 @@ public static Row toBeamRow(Schema rowSchema, TableSchema bqSchema, TableRow jso | |||
} | |||
} | |||
|
|||
if (jsonBQValue instanceof byte[] && fieldType.getTypeName() == TypeName.BYTES) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious - what does this short cut return do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type (bye[]
) was previously unsupported so converting a TableRow with byte[] field to Beam Row would throw an UnsupportedOperationException below.
"Checking there is no duplication", expectedBeamRows.size(), actualBeamRows.size()); | ||
} | ||
|
||
@Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be necessary to run a full matrix of every variable (fixed shard, dynamical destination, use copyJobs) which generates 2*2*2
possibilities.
A choice to reduce the complexity is test default + each flag in a test + test all feature added, this reduces the complexity from 2^n to n+2. Also, given the default case (had autoshard) will be tested in #28272 (serve as default), I propose keep the following one:
- testLoadWithFixedShards
- testWithAutoShardingAndCopyJobs
- testDynamicDestinationsWithAutoSharding
- testDynamicDestinationsWithAutoShardingAndCopyJobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we're saying that fixed/dynamic sharding is independent from single/multi loads. I think I agree with that.
I mostly agree with that new list, I just think we should do testDynamicDestinationsWithFixedShards instead of testDynamicDestinationsWithAutoSharding because there is a slightly different code path there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P.S. I think we can remove the FILE_LOADS test in #28272 and keep FileLoadsStreamingIT as the hub for streaming loads tests. That way we can keep the other file as BigQueryIOStorageWriteIT
for storage write. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should do testDynamicDestinationsWithFixedShards instead of testDynamicDestinationsWithAutoSharding
Sure that's good
...e-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/FileLoadsStreamingIT.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback @Abacn! I reduced test complexity, PTAL
@@ -689,6 +689,10 @@ public static Row toBeamRow(Schema rowSchema, TableSchema bqSchema, TableRow jso | |||
} | |||
} | |||
|
|||
if (jsonBQValue instanceof byte[] && fieldType.getTypeName() == TypeName.BYTES) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type (bye[]
) was previously unsupported so converting a TableRow with byte[] field to Beam Row would throw an UnsupportedOperationException below.
...e-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/FileLoadsStreamingIT.java
Show resolved
Hide resolved
"Checking there is no duplication", expectedBeamRows.size(), actualBeamRows.size()); | ||
} | ||
|
||
@Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we're saying that fixed/dynamic sharding is independent from single/multi loads. I think I agree with that.
I mostly agree with that new list, I just think we should do testDynamicDestinationsWithFixedShards instead of testDynamicDestinationsWithAutoSharding because there is a slightly different code path there
"Checking there is no duplication", expectedBeamRows.size(), actualBeamRows.size()); | ||
} | ||
|
||
@Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P.S. I think we can remove the FILE_LOADS test in #28272 and keep FileLoadsStreamingIT as the hub for streaming loads tests. That way we can keep the other file as BigQueryIOStorageWriteIT
for storage write. wdyt?
Run PostCommit_Java_Dataflow |
Sounds good to me |
Most recent Legacy runner post commits passed |
Failing tests are unrelated |
Postcommit failure was due to quota "Startup of the worker pool in zone us-central1-b failed to bring up any of the desired 1 workers." Good to check - is the test specified specific zone? As we are bringing up more tests, we should request increasing quota and/or distribute works in different region. |
No specific zone for these tests, they all default to US multi-region |
Addressing this large BigQuery integration testing gap.
Includes streaming with single load jobs, copy jobs, auto-sharding/fixed shards, and dynamic destinations.
Limiting it to test only ~half the cases with Dataflow runner.
Uncovered an issue with dynamic destinations while doing this: #28309