Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file loads streaming integration tests #28312

Merged
merged 10 commits into from
Sep 13, 2023

Conversation

ahmedabu98
Copy link
Contributor

@ahmedabu98 ahmedabu98 commented Sep 5, 2023

Addressing this large BigQuery integration testing gap.

Includes streaming with single load jobs, copy jobs, auto-sharding/fixed shards, and dynamic destinations.
Limiting it to test only ~half the cases with Dataflow runner.

Uncovered an issue with dynamic destinations while doing this: #28309

@ahmedabu98
Copy link
Contributor Author

Fixes #28309

@ahmedabu98
Copy link
Contributor Author

Run PostCommit_Java_Dataflow

@ahmedabu98 ahmedabu98 marked this pull request as ready for review September 12, 2023 14:17
@ahmedabu98
Copy link
Contributor Author

R: @Abacn
R: @johnjcasey
R: @shunping-google

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! this will fill an important gap, had comment below

.apply("GroupByKey", GroupByKey.create())
.apply("Extract Values", Values.create())
// We use this and the following GBK to aggregate by final destination.
// This way, each destination has its own pane sequence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this fixes #28309, thanks!

@@ -689,6 +689,10 @@ public static Row toBeamRow(Schema rowSchema, TableSchema bqSchema, TableRow jso
}
}

if (jsonBQValue instanceof byte[] && fieldType.getTypeName() == TypeName.BYTES) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious - what does this short cut return do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type (bye[]) was previously unsupported so converting a TableRow with byte[] field to Beam Row would throw an UnsupportedOperationException below.

"Checking there is no duplication", expectedBeamRows.size(), actualBeamRows.size());
}

@Test
Copy link
Contributor

@Abacn Abacn Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be necessary to run a full matrix of every variable (fixed shard, dynamical destination, use copyJobs) which generates 2*2*2 possibilities.

A choice to reduce the complexity is test default + each flag in a test + test all feature added, this reduces the complexity from 2^n to n+2. Also, given the default case (had autoshard) will be tested in #28272 (serve as default), I propose keep the following one:

  • testLoadWithFixedShards
  • testWithAutoShardingAndCopyJobs
  • testDynamicDestinationsWithAutoSharding
  • testDynamicDestinationsWithAutoShardingAndCopyJobs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're saying that fixed/dynamic sharding is independent from single/multi loads. I think I agree with that.

I mostly agree with that new list, I just think we should do testDynamicDestinationsWithFixedShards instead of testDynamicDestinationsWithAutoSharding because there is a slightly different code path there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. I think we can remove the FILE_LOADS test in #28272 and keep FileLoadsStreamingIT as the hub for streaming loads tests. That way we can keep the other file as BigQueryIOStorageWriteIT for storage write. wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should do testDynamicDestinationsWithFixedShards instead of testDynamicDestinationsWithAutoSharding

Sure that's good

Copy link
Contributor Author

@ahmedabu98 ahmedabu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @Abacn! I reduced test complexity, PTAL

@@ -689,6 +689,10 @@ public static Row toBeamRow(Schema rowSchema, TableSchema bqSchema, TableRow jso
}
}

if (jsonBQValue instanceof byte[] && fieldType.getTypeName() == TypeName.BYTES) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type (bye[]) was previously unsupported so converting a TableRow with byte[] field to Beam Row would throw an UnsupportedOperationException below.

"Checking there is no duplication", expectedBeamRows.size(), actualBeamRows.size());
}

@Test
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're saying that fixed/dynamic sharding is independent from single/multi loads. I think I agree with that.

I mostly agree with that new list, I just think we should do testDynamicDestinationsWithFixedShards instead of testDynamicDestinationsWithAutoSharding because there is a slightly different code path there

"Checking there is no duplication", expectedBeamRows.size(), actualBeamRows.size());
}

@Test
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. I think we can remove the FILE_LOADS test in #28272 and keep FileLoadsStreamingIT as the hub for streaming loads tests. That way we can keep the other file as BigQueryIOStorageWriteIT for storage write. wdyt?

@ahmedabu98
Copy link
Contributor Author

Run PostCommit_Java_Dataflow

@Abacn
Copy link
Contributor

Abacn commented Sep 12, 2023

P.S. I think we can remove the FILE_LOADS test in #28272 and keep FileLoadsStreamingIT as the hub for streaming loads tests.

Sounds good to me

@ahmedabu98
Copy link
Contributor Author

Most recent Legacy runner post commits passed
Direct Runner tests passed as well

@ahmedabu98
Copy link
Contributor Author

Failing tests are unrelated

@Abacn
Copy link
Contributor

Abacn commented Sep 13, 2023

Postcommit failure was due to quota "Startup of the worker pool in zone us-central1-b failed to bring up any of the desired 1 workers."

Good to check - is the test specified specific zone? As we are bringing up more tests, we should request increasing quota and/or distribute works in different region.

@ahmedabu98
Copy link
Contributor Author

is the test specified specific zone

No specific zone for these tests, they all default to US multi-region

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants