Support Map in BQ for StorageWrites API for Beam Rows #32512

prodriguezdefino · 2024-09-20T06:28:13Z

Currently, BigQuery table schema utility and the implementation for StorageWrites for Beam Rows does not support sending rows with properties of type Map or array of Map as part of their schema.

This PR adds that functionality transforming the Map into a Message type which contains two fields, key and value, respecting the types coming from the Row schema while mimicking the behavior when using TableRows to the BigQueryIO PTransform.

Copied from #22179 since it got closed after inactivity for long period (and I can not re-open it).

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

…ite API for Beam Row

prodriguezdefino · 2024-09-20T06:29:45Z

fixes #23618

@JohnZZGithub FYI

github-actions · 2024-09-20T07:34:35Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

github-actions · 2024-09-20T19:34:45Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java.
R: @johnjcasey for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

JohnZZGithub · 2024-09-26T16:42:00Z

The patch LGTM. And we tested it on our GCP env.

damccorm · 2024-10-03T15:38:29Z

@robertwb @johnjcasey could you please take a look at this one?

github-actions · 2024-10-11T12:14:06Z

Reminder, please take a look at this pr: @robertwb @johnjcasey

github-actions · 2024-10-15T12:14:26Z

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @chamikaramj for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

johnjcasey · 2024-10-15T13:52:10Z

...oud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BeamRowToStorageApiProto.java

+        @Nullable FieldType keyType = field.getType().getMapKeyType();
+        @Nullable FieldType valueType = field.getType().getMapValueType();
+        if (keyType == null || valueType == null) {
+          throw new RuntimeException("Unexpected null element type!");


Can you add some context to this exception around the error being in converting to the storage api proto? That would help users diagnose their pipelines without needing to know the beam code as well

johnjcasey · 2024-10-15T13:55:54Z

...oud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BeamRowToStorageApiProto.java

-        return list.stream()
-            .map(v -> toProtoValue(fieldDescriptor, arrayElementType, v))
-            .collect(Collectors.toList());
+        boolean shouldFlatMap =


This looks like it supports one level of nested collection. Is that an intended limit?

Also, can you add a comment to that effect?

Yes I will add the comment.

Treating the recursive collection flattening should help to cover the other collection types. This PR focus on the special case of having a MAP, or ARRAY as the type of a field in a Row.

Removed the flattening of nested container types.

johnjcasey · 2024-10-15T13:57:53Z

...o/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtils.java

@@ -403,7 +403,7 @@ private static List<TableFieldSchema> toTableFieldSchema(Schema schema) {
      }
      if (type.getTypeName().isCollectionType()) {
        type = Preconditions.checkArgumentNotNull(type.getCollectionElementType());
-        if (type.getTypeName().isCollectionType() || type.getTypeName().isMapType()) {
+        if (type.getTypeName().isCollectionType() && !type.getTypeName().isMapType()) {


BQ supports arrays of maps, but not arrays of other collections?

BigQuery only supports arrays of structs and scalar types, not nested collection/map types.

This change enables storing a simple map or an array of maps (after flattening them) as a repeated struct field.

A separated change may help to support flattening other collection types (arrays of arrays for example).

removed this change.

robertwb

Thanks for this contribution. Supporting maps will be very nice.

robertwb · 2024-10-15T15:50:51Z

...oud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BeamRowToStorageApiProto.java

+        @Nullable FieldType keyType = field.getType().getMapKeyType();
+        @Nullable FieldType valueType = field.getType().getMapValueType();
+        if (keyType == null || valueType == null) {
+          throw new RuntimeException("Unexpected null element type!");


It'd be more informative if the key and value raised distinct errors.

robertwb · 2024-10-15T15:53:13Z

...oud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BeamRowToStorageApiProto.java

@@ -272,6 +286,8 @@ private static Object messageValueFromRowValue(
    if (value == null) {
      if (fieldDescriptor.isOptional()) {
        return null;
+      } else if (fieldDescriptor.isRepeated()) {


Currently we distinguish between the empty list and a missing value. I think we want to keep that distinction.

Keeping the return value as null.

It seems that the previous code will fail in the case of having a schema with a simple array of strings as a field and marking it as nullable. By adding this check here it also fixes that case. Adding a test to check those particularities.

huh, seems like master now has the same code as initially proposed by this PR (see here). For my changes this was not necessary, but maybe other things have been modified.

I will use the Collections.emptyList() reference from there.

robertwb · 2024-10-15T15:59:14Z

...oud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BeamRowToStorageApiProto.java

+            list.stream().map(v -> toProtoValue(fieldDescriptor, arrayElementType, v));
+
+        if (shouldFlatMap) {
+          valueStream = valueStream.flatMap(vs -> ((List) vs).stream());


Why are we introducing this flattening here?

See the comment here: #22179 (comment)

Also added a comment on the code to explain it.

removed the flattening of nested container types and added a missing check for array.

…aps as well.

robertwb · 2024-10-16T19:18:01Z

...oud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BeamRowToStorageApiProto.java

-            .collect(Collectors.toList());
+        // We currently only support maps as non-row or non-scalar element types
+        // given that BigQuery does not support nested arrays. If the element type is of map type
+        // we should flatten it given how is being translated (as a list of proto(key, value).


IMHO, if BQ doesn't support arrays of arrays (or arrays of maps) we should reject such rather than implicitly flattening them (which is lossy and could be unexpected from a user's perspective).

as you correctly stated, BQ does not support those nested container types, but it does not support maps either.

currently as a Beam user, when translating formats for BQ ingestion (from Avro, Thrift, or others which support MAP natively), I need to inspect the schemas or IDLs to understand if a MAP or a nested container is there and translate it into something that works for BQ. This adds complexity to the pipelines, and can be detrimental of the overall performance (because potential packing and unpacking needed to translate data in the original format).

this change aims to aid that translation, for both cases MAP type and ARRAY/ITERABLE of MAPs which are both supported in Beam Rows (used for simplicity after original format translation) but not in the BQ storage write proto translation. for MAP, we are making a structure decision for the translation, and for ARRAY/ITERABLE of MAPs as well.

I agree, we are losing key functionalities from the original structures with this translation decision (indexing and key uniqueness as starters), but I think through improved documentation we can alert the users about these caveats (which does not affect already existing pipelines given that this is a net-new feature).

For Maps it's fine as there's no surprise on the user side (e.g. seeing key-value records) plus it can be losslessly translated back if read as a MAP type. But the same cannot be said of the flattening that's done.

If we are concerned about convenience for users, a separate explicit transform that flattens nested structures could be provided (which would be the identity if there are no required unnestings). This should have comparable performance to doing it as part of the write, and likely negligible to the cost of actually talking to the services in question.

Sounds good, I will remove the flattening here. FYI @JohnZZGithub.

I noticed that we are not checking for maps types here, I will add the check and improve the messaging for users to understand what's going on.

Also I will try to work, on a separated PR, on a more general flattening configuration. Probably in the form of a lambda, so we can delegate the users what to do in the case of encountering a nested container type in the translation process.

…for array<map> not being supported

implemented support for maps and array of maps for BigQuery StorageWr…

b6afd4a

…ite API for Beam Row

github-actions bot added java io gcp labels Sep 20, 2024

prodriguezdefino marked this pull request as ready for review September 20, 2024 06:30

fix cdc test

f6ec654

github-actions bot added the Next Action: Reviewers label Sep 20, 2024

github-actions bot added the slow-review label Oct 11, 2024

github-actions bot removed the slow-review label Oct 15, 2024

johnjcasey requested changes Oct 15, 2024

View reviewed changes

robertwb reviewed Oct 15, 2024

View reviewed changes

prodriguezdefino added 2 commits October 15, 2024 14:17

addressing comments and adding test for arrays and maps particularities.

02cf883

merging from master

479d8bb

prodriguezdefino requested a review from johnjcasey October 15, 2024 23:07

prodriguezdefino added 2 commits October 16, 2024 11:48

including tests to validate behavior for multimaps and iterables of m…

df8f7c2

…aps as well.

Merge branch 'master' into avro_map_and_arrayofmap_bq_storagewrites

c59db8f

robertwb reviewed Oct 16, 2024

View reviewed changes

removing the flattening of nested container types and adding a check …

743698a

…for array<map> not being supported

prodriguezdefino changed the title ~~Support Map and Arrays of Maps in BQ for StorageWrites API for Beam Rows~~ Support Map in BQ for StorageWrites API for Beam Rows Oct 18, 2024

prodriguezdefino requested a review from robertwb October 18, 2024 17:34

robertwb approved these changes Oct 22, 2024

View reviewed changes

robertwb merged commit 4f4853e into apache:master Oct 22, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Map in BQ for StorageWrites API for Beam Rows #32512

Support Map in BQ for StorageWrites API for Beam Rows #32512

prodriguezdefino commented Sep 20, 2024 •

edited

Loading

prodriguezdefino commented Sep 20, 2024

github-actions bot commented Sep 20, 2024

github-actions bot commented Sep 20, 2024

JohnZZGithub commented Sep 26, 2024

damccorm commented Oct 3, 2024

github-actions bot commented Oct 11, 2024

github-actions bot commented Oct 15, 2024

johnjcasey Oct 15, 2024

johnjcasey Oct 15, 2024

johnjcasey Oct 15, 2024

prodriguezdefino Oct 15, 2024

prodriguezdefino Oct 17, 2024

johnjcasey Oct 15, 2024

prodriguezdefino Oct 15, 2024 •

edited

Loading

prodriguezdefino Oct 17, 2024

robertwb left a comment

robertwb Oct 15, 2024

robertwb Oct 15, 2024

prodriguezdefino Oct 15, 2024

prodriguezdefino Oct 15, 2024

robertwb Oct 15, 2024

prodriguezdefino Oct 15, 2024

prodriguezdefino Oct 16, 2024

prodriguezdefino Oct 17, 2024

robertwb Oct 16, 2024

prodriguezdefino Oct 16, 2024 •

edited

Loading

robertwb Oct 17, 2024

prodriguezdefino Oct 17, 2024 •

edited

Loading

Support Map in BQ for StorageWrites API for Beam Rows #32512

Support Map in BQ for StorageWrites API for Beam Rows #32512

Conversation

prodriguezdefino commented Sep 20, 2024 • edited Loading

GitHub Actions Tests Status (on master branch)

prodriguezdefino commented Sep 20, 2024

github-actions bot commented Sep 20, 2024

github-actions bot commented Sep 20, 2024

JohnZZGithub commented Sep 26, 2024

damccorm commented Oct 3, 2024

github-actions bot commented Oct 11, 2024

github-actions bot commented Oct 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prodriguezdefino Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertwb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prodriguezdefino Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prodriguezdefino Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

prodriguezdefino commented Sep 20, 2024 •

edited

Loading

prodriguezdefino Oct 15, 2024 •

edited

Loading

prodriguezdefino Oct 16, 2024 •

edited

Loading

prodriguezdefino Oct 17, 2024 •

edited

Loading