BigQueryIO.read(SerializableFunction): Collect records that could not be parsed into the custom-typed object into a PCollection of TableRows #20704

damccorm · 2022-06-04T19:02:36Z

Just as org.apache.beam.sdk.io.gcp.bigquery.WriteResult.getFailedInserts() allows a user to collect failed writes for downstream processing (e.g., sinking the records into some kind of deadletter store), could the results of a BigQueryIO.read(SerializableFunction) be collected, allowing a user to access TableRows that were not able to be parsed by the provided function , for the purpose of downstream processing (e.g., some kind of deadletter handling).

In our use case, all data loaded into our Apache Beam pipeline must meet a specified schema, where certain fields are required to be non-null. It would be ideal to collect records that do not meet the schema to output them to some kind of deadletters store.

Our current implementation requires us to use the slower BigQueryIO.ReadTableRows() and then attempt, in a subsequent transform, to parse these TableRows into a custom typed object, outputting any failures to a side output for downstream processing. This isn't incredibly cumbersome, but it would be a nice feature of the connector itself.

Imported from Jira BEAM-11919. Original Jira may contain additional context.
Reported by: jacquelynwax.

kennknowles · 2022-12-02T00:28:26Z

@johnjcasey I'm triaging issues - is this still relevant?

RustedBones · 2024-09-04T09:12:00Z

Done in #30081 ?

damccorm added awaiting triage io-java-gcp P3 wish labels Jun 4, 2022

damccorm added gcp io java and removed io-java-gcp labels Jun 16, 2022

kennknowles added new feature and removed awaiting triage labels Dec 2, 2022

damccorm closed this as completed Sep 4, 2024

github-actions bot added this to the 2.60.0 Release milestone Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQueryIO.read(SerializableFunction): Collect records that could not be parsed into the custom-typed object into a PCollection of TableRows #20704

BigQueryIO.read(SerializableFunction): Collect records that could not be parsed into the custom-typed object into a PCollection of TableRows #20704

damccorm commented Jun 4, 2022

kennknowles commented Dec 2, 2022

RustedBones commented Sep 4, 2024

BigQueryIO.read(SerializableFunction): Collect records that could not be parsed into the custom-typed object into a PCollection of TableRows #20704

BigQueryIO.read(SerializableFunction): Collect records that could not be parsed into the custom-typed object into a PCollection of TableRows #20704

Comments

damccorm commented Jun 4, 2022

kennknowles commented Dec 2, 2022

RustedBones commented Sep 4, 2024