Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: MongoDB to BigQuery breaking schema change from STRING to JSON #1834

Open
pilis opened this issue Sep 2, 2024 · 7 comments
Open

[Bug]: MongoDB to BigQuery breaking schema change from STRING to JSON #1834

pilis opened this issue Sep 2, 2024 · 7 comments
Labels
bug Something isn't working needs triage p2

Comments

@pilis
Copy link

pilis commented Sep 2, 2024

Related Template(s)

mongodb_to_bigquery

Template Version

2024-08-20-00_rc00

What happened?

I have been using the template for a long time. I have a table in BigQuery where the source_data column is of type STRING (the table is appended with daily job execution). Pull Request #1796 introduces a change that alters the schema from STRING to JSON. The change in the Pull Request is not backwards compatible. I get the following error while running the job.

Relevant log output

Error message from worker: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_mongodbtobigqueryMY_TABLE_ID_0ed4e5c8caba494390c4f392c04b7746_379378654e40315a5bba6b82b44a8523_00001_00000, reached max retries: 3, last failed job: {
  "configuration" : {
    "jobType" : "LOAD",
    "labels" : {
      "beam_job_id" : "2024-09-01_19_07_45-5426785199031709769"
    },
    "load" : {
      "createDisposition" : "CREATE_IF_NEEDED",
      "destinationTable" : {
        "datasetId" : "MY_DATASET_ID",
        "projectId" : "MY_PROJECT_ID",
        "tableId" : "MY_TABLE_ID"
      },
      "ignoreUnknownValues" : false,
      "sourceFormat" : "NEWLINE_DELIMITED_JSON",
      "useAvroLogicalTypes" : false,
      "writeDisposition" : "WRITE_APPEND"
    }
  },
  "etag" : "R6mh8Mzo3e8NvZKP0av9xw==",
  "id" : "MY_PROJECT_ID:MY_REGION_ID.beam_bq_job_LOAD_mongodbtobigqueryMY_TABLE_ID_0ed4e5c8caba494390c4f392c04b7746_379378654e40315a5bba6b82b44a8523_00001_00000-6",
  "jobCreationReason" : {
    "code" : "REQUESTED"
  },
  "jobReference" : {
    "jobId" : "beam_bq_job_LOAD_mongodbtobigqueryMY_TABLE_ID_0ed4e5c8caba494390c4f392c04b7746_379378654e40315a5bba6b82b44a8523_00001_00000-6",
    "location" : "MY_REGION_ID",
    "projectId" : "MY_PROJECT_ID"
  },
  "kind" : "bigquery#job",
  "principal_subject" : "MY_SERVICE_ACCOUNT",
  "selfLink" : "https://bigquery.googleapis.com/bigquery/v2/projects/MY_PROJECT_ID/jobs/beam_bq_job_LOAD_mongodbtobigqueryMY_TABLE_ID_0ed4e5c8caba494390c4f392c04b7746_379378654e40315a5bba6b82b44a8523_00001_00000-6?location=MY_REGION_ID",
  "statistics" : {
    "creationTime" : "1725243135832",
    "endTime" : "1725243135858",
    "startTime" : "1725243135858"
  },
  "status" : {
    "errorResult" : {
      "message" : "Provided Schema does not match Table MY_PROJECT_ID:MY_DATASET_ID.MY_TABLE_ID. Field source_data has changed type from STRING to JSON",
      "reason" : "invalid"
    },
    "errors" : [ {
      "message" : "Provided Schema does not match Table MY_PROJECT_ID:MY_DATASET_ID.MY_TABLE_ID. Field source_data has changed type from STRING to JSON",
      "reason" : "invalid"
    } ],
    "state" : "DONE"
  },
  "user_email" : "MY_SERVICE_ACCOUNT_ID"
}.
	org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39)
	org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown Source)
	org.apache.beam.fn.harness.FnApiDoFnRunner.finishBundle(FnApiDoFnRunner.java:1776)
	org.apache.beam.fn.harness.data.PTransformFunctionRegistry.lambda$register$0(PTransformFunctionRegistry.java:116)
	org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:560)
	org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:150)
	org.apache.beam.fn.harness.control.BeamFnControlClient$InboundObserver.lambda$onNext$0(BeamFnControlClient.java:115)
	java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	org.apache.beam.sdk.util.UnboundedScheduledExecutorService$ScheduledFutureTask.run(UnboundedScheduledExecutorService.java:163)
	java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_mongodbtobigqueryMY_TABLE_ID_0ed4e5c8caba494390c4f392c04b7746_379378654e40315a5bba6b82b44a8523_00001_00000, reached max retries: 3, last failed job: {
  "configuration" : {
    "jobType" : "LOAD",
    "labels" : {
      "beam_job_id" : "2024-09-01_19_07_45-5426785199031709769"
    },
    "load" : {
      "createDisposition" : "CREATE_IF_NEEDED",
      "destinationTable" : {
        "datasetId" : "MY_DATASET_ID",
        "projectId" : "MY_PROJECT_ID",
        "tableId" : "MY_TABLE_ID"
      },
      "ignoreUnknownValues" : false,
      "sourceFormat" : "NEWLINE_DELIMITED_JSON",
      "useAvroLogicalTypes" : false,
      "writeDisposition" : "WRITE_APPEND"
    }
  },
  "etag" : "R6mh8Mzo3e8NvZKP0av9xw==",
  "id" : "MY_PROJECT_ID:MY_REGION_ID.beam_bq_job_LOAD_mongodbtobigqueryMY_TABLE_ID_0ed4e5c8caba494390c4f392c04b7746_379378654e40315a5bba6b82b44a8523_00001_00000-6",
  "jobCreationReason" : {
    "code" : "REQUESTED"
  },
  "jobReference" : {
    "jobId" : "beam_bq_job_LOAD_mongodbtobigqueryMY_TABLE_ID_0ed4e5c8caba494390c4f392c04b7746_379378654e40315a5bba6b82b44a8523_00001_00000-6",
    "location" : "MY_REGION_ID",
    "projectId" : "MY_PROJECT_ID"
  },
  "kind" : "bigquery#job",
  "principal_subject" : "MY_SERVICE_ACCOUNT",
  "selfLink" : "https://bigquery.googleapis.com/bigquery/v2/projects/MY_PROJECT_ID/jobs/beam_bq_job_LOAD_mongodbtobigqueryMY_TABLE_ID_0ed4e5c8caba494390c4f392c04b7746_379378654e40315a5bba6b82b44a8523_00001_00000-6?location=MY_REGION_ID",
  "statistics" : {
    "creationTime" : "1725243135832",
    "endTime" : "1725243135858",
    "startTime" : "1725243135858"
  },
  "status" : {
    "errorResult" : {
      "message" : "Provided Schema does not match Table MY_PROJECT_ID:MY_DATASET_ID.MY_TABLE_ID. Field source_data has changed type from STRING to JSON",
      "reason" : "invalid"
    },
    "errors" : [ {
      "message" : "Provided Schema does not match Table MY_PROJECT_ID:MY_DATASET_ID.MY_TABLE_ID. Field source_data has changed type from STRING to JSON",
      "reason" : "invalid"
    } ],
    "state" : "DONE"
  },
  "user_email" : "MY_SERVICE_ACCOUNT_ID"
}.
	org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:208)
	org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:161)
	org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:389)
@pilis pilis added bug Something isn't working needs triage p2 labels Sep 2, 2024
@pilis
Copy link
Author

pilis commented Sep 2, 2024

I am unable to use the previous version of the job - only the latest version is available. I found a workaround where I will prepare the previous version of the template and host it on my end.

Link to previous working image: https://console.cloud.google.com/gcr/images/dataflow-templates/global/2024-08-13-00_rc00/mongodb-to-bigquery
Link to template: https://console.cloud.google.com/storage/browser/_details/dataflow-templates-europe-west3/latest/flex/MongoDB_to_BigQuery;tab=version_history I am unable to download the previous version (there is no version history enabled)

👉 Quick global rollback to restore soft-deleted version

@turkerakyuz
Copy link

Changing data type of source_data column means all data pipelines appending data to an existing table on BigQuery by using this template should be modified. This is against backward compatibility

@kabir-aviva
Copy link

Same issue here, is there any way this can be prioritized?

@ahmedabu98
Copy link
Contributor

CC @Polber is it necessary to fix the source_data column to JSON type?

@ggprod
Copy link
Contributor

ggprod commented Sep 17, 2024

@Polber when will the merged fix be deployed for users of the template?

@pcortez
Copy link

pcortez commented Sep 23, 2024

Hi!
Due to urgency, I had to change the source_data schema to JSON, now that the change was reverted to production, is there a way to use JSON instead of STRING?

Perhaps a new parameter could be added to the template, to be able to choose how to parse source_data.

@ggprod
Copy link
Contributor

ggprod commented Sep 24, 2024

@Polber when will the merged fix be deployed for users of the template?

Nevermind, I see it was deployed to the GCS bucket (gs://dataflow-templates-us-central1/latest/flex/MongoDB_to_BigQuery) yesterday at 10am.. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage p2
Projects
None yet
Development

No branches or pull requests

6 participants