Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [Java BQ FILE_LOADS] When streaming to dynamic destinations with copy jobs and CREATE_IF_NEEDED, only the first destination's table is created #28309

Closed
2 of 15 tasks
ahmedabu98 opened this issue Sep 5, 2023 · 4 comments

Comments

@ahmedabu98
Copy link
Contributor

ahmedabu98 commented Sep 5, 2023

What happened?

Was testing FILE_LOADS streaming writes and found that when dynamic destinations are set and copy jobs are used (ie. large data) and CREATE_IF_NEEDED is set, only the first table is created. For example, if I'm writing to two tables A and B, it becomes a race condition on which copy job is seen first in the pipeline. If copy job to table A is performed first, then table A will be created and all subsequent copy jobs to table B will fail with an error similar to the following:

WARNING: Load job beam_bq_job_COPY_testpipelineahmedabualsaud0905154941c26eb941_17a94e2694554455aa31cca9f9389b49_4e2479b4160d04b56a8075645f4974e1_00003-0 failed, will retry: {
  "errorResult" : {
    "message" : "Not found: Table <project>:<dataset>.mytable_B",
    "reason" : "notFound"
  },
  "errors" : [ {
    "message" : "Not found: Table <project>:<dataset>.mytable_B",
    "reason" : "notFound"
  } ],
  "state" : "DONE"
}. Next job id beam_bq_job_COPY_testpipelineahmedabualsaud0905154941c26eb941_17a94e2694554455aa31cca9f9389b49_4e2479b4160d04b56a8075645f4974e1_00003-1

What we would expect instead is for all tables to be created.

P.S. not seeing this behavior in batch mode

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@ahmedabu98
Copy link
Contributor Author

ahmedabu98 commented Sep 5, 2023

This behavior is likely due to these lines:

boolean isFirstPane =
firstTempTable != null && firstTempTable.isFirstPane() && c.pane().isFirst();
WriteDisposition writeDisposition =
isFirstPane ? firstPaneWriteDisposition : WriteDisposition.WRITE_APPEND;
CreateDisposition createDisposition =
isFirstPane ? firstPaneCreateDisposition : CreateDisposition.CREATE_NEVER;

The general idea is after the first pane, we set appropriate create and write dispositions so that subsequent jobs don't overwrite previous data. However here, c.pane().isFirst() in streaming is only true for the first copy job. Subsequent copy jobs seem to appear in different panes (maybe because of this GBK). This results in Beam setting CREATE_NEVER disposition on everything after the first copy job, even if its the first job for a particular destination. BigQuery tries to copy into a non-existent table and instead of creating the table it throws the error mentioned above.

@Abacn
Copy link
Contributor

Abacn commented Sep 5, 2023

This was identical to https://issues.apache.org/jira/browse/BEAM-7195 - was the fix in #14238 no longer effective?

@ahmedabu98
Copy link
Contributor Author

ahmedabu98 commented Sep 5, 2023

I don't think a test was ever created for the changes in that PR, so I can't tell

But I see that the solution in that PR was not fully extended to the multiple partitions path. I can try implementing it there as well. Thanks @Abacn!

@ahmedabu98 ahmedabu98 changed the title [Bug]: [Java BQ FILE_LOADS] When streaming to dynamic destinations with copy jobs and CREATE_IF_NEEDED, only the first table is created [Bug]: [Java BQ FILE_LOADS] When streaming to dynamic destinations with copy jobs and CREATE_IF_NEEDED, only the first destination's table is created Sep 15, 2023
@ahmedabu98
Copy link
Contributor Author

Fixed in #28312

@github-actions github-actions bot added this to the 2.51.0 Release milestone Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants