Use new BigQuery sink and remove num_bigquery_write_shards flag usage. #499

tneymanov · 2019-06-25T20:38:45Z

Modify BigQuery writing mechanism to use the new WriteToBigQuery PTransform, and deprecate num_bigquery_write_shards flag.

Previously, issue #199 forced us to use a hack to shard the variants before they are written to BigQuery, which negatively affects the processing speed. With the implementation of the new sink, the flag is no longer need.

The flag will remain as a dummy in order to not break any of the current callers.

coveralls · 2019-06-25T20:52:16Z

Pull Request Test Coverage Report for Build 1767

0 of 2 (0.0%) changed or added relevant lines in 2 files are covered.
2 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.07%) to 89.178%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
gcp_variant_transforms/pipeline_common.py	0	1	0.0%
gcp_variant_transforms/transforms/variant_to_bigquery.py	0	1	0.0%

Files with Coverage Reduction	New Missed Lines	%
gcp_variant_transforms/pipeline_common.py	1	71.63%
gcp_variant_transforms/vcf_to_bq.py	1	33.02%

Totals
Change from base Build 1758:	0.07%
Covered Lines:	7721
Relevant Lines:	8658

💛 - Coveralls

kemp-google · 2019-06-26T14:45:51Z

gcp_variant_transforms/options/variant_transform_options.py

@@ -173,7 +173,8 @@ def add_arguments(self, parser):
    parser.add_argument(
        '--num_bigquery_write_shards',
        type=int, default=1,
-        help=('Before writing the final result to output BigQuery, the data is '
+        help=('Note: This flag is now deprecated and should not be used! '


Can we nuke all the help text and just say "This flag is deprecated and may be removed in future releases" ?

Can we nuke all the help text and just say "This flag is deprecated and may be removed in future releases" ?

+1

allieychen · 2019-06-26T17:14:23Z

docs/large_inputs.md

-operations require "shuffling" the data (i.e. redistributing the data among
-workers), which require significant disk I/O.
+As a result, we recommend using SSDs if [merging](variant_merge.md) is enabled:
+these operations require "shuffling" the data (i.e. redistributing the data


nit: It is not 'these' any more.

allieychen · 2019-06-26T17:14:59Z

gcp_variant_transforms/options/variant_transform_options.py

@@ -173,7 +173,8 @@ def add_arguments(self, parser):
    parser.add_argument(
        '--num_bigquery_write_shards',
        type=int, default=1,
-        help=('Before writing the final result to output BigQuery, the data is '
+        help=('Note: This flag is now deprecated and should not be used! '


Can we nuke all the help text and just say "This flag is deprecated and may be removed in future releases" ?

+1

allieychen · 2019-06-26T17:16:34Z

gcp_variant_transforms/transforms/variant_to_bigquery.py

+                    beam.io.BigQueryDisposition.WRITE_APPEND
+                    if self._append
+                    else beam.io.BigQueryDisposition.WRITE_TRUNCATE),
+                method=beam.io.WriteToBigQuery.Method.STREAMING_INSERTS))


Why do you choose to use streaming here? It costs extra money.

tneymanov

Addressed the comments.

Also enabled 'use_beam_bq_sink' flag - using this flag, I was able to run the new thousand genomes dataset without receiving the BQ error.

tneymanov · 2019-07-03T16:10:11Z

docs/large_inputs.md

-operations require "shuffling" the data (i.e. redistributing the data among
-workers), which require significant disk I/O.
+As a result, we recommend using SSDs if [merging](variant_merge.md) is enabled:
+these operations require "shuffling" the data (i.e. redistributing the data


tneymanov · 2019-07-03T16:23:37Z

gcp_variant_transforms/options/variant_transform_options.py

@@ -173,7 +173,8 @@ def add_arguments(self, parser):
    parser.add_argument(
        '--num_bigquery_write_shards',
        type=int, default=1,
-        help=('Before writing the final result to output BigQuery, the data is '
+        help=('Note: This flag is now deprecated and should not be used! '


tneymanov · 2019-07-03T16:23:42Z

gcp_variant_transforms/transforms/variant_to_bigquery.py

+                    beam.io.BigQueryDisposition.WRITE_APPEND
+                    if self._append
+                    else beam.io.BigQueryDisposition.WRITE_TRUNCATE),
+                method=beam.io.WriteToBigQuery.Method.STREAMING_INSERTS))


allieychen

LGTM. Is the performance of the bq sink similar to the beam sink? Also, use_beam_bq_sink is an experiment flag, I don't know whether there is any known downside or the plan for this flag from Beam's team? It might worth to talk to them.

tneymanov · 2019-07-03T19:59:59Z

LGTM. Is the performance of the bq sink similar to the beam sink? Also, use_beam_bq_sink is an experiment flag, I don't know whether there is any known downside or the plan for this flag from Beam's team? It might worth to talk to them.

For the new 1k genomes dataset it finished in 47 min 19 sec with 3,712 cores. Without it, it takes more than hour.

samanvp

Thanks Tural,
Please also remove transforms/limit_write.py and transforms/limit_write_test.py

samanvp · 2019-07-30T16:01:42Z

gcp_variant_transforms/transforms/variant_to_bigquery.py

@@ -29,7 +28,6 @@
 from gcp_variant_transforms.libs import processed_variant
 from gcp_variant_transforms.libs import vcf_field_conflict_resolver
 from gcp_variant_transforms.libs.variant_merge import variant_merge_strategy  # pylint: disable=unused-import
-from gcp_variant_transforms.transforms import limit_write


 # TODO(samanvp): remove this hack when BQ custom sink is added to Python SDK,


Please also remove the TODO and the const under it.

tneymanov · 2020-01-13T15:18:46Z

Addressed Saman's comments from July.

samanvp

Thanks Tural, please address the comments.
After making sure all tests are passing we can submit this PR.

samanvp · 2020-01-13T16:08:58Z

setup.py

@@ -42,8 +42,7 @@
    # Nucleus needs uptodate protocol buffer compiler (protoc).
    'protobuf>=3.6.1',
    'mmh3<2.6',
-    # Refer to issue #528
-    'google-cloud-storage<1.23.0',


We still need to pin to 1.22 due to #528

gcp_variant_transforms/transforms/sample_info_to_bigquery.py

samanvp · 2020-01-13T18:05:26Z

gcp_variant_transforms/vcf_to_bq.py

@@ -383,7 +383,8 @@ def _run_annotation_pipeline(known_args, pipeline_args):

 def _create_sample_info_table(pipeline,  # type: beam.Pipeline
                              pipeline_mode,  # type: PipelineModes
-                              known_args,  # type: argparse.Namespace
+                              known_args,  # type: argparse.Namespace,
+                              pipeline_args,  # type: List[str]


pipeline_args is not used anywhere in this method.

I was going to use them to create temp_location/_directory for SampleInfoToBigQuery, but I guess I forgot and there are no tests to catch it...

Removed it, since I can just pass temp_directory from the parent.

samanvp · 2020-01-13T18:09:51Z

gcp_variant_transforms/vcf_to_bq.py

@@ -480,6 +480,8 @@ def run(argv=None):
    num_partitions = 1

  if known_args.output_table:
+    options = pipeline_options.PipelineOptions(pipeline_args)


This variable is used only once in the next line. Is it possible to combine these two lines?

Inlined even further to get temp_directory, which is used in both following calls. Also added a check to prevent using BQ export (since there is also Avro one) without temp_directory, as new sink demands it. The check is in the beginning so that customer doesn't waste compute resources just to be shut down at the end.

tneymanov

All but 1k genomes test have passed (this one takes a bit longer).

Also modified sample table code a bit, as apparently it was broken. I presume your follow up PR was going to fix it, but I didn't want a broken code to be in the last release for python 2.7. PTAL.

gcp_variant_transforms/transforms/sample_info_to_bigquery.py

tneymanov · 2020-01-14T00:02:13Z

gcp_variant_transforms/vcf_to_bq.py

@@ -383,7 +383,8 @@ def _run_annotation_pipeline(known_args, pipeline_args):

 def _create_sample_info_table(pipeline,  # type: beam.Pipeline
                              pipeline_mode,  # type: PipelineModes
-                              known_args,  # type: argparse.Namespace
+                              known_args,  # type: argparse.Namespace,
+                              pipeline_args,  # type: List[str]


I was going to use them to create temp_location/_directory for SampleInfoToBigQuery, but I guess I forgot and there are no tests to catch it...

Removed it, since I can just pass temp_directory from the parent.

tneymanov · 2020-01-14T00:03:41Z

gcp_variant_transforms/vcf_to_bq.py

@@ -480,6 +480,8 @@ def run(argv=None):
    num_partitions = 1

  if known_args.output_table:
+    options = pipeline_options.PipelineOptions(pipeline_args)


Inlined even further to get temp_directory, which is used in both following calls. Also added a check to prevent using BQ export (since there is also Avro one) without temp_directory, as new sink demands it. The check is in the beginning so that customer doesn't waste compute resources just to be shut down at the end.

tneymanov · 2020-01-14T00:03:54Z

setup.py

@@ -42,8 +42,7 @@
    # Nucleus needs uptodate protocol buffer compiler (protoc).
    'protobuf>=3.6.1',
    'mmh3<2.6',
-    # Refer to issue #528
-    'google-cloud-storage<1.23.0',


samanvp

Thanks Tural, after addressing these comments we should be able to land this PR.

I have 2 main concerns:

What will happen if pipeline is in DirectRunner mode? Does the new sink still works as expected? Do we need to set the temp_location on GCS or it has to be a local directory on the machine that runs the pipeline? Please make sure we check this scenario thoroughly.
We need to run the code for a large input, for example 1000Genome. If you remember that's how we found out this new sink has a bug and drops some of the output rows.

Thanks!

samanvp · 2020-01-14T15:13:57Z

gcp_variant_transforms/options/variant_transform_options.py

-              'It is recommended to use 20 for loading large inputs without '
-              'merging. Use a smaller value (2 or 3) if both merging and '
-              'optimize_for_large_inputs are enabled.'))
+        help=('This flag is deprecated and may be removed in future releases.'))


...and will be removed in the next release.

samanvp · 2020-01-14T15:27:10Z

gcp_variant_transforms/vcf_to_bq.py

+  if known_args.output_table and '--temp_location' not in pipeline_args:
+    raise ValueError('--temp_location is required for BigQuery imports.')


This check seems out of place in this module, could we move it somewhere else?
How about this:
We can conduct this check in pipeline_common.parse_args, perhaps in _raise_error_on_invalid_flags by adding pipeline_args as its second input argument. We should check temp_directory is set, is not null, and is a valid GCS bucket.

samanvp · 2020-01-14T15:32:46Z

gcp_variant_transforms/vcf_to_bq.py

+    if not temp_directory:
+      raise ValueError('--temp_location must be set when writing to BigQuery.')


If we conduct the check as I mentioned above this check can be removed.

Oops, artifact. Removed.

samanvp · 2020-01-14T17:12:14Z

gcp_variant_transforms/transforms/variant_to_bigquery.py

+                    if self._append
+                    else beam.io.BigQueryDisposition.WRITE_TRUNCATE),
+                method=beam.io.WriteToBigQuery.Method.FILE_LOADS,
+                custom_gcs_temp_location=self._temp_location))


It seems we don't need to set this argument as long as w have it in pipeline_args:
https://github.com/apache/beam/blob/7bea94c59d1ead24659e09b0e467beeb82f4cadd/sdks/python/apache_beam/io/gcp/bigquery.py#L1261

custom_gcs_temp_location (str): A GCS location to store files to be used for file loads into BigQuery. By default, this will use the pipeline's temp_location, but for pipelines whose temp_location is not appropriate for BQ File Loads, users should pass a specific one.

Could you please test this and make sure it will work without setting this arg. If that's the case there is no need to pass temp_location to this module as well as sample_info_to_bigquery.py.

Tested, seems to be the case... removed manual selection of temp_dir.

samanvp

Please make sure all the comments are addressed before merging this PR.

samanvp · 2020-01-14T20:15:12Z

...nsforms/testing/integration/vcf_to_bq_tests/presubmit_tests/small_tests/valid_4_1_pysam.json

@@ -1,23 +0,0 @@
-[


Why this file is deleted?

We are getting 'non-homogenized' list error from this test, as I've told you offline.

Seems like old sink was fixing it on it's own. This test will be reintroduced in the PySam PR.

tneymanov

Thanks, Saman.

tneymanov · 2020-01-14T20:13:09Z

gcp_variant_transforms/options/variant_transform_options.py

-              'It is recommended to use 20 for loading large inputs without '
-              'merging. Use a smaller value (2 or 3) if both merging and '
-              'optimize_for_large_inputs are enabled.'))
+        help=('This flag is deprecated and may be removed in future releases.'))


tneymanov · 2020-01-14T20:18:40Z

gcp_variant_transforms/transforms/variant_to_bigquery.py

+                    if self._append
+                    else beam.io.BigQueryDisposition.WRITE_TRUNCATE),
+                method=beam.io.WriteToBigQuery.Method.FILE_LOADS,
+                custom_gcs_temp_location=self._temp_location))


Tested, seems to be the case... removed manual selection of temp_dir.

tneymanov · 2020-01-14T21:16:41Z

gcp_variant_transforms/vcf_to_bq.py

+  if known_args.output_table and '--temp_location' not in pipeline_args:
+    raise ValueError('--temp_location is required for BigQuery imports.')


tneymanov · 2020-01-14T21:16:52Z

gcp_variant_transforms/vcf_to_bq.py

+    if not temp_directory:
+      raise ValueError('--temp_location must be set when writing to BigQuery.')


Oops, artifact. Removed.

tneymanov · 2020-01-14T21:26:15Z

...nsforms/testing/integration/vcf_to_bq_tests/presubmit_tests/small_tests/valid_4_1_pysam.json

@@ -1,23 +0,0 @@
-[


We are getting 'non-homogenized' list error from this test, as I've told you offline.

Seems like old sink was fixing it on it's own. This test will be reintroduced in the PySam PR.

samanvp · 2020-01-14T23:40:02Z

Thanks Tural.
Please make sure we add the removed test in next PR.
Also this PR shouldn't merge unless we get a successful 1000Genome run without any missing rows.

tneymanov requested review from kemp-google and allieychen June 25, 2019 20:38

tneymanov assigned allieychen Jun 25, 2019

tneymanov force-pushed the beam_sink branch from b46a322 to bdb32b0 Compare June 25, 2019 20:42

kemp-google reviewed Jun 26, 2019

View reviewed changes

allieychen reviewed Jun 26, 2019

View reviewed changes

tneymanov commented Jul 3, 2019

View reviewed changes

allieychen approved these changes Jul 3, 2019

View reviewed changes

samanvp reviewed Jul 30, 2019

View reviewed changes

tneymanov force-pushed the beam_sink branch 2 times, most recently from 482151b to 4d4481c Compare January 13, 2020 15:18

tneymanov force-pushed the beam_sink branch from 4d4481c to 666fa4a Compare January 13, 2020 15:20

samanvp reviewed Jan 13, 2020

View reviewed changes

tneymanov commented Jan 14, 2020

View reviewed changes

tneymanov force-pushed the beam_sink branch from 1dc6904 to 5f38e93 Compare January 14, 2020 05:04

samanvp reviewed Jan 14, 2020

View reviewed changes

samanvp approved these changes Jan 14, 2020

View reviewed changes

tneymanov force-pushed the beam_sink branch 2 times, most recently from ad9582b to 1ed2f38 Compare January 14, 2020 21:59

tneymanov commented Jan 14, 2020

View reviewed changes

tneymanov force-pushed the beam_sink branch from 1ed2f38 to 059ef86 Compare January 20, 2020 16:12

tneymanov force-pushed the beam_sink branch from 059ef86 to 8c632c2 Compare February 5, 2020 21:45

tneymanov added 4 commits February 10, 2020 10:42

Use new BigQuery sink.

2895b87

Address first iteration of comments.

5dd0a66

Address second iteration of comments.

d7c4271

Remove WRITE_TRUNCATE option when writing to BQ.

0d69f80

tneymanov force-pushed the beam_sink branch from 8c632c2 to 0d69f80 Compare February 10, 2020 15:52

Play with tests.

b755d89

tneymanov mentioned this pull request Feb 12, 2020

Implement a new BigQuery sink #549

Open

		if known_args.output_table and '--temp_location' not in pipeline_args:
		raise ValueError('--temp_location is required for BigQuery imports.')

		if not temp_directory:
		raise ValueError('--temp_location must be set when writing to BigQuery.')

Use new BigQuery sink and remove num_bigquery_write_shards flag usage. #499

Are you sure you want to change the base?

Use new BigQuery sink and remove num_bigquery_write_shards flag usage. #499

Conversation

tneymanov commented Jun 25, 2019

coveralls commented Jun 25, 2019 • edited Loading

Pull Request Test Coverage Report for Build 1767

💛 - Coveralls

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tneymanov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allieychen left a comment • edited Loading

Choose a reason for hiding this comment

tneymanov commented Jul 3, 2019

samanvp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tneymanov commented Jan 13, 2020

samanvp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tneymanov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samanvp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samanvp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tneymanov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samanvp commented Jan 14, 2020

coveralls commented Jun 25, 2019 •

edited

Loading

allieychen left a comment •

edited

Loading