[Bug]: Python WriteToBigtable get stuck for large jobs due to client dead lock #28715

Domlenart · 2023-09-28T10:04:37Z

What happened?

Hello,

we are running Dataflow Python jobs using Beam 2.49.0. We are starting those jobs from a notebook using the functionality described here. Btw, this example crashes on beam 2.50.0 notebook kernel, I reported this problem to our Google support, let me know if this is something of interest and I will report a separate issue here.

Problem description:

We have a very simple pipeline that reads data using ReadFromBigQuery, and does two beam.Map operations to clean and transform the data to google.cloud.bigtable.row.DirectRow and then WriteToBigTable is used to write the data.

We are testing the performance of BigTable HDD vs SDD-based instances, so we wanted to run jobs that insert 10kk and 100kk rows.

Unfortunately, the 10kk job that was writing to the HDD instance got stuck after writing 9,999,567 rows.

As you can see in the screenshot, the job scaled to about 500 workers, wrote most of the records in ~20min and then it scaled down to 2 workers, and no progress was made for ~18h. I canceled the job manually at that point.

After rerunning, the job has run to completion in 20 minutes.

Today, I've started two more jobs, each meant to write 100kk rows to BigTable (one to HDD and the other to SSD-based instance). Both got stuck at near completion. Here are some details about one of those jobs:

One thing I noticed in all of those jobs is that "stragglers" are detected.

However, a reason why they are straggling is undermined:

Repro code:

import apache_beam as beam
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
from apache_beam.runners import DataflowRunner

from apache_beam.options import pipeline_options
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.io.gcp.bigtableio import WriteToBigTable

from google.cloud.bigtable import row

import datetime

from typing import Dict, Any, Tuple, List


def to_bt_row(beam_row: Tuple[str, Dict[str, Any]]) -> row.DirectRow:
    import datetime
    """
    Creates BigTable row from standard dataflow row with key mapping to a dict.
    The key is used as a BigTable row key and the dict keys are used as BigTable column names.
    The dict values are used as the column values.
    
    To keep it simple:
    - all columns are assigned to a column family called default
    - the cell timestamp is set to current time
    """
    from google.cloud.bigtable import row as row_
    (key, values) = beam_row
    bt_row = row_.DirectRow(row_key=key)
    for k, v in values.items():
        bt_row.set_cell(
            "default",
            k.encode(),
            str(v).encode(),
            datetime.datetime.now()
        )
    return bt_row

def set_device_id_as_key(row: Dict[str, Any]) -> Tuple[str, Dict[str, Any]]:
    """
    Given dict, convert it to two-element tuple. 
    The first element in the tuple is the original dicts value under "device_id" key.
    The second tuple element is the original dict without the "device_id" key. 
    """
    k = row.pop("device_id")
    return k, row

def insert_data(n: int, source_bq_table: str, instance: str, destination_table:str, jobname="test_job"):
    options = pipeline_options.PipelineOptions(
        flags={},
        job_name=jobname
    )
    _, options.view_as(GoogleCloudOptions).project = google.auth.default()
    options.view_as(GoogleCloudOptions).region = 'us-east1'
    dataflow_gcs_location = 'gs://redacted-gcs-bucket/dataflow'
    options.view_as(GoogleCloudOptions).staging_location = '%s/staging' % dataflow_gcs_location
    options.view_as(GoogleCloudOptions).temp_location = '%s/temp' % dataflow_gcs_location

    p = beam.Pipeline(InteractiveRunner())

    res = (
        p | 'QueryTable' >> beam.io.ReadFromBigQuery(
            query=f"""
            SELECT* FROM `redacted.redacted.{source_bq_table}` 
            limit {n}
            """,
            use_standard_sql=True,
            project="redacted",
            use_json_exports=True,
            gcs_location="gs://redactedbucket/bq_reads"
        )
        | "set device id" >> beam.Map(set_device_id_as_key)
        | "create bt rows" >> beam.Map(to_bt_row)
        | "write out" >> WriteToBigTable(
            project_id="another-project",
            instance_id=instance,
            table_id=destination_table
        )
    )

    DataflowRunner().run_pipeline(p, options=options)

insert_data(100_000_000, "bq_table_with_100kk_rows", "xyz-ssd", "some_table", "test_100kk_ssd")

Let me know if you need any further details, I'd be very glad to help!

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

The text was updated successfully, but these errors were encountered:

Domlenart · 2023-09-28T11:10:05Z

I have managed to run the stuck jobs on beam 2.45. I retried 4 jobs with the 100kk rows write, and they all ran to completion.

I will do some more testing on 2.45 and report if we manage to observe this problem on that version as well.

liferoad · 2023-09-28T12:35:47Z

@Abacn @ahmedabu98 FYI.
@Domlenart Thanks a lot to report the issue with the detailed repo code. Please ask the cloud support team to reach out the Beam IO team at Google and we can do more debugging from there.

tvalentyn · 2023-09-28T18:49:51Z

I think Dataflow support is a proper channel for this issue, we can open a follow up issue Beam SDK improvement once rootcaused.

liferoad · 2023-09-28T20:52:49Z

Please let us know if you filed the cloud support ticket. Thanks.

Domlenart · 2023-09-29T07:37:15Z

Please let us know if you filed the cloud support ticket. Thanks.

Yes, right after you've made that request. Unfortunately, support is asking me to try different beam versions (2.46) or downgrade Protobuf. Which I have no interest in doing since I already see in our testing that the bug does not exist on 2.45. Looks like support is trying to blame this on the memory leak issue reported in other tickets.

liferoad · 2023-09-29T12:32:53Z

Please ask them to route this to our Dataflow team.

Domlenart · 2023-09-29T12:36:54Z

Done. I've repeated the request in the ticket to route it to the proper team internally.

liferoad · 2023-09-29T12:41:36Z

Thanks a lot. If you hear anything back, please let us know.

Domlenart · 2023-10-01T08:57:56Z

According to support, the Dataflow team was contacted about this. Can you please confirm @liferoad? Also, if that's the case can you please provide a bit more detail about next steps and if there will be a bugfix for this problem in a future release?
While 2.45 works, it does not support Python 3.11, so once the bug is fixed we'd love to go back to 2.50+. Thanks!

liferoad · 2023-10-01T13:40:33Z

Just saw the ticket. I will ask our engineers to take a closer look. Thanks.

liferoad · 2023-10-01T13:46:04Z

This might be related to #28562 (Java SDK) based on the symptom.

liferoad · 2023-10-01T13:47:22Z

@mutianf FYI.

ammppp · 2023-10-03T15:43:32Z

I believe the reason the pipeline fails on 2.50 is because of this: #28399

Effectively, when a pipeline is created/run like the example below, it ends up trying to use the Dataflow Runner V1 which is no longer allowed from Beam SDK 2.50+:

p = beam.Pipeline(InteractiveRunner())
DataflowRunner().run_pipeline(p, options=options)

A workaround (until 2.51 is released) is to manually specify the "--experiments=use_runner_v2" in the pipeline options.

liferoad · 2023-10-03T15:45:57Z

But from the screenshots, the jobs indeed already used Runner V2.

Abacn · 2023-10-03T16:52:41Z

I believe the reason the pipeline fails on 2.50 is because of this: #28399

Effectively, when a pipeline is created/run like the example below, it ends up trying to use the Dataflow Runner V1 which is no longer allowed from Beam SDK 2.50+:

p = beam.Pipeline(InteractiveRunner()) DataflowRunner().run_pipeline(p, options=options)

A workaround (until 2.51 is released) is to manually specify the "--experiments=use_runner_v2" in the pipeline options.

This is unexpected. In Beam 2.49.0 Python, if it does not explicit "disable_runner_v2" experiment, it should default to runner v2, and need disable_runner_v2 to run on Dataflow legacy runner; in 2.50.0 Python, it should always on runner v2 without the need of specifying experiment.

~~Let us taking a closer look and please send a customer ticket to Dataflow so we can take a look of your jobId~~

EDIT: I see, this is a notebook job. It's a bug on Beam 2.50.0 and should be fixed in 2.51.0; and "--experiments=use_runner_v2" could be a workaround.

Update: Another workaround would require revert go/beampr/27085 and build your custom SDK on top of release-2.50.0 branch.

PanJ · 2023-10-17T17:54:29Z

I have a similar issue posted here

In my case, even a small WriteToBigTable job could get stuck (but at a very low chance). Not sure if my logs helps with the diagnosis

Unable to perform SDK-split for work-id: 5193980908353266575 due to error: INTERNAL: Empty split returned. [type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto] { trail_point { source_file_loc { filepath: "dist_proc/dax/workflow/worker/fnapi_operators.cc" line: 2738 } } }']
=== Source Location Trace: ===
dist_proc/dax/internal/status_utils.cc:236
 And could not Checkpoint reader due to error: OUT_OF_RANGE: Cannot checkpoint when range tracker is finished. [type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto] { trail_point { source_file_loc { filepath: "dist_proc/dax/workflow/worker/operator.cc" line: 340 } } }']
=== Source Location Trace: ===
dist_proc/dax/io/dax_reader_driver.cc:253
dist_proc/dax/workflow/worker/operator.cc:340

Also, the issue still occurs in 2.51.0 version

liferoad · 2023-10-17T20:20:36Z

yes, the issue is actually in the bigtable client library. For now, please use Beam 2.45.

ee07dazn · 2024-01-12T14:17:21Z

Any update on this ?

Abacn · 2024-01-12T14:54:01Z

Any update on this ?

It is resolved. Upgrade to Beam 2.53.0 or pin google-cloud-bigtable==2.22.0 for older version between 2.49.0 or 2.52.0 should resolve the issue

chamikaramj · 2024-02-13T00:31:29Z

@Abacn seems like we need to re-enable the BigTable test here:

beam/sdks/python/apache_beam/examples/cookbook/bigtableio_it_test.py

Line 180 in fa32492

def test_bigtable_write(self):

Domlenart added awaiting triage bug labels Sep 28, 2023

github-actions bot added python dataflow P1 labels Sep 28, 2023

tvalentyn closed this as completed Sep 28, 2023

github-actions bot added this to the 2.52.0 Release milestone Sep 28, 2023

liferoad removed the awaiting triage label Oct 1, 2023

liferoad assigned liferoad and unassigned liferoad Oct 17, 2023

Abacn mentioned this issue Oct 20, 2023

[Failing Test]: Python ARM PostCommit failing after #28385 #29076

Closed

16 tasks

Abacn reopened this Oct 30, 2023

Abacn changed the title ~~[Bug]: Large dataflow jobs get stuck~~ [Bug]: Python WriteToBigtable get stuck for large jobs due to client dead lock Oct 30, 2023

damccorm removed this from the 2.52.0 Release milestone Oct 31, 2023

mutianf mentioned this issue Dec 13, 2023

Update Bigtable python client version #29753

Merged

3 tasks

Abacn closed this as completed Jan 12, 2024

Abacn added this to the 2.53.0 Release milestone Jan 12, 2024

github-actions bot modified the milestones: 2.53.0 Release, 2.54.0 Release Jan 12, 2024

damccorm added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Python WriteToBigtable get stuck for large jobs due to client dead lock #28715

[Bug]: Python WriteToBigtable get stuck for large jobs due to client dead lock #28715

Domlenart commented Sep 28, 2023 •

edited

Loading

Domlenart commented Sep 28, 2023 •

edited

Loading

liferoad commented Sep 28, 2023

tvalentyn commented Sep 28, 2023 •

edited

Loading

liferoad commented Sep 28, 2023 •

edited

Loading

Domlenart commented Sep 29, 2023

liferoad commented Sep 29, 2023

Domlenart commented Sep 29, 2023

liferoad commented Sep 29, 2023

Domlenart commented Oct 1, 2023

liferoad commented Oct 1, 2023

liferoad commented Oct 1, 2023

liferoad commented Oct 1, 2023

ammppp commented Oct 3, 2023 •

edited

Loading

liferoad commented Oct 3, 2023

Abacn commented Oct 3, 2023 •

edited

Loading

PanJ commented Oct 17, 2023

liferoad commented Oct 17, 2023

ee07dazn commented Jan 12, 2024

Abacn commented Jan 12, 2024

chamikaramj commented Feb 13, 2024

[Bug]: Python WriteToBigtable get stuck for large jobs due to client dead lock #28715

[Bug]: Python WriteToBigtable get stuck for large jobs due to client dead lock #28715

Comments

Domlenart commented Sep 28, 2023 • edited Loading

What happened?

Issue Priority

Issue Components

Domlenart commented Sep 28, 2023 • edited Loading

liferoad commented Sep 28, 2023

tvalentyn commented Sep 28, 2023 • edited Loading

liferoad commented Sep 28, 2023 • edited Loading

Domlenart commented Sep 29, 2023

liferoad commented Sep 29, 2023

Domlenart commented Sep 29, 2023

liferoad commented Sep 29, 2023

Domlenart commented Oct 1, 2023

liferoad commented Oct 1, 2023

liferoad commented Oct 1, 2023

liferoad commented Oct 1, 2023

ammppp commented Oct 3, 2023 • edited Loading

liferoad commented Oct 3, 2023

Abacn commented Oct 3, 2023 • edited Loading

PanJ commented Oct 17, 2023

liferoad commented Oct 17, 2023

ee07dazn commented Jan 12, 2024

Abacn commented Jan 12, 2024

chamikaramj commented Feb 13, 2024

Domlenart commented Sep 28, 2023 •

edited

Loading

Domlenart commented Sep 28, 2023 •

edited

Loading

tvalentyn commented Sep 28, 2023 •

edited

Loading

liferoad commented Sep 28, 2023 •

edited

Loading

ammppp commented Oct 3, 2023 •

edited

Loading

Abacn commented Oct 3, 2023 •

edited

Loading