-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Python WriteToBigtable get stuck for large jobs due to client dead lock #28715
Comments
I have managed to run the stuck jobs on beam 2.45. I retried 4 jobs with the 100kk rows write, and they all ran to completion. I will do some more testing on 2.45 and report if we manage to observe this problem on that version as well. |
@Abacn @ahmedabu98 FYI. |
I think Dataflow support is a proper channel for this issue, we can open a follow up issue Beam SDK improvement once rootcaused. |
Please let us know if you filed the cloud support ticket. Thanks. |
Yes, right after you've made that request. Unfortunately, support is asking me to try different beam versions (2.46) or downgrade Protobuf. Which I have no interest in doing since I already see in our testing that the bug does not exist on 2.45. Looks like support is trying to blame this on the memory leak issue reported in other tickets. |
Please ask them to route this to our Dataflow team. |
Done. I've repeated the request in the ticket to route it to the proper team internally. |
Thanks a lot. If you hear anything back, please let us know. |
According to support, the Dataflow team was contacted about this. Can you please confirm @liferoad? Also, if that's the case can you please provide a bit more detail about next steps and if there will be a bugfix for this problem in a future release? |
Just saw the ticket. I will ask our engineers to take a closer look. Thanks. |
This might be related to #28562 (Java SDK) based on the symptom. |
@mutianf FYI. |
I believe the reason the pipeline fails on 2.50 is because of this: #28399 Effectively, when a pipeline is created/run like the example below, it ends up trying to use the Dataflow Runner V1 which is no longer allowed from Beam SDK 2.50+: p = beam.Pipeline(InteractiveRunner()) A workaround (until 2.51 is released) is to manually specify the "--experiments=use_runner_v2" in the pipeline options. |
But from the screenshots, the jobs indeed already used |
This is unexpected. In Beam 2.49.0 Python, if it does not explicit "disable_runner_v2" experiment, it should default to runner v2, and need
EDIT: I see, this is a notebook job. It's a bug on Beam 2.50.0 and should be fixed in 2.51.0; and "--experiments=use_runner_v2" could be a workaround. Update: Another workaround would require revert go/beampr/27085 and build your custom SDK on top of release-2.50.0 branch. |
I have a similar issue posted here In my case, even a small
Also, the issue still occurs in |
yes, the issue is actually in the bigtable client library. For now, please use Beam 2.45. |
Any update on this ? |
It is resolved. Upgrade to Beam 2.53.0 or pin google-cloud-bigtable==2.22.0 for older version between 2.49.0 or 2.52.0 should resolve the issue |
@Abacn seems like we need to re-enable the BigTable test here:
|
What happened?
Hello,
we are running Dataflow Python jobs using Beam 2.49.0. We are starting those jobs from a notebook using the functionality described here. Btw, this example crashes on beam 2.50.0 notebook kernel, I reported this problem to our Google support, let me know if this is something of interest and I will report a separate issue here.
Problem description:
We have a very simple pipeline that reads data using ReadFromBigQuery, and does two beam.Map operations to clean and transform the data to
google.cloud.bigtable.row.DirectRow
and then WriteToBigTable is used to write the data.We are testing the performance of BigTable HDD vs SDD-based instances, so we wanted to run jobs that insert 10kk and 100kk rows.
Unfortunately, the 10kk job that was writing to the HDD instance got stuck after writing 9,999,567 rows.
As you can see in the screenshot, the job scaled to about 500 workers, wrote most of the records in ~20min and then it scaled down to 2 workers, and no progress was made for ~18h. I canceled the job manually at that point.
After rerunning, the job has run to completion in 20 minutes.
Today, I've started two more jobs, each meant to write 100kk rows to BigTable (one to HDD and the other to SSD-based instance). Both got stuck at near completion. Here are some details about one of those jobs:
One thing I noticed in all of those jobs is that "stragglers" are detected.
However, a reason why they are straggling is undermined:
Repro code:
Let me know if you need any further details, I'd be very glad to help!
Issue Priority
Priority: 1 (data loss / total loss of function)
Issue Components
The text was updated successfully, but these errors were encountered: