-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Artifact materialization can cause a race when multiple callers materialize the same artifact in the same destination. #28605
Comments
cc: @lostluck |
This seems like a Python specific issue, as it does multiple processes or multiple containers on a worker VM. The other SDKs (Go and Java at least) will only have a single boot cycle to to download from the artifact repo. It's hard to block race conditions across separate processes that can't meaningfully communicate. It's also hard to know if that semi-persist directory is actually shared or not between other workers. In more actionable commentary, that "don't download if the sha matches" is probably a good idea, since we can check for existing files, and if it exists with with a matching SHA, then it's the expected file. Whomever works on this needs to take that into consideration. |
I'm gonna say that a process writing to a share location had better assume there might be other processes that want that location, even on accident. The two choices are: (1) generate a fresh location that no one else is going to conflict with (a la |
Which is just me saying the same thing but from the peanut gallery objecting to processes working in a way that makes assumptions about how they'll be used, such as an SDK harness assuming it has a particular relationship to docker containers and VMs. That's just pain down the line. |
I think this was on my radar due to the release. Is it common enough that it makes the sibling process feature unusable, or can we defer to 2.52.0? (I know there is already not a milestone attached - I am inviting you to attach 2.51.0 if it would have a huge negative impact on users) |
sibling worker protocol has been reenabled now, and with sibling workers, this issue doesn't happen. |
At least in Dataflow runner. |
Can you point to a change that addressed this? Is it already in 2.51.0? (apologies if this is already answer and I missed it or forgot) |
a change that is now fully rolled out in Dataflow runner fully enables sibling sdk worker protocol, thereby removing the failure mode in default configuration for dataflow - Dataflow now starts only one Python container instead of many, and since artifacts are downloaded only once per worker machine, there is no race. it is a runner-controlled change. |
What happened?
Beam SDK containers use the below helper to download artifacts.
beam/sdks/go/pkg/beam/artifact/materialize.go
Line 60 in 0146a83
When a worker instantiates multiple SDK containers that share a destination for staged artifacts (
--semi_persist_dir
:beam/sdks/java/container/boot.go
Line 50 in 0146a83
This race was observed in Dataflow Python pipelines that didn't use a sibling SDK worker protocol.
We can consider using a file lock during materialization, and possibly not re-download artifacts with the same SHA.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: