[Bug]: Artifact materialization can cause a race when multiple callers materialize the same artifact in the same destination. #28605

tvalentyn · 2023-09-21T22:43:29Z

tvalentyn · 2023-09-21T22:43:47Z

lostluck · 2023-09-21T22:54:48Z

This seems like a Python specific issue, as it does multiple processes or multiple containers on a worker VM. The other SDKs (Go and Java at least) will only have a single boot cycle to to download from the artifact repo. It's hard to block race conditions across separate processes that can't meaningfully communicate.

It's also hard to know if that semi-persist directory is actually shared or not between other workers.

In more actionable commentary, that "don't download if the sha matches" is probably a good idea, since we can check for existing files, and if it exists with with a matching SHA, then it's the expected file.
If not, it might be either wrong, or in progress of being downloaded....

Whomever works on this needs to take that into consideration.

kennknowles · 2023-09-26T12:31:52Z

I'm gonna say that a process writing to a share location had better assume there might be other processes that want that location, even on accident. The two choices are: (1) generate a fresh location that no one else is going to conflict with (a la mktemp -d) or (2) use a location intentionally shared. I feel like in this case intentional sharing is what you want, to download once and use by all.

kennknowles · 2023-09-26T12:33:25Z

Which is just me saying the same thing but from the peanut gallery objecting to processes working in a way that makes assumptions about how they'll be used, such as an SDK harness assuming it has a particular relationship to docker containers and VMs. That's just pain down the line.

kennknowles · 2023-09-26T12:34:44Z

I think this was on my radar due to the release. Is it common enough that it makes the sibling process feature unusable, or can we defer to 2.52.0? (I know there is already not a milestone attached - I am inviting you to attach 2.51.0 if it would have a huge negative impact on users)

tvalentyn · 2023-09-28T19:25:26Z

sibling worker protocol has been reenabled now, and with sibling workers, this issue doesn't happen.

tvalentyn · 2023-09-28T19:27:15Z

At least in Dataflow runner.

kennknowles · 2023-10-03T16:42:05Z

Can you point to a change that addressed this? Is it already in 2.51.0? (apologies if this is already answer and I missed it or forgot)

tvalentyn · 2023-10-03T19:09:14Z

a change that is now fully rolled out in Dataflow runner fully enables sibling sdk worker protocol, thereby removing the failure mode in default configuration for dataflow - Dataflow now starts only one Python container instead of many, and since artifacts are downloaded only once per worker machine, there is no race. it is a runner-controlled change.

tvalentyn added bug awaiting triage labels Sep 21, 2023

github-actions bot added python java go P2 labels Sep 21, 2023

tvalentyn mentioned this issue Sep 25, 2023

[Failing Test]: Some Python integration tests runs result in environment mismatch. #28653

Closed

15 tasks

jrmccluskey removed the awaiting triage label Sep 28, 2023

tvalentyn added P3 and removed P2 labels Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Artifact materialization can cause a race when multiple callers materialize the same artifact in the same destination. #28605

[Bug]: Artifact materialization can cause a race when multiple callers materialize the same artifact in the same destination. #28605

tvalentyn commented Sep 21, 2023 •

edited

Loading

tvalentyn commented Sep 21, 2023

lostluck commented Sep 21, 2023

kennknowles commented Sep 26, 2023

kennknowles commented Sep 26, 2023

kennknowles commented Sep 26, 2023

tvalentyn commented Sep 28, 2023

tvalentyn commented Sep 28, 2023

kennknowles commented Oct 3, 2023

tvalentyn commented Oct 3, 2023 •

edited

Loading

[Bug]: Artifact materialization can cause a race when multiple callers materialize the same artifact in the same destination. #28605

[Bug]: Artifact materialization can cause a race when multiple callers materialize the same artifact in the same destination. #28605

Comments

tvalentyn commented Sep 21, 2023 • edited Loading

What happened?

Issue Priority

Issue Components

tvalentyn commented Sep 21, 2023

lostluck commented Sep 21, 2023

kennknowles commented Sep 26, 2023

kennknowles commented Sep 26, 2023

kennknowles commented Sep 26, 2023

tvalentyn commented Sep 28, 2023

tvalentyn commented Sep 28, 2023

kennknowles commented Oct 3, 2023

tvalentyn commented Oct 3, 2023 • edited Loading

tvalentyn commented Sep 21, 2023 •

edited

Loading

tvalentyn commented Oct 3, 2023 •

edited

Loading