You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While preparing the Go SDK to make prism the default runner, I discovered that a DoFn that consumes it's main input as a side input as well can lead to the SDK to execute the dofn multiple times.
The cause is that the exec.Plan building logic doesn't take whether an input is a "side input" or not, into account, leading to the decision that the consumer must be multiplexed.
I believe that other runners rename the pcollection inputs for the side inputs, leading to avoiding this issue accidentally.
This should be validated with additional regression test pipelines, for both side inputs, and for flattens, which could be executed SDK side as well at the direction of the runner, this will validate whether this is an issue against Dataflow and other portable runners. I don't suspect this to be a common pattern however.
In any case, since prism is intended to challenge SDK assumptions, I'll have an SDK side fix for 2.50. There's sufficient information to determine whether an input is a side input or not, it's simply not presently used.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
Component: Python SDK
Component: Java SDK
Component: Go SDK
Component: Typescript SDK
Component: IO connector
Component: Beam examples
Component: Beam playground
Component: Beam katas
Component: Website
Component: Spark Runner
Component: Flink Runner
Component: Samza Runner
Component: Twister2 Runner
Component: Hazelcast Jet Runner
Component: Google Cloud Dataflow Runner
The text was updated successfully, but these errors were encountered:
lostluck
changed the title
[Bug][Go SDK]: On prism consuming the same PCollection as both side input and parallel input leads to duplicate elements.
[Bug][Go SDK]: On prism, consuming the same PCollection as both side input and parallel input leads to duplicate elements.
Jul 23, 2023
What happened?
While preparing the Go SDK to make prism the default runner, I discovered that a DoFn that consumes it's main input as a side input as well can lead to the SDK to execute the dofn multiple times.
In this case, dofn3x1 is duplicated 3 times.
The cause is that the exec.Plan building logic doesn't take whether an input is a "side input" or not, into account, leading to the decision that the consumer must be multiplexed.
https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/translate.go#L194
I believe that other runners rename the pcollection inputs for the side inputs, leading to avoiding this issue accidentally.
This should be validated with additional regression test pipelines, for both side inputs, and for flattens, which could be executed SDK side as well at the direction of the runner, this will validate whether this is an issue against Dataflow and other portable runners. I don't suspect this to be a common pattern however.
In any case, since prism is intended to challenge SDK assumptions, I'll have an SDK side fix for 2.50. There's sufficient information to determine whether an input is a side input or not, it's simply not presently used.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: