You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is the preferred / ideal solution for combining two streams?
Here are my specifics, in case there are some good examples:
I am combining two Pub/Sub streams
Each of the streams is large enough that I cannot treat either of them as a side input
One of the streams will be joined multiple times to the other stream
The time window is difficult to determine ==> it could be seconds or minutes, or it could take days
The specific case I have is in payment processing:
I have an initial payment transaction with all of the initial details
I have payment transaction "events" like the payment is created, authorized, captured, refunded, etc
Each event has the ID of the initial payment, so that it can be joined simply on the payment ID
But there might be 6 events, or maybe 20 events
And the events may occur over the span of a few seconds and be done, or there may be events that are associated with payments from weeks or months ago
I was initially thinking of creating two streams with a reasonable window of maybe a minute or two; whatever we can find an acceptable delay for reaching our storage. Then in that window, all the payments will be saved and available for joining the payment events to them. Any payment events that are associated with a payment that has fallen outside of the window wherein it is retained can be pulled from storage to append the new events.
I'd like to have a nice example for that first half of what I proposed: where both the payment and payment event are available in the time window.
There are several examples of how to do this with side inputs, but I cannot seem to find streams of two large datasets (eg, two Pub/Sub or Kafka streams).
I would ideally like an example in Python, but any language is fine.
Add utility class that enables a temporal join between two streams where Stream A is matched to Stream B where
A.timestamp = (max(b.timestamp) where b.timestamp <= a.timestamp)
This will use the following overall flow:
KV(key, Timestamped<V>)
| Window
| GBK
| Statefull DoFn
Imported from Jira BEAM-7386. Original Jira may contain additional context.
Reported by: [email protected].
The text was updated successfully, but these errors were encountered: