Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Utility BiTemporalStreamJoin #19553

Open
kennknowles opened this issue Jun 4, 2022 · 1 comment
Open

Add Utility BiTemporalStreamJoin #19553

kennknowles opened this issue Jun 4, 2022 · 1 comment

Comments

@kennknowles
Copy link
Member

Add utility class that enables a temporal join between two streams where Stream A is matched to Stream B where

A.timestamp = (max(b.timestamp) where b.timestamp <= a.timestamp) 

This will use the following overall flow:

KV(key, Timestamped<V>) 

| Window

| GBK

| Statefull DoFn

 

 

Imported from Jira BEAM-7386. Original Jira may contain additional context.
Reported by: [email protected].

@lazarillo
Copy link
Contributor

It looks like this isn't moving anywhere.

What is the preferred / ideal solution for combining two streams?

Here are my specifics, in case there are some good examples:

  • I am combining two Pub/Sub streams
  • Each of the streams is large enough that I cannot treat either of them as a side input
  • One of the streams will be joined multiple times to the other stream
  • The time window is difficult to determine ==> it could be seconds or minutes, or it could take days

The specific case I have is in payment processing:

  • I have an initial payment transaction with all of the initial details
  • I have payment transaction "events" like the payment is created, authorized, captured, refunded, etc
  • Each event has the ID of the initial payment, so that it can be joined simply on the payment ID
  • But there might be 6 events, or maybe 20 events
  • And the events may occur over the span of a few seconds and be done, or there may be events that are associated with payments from weeks or months ago

I was initially thinking of creating two streams with a reasonable window of maybe a minute or two; whatever we can find an acceptable delay for reaching our storage. Then in that window, all the payments will be saved and available for joining the payment events to them. Any payment events that are associated with a payment that has fallen outside of the window wherein it is retained can be pulled from storage to append the new events.

I'd like to have a nice example for that first half of what I proposed: where both the payment and payment event are available in the time window.

There are several examples of how to do this with side inputs, but I cannot seem to find streams of two large datasets (eg, two Pub/Sub or Kafka streams).

I would ideally like an example in Python, but any language is fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants