[Bug]: Unable to drain Flink job when RequiresStableInput is used #28554

kkdoon · 2023-09-20T06:26:08Z

What happened?

Issue:
Flink pipeline does not get drained when RequiresStableInput annotation is used. Stack-trace when drain is triggered:
Caused by: java.lang.RuntimeException: There are still watermark holds left when terminating operator KVStoreTransform/GroupIntoBatches/ParDo(GroupIntoBatches)/ParMultiDo(GroupIntoBatches) -> PersistenceTransform/WriteSpendStateToDb/ParMultiDo(WriteSpendStateDbStreaming) (1/1)#0 Watermark held 1695147850921 at org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.flushData(DoFnOperator.java:631) at org.apache.beam.runners.flink.translation.wrappers.streaming.AbstractStreamOperatorCompat.finish(AbstractStreamOperatorCompat.java:67) at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50) at org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.finishOperator(StreamOperatorWrapper.java:239) at org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.lambda$deferFinishOperatorToMailbox$3(StreamOperatorWrapper.java:213) at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50) at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.tryYield(MailboxExecutorImpl.java:97) at org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.quiesceTimeServiceAndFinishOperator(StreamOperatorWrapper.java:186) ... 13 more

Cause:
When drain is triggered, MAX_WATERMARK gets emitted before the last checkpoint barrier. This helps in triggering all the registered event-time timers and flushing out any state. Therefore, the expectation is that when flushData is finally invoked in DoFnOperator, all the event timers should be fired and the watermark should proceed.

However, when RequiresStableInput annotation is used, the behavior is to process the DoFn after the checkpoint operation is complete. Since, flush is invoked when the final checkpoint/savepoint operation is in progress, the watermark is held by the DoFn with the RequiresStableInput annotation and is waiting for the checkpoint to complete.

Potential Solution:
Skip this check in case the DoFn has RequiresStableInput set, since we know that all the final pending data will be processed after the final savepoint operation completes.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

The text was updated successfully, but these errors were encountered:

je-ik · 2023-09-21T08:18:40Z

@dmvk Is it possible that this is bug in Flink? I think there should be checkpoint barrier strictly after emitting MAX_WATERMARK, does it make sense?

je-ik · 2023-09-21T09:38:41Z

@kkdoon can you please clarify how you do the drain? According to the docs, using stop and drain with savepoint should do all the necessary steps for the pipeline to terminate. Is this not working in your case?

kkdoon · 2023-09-21T22:21:01Z

I have tried stopping the job using both the CLI (with --drain) as well as via flink RestClusterClient library where advanceToEndOfTime is set to True to drain the job.

Note that when drain is set to false, pipeline is able to successfully stop with savepoint and DoFn with RequiresStableInput annotation does emit the buffered results. There's no error since flushData is not invoked & watermark does not proceed to MAX_WATERMARK.

So this issue is only for the specific scenario when drain flag is used to advance the watermark and permanently stop the pipeline.

je-ik · 2023-09-22T08:07:25Z

I think I see the problem.

From my understanding, using stop with --drain should result in the following sequence:
a) sources stop emitting data
b) MAX_WATERMARK is emitted
c) savepoint is taken
d) job is terminated

The problem is that after the savepoint is taken in step c) we process the data and this results in more elements being sent downstream after the savepoint barrier has passed. This problem should arise only in case when there is a chain of at least two transforms with @RequiresStableInput. Is this your case?

kkdoon · 2023-09-22T17:02:15Z

My job only has 1 DoFn using the @RequiresStableInput annotation (at the terminal node). Job graph looks as follows:

KafkaIO -> DoFn -> DoFn with RequiresStableInput

The job always fails during the final savepoint operation with the watermark hold error and not after the savepoint is taken (fails between step b and c).

je-ik · 2023-09-22T18:06:06Z

This feels strange. Can you try to log the elements arriving at the stable DoFn after the savepoint to see where are they coming from? I would be surprised if this would be coming from the KafkaIO (provided it stops sending data before triggering the checkpoint).

kkdoon · 2023-09-23T22:26:48Z

These events are the buffered events that are emitted from DoFn1. I tried the following steps:

Publish 10 elements to Kafka topic and stop the producer job.
Verified that the 10 elements were processed by DoFn1 and Stable DoFn did not receive them.
Drained the job and it failed to execute the savepoint (and did not emit any of the buffered 10 elements).

From my experience, the only instance drain operation works is if the buffered elements are fully processed (via checkpoint trigger) and buffer is empty at time of drain execution.

je-ik · 2023-10-02T07:09:30Z

Sorry for late reaction, I was OOO.

I once again walked though the code and it seems to make sense now. Here is the problem:
a) flushData method is wired up to https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/operators/StreamOperator.html#finish-- (https://nightlies.apache.org/flink/flink-docs-release-1.12/api/java/org/apache/flink/streaming/api/operators/AbstractStreamOperator.html#close-- to Flink 1.12)
b) this method is apparently called prior to call to notifyCheckpointComplete, which means we still have buffered data

The only solution then seems to make sure that in call to flushData je clean the buffer (process all buffered data and emit outputs, clearing the watermark hold). All this should happen immediately before the last checkpoint is completed and thus it should not break the stable input DoFn's contract.

kkdoon · 2023-10-03T18:59:38Z

No worries and thanks for your response.

I wanted to double check that if we flush the buffer within flushData AND before final checkpoint operation is complete, won't that violate stable input DoFn contract (where stable DoFn should only be invoked once state is check-pointed)? Or is it ok in case of drain operation? I might be misunderstanding the sequence of operations here.

Issue i see is that once we flush the buffer, it would trigger the downstream DoFn processing. And in the scenario where the downstream stable DoFn operation fails (say its writing to a sink and the sink is unavailable and/or checkpoint itself has timeout), our final checkpoint and drain operation will fail, which could lead to inconsistent state.

je-ik · 2023-10-04T09:33:29Z

Yes, if the final checkpoint fails (for whatever reason), then the retry can yield different result than the first run. I think we have only two options:
a) best-effort, i.e. flushing our buffer in flushData, which is what it was intended for, or
b) fail the pipeline, as draining is incompatible with stable DoFns.

Currently, we do b), which is actually semantically correct. We can change the exception to be more explanatory or even better throw exception unconditionally, if DoFn has stable input (currently it might fail non-deterministically).

Because switching between a) and b) might be subject to user decision, we might introduce a flag to FlinkPipelineOptions that will opt-in to from b) to a).

kkdoon · 2023-10-04T17:07:23Z

ok, adding a flag for drain behavior sounds reasonable.

Lastly, I just wanted to double check with what issues do you see if we instead skip the watermark hold check inside flushData and do it inside notifyCheckpointComplete (only when drain is initiated) ? That way we still maintain the contract of flushing after checkpoint is completed. And in this case if final checkpoint fails, its still ok since drain will fail and pipeline will continue running (and user can retry again later). And if checkpoint is successful but stable DoFn fails, then drain will finish successfully with a warning message (to restore last savepoint and re-run the pipeline since some elements were not processed).

If watermark skip approach doesn't seem correct, then i will update my PR to implement approach A.

je-ik · 2023-10-04T18:09:03Z

There are two problems:
a) there can be other reasons why there are watermark holds besides stable input buffering, so skipping the checks would hide those as well
b) draining pipeline first updates watermark to infinity, which (as documented) would result in incorrect output if the pipeline is later resumed from the final savepoint (it would move watermark back in time, which is correctness bug). Draining is meant only for cases when the pipeline will not be resumed later.

kkdoon · 2023-10-04T18:20:31Z

Thanks for the clarification! Makes sense to make this operation best-effort in that case.

…pache#28554)

je-ik · 2023-10-24T06:30:13Z

Closed via #29102

…pache#28554)

kkdoon added awaiting triage bug labels Sep 20, 2023

github-actions bot added flink P2 labels Sep 20, 2023

kkdoon mentioned this issue Sep 20, 2023

Enable drain operation for flink jobs with requiresStableInput annotation #28567

Closed

3 tasks

kkdoon mentioned this issue Oct 21, 2023

[flink] Flush buffer during drain operation for requiresStableInput operator #29102

Merged

3 tasks

kkdoon pushed a commit to twitter-forks/beam that referenced this issue Oct 23, 2023

flush buffer during drain operation for requiresStableInput operator (a…

1a27d2c

…pache#28554)

je-ik closed this as completed Oct 24, 2023

github-actions bot added this to the 2.52.0 Release milestone Oct 24, 2023

kkdoon pushed a commit to twitter-forks/beam that referenced this issue Oct 16, 2024

flush buffer during drain operation for requiresStableInput operator (a…

e175373

…pache#28554)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Unable to drain Flink job when RequiresStableInput is used #28554

[Bug]: Unable to drain Flink job when RequiresStableInput is used #28554

kkdoon commented Sep 20, 2023 •

edited

Loading

je-ik commented Sep 21, 2023 •

edited

Loading

je-ik commented Sep 21, 2023

kkdoon commented Sep 21, 2023

je-ik commented Sep 22, 2023 •

edited

Loading

kkdoon commented Sep 22, 2023

je-ik commented Sep 22, 2023

kkdoon commented Sep 23, 2023

je-ik commented Oct 2, 2023

kkdoon commented Oct 3, 2023

je-ik commented Oct 4, 2023

kkdoon commented Oct 4, 2023

je-ik commented Oct 4, 2023 •

edited

Loading

kkdoon commented Oct 4, 2023

je-ik commented Oct 24, 2023

[Bug]: Unable to drain Flink job when RequiresStableInput is used #28554

[Bug]: Unable to drain Flink job when RequiresStableInput is used #28554

Comments

kkdoon commented Sep 20, 2023 • edited Loading

What happened?

Issue Priority

Issue Components

je-ik commented Sep 21, 2023 • edited Loading

je-ik commented Sep 21, 2023

kkdoon commented Sep 21, 2023

je-ik commented Sep 22, 2023 • edited Loading

kkdoon commented Sep 22, 2023

je-ik commented Sep 22, 2023

kkdoon commented Sep 23, 2023

je-ik commented Oct 2, 2023

kkdoon commented Oct 3, 2023

je-ik commented Oct 4, 2023

kkdoon commented Oct 4, 2023

je-ik commented Oct 4, 2023 • edited Loading

kkdoon commented Oct 4, 2023

je-ik commented Oct 24, 2023

kkdoon commented Sep 20, 2023 •

edited

Loading

je-ik commented Sep 21, 2023 •

edited

Loading

je-ik commented Sep 22, 2023 •

edited

Loading

je-ik commented Oct 4, 2023 •

edited

Loading